In 2009, Harvey Silverglate published the novel Three Felonies a Day, in which he describes how “federal criminal laws have become dangerously disconnected from the English common law tradition and how prosecutors can pin arguable federal crimes on any one of us, for even the most seemingly innocuous behavior.” And that each of us commits those felonies three times a day.
In an already overloaded criminal justice system, it’s unlikely that you’ll ever be pinned for one of your unrealized felonies, but that system still has to manage information about the more than 11 million arrests that were made across the United States in 2014 alone.
Departments throughout the justice system maintain control of the amount of information across different systems using Master Data Management (MDM).
Gartner describes MDM as “the consistent and uniform set of identifiers and extended attributes that describe the core entities of the enterprise and are used across multiple business processes.” It’s an effort to make sure that there is a near perfect, or “mastered” record for the most important information to an enterprise. The field of MDM includes processes and tools for data governance, data quality management, maintaining data lineage, capturing and codifying enterprise events (onboarding a new customer, for example) and monitoring performance and progress on achieving all of these.
There are many different definitions of MDM and quite a few products built specifically for enabling it, including those from Informatica, IBM, Trillium and Oracle. These products include purpose-built UIs for workflow/BPM, integration middleware, data modeling, data profiling, matching, cleansing and linking, information governance support, data stewardship support, and more. The challenge is trying to get to a static perfect golden record in ever-changing enterprises by doing back office clean-up of the data using rigid relational-based MDM. Recently, a U.S. state’s department of justice (DOJ) challenged us with a system of 58 counties that each report data in a different schema. Their ultimate goal was to create an Automated Criminal History System (ACHS) that had information for many more “subjects” than there are actual unique people.
Without the ability to identify which subjects are the same “person,” the DOJ simply cannot get an aggregated view of each individual and therefore are working with an inefficient system that has millions of erroneous records.
MarkLogic’s MDM Alternative
MarkLogic is proposing a solution that would allow them to put the data together in a fully-indexed Operational Data Hub (ODH). Its flexible model and rich search and query support the necessary data cleanup, data governance, merge and unmerge operations would deliver a single 360-degree person view instead of a collection of “subject” data that may or may not relate to the same person. (My colleague Dan McCreary does a great job explaining how you can quickly load all those schemas as is.)The upshot is, all the data goes in, queries can be made across all of it, and the ODH would eliminate errors and duplications.
Moving further to the east, another state’s Criminal History System (CHS) is struggling to manage records of people, bookings, and court cases. The data at hand is too complicated to represent in a relational database—taking as much as three dozen tables, which made search, update, merge and unmerge operations too difficult.
The state is using MarkLogic as the foundation for a new CHS that allows these records to be much more naturally and efficiently represented as a single logical entity in a document. This system stores a person’s information in a single document for more efficient queries without complex join operations, and it can be easily adapted without extensive data modeling efforts and schema changes as future data requirements change . Flexible replication allows a subset of the State CHS database to support a public site, that even if it were hacked does not contain any sensitive data.
‘Colligation’ & Fuzzy Algorithms
The state and its contractor use MarkLogic to colligate (essentially match and group) information often duplicated with errors or other differences in many separate records. Colligation can be quite difficult as the data elements in separate documents that represent the same person do not always match, for example, the ‘firstname’ field might be John, Jonathan or Jack and the DOB field might be off on the year, month or day due to data entry errors. The system uses MarkLogic’s field-weighted and fuzzy matching algorithms with data merging capabilities also running in MarkLogic.
In order to provide colligation of all of the information relating to a person’s journey through the criminal justice system (arrest, detainment, charges, convictions, sentencing, incarceration, and parole), MarkLogic and the state’s contractor implemented a method to express and persist this information as entities in single documents. These documents include all of the original source information, data source counts for each matching data element, and a history of the changes to support unmerge operations.
One of the primary reasons for using MarkLogic as an MDM solution is because it can ingest any data, which makes retaining original source data, along with its lineage and timestamps, trivial. This allows the state to use any of that data, regardless of type, for search and query. This is supported by MarkLogic’s Universal Index, which indexes the words, values, and document structure from each of the loaded documents, as well as the document structure of both unstructured and structured data and documents without advance knowledge of the data structures. This is particularly important for the CHS’s approach for matching records with fuzzy search and field-weighted algorithms, and the subsequent merge and unmerge operations on those records.
The CHS users also need to be able to monitor certain records for any changes. MarkLogic’s Alerting feature allows them to store queries of any complexity as business rules to proactively identify any piece of data that matches the rules as soon as they enter the database. In addition to being used for safety reasons, such as, monitoring violent offenders for parole violations, this proactive notification is a way to prevent duplicate data entries from being saved to the database where they could cause errors and inefficiencies and would require back-office MDM clean-up efforts to fix.
Other NoSQL databases provide a flexible model, but this CHS also needs government-grade security. It is critical for preventing unwanted access to very sensitive information and temporarily allowing it when a user’s access changes based on their contextual system usage. For example, background check requests may temporarily and safely elevate a user’s access to a certain specific subset of data.
The amount of data coming through the justice system can be complicated, but with the right tools (even if they’re called by a different name), MarkLogic brings order to law and order.