Data Quality and NoSQL Databases
A question that often comes up when people first start thinking about deploying mission-critical applications in NoSQL environments is: what about data quality? In a traditional relational environment, extensive ETL is used to move data from its native forms into SQL. An important part of the ETL process is to check for and correct errors in the incoming data so that data exposed to end-users is as clean as humanly possible.
One of the promises of some NoSQL is that you can move to a load as is environment and immediately allow users access to this data. But doesn’t data quality go out the window? How do you prevent users from accessing inaccurate or missing data?
These are valid and important questions. The reality is that a NoSQL solution can provide data quality that is equivalent or greater than a traditional relational based approach while still providing for faster, less expensive and more flexible deployments. To explain the mechanics of this we will walk through important aspects of NoSQL deployments using the MarkLogic database as an example and explain how they can lead to a database with higher levels of data quality than is found in most of the legacy applications they replace.
How Much ETL?
ETL performs multiple functions. It is used for:
- Looking up external data to enrich a data record
- Converting data values into a preferred format (e.g. pounds to dollars)
- Identifying and correcting errors in the initial data
- Converting data from its incoming format to the target format (e.g. XML to relational)
In a MarkLogic based system, the first three functions continue to exist – these steps are critical for a high level of data quality. While the work needs to be done, discussed above, MarkLogic Alerts make it possible to automate much of the effort that goes into those functions and provide superior data quality compared to relational based systems.
In a typical MarkLogic implementation, the fourth function of converting data will be greatly reduced and sometimes eliminated.
How does this work in practice?
Load As Is and the Envelope Pattern
In MarkLogic, a common design pattern is known as the “envelope pattern.” This involves loading data as it comes from the primary source, storing that data as is and then creating metadata to harmonize and enrich the data. The envelope pattern continues to maintain the original data as is, but maintains a metadata section to handle the enriched and canonicalized enhancements. Most data quality issues are handled as part of the harmonization and enrichment phase. The work in filling out the envelope after the initial data has been loaded is similar to the ETL work done before data is loaded in relational systems, except that there is generally far less of it.
Data Lineage – An important benefit of the envelope pattern is that data lineage is much easier to maintain.
Data lineage is a critical element of data quality—you cannot confirm the validity of a data element if you do not know where it came from. In recent years regulators and end-users have become far more demanding in understanding the data lineage of the data used in making important decisions. This is causing stress for many legacy systems.
In many traditional relational based systems, data may undergo four or five separate ETL steps before it is loaded into the database. This can be a disaster for data quality because when a user sees a value from the database it is not clear where it came from. Was the value part of the original record? Or was it added in ETL step 3B? In many systems today it is impossible to tell.
Data lineage becomes a much easier issue if data is loaded as is and transformations and enhancements are performed against a canonical version of the data (while maintaining the original data unchanged) instead of multiple ETL steps before the data is ingested into the database. Users can see both the original data and the transformed data and the enrichment and harmonization process are far more transparent.
In a database like MarkLogic the envelope pattern is a natural approach. In a relational system, it is much more difficult to implement.
Alerts — Data loading can be a complex process. Data from different sources may arrive at different times and need to be processed and combined to create composite records. Required data may not initially be available and may require action to correct deficiencies or errors. All of this can make creating composite records an expensive and labor and time-intensive activity.
MarkLogic Alerts can be a powerful tool to overcome this issue. Alerts notify users when new content is ingested which matches a predefined query. They are resolved through specialized indexes with the result that thousands of alerts can be defined with minimal effect on system performance. While similar in some ways to the triggers found in relational databases, in practice MarkLogic Alerts are very different. SQL triggers are much less flexible and substantially degrade performance.
Alerts can be used for many reasons beyond data quality. In a data quality scenario alerts are created to capture known data quality issues. As data is ingested the appropriate alerts are triggered and whenever the alert returns a value of “true” code is run to handle the issue. For example, if a required field is missing, the alert code may involve periodically connecting to a reference data site to look up the required information until it is available.
The use of MarkLogic Alerts to check for and correct data quality issues can greatly improve data quality while reducing the time and effort needed to generate the data.
Taking it all together, MarkLogic can speed time to market, reduce ETL and modeling effort and cost, while enhancing the data quality of the final solution.
For More Info
Progressive Transformation using the Envelope Pattern a short, 16-min on-demand tutorial on what is progressive transformation of data, loading relational business entities as-is, idiomatic versus canonical data, using the envelope data modeling pattern to store canonical forms of data and approaches to modeling entity relationships between documents.