In the blog post, Golden Record Helps Insurers Retain Clients, Frédéric Valluet made a pretty bold claim about MarkLogic’s capabilities:
FV: […] Being able to better manage data governance and quality to automate and accelerate data gathering is the very first step. That is where an enterprise NoSQL platform, such as MarkLogic, brings a lot of features and capabilities that help companies easily and quickly solve this first challenge.
AO: Do you have examples of such achievements?
FV: One of our customers has been trying to solve this issue for 5 years with a traditional relational MDM technology. And he never succeeded. But then he tried with MarkLogic and did it literally in a few weeks. That’s amazing!
For most of my career I would have doubted such a claim, and I expect readers would feel a similar skepticism. But since joining MarkLogic, and working with global insurance and publishing companies, I find Frédéric’s story to be pretty plausible. So, I thought I should break down how MarkLogic makes such a dramatic acceleration possible for problems like this one.
Whether building and integrating a traditional MDM system or trying to make better use of disparate business data, we’re dealing with the same question. How do we get — from a bunch of documents and operational data sources — to a “master” representation of the data that gives us enough understanding so that we can build business strategy on top of it?
I’d like to look at the type of use case where the target system is not necessarily MarkLogic, such as migrating data to a traditional MDM system. This will show how MarkLogic can help with one of the most expensive phases of such a project without requiring that any of the data actually end up in MarkLogic. Here, we’re taking advantage of MarkLogic’s multi-model nature to use it as a tool for data exploration.
(This is intended for a wider audience than a technical how-to, but I will link to several articles in MarkLogic documentation, so curious readers can investigate exactly what steps are necessary to do what I describe.)
When we’re first designing an ETL (Extract, Transform, Load) for a target system, we’re still not sure how the data should be mapped from its source to its destination. The traditional approach to solving this problem could be called ETE — extract, transform, explore, — where we work our way around the bootstrapping problem:
- We want the data in a database so that we can explore, query, and better understand it, but
- We need to understand it and build a normalized model in order to get it into the database.
A database (yes, even a relational one) is an outstanding tool for exploring data, but you need a data model to get your data into that database. But a good data model needs to be well-researched, and that research comes from exploring the data, which is best done in a database, which needs a data model, and so on.
The solution: Iterate through the following steps:
- Look at the data
- Make an intuitive best-guess on the data model
- Develop transforms to stage and ingest source data into that data model
- Check your results, adjust your best-guess and repeat from step 2.
Each of these steps has a different set of tools you can use. Since the work is exploratory and creative at the beginning, the tools tend to be disparate as well, depending on developers’ and analysts’ taste. Usually, it’s some combination of scripting and analytics languages (R, Python, Perl), text-editors, and spreadsheets. There also tend to be a lot of “sample” files floating around on developer laptops or shared network drives somewhere.
But right here at Step 1 is the place where MarkLogic is already very useful. Before giving much thought to a normalized data model, staging, or import processes, MarkLogic is already capable of ingesting the data and giving you some of the nice things for which you would normally have to wait until stage 4:
- one defined spot of infrastructure where the data is located,
- security and access control based on roles, permissions, and privileges,
- backup and restore tools,
- a full-text index, and the ability to define further specific indexes,
- a query execution tool to explore the data.
With these tools, it’s possible to start exploring the data and developing a normalized data model, and then to start developing an export into the format that the target system requires.
Sounds nice, so let’s consider a few examples of typical source data and how much work it really is to ingest it without normalizing.
Handling XML and JSON As Is
This is the best case for a starting point but also one of the most common, since most databases are capable of exporting data to XML or JSON. MarkLogic is built to hold XML and JSON documents in their native formats, so we’re already spared a lot of work. We can literally go ahead put these documents into the database right now, with about as much development effort as a shell script (using MarkLogic Content Pump) or even just copying files over a network (using a MarkLogic-hosted webDAV server).
With a little more effort, we can start defining fields and specific indexes. This will allow us to search and filter documents based on more precise criteria. From there, it’s also possible to define views allowing SQL-like queries. Those views can then be connected to any traditional analytics tool that can query ODBC, used to test the viability of potential primary keys, and identify areas where the data needs to be cleaned up.
XLSX, DOCX, PDF and Other Binaries: Built-in Tools to Extract Content
Here, again, the first step is to get the file into the database as it is, and there are built-in tools that extract the text content and/or metadata from 200+ file types and convert it to XML or JSON during ingestion into the database. There’s a chapter on this in the MarkLogic Search Developer’s Guide.. Examples are shown in the MarkLogic documentation:
- xdmp:document-filter, xdmp.documentFilter for most binary formats.
- xdmp:word-convert, xdmp:excel-convert for older Word (.doc) and Excel (.xls) files.
There are plenty of tricks to integrate these functions as part of a smooth ingest process, but for educational purposes they are listed separately here for an ad hoc two-step process:
- copy the binary files into MarkLogic,
- extract content and metadata to a separate portion of the database.
Once that’s done, the results of the extraction can be handled and queried like any other XML or JSON data.
CSVs: Transform Rows to Documents
It’s likely that some portion of source data comes from other relational databases, where it’s straightforward to get the data you want as a CSV. The way I’ve usually seen this data handled is using MarkLogic Content Pump, which can convert each row of a CSV file into its own document as part of the ingest process. It’s documented in the MLCP Handbook, complete with an example showing that it’s not much more than a command-line call to mlcp with the appropriate option flags.
Results: Transparency From Source to Application
By decoupling the ingest step from normalization and data modeling, we’ve made the ingest process nearly trivial, and we’re going to make the subsequent steps a lot easier. Not only do we have the database and query tools ready to use for analysis, we have them available to export the data in the format we need, once we’ve agreed on the data model we want.
If we want, we can continue building out the MarkLogic database into permanent staging system for our target MDM.
This makes it a lot easier to figure out the answer to what’s probably the most common question in any data exploration: “Wait, where did that come from?” The answer is usually found simply by checking the original document, which is present in the database as well. In the case of an audit or compliance demand, it’s the same approach.
To Learn More
A New Way of Thinking About MDM 45-min webinar with Mike Doane and Dataversity on why traditional MDM systems are limited, and how you can overcome these limitations with multi-model database.