Why can it be so hard to achieve? Something I’ve learned with problem solving is: if you find yourself spiraling into a pit of complexity, often the right approach is to go back to the start and look for another way.
For centuries, astronomers struggled to map the paths of the stars and the planets. The more they tried to map their models to the actual data, the more complex these models became, involving wild calculations that would make even the most hardened mathematician cry. That was, until Copernicus came along and suggested: What if the earth wasn’t the center of the Universe, and everything revolves around it? What if, instead, the Earth revolved around the Sun? And just like, that these absurd and complex models were reduced to a beautiful simplicity.
Of course, astronomy is still a difficult subject; but by finding the right start, vast amounts of unnecessary complexity were removed from the problem.
MDM projects, are essentially the same thing. Instead of astral bodies you have data sources. Instead of geocentricism (the earth at the center of the universe model), you have relational technology; whole IT budgets have been wasted trying to make it fit this problem. The question then is, what is the corresponding heliocentric (earth revolving around the Sun) model?
First a definition: MDM comprises the processes, governance, policies, standards and tools that consistently define and manage the critical data of an organization to provide a single point of reference.
For critical (master) data two criteria are required:
As you can imagine those two criteria are aspirational because organizations frequently will have a myriad of overlapping data, which arises from both organic growth, and immature data management.
To overcome this, organizations invest in huge squads of people who seek to consolidate, clean and de-duplicate their master data — and then manage this single true version of the truth going forward. They may wish to do this using a registry style (leave the data where it is, and instead maintain a registry of which data sits where), or a hub style (where they move the data into a single repository and manage it from there). In either case, because this is dealing with the most critical data to the business, a fully enterprise grade solution is required, with ACID transactions, HA, and DR to ensure data is always consistent, never lost, and always available.
There are, however, a number of challenges associated to the pursuit of this managed single version of the truth, which often results in very large multi-year multi-million dollar projects (in the best case), or outright failure (in the worst). Over two-thirds of all MDM projects fail!
Registry approaches have the best chance of succeeding, but have the challenge that if the Christy Haragan entity is referenced (and perhaps duplicated) across multiple systems, the processes involved in maintaining a single version can become expensive and error prone (conflict resolution, for example). This approach works for smaller projects with smaller more localized data sets (e.g., spanning a single data center or perhaps geographic location), but as you can imagine doesn’t scale.
MDM Registry style: Data is left in the source systems and managed centrally.
The hub approach, does scale, but requires a domain model (e.g. Customer, Product, Account, etc.). Due to the inherent intricacy in any of these domains, these models are extremely large and complex. The process of mapping the source data onto these domain models is extremely challenging, expensive, and error prone. Moreover, as data is changed and re-arranged, (shape-shifting at its finest) the risk of breaking existing processes increases, and adds further risk and expense to the project.
MDM Hub style: Data is mapped to a domain model and moved to the central hub to be managed.
On top of those constraints, a common challenge to both of these approaches is the inherent problem of data cleansing and data de-duplication. A successful data cleansing exercise might automate 80 percent of records, but that can still leave you with quite a bit! A modest number of a million records to process (a low number in MDM terms), this still would require 200,000 records to be manually cleansed. And in the case of multiple addresses, how would the system know which is correct? Maybe both are, maybe neither.
There are, however, two approaches that can be taken to address these fundamental and challenging problems with a traditional MDM solution:
Operational Data Hub: Data is loaded as-is to the central hub to be managed.
And instead of a big-bang approach required by traditional MDM — which demands all data be mapped before the system is useful, a schema-agnostic approach is more flexible and responsive. Iterative transformation of data after ingest would allow businesses to focus on high value tasks first, testing each change for correctness, and being able to respond to business changes quickly. For data in invalid fields, a schema-agnostic approach means leaving the data in its original form (reducing the risk of breaking existing processes). The entry can be enriched with meta-data to indicate whether it’s a street address, zip code, etc.
Of course most schema-agnostic approaches won’t provide a lot of the enterprise grade features of ACID, HA and DR.
MarkLogic provides a schema-agnostic database platform allowing data to be stored in its original form, and enriched as necessary. Its semantics capability allows flexible links to be created between entities. But it has those highly-sought after enterprise features — so the business doesn’t sacrifice anything in adopting this new approach to managing it’s most critical data.
MDM, like Astronomy, is still a difficult topic. But by starting with the right approach, we can reduce large amounts of unnecessary complexity.
MarkLogic is MDM’s heliocentric solution.
Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.
A data platform lets you collect, process, analyze, and share data across systems of record, systems of engagement, and systems of insight.
We’re all drowning in data. Keeping up with our data – and our understanding of it – requires using tools in new ways to unify data, metadata, and meaning.
A knowledge graph – a metadata structure sitting on a machine somewhere – has very interesting potential, but can’t do very much by itself. How do we put it to work?
Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.Request a Demo