Much like we can divide the known cosmic universe into light matter and dark matter, we also can divide the data universe into simple and complex data.
The vast majority of what we observe in the physical cosmos is light matter: planets, stars, galaxies, and so on. Consensus theory is that the stuff we don’t see — dark matter — is many, many times larger.
In the enterprise IT data universe, the vast majority of what we work with is simpler data: tables, rows, columns, fields, etc. We’ve learned to do some pretty impressive things by combining and connecting simpler data types.
But very often what remains unexploited is complex data. It can be all around us: contracts, lab notes, emails, reports, log files, statements, and so on. Or complexity can arise when you’re faced with connecting many different forms of simpler data types that don’t follow much of a standard.
At some point, you realize that all the familiar tools and techniques that work so well with simpler data are woefully cumbersome when considering complex data types.
Let’s say you’ve been handed a typical business problem: your bank has a number of business functions (customer service, legal, marketing, etc.) that need to look at things holistically across multiple primary systems (deposit accounts, loans, investments, etc.).
Each of those business functions wants a connected view across all the relevant primary systems of what they’re interested in, each somewhat different. But none of the data living in those primary systems was intended to be connected in that way.
If I, as a business user, were simply collecting KPIs or similar simpler data from each primary system — imagine an executive dashboard — connecting and presenting the relevant data would be straightforward.
But the underlying data living in those primary systems is inherently complex. Things like loan documents, account statements, and perhaps process logs. Making matters more interesting is the likelihood that each business function needs to connect that complex data in different ways, depending on need. The legal department will be interested in different things than, say, the marketing group.
This same pattern — complex data, usually multiple sources and multiple consumers — shows up in a surprisingly large number of situations. Yes, it’s easy to imagine in financial services, whether that’s banking, trading, insurance, and so on. Much of their primary data is inherently complex.
But consider, say, pharma research.
Perhaps you’d like to organize everything you know related to “diabetes” across all of your research activities. This use case is interesting as one topic will lead you to another: insulin, genetics, and much more.
It’s the same “many users, many sources” pattern, but with a twist: a highly structured view of concepts and how they relate.
Or, instead, let’s say you’re building very complex things: aircraft, refineries, and the like. Or perhaps working in public safety. Maybe publishing, or fraud detection. There’s a pattern around inherently complex data that’s very different from the familiar ones from the world of simpler data.
So, exactly what it is that makes complex data different from simpler forms? It’s how you go about making connections across multiple data sources.
We connect data together to increase its value. Combine a list of customers with how much each spent on what, and you now have a more valuable list of top customers and what they bought.
With simpler data forms, we’d use something like customer_id to make that connection.
Imagine instead, someone handed you a pile of PDFs with customer invoices, and asked you to figure out the top customers as before. Not so easy, right?
Well, you’d probably start by scanning, converting to text, then searching for key fields, extracting the bits you needed, and so on. Here’s the point: connections have to be built for the data to be useful.
That’s where metadata comes in — data about data. The act of picking off what looks like a customer name etc. from a mostly unstructured sequence of bytes creates metadata.
The typical metadata example given is a library card catalog. Its metadata helps you find things in the library.
One popular way of using metadata is search. Google and others extract metadata by crawling the internet and building search indexes, which helps you find things on the web.
But technologies like Google search only go so far — they have a limited understanding of a user’s specific context and intent. Ideally, you’d like to precisely define what you mean when using specific terms, like “key customer.” You’d like to understand the connections between what you’re looking for, and other potentially useful things that are out there.
To do this, you’d have to enrich and enhance the metadata you’ve previously extracted. The more you did this, the better (and more useful) connections you could make across all of the available data.
The data management tools people use to work with complex data are predictably different from the ones aimed at simpler forms of data. They can present models of complex relationships between complex entities as a matter of due course.
Trying to use data management tools designed for simpler forms of data usually does not end well when applied to complex data.
One anti-pattern has been dubbed “table proliferation” — e.g. creating a new relational table anytime you need to capture a new entity relationship.
While that might sound great to someone familiar with relational databases, as you go from dozens to hundreds to thousands of highly-dependent tables, you realize you might have made a bad choice. Other challenges will inevitably arise along the way.
That being said, data management tools designed for complex data can easily present data in simpler formats as needed, for example, perhaps an SQL query or similar.
It’s fair to ask — when should I consider my data complex, and think about data management tools suited to the task at hand?
The simple answer? When the tools you’re using aren’t working well.
At MarkLogic, we usually get involved after the IT team has run through all the data management tools that they’re familiar with. Typically, every tool they’ve tried was intended for simpler forms of data.
Trying to force-fit complex data into simple tools either results in (a) a brittle, unresponsive environment that accumulates technical debt quickly, or (b) a noble but misguided effort to dramatically simplify the data so their existing tools will work.
Either way, users won’t be pleased with the result. Complex data is different.
Not every enterprise data architect understands that complex data is inherently different from simpler forms. That being said, complex data can be incredibly rich stuff, creating a never-ending stream of business value for those that approach the challenge with eyes open.
I would argue that those that learn to master complex data are very well prepared for the future of data management. Not only is there likely more complex data than simple data, but it can contain more valuable insights.
All we have to do is connect it.
Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.
Successfully responding to changes in the business landscape requires data agility. Learn what visionary organizations have done, and how you can start your journey.
Sharing data can be relatively easy. Sharing our specialized knowledge about data is harder – and current approaches don’t scale.
Data agility is the ability to make simple, powerful, and immediate changes to any aspect of how information is interpreted and acted on.
Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.Request a Demo