Choices In Enterprise Data Infrastructure
If you’re an armchair student of public policy, you know that infrastructure (transportation, communication, etc.) really matters in any economic growth discussion.
Invest in the right kind of infrastructure — at the right time — and you get great results. Make a poor choice, and the results are usually obvious as well.
If all of our organizations are trying to get more data and information literate — inform better decisions, re-engineer processes around connected data, etc. — there’s going to be some infrastructure involved.
For decades, I’ve watched the skirmishes play out between thirsty data consumers and drought-stricken IT groups trying their best to supply them.
Their thirst is easy to understand: give them the data to answer their questions, and they’ll quickly start asking harder questions that require, errr, more data. That thirst isn’t going away, nor should it.
IT is always in a tough place, but real-world experience has shown that some answers are better than others. The conceptual goal is to build better data infrastructure between people and data.
At risk of oversimplification, let’s take a look at different ways to build data infrastructure.
Here’s the Data, Come and Get It
The simplest approach is to run reports and/or dumps against production databases, and make them available to authorized downstream consumers. I think of it as asking people to carry water on their head.
Not exactly the most efficient way to get data into the hands of the people who need it, but — hey — if you’re thirsty, there’s water here.
From a pure efficiency perspective, there’s obvious room for improvement. Data consumers have to be aware of where the data is, how it was captured, how it’s structured. They have to invest effort to move data to another location, and then start massaging it to be usable.
This approach puts the onus entirely on the consumer, and that’s not ideal.
How About We Build You a Water Tank?
The search for a somewhat better answer led to the evolution of data marts, data warehouses, and the like. IT is willing to move the data on a regular basis to a place where you can work on it.
You can analyze more data that way, and people will stop bugging IT for ad-hoc data requests.
But as a consumer, you have no control over the data’s format, its cleanliness, or any other aspects — that’s your job.
Back to our water and infrastructure analogy, IT is willing to help you build a big tank — and periodically fill it with some sort of water. The rest is up to you, of course. Not ideal, but better than having to carry water.
Obviously, this pattern leads to many different and specialized water tanks, each aligned with unique missions. This results in a few, difficult problems.
First, it’s not efficient. Lots of data marts and warehouses, lots of tech, lots of effort, complexity, etc. Any long-term goal of simplifying and standardizing takes a serious hit.
Second, it can lead to poor outcomes. You now have multiple, disconnected “sources of truth” scattered throughout your organization. That makes it hard to make informed decisions around important things in your world: maybe customers, products, health outcomes, etc.
Modern Plumbing, Anyone?
What we’d like to get to is something analogous to modern plumbing: high-quality water, any temperature, use it any way you want, etc. Simply turn on the faucet.
If we dig a little deeper, there are some interesting aspects to this analogy. The consumer doesn’t have to care where the water is coming from: river, reservoir, rainfall, etc. The water is tested regularly, and delivered at sufficient quality for most purposes.
If you need something special, like distilled water for your newborn, the effort is minimal. If someone doesn’t like the shared service, anyone is welcome to dig a well, invest in pumps and filtration, etc.
The real benefit? As a consumer, I can get on with life without having to think much about water, or being thirsty. But there was some serious infrastructure that made it all happen.
Data Fabrics, Data Mesh, Data Pipelines, and More
Many of the newer memes in this space try to capture this infrastructure-oriented approach to making more and better data available, and presenting it to people in a way so they can easily consume and make better decisions.
As long as we’re thinking about the ideal data infrastructure along these lines, what would we put on our list?
I think we’d start by insisting that we could ingest, process, and add value to data no matter where it’s coming from, in any form.
The data should be immediately usable — to some degree — upon ingestion. Sure, we can add refinements later, but the notion of using data “where is, as is” — and not having to impose a format on it a priori — is very appealing.
People want to search and structure their data in different ways. Everyone has their own lens. It should be easy to build any lens you might need.
There’s the familiar rows and columns, documents, relationship graphs, geospatially, RDF triples, ontologies, and so on. Again, in an ideal world, why would you arbitrarily restrict people from looking at their data in a particular way?
Not everyone lives in spreadsheets. And the answers to life’s really interesting questions don’t typically live in spreadsheets. The really useful stuff usually lies in connecting scraps of data that weren’t intended to be connected.
Finally, this is real infrastructure. It has to be scalable, robust, recoverable, secure, auditable, etc., etc. When the town water system has a bad day, everyone has a bad day.
Data Infrastructure Can Be Fun
Perhaps one of the most interesting parts of my job is learning what people have done with modern data infrastructure.
It’s always the same pattern: what new and impactful things can now be done by simply cutting across and connecting multiple data sources quickly and efficiently? It’s fun to see the enthusiasm as the team now realizes they can drive a slew of cool new applications and really move the needle.
Just like modern plumbing has done for most of us.