Efficient data can lead to more effective service delivery in the public sector. But, much of the data is stored in silos in outdated infrastructure, making it difficult to get a full view. Integrating data to merge these data silos is notoriously difficult. However, fortunately, there are solutions that can ease the pain.
But what are they and which one is the best?
Let’s explore and compare a few options. (Hint: By the end, you’ll know which is best!)
A virtual, or federated, database is a system that accepts queries and pretends to be a big database that includes many disparate siloed data sets. But really, it queries the back-end (live, production or warehouse) systems in real time and converts the data to a common format as it is queried.
Because these databases rely on source systems for query, they are limited by the capabilities and availability of those source systems. Some of the limitations of the virtual/federated approach include:
Least Common Denominator Query – If any source system or silo does not support a query—because that query searches by a particular field, orders by a particular field, uses geospatial coordinate search, uses text search or involves custom relevance scores—then the overall system also can’t support it. So any new systems added later may actually decrease the overall capabilities of the database, rather than increase it.
Query Re-mapping – A major shortcoming of virtual/federated databases is that every query to the overall system must be converted into many different queries or requests—one for every federated silo or sub-system. This creates additional development work and tightly couples the federated system to the data silos.
Operational Isolation – Federated systems go down when any federate goes down. Often, live source systems do not have capacity for even minimal real-time queries, so the federated virtual database may bring down or impact critical upstream systems.
A data lake is a massive, centralized repository of large volumes of structured and unstructured data. If you move all of your data from disparate silos into one system (e.g., Hadoop/HDFS), it is now a data lake. The data need not be harmonized, indexed, searchable or even easily usable, but at least you don’t have to connect to a live production system every time you want to access a record.
From an Operational Isolation perspective, data lakes are better than federated systems because they do provide operational isolation by moving the data to a separate infrastructure. This is arguably their primary advantage. However, data lakes do have some key disadvantages in terms of ease of use and maintenance challenges:
Query – Data lakes do not index or harmonize data, however, and the source system indexes are not available in the data lake—so the ability to query is actually worse with a data lake than a virtual federated database.
Simplicity and Manageability – Because data is in various formats in a data lake, it requires complex logic in each batch process, ETL job or analytic job. This code rapidly becomes un-governable, mismatched, out of date and a burden.
Operational Data Hub
What state governments need is a place that allows them to load and manage all of their data in a schema-agnostic manner, relate all of the data to each other via meaningful business concepts and store it in a reliable and transactional manner. This place needs to enable organizations to “run the business” and “observe the business” on the same data without resorting to ETL to move data from place to place to support different applications. As data grows and grows, you need to find a permanent home for integrating it, otherwise you will spend all of your resources just moving data around between silos.
That permanent home is what we call an Operational Data Hub (ODH). Here are some of the key advantages to the ODH versus federation and data lakes:
Query – ODHs maintain their own indexes and build these indexes over harmonized data. Harmonization can be a progressive, agile process, so as the range of harmonized data increases, the power of the indexing increases as well. So, as an ODH progresses, the query and analytic capabilities both increase.
Operational Isolation – An ODH moves data to separate disks and infrastructure, so query load on the hub does not impact source systems. And, upgrades and management of the ODH do not require updates to source systems.
Batch Processing – Many records can be queried in an ODH. And, unlike a data lake, data hubs allow index-driven query to restrict the quantity of data being processed for any particular purpose, and to do fast lookups and joins to combine data during batch jobs without a “sub-batch” to find relevant, related records.
Simplicity and Manageability – Because data is at least partially harmonized in an ODH, all data can be handled in common code. The harmonized fields will be indexed and uniform, making access to information much simpler.
Discovery and Exploration – Ad hoc or unexpected analyses can be served easily from an ODH. By moving the data to one place and harmonizing the critical data elements, the hard work has been done during movement and indexing, making it easy to query, interact, filter and drill down into data whenever needed.
Still not sure? Check out our ebook, Introducing the Operational Data Hub, to learn more about functional specifications of what an ODH should do, and explore use cases of how it is being put into practice today.