MarkLogic World 2020 Live Keynote – Register Now

The more things change, the more they stay the same. No matter who you are or what business you are in, it’s safe to say that you still have data silos. And these silos are still slowing down your business or government agency.

Sometimes the data silos are an obvious cause of problems. Teams can’t answer seemingly simple questions because data spans two (or 10) systems. Other times, fragmented and incompatible data stifles innovation, but rather than silos being blamed, the development teams and managers are.

Unfortunately, integrating data to merge the silos is notoriously difficult. There is no magic bullet to integrate disparate data, but there are new, breakthrough techniques that ease the pain.

I’ll cover three newer approaches in this blog: Data Lakes, Virtual Database (also called database “Federation”), and Data Hubs. I’ll explain how the pros and cons of data lakes, virtual databases, and Data Hubs and the difference between them.

MarkLogic supports all three to different extents, but we prefer Data Hubs, and at the end, I’ll explain why Data Hubs are the best and how MarkLogic supports that approach. But even if you don’t build a Data Hub or don’t use MarkLogic, I promise this blog will be useful to you.

The Old Way: Comprehensive Enterprise Modeling

The Old Way: Comprehensive Enterprise Modeling

First a brief word on the old approach. People used to (and occasionally still do) build a new enterprise data model that’s comprised of every field and value in their existing enterprise, across all silos, and then map every silo to this new model in a new data warehouse via ETL jobs.

Gartner surveyed data warehouse activities using older integration methods from years ago, and found that over 50% of all data warehouses fail—and this number is likely much higher for large “boil the ocean” efforts that attempt to bring in data from too many sources.

So we’ll skip that and talk about Virtual Databases, Data Lakes and Data Hubs, which each try to solve this problem in a new way.

Definitions: Virtual Databases, Data Lakes and Data Hubs

First, let’s define our terms.

Virtual Database

A Virtual Database is a system that accepts queries and pretends to be a big database that includes many disparate siloed data sets. In actuality, it queries the back-end (live, production or warehouse) systems in real time and converts the data to a common format as it is queried.

This is also called a Federated Database, and the sub-systems integrated are called Federates.

Data Lake

A Data Lake is a term popularized in the Hadoop community and has multiple definitions. Broadly, if you move all your data from disparate silos into one system (Hadoop/HDFS) it is now a Data Lake. The data need not be harmonized, indexed, searchable or even easily usable, but at least you don’t have to connect to a live production system every time you want to access a record.

Data Hub

A Data Hub is a hub-and-spoke approach to data integration where data is physically moved and re-Indexed into a new system. To be a Data Hub (vs. a Data Lake), this system would support discovery, indexing and analytics.

Movement, Harmonization and Indexing

I’m going to make the case that we can distinguish the three based on how and when (and if) there is Movement, Harmonization and Indexing of data. More definitions below.

Maybe you wondered why I capitalized Movement, Harmonization and Indexing? I’ll use these as specific terms throughout, and have capitalized them to emphasize that these are the three Key concepts that distinguish Lakes, Hubs and Federations.

Data Movement

Hubs and Lakes both move data. The data from silos is copied to a new set of disks and processed by a new set of servers and software. Disk space is now very, very cheap so the operational (and organizational) benefits of having a separate data store that does not hit your production servers for every lookup is huge.

In contrast, a Virtual (Federated) Database delegates queries to each source system for each query, so the data does not move and remains in the silos.

Data Harmonization

Hubs and Federations both harmonize data (although Federations only do it as it is returned or processed). Data is in different formats with different field names and schemas in the various silos, but when it is ultimately served up, the records must look at least similar to be processed as a group. Harmonization falls into three categories:

  1. Naming Differences: The simplest to handle, naming differences, occur when different silos have an identical value with identical meaning, but use a different name. E.g. “PERSON.firstName” vs. “IDENTITY.GIVEN_NAME” in two different relational database tablesbut holding the same sort of data.
  2. Structural Differences: Somewhat more complex than Naming Differences, structural differences occur when the number and combination of fields differ across silos. E.g. “boxes_available” in one system might be total boxes in the warehouse plus a count_per_box field to derive total items, but in another system, may directly represent total_items (regardless of boxing). Further, one may represent available inventory not already allocated to unfilled orders, where the other may represent all physical inventory, regardless of open orders.
  3. Semantic Differences: These are the worst. One system may have three patient statuses and another may have five. These statuses will often overlap and be hard to map to one another ({Scheduled, Needs_Followup, Inactive} vs. {Intake, Scheduled, Telemedicine-only, Discharged}).

Progressive Harmonization for Agility

All of the approaches Harmonize. But beware the failing approach of trying to model and map every data field in the entire enterprise. The trick is to model what is needed, when it is needed, producing valuable “data services” and exports in an agile, phased approach. Each iteration will then add more data outputs to the existing core of Harmonized data, avoiding a big-bang approach.

This is difficult to do with relational technologies. Changing the structure of a relational database typically has cascading effects to multiple systems that are tightly coupled to the relational system (often because they embed SQL directly in the calling systemif the tables change, the SQL breaks).

A “Data Services First” approach works well here. First identify consumers and business needs (or use cases) for initial phases and trace the data needed in outputs to identify the inputs and Harmonization needed for that phase.

Later phases extend the core harmonized models and transforms.

MarkLogic® particularly shines here, as MarkLogic Server indexes the structure of data from source systems in both harmonized and raw forms, so some search and processing can be done in the Data Lake style, but critical data is indexed and accessed in the harmonized form.


The final characteristic that defines the approaches and distinguishes among them is in how and when they Index the data. Indexing allows fast lookup and fast analytics on any indexed field. Indexing is only possible after Movement and Harmonization. If the data is not moved to a new place, there is typically nowhere to write the new indexes. If the data is left in raw, disparate forms, there are no consistent values to index. So how data is moved and harmonized determines if it can be Indexed.

  • Federated (Virtual) databases don’t do any Indexing, nor do they have separate data storage to store indexes at all. They rely on the various silos and source systems to have adequate indexing (a bad idea and false assumption, by the way). They map any request into a different request for each source system and execute on all source systems.
  • Data Lakes do very little Indexing. If they do index, they do it by integrating additional components (thus building a “franken-stack” from many integrated components) to index parts separatelysuch as HBase tables, smaller data marts, SOLR text indexes and so on. See this other blog describing my personal experience with these “Franken-systems.”
  • Data Hubs move all data to one place, partially or fully harmonize it and then index the data in its harmonized form. This is critical because the most useful indexes can only be built atop harmonized data.

World of Pain

Clearly, it is very hard to aggregate, find, group or analyze data if the data is named differently, has fundamental structural and semantic differences or is not indexed. You or your developers will somehow work around the differences, but if you’re not careful, the treatment will be worse than the cure—you’ll end up with piles of un-governable code scattered across dozens of reports, batch processes and ETL jobs to address the issue.

Summing Up the Differences

So we have three approaches and three criteria that distinguish them. Here’s a table to sum up the three approaches and how they behave with respect to each criterion:

This makes each algorithm or ETL process responsible for dealing with the variety of data formats in its own way.

Approach Is Data Moved? Is Data Harmonized? Is Data Indexed?
Federation No. Data is left in the silos.
Data Lakes Yes. Data is moved to one place, but left in its source format. Not much. It’s hard to index data when it is in many incompatible forms.
Data Hub Yes. Data is moved to one place. Yes. Data is (at least partially) harmonized as it is moved. Yes, Data is indexed in the harmonized form for efficient access and analysis.

Implications of Each Approach

That covers quite a lot about the differences between the approaches. But they say the devil is in the details.

So now I’ll go into a few of the implications of those differences to show how each plays out in the real world, and make the case that Data Hubs are the best solution.

Virtual Database (Federation) Implications

Because Virtual Federated Databases rely on source systems for query, they are limited by the capabilities and availability of those source systems.

Least Common Denominator Query: If any source system or silo does not support a query—because that query searches by a particular field, orders by a particular field, uses geospatial coordinate search, uses text search or involves custom relevance scores—then the overall system also can’t support it. This means that any new systems added later may actually decrease the overall capabilities of the Federation, rather than increase it.

Query Re-mapping: The particular shortcoming of Virtual Federated Databases is that every query to the overall system must be converted into many different queries or requests—one for every federated silo or sub-system. This creates additional development work and tightly couples the Federated system to the silos.

Real-Time Data: This is the one benefit of Federation. You don’t need to hook into change detection mechanisms or implement them if you don’t have them. I would argue that if your various sources and silos have no ability to figure out what has changed every hour or day or on some schedule, that is an enterprise problem you should solve. That said, Federated systems always query the live data, so they are always real-time even without any ability to detect changes in the source systems.

Operational Isolation: Federated systems go down when any federate goes down, or require complex code and user interactions to support partial queries in a degraded state. Often, live source systems do not have capacity for even minimal real-time queries, much less critical batch processes, so the federated virtual database may bring down or impact critical up-stream systems.

This diagram illustrates the just-in-time form of Harmonization in a Federation.


Virtual (Federated) databases perform all Harmonization in real time, they so can support real-time web services to “run the business” but cannot typically scale to support batch processing or analytics. You can run parts of your business on a Federated Database, but have to be cautious about how much load you can put on the systems.

To sum up my views on Virtual Federated Databases:

Even if you automate the process of scurrying around to query 10 different systems, you’re still scurrying around to query 10 different systems. It just doesn’t help much.

Data Lake Implications

Operational Isolation: Data Lakes are better than Federated systems because they do provide operational isolation by moving the data to a separate infrastructure (typically Hadoop). This is arguably their primary advantage.

Query: Data Lakes do not index or harmonize data, and the source system indexes are not available in the Data Lake–so query-ability is actually worse with a Data Lake than a Virtual Federated Database.

Batch Processing: Data Lakes (and Hadoop systems generally) do reasonably well at batch processing, including predictive analytics. If you are willing to load and process every record no matter what, you do not need a sophisticated query and can work around lack of indexes.

Simplicity and Manageability: Because data is in various formats in a Data Lake, it requires complex logic in each batch process, ETL job or analytic job. This code rapidly becomes un-governable, mismatched, out of date and a burden.

Inherently Un-Governed: Data Lakes are useful for interactive data prep and data science, where a small group of analysts or a single analyst needs to copy, filter, alter and experiment with data without any controls or restrictions. This is not an enterprise solution to a business problem, but may be a good “playground” for data science.


Data Lakes require batch analytics to brute-force-process data, or batch ETL to load disparate data formats into a new form, so are generally for data analytics or insight rather than real-time processing. You can “see your business” with a Data Lake, but you cannot “run your business” on them, because of their analytic/batch nature. For operational, real-time loads, data must be indexed for fast access and aggregation.

You can sometimes blur the line between a Data Lake and Data Hub by adding ETL, which actually harmonizes data. Data Lakes still lack the transactional, real-time capabilities that allow an enterprise to “run the business” so they are best for analytic or reporting workloads.

To sum up my views on Data Lakes:

If your child has a disastrously messy room and picks everything up off the floor only to throw it in the closet, and then closes the closet door, is her room really clean? Or is it still a mess, just re-located to a central place?

Data Hub Implications

Query: Data Hubs maintain their own indexes and build these indexes over harmonized data. Remember that harmonization can be a progressive, agile process, so as the range of harmonized data increases, the power of the indexing increases as well.

So as a Data Hub progresses, the query and analytic capabilities both increase. In contrast, other systems must default to “least common denominator” indexing and query. Queries can only be supported in a Federation if they are supported by the least powerful system in that Federation. Analytic or reporting processes can only be supported in a Data Lake if the code can handle the messiest, least-compatible data sources in the lake.

Note that MarkLogic indexes: individual data fields, JSON structure, XML structure, embedded or unstructured text, RDF semantic data and geospatial data out of the box. (Binary data is also stored, with text and binary metadata extracted from formats including Word documents, PDFs, JPEGs and others.) This capability for pervasive indexing is one reason MarkLogic typically recommend the Operational Data Hub pattern—we are well suited to both support and “operationalize” real-time access to disparate data.

Operational Isolation: Like a Data Lake, a Data Hub moves data to separate disks and infrastructure, so query load on the Hub does not impact source systems. Similarly, upgrades and management of the Data Hub do not require updates to source systems.

Batch Processing: Like a Data Lake, you can query and process many records when needed in a Hub. Unlike a Data Lake, Data Hubs allow index-driven query to restrict the quantity of data being processed for any particular purpose, and to do fast lookups and joins to combine data during batch jobs without a “sub-batch” to find relevant, related records.

Simplicity and Manageability: Because data is at least partially harmonized in a Data Hub, all data can be handled in common code. Building on the inventory example above, the harmonized data set might have “unitsOnHand” as a harmonized field regardless of which incompatible source system the data came from (total_items vs boxes_available). The harmonized fields will be indexed and uniform, making access to inventory information much simpler.

Discovery and Exploration: Ad hoc or unexpected analyses can be served easily from a Data Hub. By moving the data to one place and harmonizing the critical data elements, the hard work has been done during movement and indexing, making it easy to query, interact, filter and drill down into data whenever needed.

Query Re-mapping: Data Hubs avoid query re-mapping to each source system by Harmonizing data first and indexing the uniform data set. Note that Harmonization is a “sunk development cost” since you cannot reasonably have a system that returns data from every source in a different format—all the approaches must eventually harmonize (even via bespoke code in every report). By Indexing and storing after Harmonization, Data Hubs allow simple, uniform queries.

The Old Way: Comprehensive Enterprise Modeling

Operational Data Hubs: Data Hubs have indexed, queryable data in a common form, so they can support web services, real-time applications, batch and also analytics. When a data hub is enabled with real-time data services, we call this an “operational” Data Hub or ODH. You can connect real-time services, web views, dynamic reports and APIs to an ODH, so you may “run your business” on an ODH.

To sum up my views on Data Hubs:

Build a Data Hub. It is the most powerful approach, and is the easiest to build and maintain.

Why MarkLogic?

I hope all of this information is useful regardless of your current approach or ultimate architectural choice. Having established the advantages of the Data Hub approach over others, why use MarkLogic to build one?

While the MarkLogic Server supports Data Lakes and Federations, we have found Hubs to be the best option, and have spent more than a decade building features to keep your operations running smoothly.

These features include:

Universal Index: MarkLogic takes any data, as is, and stores and indexes it. We leverage the structure already present in common record formats including XML, JSON and RDF, and index that structure automatically (with no coding).

Massive Scale: MarkLogic is inherently (and transparently!) clustered and uses reverse indexes, which allows MarkLogic to scale horizontally like a search engine rather than vertically like most database products.

Big Data Total Cost of Ownership: MarkLogic provides a Data Hub Service on cloud infrastructure that auto-scales to handle any query load.

As huge amounts of data become integrated and available, MarkLogic automatically moves older or less-critical data to cheaper storage, including HDFS or S3 for high-performance computing balanced with cost.

World-class Security: Silo busting is critical for any business and not only makes data easier to access but also easier to steal. MarkLogic secures data at the record level with best-in-class, role-based security, compartment security, per-field security and on-disk encryption via rotating keys and key management.

Flexible, Industry Standard Services: MarkLogic ships with a searchable REST API to expose or customize integrated data. Arbitrary data transforms can be applied as it is served so all clients and APIs get the integrated data in the format they need. Typically this is various JSON, XML, RDF and CSV forms. This API layer enables agility as the data on disk can change but APIs remain stable (decoupled).

Text Search: MarkLogic indexes all text embedded in your data or extracted from binary formats. Stemming, phrase search, tf/idf relevance, and/or/not/near queries, decompounding, multiple languages and other features are present as expected in a full-featured search product.

Geospatial Query: The more data you integrate, the more ways you’ll need to access it. MarkLogic supports more than other products out of the box, including latitude/longitude geospatial query with points, polygons, distance and polygons with “holes.”

I know … this sounds like a lot of features and capabilities and almost too good to be true.

(Did I mention we’ve been doing this for over a decade?)

Want to leverage the cloud for the agility it provides? LEARN about how MarkLogic Data Hub Service provides immediate query capacity to handle spikes in demand.

This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.