MarkLogic World 2020 Live Keynote – Register Now

Overview

As data storage options evolve and become more complex, questions arise as to which approach is the right one. Arguments for or against a particular option aren’t always easily defined. It’s important to make comparisons between the different systems, database types, and storage formats, especially within the context of your organization’s specific data requirements. Let’s start with a quick comparison table of a MarkLogic Data Hub and then look more generally at the differences between data hubs and data warehouses.

Comparison Table

MarkLogic Data Hub Data Warehouse
Use Cases
  • Analyze structured and unstructured data
  • Power transactional applications
  • BI and reporting on structured data
Data Model
  • Multi-Model
  • Relational
Search & Query
  • Rich, multi-dimensional, search-style querying
  • Querying with JavaScript, XQuery, SPARQL, and SQL
  • SQL queries on structured data, often defined in advance so the warehouse can be optimized
Data Ingestion
  • Optimized for loading multi-structured data
  • Optimized for loading relational data
Data Quality
  • Handles raw data that may or may not be curated
  • Schema on read
  • Designed for highly curated data
  • Schema on write
Data Curation
  • Supports data curation (enrichment, harmonization, mastering)
  • Stores metadata with the data
  • Requires ETL tools to curate data before loading
Security
  • Designed for handling mission-critical data
  • Designed for handling mission-critical data
Scalability
  • Elastic, scale-out, clustered architecture
  • Depends. Most cloud data warehouses are designed to scale. Others require significant work or have high costs
Deployment
  • Any environment
  • Any environment
Maturity
  • Modern architecture that has become popular in the past 5 years
  • Legacy architecture that has been used for over thirty years

What is a Data Warehouse?

Data warehouses are “observe the business” data stores designed for analyzing data that often comes from upstream “run the business” transactional systems. Their purpose is to provide analysts an aggregate, cross-cutting view of the data.

Data warehouses use a relational model in which data is managed in highly structured rows and columns. The data structure, or schema, is defined in advance (a.k.a. schema on write) and optimized for fast analytical queries using SQL. Analytical queries usually involve joining, aggregating, and filtering the data.

While data warehouses have existed for decades, today’s modern data warehouses are purpose-built for the cloud. Examples such as Snowflake and Redshift are popular reincarnations of traditional data warehouses like Netezza and Teradata. Snowflake, in their own words, is “glorified SQL.” These cloud-native data warehouses provide cloud scale, cloud economics, and are fully managed. And, they have evolved to provide some support for JSON. Their core use case is still the same, however — they support enterprise BI and analytics on relational data.

Let’s consider a typical example of how a data warehouse is used. Imagine a large bank is running real-time trading systems to handle transactions. Those transactions happen in multiple OLTP (Online Transactional Processing) systems across the bank and are then aggregated into a central OLAP (Online Analytical Processing) data warehouse using ETL tools to extract, transform, and load the data.

The warehouse is used for further back-end processing (e.g., trade reconciliation), analysis (e.g., aggregate risk exposure), and reporting (e.g., regulatory agency inquiries).

What is a Data Hub?

Data hubs are data stores that act as stable integration hub in a hub-and-spoke architecture and provide a centralized view of your most important data assets. They use a multi-model database to store multi-structured data of various types, and also have the tools to curate that data (enriching, mastering, harmonizing). They are also operational and transactional, meaning they can power transactional applications, be used for advanced analytics, or simply feed other downstream systems.

While they can serve as systems of record, Data Hubs are usually referred to as a shared integration point in most architectures, where they are used to create an organization’s 360-degree view. As a rule of thumb, a data hub is not a drop-in upgrade or replacement for a data warehouse. Data hubs and data warehouses can easily coexist, and MarkLogic customers often use both together.

What Are the Key Advantages of a Data Hub?

Compared to data warehouses, data hubs provide greater agility, have built-in data curation tools, and are operational (not just analytical).

Data hubs provide agile DataOps. They make it possible to apply the principles of agile development to managing data in the data layer. This is possible because data hubs do not require a strict schema to be defined in advance, which forces a waterfall approach. Instead, raw data can be loaded into a data hub as is. The raw data can then be curated and made fit-for-purpose for downstream use. The process is often referred to as “ELT” because the data is loaded first, then transformed iteratively to meet the needs of the business. Schemas can be defined for the curated data or at query time (a.k.a. schema on read).

Data hubs also excel when there is ambiguity. They support scenarios when there are unknown, complex data sources that may need to be streamed in (or batch loaded), and unknown use cases for how the data will be used later.

The reason data hubs are great with handling ambiguity is that they index everything and provide search-style querying immediately after ingesting the data. And, data hubs have built-in tools to resolve the ambiguity over time as downstream use cases become concrete in defining how source data needs to be harmonized and curated.

Is a Data Hub Good for Integration?

Here are some examples of the integration challenges that a data hub can resolve:

  • Naming differences (e.g., FirstName vs FName — the same values described differently just because someone chose two different column names)
  • Structural differences (e.g., varying number and combination of fields — “boxes_available” in one system might be total boxes in the warehouse plus a “count_per_box” field to derive total items, but in another system may directly represent “total_items” regardless of boxing)
  • Semantic differences (e.g., similar to naming differences, but in this case someone chose slightly different names and the values themselves are slightly different — one system may have three patient statuses and another may have five. These statuses will often overlap and be hard to map to one another ({Scheduled, Needs_Followup, Inactive} vs. {Intake, Scheduled, Telemedicine-only, Discharged}).)

Data hubs are operational. They can provide a real-time view of the business that can be kept up-to-date in real-time, and can even write back to the upstream system when necessary. By allowing real-time updates with transactional support, data hubs provide a reliable data store in which direct updates may be made to integrated data without hurting data governance and accuracy.

What Are the Best Use Cases for a Data Hub?

Here are some of the signs that indicate a data hub is a good choice for your architecture:

  • When you have complex, changing data sources and uses — Data hubs are good at integrating multi-structured, changing data. So, if you are not quite sure what the incoming data sources include, when the data will be available, have many complex schemas to integrate, have upstream data sources that frequently change or are of unknown quality, or you are not quite sure what the integrated data will be used for, then a data hub is a good choice
  • When the business needs data delivered fast — Data hubs provide a significant advantage in terms of agility. So, data hubs are a good choice if you cannot wait for lots of upfront data modeling and the business needs data delivered fast. They are also a good choice if the needs of the business change frequently and you need agile DataOps
  • When you have complex (unplanned) queries — Querying data in a data hub is more like running searches on Google, making it ideal for asking rich questions of your data that might otherwise be impossible in a traditional data warehouse. Instead of a row and columns mindset and worrying about what complex joins are required, you can think about querying across multi-dimensional entities and relationships, which includes values, metadata, words and phrases, and structure
  • When you need real-time, operational views — Data hubs are operational and transactional. This makes them a good choice when your analytics team needs a real-time view, not a historical snapshot. Or, when the use case requires that analysts be able to write back to the system and create a feedback loop as part of a system of knowledge
  • When you need a stable platform and trusted point of integration — Data hubs are backed by a database. This means data hubs persist data, provide HA/DR, transactional consistency, enterprise security, and all the other capabilities that are required to act as a stable platform that will simplify your overall architecture and not become just another silo

Our customers typically use MarkLogic Data Hub Service for use cases such as building a unified view, search and discovery, and operational analytics.

When is a Data Warehouse a Better Fit?

Data warehouses are proven in the enterprise and almost all organizations have one or more data warehouses, and often a number of data marts that have been spun off them. Data warehouses will always be useful when data is highly structured and well-defined, and when the warehouse’s purpose is also well-defined.

If all you need to do is run fast SQL queries over rows and columns then a data warehouse is a great solution. Data warehouses are optimized for loading structured data and querying with SQL, and because of their dominance across the enterprise for the past 30+ years, there is an abundance of people with data warehouse and SQL skills.

So, if you are happy with your data warehouse and you don’t have challenges with data integration, there is no reason to change!

How They Can Work Together

Data hubs and data warehouses can easily coexist, and our customers often use both together.

In most cases, organizations have existing data warehouses but then a new use case pops up that requires integrating data from those warehouses and they don’t want to spend a bunch of time and money on ETL and data modeling to build a common schema to integrate it all.

To solve this problem, organizations can employ a data hub to integrate data from those siloed warehouses (and any other data silos). From there, the data hub can power applications, or can feed curated data to another data warehouse downstream, or offloaded it into a file system optimized for low-cost storage.

So, the data warehouse continues to be an important part of the architecture, but the data hub serves to make the overall data-integration process more agile and trusted.

Learn More

We have many customers who chose to supplement or replace their data warehouses with a MarkLogic Data Hub. Some examples include AIRBUS, AbbVie, Northern Trust, Hannover Re, and Chevron.

Sign Up for Our Live Demo

See how MarkLogic integrates data faster, reduces costs, and enables secure data sharing.

Register Now

This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.