Hadoop Integration

Over 90 percent of the world’s data was created in the past two years. Organizations need to store and analyze massive amounts of structured and unstructured data from disparate data sources—data too massive to manage effectively with traditional relational databases. Hadoop is a great tool to help with this task, and MarkLogic is the best database for Hadoop. MarkLogic makes Hadoop better by letting you use it as part of an infrastructure to handle both operational and analytic workloads—improving data governance, reducing ETL efforts, and matching your storage mechanisms with the value of your data.


Hadoop: HDFS and MapReduce

Hadoop has become popular because it is designed to cheaply store data in the Hadoop Distributed File System (HDFS) and run large-scale MapReduce jobs for batch analysis.


HDFS is a Java-based file system that provides scalable and reliable data storage across clusters of commodity servers. In production, HDFS has been shown to scale to 200 Petabytes of storage across 4,500 servers, supporting close to a billion files.


MapReduce is a processing framework that uses a “divide-and-conquer” paradigm that takes a huge task and breaks it into small parts (“Map”) and then aggregates the resulting outputs from each part (“Reduce”). Any large task that can be broken into smaller pieces is a candidate for use with Hadoop.

MarkLogic Hadoop

Online, low latency applications

Real-time transactions

Built-in search

Offline, high latency processing

Long-haul, batch analytics

Distributed, cost-effective storage

MarkLogic: The Best Database for Hadoop

Hadoop is great for storing and analyzing data, but it still needs a database. Hadoop is simply not designed for low latency transactions required for real-time interactive applications, or applications that require enterprise features such as government-grade security and high availability and disaster recovery. Even as the Hadoop ecosystem continues to evolve, the real benefits of Hadoop are realized only when running alongside an enterprise-grade database.

MarkLogic is the best database for Hadoop because it can seamlessly run alongside the Hadoop ecosystem, acting as the database to power real-time, transactional applications. Additionally, MarkLogic can leverage HDFS within a tiered storage model, seamlessly moving data between any combination of HDFS, S3, SSD, SAN, NAS, or local disk to support specific SLAs and cost objectives without modifying downstream application code.

Using MarkLogic and Hadoop together lets you benefit from the low cost of Hadoop storage, while using MarkLogic to ensure you get the enterprise features you cannot live without—including ACID transactions, high availability and disaster recovery, government-grade security, and performance monitoring tools.

Modern Hadoop Infrastructure

MarkLogic stores unstructured data across clusters much like Hadoop, an architectural parity that makes it easy to move data partitions (forests) between MarkLogic hosts and the Hadoop ecosystem using the MarkLogic Connector for Hadoop, using MarkLogic as either an input source or output destination.

Use Cases for MarkLogic and Hadoop

Real-time Apps Running Directly on HDFS

HDFS is only a file system—it has no indexes, so can’t do low-latency queries or granular updates—you need a database for real-time applications. MarkLogic can run directly on HDFS so that you can deliver real-time applications on data you have staged in HDFS.

HDFS as an Inexpensive Storage Tier

MarkLogic can use HDFS as as part of a tiered storage strategy. HDFS is ideal for archive data, and you can fluidly move your data off local disk and onto HDFS as it ages—helping you save money while still providing appropriate performance and availability for applications.

Utilize the MarkLogic Connector for Hadoop

Use the MarkLogic Connector for Hadoop in order to do MapReduce, for ETL, analytics, or enrichment. As a matter of fact, our bulk loading tool, MarkLogic Content Pump (mlcp), uses MapReduce under the covers to load terabytes of data in parallel. A MapReduce-based application can even access the data in a MarkLogic data file without having to first mount it to a database; a feature of MarkLogic called “direct access.”




Making Hadoop Better

Confused about Hadoop? This video provides an overview of Hadoop and how you can use it



MarkLogic and Hadoop

Download this datasheet to learn more about how you can use MarkLogic to improve your investment in Hadoop



Hadoop Developer’s Guide

Read the developer’s guide to gain a deeper understanding of how to use Hadoop alongside MarkLogic