Connector for Hadoop

Hadoop Connector

The MarkLogic Connector for Hadoop enables real-time Big Data Analytics on structured, semi-structured, and unstructured data.

MarkLogic 5 is an operational database designed for Big Data Applications that deliver better decisions faster. Organizations that recognize the need for real-time analytics on Big Data turn to MarkLogic to handle complex, ad hoc queries. These same organizations might also leverage software libraries with complex algorithms to process massive volumes of data. When they apply these libraries to Big Data, they should use the MarkLogic Connector for Hadoop which you can download from the Developer Website.

Hadoop and MarkLogic Integration

The MarkLogic Connector for Hadoop lets MarkLogic and Hadoop work seamlessly together to tackle more problems than can be addressed by either technology alone. The combination of MarkLogic and Hadoop gives customers the best of real-time analytics and batch processing. Hadoop provides the ability to process data with complex algorithms in a batch process, and MarkLogic enables ongoing, ad hoc queries on the processed data. This combination lets users identify new insights in data immediately without having to write more code and wait for the batch process to complete. Running a query in MarkLogic often leads to insights that require drilling down further. The MarkLogic real-time model allows this type of responsive interaction in analytics applications.

The MarkLogic Connector for Hadoop is a drop-in library for the Hadoop framework. It allows developers to run MapReduce jobs on data in MarkLogic through standard Hadoop APIs. The connector offers the flexibility of reading and writing data in MarkLogic as well as in the Hadoop Distributed FIlesystem (HDFS), the default storage system for Hadoop. This is particularly useful when running batch jobs on raw data stored in HDFS that need to be subsequently output to MarkLogic for additional analysis. By delivering processed data to MarkLogic, users can take advantage of MarkLogic indexes to run precise, ad hoc queries. Also, the connector takes advantage of MarkLogic’s distributed architecture to perform large batch reads and writes in parallel.

  Hadoop Connector Diagram

Having the ability to run batch-processing on information before it goes into MarkLogic, then to also enrich data that’s already inside MarkLogic, offers a lot of new possibilities for our customers. As Big Data continues to be a driving factor in technology decisions, being able to run Hadoop processes on data inside MarkLogic will be a real game changer.”

- Tony Jewitt,
Vice President,
Big Data Solutions,
Avalon Consulting, LLC.

The MarkLogic Connector for Hadoop allows us to do much more with data we extract using Hadoop. For example, we can now run detailed, rich analytics of image fingerprints by putting the data into MarkLogic.”

- Sergio Restrepo,
Senior Architect,
Yuxi Pacific