Progress Acquires MarkLogic! Learn More
BLOG ARTICLE

Hey DB-Engines: Yes MarkLogic Does MapReduce

Back to blog
02.10.2014
2 minute read
Back to blog
02.10.2014
2 minute read
Person using a tablet

Hello DB-Engines,

I was just looking at your MarkLogic listing, and wanted to make you aware that there are actually three ways in which MarkLogic supports MapReduce:

1.       Our Hadoop connector allows customers to run Hadoop MapReduce jobs on data in MarkLogic. Doing this can be much faster and more secure than running Hadoop MapReduce jobs over the same corpus in HDFS because you can specify query constraints on the universe of documents you want to map, and we’ll use our fast index-based query to only map the documents that conform to those constraints. In addition, we will automatically security-trim the set of documents to map.

2.      As of MarkLogic 7 (shipped November 2013), we support using HDFS to store our native data storage format (known as forests), so you can run your database directly from HDFS as if it were any other file system. If you do this, you can also run MapReduce jobs directly against those forests, even if they are not attached to a running MarkLogic instance. This is very useful for cases where customers want to archive data on HDFS, detach it so that it isn’t consuming MarkLogic compute cycles, but still have access to it for batch analytics without having to perform any ETL on the data. What’s more, in this scenario if the customer does want to interactively query the data, it can be re-mounted and queried in seconds.

3.      Finally, we also have a non-Hadoop, in-database MapReduce capability for computing aggregates over large amounts of data in real time. The way it works is that customers can write a user defined function (UDF) in C++ that uses the map/reduce pattern. These functions are pushed to the nodes where the data is managed and are executed in-process with the server process. The source data for aggregation comes from our range indexes, which are memory-mapped files, so the entire process happens in memory, which allows it to work in real time.

It would be great to get our listing corrected to reflect this. Ideally you could list three different ways we allow MapReduce in that column (text in parentheses would work for the info icon tooltip):

  • Hadoop Connector (to run Hadoop MapReduce jobs on data stored in MarkLogic, taking advantage of indexing and security)
  • Direct Access (allows Hadoop MapReduce jobs to recognize MarkLogic data stored in HDFS as a “mappable” file format)
  • In-Database MapReduce (for distributed computation of aggregates over data in MarkLogic in real time)

Thanks!

— David

Share this article

Read More

Related Posts

Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.

Product

Semantics, Search, MarkLogic 11 and Beyond

Get info on recent and upcoming product updates from John Snelson, head of the MarkLogic product architecture team.

All Blog Articles
Product

Integrating MarkLogic with Kafka

The MarkLogic Kafka Connector makes it easy to move data between the two systems, without the need for custom code.

All Blog Articles
Product

Introduction to GraphQL with MarkLogic

MarkLogic 11 introduces support for GraphQL queries that run against views in your MarkLogic database. Customers interested in or already using GraphQL can now securely query MarkLogic via this increasingly popular query language.

All Blog Articles

Sign up for a Demo

Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.

Request a Demo