We’ve joined forces with Smartlogic to reveal smarter decisions—together.

Hey DB-Engines: Yes MarkLogic Does MapReduce

Hello DB-Engines,

I was just looking at your MarkLogic listing, and wanted to make you aware that there are actually three ways in which MarkLogic supports MapReduce:

1.       Our Hadoop connector allows customers to run Hadoop MapReduce jobs on data in MarkLogic. Doing this can be much faster and more secure than running Hadoop MapReduce jobs over the same corpus in HDFS because you can specify query constraints on the universe of documents you want to map, and we’ll use our fast index-based query to only map the documents that conform to those constraints. In addition, we will automatically security-trim the set of documents to map.

2.      As of MarkLogic 7 (shipped November 2013), we support using HDFS to store our native data storage format (known as forests), so you can run your database directly from HDFS as if it were any other file system. If you do this, you can also run MapReduce jobs directly against those forests, even if they are not attached to a running MarkLogic instance. This is very useful for cases where customers want to archive data on HDFS, detach it so that it isn’t consuming MarkLogic compute cycles, but still have access to it for batch analytics without having to perform any ETL on the data. What’s more, in this scenario if the customer does want to interactively query the data, it can be re-mounted and queried in seconds.

3.      Finally, we also have a non-Hadoop, in-database MapReduce capability for computing aggregates over large amounts of data in real time. The way it works is that customers can write a user defined function (UDF) in C++ that uses the map/reduce pattern. These functions are pushed to the nodes where the data is managed and are executed in-process with the server process. The source data for aggregation comes from our range indexes, which are memory-mapped files, so the entire process happens in memory, which allows it to work in real time.

It would be great to get our listing corrected to reflect this. Ideally you could list three different ways we allow MapReduce in that column (text in parentheses would work for the info icon tooltip):

  • Hadoop Connector (to run Hadoop MapReduce jobs on data stored in MarkLogic, taking advantage of indexing and security)
  • Direct Access (allows Hadoop MapReduce jobs to recognize MarkLogic data stored in HDFS as a “mappable” file format)
  • In-Database MapReduce (for distributed computation of aggregates over data in MarkLogic in real time)


— David

Start a discussion

Connect with the community




Most Recent

View All

Unifying Data, Metadata, and Meaning

We're all drowning in data. Keeping up with our data - and our understanding of it - requires using tools in new ways to unify data, metadata, and meaning.
Read Article

How to Achieve Data Agility

Successfully responding to changes in the business landscape requires data agility. Learn what visionary organizations have done, and how you can start your journey.
Read Article

Scaling Memory in MarkLogic Server

This not-too-technical article covers a number of questions about MarkLogic Server and its use of memory. Learn more about how MarkLogic uses memory, why you might need more memory, when you need more memory, and how you can add more memory.
Read Article
This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.