Hey DB-Engines: Yes MarkLogic Does MapReduce

Hello DB-Engines,

I was just looking at your MarkLogic listing, and wanted to make you aware that there are actually three ways in which MarkLogic supports MapReduce:

1.       Our Hadoop connector allows customers to run Hadoop MapReduce jobs on data in MarkLogic. Doing this can be much faster and more secure than running Hadoop MapReduce jobs over the same corpus in HDFS because you can specify query constraints on the universe of documents you want to map, and we’ll use our fast index-based query to only map the documents that conform to those constraints. In addition, we will automatically security-trim the set of documents to map.

2.      As of MarkLogic 7 (shipped November 2013), we support using HDFS to store our native data storage format (known as forests), so you can run your database directly from HDFS as if it were any other file system. If you do this, you can also run MapReduce jobs directly against those forests, even if they are not attached to a running MarkLogic instance. This is very useful for cases where customers want to archive data on HDFS, detach it so that it isn’t consuming MarkLogic compute cycles, but still have access to it for batch analytics without having to perform any ETL on the data. What’s more, in this scenario if the customer does want to interactively query the data, it can be re-mounted and queried in seconds.

3.      Finally, we also have a non-Hadoop, in-database MapReduce capability for computing aggregates over large amounts of data in real time. The way it works is that customers can write a user defined function (UDF) in C++ that uses the map/reduce pattern. These functions are pushed to the nodes where the data is managed and are executed in-process with the server process. The source data for aggregation comes from our range indexes, which are memory-mapped files, so the entire process happens in memory, which allows it to work in real time.

It would be great to get our listing corrected to reflect this. Ideally you could list three different ways we allow MapReduce in that column (text in parentheses would work for the info icon tooltip):

  • Hadoop Connector (to run Hadoop MapReduce jobs on data stored in MarkLogic, taking advantage of indexing and security)
  • Direct Access (allows Hadoop MapReduce jobs to recognize MarkLogic data stored in HDFS as a “mappable” file format)
  • In-Database MapReduce (for distributed computation of aggregates over data in MarkLogic in real time)


— David