Data Platform

ProgressBlogs Hey DB-Engines: Yes MarkLogic Does MapReduce

Hey DB-Engines: Yes MarkLogic Does MapReduce

by David Gorbet

Posted on February 10, 2014 0 Comments

Hello DB-Engines,

I was just looking at your MarkLogic listing, and wanted to make you aware that there are actually three ways in which MarkLogic supports MapReduce:

1. Our Hadoop connector allows customers to run Hadoop MapReduce jobs on data in MarkLogic. Doing this can be much faster and more secure than running Hadoop MapReduce jobs over the same corpus in HDFS because you can specify query constraints on the universe of documents you want to map, and we’ll use our fast index-based query to only map the documents that conform to those constraints. In addition, we will automatically security-trim the set of documents to map.

2. As of MarkLogic 7 (shipped November 2013), we support using HDFS to store our native data storage format (known as forests), so you can run your database directly from HDFS as if it were any other file system. If you do this, you can also run MapReduce jobs directly against those forests, even if they are not attached to a running MarkLogic instance. This is very useful for cases where customers want to archive data on HDFS, detach it so that it isn’t consuming MarkLogic compute cycles, but still have access to it for batch analytics without having to perform any ETL on the data. What’s more, in this scenario if the customer does want to interactively query the data, it can be re-mounted and queried in seconds.

3. Finally, we also have a non-Hadoop, in-database MapReduce capability for computing aggregates over large amounts of data in real time. The way it works is that customers can write a user defined function (UDF) in C++ that uses the map/reduce pattern. These functions are pushed to the nodes where the data is managed and are executed in-process with the server process. The source data for aggregation comes from our range indexes, which are memory-mapped files, so the entire process happens in memory, which allows it to work in real time.

It would be great to get our listing corrected to reflect this. Ideally you could list three different ways we allow MapReduce in that column (text in parentheses would work for the info icon tooltip):

Hadoop Connector (to run Hadoop MapReduce jobs on data stored in MarkLogic, taking advantage of indexing and security)
Direct Access (allows Hadoop MapReduce jobs to recognize MarkLogic data stored in HDFS as a “mappable” file format)
In-Database MapReduce (for distributed computation of aggregates over data in MarkLogic in real time)

Thanks!

— David

MarkLogic

David Gorbet

View all posts from David Gorbet on the Progress blog. Connect with us about all things application development and deployment, data integration and digital business.

Comments

Comments are disabled in preview mode.

Topics

More From Progress

Shadow Analytics: Why You Can’t Afford to Leave It Unchecked

Then, Now and Beyond: The Future of Back Office Software

2022 Progress Data Connectivity Report

Subscribe to get all the news, info and tutorials you need to build better business apps and sites

Country/Territory

Blog

MarkLogic

Semaphore

OpenEdge

DataDirect

Sitefinity

Telerik

Kendo UI

Corticon

DataDirect

MOVEit

Chef

Flowmon

Kemp LoadMaster

WhatsUp Gold

Telerik

Kendo UI

Fiddler

Test Studio

MOVEit

WS_FTP

Hey DB-Engines: Yes MarkLogic Does MapReduce

David Gorbet

Comments

Topics

Sitefinity Training and Certification Now Available.

More From Progress

Latest Stories in Your Inbox

Latest Stories
in Your Inbox