The new website for MarkLogic is www.progress.com/marklogic. Visit it.
BLOG ARTICLE

Working with Ranged Buckets: User-Defined Functions

Back to blog
12.19.2014
4 minute read
Back to blog
12.19.2014
4 minute read

In my previous post about working with Ranged Buckets using custom constraints, we discussed one approach to handling ranged buckets; here, we delve into an approach using user-defined functions (UDFs). To summarize the issue of working with ranges in documents, we have data that look like this:

<doc>
  <lo>2</lo>
  <hi>9</hi>
  <id>1154</id>
</doc>

We want to build a facet with buckets like 0-4, 5-8, 9-12, 13-16, and 17-20. The “lo” and “hi” values in the sample document represent a range, so the document should be counted for the 0-4, 5-8, and 9-12 buckets, even though no value from five to eight explicitly appears in the document.

In Working with Ranged Buckets: Custom Constraints, we solved this problem using a normal custom constraint. Here, we use a more involved technique: a User-Defined Function (UDF). Also referred to as “Aggregate User Defined Functions”, UDFs let MarkLogic application developers write C++ code to implement map/reduce jobs. Personally, I have had only some experience writing meaningful C++ code (the notable exception being the other UDF that I wrote). I got through it, though, and found some interesting results. (Feel free to suggest improvements to the code, which you can clone and follow along.)

Implementation

I’ll refer you to the documentation for the general background on UDFs, but essentially, you need to think about four functions:

start

The start function handles any arguments used to customize this run of the UDF. In my case, I needed to pass in the buckets that I wanted to use. I dynamically allocate an array of buckets that I’ll use throughout the job.

map

Two range indexes get passed in, one for the “lo” element and one for the “hi” element. The map function gets called for each forest stand in the database, examining the values in the input range indexes. When two indexes are passed in, the map function sees the values as tuples. For instance, the values in the sample document above show up as the tuple (2, 9). Always check the frequency of that tuple, in case the same pair occurs in multiple documents. Once this function has been called for a stand, we know the counts for each bucket for the values in that particular stand.

reduce

The reduce function combines the per-stand counts, aggregating them until a set of values for the entire database is known. My implementation just needed to add the counts for each bucket.

finish

The last step is to organize the results in a way that they can be sent back to XQuery. The finish function builds a map, using “0-4” as the key for the first bucket and the count as the value.

Encoding and Decoding

When working in a cluster, encode and decode functions are important too. I implemented them for my simple tests, but used the UDF on a single MarkLogic instance, so these functions weren’t called.

Deploying

Building the UDF is pretty simple using the Makefile provided by MarkLogic. You’ll find the MarkLogic Makefile for UDFs in:

  • /opt/MarkLogic/Samples/NativePlugins/ (Linux)
  • ~/Library/MarkLogic/Samples/NativePlugins/ (Mac)
  • C:Program FilesMarkLogicSamplesNativePlugins (Windows)

I customized the two places where the name needed to match my filename, but otherwise left it alone. After compiling, I uploaded the UDF to MarkLogic using Query Console. I exported the workspace to GitHub, which is available for your reference.

You can call a UDF using the /v1/values endpoint, but I decided to wrap it in a custom constraint to provide a straightforward comparison with the custom constraint built in the previous post. After all, the goal is to provide a facet. A custom constraint requires XML for the search options and XQuery.

The Results

I figured UDFs would be more interesting with multiple forests, as mapping a job to a single forest that has just one stand doesn’t gain any parallelism. With that in mind, I bumped my database up to four forests, then to six, and compared my UDF implementation with the two-function approach I described in the previous approach. I tested with the same 100,000 documents used in the previous post.

Median Seconds 4 Forests 6 Forests
UDF 0.002898 0.002858
two-function 0.003909 0.004261

 

The numbers are the median seconds returned in the facet-resolution-time part of the response to /v1/search?options=udf or /v1/search?options=startfinish. A couple of things jumped out at me. First, the UDF out-performed the two-function XQuery custom facet. Second, the UDF had a very slight improvement while moving from four forests to six — slight enough that let’s call it even. The two-function approach, however, increased a noticeable amount.

Concluding Thoughts on UDFs

When should you reach for a UDF? When your data don’t support directly getting your values, it might be worthwhile. For instance, when working with ranged buckets, we can’t simply do a facet on “lo” or “hi” because we wouldn’t represent the values in between. Writing a UDF is more complicated and more dangerous than other approaches, but appears to have some performance benefits, as we saw here.

There is usually an alternative. For instance, I could have supplemented my data in our example such that the sample document would have all values from two through nine inclusive, allowing me to use a standard facet. However, this ultimately leads to a tradeoff where you must ask yourself: do I want to spend a little more time at ingest and take up a little more space, or do I want to dynamically compute the values I need? The answer is certainly application-specific, but UDFs provide a handy (and sharp!) tool to work with.


Share this article

Read More

Related Posts

Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.

Tutorial

Poker Fun with XQuery

In this post, we dive into building a full five-card draw poker game with a configurable number of players. Written in XQuery 1.0, along with MarkLogic extensions to the language, this game provides examples of some great programming capabilities, including usage of maps, recursions, random numbers, and side effects. Hopefully, we will show those new to XQuery a look at the language that they may not get to see in other tutorials or examples.

All Blog Articles
Tutorial

Protecting passwords in ml-gradle projects

If you are getting involved in a project using ml-gradle, this tip should come in handy if you are not allowed to put passwords (especially the admin password!) in plain text. Without this restriction, you may have multiple passwords in your gradle.properties file if there are multiple MarkLogic users that you need to configure. Instead of storing these passwords in gradle.properties, you can retrieve them from a location where they’re encrypted using a Gradle credentials plugin.

All Blog Articles
Tutorial

Getting Started with Apache Nifi: Migrating from Relational to MarkLogic

Apache NiFi introduces a code-free approach of migrating content directly from a relational database system into MarkLogic. Here we walk you through getting started with migrating data from a relational database into MarkLogic

All Blog Articles

Sign up for a Demo

Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.

Request a Demo