In my previous post about working with Ranged Buckets using custom constraints, we discussed one approach to handling ranged buckets; here, we delve into an approach using user-defined functions (UDFs). To summarize the issue of working with ranges in documents, we have data that look like this:
<doc> <lo>2</lo> <hi>9</hi> <id>1154</id> </doc>
We want to build a facet with buckets like 0-4, 5-8, 9-12, 13-16, and 17-20. The “lo” and “hi” values in the sample document represent a range, so the document should be counted for the 0-4, 5-8, and 9-12 buckets, even though no value from five to eight explicitly appears in the document.
In Working with Ranged Buckets: Custom Constraints, we solved this problem using a normal custom constraint. Here, we use a more involved technique: a User-Defined Function (UDF). Also referred to as “Aggregate User Defined Functions”, UDFs let MarkLogic application developers write C++ code to implement map/reduce jobs. Personally, I have had only some experience writing meaningful C++ code (the notable exception being the other UDF that I wrote). I got through it, though, and found some interesting results. (Feel free to suggest improvements to the code, which you can clone and follow along.)
I’ll refer you to the documentation for the general background on UDFs, but essentially, you need to think about four functions:
The start function handles any arguments used to customize this run of the UDF. In my case, I needed to pass in the buckets that I wanted to use. I dynamically allocate an array of buckets that I’ll use throughout the job.
Two range indexes get passed in, one for the “lo” element and one for the “hi” element. The map function gets called for each forest stand in the database, examining the values in the input range indexes. When two indexes are passed in, the map function sees the values as tuples. For instance, the values in the sample document above show up as the tuple (2, 9). Always check the frequency of that tuple, in case the same pair occurs in multiple documents. Once this function has been called for a stand, we know the counts for each bucket for the values in that particular stand.
The reduce function combines the per-stand counts, aggregating them until a set of values for the entire database is known. My implementation just needed to add the counts for each bucket.
The last step is to organize the results in a way that they can be sent back to XQuery. The finish function builds a map, using “0-4” as the key for the first bucket and the count as the value.
When working in a cluster, encode and decode functions are important too. I implemented them for my simple tests, but used the UDF on a single MarkLogic instance, so these functions weren’t called.
Building the UDF is pretty simple using the Makefile provided by MarkLogic. You’ll find the MarkLogic Makefile for UDFs in:
/opt/MarkLogic/Samples/NativePlugins/
(Linux)~/Library/MarkLogic/Samples/NativePlugins/
(Mac)I customized the two places where the name needed to match my filename, but otherwise left it alone. After compiling, I uploaded the UDF to MarkLogic using Query Console. I exported the workspace to GitHub, which is available for your reference.
You can call a UDF using the /v1/values endpoint, but I decided to wrap it in a custom constraint to provide a straightforward comparison with the custom constraint built in the previous post. After all, the goal is to provide a facet. A custom constraint requires XML for the search options and XQuery.
I figured UDFs would be more interesting with multiple forests, as mapping a job to a single forest that has just one stand doesn’t gain any parallelism. With that in mind, I bumped my database up to four forests, then to six, and compared my UDF implementation with the two-function approach I described in the previous approach. I tested with the same 100,000 documents used in the previous post.
Median Seconds | 4 Forests | 6 Forests |
UDF | 0.002898 | 0.002858 |
two-function | 0.003909 | 0.004261 |
The numbers are the median seconds returned in the facet-resolution-time part of the response to /v1/search?options=udf or /v1/search?options=startfinish. A couple of things jumped out at me. First, the UDF out-performed the two-function XQuery custom facet. Second, the UDF had a very slight improvement while moving from four forests to six — slight enough that let’s call it even. The two-function approach, however, increased a noticeable amount.
When should you reach for a UDF? When your data don’t support directly getting your values, it might be worthwhile. For instance, when working with ranged buckets, we can’t simply do a facet on “lo” or “hi” because we wouldn’t represent the values in between. Writing a UDF is more complicated and more dangerous than other approaches, but appears to have some performance benefits, as we saw here.
There is usually an alternative. For instance, I could have supplemented my data in our example such that the sample document would have all values from two through nine inclusive, allowing me to use a standard facet. However, this ultimately leads to a tradeoff where you must ask yourself: do I want to spend a little more time at ingest and take up a little more space, or do I want to dynamically compute the values I need? The answer is certainly application-specific, but UDFs provide a handy (and sharp!) tool to work with.
Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.
In this post, we dive into building a full five-card draw poker game with a configurable number of players. Written in XQuery 1.0, along with MarkLogic extensions to the language, this game provides examples of some great programming capabilities, including usage of maps, recursions, random numbers, and side effects. Hopefully, we will show those new to XQuery a look at the language that they may not get to see in other tutorials or examples.
If you are getting involved in a project using ml-gradle, this tip should come in handy if you are not allowed to put passwords (especially the admin password!) in plain text. Without this restriction, you may have multiple passwords in your gradle.properties file if there are multiple MarkLogic users that you need to configure. Instead of storing these passwords in gradle.properties, you can retrieve them from a location where they’re encrypted using a Gradle credentials plugin.
Apache NiFi introduces a code-free approach of migrating content directly from a relational database system into MarkLogic. Here we walk you through getting started with migrating data from a relational database into MarkLogic
Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.
Request a Demo