How indexing makes XPath fast
Recently, I built a MarkLogic web application using a lot of XSLT and a dash of XQuery. I was able to apply my understanding about how to optimize code to make a site like this perform better, so I thought I would share some of these insights with other beginners.
Whenever MarkLogic evaluates an XPath expression (which is what both XSLT and XQuery use to extract information from XML documents), there are typically two steps that take place:
- Retrieve a list of documents (“fragments” actually)
- Open those documents to extract relevant results
Consider this expression:
A naïve implementation would go grab all available documents, open each of them up and get the
<Title> child of
<Article>, if present. In fact, that’s the only option for an XQuery (or XSLT) implementation that doesn’t use a database. Saxon, for example, can look in all XML documents in the current file system directory using
collection("?select=*.xml")/Article/Title. Since these documents have not been indexed or pre-loaded, there’s no way of knowing whether any of them even have an
<Article> element without first opening them up to see. This works fine up to a certain point. If you try to load too many documents at once, however, you will run out of memory. Also, having to read the file and parse the XML at the time the expression is evaluated is inherently slower than if the documents were somehow pre-loaded. Sometimes, this is unavoidable, such as when the XML you’re processing is generated on-the-fly. This is where an indexed implementation like MarkLogic comes in handy: when the XML is where you actually store your content — especially when you have a lot of it.
One way MarkLogic makes XPath evaluation faster is through indexing. For the purposes of this post, you can conceptually think of the documents as sitting on the file system (we’ll ignore for now the fact that documents are pre-parsed, compressed, and cached in memory to make reads much faster). Let’s call this the “document store.” With a non-database-backed XQuery or XSLT implementation, that’s all you have: the document store. But MarkLogic stores its data in two places:
- the document store, and
- the index
The index does not contain every piece of information about documents. For example, you cannot necessarily reconstruct an entire document from the index alone. In that sense, you can think of it like a back-of-the-book index. The back-of-the-book index lets you look up information in the book more quickly, but it does not substitute for the book itself. Let’s carry the analogy further with an example. Let’s say I’m reading a cookbook, and I want to look up every recipe that has ginger in it. If my cookbook didn’t have an index, then I’d have to flip through the entire book and visually scan each page for the word “ginger.” If I were to break this down into two steps, they might be:
- Get a list of pages
- Scan each page for “ginger,” filtering out the ones that don’t have it
Without an index, the answer to step 1 is “all the pages in the cookbook.” That will take some time. But if my cookbook has an index, then I will see a listing that looks something like this:
Ginger, 15, 32, 49, 50, 57, 65, 66
Now I no longer have to scan the whole book. I only have to look at those 7 pages to find what I’m looking for. I won’t even have to scan the page, filtering out what I don’t want. I can confidently know from the index that each of these pages is a recipe that has ginger in it. We could say that my request is fully searchable from the index. Not only that, it doesn’t require me to do any filtering. (Technically, my filtered results would be the same as if I had skipped filtering.)
But what if I only want ginger recipes that don’t require any heating/cooking? It would be great if my cookbook’s index included a listing like “Ginger – Raw.” But what if it doesn’t? I’d have to make do with the index listing I have, so my process would look like this:
- Get a list of pages (7 altogether)
- Scan each page to see if the recipe requires heating or not
Now I have to do some visual filtering (step 2), but at least I don’t have to scan the whole book. In this case, we could say my query is partially searchable from the index.
One last example: let’s say I want to find all recipes that include both ginger and tuna. In that case, I’d look up two different entries in the index—the one for “ginger” and another one for “tuna,” which might look like this:
Tuna, 13, 32, 49, 57, 80
Now all I have to do is get the intersection of those two listings. In this case, I can quickly see that pages 32, 49, and 57 contain both “ginger” and “tuna,” because those pages appear in both index listings.
If you haven’t gathered this already, each cookbook page in the analogy corresponds to an XML document (fragment, actually). Also, just as the back-of-the-book index saves us time when looking for ginger recipes, MarkLogic’s index saves it time when looking for documents. In that sense, it plays an inherently negative role. Indexing lets you ignore (subtract) a subset of the documents because you already know, for example, that they don’t have any <Title> elements. Another way of thinking about it is that, for any given query, it reduces the size of your document store (while still giving you the same results as if you had scanned all the documents).
I’ve been throwing around the terms “fully searchable” and “partially searchable.” As it turns out, these are technical terms that apply to both XPath expressions and word search queries. MarkLogic Server analyzes each path expression to see how searchable it is. Even if it’s not fully searchable, it might be partially searchable (as in the case of our request for raw ginger recipes). In that case, it reduces the number of documents that it has to “filter” (another technical term, describing the scanning process). As you might expect, learning what makes an XPath expression searchable or not will go a long way towards helping you write highly performant code on MarkLogic Server.
Our original expression
collection()/Article/Title is a fully searchable XPath expression, which is to say that each of its three path expression steps is searchable, so even if we have billions of documents in our database, MarkLogic will only open the documents that contain those elements in that arrangement; of course, I had always assumed this to be the case. I just didn’t have a very clear idea of how it did this. But now, after having played a bunch with xdmp:query-trace(), and after having studied “Inside MarkLogic Server,” it’s all much more clear to me. I hope it is for you, too!