Using namespace wildcards in XPath
Have you ever wished you could just skip having to deal with namespaces in your content? One way to do this is to avoid using namespaces altogether (i.e., avoid using any
xmlns:* declarations in your XML content). But given that namespaces are in widespread use both in standard XML vocabularies and in custom application data, that option isn’t always available.
XPath does provide a convenient feature known as local name tests, or namespace wildcards, which lets you avoid having to type your content’s namespace declaration in your query. In fact, you might be tempted to use it all the time to save the effort of typing, but I’m here to tell you that’s not a good idea. Keep reading if you want to know when it might be safe to use them and it’s not.
What name tests are in XPath?
There are four kinds of name tests in XPath, and three of them are wildcards, shown in the table below. Pay particular attention to the last entry in the table.
|What it matches||Example(s)|
|Match a specific QName||
|Match any name||
|Match any name in a specific namespace||
|Match a specific local name, regardless of namespace||
For some time, only the first three kinds of wildcards were supported in XPath 1.0 (pre-XQuery). If you wanted to select a
<foo> element regardless of its namespace, you would have to write something like this:
*[local-name(.) = 'foo']
One rationale behind this (perhaps obvious) omission is that such a language feature might encourage some bad practices. The idea behind a namespace is that it identifies a distinct set of names. Local names in different namespaces shouldn’t necessarily be related to each other (
<head> means one thing in HTML and quite another in, say, AnatomyML). Of course, that still didn’t prevent people from using namespaces for things like versioning, where each new version of a vocabulary gets a new namespace URI.
Local name tests in XPath
Local name tests (namespace wildcards) were eventually added to XPath 2.0 (and thus XQuery):
The above query selects all elements with local name “foo” regardless of namespace. Even if you know these elements are in just one namespace, it can be a convenient shortcut. It saves you from having to write out the namespace declaration:
declare namespace xyz="http://example.com"; collection()//xyz:foo
Problems with namespace wildcards
There are two problems with using namespace wildcards like
*:foo. One is that the intentions are unclear. Did you really mean that? Are there really elements named
<foo> in more than one namespace? Or were you just being lazy?
The other problem is with performance. MarkLogic indexes elements by QName, not by local name, which means namespace wildcards won’t utilize the index and will require a lot of filtering. We can prove this by using our friend
xdmp:plan(), or its cousin,
xdmp:plan( collection()//*:foo )
The output shows how many “fragments” (equivalent to documents, unless you’ve enabled fragmenting) have to be read in order to resolve this query. Normally, MarkLogic uses its Universal Index to minimize the number of document reads it has to make. In this case, we can see from the output that the
*:foo step produces below is problematic:
<qry:info-trace>Analyzing path: fn:collection()/descendant::*:foo</qry:info-trace> <qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace> <qry:info-trace>Step 2 does not use indexes: descendant::*:foo</qry:info-trace>
Looking further down the output, we see the number of fragments that would have to be opened for the filtering stage:
<qry:info-trace>Selected 14944 fragments</qry:info-trace> <qry:result estimate="14944"/>
This is not the number of documents that have a
<foo> element. This is the total number of documents in my database. So obviously, this query is going to run very slowly, because it’s forcing all of those fragments to be read from the disk.
Specifying the exact QName
In contrast, let’s look at the plan with the case where we specify the exact QName:
declare namespace xyz="http://example.com"; xdmp:plan( collection()//xyz:foo )
In this case, we see that the path is “fully searchable.” In other words, all the steps contribute index constraints that can be used to narrow down the possible number of matching documents:
<qry:info-trace>Analyzing path: fn:collection()/descendant::xyz:foo</qry:info-trace> <qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace> <qry:info-trace>Step 2 is searchable: descendant::xyz:foo</qry:info-trace> <qry:info-trace>Path is fully searchable.</qry:info-trace>
And we see further down that MarkLogic knows a priori, from the Universal Index, that no
<xyz:foo> elements exist in the database:
<qry:info-trace>Selected 0 fragments</qry:info-trace> <qry:result estimate="0"/>
So simply by specifying the namespace part of the QName, we’ve gone from having to read all the documents in the database, to having to read none of them.
To summarize, you should generally avoid namespace wildcards like
*:foo for two reasons: performance and clarity.
Are there ever any cases where it’s okay to use
*:foo? Performance is not nearly as big an issue when you’re processing documents that you’re already committed to opening. For example, if you’re processing a single zip file manifest (the result of
xdmp:zip-manifest()), then using
*:part because you’re too lazy to declare the zip namespace isn’t a problem as far as performance goes, because you’re not searching among thousands or millions of documents and the index doesn’t even come into play. Still, in production code, it’s a good idea to declare the namespace and use
zip:part so your intentions are clearly documented.
Of course, when your intentions actually are to select an element with a specific local name but any number of namespaces, then you can use
*:foo; but, again, be sure it’s not when you’re searching across the database. In that case, if it’s possible, you should enumerate all the QNames, so MarkLogic can most effectively narrow down the result set based on what it knows from its indexes:
If you didn’t even know namespace wildcards existed in XPath, then you might find it odd that I’m both introducing them to you and recommending against using them in the same article. Consider this just another chance to become familiar with
xdmp:plan(), which is much more generally useful. It will help you write fast queries and understand what makes them fast.