Data Platform

ProgressBlogs Using namespace wildcards in XPath

Using namespace wildcards in XPath

by Evan Lenz

Posted on December 12, 2011 0 Comments

Have you ever wished you could just skip having to deal with namespaces in your content? One way to do this is to avoid using namespaces altogether (i.e., avoid using any xmlns or xmlns:* declarations in your XML content). But given that namespaces are in widespread use both in standard XML vocabularies and in custom application data, that option isn’t always available.

XPath does provide a convenient feature known as local name tests, or namespace wildcards, which lets you avoid having to type your content’s namespace declaration in your query. In fact, you might be tempted to use it all the time to save the effort of typing, but I’m here to tell you that’s not a good idea. Keep reading if you want to know when it might be safe to use them and it’s not.

What name tests are in XPath?

There are four kinds of name tests in XPath, and three of them are wildcards, shown in the table below. Pay particular attention to the last entry in the table.

What it matches	Example(s)
Match a specific QName	`foo`, `xyz:bar`, etc.
Match any name	`*`
Match any name in a specific namespace	`xyz:*`
Match a specific local name, regardless of namespace	`*:foo`

For some time, only the first three kinds of wildcards were supported in XPath 1.0 (pre-XQuery). If you wanted to select a <foo> element regardless of its namespace, you would have to write something like this:

*[local-name(.) = 'foo']

One rationale behind this (perhaps obvious) omission is that such a language feature might encourage some bad practices. The idea behind a namespace is that it identifies a distinct set of names. Local names in different namespaces shouldn’t necessarily be related to each other (<head> means one thing in HTML and quite another in, say, AnatomyML). Of course, that still didn’t prevent people from using namespaces for things like versioning, where each new version of a vocabulary gets a new namespace URI.

Local name tests in XPath

Local name tests (namespace wildcards) were eventually added to XPath 2.0 (and thus XQuery):

collection()//*:foo

The above query selects all elements with local name “foo” regardless of namespace. Even if you know these elements are in just one namespace, it can be a convenient shortcut. It saves you from having to write out the namespace declaration:

declare namespace xyz="http://example.com";
collection()//xyz:foo

Problems with namespace wildcards

There are two problems with using namespace wildcards like *:foo. One is that the intentions are unclear. Did you really mean that? Are there really elements named <foo> in more than one namespace? Or were you just being lazy?

The other problem is with performance. MarkLogic indexes elements by QName, not by local name, which means namespace wildcards won’t utilize the index and will require a lot of filtering. We can prove this by using our friend xdmp:plan(), or its cousin, xdmp:query-trace()):

xdmp:plan(
  collection()//*:foo
)

The output shows how many “fragments” (equivalent to documents, unless you’ve enabled fragmenting) have to be read in order to resolve this query. Normally, MarkLogic uses its Universal Index to minimize the number of document reads it has to make. In this case, we can see from the output that the *:foo step produces below is problematic:

<qry:info-trace>Analyzing path: fn:collection()/descendant::*:foo</qry:info-trace>
<qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace>
<qry:info-trace>Step 2 does not use indexes: descendant::*:foo</qry:info-trace>

Looking further down the output, we see the number of fragments that would have to be opened for the filtering stage:

<qry:info-trace>Selected 14944 fragments</qry:info-trace>
<qry:result estimate="14944"/>

This is not the number of documents that have a <foo> element. This is the total number of documents in my database. So obviously, this query is going to run very slowly, because it’s forcing all of those fragments to be read from the disk.

Specifying the exact QName

In contrast, let’s look at the plan with the case where we specify the exact QName:

declare namespace xyz="http://example.com";
xdmp:plan(
  collection()//xyz:foo
)

In this case, we see that the path is “fully searchable.” In other words, all the steps contribute index constraints that can be used to narrow down the possible number of matching documents:

<qry:info-trace>Analyzing path: fn:collection()/descendant::xyz:foo</qry:info-trace>
<qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace>
<qry:info-trace>Step 2 is searchable: descendant::xyz:foo</qry:info-trace>
<qry:info-trace>Path is fully searchable.</qry:info-trace>

And we see further down that MarkLogic knows a priori, from the Universal Index, that no<xyz:foo> elements exist in the database:

<qry:info-trace>Selected 0 fragments</qry:info-trace>
<qry:result estimate="0"/>

So simply by specifying the namespace part of the QName, we’ve gone from having to read all the documents in the database, to having to read none of them.

Summary

To summarize, you should generally avoid namespace wildcards like *:foo for two reasons: performance and clarity.

Are there ever any cases where it’s okay to use *:foo? Performance is not nearly as big an issue when you’re processing documents that you’re already committed to opening. For example, if you’re processing a single zip file manifest (the result of xdmp:zip-manifest()), then using *:part because you’re too lazy to declare the zip namespace isn’t a problem as far as performance goes, because you’re not searching among thousands or millions of documents and the index doesn’t even come into play. Still, in production code, it’s a good idea to declare the namespace and use zip:part so your intentions are clearly documented.

Of course, when your intentions actually are to select an element with a specific local name but any number of namespaces, then you can use *:foo; but, again, be sure it’s not when you’re searching across the database. In that case, if it’s possible, you should enumerate all the QNames, so MarkLogic can most effectively narrow down the result set based on what it knows from its indexes:

//(abc:foo|def:foo|xyz:foo)

If you didn’t even know namespace wildcards existed in XPath, then you might find it odd that I’m both introducing them to you and recommending against using them in the same article. Consider this just another chance to become familiar with xdmp:plan(), which is much more generally useful. It will help you write fast queries and understand what makes them fast.

Related Resources

MarkLogic

Evan Lenz

View all posts from Evan Lenz on the Progress blog. Connect with us about all things application development and deployment, data integration and digital business.

Comments

Comments are disabled in preview mode.

Topics

More From Progress

Shadow Analytics: Why You Can’t Afford to Leave It Unchecked

Then, Now and Beyond: The Future of Back Office Software

2022 Progress Data Connectivity Report

Subscribe to get all the news, info and tutorials you need to build better business apps and sites

Country/Territory

Blog

MarkLogic

Semaphore

OpenEdge

DataDirect

Sitefinity

Telerik

Kendo UI

Corticon

DataDirect

MOVEit

Chef

Flowmon

Kemp LoadMaster

WhatsUp Gold