The new website for MarkLogic is Visit it.

The Secrets to Wildcard Search in MarkLogic

Back to blog
8 minute read
Back to blog
8 minute read

Wildcards are special characters that you can include in your queries to search for variations of a term. A short string with wildcard characters can potentially match thousands terms of in your database, making wildcards an efficient way to search on similarly spelled words. MarkLogic search supports two wildcard characters:

* Matches zero or more non-space characters in a word, with car* matching car, care, and caring (and all other terms that start with car).
? Matches exactly one non-space character in a word, with car? matching care and cart but not car or caring.


You can have wildcards at the beginning, middle, or end of your search terms, or mix them together (e.g., *e*?her?*).

Note that wildcard searches are similar to stemming, which is how MarkLogic automatically considers grammatically similar words when processing a query. With stemming turned on, MarkLogic considers swam and swimming when you search for swim. For more detail, see the Understanding and Using Stemmed Searches section of the Search Developer’s Guide.

MarkLogic supports wildcard search with three types of indexes:

  1. three-character indexes
  2. trailing wildcard indexes
  3. lexicons

How and when it uses these indexes depends on the wildcard pattern you’re searching on.

Three-Character Indexes

Indexes are the key to regular search in MarkLogic, and they are also the key to handling wildcards. An index takes character sequences in your database documents and maps them to document IDs. During a typical search, MarkLogic analyzes the query—breaking a complicated query into its component parts—and then consults the indexes to return a list of matching results. By default, MarkLogic indexes all the words in a database as well as structural information like the element and attribute names (for XML documents) and property names (for JSON documents). You can have MarkLogic index additional sequences to support certain features, such as wildcards.

When you turn on a three-character index for your database, MarkLogic will map every unique three-letter sequence to its corresponding documents. If the term cater appears in a document, entries for cat, ate, and ter are added to the index. MarkLogic also includes contextual information about whether each sequence was at the start or end of a word. A three-character index is perfect for handling wildcard searches that include three characters and *—for example *ate, ate*, or *ate*. MarkLogic can retrieve all of the document IDs it needs for the results from that index.

MarkLogic can also leverage the three-character index for longer wildcard terms. It can split a four-character term such as *ater* into *ate* and *ter*, look up the two terms in the three-character index, and combine the results. However, this can potentially result in false-positives, for example, a document that contains terminate that is incorrectly flagged as a match.

Such false-positives are removed during the filtering part of the search. Filtering is when result documents are opened and double-checked to make sure they actually match the search query. The more filtering MarkLogic has to do, the longer it takes to return results. This means that using the three-character index to handle longer wildcard terms—which can increase filtering time—isn’t always the most efficient strategy. (You can turn off filtering, which means you’ll get results faster but potentially at the expense of false positives.)

MarkLogic also gives you the option to turn on one- and two-character indexes for databases. You might think that these would be helpful for getting wildcard results for shorter terms (e.g., b* and ba*), but turning these indexes on results in much, much larger indexes, and it also impacts the time it takes to load, index, and merge documents. For these reasons, activating these indexes is discouraged, and from this point forward we will assume that they are not turned on.

What about indexing sequences greater than three characters? MarkLogic doesn’t offer these indexes since, in most cases, any benefit would be outweighed by the increased index space. Typically, three-character indexes give you the most bang for the buck by efficiently supporting wildcards and search in general.

Trailing Wildcard Index

For quickly handling wildcard searches that have more than three characters and end in *, the trailing wildcard index can be turned on. This will store entries for all
the possible prefix combinations for each word greater than or equal to three characters. For cater, MarkLogic stores cat, cate, and cater in the index. The index includes contextual information marking the sequences as occurring at the start of a word. Search for cate*, and MarkLogic will use the trailing wildcard index.

A trailing wildcard index isn’t helpful for terms with a leading * (e.g., *ater). Nor does MarkLogic offer an analogous leading wildcard index (i.e., MarkLogic is biased toward English in this sense). Searchers for English terms will more often put wildcards at the end of search terms than at the beginning, since English grammar tends to use more suffixes than prefixes, hence the preference for a trailing wildcard index.

shows how MarkLogic stores sequences from the word cater with a three-character index versus a trailing wildcard index

Figure 1: Indexes map character sequences that appear in documents to their respective document IDs, showing how MarkLogic stores sequences from the word cater when different indexes are turned on.


For wildcard searches that involve something more complicated than three-character strings or a trailing wildcard, MarkLogic will try to use lexicons. A lexicon is an ordered list of the unique words in a database. Unlike an index, a lexicon doesn’t map words to the documents in which they appear; it’s just a big ordered list.

Lexicons enable MarkLogic to quickly retrieve all the potential matches for a wildcard pattern. Let’s say you are searching for the wildcarded string r?b*?t. MarkLogic can take that string and compare it to the lexicon word list. Since the lexicon words are stored in order, MarkLogic can scan them efficiently. (We’ll talk more about that in a moment.)

In our r?b*?t example, a lexicon scan might return the terms rabbit, rebuilt, and robot. With this information, MarkLogic can rewrite the wildcard query like this:

rabbit OR rebuilt OR robot

It can then find the occurrences of these three words in its indexes and combine what it finds into a set of results.

Lexicons also support type-ahead suggestions in search boxes. When a user types vac in a search box, the application can check a lexicon and display a menu with words that match, like vacation and vacuum. It’s similar to performing a search for vac*, but doesn’t require checking indexes.

Limits of Lexicons

Using a lexicon to build an expanded OR query doesn’t always work. Sometimes a wildcard string will return thousands of matching terms from the lexicon, way too many to construct an effective OR query.

To get around this, you can cut down on a long list of terms returned by constructing a shorter list of three-character prefixes and postfixes.

Let’s say a lexicon results list looks like this:

  • staggered
  • staired
  • stammered
  • stared
  • starred

With the prefix/postfix strategy, MarkLogic can rewrite the list as something shorter and more manageable:

  • sta*
  • *red

Then you can use those three-character sequences with the three-character indexes, which is very fast. This can yield false-positives (which will need filtering), but it can be a better way to go when a lexicon scan returns too many words.

MarkLogic employs another strategy for single letters with a trailing wildcard—for example, a*. MarkLogic would break this into three separate searches:

  • search for a using the regular word index (there will be one index entry for a, if it exists)
  • search for a? in the lexicon (this should yield a small number of results) and do an expanded OR query
  • search for a??* in the lexicon, using the prefix/postfix strategy (to avoid a long list of matches for words beginning with a)

Wildcard Search Options

Developers have control over how MarkLogic uses lexicons to resolve wildcard searches with the lexicon-expand option. This option can have the following values:

heuristic Resolve wildcard queries using MarkLogic’s set of internal rules (e.g., if a lexicon scan is estimated to return a large number of terms, use alternative strategies). This is the default.
full Always expand a wildcard query into a set of ORed terms, even if this results in a large number of terms and, potentially, a slow query.Always expand a wildcard query into a set of ORed terms, even if this results in a large number of terms and, potentially, a slow query.
prefix-postfix Always resolve wildcard queries with prefix and postfix sequences based on matching lexicon terms.
off Do not use lexicon expansion for wildcard searches. Use indexes instead.


Better Lexicon Scans

The order of entries in a lexicon is determined by the collation type of the lexicon. The codepoint lexicon, which orders lowercase before uppercase, is recommended for faster searching and more accurate results. However, this presents a challenge when doing case-insensitive searches. For example, a simple linear scan through the lexicon for the wildcard term cat* will check lots of entries that will never match (those starting with d, e, f, etc.) as you scan through the entries starting with c to the entries starting with C.

MarkLogic solves this problem by initially doing a series of binary searches to pick out just the sections of the lexicon that are worth scanning. A binary search works by looking at the middle item of an ordered list, determining if the search term is located before or after that middle item, and then performing a similar search on that half. By continuing to divide and conquer, the search eventually finds what it is looking for—usually much faster than if it searched linearly.

For a wildcard search for cat*, MarkLogic performs binary searches to find the range starting at c* and ending at C*. Then, within that range, it does binary searches to find the subsections from ca* to cA* and from Ca* to CA*. Finally, it finds the subsections for the letters t and T. (In reality, MarkLogic may stop doing binary searches for the characters in a search term once the ranges get small enough.) After the subsections are determined, MarkLogic can scan through them quickly to find the matching entries.

shows areas of potential matches after performing a case insensitive search for the term cat*

Figure 2: MarkLogic uses a series of binary searches to minimize what it needs to scan in a lexicon for a term. This is the process when performing a case insensitive search for the term cat*. The white sections mark the areas that could have matches.


MarkLogic can quickly return results for even very complex wildcard queries. You can learn more about wildcard search and configuring related settings in the MarkLogic documentation.

Thanks to Mary Holstege and Fei Xue for their help with this article.

Mike Wooldridge

Mike Wooldridge is a Senior Software Engineer at MarkLogic. He has built some of the software tools that ship with MarkLogic, including Monitoring History, and has authored open-source projects such as MLPHP. Mike is the coauthor of the latest edition of Inside MarkLogic Server and has also written books for Wiley on graphics software and web design, including Teach Yourself Visually Photoshop CC and Teach Yourself Visually HTML5.

Read more by this author

Share this article

Read More

Related Posts

Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.


Poker Fun with XQuery

In this post, we dive into building a full five-card draw poker game with a configurable number of players. Written in XQuery 1.0, along with MarkLogic extensions to the language, this game provides examples of some great programming capabilities, including usage of maps, recursions, random numbers, and side effects. Hopefully, we will show those new to XQuery a look at the language that they may not get to see in other tutorials or examples.

All Blog Articles

Protecting passwords in ml-gradle projects

If you are getting involved in a project using ml-gradle, this tip should come in handy if you are not allowed to put passwords (especially the admin password!) in plain text. Without this restriction, you may have multiple passwords in your file if there are multiple MarkLogic users that you need to configure. Instead of storing these passwords in, you can retrieve them from a location where they’re encrypted using a Gradle credentials plugin.

All Blog Articles

Getting Started with Apache Nifi: Migrating from Relational to MarkLogic

Apache NiFi introduces a code-free approach of migrating content directly from a relational database system into MarkLogic. Here we walk you through getting started with migrating data from a relational database into MarkLogic

All Blog Articles

Sign up for a Demo

Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.

Request a Demo