The Secrets to Wildcard Search in MarkLogic
Wildcards are special characters that you can include in your queries to search for variations of a term. A short string with wildcard characters can potentially match thousands terms of in your database, making wildcards an efficient way to search on similarly spelled words. MarkLogic search supports two wildcard characters:
||Matches zero or more non-space characters in a word, with
||Matches exactly one non-space character in a word, with
You can have wildcards at the beginning, middle, or end of your search terms, or mix them together (e.g.,
Note that wildcard searches are similar to stemming, which is how MarkLogic automatically considers grammatically similar words when processing a query. With stemming turned on, MarkLogic considers
swimming when you search for
swim. For more detail, see the Understanding and Using Stemmed Searches section of the Search Developer’s Guide.
MarkLogic supports wildcard search with three types of indexes:
- three-character indexes
- trailing wildcard indexes
How and when it uses these indexes depends on the wildcard pattern you’re searching on.
Indexes are the key to regular search in MarkLogic, and they are also the key to handling wildcards. An index takes character sequences in your database documents and maps them to document IDs. During a typical search, MarkLogic analyzes the query—breaking a complicated query into its component parts—and then consults the indexes to return a list of matching results. By default, MarkLogic indexes all the words in a database as well as structural information like the element and attribute names (for XML documents) and property names (for JSON documents). You can have MarkLogic index additional sequences to support certain features, such as wildcards.
When you turn on a three-character index for your database, MarkLogic will map every unique three-letter sequence to its corresponding documents. If the term
cater appears in a document, entries for
ter are added to the index. MarkLogic also includes contextual information about whether each sequence was at the start or end of a word. A three-character index is perfect for handling wildcard searches that include three characters and *—for example
*ate*. MarkLogic can retrieve all of the document IDs it needs for the results from that index.
MarkLogic can also leverage the three-character index for longer wildcard terms. It can split a four-character term such as
*ter*, look up the two terms in the three-character index, and combine the results. However, this can potentially result in false-positives, for example, a document that contains
terminate that is incorrectly flagged as a match.
Such false-positives are removed during the filtering part of the search. Filtering is when result documents are opened and double-checked to make sure they actually match the search query. The more filtering MarkLogic has to do, the longer it takes to return results. This means that using the three-character index to handle longer wildcard terms—which can increase filtering time—isn’t always the most efficient strategy. (You can turn off filtering, which means you’ll get results faster but potentially at the expense of false positives.)
MarkLogic also gives you the option to turn on one- and two-character indexes for databases. You might think that these would be helpful for getting wildcard results for shorter terms (e.g.,
ba*), but turning these indexes on results in much, much larger indexes, and it also impacts the time it takes to load, index, and merge documents. For these reasons, activating these indexes is discouraged, and from this point forward we will assume that they are not turned on.
What about indexing sequences greater than three characters? MarkLogic doesn’t offer these indexes since, in most cases, any benefit would be outweighed by the increased index space. Typically, three-character indexes give you the most bang for the buck by efficiently supporting wildcards and search in general.
Trailing Wildcard Index
For quickly handling wildcard searches that have more than three characters and end in *, the trailing wildcard index can be turned on. This will store entries for all
the possible prefix combinations for each word greater than or equal to three characters. For
cater, MarkLogic stores
cater in the index. The index includes contextual information marking the sequences as occurring at the start of a word. Search for
cate*, and MarkLogic will use the trailing wildcard index.
A trailing wildcard index isn’t helpful for terms with a leading * (e.g.,
*ater). Nor does MarkLogic offer an analogous leading wildcard index (i.e., MarkLogic is biased toward English in this sense). Searchers for English terms will more often put wildcards at the end of search terms than at the beginning, since English grammar tends to use more suffixes than prefixes, hence the preference for a trailing wildcard index.
Figure 1: Indexes map character sequences that appear in documents to their respective document IDs, showing how MarkLogic stores sequences from the word
cater when different indexes are turned on.
For wildcard searches that involve something more complicated than three-character strings or a trailing wildcard, MarkLogic will try to use lexicons. A lexicon is an ordered list of the unique words in a database. Unlike an index, a lexicon doesn’t map words to the documents in which they appear; it’s just a big ordered list.
Lexicons enable MarkLogic to quickly retrieve all the potential matches for a wildcard pattern. Let’s say you are searching for the wildcarded string
r?b*?t. MarkLogic can take that string and compare it to the lexicon word list. Since the lexicon words are stored in order, MarkLogic can scan them efficiently. (We’ll talk more about that in a moment.)
r?b*?t example, a lexicon scan might return the terms
robot. With this information, MarkLogic can rewrite the wildcard query like this:
rabbit OR rebuilt OR robot
It can then find the occurrences of these three words in its indexes and combine what it finds into a set of results.
Lexicons also support type-ahead suggestions in search boxes. When a user types
vac in a search box, the application can check a lexicon and display a menu with words that match, like
vacuum. It’s similar to performing a search for
vac*, but doesn’t require checking indexes.
Limits of Lexicons
Using a lexicon to build an expanded
OR query doesn’t always work. Sometimes a wildcard string will return thousands of matching terms from the lexicon, way too many to construct an effective
To get around this, you can cut down on a long list of terms returned by constructing a shorter list of three-character prefixes and postfixes.
Let’s say a lexicon results list looks like this:
With the prefix/postfix strategy, MarkLogic can rewrite the list as something shorter and more manageable:
Then you can use those three-character sequences with the three-character indexes, which is very fast. This can yield false-positives (which will need filtering), but it can be a better way to go when a lexicon scan returns too many words.
MarkLogic employs another strategy for single letters with a trailing wildcard—for example,
a*. MarkLogic would break this into three separate searches:
- search for
ausing the regular word index (there will be one index entry for
a, if it exists)
- search for
a?in the lexicon (this should yield a small number of results) and do an expanded OR query
- search for
a??*in the lexicon, using the prefix/postfix strategy (to avoid a long list of matches for words beginning with
Wildcard Search Options
Developers have control over how MarkLogic uses lexicons to resolve wildcard searches with the lexicon-expand option. This option can have the following values:
||Resolve wildcard queries using MarkLogic’s set of internal rules (e.g., if a lexicon scan is estimated to return a large number of terms, use alternative strategies). This is the default.|
||Always expand a wildcard query into a set of ORed terms, even if this results in a large number of terms and, potentially, a slow query.Always expand a wildcard query into a set of ORed terms, even if this results in a large number of terms and, potentially, a slow query.|
||Always resolve wildcard queries with prefix and postfix sequences based on matching lexicon terms.|
||Do not use lexicon expansion for wildcard searches. Use indexes instead.|
Better Lexicon Scans
The order of entries in a lexicon is determined by the collation type of the lexicon. The codepoint lexicon, which orders lowercase before uppercase, is recommended for faster searching and more accurate results. However, this presents a challenge when doing case-insensitive searches. For example, a simple linear scan through the lexicon for the wildcard term
cat* will check lots of entries that will never match (those starting with
f, etc.) as you scan through the entries starting with
c to the entries starting with
MarkLogic solves this problem by initially doing a series of binary searches to pick out just the sections of the lexicon that are worth scanning. A binary search works by looking at the middle item of an ordered list, determining if the search term is located before or after that middle item, and then performing a similar search on that half. By continuing to divide and conquer, the search eventually finds what it is looking for—usually much faster than if it searched linearly.
For a wildcard search for
cat*, MarkLogic performs binary searches to find the range starting at
c* and ending at
C*. Then, within that range, it does binary searches to find the subsections from
cA* and from
CA*. Finally, it finds the subsections for the letters
T. (In reality, MarkLogic may stop doing binary searches for the characters in a search term once the ranges get small enough.) After the subsections are determined, MarkLogic can scan through them quickly to find the matching entries.
Figure 2: MarkLogic uses a series of binary searches to minimize what it needs to scan in a lexicon for a term. This is the process when performing a case insensitive search for the term
cat*. The white sections mark the areas that could have matches.
MarkLogic can quickly return results for even very complex wildcard queries. You can learn more about wildcard search and configuring related settings in the MarkLogic documentation.