Given the size of MarkLogic’s
cts:query API, there are often many ways to construct the
cts:query parameter passed to
cts:search. While all are expected to return only relevant matches, the search performance can vary (in some cases significantly) depending on the current index options enabled on the database. Let’s walk through why
xdmp:plan is handy for optimizing search performance in MarkLogic.
To help understand how MarkLogic constructs responses to search requests, it’s important to understand the concepts of E-nodes and D-nodes. MarkLogic server instances are logically segmented based on the operations performed to satisfy requests. Servers can be Evaluators (E-nodes), Data Managers (D-nodes) or combined E/D-nodes (for instance, a lone MarkLogic instance is a combined E/D-node). E-nodes listen on a socket, parse requests, and generate responses. D-nodes hold data along with its associated indexes, and support E-nodes by providing them with the data they need to satisfy requests and process updates. For more information, refer to
MarkLogic E-nodes and D-nodes in the MarkLogic Concepts Guide.
cts:search resolves queries in two phases. The first phase performs index resolution on the D-nodes. This initial result may contain false positives depending on the index configuration and the query. The second phase performs filtering of the results on the E-nodes, which examines the matched documents and removes false positives. If a query can be resolved completely from the indexes, then filtering is not required.
One goal when optimizing search performance is to configure the database indexes and construct queries that take advantage of these indexes in such a way as to ensure that filtering isn’t necessary. To accomplish this requires analysis of the types of searches supported by an application and matching those requirements to the optimal index configuration and associated
cts:query constructors. Of course, there are always tradeoffs to consider, primarily between query response times under expected loads, memory requirements, and on disk size of the database. In other words, the decision to trade disk space and memory for response times or response times for disk space and memory depends entirely on the specific requirements of the application, budgetary constraints, and any Service Level Agreements (SLA) between the application owners and its users.
Since the index resolution takes place in memory and filtering requires reading documents off disk (disk I/O), filtered searches will be slower than queries that can return only relevant results without filtering. This is only possible if the database configuration and specific
cts:query constructors used can ensure that false positives are not included during the index resolution phase.
A simple example
For this discussion we will use a sample database containing 50,000 documents constructed by randomly selecting words from an English language thesaurus, stringing the words together into randomly selected sentence lengths, and stringing the sentences together into random numbers of paragraphs. Documents also included metadata sections containing randomly selected words contained in
<keyword> elements. In addition, one or more choices from a set of known quotations are included in some documents to provide known sequences of words for testing more complex queries.
The generated documents follow this structure:
<topic id="f79886d8-4a66-b10a-c87e-bf85cd95ea3d" ditaarch:DITAArchVersion="1.0" xmlns:ditaarch="http://dita.oasis-open.org/architecture/2005/"> <title>One family bangiaceae</title> <prolog> <author>Mai Dygert</author> <critdates> <created date="2014-05-22T07:51:11.798739-05:00"> </created> <revised modified="2014-06-24T07:51:11.798739-05:00"> </revised> </critdates> <metadata> <keywords> <keyword>fagus</keyword> <keyword>mendacity</keyword> <keyword>vestibule</keyword> <keyword>salomon</keyword> <keyword>mantineia</keyword> </keywords> </metadata> </prolog> <body> <p>All camaraderie detusk of because certain triquetral sluice down. Them marcus aurelius vie whoever symphonic music like once everybody spring training ravish most caddice-fly against. Much pudendal artery venesect another mentha longifolia despite or her lamina arcus vertebrae do in. That claymore rusticate or me frumpishly enduringness dole out which haitian capital for. His off-axis reflector apparitional hugger mugger or any strategic warning laugh at a pinchas zukerman round.</p> <p>Their reflection relieve oneself that cur near and no genus cadra withstand. Each virginia snakeroot bejewelled bumble before the theatrical performance capsulize our chronic leukemia after. This in so far cranberry barrage jam whomever by experimentation erwinia unless her docking take leave. This multiple myeloma rough out my andrei tarkovsky before the genus anhinga recrudesce into.</p> <p>These capitation hallow his seaside goldenrod on since my blindworm equip everyone frying pan. That cheremiss apraxic abut but anything polanisia auscultate one fortification about. Nobody bank holding company matter to past whether someone endospore ametabolous embalm. My quamash immunize per except those high table bamboozle what genus stizostedion among. Much cotyledon ptyalise the pub onto whether herself grater tantalise up. Us vinegar joe stilwell cauterise among whether certain vidar close down.</p> </body> </topic>
The first set of tests are run against a database with all index options disabled except for word searches (at minimum either word searches or stemming is required for searching content). The test consists of executing a simple search using
cts:search(fn:doc(), 'ontology'). This example requests that the MarkLogic server select all documents containing the word “ontology” regardless of where the word appears within a document.
The query plan for this search can be examined by passing the
cts:search function to
xdmp:plan like this:
which returns the following plan:
<qry:query-plan xmlns:qry="http://marklogic.com/cts/query"> <qry:expr-trace>xdmp:eval("xquery version "1.0-ml"; &#10;(:&#10;xdmp:plan(cts:sear...", (), <options xmlns="xdmp:eval"><database>10675285422219569092</database>< modules>34183009642494...</options>)</qry:expr-trace> <qry:info-trace>Analyzing path for search: fn:doc()</qry:info-trace> <qry:info-trace>Step 1 is searchable: fn:doc()</qry:info-trace> <qry:info-trace>Path is fully searchable.</qry:info-trace> <qry:info-trace>Gathering constraints.</qry:info-trace> <qry:word-trace text="ontology"> <qry:key>3680059471137048531</qry:key> </qry:word-trace> <qry:info-trace>Search query contributed 1 constraint: cts:word-query("ontology", ("lang=en"), 1)</qry:info-trace> <qry:partial-plan> <qry:term-query weight="1"> <qry:key>3680059471137048531</qry:key> <qry:annotation>word("ontology")</qry:annotation> </qry:term-query> </qry:partial-plan> <qry:info-trace>Executing search.</qry:info-trace> <qry:ordering/> <qry:final-plan> <qry:and-query> <qry:term-query weight="1"> <qry:key>3680059471137048531</qry:key> <qry:annotation>word("ontology")</qry:annotation> </qry:term-query> </qry:and-query> </qry:final-plan> <qry:info-trace>Selected 76 fragments to filter</qry:info-trace> <qry:result estimate="76"/> </qry:query-plan>
For this discussion, we’re primarily interested in the final-plan, the estimate, and what the plan indicates can be determined during the index resolution phase.
<qry:final-plan> <qry:and-query> <qry:term-query weight="1"> <qry:key>3680059471137048531</qry:key> <qry:annotation>word("ontology")</qry:annotation> </qry:term-query> </qry:and-query> </qry:final-plan> <qry:info-trace>Selected 76 fragments to filter</qry:info-trace> <qry:result estimate="76"/>
<qry:annotation> element in the final plan indicates that during query formulation only one assertion about documents was identified: The document contains the word “ontology”.
This is the only assertion that can be identified during query formulation given the current index configuration of the database and the supplied query. In addition, the plan contains a
<qry:result> element with an estimate of the number of matching documents in the database:
<qry:result estimate="76"/>. Using information in the index alone, the server estimates that the database contains 76 matching documents.
Executing a filtered search and counting the results using
fn:count(cts:search(fn:doc(), 'ontology')) provides a total of 76, which matches the value provided by estimate. The server is able to accurately retrieve the correct documents using index resolution alone. This search could be performed “unfiltered” with a high degree of confidence that the results do not include false positives. This simple configuration is enough to satisfy many use cases.
Unfortunately, this simple query and minimal database configuration is insufficient to support a wide range of search requirements often found in typical search applications. The following sections build on this simple example to illustrate how to analyze more complex queries and the impact of enabling different sets of additional index options on the server’s ability to accurately resolve searches without filtering.
Restricting matches to specific elements
Since MarkLogic indexes both content and structure, it’s possible to formulate queries that restrict results not only containing specific words, but also to only those documents containing the word within a specific element. Consider a more specific query executed against the same database using the same index configuration. In this case, the requirement is to retrieve documents containing the word “ontology”, but only if the word appears in the “keyword” element. This can be accomplished using a query like this:
cts:search(fn:doc(), cts:element-word-query(xs:QName('keyword'), "ontology"))
Like the previous example, this search requests that the server match documents containing the word “ontology“, but in addition, the word must appear in an element named
- word searches
The final plan for this query is:
<qry:final-plan> <qry:and-query> <qry:term-query weight="1"> <qry:key>3680059471137048531</qry:key> <qry:annotation>word("ontology")</qry:annotation> </qry:term-query> <qry:term-query weight="0"> <qry:key>13038789913933283747</qry:key> <qry:annotation>element(keyword)</qry:annotation> </qry:term-query> </qry:and-query> </qry:final-plan> <qry:info-trace>Selected 75 fragments to filter</qry:info-trace> <qry:result estimate="75"/>
This plan contains two assertions about possible matching documents:
- The document contains the word “ontology”.
- The document contains an element named “keyword”.
Note that it does not assert that the word “ontology” appears in the element “keyword”. This is clearly not enough information to ensure that a document found during the index resolution phase actually matches the query. This is demonstrated by comparing the estimate (75) with a count of the filtered search results (2). The unfiltered search results contain 73 false-positive matches. These false-positive matches must be removed during the filtering phase to guarantee accurate results.
Enabling element-based searches
To support resolving queries of this type without resorting to filtering, MarkLogic provides additional index
configuration options. The first one to enable is named “fast element word searches“. Note that there
are tradeoffs to consider for each additional enabled index option. In this case, enabling fast element word
searches results in decreased document ingestion performance and larger database size on disk. This is due to the
additional index information captured and persisted on disk when documents are inserted into the database.
- word searches
- fast element word searches
Once the server has finished reindexing the documents, executing:
xdmp:plan(cts:search(fn:doc(), cts:element-word-query(xs:QName('keyword'), "ontology")))
results in the following final-plan:
<qry:final-plan> <qry:and-query> <qry:term-query weight="1"> <qry:key>13836301292251522151</qry:key> <qry:annotation>element(keyword,word("ontology"))</qry:annotation> </qry:term-query> <qry:term-query weight="1"> <qry:key>3680059471137048531</qry:key> <qry:annotation>word("ontology")</qry:annotation> </qry:term-query> </qry:and-query> </qry:final-plan> <qry:info-trace>Selected 2 fragments to filter</qry:info-trace> <qry:result estimate="2"/>
This plan contains now two assertions to be tested:
- The document contains an element keyword containing the word “ontology” and …
- The document contains the word “ontology”
These assertions are now specific enough to match the intent of the
cts:query passed to search during the query
resolution phase. Note the value of the estimate has gone from 75 to 2. This matches the number of results actually
returned by executing this query in a filtered search and counting the results. The correct set of matching documents can
now be determined solely during the index resolution phase of query execution. With fast element word searches
enabled, searches of this type can be executed “unfiltered” with high confidence that the results will not contain false
This result demonstrates why
xdmp:plan is an essential tool for understanding and optimizing search performance
in MarkLogic applications. The insight it provides into the inner workings of MarkLogic’s indexing and search capabilities is
invaluable in helping application developers deliver the best possible performance to their users.