Progress Acquires MarkLogic! Learn More
BLOG ARTICLE

Grokking the cts API

Back to blog
03.26.2012
10 minute read
Back to blog
03.26.2012
10 minute read

As of the MarkLogic 10.0-3 release, the total number of built-in cts (“core text search”) functions comes in at 367! That already excludes deprecated functions. Given how central the cts functions are for building applications on MarkLogic, I thought it would help to provide some pointers in navigating this potentially overwhelming API.

But first of all, if you’re just getting started building a standard search application, you should start with the Search API (which uses and provides hooks into the cts API).

Having said that, here you go!

Just kidding. While word clouds can be fun, they’re not always very useful. (I generated the above based on each function’s number of search hits on this website, so I suppose the result is somewhat interesting; just don’t put too much stock in it.)

Let’s take a tour through the cts API, using some categories I’ve chosen. We’ll knock down all functions, without necessarily explaining how they work. You’ll want to refer to the cts API documentation for those details.

The following list summarizes my breakdown by category:

  • Query Execution (3)
  • Query Objects (220)
  • Lexicon Functions (69)
  • Lexicon Reference (21)
  • Geospatial shapes (18)
  • Sorting (7)
  • Search result meta-data (6)
  • Miscellaneous (23)

Now let’s take a quick tour through each one.

Query execution (3)

The most important function of them all is cts:search, which is concerned with executing cts queries (we’ll get to those next). A related and also important function is cts:contains which matches a given node sequence against a given cts query, returning true if it matches and false otherwise. cts:walk is used in a similar manner as cts:contains except it returns the actual match instead of just true or false. Three down, 384 to go!

Query objects (220)

MarkLogic extends the XPath data model with an object type called “cts:query”, which is the super-type of a number of more specific cts:query sub-types. Queries can be composed together using the cts query constructor functions. They can then be executed by passing them to cts:search() or passed to other functions, such as lexicon calls or functions in other libraries, including search:resolve(), jsearch’s where clause and many others. All of these function names end in “-query”. If you see a cts function whose name ends in “-query”, you can be assured that it’s a cts:query constructor.

Query constructors can be categorized into different kinds. I’m going to call them leaf, composite, and “special” (for lack of a better word).

Composite query constructors (12)

The composite query constructors build up new queries from other queries, whether leaf queries or other composite queries. Here they are broken down into a few sub-categories:

Category Composite query constructor
Logical composition cts:and-query

cts:and-not-query

cts:or-query

cts:not-query

cts:not-in-query

Element/Property scoping cts:element-query

cts:json-property-scope-query

Fragment scoping cts:document-fragment-query

cts:locks-fragment-query

cts:properties-fragment-query

Special queries cts:boost-query

cts:near-query

Leaf query constructors (35)

The leaf query constructors are for queries that can stand on their own, i.e. can be constructed without the help of another query constructor. The following list breaks them down into several categories, depending on what the query searches for (collection URIs, directories, words, values, etc.). I’ve marked some of the text with bold type to draw attention to the consistent naming conventions.

Object being searched Leaf query constructors
collection URIs cts:collection-query
document URIs cts:document-query
directories cts:directory-query
words cts:element-attribute-word-query

cts:element-word-query

cts:field-word-query

cts:json-property-word-query

cts:word-query

values cts:element-attribute-value-query

cts:element-value-query

cts:field-value-query

cts:json-property-value-query

range index cts:element-attribute-range-query

cts:element-range-query

cts:field-range-query

cts:json-property-range-query

cts:path-range-query

cts:period-range-query

cts:range-query

cts:triple-range-query

geospatial cts:element-attribute-pair-geospatial-query

cts:element-child-geospatial-query

cts:element-geospatial-query

cts:element-pair-geospatial-query

cts:geospatial-region-query

cts:json-property-child-geospatial-query

cts:json-property-geospatial-query

cts:json-property-pair-geospatial-query

cts:path-geospatial-query

timestamp cts:after-query

cts:before-query

cts:lsqt-query

cts:period-compare-query

boolean cts:false-query

cts:true-query

Words and values differ in how they compare content against the search. A JSON document containing {“Text”: “some content”} will match cts:word-query(“some”) but not cts:json-property-value-query(“some”).

Another thing worth noticing about the word, value, and range queries above is that they have consistent ways of scoping queries: by element, by attribute, or by field. So we see a function for each pairing of scope (element, attribute, or field) and object (word, value, or range). We’ll see something similar with the lexicon functions. Stay tuned.

This scoping applies to filtered search, i.e. we expect documents for element-***-query to return only XML documents while json-***-query would only return JSON documents. For unfiltered search, element-***-query and json-***-query will return both JSON and XML documents that match the query. Of course this does not apply to element-attribute-***-query since there is no such thing for JSON documents.

Special query constructors (5)

While the functions below each return a cts:query value, they don’t really fall into the above (leaf vs. composite) categories:

Function Description
cts:query constructs a cts:query from its XML representation
cts:registered-query returns a previously registered query (using cts:register)

 

cts:reverse-query returns a reverse query (for finding stored queries given a document, rather than stored documents given a query)
cts:similar-query returns a query matching nodes similar to the given model nodes
cts:parse converts a search string to an equivalent cts:query using a defined grammar.

Okay, only 332 functions to go. (I promise the pace will pick up soon.)

Query accessors (168)

The query accessor functions aren’t very interesting at all—and there are 168 of them! They’re accessors for the various components of a cts:query value. You can recognize them using this failsafe technique: if you see a cts function whose name includes the string “-query-“, then it’s just an accessor. An example would be cts:word-query and its three accessors: cts:word-query-options, cts:word-query-text, and cts:word-query-weight. See a pattern?

Lexicon functions (69)

Lexicon functions are much more interesting. Whereas cts queries are about efficiently finding documents, lexicon functions are about efficiently retrieving unique values (or words or URIs, etc.) from across a potentially large number of documents. They all require a particular index setting to be enabled. For “search,” think cts:search. For “analytics,” think lexicon functions.

String lexicons (14)

Below are the 24 non-geospatial lexicon and lexicon wildcard functions grouped by lexicon type. Note the consistent naming conventions (at the end of the function names).

Aggregate Function Wildcard function Source
cts:uris cts:uri-match URI lexicon
cts:collections cts:collection-match Collection lexicon
cts:words cts:word-match Word lexicon
cts:element-words cts:element-word-match Element word lexicon
cts:element-attribute-words cts:element-attribute-word-match Attribute word lexicon
cts:json-property-words cts:json-property-word-match Element word lexicon
cts:field-words cts:field-word-match Field word lexicon (inside Fields)

Lexicons are typically found at the database configuration page of the Admin UI, except for Field word lexicon as noted above.

Scalar type specific lexicons (18)

Aggregate Function Wildcard function Source
cts:values cts:value-match Range index
cts:element-values cts:element-value-match Element range index
cts:element-attribute-values cts:element-attribute-value-match Attribute range index
cts:field-values cts:field-value-match Field range index
cts:value-ranges Range index
cts:element-value-ranges Element range index
cts:element-attribute-value-ranges Attribute range index
cts:field-value-ranges Field range index
cts:value-co-occurrences Range index
cts:element-value-co-occurrences Element range index
cts:element-attribute-value-co-occurrences Attribute range index
cts:field-value-co-occurrences Field range index
cts:value-tuples Range index
cts:triples Triples range index

The range index above is a combination of element, attribute and field range index. “Range index” also includes the collection and uri lexicon. Indexes are found on the left-hand side of the Admin UI when you click on a database (Configure >> Databases >> {database name} >> *** Index. These functions can be used to generate aggregate reports.

Geospatial lexicons (17)

Aggregate Function Wildcard function Shape
cts:element-geospatial-values cts:element-geospatial-value-match Points
cts:element-child-geospatial-values cts:element-child-geospatial-value-match Points
cts:element-pair-geospatial-values cts:element-pair-geospatial-value-match Points
cts:element-attribute-pair-geospatial-values cts:element-attribute-pair-geospatial-value-match Points
cts:geospatial-co-occurrences Point pairs
cts:element-value-geospatial-co-occurrences Point pairs
cts:element-attribute-value-geospatial-co-occurrences Point pairs
cts:geospatial-boxes Boxes
cts:element-geospatial-boxes Boxes
cts:element-pair-geospatial-boxes Boxes
cts:element-child-geospatial-boxes Boxes
cts:element-attribute-pair-geospatial-boxes Boxes
cts:match-regions Polygon

Requires corresponding geospatial index (element, element pair, element-child, element attribute pair). Which of these you use depends on how you chose to represent geospatial coordinates in your data.

Math-specific aggregates (19)

These are functions that will perform the mathematical computations for you.

cts:aggregate cts:linear-model cts:rank*
cts:correlation cts:max cts:stddev
cts:avg-aggregate cts:median* cts:stddev-p
cts:covariance cts:min cts:sum-aggregate
cts:covariance-p cts:percent-rank* cts:variance
cts:count-aggregate cts:percentile* cts:variance-p
cts:triple-value-statistics

*These functions take in a sequence (or an array) of values. The rest of the functions require a range index or collation.

Tuple meta-data (1)

This only contains the function cts:frequency.

Constructors (17)Lexicon reference functions (21)

Reference Function Target
cts:uri-reference URI lexicon
cts:collection-reference Collection lexicon
cts:element-reference Element range index
cts:json-property-reference Element range index
cts:element-attribute-reference Attribute range index
cts:field-reference Field range index
cts:path-reference Path range index
cts:geospatial-element-reference Geospatial element point range index
cts:geospatial-json-property-reference Geospatial element point range index
cts:geospatial-attribute-pair-reference Geospatial element attribute point range index
cts:geospatial-element-child-reference Geospatial element child point range index
cts:geospatial-json-property-child-reference Geospatial element child point range index
cts:geospatial-element-pair-reference Geospatial element pair point range index
cts:geospatial-json-property-pair-reference Geospatial element pair point range index
cts:geospatial-path-reference Geospatial path point range index
cts:geospatial-region-path-reference Geospatial region range index
cts:reference-parse Any index represented by the XML to be parsed.

These functions are often times used with the String and Scalar type-specific lexicon functions, as mentioned in the previous section.

Accessors (4)

cts:reference-collation cts:reference-nullable
cts:reference-coordinate-system cts:reference-scalar-type

Geospatial shapes and accessors (18)

Shape Accessor
cts:point cts:point-latitude

cts:point-longitude

cts:linestring cts:linestring-vertices
cts:circle cts:circle-center

cts:circle-radius

cts:box cts:box-east

cts:box-north

cts:box-south

cts:box-west

cts:polygon cts:polygon-vertices
cts:complex-polygon cts:complex-polygon-inner

cts:complex-polygon-outer

Note that functions like cts:***-intersects and cts:***-contains are now deprecated. Switch to the geo library.

Most commonly, you use these shapes to construct geospatial queries. So first you construct a cts:region (using one or more of the above constructor functions). Then, you construct a geospatial cts:query (using a geospatial query function such as cts:element-geospatial-query), passing it the cts:region(s) you constructed. Finally, you pass the query to cts:search to run a geospatial search, or to a lexicon function to perform some geospatial-related analytics.

Sorting (7)

These constructors are typically used to specify which document information to use to “pre-sort” the response of cts:search, jsearch, and search:search.

Constructor Sorted by
cts:index-order Sort based on range-index
cts:document-order Sort based on the hash of the document URI
cts:quality-order Sort based on document quality
cts:score-order Sort based on search score. Affected by document quality and document frequency
cts:fitness-order Sort based on fitness. Not affected by document quality nor by document frequency
cts:confidence-order Sort by confidence. Not affected by document quality
cts:unordered #iDon’tCare

Search result meta-data functions (6)

The result of a call to cts:search() is a sequence of nodes that reside in your database. But these node references also contain some special properties (five, to be precise) that extend beyond the XPath data model. They’re very handy for building search applications since they relate to things like search relevance:

Function Purpose
cts:score log(term frequency) * (inverse document frequency) + (QualityWeight * Quality)
cts:quality Document quality
cts:confidence Score without document frequency
cts:fitness Confidence without the effect document quality
cts:relevance-info Relevance score
cts:remainder Estimate of the remaining fragments to process.

Miscellaneous categories (23)

“Miscellaneous” is a popular category in my family’s monthly budget, but I digress. I’ll try to break down these last remaining functions into some sub-categories:

Category Function
Parsing/tokenization cts:stem

cts:tokenize

cts:part-of-speech

cts:distinctive-terms

Registered query cts:deregister

cts:register

Classifier cts:classify

cts:thresholds

cts:train

Temporal cts:period

cts:period-compare

Clustering cts:cluster
Entity Services cts:entity

cts:entity-dictionary

cts:entity-dictionary-parse

cts:entity-highlight

Result node manipulation cts:element-walk

cts:highlight

XPath validation cts:valid-document-patch-path

cts:valid-extract-path

cts:valid-index-path

cts:valid-optic-path

cts:valid-tde-context

I’m not going to explain these (or fall on any swords defending their categorization). The important thing is that the cts API looks a lot less overwhelming to you now, right? There’s a hidden wisdom to it all—an underlying logic, a latent brilliance, a method to the madness…sorry, got a little carried away there.

Conclusion

Congratulations, you made it through the whole tour! As a reward, here’s a little code to look at. It’s the query I ran to generate the data for the Wordle shown at the beginning of the article. And, yes, it does use the cts API:

for $func-name in cts:element-attribute-values(xs:QName("function"),
                                               xs:QName("fullname"))
where starts-with($func-name,"cts:")
return
  concat($func-name,":",xdmp:estimate(cts:search(collection(),$func-name)))

And if you’re thinking to yourself that I must have a range index enabled on my database since I’m calling a value lexicon, you’re right. Well done.

Share this article

Read More

Related Posts

Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.

Developer Insights

Multi-Model Search using Semantics and Optic API

The MarkLogic Optic API makes your searches smarter by incorporating semantic information about the world around you and this tutorial shows you just how to do it.

All Blog Articles
Developer Insights

Create Custom Steps Without Writing Code with Pipes

Are you someone who’s more comfortable working in Graphical User Interface (GUI) than writing code? Do you want to have a visual representation of your data transformation pipelines? What if there was a way to empower users to visually enrich content and drive data pipelines without writing code? With the community tool Pipes for MarkLogic […]

All Blog Articles
Developer Insights

Part 3: What’s New with JavaScript in MarkLogic 10?

Rest and Spread Properties in MarkLogic 10 In this last blog of the series, we’ll review over the new object rest and spread properties in MarkLogic 10. As mentioned previously, other newly introduced features of MarkLogic 10 include: The addition of JavaScript Modules, also known as MJS (discussed in detail in the first blog in this […]

All Blog Articles

Sign up for a Demo

Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.

Request a Demo