There was a recent discussion on an internal mailing list asking whether you could set up 10,000 range indexes on a database. When faced with a question like this, we should step back and evaluate the problem we’re trying solve. The data set in question has about 1,000 entities, with an expectation that an average of 10 fields related to each entity would need to be indexed, leading to the question about 10,000 range indexes.
One may initially believe that this kind of logic suggests relational thinking: “this is natural; this is what most of us learned first!”. Of course, every index has a cost, regardless of whether the database is MarkLogic, an RDBMS, or another NoSQL database. 10,000 range indexes isn’t a good idea in MarkLogic, but know that if you were thinking about setting up that many, there’s probably a better solution.
The first question we should consider is whether we actually need range indexes for those 10,000 fields (or elements). MarkLogic’s Universal Index may provide what we need: a way to index the terms and structure of all documents. Through the Universal Index, we can do full-text searches on any ingested content, and scope it to particular document sections if we want. In many cases, this means we don’t need to set up specific indexes to provide rapid access to particular content.
The Universal Index provides immediate access to text and structure. When do we need range indexes? In a search context, we use range indexes for data-type specific inequalities, such as “find me all articles published since Jan 1, 2012”. By having a date range index on the publication date, we can build a greater-than-or-equal-to query. We can also use range indexes to get lists of values, enabling us to build facets. Jason Hunter’s Inside MarkLogic Server lists some other range index benefits.
In typical applications, we want to search across many (or all) fields, but we don’t need inequality comparisons or to generate thousands of facets. This means that for most applications, we’ll get much of our search capability from the Universal Index and supplement with a small number of range indexes.
In MarkLogic, a field is a structure that lets us refer to the contents of multiple elements by the same name. When we merge data from different sources, we sometimes get multiple elements that represent the same thing, but with different names. For instance, consider two book databases, where one has “published-date” and one has “pub-date”. At first glance, these appear to be two separate types of data, suggesting separate range indexes. However, with MarkLogic’s field feature, a single name can refer to the contents of both elements, with one type-specific index pulling values from all the elements. This is another way that the number of indexes can be reduced.
Sometimes you really do want to do range queries across a wide variety of fields. In an extreme case, MarkLogic lets you represent everything as triples, allowing for inequality queries using SPARQL’s FILTER or the cts:triples() function. MarkLogic’s own history monitoring is built entirely with triples. More commonly, triples are used in combination with documents to produce a powerful hybrid.
Having looked at some alternatives to setting up 10,000 range indexes, let’s come back to the original question. It turns out that the answer is no — you should not attempt to make anything on the order of 10,000. A target cap for range indexes should be about 100, and the vast majority of applications require a much smaller number than that. Each forest stores the indexes that relate to the content in that forest; each forest is broken into one or more stands. Each of these stands manages its indexes in two memory-mapped files per index. We commonly see twelve forests on a host (six master, six replica) with about 100 stands; multiply that by 10,000 range indexes and we’d have millions of open files handles!
Sometimes the transition from a relational background to a document and triples model doesn’t click right away for everyone, which can lead to a question like the one we had to begin with. If you find yourself planning to make thousands (or even hundreds) of range indexes, it’s probably worth stepping back and rethinking about how the data will be represented. The Universal Index is really powerful — so let it do what it does best! And for cases the Universal Index doesn’t satisfy, simply apply fields, range indexes, and triples as needed.
Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.
The MarkLogic Optic API makes your searches smarter by incorporating semantic information about the world around you and this tutorial shows you just how to do it.
Are you someone who’s more comfortable working in Graphical User Interface (GUI) than writing code? Do you want to have a visual representation of your data transformation pipelines? What if there was a way to empower users to visually enrich content and drive data pipelines without writing code? With the community tool Pipes for MarkLogic […]
Rest and Spread Properties in MarkLogic 10 In this last blog of the series, we’ll review over the new object rest and spread properties in MarkLogic 10. As mentioned previously, other newly introduced features of MarkLogic 10 include: The addition of JavaScript Modules, also known as MJS (discussed in detail in the first blog in this […]
Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.
Request a Demo