Gartner Cloud DBMS Report Names MarkLogic a Visionary

Using Many Collections

On an internal discussion list, a question came up recently about a customer who was using a very large number of collections. MarkLogic Founder Chris Lindblad chimed in with the following explanation of how the collection lexicon works and whether using a large number of collections is good or bad. He agreed that sharing his answer here would be useful for others.

I think having a large number of collections is a great way of organizing documents. The collection mechanism in MarkLogic is very scalable. You can easily have as many collections as documents. I encourage using them, not discourage using them.

Collections are implemented as if there is a hidden <collection> element in each document for every collection that the document belongs to. So if a document belongs to ten collections, that is as if there is ten hidden <collection> elements in that document with the names of the ten collections that it belongs to. So the fundamental database cost for collection metadata is the number of collections each document belongs to, times the number of documents. Fundamentally having one collection with a million documents is about the same as having a million collections, each with one document.

Collection lexicons are implemented as if there is a string range index defined for the hidden <collection> element. So for the collection lexicon the cost of a distinct collection name is no more than the cost of a distinct value in an element for which you have defined a string range index. The URI lexicon works the same way. The only difference between the URI lexicon and the collection lexicon is that a document has only one URI, but can be in many collections.

We use large numbers of collections to implement features in MarkLogic. The bitemporal feature uses a distinct collection for each temporally-managed document. Every version of a temporally-managed document exists as a separate database document in that collection. So a bitemporal database with billions of temporally-managed documents would have billions of collections.

Collections are no less and no more expensive than having an extra element in your documents for each collection your document belongs to. The cost of having many collections is no less and no more expensive than having many distinct values in that extra element. The cost of the collection lexicon is no less and no more expensive than having a string range index on that extra element.

Another contributor pointed out that the collection lexicon is off by default and that it only needs to be turned on only if you want to use cts:collections() or cts:collection-match(). As with any range indexes, keep an eye on your memory consumption.

Start a discussion

Connect with the community

STACK OVERFLOW

EVENTS

GITHUB COMMUNITY

Most Recent

View All

Digital Acceleration Series: Powering MDM with MarkLogic

Our next event series covers key aspects of MDM including data integration, third-party data, data governance, and data security -- and how MarkLogic brings all of these elements together in one future-facing, agile MDM data hub.
Read Article

Of Data Warehouses, Data Marts, Data Lakes … and Data Hubs

New technology solutions arise in response to new business needs. Learn why a data hub platform makes the most sense for complex data.
Read Article

5 Key Findings from MarkLogic-Sponsored Financial Data Leaders Study

Financial institutions differ in their levels of maturity in managing and utilizing their enterprise data. To understand trends and winning strategies in getting the greatest value from this data, we recently co-sponsored a survey with the Financial Information Management WBR Insights research division.
Read Article
This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.