We’ve joined forces with Smartlogic to reveal smarter decisions—together.

Using Many Collections

On an internal discussion list, a question came up recently about a customer who was using a very large number of collections. MarkLogic Founder Chris Lindblad chimed in with the following explanation of how the collection lexicon works and whether using a large number of collections is good or bad. He agreed that sharing his answer here would be useful for others.

I think having a large number of collections is a great way of organizing documents. The collection mechanism in MarkLogic is very scalable. You can easily have as many collections as documents. I encourage using them, not discourage using them.

Collections are implemented as if there is a hidden <collection> element in each document for every collection that the document belongs to. So if a document belongs to ten collections, that is as if there is ten hidden <collection> elements in that document with the names of the ten collections that it belongs to. So the fundamental database cost for collection metadata is the number of collections each document belongs to, times the number of documents. Fundamentally having one collection with a million documents is about the same as having a million collections, each with one document.

Collection lexicons are implemented as if there is a string range index defined for the hidden <collection> element. So for the collection lexicon the cost of a distinct collection name is no more than the cost of a distinct value in an element for which you have defined a string range index. The URI lexicon works the same way. The only difference between the URI lexicon and the collection lexicon is that a document has only one URI, but can be in many collections.

We use large numbers of collections to implement features in MarkLogic. The bitemporal feature uses a distinct collection for each temporally-managed document. Every version of a temporally-managed document exists as a separate database document in that collection. So a bitemporal database with billions of temporally-managed documents would have billions of collections.

Collections are no less and no more expensive than having an extra element in your documents for each collection your document belongs to. The cost of having many collections is no less and no more expensive than having many distinct values in that extra element. The cost of the collection lexicon is no less and no more expensive than having a string range index on that extra element.

Another contributor pointed out that the collection lexicon is off by default and that it only needs to be turned on only if you want to use cts:collections() or cts:collection-match(). As with any range indexes, keep an eye on your memory consumption.

Start a discussion

Connect with the community




Most Recent

View All

Unifying Data, Metadata, and Meaning

We're all drowning in data. Keeping up with our data - and our understanding of it - requires using tools in new ways to unify data, metadata, and meaning.
Read Article

How to Achieve Data Agility

Successfully responding to changes in the business landscape requires data agility. Learn what visionary organizations have done, and how you can start your journey.
Read Article

Scaling Memory in MarkLogic Server

This not-too-technical article covers a number of questions about MarkLogic Server and its use of memory. Learn more about how MarkLogic uses memory, why you might need more memory, when you need more memory, and how you can add more memory.
Read Article
This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.