We’ve joined forces with Smartlogic to reveal smarter decisions—together.

Using Many Collections

On an internal discussion list, a question came up recently about a customer who was using a very large number of collections. MarkLogic Founder Chris Lindblad chimed in with the following explanation of how the collection lexicon works and whether using a large number of collections is good or bad. He agreed that sharing his answer here would be useful for others.

I think having a large number of collections is a great way of organizing documents. The collection mechanism in MarkLogic is very scalable. You can easily have as many collections as documents. I encourage using them, not discourage using them.

Collections are implemented as if there is a hidden <collection> element in each document for every collection that the document belongs to. So if a document belongs to ten collections, that is as if there is ten hidden <collection> elements in that document with the names of the ten collections that it belongs to. So the fundamental database cost for collection metadata is the number of collections each document belongs to, times the number of documents. Fundamentally having one collection with a million documents is about the same as having a million collections, each with one document.

Collection lexicons are implemented as if there is a string range index defined for the hidden <collection> element. So for the collection lexicon the cost of a distinct collection name is no more than the cost of a distinct value in an element for which you have defined a string range index. The URI lexicon works the same way. The only difference between the URI lexicon and the collection lexicon is that a document has only one URI, but can be in many collections.

We use large numbers of collections to implement features in MarkLogic. The bitemporal feature uses a distinct collection for each temporally-managed document. Every version of a temporally-managed document exists as a separate database document in that collection. So a bitemporal database with billions of temporally-managed documents would have billions of collections.

Collections are no less and no more expensive than having an extra element in your documents for each collection your document belongs to. The cost of having many collections is no less and no more expensive than having many distinct values in that extra element. The cost of the collection lexicon is no less and no more expensive than having a string range index on that extra element.

Another contributor pointed out that the collection lexicon is off by default and that it only needs to be turned on only if you want to use cts:collections() or cts:collection-match(). As with any range indexes, keep an eye on your memory consumption.

Start a discussion

Connect with the community




Most Recent

View All

Why Data Agility Is Essential for Your Business

Data agility is the ability to make simple, powerful, and immediate changes to any aspect of how information is interpreted and acted on.
Read Article

Facts and What They Mean

In the digital era, data is cheap, interpretations are expensive. An agile semantic data platform combines facts and what they mean to create reusable organizational knowledge.
Read Article

Truth in ESG Labels

Managing a portfolio of investments for your client has never been simple - and doing so through an ESG lens raises the complexity to an almost mind-boggling level. Learn the signs your team has hit the wall with current tools - and how a semantic knowledge graph can help.
Read Article
This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.