Progress Acquires MarkLogic! Learn More

Mitigating the Impact of Re-indexing

Back to blog
7 minute read
Back to blog
7 minute read

Your application is live in production, you have millions of documents, and now you want to change a database setting or add a custom index. You know that when you move these changes to production it’s going to take hours, maybe days to re-index all the affected documents and while MarkLogic is re-indexing it’s going to take up resources. You need a solution to mitigate the impact of re-indexing.

Re-indexing resource consumption

Before we go over the solution, we need to understand the problem: re-indexing can be a heavy consumer of your system resources. The Understanding System Resources white paper explains index utilization best with these two paragraphs.

MarkLogic indexes consist of both in-memory and on-disk data structures. Range indexes and lexicons are stored in-memory-mapped files – if your application uses them, you’ll see equivalent memory usage. Term lists are part of what’s known as the Universal Index, and those are both in-memory (in the List Cache) and in files on disk. The Triple Index also uses memory and disk resources; although not memory mapped, the Triple Cache will grow and shrink as needed to support semantics queries.

Generally, utilization of indexes means you’ll need both more storage space on-disk, and potentially more space utilized in-memory, in the case of lexicons and range indexes. More indexes mean larger index files, and slower ingestion – more work needs to be done as content is ingested to create the index files. Of course, more indexes, particularly when residing in-memory, can result in query performance 100X-1000X faster than if the query needs to be resolved through additional work at query time.

(MarkLogic Performance: Understanding System Resources, Page. 14)

From these two paragraphs, we know that indexing takes time. Re-indexing one document takes just as long as it did to index that document. The Re-indexer smartly queries the data to see if it can filter out documents that do not need to be re-indexed. It uses the same query features that are used with cts:search. So if we want to know how many documents will be affected by changing the index settings we can do a cts:search with the element/attribute and put an xdmp:estimate around it and that will likely be the amount of documents affected by the change.

Re-indexing is like ingesting a document. The resource utilization will be very similar. Here is what the Understanding System Resources white paper explains about ingestion.

There are multiple operations that consume resources in the ingest process. Some of those operations must happen in the foreground, immediately: writing to the journal, for example. Other operations happen in the background, prioritized behind foreground operations (and subject to throttling through administrative settings). You will find that MarkLogic utilizes resources – particularly I/O and CPU – even at times when no queries are issued. The system is constantly optimizing for the next read or write operation.

This means that you’ll observe the following, all of which is normal and means the system is operating properly:

  • Spiky I/O. This happens when periodic merges run and do big I/O operations to combine files on disk
  • CPU. Merges will show up as nice % in CPU statistics

When ingesting [or re-indexing] content, you should expect to see heavy I/O and CPU activity related to merges.”

(MarkLogic Performance: Understanding System Resources, Page. 7)

When we are re-indexing, we need to keep in mind the resource consumption of:

  • The new indexes
  • Ingestion (re-indexing) of the documents affected
  • The merges that will happened because of the re-indexing of the documents effected

Mitigating Re-indexing Impact

From the high-level overview, we learn that re-indexing can be a resource intensive operation, especially when you are re-indexing a large number of documents with a database that has a lot of custom indexes. We have seen re-indexing jobs take days to re-index.

Most systems are not able to have their resources constrained for days. An even bigger issue is that sometimes code requires indexes to be available in order to execute. If this code is deployed at the same time as the new database’s settings were applied then the code would error out until the re-indexing is done.

To mitigate the resources used and to resolve the code issue it is suggested to deploy the new database settings before they are needed in the database. You’ll want to lower the “reindexer throttle” to whatever is comfortable and make sure the merge priority is set to lower. This will allow the re-indexing to happen but at a slower rate. Also, it can take longer to finish because you don’t need it right away. If you are changing an index, say to a different collation, it’s a good idea to add the new index alongside the current index, because the current code still needs the index to run. You can than clean up the old index once the re-indexing is done and the new code is deployed.

Frequently Asked Questions

When you go to implement this new management of re-indexing you’ll inevitably have some questions like the ones below.

How far ahead do you deploy the settings?

There are a few different ways you can handle this. You can say I have N number of documents in my database it will take Y number of hours to re-index all of them so to be safe I’ll deploy Y + buffer before I need it. For example, 24 million documents (with 100 custom indexes and many database indexes turned on) might take 72 hours to re-index so to give extra time we are going to deploy the index settings 5 days before they are needed.

That way would work well if you wanted to put this in your process for all deployments but what if you were in a more ad hoc deployment scenario? What you could do is see how long it takes to re-index on your pre-prod environment, which hopefully is the same as your production environment. If you do not have a pre-prod environment you could try to estimate with the cts:search and xdmp:estimate approach talked about in the High-Level Overview section.

How do you find a good re-indexer throttle setting?

Before changing the throttle most clusters are set to 5. 5 is the highest you can go and 1 is the lowest. There really isn’t a good way to find the best re-indexer setting besides guess and check. The good thing is there are really only 4 options you have, because if you could use 5 then you wouldn’t have this problem. You can do it two ways. From starting at 4, running re-indexing and seeing how the system resources are being used and then going down 1 number until resources are fine. Or you could go up from 1 to 4 until resource consumptions are no longer acceptable.

Starting from 1 going to 4 is a safer option because you’ll be using less resources. One thing to note when changing the re-indexer throttle settings is that effect of the old setting is still being seen on the merges that are happening. If you are going down from a higher number you’ll be seeing more merges until merges catch up with the current re-indexer throttle setting.

How do you know if there are index changes?

If you are using a version control system you can look at the database settings file and compare them for changes.

If you have an environment that you deploy to before production you can use the Configuration Manager located at port 8002. That allows you to export database settings and import them. When you go to import it will show you the changes. You can do this without applying the changes.

How do you just deploy the database settings?

If you are using one of the common build tools like ml-gradle or Roxy, you can use their commands to just deploy the database settings. For ml-gradle you can use the mlDeployDatabases task, this will update each database you have in the configuration directory. Roxy has a setting that allows you to deploy selective parts of the bootstrap command. If you just want to deploy indexes you could run this command: ./ml local bootstrap --apply-changes=indexes.

If you do not have a common build tool you can use the Configuration Manager located at port 8002 to deploy indexes. You’ll need a cluster, such as a pre-prod environment, that already has the database settings that you want to deploy. You can then export the configurations from the pre-prod environment and import them into the production environment.

Are there special considerations when dealing with a cluster set up with database replication?

You will have to make the database changes to the DR cluster first. If you don’t, you will likely have to rebuild the whole replicated database, as you might have missing index information for some documents.


Additional Resources

Tyler Replogle

Read more by this author

Share this article

Read More

Related Posts

Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.


Poker Fun with XQuery

In this post, we dive into building a full five-card draw poker game with a configurable number of players. Written in XQuery 1.0, along with MarkLogic extensions to the language, this game provides examples of some great programming capabilities, including usage of maps, recursions, random numbers, and side effects. Hopefully, we will show those new to XQuery a look at the language that they may not get to see in other tutorials or examples.

All Blog Articles

Protecting passwords in ml-gradle projects

If you are getting involved in a project using ml-gradle, this tip should come in handy if you are not allowed to put passwords (especially the admin password!) in plain text. Without this restriction, you may have multiple passwords in your file if there are multiple MarkLogic users that you need to configure. Instead of storing these passwords in, you can retrieve them from a location where they’re encrypted using a Gradle credentials plugin.

All Blog Articles

Getting Started with Apache Nifi: Migrating from Relational to MarkLogic

Apache NiFi introduces a code-free approach of migrating content directly from a relational database system into MarkLogic. Here we walk you through getting started with migrating data from a relational database into MarkLogic

All Blog Articles

Sign up for a Demo

Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.

Request a Demo