This post is an update, based on a talk given by Jason Hunter and Franklin Salonga, with updates from M.Joel Dubinko, Principal Engineer. Thanks to Matt Allen for working on the earlier version.
MarkLogic is built from the ground up for speed, yet many of our support cases have to do with performance. Often that’s because people are following historical wisdom that no longer applies. Today, it’s common to find big-memory systems using a 64-bit address space and plentiful CPU cores, and disks with plummeting latencies (but that haven’t grown in throughput as much as they have in size). MarkLogic lives natively in this new reality, and that means that paying attention to the right guidelines will pay huge benefits in the optimal performance of your applications.
While not a replacement for a more thorough discussion of understanding system resources, this article will give you a flavor of the concepts important to high-performance apps the MarkLogic way.
The following is a list of tips, compiled with the help of our performance geniuses, will help realize optimal performance in your MarkLogic app.
1. Invest in Compute
MarkLogic is optimized for server-grade systems, those just to the left of the hockey-stick price jump. Today that means at least 16 cores, 128-256 Gigs of RAM, and 8-20 TB of disk, if not more. Make sure your disks have enough bandwidth and IOPS, not just capacity.
2. Aim for 100KB docs +/- 2 Orders of Magnitude
MarkLogic’s internal optimizer prefers documents around 100 KB (remember, in MarkLogic, each document should be one unit of query and should be seen more like relational rows than tables). Things will still work with documents down to 1 KB or less, but at this end of the scale relative per-document memory, disk, and lock overhead all start to creep up. Likewise, don’t worry about documents up to around 10 MB, but beyond that, the time to read documents off disk starts to be noticeable.
3. Think of MarkLogic Like an Only Child
On a busy system you may find 100 percent CPU utilization across multiple cores—that’s a feature, not a bug. MarkLogic assumes you want maximum performance given available resources. MarkLogic will also take advantage of however much RAM you have available for caching, so it’s best not to run other services on the same machines as MarkLogic Server. If you’re using shared resources like a SAN or virtual machine, you may want to impose restrictions that limit what MarkLogic can use.
4. The Best Filtering Is Avoiding Filtering
Indexes identify candidate documents, then filtering verifies the exact hits. That means that filtering ensures you get accurate results at the cost of examining lots of documents. For example, imagine a case-sensitive query without a case-sensitive index available. But watch out, as filtering can hide bad index settings! It’s much better to match up index settings with your application needs, since then you can use the “unfiltered” option on queries and still get accurate results, quickly.
5. Don’t Try to Outsmart Merging
Contact support if you plan to change any of the advanced merge settings (max size, min size, min ratio, timeout periods). You usually shouldn’t tweak these. If you’re worrying about merge settings, you’re probably underprovisioned (See Recommendation #1).
6. Large Reads Are for Queries, Not Updates
Hurrah! Using MVCC for transaction processing means lock-free reads. But, to be a “read” your module can’t include any update calls. This is determined by static analysis in advance, so even if the update call isn’t made, it still changes your behavior. Locks are cheap but they’re not free, and any big search to find the top 10 results will lock the full result set during the sort. Whenever possible, do update calls in a separate nested transaction context using xdmp:invoke() with an option specifying “different-transaction”.
7. Measure Everything
Measure before. Measure during. Measure after. Measure at all levels. When you know what’s normal, you can isolate when something looks different. MarkLogic can internally capture “Monitoring History” to a Meters database. Many customers also use tools such as Cacti, Ganglia, Nagios, Graphite, and others.
8. Have a Staging Environment Handy
A staging box (or cluster) means you can measure changes in isolation, including new application code, new indexes, new data models, MarkLogic upgrades, etc. If you’re running on a cluster, then stage on a cluster (because you’ll see the effects of distribution, like net traffic and 2-phase commits). With AWS or Azure it’s easier than ever to “spin up” a cluster to test something.
9. Read the Release Notes
With each new release of MarkLogic, be sure to look through the release notes, which contain important information about configuring, running, and optimizing MarkLogic–and by extension, your applications.
10. Read the White Paper
It’s so important this is the second time in this article we’re mentioning it, but be sure to download, read, and understand the guide to Understanding System Resources for a comprehensive discussion about making the most of your hardware and software. Periodically check for updates, as we revise this guidance from time-to-time, as technology marches forward.
1. Don’t Miss Out on New Features
MarkLogic has plenty of features that help with performance, including Optic API, MLCP, tiered storage, and multimodal queries. With the MLCP fast-load option, you can perform forest assignments on the client, and directly insert to that forest. It’s really a sharp tool, so don’t use it if you’re changing forest topology or assignment policies. With tiered storage, you can use object storage such as AWS S3 or Azure Blob Storage as cheap mass storage of data that doesn’t need high performance. Remember, you can “partition” data (i.e. based on dates) and let it age to slower disks. Take the time to learn about Optic API, with which you have a whole new way to model your data, and which in many cases can produce multimodal queries that run faster than handwritten queries crafted against earlier versions of MarkLogic.
2. Avoid Fragmentation
If you’ve heard of fragmentation, you might be a long-time user of MarkLogic. This feature is no longer recommended. Just avoid it, but if you must, then ask support first.
3. Make Markup Meaningful
If you can use meaningful markup (where the tags describe the content they hold) you get both prettier XML and JSON, as well as content that’s easier to integrate into your apps. Reading query plans will be easier as well.
4. Earlier Indexing is Better Indexing
Take advantage of TDE to configure load-time indexes. Adding an index after loading requires touching every document with data relating to that index. Turning off an index is instant, but no space will be reclaimed until the re-index occurs. A little thought into index settings before loading will save you time.
5. Taste Test
Load a bit of data early, so you can get an idea about rates, sizes, and loads. Different index settings will affect performance and sizes. Test at a few sizes because some things scale linearly, some logarithmically. Especially when incrementally refining your data and schemas, it is best to perfect your new index settings in a smaller test environment before applying them to production.
That’s it! With these pro tips, you’ll be in good shape to understand the most important performance concepts. But, if you are still running into problems, don’t hesitate to contact support at email@example.com!
Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.
Update on employee engagement during the pandemic, and the work of our DE&I team.
Find out who will be presenting at our first-ever virtual MarkLogic World, and what they’ll be covering.
Learn how we’re taking MarkLogic World virtual this year, and how you can participate – from keynotes to breakout sessions, training and more.
Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.Request a Demo