Data Platform

ProgressBlogs Performance Theory: Tales From MarkLogic Support

Performance Theory: Tales From MarkLogic Support

by Matt Allen

Posted on April 02, 2014 0 Comments

This post is an update, based on a talk given by Jason Hunter and Franklin Salonga, with updates from M.Joel Dubinko, Principal Engineer. Thanks to Matt Allen for working on the earlier version.

MarkLogic is built from the ground up for speed, yet many of our support cases have to do with performance. Often that’s because people are following historical wisdom that no longer applies. Today, it’s common to find big-memory systems using a 64-bit address space and plentiful CPU cores, and disks with plummeting latencies (but that haven’t grown in throughput as much as they have in size). MarkLogic lives natively in this new reality, and that means that paying attention to the right guidelines will pay huge benefits in the optimal performance of your applications.

While not a replacement for a more thorough discussion of understanding system resources, this article will give you a flavor of the concepts important to high-performance apps the MarkLogic way.

The Top 10 Tips (plus a few bonus tips)

The following is a list of tips, compiled with the help of our performance geniuses, will help realize optimal performance in your MarkLogic app.

1. Invest in ComputeMarkLogic is optimized for server-grade systems, those just to the left of the hockey-stick price jump. Today that means at least 16 cores, 128-256 Gigs of RAM, and 8-20 TB of disk, if not more. Make sure your disks have enough bandwidth and IOPS, not just capacity.

2. Aim for 100KB docs +/- 2 Orders of MagnitudeMarkLogic’s internal optimizer prefers documents around 100 KB (remember, in MarkLogic, each document should be one unit of query and should be seen more like relational rows than tables). Things will still work with documents down to 1 KB or less, but at this end of the scale relative per-document memory, disk, and lock overhead all start to creep up. Likewise, don’t worry about documents up to around 10 MB, but beyond that, the time to read documents off disk starts to be noticeable.

3. Think of MarkLogic Like an Only ChildOn a busy system you may find 100 percent CPU utilization across multiple cores—that’s a feature, not a bug. MarkLogic assumes you want maximum performance given available resources. MarkLogic will also take advantage of however much RAM you have available for caching, so it’s best not to run other services on the same machines as MarkLogic Server. If you’re using shared resources like a SAN or virtual machine, you may want to impose restrictions that limit what MarkLogic can use.

4. The Best Filtering Is Avoiding FilteringIndexes identify candidate documents, then filtering verifies the exact hits. That means that filtering ensures you get accurate results at the cost of examining lots of documents. For example, imagine a case-sensitive query without a case-sensitive index available. But watch out, as filtering can hide bad index settings! It’s much better to match up index settings with your application needs, since then you can use the “unfiltered” option on queries and still get accurate results, quickly.

5. Don’t Try to Outsmart MergingContact support if you plan to change any of the advanced merge settings (max size, min size, min ratio, timeout periods). You usually shouldn’t tweak these. If you’re worrying about merge settings, you’re probably underprovisioned (See Recommendation #1).

6. Large Reads Are for Queries, Not UpdatesHurrah! Using MVCC for transaction processing means lock-free reads. But, to be a “read” your module can’t include any update calls. This is determined by static analysis in advance, so even if the update call isn’t made, it still changes your behavior. Locks are cheap but they’re not free, and any big search to find the top 10 results will lock the full result set during the sort. Whenever possible, do update calls in a separate nested transaction context using xdmp:invoke() with an option specifying “different-transaction”.

7. Measure EverythingMeasure before. Measure during. Measure after. Measure at all levels. When you know what’s normal, you can isolate when something looks different. MarkLogic can internally capture “Monitoring History” to a Meters database. Many customers also use tools such as Cacti, Ganglia, Nagios, Graphite, and others.

8. Have a Staging Environment HandyA staging box (or cluster) means you can measure changes in isolation, including new application code, new indexes, new data models, MarkLogic upgrades, etc. If you’re running on a cluster, then stage on a cluster (because you’ll see the effects of distribution, like net traffic and 2-phase commits). With AWS or Azure it’s easier than ever to “spin up” a cluster to test something.

9. Read the Release NotesWith each new release of MarkLogic, be sure to look through the release notes, which contain important information about configuring, running, and optimizing MarkLogic–and by extension, your applications.

10. Read the White PaperIt’s so important this is the second time in this article we’re mentioning it, but be sure to download, read, and understand the guide to Understanding System Resources for a comprehensive discussion about making the most of your hardware and software. Periodically check for updates, as we revise this guidance from time-to-time, as technology marches forward.

Bonus Tips

1. Don’t Miss Out on New FeaturesMarkLogic has plenty of features that help with performance, including Optic API, MLCP, tiered storage, and multimodal queries. With the MLCP fast-load option, you can perform forest assignments on the client, and directly insert to that forest. It’s really a sharp tool, so don’t use it if you’re changing forest topology or assignment policies. With tiered storage, you can use object storage such as AWS S3 or Azure Blob Storage as cheap mass storage of data that doesn’t need high performance. Remember, you can “partition” data (i.e. based on dates) and let it age to slower disks. Take the time to learn about Optic API, with which you have a whole new way to model your data, and which in many cases can produce multimodal queries that run faster than handwritten queries crafted against earlier versions of MarkLogic.

2. Avoid FragmentationIf you’ve heard of fragmentation, you might be a long-time user of MarkLogic. This feature is no longer recommended. Just avoid it, but if you must, then ask support first.

3. Make Markup MeaningfulIf you can use meaningful markup (where the tags describe the content they hold) you get both prettier XML and JSON, as well as content that’s easier to integrate into your apps. Reading query plans will be easier as well.

4. Earlier Indexing is Better IndexingTake advantage of TDE to configure load-time indexes. Adding an index after loading requires touching every document with data relating to that index. Turning off an index is instant, but no space will be reclaimed until the re-index occurs. A little thought into index settings before loading will save you time.

5. Taste TestLoad a bit of data early, so you can get an idea about rates, sizes, and loads. Different index settings will affect performance and sizes. Test at a few sizes because some things scale linearly, some logarithmically. Especially when incrementally refining your data and schemas, it is best to perfect your new index settings in a smaller test environment before applying them to production.

That’s it! With these pro tips, you’ll be in good shape to understand the most important performance concepts. But, if you are still running into problems, don’t hesitate to contact support at support@marklogic.com!

MarkLogic

Matt Allen

Matt Allen is a VP of Product Marketing Manager responsible for marketing all the features and benefits of MarkLogic across all verticals. In this role, Matt interfaces with the product and engineering team and with sales and marketing to create content and events that educate and inspire adoption of the technology. Matt is based at MarkLogic headquarters in San Carlos, CA and in his free time he is an artist who specializes in large oil paintings.

Comments

Comments are disabled in preview mode.

Topics

More From Progress

Shadow Analytics: Why You Can’t Afford to Leave It Unchecked

Then, Now and Beyond: The Future of Back Office Software

2022 Progress Data Connectivity Report

Subscribe to get all the news, info and tutorials you need to build better business apps and sites

Country/Territory

Blog

MarkLogic

Semaphore

OpenEdge

DataDirect

Sitefinity

Telerik

Kendo UI

Corticon

DataDirect

MOVEit

Chef

Flowmon

Kemp LoadMaster

WhatsUp Gold

Telerik

Kendo UI

Fiddler

Test Studio

MOVEit

WS_FTP