Performance Theory: Tales From MarkLogic Support
This post is a snapshot of the talk that Jason Hunter and Franklin Salonga will give next week at MarkLogic World, also titled, “Performance Theory: Tales From The MarkLogic Support Desk.” Jason Hunter is Chief Architect and Frank Salonga is Lead Engineer at MarkLogic. You can follow Jason on Twitter@hunterhacker.
MarkLogic is extremely well-designed, and from the ground up it’s built for speed, yet many of our support cases have to do with performance. Often that’s because people are following historical conventions that no longer apply. Today, there are big-memory systems using a 64-bit address space with lots of CPU cores, holding disks that are insanely fast (but that haven’t grown in speed as much as they have in size*), hooked together by high-speed bandwidth. MarkLogic lives natively in this new reality, and that changes the guidelines you want to follow for finding optimal performance in your database.
The Top 10 (Actually 16) Tips
The following is a list of top 16 tips to realize optimal performance when using MarkLogic, all based on some of the common problems encountered by the support desk (aka performance geniuses):
1. Buy Enough Iron
MarkLogic is optimized for server-grade systems, those just to the left of the hockey-stick price jump. Today that means 16 cores, 128-256 Gigs of RAM, 8-20 TB of disk, 2 disk controllers.
2. Aim for 100KB docs +/- 2 Orders of Magnitude
MarkLogic’s internal algorithms are optimized for documents around 100 KB (remember, in MarkLogic, each document should be one unit of query and should be seen more like relational rows than tables). You can go down to 1 KB but below that the memory/disk/lock overhead per document starts to be troublesome. And, you can go up to 10 MB but above that line the time to read it off disk starts to be noticeable.
3. Avoid Fragmentation
Just avoid it, but if you must, then ask support first.
4. Think of MarkLogic Like an Only Child
It’s not a bug to use 100 percent of the CPU—that’s a feature. MarkLogic assumes you want maximum performance given available resources. If you’re using shared resources (a SAN, a virtual machine) you may want to impose restrictions that limit what MarkLogic can use.
5. Six Forests, Six Replicas
Every use case is different, but in general deployments of MarkLogic 7 are proving optimal with 6 forests on each computer and (if doing High Availability) 6 replicas.
6. Earlier Indexing is Better Indexing
Adding an index after loading requires touching every document with data relating to that index. Turning off an index is instant, but no space will be reclaimed until the re-index occurs. A little thought into index settings before loading will save you time.
7. Filtering: Your Fried or Foe
Indexes isolate candidate documents, then filtering verifies the hits. Filtering lets you get accurate results even without accurate indexes (e.g., a case sensitive query without the case sensitive index). So, watch out, as filtering can hide bad index settings! If you really trust the indexes, you can use “unfiltered.” It is best to perfect your index settings in a small test environment, then apply them to production.
8. Use Meaningful Markup If You Can
If you can use meaningful markup (where the tags describe the content they hold) you get both prettier XML and XML that’s easier to write indexes against.
9. Don’t Try to Outsmart Merging
Contact support if you plan to change any of the advanced merge settings (max size, min size, min ratio, timeout periods). You shouldn’t usually tweak these. If you’re thinking about merge settings, you’re probably underprovisioned (See Recommendation #1).
10. Big Reads Go In Queries, Not Updates
Hurrah! Using MVCC for transaction processing means lock-free reads. But, to be a “read” your module can’t include any update calls. This is determined by static analysis in advance, so even if the update call isn’t made, it still changes your behavior. Locks are cheap but they’re not free, and any big search to find the top 10 results will lock the full result set during the sort. Whenever possible, do update calls in a separate nested transaction context using xdmp:invoke() with an option specifying “different-transaction”.
11. Taste Test
Load a bit of data early, so you can get an idea about rates, sizes, and loads. Different index settings will affect performance and sizes. Test at a few sizes because some things scale linearly, some logarithmically.
Measure before. Measure after. Measure at all levels. When you know what’s normal, you can isolate when something goes different. MarkLogic 7 can internally capture “Monitoring History” to a Meters database. There are also tools such as Cacti, Ganglia, Nagios, Graphite, and others.
13. Keep a Staging Box
A staging box (or cluster) means you can measure changes in isolation (new application code, new indexes, new data models, MarkLogic upgrades, etc.). If you’re running on a cluster, then stage on a cluster (because you’ll see the effects of distribution, like net traffic and 2-phase commits). With AWS it’s easier than ever to “spin up” a cluster to test something.
14. Adjust as Needed
You need to be measuring so you know what is normal and then know what you should adjust. So, what can you adjust?
- Code: Adjusting your code often provides the biggest bang
- Memory sizes: The defaults assume a combo E-node/D-node server
- Indexes: Best in advance, maybe during tasting. Or, try on staging
- Cluster size and forest distribution: This is much easier in MarkLogic 7
15. Follow Our Advice on Swap Space
Our release notes tell you:
- Windows: 2x the physical memory
- Linux: 1x the physical memory (minus any huge pages)
- Solaris: 1x-2x the physical memory
MarkLogic doesn’t intend to leverage swap space! But, for an OS to give memory to MarkLogic, it wants the swap space to exist. Remember, disk is 100x cheaper than RAM, and this helps us use the RAM.
16. Don’t Forget New Features
MarkLogic has plenty of features that help with performance, including MLCP, tiered storage, and semantics. With the MLCP fast-load option, you can perform forest assignments on the client, and directly insert to that forest. It’s really a sharp tool, but you don’t use it if you’re changing forest topology or assignment policies. With tiered storage, you can use HDFS as cheap mass storage of data that doesn’t need high performance. Remember, you can “partition” data (i.e. based on dates) and let it age to slower disks. With semantics, you have a whole new way to model your data, which in many cases can produce easier to optimize queries.
That’s it! With these pro tips, you should be able to handle the most common performance issues. But, if you are still having performance issues, don’t hesitate to contact support at email@example.com!
*With regard to storage, as you add capacity, it is critical that you add throughput in order to maintain a fast system (http://tylermuth.wordpress.com/2011/11/02/a-little-hard-drive-history-and-the-big-data-problem/)