Scaling Memory in the MarkLogic Server
This not-too-technical article covers a number of questions about MarkLogic Server and its use of memory:
- How MarkLogic uses memory;
- Why you might need more memory;
- When you might need more memory;
- How you can add more memory.
Let’s say you have an existing MarkLogic environment that’s running acceptably well. You have made sure that it does not abuse the infrastructure on which it’s running. It meets your SLA (may be expressed as something like “99% of searches return within 2 seconds, with availability of 99.99%”). Several things about your applications have helped achieve this success:
- Your queries run as queries; there are no queries mistakenly running in update mode.
- The result sets from queries are of reasonable size, not “boiling the ocean”.
- Your queries are largely satisfied from indexes, typically able to run unfiltered.
- Updates affect a relatively small number of documents at any one time.
- Plus, your current infrastructure and practices are up to the task.
As such, your application’s performance is largely determined by the number of disk accesses required to satisfy any given query. Most of the processing involved is related to our major data structures:
Fulfilling a query can involve tens, hundreds or even thousands of accesses to these data structures, which reside on disk in files within stand directories. (The triple world especially tends to exhibit the greatest variability and computational burden.)Of course, MarkLogic is designed so that the great majority of these accesses do not need to access the on-disk structures. Instead, the server caches termlists, range indexes, triples, etc. which are kept in RAM in the following places:
- termlists are cached in the List Cache, which is allocated at startup time (according to values found in config files) and managed by MarkLogic Server. When a termlist is needed, the cache is first consulted to see whether the termlist in question is present. If so, no disk access is required. Otherwise, the termlist is read from disk involving files in the stand such as ListIndex and ListData.
- range indexes are held in memory-mapped areas of RAM and managed by the operating system’s virtual memory management system. MarkLogic allocates the space for the in-memory version of the range index, causes the file to be loaded in (either on-demand or via pre-load option), and thereafter treats it as an in-memory array structure. Any re-reading of previously paged-out data is performed transparently by the OS. Needless to say, this last activity slows down the operation of the server and should be kept to a minimum.
One key notion to keep in mind is that the in-memory operations (the “hit” cases above) operate at speeds of about a microsecond or so of computation. The go-to-disk penalty (the “miss” cases) cost at least one disk access which takes a handful of milliseconds plus even more computation than a hit case. This represents a difference in the order of 10,000 times slower. Nonetheless, you are running acceptably. Your business is succeeding and growing. However, there are a number of forces stealthily working against your enterprise continuing in this happy state.
- Your database is getting larger (more and perhaps larger documents).
- More users are accessing your applications.
- Your applications are gaining new or expanded capabilities.
- Your software is being updated on a regular basis.
- You are thinking about new operational procedures (e.g. encryption).
In the best of all worlds, you have been measuring your system diligently and can sense when your response time is starting to degrade. In the worst of all worlds, you perform some kind of operational/application/server/operating system upgrade and performance falls off a cliff. Let’s look under the hood and see how pressure is building on your infrastructure. Specifically, let’s look at the consumption of memory and the effectiveness of the key caching structures in the server. Recall that the response time of a MarkLogic application is driven predominantly by how many disk operations are needed to complete a query. This, in turn, is driven by how many termlist and range index requests are initiated by the application through MarkLogic Server and how many of those do not “hit” in the List Cache and in-memory Range Indexes. Each one of those “misses” generates disk activity, as well as a significant amount of additional computation. All the forces listed above contribute to decreasing cache efficiency, in large part because they all use more RAM. A fixed size cache can hold only a fraction of the on-disk structure that it attempts to optimize. If the on-disk size keeps growing (a good thing, right?) then the existing cache will be less effective at satisfying requests. If more users are accessing the system, they will ask in total for a wider range of data. As applications are enriched, new on-disk structures will be needed (additional range indexes, additional index types, etc.) And when did any software upgrade use LESS memory? There’s a caching concept from the early days of modern computing (the Sixties, before many of you were born) called “folding ratio”. You take the total size of a data structure and divide it by the size of the “cache” that sits in front of it. This yields a dimensionless number that serves as a rough indicator of cache efficiency (and lets you track changes to it). A way to compute this for your environment is to take the total on-disk size of your database and divide it by the total amount of RAM in your cluster. Let’s say each of your nodes has 128GB of RAM and 10 disks of about 1TB each that are about half full. So, the folding ratio of each node of (the shared-nothing approach of MarkLogic allows us to consider each node individually) this configuration at this moment is (10 x 1TB x 50%) / 128GB or about 40 to 1. This number by itself is neither good nor bad. It’s just a way to track changes in load. As the ratio gets larger, the cache hit ratio will decrease (or, more to the point, the cache miss ratio will increase) and response time will grow. Remember, the difference between a hit ratio of 98% versus a hit ratio of 92% (both seem pretty good, you say) is a factor of four in resulting disk accesses! That’s because one is a 2% miss ratio and the other is an 8% miss ratio. Consider the guidelines that MarkLogic provides regarding provisioning: 2 VCPUs and 8GB RAM to support a primary forest that is being updated and queried. The maximum recommended size of a single forest is about 400 GB, so the folding ratio of such a forest is 400GB / 8GB or about 50 to 1. This suggests that the configuration outlined a couple of paragraphs back is at about 80% of capacity. It would be time to think about growing RAM before too long. What will happen if you delay? Since MarkLogic is a shared-nothing architecture, the caches on any given node will behave independently from those on the other nodes. Each node will therefore exhibit its own measure of cache efficiency. Since a distributed system operates at the speed of its slowest component, it is likely that the node with the most misses will govern the response time of the cluster as a whole. At some point, response time degradation will become noticeable and it will become time to remedy the situation. The miss ratios on your List Cache and your page-in rate for your Range Indexes will grow to the point at which your SLA might no longer be met. Many installations get surprised by the rapidity of this degradation. But recall, the various forces mentioned above are all happening in parallel, and their effect is compounding. The load on your caches will grow more than linearly over time. So be vigilant and measure, measure, and measure! In the best of all possible worlds, you have a test system that mirrors your production environment that exhibits this behavior in advance of production. One approach is to experiment with reducing the memory on the test system by, say, configuring VMs for a given memory size (and adjusting huge pages and cache sizes proportionately) to see where things degrade unacceptably. You could measure:
- Response time: where does it degrade by 2x, say?
- List cache miss ratio: at what point does it double, say?
- Page-in rate: at what point does increase by 2x, say?
When you find the memory size at which things degraded unacceptably, use that to project the largest folding ratio that your workload can tolerate. Or you can be a bit clever and do the same additional calculations for ListCache and Anonymous memory:
- Compute the sum of the sizes of all ListIndex + ListData files in all stands and divide by the size of ListCache. This gives the folding ratio for this host of the termlist world.
- Similarly, compute the sum of the sizes of all RangeIndex files and divide by the size of anonymous memory. This gives the folding ratio for the range index world on this host. This is where encryption can bite you. At least for a period of time, both the encrypted and the unencrypted versions of a range index must be present in memory. This effectively doubles your folding ratio and can send you over the edge in a hurry. [Note: depending on your application, there may be additional in-memory derivatives of range indexes built to optimize for facets, and sorting of results, … all taking up additional RAM.]
[To be fair, on occasion a resource other than RAM can become oversubscribed (beyond the scope of this discussion):
- IOPs and I/O bandwidth (both at the host and storage level);
- Disk capacity (too full leads to slowness on some storage devices, or to inability to merge);
- Wide-area network bandwidth/latency/consistency (causes DR to push back and stall primary);
- CPU saturation (this is rare for traditional search-style applications, but shows up more in the world of SQL, SPARQL and Optic, often accompanied by memory pressure due to very large Join Tables. Check your query plans!);
- Intra-cluster network bandwidth (both at the host and switch/backbone layer, also rare)].
Alternatively, you may know you need to add RAM because you have an emergency on your hands: you observe that MarkLogic is issuing Low Memory warnings, you have evidence of heavy swap usage, your performance is often abysmal, and/or the operating system’s OOM (out of memory) killer is often taking down your MarkLogic instance. It is important to pay attention to the warnings that MarkLogic issues, above and beyond any that come from the OS. You need to tune your queries so as to avoid bad practices (see the discussion at the beginning of this article) that waste memory and other resources, and almost certainly add RAM to your installation. The tuning exercise can be labor-intensive and time-consuming; it is often best to throw lots of RAM at the problem to get past the emergency at hand. So, how to add more RAM to your cluster? There are three distinct techniques:
- Scale vertically: Just add more RAM to the hosts you already have.
- Scale horizontally: Add more nodes to your cluster and re-distribute the data
- Scale functionally: Convert your existing e/d-nodes into d-nodes and add new e-nodes
Each of these options has its pros and cons.
- Granularity: Say you want to increase RAM by 20%. Is there an option to do just this?
- Scope: Do you upgrade all nodes? Upgrade some nodes? Add additional nodes?
- Cost: Will there be unanticipated costs beyond just adding RAM (or nodes)?
- Operational impact: What downtime is needed? Will you need to re-balance?
- Timeliness: How can you get back to acceptable operation as quickly as possible?
Option 1: Scale Vertically
On the surface, this is the simplest way to go. Adding more RAM to each node requires upgrading all nodes. If you already have separate e- and d-nodes, then it is likely that just the d-nodes should get the increased RAM. In an on-prem (or, more properly, non-cloud) environment this is a bunch of procurement and IT work. In the worst case, your RAM is already maxed out so scaling vertically is not an option. In a cloud deployment, the cloud provider dictates what options you have. Adding RAM may drag along additional CPUs to all nodes also, which requires added MarkLogic licenses as well as larger payment to the cloud provider. The increased RAM tends to come in big chunks (only 1.5x or 2x options). It’s generally not easy to get just the 20% more RAM (say) that you want. But this may be premature cost optimization; it may be best just to add heaps of RAM, stabilize the situation, and then scale RAM back as feasible. Once you are past the emergency, you should begin to implement longer-term strategies. This approach also does not add any network bandwidth, storage bandwidth, and capacity in most cases, and runs the small risk of just moving the bottleneck away from RAM and onto something else.
Option 2: Scale Horizontally
This approach adds whole nodes to the existing complex. It has the net effect of adding RAM, CPU, bandwidth, and capacity. It requires added licenses, and payment to the cloud provider (or a capital procurement if on-prem). The granularity of expansion can be controlled; if you have an existing cluster of (2n+1) nodes, the smallest increment that makes sense in an HA context is 2 more nodes (to preserve quorum determination) giving (2n+3) nodes. In order to make use of the RAM in the new nodes, rebalancing will be required. When the rebalancing is complete, the new RAM will be utilized. This option tends to be optimal in terms of granularity, especially in already larger clusters. To add 20% of aggregate RAM to a 25-node cluster, you would add 6 nodes to make a 31-node cluster (maintaining the odd number of nodes for HA). You would be adding 24%, which is better than having to add 50% if you had to scale all 25 nodes by 50% because that was what your cloud provider offered.
Option 3: Scale Functionally
Scaling functionally means adding new nodes as e-nodes to cluster and reconfiguring existing e/d-nodes to be d-nodes. This frees up RAM on the d-side (specifically by dramatically reducing the need for Expanded Tree Cache and memory for query evaluation) which will go towards restoring a good folding ratio. Recent experience says about 15% of RAM could be affected in this manner. More licenses are again required, plus installation and admin work to reconfigure the cluster. You need to make sure that the network can handle increases in XDMP traffic from e-nodes to d-nodes, but this is not typically a problem. The resulting cluster tends to run more predictably. One of our largest production clusters typically runs its d-nodes at nearly 95% memory usage as reported by MarkLogic as the first number in an error log line. It can get away with running so full because it is a classical search application whose d-node RAM usage does not fluctuate much. Memory usage on e-nodes is a different story, especially when the application uses SQL or Optic. In such a situation, on-demand allocation of large Join Tables can cause an abrupt increase in memory usage. That’s why our advice on combined e/d nodes is to run below 80% to allow for query processing. Thereafter, the two groups of nodes can be scaled independently depending on how the workload evolves. Here are a few key takeaways from this discussion:
- Measure performance when it is acceptable, not just when it is poor.
- Do whatever it takes to stabilize in an emergency situation.
- Correlate metrics with acceptable/marginal performance to determine a usable folding ratio.
- If you have to make a guess, try to achieve no worse than a 50:1 ratio and go from there.
- Measure and project the growth rate of your database.
- Figure out how much RAM needs to be added to accommodate projected growth.
- Test this hypothesis if you can on your performance cluster.