I get asked this question at every conference and Meetup I attend, by different people in different variations — how is MarkLogic better? What does MarkLogic do differently? And so forth. I hope to answer everyone’s questions in one place, so welcome to the [un]official MarkLogic vs MongoDB showdown!
In order to do such a comparison, we need to look at a few different perspectives, ranging from technical to business considerations. We will review some significant technical differences between the two databases, and you will come to find out that technical differences can heavily impact business-level decisions.
Have you ever heard someone saying “if you’re using NoSQL you can’t be using ACID”. Now you can tell them that you know one database that is of the “NoSQL kind” and has full support for Atomicity, Consistency, Isolation, Durability — better known as ACID.
ACID defines a set of properties that will guarantee reliability in the world of database transactions. Atomicity guarantees that if you have multiple statements in a transaction, all parts of the transaction need to be successful (i.e., if any one of them fails, the whole transaction will also fail). Consistency represents the guarantee that the database will go from one valid state into another when a transaction commits. Isolation ensures that if you execute your transactions concurrently, those transactions are unaware of each other and that they are executed serially. Finally, durability means that if a transaction commits (i.e. saves data to the database, does an update or deletes something) those changes are going to persist, even in the case of a system failure.
There are two ways to implement ACID capabilities — locking, or via multiversioning. MarkLogic implements Multi-Version Concurrency Control (MVCC), which means that you can still read a document without acquiring locks (that is, if a document is being written to, you can still read that document nor does reading a document block writing).
MarkLogic has full ACID support; however, MongoDB has what is being referred to as “eventual consistency”. But what does that mean? Imagine you have a cluster of MongoDB servers, and you do an update to one of your documents. Without ACID guarantees, MongoDB’s default settings do not guarantee that an update is durable before acknowledging that it was made, and it does not guarantee that an update will be replicated to a majority of servers before a read can take place.
In essence, there’s the potential to end up with inconsistent of stale data, even with the strictest consistency settings (where all reads occur from a single primary server using majority read concern and all writes use majority write concern). For example, let’s say that Update A is recorded to your document on the primary server, but the primary server fails before replicating that update to a majority of nodes. From the client’s perspective, the write has failed; in reality, Update A may have made it to the secondaries and could be available. However, the client would have no way of knowing if their write actually succeeded or not.
Sharding and clusters are important in the discussion of how the two databases scale-out. Ideally, an enterprise grade NoSQL database should be able to scale by just adding new servers to the cluster. It should also perform at this scale, no matter the size, which MarkLogic already does. Even with an environment that has a mix of cloud and physical server deployments, you can still have high availability.
MongoDB, on the other hand, requires you to provision all the hardware for a highly available cluster. When setting up such a cluster you also need to decide on a sharding key. (a database shard is a horizontal partition of data.) Essentially, you specify a sharding key, and depending on this key, your data will be inserted to different servers in a distributed manner so that you can also optimize your read and write loads.
The issue is that once you have a shard key set up, you can’t change it. The only way to change is by following a five step process, starting with dumping all your data from MongoDB into an external format. Functionality also changes when MongoDB is sharded. Sharding breaks several critical features in MongoDB, including point-in-time recovery for a production system, in-document isolation, and several performance enhancement options (for instance, certain secondary indexes and operations). Users are instructed to anticipate this loss of functionality by ensuring their code and practices never use the features.
MarkLogic, however, provides its customers with auto-scaling tools designed to keep performance levels stable regardless of whether there are dozens of 3-node AWS clusters or a single fifty-node physical cluster. Because MarkLogic can be an application server, database, and search engine all at once, topologies are smaller, simpler, and easier to administer.
When you load a document into MarkLogic the system will automatically index word tokens of that document (as well as the structure of the document). This index — often referred to as the Universal Index — gives you search out-of-the-box. Of course, you can add additional indexes to your database — such as term list indexes that will help you answer ‘yes or no’ type questions — and provide you results based on relevancy. Additionally, you can enable term list indexes that will help you to do wildcard searches as well, along with a lot of other options. However a term list index cannot answer inequality type questions, such as ‘show me all the documents where the price is < £25’. For these you need to enable range indexes. Range indexes can be added against XML elements, attributes or JSON properties. Furthermore, in MarkLogic you can define geospatial indexes as well as triple indexes (yes, MarkLogic is also a triplestore – this may also be another key feature for those of you who use semantics or have standard documents as well as RDF triples).
Contrary to our standards, MongoDB has a single attribute indexing capability, and your queries would only be able to use two indexes at a time. More complex queries require you to have what is referred to as a compound index. Simply put, a compound index is a structure that holds references to multiple fields and it should be used by frequent queries. However, a major drawback is that queries using compound indexes have to respect order, restricting how you can sort. Imagine having to sort on a compound index with three different keys. If you wanted to sort on two of those keys and reverse sort on another, you’d need to build another index. If you want to sort those keys in a different order, you’d need a third compound index. Interestingly enough, the advice from MondDB’s tech support is to build compound indexes rather than query on two indexes.
Storing relational data is a relatively straight-forward concept, as you simply store your data in tables. In the world of NoSQL, however, there needs to be another way to easily store and organize documents, especially with document-based NoSQL solutions. To do this, documents in a NoSQL database are added to collections, which act as a category label (or, a practical way of grouping your documents together).
MongoDB only allows you to store a document in a single collection. I personally find this to be extremely limiting; what if I’d like to have my document in two or even more collections? For example, let’s say you have a recipe for a lovely vegetarian dish. You could put that into a collection and name it ‘recipes’. Wouldn’t it be nice to also have this document be part of some other collections like ‘vegetarian’ and say ‘asian’ without actually having to update the document content? Collections in MarkLogic allow you to “slice and dice” your data as you see it fit — now I can query my data and retrieve all recipes, or only those that are in the vegetarian collection and if I’m organizing a dinner for my friends and they all love Asian food I can query for that collection.
Search in MarkLogic has always been part of the core of the system; it wasn’t built around an existing solution. When you execute a query in MarkLogic, you’ll see your documents returned with what is called ‘relevancy ranking’, which displays the most relevant document first. There’s a complex algorithm that determines how relevant a document is for your query based on the entire document set.
There are multiple ways that you can utilize search to affect the relevancy of your documents, giving you the opportunity to have a fine level of control over what the final search result set looks like. MarkLogic’s search features give you a great set of functionality that will help you to build search applications. Would you like to work with geospatial data? You can! What about snippets? Facets? Highlighting of search results? Type ahead features? Multi-language support? Stemming? Yes yes yes and yes.
MongoDB’s query interface works as a standard database
where statement; you would need to deploy a third party solution to handle search. As a result, you now have a more complex architecture requiring more skills to staff, on top of having to wait until a change is indexed every time a change is made. The documents that you would see returned would not be based on relevancy; instead, they are based on document order and you can sort them based on a property in your documents (i.e., find all the documents where the name is ‘John’ and sort it by ‘lastname’). They also support text search via text indexes and
$text, but it has limited capabilities and questionable scalability.
Since MarkLogic 8, MarkLogic has received common criteria certification. The common criteria certification is the most widely recognized security certification for IT products, certified in 25 countries, including the United States, Canada, India, Japan, Australia, Malaysia, and many countries in the EU, that all mutually recognize the certification. Common Criteria is also a certification that is specifically requested by customers. MarkLogic is one of only six DBMS vendors to receive this certification, and it is the only NoSQL company in this elite group. With data breaches being common in the news, concerns over security and privacy of data are at an all-time high. Customers need technologies that can help them protect their sensitive data. As an enterprise NoSQL database, MarkLogic has had security built in from day one, and we continuously invest in market-leading security features, standards support, and certifications so that customers are ensured is no better place for their mission-critical data.
Element-Level Security (ELS) is a concept that allows developers and administrators to hide parts of their documents – for example, we could specify that a JSON property in documents is visible (and searchable) only by users who have a certain role. Combining ELS with document level security (where we get to control whether a user with a certain role has access to the entire document) gives MarkLogic a flexible way of handling security of documents within the database. ELS is flexible because of the number of ways to hide data from certain users, and, once the data is hidden, it’s also hidden from searches (e.g., a search will not return a document that contains a sensitive piece of information).
There will often be times when you run into situations where you need to give data to another department, whether that be for testing, staging, or simply given to a team of data scientists. The data has to be real but it cannot contain Personable Identifiable Information (PII), which prevents giving out data that contains information like social security numbers. MarkLogic overcomes these restrictions through redaction. Redaction works on exports and allows developers to setup rules and apply concealment or masking to values found in XML elements / attributes and JSON properties. There are multiple built-in functions that can be leveraged and there’s also room for custom functions.
Encryption at rest
Encryption at rest refers to applying encryption on our data at the hard disk, preventing situations where an attacker could potentially have direct access to the servers and where the attacker would want to copy data off from the storage device (e.g., a “Snowden” attack). Encryption at rest can be used to encrypt data, log files and configuration files separately. This feature is enabled by a key management system and a keystore — and by default, MarkLogic embedded the PKCS #11 secured wallet.
In the beginning of this post, I mentioned that a good technical understanding of a system can lead to business level decisions. I believe if you’ve read all the previous points, it’s clear why a business would require a database that has ACID support and allows you to search through your database looking for the right documents while maintaining a strong security model.
If you’re interested in MarkLogic, please download the latest version of the product, which is available for multiple operating systems and comes with a free developer license.
This article first appeared as a post on Tamas’ blog.