We believe we’re in the midst of a generational shift in the database industry. Change is difficult, though, and there is always resistance. With more than four decades in the market, relational databases are the most popular choice for running the world’s most important systems.
Then along came “NoSQL.” How can these newer NoSQL databases do what they claim to do? MarkLogic asserts it can speed up large-scale data integration projects by 4x. They claim to handle all of your data—not just the structured data—and that they can handle change faster and to scale easier.
It’s easy to question these claims—and doubts usually fall into two categories:
- Doubting there is a problem with relational databases…
- Doubting that NoSQL can really solve the problem…
In this two-part series, I’m going to debunk those doubts so you can better understand the challenges with relational databases. Let’s start with the questions that come up when doubting there is a problem with relational databases.
Doubting There Is a Problem With Relational
In a series of other blog posts, I explained at a high level that relational databases are not designed for change, for scale, for heterogeneous data, for mixed workloads, or for modern app development. A lot of people commented on those posts with many great questions. Rather than address each question one by one, I thought it would be better write this post to discuss the common objections that were brought up, and provide some answers that I hope will help (or at least generate more healthy discussion).
The real problem that causes projects to go more slowly or never finish is that they require you to model your data into a single schema—before you load any data. Yes, the schema can change, but you’re still locked into having that one über schema. That is extremely problematic when you get new data sources or want to ask new questions that you haven’t planned or modeled for.
Yes, an incremental change like adding a table or changing the data type for a column may be relatively easy. I admit to being hyperbolic when I said that it takes a million dollars to change a column (though that truly was for one of our large customers).
But what about other types of changes that are more complex?
One example of a more complex change is a fundamental “nature of the customer” question that deals with how disparate data changes over time. For example, what if you need to simultaneously query what your data about your customers, orders, and stores looked like last month and what the data looks like today. In a relational database, that information may be spread across multiple tables in multiple silos, and there may be multiple versions that represent different snapshots at different points in time. In a relational database, conceptual data modeling changes often break things in the database and through the middleware and application stack, so capturing changes and asking hard questions becomes impossible.
It’s not just about one change, though. Your data is constantly in flux and changes are being made all the time as new data comes in and new data goes out. In that environment, how would you ever build a single, unified conceptual and physical model with a static relational database? You can’t just shut down the database if it’s your central operational data hub that runs the business. You need to (cliché alert) swap the wings of the airplane in-flight and such a feat is not only expensive but usually impossible.
This may not be a huge problem for every use case and every organization, but it certainly is for many of the organizations we work with. These are the types of problems that take two to five years and millions of dollars to resolve. For example, one of the large banks we worked with needed to meet regulatory hurdles imposed on them. They also had dozens of different front office trading systems, all of which used different, changing data models. They spent over two years trying to build their über schema in Oracle and finally gave up. With MarkLogic, they put a new system into production in six months. It wasn’t because they wanted to “worry about the details later.”
The system is running their bank. The bank chose MarkLogic because they wanted to use an iterative, flexible approach to modeling their data. And, they wanted that data to be useful for many applications, not just one.
This is not an isolated example—organizations regularly see projects being completed 4x faster than with relational. In one instance at a Fortune 500 insurance company, the data modeling phase went 15x faster than with relational.
Gone are the days of single app databases. As MarkLogic product manager Justin Makeig says, “Applications are ephemeral—data is forever.”
This is the same scenario that has played out in every vertical where data integration is a huge challenge. If you have some time, you can watch the presentation from Mike Fillion at Aetna that goes into more depth about data modeling challenges with relational. The story is the same: Data modeling with relational was taking too long, and then they used MarkLogic and finished their project. With flexible data modeling, you avoid the risk of building the wrong thing and you have a better platform for making changes later.
If you’re interested in more on this topic, I discuss how MarkLogic makes data modeling easier and faster in another blog post focused on ETL.
Relational databases were built to store structured data in rows and columns, not document data like JSON and XML. If you keep up with database news, however, you may have heard the claims that relational vendors now support JSON.
When a relational database company says they support JSON, what they really mean is that you can put a JSON document into a column that is specified to accept JSON documents. The same goes for relational databases that claim to support XML.
Putting a JSON document into a column in a relational database doesn’t make it a document database. With that approach, the JSON is just a blob and the document model is stripped of its value. The document is not fully indexed and it cannot be searched, updated, or enriched in the same way it would be in a true document database. You lose all the richness and flexibility that made the document model desirable in the first place. You lose the value of the relational database too—no table and column statistics, no indexing features and scan types, and you are limited to only very simple operations.
If a relational database could store JSON or XML data in the same way as a document database, MarkLogic, MongoDB, and Couchbase probably wouldn’t be around. Oracle probably wouldn’t have acquired the company Sleepycat to get BerkeleyDB and develop their own Oracle NoSQL database, IBM wouldn’t have acquired Cloudant (i.e. CouchDB document database), and Microsoft wouldn’t have developed DocumentDB.
So what about relational databases that have graph support for RDF triples? A relational database seems like a great tool for storing the simple structure of RDF—you would just use columns for the subject, predicate, and object. Unfortunately, this would mean hundreds of recursive joins for triple queries. Just imagine the ugly performance of a single table that contains billions of rows where every update is executed on that table and its three indexes. Again, there’s a reason that Oracle has a separate triple index that is licensed as a separate product.
For these reasons, it’s accurate to say that relational databases really are not really designed to store all of your data. Yes, you may be able to stuff it all in there with enough time and money, but in practice that usually doesn’t happen. It’s much easier and better to pick a tool that’s built to store different types of data. That idea is now encapsulated in the idea of polyglot persistence and is what has led to the growing adoption of multi-model databases that is now happening.
In my earlier post, I made the point that relational databases are not designed for scale. There’s no shortage of other analysts and users that make the same point. A former Oracle DBA, Robert Schumacher, summed it up nicely in a talk he gave at a financial analyst conference in 2015:
It’s not that relational databases cannot scale. It’s just that they were not originally designed to scale and despite the many improvements, relational databases are still ill-suited for massive scaling.
In a post on Infoworld titled, “How do I freaking scale Oracle?” a developer with over 20-years of Oracle experience explains how to scale Oracle. It requires the following:
- Correct sharding algorithm designed by the developer
- Fault tolerance against server outages, storage outages, and data center outages (aka “disaster”)
- A way to balance load against servers and/or CPU cores as well as disks
To do all of that, you need to buy more products. One product, Oracle RAC, uses a shared disk architecture and addresses the challenge of load balancing and high availability. But, there’s still the problem of underlying storage as a single point of failure. For that, you need other products to do storage replication (SRDF) or virtualization (VPLEX). There are also other products to help with scaling such as GoldenGate and Data Guard that are very similar products designed for transaction replication.
Finally, you have to partition, or shard, your data across distributed nodes. For Oracle, partitioning allows a table, or index to be subdivided into smaller pieces or partitions that each have their own names and storage characteristics. This isn’t too different from many other NoSQL databases where you split up data among nodes based on a shard key. But, there are multiple approaches to how you partition data in Oracle. You can use TimesTen or a less-supported product, Coherence, to leverage in-memory caching to take some load off the database.
Sound complicated? It is.
Oracle knows their scalability was lacking, which is why they launched new support for automatic sharding in Oracle Database 12c, Release 2. Oracle spokesman Andrew Mendelsohn said that automatic sharding is a “really key component for any database as you move up to cloud scale… It gives you highly reliable, highly scalable infrastructure” (Forbes, “Oracle’s Latest Database: Right Technology, Right Time”, September 20, 2016).
Welcome to the scale-out club. MarkLogic has had automatic sharding for years and we’re happy to see relational vendors coming around to our product advantages. The market has spoken and Oracle and other relational vendors are now taking a page out of the NoSQL handbook. As the CTO at J.P. Morgan said in an interview, “We needed to get away from relational… MarkLogic offered us horizontal scalability, the ability to just add more and not have to do a big infrastructure replacement.”
Click here to read the next post in this two-part series about how NoSQL solves the problem, and the doubts people often have about whether it really can.
If you have more questions about the topics above, here are some additional recommended resources:
- Whitepaper, Rethink Data Modeling
- Webinar, What Is a Multi-model Database: Two Paths to Engineering
- Presentation, Approaching a Traditional Join Problem at Scale