When to Denormalize
When building an application with MarkLogic, we need to figure out how to model our data. We generally start by identifying the relevant entities, which leads to a common question: what should I denormalize from one entity type into another? Let’s illustrate with an example.
Suppose my database has books and authors. In the relational world, I might have a schema that looks like this:
Sample relational schema
I’ve left out some details, such as the publisher and address tables, but this is enough to explore with.
To bring this data into MarkLogic, we see that books and authors are our primary entities (of the tables shown). What goes into each type of document? Some of that is easy: a book will have a title, copyright, and ISBN. An author will have the person’s name and date of birth.
<book> <title>The Light of Other Days</title> <copyright>2000</copyright> <isbn>0-812-57640-3</isbn> </book> <author> <first-name>Arthur</first-name> <middle-initial>C</middle-initial> <last-name>Clark</last-name> <dob>1917-12-16</dob> </author>
For other tables, we need to decide how to represent their information: create new documents for them? denormalize them into the documents we already have? represent them as semantic triples?
Let’s start with an easy one: genre. This is a simple lookup table, where a book holds a genre_id, which points to the string that names that genre. In fact, the value could have been inserted into the book table in the first place, but it was moved to its own table to ensure the string value occurred in just one place.
There are two simple questions we can ask when we consider whether to denormalize a piece of information into entity documents (for instance, our book documents), versus looking it up in some other place (such as having genre documents).
- Will this piece of information be useful when searching for the entity?
- How likely is this piece of information to change (or how frequently will that occur)?
In the case of genre, these are pretty easy to answer. Including the label “Science Fiction” in the book record will make it possible to search by those terms, as well as supporting a simple facet on genre. On the other hand, a book’s genre is unlikely to change; likewise the list of genres themselves will not change often. Let’s update our structure.
<book> <title>The Light of Other Days</title> <copyright>2000</copyright> <isbn>0-812-57640-3</isbn> <genre>Science Fiction</genre> </book>
An author’s address is tied to an individual author and certainly belongs with those entities. Of course, an author (especially a successful one) may have multiple addresses. That’s simple enough in XML — we can just add
<address> elements as needed. (Likewise, if we’re working with JSON, we can set that up as an array.)
<author> <first-name>Arthur</first-name> <middle-initial>C</middle-initial> <last-name>Clark</last-name> <dob>1917-12-16</dob> <address> <street>...</street> <city>...</city> <state>...</state> <!-- etc --> </address> </author>
We still need to connect the books and the authors. We can do so either by adding some elements that refer from one entity type to another, or we can use semantic triples. While triples allow for some interesting queries, I’ll leave that for another post and look at a simple connection here: in each book document, we’ll add one or more elements pointing to the author(s). We can do that with the URI of the author documents, which gives us a simple connection that we can use to look up additional information, if we wish.
Let’s pause for a moment, though, and revisit the questions above. Are there parts of the author documents that 1) help us search for books and 2) be unlikely to change? A very common use case is looking for books by author’s name, so let’s support that. Let’s bring the author’s name into each book. For this case, I think there’s more value in treating the author’s name as a whole than in breaking it into pieces.
<book> <title>The Light of Other Days</title> <copyright>2000</copyright> <isbn>0-812-57640-3</isbn> <genre>Science Fiction</genre> <author-name uri="/authors/2358970.xml">Arthur C. Clarke</author-name> <author-name uri="/authors/3498075.xml">Stephen Baxter</author-name> </book>
Since MarkLogic will index the terms in
<author-name>, we’ll find this book when doing text searches for “Arthur C. Clark”, “Arthur C Clark”, “Arthur Clark”, or even just “Clark”. We can also (very easily) set up a facet on
author-name to get counts of how many books each author has written.
Remember the two questions above when thinking about whether to denormalize a piece of information from one entity to another. To see an example of when not to denormalize, see Kasey Alderete’s post on Keeping Reputation Data Consistent in Samplestack.