We’ve joined forces with Smartlogic to reveal smarter decisions—together.

When to Denormalize

When building an application with MarkLogic, we need to figure out how to model our data. We generally start by identifying the relevant entities, which leads to a common question: what should I denormalize from one entity type into another? Let’s illustrate with an example.

Suppose my database has books and authors. In the relational world, I might have a schema that looks like this:

Sample relational schema

I’ve left out some details, such as the publisher and address tables, but this is enough to explore with.

To bring this data into MarkLogic, we see that books and authors are our primary entities (of the tables shown). What goes into each type of document? Some of that is easy: a book will have a title, copyright, and ISBN. An author will have the person’s name and date of birth.

<book>
  <title>The Light of Other Days</title>
  <copyright>2000</copyright>
  <isbn>0-812-57640-3</isbn>
</book>

<author>
  <first-name>Arthur</first-name>
  <middle-initial>C</middle-initial>
  <last-name>Clark</last-name>
  <dob>1917-12-16</dob>
</author>

For other tables, we need to decide how to represent their information: create new documents for them? denormalize them into the documents we already have? represent them as semantic triples?

Let’s start with an easy one: genre. This is a simple lookup table, where a book holds a genre_id, which points to the string that names that genre. In fact, the value could have been inserted into the book table in the first place, but it was moved to its own table to ensure the string value occurred in just one place.

There are two simple questions we can ask when we consider whether to denormalize a piece of information into entity documents (for instance, our book documents), versus looking it up in some other place (such as having genre documents).

  1. Will this piece of information be useful when searching for the entity?
  2. How likely is this piece of information to change (or how frequently will that occur)?

In the case of genre, these are pretty easy to answer. Including the label “Science Fiction” in the book record will make it possible to search by those terms, as well as supporting a simple facet on genre. On the other hand, a book’s genre is unlikely to change; likewise the list of genres themselves will not change often. Let’s update our structure.

<book>
  <title>The Light of Other Days</title>
  <copyright>2000</copyright>
  <isbn>0-812-57640-3</isbn>
  <genre>Science Fiction</genre>
</book>

An author’s address is tied to an individual author and certainly belongs with those entities. Of course, an author (especially a successful one) may have multiple addresses. That’s simple enough in XML — we can just add <address> elements as needed. (Likewise, if we’re working with JSON, we can set that up as an array.)

<author>
  <first-name>Arthur</first-name>
  <middle-initial>C</middle-initial>
  <last-name>Clark</last-name>
  <dob>1917-12-16</dob>
  <address>
    <street>...</street>
    <city>...</city>
    <state>...</state>
    <!-- etc -->
  </address>
</author>

We still need to connect the books and the authors. We can do so either by adding some elements that refer from one entity type to another, or we can use semantic triples. While triples allow for some interesting queries, I’ll leave that for another post and look at a simple connection here: in each book document, we’ll add one or more elements pointing to the author(s). We can do that with the URI of the author documents, which gives us a simple connection that we can use to look up additional information, if we wish.

Let’s pause for a moment, though, and revisit the questions above. Are there parts of the author documents that 1) help us search for books and 2) be unlikely to change? A very common use case is looking for books by author’s name, so let’s support that. Let’s bring the author’s name into each book. For this case, I think there’s more value in treating the author’s name as a whole than in breaking it into pieces.

<book>
  <title>The Light of Other Days</title>
  <copyright>2000</copyright>
  <isbn>0-812-57640-3</isbn>
  <genre>Science Fiction</genre>
  <author-name uri="/authors/2358970.xml">Arthur C. Clarke</author-name>
  <author-name uri="/authors/3498075.xml">Stephen Baxter</author-name>
</book>

Since MarkLogic will index the terms in <author-name>, we’ll find this book when doing text searches for “Arthur C. Clark”, “Arthur C Clark”, “Arthur Clark”, or even just “Clark”. We can also (very easily) set up a facet on author-name to get counts of how many books each author has written.

Remember the two questions above when thinking about whether to denormalize a piece of information from one entity to another. To see an example of when not to denormalize, see Kasey Alderete’s post on Keeping Reputation Data Consistent in Samplestack.

Technical Community Manager

Start a discussion

Connect with the community

STACK OVERFLOW

EVENTS

GITHUB COMMUNITY

Most Recent

View All

Facts and What They Mean

In the digital era, data is cheap, interpretations are expensive. An agile semantic data platform combines facts and what they mean to create reusable organizational knowledge.
Read Article

Truth in ESG Labels

Managing a portfolio of investments for your client has never been simple - and doing so through an ESG lens raises the complexity to an almost mind-boggling level. Learn the signs your team has hit the wall with current tools - and how a semantic knowledge graph can help.
Read Article

4 Signs You’ve Got a Transaction Reconciliation Challenge

Many firms manage transaction reconciliation using smart people armed with spreadsheets - but that doesn't scale well. Learn what to look for, to know if you're creating new forms of risk for your firm.
Read Article
This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.