Progress Acquires MarkLogic! Learn More

Building a Semantic Recommendation Engine: the Sequel

Back to blog
4 minute read
Back to blog
4 minute read

Since we discussed the movie business in my previous post on building a semantic recommendation engine, a sequel seemed appropriate.

First some background. Our children are in their late teens, but over the years they’ve been taken to almost every animated film produced in the twenty first century and prior. While there have been some flops, as parents, we’ve marveled at how often the writers are able to inject entertaining dialogue for adults, yet keep the young ones glued to their seats with simpler chatter and of course, amazing visuals. They even manage to overlay multiple themes and “moral of the story” messaging appropriately targeted to kids and parents. Bravo to these super talented folks!

Given that, let’s say a fictitious content provider named Netflux knows a consumer has kids and takes them to the animated hits. The easy thing to do is to recommend similar age appropriate films. But let’s say they really want to hit the mark. Knowing the parent’s twitter handle, they decide to leverage social media. Note: in prep for this post, the following tweets were sent:

The #testrec hash tag allowed for convenient gathering of these tweets in an array, but Netflux can retrieve these via the consumer’s Twitter handle.

Since The Incredibles is a film about super heroes, a recommendation could be the Spider-Man series. However, using search techniques like synonym matching, co-occurrence and stemming, along with custom semantic inferencing rules, Netflux can REALLY impress the consumer by recommending NOT Spider-Man but Monsters vs. Aliens. How would that work?

First, the creators of Monsters vs. Aliens would have to tag the film with descriptive metadata and share it with Netflux. Detailed tagging is table stakes for accurate media recommendation engines. Content providers are tagging not just title metadata but annotating each scene with information pertaining to characters, talents, locations, storylines, costumes, product placements and a variety of other attributes, which provide valuable media insight.

Next, the assumption is that Netflux tracks information about its consumers, e.g. SS#, bank accounts…just kidding. Using a Twitter handle, Netflux can grab a consumer’s tweets (see code in Appendix1, which can be used in MarkLogic’s query console) and process them in MarkLogic’s operational data hub (ODH) as follows:

  1. Load all tweets in a staging database as-is. The tweets are loaded into a structure that preserves the original content, but allows for incremental enrichments to be collected in other areas of the structure. The “envelope” pattern is used for this purpose, allowing semantic facts and other types of metadata to be collected.
  2. Harmonize the tweets by leveraging an enrichment service, a process that could tag movie titles and sentiment words and also generate semantic facts such as:
    1. @mmalgeri tweeted #banter
    2. @mmalgeri tweeted #repartee
    3. @mmalgeri tweeted #witticisms
    4. @mmalgeri tweeted “The Incredibles”
    5. @mmalgeri tweeted “Shrek”
    6. @mmalgeri tweeted “Finding Nemo”
  3. Further harmonize these tweets by associating words like banter with its synonyms and stems such as repartee and witticism, and add these synonyms to the tweet document.
  4. Perform co-occurrence analysis on these documents to determine which sentiment words appear with movie titles.
  5. Create custom inferencing rules that conclude:
    1. If @mmalgeri has tweets about movies, and
    2. @mmalgeri tweets synonymous sentiment words about movies,
    3. then recommend a movie with the same or synonymous sentiment words

In other words, a smart recommendation engine would realize that @mmalgeri might like animations about heroes for his kids, but…he REALLY likes snappy dialogue. The Spider-man series would not likely be tagged with this kind of descriptive metadata because that’s not its main characteristic. However, Monsters vs. Aliens contains an abundance of clever dialogue and is hopefully properly tagged… ”Fresno? Fresno! In what universe is Fresno better than Paris, Derek?”

Content providers can leverage features such as multi-model NoSQL and semantic documents, sophisticated search and indexing, semantic facts contained in graphs, and semantic inferencing and combine them to create a smart recommendation engine. MarkLogic provides these features out of the box. Consider downloading the free developer’s version and while you’re at it, check out Monsters vs. Aliens…you’ll have fun.

Appendix 1 – Javascript code to gather tweets in QConsole

// Requires MarkLogic 8 or higher
// This function gets tweets based on a query. 
// The code can be modified to pass in the query via a form
// Prior to running this module, the user should have acquired 
// an access token, which is used in the httpGet call, in the 
// Authorization header. 
// To get an access token, see 
// Note: the “query” here is “#testrec” but could be something like
// Nemo AND witty

function getTweets () {

  var query = xdmp.urlEncode("#testrec");

  var twitterURL = "" + query;

  var accessToken = "replace with your token";

  var tweetItr = 
        "headers" : {
          "Authorization" : "Bearer " + accessToken
        "format" : "json"
  var tweetPackage = tweetItr.toObject();
  var theTweets = tweetPackage[1].toObject().statuses;

    return tweet.text;



Michael Malgeri

Michael Malgeri is a Principal Technologist with MarkLogic. He works with companies to match their business requirements with MarkLogic’s enterprise NoSQL database and semantic features. He helps organizations reduce costs, automate processes, find new opportunities and create applications that bring high value to businesses and their customers. Michael focuses on the media and entertainment industry, where content providers, distributors and related companies are seeking to leverage the power of data in order to capture new opportunities driven by expanding global information consumption.

Michael holds Master’s Degrees in Computer Science, Business and Mechanical Engineering. He's been a Certified Project Management Professional since 2011.

Read more by this author

Share this article

Read More

Related Posts

Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.

Developer Insights

Multi-Model Search using Semantics and Optic API

The MarkLogic Optic API makes your searches smarter by incorporating semantic information about the world around you and this tutorial shows you just how to do it.

All Blog Articles
Developer Insights

Create Custom Steps Without Writing Code with Pipes

Are you someone who’s more comfortable working in Graphical User Interface (GUI) than writing code? Do you want to have a visual representation of your data transformation pipelines? What if there was a way to empower users to visually enrich content and drive data pipelines without writing code? With the community tool Pipes for MarkLogic […]

All Blog Articles
Developer Insights

Part 3: What’s New with JavaScript in MarkLogic 10?

Rest and Spread Properties in MarkLogic 10 In this last blog of the series, we’ll review over the new object rest and spread properties in MarkLogic 10. As mentioned previously, other newly introduced features of MarkLogic 10 include: The addition of JavaScript Modules, also known as MJS (discussed in detail in the first blog in this […]

All Blog Articles

Sign up for a Demo

Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.

Request a Demo