Gartner Cloud DBMS Report Names MarkLogic a Visionary

Building a Semantic Recommendation Engine: the Sequel

Since we discussed the movie business in my previous post on building a semantic recommendation engine, a sequel seemed appropriate.

First some background. Our children are in their late teens, but over the years they’ve been taken to almost every animated film produced in the twenty first century and prior. While there have been some flops, as parents, we’ve marveled at how often the writers are able to inject entertaining dialogue for adults, yet keep the young ones glued to their seats with simpler chatter and of course, amazing visuals. They even manage to overlay multiple themes and “moral of the story” messaging appropriately targeted to kids and parents. Bravo to these super talented folks!

Given that, let’s say a fictitious content provider named Netflux knows a consumer has kids and takes them to the animated hits. The easy thing to do is to recommend similar age appropriate films. But let’s say they really want to hit the mark. Knowing the parent’s twitter handle, they decide to leverage social media. Note: in prep for this post, the following tweets were sent:

The #testrec hash tag allowed for convenient gathering of these tweets in an array, but Netflux can retrieve these via the consumer’s Twitter handle.

Since The Incredibles is a film about super heroes, a recommendation could be the Spider-Man series. However, using search techniques like synonym matching, co-occurrence and stemming, along with custom semantic inferencing rules, Netflux can REALLY impress the consumer by recommending NOT Spider-Man but Monsters vs. Aliens. How would that work?

First, the creators of Monsters vs. Aliens would have to tag the film with descriptive metadata and share it with Netflux. Detailed tagging is table stakes for accurate media recommendation engines. Content providers are tagging not just title metadata but annotating each scene with information pertaining to characters, talents, locations, storylines, costumes, product placements and a variety of other attributes, which provide valuable media insight.

Next, the assumption is that Netflux tracks information about its consumers, e.g. SS#, bank accounts…just kidding. Using a Twitter handle, Netflux can grab a consumer’s tweets (see code in Appendix1, which can be used in MarkLogic’s query console) and process them in MarkLogic’s operational data hub (ODH) as follows:

  1. Load all tweets in a staging database as-is. The tweets are loaded into a structure that preserves the original content, but allows for incremental enrichments to be collected in other areas of the structure. The “envelope” pattern is used for this purpose, allowing semantic facts and other types of metadata to be collected.
  2. Harmonize the tweets by leveraging an enrichment service, a process that could tag movie titles and sentiment words and also generate semantic facts such as:
    1. @mmalgeri tweeted #banter
    2. @mmalgeri tweeted #repartee
    3. @mmalgeri tweeted #witticisms
    4. @mmalgeri tweeted “The Incredibles”
    5. @mmalgeri tweeted “Shrek”
    6. @mmalgeri tweeted “Finding Nemo”
  3. Further harmonize these tweets by associating words like banter with its synonyms and stems such as repartee and witticism, and add these synonyms to the tweet document.
  4. Perform co-occurrence analysis on these documents to determine which sentiment words appear with movie titles.
  5. Create custom inferencing rules that conclude:
    1. If @mmalgeri has tweets about movies, and
    2. @mmalgeri tweets synonymous sentiment words about movies,
    3. then recommend a movie with the same or synonymous sentiment words

In other words, a smart recommendation engine would realize that @mmalgeri might like animations about heroes for his kids, but…he REALLY likes snappy dialogue. The Spider-man series would not likely be tagged with this kind of descriptive metadata because that’s not its main characteristic. However, Monsters vs. Aliens contains an abundance of clever dialogue and is hopefully properly tagged… ”Fresno? Fresno! In what universe is Fresno better than Paris, Derek?”

Content providers can leverage features such as multi-model NoSQL and semantic documents, sophisticated search and indexing, semantic facts contained in graphs, and semantic inferencing and combine them to create a smart recommendation engine. MarkLogic provides these features out of the box. Consider downloading the free developer’s version and while you’re at it, check out Monsters vs. Aliens…you’ll have fun.

Appendix 1 – Javascript code to gather tweets in QConsole

// Requires MarkLogic 8 or higher
//
// This function gets tweets based on a query. 
// The code can be modified to pass in the query via a form
// Prior to running this module, the user should have acquired 
// an access token, which is used in the httpGet call, in the 
// Authorization header. 
// To get an access token, see 
// https://dev.twitter.com/oauth/overview/application-owner-access-tokens
// 
// Note: the “query” here is “#testrec” but could be something like
// Nemo AND witty

function getTweets () {

  var query = xdmp.urlEncode("#testrec");

  var twitterURL = "https://api.twitter.com/1.1/search/tweets.json?q=" + query;

  var accessToken = "replace with your token";

  var tweetItr = 
    xdmp.httpGet(twitterURL, 
      {
        "headers" : {
          "Authorization" : "Bearer " + accessToken
        }, 
        "format" : "json"
      });
  
  var tweetPackage = tweetItr.toObject();
  var theTweets = tweetPackage[1].toObject().statuses;

  return theTweets.map(function(tweet){
    return tweet.text;
  });
}

getTweets();

 

Michael Malgeri - Principal Technologist | MarkLogic

Michael Malgeri is a Principal Technologist with MarkLogic. He works with companies to match their business requirements with MarkLogic’s enterprise NoSQL database and semantic features. He helps organizations reduce costs, automate processes, find new opportunities and create applications that bring high value to businesses and their customers. Michael focuses on the media and entertainment industry, where content providers, distributors and related companies are seeking to leverage the power of data in order to capture new opportunities driven by expanding global information consumption.

Michael holds Master’s Degrees in Computer Science, Business and Mechanical Engineering. He's been a Certified Project Management Professional since 2011.

Start a discussion

Connect with the community

STACK OVERFLOW

EVENTS

GITHUB COMMUNITY

Most Recent

View All

Digital Acceleration Series: Powering MDM with MarkLogic

Our next event series covers key aspects of MDM including data integration, third-party data, data governance, and data security -- and how MarkLogic brings all of these elements together in one future-facing, agile MDM data hub.
Read Article

Of Data Warehouses, Data Marts, Data Lakes … and Data Hubs

New technology solutions arise in response to new business needs. Learn why a data hub platform makes the most sense for complex data.
Read Article

5 Key Findings from MarkLogic-Sponsored Financial Data Leaders Study

Financial institutions differ in their levels of maturity in managing and utilizing their enterprise data. To understand trends and winning strategies in getting the greatest value from this data, we recently co-sponsored a survey with the Financial Information Management WBR Insights research division.
Read Article
This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.