The ListI recently heard a rumor that the big guy up north was making some big changes to how he does his big gift to kid matching algorithm. As we all know, Santa got NoSQL long ago, using the flexibility of the schema and matching to streamline one of the biggest problems in all of computing: how to take a list of kids, sort them out and see who gets the goodies and who gets the coal.

So what was happening up North? It turns out that Santa is, as usual, way ahead of the curve …

Through a secret Elf source on Santa’s data team, I learned that they had been looking at the list algorithm for a while. “The old system was really good at doing the big match, but was a little too black and white for today’s kids: you were either naughty or nice. But with everything kids have access today almost no one is totally good.”  Also the gift inventory wasn’t keeping up with the times: “we had a lot of gifts getting ignored because they didn’t match the kids real interests”. There is, apparently, nothing worse than when Santa finds out a gift didn’t work out. My source would only say that Santa gets “sad.”

The team has tried new approaches over the years but nothing made it out of the prototype phase.  The challenge was not to just get the complete picture of the kids with all their varying attributes and behaviors, but also making a better match with a gift right for that kid. Coal would still be on the list but Santa wanted to cut it way back for only the worst cases. He not only hates giving it out, it turns out the North Pole has been under some environmental pressure about coal consumption. Instead, Santa wanted to do ‘naughty’ gifts that would encourage kids to move up along the nice scale.  One example my secret Elf source gave me was Santa’s move into accessories. A kid might get a Lego organizer instead of Lego with a note from Santa to keep up the good work to see it filled.  Santa also doesn’t think small: he wanted a system that would scale to the over 550 million kids (according to The Atlantic in 2011) out there but be precise for every single one.

Along with NoSQL, Santa and his team have long been in to semantics and my source refused to confirm or deny long rumors of a Sir Tim Berners-Lee / Santa summit in the early 2000s. As a result, they knew the concepts are a good fit.  If they could add triples to the kids database to record all the good and bad it would be able to create a naughty/nice “signature” for each kid instead of the simple naughty / nice value.  Similarly, the gifts could be classified along a much wider range of real world attributes that could map into the attributes of the kids.

But the team always had trouble putting it into action because of the scale of the Big List system and the need for it to be dynamic to make sure they didn’t miss any last minute information. “If we get an update like helping a Grandma crossing the street at the very last minute we need to make sure that gets into the match so the kid can get gift they deserve.” According to my source, this ruled out many common approaches because “the kids and gift database is built for speed and scale and pre-calculating it or waiting for a match based from semantics running in a separate system just didn’t cut it”. When you throw in the that gifts needed to be sorted with geospatial co-ordinates having multiple components was “like trimming a tree with the abominable snowman – you’re constantly putting it all back together and by the time you get done its too late.” Apparently there is some bad history there.

But the team didn’t give up and early this year while the big guy was on his usual post-holiday Hawaii trip, the data team started playing around with MarkLogic 7. They quickly got traction on a new idea to combine the kid and gift inventory using NoSQL with the naughty / nice attributes represented as triples. When they showed the first prototypes Santa thought they had faked it (“Too many reindeer poop pranks” said my source) and only believed them after they did the whole run in real time and showed how the gifts changed as the kids profiles were updated.  From then on the team has be “flat out” bringing the system online and ready for this year’s big run.  “We’re up to a couple trillion triples and still get the list in a couple of minutes.”  My source refused to comment on the infrastructure beyond “lets just say that when we do the run a certain online retailer slows down quite a bit.”

Although the team is very busy I was able to catch a glimpse of how it works – and the changes are dramatic from the old simple match.

The first change is to go with a document and triples data model – this is an early example of the schema for kids:

<kid>
  <name>john</name>
  <sem:triple><sem:subject>john</sem:subject>
    <sem:predicate>wants</sem:predicate>
    <sem:object>train</sem:object>
  </sem:triple>
  <sem:triple>
    <sem:subject>john</sem:subject>
    <sem:predicate>gooddeed</sem:predicate>
    <sem:object>10</sem:object>
  </sem:triple>
  <loc lat="40.77" lon="73.98"></loc>
</kid>

They later replaced ‘goodness with the actual deeds as well as the deed ranking and the real data sets now have “hundreds of triples” per kid.

Matched to toy that also have values for overall goodness:

<toy>
   <name>train</name>
   <sem:triple><sem:subject>wireless-train-set</sem:subject>
     <sem:predicate>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</sem:predicate>
     <sem:object>Train</sem:object>
    </sem:triple>
   <sem:triple><sem:subject>wireless-train-set</sem:subject>
     <sem:predicate>goodnessvalue</sem:predicate>
     <sem:object>10</sem:object>
   </sem:triple>
</toy>

With the entire inventory of toys and kids modeled out with both documents and triples, the team is now able to fine tune the match query.  This early example shows the power of the match using both semantics and geospatial to spit out the list for a region with a fine grained toys to kids match:

import module namespace geo = "http://marklogic.com/geospatial"
         at "/MarkLogic/geospatial/geospatial.xqy";
sem:sparql('
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX prop: <http://marklogic.com/prop#>
SELECT ?kid  ?toy ?value WHERE {
   ?kid prop:wants ?t.
   ?kid prop:goodness ?score.
   ?toy rdf:type  ?t.
   ?toy prop:value ?value.
   FILTER (?value = ?score)}
', (), (), 
  cts:or-query((cts:and-query((cts:collection-query("kids"),
                               cts:element-attribute-pair-geospatial-query(xs:QName("loc"),
                                                                           xs:QName("lat"), xs:QName("lon"), 
                                                                           cts:circle(20, cts:point(40.77,73.98))))),
               cts:collection-query("toys"))))

The geo in this example is taking the circle around a specific drop point. When I asked about using this method my source would only say that they were always trying to “optimize the delivery system with innovation” and that he could not comment on Santa’s drone usage.

This is the first year the team will be using the system but the expectations are big. “We’re just starting to tap the potential of making the finer grained matches – already we’re confident we can bring gift satisfaction way up and next year we’ll be bringing that into the workshop to help tailor gift creation.”  But what really gets the elves excited are all new was to get access to their data “There is already a team working with Tableau and Santa won’t stop bugging them and when I start to think about doing more geospatial optimization I just … get … so …. wow … amazing …”  I’m not sure if you’ve ever seen an elf get excited but the tend to both start flying AND sputtering.  Once he calmed down my source was finally able to finish with “it’s going to be great!”

I, for one, am looking forward to the big changes up north!

Matt

This website uses cookies.

By continuing to use this website you are giving consent to cookies being used in accordance with the MarkLogic Privacy Statement.