The following blog pulls together thoughts expressed by my fellow colleagues, Charles Greer, John Snelson, Stephen Buxton and Philip Fennell during MarkLogic World.
LinkedIn is the “World’s largest professional network.” It also has some great apps that others covet: “People You May Know”, faceted search, InMail messaging system, and the analytics they run via Hadoop jobs. Many organizations want similar kinds of apps to connect their networks, provide better customer support, and develop better customer insights – but they don’t want to do all the heavy lifting and infrastructure-juggling that LinkedIn has done to make it all possible.
In fact, LinkedIn is very open about its infrastructure – giving presentations and even open-sourcing its technology. According to Bhaskar Ghosh, LinkedIn’s Senior Director of Engineering, LinkedIn uses three-phase data architecture comprised of online, offline and “nearline” systems. The online systems handle users’ real-time interactions; offline systems (Hadoop and a Teradata warehouse) handle analytic workloads; and “nearline” systems handle features such as People You May Know, search and the LinkedIn social graph.
The key components of LinkedIn’s tiered architecture include:
- Voldemort: Distributed key-value storage system, “basically just a big, distributed, persistent, fault-tolerant hash table” that is used for certain high-scalability storage problems where simple functional partitioning is not sufficient. Voldemort is like Amazon’s Dynamo database.
- Kafka: Distributed publish-subscribe messaging system. Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines.
- Azkaban: Batch job scheduler for mostly Hadoop jobs or ETL of some type. It allows the independent pieces to be declaratively assembled into a single workflow, and for that workflow to be scheduled to run periodically.
- Espresso: Horizontally scalable, indexed, timeline-consistent, document-oriented, highly available NoSQL data store (sound somewhat familiar?). Espresso was designed to replace legacy Oracle databases, being used first for LinkedIn’s InMail messaging service, and then for a variety of other features as well.
- Zoie and Bobo: Built on Apache Lucene to support fast real-time indexing and faceted search.
- Distributed Graph System: The only black box in LinkedIn’s infrastructure, this is the special-sauce that makes LinkedIn’s social graph possible, enabling features such as how related you are to another contact (ie., 1st, 2nd, 3rd degree connections) and the recommendation engine. This graph database handles hundreds of millions of members and their connections, and handles hundreds of thousands of queries per second.
That’s a lot of custom built systems that an organization would have to string together to get LinkedIn’s capabilities? … Or, maybe they could just use MarkLogic.
If LinkedIn was using MarkLogic, they would most likely be able to replace Espresso, Zoie, Bobo, and their graph database in one fell swoop. These are the systems that support “People You May Know”, the InMail messaging system, faceted search, and also the analytics they run via Hadoop jobs—all features similar to ones MarkLogic is already supporting for other customers.
MarkLogic is a document database not too different from Espresso. According to LinkedIn, they needed a NoSQL database that supports distributed transactions across a set of documents, maintains consistency to act as a single source-of-truth for user data, integrates with the entire data ecosystem as a platform for developers to build on, and has schema awareness and rich functionality such as indexing and search. Check, check, check, and check—MarkLogic has all of these features out-of-the-box. One example showing a similar capability is MarkMail, an application MarkLogic built in-house that is used to archive and search millions of emails.
MarkLogic can do more than LinkedIn’s Espresso. MarkLogic could act as a search engine to replace Zoie/Bobo because MarkLogic uses a universal index, with the ability to configure 25 different text indexes and three different range indexes. These indexes also enable faceted search. This all comes built-in with MarkLogic, not bolt-on like other solutions do with Lucene or Solr. One example of a search experience built on MarkLogic is Springer Images, a searchable database of 8.4 million scientific documents including journals, books, series, protocols and reference works. Not quite the 200+ million users that LinkedIn has, but definitely not too shabby. Besides Springer, MarkLogic has other, larger examples with LexisNexis and the Defense Intelligence Agency but they are not publicly accessible.
MarkLogic could also act as LinkedIn’s distributed graph system. MarkLogic has support for semantics in that MarkLogic can store RDF triples, using SPARQL as its query language. RDF is the language of the semantic web, allowing relationships with nodes and edges to be created and searched. This feature was introduced in MarkLogic 7 last year, and gives MarkLogic the unique ability among NoSQL databases to act as both a document and graph database. One early example of this feature in production is the LDS Church’s Gospel Topics Explorer. Another example is a proprietary tool called GoldminR, developed by a startup called FactGem.
MarkLogic integrates with Hadoop. Replacing Kafka and Azkaban would probably be a stretch, but MarkLogic might be able to tackle that as well. MarkLogic is great at combing the capabilities of operational and analytical systems, using tiered storage and integrating with Hadoop to support real-time analytics. This means that the batch jobs that Kafka is used for wouldn’t really be necessary because it would be done in real-time with MarkLogic. All that ETL work done by Azkaban—probably wouldn’t need to worry about that either since all the data would be in one place in MarkLogic.
New Features with MarkLogic. While LinkedIn is great, there are still some limitations to features such as search. For example, what about those hidden gems that some people put in their profile that you want to search for but aren’t captured as profile attributes? What about alerts for job postings? Or, what if you wanted to see what your network looked like at a certain point in time (ie. Who you knew when)? MarkLogic would make all of these features possible. Users could easily conduct rich search queries across unstructured profile information, get alerts created for job or other searches, and leverage bitemporal (new feature in MarkLogic 8) to view what data looked like at a historical time-stamp.
It’s really tempting to build everything from scratch and certainly LinkedIn has done an impressive job. But even Jay Kreps, Principal Staff Engineer at LinkedIn, has been quoted by GigaOM saying “the constant cycle of building new systems is ‘kind of a hamster wheel.’”
Sometimes, it’s nice to get off the hamster wheel and benefit from a tool that’s already built. Why not reimagine your data with MarkLogic? Just sayin.’