When MarkLogic first introduced semantics into our product offering in 2013, few organizations were familiar with the concept or the benefits it had to offer. Although semantics is now far more widely known and used, many developers still struggle with whether using this approach is the best way to solve their data integration issues.
I sat down with MarkLogic’s Senior Director of Product Management, Stephen Buxton, in advance of his presentation at MarkLogic World (May 7-10, 2018), to get his perspective on the evolution of semantics capabilities and use cases, and to ask him to share some of his favorite tips for developers.
Q: How have you seen “semantics” evolve in your time here at MarkLogic? What’s changed?
The biggest change in semantics at MarkLogic that I’ve seen is that it’s become more and more mainstream. When we first started talking about semantics, it was a little bit exotic, a little bit of an oddity. The technology had been around for a while but it had been used in very small, kind of niche applications.
I normally do one of those hand polls when I give talks. I say, “Raise your fingers to show me how much you know about semantics – where zero is ‘I can’t spell RDF’ and 10 is ‘I have a PhD in Ontology.’” I used to get lots of zeroes, and some twos, some threes. The last couple I’ve given, though, I’ve found two people who’ve had PhDs in Ontologies, lots of sixes and sevens – and even some 10s in the audience, which is very gratifying.
What it means for me is that when I talk to people about semantics, I don’t have to start from the very beginning every time. People have a general idea of what SPARQL and RDF are, what they do and what they’re useful for, and what graphs are and what they’re useful for. So, I can go straight into how MarkLogic does semantics, how that’s different, and how using a multi-model database – documents and triples (and SQL views, for that matter) – is really very different from using semantics stand-alone.
Q: There’s always a ton of interest in the semantics sessions at MarkLogic World… This year you’ll be focusing on the use of semantics for data integration – why pick that topic for this year?
The focus of MarkLogic right now is “MarkLogic is the best database in the world for integrating data from silos.” Rather than just talking about semantics in the abstract, as a technology, we thought we’d dive into that thought further. In other words, we say over and over again that MarkLogic is the best database in the world for integrating data from silos – but why is it the best database? What makes it the best?
We thought we’d just kind of double-click on that and say, “One of the fundamental reasons is that the architecture of MarkLogic is a multi-model NoSQL database – which means we can handle documents and triples, and of course a combination of documents and triples.”
To sum up: One of the reasons that MarkLogic is the best database for integration is because we use a multi-model approach, and part of that multi-model story is semantics. Rather than talk about semantics as an abstract technology, we’re going to dig into the data integration side to demonstrate all the things that semantics does to support data integration efforts.
Q: I think that will go over well with attendees, because you can get academic material in a lot of places, but focusing on a real-life, important use case will give them information they can use. So, what are some of the other use cases you’ve seen for applying semantics?
There are all kinds of use cases. Part of the challenge in talking about semantics – and MarkLogic Semantics in particular – is that there are so many.
For database folks and people who are used to relational databases, it’s a little bit like the ‘date’ data type. In every database, dates are treated differently. You can do different operations on them, and they have their uses – but you wouldn’t dream of pushing your dates out into a separate “date type database,” because that would mean that every time you want to talk about dates (as opposed to strings or numbers), you would have to go out to this separate database and then somehow join the results. You wouldn’t ever dream of doing that.
This may sound a bit off-topic, but it does relate to the question of use cases and an extreme version of the question people ask me all the time: What can you use semantics for?”
The short answer is that you can use semantics for lots of things.
My personal favorite is semantic search or the idea that you can store a whole bunch of knowledge or context about the world in a semantic graph and use that to give you better searches.
Q: Can you give me some customer examples? What’s the coolest use of semantics that you’ve seen one of our customers deploy?
One of my very favorite customers is BSI, the British Standards Institute, who had a very real problem on their site. They sell standards that everybody must adhere to and if you want to make a cardiac catheter you go to their site, type in “cardiac catheter” and then buy the standards that you must comply with in order to make a cardiac catheter. But of course, you can’t find those standards with a regular search.
An Elastic* search, or a Google* search will simply not find the standards that you want. You’re out of luck because you don’t know what standards you have to comply with, and BSI is out of luck because they don’t get to sell you anything (and that’s how they make their living).
A semantic graph, however, will tell you that a catheter is a kind of implantable device, and that implantable devices have to be made in sterile environments. You will get all the standards that apply to implantable devices and sterile environments,just because you have this knowledge of the world backing up your search. The semantic search story is a great use case for semantics.
Another great use case is intelligence. Again, we’re representing facts and relationships about the world in a graph – so, if you’re on a police force trying to find out who might be in danger, you can look at where that person lives, what their relations are (mother, father, brothers), and who else lives in that house. You can also look at who else lives in that house that has any kind of police record or alias, where that alias has a police record, and so on. You can just explore out the graph and look for connections that you wouldn’t really see with any other kind of a model – certainly not with a relational model, and not even with a document model.
So, that’s a really cool use case of building out your knowledge of the world – of people, places, maybe organizations depending on what your business is — and answering questions by traversing the graph and discovering links and relationships that you wouldn’t have seen any other way.
If your graph can include documents, that adds another dimension. You can now ask, “Who else lives in that house that has an alias, where that alias has a police record that mentions some kind of abuse in the recent past?” while having the ability to look at any police records and interview notes, and so on.
While we’re on this intelligence use case, of course there’s also an aspect of data integration here – integrating the information from police, social services, hospitals, schools, and so on, to find things you didn’t even know to look you could search.
After 9/11, US intelligence agencies gathered around a table with printouts on paper of everything they knew about the attackers. When they laid out all those sheets of paper on the same table, it was clear they knew, collectively, what was going to happen – they just couldn’t put it all together. Semantics is very good at helping you pull all that information together and drawing relationships between the pieces. And documents are perfect for representing the lumpy bits – a person’s profile, a criminal record, an interview transcript. (By the way, I don’t know if that meeting actually took place, but it is a lovely story.)
As a side note, that intelligence use case is particularly interesting to me, as that’s the world I came from. I used to work for a company that did link analysis, and the database we used underneath was relational. I recall there were some issues with data management and with performance hits in querying.
One of the interesting things about the multi-model world is that people would always tell me about their favorite model (whether it be tables, or documents, or triples) because they could do anything in that model. The truth is, yes, you can do anything with triples, tables, or documents, but each of those models has a set of uses that are particularly suited to that model, where it’s natural to store that kind of data in that model. It’s easier to ingest, easier to manage, and easier to query – that’s a perfect example.
You can certainly store a graph as a table, but it’s very hard to lay it out and ingest it, and you’ve got problems with data types. Ideally, you’d have a column that has multiple data types, but that’s not allowed in a table.
Once you’ve got your tables set up, you can certainly store that information in tables, and query it in SQL, but your SQL queries wind up being massive numbers of joins – and worse than that, self-joins (joins across rows in the same table) – and relational databases are just not made for that kind of storage or query of query, so they tend to do very poorly.
Q: Where do people tend to struggle when using semantics? Why is that?
What I generally tell people is that I spend half my life trying to convince people to use semantics, and the other half of my life trying to convince people not to use semantics. It goes back to using the right model for the right thing.
When people are brand new to semantics, it’s often a challenge to get them to look at the RDF data model and the SPARQL query language – as I said, they will have their favorite model already, and they tend to want to stay with those. Then you see them get intrigued by the idea of graphs and SPARQL queries, and then they often flip and go all-in on semantics. It’s just the same problem with a different model. Using RDF for everything is just as bad as using tables or documents for everything.
So that’s where people tend to struggle – either they don’t want to use RDF and SPARQL at all, or they want to use it for everything. The key is to use RDF and SPARQL for things that require atomic facts and relationships. In other words, you would use these in cases in which you are going to want to do a query that traverses the graph.
Instead of simply searching for customers who have an address in this postal code, you can ask, for all customers who have a postcode that’s in a town with a population of more than five million and in a country where EU regulations apply” – that kind of thing, where you can see a query going out across a graph as opposed to looking up something in a table or in a document, where you need to know how things are laid out ahead of time.
Q: I think there’s still a lot of confusion in the marketplace relative to semantics and graph technology and terminology – at MarkLogic we use SPARQL and RDF, but I know there are other approaches to doing graphs – what’s your advice to people when they are evaluating a number of different technologies or approaches?
The nomenclature is somewhat confusing – and, as usual with naming, there is no king of the universe for names, and so we have to figure out as best we can what’s best practice for a particular name in a particular domain. As far as naming goes, a graph database is either a triple store or a property graph database. What MarkLogic does makes it a graph database, but it’s that subset of graph databases that are triple stores – because we deal in triples, RDF and SPARQL.
The other branch of graph database is the propertygraph database. It’s still a graph, but it persists the data differently and indexes it differently. So, rather than the query that we were talking about – “find me all customers that have a postcode that’s in a town that has a population of more than five million and is in a country where EU regulations apply” – rather than that graph traversal, which RDF and SPARQL are very good at, the property graph database is good at answering questions about the whole graph. For example, “which node in this graph has the most inputs and outputs” or “what’s the shortest weighted path from this node over here to that node over there.” We can answer those types of questions in MarkLogic, too, but we won’t do it from an index.
So, those are the two different kinds of graph databases. A lot of the time people find that they can do most of what they want with a triple store – and they can do way more [with MarkLogic] than they can do with a regular triple store, because with MarkLogic they’ve got a triple store and a document store, and triples and documents intertwingled. The things that a property graph are particularly good at, you can do with brute force in a triple store – but occasionally you’ll come across a use case where “no, I really want to do just graph analytics” and for that you need a special-purpose property graph database.
The other difference is – for those that care about these kinds of things – SPARQL and RDF are W3C standards, and so there are specs for them and everybody knows what they’re supposed to do, and there are tools around them and so on. With property graph databases, there is no standard governing either the data model or the query language, so they tend to be very much vendor-specific.
Q: What’s your favorite or most important tip that you can give about semantics to new developers?
Be open to the idea of semantics – learn about RDF and SPARQL – but don’t go all in and say, “Okay, now I can do everything in RDF and SPARQL.”
Look at the graph model and the document model, and then look at some of the things that MarkLogic does to let you use those two models – not only side by side, but MarkLogic also allows you to merge those models.
For example, we have an interesting way to derive triples from the information in a document. Rather than having some information in documents, some in triples, and then maybe going through your documents and doing some transformation – pulling those triples out and putting them into a different storage mechanism – you can put all your basic information in documents and then let MarkLogic derive the triples or the parts of that information that are useful, to be indexed as triples.
We’ll just put those in an index – we won’t make a copy, we won’t make you track which triples came from which document (we’ll store that in the index), we won’t make you manage the triples alongside the documents (we do all that for you as part of our indexing). This means you can store your information once, you can persistit as documents, and then have it indexed as documents, indexed as triples, you can even have it indexed as SQL (so you’ve got a SQL lens over that data). It makes it very simple, very easy to manage the information – and then you can write these very powerful queries that combine document search and graph query as well.
Q: What about for more advanced developers – people who may have already been using semantics for a while?
I would suggest digging deeper into how you can use not just documents as well as triples, not just documents alongside triples, but documents and triples intertwingled – which is my all-time favorite word.
Q: You’ve talked a lot about MarkLogic and semantics, but only a little about the role semantics plays in data integration. Can you say more about that?
I’ll be saying a lot more about using semantics for data integration in my talk at MarkLogic World…
Stephen is the product manager for Search and Semantics at MarkLogic, where he has been a member of the Products team since 2005. He is the co-author of “Querying XML” and a contributor to “Database Design,” a book in Morgan Kaufman’s “Know It All” series. Before joining MarkLogic, Stephen was Director of Product Management for Text and XML at Oracle Corporation.
To learn more about MarkLogic Semantics:
- Come to MarkLogic World in San Francisco on May 7-10, 2018 and attend the session “Using MarkLogic Semantics for Data Integration,” which is part of the Advanced Technical Track. To register and attend for FREE, visit marklogic.com/world
- Watch the session “Getting the Most from MarkLogic Semantics” from last year’s MarkLogic World
- Visit the Semantics page on our website