Search is at the core of MarkLogic’s architecture, and our built-in search functionality is a big part of what makes MarkLogic so easy to work with. I sat down with Distinguished Engineer, Mary Holstege, in advance of her upcoming presentation at MarkLogic World (May 7-10, 2018), to get her perspective on the evolution (and some misconceptions!) of search (and query!) capabilities, and to ask her to share some of her favorite tips for developers.
Mary is the second-longest tenured engineer at MarkLogic (after our founder Christopher Lindblad) and has made significant contributions to the core server product over the years. Outside of MarkLogic, Mary has submitted papers and presented at industry conferences around the world, and has advanced the state of the art for markup, XQuery, and Search. (She is also a skilled doodler – evidence of which can be found here and here.)
Q: How have you seen “search” evolve in your time here at MarkLogic?
We did a lot of investment in search reasonably early on – in the version 3.2/4.0 timeframe – and the work we’ve done more recently is more tactical, in that it’s leveraging our architectural strength – search – to accomplish other ends. For example, Element Level Security is largely wired into Search. So, the more recent things we’ve done, I don’t think people think of as “search” at all – although I do.
Q: Do you feel that when you talk to people, they get too wrapped up in the word “search” and think about it in terms of, say, enterprise search – versus search being foundational for all these other things?
Yes, I do. And I think we sometimes do ourselves and our customers a bit of a disservice, because it leads people to think about how to use MarkLogic in ways that actually it isn’t best at. You’re going to get the best performance out of MarkLogic if you actually rely on the search indexes because search indexes are designed to scale and perform. And they do. But people tend to veer towards trying to use some of the other indexes, preferentially; they have their place too, but the Universal Index is always going to be the best bet.
Here’s an example: people often over-use range indexes, because they look more like relational columns – and people coming from a relational world find that familiar and comfortable. Range indexes absolutely have their place, but every range index lives in memory, and that’s a cost that can add up. I’ve seen people overuse triples, too, turning absolutely everything into a triple. Documents provide context for facts; you’re much better off using documents to represent entities with selected triples as needed, range or geo indexes for selected data items, and rows to project out basic facts. You’re going to get the best results if you use the totality of the platform, and the right tool for the job.
But, I also think that, since people – you say search, and they think enterprise search – and they say, “well, what does that have to do with my database”… I think that’s a good thing, because it starts those discussions. So it’s sort of a balancing act. You don’t want to scare away people who actually care about search. You don’t want to mislead people to misuse the platform – but you also don’t want to make people think that they aren’t dealing with a database.
Q: Where do you feel the real challenge is, then, in terms of understanding? If you had a megaphone, to reach out into the world, what would you shout into it?
An important thing to understand is that MarkLogic really is a search engine. Some people who join MarkLogic think at first, “I’m joining a database company – MarkLogic is a database – this search stuff doesn’t matter.” And I think it’s fine from a messaging point of view, to the world, to emphasize more of the “database” functionality – but it’s important to understand what’s really going on inside. This is especially true for the technical folks, because technically you can’t properly leverage a platform if you don’t understand why it’s doing what it’s doing, and how it’s doing what it’s doing.
When I talk to people outside the company – mostly the people who want to talk to me care about true “search” – and I find they’re more scared of the opposite. They’ll say things like, “you seem to be running away from search – do you not care about us anymore?” I get that a lot, even though it’s not true.
Building Better Search Applications
Q: The abstract for your talk mentions creating “magical” search experiences. What’s the coolest search experience you’ve seen one of our customers build?
It’s hard for me to pick out a specific one, but the ones that speak to me the most are where the interface of how you express queries is visual – so it’s sliders and timelines and maps and graphs – and the responses are likewise visual. People tend to think of search as “I typed in a query string, and I got back a set of links” – and certainly that’s everybody’s “Ur experience,” you know, you go to Google* and that’s what you get. But, that’s actually not what you get half the time out of Google – you get facts, you get diagrams, you get maps. I think the applications that do that are the ones that are the most compelling. And you get that by mixing some of the modalities we have in the database – the semantics with the geospatial with the more full-text capabilities.
I’ve seen query builders that are just using graphs to create links – this is connected to this – is connected to this – and it turns into a query underneath. I like the sliders and map interfaces too. All the “mix-y” stuff.
Q: Where do people tend to struggle when building search apps?
I think they focus on extending their queries and making their interfaces nice – and they neglect their data. Often the best way to make your searches more effective is to just reshape your data a little bit. And I’ve seen people just build these horrific, complicated, underperforming queries – that if they had just tweaked their data just the tiniest little bit, it would have been much simpler and much more effective.
This will be a focus of my talk at MarkLogic World next month – there’s a triad. The “Search tripod.” There’s the query, the data, and the response – and you can work on each of them to improve the whole. And the data leg is the one that people neglect completely.
Reshaping Data to Improve Search
Q: Interesting! Can you give an example of what it looks like to ‘neglect your data’?
A pattern I see a lot is people have some fairly repetitive substructure – if you think in more business data terms, “items” within a larger entity – and they want to query a specific one of those out of the universe of all of them within a document. And you get these huge nestings of element queries – whereas if you just divided those items out, or if you divided them each in their own little document, you could search for them that way. And just pull over whatever metadata you needed from the larger entity, so you could query for them together without a join.
A similar case, in more of the text search side, is where there may be different kinds of paragraphs that have metadata tags on them. You could search for “this paragraph that contains this, with this metadata tag and that metadata tag” – or you could just change the markup of the paragraph itself to say it’s a patent document, and this paragraph is a claim – and simplify your life completely.
People tend to be more willing to work on their data in cases like “my dates have inconsistent formats, so I’ll create a normalized field that has the normalized value,” so you don’t have to do a big OR search to search for the thing. People have been more willing to do that kind of work. But reshaping your data can really pay off. It just comes down to “try to find a way to identify the semantics of what you want to search for,” and putting that in the mark-up. This is always more effective than trying to add it in after the fact.
Search Best Practices
Q: What’s your favorite or most important tip for new developers?
There are so many tips… The biggest one is the overall “you’re often not doing field lookup, you’re doing full-text search – and just embrace that. Embrace the fact that we’re a document database, indexed to access document content. And that’s not a bad thing.” People want to try to interpret some of our query operators as if they were relational database operators – and they’re just not. Value queries are not equality against a value – that’s not what they are. And so, I think that it gets back to what I was saying: understand that it is really search indexes under there. And understand what that means, because that makes all the difference. They can be very effective for you, but you’ve got to understand the fire you’re playing with.
Q: It does seems like it’s quite a mind-shift for people coming from the relational world.
Yes, it is.
Q: What about for more advanced developers – some of the people who may be coming to your session this year at MarkLogic World?
I would say that – this is kind of a strange answer, but I think it may be true – a lot of people are looking for technical solutions to psychological problems. What I mean by that is great applications are solving psychological needs, for humans. And you need to understand the humans to know what you need to do. No amount of throwing cute techniques is going to solve the real problem if you haven’t understood what people are actually trying to do. I do see that a lot – where people go through tremendous hoops and think “I need all these techniques or scoring algorithms” – when really what it is, is that they haven’t understood what their users are trying to get out of it.
From a technical point of view, I think we have a bunch of tools in MarkLogic that can be applied fairly effectively. For example, I think adding a dash of semantics to your search – versus a “pure” search or semantics approach – is very effective, and can make both sort of work better together. Whereas, if you try to go all the way one way or the other, it can be harder. It comes down to using all of the modalities of the platform – because they all work together, and they’re more effective if you balance them. So, not going overboard with range indexes, not going overboard with semantics, but using a little of them, in sort of key places.
Q: Ahh, so that’s that “pick the right tool for the job thing.”
Pick the right tool for the job. And we’ve got a lot of little tools in there, and it’s a question of, you know, don’t make life too hard for yourself.
Q: Your session at MarkLogic World is always one of the most popular ones. What are you most excited to talk to people about this year, either in your session or to attendees in general?
I’m always excited to talk about what I’ve actually been working on – this year, I’ve been working on ontology-driven entity extraction – which sounds very grand, but it’s just another little tool that can be used in various places. One of the things I’d like to talk about is how to do really smart query analysis – and this is one of the tools you can use to do that.
To hear more from Mary, come to MarkLogic World in San Francisco on May 7-10, 2018 and attend her session “Best Practices for World-Class Search,” which is part of the Advanced Technical Track. To register and attend for FREE, visit marklogic.com/world.
You can also view her presentation from last year’s MarkLogic World conference, Search from First Principles: Powering Better Results, to get an introduction to the core fundamental principles of search and indexing in MarkLogic, explained through concrete analogies.