(From “Star Trek Into Darkness,” 2013)
Imagine you’re working as an analyst for an intelligence agency as the lead for a task force tracking and trying to capture a suspected terrorist. A lot of intelligence, Human reports, signals (both human communication and machine communication), imagery, full motion video, and social media sources are pouring in about the movements of this person. Timely synthesis and understanding of this all-source data can make a difference whether you capture this person and potentially prevent a next terrorist attack or not. How much time can you afford to waste?
Knowing your data and knowing it on time is crucial to your task. Do your systems allow you to search over new intel as soon as it hits the database? If not, it means there is a lag between the moment you have got the intel and the moment you become aware of it and can put it to use. A lot can happen in that time and the price to pay can sometimes be too high.
If you’re storing your intel in a relational database, you have the power of SQL at your disposal. Most relational databases update their indexes immediately so data is immediately available for querying.
However, SQL is not the language that you want when to search all-source content. Instead of joining relational tables in your queries you want to ask questions like “find me all emails or phone calls where Joe is the sender and Mo is the recipient and some of the words ‘dinner’, ‘circus’ or ‘sleep’ are mentioned, and Joe was not more than 3 miles away from the airport.”
Searching over emails, observation reports, phone calls combined with temporal and a geospatial aspect requires an advanced search engine. That software usually sits parallel to the database that stores this intel. The intelligence data stored in the database comes in high velocity, volume and variance. It takes time for every new database record to be processed and indexed by the search engine.
Why is this?
Your intel is stored in a database but with most databases your search engine does not access it directly. Instead, data has to be transported from the database to the search engine.
The database and the search engine have different sets of APIs and they support different data types and structures. To make your intel available for search, the database content needs to be transformed and moved to the search engine. After every new piece of information or change is committed into the database it has to be propagated to the search engine indexes and very often portions of the index have to be rebuilt.
The code to orchestrate all of this is heroic. It takes the most talented technical staff, is complex, and is brittle, requiring never-ending modification and tuning as the nature of your data changes.
If you decide to track a new data source, like security camera feeds, you not only need to save that into your database but your developers need to transform it once again so that the search engine can digest it.
When a record in your database gets an update, say the current location of an observed person has changed and that update has hit the database, we have seen it will take some time before your search engine queries start returning the new location. In the meanwhile, the person of interest may have changed the location again.
As a consequence, the more intel you have gathered during your operation the more the lag to getting that data available for search grows. That means the more progress you make in your work collecting data — the more it will slow you down.
Do you see how precious time is lost on accommodating the needs of technology? It should be vice versa – the technology should be serving.
If you were observing movements of a specific car in real time, based on security camera feeds (assuming you had car-plates-recognition software to provide the location of the car in real-time) search engine latency in indexing of latest intel will provide outdated answers to your queries. That means it will be too late when the search engine notifies you that the observed car is in the vicinity of the airport. That car may be carrying a bomb.
To battle the problem of stale data, so-called low-latency databases have been emerging in recent years. Those databases promise that the time it takes to access the data is reduced to zero or, more realistically, to a very low number expressed in milliseconds. These databases are often scalable in-memory databases that can work with huge amounts of data. Nothing can be faster than accessing data that is stored in-memory and these databases keep their promise.
However, this performance comes with tradeoffs. Often, these databases only guarantee “strong” consistency of data. This is much weaker than the consistency of data achieved with ACID transactions that most relational databases have. What does that mean for you? There’s no guarantee that these systems will not lose your intel.
Another tradeoff is extremely limited ability to query data in these databases. Compared to them, even classical relational databases offer very rich querying capabilities. And we’ve seen even SQL is not enough.
The latency between the database and the search engine can be lowered by either using a subset of your sources or simplifying the representation of the content. That way you lower the complexity of the transformation that has to happen between these two systems and can have the data available for search faster.
However, the more you simplify or decide to cut down on data you can search on, the more risk you are running of missing the crucial part of information. Let’s say you decided to only expose for search those telephone transcripts that have certain words you know are code words. What if the codes change in the meantime but the old codes continue to be used for meaningless conversations? You have a useless or even misleading piece of intel in your search engine. The useful piece of information is only in the database and you can’t search over it using your advanced search engine capabilities.
How much a lag or latency can you afford in your operation? Is it OK to know about the crucial phone call 12 hours after it happened? No? It would be better to know about it immediately, right?
Is this possible? Sure it is, if you have the right technology.
What you need is a near real-time search indexing or, ideally, zero-latency indexing. Come again, what is zero-latency indexing?
It is the ability to index structured and unstructured data instantaneously making it available for search as soon as it is committed to the database. Having in mind the velocity, variety and volumes of intel you are gathering, you will want to use a database that can store all those different types of data. That would be a non-relational (NoSql) database.
But if you are going to store, you need to be able to find. And the solution to powerful finds is a powerful Universal Index – with zero latency. In the world of non-relational databases, there are not many products that have zero-latency indexing combined with powerful search functionality. Once again, zero-latency indexing is the ability to save data into the database along with updating the search indexes within the same transaction. One such product is MarkLogic.
What makes MarkLogic unique is that it has ACID transactions and has rich search capabilities combining structured and unstructured search with geospatial and temporal aspects. MarkLogic is also a triple-store so you can combine your intel with semantical information stored as graphs and query it using SPARQL. Information ingested into MarkLogic is available for search and alerting at the next CPU cycle. With MarkLogic you can be sure your valuable data gets saved quickly and efficiently into the database, that it doesn’t get lost and that it is immediately available for most complex search queries.
We all know that the information is the new currency, right? Well, only if you know that you have it. Only then you can use it to your advantage. With MarkLogic the search indexes are up-to-date as soon as your intel is saved into the database. Remember the car near the airport? MarkLogic is able to provide the correct answer to the question “Is the observed car near the airport?” at any time.
Not only that but it will also alert you as soon as the intel you’re interested in hits the database. You can define questions such as “Has Joe entered his car and is he moving on highway T-16?” in advance. As soon as such observation report hits the database, you will be alerted. MarkLogic has zero-latency indexing and that is why a real-time tracking of Joe and his car with it is possible.
Even if your job doesn’t include catching the bad guy – how much latency can your organization accept?
Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.
A data platform lets you collect, process, analyze, and share data across systems of record, systems of engagement, and systems of insight.
We’re all drowning in data. Keeping up with our data – and our understanding of it – requires using tools in new ways to unify data, metadata, and meaning.
A knowledge graph – a metadata structure sitting on a machine somewhere – has very interesting potential, but can’t do very much by itself. How do we put it to work?
Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.Request a Demo