Graph Stores Alone — Not Enough for Tracing Drug Lot Genealogy

Pharmacovigilance, or drug safety, is of the utmost importance to everyone. Personally, I assume all medications in the marketplace are safe. When I hear otherwise, such as when the recalls for Tylenol or Vioxx happened, I pay attention!

So, too, do pharmaceutical companies – the costs to them in brand damages, regulatory fines, and lawsuits alone easily can reach into the billions of dollars.

Pharmacovigilance, which the World Health Organization defines as “the science and activities relating to the detection, assessment, understanding and prevention of adverse effects or any other drug-related problem,” is key to avoiding such risks. And for pharmaceutical companies, that means quickly and easily analyzing and detecting anomalous signals at every point of the drug life cycle — from manufacture to post-marketing.

This is not an easy task; these signals can be lost in the proliferation of data coming from pharmaceutical quality systems that include:

  • materials systems,
  • equipment and facilities,
  • production,
  • laboratories (clinical trial data, public domain and RWE related sources),
  • packaging and labeling, and
  • quality systems.

Pharmaceutical Quality Systems

Rigorous quality assurance requires stringent controls throughout the supply chain. The two biggest culprits in recalls are human error and defective raw materials. Experts say that increasing automation is necessary to reduce human error.

Materials systems generate complex data relationships — without even thinking of data from all of the other systems I mentioned. For example, drugs are made up of materials that are in lots, and lots can be combined to make other lots in the process. Multiple lots can be combined into a single new lot that can fan out into multiple other lots. This is a many-to-many relationship and this information is all part of the “lot genealogy,” typically maintained in systems such as Oracle or SAP.

Harry Bakken, principal consultant with Avalon Consulting, LLC, which provides solutions on advanced platform technologies for life sciences companies, elaborated.

“There are a significant number of companies that supply pharmaceutical manufacturers with supplies, components and ingredients. All of their products need the same stringent record keeping and traceability. Products can be manufactured ‘within specifications’ and pass QA verification, and can later be called into question by end users or downstream manufacturers as problems surface. Defects in goods elsewhere in the supply chain are difficult to evaluate and document.”

Unfortunately, the underlying models of the relational database constrain the analyst who is examining these lots or introducing new information to be analyzed with this data. The logical model for these systems are pretty straight forward and can easily define the hierarchy of products within the assembly of a lot. The physical model creates challenges around accessing information related to other information.

That is to say, when you want to query data related to other data, you need to look within the hierarchy, getting answers as you “traverse the tree in any direction.” For example, to query the hierarchy of lots, or associated attributes of the lots, or both at the same time so that you can do discovery and detect signals, you would need to search back and forth across the data set. In relational databases, this involves many queries and the responses would need to be combined — adding complexity and perhaps introducing error into the process. And it gets more complex as the relationships increase in size and complexity.


Semantics Help Tame Data Complexity

Some architects are turning to semantic and graph model databases to handle this complexity. For example, one customer wanted to analyze the genealogy data for these lots. The data was stored in SAP, and the customer wanted to extract it into a Hadoop cluster. They then extracted the data from the Hadoop cluster and loaded it into NEO4J for viewing of the genealogy hierarchy. The customer can view the hierarchy of the lots in NEO4J that make up the lineage of a drug. This provided a very quick and performant view of the lineage. Now the user can look at these lots and the differing levels of lots looking for deviations or signals from the norm for the batches.

While the graph database does provide a hierarchical view of the data, the devil — in this case comprehensive search — is in the details. With a graph-only database, they needed to know the lots to start their search. But what if you wanted to know the lots from a manufacturer or supplier within a specific date range? And, Oh! What lots were those materials used in? Or, if they have suspect lots, what are the characteristics between the suspected lots? The graph database would need to be connected to the lots database to answer this question easily. If it is not, they have to start looking at different databases to query before analyzing.

Or, consider a situation where if you have similar batches that are supposed to be the same, but you have different suppliers for the raw materials, with different outcomes for the batches. What was the cause of the delta between the batches? Or you have the same materials but batches produced at different plants. Again, how can one easily analyze all of the data involved in this process? This analysis is complex and the analysis can start with the specific lots, parameters, attributes, customer complaints with data many systems such as manufacturing systems, ERP systems, lab systems, sensors, raw material data, and on. And the data volumes and sources keep growing. Think how much simpler analyzing all of this disparate data is if it’s all in one place with unified search capability.

Bakken agreed. “Seemingly innocuous data could be quickly brought into context when trying to understand why a particular lot is having a problem,” he said. “A local food manufacturer had spent months trying to figure out what caused freshness problems among lots manufactured in a particular plant over a few weeks one summer. They toiled through supplier information, finding no obvious culprit. Ingredient lots were found to be used in “good” lots and there weren’t any deviations from SOP’s. In the end, somehow it was traced to abnormally high humidity weather at that plant. Monitoring and sensor data hadn’t been considered beyond cooking temperatures. Other environment monitoring data confirmed the anecdotal theory that it might have been the weather.”


Combining Graph and Data in Multi-Model Database

Multi-model, schema-agnostic databases provide a single platform where all of the data required for a 360-view analysis. This single source of truth is in one place, not distributed across multiple systems. It is also flexible when new data sources or enterprise systems enter an enterprise’s data ecosystem.

In the next phase, the firm wants to explore using semantic triples to establish important relationships that are difficult, if not impossible, to achieve with relational systems and integrate Enterprise Analytical Tools. For instance, by using the Resource Description Framework (RDF) to relate parts, suppliers, locations, test equipment, tests, etc., to model real world relationships such as “hasPart”, “connectedTo”, “suppliedBy”, “testedOn”, and “assembledAt”, enables a much more flexible data model than relational databases. RDF allows you to ask interesting questions that leverage MarkLogic’s multi-model capabilities that include searching documents while doing semantic querying and inferencing with SPARQL.

Drug recalls have a significant impact on the public trust across the globe. Thus, it is incumbent that the drug safety stakeholders within an organization have a platform that can flag signals as early as possible wherever they may occur in a very complex supply chain. A multi-model database that models all entity data as documents and associates those entities with triples will greatly reduce analysis and discovery time of those potential signals.


For more information

Read: The Importance of Metadata to Life Sciences
What if you could run complex queries across all of your data and metadata—no need to shred it first—with lightning fast results? Learn how MarkLogic features powerful search and semantics capabilities that enable you to extract more value from your metadata.

Check out: How Can Enterprise NoSQL Advance Analytics in Life Sciences?
Adverse event data is often buried in mounds of information, requiring extensive filtering that can take hours just to find it. Learn how MarkLogic enables data scientists to quickly narrow down datasets to just the “features of interest,” reducing the time it takes to get data to the machine learning aspects of signal detection.

Hear: Using Big Data to Run Your Pharma Business
Pharmaceutical companies are pivoting from traditional service-oriented architectures to combining data from multiple operational silos in real-time to more efficiently run the business.