It is well known that ‘anonymous data’ often isn’t that anonymous. There are a few well-publicized examples of ‘anonymous’ datasets being released that were quickly de-anonymized:
That was the end of releasing ‘anonymous’ data to the public. But, the problem with anonymous data lives on within organizations.
In what is known among cybersecurity pros as a linkage attack, adversaries collect auxiliary information about a certain individual from multiple data sources and then combine that data to form a whole picture about their target, which is often an individual’s personally identifiable information.
The common approach to mitigate linkage attacks is to anonymize data before exporting by removing personally identifiable information (PII) such as ID, phones, etc. Unfortunately, this is not enough.
A better approach to protect against linkage attacks is to centralize sharing, simply share less raw data, and if you do want to share data—create layers of abstraction or generalization by redacting parts of the data.
How does a linkage attack work? Let me provide an example from the healthcare industry. Imagine that a care provider shares anonymized data with external researchers about medical conditions. The export contains “Gender,” “Postal code,” “Date of birth,” and “Description.” An attacker could easily use a public voter list that contains “Name,” “Gender,” “Postal code,” and “Date of birth” to cross-reference the patients.
In practice, the more you preserve the analytical utility of the dataset, such as keeping “Gender” and “Postal code” information in the export, the more you are susceptible to linkage attacks.
Many people think that if they just remove the PII from their data, it is okay to export. But, it’s not.
For example, let’s say you export credit card transactions removing all PII. What is left is anonymized data that includes the user’s primary key, transaction date, and value. You give this export to a data analyst to calculate the average customer spent, find common behavior, etc.
However, the data analyst has another idea in mind. He also has access to the call center database, which does have PII. The call center database has information about which products the customer purchased, a history of complaints, questions, disputes, etc.
Given a sufﬁciently large dataset, the analyst can find a customer in the credit card transactions dataset. While transactions may not uniquely identify a customer, the analyst can easily combine the transaction data with complaints, questions, and disputes to form the complete picture. For example, if a customer calls to complain about a duplicate charge on a particular day, the analyst can use this information to search the transactions and find potential matches. With time, he could uniquely identify large numbers of customers.
Here, we described a complex attack by an internal adversary for three reasons:
To better protect the data exported against linkage attacks, we recommend that you centralize sharing, share less, create abstractions, and use the right protection.
It’s really hard to secure data across multiple data silos. As we have seen in the aforementioned examples, insiders conducting linkage attacks have access to an assortment of databases. These database silos all have different access controls and auditing, not to mention various data formats. These silos prevent implementation of a consistent policy to protect user information and privacy.
The best approach to address this problem is to use a centralized database to govern and secure the data. This approach makes securing applications easier and faster. Why rely on heuristic, probabalistic approaches to protection against re-identification attacks when you can have comprehensive auditing and policy execution, consistently implemented across your entire organization, and exposed via a rich set of APIs to access aggregate information?
The best database for centralizing all of your data is a multi-model database like MarkLogic. MarkLogic is built to flexibly store and manage all of an organization’s data, and enables consistent data governance across disparate data stores. MarkLogic has a lot of advanced features for securing data such as Document and Element Level Security, and all security can be controlled from a central location that serves different purposes and applies different access controls.
Our second recommendation is to bring the data analysis to the data. In other words, “give me your code.”
Most business users are looking for summaries (or aggregates) of information–not the data itself. It’s better not to share raw data.
In MarkLogic, you can use amped functions that run internally at a higher privilege and do things that the user cannot do directly, to calculate aggregates but avoid giving access to the raw document data. For example, you can use an amped function to calculate what customers spent per Postal Code, but the user has no access to individual records.
This is a terrific approach to protect against linkage attacks. The challenge is that you need to know your questions a priori in order to create the functions. Therefore, this is a great approach for a report or portal that displays aggregates and calculations.
If you need to share data in its raw format with data scientists, consider adding a layer of abstraction or generalization.
Do you really need to share the full “Date of birth”? Or just “Year of birth”?
Do you really need the full “Postal Code”? Or, would “County” do?
For example, in MarkLogic you can use Redaction to mask the “day” and “month” out of “Date of birth.” You can use a dictionary to replace “Postal code” with “County.” Or, replace “Age” with “Age range.” Just keep in mind that although bigger abstractions provide more security, they also result in slightly less precise analytics.
Oftentimes, when I talk to customers about this, they suggest protecting against linkage attacks using format-preserving encryption, homomorphic encryption, perturbation, and salting.
Not so fast!
These technologies protect against attacks such as dictionary, rainbow, and brute force attacks but not against linkage attacks. Linkage attacks don’t use encrypted data, so those approaches don’t work. Linkage attacks are done with data that is left preserved for analysis. If you are also concerned about dictionary and rainbow attacks against your data, MarkLogic can also protect you.
MarkLogic has advanced Encryption at Rest, which has multiple low-level encryption keys to minimize the impact of any breach, and multiple salting methods on redaction, to maximize exported data entropy.
Linkage attacks can be simple or very sophisticated. Protecting against them may involve simple forms of redaction, more sophisticated abstraction, or full computation at the data layer.
MarkLogic provides a set of capabilities to help you ensure that your data is safe, in a central location, and that you still can use it for analytics, operations, and business reporting.
MarkLogic helps you secure and govern your data:
To learn more, download the white paper, Developing Secure Applications on MarkLogic. For a quick summary, check out our Element Level Security and Redaction Datasheet.
Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.
The MarkLogic Optic API makes your searches smarter by incorporating semantic information about the world around you and this tutorial shows you just how to do it.
Are you someone who’s more comfortable working in Graphical User Interface (GUI) than writing code? Do you want to have a visual representation of your data transformation pipelines? What if there was a way to empower users to visually enrich content and drive data pipelines without writing code? With the community tool Pipes for MarkLogic […]
Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.Request a Demo