Progress Acquires MarkLogic! Learn More

Gone Data: Is Data That Was Supposed to Be Deleted — Really Gone?

Back to blog
6 minute read
Back to blog
6 minute read
Photo illustration of a pencil eraser that is erasing the printed word data.

EU GDPR is surfacing a host of issues — and one of them is: if a client requests data erasure — how can you be sure it is gone?

It sounds easy — but customer data is stored all over the place. There may be pictures, social media accounts, videos. All of which you know to be linked to Jane or John Smith but why are your searches not coming back with these?

With more and more systems added on top of each other over the years, there is no one system that has all the data in one place (or even knows about all the data) to ensure these requests are fulfilled to the letter of the (international) law. This presents a tremendous data integration problem — and this brings us to a board meeting, happening now, I am sure, where the Head of IT, a consultant or even a technology savvy privacy lawyer has been brought in to discuss the problem:

CXO: So the next item on the agenda, EU GDPR. How are we going to respond to this?

IT PRO: Ahem, it’s difficult. The data we have is, well, a mess. There are three problems. The first is — and this doesn’t effect just Data Privacy — but everything we do — we have data on top of data. Spread across multiple different sites. Across multiple systems and even some data that we are struggling to get a handle on.

Data Scattered Across Silos

When you think of Big Data, or metadata or small data or any type of data-related role; what do you think that entails? Do you think that data scientists spend most of their time analyzing the data to get the best insights they can? Insights that can change the fortunes of a company? Or form a part of some great new innovation? This, I am sad to say, couldn’t be further from the truth. A recent study by CrowdFlower found that 60 percent of data scientist’s time is spent “wrangling” or “cleansing” the data to make it fit into their database provider. Oh, and 80 percent of those who responded to the survey also found this aspect of their role is the least enjoyable part of it. That is a tough pill to swallow. You are asking these people to spend most of their time in the least enjoyable part of their role.

Let’s return to the scene.

CXO: Ok, so what do you think we need?

IT PRO: We need a central database where we can bring in all the data to one place, link all of the data and ensure that we can find every document, or other, and have them linked to the piece of data we have been requested to erase.

Schema-on-Read Approach

This is the second problem: Bringing your data together and organizing it into a central hub. Which is great if your database can ingest all the data as-is and handle a schema-on-read approach. However because many IT departments work with structured data — they think that they should store all their data in a relational database or warehouse. And those systems definitely don’t let you load as is. It requires months and months of mapping and modeling. And if you want to include unstructured or semi-structured data – all the information in a contract, or digital cameras, or text messages – well the job just got harder.

Once again back to the executive suites.

CXO: Ok so I get that we need something new. Why can’t we use our existing tools and vendors?

IT PRO: We could but we would need training, development and time. All three things that will come with a price tag.

CXO: Hmm so how much will this cost?

We’ll leave this scene there. You see, the other database providers that he is referencing need the extended development time due to the fact that they are trying to shoehorn unstructured data into a structured environment. All of this costs time and more importantly money. Both of which are under enormous pressure; especially when you think that this legislation is coming next year. What is needed for today and tomorrow’s data legislation (including EU GDPR) is a database that can handle all of your data – including structured, text, JSON, XML, RDF triples, geospatial and large binaries.

Those of you with a keen eye will have noticed that our IT/Consultant/Savvy Lawyer mentioned three problems that EU GDPR throws up — and solving it puts you on a stronger business footing too!

So you have all your data in one place, structured, unstructured, everything. This is where the real problem lies and our third problem. You have created a database with all your data but how are you going to comply with a request to be forgotten. How do you know you have caught everything in your system that needs to be deleted?

Centralized Operational Data Hub

Well if you have a centralized data hub that creates semantic associations (metadata) between entities and assets — you can do a search and find all assets associated with a specific individual. All of the data, whether it is structured or unstructured, should be semantically linked on ingest to ensure that you find all of Jane or John Smith’s character-centric files (alphanumeric or otherwise) and any other file (pictures, videos even social media references) are captured and able to be deleted with ease and in a timely manner.

Now when that deletion request comes through, you can be confident that you are in compliance.

Is there such a system that will semantically link on ingest? Actually yes, Allan Donald, senior product manager at the BBC, described the BBC’s development of its program metadata API and why they chose (again) to partner with MarkLogic.

“After some abortive attempts to solve this problem in SQL, and a lengthy period of prototyping and testing alternatives with major database vendors, we settled on a NoSQL database from …MarkLogic. This had already been successfully used for the Olympics as part of the BBC’s Dynamic Semantic Publishing platform. Using [MarkLogic] we saw significant speed benefits. Some sample availability queries that took up to 20 seconds for SQL could be performed on NoSQL documents in around 20ms – a thousand times faster.”

EUGDPR is coming into effect in May 2018. Failure to comply or have sufficient measure in place by this date could result in organisations being fined up to 4 percent of annual global turnover or €20 Million (whichever is greater).

Take that in for a minute – 4 percent of your turnover. To put that into context, at the time of writing this article, the company at the bottom of the FTSE250, if found to be non-compliant could be fined £15,800,000. Or put another way 16% of their cash reserves.

Will you be ready? Will you be fast enough to ensure that you remain compliant? More importantly, will you ensure that you have the right database for the job?

For more information on this topic

EU GDPR: Beyond Compliance, blog post that outlines the key issues Data Protection Officers face, and how a 360-view of clients can help

Schema-on-Read vs Schema-on-Write blog post that defines the true strength of a NoSQL database — the Schema-on-Read approach – which allows you to load data as is — and transform later as you need it!

The Path To Compliance 45-min webinar, Christy Haragan joins Anastasia Olshanskaya to discuss the new data privacy rights individuals have — how this dramatically impacts business, and they leave you with a 5-step guide to EU GDPR compliance.

Philip Miller

Philip Miller is a Customer Success Manager for Progress | MarkLogic, looking after our International Standards Bodies and Publishing accounts. Philip also leads our customer webinar series Digital Acceleration and Progress | MarkLogic Vision events. Always keen to advocate for his customers and provide a voice internally to improve and innovate the Progress | MarkLogic Data Platform. Named as a Top Influencer in Onalytica's Who's Who in Data Management. Outside of work, he's a father to two daughters, a fan of dogs, and an avid learner, trying to learn something new every day.

Read more by this author

Share this article

Read More

Related Posts

Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.

Architect Insights

What Is a Data Platform – and Why Do You Need One?

A data platform lets you collect, process, analyze, and share data across systems of record, systems of engagement, and systems of insight.

All Blog Articles
Architect Insights

Unifying Data, Metadata, and Meaning

We’re all drowning in data. Keeping up with our data – and our understanding of it – requires using tools in new ways to unify data, metadata, and meaning.

All Blog Articles
Architect Insights

When a Knowledge Graph Isn’t Enough

A knowledge graph – a metadata structure sitting on a machine somewhere – has very interesting potential, but can’t do very much by itself. How do we put it to work?

All Blog Articles

Sign up for a Demo

Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.

Request a Demo