Progress Acquires MarkLogic! Learn More

The Staggering Impact of Dirty Data

Back to blog
5 minute read
Back to blog
5 minute read

Sometimes, costs sneak up on us. What might seem to be an everyday annoyance has been having staggering cost implications for years.

Dirty data—data that is inaccurate, incomplete or inconsistent—is one of these surprises. Experian reports that on average, companies across the globe feel that 26% of their data is dirty. This contributes to enormous losses. In fact, it costs the average business 15% to 25% of revenue, and the US economy over $3 trillion annually. Anybody who’s had to deal with dirty data knows how frustrating it can be, but when the numbers are added up, it can be difficult to wrap your head around its impact.

Since dirty data costs so mucha sobering understatementit is critical to understand where it comes from, how it affects business and how it can be dealt with.

Where Does Dirty Data Come From?

According to Experian, human error influences over 60% of dirty data, and poor interdepartmental communication is involved in about 35% of inaccurate data records. Intuitively, it seems that a solid data strategy should mitigate these issues, but inadequate data strategy also impacts 28% of inaccurate data.

When different departments are entering related data into separate data silos, even good data strategy isn’t going to prevent fouling downstream data warehouses, marts and lakes. Records can be duplicated with non-canonical data such as different misspellings of names and addresses. Data silos with poor constraints can lead to dates, account numbers or personal information being shown in different formats, which makes them difficult or impossible to automatically reconcile.

Dirty data can remain hidden for years, which makes it even more difficult to detect and deal with when it is actually found. Unfortunately, 57% of businesses find out about dirty data when it’s reported by customers or prospectsa particularly poor way to track down and solve essential data issues.

Many organizations search for inconsistent and inaccurate data using manual processes because their data is too decentralized and too non-standard. These plans tend to fall into the same trap as the datainstead of consolidated planning, each department is responsible for its own data inaccuracies. While this may catch some instances, it also contributes to internal inconsistencies between department silos. The fix happens in one place but not in another, which just leads to more data problems.

The Impact of Dirty Data

Dirty data results in wasted resources, lost productivity, failed communicationboth internal and externaland wasted marketing spending. In the US, it is estimated that 27% of revenue is wasted on inaccurate or incomplete customer and prospect data.

Productivity is impacted in several important areas. Data scientists are spending around 60% of their time cleaning, normalizing and organizing data. In the meantime, knowledge workers are spending up to 50% of their time with hidden and inaccurate data.

Dirty data lacks credibility, and that means that end-users who rely on that data spend extra time confirming its accuracy, further reducing speed and productivity. Introducing another manual process leads to more inaccuracies and mounting inconsistencies through growing numbers of dirty records.

In addition to the revenue loss, dirty data impacts businesses more insidiously. Only 16% of business executives are confident in the accuracy that underlies their business decisions. Garbage in, garbage outwhen you can’t rely on your own data, something needs to be done to increase data accuracy and reliability.

Dirty Data in Banking

Worldwide, inaccuracies in data costs between 15% and 25% of revenue for a company. With global revenues of over $2.2 trillion, this means that dirty data costs the global banking industry over $400 billion. Dirty data also leads to a number of risks that are unique to the banking industry.

Inconsistent information across data silos in an organization leads to transactional risks such as inaccurate or even fraudulent transactions. Fake and fraudulent accounts should be caught early by processes that clean or detect dirty data. When they don’t, the bank is put at risk, and its reputation is damaged.

With so much dirty data and so few executives trusting the data they are using, it’s bound to result in poor strategic decisions. You can’t pick the right path if you don’t know where you are. Dirty data can lead to tremendous operational risks.

The constantly evolving regulatory landscape also creates a heavy burden for data management. Compliance teams are under significant pressure to provide more information about data, but when they don’t have clean data to work with, they are out of luck. The 2018 rollout of Mifid II regulations has been a painful example of this, with faltering compliance and increasingly strict regulators causing pain for many European financial firms.

Dealing with Dirty Data

The most challenging problem in cleaning up dirty data is the cleaning of invalid entries and duplicate data. Careful error correction is needed to not only ensure that no data is lost while improving the consistency of existing valid data, but that all of the metadata corresponding to data correction is maintained alongside the integrated data itself.

Once data has been cleansed, it needs to be maintained. After the initial process of cleaning dirty data, only new or changed data should need to be checked for validity and consistency. In all cases, from old to newly entered data, the lineage of the data must be recorded. This ensures its validity and trustworthiness.

Best practices for cleaning dirty data and for data governance include the following practices:

  1. Harmonize by correlating the data across different siloed sources and harnessing metadata for data provenance and lineage.
  2. Leverage core smart mastering capabilities to match and merge entities in a single multi-model platform.
  3. Apply semantics to capture relationships between data and to ensure consistency.
  4. Create a 360-degree view by integrating all of your data sources.
  5. Find dirty data using natural language searching, data modeling and machine learning to identify patterns and anomalies.

It is a lot, but it’s worth it. An organization that uses strong data governance in addition to data-cleansing practices can generate up to 70% more revenue.

Stop Letting Dirty Data Slow You Down

The business impact of dirty data is staggering, but an individual organization can avoid the morass. Modern techniques and technology can minimize the impact of dirty data. Clean, reliable data makes the business more agile and responsive while cutting down on wasted efforts by data scientists and knowledge workers.

Your business might already be planning to tackle its dirty-data problems. In fact, 84% of businesses are planning to implement data quality solutions soon, but many of these solutions are segmented across departments in the enterprise.  Moreover, many data quality initiatives won’t address core changes needed inside the database to affect positive change where it is needed the most. This will only lead to future problems with inconsistent data, exacerbating the current state as data proliferates. The effort needs to be global across the business and in a way that addresses shortcomings at their sourceinside the database. An operational data hub, such as one built on top of MarkLogic®, can help your business get the right start on cleaning its dirty data.

Learn how MarkLogic’s Operational Data Hub framework can help you improve data governance and increase the quality of your data assets.

Ed Downs

Ed Downs is responsible for customer solutions marketing at MarkLogic. He draws on his considerable experience, having delivered large-scale big data projects and operational and analytical solutions for public and private sector organizations, to drive awareness and accelerate adoption of the MarkLogic platform.

Read more by this author

Share this article

Read More

Related Posts

Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.

Architect Insights

What Is a Data Platform – and Why Do You Need One?

A data platform lets you collect, process, analyze, and share data across systems of record, systems of engagement, and systems of insight.

All Blog Articles
Architect Insights

Unifying Data, Metadata, and Meaning

We’re all drowning in data. Keeping up with our data – and our understanding of it – requires using tools in new ways to unify data, metadata, and meaning.

All Blog Articles
Architect Insights

When a Knowledge Graph Isn’t Enough

A knowledge graph – a metadata structure sitting on a machine somewhere – has very interesting potential, but can’t do very much by itself. How do we put it to work?

All Blog Articles

Sign up for a Demo

Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.

Request a Demo