The new website for MarkLogic is Visit it.

The Next Critical Step for AI: Eliminate Data Bias

Back to blog
4 minute read
Back to blog
4 minute read

Artificial Intelligence (AI) has a great capacity for good. I believe human-driven AI will probably be one of the greatest tools humanity has ever developed. But fulfilling that potential requires us to do the hard work—now. This begins with ensuring the data our systems ingest are comprehensive and free of bias. The good news is that technology can and should help.

Data Bias—A Real-World Example

The typical enterprise won’t gain much benefit from AI trained on data scraped randomly off the internet. Business value comes with AI trained on an organization’s own data, which is also where bias can creep in. Flawed data sets produce flawed AI decisions, and these can have drastic consequences:

A woman in the United States took sleeping tablets, following her doctor’s advice based on the manufacturer’s own guidelines. The next morning, she rose and drove to work, but got pulled over—and later arrested. The issue? The prior night’s medication still in her system left her driving under the influence. She fought the charges in court where it was later revealed the medicine guidelines her physician gave her, based on the advice from the manufacturer, were developed using data solely from male test subjects. With faster metabolisms, certain medicines exit the systems of men far faster than women. In this case, biased medical data led to bad medicine and a scary legal entanglement.

How to Avoid Biased Datasets

To avoid biased data, or at the very least mitigate its prevalence, companies should follow two important steps. First, the widest array of data needs to be ingested. This includes vast amounts of their own, proprietary raw data, structured and unstructured, drawing upon every possible company source, such as documents, excel files, research, financials, regulatory data, historical data and benchmarks. Second, controls are required, enabled by meta-tagging data with contextual information.

To accelerate this process, companies need a tool that enables the data to be ingested with the necessary context applied. This has historically been the role of subject matter experts. However, processing data at scale requires a rules-based engine to classify data with the proper taxonomies and ontologies, thus providing the context behind the data, which can so often expose the bias.

This process enables businesses to not only consider the validity of the algorithm, but really, the source data used to train the algorithm as well. Oversight is where humans can help keep the AI decisioning on track. For example, we wouldn’t teach an algorithm that 2+2=5. But that’s exactly what we’re doing if we don’t ensure the data we use for AI is clean, sensible and has the proper metadata context.

Infusing AI with internal data already shows great promise. BloombergGPT™ is reported to be 52% proprietary or cleaned financial data. Its study found, “the BloombergGPT model outperforms existing open models of a similar size on financial tasks by large margins, while still performing on par or better on general natural language processing benchmarks.” This is just one example but shows how powerful integrating internally sourced data sets can be.

AI Still Needs Humans

Regardless of where the data comes from, AI lacks a moral compass and ethical context that human decisions organically include.

To compensate for this gap, we must ask the right questions and include those rationales in our data sets. AI algorithms also need to be trained across cultures, ages and genders, as well as a host of other parameters to account for bias. The cleaner the data points used, the more sound the decision.

The “wisdom of crowd” theory puts forth, in brief, that the more data points you combine about a particular question, the more “right” your resulting answer. This even holds when crowd-sourced decisions are compared to experts. Stripped to its core, AI takes a reasonable guess based on the data it has. Accuracy, therefore, comes from aggregating the data points and balancing the wrong and the right to discern the most probable. But AI can’t govern itself. It takes diverse and critical thinking, weighing many factors to ensure the decisions we get via AI’s advanced decision-making are for the good of the whole, rather than biased to the few.

A Transparent Way Forward

As the world of data grows, businesses need scalable solutions to process and manage it all. There is a limit to how much information a human brain can process. And repeatedly retaining subject matter experts is impractical. Achieving unbiased data requires an agile, transparent, rules-based data platform where data can be ingested, harmonised and curated for the AI tool. If businesses and their AI teams are to responsibly move forward, they need a replicable, scalable way to ensure AI algorithms are trained with clean, quality data. Preferably, their proprietary own.

In my next blog, I am going to look at another feature that any data platform should have to help remove data bias and add further transparency to the data: bi-temporality. That piece will look at how it can be leveraged to provide data provenance and lineage throughout the life cycle of the data.

Data Bias Survey Results

For more information on the state of data bias in business today, and to gain insight into how to avoid and address data bias in your own organization, read the highlights from our data bias survey.

Read the blog

Philip Miller

Philip Miller is a Customer Success Manager for Progress | MarkLogic, looking after our International Standards Bodies and Publishing accounts. Philip also leads our customer webinar series Digital Acceleration and Progress | MarkLogic Vision events. Always keen to advocate for his customers and provide a voice internally to improve and innovate the Progress | MarkLogic Data Platform. Named as a Top Influencer in Onalytica's Who's Who in Data Management. Outside of work, he's a father to two daughters, a fan of dogs, and an avid learner, trying to learn something new every day.

Read more by this author
Read More

Related Posts

Like what you just read, here are a few more articles for you to check out or you can visit our blog overview page to see more.

Business Insights

How to Achieve Data Agility

Successfully responding to changes in the business landscape requires data agility. Learn what visionary organizations have done, and how you can start your journey.

All Blog Articles
Business Insights

Knowledge Sharing Challenges

Sharing data can be relatively easy. Sharing our specialized knowledge about data is harder – and current approaches don’t scale.

All Blog Articles
Business Insights

Why Data Agility Is Essential for Your Business

Data agility is the ability to make simple, powerful, and immediate changes to any aspect of how information is interpreted and acted on.

All Blog Articles

Sign up for a Demo

Don’t waste time stitching together components. MarkLogic combines the power of a multi-model database, search, and semantic AI technology in a single platform with mastering, metadata management, government-grade security and more.

Request a Demo