Organizations can harness great benefits from data, but understanding the importance of data quality, trust and avoiding bias allows them to make decisions and create profit.
At a fundamental level, data trust is when an enterprise has confidence that the data it is using is accurate, usable, comprehensive and relevant to its intended purposes. On a bigger-picture level, data trust has to do with context, ethics and biases.
“The narrow definition is looking at how data is used in organizations to drive their mission,” said Danil Mikhailov, executive director at Data.org, a nonprofit backed by the Mastercard Center for Inclusive Growth and The Rockefeller Foundation.
Data.org promotes the use of data science to tackle society’s greatest challenges. One of its projects is the data maturity assessment, a free tool that helps organizations determine where they stand on the data journey.
That narrow definition of data trust often gets support from tools that assess the quality of that data or automatically monitor the data across key metrics, Mikhailov said.
“Once you tick all the boxes, the organization can trust the data more,” he said.
But that definition of data trust is limited because data is part of a broader context. Companies should consider other factors when evaluating data trust beyond the basic operational ones.
“Look not just at the specifics of data quality but who is the data for?” Mikhailov said. “Who is involved in the process of designing the systems, assessing the systems, using the data?”
The bigger picture is harder to quantify and operationalize but forgetting or ignoring it can lead to biases and failures, he added.
The cost of bad data
Organizations’ bottom line reflects the importance of data quality. Poor data quality costs organizations, on average, $13 million a year, according to a Gartner report in July 2021. It’s not just the immediate effect on revenue that’s at stake. Poor data quality increases the complexity of data ecosystems and leads to poor decision-making.
There’s a rule of thumb called the “1-10-100” rule of data that dates back to 1992; it says a dollar spent on verifying data at the outset translates to a $10 cost for correcting bad data, and a $100 cost to the business if it is not fixed.
Eighty-two percent of senior data executives said data quality concerns represent a barrier to data integration projects, and 80% find it challenging to consistently enrich data with proper context at scale, according to a survey by Corinium Intelligence in June 2021.
Executives don’t trust their own data
Identifying data quality and trusting its accuracy, consistency and completeness is a challenge for any executive. This is true even in organizations where the importance of data quality is literally a matter of life and death.
Only 20% of healthcare executives said that they fully trust their data, according to an October 2021 survey by consulting firm Sage Growth Partners and data technology company InterSystems.
Servaas VerbiestDirector of product field strategy, Sungard Availability Services
Trust starts with the collection process. One mistake companies make is assuming data is good and safe just because it matches what the company wants to track or measure, said Servaas Verbiest, director of product field strategy at Sungard Availability Services.
“It’s all about understanding who provided the data, where it came from, why it was collected, how it was collected,” he said.
Diversification also helps.
“A single source of truth is a single point of failure,” Verbiest said. “That is a big task, but it is essential to prevent bias or adoption from being impacted by an individual’s preference versus the data bias required by the organization.”
It’s also important to follow the chain of custody of the data after collecting it to ensure that it’s not tampered with later. In addition, data may change over time, so quality control processes must be ongoing.
For example, Abe Gong, CEO of data collaboration company Superconductive, once built an algorithm to predict health outcomes. One critical variable was gender, coded as 1 for male and 2 for female. The data came from a healthcare company. Then a new batch of data arrived using 1, 2, 4 and 9.
The reason? People were now able to select “nonbinary” or “prefer not to say.” The schema was coded for ones and twos, meaning the algorithm’s predictions would have yielded erroneous results indicating that a person with code 9 was nine times more female — with their associated health risks multiplied as well.
“The model would have made predictions about disease and hospitalization risk that made absolutely no sense,” Gong said.
Fortunately, the company had tests in place to catch the problem and update the algorithms for the new data.
“In our open source library, they’re called data contracts or checkpoints,” he said. “As the new data comes in, it raises an alert that says the system was expecting only ones and twos, which gives us a heads up that something has fundamentally changed in the data.”
Superconductive is one of several commercial vendors offering data scoring platforms. Other vendors in this market include Talend, Informatica, Anomalo, Datafold, Metaplane, Soda and Bigeye.
Identifying biased data
It’s too simplistic to say that some data contains bias and some don’t.
“There are no unbiased data stores,” said Slater Victoroff, co-founder and CTO at Indico Data, an unstructured data management company. “In truth, it’s a spectrum.”
The best approach is to identify bias and then work to correct it.
“There’s a large number of techniques that can be used to mitigate that bias,” Victoroff said. “Many of these techniques are simple tweaks to sampling and representation, but in practice it’s important to remember that data can’t become unbiased in a vacuum.”
Companies may need to look for new data sources outside the traditional ones or set up differential outcomes for protected classes.
“It’s not enough to simply say: ‘remove bias from the data,'” Victoroff said. “We have to explicitly look at differential outcomes for protected classes, and maybe even look for new sources of data outside of the ones that have traditionally been considered.”
Other techniques companies can use to reduce bias include separating the people building the models from the fairness committee, said Sagar Shah, client partner at AI technology company Fractal Analytics. Companies can also make sure that developers can’t see sensitive attributes so that they don’t accidentally use that data in their models.
As with data quality checks, bias checks must also be continual, Shah said.
How to build data trust
One of the biggest trends this year when it comes to data is the move to data fabrics. This approach helps break down data silos and uses advanced analytics to optimize the data integration process and create a single, compliant view of data.
Data fabrics can reduce data management efforts by up to 70%. Gartner recommends using technology such as artificial intelligence to reduce human errors and decrease costs.
Seventy-nine percent of organizations have more than 100 data sources — and 30% have more than 1,000, according to a December 2021 IDC survey of global chief data officers. Meanwhile, most organizations haven’t standardized their data quality function and nearly two-thirds haven’t standardized data governance and privacy.
Organizations that optimize their data see numerous benefits. Operational efficiency was 117% higher, customer retention was 44% higher, profits were 36% higher and time to market was 33% faster, according to the IDC survey.