How to root out bias in your data

Analytics is a top priority for savvy CIOs. But if implicit biases are hiding within your most trusted data sets, your algorithms could be leading you to make bad decisions

How to root out bias in your data
Computerworld / Thinkstock

Human beings are inherently biased. So when companies began using computer algorithms to guide their critical business processes, many people believed the days of discriminatory hiring practices, emotion-fuelled performance reviews and partisan product development were coming to an end.

“That was an optimistic, naive hope,” says Matt Bencke, CEO of Spare5, a startup that helps companies train algorithms for artificial intelligence systems. “Of course, we would all like to remove bias from certain parts of our lives, but that’s incredibly difficult.”

Just ask city officials in Boston. As part of an effort to shore up the municipality’s aging infrastructure, city hall released a smartphone app that collects GPS data to help detect potholes. However, because people in lower income groups are less likely to own smartphones, the program didn’t include data from large segments of city neighborhoods.

Even Pokemon Go isn’t immune to big data bias. Recently, the Urban Institute called out the wildly popular game for featuring fewer Pokemon stops in primarily black neighborhoods than in it did in white communities. The Washington-based think tank speculates that location-based data for Pokemon Go originally came from an earlier game, Ingress, which was popular among “younger, English-speaking men,” many of whom contributed relevant portal locations to the game’s database.

[ Download this story, plus learn how IT leaders set priorities during business change and lots more in Computerworld’s November digital magazine! ]

But potholes and Pokemon are the least of what makes data bias dangerous. These days, businesses rely on sophisticated computer algorithms to hire new employees, determine product inventory, shape marketing campaigns and predict consumer behavior. If algorithms can’t be trusted to provide honest and impartial insights, businesses could make misguided and discriminatory decisions.

It’s a danger that’s well understood by Solon Barocas. Barocas is a postdoctoral researcher in the Microsoft Research Lab in New York City who focuses on the ethics of machine learning. “In many cases, the problems that result from [data] discrimination could harm your bottom line in the sense that you’re overlooking potentially qualified job applicants or you’re incorrectly targeting ads,” he warns. “In a way, discrimination is a form of inaccurate decision-making.”

As reliance on data analytics grows, critical questions arise: What’s to blame for data bias—programmer subjectivity, poorly chosen data sources, misguided sampling or the challenges of machine learning? And perhaps more importantly, who is to blame? An unwitting programmer, a busy CIO or the C-level executive who signed off on a data-driven initiative?

Dangers lurking within

To answer questions like those, it helps to understand one of the most common causes of bias in computer algorithms: data selection.

“Data bias always starts from the point of view of the person selecting the data because he’s the one who believes the data is valuable,” says Mark Lack, manager of strategy for analytics and business intelligence at Mueller Inc. And the more data there is to choose from, the greater the potential for skewed results.

That’s a problem that Mueller is trying to address. The manufacturer of metal building and metal roofing products crunches everything from transactional and machine data to general ledger and revenue figures. “We have no shortage of data in either volume, velocity or variety,” says Lack.

Yet all of those bits and bytes increase the likelihood of bias creeping into algorithms, because data sets are typically first selected subjectively by people.

To illustrate the problem, Barocas points to the example of a company relying on a data set that includes a smaller number of female workers than what actually exists in the labor market. Because this data set “doesn’t proportionately represent the different parts of the population,” Barocas says, the model is more likely to dismiss certain job applicants because the algorithm either doesn’t believe the population exists or has limited information about the segment. “The result,” he says, “is that you can have computers unintentionally discriminate against people who just happen to be not well represented in the data.”

Lack says Mueller is trying to minimize the risk of poor data selection with the help of a robust analytics system. Instead of having data science teams choose data sets for analysis, Mueller uses IBM Watson Analytics technology, which is capable of analyzing vast volumes of information, with no need for human intervention.

“Bias happens when you start to chop off data that you think may not be relevant,” says Lack. “What Watson Analytics does is include it all, allowing us to remove our own biases.” Today, Mueller uses Watson Analytics tools to correlate sales and location data, forecast revenue and enhance supply chain operations.

The fallout of familiarity

Another threat to impartiality is training. In the case of artificial intelligence, programmers build data sets that are used to train computer algorithms. “This is where the bias comes in,” says Bencke. “Algorithms have no sense of subjectivity until you train them. Training is just a fancy term for letting a computer know what we think. But if you don’t think about how to build unbiased training data, it doesn’t matter how powerful your infrastructure is, or how clever your models are. You’ve presented a biased and flawed set of results.”

For instance, Bencke says a programmer’s selection of training data may reflect what’s most familiar to him or her, like a particular demographics’ preferences for images or tweets written in British English rather than American slang.

While those may be innocuous examples, “when you get into more sophisticated and subjective domains, it can be really dangerous,” Bencke says. “It’s rarely, if ever, about bad intentions. Rather, it’s about naiveté and a failure to really think things through carefully.”

Pleasing business leaders

Programmers aren’t the only ones at risk of inserting bias into computer algorithms. Oftentimes, IT professionals are asked by business leaders to crunch certain sets of data with a clear directive: Make sure the results support my hypothesis.

Fortunately, there are ways to avoid conducting analytics by personal agenda. Allan Frank, chief strategist and co-founder of consultancy The Hackett Group and partner at LiquidHub, has developed a technique for avoiding confirmation bias—the propensity to interpret information in a way that confirms one’s pre-existing beliefs or hypotheses. “We introduce questions asking the same thing multiple ways so that effectively you’re minimizing the impact of bias,” says Frank.

Computerworld, January-February 2017 - 4 Tips for avoiding unwanted data bias Computerworld / Thinkstock

For example, Frank cites a recent experience with a retail client. Struggling to improve its supply chain, the company assumed its problems stemmed from poorly planned inventory. So the retailer came to Frank seeking a predictive model to answer questions about its future inventory needs. However, Frank persuaded the company to examine its customers’ demands instead of asking questions about its inventory and supply chain. With the change in focus, “everybody’s eyes lit up,” Frank says. “Asking the right questions is as important as anything.”

Breaking bad habits

Regardless of the cause or type of an underlying bias, it’s hard to break the cycle of bad data skewing results. That’s because, over time, human biases become encoded in data sets and IT practitioners have a hard time detecting them.

Consider, for example, Bloomberg, the financial data and media company.

Bloomberg provides business news, in-depth analysis and data to financial companies and other organizations around the world. Its goal is “to give our clients the most accurate, truthful representation of the world as possible,” says Gideon Mann, head of data science at Bloomberg. Nonetheless, “even in many of our data analyses, there are a lot of times when people’s bias legitimately creeps into their estimates and their own views of things to come,” he says.

As a result, “over time, you have all of these assessments by a particular individual of the return for a particular company or a particular estimate,” Mann says. “And as you might imagine, sometimes those estimates are accurate—and sometimes they’re not.”

To avoid perpetuating patterns of data discrimination, Mann says Bloomberg is taking steps to “de-bias” its analysts’ assessments by examining whether they are “consistently being over-optimistic or over-pessimistic with a particular company.” Other approaches include conducting deep evaluations of analyst assessments, taking an average score when working with multiple estimates and performing random spot checks on decisions by subjecting them to rigorous human review.

IT on the hot seat

Organizations should also hold IT professionals’ feet to the fire.

According to Bencke of Spare5, education and awareness can help eliminate data bias. “People creating artificial intelligence engines need to be aware of the danger of bias,” he says. For this reason, he calls for “training on how to avoid bias in artificial intelligence models” and says employers should “encourage people to say that they’ve been certified through some sort of course work.”

Frank of The Hackett Group, on the other hand, recommends that companies “build heterogeneous and multidisciplinary” teams to discuss data bias “so that you actually get different perspectives.” For example, a company could set up a center of excellence that encourages IT professionals and business leaders to work together to examine how poorly selected data sets and loaded hypotheses can contribute to misleading insights.

“CIOs need to recognize that they’re on a journey together with the business to build an insight-driven organization,” Frank says.

Lack agrees. “We need to understand our value proposition as IT professionals, data scientists and analytics users,” he says. “Our value is in providing information to users so they can make the right decisions at the right time.”

IT should serve as a business enabler by providing insights based on unbiased data sets for the best possible business outcomes, he says.

But while certification courses and multidisciplinary teams are good first steps toward fixing bad algorithms, some argue that IT professionals need to rethink how they measure the value of their models. Most practitioners have been trained to put models through rigorous verification and validation testing, “but they tend not to think of issues as questions of nondiscrimination or bias,” says Barocas. “Rather, they think of it as being questions of validity—how do we make sure this model is actually going to do a good job?”

That’s a mistake, according to Barocas, who suggests that if IT leaders put as much work into creating impartial algorithms as they put into building high-performance models, they’d “get a handle on potential bias and discrimination.”

No easy solution

Blind faith in robust analytics systems can also challenge an organization’s ability to make smart decisions. Tools capable of crunching vast volumes of data can minimize the need for data preparation and selection—activities that require IT involvement. But there’s no such thing as a silver bullet when it comes to eliminating human biases.

1 2 Page 1
Page 1 of 2