4 key tests for your AI explainability toolkit

Enterprise-grade explainability solutions provide fundamental transparency into how machine learning models make decisions, as well as broader assessments of model quality and fairness. Is yours up to the job?

4 key tests for your AI explainability toolkit
CarlosCastilla / Getty Images

Until recently, explainability was largely seen as an important but narrowly scoped requirement towards the end of the AI model development process. Now, explainability is being regarded as a multi-layered requirement that provides value throughout the machine learning lifecycle.

Furthermore, in addition to providing fundamental transparency into how machine learning models make decisions, explainability toolkits now also execute broader assessments of machine learning model quality, such as those around robustness, fairness, conceptual soundness, and stability.

Given the increased importance of explainability, organizations hoping to adopt machine learning at scale, especially those with high-stakes or regulated use cases, must pay greater attention to the quality of their explainability approaches and solutions.

There are many open source options available to address specific aspects of the explainability problem. However, it is hard to stitch these tools together into a coherent, enterprise-grade solution that is robust, internally consistent, and performs well across models and development platforms.

An enterprise-grade explainability solution must meet four key tests:

  1. Does it explain the outcomes that matter?
  2. Is it internally consistent?
  3. Can it perform reliably at scale?
  4. Can it satisfy rapidly evolving expectations?

Does it explain the outcomes that matter?

As machine learning models are increasingly used to influence or determine outcomes of high importance in people’s lives, such as loan approvals, job applications, and school admissions, it is essential that explainability approaches provide reliable and trustworthy explanations as to how models arrive at their decisions.

Explaining a classification decision (a yes/no decision) is often vastly divergent from explaining a probability result or model risk score. “Why did Jane get denied a loan?” is a fundamentally different question from “Why did Jane receive a risk score of 0.63?”

While conditional methods like TreeSHAP are accurate for model scores, they can be extremely inaccurate for classification outcomes. As a result, while they can be handy for basic model debugging, they are unable to explain the “human understandable” consequences of the model score, such as classification decisions.

Instead of TreeSHAP, consider Quantitative Input Influence, QII. QII simulates breaking the correlations between model features in order to measure changes to the model outputs. This technique is more accurate for a broader range of results, including not only model scores and probabilities but also the more impactful classification outcomes.

Outcome-driven explanations are very important for questions surrounding unjust bias. For example, if a model is truly unbiased, the answer to the question “Why was Jane denied a loan compared to all approved women?” should not differ from “Why was Jane denied a loan compared to all approved men?”

Is it internally consistent?

Open source offerings for AI explainability are often restricted in scope. The Alibi library, for example, builds directly on top of SHAP and thus is automatically limited to model scores and probabilities. In search of a broader solution, some organizations have cobbled together an amalgam of narrow open source techniques. However, this approach can lead to inconsistent tools and provide contradictory results for the same questions.

A coherent explainability approach must ensure consistency along three dimensions:

  1. Explanation scope (local vs. global): Deep model evaluation and debugging capabilities are critical to deploying trustworthy machine learning, and in order to perform root cause analysis, it’s important to be grounded in a consistent, well-founded explanation foundation. If different techniques are used to generate local and global explanations, it becomes impossible to trace unexpected explanation behavior back to the root cause of the problem, and therefore removes the opportunity to fix it.
  2. The underlying model type (traditional models vs. neural networks): A good explanation framework should ideally be able to work across machine learning model types — not just for decision trees/forests, logistic regression models, and gradient-boosted trees, but also for neural networks (RNNs, CNNs, transformers).
  3. The stage of the machine learning lifecycle (development, validation, and ongoing monitoring): Explanations need not be consigned to the last step of the machine learning lifecycle. They can act as the backbone of machine learning model quality checks in development and validation, and then also be used to continuously monitor models in production settings. Seeing how model explanations shift over time, for example, can act as an indication of whether the model is operating on new and potentially out-of-distribution samples. This makes it essential to have an explanation toolkit that can be consistently applied throughout the machine learning lifecycle.

Can it perform reliably at scale?

Explanations, particularly those that estimate Shapley values like SHAP and QII, are always going to be approximations. All explanations (barring replicating the model itself) will incur some loss in fidelity. All else being equal, faster explanation calculations can enable more rapid development and deployment of a model.

The QII framework can provably (and practically) deliver accurate explanations while still adhering to the principles of a good explanation framework. But scaling these computations across different forms of hardware and model frameworks requires significant infrastructure support.

Even when computing explanations via Shapley values, it can be a significant challenge to correctly and scalably implement these explanations. Common implementation issues include problems with how correlated features are dealt with, how missing values are treated, and how the comparison group is selected. Subtle errors along these dimensions can lead to significantly different local or global explanations.

Can it satisfy rapidly evolving requirements?

The question of what constitutes a good explanation is evolving rapidly. On the one hand, the science of explaining machine learning models (and of conducting reliable assessments on model quality such as bias, stability, and conceptual soundness) is still developing. On the other, regulators around the world are framing their expectations on the minimum standards for explainability and model quality. As machine learning models start getting rolled out in new industries and use cases, expectations around explanations also change.

Given this shifting baseline, it is essential that the explainability toolkit used by a firm remains dynamic. Having a dedicated R&D capability — to understand evolving needs and tailor or enhance the toolkit to meet them — is critical.

Explainability of machine learning models is central to building trust in machine learning models and ensuring large-scale adoption. Using a medley of diverse open source options to achieve that can appear attractive, but stitching them together into a coherent, consistent, and fit-for-purpose framework remains challenging. Firms looking to adopt machine learning at scale should spend the time and effort needed to find the right option for their needs.

Shayak Sen is the chief technology officer and co-founder of Truera. Sen started building production grade machine learning models over 10 years ago and has conducted leading research in making machine learning systems more explainable, privacy compliant, and fair. He has a Ph.D. in computer science from Carnegie Mellon University and a BTech in computer science from the Indian Institute of Technology, Delhi.

Anupam Datta, professor of electrical and computer engineering at Carnegie Mellon University and chief scientist of Truera, and Divya Gopinath, research engineer at Truera, contributed to this article.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2021 IDG Communications, Inc.

How to choose a low-code development platform