The Solvability of Interpretability Evaluation Metrics

Yilun Zhou
MIT CSAIL
yilun@csail.mit.edu &Julie Shah
MIT CSAIL
julie_a_shah@csail.mit.edu Yilun Zhou Julie Shah
MIT CSAIL
{yilun,julie_a_shah}@csail.mit.edu
https://yilunzhou.github.io/solvability/

Abstract

Feature attribution methods are popular for explaining neural network predictions, and they are often evaluated on metrics such as comprehensiveness and sufficiency. In this paper, we highlight an intriguing property of these metrics: their solvability. Concretely, we can define the problem of optimizing an explanation for a metric, which can be solved by beam search. This observation leads to the obvious yet unaddressed question: why do we use explainers (e.g., LIME) not based on solving the target metric, if the metric value represents explanation quality? We present a series of investigations showing strong performance of this beam search explainer and discuss its broader implication: a definition-evaluation duality of interpretability concepts. We implement the explainer and release the Python solvex package for models of text, image and tabular domains.

1 Introduction

For neural network models deployed in high stakes domains, the explanations for predictions are often as important as the predictions themselves. For example, a skin cancer detection model may work by detecting surgery markers (Winkler et al., 2019) and an explanation that reveals this spurious correlation is highly valuable. However, evaluating the correctness (or faithfulness) of explanations is fundamentally ill-posed: because the explanations are used to help people understand the reasoning of the model, we cannot check it against the ground truth reasoning, as the latter is not available.

As a result, correctness evaluations typically employ certain alternative metrics. For feature attribution explanations, they work under a shared principle: changing an important feature should have a large impact on the model prediction. Thus, the quality of the explanation is defined by different formulations of the model prediction change, resulting in various metrics such as comprehensiveness and sufficiency (DeYoung et al., 2020). To develop new explanation methods (Fig. 1, left), people generally identify a specific notion of feature importance (e.g., local sensitivity), propose the corresponding explainer (e.g., gradient saliency (Simonyan et al., 2013)), evaluate it on one or more metrics, and claim its superiority based on favorable results vs. baseline explainers. We call these explainers heuristic as they are motivated by pre-defined notions of feature importance.

Refer to caption — Figure 1: Left: the current process of developing new explainers. Right: the natural implication following our observation that evaluation metrics are solvable.

In this paper, we show that all these metrics are solvable, in that we can define an explanation as the one that optimizes a metric value and search for it. The obvious question is then: if we take a specific target metric to represent correctness, why don’t we just search for the metric-optimal explanation (Fig. 1, right) but take the more convoluted route of developing heuristic explanations and then evaluating them (Fig. 1, left)?

There are several possible reasons. First, the optimization problem may be so hard that we cannot find an explanation better than the heuristic ones. The bigger concern, however, is that of Goodhart’s Law. In other words, as soon as a metric is used in explicit optimization, it ceases to be a good metric. Concretely, the explanation may overfit to the particular metric and perform much worse on closely related ones (Chan et al., 2022), or overfit to the model and effectively adversarially attack the model when assigning word importance (Feng et al., 2018). It may also perform poorly on evaluations not based on such metrics, such as ground truth alignment (Zhou et al., 2022a).

We assess these concerns, taking the widely used comprehensiveness and sufficiency metrics (DeYoung et al., 2020) as the optimization target. Our findings, however, largely dispel every concern. A standard beam search produces explanations that greatly outperform existing one such as LIME and SHAP on the target metric. On several other metrics, the search-based explainer also performs favorably on average. There is no strong evidence of it adversarially exploiting the model either, and it achieves competitive performances on a suite of ground truth-based evaluations.

Thus, we advocate for wider adoptions of the explainer, which is domain-general and compatible with models on image and tabular data as well. As an engineering contribution, we release the Python solvex package (solvability-based explanation) and demonstrate its versatility in App.A.

More broadly, the solvability phenomenon is one facet of the definition-evaluation duality, which asserts an equivalence between definitions and evaluations. Solvability recognizes that for each evaluation metric, we can define explainer that performs optimally on this metric. Conversely, for each explainer, we can also come up with an evaluation metric that ranks this explainer on top – a straightforward one would be the negative distance between the explanation under evaluation and the “reference explanation” generated by the explainer.

While the community has mostly agreed on a spectrum on which various interpretability concepts (Fig. 2) are located, duality allows every concept to be moved freely on the scale. We explored two particular movements as represented by the solid arrows, but the more general investigation of this operation could be of both theoretical and practical interest. In addition, given that definitions and evaluations are really two sides of the same coin, we need to reflect how to best evaluate explanations. Sec. 6 argues to measure their demonstrable utilities in downstream tasks, and present potential ways and ideas to better align the interpretability research with such goals.

2 Background and Related Work

In this section, we give a concise but unified introduction to the popular feature attribution explainers and evaluation metrics studied in this paper.

2.1 Feature Attribution Explainers

We focus on feature attribution explanations, which explains an input $x=(x_{1},...,x_{L})$ by a vector $e=(e_{1},...,e_{L})$ where $e_{l}$ represents the “contribution” of $x_{l}$ to the prediction. Many different definitions for contribution have been proposed and we consider the following five.

•

Vanilla gradient (Grad) (Simonyan et al., 2013; Li et al., 2016a) is the L2 norm of gradient of the prediction (in logit, following standard practice) with respect to the token embedding.
•

Integrated gradient (IntG) (Sundararajan et al., 2017) is the path integral of the embedding gradient along the line segment from the zero embedding value to the actual value.
•

LIME (Ribeiro et al., 2016) is the coefficient of a linear regression in the local neighborhood.
•

SHAP (Lundberg and Lee, 2017) computes the Shapley value (Roth, 1988) for each word.
•

Occlusion (Occl) (Li et al., 2016b) is the change in prediction when a word is removed from the input while all other words remain.

2.2 Feature Attribution Evaluations

Naturally, different definitions result in different explanation values. As findings (e.g., Adebayo et al., 2018; Nie et al., 2018) suggest that some explanations are not correct (i.e., faithfully reflecting the model’s reasoning process), many evaluations are proposed to quantify the correctness of different explanations. Not having access to the ground truth model working mechanism (which is what explanations seek to reveal in the first place), they are instead guided by one principle: changing an important feature (as judged by the explanation) should have a large impact on the prediction, and the magnitude of the impact is taken as explanation quality. However, there are different ways to quantify the impact, leading to different evaluations, and we consider six in this paper.

Let $f:\mathcal{X}\rightarrow\mathbb{R}$ be a function that we want to explain, such as the probability of the target class. For an input $x=(x_{1},...,x_{L})$ of $L$ words, according to an explanation $e=(e_{1},...,e_{L})$ , we can create a sequence of $L+1$ input deletions $\tilde{x}^{(0)}_{e},\tilde{x}^{(1)}_{e},...,\tilde{x}^{(L)}_{e}$ where $\tilde{x}^{(l)}_{e}$ is the the input but with $l$ most important features removed. Thus, we have $\tilde{x}^{(0)}_{e}=x$ and $\tilde{x}^{(L)}_{e}$ being the empty string.¹¹1We define feature removal as the literal deletion of the word from the sentence, which is a popular practice. Other methods replace the token with [UNK], [MASK] or zero embedding, are more sophisticated such as performing BERT mask filling (Kim et al., 2020). While our current approach could lead to out-of-distribution instances, we adopt it due to its popularity. A thorough investigation for the best strategy is orthogonal to our paper and beyond its scope. The comprehensiveness $\kappa$ (DeYoung et al., 2020) is defined as

\displaystyle\kappa(x,e)=\frac{1}{L+1}\sum_{l=0}^{L}f(x)-f(\tilde{x}^{(l)}_{e}).

(1)

It measures the deviation from the original model prediction when important features (according to $e$ ) are successively removed, and therefore a larger value is desirable. It was also proposed for computer vision models as the area over perturbation curve (AoPC) by Samek et al. (2016).

Analogously, we can define the sequence of input insertions $\widehat{x}^{(0)}_{e},\widehat{x}^{(1)}_{e},...,\widehat{x}^{(L)}_{e}$ , where $\widehat{x}^{(l)}_{e}$ is the input with the $l$ most important features present. Thus, $\widehat{x}^{(0)}_{e}$ is the empty string and $\widehat{x}^{(L)}_{e}=x$ , but otherwise the sequences of input insertions and deletions do not mirror each other. The sufficiency $\sigma$ (DeYoung et al., 2020) is defined as

\displaystyle\sigma(x,e)=\frac{1}{L+1}\sum_{l=0}^{L}f(x)-f(\widehat{x}^{(l)}_{e}).

(2)

It measures the gap to the original model prediction that remains (i.e., convergence to the model prediction) when features are successively inserted from the most important to the least. Therefore, a smaller value is desirable.

Another interpretation of prediction change just considers decision flips. Let $g:\mathcal{X}\rightarrow\{0,...,K\}$ be the function that outputs the most likely class of an input. The decision flip by removing the most important token (Chrysostomou and Aletras, 2021) is defined as

\displaystyle\mathrm{DF_{MIT}}(x,e)=\mathbbm{1}_{g(\tilde{x}^{(1)}_{e})\neq g(x)},

(3)

which measures whether removing the most important token changes the decision. Across a dataset, its average value gives the overall decision flip rate, and a higher value is desirable.

The fraction of token removals for decision flip (Serrano and Smith, 2019) is defined as

\displaystyle\mathrm{DF_{Frac}}(x,e)=\frac{\operatorname*{arg\,min}_{l}g(\tilde{x}^{(l)}_{e})\neq g(x)}{L},

(4)

and we define $\mathrm{DF_{Frac}}=1$ if no value of $l$ leads to the decision flip. This metric represents the fraction of feature removals that is needed to flip the decision, and hence a lower value is desirable.

Last, two metrics evaluate correlations between model prediction and feature importance. For $x$ and $e$ , we define the sequence of marginal feature deletions $x_{-}^{(1)},...,x_{-,e}^{(L)}$ such that $x{{}_{,}e}-^{(l)}$ is original input with only the $l$ -th important feature removed. The deletion rank correlation (Alvarez-Melis and Jaakkola, 2018b) is defined as

	$\displaystyle\delta_{f}=[f(x)-f(x_{-,e}^{(1)}),...,f(x)-f(x_{-,e}^{(L)})],$		(5)
	$\displaystyle\mathrm{Rank_{Del}}(x,e)=\rho(\delta_{f},e),$		(6)

where $\rho(\cdot,\cdot)$ is the Spearman rank correlation coefficient between the two input vectors. Intuitively, this metric asserts that suppressing a more important feature should have a larger impact to the model prediction. A higher correlation is desirable.

The insertion rank correlation (Luss et al., 2021) is defined as

	$\displaystyle v=[f(\tilde{x}^{(L)}),...,f(\tilde{x}^{(0)})],$		(7)
	$\displaystyle\mathrm{Rank_{Ins}}(x,e)=\rho(v,[0,...,L]),$		(8)

and recall that $\tilde{x}^{(L)}_{e},...,\tilde{x}^{(0)}_{e}$ is the sequence of inputs with increasingly more important features inserted, starting from the empty string $\tilde{x}^{(L)}$ to the full input $\tilde{x}^{(0)}$ . This metric asserts that the model prediction on this sequence should increase monotonically to the original prediction. Also a higher correlation is desirable.

Related to our proposed notion of solvability is the phenomenon that some metric values seem to favor some explainers (Pham et al., 2022; Ju et al., 2022). While it is often used to argue against the use of certain evaluations, we take this idea to the extreme, which culminates in the solvability property, and find that metric-solving (Def. 3.1) explanations from some metrics can be high-quality.

3 The Solvability of Evaluation Metrics

Now we establish the central observation of this paper: the solvability of these evaluation metrics. Observe that each evaluation metric, e.g., comprehensiveness $\kappa$ , is defined on the input $x$ and the explanation $e$ , and its computation only uses the model prediction function $f$ (or $g$ derived from $f$ for the two decision flip metrics). In addition, the form of feature attribution explanation constrains $e$ to be a vector of the same length as $x$ , or $e\in\mathbb{R}^{L}$ .

Without loss of generality, we assume that the metrics are defined such that a higher value means a better explanation (e.g., redefining the sufficiency to be the negative of its original form). We formalize the concept of solvability as follows:

Definition 3.1.

For a metric $m$ and an input $x$ , an explanation $e^{*}$ solves the metric $m$ if $m(x,e^{*})\geq m(x,e)$ for all $e\in\mathbb{R}^{L}$ . We also call $e^{*}$ the $m$ -solving explanation.

Notably, there are already two explanation-solving-metric cases among the ones in Sec. 2.

Theorem 1.

The occlusion explainer solves the DF_MIT and Rank_Del metrics.

The proof follows from the definition of the explainer and the two metrics. Occlusion explainer defines token importance as the prediction change when each the token is individually removed, thus the most important token is the one that induces the largest change, which makes it most likely to flip the decision under DF_MIT. In addition, because token importance is defined as the model prediction change, its rank correlation with the latter (i.e., Rank_Del) is maximal at 1.0.

Thm. 1 highlights an important question: if we take DF_MIT or Rank_Del as the metric (i.e., indicator) of explanation quality, why should we consider any other explanation, when the occlusion explanation provably achieves the optimum? A possible answer is that the metrics themselves are problematic. For example, one can argue that the DF_MIT is too restrictive for overdetermined input: when redundant features (e.g., synonyms) are present, removing any individual one cannot change the prediction, such as for the sentiment classification input of “This movie is great, superb and beautiful.”

Nonetheless, the perceived quality of a metric can be loosely inferred from its adoption by the community, and the comprehensiveness and sufficiency metrics (DeYoung et al., 2020) are by far the most widely used. They overcome the issue of DF_MIT by also considering inputs with more than one token removed. Since a metric is scalar-valued, we combine comprehensiveness $\kappa$ and sufficiency $\sigma$ into comp-suff difference $\Delta$ , defined as (recall that a lower sufficiency value is better):

\displaystyle\Delta(x,e)=\kappa(x,e)-\sigma(x,e).

(9)

Again, we face the same question: if $\Delta$ is solvable, why should any heuristic explainers be used instead of the $\Delta$ -solving $e^{*}$ ? In the next two sections, we seek to answer it by first proposing a beam search algorithm to (approximately) find $e^{*}$ and then explore its various properties.

4 Solving Metrics with Beam Search

We first define two properties that are satisfied by some metrics: value agnosticity and additivity.

Definition 4.1.

For an input $x=(x_{1},...,x_{L})$ with explanation $e=(e_{1},...,e_{L})$ , we define the ranked importance as $r(x_{l})=|\{e_{i}:e_{i}\leq e_{l},1\leq i\leq L\}|$ . In other word, the $x_{l}$ with $r(x_{l})=L$ is the most important, and that with $r(x_{l})=1$ is the least. A metric $m$ is value-agnostic if for all $e_{1}$ and $e_{2}$ that induce the same ranked importance, we have

\displaystyle m(x,e_{1})=m(x,e_{2}).

(10)

A value-agnostic metric has at most $L!$ unique values across all possible explanations for an input of length $L$ . Thus, in theory, an exhaustive search over the $L!$ permutations of the list $[1,2,...,L]$ is guaranteed to find the $e^{*}$ that solves the metric.

Definition 4.2.

A metric $m$ is additive if it can be written in the form of

\displaystyle m(x,e)=\sum_{l=0}^{L}h(x,e^{(l)}),

(11)

for some function $h$ , where $e^{(l)}$ reveals the attribution values of $l$ most important features according to $e$ but keeps the rest inaccessible.

Theorem 2.

Comprehensiveness, sufficiency and their difference are value-agnostic and additive.

The proof is straightforward, by observing that both $\tilde{x}^{(l)}$ and $\widehat{x}^{(l)}$ can be created from $x$ and the ordering of $e^{(l)}$ . In fact, all metrics in Sec. 2 are value-agnostic (but only some are additive).

A metric satisfying these two properties admits an efficient beam search algorithm to approximately solve it. As $e^{(l)}$ can be considered as a partial explanation that only specifies the top- $l$ important features, we start with $e^{(0)}$ , and try each feature as most important obtain $e^{(1)}$ . With beam size $B$ , if there are more than $B$ features, we keep the top- $B$ according to the partial sum. This extension procedure continues until all features are added, and top extension is then $e^{*}$ . Alg. 1 documents the procedure, where $\mathrm{ext}(e,v)$ extends $e$ and returns a set of explanations, in which each new one has value $v$ on one previously empty entry of $e$ . Finally, note that $e^{*}$ generated on Line 8 has entry values in $\{1,...,L\}$ , but some features may contribute against the prediction (e.g., “This movie is truly innovative although slightly cursory.”). Thus, we post-process $e^{*}$ by shifting all values by $k$ such that the new values (in $\{1-k,L-k\}$ ) maximally satisfy the sign of marginal contribution of each word (i.e., the sign of the occlusion saliency).

1 Input: beam size

B

, metric

m

, sentence

x

of length

L

;

2 Let

e^{(0)}

be an empty length-

L

explanation;

3 beams

\leftarrow\{e^{(0)}\}

;

4 for $l=1,...,L$ do

5 beams

\displaystyle\leftarrow\bigcup_{e\,\in\,\texttt{beams}}\mathrm{ext}(e,L-l+1)

;

6 beams

\leftarrow\mathrm{choose\_best}(\texttt{beams},B)

;

8 end for

e\leftarrow\mathrm{choose\_best}(\texttt{beams},1)

;

e^{*}\leftarrow\mathrm{shift}(e)

;

11 return

e^{*}

;

Algorithm 1 Beam search for finding

e^{*}

Without the additive property, beam search is not feasible due to the lack of partial metric values. However, Zhou et al. (2021) presented a simulated annealing algorithm (Kirkpatrick et al., 1983) to search for the optimal data acquisition order in active learning, and we can use a similar procedure to search for the optimal feature importance order. If the metric is value-sensitive, assuming differentiability with respect to the explanation value, methods such as gradient descent can be used. Since we focus on comprehensiveness and sufficiency in this paper, the development and evaluation of these approaches are left to future work.

5 Experiments

We investigate various properties of the beam search explainer vs. existing heuristic explainers, using the publicly available textattack/roberta-base-SST-2 model on the SST dataset (Socher et al., 2013) as a case study. The sentiment value for each sentence is a number between 0 (very negative) and 1 (very positive), which we binarize into two classes of $[0,0.4]$ and $[0.6,1]$ . Sentences with sentiment values in middle are discarded. The average sentence length is 19, making the exhaustive search impossible. We use a beam size of 100 to search for $\Delta$ -solving explanation E^∗. All reported statistics are computed on the test set.

Fig. 3 presents two explanations, with additional ones in Fig. 11 of App. C. While we need more quantitative analyses (carried out below) for definitive conclusions on its various properties, E^∗ explanations at least looks reasonable and is likely to help people understand the model by highlighting the high importance of sentiment-laden words.

A worthy tribute to a great humanitarian and her vibrant ‘ co-stars . ’

So stupid , so ill-conceived , so badly drawn , it created whole new levels of ugly .

Figure 3: Two E^∗ explanations. The shade of background color represents feature importance.

5.1 Performance on the Target Metric

We compare E^∗ to heuristic explainers on the $\Delta$ metric, with results shown in Tab. 5.1 along with the associated $\kappa$ and $\sigma$ . A random explanation baseline is included for reference. We can see that E^∗ achieves the best $\Delta$ , often by a large margin. It also tops the ranking separately for $\kappa$ and $\sigma$ , which suggests that an explanation could be optimally comprehensive and sufficient at the same time.

Explainer	Comp $\kappa\uparrow$	Suff $\sigma\downarrow$	Diff $\Delta\uparrow$
Grad	0.327	0.108	0.218
IntG	0.525	0.044	0.481
LIME	0.682	0.033	0.649
SHAP	0.612	0.034	0.578
Occl	0.509	0.040	0.469
\cdashline1-4 E^∗		0.740	0.020	0.720
Random	0.218	0.212	0.006

$B$	1	5	10	20	50	100	LIME
$\kappa$	0.717	0.731	0.734	0.736	0.739	0.740	0.682
$\sigma$	0.020	0.020	0.020	0.020	0.020	0.020	0.033
$\Delta$	0.697	0.711	0.714	0.716	0.719	0.720	0.649
$T$	0.38	0.77	1.15	1.72	2.85	4.37	4.75

Explainer	DF_MIT $\uparrow$	DF_Frac $\downarrow$	Rank_Del $\uparrow$	Rank_Ins $\uparrow$
Grad	10.5%	54.5%	0.162	0.521
IntG	16.9%	39.6%	0.369	0.468
LIME	25.5%	28.1%	0.527	0.342
SHAP	23.0%	36.1%	0.369	0.458
Occl	26.4%	40.6%	1.000	0.396
\cdashline1-5 E^∗		25.0%	25.2%	0.438	0.423
Random	23.4%	72.3%	0.004	0.599

Type	$\widehat{y}=0$	$\widehat{y}=1$
Short	terrible, awful, disaster, worst, never	excellent, great, fantastic, brilliant, enjoyable
Long	A total waste of time. Not worth the money! Is it even a real film? Overall it looks cheap.	I like this movie. This is a great movie! Such a beautiful work. Surely recommend it!

Replacement word sets	$\widehat{y}=0$	$\widehat{y}=1$
a, an, the	a	the
in, on, at	in	on
I, you	I	you
he, she	he	she
can, will, may	can	may
could, would, might	could	might
(all forms of be)	is	are
(all punctuation marks)	(period)	(comma)

The Solvability of Interpretability Evaluation Metrics

Abstract

1 Introduction