This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Shapley Sets: Feature Attribution via Recursive Function Decomposition

Torty Sivill
University of Bristol
The Alan Turing Institute
vs14980@bristol.ac.uk
&Peter Flach
University of Bristol
The Alan Turing Institute
Abstract

Despite their ubiquitous use, Shapley value feature attributions can be misleading due to feature interaction in both model and data. We propose an alternative attribution approach, Shapley Sets, which awards value to sets of features. Shapley Sets decomposes the underlying model into non-separable variable groups using a recursive function decomposition algorithm with log linear complexity in the number of variables. Shapley Sets attributes to each non-separable variable group their combined value for a particular prediction. We show that Shapley Sets is equivalent to the Shapley value over the transformed feature set and thus benefits from the same axioms of fairness. Shapley Sets is value function agnostic and we show theoretically and experimentally how Shapley Sets avoids pitfalls associated with Shapley value based alternatives and are particularly advantageous for data types with complex dependency structure.

Keywords Explainability, Feature Attribution, Shapley Value, Function Decomposition, Separability

1 The Shapley Value and Non-separable Functions

In co-operative game theory, one central question is that of fair division: if players form a coalition to achieve a common goal, how should they split the profits? Let NN be the set {1,2,n}\{1,2,...n\} of players and 2N2^{N} all coalitions of players. A function v:2Nv:2^{N}\rightarrow\mathbb{R} is the nn-person game in characteristic form, such that v(S),SNv(S),S\subseteq N defines the worth of coalition SS where v()=0v(\varnothing)=0. A solution concept is a mapping assigning a vector 𝐱n\mathbf{x}\in\mathbb{R}^{n} to the game vv. The Shapley value [1] is the most widely known solution concept which uniquely satisfies certain axioms of fairness: efficiency, dummy, symmetry and additivity. Please see [1] for definitions.

Definition 1.1 (Shapley Value).

For the game vv the Shapley value of player iNi\in N is given as

ϕi(v)=ΣSN\{i}|S|!(n|S|1)!n![v(S{i})v(S)]\phi_{i}(v)=\Sigma_{S\subseteq N\backslash\{i\}}\frac{|S|!(n-|S|-1)!}{n!}[v(S\cup\{i\})-v(S)] (1)

Under efficiency, the Shapley value decomposes the value of the grand coalition v(N)v()v(N)-v(\varnothing) to attribute worth to each individual player. The Shapley value is a fully separable function (Definition 1.2) such that v(N)v()=inϕi(v)v(N)-v(\varnothing)=\sum_{i}^{n}\phi_{i}(v).

Definition 1.2.

[Additively Separable Function] A function f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R} with variable set 𝐗={X1,,Xn}\mathbf{X}=\{X_{1},...,X_{n}\} is separable if it has the following form f(𝐗)=i=1kfi(𝐗i)f(\mathbf{X})=\sum_{i=1}^{k}f_{i}(\mathbf{X}_{i}) 1<kn1<k\leq n,

where 𝐗1,𝐗2,,𝐗k\mathbf{X}_{1},\mathbf{X}_{2},...,\mathbf{X}_{k} are kk non-overlapped sub-vectors of 𝐗\mathbf{X}.

Specifically, the function ff is also called fully additively separable if k=nk=n, while it is regarded as fully non-separable if k=1k=1. While there are other forms of separability, In this paper we use the term separable to refer to additive separability.

The set function vv may not be fully separable. Within coalitional games this is due to the interaction between players. Consider the following example for the game vv with player set N={1,2,3}N=\{1,2,3\} and v(1)=1,v(2)=0,v(3)=0,v(1,2)=1,v(1,3)=1,v(2,3)=2,v(1,2,3)=3v(1)=1,v(2)=0,v(3)=0,v(1,2)=1,v(1,3)=1,v(2,3)=2,v(1,2,3)=3. Clearly the game is not fully separable as v(1)+v(2)+v(3)v(1,2,3)v(1)+v(2)+v(3)\not=v(1,2,3). The non-separable interaction effects within coalitional games are dealt with by solution concepts which map partially separable into fully separable functions, allowing an individual attribution of worth to each player. The Shapley value provides an attribution where each player receives an average of their marginal contribution to all coalitions.

1.1 Interaction Effects For Feature Attribution

When applying the Shapley value to feature attribution, there are three functions to consider: The model to be explained f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R} which operates on a variable set 𝐗={X1,,Xn}\mathbf{X}=\{X_{1},...,X_{n}\}. The set function vv, which takes as input a set of features 𝐗S𝐗\mathbf{X}_{S}\subseteq\mathbf{X} and obtains ff’s prediction on this coalition of features. The Shapley value ϕ(v)\phi(v) which maps the set function vv into a fully separable function. Given a particular prediction to attribute, f(𝐱)f(\mathbf{x}) where 𝐱={x1,,xn}\mathbf{x}=\{x_{1},...,x_{n}\}, the value function v(𝐱,𝐗S)v(\mathbf{x},\mathbf{X}_{S}) specifies how the subset of features, 𝐗S¯=𝐗\𝐗S\mathbf{X}_{\bar{S}}=\mathbf{X}\backslash\mathbf{X}_{S}, should be removed from 𝐱\mathbf{x}.

In coalitional game theory, the Shapley value attributes the difference in value between the grand coalition and the empty set of players. For feature attribution, the Shapley value attributes the change in prediction between instance and baseline. Therefore, the value of the empty coalition, v(𝐱,𝐗)v(\mathbf{x},\mathbf{X}_{\varnothing}) is not guaranteed to be zero but some uninformative baseline prediction and thus the value function must account for the non-zero baseline: v(𝐱,𝐗S)=f(𝐗S¯,𝐗S=𝐱S)f(𝐗)v(\mathbf{x},\mathbf{X}_{S})=f(\mathbf{X}_{\bar{S}},\mathbf{X}_{S}=\mathbf{x}_{S})-f(\mathbf{X}_{\varnothing}). Similarly to coalitional games, the Shapley value fairly allocates interaction effects to each feature. Feature interaction may occur in the data f(X1,X2,X3)=X1+X2+X3f(X_{1},X_{2},X_{3})=X_{1}+X_{2}+X_{3} where X2=αX3X_{2}=\alpha X_{3} and/or in the model f(X1,X2,X3)=X1+2X2X3f(X_{1},X_{2},X_{3})=X_{1}+2X_{2}X_{3}. The choice of value function, which acts as the interface between the Shapley value and the function ff determines the kind of interactive effects the Shapley value must allocate between features.

1.2 When Interaction Occurs in the Data

While the following ideas have previously been discussed [2, 3, 4], we re-frame them here within the context of separability which allows us to motivate our proposed attribution method, Shapley Sets.

Example 1: Given the binary variable set 𝐗={X1,X2,X3}\mathbf{X}=\{X_{1},X_{2},X_{3}\} and function f(𝐗)=X1+X3f(\mathbf{X})=X_{1}+X_{3} where X2X_{2} is the causal ancestor of X3X_{3} such that X3=X2X_{3}=X_{2}. It is clear that X2X_{2} has no impact on f(𝐗)f(\mathbf{X}) from the perspective of the model. However, from the perspective of the data distribution, X3X_{3} is dependent on X2X_{2}. Changing X2X_{2} will result in a change in X3X_{3} therefore, changing X2X_{2} to a value non-consistent with X3X_{3} does not make sense. Whether to consider X2X_{2} as a separate player in the game and attribute value despite it having no direct influence on the model output is an open debate in the literature.

Off-manifold Value Functions There are those who argue that features with no impact on the model should receive no attribution [5, 6]. These methods break all statistical relationships between the inputs to the model by using a value function which calculates the impact of each feature on the model independently of its impact on the distribution of other features. This approach was formalised as vmargv_{marg} by [6]

vmarg(𝐱,𝐗S)=f(𝐗S=𝐱S,𝔼[𝐗S¯])f(𝔼[𝐗])v_{marg}(\mathbf{x},\mathbf{X}_{S})=f(\mathbf{X}_{S}=\mathbf{x}_{S},\mathbb{E}[\mathbf{X}_{\bar{S}}])-f(\mathbb{E}[\mathbf{X}]) (2)

The expectation is usually taken over the input distribution 𝐗input\mathbf{X}_{input}. However, if this is replaced by an arbitrary distribution, vmargv_{marg} was generalised to vbsv_{bs} by [7] which uses an arbitrary baseline sample 𝐳\mathbf{z},

vbs(𝐱,𝐳,𝐗S)=f(𝐗S=𝐱S,𝐗S¯=𝐳S¯)f(𝐗=𝐳).v_{bs}(\mathbf{x},\mathbf{z},\mathbf{X}_{S})=f(\mathbf{X}_{S}=\mathbf{x}_{S},\mathbf{X}_{\bar{S}}=\mathbf{z}_{\bar{S}})-f(\mathbf{X}=\mathbf{z}). (3)

There are those who argue that attributions independent of the statistical interactions in the data are inherently misleading [8, 4]. Firstly from a causal perspective, if we consider Example 1, the Shapley value via vmargv_{marg} would assign zero importance to X2X_{2}. An attribution ignoring that X2X_{2} is directly responsible for X3X_{3} is misleading, especially if the attribution is used to recommend changes. Furthermore, vmargv_{marg} evaluates the model on out of distribution samples. If we break the causal relationship between X2X_{2} and X3X_{3}, and using their independent expected values 𝔼[X2]=𝔼[X3]=0\mathbb{E}[X_{2}]=\mathbb{E}[X_{3}]=0 in vmargv_{marg}, the model is evaluated on samples (x1,1,0)(x_{1},1,0) which is a complete misrepresentation of the truth.

On-manifold Value Functions To combat this problem, on-manifold samples can be calculated by the use of the conditional value function, first introduced by [9] which does consider statistically related features as separate players in the game, allowing the distribution of out of coalition features to be impacted by the feature in question

vcond(𝐱,𝐗S)=𝔼[f(𝐗S=𝐱S,𝐗S¯)|𝐗S=𝐱S]𝔼[f(𝐗)].v_{cond}(\mathbf{x},\mathbf{X}_{S})=\mathbb{E}[f(\mathbf{X}_{S}=\mathbf{x}_{S},\mathbf{X}_{\bar{S}})|\mathbf{X}_{S}=\mathbf{x}_{S}]-\mathbb{E}[f(\mathbf{X})]. (4)

vcondv_{cond} is often taken as the observational conditional probability whereby the expected conditional is calculated over 𝐗input\mathbf{X}_{input}. This generates on-manifold data samples which address the problems discussed above. Furthermore, features which have no direct impact on the model but an indirect impact through other features are assigned a non zero importance, more accurately reflecting reality. However, the two significant issues with vcondv_{cond} are its computational complexity: requiring the evaluation of the model on 2N2^{N} multivariate conditional distributions and the undesirable impact of considering all features as players combined with the efficiency which we explicate below.

In assigning non-marginal features a non zero importance, vcondv_{cond} can give misleading explanations which indicate features to change despite having zero impact on the outcome. This weakness of vcondv_{cond} has been formalised as a “violation of sensitivity” [6]: When the relevance of ϕi\phi_{i} is defined by vcond,ϕi0v_{cond},\phi_{i}\not=0 does not imply that ff depends on XiX_{i}

The failure of sensitivity exhibited by vcondv_{cond} leads to further issues with the generated attributions. Consider Example 1 again, where X3,X2,X1X_{3},X_{2},X_{1} are binary variables and X3=X2X_{3}=X_{2}. Given the input 𝐱=(x1,x2,x3)=(1,1,1)\mathbf{x}=(x_{1},x_{2},x_{3})=(1,1,1) and f(x1,x2,x3)=2f(x_{1},x_{2},x_{3})=2 Under vcondv_{cond}, the Shapley attributions for X2X_{2} and X3X_{3} would both be greater than the attribution for X1X_{1}. Clearly, the attribution of X2X_{2} violates sensitivity. Now, consider the an alternative function which is just trained on two features X1,X3X_{1},X_{3}. As X3=X2X_{3}=X_{2} f2(X1,X3)=f(X1,X2,X3)f_{2}(X_{1},X_{3})=f(X_{1},X_{2},X_{3}). However, now the Shapley values for X1X_{1} and X3X_{3} are equal. The relative apparent importances of X1X_{1} and X3X_{3} depend on whether X2X_{2} is considered to be a third feature, even though the two functions are effectively the same.

[10] propose a solution to the failure of sensitivity exhibited by vcondv_{cond} following the intuition: If XiX_{i} is known to be the deterministic causal ancestor of XjX_{j}, one might want to attribute all effect to XiX_{i} and none to XjX_{j}. In contrast, [4] argue that the only way to remove the problems arising from the failure of sensitivity is to replace the observational vcondv_{cond} with the interventional conditional distribution. However, both the asymmetric and interventional attribiutions above require the specification of the causal structure of the phenomenon being modelled. It has been argued [2] that this requirement is a significant limiting factor in the adoption of either approach. In this paper, we propose an attribution approach which can be used with on and off-manifold value functions. Under vcondv_{cond}, our method generates on-manifold attributions which avoid the failure of sensitivity without requiring any knowledge of the causal structure of the underlying data distribution.

1.3 When Interaction Occurs in the Model

While off-manifold value functions ignore interaction in the data, both on and off-manifold value functions recognize interaction in the model. It has been recognised, however that the Shapley value, in the presence of feature interaction in the model, generates misleading attributions [11].

Example 2: Consider the function f(X1,X2,X3)=X1+2X2X3f(X_{1},X_{2},X_{3})=X_{1}+2X_{2}X_{3} and assume that each of the three features are statistically independent, i.e. all interaction between features is defined entirely by the model. Furthermore, it is given that 𝔼[X1]=𝔼[X2]=𝔼[X3]=0\mathbb{E}[X_{1}]=\mathbb{E}[X_{2}]=\mathbb{E}[X_{3}]=0 and that our sample to be explained 𝐱=(1,1,1)\mathbf{x}=(1,1,1). The Shapley value under both on and off-manifold value functions give equal attributions to each feature. While this attribution makes sense from the perspective of how much each feature contributed to the change in prediction, it does not reflect the true behaviour of the model where changing the value of X2X_{2} or X3X_{3} would have double the impact on the model as changing X1X_{1}. In this paper, we propose a solution concept which would group X2,X3X_{2},X_{3} and unlike the Shapley value, award attribution together, resulting in attributions more faithful to the underlying model ff when used with on or off-manifold value functions.

2 Shapley Sets of Non-Separable Variable Groups

The problems with Shapley value attributions discussed above occur as it assigns individual value to variables belonging to Non-Separable Variable Groups (NSVGs) in regards to the underlying partially separable function ff (Definition 1.2). Non-separable groups are used to describe the formed variable groups {𝐗1,,𝐗k}\{\mathbf{X}_{1},...,\mathbf{X}_{k}\} after a complete (or ideal) decomposition of ff. An NSVG can also be defined as the minimal set of all interacted variables given the function ff which we explicate in Definition 2.1.

Definition 2.1 (Non-Separable Variable Group (NSVG)).

Let 𝐗={X1,X2,Xn}\mathbf{X}=\{X_{1},X_{2},...X_{n}\} be the set of decision variables and ff be a partially separable function f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R} satisfying Definition 1.2. If there exists any two candidate decision vectors 𝐱={x1,,xn}\mathbf{x}=\{x_{1},...,x_{n}\} and 𝐱={x1,,xn}\mathbf{x}^{\prime}=\{x^{\prime}_{1},...,x^{\prime}_{n}\}, sampled from the domain of 𝐗\mathbf{X}, such that the following property holds for any two mutually exclusive subsets 𝐗i,𝐗j𝐗\mathbf{X}_{i},\mathbf{X}_{j}\subset\mathbf{X}, 𝐗i𝐗j=\mathbf{X}_{i}\cap\mathbf{X}_{j}=\varnothing,

f(𝐱)𝐗i𝐗jf(𝐱)𝐗jf(𝐱)𝐗if(𝐱)𝐗,f(\mathbf{x})_{\mathbf{X}_{i}\cup\mathbf{X}_{j}}-f(\mathbf{x})_{\mathbf{X}_{j}}\not=f(\mathbf{x})_{\mathbf{X}_{i}}-f(\mathbf{x})_{\mathbf{X}_{\varnothing}}, (5)

then the sets 𝐗i,𝐗j\mathbf{X}_{i},\mathbf{X}_{j} are said to interact. Here, f(𝐱)𝐗S=f(𝐗S=𝐱S,𝐗S¯=𝐱S¯)f(\mathbf{x})_{\mathbf{X}_{S}}=f(\mathbf{X}_{S}=\mathbf{x}_{S},\mathbf{X}_{\bar{S}}=\mathbf{x^{\prime}}_{\bar{S}}) and 𝐗S𝐗S¯=𝐗\mathbf{X}_{S}\cup\mathbf{X}_{\bar{S}}=\mathbf{X}. As an NSVG refers to the minimal set of interacted variables, if |𝐗i||\mathbf{X}_{i}| and |𝐗j||\mathbf{X}_{j}| is minimized such that Equation 5 still holds, then 𝐗i𝐗j\mathbf{X}_{i}\cup\mathbf{X}_{j} is a NSVG. (For proof, see [12]).

Translating Definition 2.1 for feature attribution, given that f(𝐱)𝐗Sf(\mathbf{x})_{\mathbf{X}_{S}} is a function over the domain of all the possible subsets of 𝐗S𝐗\mathbf{X}_{S}\subseteq\mathbf{X} we can rewrite Equation 5 in terms of v(𝐱,𝐗S)v(\mathbf{x},\mathbf{X}_{S}), where vv could represent any of the value functions from the previous section but in this paper we restrict v{vcond,vbs}v\in\{v_{cond},v_{bs}\}. By setting 𝐗i={Xi}\mathbf{X}_{i}=\{X_{i}\} and 𝐗j=𝐗S\mathbf{X}_{j}=\mathbf{X}_{S}, for vbsv_{bs}, given that |𝐗S||\mathbf{X}_{S}| is minimised, if there exist any candidate vectors 𝐱,𝐱\mathbf{x},\mathbf{x^{\prime}} such that

vbs(𝐱,𝐱,{Xi}𝐗S)vbs(𝐱,𝐱,𝐗S)vbs(𝐱,𝐱,{Xi})v_{bs}(\mathbf{x},\mathbf{x^{\prime}},\{X_{i}\}\cup\mathbf{X}_{S})-v_{bs}(\mathbf{x},\mathbf{x^{\prime}},\mathbf{X}_{S})\not=v_{bs}(\mathbf{x},\mathbf{x^{\prime}},\{X_{i}\}) (6)

then {Xi}𝐗S\{X_{i}\}\cup\mathbf{X}_{S} is a NSVG.

For vcondv_{cond}, given that |𝐗S||\mathbf{X}_{S}| is minimised, if there exists any candidate vector 𝐱\mathbf{x} such that

vcond(𝐱,{Xi}𝐗S)vcond(𝐱,𝐗S)vcond(𝐱,{Xi})v_{cond}(\mathbf{x},\{X_{i}\}\cup\mathbf{X}_{S})-v_{cond}(\mathbf{x},\mathbf{X}_{S})\not=v_{cond}(\mathbf{x},\{X_{i}\}) (7)

then {Xi}𝐗S\{X_{i}\}\cup\mathbf{X}_{S} is a NSVG.

Given the partially separable function from Example 2, under vbsv_{bs}, {X2,X3}\{X_{2},X_{3}\} is a NSVG as vbs(𝐱,𝐱,{X3,X2})vbs(𝐱,𝐱,{X2})vbs(𝐱,𝐱,{X3})v_{bs}(\mathbf{x},\mathbf{x^{\prime}},\{X_{3},X_{2}\})-v_{bs}(\mathbf{x},\mathbf{x^{\prime}},\{X_{2}\})\not=v_{bs}(\mathbf{x},\mathbf{x^{\prime}},\{X_{3}\}) for settings 𝐱=(1,1,1)\mathbf{x}=(1,1,1) and 𝐱=(0,0,0)\mathbf{x^{\prime}}=(0,0,0).

Given the partially separable function from Example 1, under vcondv_{cond}, the set {X2,X3}\{X_{2},X_{3}\} is a NSVG as vcond(𝐱,{X3,X2})vcond(𝐱,{X2})vcond(𝐱,X3)v_{cond}(\mathbf{x},\{X_{3},X_{2}\})-v_{cond}(\mathbf{x},\{X_{2}\})\not=v_{cond}(\mathbf{x},X_{3}) for setting 𝐱=(1,1,1)\mathbf{x}=(1,1,1).

In this paper, we propose an alternative attribution method which, unlike the Shapley value, does not separate NSVGs to assign attribution. We work under the intuition that any interacting feature whether that be in the model or in the data should not be considered as separate players in the coalitional game but should be awarded value together. In both the examples above, X2X_{2} and X3X_{3} would receive joint attribution under our proposed method.

Given the partially separable function ff satisfying Definition 1.2, variable set 𝐗={X1,X2,,Xn}\mathbf{X}=\{X_{1},X_{2},...,X_{n}\}, and a specified value function vcond,marg(𝐱,𝐗S)v_{cond,marg}(\mathbf{x},\mathbf{X}_{S}), our proposed solution concept φ\varphi, which we term Shapley Sets (SS), finds the optimal decomposition of ff into the set of m>1m>1 NSVGs {𝐗1,,𝐗m}\{\mathbf{X}_{1},...,\mathbf{X}_{m}\}. The resulting variable grouping {𝐗1,,𝐗m}\{\mathbf{X}_{1},...,\mathbf{X}_{m}\} satisfies Definition 1.2 and each variable group is composed solely of variables 𝐗i\mathbf{X}_{i} which satisfy definition 2.1. From Definition 1.2, f(𝐱)=i=1mv(𝐱,𝐗i)f(\mathbf{x})=\sum_{i=1}^{m}v(\mathbf{x},\mathbf{X}_{i}). Given a prediction to be attributed, f(𝐱)f(\mathbf{x}), our proposed attribution, φ\varphi, therefore returns the attribution for each variable group 𝐗i,im\mathbf{X}_{i},\forall i\in m given as:

φ𝐗i=v(𝐱,𝐗i)\varphi_{\mathbf{X}_{i}}=v(\mathbf{x},\mathbf{X}_{i}) (8)
Proposition 2.2.

If we model each NSVG, 𝐗i{𝐗i,,𝐗m}\mathbf{X}_{i}\in\{\mathbf{X}_{i},...,\mathbf{X}_{m}\} as a super-feature ZiZ_{i} such that 𝐙={Zi,,Zm}\mathbf{Z}=\{Z_{i},...,Z_{m}\}, zi=𝐱iz_{i}=\mathbf{x}_{i} and 𝐳={z1,,zn}\mathbf{z}=\{z_{1},...,z_{n}\} The Shapley value of each super feature ϕZi(v,𝐳)\phi_{Z_{i}}(v,\mathbf{z}) is equivalent to v(𝐳,Zi)v(\mathbf{z},Z_{i})

Proof.
ϕZi(v,𝐳)=𝐙S𝐙\{Zi}α[v(𝐳,{Zi}𝐙S)v(𝐳,𝐙S)]\phi_{Z_{i}}(v,\mathbf{z})=\sum_{\mathbf{Z}_{S}\subseteq\mathbf{Z}\backslash\{Z_{i}\}}\alpha[v(\mathbf{z},\{Z_{i}\}\cup\mathbf{Z}_{S})-v(\mathbf{z},\mathbf{Z}_{S})]

where α=|𝐙S|!(|𝐙||𝐙𝐒|1)!|𝐙|!\alpha=\frac{|\mathbf{Z}_{S}|!(|\mathbf{Z}|-|\mathbf{Z_{S}}|-1)!}{|\mathbf{Z}|!}

Given that each 𝐙i,𝐙j𝐙\mathbf{Z}_{i},\mathbf{Z}_{j}\subseteq\mathbf{Z} is a NSVG, from Definition 2.1 we know that v(𝐳,𝐙i𝐙j)v(𝐳,𝐙j)=v(𝐳,𝐙i)v(\mathbf{z},\mathbf{Z}_{i}\cup\mathbf{Z}_{j})-v(\mathbf{z},\mathbf{Z}_{j})=v(\mathbf{z},\mathbf{Z}_{i}) for any 𝐙i,𝐙j𝐙\mathbf{Z}_{i},\mathbf{Z}_{j}\subseteq{\mathbf{Z}}. Therefore, v(𝐳,{Zi}𝐙S)v(𝐳,𝐙S)=v(𝐳,{Zi})v(\mathbf{z},\{Z_{i}\}\cup\mathbf{Z}_{S})-v(\mathbf{z},\mathbf{Z}_{S})=v(\mathbf{z},\{Z_{i}\})

It follows that given 𝐙S𝐙\{Zi}|𝐙S|!(|𝐙||𝐙𝐒|1)!|𝐙|!=1\sum_{\mathbf{Z}_{S}\subseteq\mathbf{Z}\backslash\{Z_{i}\}}\frac{|\mathbf{Z}_{S}|!(|\mathbf{Z}|-|\mathbf{Z_{S}}|-1)!}{|\mathbf{Z}|!}=1

ϕZi(v,𝐳)=𝐙S𝐙\{Zi}αv(𝐳,{Zi})=v(𝐳,{Zi})=v(𝐱,𝐗i)\phi_{Z_{i}}(v,\mathbf{z})=\sum_{\mathbf{Z}_{S}\subseteq\mathbf{Z}\backslash\{Z_{i}\}}\alpha v(\mathbf{z},\{Z_{i}\})=v(\mathbf{z},\{Z_{i}\})=v(\mathbf{x},\mathbf{X}_{i})

Proposition 2.2 shows how the attribution given by Shapley Sets (SS) to variable XiX_{i}, φ𝐗i(v,𝐱)\varphi_{\mathbf{X}_{i}}(v,\mathbf{x}), is equivalent to the Shapley value when played over the feature set 𝐙\mathbf{Z} containing the set of NSVGs {Z1,,Zm}={𝐗1,,𝐗m}\{Z_{1},...,Z_{m}\}=\{\mathbf{X}_{1},...,\mathbf{X}_{m}\} for a given v{vcond,vbs}v\in\{v_{cond},v_{bs}\}. SS therefore satisfies the same axioms of fairness as the Shapley value: efficiency, dummy, additivity and symmetry when played over this feature set. However, we have discussed how, despite its axioms, the Shapley value can generate misleading attributions in the presence of feature interaction. In Section 4 we therefore give practical advantages of the SS over the Shapley value. First however, we provide a method for finding the optimal decomposition of ff into its NSVGs.

3 Computing Shapley Sets

Determining the NSVGs of a function ff could be achieved manually by partitioning the variable set and determining interaction over every possible candidate vector. However, this would be computationally intractable. Instead, there exists a large body of literature surrounding function decomposition in global optimization problems. Of this work, automatic decomposition methods identify NSVGs. We therefore propose a method for calculating SS which is based on the Recursive Decomposition Grouping algorithm (RDG) as introduced in [12].

To identify whether two sets of variables 𝐗i\mathbf{X}_{i} and 𝐗j\mathbf{X}_{j} interact, RDG uses a fitness measure, based on Definition 2.1 with candidate vectors, 𝐱,𝐱\mathbf{x},\mathbf{x}^{\prime} as the lower and upper bounds of the domain of 𝐗\mathbf{X}. If the difference between the left and right hand side of Equation 5 meets some threshold ϵ=αmin{|f(𝐱1)|,,|f(𝐱k)|}\epsilon=\alpha\min\{|f(\mathbf{x}_{1})|,...,|f(\mathbf{x}_{k})|\} where 𝐱k\mathbf{x}_{k} is a randomly selected candidate vector, then 𝐗i\mathbf{X}_{i} and 𝐗j\mathbf{X}_{j} are deemed by RDG to interact. To adapt RDG for vcondv_{cond} and vbsv_{bs}, we propose an alternative fitness measure, Definition 3.1, with candidate vectors 𝐱,𝐱\mathbf{x},\mathbf{x^{\prime}} randomly sampled from 𝐗input\mathbf{X}_{input} which can identify NSVGs in the function and/or in the model.

Definition 3.1 (Shapley Sets Fitness Measure).

Given two sets of variables 𝐗i,𝐗𝐣\mathbf{X}_{i},\mathbf{X_{j}} and a specified value function, v{vbs,vcond}v\in\{v_{bs},v_{cond}\}, if |vcond(𝐱,𝐗i𝐗j)vcond(𝐱,𝐗j)vcond(𝐱,𝐗i)|>ϵ|v_{cond}(\mathbf{x},\mathbf{X}_{i}\cup\mathbf{X}_{j})-v_{cond}(\mathbf{x},\mathbf{X}_{j})-v_{cond}(\mathbf{x},\mathbf{X}_{i})|>\epsilon then there is interaction between 𝐗i\mathbf{X}_{i} and 𝐗j\mathbf{X}_{j}. Or, if |vbs(𝐱,𝐱,𝐗i𝐗j)vbs(𝐱,𝐱,𝐗j)vbs(𝐱,𝐱,𝐗i)|>ϵ|v_{bs}(\mathbf{x},\mathbf{x^{\prime}},\mathbf{X}_{i}\cup\mathbf{X}_{j})-v_{bs}(\mathbf{x},\mathbf{x^{\prime}},\mathbf{X}_{j})-v_{bs}(\mathbf{x},\mathbf{x^{\prime}},\mathbf{X}_{i})|>\epsilon then there is interaction between 𝐗i\mathbf{X}_{i} and 𝐗j\mathbf{X}_{j}.

We substitute the SS fitness measure into the RDG algorithm which identifies NSVGs by recursively identifying individual variable sets 𝐗j\mathbf{X}_{j} with which a given variable XiX_{i} interacts with. If XiX_{i} and a single variable XjX_{j} are said to interact they are placed into the same NSVG, 𝐗1\mathbf{X}_{1}. At which point conditional interaction between 𝐗1\mathbf{X}_{1} and the remaining variables is identified. The algorithm iterates over every variable Xi𝐗X_{i}\in\mathbf{X} and returns the set of NSVGs. To compute the SS attributions for a given prediction f(𝐱)f(\mathbf{x}) we compute v(𝐱,𝐗i)v(\mathbf{x},\mathbf{X}_{i}) for each NSVG, 𝐗i\mathbf{X}_{i}. Our full algorithm is shown in Algorithm 2. The runtime of SS is O(nlogn)O(nlogn) as proven in [12].

Algorithm 1 ValueInteract(𝐗1,𝐗2\mathbf{X}_{1},\mathbf{X}_{2})
v{vbs,vcond},𝐗input,ϵv\in\{v_{bs},v_{cond}\},\mathbf{X}_{input},\epsilon
if vbsv_{bs} then
  Sample 𝐱,𝐱\mathbf{x},\mathbf{x}^{\prime} from input distribution 𝐗input\mathbf{X}_{input}
  σ1=v(𝐱,𝐱,𝐗1𝐗2)v(𝐱,𝐱𝐗2)\sigma_{1}=v(\mathbf{x},\mathbf{x^{\prime}},\mathbf{X}_{1}\cup\mathbf{X}_{2})-v(\mathbf{x},\mathbf{x^{\prime}}\mathbf{X}_{2})
  σ2=v(𝐱,𝐱,𝐗1)\sigma_{2}=v(\mathbf{x},\mathbf{x^{\prime}},\mathbf{X}_{1})
end if
if vcondv_{cond} then
  Sample 𝐱\mathbf{x} from input distribution 𝐗input\mathbf{X}_{input}
  σ1=v(𝐱,𝐗1𝐗2)v(𝐱,𝐗2)\sigma_{1}=v(\mathbf{x},\mathbf{X}_{1}\cup\mathbf{X}_{2})-v(\mathbf{x},\mathbf{X}_{2})
  σ2=v(𝐱,𝐗1)\sigma_{2}=v(\mathbf{x},\mathbf{X}_{1})
end if
if |σ1σ2|>ϵ|\sigma_{1}-\sigma_{2}|>\epsilon then
  if 𝐗2\mathbf{X}_{2} contains one variable then
   𝐗1=𝐗1𝐗2\mathbf{X}_{1}=\mathbf{X}_{1}\cup\mathbf{X}_{2}
  else
   Split 𝐗2\mathbf{X}_{2} into two equal groups G1,G2G_{1},G_{2}
   𝐗11\mathbf{X}^{1}_{1} = ValueInteract(𝐗1,G1)\mathbf{X}_{1},G_{1})
   𝐗12\mathbf{X}^{2}_{1} = ValueInteract(𝐗1,G2)\mathbf{\mathbf{X}}_{1},G_{2})
   𝐗1=𝐗11𝐗12\mathbf{X}_{1}=\mathbf{X}^{1}_{1}\cup\mathbf{X}^{2}_{1}
  end if
end if
Return 𝐗1\mathbf{X}_{1}
Algorithm 2 Shapley Sets (Adapted from RDG [12])
v{vcond,vbs},ϵv\in\{v_{cond},v_{bs}\},\epsilon, 𝐱inp\mathbf{x}_{inp}, 𝐱ref\mathbf{x}_{ref} (if v=vbsv=v_{bs})
Intialise sepsseps and nonsepsnonseps as empty groups
Assign the first variable in 𝐗\mathbf{X} to 𝐗1\mathbf{X}_{1}
Assign the rest of the variables in 𝐗\mathbf{X} to 𝐗2\mathbf{X}_{2}
while 𝐗2\mathbf{X}_{2} is not empty do
  𝐗1ValueInteract(𝐗1,𝐗2)\mathbf{X^{\prime}}_{1}\leftarrow ValueInteract(\mathbf{X}_{1},\mathbf{X}_{2})
  if 𝐗1\mathbf{X^{\prime}}_{1} is the same as 𝐗1\mathbf{X}_{1} then
   if 𝐗1\mathbf{X}_{1} contains one variable then
     sepsX1seps\leftarrow X_{1}
   else
     nonseps𝐗1nonseps\leftarrow\mathbf{X}_{1}
   end if
   Empty 𝐗1\mathbf{X}_{1} and 𝐗1\mathbf{X}^{\prime}_{1}
   Assign the first variable of 𝐗2\mathbf{X}_{2} to 𝐗1\mathbf{X}_{1}
   Delete the first variable of 𝐗2\mathbf{X}_{2}
  else
   𝐗1=𝐗1\mathbf{X}_{1}=\mathbf{X^{\prime}}_{1}
   Delete the variables of 𝐗1\mathbf{X}_{1} from 𝐗2\mathbf{X}_{2}
  end if
end while
For each set of variables 𝐗𝐢sepsnonseps\mathbf{X_{i}}\in seps\cup nonseps return vcond(𝐱inp,𝐗i)v_{cond}(\mathbf{x}_{inp},\mathbf{X}_{i}) or vbs(𝐱inp,𝐱ref,𝐗i)v_{bs}(\mathbf{x}_{inp},\mathbf{x}_{ref},\mathbf{X}_{i})

4 Motivating Shapley Sets

The selection of the value function vv determines the variable grouping generated. Used with vbsv_{bs}, as interacting features are placed in the same NSVG, the attributions resulting from SS will be more faithful to the underlying model. The SS attribution for vbsv_{bs} in Example 2 would be φX1=1\varphi_{X_{1}}=1 and φX2,X3=2\varphi_{X_{2},X_{3}}=2.

Used with vcondv_{cond}, as interacting features are placed in the same NSVG, the attributions resulting from SS do not suffer from the violation of sensitivity as described in Section 1.2. Consider again Example 1, as X2,X3X_{2},X_{3} now belong to a NSVG, the SS attributions for X1,X3X_{1},X_{3} are now equal across both ff and f2f_{2} therefore robust to whether non-directly impacting features are included in the model. SS offer a further advantage when used to compare the attributions under on and off-manifold examples. Consider again Example 2 yet now with X1=αX2X_{1}=\alpha X_{2}. The SS attribution via vmargv_{marg} would be φX1=1\varphi_{X_{1}}=1 and φX2,X3=2\varphi_{X_{2},X_{3}}=2. However, if SS was calculated via vcondv_{cond} φ{X1,X2,X3}=3𝔼[f(𝐗)]\varphi_{\{X_{1},X_{2},X_{3}\}}=3-\mathbb{E}[f(\mathbf{X})] indicating that ff is non-separable and all the features interact.

The comparison between on and off-manifold SS therefore indicate where the feature interaction takes place. We have thus far provided an alternative attribution method to the Shapley Value, SS which can be computed in O(nlogn)O(nlogn) time with nn being the number of features. SS can be adapted for arbitrary value functions and offers several advantages over Shapley value based attributions when used with on and off-manifold value functions. In Section 6 we empirically validate the theoretical claims made above but first we discuss related work.

5 Related Work

As Shapley value based feature attribution has a rich literature, we differentiate SS from three approaches which are closest in essence to ours. SS enforce a coalition structure on the Shapley value such that players cannot be considered in isolation from their coalitions. The Owen value is a solution concept for games with an existing coalition structure [13]. The Owen value is the result of a two–step procedure: first, the coalitions play a quotient game among themselves, and each coalition receives a payoff which, in turn, is shared among its individual players in an internal game. Both payoffs are given by applying the Shapley value. This approach is not equivalent to SS, who assume no prior coalitional structure, and instead finds the optimum coalition structure which is the decomposition of vv into its NSVGs.

Shapley Residuals [11] capture the level to which a value function is inessential to a coalition of features. They show for vcond,vmargv_{cond},v_{marg}, that if the game function can be decomposed into v(𝐱,𝐗S)=v(𝐗T)+v(𝐗T¯)v(\mathbf{x},\mathbf{X}_{S})=v(\mathbf{X}_{T})+v(\mathbf{X}_{\bar{T}}) for 𝐗T𝐗S\mathbf{X}_{T}\subset\mathbf{X}_{S} then the value function vv is inessential with respect to the coalition 𝐗T\mathbf{X}_{T}. In this way we can view Shapley residuals, rS0r_{S}\neq 0 as an indication that a coalition is an non-separable variable group. However, the Shapley residuals are built on complex Hodge decomposition [14], are difficult to understand and do not offer a better way of attributing to features. In contrast, SS is built on the idea of additive separability, easier to understand, less computationally expensive and propose a solution to the issues with the Shapley value which are analogous to those Shapley residuals were designed to identify.

Grouped Shapley Values Determining the Shapley value of grouped variables has been previously suggested in [15, 16] which identify interaction in the data (based on measures of correlation) to then partition the features into groups, after which the Shapley value is then calculated. Shapley Sets is distinct from the above approaches in the following ways. Firstly, Shapley Sets is capable of uncovering interaction in the model as well as in the data. Secondly, Shapley Sets is designed to find the optimal grouping of the features such that the Shapley value theoretically reduces to the simple computation in Equation 8. Therefore, the grouping under Shapley Sets requires linear time to compute (given the prior decomposition of the variable set under log linear time), whereas the grouping proposed under grouped shapley values [15, 16] still requires exponential computation (to compute exactly although this can be approximated). Shapley Sets, to our knowledge, is the first contribution to the feature attribution literature which automatically decomposes a function into the optimal variable set by which to award attribution.

6 Experimental Motivation of Shapley Sets

We begin with two synthetic experiments. The first of these motivates the use of SS in the presence of interaction in the model. The second motivates the use of the SS in the presence of interaction in the data. We then compare SS to existing Shapley value (SV) based attribution methods on three benchmark datasets. We first however, outline how the value functions vbs,vcondv_{bs},v_{cond} are computed for our experiments.

As discussed above, vbsv_{bs} takes as input arbitrary reference vectors. For our experiments we select vmargv_{marg} such that vbs=vmargv_{bs}=v_{marg} (Equation 2). The expectation is taken over the empirical input distribution 𝐗input\mathbf{X}_{input} For the calculation of vcondv_{cond} (Equation 4), as the true conditional probabilities for the underlying data distribution are unknown we approximate p(𝐗S¯|𝐗S=𝐱s)p(\mathbf{X}_{\bar{S}}|\mathbf{X}_{S}=\mathbf{x}_{s)} using the underlying data distribution. Approximating conditional distributions can be achieved by directly sampling from the empirical data distribution. However, as noted in [3], the this method of approximating p(𝐗S¯|𝐗S)=𝐱sp(\mathbf{X}_{\bar{S}}|\mathbf{X}_{S})=\mathbf{x}_{s} suffers when |𝐗S|>2|\mathbf{X}_{S}|>2, due to sparsity in the underlying empirical distribution. We therefore adopt the approach of [3], where under the assumption that each 𝐱𝐗\mathbf{x}\in\mathbf{X} is sampled from a multivariate Gaussian with mean vector μ\mathbf{\mu} and covariance matrix 𝚺\mathbf{\Sigma}, the conditional distribution p(𝐗S¯|𝐗Sp(\mathbf{X}_{\bar{S}}|\mathbf{X}_{S} is also multivariate Gaussian such that p(𝐗S¯|𝐗S=𝐱s)=𝒩S¯(𝝁S¯|S,𝚺S¯|S)p(\mathbf{X}_{\bar{S}}|\mathbf{X}_{S}=\mathbf{x}_{s})=\mathcal{N}_{\bar{S}}(\boldsymbol{\mu}_{\bar{S}|S},\mathbf{\Sigma}_{\bar{S}|S}) where 𝝁S¯|S=𝝁S¯+𝚺S¯S𝚺SS1(𝐱S𝝁S)\boldsymbol{\mu}_{\bar{S}|S}=\boldsymbol{\mu}_{\bar{S}}+\mathbf{\Sigma}_{\bar{S}S}\mathbf{\Sigma}^{-1}_{SS}(\mathbf{x}_{S}-\boldsymbol{\mu}_{S}) and 𝚺S¯|S=𝚺S¯S¯+𝚺S¯S𝚺SS1𝚺SS¯\mathbf{\Sigma}_{\bar{S}|S}=\mathbf{\Sigma}_{\bar{S}\bar{S}}+\mathbf{\Sigma}_{\bar{S}S}\mathbf{\Sigma}^{-1}_{SS}\mathbf{\Sigma}_{S\bar{S}}. We can therefore sample from the conditional Gaussian distribution with expectation vector and covariance matrix given by 𝝁S¯|S\boldsymbol{\mu}_{\bar{S}|S} and 𝚺S¯|S\mathbf{\Sigma}_{\bar{S}|S} where 𝝁\boldsymbol{\mu} and 𝚺\mathbf{\Sigma} are estimated by the sample mean and covariance matrix of 𝐗input\mathbf{X}_{input}.

6.1 Synthetic Experiment: Interaction in the Model

Table 1: Mean Average Error ±\pm std, for SS and SV attributions under vmargv_{marg} for the three functions outlined in Section 5.1. SS perfectly identifies NSVGs for all three functions.
SS Shapley Value
f1f_{1} 0.000±0.000\mathbf{0.000\pm 0.000} 0.335±0.4000.335\pm 0.400
f2f_{2} 0.000±0.000\mathbf{0.000\pm 0.000} 1.143±0.9901.143\pm 0.990
f3f_{3} 0.000±0.000\mathbf{0.000\pm 0.000} 0.540±0.5800.540\pm 0.580

We first construct three functions with linear and non-linear feature interactions:

f1(𝐗)=X0+(X1/(2+X4))+2(X2X3)+sin(2(X5)+X6)f_{1}(\mathbf{X})=X_{0}+(X_{1}/(2+X_{4}))+2(X_{2}*X_{3})+sin(2(X_{5})+X_{6})
f2(𝐗)=2(sgn(X0))+sgn(X1X2X3)+sgn(X4X5X6)f_{2}(\mathbf{X})=2(sgn(X_{0}))+sgn(X_{1}X_{2}X_{3})+sgn(X_{4}X_{5}X_{6})
f3(𝐗)=2(X0X2X3)+4(X4X5)3(X1)2(X6))f_{3}(\mathbf{X})=2(X_{0}X_{2}X_{3})+4(X_{4}X_{5})-3(X_{1})^{2}-(X_{6}))

We construct a synthetic dataset of seven features drawn independently from 𝒩(1,1)\mathcal{N}(-1,1). For each of 100 randomly drawn samples we compute SS under vmargv_{marg}. As |𝐗|=7|\mathbf{X}|=7 we are able to compute the true SVs under vmargv_{marg} for each feature, without relying on a sampling algorithm. As we know the ground truth we calculate the Mean Average Error across all features and samples as our evaluation metric,

MAE=1kj=1k1ni=1nm(Xij)gt(Xij),MAE=\frac{1}{k}\sum^{k}_{j=1}\frac{1}{n}\sum_{i=1}^{n}m(X_{ij})-gt(X_{ij}), (9)

where m(Xij)m(X_{ij}) is the attribution given by m=SSm=SS or m=SVm=SV to feature ii in sample kk. As SS calculates an attribution for a set of features, mSS(Xij)=φ𝐗𝐢𝐣m_{SS}(X_{ij})=\varphi_{\mathbf{X_{ij}}}, the ground truth attribution gt(Xij)gt(X_{ij}) is the ground truth value of each NSVG. For example, given f=2(X1X2)f=2(X_{1}X_{2}) and 𝐱j=(1,1)\mathbf{x}_{j}=(1,1), gt(X1,j)=2gt(X_{1,j})=2 and gt(X2,j)=2gt(X_{2,j})=2.

Results are shown in Table 1. SS is successful in decomposing each function into its NSVGs and the attributions awarded to each set matches the ground truth of the function giving MAE of zero for all samples and functions. SV attributions deviate from ground truth by dividing the value of each NSVG between each individual feature which results in misleading attributions, particularly in the presence of inverse relationships between features. For example consider the following sub-component (X1)/(1X2)(X_{1})/(1-X_{2}), and a particular sample 𝐱=(1,0.2)\mathbf{x}=(1,0.2). SV gives X1X_{1} a positive attribution but X2X_{2}’s attribution is negative. Under SS, X1X_{1} and X2X_{2} are considered as non-separable and awarded a positive attribution together. From its SV attribution, a user may opt to change X2X_{2} rather than X1X_{1}, however, as these features jointly move the outcome from the baseline to the target, the impact of changing X2X_{2} in isolation could be cancelled out by the impact of X1X_{1}.

6.2 Synthetic Experiment: Interaction in the Data

Table 2: Mean Average Error ±\pm std for SS under vcondv_{cond} and SV under vcondv_{cond} and vmargv_{marg} for the three experiments outlined in Section 5.2. SS has lower MAE than SV for all models
SS Shap Marg Shap Cond
g1g_{1} 0.204±0.114\mathbf{0.204\pm 0.114} 0.226±0.1210.226\pm 0.121 0.211±0.1270.211\pm 0.127
g2g_{2} 0.071±0.031\mathbf{0.071\pm 0.031} 0.082±0.0320.082\pm 0.032 0.073±0.0310.073\pm 0.031
g3g_{3} 0.074±0.044\mathbf{0.074\pm 0.044} 0.110±0.0680.110\pm 0.068 0.150±0.0590.150\pm 0.059

We adopt the approach of [8] and propose and underlying linear regression model f(𝐗)=X0+0.5X1+0.8X3+0.2X2+0.5X4f(\mathbf{X})=X_{0}+0.5X_{1}+0.8X_{3}+0.2X_{2}+0.5X_{4}. We construct a synthetic dataset comprising five features n=5n=5. (X2,X3,X4)(X_{2},X_{3},X_{4}) are all modeled as I.I.D and drawn independently from 𝒩(1,1)\mathcal{N}(-1,1). X0,X1X_{0},X_{1}, however are modeled as dependent features where X1=ρX0X_{1}=\rho X_{0}. We generate a synthetic dataset Xtrain,XtestX_{train},X_{test} consisting k=(2000,100)k=(2000,100) samples of each feature and obtain the ground truth labels 𝐲train,𝐲test=f(𝐗train),f(𝐗test)\mathbf{y}_{train},\mathbf{y}_{test}=f(\mathbf{X}_{train}),f(\mathbf{X}_{test}). We next select a model gg which is trained on 𝐗train,𝐲train\mathbf{X}_{train},\mathbf{y}_{train} to approximate ff. We calculate the attributions for each sample in 𝐗test\mathbf{X}_{test} generated by the SV under both vmargv_{marg} and vcondv_{cond} and the attributions from SS under vcondv_{cond}. To evaluate attributions we use the coefficients of the linear regression model as our ground truth attributions c={1,0.5,0.8,0.2,0.5}c=\{1,0.5,0.8,0.2,0.5\}. We use MAEMAE (Equation 9) where the ground truth for feature ii in sample jj gtXij=cixi,jgt_{X_{ij}}=c_{i}x_{i,j}.

Off-manifold attributions in the presence of interaction in the data recover the ground truth attributions reliably when gg is a linear model, however, that breaks down when non-linear models are used as the approximating function gg [8]. We therefore compare attributions under g1g1, a linear regression model, and g2g2, an XGBoost model.

Results are shown in Table 2 where SS outperforms SV on both g1g1 and g2g2. Under g1g1, the MAE is lower for SV Marginal than for SV Conditional, validating the findings in [8].

However, when non linear g2g2 is used, the attributions from SS and SV under vcondv_{cond} outperform SV under vmargv_{marg}. The attributions provided by SS outperform those generated by SV across both models. We now show experimentally the claim that SS under on-manifold value function avoid the issues related to sensitivity. To do this we add a dummy variable X5=X0X_{5}=X_{0} to the dataset 𝐗\mathbf{X} such that X5X_{5} is not used by ff. We train another XGBoost model, g3g3 using the new dataset and generate the three sets of attributions as before. Results are shown in Table 2. Under the influence of the dummy, MAE of SV under vcondv_{cond} increases, as the attribution of each of the non-dummy variables moves further away from its true value to accommodate the attribution of the new feature despite it having no effect on the true output. In contrast, as SS includes this dummy feature in the non-separable set {X0,X1}\{X_{0},X_{1}\}. The resulting attribution to the existing features is unchanged and thus the MAE remains constant under the inclusion dummy variables, demonstrating SS’s robustness to how the underlying phenomenon is modelled.

Table 3: Average deletion ±\pm std for the attributions generated by SS under vmargv_{marg} and vcondv_{cond}, KS and TS for the Boston (B), Diabates (D) and Correlation (C) datasets. SS attributions have lowest deletion score across all datasets.
SS Int SS Cond KS TS
B 0.020±0.0220.020\pm 0.022 0.007±0.006\mathbf{0.007\pm 0.006} 0.046±0.0470.046\pm 0.047 0.047±0.0480.047\pm 0.048
D 0.081±0.0750.081\pm 0.075 0.050±0.039\mathbf{0.050\pm 0.039} 0.103±0.0850.103\pm 0.085 0.010±0.0820.010\pm 0.082
C 0.005±0.007\mathbf{0.005}\pm 0.007 0.033±0.0290.033\pm 0.029 0.075±0.0570.075\pm 0.057 0.072±0.0550.072\pm 0.055

6.3 Shapley Sets of Real World Benchmarks

We now evaluate SS on real data: the Diabetes, Boston and Correlation datasets from the Shap library [16]. For each dataset we train either an XGBoost or Random Forest model on the provided train set obtaining R2R^{2} score of 0.90 (RF), 0.89 9 (RF) and 0.86 (XGB) respectively. We compute SS attributions for 100 randomly selected samples from the test set under both vmargv_{marg} and vcondv_{cond}. As the dimensionality of the datasets now exceed that capable of being computed by the true Shapley values we compare the SS attributions with the most commonly used approximation techniques: Tree Shap (TS) [17] and Kernel Shap (KS) [16]. Under its original implementation, KS is an approximation of an off-manifold value function and breaks the relationship between input features and the data distribution. TS does not make this assumption and is presented as an on-manifold Shapley value approximation. However, in practice TS performs poorly when there is high dependence between features in the dataset [3]. To evaluate the attributions generated by SS, KS and TS in the absence of a ground truth attribution we use modified versions of the deletion and sensitivity measures which have been used widely across the literature [18]. Deletion is built on the intuition that the magnitude of a feature’s score should reflect its impact on the output. Our metric therefore measures the absolute distance between the target prediction,

v(𝐱,𝐗)v(\mathbf{x},\mathbf{X}_{\varnothing}) and the prediction of a given sample v(𝐱,𝐗)v(\mathbf{x},\mathbf{X}) after the most important feature Xi=xiX^{\prime}_{i}=x_{i}, determined by the attribution method under consideration mm, has been removed.

AD=1kj=1k|v(𝐱j,)v(𝐱j,N\{i})|AD=\frac{1}{k}\sum_{j=1}^{k}|v(\mathbf{x}_{j},\varnothing)-v(\mathbf{x}_{j},N\backslash\{i\})| (10)
Table 4: Average sensitivity ±\pm std for SS under vmargv_{marg} and vcondv_{cond}, KS and TS for the Boston (B), Diabates (D) and Correlation (C) datasets. SS results in the lowest sensitivity for B and C yet KS achieves lowest sensitivity for D.
SS Int SS Cond KS TS
B 0.015±0.0490.015\pm 0.049 0.006±0.031\mathbf{0.006\pm 0.031} 0.029±0.0000.029\pm 0.000 0.030±0.0000.030\pm 0.000
D 0.021±0.0990.021\pm 0.099 0.017±0.0670.017\pm 0.067 0.004±0.000\mathbf{0.004\pm 0.000} 0.076±0.0000.076\pm 0.000
C 0.000±0.010\mathbf{0.000\pm 0.010} 0.008±0.0200.008\pm 0.020 0.001±0.0000.001\pm 0.000 0.035±0.0000.035\pm 0.000

Low AD indicates that the attribution technique has correctly identified an important feature to remove. As SS attributes to sets of features we allow 𝐗\mathbf{X}^{\prime} to be a non-separable variable set as generated by SS. This may influence the reliability of AD due to a varying number of features being removed from an instance. We therefore also assess the sensitivity of the attribution technique which calculates the difference between the sum of all the attributions given by the attribution technique and the prediction of the sample. Ideal attributions have a low sensitivity.

AS=1kj=1k|v(𝐱j,N)i=1nm(Xij)|AS=\frac{1}{k}\sum_{j=1}^{k}|v(\mathbf{x}_{j},N)-\sum_{i=1}^{n}m(\textbf{X}_{ij})| (11)

Tables 3 and 4 show how SS has lower (better) deletion than TS and KS across all three datasets. However, KS has the lowest sensitivity score on the Diabetes dataset, we note that for this dataset, there is high variance of the sensitivity score for both SS attributions. This can be largely explained by the sensitivity of SS to the setting of ϵ\epsilon which is discussed further in Section 7.

Refer to caption
Refer to caption
Figure 1: Curves show the change in prediction of two individual samples from the Boston dataset as increasing features, as sorted in order of importance by the attributions returned by SS (green) and KS (red) attributions are removed from the instance. Original and target predictions are shown by the black and blue horizontal line. An ideal attribution would result in a sharp increase or decrease towards the target. In both samples, SS results in a quicker and smoother transition from original to target prediction

Figure 1 shows the advantage of sets rather than individual attributions. The red and green curves (KS and SS respectively) show the change in prediction as each feature in the sorted attributions is masked consecutively from the input. By considering the effect of sets of interacting features rather than individual features we can see that SS avoids the sub-optimal behaviour of KS which arises due to the interaction effects between features in the model masking each other’s importance. Figure 1 also validates the use of the deletion to compare individual and set attributions as it is clear that masking more features does not guarantee a lower deletion score.

7 Conclusions, Limitations and Future Work

This paper has introduced Shapley Sets (SS), a novel method for feature attribution, which automatically and optimally decomposes a function ff into a set of NSVGs by which to award attribution. We have shown how SS generates more faithful explanations in the presence of feature interaction both in the data and in the model than Shapley value-based alternatives. To our knowledge, SS is the only method in the literature which automatically generates a grouped attribution vector. Below we explore some limitations of SS and ideas for future work. Sensitivity to Parametrisation: In Algorithm 2, ϵ\epsilon determines the degree to which two sets of variables are considered interacting. The original RDG algorithms recommends the setting ϵ\epsilon as proportional the magnitude of the objective space. This setting works well for SS Interventional. However, we noticed a large variation in the variable grouping generated by SS Conditional under this setting of ϵ\epsilon. This is not surprising as it is known that vcondv_{cond} is sensitive to feature correlations in the data and it is difficult to know how much correlational structure to allow before two features are considered to be causally linked. Future work should therefore look at alternative methods of function decomposition which are not so dependent on the parametrisation of ϵ\epsilon [19]. Assumption of Partially-Separable Model SS assumes that the model to be explained is partially separable. If we consider the function f(𝐗)=X1X2X3f(\mathbf{X})=X_{1}X_{2}X_{3}, SS would result in a single attribution to all three features of f(𝐱)f(\mathbf{x}). This is not useful from an explanation perspective although does inform us about the nature of the underlying model. Furthermore, the assumption of a partially separable function is also made by the Shapley value [2]. Future work should consider function decomposition under a wider class of separability such as multiplicative separability where associated algorithms decompose a function into its additive and multiplicative separable variable sets [19].

Acknowledgments

We would like to thank Giulia Occhini, Alexis Monks, Isobel Shaw, Jennifer Yates, for their invaluable support during the writing of this paper. This work was supported by an Alan Turing Institute PhD Studentship funded under EPSRC grant EP/N510129/1.

References

  • [1] Lloyd S Shapley. A value for n-person games. Classics in game theory, 69, 1997.
  • [2] I Elizabeth Kumar, Suresh Venkatasubramanian, Carlos Scheidegger, and Sorelle Friedler. Problems with shapley-value-based explanations as feature importance measures. In International Conference on Machine Learning, pages 5491–5500. PMLR, 2020.
  • [3] Kjersti Aas, Martin Jullum, and Anders Løland. Explaining individual predictions when features are dependent: More accurate approximations to shapley values. Artificial Intelligence, 298:103502, 2021.
  • [4] Tom Heskes, Evi Sijben, Ioan Gabriel Bucur, and Tom Claassen. Causal shapley values: Exploiting causal knowledge to explain individual predictions of complex models. Advances in neural information processing systems, 33:4778–4789, 2020.
  • [5] Luke Merrick and Ankur Taly. The explanation game: Explaining machine learning models using shapley values. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, pages 17–38. Springer, 2020.
  • [6] Dominik Janzing, Lenon Minorics, and Patrick Blöbaum. Feature relevance quantification in explainable ai: A causal problem. In International Conference on artificial intelligence and statistics, pages 2907–2916. PMLR, 2020.
  • [7] Mukund Sundararajan and Amir Najmi. The many shapley values for model explanation. In International conference on machine learning, pages 9269–9278. PMLR, 2020.
  • [8] Giles Hooker, Lucas Mentch, and Siyu Zhou. Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance. Statistics and Computing, 31(6):1–16, 2021.
  • [9] Erik Štrumbelj and Igor Kononenko. Explaining prediction models and individual predictions with feature contributions. Knowledge and information systems, 41(3):647–665, 2014.
  • [10] Christopher Frye, Colin Rowat, and Ilya Feige. Asymmetric shapley values: incorporating causal knowledge into model-agnostic explainability. Advances in Neural Information Processing Systems, 33:1229–1239, 2020.
  • [11] Indra Kumar, Carlos Scheidegger, Suresh Venkatasubramanian, and Sorelle Friedler. Shapley residuals: Quantifying the limits of the shapley value for explanations. Advances in Neural Information Processing Systems, 34:26598–26608, 2021.
  • [12] Yuan Sun, Michael Kirley, and Saman K Halgamuge. A recursive decomposition method for large scale continuous optimization. IEEE Transactions on Evolutionary Computation, 22(5):647–661, 2017.
  • [13] Guilliermo Owen. Values of games with a priori unions. In Mathematical economics and game theory, pages 76–88. Springer, 1977.
  • [14] Ari Stern and Alexander Tettenhorst. Hodge decomposition and the shapley value of a cooperative game. Games and Economic Behavior, 113:186–198, 2019.
  • [15] Martin Jullum, Annabelle Redelmeier, and Kjersti Aas. groupshapley: Efficient prediction explanation with shapley values for feature groups. arXiv preprint arXiv:2106.12228, 2021.
  • [16] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.
  • [17] Scott M. Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding with explainable ai for trees. Nature Machine Intelligence, 2(1):2522–5839, 2020.
  • [18] Arne Gevaert, Axel-Jan Rousseau, Thijs Becker, Dirk Valkenborg, Tijl De Bie, and Yvan Saeys. Evaluating feature attribution methods in the image domain. arXiv preprint arXiv:2202.12270, 2022.
  • [19] Minyang Chen, Wei Du, Yang Tang, Yaochu Jin, and Gary G Yen. A decomposition method for both additively and non-additively separable problems. IEEE Transactions on Evolutionary Computation, 2022.