Learning to Receive Help:
Intervention-Aware Concept Embedding Models

David S. Hippocampus
Department of Computer Science
Cranberry-Lemon University
Pittsburgh, PA 15213
hippo@cs.cranberry-lemon.edu
Use footnote for providing further information about author (webpage, alternative address)—not for acknowledging funding agencies.

Abstract

TODO: Write this down. It is still in flux…

Concept-based explainable modelling has gained a lot of traction as a way of addressing the opacity of DNNs, where predictions are constructed and explained in terms of high-level human understandable concepts. Domain expert human interventions on concepts often lead to improved model performance. However, it is not clear how to choose concepts to intervene on that lead to best improved model performance, and once chosen, what intervention policy incentivises the model to actually use the chosen concepts to intervene on.

As artificial intelligence systems become ubiquitous in our day-to-day, it is crucial that they are interpretable and open to human feedback. Concept Embedding Models (CEMs) are neural architectures that achieve these goals by conditioning their task prediction on a set of human-understandable concept embeddings that can be modified interactively by humans. These interactions, referred to as concept interventions, have been shown to significantly improve a CEM’s test performance. Nevertheless, a CEM’s receptiveness to concept interventions comes with two critical limitations: First, CEMs are unable to specify which concepts’ labels would result in the best performance boost after an intervention, leading to high variance in the effect of interventions. Second, CEMs lack an explicit regularisation in their objective function that incentivises their performance to improve after an intervention, leading to scenarios where intervention gains are limited. To address these limitations, we propose a new CEM-inspired architecture, called Intervention-Aware Concept Embedding Models (IntCEMs), that incorporates an explicit incentive in its loss function to improve performance after receiving an intervention. IntCEMs learn an end-to-end concept intervention policy that can be used to sample meaningful interventions during training and specify which concepts’ ground truth labels may maximize its expected performance during testing. Our experiments show that IntCEMs outperform state-of-the-art concept-interpretable architectures when concept interventions are performed while remaining competitive against the same baselines when no interventions are performed. These results suggest that by exposing IntCEMs to interventions during training, we can bridge the gap between how concept-learning models are typically trained and how they are expected to be used in practice with a feedback-providing human user.

1 Introduction

TODO: [IMPORTANT NOTE] Everything is waaay too long right now but it will be halved if possible in the next iterations after the full manuscript is done.

Knowing when and how to ask for help is paramount to human success. The ability to easily request assistance from others has been instrumental in maintaining our modern mass collaborations [doan2010mass] while simultaneously allowing for misconception correction [gusukuma2018misconception] and improvements in our understanding of a topic [webb1995group]. When help is requested in the form of a query, it enables three things: (1) the interrogator may use the answer to the query to update their belief about their environment, leading to better-informed decision-making; (2) the receiver of the query may use its contents to calibrate their trust on the interrogator’s capabilities [rapid_trust_calibration, explanations_trust_calibration]; and (3) the query itself may be indicative of knowledge gaps in the interrogator, generating feedback on which skills may require future attention.

While the use of queries is ubiquitous in real-world decision-making, modern artificial intelligence (AI) systems, in particular deep neural networks (DNNs), are commonly deployed in isolation to experts whose knowledge base could, in theory, be queried in order to help the system perform better. Such “expert/system isolation” is partly because the decision process in DNNs is too complex for a single inspector to fully understand it [gunning2019xai, rudin2021interpretable]. This limits communication channels between the model and the user. Nevertheless, even for DNNs that are carefully designed to be easily interpretable by construction (e.g., Self-Explaining Neural Networks [senn] or Prototypical Part Networks [protopnet]), it is unclear how these models can query experts for assistance and how they could even use such feedback if it is available. This lack of expert-system interaction leaves a crucial gap in the current design of neural architectures, exacerbating the shift between how these systems are trained (i.e., commonly without any expert intervention) and how they are expected to be used in practice (i.e., in conjunction with a human user which may provide feedback).

Progress in this aspect has recently come from work in Explainable Artificial Intelligence (XAI) where researchers have proposed novel interpretable DNN architectures that permit the expression and uptake of expert feedback [cbm, cem, havasi2022addressing, magister2022encoding]. In particular, Concept Bottleneck Models (CBMs) [cbm] enable easy expert-model interactions by generating, as part of their inference process, explanations for their predictions using high-level units of information referred to as “concepts”. Such explanations allow experts to better understand why a model makes a decision based on human-understandable concepts (e.g., the bird is a crow because it is “black” and “medium-sized”) rather than low-level features (e.g., input pixels) while also enabling them to provide feedback to the model in the form of ground truth values of some of these concepts. This is done through what is commonly referred to as concept interventions [cbm, cem, closer_look_at_interventions], a process in which an expert analyses a concept-based explanation provided by a CBM and corrects mispredicted intermediate concepts so that the model can update its prediction with this new feedback. This process has been empirically shown to improve a CBM’s performance in a variety of tasks [cbm, cem, closer_look_at_interventions, coop].

While concept interventions in CBM-like models have shown promise, there are currently several limitations in need of further study. First, it is known that the order in which certain concepts are selected when intervening can have a significant effect on the task performance of the CBM. This has led to some non-parametric concept selection policies being proposed [closer_look_at_interventions, coop] to determine a greedy order in which concepts should be intervened when we have access to only a handful of interventions. Nevertheless, these policies have been shown to underperform with respect to an optimal oracle greedy policy [coop] and they can be costly to execute in practice [closer_look_at_interventions], leaving a significant performance and tractability gap to be filled by further research. Second, regardless of which intervention policy is selected, it is not necessarily the case that interventions always improve the CBM’s task performance. For example, previous work by cbm has shown that variations to the mode of intervention in a CBM (i.e., whether one intervenes by modifying its logistic bottleneck activations or their sigmoidal representations), as well as variations to a CBM’s concept predictive loss weight, can lead to interventions increasing a model’s test error. Similarly, metric_paper, cem empirically observed that interventions may have a detrimental effect on a CBM’s original performance if the CBM does not have enough capacity or if its set of concept annotations at training time is not a complete [concept_completeness] description of the downstream task. These results suggest a CBM’s receptiveness to interventions, and therefore its ability to properly learn correlations between output labels and predicted concepts, is neither guaranteed nor fully understood.

In this paper, we argue that the observed sensitivity of a CBM’s receptiveness to interventions with respect to changes in its training and architectural setup is not coincidental; it is, in fact, an artefact of the objective function used to train CBMs. Specifically, we argue that a CBM has no real training incentive to be receptive to interventions as it is neither exposed to interventions at training time nor optimised to perform well under interventions. We hypothesise that optimising for a linear combination of proxy losses for concept predictive accuracy and task predictive accuracy, as CBMs are currently trained, results in these models being receptive to interventions only when learning meaningful concept-to-output-label correlations is easier than exploiting possible leakage through the CBM’s continuous bottleneck [promises_and_pitfalls, metric_paper]. To address this, we propose a novel objective function and architecture built on top of Concept Embedding Models (CEMs) [cem] that addresses the two previously discussed limitations through an intervention-aware objective function. Our architecture, which we call Intervention-aware Concept Embedding Model (IntCEM), has two key distinctive features that have not been explored in previous concept-based models. First, it learns a greedy intervention policy in an end-to-end fashion which imposes a probability distribution over which concept to intervene on next to maximise the performance improvement after the intervention (e.g., which concepts we would like to “ask” an expert about). Second, its loss function incentivises the model to perform well when it is intervened on by adding a regularisation term that considers the model performance after rolling out the learnt intervention policy. The main contributions of this paper can therefore be summarised as follows:

•

We propose Intervention-aware Concept Embedding Models (IntCEMs), the first concept-interpretable neural architecture that learns not only to predict and explain their tasks using high-level concepts, but also learns a meaningful distribution over concepts that an expert may provide to improve the model’s performance.
•

We show that IntCEMs are able to perform competitively against state-of-the-art baselines when they receive no interventions while they significantly outperform all baselines when intervening on their concepts.
•

We qualitatively demonstrate that IntCEM is able to learn to identify possibly occluded or noisy concepts in a sample and use this information for requesting useful feedback from the user to improve its downstream performance.
•

We make our code and implementation, including an extensive suite of intervention policies for Concept Bottleneck Models, publicly available¹¹1Repository containing code and reproducibility steps will be made public upon publication..

2 Background and Previous Work

Explainable Artificial Intelligence

Early work in XAI focused on constructing explanations for a “black box” model (e.g., a DNN) by looking at how much each training sample [influence_functions_linear_regression, influence_functions_black_boxes] or each input feature [lime, shap, anchors] contributes to a model’s output. As DNNs gained traction in the field, the latter set of approaches was further developed for when the model being explained is a DNN, giving birth to what is now referred to as saliency map methods [og_saliency, smoothgrad, grad_cam]. These methods exploit the fact that DNNs are differentiable and make use of the model’s gradients to generate heatmaps to visually highlight how much a given feature contributes, either positively or negatively, to the output prediction. Although useful, these methods have been successful at identifying potential concerns in DNNs [barnett2021case] and accidental spurious correlations captured by different models [degrave2021ai] they have been shown to be susceptible to misinterpretations [fragile_saliency_maps] as well as perturbations to both weights [saliency_random_network] and input features [unreliability_saliency] while also being predisposed to adversarial attacks [saliency_manipulation].

Concept Learning and Concept Bottleneck Models

Following empirical evidence that DNNs may be learning detectors for human-interpretable high-level concepts as part of their latent space [network_dissection, net2vec], several works [tcav, cace, ace, cme, concept_completeness] have explored whether one can extract such concepts from trained DNNs to construct explanations for their predictions. Using high-level concepts rather than input features to construct explanations, as done by previous saliency or feature importance methods, allows users to reason about a model’s explanation using a much more concise and human-understandable set of values. Within this line of work, cbm, closely followed by concept_whitening, were amongst the first to propose designing neural architectures that can learn concept-based explanations as part of their predictions. Their method, Concept Bottleneck Models (CBMs), provides a general design framework for developing concept-based interpretable neural architectures which the work in this paper follows.

Given a set of concept-annotated training samples $\mathcal{D}:=\{(\mathbf{x},\mathbf{c},y)|\mathbf{x}\in\mathbb{R}^{n},\mathbf{c}\in\{0,1\}^{k},y\in\{1,\cdots,L\}$ , where each sample $\mathbf{x}$ (e.g., an X-ray scan of a knee) is annotated with a downstream task label $y$ (e.g., arthritis grade) and a set of $k$ binary concepts $\mathbf{c}$ (e.g., $c_{1}$ may be “has bone spurs”), a CBM can be defined as a pair of functions $(g:\mathbb{R}^{n}\rightarrow[0,1]^{k},f:[0,1]^{k}\rightarrow[0,1]^{L})$ whose composition $(f\circ g):\mathbb{R}^{n}\rightarrow[0,1]^{L}$ predicts a probability over output tasks for a given sample $\mathbf{x}$ . The first of these functions $f$ , called the concept encoder, learns a mapping between input features $\mathbf{x}\in\mathbb{R}^{n}$ and a set of concept activations $\hat{\mathbf{c}}=g(\mathbf{x})\in[0,1]^{k}$ , where $\hat{c}_{i}$ is incentivised to be close to $1$ whenever concept $c_{i}$ is active in $\mathbf{x}$ and close to $0$ otherwise. The second of these functions $f$ , called the label predictor, learns a mapping from the set of predicted concepts $\hat{\mathbf{c}}$ to a distribution $\hat{\mathbf{y}}=f(\hat{\mathbf{c}})$ over $L$ downstream task classes. When one parameterises both $g$ and $f$ using differentiable parametric models (e.g., DNNs), their composition $(f\circ g)$ results in an end-to-end differentiable model whose constituent parameters can be learnt (i) jointly, where the loss function is a linear combination of the concept predictive loss $\mathcal{L}_{\text{concept}}$ and the downstream task cross-entropy loss, (ii) sequentially, where we first $g$ to minimise the concept predictive loss $\mathcal{L}_{\text{concept}}$ and then we train $f$ to minimise the task cross-entropy loss from the $g$ (where $g$ ’s parameters are frozen), or (iii) independently, where we train both $g$ and $f$ independently of each other to minimise their respective losses. The resulting composition of $g$ and $f$ results in a concept-interpretable model as all the information that is available for $f$ to predict an output class is the “bottleneck” of predicted concepts $\hat{\mathbf{c}}$ . This allows one to interpret $\hat{c}_{i}$ as the probability $\hat{p}_{i}$ that the model believes that ground truth concept $c_{i}$ is active in the input sample $\mathbf{x}$ .

Concept Embedding Models

Although CBMs are capable of learning interpretable neural models when concept annotations are available at training time, their training paradigm makes them vulnerable to the provided concept annotations being incomplete [concept_completeness] w.r.t. the downstream task. Specifically, given that a CBM’s label predictor operates only on the concept scores predicted by the concept encoder, if the encoder’s set of concepts is not fully predictive of the downstream task of interest, then the CBM will be forced to leverage between being highly accurate at predicting concepts or being highly accurate at predicting downstream task labels. Addressing this trade-off between accuracy and interpretability in concept-incomplete training setups is the motivation of Concept Embedding Models (CEMs) [cem], a recently proposed CBM whose bottleneck allows for high-dimensional concept representations. These models address this limitation by allowing information of concepts not provided during training to flow into the label predictor via two high-dimensional representations (i.e., embeddings) for each concept $c_{i}$ : $\mathbf{c}^{+}_{i}\in\mathbb{R}^{m}$ , an $m$ -dimensional vector representing the training concept $c_{i}$ when it is active $c_{i}$ ( $c_{i}$ = 1), and $\mathbf{c}^{-}_{i}\in\mathbb{R}^{m}$ , an $m$ -dimensional vector representing training concept $c_{i}$ when it is inactive ( $c_{i}$ = 0).

For each training concept $c_{i}$ , a CEM generates its concept embeddings from an input a latent code $\mathbf{h}=\rho(\mathbf{x})$ learnt by a differentiable encoder module $\rho$ (e.g., a pre-trained ResNet [resnet] backbone) using learnable models $\phi_{i}^{+}(\mathbf{h})$ and $\phi_{i}^{-}(\mathbf{h})$ . These embeddings are then used to generate the predicted probability $\hat{p}_{i}=s([\hat{\mathbf{c}}^{+}_{i},\hat{\mathbf{c}}^{-}_{i}])$ of concept $c_{i}$ being active through a learnable scoring model $s:\mathbb{R}^{2m}\rightarrow[0,1]$ shared across all concepts. These probabilities allow one to define a CEM’s concept encoder $g(\mathbf{x})=\hat{\mathbf{c}}:=[\hat{\mathbf{c}}_{1},\cdots,\hat{\mathbf{c}}_{k}]\in\mathbb{R}^{km}$ by using the predicted probabilities to mix the positive and negative embeddings of each concept to generate a final concept embedding $\hat{\mathbf{c}}_{i}:=\hat{p}_{i}\hat{\mathbf{c}}^{+}_{i}+(1-\hat{p}_{i})\hat{\mathbf{c}}^{-}_{i}$ for each concept. The output of CEM’s concept encoder (i.e., its “bottleneck”) can be passed to a learnable label predictor $g(\hat{\mathbf{c}})$ which outputs a distribution over task labels $\{1,\cdots,L\}$ and can be explained via the concept activations predicted by the concept probabilities $\hat{\mathbf{p}}:=[\hat{p}_{1},\cdots,\hat{p}_{k}]^{T}$ .

Concept Interventions

A key property of both CBMs and CEMs is that they can improve their test-time performance through expert feedback in the form of concept interventions. These interventions work by allowing an expert to interact with the model at deployment time to correct one or more of its predicted concept probabilities $\hat{\mathbf{p}}$ using their own domain knowledge. By updating the predicted probability of concept $c_{i}$ (i.e., $\hat{p}_{i}$ ) so that it matches the ground truth concept label (i.e., setting $\hat{p}_{i}:=c_{i}$ ), these models can update their predicted bottleneck $\hat{\mathbf{c}}$ and propagate that change into their label predictor $g(\hat{\mathbf{x}})$ , leading to a potential change in the model’s output prediction.

Formally speaking, we can formulate intervening on a CEM as follows: assume we are given a mask $\mathbf{\mu}\in\{0,1\}^{k}$ of concepts we wish to intervene on, with $\mu_{i}=1$ if we will intervene on concept $c_{i}$ and $\mu_{i}=0$ otherwise. Furthermore, assume that we are given the ground truth values $\tilde{\mathbf{c}}$ of all the concepts we will intervene on. For notational simplicity, we can extend $\tilde{\mathbf{c}}$ to be in $\{0,1\}^{k}$ so that $\tilde{c}_{i}=c_{i}$ if $\mu_{i}=1$ and $\tilde{c}_{i}=0.5$ otherwise. In other words, this vector contains the ground truth concept labels for all concepts we are intervening on, and whose values we know from the expert, while it assigns an arbitrary value to all other concepts (we pick $0.5$ to express full uncertainty but, as it will become clear next, the value used for such extension is of no importance). We define an intervention as a process in which, for all concepts $c_{i}$ , the activation(s) in the CEM’s bottleneck $\hat{\mathbf{c}}$ corresponding to concept $c_{i}$ gets updated as follows:

\hat{\mathbf{c}}_{i}:=(\mu_{i}\tilde{c}_{i}+(\mu_{i}-1)\hat{p}_{i})\hat{\mathbf{c}}_{i}^{+}+\big{(}1-(\mu_{i}\tilde{c}_{i}+(\mu_{i}-1)\hat{p}_{i})\big{)}\hat{\mathbf{c}}_{i}^{-}

Where $\hat{\mathbf{c}}_{i}^{+}$ and $\hat{\mathbf{c}}_{i}^{-}$ are the positive and negative predicted concept embeddings for concept $c_{i}$ . Notice that this exact formulation can be generalised to be applicable to CBMs by letting $\mathbf{c}_{i}^{+}=[1]$ and $\mathbf{c}_{i}^{-}=[0]$ for all $1\leq i\leq k$ .²²2This formulation can also be extended to CBMs with logit bottlenecks by letting $\hat{\mathbf{c}}_{i}^{+}$ be the 95-th percentile value of the empirical training distribution of $\hat{c}_{i}$ and, similarly, letting $\hat{\mathbf{c}}_{i}^{-}$ be the 5-th percentile value of the same distribution. Intuitively, this update forces the mixing coefficient between the positive and negative concept embeddings to be the ground truth concept value $c_{i}$ when we intervene on concept $c_{i}$ (i.e., $\mu_{i}=1$ ) while maintaining it as $\hat{p}_{i}$ if we are not intervening on concept $c_{i}$ . Because the predicted positive and negative concept embeddings, as well as the predicted concept probabilities, are all functions of $\mathbf{x}$ , for simplicity we use $\tilde{g}(\mathbf{x},\mathbf{\mu},\tilde{\mathbf{c}})$ to represent the updated bottleneck of CBM $(g,f)$ when intervening with mask $\mathbf{\mu}$ and concept values $\tilde{\mathbf{c}}$ .

Intervention Policies

Recent work exploring applications of concept interventions has shown that, even if we intervene on a fixed number of concepts, the performance improvements we observe at test time vary a lot depending on which concepts we select to intervene on in the first place. Intuitively, this seems to align with the idea that some concepts may be highly informative of the output task yet hard to correctly predict, hence intervening on such concepts first may bring more mass to the probability of the correct class with fewer interventions. This approach was followed by cbm when they selected an intervention order based on the validation concept accuracy of the CBM they were intervening on. Nevertheless, such a static approach, in the sense that the intervention order is independent of the sample being intervened on, has two critical limitations. First, it assumes that the CBM is equally as bad at predicting a given concept. Yet, it may be the case that a concept that is hard to predict in the validation because, on average, the concept tends to be occluded in samples in that dataset. However, when it is visible, it is easy to detect by the CBM, therefore rendering it useless to intervene on this concept when the concept is accurately identified by the concept encoder. Second, such approaches assume that all concepts are equally easy to acquire from an expert, yet in practice, some concepts may be too costly to merit acquisition if they are not expected to bring large improvements to the model performance.

In order to address these limitations, several concurrent works [closer_look_at_interventions, uncertain_concepts_via_interventions, coop] have taken inspiration from the related field of Active Feature Acquisition (AFA) [shim2018joint, li2021active, strauss2022posterior], to develop dynamic concept acquisition policies. These policies take as an input a mask of previously intervened concepts $\mathbf{\mu}$ and a set of predicted concept probabilities $\hat{\mathbf{p}}$ and determine which concept should be requested next if we are to acquire one more ground-truth label from an expert. The crux behind the state-of-the-art in this set of approaches, in particular of both Cooperative Prediction (CooP) [coop] and Expected Change in Target Prediction (ECTP) [closer_look_at_interventions], is to select a concept that, when the CBM is intervened on, results in the greatest change of the currently predicted probability in expectation (with the expectation of the true value of $c_{i}$ taken over the distribution given by $p(c_{i}=1)=\hat{p}_{i}$ ). These policies, however, can be costly to compute as the number of concepts increases and have the shared limitation that by maximising the expected change in probability in the predicted class, they do not guarantee that the probability mass shifted towards the correct output class.

3 Intervention-Aware Concept Embedding Models

In the evaluation of test-time interventions for CEMs, cem observed that these models achieved only moderate intervention gains unless one randomly intervenes on some of CEMs concepts at training time. This process, which the authors named RandInt, proceeds by sampling an intervention mask $\mathbf{\mu}\sim\text{Bernoulli}(p_{\text{int}})$ at each training step and generating as an output prediction $\hat{\mathbf{y}}=f(\tilde{g}(\mathbf{x},\mathbf{\mu},\mathbf{c}))$ for current training sample $(\mathbf{x},\mathbf{c},y)$ . Intuitively, one can see that by performing random interventions at training time, the model may be conditioned to be receptive to such random interventions at test time. Furthermore, RandInt has the equally important effect of acting as a contrastive regulariser for the learnt concept embeddings: by forcing the gradient from the task loss to only update the weights generating the “correct” concept embedding for the intervened concepts, the model learns to separate the distribution of positive embeddings $\hat{\mathbf{c}}^{+}_{i}$ for concept $c_{i}$ from its corresponding distribution of negative embeddings $\hat{\mathbf{c}}^{-}_{i}$ .

Although beneficial in principle, RandInt has two crucial limitations. First, it assumes that a model should be equally penalised if it is unable to correctly predict $y$ after $k$ interventions than if it is unable to predict $y$ after $k^{\prime}<<k$ interventions. This generates no incentive for the model to perform better when given more expert feedback in the form of more interventions. Second, even if RandInt conditions the CEM to be more receptive to interventions at test time, it assumes that for a given sample, all concepts are equally likely, and more importantly equally as good, to be intervened on. However, as seen with dynamic policies such as CooP, selecting concepts to intervene as a function of the current sample can lead to drastically better performance boosts than selecting concepts to intervene on at random. This generates a shift between how this model may be intervened on in practice, with concepts selected dynamically as a function of the sample rather than randomly, and how it is intervened at training time.

In this section, we address these two limitations by proposing Intervention-Aware Concept Embedding Models (IntCEMs), a new CEM-based architecture and training framework. The novelty of our architecture lies both in its learnable concept intervention policy $\psi(\hat{\mathbf{c}},\mathbf{\mu})$ and in its objective function that penalises task prediction errors proportionally to how many concepts were intervened for a given prediction. By learning an intervention policy $\psi$ in conjunction to a concept encoder $g$ and a label predictor $f$ , IntCEM can simulate dynamic intervention steps at training time while concurrently learning a policy which enables the model to “ask” for help on specific concepts at test time. A visualisation of our architecture’s training procedure can be seen in Figure 2.

3.1 Architecture Description

Formally, we define an IntCEM as a tuple of parametric functions $(g,f,\psi)$ where: (i) $g$ , the concept encoder, is a CEM-like concept encoder mapping input features $\mathbf{x}$ to a bottleneck $\hat{\mathbf{c}}$ , (ii) $f:\mathbb{R}^{km}\rightarrow[0,1]^{L}$ is a label predictor function mapping the bottleneck $\hat{\mathbf{c}}$ to a probability distribution over $L$ output classes, and (iii) $\psi:\mathbb{R}^{km}\times\{0,1\}^{k}\rightarrow[0,1]^{k}$ , the concept intervention policy, is a function mapping $\hat{\mathbf{c}}$ and a mask of previously intervened concepts $\mathbf{\mu}$ to a probability distribution over which concepts one should intervene next to maximise the correct task label’s probability. As done in traditional CEMs, $g$ will compute the model’s bottleneck by first predicting a set of positive and negative concept embeddings $\{(\hat{\mathbf{c}}^{+}_{i},\hat{\mathbf{c}}^{-}_{i})\}_{i=1}^{k}$ and then generating a set of predicted concept probabilities $\hat{\mathbf{p}}$ from these embeddings from which one can construct a bottleneck $\hat{\mathbf{c}}:=[\hat{p}_{1}\hat{\mathbf{c}}^{+}_{1}+(1-\hat{p}_{1})\hat{\mathbf{c}}^{-}_{1},\cdots,\hat{p}_{k}\hat{\mathbf{c}}^{+}_{k}+(1-\hat{p}_{k})\hat{\mathbf{c}}^{-}_{k}]$ . For notational simplicity, we use $\zeta(\mathbf{x}):=\big{(}\{(\hat{\mathbf{c}}^{+}_{i},\hat{\mathbf{c}}^{-}_{i})\}_{i=1}^{k},\hat{\mathbf{p}}\big{)}$ to represent the CEM-like backbone that learns to generate a positive and negative embedding for each concept as well as a probability of activation for the same concept.

We highlight that at inference time, our model’s predictive process behaves exactly as that in traditional CEMs, with the only exception being that IntCEM also outputs a probability distribution $\psi(\hat{\mathbf{c}},\mathbf{\mu})$ with a high mass on concepts whose ground truth values, if provided by an expert intervention, could lead to significant performance improvements. Because of this, IntCEM can be intervened on using the same mechanism as in CEM where one updates the mixing coefficients between embeddings using the provided concept’s ground truth values. As we elaborate in the following section, it is during training that IntCEMs significantly differ from CEMs.

3.2 Training Procedure

During training, we condition IntCEMs to be receptive to test-time interventions by predisposing our model to train-time interventions rolled from IntCEM’s own concept selection policy. We proceed to do this by first sampling an initial intervention mask $\mathbf{\mu}$ from some prior distribution $p(\mathbf{\mu})$ at the beginning of each training step and then extending this mask by selecting the next concept we intervene on from the policy $\psi$ . The crux behind making these training-time rollouts of $\psi$ work is to adjust our loss function so that it incentivises greater test-time accuracy improvements when the number of concepts selected by $\mu$ is larger while encouraging $\psi$ to select concepts that result in the highest increase in the ground truth label’s probability when adding one more intervention to $\mathbf{\mu}$ . We achieve this by minimising an objective function $\mathcal{L}$ which can be thought of as a linear combination of three independent loss terms:

\mathcal{L}(\mathbf{x},\mathbf{c},y,\mathbf{\mu}):=\lambda_{\text{roll}}\mathcal{L}_{\text{roll}}(\mathbf{x},\mathbf{c},y,\mathbf{\mu})+\mathcal{L}_{\text{pred}}(\mathbf{x},\mathbf{c},y,\mathbf{\mu})+\lambda_{\text{concept}}\mathcal{L}_{\text{concept}}(\mathbf{c},\hat{\mathbf{p}})

Here $\lambda_{\text{roll}}$ and $\lambda_{\text{concept}}$ are hyperparameters that control how much one values learning an accurate concept selection policy and how much one values accurately predicting concept labels, respectively, compared to how much one values accurately predicting downstream task labels. Below, we motivate and define each of the terms in this objective function.

Rollout Mask Loss ( $\mathcal{L}_{\text{roll}}$ )

The purpose of the first term in our objective function, namely $\mathcal{L}_{\text{roll}}(\mathbf{x},\mathbf{c},y,\mathbf{\mu})$ , is to provide feedback to the concept selection policy $\psi$ regarding good choices for concepts to intervene next. Inspired by the “Skyline” optimal greedy policy proposed by coop, we design this loss term by first noticing that, given an intervention mask $\mu\in\{0,1\}^{k}$ with at least one unintervened concept in it, if we have an oracle that enables us to obtain any sample’s ground truth concepts $\mathbf{c}$ and task labels $y$ , then we could tractably compute the optimal next concept $c_{*}$ to intervene on given by calculating

c_{*}=\operatorname*{arg\,max}_{\begin{subarray}{c}1\leq i\leq k\\ \mu_{i}\neq 1\end{subarray}}f\big{(}\tilde{g}(\mathbf{x},\mathbf{\mu},\mathbf{c})\big{)}_{y}

where $f(\cdot)_{y}$ represents the probability of class $y$ predicted by the label predictor $f$ . Intuitively, $c_{*}$ corresponds to the concept whose ground-truth label and intervention would lead to the maximum probability in the ground-truth class $y$ after a single intervention. Notice that at test-time, it is impossible for one to determine $c_{*}$ as we do not know $y$ and the ground truth value of $c_{*}$ (this is what we are trying to request from an expert). Nevertheless, we can exploit the fact that these labels are available during training by providing meaningful feedback to $\psi$ through the following loss function:

\mathcal{L}_{\text{roll}}(\mathbf{x},\mathbf{c},y,\mathbf{\mu})=\text{CE}(\mathbf{\eta}^{*}(\mathbf{x},\mathbf{\mu},\mathbf{c},y),\psi(\tilde{g}(\mathbf{x},\mathbf{\mu},\mathbf{c}),\mathbf{\mu}))

In this term, we use $\text{CE}(\mathbf{p},\hat{\mathbf{p}})$ to mean the cross-entropy loss between ground truth distribution $\mathbf{p}$ and predicted distribution $\hat{\mathbf{p}}$ and use as a target distribution of the CE, i.e. $\mathbf{\eta}^{*}(\mathbf{x},\mathbf{\mu},\mathbf{c},y)$ , a distribution that captures how much the probability of the ground-truth output class changes as one intervenes on different concepts. If we let $\mathbf{\mu}\vee\mathds{1}_{i}$ represent the action of adding an intervention on concept $i$ to the mask of intervened concepts $\mathbf{\mu}$ , we can compute a meaningful target distribution by assigning $f\big{(}\tilde{g}(\mathbf{x},\mathbf{\mu}\vee\mathds{1}_{i},\mathbf{c})$ as a score to each concept $c_{i}$ and normalising the resulting vector of scores:

\mathbf{\eta}^{*}(\mathbf{x},\mathbf{\mu},\mathbf{c},y):=\text{Softmax}\Big{(}\Big{[}f\big{(}\tilde{g}(\mathbf{x},\mathbf{\mu}\vee\mathds{1}_{1},\mathbf{c})\big{)}_{y},\cdots,f\big{(}\tilde{g}(\mathbf{x},\mathbf{\mu}\vee\mathds{1}_{k},\mathbf{c})\big{)}_{y}\Big{]}\Big{)}

Intuitively, this can be thought of as a form of training-time behavioural cloning [behavioural_cloning, ross2011reduction], where $\psi$ is being trained to learn to predict $c_{*}$ given $\mathbf{\mu}$ and the current bottleneck $\tilde{g}(\mathbf{x},\mathbf{\mu},\mathbf{c})$ .

Task Predictive Loss ( $\mathcal{L}_{\text{task}}$ )

As in traditional CEMs, we include a loss term that penalises our model for not being able to predict the ground-truth label $y$ from $\hat{\mathbf{c}}$ . In contrast to how traditional CEMs or CBMs use such a loss, however, our task loss term $\mathcal{L}_{\text{task}}$ has two key components. The first component includes a cross-entropy term $\text{CE}\big{(}y,f(\tilde{g}(\mathbf{x},\mathbf{\mu},\mathbf{c}))$ that penalises our model for mispredicting $y$ after interventions $\mathbf{\mu}$ . This term, however, is scaled by a factor of $\gamma^{\sum_{i=1}^{k}\mu_{i}}$ (where $\gamma\geq 1$ is a user-selected hyperparameter close to $1$ ) whose purpose is to penalise a task misprediction proportionally to the number of intervened concepts. The second component of our loss works by sampling a next concept intervention $\mathbf{\eta}\in\{0,1\}^{k}$ from the probability distribution learnt by $\psi(\tilde{g}(\mathbf{x},\mathbf{\mu},\mathbf{c}),\mathbf{\mu})$ and compute the task cross-entropy error $\text{CE}\big{(}y,f(\tilde{g}(\mathbf{x},\mathbf{\mu},\mathbf{c}))$ following this extra interventions. As done for when we intervene only on concepts $\mathbf{\mu}$ , we scale this loss term by $\gamma^{1+\sum_{i=1}^{k}\mu_{i}}$ to take into account how many interventions we have performed up to that point. When we put everything together we obtain the following loss term:

\mathcal{L}_{\text{roll}}(\mathbf{x},\mathbf{c},y,\mathbf{\mu})=\text{CE}(\mathbf{\eta}^{*}(\mathbf{x},\mathbf{\mu},\mathbf{c},y),\psi(\tilde{g}(\mathbf{x},\mathbf{\mu},\mathbf{c}),\mathbf{\mu}))

Notice that in order for us to update the parameters in $\psi$ using feedback from the output class, we must be able to propagate gradients to the sampling operation performed to obtain $\mathbf{\eta}$ . In this work, we achieve this by using a differentiable Gumbel-Softmax [gumbel_softmax] categorical sampler which enables us to sample from a categorical distribution whose parameters are being generated by $\psi(\cdot)$ in a differentiable manner. Finally, for the sake of simplicity and tractability, unless specified otherwise, in practice we estimate the inner expectation with a single Monte Carlo sample from $\psi(\cdot)$ .

Concept Predictive Loss ( $\mathcal{L}_{\text{concept}}$ )

The last term in our loss incentivises accurate concept explanations for IntCEM’s predictions using the cross entropy loss averaged across all training concepts:

\mathcal{L}_{\text{concept}}(\mathbf{c},\hat{\mathbf{p}}):=\frac{1}{k}\sum_{i=1}^{k}\text{CE}(c_{i},\hat{p}_{i})=\frac{1}{k}\sum_{i=1}^{k}\big{(}-c_{i}\log{\hat{p}_{i}}-(1-c_{i})\log{(1-\hat{p}_{i})}\big{)}

Putting everything together

Using our defined objective function, we learn IntCEM’s parameters $\theta$ by minimising, via gradient descent, the expected value of $\mathcal{L}$ in the training set $\mathcal{D}$ while we sample interventions masks $\mathbf{\mu}$ from a prior distribution $p(\mathbf{\mu})$ :

\theta^{*}=\operatorname*{arg\,min}_{\theta}\mathbb{E}_{(\mathbf{x},\mathbf{c},y)\sim\mathcal{D}}\Big{[}\mathbb{E}_{\mathbf{\mu}\sim p(\mathbf{\mu})}\big{[}\mathcal{L}(\mathbf{x},\mathbf{c},y,\mathbf{\mu})\big{]}\Big{]}

(1)

Notice that the outer expectation can be estimated through Monte Carlo samples coming from our training set while the inner expectation can be estimated using $S_{\text{masks}}$ Monte Carlo samples of $p(\mathbf{\mu})$ . It is important to notice that the concept mask’s prior distribution $p(\mathbf{\mu})$ may incorporate domain-specific knowledge by considering possible correlations between concepts. For example, if one can partition the set of $k$ training concepts into $k^{\prime}$ subsets of mutually exclusive concept groups (as is the case for popular datasets used in concept learning such as CUB [cub]), then a useful prior may be constructed by first sampling a mask of concept groups $\mathbf{\mu}^{\prime}$ from $\text{Bernoulli}(p_{\text{int}})$ , for some user-selected hyperparameter $p_{\text{int}}\in[0,1]$ , and then generating a concept-level mask $\mathbf{\mu}$ by setting $\mu_{i}$ be $1$ if its parent concept group was selected by $\mathbf{\mu}^{\prime}$ . For the sake of simplicity, we follow such a prior in this work and leave exploring more interesting and domain-specific priors as future work. Similarly, for the sake of tractability and simplicity, unless specified otherwise we estimate the inner expectation in Eq. (1) using a single Monte Carlo sample of $p(\mathbf{\mu})$ (i.e., $S_{\text{masks}}$ = 1). TODO: We explore how our results change when we sample multiple masks per training sample in Appendix A.6.

4 Experiments

In this section, we evaluate our proposed method by exploring the following research questions:

•

Effects of Policies (Q1A): How does IntCEM’s performance change as we intervene using different concept selection policies?
•

Intervention Performance (Q1B): Are IntCEMs more receptive to test-time interventions than state-of-the-art CBM variants?
•

Unintervened Predictive Performance and Concept-Interpretability (Q2): In the absence of interventions, how does IntCEM’s task and concept predictive performance compare to other baselines?
•

Learnt Policy Interpretability (Q3): Can we learn useful insights about our model or task by inspecting IntCEM’s learnt policy?

We begin to explore these questions by first investigating how IntCEM’s test-time performance changes as we intervene on an increasing number of concepts following its learnt policy as well as standard concept selection policies proposed in previous works (Section 4.2). Then, we evaluate IntCEMs, as well as competing CBM-like baselines, in a variety of tasks when no interventions are provided to determine the effect that its modified loss function has on its task and concept performance (Section 4.3). Finally, we qualitatively dissect IntCEM’s learnt policy in a well-known concept annotated dataset and explore some of the insights one may obtain from looking at its learnt distribution ((Section 4.4).

4.1 Experiment Setup

Datasets and Tasks

We study how IntCEM performs as we intervene on more concepts by deploying our method on five vision tasks defined on three different datasets. First, inspired by the UMNIST dataset proposed by collins2023human, we define two simple tasks based on the MNIST dataset [mnist]. In our first task, which we call MNIST-Add, one is given a total of $12$ MNIST digits per sample (where each operand is confined to be within a fixed range of possible values) and needs to determine whether the sum of these digits is at least half of the maximum attainable sum. This binary task is provided as concept annotations the one-hot-encodings of each operand’s real value. The MNIST-based second task, which we call MNIST-Add-Incomp, is exactly the same as MNIST-Add with the exception that we provide as concept annotations the value of only 8/12 operands (the operands whose annotations are provided are selected at random before training commences). We explore these two variants of the same task to evaluate how IntCEM behaves when the set of concept annotations is both a complete and an incomplete description of the output task. Following this, we evaluate IntCEM on two tasks based on the CUB [cub] dataset. This dataset contains images of birds in the wild together with a downstream label representing the bird’s type and a set of binary annotations describing some of the attributes of a bird. The first task, which we simply call CUB, processes the CUB dataset as suggested by cbm and selects 112 of the bird’s attributes (grouped into 28 sets of mutually exclusive binary attributes) as concept annotations while using the bird type as the downstream task. The second CUB-based task, which we call CUB-Incomp, is constructed in the same line as MINST-Add-Incomp by providing only $25\%$ of all concept groups during training. Finally, we evaluate our method in a task defined on the Celebrity Attributes Dataset [celeba] (CelebA) using the same concept and task labels as cem. This results in a task with 256 classes and 6 binary concept annotations. For specific details of these datasets, including their size and properties, refer to Appendix A.1.

Baselines

For the sake of obtaining a fair and complete evaluation of IntCEM, we compare its performance against that of a traditional CEM with the exact same architecture. Similarly, given that state-of-the-art intervention policies were designed with CBMs in mind rather than CEMs, we include in our evaluation traditional CBMs that are trained jointly (Joint CBM), sequentially (Sequential CBM) and independently (Independent CBM). Furthermore, given the strong empirical evidence that the activation function used in a joint CBM can have significant consequences on how interventions affect the CBM’s performance [cbm, metric_paper], we also include as part of our baselines both a joint CBM with a Sigmoidal bottleneck (Joint CBM-Sigmoid) and a CBM whose bottleneck correspond to the log probabilities of the predicted concepts (Joint CBM-Logit). Throughout all of our experiments, we attempt to get as close as a fair comparison across baselines by using the same concept encoder and label predictor architectures as in IntCEM for the concept encoders and label predictors of all other baselines. Across all tasks, unless specified otherwise, we tune IntCEM’s $\lambda_{\text{roll}}$ hyperparameter by selecting the model with the best validation loss after we vary $\lambda_{\text{roll}}$ over $\{5,1,0.1\}$ . As for $\lambda_{concept}$ , we fix this regulariser’s strength to the same value (selected from $\{10,5,1\}$ using the validation error of a traditional CEM) across all methods for a given task. The used embedding size $m$ and the initial intervention mask probability $p_{\text{int}}$ are set to $m=16$ and $p_{\text{int}}=0.25$ following the suggestions by cem. For an ablation study on how these hyperparameters affect our observed results, refer to Appendix A.4. Similarly, for a complete set of details on the architectures and hyperparameters used to train each baseline, refer to Appendix A.2.

4.2 Intervention Performance (Q1)

In this subsection, we explore how concept interventions at test-time affect an IntCEM’s task performance in a variety of tasks while exploring a variety of concept selection policies. Specifically, we investigate two hypotheses: First, we start investigating research question Q1A and hypothesise that, if IntCEM’s policy is capturing meaningful information regarding the model and the concepts it operates on, then we should observe higher performance for IntCEM when we select concepts using $\psi$ than if we were to randomly select them. Then we explore research question Q1B and contrast IntCEM’s receptiveness to interventions compared to that of competing baselines. We hypothesise that, because IntCEM was exposed to interventions as part of its training process, and because it was penalised heavier for mispredicting task labels after performing more interventions, it should receive enough conditioning to outperform competing baselines even when we use a random policy intervention to select the order of interventions. To investigate both of these hypotheses, we evaluate IntCEMs against our baselines across all tasks as we intervene on an increasing number of concepts groups (i.e., as in [cbm] we intervene by setting groups of mutually exclusive concepts all at once). Our results discussed in this section highlight the importance of proper training-time conditioning and show that, by introducing a training objective that incorporates training interventions sampled from a carefully selected policy, the resulting model is more receptive to expert feedback at test time.

IntCEM learns a meaningful intervention policy, enabling significant performance boosts with a small number of interventions.

We explore an IntCEM receptiveness to interventions by computing its test performance while intervening following one of these policies: (1) IntCEM’s learnt policy $\psi$ , where we select the next intervention concept by choosing the unknown concept whose probability of selection, as given by $\psi(\cdot)$ , is the highest, (2) a Random intervention policy, where the next concept is selected, uniformly at random, from the set of unknown concepts, (3) the Uncertainty of Concept Prediction (UCP) [closer_look_at_interventions] policy, where the next concept is selected by choosing the concept $c_{i}$ whose predicted probability $\hat{p}_{i}$ has the highest uncertainty (measured by $1/|\hat{p}_{i}-0.5|$ ), (4) the Cooperative Policy (CooP) [coop], where we select the concept that maximises a linear combination of their predicted uncertainty (akin to UCP) and the expected change in the predicted label’s probability when intervening on that concept, and finally (5) Skyline, an oracle intervention policy which selects $c_{*}$ on every step, indicating an upper bound for how good a greedy policy may be for a given task. Furthermore, as IntCEM learns a policy on the fly that attempts to imitate the Skylne policy, we include a Behavioural Cloning [behavioural_cloning] policy which we train to predict $c_{*}$ from pairs of training samples $(\tilde{g}(\mathbf{x},\mathbf{\mu},\mathbf{c}),\mathbf{\mu})$ after IntCEM has been trained. For details on how each of these policy baselines is used, including how their hyperparameters are selected, see Appendix A.5.

Our intervention policy study, shown in Figure 3, underscores three crucial properties of IntCEM. First, it shows that IntCEM’s learnt policy $\psi$ is able to capture meaningful properties about the intervention process as it is able to outperform a random intervention policy across all tasks (by more than $\sim 10\%$ in some instances as in CUB. Second, we observe there is a strong incentive to learn $\psi$ in conjunction with $f$ and $g$ as its equivalent behavioural cloning policy, which is learnt after that IntCEM has been trained, performs significantly worse than our learnt policy, only being slightly better than the random policy in CUB (TODO: In the current CelebA plot the behavioural cloning policy is worse than the learnt policy but this will be fixed once I update these plots to include the latest results). Finally, we observe that (i) using CooP to intervene on top of IntCEM rather than $\psi$ can lead to even further performance boosts from interventions. This, and the significant difference between the optimal policy and RandInt’s learnt policy, suggests that future work could develop stronger inductive biases for $\psi$ to improve its predictive capabilities (TODO: The difference with CooP is something that may change very soon with the new results I am getting. However, for the sake of being conservative, I will frame it now as something worth exploring in the future). For further results showing similar trends in the rest of the tasks used for this paper, as well as showing how static policies fare against the dynamic intervention policies shown here, refer to TODO: Appendix A.5.

IntCEM significantly outperforms all other baselines when intervening on an increasing number of concepts.

Following our initial intervention evaluation of IntCEM using different intervention policies, we proceed to study how IntCEM, when intervened with its own learnt policy, fares against competing methods intervened with CooP. For the sake of being able to clearly study trends in each model’s receptiveness to interventions, we only show a comparison between IntCEM’s own policy and CooP in other baselines. We opt to present our results in this way as we empirically observe that CooP outperforms all other non-oracle policies for other methods (see Appendix A.5). Our results, seen in Figure 1(a), show that IntCEM’s task accuracy is significantly better than that of competing methods across all tasks. Specifically, we

4.3 Unintervened Task and Concept Performance (Q2)

Having seen that IntCEMs are able to significantly outperform competing baselines when one intervenes on them, in this section we explore how IntCEM’s objective function affects its test performance with respect to existing state-of-the-art models when no interventions are introduced. For this, we look at both the downstream task predictive test accuracy, or the area under the ROC for binary tasks like the MNIST-based tasks (AUC), and the mean concept test AUC across all tasks and baselines. To study how IntCEM reacts to changes in the weight $\lambda_{\text{roll}}$ assigned to its rolling policy’s loss, we evaluate several IntCEMs whose configurations differ only on the value of $\lambda_{\text{roll}}$ used during training. We summarise our findings in Table 1.

Table 1: Downstream accuracy (%) and mean concept AUC-ROC (%), together with their standard errors over 5 random initialisations for each method, across all tasks and baselines. For IntCEM, we show the results when varying

\lambda_{\text{roll}}

\{0.1,1,5\}

to show the observed trade-off between model performance with and without interventions. We bold the best results for each metric across each task, highlighting more than one result for a task if it is within one standard error from the highest baseline.

Task IntCEM( $\lambda_{\text{roll}}=5)$ IntCEM( $\lambda_{\text{roll}}=1)$ IntCEM( $\lambda_{\text{roll}}=0.1)$ CEM Joint CBM-Sigmoid Joint CBM-Logit Independent CBM Sequential CBM Task MNIST-ADD 89.05 ± 1.85 91.66 ± 0.41 91.25 ± 0.44 89.79 ± 0.30 68.57 ± 4.26 86.09 ± 0.28 72.88 ± TODO: 0.0 67.94 ± TODO: 0.0 MNIST-ADD-INCOMP 89.51 ± 0.25 89.40 ± 0.32 89.79 ± 0.36 87.11 ± 0.88 68.19 ± 2.18 84.15 ± 0.85 76.10 ± 2.41 74.89 ± 1.19 CUB 75.11 ± 0.40 77.79 ± 0.36 78.24 ± 0.22 79 ± 0.54 75.10 ± 0.59 78.24 ± 0.26 61.17 ± 3.41 54.85 ± 4.66 CUB-INCOMP 73.48 ± 0.79 74.41 ± 0.31 75.03 ± 0.61 75.52 ± 0.37 TODO: 15.17 ± 19.82 68.42 ± 7.91 44.82 ± 0.33 40.00 ± 1.59 CelebA 33.50 ± 0.71 31.46 ± 0.78 30.52 ± 1.15 30.88 ± 0.92 24.15 ± 0.43 23.87 ± 1.26 24.25 ± 0.26 24.53 ± 0.3 Concept MNIST-ADD 75.25 ± 5.37 84.15 ± 0.80 84.19 ± 0.11 85.67 ± 0.04 89.34 ± 0.15 89.34 ± 0.27 81.82 ± TODO: 0.0 81.82 ± TODO: 0.0 MNIST-ADD-INCOMP 86.40 ± 1.83 86.63 ± 1.16 87.14 ± 1.60 87.77 ± 1.0 90.14 ± 0.64 90.88 ± 0.13 87.58 ± 0.37 87.58 ± 0.37 CUB 88.99 ± 0.52 91.86 ± 0.16 93.17 ± 0.15 94.48 ± 0.05 93.77 ± 0.14 93.57 ± 0.2 90.06 ± 0.82 90.06 ± 0.82 CUB-INCOMP 90.17 ± 0.59 93.66 ± 0.44 94.54 ± 0.18 94.65 ± 0.10 TODO: 81.62 ± 7.32 92.87 ± 2.01 93.65 ± 0.25 93.65 ± 0.25 CelebA 71.05 ± 1.77 85.74 ± 0.34 85.37 ± 0.22 85.57 ± 0.16 81.14 ± 0.75 82.85 ± 0.44 82.85 ± 0.21 82.85 ± 0.21

IntCEM is competitive with respect to state-of-the-art models across several tasks.

Our test task accuracy evaluation, shown in the top half of Table 1, shows that throughout all but one of the tasks, IntCEMs are as accurate if not more accurate than all other baselines. It is only in the CUB task where we observe a small drop in performance (less than $\sim 0.5\%$ ) for the best-performing IntCEM variant with $\lambda_{\text{roll}}=0.1$ . Notice that from the five tasks we use during evaluation, three of them, namely MNIST-Add-Incomp, CUB-Incomp, and CelebA, do not have a complete set of concept annotations at training time. Therefore, our results suggest that just as in traditional CEMs, our model is able to maintain a high task performance even when the set of concept annotations is not fully descriptive of the downstream task, a highly desired property to have for deploying these models in real-world tasks.

IntCEM’s concept predictive performance is competitive with respect to traditional CEMs.

The second half of Table 1 shows that IntCEM’s mean concept AUC, particularly for low values of $\lambda_{\text{roll}}$ , is competitive against that of CEMs, falling behind it by at most $\sim 1.2\%$ at its worst. In the MNIST-based tasks, we observe that both IntCEMs and CEMs lag behind traditional joint CBMs in their concept of predictive AUC. Nevertheless, we note that the gains seen in concept accuracy for the joint CBMs over IntCEM do not merit their observed drop in task performance seen compared to IntCEMs in these two tasks. These results suggest that IntCEM’s concept explanations are as accurate as those provided by state-of-the-art CEMs, leading a set of predicted concepts by IntCEM to constitute a faithful explanation of its downstream prediction.

IntCEM rollout loss can be calibrated depending on its intended use.

One interesting effect we observe in both our task and concept accuracy results is that IntCEM is particularly sensitive to its $\lambda_{\text{roll}}$ parameter, as seen in particular for CUB. As we will explore in more detail in the next section, these observed differences seem to indicate there exists a tradeoff in IntCEM’s objective function between incentivising the model to be highly accurate without any interventions and incentivising the model to be highly accurate after a small number of interventions. This tradeoff can be calibrated depending on the intended application of the IntCEM: if one expects to deploy the model in a setting where interventions will be expected in the average case, then using a larger value of $\lambda_{\text{roll}}$ during training may condition the model to be very receptive to interventions, albeit its initial performance may be slightly lower than an equivalent CEM. On the other hand, if one expects to intervene on a trained IntCEM very infrequently, a smaller value of $\lambda_{\text{roll}}$ may lead to an IntCEM with state-of-the-art performance without interventions yet whose initial gain in performance once we intervene on it will not be as high as a model trained with a higher value of $\lambda_{\text{roll}}$ . While this hyperparameter adds complexity to IntCEM’s training process and hyperparameter section process, in practice we found that trying out values in $\lambda_{\text{roll}}\in\{0.1,1,5\}$ results in highly accurate CEMs.

4.4 Interpreting the Intervention Policy (Q3)

TODO: Write me

5 Discussion

Relation with imitation learning

TODO: Elaborate on how introducing interventions as part of our training objective is akin to learning from expert demonstrations in imitation learning.

Non-embedding Intervention-Aware Concept Bottleneck Models

TODO: Elaborate on why we could not apply the training objective to non-embedding CBMs. The short answer for this is that it is a gradient-blocking operation unless one enables trainable anchor values to be used for representing a concept’s true state.

Limitations and future work

TODO: We require fine-tuning of three loss hyperparameters ( $\lambda_{\text{concept}},\lambda_{\text{roll}}$ and $\gamma$ ). We do not consider different costs. We do not consider different competences between experts. We are still learning a greedy policy, which may ignore possible budgets. Future work can also consider handling human uncertainty in interventions [collins2023human].

[mention label-free [oikarinenlabel] and post-hoc cbm [yuksekgonul2022post]]

6 Conclusion

TODO: Write me

Appendix A Appendix

A.1 Dataset Details

TODO: Write me

A.2 Model Architecture and Baselines Details

TODO: Write me

A.3 Training Configurations

TODO: Write me

A.4 Exploring IntCEM’s Hyperparameter Sensitivity

TODO: Write me

A.5 More Results and Details on Intervention Policy Evaluation

TODO: Write me

A.6 Effect of Number of Rollouts

TODO: Write me

A.7 Software and Hardware Used

TODO: Write me

Learning to Receive Help: Intervention-Aware Concept Embedding Models

Abstract

1 Introduction

2 Background and Previous Work

Explainable Artificial Intelligence

Concept Learning and Concept Bottleneck Models

Concept Embedding Models

Concept Interventions

Intervention Policies

3 Intervention-Aware Concept Embedding Models

3.1 Architecture Description

3.2 Training Procedure

Rollout Mask Loss (ℒroll\mathcal{L}_{\text{roll}})

Task Predictive Loss (ℒtask\mathcal{L}_{\text{task}})

Concept Predictive Loss (ℒconcept\mathcal{L}_{\text{concept}})

Putting everything together

4 Experiments

4.1 Experiment Setup

Datasets and Tasks

Baselines

4.2 Intervention Performance (Q1)

IntCEM learns a meaningful intervention policy, enabling significant performance boosts with a small number of interventions.

IntCEM significantly outperforms all other baselines when intervening on an increasing number of concepts.

4.3 Unintervened Task and Concept Performance (Q2)

IntCEM is competitive with respect to state-of-the-art models across several tasks.

IntCEM’s concept predictive performance is competitive with respect to traditional CEMs.

IntCEM rollout loss can be calibrated depending on its intended use.

4.4 Interpreting the Intervention Policy (Q3)

5 Discussion

Relation with imitation learning

Non-embedding Intervention-Aware Concept Bottleneck Models

Limitations and future work

6 Conclusion

Appendix A Appendix

A.1 Dataset Details

A.2 Model Architecture and Baselines Details

A.3 Training Configurations

A.4 Exploring IntCEM’s Hyperparameter Sensitivity

A.5 More Results and Details on Intervention Policy Evaluation

A.6 Effect of Number of Rollouts

A.7 Software and Hardware Used

Learning to Receive Help:
Intervention-Aware Concept Embedding Models

Rollout Mask Loss ( $\mathcal{L}_{\text{roll}}$ )

Task Predictive Loss ( $\mathcal{L}_{\text{task}}$ )

Concept Predictive Loss ( $\mathcal{L}_{\text{concept}}$ )