This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Heterogeneous Risk Minimization

Jiashuo Liu    Zheyuan Hu    Peng Cui    Bo Li    Zheyan Shen
Abstract

Machine learning algorithms with empirical risk minimization usually suffer from poor generalization performance due to the greedy exploitation of correlations among the training data, which are not stable under distributional shifts. Recently, some invariant learning methods for out-of-distribution (OOD) generalization have been proposed by leveraging multiple training environments to find invariant relationships. However, modern datasets are frequently assembled by merging data from multiple sources without explicit source labels. The resultant unobserved heterogeneity renders many invariant learning methods inapplicable. In this paper, we propose Heterogeneous Risk Minimization (HRM) framework to achieve joint learning of latent heterogeneity among the data and invariant relationship, which leads to stable prediction despite distributional shifts. We theoretically characterize the roles of the environment labels in invariant learning and justify our newly proposed HRM framework. Extensive experimental results validate the effectiveness of our HRM framework.

Machine Learning, ICML

1 Introduction

The effectiveness of machine learning algorithms with empirical risk minimization (ERM) relies on the assumption that the testing and training data are identically drawn from the same distribution, which is known as the IID hypothesis. However, distributional shifts between testing and training data are usually inevitable due to data selection biases or unobserved confounders that widely exist in real data. Under such circumstances, machine learning algorithms with ERM usually suffer from poor generalization performance due to the greedy exploitation of correlations among the training data, which are not stable under distributional shifts. How to guarantee a machine learning algorithm with out-of-distribution (OOD) generalization ability and stable performances under distributional shifts is of paramount significance, especially in high-stake applications such as medical diagnosis, criminal justice, and financial analysis etc (Kukar, 2003; Berk et al., 2018; Rudin & Ustun, 2018).

There are mainly two branches of methods proposed to solve the OOD generalization problem, namely distributionally robust optimization (DRO) (Esfahani & Kuhn, 2018; Duchi & Namkoong, 2018; Sinha et al., 2018; Sagawa et al., 2019) and invariant learning (Arjovsky et al., 2019; Koyama & Yamaguchi, 2020; Chang et al., 2020). DRO methods aim to optimize the worst-performance over a distribution set to ensure their OOD generalization performances. While DRO is a powerful family of methods, it is often argued for its over-pessimism problem when the distribution set is large (Hu et al., 2018; Frogner et al., 2019). From another perspective, invariant learning methods propose to exploit the causally invariant correlations(rather than varying spurious correlations) across multiple training environments, resulting in out-of-distribution (OOD) optimal predictors. However, the effectiveness of such methods relies heavily on the quality of training environments, and the intrinsic role of environments in invariant learning remains vague in theory. More importantly, modern big data are frequently assembled by merging data from multiple sources without explicit source labels. The resultant unobserved heterogeneity renders these invariant learning methods inapplicable.

In this paper, we propose Heterogeneous Risk Minimization (HRM), an optimization framework to achieve joint learning of the latent heterogeneity among the data and the invariant predictor, which leads to better generalization ability despite distributional shifts. More specifically, we theoretically characterize the roles of the environment labels in invariant learning, which motivates us to design two modules in the framework corresponding to heterogeneity identification and invariant learning respectively. We provide theoretical justification on the mutual promotion of these two modules, which resonates the joint optimization process in a reciprocal way. Extensive experiments in both synthetic and real-world experiments datasets demonstrate the superiority of HRM in terms of average performance, stability performance as well as worst-case performance under different settings of distributional shifts. We summarize our contributions as following:

1. We propose the novel HRM framework for OOD generalization without environment labels, in which heterogeneity identification and invariant prediction are jointly optimized.

2. We theoretically characterize the role of environments in invariant learning from the perspective of heterogeneity, based on which we propose a novel clustering method for heterogeneity identification from heterogeneous data.

3. We theoretically justify the mutual promotion relationship between heterogeneity identification and invariant learning, resonating the joint optimization process in HRM.

2 Problem Formulation

2.1 OOD and Maximal Invariant Predictor

Following (Arjovsky et al., 2019; Chang et al., 2020), we consider a dataset D={De}esupp(tr)D=\left\{D^{e}\right\}_{e\in\mathrm{supp}(\mathcal{E}_{tr})}, which is a mixture of data De={(xie,yie)}i=1neD^{e}=\left\{(x_{i}^{e},y_{i}^{e})\right\}_{i=1}^{n_{e}} collected from multiple training environments esupp(tr)e\in\mathrm{supp}(\mathcal{E}_{tr}), xie𝒳x_{i}^{e}\in\mathcal{X} and yie𝒴y_{i}^{e}\in\mathcal{Y} are the ii-th data and label from environment ee respectively and nen_{e} is number of samples in environment ee. Environment labels are unavailable as in most real applications. tr\mathcal{E}_{tr} is a random variable on indices of training environments and PeP^{e} is the distribution of data and label in environment ee.

The goal of this work is to find a predictor f():𝒳𝒴f(\cdot):\mathcal{X}\rightarrow\mathcal{Y} with good out-of-distribution generalization performance, which can be formalized as:

argminfmaxesupp()(f|e)\arg\min_{f}\max_{e\in\mathrm{supp}(\mathcal{E})}\mathcal{L}(f|e) (1)

where (f|e)=𝔼[l(f(X),Y)|e]=𝔼e[l(f(Xe),Ye)]\mathcal{L}(f|e)=\mathbb{E}[l(f(X),Y)|e]=\mathbb{E}^{e}[l(f(X^{e}),Y^{e})] is the risk of predictor ff on environment ee, and l(,):𝒴×𝒴+l(\cdot,\cdot):\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R}^{+} is the loss function. \mathcal{E} is the random variable on indices of all possible environments such that supp()supp(tr)\mathrm{supp}(\mathcal{E})\supset\mathrm{supp}(\mathcal{E}_{tr}). Usually, for all esupp()supp(tr)e\in\mathrm{supp}(\mathcal{E})\setminus\mathrm{supp}(\mathcal{E}_{tr}), the data and label distribution Pe(X,Y)P^{e}(X,Y) can be quite different from that of training environments tr\mathcal{E}_{tr}. Therefore, the problem in Equation 1 is referred to as Out-of-Distribution (OOD) Generalization problem (Arjovsky et al., 2019).

Without any prior knowledge or structural assumptions, it is impossible to figure out the OOD generalization problem, since one cannot characterize the unseen latent environments in supp()\mathrm{supp}(\mathcal{E}). A commonly used assumption in invariant learning literature (Rojas-Carulla et al., 2015; Gong et al., 2016; Arjovsky et al., 2019; Kuang et al., 2020; Chang et al., 2020) is as follow:

Assumption 2.1.

There exists random variable Φ(X)\Phi^{*}(X) such that the following properties hold:

a. Invarianceproperty\mathrm{Invariance\ property}: for all e,esupp()e,e^{\prime}\in\mathrm{supp}(\mathcal{E}), we have Pe(Y|Φ(X))=Pe(Y|Φ(X))P^{e}(Y|\Phi^{*}(X))=P^{e^{\prime}}(Y|\Phi^{*}(X)) holds.

b. Sufficiencyproperty\mathrm{Sufficiency\ property}: Y=f(Φ)+ϵ,ϵXY=f(\Phi^{*})+\epsilon,\ \epsilon\perp X.

This assumption indicates invariance and sufficiency for predicting the target YY using Φ\Phi^{*}, which is known as invariant covariates or representations with stable relationships with YY across different environments ee\in\mathcal{E}. In order to acquire the invariant predictor Φ(X)\Phi^{*}(X), a branch of work to find maximal invariant predictor (Chang et al., 2020; Koyama & Yamaguchi, 2020) has been proposed, where the invariance set and the corresponding maximal invariant predictor are defined as:

Definition 2.1.

The invariance set \mathcal{I} with respect to \mathcal{E} is defined as:

\displaystyle\mathcal{I}_{\mathcal{E}} ={Φ(X):Y|Φ(X)}\displaystyle=\{\Phi(X):Y\perp\mathcal{E}|\Phi(X)\} (2)
={Φ(X):H[Y|Φ(X)]=H[Y|Φ(X),]}\displaystyle=\{\Phi(X):H[Y|\Phi(X)]=H[Y|\Phi(X),\mathcal{E}]\}

where H[]H[\cdot] is the Shannon entropy of a random variable. The corresponding maximal invariant predictor (MIP) of \mathcal{I}_{\mathcal{E}} is defined as:

S=argmaxΦI(Y;Φ)S=\arg\max_{\Phi\in\mathcal{I}_{\mathcal{E}}}I(Y;\Phi) (3)

where I(;)I(\cdot;\cdot) measures Shannon mutual information between two random variables.

Here we prove that the MIP SS can can guarantee OOD optimality, as indicated in Theorem 2.1. The formal statement of Theorem 2.1 as well as its proof can be found in appendix.

Theorem 2.1.

(Informal) For predictor Φ(X)\Phi^{*}(X) satisfying Assumption 2.1, Φ\Phi^{*} is the maximal invariant predictor with respect to \mathcal{E} and the solution to OOD problem in equation 1 is 𝔼Y[Y|Φ]=argminfsupesupp()𝔼[(f)|e]\mathbb{E}_{Y}[Y|\Phi^{*}]=\arg\min_{f}\sup_{e\in\mathrm{supp}(\mathcal{E})}\mathbb{E}[\mathcal{L}(f)|e].

Recently, some works suppose the availability of data from multiple environments with environment labels, wherein they can find MIP (Chang et al., 2020; Koyama & Yamaguchi, 2020). However, they rely on the underlying assumption that the invariance set tr\mathcal{I}_{\mathcal{E}_{tr}} of tr\mathcal{E}_{tr} is exactly the invariance set \mathcal{I}_{\mathcal{E}} of all possible unseen environments \mathcal{E}, which cannot be guaranteed as shown in Theorem 2.2.

Theorem 2.2.

tr\mathcal{I}_{\mathcal{E}}\subseteq\mathcal{I}_{\mathcal{E}_{tr}}

As shown in Theorem 2.2 that tr\mathcal{I}_{\mathcal{E}}\subseteq\mathcal{I}_{\mathcal{E}_{tr}}, the learned predictor is only invariant to such limited environments tr\mathcal{E}_{tr} but is not guaranteed to be invariant with respect to all possible environments \mathcal{E}.

Here we give a toy example in Table 1 to illustrate this. We consider a binary classification between cats and dogs, where each photo contains 3 features, animal feature X1{cat,dog}X_{1}\in\left\{\text{cat},\text{dog}\right\}, a background feature X2{on grass,in water}X_{2}\in\left\{\text{on grass},\text{in water}\right\} and the photographer’s signature feature X3{Irma,Eric}X_{3}\in\left\{\text{Irma},\text{Eric}\right\}. Assume all possible testing environments supp()={e1,e2,e3,e4,e5,e6}\mathrm{supp}(\mathcal{E})=\left\{e_{1},e_{2},e_{3},e_{4},e_{5},e_{6}\right\} and the train environment supp(tr)={e5,e6}\mathrm{supp}(\mathcal{E}_{tr})=\left\{e_{5},e_{6}\right\}, then ={Φ|Φ=Φ(X1)}\mathcal{I}_{\mathcal{E}}=\left\{\Phi|\Phi=\Phi(X_{1})\right\} while tr={Φ|Φ=Φ(X1,X2)}\mathcal{I}_{\mathcal{E}_{tr}}=\left\{\Phi|\Phi=\Phi(X_{1},X_{2})\right\}. The reason is that e5,e6e_{5},e_{6} only tell us X3X_{3} cannot be included in the invariance set but cannot exclude X2X_{2}. But if e5e_{5} and e6e_{6} can be further divided into e1,e2e_{1},e_{2} and e3,e4e_{3},e_{4} respectively, the invariance set tr\mathcal{I}_{\mathcal{E}_{tr}} becomes tr=={Φ(X1)}\mathcal{I}_{\mathcal{E}_{tr}}=\mathcal{I}_{\mathcal{E}}=\{\Phi(X_{1})\}.

This example shows that the manually labeled environments may not be sufficient to achieve MIP, not to mention the cases where environment labels are not available. This limitation necessitates the study on how to exploit the latent intrinsic heterogeneity in training data (like e5e_{5} and e6e_{6} in the above example) to form more refined environments for OOD generalization. The environments need to be subtly uncovered, in the sense of OOD generalization problem, as indicated by Theorem B.4, not all environments are helpful to tighten the invariance set.

Theorem 2.3.

Given set of environments supp()^\mathrm{supp}(\hat{\mathcal{E})}, denote the corresponding invariance set ^\mathcal{I}_{\hat{\mathcal{E}}} and the corresponding maximal invariant predictor Φ^\hat{\Phi}. For one newly-added environment enewe_{new} with distribution Pnew(X,Y)P^{new}(X,Y), if Pnew(Y|Φ^)=Pe(Y|Φ^)P^{new}(Y|\hat{\Phi})=P^{e}(Y|\hat{\Phi}) for esupp(^)e\in\mathrm{supp}(\hat{\mathcal{E}}), the invariance set constrained by supp(^){enew}\mathrm{supp}(\hat{\mathcal{E}})\cup\{e_{new}\} is equal to ^\mathcal{I}_{\hat{\mathcal{E}}}.

Class 0 (Cats) Class 1 (Dogs)
Index X1X_{1} X2X_{2} X3X_{3} X1X_{1} X2X_{2} X3X_{3}
e1e_{1} Cats Water Irma Dogs Grass Eric
e2e_{2} Cats Grass Eric Dogs Water Irma
e3e_{3} Cats Water Eric Dogs Grass Irma
e4e_{4} Cats Grass Irma Dogs Water Eric
e5e_{5} Mixture: 90% data from e1e_{1} and 10% data from e2e_{2}
e6e_{6} Mixture: 90% data from e3e_{3} and 10% data from e4e_{4}
Table 1: A Toy Example for the difference between \mathcal{I}_{\mathcal{E}} and tr\mathcal{I}_{\mathcal{E}_{tr}}.

2.2 Problem of Heterogeneous Risk Minimization

Besides Assumption 2.1, we make another assumption on the existence of heterogeneity in training data as:

Assumption 2.2.

HeterogeneityAssumption\mathrm{Heterogeneity\ Assumption}.
For random variable pair (X,Φ)(X,\Phi^{*}) and Φ\Phi^{*} satisfying Assumption 2.1, using functional representation lemma (El Gamal & Kim, 2011), there exists random variable Ψ\Psi^{*} such that X=X(Φ,Ψ)X=X(\Phi^{*},\Psi^{*}), then we assume Pe(Y|Ψ)P^{e}(Y|\Psi^{*}) can arbitrary change across environments esupp()e\in\mathrm{supp}(\mathcal{E}).

The heterogeneity among provided environments can be evaluated by the compactness of the corresponding invariance set as |||\mathcal{I}_{\mathcal{E}}|. Specifically, smaller |||\mathcal{I}_{\mathcal{E}}| leads to higher heterogeneity, since more variant features can be excluded. Based on the assumption, we come up with the problem of heterogeneity exploitation for OOD generalization.

Problem 1.

Heterogeneous Risk Minimization.
Given heterogeneous dataset D={De}esupp(latent)D=\left\{D^{e}\right\}_{e\in\mathrm{supp}(\mathcal{E}_{latent})} without environment labels, the task is to generate environments tr\mathcal{E}_{tr} with minimal |tr||\mathcal{I}_{\mathcal{E}_{tr}}| and learn invariant model under learned tr\mathcal{E}_{tr} with good OOD performance.

Theorem B.4 together with Assumption 2.2 indicate that, to better constrain tr\mathcal{I}_{\mathcal{E}_{tr}}, the effective way is to generate environments with varying P(Y|Ψ(X))P(Y|\Psi^{*}(X)) that can exclude variant features from tr\mathcal{I}_{\mathcal{E}_{tr}}. Under this problem setting, we encounter the circular dependency: first we need variant Ψ\Psi^{*} to generate heterogeneous environments tr\mathcal{E}_{tr}; then we need tr\mathcal{E}_{tr} to learned invariant Φ\Phi^{*} as well as variant Ψ\Psi^{*}. Furthermore, there exists positive feedback between these two steps. When acquiring tr\mathcal{E}_{tr} with tighter tr\mathcal{I}_{\mathcal{E}_{tr}}, more invariant predictor Φ(X)\Phi(X) (i.e. a better approximation of MIP) can be found, which will further bring a clearer picture of variant parts, and therefore promote the generation of tr\mathcal{E}_{tr}. With this notion, we propose our framework for Heterogeneous Risk Minimization (HRM) which leverages the mutual promotion between the two steps and conduct joint optimization.

3 Method

In this work, we temporarily focus on a simple but general setting, where X=[Φ,Ψ]TdX=[\Phi^{*},\Psi^{*}]^{T}\in\mathbb{R}^{d} in raw feature level and Φ,Ψ\Phi^{*},\Psi^{*} satisfy Assumption 2.1. Under this setting, Our Heterogeneous Risk Minimization (HRM) framework contains two interactive parts, the frontend c\mathcal{M}_{c} for heterogeneity identification and the backend p\mathcal{M}_{p} for invariant prediction. The general framework is shown in Figure 1.

Refer to caption
Figure 1: The framework of HRM.

Given the pooled heterogeneous data, it starts with the heterogeneity identification module c\mathcal{M}_{c} leveraging the learned variant representation Ψ(X)\Psi(X) to generate heterogeneous environments learn\mathcal{E}_{learn}. Then the learned environments are used by OOD prediction module p\mathcal{M}_{p} to learn the MIP Φ(X)\Phi(X) as well as the invariant prediction model f(Φ(X))f(\Phi(X)). After that, we derive the variant Ψ(X)\Psi(X) to further boost the module c\mathcal{M}_{c}, which is supported by Theorem B.4. As for the ’convert’ step, under our setting, we adopt feature selection in this work, through which more variant feature Ψ\Psi can be attained when more invariant feature Φ\Phi is learned. Specifically, the invariant predictor Φ(X)\Phi(X) is generated as Φ(X)=MX\Phi(X)=M\odot X, and the variant part Ψ(X)=(1M)X\Psi(X)=(1-M)\odot X correspondingly, where M{0,1}dM\in\left\{0,1\right\}^{d} is the binary invariant feature selection mask. For instance, for Table 1, X=[X1,X2,X3]X=[X_{1},X_{2},X_{3}], the ground truth binary mask is M=[1,0,0]M=[1,0,0]. In this way, the better Φ\Phi is learned, the better Ψ\Psi can be obtained. Note that we use the soft selection which is more flexible and general in our algorithm with M[0,1]dM\in[0,1]^{d}.

The whole framework is jointly optimized, so that the mutual promotion between heterogeneity identification and invariant learning can be fully leveraged.

3.1 Implementation of p\mathcal{M}_{p}

Here we introduce our invariant prediction module p\mathcal{M}_{p}, which takes multiple environments training data D={De}esupp(tr)D=\left\{D^{e}\right\}_{e\in supp(\mathcal{E}_{tr})} as input, and outputs the corresponding invariant predictor ff and the indices of invariant features MM given current environments tr\mathcal{E}_{tr}. We combine feature selection with invariant learning under heterogeneous environments, which can select the features with stable/invariant correlations with the label across tr\mathcal{E}_{tr}. Specifically, the former module can select most informative features with respect to the loss function and latter module ensures the selected features are invariant. Their combination ensures p\mathcal{M}_{p} to select the most informative invariant features.

For invariant learning, we follow the variance penalty regularizer proposed in (Koyama & Yamaguchi, 2020) and simplify it in feature selection scenarios. The objective function of p\mathcal{M}_{p} with M{0,1}dM\in\{0,1\}^{d} is:

e(MX,Y;θ)\displaystyle\mathcal{L}^{e}(M\odot X,Y;\theta) =𝔼Pe[(MXe,Ye;θ)]\displaystyle=\mathbb{E}_{P^{e}}[\ell(M\odot X^{e},Y^{e};\theta)] (4)
p(MX,Y;θ)\displaystyle\mathcal{L}_{p}(M\odot X,Y;\theta) =𝔼tr[e]+λtrace(Vartr(θe))\displaystyle=\mathbb{E}_{\mathcal{E}_{tr}}[\mathcal{L}^{e}]+\lambda\mathrm{trace}(\mathrm{Var}_{\mathcal{E}_{tr}}(\nabla_{\theta}\mathcal{L}^{e})) (5)

However, as the optimization of hard feature selection with binary mask MM suffers from high variance, we use the soft feature selection with gates taking continuous value in [0,1][0,1]. Specifically, following (Yamada et al., 2020), we approximate each element of M=[m1,,md]TM=[m_{1},\dots,m_{d}]^{T} to clipped Gaussian random variable parameterized by μ=[μ1,,μd]T\mu=[\mu_{1},\dots,\mu_{d}]^{T} as

mi=max{0,min{1,μi+ϵ}}m_{i}=\max\{0,\min\{1,\mu_{i}+\epsilon\}\} (6)

where ϵ\epsilon is drawn from 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}). With this approximation, the objective function with soft feature selection can be written as:

e(θ,μ)=𝔼Pe𝔼M[(MXe,Ye;θ)+αM0]\displaystyle\mathcal{L}^{e}(\theta,\mu)=\mathbb{E}_{P^{e}}\mathbb{E}_{M}\left[\ell(M\odot X^{e},Y^{e};\theta)+\alpha\|M\|_{0}\right] (7)

where MM is a random vector with dd independent variables mim_{i} for i[d]i\in[d]. Under the approximation in Equation 6, M0\|M\|_{0} is simply i[d]P(mi>0)\sum_{i\in[d]}P(m_{i}>0) and can be calculated as M0=i[d]CDF(μi/σ)\|M\|_{0}=\sum_{i\in[d]}\mathrm{CDF}(\mu_{i}/\sigma), where CDF\mathrm{CDF} is the standard Gaussian CDF. We formulate our objective as risk minimization problem:

minθ,μp(θ;μ)=𝔼tr[e(θ,μ)]+λtrace(Vartr(θe))\min_{\theta,\mu}\mathcal{L}_{p}(\theta;\mu)=\mathbb{E}_{\mathcal{E}_{tr}}[\mathcal{L}^{e}(\theta,\mu)]+\lambda\mathrm{trace}(\mathrm{Var}_{\mathcal{E}_{tr}}(\nabla_{\theta}\mathcal{L}^{e})) (8)

where

e(θ,μ)=𝔼Pe𝔼M[(MXe,Ye;θ)+αi[d]CDF(μi/σ)]\mathcal{L}^{e}(\theta,\mu)=\mathbb{E}_{P^{e}}\mathbb{E}_{M}\left[\ell(M\odot X^{e},Y^{e};\theta)+\alpha\sum_{i\in[d]}\mathrm{CDF}(\mu_{i}/\sigma)\right] (9)

Further, as for linear models, we simply approximate the regularizer trace(Vartr(θe))\mathrm{trace}(\mathrm{Var}_{\mathcal{E}_{tr}}(\nabla_{\theta}\mathcal{L}^{e})) by Vartr(θe)M2\|\mathrm{Var}_{\mathcal{E}_{tr}}(\nabla_{\theta}\mathcal{L}^{e})\odot M\|^{2}. Then we obtain Φ(X)\Phi(X) and Ψ(X)\Psi(X) when we obtain μ\mu as well as MM. Further in Section 4, we theoretically prove that the prediction module p\mathcal{M}_{p} is able to learn the MIP with respect to given environments tr\mathcal{E}_{tr}.

3.2 Implementation of c\mathcal{M}_{c}

Notation. Ψ\Psi means the learned variant part Ψ(X)\Psi(X). ΔK\Delta_{K} means KK-dimension simplex. fθ()f_{\theta}(\cdot) means the function ff parameterized by θ\theta.

The heterogeneity identification module c\mathcal{M}_{c} takes a single dataset as input, and outputs a multi-environment dataset partition for invariant prediction. We implement it with a clustering algorithm. As indicated in Theorem B.4, the more diverse P(Y|Ψ)P(Y|\Psi) for our generated environments, the better the invariance set \mathcal{I} is. Therefore, we cluster the data points according to the relationship between Ψ\Psi and YY, for which we use P(Y|Ψ)P(Y|\Psi) as the cluster centre. Note that Ψ\Psi is initialized as Ψ(X)=X\Psi(X)=X in our joint optimization.

Specifically, we assume the jj-th cluster centre PΘj(Y|Ψ)P_{\Theta_{j}}(Y|\Psi) parameterized by Θj\Theta_{j} to be a Gaussian around fΘj(Ψ)f_{\Theta_{j}}(\Psi) as 𝒩(fΘj(Ψ),σ2)\mathcal{N}(f_{\Theta_{j}}(\Psi),\sigma^{2}):

hj(Ψ,Y)=PΘj(Y|Ψ)=12πσexp((YfΘj(Ψ))22σ2)h_{j}(\Psi,Y)=P_{\Theta_{j}}(Y|\Psi)=\frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{(Y-f_{\Theta_{j}}(\Psi))^{2}}{2\sigma^{2}}) (10)

For the given N=etrneN=\sum_{e\in\mathcal{E}_{tr}}n_{e} empirical data samples 𝒟={ψi(xi),yi}i=1N\mathcal{D}=\{\psi_{i}(x_{i}),y_{i}\}_{i=1}^{N}, the empirical distribution is modeled as P^N=1Ni=1Nδi(Ψ,Y)\hat{P}_{N}=\frac{1}{N}\sum_{i=1}^{N}\delta_{i}(\Psi,Y) where

δi(Ψ,Y)={1,ifΨ=ψiandY=yi0,otherwise\delta_{i}(\Psi,Y)=\begin{cases}1,\quad\mathrm{if\ }\Psi=\psi_{i}\ \mathrm{and}\ Y=y_{i}\\ 0,\quad\mathrm{otherwise}\end{cases} (11)

The target of our heterogeneous clustering is to find a distribution in 𝒬={Q|Q=j[K]qjhj(Ψ,Y),𝕢ΔK}\mathcal{Q}=\{Q|Q=\sum_{j\in[K]}q_{j}h_{j}(\Psi,Y),\mathbb{q}\in\Delta_{K}\} to fit the empirical distribution best. Therefore, the objective function of our heterogeneous clustering is:

minQ𝒬DKL(P^NQ)\min_{Q\in\mathcal{Q}}D_{KL}(\hat{P}_{N}\|Q) (12)

The above objective can be further simplified to:

minΘ,𝕢{c=1Ni=1Nlog[j=1Kqjhj(ψi,yi)]}\min_{\Theta,\mathbb{q}}\left\{\mathcal{L}_{c}=-\frac{1}{N}\sum_{i=1}^{N}\log\left[\sum_{j=1}^{K}q_{j}h_{j}(\psi_{i},y_{i})\right]\right\} (13)

As for optimization, we use EM algorithm to optimize the centre parameter Θ\Theta and the mixture weight 𝕢\mathbb{q}. After optimizing equation 13, for building tr\mathcal{E}_{tr}, we assign each data point to environment ejtre_{j}\in\mathcal{E}_{tr} with probability:

P(ej|Ψ,Y)=qjhj(Ψ,Y)/(i=1Kqihi(Ψ,Y))P(e_{j}|\Psi,Y)=q_{j}h_{j}(\Psi,Y)/\left(\sum_{i=1}^{K}q_{i}h_{i}(\Psi,Y)\right) (14)

In this way, tr\mathcal{E}_{tr} is generated by c\mathcal{M}_{c}.

4 Theoretical Analysis

In this section, we theoretically analyze our proposed Heterogeneous Risk Minimization (HRM) method. We first analyze our proposed invariant learning module p\mathcal{M}_{p}, and then justify the existence of the positive feedback in our HRM.

Justification of p\mathcal{M}_{p} We prove that given training environments tr\mathcal{E}_{tr}, our invariant prediction model p\mathcal{M}_{p} can learn the maximal invariant predictor Φ(X)\Phi(X) with respect to the corresponding invariance set tr\mathcal{I}_{\mathcal{E}_{tr}}.

Theorem 4.1.

Given tr\mathcal{E}_{tr}, the learned Φ(X)=MX\Phi(X)=M\odot X is the maximal invariant predictor of tr\mathcal{I}_{\mathcal{E}_{tr}}.

Justification of the Positive Feedback The core of our HRM framework is the mechanism for c\mathcal{M}_{c} and p\mathcal{M}_{p} to mutual promote each other. Here we theoretically justify the existence of such positive feedback. In Assumption 2.1, we assume the invariance and sufficiency properties of the stable features Φ\Phi^{*} and assume the relationship between unstable part Ψ\Psi^{*} and YY can arbitrarily change. Here we make a more specific assumption on the heterogeneity across environments with respect to Φ\Phi^{*} and Ψ\Psi^{*}.

Assumption 4.1.

Assume the pooled training data is made up of heterogeneous data sources: Ptr=esupp(tr)wePeP_{tr}=\sum_{e\in\mathrm{supp}(\mathcal{E}_{tr})}w_{e}P^{e}. For any ei,ejtr,eieje_{i},e_{j}\in\mathcal{E}_{tr},e_{i}\neq e_{j}, we assume

Ii,jc(Y;Φ|Ψ)max(Ii(Y;Φ|Ψ),Ij(Y;Φ|Ψ))I^{c}_{i,j}(Y;\Phi^{*}|\Psi^{*})\geq\mathrm{max}(I_{i}(Y;\Phi^{*}|\Psi^{*}),I_{j}(Y;\Phi^{*}|\Psi^{*})) (15)

where Φ\Phi^{*} is invariant feature and Ψ\Psi^{*} the variant. IiI_{i} represents mutual information in PeiP^{e_{i}} and Ii,jcI^{c}_{i,j} represents the cross mutual information between PeiP^{e_{i}} and PejP^{e_{j}} takes the form of Ii,jc(Y;Φ|Ψ)=Hi,jc[Y|Ψ]Hi,jc[Y|Φ,Ψ]I^{c}_{i,j}(Y;\Phi|\Psi)=H^{c}_{i,j}[Y|\Psi]-H^{c}_{i,j}[Y|\Phi,\Psi] and Hi,jc[Y]=pei(y)logpej(y)𝑑yH^{c}_{i,j}[Y]=-\int p^{e_{i}}(y)\log p^{e_{j}}(y)dy.

Here we would like to intuitively demonstrate this assumption. Firstly, the mutual information Ii(Y;Φ)=Hi[Y]Hi[Y|Φ]I_{i}(Y;\Phi^{*})=H_{i}[Y]-H_{i}[Y|\Phi^{*}] can be viewed as the error reduction if we use Φ\Phi^{*} to predict YY rather than predict by nothing. Then the cross mutual information Ii,j(Y;Φ)I_{i,j}(Y;\Phi^{*}) can be viewed as the error reduction if we use the predictor learned on Φ\Phi^{*} in environment eje_{j} to predict in environment eie_{i}, rather than predict by nothing. Therefore, the R.H.S in equation 15 measures that, in environment eie_{i}, how much prediction error can be reduced if we further add Φ\Phi^{*} for prediction rather than use only Ψ\Psi^{*}. And the L.H.S measures that, when using predictors trained in eie_{i} to predict in eje_{j}, how much prediction error can be reduced if we further add Φ\Phi^{*} for prediction rather than use only Ψ\Psi^{*}. Intuitively, Assumption 4.1 assumes that invariant feature Φ\Phi^{*} provides more information for predicting YY across environments than in one single environment, and correspondingly, the information provided by Ψ\Psi^{*} shrinks a lot across environments, which indicates that the relationship between variant feature Ψ\Psi^{*} and YY varies across environments. Based on this assumption, we first prove that the cluster centres are pulled apart as invariant feature is excluded from clustering.

Theorem 4.2.

For ei,ejsupp(tr)e_{i},e_{j}\in\mathrm{supp}(\mathcal{E}_{tr}), assume that X=[Φ,Ψ]TX=[\Phi^{*},\Psi^{*}]^{T} satisfying Assumption 2.1, where Φ\Phi^{*} is invariant and Ψ\Psi^{*} variant. Then under Assumption 4.1, we have DKL(Pei(Y|X)Pej(Y|X))DKL(Pei(Y|Ψ)Pej(Y|Ψ))\mathrm{D_{KL}}(P^{e_{i}}(Y|X)\|P^{e_{j}}(Y|X))\leq\mathrm{D_{KL}}(P^{e_{i}}(Y|\Psi^{*})\|P^{e_{j}}(Y|\Psi^{*}))

Theorem B.6 indicates that the distance between cluster centres is larger when using variant features Ψ\Psi^{*} , therefore, it is more likely to obtain the desired heterogeneous environments, which explains why we use learned variant part Ψ(X)\Psi(X) for clustering. Finally, we provide the theorem for optimality guarantee for our HRM.

Theorem 4.3.

Under Assumption 2.1 and 4.1, for the proposed c\mathcal{M}_{c} and p\mathcal{M}_{p}, we have the following conclusions: 1. Given environments tr\mathcal{E}_{tr} such that =tr\mathcal{I}_{\mathcal{E}}=\mathcal{I}_{\mathcal{E}_{tr}}, the learned Φ(X)\Phi(X) by p\mathcal{M}_{p} is the maximal invariant predictor of \mathcal{I}_{\mathcal{E}}.

2. Given the maximal invariant predictor Φ\Phi^{*} of \mathcal{I}_{\mathcal{E}}, assume the pooled training data is made up of data from all environments in supp()\mathrm{supp}(\mathcal{E}), there exist one split that achieves the minimum of the objective function and meanwhile the invariance set regularized is equal to \mathcal{I}_{\mathcal{E}}.

Intuitively, Theorem 4.3 proves that given one of the c\mathcal{M}_{c} and p\mathcal{M}_{p} optimal, the other is optimal, which validates the existence of the global optimal point of our algorithm.

5 Experiment

In this section, we validate the effectiveness of our method on simulation data and real-world data.

Baselines We compare our proposed HRM with the following methods:

  • Empirical Risk Minimization(ERM): minθ𝔼P0[(θ;X,Y)]\min_{\theta}\mathbb{E}_{P_{0}}[\ell(\theta;X,Y)]

  • Distributionally Robust Optimization(DRO (Sinha et al., 2018)): minθsupQW(Q,P0)ρ𝔼Q[(θ;X,Y)]\min_{\theta}\sup_{Q\in W(Q,P_{0})\leq\rho}\mathbb{E}_{Q}[\ell(\theta;X,Y)]

  • Environment Inference for Invariant Learning(EIIL (Creager et al., 2020)):

    minΦmaxu\displaystyle\min_{\Phi}\max_{u} e1Neiui(e)(wΦ(xi),yi)+\displaystyle\sum_{e\in\mathcal{E}}\frac{1}{N_{e}}\sum_{i}u_{i}(e)\ell(w\odot\Phi(x_{i}),y_{i})+ (16)
    eλw|w=1.01Neiui(e)(wΦ(xi),yi)2\displaystyle\sum_{e\in\mathcal{E}}\lambda\|\nabla_{w|w=1.0}\frac{1}{N_{e}}\sum_{i}u_{i}(e)\ell(w\odot\Phi(x_{i}),y_{i})\|_{2}
  • Invariant Risk Minimization(IRM (Arjovsky et al., 2019)) with environment tr\mathcal{E}_{tr} labels:

    minΦetre+λw|w=1.0e(wΦ)2\min_{\Phi}\sum_{e\in\mathcal{E}_{tr}}\mathcal{L}^{e}+\lambda\|\nabla_{w|w=1.0}\mathcal{L}^{e}(w\odot\Phi)\|^{2} (17)

Further, for ablation study, we also compare with HRMs, which runs HRM for only one iteration without the feedback loop. Note that IRM is based on multiple training environments and we provide environment tr\mathcal{E}_{tr} labels for IRM, while others do not need environment labels.

Evaluation Metrics To evaluate the prediction performance, we use Mean_Error\mathrm{Mean\_Error} defined as Mean_Error=1|test|eteste\mathrm{Mean\_Error}=\frac{1}{|\mathcal{E}_{test}|}\sum_{e\in\mathcal{E}_{test}}\mathcal{L}^{e}, Std_Error\mathrm{Std\_Error} defined as Std_Error=1|test|1etest(eMean_Error)2\mathrm{Std\_Error}=\sqrt{\frac{1}{|\mathcal{E}_{test}|-1}\sum_{e\in\mathcal{E}_{test}}(\mathcal{L}^{e}-\mathrm{Mean\_Error})^{2}}, which are mean and standard deviation error across test\mathcal{E}_{test} and Max_Error=maxeteste\mathrm{Max\_Error}=\max_{e\in\mathcal{E}_{test}}\mathcal{L}^{e}, which are mean error, standard deviation error and worst-case error across test\mathcal{E}_{test}.

Imbalanced Mixture It is a natural phenomena that empirical data follow a power-law distribution, i.e. only a few environments/subgroups are common and the rest are rare (Shen et al., 2018; Sagawa et al., 2019, 2020). Therefore, we perform non-uniform sampling among different environments in training set.

Table 2: Results in selection bias simulation experiments of different methods with varying selection bias rr, and dimensions nbn_{b} and dd of training data, and each result is averaged over ten times runs.
Scenario 1: varying selection bias rate rr (d=10,nb=1d=10,n_{b}=1)
rr r=1.5r=1.5 r=1.9r=1.9 r=2.3r=2.3
Methods Mean_Error\mathrm{Mean\_Error} Std_Error\mathrm{Std\_Error} Max_Error\mathrm{Max\_Error} Mean_Error\mathrm{Mean\_Error} Std_Error\mathrm{Std\_Error} Max_Error\mathrm{Max\_Error} Mean_Error\mathrm{Mean\_Error} Std_Error\mathrm{Std\_Error} Max_Error\mathrm{Max\_Error}
ERM 0.476 0.064 0.524 0.510 0.108 0.608 0.532 0.139 0.690
DRO 0.467 0.046 0.516 0.512 0.111 0.625 0.535 0.143 0.746
EIIL 0.477 0.057 0.543 0.507 0.102 0.613 0.540 0.139 0.683
IRM(with tr\mathcal{E}_{tr} label) 0.460 0.014 0.475 0.456 0.015 0.472 0.461 0.015 0.475
HRMs 0.465 0.045 0.511 0.488 0.078 0.577 0.506 0.096 0.596
HRM 0.447 0.011 0.462 0.449 0.010 0.465 0.447 0.011 0.463
Scenario 2: varying dimension dd (r=1.9,nb=0.1dr=1.9,n_{b}=0.1d)
dd d=10d=10 d=20d=20 d=40d=40
Methods Mean_Error\mathrm{Mean\_Error} Std_Error\mathrm{Std\_Error} Max_Error\mathrm{Max\_Error} Mean_Error\mathrm{Mean\_Error} Std_Error\mathrm{Std\_Error} Max_Error\mathrm{Max\_Error} Mean_Error\mathrm{Mean\_Error} Std_Error\mathrm{Std\_Error} Max_Error\mathrm{Max\_Error}
ERM 0.510 0.108 0.608 0.533 0.141 0.733 0.528 0.175 0.719
DRO 0.512 0.111 0.625 0.564 0.186 0.746 0.555 0.196 0.758
EIIL 0.507 0.102 0.613 0.543 0.147 0.699 0.542 0.178 0.727
IRM(with tr\mathcal{E}_{tr} label) 0.456 0.015 0.472 0.484 0.014 0.489 0.500 0.051 0.540
HRMs 0.488 0.078 0.577 0.486 0.069 0.555 0.477 0.081 0.553
HRM 0.449 0.010 0.465 0.466 0.011 0.478 0.465 0.015 0.482

5.1 Simulation Data

We design two mechanisms to simulate the varying correlations among covariates across environments, named by selection bias and anti-causal effect.

Table 3: Prediction errors of the anti-causal effect experiment. We design two settings with different dimensions of Φ\Phi^{*} and Ψ\Psi^{*} as nϕn_{\phi} and nψn_{\psi} respectively. The results are averaged over 10 runs.
Scenario 1: nϕ=9,nψ=1n_{\phi}=9,\ n_{\psi}=1
ee Training environments Testing environments
Methods e1e_{1} e2e_{2} e3e_{3} e4e_{4} e5e_{5} e6e_{6} e7e_{7} e8e_{8} e9e_{9} e10e_{10}
ERM 0.290 0.308 0.376 0.419 0.478 0.538 0.596 0.626 0.640 0.689
DRO 0.289 0.310 0.388 0.428 0.517 0.610 0.627 0.669 0.679 0.739
EIIL 0.075 0.128 0.349 0.485 0.795 1.162 1.286 1.527 1.558 1.884
IRM(with tr\mathcal{E}_{tr} label) 0.306 0.312 0.325 0.328 0.343 0.358 0.365 0.374 0.377 0.392
HRMs 1.060 1.085 1.112 1.130 1.207 1.280 1.325 1.340 1.371 1.430
HRM 0.317 0.314 0.322 0.318 0.321 0.317 0.315 0.315 0.316 0.320
Scenario 2: nϕ=5,nψ=5n_{\phi}=5,\ n_{\psi}=5
ee Training environments Testing environments
Methods e1e_{1} e2e_{2} e3e_{3} e4e_{4} e5e_{5} e6e_{6} e7e_{7} e8e_{8} e9e_{9} e10e_{10}
ERM 0.238 0.286 0.433 0.512 0.629 0.727 0.818 0.860 0.895 0.980
DRO 0.237 0.294 0.452 0.529 0.651 0.778 0.859 0.911 0.950 1.028
EIIL 0.043 0.145 0.521 0.828 1.237 1.971 2.523 2.514 2.506 3.512
IRM(with tr\mathcal{E}_{tr} label) 0.287 0.293 0.329 0.345 0.382 0.420 0.444 0.461 0.478 0.504
HRMs 0.455 0.463 0.479 0.478 0.495 0.508 0.513 0.519 0.525 0.533
HRM 0.316 0.315 0.315 0.330 0.3200 0.317 0.326 0.330 0.333 0.335

Selection Bias In this setting, the correlations between variant covariates and the target are perturbed through selection bias mechanism. According to Assumption 2.1, we assume X=[Φ,Ψ]TdX=[\Phi^{*},\Psi^{*}]^{T}\in\mathbb{R}^{d} and Y=f(Φ)+ϵY=f(\Phi^{*})+\epsilon and that P(Y|Φ)P(Y|\Phi^{*}) remains invariant across environments while P(Y|Ψ)P(Y|\Psi^{*}) changes arbitrarily. For simplicity, we select data points according to a certain variable set VbΨV_{b}\subset\Psi^{*}:

P^(x)=viVb|r|5|f(ϕ)sign(r)vi|\hat{P}(x)=\prod_{v_{i}\in V_{b}}|r|^{-5*|f(\phi^{*})-sign(r)*v_{i}|} (18)

where |r|>1|r|>1, VbnbV_{b}\in\mathbb{R}^{n_{b}} and P^(x)\hat{P}(x) denotes the probability of point xx to be selected. Intuitively, rr eventually controls the strengths and direction of the spurious correlation between VbV_{b} and YY(i.e. if r>0r>0, a data point whose VbV_{b} is close to its yy is more probably to be selected.). The larger value of |r||r| means the stronger spurious correlation between VbV_{b} and YY, and r0r\geq 0 means positive correlation and vice versa. Therefore, here we use rr to define different environments.

In training, we generate sum=2000sum=2000 data points, where κ=95%\kappa=95\% points from environment e1e_{1} with a predefined rr and 1κ=5%1-\kappa=5\% points from e2e_{2} with r=1.1r=-1.1. In testing, we generate data points for 10 environments with r[3,2.7,2.3,,2.3,2.7,3.0]r\in[-3,-2.7,-2.3,\dots,2.3,2.7,3.0]. β\beta is set to 1.0. We compare our HRM with ERM, DRO, EIIL and IRM for Linear Regression. We conduct extensive experiments with different settings on rr, nbn_{b} and dd. In each setting, we carry out the procedure 10 times and report the average results. The results are shown in Table 4.

From the results, we have the following observations and analysis: ERM suffers from the distributional shifts in testing and yields poor performance in most of the settings. DRO surprisingly has the worst performance, which we think is due to the over-pessimism problem (Frogner et al., 2019). EIIL has the similar performance with ERM, which indicates that its inferred environments cannot reveal the spurious correlations between YY and VbV_{b}. IRM performs much better than the above two baselines, however, as IRM depends on the available environment labels to work, it uses much more information than the other three methods. Compared to the three baselines, our HRM achieves nearly perfect performance with respect to average performance and stability, especially the variance of losses across environments is close to 0, which reflects the effectiveness of our heterogeneous clustering as well as the invariant learning algorithm. Furthermore, our HRM does not need environment labels, which verifies that our clustering algorithm can mine the latent heterogeneity inside the data and further shows our superiority to IRM.

Besides, we visualize the differences between environments using Task2Vec (Achille et al., 2019) in Figure 2, where larger value means the two environments are more heterogeneous. The pooled training data are mixture of environments with r=1.9r=1.9 and r=1.1r=-1.1, the difference between whom is shown in yellow box. And the red boxes show differences between learned environments by HRMs and HRM. The big promotion between init\mathcal{E}_{init} and learn\mathcal{E}_{learn} verifies our HRM can exploit heterogeneity inside data as well as the existence of the positive feedback. Due to space limitation, results of varying sum,κ,nbsum,\kappa,n_{b} as well as experimental details are left to appendix.

Refer to caption
Figure 2: Visualization of differences between environments in scenario 1 in selection bias experiment(r=1.9r=1.9). The left figure shows the initial clustering results using XX, and the right one shows the learned learn\mathcal{E}^{learn} using the learned variant part Ψ(X)\Psi(X).

Anti-causal Effect Inspired by (Arjovsky et al., 2019), we induce the spurious correlation by using anti-causal relationship from the target YY to the variant covariates Ψ\Psi^{*}. In this experiment, we assume X=[Φ,Ψ]TdX=[\Phi^{*},\Psi^{*}]^{T}\in\mathbb{R}^{d} and firstly sample Φ\Phi^{*} from mixture Gaussian distribution characterized as i=1kzk𝒩(μi,I)\sum_{i=1}^{k}z_{k}\mathcal{N}(\mu_{i},I) and the target Y=θϕTΦ+βΦ1Φ2Φ3+𝒩(0,0.3)Y=\theta_{\phi}^{T}\Phi^{*}+\beta\Phi_{1}\Phi_{2}\Phi_{3}+\mathcal{N}(0,0.3). Then the spurious correlations between Ψ\Psi^{*} and YY are generated by anti-causal effect as

Ψ=θψY+𝒩(0,σ(μi)2)\Psi^{*}=\theta_{\psi}Y+\mathcal{N}(0,\sigma(\mu_{i})^{2}) (19)

where σ(μi)\sigma(\mu_{i}) means the Gaussian noise added to Ψ\Psi^{*} depends on which component the invariant covariates Φ\Phi^{*} belong to. Intuitively, in different Gaussian components, the corresponding correlations between Ψ\Psi^{*} and YY are varying due to the different value of σ(μi)\sigma(\mu_{i}). The larger the σ(μi)\sigma(\mu_{i}) is, the weaker correlation between Ψ\Psi^{*} and YY. We use the mixture weight Z=[z1,,zk]TZ=[z_{1},\dots,z_{k}]^{T} to define different environments, where different mixture weights represent different overall strength of the effect YY on Ψ\Psi^{*}.

In this experiment, we set β=0.1\beta=0.1 and build 10 environments with varying σ\sigma and the dimension of Φ,Ψ\Phi^{*},\Psi^{*}, the first three for training and the last seven for testing. We run experiments for 10 times and the averaged results are shown in Table 3. EIIL achieves the best training performance with respect to prediction errors on training environments e1e_{1}, e2e_{2}, e3e_{3}, while its performances in testing are poor. ERM suffers from distributional shifts in testing. DRO seeks for over-considered robustness and performs much worse. IRM performs much better as it learns invariant representations with help of environment labels. HRM achieves nearly uniformly good performance in training environments as well as the testing ones, which validates the effectiveness of our method and proves its excellent generalization ability.

Refer to caption
(a) Training and testing accuracy for the car insurance prediction. Left sub-figure shows the training results for 5 settings and the right shows their corresponding testing results.
Refer to caption
(b) Mis-Classification Rate for the income prediction.
Refer to caption
(c) Prediction error for the house price prediction. RMSE\mathrm{RMSE} refers to the Root Mean Square Error.
Figure 3: Results of real-world datasets, including training and testing performance for five methods.

5.2 Real-world Data

We test our method on three real-world tasks, including car insurance prediction, people income prediction and house price prediction.

5.2.1 Settings

Car Insurance Prediction In this task, we use a real-world dataset for car insurance prediction (Kaggle). It is a classification task to predict whether a person will buy car insurance based on related information, such as vehicle damage, annual premium, vehicle age etc111https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction. We impose selection bias mechanism on the correlation between the outcome (i.e. the label indicating whether buying insurance) and the sex attribute to simulate multiple environments. Specifically, we simulate different strengths |r||r| of the spurious correlation between sex and target in training, and reverse the direction of such correlation in testing(+|r|+|r| in training and |r|-|r| in testing). For IRM, in each setting, we divide the training data into three training environments with r1=0.95,r2=0.9,r3=0.8r_{1}=0.95,r_{2}=0.9,r_{3}=-0.8, and different overall correlation rr corresponds to different numbers of data in e1,e2,e3e_{1},e_{2},e_{3}. We perform 5 experiments with varying rr and the results in both training and testing are shown in Figure 3(a).

People Income Prediction In this task we use the Adult dataset (Dua & Graff, 2017) to predict personal income levels as above or below $50,000 per year based on personal details. We split the dataset into 10 environments according to demographic attributes sex\mathrm{sex} and race\mathrm{race}. In training phase, all methods are trained on pooled data including 693 points from environment 1 and 200 from environment 2, and validated on 100 sampled from both. For IRM, the ground-truth environment labels are provided. In testing phase, we test all methods on the 10 environments and report the mis-classification rate on all environments in Figure 3(b).

House Price Prediction In this experiment, we use a real-world regression dataset (Kaggle) of house sales prices from King County, USA222 https://www.kaggle.com/c/house-prices-advanced-regression- techniques/data. The target variable is the transaction price of the house and each sample contains 17 predictive variables such as the built year of the house, number of bedrooms, and square footage of home, etc. We simulate different environments according to the built year of the house, since it is fairly reasonable to assume the correlations among covariates and the target may vary along time. Specifically, we split the dataset into 6 periods, where each period approximately covers a time span of two decades. All methods are trained on data from the first period([1900,1920)[1900,1920)) and test on the other periods. For IRM, we further divide the training data into two environments where builtyear[1900,1910)built\ year\in[1900,1910) and [1910,1920)[1910,1920) respectively. Results are shown in Figure 3(c).

5.2.2 Analysis

From the results of three real-world tasks, we have the following observations and analysis: ERM achieves high accuracy in training while performing much worse in testing, indicating its inability in dealing with OOD predictions. DRO’s performance is not satisfactory, sometimes even worse than ERM. One plausible reason is its over-pessimistic nature which leads to too conservative predictors. Comparatively, invariant learning methods perform better in testing. IRM performs better than ERM and DRO, which shows the usefulness of environment labels for OOD generalization and the possibility of learning invariant predictor from multiple environments. EIIL performs inconsistently across different tasks, possibly due to its instability of the environment inference method. In all tasks and almost all testing environments (16/18), HRM consistently achieves the best performances. HRM even outperforms IRM significantly in a unfair setting where we provide perfect environment labels for IRM. One one side, it shows the limitation of manually labeled environments. On the other side, it demonstrates that, relieving the dependence on environment labels, HRM can effectively uncover and fully leverage the intrinsic heterogeneity in training data for invariant learning.

6 Related Works

There are mainly two branches of methods for the OOD generalization problem, namely distributionally robust optimization (DRO) (Esfahani & Kuhn, 2018; Duchi & Namkoong, 2018; Sinha et al., 2018; Sagawa et al., 2019) and invariant learning (Arjovsky et al., 2019; Koyama & Yamaguchi, 2020; Chang et al., 2020; Creager et al., 2020). DRO methods propose to optimize the worst-case risk within an uncertainty set, which lies around the observed training distribution and characterizes the potential testing distributions. However, in real scenarios, to better capture the testing distribution, the uncertainty set should be pretty large, which also results in the over-pessimism problem of DRO methods(Hu et al., 2018; Frogner et al., 2019).

Realizing the difficulty of solving OOD generalization problem without any prior knowledge or structural assumptions, invariant learning methods assume the existence of causally invariant relationships between some predictors Φ(X)\Phi(X) and the target YY. (Arjovsky et al., 2019) and (Koyama & Yamaguchi, 2020) propose to learning an invariant representation through multiple training environments. (Chang et al., 2020) also proposes to select features whose predictive relationship with the target stays invariant across environments. However, their effectiveness relies on the quality of the given multiple training environments, and the role of environments remains vague theoretically. Recently, (Creager et al., 2020) improves (Arjovsky et al., 2019) by relaxing its requirements for multiple environments. Specifically, (Creager et al., 2020) proposes a two-stage method, which firstly infers the environment division with a pre-provided biased model, and then performs invariant learning on the inferred environments. However, the two stages cannot be jointly optimized, and the environment division relies on the given biased model and lacks theoretical guarantees.

7 Discussions

In this work, we theoretically analyze the role of environments in invariant learning, and propose our HRM for joint heterogeneity identification and invariant prediction, which relaxes the requirements for environment labels and opens a new direction for invariant learning. To our knowledge, this is the first work to both theoretically and empirically analyze how the equality of multiple environments affects invariant learning. This paper mainly focuses on the raw variable level with the assumption of X=[Φ,Ψ]TX=[\Phi^{*},\Psi^{*}]^{T}, which is able to cover a broad spectrum of applications, e.g. healthcare, finance, marketing etc, where the raw variables are informative enough.

However, our work has some limitations, which we hope to improve in the future. Firstly, in order to achieve the mutual promotion, we should use the variant features Ψ\Psi^{*} for heterogeneity identification rather than the invariant ones. However, the process of invariant prediction continuously discards the variant features Ψ\Psi^{*} (for invariant features or representation), which makes it quite hard to recover the variant features. To overcome this, we focus on the simple setting where X=[Φ,Ψ]TX=[\Phi^{*},\Psi^{*}]^{T}, since we can directly obtain the variant features Ψ\Psi^{*} when having invariant features Φ\Phi^{*}. To further extend the power of HRM, we will consider to incorporate representation learning from XX in future work. Secondly, our clustering algorithm in c\mathcal{M}_{c} lacks theoretical guarantees for its convergence. To the best of our knowledge, in order to theoretically analyze the convergence of a clustering algorithm, it is necessary to measure the distance between data points. However, our clustering algorithm takes models’ parameters as cluster centers and aims to cluster data points (X,Y)(X,Y) according to their relationships between XX and YY, whose dissimilarity cannot be easily measured, since the relationship is statistical magnitude and cannot be calculated individually. How to theoretically analyze the convergence property of such clustering algorithms remains unsolved.

8 Acknowledgements

This work was supported in part by National Key R&D Program of China (No. 2018AAA0102004, No. 2020AAA0106300), National Natural Science Foundation of China (No. U1936219, 61521002, 61772304), Beijing Academy of Artificial Intelligence (BAAI), and a grant from the Institute for Guo Qiang, Tsinghua University. Bo Li’s research was supported by the Tsinghua University Initiative Scientific Research Grant, No. 2019THZWJC11; Technology and Innovation Major Project of the Ministry of Science and Technology of China under Grant 2020AAA0108400 and 2020AAA01084020108403; Major Program of the National Social Science Foundation of China (21ZDA036).

References

  • Achille et al. (2019) Achille, A., Lam, M., Tewari, R., Ravichandran, A., Maji, S., Fowlkes, C. C., Soatto, S., and Perona, P. Task2vec: Task embedding for meta-learning. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 6429–6438. IEEE, 2019. doi: 10.1109/ICCV.2019.00653. URL https://doi.org/10.1109/ICCV.2019.00653.
  • Arjovsky et al. (2019) Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
  • Berk et al. (2018) Berk, R., Heidari, H., Jabbari, S., Kearns, M., and Roth, A. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research, pp.  0049124118782533, 2018.
  • Chang et al. (2020) Chang, S., Zhang, Y., Yu, M., and Jaakkola, T. S. Invariant rationalization. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp.  1448–1458. PMLR, 2020. URL http://proceedings.mlr.press/v119/chang20c.html.
  • Creager et al. (2020) Creager, E., Jacobsen, J.-H., and Zemel, R. Environment inference for invariant learning. In ICML Workshop on Uncertainty and Robustness, 2020.
  • Dua & Graff (2017) Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Duchi & Namkoong (2018) Duchi, J. and Namkoong, H. Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750, 2018.
  • El Gamal & Kim (2011) El Gamal, A. and Kim, Y.-H. Network information theory. Network Information Theory, 12 2011. doi: 10.1017/CBO9781139030687.
  • Esfahani & Kuhn (2018) Esfahani, P. M. and Kuhn, D. Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. Math. Program., 171(1-2):115–166, 2018. doi: 10.1007/s10107-017-1172-1. URL https://doi.org/10.1007/s10107-017-1172-1.
  • Frogner et al. (2019) Frogner, C., Claici, S., Chien, E., and Solomon, J. Incorporating unlabeled data into distributionally robust learning. arXiv preprint arXiv:1912.07729, 2019.
  • Gong et al. (2016) Gong, M., Zhang, K., Liu, T., Tao, D., Glymour, C., and Schölkopf, B. Domain adaptation with conditional transferable components. In Balcan, M. and Weinberger, K. Q. (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp.  2839–2848. JMLR.org, 2016. URL http://proceedings.mlr.press/v48/gong16.html.
  • Hu et al. (2018) Hu, W., Niu, G., Sato, I., and Sugiyama, M. Does distributionally robust supervised learning give robust classifiers? In Dy, J. G. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp.  2034–2042. PMLR, 2018. URL http://proceedings.mlr.press/v80/hu18a.html.
  • Koyama & Yamaguchi (2020) Koyama, M. and Yamaguchi, S. Out-of-distribution generalization with maximal invariant predictor. CoRR, abs/2008.01883, 2020. URL https://arxiv.org/abs/2008.01883.
  • Kuang et al. (2020) Kuang, K., Xiong, R., Cui, P., Athey, S., and Li, B. Stable prediction with model misspecification and agnostic distribution shift. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp.  4485–4492. AAAI Press, 2020. URL https://aaai.org/ojs/index.php/AAAI/article/view/5876.
  • Kukar (2003) Kukar, M. Transductive reliability estimation for medical diagnosis. Artificial Intelligence in Medicine, 29(1-2):81–106, 2003.
  • Rojas-Carulla et al. (2015) Rojas-Carulla, M., Schlkopf, B., Turner, R., and Peters, J. Invariant models for causal transfer learning. Stats, 2015.
  • Rudin & Ustun (2018) Rudin, C. and Ustun, B. Optimized scoring systems: Toward trust in machine learning for healthcare and criminal justice. Interfaces, 48(5):449–466, 2018.
  • Sagawa et al. (2019) Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731, 2019.
  • Sagawa et al. (2020) Sagawa, S., Raghunathan, A., Koh, P. W., and Liang, P. An investigation of why overparameterization exacerbates spurious correlations. 2020.
  • Shen et al. (2018) Shen, Z., Cui, P., Kuang, K., Li, B., and Chen, P. Causally regularized learning with agnostic data selection bias. In 2018 ACM Multimedia Conference, 2018.
  • Sinha et al. (2018) Sinha, A., Namkoong, H., and Duchi, J. Certifying some distributional robustness with principled adversarial training. International Conference on Learning Representations, 2018.
  • Yamada et al. (2020) Yamada, Y., Lindenbaum, O., Negahban, S., and Kluger, Y. Feature selection using stochastic gates. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp.  10648–10659. PMLR, 2020. URL http://proceedings.mlr.press/v119/yamada20a.html.

Appendix A Additional Simulation Results and Details

Selection Bias In this setting, the correlations among covariates are perturbed through selection bias mechanism. According to assumption 2.1, we assume X=[Φ,Ψ]TdX=[\Phi^{*},\Psi^{*}]^{T}\in\mathbb{R}^{d} and Φ=[Φ1,Φ2,,Φnϕ]Tnϕ\Phi^{*}=[\Phi^{*}_{1},\Phi^{*}_{2},\dots,\Phi^{*}_{n_{\phi}}]^{T}\in\mathbb{R}^{n_{\phi}} is independent from Ψ=[Ψ1,Ψ2,,Ψnψ]nψ\Psi^{*}=[\Psi^{*}_{1},\Psi^{*}_{2},\dots,\Psi^{*}_{n_{\psi}}]\in\mathbb{R}^{n_{\psi}} while the covariates in Φ\Phi^{*} are dependent with each other. We assume Y=f(Φ)+ϵY=f(\Phi^{*})+\epsilon and P(Y|Φ)P(Y|\Phi^{*}) remains invariant across environments while P(Y|Ψ)P(Y|\Psi^{*}) can arbitrarily change.

Therefore, we generate training data points with the help of auxiliary variables Znϕ+1Z\in\mathbb{R}^{n_{\phi}+1} as following:

Z1,,Znϕ+1\displaystyle Z_{1},\dots,Z_{n_{\phi}+1} iid𝒩(0,1.0)\displaystyle\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}(0,1.0) (20)
Ψ1,,Ψnψ\displaystyle\Psi^{*}_{1},\dots,\Psi^{*}_{n_{\psi}} iid𝒩(0,1.0)\displaystyle\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}(0,1.0) (21)
Φi=0.8Zi+0.2Zi+1\displaystyle\Phi^{*}_{i}=0.8*Z_{i}+0.2*Z_{i+1} fori=1,,nϕ\displaystyle\ \ \ \ \ for\ \ i=1,\dots,n_{\phi} (22)

To induce model misspecification, we generate YY as:

Y=f(Φ)+ϵ=θϕ(Φ)T+βΦ1Φ2Φ3+ϵY=f(\Phi^{*})+\epsilon=\theta_{\phi}*(\Phi^{*})^{T}+\beta*\Phi^{*}_{1}\Phi^{*}_{2}\Phi^{*}_{3}+\epsilon (23)

where θϕ=[12,1,1,12,1,1,]nϕ\theta_{\phi}=[\frac{1}{2},-1,1,-\frac{1}{2},1,-1,\dots]\in\mathbb{R}^{n_{\phi}}, and ϵ𝒩(0,0.3)\epsilon\sim\mathcal{N}(0,0.3). As we assume that P(Y|Φ)P(Y|\Phi^{*}) remains unchanged while P(Y|Ψ)P(Y|\Psi^{*}) can vary across environments, we design a data selection mechanism to induce this kind of distribution shifts. For simplicity, we select data points according to a certain variable set VbΨV_{b}\subset\Psi^{*}:

P^=ΠviVb|r|5|f(ϕ)sign(r)vi|\displaystyle\hat{P}=\Pi_{v_{i}\in V_{b}}|r|^{-5*|f(\phi)-sign(r)*v_{i}|} (24)
μUni(0,1)\displaystyle\mu\sim Uni(0,1) (25)
M(r;(x,y))={1,μP^ 0,otherwise\displaystyle M(r;(x,y))=\begin{cases}1,\ \ \ \ \ &\text{$\mu\leq\hat{P}$ }\\ 0,\ \ \ \ \ &\text{otherwise}\end{cases} (26)

where |r|>1|r|>1 and VbnbV_{b}\in\mathbb{R}^{n_{b}}. Given a certain rr, a data point (x,y)(x,y) is selected if and only if M(r;(x,y))=1M(r;(x,y))=1 (i.e. if r>0r>0, a data point whose VbV_{b} is close to its YY is more probably to be selected.)

Intuitively, rr eventually controls the strengths and direction of the spurious correlation between VbV_{b} and YY(i.e. if r>0r>0, a data point whose VbV_{b} is close to its YY is more probably to be selected.). The larger value of |r||r| means the stronger spurious correlation between VbV_{b} and YY, and r0r\geq 0 means positive correlation and vice versa. Therefore, here we use rr to define different environments.

In training, we generate sumsum data points, where κsum\kappa\cdot sum points from environment e1e_{1} with a predefined rr and (1κ)sum(1-\kappa)sum points from e2e_{2} with r=1.1r=-1.1. In testing, we generate data points for 10 environments with r[3,2,1.7,,1.7,2,3]r\in[-3,-2,-1.7,\dots,1.7,2,3]. β\beta is set to 1.0.

Apart from the two scenarios in main body, we also conduct scenario 3 and 4 with varying κ,n\kappa,n and nbn_{b} respectively.

Table 4: Results in selection bias simulation experiments of different methods with varying sample size sumsum, ratio κ\kappa and variant dimensions nbn_{b} of training data, and each result is averaged over ten times runs.
Scenario 3: varying ratio κ\kappa and sample size sumsum (d=10,r=1.9,nb=1d=10,r=1.9,n_{b}=1)
κ,n\kappa,n κ=0.90,sum=1000\kappa=0.90,sum=1000 κ=0.95,sum=2000\kappa=0.95,sum=2000 κ=0.975,sum=4000\kappa=0.975,sum=4000
Methods Mean_Error\mathrm{Mean\_Error} Std_Error\mathrm{Std\_Error} Max_Error\mathrm{Max\_Error} Mean_Error\mathrm{Mean\_Error} Std_Error\mathrm{Std\_Error} Max_Error\mathrm{Max\_Error} Mean_Error\mathrm{Mean\_Error} Std_Error\mathrm{Std\_Error} Max_Error\mathrm{Max\_Error}
ERM 0.477 0.061 0.530 0.510 0.108 0.608 0.547 0.150 0.687
DRO 0.480 0.107 0.597 0.512 0.111 0.625 0.608 0.227 0.838
EIIL 0.476 0.063 0.529 0.507 0.102 0.613 0.539 0.148 0.689
IRM(with tr\mathcal{E}_{tr} label) 0.455 0.015 0.471 0.456 0.015 0.472 0.456 0.015 0.472
HRM 0.450 0.010 0.461 0.447 0.011 0.465 0.447 0.010 0.463
Scenario 4: varying variant dimension nbn_{b} (d=10,sum=2000,κ=0.95,r=1.9,nb=1d=10,sum=2000,\kappa=0.95,r=1.9,n_{b}=1)
nbn_{b} nb=1n_{b}=1 nb=3n_{b}=3 nb=5n_{b}=5
Methods Mean_Error\mathrm{Mean\_Error} Std_Error\mathrm{Std\_Error} Max_Error\mathrm{Max\_Error} Mean_Error\mathrm{Mean\_Error} Std_Error\mathrm{Std\_Error} Max_Error\mathrm{Max\_Error} Mean_Error\mathrm{Mean\_Error} Std_Error\mathrm{Std\_Error} Max_Error\mathrm{Max\_Error}
ERM 0.510 0.108 0.608 0.468 0.110 0.583 0.445 0.112 0.567
DRO 0.512 0.111 0.625 0.515 0.107 0.617 0.454 0.122 0.577
EIIL 0.520 0.111 0.613 0.469 0.111 0.581 0.454 0.100 0.557
IRM(with tr\mathcal{E}_{tr} label) 0.456 0.015 0.472 0.432 0.014 0.446 0.414 0.061 0.475
HRM 0.447 0.011 0.465 0.413 0.012 0.431 0.402 0.057 0.462

Anti-Causal Effect Inspired by (Arjovsky et al., 2019), in this setting, we introduce the spurious correlation by using anti-causal relationship from the target YY to the variant covariates Ψ\Psi^{*}.

We assume X=[Φ,Ψ]TdX=[\Phi^{*},\Psi^{*}]^{T}\in\mathbb{R}^{d} and Φ=[Φ1,Φ2,,Φnϕ]Tnϕ\Phi^{*}=[\Phi^{*}_{1},\Phi^{*}_{2},\dots,\Phi^{*}_{n_{\phi}}]^{T}\in\mathbb{R}^{n_{\phi}}, Ψ=[Ψ1,Ψ2,,Ψnψ]nψ\Psi^{*}=[\Psi^{*}_{1},\Psi^{*}_{2},\dots,\Psi^{*}_{n_{\psi}}]\in\mathbb{R}^{n_{\psi}} Data Generation process is as following:

Φ\displaystyle\Phi^{*} i=1kzk𝒩(μi,I)\displaystyle\sim\sum_{i=1}^{k}z_{k}\mathcal{N}(\mu_{i},I) (27)
Y\displaystyle Y =θϕTΦ+βΦ1Φ2Φ3+𝒩(0,0.3)\displaystyle=\theta_{\phi}^{T}\Phi^{*}+\beta\Phi^{*}_{1}\Phi^{*}_{2}\Phi^{*}_{3}+\mathcal{N}(0,0.3) (28)
Ψ\displaystyle\Psi^{*} =θψY+𝒩(0,σ(μi)2)\displaystyle=\theta_{\psi}Y+\mathcal{N}(0,\sigma(\mu_{i})^{2}) (29)

where i=1kzi=1&zi>=0\sum_{i=1}^{k}z_{i}=1\ \&\ z_{i}>=0 is the mixture weight of kk Gaussian components, σ(μi)\sigma(\mu_{i}) means the Gaussian noise added to Ψ\Psi^{*} depends on which component the invariant covariates Φ\Phi^{*} belong to and θψnψ\theta_{\psi}\in\mathbb{R}^{n_{\psi}}. Intuitively, in different Gaussian components, the corresponding correlations between Ψ\Psi^{*} and YY are varying due to the different value of σ(μi)\sigma(\mu_{i}). The larger the σ(μi)\sigma(\mu_{i}) is, the weaker correlation between Ψ\Psi^{*} and YY. We use the mixture weight Z=[z1,,zk]TZ=[z_{1},\dots,z_{k}]^{T} to define different environments, where different mixture weights represent different overall strength of the effect YY on Ψ\Psi^{*}. In this experiment, we set β=0.1\beta=0.1 and build 10 environments with varying σ\sigma and the dimension of Φ,Ψ\Phi^{*},\Psi^{*}, the first three for training and the last seven for testing. Specifically, we set β=0.1\beta=0.1, μ1=[0,0,0,1,1]T,μ2=[0,0,0,1,1]T,μ2=[0,0,0,1,1]T,μ4=μ5==μ10=[0,0,0,1,1]T\mu_{1}=[0,0,0,1,1]^{T},\mu_{2}=[0,0,0,1,-1]^{T},\mu_{2}=[0,0,0,-1,1]^{T},\mu_{4}=\mu_{5}=\dots=\mu_{10}=[0,0,0,-1,-1]^{T}, σ(μ1)=0.2,σ(μ2)=0.5,σ(μ3)=1.0\sigma(\mu_{1})=0.2,\sigma(\mu_{2})=0.5,\sigma(\mu_{3})=1.0 and [σ(μ4),σ(μ5),,σ(μ10)]=[3.0,5.0,,15.0][\sigma(\mu_{4}),\sigma(\mu_{5}),\dots,\sigma(\mu_{10})]=[3.0,5.0,\dots,15.0]. θϕ,θψ\theta_{\phi},\theta_{\psi} are randomly sampled from 𝒩(1,I)\mathcal{N}(1,I) and 𝒩(0.5,0.1I)\mathcal{N}(0.5,0.1I) respectively. We run experiments for 10 times and average the results.

Appendix B Proofs

B.1 Proof of Theorem 2.1

First, we would like to prove that a random variable satisfying assumption 2.1 is MIP.

Theorem B.1.

A representation Φ\Phi^{*}\in\mathcal{I} satisfying assumption 2.1 is the maximal invariant predictor.

Proof.

\rightarrow: To prove Φ=argminZI(Y;Z)\Phi^{*}=\arg\min_{Z\in\mathcal{I}}I(Y;Z). If Φ\Phi^{*} is not the maximal invariant predictor, assume Φ=argmaxZI(Y;Z)\Phi^{\prime}=\arg\max_{Z\in\mathcal{I}}I(Y;Z). Using functional representation lemma, consider (Φ,Φ)(\Phi^{*},\Phi^{\prime}), there exists random variable Φextra\Phi_{extra} such that Φ=σ(Φ,Φextra)\Phi^{{}^{\prime}}=\sigma(\Phi^{*},\Phi_{extra}) and ΦΦextra\Phi^{*}\perp\Phi_{extra}. Then I(Y;Φ)=I(Y;Φ,Φextra)=I(f(Φ);Φ,Φextra)=I(f(Φ);Φ)I(Y;\Phi^{{}^{\prime}})=I(Y;\Phi^{*},\Phi_{extra})=I(f(\Phi^{*});\Phi^{*},\Phi_{extra})=I(f(\Phi^{*});\Phi^{*}).

\leftarrow: To prove the maximal invariant predictor Φ\Phi^{*} satisfies the sufficiency property in assumption 2.1.

The converse-negative proposition is :

Yf(Φ)+ϵΦargmaxZI(Y;Z)Y\neq f(\Phi^{*})+\epsilon\rightarrow\Phi^{*}\neq\arg\max_{Z\in\mathcal{I}}I(Y;Z) (30)

Suppose Yf(Φ)+ϵY\neq f(\Phi^{*})+\epsilon and Φ=argmaxZI(Y;Z)\Phi^{*}=\arg\max_{Z\in\mathcal{I}}I(Y;Z), and suppose Y=f(Φ)+ϵY=f(\Phi^{{}^{\prime}})+\epsilon where ΦΦ\Phi^{{}^{\prime}}\neq\Phi^{*}. Then we have:

I(f(Φ);Φ)I(f(Φ);Φ)I(f(\Phi^{{}^{\prime}});\Phi^{*})\leq I(f(\Phi^{{}^{\prime}});\Phi^{{}^{\prime}}) (31)

Therefore, Φ=argmaxZI(Y;Z)\Phi^{{}^{\prime}}=\arg\max_{Z\in\mathcal{I}}I(Y;Z)

Then we provide the proof of theorem 2.1.

Theorem B.2.

Let gg be a strictly convex, differentiable function and let DD be the corresponding Bregman Loss function. Let Φ\Phi^{*} is the maximal invariant predictor with respect to II_{\mathcal{E}}, and put h(X)=𝔼Y[Y|Φ]h^{*}(X)=\mathbb{E}_{Y}[Y|\Phi^{*}]. Under assumption 2.2, we have:

h=argminhsupesupp()𝔼[D(h(X),Y)|e]h^{*}=\arg\min_{h}\sup_{e\in\mathrm{supp}(\mathcal{E})}\mathbb{E}[D(h(X),Y)|e] (32)
Proof.

Firstly, according to theorem B.1, Φ\Phi^{*} satisfies assumption 2.1. Consider any function hh, we would like to prove that for each distribution PeP^{e}(ee\in\mathcal{E}), there exists an environment ee^{\prime} such that:

𝔼[D(h(X),Y)|e]𝔼[D(h(X),Y)|e]\mathbb{E}[D(h(X),Y)|e^{\prime}]\geq\mathbb{E}[D(h^{*}(X),Y)|e] (33)

For each ee\in\mathcal{E} with density ([Φ,Ψ],Y)P(Φ,Ψ,Y)([\Phi,\Psi],Y)\mapsto P(\Phi,\Psi,Y), we construct environment ee^{\prime} with density Q(Φ,Ψ,Y)Q(\Phi,\Psi,Y) that satisfies: (omit the superscript * of Φ\Phi and Ψ\Psi for simplicity)

Q(Φ,Ψ,Y)=P(Φ,Y)Q(Ψ)Q(\Phi,\Psi,Y)=P(\Phi,Y)Q(\Psi) (34)

Note that such environment ee^{\prime} exists because of the heterogeneity property assumed in assumption 2.2. Then we have:

D(h(ϕ,ψ),y)q(ϕ,ψ,y)𝑑ϕ𝑑ψ𝑑y\displaystyle\int D(h(\phi,\psi),y)q(\phi,\psi,y)d\phi d\psi dy (35)
=ψϕ,yD(h(ϕ,ψ),y)p(ϕ,y)q(ψ)𝑑ϕ𝑑y𝑑ψ\displaystyle=\int_{\psi}\int_{\phi,y}D(h(\phi,\psi),y)p(\phi,y)q(\psi)d\phi dyd\psi (36)
=ψϕ,yD(h(ϕ,ψ),y)p(ϕ,y)𝑑ϕ𝑑yq(ψ)𝑑ψ\displaystyle=\int_{\psi}\int_{\phi,y}D(h(\phi,\psi),y)p(\phi,y)d\phi dyq(\psi)d\psi (37)
ψϕ,yD(h(ϕ,ψ),y)p(ϕ,y)𝑑ϕ𝑑yq(ψ)𝑑ψ\displaystyle\geq\int_{\psi}\int_{\phi,y}D(h^{*}(\phi,\psi),y)p(\phi,y)d\phi dyq(\psi)d\psi (38)
=ψϕ,yD(h(ϕ),y)p(ϕ,y)𝑑ϕ𝑑yq(ψ)𝑑ψ\displaystyle=\int_{\psi}\int_{\phi,y}D(h^{*}(\phi),y)p(\phi,y)d\phi dyq(\psi)d\psi (39)
=ϕ,yD(h(ϕ),yp(ϕ,y)dϕdy\displaystyle=\int_{\phi,y}D(h^{*}(\phi),yp(\phi,y)d\phi dy (40)
=ϕ,ψ,yD(h(ϕ),y)p(ϕ,ψ,y)𝑑ϕ𝑑ψ𝑑y\displaystyle=\int_{\phi,\psi,y}D(h^{*}(\phi),y)p(\phi,\psi,y)d\phi d\psi dy (41)

B.2 Proof of Theorem 2.2

Theorem B.3.

tr\mathcal{I}_{\mathcal{E}}\subseteq\mathcal{I}_{\mathcal{E}_{tr}}

Proof.

Since tr\mathcal{E}_{tr}\subseteq\mathcal{E}, then for any SS\in\mathcal{I}_{\mathcal{E}}, StrS\in\mathcal{I}_{\mathcal{E}_{tr}}. ∎

B.3 Proof of Theorem 2.3

Theorem B.4.

Given set of environments supp()^\mathrm{supp}(\hat{\mathcal{E})}, denote the corresponding invariance set ^\mathcal{I}_{\hat{\mathcal{E}}} and the corresponding maximal invariant predictor Φ^\hat{\Phi}. For one newly-added environment enewe_{new} with distribution Pnew(X,Y)P^{new}(X,Y), if Pnew(Y|Φ^)=Pe(Y|Φ^)P^{new}(Y|\hat{\Phi})=P^{e}(Y|\hat{\Phi}) for esupp(^)e\in\mathrm{supp}(\hat{\mathcal{E}}), the invariance set constrained by supp(^){enew}\mathrm{supp}(\hat{\mathcal{E}})\cup\{e_{new}\} is equal to ^\mathcal{I}_{\hat{\mathcal{E}}}.

Proof.

Denote the invariance set with respect to supp(^{enew})\mathrm{supp}(\hat{\mathcal{E}}\cup\{e_{new}\}) as new\mathcal{I}_{new}, it is easy to prove that S^\forall S\in\mathcal{I}_{\hat{\mathcal{E}}}, we have SnewS\in\mathcal{I}_{new}, since the newly-added environment cannot exclude any variables from the original invariance set. ∎

B.4 Proof of Theorem 4.1

Theorem B.5.

Given tr\mathcal{E}_{tr}, the learned Φ(X)=MX\Phi(X)=M\odot X is the maximal invariant predictor of tr\mathcal{I}_{\mathcal{E}_{tr}}.

Proof.

The objective function for p\mathcal{M}_{p} is

p(MX,Y;θ)=𝔼tr[e]+λtrace(Vartr(θe))\mathcal{L}_{p}(M\odot X,Y;\theta)=\mathbb{E}_{\mathcal{E}_{tr}}[\mathcal{L}^{e}]+\lambda\mathrm{trace}(\mathrm{Var}_{\mathcal{E}_{tr}}(\nabla_{\theta}\mathcal{L}^{e})) (43)

Here we prove that the minimum of objective function can be achieved when Φ(X)=MX\Phi(X)=M\odot X is the maximal invariant predictor. According to theorem B.1, Φ(X)\Phi(X) satisfies assumption 2.1, which indicates that Pe(Y|Φ(X))P^{e}(Y|\Phi(X)) stays invariant.

From the proof in C.2 in (Koyama & Yamaguchi, 2020), I(Y;|Φ(X))=0I(Y;\mathcal{E}|\Phi(X))=0 indicates that trace(Vartr(θe))=0\mathrm{trace}(\mathrm{Var}_{\mathcal{E}_{tr}}(\nabla_{\theta}\mathcal{L}^{e}))=0.

Further, from the sufficiency property, the minimum of e\mathcal{L}^{e} is achieved with Φ(X)\Phi(X). Therefore, 𝔼tr[e]+λtrace(Vartr(θe))\mathbb{E}_{\mathcal{E}_{tr}}[\mathcal{L}^{e}]+\lambda\mathrm{trace}(\mathrm{Var}_{\mathcal{E}_{tr}}(\nabla_{\theta}\mathcal{L}^{e})) reaches the minimum with Φ(X)\Phi(X) being the MIP.(λ0\lambda\geq 0) ∎

B.5 Proof of Theorem 4.2

Theorem B.6.

For ei,ejsupp(tr)e_{i},e_{j}\in\mathrm{supp}(\mathcal{E}_{tr}), assume that X=[Φ,Ψ]TX=[\Phi^{*},\Psi^{*}]^{T} satisfying Assumption 2.1, where Φ\Phi^{*} is invariant and Ψ\Psi^{*} variant. Then under Assumption 4.1, we have DKL(Pei(Y|X)Pej(Y|X))DKL(Pei(Y|Ψ)Pej(Y|Ψ))\mathrm{D_{KL}}(P^{e_{i}}(Y|X)\|P^{e_{j}}(Y|X))\leq\mathrm{D_{KL}}(P^{e_{i}}(Y|\Psi^{*})\|P^{e_{j}}(Y|\Psi^{*}))

Proof.
DKL(Pei(Y|X)Pej(Y|X))\displaystyle\ \ \ \ \ \mathrm{D_{KL}}(P^{e_{i}}(Y|X)\|P^{e_{j}}(Y|X)) (44)
=DKL(Pei(Y|Φ,Ψ)Pej(Y|Φ,Ψ))\displaystyle=\mathrm{D_{KL}}(P^{e_{i}}(Y|\Phi^{*},\Psi^{*})\|P^{e_{j}}(Y|\Phi^{*},\Psi^{*})) (45)
=pi(y,ϕ,ψ)log[pi(y|ϕ,ψ)pj(y|ϕ,ψ)]𝑑y𝑑ϕ𝑑ψ\displaystyle=\int\int\int p_{i}(y,\phi,\psi)\log\left[\frac{p_{i}(y|\phi,\psi)}{p_{j}(y|\phi,\psi)}\right]dyd\phi d\psi (46)

Therefore, we have

DKL(Pei(Y|Ψ)Pej(Y|Ψ))DKL(Pei(Y|X)Pej(Y|X))\displaystyle\mathrm{D_{KL}}(P^{e_{i}}(Y|\Psi)\|P^{e_{j}}(Y|\Psi))-\mathrm{D_{KL}}(P^{e_{i}}(Y|X)\|P^{e_{j}}(Y|X)) (47)
=pi(y,ϕ,ψ)(logpi(y|ψ)pj(y|ψ)logpi(y|ϕ,ψ)pj(y|ϕ,ψ))𝑑y𝑑ϕ𝑑ψ\displaystyle=\int\int\int p_{i}(y,\phi,\psi)\left(\log\frac{p_{i}(y|\psi)}{p_{j}(y|\psi)}-\log\frac{p_{i}(y|\phi,\psi)}{p_{j}(y|\phi,\psi)}\right)dyd\phi d\psi (48)
=pi(y,ϕ,ψ)(logpi(y|ψ)pi(y|ϕ,ψ)logpj(y|ψ)pj(y|ϕ,ψ))𝑑y𝑑ϕ𝑑ψ\displaystyle=\int\int\int p_{i}(y,\phi,\psi)\left(\log\frac{p_{i}(y|\psi)}{p_{i}(y|\phi,\psi)}-\log\frac{p_{j}(y|\psi)}{p_{j}(y|\phi,\psi)}\right)dyd\phi d\psi (49)
=Ii,jc(Y;Φ|Ψ)Ii(Y;Φ|Ψ)\displaystyle=I_{i,j}^{c}(Y;\Phi^{*}|\Psi^{*})-I_{i}(Y;\Phi^{*}|\Psi^{*}) (50)

Therefore, we have

DKL(Pei(Y|X)Pej(Y|X))DKL(Pei(Y|Ψ)Pej(Y|Ψ))\mathrm{D_{KL}}(P^{e_{i}}(Y|X)\|P^{e_{j}}(Y|X))\leq\mathrm{D_{KL}}(P^{e_{i}}(Y|\Psi^{*})\|P^{e_{j}}(Y|\Psi^{*})) (51)

B.6 Proof of Theorem 4.3

Theorem B.7.

Under Assumption 2.1 and 2.2, for the proposed c\mathcal{M}_{c} and p\mathcal{M}_{p}, we have the following conclusions: 1. Given environments tr\mathcal{E}_{tr} such that =tr\mathcal{I}_{\mathcal{E}}=\mathcal{I}_{\mathcal{E}_{tr}}, the learned Φ(X)\Phi(X) by p\mathcal{M}_{p} is the maximal invariant predictor of \mathcal{I}_{\mathcal{E}}. 2. Given the maximal invariant predictor Φ\Phi^{*} of \mathcal{I}_{\mathcal{E}}, assume the pooled training data is made up of data from all environments in supp()\mathrm{supp}(\mathcal{E}), then the invariance set tr\mathcal{I}_{\mathcal{E}_{tr}} regularized by learned environments tr\mathcal{E}_{tr} is equal to \mathcal{I}_{\mathcal{E}}.

Proof.

For 1, according to theorem B.5, the learned Φ(X)\Phi(X) by p\mathcal{M}_{p} is the maximal invariant predictor of tr\mathcal{I}_{\mathcal{E}_{tr}}. Therefore, if =tr\mathcal{I}_{\mathcal{E}}=\mathcal{I}_{\mathcal{E}_{tr}}, Φ(X)\Phi(X) is the real maximal invariant predictor.

For 2, assume that Ptrain(X,Y)=ewePe(X,Y)P_{train}(X,Y)=\sum_{e\in\mathcal{E}}w_{e}P^{e}(X,Y), we would like to prove that DKL(Ptrain(Y|Ψ)Q)\mathrm{D}_{KL}(P_{train}(Y|\Psi^{*})\|Q) reaches minimum when the components in the mixture distribution QQ corresponds to distributions for ee\in\mathcal{E}. Since the learned Φ(X)\Phi(X) by p\mathcal{M}_{p} is the maximal invariant predictor of \mathcal{I}_{\mathcal{E}}, the corresponding Ψ(X)\Psi(X) is exactly the Ψ(X)\Psi^{*}(X). Then taking Q=ewePe(Y|Ψ)Q^{*}=\sum_{e\in\mathcal{E}}w_{e}P^{e}(Y|\Psi^{*}), we have Q𝒬\forall Q\in\mathcal{Q},

DKL(Ptrain(Y|Ψ)Q)DKL(Ptrain(Y|Ψ)Q)\mathrm{D_{KL}}(P_{train}(Y|\Psi^{*})\|Q^{*})\leq\mathrm{D_{KL}}(P_{train}(Y|\Psi^{*})\|Q) (52)

Therefore, the components in QQ^{*} correspond to PeP^{e} for ee\in\mathcal{E}, which makes tr\mathcal{I}_{\mathcal{E}_{tr}} approaches to \mathcal{I}_{\mathcal{E}}. ∎