This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Causal Conceptions of Fairness and their Consequences

Hamed Nilforoshan    Johann Gaebler    Ravi Shroff    Sharad Goel
Abstract

Recent work highlights the role of causality in designing equitable decision-making algorithms. It is not immediately clear, however, how existing causal conceptions of fairness relate to one another, or what the consequences are of using these definitions as design principles. Here, we first assemble and categorize popular causal definitions of algorithmic fairness into two broad families: (1) those that constrain the effects of decisions on counterfactual disparities; and (2) those that constrain the effects of legally protected characteristics, like race and gender, on decisions. We then show, analytically and empirically, that both families of definitions almost always—in a measure theoretic sense—result in strongly Pareto dominated decision policies, meaning there is an alternative, unconstrained policy favored by every stakeholder with preferences drawn from a large, natural class. For example, in the case of college admissions decisions, policies constrained to satisfy causal fairness definitions would be disfavored by every stakeholder with neutral or positive preferences for both academic preparedness and diversity. Indeed, under a prominent definition of causal fairness, we prove the resulting policies require admitting all students with the same probability, regardless of academic qualifications or group membership. Our results highlight formal limitations and potential adverse consequences of common mathematical notions of causal fairness.

Machine Learning, ICML

1 Introduction

Imagine designing an algorithm to guide decisions for college admissions. To help ensure algorithms such as this are broadly equitable, a plethora of formal fairness criteria have been proposed in the machine learning community (Darlington, 1971; Cleary, 1968; Zafar et al., 2017b; Dwork et al., 2012; Chouldechova, 2017; Hardt et al., 2016; Kleinberg et al., 2017; Woodworth et al., 2017; Zafar et al., 2017a; Corbett-Davies et al., 2017; Chouldechova & Roth, 2020; Berk et al., 2021). For example, under the principle that fair algorithms should have comparable performance across demographic groups (Hardt et al., 2016), one might check that among applicants who were ultimately academically “successful” (e.g., who eventually earned a college degree, either at the institution in question or elsewhere), the algorithm would recommend admission for an equal proportion of candidates across race groups. Alternatively, following the principle that decisions should be agnostic to legally protected attributes like race and gender (cf. Corbett-Davies & Goel, 2018; Dwork et al., 2012), one might mandate that these features not be provided to the algorithm.

Recent scholarship has argued for extending equitable algorithm design by adopting a causal perspective, leading to myriad additional formal criteria for fairness (Coston et al., 2020; Imai & Jiang, 2020; Imai et al., 2020; Wang et al., 2019; Kusner et al., 2017; Nabi & Shpitser, 2018; Wu et al., 2019; Mhasawade & Chunara, 2021; Kilbertus et al., 2017; Zhang & Bareinboim, 2018; Zhang et al., 2017; Chiappa, 2019; Loftus et al., 2018; Galhotra et al., 2022; Carey & Wu, 2022). Less attention, however, has been given to understanding the potential downstream consequences of using these causal definitions of fairness as algorithmic design principles, leaving an important gap to fill if these criteria are to responsibly inform policy choices.

Here we synthesize and critically examine the statistical properties and concomitant consequences of popular causal approaches to fairness. We begin, in Section 2, by proposing a two-part taxonomy for causal conceptions of fairness that mirrors the illustrative, non-causal fairness principles described above. Our first category of definitions encompasses those that consider the effect of decisions on counterfactual disparities. For example, recognizing the causal effect of college admission on later success, one might demand that among applicants who would be academically successful if admitted to a particular college, the algorithm would recommend admission for an equal proportion of candidates across race groups. The second category of definitions encompasses those that seek to limit both the direct and indirect effects of one’s group membership on decisions. For example, because one’s race might impact earlier educational opportunities, and hence test scores, one might require that admissions decisions are robust to the effect of race along such causal paths.

We show, in Section 3, that when the distribution of causal effects is known (or can be estimated), one can efficiently compute utility-maximizing decision policies constrained to satisfy each of the causal fairness criteria we consider. However, for natural families of utility functions—for example, those that prefer both higher degree attainment and more student-body diversity—we prove in Section 4 that causal fairness constraints almost always lead to strongly Pareto dominated decision policies. To establish this result, we use the theory of prevalence (Christensen, 1972; Hunt et al., 1992; Anderson & Zame, 2001; Ott & Yorke, 2005), which extends the notion of full-measure sets to infinite-dimensional vector spaces. In particular, in our running college admissions example, adhering to any of the common conceptions of causal fairness would simultaneously result in a lower number of degrees attained and lower student-body diversity, relative to what one could achieve by explicitly tailoring admissions policies to achieve desired outcomes. In fact, under one prominent definition of causal fairness, we prove that the induced policies require simply admitting all applicants with equal probability, irrespective of one’s academic qualifications or group membership. These results, we hope, elucidate the structure—and limitations—of current causal approaches to equitable decision making.

2 Causal Approaches to Fair Decision Making

We describe two broad classes of causal notions of fairness: (1) those that consider outcomes when decisions are counterfactually altered; and (2) those that consider outcomes when protected attributes are counterfactually altered. We illustrate these definitions in the context of a running example of college admissions decisions.

2.1 Problem Setup

Consider a population of individuals with observed covariates XX, drawn i.i.d from a set 𝒳n\mathcal{X}\subseteq\mathbb{R}^{n} with distribution 𝒟X\mathcal{D}_{X}. Further suppose that A𝒜A\in\mathcal{A} describes one or more discrete protected attributes, such as race or gender, which can be derived from XX (i.e., A=α(X)A=\alpha(X) for some measurable function α\alpha). Each individual is subject to a binary decision D{0,1}D\in\{0,1\}, determined by a (randomized) rule d(x)[0,1]d(x)\in[0,1], where d(x)=Pr(D=1X=x)d(x)=\operatorname{Pr}(D=1\mid X=x) is the probability of receiving a positive decision.111That is, D=𝟙UDd(X)D=\mathbb{1}_{U_{D}\leq d(X)}, where UDU_{D} is an independent uniform random variable. Given a budget bb with 0<b<10<b<1, we require the decision rule to satisfy 𝔼[D]b\mathbb{E}[D]\leq b, limiting the expected proportion of positive decisions.

In our running example, we imagine a population of applicants to a particular college, where dd denotes an admissions rule and DD indicates a binary admissions decision. To simplify our exposition, we assume all admitted students attend the school. In our setting, the covariates XX consist of an applicant’s test score and race A{a0,a1}A\in\{a_{0},a_{1}\}, where, for notational convenience, we consider two race groups. The budget bb bounds the expected proportion of admitted applicants.

Assuming there is no interference between units (Imbens & Rubin, 2015), we write Y(1)Y(1) and Y(0)Y(0) to denote potential outcomes of interest under each of the two possible binary decisions, where Y=Y(D)Y=Y(D) is the realized outcome. We assume that Y(1)Y(1) and Y(0)Y(0) are drawn from a (possibly infinite) set 𝒴\mathcal{Y}\subseteq\mathbb{R}, where |𝒴|>1|\mathcal{Y}|>1. In our admissions example, YY is a binary variable that indicates college graduation (i.e., degree attainment), with Y(1)Y(1) and Y(0)Y(0) describing, respectively, whether an applicant would attain a college degree if admitted to or if rejected from the school we consider. Note that Y(0)Y(0) is not necessarily zero, as a rejected applicant may attend—and graduate from—a different university.

Given this setup, our goal is to construct decision policies dd that are broadly equitable, formalized in part by the causal notions of fairness described below. We focus on decisions that are made algorithmically, informed by historical data on applicants and subsequent outcomes.

2.2 Limiting the Effect of Decisions on Disparities

A popular class of non-causal fairness definitions requires that error rates (e.g., false positive and false negative rates) are equal across protected groups (Hardt et al., 2016; Corbett-Davies & Goel, 2018). Causal analogues of these definitions have recently been proposed (Coston et al., 2020; Imai & Jiang, 2020; Imai et al., 2020; Mishler et al., 2021), which require various conditional independence conditions to hold between the potential outcomes, protected attributes, and decisions.222In the literature on causal fairness, there is at times ambiguity between “predictions” Y^{0,1}\hat{Y}\in\{0,1\} of YY and “decisions” D{0,1}D\in\{0,1\}. Following past work (e.g., Corbett-Davies et al., 2017; Kusner et al., 2017; Wang et al., 2019), here we focus exclusively on decisions, with predictions implicitly impacting decisions but not explicitly appearing in our definitions.

Below we list three representative examples of this class of fairness definitions: counterfactual predictive parity (Coston et al., 2020), counterfactual equalized odds (Mishler et al., 2021; Coston et al., 2020), and conditional principal fairness (Imai & Jiang, 2020).333Our subsequent analytical results extend in a straightforward manner to structurally similar variants of these definitions (e.g., requiring Y(0)AD=1Y(0)\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A\mid D=1 or DAY(0)D\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A\mid Y(0), variants of counterfactual predictive parity and counterfactual equalized odds, respectively).

Definition 1.

Counterfactual predictive parity holds when

Y(1)AD=0.Y(1)\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A\mid D=0. (1)

In our college admissions example, counterfactual predictive parity means that among rejected applicants, the proportion who would have attained a college degree, had they been accepted, is equal across race groups.

Definition 2.

Counterfactual equalized odds holds when

DAY(1).D\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A\mid Y(1). (2)

In our running example, counterfactual equalized odds is satisfied when two conditions hold: (1) among applicants who would graduate if admitted (i.e., Y(1)=1Y(1)=1), students are admitted at the same rate across race groups; and (2) among applicants who would not graduate if admitted (i.e., Y(1)=0Y(1)=0), students are again admitted at the same rate across race groups.

Definition 3.

Conditional principal fairness holds when

DAY(0),Y(1),W,D\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A\mid Y(0),Y(1),W, (3)

where, for a measurable function ω\omega on 𝒳\mathcal{X}, W=ω(X)W=\omega(X) describes a reduced set of the covariates XX. When WW is constant (or, equivalently, when we do not condition on WW), this condition is called principal fairness.

In our example, conditional principal fairness means that “similar” applicants—where similarity is defined by the potential outcomes and covariates WW—are admitted at the same rate across race groups.

2.3 Limiting the Effect of Attributes on Decisions

An alternative causal framework for understanding fairness considers the effects of protected attributes on decisions (Wang et al., 2019; Kusner et al., 2017; Nabi & Shpitser, 2018; Wu et al., 2019; Mhasawade & Chunara, 2021; Kilbertus et al., 2017; Zhang & Bareinboim, 2018; Zhang et al., 2017). This approach, which can be understood as codifying the legal notion of disparate treatment (Goel et al., 2017; Zafar et al., 2017a), considers a decision rule to be fair if, at a high level, decisions for individuals are the same in “(a) the actual world and (b) a counterfactual world where the individual belonged to a different demographic group” (Kusner et al., 2017).444Conceptualizing a general causal effect of an immutable characteristic such as race or gender is rife with challenges, the greatest of which is expressed by the mantra, “no causation without manipulation” (Holland, 1986). In particular, analyzing race as a causal treatment requires one to specify what exactly is meant by “changing an individual’s race” from, for example, white to Black (Gaebler et al., 2022; Hu & Kohler-Hausmann, 2020). Such difficulties can sometimes be addressed by considering a change in the perception of race by a decision maker (Greiner & Rubin, 2011)—for instance, by changing the name listed on an employment application (Bertrand & Mullainathan, 2004), or by masking an individual’s appearance (Goldin & Rouse, 2000; Grogger & Ridgeway, 2006; Pierson et al., 2020; Chohlas-Wood et al., 2021b).

In contrast to “fairness through unawareness”—in which race and other protected attributes are barred from being an explicit input to a decision rule (cf. Dwork et al., 2012; Corbett-Davies & Goel, 2018)—the causal versions of this idea consider both the direct and indirect effects of protected attributes on decisions. For example, even if decisions only directly depend on test scores, race may indirectly impact decisions through its effects on educational opportunities, which in turn influence test scores. This idea can be formalized by requiring that decisions remain the same in expectation even if one’s protected characteristics are counterfactually altered, a condition known as counterfactual fairness  (Kusner et al., 2017).

Definition 4.

Counterfactual fairness holds when

𝔼[D(a)X]=𝔼[DX].\mathbb{E}[D(a^{\prime})\mid X]=\mathbb{E}[D\mid X]. (4)

where D(a)D(a^{\prime}) denotes the decision when one’s protected attributes are counterfactually altered to be any a𝒜a^{\prime}\in\mathcal{A}.

In our running example, this means that for each group of observationally identical applicants (i.e., those with the same values of XX, meaning identical race and test score), the proportion of students who are actually admitted is the same as the proportion who would be admitted if their race were counterfactually altered.

Counterfactual fairness aims to limit all direct and indirect effects of protected traits on decisions. In a generalization of this criterion—termed path-specific fairness (Chiappa, 2019; Nabi & Shpitser, 2018; Zhang et al., 2017; Wu et al., 2019)—one allows protected traits to influence decisions along certain causal paths but not others. For example, one may wish to allow the direct consideration of race by an admissions committee to implement an affirmative action policy, while also guarding against any indirect influence of race on admissions decisions that may stem from cultural biases in standardized tests (Williams, 1983).

AARaceEEEducationTTTest ScoreMMPreparationDDDecisionYYGraduation
Figure 1: A causal DAG illustrating a hypothetical process for college admissions. Under path-specific fairness, one may require, for example, that race does not affect decisions along the path highlighted in red.

The formal definition of path-specific fairness requires specifying a causal DAG describing relationships between attributes (both observed covariates and latent variables), decisions, and outcomes. In our running example of college admissions, we imagine that each individual’s observed covariates are the result of the process illustrated by the causal DAG in Figure 1. In this graph, an applicant’s race AA influences the educational opportunities EE available to them prior to college; and educational opportunities in turn influence an applicant’s level of college preparation, MM, as well as their score on a standardized admissions test, TT, such as the SAT. We assume the admissions committee only observes an applicant’s race and test score so that X=(A,T)X=(A,T), and makes their decision DD based on these attributes. Finally, whether or not an admitted student subsequently graduates (from any college), YY, is a function of both their preparation and whether they were admitted.555In practice, the racial composition of an admitted class may itself influence degree attainment, if, for example, diversity provides a net benefit to students (Page, 2007). Here, for simplicity, we avoid consideration of such peer effects.

To define path-specific fairness, we start by defining, for the decision DD, path-specific counterfactuals, a general concept in causal DAGs (cf. Pearl, 2001). Suppose 𝒢=(𝒱,𝒰,)\mathcal{G}=(\mathcal{V},\mathcal{U},\mathcal{F}) is a causal model with nodes 𝒱\mathcal{V}, exogenous variables 𝒰\mathcal{U}, and structural equations \mathcal{F} that define the value at each node VjV_{j} as a function of its parents (Vj)\wp(V_{j}) and its associated exogenous variable UjU_{j}. (See, for example, Pearl (2009a) for further details on causal DAGs.) Let V1,,VmV_{1},\dots,V_{m} be a topological ordering of the nodes, meaning that (Vj){V1,,Vj1}\wp(V_{j})\subseteq\{V_{1},\dots,V_{j-1}\} (i.e., the parents of each node appear in the ordering before the node itself). Let Π\Pi denote a collection of paths from node AA to DD. Now, for two possible values aa and aa^{\prime} for the variable AA, the path-specific counterfactuals DΠ,a,aD_{\Pi,a,a^{\prime}} for the decision DD are generated by traversing the list of nodes in topological order, propagating counterfactual values obtained by setting A=aA=a^{\prime} along paths in Π\Pi, and otherwise propagating values obtained by setting A=aA=a. (In Algorithm 1 in the Appendix, we formally define path-specific counterfactuals for an arbitrary node—or collection of nodes—in the DAG.)

To see this idea in action, we work out an illustrative example, computing path-specific counterfactuals for the decision DD along the single path Π={AETD}\Pi=\{A\rightarrow E\rightarrow T\rightarrow D\} linking race to the admissions committee’s decision through test score, highlighted in red in Figure 1. In the system of equations below, the first column corresponds to draws VV^{*} for each node VV in the DAG, where we set AA to aa, and then propagate that value as usual. The second column corresponds to draws V¯\overline{V}^{*} of path-specific counterfactuals, where we set AA to aa^{\prime}, and then propagate the counterfactuals only along the path AETDA\rightarrow E\rightarrow T\rightarrow D. In particular, the value for the test score T¯\overline{T}^{*} is computed using the value of E¯\overline{E}^{*} (since the edge ETE\rightarrow T is on the specified path) and the value of MM^{*} (since the edge MTM\rightarrow T is not on the path). As a result of this process, we obtain a draw D¯\overline{D}^{*} from the distribution of DΠ,a,aD_{\Pi,a,a^{\prime}}.

A\displaystyle A^{*} =a,\displaystyle=a, A¯\displaystyle\overline{A}^{*} =a,\displaystyle=a^{\prime},
E\displaystyle E^{*} =fE(A),\displaystyle=f_{E}(A^{*}), E¯\displaystyle\overline{E}^{*} =fE(A¯),\displaystyle=f_{E}(\overline{A}^{*}),
M\displaystyle M^{*} =fM(E),\displaystyle=f_{M}(E^{*}), M¯\displaystyle\overline{M}^{*} =fM(E),\displaystyle=f_{M}(E^{*}),
T\displaystyle T^{*} =fT(E,M),\displaystyle=f_{T}(E^{*},M^{*}), T¯\displaystyle\overline{T}^{*} =fT(E¯,M),\displaystyle=f_{T}(\overline{E}^{*},M^{*}),
D\displaystyle D^{*} =fD(A,T),\displaystyle=f_{D}(A^{*},T^{*}), D¯\displaystyle\overline{D}^{*} =fD(A,T¯).\displaystyle=f_{D}(A^{*},\overline{T}^{*}).

Path-specific fairness formalizes the intuition that the influence of a sensitive attribute on a downstream decision may, in some circumstances, be considered legitimate (i.e., it may be acceptable for the attribute to affect decisions along certain paths in the DAG). For instance, an admissions committee may believe that the effect of race AA on admissions decisions DD which passes through college preparation MM is legitimate, whereas the effect of race along the path AETDA\rightarrow E\rightarrow T\rightarrow D, which may reflect access to test prep or cultural biases of the tests, rather than actual academic preparedness, is illegitimate. In that case, the admissions committee may seek to ensure that the proportion of applicants they admit from a certain race group remains unchanged if one were to counterfactually alter the race of those individuals along the path Π={AETD}\Pi=\{A\rightarrow E\rightarrow T\rightarrow D\}.

Definition 5.

Let Π\Pi be a collection of paths, and, for a measurable function ww on 𝒳\mathcal{X}, let W=ω(X)W=\omega(X) describe a reduced set of the covariates XX. Path-specific fairness, also called Π\Pi-fairness, holds when, for any a𝒜a^{\prime}\in\mathcal{A},

𝔼[DΠ,A,aW]=𝔼[DW].\mathbb{E}[D_{\Pi,A,a^{\prime}}\mid W]=\mathbb{E}[D\mid W]. (5)

In the definition above, rather than a particular counterfactual level aa, the baseline level of the path-specific effect is AA, i.e., an individual’s actual (non-counterfactually altered) group membership (e.g., their actual race). We have implicitly assumed that the decision variable DD is a descendant of the covariates XX. In particular, without loss of generality, we assume DD is defined by the structural equation fD(x,uD)=𝟙uDd(x)f_{D}(x,u_{D})=\mathbb{1}_{u_{D}\leq d(x)}, where the exogenous variable uDUnif(0,1)u_{D}\sim\operatorname{\textsc{Unif}}(0,1), so that Pr(D=1X=x)=d(x)\operatorname{Pr}(D=1\mid X=x)=d(x). If Π\Pi is the set of all paths from AA to DD, then DΠ,A,a=D(a)D_{\Pi,A,a^{\prime}}=D(a^{\prime}), in which case, for W=XW=X, path-specific fairness is the same as counterfactual fairness.

3 Constructing Causally Fair Policies

The definitions of causal fairness above constrain the set of decision policies one might adopt, but, in general, they do not yield a unique policy. For instance, a policy in which applicants are admitted randomly and independently with probability bb—where bb is the specified budget—satisfies counterfactual equalized odds (Def. 2), conditional principal fairness (Def. 3), counterfactual fairness (Def. 4), and path-specific fairness (Def. 5).666A policy satisfying counterfactual predictive parity (Def. 1) is not guaranteed to exist. For example, if b=0b=0—in which case D=0D=0 a.s.—and 𝔼[Y(1)A=a1]𝔼[Y(1)A=a2]\mathbb{E}[Y(1)\mid A=a_{1}]\neq\mathbb{E}[Y(1)\mid A=a_{2}], then Eq. (1) cannot hold. Similar counterexamples can be constructed for b1b\ll 1.

However, such a randomized policy may be sub-optimal in the eyes of decision-makers aiming to maximize outcomes such as class diversity or degree attainment. Past work has described multiple approaches to selecting a single policy from among those satisfying any given fairness definition, including maximizing concordance of the decision with the outcome variable (Nabi & Shpitser, 2018; Chiappa, 2019) or with an existing policy (Wang et al., 2019) (e.g., in terms of binary accuracy or KL-divergence).

Here, as we are primarily interested in the downstream consequences of various causal fairness definitions, we consider causally fair policies that maximize utility (Liu et al., 2018; Kasy & Abebe, 2021; Corbett-Davies et al., 2017; Cai et al., 2020; Chohlas-Wood et al., 2021a).

Suppose u(x)u(x) denotes the utility of assigning a positive decision to individuals with observed covariate values xx, relative to assigning them negative decisions. In our running example, we set

u(x)=𝔼[Y(1)X=x]+λ𝟙α(x)=a1,u(x)=\mathbb{E}[Y(1)\mid X=x]+\lambda\cdot\mathbb{1}_{\alpha(x)=a_{1}}, (6)

where 𝔼[Y(1)X=x]\mathbb{E}[Y(1)\mid X=x] denotes the likelihood the applicant would graduate if admitted, 𝟙α(x)=a1\mathbb{1}_{\alpha(x)=a_{1}} indicates whether the applicant identifies as belonging to race group a1a_{1} (e.g., a1a_{1} may denote a group historically underrepresented in higher education), and λ0\lambda\geq 0 is an arbitrary constant that balances preferences for both student graduation and racial diversity.

We seek decision policies that maximize expected utility, subject to satisfying a given definition of causal fairness, as well as the budget constraint. Specifically, letting 𝒞\mathcal{C} denote the family of all decision policies that satisfy one of the causal fairness definitions listed above, a utility-maximizing policy dd^{*} is given by

dargmaxd𝒞𝔼[d(X)u(X)]s.t.𝔼[d(X)]b.\displaystyle\begin{split}d^{*}\in\arg\max_{d\in\mathcal{C}}&\quad\mathbb{E}[d(X)\cdot u(X)]\\ \text{s.t.}&\quad\mathbb{E}[d(X)]\leq b.\end{split} (7)

Constructing optimal policies poses both statistical and computational challenges. One must, in general, estimate the joint distribution of covariates and potential outcomes—and, even more dauntingly, causal effects along designated paths for path-specific definitions of fairness. In some settings, it may be possible to obtain these estimates from observational analyses of historical data or randomized controlled trials, though both approaches typically involve substantial hurdles in practice.

We prove that if one has this statistical information, it is possible to efficiently compute causally fair utility-maximizing policies by solving either a single linear program or a series of linear programs (Appendix, Theorem B.1). In the case of counterfactual equalized odds, conditional principal fairness, counterfactual fairness, and path-specific fairness, we show that the definitions can be translated to linear constraints. For counterfactual predictive parity, the defining independence condition yields a quadratic constraint, which we show can be expressed as a linear constraint by further conditioning on one of the decision variables, and the optimization problem in turn can be solved through a series of linear programs.

4 The Structure of Causally Fair Policies

Above, for each definition of causal fairness, we sketched how to construct utility-maximizing policies that satisfy the corresponding constraints. Now we explore the structural properties of causally fair policies. We show—both empirically and analytically, under relatively mild distributional assumptions—that policies constrained to be causally fair are disfavored by every individual in a natural class of decision makers with varying preferences for diversity. To formalize these results, we start by introducing some notation and then defining the concept of (strong) Pareto dominance.

4.1 Pareto Dominance and Consistent Utilities

For a real-valued utility function uu and decision policy dd, we write u(d)=𝔼[d(X)u(X)]u(d)=\mathbb{E}[d(X)\cdot u(X)] to denote the utility of dd under uu.

Definition 6.

For a budget bb, we say a decision policy dd is feasible if 𝔼[d(X)]b\mathbb{E}[d(X)]\leq b.

Given a collection of utility functions encoding the preferences of different individuals, we say a decision policy dd is Pareto dominated if there exists a feasible alternative dd^{\prime} such that none of the decision makers prefers dd over dd^{\prime}, and at least one decision maker strictly prefers dd^{\prime} over dd, a property formalized in Definition 7.

Definition 7.

Suppose 𝒰\mathcal{U} is a collection of utility functions. A decision policy dd is Pareto dominated if there exists a feasible alternative dd^{\prime} such that u(d)u(d)u(d^{\prime})\geq u(d) for all u𝒰u\in\mathcal{U}, and there exists u𝒰u^{\prime}\in\mathcal{U} such that u(d)>u(d)u^{\prime}(d^{\prime})>u^{\prime}(d). A policy dd is strongly Pareto dominated if there exists a feasible alternative dd^{\prime} such that u(d)>u(d)u(d^{\prime})>u(d) for all u𝒰u\in\mathcal{U}. A policy dd is Pareto efficient if it is feasible and not Pareto dominated, and the Pareto frontier is the set of Pareto efficient policies.

To develop intuition about the structure of causally fair decision policies, we continue working through our illustrative example of college admissions. We consider a collection of decision makers with utilities 𝒰\mathcal{U} of the form in Eq. (6), for λ0\lambda\geq 0. In this example, decision makers differ in their preferences for diversity (as determined by λ\lambda), but otherwise have similar preferences. We call such a collection of utilities consistent modulo α\alpha.

Definition 8.

We say that a set of utilities 𝒰\mathcal{U} is consistent modulo α\alpha if, for any u,u𝒰u,u^{\prime}\in\mathcal{U}:

  1. 1.

    For any xx, sign(u(x))=sign(u(x))\mathop{\mathrm{sign}}(u(x))=\mathop{\mathrm{sign}}(u^{\prime}(x));

  2. 2.

    For any x1x_{1} and x2x_{2} such that α(x1)=α(x2)\alpha(x_{1})=\alpha(x_{2}), u(x1)>u(x2)u(x_{1})>u(x_{2}) if and only if u(x1)>u(x2)u^{\prime}(x_{1})>u^{\prime}(x_{2}).

For consistent utilities, the Pareto frontier takes a particularly simple form, represented by (a subset of) group-specific threshold policies.

Proposition 1.

Suppose 𝒰\mathcal{U} is a set of utilities that is consistent modulo α\alpha. Then any Pareto efficient decision policy dd is a multiple threshold policy. That is, for any u𝒰u\in\mathcal{U}, there exist group-specific constants ta0t_{a}\geq 0 such that, a.s.:

d(x)={1u(x)>tα(x),0u(x)<tα(x).d(x)=\begin{cases}1&u(x)>t_{\alpha(x)},\\ 0&u(x)<t_{\alpha(x)}.\\ \end{cases} (8)

The proof of Proposition 1 is in the Appendix.777 In the statement of the proposition, we do not specify what happens at the thresholds u(x)=tα(x)u(x)=t_{\alpha(x)} themselves, as one can typically ignore the exact manner in which decisions are made at the threshold. Specifically, given a threshold policy dd, we can construct a standardized threshold policy dd^{\prime} that is constant within group at the threshold (i.e., d(x)=cα(x)d^{\prime}(x)=c_{\alpha(x)} when u(x)=tα(x)u(x)=t_{\alpha(x)}), and for which: (1) 𝔼[d(X)|A]=𝔼[d(X)|A]\mathbb{E}[d^{\prime}(X)|A]=\mathbb{E}[d(X)|A]; and (2) u(d)=u(d)u(d^{\prime})=u(d). In our running example, this means we can standardize threshold policies so that applicants at the threshold are admitted with the same group-specific probability.

4.2 An Empirical Example

With these preliminaries in place, we now empirically explore the structure of causally fair decision policies in the context of our stylized example of college admissions, given by the causal DAG in Figure 1. In the hypothetical pool of 100,000 applicants we consider, applicants in the target race group a1a_{1} have, on average, fewer educational opportunities than those applicants in group a0a_{0}, which leads to lower average academic preparedness, as well as lower average test scores. See Section C in the Appendix for additional details, including the specific structural equations we use.

Refer to caption
Figure 2: Utility-maximizing policies for various definitions of causal fairness in an illustrative example of college admissions, with the Pareto frontier depicted by the solid purple curve. For path-specific fairness, we set Π\Pi equal to the single path AETDA\rightarrow E\rightarrow T\rightarrow D, and set W=XW=X. For each causal fairness definition, the depicted constrained policies are strongly Pareto dominated, meaning there is an alternative feasible policy that simultaneously achieves greater student-body diversity and higher college degree attainment. Our analytical results show, more generally, that under mild distributional assumptions, every policy constrained to satisfy these causal fairness definitions is strongly Pareto dominated.

For the utility function in Eq. (6) with λ=14\lambda=\tfrac{1}{4}, we apply Theorem B.1 to compute utility-maximizing policies for each of the above causal definitions of fairness. We plot the results in Figure 2, where, for each policy, the horizontal axis shows the expected number of admitted applicants from the target race group, and the vertical axis shows the expected number of college graduates. Additionally, for the family of utilities 𝒰\mathcal{U} given by Eq. (6) for λ0\lambda\geq 0, we depict the Pareto frontier by the solid purple curve, computed via Proposition 1.888For all the cases we consider, the optimal policies admit the maximum proportion of students allowed under the budget bb (i.e., Pr(D=1)=b\operatorname{Pr}(D=1)=b). To compute the Pareto frontier in Figure 2, it is sufficient—by Proposition 1 and Footnote 7—to sweep over (standardized) group-specific threshold policies relative to the utility u0(x)=𝔼[Y(1)|X=x]u_{0}(x)=\mathbb{E}[Y(1)|X=x]. For reference, the dashed purple line corresponds to max-utility policies constrained to satisfy the level of diversity indicated on the xx-axis, though these policies are not on the Pareto frontier, as they result in fewer college graduates and lower diversity than the policy that maximizes graduation alone (indicated by the “max graduation” point in Figure 2).

For each fairness definition, the depicted policies are strongly Pareto dominated, meaning that there is an alternative feasible policy favored by all decision makers with preferences in 𝒰\mathcal{U}. In particular, for each definition of causal fairness, there is an alternative feasible policy in which one simultaneously achieves more student-body diversity and more college graduates. In some instances, the efficiency gap is quite stark. Utility-maximizing policies constrained to satisfy either counterfactual fairness or path-specific fairness require one to admit each applicant independently with fixed probability bb (where bb is the budget), regardless of academic preparedness or group membership.999For path-specific fairness, we set Π\Pi equal to the single path AETDA\rightarrow E\rightarrow T\rightarrow D, and set W=XW=X in this example. These results show that constraining decision-making algorithms to satisfy popular definitions of causal fairness can have unintended consequences, and may even harm the very groups they were ostensibly designed to protect.

4.3 The Statistical Structure of Causally Fair Policies

The patterns illustrated in Figure 2 and discussed above are not idiosyncracies of our particular example, but rather hold quite generally. Indeed, Theorem 1 shows that for almost every joint distribution of XX, Y(0)Y(0), and Y(1)Y(1) such that u(X)u(X) has a density, any decision policy satisfying counterfactual equalized odds or conditional principal fairness is Pareto dominated. Similarly, for almost every joint distribution of XX and XΠ,A,aX_{\Pi,A,a}, we show that policies satisfying path-specific fairness (including counterfactual fairness) are Pareto dominated. (NB: The analogous statement for counterfactual predictive parity is not true, which we address in Proposition 2.)

The notion of almost every distribution that we use here was formalized by Christensen (1972), Hunt et al. (1992), Anderson & Zame (2001), and others (cf. Ott & Yorke, 2005, for a review). Suppose, for a moment, that combinations of covariates and outcomes take values in a finite set of size mm. Then the space of joint distributions on covariates and outcomes can be represented by the unit (m1)(m-1)-simplex: Δm1={pmpi0andi=1mpi=1}\Delta^{m-1}=\{p\in\mathbb{R}^{m}\mid p_{i}\geq 0\ \text{and}\ \sum_{i=1}^{m}p_{i}=1\}. Since Δm1\Delta^{m-1} is a subset of an (m1)(m-1)-dimensional hyperplane in m\mathbb{R}^{m}, it inherits the usual Lebesgue measure on m1\mathbb{R}^{m-1}. In this finite-dimensional setting, almost every distribution means a subset of distributions that has full Lebesgue measure on the simplex. Given a property that holds for almost every distribution in this sense, that property holds almost surely under any probability distribution on the space of distributions that is described by a density on the simplex. We use a generalization of this basic idea that extends to infinite-dimensional spaces, allowing us to consider distributions with arbitrary support. (See the Appendix for further details.)

To prove this result, we make relatively mild restrictions on the set of distributions and utilities we consider to exclude degenerate cases, as formalized by Definition 9 below.

Definition 9.

Let 𝒢\mathcal{G} be a collection of functions from 𝒵\mathcal{Z} to d\mathbb{R}^{d} for some set 𝒵\mathcal{Z}. We say that a distribution of ZZ on 𝒵\mathcal{Z} is 𝒢\mathcal{G}-fine if g(Z)g(Z) has a density for all g𝒢g\in\mathcal{G}.

In particular, 𝒰\mathcal{U}-fineness ensures that the distribution of u(X)u(X) has a density. In the absence of 𝒰\mathcal{U}-fineness, corner cases can arise in which an especially large number of policies may be Pareto efficient, in particular when u(X)u(X) has large atoms and XX can be used to predict the potential outcomes Y(0)Y(0) and Y(1)Y(1) even after conditioning on u(X)u(X). See Prop. E.7 for details. Our example of college admissions, where 𝒰\mathcal{U} is defined by Eq. (6), is 𝒰\mathcal{U}-fine.

Theorem 1.

Suppose 𝒰\mathcal{U} is a set of utilities consistent modulo α\alpha. Further suppose that for all a𝒜a\in\mathcal{A} there exist a 𝒰\mathcal{U}-fine distribution of XX and a utility u𝒰u\in\mathcal{U} such that Pr(u(X)>0,A=a)>0\operatorname{Pr}(u(X)>0,A=a)>0, where A=α(X)A=\alpha(X). Then,

  • For almost every 𝒰\mathcal{U}-fine distribution of XX and Y(1)Y(1), any decision policy satisfying counterfactual equalized odds is strongly Pareto dominated.

  • If |Img(ω)|<|\operatorname{\textsc{Img}}(\omega)|<\infty and there exists a 𝒰\mathcal{U}-fine distribution of XX such that Pr(A=a,W=w)>0\operatorname{Pr}(A=a,W=w)>0 for all a𝒜a\in\mathcal{A} and wImg(ω)w\in\operatorname{\textsc{Img}}(\omega), where W=ω(X)W=\omega(X), then, for almost every 𝒰\mathcal{U}-fine joint distribution of XX, Y(0)Y(0), and Y(1)Y(1), any decision policy satisfying conditional principal fairness is strongly Pareto dominated.

  • If |Img(ω)|<|\operatorname{\textsc{Img}}(\omega)|<\infty and there exists a 𝒰\mathcal{U}-fine distribution of XX such that Pr(A=a,W=wi)>0\operatorname{Pr}(A=a,W=w_{i})>0 for all a𝒜a\in\mathcal{A} and some distinct w0,w1Img(ω)w_{0},w_{1}\in\operatorname{\textsc{Img}}(\omega), then, for almost every 𝒰𝒜\mathcal{U}^{\mathcal{A}}-fine joint distributions of AA and the counterfactuals XΠ,A,aX_{\Pi,A,a^{\prime}}, any decision policy satisfying path-specific fairness is strongly Pareto dominated.101010Here, u𝒜:(xa)a𝒜(u(xa))a𝒜u^{\mathcal{A}}:(x_{a})_{a\in\mathcal{A}}\mapsto(u(x_{a}))_{a\in\mathcal{A}} and 𝒰𝒜\mathcal{U}^{\mathcal{A}} is the set of u𝒜u^{\mathcal{A}} for u𝒰u\in\mathcal{U}. In other words, the requirement is that the joint distribution of the u(XΠ,A,a)u(X_{\Pi,A,a}) has a density.

The proof of Theorem 1 is given in the Appendix. At a high-level, the proof proceed in three steps, which we outline below using the example of counterfactual equalized odds. First, we show that for almost every fixed 𝒰\mathcal{U}-fine joint distribution μ\mu of XX and Y(1)Y(1) there is at most one policy d(x)d^{*}(x) satisfying counterfactual equalized odds that is not strongly Pareto dominated. To see why, note that for any specific y0y_{0}, since counterfactual equalized odds requires that DAY(1)=y0D\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A\mid Y(1)=y_{0}, setting the threshold for one group determines the thresholds for all the others; the budget constraint then can be used to fix the threshold for the original group. Second, we construct a “slice” around μ\mu such that for any distribution ν\nu in the slice, d(x)d^{*}(x) is still the only policy that can potentially lie on the Pareto frontier while satisfying counterfactual equalized odds. We create the slice by strategically perturbing μ\mu only where Y(1)=y1Y(1)=y_{1}, for some y1y0y_{1}\neq y_{0}. This perturbation moves mass from one side of the thresholds of d(x)d^{*}(x) to the other, consequently breaking the balance requirement DAY(1)=y1D\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A\mid Y(1)=y_{1} for almost every ν\nu in the slice. This phenomenon is similar to the problem of infra-marginality (Simoiu et al., 2017; Ayres, 2002), which likewise afflicts non-causal notions of fairness (Corbett-Davies et al., 2017; Corbett-Davies & Goel, 2018). Finally, we appeal to the notion of prevalence to stitch the slices together, showing that for almost every distribution, any policy satisfying counterfactual equalized odds is strongly Pareto dominated. Analogous versions of this general argument apply to the cases of conditional principal fairness and path-specific fairness.111111This argument does not depend in an essential way on the definitions being causal. In Corollary E.5, we show an analogous result for the non-counterfactual version of equalized odds.

In some common settings, path-specific fairness with W=XW=X constrains decisions so severely that the only allowable policies are constant (i.e., d(x1)=d(x2)d(x_{1})=d(x_{2}) for all x1,x2𝒳x_{1},x_{2}\in\mathcal{X}). For instance, in our running example, path-specific fairness requires admitting all applicants with the same probability, irrespective of academic preparation or group membership. Thus, all applicants are admitted with probability bb, where bb is the budget, under the optimal policy constrained to satisfy path-specific fairness.

To build intuition for this result, we sketch the argument for a finite covariate space 𝒳\mathcal{X}. Given a policy dd that satisfies path-specific fairness, select xargmaxx𝒳d(x)x^{*}\in\arg\max_{x\in\mathcal{X}}d(x).

By the definition of path-specific fairness, for any a𝒜a\in\mathcal{A},

d(x)\displaystyle d(x^{*}) =𝔼[DΠ,A,aX=x]\displaystyle=\mathbb{E}[D_{\Pi,A,a}\mid X=x^{*}] (9)
=xα1(a)d(x)Pr(XΠ,A,a=xX=x).\displaystyle=\ \sum_{\mathclap{x\in\alpha^{-1}(a)}}\ d(x)\cdot\operatorname{Pr}(X_{\Pi,A,a}=x\mid X=x^{*}).

That is, the probability of an individual with covariates xx^{*} receiving a positive decision must be the average probability of the individuals with covariates xx in group aa receiving a positive decision, weighted by the probability that an individual with covariates xx^{*} in the real world would have covariates xx counterfactually.

Next, we suppose that there exists an a𝒜a^{\prime}\in\mathcal{A} such that Pr(XΠ,A,a=xX=x)>0\operatorname{Pr}(X_{\Pi,A,a^{\prime}}=x\mid X=x^{*})>0 for all xα1(a)x\in\alpha^{-1}(a^{\prime}). In this case, because d(x)d(x)d(x)\leq d(x^{*}) for all x𝒳x\in\mathcal{X}, Eq. (9) shows that in fact d(x)=d(x)d(x)=d(x^{*}) for all xα1(a)x\in\alpha^{-1}(a^{\prime}).

Now, let xx^{\prime} be arbitrary. Again, by the definition of path-specific fairness, we have that

d(x)\displaystyle d(x^{\prime}) =𝔼[DΠ,A,aX=x]\displaystyle=\mathbb{E}[D_{\Pi,A,a^{\prime}}\mid X=x^{\prime}]
=xα1(a)d(x)Pr(XΠ,A,a=xX=x)\displaystyle=\ \sum_{\mathclap{x\in\alpha^{-1}(a^{\prime})}}\ d(x)\cdot\operatorname{Pr}(X_{\Pi,A,a^{\prime}}=x\mid X=x^{\prime})
=xα1(a)d(x)Pr(XΠ,A,a=xX=x),\displaystyle=\ \sum_{\mathclap{x\in\alpha^{-1}(a^{\prime})}}\ d(x^{*})\cdot\operatorname{Pr}(X_{\Pi,A,a^{\prime}}=x\mid X=x^{*}),
=d(x),\displaystyle=d(x^{*}),

where we use in the third equality the fact d(x)=d(x)d(x)=d(x^{*}) for all xα1(a)x\in\alpha^{-1}(a^{\prime}), and in the final equality the fact that XΠ,A,aX_{\Pi,A,a^{\prime}} is supported on α1(a)\alpha^{-1}(a^{\prime}).

Theorem 2 formalizes and extends this argument to more general settings, where Pr(XΠ,A,a=xX=x)\operatorname{Pr}(X_{\Pi,A,a^{\prime}}=x\mid X=x^{*}) is not necessarily positive for all xα1(a)x\in\alpha^{-1}(a^{\prime}). The proof of Theorem 2 is in the Appendix, along with extensions to continuous covariate spaces and a more complete characterization of Π\Pi-fair policies for finite 𝒳\mathcal{X}.

Theorem 2.

Suppose 𝒳\mathcal{X} is finite and Pr(X=x)>0\operatorname{Pr}(X=x)>0 for all x𝒳x\in\mathcal{X}. Suppose Z=ζ(X)Z=\zeta(X) is a random variable such that:

  1. 1.

    Z=ZΠ,A,aZ=Z_{\Pi,A,a^{\prime}} for all a𝒜a^{\prime}\in\mathcal{A},

  2. 2.

    Pr(XΠ,A,a=xX=x)>0\operatorname{Pr}(X_{\Pi,A,a^{\prime}}=x^{\prime}\mid X=x)>0 for all a𝒜a^{\prime}\in\mathcal{A} such that α(x)a\alpha(x)\neq a^{\prime} and x,x𝒳x,x^{\prime}\in\mathcal{X} such that ζ(x)=ζ(x)\zeta(x)=\zeta(x^{\prime}).

Then, for any Π\Pi-fair policy dd, with W=XW=X, there exists a function ff such that d(X)=f(Z)d(X)=f(Z), i.e., dd is constant across individuals having the same value of ZZ.

The first condition of Theorem 2 holds for any reduced set of covariates ZZ that is not causally affected by changes in AA (e.g., ZZ is not a descendent of AA). The second condition requires that among individuals with covariates xx, a positive fraction have covariates xx^{\prime} in a counterfactual world in which they belonged to another group aa^{\prime}. Because ζ(x)\zeta(x) is the same in the real and counterfactual worlds—since ZZ is unaffected by AA, by the first condition—we only consider xx^{\prime} such that ζ(x)=ζ(x)\zeta(x^{\prime})=\zeta(x) in the second condition.

In our running example, the only non-race covariate is test score, which is downstream of race. Further, among students with a given test score, a positive fraction achieve any other test score in the counterfactual world in which their race is altered. As such, the empty set of reduced covariates—formally encoded by setting ζ\zeta to a constant function—satisfies the conditions of Theorem 2. The theorem then implies that under any Π\Pi-fair policy, every applicant is admitted with equal probability.

Even when decisions are not perfectly uniform lotteries, as in our admissions example, Theorem 2 suggests that enforcing Π\Pi-fairness can lead to unexpected outcomes. For instance, suppose we modify our admissions example to additionally include age as a covariate that is causally unconnected to race—as some past work has done. In that case, Π\Pi-fair policies would admit students based on their age alone, irrespective of test score or race. Although in some cases such restrictive policies might be desirable, this strong structural constraint implied by Π\Pi-fairness appears to be a largely unintended consequence of the mathematical formalism.

The conditions of Theorem 2 are relatively mild, but do not hold in every setting. Suppose that in our admissions example it were the case that TΠ,A,a0=TΠ,A,a1+cT_{\Pi,A,a_{0}}=T_{\Pi,A,a_{1}}+c for some constant cc—that is, suppose the effect of intervening on race is a constant change to an applicant’s test score. Then the second condition of Theorem 2 would no longer hold for a constant ζ\zeta. Indeed, any multiple-threshold policy in which ta0=ta1+ct_{a_{0}}=t_{a_{1}}+c would be Π\Pi-fair. In practice, though, such deterministic counterfactuals would seem to be the exception rather than the rule. For example, it seems reasonable to expect that test scores would depend on race in complex ways that induce considerable heterogeneity.

Lastly, we note that WXW\neq X in some variants of path-specific fairness (e.g., Zhang & Bareinboim, 2018; Nabi & Shpitser, 2018), in which case Theorem 2 does not apply. Although, in that case, policies are typically still Pareto dominated in accordance with Theorem 1.

We conclude our analysis by investigating counterfactual predictive parity, the least demanding of the causal notions of fairness we have considered, requiring only that Y(1)AD=0Y(1)\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A\mid D=0. As such, it is in general possible to have a policy on the Pareto frontier that satisfies this condition. However, in Proposition 2, we show that this cannot happen in some common cases, including our example of college admissions.

In that setting, when the target group has lower average graduation rates—a pattern that often motivates efforts to actively increase diversity—decision policies constrained to satisfy counterfactual predictive parity are Pareto dominated. The proof of the proposition is in the Appendix.

Proposition 2.

Suppose 𝒜={a0,a1}\mathcal{A}=\{a_{0},a_{1}\}, and consider the family 𝒰\mathcal{U} of utility functions of the form

u(x)=r(x)+λ𝟙α(x)=a1,u(x)=r(x)+\lambda\cdot\mathbb{1}_{\alpha(x)=a_{1}},

indexed by λ0\lambda\geq 0, where r(x)=𝔼[Y(1)X=x]r(x)=\mathbb{E}[Y(1)\mid X=x]. Suppose the conditional distributions of r(X)r(X) given AA are beta distributed, i.e.,

r(X)A=aBeta(μa,v),r(X)\mid A=a\sim\operatorname{\textsc{Beta}}(\mu_{a},v),

with v>2v>2 and μa0>μa1>1/v\mu_{a_{0}}>\mu_{a_{1}}>1/v.121212Here we parameterize the beta distribution in terms of its mean μ\mu and sample size vv. In terms of the common, alternative α\alpha-β\beta parameterization, μ=α/(α+β)\mu=\alpha/(\alpha+\beta) and v=α+βv=\alpha+\beta. Then any policy satisfying counterfactual predictive parity is strongly Pareto dominated.

5 Discussion

We have worked to collect, synthesize, and investigate several causal conceptions of fairness that recently have appeared in the machine learning literature. These definitions formalize intuitively desirable properties—for example, minimizing the direct and indirect effects of race on decisions. But, as we have shown both analytically and with a synthetic example, they can, perhaps surprisingly, lead to policies with unintended downstream outcomes. In contrast to prior impossibility results (Kleinberg et al., 2017; Chouldechova, 2017), in which different formal notions of fairness are shown to be in conflict with each other, we demonstrate trade-offs between formal notions of fairness and resulting social welfare. For instance, in our running example of college admissions, enforcing various causal fairness definitions can lead to a student body that is both less academically prepared and less diverse than what one could achieve under natural alternative policies, potentially harming the very groups these definitions were ostensibly designed to protect. Our results thus highlight a gap between the goals and potential consequences of popular causal approaches to fairness.

What, then, is the role of causal reasoning in designing equitable algorithms? Under a consequentialist perspective to algorithm design (Chohlas-Wood et al., 2021a; Cai et al., 2020; Liang et al., 2021), one aims to construct policies with the most desirable expected outcomes, a task that inherently demands causal reasoning. Formally, this approach corresponds to solving the unconstrained optimization problem in Eq. (7), where preferences for diversity may be directly encoded in the utility function itself, rather than by constraining the class of policies, mitigating potentially problematic consequences. While conceptually appealing, this consequentialist approach still faces considerable practical challenges, including estimating the expected effects of decisions, and eliciting preferences over outcomes.

Our analysis illustrates some of the limitations of mathematical formalizations of fairness, reinforcing the need to explicitly consider the consequences of actions, particularly when decisions are automated and carried out at scale. Looking forward, we hope our work clarifies the ways in which causal reasoning can aid the equitable design of algorithms.

Acknowledgements

We thank Guillaume Basse, Jennifer Hill, and Ravi Sojitra for helpful conversations. H.N was supported by a Stanford Knight-Hennessy Scholarship and the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1656518. J.G was supported by a Stanford Knight-Hennessy Scholarship. R.S. was supported by the NSF Program on Fairness in AI in Collaboration with Amazon under the award “FAI: End-to-End Fairness for Algorithm-in-the-Loop Decision Making in the Public Sector,” no. IIS-2040898. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or Amazon. Code to reproduce our results is available at https://github.com/stanford-policylab/causal-fairness.

References

  • Anderson & Zame (2001) Anderson, R. M. and Zame, W. R. Genericity with infinitely many parameters. Advances in Theoretical Economics, 1(1):1–62, 2001.
  • Ayres (2002) Ayres, I. Outcome tests of racial disparities in police practices. Justice Research and Policy, 4(1-2):131–142, 2002.
  • Benji (2020) Benji. The sum of an uncountable number of positive numbers. Mathematics Stack Exchange, 2020. URL https://math.stackexchange.com/q/20661. (version: 2020-05-29).
  • Berk et al. (2021) Berk, R., Heidari, H., Jabbari, S., Kearns, M., and Roth, A. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research, 50(1):3–44, 2021.
  • Bertrand & Mullainathan (2004) Bertrand, M. and Mullainathan, S. Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination. American Economic Review, 94(4):991–1013, 2004.
  • Billingsley (1995) Billingsley, P. Probability and Measure. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, Inc., New York, third edition, 1995. ISBN 0-471-00710-2. A Wiley-Interscience Publication.
  • Brozius (2019) Brozius, H. Conditional expectation - E[f(X,Y)|Y]E[f(X,Y)|Y]. Mathematics Stack Exchange, 2019. URL https://math.stackexchange.com/q/3247577. (Version: 2019-06-01).
  • Cai et al. (2020) Cai, W., Gaebler, J., Garg, N., and Goel, S. Fair allocation through selective information acquisition. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp.  22–28, 2020.
  • Carey & Wu (2022) Carey, A. N. and Wu, X. The causal fairness field guide: Perspectives from social and formal sciences. Frontiers in Big Data, 5, 2022.
  • Chiappa (2019) Chiappa, S. Path-specific counterfactual fairness. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  7801–7808, 2019.
  • Chohlas-Wood et al. (2021a) Chohlas-Wood, A., Coots, M., Brunskill, E., and Goel, S. Learning to be fair: A consequentialist approach to equitable decision-making. arXiv preprint arXiv:2109.08792, 2021a.
  • Chohlas-Wood et al. (2021b) Chohlas-Wood, A., Nudell, J., Yao, K., Lin, Z., Nyarko, J., and Goel, S. Blind justice: Algorithmically masking race in charging decisions. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp.  35–45, 2021b.
  • Chouldechova (2017) Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153–163, 2017.
  • Chouldechova & Roth (2020) Chouldechova, A. and Roth, A. A snapshot of the frontiers of fairness in machine learning. Communications of the ACM, 63(5):82–89, 2020.
  • Christensen (1972) Christensen, J. P. R. On sets of Haar measure zero in Abelian Polish groups. Israel Journal of Mathematics, 13(3-4):255–260, 1972.
  • Cleary (1968) Cleary, T. A. Test bias: Prediction of grades of Negro and white students in integrated colleges. Journal of Educational Measurement, 5(2):115–124, 1968.
  • Corbett-Davies & Goel (2018) Corbett-Davies, S. and Goel, S. The measure and mismeasure of fairness: A critical review of fair machine learning. arXiv preprint arXiv:1808.00023, 2018.
  • Corbett-Davies et al. (2017) Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., and Huq, A. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.  797–806, 2017.
  • Coston et al. (2020) Coston, A., Mishler, A., Kennedy, E. H., and Chouldechova, A. Counterfactual risk assessments, evaluation, and fairness. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp.  582–593, 2020.
  • Darlington (1971) Darlington, R. B. Another look at “cultural fairness”. Journal of Educational Measurement, 8(2):71–82, 1971.
  • Dwork et al. (2012) Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp.  214–226, 2012.
  • Gaebler et al. (2022) Gaebler, J., Cai, W., Basse, G., Shroff, R., Goel, S., and Hill, J. A causal framework for observational studies of discrimination. Statistics and Public Policy, 2022.
  • Galhotra et al. (2022) Galhotra, S., Shanmugam, K., Sattigeri, P., and Varshney, K. R. Causal feature selection for algorithmic fairness. Proceedings of the 2022 International Conference on Management of Data (SIGMOD), 2022.
  • Goel et al. (2017) Goel, S., Perelman, M., Shroff, R., and Sklansky, D. A. Combatting police discrimination in the age of big data. New Criminal Law Review: An International and Interdisciplinary Journal, 20(2):181–232, 2017.
  • Goldin & Rouse (2000) Goldin, C. and Rouse, C. Orchestrating impartiality: The impact of “blind” auditions on female musicians. American Economic Review, 90(4):715–741, 2000.
  • Greiner & Rubin (2011) Greiner, D. J. and Rubin, D. B. Causal effects of perceived immutable characteristics. Review of Economics and Statistics, 93(3):775–785, 2011.
  • Grogger & Ridgeway (2006) Grogger, J. and Ridgeway, G. Testing for racial profiling in traffic stops from behind a veil of darkness. Journal of the American Statistical Association, 101(475):878–887, 2006.
  • Hardt et al. (2016) Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems, 29:3315–3323, 2016.
  • Holland (1986) Holland, P. W. Statistics and causal inference. Journal of the American Statistical Association, 81(396):945–960, 1986.
  • Hu & Kohler-Hausmann (2020) Hu, L. and Kohler-Hausmann, I. What’s sex got to do with machine learning? In Proceedings of the 2020 ACM Conference on Fairness, Accountability, and Transparency, 2020.
  • Hunt et al. (1992) Hunt, B. R., Sauer, T., and Yorke, J. A. Prevalence: a translation-invariant “almost every” on infinite-dimensional spaces. Bulletin of the American Mathematical Society, 27(2):217–238, 1992.
  • Imai & Jiang (2020) Imai, K. and Jiang, Z. Principal fairness for human and algorithmic decision-making. arXiv preprint arXiv:2005.10400, 2020.
  • Imai et al. (2020) Imai, K., Jiang, Z., Greiner, J., Halen, R., and Shin, S. Experimental evaluation of algorithm-assisted human decision-making: Application to pretrial public safety assessment. arXiv preprint arXiv:2012.02845, 2020.
  • Imbens & Rubin (2015) Imbens, G. W. and Rubin, D. B. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015.
  • Kasy & Abebe (2021) Kasy, M. and Abebe, R. Fairness, equality, and power in algorithmic decision-making. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp.  576–586, 2021.
  • Kemeny & Snell (1976) Kemeny, J. G. and Snell, J. L. Finite Markov Chains. Undergraduate Texts in Mathematics. Springer-Verlag, New York-Heidelberg, 1976. Reprinting of the 1960 original.
  • Kilbertus et al. (2017) Kilbertus, N., Rojas-Carulla, M., Parascandolo, G., Hardt, M., Janzing, D., and Schölkopf, B. Avoiding discrimination through causal reasoning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp.  656–666, 2017.
  • Kleinberg et al. (2017) Kleinberg, J., Mullainathan, S., and Raghavan, M. Inherent trade-offs in the fair determination of risk scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.
  • Kusner et al. (2017) Kusner, M., Loftus, J., Russell, C., and Silva, R. Counterfactual fairness. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp.  4069–4079, 2017.
  • Liang et al. (2021) Liang, A., Lu, J., and Mu, X. Algorithmic design: Fairness versus accuracy. arXiv preprint arXiv:2112.09975, 2021.
  • Liu et al. (2018) Liu, L. T., Dean, S., Rolf, E., Simchowitz, M., and Hardt, M. Delayed impact of fair machine learning. In International Conference on Machine Learning, pp. 3150–3158. PMLR, 2018.
  • Loftus et al. (2018) Loftus, J. R., Russell, C., Kusner, M. J., and Silva, R. Causal reasoning for algorithmic fairness. arXiv preprint arXiv:1805.05859, 2018.
  • Mhasawade & Chunara (2021) Mhasawade, V. and Chunara, R. Causal multi-level fairness. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp.  784–794, 2021.
  • Mishler et al. (2021) Mishler, A., Kennedy, E. H., and Chouldechova, A. Fairness in risk assessment instruments: Post-processing to achieve counterfactual equalized odds. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp.  386–400, 2021.
  • Nabi & Shpitser (2018) Nabi, R. and Shpitser, I. Fair inference on outcomes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • Ott & Yorke (2005) Ott, W. and Yorke, J. Prevalence. Bulletin of the American Mathematical Society, 42(3):263–290, 2005.
  • Page (2007) Page, S. E. Making the difference: Applying a logic of diversity. Academy of Management Perspectives, 21(4):6–20, 2007.
  • Pearl (2001) Pearl, J. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty and Artificial Intelligence, 2001, pp.  411–420. Morgan Kaufman, 2001.
  • Pearl (2009a) Pearl, J. Causal inference in statistics: An overview. Statistics surveys, 3:96–146, 2009a.
  • Pearl (2009b) Pearl, J. Causality. Cambridge University Press, second edition, 2009b.
  • Pierson et al. (2020) Pierson, E., Simoiu, C., Overgoor, J., Corbett-Davies, S., Jenson, D., Shoemaker, A., Ramachandran, V., Barghouty, P., Phillips, C., Shroff, R., and Goel, S. A large-scale analysis of racial disparities in police stops across the United States. Nature Human Behaviour, 4(7):736–745, 2020.
  • Rao (2005) Rao, M. M. Conditional measures and applications, volume 271 of Pure and Applied Mathematics (Boca Raton). Chapman & Hall/CRC, Boca Raton, FL, second edition, 2005.
  • Rudin (1987) Rudin, W. Real and Complex Analysis. McGraw-Hill Book Co., New York, third edition, 1987. ISBN 0-07-054234-1.
  • Rudin (1991) Rudin, W. Functional Analysis. International Series in Pure and Applied Mathematics. McGraw-Hill, Inc., New York, second edition, 1991. ISBN 0-07-054236-8.
  • Silva (2008) Silva, C. E. Invitation to Ergodic Theory, volume 42 of Student Mathematical Library. American Mathematical Society, Providence, RI, 2008.
  • Simoiu et al. (2017) Simoiu, C., Corbett-Davies, S., and Goel, S. The problem of infra-marginality in outcome tests for discrimination. The Annals of Applied Statistics, 11(3):1193–1216, 2017.
  • Steele (2019) Steele, R. Space of vector measures equipped with the total variation norm is complete. Mathematics Stack Exchange, 2019. URL https://math.stackexchange.com/q/3197508. (Version: 2019-04-22).
  • Wang et al. (2019) Wang, Y., Sridhar, D., and Blei, D. M. Equal opportunity and affirmative action via counterfactual predictions. arXiv preprint arXiv:1905.10870, 2019.
  • Williams (1983) Williams, T. S. Some issues in the standardized testing of minority students. Journal of Education, pp.  192–208, 1983.
  • Woodworth et al. (2017) Woodworth, B., Gunasekar, S., Ohannessian, M. I., and Srebro, N. Learning non-discriminatory predictors. In Conference on Learning Theory, pp.  1920–1953. PMLR, 2017.
  • Wu et al. (2019) Wu, Y., Zhang, L., Wu, X., and Tong, H. PC-fairness: A unified framework for measuring causality-based fairness. Advances in Neural Information Processing Systems, 32, 2019.
  • Zafar et al. (2017a) Zafar, M. B., Valera, I., Gomez Rodriguez, M., and Gummadi, K. P. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, pp.  1171–1180, 2017a.
  • Zafar et al. (2017b) Zafar, M. B., Valera, I., Rodriguez, M. G., Gummadi, K. P., and Weller, A. From parity to preference-based notions of fairness in classification. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp.  228–238, 2017b.
  • Zhang & Bareinboim (2018) Zhang, J. and Bareinboim, E. Fairness in decision-making—the causal explanation formula. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Zhang et al. (2017) Zhang, L., Wu, Y., and Wu, X. A causal framework for discovering and removing direct and indirect discrimination. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp.  3929–3935, 2017.

Appendix A Path-specific Counterfactuals

Constructing policies which satisfy path-specific fairness requires computing path-specific counterfactual values of features. In Algorithm 1, we describe the formal construction of path-specific counterfactuals ZΠ,a,aZ_{\Pi,a,a^{\prime}}, for an arbitrary variable ZZ (or collection of variables) in the DAG. To generate a sample ZΠ,a,aZ_{\Pi,a,a^{\prime}}^{*} from the distribution of ZΠ,a,aZ_{\Pi,a,a^{\prime}}, we first sample values UjU^{*}_{j} for the exogenous variables. Then, in the first loop, we traverse the DAG in topological order, setting AA to aa and iteratively computing values VjV_{j}^{*} of the other nodes based on the structural equations in the usual fashion. In the second loop, we set AA to aa^{\prime}, and then iteratively compute values Vj¯\overline{V_{j}}^{*} for each node. Vj¯\overline{V_{j}}^{*} is computed using the structural equation at that node, with value V¯\overline{V_{\ell}}^{*} for each of its parents that are connected to it along a path in Π\Pi, and the value VV^{*}_{\ell} for all its other parents. Finally, we set ZΠ,a,aZ_{\Pi,a,a^{\prime}}^{*} to Z¯\overline{Z}^{*}.

Data: 𝒢\mathcal{G} (topologically ordered), Π\Pi, aa, and aa^{\prime}
Result: A sample ZΠ,a,aZ_{\Pi,a,a^{\prime}}^{*} from ZΠ,a,aZ_{\Pi,a,a^{\prime}}
Sample values {Uj}\{U_{j}^{*}\} for the exogenous variables /* Compute counterfactuals by setting AA to aa */
1 for j=1,,mj=1,\ldots,m do
2       if Vj=AV_{j}=A then
3             VjaV_{j}^{*}\leftarrow a
4      else
5             (Vj){VV(Vj)}\wp(V_{j})^{*}\leftarrow\{V_{\ell}^{*}\mid V_{\ell}\in\wp(V_{j})\}
6             VjfVj((Vj),Uj)V_{j}^{*}\leftarrow f_{V_{j}}(\wp(V_{j})^{*},U_{j}^{*})
7            
8       end if
9      
10 end for
/* Compute counterfactuals by setting AA to aa^{\prime} and propagating values along paths in Π\Pi */
11 for j=1,,mj=1,\ldots,m do
12       if Vj=AV_{j}=A then
13             V¯ja\overline{V}_{j}^{*}\leftarrow a^{\prime}
14      else
15             for Vk(Vj)V_{k}\in\wp(V_{j}) do
16                   if edge (Vk,Vj)(V_{k},V_{j}) lies on a path in Π\Pi then
17                         VkV¯kV_{k}^{\dagger}\leftarrow\overline{V}_{k}^{*}
18                  else
19                         VkVkV_{k}^{\dagger}\leftarrow V_{k}^{*}
20                   end if
21                  
22             end for
23            (Vj){VV(Vj)}\wp(V_{j})^{\dagger}\leftarrow\{V_{\ell}^{\dagger}\mid V_{\ell}\in\wp(V_{j})\}
24             V¯jfVj((Vj),Uj)\overline{V}_{j}^{*}\leftarrow f_{V_{j}}(\wp(V_{j})^{\dagger},U_{j}^{*})
25       end if
26      
27 end for
ZΠ,a,aZ¯Z_{\Pi,a,a^{\prime}}^{*}\leftarrow\overline{Z}^{*}
Algorithm 1 Path-specific counterfactuals

Appendix B Constructing Causally Fair Policies

In order to construct causally fair policies, we prove that the optimization problem in Eq. (7) can be efficiently solved as a single linear program—in the case of counterfactual equalized odds, conditional principal fairness, counterfactual fairness, and path-specific fairness—or as a series of linear programs in the case of counterfactual predictive parity.

Theorem B.1.

Consider the optimization problem given in Eq. (7).

  1. 1.

    If 𝒞\mathcal{C} is the class of policies that satisfies counterfactual equalized odds or conditional principal fairness, and the distribution of (X,Y(0),Y(1))(X,Y(0),Y(1)) is known and supported on a finite set of size nn, then a utility-maximizing policy constrained to lie in 𝒞\mathcal{C} can be constructed via a linear program with O(n)O(n) variables and constraints.

  2. 2.

    If 𝒞\mathcal{C} is the class of policies that satisfies path-specific fairness (including counterfactual fairness), and the distribution of (X,DΠ,A,a)(X,D_{\Pi,A,a}) is known and supported on a finite set of size nn, then a utility-maximizing policy constrained to lie in 𝒞\mathcal{C} can be constructed via a linear program with O(n)O(n) variables and constraints.

  3. 3.

    Suppose 𝒞\mathcal{C} is the class of policies that satisfies counterfactual predictive parity, that the distribution of (X,Y(1))(X,Y(1)) is known and supported on a finite set of size nn, and that the optimization problem in Eq. (7) has a feasible solution. Further suppose Y(1)Y(1) is supported on kk points, and let Δk1={pkpi0andi=1kpi=1}\Delta^{k-1}=\{p\in\mathbb{R}^{k}\mid p_{i}\geq 0\ \text{and}\ \sum_{i=1}^{k}p_{i}=1\} be the unit (k1)(k-1)-simplex. Then one can construct a set of linear programs ={L(v)}vΔk\mathcal{L}=\{L(v)\}_{v\in\Delta^{k}}, with each having O(n)O(n) variables and constraints, such that the solution to one of the LPs in \mathcal{L} is a utility-maximizing policy constrained to lie in 𝒞\mathcal{C}.

Proof.

Let 𝒳={x1,,xm}\mathcal{X}=\{x_{1},\ldots,x_{m}\}; then, we seek decision variables did_{i}, i=1,,mi=1,\ldots,m, corresponding to the probability of making a positive decision for individuals with covariate value xix_{i}. Therefore, we require that 0di10\leq d_{i}\leq 1.

Letting pi=Pr(X=xi)p_{i}=\operatorname{Pr}(X=x_{i}) denote the mass of XX at xix_{i}, note that the objective function 𝔼[d(X)u(X)]\mathbb{E}[d(X)\cdot u(X)] equals i=1mdiu(xi)pi\sum_{i=1}^{m}d_{i}\cdot u(x_{i})\cdot p_{i} and the budget constraint i=1mdipib\sum_{i=1}^{m}d_{i}\cdot p_{i}\leq b are both linear in the decision variables.

We now show that each of the causal fairness definitions can be enforced via linear constraints. We do so in three parts as listed in theorem.

Theorem B.1 Part 1

First, we consider counterfactual equalized odds. A decision policy satisfies counterfactual equalized odds when DAY(1).D\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A\mid Y(1). Since DD is binary, this condition is equivalent to the expression 𝔼[d(X)A=a,Y(1)=y]=𝔼[d(X)Y(1)=y]\mathbb{E}[d(X)\mid A=a,Y(1)=y]=\mathbb{E}[d(X)\mid Y(1)=y] for all a𝒜a\in\mathcal{A} and y𝒴y\in\mathcal{Y} such that Pr(Y(1)=y)>0\operatorname{Pr}(Y(1)=y)>0. Expanding this expression and replacing d(xj)d(x_{j}) by the corresponding decision variable djd_{j}, we obtain that

i=1mdiPr(X=xiA=a,Y(1)=y)=i=1mdiPr(X=xiY(1)=y)\sum_{i=1}^{m}d_{i}\cdot\operatorname{Pr}(X=x_{i}\mid A=a,Y(1)=y)\\ =\sum_{i=1}^{m}d_{i}\cdot\operatorname{Pr}(X=x_{i}\mid Y(1)=y)

for each a𝒜a\in\mathcal{A} and each of the finitely many values y𝒴y\in\mathcal{Y} such that Pr(Y(1)=y)>0\operatorname{Pr}(Y(1)=y)>0. These constraints are linear in the did_{i} by inspection.

Next, we consider conditional principal fairness. A decision policy satisfies conditional principal fairness when DAY(0),Y(1),WD\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A\mid Y(0),Y(1),W, where W=ω(X)W=\omega(X) denotes a reduced set of the covariates XX. Again, since DD is binary, this condition is equivalent to the expression 𝔼[d(X)A=a,Y(0)=y0,Y(1)=y1,W=w]=𝔼[d(X)Y(0)=y0,Y(1)=y1,W=w]\mathbb{E}[d(X)\mid A=a,Y(0)=y_{0},Y(1)=y_{1},W=w]=\mathbb{E}[d(X)\mid Y(0)=y_{0},Y(1)=y_{1},W=w] for all y0y_{0}, y1y_{1}, and ww satisfying Pr(Y(0)=y0,Y(1)=y1,W=w)>0\operatorname{Pr}(Y(0)=y_{0},Y(1)=y_{1},W=w)>0. As above, expanding this expression and replacing d(xj)d(x_{j}) by the corresponding decision variable djd_{j} yields linear constraints of the form

i=1mdiPr(X=xiA=a,S=s)=j=1mdiPr(X=xiS=s)\sum_{i=1}^{m}d_{i}\cdot\operatorname{Pr}(X=x_{i}\mid A=a,S=s)\\ =\sum_{j=1}^{m}d_{i}\cdot\operatorname{Pr}(X=x_{i}\mid S=s)

for each a𝒜a\in\mathcal{A} and each of the finitely many values of S=(Y(0),Y(1),W)S=(Y(0),Y(1),W) such that s=(y0,y1,w)𝒴×𝒴×𝒲s=(y_{0},y_{1},w)\in\mathcal{Y}\times\mathcal{Y}\times\mathcal{W} satisfies Pr(Y(0)=y0,Y(1)=y1,W=w)>0\operatorname{Pr}(Y(0)=y_{0},Y(1)=y_{1},W=w)>0. Again, these constraints are linear by inspection.

Theorem B.1 Part 2

Suppose a decision policy satisfies path-specific fairness for a given collection of paths Π\Pi and a (possibly) reduced set of covariates W=ω(X),W=\omega(X), meaning that for every a𝒜a^{\prime}\in\mathcal{A}, 𝔼[DΠ,A,aW]=𝔼[DW]\mathbb{E}[D_{\Pi,A,a^{\prime}}\mid W]=\mathbb{E}[D\mid W].

Recall from the definition of path-specific counterfactuals that DΠ,A,a=fD(XΠ,A,a,UD)=𝟙UDd(XΠ,A,a)D_{\Pi,A,a^{\prime}}=f_{D}(X_{\Pi,A,a^{\prime}},U_{D})=\mathbb{1}_{U_{D}\leq d(X_{\Pi,A,a^{\prime}})}, where UD{XΠ,A,a,X}U_{D}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\{X_{\Pi,A,a},X\}. Since W=ω(X)W=\omega(X), UD{XΠ,A,a,W}U_{D}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\{X_{\Pi,A,a},W\}, it follows that

𝔼[DΠ,A,a\displaystyle\mathbb{E}[D_{\Pi,A,a^{\prime}} W=w]\displaystyle\mid W=w]
=i=1m𝔼[DΠ,A,aXΠ,A,a=xi,W=w]\displaystyle=\sum_{i=1}^{m}\mathbb{E}[D_{\Pi,A,a^{\prime}}\mid X_{\Pi,A,a}=x_{i},W=w]
Pr(XΠ,A,a=xiW=w)\displaystyle\hskip 21.33955pt\cdot\operatorname{Pr}(X_{\Pi,A,a}=x_{i}\mid W=w)
=i=1m𝔼[𝟙UDd(XΠ,A,a)XΠ,A,a=xi,W=w]\displaystyle=\sum_{i=1}^{m}\mathbb{E}[\mathbb{1}_{U_{D}\leq d(X_{\Pi,A,a^{\prime}})}\mid X_{\Pi,A,a}=x_{i},W=w]
Pr(XΠ,A,a=xiW=w)\displaystyle\hskip 21.33955pt\cdot\operatorname{Pr}(X_{\Pi,A,a^{\prime}}=x_{i}\mid W=w)
=i=1md(XΠ,A,a)Pr(XΠ,A,a=xiW=w)\displaystyle=\sum_{i=1}^{m}d(X_{\Pi,A,a^{\prime}})\cdot\operatorname{Pr}(X_{\Pi,A,a^{\prime}}=x_{i}\mid W=w)
=i=1mdiPr(XΠ,A,a=xiW=w).\displaystyle=\sum_{i=1}^{m}d_{i}\cdot\operatorname{Pr}(X_{\Pi,A,a^{\prime}}=x_{i}\mid W=w).

An analogous calculation yields that 𝔼[DW=w]=i=1mdiPr(X=xiW=w)\mathbb{E}[D\mid W=w]=\sum_{i=1}^{m}d_{i}\cdot\operatorname{Pr}(X=x_{i}\mid W=w). Equating these expressions gives

i=1mdiPr(X=xiW=w)=i=1mdiPr(XΠ,A,a=xiW=w)\sum_{i=1}^{m}d_{i}\cdot\operatorname{Pr}(X=x_{i}\mid W=w)\\ =\sum_{i=1}^{m}d_{i}\cdot\operatorname{Pr}(X_{\Pi,A,a^{\prime}}=x_{i}\mid W=w)

for each a𝒜a^{\prime}\in\mathcal{A} and each of the finitely many w𝒲w\in\mathcal{W} such that Pr(W=w)>0\operatorname{Pr}(W=w)>0. Again, each of these constraints is linear by inspection.

Theorem B.1 Part 3

A decision policy satisfies counterfactual predictive parity if Y(1)AD=0,Y(1)\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A\mid D=0, or equivalently, Pr(Y(1)=yA=a,D=0)=Pr(Y(1)D=0)\operatorname{Pr}(Y(1)=y\mid A=a,D=0)=\operatorname{Pr}(Y(1)\mid D=0) for all a𝒜a\in\mathcal{A}. We may rewrite this expression to obtain:

Pr(Y(1)=y,A=a,D=0)Pr(A=a,D=0)=Cy,\displaystyle\dfrac{\operatorname{Pr}(Y(1)=y,A=a,D=0)}{\operatorname{Pr}(A=a,D=0)}=C_{y},

where Cy=Pr(Y(1)=yD=0)C_{y}=\operatorname{Pr}(Y(1)=y\mid D=0).

Expanding the numerator on the left-hand side of the above equation yields

Pr(Y(1)=y,A=a,D=0)=i=1m[1di]Pr(Y(1)=y,A=a,X=xi)\operatorname{Pr}(Y(1)=y,A=a,D=0)\\ =\sum_{i=1}^{m}[1-d_{i}]\cdot\operatorname{Pr}(Y(1)=y,A=a,X=x_{i})

Similarly, expanding the denominator yields

Pr(Y(1)=y,D=0)=i=1m[1di]Pr(Y(1)=y,X=xi).\operatorname{Pr}(Y(1)=y,D=0)\\ =\sum_{i=1}^{m}[1-d_{i}]\cdot\operatorname{Pr}(Y(1)=y,X=x_{i}).

for each of the finitely many y𝒴y\in\mathcal{Y}. Therefore, counterfactual predictive parity corresponds to

i=1m[1di]Pr(Y(1)=y,A=a,X=xi)=Cyi=1m[1di]Pr(Y(1)=y,X=xi),\displaystyle\begin{split}\sum_{i=1}^{m}[1&-d_{i}]\cdot\operatorname{Pr}(Y(1)=y,A=a,X=x_{i})\\ &=C_{y}\cdot\sum_{i=1}^{m}[1-d_{i}]\cdot\operatorname{Pr}(Y(1)=y,X=x_{i}),\end{split} (10)

for each a𝒜a\in\mathcal{A} and y𝒴y\in\mathcal{Y}. Again, these constraints are linear in the did_{i} by inspection.

Consider the family of linear programs ={L(v)}vΔk\mathcal{L}=\{L(v)\}_{v\in\Delta^{k}} where the linear program L(v)L(v) has the same objective function i=1mdiu(xi)pi\sum_{i=1}^{m}d_{i}\cdot u(x_{i})\cdot p_{i} and budget constraint i=1mdipib\sum_{i=1}^{m}d_{i}\cdot p_{i}\leq b as before, together with additional constraints for each a𝒜a\in\mathcal{A} as in Eq. (10), where Cyi=viC_{y_{i}}=v_{i} for i=1,,ki=1,\ldots,k.

By assumption, there exists a feasible solution to the optimization problem in Eq. (7), so the solution to at least one program in \mathcal{L} is a utility-maximizing policy that satisfies counterfactual predictive parity. ∎

Appendix C A Stylized Example of College Admissions

In the example we consider in Section 2.1, the exogenous variables 𝒰={uA,uD,uE,uM,uT,uY}\mathscr{U}=\{u_{A},u_{D},u_{E},u_{M},u_{T},u_{Y}\} in the DAG are independently distributed as follows:

UA,UD,UY\displaystyle U_{A},U_{D},U_{Y} Unif(0,1),\displaystyle\sim\operatorname{\textsc{Unif}}(0,1),
UE,UM,UT\displaystyle U_{E},U_{M},U_{T} N(0,1).\displaystyle\sim\operatorname{\textsc{N}}(0,1).

For fixed constants μA\mu_{A}, βE,0\beta_{E,0}, βE,A\beta_{E,A}, βM,0\beta_{M,0}, βM,E\beta_{M,E}, βT,0\beta_{T,0}, βT,E\beta_{T,E}, βT,M\beta_{T,M}, βT,B\beta_{T,B}, βT,u\beta_{T,u}, βY,0\beta_{Y,0}, βY,D\beta_{Y,D}, we define the endogenous variables 𝒱={A,E,M,T,D,Y}\mathscr{V}=\{A,E,M,T,D,Y\} in the DAG by the following structural equations:

fA(uA)\displaystyle f_{A}(u_{A}) ={a1ifuAμAa0otherwise,\displaystyle=\begin{cases}a_{1}&\text{if}\ u_{A}\leq\mu_{A}\\ a_{0}&\text{otherwise}\end{cases},
fE(a,uE)\displaystyle f_{E}(a,u_{E}) =βE,0+βE,A𝟙(a=a1)+uE,\displaystyle=\beta_{E,0}+\beta_{E,A}\cdot\mathbb{1}(a=a_{1})+u_{E},
fM(e,uM)\displaystyle f_{M}(e,u_{M}) =βM,0+βM,Ee+uM,\displaystyle=\beta_{M,0}+\beta_{M,E}\cdot e+u_{M},
fT(e,m,uT)\displaystyle f_{T}(e,m,u_{T}) =βT,0+βT,Ee\displaystyle=\beta_{T,0}+\beta_{T,E}\cdot e
+βT,Mm+βT,Bem+βT,uuT,\displaystyle\hskip 14.22636pt+\beta_{T,M}\cdot m+\beta_{T,B}\cdot e\cdot m+\beta_{T,u}\cdot u_{T},
fD(x,uD)\displaystyle f_{D}(x,u_{D}) =𝟙(uDd(x)),\displaystyle=\mathbb{1}(u_{D}\leq d(x)),
fY(m,uY,δ)\displaystyle f_{Y}(m,u_{Y},\delta) =𝟙(uYlogit1(βY,0+m+βY,Dδ)),\displaystyle=\mathbb{1}(u_{Y}\leq\operatorname{\text{logit}}^{-1}(\beta_{Y,0}+m+\beta_{Y,D}\cdot\delta)),

where logit1(x)=(1+exp(x))1\operatorname{\text{logit}}^{-1}(x)=(1+\exp(-x))^{-1} and d(x)d(x) is the decision policy. In our example, we use constants μA=13\mu_{A}=\tfrac{1}{3}, βE,0=1\beta_{E,0}=1, βE,A=1\beta_{E,A}=-1, βM,0=0\beta_{M,0}=0, βM,E=1\beta_{M,E}=1, βT,0=50\beta_{T,0}=50, βT,E=4\beta_{T,E}=4, βT,M=4\beta_{T,M}=4, βT,u=7\beta_{T,u}=7, βT,B=1\beta_{T,B}=1, βY,0=12\beta_{Y,0}=-\tfrac{1}{2}, βY,D=12\beta_{Y,D}=\tfrac{1}{2}. We also assume a budget b=12b=\frac{1}{2}.

Appendix D Proof of Proposition 1

We begin by more formally defining (multiple) threshold policies. We assume, without loss of generality, that Pr(A=a)>0\operatorname{Pr}(A=a)>0 for all a𝒜a\in\mathcal{A} throughout.

Definition D.1.

Let u(x)u(x) be a utility function. We say that a policy d(x)d(x) is a threshold policy with respect to uu if there exists some tt such that

d(x)={1u(x)>t,0u(x)<t,d(x)=\begin{cases}1&u(x)>t,\\ 0&u(x)<t,\end{cases}

and d(x)[0,1]d(x)\in[0,1] is arbitrary if u(x)=tu(x)=t. We say that d(x)d(x) is a multiple threshold policy with respect to uu if there exist group-specific constants tat_{a} for a𝒜a\in\mathcal{A} such that

d(x)={1u(x)>tα(x),0u(x)<tα(x),d(x)=\begin{cases}1&u(x)>t_{\alpha(x)},\\ 0&u(x)<t_{\alpha(x)},\end{cases}

and d(x)[0,1]d(x)\in[0,1] is arbitrary if u(x)=tα(x)u(x)=t_{\alpha(x)}.

Remark 1.

In general, it is possible for different thresholds to produce threshold policies that are almost surely equal. For instance, if u(X)Bern(12)u(X)\sim\operatorname{\textsc{Bern}}(\tfrac{1}{2}), then the policies 𝟙u(X)>p\mathbb{1}_{u(X)>p} are almost surely equal for all p[0,1)p\in[0,1). Nevertheless, we speak in general of the threshold associated with the threshold policy d(X)d(X) unless there is ambiguity.

We first observe that if 𝒰\mathcal{U} is consistent modulo α\alpha, then whether a decision policy d(x)d(x) is a multiple threshold policy does not depend on our choice of u𝒰u\in\mathcal{U}.

Lemma D.1.

Let 𝒰\mathcal{U} be a collection of utilities consistent modulo α\alpha, and suppose d:𝒳[0,1]d:\mathcal{X}\to[0,1] is a decision rule. If d(x)d(x) is a multiple threshold rule with respect to a utility u𝒰u^{*}\in\mathcal{U}, then d(x)d(x) is a multiple threshold rule with respect to every u𝒰u\in\mathcal{U}. In particular, if d(x)d(x) can be represented by non-negative thresholds over uu^{*}, it can be represented by non-negative thresholds over any u𝒰u\in\mathcal{U}.

Proof.

Suppose d(x)d(x) is represented by thresholds {ta}a𝒜\{t_{a}^{*}\}_{a\in\mathcal{A}} with respect to uu^{*}. We construct the thresholds {ta}a𝒜\{t_{a}\}_{a\in\mathcal{A}} explicitly.

Fix a𝒜a\in\mathcal{A} and suppose that there exists xα1(a)x^{*}\in\alpha^{-1}(a) such that u(x)=tau^{*}(x^{*})=t_{a}^{*}. Then set ta=u(x)t_{a}=u(x^{*}). Now, if u(x)>ta=u(x)u(x)>t_{a}=u(x^{*}) then, by consistency modulo α\alpha, u(x)>u(x)=tau^{*}(x)>u^{*}(x^{*})=t_{a}^{*}. Similarly if u(x)<tau(x)<t_{a} then u(x)<tau^{*}(x)<t_{a}^{*}. We also note that by consistency modulo α\alpha, sign(ta)=sign(u(x))=sign(u(x))=sign(ta)\mathop{\mathrm{sign}}(t_{a})=\mathop{\mathrm{sign}}(u(x^{*}))=\mathop{\mathrm{sign}}(u^{*}(x^{*}))=\mathop{\mathrm{sign}}(t_{a}^{*}).

If there is no xα1(a)x^{*}\in\alpha^{-1}(a) such that u(x)=tau^{*}(x^{*})=t_{a}^{*}, then let

ta=infxSau(x)t_{a}=\inf_{x\in S_{a}}u(x)

where Sa={xα1(a)u(x)>ta}S_{a}=\{x\in\alpha^{-1}(a)\mid u^{*}(x)>t_{a}^{*}\}. Note that since sign(u(x))=sign(u(x))\mathop{\mathrm{sign}}(u(x))=\mathop{\mathrm{sign}}(u^{*}(x)) for all xx by consistency modulo α\alpha, if ta0t_{a}^{*}\geq 0, it follows that ta0t_{a}\geq 0 as well.

We need to show in this case also that if u(x)>tau(x)>t_{a} then u(x)>tau^{*}(x)>t_{a}^{*}, and if u(x)<tau(x)<t_{a} then u(x)<tau^{*}(x)<t_{a}^{*}. To do so, let xα1(a)x\in\alpha^{-1}(a) be arbitrary, and suppose u(x)>tau(x)>t_{a}. Then, by definition, there exists xα1(a)x^{\prime}\in\alpha^{-1}(a) such that u(x)>u(x)>tau(x)>u(x^{\prime})>t_{a} and u(x)>tau^{*}(x^{\prime})>t_{a}^{*}, whence u(x)>u(x)>tau^{*}(x)>u^{*}(x^{\prime})>t_{a}^{*} by consistency modulo α\alpha. On the other hand, if u(x)<tau(x)<t_{a}, it follows by the definition of tat_{a} that u(x)tau^{*}(x)\leq t_{a}^{*}; since u(x)tau^{*}(x)\neq t_{a}^{*} by hypothesis, it follows that u(x)<tau^{*}(x)<t_{a}^{*}.

Therefore, it follows in both cases that for xα1(a)x\in\alpha^{-1}(a), if u(x)>tau(x)>t_{a} then u(x)>tau^{*}(x)>t_{a}^{*}, and if u(x)<tau(x)<t_{a} then u(x)<tau^{*}(x)<t_{a}^{*}. Therefore

d(x)={1ifu(x)>tα(x),0ifu(x)<tα(x),d(x)=\begin{cases}1&\text{if}\ u(x)>t_{\alpha(x)},\\ 0&\text{if}\ u(x)<t_{\alpha(x)},\\ \end{cases}

i.e., d(x)d(x) is a multiple threshold policy with respect to uu. Moreover, as noted above, if ta0t_{a}^{*}\geq 0 for all a𝒜a\in\mathcal{A}, then ta0t_{a}\geq 0 for all a𝒜a\in\mathcal{A}. ∎

We now prove the following strengthening of Prop. 1.

Lemma D.2.

Let 𝒰\mathcal{U} be a collection of utilities consistent modulo α\alpha. Let d(x)d(x) be a feasible decision policy that is not a.s. a multiple threshold policy with non-negative thresholds with respect to 𝒰\mathcal{U}, then d(x)d(x) is strongly Pareto dominated.

Proof.

We prove the claim in two parts. First, we show that any policy that is not a multiple threshold policy is strongly Pareto dominated. Then, we show that any multiple threshold policy that cannot be represented with non-negative thresholds is strongly Pareto dominated.

If d(x)d(x) is not a multiple threshold policy, then there exists a u𝒰u\in\mathcal{U} and a𝒜a^{*}\in\mathcal{A} such that d(x)d(x) is not a threshold policy when restricted to α1(a)\alpha^{-1}(a^{*}) with respect to uu.

We will construct an alternative policy d(x)d^{\prime}(x) that attains strictly greater utility on α1(a)\alpha^{-1}(a^{*}) and is identical elsewhere. Thus, without loss of generality, we assume there is a single group, i.e., α(x)=a\alpha(x)=a^{*}. The proof proceeds heuristically by moving some of the mass below a threshold to above a threshold to create a feasible policy with improved utility.

Let b=𝔼[d(X)]b=\mathbb{E}[d(X)]. Define

mLo(t)\displaystyle m^{\operatorname{{Lo}}}(t) =𝔼[d(X)𝟙u(X)<t],\displaystyle=\mathbb{E}[d(X)\cdot\mathbb{1}_{u(X)<t}],
mUp(t)\displaystyle m^{\operatorname{{Up}}}(t) =𝔼[(1d(X))𝟙u(X)>t].\displaystyle=\mathbb{E}[(1-d(X))\cdot\mathbb{1}_{u(X)>t}].

We show that there exists tt^{*} such that mUp(t)>0m^{\operatorname{{Up}}}(t^{*})>0 and mLo(t)>0m^{\operatorname{{Lo}}}(t^{*})>0. For, if not, consider

t~=inf{t:mUp(t)=0}.\tilde{t}=\inf\{t\in\mathbb{R}:m^{\operatorname{{Up}}}(t)=0\}.

Note that d(X)𝟙u(X)>t~=𝟙u(X)>t~d(X)\cdot\mathbb{1}_{u(X)>\tilde{t}}=\mathbb{1}_{u(X)>\tilde{t}} a.s. If t~=\tilde{t}=-\infty, then by definition d(X)=1d(X)=1 a.s., which is a threshold policy, violating our assumption on d(X)d(X). If t~>\tilde{t}>-\infty, then for any t<t~t^{\prime}<\tilde{t}, we have, by definition that mUp(t)>0m^{\operatorname{{Up}}}(t^{\prime})>0, and so by hypothesis mLo(t)=0m^{\operatorname{{Lo}}}(t^{\prime})=0. Therefore d(X)𝟙u(X)<t~=0d(X)\cdot\mathbb{1}_{u(X)<\tilde{t}}=0 a.s., and so, again, d(X)d(X) is a threshold policy, contrary to hypothesis.

Now, with tt^{*} as above, for notational simplicity, let mUp=mUp(t)m^{\operatorname{{Up}}}=m^{\operatorname{{Up}}}(t^{*}) and mLo=mLo(t)m^{\operatorname{{Lo}}}=m^{\operatorname{{Lo}}}(t^{*}) and consider the alternative policy

d(x)={(1mUp)d(x)u(x)<t,d(x)u(x)=t,1(1mLo)(1d(x))u(x)>t.d^{\prime}(x)=\begin{cases}(1-m^{\operatorname{{Up}}})\cdot d(x)&u(x)<t^{*},\\ d(x)&u(x)=t^{*},\\ 1-(1-m^{\operatorname{{Lo}}})\cdot(1-d(x))&u(x)>t^{*}.\end{cases}

Then it follows by construction that

𝔼[d(X)]\displaystyle\mathbb{E}[d^{\prime}(X)] =(1mUp)mLo+𝔼[d(X)𝟙u(X)=t]\displaystyle=(1-m^{\operatorname{{Up}}})\cdot m^{\operatorname{{Lo}}}+\mathbb{E}[d(X)\cdot\mathbb{1}_{u(X)=t^{*}}]
+Pr(u(X)>t)(1mLo)mUp\displaystyle\hskip 21.33955pt+\operatorname{Pr}(u(X)>t^{*})-(1-m^{\operatorname{{Lo}}})\cdot m^{\operatorname{{Up}}}
=mLo+𝔼[d(X)𝟙u(X)=t]\displaystyle=m^{\operatorname{{Lo}}}+\mathbb{E}[d(X)\cdot\mathbb{1}_{u(X)=t^{*}}]
+Pr(u(X)>t)mUp\displaystyle\hskip 21.33955pt+\operatorname{Pr}(u(X)>t^{*})-m^{\operatorname{{Up}}}
=𝔼[d(X)𝟙u(X)<t]+𝔼[d(X)𝟙u(X)=t]\displaystyle=\mathbb{E}[d(X)\cdot\mathbb{1}_{u(X)<t^{*}}]+\mathbb{E}[d(X)\cdot\mathbb{1}_{u(X)=t^{*}}]
+𝔼[𝟙u(X)>t]𝔼[(1d(X))𝟙u(X)>t]\displaystyle\hskip 21.33955pt+\mathbb{E}[\mathbb{1}_{u(X)>t^{*}}]-\mathbb{E}[(1-d(X))\cdot\mathbb{1}_{u(X)>t^{*}}]
=𝔼[d(X)]\displaystyle=\mathbb{E}[d(X)]
=b,\displaystyle=b,

so d(x)d^{\prime}(x) is feasible. However,

d(x)d(x)\displaystyle d^{\prime}(x)-d(x) =mLo(1d(x))𝟙u(x)>t\displaystyle=m^{\operatorname{{Lo}}}\cdot(1-d(x))\cdot\mathbb{1}_{u(x)>t^{*}}
mUpd(x)𝟙u(x)<t,\displaystyle\hskip 28.45274pt-m^{\operatorname{{Up}}}\cdot d(x)\cdot\mathbb{1}_{u(x)<t^{*}},

and so

𝔼[(d(X)\displaystyle\mathbb{E}[(d^{\prime}(X) d(X))u(X)]\displaystyle-d(X))\cdot u(X)]
=mLo𝔼[(1d(X))𝟙u(X)>tu(X)]\displaystyle=m^{\operatorname{{Lo}}}\cdot\mathbb{E}[(1-d(X))\cdot\mathbb{1}_{u(X)>t^{*}}\cdot u(X)]
mUp𝔼[d(X)𝟙u(X)<tu(X)]\displaystyle\hskip 28.45274pt-m^{\operatorname{{Up}}}\cdot\mathbb{E}[d(X)\cdot\mathbb{1}_{u(X)<t^{*}}\cdot u(X)]
>mLot𝔼[(1d(X))𝟙u(X)>t]\displaystyle>m^{\operatorname{{Lo}}}\cdot t^{*}\cdot\mathbb{E}[(1-d(X))\cdot\mathbb{1}_{u(X)>t^{*}}]
mUpt𝔼[d(X)𝟙u(X)<t]\displaystyle\hskip 28.45274pt-m^{\operatorname{{Up}}}\cdot t^{*}\cdot\mathbb{E}[d(X)\cdot\mathbb{1}_{u(X)<t^{*}}]
=tmLomUptmUpmLo\displaystyle=t^{*}\cdot m^{\operatorname{{Lo}}}\cdot m^{\operatorname{{Up}}}-t^{*}\cdot m^{\operatorname{{Up}}}\cdot m^{\operatorname{{Lo}}}
=0.\displaystyle=0.

Therefore

𝔼[d(X)u(X)]<𝔼[d(X)u(X)].\mathbb{E}[d(X)\cdot u(X)]<\mathbb{E}[d^{\prime}(X)\cdot u(X)].

It remains to show that u(d)>u(d)u^{\prime}(d^{\prime})>u^{\prime}(d) for arbitrary u𝒰u^{\prime}\in\mathcal{U}. Let

t=inf{u(x):d(x)>d(x)}.t^{\prime}=\inf\{u^{\prime}(x):d^{\prime}(x)>d(x)\}.

Note that by construction for any x,x𝒳x,x^{\prime}\in\mathcal{X}, if d(x)>d(x)d^{\prime}(x)>d(x) and d(x)<d(x)d^{\prime}(x^{\prime})<d(x^{\prime}), then u(x)>t>u(x)u(x)>t^{*}>u(x^{\prime}). It follows by consistency modulo α\alpha that u(x)tu(x)u^{\prime}(x)\geq t^{\prime}\geq u^{\prime}(x^{\prime}), and, moreover, that at least one of the inequalities is strict. Without loss of generality, assume u(x)>tu(x)u^{\prime}(x)>t^{\prime}\geq u^{\prime}(x^{\prime}). Then, we have that u(x)>tu(x)>t^{*} if and only if u(x)>tu^{\prime}(x)>t^{\prime}. Therefore, it follows that

𝔼[(d(X)d(X))𝟙u(X)>t]=mUp>0.\mathbb{E}[(d^{\prime}(X)-d(X))\cdot\mathbb{1}_{u^{\prime}(X)>t^{\prime}}]=m^{\operatorname{{Up}}}>0.

Since 𝔼[d(X)d(X)]=0\mathbb{E}[d^{\prime}(X)-d(X)]=0, we see that

𝔼[(d(X)\displaystyle\mathbb{E}[(d^{\prime}(X) d(X))u(X)]\displaystyle-d(X))\cdot u^{\prime}(X)]
=𝔼[(d(X)d(X))𝟙u(X)>tu(X)]\displaystyle=\mathbb{E}[(d^{\prime}(X)-d(X))\cdot\mathbb{1}_{u^{\prime}(X)>t^{\prime}}\cdot u^{\prime}(X)]
+𝔼[(d(X)d(X))𝟙u(X)tu(X)]\displaystyle\hskip 21.33955pt+\mathbb{E}[(d^{\prime}(X)-d(X))\cdot\mathbb{1}_{u^{\prime}(X)\leq t^{\prime}}\cdot u^{\prime}(X)]
>t𝔼[(d(X)d(X))𝟙u(X)>t]\displaystyle>t^{\prime}\cdot\mathbb{E}[(d^{\prime}(X)-d(X))\cdot\mathbb{1}_{u^{\prime}(X)>t^{\prime}}]
+t𝔼[(d(X)d(X))𝟙u(X)t]\displaystyle\hskip 21.33955pt+t^{\prime}\cdot\mathbb{E}[(d^{\prime}(X)-d(X))\cdot\mathbb{1}_{u^{\prime}(X)\leq t^{\prime}}]
=t𝔼[d(X)d(X)]\displaystyle=t^{\prime}\cdot\mathbb{E}[d^{\prime}(X)-d(X)]
=0,\displaystyle=0,

where in the inequality we have used the fact that if d(x)>d(x)d^{\prime}(x)>d(x), u(x)>tu^{\prime}(x)>t^{\prime}, and if d(x)<d(x)d^{\prime}(x)<d(x), u(x)tu^{\prime}(x)\leq t^{\prime}. Therefore

𝔼[d(X)u(X)]<𝔼[d(X)u(X)],\mathbb{E}[d(X)\cdot u^{\prime}(X)]<\mathbb{E}[d^{\prime}(X)\cdot u^{\prime}(X)],

i.e., d(x)d^{\prime}(x) strongly Pareto dominates d(x)d(x).

Now, we prove the second claim, namely, that a multiple threshold policy τ(x)\tau(x) that cannot be represented with non-negative thresholds is strongly Pareto dominated. For, if τ(x)\tau(x) is such a policy, then, by Lemma D.1, for any u𝒰u\in\mathcal{U}, 𝔼[τ(X)𝟙u(X)<0]>0\mathbb{E}[\tau(X)\cdot\mathbb{1}_{u(X)<0}]>0. It follows immediately that τ(x)=τ(x)𝟙u(x)>0\tau^{\prime}(x)=\tau(x)\cdot\mathbb{1}_{u(x)>0} satisfies u(τ)>u(τ)u(\tau^{\prime})>u(\tau). By consistency modulo α\alpha, the definition of τ(x)\tau^{\prime}(x) does not depend on our choice of uu, and so u(τ)>u(τ)u(\tau^{\prime})>u(\tau) for every u𝒰u\in\mathcal{U}, i.e., τ(x)\tau^{\prime}(x) strongly Pareto dominates τ(x)\tau(x). ∎

The following results, which draw on Lemma D.2, are useful in the proof of Theorem 1.

Definition D.2.

We say that a decision policy d(x)d(x) is budget-exhausting if

min(b,Pr(u(X)>0))𝔼[d(X)]min(b,Pr(u(X)0)).\begin{split}\min(b,\operatorname{Pr}(u(X)>0))&\leq\mathbb{E}[d(X)]\\ &\leq\min(b,\operatorname{Pr}(u(X)\geq 0)).\end{split}
Remark 2.

We note that if 𝒰\mathcal{U} is consistent modulo α\alpha, then whether or not a decision policy d(x)d(x) is budget-exhausting does not depend on the choice of u𝒰u\in\mathcal{U}. Further, if Pr(u(X)=0)=0\operatorname{Pr}(u(X)=0)=0—e.g., if the distribution of XX is 𝒰\mathcal{U}-fine—then the decision policy is budget-exhausting if and only if 𝔼[d(X)]=min(b,Pr(u(X)>0))\mathbb{E}[d(X)]=\min(b,\operatorname{Pr}(u(X)>0)).

Corollary D.1.

Let 𝒰\mathcal{U} be a collection of utilities consistent modulo α\alpha. If τ(x)\tau(x) is a feasible policy that is not a budget-exhausting multiple threshold policy with non-negative thresholds, then τ(x)\tau(x) is strongly Pareto dominated.

Proof.

Suppose τ(x)\tau(x) is not strongly Pareto dominated. By Lemma D.2, it is a multiple threshold policy with non-negative thresholds.

Now, suppose toward a contradiction that τ(x)\tau(x) is not budget-exhausting. Then, either 𝔼[τ(X)]>min(b,Pr(u(X)0))\mathbb{E}[\tau(X)]>\min(b,\operatorname{Pr}(u(X)\geq 0)) or 𝔼[τ(X)]<min(b,Pr(u(X)>0))\mathbb{E}[\tau(X)]<\min(b,\operatorname{Pr}(u(X)>0)).

In the first case, since τ(x)\tau(x) is feasible, it follows that 𝔼[τ(X)]>Pr(u(X)0)\mathbb{E}[\tau(X)]>\operatorname{Pr}(u(X)\geq 0). It follows that τ(x)𝟙u(x)<0\tau(x)\cdot\mathbb{1}_{u(x)<0} is not almost surely zero. Therefore

𝔼[τ(X)]<𝔼[τ(X)𝟙u(X)>0],\mathbb{E}[\tau(X)]<\mathbb{E}[\tau(X)\cdot\mathbb{1}_{u(X)>0}],

and, by consistency modulo α\alpha, this holds for any u𝒰u\in\mathcal{U}. Therefore τ(x)\tau(x) is strongly Pareto dominated, contrary to hypothesis.

In the second case, consider

d(x)=θ𝟙u(x)>0+(1θ)τ(x).d(x)=\theta\cdot\mathbb{1}_{u(x)>0}+(1-\theta)\cdot\tau(x).

Since 𝔼[τ(X)]<min(b,Pr(u(X)>0))\mathbb{E}[\tau(X)]<\min(b,\operatorname{Pr}(u(X)>0)) and

𝔼[d(X)]=θPr(u(X)>0)+(1θ)𝔼[τ(X)],\mathbb{E}[d(X)]=\theta\cdot\operatorname{Pr}(u(X)>0)+(1-\theta)\cdot\mathbb{E}[\tau(X)],

there exists some θ>0\theta>0 such that d(x)d(x) is feasible.

For that θ\theta, a similar calculation shows immediately that u(d)>u(τ)u(d)>u(\tau), and, by consistency modulo α\alpha, u(d)>u(τ)u^{\prime}(d)>u^{\prime}(\tau) for all u𝒰u^{\prime}\in\mathcal{U}. Therefore, again, d(x)d(x) strongly Pareto dominates τ(x)\tau(x), contrary to hypothesis. ∎

Lemma D.3.

Given a utility uu, there exists a mapping TT from [0,1]𝒜[0,1]^{\mathcal{A}} to [,]𝒜[-\infty,\infty]^{\mathcal{A}} taking sets of quantiles {qa}a𝒜\{q_{a}\}_{a\in\mathcal{A}} to thresholds {ta}a𝒜\{t_{a}\}_{a\in\mathcal{A}} such that:

  1. 1.

    TT is monotonically non-increasing in each coordinate;

  2. 2.

    For each set of quantiles, there is a multiple threshold policy τ:𝒳[0,1]\tau:\mathcal{X}\to[0,1] with thresholds T({qa})T(\{q_{a}\}) with respect to uu such that 𝔼[τ(X)A=a]=qa\mathbb{E}[\tau(X)\mid A=a]=q_{a}.

Proof.

Simply choose

ta=inf{s:Pr(u(X)>s)<qa}.t_{a}=\inf\{s\in\mathbb{R}:\operatorname{Pr}(u(X)>s)<q_{a}\}. (11)

Then define

pa={qaPr(u(X)>taA=a)Pr(u(X)=taA=a)Pr(u(X)=ta,A=a)>00Pr(u(X)=ta,A=a)=0.p_{a}=\begin{cases}\frac{q_{a}-\operatorname{Pr}(u(X)>t_{a}\mid A=a)}{\operatorname{Pr}(u(X)=t_{a}\mid A=a)}&\operatorname{Pr}(u(X)=t_{a},A=a)>0\\ 0&\operatorname{Pr}(u(X)=t_{a},A=a)=0.\end{cases}

Note that Pr(u(X)taA=a)qa\operatorname{Pr}(u(X)\geq t_{a}\mid A=a)\geq q_{a}, since, by definition, Pr(u(X)>taϵA=a)qa\operatorname{Pr}(u(X)>t_{a}-\epsilon\mid A=a)\geq q_{a} for all ϵ>0\epsilon>0. Therefore,

Pr(u(X)>taA=a)+Pr(u(X)=taA=a)qa,\operatorname{Pr}(u(X)>t_{a}\mid A=a)+\operatorname{Pr}(u(X)=t_{a}\mid A=a)\geq q_{a},

and so pa1p_{a}\leq 1. Further, since Pr(u(X)>taA=a)qa\operatorname{Pr}(u(X)>t_{a}\mid A=a)\leq q_{a}, we have that pa0p_{a}\geq 0.

Finally, let

d(x)={1u(x)>tα(x),pau(x)=tα(x),0u(x)<tα(x),d(x)=\begin{cases}1&u(x)>t_{\alpha(x)},\\ p_{a}&u(x)=t_{\alpha(x)},\\ 0&u(x)<t_{\alpha(x)},\end{cases}

and it follows immediately that 𝔼[d(X)A=a]=qa\mathbb{E}[d(X)\mid A=a]=q_{a}. That tat_{a} is a monotonically non-increasing function of qaq_{a} follows immediately from Eq. (11). ∎

We can further refine Cor. D.1 and Lemma D.3 as follows:

Lemma D.4.

Let uu be a utility. Then a feasible policy is utility maximizing if and only if it is a budget-exhausting threshold policy. Moreover, there exists at least one utility maximizing policy.

Proof.

Let α¯\bar{\alpha} be a constant map, i.e., α¯:𝒳𝒜¯\bar{\alpha}:\mathcal{X}\to\bar{\mathcal{A}}, where |𝒜¯|=1|\bar{\mathcal{A}}|=1. Then 𝒰={u}\mathcal{U}=\{u\} is consistent modulo α¯\bar{\alpha}, and so by Cor. D.1, any Pareto efficient policy is a budget exhausting multiple threshold policy relative to 𝒰\mathcal{U}. Since 𝒰\mathcal{U} contains a single element, a policy is Pareto efficient if and only if it is utility maximizing. Since α¯\bar{\alpha} is constant, a policy is a multiple threshold policy relative to α¯\bar{\alpha} if and only if it is a threshold policy. Therefore, a policy is utility maximizing if and only if it is a budget exhausting threshold policy. By Lemma D.3, such a policy exists, and so the maximum is attained. ∎

Appendix E Prevalence and the Proof of Theorem 1

The notion of a probabilistically “small” set—such as the event in which an idealized dart hits the exact center of a target—is, in finite-dimensional real vector spaces, typically encoded by the idea of a Lebesgue null set.

Here we prove that the set of distributions such that there exists a policy satisfying either counterfactual equalized odds, conditional principal fairness, or counterfactual fairness that is not strongly Pareto dominated is “small” in an analogous sense. The proof turns on the following intuition. Each of the fairness definitions imposes a number of constraints. By Lemma D.2, any policy that is not strongly Pareto dominated is a multiple threshold policy. By adjusting the group-specific thresholds of such a policy, one can potentially satisfy one constraint per group. If there are more constraints than groups, then one has no additional degrees of freedom that can be used to ensure that the remaining constraints are satisfied. If, by chance, those constraints are satisfied with the same threshold policy, they are not satisfied robustly—even a minor distribution shift, such as increasing the amount of mass above the threshold by any amount on the relevant subpopulation, will break them. Therefore, over a “typical” distribution, at most |𝒜||\mathcal{A}| of the constraints can simultaneously be satisfied by a Pareto efficient policy, meaning that typically no Pareto efficient policy fully satisfies all of the conditions of the fairness definitions.

Formalizing this intuition, however, requires considerable care. In Section E.1, we give a brief introduction to a popular generalization of null sets to infinite-dimensional vector spaces, drawing heavily on a review article by Ott & Yorke (2005). In Section E.2 we provide a roadmap of the proof itself. In Section E.3, we establish the main hypotheses necessary to apply the notion of prevalence to a convex set—in our case, the set of 𝒰\mathcal{U}-fine distributions. In Section E.4, we establish a number of technical lemmata used in the proof of Theorem 1, and provide a proof of the theorem itself in Section E.5. In Section E.6, we show why the hypothesis of 𝒰\mathcal{U}-fineness is important and how conspiracies between atoms in the distribution of u(X)u(X) can lead to “robust” counterexamples.

E.1 Shyness and Prevalence

Lebesgue measure λn\lambda_{n} on n\mathbb{R}^{n} has a number of desirable properties:

  • Local finiteness: For any point vnv\in\mathbb{R}^{n}, there exists an open set UU containing xx such that λn[U]<\lambda_{n}[U]<\infty;

  • Strict positivity: For any open set UU, if λn[U]=0\lambda_{n}[U]=0, then U=U=\emptyset;

  • Translation invariance: For any vnv\in\mathbb{R}^{n} and measurable set EE, λn[E+v]=λn[E]\lambda_{n}[E+v]=\lambda_{n}[E].

No measure on an infinite-dimensional, separable Banach space, such as L1()L^{1}(\mathbb{R}), can satisfy these three properties (Ott & Yorke, 2005). However, while there is no generalization of Lebesgue measure to infinite dimensions, there is a generalization of Lebesgue null sets—called shy sets—to the infinite-dimensional context that preserves many of their desirable properties.

Definition E.3 (Hunt et al. (1992)).

Let VV be a completely metrizable topological vector space. We say that a Borel set EVE\subseteq V is shy if there exists a Borel measure μ\mu on VV such that:

  1. 1.

    There exists compact CVC\subseteq V such that 0<μ[C]<0<\mu[C]<\infty,

  2. 2.

    For all vVv\in V, μ[E+v]=0\mu[E+v]=0.

An arbitrary set FVF\subseteq V is shy if there exists a shy Borel set EVE\subseteq V containing FF.

We say that a set is prevalent if its complement is shy.

Prevalence generalizes the concept of Lebesgue “full measure” or “co-null” sets (i.e., sets whose complements have null Lebesgue measure) in the following sense:

Proposition E.3 (Hunt et al. (1992)).

Let VV be a completely metrizable topological vector space. Then:

  • Any prevalent set is dense in VV;

  • If GLG\subseteq L and GG is prevalent, then LL is prevalent;

  • A countable intersection of prevalent sets is prevalent;

  • Every translate of a prevalent set is prevalent;

  • If V=nV=\mathbb{R}^{n}, then GnG\subseteq\mathbb{R}^{n} is prevalent if and only if λn[nG]=0\lambda_{n}[\mathbb{R}^{n}\setminus G]=0.

As is conventional for sets of full measure in finite-dimensional spaces, if some property holds for every vEv\in E, where EE is prevalent, then we say that the property holds for almost every vVv\in V or that it holds generically in VV.

Prevalence can also be generalized from vector spaces to convex subsets of vector spaces, although additional care must be taken to ensure that a relative version of Prop. E.3 holds.

Definition E.4 (Anderson & Zame (2001)).

Let VV be a topological vector space and let CVC\subseteq V be a convex subset completely metrizable in the subspace topology induced by VV. We say that a universally measurable set ECE\subseteq C is shy in CC at cCc\in C if for each 1δ>01\geq\delta>0, and each neighborhood UU of 0 in VV, there is a regular Borel measure μ\mu with compact support such that

Supp(μ)(δ(Cc)+c)(U+c),\operatorname{\textsc{Supp}}(\mu)\subseteq\left(\delta(C-c)+c\right)\cap(U+c),

and μ[E+v]=0\mu[E+v]=0 for every vVv\in V.

We say that EE is shy in CC or shy relative to CC if EE is shy in CC at cc for every cCc\in C. An arbitrary set FVF\subseteq V is shy in CC if there exists a universally measurable shy set ECE\subseteq C containing FF.

A set GG is prevalent in CC if CGC\setminus G is shy in CC.

Proposition E.4 (Anderson & Zame (2001)).

If EE is shy at some point cCc\in C, then EE is shy at every point in CC and hence is shy in CC.

Sets that are shy in CC enjoy similar properties to sets that are shy in VV.

Proposition E.5 (Anderson & Zame (2001)).

Let VV be a topological vector space and let CVC\subseteq V be a convex subset completely metrizable in the subspace topology induced by VV. Then:

  • Any prevalent set in CC is dense in CC;

  • If GLG\subseteq L and GG is prevalent in CC, then LL is prevalent in CC;

  • A countable intersection of sets prevalent in CC is prevalent in CC

  • If GG is prevalent in CC then G+vG+v is prevalent in C+vC+v for all vVv\in V.

  • If V=nV=\mathbb{R}^{n} and CVC\subseteq V is a convex subset with non-empty interior, then GCG\subseteq C is prevalent in CC if and only if λn[CG]=0\lambda_{n}[C\setminus G]=0.

Sets that are shy in CC can often be identified by inspecting their intersections with a finite-dimensional subspace WW of VV, a strategy we use to prove Theorem 1.

Definition E.5 (Anderson & Zame (2001)).

A universally measurable set ECE\subseteq C, where CC is convex and completely metrizable, is said to be kk-shy in CC if there exists a kk-dimensional subspace WVW\subseteq V such that

  1. 1.

    A translate of the set CC has positive Lebesgue measure in WW, i.e., λW[C+v0]>0\lambda_{W}[C+v_{0}]>0 for some v0Vv_{0}\in V;

  2. 2.

    Every translate of the set EE is a Lebesgue null set in WW, i.e., λW[E+v]=0\lambda_{W}[E+v]=0 for all vVv\in V.

Here λW\lambda_{W} denotes kk-dimensional Lebesgue measure supported on WW.131313Note that Lebesgue measure on WW is only defined up to a choice of basis; however, since λ[T(A)]=|det(T)|λ[A]\lambda[T(A)]=|\det(T)|\cdot\lambda[A] for any linear automorphism TT and Lebesgue measure λ\lambda, whether a set has null measure does not depend on the choice of basis. We refer to such a WW as a kk-dimensional probe witnessing the kk-shyness of EE, and to an element wWw\in W as a perturbation.

The following intuition motivates the use of probes to detect shy sets. By analogy with Fubini’s theorem, one can imagine trying to determine whether a subset of a finite-dimensional vector space is large or small by looking at its cross sections parallel to some subspace WVW\subseteq V. If a set EVE\subseteq V is small in each cross section—i.e., if λW[E+v]=0\lambda_{W}[E+v]=0 for all vVv\in V—then EE itself is small in VV, i.e., EE has λV\lambda_{V}-measure zero.

Proposition E.6 (Anderson & Zame (2001)).

Every kk-shy set in CC is shy in CC.

E.2 Outline

To aid the reader in following the application of the theory in Section E.1 to the proof of Theorem 1, we provide the following outline of the argument.

In Section E.3 we establish the context to which we apply the notion of relative shyness. In particular, we introduce the vector space 𝕂\mathbb{K} consisting of the totally bounded Borel measures on the state space 𝒦\mathcal{K}—where 𝒦\mathcal{K} is 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, 𝒳×𝒴×𝒴\mathcal{X}\times\mathcal{Y}\times\mathcal{Y}, or 𝒜×𝒳𝒜\mathcal{A}\times\mathcal{X}^{\mathcal{A}}, depending on which notion of fairness is under consideration. We further isolate the subspace 𝐊𝕂\mathbf{K}\subseteq\mathbb{K} of 𝒰\mathcal{U}-fine totally bounded Borel measures. Within this space, we are interested in the convex set 𝐐𝐊\mathbf{Q}\subseteq\mathbf{K}, the set of 𝒰\mathcal{U}-fine joint probability distributions of, respectively, XX and Y(1)Y(1); XX, Y(0)Y(0), Y(1)Y(1); or AA and the XΠ,A,aX_{\Pi,A,a}. Within 𝐐\mathbf{Q}, we identify 𝐄𝐐\mathbf{E}\subseteq\mathbf{Q}, the set of 𝒰\mathcal{U}-fine distributions on 𝒦\mathcal{K} over which there exists a policy satisfying the relevant fairness definition that is not strongly Pareto dominated. The claim of Theorem 1 is that 𝐄\mathbf{E} is shy relative to 𝐐\mathbf{Q}.

To ensure that relative shyness generalizes Lebesgue null measure in the expected way—i.e., that Prop. E.5 holds—Definition E.4 has three technical requirements: (1) that the ambient vector space VV be a topological vector space; (2) that the convex set CC be completely metrizable; and (3) that the shy set EE be universally measurable. In Lemma E.7, we observe that 𝕂\mathbb{K} is a complete topological vector space under the total variation norm, and so is a Banach space. We extend this in Cor. E.2, showing that 𝐊\mathbf{K} is also a Banach space. We use this fact in Lemma E.11 to show that 𝐐\mathbf{Q} is a completely metrizable subset of 𝐊\mathbf{K}, as well as convex. Lastly, in Lemma E.13, we show that the set 𝐄\mathbf{E} is closed, and therefore universally measurable.

In Section E.4, we develop the machinery needed to construct a probe 𝐖\mathbf{W} for the proof of Theorem 1 and prove several lemmata simplifying the eventual proof of the theorem. To build the probe, it is necessary to construct measures μmax,a\mu_{\max,a} with maximal support on the utility scale. This ensures that if any two threshold policies produce different decisions on any μ𝐊\mu\in\mathbf{K}, they will produce different decisions on typical perturbations. The construction of the μmax,a\mu_{\max,a}, is carried out in Lemma E.14 and Cor. E.3. Next, we introduce the basic style of argument used to show that a subset of 𝐐\mathbf{Q} is shy in Lemma E.15 and Lemma E.16, in particular, by showing that the set of μ𝐐\mu\in\mathbf{Q} that give positive probability to an event EE is either prevalent or empty. We use then use a technical lemma, Lemma E.17, to show, in effect, that a generic element of 𝐐\mathbf{Q} has support on the utility scale wherever a given fixed distribution μ𝐐\mu\in\mathbf{Q} does. In Defn. E.12, we introduce the concept of overlapping and splitting utilities, and show in Lemma E.19 that this property is generic in 𝐐\mathbf{Q} unless there exists a ω\omega-stratum that contains no positive-utility observables xx. Lastly, in Lemma E.20, we provide a mild simplification of the characterization of finitely shy sets that makes the the proof of Theorem 1 more straightforward.

Finally, in Section E.5, we give the proof of Theorem 1. We divide the proof into three parts. In the first part, we restrict our attention to the case of counterfactual equalized odds, and show in detail how to combine the lemmata of the previous section to construct the (at most) 2|𝒜|2\cdot|\mathcal{A}|-dimensional probe 𝐖\mathbf{W}. In the second part we consider two distinct cases. The argument in both cases is conceptually parallel. First, we argue that the balance conditions of counterfactual equalized odds encoded by Eq. (2) must be broken by a typical perturbation in 𝐖\mathbf{W}. In particular, we argue that for a given base distribution μ\mu, there can be at most one budget-exhausting multiple threshold policy that can—although need not necessarily—satisfy counterfactual equalized odds. We show that the form of this policy cannot be altered by an appropriate perturbation in 𝐖\mathbf{W}, but that the conditional probability of a positive decision will, in general, be altered in such a way that Eq. (2) can only hold for a λ𝐖\lambda_{\mathbf{W}}-null set of perturbations. In the final section, we lay out modiciations that can be made to the proof given for counterfactual equalized odds in the first two parts that adapt the argument to the cases of conditional principal fairness and path-specific fairness. In particular, we show how to construct the probe 𝐖\mathbf{W} in such a way that the additional conditioning on the reduced covariates W=ω(X)W=\omega(X) in Eqs. (3) and (5) does not affect the argument.

E.3 Convexity, Complete Metrizability, and Universal Measurability

In this section, we establish the background requirements of Prop. E.6 for the setting of Theorem 1. In particular, we exhibit the 𝒰\mathcal{U}-fine distributions as a convex subset of a topological vector space, the set of totally bounded 𝒰\mathcal{U}-fine Borel measures. We show that the 𝒰\mathcal{U}-fine probability distributions form a completely metrizable subset in the topology it inherits from the space of totally bounded measures. Lastly, we show that the set of regular distributions under which there exists a Pareto efficient policy satisfying one of the three fairness criteria is closed, and therefore universally measurable.

E.3.1 Background and notation

We begin by establishing some notational conventions. We let 𝒦\mathcal{K} denote the underlying state space over which the distributions in Theorem 1 range. Specifically, 𝒦=𝒳×𝒴\mathcal{K}=\mathcal{X}\times\mathcal{Y} in the case of counterfactual equalized odds; 𝒦=𝒳×𝒴×𝒴\mathcal{K}=\mathcal{X}\times\mathcal{Y}\times\mathcal{Y} in the case of conditional principal fairness; and 𝒦=𝒜×𝒳𝒜\mathcal{K}=\mathcal{A}\times\mathcal{X}^{\mathcal{A}} in the case of path-specific fairness. We note that since 𝒳k\mathcal{X}\subseteq\mathbb{R}^{k} for some kk and YY\subseteq\mathbb{R}, 𝒦\mathcal{K} may equivalently be considered a subset of n\mathbb{R}^{n} for some nn\in\mathbb{N}, with the subspace topology (and Borel sets) inherited from n\mathbb{R}^{n}.141414In the case of path-specific fairness, we can equivalently think of 𝒜\mathcal{A} as a set of integers indexing the groups.

We recall the definition of totally bounded measures.

Definition E.6.

Let \mathcal{M} be a σ\sigma-algebra on VV, and let μ\mu be a countably additive (V,)(V,\mathcal{M})-measure. Then, we define

|μ|[E]=supi=1|μ[Ei]||\mu|[E]=\sup\sum_{i=1}^{\infty}|\mu[E_{i}]| (12)

where the supremum is taken over all countable partitions {Ei}i\{E_{i}\}_{i\in\mathbb{N}}, i.e., collections such that i=1Ei=E\bigcup_{i=1}^{\infty}E_{i}=E and EiEj=E_{i}\cap E_{j}=\emptyset for jij\neq i. We call |μ||\mu| the total variation of μ\mu, and the total variation norm of μ\mu is |μ|[V]|\mu|[V].

We say that μ\mu is totally bounded if its total variation norm is finite, i.e., |μ|[V]<|\mu|[V]<\infty.

Lemma E.5.

If μ\mu is totally bounded, then |μ||\mu| is a finite positive measure on (V,)(V,\mathcal{M}), and |μ[E]||μ|[E]|\mu[E]|\leq|\mu|[E] for all EE\in\mathcal{M}.

See Theorem 6.2 in Rudin (1987) for proof.

We let 𝕂\mathbb{K} denote the set of totally bounded Borel measures on 𝒦\mathcal{K}. We note that, in the case of path specific fairness, which involves the joint distributions of counterfactuals, XX is not defined directly. Rather, the joint distribution of the counterfactuals XΠ,A,aX_{\Pi,A,a^{\prime}} and AA defines the distribution of XX through consistency, i.e., what would have happened to someone if their group membership were changed to a𝒜a^{\prime}\in\mathcal{A} is what actually happens to them if their group membership is aa^{\prime}. More formally, Pr(XEA=a)=Pr(XΠ,A,aEA=a)\operatorname{Pr}(X\in E\mid A=a^{\prime})=\operatorname{Pr}(X_{\Pi,A,a^{\prime}}\in E\mid A=a^{\prime}) for all Borel sets E𝒳E\subseteq\mathcal{X}. (See § 3.6.3 in Pearl (2009b).)

For any μ𝕂\mu\in\mathbb{K}, we adopt the following notational conventions. If we say that a property holds μ\mu-a.s., then the subset of 𝒦\mathcal{K} on which the property fails has |μ||\mu|-measure zero. If E𝒦E\subseteq\mathcal{K} is a measurable set, then we denote by μE\mu\operatorname{\upharpoonright}_{E} the restriction of μ\mu to EE, i.e., the measure defined by the mapping Eμ[EE]E^{\prime}\mapsto\mu[E\cap E^{\prime}]. We let 𝔼μ[f]=𝒦fdμ\mathbb{E}_{\mu}[f]=\int_{\mathcal{K}}f\,\mathrm{d}\mu, and for measurable sets EE, Prμ(E)=μ[E]\operatorname{Pr}_{\mu}(E)=\mu[E].151515To state and prove our results in a notationally uniform way, we occasionally write Prμ(E)\operatorname{Pr}_{\mu}(E) even when μ\mu ranges over measures that may not be probability measures. The fairness criteria we consider involve conditional independence relations. To make sense of conditional independence relations more generally, for Borel measurable ff we define 𝔼μ[f]\mathbb{E}_{\mu}[f\mid\mathcal{F}] to be the Radon-Nikodym derivative of the measure E𝔼μ[f𝟙E]E\mapsto\mathbb{E}_{\mu}[f\cdot\mathbb{1}_{E}] with respect to the measure μ\mu restricted to the sub–σ\sigma-algebra of Borel sets \mathcal{F}. (See § 34 in Billingsley (1995).) Similarly, we define 𝔼μ[fg]\mathbb{E}_{\mu}[f\mid g] to be 𝔼μ[fσ(g)]\mathbb{E}_{\mu}[f\mid\sigma(g)], where σ(g)\sigma(g) denotes the sub–σ\sigma-algebra of the Borel sets generated by gg. In cases where the condition can occur with non-zero probability, we can instead make use of the elementary definition of discrete conditional probability.

Lemma E.6.

Let gg be a Borel function on 𝒦\mathcal{K}, and suppose Prμ(g=c)0\operatorname{Pr}_{\mu}(g=c)\neq 0 for some constant cc\in\mathbb{R}. Then, we have that μ\mu-a.s., for any Borel function ff,

𝔼μ[fg]𝟙g=c=𝔼μ[f𝟙g=c]Prμ(g=c)𝟙g=c.\mathbb{E}_{\mu}[f\mid g]\cdot\mathbb{1}_{g=c}=\frac{\mathbb{E}_{\mu}[f\cdot\mathbb{1}_{g=c}]}{\operatorname{Pr}_{\mu}(g=c)}\cdot\mathbb{1}_{g=c}.

See Rao (2005) for proof.

With these notational conventions in place, we turn to establishing the background conditions of Prop. E.6.

Lemma E.7.

The set of totally bounded measures on a measure space (V,)(V,\mathcal{M}) form a complete topological vector space under the total variation norm, and hence a Banach space.

See, e.g., Steele (2019) for proof. It follows from this that 𝕂\mathbb{K} is a Banach space.

Remark 3.

Since 𝕂\mathbb{K} is a Banach space, it possesses a topology, and consequently a collection of Borel subsets. These Borel sets are to be distinguished from the Borel subsets of the underlying state space 𝒦\mathcal{K}, which the elements of 𝕂\mathbb{K} measure. The requirement that the subset EE of the convex set CC be universally measurable in Proposition E.6 is in reference to the Borel subsets of 𝕂\mathbb{K}; the requirement that μ𝕂\mu\in\mathbb{K} be a Borel measure is in reference to the Borel subsets of 𝒦\mathcal{K}.

Recall the definition of absolute continuity.

Definition E.7.

Let μ\mu and ν\nu be measures on a measure space (V,)(V,\mathcal{M}). We say that a measure ν\nu is absolutely continuous with respect to μ\mu—also written νμ\nu\Lt\mu—if, whenever μ[E]=0\mu[E]=0, ν[E]=0\nu[E]=0.

Absolute continuity is a closed property in the topology induced by the total variation norm.

Lemma E.8.

Consider the space of totally bounded measures on a measure space (V,)(V,\mathcal{M}) and fix μ\mu. The set of ν\nu such that νμ\nu\Lt\mu is closed.

Proof.

Let {νi}i\{\nu_{i}\}_{i\in\mathbb{N}} be a convergent sequence of measures absolutely continuous with respect to μ\mu. Let the limit of the νi\nu_{i} be ν\nu. We seek to show that νμ\nu\Lt\mu. Let EE\in\mathcal{M} be an arbitrary set such that μ[E]=0\mu[E]=0. Then, we have that

ν[E]\displaystyle\nu[E] =limnνi[E]\displaystyle=\lim_{n\to\infty}\nu_{i}[E]
=limn0\displaystyle=\lim_{n\to\infty}0
=0,\displaystyle=0,

since νiμ\nu_{i}\Lt\mu for all ii. Since EE was arbitrary, the result follows. ∎

Recall the definition of a pushforward measure.

Definition E.8.

Let f:(V,)(V,)f:(V,\mathcal{M})\to(V^{\prime},\mathcal{M}^{\prime}) be a measurable function. Let μ\mu be a measure on VV. We define the pushforward measure μf1\mu\circ f^{-1} on VV^{\prime} by the map Eμ[f1(E)]E^{\prime}\mapsto\mu[f^{-1}(E^{\prime})] for EE^{\prime}\in\mathcal{M}^{\prime}.

Within 𝕂\mathbb{K}, in the case of counterfactual equalized odds and conditional principal fairness, we define the subspace 𝐊\mathbf{K} to be the set of totally bounded measures μ\mu on 𝒦\mathcal{K} such that the pushforward measure μu1\mu\circ u^{-1} is absolutely continuous with respect to the Lebesgue measure λ\lambda on \mathbb{R} for all u𝒰u\in\mathcal{U}. By the Radon-Nikodym theorem, these pushforward measures arise from densities, i.e., for any μ𝐊\mu\in\mathbf{K}, there exists a unique fμL1()f_{\mu}\in L^{1}(\mathbb{R}) such that for any measurable subset EE of \mathbb{R}, we have

μu1[E]=Efμdλ.\mu\circ u^{-1}[E]=\int_{E}f_{\mu}\,\mathrm{d}\lambda.

In the case of path-specific fairness, we require the joint distributions of the counterfactual utilities to have a joint density. That is, we define the subspace 𝐊\mathbf{K} to be the set of totally bounded measures μ\mu on 𝒦\mathcal{K} such that the pushforward measure μ(u𝒜)1\mu\circ(u^{\mathcal{A}})^{-1} is absolutely continuous with respect to Lebesgue measure on 𝒜\mathbb{R}^{\mathcal{A}} for all u𝒰u\in\mathcal{U}. Here, we recall that

u𝒜:(a,(xa)a𝒜)(u(xa))a𝒜.u^{\mathcal{A}}:(a,(x_{a^{\prime}})_{a^{\prime}\in\mathcal{A}})\mapsto(u(x_{a^{\prime}}))_{a^{\prime}\in\mathcal{A}}.

As before, there exists a corresponding density fμL1(𝒜)f_{\mu}\in L^{1}(\mathbb{R}^{\mathcal{A}}).

We therefore see that 𝐊\mathbf{K} extends in a natural way the notion of a 𝒰\mathcal{U}- or 𝒰𝒜\mathcal{U}^{\mathcal{A}}-fine distribution, and so, by a slight abuse of notation, refer to 𝐊\mathbf{K} as the set of 𝒰\mathcal{U}-fine measures on 𝒦\mathcal{K}.

Indeed, since Prμ(u(X)E,A=a)Prμ(u(X)E)\operatorname{Pr}_{\mu}(u(X)\in E,A=a)\leq\operatorname{Pr}_{\mu}(u(X)\in E), it also follows that, for a𝒜a\in\mathcal{A} such that Prμ(A=a)>0\operatorname{Pr}_{\mu}(A=a)>0, the conditional distributions of u(X)A=au(X)\mid A=a are also absolutely continuous with respect to Lebesgue measure, and so also have densities. For notational convenience, we set fμ,af_{\mu,a} to be the function satisfying

Prμ(u(X)E,A=a)=Efμ,adλ,\operatorname{Pr}_{\mu}(u(X)\in E,A=a)=\int_{E}f_{\mu,a}\,\mathrm{d}\lambda,

so that fμ=a𝒜fμ,af_{\mu}=\sum_{a\in\mathcal{A}}f_{\mu,a}.

Since absolute continuity is a closed condition, it follows that 𝐊\mathbf{K} is a closed subspace of 𝕂\mathbb{K}. This leads to the following useful corollary of Lemma E.8.

Corollary E.2.

The collection of 𝒰\mathcal{U}-fine measures on 𝒦\mathcal{K} is a Banach space.

Proof.

It is straightforward to see that 𝐊\mathbf{K} is a subspace of 𝕂\mathbb{K}. Since 𝐊\mathbf{K} is a closed subset of 𝕂\mathbb{K} by Lemma E.8, it is complete, and therefore a Banach space. ∎

We note the following useful fact about elements of 𝐊\mathbf{K}.

Lemma E.9.

Consider the mapping μfμ\mu\mapsto f_{\mu} from 𝐊\mathbf{K} to L1()L^{1}(\mathbb{R}) given by associating a measure μ\mu with the Radon-Nikodym derivative of the pushforward measure μu1\mu\circ u^{-1}. This mapping is continuous. Likewise, the mapping μfμ,a\mu\mapsto f_{\mu,a} is continuous for all a𝒜a\in\mathcal{A}, and, in the case of path-specific fairness, the mapping of μ\mu to the Radon-Nikodym derivative of μ(u𝒜)1\mu\circ(u^{\mathcal{A}})^{-1} is continuous.

Proof.

We show only the first case. The others follow by virtually identical arguments.

Let ϵ>0\epsilon>0 be arbitrary. Choose μ𝐊\mu\in\mathbf{K}, and suppose that |μμ|[𝒦]<ϵ|\mu-\mu^{\prime}|[\mathcal{K}]<\epsilon. Then, let

EUp\displaystyle E^{\operatorname{{Up}}} ={x:fμ(x)>fμ(x)}\displaystyle=\{x\in\mathbb{R}:f_{\mu}(x)>f_{\mu^{\prime}}(x)\}
ELo\displaystyle E^{\operatorname{{Lo}}} ={x:fμ(x)<fμ(x)}.\displaystyle=\{x\in\mathbb{R}:f_{\mu}(x)<f_{\mu^{\prime}}(x)\}.

Then EUpE^{\operatorname{{Up}}} and ELoE^{\operatorname{{Lo}}} are disjoint, so we have that

fμfμL1()\displaystyle\|f_{\mu}-f_{\mu^{\prime}}\|_{L^{1}(\mathbb{R})} =|EUpfμfμdλ|\displaystyle=\left|\int_{E^{\operatorname{{Up}}}}f_{\mu}-f_{\mu^{\prime}}\,\mathrm{d}\lambda\right|
+|ELofμfμdλ|\displaystyle\hskip 28.45274pt+\left|\int_{E^{\operatorname{{Lo}}}}f_{\mu}-f_{\mu^{\prime}}\,\mathrm{d}\lambda\right|
=|(μμ)[u1(EUp)]|\displaystyle=|(\mu-\mu^{\prime})[u^{-1}(E^{\operatorname{{Up}}})]|
+|(μμ)[u1(ELo)]|\displaystyle\hskip 28.45274pt+|(\mu-\mu^{\prime})[u^{-1}(E^{\operatorname{{Lo}}})]|
<ϵ,\displaystyle<\epsilon,

where the second equality follows by the definition of pushforward measures and the inequality follows from Lemma E.5. Since ϵ\epsilon was arbitrary, the claim follows. ∎

Finally, we define 𝐐\mathbf{Q}. We let 𝐐\mathbf{Q} be the subset of 𝐊\mathbf{K} consisting of all 𝒰\mathcal{U}-fine probability measures, i.e., measures μ𝕂\mu\in\mathbb{K} such that:

  1. 1.

    The measure μ\mu is 𝒰\mathcal{U}-fine;

  2. 2.

    For all Borel sets E𝒦E\subseteq\mathcal{K}, μ[E]0\mu[E]\geq 0;

  3. 3.

    The measure of the whole space is unity, i.e., μ[𝒦]=1\mu[\mathcal{K}]=1.

We conclude the background and notation by observing that threshold policies are defined wholly by their thresholds for distributions in 𝐊\mathbf{K} and 𝐐\mathbf{Q}. Importantly, this observation does not hold when there are atoms on the utility scale—which measures in 𝐊\mathbf{K} lack—which can in turn lead to counterexamples to Theorem 1; see Appendix E.6.

Lemma E.10.

Let τ0(x)\tau_{0}(x) and τ1(x)\tau_{1}(x) be two multiple threshold policies. If τ0(x)\tau_{0}(x) and τ1(x)\tau_{1}(x) have the same thresholds, then for any μ𝐊\mu\in\mathbf{K}, τ0(X)=τ1(X)\tau_{0}(X)=\tau_{1}(X) μ\mu-a.s. Similarly, for μ𝐐\mu\in\mathbf{Q}, if

𝔼μ[τ0(X)A=a]=𝔼μ[τ1(X)A=a]\mathbb{E}_{\mu}[\tau_{0}(X)\mid A=a]=\mathbb{E}_{\mu}[\tau_{1}(X)\mid A=a]

for all a𝒜a\in\mathcal{A} such that Prμ(A=a)>0\operatorname{Pr}_{\mu}(A=a)>0, then τ0(X)=τ1(X)\tau_{0}(X)=\tau_{1}(X) μ\mu-a.s.

Moreover, for μ𝐊\mu\in\mathbf{K} in the case of path-specific fairness, if τ0(x)\tau_{0}(x) and τ1(x)\tau_{1}(x) have the same thresholds, then τ0(XΠ,A,a)=τ1(XΠ,A,a)\tau_{0}(X_{\Pi,A,a})=\tau_{1}(X_{\Pi,A,a}) μ\mu-a.s. for any a𝒜a\in\mathcal{A}. Similarly, for μ𝐐\mu\in\mathbf{Q} in the case of path-specific fairness, if

𝔼μ[τ0(XΠ,A,a)]=𝔼μ[τ1(XΠ,A,a)]\mathbb{E}_{\mu}[\tau_{0}(X_{\Pi,A,a})]=\mathbb{E}_{\mu}[\tau_{1}(X_{\Pi,A,a})]

then τ0(XΠ,A,a)=τ1(XΠ,A,a)\tau_{0}(X_{\Pi,A,a})=\tau_{1}(X_{\Pi,A,a}) μ\mu-a.s. as well.

Proof.

First, we show that threshold policies with the same thresholds are equal, then we show that threshold policies that distribute positive decisions across groups in the same way are equal.

Let {ta}a𝒜\{t_{a}\}_{a\in\mathcal{A}} denote the shared set of thresholds. It follows that if τ0(x)τ1(x)\tau_{0}(x)\neq\tau_{1}(x), then u(x)=tα(x)u(x)=t_{\alpha(x)}. Now,

Pr(u(X)=ta,A=a)=tatafμ,adλ=0,\operatorname{Pr}(u(X)=t_{a},A=a)=\int_{t_{a}}^{t_{a}}f_{\mu,a}\,\mathrm{d}\lambda=0,

so Prμ(τ0(X)τ1(X))=0\operatorname{Pr}_{\mu}(\tau_{0}(X)\neq\tau_{1}(X))=0. Next, suppose

𝔼μ[τ0(X)A=a]=𝔼μ[τ1(X)A=a].\mathbb{E}_{\mu}[\tau_{0}(X)\mid A=a]=\mathbb{E}_{\mu}[\tau_{1}(X)\mid A=a].

If the thresholds of the two policies agree for all a𝒜a\in\mathcal{A} such that Prμ(A=a)>0\operatorname{Pr}_{\mu}(A=a)>0, then we are done by the previous paragraph. Therefore, suppose ta0ta1t_{a}^{0}\neq t_{a}^{1} for some suitable a𝒜a\in\mathcal{A}, where tait_{a}^{i} represents the threshold for group a𝒜a\in\mathcal{A} under the policy τi(x)\tau_{i}(x). Without loss of generality, suppose ta0<ta1t_{a}^{0}<t_{a}^{1}. Then, it follows that

ta0ta1fμ,adλ\displaystyle\int_{t_{a}^{0}}^{t_{a}^{1}}f_{\mu,a}\,\mathrm{d}\lambda =𝔼μ[τ0(X)A=a]𝔼μ[τ1(X)A=a]\displaystyle=\mathbb{E}_{\mu}[\tau_{0}(X)\mid A=a]-\mathbb{E}_{\mu}[\tau_{1}(X)\mid A=a]
=0.\displaystyle=0.

Since μ𝐐\mu\in\mathbf{Q}, μ=|μ|\mu=|\mu|, whence

Pr|μ|(t0au(X)ta1A=a)=0.\operatorname{Pr}_{|\mu|}(t_{0}^{a}\leq u(X)\leq t_{a}^{1}\mid A=a)=0.

Since this is true for all a𝒜a\in\mathcal{A} such that Prμ(A=a)>0\operatorname{Pr}_{\mu}(A=a)>0, τ0(X)=τ1(X)\tau_{0}(X)=\tau_{1}(X) μ\mu-a.s.

The proof in the case of path-specific fairness is almost identical. ∎

E.3.2 Convexity, complete metrizability, and universal measurability

The set of regular 𝒰\mathcal{U}-fine probability measures 𝐐\mathbf{Q} is the set to which we wish to apply Prop. E.6. To do so, we must show that 𝐐\mathbf{Q} is a convex and completely metrizable subset of 𝐊\mathbf{K}.

Lemma E.11.

The set of regular probability measures 𝐐\mathbf{Q} is convex and completely metrizable.

Proof.

The proof proceeds in two pieces. First, we show that the 𝒰\mathcal{U}-fine probability distributions are convex, as can be verified by direct calculation. Then, we show that 𝐐\mathbf{Q} is closed and therefore complete in the original metric of 𝐊\mathbf{K}.

We begin by verifying convexity. Let μ,μ𝐐\mu,\mu^{\prime}\in\mathbf{Q} and let E𝒦E\subseteq\mathcal{K} be an arbitrary Borel subset of 𝒦\mathcal{K}. Then, choose θ[0,1]\theta\in[0,1], and note that

(θμ+[1θ]μ)[E]\displaystyle(\theta\cdot\mu+[1-\theta]\cdot\mu^{\prime})[E] =θμ[E]+[1θ]μ[E]\displaystyle=\theta\cdot\mu[E]+[1-\theta]\cdot\mu^{\prime}[E]
θ0+[1θ]0\displaystyle\geq\theta\cdot 0+[1-\theta]\cdot 0
=0,\displaystyle=0,

and, likewise, that

(θμ+[1θ]μ)[𝒦]\displaystyle(\theta\cdot\mu+[1-\theta]\cdot\mu^{\prime})[\mathcal{K}] =θμ[𝒦]+[1θ]μ[𝒦]\displaystyle=\theta\cdot\mu[\mathcal{K}]+[1-\theta]\cdot\mu^{\prime}[\mathcal{K}]
=θ1+[1θ]1\displaystyle=\theta\cdot 1+[1-\theta]\cdot 1
=1.\displaystyle=1.

It remains only to show that 𝐐\mathbf{Q} is completely metrizable. To prove this, it suffices to show that it is closed, since closed subsets of complete spaces are complete, and 𝐊\mathbf{K} is a Banach space by Cor. E.2, and therefore complete.

Suppose {μi}i\{\mu_{i}\}_{i\in\mathbb{N}} is a convergent sequence of probability measures in 𝐊\mathbf{K} with limit μ\mu. Then

μ[E]=limiμi[E]limi0=0\mu[E]=\lim_{i\to\infty}\mu_{i}[E]\geq\lim_{i\to\infty}0=0

and

μ[𝒦]=limiμi[𝒦]=limi1=1.\mu[\mathcal{K}]=\lim_{i\to\infty}\mu_{i}[\mathcal{K}]=\lim_{i\to\infty}1=1.

Therefore 𝐐\mathbf{Q} is closed, and therefore complete, and hence is a convex, completely metrizable subset of 𝐊\mathbf{K}. ∎

Next we prove that the set 𝐄\mathbf{E} of regular 𝒰\mathcal{U}-fine densities over which there exists a policy satisfying the relevant counterfactual fairness definition that is not strongly Pareto dominated is universally measurable.

Recall the definition of universal measurability.

Definition E.9.

Let VV be a complete topological space. Then EVE\subseteq V is universally measurable if VV is measurable by the completion of every finite Borel measure on VV, i.e., if for every finite Borel measure μ\mu, there exist Borel sets EE^{\prime} and SS such that EESE\ \triangle\ E^{\prime}\subseteq S and μ[S]=0\mu[S]=0.

We note that if a set is Borel, it is by definition universally measurable. Moreover, if a set is open or closed, it is by definition Borel.

To show that 𝐄\mathbf{E} is closed, we show that any convergent sequence in 𝐄\mathbf{E} has a limit in 𝐄\mathbf{E}. The technical complication of the argument stems from the following fact that satisfying the fairness conditions, e.g., Eq. (4), involves conditional expectations, about which very little can be said in the absence of a density, and which are difficult to compare when taken across distinct measures.

To handle these difficulties, we begin with a technical lemma, Lemma E.12, which gives a coarse bound on how different the conditional expectations of the same variable can be with respect to a sub–σ\sigma-algebra \mathcal{F} over two different distributions, μ\mu and μ\mu^{\prime}, before applying the results to the proof of Lemma E.13.

Definition E.10.

Let μ\mu be a measure on a measure space (V,)(V,\mathcal{M}), and let ff be μ\mu-measurable. Consider the equivalence class of \mathcal{M}-measurable functions C={g:g=f μ-a.e.}C=\{g:g=f\text{ $\mu$-a.e.}\}.161616Some authors define Lp(μ)L^{p}(\mu) spaces to consist of such equivalence classes, rather than the definition we use here. We say that any gCg\in C is a version of ff, and that gCg\in C is a standard version if g(v)Cg(v)\leq C for some constant CC and all vVv\in V.

Remark 4.

It is straightforward to see that for fL(μ)f\in L^{\infty}(\mu), a standard version always exists with C=fC=\|f\|_{\infty}.

Remark 5.

Note that in general, the conditional expectation 𝔼μ[f]\mathbb{E}_{\mu^{\prime}}[f\mid\mathcal{F}] is defined only μ\mu^{\prime}-a.e. If μ\mu is not assumed to be absolutely continuous with respect to μ\mu^{\prime}, it follows that

𝔼μ[f]𝔼μ[f]L1(μ)\|\mathbb{E}_{\mu}[f\mid\mathcal{F}]-\mathbb{E}_{\mu^{\prime}}[f\mid\mathcal{F}]\|_{L^{1}(\mu)} (13)

is not entirely well-defined, in that its value depends on what version of 𝔼μ[f]\mathbb{E}_{\mu^{\prime}}[f\mid\mathcal{F}] one chooses. For appropriate ff, however, one can nevertheless bound Eq. (13) for any standard version of 𝔼μ[f]\mathbb{E}_{\mu^{\prime}}[f\mid\mathcal{F}].

Lemma E.12.

Let μ\mu, μ\mu^{\prime} be totally bounded measures on a measure space (V,)(V,\mathcal{M}). Let fL(μ)L(μ)f\in L^{\infty}(\mu)\cap L^{\infty}(\mu^{\prime}). Let \mathcal{F} be a sub–σ\sigma-algebra of \mathcal{M}. Let

C=max(fL(μ),fL(μ)).C=\max(\|f\|_{L^{\infty}(\mu)},\|f\|_{L^{\infty}(\mu^{\prime})}).

Then, if gg is a standard version of 𝔼μ[f]\mathbb{E}_{\mu^{\prime}}[f\mid\mathcal{F}], we have that

V|𝔼μ[f]g|dμ4C|μμ|[V].\int_{V}|\mathbb{E}_{\mu}[f\mid\mathcal{F}]-g|\,\mathrm{d}\mu\leq 4C\cdot|\mu-\mu^{\prime}|[V]. (14)
Proof.

First, we note that both 𝔼μ[f]\mathbb{E}_{\mu}[f\mid\mathcal{F}] and gg are \mathcal{F}-measurable. Therefore, the sets

EUp={vV:𝔼μ[f](v)>g(v)}E^{\operatorname{{Up}}}=\{v\in V:\mathbb{E}_{\mu}[f\mid\mathcal{F}](v)>g(v)\}

and

ELo={vV:𝔼μ[f](v)<g(v)}E^{\operatorname{{Lo}}}=\{v\in V:\mathbb{E}_{\mu}[f\mid\mathcal{F}](v)<g(v)\}

are in \mathcal{F}. Now, note that

V|𝔼μ[f]g|dμ=EUp𝔼μ[f]gdμ+ELog𝔼μ[f]dμ.\int_{V}|\mathbb{E}_{\mu}[f\mid\mathcal{F}]-g|\,\mathrm{d}\mu=\int_{E^{\operatorname{{Up}}}}\mathbb{E}_{\mu}[f\mid\mathcal{F}]-g\,\mathrm{d}\mu\\ +\int_{E^{\operatorname{{Lo}}}}g-\mathbb{E}_{\mu}[f\mid\mathcal{F}]\,\mathrm{d}\mu.

First consider EUpE^{\operatorname{{Up}}}. Then, we have that

EUp𝔼μ[f]\displaystyle\int_{E^{\operatorname{{Up}}}}\mathbb{E}_{\mu}[f\mid\mathcal{F}] gdμ\displaystyle-g\,\mathrm{d}\mu
=EUp𝔼μ[f]gdμ\displaystyle=\int_{E^{\operatorname{{Up}}}}\mathbb{E}_{\mu}[f\mid\mathcal{F}]-g\,\mathrm{d}\mu
+EUpggdμ\displaystyle\hskip 14.22636pt+\int_{E^{\operatorname{{Up}}}}g-g\,\mathrm{d}\mu^{\prime}
|EUp𝔼μ[f]dμEUpgdμ|\displaystyle\leq\left|\int_{E^{\operatorname{{Up}}}}\mathbb{E}_{\mu}[f\mid\mathcal{F}]\,\mathrm{d}\mu-\int_{E^{\operatorname{{Up}}}}g\,\mathrm{d}\mu^{\prime}\right|
+Eupgd|μμ|\displaystyle\hskip 14.22636pt+\int_{E^{up}}g\,\mathrm{d}|\mu-\mu^{\prime}|
|EUpfdμEUpfdμ|\displaystyle\leq\left|\int_{E^{\operatorname{{Up}}}}f\,\mathrm{d}\mu-\int_{E^{\operatorname{{Up}}}}f\,\mathrm{d}\mu^{\prime}\right|
+EupCd|μμ|,\displaystyle\hskip 14.22636pt+\int_{E^{up}}C\,\mathrm{d}|\mu-\mu^{\prime}|,

where in the final inequality, we have used the fact that, since gg is a standard version of 𝔼μ[f]\mathbb{E}_{\mu^{\prime}}[f\mid\mathcal{F}],

g(v)𝔼μ[f]L(μ)Cg(v)\leq\|\mathbb{E}_{\mu^{\prime}}[f\mid\mathcal{F}]\|_{L^{\infty}(\mu^{\prime})}\leq C

for all vVv\in V, and the fact that, by the definition of conditional expectation,

E𝔼ν[h]dν=Ehdν\int_{E}\mathbb{E}_{\nu}[h\mid\mathcal{F}]\,\mathrm{d}\nu=\int_{E}h\,\mathrm{d}\nu

for any EE\in\mathcal{F}.

Since ff is everywhere bounded by CC, applying Lemma E.5 yields that this final expression is less than or equal to 2C|μμ|[V]2C\cdot|\mu-\mu^{\prime}|[V]. An identical argument shows that

ELog𝔼μ[f]dμ2C|μμ|[V],\int_{E^{\operatorname{{Lo}}}}g-\mathbb{E}_{\mu}[f\mid\mathcal{F}]\,\mathrm{d}\mu\leq 2C\cdot|\mu-\mu^{\prime}|[V],

whence the result follows. ∎

Lemma E.13.

Let 𝐄𝐐\mathbf{E}\subseteq\mathbf{Q} denote the set of joint densities on 𝒦\mathcal{K} such that there exists a policy satisfying the relevant fairness definition that is not strongly Pareto dominated. Then, 𝐄\mathbf{E} is closed, and therefore universally measurable.

Proof.

For notational simplicity, we consider the case of counterfactual equalized odds. The proofs in the other two cases are virtually identical.

Suppose μiμ\mu_{i}\to\mu in 𝐊\mathbf{K}, where {μi}i𝐄\{\mu_{i}\}_{i\in\mathbb{N}}\subseteq\mathbf{E}. Then, by Lemma E.9, fμi,afμ,af_{\mu_{i},a}\to f_{\mu,a} in L1()L^{1}(\mathbb{R}). Moreover, by Lemma D.2, there exists a sequence of threshold policies {τi(x)}i\{\tau_{i}(x)\}_{i\in\mathbb{N}} such that both

𝔼μi[τ(X)]=min(b,Prμi(u(X)>0))\mathbb{E}_{\mu_{i}}[\tau(X)]=\min(b,\operatorname{Pr}_{\mu_{i}}(u(X)>0))

and

𝔼μi[τi(X)A,Y(1)]=𝔼μi[τi(X)Y(1)].\mathbb{E}_{\mu_{i}}[\tau_{i}(X)\mid A,Y(1)]=\mathbb{E}_{\mu_{i}}[\tau_{i}(X)\mid Y(1)].

Let {qa,i}a𝒜\{q_{a,i}\}_{a\in\mathcal{A}} be defined by

qa,i=𝔼μi[τi(X)A=a]q_{a,i}=\mathbb{E}_{\mu_{i}}[\tau_{i}(X)\mid A=a]

if Prμi(A=a)>0\operatorname{Pr}_{\mu_{i}}(A=a)>0, and qa,i=0q_{a,i}=0 otherwise.

Since [0,1]𝒜[0,1]^{\mathcal{A}} is compact, there exists a convergent subsequence {{qa,ni}a𝒜}i\{\{q_{a,n_{i}}\}_{a\in\mathcal{A}}\}_{i\in\mathbb{N}}. Let it converge to the collection of quantiles {qa}a𝒜\{q_{a}\}_{a\in\mathcal{A}} defining, by Lemma D.3, a multiple threshold policy τ(x)\tau(x) over μ\mu.

Because μiμ\mu_{i}\to\mu and {qa,ni}a𝒜{qa}a𝒜\{q_{a,n_{i}}\}_{a\in\mathcal{A}}\to\{q_{a}\}_{a\in\mathcal{A}}, we have that

𝔼μ[τa,ni(X)A=a]𝔼μ[τ(X)A=a]\mathbb{E}_{\mu}[\tau_{a,n_{i}}(X)\mid A=a]\to\mathbb{E}_{\mu}[\tau(X)\mid A=a]

for all a𝒜a\in\mathcal{A} such that Prμ(A=a)>0\operatorname{Pr}_{\mu}(A=a)>0. Therefore, by Lemma E.9, τni(X)τ(X)\tau_{n_{i}}(X)\to\tau(X) in L1(μ)L^{1}(\mu).

Choose ϵ>0\epsilon>0 arbitrarily. Then, choose NN so large that for ii greater than NN,

|μμni|[𝒦]\displaystyle|\mu-\mu_{n_{i}}|[\mathcal{K}] <ϵ10,\displaystyle<\tfrac{\epsilon}{10}, τ(X)τni(X)L1(μ)\displaystyle\|\tau(X)-\tau_{n_{i}}(X)\|_{L^{1}(\mu)} ϵ10.\displaystyle\leq\tfrac{\epsilon}{10}.

Then, observe that τ(x),τi(x)1\tau(x),\tau_{i}(x)\leq 1, and recall that

𝔼μni[τni(X)A,Y(1)]=𝔼μni[τni(X)Y(1)].\mathbb{E}_{\mu_{n_{i}}}[\tau_{n_{i}}(X)\mid A,Y(1)]=\mathbb{E}_{\mu_{n_{i}}}[\tau_{n_{i}}(X)\mid Y(1)]. (15)

Therefore, let gi(x)g_{i}(x) be a standard version of 𝔼μni[τni(X)Y(1)]\mathbb{E}_{\mu_{n_{i}}}[\tau_{n_{i}}(X)\mid Y(1)] over μni\mu_{n_{i}}. By Eq. (15), gi(x)g_{i}(x) is also a standard version of 𝔼μni[τni(X)A,Y(1)]\mathbb{E}_{\mu_{n_{i}}}[\tau_{n_{i}}(X)\mid A,Y(1)] over μni\mu_{n_{i}}. Then, by Lemma E.12, we have that

𝔼μ\displaystyle\|\mathbb{E}_{\mu} [τ(X)A,Y(1)]𝔼μni[τni(X)Y(1)]L1(μ)\displaystyle[\tau(X)\mid A,Y(1)]-\mathbb{E}_{\mu_{n_{i}}}[\tau_{n_{i}}(X)\mid Y(1)]\|_{L^{1}(\mu)}
𝔼μ[τ(X)A,Y(1)]\displaystyle\leq\|\mathbb{E}_{\mu}[\tau(X)\mid A,Y(1)]
𝔼μ[τni(X)A,Y(1)]L1(μ)\displaystyle\hskip 28.45274pt-\mathbb{E}_{\mu}[\tau_{n_{i}}(X)\mid A,Y(1)]\|_{L^{1}(\mu)}
+𝔼μ[τni(X)A,Y(1)]gi(X)L1(μ)\displaystyle\hskip 14.22636pt+\|\mathbb{E}_{\mu}[\tau_{n_{i}}(X)\mid A,Y(1)]-g_{i}(X)\|_{L^{1}(\mu)}
+gi(X)𝔼μ[τni(X)Y(1)]L1(μ)L1(μ)\displaystyle\hskip 14.22636pt+\|g_{i}(X)-\mathbb{E}_{\mu}[\tau_{n_{i}}(X)\mid Y(1)]\|_{L^{1}(\mu)}\|_{L^{1}(\mu)}
+𝔼μ[τni(X)Y(1)𝔼μ[τ(X)Y(1)]L1(μ)\displaystyle\hskip 14.22636pt+\|\mathbb{E}_{\mu}[\tau_{n_{i}}(X)\mid Y(1)-\mathbb{E}_{\mu}[\tau(X)\mid Y(1)]\|_{L^{1}(\mu)}
<ϵ10+4ϵ10+4ϵ10+ϵ10.\displaystyle<\frac{\epsilon}{10}+\frac{4\epsilon}{10}+\frac{4\epsilon}{10}+\frac{\epsilon}{10}.

Since ϵ>0\epsilon>0 was arbitrary, it follows that, μ\mu-a.e.,

𝔼μ[τ(X)A,Y(1)]=𝔼μ[τ(X)Y(1)].\mathbb{E}_{\mu}[\tau(X)\mid A,Y(1)]=\mathbb{E}_{\mu}[\tau(X)\mid Y(1)].

Recall the standard fact that for independent random variables XX and UU,

𝔼[f(X,U)X]=f(X,u)dFU(u),\mathbb{E}[f(X,U)\mid X]=\int f(X,u)\mathop{}\!\mathrm{d}F_{U}(u),

where FUF_{U} is the distribution of UU.171717For a proof of this fact see, e.g., Brozius (2019). Further recall that D=𝟙UDτ(X)D=\mathbb{1}_{U_{D}\leq\tau(X)}, where UDX,Y(1)U_{D}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X,Y(1). It follows that

Prμ(D=1X,Y(1))=01𝟙ud<τ(X)dλ(ud)=τ(X).\operatorname{Pr}_{\mu}(D=1\mid X,Y(1))=\int_{0}^{1}\mathbb{1}_{u_{d}<\tau(X)}\,\mathrm{d}\lambda(u_{d})=\tau(X).

Hence, by the law of iterated expectations,

Prμ(D=1\displaystyle\operatorname{Pr}_{\mu}(D=1 A,Y(1))\displaystyle\mid A,Y(1))
=𝔼μ[Prμ(D=1X,Y(1))A,Y(1)]\displaystyle=\mathbb{E}_{\mu}[\operatorname{Pr}_{\mu}(D=1\mid X,Y(1))\mid A,Y(1)]
=𝔼μ[τ(X)A,Y(1)]\displaystyle=\mathbb{E}_{\mu}[\tau(X)\mid A,Y(1)]
=𝔼μ[τ(X)Y(1)]\displaystyle=\mathbb{E}_{\mu}[\tau(X)\mid Y(1)]
=𝔼μ[Prμ(D=1X,Y(1))Y(1)]\displaystyle=\mathbb{E}_{\mu}[\operatorname{Pr}_{\mu}(D=1\mid X,Y(1))\mid Y(1)]
=Prμ(D=1Y(1)).\displaystyle=\operatorname{Pr}_{\mu}(D=1\mid Y(1)).

Therefore DAY(1)D\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A\mid Y(1) over μ\mu, i.e., counterfactual equalized odds holds for the decision policy τ(x)\tau(x) over the distribution μ\mu. Consequently μ𝐄\mu\in\mathbf{E}, and so 𝐄\mathbf{E} is closed and therefore universally measurable. ∎

E.4 Shy Sets and Probes

We require a number of additional technical lemmata for the proof of Theorem 1. The probe must be constructed carefully, so that, on the utility scale, an arbitrary element of 𝐐\mathbf{Q} is absolutely continuous with respect to a typical perturbation. In addition, it is useful to show that a number of properties are generic to simplify certain aspects of the proof of Theorem 1. For instance, Lemma E.16 is used in Theorem 1 to show that a certain conditional expectation is generically well-defined, avoiding the need to separately treat certain corner cases.

Cor. E.3 concerns the construction of the probe used in the proof of Theorem 1. Lemmata E.17 to E.20 use Cor. E.3 to provide additional simplifications to the proof of Theorem 1.

E.4.1 Maximal support

First, to construct the probe used in the proof of Theorem 1, we require elements μ𝐐\mu\in\mathbf{Q} such that the densities fμf_{\mu} have “maximal” support. To produce such distributions, we use the following measure-theoretic construction.

Definition E.11.

Let {Eα}γΓ\{E_{\alpha}\}_{\gamma\in\Gamma} be an arbitrary collection of μ\mu-measurable sets for some positive measure μ\mu on a measure space (M,)(M,\mathcal{M}). We say that EE is the measure-theoretic union of {Eγ}γΓ\{E_{\gamma}\}_{\gamma\in\Gamma} if μ[EγE]=0\mu[E_{\gamma}\setminus E]=0 for all γΓ\gamma\in\Gamma and E=i=1EγiE=\bigcup_{i=1}^{\infty}E_{\gamma_{i}} for some countable subcollection {γi}\{\gamma_{i}\}\subseteq\mathbb{N}.

While measure-theoretic unions themselves are known (cf. Silva (2008), Rudin (1991)), for completeness, we include a proof of their existence, which, to the best of our knowledge, is not found in the literature.

Lemma E.14.

Let μ\mu be a finite positive measure on a measure space (V,)(V,\mathcal{M}). Then an arbitrary collection {Eγ}γΓ\{E_{\gamma}\}_{\gamma\in\Gamma} of μ\mu-measurable sets has a measure-theoretic union.

Proof.

For each countable subcollection ΓΓ\Gamma^{\prime}\subseteq\Gamma, consider the “error term”

r(Γ)=supγΓμ[EγγΓEγ]r(\Gamma^{\prime})=\sup_{\gamma\in\Gamma}\mu\left[E_{\gamma}\setminus\bigcup_{\gamma^{\prime}\in\Gamma^{\prime}}E_{\gamma^{\prime}}\right]

We claim that the infimum of r(Γ)r(\Gamma^{\prime}) over all countable subcollections ΓΓ\Gamma^{\prime}\subseteq\Gamma must be zero.

For, toward a contradiction, suppose it were greater than or equal to ϵ>0\epsilon>0. Choose any set Eγ1E_{\gamma_{1}} such that μ[Eγ1]ϵ\mu[E_{\gamma_{1}}]\geq\epsilon. Such a set must exist, since otherwise r()<ϵr(\emptyset)<\epsilon. Choose Eγ2E_{\gamma_{2}} such that μ[Eγ2Eγ1]>ϵ\mu[E_{\gamma_{2}}\setminus E_{\gamma_{1}}]>\epsilon. Again, some such set must exist, since otherwise r({γ1})<ϵr(\{\gamma_{1}\})<\epsilon. Continuing in this way, we construct a countable collection {Eγi}i\{E_{\gamma_{i}}\}_{i\in\mathbb{N}}.

Therefore, we see that

μ[V]μ[i=1nEγi]=i=1nμ[Eγij=1iEγj].\displaystyle\mu[V]\geq\mu\left[\bigcup_{i=1}^{n}E_{\gamma_{i}}\right]=\sum_{i=1}^{n}\mu\left[E_{\gamma_{i}}\setminus\bigcup_{j=1}^{i}E_{\gamma_{j}}\right].

By construction, every term in the final sum is greater than or equal to ϵ\epsilon, contradicting the fact that μ[V]<\mu[V]<\infty.

Therefore, there exist countable collections {Γn}n\{\Gamma_{n}\}_{n\in\mathbb{N}} such that r(Γn)<1nr(\Gamma_{n})<\frac{1}{n}. It follows immediately that for all nn

r(nΓn)r(Γk)r\left(\bigcup_{n\in\mathbb{N}}\Gamma_{n}\right)\leq r(\Gamma_{k})

for any fixed kk\in\mathbb{N}. Consequently,

r(nΓn)=0,r\left(\bigcup_{n\in\mathbb{N}}\Gamma_{n}\right)=0,

and nΓn\bigcup_{n\in\mathbb{N}}\Gamma_{n} is countable. ∎

The construction of the “maximal” elements used to construct the probe in the proof of Theorem 1 follows as a corollary of Lemma E.14

Corollary E.3.

There are measures μmax,a𝐐\mu_{\max,a}\in\mathbf{Q} such that for every a𝒜a\in\mathcal{A} and any μ𝐊\mu\in\mathbf{K},

λ[Supp(fμ,a)Supp(fμmax,a)]=0.\lambda[\operatorname{\textsc{Supp}}(f_{\mu,a})\setminus\operatorname{\textsc{Supp}}(f_{\mu_{\max},a})]=0.
Proof.

Consider the collection {Supp(fμ,a)}μ𝐊\{\operatorname{\textsc{Supp}}(f_{\mu,a})\}_{\mu\in\mathbf{K}}. By Lemma E.14, there exists a countable collection of measures {μi}i\{\mu_{i}\}_{i\in\mathbb{N}} such that for any μ𝐊\mu\in\mathbf{K},

λ[Supp(fμ,a)i=1Supp(fμi,a)]=0,\lambda\left[\operatorname{\textsc{Supp}}(f_{\mu,a})\setminus\bigcup_{i=1}^{\infty}\operatorname{\textsc{Supp}}(f_{\mu_{i},a})\right]=0,

where, without loss of generality, we may assume that λ[Supp(fμi,a)]>0\lambda[\operatorname{\textsc{Supp}}(f_{\mu_{i},a})]>0 for all ii\in\mathbb{N}. Such a sequence must exist, since, by the first hypothesis of Theorem 1, for every a𝒜a\in\mathcal{A}, there exists μ𝐐\mu\in\mathbf{Q} such that Prμ(A=a)>0\operatorname{Pr}_{\mu}(A=a)>0. Therefore, we can define the probability measure μmax,a\mu_{\max,a}, where

μmax,a=i=1n2i|μiA=a||μiA=a|[𝒦].\mu_{\max,a}=\sum_{i=1}^{n}2^{-i}\cdot\frac{\left|\mu_{i}\operatorname{\upharpoonright}_{A=a}\right|}{\left|\mu_{i}\operatorname{\upharpoonright}_{A=a}\right|[\mathcal{K}]}.

It follows immediately by construction that

Supp(fμmax,a)=i=1Supp(fμi,a),\operatorname{\textsc{Supp}}(f_{\mu_{\max},a})=\bigcup_{i=1}^{\infty}\operatorname{\textsc{Supp}}(f_{\mu_{i},a}),

and that μmax,a𝐐\mu_{\max,a}\in\mathbf{Q}. ∎

For notational simplicity, we refer to Supp(fμmax,a)\operatorname{\textsc{Supp}}(f_{\mu_{\max,a}}) as SaS_{a} throughout.

In the case of conditional principal fairness and path-specific fairness, we need a mild refinement of the previous result that accounts for ω\omega.

Corollary E.4.

There are measures μmax,a,w𝐐\mu_{\max,a,w}\in\mathbf{Q} defined for every w𝒲=Img(ω)w\in\mathcal{W}=\operatorname{\textsc{Img}}(\omega) and any a𝒜a\in\mathcal{A} such that for some ν𝐊\nu\in\mathbf{K}, Prν(W=w,A=a)>0\operatorname{Pr}_{\nu}(W=w,A=a)>0. These measures have the property that for any μ𝐊\mu\in\mathbf{K},

λ[Supp(fμ,a,w)Supp(fμmax,a,w)]=0,\lambda[\operatorname{\textsc{Supp}}(f_{\mu^{\prime},a,w})\setminus\operatorname{\textsc{Supp}}(f_{\mu_{\max},a,w})]=0,

where fμ,a,wf_{\mu^{\prime},a,w} is the density of the pushforward measure (μW=w,A=a)u1(\mu^{\prime}\operatorname{\upharpoonright}_{W=w,A=a})\circ u^{-1}.

Recalling that |Img(ω)|<|\operatorname{\textsc{Img}}(\omega)|<\infty, the proof is the same as Cor. E.3, and we analogously refer to Supp(fμmax,a,w)\operatorname{\textsc{Supp}}(f_{\mu_{\max,a,w}}) as Sa,wS_{a,w}. Here, we have assumed without loss of generality—as we continue to assume in the sequel—that for all w𝒲w\in\mathcal{W}, there is some μ𝐊\mu\in\mathbf{K} such that Prμ(W=w)>0\operatorname{Pr}_{\mu}(W=w)>0.

Remark 6.

Because their support is maximal, the hypotheses of Theorem 1, in addition to implying that μmax,a\mu_{\max,a} is well-defined for all a𝒜a\in\mathcal{A}, also imply that Prμmax,a(u(X)>0)>0\operatorname{Pr}_{\mu_{\max,a}}(u(X)>0)>0. In the case of conditional principal fairness, they further imply that Prμmax,a(W=w)>0\operatorname{Pr}_{\mu_{\max,a}}(W=w)>0 for all w𝒲w\in\mathcal{W} and a𝒜a\in\mathcal{A}. Likewise, in the case of path-specific fairness, they further imply that Prμmax,a(W=wi)>0\operatorname{Pr}_{\mu_{\max,a}}(W=w_{i})>0 for i=0,1i=0,1 and some a𝒜a\in\mathcal{A}.

E.4.2 Shy sets and probes

In the following lemmata, we demonstrate that a number of useful properties are generic in 𝐐\mathbf{Q}. We also demonstrate a short technical lemma, Lemma E.20, which allows us to use these generic properties to simplify the proof of Theorem 1.

We begin with the following lemma, which is useful in verifying that certain subspaces of 𝐊\mathbf{K} form probes.

Lemma E.15.

Let 𝐖\mathbf{W} be a non-trivial finite dimensional subspace of 𝐊\mathbf{K} such that ν[𝒦]=0\nu[\mathcal{K}]=0 for all ν𝐖\nu\in\mathbf{W}. Then, there exists μ𝐊\mu\in\mathbf{K} such that λ𝐖[𝐐μ]>0\lambda_{\mathbf{W}}[\mathbf{Q}-\mu]>0.

Proof.

Set

μ=i=1n|νi||νi|[𝒦],\mu=\sum_{i=1}^{n}\frac{|\nu_{i}|}{|\nu_{i}|[\mathcal{K}]},

where ν1,,νn\nu_{1},\ldots,\nu_{n} form a basis of 𝐖\mathbf{W}. Then, if |βi|1|νi|[𝒦]|\beta_{i}|\leq\tfrac{1}{|\nu_{i}|[\mathcal{K}]}, it follows that

μ+i=1nβiνi𝐐.\mu+\sum_{i=1}^{n}\beta_{i}\cdot\nu_{i}\in\mathbf{Q}.

Since

λn[i=1n[1|νi|[𝒦],1|νi|[𝒦]]]>0,\lambda_{n}\left[\prod_{i=1}^{n}\left[-\frac{1}{|\nu_{i}|[\mathcal{K}]},\frac{1}{|\nu_{i}|[\mathcal{K}]}\right]\right]>0,

it follows that λ𝐖[𝐐μ]>0\lambda_{\mathbf{W}}[\mathbf{Q}-\mu]>0. ∎

Next we show that, given a ν𝐐\nu\in\mathbf{Q}, a generic element of 𝐐\mathbf{Q} “sees” events to which ν\nu assigns non-zero probability. While Lemma E.18 alone in principle suffices for the proof of Theorem 1, we include Lemma E.16 both for conceptual clarity and to introduce at a high level the style of argument used in the subsequent lemmata and in the proof of Theorem 1 to show that a set is shy relative to 𝐐\mathbf{Q}.

Lemma E.16.

For a Borel set E𝒦E\subseteq\mathcal{K}, suppose there exists ν𝐐\nu\in\mathbf{Q} such that ν[E]>0\nu[E]>0. Then the set of μ𝐐\mu\in\mathbf{Q} such that μ[E]>0\mu[E]>0 is prevalent.

Proof.

First, we note that the set of μ𝐐\mu\in\mathbf{Q} such that μ[E]=0\mu[E]=0 is closed and therefore universally measurable. For, if {μi}i𝐐\{\mu_{i}\}_{i\in\mathbb{N}}\subseteq\mathbf{Q} is a convergent sequence with limit μ\mu, then

μ[E]\displaystyle\mu[E] =limnμi[E]\displaystyle=\lim_{n\to\infty}\mu_{i}[E]
=limn0\displaystyle=\lim_{n\to\infty}0
=0.\displaystyle=0.

Now, if μ[E]>0\mu[E]>0 for all μ𝐐\mu\in\mathbf{Q}, there is nothing to prove. Therefore, suppose that there exists ν𝐐\nu^{\prime}\in\mathbf{Q} such that ν[E]=0\nu^{\prime}[E]=0.

Next, consider the measure ν~=νν\tilde{\nu}=\nu^{\prime}-\nu. Then, let 𝐖=Span(ν~)\mathbf{W}=\operatorname{\textsc{Span}}(\tilde{\nu}). Since ν~0\tilde{\nu}\neq 0 and

ν~[𝒦]=ν[𝒦]ν[𝒦]=0,\tilde{\nu}[\mathcal{K}]=\nu^{\prime}[\mathcal{K}]-\nu[\mathcal{K}]=0,

it follows by Lemma E.15 that λ𝐖[𝐐μ]>0\lambda_{\mathbf{W}}[\mathbf{Q}-\mu]>0 for some μ\mu.

Now, for arbitrary μ𝐐\mu\in\mathbf{Q}, note that if (μ+βν~)[E]=0(\mu+\beta\cdot\tilde{\nu})[E]=0, then

μ[E]βν[E]=0\mu[E]-\beta\cdot\nu[E]=0

i.e.,

β=μ[E]ν[E].\beta=\frac{\mu[E]}{\nu[E]}.

A singleton has null Lebesgue measure, and so the set of ν𝐖\nu\in\mathbf{W} such that (μ+ν)[E]=0(\mu+\nu)[E]=0 is λ𝐖\lambda_{\mathbf{W}}-null. Therefore, by Prop. E.6, the set of μ𝐐\mu\in\mathbf{Q} such that μ[E]=0\mu[E]=0 is shy relative to 𝐐\mathbf{Q}, as desired. ∎

While Lemma E.16 shows that a typical element of 𝐐\mathbf{Q} “sees” individual events, in the proof of Theorem 1, we require a stronger condition, namely, that a typical element of 𝐐\mathbf{Q} “sees” certain uncountable collections of events. To demonstrate this more complex property, we require the following technical result, which is closely related to the real analysis folk theorem that any convergent uncountable “sum” can contain only countably many non-zero terms. (See, e.g., Benji (2020).)

Lemma E.17.

Suppose μ\mu is a totally bounded measure on (V,)(V,\mathcal{M}), ff and gg are μ\mu-measurable real-valued functions, and g0g\neq 0 μ\mu-a.e. Then the set

Zβ={vV:f(v)+βg(v)=0}Z_{\beta}=\{v\in V:f(v)+\beta\cdot g(v)=0\}

has non-zero μ\mu measure for at most countably many β\beta\in\mathbb{R}.

Proof.

First, we show that for any countable collection {βi}i\{\beta_{i}\}_{i\in\mathbb{N}}\subseteq\mathbb{R}, the sum i=1μ[Zβi]\sum_{i=1}^{\infty}\mu[Z_{\beta_{i}}] converges. Then, we show how this implies that μ[Zβ]=0\mu[Z_{\beta}]=0 for all but countably many β\beta\in\mathbb{R}.

First, we note that for distinct β,β\beta,\beta^{\prime}\in\mathbb{R},

ZβZβ{vV:(ββ)g(v)=0}.Z_{\beta}\cap Z_{\beta^{\prime}}\subseteq\{v\in V:(\beta-\beta^{\prime})\cdot g(v)=0\}.

Now, by hypothesis,

μ[{vV:g(v)=0}]=0,\mu[\{v\in V:g(v)=0\}]=0,

and since ββ0\beta-\beta^{\prime}\neq 0, it follows that

μ[{vV:(ββ)g(v)=0}]=0\mu[\{v\in V:(\beta-\beta^{\prime})\cdot g(v)=0\}]=0

as well. Consequently, it follows that if {Zβi}i\{Z_{\beta_{i}}\}_{i\in\mathbb{N}} is a countable collection of distinct elements of \mathbb{R}, then

i=1μ[Zβi]\displaystyle\sum_{i=1}^{\infty}\mu[Z_{\beta_{i}}] =μ[i=1Zβi]\displaystyle=\mu\left[\bigcup_{i=1}^{\infty}Z_{\beta_{i}}\right]
μ[V]\displaystyle\leq\mu[V]
<.\displaystyle<\infty.

To see that this implies that μ[Zβ]>0\mu[Z_{\beta}]>0 for only countably many β\beta\in\mathbb{R}, let GϵG_{\epsilon}\subseteq\mathbb{R} consist of those β\beta such that μ[Zβ]ϵ\mu[Z_{\beta}]\geq\epsilon. Then GϵG_{\epsilon} must be finite for all ϵ>0\epsilon>0, since otherwise we could form a collection {βi}iGϵ\{\beta_{i}\}_{i\in\mathbb{N}}\subseteq G_{\epsilon}, in which case

i=1μ[Zβi]i=1ϵ=,\sum_{i=1}^{\infty}\mu[Z_{\beta_{i}}]\geq\sum_{i=1}^{\infty}\epsilon=\infty,

contrary to what was just shown. Therefore,

{β:μ[Zβ]>0}=i=1G1/i\{\beta\in\mathbb{R}:\mu[Z_{\beta}]>0\}=\bigcup_{i=1}^{\infty}G_{1/i}

is countable. ∎

We now apply Lemma E.17 to the proof of the following lemma, which states, informally, that, under a generic element of 𝐐\mathbf{Q}, u(X)u(X) is supported everywhere it is supported under some particular fixed element of 𝐐\mathbf{Q}. For instance, Lemma E.17 can be used to show that for a generic element of 𝐐\mathbf{Q}, the density of u(X)A=au(X)\mid A=a is positive λSa\lambda\operatorname{\upharpoonright}_{S_{a}}-a.e.

Lemma E.18.

Let ν𝐐\nu\in\mathbf{Q} and suppose ν\nu is supported on EE, i.e., ν[𝒦E]=0\nu[\mathcal{K}\setminus E]=0. Then the set of μ𝐐\mu\in\mathbf{Q} such that νu1(μE)u1\nu\circ u^{-1}\Lt(\mu\operatorname{\upharpoonright}_{E})\circ u^{-1} is prevalent relative to 𝐐\mathbf{Q}.

Lemma E.18 states, informally, that for generic μ𝐐\mu\in\mathbf{Q}, fμEf_{\mu\operatorname{\upharpoonright}_{E}} is supported everywhere fνf_{\nu} is supported.

Proof.

We begin by showing that the set of μ𝐐\mu\in\mathbf{Q} such that νu1(μE)u1\nu\circ u^{-1}\Lt(\mu\operatorname{\upharpoonright}_{E})\circ u^{-1} is Borel, and therefore universally measurable. Then, we construct a probe 𝐖\mathbf{W} and use it to show that this collection is finitely shy.

To begin, let UqU_{q} denote the set of μ𝐐\mu\in\mathbf{Q} such that

νu1[{|fμE|=0}]<q.\nu\circ u^{-1}[\{|f_{\mu\operatorname{\upharpoonright}_{E}}|=0\}]<q.

We note that UqU_{q} is open. For, if μUq\mu\in U_{q}, then there exists some r>0r>0 such that

νu1[{|fμE|<r}]<q.\nu\circ u^{-1}[\{|f_{\mu\operatorname{\upharpoonright}_{E}}|<r\}]<q.

Let

ϵ=qνu1[{|fμE|<r}].\epsilon=q-\nu\circ u^{-1}[\{|f_{\mu\operatorname{\upharpoonright}_{E}}|<r\}].

Now, since νu1λ\nu\circ u^{-1}\Lt\lambda, there exists a δ\delta such that if λ[E]<δ\lambda[E^{\prime}]<\delta, then νu1[E]<ϵ\nu\circ u^{-1}[E^{\prime}]<\epsilon. Choose μ\mu^{\prime} arbitrarily so that |μμ|[𝒦]<δr|\mu-\mu^{\prime}|[\mathcal{K}]<\delta\cdot r. Then, by Markov’s inequality, we have that

λ[{|fμEfμE|>r}]<δ,\lambda[\{|f_{\mu\operatorname{\upharpoonright}_{E}}-f_{\mu^{\prime}\operatorname{\upharpoonright}_{E}}|>r\}]<\delta,

i.e.,

νu1[{fμEfμE|>r}]<ϵ.\nu\circ u^{-1}[\{f_{\mu\operatorname{\upharpoonright}_{E}}-f_{\mu^{\prime}\operatorname{\upharpoonright}_{E}}|>r\}]<\epsilon.

Now, we note that by the triangle inequality, wherever |fμE|=0|f_{\mu^{\prime}\operatorname{\upharpoonright}_{E}}|=0, either |fμE|<r|f_{\mu\operatorname{\upharpoonright}_{E}}|<r or |fμEfμE|>r|f_{\mu\operatorname{\upharpoonright}_{E}}-f_{\mu^{\prime}\operatorname{\upharpoonright}_{E}}|>r. Therefore

λ[{|fμE|=0}]\displaystyle\lambda[\{|f_{\mu^{\prime}\operatorname{\upharpoonright}_{E}}|=0\}] νu1[{|fμEfμE|>r}]\displaystyle\leq\nu\circ u^{-1}[\{|f_{\mu\operatorname{\upharpoonright}_{E}}-f_{\mu^{\prime}\operatorname{\upharpoonright}_{E}}|>r\}]
+μu1[{|fμE|<r]\displaystyle\hskip 28.45274pt+\mu\circ u^{-1}[\{|f_{\mu\operatorname{\upharpoonright}_{E}}|<r]
<ϵ+μu1[{|fμE|<r]\displaystyle<\epsilon+\mu\circ u^{-1}[\{|f_{\mu\operatorname{\upharpoonright}_{E}}|<r]
<q.\displaystyle<q.

We conclude that μUq\mu^{\prime}\in U_{q}, and so UqU_{q} is open.

Note that νu1(μE)u1\nu\circ u^{-1}\Lt(\mu\operatorname{\upharpoonright}_{E})\circ u^{-1} if and only if

λ[Supp(fν)Supp(fμE)]=0.\lambda[\operatorname{\textsc{Supp}}(f_{\nu})\setminus\operatorname{\textsc{Supp}}(f_{\mu\operatorname{\upharpoonright}_{E}})]=0.

By the definition of the support of a function, λSupp(fμ)μu1\lambda\operatorname{\upharpoonright}_{\operatorname{\textsc{Supp}}(f_{\mu})}\Lt\mu\circ u^{-1}. Therefore, it follows that

λ[Supp(fμ)Supp(fνE)]=0\lambda[\operatorname{\textsc{Supp}}(f_{\mu})\setminus\operatorname{\textsc{Supp}}(f_{\nu\operatorname{\upharpoonright}_{E}})]=0

if and only if

μu1[Supp(fμ)Supp(fνE)]=0.\mu\circ u^{-1}[\operatorname{\textsc{Supp}}(f_{\mu})\setminus\operatorname{\textsc{Supp}}(f_{\nu\operatorname{\upharpoonright}_{E}})]=0.

Then, it follows immediately that the set of ν𝐐\nu\in\mathbf{Q} such that μu1(νE)u1\mu\circ u^{-1}\Lt(\nu\operatorname{\upharpoonright}_{E})\circ u^{-1} is i=1nU1/i\bigcap_{i=1}^{n}U_{1/i}, which is, by construction, Borel, and therefore universally measurable.

Now, since

Prν(u(X)<t)=tfνdλ\operatorname{Pr}_{\nu}(u(X)<t)=\int_{-\infty}^{t}f_{\nu}\,\mathrm{d}\lambda

is a continuous function of tt, by the intermediate value theorem, there exists tt such that Prν(u(X)S0)=Prν(u(X)S1)\operatorname{Pr}_{\nu}(u(X)\in S_{0})=\operatorname{Pr}_{\nu}(u(X)\in S_{1}), where S0=Supp(fν)(,t)S_{0}=\operatorname{\textsc{Supp}}(f_{\nu})\cap(-\infty,t) and S1=Supp(fν)[t,)S_{1}=\operatorname{\textsc{Supp}}(f_{\nu})\cap[t,\infty). Then, we let

ν~[E]=E𝟙u1(S0)𝟙u1(S1)dν.\tilde{\nu}[E^{\prime}]=\int_{E^{\prime}}\mathbb{1}_{u^{-1}(S_{0})}-\mathbb{1}_{u^{-1}(S_{1})}\,\mathrm{d}\nu.

Take 𝐖=Span(ν~)\mathbf{W}=\operatorname{\textsc{Span}}(\tilde{\nu}). Since ν~0\tilde{\nu}\neq 0 and ν~[𝒦]=0\tilde{\nu}[\mathcal{K}]=0, we have by Lemma E.15 that λ𝐖[𝐐μ]>0\lambda_{\mathbf{W}}[\mathbf{Q}-\mu]>0 for some μ\mu.

By the definition of a density, fν~f_{\tilde{\nu}} is positive (ν~u1)(\tilde{\nu}\circ u^{-1})-a.e. Consequently, by the definition of ν~\tilde{\nu}, fν~f_{\tilde{\nu}} is non-zero (μu1)(\mu\circ u^{-1})-a.e. Therefore, by Lemma E.17, there exist only countably many β\beta\in\mathbb{R} such that the density of (μ+βν~)u1(\mu+\beta\cdot\tilde{\nu})\circ u^{-1} equals zero on a set of positive μu1\mu\circ u^{-1}-measure. Since countable sets have λ\lambda-measure zero and ν\nu is arbitrary, the set of μ𝐐\mu\in\mathbf{Q} such that νu1(μE)u1\nu\circ u^{-1}\Lt(\mu\operatorname{\upharpoonright}_{E})\circ u^{-1} is prevalent relative to 𝐐\mathbf{Q} by Prop. E.6. ∎

The following definition and technical lemma are needed to extend Theorem 1 to the cases of conditional principal fairness and path-specific fairness, which involve additional conditioning on W=ω(X)W=\omega(X). In particular, one corner case we wish to avoid in the proof of Theorem 1 is when the decision policy is non-trivial (i.e., some individuals receive a positive decision and others do not) but from the perspective of each ω\omega-stratum, the policy is trivial (i.e., everyone in the stratum receives a positive or negative decision). Definition E.12 formalizes this pathology, and Lemma E.19 shows that this issue—under a mild hypothesis—does not arise for a generic element of 𝐐\mathbf{Q}.

Definition E.12.

We say that μ𝐐\mu\in\mathbf{Q} overlaps utilities when, for any budget-exhausting multiple threshold policy τ(x)\tau(x), if

0<𝔼μ[τ(X)]<1,0<\mathbb{E}_{\mu}[\tau(X)]<1,

then there exists w𝒲w\in\mathcal{W} such that

0<𝔼μ[τ(X)W=w]<1.0<\mathbb{E}_{\mu}[\tau(X)\mid W=w]<1.

If there exists a budget-exhausting multiple threshold policy τ(x)\tau(x) such that

0<𝔼μ[τ(X)]<1,0<\mathbb{E}_{\mu}[\tau(X)]<1,

but for all w𝒲w\in\mathcal{W},

𝔼μ[τ(X)W=w]{0,1},\mathbb{E}_{\mu}[\tau(X)\mid W=w]\in\{0,1\},

then we say that τ(x)\tau(x) splits utilities over μ\mu.

Informally, having overlapped utilities prevents a budget-exhausting threshold policy from having thresholds that fall on the utility scale exactly between the strata induced by ω\omega—i.e., a threshold policy that splits utilities. This is almost a generic condition in 𝐐\mathbf{Q}, as we shown in Lemma E.19.

Lemma E.19.

Let 0<b<10<b<1. Suppose that for all w𝒲w\in\mathcal{W} there exists μ𝐐\mu\in\mathbf{Q} such that Prμ(u(X)>0,W=w)>0\operatorname{Pr}_{\mu}(u(X)>0,W=w)>0. Then almost every μ𝐐\mu\in\mathbf{Q} overlaps utilities.

Proof.

Our goal is to show that the set 𝐄\mathbf{E}^{\prime} of measures μ𝐐\mu\in\mathbf{Q} such that there exists a splitting policy τ(x)\tau(x) is shy. To simplify the proof, we divide an conquer, showing that the set 𝐄Γ\mathbf{E}_{\Gamma} of measures μ𝐐\mu\in\mathbf{Q} such that there exists a splitting policy where the thresholds fall below wΓ𝒲w\in\Gamma\subseteq\mathcal{W} and above wΓw\notin\Gamma is Borel, before constructing a probe that shows that it is shy. Then, we argue that 𝐄=Γ𝒲𝐄Γ\mathbf{E}^{\prime}=\bigcup_{\Gamma\subseteq\mathcal{W}}\mathbf{E}_{\Gamma}, which shows that 𝐄\mathbf{E}^{\prime} is shy.

We begin by considering the linear map Φ:𝐊×𝒲\Phi:\mathbf{K}\to\mathbb{R}\times\mathbb{R}^{\mathcal{W}} given by

Φ(μ)=(Prμ(u(X)=0),(Prμ(W=w))w𝒲).\Phi(\mu)=\left(\operatorname{Pr}_{\mu}(u(X)=0),\left(\operatorname{Pr}_{\mu}(W=w)\right)_{w\in\mathcal{W}}\right).

For any Γ𝒲\Gamma\subseteq\mathcal{W}, the sets

FUpΓ\displaystyle F^{\operatorname{{Up}}}_{\Gamma} ={x×𝒲:x0b,b=wΓxw},\displaystyle=\{x\in\mathbb{R}\times\mathbb{R}^{\mathcal{W}}:x_{0}\geq b,b=\sum_{w\in\Gamma}x_{w}\},
FLoΓ\displaystyle F^{\operatorname{{Lo}}}_{\Gamma} ={x×𝒲:x0b,x0=wΓxw},\displaystyle=\{x\in\mathbb{R}\times\mathbb{R}^{\mathcal{W}}:x_{0}\leq b,x_{0}=\sum_{w\in\Gamma}x_{w}\},

are closed by construction. Therefore, since Φ\Phi is continuous,

𝐄Γ=𝐐Φ1(Γ𝒲FUpΓFLoΓ)\mathbf{E}_{\Gamma}=\mathbf{Q}\cap\Phi^{-1}\left(\bigcup_{\Gamma\subseteq\mathcal{W}}F^{\operatorname{{Up}}}_{\Gamma}\cup F^{\operatorname{{Lo}}}_{\Gamma}\right) (16)

is closed, and therefore universally measurable.

Note that by our hypothesis and Cor. E.4, for all w𝒲w\in\mathcal{W} there exists some aw𝒜a_{w}\in\mathcal{A} such that

Prμmax,aw,w(u(X)>0).\operatorname{Pr}_{\mu_{\max,a_{w},w}}(u(X)>0).

We use this to show that 𝐄Γ\mathbf{E}_{\Gamma} is shy. Pick w𝒲w^{*}\in\mathcal{W} arbitrarily, and consider the measures νw\nu_{w} for www\neq w^{*} defined by

νw\displaystyle\nu_{w} =μmax,aw,wu(X)>0Prμmax,aw,w(u(X)>0)\displaystyle=\frac{\mu_{\max,a_{w},w}\operatorname{\upharpoonright}_{u(X)>0}}{\operatorname{Pr}_{\mu_{\max,a_{w},w}}(u(X)>0)}
μmax,aw,wu(X)>0Prμmax,aw,w(u(X)>0).\displaystyle\hskip 28.45274pt-\frac{\mu_{\max,a_{w^{*}},w^{*}}\operatorname{\upharpoonright}_{u(X)>0}}{\operatorname{Pr}_{\mu_{\max,a_{w^{*}},w^{*}}}(u(X)>0)}.

We note that νw[𝒦]=0\nu_{w}[\mathcal{K}]=0 by construction. Therefore, if 𝐖w=Span(νw)\mathbf{W}_{w}=\operatorname{\textsc{Span}}(\nu_{w}), then λ𝐖w[𝐐μw]>0\lambda_{\mathbf{W}_{w}}[\mathbf{Q}-\mu_{w}]>0 for some μw\mu_{w} by Lemma E.15.

Moreover, we have that Prν(u(X)>0)=0\operatorname{Pr}_{\nu}(u(X)>0)=0 for all ν𝐖w\nu\in\mathbf{W}_{w}, i.e.,

Prμ(u(X)>0)=Prμ+ν(u(X)>0).\operatorname{Pr}_{\mu}(u(X)>0)=\operatorname{Pr}_{\mu+\nu}(u(X)>0).

Now, since 0<b<10<b<1 and ω\omega partitions 𝒳\mathcal{X}, it follows that

𝐄𝒲=𝐄=.\mathbf{E}_{\mathcal{W}}=\mathbf{E}_{\emptyset}=\emptyset.

Since λ𝐖[]=0\lambda_{\mathbf{W}}[\emptyset]=0 for any subspace 𝐖\mathbf{W}, we can assume without loss of generality that Γ𝒲,\Gamma\neq\mathcal{W},\emptyset.

In that case, there exists wΓ𝒲w_{\Gamma}\in\mathcal{W} such that if wΓw^{*}\in\Gamma, then wΓΓw_{\Gamma}\notin\Gamma, and vice versa. Without loss of generality, assume wΓΓw_{\Gamma}\in\Gamma and wΓw^{*}\notin\Gamma. It then follows that for arbitrary μ𝐐\mu\in\mathbf{Q},

Φ(μ+βνwΓ)=Φ(μ)+β𝐞wΓβ𝐞w,\Phi(\mu+\beta\cdot\nu_{w_{\Gamma}})=\Phi(\mu)+\beta\cdot\mathbf{e}_{w_{\Gamma}}-\beta\cdot\mathbf{e}_{w^{*}},

where 𝐞w\mathbf{e}_{w} is the basis vector corresponding to w𝒲w\in\mathcal{W}. From this, it follows immediately by Eq. (E.4.2) that

μ+βνwΓ𝐄Γ\mu+\beta\cdot\nu_{w_{\Gamma}}\in\mathbf{E}_{\Gamma}

only if

β=min(b,Prμ(u(X)>0))wΓPrμ(W=w).\beta=\min(b,\operatorname{Pr}_{\mu}(u(X)>0))-\sum_{w\in\Gamma}\operatorname{Pr}_{\mu}(W=w).

This is a measure zero subset of \mathbb{R}, and so it follows that

λ𝐖wΓ[𝐄Γμ]=0\lambda_{\mathbf{W}_{w_{\Gamma}}}[\mathbf{E}_{\Gamma}-\mu]=0

for all μ𝐊\mu\in\mathbf{K}. Therefore, by Prop. E.6, 𝐄Γ\mathbf{E}_{\Gamma} is shy in 𝐐\mathbf{Q}. Taking the union over Γ𝒲\Gamma\subseteq\mathcal{W}, it follows by Prop. E.5 that Γ𝒲𝐄Γ\bigcup_{\Gamma\subseteq\mathcal{W}}\mathbf{E}_{\Gamma} is shy.

Now, we must show that 𝐄=Γ𝒲𝐄Γ\mathbf{E}^{\prime}=\bigcup_{\Gamma\subseteq\mathcal{W}}\mathbf{E}_{\Gamma}. By construction, 𝐄Γ𝐄\mathbf{E}_{\Gamma}\subseteq\mathbf{E}^{\prime}, since the policy τ(x)=𝟙ω(x)Γ\tau(x)=\mathbb{1}_{\omega(x)\in\Gamma} is budget-exhausting and separates utilities. To see the reverse inclusion, suppose μ𝐄\mu\in\mathbf{E}^{\prime}, i.e., that there exists a budget-exhausting multiple threshold policy τ(x)\tau(x) that splits utilities over μ\mu. Then, let

Γμ={w𝒲:𝔼μ[τ(X)W=w]=1}.\Gamma_{\mu}=\{w\in\mathcal{W}:\mathbb{E}_{\mu}[\tau(X)\mid W=w]=1\}.

Since τ(x)\tau(x) is budget-exhausting, it follows immediately that μ𝐄Γμ\mu\in\mathbf{E}_{\Gamma_{\mu}}. Therefore, 𝐄=Γ𝒲𝐄Γ\mathbf{E}^{\prime}=\bigcup_{\Gamma\subseteq\mathcal{W}}\mathbf{E}_{\Gamma}, and so 𝐄\mathbf{E}^{\prime} is shy, as desired. ∎

We conclude our discussion of shyness and shy sets with the following general lemma, which simplifies relative prevalence proofs by showing that one can, without loss of generality, restrict one’s attention to the elements of the shy set itself in applying Prop. E.6.

Lemma E.20.

Suppose EE is a universally measurable subset of a convex, completely metrizable set CC in a topological vector space VV. Suppose that for some finite-dimensional subpace VV^{\prime}, λV[C+v0]>0\lambda_{V^{\prime}}[C+v_{0}]>0 for some v0Vv_{0}\in V. If, in addition, for all vEv\in E,

λV[{vV:v+vE}]=0,\lambda_{V^{\prime}}[\{v^{\prime}\in V^{\prime}:v+v^{\prime}\in E\}]=0, (17)

then it follows that EE is shy relative to CC.

Proof.

Let vv be arbitrary. Then, either (v+V)E(v+V^{\prime})\cap E is empty or not.

First, suppose it is empty. Since λV[]=0\lambda_{V^{\prime}}[\emptyset]=0 by definition, it follows immediately that in this case λV[Ev]=0\lambda_{V^{\prime}}[E-v]=0.

Next, suppose the intersection is not empty, and let v+vEv+v^{*}\in E for some fixed vVv^{*}\in V^{\prime}. It follows that

λV[Ev]\displaystyle\lambda_{V^{\prime}}[E-v] =λV[{vV:v+vE}]\displaystyle=\lambda_{V^{\prime}}[\{v^{\prime}\in V^{\prime}:v+v^{\prime}\in E\}]
=λV[{vV:(v+v)+vE}]\displaystyle=\lambda_{V^{\prime}}[\{v^{\prime}\in V^{\prime}:(v+v^{*})+v^{\prime}\in E\}]
=0,\displaystyle=0,

where the first equality follows by definition; the second equality follows by the translation invariance of λV\lambda_{V^{\prime}}, and the fact that v+V=Vv^{*}+V^{\prime}=V^{\prime}; and the final inequality follows from Eq. (17).

Therefore λV[Ev]=0\lambda_{V^{\prime}}[E-v]=0 for arbitrary vv, and so EE is shy. ∎

E.5 Proof of Theorem 1

Using the lemmata above, we can prove Theorem 1. We briefly summarize what has been established so far:

  • Lemma E.7: The set 𝐊\mathbf{K} of 𝒰\mathcal{U}-fine distributions on 𝒦\mathcal{K} is a Banach space;

  • Lemma E.11: The subset 𝐐\mathbf{Q} of 𝒰\mathcal{U}-fine probability measures on 𝒦\mathcal{K} is a convex, completely metrizable subset of 𝐊\mathbf{K};

  • Lemma E.13: The subset 𝐄\mathbf{E} of 𝐐\mathbf{Q} is a universally measurable subset of 𝐊\mathbf{K}, where 𝐄\mathbf{E} is the set consisting of 𝒰\mathcal{U}-fine probability measures over which there exists a policy satisfying counterfactual equalized odds (resp., conditional principal fairness, or path-specific fairness) that is not strongly Pareto dominated.

Therefore, to apply Prop. E.6, it follows that what remains is to construct a probe 𝐖\mathbf{W} and show that λ𝐖[𝐐+μ0]>0\lambda_{\mathbf{W}}[\mathbf{Q}+\mu_{0}]>0 for some μ0𝐊\mu_{0}\in\mathbf{K} but λ𝐖[𝐄+μ]=0\lambda_{\mathbf{W}}[\mathbf{E}+\mu]=0 for all μ𝐊\mu\in\mathbf{K}.

Proof.

We divide the proof into three pieces. First, we illustrate how to construct the probe 𝐖\mathbf{W} from a particular collection of distributions {νaUp,νaLo}a𝒜\{\nu_{a}^{\operatorname{{Up}}},\nu_{a}^{\operatorname{{Lo}}}\}_{a\in\mathcal{A}}. Second, we show that λ𝐖[𝐄+μ]=0\lambda_{\mathbf{W}}[\mathbf{E}+\mu]=0 for all μ𝐊\mu\in\mathbf{K}. For notational and expository simplicity, we focus in these first two sections on the case of counterfactual equalized odds. Therefore, in the third section, we show how to generalize the argument to conditional principal fairness and path-specific fairness.

Construction of the probe

We will construct our probe to address two different cases. We recall that, by Cor. D.1, any policy that is not strongly Pareto dominated must be a budget-exhausting multiple threshold policy with non-negative thresholds. In the first case, we consider when the candidate budget-exhausting multiple threshold policy is 𝟙u(x)>0\mathbb{1}_{u(x)>0}. By perturbing the underlying distribution by ν𝐖Lo\nu\in\mathbf{W}^{\operatorname{{Lo}}}, we will be able to break the balance requirements implied by Eq. (2). In the second case, we treat the possibility that the candidate budget-exhausting multiple threshold policy has a non-trivial positive threshold for at least one group. By perturbing the underlying distribution by ν𝐖Up\nu\in\mathbf{W}^{\operatorname{{Up}}} for an alternative set of perturbations 𝐖Up\mathbf{W}^{\operatorname{{Up}}}, we will again be able to break the balance requirements.

More specifically, to construct our probe 𝐖=𝐖Up+𝐖Lo\mathbf{W}=\mathbf{W}^{\operatorname{{Up}}}+\mathbf{W}^{\operatorname{{Lo}}}, we want 𝐖Up\mathbf{W}^{\operatorname{{Up}}} and 𝐖Lo\mathbf{W}^{\operatorname{{Lo}}} to have a number of properties. In particular, for all ν𝐖\nu\in\mathbf{W}, perturbation by ν\nu should not affect whether the underlying distribution is a probability distribution, and should not affect how much of the budget is available to budget-exhausting policies. Specifically, for all ν𝐖\nu\in\mathbf{W},

𝒦1dν=0,\int_{\mathcal{K}}1\,\mathrm{d}\nu=0, (18)

and

𝒦𝟙u(X)>0dν=0.\int_{\mathcal{K}}\mathbb{1}_{u(X)>0}\,\mathrm{d}\nu=0. (19)

In fact, the amount of budget available to budget-exhausting policies will not change within group, i.e., for all a𝒜a\in\mathcal{A} and ν𝐖\nu\in\mathbf{W},

𝒦𝟙u(X)>0,A=adν=0.\int_{\mathcal{K}}\mathbb{1}_{u(X)>0,A=a}\,\mathrm{d}\nu=0. (20)

Additionally, for some distinguished y0,y1𝒴y_{0},y_{1}\in\mathcal{Y}, non-zero perturbations in νLo𝐖Lo\nu^{\operatorname{{Lo}}}\in\mathbf{W}^{\operatorname{{Lo}}} should move mass between y0y_{0} and y1y_{1}. That is, they should have the property that if Pr|νLo|(A=a)>0\operatorname{Pr}_{|\nu^{\operatorname{{Lo}}}|}(A=a)>0, then

𝒦𝟙u(X)<0,Y=yi,A=adνLo0.\int_{\mathcal{K}}\mathbb{1}_{u(X)<0,Y=y_{i},A=a}\,\mathrm{d}\nu^{\operatorname{{Lo}}}\neq 0. (21)

Finally, perturbations in 𝐖Up\mathbf{W}^{\operatorname{{Up}}} should have the property that for any non-trivial t>0t>0, some mass is moved either above or below t>0t>0. More precisely, for any μ𝐐\mu\in\mathbf{Q} and any tt such that

0<Prμ(u(X)>tA=a)<1,0<\operatorname{Pr}_{\mu}(u(X)>t\mid A=a)<1,

if νUp𝐖Up\nu^{\operatorname{{Up}}}\in\mathbf{W}^{\operatorname{{Up}}} is such that Pr|νUp|(A=a)>0\operatorname{Pr}_{|\nu^{\operatorname{{Up}}}|}(A=a)>0, then

𝒦𝟙u(X)>t,A=adνUp0.\int_{\mathcal{K}}\mathbb{1}_{u(X)>t,A=a}\,\mathrm{d}\nu^{\operatorname{{Up}}}\neq 0. (22)

To carry out the construction, choose distinct y0,y1𝒴y_{0},y_{1}\in\mathcal{Y}. Then, since

μmax,au1[Sa[0,ra)]μmax,au1[Sa[ra,)]\mu_{\max,a}\circ u^{-1}[S_{a}\cap[0,r_{a})]-\mu_{\max,a}\circ u^{-1}[S_{a}\cap[r_{a},\infty)]

is a continuous function of rar_{a}, it follows by the intermediate value theorem that we can partition SaS_{a} into three pieces,

SaLo\displaystyle S_{a}^{\operatorname{{Lo}}} =Sa(,0),\displaystyle=S_{a}\cap(-\infty,0),
SUpa,0\displaystyle S^{\operatorname{{Up}}}_{a,0} =Sa[0,ra),\displaystyle=S_{a}\cap[0,r_{a}),
SUpa,1\displaystyle S^{\operatorname{{Up}}}_{a,1} =Sa[ra,),\displaystyle=S_{a}\cap[r_{a},\infty),

so that

Prμmax,a(u(X)SUpa,0)=Prμmax,a(u(X)SUpa,1).\operatorname{Pr}_{\mu_{\max,a}}\left(u(X)\in S^{\operatorname{{Up}}}_{a,0}\right)=\operatorname{Pr}_{\mu_{\max,a}}\left(u(X)\in S^{\operatorname{{Up}}}_{a,1}\right).

Recall that 𝒦=𝒳×𝒴\mathcal{K}=\mathcal{X}\times\mathcal{Y}. Let π𝒳:𝒦𝒳\pi_{\mathcal{X}}:\mathcal{K}\to\mathcal{X} denote projection onto 𝒳\mathcal{X}, and γy:𝒳𝒦\gamma_{y}:\mathcal{X}\to\mathcal{K} be the injection x(x,y)x\mapsto(x,y). We define

νaUp[E]\displaystyle\nu_{a}^{\operatorname{{Up}}}[E] =μmax,a(γy1π𝒳)1[Eu1(SUpa,1)],\displaystyle=\mu_{\max,a}\circ(\gamma_{y_{1}}\circ\pi_{\mathcal{X}})^{-1}\left[E\cap u^{-1}\left(S^{\operatorname{{Up}}}_{a,1}\right)\right],
μmax,a(γy1π𝒳)1[Eu1(SUpa,0)],\displaystyle\hskip 14.22636pt-\mu_{\max,a}\circ(\gamma_{y_{1}}\circ\pi_{\mathcal{X}})^{-1}\left[E\cap u^{-1}\left(S^{\operatorname{{Up}}}_{a,0}\right)\right],
νaLo[E]\displaystyle\nu_{a}^{\operatorname{{Lo}}}[E] =μmax,a(γy1π𝒳)1[Eu1(SLoa)]\displaystyle=\mu_{\max,a}\circ(\gamma_{y_{1}}\circ\pi_{\mathcal{X}})^{-1}\left[E\cap u^{-1}\left(S^{\operatorname{{Lo}}}_{a}\right)\right]
μmax,a(γy0π𝒳)1[Eu1(SLoa)].\displaystyle\hskip 14.22636pt-\mu_{\max,a}\circ(\gamma_{y_{0}}\circ\pi_{\mathcal{X}})^{-1}\left[E\cap u^{-1}\left(S^{\operatorname{{Lo}}}_{a}\right)\right].

By construction, νaUp\nu_{a}^{\operatorname{{Up}}} concentrates on

{y1}×u1(Sa[0,)),\{y_{1}\}\times u^{-1}(S_{a}\cap[0,\infty)),

while νaLo\nu_{a}^{\operatorname{{Lo}}} concentrates on

{y0,y1}×u1(Sa(,0)).\{y_{0},y_{1}\}\times u^{-1}(S_{a}\cap(-\infty,0)).

Moreover, if we set

𝐖Up\displaystyle\mathbf{W}^{\operatorname{{Up}}} =Span(νaUp)a𝒜,\displaystyle=\operatorname{\textsc{Span}}(\nu_{a}^{\operatorname{{Up}}})_{a\in\mathcal{A}},
𝐖Lo\displaystyle\mathbf{W}^{\operatorname{{Lo}}} =Span(νaLo)a𝒜,\displaystyle=\operatorname{\textsc{Span}}(\nu_{a}^{\operatorname{{Lo}}})_{a\in\mathcal{A}},

then it is easy to see that Eqs. (18) to (21) will hold. The only non-trivial case is Eq. (22). However, by Cor. E.3, the support of fμmax,af_{\mu_{\max,a}} is maximal. That is, for μ𝐐\mu\in\mathbf{Q}, if

0<Prμ(u(X)>tA=a,u(X)>0)<1,0<\operatorname{Pr}_{\mu}(u(X)>t\mid A=a,u(X)>0)<1,

then it follows that 0<t<supSa0<t<\sup S_{a}. Either trat\leq r_{a} or t>rat>r_{a}. First, assume trat\leq r_{a}; then, it follows by the construction of νaUp\nu_{a}^{\operatorname{{Up}}} that

νaUpu1[(t,)]\displaystyle\nu_{a}^{\operatorname{{Up}}}\circ u^{-1}[(t,\infty)] =rafmax,adλ\displaystyle=\int_{r_{a}}^{\infty}f_{\max,a}\,\mathrm{d}\lambda
trafmax,adλ\displaystyle\hskip 28.45274pt-\int_{t}^{r_{a}}f_{\max,a}\,\mathrm{d}\lambda
>rafmax,adλ\displaystyle>\int_{r_{a}}^{\infty}f_{\max,a}\,\mathrm{d}\lambda
0rafmax,adλ\displaystyle\hskip 28.45274pt-\int_{0}^{r_{a}}f_{\max,a}\,\mathrm{d}\lambda
=0.\displaystyle=0.

Similarly, if t>rat>r_{a},

νaUpu1[(t,)]\displaystyle\nu_{a}^{\operatorname{{Up}}}\circ u^{-1}[(t,\infty)] =tfmax,adλ\displaystyle=\int_{t}^{\infty}f_{\max,a}\,\mathrm{d}\lambda
>supSafmax,adλ\displaystyle>\int_{\sup S_{a}}^{\infty}f_{\max,a}\,\mathrm{d}\lambda
=0.\displaystyle=0.

Therefore Eq. (22) holds.

Since 𝐖\mathbf{W} is non-trivial181818In general, some or all of the νLo\nu^{\operatorname{{Lo}}} may be zero depending on the λ\lambda-measure of SaLoS_{a}^{\operatorname{{Lo}}}. However, as noted in Remark 6, the νa,iUp\nu_{a,i}^{\operatorname{{Up}}} cannot be zero, since Prμmax,a(u(X)>0)>0\operatorname{Pr}_{\mu_{\max,a}}(u(X)>0)>0 for all a𝒜a\in\mathcal{A}. Therefore 𝐖{0}\mathbf{W}\neq\{0\}. and ν[𝒦]=0\nu[\mathcal{K}]=0 for all ν𝐖\nu\in\mathbf{W}, it follows by Lemma E.15 that λ𝐖[𝐐μ]>0\lambda_{\mathbf{W}}[\mathbf{Q}-\mu]>0 for some μ𝐊\mu\in\mathbf{K}.

Shyness

Recall that, by Prop. E.5, a set EE is shy if and only if, for an arbitrary shy set EE^{\prime}, EEE\setminus E^{\prime} is shy. By Lemma E.16, a generic element of μ𝐐\mu\in\mathbf{Q} satisfies Prμ(u(X)>0,Y(1)=yi,A=a)>0\operatorname{Pr}_{\mu}(u(X)>0,Y(1)=y_{i},A=a)>0 for i=0,1i=0,1, and a𝒜a\in\mathcal{A}. Likewise, by Lemma E.18, a generic μ𝐐\mu\in\mathbf{Q} satisfies νaUpu1(μ𝒳×{y1})u1\nu_{a}^{\operatorname{{Up}}}\circ u^{-1}\Lt(\mu\operatorname{\upharpoonright}_{\mathcal{X}\times\{y_{1}\}})\circ u^{-1}. Therefore, to simplify our task and recalling Remark 6, we may instead demonstrate the shyness of the set of μ𝐐\mu\in\mathbf{Q} such that:

  • There exists a budget-exhausting multiple threshold policy τ(x)\tau(x) with non-negative thresholds satisfying counterfactual equalized odds over μ\mu;

  • For i=0,1i=0,1,

    Prμ(u(X)>0,A=a,Y(1)=yi)>0;\operatorname{Pr}_{\mu}(u(X)>0,A=a,Y(1)=y_{i})>0; (23)
  • For all a𝒜a\in\mathcal{A},

    νaUpu1(μα1(a)×{y1})u1.\nu_{a}^{\operatorname{{Up}}}\circ u^{-1}\Lt(\mu\operatorname{\upharpoonright}_{\alpha^{-1}(a)\times\{y_{1}\}})\circ u^{-1}. (24)

By a slight abuse of notation, we continue to refer to this set as 𝐄\mathbf{E}. We note that, by the construction of 𝐖\mathbf{W}, Eq. (23) is not affected by perturbation by ν𝐖\nu\in\mathbf{W}, and Eq. (24) is not affected by perturbation by νLo𝐖\nu^{\operatorname{{Lo}}}\in\mathbf{W}.

In particular, by Lemma E.20, it suffices to show that λ𝐖[𝐄μ]=0\lambda_{\mathbf{W}}[\mathbf{E}-\mu]=0 for μ𝐄\mu\in\mathbf{E}.

Therefore, let μ𝐄\mu\in\mathbf{E} be arbitrary. Let the budget-exhausting multiple threshold policy satisfying counterfactual equalized odds over it be τ(x)\tau(x), so that

𝔼μ[τ(X)]=min(b,Prμ(u(X)>0)),\mathbb{E}_{\mu}[\tau(X)]=\min(b,\operatorname{Pr}_{\mu}(u(X)>0)),

with thresholds {ta}a𝒜\{t_{a}\}_{a\in\mathcal{A}}. We split into two cases based on whether τ(X)=𝟙u(X)>0\tau(X)=\mathbb{1}_{u(X)>0} μ\mu-a.s. or not.

In both cases, we make use of the following two useful observations.

First, note that as 𝐄𝐐\mathbf{E}\subseteq\mathbf{Q}, if μ+ν\mu+\nu is not a probability measure, then μ+ν𝐄\mu+\nu\notin\mathbf{E}. Therefore, without loss of generality, we assume throughout that μ+ν\mu+\nu is a probability measure.

Second, suppose τ(x)\tau^{\prime}(x) is a policy satisfying counterfactual equalized odds over some ν𝐐\nu\in\mathbf{Q}. Then, if 0<𝔼μ[τ(X)]<10<\mathbb{E}_{\mu}[\tau^{\prime}(X)]<1, it follows that for all a𝒜a\in\mathcal{A},

0<𝔼μ[τ(X)A=a]<1.0<\mathbb{E}_{\mu}[\tau^{\prime}(X)\mid A=a]<1. (25)

For, suppose not. Then, without loss of generality, there must be a0,a1𝒜a_{0},a_{1}\in\mathcal{A} such that

𝔼μ[τ(X)A=a0]=0\mathbb{E}_{\mu}[\tau^{\prime}(X)\mid A=a_{0}]=0

and

𝔼μ[τ(X)A=a1]>0.\mathbb{E}_{\mu}[\tau^{\prime}(X)\mid A=a_{1}]>0.

But then, by the law of iterated expectation, there must be some 𝒴𝒴\mathcal{Y}^{\prime}\subseteq\mathcal{Y} such that μ[𝒳×𝒴]>0\mu[\mathcal{X}\times\mathcal{Y}^{\prime}]>0 and so,

𝟙𝒳×𝒴𝔼μ\displaystyle\mathbb{1}_{\mathcal{X}\times\mathcal{Y}^{\prime}}\cdot\mathbb{E}_{\mu} [τ(X)A=a1,Y(1)]\displaystyle[\tau^{\prime}(X)\mid A=a_{1},Y(1)]
>0\displaystyle>0
=𝟙𝒳×𝒴𝔼μ[τ(X)A=a0,Y(1)],\displaystyle=\mathbb{1}_{\mathcal{X}\times\mathcal{Y}^{\prime}}\cdot\mathbb{E}_{\mu}[\tau^{\prime}(X)\mid A=a_{0},Y(1)],

contradicting the fact that τ(x)\tau^{\prime}(x) satisfies counterfactual equalized odds over μ\mu. Therefore, in what follows, we can assume that Eq. (25) holds.

Our goal is to show that λ𝐖[𝐄μ]=0\lambda_{\mathbf{W}}[\mathbf{E}-\mu]=0.

Case 1 (τ(X)=𝟙u(X)>0\tau(X)=\mathbb{1}_{u(X)>0}).

We argue as follows. First, we show that 𝟙u(X)>0\mathbb{1}_{u(X)>0} is the unique budget-exhausting multiple threshold policy with non-negative thresholds over μ+ν\mu+\nu for all ν𝐖\nu\in\mathbf{W}. Then, we show that the set of ν𝐖\nu\in\mathbf{W} such that 𝟙u(x)>0\mathbb{1}_{u(x)>0} satisfies counterfactual equalized odds over μ+ν\mu+\nu is a λ𝐖\lambda_{\mathbf{W}}-null set.

We begin by observing that 𝐖Lo{0}\mathbf{W}^{\operatorname{{Lo}}}\neq\{0\}. For, if that were the case, then Eq. (25) would not hold for τ(x)\tau(x).

Next, we note that by Eq. (19), for any ν𝐖\nu\in\mathbf{W},

Prμ+ν(u(X)>0)=Prμ(u(X)>0)\operatorname{Pr}_{\mu+\nu}(u(X)>0)=\operatorname{Pr}_{\mu}(u(X)>0)

and so

𝔼μ+ν[𝟙u(X)>0]=min(b,Prμ+ν(u(X)>0)).\mathbb{E}_{\mu+\nu}[\mathbb{1}_{u(X)>0}]=\min(b,\operatorname{Pr}_{\mu+\nu}(u(X)>0)).

If τ(x)\tau^{\prime}(x) is a feasible multiple threshold policy with non-negative thresholds and τ(X)𝟙u(X)>0\tau^{\prime}(X)\neq\mathbb{1}_{u(X)>0} (μ+ν)(\mu+\nu)-a.s., then, as a consequence,

𝔼μ+ν[τ(X)]<Prμ+ν(u(X)>0)b.\mathbb{E}_{\mu+\nu}[\tau^{\prime}(X)]<\operatorname{Pr}_{\mu+\nu}(u(X)>0)\leq b.

Therefore, it follows that 𝟙u(X)>0\mathbb{1}_{u(X)>0} is the unique budget-exhausting multiple threshold policy over μ+ν\mu+\nu with non-negative thresholds.

Now, note that if counterfactual equalized odds holds with decision policy τ(x)=𝟙u(x)>0\tau(x)=\mathbb{1}_{u(x)>0}, then, by Eq. (4) and Lemma E.6, we must have that

Prμ+ν(u(X)>0A=a,Y(1)=y1)=Prμ+ν(u(X)>0A=a,Y(1)=y1)\operatorname{Pr}_{\mu+\nu}(u(X)>0\mid A=a,Y(1)=y_{1})\\ =\operatorname{Pr}_{\mu+\nu}(u(X)>0\mid A=a^{\prime},Y(1)=y_{1})

for a,a𝒜a,a^{\prime}\in\mathcal{A}.191919To ensure that both quantities are well-defined, here and throughout the remainder of the proof we use the fact that by Eqs. (20) and (23), Prμ+ν(u(X)>0,A=a,Y(1)=y1)>0\operatorname{Pr}_{\mu+\nu}(u(X)>0,A=a,Y(1)=y_{1})>0.

Now, we will show that a typical element of 𝐖\mathbf{W} breaks this balance requirement. Choose aa^{*} such that νLoa0\nu^{\operatorname{{Lo}}}_{a*}\neq 0. Recall that ν\nu is fixed, and let ν=νβaLoνLoa\nu^{\prime}=\nu-\beta_{a^{*}}^{\operatorname{{Lo}}}\cdot\nu^{\operatorname{{Lo}}}_{a^{*}}. Let

pa=Prμ+ν(u(X)>0A=a,Y(1)=y1).p_{a}=\operatorname{Pr}_{\mu+\nu^{\prime}}(u(X)>0\mid A=a^{\prime},Y(1)=y_{1}).

Note that it cannot be the case that pa=0p_{a}=0 for all a𝒜a\in\mathcal{A}, since, by Eq. (23),

Prμ+ν(u(X)>0Y(1)=y1)>0.\operatorname{Pr}_{\mu+\nu^{\prime}}(u(X)>0\mid Y(1)=y_{1})>0.

Therefore, by the foregoing discussion, either pa>0p_{a^{*}}>0 or pa=0p_{a^{*}}=0 and we can choose a𝒜a^{\prime}\in\mathcal{A} such that pa>0p_{a^{\prime}}>0. Since the νaLo\nu_{a}^{\operatorname{{Lo}}}, νa,iUp\nu_{a,i}^{\operatorname{{Up}}} are all mutually singular, it follows that counterfactual equalized odds can only hold over μ+ν\mu+\nu if

pa=Prμ+ν(u(X)>0A=a,Y(1)=y1).p_{a^{\prime}}=\operatorname{Pr}_{\mu+\nu}(u(X)>0\mid A=a^{*},Y(1)=y_{1}).

Now, we observe that by Lemma E.6, that

Prμ+ν(u(X)>0A=a,Y(1)=y1)=ηπ+βaLoρ\operatorname{Pr}_{\mu+\nu}(u(X)>0\mid A=a^{*},Y(1)=y_{1})=\frac{\eta}{\pi+\beta_{a^{*}}^{\operatorname{{Lo}}}\cdot\rho}

where

η\displaystyle\eta =Prμ(u(X)>0,A=a,Y(1)=y1)\displaystyle=\operatorname{Pr}_{\mu}(u(X)>0,A=a^{*},Y(1)=y_{1})
π\displaystyle\pi =Prμ(A=a,Y(1)=y1),\displaystyle=\operatorname{Pr}_{\mu}(A=a^{*},Y(1)=y_{1}),
ρ\displaystyle\rho =𝒦𝟙A=a,Y(1)=y1dνaLo.\displaystyle=\int_{\mathcal{K}}\mathbb{1}_{A=a^{*},Y(1)=y_{1}}\,\mathrm{d}\nu_{a^{*}}^{\operatorname{{Lo}}}.

since

0\displaystyle 0 =𝒦𝟙u(X)>0,A=a,Y(1)=y1dνaLo,\displaystyle=\int_{\mathcal{K}}\mathbb{1}_{u(X)>0,A=a^{*},Y(1)=y_{1}}\,\mathrm{d}\nu_{a^{*}}^{\operatorname{{Lo}}},
0\displaystyle 0 𝒦𝟙A=a,Y(1)=y1dνaLo.\displaystyle\neq\int_{\mathcal{K}}\mathbb{1}_{A=a^{*},Y(1)=y_{1}}\,\mathrm{d}\nu_{a^{*}}^{\operatorname{{Lo}}}.

Here, the equality follows by the fact that νLo\nu^{\operatorname{{Lo}}} is supported on SLoa×{y0,y1}S^{\operatorname{{Lo}}}_{a}\times\{y_{0},y_{1}\} and the inequality from Eq. (21).

Therefore, if, in the first case, pa>0p_{a^{\prime}}>0, then counterfactual equalized odds only holds if

βaLo=epaπpaρ,\beta_{a^{*}}^{\operatorname{{Lo}}}=\frac{e-p_{a^{\prime}}\cdot\pi}{p_{a^{\prime}}\cdot\rho},

since, as noted above, ρ0\rho\neq 0 by Eq. (21). In the second case, if pa=0p_{a^{\prime}}=0, then counterfactual equalized odds can only hold if

e=paπ=0.e=p_{a^{*}}\cdot\pi=0.

Since we chose aa^{\prime} so that pa>0p_{a^{*}}>0 if pa=0p_{a^{\prime}}=0 and π>0\pi>0 by Eq. (23), this is impossible.

In either case, we see that the set of βaLo\beta_{a^{*}}^{\operatorname{{Lo}}}\in\mathbb{R} such that there a budget-exhausting threshold policy with positive thresholds satisfying counterfactual equalized odds over μ+ν+βaLoνaLo\mu+\nu^{\prime}+\beta_{a^{*}}^{\operatorname{{Lo}}}\cdot\nu_{a^{*}}^{\operatorname{{Lo}}} has λ\lambda-measure zero. That is

λSpan(νaLo)[𝐄μν]=0.\lambda_{\operatorname{\textsc{Span}}(\nu_{a^{*}}^{\operatorname{{Lo}}})}[\mathbf{E}-\mu-\nu^{\prime}]=0.

Since ν\nu^{\prime} was arbitrary, it follows by Fubini’s theorem that λ𝐖[𝐄μ]=0\lambda_{\mathbf{W}}[\mathbf{E}-\mu]=0.

Case 2 (τ(X)𝟙u(X)>0\tau(X)\neq\mathbb{1}_{u(X)>0}).

Our proof strategy is similar to the previous case. First, we show that, for a given fixed νLo𝐖Lo\nu^{\operatorname{{Lo}}}\in\mathbf{W}^{\operatorname{{Lo}}}, there is a unique candidate policy τ~(x)\tilde{\tau}(x) for being a budget-exhausting multiple threshold policy with non-negative thresholds and satisfying counterfactual equalized odds over μ+νLo+νUp\mu+\nu^{\operatorname{{Lo}}}+\nu^{\operatorname{{Up}}} for any νUp𝐖Up\nu^{\operatorname{{Up}}}\in\mathbf{W}^{\operatorname{{Up}}}. Then, we show that the set of νUp\nu^{\operatorname{{Up}}} such that τ~(X)\tilde{\tau}(X) satisfies counterfactual equalized odds has λ𝐖Up\lambda_{\mathbf{W}^{\operatorname{{Up}}}} measure zero. Finally, we argue that this in turn implies that the set of ν𝐖\nu\in\mathbf{W} such that there exists a Pareto efficient policy satisfying counterfactual equalized odds over μ+ν\mu+\nu has λ𝐖\lambda_{\mathbf{W}}-measure zero.

We seek to show that λ𝐖Up[𝐄(μ+νLo)]=0\lambda_{\mathbf{W}^{\operatorname{{Up}}}}[\mathbf{E}-(\mu+\nu^{\operatorname{{Lo}}})]=0. To begin, we note that since νa,iUp\nu_{a,i}^{\operatorname{{Up}}} concentrates on {y1}×𝒳\{y_{1}\}\times\mathcal{X} for all a𝒜a\in\mathcal{A}, it follows that

𝔼μ+νLo[d(X)A=a,Y(1)=y0]=𝔼μ+νLo+νUp[d(X)A=a,Y(1)=y0]\mathbb{E}_{\mu+\nu^{\operatorname{{Lo}}}}[d(X)\mid A=a,Y(1)=y_{0}]\\ =\mathbb{E}_{\mu+\nu^{\operatorname{{Lo}}}+\nu^{\operatorname{{Up}}}}[d(X)\mid A=a,Y(1)=y_{0}]

for any νUp𝐖Up\nu^{\operatorname{{Up}}}\in\mathbf{W}^{\operatorname{{Up}}}.

Now, suppose there exists some νUp𝐖Up\nu^{\operatorname{{Up}}}\in\mathbf{W}^{\operatorname{{Up}}} such that there exists a budget-exhausting multiple threshold policy τ~(x)\tilde{\tau}(x) with non-negative thresholds such that counterfactual equalized odds is satisfied over μ+νLo+νUp\mu+\nu^{\operatorname{{Lo}}}+\nu^{\operatorname{{Up}}}. (If not, then we are done and λ𝐖Up[𝐄(μ+νLo)]=0\lambda_{\mathbf{W}^{\operatorname{{Up}}}}[\mathbf{E}-(\mu+\nu^{\operatorname{{Lo}}})]=0, as the measure of the empty set is zero.) Let

p=𝔼μ+νLo[τ~(X)A=a,Y(1)=y0].p=\mathbb{E}_{\mu+\nu^{\operatorname{{Lo}}}}[\tilde{\tau}(X)\mid A=a,Y(1)=y_{0}].

Suppose that τ~(x)\tilde{\tau}^{\prime}(x) is an alternative budget-exhausting multiple threshold policy with non-negative thresholds such that counterfactual equalized odds is satisfied. We seek to show that τ(X)=τ(X)\tau^{\prime}(X)=\tau(X) (μ+νLo+νUp)(\mu+\nu^{\operatorname{{Lo}}}+\nu^{\operatorname{{Up}}})-a.e. for any νUp𝐖Up\nu^{\operatorname{{Up}}}\in\mathbf{W}^{\operatorname{{Up}}}. Toward a contradiction, suppose that for some a0𝒜a_{0}\in\mathcal{A},

𝔼μ+νLo[τ~(X)A=a0,Y(1)=y0]<p.\mathbb{E}_{\mu+\nu^{\operatorname{{Lo}}}}[\tilde{\tau}^{\prime}(X)\mid A=a_{0},Y(1)=y_{0}]<p.

Since, by Eq. (23), Prμ+νLo(A=a0,Y(1)=y0)>0\operatorname{Pr}_{\mu+\nu^{\operatorname{{Lo}}}}(A=a_{0},Y(1)=y_{0})>0, it follows that

𝔼μ+νLo[τ~(X)A=a0]<𝔼μ+νLo[τ~(X)A=a0].\mathbb{E}_{\mu+\nu^{\operatorname{{Lo}}}}[\tilde{\tau}^{\prime}(X)\mid A=a_{0}]<\mathbb{E}_{\mu+\nu^{\operatorname{{Lo}}}}[\tilde{\tau}(X)\mid A=a_{0}].

Therefore, since τ~(x)\tilde{\tau}(x)^{\prime} is budget exhausting, there must be some a1a_{1} such that

𝔼μ+νLo[τ~(X)A=a1]>𝔼μ+νLo[τ~(X)A=a1].\mathbb{E}_{\mu+\nu^{\operatorname{{Lo}}}}[\tilde{\tau}^{\prime}(X)\mid A=a_{1}]>\mathbb{E}_{\mu+\nu^{\operatorname{{Lo}}}}[\tilde{\tau}(X)\mid A=a_{1}].

From this, it follows τ~(x)\tilde{\tau}^{\prime}(x) can be represented by a threshold greater than or equal to that of τ~(x)\tilde{\tau}(x) on α1(a1)\alpha^{-1}(a_{1}), and hence

𝔼μ+νLo\displaystyle\mathbb{E}_{\mu+\nu^{\operatorname{{Lo}}}} [τ~(X)A=a1,Y(1)=y0]\displaystyle[\tilde{\tau}^{\prime}(X)\mid A=a_{1},Y(1)=y_{0}]
𝔼μ+νLo[τ~(X)A=a0,Y(1)=y0]\displaystyle\geq\mathbb{E}_{\mu+\nu^{\operatorname{{Lo}}}}[\tilde{\tau}(X)\mid A=a_{0},Y(1)=y_{0}]
=p\displaystyle=p
>𝔼μ+νLo[τ~(X)A=a0,Y(1)=y0],\displaystyle>\mathbb{E}_{\mu+\nu^{\operatorname{{Lo}}}}[\tilde{\tau}^{\prime}(X)\mid A=a_{0},Y(1)=y_{0}],

contradicting the fact that τ~(x)\tilde{\tau}^{\prime}(x) satisfies counterfactual equalized odds.

By the preceding discussion, Lemma E.10, and the fact that νLo\nu^{\operatorname{{Lo}}} is supported on u1((,0])u^{-1}((-\infty,0]),

τ~(X)=τ~(X)(μ𝒳×{y0})-a.e.\tilde{\tau}(X)=\tilde{\tau}^{\prime}(X)\quad(\mu\operatorname{\upharpoonright}_{\mathcal{X}\times\{y_{0}\}})\text{-a.e.}

By Eq. (24), it follows that τ~(X)=τ~(X)\tilde{\tau}(X)=\tilde{\tau}^{\prime}(X) νUpa,i\nu^{\operatorname{{Up}}}_{a,i}-a.e. for i=0,1i=0,1. As a consequence,

τ~(X)=τ~(X)(μ+νLo+νup)-a.e.\tilde{\tau}(X)=\tilde{\tau}^{\prime}(X)\quad(\mu+\nu^{\operatorname{{Lo}}}+\nu^{up})\text{-a.e.}

for all νUp𝐖Up\nu^{\operatorname{{Up}}}\in\mathbf{W}^{\operatorname{{Up}}}. Therefore τ~(X)\tilde{\tau}(X) is, indeed, unique, as desired.

Now, we note that since τ(X)𝟙u(X)>0\tau(X)\neq\mathbb{1}_{u(X)>0}, it follows that 𝔼[τ(X)]<Prμ(u(X)>0)\mathbb{E}[\tau(X)]<\operatorname{Pr}_{\mu}(u(X)>0). It follows that 𝔼μ[τ(X)]=b\mathbb{E}_{\mu}[\tau(X)]=b, since τ(x)\tau(x) is budget exhausting. Therefore, by Eq. (19), it follows that for any budget-exhausting policy τ~(X)\tilde{\tau}(X), 𝔼[τ~(X)]=b\mathbb{E}[\tilde{\tau}(X)]=b, and so τ~(X)𝟙u(X)>0\tilde{\tau}(X)\neq\mathbb{1}_{u(X)>0} over μ+ν\mu+\nu.

Therefore, fix νLo\nu^{\operatorname{{Lo}}} and τ~(X)\tilde{\tau}(X). By Eq. (25), there is some aa^{*} such that

0<Prμ+νLo(u(X)>t~aA=a)<1.0<\operatorname{Pr}_{\mu+\nu^{\operatorname{{Lo}}}}(u(X)>\tilde{t}_{a^{*}}\mid A=a^{*})<1.

Then, it follows by Eq. (22) that

𝒦𝟙u(X)>t~adνUpa0.\int_{\mathcal{K}}\mathbb{1}_{u(X)>\tilde{t}_{a^{*}}}\,\mathrm{d}\nu^{\operatorname{{Up}}}_{a^{*}}\neq 0.

Fix ν=νβaUpνaUp\nu^{\prime}=\nu-\beta_{a^{*}}^{\operatorname{{Up}}}\cdot\nu_{a^{*}}^{\operatorname{{Up}}}. Then, for some aaa\neq a^{*}, set

p=𝔼μ+ν[τ~(X)A=a,Y(1)=y1].p^{*}=\mathbb{E}_{\mu+\nu^{\prime}}[\tilde{\tau}(X)\mid A=a,Y(1)=y_{1}].

Since the νaLo\nu_{a}^{\operatorname{{Lo}}}, νaUp\nu_{a}^{\operatorname{{Up}}} are all mutually singular, it follows that counterfactual equalized odds can only hold over μ+ν\mu+\nu if

p=Prμ+ν(u(X)>t~aA=a,Y(1)=y1).p^{*}=\operatorname{Pr}_{\mu+\nu}(u(X)>\tilde{t}_{a^{*}}\mid A=a^{*},Y(1)=y_{1}).

Now, we observe that by Lemma E.6, that

Prμ+ν(u(X)>t~aA=a,Y(1)=y1)=η+βaUpγπ\displaystyle\begin{split}\operatorname{Pr}_{\mu+\nu}(u(X)>\tilde{t}_{a^{*}}&\mid A=a^{*},Y(1)=y_{1})\\ &=\frac{\eta+\beta_{a}^{\operatorname{{Up}}}\cdot\gamma}{\pi}\end{split} (26)

where

η\displaystyle\eta =Prμ+νLo(u(X)>t~aA=a,Y(1)=y1),\displaystyle=\operatorname{Pr}_{\mu+\nu^{\operatorname{{Lo}}}}(u(X)>\tilde{t}_{a^{*}}\mid A=a^{*},Y(1)=y_{1}),
π\displaystyle\pi =Prμ+νLo(A=a,Y(1)=y1),\displaystyle=\operatorname{Pr}_{\mu+\nu^{\operatorname{{Lo}}}}(A=a^{*},Y(1)=y_{1}),
γ\displaystyle\gamma =𝒦𝟙u(X)>t~a,A=a,Y(1)=y1dνaUp,\displaystyle=\int_{\mathcal{K}}\mathbb{1}_{u(X)>\tilde{t}_{a^{*}},A=a^{*},Y(1)=y_{1}}\,\mathrm{d}\nu_{a}^{\operatorname{{Up}}},

and we note that

0=𝒦𝟙A=a,Y(1)=y1dνaLo.0=\int_{\mathcal{K}}\mathbb{1}_{A=a^{*},Y(1)=y_{1}}\,\mathrm{d}\nu_{a}^{\operatorname{{Lo}}}.

Eq. (26) can be rearranged to

(pπη)βγ=0.(p^{*}\cdot\pi-\eta)-\beta\cdot\gamma=0.

This can only hold if

β=pπηγ,\beta=\frac{p^{*}\cdot\pi-\eta}{\gamma},

since by Eq. (22), γ0\gamma\neq 0. Since any countable subset of \mathbb{R} is a λ\lambda-null set,

λSpan(νaUp)[𝐄μν]=0.\lambda_{\operatorname{\textsc{Span}}(\nu_{a^{*}}^{\operatorname{{Up}}})}[\mathbf{E}-\mu-\nu^{\prime}]=0.

Since ν\nu^{\prime} was arbitrary, it follows by Fubini’s theorem that λ𝐖Up[𝐄μνLo]=0\lambda_{\mathbf{W}^{\operatorname{{Up}}}}[\mathbf{E}-\mu-\nu^{\operatorname{{Lo}}}]=0 in this case as well. Lastly, since νLo\nu^{\operatorname{{Lo}}} was also arbitrary, applying Fubini’s theorem a final time gives that λ𝐖[𝐄μ]=0\lambda_{\mathbf{W}}[\mathbf{E}-\mu]=0.

Conditional principal fairness and path-specific fairness

The extension of these results to conditional principal fairness and path-specific fairness is straightforward. All that is required is a minor modification of the probe.

In the case of conditional principal fairness, we set

νa,wUp[E]\displaystyle\nu_{a,w}^{\operatorname{{Up}}}[E] =μmax,a,w(γ(y1,y1)π𝒳)1[Eu1(SUpa,1)],\displaystyle=\mu_{\max,a,w}\circ(\gamma_{(y_{1},y_{1})}\circ\pi_{\mathcal{X}})^{-1}[E\cap u^{-1}(S^{\operatorname{{Up}}}_{a,1})],
μmax,a(γ(y1,y1)π𝒳)1[Eu1(SUpa,w)],\displaystyle\hskip 11.38092pt-\mu_{\max,a}\circ(\gamma_{(y_{1},y_{1})}\circ\pi_{\mathcal{X}})^{-1}[E\cap u^{-1}(S^{\operatorname{{Up}}}_{a,w})],
νa,wLo[E]\displaystyle\nu_{a,w}^{\operatorname{{Lo}}}[E] =μmax,a,w(γ(y1,y1)π𝒳)1[Eu1(SLoa)]\displaystyle=\mu_{\max,a,w}\circ(\gamma_{(y_{1},y_{1})}\circ\pi_{\mathcal{X}})^{-1}[E\cap u^{-1}(S^{\operatorname{{Lo}}}_{a})]
μmax,a(γ(y0,y0)π𝒳)1[Eu1(SLoa,w)],\displaystyle\hskip 11.38092pt-\mu_{\max,a}\circ(\gamma_{(y_{0},y_{0})}\circ\pi_{\mathcal{X}})^{-1}[E\cap u^{-1}(S^{\operatorname{{Lo}}}_{a,w})],

where γ(y,y):𝒳𝒦\gamma_{(y,y^{\prime})}:\mathcal{X}\to\mathcal{K} is the injection x(x,y,y)x\mapsto(x,y,y^{\prime}). Our probe is then given by

𝐖Up\displaystyle\mathbf{W}^{\operatorname{{Up}}} =Span(νa,wUp),\displaystyle=\operatorname{\textsc{Span}}(\nu_{a,w}^{\operatorname{{Up}}}),
𝐖Lo\displaystyle\mathbf{W}^{\operatorname{{Lo}}} =Span(νa,wLo),\displaystyle=\operatorname{\textsc{Span}}(\nu_{a,w}^{\operatorname{{Lo}}}),

almost as before.

The proof otherwise proceeds virtually identically, except for two points. First, recalling Remark 6, we use the fact that a generic element of 𝐐\mathbf{Q} satisfies Prμ(A=a,W=w)>0\operatorname{Pr}_{\mu}(A=a,W=w)>0 in place of Prμ(A=a)>0\operatorname{Pr}_{\mu}(A=a)>0 throughout. Second, we use the fact that ω\omega overlaps utility in place of Eq. (25). In particular, If ω\omega does not overlap utilities for a generic μ𝐐\mu\in\mathbf{Q}, then, by Lemma E.19, there exists w𝒲w\in\mathcal{W} such that Prμ(u(X)>0,W=w)=0\operatorname{Pr}_{\mu}(u(X)>0,W=w)=0 for all μ𝐐\mu\in\mathbf{Q}. If this occurs, we can show that no budget-exhausting multiple threshold policy with positive thresholds satisfies conditional principal fairness, exactly as we did to show Eq. (25).

In the case of path-specific fairness, we instead define

Sa,wLo\displaystyle S_{a,w}^{\operatorname{{Lo}}} =Sa,w(,ra,w),\displaystyle=S_{a,w}\cap(-\infty,r_{a,w}),
Sa,wUp\displaystyle S_{a,w}^{\operatorname{{Up}}} =Sa,w[ra,w,),\displaystyle=S_{a,w}\cap[r_{a,w},\infty),

where ra,wr_{a,w} is chosen so that

Prμmax,a,w(u(X)Sa,wLo)=Prμmax,a,w(u(X)Sa,wUp).\operatorname{Pr}_{\mu_{\max,a,w}}(u(X)\in S_{a,w}^{\operatorname{{Lo}}})=\operatorname{Pr}_{\mu_{\max,a,w}}(u(X)\in S_{a,w}^{\operatorname{{Up}}}).

Let πX\pi_{X} denote the projection from 𝒦=𝒜×𝒳𝒜\mathcal{K}=\mathcal{A}\times\mathcal{X}^{\mathcal{A}} given by

(a,(xa)a𝒜)xa.\left(a,(x_{a^{\prime}})_{a^{\prime}\in\mathcal{A}}\right)\mapsto x_{a}.

Let πa\pi_{a^{\prime}} denote the projection from the aa^{\prime}-th component. (That is, given μ𝐊\mu\in\mathbf{K}, the distribution of XΠ,A,aX_{\Pi,A,a^{\prime}} over μ\mu is given by μπ1a\mu\circ\pi^{-1}_{a^{\prime}} and the distribution of XX is given by μπ1X\mu\circ\pi^{-1}_{X}.) Then, we let μ~max,a,w\tilde{\mu}_{\max,a,w} be the measure on 𝒳\mathcal{X} given by

μ~max,a,w[E]\displaystyle\tilde{\mu}_{\max,a,w}[E] =μmax,a,w[E(uπa)1(sa,wUp)]\displaystyle=\mu_{\max,a,w}[E\cap(u\circ\pi_{a})^{-1}(s_{a,w}^{\operatorname{{Up}}})]
μmax,a,w[E(uπa)1(Sa,wLo)].\displaystyle\hskip 28.45274pt-\mu_{\max,a,w}[E\cap(u\circ\pi_{a})^{-1}(S_{a,w}^{\operatorname{{Lo}}})].

Finally, let ϕ:𝒜𝒜\phi:\mathcal{A}\to\mathcal{A} be a permutation of the groups with no fixed points, i.e., so that aϕ(a)a^{\prime}\neq\phi(a^{\prime}) for all a𝒜a^{\prime}\in\mathcal{A}. Then, we define

νa=δa×μ~max,ϕ(a),w1×aϕ(a)μmax,a,w1πa1,\nu_{a^{\prime}}=\delta_{a^{\prime}}\times\tilde{\mu}_{\max,\phi(a^{\prime}),w_{1}}\times\prod_{a\neq\phi(a^{\prime})}\mu_{\max,a,w_{1}}\circ\pi_{a}^{-1},

where δa\delta_{a} is the measure on 𝒜\mathcal{A} given by δa[{a}]=𝟙a=a\delta_{a}[\{a^{\prime}\}]=\mathbb{1}_{a=a^{\prime}}. Then, simply let

𝐖=Span(νa)a𝒜.\mathbf{W}=\operatorname{\textsc{Span}}(\nu_{a}^{\prime})_{a^{\prime}\in\mathcal{A}}.

Since μ~max,a,w[𝒳]=0\tilde{\mu}_{\max,a,w}[\mathcal{X}]=0 for all a𝒜a\in\mathcal{A}, it follows that νa,wπX1=0\nu_{a,w}\circ\pi_{X}^{-1}=0, i.e.,

Prμ(XE)=Prμ+ν(XE)\operatorname{Pr}_{\mu}(X\in E)=\operatorname{Pr}_{\mu+\nu}(X\in E)

for any ν𝐖\nu\in\mathbf{W} and μ𝐐\mu\in\mathbf{Q}. Therefore Eqs. (18) and (19) hold. Moreover, the νa\nu_{a} satisfy the following strengthening of Eq. (22). Perturbations in 𝐖\mathbf{W} have the property that for any non-trivial tt—not necessarily positive—some of the mass of u(XΠ,A,a)u(X_{\Pi,A,a}) is moved either above or below tt. More precisely, for any μ𝐐\mu\in\mathbf{Q} and any tt such that

0<Prμ(u(X)>tA=a)<1,0<\operatorname{Pr}_{\mu}(u(X)>t\mid A=a)<1,

if ν𝐖\nu\in\mathbf{W} is such that Pr|ν|(A=ϕ1(a))>0\operatorname{Pr}_{|\nu|}(A=\phi^{-1}(a))>0, then

𝒦𝟙u(XΠ,A,a)>tdνa0.\int_{\mathcal{K}}\mathbb{1}_{u(X_{\Pi,A,a})>t}\,\mathrm{d}\nu_{a}\neq 0. (27)

This stronger property means that we need not separately treat the case where τ(X)=𝟙u(X)>0\tau(X)=\mathbb{1}_{u(X)>0} μ\mu-a.e.

Other than this difference the proof proceeds in the same way, except for two points. First, we again make use of the fact that ω\omega can be assumed to overlap utilities in place of Eq. (25), as in the case of conditional principal fairness. Second, w0w_{0} and w1w_{1} take the place of y0y_{0} and y1y_{1}. In particular, to establish the uniqueness of τ~(x)\tilde{\tau}(x) given μ\mu and νLo\nu^{\operatorname{{Lo}}} in the second case, instead of conditioning on y0y_{0}, we instead condition on w0w_{0}, where, following the discussion in Remark 6 and Lemma E.16, this conditioning is well-defined for a generic element of 𝐐\mathbf{Q}. ∎

We have focused on causal definitions of fairness, but the thrust of our analysis applies to non-causal conceptions of fairness as well. Below we show that policies constrained to satisfy (non-counterfactual) equalized odds (Hardt et al., 2016) are generically strongly Pareto dominated, a result that follows immediately from our proof above.

Definition E.13.

Equalized odds holds for a decision policy d(x)d(x) when

d(X)AY.d(X)\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A\mid Y. (28)

We note that YY in Eq. (28) does not depend on our choice of d(x)d(x), but rather represents the realized value of YY, e.g., under some status quo decision making rule.

Corollary E.5.

Suppose 𝒰\mathcal{U} is a set of utilities consistent modulo α\alpha. Further suppose that for all a𝒜a\in\mathcal{A} there exist a 𝒰\mathcal{U}-fine distribution of XX and a utility u𝒰u\in\mathcal{U} such that Pr(u(X)>0,A=a)>0\operatorname{Pr}(u(X)>0,A=a)>0, where A=α(X)A=\alpha(X). Then, for almost every 𝒰\mathcal{U}-fine distribution of XX and YY on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, any decision policy d(x)d(x) satisfying equalized odds is strongly Pareto dominated.

Proof.

Consider the following maps. Distributions of XX and Y(1)Y(1), i.e., probability measures on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, can be embedded in the space of joint distributions on XX, Y(0)Y(0), and Y(1)Y(1) via pushing forward by the map ι\iota, where ι:(x,y)(x,y,y)\iota:(x,y)\mapsto(x,y,y). Likewise, given a fixed decision policy D=d(X)D=d(X), joint distributions of XX, Y(0)Y(0), and Y(1)Y(1) can be projected onto the space of joint distributions of XX and YY by pushing forward by the map πd:(x,y0,y1)(x,yd(x))\pi_{d}:(x,y_{0},y_{1})\mapsto(x,y_{d(x)}). Lastly, we see that the composition of ι\iota and πd\pi_{d}—regardless of our choice of d(x)d(x)—is the identity, as shown in the diagram below.

𝒳×𝒴\mathcal{X}\times\mathcal{Y}𝒳×𝒴×𝒴\mathcal{X}\times\mathcal{Y}\times\mathcal{Y}𝒳×𝒴\mathcal{X}\times\mathcal{Y}ι\iotaπd\pi_{d}id\operatorname{id}

We note also that counterfactual equalized odds holds for μ\mu exactly when equalized odds holds for μ(πdι)1\mu\circ(\pi_{d}\circ\iota)^{-1}. The result follows immediately from this and Theorem 1. ∎

E.6 General Measures on 𝒦\mathcal{K}

Theorem 1 is restricted to 𝒰\mathcal{U}-fine and 𝒰𝒜\mathcal{U}^{\mathcal{A}}-fine distributions on the state space. The reason for this restriction is that when the distribution of XX induces atoms on the utility scale, threshold policies can possess additional—or even infinite—degrees of freedom when the threshold falls exactly on an atom. In particular circumstances, these degrees of freedom can be used to ensure causal fairness notions, such as counterfactual equalized odds, hold in a locally robust way. In particular, the generalization of Theorem 1 beyond 𝒰\mathcal{U}-fine measures to all totally bounded measures on the state space is false, as illustrated by the following proposition.

Proposition E.7.

Consider the set 𝐄𝕂\mathbf{E}^{\prime}\subseteq\mathbb{K} of distributions—not necessarily 𝒰\mathcal{U}-fine—on 𝒦=𝒳×𝒴\mathcal{K}=\mathcal{X}\times\mathcal{Y} over which there exists a Pareto efficient policy satisfying counterfactual equalized odds. There exist bb, 𝒳\mathcal{X}, 𝒴\mathcal{Y}, and 𝒰\mathcal{U} such that 𝐄\mathbf{E}^{\prime} is not relatively shy.

Proof.

We adopt the notational conventions of Section E.3. We note that by Prop. E.5, a set can only be shy if it has empty interior. Therefore, we will construct an example in which an open ball of distributions on 𝒦\mathcal{K} in the total variation norm all allow for a Pareto efficient policy satisfying counterfactual equalized odds, i.e., are contained in 𝐄\mathbf{E}^{\prime}.

Let b=34b=\tfrac{3}{4}, 𝒴={0,1}\mathcal{Y}=\{0,1\}, 𝒜={a0,a1}\mathcal{A}=\{a_{0},a_{1}\}, and 𝒳={0,1}×{a0,a1}×\mathcal{X}=\{0,1\}\times\{a_{0},a_{1}\}\times\mathbb{R}. Let α:𝒳𝒜\alpha:\mathcal{X}\to\mathcal{A} be given by α:(y,a,v)a\alpha:(y,a,v)\mapsto a for arbitrary (y,a,v)𝒳(y,a,v)\in\mathcal{X}. Likewise, let u:𝒳u:\mathcal{X}\to\mathbb{R} be given by u:(y,a,v)vu:(y,a,v)\mapsto v. Then, if 𝒰={u}\mathcal{U}=\{u\}, 𝒰\mathcal{U} is vacuously consistent modulo α\alpha. Consider the joint distribution μ\mu on 𝒦=𝒳×𝒴\mathcal{K}=\mathcal{X}\times\mathcal{Y} where for all y,y𝒴y,y^{\prime}\in\mathcal{Y}, a𝒜a\in\mathcal{A}, and uu\in\mathbb{R},

Prμ(X=(a,y,u),Y(1)=y)=14𝟙y=yPrμ(u(X)=u),\operatorname{Pr}_{\mu}(X=(a,y,u),Y(1)=y^{\prime})\\ =\frac{1}{4}\cdot\mathbb{1}_{y=y^{\prime}}\cdot\operatorname{Pr}_{\mu}(u(X)=u),

where, over μ\mu, u(X)u(X) is distributed as a 12\tfrac{1}{2}-mixture of Unif(1,2)\operatorname{\textsc{Unif}}(1,2) and δ(1)\delta(1); that is, Pr(u(X)=1)=12\operatorname{Pr}(u(X)=1)=\tfrac{1}{2} and Pr(a<u(X)<b)=ba2\operatorname{Pr}(a<u(X)<b)=\tfrac{b-a}{2} for 0ab<10\leq a\leq b<1.

We first observe that there exists a Pareto efficient threshold policy τ(x)\tau(x) such that counterfactual equalized odds is satisfied with respect to the decision policy τ(X)\tau(X). Namely, let

τ(a,y,u)={1u>1,12u=1,0u<1.\tau(a,y,u)=\begin{cases}1&u>1,\\ \frac{1}{2}&u=1,\\ 0&u<1.\end{cases}

Then, it immediately follows that 𝔼[τ(X)]=34=b\mathbb{E}[\tau(X)]=\tfrac{3}{4}=b. Since τ(x)\tau(x) is a threshold policy and exhausts the budget, it is utility maximizing by Lemma D.4. Moreover, if D=𝟙UDτ(X)D=\mathbb{1}_{U_{D}\leq\tau(X)} for some UDUnif(0,1)U_{D}\sim\operatorname{\textsc{Unif}}(0,1) independent of XX and Y(1)Y(1), then DAY(1)D\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A\mid Y(1). Since u(X)A,Y(1)u(X)\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A,Y(1), it follows that

Prμ(D=1\displaystyle\operatorname{Pr}_{\mu}(D=1 A=a,Y(1)=y)\displaystyle\mid A=a,Y(1)=y)
=Pr(UDτ(X)A=a,Y(1)=y)\displaystyle\hskip 14.22636pt=\operatorname{Pr}(U_{D}\leq\tau(X)\mid A=a,Y(1)=y)
=Pr(UDτ(X))\displaystyle\hskip 14.22636pt=\operatorname{Pr}(U_{D}\leq\tau(X))
=𝔼μ[τ(X)],\displaystyle\hskip 14.22636pt=\mathbb{E}_{\mu}[\tau(X)],

Therefore Eq. (2) is satisfied, i.e., counterfactual equalized odds holds. Now, using μ\mu, we construct an open ball of distributions over which we can construct similar threshold policies. In particular, suppose μ\mu^{\prime} is any distribution such that |μμ|[𝒦]<164|\mu-\mu^{\prime}|[\mathcal{K}]<\tfrac{1}{64}. Then, we claim that there exists a budget-exhausting threshold policy satisfying counterfactual equalized odds over μ\mu^{\prime}. For, we note that

Prμ(U>1)<Prμ(U>1)+164=3364,\displaystyle\operatorname{Pr}_{\mu^{\prime}}(U>1)<\operatorname{Pr}_{\mu}(U>1)+\frac{1}{64}=\frac{33}{64},
Prμ(U1)>Prμ(U1)164=6364,\displaystyle\operatorname{Pr}_{\mu^{\prime}}(U\geq 1)>\operatorname{Pr}_{\mu}(U\geq 1)-\frac{1}{64}=\frac{63}{64},

and so any threshold policy τ(x)\tau^{\prime}(x) satisfying 𝔼[τ(X)]=b=34\mathbb{E}[\tau^{\prime}(X)]=b=\tfrac{3}{4} must have t=1t=1 as its threshold.

We will now construct a threshold policy τ(x)\tau^{\prime}(x) satisfying counterfactual equalized odds over μ\mu^{\prime}. Consider a threshold policy of the form

τ(a,y,u)={1u>1,pa,yu=1,0u<1.\tau^{\prime}(a,y,u)=\begin{cases}1&u>1,\\ p_{a,y}&u=1,\\ 0&u<1.\end{cases}

For notational simplicity, let

qa,y\displaystyle q_{a,y} =Prμ(A=a,Y=y,U>1),\displaystyle=\operatorname{Pr}_{\mu^{\prime}}(A=a,Y=y,U>1),
ra,y\displaystyle r_{a,y} =Prμ(A=a,Y=y,U=1),\displaystyle=\operatorname{Pr}_{\mu^{\prime}}(A=a,Y=y,U=1),
πa,y\displaystyle\pi_{a,y} =Prμ(A=a,Y=y).\displaystyle=\operatorname{Pr}_{\mu^{\prime}}(A=a,Y=y).

Then, we have that

𝔼μ[τ(X)]\displaystyle\mathbb{E}_{\mu^{\prime}}[\tau^{\prime}(X)] =a,yqa,y+pa,yra,y,\displaystyle=\sum_{a,y}q_{a,y}+p_{a,y}\cdot r_{a,y},
𝔼μ[τ(X)A=a,Y=y]\displaystyle\mathbb{E}_{\mu^{\prime}}[\tau^{\prime}(X)\mid A=a,Y=y] =qa,y+pa,yra,yπa,y.\displaystyle=\frac{q_{a,y}+p_{a,y}\cdot r_{a,y}}{\pi_{a,y}}.

Therefore, the policy will be budget exhausting if

a,yqa,y+pa,yra,y=34,\sum_{a,y}q_{a,y}+p_{a,y}\cdot r_{a,y}=\tfrac{3}{4},

and it will satisfy counterfactual equalized odds if

πa1,0(qa0,0+pa0,0ra0,0)=πa0,0(qa1,0+pa1,0ra1,0),πa1,1(qa0,1+pa0,1ra0,1)=πa0,1(qa1,1+pa1,1ra1,1),\displaystyle\begin{split}\pi_{a_{1},0}\cdot(q_{a_{0},0}&+p_{a_{0},0}\cdot r_{a_{0},0})\\ &=\pi_{a_{0},0}\cdot(q_{a_{1},0}+p_{a_{1},0}\cdot r_{a_{1},0}),\\ \pi_{a_{1},1}\cdot(q_{a_{0},1}&+p_{a_{0},1}\cdot r_{a_{0},1})\\ &=\pi_{a_{0},1}\cdot(q_{a_{1},1}+p_{a_{1},1}\cdot r_{a_{1},1}),\end{split} (29)

since, as above,

Pr(D=1A=a,Y(1)=y)=𝔼[τ(X)A=a,Y(1)=y].\operatorname{Pr}(D=1\mid A=a,Y(1)=y)\\ =\mathbb{E}[\tau^{\prime}(X)\mid A=a,Y(1)=y].

Again, for notational simplicity, let

S=34Prμ(U>1)Prμ(U=1).S=\frac{\tfrac{3}{4}-\operatorname{Pr}_{\mu^{\prime}}(U>1)}{\operatorname{Pr}_{\mu^{\prime}}(U=1)}.

Then, a straightforward algebraic manipulation shows that Eq. (29) is solved by setting pa0,yp_{a_{0},y} to be

Sπa0,y(ra0,y+ra1,y)+πa0,yqa1,yπa1,yqa0,yra0,y(πa0,y+πa1,y),\frac{S\cdot\pi_{a_{0},y}\cdot(r_{a_{0},y}+r_{a_{1},y})+\pi_{a_{0},y}\cdot q_{a_{1},y}-\pi_{a_{1},y}\cdot q_{a_{0},y}}{r_{a_{0},y}\cdot(\pi_{a_{0},y}+\pi_{a_{1},y})},

and pa1,yp_{a_{1},y} to be

Sπa1,y(ra0,y+ra1,y)+πa1,yqa0,yπa0,yqa1,yra1,y(πa0,y+πa1,y).\frac{S\cdot\pi_{a_{1},y}\cdot(r_{a_{0},y}+r_{a_{1},y})+\pi_{a_{1},y}\cdot q_{a_{0},y}-\pi_{a_{0},y}\cdot q_{a_{1},y}}{r_{a_{1},y}\cdot(\pi_{a_{0},y}+\pi_{a_{1},y})}.

In order for τ(x)\tau^{\prime}(x) to be a well-defined policy, we need to show that pa,y[0,1]p_{a,y}\in[0,1] for all a𝒜a\in\mathcal{A} and y𝒴y\in\mathcal{Y}. To that end, note that

qa,y\displaystyle q_{a,y} =Prμ(A=a,Y=y,U>1),\displaystyle=\operatorname{Pr}_{\mu^{\prime}}(A=a,Y=y,U>1),
ra,y\displaystyle r_{a,y} =Prμ(A=a,Y=y,U=1),\displaystyle=\operatorname{Pr}_{\mu^{\prime}}(A=a,Y=y,U=1),
πa,y\displaystyle\pi_{a,y} =Prμ(A=a,Y=y),\displaystyle=\operatorname{Pr}_{\mu^{\prime}}(A=a,Y=y),
ra0,y+ra1,y\displaystyle r_{a_{0},y}+r_{a_{1},y} =Prμ(Y=y,U=1),\displaystyle=\operatorname{Pr}_{\mu^{\prime}}(Y=y,U=1),
πa0,y+πa1,y\displaystyle\pi_{a_{0},y}+\pi_{a_{1},y} =Prμ(Y=y),\displaystyle=\operatorname{Pr}_{\mu^{\prime}}(Y=y),
S\displaystyle S =34Prμ(U>1)Prμ(U=1).\displaystyle=\frac{\frac{3}{4}-\operatorname{Pr}_{\mu^{\prime}}(U>1)}{\operatorname{Pr}_{\mu^{\prime}}(U=1)}.

Now, we recall that |Prμ(E)Prμ(E)|<164|\operatorname{Pr}_{\mu^{\prime}}(E)-\operatorname{Pr}_{\mu}(E)|<\frac{1}{64} for any event EE by hypothesis. Therefore,

764\displaystyle\frac{7}{64}\leq\ qa,yq_{a,y} 964,\displaystyle\ \leq\frac{9}{64},
764\displaystyle\frac{7}{64}\leq\ ra,yr_{a,y} 964,\displaystyle\ \leq\frac{9}{64},
764\displaystyle\frac{7}{64}\leq\ πa,y\pi_{a,y} 1764,\displaystyle\ \leq\frac{17}{64},
1564\displaystyle\frac{15}{64}\leq\ ra0,y+ra1,yr_{a_{0},y}+r_{a_{1},y} 1764,\displaystyle\ \leq\frac{17}{64},
3164\displaystyle\frac{31}{64}\leq\ πa0,y+πa1,y\displaystyle\pi_{a_{0},y}+\pi_{a_{1},y} 3364,\displaystyle\ \leq\frac{33}{64},
1531\displaystyle\frac{15}{31}\leq\ SS 1733.\displaystyle\ \leq\frac{17}{33}.

Using these bounds and the expressions for pa,yp_{a,y} derived above, we see that

6293069<pa,y<64977161,\displaystyle\frac{629}{3069}<p_{a,y}<\frac{6497}{7161},

and hence pa,y[0,1]p_{a,y}\in[0,1] for all a𝒜a\in\mathcal{A} and y𝒴y\in\mathcal{Y}.

Therefore, the policy τ(x)\tau^{\prime}(x) is well-defined, and, by construction, is budget-exhausting and therefore utility-maximizing by Lemma D.4. It also satisfies counterfactual equalized odds by construction. Since μ\mu^{\prime} was arbitrary, it follows that the set of distributions on 𝒦\mathcal{K} such that there exists a Pareto efficient policy satisfying counterfactual equalized odds contains an open ball, and hence is not shy. ∎

Appendix F Theorem 2 and Related Results

We first prove a variant of Theorem 2 for general, continuous covariates 𝒳\mathcal{X}. Then, we extend and generalize Theorem 2 using the theory of finite Markov chains, offering a proof of the theorem different from the sketch included in the main text.

F.1 Extension to Continuous Covariates

Here we follow the proof sketch in the main text for Theorem 2, which assumes a finite covariate-space 𝒳\mathcal{X}. In that case, we start with a point xx^{*} with maximum decision probability d(x)d(x^{*}), and then assume, toward a contradiction, that there exists a point with strictly lower decision probability. The general case is more involved since it is not immediately clear that the maximum value of d(x)d(x) is achieved with positive probability in 𝒳\mathcal{X}. We start with the lemma below before proving the main result.

Lemma F.21.

A decision policy d(x)d(x) satisfies path-specific fairness with W=XW=X if and only if any a𝒜a^{\prime}\in\mathcal{A},

𝔼[d(XΠ,A,a)X]=d(X).\mathbb{E}[d(X_{\Pi,A,a^{\prime}})\mid X]=d(X).
Proof.

First, suppose that d(x)d(x) satisfies path-specific fairness. To show the result, we use the standard fact that for independent random variables XX and UU,

𝔼[f(X,U)X]=f(X,u)dFU(u),\mathbb{E}[f(X,U)\mid X]=\int f(X,u)\mathop{}\!\mathrm{d}F_{U}(u), (30)

where FUF_{U} is the distribution of UU. (For a proof of this fact see, for example, Brozius, 2019)

Now, we have that

𝔼[DΠ,A,aXΠ,A,a]\displaystyle\mathbb{E}[D_{\Pi,A,a^{\prime}}\mid X_{\Pi,A,a^{\prime}}] =𝔼[𝟙UDd(XΠ,A,a)XΠ,A,a]\displaystyle=\mathbb{E}[\mathbb{1}_{U_{D}\leq d(X_{\Pi,A,a^{\prime}})}\mid X_{\Pi,A,a^{\prime}}]
=01𝟙ud(XΠ,A,a)du\displaystyle=\int_{0}^{1}\mathbb{1}_{u\leq d(X_{\Pi,A,a^{\prime}})}\mathop{}\!\mathrm{d}u
=d(XΠ,A,a),\displaystyle=d(X_{\Pi,A,a^{\prime}}),

where the first equality follows from the definition of DΠ,A,aD_{\Pi,A,a^{\prime}}, and the second from Eq. (30), since the exogenous variable UDUnif(0,1)U_{D}\sim\operatorname{\textsc{Unif}}(0,1) is independent of the counterfactual covariates XΠ,A,aX_{\Pi,A,a^{\prime}}. An analogous argument shows that 𝔼[DX]=d(X)\mathbb{E}[D\mid X]=d(X).

Finally, conditioning on XX, we have

𝔼[d(XΠ,A,a)X]\displaystyle\mathbb{E}[d(X_{\Pi,A,a^{\prime}})\mid X] =𝔼[𝔼[DΠ,A,aXΠ,A,a]X]\displaystyle=\mathbb{E}[\mathbb{E}[D_{\Pi,A,a^{\prime}}\mid X_{\Pi,A,a^{\prime}}]\mid X]
=𝔼[𝔼[DΠ,A,aXΠ,A,a,X]X]\displaystyle=\mathbb{E}[\mathbb{E}[D_{\Pi,A,a^{\prime}}\mid X_{\Pi,A,a^{\prime}},X]\mid X]
=𝔼[DΠ,A,aX]\displaystyle=\mathbb{E}[D_{\Pi,A,a^{\prime}}\mid X]
=𝔼[DX]\displaystyle=\mathbb{E}[D\mid X]
=d(X),\displaystyle=d(X),

where the second equality follows from the fact that DΠ,A,aXXΠ,A,aD_{\Pi,A,a^{\prime}}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X\mid X_{\Pi,A,a^{\prime}}, the third from the law of iterated expectations, and the fourth from the definition of path-specific fairness.

Next, suppose that

𝔼[d(XΠ,A,aX]=d(X)\mathbb{E}[d(X_{\Pi,A,a^{\prime}}\mid X]=d(X)

for all a𝒜a^{\prime}\in\mathcal{A}. Then, since W=XW=X and XUDX\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}U_{D}, using Eq. (30), we have that for all a𝒜a^{\prime}\in\mathcal{A},

𝔼[DΠ,A,aX]\displaystyle\mathbb{E}[D_{\Pi,A,a^{\prime}}\mid X] =𝔼[𝔼[𝟙UDd(XΠ,A,a)XΠ,A,a,X]X]\displaystyle=\mathbb{E}[\mathbb{E}[\mathbb{1}_{U_{D}\leq d(X_{\Pi,A,a^{\prime}})}\mid X_{\Pi,A,a^{\prime}},X]\mid X]
=𝔼[𝔼[d(XΠ,A,a)XΠ,A,a,X]X]\displaystyle=\mathbb{E}[\mathbb{E}[d(X_{\Pi,A,a^{\prime}})\mid X_{\Pi,A,a^{\prime}},X]\mid X]
=𝔼[d(XΠ,A,a)X]\displaystyle=\mathbb{E}[d(X_{\Pi,A,a^{\prime}})\mid X]
=d(X)\displaystyle=d(X)
=𝔼[d(X)X]\displaystyle=\mathbb{E}[d(X)\mid X]
=𝔼[DX].\displaystyle=\mathbb{E}[D\mid X].

This is exactly Eq. (5), and so the result follows. ∎

We are now ready to prove a continuous variant of Theorem 2. The technical hypotheses of the theorem ensure that the conditional probability measures Pr(EX)\operatorname{Pr}(E\mid X) are “sufficiently” mutually non-singular distributions on 𝒳\mathcal{X} with respect to the distribution of XX—for example, the conditions ensure that the conditional distribution of XΠ,A,aXX_{\Pi,A,a}\mid X does not have atoms that XX itself does not have, and vice versa. For notational and conceptual simplicity, we only consider the case of trivial ζ\zeta, i.e., where ζ(x)=ζ(x)\zeta(x)=\zeta(x^{\prime}) for all x,x𝒳x,x^{\prime}\in\mathcal{X}.

Proposition F.8.

Suppose that

  1. 1.

    For all a𝒜a\in\mathcal{A} and any event SS satisfying Pr(XSA=a)>0\operatorname{Pr}(X\in S\mid A=a)>0, we have, a.s.,

    Pr(XΠ,A,aSA=aX)>0.\operatorname{Pr}(X_{\Pi,A,a}\in S\lor A=a\mid X)>0.
  2. 2.

    For all a𝒜a\in\mathcal{A} and ϵ>0\epsilon>0, there exists δ>0\delta>0 such that for any event SS satisfying Pr(XSA=a)<δ\operatorname{Pr}(X\in S\mid A=a)<\delta, we have, a.s.,

    Pr(XΠ,A,aS,AaX)<ϵ.\operatorname{Pr}(X_{\Pi,A,a}\in S,A\neq a\mid X)<\epsilon.

Then, for W=XW=X, any Π\Pi-fair policy d(x)d(x) is constant a.s. (i.e., d(X)=pd(X)=p a.s. for some 0p10\leq p\leq 1).

Proof.

Let dmax=d(x)d_{\max}=\|d(x)\|_{\infty}, the essential supremum of dd. To establish the theorem statement, we show that Pr(d(X)=dmaxA=a)=1\operatorname{Pr}(d(X)=d_{\max}\mid A=a)=1 for all a𝒜a\in\mathcal{A}. To do that, we begin by showing that there exists some a𝒜a\in\mathcal{A} such that Pr(d(X)=dmaxA=a)>0\operatorname{Pr}(d(X)=d_{\max}\mid A=a)>0.

Assume, toward a contradiction, that for all a𝒜a\in\mathcal{A},

Pr(d(X)=dmaxA=a)=0.\operatorname{Pr}(d(X)=d_{\max}\mid A=a)=0. (31)

Because 𝒜\mathcal{A} is finite, there must be some a0𝒜a_{0}\in\mathcal{A} such that

Pr(dmaxd(X)<ϵA=a0)>0\operatorname{Pr}(d_{\max}-d(X)<\epsilon\mid A=a_{0})>0 (32)

for all ϵ>0\epsilon>0.

Choose a1a0a_{1}\neq a_{0}. We show that for values of xx such that d(x)d(x) is close to dmaxd_{\max}, the distribution of d(XΠ,A,a1)X=xd(X_{\Pi,A,a_{1}})\mid X=x must be concentrated near dmaxd_{\max} with high probability to satisfy the definition of path-specific fairness, in Eq. (5). But, under the assumption in Eq. (31), we also show that the concentration occurs with low probability, by the continuity hypothesis in the statement of the theorem, establishing the contradiction.

Specifically, by Markov’s inequality, for any ρ>0\rho>0, a.s.,

Pr(dmax\displaystyle\operatorname{Pr}(d_{\max}- d(XΠ,A,a1)ρX)\displaystyle\ d(X_{\Pi,A,a_{1}})\geq\rho\mid X)
𝔼[dmaxd(XΠ,A,a1)X]ρ\displaystyle\leq\frac{\mathbb{E}[d_{\max}-d(X_{\Pi,A,a_{1}})\mid X]}{\rho}
=dmaxd(X)ρ,\displaystyle=\frac{d_{\max}-d(X)}{\rho},

where the final equality follows from Lemma F.21. Rearranging, it follows that for any ρ>0\rho>0, a.s.,

Pr(dmaxd(XΠ,A,a1)<ρX)1dmaxd(X)ρ.\displaystyle\begin{split}\operatorname{Pr}(d_{\max}&-d(X_{\Pi,A,a_{1}})<\rho\mid X)\\ &\hskip 34.14322pt\geq 1-\frac{d_{\max}-d(X)}{\rho}.\end{split} (33)

Now let S={x𝒳:dmaxd(x)<ρ}S=\{x\in\mathcal{X}:d_{\max}-d(x)<\rho\}. By the second hypothesis of the theorem, we can choose δ\delta sufficiently small that if

Pr(XSA=a1)<δ\operatorname{Pr}(X\in S\mid A=a_{1})<\delta

then, a.s.,

Pr(XΠ,A,a1S,Aa1X)<12.\operatorname{Pr}(X_{\Pi,A,a_{1}}\in S,A\neq a_{1}\mid X)<\tfrac{1}{2}.

In other words, we can chose δ\delta such that if

Pr(dmaxd(X)<ρA=a1)<δ\operatorname{Pr}(d_{\max}-d(X)<\rho\mid A=a_{1})<\delta

then, a.s.,

Pr(dmaxd(XΠ,A,a1)<ρ,Aa1X)<12\operatorname{Pr}(d_{\max}-d(X_{\Pi,A,a_{1}})<\rho,A\neq a_{1}\mid X)<\tfrac{1}{2}

By Eq. (31), we can choose ϵ>0\epsilon>0 so small that

Pr(dmaxd(X)<ϵA=a1)<δ.\operatorname{Pr}(d_{\max}-d(X)<\epsilon\mid A=a_{1})<\delta.

Then, we have that

Pr(dmaxd(XΠ,A,a1)<ϵ,Aa1X)<12\operatorname{Pr}(d_{\max}-d(X_{\Pi,A,a_{1}})<\epsilon,A\neq a_{1}\mid X)<\tfrac{1}{2}

a.s. Further, by the definition of the essential supremum and a0a_{0}, and the fact that a0a1a_{0}\neq a_{1}, we have that

Pr(dmaxd(X)<ϵ2,Aa1)>0.\operatorname{Pr}(d_{\max}-d(X)<\tfrac{\epsilon}{2},A\neq a_{1})>0.

Therefore, with positive probability, we have that

1\displaystyle 1 dmaxd(X)ϵ\displaystyle-\frac{d_{\max}-d(X)}{\epsilon}
>1ϵ2ϵ\displaystyle\hskip 28.45274pt>1-\frac{\frac{\epsilon}{2}}{\epsilon}
=12\displaystyle\hskip 28.45274pt=\frac{1}{2}
>Pr(dmaxd(XΠ,A,a1)<ϵ,Aa1X).\displaystyle\hskip 28.45274pt>\operatorname{Pr}(d_{\max}-d(X_{\Pi,A,a_{1}})<\epsilon,A\neq a_{1}\mid X).

This contradicts Eq. (33), and so it cannot be the case that Pr(d(X)=dmaxA=a0)=0\operatorname{Pr}(d(X)=d_{\max}\mid A=a_{0})=0, meaning Pr(d(X)=dmaxA=a0)>0\operatorname{Pr}(d(X)=d_{\max}\mid A=a_{0})>0.

Now, we show that Pr(d(X)=dmaxA=a1)=1\operatorname{Pr}(d(X)=d_{\max}\mid A=a_{1})=1. Suppose, toward a contradiction, that

Pr(d(X)<dmaxA=a1)>0.\operatorname{Pr}(d(X)<d_{\max}\mid A=a_{1})>0.

Then, by the first hypothesis, a.s.,

Pr(d(XΠ,A,a1)<dmaxA=a1X)>0\operatorname{Pr}(d(X_{\Pi,A,a_{1}})<d_{\max}\lor A=a_{1}\mid X)>0

As a consequence,

dmax\displaystyle d_{\max} =𝔼[d(X)d(X)=dmax,A=a0]\displaystyle=\mathbb{E}[d(X)\mid d(X)=d_{\max},A=a_{0}]
=𝔼[𝔼[d(XΠ,A,a1)X]d(X)=dmax,A=a0]\displaystyle=\mathbb{E}[\mathbb{E}[d(X_{\Pi,A,a_{1}})\mid X]\mid d(X)=d_{\max},A=a_{0}]
<𝔼[𝔼[dmaxX]d(X)=dmax,A=a0]\displaystyle<\mathbb{E}[\mathbb{E}[d_{\max}\mid X]\mid d(X)=d_{\max},A=a_{0}]
=𝔼[dmaxd(X)=dmax,A=a0]\displaystyle=\mathbb{E}[d_{\max}\mid d(X)=d_{\max},A=a_{0}]
=dmax,\displaystyle=d_{\max},

where we can condition on the set {d(X)=dmax,A=a0}\{d(X)=d_{\max},A=a_{0}\} since Pr(d(X)=dmaxA=a0)>0\operatorname{Pr}(d(X)=d_{\max}\mid A=a_{0})>0; and the second equality above follows from Lemma F.21. This establishes the contradiction, and so Pr(d(X)=dmaxA=a1)=1\operatorname{Pr}(d(X)=d_{\max}\mid A=a_{1})=1.

Finally, we extend this equality to all a𝒜a\in\mathcal{A}. Since, Pr(d(X)dmaxA=a1)=0\operatorname{Pr}(d(X)\neq d_{\max}\mid A=a_{1})=0, we have, by the second hypothesis of the theorem, that, a.s.,

Pr(d(XΠ,A,a1)dmax,Aa1X)=0.\operatorname{Pr}(d(X_{\Pi,A,a_{1}})\neq d_{\max},A\neq a_{1}\mid X)=0.

Since, by definition, Pr(XΠ,A,a1=XA=a1)=1\operatorname{Pr}(X_{\Pi,A,a_{1}}=X\mid A=a_{1})=1, and Pr(d(X)=dmaxA=a1)=1\operatorname{Pr}(d(X)=d_{\max}\mid A=a_{1})=1, we can strengthen this to

Pr(d(XΠ,A,a1)dmaxX)=0.\operatorname{Pr}(d(X_{\Pi,A,a_{1}})\neq d_{\max}\mid X)=0.

Consequently, a.s.,

d(X)\displaystyle d(X) =𝔼[d(XΠ,A,a)X]\displaystyle=\mathbb{E}[d(X_{\Pi,A,a})\mid X]
=𝔼[dmaxX]\displaystyle=\mathbb{E}[d_{\max}\mid X]
=dmax,\displaystyle=d_{\max},

where the first equality follows from Lemma F.21, establishing the result. ∎

F.2 A Markov Chain Perspective

The theory of Markov chains illuminates—and allows us to extend—the proof of Theorem 2. Suppose 𝒳={x1,,xn}\mathcal{X}=\{x_{1},\ldots,x_{n}\}.202020Because of the technical difficulties associated with characterizing the long-run behavior of arbitrary infinite Markov chains, we restrict our attention in this section to Markov chains with finite state spaces. For any a𝒜a^{\prime}\in\mathcal{A}, let Pa=[pai,j]P_{a^{\prime}}=[p^{a^{\prime}}_{i,j}] where pai,j=Pr(XΠ,A,a=xjX=xi)p^{a^{\prime}}_{i,j}=\operatorname{Pr}(X_{\Pi,A,a^{\prime}}=x_{j}\mid X=x_{i}). Then PaP_{a^{\prime}} is a stochastic matrix.

To motivate the subsequent discussion, we first note that this perspective conceptually simplifies some of our earlier results. Lemma F.21 can be recast as stating that when W=XW=X, a policy dd is Π\Pi-fair if and only if Pad=dP_{a^{\prime}}d=d—i.e., if and only if dd is a 1-eigenvector of PaP_{a^{\prime}}—for all a𝒜a^{\prime}\in\mathcal{A}.

The 1-eigenvectors of Markov chains have a particularly simple structure, which we derive here for completeness.

Lemma F.22.

Let S1,,SmS_{1},\ldots,S_{m} denote the recurrent classes of a finite Markov chain with transition matrix PP. If dd is a 1-eigenvector of PP, then dd takes a constant value pkp_{k} on each SkS_{k}, k=1,,mk=1,\ldots,m, and

di=k=1m[limnjSkPijn]pk.d_{i}=\sum_{k=1}^{m}\left[\lim_{n\to\infty}\sum_{j\in S_{k}}P_{ij}^{n}\right]\cdot p_{k}. (34)
Remark 7.

We note that limnjSkPni,j\lim_{n\to\infty}\sum_{j\in S_{k}}P^{n}_{i,j} always exists and is the probability that the Markov chain, beginning at state ii, is eventually absorbed by the recurrent class SkS_{k}.

Proof.

Note that, possibly by reordering the states, we can arrange that the stochastic matrix PP is in canonical form, i.e., that

P=[BRQ],P=\begin{bmatrix}B&\\ R^{\prime}&Q\end{bmatrix},

where QQ is a sub-stochastic matrix, RR is non-negative, and

B=[P1P2Pm]B=\begin{bmatrix}P_{1}&&&\\ &P_{2}&&\\ &&\ddots&\\ &&&P_{m}\end{bmatrix}

is a block-diagonal matrix with the stochastic matrix PiP_{i} corresponding to the transition probabilities on the recurrent set SiS_{i} in the ii-th position along the diagonal.

Now, consider a 11-eigenvector v=[v1v2]v=[v_{1}\ v_{2}]^{\top} of PP. We must have that Pv=vPv=v, i.e., Bv1=v1Bv_{1}=v_{1} and Rv1+Qv2=v2R^{\prime}v_{1}+Qv_{2}=v_{2}. Therefore v1v_{1} is a 1-eigenvector of BB. Since BB is block diagonal, and each diagonal element is a positive stochastic matrix, it follows by the Perron-Frobenius theorem that the 1-eigenvectors of BB are given by Span(𝟏Si)i=1,,m\operatorname{\textsc{Span}}(\mathbf{1}_{S_{i}})_{i=1,\ldots,m}, where 𝟏Si\mathbf{1}_{S_{i}} is the vector which is 1 at index jj if jSij\in S_{i} and is 0 otherwise.

Now, for v1Span(𝟏Si)i=1,,mv_{1}\in\operatorname{\textsc{Span}}(\mathbf{1}_{S_{i}})_{i=1,\ldots,m}, we must find v2v_{2} such that Rv1+Qv2=v2R^{\prime}v_{1}+Qv_{2}=v_{2}.

Note that every finite Markov chain MM can be canonically associated with an absorbing Markov chain MAbsM^{\operatorname{\textsc{Abs}}} where the set of states of MAbsM^{\operatorname{\textsc{Abs}}} is exactly the union of the transitive states of MM and the recurrent sets of MM. (In essence, one tracks which state of MM the Markov chain is in until it is absorbed by one of the recurrent sets, at which point the entire recurrent set is treated as a single absorbent state.) The transition matrix PAbsP^{\operatorname{\textsc{Abs}}} associated with MAbsM^{\operatorname{\textsc{Abs}}} is given by

PAbs=[IRQ]P^{\operatorname{\textsc{Abs}}}=\begin{bmatrix}I&\\ R&Q\end{bmatrix}

where R=R[𝟏S1 1Sm]R=R^{\prime}[\mathbf{1}_{S_{1}}\ \ldots\ \mathbf{1}_{S_{m}}]. In particular, it follows that v=[v1v2]v=[v_{1}\ v_{2}]^{\top} is a 1-eigenvector of PP if and only if [Tv1v2][Tv_{1}\ v_{2}]^{\top} is a 1-eigenvector of PAbsP^{\operatorname{\textsc{Abs}}}, where T:𝟏Si𝐞iT:\mathbf{1}_{S_{i}}\mapsto\mathbf{e}_{i}.

Now, if vv is a 1-eigenvector of PAbsP^{\operatorname{\textsc{Abs}}}, then it is a 1-eigenvector of (PAbs)k(P^{\operatorname{\textsc{Abs}}})^{k} for all kk. Since QQ is sub-stochastic, the series k=0Qk\sum_{k=0}^{\infty}Q^{k} converges to (IQ)1(I-Q)^{-1}. Since

(PAbs)k=[I(I+Q++Qk1)RQk],(P^{\operatorname{\textsc{Abs}}})^{k}=\begin{bmatrix}I&\\ (I+Q+\cdots+Q^{k-1})R&Q^{k}\end{bmatrix},

it follows that

limk(PAbs)k=[I(IQ)1R0].\lim_{k\to\infty}(P^{\operatorname{\textsc{Abs}}})^{k}=\begin{bmatrix}I&\\ (I-Q)^{-1}R&0\end{bmatrix}.

Therefore, if v=[v1v2]v=[v_{1}\ v_{2}]^{\top} is a 1-eigenvector of PAbsP^{\operatorname{\textsc{Abs}}}, we must have that (IQ)1Rv1=v2(I-Q)^{-1}Rv_{1}=v_{2}. By Theorem 3.3.7 in Kemeny & Snell (1976), the (i,k)(i,k) entry of (IQ)1R(I-Q)^{-1}R is exactly the probability that, conditional on X0=xiX_{0}=x_{i}, the Markov chain is eventually absorbed by the recurrent set SkS_{k}. This is, in turn, by the Chapman-Kolmogorov equations and the definition of SkS_{k}, equal to limnjSkPni,j\lim_{n\to\infty}\sum_{j\in S_{k}}P^{n}_{i,j}, and therefore the result follows. ∎

We arrive at the following simple necessary condition on Π\Pi-fair policies.

Corollary F.6.

Suppose 𝒳\mathcal{X} is finite, and define the stochastic matrix P=1|𝒜|a𝒜PaP=\tfrac{1}{|\mathcal{A}|}\sum_{a^{\prime}\in\mathcal{A}}P_{a^{\prime}}. If d(x)d(x) is a Π\Pi-fair policy then it is constant on the recurrent classes of PP.

Proof.

By Lemma F.21, dd is Π\Pi-fair if and only if Pad=dP_{a^{\prime}}d=d for all a𝒜a^{\prime}\in\mathcal{A}. Therefore,

1|𝒜|a𝒜Pad=1|𝒜|a𝒜d=d,\frac{1}{|\mathcal{A}|}\sum_{a^{\prime}\in\mathcal{A}}P_{a^{\prime}}d=\frac{1}{|\mathcal{A}|}\sum_{a^{\prime}\in\mathcal{A}}d=d, (35)

and so dd is a 1-eigenvector of PP. Therefore it is constant on the recurrent classes of PP by Lemma F.22. ∎

We note that Theorem 2 follows immediately from this.

Proof of Theorem 2.

Note that 1|𝒜|a𝒜Pa\tfrac{1}{|\mathcal{A}|}\sum_{a\in\mathcal{A}}P_{a} decomposes into a block diagonal stochastic matrix, where each block corresponds to a single stratum of ζ\zeta and is irreducible. Consequently, each stratum forms a recurrent class, and the result follows. ∎

Appendix G Proof of Proposition 2

To prove the proposition, we must characterize the conditional tail risks of the beta distribution. Note that in the main text, we parameterize beta distributions in terms of their mean μ\mu and sample size vv; here, for mathematical simplicity, we parameterize them in terms of successes, α\alpha, and failures, β\beta, where μ=αα+β\mu=\tfrac{\alpha}{\alpha+\beta} and v=α+βv=\alpha+\beta.

Lemma G.23.

Suppose ZiBeta(αi,βi)Z_{i}\sim\operatorname{\textsc{Beta}}(\alpha_{i},\beta_{i}) for i=0,1i=0,1, and that α0>α1>1\alpha_{0}>\alpha_{1}>1, 1<β0<β11<\beta_{0}<\beta_{1}. Then, for all t(0,1]t\in(0,1], 𝔼[Z1Z1<t]<𝔼[Z0Z0<t]\mathbb{E}[Z_{1}\mid Z_{1}<t]<\mathbb{E}[Z_{0}\mid Z_{0}<t].

Proof.

Let Z(α,β)Beta(α,β)Z(\alpha,\beta)\sim\operatorname{\textsc{Beta}}(\alpha,\beta). Then,

𝔼[Z(α,β)Z(α,β)<t]\displaystyle\mathbb{E}[Z(\alpha,\beta)\mid Z(\alpha,\beta)<t] =0txxα1(1x)β1B(α,β)dx0txα1(1x)β1B(α,β)dx\displaystyle=\frac{\int_{0}^{t}x\cdot\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}\,\mathrm{d}x}{\int_{0}^{t}\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}\,\mathrm{d}x}
=0txα(1x)β1dx0txα1(1x)β1dx.\displaystyle=\frac{\int_{0}^{t}x^{\alpha}(1-x)^{\beta-1}\,\mathrm{d}x}{\int_{0}^{t}x^{\alpha-1}(1-x)^{\beta-1}\,\mathrm{d}x}.

Since α>1\alpha>1, we may take the partial derivative with respect to α\alpha by differentiating under the integral sign, which yields that α𝔼[Z(α,β)Z(α,β)<t]\tfrac{\partial}{\partial\alpha}\mathbb{E}[Z(\alpha,\beta)\mid Z(\alpha,\beta)<t] equals

αI(t,α,β)2[α1]I(t,α+1,β)I(t,α1,β)I(t,α,β)2,\frac{\alpha\cdot I(t,\alpha,\beta)^{2}-[\alpha-1]\cdot I(t,\alpha+1,\beta)\cdot I(t,\alpha-1,\beta)}{I(t,\alpha,\beta)^{2}},

where I(x,α,β)=0txα1(1x)β1dxI(x,\alpha,\beta)=\int_{0}^{t}x^{\alpha-1}(1-x)^{\beta-1}\,\mathrm{d}x. Rearranging gives that this is greater than zero when

0\displaystyle 0 <αI(t,α+1,β)0t(xα2xα1)(1x)βdx\displaystyle<\alpha\cdot I(t,\alpha+1,\beta)\cdot\int_{0}^{t}(x^{\alpha-2}-x^{\alpha-1})(1-x)^{\beta}\,\mathrm{d}x
+αI(t,α,β)0t(xα1xα)(1x)βdx\displaystyle\hskip 28.45274pt+\alpha\cdot I(t,\alpha,\beta)\cdot\int_{0}^{t}(x^{\alpha-1}-x^{\alpha})(1-x)^{\beta}\,\mathrm{d}x
+I(t,α+1,β)I(t,α1,β).\displaystyle\hskip 28.45274pt+I(t,\alpha+1,\beta)\cdot I(t,\alpha-1,\beta).

Since all of the integrands are positive, α𝔼[Z(α,β)Z(α,β)<t]>0\tfrac{\partial}{\partial\alpha}\mathbb{E}[Z(\alpha,\beta)\mid Z(\alpha,\beta)<t]>0.

A virtually identical argument shows that β𝔼[Z(α,β)Z(α,β)<t]<0\tfrac{\partial}{\partial\beta}\mathbb{E}[Z(\alpha,\beta)\mid Z(\alpha,\beta)<t]<0. Therefore, the result follows. ∎

We use this lemma to prove a modest generalization of Prop. 2.

Lemma G.24.

Suppose 𝒜={a0,a1}\mathcal{A}=\{a_{0},a_{1}\}, and consider the family 𝒰\mathcal{U} of utility functions of the form

u(x)=r(x)+λ𝟙α(x)=a1,u(x)=r(x)+\lambda\cdot\mathbb{1}_{\alpha(x)=a_{1}},

indexed by λ0\lambda\geq 0, where r(x)=𝔼[Y(1)X=x]r(x)=\mathbb{E}[Y(1)\mid X=x]. Suppose the conditional distributions of r(X)r(X) given AA are beta distributed, i.e.,

𝒟(r(X)A=a)=Beta(αa,βa),\mathcal{D}(r(X)\mid A=a)=\operatorname{\textsc{Beta}}(\alpha_{a},\beta_{a}),

with 1<αa1<αa01<\alpha_{a_{1}}<\alpha_{a_{0}} and 1<βa0<βa11<\beta_{a_{0}}<\beta_{a_{1}}. Then any policy satisfying counterfactual predictive parity is strongly Pareto dominated.

Proof.

Suppose there were a Pareto efficient policy satisfying counterfactual predictive parity. Let λ=0\lambda=0. Then, by Prop. 1, we may without loss of generality assume that there exist thresholds ta0t_{a_{0}}, ta1t_{a_{1}} and a constant pp such that a threshold policy τ(x)\tau(x) witnessing Pareto efficiency is given by

τ(x)={1r(x)>tα(x),0r(x)<tα(x).\tau(x)=\begin{cases}1&r(x)>t_{\alpha(x)},\\ 0&r(x)<t_{\alpha(x)}.\end{cases}

(Note that by our distributional assumption, Pr(u(x)=t)=0\operatorname{Pr}(u(x)=t)=0 for all t[0,1]t\in[0,1].) Since λ0\lambda\geq 0, we must have that ta0ta1t_{a_{0}}\geq t_{a_{1}}. Since b<1b<1, 0<ta00<t_{a_{0}}. Therefore,

𝔼[Y(1)A=a0,\displaystyle\mathbb{E}[Y(1)\mid A=a_{0}, D=0]\displaystyle\,D=0]
=𝔼[r(X)A=a0,u(X)<ta0]\displaystyle=\mathbb{E}[r(X)\mid A=a_{0},u(X)<t_{a_{0}}]
𝔼[r(X)A=a0,u(X)<ta1]\displaystyle\geq\mathbb{E}[r(X)\mid A=a_{0},u(X)<t_{a_{1}}]
>𝔼[r(X)A=a1,u(X)<ta1]\displaystyle>\mathbb{E}[r(X)\mid A=a_{1},u(X)<t_{a_{1}}]
=𝔼[Y(1)A=a1,D=0],\displaystyle=\mathbb{E}[Y(1)\mid A=a_{1},D=0],

where the first equality follows by the law of iterated expectation, the second from the fact that ta1ta0t_{a_{1}}\leq t_{a_{0}}, the third from our distributional assumption and Lemma G.23, and the final again from the law of iterated expectation. However, since counterfactual predictive parity is satisfied, 𝔼[Y(1)A=a0,D=0]=𝔼[Y(1)A=a1,D=0]\mathbb{E}[Y(1)\mid A=a_{0},D=0]=\mathbb{E}[Y(1)\mid A=a_{1},D=0], which is a contradiction. Therefore, no such threshold policy exists. ∎

After accounting for the difference in parameterization, Prop. 2 follows as a corollary.

Proof of Prop. 2.

Note that αa=μava>v1v=1\alpha_{a}=\mu_{a}\cdot v_{a}>v\cdot\tfrac{1}{v}=1 and βa=vαa=v(1μa)>v(11v)>21=1\beta_{a}=v-\alpha_{a}=v\cdot(1-\mu_{a})>v\cdot(1-\tfrac{1}{v})>2-1=1. Moreover, since μa0>μa1\mu_{a_{0}}>\mu_{a_{1}}, αa0=vμa0>vμa1=αa1\alpha_{a_{0}}=v\cdot\mu_{a_{0}}>v\cdot\mu_{a_{1}}=\alpha_{a_{1}} and βa0=v(1μa0)<v(1μa1)=βa1\beta_{a_{0}}=v\cdot(1-\mu_{a_{0}})<v\cdot(1-\mu_{a_{1}})=\beta_{a_{1}}. Therefore 1<αa0<αa11<\alpha_{a_{0}}<\alpha_{a_{1}} and 1<αa1<αa01<\alpha_{a_{1}}<\alpha_{a_{0}}, and so, by Lemma G.24, the proposition follows. ∎