This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Boosting Sortition via Proportional Representation

Soroush Ebadian1, Evi Micha2
(1University of Toronto   2Harvard University)
Abstract

Sortition is based on the idea of choosing randomly selected representatives for decision making. The main properties that make sortition particularly appealing are fairness — all the citizens can be selected with the same probability— and proportional representation — a randomly selected panel probably reflects the composition of the whole population. When a population lies on a representation metric, we formally define proportional representation by using a notion called the core. A panel is in the core if no group of individuals is underrepresented proportional to its size. While uniform selection is fair, it does not always return panels that are in the core. Thus, we ask if we can design a selection algorithm that satisfies fairness and ex post core simultaneously. We answer this question affirmatively and present an efficient selection algorithm that is fair and provides a constant-factor approximation to the optimal ex post core. Moreover, we show that uniformly random selection satisfies a constant-factor approximation to the optimal ex ante core. We complement our theoretical results by conducting experiments with real data.

1 Introduction

In the last centuries, representative democracy has become synonymous with elections. However, this has not been the case throughout history. Since ancient Athens, the random selection of representatives from a given population has been proposed as a means of promoting democracy and equality  Van Reybrouck (2016). Sortition has gained significant popularity in recent years, mainly because of its use for forming citizens’ assemblies, where a randomly selected panel of individuals deliberates on issues and makes recommendations. Currently, citizens’ assemblies are being implemented by more than 40 organizations in over 25 countries Flanigan et al. (2021a).

Recently, there has been a growing interest within the computer science research community in designing algorithms that select representative panels fairly and transparently Flanigan et al. (2020, 2021a, 2021b); Ebadian et al. (2022). Admittedly, a straightforward method for selecting a representative panel of size kk from a given population of size nn is to randomly select kk individuals uniformly Engelstad (1989). We refer to this simple procedure as uniform selection. As highlighted by Flanigan et al. (2020), two main reasons make this method particularly appealing:

  1. 1.

    Fairness: Each citizen is included in the panel with the same probability, satisfying the requirement of equal participation. Specifically, each citizen is selected with a probability of k/nk/n

  2. 2.

    Proportional Representation: The selected panel is likely to mirror the structure of the population, since if x%x\% of the population has specific characteristics, then in expectation, x%x\% of the panel will consist of individuals with these characteristics. For instance, if the female share of the population is 48%48\%, then in expectation, 48%48\% of the panel will be females.

Indeed, uniform selection seems to achieve proportional representation ex ante (before the randomness is realized), since in expectation the selected panel reflects the composition of the population, especially when the size of the panel is very large. However, one of the critiques of this sampling procedure is that with non-zero probability, a panel that completely excludes certain demographic groups can be selected Engelstad (1989). For example, if the population is split evenly between college-educated and non-college-educated individuals, there’s a chance that uniform selection could result in a panel consisting solely of college-educated individuals. To address such extreme cases, various strategies have been proposed to ensure proportional representation ex post (after the randomness is realized) Martin and Carson (1999).

One common strategy is the use of stratified sampling Gąsiorowska (2023). The idea is that the individuals are partitioned into disjoint groups and then a proportional number of representatives is sampled uniformly at random from each group. For example, if the population is comprised of 49% college-educated individuals and 51% non-college-educated individuals, then we can choose 49% of the representatives from the first group and the remaining representatives from the other group. This idea can be extended to ensure proportional representation across intersectional features as well. For instance, in a population characterized by the level of education and the income, we can define four groups: college-educated low-income, college-educated high-income, non-college-educated low-income, and non-college-educated high-income and then sample from each group separately. However, this approach becomes impractical when dealing with a large predefined set of features, as the number of possible groups can grow exponentially, and there may not be enough seats in the panel to represent all of them. A more general approach, extensively used in practice, is to set quotas over individual or set of features Flanigan et al. (2020); Vergne (2018). Similar to stratified sampling, when aiming for proportional representation across all intersectional features, the number of quotas can become exponential, making it infeasible to satisfy all of them concurrently. Alternatively, one may opt for setting quotas over a subset of intersectional features. For instance, quotas could be set for gender and race simultaneously, along with additional quotas for income. However, this might not ensure the representation of specific subgroups, such as high-income black women.

The presence of the above challenges in existing strategies prompts a need for alternative approaches for ensuring proportional representation. This, in turn, highlights the necessity of rigorously defining proportional representation first. Our work departs from these observations, and we aim to address the following questions:

  1. 1.

    What is a formal definition of proportional representation of a population?

  2. 2.

    To what extent does uniform selection satisfy proportional representation?

  3. 3.

    Is it possible to design selection algorithms that enhance representation guarantees while maintaining fairness?

1.1 Our approach

Proportional Representation via Core.

We begin by tackling the first question posed above. Intuitively, a panel can be deemed proportionally representative if each group of size ss within a population of nn individuals is represented by s/nks/n\cdot k members in the panel, out of the total kk representatives selected. Motivated by this intuition, we borrow a notion of proportional representation used by recent works on multiwinner elections, fair allocation of public goods and clustering Aziz et al. (2017); Fain et al. (2018); Conitzer et al. (2019); Cheng et al. (2020); Chen et al. (2019), called the core. The main idea of the core is: Every subset SS of the population is entitled to choose up to |S|/nk|S|/n\cdot k representatives. Formally, a panel PP is called proportionally representative, or is said to be in the core, if there does not exist a subset SS of the population that could choose a panel PP^{\prime}, with |P||S|/nk|P^{\prime}|\leq|S|/n\cdot k, under which all of them feel more represented. Note that this notion is not defined over predefined groups using particular features, but it provides proportional representation in the panel to every subset of the population.

Representation Metric Space.

A conceptual challenge is to quantify the extent to which a panel represents an individual. To address this, we use the same approach as taken by Ebadian et al. (2022) in which it is assumed that the individuals lie in an underlying representation metric space. The representation metric space can be constructed as a function of features that are of particular interest for an application at hand, such as gender, age, ethnicity and education. Intuitively, the construction of such a metric space eliminates the necessity of partitioning individuals into groups that all share exactly the same characteristics. Instead, it serves as a means of detecting large groups of individuals that share similar characteristics and are eligible to be represented proportionally. For example, a 30-year-old single, low-income black woman might still feel close to a 35-year-old married, medium-income black woman, since they share many characteristics, even if they differ in some of them.

qq-Cost.

Finally, to measure the degree to which an individual is represented by a panel again, we take the approach of Ebadian et al. (2022), following a recent work of Caragiannis et al. (2022) in multiwinner elections. Specifically, the cost of an individual for a panel is determined by her distance from the qq-th closest member in the panel, for some q[k]q\in[k]. We find this choice of cost suitable for applications related to sortition due to two main reasons. First, an individual may not care about her distance to all the representatives, but she may wish to ensure that there are a few with whom she can relate. For example, a woman may want to ensure that there are at least a few women on a panel to represent her, without necessarily requiring the entire panel to be composed of females, which would not be reasonable. Second, it effectively differentiates between panels containing representatives whom an individual can readily relate to and panels where representatives are more distant from her. For instance, consider an individual aged 40, a panel that includes 2 representatives aged 40, one representative aged 20, and one representative aged 60 and another panel consisting of two representatives aged 30 and two representatives aged 50. The individual may feel represented by at least two people in the former panel, and therefore for q=2q=2 her cost would be low. While for the second panel her cost would be higher since no representative is that close to her. In contrast, natural alternatives such as the average distance would fail to capture this difference since both panels would have the same average distance from her. The choice of qq depends on the application at hand. However, in this work, we provide selection algorithms that do not require knowledge of the value of qq but offer guarantees for any value of it concurrently.

1.2 Our Contribution

Our primary conceptual contribution lies in introducing the core, in the context of sortition. Before delving into our work, we discuss the relevant literature that has provided inspiration and insights for our research. The idea of using the core as a means of measuring the proportional representation that a panel provides to a population, lying in a metric space, was first introduced by Chen et al. (2019) in a clustering setting. In our terms, Chen et al. (2019) consider the case of q=1q=1, i.e., each individual cares for her distance from her closest representative, while in this work, we extend the notion of core to the class of qq-cost functions. They show that a solution in the core is not guaranteed to exist and define a multiplicative approximation of it with respect to the cost improvement of all individuals eligible to choose a different panel. They introduce an algorithm, called Greedy Capture, that returns a solution in the (1+2)(1+\sqrt{2})-approximate core. Roughly speaking, the algorithm partitions the nn individuals into kk parts by smoothly increasing balls in the underlying metric space around each individual and greedily creating a part whenever a ball captures n/kn/k individuals that have not already been captured. The centers of the balls serve as the representatives.

In a sortition setting,in addition to proportional representation of all groups, it is important to ensure the fairness constraint which is that all individuals have the same chance of being included in the panel. For ensuring that, a selection algorithm should return distribution over panels of size kk, and not a deterministic panel as in the clustering setting. Therefore, in this work we ask for selection algorithms that are simultaneously in the ex post core, meaning that every panel that the algorithm might return, is in the core, and simultaneously is fair, meaning that each individual is included in the panel with probability equal to k/nk/n.

In Section 3, as one would expect, we demonstrate that uniform selection, despite satisfying fairness by its definition, falls short of achieving any reasonable approximation to the ex post core for almost any qq, with only exception being q=kq=k. This is due to the fact that when q=kq=k, any panel inherently belongs to the 22-approximate ex post core, as we will show later. We then pose the question: Is there any selection algorithm that is fair and achieves an O(1)O(1)-approximation to the ex post core? The answer is affirmative. We introduce an efficient selection algorithm, denoted as FairGreedyCapture, that is fair and is in the 66-approximate ex post core for every value of q[k]q\in[k]. In some sense, this guarantees the best of both worlds, as we provide an algorithm that preserves the positive characteristic of uniform selection, namely fairness, and additionally, it ensures that any realized panel is in the O(1)O(1)-approximate ex post core. Again loosely speaking, FairGreedyCapture creates kk parts using Greedy Capture which “opens” a ball in the metric space when a sufficiently number of individuals fall into it. In contrast to Greedy Capture, which selects the center of the ball as a representative, FairGreedyCapture assigns probabilities of selection to individuals within the ball, ensuring that the sum of these probabilities equals to 11. This ensures the selection of one representative from each ball. Additionally, to ensure fairness, a total fraction of k/nk/n is assigned to each individual across the kk balls. Then, leveraging Birkhoff’s decomposition algorithm, we find a distribution over panels of size kk, where each panel contains at least one representative from each ball, and each individual is selected with a probability of k/nk/n. We complement this result by showing that no fair selection algorithm provides an approximation better than 22 to the ex post core.

In Section 4, we turn our attention to the question: Is uniform selection in the ex ante core? As previously mentioned, uniform selection seems to satisfy the ex ante core, at least for large panels, since, in expectation, a panel is proportionally representative. Here, we investigate whether this is true for all values of kk and qq. In particular, we define a selection algorithm to be in the ex ante core if, for any panel PP, the expected number of individuals who feel more represented by PP than panels chosen from the selection algorithms is less than |P|/nk|P|/n\cdot k. This indicates that no other panel receives significant support, in expectation. First, we show that for q=kq=k, uniform selection is in the ex ante core. However, for q<kq<k, no fair selection algorithm is in the ex ante core. Therefore, as before, we define a multiplicative approximation with respect to the cost improvement. We demonstrate that uniform selection provides an approximation of 44 to the ex ante core. On the other hand, we show that no fair selection algorithm provides an approximation better than 22 to the ex ante core.

In Section 5, we explore the question of whether, given a panel PP, there is any way to determine if it satisfies an approximation of the ex post core for a value of qq. This can be useful when a panel has been sampled using a selection algorithm that does not provide any guarantees for the ex post core. We show that given a panel PP, we can approximate, in polynomial time, how much it violates the core up to constants.

Finally, in Section 6, we empirically evaluate the approximation of uniform selection and FairGreedyCapture to the ex post core on constructed metrics derived from two demographic datasets. We notice that for large values of qq, uniform selection achieves an approximation to the ex post core similar to the one that FairGreedyCapture achieves. For smaller values of qq, when the individuals form cohesive parts, uniform selection has unbounded approximation very often. However, when the individuals are well spread in the space, uniform selection achieves a good approximation of the ex post core. Thus, the decision of using uniform selection depends on the value of qq and the structure of the population.

1.3 Related Work

Ebadian et al. (2022) recently considered the same question of measuring the representation that a panel or a selection algorithm achieves in a rigorous way. As we mentioned above, they also assume the existence of a representative metric space and use the distance of the qq-th closest representative in the panel to measure to what degree a panel represents an individual. However, they use the social cost (i.e. the sum of individual costs) to measure how much a panel represents the whole population. In Appendix A, we show that this measure of representation may fail to achieve the idea of proportional representation. Moreover, while a reasonable approximation of their notion of representation is, in some cases, incompatible with fairness (i.e., each individual is included in the panel with the same probability), in this work, we show that there are selection algorithms that achieve a constant approximation of proportional representation and fairness simultaneously.

As we discussed above, a method that is used in practice for enforcing representation is by setting quotas over features. However, a problem that appears is that only a few people volunteer to participate in a decision panel. As a result, the representatives are selected from a pool of volunteers which usually does not reflect the composition of the population, since for example highly educated people are usually more willing to participate in a decision panel than less educated people. Flanigan et al. (2021a) proposed selection algorithms that, given a biased pool of volunteers, find distributions that maximize the minimum selection probability of any volunteer over panels that satisfy the desired quotas. In this work, similar to Ebadian et al. (2022) and Benadè et al. (2019), we focus on the pivotal idea of a sortition based democracy that relies on sampling representatives directly from the underlying population Gastil and Wright (2019). However, later, we discuss how our approach can be modified for being applied in biased pools of volunteers. Benadè et al. (2019) focused on the idea of stratified sampling and asked how this strategy may affect the variance of the representation of unknown groups. Flanigan et al. (2021b) studied how the selection algorithms can become transparent as well. In a more recent work, Flanigan et al. (2024) studied the manipulability of different selection algorithms, i.e the incentives of individuals to misreport their features.

The representation of individuals as having an ideal point in a metric space has its roots to the spatial model of voting Arrow (1990); Enelow and Hinich (1984). As we mentioned above, the idea of using the core as a notion of proportional representation in a metric space was first introduced by Chen et al. (2019), and later revisited by Micha and Shah (2020), in a clustering setting. Proportional representation in clustering has also been studied by Aziz et al. (2023) and Kalayci et al. (2024). The definition by Aziz et al. (2023) is quite similar to the core, with the basic difference being that each dense group explicitly requires a sufficient number of representatives. Kalayci et al. (2024) consider a version of the core where an agent’s cost for the panel is the sum of the distance of each representative, and a group is incentivized to deviate to another solution if the overall group can reduce the sum of costs. A drawback of both the definition of the core we use in this paper and Greedy Capture, which was mentioned by Aziz et al. (2023) and Kalayci et al. (2024), is that a dense group might end up being represented by just one individual. This happens because Greedy Capture keeps expanding opened balls, and when a new individual is captured by such a ball, it disregards it by implicitly assuming that this individual is already represented. We stress that while our notion of the core does not explicitly account for this problem, FairGreedyCapture does not expand balls that are already open, and thus, it does not suffer from this weakness. More broadly, the implicitly goal of clustering is to find a set of kk centers that represent all the data points in an underlying metric space. As discussed by Chen et al. (2019) in their work, the classic objectives, namely kk-center, kk-means, and kk-median objectives, are deemed incompatible with the core. Consequently, they do not align with the notion of proportional representation desired in this work. The literature has explored various notions of fairness in clustering Chhabra et al. (2021). Recently, Kellerhals and Peters (2023) establish links among the numerous concepts related to fairness and proportionality in clustering.

Proportional representation through core has been extensively studied in the context of multiwinner elections as well Aziz et al. (2017); Faliszewski et al. (2017); Lackner and Skowron (2023); Fain et al. (2018). The problem of selecting a representative panel can be framed as a committee election problem, where the candidates are drawn from the same pool as the voters. While in these works, the voters and the candidates do not lie in a metric space, but instead the voters hold rankings over candidates, in our model, the rankings could derive from the underlying metric space. Due to impossibility results Cheng et al. (2020), relaxations of the core have been studied. The ex ante core, as defined here, was introduced by Cheng et al. (2020). They show that, without the fairness constraint, the ex ante core can be guaranteed. In this work, we show that by imposing this fairness constraint, an approximation to the ex ante qq-core better than 22 is impossible, for all q[k1]q\in[k-1].

2 Preliminaries

For tt\in\mathbb{N}, let [t]={1,,t}[t]=\{1,\ldots,t\}. We denote the population by [n][n]. A panel PP is defined as a subset of the population. The nn individuals lie in an underlying representation metric space with distance function dd. The distance between individuals ii and jj is denoted as d(i,j)d(i,j). We assume that the distances are symmetric, i.e., d(i,j)=d(j,i)d(i,j)=d(j,i), and satisfy the triangle inequality, i.e., d(i,j)d(i,)+d(,j)d(i,j)\leq d(i,\ell)+d(\ell,j). An instance of our problem is characterized by the individuals in the population and the distances among them. Henceforth, we simply refer to such an instance as dd.

We consider a class of cost functions to measure the cost of an individual ii within a panel PP. For q[k]q\in[k], we define the qq-cost of ii for PP as the distance to her qq-th closest member in the panel, denoted by cq(i,P;d)c_{q}(i,P;d). When q=1q=1, the cost of an individual is equal to her distance from her closest representative in the panel, and for q=kq=k, the cost is equal to her distance from her furthest representative in the panel. We denote by 𝗍𝗈𝗉q(i,P;d)\operatorname{\mathsf{top}}_{q}(i,P;d) the set of the qq closest representatives of ii in a panel PP (with ties broken arbitrarily). Additionally, B(i,r;d)B(i,r;d) represents the set of individuals captured from a ball centered at ii with a radius of rr, i.e., B(i,r;d)={i[n]:d(i,i)r}B(i,r;d)=\{i^{\prime}\in[n]:d(i,i^{\prime})\leq r\}. We may omit dd from the notation when clear from the context.

A selection algorithm, denoted by 𝒜k\mathcal{A}_{k}, is parameterized by kk and takes as input the metric dd and outputs a distribution over all panels of size kk. We say that a panel is in the support of 𝒜k\mathcal{A}_{k}, if it is implemented with positive probability under the distribution that 𝒜k\mathcal{A}_{k} outputs. We pay special attention to the uniform selection algorithm, denoted by 𝒰k\mathcal{U}_{k}, that always outputs a uniform distribution over all the subsets of the population of size kk.

Fairness.  As mentioned above, one of the appealing properties of uniform selection is that each individual is included in the panel with the same probability. We call this property fairness and we say that a selection algorithm is fair if:

i[n],PrP𝒜k[iP]=k/n.\displaystyle\forall i\in[n],\quad\Pr\nolimits_{P\sim\mathcal{A}_{k}}[i\in P]=k/n.

Core.  Another appealing property of sortition is proportional representation. Here, we utilize the idea of the core to measure the proportional representation of a panel and, by extension, of a selection algorithm. To do so, we first introduce the following definition: For α1\alpha\geq 1, the α\alpha-qq-preference count of PP with respect to PP^{\prime} is the number of individuals whose qq-cost under PP is larger than α\alpha times their qq-cost under PP^{\prime}:

Vq(P,P,α)=|{i[n]:cq(i,P)>αcq(i,P)}|.\displaystyle V_{q}(P,P^{\prime},\alpha)=\lvert\{i\in[n]:c_{q}(i,P)>\alpha\cdot c_{q}(i,P^{\prime})\}\rvert.

A panel PP is in the α\alpha-qq-core, if for any panel PP^{\prime}, Vq(P,P,α)<|P|n/kV_{q}(P,P^{\prime},\alpha)<|P^{\prime}|\cdot n/k. For α=1\alpha=1, we say that the panel is in the qq-core. We define α\alpha-qq-core for α>1\alpha>1, since even when q=1q=1, a panel in the exact qq-core is not guaranteed to exist (Chen et al., 2019; Micha and Shah, 2020).

Ex Post qq-Core.

A selection algorithm 𝒜k\mathcal{A}_{k} is in the ex post α\alpha-qq-core (or ex post qq-core, for α=1\alpha=1) if every panel PP in the support of 𝒜k\mathcal{A}_{k} is in the α\alpha-qq-core, i.e., for all PP drawn from 𝒜k\mathcal{A}_{k} and all PP^{\prime},

Vq(P,P,α)<|P|n/k.V_{q}(P,P^{\prime},\alpha)<|P^{\prime}|\cdot n/k.

Ex Ante qq-Core.

A selection algorithm 𝒜k\mathcal{A}_{k} is in the ex ante α\alpha-qq-core (or ex ante qq-core, for α=1\alpha=1) if for all PP^{\prime}:

𝔼P𝒜k[Vq(P,P,α)]<|P|nk.\displaystyle\mathbb{E}_{P\sim\mathcal{A}_{k}}[V_{q}(P,P^{\prime},\alpha)]<|P^{\prime}|\cdot\frac{n}{k}.

The idea of requiring a core-like property over the expected number of preference counts was introduced by Cheng et al. (2020) in a multi-winner election setting. Essentially, it states that for any panel PP^{\prime}, if, for any realized panel PP, we count the number of individuals that reduce their cost by a multiplicative factor of at least α\alpha under PP^{\prime}, in expectation, this number is less than |P|n/k|P^{\prime}|\cdot n/k. Therefore, in expectation, they are not eligible to choose it.

It is easy to see that ex post α\alpha-core implies ex ante α\alpha-core, since if for each PP in the support of a distribution that 𝒜k\mathcal{A}_{k} returns and each PP^{\prime}, it holds that Vq(P,P,α)<|P|n/kV_{q}(P,P^{\prime},\alpha)<|P^{\prime}|\cdot n/k, then 𝔼P𝒜k[Vq(P,P,α)]<|P|n/k\mathbb{E}_{P\sim\mathcal{A}_{k}}[V_{q}(P,P^{\prime},\alpha)]<|P^{\prime}|\cdot n/k.

3 Fairness and Ex Post Core

Input: [n][n], dd
Output: PP_{\ell} and λ\lambda_{\ell}, for [L]\ell\in[L], where each PP_{\ell} represents a panel of size kk and λ\lambda_{\ell} represents its probability of being selected
/* Create a (k/n)(k/n)-fractional allocation by distributing a k/nk/n fraction for each individual among kk balls, ensuring that each ball contains a total fractional amount equal to 11. */
X[0]k×nX\leftarrow[0]^{k\times n}; δ0\delta\leftarrow 0; j1j\leftarrow 1; {yik/n}i[n]\{y_{i}\leftarrow k/n\}_{i\in[n]};
while i[n]yj>0\sum_{i\in[n]}y_{j}>0  do
       Smoothly increase δ\delta;
       while i[n]\exists i\in[n], such that iB(i,δ)yi1\sum_{i^{\prime}\in B(i,\delta)}y_{i^{\prime}}\geq 1 do
             while Xj=i[n]Xj,i<1X_{j}=\sum_{i\in[n]}X_{j,i}<1 do
                   Pick iB(i,δ)i^{\prime}\in B(i,\delta) with xi>0x_{i^{\prime}}>0;
                   Xj,imin(1Xj,yi)X_{j,i^{\prime}}\leftarrow\min(1-X_{j},y_{i});
                   yiyiXj,iy_{i}\leftarrow y_{i}-X_{j,i^{\prime}};
                  
             end while
            jj+1j\leftarrow j+1;
            
       end while
      
end while
/* Apply Birkhoff’s decomposition */
X[1/n](nk)×nX^{\prime}\leftarrow[1/n]^{(n-k)\times n};
Let Y=[XX]Y=\left[\begin{array}[]{c}X\\ X^{\prime}\end{array}\right];
Compute a decomposition of Y==1LλYY=\sum_{\ell=1}^{L}\lambda_{\ell}Y^{\ell} using the Birkhoff’s decomposion (Theorem 2);
for =1\ell=1 to LL do
      P{i[n]Yj,i=1 for some jk}P_{\ell}\leftarrow\big{\{}i\in[n]\mid Y^{\ell}_{j,i}=1\text{ for some }j\leq k\big{\}}
end for
return distribution over LL panels {P}[L]\{P_{\ell}\}_{\ell\in[L]} where PP_{\ell} is selected with probability λ\lambda_{\ell}
ALGORITHM 1 FairGreedyCapturek\textsc{FairGreedyCapture}_{k}

In this section, we investigate if there are selection algorithms that are fair, and in addition, provide a constant approximation to the ex post qq-core. Unsurprisingly, uniform selection may fail to provide any bounded approximation to the ex post qq-core for q[k1]q\in[k-1] 111 For q=kq=k, we show in Appendix B that all panels lie in the 22 approximation of the kk-core; hence, any algorithm including uniform selection provides an ex post 22-kk-core. . This happens because each panel has a nonzero probability of selection, and there may exist panels with arbitrarily large violations of the qq-core objective.

Theorem 1.

For any q[k1]q\in[k-1] and n/kk\left\lfloor\nicefrac{{n}}{{k}}\right\rfloor\geq k, there exists an instance such that uniform selection is not in the ex post α\alpha-qq-core for any bounded α\alpha.

Proof.

Consider an instance in which there are n/k\left\lfloor\nicefrac{{n}}{{k}}\right\rfloor individuals in group AA and the remaining individuals are in group BB. Suppose that the distance between any two individuals in the same group is 0, and the distance between any two individuals in different groups is 11. Since, n/kk\left\lfloor\nicefrac{{n}}{{k}}\right\rfloor\geq k, uniform selection has a non-zero probability of returning a panel where all the representatives are from group AA. In this scenario, for any q[k1]q\in[k-1], the qq-cost of all the individuals in group BB is equal to 11. However, individuals in group BB are entitled to choose up to k1k-1 representatives among themselves, and if they do so, their qq-cost becomes 0, resulting in an unbounded improvement of their qq-cost. Therefore, uniform selection is not in the ex post α\alpha-qq-core for any bounded α\alpha. ∎

Therefore, we ask: For every qq, is there any selection algorithm that keeps the fairness guarantee of uniform selection and ensures that every panel in its support is in the constant approximation of the qq-core? We answer this positively.

We present a selection algorithm, called FairGreedyCapturek\textsc{FairGreedyCapture}_{k}, that is fair and in the ex post 66-qq-core, for every q[k]q\in[k]. We highlight that the algorithm does not need to know the value of qq. Our algorithm leverages the basic idea of the Greedy Capture algorithm introduced by Chen et al. (2019), which returns a panel in the (1+2)2.42(1+\sqrt{2})\approxeq 2.42-approximation of the 11-core. Note that this algorithm is deterministic and need not satisfy fairness. Briefly, Greedy Capture starts with an empty panel and grows a ball around all individuals at the same rate. When a ball captures at least n/k\left\lceil n/k\right\rceil individuals for the first time, the center of the ball is included in the panel and all the captured individuals are disregarded. The algorithm keeps growing balls on all individuals, including the opened balls. As the opened balls continue to grow and capture more individuals, the newly captured ones are immediately disregarded as well. Note that the final panel can be of size less than kk.

At a high level, FairGreedyCapturek\textsc{FairGreedyCapture}_{k}, as outlined in Algorithm 1, operates as follows: it greedily opens kk balls using the basic idea of the Greedy Capture algorithm, ensuring each ball contains sufficiently many individuals. In contrast to Greedy Capture, which selects the centers of the balls as the representatives, our algorithm probabilistically selects precisely one individual from each of the kk balls.

Before, we describe the algorithm in more detail, we define a (k/n)(k/n)-fractional allocation as a non-negative k×nk\times n matrix X[0,1]k×nX\in[0,1]^{k\times n} where entries in each row sums to 11 and entries in each column sum to k/nk/n, i.e., for each i[n]i\in[n], j[k]Xj,i=k/n\sum_{j\in[k]}X_{j,i}=k/n, and for each j[k]j\in[k], i[n]Xj,i=1\sum_{i\in[n]}X_{j,i}=1. The algorithm, during its execution, generates a (k/n)(k/n)-fractional allocation XX of individuals in [n][n] into kk balls, where Xj,iX_{j,i} denotes the fraction of individual ii assigned to ball jj. We say that an individual ii is assigned to ball jj, if Xj,i>0X_{j,i}>0. An individual can be assigned to more than one balls.

The (k/n)(k/n)-fractional allocation XX is generated as follows. Denote the unallocated part of each individual ii by yiy_{i}. Start with yi=k/ny_{i}=k/n. This corresponds to the fairness criterion that we allocate a k/nk/n probability of selection to each individual. Algorithm 1 grows a ball around every individual in [n][n] at the same rate. Suppose a ball captures individuals whose combined unallocated parts sum to at least 11. Then, we open this ball and from individuals ii^{\prime} captured by this ball with yi>0y_{i^{\prime}}>0, we arbitrarily remove a total mass of exactly 11 and assign it to the ball. This can be done in various ways, e.g., greedily pick an individual ii^{\prime} with positive yiy_{i^{\prime}} and allocate min{1i[n]Xj,i,yi}\min\{1-\sum_{i\in[n]}X_{j,i},y_{i^{\prime}}\} fraction of it to the corresponding row (i.e. ball). This procedure terminates when the k/nk/n fraction of each individual is fully allocated. Note that since each time a ball opens, a total mass of 11 is deducted from yiy_{i}-s and, for each i[n]i\in[n], yiy_{i} starts with a fraction of k/nk/n, exactly kk balls are opened.

Sampling panels from the (k/n)(k/n)-fractional allocation.

Next, we show a method of decomposing XX, the (k/n)(k/n)-fractional allocation, to a distribution over panels of size kk that each contain at least one representative from each ball. We employ the Birkhoff’s decomposition (Birkhoff, 1946). This theorem applies over square matrices that are bistochastic. A matrix is bistochastic if every entry is nonnegative and the sum of elements in each of its rows and columns is equal to 11.

Theorem 2 (Birkhoff-von Neumann).

Let YY be a n×nn\times n bistochastic matrix. There exists a polynomial time algorithm that computes a decomposition Y==1LλYY=\sum_{\ell=1}^{L}\lambda_{\ell}Y^{\ell}, with Ln2n+2L\leq n^{2}-n+2, such that for each [L]\ell\in[L], λ[0,1]\lambda_{\ell}\in[0,1], YY^{\ell} is a permutation matrix and =1Lλ=1\sum_{\ell=1}^{L}\lambda_{\ell}=1.

We cannot directly apply the theorem above, since the (k/n)(k/n)-fractional allocation XX is not bistochastic nor a square matrix. However, we can complete XX into a square matrix Y=[XX]Y=\left[\begin{array}[]{c}X\\ X^{\prime}\end{array}\right] by adding nkn-k rows X=[1/n](nk)×nX^{\prime}=[1/n]^{(n-k)\times n} where all entries are 1/n1/n. Note that the resulting matrix YY is bistochastic. Indeed, each row of both XX and XX^{\prime} sums to 11 by their definition; further, as each column of XX sums to k/nk/n and that it is followed by nkn-k of 1/n1/n entries in XX^{\prime}, the columns also sum to 11. Note that there are various choices of XX^{\prime} that makes YY a bistochastic matrix, but here we use the uniform matrix for simplicity. Then, the algorithm applies Theorem 2 and computes the decomposition Y==1LλYY=\sum_{\ell=1}^{L}\lambda_{\ell}Y^{\ell}. For each permutation matrix YY^{\ell}, we create a panel PP_{\ell} consisting of the individuals that have been assigned to the first kk rows, i.e. PP_{\ell} contains all ii-s with Yj,i=1Y^{\ell}_{j,i}=1 for some jkj\leq k. Finally, the algorithm returns the distribution that selects each panel PP_{\ell} with probability equal to λ\lambda_{\ell}.

To prove that FairGreedyCapturek\textsc{FairGreedyCapture}_{k} is fair and ex post O(1)O(1)-qq-core, we need the next two lemmas.

Lemma 1.

Let S[n]S\subseteq[n], PP^{\prime} be a panel, and m=|P|/qm=\left\lfloor|P^{\prime}|/q\right\rfloor.

  1. 1.

    There exists a partitioning of SS into mm disjoint sets T1,,TmT_{1},\ldots,T_{m} and an individual iTi^{*}_{\ell}\in T_{\ell} such that for all [m]\ell\in[m] and iTi\in T_{\ell}, cq(i,P)cq(i,P)c_{q}(i,P^{\prime})\leq c_{q}(i^{*}_{\ell},P^{\prime}) and 𝗍𝗈𝗉q(i,P)𝗍𝗈𝗉q(i,P)\operatorname{\mathsf{top}}_{q}(i,P^{\prime})\cap\operatorname{\mathsf{top}}_{q}(i^{*}_{\ell},P^{\prime})\neq\emptyset.

  2. 2.

    There exists a partitioning of SS into mm disjoint sets T1,,TmT_{1},\ldots,T_{m} and an individual iTi^{*}_{\ell}\in T_{\ell} such that for all [m]\ell\in[m] and iTi\in T_{\ell}, cq(i,P)cq(i,P)c_{q}(i,P^{\prime})\geq c_{q}(i^{*}_{\ell},P^{\prime}) and 𝗍𝗈𝗉q(i,P)𝗍𝗈𝗉q(i,P)\operatorname{\mathsf{top}}_{q}(i,P^{\prime})\cap\operatorname{\mathsf{top}}_{q}(i^{*}_{\ell},P^{\prime})\neq\emptyset.

Proof.

We start by showing the first part. We partition all the individuals in SS into m|P|/qm\leq\left\lfloor\nicefrac{{|P^{\prime}|}}{{q}}\right\rfloor groups, denoted by T1,,TmT_{1},\ldots,T_{m} iteratively as follows.

Suppose i1i^{*}_{1} is the individual with the smallest qq-cost over PP^{\prime} (ties are broken arbitrary), i.e. i1=argmaxiScq(i,P)i^{*}_{1}=\operatorname*{arg\,max}_{i\in S}c_{q}(i,P^{\prime}). Then, T1T_{1} is the set of all the individuals whose qq closest representatives from PP^{\prime} includes at least one member of 𝗍𝗈𝗉q(i1,P)\operatorname{\mathsf{top}}_{q}(i^{*}_{1},P^{\prime}), i.e.

T1={iS:𝗍𝗈𝗉q(i,P)𝗍𝗈𝗉q(i1,P)}.\displaystyle T_{1}=\{i\in S:\operatorname{\mathsf{top}}_{q}(i,P^{\prime})\cap\operatorname{\mathsf{top}}_{q}(i^{*}_{1},P^{\prime})\neq\emptyset\}.

Next, from the remaining individuals, suppose i2i^{*}_{2} is the one with the smallest qq-cost over PP^{\prime}, i.e. i2=argminiST1cq(i,P)i^{*}_{2}=\operatorname*{arg\,min}_{i\in S\setminus T_{1}}c_{q}(i,P). Construct T2T_{2} from ST1S\setminus T_{1} similarly by taking all the individuals whose at least one of their qq closest representatives in PP^{\prime} is included in 𝗍𝗈𝗉q(i2,P)\operatorname{\mathsf{top}}_{q}(i^{*}_{2},P^{\prime}). We repeat this procedure, and in round \ell, we find iS(=11T)i^{*}_{\ell}\in S\setminus(\cup_{\ell^{\prime}=1}^{\ell-1}T_{\ell^{\prime}}) that has the smallest cost over PP^{\prime}, and construct TT_{\ell} by assigning any individual in S(=11T)S\setminus(\cup_{\ell^{\prime}=1}^{\ell-1}T_{\ell^{\prime}}) whose at least one of the qq closest representatives belongs in 𝗍𝗈𝗉q(i,P)\operatorname{\mathsf{top}}_{q}(i^{*}_{\ell},P^{\prime}). Note that for any 1,2[m]\ell_{1},\ell_{2}\in[m] with 1<2\ell_{1}<\ell_{2}, 𝗍𝗈𝗉q(i1,P)𝗍𝗈𝗉q(i2,P)=\operatorname{\mathsf{top}}_{q}(i^{*}_{\ell_{1}},P^{\prime})\cap\operatorname{\mathsf{top}}_{q}(i^{*}_{\ell_{2}},P^{\prime})=\emptyset, as if at least one of the qq closest representatives of i2i^{*}_{\ell_{2}} in PP is included in 𝗍𝗈𝗉q(i1,P)\operatorname{\mathsf{top}}_{q}(i^{*}_{\ell_{1}},P^{\prime}), then i2i^{*}_{\ell_{2}} would have been assigned to T1T_{\ell_{1}} and would not belong in S(=121T)S\setminus(\cup_{\ell^{\prime}=1}^{\ell_{2}-1}T_{\ell^{\prime}}). This means that in each round, we consider qq representatives that have not been considered before, and hence after |P|/q\left\lfloor|P^{\prime}|/q\right\rfloor rounds, less than qq representatives in PP^{\prime} may remain unconsidered. As a result, after at most |P|/q\left\lfloor|P^{\prime}|/q\right\rfloor rounds, all the individuals will have been assigned to some group, since at least one of their qq closest representatives has been considered.

The second part follows by simply setting ii^{*}_{\ell} to be equal to the individual in S(=11T)S\setminus(\cup_{\ell^{\prime}=1}^{\ell-1}T_{\ell^{\prime}}) that has the largest cost over PP^{\prime}, i.e. i=argminiS(=11T)cq(i,P)i^{*}_{\ell}=\operatorname*{arg\,min}_{i\in S\setminus(\cup_{\ell^{\prime}=1}^{\ell-1}T_{\ell^{\prime}})}c_{q}(i,P). All the remaining arguments remain the same. ∎

Lemma 2.

For any panel PP and any i,i[n]i,i^{\prime}\in[n], it holds that cq(i,P)d(i,i)+cq(i,P)c_{q}(i,P)\leq d(i,i^{\prime})+c_{q}(i^{\prime},P).

Proof.

Consider a ball centered at ii^{\prime} with radius cq(i,P)c_{q}(i^{\prime},P). This ball contains at least qq representatives of PP. Hence, cq(i,P)c_{q}(i,P) is less than or equal to the distance of ii to one of the qq representatives that are included in B(i,cq(i,P))B(i^{\prime},c_{q}(i^{\prime},P)) which is at most d(i,i)+cq(i,P)d(i,i^{\prime})+c_{q}(i^{\prime},P). ∎

Now, we are ready to prove the next theorem.

Theorem 3.

For every qq, FairGreedyCapturek\textsc{FairGreedyCapture}_{k} is fair and in the ex post 66-qq-core.

Proof.

Seeing that the algorithm is fair is straightforward. For a matrix AA, let A[1:k,:]A[1:k,:] be the submatrix induced by keeping its first kk rows. First, note that for each panel PP^{\ell} we choose the individuals that have been assigned to Y[1:k,:]Y^{\ell}[1:k,:] and second, recall that Y[1:k,:]=XY[1:k,:]=X. The fairness of the algorithm follows by the facts that Y[1:k,:]=X==1LλY[1:k,:]Y[1:k,:]=X=\sum_{\ell=1}^{L}\lambda_{\ell}Y^{\ell}[1:k,:] and for each i[n]i\in[n], j=1kXj,i=k/n\sum_{j=1}^{k}X_{j,i}=k/n.

We proceed by showing that FairGreedyCapturek\textsc{FairGreedyCapture}_{k} is in the ex post 66-qq-core, for all q[k]q\in[k]. First, note that if an individual ii is assigned to a ball jj in some YY^{\ell}, then we must have Xj,i>0X_{j,i}>0. Now, since each individual i[n]i\in[n] is assigned to a ball j[k]j\in[k] in the permutation, we get that that at least one individual is selected from each ball.

Let PP be any panel that the algorithm may return. Suppose for contradiction that there exists a panel PP^{\prime} such that Vq(P,P,6)|P|n/kV_{q}(P,P^{\prime},6)\geq|P^{\prime}|\cdot n/k. This means that there exists S[n]S\subseteq[n], with |S||P|n/k|S|\geq|P^{\prime}|\cdot n/k, such that:

iS,cq(i,P)>6cq(i,P).\displaystyle\forall i\in S,\quad\quad c_{q}(i,P)>6\cdot c_{q}(i,P^{\prime}). (1)

Let T1,,TmT_{1},\ldots,T_{m} be a partition of SS with respect to PP^{\prime}, as given in the first part of Lemma 1. Since m|P|/qm\leq\left\lfloor|P^{\prime}|/q\right\rfloor and |S||P|n/k|S|\geq|P^{\prime}|\cdot n/k, we conclude that there exists a part, say TT_{\ell}, that has size at least qn/kq\cdot n/k. From Lemma 1, we know that there exists iTi^{*}_{\ell}\in T_{\ell} such that for each iTi\in T_{\ell} it holds that cq(i,P)cq(i,P)c_{q}(i,P^{\prime})\leq c_{q}(i^{*}_{\ell},P^{\prime}) and 𝗍𝗈𝗉q(i,P)𝗍𝗈𝗉q(i,P)\operatorname{\mathsf{top}}_{q}(i,P^{\prime})\cap\operatorname{\mathsf{top}}_{q}(i^{*}_{\ell},P^{\prime})\neq\emptyset. Therefore, we can conclude that for each iTi\in T_{\ell}, d(i,i)2cq(i,P)d(i^{*}_{\ell},i)\leq 2\cdot c_{q}(i^{*}_{\ell},P^{\prime}), as following: Pick an arbitrary representative in 𝗍𝗈𝗉q(i,P)𝗍𝗈𝗉q(i,P)\operatorname{\mathsf{top}}_{q}(i,P^{\prime})\cap\operatorname{\mathsf{top}}_{q}(i^{*}_{\ell},P^{\prime}) and denote it as rir_{i}. Then,

d(i,i)d(i,ri)+d(ri,i)cq(i,P)+cq(i,P)2cq(i,P).\displaystyle d(i,i^{*}_{\ell})\leq d(i,r_{i})+d(r_{i},i^{*}_{\ell})\leq c_{q}(i,P^{\prime})+c_{q}(i^{*}_{\ell},P^{\prime})\leq 2\cdot c_{q}(i^{*}_{\ell},P^{\prime}).

This implies that the ball centered at ii^{*}_{\ell} with a radius of 2cq(i,P)2\cdot c_{q}(i^{*}_{\ell},P^{\prime}) captures all individuals in TT_{\ell}.

Now, consider all the balls that FairGreedyCapturek\textsc{FairGreedyCapture}_{k} opens and contain individuals from TT_{\ell}. Since Tqn/kT_{\ell}\geq q\cdot n/k and each ball is assigned a total fraction of 11, there are at least qq such balls. Next, we claim that least qq of them have radius at most 2cq(i,P)2\cdot c_{q}(i^{*}_{\ell},P^{\prime}). Suppose for contradiction that at most q1q-1 of them have radius at most 2cq(i,P)2\cdot c_{q}(i^{*}_{\ell},P^{\prime}). This means that a total fraction of at least 11 from individual in TT_{\ell} is assigned to balls with radius strictly larger than 2cq(i,P)2\cdot c_{q}(i^{*}_{\ell},P^{\prime}). However, the ball centered at ii^{*}_{\ell} with radius 2cq(i,P)2\cdot c_{q}(i^{*}_{\ell},P^{\prime}) would have captured this fraction, and therefore we reach a contradiction.

Next, denote with B1,,BqB_{1},\ldots,B_{q}, qq balls that are opened, and each contain individuals from TT_{\ell} and have radius at most 2cq(i,P)2\cdot c_{q}(i^{*}_{\ell},P^{\prime}). Due to the definition of FairGreedyCapturek\textsc{FairGreedyCapture}_{k}, each panel that is returned, contains at least one representative from each ball. Therefore, each ball BjB_{j} contains at least one representative, denoted by rjr_{j}. Now, note that since each BjB_{j} contains at least one individual from TT_{\ell}, denoted by iji_{j}, we have that

j[q],d(i,rj)d(i,ij)+d(ij,rj)6cq(i,P),\displaystyle\forall j\in[q],\quad d(i^{*}_{\ell},r_{j})\leq d(i^{*}_{\ell},i_{j})+d(i_{j},r_{j})\leq 6\cdot c_{q}(i^{*}_{\ell},P^{\prime}),

where the first inequality follows from the triangle inequality and the last inequality follows from the facts that for each iTi\in T_{\ell}, d(i,i)2cq(i,P)d(i^{*}_{\ell},i)\leq 2\cdot c_{q}(i^{*}_{\ell},P^{\prime}), and each BjB_{j} has radius at most 2cq(i,P)2\cdot c_{q}(i^{*}_{\ell},P^{\prime}) and both iji_{j} and rjr_{j} belong to this ball. Therefore, there are at least qq representatives in PP that have distance at most 6cq(i,P)6\cdot c_{q}(i^{*}_{\ell},P^{\prime}) from ii^{*}_{\ell}. But then, cq(i,P)6cq(i,P)c_{q}(i^{*}_{\ell},P)\leq 6\cdot c_{q}(i^{*}_{\ell},P^{\prime}) which is a contradiction with Equation 1. ∎

As we discussed above, the ex post α\alpha-qq-core implies the ex ante α\alpha-qq-core which means that FairGreedyCapturek\textsc{FairGreedyCapture}_{k} is also in the ex ante 66-qq-core for all q[k]q\in[k]. In the next section, we show that no fair algorithm provides an approximation better than 22 to the ex ante qq-cost, for any qq. Therefore, we get that no fair selection algorithm provides an approximation better than 22 to the ex post qq-core either. This means that FairGreedyCapturek\textsc{FairGreedyCapture}_{k} is optimal up to a factor of 33.

3.1 Ex Post Core and Quotas over Features

In our introduction, we discussed a common approach used to ensure proportional representation, which involves setting quotas based on individual or groups of features. For instance, a quota might mandate that at least 45%45\% of representatives are female. While the concept of the core aims to achieve proportional representation across intersecting features, it may not guarantee the same across individual features. For instance, a panel comprising entirely men could still meet core criteria, even if the overall population is 50% women. This raises the question of whether it’s possible to achieve both types of representation to the degree that is possible. We argue that this is feasible and show how the core requirement can be translated into a set of quotas.

As showed above, FairGreedyCapturek\textsc{FairGreedyCapture}_{k} generates kk balls, with each individual assigned to one or more balls. The key condition for achieving an ex post O(1)O(1)-qq-core is to have at least one representative from each ball. This condition can be transformed into quotas by introducing an additional feature, bib_{i}, for each individual ii, indicating the balls they belong to. Thus, bib_{i} can take values in 2[k]{}2^{[k]}\setminus\{\emptyset\}, where 2[k]2^{[k]} represents the power set of [k][k]. We then can set quotas that require the panel to contain at least one representative ii that belongs in ball jj, i.e. jbij\subseteq b_{i}, for each j[k]j\in[k]. In other words, we can think of each ball as a subpopulation from which we want to draw a representative. We can then utilize the methods proposed by Flanigan et al. (2021a) to identify panels that meet these quotas, along with others as much as possible, while maximizing fairness. We also note that this translation allows for sampling from a biased pool of representatives using the algorithm of the aforementioned paper, as long as the characteristics of the global population are known and the balls can be constructed based on them.

4 Uniform Selection and Ex Ante Core

We have already discussed that uniform selection fails to provide any reasonable approximation to the ex post qq-core, for almost all values of qq. However, as we mentioned in the introduction, it seems to satisfy the ex ante qq-core, at least when kk is very large. In this section, we ask whether indeed uniform selection satisfies a constant approximation of the ex ante qq-core, in a rigorous way, for all values of qq and kk. We show that uniform selection is in the ex ante 44-qq-core, for every qq222 In fact, for q=kq=k, uniform selection is in the ex ante kk-core (see Appendix C). The main reason is that, for q=kq=k, it suffices to show that the grand coalition does not deviate ex-ante. Since each panel is selected with non-zero probability, the marginal probabilities of deviation is strictly less than one, and the ex ante kk-core is satisfied.

To show this result, we use the following form of Chu–Vandermonde identity which we prove in Appendix D for completeness.

Lemma 3 (Chu–Vandermonde identity).

For any n,k,n,k, and rr, with 0rkn0\leq r\leq k\leq n, it holds

j=0n(jr)(njkr)=(n+1k+1).\displaystyle\sum_{j=0}^{n}\binom{j}{r}\cdot\binom{n-j}{k-r}=\binom{n+1}{k+1}.

Now, we are ready to prove the following theorem.

Theorem 4.

For any qq, uniform selection is in the ex ante 44-qq-core, i.e. for any panel PP^{\prime}

𝔼P𝒰k[Vq(P,P,4)]<|P|nk.\displaystyle\mathbb{E}_{P\sim\mathcal{U}_{k}}\left[V_{q}(P,P^{\prime},4)\right]<|P^{\prime}|\cdot\frac{n}{k}.
Proof.

Let PP^{\prime} be any panel. By linearity of expectation, we have that

𝔼P𝒰k[Vq(P,P,4)]=i[n]PrP𝒰k[cq(i,P)>4cq(i,P)].\displaystyle\mathbb{E}_{P\sim\mathcal{U}_{k}}\left[V_{q}(P,P^{\prime},4)\right]=\sum_{i\in[n]}\;\Pr\nolimits_{P\sim\mathcal{U}_{k}}\left[c_{q}(i,P)>4\cdot c_{q}(i,P^{\prime})\right].

Let T1,TmT_{1}\ldots,T_{m} be a partition of [n][n] with respect to PP^{\prime}, as given in the second part of Lemma 1. For each [m]\ell\in[m], we reorder the individuals in TT_{\ell} in an increasing order based on their distance from ii^{*}_{\ell}, and relabel them as i1,,i|T|i^{\ell}_{1},\ldots,i^{\ell}_{|T_{\ell}|}. This way, i1i^{\ell}_{1} and i|T|i^{\ell}_{|T_{\ell}|} are the individuals in TT_{\ell} that have the smallest and the largest distance from ii^{*}_{\ell}, respectively. Then, we get that

i[n]PrP𝒰k[cq(i,P)>4cq(i,P)]==1mj=1|T|PrP𝒰k[cq(ij,P)>4cq(ij,P)].\sum_{i\in[n]}\;\Pr\nolimits_{P\sim\mathcal{U}_{k}}\left[c_{q}(i,P)>4\cdot c_{q}(i,P^{\prime})\right]={\sum_{\ell=1}^{m}\sum_{j=1}^{|T_{\ell}|}}\;\Pr\nolimits_{P\sim\mathcal{U}_{k}}\left[c_{q}(i_{j}^{\ell},P)>4\cdot c_{q}(i_{j}^{\ell},P^{\prime})\right]. (2)

In the next lemma, we bound PrP𝒰k[cq(ij,P)>4cq(ij,P)]\Pr_{P\sim\mathcal{U}_{k}}\left[c_{q}(i_{j}^{\ell},P)>4\cdot c_{q}(i^{\ell}_{j},P^{\prime})\right] for each iji_{j}^{\ell}.

Lemma 4.

For each [m]\ell\in[m] and j[|T|]j\in[|T_{\ell}|],

PrP𝒰k[cq(ij,P)>4cq(ij,P)]r=0q11(nk)(jr)(njkr).\Pr\nolimits_{P\sim\mathcal{U}_{k}}\left[c_{q}(i_{j}^{\ell},P)>4\cdot c_{q}(i^{\ell}_{j},P^{\prime})\right]\leq{\sum_{r=0}^{q-1}\frac{1}{{\binom{n}{k}}}}\cdot\binom{j}{r}\binom{n-j}{k-r}.
Proof.
4cq(ij,P)4\cdot c_{q}(i^{\ell}_{j},P^{\prime})ii^{*}_{\ell}iji^{\ell}_{j}d(i,ij)d(i^{*}_{\ell},i^{\ell}_{j})i1i^{\ell}_{1}i2i^{\ell}_{2}i3i^{\ell}_{3}i4i^{\ell}_{4}\ldots
Figure 1: Diagram for Proof of Lemma 4

For each iji_{j}^{\ell}, let rjr^{\ell}_{j} be an arbitrary representative in 𝗍𝗈𝗉q(ij,P)𝗍𝗈𝗉q(i,P)\operatorname{\mathsf{top}}_{q}(i^{\ell}_{j},P^{\prime})\cap\operatorname{\mathsf{top}}_{q}(i^{*}_{\ell},P^{\prime}). Then, we get that

d(ij,i)d(ij,rj)+d(rj,i)cq(ij,P)+cq(i,P)2cq(ij,P),d(i_{j}^{\ell},i^{*}_{\ell})\leq d(i_{j}^{\ell},r^{\ell}_{j})+d(r^{\ell}_{j},i^{*}_{\ell})\leq c_{q}(i^{\ell}_{j},P^{\prime})+c_{q}(i^{*}_{\ell},P^{\prime})\leq 2\cdot c_{q}(i^{\ell}_{j},P^{\prime}), (3)

where the last inequality follows from the fact that ii^{*}_{\ell} has the smallest cost over PP^{\prime} among all the individuals in TT_{\ell}. Now, consider the ball that is centered at iji_{j}^{\ell} and has radius 4cq(ij,P)4\cdot c_{q}(i^{\ell}_{j},P^{\prime}). Note that this ball contains any individual iji^{\ell}_{j^{\prime}} with j<jj^{\prime}<j. Indeed, for each iji^{\ell}_{j} and iji^{\ell}_{j^{\prime}} with j<jj^{\prime}<j, we have that

d(ij,ij)d(ij,i)+d(i,ij)2d(ij,i)4cq(ij,P),\displaystyle d(i^{\ell}_{j},i^{\ell}_{j^{\prime}})\leq d(i^{\ell}_{j},i^{*}_{\ell})+d(i^{*}_{\ell},i^{\ell}_{j^{\prime}})\leq 2\cdot d(i^{\ell}_{j},i^{*}_{\ell})\leq 4\cdot c_{q}(i^{\ell}_{j},P^{\prime}),

where the second inequality follows form the fact that for each j,j[|T|]j^{\prime},j\in[|T_{\ell}|] with j<jj^{\prime}<j, d(ij,i)d(ij,i)d(i^{\ell}_{j^{\prime}},i^{*}_{\ell})\leq d(i^{\ell}_{j},i^{*}_{\ell}) and the last inequality follows form Equation 3. This argument is drawn in Figure 1.

When cq(ij,P)>4cq(ij,P)c_{q}(i_{j}^{\ell},P)>4\cdot c_{q}(i^{\ell}_{j},P^{\prime}), then we get that |P{i1,,ij}|<q|P\cap\{i^{\ell}_{1},\ldots,i^{\ell}_{j}\}|<q, as otherwise there would exist at least qq individuals in B(ij,4cq(ij,P))B(i^{\ell}_{j},4\cdot c_{q}(i^{\ell}_{j},P^{\prime})), and cq(ij,P)c_{q}(i_{j}^{\ell},P) would be at most 4cq(ij,P)4\cdot c_{q}(i^{\ell}_{j},P^{\prime}). Hence, we have that

PrP𝒰k[cq(ij,P)>4cq(ij,P)]\displaystyle\Pr_{P\sim\mathcal{U}_{k}}[c_{q}(i_{j}^{\ell},P)>4\cdot c_{q}(i^{\ell}_{j},P^{\prime})] PrP𝒰k[|P{i1,,ij}|<q]\displaystyle\leq\Pr_{P\sim\mathcal{U}_{k}}\left[~{}\lvert P\cap\{i^{\ell}_{1},\ldots,i^{\ell}_{j}\}\rvert<q\right]
=PrP𝒰k[r=0q1|P{i1,,ij}|=r]\displaystyle=\Pr_{P\sim\mathcal{U}_{k}}\left[\bigcup_{r=0}^{q-1}\,\lvert P\cap\{i^{\ell}_{1},\ldots,i^{\ell}_{j}\}\rvert=r\right]
r=0q1PrP𝒰k[|P{i1,,ij}|=r]=r=0q11(nk)(jr)(njkr).\displaystyle\leq\sum_{r=0}^{q-1}\Pr_{P\sim\mathcal{U}_{k}}\left[\,\lvert P\cap\{i^{\ell}_{1},\ldots,i^{\ell}_{j}\}\rvert=r\right]={\sum_{r=0}^{q-1}\frac{1}{{\binom{n}{k}}}}\cdot\binom{j}{r}\binom{n-j}{k-r}.

where the second inequality follows from the Union Bound and the last equality follows form the fact that uniform selection chooses kk out of nn individuals uniformly at random. ∎

Then, by returning to Equation 2 we get that,

𝔼P𝒰k[Vq(P,P,4)]\displaystyle\mathbb{E}_{P\sim\mathcal{U}_{k}}[V_{q}(P,P^{\prime},4)] ==1mj=1|T|PrP𝒰k[cq(ij,P)>4cq(ij,P)]\displaystyle=\sum_{\ell=1}^{m}\;\sum_{j=1}^{|T_{\ell}|}\;\Pr\nolimits_{P\sim\mathcal{U}_{k}}\left[c_{q}(i_{j}^{\ell},P)>4\cdot c_{q}(i_{j},P^{\prime})\right]
1(nk)=1mj=1|T|r=0q1(jr)(njkr)\displaystyle\leq\frac{1}{{\binom{n}{k}}}\cdot\sum_{\ell=1}^{m}\;\sum_{j=1}^{|T_{\ell}|}\;\sum_{r=0}^{q-1}\;\binom{j}{r}\binom{n-j}{k-r} (by Lemma 4)
=1(nk)=1mr=0q1j=1|T|(jr)(njkr)\displaystyle=\frac{1}{{\binom{n}{k}}}\cdot\sum_{\ell=1}^{m}\;\sum_{r=0}^{q-1}\;\sum_{j=1}^{|T_{\ell}|}\;\binom{j}{r}\binom{n-j}{k-r} (swap summations)
1(nk)=1mr=0q1j=0n(jr)(njkr)\displaystyle\leq\frac{1}{{\binom{n}{k}}}\cdot\sum_{\ell=1}^{m}\;\sum_{r=0}^{q-1}\;\sum_{j=0}^{n}\binom{j}{r}\binom{n-j}{k-r} (|T|n)\displaystyle(|T_{\ell}|\leq n)
=1(nk)=1mr=0q1(n+1k+1)\displaystyle=\frac{1}{{\binom{n}{k}}}\cdot\sum_{\ell=1}^{m}\;\sum_{r=0}^{q-1}\;\binom{n+1}{k+1} (by Lemma 3)\displaystyle(\text{by \lx@cref{creftypecap~refnum}{lem:chu-vandermonde}})
==1mr=0q1n+1k+1=mqn+1k+1<|P|nk,\displaystyle=\sum_{\ell=1}^{m}\;\sum_{r=0}^{q-1}\frac{n+1}{k+1}=m\cdot q\cdot\frac{n+1}{k+1}<|P^{\prime}|\cdot\frac{n}{k},

where the last inequality follows from the facts that m|P|qm\leq\left\lfloor\frac{|P^{\prime}|}{q}\right\rfloor and n+1k+1<nk{\frac{n+1}{k+1}<\frac{n}{k}} for k<nk<n. ∎

In the next theorem, we show that for any q<kq<k, no selection algorithm that is fair, is guaranteed to achieve ex ante α\alpha-qq-core with α<2\alpha<2, and hence uniform selection is optimal up to a factor of 22.

Theorem 5.

For any q[k1]q\in[k-1], when n2k2/(kq)n\geq 2k^{2}/(k-q), there exists an instance such that no selection algorithm that is fair, is in the ex ante α\alpha-qq-core with α<2\alpha<2.

Proof.

Consider a star graph with nqn-q leaves and an internal node. Suppose qq individuals I={i1,,iq}I=\{i_{1},\ldots,i_{q}\} lie on the internal node, and exactly one individual lies on each of the nqn-q leaves. Individuals in II have a distance of 0 from each other and a distance of 11 from [n]I[n]\setminus I; and, the distance between a pair of individuals from [n]I[n]\setminus I is equal to 22. These distances satisfy the triangle inequality.

Let PP be an arbitrary panel of size kk that does not contain i1i_{1}. We show that for P=IP^{\prime}=I and every α<2\alpha<2, we have that Vq(P,P,α)nk.V_{q}(P,P^{\prime},\alpha)\geq n-k. For any iIi\in I, it holds cq(i,P)=1c_{q}(i,P)=1 and cq(i,P)=0c_{q}(i,P^{\prime})=0 — which is an unbounded improvement. For any individual ii in [n](IP)[n]\setminus(I\cup P), cq(i,P)=2c_{q}(i,P)=2 since their qqth closest representative in PP would be on another leaf, while cq(i,P)=1c_{q}(i,P^{\prime})=1 — which yields a 22 factor improvement. Therefore, Vq(P,P,α)|([n](IP))I|n|P|=nkV_{q}(P,P^{\prime},\alpha)\geq|([n]\setminus(I\cup P))\cup I|\geq n-|P|=n-k, for every α<2\alpha<2.

Under any fair selection algorithm, i1i_{1} is not included in the panel with probability 1k/n1-\nicefrac{{k}}{{n}}. Thus, we have that

𝔼P𝒰k[Vq(P,P,α)]PrP𝒰k[i1P](nk)=(1k/n)(nk)qn/k=|P|n/k,\displaystyle\mathbb{E}_{P\sim\mathcal{U}_{k}}[V_{q}(P,P^{\prime},\alpha)]\geq\Pr_{P\sim\mathcal{U}_{k}}[i_{1}\notin P]\cdot(n-k)=(1-k/n)\cdot(n-k)\geq q\cdot n/k=|P^{\prime}|\cdot n/k,

where the last inequality follows from the assumption that n2k2/(kq)n\geq 2k^{2}/(k-q). ∎

5 Auditing Ex Post Core

Input: PP, [n][n], dd, kk, qq,
Output: α^\hat{\alpha}
for j[n]j\in[n] do
       P^j\hat{P}_{j}\leftarrow {j}q1\{j\}\cup q-1 closest neighbors of jj;
       α^j\hat{\alpha}_{j}\leftarrow the qn/k\left\lceil q\cdot n/k\right\rceil largest value among {cq(i,P)/cq(i,P^j)}i[n]\{c_{q}(i,P)/c_{q}(i,\hat{P}_{j})\}_{i\in[n]};
      
end for
return α^argmaxj[n]α^j\hat{\alpha}\leftarrow\operatorname*{arg\,max}_{j\in[n]}\hat{\alpha}_{j}
ALGORITHM 2 Auditing Algorithm

In this section, we turn our attention to the following question: Given a panel PP, how much does it violate the qq-core, i.e. what is the maximum value of α\alpha such that there exists a panel PP^{\prime} with Vq(P,P,α)|P|n/kV_{q}(P,P^{\prime},\alpha)\geq|P^{\prime}|\cdot n/k? This auditing question can be very useful in practice for measuring the proportional representation of a panel formed using a method that does not guarantee any panel to be in the approximate core, such as uniform selection.

Chen et al. (2019) ask the same question for the case where the cost of an individual for a panel is equal to her distance form her closest representative in the panel, i.e. when q=1q=1. In this case, it suffices to restrict our attention to panels of size 11, which are subsets of the population that individuals may prefer to be represented by. In other words, given a panel PP, we can simply consider every individual as a potential representative and check if a sufficiently large subset of the population prefers this individual to be their representative over PP. Thus, we can find the maximum α\alpha such that there exists PP^{\prime}, with Vq(P,P,α)n/kV_{q}(P,P^{\prime},\alpha)\geq n/k as following: For each j[n]j\in[n], calculate αj\alpha_{j} which is equal to the n/k\left\lceil n/k\right\rceil largest value among the set {cq(i,P)/cq(i,{j})}i[n]\{c_{q}(i,P)/c_{q}(i,\{j\})\}_{i\in[n]} containing the qq-cost ratios of PP to P^\hat{P}. Then, α\alpha is equal to the maximum value among all αj\alpha_{j}’s.

For q>1q>1, this question is more challenging. We show the possibility of approximating the value of the maximum α\alpha, by generalizing the above procedure as following: For each j[n]j\in[n], let P^j\hat{P}_{j} be the panel that contains jj and its q1q-1 closest neighbors. Then, calculate α^j\hat{\alpha}_{j} as the qn/k\left\lceil q\cdot n/k\right\rceil largest value of among the set {cq(i,P)/cq(i,P^j)}i[n]\{c_{q}(i,P)/c_{q}(i,\hat{P}_{j})\}_{i\in[n]}. Then, we return the maximum value among all α^j\hat{\alpha}_{j}’s as α^\hat{\alpha}. Algorithm 2 executes this procedure. We show that the maximum α\alpha such that there exists a panel PP^{\prime} with Vq(P,P,α)|P|n/kV_{q}(P,P^{\prime},\alpha)\geq|P^{\prime}|\cdot n/k is at most 3α^+23\cdot\hat{\alpha}+2.

Theorem 6.

There exists an efficient algorithm that for every panel PP and q[k]q\in[k] returns α^\hat{\alpha}-qq-core violation that satisfies α^α3α^+2\hat{\alpha}\leq\alpha\leq 3\hat{\alpha}+2, where α\alpha is the maximum amount of qq-core violation of PP.

Proof.

Suppose for contradiction that while the algorithm returns α^\hat{\alpha}, there exists S[n]S\subseteq[n] and P[n]P^{\prime}\subseteq[n], with |S||P|n/k|S|\geq|P^{\prime}|\cdot n/k, such that

iS,cq(i,P)>(3α^+2)cq(i,P).\displaystyle\forall i\in S,\quad\quad c_{q}(i,P)>(3\cdot\hat{\alpha}+2)\cdot c_{q}(i,P^{\prime}).

First, note that if the algorithm outputs α^\hat{\alpha}, this means that for every α>α^\alpha^{\prime}>\hat{\alpha} and j[n]j\in[n], it holds that

Vq(P,P^j,α)<|P^j|n/k,V_{q}(P,\hat{P}_{j},\alpha^{\prime})<|\hat{P}_{j}|\cdot n/k, (4)

as otherwise the algorithm would output a value strictly larger than α^\hat{\alpha}.

Let T1,,TmT_{1},\ldots,T_{m} be a partition of SS with respect to PP^{\prime}, as given in the first part of Lemma 1. Since m|P|/qm\leq\left\lfloor|P^{\prime}|/q\right\rfloor and |S||P|n/k|S|\geq|P^{\prime}|\cdot n/k, we conclude that there exists a part, say TT_{\ell}, that has size at least qn/kq\cdot n/k. Moreover, since there exists iTi^{*}_{\ell}\in T_{\ell} such that for each iTi\in T_{\ell}, it holds that cq(i,P)cq(i,P)c_{q}(i,P^{\prime})\leq c_{q}(i^{*}_{\ell},P^{\prime}) and 𝗍𝗈𝗉q(i,P)𝗍𝗈𝗉q(i,P)\operatorname{\mathsf{top}}_{q}(i,P^{\prime})\cap\operatorname{\mathsf{top}}_{q}(i^{*}_{\ell},P^{\prime})\neq\emptyset, we can conclude that d(i,i)2cq(i,P)d(i,i^{*}_{\ell})\leq 2\cdot c_{q}(i^{*}_{\ell},P^{\prime}), by considering a representative in 𝗍𝗈𝗉q(i,P)𝗍𝗈𝗉q(i,P)\operatorname{\mathsf{top}}_{q}(i,P^{\prime})\cap\operatorname{\mathsf{top}}_{q}(i^{*}_{\ell},P^{\prime}) and applying the triangle inequality, i.e. d(i,i)d(i,ri)+d(ri,i)2cq(i,P)d(i,i^{*}_{\ell})\leq d(i,r_{i})+d(r_{i},i^{*}_{\ell})\leq 2\cdot c_{q}(i^{*}_{\ell},P^{\prime}), where ri𝗍𝗈𝗉q(i,P)𝗍𝗈𝗉q(i,P)r_{i}\in\operatorname{\mathsf{top}}_{q}(i,P^{\prime})\cap\operatorname{\mathsf{top}}_{q}(i^{*}_{\ell},P^{\prime}). This means there exists a ball centered at ii^{*}_{\ell} that has radius 2cq(i,P)2\cdot c_{q}(i^{*}_{\ell},P^{\prime}) and captures all the individuals in TT_{\ell}. Now, note that there exists iTi^{\prime}\in T_{\ell} such that αcq(i,P^i)>cq(i,P)\alpha^{\prime}\cdot c_{q}(i^{\prime},\hat{P}_{i^{*}_{\ell}})>c_{q}(i^{\prime},P), since otherwise for each iTi\in T_{\ell} would hold that αcq(i,P^i)cq(i,P)\alpha^{\prime}\cdot c_{q}(i^{\prime},\hat{P}_{i^{*}_{\ell}})\leq c_{q}(i^{\prime},P) and then Vq(P,P^i,α)qn/k=|P^i|n/kV_{q}(P,\hat{P}_{i^{*}_{\ell}},\alpha^{\prime})\geq q\cdot n/k=|\hat{P}_{i^{*}_{\ell}}|\cdot n/k which contradicts Equation 4. Hence,

cq(i,P)d(i,i)+cq(i,P)<d(i,i)+αcq(i,P^i)\displaystyle c_{q}(i^{*}_{\ell},P)\leq d(i^{*}_{\ell},i^{\prime})+c_{q}(i^{\prime},P)<d(i^{*}_{\ell},i^{\prime})+\alpha^{\prime}\cdot c_{q}(i^{\prime},\hat{P}_{i^{*}_{\ell}}) d(i,i)+α(d(i,i)+cq(i,P^i)\displaystyle\leq d(i^{*}_{\ell},i^{\prime})+\alpha^{\prime}\cdot(d(i^{*}_{\ell},i^{\prime})+c_{q}(i^{*}_{\ell},\hat{P}_{i^{*}_{\ell}})
(3α+2)cq(i,P).\displaystyle\leq(3\cdot\alpha^{\prime}+2)\cdot c_{q}(i^{*}_{\ell},P^{\prime}).

where the first and the third inequalities follows from Lemma 2 and the last inequality follows from the facts that for each iTi\in T_{\ell}, d(i,i)2cq(i,P)d(i,i^{*}_{\ell})\leq 2\cdot c_{q}(i^{*}_{\ell},P^{\prime}), and cq(i,P^i)cq(i,P)c_{q}(i^{*}_{\ell},\hat{P}_{i^{*}_{\ell}})\leq c_{q}(i^{*}_{\ell},P^{\prime}) for each PP^{\prime} since P^i\hat{P}_{i^{*}_{\ell}} consists of the qq closest neighbors of ii^{*}_{\ell}. Therefore, cq(i,P)(3α^+2)cq(i,P)c_{q}(i^{*}_{\ell},P)\leq(3\cdot\hat{\alpha}+2)\cdot c_{q}(i^{*}_{\ell},P^{\prime}) and the theorem follows. ∎

6 Experiments

In previous sections, we examined uniform selection from a worst-case perspective and found that it cannot guarantee panels in the core for any bounded approximation ratio. But, what about the average case? How much better is FairGreedyCapture than uniform selection in terms of their approximations to the ex post core in the average case? In this section, we aim to address these questions through empirical evaluations of both algorithms using real databases.

6.1 Datasets

In accordance with the methodology proposed by Ebadian et al. (2022), we utilize the same two datasets used by the authors as a proxy for constructing the underlying metric space. These datasets capture various characteristics of populations across multiple observable features. It is reasonable to assume that individuals feel closer to others who share similar characteristics. Therefore, we construct a random metric space using these datasets.

Adult.

The first is the Adult dataset, extracted from the 1994 Current Population Survey by the US Census Bureau and available on the UCI Machine Learning Repository under a CC BY 4.0 license Kohavi and Becker (1996); Dua and Graff (2017). Our analysis focuses on five demographic features: sex, race, workclass, marital.status, and education.num. The dataset is comprised of 32,56132{,}561 data points, each with a sample weight attribute (fnlwgt). We identify 15131513 unique data points by these features and treat the sum of the weights associated with each unique point as a distribution across them.

ESS.

The second dataset we analyze is the European Social Survey (ESS), available under a CC BY 4.0 license Report. (2021). Conducted biennially in Europe since 2001, the survey covers attitudes towards politics and society, social values, and well-being. We used the ESS Round 9 (2018) dataset, which has 46,27646{,}276 data points and 14511451 features across 28 countries. On average, each country has around 250250 features (after removing non-demographic and country-unrelated data), with country-specific data points ranging from 781781 to 27452745. Each ESS data point has a post-stratification weight (pspwght), which we use to represent the distribution of the data points. Our analysis focuses on the ESS data for the United Kingdom (ESS-UK), which includes 2204 data points.

6.2 Representation Metric Construction

In line with the work of Ebadian et al. (2022), we apply the same approach to generate synthetic metric preferences, which are used to measure the dissimilarity between individuals based on their feature values. Our datasets consist of two types of features: categorical features (e.g. sex, race, and martial status) and continuous features (e.g. income). We define the distance between individuals ii and jj with respect to feature ff as follows:

d(i,j;f){𝟙[f(i)f(j)], if f is a categorical feature;1maxi,j|f(i)f(j)||f(i)f(j)|, if f is a continuous feature,d(i,j;f)\coloneqq\begin{cases}\mathbbm{1}[f(i)\neq f(j)],&\text{$\quad$ if $f$ is a \emph{categorical} feature;}\\ {\frac{1}{{\max_{i^{\prime},j^{\prime}}|f(i^{\prime})-f(j^{\prime})|}}}\cdot|f(i)-f(j)|,&\text{$\quad$ if $f$ is a \emph{continuous} feature,}\end{cases}

where the normalization factor for continuous features ensures that d(i,j;f)[0,1]d(i,j;f)\in[0,1] for all ii, jj, and ff, and that the distances in different features are comparable. Next, we define the distance between two individuals as the weighted sum of the distances over different features, i.e. d(i,j)=fFwfd(i,j;f),d(i,j)=\displaystyle\sum\nolimits_{{f\in F}}\,w_{f}\cdot d(i,j;f), where the weights wfw_{f}’s are randomly generated. Each unique set of randomly generated feature weights results in a new representation metric.

We generate 100100 sets of randomly-assigned feature weights per dataset, calculate a representation metric for each set, and report the performance metrics averaged over 100100 instances. Given that our datasets are samples of a large population (i.e. millions) and represented through a relatively small number of unique data points (i.e. few thousands), we assume that each data point represents a group of at least kk people, which takes a maximum value of 40 in our study. To empirically measure ex post core violation, for each of the 100100 instances, we sample one panel from an algorithm and compute the core violation using Algorithm 2. We note that this is not exactly equal to the worst-case core violation, but a very good approximation of it.

6.3 Results

Results for Ex Post Core Violation.

In Adult dataset, we observe an unbounded ex post core violation for Uniform when q4q\leq 4. Specifically, for q{1,2,3}q\in\{1,2,3\}, we observed unbounded core violation in 84%84\%, 9%9\%, and 36%36\% of the instances respectively. This happens since 8.3%{\sim}8.3\% of the population is mapped to a single data point and that Uniform fails to select qq individuals from this group. When q3q\leq 3, we have q/k8.4%\nicefrac{{q}}{{k}}\leq 8.4\%, and this cohesive group is entitled to select at least qq members of the panel from themselves, which results in qq-cost of 0 for them and an unbounded violation of the core. However, FairGreedyCapture captures this cohesive group and selects at least qq representatives from them. Furthermore, we see significantly higher ex post core violation for Uniform compared to FairGreedyCapture for smaller values of qq (up to 1212) and comparable performance for larger values of qq. This is expected as FairGreedyCapture tends to behave more similarly to Uniform as qq increases because it selects from fewer yet larger groups (k/q+1\left\lfloor\nicefrac{{k}}{{q}}\right\rfloor+1 groups of size qn/k\nicefrac{{qn}}{{k}}).

We observe a similar pattern in ESS-UK that Uniform obtains worse ex post core violations when qq is smaller and similar performance as FairGreedyCapture for larger values of qq. However, in contrast to Adult, we do not observe similar unbounded violations for Uniform in ESS-UK. The reason is that ESS-UK consists of  250 features (compared to the 55 we used from Adult) and any data points represent at most 0.2%0.2\% of the population. Thus, no group is entitled to choose enough representatives from their own to significantly improve their cost or make it 0. The decline in core violation for q=kq=k happens as it measures the minimum improvement in cost over the whole population, which is more demanding than lower values of qq. Lastly, FairGreedyCapture performs consistently for all values of qq and achieves an ex post core violation less than 1.61.6 and 1.251.25 in Adult and ESS-UK respectively.

Refer to caption
(a) Adult
Refer to caption
(b) ESS-UK
Figure 2: Ex post core violation of FairGreedyCapture and Uniform with k=40k=40
Refer to caption
(a) Adult
Refer to caption
(b) ESS-UK
Figure 3: Approximation to the optimal social cost of FairGreedyCapture and Uniform with with k=40k=40

Evaluating Approximation to Optimal Social Cost.

As we mentioned in the introduction, Ebadian et al. (2022) use a different approach to measure the representativeness of a panel by considering the social cost (sum of qq-costs) over a panel. In particular, they define the representativeness of an algorithm as the worst-case ratio between the optimal social cost and the (expected) social cost obtained by the algorithm. Ebadian et al. (2022), in their empirical analysis, measure the average approximation to the optimal social cost of an algorithm 𝒜\mathcal{A} over a set of instances \mathcal{I}, defined as 1||IminPi[n]cq(i,P)i[n]cq(i,𝒜(I))\frac{1}{|\mathcal{I}|}\sum\nolimits_{I\in\mathcal{I}}\frac{\min_{P}~{}\sum_{i\in[n]}c_{q}(i,P)}{\sum_{i\in[n]}c_{q}(i,\mathcal{A}(I))}. Since finding the optimal panel is a hard problem and the dataset and panel sizes are large, Ebadian et al. (2022) use a proxy for the minimum social cost, specifically, an implementation of the algorithm of Kumar and Raichel (2013) for the fault-tolerant kk-median problem that achieves a constant factor approximation of the optimal objective — which is equivalent to minimizing the qq-social cost. We use the same approach and report the average approximation to the optimal social cost of FairGreedyCapture and Uniform.

In Figure 3, the reader can see the performance of the two different algorithms over this objective. For ESS-UK, we observe a similar behaviour from the two algorithms, while for Adult, FairGreedyCapture outperforms Uniform for q[3]q\in[3], which is again due to FairGreedyCapture capturing the cohesive group. All considered, we observe that FairGreedyCapture can maintain at least the same level or even better optimal social cost approximation as Uniform would, while achieving significantly better empirical core guarantees in the two datasets.

7 Discussion

This work introduces a notion of proportional representation, called the core, within the context of sortition. The core serves as a metric to ensure proportional representation across intersectional features. While uniform selection achieves an ex ante O(1)O(1)-qq-core, it fails to provide a reasonable approximation to the ex post qq-core. To address this, we propose a selection algorithm, FairGreedyCapture, which preserves the positive aspects of uniform selection, i.e. fairness and ex ante O(1)O(1)-qq-core, while also meeting the ex post O(1)O(1)-qq-core requirement. We also highlight that the use of FairGreedyCapture allows the translation of the core requirement into a set of quotas, which can be integrated with another set of quotas to ensure proportional representation across both individual and intersectional features.

It is worth to emphasize that the limitations of uniform selection in satisfying ex post guarantees arise from the potential return of non-proportionally representative panels with a positive probability. In Appendix E, we explore a natural variation where the core property is mandated to hold over the expected qq-costs of panels chosen from a selection algorithm. We demonstrate that this variation is incomparable with the ex post qq-core, and more importantly uniform selection fails to offer any meaningful multiplicative approximation to this variation, as well.

There are many directions for future work. First, there are gaps between the lower and upper bounds we provide for both the ex ante and the ex post core. Closing these gaps and investigating if there are fair selection algorithms that provide better guarantees to ex ante and/or ex post core is an immediate interesting direction. Moreover, we show Fair Greedy Capture is in the ex post 66-qq-core, but we do not provide lower bounds indicating that this analysis is tight. In fact, in Appendix F, we show that for q=1q=1, FairGreedyCapture is in the ex post ((3+17)/23.57)((3+\sqrt{17})/2\approxeq 3.57)-11-core and this is tight. Finding tight bounds for the general case is an open question. In addition, in Appendix G, we show that if qq is known for an application at hand and we wish to provide guarantees with respect to ex post qq-core, a variation of Augmented-FairGreedyCapture provides an approximation of ((5+41)/2)5.72((5+\sqrt{41})/2)\approx 5.72, which is slightly better than the approximation of 66. Exploring if this is tight as well is another interesting direction. Furthermore, Micha and Shah (2020) show that for q=1q=1, Greedy Capture (Chen et al., 2019), provides better guarantees for the Euclidean space. So, another interesting question is to see if when the metric dd consists of usual distance functions such as norms L2L^{2}, L1L^{1} and LL^{\infty}, FairGreedyCapture can provide better guarantees.

References

  • Arrow [1990] K. Arrow. Advances in the spatial theory of voting. Cambridge University Press, 1990.
  • Aziz et al. [2017] H. Aziz, M. Brill, V. Conitzer, E. Elkind, R. Freeman, and T. Walsh. Justified representation in approval-based committee voting. Social Choice and Welfare, 48(2):461–485, 2017.
  • Aziz et al. [2023] H. Aziz, B. E Lee, and S. M. Chu. Proportionally representative clustering. arXiv preprint arXiv:2304.13917, 2023.
  • Benadè et al. [2019] G. Benadè, P. Gölz, and A. D. Procaccia. No stratification without representation. In Proceedings of the 20th ACM Conference on Economics and Computation (EC), pages 281–314, 2019.
  • Birkhoff [1946] G. Birkhoff. Three observations on linear algebra. Univ. Nac. Tacuman, Rev. Ser. A, 5:147–151, 1946.
  • Caragiannis et al. [2022] I. Caragiannis, N. Shah, and A. A. Voudouris. The metric distortion of multiwinner voting. Artificial Intelligence, 313:103802, 2022.
  • Chen et al. [2019] X. Chen, B. Fain, L. Lyu, and K. Munagala. Proportionally fair clustering. In International Conference on Machine Learning, pages 1032–1041, 2019.
  • Cheng et al. [2020] Y. Cheng, Z. Jiang, K. Munagala, and K. Wang. Group fairness in committee selection. ACM Transactions on Economics and Computation (TEAC), 8(4):1–18, 2020.
  • Chhabra et al. [2021] A. Chhabra, K. Masalkovaitė, and P. Mohapatra. An overview of fairness in clustering. IEEE Access, 9:130698–130720, 2021.
  • Conitzer et al. [2019] V Conitzer, R Freeman, N Shah, and J. W. Vaughan. Group fairness for the allocation of indivisible goods. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI), pages 1853–1860, 2019.
  • Dua and Graff [2017] D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Ebadian et al. [2022] S. Ebadian, G. Kehne, E. Micha, A. D. Procaccia, and N. Shah. Is sortition both representative and fair? In Proceedings of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS), pages 25720–25731, 2022.
  • Enelow and Hinich [1984] J. M. Enelow and M. J. Hinich. The spatial theory of voting: An introduction. CUP Archive, 1984.
  • Engelstad [1989] F. Engelstad. The assignment of political office by lot. Social Science Information, 28(1):23–50, 1989.
  • Fain et al. [2018] B. Fain, K. Munagala, and N. Shah. Fair allocation of indivisible public goods. In Proceedings of the 19th ACM Conference on Economics and Computation (EC), pages 575–592, 2018.
  • Faliszewski et al. [2017] P. Faliszewski, Piotr S., A. Slinko, and N. Talmon. Multiwinner voting: A new challenge for social choice theory. Trends in computational social choice, 74(2017):27–47, 2017.
  • Flanigan et al. [2020] B. Flanigan, P. Gölz, A. Gupta, and A. D. Procaccia. Neutralizing self-selection bias in sampling for sortition. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems (NeurIPS), pages 6528–6539, 2020.
  • Flanigan et al. [2021a] B. Flanigan, P. Gölz, A. Gupta, B. Hennig, and A. D. Procaccia. Fair algorithms for selecting citizens’ assemblies. Nature, 596:548–552, 2021a.
  • Flanigan et al. [2021b] B. Flanigan, G. Kehne, and A. D. Procaccia. Fair sortition made transparent. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems (NeurIPS), pages 25720–25731, 2021b.
  • Flanigan et al. [2024] B. Flanigan, J. Liang, A. D. Procaccia, and S. Wang. Manipulation-robust selection of citizens’ assemblies. In Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI), 2024.
  • Gąsiorowska [2023] A. Gąsiorowska. Sortition and its principles: Evaluation of the selection processes of citizens’ assemblies. Journal of Deliberative Democracy, 19(1), 2023.
  • Gastil and Wright [2019] J. Gastil and E. O. Wright, editors. Legislature by Lot: Transformative Designs for Deliberative Governance. Verso, 2019.
  • Kalayci et al. [2024] Y. H. Kalayci, D. Kempe, and V. Kher. Proportional representation in metric spaces and low-distortion committee selection. In Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI), pages 9815–9823, 2024.
  • Kellerhals and Peters [2023] L. Kellerhals and J. Peters. Proportional fairness in clustering: A social choice perspective. arXiv preprint arXiv:2310.18162, 2023.
  • Kohavi and Becker [1996] R. Kohavi and B. Becker. Adult data set. UCI machine learning repository, 5:2093, 1996.
  • Kumar and Raichel [2013] N. Kumar and B. Raichel. Fault tolerant clustering revisited. In Proceedings of the 25th Canadian Conference on Computational Geometry, CCCG 2013, Waterloo, Ontario, Canada, August 8-10, 2013. Carleton University, Ottawa, Canada, 2013.
  • Lackner and Skowron [2023] M. Lackner and P. Skowron. Multi-winner voting with approval preferences. Springer Nature, 9783031090158, 2023.
  • Martin and Carson [1999] B. Martin and L. Carson. Random selection in politics. 1999.
  • Micha and Shah [2020] E. Micha and N. Shah. Proportionally fair clustering revisited. In 47th International Colloquium on Automata, Languages, and Programming (ICALP), pages 85:1–85:16, Saarbrücken, Germany, 2020. Schloss Dagstuhl.
  • Report. [2021] ESS Round 9: European Social Survey (2021): ESS-9 2018 Documentation Report. Edition 3.1. bergen, european social survey data archive, nsd - norwegian centre for research data for ESS ERIC. 2021. 10.21338/NSD-ESS9-2018. URL https://www.europeansocialsurvey.org/data/.
  • Van Reybrouck [2016] D. Van Reybrouck. Against Elections: The Case for Democracy. Random House, 2016.
  • Vergne [2018] A. Vergne. Citizens’ participation using sortition: A practical guide to using random selection to guarantee diverse democratic participation. 2018.
  • Yates [1948] F. Yates. Systematic sampling. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 241(834):345–377, 1948.

Appendix A Minimizing Social Cost Fails to Provide Proportional Representation

Example 7.

Let nn be odd, k=3k=3 and q=1q=1. Assume that there are four group of individuals, AA, BB, CC and DD. There are exactly one individual in group AA, and exactly one individual in group BB, while there are n12\frac{n-1}{2} individuals in group CC and n12\frac{n-1}{2} individuals in group DD. The distances between individuals in different groups is specified in the following table.

AA BB CC D
AA 0 \infty \infty \infty
BB \infty 0 \infty \infty
CC \infty \infty 0 1010
DD \infty \infty 1010 0

It is not difficult to see that any panel with minimum social cost contains the single individuals in groups AA and BB and one individual from either group CC or group DD, as otherwise the social cost would be unbounded. This means that while the individuals in group CC form almost 50%50\% of the population, and similarly do the individuals in group DD, in any panel with optimal social cost, either group CC or DD is not represented at all. On the other hand, the two eccentric individuals are always part of the panel.

Appendix B Uniform Selection is in the Ex Post 22-kk-Core

Next, we show that when q=kq=k, any panel is in the ex post 22-kk-core, which implies that any algorithm including uniform selection is in the ex post 22-kk-core. This is due to the fact that in this case only if the grand coalition, i.e. all the agents, has incentives to deviate, the ex post core is violated.

Theorem 8.

Every panel is in the ex post 22-kk-core. Therefore, uniform selection is in the ex post 22-core, and this is tight.

Proof.

Consider any panel PP. It suffices to show that for any arbitrary panel PP^{\prime} of size kk, the qq-cost of all individuals cannot be improved by a factor of greater than α=2\alpha=2.

Let i1i_{1} and i2i_{2} be the two individuals in the population with the maximum distance between them. Now, consider an arbitrary representative rr in panel PP^{\prime}. Without loss of generality, suppose that ck(i1,P)ck(i2,P)c_{k}(i_{1},P^{\prime})\leq c_{k}(i_{2},P^{\prime}). Then, we have

ck(i2,P)=maxjPd(i2,j)\displaystyle c_{k}(i_{2},P)={\max_{j\in P}}\;d(i_{2},j) d(i1,i2)\displaystyle\leq d(i_{1},i_{2}) (by the choice of i1i_{1} and i2i_{2})
d(i1,r)+d(r,i2)\displaystyle\leq d(i_{1},r)+d(r,i_{2}) (triangle inequality)
ck(i1,P)+ck(i2,P)\displaystyle\leq c_{k}(i_{1},P^{\prime})+c_{k}(i_{2},P^{\prime}) (as rPr\in P^{\prime})
2ck(i2,P).\displaystyle\leq 2\cdot c_{k}(i_{2},P^{\prime}).

This implies Vq(P,P,2)<|P|n/k=nV_{q}(P,P^{\prime},2)<|P^{\prime}|\cdot n/k=n, since the qq-cost for i2i_{2} does not improve by a factor of more than two. From, this we get that any panel PP is in the ex post 22-kk-core, and therefore uniform selection is in the ex post 22-core.

Next, we show that there exists an instance such that uniform selection is not in the ex post α\alpha-kk-core for α<2\alpha<2. Consider the case that the individuals are assigned into three groups, AA, BB and CC, with k/2\left\lfloor k/2\right\rfloor, k/2\left\lceil k/2\right\rceil, and nkn-k individuals, respectively. The distances between individuals is as specified in the following table.

AA BB CC
AA 0 22 11
BB 22 0 11
CC 11 11 0

The panel PP which consists of all the kk people in groups AA and BB is in the support of uniform selection. Then, for iABi\in A\cup B, ck(i,P)=2c_{k}(i,P)=2 as the kk-th closest representative in PP lies in the other group. For iCi\in C, the ck(i,P)=1c_{k}(i,P)=1. Now, consider panel PP^{\prime} that consists of kk individuals from group CC. The qq-costs of all individuals improve by a factor of at least 22. Hence, 𝒰k\mathcal{U}_{k} violates ex post 22-kk-core in this example. ∎

Appendix C Uniform Selection is in the Ex Ante kk-Core

Proposition 9.

Uniform selection is in the ex ante kk-core.

Proof.

To satisfy ex ante kk-core, for any panel |P||P^{\prime}| of size kk, we should have

𝔼P𝒟k[Vq(P,P,α)]<|P|nk=n.\displaystyle\mathbb{E}_{P\sim\mathcal{D}_{k}}[V_{q}(P,P^{\prime},\alpha)]<|P^{\prime}|\cdot\frac{n}{k}=n.

Essentially, this means that the ex ante kk-core is violated only if the grand coalition, i.e. all the agents, has incentives to deviate to PP^{\prime}, in expectation. Since Vq(P,P,α)nV_{q}(P,P^{\prime},\alpha)\leq n for all PP^{\prime} by definition, it suffices to show that there exists a panel PP that is chosen with non-zero probability, and it holds that Vq(P,P,α)<nV_{q}(P,P^{\prime},\alpha)<n. Since, 𝒰k\mathcal{U}_{k} chooses any panel with non-zero probability, including PP^{\prime}, there is a non-zero probability that we realize panel P=PP=P^{\prime} for which Vq(P,P,α)=0V_{q}(P,P^{\prime},\alpha)=0 — since the qq-costs do not strictly improve for any individual. Thus, the expected preference count of the panel that selected from uniform selection with respect to any other panel is strictly less than nn, satisfying the ex ante kk-core. ∎

Appendix D Proof of Chu–Vandermonde Identity

Proof.

We give a combinatorial argument for this identity. Suppose we want to select k+1k+1 items out of a set of size n+1n+1. For i[1,n+1]i\in[1,n+1], let PiP_{i} be the number of such subsets in which the (r+1)(r+1)th picked item is item ii. As each subset is counted exactly once among PiP_{i}’s (at the position of its (r+1)(r+1)th item), we have i=1n+1Pi=(n+1k+1)\sum_{i=1}^{n+1}P_{i}=\binom{n+1}{k+1}. Now, we calculate PiP_{i}. Suppose the (r+1)(r+1)th item is ii. Then, rr items should be selected from the first i1i-1 items and k+1(r+1)=krk+1-(r+1)=k-r items should be selected from the last n+1in+1-i items. Therefore, Pi=(i1r)(n(i1)kr)P_{i}=\binom{i-1}{r}\cdot\binom{n-(i-1)}{k-r}. Then, we have

(n+1k+1)=i=1n+1Pi=i=1n+1(i1r)(n(i1)kr)=j=0n(jr)(njkr).\binom{n+1}{k+1}=\sum_{i=1}^{n+1}P_{i}=\sum_{i=1}^{n+1}\binom{i-1}{r}\cdot\binom{n-(i-1)}{k-r}=\sum_{j=0}^{n}\binom{j}{r}\cdot\binom{n-j}{k-r}.\qed

Appendix E qq-Core over Expected Cost

A variation of the demanding ex post qq-core is to ask the core property to hold with respect to the expected qq-cost, as it is given in the definition below.

Definition 10 (α\alpha-qq-Core over Expected Cost).

A selection algorithm 𝒜k\mathcal{A}_{k} is in the α\alpha-qq-core over expected cost (or in the qq-core over expected cost, for α=1\alpha=1) if there is no S[n]S\subseteq[n] and a panel PP^{\prime} with |P||S|/nk|P^{\prime}|\leq|S|/n\cdot k such that

iS,𝔼P𝒜k[cq(i,P)]>αcq(i,P).\displaystyle\forall i\in S,\ \mathbb{E}_{P\sim\mathcal{A}_{k}}[c_{q}(i,P)]>\alpha\cdot c_{q}(i,P^{\prime}).

We start by showing that the ex post qq-core and the qq-core over expected cost are incomparable.

Proposition 11.

For any q[k]q\in[k], ex post qq-core and qq-core over expected cost are incomparable.

Proof.

First, we show that the ex post qq-core does not imply the qq-core over expected cost. Assume that nn is divisible by kk and qq is divisible by 33. Consider an instance where there are five groups of individuals, AA, BB, CC, DD and EE. The first three groups contain (qn/kq)/3(q\cdot n/k-q)/3 individuals each, the fourth group contains qq individuals and the last group contains nqn/kn-q\cdot n/k individuals. The table below provides the specified distances between individuals within given groups.

AA BB CC DD EE
AA 0 22 22 11 \infty
BB 22 0 22 11 \infty
CC 22 22 0 11 \infty
DD 11 11 11 0 \infty
EE \infty \infty \infty \infty 0

Suppose that a selection algorithm 𝒜k\mathcal{A}_{k} returns with probability 1/31/3 a panel that contains qq individuals from group AA and the remaining representatives are from group EE, with probability 1/31/3 a panel that contains qq individuals from group BB and the remaining representatives are from group EE and with probability 1/31/3 a panel that contains qq individuals from group CC and the remaining representatives are from group EE. All these panels are in the ex post qq-core, since there is no sufficiently large group such that if they choose another panel, all of them reduce their distance. Now, we see that for each ii in AA or BB or CC, it holds that

𝔼P𝒜k[cq(i,P)]=232=4/3\displaystyle\mathbb{E}_{P\sim\mathcal{A}_{k}}[c_{q}(i,P)]=\frac{2}{3}\cdot 2=4/3

while for each ii in DD, it holds that

𝔼P𝒜k[cq(i,P)]=1.\displaystyle\mathbb{E}_{P\sim\mathcal{A}_{k}}[c_{q}(i,P)]=1.

If all the individuals in AA, BB, CC and DD choose a panel PP^{\prime} that contains qq individuals from DD, then all of them reduce their distance by a factor at least equal to 4/34/3.

Next, we show that the qq-core over expected cost does not imply the ex post qq-core. Consider an instance where there are four groups of individuals, AA, BB, CC, DD. Group AA contains qn/kqq\cdot n/k-q individuals, group BB contains qq individuals, group CC contains qq individuals and group DD contains all the remaining individuals. The distance between individuals belonging to given groups is specified in the following table.

AA BB CC DD
AA 0 11 22 \infty
BB 11 0 11 \infty
CC 22 11 0 \infty
DD \infty \infty \infty 0

Suppose that a selection algorithm 𝒜k\mathcal{A}_{k} returns with probability 1/21/2 a panel P1P_{1} that contains qq individuals from group AA and kqk-q individuals from group DD, and with the remaining probability returns a panel P2P_{2} that contains qq individuals from group CC and kqk-q individuals from group DD. Then, for each ii in ACA\cup C, we have that

𝔼P𝒜k[cq(i,P)]=122=1\displaystyle\mathbb{E}_{P\sim\mathcal{A}_{k}}[c_{q}(i,P)]=\frac{1}{2}\cdot 2=1

while for each ii in BB, we have that

𝔼P𝒜k[cq(i,P)]=121+121=1.\displaystyle\mathbb{E}_{P\sim\mathcal{A}_{k}}[c_{q}(i,P)]=\frac{1}{2}\cdot 1+\frac{1}{2}\cdot 1=1.

Hence, this algorithm is in the qq-core over expected cost. But when the algorithm returns P1P_{1}, all the individuals in AA and BB can reduce their cost by a factor of 22 by choosing qq representatives in BB. ∎

Next, we show that as in the case of the ex post qq-core, uniform selection is in the 22-kk-core over expected cost.

Theorem 12.

For q=kq=k, uniform selection is 22-qq-core over expected cost.

Proof.

In the proof of Theorem 8, we show that for any PP and any panel PP^{\prime}, with |P|=|P|=k|P|=|P^{\prime}|=k, there exists iNi\in N, such that ck(i,P)2ck(i,P)c_{k}(i,P)\leq 2\cdot c_{k}(i,P^{\prime}). This implies that 𝔼P𝒰k[ck(i,P)]2ck(i,P),\mathbb{E}_{P\sim\mathcal{U}_{k}}[c_{k}(i,P)]\leq 2\cdot c_{k}(i,P^{\prime}), which means that uniform selection is in the 22-kk-core over expected cost. This is because to violate 22-kk-core over expected cost, the kk-cost of the entire population would have to improve by a factor of more than 22, which does not hold for individual ii. ∎

Again as in the case of the ex post qq-core, we show that uniform selection does not provide any bounded multiplicative approximation to the qq-core over expected cost, for q[k1]q\in[k-1].

Theorem 13.

For any q[k1]q\in[k-1] and n/kk\left\lfloor\nicefrac{{n}}{{k}}\right\rfloor\geq k, there exists an instance such that uniform selection is not in the α\alpha-qq-core over expected cost, for any bounded α\alpha.

Proof.

Consider the instance as given in proof  Theorem 1. As before, uniform selection may return a panel that consists only from individuals in group AA. Therefore, all the individuals in group BB have positive expected qq-cost under uniform selection, while if they choose a panel among themselves, they would all have a qq-cost of 0. Thus, uniform selection is not in the α\alpha-qq-core over expected cost for any bounded α\alpha. ∎

Lastly, we show that FairGreedyCapturek\textsc{FairGreedyCapture}_{k} is in the 66-qq-core over expected cost, for every qq.

Theorem 14.

For every qq, FairGreedyCapturek\textsc{FairGreedyCapture}_{k} is in the 66-qq-core over expected cost.

Proof.

Let 𝒟k\mathcal{D}_{k} be the distribution that FairGreedyCapturek\textsc{FairGreedyCapture}_{k} returns. Suppose for contradiction that there exists S[n]S\subseteq[n] and P[n]P^{\prime}\subseteq[n], with |S||P|n/k|S|\geq|P^{\prime}|\cdot n/k, such that

iS,𝔼P𝒟k,q[cq(i,P)]>6cq(i,P).\displaystyle\forall i\in S,\quad\quad\mathbb{E}_{P\sim\mathcal{D}_{k,q}}[c_{q}(i,P)]>6\cdot c_{q}(i,P^{\prime}).

In the proof of Theorem 3, we show that there exist iSi^{*}_{\ell}\in S such that for every PP in the support of the algorithm, we have that cq(i,P)6cq(i,P)c_{q}(i^{*}_{\ell},P)\leq 6\cdot c_{q}(i^{*}_{\ell},P^{\prime}). This implies that 𝔼P𝒟kcq(i,P)6cq(i,P)\mathbb{E}_{P\sim\mathcal{D}_{k}}c_{q}(i^{*}_{\ell},P)\leq 6\cdot c_{q}(i^{*}_{\ell},P^{\prime}) which is a contradiction. ∎

Appendix F Analysis of FairGreedyCapturek\textsc{FairGreedyCapture}_{k} for the Ex Post 11-Core

Theorem 15.

FairGreedyCapturek\textsc{FairGreedyCapture}_{k} in the ex post 3+172\frac{3+\sqrt{17}}{2}-11-core and there exists an instance for which this bound is tight.

Proof.

Let PP be any panel that the algorithm may return. Suppose for contradiction that there exists a panel PP^{\prime} such that Vq(P,P,(3+17)/2)|P|n/kV_{q}(P,P^{\prime},(3+\sqrt{17})/2)\geq|P^{\prime}|\cdot n/k. This means that there exists S[n]S\subseteq[n], with |S||P|n/k|S|\geq|P^{\prime}|\cdot n/k, such that:

iS,cq(i,P)>(3+17)/2cq(i,P).\displaystyle{\forall i\in S,\quad\quad c_{q}(i,P)>(3+\sqrt{17})/2\cdot c_{q}(i,P^{\prime}).} (5)

If |P|>1|P^{\prime}|>1, we can partition SS into |P||P^{\prime}| groups by assigning each individual to their closest representative from PP^{\prime}, and at least one of these groups should have size at least n/kn/k. Therefore, without loss of generality, we can assume that |P|=1|P^{\prime}|=1 and |S|n/k|S|\geq n/k.

Let P={i}P^{\prime}=\{i^{*}\} and ii^{\prime} be the individual in SS that has the largest distance from ii^{*}. Since there are sufficiently many individuals in the ball B(i,d(i,i))B(i^{*},d(i^{*},i^{\prime})), it is possible that the algorithm opened it during its execution. If this happened, this means that there is at least one representative in PP that is located within this ball. Then, we get that ii^{\prime} has a distance at most equal to the diameter of the ball from her closest representative in PP which is at most 2d(i,i)=2cq(i,P)2\cdot d(i^{\prime},i^{*})=2\cdot c_{q}(i^{\prime},P^{\prime}). Hence, ii^{\prime} cannot reduce her distance by a multiplicative factor larger than 22 by choosing PP^{\prime}, and we reach in a contradiction.

On the other hand, if the algorithm did not open this ball during its execution, this means that some of the individuals in TT have been allocated to different balls before the ball centered at ii^{*} captures sufficiently many of them. Hence, some individuals in SS have been captured from a different ball with radius at most d(i,i)=cq(i,P)d(i^{\prime},i^{*})=c_{q}(i^{\prime},P^{\prime}). Suppose that i′′i^{\prime\prime} is the first individual in SS that was captured from such a ball. Then, we have that within this ball there is 11 representative in PP. Hence cq(i′′,P)2d(i,i)c_{q}(i^{\prime\prime},P)\leq 2\cdot d(i^{\prime},i^{*}), since the distance of i′′i^{\prime\prime} form any other individual in this ball is at most equal to the diameter of the ball. We consider the minimum multiplicative improvement of both ii^{\prime} and i′′i^{\prime\prime}:

min(cq(i,P)cq(i,P),cq(i′′,P)cq(′′,P))\displaystyle\min\left(\frac{c_{q}(i^{\prime},P)}{c_{q}(i^{\prime},P^{\prime})},\frac{c_{q}(i^{\prime\prime},P)}{c_{q}(^{\prime\prime},P^{\prime})}\right)
=\displaystyle= min(cq(i,P)d(i,i),cq(i′′,P)d(i′′,i))\displaystyle\min\left(\frac{c_{q}(i^{\prime},P)}{d(i^{\prime},i^{*})},\frac{c_{q}(i^{\prime\prime},P)}{d(i^{\prime\prime},i^{*})}\right)
\displaystyle\leq min(d(i,i′′)+cq(i′′,P)d(i,i),cq(i′′,P)d(i′′,i))\displaystyle\min\left(\frac{d(i^{\prime},i^{\prime\prime})+c_{q}(i^{\prime\prime},P)}{d(i^{\prime},i^{*})},\frac{c_{q}(i^{\prime\prime},P)}{d(i^{\prime\prime},i^{*})}\right) (by Lemma 2)
\displaystyle\leq min(d(i,i)+d(i,i′′)+cq(i′′,P)d(i,i),cq(i′′,P)d(i′′,i))\displaystyle\min\left(\frac{d(i^{\prime},i^{*})+d(i^{*},i^{\prime\prime})+c_{q}(i^{\prime\prime},P)}{d(i^{\prime},i^{*})},\frac{c_{q}(i^{\prime\prime},P)}{d(i^{\prime\prime},i^{*})}\right)
\displaystyle\leq min(d(i,i)+d(i,i′′)+2d(i,i)d(i,i),2d(i,i)d(i′′,i))\displaystyle\min\left(\frac{d(i^{\prime},i^{*})+d(i^{*},i^{\prime\prime})+2\cdot d(i^{\prime},i^{*})}{d(i^{\prime},i^{*})},\frac{2\cdot d(i^{\prime},i^{*})}{d(i^{\prime\prime},i^{*})}\right) (as cq(i′′,P)2d(i,i)c_{q}(i^{\prime\prime},P)\leq 2\cdot d(i^{\prime},i^{*}) )
\displaystyle\leq maxz0min(3+1/z,2z)=(3+17)/2.\displaystyle\max_{z\geq 0}\min(3+1/z,2\cdot z)=(3+\sqrt{17})/2.

To show that this bound is tight consider the case that n=28n=28 and k=7k=7. Assume that the individuals form four isomorphic sets of 77 individuals each such that each set is sufficiently far from all other sets. The distances between the individuals in one set are given in the table below.

a1a_{1} a2a_{2} a3a_{3} a4a_{4} a5a_{5} a6a_{6} a7a_{7}
a1a_{1} 0 11 22 1712\frac{\sqrt{17}-1}{2} 17+12ϵ\frac{\sqrt{17}+1}{2}-\epsilon 17+12ϵ\frac{\sqrt{17}+1}{2}-\epsilon 17+322ϵ\frac{\sqrt{17}+3}{2}-2\cdot\epsilon
a2a_{2} 11 0 11 1732\frac{\sqrt{17}-3}{2} 1712ϵ\frac{\sqrt{17}-1}{2}-\epsilon 1712ϵ\frac{\sqrt{17}-1}{2}-\epsilon 17+122ϵ\frac{\sqrt{17}+1}{2}-2\cdot\epsilon
a3a_{3} 22 11 0 1712\frac{\sqrt{17}-1}{2} 17+12ϵ\frac{\sqrt{17}+1}{2}-\epsilon 17+12ϵ\frac{\sqrt{17}+1}{2}-\epsilon 17+322ϵ\frac{\sqrt{17}+3}{2}-2\cdot\epsilon
a4a_{4} 1712\frac{\sqrt{17}-1}{2} 1732\frac{\sqrt{17}-3}{2} 1712\frac{\sqrt{17}-1}{2} 0 1ϵ1-\epsilon 1ϵ1-\epsilon 22ϵ2-2\epsilon
a5a_{5} 17+12ϵ\frac{\sqrt{17}+1}{2}-\epsilon 17+12ϵ\frac{\sqrt{17}+1}{2}-\epsilon 1712ϵ\frac{\sqrt{17}-1}{2}-\epsilon 1ϵ1-\epsilon 0 0 1ϵ1-\epsilon
a6a_{6} 17+12ϵ\frac{\sqrt{17}+1}{2}-\epsilon 17+12ϵ\frac{\sqrt{17}+1}{2}-\epsilon 1712ϵ\frac{\sqrt{17}-1}{2}-\epsilon 1ϵ1-\epsilon 0 0 1ϵ1-\epsilon
a7a_{7} 17+322ϵ\frac{\sqrt{17}+3}{2}-2\epsilon 17+322ϵ\frac{\sqrt{17}+3}{2}-2\epsilon 17+122ϵ\frac{\sqrt{17}+1}{2}-2\epsilon 22ϵ2-2\epsilon 1ϵ1-\epsilon 1ϵ1-\epsilon 1010

Since k=7k=7 and there are four isomorphic groups, there exists a group that has at most one representative in some realized panel. Note that the algorithm first opens the balls that are centered at a5a_{5} and have radius equal to 1ϵ1-\epsilon. Assume that when this ball was opened in the group that has one representative in the panel, the algorithm chooses a7a_{7} to be included in the panel. Then, in this group the individuals a1a_{1}, a2a_{2}, a3a_{3} and a4a_{4} are eligible to choose a2a_{2} and all of them reduce their distance by a multiplicative factor of at least (3+17)/2(3+\sqrt{17})/2 as ϵ\epsilon goes to zero. ∎

Appendix G Augmented-FairGreedyCapture with Known qq

Input: Individuals [n][n], metric dd, kk, qq
Output: Panel PP
R[n];δ0;PR\leftarrow[n];\delta\leftarrow 0;P\leftarrow\emptyset;
while |R|qn/k|R|\geq\left\lceil q\cdot n/k\right\rceil do
       Smoothly increase δ\delta;
       while jR\exists j\in R such that |B(j,δ)R|qn/k|B(j,\delta)\cap R|\geq\left\lceil q\cdot n/k\right\rceil do
             Sqn/kS\leftarrow\left\lceil q\cdot n/k\right\rceil individuals arbitrary chosen from B(j,δ)B(j,\delta);
             P^\hat{P}\leftarrow pick qq individuals from SS uniformly at random;
             PPP^P\leftarrow P\cup\hat{P};
             RRSR\leftarrow R\setminus S;
            
       end while
      
end while
if |P|<k|P|<k then
       P^\hat{P}\leftarrow k|P|k-|P| individuals from [n]P[n]\setminus P by picking iRi\in R with probability k/nk/n and i[n](PR)i\in[n]\setminus(P\cup R) with probability k|P||R|k/nn|P||R|\frac{k-|P|-|R|\cdot k/n}{n-|P|-|R|};
       PPP^P\leftarrow P\cup\hat{P};
end if
ALGORITHM 3 Augmented-FairGreedyCapture

Here, we show that there exists a version of FairGreedyCapture such that if qq is known, it provides an approximation of ((5+41)/2)((5+\sqrt{41})/2) to the ex post qq-core. As before, our algorithm leverages the basic idea of the Greedy Capture algorithm.

Augmented-FairGreedyCapturek,q\textsc{Augmented-FairGreedyCapture}_{k,q}, in Algorithm 3, starts with an empty panel PP and grows a ball around every individual in [n][n] at the same rate. When a ball captures qn/k\left\lceil q\cdot n/k\right\rceil individuals (if more than qn/k\left\lceil q\cdot n/k\right\rceil individuals have been captured, it chooses exactly qn/k\left\lceil q\cdot n/k\right\rceil by arbitrarily excluding some points on the boundary), the algorithm selects qq of them uniformly at random, includes them in the panel PP, and disregards all the qn/k\left\lceil q\cdot n/k\right\rceil individuals. When this happens, we say that the algorithm detects this ball. Unlike Greedy Capture, we continue growing balls only around the individuals that are not yet disregarded, i.e. detected balls are frozen. When fewer than qn/k\left\lceil q\cdot n/k\right\rceil individuals remain, the algorithm selects the remaining representatives from among the individuals who have not yet been included in the panel as follows: each individual who has not been disregarded is selected with probability k/nk/n, and the remaining probability mass is allocated uniformly among the individuals who have been disregarded but not selected. This can be achieved through systematic sampling Yates [1948].

Theorem 16.

For every qq, Augmented-FairGreedyCapturek,q\textsc{Augmented-FairGreedyCapture}_{k,q} is fair and in the ex post ((5+41)/2((5+\sqrt{41})/2-qq-core.

Proof.

We start by showing that the algorithm is fair.

Lemma 5.

Augmented-FairGreedyCapturek,q\textsc{Augmented-FairGreedyCapture}_{k,q} is fair.

Proof.

Suppose that qn/kq\cdot n/k is an integer. Then, each individual that is disergarded in the while loop of the algorithm is included in the panel with probability exactly k/nk/n. Now, suppose that after the algorithm has detected tt balls, less than qn/kq\cdot n/k individuals have not been disregarded. Then, when the algorithm exits the while loop we have that |R|=ntqn/k|R|=n-t\cdot q\cdot n/k and k|P|=ktqk-|P|=k-t\cdot q. But since,

|R|k/n=ktq,\displaystyle|R|\cdot k/n=k-t\cdot q,

we conclude that the remaining ktqk-t\cdot q representatives are chosen uniformly among the individuals in RR. Thus, the algorithm returns a panel of size kk and each i[n]i\in[n] is chosen with probability k/nk/n.

Now, we focus on the case where qn/kq\cdot n/k is not an integer. In this case, note that in the while loop of the algorithm, less than kk representatives are included in the panel, since qq representatives are included in it every time that qn/k\left\lceil q\cdot n/k\right\rceil non-disregarded individuals are captured from a ball. Moreover, each individual that is disregarded is chosen with probability strictly less than k/nk/n. Now suppose that after exiting the while loop, there are individuals that have not been disregarded, i.e. |R|>0|R|>0. First, we show that the algorithm correctly chooses another k|P|k-|P| representatives and outputs a panel of size kk. The algorithm would select each individual in RR with probability k/nk/n and allocates the remaining probability — which is equal to k|P||R|k/nk-|P|-|R|\cdot k/n —uniformly among the n|P||R|n-|P|-|R| individuals that have been disregarded but not selected in PP. To satisfy fairness for people in RR, it suffices to show that |R|k/n<k|P||R|\cdot k/n<k-|P|. Since for each individual i[n]Ri\in[n]\setminus R we have Pr[iP]=q/qn/k<k/n\Pr[i\in P]=q/\left\lceil q\cdot n/k\right\rceil<k/n, then we have |P|=𝔼[|P|]=i[n]RPr[iP]<(n|R|)k/n|P|=\mathbb{E}[|P|]=\sum_{i\in[n]\setminus R}\Pr[i\in P]<(n-|R|)\cdot k/n. Thus, k|P|>k(n|R|)k/n=|R|k/n.k-|P|>k-(n-|R|)\cdot k/n=|R|\cdot k/n. Hence, the algorithm outputs panels of size kk.

It remains to show that each individual in [n]R[n]\setminus R, which is disregarded in the while loop, is included in the panel with probability k/nk/n. First, note that all of them are included in the panel with the same probability. This holds, since each is selected with probability q/qn/kq/\left\lceil q\cdot n/k\right\rceil from the ball that captured them in the while loop, and, when not selected in the while loop, they get an equal chance of selection of k|P||R|k/nn|R||P|\frac{k-|P|-|R|\cdot k/n}{n-|R|-|P|}. Since the size of the final panel returned by the algorithm is always kk, and by linearity of expectation, we have k=|R|k/n+i[n]RPr[i]k=|R|\cdot k/n+\sum_{i\in[n]\setminus R}\Pr[i]. By equality of Pr[i]\Pr[i]’s, we conclude that all must be equal to k/nk/n and each individual in [n][n] is included in the panel with probability k/nk/n. ∎

We proceed by showing that Augmented-FairGreedyCapturek,q\textsc{Augmented-FairGreedyCapture}_{k,q} is in the ex post ((5+41)/2)((5+\sqrt{41})/2)-qq-core. Let PP be any panel that the algorithm may return. Suppose for contradiction that there exists a panel PP^{\prime} such that Vq(P,P,(5+41)/2)|P|n/kV_{q}(P,P^{\prime},(5+\sqrt{41})/2)\geq|P^{\prime}|\cdot n/k. This means that there exists S[n]S\subseteq[n], with |S||P|n/k|S|\geq|P^{\prime}|\cdot n/k, such that:

iS,cq(i,P)>(5+41)/2cq(i,P).\displaystyle\forall i\in S,\quad\quad c_{q}(i,P)>(5+\sqrt{41})/2\cdot c_{q}(i,P^{\prime}). (6)

Let T1,,TmT_{1},\ldots,T_{m} be a partition of SS with respect to PP^{\prime}, as given in the first part of Lemma 1. In a similar way as in the proof of Theorem 3, we can conclude that there exists a ball centered at ii^{*}_{\ell} that has radius 2cq(i,P)2\cdot c_{q}(i^{*}_{\ell},P^{\prime}) and captures all the individuals in TT_{\ell}. Since there are sufficiently many individuals in this ball, it is possible that the algorithm detected this ball (or a nested one) during its execution. If this happened, this means that there are qq representatives in PP that are located within the ball B(i,2cq(i,P))B(i^{*}_{\ell},2\cdot c_{q}(i^{*}_{\ell},P^{\prime})). Then, we get that cq(i,P)2cq(i,P)c_{q}(i^{*}_{\ell},P)\leq 2\cdot c_{q}(i^{*}_{\ell},P^{\prime}) which contradicts Equation 6.

On the other hand, if the algorithm did not detect this ball (or a nested one) during its execution, this means that some of the individuals in TT_{\ell} have been disregarded before the ball centered at ii^{*}_{\ell} captures sufficiently many of them. Hence, some individuals in TT_{\ell} have been captured from a different ball with radius at most 2cq(i,P)2\cdot c_{q}(i^{*}_{\ell},P^{\prime}). Suppose that ii^{\prime} is the first individual in TT_{\ell} that was captured from such a ball. Then, we get that qq representatives in PP are within this ball, which means that cq(i,P)4cq(i,P)c_{q}(i^{\prime},P)\leq 4\cdot c_{q}(i^{*}_{\ell},P^{\prime}), since the distance of ii^{\prime} form any individual in this ball is at most equal to the diameter of the ball. We consider the minimum multiplicative improvement of both ii^{*}_{\ell} and ii^{\prime}:

min(cq(i,P)cq(i,P),cq(i,P)cq(i,P))\displaystyle\min\left(\frac{c_{q}(i^{\prime},P)}{c_{q}(i^{\prime},P^{\prime})},\frac{c_{q}(i^{*}_{\ell},P)}{c_{q}(i^{*}_{\ell},P^{\prime})}\right)
\displaystyle\leq min(cq(i,P)cq(i,P),d(i,i)+cq(i,P)cq(i,P))\displaystyle\min\left(\frac{c_{q}(i^{\prime},P)}{c_{q}(i^{\prime},P^{\prime})},\frac{d(i^{*}_{\ell},i^{\prime})+c_{q}(i^{\prime},P)}{c_{q}(i^{*}_{\ell},P^{\prime})}\right) (by Lemma 2)
\displaystyle\leq min(cq(i,P)cq(i,P),d(i,ri)+d(i,ri)+cq(i,P)cq(i,P))\displaystyle\min\left(\frac{c_{q}(i^{\prime},P)}{c_{q}(i^{\prime},P^{\prime})},\frac{d(i^{*}_{\ell},r_{i^{\prime}})+d(i^{\prime},r_{i^{\prime}})+c_{q}(i^{\prime},P)}{c_{q}(i^{*}_{\ell},P^{\prime})}\right) (by Triangle Inequality)
\displaystyle\leq min(cq(i,P)cq(i,P),cq(i,P)+cq(i,P)+cq(i,P)cq(i,P))\displaystyle\min\left(\frac{c_{q}(i^{\prime},P)}{c_{q}(i^{\prime},P^{\prime})},\frac{c_{q}(i^{*}_{\ell},P^{\prime})+c_{q}(i^{\prime},P^{\prime})+c_{q}(i^{\prime},P)}{c_{q}(i^{*}_{\ell},P^{\prime})}\right) (as ri𝗍𝗈𝗉q(i,P)𝗍𝗈𝗉q(i,P))r_{i^{\prime}}\in\operatorname{\mathsf{top}}_{q}(i^{\prime},P^{\prime})\cap\operatorname{\mathsf{top}}_{q}(i^{*}_{\ell},P^{\prime}))
\displaystyle\leq min(4cq(i,P)cq(i,P),5+cq(i,P)cq(i,P))\displaystyle\min\left(\frac{4\cdot c_{q}(i^{*}_{\ell},P^{\prime})}{c_{q}(i^{\prime},P^{\prime})},5+\frac{c_{q}(i^{\prime},P^{\prime})}{c_{q}(i^{*}_{\ell},P^{\prime})}\right) (as cq(i,P)4cq(i,P))c_{q}(i^{\prime},P)\leq 4\cdot c_{q}(i^{*}_{\ell},P^{\prime}))
\displaystyle\leq maxz0min(4z,5+1/z)=(5+41)/2\displaystyle\max_{z\geq 0}\min(4\cdot z,5+1/z)=(5+\sqrt{41})/2

which violates Equation 6 and the theorem follows. ∎