This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Network Flow Methods for the Minimum Covariates Imbalance Problem

Dorit S. Hochbaum Research supported in part by NSF award No. CMMI-1760102.hochbaum@ieor.berkeley.edu Department of Industrial Engineering and Operations Research,  University of California, Berkeley Xu Rao xrao@berkeley.edu Department of Industrial Engineering and Operations Research,  University of California, Berkeley
Abstract

The problem of balancing covariates arises in observational studies where one is given a group of control samples and another group, disjoint from the control group, of treatment samples. Each sample, in either group, has several observed nominal covariates. The values, or categories, of each covariate partition the treatment and control samples to a number of subsets referred to as levels where the samples at every level share the same covariate value. We address here a problem of selecting a subset of the control group so as to balance, to the best extent possible, the sizes of the levels between the treatment group and the selected subset of control group, the min-imbalance problem. It is proved here that the min-imbalance problem, on two covariates, is solved efficiently with network flow techniques. We present an integer programming formulation of the problem where the constraint matrix is totally unimodular, implying that the linear programming relaxation to the problem has all basic solutions, and in particular the optimal solution, integral. This integer programming formulation is linked to a minimum cost network flow problem which is solvable in O(n(n+nlogn))O(n\cdot(n^{\prime}+n\log n)) steps, for nn the size of the treatment group and nn^{\prime} the size of the control group. A more efficient algorithm is further devised based on an alternative, maximum flow, formulation of the two-covariate min-imbalance problem, that runs in O(n3/2log2n)O(n^{\prime 3/2}\log^{2}n) steps.

1 Introduction

The problem of balancing covariates arises in observational studies where one is given a group of control samples and another group, disjoint from the control group, of treatment samples. Each sample, in either group, has several observed nominal covariates. The values, or categories, of each covariate partition the treatment and control samples to a number of subsets referred to as levels where the samples at every level share the same covariate value.

When estimating causal effects using observational data, it is desirable to replicate a randomized experiment as closely as possible by obtaining treatment and control groups with similar covariate distributions. This goal can often be achieved by choosing well-matched samples of the original treatment and control groups, thereby reducing bias due to the covariates. The matching is typically one-to-one.

The bias due to the covariates can be completely removed if for each matched pair, the two samples belong to the same levels over all covariates. Satisfying this requirement, however, typically results in a very small selection from the treatment and control group, which is not desirable. To address this Rosenbaum et al. [1] introduced a relaxation requiring instead to match all treatment samples to a subset of the control samples so that the number of control and treatment samples in each level of each covariate are the same. This requirement is known in the literature as fine balance.

It is not always feasible to find a selection of the control samples that satisfies the fine balance requirement. To that end several papers considered the goal of minimizing the violation of this requirement, which we refer to as imbalance. Yang et al. [2] considered finding an optimal matching between the treatment samples and a selection of the control samples where that selection minimizes an imbalance objective, to be explained below. Bennett et al. [3] considered directly the minimum imbalance problem that finds a selection which minimizes the violation of the fine balance requirement.

Our focus here is the minimum imbalance problem, also referred to as the min-imbalance problem. The min-imbalance problem is trivial for a single covariate and NP-hard for three or more covariates, as discussed next. For the case of two covariates, we introduce here efficient network flow algorithms that solve the problem in polynomial time.

Let PP be the number of covariates to be balanced. To introduce the min-imbalance problem, consider first the case of a single covariate, P=1P=1, that partitions the control and treatment groups into, say, kk levels each. Let the sizes of the levels in the treatment group be 1,,k\ell_{1},...,\ell_{k} and the sizes of the levels in the control group be 1,,k\ell^{\prime}_{1},...,\ell^{\prime}_{k}. The min-imbalance problem is to select a subset of the control group, called selection, such that the sizes of the levels in the selection, approximate as closely as possible the sizes 1,,k\ell_{1},...,\ell_{k}. The solution to the single covariate min-imbalance problem is trivial: the optimal selection is any i\ell_{i} samples out of the i\ell^{\prime}_{i} level ii control samples if ii\ell^{\prime}_{i}\geq\ell_{i}; otherwise, select all i\ell^{\prime}_{i} level ii control samples. The value of the objective function corresponding to this selection is then i=1kmax{0,ii}\sum_{i=1}^{k}\max\{0,\ell_{i}-\ell^{\prime}_{i}\}, which is the optimal value for the single covariate min-imbalance problem.

For the case of multiple covariates, covariate pp partitions both treatment and control groups into kpk_{p} levels each. Denote the sizes of the levels in the treatment group under covariate pp by p,1,p,2,,p,kp\ell_{p,1},\ell_{p,2},...,\ell_{p,k_{p}}, and let the partition of the control group under covariate pp be Lp,1,Lp,2,,Lp,kpL^{\prime}_{p,1},L^{\prime}_{p,2},...,L^{\prime}_{p,k_{p}} of sizes p,1,p,2,,p,kp\ell^{\prime}_{p,1},\ell^{\prime}_{p,2},...,\ell^{\prime}_{p,k_{p}}. The min-imbalance problem is to find a selection SS, subset of the control group, that minimizes the objective function: p=1Pi=1kp||SLp,i|p,i|\sum_{p=1}^{P}\sum_{i=1}^{k_{p}}||S\cap L^{\prime}_{p,i}|-\ell_{p,i}|.

Let the number of treatment samples be nn and the number of control samples be nn^{\prime}. In applications of the min-imbalance problem it is often the case that the size of selection SS is required to be nn, the size of the treatment group. This is assuming that nnn\leq n^{\prime}. Here we focus the presentation on the case where the size of SS is required to be nn. Yet, all our methods generalize to the min-imbalance problem for any specified size qq, qnq\leq n^{\prime}, of the selection, as shown in Appendix B.

The min-imbalance problem was recently studied by Bennett et al. [3] where a mixed integer programming (MIP) formulation of the problem was presented, with the size of the selection equal to nn, the size of the treatment group. They showed that the corresponding linear programming (LP) relaxation yields an integer solution when the number of covariates P2P\leq 2. Consequently, Bennett et al. [3] established that the min-imbalance problem on two covariates is solved in polynomial time via linear programming. For P=3P=3 they presented an example where the LP relaxation’s solution is not integral. It is further noted in [3], that the min-imbalance problem for P3P\geq 3 is NP-hard. However, no proof or reference is provided. Among other contributions, we fill this gap with a formal proof that the min-imbalance problem for three or more covariates is NP-hard.

A problem related to the min-imbalance that selects a subset of the control samples so as to achieve a certain type of balancing of the levels is called Optimal Matching with Minimal Deviation from Fine Balance, see Yang et al. [2]. We refer to this problem here for short as κ\kappa-Matching-Balance (MB). The MB problem is to optimize the assignment of each treatment sample to a number, κ\kappa, of control samples, based on a distance measure between each treatment and each control sample. This is subject to a constraint that the selection of the κn\kappa n control samples is an optimal solution to a problem related to min-imbalance. In other words, MB is defined in two stages: In the first stage the goal is to find a selection SS of control samples of size κn\kappa n, where κ\kappa is an integer between 11 and nn\frac{n^{\prime}}{n}, and nn is the size of the treatment group, so as to minimize the κ\kappa-imbalance, defined as p=1Pi=1kp||SLp,i|κp,i|\sum_{p=1}^{P}\sum_{i=1}^{k_{p}}|{|S\cap L^{\prime}_{p,i}|}-\kappa\ell_{p,i}|. In the second stage, among all selections of control samples that attain the minimum κ\kappa-imbalance, one chooses the selection that minimizes the distance matching of each treatment sample to exactly κ\kappa selected control samples. Yang et al. [2] studied this MB problem for a single covariate and proposed two network flow algorithms. We observe here that, for any number of covariates, if the first stage problem of minimizing the κ\kappa-imbalance has a unique solution, in terms of the number of selected samples from each level-intersection (to be discussed in the next section, Section 2) of the control group, then an optimal solution to the second stage, and therefore to the MB problem, is easily attained by solving a network flow problem. As shown in Appendix C the minimum κ\kappa-imbalance is equivalent to the min-imbalance problem, and therefore an optimal solution to a corresponding min-imbalance problem provides an optimal solution to the minimum κ\kappa-imbalance problem. It follows in particular that for a single covariate, the first stage problem of minimizing the κ\kappa-imbalance is trivial and has a unique solution in terms of the number of selected samples from each level of the control group. For two or more covariates, no polynomial running time algorithm is known for this problem. However, for two covariates, any of the algorithms shown here, also solve optimally the minimum κ\kappa-imbalance problem. For any number of covariates, when the minimum κ\kappa-imbalance solution is unique (in terms of level-intersection sizes) and known, MB problem is solved in polynomial time in the second stage (for details see Appendix C). In particular for two covariates, if the solution to the minimum κ\kappa-imbalance problem is unique, then the problem is solved efficiently with network flow techniques.

Our main results here are efficient algorithms for the min-imbalance problem with two covariates. We present an integer programming formulation of the problem, related to that of Bennett et al [3], and show that the constraint matrix is totally unimodular. That implies the linear programming relaxation to the problem has all basic solutions, and in particular the optimal solution, integral. We then show that the two-covariate min-imbalance problem can be solved much more efficiently than as a linear program with network flow techniques. We show how to solve the problem as a minimum cost network flow problem with complexity of O(n(n+nlogn))O(n\cdot(n^{\prime}+n\log n)). We then devise a more efficient algorithm based on a maximum flow formulation of the two-covariate min-imbalance problem that runs in O(n3/2log2n)O(n^{\prime 3/2}\log^{2}n) steps. Another contribution here is a proof that for three or more covariates, the min-imbalance problem is NP-hard.

In summary our contributions here are:

  • An integer programming formulation of the min-imbalance problem with two covariates that has a totally unimodular constraint matrix.

  • A formulation of the two-covariate min-imbalance problem as a minimum cost network flow problem, solved in O(n(n+nlogn))O(n\cdot(n^{\prime}+n\log n)) steps.

  • A maximum flow formulation of the two-covariate min-imbalance problem that is solved in O(n3/2log2n)O(n^{\prime 3/2}\log^{2}n) steps.

  • An NP-hardness proof of the min-imbalance problem with three or more covariates.

Paper overview: Section 2 provides the notation used here. Section 3 describes the integer programming formulation with a totally unimodular constraint matrix for the two-covariate min-imbalance problem. We describe the minimum cost network flow formulation of the two-covariate min-imbalance problem, and the algorithm used to solved it in Section 4. In Section 5 we provide a maximum flow problem, the output of which is converted, in linear time, to an optimal solution to the two-covariate min-imbalance problem. We provide several concluding remarks in Section 6. The NP-hardness proof of the min-imbalance problem with three of more covariates is provided in Appendix A. Appendix B explains our method also applies to the min-imbalance problem for other specified size of the selection.

2 Preliminaries and notations

The goal of the min-imbalance problem is to identify a selection of control samples that will match, at each level of each covariate partition, as closely as possible, the number of treatment samples in that level and the number of selection samples in the same level. We denote by PP the number of covariates and the number of levels in the partition induced by covariate pp by kpk_{p}, for p=1,,Pp=1,...,P. Let the levels of the treatment group under covariate p=1,,Pp=1,\ldots,P be Lp,1,Lp,2,,Lp,kpL_{p,1},L_{p,2},...,L_{p,k_{p}} of sizes p,1,p,2,,p,kp\ell_{p,1},\ell_{p,2},...,\ell_{p,k_{p}}, and let the levels of the control group under covariate pp be Lp,1,Lp,2,,Lp,kpL^{\prime}_{p,1},L^{\prime}_{p,2},...,L^{\prime}_{p,k_{p}} of sizes p,1,p,2,,p,kp\ell^{\prime}_{p,1},\ell^{\prime}_{p,2},...,\ell^{\prime}_{p,k_{p}}. For a selection SS of control samples we define the discrepancy at level ii under covariate pp as,

dis(S,p,i)=|SLp,i|p,i.dis(S,p,i)=|S\cap L^{\prime}_{p,i}|-\ell_{p,i}.

The discrepancy of a level can be positive or negative. If the discrepancy is positive we refer to it as excess which is defined as ep,i(S)=max{0,dis(S,p,i)}e_{p,i}(S)=\max\{0,dis(S,p,i)\}, and if negative, we refer to it as deficit dp,i(S)=max{0,dis(S,p,i)}d_{p,i}(S)=\max\{0,-dis(S,p,i)\}. With this notation the imbalance of a selection SS, IM(S)IM(S) is,

IM(S)=p=1Pi=1kp(ep,i(S)+dp,i(S)).IM(S)=\sum_{p=1}^{P}\sum_{i=1}^{k_{p}}\left(e_{p,i}(S)+d_{p,i}(S)\right).

Notice that this quantity is identical to the imbalance form presented in the introduction: IM(S)=p=1Pi=1kp||SLp,i|p,i|IM(S)=\sum_{p=1}^{P}\sum_{i=1}^{k_{p}}||S\cap L^{\prime}_{p,i}|-\ell_{p,i}|.

We now present the integer programming formulation that was given by Bennett et al. in [3] for the min-imbalance problem. That integer program involves two sets of decision variables:
zjz_{j}: a binary variable equal to 11 if control sample jj is in the selection SS, and 0 otherwise, for j=1,,nj=1,\ldots,n^{\prime};
yp,i=|dis(S,p,i)|=||SLp,i|p,i|y_{p,i}=|dis(S,p,i)|=||S\cap L^{\prime}_{p,i}|-\ell_{p,i}|: the absolute value of discrepancy at level ii under covariate pp, for p=1,,Pp=1,\ldots,P, and i=1,,kpi=1,\ldots,k_{p}. We note that this variable can only assume integer values, even though Bennett et al. refer to this as a mixed integer programming formulation.
With these variables the formulation is:

min\displaystyle\min p=1Pi=1kpyp,i\displaystyle\sum_{p=1}^{P}\sum_{i=1}^{k_{p}}y_{p,i} (1a)
s.t. jLp,izjp,iyp,i\displaystyle\sum_{j\in L^{\prime}_{p,i}}z_{j}-\ell_{p,i}\leq y_{p,i} p=1,,P,i=1,,kp\displaystyle p=1,\ldots,P,\quad i=1,\ldots,k_{p} (1b)
p,ijLp,izjyp,i\displaystyle\ell_{p,i}-\sum_{j\in L^{\prime}_{p,i}}z_{j}\leq y_{p,i} p=1,,P,i=1,,kp\displaystyle p=1,\ldots,P,\quad i=1,\ldots,k_{p} (1c)
j=1nzj=n\displaystyle\sum_{j=1}^{n^{\prime}}z_{j}=n (1d)
zj{0,1}\displaystyle z_{j}\in\{0,1\} j=1,,n.\displaystyle j=1,\ldots,n^{\prime}. (1e)

For each pair p,ip,i, p=1,,Pp=1,\ldots,P and i=1,,kpi=1,\ldots,k_{p}, constraint (1b) and (1c) ensure that yp,iy_{p,i} assumes the absolute value of the difference between the number of selected level ii control samples and p,i\ell_{p,i} at an optimal solution. These constraints also ensure that any feasible yp,iy_{p,i} is non-negative and therefore a non-negativity constraint is not required for variable yp,iy_{p,i}. The constraint (1d) specifies that the size of selected subset equals the size of the treatment group.

Bennett et al. [3] proved that any basic solution of the linear programming relaxation of (1) is integral for P=2P=2. We provide in Section 3 a stronger result showing that a slightly modified form of this integer programming formulation has a constraint matrix which is totally unimodular for P=2P=2. That implies that result that every basic solution is integer, and furthermore the constraint matrix is that of a minimum cost network flow problem. This implies that the problem can be solved significantly more efficiently than as linear programming problem.

An optimal solution to formulation (1) specifies for each control sample whether or not it is in the selection. We observe however that the output to the min-imbalance problem, for any number of covariates, can be presented more compactly in terms of level-intersections. For PP covariates the intersection of the level sets L1,i1L2,i2LP,iPL^{\prime}_{1,i_{1}}\cap L^{\prime}_{2,i_{2}}\cap\ldots\cap L^{\prime}_{P,i_{P}}, ip=1,,kpi_{p}=1,\ldots,k_{p}, p=1,,Pp=1,\ldots,P, form a partition of the control group. The number of non-empty sets in this partition is at most min{n,Πp=1Pkp}\min\{n^{\prime},\Pi_{p=1}^{P}k_{p}\}. Instead of specifying which sample belongs to the selection, it is sufficient to determine the number of selected control samples in each level intersection, since the identity of the specific selected samples has no effect on the imbalance. With this discussion, we have a theorem on the presentation of the solution to the min-imbalance problem in terms of the level-intersection sizes.

Theorem 1.

The level-intersection sizes si1,i2,,iPs_{i_{1},i_{2},\ldots,i_{P}} are an optimal solution to the min-imbalance problem if there exists an optimal selection SS such that si1,i2,,iP=|SL1,i1L2,i2LP,iP|s_{i_{1},i_{2},\ldots,i_{P}}=|S\cap L^{\prime}_{1,i_{1}}\cap L^{\prime}_{2,i_{2}}\cap\ldots\cap L^{\prime}_{P,i_{P}}|.

We will say that the solution to the problem is unique if for any optimal selection SS, the numbers si1,i2,,iP=|SL1,i1L2,i2LP,iP|s_{i_{1},i_{2},\ldots,i_{P}}=|S\cap L^{\prime}_{1,i_{1}}\cap L^{\prime}_{2,i_{2}}\cap\ldots\cap L^{\prime}_{P,i_{P}}| are unique. In order to derive an optimal selection given the optimal level-intersection sizes, one selects any si1,i2,,iPs_{i_{1},i_{2},\ldots,i_{P}} control samples from the intersection L1,i1L2,i2LP,iPL^{\prime}_{1,i_{1}}\cap L^{\prime}_{2,i_{2}}\cap\ldots\cap L^{\prime}_{P,i_{P}} for ip=1,,kpi_{p}=1,\ldots,k_{p}, p=1,,Pp=1,\ldots,P. Obviously, when si1,i2,,iP<|L1,i1L2,i2LP,iP|s_{i_{1},i_{2},\ldots,i_{P}}<|L^{\prime}_{1,i_{1}}\cap L^{\prime}_{2,i_{2}}\cap\ldots\cap L^{\prime}_{P,i_{P}}| for at least one combination of i1,i2,,iPi_{1},i_{2},\ldots,i_{P}, the optimal selection is never unique.

Standard formulations of the minimum cost network flow (MCNF) and the maximum flow problems as well as a discussion of why MCNF constraint matrix is totally unimodular, are given in Appendix D.

3 A modified formulation with a totally unimodular constraint matrix for P=2P=2

In our formulation, instead of using variables yp,iy_{p,i}, we use variables for excess and deficit. As discussed in Section 2, ||SLp,i|p,i|=ep,i(S)+dp,i(S)||S\cap L^{\prime}_{p,i}|-\ell_{p,i}|=e_{p,i}(S)+d_{p,i}(S) for each pp and ii. We let the variable for excess for pp and ii be ep,ie_{p,i} and the variable for deficit be dp,id_{p,i}. Note that yp,i=ep,i+dp,iy_{p,i}=e_{p,i}+d_{p,i} where both ep,ie_{p,i} and dp,id_{p,i} are non-negative variables.

For each pp and ii, |SLp,i|p,i=ep,idp,i|SLp,i|+dp,iep,i=p,i|S\cap L^{\prime}_{p,i}|-\ell_{p,i}=e_{p,i}-d_{p,i}\Leftrightarrow|S\cap L^{\prime}_{p,i}|+d_{p,i}-e_{p,i}=\ell_{p,i}.

In the modified formulation shown below, the constraints (2b) and (2c) for the two covariates are separated to facilitate the identification of the total unimodularity property. Since L1,1,,L1,k1L^{\prime}_{1,1},...,L^{\prime}_{1,k_{1}} is a partition of the control group, i=1k1|SL1,i|=|S|\sum_{i=1}^{k_{1}}|S\cap L^{\prime}_{1,i}|=|S|. Also, because 1,1,,1,k1\ell_{1,1},...,\ell_{1,k_{1}} are the sizes of the levels partition of the treatment group for the first covariate, it follows that i=1k11,i=n\sum_{i=1}^{k_{1}}\ell_{1,i}=n. Therefore, i=1k1(e1,id1,i)=i=1k1(|SL1,i|1,i)=|S|n\sum_{i=1}^{k_{1}}\left(e_{1,i}-d_{1,i}\right)=\sum_{i=1}^{k_{1}}\left(|S\cap L^{\prime}_{1,i}|-\ell_{1,i}\right)=|S|-n . So specifying |S|=n|S|=n is equivalent to constraint (2d) in formulation (2) given below:

min\displaystyle\min p=12i=1kp(ep,i+dp,i)\displaystyle\sum_{p=1}^{2}\sum_{i=1}^{k_{p}}\left(e_{p,i}+d_{p,i}\right) (2a)
s.t. jL1,izj+d1,ie1,i=1,i\displaystyle\sum_{j\in L^{\prime}_{1,i}}z_{j}+d_{1,i}-e_{1,i}=\ell_{1,i} i=1,,k1\displaystyle i=1,...,k_{1} (2b)
jL2,izj+d2,ie2,i=2,i\displaystyle\sum_{j\in L^{\prime}_{2,i}}z_{j}+d_{2,i}-e_{2,i}=\ell_{2,i} i=1,,k2\displaystyle i=1,...,k_{2} (2c)
i=1k1d1,i+i=1k1e1,i=0\displaystyle-\sum_{i=1}^{k_{1}}d_{1,i}+\sum_{i=1}^{k_{1}}e_{1,i}=0 (2d)
ep,i,dp,i0\displaystyle e_{p,i},d_{p,i}\geq 0 p=1,2,i=1,,kp\displaystyle p=1,2,\quad i=1,...,k_{p} (2e)
zj{0,1}\displaystyle z_{j}\in\{0,1\} j=1,,n.\displaystyle j=1,...,n^{\prime}. (2f)
Lemma 1.

The constraint matrix of LP relaxation of formulation (2) is totally unimodular.

Proof.

First, it is noted that in the constraint matrix of (2) each entry is 0, 11 or 1-1. Multiplying some rows of a totally unimodular matrix by 1-1 preserves total unimodularity. Consider the matrix resulting by multiplying the rows of constraint (2c) by 1-1. As shown next, each column in this new matrix has at most one 11 and at most one 1-1. Hence, by a well-known theorem (see Appendix D Theorem 6) the matrix is totally unimodular.

  • (1)

    Both {L1,1,,L1,k1}\{L^{\prime}_{1,1},...,L^{\prime}_{1,k_{1}}\} and {L2,1,,L2,k2}\{L^{\prime}_{2,1},...,L^{\prime}_{2,k_{2}}\} are partitions of the control group, so L1,1,,L1,k1L^{\prime}_{1,1},...,L^{\prime}_{1,k_{1}} are mutually disjoint, and L2,1,,L2,k2L^{\prime}_{2,1},...,L^{\prime}_{2,k_{2}} are mutually disjoint. The column of each zjz_{j} has exactly one 11 in rows corresponding to (2b), and one 1-1 (after multiplication) in rows corresponding to (2c).

  • (2)

    For each ii, the column of d1,id_{1,i} has exactly one 11 in rows corresponding to (2b) and exactly one 1-1 in rows corresponding to (2d); the column of e1,ie_{1,i} has exactly one 1-1 in rows corresponding to (2b)) and exactly one 11 in rows corresponding to (2d).

  • (3)

    For each ii, the column of d2,id_{2,i} has exactly one non-zero, 11 or 1-1, entry in rows corresponding to (2c); the column of e2,ie_{2,i} has exactly one non-zero, 11 or 1-1, entry in rows corresponding to (2c).

Since the new matrix has at most one 11 and at most one 1-1 in each column, it is totally unimodular. ∎

Formulation (2) is in fact also a minimum cost network flow (MCNF) formulation. See Appendix D for a generic formulation of MCNF. A generic MCNF formulation has exactly one 11 and one 1-1 in each column of the constraint matrix. To make formulation (2) have this structure, we multiple all coefficients in constraint (2c) by 1-1, and add a redundant constraint i=1k2d2,ii=1k2e2,i=0\sum_{i=1}^{k_{2}}d_{2,i}-\sum_{i=1}^{k_{2}}e_{2,i}=0, which, like constraint (2d), is also equivalent to |S|=n|S|=n. In the next section, we streamline this formulation to derive a more efficient network flow formulation for this problem.

4 Network flow formulation for P=2P=2

Here we use the level-intersection sizes as variables, xi1,i2x_{i_{1},i_{2}} for i1=1,,k1,i2=1,,k2i_{1}=1,...,k_{1},i_{2}=1,...,k_{2}. These variables can also be written as xi1,i2=jL1,i1L2,i2zjx_{i_{1},i_{2}}=\sum_{j\in L^{\prime}_{1,i_{1}}\cap L^{\prime}_{2,i_{2}}}z_{j}. With these decision variables we get the following network flow formulation:

min\displaystyle\min p=12i=1kp(ep,i+dp,i)\displaystyle\sum_{p=1}^{2}\sum_{i=1}^{k_{p}}\left(e_{p,i}+d_{p,i}\right) (3a)
s.t. i2=1k2xi1,i2+d1,i1e1,i1=1,i1\displaystyle\sum_{i_{2}=1}^{k_{2}}x_{i_{1},i_{2}}+d_{1,i_{1}}-e_{1,i_{1}}=\ell_{1,i_{1}} i1=1,,k1\displaystyle i_{1}=1,...,k_{1} (3b)
i1=1k1xi1,i2d2,i2+e2,i2=2,i2\displaystyle-\sum_{i_{1}=1}^{k_{1}}x_{i_{1},i_{2}}-d_{2,i_{2}}+e_{2,i_{2}}=-\ell_{2,i_{2}} i2=1,,k2\displaystyle i_{2}=1,...,k_{2} (3c)
i1=1k1d1,i1+i1=1k1e1,i1=0\displaystyle-\sum_{i_{1}=1}^{k_{1}}d_{1,i_{1}}+\sum_{i_{1}=1}^{k_{1}}e_{1,i_{1}}=0 (3d)
i2=1k2d2,i2i2=1k2e2,i2=0\displaystyle\sum_{i_{2}=1}^{k_{2}}d_{2,i_{2}}-\sum_{i_{2}=1}^{k_{2}}e_{2,i_{2}}=0 (3e)
ep,i,dp,i0\displaystyle e_{p,i},d_{p,i}\geq 0 p=1,2,k=1,,kp\displaystyle p=1,2,\quad k=1,...,k_{p} (3f)
0xi1,i2ui1,i2\displaystyle 0\leq x_{i_{1},i_{2}}\leq u_{i_{1},i_{2}} i1=1,,k1,i2=1,,k2\displaystyle i_{1}=1,...,k_{1},\quad i_{2}=1,...,k_{2} (3g)
.

The formulation (3) is a minimum cost network flow problem. The corresponding network is shown in Figure 1, where all capacity lower bounds are 0, and each arc has a cost per unit flow and upper bound associated with it. Nodes of type (1,i1)(1,i_{1}) each has supply of 1,i1\ell_{1,i_{1}} and nodes of type (2,i2)(2,i_{2}) each has demand of 2,i2\ell_{2,i_{2}} In Figure 1, for each i1i_{1} and i2i_{2}, the flow on the arc between node (1,i1)(1,i_{1}) and node (2,i2)(2,i_{2}) represents variable xi1,i2x_{i_{1},i_{2}}; arc from node 11 to node (1,i1)(1,i_{1}) represents the excess e1,i1e_{1,i_{1}}; arc to node 11 from any node (1,i1)(1,i_{1}) represents the deficit d1,i1d_{1,i_{1}}; arc from node 22 to any node (2,i2)(2,i_{2}) represents the deficit d2,i2d_{2,i_{2}}; arc to node 22 from any node (2,i2)(2,i_{2}) represents the excess e2,i2e_{2,i_{2}}. So the per unit arc cost should be 1 for arcs between node 1 or 2 and any node in {(1,1),(1,2),,(1,k1)}{(2,1),(2,2),,(2,k2)}\{(1,1),(1,2),...,(1,k_{1})\}\cup\{(2,1),(2,2),...,(2,k_{2})\}. All other arcs have cost 0. It is easy to verify that constraints (3b) are corresponding to the flow balance at nodes (1,i1)(1,i_{1}) for all i1i_{1}, constraints (3c) are corresponding to the flow balance at nodes (2,i2)(2,i_{2}) for all i2i_{2}. Constraint (3d) is corresponding to the flow balance at node 1, and constraint (3e) is corresponding to the flow balance at node 2.

1,11,21,k1k_{1}2,12,22,k2k_{2}1122\vdots\vdotsarc legend: (cost, upperbound)(1,1)(\ell_{1,1})(1,2)(\ell_{1,2})(1,k1)(\ell_{1,k_{1}})(2,1)(-\ell_{2,1})(2,2)(-\ell_{2,2})(2,k2)(-\ell_{2,k_{2}})\pgfmathresultpt(0,u(0,u1,1))\pgfmathresultpt(0,u(0,u1,2))\pgfmathresultpt(0,u(0,u1,k2{}_{1,k_{2}}))\pgfmathresultpt(0,u(0,u2,2))\pgfmathresultpt(0,u(0,uk1,k2{}_{k_{1},k_{2}}))\pgfmathresultpt(1,)(1,\infty)\pgfmathresultpt(1,)(1,\infty)\pgfmathresultpt(1,)(1,\infty)\pgfmathresultpt(1,)(1,\infty)\pgfmathresultpt(1,)(1,\infty)\pgfmathresultpt(1,)(1,\infty)\pgfmathresultpt(1,)(1,\infty)\pgfmathresultpt(1,)(1,\infty)\pgfmathresultpt(1,)(1,\infty)\pgfmathresultpt(1,)(1,\infty)\pgfmathresultpt(1,)(1,\infty)\pgfmathresultpt(1,)(1,\infty)
Figure 1: MCNF graph corresponding to formulation (3).
Theorem 2.

The 2-covariate min-imbalance problem is solved as a minimum cost network flow problem in O(n(n+nlogn))O(n\cdot(n^{\prime}+n\log n)) time.

Proof.

We choose the algorithm of successive shortest paths that is particularly efficient for a MCNF with “small” total supply to solve the network flow problem of the 2-covariate min-imbalance problem.

The successive shortest path algorithm iteratively selects a node ss with excess supply (supply not yet sent to some demand node) and a node tt with unfulfilled demand and sends flow from ss to tt along a shortest path in the residual network [4], [5], [6]. The algorithm terminates when the flow satisfies all the flow balance constraints. Since at each iteration, the number of remaining units of supply to be sent is reduced by at least one unit, the number of iterations is bounded by the total amount of supply. For our problem the total supply is O(n)O(n).

At each iteration, the shortest path can be solved with Dijkstra’s algorithm of complexity O(|A|+|V|log|V|)O(|A|+|V|\log|V|), where |V||V| is number nodes and |A||A| is number of arcs [7], [8]. In our formulation, |V||V| is O(k1+k2)O\left(k_{1}+k_{2}\right), which is at most O(n)O(n). Since the number of nonempty sets L1,i1L2,i2L^{\prime}_{1,i_{1}}\cap L^{\prime}_{2,i_{2}} is at most min{n,k1k2}\min\{n^{\prime},k_{1}k_{2}\}, the number of arcs |A||A| is O(min{n,k1k2})O(\min\{n^{\prime},k_{1}k_{2}\}).

Hence, the total running time of applying the successive shortest path algorithm with node potentials on our formulation is O(n(n+nlogn))O(n\cdot(n^{\prime}+n\log n)). ∎

5 Maximum flow formulation for P=2P=2

Here we show a maximum flow (max-flow) formulation (see Appendix D for a generic formulation of max-flow problem) for the min-imbalance problem with 2 covariates. Unlike the previous formulations, the maximum flow solution requires further manipulation in order to derive an optimal solution to the min-imbalance problem with 2 covariate. That max-flow graph is illustrated in Figure (2). The source node ss can send at most 1,i1\ell_{1,i_{1}} units of flow to node (1,i1)(1,i_{1}) for each i1=1,,k1i_{1}=1,...,k_{1}, the sink node can get at most 2,i2\ell_{2,i_{2}} units of flow from node (2,i2)(2,i_{2}) for each i2=1,,k2i_{2}=1,...,k_{2}, and there can be a flow from node (1,i1)(1,i_{1}) to node (2,i2)(2,i_{2}) with amount bounded by ui1,i2u_{i_{1},i_{2}}, for i1=1,,k1,i2=1,,k2i_{1}=1,...,k_{1},i_{2}=1,...,k_{2}.

1,11,21,k1k_{1}2,12,22,k2k_{2}sstt\vdots\vdotsarc legend: upperbound\pgfmathresultptuu1,1\pgfmathresultptuu1,2\pgfmathresultptuu1,k2{}_{1,k_{2}}\pgfmathresultptuu2,2\pgfmathresultptuuk1,k2{}_{k_{1},k_{2}}\pgfmathresultpt1,1\ell_{1,1}\pgfmathresultpt1,2\ell_{1,2}\pgfmathresultpt1,k1\ell_{1,k_{1}}\pgfmathresultpt2,1\ell_{2,1}\pgfmathresultpt2,2\ell_{2,2}\pgfmathresultpt2,k2\ell_{2,k_{2}}
Figure 2: Maximum flow graph

Let the maximum flow value for the max-flow problem presented in Figure (2) be denoted by ff^{*}, and let 𝐱\mathbf{x}^{*} be the optimal flow vector, with xi1,i2x^{*}_{i_{1},i_{2}} denoting the flow amount between node (1,i1)(1,i_{1}) and node (2,i2)(2,i_{2}). It is obvious that i1=1k1i2=1k2xi1,i2=fi1=1k11,i1=n\sum_{i_{1}=1}^{k_{1}}\sum_{i_{2}=1}^{k_{2}}x^{*}_{i_{1},i_{2}}=f^{*}\leq\sum_{i_{1}=1}^{k_{1}}\ell_{1,i_{1}}=n. That means an initial selection SS^{\prime} generated by selecting the prescribed number of control samples as in the optimal max-flow solution, i.e., selecting xi1,i2x^{*}_{i_{1},i_{2}} control samples from L1,i1L2,i2L^{\prime}_{1,i_{1}}\cap L^{\prime}_{2,i_{2}} is of size ff^{*}. In order to get a feasible solution for the min-imbalance problem it is required to select nfn-f^{*} additional control samples. The selection SS^{\prime} has no positive excess, only non-negative deficits with respect to the levels of both covariates. This is because i2=1k2xi1,i21,i1\sum_{i_{2}=1}^{k_{2}}x^{*}_{i_{1},i_{2}}\leq\ell_{1,i_{1}} due to the upper bound of arc from ss to (1,i1)(1,i_{1}) for each i1i_{1}, and i1=1k1xi1,i22,i2\sum_{i_{1}=1}^{k_{1}}x^{*}_{i_{1},i_{2}}\leq\ell_{2,i_{2}} due to the upper bound of arc from (2,i2)(2,i_{2}) to tt for each i2i_{2}. To recover an optimal solution for the min-imbalance problem from the initial set SS^{\prime}, we add up to nfn-f^{*} unselected control samples, one at a time, each corresponding to a level with positive deficit under either covariate 11 or 22. This process is repeated until either nfn-f^{*} such control samples are found, or until no such control sample exists. In the latter case, to complete the size of the selection, any randomly selected control samples are added. Algorithm 1 is a formal statement of this process of recovering an optimal solution of the min-imbalance problem from the initial selection SS^{\prime}.

Algorithm 1
Initialization step: Select xi1,i2x^{*}_{i_{1},i_{2}} number of control samples from L1,i1L2,i2L^{\prime}_{1,i_{1}}\cap L^{\prime}_{2,i_{2}} in set SS^{\prime}.
while |S|<n|S^{\prime}|<n do
     if there exists an unselected control sample jSj\notin S^{\prime} whose covariate 1 level is i1i_{1} and covariate 2 level is i2i_{2}, such that |SL1,i1|<1,i1|S^{\prime}\cap L^{\prime}_{1,i_{1}}|<\ell_{1,i_{1}} or |SL2,i2|<2,i2|S^{\prime}\cap L^{\prime}_{2,i_{2}}|<\ell_{2,i_{2}} then,
         SS{j}S^{\prime}\leftarrow S^{\prime}\cup\{j\}.
     else
         Let S′′=SS^{\prime\prime}=S^{\prime} and let S+S^{+} be any n|S|n-|S^{\prime}| control samples S\notin S^{\prime}. Set SSS+S^{\prime}\leftarrow S^{\prime}\cup S^{+}.
     end if
end while
Output SS^{\prime}.

To show that Algorithm 1 provides an optimal solution to the min-imbalance problem, we distinguish two cases of Algorithm 1: (1) S+=S^{+}=\emptyset and (2) |S+|1|S^{+}|\geq 1. In the first case, there is, at each iteration, at least one control sample that belongs to some level with positive deficit. In Theorem 3 we prove that the output SS^{\prime} of Algorithm 1 is an optimal solution in this case.

Theorem 3.

If S+=S^{+}=\emptyset then the output selection SS^{\prime} of Algorithm 1 is optimal for the min-imbalance problem, with an optimal objective value of 2(nf)2(n-f^{*}).

Proof.

First, we show that the total imbalance of the selection SS^{\prime} is IM(S)=2(nf)IM(S^{\prime})=2(n-f^{*}). At the initialization step the selection SS^{\prime} has only deficits for all levels, with total deficit for covariate 1, i1=1k1(1,i1i2=1k2xi1,i2)=nf\sum_{i_{1}=1}^{k_{1}}(\ell_{1,i_{1}}-\sum_{i_{2}=1}^{k_{2}}x_{i_{1},i_{2}})=n-f^{*}, and total deficit for covariate 2, i2=1k2(2,i2i1=1k1xi1,i2)=nf\sum_{i_{2}=1}^{k_{2}}(\ell_{2,i_{2}}-\sum_{i_{1}=1}^{k_{1}}x_{i_{1},i_{2}})=n-f^{*}. At each iteration, there is an added control sample, say in L1,i1L2,i2L^{\prime}_{1,i_{1}}\cap L^{\prime}_{2,i_{2}}, such that either L1,i1L^{\prime}_{1,i_{1}} or L2,i2L^{\prime}_{2,i_{2}} has a positive deficit with respect to SS^{\prime}. It is however impossible for both L1,i1L^{\prime}_{1,i_{1}} and L2,i2L^{\prime}_{2,i_{2}} to have a positive deficit with respect to SS^{\prime} since otherwise, there is an s,ts,t-augmenting path, from ss to node (1,i1)(1,i_{1}), to node (2,i2)(2,i_{2}), to tt, along which the flow can be increased by at least one unit. This is in contradiction to the optimality of the max-flow solution 𝐱\mathbf{x}^{*}. As a result, at each iteration where a control sample is added, the total deficit is reduced by one unit, and the total excess is increased by one unit. Thus, at each iteration of the if step, the sum of total deficit and excess remains the same, namely 2(nf)2(n-f^{*}).

Suppose, by contradiction, that there exists a selection SS^{*} for which the total imbalance is lower, IM(S)<2(nf)IM(S^{*})<2(n-f^{*}). We repeat the following iterative procedure of removing samples from SS^{*} until there is no positive excess remaining: while there is a level of either covariate with positive excess with respect to SS^{*}, we remove one sample of SS^{*} that belongs to this level. Each such iteration results in the total excess reducing by at least 1 unit and the total deficit increasing by at most 1 unit, and therefore the sum of total deficit and excess does not increase. So when this iterative procedure ends, the total excess is zero and the total deficit is at most IM(S)IM(S^{*}). Let xi1,i2x_{i_{1},i_{2}} be the number of samples remaining in SL1,i1L2,i2S^{*}\cap L^{\prime}_{1,i_{1}}\cap L^{\prime}_{2,i_{2}} after this excess removing procedure. Since there is no positive excess, 𝐱\mathbf{x} is a feasible solution for the max-flow problem with the flow between node (1,i1)(1,i_{1}) and node (2,i2)(2,i_{2}) equal to xi1,i2x_{i_{1},i_{2}}. The sum of deficits associated with this remaining set is ni1=1k1i2=1k2xi1,i2n-\sum_{i_{1}=1}^{k_{1}}\sum_{i_{2}=1}^{k_{2}}x_{i_{1},i_{2}} for covariate 1 and ni1=1k1i2=1k2xi1,i2n-\sum_{i_{1}=1}^{k_{1}}\sum_{i_{2}=1}^{k_{2}}x_{i_{1},i_{2}} for covariate 2, for a total of 2(ni1=1k1i2=1k2xi1,i2)2(n-\sum_{i_{1}=1}^{k_{1}}\sum_{i_{2}=1}^{k_{2}}x_{i_{1},i_{2}}), which is at most IM(S)IM(S^{*}). Therefore, the total flow value, i1=1k1i2=1k2xi1,i2\sum_{i_{1}=1}^{k_{1}}\sum_{i_{2}=1}^{k_{2}}x_{i_{1},i_{2}}, satisfies that it is at least nIM(S)2n-\frac{IM(S^{*})}{2}. Since nIM(S)2>n(nf)=fn-\frac{IM(S^{*})}{2}>n-(n-f^{*})=f^{*}, it follows that the value of the feasible flow induced by the set SS^{*} is greater than the maximum flow value ff^{*}, which contradicts the optimality of ff^{*}. ∎

We now address the second case where |S+|1|S^{+}|\geq 1 and |S′′|<n|S^{\prime\prime}|<n. In this case, the total imbalance of S′′S^{\prime\prime} is, from the arguments in the proof of Theorem 3, IM(S′′)=2(nf)IM(S^{\prime\prime})=2(n-f^{*}). Each one of the S+S^{+} samples selected adds 1 unit of excess to each covariate, resulting in the addition of 2 units of excess to the imbalance. Therefore, the total imbalance of the output solution is 2(nf)+2|S+|2(n-f^{*})+2|S^{+}|. We next show what the value of |S+||S^{+}| is, and then demonstrate that any feasible selection to the min-imbalance problem has total imbalance of at least 2(nf)+2|S+|2(n-f^{*})+2|S^{+}|. This will prove that the output of Algorithm 1, SS^{\prime}, is an optimal solution to the min-imbalance problem.

It will be useful to consider an equivalent form of Algorithm 1. For each level ii of covariate pp that has |SLp,i|<p,i|S^{\prime}\cap L^{\prime}_{p,i}|<\ell_{p,i}, we add the largest number possible of available control samples in Lp,iL^{\prime}_{p,i} so long as the total does not exceed nn. This number is min{p,i|SLp,i|,p,i|SLp,i|}\min\{\ell_{p,i}-|S^{\prime}\cap L^{\prime}_{p,i}|,\ell^{\prime}_{p,i}-|S^{\prime}\cap L^{\prime}_{p,i}|\}. Let ¯p,i=min{p,i,p,i}\bar{\ell}_{p,i}=\min\{\ell_{p,i},\ell^{\prime}_{p,i}\} then for each p,ip,i that has |SLp,i|<p,i|S^{\prime}\cap L^{\prime}_{p,i}|<\ell_{p,i} we add ¯p,i|SLp,i|\bar{\ell}_{p,i}-|S^{\prime}\cap L^{\prime}_{p,i}| previously unselected control samples to SS^{\prime}. The outcome of this equivalent procedure is exactly the same as that of Algorithm 1. In the case that |S+|1|S^{+}|\geq 1 there is an insufficient number of control samples to add to SS^{\prime} after for all levels the largest number possible has been added. Therefore, at the end of this process, the if step returns that another unselected control sample does not exist, and the total number of control samples of S′′S^{\prime\prime}, for each level ii of covariate pp, is ¯p,i\bar{\ell}_{p,i}.

Lemma 2.

If |S+|1|S^{+}|\geq 1 (and |S′′|=n|S+|<n|S^{\prime\prime}|=n-|S^{+}|<n) then |S+|=n(¯1+¯2f)|S^{+}|=n-(\bar{\ell}_{1}+\bar{\ell}_{2}-f^{*}) where ¯1=i1=1k1¯1,i1\bar{\ell}_{1}=\sum_{i_{1}=1}^{k_{1}}\bar{\ell}_{1,i_{1}} and ¯2=i2=1k2¯2,i2\bar{\ell}_{2}=\sum_{i_{2}=1}^{k_{2}}\bar{\ell}_{2,i_{2}}.

Proof.

At the initialization step of Algorithm 1 |S|=f|S^{\prime}|=f^{*} and the total deficit is 2(nf)2(n-f^{*}). Each time a control sample is added to SS^{\prime} in the if step, the total deficit is decreased exactly by 1 unit. So we can derive the value of |S′′||S^{\prime\prime}| when the algorithm terminates if we know the total deficit when the algorithm terminates. Note that the total excess may change, but we only consider here the deficit part of the imbalance.

From the discussion above, the total number of control samples of S′′S^{\prime\prime}, for each level ii of covariate pp, is ¯p,i\bar{\ell}_{p,i}. We denote ¯1=i1=1k1¯1,i1\bar{\ell}_{1}=\sum_{i_{1}=1}^{k_{1}}\bar{\ell}_{1,i_{1}}, and ¯2=i2=1k2¯2,i2\bar{\ell}_{2}=\sum_{i_{2}=1}^{k_{2}}\bar{\ell}_{2,i_{2}}. Since the sum i1=1k11,i1=n\sum_{i_{1}=1}^{k_{1}}\ell_{1,i_{1}}=n and i2=1k22,i2=n\sum_{i_{2}=1}^{k_{2}}\ell_{2,i_{2}}=n, the sum of deficits of set S′′S^{\prime\prime} under covariate 1 is i1=1k11,i1¯1,i1=n¯1\sum_{i_{1}=1}^{k_{1}}\ell_{1,i_{1}}-\bar{\ell}_{1,i_{1}}=n-\bar{\ell}_{1}, and the sum of deficits under covariate 2 equals i2=1k22,i2¯2,i2=n¯2\sum_{i_{2}=1}^{k_{2}}\ell_{2,i_{2}}-\bar{\ell}_{2,i_{2}}=n-\bar{\ell}_{2}. It follows that the sum of deficits of S′′S^{\prime\prime} is 2n¯1¯22n-\bar{\ell}_{1}-\bar{\ell}_{2}.

Since the initial set SS^{\prime} that has total deficit of 2(nf)2(n-f^{*}) has its deficit reduced through Algorithm 1 to 2n¯1¯22n-\bar{\ell}_{1}-\bar{\ell}_{2} in the set S′′S^{\prime\prime}, the additional number of control sample in S′′S^{\prime\prime} that were added to the initial ff^{*} control samples is 2(nf)(2n¯1¯2)=¯1+¯22f2(n-f^{*})-(2n-\bar{\ell}_{1}-\bar{\ell}_{2})=\bar{\ell}_{1}+\bar{\ell}_{2}-2f^{*}. Therefore, the size of S′′S^{\prime\prime} is f+(¯1+¯22f)=¯1+¯2ff^{*}+(\bar{\ell}_{1}+\bar{\ell}_{2}-2f^{*})=\bar{\ell}_{1}+\bar{\ell}_{2}-f^{*}. This number is less than nn and the size of S+S^{+} then satisfies, |S+|=n(¯1+¯2f)|S^{+}|=n-(\bar{\ell}_{1}+\bar{\ell}_{2}-f^{*}).

Corollary 1.

If |S+|1|S^{+}|\geq 1 when Algorithm 1 terminates, the total imbalance of the output solution SS^{\prime} is IM(S)=4n2¯12¯2IM(S^{\prime})=4n-2\bar{\ell}_{1}-2\bar{\ell}_{2}.

Proof.

The imbalance of S′′S^{\prime\prime}, as in the proof of Theorem 3, is equal to 2(nf)2(n-f^{*}). Since each sample in S+S^{+} adds two units to the imbalance, the total imbalance of the output solution SS^{\prime} is IM(S)=2(nf)+2(n(¯1+¯2f))IM(S^{\prime})=2(n-f^{*})+2(n-(\bar{\ell}_{1}+\bar{\ell}_{2}-f^{*})), which is equal to 4n2¯12¯24n-2\bar{\ell}_{1}-2\bar{\ell}_{2}, as stated. ∎

Next, we prove that this is the minimum total imbalance achievable.

Theorem 4.

For any selection of size nn, the total imbalance must be greater or equal to 4n2¯12¯24n-2\bar{\ell}_{1}-2\bar{\ell}_{2}.

Proof.

For the optimal selection SS^{*} of size nn, let IM(S)IM(S^{*}) be the total imbalance of SS^{*}. We first classify the samples in SS^{*} into three types, S1S_{1}, S2S_{2} and S3S_{3} that forms a partition of SS^{*}, using the 3-type Classification Procedure.

In the procedure we use variable dis(p,i)dis(p,i) to denote the value of dis(S,p,i)dis(S,p,i), discrepancy for selection SS and level ii under covariate pp. With this notation the excess of the corresponding level is e(S,p,i)=max{0,dis(p,i)}e(S,p,i)=\max\{0,dis(p,i)\}, and the deficit is d(S,p,i)=max{0,dis(p,i)}d(S,p,i)=\max\{0,-dis(p,i)\}.

3-type Classification Procedure
Initialize

      SS,S1,S2,S3S\leftarrow S^{*},S_{1}\leftarrow\emptyset,S_{2}\leftarrow\emptyset,S_{3}\leftarrow\emptyset;
      Let dis(p,i)|SLp,i|p,idis(p,i)\leftarrow|S\cap L^{\prime}_{p,i}|-\ell_{p,i} for p=1,2,i=1,,kpp=1,2,\ i=1,...,k_{p};
S1S_{1} selection:
      While there exists a sample jj in SS whose covariate 1 level is i1i_{1}, covariate 2 level is i2i_{2}, such that dis(1,i1)>0dis(1,i_{1})>0 and dis(2,i2)>0dis(2,i_{2})>0, pick jj;
            S1S1{j},SS{j},dis(1,i1)dis(1,i1)1,dis(2,i2)dis(2,i2)1S_{1}\leftarrow S_{1}\cup\{j\},S\leftarrow S-\{j\},dis(1,i_{1})\leftarrow dis(1,i_{1})-1,\ dis(2,i_{2})\leftarrow dis(2,i_{2})-1;
      end while;
S2S_{2} selection:
      While there exists a sample jj in SS whose covariate 1 level is i1i_{1}, covariate 2 level is i2i_{2}, such that dis(1,i1)>0dis(1,i_{1})>0 or dis(2,i2)>0dis(2,i_{2})>0, pick jj;
            S2S2{j},SS{j},dis(1,i1)dis(1,i1)1,dis(2,i2)dis(2,i2)1S_{2}\leftarrow S_{2}\cup\{j\},S\leftarrow S-\{j\},dis(1,i_{1})\leftarrow dis(1,i_{1})-1,\ dis(2,i_{2})\leftarrow dis(2,i_{2})-1;
      end while;
S3S_{3} selection:
      S3SS_{3}\leftarrow S;
end.

The output S1,S2,S3S_{1},S_{2},S_{3} of the procedure is not unique, since it depends on the order in which samples are picked. However, the statements of the theorem hold for any output of the procedure. Note that in the procedure, whenever a sample is picked, any dis(p,i)dis(p,i) can only go down. For that reason, once S1S_{1} selection ends, there will not be another sample in SS for which the discrepancy values of the corresponding levels under the two covariates are both positive. Furthermore, once the S1S_{1} and S2S_{2} selections are done, dis(p,i)0dis(p,i)\leq 0 for each p,ip,i. That means, |S3Lp,i|p,i|S_{3}\cap L^{\prime}_{p,i}|\leq\ell_{p,i} for all p,ip,i.

Let the sizes of the three subsets be denoted by s1=|S1|,s2=|S2|,s3=|S3|s_{1}=|S_{1}|,s_{2}=|S_{2}|,s_{3}=|S_{3}|.

We claim that the total imbalance of samples in S3S_{3} is IM(S)2s1IM(S^{*})-2s_{1}. For the S1S_{1} selection part of the procedure, each samples picked in S1S_{1} reduces the total excess by 2. For each sample selected in the S2S_{2} selection part of the procedure, the excess is reduced by 1, and the deficit is increased by 1. So for each sample selected in the S2S_{2} selection step, the total imbalance does not change. Therefore, the total imbalance of the samples in S3S_{3} is IM(S)2s1IM(S^{*})-2s_{1}.

On the other hand, the total imbalance of S3S_{3}, which equals the sum of deficits of both covariates (all excesses equal zero as |S3Lp,i|p,i|S_{3}\cap L^{\prime}_{p,i}|\leq\ell_{p,i}), is 2(ns3)2(n-s_{3}). That says,

IM(S)2s1=2(ns3).IM(S^{*})-2s_{1}=2(n-s_{3}).

Hence,

IM(S)=2(ns3)+2s1=2(ns3)+2(ns2s3)=4n2s24s3.IM(S^{*})=2(n-s_{3})+2s_{1}=2(n-s_{3})+2(n-s_{2}-s_{3})=4n-2s_{2}-4s_{3}.

Here, the second equality comes from the fact that s1+s2+s3=ns_{1}+s_{2}+s_{3}=n.

Next, we show that s2(¯1s3)+(¯2s3)s_{2}\leq(\bar{\ell}_{1}-s_{3})+(\bar{\ell}_{2}-s_{3}). Let the samples in S2S_{2} be ordered according to the order they were picked, j1,j2,,js2j_{1},j_{2},...,j_{s_{2}}. We now add these samples to S3S_{3}, in the reverse order js2,,j1j_{s_{2}},...,j_{1}. When each sample jqj_{q} is added to S3S_{3}, the deficit is reduced by exactly 1 unit. Once all the S2S_{2} samples are added to S3S_{3} the deficit at each level of S2S3S_{2}\cup S_{3} is zero, or alternatively, dis(S2S3,p,i)=|(S2S3)Lp,i|p,i0dis(S_{2}\cup S_{3},p,i)=|(S_{2}\cup S_{3})\cap L^{\prime}_{p,i}|-\ell_{p,i}\geq 0 for each p,ip,i.

We now consider the total deficit of S3S_{3}: By the definition of ¯1\bar{\ell}_{1} and ¯2\bar{\ell}_{2}, the positive deficit of S3S_{3} under covariate 1 is at most ¯1s3\bar{\ell}_{1}-s_{3} and that the positive deficit of S3S_{3} under covariate 2 is at most ¯2s3\bar{\ell}_{2}-s_{3}. That means the size of S2S_{2} is bounded by the amount of this deficit, s2(¯1s3)+(¯2s3)s_{2}\leq(\bar{\ell}_{1}-s_{3})+(\bar{\ell}_{2}-s_{3}).

Now we have,

s2(¯1s3)+(¯2s3)\displaystyle s_{2}\leq(\bar{\ell}_{1}-s_{3})+(\bar{\ell}_{2}-s_{3}) \displaystyle\Leftrightarrow s2+2s3¯1+¯2\displaystyle s_{2}+2s_{3}\leq\bar{\ell}_{1}+\bar{\ell}_{2}
\displaystyle\Leftrightarrow IM(S)=4n2s24s34n2¯12¯2.\displaystyle IM(S^{*})=4n-2s_{2}-4s_{3}\geq 4n-2\bar{\ell}_{1}-2\bar{\ell}_{2}.

We conclude that the total imbalance IM(S)IM(S^{*}) is at least 4n2¯12¯24n-2\bar{\ell}_{1}-2\bar{\ell}_{2}. That implies that the selection output of Algorithm 1, SS^{\prime}, which has a total imbalance of 4n2¯12¯24n-2\bar{\ell}_{1}-2\bar{\ell}_{2}, is optimal.

The conclusion from Corollary 1 and Theorem 4, is that for |S+|1|S^{+}|\geq 1 when Algorithm 1 terminates, the output solution SS^{\prime} is an optimal selection to the min-imbalance problem. Together with Theorem 3, we have that Algorithm 1 outputs an optimal selection for the min-imbalance problem using the max-flow solution to the flow problem in Figure 2 as input.

Theorem 5.

The maximum flow formulation of the 2-covariate min-imbalance problem is solved the in O(nmin{n23,n12}log2n)O(n^{\prime}\cdot\min\{n^{\frac{2}{3}},n^{\prime\frac{1}{2}}\}\cdot\log^{2}n) time.

Proof.

We choose here the binary blocking flow algorithm of Goldberg and Rao [9] for solving the max-flow problem shown in Figure 2 because this algorithm depends on the maximum arc capacity which is a small quantity in our formulation.

The complexity of the binary blocking flow algorithm for a graph G=(V,A){G}=(V,A) is O(|A|min{|V|23,|A|12}log|V|2|A|logU)O(|A|\cdot\min\{|V|^{\frac{2}{3}},|A|^{\frac{1}{2}}\}\cdot\log\frac{|V|^{2}}{|A|}\log U) where |V||V| is number of nodes, |A||A| is number of arcs, and UU is maximum arc capacity. As argued earlier for the minimum cost network flow formulation, the number of nodes in the network |V||V| is O(k1+k2)O\left(k_{1}+k_{2}\right), which is no more than O(n)O(n); and the number of arcs is bounded by min{n,k1k2}\min\{n^{\prime},k_{1}k_{2}\}. Although ui1,i2u_{i_{1},i_{2}} could be as large as nn^{\prime}, a feasible flow to our maximum flow formulation can not have more than 1,i1\ell_{1,i_{1}} units of flow on the arc from node (1,i1)(1,i_{1}) to node (2,i2)(2,i_{2}). Thus the maximum arc capacity UU is effectively O(n)O(n). The ratio |V|2|A|(k1+k2)2k1+k2n\frac{|V|^{2}}{|A|}\leq\frac{(k_{1}+k_{2})^{2}}{k_{1}+k_{2}}\leq n. Hence, the running time of applying the binary blocking flow algorithm to our max-flow problem is O(nmin{n23,n12}log2n)O(n^{\prime}\cdot\min\{n^{\frac{2}{3}},n^{\prime\frac{1}{2}}\}\cdot\log^{2}n).

The complexity of Algorithm 1 is O(n)O(n) as the number of iterations is bounded by nn, and each iteration takes O(1)O(1) steps.

Therefore, the running time of solving the min-imbalance problem as a max-flow problem is O(nmin{n23,n12}log2n)O(n^{\prime}\cdot\min\{n^{\frac{2}{3}},n^{\prime\frac{1}{2}}\}\cdot\log^{2}n). ∎

6 Conclusions

We present here new insights to the minimum imbalance problem that involves selection of control samples so that excess and deficit of the respective treatments samples’ levels are minimized. We demonstrate that an integer programming formulation of the problem on two covariates has a totally unimodular constraint matrix. We then present two efficient approaches to solve the problem for two covariates. One is based on minimum cost network flow, and the other, that is in general more efficient still, is based on a maximum flow presentation of the problem. With those flow algorithms the problem is easily solved, in theory and in practice. We further provide here a proof of NP-hardness of the problem on three covariates in Appendix A, that implies that no efficient algorithm exists for the minimum imbalance problem in the presence of three or more covariates.

References

  • [1] P. R. Rosenbaum, R. N. Ross, J. H. Silber, Minimum distance matched sampling with fine balance in an observational study of treatment for ovarian cancer, Journal of the American Statistical Association 102 (477) (2007) 75–83. doi:10.1198/016214506000001059.
  • [2] D. Yang, D. S. Small, J. H. Silber, P. R. Rosenbaum, Optimal matching with minimal deviation from fine balance in a study of obesity and surgical outcomes, Biometrics 68 (2) (2012) 628–636. doi:10.1111/j.1541-0420.2011.01691.x.
  • [3] M. Bennett, J. P. Vielma, J. R. Zubizarreta, Building representative matched samples with multi-valued treatments in large observational studies, Journal of Computational and Graphical Statistics 0 (0) (2020) 1–29. doi:10.1080/10618600.2020.1753532.
  • [4] W. S. Jewell, Optimal flow through networks, in: Operations Research, Vol. 6, 1958, pp. 633–633.
  • [5] M. Iri, A new method of solving transportation-network problems, Journal of the Operations Research Society of Japan 3 (1) (1960) 2.
  • [6] R. Busaker, P. J. Gowen, A procedure for determining minimal-cost network flow patterns, Tech. rep., ORO Technical Report 15, Operational Research Office, John Hopkins University (1961).
  • [7] N. Tomizawa, On some techniques useful for solution of transportation network problems, Networks 1 (2) (1971) 173–194. doi:10.1002/net.3230010206.
  • [8] J. Edmonds, R. M. Karp, Theoretical improvements in algorithmic efficiency for network flow problems, Journal of the ACM (JACM) 19 (2) (1972) 248–264. doi:10.1145/321694.321699.
  • [9] A. V. Goldberg, S. Rao, Beyond the flow decomposition barrier, Journal of the ACM (JACM) 45 (5) (1998) 783–797.
  • [10] R. M. Karp, Reducibility among combinatorial problems, in: Complexity of computer computations, Springer, 1972, pp. 85–103.
  • [11] M. R. Garey, D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, W. H. Freeman, New York, NY, USA, 1978.
  • [12] I. Heller, C. Tompkins, Integral boundary points of convex polyhedra, Linear inequalities and related systems 38 (1956) 223–246.
  • [13] I. Heller, C. Tompkins, An extension of a theorem of dantzig’s, Linear inequalities and related systems 38 (1956) 247–254.

Appendix A NP-hardness for P3P\geq 3

Bennett et al. [3] states that the min-imbalance problem is NP-hard when P3P\geq 3 but presented no proof. Here, we provide a proof of this complexity status. The implication of this result is that no efficient algorithm exists for the min-imbalance problem in the presence of three or more covariates.

The reduction is from the 3-Dimensional Matching problem, one of Karp’s 21 NP-complete problems [10][11].
3-Dimensional Matching: Given a finite set XX and a set of triplets UX×X×XU\subset X\times X\times X. Is there a subset MUM\subseteq U such that |M|=|X||M|=|X| and that no two elements of MM agree in any coordinate?

Lemma 3.

The min-imbalance problem is NP-hard even when P=3P=3.

Proof.

Given an instance of 3-dimensional matching problem, we define an instance of min-imbalance problem with P=3P=3, k(p)=|X|k(p)=|X| for p=1,2,3p=1,2,3, and p,k=1\ell_{p,k}=1 for each p,kp,k. For each triplet uu in UU, we have a corresponding control sample cc in LL^{\prime} (the control group) such that the 3 covariates’ levels of cc are defined by uu. The decision problem for min-imbalance is stated as: is there a subset SLS\subseteq L^{\prime} such that |Lp,kS|=p,k|L^{\prime}_{p,k}\cap S|=\ell_{p,k} for all pp and kk?

We observe that if there exists a subset MUM\subseteq U such that |M|=|X||M|=|X| and that no two elements of MM agree in any coordinate, then on each coordinate, each element of XX must be covered by MM exactly once. That means, if we takes SS to be the subset of corresponding samples of MM, then |Lp,kS|=1=p,k|L^{\prime}_{p,k}\cap S|=1=\ell_{p,k} for all pp and kk.

And if there exists a subset SLS\subseteq L^{\prime} such that |Lp,kS|=p,k=1|L^{\prime}_{p,k}\cap S|=\ell_{p,k}=1 for all pp and kk, then the subset MM that corresponds to SS does not have two elements that agree in any coordinate, and that |M|=|S|=|X||M|=|S|=|X|.

Therefore, 3-dimensional matching problem can be reduced to a 3-covariate min-imbalance problem, and hence, the min-imbalance problem is NP-hard even when P=3P=3. ∎

Appendix B Selection of size different from nn

The complexity of the min-imbalance problem on two covariates where the required size of the selection is q<nq<n^{\prime}, for qq different from the size of the treatment group nn, is the same as for the case where the required size is nn. Both our methods described in Section 4 and Section 5 apply with minor modifications as described next.

The minimum cost network flow method Consider the network flow formulation (3). With the notation defined in Section 3, i=1k1(e1,id1,i)=i=1k1(|SL1,i|1,i)=|S|n\sum_{i=1}^{k_{1}}\left(e_{1,i}-d_{1,i}\right)=\sum_{i=1}^{k_{1}}\left(|S\cap L^{\prime}_{1,i}|-\ell_{1,i}\right)=|S|-n. Therefore, if |S||S| is required to be qq, constraints (3d) are changed to

i1=1k1d1,i1+i1=1k1e1,i1=qn.-\sum_{i_{1}=1}^{k_{1}}d_{1,i_{1}}+\sum_{i_{1}=1}^{k_{1}}e_{1,i_{1}}=q-n.

Similarly, constraints (3e) are modified to

i2=1k2d2,i2i2=1k2e2,i2=nq.\sum_{i_{2}=1}^{k_{2}}d_{2,i_{2}}-\sum_{i_{2}=1}^{k_{2}}e_{2,i_{2}}=n-q.

In Figure 1, the supply of node 11 changes from 0 to qnq-n, and the supply of node 22 changes from 0 to nqn-q. Everything else remains the same. With these modifications it is easy to see that the optimal flow corresponds to an optimal solution to the min-imbalance with a selection of size qq

The maximum flow method The maximum flow problem used for qnq\neq n is the same as for the case of nn. There is however a slight modification in recovering the optimal selection from the max-flow solution.

If the size qq is less than the maximum flow amount ff^{*}, then after the initialization step in Algorithm 1, any qq samples out of the SS^{\prime} selection, is an optimal selection. This is because each sample selected in the initialization step would reduce the total deficit by 22 units and maintain the total excess to be zero, which is the best scenario possible.

If qfq\geq f^{*}, then Algorithm 1 with nn being replaced by qq, gives the optimal selection. This follows from the proofs in Section 5 with nn being replaced by qq.

Appendix C The minimum κ\kappa-imbalance and κ\kappa-Matching-Balance problems

The minimum κ\kappa-imbalance problem is to find a selection SS of control samples of size κn\kappa n, where κ\kappa is an integer between 11 and nn\frac{n^{\prime}}{n}, and nn is the size of the treatment group, so as to minimize the κ\kappa-imbalance, defined as p=1Pi=1kp||SLp,i|κp,i|\sum_{p=1}^{P}\sum_{i=1}^{k_{p}}|{|S\cap L^{\prime}_{p,i}|}-\kappa\ell_{p,i}|. For κ=1\kappa=1 the problem is the min-imbalance problem and therefore it appears that the κ\kappa-imbalance problem is more general than the min-imbalance problem. We show here however that the minimum κ\kappa-imbalance problem is reducible to the min-imbalance problem and therefore the two problems are equivalent.

To reduce the minimum κ\kappa-imbalance problem to the min-imbalance problem we replace each treatment sample by κ\kappa copies. This new treatment group has a size of κn\kappa n and the size of level ii under covariate pp is κp,i\kappa\ell_{p,i} for each pp and ii. Hence the total imbalance defined for the new treatment group is the same as the κ\kappa-imbalance of the original treatment group. This implies that particularly for the two covariates minimum κ\kappa-imbalance problem the methods in Section 4 and Section 5 provide an optimal solution in polynomial time. If this solution, given in terms of the sizes of the level-intersections (the numbers of control samples selected in each L1,i1L2,i2L^{\prime}_{1,i_{1}}\cap L^{\prime}_{2,i_{2}} for i1=1,k1i_{1}=1,\ldots k_{1} and i2=1,,k2i_{2}=1,\ldots,k_{2}), is unique, then the solution to the second stage of MB solves that problem optimally and in polynomial time.

For any number of covariates PP an optimal solution to the second stage of MB, for given sizes of level intersections si1,i2,,iPs_{i_{1},i_{2},\ldots,i_{P}}, is presented as the following minimum cost flow problem. First there is a bipartite graph with the treatment samples each represented by a node on one side, and the control samples each represented by a node on the other size. The cost on each arc between a treatment sample and a control sample is the “distance” metric between the two, and the arc capacity is 11. Each treatment sample has a demand of κ\kappa. For each non-zero level intersection there is a source node with supply of si1,i2,,iPs_{i_{1},i_{2},\ldots,i_{P}}. This source node is connected to all control samples in the intersection L1,i1L2,i2LiPL^{\prime}_{1,i_{1}}\cap L^{\prime}_{2,i_{2}}\cap\ldots\cap L^{\prime}_{i_{P}} with arcs of capacity 11 and cost of 0. The control sample nodes through which there is a positive flow (of 11 unit) are the ones selected and assigned to the respective treatment sample nodes to which they have a positive flow. This is a minimum cost network flow problem with a total demand (or supply) bounded by nn^{\prime}, and O(n)O(n^{\prime}) arcs and O(n)O(n^{\prime}) nodes. Therefore the successive shortest paths algorithm, discussed in Section 4, solves this problem in O(n(n+nlogn)O(n^{\prime}(n^{\prime}+n^{\prime}\log n^{\prime}) steps.

Appendix D The minimum cost network flow, the maximum flow problems and total unimodularity

We formulate here the minimum cost network flow problem (MCNF). The input to the problem is a graph G=(V,A)G=(V,A) with a set of nodes VV and a set of arcs AA, where each arc (i,j)A(i,j)\in A is associated with a cost cijc_{ij}, capacity upper bound uiju_{ij}, and capacity lower bound lijl_{ij}. Each node iVi\in V has supply bib_{i} which is interpreted as demand if negative, and can be 0. Let xijx_{ij} be the amount of flow on arc (i,j)A(i,j)\in A. The flow vector 𝐱\mathbf{x} is said to be feasible if it satisfies:
(1) Flow balance constraints: For every node kVk\in V Outflow(k)Inflow(k)=bkOutflow(k)-Inflow(k)=b_{k}
(2) Capacity constraints: For each arc (i,j)A(i,j)\in A, lijxijuijl_{ij}\leq x_{ij}\leq u_{ij}.

The linear programming formulation of the problem is:

(MCNF) min(i,j)Acijxijsubject to (k,j)Axkj(i,k)Axik=bikVlijxijuij,(i,j)A.\hskip 28.90755pt\begin{array}[]{ll}\mbox{(MCNF)~{}~{}~{}~{}}\min&\sum_{(i,j)\in A}\ c_{ij}x_{ij}\\ \mbox{subject to }&\sum_{(k,j)\in A}x_{kj}\,-\,\sum_{(i,k)\in A}x_{ik}=b_{i}\ \forall k\in V\\ &l_{ij}\leq x_{ij}\leq u_{ij},\ \ \forall(i,j)\in A.\end{array}
Theorem 6.

[12][13] A {0,1,1}\{0,1,-1\}-matrix where each column (or row) has at most one 11 and at most one 1-1 is totally unimodular.

The addition of the capacity constraints retains the total unimodularity property. Note that the MCNF constraint matrix is linearly dependent (with the sum of all balance constraints leading to 0=00=0) and one constraint can be eliminated as it is the negative sum of the others.

We also make use of the maximum flow problem, max-flow, which is a special case of MCNF. The max-flow problem is defined on a directed graph G=(V,A)G=(V,A), with arc capacities uiju_{ij} and two distinguished nodes: a source node, sVs\in V, and a sink node, tVt\in V. The total outflow from ss, or the total inflow into tt, is called the value of the flow. The objective is to find the maximum value feasible flow leaving ss and reaching tt that satisfy the arc capacities. A linear programming formulation of max-flow is:

(max-flow) maxfsubject to (s,j)Axsj=f(k,j)Axkj(i,k)Axik=0kV{s,t}0xijuij,(i,j)A.\hskip 28.90755pt\begin{array}[]{ll}\mbox{(max-flow)~{}~{}~{}~{}}\max&f\\ \mbox{subject to }&\sum_{(s,j)\in A}x_{sj}=f\\ &\sum_{(k,j)\in A}x_{kj}\,-\,\sum_{(i,k)\in A}x_{ik}=0\ \forall k\in V\setminus\{s,t\}\\ &0\leq x_{ij}\leq u_{ij},\ \ \forall(i,j)\in A.\end{array}

As a special case of MCNF there are specialized algorithms for max-flow which are more efficient than algorithms for MCNF.