Supplement to “Causal Strategic Linear Regression”
1 Appendix
1.1 Agent Outcomes
Proof of Theorem 1.
Let’s walk through the steps of the algorithm, bounding the error that accumulates along the way.
In the first round we set in order to obtain an estimate for .
Since is a unit vector, the variance of is at most plus a constant (from the subgaussian noise).
By Chebyshev’s inequality, this means that samples suffice for the empirical estimator of to have no more than error with failure probability . We call the output of this estimator and let be the r-dimensional vector with in every coordinate.
Now we choose that form an orthonormal basis of the image of the diagonal matrix . For each we observe the reward , subtract out , and plug it into the empirical mean estimator. For each , let be the resulting coefficient. After samples, each coefficient has at most error with failure probability at most . Since we have computed estimators, each one with failure probability at most , a union bound gives us a total failure probability that is sub-constant.
We can now bound the total squared error between said coefficients and in the basis (noting that the choice of basis does not affect the magnitude of the error). We can break up the error into two components using the triangle inequality: the error due to and the error in the subsequent rounds. Each coordinate of has error of magnitude at most , so the total magnitude of the error in is at most . The same argument applies for the error in the coordinate estimates, leading to a total error of at most .
Recall that . Let . We can now bound the gap between the agent outcomes incentivized by and by :
(1) | ||||
(2) | ||||
(3) | ||||
(4) |
∎
1.2 Prediction Risk
Proof of Lemma 1.
where the last line follows because and are uncorrelated. ∎
1.3 Parameter Estimation
In this section we describe how we recover in -distance when there exists an such that is full rank. Before we proceed we make a couple of observations. When there is no way to make the above matrix full rank, we cannot hope to recover the optimal . If there is no natural variation in e.g. the last two features, and furthermore no agent can act along those features, it is not possible to disentangle their potential effects on the outcome. This also suggests that the parameter recovery is a more substantive demand for the decision maker than the standard linear regression setting. To discover this additional information, the decision maker can incentivize the agents to take actions that help the decision-maker recover the true outcome-governing parameters.
This motivates the algorithm we present in this section. It operates in two stages. First, it recovers the information necessary in order to to identify the decision rule which will provide the most informative agent samples after those agents have gamed. Second, it collects data while incentivizing this action. Finally, it computes an estimate of using the collected data. We present the complete procedure in Algorithm 1.
The procedure in Algorithm 1 can be summarized as follows:
-
1.
Estimate the first and second moments of the distribution of agents’ features.
-
2.
Estimate the Gramian of the action matrix .
-
3.
Compute the most informative choice of .
-
4.
Collect samples under the most informative and then return the output of OLS.
Before we proceed to the proof of correctness of Algorithm 1, let us build some intuition for why this procedure of choosing a single and collecting samples under said makes sense. As we show later, the convergence of OLS for linear regression can be controlled by the minimum eigenvalue of the second moment matrix of the samples. Our algorithm finds the value of that, after agents game, maximizes this minimum eigenvalue in expectation. It turns out the minimum eigenvalue of the expected second moment matrix of post-gaming samples is convex with respect to the choice of . The convexity of the objective suggests that a priori, when choosing s to obtain informative samples, the optimal strategy is choose a single specific .
The main difficulty in the rest of the algorithm is achieving the necessary precision in the estimation to be able to set up the above optimization problem to identify such an .
Theorem 3. When , the output of Algorithm 1 run with parameter satisfies with probability greater than .
The proof of Theorem 3 relies on several lemmas. First we bound the error of OLS as a function of the empirical second moment matrix in Lemma 1. Note that the usual bound for the convergence of OLS is distribution dependent. That is, the expected error is small.
Lemma 1.
Assume . Consider samples and . Let be the output of OLS . Then
The above proof is elementary and a slight modification of the standard textbook proof (see for example, (liangstat)).
The proof also requires that the optimization to choose the optimal is convex.
Lemma 2.
The minimum eigenvalue of the following matrix is convex with respect to for any values of .
Furthermore, when the following conditions are true, then the minimum eigenvalue of the above is within a constant factor of the optimal value.
.
-
•
-
•
-
•
-
•
Finally, the above holds true even for an with distance at most from the optimum.
Finally, we use a minor lemma for recover of a random vector via the empirical mean estimator. Note that we treat the matrix as a vector.
Lemma 3.
Assume . Let be drawn from the distribution and be the empirical mean estimator computed from said ’s. Let be the expected second moment matrix of the s. Then
We proceed with the proof of Theorem 3 below.
Proof.
The first step of the algorithm is for recovering an estimate of and . Note that samples suffice to recover and such that:
-
•
-
•
The for loop recovers an estimate of . Via Lemma 3, the samples suffice to ensure that the following two conditions hold:
-
•
-
•
Then the algorithm computes an estimate of the optimal . Via Lemma 2, we have that the optimum guarantees the minimum eigenvalue of an approximate solution will be within a constant factor of the optimum.
This guarantees that samples suffice to ensure the recover of within squared -distance of in expectation.
Finally the expectations can be used with a Markov inequality to ensure the algorithm succeeds with (arbitrarily high) constant probability. ∎
Now we prove the lemmas. We begin with Lemma 1. This proof is a slight modification of the textbook proof for the convergence of OLS.
Proof.
In this section we derive a bound on the convergence of the least squares estimator when a fixed design matrix is used. Note this is exactly the case we encounter, since the choice of lets us affect the entries of the design matrix. This is a standard, textbook result and not a main contribution of the paper.
In order to state the result more formally we have to introduce some notation. The goal of the procedure is to recover , when given tuples where is 1-subgaussian. We aim to characterize where is obtained from ordinary least squares. Let be the vector with the ’s in its columns. Let be the minimum eigenvalue of (the second moment matrix).
Below all expectations are taken only over the random noise. We assume the second moment matrix is full rank.
This motivates our procedure for parameter recovery. We do so in a fashion that attempts to maximize . Note that it is the minimum eigenvalue that determines the convergence rate. This is due to the fact that little variation along a dimension makes it hard to disentangle the features’ effect on the outcome via from the constant-variance noise . ∎
Lemma 2 is somewhat more involved. It is proven in three parts. The first is that the optimization problem is convex. The second is that approximate recovery of and suffice for approximately minimizing the original expression. The third is that an approximate solution suffices.
Proof.
In this section we describe how to choose the value of that maximizes the value of for the samples we obtain.
To do so, we examine the expectation of the second moment matrix and make several observations. Let denote the expected second moment matrix of (i.e. . We have:
-
1.
The minimum eigenvalue of the above expression is concave with respect to . This follows due to the following: is a linear operator, the minimum eigenvalue of a Gramian matrix is concave with respect to , and the expectation of a concave function is concave (boyd2004convex).
-
2.
Since the agent attempts to maximize their motion in the direction, we want to ensure that we move them toward toward the direction that maximizes the minimum eigenvalue of .
However, we do not operate with exact knowledge of , etc. It turns out that even approximately solving this optimization problem with estimates for suffices for our purposes, as long as the we obtain from our optimization (using the estimates) results in a high value for the minimum eigenvalue of . Let be the maximizing argument for the estimated optimization problem and let be the maximizing argument for the original optimization problem. Let be the true maximized second moment matrix including gaming, and be the maximizing second moment matrix with gaming resulting from replacing the true with the estimates. In formal terms, we need to show the minimum eigenvalue of the following is large: . We note that when is within of for all in the unit ball, the minimum eigenvalues may differ by at most .
And now we bound the norm of assuming the following:
-
1.
-
2.
-
3.
-
4.
We work out the bound below.
This means if we find an approximate solution to the system with the estimated values, we obtain a within of the optimal. ∎
Finally, we present the proof of Lemma 3:
Proof.
Recall that when the decision-maker fixes , it receives samples of the form . We note this can be used to recover the matrix . In particular, we show how rounds, each with samples, suffices to recover the matrix to squared Frobenius norm . Recall the procedure we propose simply chooses , one-hot coordinate vectors in each round. We first bound the error in . coordinate-wise: . A union bound across coordinates shows that samples suffice to recover within squared Frobenius norm . ∎