This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Estimation of Regression Models with Potentially Endogenous Outliers via a Modern Optimization Lens

Zhan Gao Department of Economics, University of Southern California. zhangao@usc.edu.    Hyungsik Roger Moon Department of Economics, University of Southern California. moonr@usc.edu.
Abstract

This paper addresses the robust estimation of linear regression models in the presence of potentially endogenous outliers. Through Monte Carlo simulations, we demonstrate that existing L1L_{1}-regularized estimation methods, including the Huber estimator and the least absolute deviation (LAD) estimator, exhibit significant bias when outliers are endogenous. Motivated by this finding, we investigate L0L_{0}-regularized estimation methods. We propose systematic heuristic algorithms, notably an iterative hard-thresholding algorithm and a local combinatorial search refinement, to solve the combinatorial optimization problem of the L0L_{0}-regularized estimation efficiently. Our Monte Carlo simulations yield two key results: (i) The local combinatorial search algorithm substantially improves solution quality compared to the initial projection-based hard-thresholding algorithm while offering greater computational efficiency than directly solving the mixed integer optimization problem. (ii) The L0L_{0}-regularized estimator demonstrates superior performance in terms of bias reduction, estimation accuracy, and out-of-sample prediction errors compared to L1L_{1}-regularized alternatives. We illustrate the practical value of our method through an empirical application to stock return forecasting.

Key words: Robust estimation, mixed integer optimization, outlier detection, endogeneity
JEL code: C13, C53, C61, C63

1 Introduction

Robust estimation of linear regression models in the presence of outliers has been extensively studied in econometrics and statistics. It is well-known that the classical ordinary least square (OLS) estimator is extremely sensitive to outliers. Various robust estimation methods have been proposed, for example, Huber’s M-estimation (Huber, 1964), the least median of squares (LMS) estimator (Siegel, 1982), and the least trimmed squares (LTS) estimator (Rousseeuw, 1984, 1985), among others. More recently, Lee et al. (2012) propose regularization of the L1L_{1}-norm of case-specific parameters to achieve robust estimation against outliers and She and Owen (2011) show that any regularized estimator can be formulated as an equivalent iterative thresholding procedure of outlier detection. Refer to Yu and Yao (2017) for a comprehensive survey and detailed comparison.

The traditional literature has focused on the breakdown point and efficiency of a robust estimator (Huber and Donoho, 1983). The finite sample breakdown point of an estimator measures the proportion of outliers that can be arbitrarily contaminated before the estimation error goes to infinity. On the other hand, efficiency measures the relative estimation efficiency of the robust estimator compared to OLS in the ideal scenario where the error term is normally distributed and there are no outliers. However, there is limited work investigating how robust estimators perform when the outliers are endogenous in the sense that the noises can be arbitrarily correlated with observed regressors. In this scenario, an outlier not only brings in a contamination noise of large magnitude but also introduces model misspecification against the classical linear regression assumptions. The endogeneity issues from the outliers can lead to severe estimation bias.

In this paper, we first examine the widely adopted L1L_{1}-regularized estimator, which includes the Huber estimator and least absolute deviation as special cases, and demonstrate that it is subject to significant bias when outliers are endogenous. As shown in She and Owen (2011), the L1L_{1}-regularized estimation is equivalent to an iterative soft-thresholding procedure with a data-driven thresholding parameter, and hence it does not completely eliminate the detected outlier.

Motivated by this finding, we turn to the L0L_{0}-regularized approach. By restricting the cardinality of the set of outliers, the L0L_{0}-regularized estimator searches for the best subset of observations and the regression coefficients that minimize the least squares objective function. This problem is equivalent to solving the least trimmed squares (LTS) estimator, which is well-known to be an NP-hard combinatorial optimization problem (Natarajan, 1995). Following Bertsimas et al. (2016); Thompson (2022), the L0L_{0}-regularized estimator can be formulated as a mixed integer optimization (MIO) problem. With the continuous advancement in modern optimization software, such as gurobi111gurobi is a commercial optimization solver that specializes in mixed integer optimization. In the past decade, gurobi has been up to 75-fold faster in computation speed for mixed integer optimization problems independent of hardware advancement (Gurobi Optimization, LLC, 2024)., and computational infrastructure, it is now tractable to solve the L0L_{0}-regularized estimation problem with a verifiable global optimal solution for datasets with up to hundreds of observations. However, the optimization routine heavily relies on the initial values because of its non-convexity nature. In addition, the implementation is not scalable, and the computational burden is increasingly heavy as the sample size and fraction of outliers grow.

To address the computational challenges, we propose a scalable and efficient heuristic algorithm that combines the iterative hard-thresholding (IHT) algorithm and a local combinatorial search refinement inspired by Hazimeh and Mazumder (2020). Given the initial solution by the IHT algorithm, the local combinatorial search algorithm checks whether swapping a few observations, say one or two, between the estimated set of inliers and outliers, can improve the objective value by solving a small-scale mixed integer optimization problem. The heuristic algorithm is more computationally efficient and can improve the initial IHT solution to be as good as the solution from solving the original L0L_{0}-regularized MIO problem in most cases, according to the Monte Carlo experiments.

Our contributions are twofold. First, we document that existing L1L_{1}-regularized methods are biased in the presence of endogenous outliers, whereas the L0L_{0}-regularized approach does not suffer from this bias through Monte Carlo simulations. This finding extends the understanding of the properties of robust estimation methods to a new scenario. Second, we propose systematic heuristic algorithms that provide stable and high-quality solutions while being computationally efficient. To the best of our knowledge, this is the first work to apply the idea of local combinatorial search to the context of robust estimation and outlier detection.

To illustrate the practical value of our method, we apply it to an empirical application in stock return forecasting. The results demonstrate that our L0L_{0}-regularized estimator outperforms L1L_{1}-regularized alternatives in terms of out-of-sample prediction errors.


Notations. For a generic vector a=(a1,a2,,aN)a=\left(a_{1},a_{2},\cdots,a_{N}\right)^{\prime}, a=(aa)1/2\left\|a\right\|=\left(a^{\prime}a\right)^{1/2}, a1=i=1N|ai|\left\|a\right\|_{1}=\sum_{i=1}^{N}\left|a_{i}\right|, a=maxi|ai|\left\lVert a\right\rVert_{\infty}=\max_{i}|a_{i}| and a0=i=1N𝟙{ai0}\left\lVert a\right\rVert_{0}=\sum_{i=1}^{N}\mathbbm{1}\left\{a_{i}\neq 0\right\}. For matrix AA, we define the Frobenius matrix norm A\left\|A\right\| as A=(tr(AA))1/2\left\|A\right\|=\left(\operatorname{tr}\left(A^{\prime}A\right)\right)^{1/2}. Generically, [N]={1,2,,N}[N]=\left\{1,2,\cdots,N\right\} for positive integer NN. For aa\in\mathbb{R}, a\lfloor a\rfloor is the integer part of the real number aa. ιN=(1,1,,1)N\iota_{N}=\left(1,1,\cdots,1\right)\in\mathbb{R}^{N} denotes the one vector.


The rest of the paper is organized as follows. Section 2 presents the model setup and motivating examples. Section 3 summarizes the existing L1L_{1}-regularized methods for robust estimation. Section 4 delineates the L0L_{0}-regularized robust methods and the detailed algorithms. The performance of the algorithms is evaluated through Monte Carlo experiments in 5. Section 6 illustrates the proposed method in an empirical application of stock return forecasting. Section 7 concludes.

2 Linear Regression with Potentially Endogenous Outliers

This section sets up the linear regression models with potentially endogenous outliers. Suppose we observe samples (yi,xi)(y_{i},x_{i}^{\prime})^{\prime} for individual unit i[N]i\in\left[N\right]. Let 𝒪[N]\mathcal{O}\subset\left[N\right] and =[N]𝒪\mathcal{I}=\left[N\right]\setminus\mathcal{O}. Consider the following linear regression model:

yi=β0+xiβ1+αi+uiy_{i}=\beta_{0}+x_{i}^{\prime}\beta_{1}+\alpha_{i}+u_{i} (2.1)

where 𝔼(ui|xi)=0\mathbb{E}(u_{i}|x_{i})=0 for all i[N]i\in\left[N\right]. The parameter αi\alpha_{i} represents the conditional mean shift such that

αi\displaystyle\alpha_{i} =0ifi\displaystyle=0\quad{\rm if}\,\,i\in\mathcal{I}
𝔼(αi|xi)\displaystyle\mathbb{E}(\alpha_{i}|x_{i}) 0ifi𝒪.\displaystyle\neq 0\quad{\rm if}\,\,i\in\mathcal{O}.

We refer to the parameters αi\alpha_{i} as outlier fixed effects, which are allowed to be arbitrarily correlated with the regressors xix_{i}. An alternative formulation of model (2.1) is:

yi=β0+xiβ1+γiα~i+ui,y_{i}=\beta_{0}+x_{i}^{\prime}\beta_{1}+\gamma_{i}\widetilde{\alpha}_{i}+u_{i}, (2.2)

where γi=𝟙{i𝒪}\gamma_{i}=\mathbbm{1}\left\{i\in\mathcal{O}\right\} is the outlier dummy and α~\widetilde{\alpha} is the latent outlier fixed effect.

Remark 1.

γi\gamma_{i} and αi\alpha_{i} are generally allowed to be correlated. For example, γi=𝟙(α~i>a¯)\gamma_{i}=\mathbbm{1}\left(\widetilde{\alpha}_{i}>\overline{a}\right), i.e., the shock α~i\widetilde{\alpha}_{i} affects the dependent variable only if it is large enough to pass the threshold a¯\overline{a}. Conversely, γi=𝟙(α~i<a¯)\gamma_{i}=\mathbbm{1}\left(\widetilde{\alpha}_{i}<\overline{a}\right) implies that the error affects the dependent variable only if it is small enough to avoid detection during the data collection process.

The existence of αi\alpha_{i} introduces endogeneity to a subset of observations 𝒪\mathcal{O}. The classical ordinary least squares (OLS) estimator is expected to be biased in the presence of endogenous outliers. The following motivating examples demonstrate possible sources of endogenous outliers.

Example 1 (Heterogeneous Coefficients).

Suppose that for ii\in\mathcal{I},

yi=β0+β1xi+ui,y_{i}=\beta_{0}+\beta_{1}^{\prime}x_{i}+u_{i},

while for i𝒪i\in\mathcal{O},

yi=β0,i+β1,ixi+ui,y_{i}=\beta_{0,i}+\beta_{1,i}^{\prime}x_{i}+u_{i},

where the coefficients β0,i\beta_{0,i} and β1,i\beta_{1,i} are heterogeneous across i𝒪i\in\mathcal{O} and potentially correlated with xix_{i}. Define

α0,i\displaystyle\alpha_{0,i} :=(β0,iβ0)𝟙(i𝒪),\displaystyle:=(\beta_{0,i}-\beta_{0})\mathbbm{1}(i\in\mathcal{O}),
α1,i\displaystyle\alpha_{1,i} :=(β1,iβ1)𝟙(i𝒪).\displaystyle:=(\beta_{1,i}-\beta_{1})\mathbbm{1}(i\in\mathcal{O}).

Then, for i[N]i\in\left[N\right],

yi=β0+β1xi+α0,i+α1,ixi+ui,y_{i}=\beta_{0}+\beta_{1}^{\prime}x_{i}+\alpha_{0,i}+\alpha_{1,i}^{\prime}x_{i}+u_{i},

which can be reformulated as (2.1) with

αi={0fori,α0,i+α1,ixifori𝒪,\alpha_{i}=\begin{cases}0&{\rm for}\;\;i\in\mathcal{I},\\ \alpha_{0,i}+\alpha_{1,i}^{\prime}x_{i}&{\rm for}\;\;i\in\mathcal{O},\end{cases}

where 𝔼(αi|xi)=𝔼(α0,i|xi)+𝔼(α1,i|xi)xi0\mathbb{E}\left(\alpha_{i}|x_{i}\right)=\mathbb{E}\left(\alpha_{0,i}|x_{i}\right)+\mathbb{E}\left(\alpha_{1,i}|x_{i}\right)^{\prime}x_{i}\neq 0 for i𝒪i\in\mathcal{O} in general.

Example 2 (Measurement Errors).

Suppose that yiy_{i} and xix_{i}^{*} are the outcome variable and the regressors of interest, respectively. Assume that there exists a linear relationship between yiy_{i} and xix_{i}^{*} as

yi=β0+β1xi+ui,y_{i}=\beta_{0}+\beta_{1}^{\prime}x_{i}^{*}+u_{i},

where 𝔼(ui|xi)=0\mathbb{E}(u_{i}|x_{i}^{*})=0. However, measurement errors can contaminate the variables during data collection and processing. We observe xix_{i}^{*} only for ii\in\mathcal{I}, and for i𝒪i\in\mathcal{O}, we observe proxy variables with measurement errors, xi:=xi+α1,ix_{i}:=x_{i}^{*}+\alpha_{1,i}, where α1,i\alpha_{1,i} represents measurement errors. Then, the regression model becomes

yi=β0+xiβ1+αi+ui,y_{i}=\beta_{0}+x_{i}^{\prime}\beta_{1}+\alpha_{i}+u_{i},

where

xi={xifori,xi+α1,ifori𝒪,αi={0fori,α1,iβ1fori𝒪.x_{i}=\begin{cases}x_{i}^{*}&{\rm for}\;\;i\in\mathcal{I},\\ x_{i}^{*}+\alpha_{1,i}&{\rm for}\;\;i\in\mathcal{O},\end{cases}\quad\quad\alpha_{i}=\begin{cases}0&{\rm for}\;\;i\in\mathcal{I},\\ -\alpha_{1,i}^{\prime}\beta_{1}&{\rm for}\;\;i\in\mathcal{O}.\end{cases}

In the case of nonclassical measurement errors, 𝔼(αi|xi)0\mathbb{E}(\alpha_{i}|x_{i})\neq 0 in general, which induces endogeneity to a subset of observations 𝒪\mathcal{O}.

The objective is to accurately estimate β=(β0,β1)\beta=(\beta_{0},\beta_{1}^{\prime})^{\prime} in the presence of outliers in the samples.

3 Existing Methods: L1L_{1}-regularized Robust Regression

The most widely used robust estimators of β\beta in the presence of outlier observations are Huber’s M-estimation and the least absolute deviation (LAD) estimation method (Huber, 1964). These two estimators can be understood as special cases of the L1L_{1}-regularized estimator.

Denote Y=(y1,,yN),X=((1,x1),,(1,xN)),β=(β0,β1),α=(α1,αN),U=(u1,,uN)Y=(y_{1},...,y_{N})^{\prime},X=\left((1,x_{1}^{\prime}),...,(1,x_{N}^{\prime})\right)^{\prime},\beta=\left(\beta_{0},\beta_{1}^{\prime}\right)^{\prime},\alpha=(\alpha_{1},...\alpha_{N})^{\prime},U=(u_{1},...,u_{N})^{\prime}. Then, we can present the model (2.1) in matrix form as

Y=Xβ+α+U.Y=X\beta+\alpha+U. (3.1)

Let the least squares loss function be

LN(β,α)=12YXβα2,L_{N}\left(\beta,\alpha\right)=\frac{1}{2}\left\|Y-X\beta-\alpha\right\|^{2},

and consider the following L1L_{1}-regularized optimization problem,

minβ,αQNψ(β,α)=LN(β,α)+ψα1.\min_{\beta,\alpha}Q_{N}^{\psi}\left(\beta,\alpha\right)=L_{N}\left(\beta,\alpha\right)+\psi\left\|\alpha\right\|_{1}. (3.2)

Given β\beta, α^ψ(β)argminαQNψ(β,α).\hat{\alpha}^{\psi}(\beta)\coloneq\operatorname*{arg\,min}_{\alpha}Q_{N}^{\psi}(\beta,\alpha). The closed-form solution is

α^iψ(β)={(yixiβ)ψif yixiβψ,0if |yixiβ|<ψ,(yixiβ)+ψif yixiβψ.\hat{\alpha}_{i}^{\psi}(\beta)=\begin{cases}\left(y_{i}-x_{i}^{\prime}\beta\right)-\psi&\text{if }y_{i}-x_{i}^{\prime}\beta\geq\psi,\\ 0&\text{if }\left|y_{i}-x_{i}^{\prime}\beta\right|<\psi,\\ \left(y_{i}-x_{i}^{\prime}\beta\right)+\psi&\text{if }y_{i}-x_{i}^{\prime}\beta\leq-\psi.\end{cases} (3.3)

Then the profile L1L_{1}-regularized objective function becomes

QNψ(β,α^ψ(β))=i=1N[12(yiβxi)2𝟙{|yiβxi|ψ}+(ψ|yiβxi|ψ22)𝟙{|yiβxi|>ψ}].Q_{N}^{\psi}\left(\beta,\hat{\alpha}^{\psi}\left(\beta\right)\right)=\sum_{i=1}^{N}\left[\frac{1}{2}\left(y_{i}-\beta^{\prime}x_{i}\right)^{2}\mathbbm{1}\left\{\left|y_{i}-\beta^{\prime}x_{i}\right|\leq\psi\right\}+\left(\psi\left|y_{i}-\beta^{\prime}x_{i}\right|-\frac{\psi^{2}}{2}\right)\mathbbm{1}\left\{\left|y_{i}-\beta^{\prime}x_{i}\right|>\psi\right\}\right]. (3.4)
Refer to caption
Figure 1: The L1L_{1}-regularized loss function

In (3.3), the tuning parameter ψ\psi plays the role of a soft thresholding parameter. When ψ\psi is fixed, (3.4) becomes the objective function of Huber regression

QNψ(β,α^ψ(β))=i=1Nρψ(yiβxi),Q_{N}^{\psi}\left(\beta,\hat{\alpha}^{\psi}\left(\beta\right)\right)=\sum_{i=1}^{N}\rho_{\psi}\left(y_{i}-\beta^{\prime}x_{i}\right),

where

ρψ(t)={12t2if |t|ψ,ψ|t|12ψ2if |t|>ψ,\rho_{\psi}(t)=\begin{cases}\frac{1}{2}t^{2}&\text{if }|t|\leq\psi,\\ \psi|t|-\frac{1}{2}\psi^{2}&\text{if }|t|>\psi,\end{cases} (3.5)

is the Huber loss function (Hastie et al., 2015, p. 26 - 27). As illustrated in Figure 1, the cutoffs ψ-\psi and ψ\psi divide the Huber loss function into two regimes: it is a squared loss if |t|ψ|t|\leq\psi and a linear loss in absolute deviation if |t|>ψ|t|>\psi. If we let ψ=ψN\psi=\psi_{N} so that mini{1,2,,N}|yixiβ|ψN>0\min_{i\in\left\{1,2,\cdots,N\right\}}\left|y_{i}-x_{i}^{\prime}\beta\right|\geq\psi_{N}>0 given NN and the sample (Y,X)\left(Y,X\right) , then we have

(NψN)1QNψ(β,α^ψ(β))1N|yixiβ|,\left(N\psi_{N}\right)^{-1}Q_{N}^{\psi}\left(\beta,\hat{\alpha}^{\psi}\left(\beta\right)\right)\to\frac{1}{N}\left|y_{i}-x_{i}^{\prime}\beta\right|,

as ψN0\psi_{N}\to 0, which shrinks the squared loss region in Figure 1.

The L1L_{1}-regularized estimator unifies the Huber and LAD estimators and offers greater flexibility by allowing the regularization parameter to be chosen in a data-driven manner. However, as a soft-thresholding algorithm, as shown in (3.3), it does not completely eliminate the detected outliers from the estimation procedure. This can lead to estimation biases when the outliers are endogenous.

As demonstrated in the Monte Carlo experiments in Section 5, the L1L_{1}-regularized estimator performs well in terms of estimation bias, root mean squared error (RMSE), and prediction error in DGP 1, where the outlier fixed effects are generated as exogenous random variables. However, in DGP 2 and DGP 3, where the outlier fixed effects are correlated with the regressors, the L1L_{1}-regularized estimator for the slope coefficients exhibits significant biases. This finding motivates the consideration of the hard-thresholding-based L0L_{0}-regularized estimator, as discussed in Section 4.

4 L0L_{0}-regularized Robust Regression

We consider the L0L_{0}-reguarization on the outlier fixed effects in the least square estimation,

minβ,αLN(β,α) s.t. α0k,\min_{\beta,\alpha}L_{N}\left(\beta,\alpha\right)\text{ s.t. }\left\|\alpha\right\|_{0}\leq k, (4.1)

where α0=i=1N𝟙{αi0}\left\lVert\alpha\right\rVert_{0}=\sum_{i=1}^{N}\mathbbm{1}\left\{\alpha_{i}\neq 0\right\} and the tuning parameter k[N]k\in\left[N\right] controls the exact sparsity of the outlier fixed effects α\alpha.

Remark 2.

The Lagrangian form of the regularized estimation problem

minβ,αLN(β,α)+Pψ(α),\min_{\beta,\alpha}L_{N}\left(\beta,\alpha\right)+P_{\psi}\left(\alpha\right), (4.2)

where Pψ(α)P_{\psi}\left(\alpha\right) is the penalty function, is often studied in the literature, for example, by She and Owen (2011) and Lee et al. (2012). When Pψ(α)=ψα1P_{\psi}\left(\alpha\right)=\psi\left\|\alpha\right\|_{1}, (4.2) is equivalent to classical Huber regression or least absolute deviation (LAD) estimation, depending on the tuning parameter ψ\psi, which is detailed in Section 3. She and Owen (2011) show that the iterative procedure of outlier detection based on hard-thresholding, Θhardψ(x)={0, if |x|ψx, if |x|>ψ\Theta_{\operatorname{hard}}^{\psi}(x)=\begin{cases}0,&\text{ if }|x|\leq\psi\\ x,&\text{ if }|x|>\psi\end{cases}, gives a local minimum of the L0L_{0}-regularized optimization,

minβ,αLN(β,α)+ψα0.\min_{\beta,\alpha}L_{N}\left(\beta,\alpha\right)+\psi\left\|\alpha\right\|_{0}. (4.3)

Note that (4.1) and (4.3) are not equivalent since it is not guaranteed that there exists a corresponding ψ>0\psi>0 for each k[N]k\in\left[N\right]. To see this, denote ψ(k)=minβ0k12yXβ22\psi\left(k\right)=\min_{\left\|\beta\right\|_{0}\leq k}\frac{1}{2}\left\|y-X\beta\right\|_{2}^{2}, and δ(k)=|ψ(k+1)ψ(k)|\delta\left(k\right)=\left|\psi\left(k+1\right)-\psi\left(k\right)\right|. Note that ψ(k)\psi\left(k\right) is strictly decreasing in kk unless the model perfectly fits the data. Suppose δ(k)\delta\left(k\right) is decreasing, then for any kk, there exists ψ(δ(k+1),δ(k))\psi\in\left(\delta\left(k+1\right),\delta\left(k\right)\right) such that the corresponding L0L_{0}-penalized problem gives the same solution as (4.1). However, δ(k)\delta\left(k\right) is not guaranteed to be decreasing. For any local minimum k~\tilde{k}, there is no corresponding ψ+\psi\in\mathbb{R}_{+} that yields the same solution. Scenarios like cointegration can serve as examples. In this paper, we focus on the constrained form of L0L_{0}-regularization (4.1), which allows one to control the exact sparsity of α\alpha.

As in Bertsimas et al. (2016); Thompson (2022), (4.1) can be formulated as the following mixed integer optimization (MIO) problem,

minβ,α,γLN(β,α)\displaystyle\min_{\beta,\alpha,\gamma}L_{N}\left(\beta,\alpha\right)
s.t. γi{0,1},i[N]\displaystyle\gamma_{i}\in\left\{0,1\right\},\,i\in\left[N\right]
i=1Nγik,\displaystyle\sum_{i=1}^{N}\gamma_{i}\leq k, (4.4)
{αi,1γi}SOS-1,i[N],\displaystyle\left\{\alpha_{i},1-\gamma_{i}\right\}\in\text{SOS-1},\,i\in\left[N\right],
αMα,α1Mα,1,\displaystyle\left\lVert\alpha\right\rVert_{\infty}\leq M_{\alpha},\,\left\lVert\alpha\right\rVert_{1}\leq M_{\alpha,1},

in which we introduce a binary variable γi\gamma_{i} to model whether an observation is detected as an outlier. i=1Nγik\sum_{i=1}^{N}\gamma_{i}\leq k corresponds to the cardinality constraint α0k\left\lVert\alpha\right\rVert_{0}\leq k. If γi=1\gamma_{i}=1, i.e. ii is labelled as an outlier, then αi\alpha_{i} should be nonzero; on the other hand if γi=0\gamma_{i}=0, i.e. ii is labelled as an inlier, then αi\alpha_{i} should be exactly 0. This intuition can be translated into the constraint (1γi)αi=0\left(1-\gamma_{i}\right)\alpha_{i}=0, which can be further modeled via integer optimization using the Special Ordered Set of Type 1 (SOS-1), that is a set contains at most one non-zero variable.222 (4.1) can also be formulated as a MIO problem using big-MM method, minβ,α,γLN(β,α) s.t. γi{0,1},i=1Nγik,γiMααiγiMα,\min_{\beta,\alpha,\gamma}L_{N}\left(\beta,\alpha\right)\text{ s.t. }\gamma_{i}\in\left\{0,1\right\},\,\sum_{i=1}^{N}\gamma_{i}\leq k,\,-\gamma_{i}M_{\alpha}\leq\alpha_{i}\leq\gamma_{i}M_{\alpha}, where MαM_{\alpha} is provides a sufficiently large bound for α\alpha. In practice, we set Mα=τα^M_{\alpha}=\tau\left\lVert\widehat{\alpha}\right\rVert_{\infty} with τ>1\tau>1 and α^\widehat{\alpha} is the initial estimator form the heuristic algorithm. MαM_{\alpha} and Mα,1M_{\alpha,1} are user-defined bound parameters that can help to tighten the parameter space and improve the computation performance.

The optimization problem (4.4) is a well-posed MIO problem, and provable global optimality can be achieved by optimization solvers such as gurobi (Gurobi Optimization, LLC, 2024). The computational efficiency and solution quality are highly dependent on the initial values provided to the solver. Furthermore, as demonstrated in the Monte Carlo experiments in Table 1, directly solving (4.4) does not achieve provable optimality within the prespecified 5-minute timeframe when the sample size NN and the number of outliers kk are large. In the following subsection, we propose systematic approximate algorithms to solve (4.1).

4.1 Heuristics

The same as best subset selection studied in Bertsimas et al. (2016), L0L_{0}-regularization robust regression (4.1) cannot be solved in polynomial time, i.e. it is NP-hard (Natarajan, 1995). The computational cost of solving for the exact solution increases steeply as the sample size increases. Heuristic algorithms that can find approximate solutions are useful for parameter tuning and providing warm-starts for the solvers. In this section, we follow Bertsimas et al. (2016); Hazimeh and Mazumder (2020); Thompson (2022); Mazumder et al. (2022) to provide heuristic algorithms based on iterative hard-thresholding, neighborhood search, local combinatorial search for (4.1).

4.1.1 Iterative Hard-thresholding

Iterative hard-thresholding (IHT) based on project gradient descent is a generic heuristic algorithm for general nonlinear programming with sparsity constraints (Beck and Eldar, 2013) and it is successfully applied to best subset selection in Bertsimas et al. (2016). For problem (4.1), we propose the following iterative hard-thresholding algorithm, which is a simplified version of the projected block-coordinate gradient descent in Thompson (2022, Algorithm 1).

Define the hard-thresholding operator for cNc\in\mathbb{R}^{N} as

Hk(c)=argminα0kαc22.H_{k}\left(c\right)=\operatorname*{arg\,min}_{\left\|\alpha\right\|_{0}\leq k}\left\|\alpha-c\right\|_{2}^{2}.

As shown in Bertsimas et al. (2016, Proposition 3), if α^Hk(c)\hat{\alpha}\in H_{k}\left(c\right), then α^\hat{\alpha} retains the kk largest (in absolute value) elements of cc and sets the rest to zero, i.e.

Hk(c)={ciif i{(1),(2),,(k)}0otherwise,H_{k}\left(c\right)=\begin{cases}c_{i}&\text{if }i\in\left\{(1),(2),\cdots,(k)\right\}\\ 0&\text{otherwise},\end{cases}

if {(1),(2),,(N)}\left\{(1),(2),\cdots,(N)\right\} is an ordering of 𝒩\mathcal{N} such that |c(1)||c(2)||c(N)|\left|c_{(1)}\right|\geq\left|c_{(2)}\right|\geq\cdots\geq\left|c_{(N)}\right|.

The iterative hard-thresholding algorithm starts with an initial robust estimator of β\beta and iteratively updates α\alpha and β\beta by applying the hard-thresholding operator to the residuals and running least square estimation based on the support of updated α\alpha. The details are summarized in Algorithm 1.

input :  An initial robust estimator β^(0)\hat{\beta}^{(0)}, sparsity parameter kk and the maximum number of iterations MM.
Initialize j=0j=0, while  β^(j+1)β^(j)2>0\left\|\hat{\beta}^{(j+1)}-\hat{\beta}^{(j)}\right\|_{2}>0 and jMj\leq M  do
 α^(j)Hk(YXβ^(j))\hat{\alpha}^{(j)}\leftarrow H_{k}\left(Y-X\hat{\beta}^{(j)}\right);
 β^(j+1)argminβLN(β,α^(j))\hat{\beta}^{(j+1)}\leftarrow\arg\min_{\beta}L_{N}\left(\beta,\hat{\alpha}^{(j)}\right);
 jj+1j\leftarrow j+1
end while
output :  β^IHT=β^(j)\hat{\beta}^{\text{IHT}}=\hat{\beta}^{(j)} and α^IHT=α^(j)\hat{\alpha}^{\text{IHT}}=\hat{\alpha}^{(j)}.
Algorithm 1 Iterative Hard-thresholding (IHT)
Remark 3.

The initial estimator β^(0)\hat{\beta}^{(0)} for Algorithm 1 can be obtained by Huber regression or least absolute deviation estimation. Algorithm 1 is a simplified version of the projected block-coordinate gradient descent in Thompson (2022, Algorithm 1). The convergence of IHT directly follows from Thompson (2022). In addition, (β^IHT,α^IHT)\left(\hat{\beta}^{\text{IHT}^{\prime}},{\hat{\alpha}^{\text{IHT}^{\prime}}}\right)^{\prime} is automatically coordinate-wise optimal in the sense that optimizing concerning one coordinate at a time, while keeping others fixed, cannot improve the objective (Hazimeh and Mazumder, 2020).

4.1.2 Local Combinatorial Search

Despite the iterative hard-thresholding algorithm is fast and easy to implement, it is only guaranteed to converge to a coordinate-wise optimal solution. In this section, we propose a local combinatorial search algorithm, similar to Hazimeh and Mazumder (2020), to further refine the solution.

Note that (4.1) is equivalent to

min[N],||Nkminβi(yixiβ)2.\min_{\mathcal{I}\subset\left[N\right],\,\left|\mathcal{I}\right|\geq N-k}\min_{\beta}\sum_{i\in\mathcal{I}}\left(y_{i}-x_{i}^{\prime}\beta\right)^{2}.

For an estimate α^\hat{\alpha}, denote ^={i𝒩:α^i=0}\hat{\mathcal{I}}=\left\{i\in\mathcal{N}:\hat{\alpha}_{i}=0\right\} and 𝒪^={i𝒩:α^i0}\hat{\mathcal{O}}=\left\{i\in\mathcal{N}:\hat{\alpha}_{i}\neq 0\right\}. Following Hazimeh and Mazumder (2020), α^\hat{\alpha} is said to be swap-inescapable of order ll if arbitrarily swapping up to ll observations between ^\hat{\mathcal{I}} and 𝒪^\hat{\mathcal{O}} and then optimizing over the new support cannot improve the objective value. Formally,

minβi^(yixiβ)2=min𝒮1^,𝒮2𝒪^|𝒮1||𝒮2|lminβi(^𝒮1)𝒮2(yixiβ)2.\min_{\beta}\sum_{i\in\hat{\mathcal{I}}}\left(y_{i}-x_{i}^{\prime}\beta\right)^{2}=\min_{\begin{subarray}{c}\mathcal{S}_{1}\subseteq\hat{\mathcal{I}},\mathcal{S}_{2}\subseteq\hat{\mathcal{O}}\\ \left|\mathcal{S}_{1}\right|\leq\left|\mathcal{S}_{2}\right|\leq l\end{subarray}}\min_{\beta}\sum_{i\in\left(\hat{\mathcal{I}}\setminus\mathcal{S}_{1}\right)\cup\mathcal{S}_{2}}\left(y_{i}-x_{i}^{\prime}\beta\right)^{2}. (4.5)

If solution α^\hat{\alpha} is swap-inescapable of order kk, then α^\hat{\alpha}, associated with the OLS estimates β^\hat{\beta} using observations in ^\hat{\mathcal{I}}, is the exact global optimal solution to (4.1). By solving the local combinatorial search problem in (4.5) for small l<kl<k, say l=1l=1 or 22, we improve the IHT algorithm and verify the local combinatorial exactness.

The problem in (4.5) can be formulated as a mixed-integer optimization (MIO) problem given by

minβ,γ,αLN(β,α)\displaystyle\min_{\beta,\gamma,\alpha}L_{N}\left(\beta,\alpha\right)
s.t. γi{0,1},i[N],\displaystyle\gamma_{i}\in\left\{0,1\right\},\,i\in\left[N\right],
i=1Nγik,\displaystyle\sum_{i=1}^{N}\gamma_{i}\leq k,
(αi,1γi)SOS-1,i[N],\displaystyle\left(\alpha_{i},1-\gamma_{i}\right)\in\text{SOS-1},\,i\in\left[N\right], (4.6)
i^γil,\displaystyle\sum_{i\in\hat{\mathcal{I}}}\gamma_{i}\leq l,\,
i𝒪^γikl,\displaystyle\sum_{i\in\hat{\mathcal{O}}}\gamma_{i}\geq k-l,
αMα,α1Mα,1.\displaystyle\left\lVert\alpha\right\rVert_{\infty}\leq M_{\alpha},\,\left\lVert\alpha\right\rVert_{1}\leq M_{\alpha,1}.

The constraints i^γil\sum_{i\in\hat{\mathcal{I}}}\gamma_{i}\leq l and i𝒪^γikl\sum_{i\in\hat{\mathcal{O}}}\gamma_{i}\geq k-l restrict the number of swaps between ^\hat{\mathcal{I}} and 𝒪^\hat{\mathcal{O}} up to ll. (4.6) has a much smaller search space than the original problem (4.4) when ll is small.333 When l=1l=1, we can solve the problem in a brute force way by running least square estimation on all possible 1-1 swaps between ^\hat{\mathcal{I}} and 𝒪^\hat{\mathcal{O}} and comparing the k(Nk)k\left(N-k\right) resulting objective values without invoking the MIO solver. The heuristics combining IHT and local combinatorial search are summarized in Algorithm 2.

input : An initial robust estimator β^(0)\hat{\beta}^{(0)}, sparsity parameter kk and local exactness level ll.
for j = 0, 1, 2, … do
 
  1. 1.

    (β^(j+1),α^(j+1))\left(\hat{\beta}^{(j+1)},\hat{\alpha}^{(j+1)}\right)\leftarrow Output of Algorithm 1 (IHT) initialized with β^(j)\hat{\beta}^{(j)};

  2. 2.

    Compute ^(j+1)={i𝒩:α^i(j+1)=0}\hat{\mathcal{I}}^{(j+1)}=\left\{i\in\mathcal{N}:\hat{\alpha}_{i}^{(j+1)}=0\right\} and 𝒪^(j+1)={i𝒩:α^i(j+1)0}\hat{\mathcal{O}}^{(j+1)}=\left\{i\in\mathcal{N}:\hat{\alpha}_{i}^{(j+1)}\neq 0\right\};

  3. 3.

    (β~,α~)\left(\tilde{\beta},\tilde{\alpha}\right)\leftarrow solutions to local combinatorial search problem in (4.6) given ^(j+1)\hat{\mathcal{I}}^{(j+1)}, 𝒪^(j+1)\hat{\mathcal{O}}^{(j+1)} and ll.

if  LN(β~,α~)<LN(β^(j+1),α^(j+1))L_{N}\left(\tilde{\beta},\tilde{\alpha}\right)<L_{N}\left(\hat{\beta}^{(j+1)},\hat{\alpha}^{(j+1)}\right)  then
    (β^(j+1),α^(j+1))(β~,α~)\left(\hat{\beta}^{(j+1)},\hat{\alpha}^{(j+1)}\right)\leftarrow\left(\tilde{\beta},\tilde{\alpha}\right).
 else
    output : β^=β^(j+1)\hat{\beta}=\hat{\beta}^{(j+1)} and α^=α^(j+1)\hat{\alpha}=\hat{\alpha}^{(j+1)}.
    
   end if
 
end for
Algorithm 2 Local Combinatorial Search

4.1.3 Neighborhood Search

Algorithm 2 is guaranteed to provide a solution that is locally optimal in the sense of swap-inescapability of order ll. However, the quality of the solution depends on the initial inputs due to the noncovexity. As noted in Thompson (2022) and Mazumder et al. (2022), a neighborhood search procedure can serve as a useful systematic way of perturbing the initial inputs to improve the solution quality of the heuristic algorithm. Let [K]={1,2,,K}[K]=\{1,2,\cdots,K\} with KN/2K\leq\lfloor N/2\rfloor be the set of candidate sparsity parameters.

Denote bk,l(β)b_{k,l}\left(\beta\right) and ak,l(β)a_{k,l}\left(\beta\right) as the outputs of Algorithm 2 initialized with β\beta, kk and ll. The neighborhood search procedure is summarized in Algorithm 3. As a by-product of the procedure, we obtain a solution for each k[K]k\in\left[K\right] which is useful for sensitivity analysis and parameter tuning, which is discussed in Section 4.2.

input : An initial robust estimator β^(0)\hat{\beta}^{(0)}, [K]={1,2,,K}\left[K\right]=\left\{1,2,\cdots,K\right\} and local exactness level ll.
for k = 1, 2, …, K do
 β^k(1)bk,l(β^(0))\hat{\beta}^{(1)}_{k}\leftarrow b_{k,l}\left(\hat{\beta}^{(0)}\right); α^k(1)ak,l(β^(0))\hat{\alpha}^{(1)}_{k}\leftarrow a_{k,l}\left(\hat{\beta}^{(0)}\right).
end for
Initialize j=1j=1, while  |k=1KLN(β(j),α(j))k=1KLN(β(j1),α(j1))|>0\left|\sum_{k=1}^{K}L_{N}\left(\beta^{(j)},\alpha^{(j)}\right)-\sum_{k=1}^{K}L_{N}\left(\beta^{(j-1)},\alpha^{(j-1)}\right)\right|>0  do
 for k = 1, 2, …, K do
    β~bk,l(β^k1(j))\tilde{\beta}_{-}\leftarrow b_{k,l}\left(\hat{\beta}^{(j)}_{k-1}\right); α~ak,l(β^k1(j))\tilde{\alpha}_{-}\leftarrow a_{k,l}\left(\hat{\beta}^{(j)}_{k-1}\right); β~+bk,l(β^k+1(j))\tilde{\beta}_{+}\leftarrow b_{k,l}\left(\hat{\beta}^{(j)}_{k+1}\right); α~+ak,l(β^k+1(j))\tilde{\alpha}_{+}\leftarrow a_{k,l}\left(\hat{\beta}^{(j)}_{k+1}\right); (β~,α~)argmin(β,α){(β^k(j),α^k(j)),(β~,α~),(β~+,α~+)}LN(β,α)\left(\tilde{\beta},\tilde{\alpha}\right)\leftarrow\operatorname*{arg\,min}\limits_{\left(\beta,\alpha\right)\in\left\{\left(\hat{\beta}^{(j)}_{k},\hat{\alpha}^{(j)}_{k}\right),\left(\tilde{\beta}_{-},\tilde{\alpha}_{-}\right),\left(\tilde{\beta}_{+},\tilde{\alpha}_{+}\right)\right\}}L_{N}\left(\beta,\alpha\right); β^k(j+1)β~\hat{\beta}^{(j+1)}_{k}\leftarrow\tilde{\beta}; α^k(j+1)α~\hat{\alpha}^{(j+1)}_{k}\leftarrow\tilde{\alpha}.
   end for
 jj+1j\leftarrow j+1
end while
output :  β^kβ^k(j)\hat{\beta}_{k}\leftarrow\hat{\beta}^{(j)}_{k}, α^kα^k(j)\hat{\alpha}_{k}\leftarrow\hat{\alpha}^{(j)}_{k}, for k[K]k\in\left[K\right].
Algorithm 3 Neighborhood Search

4.2 Tuning Parameter Choice

In both L1L_{1}- and L0L_{0}- regularized estimation methods, the selection of tuning parameters, ψ\psi in (3.2) and kk in (4.1), plays a critical role. We propose the BIC-type information criteria,

BIC(k)=Nlog(YXβ^α^22N)+klog(N).\text{BIC}^{\ast}\left(k\right)=N\log\left(\frac{\left\|Y-X\hat{\beta}-\hat{\alpha}\right\|_{2}^{2}}{N}\right)+k\log\left(N\right). (4.7)

For computational efficiency, we use Algorithm 3 to generate solutions for a grid of candidate tuning parameters and select the one that minimizes the BIC,

k^=argmink{1,2,,K}BIC(k),\hat{k}=\operatorname*{arg\,min}_{k\in\{1,2,\cdots,K\}}\text{BIC}^{\ast}\left(k\right), (4.8)

where KN/2K\leq\lfloor N/2\rfloor is the maximum potential number of outliers.

The widely used theoretical property to evaluate robust estimation methods is the finite sample breakdown point, proposed in Hampel (1971) and Huber and Donoho (1983). Suppose the original sample is (Y,X)\left(Y,X\right) and the contaminated sample is (Y~(k0),X~(k0))\left(\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}\right) with k0k_{0} observations in the original sample being arbitrarily replaced by outliers. The finite sample breakdown point of an estimator TT is defined as

B(T;(Y,X))=mink0{k0N|supY~(k0),X~(k0)T(Y,X)T(Y~(k0),X~(k0))=}.B\left(T;\left(Y,X\right)\right)=\min_{k_{0}}\left\{\frac{k_{0}}{N}|\sup_{\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}}\left\|T\left(Y,X\right)-T\left(\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}\right)\right\|=\infty\right\}.
Proposition 1.

Let (Y,X)\left(Y,X\right) be a sample of size NN. T(Y,X)T\left(Y,X\right) denotes the estimator for β\beta obtained by solving (4.1) with k^\hat{k} selected by (4.8) based on (Y,X)\left(Y,X\right). For a fixed kk, the estimator defined by (4.1) with kk is denoted as Tk(Y,X)T_{k}\left(Y,X\right). Then, T(X,Y)T\left(X,Y\right) has the finite sample break down point

B(T;(Y,X))=N/2+1N.B\left(T;\left(Y,X\right)\right)=\frac{\lfloor N/2\rfloor+1}{N}.
Proof.

The proof is relegated in Appendix A.1. ∎

Remark 4.

For a fixed kk, B(Tk;(Y,X))=k+1NB\left(T_{k};\left(Y,X\right)\right)=\frac{k+1}{N}, as shown in Thompson (2022). Proposition 1 extends the results to the estimation procedure with BIC(k)\mathrm{BIC}^{\ast}\left(k\right) which achieves the optimal breakdown point.

5 Monte Carlo Simulation

In this section, we examine the numerical performance of the proposed L0L_{0}-regularized estimation procedure. We compare its coefficient estimation accuracy and prediction error to those of the L1L_{1}-regularized estimation and the classical methods, LAD and OLS.

5.1 Setup

Following the setting in Section 2, we consider the following data generating processes (DGPs).

DGP 1 (Exogenous Outliers). Consider the linear regression model with outliers,

yi=β0+β1xi,1+β2xi,2+γiαi+ui,i=1,2,,N,y_{i}=\beta_{0}+\beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\gamma_{i}\alpha_{i}+u_{i},\,i=1,2,\cdots,N,

where γi\gamma_{i} is the indicator for outliers and αi\alpha_{i} is the outlier fixed effect. We generate the regressors by xi,1=(vi,12+vi,222)/2x_{i,1}=\left(v_{i,1}^{2}+v_{i,2}^{2}-2\right)/2 and xi,2=xi,1+vi,3x_{i,2}=x_{i,1}+v_{i,3}, where vi,ji.i.d.N(0,1)v_{i,j}\sim\text{i.i.d.}N(0,1) for j=1,2,3j=1,2,3, and the error term by uii.i.d.N(0,1)u_{i}\sim\text{i.i.d.}N(0,1). Let p(0,1)p\in\left(0,1\right) denote the fraction of outliers. k0=pNk_{0}=\left\lfloor pN\right\rfloor and γi={1i<=k00i>k0.\gamma_{i}=\begin{cases}1&i<=k_{0}\\ 0&i>k_{0}\end{cases}. Let αii.i.d.N(μα,σα2)\alpha_{i}\sim\text{i.i.d.}N(\mu_{\alpha},\sigma_{\alpha}^{2}) be exogenous shocks to outliers where (μα,σα){(0,5),(5,5),(10,10)}\left(\mu_{\alpha},\sigma_{\alpha}\right)\in\left\{(0,5),(5,5),(10,10)\right\}. The true coefficients are β0=0.5\beta_{0}=0.5, β1=(1,1)\beta_{1}=\left(1,1\right)^{\prime}.

DGP 2 (Endogenous Outliers). This DGP deviates from DGP 1 by allowing the outlier fixed effects to be correlated with the regressors. Let αi=ρ(vi,1+vi,2+vi,3)\alpha_{i}=\rho\left(v_{i,1}+v_{i,2}+v_{i,3}\right) be a linear combination of the innovations to make the outlier fixed effects correlated with regressors and create endogeneity. The parameter ρ{2,5,10}\rho\in\left\{2,5,10\right\} to control the degree of correlation. The rest components are the same as DGP 1.

DGP 3 (Predictive Regression with Endogenous Shocks). Consider a linear predictive regression model as studied in Kostakis et al. (2015); Koo et al. (2020); Lee et al. (2022). The dependent variable is generated as

yi+1=β0+β1zi+l=12xi,lcϕl+l=12xi,lηl,N+γiαi+ui+1,y_{i+1}=\beta_{0}+\beta_{1}z_{i}+\sum_{l=1}^{2}x_{i,l}^{c}\phi_{l}+\sum_{l=1}^{2}x_{i,l}\eta_{l,N}+\gamma_{i}\alpha_{i}+u_{i+1},

where β0=0.3\beta_{0}=0.3, β1=1\beta_{1}=1, ϕ=(1,1)\phi=\left(1,-1\right) and ηN=(1/N,1/N)\eta_{N}=\left(1/\sqrt{N},-1/\sqrt{N}\right). The vector of the stacked innovation ξi=(zi,vi2×1,ei2×1,ui)\xi_{i}=\left(z_{i},\underset{2\times 1}{v_{i}^{\prime}},\underset{2\times 1}{e_{i}^{\prime}},u_{i}\right)^{\prime} follows a VAR(1) process ξi=Φξi1+εi\xi_{i}=\Phi\xi_{i-1}+\varepsilon_{i}, where εiiidN(0,Σε)\varepsilon_{i}\sim iid\;N\left(0,\Sigma_{\varepsilon}\right) in which Φ\Phi and Σε\Sigma_{\varepsilon} are empirically estimated from the Welch and Goyal (2008) data as in Lee et al. (2022, Supplements S1). xic2x_{i}^{c}\in\mathbb{R}^{2} is a vector I(1) process with cointegration rank 1 based on the VECM, Δxic=ΓΛxi1c+vi,\Delta x_{i}^{c}=\Gamma^{\prime}\Lambda x_{i-1}^{c}+v_{i}, where Λ=(1101)\Lambda=\begin{pmatrix}1&-1\\ 0&1\end{pmatrix} and Γ=(0110)\Gamma=\begin{pmatrix}0&1\\ 1&0\end{pmatrix} are the cointegrating matrix and the loading matrix, respectively. (xi,l)l=12\left(x_{i,l}\right)_{l=1}^{2} are random walks generated by xi,l=xi1,l+ei,lx_{i,l}=x_{i-1,l}+e_{i,l}, l=1,2l=1,2. Let p(0,1)p\in(0,1) be the fraction of outliers and define k0=pNk_{0}=\lfloor pN\rfloor. We impose two outlier periods. Let c1=0.25Nc_{1}=\lfloor 0.25N\rfloor and c2=0.75Nc_{2}=\lfloor 0.75N\rfloor be the positions in the sample where the outliers start. The outlier indicator γi=1\gamma_{i}=1 for i=cl+1,cl+2,,cl+k0/2i=c_{l}+1,c_{l}+2,\cdots,c_{l}+\lfloor k_{0}/2\rfloor, l=1,2l=1,2 and 0 otherwise. The outlier shifts αi=ρ(zi+viι2)\alpha_{i}=\rho\left(z_{i}+v_{i}^{\prime}\iota_{2}\right) where ρ{2,5,10}\rho\in\left\{2,5,10\right\}.

To evaluate each heuristic algorithm in Section 4.1 as compared to solving the full-scale mixed integer optimization for (4.4), we will report the measure Equal MIO, which is the frequency among replications the estimates obtained by the algorithm are the same as those of the MIO solution of (4.4). For each implementation invoking the MIO solver, we report relative optimality gap, which is defined as

Relative Optimality Gap=fPfDfD,\text{Relative Optimality Gap}=\frac{f^{P}-f^{D}}{f^{D}},

where fPf^{P} is the attained upper bound of the objective value of the MIO solution and fDf^{D} is the dual lower bound of the objective value delivered by the solver.444The value can be read from the solver output. Details of the definition can be found at https://www.gurobi.com/documentation/current/refman/mipgap2.html. Relative optimality gap measures the progress of optimality verification of the solver. The solution is verified to be globally optimal if this gap is less than a prespecified tolerance threshold 10410^{-4}.

For the estimation and out-of-sample prediction performance, we report the bias, root mean squared error (RMSE) and prediction errors. Generically, bias and RMSE are calculated by R1r=1R(β^(r)β)R^{-1}\sum_{r=1}^{R}\left(\hat{\beta}^{(r)}-\beta\right) and R1r=1R(β^(r)β)2\sqrt{R^{-1}\sum_{r=1}^{R}\left(\hat{\beta}^{(r)}-\beta\right)^{2}}, respectively, for true parameter β\beta, its estimate β^(r)\hat{\beta}^{(r)} across RR replications. To calculate the prediction error, we generate test data following the same DGP without outliers, {y~i,x~i}i=1Nt\left\{\widetilde{y}_{i},\widetilde{x}^{\prime}_{i}\right\}_{i=1}^{N_{t}}, Nt=1000N_{t}=1000. The prediction error is calculated by

Prediction Error=R1r=1R(1Nti=1Nt(y~i(r)x~i(r)β^(r))2).\text{Prediction Error}=R^{-1}\sum_{r=1}^{R}\left(\frac{1}{N_{t}}\sum_{i=1}^{N_{t}}\left(\widetilde{y}_{i}^{(r)}-\widetilde{x}_{i}^{(r)\prime}\hat{\beta}^{(r)}\right)^{2}\right).

In all experiments, we consider p{5%,10%,20%}p\in\left\{5\%,10\%,20\%\right\} and N{100,200,400}N\in\left\{100,200,400\right\}. All experiments are carried out on a Linux machine with an Intel i9-13900K CPU and Guorbi optimization solver at version 11.0. The time limit for the solver is set to 5 minutes.

5.2 Performance of the Heuristic Algorithms

We begin by evaluating the performance of the heuristic algorithms presented in Section 4.1 in comparison to solving the full-scale mixed integer optimization problem (4.4). We set the parameters (μα,σα)=(5,5)\left(\mu_{\alpha},\sigma_{\alpha}\right)=\left(5,5\right) for DGP 1 and ρ=5\rho=5 for DGP 2 and 3 and run 100100 replications for each setup. The parameter kk is set to the true value k0k_{0} for all algorithms. For MIO problems (4.4) and (4.6), Mα=τα^M_{\alpha}=\tau\left\lVert\hat{\alpha}\right\rVert_{\infty} and Mα,1=τα^1M_{\alpha,1}=\tau\left\lVert\hat{\alpha}\right\rVert_{1} where α^\hat{\alpha} is the initial estimator and τ=1.5\tau=1.5.

For each simulated dataset, we implement Algorithm 1, the iterative hard-thresholding (IHT) method. Subsequently, using the IHT estimate as the initial estimator, we apply the local combinatorial search (LCS) algorithm with local exactness levels l=1l=1 and l=2l=2, referred to as LCS-1 and LCS-2, respectively. Additionally, we solve the full-scale mixed integer optimization problem (4.4) using the IHT estimate as the warm-start. In Table 1, we report the average CPU runtime (in seconds) for each algorithm, the frequency of IHT, LCS-1 and LCS-2 estimates are equal to MIO and the average relative optimality gap.

Table 1: Heuristic Algorithms Performance
CPU Time (Seconds) Equal MIO Optimality Gap
NN pp IHT LCS-1 LCS-2 MIO IHT LCS-1 LCS-2 LCS-1 LCS-2 MIO
DGP 1 100 0.05 0.0004 0.0969 0.2030 0.1596 0.91 1.00 1.00 0.0001 0.0000 0.0000
100 0.1 0.0004 0.1169 0.8459 5.2721 0.83 1.00 1.00 0.0001 0.0002 0.0000
100 0.2 0.0004 0.2433 1.4303 222.29 0.59 0.93 0.99 0.0000 0.0003 0.1298
200 0.05 0.0005 0.3674 5.2707 65.573 0.95 0.98 1.00 0.0000 0.0004 0.0131
200 0.1 0.0005 0.6265 12.093 304.92 0.78 0.94 1.00 0.0001 0.0009 0.3527
200 0.2 0.0005 0.7830 19.359 306.12 0.72 0.87 1.00 0.0002 0.0009 0.7530
400 0.05 0.0007 1.7538 105.28 302.69 0.92 1.00 1.00 0.0000 0.0010 0.4249
400 0.1 0.0007 4.3731 205.89 302.87 0.77 0.92 0.98 0.0006 0.0010 0.7960
400 0.2 0.0008 4.7983 299.60 302.56 0.55 0.83 0.98 0.0008 0.0045 0.9500
DGP 2 100 0.05 0.0004 0.1001 0.2034 0.2160 0.88 1.00 1.00 0.0001 0.0000 0.0000
100 0.1 0.0004 0.1264 1.0286 2.4994 0.87 1.00 1.00 0.0002 0.0002 0.0000
100 0.2 0.0004 0.1664 1.4692 76.835 0.65 0.94 0.99 0.0002 0.0004 0.0176
200 0.05 0.0005 0.4235 6.0045 134.65 0.91 1.00 1.00 0.0001 0.0006 0.0402
200 0.1 0.0005 0.6005 12.279 292.25 0.74 0.94 1.00 0.0001 0.0009 0.3163
200 0.2 0.0005 0.8669 19.165 305.50 0.71 0.89 0.98 0.0003 0.0010 0.6237
400 0.05 0.0006 1.9436 115.93 302.42 0.92 1.00 1.00 0.0002 0.0010 0.4884
400 0.1 0.0007 4.9341 212.42 302.54 0.81 0.92 0.96 0.0007 0.0010 0.8046
400 0.2 0.0007 5.2007 300.10 302.34 0.57 0.81 0.97 0.0008 0.0030 0.9354
DGP 3 100 0.05 0.0008 0.3549 0.2814 0.3953 0.86 1.00 1.00 0.0001 0.0000 0.0000
100 0.1 0.0009 0.4673 9.9887 13.779 0.86 0.99 1.00 0.0000 0.0005 0.0001
100 0.2 0.0010 0.5272 13.365 214.53 0.60 0.96 0.99 0.0001 0.0007 0.2078
200 0.05 0.0014 1.8395 38.356 135.95 0.79 0.97 1.00 0.0001 0.0008 0.0597
200 0.1 0.0013 2.2910 64.206 293.49 0.75 0.95 0.99 0.0002 0.0011 0.4392
200 0.2 0.0016 2.3277 61.629 302.97 0.63 0.84 0.95 0.0002 0.0009 0.8481
400 0.05 0.0019 7.4183 241.30 301.51 0.87 0.96 1.00 0.0003 0.0062 0.5299
400 0.1 0.0097 9.0912 265.04 301.38 0.80 0.88 0.95 0.0004 0.0075 0.8901
400 0.2 0.0063 10.169 253.46 301.07 0.51 0.78 0.93 0.0004 0.0103 0.9900

Note: The left panel reports the average computation time used for each algorithm in seconds. The middle panel reports the frequency among replications the estimates obtained by the algorithm are the same as those of the MIO solution of (4.4). The right panel corresponds to the average of the relative optimality gap defined as fPfDfD\frac{f^{P}-f^{D}}{f^{D}}, where fPf^{P} is the attained upper bound of the objective value of the MIO solution and fDf^{D} is the dual lower bound of the objective value delivered by the solver, across replications. IHT refers to the iterative hard-thresholding in Algorithm 1. LCS-1 and LCS-2 are the local combinatorial search algorithm in Algorithm 2 with l=1l=1 and 22, respectively. MIO refers to solving (4.4) directly by Gurobi solver.

The key findings presented in Table 1 highlight the performance differences between the heuristic algorithms and the mixed integer optimization (MIO) approach. Notably, within the 5-minute time limit, the MIO method fails to complete the optimality verification for cases where the scale is larger than N=200N=200 and p=0.1p=0.1. In contrast, the iterative hard-thresholding (IHT) and the local combinatorial search (LCS-1) with l=1l=1 are significantly faster. Specifically, IHT runs in milliseconds and LCS-1 achieves optimality in seconds for most cases. LCS-2 takes less than a minute except when N=400N=400 and p=0.1p=0.1 or 0.20.2. In these exceptional cases, the relative optimality gap of LCS-2 remains below 1%, indicating near convergence. Furthermore, the initial estimator, IHT, provides solutions of the same quality to MIO in 50% to 90% of the cases. When advancing to LCS-1, the frequency of obtaining solutions equivalent to MIO increases to approximately 90% while with a significantly lower computational cost. LCS-2 consistently matches the MIO solutions in nearly all replications. These findings underscore the effectiveness and computation efficiency of the heuristic algorithms. Particularly, the local combinatorial search algorithm balances the solution stability and computation costs, which is important when dealing with large sample sizes and many outliers.

5.3 Comparison between L0L_{0} and L1L_{1}-Regularized Estimation

In this section, we compare the performance of L0L_{0} and L1L_{1}-regularized estimation in different scenarios. For the L0L_{0} and L1L_{1}-regularized estimation, the tuning parameters kk and ψ\psi are selected using the information criteria (4.7). When choosing kk, we rely on the neighborhood search algorithm, as detailed in Algorithm 3 to generate the estimates for each candidate k[K]k\in\left[K\right]. Specifically, in the implementation of Algorithm 3, we set K=2k0K=2k_{0} and the local exactness level l=1l=1. We report LCS-2 with k^\hat{k} selected by BIC as the L0L_{0}-regularized estimation results. In addition, we include ordinary least square (OLS) and least absolute deviation (LAD) as benchmarks. Bias and root mean squared error (RMSE) are reported for the first slope parameter β1\beta_{1} in all data generating processes (DGPs). For each setup, we run 1000 replications.

The key findings from the Monte Carlo simulation reported in Table 2 to 4 reveal several important insights. In DGP 1, where the outlier fixed effects are exogenous and there is no endogeneity issue, the bias of the L0L_{0} method is slightly larger than that of the alternatives. However, the L0L_{0} method achieves the smallest RMSE and prediction error in most cases. Ordinary Least Squares (OLS) is not robust to outliers, and their RMSE and prediction error increase significantly as the magnitude of the outlier fixed effects becomes larger. The estimation accuracy of the L1L_{1} method is unstable when the sample size and the outlier fraction are small, but it outperforms Least Absolute Deviations (LAD) and OLS as NN and pp increase. In DGP 2 and 3, where endogenous outlier fixed effects are introduced, the L1L_{1}-regularized method, LAD, and OLS exhibit severe estimation bias. In contrast, the L0L_{0}-regularized method is free from estimation bias and achieves the smallest RMSE and prediction error. The accuracy gap widens as ρ\rho, the parameter controlling the degree of endogeneity, and the fraction of outliers increase. These findings demonstrate that the L0L_{0} method is more robust to the presence of endogenous outliers and provides more accurate estimation.

Table 2: Bias, RMSE and Prediction Error Comparison: DGP 1
Bias RMSE Prediction Error
NN pp (μα,σα)\left(\mu_{\alpha},\sigma_{\alpha}\right) L0 L1 LAD OLS L0 L1 LAD OLS L0 L1 LAD OLS
100 0.05 (0,5)(0,5) 0.0178 -0.0059 -0.0068 0.0044 0.1754 0.4331 0.2591 0.2172 1.0455 1.0754 1.0691 1.0708
100 0.05 (5,5)(5,5) 0.0163 0.0061 -0.0013 -0.0119 0.1687 0.2567 0.2372 0.2618 1.0471 1.0513 1.0647 1.1550
100 0.05 (5,10)(5,10) 0.0095 0.0174 -0.0084 -0.0001 0.1740 0.4375 0.2403 0.4806 1.0462 1.0899 1.0640 1.5507
100 0.1 (0,5)(0,5) 0.0195 -0.0032 -0.0009 -0.0018 0.2057 0.3765 0.4099 0.2793 1.0634 1.0752 1.1189 1.1139
100 0.1 (5,5)(5,5) 0.0197 -0.0129 0.0051 0.0071 0.1952 0.4613 0.2992 0.3372 1.0713 1.1072 1.1006 1.3960
100 0.1 (5,10)(5,10) 0.0217 -0.0251 -0.0053 -0.0206 0.1907 0.5812 0.2687 0.6476 1.0726 1.1381 1.0992 2.5156
100 0.2 (0,5)(0,5) 0.0598 0.0013 -0.0037 -0.0014 0.2537 0.2037 0.3693 0.3467 1.1117 1.0707 1.1242 1.1861
100 0.2 (5,5)(5,5) 0.0657 -0.0023 0.0048 -0.0133 0.2930 0.4674 0.2858 0.4629 1.2384 1.2532 1.1423 2.2922
100 0.2 (5,10)(5,10) 0.0886 0.0044 -0.0041 -0.0359 0.3756 0.3443 0.3889 0.8796 1.2692 1.5191 1.1856 6.0357
200 0.05 (0,5)(0,5) 0.0055 -0.0011 -0.0005 -0.0020 0.1276 0.1125 0.1802 0.1504 1.0227 1.0185 1.0320 1.0338
200 0.05 (5,5)(5,5) 0.0091 0.0069 0.0073 -0.0026 0.1226 0.2324 0.1887 0.1904 1.0218 1.0310 1.0340 1.1070
200 0.05 (5,10)(5,10) 0.0074 0.0036 0.0107 0.0086 0.1159 0.1134 0.2214 0.3237 1.0249 1.0246 1.0448 1.3797
200 0.1 (0,5)(0,5) 0.0146 0.0089 0.0060 0.0030 0.1363 0.2345 0.1631 0.1898 1.0254 1.0298 1.0293 1.0527
200 0.1 (5,5)(5,5) 0.0166 0.0137 0.0047 0.0121 0.1278 0.2367 0.1525 0.2437 1.0385 1.0539 1.0449 1.3352
200 0.1 (5,10)(5,10) 0.0140 0.0053 0.0117 -0.0005 0.1269 0.1355 0.1899 0.4503 1.0331 1.0522 1.0451 2.2763
200 0.2 (0,5)(0,5) 0.0382 -0.0025 0.0039 -0.0155 0.1833 0.1406 0.2778 0.2525 1.0544 1.0321 1.0520 1.0980
200 0.2 (5,5)(5,5) 0.0330 0.0049 0.0018 0.0001 0.1775 0.1669 0.2220 0.3259 1.0985 1.1675 1.0926 2.1352
200 0.2 (5,10)(5,10) 0.0337 -0.0013 0.0064 -0.0189 0.1907 0.2270 0.2291 0.6181 1.0957 1.3804 1.1007 5.4837
400 0.05 (0,5)(0,5) 0.0033 0.0071 0.0021 0.0039 0.0822 0.2162 0.0944 0.1065 1.0119 1.0207 1.0140 1.0194
400 0.05 (5,5)(5,5) 0.0055 0.0028 0.0026 0.0024 0.0825 0.0789 0.0962 0.1294 1.0113 1.0154 1.0148 1.0867
400 0.05 (5,10)(5,10) 0.0041 0.0031 0.0033 0.0089 0.0800 0.0807 0.1000 0.2409 1.0132 1.0173 1.0165 1.3312
400 0.1 (0,5)(0,5) 0.0012 -0.0024 -0.0013 -0.0064 0.0872 0.0852 0.1042 0.1310 1.0122 1.0119 1.0153 1.0273
400 0.1 (5,5)(5,5) 0.0052 -0.0007 0.0028 -0.0057 0.0857 0.0912 0.1288 0.1710 1.0158 1.0433 1.0277 1.2877
400 0.1 (5,10)(5,10) 0.0069 0.0019 0.0073 -0.0055 0.0833 0.0911 0.1721 0.3137 1.0131 1.0351 1.0299 2.1209
400 0.2 (0,5)(0,5) 0.0023 -0.0003 -0.0017 0.0019 0.1051 0.1033 0.1025 0.1779 1.0152 1.0153 1.0150 1.0465
400 0.2 (5,5)(5,5) 0.0134 0.0040 0.0032 0.0063 0.1099 0.1130 0.2314 0.2254 1.0487 1.1439 1.0823 2.0636
400 0.2 (5,10)(5,10) 0.0172 -0.0030 -0.0059 -0.0163 0.1215 0.1501 0.1191 0.4302 1.0602 1.3277 1.0741 5.2611

Note: L0, L1, LAD and OLS refer to local combinatorial search algorithm with l=2l=2, L1L_{1}-regularized estimator defined in (3.2), the least absolute deviation estimator and the ordinary least squares estimator, respectively. Generically, bias and RMSE are calculated by R1r=1R(β^(r)β)R^{-1}\sum_{r=1}^{R}\left(\hat{\beta}^{(r)}-\beta\right) and R1r=1R(β^(r)β)2\sqrt{R^{-1}\sum_{r=1}^{R}\left(\hat{\beta}^{(r)}-\beta\right)^{2}}, respectively, for true parameter β\beta, its estimate β^(r)\hat{\beta}^{(r)} across R=1000R=1000 replications. We report bias and RMSE for β1\beta_{1}. The prediction error is calculated by Prediction Error=R1r=1R(1Nti=1Nt(y~i(r)x~i(r)β^(r))2),\text{Prediction Error}=R^{-1}\sum_{r=1}^{R}\left(\frac{1}{N_{t}}\sum_{i=1}^{N_{t}}\left(\widetilde{y}_{i}^{(r)}-\widetilde{x}_{i}^{(r)\prime}\hat{\beta}^{(r)}\right)^{2}\right), where {y~i,x~i}i=1Nt\left\{\widetilde{y}_{i},\widetilde{x}^{\prime}_{i}\right\}_{i=1}^{N_{t}} is generated from the same DGP without outliers and Nt=1000N_{t}=1000.

Table 3: Bias, RMSE and Prediction Error Comparison: DGP 2
Bias RMSE Prediction Error
NN pp ρ\rho L0 L1 LAD OLS L0 L1 LAD OLS L0 L1 LAD OLS
100 0.05 2 0.0136 -0.0425 -0.0213 -0.1039 0.1856 0.4412 0.2999 0.2418 1.0520 1.1012 1.0764 1.0753
100 0.05 5 0.0108 -0.0292 -0.0293 -0.2481 0.1721 0.3388 0.3421 0.4599 1.0454 1.0695 1.0981 1.2632
100 0.05 10 0.0024 -0.0455 -0.0375 -0.5083 0.1695 0.5241 0.2712 0.9381 1.0450 1.1111 1.0725 2.0180
100 0.1 2 0.0124 -0.0782 -0.0647 -0.2000 0.1985 0.2809 0.3104 0.3223 1.0584 1.0694 1.0793 1.1250
100 0.1 5 0.0177 -0.1038 -0.0633 -0.5106 0.1959 0.4527 0.3108 0.7374 1.0617 1.0868 1.0951 1.6280
100 0.1 10 0.0084 -0.1197 -0.0665 -0.9777 0.1932 0.3861 0.2889 1.4546 1.0652 1.1075 1.0973 3.4246
100 0.2 2 0.0470 -0.1924 -0.1443 -0.3914 0.2518 0.5470 0.3400 0.5008 1.0995 1.1424 1.1261 1.2899
100 0.2 5 0.0210 -0.2921 -0.1563 -1.0116 0.2536 0.4487 0.3351 1.2382 1.1436 1.1982 1.1369 2.6585
100 0.2 10 -0.0150 -0.4833 -0.1652 -1.9442 0.2628 0.6554 0.4466 2.4123 1.1598 1.5000 1.1400 7.3165
200 0.05 2 -0.0039 -0.0472 -0.0373 -0.1026 0.1263 0.1222 0.1418 0.1779 1.0203 1.0179 1.0231 1.0373
200 0.05 5 0.0019 -0.0439 -0.0357 -0.2472 0.1228 0.2360 0.1610 0.3677 1.0200 1.0288 1.0287 1.1555
200 0.05 10 0.0028 -0.0480 -0.0289 -0.5072 0.1209 0.1247 0.1675 0.7180 1.0243 1.0235 1.0350 1.6064
200 0.1 2 -0.0058 -0.0950 -0.0619 -0.2023 0.1371 0.2891 0.1853 0.2717 1.0284 1.0398 1.0369 1.0844
200 0.1 5 0.0066 -0.0861 -0.0538 -0.4758 0.1344 0.2542 0.2812 0.6015 1.0284 1.0424 1.0614 1.4231
200 0.1 10 0.0024 -0.1185 -0.0633 -1.0126 0.1321 0.1844 0.1817 1.2477 1.0283 1.0408 1.0515 2.6637
200 0.2 2 0.0132 -0.1876 -0.1343 -0.3975 0.1743 0.2374 0.2047 0.4621 1.0486 1.0667 1.0520 1.2298
200 0.2 5 0.0077 -0.2762 -0.1411 -1.0094 0.1801 0.3342 0.3001 1.1396 1.0629 1.1241 1.0809 2.3628
200 0.2 10 -0.0130 -0.4515 -0.1441 -2.0215 0.1803 0.5364 0.2658 2.2574 1.0698 1.3223 1.0785 6.3780
400 0.05 2 -0.0055 -0.0437 -0.0278 -0.1000 0.0815 0.2212 0.0907 0.1433 1.0099 1.0209 1.0117 1.0243
400 0.05 5 0.0010 -0.0513 -0.0274 -0.2519 0.0826 0.0952 0.1322 0.3206 1.0115 1.0136 1.0192 1.1145
400 0.05 10 -0.0005 -0.0543 -0.0338 -0.4964 0.0829 0.0981 0.1257 0.6225 1.0113 1.0136 1.0172 1.4262
400 0.1 2 -0.0112 -0.0978 -0.0607 -0.1992 0.0874 0.1306 0.1100 0.2350 1.0123 1.0219 1.0171 1.0623
400 0.1 5 -0.0001 -0.1024 -0.0568 -0.4896 0.0860 0.1349 0.2014 0.5579 1.0116 1.0225 1.0339 1.3413
400 0.1 10 -0.0017 -0.1138 -0.0648 -0.9929 0.0854 0.1473 0.1400 1.1230 1.0112 1.0249 1.0237 2.3312
400 0.2 2 -0.0307 -0.2113 -0.1373 -0.4016 0.1022 0.2347 0.1827 0.4328 1.0176 1.0613 1.0381 1.1956
400 0.2 5 -0.0047 -0.2648 -0.1350 -1.0076 0.1064 0.2933 0.2762 1.0737 1.0207 1.0937 1.0692 2.1861
400 0.2 10 -0.0052 -0.4260 -0.1478 -2.0080 0.1020 0.4701 0.1815 2.1353 1.0214 1.2335 1.0410 5.6672

Note: L0, L1, LAD and OLS refer to local combinatorial search algorithm with l=2l=2, L1L_{1}-regularized estimator defined in (3.2), the least absolute deviation estimator and the ordinary least squares estimator, respectively. Generically, bias and RMSE are calculated by R1r=1R(β^(r)β)R^{-1}\sum_{r=1}^{R}\left(\hat{\beta}^{(r)}-\beta\right) and R1r=1R(β^(r)β)2\sqrt{R^{-1}\sum_{r=1}^{R}\left(\hat{\beta}^{(r)}-\beta\right)^{2}}, respectively, for true parameter β\beta, its estimate β^(r)\hat{\beta}^{(r)} across R=1000R=1000 replications. We report bias and RMSE for β1\beta_{1}. The prediction error is calculated by Prediction Error=R1r=1R(1Nti=1Nt(y~i(r)x~i(r)β^(r))2),\text{Prediction Error}=R^{-1}\sum_{r=1}^{R}\left(\frac{1}{N_{t}}\sum_{i=1}^{N_{t}}\left(\widetilde{y}_{i}^{(r)}-\widetilde{x}_{i}^{(r)\prime}\hat{\beta}^{(r)}\right)^{2}\right), where {y~i,x~i}i=1Nt\left\{\widetilde{y}_{i},\widetilde{x}^{\prime}_{i}\right\}_{i=1}^{N_{t}} is generated from the same DGP without outliers and Nt=1000N_{t}=1000.

Table 4: Bias, RMSE and Prediction Error Comparison: DGP 3
Bias RMSE Prediction Error
NN pp ρ\rho L0 L1 LAD OLS L0 L1 LAD OLS L0 L1 LAD OLS
100 0.05 2 0.0133 0.0374 0.0305 0.0947 0.1181 0.1216 0.1343 0.1707 1.0929 1.1016 1.1199 1.1796
100 0.05 5 0.0028 0.0372 0.0288 0.2316 0.1120 0.1214 0.1335 0.3450 1.1013 1.0985 1.1204 1.3718
100 0.05 10 -0.0014 0.0400 0.0233 0.4303 0.1137 0.1283 0.1332 0.6133 1.1252 1.1627 1.1821 2.7509
100 0.1 2 0.0202 0.0912 0.0705 0.2313 0.1379 0.1571 0.1586 0.2903 1.2159 1.2227 1.2517 1.3784
100 0.1 5 0.0051 0.1177 0.0699 0.5802 0.1244 0.1799 0.1529 0.6761 1.1360 1.1734 1.1954 2.3094
100 0.1 10 0.0104 0.1829 0.0787 1.1178 0.1154 0.2515 0.1547 1.3130 1.1591 1.2737 1.1658 6.6681
100 0.2 2 0.0313 0.2201 0.1568 0.4539 0.1541 0.2676 0.2196 0.4995 1.1839 1.2791 1.2140 1.7171
100 0.2 5 0.0128 0.3870 0.1699 1.1336 0.1371 0.4517 0.2315 1.2259 1.1921 1.6585 1.2529 4.9567
100 0.2 10 0.0178 0.7165 0.1692 2.2973 0.1379 0.8299 0.2350 2.4777 1.1395 3.0481 1.2506 17.9627
200 0.05 2 0.0090 0.0459 0.0326 0.1128 0.0837 0.0911 0.0937 0.1493 1.1170 1.1022 1.1237 1.1233
200 0.05 5 0.0021 0.0481 0.0299 0.2854 0.0834 0.0992 0.0998 0.3453 1.0353 1.0283 1.0434 1.3253
200 0.05 10 -0.0005 0.0523 0.0292 0.5746 0.0736 0.0962 0.0924 0.6796 1.0445 1.0538 1.0657 2.2574
200 0.1 2 0.0122 0.0927 0.0650 0.2258 0.0927 0.1263 0.1145 0.2563 1.0816 1.0692 1.0776 1.1800
200 0.1 5 0.0066 0.1106 0.0712 0.5705 0.0815 0.1417 0.1155 0.6230 1.1258 1.1599 1.1389 2.1070
200 0.1 10 -0.0008 0.1536 0.0654 1.1162 0.0822 0.1931 0.1145 1.2159 1.0499 1.1218 1.0782 4.7269
200 0.2 2 0.0336 0.2183 0.1547 0.4549 0.1168 0.2405 0.1856 0.4767 1.0496 1.0919 1.0602 1.4486
200 0.2 5 0.0125 0.3576 0.1620 1.1350 0.0987 0.3921 0.1970 1.1828 1.0410 1.4231 1.1345 4.2417
200 0.2 10 0.0075 0.6543 0.1656 2.2731 0.0988 0.7136 0.2011 2.3656 1.0560 2.2161 1.1482 13.1995
400 0.05 2 0.0084 0.0527 0.0319 0.1139 0.0575 0.0760 0.0680 0.1335 1.0229 1.0125 1.0149 1.0271
400 0.05 5 0.0033 0.0571 0.0337 0.2849 0.0544 0.0800 0.0712 0.3138 1.0249 1.0412 1.0431 1.3004
400 0.05 10 0.0023 0.0568 0.0334 0.5652 0.0551 0.0812 0.0722 0.6180 1.0471 1.0697 1.0608 2.2000
400 0.1 2 0.0136 0.1077 0.0675 0.2276 0.0660 0.1241 0.0949 0.2433 1.0291 1.0416 1.0408 1.1401
400 0.1 5 0.0062 0.1134 0.0697 0.5633 0.0597 0.1299 0.0945 0.5890 1.0542 1.0938 1.0799 1.9397
400 0.1 10 0.0011 0.1469 0.0691 1.1311 0.0583 0.1676 0.0971 1.1816 1.1474 1.1870 1.1448 4.3749
400 0.2 2 0.0243 0.2288 0.1505 0.4515 0.0795 0.2405 0.1671 0.4632 1.0798 1.1893 1.1186 1.5341
400 0.2 5 0.0057 0.3392 0.1541 1.1227 0.0658 0.3560 0.1717 1.1470 0.9963 1.2354 1.0458 3.7024
400 0.2 10 0.0053 0.6241 0.1604 2.2530 0.0656 0.6515 0.1781 2.2978 1.0662 1.9160 1.1003 11.8648

Note: L0, L1, LAD and OLS refer to local combinatorial search algorithm with l=2l=2, L1L_{1}-regularized estimator defined in (3.2), the least absolute deviation estimator and the ordinary least squares estimator, respectively. Generically, bias and RMSE are calculated by R1r=1R(β^(r)β)R^{-1}\sum_{r=1}^{R}\left(\hat{\beta}^{(r)}-\beta\right) and R1r=1R(β^(r)β)2\sqrt{R^{-1}\sum_{r=1}^{R}\left(\hat{\beta}^{(r)}-\beta\right)^{2}}, respectively, for true parameter β\beta, its estimate β^(r)\hat{\beta}^{(r)} across R=1000R=1000 replications. We report bias and RMSE for β1\beta_{1}. The prediction error is calculated by Prediction Error=R1r=1R(1Nti=1Nt(y~i(r)x~i(r)β^(r))2),\text{Prediction Error}=R^{-1}\sum_{r=1}^{R}\left(\frac{1}{N_{t}}\sum_{i=1}^{N_{t}}\left(\widetilde{y}_{i}^{(r)}-\widetilde{x}_{i}^{(r)\prime}\hat{\beta}^{(r)}\right)^{2}\right), where {y~i,x~i}i=1Nt\left\{\widetilde{y}_{i},\widetilde{x}^{\prime}_{i}\right\}_{i=1}^{N_{t}} is generated from the same DGP without outliers and Nt=1000N_{t}=1000.

6 Empirical Illustration: Stock Return Predictability

Linear predictive regression models have been extensively studied for stock return forecasting. For instance, Koo et al. (2020); Lee et al. (2022) and others have applied linear predictive regression to the Welch and Goyal (2008) dataset to investigate stock return predictability. Financial time series data often exhibit instability, particularly during periods such as the financial crisis. Including these periods in estimation and forecasting can lead to varying results. As an illustration, we apply data-driven L0L_{0} and L1L_{1}-regularized robust estimation methods to the Welch and Goyal (2008) dataset555Retrieved from http://www.hec.unil.ch/agoyal/ and evaluate the out-of-sample stock return prediction performance.

We use monthly data from January 1990 to December 2023, which covers the dot-com bubble period, the 2007-09 financial crisis, and the Covid-19 pandemic. The dependent variable, excess return, is defined as the difference between the continuously compounded return on the S&P 500 index and the three-month Treasury bill rate, computed by

ExReturni=log(indexi/indexi1)log(1+tbli/12).\text{ExReturn}_{i}=\log\left(\text{index}_{i}/\text{index}_{i-1}\right)-\log\left(1+\text{tbl}_{i}/12\right).

Twelve financial and macroeconomic variables666 The predictors are Dividend Price Ratio, Dividend Yield, Earning Price Ratio, Term Spread, Default Yield Spread, Default Return Spread, Book-to-Market Ratio, Treasury Bill Rates, Long-Term Return, Net Equity Expansion, Stock Variance, Inflation. Table A.1 in the appendix summarizes the detailed description of each variable. , denoted by xix_{i}, are included in the model as predictors. We apply the L0L_{0} and L1L_{1}-regularized methods to estimate the model,

ExReturni+1=β0+xiβ1+αi+ui+1,\text{ExReturn}_{i+1}=\beta_{0}+x_{i}^{\prime}\beta_{1}+\alpha_{i}+u_{i+1},

for i[N]i\in\left[N\right], and construct one-month-ahead forecasts f^N+1=β^0+x^Nβ^1\hat{f}_{N+1}=\hat{\beta}_{0}+\hat{x}_{N}^{\prime}\hat{\beta}_{1} recursively with a 10-year rolling window for each month from January 2000 to December 2023. As in Monte Carlo experiments, LAD and OLS are included as benchmarks.

Table 5 compares the mean prediction squared error (MPSE) among different methods across various forecasting periods. We consider the forecast period from January 2000 to December 2023, and six subperiods based on the start and end dates of the dot-com bubble burst period (Mar. 2000 to Nov. 2000), the financial crisis (Dec. 2007 to Jun. 2009) and the Covid-19 shocks (Feb. 2020 to Apr. 2020). The results demonstrate that L0L_{0}-regularized method outperforms alternative methods with a margin in all subperiods except for the financial crisis period, which illuminates the forecast accuracy gain from the robustness to potentially endogenous outliers.

Table 5: Mean Prediction Squared Error (MPSE) for Monthly Excess Return of S&P 500 Index
Forecast Period L0 L1 LAD OLS
All periods 0.00306 0.00325 0.00400 0.00341
(Jan. 2000 to Dec. 2023)
Dot-com Bubble Burst 0.00409 0.00455 0.00450 0.00474
(Mar. 2000 to Nov. 2000)
Dot-com - Financial Crisis 0.00169 0.00187 0.00207 0.00197
(Mar. 2001 to Jun. 2009)
Financial Crisis 0.00937 0.00906 0.01389 0.01082
(Dec. 2007 to Jun. 2009)
Financial Crisis - Covid 0.00128 0.00134 0.00148 0.00145
(Dec. 2007 to Feb. 2020)
Covid 0.06758 0.07493 0.11073 0.07988
(Feb. 2020 to Apr. 2020)
Post-Covid 0.00332 0.00378 0.00335 0.00340
(Apr. 2020 to Dec. 2023)

Notes: The mean prediction squared error (MPSE) is calculated by averaging the square forecasting loss over the corresponding periods. The date below each period refers to the forecast period. L0, L1, LAD and OLS refer to local combinatorial search algorithm with l=2l=2, L1L_{1}-regularized estimator defined in (3.2), the least absolute deviation estimator and the ordinary least squares estimator, respectively. The smallest MPSE in each period is labeled in bold.

Additionally, Figure 2 illustrates the outliers detected (α^i0\hat{\alpha}_{i}\neq 0) for each rolling window using L0L_{0} and L1L_{1}-regularized methods. In the grid plot, each row corresponds to a different forecast period, and each column corresponds to a period that may be included in the estimation rolling window. Within each row, the highlighted cells represent the estimation window for the corresponding forecast target period. A cell is labeled red if the corresponding period is detected as an outlier in the rolling window by both methods, blue if both methods detect this period as an inlier, purple if only the L0L_{0}-regularized method detects this period as an outlier, and yellow if only the L1L_{1}-regularized method detects this period as an outlier. As shown in the figure, the L0L_{0} method consistently detects periods around the dotcom bubble, financial crisis, and Covid-19 shocks across rolling windows. In general, the L1L_{1} method detects a similar pattern while also labeling some outlier periods outside the concentrated outlier regions.

Figure 2: Outlier Detection with L0L_{0} and L1L_{1}-regularized Methods
Refer to caption

7 Conclusion

This paper addresses the robust estimation of linear regression models in the presence of potentially endogenous outliers. We demonstrate that existing L1L_{1}-regularized estimation methods exhibit significant bias when outliers are endogenous and develop L0L_{0}-regularized estimation methods to overcome this issue. We propose systematic heuristic algorithms, notably an iterative hard-thresholding algorithm and a local combinatorial search refinement, to efficiently solve the combinatorial optimization problem of the L0L_{0}-regularized estimation. The properties of the L0L_{0} and L1L_{1}-regularized methods are examined through Monte Carlo simulations. We illustrate the practical value of our method with an empirical application to stock return forecasting.

Appendix

Appendix A.1 Proofs

Proof of Proposition 1.

Note that (4.1) is equivalent to

Θk(Y,X)=min𝒩,||Nkminβi(yixiβ)2.\Theta_{k}\left(Y,X\right)=\min_{\mathcal{I}\subset\mathcal{N},\left|\mathcal{I}\right|\geq N-k}\min_{\beta}\sum_{i\in\mathcal{I}}\left(y_{i}-x_{i}^{\prime}\beta\right)^{2}.

Denote

Θ(Y,X)=Θk^(Y,X)\Theta\left(Y,X\right)=\Theta_{\hat{k}}\left(Y,X\right)

where

k^=argmin1kN/2BIC(k;(Y,X))=argmin1kN/2Nlog(Θk(Y,X)N)+klog(N).\hat{k}=\arg\min_{1\leq k\leq\lfloor N/2\rfloor}\mathrm{BIC}^{\ast}\left(k;\left(Y,X\right)\right)=\arg\min_{1\leq k\leq\lfloor N/2\rfloor}N\log\left(\frac{\Theta_{k}\left(Y,X\right)}{N}\right)+k\log\left(N\right).

Note that B(T;(Y,X))=B(Θ;(Y,X))B\left(T;\left(Y,X\right)\right)=B\left(\Theta;\left(Y,X\right)\right) since Θ(Y,X)\Theta\left(Y,X\right) is a quadratic function of T(Y,X)T\left(Y,X\right), so we can focus on Θ(Y,X)\Theta\left(Y,X\right) as proceed.

We first show B(Θ;(Y,X))>N/2B\left(\Theta;\left(Y,X\right)\right)>\lfloor N/2\rfloor by contradiction. Suppose k0=N/2k_{0}=\lfloor N/2\rfloor, and the set of uncontaminated sample is 0\mathcal{I}_{0} with |0|=NN/2\left|\mathcal{I}_{0}\right|=N-\lfloor N/2\rfloor. The objective value with sample in 0\mathcal{I}_{0} satisfies

ΘN/2(Y~(k0),X~(k0))minβi0(yixiβ)2<,\Theta_{\lfloor N/2\rfloor}\left(\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}\right)\leq\min_{\beta}\sum_{i\in\mathcal{I}_{0}}\left(y_{i}-x_{i}^{\prime}\beta\right)^{2}<\infty,

then supY~(k0),X~(k0)ΘN/2(Y~(k0),X~(k0))<,\sup_{\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}}\Theta_{\lfloor N/2\rfloor}\left(\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}\right)<\infty, and

supY~(k0),X~(k0)BIC(N/2;(Y~(k0),X~(k0)))<.\sup_{\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}}\mathrm{BIC}^{\ast}\left(\lfloor N/2\rfloor;\left(\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}\right)\right)<\infty. (A.1)

Suppose

supY~(k0),X~(k0)Θ(Y,X)Θ(Y~(k0),X~(k0))=,\sup_{\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}}\left\|\Theta\left(Y,X\right)-\Theta\left(\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}\right)\right\|=\infty,

then

supY~(k0),X~(k0)Θ(Y~(k0),X~(k0))=supY~(k0),X~(k0)Θk^(Y~(k0),X~(k0))=.\sup_{\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}}\Theta\left(\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}\right)=\sup_{\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}}\Theta_{\hat{k}}\left(\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}\right)=\infty.

This implies

supY~(k0),X~(k0)BIC(k,(Y~(k0),X~(k0)))=,\sup_{\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}}\mathrm{BIC}^{\ast}\left(k,\left(\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}\right)\right)=\infty,

for 1kN/21\leq k\leq\lfloor N/2\rfloor, which contradicts to (A.1). As a result, B(Θ;(Y,X))>N/2/NB\left(\Theta;\left(Y,X\right)\right)>\lfloor N/2\rfloor/N.

Suppose k0=N/2+1k_{0}=\lfloor N/2\rfloor+1, then Θ(Y~(k0),X~(k0))=i(yixiβ)2\Theta\left(\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}\right)=\sum_{i\in\mathcal{I}^{\ast}}\left(y_{i}-x_{i}^{\prime}\beta\right)^{2} for some 𝒩\mathcal{I}^{\ast}\subset\mathcal{N} with at least one arbitrary outlier, which leads to supY~(k0),X~(k0)Θ(Y~(k0),X~(k0))=\sup_{\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}}\Theta\left(\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}\right)=\infty, and then

supY~(k0),X~(k0)Θ(Y,X)Θ(Y~(k0),X~(k0))=.\sup_{\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}}\left\|\Theta\left(Y,X\right)-\Theta\left(\tilde{Y}_{(k_{0})},\tilde{X}_{(k_{0})}\right)\right\|=\infty.

Then B(Θ;(Y,X))N/2+1NB\left(\Theta;\left(Y,X\right)\right)\leq\frac{\lfloor N/2\rfloor+1}{N}, which completes the proof. ∎

Appendix A.2 Data Description

Table A.1 summarizes all the variables included and the first-order autocorrelation coefficient of each variable estimated for the whole sample period. As shown in the table, the excess return has an estimated first-order autocorrelation coefficient of 0.04440.0444, which indicates little persistence, similar to the default return spread (dfr), the long-term return of government bonds (ltr), stock variance (svar) and inflation (infl). The other predictors show high persistence, with AR(1) coefficients greater than 0.95.

Table A.1: Variables and AR(1) Coefficients
Predictor Description AR(1) Coef
ExReturn Excess Return: the difference between the continuously compounded return on the S&P 500 index and the three-month Treasury Bill rate 0.0444
dp Dividend Price Ratio: the difference between the log of the 12-month moving sum dividends and the log of the S&P 500 index 0.9941
dy Dividend Yield: the difference between the log of the 12-month moving sum dividends and the log of lagged the S&P 500 index 0.9941
ep Earning Price Ratio: the difference between the log of the 12-month moving sum earnings and the log of the S&P 500 index 0.9904
tms Term Spread: the difference between the long-term government bond yield and the Treasury Bill rate 0.9576
dfy Default Yield Spread: the difference between Moody’s BAA- and AAA-rated corporate bond yields 0.9717
dfr Default Return Spread: the difference between the returns of long-term corporate bonds and long-term government bonds -0.0735
bm Book-to-Market Ratio: the ratio of the book value to market value for the Dow Jones Industrial Average 0.9927
tbl Treasury Bill Rates: the 3-month Treasury Bill rates 0.9905
ltr Long-Term Return: the rate of returns of long-term government bonds 0.0500
ntis Net Equity Expansion: the ratio of the 12-month moving sums of net issues by NYSE listed stocks over the total end-of-year market capitalization of NYSE stocks 0.9778
svar Stock Variance: the sum of the squared daily returns on the S&P 500 index 0.4714
infl Inflation: the log growth of the Consumer Price Index (all urban consumers) 0.4819

References

  • Beck and Eldar (2013) Beck, A. and Y. C. Eldar (2013). Sparsity constrained nonlinear optimization: Optimality conditions and algorithms. SIAM Journal on Optimization 23(3), 1480–1509.
  • Bertsimas et al. (2016) Bertsimas, D., A. King, and R. Mazumder (2016). Best subset selection via a modern optimization lens. The Annals of Statistics 44(2), 813–852.
  • Gurobi Optimization, LLC (2024) Gurobi Optimization, LLC (2024). Gurobi Optimizer Reference Manual.
  • Hampel (1971) Hampel, F. R. (1971). A general qualitative definition of robustness. The Annals of Mathematical Statistics 42(6), 1887–1896.
  • Hastie et al. (2015) Hastie, T., R. Tibshirani, and M. Wainwright (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC.
  • Hazimeh and Mazumder (2020) Hazimeh, H. and R. Mazumder (2020). Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. Operations Research 68(5), 1517–1537.
  • Huber (1964) Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics 35(1), 73–101.
  • Huber and Donoho (1983) Huber, P. J. and D. Donoho (1983). The notion of breakdown point. In P. J. Bickel, K. Doksum, and J. L. Hodges (Eds.), A Festschrift for Erich L. Lehmann. Chapman and Hall/CRC.
  • Koo et al. (2020) Koo, B., H. M. Anderson, M. H. Seo, and W. Yao (2020). High-dimensional predictive regression in the presence of cointegration. Journal of Econometrics 219(2), 456–477.
  • Kostakis et al. (2015) Kostakis, A., T. Magdalinos, and M. P. Stamatogiannis (2015). Robust econometric inference for stock return predictability. The Review of Financial Studies 28(5), 1506–1553.
  • Lee et al. (2022) Lee, J. H., Z. Shi, and Z. Gao (2022). On lasso for predictive regression. Journal of Econometrics 229(2), 322–349.
  • Lee et al. (2012) Lee, Y., S. N. MacEachern, and Y. Jung (2012). Regularization of case-specific parameters for robustness and efficiency. Statistical Science 27(3), 350–372.
  • Mazumder et al. (2022) Mazumder, R., P. Radchenko, and A. Dedieu (2022). Subset selection with shrinkage: Sparse linear modeling when the snr is low. Operations Research.
  • Natarajan (1995) Natarajan, B. K. (1995). Sparse approximate solutions to linear systems. SIAM Journal on Computing 24(2), 227–234.
  • Rousseeuw (1984) Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American statistical association 79(388), 871–880.
  • Rousseeuw (1985) Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. Mathematical Statistics and Applications 8(283-297), 37.
  • She and Owen (2011) She, Y. and A. B. Owen (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association 106(494), 626–639.
  • Siegel (1982) Siegel, A. F. (1982). Robust regression using repeated medians. Biometrika 69(1), 242–244.
  • Thompson (2022) Thompson, R. (2022). Robust subset selection. Computational Statistics & Data Analysis 169, 107415.
  • Welch and Goyal (2008) Welch, I. and A. Goyal (2008). A comprehensive look at the empirical performance of equity premium prediction. The Review of Financial Studies 21(4), 1455–1508.
  • Yu and Yao (2017) Yu, C. and W. Yao (2017). Robust linear regression: A review and comparison. Communications in Statistics-Simulation and Computation 46(8), 6261–6282.