This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

New Highly Efficient High-Breakdown Estimator of Multivariate Scatter and Location for Elliptical Distributions

Justin A. Fishbone Lamine Mili
Abstract

High-breakdown-point estimators of multivariate location and shape matrices, such as the MM-estimator with smooth hard rejection and the Rocke S-estimator, are generally designed to have high efficiency at the Gaussian distribution. However, many phenomena are non-Gaussian, and these estimators can therefore have poor efficiency. This paper proposes a new tunable S-estimator, termed the S-q estimator, for the general class of symmetric elliptical distributions, a class containing many common families such as the multivariate Gaussian, t-, Cauchy, Laplace, hyperbolic, and normal inverse Gaussian distributions. Across this class, the S-q estimator is shown to generally provide higher maximum efficiency than other leading high-breakdown estimators while maintaining the maximum breakdown point. Furthermore, its robustness is demonstrated to be on par with these leading estimators while also being more stable with respect to initial conditions. From a practical viewpoint, these properties make the S-q broadly applicable for practitioners. This is demonstrated with an example application—the minimum-variance optimal allocation of financial portfolio investments.

keywords:
Covariance matrix estimation , Robust estimation , S-estimator , S-q estimator , Shape matrix estimation
journal: arXiv                 Copyright ©  2021 Justin Fishbone
\affiliation

[inst1]organization=Bradley Department of Electrical and Computer Engineering
Virginia Polytechnic Institute and State University,addressline=
7054 Haycock Rd, city=Falls Church, postcode=22043, state=VA, country=USA

1 Introduction

Huber (1964) introduced what is the most common class of estimators, M-estimators. Although originally applied to the location case, Maronna (1976) expanded the definition to include multivariate location and scatter. After the sample median, perhaps the most common robust M-estimators are those using the general rho functions such as the Huber or Tukey bisquare functions.

However, the drawback of using general rho functions is that they have limited efficiency when applied to parameter estimation of nonideal probability distributions. To address this, various M-estimator approaches have been taken to iteratively reweight maximum likelihood estimator (MLE) weights based on the estimated probability density function (PDF) (e.g., Windham, 1995; Basu et al., 1998; Choi and Hall, 2000; Ferrari and Yang, 2010).

Even with these improvements, multivariate M-estimators inherently have limited robustness. For example, Maronna (1976) showed that the upper-bound on the breakdown point for pp-dimensional M-estimators is (p+1)1(p+1)^{-1}, which converges to zero with large pp. To combat this weakness, Rousseeuw and Yohai (1984) introduced regression S-estimators, which Davies (1987) expanded to multivariate location and scatter. Davies also showed that the asymptotic breakdown point of S-estimators can be set to 1/2, which is the theoretical maximum of any equivariant estimator.

In practical scenarios however, estimators may have large bias at considerably lower contamination levels than the breakdown point. For many years, the Tukey bisquare was the standard rho function for S-estimators (for example, see Lopuhaä, 1989; Rocke, 1996). However, in the context of multivariate S-estimators, the bisquare is not tunable, so its robustness falls off with increasing pp. For this reason, Rocke (1996) introduced the tunable biflat and translated biweight rho functions. Maronna et al. (2006, sec. 6.4.4) slightly modified the biflat, proposing the Rocke rho function. The Rocke S-estimator (shortened here to S-Rocke) is currently the recommended high-breakdown estimator for large dimensions (p15)(p\geq 15) (Maronna and Yohai, 2017; Maronna et al., 2019, sec. 6.10). The recommended estimator for lower dimensions is the MM-estimator with the smoothed hard rejection function (MM-SHR).

There are two major shortcomings of the S-Rocke estimator that will be discussed in this paper. Firstly, it has low efficiency for small dimension, pp. Although this is an inherent disadvantage of all S-estimators, it is exceptionally acute for the S-Rocke. Secondly, the S-Rocke has poor efficiency for most common non-Gaussian distributions. This is a common problem for general-purpose estimators such as the Rocke and bisquare S-estimators, the MM-SHR, and the Huber and bisquare M-estimators. Examples of common phenomena that are frequently modeled by non-Gaussian distributions include stock returns, radar sea clutter, and speech signals, which approximately follow generalized hyperbolic (Konlack Socgnia and Wilcox, 2014), K- (Ward et al., 1990), and Laplace distributions (Gazor and Zhang, 2003), respectively.

This paper proposes and explores a new subclass of tunable, maximum-breakdown-point S-estimators that is applicable across common continuous elliptical distributions. This estimator, named the S-q estimator, uses a density-based reweighting to attain generally higher maximum efficiency across the elliptical class as compared to the S-Rocke and MM-SHR estimators. These estimators are compared from the viewpoints of statistical and computational efficiency, robustness, and stability.

Although the focus on elliptical distributions sounds limiting, as discussed in the next section, most common continuous multivariate distributions—such as the Gaussian, t-, Laplace, and hyperbolic distributions—fall into this class. As Frahm (2009) discussed, this assumption is “fundamental in multivariate analysis.”

This paper is organized as follows. Section 2 defines the new estimator and provides its functions for the most common elliptical distributions. Basic properties related to the consistency of the S-q estimator are summarized in Section 3. Section 4 provides the asymptotic distribution of the S-q estimator and compares the maximum achievable efficiencies of the S-q, S-Rocke, and MM-SHR estimators. In Section 5, the finite-sample breakdown point of the S-q is discussed, the theoretical influence functions of the estimators are compared, and the empirical finite-sample robustness of the estimators are briefly explored. Section 6 assesses two computational aspects of the estimators: computational efficiency, and stability with respect to initial estimates. A real-world example in Section 7 demonstrates the application of the estimators for the minimum-variance optimal allocation of financial portfolio investments. Finally, conclusions are summarized in Section 8.

2 Defining the S-q Estimator

This section builds the definition of the proposed S-q estimator. First, the elliptical class of distributions is reviewed. The multivariate S-estimator definition is then summarized, and finally, the S-q is defined.

2.1 Elliptical Distributions

The elliptical distribution is a general class of multivariate probability distributions encompassing many familiar subclasses such as the symmetric Gaussian, t-, Cauchy, Laplace, hyperbolic, variance gamma, and normal inverse Gaussian distributions. Table 1 summarizes the most common elliptical distributions (Fang et al., 1990, p. 69; Deng and Yao, 2018).

Table 1: Summary of Common Elliptical Distributions
Distribution Name Generating Function, ϕ(d)\phi(d)
   {Range of Parameters}
Kotz type dNexp(rds)d^{N}exp(-rd^{s})
   {r>0,s>0,N>p2}\left\{r>0,s>0,\,N>-\frac{p}{2}\right\}
Gaussian exp(d2)exp\left(-\frac{d}{2}\right)
(Kotz type with N=0,s=1,r=1/2N=0,s=1,r=1/2)
Pearson type II (1d)m(1-d)^{m}, d[0,1]\quad d\in[0,1]
   {m>0}\{m>0\}
Pearson type VII (1+d/s)N(1+d/s)^{-N}
   {N>p/2,s>0}\left\{N>p/2,\,s>0\right\}
t (1+d/ν)(ν+p)/2(1+d/\nu)^{-(\nu+p)/2}
(Pearson VII with s=ν,N=(ν+p)/2s=\nu,N=(\nu+p)/2)    {ν>0}\{\nu>0\}
Cauchy (1+d)(1+p)/2(1+d)^{-(1+p)/2}
(t with ν=1\nu=1)
Generalized hyperbolic (ψ(χ+d))λp/2Kλp/2(ψ(χ+d))\left(\sqrt{\psi(\chi+d)}\right)^{\lambda-p/2}K_{\lambda-p/2}\left(\sqrt{\psi(\chi+d)}\right)
   {ψ>0,[χ>0,λorχ=0,λ>0]}\left\{\psi>0,[\chi>0,\lambda\in\mathbb{R}\;or\;\chi=0,\lambda>0]\right\}
Variance gamma (ψd)λp/2Kλp/2(ψd)\left(\sqrt{\psi\,d}\right)^{\lambda-p/2}K_{\lambda-p/2}\left(\sqrt{\psi\,d}\right)
(Gen. hyperbolic with χ=0\chi=0)    {ψ>0,λ>0}\left\{\psi>0,\lambda>0\right\}
Laplace (2d)1p/2K1p/2(2d)\left(\sqrt{2d}\right)^{1-p/2}K_{1-p/2}\left(\sqrt{2d}\right)
(Variance gamma with ψ=2,λ=1\psi=2,\lambda=1)
Multivariate hyperbolic exp(ψ(χ+d))exp\left(-\sqrt{\psi(\chi+d)}\right)
(Gen. hyperbolic with λ=(p+1)/2\lambda=(p+1)/2)    {ψ>0,χ0}\left\{\psi>0,\,\chi\geq 0\right\}
Hyperbolic with univariate marginals (ψ(χ+d))1p/2K1p/2(ψ(χ+d))\left(\sqrt{\psi(\chi+d)}\right)^{1-p/2}K_{1-p/2}\left(\sqrt{\psi(\chi+d)}\right)
(Gen. hyperbolic with λ=1\lambda=1)    {ψ>0,χ0}\left\{\psi>0,\,\chi\geq 0\right\}
Normal inverse Gaussian (ψ(χ+d))(1+p)/2K(1+p)/2(ψ(χ+d))\left(\sqrt{\psi(\chi+d)}\right)^{-(1+p)/2}K_{-(1+p)/2}\left(\sqrt{\psi(\chi+d)}\right)
(Gen. hyperbolic with λ=1/2,χ>0\lambda=-1/2,\,\chi>0)    {ψ>0,χ>0}\left\{\psi>0,\,\chi>0\right\}

Symmetric elliptical distributions are defined as being a function of the squared Mahalanobis distance,111Some texts define the Mahalanobis distance with the mean and covariance, but this more restrictive definition excludes thick-tailed distributions where these do not exist, such as Cauchy distributions. d(𝒙,𝝁,𝚺)=(𝒙𝝁)T𝚺1(𝒙𝝁)d\left(\boldsymbol{x},\boldsymbol{\mu},\boldsymbol{\Sigma}\right)=\left(\boldsymbol{x}-\boldsymbol{\mu}\right)^{\operatorname{T}}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}\right), where 𝒙p\boldsymbol{x}\in\mathbb{R}^{p}, the location 𝝁p\boldsymbol{\mu}\in\mathbb{R}^{p}, and the p×pp\times p positive definite symmetric (PDS(p)\operatorname{PDS}(p)) scatter 𝚺PDS(p)\boldsymbol{\Sigma}\in\operatorname{PDS}(p). When the PDF is defined, it has the form fX(𝒙)=αp|𝚺|1/2ϕ(d(𝒙,𝝁,𝚺)),f_{X}\left(\boldsymbol{x}\right)=\alpha_{p}|\boldsymbol{\Sigma}|^{-1/2}\phi\left(d\left(\boldsymbol{x},\boldsymbol{\mu},\boldsymbol{\Sigma}\right)\right), for some generating function ϕ(d)\phi(d), and where αp\alpha_{p} is a constant that ensures fX(𝒙)f_{X}\left(\boldsymbol{x}\right) integrates to one. Table 1 lists common generating functions. When the covariance exists, it is proportional to the scatter matrix, 𝚺\boldsymbol{\Sigma}. The corresponding shape matrix is commonly defined as

𝛀=𝚺/|𝚺|1/p.\boldsymbol{\Omega}=\boldsymbol{\Sigma}/|\boldsymbol{\Sigma}|^{1/p}. (1)

The PDF of d(𝒙,𝝁,𝚺)d\left(\boldsymbol{x},\boldsymbol{\mu},\boldsymbol{\Sigma}\right) is given by (Kelker, 1970)

fD(d)=βpdp/21ϕ(d),f_{D}\left(d\right)=\beta_{p}\,d^{p/2-1}\phi\left(d\right), (2)

where βp=αpπp/2/Γ(p/2)\beta_{p}=\alpha_{p}\pi^{p/2}/\Gamma(p/2). Hereafter, all densities, f(d),f(d), refer to the density of d(𝒙,𝝁,𝚺)d\left(\boldsymbol{x},\boldsymbol{\mu},\boldsymbol{\Sigma}\right) in (2), so the subscript DD will be omitted. It is also generally assumed that p>2.p>2.

2.2 S-Estimators

Given a set of nn pp-dimensional samples, {𝒙1,,𝒙n}\{\boldsymbol{x}_{1},...,\boldsymbol{x}_{n}\}, S-estimators of location and shape are defined as (Maronna et al., 2006, Sec. 6.4.2)

(𝝁^,𝛀^)=arg min\displaystyle\left(\widehat{\boldsymbol{\mu}},\widehat{\boldsymbol{\Omega}}\right)=\text{arg\,min} σ^\displaystyle\widehat{\sigma} (3)
subject to |𝛀|=1,\displaystyle|\boldsymbol{\Omega}|=1,
1ni=1nρ(d(𝒙i,𝝁,𝛀)σ^)=b,\displaystyle\frac{1}{n}\sum_{i=1}^{n}\rho\left(\frac{d\left(\boldsymbol{x}_{i},\boldsymbol{\mu},\boldsymbol{\Omega}\right)}{\widehat{\sigma}}\right)=b,

for some scalar rho function, ρ(t)\rho(t). A proper S-estimator rho function should be a continuously differentiable, nondecreasing function in t0t\geq 0 with ρ(0)=0\rho(0)=0, and where there is a point cc such that ρ(t)=ρ()\rho(t)=\rho(\infty) for tct\geq c. For simplicity, and without loss of generality, the rho functions will be normalized so ρ()=1\rho(\infty)=1. The parameter bb is a scalar that affects the efficiency (see Section 4) and robustness of the estimator. The purpose of S-estimators is to achieve high robustness, so they are usually configured with b=1/2(p+1)/(2n),b=1/2-(p+1)/(2n), which achieves the maximum theoretical breakdown point that any affine equivariant estimator may have (see Section 5.1). To understand the derivation of the proposed estimator in the next section, note that σ^\widehat{\sigma} in the constraint is an M-estimator of the scale of d(𝝁,𝛀).d\left(\boldsymbol{\mu},\boldsymbol{\Omega}\right). Local solutions of (3) can be found iteratively using the weighted sums i=1nw(di/σ^)(𝒙i𝝁^)=𝟎\sum_{i=1}^{n}w\left(d_{i}/\widehat{\sigma}\right)\left(\boldsymbol{x}_{i}-\widehat{\boldsymbol{\mu}}\right)=\boldsymbol{0} and i=1nw(di/σ^)(𝒙i𝝁^)(𝒙i𝝁^)T𝛀^,\sum_{i=1}^{n}w\left(d_{i}/\widehat{\sigma}\right)\left(\boldsymbol{x}_{i}-\widehat{\boldsymbol{\mu}}\right)\left(\boldsymbol{x}_{i}-\widehat{\boldsymbol{\mu}}\right)^{\operatorname{T}}\propto\widehat{\boldsymbol{\Omega}}, where the weight function w(t)=ρ(t),w(t)=\rho^{\prime}(t), and where 𝛀^\widehat{\boldsymbol{\Omega}} is re-normalized with each iteration. For the empirical results in this paper, the estimators will all be solved using this weighted-sum algorithm.

To estimate the scatter metrix, a separate estimator of |𝚺|1/p|\boldsymbol{\Sigma}|^{1/p} can then be used to scale 𝛀^\widehat{\boldsymbol{\Omega}} using (1). Maronna et al. (2006, p. 186) discussed a simple estimator to scale 𝛀^\widehat{\boldsymbol{\Omega}} to 𝚺^\widehat{\boldsymbol{\Sigma}}. When 𝒙\boldsymbol{x} is normally distributed, dd has a chi-squared distribution with pp degrees of freedom. Therefore, they suggested using 𝚺^=Median{d(x1,𝝁^,𝛀^),,d(xn,𝝁^,𝛀^)}(χp2(0.5))1𝛀^\widehat{\boldsymbol{\Sigma}}=\operatorname{Median}\left\{d\left(x_{1},\widehat{\boldsymbol{\mu}},\widehat{\boldsymbol{\Omega}}\right),\dots,d\left(x_{n},\widehat{\boldsymbol{\mu}},\widehat{\boldsymbol{\Omega}}\right)\right\}\left(\chi^{2}_{p}(0.5)\right)^{-1}\widehat{\boldsymbol{\Omega}}, where χp2(0.5)\chi^{2}_{p}(0.5) is the 50th percentile of the chi-squared distribution. For the general case of elliptical distributions, we propose extending this to

𝚺^=Median{d(x1,𝝁^,𝛀^),,d(xn,𝝁^,𝛀^)}F1(0.5)𝛀^,\widehat{\boldsymbol{\Sigma}}=\frac{\operatorname{Median}\left\{d\left(x_{1},\widehat{\boldsymbol{\mu}},\widehat{\boldsymbol{\Omega}}\right),\dots,d\left(x_{n},\widehat{\boldsymbol{\mu}},\widehat{\boldsymbol{\Omega}}\right)\right\}}{F^{-1}(0.5)}\widehat{\boldsymbol{\Omega}},

where F(d)F(d) is the distribution function corresponding to (2), and therefore F1(0.5)F^{-1}(0.5) is the 50th percentile of the distribution.

For the location and shape matrices, the S-estimator formulation in (3) is equivalent to the alternative one given by

(𝝁^,𝚺^)=arg min\displaystyle\left(\widehat{\boldsymbol{\mu}},\widehat{\boldsymbol{\Sigma}}\right)=\text{arg\,min} |𝚺|\displaystyle|\boldsymbol{\Sigma}| (4)
subject to 1ni=1nρ(d(𝒙i,𝝁,𝚺)σ)=b,\displaystyle\frac{1}{n}\sum_{i=1}^{n}\rho\left(\frac{d\left(\boldsymbol{x}_{i},\boldsymbol{\mu},\boldsymbol{\Sigma}\right)}{\sigma}\right)=b,

which requires that σ\sigma be defined such that b=E[ρ(d(𝒙;𝝁,𝚺)/σ)]b=\operatorname{E}\left[\rho\left(d\left(\boldsymbol{x};\boldsymbol{\mu},\boldsymbol{\Sigma}\right)/\sigma\right)\right] for a consistent estimator of 𝚺\boldsymbol{\Sigma} at an assumed elliptical distribution (Rocke, 1996). While the first formulation is better for understanding the derivation of the proposed S-q estimator, this second formulation is better for defining and understanding its properties (for example, see Lopuhaä, 1989). The scale parameters in the two formulations are related asymptotically by σ=|𝚺|1/pE[σ^],\sigma=|\boldsymbol{\Sigma}|^{-1/p}\operatorname{E}\left[\widehat{\sigma}\right], at the assumed distribution.

The two most common multivariate S-estimators are the bisquare and Rocke (Maronna et al., 2019, sec. 6.4.2, 6.4.4). The S-bisquare is given by ρbisq(t)=min{1,1(1t)3}\rho_{\text{bisq}}(t)=\min\left\{1,1-(1-t)^{3}\right\} and wbisq(t)=3(1t)2I(t1),w_{\text{bisq}}(t)=3(1-t)^{2}\operatorname{I}(t\leq 1), which does not have a tuning parameter to control efficiency and robustness. The S-Rocke is given by

ργ(t)={0if 0t1γt14γ[3(t1γ)2]+12if 1γ<t<1+γ1if 1+γt,\rho_{\gamma}(t)=\begin{cases}0&\text{if}\;0\leq t\leq 1-\gamma\\ \frac{t-1}{4\gamma}\left[3-\left(\frac{t-1}{\gamma}\right)^{2}\right]+\frac{1}{2}&\text{if}\;1-\gamma<t<1+\gamma\\ 1&\text{if}\;1+\gamma\leq t\\ \end{cases},
wγ(t)=34γ[1(t1γ)2]I(1γt1+γ),w_{\gamma}(t)=\frac{3}{4\gamma}\left[1-\left(\frac{t-1}{\gamma}\right)^{2}\right]\operatorname{I}(1-\gamma\leq t\leq 1+\gamma),

where the parameter γ(0,1]\gamma\in(0,1] tunes the estimator’s efficiency and robustness. The Rocke’s maximum efficiency is generally limited at γ=1\gamma=1, which is extremely restricting for small pp. Both ρbisq(t)\rho_{\text{bisq}}(t) and ργ(t)\rho_{\gamma}(t) are generic functions that do not depend on the underlying distribution. In the following section, an alternative S-estimator is defined that accounts for the underlying distribution and that generally has better performance across the most common elliptical distributions. It also does not have the same inherent restrictions for small pp as ργ(t).\rho_{\gamma}(t).

2.3 Elliptical Density-Based S-q Estimator

The rho function corresponding to the maximum likelihood estimator, σ^,\widehat{\sigma}, of the scale of d(𝝁,𝛀),d\left(\boldsymbol{\mu},\boldsymbol{\Omega}\right), or equivalently d(𝝁,𝚺),d\left(\boldsymbol{\mu},\boldsymbol{\Sigma}\right), is ρmle(t)=tf(t)/f(t).\rho_{\text{mle}}(t)=-t\,f^{\prime}\left(t\right)/f\left(t\right). We propose weighting this by the power transform of the density, ρ~q(t)=f(t)1qρmle(t),\tilde{\rho}_{q}\left(t\right)=f(t)^{1-q}\rho_{\text{mle}}(t), where the scalar q1q\leq 1 controls the estimator robustness, with q=1q=1 corresponding to the maximum likelihood function, and with decreasing qq increasing the estimator robustness. In most cases, this rho function is not monotone, as required by S-estimators, so it is denoted with a tilde. This rho function is equivalent to the M-Lq and other M-estimators proposed, for example, by Windham (1995); Basu et al. (1998); Choi and Hall (2000); and Ferrari and Yang (2010). However, in this particular case of estimating the scale of the squared Mahalanobis distance of an elliptically distributed random vector, the density and rho function do not need to be regenerated with each numerical iteration, i,i, based on the estimates 𝝁^(i)\widehat{\boldsymbol{\mu}}^{(i)} and 𝛀^(i).\widehat{\boldsymbol{\Omega}}^{(i)}. Substituting the PDF from (2),

ρ~q(t)=(βpϕ(t))sqtspsq(tϕ(t)ϕ(t)+sp),\tilde{\rho}_{q}\left(t\right)=-\left(\beta_{p}\phi(t)\right)^{s_{q}}t^{s_{p}s_{q}}\left(t\frac{\phi^{\prime}(t)}{\phi(t)}+s_{p}\right), (5)

where sp=p/21s_{p}=p/2-1 and sq=1qs_{q}=1-q. Taking the derivative of ρ~q(t)\tilde{\rho}_{q}(t), the corresponding weight function is given by

w~q(t)=(βpϕ(t))sqtspsq(sqsp2t+(2sqsp+1)ϕ(t)ϕ(t)qt(ϕ(t)ϕ(t))2+tϕ′′(t)ϕ(t)).\tilde{w}_{q}(t)=-\left(\beta_{p}\phi(t)\right)^{s_{q}}t^{s_{p}s_{q}}\left(\frac{s_{q}s_{p}^{2}}{t}+\left(2s_{q}s_{p}+1\right)\frac{\phi^{\prime}(t)}{\phi(t)}-qt\left(\frac{\phi^{\prime}(t)}{\phi(t)}\right)^{2}+t\frac{\phi^{\prime\prime}(t)}{\phi(t)}\right). (6)

For simplicity, when q<1q<1, the scalar βp\beta_{p} can be dropped from the calculation of ρ~q(t)\tilde{\rho}_{q}(t) and w~q(t)\tilde{w}_{q}(t) in (5) and (6). When ϕ(t)\phi(t) is only positive over a finite domain (e.g. Pearson Type II distribution), then we define ρ~q(t)\tilde{\rho}_{q}(t) and w~q(t)\tilde{w}_{q}(t) to be zero outside this domain.

For the common elliptical distributions listed in Table 1, ρq~(t)\tilde{\rho_{q}}(t) is monotone in its central region between its global extrema when using appropriate values for qq (defined below). The first extremum is the minimum, which we label point aa, and the second is the maximum, labeled cc. The distance between aa and cc varies monotonically with respect to qq. We use this to define a tunable, double-hard-rejection S-estimator rho function. The value of ρ~q(t)\tilde{\rho}_{q}(t) is held constant between zero and aa at value ρ~q(a),\tilde{\rho}_{q}(a), which hard rejects inliers, and the value of ρ~q(t)\tilde{\rho}_{q}(t) is held constant above cc at value ρ~q(c),\tilde{\rho}_{q}(c), which hard rejects outliers. The resulting monotonic function is then scaled and shifted so it ranges from zero to one. This defines the S-q estimator.

Definition 1.

Assuming ϕ(t)\phi(t) is twice continuously differentiable over its region of support and sqsp2t+(2sqsp+1)ϕ(t)ϕ(t)qt(ϕ(t)ϕ(t))2+tϕ′′(t)ϕ(t)\frac{s_{q}s_{p}^{2}}{t}+\left(2s_{q}s_{p}+1\right)\frac{\phi^{\prime}(t)}{\phi(t)}-qt\left(\frac{\phi^{\prime}(t)}{\phi(t)}\right)^{2}+t\frac{\phi^{\prime\prime}(t)}{\phi(t)} has one or two zeros in t(0,)t\in(0,\infty) for q<1q<1, the S-q estimator is the S-estimator with the rho function given by

ρq(t)={0if q<1 and tas1(ρ~q(t)ρ~q(a))if q<1 and a<t<c1if q<1 and tcρ~q(t)if q=1,\rho_{q}(t)=\begin{cases}0&\text{if $q<1$ and $t\leq a$}\\ s_{1}\left(\tilde{\rho}_{q}(t)-\tilde{\rho}_{q}(a)\right)&\text{if $q<1$ and $a<t<c$}\\ 1&\text{if $q<1$ and $t\geq c$}\\ \tilde{\rho}_{q}(t)&\text{if $q=1$}\end{cases}, (7)

where s1=(ρ~q(b)ρ~q(a))1s_{1}=\left(\tilde{\rho}_{q}(b)-\tilde{\rho}_{q}(a)\right)^{-1}. The S-q estimator of Type I is the case with one zero (i.e. a=0a=0), and the Type II S-q estimator is the case with two zeros.

For most distributions, limq1c=,\lim_{q\rightarrow 1}c=\infty, or at q=1q=1, ρ~q(t)\tilde{\rho}_{q}(t) is not bounded. Therefore, we do not scale or shift ρ~q(t)\tilde{\rho}_{q}(t) in this case, and ρq(t)\rho_{q}(t) is not a proper S-estimator rho function. However, when q=1q=1 and b=1b=1, the MLE of the scale of dd is obtained. The S-q weight function is the derivative of ρq(t)\rho_{q}(t) and is given by

wq(t)={0if q<1 and tas1w~q(t)if q<1 and a<t<c0if q<1 and tcw~q(t)if q=1.w_{q}(t)=\begin{cases}0&\text{if $q<1$ and $t\leq a$}\\ s_{1}\tilde{w}_{q}(t)&\text{if $q<1$ and $a<t<c$}\\ 0&\text{if $q<1$ and $t\geq c$}\\ \tilde{w}_{q}(t)&\text{if $q=1$}\end{cases}. (8)

Table 2 lists expressions for the inlier rejection point, a,a, and the outlier rejection point, cc, for the common elliptical distributions in Table 1. For most of these distributions, the equation w~q(t)=0\tilde{w}_{q}(t)=0 is quadratic, which provides a closed-form solution for the values of aa and cc.

Table 2: S-q Inlier and Outlier Rejection Points for Common Elliptical Distributions
Distribution Inlier Rejection Point aa and Outlier Rejection Point cc
Kotz type a,ca,c =(s+2sqN+2spsqs2+4ssqN+4sspsq2sqrs)1/s=\left(\frac{s+2s_{q}N+2s_{p}s_{q}\mp\sqrt{s^{2}+4ss_{q}N+4ss_{p}s_{q}}}{2s_{q}rs}\right)^{1/s}
Gaussian a,ca,c =1+2spsq1+4spsqsq=\frac{1+2s_{p}s_{q}\mp\sqrt{1+4s_{p}s_{q}}}{s_{q}}
Pearson type II a,ca,c =2sqsp2+m(2sqsp+1)m2(4sqsp+1)+4msqsp22(sqsp2+m(2sqsp+msq))=\frac{2s_{q}s_{p}^{2}+m\left(2s_{q}s_{p}+1\right)\mp\sqrt{m^{2}\left(4s_{q}s_{p}+1\right)+4ms_{q}s_{p}^{2}}}{2\left(s_{q}s_{p}^{2}+m\left(2s_{q}s_{p}+ms_{q}\right)\right)}
Pearson type VII a,ca,c =s2Nsqsp+N2sqsp24N2sqsp4Nsqsp2+N22sq(sp22Nsp+N2)=s\frac{2Ns_{q}s_{p}+N-2s_{q}s_{p}^{2}\mp\sqrt{4N^{2}s_{q}s_{p}-4Ns_{q}s_{p}^{2}+N^{2}}}{2s_{q}\left(s_{p}^{2}-2Ns_{p}+N^{2}\right)}
Generalized hyperbolic aa ={0whenχ=0andλ=1{t|w~q(t)=0andt(0,c)}otherwise=\begin{cases}0&when\;\chi=0\;and\;\lambda=1\\ \left\{t|\tilde{w}_{q}(t)=0\;and\;t\in(0,c)\right\}&otherwise\\ \end{cases}
cc ={t|w~q(t)=0andt(a,)}=\left\{t|\tilde{w}_{q}(t)=0\;and\;t\in(a,\infty)\right\}

The asymptotic rejection probability (ARP) is defined as Pr(d/σ^c)Pr(d/\widehat{\sigma}\geq c) (Rocke, 1996). Table 2 can be used to determine qq from a desired ARP using F1(ARP)F^{-1}(ARP). However, since wq(t)w_{q}(t) is very tapered (i.e. applying little weight to values just below cc), practitioners may choose alternative approaches to tuning that allow for higher estimator efficiencies. For example, the approach used in this paper as well as in Maronna and Yohai (2017) is to tune the estimators to a desired expected efficiency, which is defined in the next section.

The general definition in (7) specifies that q1q\leq 1. In a few particular cases, however, there are some minor restrictions on qq (when q<1q<1) in order to ensure that aa and cc are in the support of f(d)f(d). Table 3 lists these restrictions.

Table 3: Restrictions on Parameter qq for Common Elliptical Distributions
Distribution Valid Range of qq
Kotz type q1q\leq 1 unless 1sp<N<sp-1-s_{p}<N<-s_{p}, then 1+s4(sp+N)<q11+\frac{s}{4\left(s_{p}+N\right)}<q\leq 1
Gaussian q1q\leq 1
Pearson type II q=1q=1 or q<11mq<1-\frac{1}{m}
Pearson type VII q1q\leq 1
Generalized hyperbolic* q1q\leq 1 unless χ=0\chi=0 and λ<1\lambda<1, then unknown<q1\textit{unknown}<q\leq 1
*Empirically inferred. Computational precision restricts q(0.998,1),q\notin(0.998,1), approximately.

Figure 1 illustrates examples of the S-q functions ρ~q(t)\tilde{\rho}_{q}(t), ρq(t)\rho_{q}(t), and wq(t)w_{q}(t) for the five-dimensional Gaussian (S-q Type II) and Laplace (S-q Type I) distributions and for various values of qq. As qq is decreased, the region of positive weights (area between points aa and cc) narrows, corresponding to increased robustness. The PDF is also plotted, illustrating how wq(t)w_{q}(t) roughly follows f(t)f(t) in the central region.

Refer to caption
Figure 1: Example S-q Rho and Weight Functions for Gaussian (Type II S-q) and Laplace (Type I S-q) Distributions. Rho functions ρq(t)\rho_{q}(t) (top) and weight functions wq(t)w_{q}(t) (bottom) are plotted for the Gaussian distribution (left) and the Laplace distribution (right) for q{2,0.5,0.9}q\in\{-2,0.5,0.9\} and p=5p=5. On the top, the dotted lines depict the corresponding ρ~q(t)\tilde{\rho}_{q}(t) functions, scaled and shifted to match ρq(t)\rho_{q}(t). On the bottom, the dashed line depicts the density function. The solid dots indicate points a,a, when ρ~q(t)=0,\tilde{\rho}_{q}(t)=0, and c,c, when ρ~q(t)=1\tilde{\rho}_{q}(t)=1.

Figure 2 compares the S-q asymptotic weights with those of the MM-SHR, S-Rocke, S-bisquare, and maximum likelihood estimators, and with the corresponding PDF. The underlying model is a 10-dimensional standard Gaussian distribution. The MM-SHR and S-q estimators have been tuned to 80% asymptotic efficiency relative to the MLE. The S-Rocke estimator is tuned to its maximum efficiency, which is 77% in this instance. The estimators have been set to the maximum breakdown point, with b=1/2,b=1/2, which results in the shifts of the peaks of the weight curves relative to the PDF.

Refer to caption
Figure 2: Example Comparison of Weight Functions for Various Estimators. For the 10-dimensional Gaussian distribution, the plot depicts the asymptotic weights for the S-q (wqw_{q}) and MM-SHR (wshrw_{\text{shr}}) estimators tuned to 80% asymptotic relative efficiency, the S-Rocke (wγw_{\gamma}) estimator tuned to its maximum efficiency (77% for this case), and the non-tunable MLE (wmlew_{\text{mle}}) and S-bisquare (wbisqw_{\text{bisq}}) estimators. The estimators are set to their maximum breakdown points.

From the figure, it is clear that the Gaussian MLE (i.e. sample estimator) gives uniform weight to all samples, no matter how improbable. The S-Rocke has a quadratic weight function, which is a hard cutoff that cannot capture the tails of f(d)f(d). The SHR weight function is cubic, and its shape better captures the shape of the right-half of the PDF. However, the SHR function is designed to approximate a step function, which is poorly suited for many distributions (c.f. wshr(t)w_{shr}(t) in Figure 2 with the Laplace f(d)f(d) in Figure 1). Only the S-q weight function follows the general shape of the PDF—giving less weight to less probable observations.

3 Consistency Properties of the S-q Estimator

As an S-estimator, the S-q estimator inherits properties from its parent class, such as affine equivariance. This section briefly summarizes properties related to its consistency. For more detailed discussion on these, see (Davies, 1987). Here, we use the alternative S-estimator formulation given by (4) under the assumptions (A1) that

(A1) b=E[ρq(d(𝒙i,𝝁,𝚺)/σ)],\displaystyle b=\operatorname{E}[\rho_{q}\left(d\left(\boldsymbol{x}_{i},\boldsymbol{\mu},\boldsymbol{\Sigma}\right)/\sigma\right)],
𝒙fX(𝒙,𝝁,𝚺,ϕ(d)), where 𝒙i are i.i.d.,\displaystyle\boldsymbol{x}\sim f_{X}\left(\boldsymbol{x},\boldsymbol{\mu},\boldsymbol{\Sigma},\phi(d)\right)\text{, where $\boldsymbol{x}_{i}$ are i.i.d.,}
ϕ(d)\phi(d) is non-increasing, and
ϕ(d)\phi(d) and ρq(d/σ)-\rho_{q}(d/\sigma) have common point(s) of decrease.
Theorem 1 (Uniqueness).

Given (A1), minimizing |𝚺^||\widehat{\boldsymbol{\Sigma}}| subject to

0ρq(d(𝒙i,𝝁^,𝚺^)σ)fX(d(𝒙i,𝝁,𝚺))d𝒙=b\int_{0}^{\infty}\rho_{q}\left(\frac{d\left(\boldsymbol{x}_{i},\widehat{\boldsymbol{\mu}},\widehat{\boldsymbol{\Sigma}}\right)}{\sigma}\right)f_{X}(d\left(\boldsymbol{x}_{i},\boldsymbol{\mu},\boldsymbol{\Sigma}\right))\operatorname{d}\!\boldsymbol{x}=b

has a unique solution (𝛍^,𝚺^)=(𝛍,𝚺).\left(\widehat{\boldsymbol{\mu}},\widehat{\boldsymbol{\Sigma}}\right)=\left(\boldsymbol{\mu},\boldsymbol{\Sigma}\right).

Proof.

See (Davies, 1987, Th. 1). ∎

Theorem 2 (Existence).

Given (A1) and n(p+1)/(1b/ρq()),n\geq(p+1)/(1-b/\rho_{q}(\infty)), then the S-q estimator has at least one solution with probability one.

Proof.

See (Davies, 1987, Th. 2). ∎

Theorem 3 (Consistency).

Given (A1), b=E[ρq(d/σ)],b=\operatorname{E}[\rho_{q}(d/\sigma)], and p+1n(1b/ρq()),p+1\leq n(1-b/\rho_{q}(\infty)), then

limn(𝝁^n,𝚺^n)=(𝝁,𝚺).\lim_{n\rightarrow\infty}\left(\widehat{\boldsymbol{\mu}}_{n},\widehat{\boldsymbol{\Sigma}}_{n}\right)=\left(\boldsymbol{\mu},\boldsymbol{\Sigma}\right).
Proof.

See (Davies, 1987, Th. 3). ∎

4 Asymptotic Distribution and Relative Efficiencies

In this section, the asymptotic distribution of the S-q estimator is provided. From this, measures of efficiency are then defined. Finally, the efficiency of the S-q estimator is compared with leading high-breakdown point estimators.

4.1 Asymptotic Distribution

For the asymptotic distribution of the S-q estimate, we continue to use the alternative S-estimator formulation given by (4). Lopuhaä (1997) derived the distribution of S-estimators with assumptions appropriate for the S-q estimator, that is

(A2) ϕp(t)\phi^{\prime}_{p}(t) is decreasing with ϕp(d)<0\phi^{\prime}_{p}(d)<0.

Here, we use the following notation. The matrix 𝑰p2\boldsymbol{I}_{p^{2}} is the p2×p2p^{2}\times p^{2} identity matrix, 𝑲p2\boldsymbol{K}_{p^{2}} is the p2×p2p^{2}\times p^{2} commutation matrix, \otimes is the Kronecker product operator, and the operator vec(𝚺)\operatorname{vec}\left(\boldsymbol{\Sigma}\right) stacks the columns if 𝚺\boldsymbol{\Sigma} into a column vector.

Theorem 4 (Asymptotic distribution).

Given (A1) and (A2), the asymptotic distribution of the S-q estimate of (𝛍^n,𝚺^n)\left(\widehat{\boldsymbol{\mu}}_{n},\widehat{\boldsymbol{\Sigma}}_{n}\right) is given by n(𝛍^n𝛍,𝚺^n𝚺)𝑑(𝐚,𝐁)\sqrt{n}\left(\widehat{\boldsymbol{\mu}}_{n}-\boldsymbol{\mu},\widehat{\boldsymbol{\Sigma}}_{n}-\boldsymbol{\Sigma}\right)\overset{d}{\rightarrow}\left(\boldsymbol{a},\boldsymbol{B}\right), with 𝐚𝐁\boldsymbol{a}\perp\boldsymbol{B}. The vector 𝐚𝒩(𝟎,𝚪𝛍)\boldsymbol{a}\sim\mathcal{N}\left(\boldsymbol{0},\boldsymbol{\Gamma}_{\boldsymbol{\mu}}\right) where

𝚪𝝁=ω1ω22𝚺,\boldsymbol{\Gamma}_{\boldsymbol{\mu}}=\frac{\omega_{1}}{\omega_{2}^{2}}\boldsymbol{\Sigma}, (9)

with ω1=p1E[dwq2(d/σ)]\omega_{1}=p^{-1}\operatorname{E}\left[dw_{q}^{2}(d/\sigma)\right] and ω2=2β0p1dp/2wq(d/σ)ϕ(d)dd\omega_{2}=-2\beta\int_{0}^{\infty}p^{-1}d^{p/2}w_{q}(d/\sigma)\phi^{\prime}(d)\;\operatorname{d}\!d. The matrix 𝐁𝒩(𝟎,𝚪𝚺)\boldsymbol{B}\sim\mathcal{N}\left(\boldsymbol{0},\boldsymbol{\Gamma}_{\boldsymbol{\Sigma}}\right) where

𝚪𝚺=ζ1(𝑰p2+𝑲p2)(𝚺𝚺)+ζ2vec(𝚺)vec(𝚺)T,\boldsymbol{\Gamma}_{\boldsymbol{\Sigma}}=\zeta_{1}\left(\boldsymbol{I}_{p^{2}}+\boldsymbol{K}_{p^{2}}\right)\left(\boldsymbol{\Sigma}\otimes\boldsymbol{\Sigma}\right)+\zeta_{2}\operatorname{vec}\left(\boldsymbol{\Sigma}\right)\operatorname{vec}\left(\boldsymbol{\Sigma}\right)^{\operatorname{T}}, (10)

with ζ1=λ12p(p+2)E[(d/σ)2wq2(d/σ)]\zeta_{1}=\lambda_{1}^{-2}p(p+2)\operatorname{E}\left[(d/\sigma)^{2}w_{q}^{2}(d/\sigma)\right] and ζ2=λ22E[(ρq(d/σ)b)2]2p1ζ1\zeta_{2}=\lambda_{2}^{-2}\operatorname{E}\left[(\rho_{q}(d/\sigma)-b)^{2}\right]-2p^{-1}\zeta_{1}, where λ1=2β0σ1dp/2+1wq(d/σ)ϕ(d)dd\lambda_{1}=-2\beta\int_{0}^{\infty}\sigma^{-1}d^{p/2+1}w_{q}(d/\sigma)\phi^{\prime}(d)\;\operatorname{d}\!d and λ2=β0dp/2(ρq(d/σ)b)ϕ(d)dd.\lambda_{2}=-\beta\int_{0}^{\infty}d^{p/2}\left(\rho_{q}(d/\sigma)-b\right)\phi^{\prime}(d)\;\operatorname{d}\!d.

Proof.

See (Lopuhaä, 1997, Corollary 2). ∎

Frahm (2009) derived the asymptotic distribution of shape matrix estimates for affine equivariant estimators. This enables us to state the asymptotic distribution of the S-q shape estimate, which is applicable using either S-estimator formulation, (3) or (4).

Theorem 5 (Shape asymptotic distribution).

Given (A1) and (A2), the asymptotic distribution of the S-q estimate of shape is given by n(𝛀^n𝛀)𝒩(𝟎,𝚪𝛀)\sqrt{n}\left(\widehat{\boldsymbol{\Omega}}_{n}-\boldsymbol{\Omega}\right)\sim\mathcal{N}\left(\boldsymbol{0},\boldsymbol{\Gamma}_{\boldsymbol{\Omega}}\right) where

𝚪𝛀=ζ1(𝑰p2+𝑲p2)(𝛀𝛀)2ζ1pvec(𝛀)vec(𝛀)T,\boldsymbol{\Gamma}_{\boldsymbol{\Omega}}=\zeta_{1}\left(\boldsymbol{I}_{p^{2}}+\boldsymbol{K}_{p^{2}}\right)\left(\boldsymbol{\Omega}\otimes\boldsymbol{\Omega}\right)-\frac{2\zeta_{1}}{p}\operatorname{vec}\left(\boldsymbol{\Omega}\right)\operatorname{vec}\left(\boldsymbol{\Omega}\right)^{\operatorname{T}}, (11)

with ζ1\zeta_{1} defined as in Theorem 4.

Proof.

See (Frahm, 2009, Corollary 1). ∎

4.2 Measures of Efficiency

The asymptotic efficiency of an estimator, at an assumed distribution, is defined as the ratio of the asymptotic variance of the maximum likelihood estimate to the variance of the estimator under consideration. For multivariate estimation, this definition of efficiency is of large dimension—p×p{p\times p} for location and p2×p2{p^{2}\times p^{2}} for shape and scatter. However, for affine equivariant estimation of location and scatter of elliptical distributions, the covariance of the estimate depends only on a scalar. Specifically, (9), (10), and (11) are general, with only the scalars ω1/ω22\omega_{1}/\omega_{2}^{2} (Bilodeau and Brenner, 1999), and ζ1\zeta_{1} and ζ2\zeta_{2} (Tyler, 1982) depending on the estimator and the generating function ϕ(d)\phi(d). Therefore, the asymptotic efficiency of the estimate 𝝁^\widehat{\boldsymbol{\mu}} can be alternatively defined as

eff(𝝁^)=ω1,mle/ω2,mle2ω1,𝝁^/ω2,𝝁^2,\operatorname{eff}_{\infty}\left(\widehat{\boldsymbol{\mu}}\right)=\frac{\omega_{1\text{,mle}}/\omega_{2\text{,mle}}^{2}}{\omega_{1,\widehat{\boldsymbol{\mu}}}/\omega_{2,\widehat{\boldsymbol{\mu}}}^{2}},

and the asymptotic efficiency of the estimate 𝛀^\widehat{\boldsymbol{\Omega}} can alternatively be defined as

eff(𝛀^)=ζ1,mleζ1,𝛀^.\operatorname{eff}_{\infty}\left(\widehat{\boldsymbol{\Omega}}\right)=\frac{\zeta_{1,\text{mle}}}{\zeta_{1,\widehat{\boldsymbol{\Omega}}}}. (12)

It is common to define asymptotic efficiency this way (for example, see Tyler, 1983; Frahm, 2009).

Comparing the S-q estimator’s efficiency to another estimator can likewise be achieved analytically using, for example, ζ1,γ/ζ1,q\zeta_{1,\gamma}/\zeta_{1,q} for the S-Rocke estimator, which when the quotient is greater than one, indicates that the S-q has higher asymptotic efficiency than the S-Rocke estimator. For other S-estimators, the asymptotic distribution parameters ω1,\omega_{1}, ω2,\omega_{2}, and ζ1\zeta_{1} are calculated the same as in Theorems 4 and 5 but using their respective weight functions. MM-estimators have the same asymptotic variance and influence function as S-estimators (Rousseeuw and Hubert, 2013). For MM-estimators, however, σ\sigma is effectively the tuning parameter, and it can be set accordingly.

In general, finite-sample performance measures are difficult to derive analytically. Instead, it is common to characterize finite-sample performance by empirically characterizing the behavior of metrics derived from the Kullback-Leibler divergence between the estimated and true distribution (for example, see Huang et al., 2006; Ferrari and Yang, 2010). For t-distributions, which includes the Gaussian distribution, the Kullback-Leibler divergence between tν(𝝁,𝚺)t_{\nu}\left(\boldsymbol{\mu},\boldsymbol{\Sigma}\right) and tν(𝝁^,𝚺^)t_{\nu}\left(\widehat{\boldsymbol{\mu}},\widehat{\boldsymbol{\Sigma}}\right) is given by (Abusev, 2015)

D(𝝁,𝚺;𝝁^,𝚺^)=12(Tr(𝚺1𝚺^)+(𝝁𝝁^)T𝚺1(𝝁𝝁^)plog(|𝚺^||𝚺|)).\operatorname{D}\left(\boldsymbol{\mu},\boldsymbol{\Sigma};\widehat{\boldsymbol{\mu}},\widehat{\boldsymbol{\Sigma}}\right)=\frac{1}{2}\left(\operatorname{Tr}\left(\boldsymbol{\Sigma}^{-1}\widehat{\boldsymbol{\Sigma}}\right)+\left(\boldsymbol{\mu}-\widehat{\boldsymbol{\mu}}\right)^{T}\boldsymbol{\Sigma}^{-1}\left(\boldsymbol{\mu}-\widehat{\boldsymbol{\mu}}\right)-p-\log\left(\frac{|\widehat{\boldsymbol{\Sigma}}|}{|\boldsymbol{\Sigma}|}\right)\right).

Following Maronna and Yohai (2017), we then define the joint location and scatter finite-sample relative efficiency as

effn(𝝁^,𝚺^;𝝁^mle,𝚺^mle)=E[D(𝝁,𝚺;𝝁^mle,𝚺^mle)]E[D(𝝁,𝚺;𝝁^,𝚺^)],\operatorname{eff}_{n}\left(\widehat{\boldsymbol{\mu}},\widehat{\boldsymbol{\Sigma}};\widehat{\boldsymbol{\mu}}_{\text{mle}},\widehat{\boldsymbol{\Sigma}}_{\text{mle}}\right)=\frac{\operatorname{E}\left[\operatorname{D}\left(\boldsymbol{\mu},\boldsymbol{\Sigma};\widehat{\boldsymbol{\mu}}_{\text{mle}},\widehat{\boldsymbol{\Sigma}}_{\text{mle}}\right)\right]}{\operatorname{E}\left[\operatorname{D}\left(\boldsymbol{\mu},\boldsymbol{\Sigma};\widehat{\boldsymbol{\mu}},\widehat{\boldsymbol{\Sigma}}\right)\right]},

where 𝝁^mle\widehat{\boldsymbol{\mu}}_{\text{mle}} and 𝚺^mle\widehat{\boldsymbol{\Sigma}}_{\text{mle}} are the location and scatter matrices corresponding to the maximum likelihood estimate, and where the expectation is calculated empirically using the sample mean over mm Monte Carlo trials. The location and the scatter finite-sample relative efficiencies are then respectively defined as effn(𝝁^,𝚺;𝝁^mle,𝚺)\operatorname{eff}_{n}\left(\widehat{\boldsymbol{\mu}},\boldsymbol{\Sigma};\widehat{\boldsymbol{\mu}}_{\text{mle}},\boldsymbol{\Sigma}\right) and effn(𝝁,𝚺^;𝝁,𝚺^mle).\operatorname{eff}_{n}\left(\boldsymbol{\mu},\widehat{\boldsymbol{\Sigma}};\boldsymbol{\mu},\widehat{\boldsymbol{\Sigma}}_{\text{mle}}\right). Likewise, we define the shape matrix finite-sample relative efficiency as

effn(𝛀^;𝛀^mle)=E[D(𝝁,𝛀;𝝁,𝛀^mle)]E[D(𝝁,𝛀;𝝁,𝛀^)].\operatorname{eff}_{n}\left(\widehat{\boldsymbol{\Omega}};\widehat{\boldsymbol{\Omega}}_{\text{mle}}\right)=\frac{\operatorname{E}\left[\operatorname{D}\left(\boldsymbol{\mu},\boldsymbol{\Omega};\boldsymbol{\mu},\widehat{\boldsymbol{\Omega}}_{\text{mle}}\right)\right]}{\operatorname{E}\left[\operatorname{D}\left(\boldsymbol{\mu},\boldsymbol{\Omega};\boldsymbol{\mu},\widehat{\boldsymbol{\Omega}}\right)\right]}. (13)

4.3 Comparison of Estimator Efficiency

Any estimator must provide a good estimate in the absence of contamination and when tuned to its maximum efficiency. This section compares the maximum achievable efficiencies of the S-q, S-Rocke, and MM-SHR estimators when set to their maximum breakdown point. The results below cover large swaths of the most common elliptical families in Table 1 for a moderate dimension of p=20.p=20. These swaths were specifically chosen to cover everyday distributions: Gaussian, Cauchy, Laplace, hyperbolic, and normal inverse Gaussian distributions.

Robust scatter matrix estimation is generally “more difficult” than the estimation of location (Maronna et al., 2019), and as Maronna and Yohai (2017) demonstrated, divergence and efficiency metrics for scatter matrix estimators are generally much worse than for the corresponding estimators of location. Likewise, due to the high dimensionality of the estimate, the underlying shape matrix is the most difficult part of estimating the scatter matrix. Additionally, many practical applications such as multivariate regression, principal components analysis, linear discriminant analysis, and canonical correlation analysis only require the shape matrix, and not the full scatter or covariance matrices (Frahm, 2009). Therefore, unless otherwise noted, the performance results in this paper are for the shape matrix, with metrics given by (12), (13), and D(𝝁,𝛀;𝝁,𝛀^).D\left(\boldsymbol{\mu},\boldsymbol{\Omega};\boldsymbol{\mu},\widehat{\boldsymbol{\Omega}}\right).

The maximum efficiencies of the S-q and S-Rocke generally occur when their parameters qq and γ\gamma are set to one—although the maximum breakdown point of the S-q is only achieved when q<1q<1. However, the maximum efficiency of the MM-SHR must be determined by a search as depicted in Figure 3, which plots, as an example, asymptotic efficiency versus tuning parameter for the estimators for the 20-dimensional Cauchy distribution. At the limit, as the MM-SHR parameter is increased toward infinity, all samples receive equal weight, which is the MLE for the Gaussian distribution, but not for distributions such as the Cauchy. In general, for each tunable estimator, its efficiency decreases while its robustness increases as its parameter is decreased. At the lower limit of its parameter, its weight function is a delta function that may reject all the samples and may result in zero efficiency. At this point, the robustness is high, but the weighted-sum solution depends entirely on the initial estimates 𝝁^(0)\widehat{\boldsymbol{\mu}}^{(0)} and 𝛀^(0).\widehat{\boldsymbol{\Omega}}^{(0)}.

Refer to caption
Figure 3: Estimator Asymptotic Relative Efficiency versus Tuning Parameter. The relative efficiencies of the estimators are plotted as a function of tuning parameter for the Cauchy distribution with p=20.p=20. All estimators are set to the maximum breakdown point. The S-Rocke parameter is in [0,1][0,1], the MM-SHR parameter is in (0,)(0,\infty), and the S-q parameter is in (,1)(-\infty,1) for the maximum breakdown point.

It should be noted that although generally of high efficiency, the S-q estimate at its limit with q=1q=1 is not necessarily the maximum likelihood estimate for location and scatter. The MLE weight function for location and scatter is given by (Tyler, 1982)

wmle(t)=2ϕ(t)ϕ(t)w_{\text{mle}}(t)=-2\frac{\phi^{\prime}(t)}{\phi(t)} (14)

whereas at q=1,q=1, (8) gives

wq=1(t)=ϕ(t)ϕ(t)+t(ϕ(t)ϕ(t))2tϕ′′(t)ϕ(t).w_{q=1}(t)=-\frac{\phi^{\prime}(t)}{\phi(t)}+t\left(\frac{\phi^{\prime}(t)}{\phi(t)}\right)^{2}-t\frac{\phi^{\prime\prime}(t)}{\phi(t)}. (15)
Theorem 6 (Relation to MLE efficiency).

Assuming b=E[ρq(d(𝐱;𝛍,𝚺))],b=\operatorname{E}\left[\rho_{q}(d\left(\boldsymbol{x};\boldsymbol{\mu},\boldsymbol{\Sigma}\right))\right], the asymptotic S-q estimate with q=1q=1 is the maximum likelihood estimate for the location and scatter matrices for distributions where

tϕ′′(t)ϕ(t)t(ϕ(t)ϕ(t))2=yϕ(t)ϕ(t),t\frac{\phi^{\prime\prime}(t)}{\phi(t)}-t\left(\frac{\phi^{\prime}(t)}{\phi(t)}\right)^{2}=y\frac{\phi^{\prime}(t)}{\phi(t)}, (16)

for some value y.y. Therefore, the S-q estimator can asymptotically achieve the Cramér–Rao lower bound for these distributions.

Proof.

The S-estimator scaling of b=E[ρq(d(𝒙;𝝁,𝚺))]b=\operatorname{E}\left[\rho_{q}(d\left(\boldsymbol{x};\boldsymbol{\mu},\boldsymbol{\Sigma}\right))\right] results in an estimate that is invariant to scaling of the weight function. Therefore, this theorem follows directly from proportionally equating (14) to (15). ∎

Remark 1.

Although this theorem inherently assumes the alternative S-estimator formulation given by (4), it still holds true for location and shape matrices using the primary S-estimator formulation in (3).

Remark 2.

If limq1ρ~q(c)\lim_{q\rightarrow 1}\tilde{\rho}_{q}(c) is finite, and b1/2,b\leq 1/2, then both the Cramér–Rao lower bound (maximum efficiency) and the maximum breakdown point can occur in the limit as q1q\rightarrow 1 (see Corollary 1 in the next section).

An example family that satisfies this theorem is the Kotz type with parameter N=0.N=0. Note, however, that limq1ρ~q(c)=,\lim_{q\rightarrow 1}\tilde{\rho}_{q}(c)=\infty, so high breakdown cannot be achieved simultaneously. This is illustrated in Figure 4, which provides the estimators’ maximum achievable asymptotic shape efficiencies for the Kotz type distribution with parameters N=0N=0 and r=1/2r=1/2 as a function of parameter ss. In this example, the S-q efficiency is plotted for its maximum absolute efficiency with q=1q=1 and for it approximate maximum high-breakdown efficiency with q=0.99.q=0.99. As seen in the figure, the high cost of high-breakdown is particularly acute for large s.s.

Refer to caption
Figure 4: Estimator Maximum Achievable Asymptotic Efficiency for Kotz Type Distribution versus Parameter s.s. Maximum achievable asymptotic shape efficiencies for the maximum breakdown point are plotted for parameters N=0,N=0, and r=1/2.r=1/2. The maximum absolute asymptotic shape efficiency of the S-q estimator for q=1q=1 is also shown.

The remainder of this paper will focus on maximum efficiency at the maximum breakdown point. Figure 4 also provides the S-Rocke and MM-SHR estimator’s maximum efficiencies at their maximum breakdown points. The MM-SHR efficiency peaks at s=1,s=1, which is expected since this is the Gaussian distribution, and the S-Rocke efficiency peaks just above this point. Their efficiencies fall off precipitously for larger and smaller values of s.s. The efficiency of the S-q, conversely, increases toward unity for smaller s.s.

The estimators’ maximum achievable asymptotic efficiencies for the t-distribution as a function of the distribution parameter, ν\nu, are plotted on the left of Figure 5. When ν=1\nu=1, the t-distribution corresponds to a Cauchy distribution, and when ν,\nu\rightarrow\infty, it corresponds to the Gaussian distribution. The S-q estimator offers the highest efficiency of the three estimators for thicker tails.

Refer to caption
Figure 5: Estimator Maximum Achievable Efficiency for t-Distribution versus Distribution Parameter, ν.\nu. Maximum achievable asymptotic (left) and small-sample (n=3pn=3p; right) efficiencies for the maximum breakdown point are plotted.

The maximum achievable small-sample relative efficiencies using n=3pn=3p are plotted on the right of Figure 5. The initial estimates were made using the Peña and Prieto (2007) kurtosis plus specific directions (KSD) estimator as recommended and provided by Maronna and Yohai (2017). Comparing these finite-sample results with the asymptotic ones on the left, it is seen that the relative results are similar. This general similarity implies that the relative performance of the asymptotic efficiencies can often be a good surrogate for the relative performance of the finite-sample efficiencies when there is no closed-form expression for the divergence in (13).

The estimators’ maximum achievable asymptotic efficiencies for the variance gamma distribution with ψ=2\psi=2 are plotted as a function of parameter λ\lambda on the left of Figure 6. The plots highlight the Laplace (λ=1)(\lambda=1) and multivariate hyperbolic (λ=(p+1)/2)(\lambda=(p+1)/2) distributions. The S-q exhibits good performance for the hyperbolic and remarkably good performance for the Laplace.

Refer to caption
Figure 6: Estimator Maximum Achievable Asymptotic Efficiency for Generalized Hyperbolic Distribution versus Parameter λ.\lambda. Maximum achievable asymptotic efficiencies for the maximum breakdown point are plotted for the variance gamma distribution with parameter ψ=2\psi=2 (left) and for the generalized hyperbolic distribution with parameters ψ=2\psi=2 and χ=1\chi=1 (right).

The estimators’ maximum achievable asymptotic efficiencies for the generalized hyperbolic distribution with ψ=2\psi=2 and χ=1\chi=1 are plotted as a function of parameter λ\lambda on the right of Figure 6. The plots highlight the normal inverse Gaussian (λ=1/2)(\lambda=-1/2) and hyperbolic (λ=(p+1)/2)(\lambda=(p+1)/2) distributions. The S-q again exhibits good performance for the hyperbolic, and it exhibits remarkably good performance for the normal inverse Gaussian.

5 Robustness Analysis

The robustness of the S-q estimator is now explored. First, the breakdown point is provided. The influence function is then explored. Finally, finite-sample simulation results are provided to further illustrate the robustness of the high-breakdown estimators.

5.1 Breakdown Point

The finite-sample breakdown point of a multivariate estimator of location or scatter is defined as the fraction of the samples, ϵn,\epsilon n, that can be set to either drive 𝝁^=\|\widehat{\boldsymbol{\mu}}\|=\infty or drive an eigenvalue of 𝚺^\widehat{\boldsymbol{\Sigma}} to either zero or infinity. Unlike multivariate M-estimators, which only achieve an asymptotic breakdown point of (p+1)1(p+1)^{-1} (Maronna, 1976), S-estimators are able to achieve the maximum possible finite-sample breakdown point that any affine equivariant estimator may have (Davies, 1987, Th. 6). For the following theorem, the term samples in general position means that no more than pp samples are contained in any hyperplane of dimension less than pp.

Theorem 7 (Finite-sample breakdown point).

Assuming (A1) and q<1,q<1, when nn samples are in general position and n(12b)p+1,n(1-2b)\geq p+1, the breakdown point of the S-q estimator is (nb+1)/n.(\lfloor nb\rfloor+1)/n.

Proof.

As discussed above Section 2.3, q<1q<1 ensures a proper S-estimator with finite value cc and bounded rho function. See (Davies, 1987, Th. 5). ∎

Corollary 1.

The maximum breakdown point is (np+1)/2/n,\lfloor(n-p+1)/2\rfloor/n, which is achieved when b=1/2(p+1)/(2n).b=1/2-(p+1)/(2n). Asymptotically, this is 1/21/2 at b=1/2.b=1/2.

5.2 Influence Function

The influence function (IF) of an estimator characterizes its sensitivity to an infinitesimal point contamination at 𝒛p,\boldsymbol{z}\in\mathbb{R}^{p}, standardized by the mass of the contamination, ϵ.\epsilon. The influence function for estimator 𝑻,\boldsymbol{T}, at the nominal distribution FF, is defined as

𝐈𝐅(𝒛;𝑻,F)\displaystyle\boldsymbol{\operatorname{IF}}\left(\boldsymbol{z};\boldsymbol{T},F\right) =limϵ0+𝑻((1ϵ)F+ϵΔ𝒛)𝑻(F)ϵ\displaystyle=\lim_{\epsilon\rightarrow 0^{+}}\frac{\boldsymbol{T}((1-\epsilon)F+\epsilon\Delta_{\boldsymbol{z}})-\boldsymbol{T}(F)}{\epsilon}
=ϵ𝑻((1ϵ)F+ϵΔ𝒛)|ϵ=0,\displaystyle=\frac{\partial}{\partial\epsilon}\boldsymbol{T}((1-\epsilon)F+\epsilon\Delta_{\boldsymbol{z}})|_{\epsilon=0},

where ϵ\epsilon is the proportion of samples that are a point-mass, Δ𝒛,\Delta_{\boldsymbol{z}}, located at 𝒛.\boldsymbol{z}.

Theorem 8 (Influence function).

Assuming (A1) and (A2), the influence functions of for the S-q estimates of 𝛍^\widehat{\boldsymbol{\mu}} and 𝚺^\widehat{\boldsymbol{\Sigma}} are given by

𝐈𝐅(𝒛;𝝁,F)\displaystyle\boldsymbol{\operatorname{IF}}\left(\boldsymbol{z};\boldsymbol{\mu},F\right) =dzwq(dz/σ)ω2𝒛cdz,\displaystyle=\frac{\sqrt{d_{z}}w_{q}(d_{z}/\sigma)}{\omega_{2}}\frac{\boldsymbol{z}_{c}}{\sqrt{d_{z}}},
𝐈𝐅(𝒛;𝚺,F)\displaystyle\boldsymbol{\operatorname{IF}}\left(\boldsymbol{z};\boldsymbol{\Sigma},F\right) =ρq(dz/σ)bλ2𝚺+p(p+2)(dz/σ)wq(dz/σ)λ1(𝒛c𝒛cTdz1p𝚺),\displaystyle=\frac{\rho_{q}\left(d_{z}/\sigma\right)-b}{\lambda_{2}}\boldsymbol{\Sigma}+\frac{p(p+2)\left(d_{z}/\sigma\right)w_{q}\left(d_{z}/\sigma\right)}{\lambda_{1}}\left(\frac{\boldsymbol{z}_{c}\boldsymbol{z}_{c}^{\operatorname{T}}}{d_{z}}-\frac{1}{p}\boldsymbol{\Sigma}\right), (17)

where 𝐳c=𝐳𝛍\boldsymbol{z}_{c}=\boldsymbol{z}-\boldsymbol{\mu} and dz=𝐳cT𝚺1𝐳c,d_{z}=\boldsymbol{z}_{c}^{\operatorname{T}}\boldsymbol{\Sigma}^{-1}\boldsymbol{z}_{c}, and where the scalars ω2\omega_{2}, λ1\lambda_{1}, and λ2\lambda_{2} were defined in Theorem 4.

Proof.

See (Lopuhaä, 1989, Corollary 5.2) and (Lopuhaä, 1997, Remark 2). ∎

By definition of S-estimators with normalized rho function, the magnitude of first term of (17) is clearly bounded to no more than λ21𝚺.\lambda_{2}^{-1}\boldsymbol{\Sigma}. Therefore, to compare the influence functions of the S-q, S-Rocke, and MM-SHR estimators, we focus on the second term. From this term, define α𝚺(dz)=λ11p(p+2)(dz/σ)w(dz/σ)\alpha_{\boldsymbol{\Sigma}}\left(d_{z}\right)=\lambda_{1}^{-1}p(p+2)(d_{z}/\sigma)w(d_{z}/\sigma) for each estimator. Figure 7 plots α𝚺(dz)\alpha_{\boldsymbol{\Sigma}}(d_{z}) at the 10-dimensional Gaussian distribution for the estimators as depicted in Figure 2.

Refer to caption
Figure 7: Example Comparison of Influence Function Parameter α𝚺(dz).\alpha_{\boldsymbol{\Sigma}}(d_{z}). For the 10-dimensional Gaussian distribution, α𝚺(dz)\alpha_{\boldsymbol{\Sigma}}(d_{z}) is plotted for estimators depicted in Figure 2.

By definition, all highly-robust estimators have bounded influence functions, and for the three estimators considered here, their influence functions are continuous. This means that small amounts of contamination have limited effects on their estimates. The gross-error sensitivity of an estimator is the maximum of 𝐈𝐅(𝒛),\boldsymbol{\operatorname{IF}}\left(\boldsymbol{z}\right), and in this example, the S-q demonstrates a lower gross-error sensitivity than the S-Rocke and MM-SHR estimators. By its definition, the MM-SHR has a inlier rejection point of zero, meaning inliers can negatively influence its estimates. However, proper Type II S-q functions have positive inlier rejection points, which provide robustness against inliers.

Relative to the S-Rocke and MM-SHR estimators, the S-q often has larger outlier rejection points. This is the cost of its generally higher efficiency and ability to reject inliers. However, due to its continuity, the influence near this point is still greatly attenuated.

5.3 Finite-Sample Robustness

To empirically compare the finite-sample robustness of the estimators, we employed the simulation method used by Maronna and Yohai (2017) and plot the shape matrix divergence, D(𝝁,𝛀;𝝁,𝛀^)\operatorname{D}\left(\boldsymbol{\mu},\boldsymbol{\Omega};\boldsymbol{\mu},\widehat{\boldsymbol{\Omega}}\right), versus shift contamination value kk. For a contamination proportion ϵ\epsilon, the first element of each of the ϵn\lfloor\epsilon n\rfloor contaminated samples was replaced with the value kk, that is x1=kx_{1}=k. The initial estimates of the weighted algorithm were determined with the KSD estimator. Figure 8 provides divergence plots for normally distributed data with ϵ=10%\epsilon=10\% contamination, for dimensions p=5p=5 and p=20p=20, and for sample sizes n=5pn=5p and n=100pn=100p. For the cases where p=20p=20, the estimators were tuned to 90% uncontaminated relative efficiency. When p=5p=5, the S-Rocke has poor maximum efficiency, so the estimators were tuned to match the maximum S-Rocke efficiency.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Estimator Divergence versus Contamination Value kk for Gaussian Distribution. Gaussian shape matrix divergences are plotted for p=5p=5 (left) and p=20p=20 (right), and for small sample (n=5pn=5p; top) and large sample (n=100pn=100p; bottom) sizes.

These plots show that the robustness of the S-q is on par with the other two estimators. Consistent with the results in Maronna and Yohai (2017), the relative worst-case performance of the estimators vary by such factors as dimension, sample size, and contamination percentage. For example, the S-q performs the best here for p=5p=5, n=100pn=100p, but the MM-SHR is the best for p=5p=5, n=5pn=5p.

6 Computational Aspects

This section explores computational aspects of the S-q and other high-breakdown estimators when using their weighted-sum algorithms. The estimators’ stabilities are first assessed by comparing their sensitivities to the initial estimates, 𝝁^(0)\widehat{\boldsymbol{\mu}}^{(0)} and 𝛀^(0).\widehat{\boldsymbol{\Omega}}^{(0)}. The computational efficiency is then evaluated by comparing the computational convergence rates of the estimators.

6.1 Stability

The primary criticism of high-breakdown estimators is that their solutions are highly sensitive to the initial estimates 𝝁^(0)\widehat{\boldsymbol{\mu}}^{(0)} and 𝛀^(0)\widehat{\boldsymbol{\Omega}}^{(0)} due to the non-convexity of their objective functions. The S-q estimator helps mitigate this with a generally wider weight function (see for example Figure 2). To demonstrate that the S-q is more stable with respect to the initial estimates, mm Gaussian Monte Carlo simulation trials were run where for each trial, 𝛀\boldsymbol{\Omega} was estimated twice for each estimator using different initializations. For the first estimate, 𝛀^(0)\widehat{\boldsymbol{\Omega}}^{(0)} was set to the MLE using all nn samples before contamination. For the second estimate, 𝛀^(0)\widehat{\boldsymbol{\Omega}}^{(0)} was set to the MLE using just 25% of the samples before contamination, resulting in a larger expected variance of 𝛀^(0).\widehat{\boldsymbol{\Omega}}^{(0)}. The sample mean of the divergence between the two final estimates, D¯(𝛀^1,𝛀^2),\bar{\operatorname{D}}\left(\widehat{\boldsymbol{\Omega}}_{1},\widehat{\boldsymbol{\Omega}}_{2}\right), was then calculated. The same values of p,p, n,n, and ϵ\epsilon were used as in the Section 5.3 simulations. The contamination method was also the same, and the value of kk was set to the worst-case value for that estimator and for the values of p,p, n,n, and ϵ\epsilon (see Figure 8).

The results are presented in the center of Table 4. It is seen that the S-q estimator was consistently the most stable of the three estimators, and the MM-SHR was generally the most sensitive. For the near-asymptotic cases, the S-q exhibited no measurable differences between the two estimates, unlike the MM-SHR. Like the S-q, the S-Rocke had no measurable differences between the two estimates for the uncontaminated near-asymptotic cases, but under contamination, its mean divergence was roughly on par with the MM-SHR.

Table 4: Estimator Stability and Computational Efficiency for Normally Distributed Data
Dim. Samples Contam. Mean Divergence Median No. of Iterations
pp nn ϵ\epsilon S-q S-Rocke MM-SHR S-q S-Rocke MM-SHR
5 100p100p 0% 0 0 8e-5 13 14 14
5 100p100p 10% 0 7e-4 5e-4 14 15 15
20 100p100p 0% 0 0 5e-5 7 7 7
20 100p100p 10% 0 1e-3 3e-2 18 8 16
5 5p5p 0% 2e-1 4e-1 2e-0 32 14 15
5 5p5p 10% 3e-1 5e-1 2e-0 31 15 15
20 5p5p 0% 6e-5 4e-2 1e-0 28 18 37
20 5p5p 10% 1e-0 2e-0 3e-0 30 18 33
Note: Mean divergence values listed as “0” have simulated average divergences less than the numerical convergence criterion, D(𝛀^(i),𝛀^(i1))<1010.\operatorname{D}\left(\widehat{\boldsymbol{\Omega}}^{(i)},\widehat{\boldsymbol{\Omega}}^{(i-1)}\right)<10^{-10}.

6.2 Computational Efficiency

To compare the relative computational efficiencies of the high-breakdown estimators, we calculated the median number of iteration required for the estimators to converge for normally distributed data for various values of pp, nn, and ϵ\epsilon. All three estimators were set to use the same tight convergence criteria that D(𝛀^(i),𝛀^(i1))<1010.\operatorname{D}\left(\widehat{\boldsymbol{\Omega}}^{(i)},\widehat{\boldsymbol{\Omega}}^{(i-1)}\right)<10^{-10}. The initial estimates were determined with the KSD estimator, and the estimators were tuned as in the Section 5.3 simulations. The contamination method was also the same, and the value of kk was set to the worst-case value for that estimator.

The results are presented on the right of Table 4. For the large-sample (n=100pn=100p) simulations, the S-q converges approximately as fast as the other two estimators (except for the one case where p=20,p=20, ϵ=10%\epsilon=10\%, where the S-Rocke performs notably better). The S-Rocke estimator consistently converges fastest for all of the small-sample (n=5pn=5p) cases, and the small-sample convergence of the S-q estimator is relatively consistent—albeit at the upper-end of the spectrum. The small-sample convergence of the MM-SHR is on par with the S-Rocke for small p,p, but worse than the others for large p.p.

7 Application to Financial Portfolio Optimization

A common financial application of mean and covariance matrices is in modern portfolio theory for the optimal allocation of portfolio investments. Under modern portfolio theory’s mean-variance framework, a minimum-variance portfolio aims to minimize the risk (i.e. variance) of the portfolio return subject to a desired expected return (Markowitz, 1952). Mathematically, this is expressed as

min𝜶\displaystyle\min_{\boldsymbol{\alpha}} 𝜶T𝛀r𝜶\displaystyle\boldsymbol{\alpha}^{\operatorname{T}}\boldsymbol{\Omega}_{r}\boldsymbol{\alpha}
subject to 𝜶T𝝁r=μp,𝜶T𝟏=1,\displaystyle\boldsymbol{\alpha}^{\operatorname{T}}\boldsymbol{\mu}_{r}=\mu_{p},\;\boldsymbol{\alpha}^{\operatorname{T}}\boldsymbol{1}=1,

where 𝜶\boldsymbol{\alpha} is a normalized vector of portfolio allocation for each asset, 𝛀r\boldsymbol{\Omega}_{r} is the shape (or equivalently covariance) matrix for the asset returns, 𝝁r\boldsymbol{\mu}_{r} is the expected returns of each asset, μp\mu_{p} is the desired expected portfolio return, and 𝟏\boldsymbol{1} is a vector of ones. The solution is given by (Roy, 1952; Merton, 1972)

𝜶=\displaystyle\boldsymbol{\alpha}={} sr(𝝁rT𝛀r1𝝁r)𝛀r1𝟏sr(𝟏T𝛀r1𝝁r)𝛀r1𝝁r\displaystyle s_{r}\left(\boldsymbol{\mu}_{r}^{\operatorname{T}}\boldsymbol{\Omega}_{r}^{-1}\boldsymbol{\mu}_{r}\right)\boldsymbol{\Omega}_{r}^{-1}\boldsymbol{1}-s_{r}\left(\boldsymbol{1}^{\operatorname{T}}\boldsymbol{\Omega}_{r}^{-1}\boldsymbol{\mu}_{r}\right)\boldsymbol{\Omega}_{r}^{-1}\boldsymbol{\mu}_{r}
+srμp(𝟏T𝛀r1𝟏)𝛀r1𝝁rsrμp(𝝁rT𝛀r1𝟏)𝛀r1𝟏,\displaystyle+s_{r}\mu_{p}\left(\boldsymbol{1}^{\operatorname{T}}\boldsymbol{\Omega}_{r}^{-1}\boldsymbol{1}\right)\boldsymbol{\Omega}_{r}^{-1}\boldsymbol{\mu}_{r}-s_{r}\mu_{p}\left(\boldsymbol{\mu}_{r}^{\operatorname{T}}\boldsymbol{\Omega}_{r}^{-1}\boldsymbol{1}\right)\boldsymbol{\Omega}_{r}^{-1}\boldsymbol{1}, (18)

where srs_{r} is a scalar that ensures the elements of 𝜶\boldsymbol{\alpha} sum to one.

In this section, the performances of the MM-SHR, S-Rocke, and S-q estimators are compared for the optimal allocation of investment in the component stocks of the DOW Jones Industrial Average. For each estimator, the parameters 𝛀r\boldsymbol{\Omega}_{r} and 𝝁r\boldsymbol{\mu}_{r} were estimated for the daily returns from the component stocks. Then, using a desired portfolio daily return of μp=.038%\mu_{p}=.038\% (corresponding to 10% annual return), the optimal allocations, 𝜶,\boldsymbol{\alpha}, were calculated using (7). Using 𝜶\boldsymbol{\alpha} for each estimator, the portfolio return was then calculated for each business day of the verification period, assuming a daily re-balance of investments. Finally, each estimator’s performance was characterized by the variance of these daily returns. This variance is a measure of the volatility of the portfolio.

For the S-q estimator, we noted that Konlack Socgnia and Wilcox (2014) showed that the generalized hyperbolic distribution is a good model for stock returns, and specifically the variance gamma subclass has good parameter stability over time. Although their analysis is for log returns, daily log returns are generally close to one, so the variance gamma model should also fit well for gross (i.e. linear) returns. For the variance gamma S-q estimator, a density-weighted M-estimator was used to estimate the model parameters λ\lambda and ψ\psi.

To demonstrate the robustness of the S-q estimator, we begin by noting that the first quarter of 2020 contained a once-in-a-generation period of extremely high volatility due to the COVID-19 pandemic, as depicted in Figure 9. This volatility started on approximately February 21. Each estimator’s performance was assessed by estimating the parameters 𝛀r\boldsymbol{\Omega}_{r} and 𝝁r\boldsymbol{\mu}_{r} using all the returns from the first quarter, then comparing the variances of the daily portfolio returns for only the pre-pandemic (prior to February 21) period. Each estimator was set to its maximum breakdown point. Each estimator was then tuned it to its maximum asymptotic efficiency with respect to the variance gamma distribution with parameters estimated using a maximum likelihood approach and using the daily returns for the years 2016–2019.

Refer to caption
Figure 9: Daily Returns of Down Jones Industrial Average and Component Stocks for 2016-2020. There are 25 component stocks in the index throughout the depicted period and plotted in the background. The overall index value is plotted on top.

Table 5 summarizes the results, listing the variances of the daily returns. The S-q estimator performed the best with the lowest variance, which indicates high robustness. The MM-SHR performed the second best, followed by the S-Rocke. The sample estimator of mean and covariance was also included to demonstrate its poor robustness.

Table 5: Sample Variances of Achieved Daily Returns for 01-Jan-2020 – 20-Feb-2020
Estimator Variance
S-q 76
MM-SHR 119
S-Rocke 147
Sample 176

Next, to demonstrate estimator efficiency, variances of daily returns were compared for a non-volatile period: 2016 through 2019. Using the same methodology and configuration as before, for each year and each estimator, 𝛀r\boldsymbol{\Omega}_{r} and 𝝁r\boldsymbol{\mu}_{r} were estimated. Then, 𝜶\boldsymbol{\alpha} was calculated and applied to each day of that year. The sample variances of each year’s daily portfolio returns are listed in Table 6. The S-q estimator resulted in the lowest portfolio variance for three of the four years, and the lowest variance on average, indicating high estimator efficiency. On average, the performance of the MM-SHR estimator was behind that of the S-q estimator, and the S-Rocke demonstrated substantially worse performance.

Table 6: Sample Variances of Achieved Daily Returns by Year
Year S-q MM-SHR S-Rocke
2016 3.51 3.56 4.10
2017 1.18 1.24 1.34
2018 6.76 6.91 7.31
2019 3.60 3.50 4.49
Sample Mean 3.77 3.80 4.31

8 Conclusion

The S-q estimator has been introduced as a new tunable multivariate estimator of location, scatter, and shape matrices for elliptical probability distributions. This new estimator is a subclass of S-estimators, which achieve the maximum theoretical breakdown point. The S-q estimator has been compared with the leading high-breakdown estimators. Across elliptical distributions, the S-q has generally higher efficiency and stability, and its robustness is on par with these other leading estimators. Additionally, the S-q provides a monotonic and upper-bounded efficiency tuning parameter, which provides simpler tuning than the MM-SHR. The S-q is therefore a broadly applicable estimator, providing practitioners with a good general high-breakdown multivariate estimator that can be used across a broad range of practical applications, such as the optimal portfolio example.

References

  • Abusev (2015) Abusev, R.A., 2015. On the Distances Between Certain Distributions in Multivariate Statistical Analysis. Journal of Mathematical Sciences 205, 2–6. doi:10.1007/s10958-015-2222-y.
  • Basu et al. (1998) Basu, A., Harris, I.R., Hjort, N.L., Jones., M.C., 1998. Robust and Efficient Estimation by Minimising a Density Power Divergence. Biometrika 85, 549–559.
  • Bilodeau and Brenner (1999) Bilodeau, M., Brenner, D., 1999. Theory of Multivariate Statistics. Springer-Verlag, New York.
  • Choi and Hall (2000) Choi, B.Y.E., Hall, P., 2000. Rendering Parametric Procedures More Robust by Empirically Tilting the Model. Biometrika 87, 453–465.
  • Davies (1987) Davies, P.L., 1987. Asymptotic Behaviour of S-Estimates of Multivariate Location Parameters and Dispersion Matrices. The Annals of Statistics 15, 1269–1292.
  • Deng and Yao (2018) Deng, X., Yao, J., 2018. On the property of multivariate generalized hyperbolic distribution and the Stein-type inequality. Communications in Statistics - Theory and Methods 47, 5346–5356. doi:10.1080/03610926.2017.1390134.
  • Fang et al. (1990) Fang, K.T., Kotz, S., Ng, K.W., 1990. Symmetric Multivariate and Related Distributions. Chapman and Hall, London.
  • Ferrari and Yang (2010) Ferrari, D., Yang, Y., 2010. Maximum Lq-likelihood estimation. The Annals of Statistics 38, 753–783. doi:10.1214/09-AOS687.
  • Frahm (2009) Frahm, G., 2009. Asymptotic distributions of robust shape matrices and scales. Journal of Multivariate Analysis 100, 1329–1337. doi:10.1016/j.jmva.2008.11.007.
  • Gazor and Zhang (2003) Gazor, S., Zhang, W., 2003. Speech probability distribution. IEEE Signal Processing Letters 10, 204–207. doi:10.1109/LSP.2003.813679.
  • Huang et al. (2006) Huang, J.Z., Liu, N., Pourahmadi, M., Liu, L., 2006. Covariance matrix selection and estimation via penalised normal likelihood. Biometrika 93, 85–98. doi:10.1093/biomet/93.1.85.
  • Huber (1964) Huber, P.J., 1964. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics 35, 73–101.
  • Kelker (1970) Kelker, D., 1970. Distribution Theory of Spherical Distributions and a Location-Scale Parameter Generalization. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002) 32, 419–430.
  • Konlack Socgnia and Wilcox (2014) Konlack Socgnia, V., Wilcox, D., 2014. A comparison of generalized hyperbolic distribution models for equity returns. Journal of Applied Mathematics 2014. doi:10.1155/2014/263465.
  • Lopuhaä (1989) Lopuhaä, H.P., 1989. On the Relation between S-Estimators and M-Estimators of Multivariate Location and Covariance. The Annals of Statistics 17, 1662–1683.
  • Lopuhaä (1997) Lopuhaä, H.P., 1997. Asymptotic expansion of S-estimators of location and covariance. Statistica Neerlandica 51, 220–237. doi:10.1111/1467-9574.00051.
  • Markowitz (1952) Markowitz, H., 1952. Portfolio Selection. The Journal of Finance 7, 77–91.
  • Maronna (1976) Maronna, R.A., 1976. Robust M-Estimators of Multivariate Location and Scatter. The Annals of Statistics 4, 51–67.
  • Maronna et al. (2006) Maronna, R.A., Martin, R.D., Yohai, V.J., 2006. Robust Statistics: Theory and Methods. First ed., John Wiley & Sons.
  • Maronna et al. (2019) Maronna, R.A., Martin, R.D., Yohai, V.J., Salibián-Barrera, M., 2019. Robust Statistics Theory and Methods (with R). Second ed., John Wiley & Sons.
  • Maronna and Yohai (2017) Maronna, R.A., Yohai, V.J., 2017. Robust and efficient estimation of multivariate scatter and location. Computational Statistics and Data Analysis 109, 64–75. doi:10.1016/j.csda.2016.11.006.
  • Merton (1972) Merton, R.C., 1972. An Analytic Derivation of the Efficient Portfolio Frontier. The Journal of Financial and Quantitative Analysis 7, 1851–1872.
  • Peña and Prieto (2007) Peña, D., Prieto, F.J., 2007. Combining Random and Specific Directions for Outlier Detection and Robust Estimation in High-Dimensional Multivariate Data. Journal of Computational and Graphical Statistics 16, 228–254.
  • Rocke (1996) Rocke, D.M., 1996. Robustness Properties of S-Estimators of Multivariate Location and Shape in High Dimension. The Annals of Statistics 24, 1327–1345.
  • Rousseeuw and Hubert (2013) Rousseeuw, P., Hubert, M., 2013. High-Breakdown Estimators of Multivariate Location and Scatter, in: Becker, C., Fried, R., Kuhnt, S. (Eds.), Robustness and Complex Data Structures. Springer, New York. chapter 4, pp. 49–66. doi:10.1007/978-3-642-35494-6.
  • Rousseeuw and Yohai (1984) Rousseeuw, P., Yohai, V., 1984. Robust Regression by Means of S-Estimators, in: Robust and Nonlinear Time Series Analysis. Springer US : New York, NY. TA - TT -, pp. 256–272. doi:10.1007/978-1-4615-7821-5{\_}15.
  • Roy (1952) Roy, A.D., 1952. Safety First and the Holding of Assets. Econometrica 20, 431–449.
  • Tyler (1982) Tyler, D.E., 1982. Radial estimates and the test for sphericity. Biometrika 69, 429–436. doi:10.1093/biomet/69.2.429.
  • Tyler (1983) Tyler, D.E., 1983. Robustness and effciency properties of scatter matrices. Biometrika 70, 411–420. doi:10.1093/biomet/71.3.656-a.
  • Ward et al. (1990) Ward, K.D., Baker, C.J., Watts, S., 1990. Maritime surveillance radar. Part 1. Radar scattering from the ocean surface. IEE Proceedings F (Radar and Signal Processing) 137, 51–62. doi:10.1049/ip-f-2.1990.0009.
  • Windham (1995) Windham, M.P., 1995. Robustifying Model Fitting. Journal of the Royal Statistical Society. Series B (Methodological) 57, 599–609.