This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust mixture regression with Exponential Power distribution

CHEN XIAO Hong Kong Baptist University
(July 2020)
Abstract

Assuming an exponential power distribution is one way to deal with outliers in regression and clustering, which can increase the robustness of the analysis. Gaussian distribution is a special case of an exponential distribution. And an exponential power distribution can be viewed as a scale mixture of normal distributions. Thus, model selection methods developed for the Gaussian mixture model can be easily extended for the exponential power mixture model. Moreover, Gaussian mixture models tend to select more components than exponential power mixture models in real-world cases, which means exponential power mixture models are easier to interpret. In this paper, We develop analyses for mixture regression models when the errors are assumed to follow an exponential power distribution. It will be robust to outliers, and model selection for it is easy to implement.

Keywords: Robust mixture regression, model selection, exponential power distribution

1 Introduction

Finite mixtures of regression models have been broadly used to deal with various heterogeneous data. In most cases, we assume the errors follow normal distributions, and parameters are estimated by the maximum likelihood estimator (MLE). However, this model will be very sensitive to outliers.

There are many methods to handle outliers in mixture of linear regression models. Trimmed likelihood estimator (TLE)(Neykov et al.,2007) uses the top NαN\alpha data with the largest log-likelihood. The choice of α\alpha is very important for the TLE. It should not be too small or too large. But it’s not easy to choose α\alpha Yao(2014) proposed a robust mixture of linear regression models by assuming the errors follow the t-distribution. Song(2016) dealt with the same problem using Laplace distribution.

Assuming an exponential power distribution is one way to deal with outliers in regression and clustering. (Salazar E et.al., 2012; Liang F et.al. 2007; Ferreira, M.A.,2014). Gaussian distribution and Laplace dsitribution are special cases of exponential distribution. leptokurtic distributions have heavier tails than Gaussian distributions, which provide protection against outliers. And an exponential power distribution is a scale mixture of normal distributions. This makes it easier to understand.

Outliers are not the only problem. Choose the model selection is also a challenge for mixture model. Gaussian mixture models tend to select more components than exponential power mixture models in real world case, which means the interpretability is greatly reduced.[7] One possible explanation is that exponential power distribution is a scale mixture of Gaussian distributions.

In this paper, robust mixture regression with Exponential power distribution and model selection for it are introduced.

2 Exponential Power(EP) Distribution

The density function of the Exponential Power(EP) distribution (p>0)(p>0) where the mean is 0 is

fp(e;0,η)=pη1p2Γ(1p)exp{η|e|p},f_{p}(e;0,\eta)=\frac{p\eta^{\frac{1}{p}}}{2\Gamma\left(\frac{1}{p}\right)}\exp\left\{-\eta|e|^{p}\right\}, (1)

where Γ()\Gamma(\cdot) is the Gamma function. When 0<p<20<p<2, the distributions are heavy tailed, which means it will provide protection against outliers. When p=1, Equation(1) defines Laplace distribution. When p=2, Equation(1) defines Gaussian distribution.

3 Robust regression with mixture of exponential power distribution

We assume that (xi,yi)(x_{i},y_{i}) follows one of the following linear regression model with probability πk\pi_{k},

yi=xiβk+eikk=1,,Ky_{i}=x^{\prime}_{i}\beta_{k}+e_{ik}\qquad k=1,\ldots,K (2)

where kKπk=1\sum_{k}^{K}\pi_{k}=1. eike_{ik} follows Exponential Power (EP) distributions fpk(eik;0,ηk)f_{p_{k}}\left(e_{ik};0,\eta_{k}\right)

Assume 𝚯={𝝅,𝜼,𝜷}\bm{\Theta}=\left\{\bm{\pi},\bm{\eta},\bm{\beta}\right\}

The complete likelihood function is

(𝐲;𝐱,𝚯)=iNk=1K[πkfpk(yixiβk;0,ηk)]zik\mathbb{P}(\mathbf{y};\mathbf{x},\bm{\Theta})=\prod_{i}^{N}\prod_{k=1}^{K}\left[\pi_{k}f_{p_{k}}\left(y_{i}-x_{i}^{\prime}\beta_{k};0,\eta_{k}\right)\right]^{z_{ik}}

wherezik{0,1}z_{ik}\in\{0,1\} is an indicator variable. It implies that yiy_{i} comes form kkth linear regression model. And k=1Kzik=1\sum_{k=1}^{K}z_{ik}=1.

And the complete log-likelihood function is

log(𝐲;𝐱,𝚯)=iNk=1Kzik[logπk+logfpk(yixiβk;0,ηk)]\log\mathbb{P}(\mathbf{y};\mathbf{x},\bm{\Theta})=\sum_{i}^{N}\sum_{k=1}^{K}z_{ik}\left[\log\pi_{k}+\log f_{p_{k}}\left(y_{i}-x_{i}^{\prime}\beta_{k};0,\eta_{k}\right)\right]

In practice, the number of components has to be estimated. So we use penalty function to select K automatically.(Huang, T et.al., 2017)

l(Θ)=log(𝐲;𝐱,Θ)P(π;λ)l(\Theta)=\log\mathbb{P}(\mathbf{y};\mathbf{x},\Theta)-P(\pi;\lambda)

whereP(π;λ)=nλk=1KDklogϵ+πkϵP(\pi;\lambda)=n\lambda\sum_{k=1}^{K}D_{k}\log\frac{\epsilon+\pi_{k}}{\epsilon}

ϵ\epsilon is a very small positive number, λ\lambda is a positive tuning parameter , and DkD_{k} is the number of free parameters in the kth component. \text{ of free parameters in the }k^{th}\text{ component. }.

4 Algorithm

We propose a Generalized Expectation–Maximization algorithm to solve the proposed model.

In the E\mathrm{E} step, conditional expectation of zijkz_{ijk} given eije_{ij} is computed by the Bayes’ rule:

γik(t+1)=πk(t)fpk(yixiβk0,ηk(t))l=1Kπl(t)fpl(yixiβk0,ηl(t))\begin{array}[]{l}\qquad\gamma_{ik}^{(t+1)}=\frac{\pi_{k}^{(t)}f_{p_{k}}\left(y_{i}-x_{i}^{\prime}\beta_{k}\mid 0,\eta_{k}^{(t)}\right)}{\sum_{l=1}^{K}\pi_{l}^{(t)}f_{p_{l}}\left(y_{i}-x_{i}^{\prime}\beta_{k}\mid 0,\eta_{l}^{(t)}\right)}\par\end{array} (3)

Q(𝚯,𝚯(t))=i,kγik(t+1)[logfpk(yixiβk0,ηk(t))+logπk]nλk=1KDklogϵ+πkϵ\begin{aligned} Q\left(\bm{\Theta},\bm{\Theta}^{(t)}\right)=&\sum_{i,k}\gamma_{ik}^{(t+1)}\left[\log f_{p_{k}}\left(y_{i}-x_{i}^{\prime}\beta_{k}\mid 0,\eta_{k}^{(t)}\right)+\log\pi_{k}\right]\\ &-n\lambda\sum_{k=1}^{K}D_{k}\log\frac{\epsilon+\pi_{k}}{\epsilon}\end{aligned}

Then in the M step, 𝚯\bm{\Theta} are updated by maximizing Q(𝚯,𝚯(t))Q\left(\bm{\Theta},\bm{\Theta}^{(t)}\right).

πk(t+1)=max{0,11λKDk[iγik(t+1)λDk]}\pi_{k}^{(t+1)}=\max\left\{0,\frac{1}{1-\lambda KD_{k}}\left[\sum_{i}\gamma_{ik}^{(t+1)}-\lambda D_{k}\right]\right\} (4)
ηk(t+1)=Nkpki,γik(t+1)|yixiβk|pk\eta_{k}^{(t+1)}=\frac{N_{k}}{p_{k}\sum_{i,}\gamma_{ik}^{(t+1)}\left|y_{i}-x_{i}^{\prime}\beta_{k}\right|^{p_{k}}} (5)

where Nk=i,γik(t+1)N_{k}=\sum_{i,}\gamma_{ik}^{(t+1)}.

On ignorning the terms not involving βk{\beta}_{k}, we have

Q(βk)=i=1Nk=1Kγik(t+1)ηk(t+1)|yixiβk|pkQ(\beta_{k})=-\sum_{i=1}^{N}\sum_{k=1}^{K}\gamma_{ik}^{(t+1)}\eta_{k}^{(t+1)}\left|y_{i}-x_{i}^{\prime}\beta_{k}\right|^{p_{k}} (6)

Then we use MM algorithm to estimate βk\beta_{k}

{(yixiβk)(yixiβk)}pk2{(yixiβk(t))(yixiβk(t))}pk2+pk2{(yixiβk(t))(yixiβk(t))}pk21((yixiβk)(yixiβk)(yixiβk(t))(yixiβk(t)))\begin{split}\{(y_{i}-x^{\prime}_{i}\beta_{k})^{\prime}(y_{i}-x^{\prime}_{i}\beta_{k})\}^{\frac{p_{k}}{2}}\leq\{(y_{i}-x^{\prime}_{i}\beta_{k}^{(t)})^{\prime}(y_{i}-x^{\prime}_{i}\beta_{k}^{(t)})\}^{\frac{p_{k}}{2}}\\ +\frac{p_{k}}{2}\{(y_{i}-x^{\prime}_{i}\beta_{k}^{(t)})^{\prime}(y_{i}-x^{\prime}_{i}\beta_{k}^{(t)})\}^{\frac{p_{k}}{2}-1}\left((y_{i}-x^{\prime}_{i}\beta_{k})^{\prime}(y_{i}-x^{\prime}_{i}\beta_{k})-(y_{i}-x^{\prime}_{i}\beta_{k}^{(t)})^{\prime}(y_{i}-x^{\prime}_{i}\beta_{k}^{(t)})\right)\end{split}

Thus

βk^=argminβki=1Nk=1Kγik(t+1)ηk(t+1)Wik(t)(yixiβk)(yixiβk),\hat{\beta_{k}}=\arg{min}_{\beta_{k}}\sum_{i=1}^{N}\sum_{k=1}^{K}\gamma_{ik}^{(t+1)}\eta_{k}^{(t+1)}W_{ik}^{(t)}(y_{i}-x^{\prime}_{i}\beta_{k})^{\prime}(y_{i}-x^{\prime}_{i}\beta_{k}), (7)

where Wik(t)=pk2{(yixiβk(t))(yixiβk(t))}pk21W_{ik}^{(t)}=\frac{p_{k}}{2}\{(y_{i}-x^{\prime}_{i}\beta_{k}^{(t)})^{\prime}(y_{i}-x^{\prime}_{i}\beta_{k}^{(t)})\}^{\frac{p_{k}}{2}-1}.

Equation (7) leads to the update

βk(t+1)=(i=1Nk=1Kγik(t+1)ηk(t+1)Wik(t)𝒙i𝒙i)1(i=1Nk=1Kγik(t+1)ηk(t+1)Wik(t)𝒙iyi).\beta_{k}^{(t+1)}=\left(\sum_{i=1}^{N}\sum_{k=1}^{K}\gamma_{ik}^{(t+1)}\eta_{k}^{(t+1)}W_{ik}^{(t)}\bm{x}_{i}\bm{x}^{\prime}_{i}\right)^{-1}\left(\sum_{i=1}^{N}\sum_{k=1}^{K}\gamma_{ik}^{(t+1)}\eta_{k}^{(t+1)}W_{ik}^{(t)}\bm{x}_{i}y_{i}\right). (8)

We can also use Newton-Raphson method to updata β\beta.

5 Simulation

In this section, we demonstrate the effectiveness of the proposed method by simulation study.

Since TLE performed worse than the robust mixture regression model based on the t-distribution [4] and exponential power distribution can cover Laplace distribution, we only compare

1.the robust mixture regression model based on the t-distribution (MixT),

2.the robust mixture regression model based on the exponetional power distribution (MixEP).

To assess different methods, we record the mean squared errors (MSE) and the bias of the parameter estimates. In terms of the mixture switching issues, we use label permutation that minimizes the difference between predicted parameters and the true parameter values.

Example 1.1. Suppose the independently and identically distributed samples {(x1i,x2i,yi),i=1,,n}\left\{\left(x_{1i},x_{2i},y_{i}\right),i=1,\ldots,n\right\} are sampled from the model

Y={0+X1+ϵ1, if Z=10X1+ϵ2, if Z=2Y=\left\{\begin{array}[]{ll}0+X_{1}+\epsilon_{1},&\text{ if }Z=1\\ 0-X_{1}+\epsilon_{2},&\text{ if }Z=2\end{array}\right.

where ZZ is a component indicator with P(Z=1)=0.5,X1N(0,1),X2N(0,1),P(Z=1)=0.5,X_{1}\sim N(0,1),X_{2}\sim N(0,1), and both ϵ1\epsilon_{1} and ϵ2\epsilon_{2} follow the same distribution as ϵ.\epsilon. We consider the following four cases for the error density function of ϵ\epsilon :(Weixin Yao et.al., 2014)

Case I: ϵt1\epsilon\sim t_{1} , which means tt -distribution with degrees of freedom 1 ,

Case II: ϵ0.95N(0,1)+0.05N(0,52)\epsilon\sim 0.95\mathrm{N}(0,1)+0.05\mathrm{N}\left(0,5^{2}\right)-,which is a contaminated normal mixture,

Case III:ϵN(0,1)\mathrm{III}:\epsilon\sim N(0,1) with 5%5\% of high leverage outliers being X1=2+ϵ,X_{1}=2+\epsilon, and Y=10+ϵY=10+\epsilon

Case IV: Y={0+X1+γ1+ϵ10X1+γ2+ϵ2Y=\left\{\begin{array}[]{l}0+X_{1}+\gamma_{1}+\epsilon_{1}\\ 0-X_{1}+\gamma_{2}+\epsilon_{2}\end{array}\right. ,ϵN(0,1),P(γ1(4,6))=P(γ2(4,6))=0.1,\begin{aligned} \epsilon\sim N(0,1),P\left(\gamma_{1}\in(4,6)\right)=P\left(\gamma_{2}\in(4,6)\right)=0.1,\end{aligned} standard normal distribution with 10% outlier contamination.

CaseI: The number of samples is N=400.

Refer to caption
Figure 1: CaseI

Figure 1 shows the simulation data in CaseI. We replicate this experiment 50 times, and the comparison of different methods are shown in Figure 2 and Table 1.

Refer to caption
(a) MixEP
Refer to caption
(b) MixT
Figure 2: Boxplots for the biass of the estimators of beta
MSE
Method β00=0\beta_{00}=0 β01=1\beta_{01}=1 β10=0\beta_{10}=0 β11=1\beta_{11}=-1
MixT 0.0169 0.01936 0.01491 0.0203
MixEP 0.3973 0.2175 0.2594 0.4337
bias
Method β00=0\beta_{00}=0 β01=1\beta_{01}=1 β10=0\beta_{10}=0 β11=1\beta_{11}=-1
MixT -0.0169 0.01393 0.03333 0.001208
MixEP 0.13677 -0.2754 -0.007769 0.2658
Table 1: CaseI:The mean squared errors (MSE)and the bias of the parameter estimates for each estimation method.

CaseII: The number of samples is N=400.

Refer to caption
Figure 3: CaseII

Figure 3 shows the simulation data in CaseII. We replicate this experiment 50 times, and the comparison of different methods are shown in Figure 4 and Table 2.

Refer to caption
(a) MixEP
Refer to caption
(b) MixT
Figure 4: Boxplots for the biass of the estimators of beta
MSE
Method β00=0\beta_{00}=0 β01=1\beta_{01}=1 β10=0\beta_{10}=0 β11=1\beta_{11}=-1
MixT 0.0111 0.0105 0.006161 0.01433
MixEP 0.0189 0.02219 0.01943 0.02024
bias
Method β00=0\beta_{00}=0 β01=1\beta_{01}=1 β10=0\beta_{10}=0 β11=1\beta_{11}=-1
MixT 0.02289 -0.009907 -0.01731 0.02171
MixEP -0.0116 -0.0408 0.02623 0.03788
Table 2: CaseII:The mean squared errors (MSE)and the bias of the parameter estimates for each estimation method.

CaseIII: The number of samples is N=400.

Refer to caption
Figure 5: CaseIII

Figure 5 shows the simulation data in CaseIII. We replicate this experiment 50 times, and the comparison of different methods are shown in Figure 6 and Table 3.

Refer to caption
(a) MixEP
Refer to caption
(b) MixT
Figure 6: Boxplots for the biass of the estimators of beta
MSE
Method β00=0\beta_{00}=0 β01=1\beta_{01}=1 β10=0\beta_{10}=0 β11=1\beta_{11}=-1
MixT 0.01188 0.01052 0.01857 0.02060
MixEP 0.03807 0.03744 0.02973 0.02881
bias
Method β00=0\beta_{00}=0 β01=1\beta_{01}=1 β10=0\beta_{10}=0 β11=1\beta_{11}=-1
MixT -0.01858 -0.00432 -0.006022 -0.008520
MixEP 0.06676 -0.08744 -0.01664 0.03178
Table 3: CaseIII:The mean squared errors (MSE)and the bias of the parameter estimates for each estimation method.

CaseIV: The number of samples is N=400.

Refer to caption
Figure 7: CaseIV

Figure 7 shows the simulation data in CaseIV. We replicate this experiment 50 times, and the comparison of different methods are shown in Figure 8 and Table 4.

Refer to caption
(a) MixEP
Refer to caption
(b) MixT
Figure 8: Boxplots for the biass of the estimators of beta
MSE
Method β00=0\beta_{00}=0 β01=1\beta_{01}=1 β10=0\beta_{10}=0 β11=1\beta_{11}=-1
MixT 3.6119 0.1376 2.4279 0.1662
MixEP 0.0411 0.02742 0.06486 0.04627
bias
Method β00=0\beta_{00}=0 β01=1\beta_{01}=1 β10=0\beta_{10}=0 β11=1\beta_{11}=-1
MixT 0.8222 -0.1153 0.6159 0.1979
MixEP 0.08276 -0.02965 0.1790 0.05284
Table 4: CaseIV:The mean squared errors (MSE)and the bias of the parameter estimates for each estimation method.

It’s hard to choose the dimension of freedom for Mixture of t-distribution. In the experiments shown above, the freedom leading to best results are used. However, the freedom selected by BIC could lead to bad results. In addition, selecting the number of components is also a problem for Mixture regression with t-distribution.

6 Conclusion

The adaptively trimmed version of MIXT worked well when there are high leverage outliers, but it can deal with drift. We can’t select a proper dimension of freedom for MIXT using BIC when faced with drift.

Adaptively removing abnormal points can improve the estimation accuracy in MixEP. It’s a natural idea is to build a model to cover the drift. This will be the future work.

References

  • [1] Huang, T., Peng, H., Zhang, K. (2017). MODEL SELECTION FORGAUSSIAN MIXTURE MODELS. Statistica Sinica, 27(1), 147-169. Re-trieved November 4, 2020, from
  • [2] Lange, K. (2016). MM optimization algorithms. Society for Industrial and Applied Mathematics. Weixing Song, Weixin Yao, Yanru Xing,
  • [3] Robust mixture regression model fitting by Laplace distribution, Computational Statistics and Data Analysis, Volume 71, 2014, Pages 128-137, ISSN 0167-9473, https://doi.org/10.1016/j.csda.2013.06.022.
  • [4] Weixin Yao, Yan Wei, Chun Yu, Robust mixture regression using the t-distribution, Computational Statistics and Data Analysis, Volume 71, 2014, Pages 116-127, ISSN 0167-9473, https://doi.org/10.1016/j.csda.2013.07.019.
  • [5] N. Neykov, P. Filzmoser, R. Dimova, P. Neytchev, Robust fitting of mixtures using the trimmed likelihood estimator, Computational Statistics and Data Analysis, Volume 52, Issue 1, 2007, Pages 299-308, ISSN 0167-9473, https://doi.org/10.1016/j.csda.2006.12.024.
  • [6] Wennan Chang and Xinyu Zhou and Yong Zang and Chi Zhang and Sha Cao,A New Algorithm using Component-wise Adaptive Trimming For Robust Mixture Regression, 2020. https://arxiv.org/abs/2005.11599
  • [7] X. Cao, Q. Zhao, D. Meng, Y. Chen and Z. Xu, Robust Low-Rank Matrix Factorization Under General Mixture Noise Distributions, IEEE Transactions on Image Processing, vol. 25, no. 10, pp. 4677-4690, Oct. 2016, doi: 10.1109/TIP.2016.2593343.
  • [8] Liang F, Liu C, Wang N: A robust sequential Bayesian method for identification of differentially expressed genes. Statistica Sinica 17: 571–597. 2007
  • [9] Salazar E, Ferreira MAR, Migon HS: Objective Bayesian analysis for exponential power regression models. Sankhya - Series B 74: 107–125. 2012 Ferreira, M.A., Salazar, E. Bayesian reference analysis for exponential power regression models. J Stat Distrib App 1, 12 (2014).