This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Spatial Heterogeneity Automatic Detection and Estimation

Xin Wang Email: wangx172@miamioh.edu Department of Statistics, Miami University Zhengyuan Zhu Email: zhuz@iastate.edu Department of Statistics, Iowa State University Hao Helen Zhang Email: hzhang@math.arizona.edu Department of Mathematics, University of Arizona
Abstract

Spatial regression is widely used for modeling the relationship between a dependent variable and explanatory covariates. Oftentimes, the linear relationships vary across space, when some covariates have location-specific effects on the response. One fundamental question is how to detect the systematic variation in the model and identify which locations share common regression coefficients and which do not. Only a correct model structure can assure unbiased estimation of coefficients and valid inferences. In this work, we propose a new procedure, called Spatial Heterogeneity Automatic Detection and Estimation (SHADE), for automatically and simultaneously subgrouping and estimating covariate effects for spatial regression models. The SHADE employs a class of spatially-weighted fusion type penalty on all pairs of observations, with location-specific weight adaptively constructed using spatial information, to cluster coefficients into subgroups. Under certain regularity conditions, the SHADE is shown to be able to identify the true model structure with probability approaching one and estimate regression coefficients consistently. We develop an alternating direction method of multiplier algorithm (ADMM) to compute the SHAD efficiently. In numerical studies, we demonstrate empirical performance of the SHADE by using different choices of weights and compare their accuracy. The results suggest that spatial information can enhance subgroup structure analysis in challenging situations when the spatial variation among regression coefficients is small or the number of repeated measures is small. Finally, the SHADE is applied to find the relationship between a natural resource survey and a land cover data layer to identify spatially interpretable groups.

key words: Areal data; Structure Selection; Penalization; Repeated measures; Spatial heterogeneity; Subgroup analysis

1 Introduction

Spatial regression is commonly used to model the relationship between a response and explanatory variables. For complex problems, some covariates (we call them global covariates) may have constant effects across space, while other covariates (we call them local covariates) may have location-specific effects, i.e, their effects on the response variable vary across space. This has received wide attention in many fields such as environmental sciences (Hu and Bradley,, 2018), biology (Zhang and Lawson,, 2011), social science (Bradley et al.,, 2018), economics (Brunsdon et al.,, 1996), and biostatistics (Xu et al.,, 2019).

A motivating example is about studying the relationship between two landcover data sources, one is the National Resources Inventory (NRI, Nusser and Goebel, 1997) survey conducted by the USDA Natural Resources Conservation Service (NRCS), the other one is the Cropland data layer (CDL, Han et al., 2012) produced by the USDA National Agricultural Statistics Service (NASS). Accurate estimate on local landcover information from NRI is essential for developing conservation policies and land management plans. However, direct estimates in small geographical areas such as at the county level may not be accurate due to small sample sizes. Auxiliary information such as CDL can be used to improve the small area estimator in NRI (Wang et al.,, 2018). Traditional regression models used in the small area estimation problems typically assume common regression coefficients over all domains, which may not be appropriate. For example, when we looked at the linear relationship between the NRI and CDL estimates of different types of land covers at county level, the regression coefficients in the Mountain states are quite different from the west coast and the vast areas in the east. This is reflected in Figure 9 (a) in Section 5. One reason for that difference is due to the scope of the NRI survey, which only include non-federal land in the US. Another contributing reason is that CDL is created by training separate machine learning models at the state level using only ground observations from that state, which creates variations among states. A common regression assumption would be too simple to capture the regional differences and will lead to biases in the estimators. This type of spatial heterogeneity is also known as structural instability. For linear models, this implies that the linear relationship changes geographically over space, and the linear regression coefficients may form subgroups. It is an important and challenging problem to identify the correct grouping structure of the regression coefficients, as only a correct model structure can lead to unbiased estimation of the regression coefficients and their valid inference. In practice, ad hoc grouping of states as regions defined by tradition or for federal administrative purposes are sometimes used to address this issue. However, such grouping is not driven by the data in specific problems, and may not be appropriate or efficient. For example, the central region in Figure 9 (a) includes not only all the Mountain states, but also the North and South Dakotas which are not traditionally considered Mountain states. Figure 9 (b) suggests further division of the east to sub-regions, which does not align to any well known administrative regions. One natural approach is to assume that regression coefficients of states nearby are more likely to belong to the same subgroup than states which are further apart, and use both the estimated regression coefficients and the spatial structure to guide the clustering of states into subregions, which is what we propose to do in this paper.

The problem of taking into account spatial dependence structure in linear regression has been studied for a long time in literature. Classical works include spatial expansion methods (Casetti,, 1972; Casetti and Jones,, 1987), which treat the spatially varying regression coefficients as a function of expansion variables, typically using longitude and latitude coordinates as location variables. One popular approach to accounting for spatial variations in the model is by introducing an additive spatial random effect for each location, as done for linear models by Cressie, (2015) and generalized linear models by Diggle et al., (1998). Another class of models in wide use are spatial varying coefficient models, including the geographically weighted regression (GWR; Brunsdon et al., 1996) and its extensions to generalized linear models (Nakaya et al.,, 2005) and the Cox model for survival analysis (Xue et al.,, 2020; Hu and Huffer,, 2020). There has been also development in the Bayesian framework, such as Gelfand et al., (2003).

The methods mentioned above typically assume that the regression parameters are smooth functions of location variables, which is reasonable in certain practice, but may not be appropriate for applications where the covariate effects are constant over subregions defined by some unobserved hidden factors. In this work, we take a different perspective by grouping the covariate effects into spatially-interpretable subgroups or clusters. As in the motivating example, different clusters have different patterns, which can be used to build more flexible estimators to improve the original direct estimates. A majority of existing work in the literature on spatial cluster detection are based on hypothesis tests, including the scan statistic methods based on the likelihood ratio (Kulldorff and Nagarwalla,, 1995; Jung et al.,, 2007; Cook et al.,, 2007) and the two-step spatial test methods under the GWR framework (Lee et al.,, 2017, 2020). Test-based methods are intuitive and useful in practice, but proper test statistics are often difficulty to construct and the tests may have low power when the number of locations is large. In addition, these methods handle the cluster detection problem and the model estimation separately, making it difficult to study inferential properties of the final estimator. The main purpose of this article is to fill this gap by developing a unified framework to detect clusters of regression coefficients, estimate them consistently, and make valid inferences.

In the context of non-spatial data analysis, a variety of clustering methods have been proposed to identify homogeneous groups for either observations or regression coefficients. Chi and Lange, (2015) developed a method for the convex clustering problem through the alternating direction method of multiplier algorithm (ADMM) (Boyd et al.,, 2011) with pairwise Lp(p1)L_{p}(p\geq 1) penalty. Nonnegative weights are considered to reduce bias for pairwise penalties. Fan et al., (2018) considered a clustering problem with l0l_{0} penalty on graphs. For clustering the regression coefficients, Ma and Huang, (2017) and Ma et al., (2019) proposed a concave fusion approach for estimating the group structure and estimating subgroup-specific effects, where smoothly clipped absolute deviation (SCAD) penalty (Fan and Li,, 2001) and the minimax concave penalty (MCP) (Zhang,, 2010) are considered.

For spatial analysis, some interesting work are recently proposed for grouping regression coefficients in the spatial regression. Ma et al., (2020) proposed the Bayesian heterogeneity pursuit regression models to detect clusters in the covariate effects based on the Dirichlet process. Hu et al., (2020) proposed a Bayesian method for clustering coefficients with auxiliary covariates random effects, based on a mixture of finite mixtures (MFM). Li and Sang, (2019) proposed a penalized approach based on the minimum spanning tree. In the area of spatial boundaries detection, Lu and Carlin, (2005) and Lu et al., (2007) considered the areal boundary detection using a Bayesian hierarchical model based on the conditional autoregressive model (Banerjee et al.,, 2014). The boundaries were determined by the posterior distribution of the corresponding spatial process or spatial weights. These boundary detection methods focused on clustering of observations instead of regression coefficients.

The main difficulty in clustering spacial covariate effects is how to estimate the number of clusters and cluster memberships consistently, by taking into account spatial neighborhood information properly. To our knowledge, none of the existing methods can address this challenge. In this work, we fill the gap by proposing a new procedure, called Spatial Heterogeneity Automatic Detection and Estimation (SHADE), for automatically grouping and estimating local covariate effects simultaneously. The SHADE employs a class of spatially-weighted fusion type penalty on all pairs of observations, with location-specific weight adaptively constructed using geographical proximity of locations, and achieves spatial clustering consistency for spatial linear regression. In theory, we show that the oracle estimator based on weighted least squares is a local minimizer of the objective function with probability approaching 1 under certain regular conditions, which indicates that the the number of clusters can be estimated consistently. We believe that this result is the first of its kind in the contest of spatial data analysis. To implement the SHADE, an alternating direction method of multiplier algorithm (ADMM) algorithm is developed. To make the best choices of spatial weights and understand their roles in practical applications, we consider a number of different schemes to choose pairwise weights and compare them numerically and theoretically. Our numerical examples suggest that, the number of clusters and the group structure can be recovered with high probability, and the spatial information can help in spatial clustering analysis when the minimal group difference is small or the number of repeated measures is small.

The article is organized as follows. In Section 2, we describe the Spatial Heterogeneity Automatic Detection and Estimation (SHADE) model and the corresponding ADMM algorithm. In Section 3, we establish theoretical properties of the SHADE estimator. The simulation study is conducted in Section 4 under several scenarios to show performances of the proposed estimator. The proposed method is applied to an NRI small area estimation problem to illustrate the use of SHADE in real-data applications in Section 5. Finally, some discussions are given in Section 6.

2 Methodology and Algorithm

2.1 Methodology: SHADE

Assume our spatial data consist of multiple measurements at each location or subject. Let yihy_{ih} be the hhth response for the iith subject observed at location 𝒔i\bm{s}_{i}, where i=1,,n,h=1,,nii=1,\dots,n,\,h=1,\dots,n_{i}. Based on their effects on the response variable, the covariates can be divided into two categories: “global” covariates which have common effects on the response across all the locations, and “local” covariates which have location-specific effects on the response. To reflect this, let 𝒛ih\bm{z}_{ih} and 𝒙ih\bm{x}_{ih} be the corresponding covariate vectors with dimension qq and pp respectively, where 𝒛ih\bm{z}_{ih}’s are “global” covariates which have common linear effects to the response across all the locations, while 𝒙ih\bm{x}_{ih}’s are “local” covariates which have location-specific linear effects on the response. We consider the following linear regression model

yih=𝒛ihT𝜼+𝒙ihT𝜷i+ϵih,y_{ih}=\bm{z}_{ih}^{T}\bm{\eta}+\bm{x}_{ih}^{T}\bm{\beta}_{i}+\epsilon_{ih}, (1)

where 𝜼\bm{\eta} represents the vector of common regression coefficients shared by global effects, 𝜷i\bm{\beta}_{i}’s are location-specific regression coefficients, and ϵih\epsilon_{ih}’s are i.i.d random errors with E(ϵih)=0E\left(\epsilon_{ih}\right)=0 and Var(ϵih)=σ2Var\left(\epsilon_{ih}\right)=\sigma^{2}. Furthermore, some locations may have same or similar location-specific effects, grouping locations of same location-specific effects can help to achieve dimension reduction and improve model prediction accuracy. Assume the nn location-specific effects belong to KK mutually exclusive subgroups: the locations with a common 𝜷i\bm{\beta}_{i} belong to the same group. Denote the corresponding partition of {1,,n}\{1,\dots,n\} as 𝒢={𝒢1,,𝒢K}\mathcal{G}=\left\{\mathcal{G}_{1},\dots,\mathcal{G}_{K}\right\}, where the index set GkG_{k} contains all the locations belonging to the group kk for k=1,,Kk=1,\ldots,K. For convenience, denote the regression coefficients associated with 𝒢k\mathcal{G}_{k} as 𝜶k\bm{\alpha}_{k}. In practice, since neither KK nor the partition 𝒢k\mathcal{G}_{k}’s are known, the goal is to use the observed data {(yih,𝒛ih,𝒙ih)}\{(y_{ih},\bm{z}_{ih},\bm{x}_{ih})\} to construct the estimator K^\hat{K} and the partition 𝒢^={𝒢^1,,𝒢^K^}\hat{\mathcal{G}}=\{\hat{\mathcal{G}}_{1},\dots,\hat{\mathcal{G}}_{\hat{K}}\}, where 𝒢^k={i:𝜷^i=𝜶^k,1in}\hat{\mathcal{G}}_{k}=\{i:\hat{\bm{\beta}}_{i}=\hat{\bm{\alpha}}_{k},1\leq i\leq n\}.

To achieve this goal, we use the following optimization problem: minimize the weighted least squares objective function subject to a spatially-weighted pairwise penalty

Qn(𝜼,𝜷;λ,ψ)=12i=1n1nih=1ni(yih𝒛ihT𝜼𝒙ihT𝜷i)2+1i<jnpγ(𝜷i𝜷j,cijλ),Q_{n}\left(\bm{\eta},\bm{\beta};\lambda,\psi\right)=\frac{1}{2}\sum_{i=1}^{n}\frac{1}{n_{i}}\sum_{h=1}^{n_{i}}\left(y_{ih}-\bm{z}_{ih}^{T}\bm{\eta}-\bm{x}_{ih}^{T}\bm{\beta}_{i}\right)^{2}+\sum_{1\leq i<j\leq n}p_{\gamma}\left(\left\|\bm{\beta}_{i}-\bm{\beta}_{j}\right\|,c_{ij}\lambda\right), (2)

where 𝜼=(η1,,ηq)T\bm{\eta}=(\eta_{1},\ldots,\eta_{q})^{T}, 𝜷=(𝜷1T,,𝜷nT)T\bm{\beta}=\left(\bm{\beta}_{1}^{T},\dots,\bm{\beta}_{n}^{T}\right)^{T}, \|\cdot\| denotes the Euclidean norm, pγ(,λ)p_{\gamma}\left(\cdot,\lambda\right) is a penalty function imposed on all distinct pairs. In the penalty function, λ0\lambda\geq 0 is a tuning parameter, γ>0\gamma>0 is a built-in constant in the penalty function, and different weights cijc_{ij}’s are assigned to different pairs of locations 𝒔i\bm{s}_{i} and 𝒔j\bm{s}_{j} for any 1i<jn1\leq i<j\leq n. One popular choice of penalty is the L1L_{1} penalty (lasso) (Tibshirani,, 1996) with the form pγ(t,λ)=λ|t|p_{\gamma}(t,\lambda)=\lambda|t|. Since L1L_{1} penalty tends to produce too many groups as shown in Ma and Huang, (2017), we consider the SCAD penalty, which is defined as

pγ(t,λ)=λ0|t|min{1,(γx/λ)+/(γ1)}𝑑x.p_{\gamma}(t,\lambda)=\lambda\int_{0}^{|t|}\min\{1,(\gamma-x/\lambda)_{+}/(\gamma-1)\}dx. (3)

Here we treat γ\gamma as a fixed value as in Fan and Li, (2001), Zhang, (2010) and Ma et al., (2019).

2.2 Choices of Spatial Weights

In (2), the values weights cijc_{ij} are crucial, as they control the number of subgroups and grouping results. The pairs 𝜷i𝜷j\left\|\bm{\beta}_{i}-\bm{\beta}_{j}\right\| with larger weights cijλc_{ij}\lambda are shrunk together more than those pairs with smaller weights. For spatial data, reasonable choices of cijc_{ij} should take into account two factors: locations with closer 𝜷j\bm{\beta}_{j} values are more grouped together, and locations closer to each other are more likely to form a subgroup as they typically have similar trends. Since the true values of 𝜷\bm{\beta} are not available, we use their estimators 𝜷~\tilde{\bm{\beta}} as the surrogates. For example, we can define the weights cijc_{ij} as

cij=exp(ψ𝒔i𝒔j𝜷~i𝜷~j),c_{ij}=\exp\left(-\psi\left\|\bm{s}_{i}-\bm{s}_{j}\right\|\cdot\left\|\tilde{\bm{\beta}}_{i}-\tilde{\bm{\beta}}_{j}\right\|\right),

where 𝜷~i\tilde{\bm{\beta}}_{i} is an initial estimate of 𝜷i\bm{\beta}_{i}, and ψ\psi is a scale parameter to control magnitudes of the weights. In areal data, we suggest three different ways of taking into spatial information in the data to construct the weights.
(i) using both spatial and regression coefficients information:

cij=exp(ψ(1aij)𝜷~i𝜷~j),c_{ij}=\exp\left(\psi\left(1-a_{ij}\right)\cdot\left\|\tilde{\bm{\beta}}_{i}-\tilde{\bm{\beta}}_{j}\right\|\right), (4)

where aija_{ij} is the neighbor order between location 𝒔i\bm{s}_{i} and location 𝒔j\bm{s}_{j}, which means that if ii and jj are neighbors, aij=1a_{ij}=1. If ii and jj are not neighbors, but they have at least one same neighbor, aij=2a_{ij}=2. Similarly, we can have all the neighborhood order for all subjects or locations.
(ii) using regression coefficients information only:

cij=exp(ψ𝜷~i𝜷~j).c_{ij}=\exp\left(-\psi\left\|\tilde{\bm{\beta}}_{i}-\tilde{\bm{\beta}}_{j}\right\|\right). (5)

(iii) using spatial information only:

cij=exp(ψ(1aij)).c_{ij}=\exp\left(\psi(1-a_{ij})\right). (6)

Weights in (4) and (5) both includes the regression coefficients, which would depend on the accuracy of 𝜷~i\tilde{\bm{\beta}}_{i}. If the number of repeated measures is not large, the values of 𝜷~i\tilde{\bm{\beta}}_{i} will not show the real relationship between different locations, which would lead to very bad weights. The phenomenon can be observed in the simulation study. The weights we use here are three special cases which use the information of either regression coefficients or the spatial neighborhood orders. Definitely, there are other ways to construct weights, such as using distance to borrow the spatial information. At least the weights satisfy condition (C4) in Section 3, the theoretical results will hold under other conditions. For example, the weights in (5) will satisfy (C4) automatically if 𝜷i~\tilde{\bm{\beta}_{i}}’s are consistent estimators. That is, besides the condition (C4), there are no other conditions about the format of the weight function.

2.3 Algorithm for SHADE

In this section, we describe the ADMM algorithm to solve (2) in Section 2.1. The algorithm shares the same spirit as Ma and Huang,’s (2017), where a non-weighted penalty is used in non-spatial settings without repeated measures.

There are two tuning parameters, λ\lambda and ψ\psi, in the proposed method. We choose them adaptively using some tuning procedures discussed at the end of this section. For now, we fix them and present the computational algorithm for solving (2). Denote the solution as

(𝜼^,𝜷^)=argmin𝜼q,𝜷npQn(𝜼,𝜷,λ,ψ).\left(\hat{\bm{\eta}},\hat{\bm{\beta}}\right)=\operatorname*{arg\,min}_{\bm{\eta}\in\mathbb{R}^{q},\bm{\beta}\in\mathbb{R}^{np}}Q_{n}\left(\bm{\eta},\bm{\beta},\lambda,\psi\right). (7)

First, we introduce the slack variables for all the pairs (i,ji,j) 𝜹ij=𝜷i𝜷j\bm{\delta}_{ij}=\bm{\beta}_{i}-\bm{\beta}_{j}, for 1i<jn1\leq i<j\leq n. Then the problem is equivalent to minimizing the following objective function with regard to (𝜼,𝜷,𝜹)(\bm{\eta},\bm{\beta},\bm{\delta}),

min𝜼,𝜷,𝜹L0(𝜼,𝜷,𝜹)\displaystyle\min_{\bm{\eta},\bm{\beta},\bm{\delta}}\leavevmode\nobreak\ \leavevmode\nobreak\ L_{0}\left(\bm{\eta},\bm{\beta},\bm{\delta}\right) =12i=1n1nih=1ni(yih𝒛ihT𝜼𝒙ihT𝜷i)2+1i<jnpγ(𝜹ij,cijλ),\displaystyle=\frac{1}{2}\sum_{i=1}^{n}\frac{1}{n_{i}}\sum_{h=1}^{n_{i}}\left(y_{ih}-\bm{z}_{ih}^{T}\bm{\eta}-\bm{x}_{ih}^{T}\bm{\beta}_{i}\right)^{2}+\sum_{1\leq i<j\leq n}p_{\gamma}\left(\left\|\bm{\delta}_{ij}\right\|,c_{ij}\lambda\right),
subject to𝜷i𝜷j𝜹ij=𝟎, 1i<jn,\displaystyle\text{subject to}\leavevmode\nobreak\ \leavevmode\nobreak\ \bm{\beta}_{i}-\bm{\beta}_{j}-\bm{\delta}_{ij}=\bm{0},\leavevmode\nobreak\ \leavevmode\nobreak\ 1\leq i<j\leq n,

where 𝜹=(𝜹ijT,1i<jn)T\bm{\delta}=(\bm{\delta}^{T}_{ij},1\leq i<j\leq n)^{T}. To handle the equation constraints in the optimization problem, we introduce the augmented Lagrangian

L(𝜼,𝜷,𝜹,𝒗)=L0(𝜼,𝜷,𝜹)+i<j𝒗ij,𝜷i𝜷j𝜹ij+ϑ2i<j𝜷i𝜷j𝜹ij2,L\left(\bm{\eta},\bm{\beta},\bm{\delta},\bm{v}\right)=L_{0}\left(\bm{\eta},\bm{\beta},\bm{\delta}\right)+\sum_{i<j}\left\langle\bm{v}_{ij},\bm{\beta}_{i}-\bm{\beta}_{j}-\bm{\delta}_{ij}\right\rangle+\frac{\vartheta}{2}\sum_{i<j}\left\|\bm{\beta}_{i}-\bm{\beta}_{j}-\bm{\delta}_{ij}\right\|^{2},

where 𝒗=(𝒗ijT,1i<jn)T\bm{v}=(\bm{v}^{T}_{ij},1\leq i<j\leq n)^{T} are Lagrange multipliers and ϑ>0\vartheta>0 is the penalty parameter.

To solve the problem, we use an iterative algorithm which updates 𝜷,𝜼,𝜹,𝒗\bm{\beta},\bm{\eta},\bm{\delta},\bm{v} sequentially, one at a time. At the (m+1)(m+1)th iteration, given their current values (𝜷(m),𝜼(m),𝜹(m),𝒗(m))(\bm{\beta}^{(m)},\bm{\eta}^{(m)},\bm{\delta}^{(m)},\bm{v}^{(m)}), the updates of 𝜼,𝜷,𝜹,𝒗\bm{\eta},\bm{\beta},\bm{\delta},\bm{v} are

(𝜼(m+1),𝜷(m+1))\displaystyle\left(\bm{\eta}^{(m+1)},\bm{\beta}^{(m+1)}\right) =argmin𝜼,𝜷L(𝜼,𝜷,𝜹(m),𝒗(m)),\displaystyle=\operatorname*{arg\,min}_{\bm{\eta},\bm{\beta}}L\left(\bm{\eta},\bm{\beta},\bm{\delta}^{{(m)}},\bm{v}^{{(m)}}\right),
𝜹(m+1)\displaystyle\bm{\delta}^{(m+1)} =argmin𝜹L(𝜼(m+1),𝜷(m+1),𝜹,𝒗(m)),\displaystyle=\operatorname*{arg\,min}_{\bm{\delta}}L\left(\bm{\eta}^{(m+1)},\bm{\beta}^{(m+1)},\bm{\delta},\bm{v}^{(m)}\right),
𝒗ij(m+1)\displaystyle\bm{v}_{ij}^{(m+1)} =𝒗ijm+ϑ(𝜷i(m+1)𝜷j(m+1)𝜹ij(m+1)).\displaystyle=\bm{v}_{ij}^{m}+\vartheta\left(\bm{\beta}_{i}^{(m+1)}-\bm{\beta}_{j}^{(m+1)}-\bm{\delta}_{ij}^{(m+1)}\right). (8)

To update 𝜼\bm{\eta} and 𝜷\bm{\beta}, we minimize the following objective function

f(𝜷,𝜼)=𝛀1/2(𝒚𝒁𝜼𝑿𝜷)2+ϑ𝑨𝜷𝜹(m)+ϑ1𝒗(m)2,f\left(\bm{\beta},\bm{\eta}\right)=\left\|\bm{\Omega}^{1/2}\left(\bm{y}-\bm{Z}\bm{\eta}-\bm{X}\bm{\beta}\right)\right\|^{2}+\vartheta\left\|\bm{A}\bm{\beta}-\bm{\delta}^{(m)}+\vartheta^{-1}\bm{v}^{(m)}\right\|^{2},

where 𝒚=(y11,,y1n1,,yn1,,yn,nn)T\bm{y}=\left(y_{11},\dots,y_{1n_{1}},\dots,y_{n1},\dots,y_{n,n_{n}}\right)^{T}, 𝒁=(𝒛11,,𝒛1n1,,𝒛n1,,𝒛n,nn)T\bm{Z}=\left(\bm{z}_{11},\dots,\bm{z}_{1n_{1}},\dots,\bm{z}_{n1},\dots,\bm{z}_{n,n_{n}}\right)^{T}, 𝑿=diag(𝑿1,,𝑿n)\bm{X}=\text{diag}\left(\bm{X}_{1},\dots,\bm{X}_{n}\right) with 𝑿i=(𝒙i1,,𝒙i,ni)T\bm{X}_{i}=\left(\bm{x}_{i1},\dots,\bm{x}_{i,n_{i}}\right)^{T}, 𝛀=diag(1/n1𝑰n1,,1/nn𝑰nn)\bm{\Omega}=\text{diag}\left(1/n_{1}\bm{I}_{n_{1}},\dots,1/n_{n}\bm{I}_{n_{n}}\right) and 𝑨=𝑫𝑰p\bm{A}=\bm{D}\otimes\bm{I}_{p} with an [n(n1)/2]×n[n\left(n-1\right)/2]\times n matrix 𝑫={(𝒆i𝒆j),i<j}T\bm{D}=\left\{\left(\bm{e}_{i}-\bm{e}_{j}\right),i<j\right\}^{T}, where 𝒆i\bm{e}_{i} is an n×1n\times 1 vector with iith element 1 and other elements 0. Then the solutions for 𝜷\bm{\beta} and 𝜼\bm{\eta} are

𝜷(m+1)=(𝑿T𝑸Z,Ω𝑿+ϑ𝑨T𝑨)1[𝑿T𝑸Z,Ω𝒚+ϑvec((𝚫(m)ϑ1𝚼(m))𝑫)],\bm{\beta}^{(m+1)}=\left(\bm{X}^{T}\bm{Q}_{Z,\Omega}\bm{X}+\vartheta\bm{A}^{T}\bm{A}\right)^{-1}\left[\bm{X}^{T}\bm{Q}_{Z,\Omega}\bm{y}+\vartheta\text{vec}\left(\left(\bm{\Delta}^{(m)}-\vartheta^{-1}\bm{\Upsilon}^{(m)}\right)\bm{D}\right)\right], (9)
𝜼(m+1)=(𝒁T𝛀𝒁)1𝒁T𝛀(𝒚𝑿𝜷(m+1)),\bm{\eta}^{(m+1)}=\left(\bm{Z}^{T}\bm{\Omega}\bm{Z}\right)^{-1}\bm{Z}^{T}\bm{\Omega}\left(\bm{y}-\bm{X}\bm{\beta}^{(m+1)}\right), (10)

where 𝚫(m)=(𝜹ij(m),i<j)p×n(n1)/2\bm{\Delta}^{(m)}=\left(\bm{\delta}_{ij}^{(m)},i<j\right)_{p\times n\left(n-1\right)/2}, 𝚼(m)=(𝒗ij(m),i<j)p×n(n1)/2\bm{\varUpsilon}^{(m)}=\left(\bm{v}_{ij}^{(m)},i<j\right)_{p\times n\left(n-1\right)/2} and

𝑸Z,Ω=𝛀𝛀𝒁(𝒁T𝛀𝒁)1𝒁T𝛀.\bm{Q}_{Z,\Omega}=\bm{\Omega}-\bm{\Omega}\bm{Z}\left(\bm{Z}^{T}\bm{\Omega}\bm{Z}\right)^{-1}\bm{Z}^{T}\bm{\Omega}.

To update 𝜹ij\bm{\delta}_{ij}’s componentwisely, it is equivalent to minimize the following objective function

ϑ2𝝇ij(m)𝜹ij2+pγ(𝜹ij,cijλ),\frac{\vartheta}{2}\left\|\bm{\varsigma}_{ij}^{(m)}-\bm{\delta}_{ij}\right\|^{2}+p_{\gamma}\left(\left\|\bm{\delta}_{ij}\right\|,c_{ij}\lambda\right),

where 𝝇ij(m+1)=(𝜷i(m+1)𝜷j(m+1))+ϑ1𝒗ij(m)\bm{\varsigma}_{ij}^{(m+1)}=\left(\bm{\beta}_{i}^{(m+1)}-\bm{\beta}_{j}^{(m+1)}\right)+\vartheta^{-1}\bm{v}_{ij}^{(m)}. The solution based on SCAD penalty has a closed-form solution as

𝜹ij(m+1)={S(𝝇ij(m+1),λcij/ϑ)if 𝝇ij(m+1)λcij+λcij/ϑ,S(𝝇ij(m+1),γλcij/((γ1)ϑ))11/((γ1)ϑ)if λcij+λcij/ϑ<𝝇ij(m+1)γλcij,𝝇ij(m+1)if 𝝇ij(m+1)>γλcij,\bm{\delta}_{ij}^{(m+1)}=\begin{cases}S\left(\bm{\varsigma}_{ij}^{(m+1)},\lambda c_{ij}/\vartheta\right)&\text{if }\left\|\bm{\varsigma}_{ij}^{(m+1)}\right\|\leq\lambda c_{ij}+\lambda c_{ij}/\vartheta,\\ \frac{S\left(\bm{\varsigma}_{ij}^{(m+1)},\gamma\lambda c_{ij}/\left(\left(\gamma-1\right)\vartheta\right)\right)}{1-1/\left(\left(\gamma-1\right)\vartheta\right)}&\text{if }\lambda c_{ij}+\lambda c_{ij}/\vartheta<\left\|\bm{\varsigma}_{ij}^{(m+1)}\right\|\leq\gamma\lambda c_{ij},\\ \bm{\varsigma}_{ij}^{(m+1)}&\text{if }\left\|\bm{\varsigma}_{ij}^{(m+1)}\right\|>\gamma\lambda c_{ij},\end{cases} (11)

where γ>cij+cij/ϑ\gamma>c_{ij}+c_{ij}/\vartheta and S(𝒘,t)=(1t/𝒘)+𝒘S\left(\bm{w},t\right)=\left(1-t/\left\|\bm{w}\right\|\right)_{+}\bm{w}, and (t)+=t(t)_{+}=t if t>0t>0, 0 otherwise.

In summary, the computational algorithm can be described as follows.

Algorithm: ADMM algorithm
0: : Initialize 𝜷(0)\bm{\beta}^{(0)}, 𝜹(0)\bm{\delta}^{(0)} and 𝒗(0)\bm{v}^{(0)}.
1:for m=0,1,2,m=0,1,2,\dots do
2:  Update 𝜷\bm{\beta} by (9).
3:  Update 𝜼\bm{\eta} by (10).
4:  Update 𝜹\bm{\delta} by (11)
5:  Update 𝒗\bm{v} by (2.3).
6:  if convergence criterion is met then
7:   Stop and get the estimates
8:  else
9:    m=m+1m=m+1
10:  end if
11:end for

If the size of nin_{i} is reasonable, such as 10 or larger, we construct the initial values 𝜷~(0)\tilde{\bm{\beta}}^{(0)} by fitting a linear regression model yih=𝒛ihT𝜼+𝒙ihT𝜷i+ϵihy_{ih}=\bm{z}_{ih}^{T}\bm{\eta}+\bm{x}_{ih}^{T}\bm{\beta}_{i}+\epsilon_{ih} for each i=1,,ni=1,\dots,n. Then, set 𝜹ij(0)=𝜷i(0)𝜷j(0)\bm{\delta}_{ij}^{(0)}=\bm{\beta}_{i}^{(0)}-\bm{\beta}_{j}^{(0)} and 𝒗(0)=𝟎\bm{v}^{(0)}=\bm{0}. If ni=1n_{i}=1 or small, the initial values can be set using the procedure in Ma et al., (2019). They used a ridge fusion criterion with a small value of the tuning parameter, then the initial group structure was obtained by assigning objects into KK^{*}(a given value) groups by ranking the estimated 𝜷i\bm{\beta}_{i}.

If 𝜹^ij=𝟎\hat{\bm{\delta}}_{ij}=\bm{0}, then the location ii and jj belong to the same group. Thus, we can obtain the corresponding estimated partition 𝒢^\hat{\mathcal{G}} and the estimated number of groups K^(λ,ψ)\hat{K}(\lambda,\psi). For each group, its group-specific parameter vector is estimated as 𝜶^k=1/|𝒢^k|i𝒢^k𝜷^i\hat{\bm{\alpha}}_{k}=1/|\hat{\mathcal{G}}_{k}|\sum_{i\in\hat{\mathcal{G}}_{k}}\hat{\bm{\beta}}_{i} for k=1,,K^k=1,\ldots,\hat{K}.

Remark 1.

If there are no global covariates, the model is simplified as yih=𝐱ihT𝛃i+ϵihy_{ih}=\bm{x}_{ih}^{T}\bm{\beta}_{i}+\epsilon_{ih}. The algorithm will be simplified, that is, 𝐐Z,Ω\bm{Q}_{Z,\Omega} will become 𝛀\bm{\Omega}. The model we use in the application is the simplified model.

Remark 2.

The convergence criterion used is the same as Ma and Huang, (2017), which is based on the primal residual 𝐫(m+1)=𝐀𝛃(m+1)𝛅(m+1)\bm{r}^{(m+1)}=\bm{A}\bm{\beta}^{(m+1)}-\bm{\delta}^{(m+1)}. The algorithm is stopped if 𝐫(m+1)<ε\|\bm{r}^{(m+1)}\|<\varepsilon, where ε\varepsilon is a small positive value.

We need to select two tuning parameters, λ\lambda and ψ\psi, in the SaSa algorithm. In this paper, we use the modified Bayes Information Criterion (BIC) (Wang et al.,, 2007) to determine the best tuning parameters adaptively from the data. In particular, we have

BIC(λ,ψ)=log[1ni=1n1ni(yih𝒛ihT𝜼^(λ,ψ)𝒙ihT𝜷^i(λ,ψ))2]+Cnlognn(K^(λ,ψ)p+q),\text{BIC}\left(\lambda,\psi\right)=\log\left[\frac{1}{n}\sum_{i=1}^{n}\frac{1}{n_{i}}\left(y_{ih}-\bm{z}_{ih}^{T}\hat{\bm{\eta}}(\lambda,\psi)-\bm{x}_{ih}^{T}\hat{\bm{\beta}}_{i}(\lambda,\psi)\right)^{2}\right]+C_{n}\frac{\log n}{n}\left(\hat{K}(\lambda,\psi)p+q\right), (12)

where CnC_{n} is a positive number which can depend on nn. Here we use Cn=c0log(log(np+q))C_{n}=c_{0}\log\left(\log\left(np+q\right)\right) following Ma and Huang, (2017) with c0=0.2c_{0}=0.2. To select ψ\psi, we select the best value from a set of candidate values, such as 0.1, 0.5, 1, 3. For each given ψ\psi, we use the warm start and continuation strategy as in Ma et al., (2019) to select tuning parameter λ\lambda. A grid of λ\lambda is predefined within [λmin,λmax][\lambda_{\text{min}},\lambda_{\text{max}}]. For each λ\lambda, the initial values are the estimated values from the previous estimation. Denote the selected tuning parameters as λ^\hat{\lambda} and ψ^\hat{\psi}. Correspondingly, the estimated group number is K^(λ^,ψ^)\hat{K}(\hat{\lambda},\hat{\psi}), and the estimated regression coefficients are 𝜷^\hat{\bm{\beta}} and 𝜼^\hat{\bm{\eta}}.

3 Theoretical Properties of SHADE

In this section, we study theoretical properties of the proposed SHADE estimator. Assume 𝒢k\mathcal{G}_{k}’s are the true partition of location-specific regression coefficients. Let |𝒢k||\mathcal{G}_{k}| be the number of subjects in group 𝒢k\mathcal{G}_{k} for k=1,,Kk=1,\dots,K, |𝒢min|\left|\mathcal{G}_{\min}\right| and |𝒢max|\left|\mathcal{G}_{\max}\right| be the minimum and maximum group sizes, respectively. Let 𝑾~\widetilde{\bm{W}} be an n×Kn\times K matrix with element wikw_{ik} and wik=1w_{ik}=1 if i𝒢ki\in\mathcal{G}_{k}, wik=0w_{ik}=0, otherwise. Denote 𝑾=𝑾~𝑰p\bm{W}=\widetilde{\bm{W}}\otimes\bm{I}_{p}, which is an np×Kpnp\times Kp matrix and 𝑼=(𝒁,𝑿𝑾)\bm{U}=\left(\bm{Z},\bm{X}\bm{W}\right). Define 𝒢={𝜷np:𝜷i=𝜷j, for i,j𝒢k, 1kK}\mathcal{M}_{\mathcal{G}}=\{\bm{\beta}\in\mathbb{R}^{np}:\,\bm{\beta}_{i}=\bm{\beta}_{j},\text{ for }i,j\in\mathcal{G}_{k},\,1\leq k\leq K\}. Using these notations, we can then express 𝜷\bm{\beta} as 𝜷=𝑾𝜶\bm{\beta}=\bm{W}\bm{\alpha} if 𝜷𝒢\bm{\beta}\in\mathcal{M}_{\mathcal{G}}, where 𝜶=(𝜶1T,,𝜶KT)T\bm{\alpha}=\left(\bm{\alpha}_{1}^{T},\dots,\bm{\alpha}_{K}^{T}\right)^{T}. For any positive numbers, xnx_{n} and yny_{n}, xnynx_{n}\gg y_{n} means that xn1yn=o(1)x_{n}^{-1}y_{n}=o(1). Define the scaled penalty function by

ργ(t)=λ1pγ(t,λ).\rho_{\gamma}(t)=\lambda^{-1}p_{\gamma}(t,\lambda). (13)

Below are our assumptions, where (C1) and (C3) follow those in Ma et al., (2019).

  1. (C1)

    The function ργ(t)\rho_{\gamma}(t) is symmetric, non-decreasing, and concave on [0,)[0,\infty). It is constant for taλt\geq a\lambda for some constant a>0a>0, and ργ(0)=0\rho_{\gamma}(0)=0. Also, ρ(t)\rho^{\prime}(t) exists and is continuous except for a finite number values of tt and ρ(0+)=1\rho^{\prime}(0+)=1.

  2. (C2)

    There exist finite positive constants M1,M2,M3>0M_{1},M_{2},M_{3}>0 such that |xih,l|M1\left|x_{ih,l}\right|\leq M_{1}, |zih,l|M1\left|z_{ih,l}\right|\leq M_{1} for j=1,,nij=1,\dots,n_{i} and i=1,,ni=1,\dots,n and M2maxini/mininiM3M_{2}\leq\max_{i}n_{i}/\min_{i}n_{i}\leq M_{3}. Also, assume that λmin(𝑼T𝛀𝑼)C1|𝒢min|\lambda_{\text{min}}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)\geq C_{1}\left|\mathcal{G}_{\text{min}}\right|, λmax(𝑼T𝛀𝑼)C1n\lambda_{\text{max}}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)\leq C_{1}^{\prime}n for some constants 0<C1<0<C_{1}<\infty and 0<C1<0<C_{1}^{\prime}<\infty, where λmin\lambda_{\text{min}} and λmax\lambda_{\text{max}} are the corresponding minimum and maximum eigenvalues respectively. In addition, assume that supi,h𝒙ihC2p\sup_{i,h}\left\|\bm{x}_{ih}\right\|\leq C_{2}\sqrt{p} and supi,h𝒛ihC3q\sup_{i,h}\left\|\bm{z}_{ih}\right\|\leq C_{3}\sqrt{q} for some constants 0<C2<0<C_{2}<\infty and 0<C3<0<C_{3}<\infty.

  3. (C3)

    The random error vector ϵ=(ϵ11,,ϵ1n1,ϵ21,,ϵ2n2,,ϵn1,,ϵnnn)T\bm{\epsilon}=\left(\epsilon_{11},\dots,\epsilon_{1n_{1}},\epsilon_{21},\dots,\epsilon_{2n_{2}},\dots,\epsilon_{n1},\dots,\epsilon_{nn_{n}}\right)^{T} has sub-Gaussian tails such that P(|𝒂Tϵ|>𝒂x)2exp(c1x2)P\left(\left|\bm{a}^{T}\bm{\epsilon}\right|>\left\|\bm{a}\right\|x\right)\leq 2\exp\left(-c_{1}x^{2}\right) for any vector 𝒂m\bm{a}\in\mathbb{R}^{m} and x>0x>0, where 0<c1<0<c_{1}<\infty and m=i=1nnim=\sum_{i=1}^{n}n_{i}.

  4. (C4)

    The pairwise weights cijc_{ij}’s are bounded away from zero if ii and jj are in the same group.

Conditions (C1) and (C3) are common used in high-dimensional penalized regression problems, which are also used in Ma et al., (2019). Condition (C2) is also similar to the condition mentioned in Ma et al., (2019), also includes the bounded conditions for covariates, which are used in Huang et al., (2004). In general, if the weights functions are not only defined on a finite support, cijc_{ij}’s will satisfy condition (C4).

First, we establish the properties of the oracle estimator, which is defined as the weighted least squares estimator assuming that the underlying group structure is known. Specifically, the oracle estimator of (𝜼,𝜶)(\bm{\eta},\bm{\alpha}) is

(𝜼^or,𝜶^or)\displaystyle\left(\hat{\bm{\eta}}^{or},\hat{\bm{\alpha}}^{or}\right) =argmin𝜼q,𝜶Kp12𝛀1/2(𝒚𝒁𝜼𝑿𝑾𝜶)2\displaystyle=\operatorname*{arg\,min}_{\bm{\eta}\in\mathbb{R}^{q},\bm{\alpha}\in\mathbb{R}^{Kp}}\frac{1}{2}\left\|\bm{\Omega}^{1/2}\left(\bm{y}-\bm{Z}\bm{\eta}-\bm{X}\bm{W\alpha}\right)\right\|^{2}
=(𝑼T𝛀𝑼)1𝑼T𝛀𝒚.\displaystyle=\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{U}^{T}\bm{\Omega}\bm{y}. (14)

And the corresponding oracle estimator of 𝜷\bm{\beta} is 𝜷^or=𝑾𝜶^or\hat{\bm{\beta}}^{or}=\bm{W}\hat{\bm{\alpha}}^{or}. Let 𝜶k0\bm{\alpha}_{k}^{0} be the true coefficient vector for group kk, k=1,,Kk=1,\dots,K and 𝜶0=((𝜶10)T,,(𝜶K0)T)T\bm{\alpha}^{0}=((\bm{\alpha}_{1}^{0})^{T},\dots,(\bm{\alpha}_{K}^{0})^{T})^{T}, and let 𝜼0\bm{\eta}^{0} be the true common coefficient vector. The following theorem shows the properties of the oracle estimator.

Theorem 1.

Suppose

|𝒢min|(q+Kp)1/2max(nmininilogn,(q+Kp)1/2).\left|\mathcal{G}_{\min}\right|\gg\left(q+Kp\right)^{1/2}\max\left(\sqrt{\frac{n}{\min_{i}n_{i}}\log n},\left(q+Kp\right)^{1/2}\right).

Under conditions (C1)-(C3), q=o(n)q=o(n) and Kp=o(n)Kp=o(n), we have with probability at least 12(q+Kp)n11-2(q+Kp)n^{-1},

(𝜼^or𝜼0𝜶^or𝜶0)ϕn,\left\|\left(\begin{array}[]{c}\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\\ \hat{\bm{\alpha}}^{or}-\bm{\alpha}^{0}\end{array}\right)\right\|\leq\phi_{n},

and

𝜷^or𝜷0|𝒢max|ϕn;supi𝜷^ior𝜷i0ϕn,\left\|\hat{\bm{\beta}}^{or}-\bm{\beta}^{0}\right\|\leq\sqrt{\left|\mathcal{G}_{\max}\right|}\phi_{n};\,\sup_{i}\left\|\hat{\bm{\beta}}_{i}^{or}-\bm{\beta}_{i}^{0}\right\|\leq\phi_{n},

where

ϕn=c11/2C11M1q+Kp|𝒢min|1nminnilogn.\phi_{n}=c_{1}^{-1/2}C_{1}^{-1}M_{1}\sqrt{q+Kp}\left|\mathcal{G}_{\min}\right|^{-1}\sqrt{\frac{n}{\min n_{i}}\log n}.

Furthermore, for any vector 𝐚nq+Kp\bm{a}_{n}\in\mathbb{R}^{q+Kp}, we have as nn\rightarrow\infty

σn(𝒂n)1𝒂nT((𝜼^or𝜼0)T,(𝜶^or𝜶0)T)T𝑑N(0,1),\sigma_{n}(\bm{a}_{n})^{-1}\bm{a}_{n}^{T}\left(\left(\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\right)^{T},\left(\hat{\bm{\alpha}}^{or}-\bm{\alpha}^{0}\right)^{T}\right)^{T}\overset{d}{\rightarrow}N(0,1), (15)

where

σn(𝒂n)=σ[𝒂nT(𝑼T𝛀𝑼)1𝑼T𝛀𝛀𝑼(𝑼T𝛀𝑼)1𝒂n]1/2.\sigma_{n}(\bm{a}_{n})=\sigma\left[\bm{a}_{n}^{T}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{U}^{T}\bm{\Omega}\bm{\Omega}\bm{U}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{a}_{n}\right]^{1/2}. (16)
Remark 3.

We don’t have any specific assumptions about nin_{i}. If minninq+Kplogn\min n_{i}\ll\frac{n}{q+Kp}\log n, or minni=O(nq+Kplogn)\min n_{i}=O\left(\frac{n}{q+Kp}\log n\right), we have |𝒢min|(q+Kp)1/2nminnilogn\left|\mathcal{G}_{\min}\right|\gg\left(q+Kp\right)^{1/2}\sqrt{\frac{n}{\min n_{i}}\log n}. If minninq+Kplogn\min n_{i}\gg\frac{n}{q+Kp}\log n, we have |𝒢min|q+Kp\left|\mathcal{G}_{\min}\right|\gg q+Kp. In this case, if qq, pp and KK are fixed values, what we need is only 1/|𝒢min|=o(1)1/\left|\mathcal{G}_{\min}\right|=o(1).

Remark 4.

The model considered in Ma et al., (2019) is a special case of our model, and their condition is a special case of our condition, that is when ni=1n_{i}=1.

Remark 5.

If let |𝒢min|=δn/K\left|\mathcal{G}_{\min}\right|=\delta n/K for some constant 0<δ10<\delta\leq 1, then

ϕn=c11/2C11M1δ1Kq+Kplogn/(nminni).\phi_{n}=c_{1}^{-1/2}C_{1}^{-1}M_{1}\delta^{-1}K\sqrt{q+Kp}\sqrt{\log n/(n\min n_{i})}.

Moreover, if qq, pp and KK are fixed values, then ϕn=Clogn/(nminni)\phi_{n}=C^{*}\sqrt{\log n/(n\min n_{i})} for some constant 0<C<0<C^{*}<\infty.

Next, we study the properties of our proposed estimator. Let

bn=mini𝒢k,j𝒢k𝜷i0𝜷j0=minkk𝜶k0𝜶k0b_{n}=\min_{i\in\mathcal{G}_{k},j\in\mathcal{G}_{k^{\prime}}}\left\|\bm{\beta}_{i}^{0}-\bm{\beta}_{j}^{0}\right\|=\min_{k\neq k^{\prime}}\left\|\bm{\alpha}_{k}^{0}-\bm{\alpha}_{k^{\prime}}^{0}\right\| (17)

be the minimal difference among different groups.

Theorem 2.

Suppose the conditions of Theorem 1 hold and (C4)(C4) holds. If bn>aλb_{n}>a\lambda and λϕn\lambda\gg\phi_{n} for some constant a>0a>0, then there exists a local minimizer (𝛈^(λ,ψ)T,𝛃^(λ,ψ)T)T\left(\hat{\bm{\eta}}(\lambda,\psi)^{T},\hat{\bm{\beta}}(\lambda,\psi)^{T}\right)^{T} of the objective function Qn(𝛈,𝛃)Q_{n}(\bm{\eta},\bm{\beta}) given in (2) such that

P((𝜼^(λ,ψ)T,𝜷^(λ,ψ)T)T=((𝜼^or)T,(𝜷^or)T)T)1.P\left(\left(\hat{\bm{\eta}}(\lambda,\psi)^{T},\hat{\bm{\beta}}(\lambda,\psi)^{T}\right)^{T}=\left((\hat{\bm{\eta}}^{or})^{T},(\hat{\bm{\beta}}^{or})^{T}\right)^{T}\right)\rightarrow 1. (18)
Remark 6.

Theorem 2 implies that true group structure can be recovered with probability approaching 1. It also implies that the estimated number of groups K^\hat{K} satisfies P(K^(λ,ψ)=K)1.P\left(\hat{K}(\lambda,\psi)=K\right)\rightarrow 1.

Let 𝜶^(λ,ψ)\hat{\bm{\alpha}}(\lambda,\psi) be the distinct group vectors of 𝜷^(λ,ψ)\hat{\bm{\beta}}(\lambda,\psi). According to Theorem 1 and Theorem 2, we have the following result.

Corollary 1.

Suppose the conditions in Theorem 2 hold, for any vector 𝐚nq+Kp\bm{a}_{n}\in\mathbb{R}^{q+Kp}, we have as nn\rightarrow\infty

σn(𝒂n)1𝒂nT((𝜼^(λ,ψ)𝜼0)T,(𝜶^(λ,ψ)𝜶0)T)T𝑑N(0,1).\sigma_{n}(\bm{a}_{n})^{-1}\bm{a}_{n}^{T}\left(\left(\hat{\bm{\eta}}(\lambda,\psi)-\bm{\eta}^{0}\right)^{T},\left(\hat{\bm{\alpha}}(\lambda,\psi)-\bm{\alpha}^{0}\right)^{T}\right)^{T}\overset{d}{\rightarrow}N(0,1). (19)
Remark 7.

The variance parameter σ2\sigma^{2} can be estimated by

σ2=1mqK^pi=1nh=1ni(yih𝒛ihT𝜼^𝒙ihT𝜷^i)2\sigma^{2}=\frac{1}{m-q-\hat{K}p}\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\left(y_{ih}-\bm{z}_{ih}^{T}\hat{\bm{\eta}}-\bm{x}_{ih}^{T}\hat{\bm{\beta}}_{i}\right)^{2} (20)

The algorithm can be implemented through package Spgr in https://github.com/wangx23/Spgr.

4 Simulation Studies

In this section, we evaluate and compare performance of the proposed SHADE estimator with different weight choices: equal weights cij=1c_{ij}=1 (denoted as “equal”), weights defined in (4) (denoted as “reg-sp”), weights defined in (5) (denoted by “reg”), and weights defined in (6) (denoted by “sp”).

The simulations are carried as follows. Let 𝒛ih=(zih,1,,zih,5)T\bm{z}_{ih}=(z_{ih,1},\dots,z_{ih,5})^{T} with zih,1=1z_{ih,1}=1 and (zih,2,,zih,5)T(z_{ih,2},\dots,z_{ih,5})^{T} are generated a multivariate normal distribution with mean 0, variance 1, and pairwise correlation ρ=0.3\rho=0.3. Define 𝒙ih=(xih,1,xih,2)T\bm{x}_{ih}=(x_{ih,1},x_{ih,2})^{T}, where xih,1x_{ih,1} is simulated from a standard normal distribution and xih,2x_{ih,2} is simulated from a centered and standardized binomial (n,0.7)(n,0.7). Let 𝜼=(η1,,η5)T\bm{\eta}=(\eta_{1},\dots,\eta_{5})^{T}, where ηk\eta_{k}’s are simulated from Uniform [1,2][1,2] and standard deviation of the error term is σ=0.5\sigma=0.5. We set ϑ=1\vartheta=1 and γ=3\gamma=3 and use the SCAD penalty function. The tuning parameters are chosen by the modified BIC defined by (12). We consider the simulations in several scenarios. The results are based on 100 simulations.

To evaluate subgrouping performance of the proposed method, we report the estimated group number K^\hat{K}, adjusted Rand index (ARI) (Rand,, 1971; Hubert and Arabie,, 1985; Vinh et al.,, 2010), and the root mean square error (RMSE) for estimating 𝜷\bm{\beta}. For the estimated K^\hat{K} over 100 simulations, we report its average (denoted by “mean”), standard error in the parenthesis, and the occurrence percentage of K^=K\hat{K}=K (denoted by “per”). The quantity ARI is used to measure the degree of agreement between two partitions, taking a value between 0 and 1: the larger ARI value, the more agreement. We report the average ARI across 100 simulations along with the standard error in the parentheses. To evaluate estimation accuracy of 𝜷\bm{\beta}, we also report the average RMSE

1ni=1n𝜷^i𝜷i2.\sqrt{\frac{1}{n}\sum_{i=1}^{n}\|\hat{\bm{\beta}}_{i}-\bm{\beta}_{i}\|^{2}}. (21)

4.1 Balanced group

We assume that there are K=3K=3 true groups 𝒢1,𝒢2\mathcal{G}_{1},\mathcal{G}_{2} and 𝒢3\mathcal{G}_{3}. Consider the two spatial settings, for which the group parameters are respectively given by:

Setting 1: 𝜷i=(1,1)T\bm{\beta}_{i}=(1,1)^{T} if i𝒢1i\in\mathcal{G}_{1}; 𝜷i=(1.5,1.5)T\bm{\beta}_{i}=(1.5,1.5)^{T} if i𝒢2i\in\mathcal{G}_{2}; 𝜷i=(2,2)T\bm{\beta}_{i}=(2,2)^{T} if i𝒢3i\in\mathcal{G}_{3}.

Setting 2: 𝜷i=(1,1)T\bm{\beta}_{i}=(1,1)^{T} if i𝒢1i\in\mathcal{G}_{1}; 𝜷i=(1.25,1.25)T\bm{\beta}_{i}=(1.25,1.25)^{T} if i𝒢2i\in\mathcal{G}_{2}; 𝜷i=(1.5,1.5)T\bm{\beta}_{i}=(1.5,1.5)^{T} if i𝒢3i\in\mathcal{G}_{3}.

Under each setting, we simulate the data on two sizes of regular lattice, a 7×77\times 7 grid (left) and a 10×1010\times 10 grid (right), as shown in Figure 1. Furthermore, for the 7×77\times 7 grid with ni=10n_{i}=10, we use a 10-fold cross validation to select the tuning parameters. The repeated measures of location ii are divided into 10 parts; the jjth part of each location is combined as the validation data set, the remaining observations form the training data set. The spatial weights (6) are considered. The results are labeled as “cv” in all the tables. Note that “reg_sp” and “reg” were not computed for the 10×1010\times 10 grid.

Refer to caption
(a) 7×77\times 7 grid
Refer to caption
(b) 10×1010\times 10 grid
Figure 1: Two spatial settings in the simulation studies.

Results for Setting 1: Tables 1 , 2 and 3 show the estimated number of groups and ARI. Figures 2 and 3 plot the RMSE of the estimates obtained using different weight choices. After estimating the group structure, one can also estimate parameters 𝜼\bm{\eta} and 𝜷\bm{\beta} again by assuming that the group information is known; the results are denoted as “refit”. Based on the numerical results, we make the following observations.

First, we summarize the results for the 7×77\times 7 grid. In all the considered scenarios, the spatially weighted penalty outperforms the non-weighed penalty ( “equal”). The upper panels in Tables 1 and 2, the left plot in Figure 2 suggest that, if the number of repeated measurements is relatively small (say, ni=10n_{i}=10), the weights “reg_sp” and “sp” perform similarly and they are the best in terms of estimating KK, recovering the true subgroup structure (large ARI), and estimating regression coefficients (small RMSE); the weights “equal” and “reg” are much worse. The lower panels of Tables 1 and 2 and the right plot in Figure 2 show that, when the number of repeated measurements gets larger (say, ni=30n_{i}=30), all the methods improve and there is not much difference among them. Cross validation works well in terms of ARI and RMSE, but it tends to over-estimate the number of groups KK. This is because that cross validation focuses more on the prediction accuracy; the coefficient estimates of some groups are close to the true coefficients, but they are not shrank together. In addition, refitting the model does not appear to further improve the accuracy of estimating 𝜷\bm{\beta}.

Table 1: Summary of the estimate K^\hat{K} for Setting 1 under the 7×77\times 7 grid.
equal reg_sp reg sp cv
ni=10n_{i}=10 mean 3.34(0.054) 3.15(0.039) 3.33(0.051) 3.13(0.034) 3.82(0.13)
per 0.69 0.86 0.69 0.87 0.56
ni=30n_{i}=30 mean 3.00(0) 3.00(0) 3.00(0) 3.00(0)
per 1.00 1.00 1.00 1.00
Table 2: Average ARI for Setting 1 under the 7×77\times 7 grid
equal reg_sp reg sp cv
ni=10n_{i}=10 0.80(0.011) 0.92(0.008) 0.82(0.01) 0.92(0.007) 0.95(0.007)
ni=30n_{i}=30 0.998(0.001) 0.999(0.0006) 0.998(0.001) 0.999(0.0006)
Refer to caption
Figure 2: RMSE for Setting 1 under the 7×77\times 7 grid

Next, we summarize the results for the 10×1010\times 10 grid. In this case, we consider equal weights and spatial weights only. Again, the spatially-weighted penalty outperforms the non-weighed penalty ( “equal”). Table 3 and Figure 3 suggest that, if the number of repeated measurements is relatively small (say, ni=10n_{i}=10), “sp” performs much better in terms of grouping and estimating regression coefficients than “equal”; for a larger number of repeated measurements (say, ni=30n_{i}=30), they perform similarly.

Table 3: Summary of K^\hat{K} and average ARI for Setting 1 under the 10×1010\times 10 grid.
K^\hat{K} ARI
equal sp equal sp
ni=10n_{i}=10 mean 3.59(0.073) 3.37(0.065) 0.70(0.009) 0.97(0.003)
per 0.53 0.71 - -
ni=30n_{i}=30 mean 3(0) 3(0) 0.996(0.001) 1.00(0)
per 1.00 1.00 - -
Refer to caption
Figure 3: RMSE for Setting 1 under the 10×1010\times 10 grid

Results for Setting 2: In this setting, the group difference becomes smaller. Tables 4, 5 and Figure 4 summarize the results for the 7×77\times 7 grid. For both values of nin_{i}, the weights “sp” performs best in terms of estimating the number of groups (K^\hat{K}), recovering the true group structure (ARI), and estimating regression coefficients. In contrast to Setting 1, when the difference among groups becomes smaller, even with ni=30n_{i}=30, the model with the spatial weight is superior to other models.

Table 4: Summary of K^\hat{K} for Setting 2 under the 7×77\times 7 grid
equal reg_sp reg sp
ni=10n_{i}=10 mean 3.25(0.119) 3.01(0.093) 3.14(0.107) 2.88(0.067)
per 0.34 0.45 0.33 0.60
ni=30n_{i}=30 mean 2.70(0.046) 2.90(0.030) 2.76(0.043) 2.95(0.022)
per 0.70 0.90 0.76 0.95
Table 5: Average ARI for Setting 2 under the 7×77\times 7 grid
equal reg_sp reg sp
ni=10n_{i}=10 0.32(0.011) 0.50(0.023) 0.33(0.01) 0.61(0.026)
ni=30n_{i}=30 0.72(0.018) 0.86(0.015) 0.75(0.017) 0.90(0.012)
Refer to caption
Figure 4: RMSE for setting 2 under the 7×77\times 7 grid

Table 6 and Figure 5 show the results for the 10×1010\times 10 grid. Again, we only compare “equal” weights and “sp” weights. The results suggest similar conclusions to those for the 7×77\times 7 grid: the model with the spatial weight is superior even with a large number of repeated measurements (ni=30n_{i}=30) by producing larger ARI and smaller RMSE.

Table 6: Summary of K^\hat{K} and average ARI for Setting 2 under the 10×1010\times 10 grid
K^\hat{K} ARI
equal sp equal sp
ni=10n_{i}=10 mean 3.82(0.146) 3.35(0.078) 0.32(0.009) 0.81(0.022)
per 0.32 0.620 - -
ni=30n_{i}=30 mean 3.10(0.060) 3.00(0.0) 0.79(0.012) 0.94(0.005)
per 0.64 1.0 - -
Refer to caption
Figure 5: RMSE for Setting 2 under the 10×1010\times 10 grid

4.2 Unbalanced group setting

Here we consider an unbalanced group setting as shown in Figure 6. In this setting, there are four groups, denoted as 𝒢1,𝒢2,𝒢3\mathcal{G}_{1},\mathcal{G}_{2},\mathcal{G}_{3} and 𝒢4\mathcal{G}_{4}, and two groups have 9 subjects and the other two groups have 41 subjects. The group parameters are 𝜷i=(1,1)T\bm{\beta}_{i}=(1,1)^{T} if i𝒢1i\in\mathcal{G}_{1}, 𝜷i=(1.5,1.5)T\bm{\beta}_{i}=(1.5,1.5)^{T} if i𝒢2i\in\mathcal{G}_{2}, 𝜷i=(2,2)T\bm{\beta}_{i}=(2,2)^{T} if i𝒢3i\in\mathcal{G}_{3} and 𝜷i=(2.5,2.5)T\bm{\beta}_{i}=(2.5,2.5)^{T} if i𝒢4i\in\mathcal{G}_{4}.

Refer to caption
Figure 6: Unbalanced group setting

Table 7 and Figure 7 show the summaries of K^\hat{K}, ARI and RMSE for 𝜷\bm{\beta} when the number of repeated measurements is ni=10n_{i}=10. Overall speaking, “reg_sp” and “sp” perform better than the other two types of weights. Especially, “sp” performs a slightly better than “reg_sp” . The results are consistent with those under balanced cases. We expect that, when the group difference becomes smaller, “sp” would still perform better than other weights even when the number of repeated measurements is large.

Table 7: Summary of K^\hat{K} and average ARI for the unbalanced setting with ni=10n_{i}=10
equal reg_sp reg sp
K^\hat{K} mean 4.58(0.093) 4.23(0.049) 5.17(0.011) 4.35(0.059)
per 0.570 0.800 0.300 0.710
ARI mean 0.62(0.010) 0.94(0.061) 0.67(0.009) 0.96(0.004)
Refer to caption
Figure 7: RMSE for unbalanced setting

4.3 Random group setting

We consider a setting without specified location group information. For each location, it has equal probability to three groups. Tables 8 shows the summary of K^\hat{K} and ARI for Setting 1 under the grid 7×77\times 7 with 10 repeated measures. Table 9 shows the summary of K^\hat{K} and ARI for Setting 2 under the grid 7×77\times 7 with 30 repeated measures. Figure 8 shows the RMSE results for both cases. We can see that different weights have similar performances. The results suggest that even without prior information on the existence of spatial groups, “sp” weights can still produce comparable results as equal weights.

Table 8: Summary of K^\hat{K} and average ARI for Setting 1 under the 7×77\times 7 grid with ni=10n_{i}=10
equal reg_sp reg sp
K^\hat{K} mean 3.42(0.064) 3.45(0.063) 3.40(0.059) 3.45(0.063)
per 0.66 0.62 0.65 0.62
ARI mean 0.78(0.011) 0.82(0.010) 0.81(0.010) 0.82(0.011)
Table 9: Summary of K^\hat{K} and average ARI for Setting 2 under the 7×77\times 7 grid with ni=30n_{i}=30
equal reg_sp reg sp
K^\hat{K} mean 2.77(0.045) 2.77(0.045) 2.83(0.040) 2.73(0.047)
per 0.75 0.75 0.81 0.71
ARI mean 0.74(0.015) 0.76(0.016) 0.77(0.014) 0.74(0.017)
Refer to caption
Figure 8: RMSE for random groups under the 7×77\times 7 grid

5 Application

In this section, we apply our SHADE method to the modeling of the National Resources Inventory survey (NRI) data 111https://www.nrcs.usda.gov/wps/portal/nrcs/main/national/technical/nra/nri/ for the purpose of small area estimation. The NRI survey is one of the largest longitudinal natural resource surveys in the U.S., and its national and state level estimates of the status and change of land cover and use and soil erosion have been used by numerous federal, state, and local agencies in the past few decades. In recent years, there is an increasing demand for NRI to provide various county level estimates. These include estimates of different land covers, such as cropland, pasture land, urban and forest. Due to the limitation of sample size, the uncertainty of the NRI direct county level estimates are usually too large for local stakeholders for making policy decisions. To make the county level estimates more useful, it is necessarily to include some auxiliary information and appropriate model to reduce the uncertainty of the estimates. One such set of auxiliary covariates is Cropland Data Layer (CDL), which is based on classification of satelliate image pixels into several mutually exclusive and exhaustive land cover categories. In this section, we model the relationship between the NRI forest proportion and the CDL forest proportion among 48 states. In NRI, forests belonging to federal land, such as national parks, are not included in the forest category. For states with more forest federal land, NRI estimates can be substantially smaller than CDL estimates. Therefore, different states could have different relationship between these two proportions.

The model we consider is,

yih=β0,i+β1,ixih+ϵihy_{ih}=\beta_{0,i}+\beta_{1,i}x_{ih}+\epsilon_{ih} (22)

where yihy_{ih} is the NRI forest proportion of the hhth county in the iith state, xihx_{ih} is the corresponding CDL forest proportion of the hhth county in the iith state, and β0,i\beta_{0,i} and β1,i\beta_{1,i} are the unknown coefficients. Both xx and yy are standardized. Instead of using the estimated linear regression coefficients as initial values directly, we use five sets of initial values which are simulated from a multivariate normal distribution with estimated coefficients as the mean vector and estimated covariance matrix as the covariance matrix. The models with the smallest modified BIC values are selected for equal weights and spatial weights respectively.

Figure 9 shows the estimated groups based on 2011 NRI data sets. The left figure plots the estimated groups based on equal weights, and the right one is for the estimated groups based on spatial weights in (6). We find that the two different weights give different estimated groups. Tables 10 and 11 are the corresponding estimates of regression coefficients in different groups.

Refer to caption
(a) Estimated groups based on equal weights
Refer to caption
(b) Estimated groups based on spatial weights
Figure 9: Estimated groups for both equal weights and spatial weights.
Table 10: Estimated coefficients of different groups for equal weight
group 1 2
β0\beta_{0} -0.029(0.006) 0.003(0.008)
β1\beta_{1} 0.885(0.011) 0.241(0.026)
Table 11: Estimated coefficients of different groups for spatial weights
group 1 2 3 4 5 6 7
β0\beta_{0} -0.041(0.016) -0.032(0.006) 0.003(0.007) 0.023(0.015) -0.108(0.293) 0.275(0.038) 0.376 (0.309)
β1\beta_{1} 1.018(0.028) 0.867(0.012) 0.241(0.024) 0.608(0.033) 1.148 (0.377) 0.332(0.064) 0.341(0.384)

When considering equal weights, λ\lambda is the only tuning parameter in the algorithm. By changing the value of λ\lambda, we can have different number of groups. We consider to change the λ\lambda value in the algorithm based on equal weights such that the number of groups is the same as what we have selected based on the spatial weights, that is 7 groups. Figure 10 shows the group structure with 7 groups based on equal weights. In both Figure 10 and the left figure of Figure 9, “WA”, “OR” and “CA” are not separated from the majority group (the group with the largest group size) when considering equal weights. These three states are in group 4, which are separated from the majority group (group 2) when considering spatial weights, which is more reasonable and intuitive based on the estimates of regression coefficients as shown in Table 11. Besides that, these three states have more national parks than those states in group 2.

Refer to caption
Figure 10: Estimated groups by changing the tuning parameter λ\lambda with equal weights.

Alternatively, we also implement KK-means clustering based on the initial estimates to identify similar behaviors among the states. Figure 11 shows the maps based on 2-means clustering and 7-means clustering, respectively. We notice that the 2-cluster map is almost the same as the map based on equal weights. However, the 7-cluster map is not interpretable compared to the result based on spatial weights. This suggests that the proposed procedure can produce more interpretable subgroup structures than KK-means clustering methods.

Refer to caption
Figure 11: Group clustering results based on K-means.

6 Discussion

In this article, we consider the problem of spatial clustering of local covariate effects and develop a general framework called Spatial Heterogeneity Automatic Detection and Estimation (SHADE) for spatial areal data with repeated measures. In spatial data, since locations near each other usually have similar patterns, we propose to take into account spatial information in the pairwise penalty, where closer locations are assigned with larger weights to encourage stronger shrinkage. In the simulation study, we use several examples to investigate and compare performance of the procedure using different weights and have found that spatial information helps to improve the accuracy of grouping, especially when the minimal group difference is small or the number of repeated measures is small. We also establish theoretical properties of the proposed estimator in terms of its consistency in estimating the number of groups.

In the real data example, we have treated states as locations and counties as repeated measures. Alternatively, one can treat counties as individual units, since one state could have counties with two different features. Then, the algorithm will involve a matrix inverse with dimension more than 3000, which will require higher computational burden. A further study is needed to compare these two models for the application.

The proposed method does not consider the spatial dependence in the regression error when constructing the objective function. The basic idea of this algorithm can be extended to a general spatial clustering setup with consideration of the spatially dependent error. More specifically, the weighted least squared term in the objective function needs to be replaced by a generalized least squares term, which includes an estimated covariance matrix. The new algorithm should have two iterative steps. The first step is to update regression coefficients to find clusters and the second step is to update covariance parameters. More simulation studies are needed to explore the performance of the two-step algorithm. And the theoretical properties needs to be established to support the new algorithm. Both theoretical and computational aspect of such extension are nontrivial and will be considered in a follow up work.

Appendices

A Proof of Theorem 1

In this section, we prove Theorem 1. When proving the central limit theorem (CLT) we use the technique in Huang et al., (2004).

The oracle estimator is define in (14), which has the following form

(𝜼^or𝜶^or)=(𝑼T𝛀𝑼)1𝑼T𝛀𝒚.\left(\begin{array}[]{c}\hat{\bm{\eta}}^{or}\\ \hat{\bm{\alpha}}^{or}\end{array}\right)=\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{U}^{T}\bm{\Omega}\bm{y}.

Thus, we have

(𝜼^or𝜼0𝜶^or𝜶0)=(𝑼T𝛀𝑼)1𝑼T𝛀ϵ,\left(\begin{array}[]{c}\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\\ \hat{\bm{\alpha}}^{or}-\bm{\alpha}^{0}\end{array}\right)=\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{U}^{T}\bm{\Omega}\bm{\epsilon},

where ϵ=(ϵiT,,ϵnT)T\bm{\epsilon}=(\bm{\epsilon}_{i}^{T},\dots,\bm{\epsilon}_{n}^{T})^{T} with ϵi=(ϵi1,,ϵi,ni)T\bm{\epsilon}_{i}=(\epsilon_{i1},\dots,\epsilon_{i,n_{i}})^{T}. Therefore,

(𝜼^or𝜼0𝜶^or𝜶0)(𝑼T𝛀𝑼)12𝑼T𝛀ϵ,\left\|\left(\begin{array}[]{c}\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\\ \hat{\bm{\alpha}}^{or}-\bm{\alpha}^{0}\end{array}\right)\right\|\leq\left\|\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\right\|_{2}\left\|\bm{U}^{T}\bm{\Omega}\bm{\epsilon}\right\|, (23)

where 2\|\cdot\|_{2} is matrix norm, which is defined as, for a matrix 𝑨\bm{A}, 𝑨2=sup𝒙=1𝑨𝒙\|\bm{A}\|_{2}=\sup_{\|\bm{x}\|=1}\|\bm{Ax}\|.

We know that

P(𝑼T𝛀ϵ>Cnminnilogn)\displaystyle P\left(\left\|\bm{U}^{T}\bm{\Omega}\bm{\epsilon}\right\|_{\infty}>C\sqrt{\frac{n}{\min n_{i}}\log n}\right) P((𝑿𝑾)T𝛀ϵ>Cnminnilogn)\displaystyle\leq P\left(\left\|\left(\bm{X}\bm{W}\right)^{T}\bm{\Omega}\bm{\epsilon}\right\|_{\infty}>C\sqrt{\frac{n}{\min n_{i}}\log n}\right)
+P(𝒁T𝛀ϵ>Cnminnilogn),\displaystyle+P\left(\left\|\bm{Z}^{T}\bm{\Omega}\bm{\epsilon}\right\|_{\infty}>C\sqrt{\frac{n}{\min n_{i}}\log n}\right), (24)

where CC is a finite positive constant and \|\cdot\|_{\infty} is defined as, for a vector 𝒙m\bm{x}\in\mathbb{R}^{m}, 𝒙=max1imxi\|\bm{x}\|_{\infty}=\max_{1\leq i\leq m}x_{i}. By condition (C2), we have

i=1nh=1nixih,l2ni21{i𝒢k}M1i=1n1ni{i𝒢k}M1i=1n1niM1nminni.\sqrt{\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{x_{ih,l}^{2}}{n_{i}^{2}}1\left\{i\in\mathcal{G}_{k}\right\}}\leq M_{1}\sqrt{\sum_{i=1}^{n}\frac{1}{n_{i}}\left\{i\in\mathcal{G}_{k}\right\}}\leq M_{1}\sqrt{\sum_{i=1}^{n}\frac{1}{n_{i}}}\leq M_{1}\sqrt{\frac{n}{\min n_{i}}}.

Since

(𝑿𝑾)T𝛀ϵ=supk,l|i=1n1nih=1nixih,lϵih1{i𝒢k}|,\left\|\left(\bm{X}\bm{W}\right)^{T}\bm{\Omega}\bm{\epsilon}\right\|_{\infty}=\sup_{k,l}\left|\sum_{i=1}^{n}\frac{1}{n_{i}}\sum_{h=1}^{n_{i}}x_{ih,l}\epsilon_{ih}1\left\{i\in\mathcal{G}_{k}\right\}\right|,

from condition (C3), it follows that

P((𝑿𝑾)T𝛀ϵ>Cnminnilogn)\displaystyle P\left(\left\|\left(\bm{X}\bm{W}\right)^{T}\bm{\Omega}\bm{\epsilon}\right\|_{\infty}>C\sqrt{\frac{n}{\min n_{i}}\log n}\right)
\displaystyle\leq l=1pk=1KP(|i=1nj=1ni1nixih,lϵih1{i𝒢k}|>Cnminnilogn)\displaystyle\sum_{l=1}^{p}\sum_{k=1}^{K}P\left(\left|\sum_{i=1}^{n}\sum_{j=1}^{n_{i}}\frac{1}{n_{i}}x_{ih,l}\epsilon_{ih}1\left\{i\in\mathcal{G}_{k}\right\}\right|>C\sqrt{\frac{n}{\min n_{i}}\log n}\right)
=\displaystyle= l=1pk=1KP(|i=1nh=1ni1nixih,lϵih1{i𝒢k}|>i=1nh=1nixih,l2ni21{i𝒢k}i=1nh=1nixih,l2ni21{i𝒢k}Cnminnilogn)\displaystyle\sum_{l=1}^{p}\sum_{k=1}^{K}P\left(\left|\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{1}{n_{i}}x_{ih,l}\epsilon_{ih}1\left\{i\in\mathcal{G}_{k}\right\}\right|>\frac{\sqrt{\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{x_{ih,l}^{2}}{n_{i}^{2}}1\left\{i\in\mathcal{G}_{k}\right\}}}{\sqrt{\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{x_{ih,l}^{2}}{n_{i}^{2}}1\left\{i\in\mathcal{G}_{k}\right\}}}C\sqrt{\frac{n}{\min n_{i}}\log n}\right)
\displaystyle\leq l=1pk=1KP(|i=1nh=1ni1nixih,lϵih1{i𝒢k}|>i=1nh=1nixih,l2ni21{i𝒢k}CM1logn)\displaystyle\sum_{l=1}^{p}\sum_{k=1}^{K}P\left(\left|\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{1}{n_{i}}x_{ih,l}\epsilon_{ih}1\left\{i\in\mathcal{G}_{k}\right\}\right|>\sqrt{\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{x_{ih,l}^{2}}{n_{i}^{2}}1\left\{i\in\mathcal{G}_{k}\right\}}\frac{C}{M_{1}}\sqrt{\log n}\right)
\displaystyle\leq 2Kpexp(c1C2M12logn)=2Kpnc1C2/M12.\displaystyle 2Kp\exp\left(-c_{1}\frac{C^{2}}{M_{1}^{2}}\log n\right)=2Kpn^{-c_{1}C^{2}/M_{1}^{2}}.

Similarly, |i=1nh=1nizih,l2ni2|M12i=1n1/niM12nminni\left|\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{z_{ih,l}^{2}}{n_{i}^{2}}\right|\leq M_{1}^{2}\sum_{i=1}^{n}1/n_{i}\leq M_{1}^{2}\frac{n}{\min n_{i}}. Again, by condition (C3), we have

P(𝒁T𝛀ϵ>Cnminnilogn)\displaystyle P\left(\left\|\bm{Z}^{T}\bm{\Omega}\bm{\epsilon}\right\|_{\infty}>C\sqrt{\frac{n}{\min n_{i}}\log n}\right)
\displaystyle\leq l=1qP(|i=1nh=1ni1nizih,lϵih|>Cnminnilogn)\displaystyle\sum_{l=1}^{q}P\left(\left|\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{1}{n_{i}}z_{ih,l}\epsilon_{ih}\right|>C\sqrt{\frac{n}{\min n_{i}}\log n}\right)
\displaystyle\leq l=1qP(|i=1nh=1ni1nizih,lϵih|>i=1nh=1nizih,l2ni2CM1logn)\displaystyle\sum_{l=1}^{q}P\left(\left|\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{1}{n_{i}}z_{ih,l}\epsilon_{ih}\right|>\sqrt{\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{z_{ih,l}^{2}}{n_{i}^{2}}}\frac{C}{M_{1}}\sqrt{\log n}\right)
\displaystyle\leq 2qexp(c1C2M12logn)=2qnc1C2/M12.\displaystyle 2q\exp\left(-c_{1}\frac{C^{2}}{M_{1}^{2}}\log n\right)=2qn^{-c_{1}C^{2}/M_{1}^{2}}.

Thus, (24) can be bounded by

P(𝑼T𝛀ϵ>Cnminnilogn)2(Kp+q)nc1C2/M12.P\left(\left\|\bm{U}^{T}\bm{\Omega}\bm{\epsilon}\right\|_{\infty}>C\sqrt{\frac{n}{\min n_{i}}\log n}\right)\leq 2\left(Kp+q\right)n^{-c_{1}C^{2}/M_{1}^{2}}.

Since 𝑼T𝛀ϵq+Kp𝑼T𝛀ϵ\left\|\bm{U}^{T}\text{$\bm{\Omega}\bm{\epsilon}$}\right\|\leq\sqrt{q+Kp}\left\|\bm{U}^{T}\bm{\Omega}\bm{\epsilon}\right\|_{\infty},

P(𝑼T𝛀ϵ>Cq+Kpnminnilogn)2(Kp+q)nc1C2/M12.P\left(\left\|\bm{U}^{T}\bm{\Omega}\bm{\epsilon}\right\|>C\sqrt{q+Kp}\sqrt{\frac{n}{\min n_{i}}\log n}\right)\leq 2\left(Kp+q\right)n^{-c_{1}C^{2}/M_{1}^{2}}.

Let C=c11/2M1C=c_{1}^{-1/2}M_{1}, thus

P(𝑼T𝛀ϵ>Cq+Kpnminnilogn)2(Kp+q)n1.P\left(\left\|\bm{U}^{T}\bm{\Omega}\bm{\epsilon}\right\|>C\sqrt{q+Kp}\sqrt{\frac{n}{\min n_{i}}\log n}\right)\leq 2\left(Kp+q\right)n^{-1}. (25)

Also, according to condition (C2), we have

(𝑼T𝛀𝑼)12C11|𝒢min|1.\left\|\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\right\|_{2}\leq C_{1}^{-1}\left|\mathcal{G}_{\text{min}}\right|^{-1}. (26)

Combining (23), (25) and (26), with probability at least 12(Kp+q)n11-2\left(Kp+q\right)n^{-1}, we have

(𝜼^or𝜼0𝜶^or𝜶0)CC11q+Kp|𝒢min|1nminnilogn.\left\|\left(\begin{array}[]{c}\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\\ \hat{\bm{\alpha}}^{or}-\bm{\alpha}^{0}\end{array}\right)\right\|\leq CC_{1}^{-1}\sqrt{q+Kp}\left|\mathcal{G}_{\text{min}}\right|^{-1}\sqrt{\frac{n}{\min n_{i}}\log n}.

Let

ϕn=c11/2C11M1q+Kp|𝒢min|1nminnilogn.\phi_{n}=c_{1}^{-1/2}C_{1}^{-1}M_{1}\sqrt{q+Kp}\left|\mathcal{G}_{\text{min}}\right|^{-1}\sqrt{\frac{n}{\min n_{i}}\log n}.

Furthermore,

𝜷^or𝜷02\displaystyle\left\|\hat{\bm{\beta}}^{or}-\bm{\beta}^{0}\right\|^{2} =k=1Ki𝒢k𝜶^kor𝜶k02|𝒢max|k=1K𝜶^kor𝜶k02\displaystyle=\sum_{k=1}^{K}\sum_{i\in\mathcal{G}_{k}}\left\|\hat{\bm{\alpha}}_{k}^{or}-\bm{\alpha}_{k}^{0}\right\|^{2}\leq\left|\mathcal{G}_{\text{max}}\right|\sum_{k=1}^{K}\left\|\hat{\bm{\alpha}}_{k}^{or}-\bm{\alpha}_{k}^{0}\right\|^{2}
=|𝒢max|𝜶^or𝜶02|𝒢max|ϕn2,\displaystyle=\left|\mathcal{G}_{\text{max}}\right|\left\|\hat{\bm{\alpha}}^{or}-\bm{\alpha}^{0}\right\|^{2}\leq\left|\mathcal{G}_{\text{max}}\right|\phi_{n}^{2},

and

supi𝜷^ior𝜷i0=supk𝜶^kor𝜶k0𝜶^or𝜶0ϕn.\sup_{i}\left\|\hat{\bm{\beta}}_{i}^{or}-\bm{\beta}_{i}^{0}\right\|=\sup_{k}\left\|\hat{\bm{\alpha}}_{k}^{or}-\bm{\alpha}_{k}^{0}\right\|\leq\left\|\hat{\bm{\alpha}}^{or}-\bm{\alpha}^{0}\right\|\leq\phi_{n}.

Next, we consider the central limit theorem. Let 𝑼=(𝑼1T,,𝑼nT)T\bm{U}=\left(\bm{U}_{1}^{T},\dots,\bm{U}_{n}^{T}\right)^{T} with 𝑼i=(𝑼i1,,𝑼i,ni)T\bm{U}_{i}=(\bm{U}_{i1},\dots,\bm{U}_{i,n_{i}})^{T} for i=1,,ni=1,\dots,n. Consider

𝒂nT((𝜼^or𝜼0)T,(𝜶^or𝜶^0)T)T=\displaystyle\bm{a}_{n}^{T}\left(\left(\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\right)^{T},\left(\hat{\bm{\alpha}}^{or}-\hat{\bm{\alpha}}^{0}\right)^{T}\right)^{T}= i=1n𝒂nT(i=1n𝑼iT𝛀i𝑼i)1𝑼iT𝛀iϵi,\displaystyle\sum_{i=1}^{n}\bm{a}_{n}^{T}\left(\sum_{i=1}^{n}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{U}_{i}\right)^{-1}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{\epsilon}_{i},

where 𝛀i=1/ni𝑰ni\bm{\Omega}_{i}=1/n_{i}\bm{I}_{n_{i}}. By the assumption of ϵi\bm{\epsilon}_{i} in the model (1), we have

E[𝒂nT((𝜼^or𝜼0)T,(𝜶^or𝜶^0)T)T]=0.E\left[\bm{a}_{n}^{T}\left(\left(\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\right)^{T},\left(\hat{\bm{\alpha}}^{or}-\hat{\bm{\alpha}}^{0}\right)^{T}\right)^{T}\right]=0.

The variance of 𝒂nT((𝜼^or𝜼0)T,(𝜶^or𝜶^0)T)T\bm{a}_{n}^{T}\left(\left(\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\right)^{T},\left(\hat{\bm{\alpha}}^{or}-\hat{\bm{\alpha}}^{0}\right)^{T}\right)^{T} can be written as

Var{𝒂nT((𝜼^or𝜼0)T,(𝜶^or𝜶^0)T)T}\displaystyle Var\left\{\bm{a}_{n}^{T}\left(\left(\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\right)^{T},\left(\hat{\bm{\alpha}}^{or}-\hat{\bm{\alpha}}^{0}\right)^{T}\right)^{T}\right\}
=\displaystyle= σ2[𝒂nT(𝑼T𝛀𝑼)1𝑼T𝛀𝛀𝑼(𝑼T𝛀𝑼)1𝒂n]\displaystyle\sigma^{2}\left[\bm{a}_{n}^{T}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{U}^{T}\bm{\Omega}\bm{\Omega}\bm{U}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{a}_{n}\right]
=\displaystyle= σ2[𝒂nT(𝑼T𝛀𝑼)1i=1n𝑼iT𝛀i𝛀i𝑼i(𝑼T𝛀𝑼)1𝒂n].\displaystyle\sigma^{2}\left[\bm{a}_{n}^{T}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\sum_{i=1}^{n}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{\Omega}_{i}\bm{U}_{i}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{a}_{n}\right].

We use the technique of Huang et al., (2004) in the proof of their Theorem 3. That is,
𝒂nT((𝜼^or𝜼0)T,(𝜶^or𝜶^0)T)T\bm{a}_{n}^{T}\left(\left(\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\right)^{T},\left(\hat{\bm{\alpha}}^{or}-\hat{\bm{\alpha}}^{0}\right)^{T}\right)^{T} can be written as i=1naiξi\sum_{i=1}^{n}a_{i}\xi_{i} with

ai2=𝒂nT(𝑼T𝛀𝑼)1𝑼iT𝛀i𝛀i𝑼i(𝑼T𝛀𝑼)1𝒂𝒏,a_{i}^{2}=\bm{a}_{n}^{T}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{\Omega}_{i}\bm{U}_{i}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{a_{n}},

where ξi\xi_{i}’s are independent with mean zero and variance one. If

maxiai2i=1nai20,\frac{\max_{i}a_{i}^{2}}{\sum_{i=1}^{n}a_{i}^{2}}\rightarrow 0,

then i=1naiξi/i=1nai2\sum_{i=1}^{n}a_{i}\xi_{i}/\sqrt{\sum_{i=1}^{n}a_{i}^{2}} is asymptotically N(0,1)N\left(0,1\right).

For any 𝝀=(λ1,,λq+Kp)T\bm{\lambda}=\left(\lambda_{1},\dots,\lambda_{q+Kp}\right)^{T}, we have

𝝀T𝑼iT𝛀i𝛀i𝑼i𝝀\displaystyle\bm{\lambda}^{T}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{\Omega}_{i}\bm{U}_{i}\bm{\lambda} =1ni2𝝀T𝑼iT𝑼i𝝀=1ni2h=1ni𝝀T𝑼ih𝑼ihT𝝀\displaystyle=\frac{1}{n_{i}^{2}}\bm{\lambda}^{T}\bm{U}_{i}^{T}\bm{U}_{i}\bm{\lambda}=\frac{1}{n_{i}^{2}}\sum_{h=1}^{n_{i}}\bm{\lambda}^{T}\bm{U}_{ih}\bm{U}_{ih}^{T}\bm{\lambda}
=1ni2h=1ni(l=1q+KpUih,lλl)2\displaystyle=\frac{1}{n_{i}^{2}}\sum_{h=1}^{n_{i}}\left(\sum_{l=1}^{q+Kp}U_{ih,l}\lambda_{l}\right)^{2}
1ni2h=1ni(l=1q+KpUih,l2)(l=1q+Kpλl2)M12ni(q+Kp)𝝀2.\displaystyle\leq\frac{1}{n_{i}^{2}}\sum_{h=1}^{n_{i}}\left(\sum_{l=1}^{q+Kp}U_{ih,l}^{2}\right)\left(\sum_{l=1}^{q+Kp}\lambda_{l}^{2}\right)\leq\frac{M_{1}^{2}}{n_{i}}\left(q+Kp\right)\left\|\bm{\lambda}\right\|^{2}.
𝝀T(i=1n𝑼iT𝛀i𝛀i𝑼i)𝝀\displaystyle\bm{\lambda}^{T}\left(\sum_{i=1}^{n}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{\Omega}_{i}\bm{U}_{i}\right)\bm{\lambda} 1maxini𝝀T(i=1n𝑼iT𝛀i𝑼i)𝝀1maxini𝝀T𝑼T𝛀𝑼𝝀\displaystyle\geq\frac{1}{\max_{i}n_{i}}\bm{\lambda}^{T}\left(\sum_{i=1}^{n}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{U}_{i}\right)\bm{\lambda}\geq\frac{1}{\max_{i}n_{i}}\bm{\lambda}^{T}\bm{U}^{T}\bm{\Omega}\bm{U}\bm{\lambda}
1maxiniC1|𝒢min|𝝀2,\displaystyle\geq\frac{1}{\max_{i}n_{i}}C_{1}\left|\mathcal{G}_{\min}\right|\left\|\bm{\lambda}\right\|^{2},

where the last inequality is by condition (C2). So,

maxi𝝀T𝑼iT𝛀i𝛀i𝑼i𝝀𝝀T(i=1n𝑼iT𝛀i𝛀i𝑼i)𝝀\displaystyle\frac{\max_{i}\bm{\lambda}^{T}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{\Omega}_{i}\bm{U}_{i}\bm{\lambda}}{\bm{\lambda}^{T}\left(\sum_{i=1}^{n}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{\Omega}_{i}\bm{U}_{i}\right)\bm{\lambda}} (maxini)(maxi1ni)M12C11|𝒢min|1(q+Kp)\displaystyle\leq\left(\max_{i}n_{i}\right)\left(\max_{i}\frac{1}{n_{i}}\right)M_{1}^{2}C_{1}^{-1}\left|\mathcal{G}_{\min}\right|^{-1}\left(q+Kp\right)
=M12C11maxiniminini|𝒢min|1(q+Kp)0,\displaystyle=M_{1}^{2}C_{1}^{-1}\frac{\max_{i}n_{i}}{\min_{i}n_{i}}\left|\mathcal{G}_{\min}\right|^{-1}\left(q+Kp\right)\rightarrow 0, (27)

by assumption.

By (27), we have that maxiai2/i=1nai20\max_{i}a_{i}^{2}/\sum_{i=1}^{n}a_{i}^{2}\rightarrow 0, so (15) exists.

B Proof of Theorem 2

In this section, we prove Theorem 2. As in Ma et al., (2019) and Ma and Huang, (2017), we define T:𝒢KpT:\,\mathcal{M}_{\mathcal{G}}\rightarrow\mathbb{R}^{Kp} to be the mapping that T(𝜷)=𝜶T\left(\bm{\beta}\right)=\bm{\alpha} and T:npKpT^{*}:\,\mathbb{R}^{np}\rightarrow\mathbb{R}^{Kp} to be the mapping that T(𝜷)=(|𝒢k|1i𝒢k𝜷iT,k=1,,K)TT^{*}\left(\bm{\beta}\right)=\left(\left|\mathcal{G}_{k}\right|^{-1}\sum_{i\in\mathcal{G}_{k}}\bm{\beta}_{i}^{T},\,k=1,\dots,K\right)^{T}.

Consider the following neighborhood of (𝜼0,𝜷0)\left(\bm{\eta}^{0},\bm{\beta}^{0}\right),

Θ={𝜼q,𝜷np:𝜼𝜼0ϕn,supi𝜷i𝜷i0ϕn}.\Theta=\left\{\bm{\eta}\in\mathbb{R}^{q},\bm{\beta}\in\mathbb{R}^{np}:\,\left\|\bm{\eta}-\bm{\eta}^{0}\right\|\leq\phi_{n},\sup_{i}\left\|\bm{\beta}_{i}-\bm{\beta}_{i}^{0}\right\|\leq\phi_{n}\right\}.

According to Theorem 1, there exists an event E1E_{1} where 𝜼𝜼0ϕn\left\|\bm{\eta}-\bm{\eta}^{0}\right\|\leq\phi_{n} and supi𝜷i𝜷i0ϕn\sup_{i}\left\|\bm{\beta}_{i}-\bm{\beta}_{i}^{0}\right\|\leq\phi_{n} such that P(E1)12(q+Kp)n1P\left(E_{1}\right)\geq 1-2\left(q+Kp\right)n^{-1}.

Recall that the objective function to minimize is given in (2), which has the following form

Qn(𝜼,𝜷;λ,ψ)=12i=1n1nih=1ni(yih𝒛ihT𝜼𝒙ihT𝜷i)2+1i<jnpγ(𝜷i𝜷j,cijλ).Q_{n}\left(\bm{\eta},\bm{\beta};\lambda,\psi\right)=\frac{1}{2}\sum_{i=1}^{n}\frac{1}{n_{i}}\sum_{h=1}^{n_{i}}\left(y_{ih}-\bm{z}_{ih}^{T}\bm{\eta}-\bm{x}_{ih}^{T}\bm{\beta}_{i}\right)^{2}+\sum_{1\leq i<j\leq n}p_{\gamma}\left(\left\|\bm{\beta}_{i}-\bm{\beta}_{j}\right\|,c_{ij}\lambda\right). (28)

Here we show that ((𝜼^or)T,(𝜷^or)T)T\left(\text{$(\hat{\bm{\eta}}$}^{or})^{T},(\hat{\bm{\beta}}^{or})^{T}\right)^{T} is a strict local minimizer of the above objective function with probability approaching 1 by two steps as in Ma et al., (2019). The first step is to show that in event E1E_{1}, Qn(𝜼,𝜷)>Qn(𝜼^or,𝜷^or)Q_{n}(\bm{\eta},\bm{\beta}^{*})>Q_{n}(\hat{\bm{\eta}}^{or},\hat{\bm{\beta}}^{or}) for any (𝜼T,𝜷T)TΘ(\bm{\eta}^{T},\bm{\beta}^{T})^{T}\in\Theta and (𝜼T,𝜷T)T((𝜼^or)T,(𝜷^or)T)T(\bm{\eta}^{T},\bm{\beta}^{*T})^{T}\neq((\hat{\bm{\eta}}^{or})^{T},(\hat{\bm{\beta}}^{or})^{T})^{T}, where 𝜷=T1(T(𝜷))\bm{\beta}^{*}=T^{-1}\left(T^{*}\left(\bm{\beta}\right)\right) and 𝜷np\bm{\beta}\in\mathbb{R}^{np}. The proof of this step is almost the same as the first step in Ma et al., (2019) , which is omitted here.

Here we show the second step, that is, there exists an event E2E_{2} such that P(E2)12n1P\left(E_{2}\right)\geq 1-2n^{-1}. In the event E1E2E_{1}\cap E_{2}, there is a neighborhood Θn\Theta_{n} of ((𝜼^or)T,(𝜷^or)T)T\left(\left(\text{$\hat{\bm{\eta}}$}^{or}\right)^{T},\left(\hat{\bm{\beta}}^{or}\right)^{T}\right)^{T}, such that Qn(𝜼,𝜷)Qn(𝜼,𝜷)Q_{n}\left(\bm{\eta},\bm{\beta}\right)\geq Q_{n}\left(\bm{\eta},\bm{\beta}^{*}\right) for any (𝜼T,𝜷T)TΘnΘ\left(\bm{\eta}^{T},\bm{\beta}^{T}\right)^{T}\in\Theta_{n}\cap\Theta for sufficiently large nn.

Let Θn={𝜷i:supi𝜷i𝜷^iortn}\Theta_{n}=\left\{\bm{\beta}_{i}:\sup_{i}\left\|\bm{\beta}_{i}-\hat{\bm{\beta}}_{i}^{or}\right\|\leq t_{n}\right\}, where tnt_{n} is a positive sequence with tn=o(1)t_{n}=o(1). By Taylor’s expansion, for (𝜼T,𝜷T)TΘnΘ\left(\bm{\eta}^{T},\bm{\beta}^{T}\right)^{T}\in\Theta_{n}\cap\Theta,

Qn(𝜼,𝜷)Qn(𝜼,𝜷)=Γ1+Γ2,Q_{n}\left(\bm{\eta},\bm{\beta}\right)-Q_{n}\left(\bm{\eta},\bm{\beta}^{*}\right)=\Gamma_{1}+\Gamma_{2}, (29)

where

Γ1\displaystyle\Gamma_{1} =(𝒚𝒁𝜼𝑿𝜷m)T𝛀𝑿(𝜷𝜷),\displaystyle=-\left(\bm{y}-\bm{Z}\bm{\eta}-\bm{X}\bm{\beta}^{m}\right)^{T}\bm{\Omega}\bm{X}\left(\bm{\beta}-\bm{\beta}^{*}\right),
Γ2\displaystyle\Gamma_{2} =i=1n[λl<jcljργ(𝜷lm𝜷jm)]𝜷iT(𝜷i𝜷i),\displaystyle=\sum_{i=1}^{n}\frac{\partial\left[\lambda\sum_{l<j}c_{lj}\rho_{\gamma}\left(\left\|\bm{\beta}_{l}^{m}-\bm{\beta}_{j}^{m}\right\|\right)\right]}{\partial\bm{\beta}_{i}^{T}}\left(\bm{\beta}_{i}-\bm{\beta}_{i}^{*}\right),

with 𝜷m=α𝜷+(1α)𝜷\bm{\beta}^{m}=\alpha\bm{\beta}+\left(1-\alpha\right)\bm{\beta}^{*} for some constant α(0,1)\alpha\in\left(0,1\right).

We have Γ2\Gamma_{2} as follows,

Γ2=λi<jcijργ(𝜷im𝜷jm)𝜷im𝜷jm1(𝜷im𝜷jm)T{(𝜷i𝜷i)(𝜷j𝜷j)}.\Gamma_{2}=\lambda\sum_{i<j}c_{ij}\rho_{\gamma}^{\prime}\left(\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|\right)\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|^{-1}\left(\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right)^{T}\left\{\left(\bm{\beta}_{i}-\bm{\beta}_{i}^{*}\right)-\left(\bm{\beta}_{j}-\bm{\beta}_{j}^{*}\right)\right\}.

For i,j𝒢ki,j\in\mathcal{G}_{k}, 𝜷i=𝜷j\bm{\beta}_{i}^{*}=\bm{\beta}_{j}^{*} and 𝜷im𝜷jm=α(𝜷i𝜷j)\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}=\alpha\left(\bm{\beta}_{i}-\bm{\beta}_{j}\right), then

Γ2\displaystyle\Gamma_{2} =λk=1K{i,j𝒢k,i<j}cijργ(𝜷im𝜷jm)𝜷im𝜷jm1(𝜷im𝜷jm)T(𝜷i𝜷j)\displaystyle=\lambda\sum_{k=1}^{K}\sum_{\left\{i,j\in\mathcal{G}_{k},i<j\right\}}c_{ij}\rho_{\gamma}^{\prime}\left(\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|\right)\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|^{-1}\left(\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right)^{T}\left(\bm{\beta}_{i}-\bm{\beta}_{j}\right)
+λk=1K{i𝒢k,j𝒢k}cijργ(𝜷im𝜷jm)𝜷im𝜷jm1(𝜷im𝜷jm)T{(𝜷i𝜷i)(𝜷j𝜷j)}.\displaystyle+\lambda\sum_{k=1}^{K}\sum_{\left\{i\in\mathcal{G}_{k},j\in\mathcal{G}_{k^{\prime}}\right\}}c_{ij}\rho_{\gamma}^{\prime}\left(\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|\right)\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|^{-1}\left(\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right)^{T}\left\{\left(\bm{\beta}_{i}-\bm{\beta}_{i}^{*}\right)-\left(\bm{\beta}_{j}-\bm{\beta}_{j}^{*}\right)\right\}.

Since supi𝜷im𝜷i0ϕn\sup_{i}\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{i}^{0}\right\|\leq\phi_{n}, for kkk\neq k^{\prime}, i𝒢ki\in\mathcal{G}_{k},j𝒢kj\in\mathcal{G}_{k^{\prime}},

𝜷im𝜷jmmini𝒢k,j𝒢k𝜷i0𝜷j02maxi𝜷im𝜷i0bn2ϕn>aλ.\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|\geq\min_{i\in\mathcal{G}_{k},j\in\mathcal{G}_{k^{\prime}}}\left\|\bm{\beta}_{i}^{0}-\bm{\beta}_{j}^{0}\right\|-2\max_{i}\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{i}^{0}\right\|\geq b_{n}-2\phi_{n}>a\lambda.

Thus, ργ(𝜷im𝜷jm)=0\rho_{\gamma}^{\prime}(\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|)=0 by assumption (C1). Therefore,

Γ2=λi=1K{i,j𝒢k,i<j}cijργ(𝜷im𝜷jm)𝜷i𝜷j.\Gamma_{2}=\lambda\sum_{i=1}^{K}\sum_{\left\{i,j\in\mathcal{G}_{k},i<j\right\}}c_{ij}\rho_{\gamma}^{\prime}\left(\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|\right)\left\|\bm{\beta}_{i}-\bm{\beta}_{j}\right\|. (30)

Also, for i,j𝒢ki,j\in\mathcal{G}_{k}, supi𝜷im𝜷jm4tn\sup_{i}\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|\leq 4t_{n}, so ργ(𝜷im𝜷jm)ρ(4tn)\rho_{\gamma}^{\prime}\left(\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|\right)\geq\rho^{\prime}\left(4t_{n}\right) by assumption (C1). Thus, we have

Γ2k=1K{i,j𝒢k,i<j}λcijργ(4tn)𝜷i𝜷j.\Gamma_{2}\geq\sum_{k=1}^{K}\sum_{\left\{i,j\in\mathcal{G}_{k},i<j\right\}}\lambda c_{ij}\rho_{\gamma}^{\prime}\left(4t_{n}\right)\left\|\bm{\beta}_{i}-\bm{\beta}_{j}\right\|.

Let 𝑸=(𝑸1T,,𝑸nT)T=[(𝒚𝒁𝜼𝑿𝜷m)T𝛀𝑿]T\bm{Q}=\left(\bm{Q}_{1}^{T},\dots,\bm{Q}_{n}^{T}\right)^{T}=\left[\left(\bm{y}-\bm{Z}\bm{\eta}-\bm{X}\bm{\beta}^{m}\right)^{T}\bm{\Omega}\bm{X}\right]^{T} with

𝑸i=1nih=1ni(yih𝒛ihT𝜼𝒙ihT𝜷im)𝒙ih.\bm{Q}_{i}=\frac{1}{n_{i}}\sum_{h=1}^{n_{i}}\left(y_{ih}-\bm{z}_{ih}^{T}\bm{\eta}-\bm{x}_{ih}^{T}\bm{\beta}_{i}^{m}\right)\bm{x}_{ih}.

We have,

Γ1\displaystyle\Gamma_{1} =(𝒚𝒁𝜼𝑿𝜷m)T𝛀𝑿(𝜷𝜷)\displaystyle=-\left(\bm{y}-\bm{Z}\bm{\eta}-\bm{X}\bm{\beta}^{m}\right)^{T}\bm{\Omega}\bm{X}\left(\bm{\beta}-\bm{\beta}^{*}\right)
=𝑸T(𝜷𝜷)\displaystyle=-\bm{Q}^{T}\left(\bm{\beta}-\bm{\beta}^{*}\right)
=k=1K{i,j𝒢k,i<j}(𝑸i𝑸j)T(𝜷i𝜷j)|𝒢k|.\displaystyle=-\sum_{k=1}^{K}\sum_{\left\{i,j\in\mathcal{G}_{k},i<j\right\}}\frac{\left(\bm{Q}_{i}-\bm{Q}_{j}\right)^{T}\left(\bm{\beta}_{i}-\bm{\beta}_{j}\right)}{\left|\mathcal{G}_{k}\right|}. (31)

Moreover,

𝑸i=1nih=1ni(ϵih+𝒛ihT(𝜼0𝜼)+𝒙ihT(𝜷i0𝜷im))𝒙ih,\bm{Q}_{i}=\frac{1}{n_{i}}\sum_{h=1}^{n_{i}}\left(\epsilon_{ih}+\bm{z}_{ih}^{T}\left(\bm{\eta}^{0}-\bm{\eta}\right)+\bm{x}_{ih}^{T}\left(\bm{\beta}_{i}^{0}-\bm{\beta}_{i}^{m}\right)\right)\bm{x}_{ih},

so

supi𝑸i\displaystyle\sup_{i}\left\|\bm{Q}_{i}\right\| supi,h𝒙ih(𝝃+supi,h𝒛ih𝜼0𝜼+supi,h𝒙ih𝜷i0𝜷im)\displaystyle\leq\sup_{i,h}\left\|\bm{x}_{ih}\right\|\left(\left\|\bm{\xi}\right\|_{\infty}+\sup_{i,h}\left\|\bm{z}_{ih}\right\|\left\|\bm{\eta}^{0}-\bm{\eta}\right\|+\sup_{i,h}\left\|\bm{x}_{ih}\right\|\left\|\bm{\beta}_{i}^{0}-\bm{\beta}_{i}^{m}\right\|\right)
C2p(𝝃+C3qϕn+C2pϕn),\displaystyle\leq C_{2}\sqrt{p}\left(\left\|\bm{\xi}\right\|_{\infty}+C_{3}\sqrt{q}\phi_{n}+C_{2}\sqrt{p}\phi_{n}\right),

where 𝝃=(ξ1,,ξn)T\bm{\xi}=\left(\xi_{1,}\dots,\xi_{n}\right)^{T} with ξi=1nih=1nϵih\xi_{i}=\frac{1}{n_{i}}\sum_{h=1}^{n}\epsilon_{ih}. According to Condition (C3),

P(𝝃>2c11logn/minni)\displaystyle P\left(\left\|\bm{\xi}\right\|_{\infty}>\sqrt{2c_{1}^{-1}}\sqrt{\log n/\min n_{i}}\right) i=1nP(|ξi|>2c11logn/minni)\displaystyle\leq\sum_{i=1}^{n}P\left(\left|\xi_{i}\right|>\sqrt{2c_{1}^{-1}}\sqrt{\log n/\min n_{i}}\right)
=i=1nP(|1nij=1niϵij|>2c11logn/minni)\displaystyle=\sum_{i=1}^{n}P\left(\left|\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}\epsilon_{ij}\right|>\sqrt{2c_{1}^{-1}}\sqrt{\log n/\min n_{i}}\right)
i=1nP(|1nij=1niϵij|>2c11logn/ni)\displaystyle\leq\sum_{i=1}^{n}P\left(\left|\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}\epsilon_{ij}\right|>\sqrt{2c_{1}^{-1}}\sqrt{\log n/n_{i}}\right)
2i=1nexp{c12c11logn}2n.\displaystyle\leq 2\sum_{i=1}^{n}\exp\left\{-c_{1}2c_{1}^{-1}\log n\right\}\leq\frac{2}{n}.

Thus, there exists an event E2E_{2} such that P(E2)12n1P\left(E_{2}\right)\geq 1-2n^{-1} and

supi𝑸iC2p(2c11logn/minini+C3qϕn+C2pϕn).\sup_{i}\left\|\bm{Q}_{i}\right\|\leq C_{2}\sqrt{p}\left(\sqrt{2c_{1}^{-1}}\sqrt{\log n/\min_{i}n_{i}}+C_{3}\sqrt{q}\phi_{n}+C_{2}\sqrt{p}\phi_{n}\right).

Thus,

|(𝑸i𝑸j)T(𝜷i𝜷j)|𝒢k||\displaystyle\left|\frac{\left(\bm{Q}_{i}-\bm{Q}_{j}\right)^{T}\left(\bm{\beta}_{i}-\bm{\beta}_{j}\right)}{\left|\mathcal{G}_{k}\right|}\right|
\displaystyle\leq 2|𝒢min|1supi𝑸i𝜷i𝜷j\displaystyle 2\left|\mathcal{G}_{\min}\right|^{-1}\sup_{i}\left\|\bm{Q}_{i}\right\|\left\|\bm{\beta}_{i}-\bm{\beta}_{j}\right\|
\displaystyle\leq 2C2|𝒢min|1p(2c11logn/minini+C3qϕn+C2pϕn)𝜷i𝜷j.\displaystyle 2C_{2}\left|\mathcal{G}_{\min}\right|^{-1}\sqrt{p}\left(\sqrt{2c_{1}^{-1}}\sqrt{\log n/\min_{i}n_{i}}+C_{3}\sqrt{q}\phi_{n}+C_{2}\sqrt{p}\phi_{n}\right)\left\|\bm{\beta}_{i}-\bm{\beta}_{j}\right\|. (32)

Combining (30), (B) and (32), (29) follows that

Qn(𝜼,𝜷)Qn(𝜼,𝜷)\displaystyle Q_{n}\left(\bm{\eta},\bm{\beta}\right)-Q_{n}\left(\bm{\eta},\bm{\beta}^{*}\right)
\displaystyle\geq k=1K{i,j𝒢k,i<j}{λcijρ(4tn)2C2|𝒢min|1p(2c11lognminini+C3qϕn+C2pϕn)}𝜷i𝜷j.\displaystyle\sum_{k=1}^{K}\sum_{\left\{i,j\in\mathcal{G}_{k},i<j\right\}}\left\{\lambda c_{ij}\rho^{\prime}\left(4t_{n}\right)-2C_{2}\left|\mathcal{G}_{\min}\right|^{-1}\sqrt{p}\left(\sqrt{2c_{1}^{-1}}\sqrt{\frac{\log n}{\min_{i}n_{i}}}+C_{3}\sqrt{q}\phi_{n}+C_{2}\sqrt{p}\phi_{n}\right)\right\}\left\|\bm{\beta}_{i}-\bm{\beta}_{j}\right\|.

As tn=o(1)t_{n}=o\left(1\right), ρ(4tn)1\rho^{\prime}\left(4t_{n}\right)\rightarrow 1. Since |𝒢min|(q+Kp)1/2max(nmininilogn,(q+Kp)1/2)\left|\mathcal{G}_{\text{min}}\right|\gg\left(q+Kp\right)^{1/2}\max\left(\sqrt{\frac{n}{\min_{i}n_{i}}\log n},\left(q+Kp\right)^{1/2}\right), p=o(n)p=o(n) and q=o(n)q=o(n), then |𝒢min|1p=o(1)\left|\mathcal{G}_{\min}\right|^{-1}p=o\left(1\right) and |𝒢min|1pq=o(1)\left|\mathcal{G}_{\min}\right|^{-1}\sqrt{pq}=o\left(1\right). Thus, λ|𝒢min|1plognminni\lambda\gg\left|\mathcal{G}_{\min}\right|^{-1}\sqrt{p}\sqrt{\frac{\log n}{\min n_{i}}}, λ|𝒢min|1pqϕn\lambda\gg\left|\mathcal{G}_{\min}\right|^{-1}\sqrt{pq}\phi_{n} and λ|𝒢min|1pϕn\lambda\gg\left|\mathcal{G}_{\min}\right|^{-1}p\phi_{n}. Therefore, Qn(𝜼,𝜷)Qn(𝜼,𝜷)0Q_{n}\left(\bm{\eta},\bm{\beta}\right)-Q_{n}\left(\bm{\eta},\bm{\beta}^{*}\right)\geq 0 for sufficiently large nn by the assumption (C4) that cijc_{ij}’s are bounded if ii and jj are in the same group.

Therefore, combining the two steps, we will have that Qn(𝜼,𝜷)>Qn(𝜼^or,𝜷^or)Q_{n}\left(\bm{\eta},\bm{\beta}\right)>Q_{n}(\hat{\bm{\eta}}^{or},\hat{\bm{\beta}}^{or}) for any (𝜼T,𝜷T)TΘnΘ\left(\bm{\eta}^{T},\bm{\beta}^{T}\right)^{T}\in\Theta_{n}\cap\Theta and (𝜼T,𝜷T)T((𝜼^or)T,(𝜷^or)T)T(\bm{\eta}^{T},\bm{\beta}^{T})^{T}\neq((\hat{\bm{\eta}}^{or})^{T},(\hat{\bm{\beta}}^{or})^{T})^{T}. This shows that ((𝜼^or)T,(𝜷^or)T)T((\hat{\bm{\eta}}^{or})^{T},(\hat{\bm{\beta}}^{or})^{T})^{T} is a strict local minimizer of the objective function (2) on E1E2E_{1}\cap E_{2} with probability at least 12(K+p+1)n11-2(K+p+1)n^{-1} for sufficiently large nn.

References

  • Banerjee et al., (2014) Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2014). Hierarchical modeling and analysis for spatial data. Crc Press.
  • Boyd et al., (2011) Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122.
  • Bradley et al., (2018) Bradley, J. R., Holan, S. H., and Wikle, C. K. (2018). Computationally efficient multivariate spatio-temporal models for high-dimensional count-valued data (with discussion). Bayesian Analysis, 13(1):253–310.
  • Brunsdon et al., (1996) Brunsdon, C., Fotheringham, A. S., and Charlton, M. E. (1996). Geographically weighted regression: a method for exploring spatial nonstationarity. Geographical analysis, 28(4):281–298.
  • Casetti, (1972) Casetti, E. (1972). Generating models by the expansion method: applications to geographical research. Geographical analysis, 4(1):81–91.
  • Casetti and Jones, (1987) Casetti, E. and Jones, J. P. (1987). Spatial aspects of the productivity slowdown: an analysis of us manufacturing data. Annals of the Association of American Geographers, 77(1):76–88.
  • Chi and Lange, (2015) Chi, E. C. and Lange, K. (2015). Splitting methods for convex clustering. Journal of Computational and Graphical Statistics, 24(4):994–1013.
  • Cook et al., (2007) Cook, A. J., Gold, D. R., and Li, Y. (2007). Spatial cluster detection for censored outcome data. Biometrics, 63(2):540–549.
  • Cressie, (2015) Cressie, N. (2015). Statistics for spatial data. John Wiley & Sons.
  • Diggle et al., (1998) Diggle, P. J., Tawn, J. A., and Moyeed, R. A. (1998). Model-based geostatistics. Journal of the Royal Statistical Society: Series C (Applied Statistics), 47(3):299–350.
  • Fan and Li, (2001) Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360.
  • Fan et al., (2018) Fan, Z., Guan, L., et al. (2018). Approximate l0l_{0}-penalized estimation of piecewise-constant signals on graphs. The Annals of Statistics, 46(6B):3217–3245.
  • Gelfand et al., (2003) Gelfand, A. E., Kim, H.-J., Sirmans, C., and Banerjee, S. (2003). Spatial modeling with spatially varying coefficient processes. Journal of the American Statistical Association, 98(462):387–396.
  • Han et al., (2012) Han, W., Yang, Z., Di, L., and Mueller, R. (2012). Cropscape: A Web service based application for exploring and disseminating US conterminous geospatial cropland data products for decision support. Computers and Electronics in Agriculture, 84:111–123.
  • Hu and Bradley, (2018) Hu, G. and Bradley, J. (2018). A bayesian spatial–temporal model with latent multivariate log-gamma random effects with application to earthquake magnitudes. Stat, 7(1):e179.
  • Hu and Huffer, (2020) Hu, G. and Huffer, F. (2020). Modified Kaplan–Meier estimator and Nelson–Aalen estimator with geographical weighting for survival data. Geographical Analysis, 52(1):28–48.
  • Hu et al., (2020) Hu, G., Xue, Y., and Ma, Z. (2020). Bayesian clustered coefficients regression with auxiliary covariates assistant random effects. arXiv preprint arXiv:2004.12022.
  • Huang et al., (2004) Huang, J. Z., Wu, C. O., and Zhou, L. (2004). Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Statistica Sinica, 14(3):763–788.
  • Hubert and Arabie, (1985) Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1):193–218.
  • Jung et al., (2007) Jung, I., Kulldorff, M., and Klassen, A. C. (2007). A spatial scan statistic for ordinal data. Statistics in medicine, 26(7):1594–1607.
  • Kulldorff and Nagarwalla, (1995) Kulldorff, M. and Nagarwalla, N. (1995). Spatial disease clusters: detection and inference. Statistics in medicine, 14(8):799–810.
  • Lee et al., (2017) Lee, J., Gangnon, R. E., and Zhu, J. (2017). Cluster detection of spatial regression coefficients. Statistics in medicine, 36(7):1118–1133.
  • Lee et al., (2020) Lee, J., Sun, Y., and Chang, H. H. (2020). Spatial cluster detection of regression coefficients in a mixed-effects model. Environmetrics, 31(2):e2578.
  • Li and Sang, (2019) Li, F. and Sang, H. (2019). Spatial homogeneity pursuit of regression coefficients for large datasets. Journal of the American Statistical Association, 114(527):1050–1062.
  • Lu and Carlin, (2005) Lu, H. and Carlin, B. P. (2005). Bayesian areal wombling for geographical boundary analysis. Geographical Analysis, 37(3):265–285.
  • Lu et al., (2007) Lu, H., Reilly, C. S., Banerjee, S., and Carlin, B. P. (2007). Bayesian areal wombling via adjacency modeling. Environmental and Ecological Statistics, 14(4):433–452.
  • Ma and Huang, (2017) Ma, S. and Huang, J. (2017). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association, 112(517):410–423.
  • Ma et al., (2019) Ma, S., Huang, J., Zhang, Z., and Liu, M. (2019). Exploration of heterogeneous treatment effects via concave fusion. The international journal of biostatistics.
  • Ma et al., (2020) Ma, Z., Xue, Y., and Hu, G. (2020). Heterogeneous regression models for clusters of spatial dependent data. Spatial Economic Analysis, pages 1–17.
  • Nakaya et al., (2005) Nakaya, T., Fotheringham, A. S., Brunsdon, C., and Charlton, M. (2005). Geographically weighted Poisson regression for disease association mapping. Statistics in medicine, 24(17):2695–2717.
  • Nusser and Goebel, (1997) Nusser, S. M. and Goebel, J. J. (1997). The National Resources Inventory: a long-term multi-resource monitoring programme. Environmental and Ecological Statistics, 4(3):181–204.
  • Rand, (1971) Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850.
  • Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288.
  • Vinh et al., (2010) Vinh, N. X., Epps, J., and Bailey, J. (2010). Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11:2837–2854.
  • Wang et al., (2007) Wang, H., Li, R., and Tsai, C.-L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94(3):553–568.
  • Wang et al., (2018) Wang, X., Berg, E., Zhu, Z., Sun, D., and Demuth, G. (2018). Small area estimation of proportions with constraint for National Resources Inventory survey. Journal of Agricultural, Biological and Environmental Statistics, 23(4):509–528.
  • Xu et al., (2019) Xu, Z., Bradley, J. R., and Sinha, D. (2019). Latent multivariate log-gamma models for high-dimensional multi-type responses with application to daily fine particulate matter and mortality counts. arXiv preprint arXiv:1909.02528.
  • Xue et al., (2020) Xue, Y., Schifano, E. D., and Hu, G. (2020). Geographically weighted Cox regression for prostate cancer survival data in louisiana. Geographical Analysis, 52(4):570–587.
  • Zhang, (2010) Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2):894–942.
  • Zhang and Lawson, (2011) Zhang, J. and Lawson, A. B. (2011). Bayesian parametric accelerated failure time spatial model and its application to prostate cancer. Journal of applied statistics, 38(3):591–603.