Spatial Heterogeneity Automatic Detection and Estimation

Xin Wang Email: wangx172@miamioh.edu Department of Statistics, Miami University Zhengyuan Zhu Email: zhuz@iastate.edu Department of Statistics, Iowa State University Hao Helen Zhang Email: hzhang@math.arizona.edu Department of Mathematics, University of Arizona

Abstract

Spatial regression is widely used for modeling the relationship between a dependent variable and explanatory covariates. Oftentimes, the linear relationships vary across space, when some covariates have location-specific effects on the response. One fundamental question is how to detect the systematic variation in the model and identify which locations share common regression coefficients and which do not. Only a correct model structure can assure unbiased estimation of coefficients and valid inferences. In this work, we propose a new procedure, called Spatial Heterogeneity Automatic Detection and Estimation (SHADE), for automatically and simultaneously subgrouping and estimating covariate effects for spatial regression models. The SHADE employs a class of spatially-weighted fusion type penalty on all pairs of observations, with location-specific weight adaptively constructed using spatial information, to cluster coefficients into subgroups. Under certain regularity conditions, the SHADE is shown to be able to identify the true model structure with probability approaching one and estimate regression coefficients consistently. We develop an alternating direction method of multiplier algorithm (ADMM) to compute the SHAD efficiently. In numerical studies, we demonstrate empirical performance of the SHADE by using different choices of weights and compare their accuracy. The results suggest that spatial information can enhance subgroup structure analysis in challenging situations when the spatial variation among regression coefficients is small or the number of repeated measures is small. Finally, the SHADE is applied to find the relationship between a natural resource survey and a land cover data layer to identify spatially interpretable groups.

key words: Areal data; Structure Selection; Penalization; Repeated measures; Spatial heterogeneity; Subgroup analysis

1 Introduction

Spatial regression is commonly used to model the relationship between a response and explanatory variables. For complex problems, some covariates (we call them global covariates) may have constant effects across space, while other covariates (we call them local covariates) may have location-specific effects, i.e, their effects on the response variable vary across space. This has received wide attention in many fields such as environmental sciences (Hu and Bradley,, 2018), biology (Zhang and Lawson,, 2011), social science (Bradley et al.,, 2018), economics (Brunsdon et al.,, 1996), and biostatistics (Xu et al.,, 2019).

A motivating example is about studying the relationship between two landcover data sources, one is the National Resources Inventory (NRI, Nusser and Goebel, 1997) survey conducted by the USDA Natural Resources Conservation Service (NRCS), the other one is the Cropland data layer (CDL, Han et al., 2012) produced by the USDA National Agricultural Statistics Service (NASS). Accurate estimate on local landcover information from NRI is essential for developing conservation policies and land management plans. However, direct estimates in small geographical areas such as at the county level may not be accurate due to small sample sizes. Auxiliary information such as CDL can be used to improve the small area estimator in NRI (Wang et al.,, 2018). Traditional regression models used in the small area estimation problems typically assume common regression coefficients over all domains, which may not be appropriate. For example, when we looked at the linear relationship between the NRI and CDL estimates of different types of land covers at county level, the regression coefficients in the Mountain states are quite different from the west coast and the vast areas in the east. This is reflected in Figure 9 (a) in Section 5. One reason for that difference is due to the scope of the NRI survey, which only include non-federal land in the US. Another contributing reason is that CDL is created by training separate machine learning models at the state level using only ground observations from that state, which creates variations among states. A common regression assumption would be too simple to capture the regional differences and will lead to biases in the estimators. This type of spatial heterogeneity is also known as structural instability. For linear models, this implies that the linear relationship changes geographically over space, and the linear regression coefficients may form subgroups. It is an important and challenging problem to identify the correct grouping structure of the regression coefficients, as only a correct model structure can lead to unbiased estimation of the regression coefficients and their valid inference. In practice, ad hoc grouping of states as regions defined by tradition or for federal administrative purposes are sometimes used to address this issue. However, such grouping is not driven by the data in specific problems, and may not be appropriate or efficient. For example, the central region in Figure 9 (a) includes not only all the Mountain states, but also the North and South Dakotas which are not traditionally considered Mountain states. Figure 9 (b) suggests further division of the east to sub-regions, which does not align to any well known administrative regions. One natural approach is to assume that regression coefficients of states nearby are more likely to belong to the same subgroup than states which are further apart, and use both the estimated regression coefficients and the spatial structure to guide the clustering of states into subregions, which is what we propose to do in this paper.

The problem of taking into account spatial dependence structure in linear regression has been studied for a long time in literature. Classical works include spatial expansion methods (Casetti,, 1972; Casetti and Jones,, 1987), which treat the spatially varying regression coefficients as a function of expansion variables, typically using longitude and latitude coordinates as location variables. One popular approach to accounting for spatial variations in the model is by introducing an additive spatial random effect for each location, as done for linear models by Cressie, (2015) and generalized linear models by Diggle et al., (1998). Another class of models in wide use are spatial varying coefficient models, including the geographically weighted regression (GWR; Brunsdon et al., 1996) and its extensions to generalized linear models (Nakaya et al.,, 2005) and the Cox model for survival analysis (Xue et al.,, 2020; Hu and Huffer,, 2020). There has been also development in the Bayesian framework, such as Gelfand et al., (2003).

The methods mentioned above typically assume that the regression parameters are smooth functions of location variables, which is reasonable in certain practice, but may not be appropriate for applications where the covariate effects are constant over subregions defined by some unobserved hidden factors. In this work, we take a different perspective by grouping the covariate effects into spatially-interpretable subgroups or clusters. As in the motivating example, different clusters have different patterns, which can be used to build more flexible estimators to improve the original direct estimates. A majority of existing work in the literature on spatial cluster detection are based on hypothesis tests, including the scan statistic methods based on the likelihood ratio (Kulldorff and Nagarwalla,, 1995; Jung et al.,, 2007; Cook et al.,, 2007) and the two-step spatial test methods under the GWR framework (Lee et al.,, 2017, 2020). Test-based methods are intuitive and useful in practice, but proper test statistics are often difficulty to construct and the tests may have low power when the number of locations is large. In addition, these methods handle the cluster detection problem and the model estimation separately, making it difficult to study inferential properties of the final estimator. The main purpose of this article is to fill this gap by developing a unified framework to detect clusters of regression coefficients, estimate them consistently, and make valid inferences.

In the context of non-spatial data analysis, a variety of clustering methods have been proposed to identify homogeneous groups for either observations or regression coefficients. Chi and Lange, (2015) developed a method for the convex clustering problem through the alternating direction method of multiplier algorithm (ADMM) (Boyd et al.,, 2011) with pairwise $L_{p}(p\geq 1)$ penalty. Nonnegative weights are considered to reduce bias for pairwise penalties. Fan et al., (2018) considered a clustering problem with $l_{0}$ penalty on graphs. For clustering the regression coefficients, Ma and Huang, (2017) and Ma et al., (2019) proposed a concave fusion approach for estimating the group structure and estimating subgroup-specific effects, where smoothly clipped absolute deviation (SCAD) penalty (Fan and Li,, 2001) and the minimax concave penalty (MCP) (Zhang,, 2010) are considered.

For spatial analysis, some interesting work are recently proposed for grouping regression coefficients in the spatial regression. Ma et al., (2020) proposed the Bayesian heterogeneity pursuit regression models to detect clusters in the covariate effects based on the Dirichlet process. Hu et al., (2020) proposed a Bayesian method for clustering coefficients with auxiliary covariates random effects, based on a mixture of finite mixtures (MFM). Li and Sang, (2019) proposed a penalized approach based on the minimum spanning tree. In the area of spatial boundaries detection, Lu and Carlin, (2005) and Lu et al., (2007) considered the areal boundary detection using a Bayesian hierarchical model based on the conditional autoregressive model (Banerjee et al.,, 2014). The boundaries were determined by the posterior distribution of the corresponding spatial process or spatial weights. These boundary detection methods focused on clustering of observations instead of regression coefficients.

The main difficulty in clustering spacial covariate effects is how to estimate the number of clusters and cluster memberships consistently, by taking into account spatial neighborhood information properly. To our knowledge, none of the existing methods can address this challenge. In this work, we fill the gap by proposing a new procedure, called Spatial Heterogeneity Automatic Detection and Estimation (SHADE), for automatically grouping and estimating local covariate effects simultaneously. The SHADE employs a class of spatially-weighted fusion type penalty on all pairs of observations, with location-specific weight adaptively constructed using geographical proximity of locations, and achieves spatial clustering consistency for spatial linear regression. In theory, we show that the oracle estimator based on weighted least squares is a local minimizer of the objective function with probability approaching 1 under certain regular conditions, which indicates that the the number of clusters can be estimated consistently. We believe that this result is the first of its kind in the contest of spatial data analysis. To implement the SHADE, an alternating direction method of multiplier algorithm (ADMM) algorithm is developed. To make the best choices of spatial weights and understand their roles in practical applications, we consider a number of different schemes to choose pairwise weights and compare them numerically and theoretically. Our numerical examples suggest that, the number of clusters and the group structure can be recovered with high probability, and the spatial information can help in spatial clustering analysis when the minimal group difference is small or the number of repeated measures is small.

The article is organized as follows. In Section 2, we describe the Spatial Heterogeneity Automatic Detection and Estimation (SHADE) model and the corresponding ADMM algorithm. In Section 3, we establish theoretical properties of the SHADE estimator. The simulation study is conducted in Section 4 under several scenarios to show performances of the proposed estimator. The proposed method is applied to an NRI small area estimation problem to illustrate the use of SHADE in real-data applications in Section 5. Finally, some discussions are given in Section 6.

2 Methodology and Algorithm

2.1 Methodology: SHADE

Assume our spatial data consist of multiple measurements at each location or subject. Let $y_{ih}$ be the $h$ th response for the $i$ th subject observed at location $\bm{s}_{i}$ , where $i=1,\dots,n,\,h=1,\dots,n_{i}$ . Based on their effects on the response variable, the covariates can be divided into two categories: “global” covariates which have common effects on the response across all the locations, and “local” covariates which have location-specific effects on the response. To reflect this, let $\bm{z}_{ih}$ and $\bm{x}_{ih}$ be the corresponding covariate vectors with dimension $q$ and $p$ respectively, where $\bm{z}_{ih}$ ’s are “global” covariates which have common linear effects to the response across all the locations, while $\bm{x}_{ih}$ ’s are “local” covariates which have location-specific linear effects on the response. We consider the following linear regression model

y_{ih}=\bm{z}_{ih}^{T}\bm{\eta}+\bm{x}_{ih}^{T}\bm{\beta}_{i}+\epsilon_{ih},

(1)

where $\bm{\eta}$ represents the vector of common regression coefficients shared by global effects, $\bm{\beta}_{i}$ ’s are location-specific regression coefficients, and $\epsilon_{ih}$ ’s are i.i.d random errors with $E\left(\epsilon_{ih}\right)=0$ and $Var\left(\epsilon_{ih}\right)=\sigma^{2}$ . Furthermore, some locations may have same or similar location-specific effects, grouping locations of same location-specific effects can help to achieve dimension reduction and improve model prediction accuracy. Assume the $n$ location-specific effects belong to $K$ mutually exclusive subgroups: the locations with a common $\bm{\beta}_{i}$ belong to the same group. Denote the corresponding partition of $\{1,\dots,n\}$ as $\mathcal{G}=\left\{\mathcal{G}_{1},\dots,\mathcal{G}_{K}\right\}$ , where the index set $G_{k}$ contains all the locations belonging to the group $k$ for $k=1,\ldots,K$ . For convenience, denote the regression coefficients associated with $\mathcal{G}_{k}$ as $\bm{\alpha}_{k}$ . In practice, since neither $K$ nor the partition $\mathcal{G}_{k}$ ’s are known, the goal is to use the observed data $\{(y_{ih},\bm{z}_{ih},\bm{x}_{ih})\}$ to construct the estimator $\hat{K}$ and the partition $\hat{\mathcal{G}}=\{\hat{\mathcal{G}}_{1},\dots,\hat{\mathcal{G}}_{\hat{K}}\}$ , where $\hat{\mathcal{G}}_{k}=\{i:\hat{\bm{\beta}}_{i}=\hat{\bm{\alpha}}_{k},1\leq i\leq n\}$ .

To achieve this goal, we use the following optimization problem: minimize the weighted least squares objective function subject to a spatially-weighted pairwise penalty

Q_{n}\left(\bm{\eta},\bm{\beta};\lambda,\psi\right)=\frac{1}{2}\sum_{i=1}^{n}\frac{1}{n_{i}}\sum_{h=1}^{n_{i}}\left(y_{ih}-\bm{z}_{ih}^{T}\bm{\eta}-\bm{x}_{ih}^{T}\bm{\beta}_{i}\right)^{2}+\sum_{1\leq i<j\leq n}p_{\gamma}\left(\left\|\bm{\beta}_{i}-\bm{\beta}_{j}\right\|,c_{ij}\lambda\right),

(2)

where $\bm{\eta}=(\eta_{1},\ldots,\eta_{q})^{T}$ , $\bm{\beta}=\left(\bm{\beta}_{1}^{T},\dots,\bm{\beta}_{n}^{T}\right)^{T}$ , $\|\cdot\|$ denotes the Euclidean norm, $p_{\gamma}\left(\cdot,\lambda\right)$ is a penalty function imposed on all distinct pairs. In the penalty function, $\lambda\geq 0$ is a tuning parameter, $\gamma>0$ is a built-in constant in the penalty function, and different weights $c_{ij}$ ’s are assigned to different pairs of locations $\bm{s}_{i}$ and $\bm{s}_{j}$ for any $1\leq i<j\leq n$ . One popular choice of penalty is the $L_{1}$ penalty (lasso) (Tibshirani,, 1996) with the form $p_{\gamma}(t,\lambda)=\lambda|t|$ . Since $L_{1}$ penalty tends to produce too many groups as shown in Ma and Huang, (2017), we consider the SCAD penalty, which is defined as

p_{\gamma}(t,\lambda)=\lambda\int_{0}^{|t|}\min\{1,(\gamma-x/\lambda)_{+}/(\gamma-1)\}dx.

(3)

Here we treat $\gamma$ as a fixed value as in Fan and Li, (2001), Zhang, (2010) and Ma et al., (2019).

2.2 Choices of Spatial Weights

In (2), the values weights $c_{ij}$ are crucial, as they control the number of subgroups and grouping results. The pairs $\left\|\bm{\beta}_{i}-\bm{\beta}_{j}\right\|$ with larger weights $c_{ij}\lambda$ are shrunk together more than those pairs with smaller weights. For spatial data, reasonable choices of $c_{ij}$ should take into account two factors: locations with closer $\bm{\beta}_{j}$ values are more grouped together, and locations closer to each other are more likely to form a subgroup as they typically have similar trends. Since the true values of $\bm{\beta}$ are not available, we use their estimators $\tilde{\bm{\beta}}$ as the surrogates. For example, we can define the weights $c_{ij}$ as

c_{ij}=\exp\left(-\psi\left\|\bm{s}_{i}-\bm{s}_{j}\right\|\cdot\left\|\tilde{\bm{\beta}}_{i}-\tilde{\bm{\beta}}_{j}\right\|\right),

where $\tilde{\bm{\beta}}_{i}$ is an initial estimate of $\bm{\beta}_{i}$ , and $\psi$ is a scale parameter to control magnitudes of the weights. In areal data, we suggest three different ways of taking into spatial information in the data to construct the weights.
(i) using both spatial and regression coefficients information:

c_{ij}=\exp\left(\psi\left(1-a_{ij}\right)\cdot\left\|\tilde{\bm{\beta}}_{i}-\tilde{\bm{\beta}}_{j}\right\|\right),

(4)

where $a_{ij}$ is the neighbor order between location $\bm{s}_{i}$ and location $\bm{s}_{j}$ , which means that if $i$ and $j$ are neighbors, $a_{ij}=1$ . If $i$ and $j$ are not neighbors, but they have at least one same neighbor, $a_{ij}=2$ . Similarly, we can have all the neighborhood order for all subjects or locations.
(ii) using regression coefficients information only:

c_{ij}=\exp\left(-\psi\left\|\tilde{\bm{\beta}}_{i}-\tilde{\bm{\beta}}_{j}\right\|\right).

(5)

(iii) using spatial information only:

c_{ij}=\exp\left(\psi(1-a_{ij})\right).

(6)

Weights in (4) and (5) both includes the regression coefficients, which would depend on the accuracy of $\tilde{\bm{\beta}}_{i}$ . If the number of repeated measures is not large, the values of $\tilde{\bm{\beta}}_{i}$ will not show the real relationship between different locations, which would lead to very bad weights. The phenomenon can be observed in the simulation study. The weights we use here are three special cases which use the information of either regression coefficients or the spatial neighborhood orders. Definitely, there are other ways to construct weights, such as using distance to borrow the spatial information. At least the weights satisfy condition (C4) in Section 3, the theoretical results will hold under other conditions. For example, the weights in (5) will satisfy (C4) automatically if $\tilde{\bm{\beta}_{i}}$ ’s are consistent estimators. That is, besides the condition (C4), there are no other conditions about the format of the weight function.

2.3 Algorithm for SHADE

In this section, we describe the ADMM algorithm to solve (2) in Section 2.1. The algorithm shares the same spirit as Ma and Huang,’s (2017), where a non-weighted penalty is used in non-spatial settings without repeated measures.

There are two tuning parameters, $\lambda$ and $\psi$ , in the proposed method. We choose them adaptively using some tuning procedures discussed at the end of this section. For now, we fix them and present the computational algorithm for solving (2). Denote the solution as

\left(\hat{\bm{\eta}},\hat{\bm{\beta}}\right)=\operatorname*{arg\,min}_{\bm{\eta}\in\mathbb{R}^{q},\bm{\beta}\in\mathbb{R}^{np}}Q_{n}\left(\bm{\eta},\bm{\beta},\lambda,\psi\right).

(7)

First, we introduce the slack variables for all the pairs ( $i,j$ ) $\bm{\delta}_{ij}=\bm{\beta}_{i}-\bm{\beta}_{j}$ , for $1\leq i<j\leq n$ . Then the problem is equivalent to minimizing the following objective function with regard to $(\bm{\eta},\bm{\beta},\bm{\delta})$ ,

	$\displaystyle\min_{\bm{\eta},\bm{\beta},\bm{\delta}}\leavevmode\nobreak\ \leavevmode\nobreak\ L_{0}\left(\bm{\eta},\bm{\beta},\bm{\delta}\right)$	$\displaystyle=\frac{1}{2}\sum_{i=1}^{n}\frac{1}{n_{i}}\sum_{h=1}^{n_{i}}\left(y_{ih}-\bm{z}_{ih}^{T}\bm{\eta}-\bm{x}_{ih}^{T}\bm{\beta}_{i}\right)^{2}+\sum_{1\leq i<j\leq n}p_{\gamma}\left(\left\\|\bm{\delta}_{ij}\right\\|,c_{ij}\lambda\right),$
		$\displaystyle\text{subject to}\leavevmode\nobreak\ \leavevmode\nobreak\ \bm{\beta}_{i}-\bm{\beta}_{j}-\bm{\delta}_{ij}=\bm{0},\leavevmode\nobreak\ \leavevmode\nobreak\ 1\leq i<j\leq n,$

where $\bm{\delta}=(\bm{\delta}^{T}_{ij},1\leq i<j\leq n)^{T}$ . To handle the equation constraints in the optimization problem, we introduce the augmented Lagrangian

L\left(\bm{\eta},\bm{\beta},\bm{\delta},\bm{v}\right)=L_{0}\left(\bm{\eta},\bm{\beta},\bm{\delta}\right)+\sum_{i<j}\left\langle\bm{v}_{ij},\bm{\beta}_{i}-\bm{\beta}_{j}-\bm{\delta}_{ij}\right\rangle+\frac{\vartheta}{2}\sum_{i<j}\left\|\bm{\beta}_{i}-\bm{\beta}_{j}-\bm{\delta}_{ij}\right\|^{2},

where $\bm{v}=(\bm{v}^{T}_{ij},1\leq i<j\leq n)^{T}$ are Lagrange multipliers and $\vartheta>0$ is the penalty parameter.

To solve the problem, we use an iterative algorithm which updates $\bm{\beta},\bm{\eta},\bm{\delta},\bm{v}$ sequentially, one at a time. At the $(m+1)$ th iteration, given their current values $(\bm{\beta}^{(m)},\bm{\eta}^{(m)},\bm{\delta}^{(m)},\bm{v}^{(m)})$ , the updates of $\bm{\eta},\bm{\beta},\bm{\delta},\bm{v}$ are

$\displaystyle\left(\bm{\eta}^{(m+1)},\bm{\beta}^{(m+1)}\right)$	$\displaystyle=\operatorname*{arg\,min}_{\bm{\eta},\bm{\beta}}L\left(\bm{\eta},\bm{\beta},\bm{\delta}^{{(m)}},\bm{v}^{{(m)}}\right),$
$\displaystyle\bm{\delta}^{(m+1)}$	$\displaystyle=\operatorname*{arg\,min}_{\bm{\delta}}L\left(\bm{\eta}^{(m+1)},\bm{\beta}^{(m+1)},\bm{\delta},\bm{v}^{(m)}\right),$
$\displaystyle\bm{v}_{ij}^{(m+1)}$	$\displaystyle=\bm{v}_{ij}^{m}+\vartheta\left(\bm{\beta}_{i}^{(m+1)}-\bm{\beta}_{j}^{(m+1)}-\bm{\delta}_{ij}^{(m+1)}\right).$	(8)

To update $\bm{\eta}$ and $\bm{\beta}$ , we minimize the following objective function

f\left(\bm{\beta},\bm{\eta}\right)=\left\|\bm{\Omega}^{1/2}\left(\bm{y}-\bm{Z}\bm{\eta}-\bm{X}\bm{\beta}\right)\right\|^{2}+\vartheta\left\|\bm{A}\bm{\beta}-\bm{\delta}^{(m)}+\vartheta^{-1}\bm{v}^{(m)}\right\|^{2},

where $\bm{y}=\left(y_{11},\dots,y_{1n_{1}},\dots,y_{n1},\dots,y_{n,n_{n}}\right)^{T}$ , $\bm{Z}=\left(\bm{z}_{11},\dots,\bm{z}_{1n_{1}},\dots,\bm{z}_{n1},\dots,\bm{z}_{n,n_{n}}\right)^{T}$ , $\bm{X}=\text{diag}\left(\bm{X}_{1},\dots,\bm{X}_{n}\right)$ with $\bm{X}_{i}=\left(\bm{x}_{i1},\dots,\bm{x}_{i,n_{i}}\right)^{T}$ , $\bm{\Omega}=\text{diag}\left(1/n_{1}\bm{I}_{n_{1}},\dots,1/n_{n}\bm{I}_{n_{n}}\right)$ and $\bm{A}=\bm{D}\otimes\bm{I}_{p}$ with an $[n\left(n-1\right)/2]\times n$ matrix $\bm{D}=\left\{\left(\bm{e}_{i}-\bm{e}_{j}\right),i<j\right\}^{T}$ , where $\bm{e}_{i}$ is an $n\times 1$ vector with $i$ th element 1 and other elements 0. Then the solutions for $\bm{\beta}$ and $\bm{\eta}$ are

\bm{\beta}^{(m+1)}=\left(\bm{X}^{T}\bm{Q}_{Z,\Omega}\bm{X}+\vartheta\bm{A}^{T}\bm{A}\right)^{-1}\left[\bm{X}^{T}\bm{Q}_{Z,\Omega}\bm{y}+\vartheta\text{vec}\left(\left(\bm{\Delta}^{(m)}-\vartheta^{-1}\bm{\Upsilon}^{(m)}\right)\bm{D}\right)\right],

(9)

\bm{\eta}^{(m+1)}=\left(\bm{Z}^{T}\bm{\Omega}\bm{Z}\right)^{-1}\bm{Z}^{T}\bm{\Omega}\left(\bm{y}-\bm{X}\bm{\beta}^{(m+1)}\right),

(10)

where $\bm{\Delta}^{(m)}=\left(\bm{\delta}_{ij}^{(m)},i<j\right)_{p\times n\left(n-1\right)/2}$ , $\bm{\varUpsilon}^{(m)}=\left(\bm{v}_{ij}^{(m)},i<j\right)_{p\times n\left(n-1\right)/2}$ and

\bm{Q}_{Z,\Omega}=\bm{\Omega}-\bm{\Omega}\bm{Z}\left(\bm{Z}^{T}\bm{\Omega}\bm{Z}\right)^{-1}\bm{Z}^{T}\bm{\Omega}.

To update $\bm{\delta}_{ij}$ ’s componentwisely, it is equivalent to minimize the following objective function

\frac{\vartheta}{2}\left\|\bm{\varsigma}_{ij}^{(m)}-\bm{\delta}_{ij}\right\|^{2}+p_{\gamma}\left(\left\|\bm{\delta}_{ij}\right\|,c_{ij}\lambda\right),

where $\bm{\varsigma}_{ij}^{(m+1)}=\left(\bm{\beta}_{i}^{(m+1)}-\bm{\beta}_{j}^{(m+1)}\right)+\vartheta^{-1}\bm{v}_{ij}^{(m)}$ . The solution based on SCAD penalty has a closed-form solution as

\bm{\delta}_{ij}^{(m+1)}=\begin{cases}S\left(\bm{\varsigma}_{ij}^{(m+1)},\lambda c_{ij}/\vartheta\right)&\text{if }\left\|\bm{\varsigma}_{ij}^{(m+1)}\right\|\leq\lambda c_{ij}+\lambda c_{ij}/\vartheta,\\ \frac{S\left(\bm{\varsigma}_{ij}^{(m+1)},\gamma\lambda c_{ij}/\left(\left(\gamma-1\right)\vartheta\right)\right)}{1-1/\left(\left(\gamma-1\right)\vartheta\right)}&\text{if }\lambda c_{ij}+\lambda c_{ij}/\vartheta<\left\|\bm{\varsigma}_{ij}^{(m+1)}\right\|\leq\gamma\lambda c_{ij},\\ \bm{\varsigma}_{ij}^{(m+1)}&\text{if }\left\|\bm{\varsigma}_{ij}^{(m+1)}\right\|>\gamma\lambda c_{ij},\end{cases}

(11)

where $\gamma>c_{ij}+c_{ij}/\vartheta$ and $S\left(\bm{w},t\right)=\left(1-t/\left\|\bm{w}\right\|\right)_{+}\bm{w}$ , and $(t)_{+}=t$ if $t>0$ , 0 otherwise.

In summary, the computational algorithm can be described as follows.

Algorithm: ADMM algorithm

0: : Initialize

\bm{\beta}^{(0)}

\bm{\delta}^{(0)}

and

\bm{v}^{(0)}

1: for

m=0,1,2,\dots

2: Update

\bm{\beta}

by (9).

3: Update

\bm{\eta}

by (10).

4: Update

\bm{\delta}

by (11)

5: Update

\bm{v}

by (2.3).

6: if convergence criterion is met then

7: Stop and get the estimates

8: else

m=m+1

10: end if

11: end for

If the size of $n_{i}$ is reasonable, such as 10 or larger, we construct the initial values $\tilde{\bm{\beta}}^{(0)}$ by fitting a linear regression model $y_{ih}=\bm{z}_{ih}^{T}\bm{\eta}+\bm{x}_{ih}^{T}\bm{\beta}_{i}+\epsilon_{ih}$ for each $i=1,\dots,n$ . Then, set $\bm{\delta}_{ij}^{(0)}=\bm{\beta}_{i}^{(0)}-\bm{\beta}_{j}^{(0)}$ and $\bm{v}^{(0)}=\bm{0}$ . If $n_{i}=1$ or small, the initial values can be set using the procedure in Ma et al., (2019). They used a ridge fusion criterion with a small value of the tuning parameter, then the initial group structure was obtained by assigning objects into $K^{*}$ (a given value) groups by ranking the estimated $\bm{\beta}_{i}$ .

If $\hat{\bm{\delta}}_{ij}=\bm{0}$ , then the location $i$ and $j$ belong to the same group. Thus, we can obtain the corresponding estimated partition $\hat{\mathcal{G}}$ and the estimated number of groups $\hat{K}(\lambda,\psi)$ . For each group, its group-specific parameter vector is estimated as $\hat{\bm{\alpha}}_{k}=1/|\hat{\mathcal{G}}_{k}|\sum_{i\in\hat{\mathcal{G}}_{k}}\hat{\bm{\beta}}_{i}$ for $k=1,\ldots,\hat{K}$ .

Remark 1.

If there are no global covariates, the model is simplified as $y_{ih}=\bm{x}_{ih}^{T}\bm{\beta}_{i}+\epsilon_{ih}$ . The algorithm will be simplified, that is, $\bm{Q}_{Z,\Omega}$ will become $\bm{\Omega}$ . The model we use in the application is the simplified model.

Remark 2.

The convergence criterion used is the same as Ma and Huang, (2017), which is based on the primal residual $\bm{r}^{(m+1)}=\bm{A}\bm{\beta}^{(m+1)}-\bm{\delta}^{(m+1)}$ . The algorithm is stopped if $\|\bm{r}^{(m+1)}\|<\varepsilon$ , where $\varepsilon$ is a small positive value.

We need to select two tuning parameters, $\lambda$ and $\psi$ , in the SaSa algorithm. In this paper, we use the modified Bayes Information Criterion (BIC) (Wang et al.,, 2007) to determine the best tuning parameters adaptively from the data. In particular, we have

\text{BIC}\left(\lambda,\psi\right)=\log\left[\frac{1}{n}\sum_{i=1}^{n}\frac{1}{n_{i}}\left(y_{ih}-\bm{z}_{ih}^{T}\hat{\bm{\eta}}(\lambda,\psi)-\bm{x}_{ih}^{T}\hat{\bm{\beta}}_{i}(\lambda,\psi)\right)^{2}\right]+C_{n}\frac{\log n}{n}\left(\hat{K}(\lambda,\psi)p+q\right),

(12)

where $C_{n}$ is a positive number which can depend on $n$ . Here we use $C_{n}=c_{0}\log\left(\log\left(np+q\right)\right)$ following Ma and Huang, (2017) with $c_{0}=0.2$ . To select $\psi$ , we select the best value from a set of candidate values, such as 0.1, 0.5, 1, 3. For each given $\psi$ , we use the warm start and continuation strategy as in Ma et al., (2019) to select tuning parameter $\lambda$ . A grid of $\lambda$ is predefined within $[\lambda_{\text{min}},\lambda_{\text{max}}]$ . For each $\lambda$ , the initial values are the estimated values from the previous estimation. Denote the selected tuning parameters as $\hat{\lambda}$ and $\hat{\psi}$ . Correspondingly, the estimated group number is $\hat{K}(\hat{\lambda},\hat{\psi})$ , and the estimated regression coefficients are $\hat{\bm{\beta}}$ and $\hat{\bm{\eta}}$ .

3 Theoretical Properties of SHADE

In this section, we study theoretical properties of the proposed SHADE estimator. Assume $\mathcal{G}_{k}$ ’s are the true partition of location-specific regression coefficients. Let $|\mathcal{G}_{k}|$ be the number of subjects in group $\mathcal{G}_{k}$ for $k=1,\dots,K$ , $\left|\mathcal{G}_{\min}\right|$ and $\left|\mathcal{G}_{\max}\right|$ be the minimum and maximum group sizes, respectively. Let $\widetilde{\bm{W}}$ be an $n\times K$ matrix with element $w_{ik}$ and $w_{ik}=1$ if $i\in\mathcal{G}_{k}$ , $w_{ik}=0$ , otherwise. Denote $\bm{W}=\widetilde{\bm{W}}\otimes\bm{I}_{p}$ , which is an $np\times Kp$ matrix and $\bm{U}=\left(\bm{Z},\bm{X}\bm{W}\right)$ . Define $\mathcal{M}_{\mathcal{G}}=\{\bm{\beta}\in\mathbb{R}^{np}:\,\bm{\beta}_{i}=\bm{\beta}_{j},\text{ for }i,j\in\mathcal{G}_{k},\,1\leq k\leq K\}$ . Using these notations, we can then express $\bm{\beta}$ as $\bm{\beta}=\bm{W}\bm{\alpha}$ if $\bm{\beta}\in\mathcal{M}_{\mathcal{G}}$ , where $\bm{\alpha}=\left(\bm{\alpha}_{1}^{T},\dots,\bm{\alpha}_{K}^{T}\right)^{T}$ . For any positive numbers, $x_{n}$ and $y_{n}$ , $x_{n}\gg y_{n}$ means that $x_{n}^{-1}y_{n}=o(1)$ . Define the scaled penalty function by

\rho_{\gamma}(t)=\lambda^{-1}p_{\gamma}(t,\lambda).

(13)

Below are our assumptions, where (C1) and (C3) follow those in Ma et al., (2019).

(C1)

The function $\rho_{\gamma}(t)$ is symmetric, non-decreasing, and concave on $[0,\infty)$ . It is constant for $t\geq a\lambda$ for some constant $a>0$ , and $\rho_{\gamma}(0)=0$ . Also, $\rho^{\prime}(t)$ exists and is continuous except for a finite number values of $t$ and $\rho^{\prime}(0+)=1$ .
(C2)

There exist finite positive constants $M_{1},M_{2},M_{3}>0$ such that $\left|x_{ih,l}\right|\leq M_{1}$ , $\left|z_{ih,l}\right|\leq M_{1}$ for $j=1,\dots,n_{i}$ and $i=1,\dots,n$ and $M_{2}\leq\max_{i}n_{i}/\min_{i}n_{i}\leq M_{3}$ . Also, assume that $\lambda_{\text{min}}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)\geq C_{1}\left|\mathcal{G}_{\text{min}}\right|$ , $\lambda_{\text{max}}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)\leq C_{1}^{\prime}n$ for some constants $0<C_{1}<\infty$ and $0<C_{1}^{\prime}<\infty$ , where $\lambda_{\text{min}}$ and $\lambda_{\text{max}}$ are the corresponding minimum and maximum eigenvalues respectively. In addition, assume that $\sup_{i,h}\left\|\bm{x}_{ih}\right\|\leq C_{2}\sqrt{p}$ and $\sup_{i,h}\left\|\bm{z}_{ih}\right\|\leq C_{3}\sqrt{q}$ for some constants $0<C_{2}<\infty$ and $0<C_{3}<\infty$ .
(C3)

The random error vector $\bm{\epsilon}=\left(\epsilon_{11},\dots,\epsilon_{1n_{1}},\epsilon_{21},\dots,\epsilon_{2n_{2}},\dots,\epsilon_{n1},\dots,\epsilon_{nn_{n}}\right)^{T}$ has sub-Gaussian tails such that $P\left(\left|\bm{a}^{T}\bm{\epsilon}\right|>\left\|\bm{a}\right\|x\right)\leq 2\exp\left(-c_{1}x^{2}\right)$ for any vector $\bm{a}\in\mathbb{R}^{m}$ and $x>0$ , where $0<c_{1}<\infty$ and $m=\sum_{i=1}^{n}n_{i}$ .
(C4)

The pairwise weights $c_{ij}$ ’s are bounded away from zero if $i$ and $j$ are in the same group.

Conditions (C1) and (C3) are common used in high-dimensional penalized regression problems, which are also used in Ma et al., (2019). Condition (C2) is also similar to the condition mentioned in Ma et al., (2019), also includes the bounded conditions for covariates, which are used in Huang et al., (2004). In general, if the weights functions are not only defined on a finite support, $c_{ij}$ ’s will satisfy condition (C4).

First, we establish the properties of the oracle estimator, which is defined as the weighted least squares estimator assuming that the underlying group structure is known. Specifically, the oracle estimator of $(\bm{\eta},\bm{\alpha})$ is

	$\displaystyle\left(\hat{\bm{\eta}}^{or},\hat{\bm{\alpha}}^{or}\right)$	$\displaystyle=\operatorname*{arg\,min}_{\bm{\eta}\in\mathbb{R}^{q},\bm{\alpha}\in\mathbb{R}^{Kp}}\frac{1}{2}\left\\|\bm{\Omega}^{1/2}\left(\bm{y}-\bm{Z}\bm{\eta}-\bm{X}\bm{W\alpha}\right)\right\\|^{2}$
		$\displaystyle=\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{U}^{T}\bm{\Omega}\bm{y}.$		(14)

And the corresponding oracle estimator of $\bm{\beta}$ is $\hat{\bm{\beta}}^{or}=\bm{W}\hat{\bm{\alpha}}^{or}$ . Let $\bm{\alpha}_{k}^{0}$ be the true coefficient vector for group $k$ , $k=1,\dots,K$ and $\bm{\alpha}^{0}=((\bm{\alpha}_{1}^{0})^{T},\dots,(\bm{\alpha}_{K}^{0})^{T})^{T}$ , and let $\bm{\eta}^{0}$ be the true common coefficient vector. The following theorem shows the properties of the oracle estimator.

Theorem 1.

Suppose

\left|\mathcal{G}_{\min}\right|\gg\left(q+Kp\right)^{1/2}\max\left(\sqrt{\frac{n}{\min_{i}n_{i}}\log n},\left(q+Kp\right)^{1/2}\right).

Under conditions (C1)-(C3), $q=o(n)$ and $Kp=o(n)$ , we have with probability at least $1-2(q+Kp)n^{-1}$ ,

\left\|\left(\begin{array}[]{c}\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\\ \hat{\bm{\alpha}}^{or}-\bm{\alpha}^{0}\end{array}\right)\right\|\leq\phi_{n},

and

\left\|\hat{\bm{\beta}}^{or}-\bm{\beta}^{0}\right\|\leq\sqrt{\left|\mathcal{G}_{\max}\right|}\phi_{n};\,\sup_{i}\left\|\hat{\bm{\beta}}_{i}^{or}-\bm{\beta}_{i}^{0}\right\|\leq\phi_{n},

where

\phi_{n}=c_{1}^{-1/2}C_{1}^{-1}M_{1}\sqrt{q+Kp}\left|\mathcal{G}_{\min}\right|^{-1}\sqrt{\frac{n}{\min n_{i}}\log n}.

Furthermore, for any vector $\bm{a}_{n}\in\mathbb{R}^{q+Kp}$ , we have as $n\rightarrow\infty$

\sigma_{n}(\bm{a}_{n})^{-1}\bm{a}_{n}^{T}\left(\left(\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\right)^{T},\left(\hat{\bm{\alpha}}^{or}-\bm{\alpha}^{0}\right)^{T}\right)^{T}\overset{d}{\rightarrow}N(0,1),

(15)

where

\sigma_{n}(\bm{a}_{n})=\sigma\left[\bm{a}_{n}^{T}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{U}^{T}\bm{\Omega}\bm{\Omega}\bm{U}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{a}_{n}\right]^{1/2}.

(16)

Remark 3.

We don’t have any specific assumptions about $n_{i}$ . If $\min n_{i}\ll\frac{n}{q+Kp}\log n$ , or $\min n_{i}=O\left(\frac{n}{q+Kp}\log n\right)$ , we have $\left|\mathcal{G}_{\min}\right|\gg\left(q+Kp\right)^{1/2}\sqrt{\frac{n}{\min n_{i}}\log n}$ . If $\min n_{i}\gg\frac{n}{q+Kp}\log n$ , we have $\left|\mathcal{G}_{\min}\right|\gg q+Kp$ . In this case, if $q$ , $p$ and $K$ are fixed values, what we need is only $1/\left|\mathcal{G}_{\min}\right|=o(1)$ .

Remark 4.

The model considered in Ma et al., (2019) is a special case of our model, and their condition is a special case of our condition, that is when $n_{i}=1$ .

Remark 5.

If let $\left|\mathcal{G}_{\min}\right|=\delta n/K$ for some constant $0<\delta\leq 1$ , then

\phi_{n}=c_{1}^{-1/2}C_{1}^{-1}M_{1}\delta^{-1}K\sqrt{q+Kp}\sqrt{\log n/(n\min n_{i})}.

Moreover, if $q$ , $p$ and $K$ are fixed values, then $\phi_{n}=C^{*}\sqrt{\log n/(n\min n_{i})}$ for some constant $0<C^{*}<\infty$ .

Next, we study the properties of our proposed estimator. Let

b_{n}=\min_{i\in\mathcal{G}_{k},j\in\mathcal{G}_{k^{\prime}}}\left\|\bm{\beta}_{i}^{0}-\bm{\beta}_{j}^{0}\right\|=\min_{k\neq k^{\prime}}\left\|\bm{\alpha}_{k}^{0}-\bm{\alpha}_{k^{\prime}}^{0}\right\|

(17)

be the minimal difference among different groups.

Theorem 2.

Suppose the conditions of Theorem 1 hold and $(C4)$ holds. If $b_{n}>a\lambda$ and $\lambda\gg\phi_{n}$ for some constant $a>0$ , then there exists a local minimizer $\left(\hat{\bm{\eta}}(\lambda,\psi)^{T},\hat{\bm{\beta}}(\lambda,\psi)^{T}\right)^{T}$ of the objective function $Q_{n}(\bm{\eta},\bm{\beta})$ given in (2) such that

P\left(\left(\hat{\bm{\eta}}(\lambda,\psi)^{T},\hat{\bm{\beta}}(\lambda,\psi)^{T}\right)^{T}=\left((\hat{\bm{\eta}}^{or})^{T},(\hat{\bm{\beta}}^{or})^{T}\right)^{T}\right)\rightarrow 1.

(18)

Remark 6.

Theorem 2 implies that true group structure can be recovered with probability approaching 1. It also implies that the estimated number of groups $\hat{K}$ satisfies $P\left(\hat{K}(\lambda,\psi)=K\right)\rightarrow 1.$

Let $\hat{\bm{\alpha}}(\lambda,\psi)$ be the distinct group vectors of $\hat{\bm{\beta}}(\lambda,\psi)$ . According to Theorem 1 and Theorem 2, we have the following result.

Corollary 1.

Suppose the conditions in Theorem 2 hold, for any vector $\bm{a}_{n}\in\mathbb{R}^{q+Kp}$ , we have as $n\rightarrow\infty$

\sigma_{n}(\bm{a}_{n})^{-1}\bm{a}_{n}^{T}\left(\left(\hat{\bm{\eta}}(\lambda,\psi)-\bm{\eta}^{0}\right)^{T},\left(\hat{\bm{\alpha}}(\lambda,\psi)-\bm{\alpha}^{0}\right)^{T}\right)^{T}\overset{d}{\rightarrow}N(0,1).

(19)

Remark 7.

The variance parameter $\sigma^{2}$ can be estimated by

\sigma^{2}=\frac{1}{m-q-\hat{K}p}\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\left(y_{ih}-\bm{z}_{ih}^{T}\hat{\bm{\eta}}-\bm{x}_{ih}^{T}\hat{\bm{\beta}}_{i}\right)^{2}

(20)

The algorithm can be implemented through package Spgr in https://github.com/wangx23/Spgr.

4 Simulation Studies

In this section, we evaluate and compare performance of the proposed SHADE estimator with different weight choices: equal weights $c_{ij}=1$ (denoted as “equal”), weights defined in (4) (denoted as “reg-sp”), weights defined in (5) (denoted by “reg”), and weights defined in (6) (denoted by “sp”).

The simulations are carried as follows. Let $\bm{z}_{ih}=(z_{ih,1},\dots,z_{ih,5})^{T}$ with $z_{ih,1}=1$ and $(z_{ih,2},\dots,z_{ih,5})^{T}$ are generated a multivariate normal distribution with mean 0, variance 1, and pairwise correlation $\rho=0.3$ . Define $\bm{x}_{ih}=(x_{ih,1},x_{ih,2})^{T}$ , where $x_{ih,1}$ is simulated from a standard normal distribution and $x_{ih,2}$ is simulated from a centered and standardized binomial $(n,0.7)$ . Let $\bm{\eta}=(\eta_{1},\dots,\eta_{5})^{T}$ , where $\eta_{k}$ ’s are simulated from Uniform $[1,2]$ and standard deviation of the error term is $\sigma=0.5$ . We set $\vartheta=1$ and $\gamma=3$ and use the SCAD penalty function. The tuning parameters are chosen by the modified BIC defined by (12). We consider the simulations in several scenarios. The results are based on 100 simulations.

To evaluate subgrouping performance of the proposed method, we report the estimated group number $\hat{K}$ , adjusted Rand index (ARI) (Rand,, 1971; Hubert and Arabie,, 1985; Vinh et al.,, 2010), and the root mean square error (RMSE) for estimating $\bm{\beta}$ . For the estimated $\hat{K}$ over 100 simulations, we report its average (denoted by “mean”), standard error in the parenthesis, and the occurrence percentage of $\hat{K}=K$ (denoted by “per”). The quantity ARI is used to measure the degree of agreement between two partitions, taking a value between 0 and 1: the larger ARI value, the more agreement. We report the average ARI across 100 simulations along with the standard error in the parentheses. To evaluate estimation accuracy of $\bm{\beta}$ , we also report the average RMSE

\sqrt{\frac{1}{n}\sum_{i=1}^{n}\|\hat{\bm{\beta}}_{i}-\bm{\beta}_{i}\|^{2}}.

(21)

4.1 Balanced group

We assume that there are $K=3$ true groups $\mathcal{G}_{1},\mathcal{G}_{2}$ and $\mathcal{G}_{3}$ . Consider the two spatial settings, for which the group parameters are respectively given by:

Setting 1: $\bm{\beta}_{i}=(1,1)^{T}$ if $i\in\mathcal{G}_{1}$ ; $\bm{\beta}_{i}=(1.5,1.5)^{T}$ if $i\in\mathcal{G}_{2}$ ; $\bm{\beta}_{i}=(2,2)^{T}$ if $i\in\mathcal{G}_{3}$ .

Setting 2: $\bm{\beta}_{i}=(1,1)^{T}$ if $i\in\mathcal{G}_{1}$ ; $\bm{\beta}_{i}=(1.25,1.25)^{T}$ if $i\in\mathcal{G}_{2}$ ; $\bm{\beta}_{i}=(1.5,1.5)^{T}$ if $i\in\mathcal{G}_{3}$ .

Under each setting, we simulate the data on two sizes of regular lattice, a $7\times 7$ grid (left) and a $10\times 10$ grid (right), as shown in Figure 1. Furthermore, for the $7\times 7$ grid with $n_{i}=10$ , we use a 10-fold cross validation to select the tuning parameters. The repeated measures of location $i$ are divided into 10 parts; the $j$ th part of each location is combined as the validation data set, the remaining observations form the training data set. The spatial weights (6) are considered. The results are labeled as “cv” in all the tables. Note that “reg_sp” and “reg” were not computed for the $10\times 10$ grid.

Results for Setting 1: Tables 1 , 2 and 3 show the estimated number of groups and ARI. Figures 2 and 3 plot the RMSE of the estimates obtained using different weight choices. After estimating the group structure, one can also estimate parameters $\bm{\eta}$ and $\bm{\beta}$ again by assuming that the group information is known; the results are denoted as “refit”. Based on the numerical results, we make the following observations.

First, we summarize the results for the $7\times 7$ grid. In all the considered scenarios, the spatially weighted penalty outperforms the non-weighed penalty ( “equal”). The upper panels in Tables 1 and 2, the left plot in Figure 2 suggest that, if the number of repeated measurements is relatively small (say, $n_{i}=10$ ), the weights “reg_sp” and “sp” perform similarly and they are the best in terms of estimating $K$ , recovering the true subgroup structure (large ARI), and estimating regression coefficients (small RMSE); the weights “equal” and “reg” are much worse. The lower panels of Tables 1 and 2 and the right plot in Figure 2 show that, when the number of repeated measurements gets larger (say, $n_{i}=30$ ), all the methods improve and there is not much difference among them. Cross validation works well in terms of ARI and RMSE, but it tends to over-estimate the number of groups $K$ . This is because that cross validation focuses more on the prediction accuracy; the coefficient estimates of some groups are close to the true coefficients, but they are not shrank together. In addition, refitting the model does not appear to further improve the accuracy of estimating $\bm{\beta}$ .

Table 1: Summary of the estimate

\hat{K}

for Setting 1 under the

7\times 7

grid.

		equal	reg_sp	reg	sp	cv
$n_{i}=10$	mean	3.34(0.054)	3.15(0.039)	3.33(0.051)	3.13(0.034)	3.82(0.13)
$n_{i}=10$	per	0.69	0.86	0.69	0.87	0.56
$n_{i}=30$	mean	3.00(0)	3.00(0)	3.00(0)	3.00(0)
$n_{i}=30$	per	1.00	1.00	1.00	1.00

Table 2: Average ARI for Setting 1 under the

7\times 7

grid

	equal	reg_sp	reg	sp	cv
$n_{i}=10$	0.80(0.011)	0.92(0.008)	0.82(0.01)	0.92(0.007)	0.95(0.007)
$n_{i}=30$	0.998(0.001)	0.999(0.0006)	0.998(0.001)	0.999(0.0006)

Next, we summarize the results for the $10\times 10$ grid. In this case, we consider equal weights and spatial weights only. Again, the spatially-weighted penalty outperforms the non-weighed penalty ( “equal”). Table 3 and Figure 3 suggest that, if the number of repeated measurements is relatively small (say, $n_{i}=10$ ), “sp” performs much better in terms of grouping and estimating regression coefficients than “equal”; for a larger number of repeated measurements (say, $n_{i}=30$ ), they perform similarly.

Table 3: Summary of

\hat{K}

and average ARI for Setting 1 under the

10\times 10

grid.

		$\hat{K}$		ARI
		equal	sp	equal	sp
$n_{i}=10$	mean	3.59(0.073)	3.37(0.065)	0.70(0.009)	0.97(0.003)
$n_{i}=10$	per	0.53	0.71	-	-
$n_{i}=30$	mean	3(0)	3(0)	0.996(0.001)	1.00(0)
$n_{i}=30$	per	1.00	1.00	-	-

Results for Setting 2: In this setting, the group difference becomes smaller. Tables 4, 5 and Figure 4 summarize the results for the $7\times 7$ grid. For both values of $n_{i}$ , the weights “sp” performs best in terms of estimating the number of groups ( $\hat{K}$ ), recovering the true group structure (ARI), and estimating regression coefficients. In contrast to Setting 1, when the difference among groups becomes smaller, even with $n_{i}=30$ , the model with the spatial weight is superior to other models.

Table 4: Summary of

\hat{K}

for Setting 2 under the

7\times 7

grid

		equal	reg_sp	reg	sp
$n_{i}=10$	mean	3.25(0.119)	3.01(0.093)	3.14(0.107)	2.88(0.067)
$n_{i}=10$	per	0.34	0.45	0.33	0.60
$n_{i}=30$	mean	2.70(0.046)	2.90(0.030)	2.76(0.043)	2.95(0.022)
$n_{i}=30$	per	0.70	0.90	0.76	0.95

Table 5: Average ARI for Setting 2 under the

7\times 7

grid

	equal	reg_sp	reg	sp
$n_{i}=10$	0.32(0.011)	0.50(0.023)	0.33(0.01)	0.61(0.026)
$n_{i}=30$	0.72(0.018)	0.86(0.015)	0.75(0.017)	0.90(0.012)

Table 6 and Figure 5 show the results for the $10\times 10$ grid. Again, we only compare “equal” weights and “sp” weights. The results suggest similar conclusions to those for the $7\times 7$ grid: the model with the spatial weight is superior even with a large number of repeated measurements ( $n_{i}=30$ ) by producing larger ARI and smaller RMSE.

Table 6: Summary of

\hat{K}

and average ARI for Setting 2 under the

10\times 10

grid

		$\hat{K}$		ARI
		equal	sp	equal	sp
$n_{i}=10$	mean	3.82(0.146)	3.35(0.078)	0.32(0.009)	0.81(0.022)
$n_{i}=10$	per	0.32	0.620	-	-
$n_{i}=30$	mean	3.10(0.060)	3.00(0.0)	0.79(0.012)	0.94(0.005)
$n_{i}=30$	per	0.64	1.0	-	-

4.2 Unbalanced group setting

Here we consider an unbalanced group setting as shown in Figure 6. In this setting, there are four groups, denoted as $\mathcal{G}_{1},\mathcal{G}_{2},\mathcal{G}_{3}$ and $\mathcal{G}_{4}$ , and two groups have 9 subjects and the other two groups have 41 subjects. The group parameters are $\bm{\beta}_{i}=(1,1)^{T}$ if $i\in\mathcal{G}_{1}$ , $\bm{\beta}_{i}=(1.5,1.5)^{T}$ if $i\in\mathcal{G}_{2}$ , $\bm{\beta}_{i}=(2,2)^{T}$ if $i\in\mathcal{G}_{3}$ and $\bm{\beta}_{i}=(2.5,2.5)^{T}$ if $i\in\mathcal{G}_{4}$ .

Table 7 and Figure 7 show the summaries of $\hat{K}$ , ARI and RMSE for $\bm{\beta}$ when the number of repeated measurements is $n_{i}=10$ . Overall speaking, “reg_sp” and “sp” perform better than the other two types of weights. Especially, “sp” performs a slightly better than “reg_sp” . The results are consistent with those under balanced cases. We expect that, when the group difference becomes smaller, “sp” would still perform better than other weights even when the number of repeated measurements is large.

Table 7: Summary of

\hat{K}

and average ARI for the unbalanced setting with

n_{i}=10

		equal	reg_sp	reg	sp
$\hat{K}$	mean	4.58(0.093)	4.23(0.049)	5.17(0.011)	4.35(0.059)
$\hat{K}$	per	0.570	0.800	0.300	0.710
ARI	mean	0.62(0.010)	0.94(0.061)	0.67(0.009)	0.96(0.004)

4.3 Random group setting

We consider a setting without specified location group information. For each location, it has equal probability to three groups. Tables 8 shows the summary of $\hat{K}$ and ARI for Setting 1 under the grid $7\times 7$ with 10 repeated measures. Table 9 shows the summary of $\hat{K}$ and ARI for Setting 2 under the grid $7\times 7$ with 30 repeated measures. Figure 8 shows the RMSE results for both cases. We can see that different weights have similar performances. The results suggest that even without prior information on the existence of spatial groups, “sp” weights can still produce comparable results as equal weights.

Table 8: Summary of

\hat{K}

and average ARI for Setting 1 under the

7\times 7

grid with

n_{i}=10

		equal	reg_sp	reg	sp
$\hat{K}$	mean	3.42(0.064)	3.45(0.063)	3.40(0.059)	3.45(0.063)
$\hat{K}$	per	0.66	0.62	0.65	0.62
ARI	mean	0.78(0.011)	0.82(0.010)	0.81(0.010)	0.82(0.011)

Table 9: Summary of

\hat{K}

and average ARI for Setting 2 under the

7\times 7

grid with

n_{i}=30

		equal	reg_sp	reg	sp
$\hat{K}$	mean	2.77(0.045)	2.77(0.045)	2.83(0.040)	2.73(0.047)
$\hat{K}$	per	0.75	0.75	0.81	0.71
ARI	mean	0.74(0.015)	0.76(0.016)	0.77(0.014)	0.74(0.017)

5 Application

In this section, we apply our SHADE method to the modeling of the National Resources Inventory survey (NRI) data ¹¹1https://www.nrcs.usda.gov/wps/portal/nrcs/main/national/technical/nra/nri/ for the purpose of small area estimation. The NRI survey is one of the largest longitudinal natural resource surveys in the U.S., and its national and state level estimates of the status and change of land cover and use and soil erosion have been used by numerous federal, state, and local agencies in the past few decades. In recent years, there is an increasing demand for NRI to provide various county level estimates. These include estimates of different land covers, such as cropland, pasture land, urban and forest. Due to the limitation of sample size, the uncertainty of the NRI direct county level estimates are usually too large for local stakeholders for making policy decisions. To make the county level estimates more useful, it is necessarily to include some auxiliary information and appropriate model to reduce the uncertainty of the estimates. One such set of auxiliary covariates is Cropland Data Layer (CDL), which is based on classification of satelliate image pixels into several mutually exclusive and exhaustive land cover categories. In this section, we model the relationship between the NRI forest proportion and the CDL forest proportion among 48 states. In NRI, forests belonging to federal land, such as national parks, are not included in the forest category. For states with more forest federal land, NRI estimates can be substantially smaller than CDL estimates. Therefore, different states could have different relationship between these two proportions.

The model we consider is,

y_{ih}=\beta_{0,i}+\beta_{1,i}x_{ih}+\epsilon_{ih}

(22)

where $y_{ih}$ is the NRI forest proportion of the $h$ th county in the $i$ th state, $x_{ih}$ is the corresponding CDL forest proportion of the $h$ th county in the $i$ th state, and $\beta_{0,i}$ and $\beta_{1,i}$ are the unknown coefficients. Both $x$ and $y$ are standardized. Instead of using the estimated linear regression coefficients as initial values directly, we use five sets of initial values which are simulated from a multivariate normal distribution with estimated coefficients as the mean vector and estimated covariance matrix as the covariance matrix. The models with the smallest modified BIC values are selected for equal weights and spatial weights respectively.

Figure 9 shows the estimated groups based on 2011 NRI data sets. The left figure plots the estimated groups based on equal weights, and the right one is for the estimated groups based on spatial weights in (6). We find that the two different weights give different estimated groups. Tables 10 and 11 are the corresponding estimates of regression coefficients in different groups.

Table 10: Estimated coefficients of different groups for equal weight

group	1	2
$\beta_{0}$	-0.029(0.006)	0.003(0.008)
$\beta_{1}$	0.885(0.011)	0.241(0.026)

Table 11: Estimated coefficients of different groups for spatial weights

group	1	2	3	4	5	6	7
$\beta_{0}$	-0.041(0.016)	-0.032(0.006)	0.003(0.007)	0.023(0.015)	-0.108(0.293)	0.275(0.038)	0.376 (0.309)
$\beta_{1}$	1.018(0.028)	0.867(0.012)	0.241(0.024)	0.608(0.033)	1.148 (0.377)	0.332(0.064)	0.341(0.384)

When considering equal weights, $\lambda$ is the only tuning parameter in the algorithm. By changing the value of $\lambda$ , we can have different number of groups. We consider to change the $\lambda$ value in the algorithm based on equal weights such that the number of groups is the same as what we have selected based on the spatial weights, that is 7 groups. Figure 10 shows the group structure with 7 groups based on equal weights. In both Figure 10 and the left figure of Figure 9, “WA”, “OR” and “CA” are not separated from the majority group (the group with the largest group size) when considering equal weights. These three states are in group 4, which are separated from the majority group (group 2) when considering spatial weights, which is more reasonable and intuitive based on the estimates of regression coefficients as shown in Table 11. Besides that, these three states have more national parks than those states in group 2.

Alternatively, we also implement $K$ -means clustering based on the initial estimates to identify similar behaviors among the states. Figure 11 shows the maps based on 2-means clustering and 7-means clustering, respectively. We notice that the 2-cluster map is almost the same as the map based on equal weights. However, the 7-cluster map is not interpretable compared to the result based on spatial weights. This suggests that the proposed procedure can produce more interpretable subgroup structures than $K$ -means clustering methods.

6 Discussion

In this article, we consider the problem of spatial clustering of local covariate effects and develop a general framework called Spatial Heterogeneity Automatic Detection and Estimation (SHADE) for spatial areal data with repeated measures. In spatial data, since locations near each other usually have similar patterns, we propose to take into account spatial information in the pairwise penalty, where closer locations are assigned with larger weights to encourage stronger shrinkage. In the simulation study, we use several examples to investigate and compare performance of the procedure using different weights and have found that spatial information helps to improve the accuracy of grouping, especially when the minimal group difference is small or the number of repeated measures is small. We also establish theoretical properties of the proposed estimator in terms of its consistency in estimating the number of groups.

In the real data example, we have treated states as locations and counties as repeated measures. Alternatively, one can treat counties as individual units, since one state could have counties with two different features. Then, the algorithm will involve a matrix inverse with dimension more than 3000, which will require higher computational burden. A further study is needed to compare these two models for the application.

The proposed method does not consider the spatial dependence in the regression error when constructing the objective function. The basic idea of this algorithm can be extended to a general spatial clustering setup with consideration of the spatially dependent error. More specifically, the weighted least squared term in the objective function needs to be replaced by a generalized least squares term, which includes an estimated covariance matrix. The new algorithm should have two iterative steps. The first step is to update regression coefficients to find clusters and the second step is to update covariance parameters. More simulation studies are needed to explore the performance of the two-step algorithm. And the theoretical properties needs to be established to support the new algorithm. Both theoretical and computational aspect of such extension are nontrivial and will be considered in a follow up work.

Appendices

A Proof of Theorem 1

In this section, we prove Theorem 1. When proving the central limit theorem (CLT) we use the technique in Huang et al., (2004).

The oracle estimator is define in (14), which has the following form

\left(\begin{array}[]{c}\hat{\bm{\eta}}^{or}\\ \hat{\bm{\alpha}}^{or}\end{array}\right)=\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{U}^{T}\bm{\Omega}\bm{y}.

Thus, we have

\left(\begin{array}[]{c}\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\\ \hat{\bm{\alpha}}^{or}-\bm{\alpha}^{0}\end{array}\right)=\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{U}^{T}\bm{\Omega}\bm{\epsilon},

where $\bm{\epsilon}=(\bm{\epsilon}_{i}^{T},\dots,\bm{\epsilon}_{n}^{T})^{T}$ with $\bm{\epsilon}_{i}=(\epsilon_{i1},\dots,\epsilon_{i,n_{i}})^{T}$ . Therefore,

\left\|\left(\begin{array}[]{c}\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\\ \hat{\bm{\alpha}}^{or}-\bm{\alpha}^{0}\end{array}\right)\right\|\leq\left\|\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\right\|_{2}\left\|\bm{U}^{T}\bm{\Omega}\bm{\epsilon}\right\|,

(23)

where $\|\cdot\|_{2}$ is matrix norm, which is defined as, for a matrix $\bm{A}$ , $\|\bm{A}\|_{2}=\sup_{\|\bm{x}\|=1}\|\bm{Ax}\|$ .

We know that

	$\displaystyle P\left(\left\\|\bm{U}^{T}\bm{\Omega}\bm{\epsilon}\right\\|_{\infty}>C\sqrt{\frac{n}{\min n_{i}}\log n}\right)$	$\displaystyle\leq P\left(\left\\|\left(\bm{X}\bm{W}\right)^{T}\bm{\Omega}\bm{\epsilon}\right\\|_{\infty}>C\sqrt{\frac{n}{\min n_{i}}\log n}\right)$
		$\displaystyle+P\left(\left\\|\bm{Z}^{T}\bm{\Omega}\bm{\epsilon}\right\\|_{\infty}>C\sqrt{\frac{n}{\min n_{i}}\log n}\right),$		(24)

where $C$ is a finite positive constant and $\|\cdot\|_{\infty}$ is defined as, for a vector $\bm{x}\in\mathbb{R}^{m}$ , $\|\bm{x}\|_{\infty}=\max_{1\leq i\leq m}x_{i}$ . By condition (C2), we have

\sqrt{\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{x_{ih,l}^{2}}{n_{i}^{2}}1\left\{i\in\mathcal{G}_{k}\right\}}\leq M_{1}\sqrt{\sum_{i=1}^{n}\frac{1}{n_{i}}\left\{i\in\mathcal{G}_{k}\right\}}\leq M_{1}\sqrt{\sum_{i=1}^{n}\frac{1}{n_{i}}}\leq M_{1}\sqrt{\frac{n}{\min n_{i}}}.

Since

\left\|\left(\bm{X}\bm{W}\right)^{T}\bm{\Omega}\bm{\epsilon}\right\|_{\infty}=\sup_{k,l}\left|\sum_{i=1}^{n}\frac{1}{n_{i}}\sum_{h=1}^{n_{i}}x_{ih,l}\epsilon_{ih}1\left\{i\in\mathcal{G}_{k}\right\}\right|,

from condition (C3), it follows that

		$\displaystyle P\left(\left\\|\left(\bm{X}\bm{W}\right)^{T}\bm{\Omega}\bm{\epsilon}\right\\|_{\infty}>C\sqrt{\frac{n}{\min n_{i}}\log n}\right)$
	$\displaystyle\leq$	$\displaystyle\sum_{l=1}^{p}\sum_{k=1}^{K}P\left(\left\|\sum_{i=1}^{n}\sum_{j=1}^{n_{i}}\frac{1}{n_{i}}x_{ih,l}\epsilon_{ih}1\left\{i\in\mathcal{G}_{k}\right\}\right\|>C\sqrt{\frac{n}{\min n_{i}}\log n}\right)$
	$\displaystyle=$	$\displaystyle\sum_{l=1}^{p}\sum_{k=1}^{K}P\left(\left\|\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{1}{n_{i}}x_{ih,l}\epsilon_{ih}1\left\{i\in\mathcal{G}_{k}\right\}\right\|>\frac{\sqrt{\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{x_{ih,l}^{2}}{n_{i}^{2}}1\left\{i\in\mathcal{G}_{k}\right\}}}{\sqrt{\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{x_{ih,l}^{2}}{n_{i}^{2}}1\left\{i\in\mathcal{G}_{k}\right\}}}C\sqrt{\frac{n}{\min n_{i}}\log n}\right)$
	$\displaystyle\leq$	$\displaystyle\sum_{l=1}^{p}\sum_{k=1}^{K}P\left(\left\|\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{1}{n_{i}}x_{ih,l}\epsilon_{ih}1\left\{i\in\mathcal{G}_{k}\right\}\right\|>\sqrt{\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{x_{ih,l}^{2}}{n_{i}^{2}}1\left\{i\in\mathcal{G}_{k}\right\}}\frac{C}{M_{1}}\sqrt{\log n}\right)$
	$\displaystyle\leq$	$\displaystyle 2Kp\exp\left(-c_{1}\frac{C^{2}}{M_{1}^{2}}\log n\right)=2Kpn^{-c_{1}C^{2}/M_{1}^{2}}.$

Similarly, $\left|\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{z_{ih,l}^{2}}{n_{i}^{2}}\right|\leq M_{1}^{2}\sum_{i=1}^{n}1/n_{i}\leq M_{1}^{2}\frac{n}{\min n_{i}}$ . Again, by condition (C3), we have

		$\displaystyle P\left(\left\\|\bm{Z}^{T}\bm{\Omega}\bm{\epsilon}\right\\|_{\infty}>C\sqrt{\frac{n}{\min n_{i}}\log n}\right)$
	$\displaystyle\leq$	$\displaystyle\sum_{l=1}^{q}P\left(\left\|\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{1}{n_{i}}z_{ih,l}\epsilon_{ih}\right\|>C\sqrt{\frac{n}{\min n_{i}}\log n}\right)$
	$\displaystyle\leq$	$\displaystyle\sum_{l=1}^{q}P\left(\left\|\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{1}{n_{i}}z_{ih,l}\epsilon_{ih}\right\|>\sqrt{\sum_{i=1}^{n}\sum_{h=1}^{n_{i}}\frac{z_{ih,l}^{2}}{n_{i}^{2}}}\frac{C}{M_{1}}\sqrt{\log n}\right)$
	$\displaystyle\leq$	$\displaystyle 2q\exp\left(-c_{1}\frac{C^{2}}{M_{1}^{2}}\log n\right)=2qn^{-c_{1}C^{2}/M_{1}^{2}}.$

Thus, (24) can be bounded by

P\left(\left\|\bm{U}^{T}\bm{\Omega}\bm{\epsilon}\right\|_{\infty}>C\sqrt{\frac{n}{\min n_{i}}\log n}\right)\leq 2\left(Kp+q\right)n^{-c_{1}C^{2}/M_{1}^{2}}.

Since $\left\|\bm{U}^{T}\text{$\bm{\Omega}\bm{\epsilon}$}\right\|\leq\sqrt{q+Kp}\left\|\bm{U}^{T}\bm{\Omega}\bm{\epsilon}\right\|_{\infty}$ ,

P\left(\left\|\bm{U}^{T}\bm{\Omega}\bm{\epsilon}\right\|>C\sqrt{q+Kp}\sqrt{\frac{n}{\min n_{i}}\log n}\right)\leq 2\left(Kp+q\right)n^{-c_{1}C^{2}/M_{1}^{2}}.

Let $C=c_{1}^{-1/2}M_{1}$ , thus

P\left(\left\|\bm{U}^{T}\bm{\Omega}\bm{\epsilon}\right\|>C\sqrt{q+Kp}\sqrt{\frac{n}{\min n_{i}}\log n}\right)\leq 2\left(Kp+q\right)n^{-1}.

(25)

Also, according to condition (C2), we have

\left\|\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\right\|_{2}\leq C_{1}^{-1}\left|\mathcal{G}_{\text{min}}\right|^{-1}.

(26)

Combining (23), (25) and (26), with probability at least $1-2\left(Kp+q\right)n^{-1}$ , we have

\left\|\left(\begin{array}[]{c}\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\\ \hat{\bm{\alpha}}^{or}-\bm{\alpha}^{0}\end{array}\right)\right\|\leq CC_{1}^{-1}\sqrt{q+Kp}\left|\mathcal{G}_{\text{min}}\right|^{-1}\sqrt{\frac{n}{\min n_{i}}\log n}.

Let

\phi_{n}=c_{1}^{-1/2}C_{1}^{-1}M_{1}\sqrt{q+Kp}\left|\mathcal{G}_{\text{min}}\right|^{-1}\sqrt{\frac{n}{\min n_{i}}\log n}.

Furthermore,

	$\displaystyle\left\\|\hat{\bm{\beta}}^{or}-\bm{\beta}^{0}\right\\|^{2}$	$\displaystyle=\sum_{k=1}^{K}\sum_{i\in\mathcal{G}_{k}}\left\\|\hat{\bm{\alpha}}_{k}^{or}-\bm{\alpha}_{k}^{0}\right\\|^{2}\leq\left\|\mathcal{G}_{\text{max}}\right\|\sum_{k=1}^{K}\left\\|\hat{\bm{\alpha}}_{k}^{or}-\bm{\alpha}_{k}^{0}\right\\|^{2}$
		$\displaystyle=\left\|\mathcal{G}_{\text{max}}\right\|\left\\|\hat{\bm{\alpha}}^{or}-\bm{\alpha}^{0}\right\\|^{2}\leq\left\|\mathcal{G}_{\text{max}}\right\|\phi_{n}^{2},$

and

\sup_{i}\left\|\hat{\bm{\beta}}_{i}^{or}-\bm{\beta}_{i}^{0}\right\|=\sup_{k}\left\|\hat{\bm{\alpha}}_{k}^{or}-\bm{\alpha}_{k}^{0}\right\|\leq\left\|\hat{\bm{\alpha}}^{or}-\bm{\alpha}^{0}\right\|\leq\phi_{n}.

Next, we consider the central limit theorem. Let $\bm{U}=\left(\bm{U}_{1}^{T},\dots,\bm{U}_{n}^{T}\right)^{T}$ with $\bm{U}_{i}=(\bm{U}_{i1},\dots,\bm{U}_{i,n_{i}})^{T}$ for $i=1,\dots,n$ . Consider

\displaystyle\bm{a}_{n}^{T}\left(\left(\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\right)^{T},\left(\hat{\bm{\alpha}}^{or}-\hat{\bm{\alpha}}^{0}\right)^{T}\right)^{T}=

\displaystyle\sum_{i=1}^{n}\bm{a}_{n}^{T}\left(\sum_{i=1}^{n}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{U}_{i}\right)^{-1}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{\epsilon}_{i},

where $\bm{\Omega}_{i}=1/n_{i}\bm{I}_{n_{i}}$ . By the assumption of $\bm{\epsilon}_{i}$ in the model (1), we have

E\left[\bm{a}_{n}^{T}\left(\left(\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\right)^{T},\left(\hat{\bm{\alpha}}^{or}-\hat{\bm{\alpha}}^{0}\right)^{T}\right)^{T}\right]=0.

The variance of $\bm{a}_{n}^{T}\left(\left(\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\right)^{T},\left(\hat{\bm{\alpha}}^{or}-\hat{\bm{\alpha}}^{0}\right)^{T}\right)^{T}$ can be written as

		$\displaystyle Var\left\{\bm{a}_{n}^{T}\left(\left(\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\right)^{T},\left(\hat{\bm{\alpha}}^{or}-\hat{\bm{\alpha}}^{0}\right)^{T}\right)^{T}\right\}$
	$\displaystyle=$	$\displaystyle\sigma^{2}\left[\bm{a}_{n}^{T}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{U}^{T}\bm{\Omega}\bm{\Omega}\bm{U}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{a}_{n}\right]$
	$\displaystyle=$	$\displaystyle\sigma^{2}\left[\bm{a}_{n}^{T}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\sum_{i=1}^{n}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{\Omega}_{i}\bm{U}_{i}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{a}_{n}\right].$

We use the technique of Huang et al., (2004) in the proof of their Theorem 3. That is,
$\bm{a}_{n}^{T}\left(\left(\hat{\bm{\eta}}^{or}-\bm{\eta}^{0}\right)^{T},\left(\hat{\bm{\alpha}}^{or}-\hat{\bm{\alpha}}^{0}\right)^{T}\right)^{T}$ can be written as $\sum_{i=1}^{n}a_{i}\xi_{i}$ with

a_{i}^{2}=\bm{a}_{n}^{T}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{\Omega}_{i}\bm{U}_{i}\left(\bm{U}^{T}\bm{\Omega}\bm{U}\right)^{-1}\bm{a_{n}},

where $\xi_{i}$ ’s are independent with mean zero and variance one. If

\frac{\max_{i}a_{i}^{2}}{\sum_{i=1}^{n}a_{i}^{2}}\rightarrow 0,

then $\sum_{i=1}^{n}a_{i}\xi_{i}/\sqrt{\sum_{i=1}^{n}a_{i}^{2}}$ is asymptotically $N\left(0,1\right)$ .

For any $\bm{\lambda}=\left(\lambda_{1},\dots,\lambda_{q+Kp}\right)^{T}$ , we have

	$\displaystyle\bm{\lambda}^{T}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{\Omega}_{i}\bm{U}_{i}\bm{\lambda}$	$\displaystyle=\frac{1}{n_{i}^{2}}\bm{\lambda}^{T}\bm{U}_{i}^{T}\bm{U}_{i}\bm{\lambda}=\frac{1}{n_{i}^{2}}\sum_{h=1}^{n_{i}}\bm{\lambda}^{T}\bm{U}_{ih}\bm{U}_{ih}^{T}\bm{\lambda}$
		$\displaystyle=\frac{1}{n_{i}^{2}}\sum_{h=1}^{n_{i}}\left(\sum_{l=1}^{q+Kp}U_{ih,l}\lambda_{l}\right)^{2}$
		$\displaystyle\leq\frac{1}{n_{i}^{2}}\sum_{h=1}^{n_{i}}\left(\sum_{l=1}^{q+Kp}U_{ih,l}^{2}\right)\left(\sum_{l=1}^{q+Kp}\lambda_{l}^{2}\right)\leq\frac{M_{1}^{2}}{n_{i}}\left(q+Kp\right)\left\\|\bm{\lambda}\right\\|^{2}.$

	$\displaystyle\bm{\lambda}^{T}\left(\sum_{i=1}^{n}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{\Omega}_{i}\bm{U}_{i}\right)\bm{\lambda}$	$\displaystyle\geq\frac{1}{\max_{i}n_{i}}\bm{\lambda}^{T}\left(\sum_{i=1}^{n}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{U}_{i}\right)\bm{\lambda}\geq\frac{1}{\max_{i}n_{i}}\bm{\lambda}^{T}\bm{U}^{T}\bm{\Omega}\bm{U}\bm{\lambda}$
		$\displaystyle\geq\frac{1}{\max_{i}n_{i}}C_{1}\left\|\mathcal{G}_{\min}\right\|\left\\|\bm{\lambda}\right\\|^{2},$

where the last inequality is by condition (C2). So,

	$\displaystyle\frac{\max_{i}\bm{\lambda}^{T}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{\Omega}_{i}\bm{U}_{i}\bm{\lambda}}{\bm{\lambda}^{T}\left(\sum_{i=1}^{n}\bm{U}_{i}^{T}\bm{\Omega}_{i}\bm{\Omega}_{i}\bm{U}_{i}\right)\bm{\lambda}}$	$\displaystyle\leq\left(\max_{i}n_{i}\right)\left(\max_{i}\frac{1}{n_{i}}\right)M_{1}^{2}C_{1}^{-1}\left\|\mathcal{G}_{\min}\right\|^{-1}\left(q+Kp\right)$
		$\displaystyle=M_{1}^{2}C_{1}^{-1}\frac{\max_{i}n_{i}}{\min_{i}n_{i}}\left\|\mathcal{G}_{\min}\right\|^{-1}\left(q+Kp\right)\rightarrow 0,$		(27)

by assumption.

By (27), we have that $\max_{i}a_{i}^{2}/\sum_{i=1}^{n}a_{i}^{2}\rightarrow 0$ , so (15) exists.

B Proof of Theorem 2

In this section, we prove Theorem 2. As in Ma et al., (2019) and Ma and Huang, (2017), we define $T:\,\mathcal{M}_{\mathcal{G}}\rightarrow\mathbb{R}^{Kp}$ to be the mapping that $T\left(\bm{\beta}\right)=\bm{\alpha}$ and $T^{*}:\,\mathbb{R}^{np}\rightarrow\mathbb{R}^{Kp}$ to be the mapping that $T^{*}\left(\bm{\beta}\right)=\left(\left|\mathcal{G}_{k}\right|^{-1}\sum_{i\in\mathcal{G}_{k}}\bm{\beta}_{i}^{T},\,k=1,\dots,K\right)^{T}$ .

Consider the following neighborhood of $\left(\bm{\eta}^{0},\bm{\beta}^{0}\right)$ ,

\Theta=\left\{\bm{\eta}\in\mathbb{R}^{q},\bm{\beta}\in\mathbb{R}^{np}:\,\left\|\bm{\eta}-\bm{\eta}^{0}\right\|\leq\phi_{n},\sup_{i}\left\|\bm{\beta}_{i}-\bm{\beta}_{i}^{0}\right\|\leq\phi_{n}\right\}.

According to Theorem 1, there exists an event $E_{1}$ where $\left\|\bm{\eta}-\bm{\eta}^{0}\right\|\leq\phi_{n}$ and $\sup_{i}\left\|\bm{\beta}_{i}-\bm{\beta}_{i}^{0}\right\|\leq\phi_{n}$ such that $P\left(E_{1}\right)\geq 1-2\left(q+Kp\right)n^{-1}$ .

Recall that the objective function to minimize is given in (2), which has the following form

Q_{n}\left(\bm{\eta},\bm{\beta};\lambda,\psi\right)=\frac{1}{2}\sum_{i=1}^{n}\frac{1}{n_{i}}\sum_{h=1}^{n_{i}}\left(y_{ih}-\bm{z}_{ih}^{T}\bm{\eta}-\bm{x}_{ih}^{T}\bm{\beta}_{i}\right)^{2}+\sum_{1\leq i<j\leq n}p_{\gamma}\left(\left\|\bm{\beta}_{i}-\bm{\beta}_{j}\right\|,c_{ij}\lambda\right).

(28)

Here we show that $\left(\text{$(\hat{\bm{\eta}}$}^{or})^{T},(\hat{\bm{\beta}}^{or})^{T}\right)^{T}$ is a strict local minimizer of the above objective function with probability approaching 1 by two steps as in Ma et al., (2019). The first step is to show that in event $E_{1}$ , $Q_{n}(\bm{\eta},\bm{\beta}^{*})>Q_{n}(\hat{\bm{\eta}}^{or},\hat{\bm{\beta}}^{or})$ for any $(\bm{\eta}^{T},\bm{\beta}^{T})^{T}\in\Theta$ and $(\bm{\eta}^{T},\bm{\beta}^{*T})^{T}\neq((\hat{\bm{\eta}}^{or})^{T},(\hat{\bm{\beta}}^{or})^{T})^{T}$ , where $\bm{\beta}^{*}=T^{-1}\left(T^{*}\left(\bm{\beta}\right)\right)$ and $\bm{\beta}\in\mathbb{R}^{np}$ . The proof of this step is almost the same as the first step in Ma et al., (2019) , which is omitted here.

Here we show the second step, that is, there exists an event $E_{2}$ such that $P\left(E_{2}\right)\geq 1-2n^{-1}$ . In the event $E_{1}\cap E_{2}$ , there is a neighborhood $\Theta_{n}$ of $\left(\left(\text{$\hat{\bm{\eta}}$}^{or}\right)^{T},\left(\hat{\bm{\beta}}^{or}\right)^{T}\right)^{T}$ , such that $Q_{n}\left(\bm{\eta},\bm{\beta}\right)\geq Q_{n}\left(\bm{\eta},\bm{\beta}^{*}\right)$ for any $\left(\bm{\eta}^{T},\bm{\beta}^{T}\right)^{T}\in\Theta_{n}\cap\Theta$ for sufficiently large $n$ .

Let $\Theta_{n}=\left\{\bm{\beta}_{i}:\sup_{i}\left\|\bm{\beta}_{i}-\hat{\bm{\beta}}_{i}^{or}\right\|\leq t_{n}\right\}$ , where $t_{n}$ is a positive sequence with $t_{n}=o(1)$ . By Taylor’s expansion, for $\left(\bm{\eta}^{T},\bm{\beta}^{T}\right)^{T}\in\Theta_{n}\cap\Theta$ ,

Q_{n}\left(\bm{\eta},\bm{\beta}\right)-Q_{n}\left(\bm{\eta},\bm{\beta}^{*}\right)=\Gamma_{1}+\Gamma_{2},

(29)

where

	$\displaystyle\Gamma_{1}$	$\displaystyle=-\left(\bm{y}-\bm{Z}\bm{\eta}-\bm{X}\bm{\beta}^{m}\right)^{T}\bm{\Omega}\bm{X}\left(\bm{\beta}-\bm{\beta}^{*}\right),$
	$\displaystyle\Gamma_{2}$	$\displaystyle=\sum_{i=1}^{n}\frac{\partial\left[\lambda\sum_{l<j}c_{lj}\rho_{\gamma}\left(\left\\|\bm{\beta}_{l}^{m}-\bm{\beta}_{j}^{m}\right\\|\right)\right]}{\partial\bm{\beta}_{i}^{T}}\left(\bm{\beta}_{i}-\bm{\beta}_{i}^{*}\right),$

with $\bm{\beta}^{m}=\alpha\bm{\beta}+\left(1-\alpha\right)\bm{\beta}^{*}$ for some constant $\alpha\in\left(0,1\right)$ .

We have $\Gamma_{2}$ as follows,

\Gamma_{2}=\lambda\sum_{i<j}c_{ij}\rho_{\gamma}^{\prime}\left(\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|\right)\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|^{-1}\left(\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right)^{T}\left\{\left(\bm{\beta}_{i}-\bm{\beta}_{i}^{*}\right)-\left(\bm{\beta}_{j}-\bm{\beta}_{j}^{*}\right)\right\}.

For $i,j\in\mathcal{G}_{k}$ , $\bm{\beta}_{i}^{*}=\bm{\beta}_{j}^{*}$ and $\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}=\alpha\left(\bm{\beta}_{i}-\bm{\beta}_{j}\right)$ , then

	$\displaystyle\Gamma_{2}$	$\displaystyle=\lambda\sum_{k=1}^{K}\sum_{\left\{i,j\in\mathcal{G}_{k},i<j\right\}}c_{ij}\rho_{\gamma}^{\prime}\left(\left\\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\\|\right)\left\\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\\|^{-1}\left(\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right)^{T}\left(\bm{\beta}_{i}-\bm{\beta}_{j}\right)$
		$\displaystyle+\lambda\sum_{k=1}^{K}\sum_{\left\{i\in\mathcal{G}_{k},j\in\mathcal{G}_{k^{\prime}}\right\}}c_{ij}\rho_{\gamma}^{\prime}\left(\left\\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\\|\right)\left\\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\\|^{-1}\left(\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right)^{T}\left\{\left(\bm{\beta}_{i}-\bm{\beta}_{i}^{}\right)-\left(\bm{\beta}_{j}-\bm{\beta}_{j}^{}\right)\right\}.$

Since $\sup_{i}\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{i}^{0}\right\|\leq\phi_{n}$ , for $k\neq k^{\prime}$ , $i\in\mathcal{G}_{k}$ , $j\in\mathcal{G}_{k^{\prime}}$ ,

\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|\geq\min_{i\in\mathcal{G}_{k},j\in\mathcal{G}_{k^{\prime}}}\left\|\bm{\beta}_{i}^{0}-\bm{\beta}_{j}^{0}\right\|-2\max_{i}\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{i}^{0}\right\|\geq b_{n}-2\phi_{n}>a\lambda.

Thus, $\rho_{\gamma}^{\prime}(\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|)=0$ by assumption (C1). Therefore,

\Gamma_{2}=\lambda\sum_{i=1}^{K}\sum_{\left\{i,j\in\mathcal{G}_{k},i<j\right\}}c_{ij}\rho_{\gamma}^{\prime}\left(\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|\right)\left\|\bm{\beta}_{i}-\bm{\beta}_{j}\right\|.

(30)

Also, for $i,j\in\mathcal{G}_{k}$ , $\sup_{i}\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|\leq 4t_{n}$ , so $\rho_{\gamma}^{\prime}\left(\left\|\bm{\beta}_{i}^{m}-\bm{\beta}_{j}^{m}\right\|\right)\geq\rho^{\prime}\left(4t_{n}\right)$ by assumption (C1). Thus, we have

\Gamma_{2}\geq\sum_{k=1}^{K}\sum_{\left\{i,j\in\mathcal{G}_{k},i<j\right\}}\lambda c_{ij}\rho_{\gamma}^{\prime}\left(4t_{n}\right)\left\|\bm{\beta}_{i}-\bm{\beta}_{j}\right\|.

Let $\bm{Q}=\left(\bm{Q}_{1}^{T},\dots,\bm{Q}_{n}^{T}\right)^{T}=\left[\left(\bm{y}-\bm{Z}\bm{\eta}-\bm{X}\bm{\beta}^{m}\right)^{T}\bm{\Omega}\bm{X}\right]^{T}$ with

\bm{Q}_{i}=\frac{1}{n_{i}}\sum_{h=1}^{n_{i}}\left(y_{ih}-\bm{z}_{ih}^{T}\bm{\eta}-\bm{x}_{ih}^{T}\bm{\beta}_{i}^{m}\right)\bm{x}_{ih}.

We have,

$\displaystyle\Gamma_{1}$	$\displaystyle=-\left(\bm{y}-\bm{Z}\bm{\eta}-\bm{X}\bm{\beta}^{m}\right)^{T}\bm{\Omega}\bm{X}\left(\bm{\beta}-\bm{\beta}^{*}\right)$
	$\displaystyle=-\bm{Q}^{T}\left(\bm{\beta}-\bm{\beta}^{*}\right)$
	$\displaystyle=-\sum_{k=1}^{K}\sum_{\left\{i,j\in\mathcal{G}_{k},i<j\right\}}\frac{\left(\bm{Q}_{i}-\bm{Q}_{j}\right)^{T}\left(\bm{\beta}_{i}-\bm{\beta}_{j}\right)}{\left\|\mathcal{G}_{k}\right\|}.$	(31)

Moreover,

\bm{Q}_{i}=\frac{1}{n_{i}}\sum_{h=1}^{n_{i}}\left(\epsilon_{ih}+\bm{z}_{ih}^{T}\left(\bm{\eta}^{0}-\bm{\eta}\right)+\bm{x}_{ih}^{T}\left(\bm{\beta}_{i}^{0}-\bm{\beta}_{i}^{m}\right)\right)\bm{x}_{ih},

	$\displaystyle\sup_{i}\left\\|\bm{Q}_{i}\right\\|$	$\displaystyle\leq\sup_{i,h}\left\\|\bm{x}_{ih}\right\\|\left(\left\\|\bm{\xi}\right\\|_{\infty}+\sup_{i,h}\left\\|\bm{z}_{ih}\right\\|\left\\|\bm{\eta}^{0}-\bm{\eta}\right\\|+\sup_{i,h}\left\\|\bm{x}_{ih}\right\\|\left\\|\bm{\beta}_{i}^{0}-\bm{\beta}_{i}^{m}\right\\|\right)$
		$\displaystyle\leq C_{2}\sqrt{p}\left(\left\\|\bm{\xi}\right\\|_{\infty}+C_{3}\sqrt{q}\phi_{n}+C_{2}\sqrt{p}\phi_{n}\right),$

where $\bm{\xi}=\left(\xi_{1,}\dots,\xi_{n}\right)^{T}$ with $\xi_{i}=\frac{1}{n_{i}}\sum_{h=1}^{n}\epsilon_{ih}$ . According to Condition (C3),

	$\displaystyle P\left(\left\\|\bm{\xi}\right\\|_{\infty}>\sqrt{2c_{1}^{-1}}\sqrt{\log n/\min n_{i}}\right)$	$\displaystyle\leq\sum_{i=1}^{n}P\left(\left\|\xi_{i}\right\|>\sqrt{2c_{1}^{-1}}\sqrt{\log n/\min n_{i}}\right)$
		$\displaystyle=\sum_{i=1}^{n}P\left(\left\|\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}\epsilon_{ij}\right\|>\sqrt{2c_{1}^{-1}}\sqrt{\log n/\min n_{i}}\right)$
		$\displaystyle\leq\sum_{i=1}^{n}P\left(\left\|\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}\epsilon_{ij}\right\|>\sqrt{2c_{1}^{-1}}\sqrt{\log n/n_{i}}\right)$
		$\displaystyle\leq 2\sum_{i=1}^{n}\exp\left\{-c_{1}2c_{1}^{-1}\log n\right\}\leq\frac{2}{n}.$

Thus, there exists an event $E_{2}$ such that $P\left(E_{2}\right)\geq 1-2n^{-1}$ and

\sup_{i}\left\|\bm{Q}_{i}\right\|\leq C_{2}\sqrt{p}\left(\sqrt{2c_{1}^{-1}}\sqrt{\log n/\min_{i}n_{i}}+C_{3}\sqrt{q}\phi_{n}+C_{2}\sqrt{p}\phi_{n}\right).

Thus,

	$\displaystyle\left\|\frac{\left(\bm{Q}_{i}-\bm{Q}_{j}\right)^{T}\left(\bm{\beta}_{i}-\bm{\beta}_{j}\right)}{\left\|\mathcal{G}_{k}\right\|}\right\|$
$\displaystyle\leq$	$\displaystyle 2\left\|\mathcal{G}_{\min}\right\|^{-1}\sup_{i}\left\\|\bm{Q}_{i}\right\\|\left\\|\bm{\beta}_{i}-\bm{\beta}_{j}\right\\|$
$\displaystyle\leq$	$\displaystyle 2C_{2}\left\|\mathcal{G}_{\min}\right\|^{-1}\sqrt{p}\left(\sqrt{2c_{1}^{-1}}\sqrt{\log n/\min_{i}n_{i}}+C_{3}\sqrt{q}\phi_{n}+C_{2}\sqrt{p}\phi_{n}\right)\left\\|\bm{\beta}_{i}-\bm{\beta}_{j}\right\\|.$	(32)

Combining (30), (B) and (32), (29) follows that

		$\displaystyle Q_{n}\left(\bm{\eta},\bm{\beta}\right)-Q_{n}\left(\bm{\eta},\bm{\beta}^{*}\right)$
	$\displaystyle\geq$	$\displaystyle\sum_{k=1}^{K}\sum_{\left\{i,j\in\mathcal{G}_{k},i<j\right\}}\left\{\lambda c_{ij}\rho^{\prime}\left(4t_{n}\right)-2C_{2}\left\|\mathcal{G}_{\min}\right\|^{-1}\sqrt{p}\left(\sqrt{2c_{1}^{-1}}\sqrt{\frac{\log n}{\min_{i}n_{i}}}+C_{3}\sqrt{q}\phi_{n}+C_{2}\sqrt{p}\phi_{n}\right)\right\}\left\\|\bm{\beta}_{i}-\bm{\beta}_{j}\right\\|.$

As $t_{n}=o\left(1\right)$ , $\rho^{\prime}\left(4t_{n}\right)\rightarrow 1$ . Since $\left|\mathcal{G}_{\text{min}}\right|\gg\left(q+Kp\right)^{1/2}\max\left(\sqrt{\frac{n}{\min_{i}n_{i}}\log n},\left(q+Kp\right)^{1/2}\right)$ , $p=o(n)$ and $q=o(n)$ , then $\left|\mathcal{G}_{\min}\right|^{-1}p=o\left(1\right)$ and $\left|\mathcal{G}_{\min}\right|^{-1}\sqrt{pq}=o\left(1\right)$ . Thus, $\lambda\gg\left|\mathcal{G}_{\min}\right|^{-1}\sqrt{p}\sqrt{\frac{\log n}{\min n_{i}}}$ , $\lambda\gg\left|\mathcal{G}_{\min}\right|^{-1}\sqrt{pq}\phi_{n}$ and $\lambda\gg\left|\mathcal{G}_{\min}\right|^{-1}p\phi_{n}$ . Therefore, $Q_{n}\left(\bm{\eta},\bm{\beta}\right)-Q_{n}\left(\bm{\eta},\bm{\beta}^{*}\right)\geq 0$ for sufficiently large $n$ by the assumption (C4) that $c_{ij}$ ’s are bounded if $i$ and $j$ are in the same group.

Therefore, combining the two steps, we will have that $Q_{n}\left(\bm{\eta},\bm{\beta}\right)>Q_{n}(\hat{\bm{\eta}}^{or},\hat{\bm{\beta}}^{or})$ for any $\left(\bm{\eta}^{T},\bm{\beta}^{T}\right)^{T}\in\Theta_{n}\cap\Theta$ and $(\bm{\eta}^{T},\bm{\beta}^{T})^{T}\neq((\hat{\bm{\eta}}^{or})^{T},(\hat{\bm{\beta}}^{or})^{T})^{T}$ . This shows that $((\hat{\bm{\eta}}^{or})^{T},(\hat{\bm{\beta}}^{or})^{T})^{T}$ is a strict local minimizer of the objective function (2) on $E_{1}\cap E_{2}$ with probability at least $1-2(K+p+1)n^{-1}$ for sufficiently large $n$ .

References

Banerjee et al., (2014) Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2014). Hierarchical modeling and analysis for spatial data. Crc Press.
Boyd et al., (2011) Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122.
Bradley et al., (2018) Bradley, J. R., Holan, S. H., and Wikle, C. K. (2018). Computationally efficient multivariate spatio-temporal models for high-dimensional count-valued data (with discussion). Bayesian Analysis, 13(1):253–310.
Brunsdon et al., (1996) Brunsdon, C., Fotheringham, A. S., and Charlton, M. E. (1996). Geographically weighted regression: a method for exploring spatial nonstationarity. Geographical analysis, 28(4):281–298.
Casetti, (1972) Casetti, E. (1972). Generating models by the expansion method: applications to geographical research. Geographical analysis, 4(1):81–91.
Casetti and Jones, (1987) Casetti, E. and Jones, J. P. (1987). Spatial aspects of the productivity slowdown: an analysis of us manufacturing data. Annals of the Association of American Geographers, 77(1):76–88.
Chi and Lange, (2015) Chi, E. C. and Lange, K. (2015). Splitting methods for convex clustering. Journal of Computational and Graphical Statistics, 24(4):994–1013.
Cook et al., (2007) Cook, A. J., Gold, D. R., and Li, Y. (2007). Spatial cluster detection for censored outcome data. Biometrics, 63(2):540–549.
Cressie, (2015) Cressie, N. (2015). Statistics for spatial data. John Wiley & Sons.
Diggle et al., (1998) Diggle, P. J., Tawn, J. A., and Moyeed, R. A. (1998). Model-based geostatistics. Journal of the Royal Statistical Society: Series C (Applied Statistics), 47(3):299–350.
Fan and Li, (2001) Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360.
Fan et al., (2018) Fan, Z., Guan, L., et al. (2018). Approximate $l_{0}$ -penalized estimation of piecewise-constant signals on graphs. The Annals of Statistics, 46(6B):3217–3245.
Gelfand et al., (2003) Gelfand, A. E., Kim, H.-J., Sirmans, C., and Banerjee, S. (2003). Spatial modeling with spatially varying coefficient processes. Journal of the American Statistical Association, 98(462):387–396.
Han et al., (2012) Han, W., Yang, Z., Di, L., and Mueller, R. (2012). Cropscape: A Web service based application for exploring and disseminating US conterminous geospatial cropland data products for decision support. Computers and Electronics in Agriculture, 84:111–123.
Hu and Bradley, (2018) Hu, G. and Bradley, J. (2018). A bayesian spatial–temporal model with latent multivariate log-gamma random effects with application to earthquake magnitudes. Stat, 7(1):e179.
Hu and Huffer, (2020) Hu, G. and Huffer, F. (2020). Modified Kaplan–Meier estimator and Nelson–Aalen estimator with geographical weighting for survival data. Geographical Analysis, 52(1):28–48.
Hu et al., (2020) Hu, G., Xue, Y., and Ma, Z. (2020). Bayesian clustered coefficients regression with auxiliary covariates assistant random effects. arXiv preprint arXiv:2004.12022.
Huang et al., (2004) Huang, J. Z., Wu, C. O., and Zhou, L. (2004). Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Statistica Sinica, 14(3):763–788.
Hubert and Arabie, (1985) Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1):193–218.
Jung et al., (2007) Jung, I., Kulldorff, M., and Klassen, A. C. (2007). A spatial scan statistic for ordinal data. Statistics in medicine, 26(7):1594–1607.
Kulldorff and Nagarwalla, (1995) Kulldorff, M. and Nagarwalla, N. (1995). Spatial disease clusters: detection and inference. Statistics in medicine, 14(8):799–810.
Lee et al., (2017) Lee, J., Gangnon, R. E., and Zhu, J. (2017). Cluster detection of spatial regression coefficients. Statistics in medicine, 36(7):1118–1133.
Lee et al., (2020) Lee, J., Sun, Y., and Chang, H. H. (2020). Spatial cluster detection of regression coefficients in a mixed-effects model. Environmetrics, 31(2):e2578.
Li and Sang, (2019) Li, F. and Sang, H. (2019). Spatial homogeneity pursuit of regression coefficients for large datasets. Journal of the American Statistical Association, 114(527):1050–1062.
Lu and Carlin, (2005) Lu, H. and Carlin, B. P. (2005). Bayesian areal wombling for geographical boundary analysis. Geographical Analysis, 37(3):265–285.
Lu et al., (2007) Lu, H., Reilly, C. S., Banerjee, S., and Carlin, B. P. (2007). Bayesian areal wombling via adjacency modeling. Environmental and Ecological Statistics, 14(4):433–452.
Ma and Huang, (2017) Ma, S. and Huang, J. (2017). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association, 112(517):410–423.
Ma et al., (2019) Ma, S., Huang, J., Zhang, Z., and Liu, M. (2019). Exploration of heterogeneous treatment effects via concave fusion. The international journal of biostatistics.
Ma et al., (2020) Ma, Z., Xue, Y., and Hu, G. (2020). Heterogeneous regression models for clusters of spatial dependent data. Spatial Economic Analysis, pages 1–17.
Nakaya et al., (2005) Nakaya, T., Fotheringham, A. S., Brunsdon, C., and Charlton, M. (2005). Geographically weighted Poisson regression for disease association mapping. Statistics in medicine, 24(17):2695–2717.
Nusser and Goebel, (1997) Nusser, S. M. and Goebel, J. J. (1997). The National Resources Inventory: a long-term multi-resource monitoring programme. Environmental and Ecological Statistics, 4(3):181–204.
Rand, (1971) Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850.
Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288.
Vinh et al., (2010) Vinh, N. X., Epps, J., and Bailey, J. (2010). Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11:2837–2854.
Wang et al., (2007) Wang, H., Li, R., and Tsai, C.-L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94(3):553–568.
Wang et al., (2018) Wang, X., Berg, E., Zhu, Z., Sun, D., and Demuth, G. (2018). Small area estimation of proportions with constraint for National Resources Inventory survey. Journal of Agricultural, Biological and Environmental Statistics, 23(4):509–528.
Xu et al., (2019) Xu, Z., Bradley, J. R., and Sinha, D. (2019). Latent multivariate log-gamma models for high-dimensional multi-type responses with application to daily fine particulate matter and mortality counts. arXiv preprint arXiv:1909.02528.
Xue et al., (2020) Xue, Y., Schifano, E. D., and Hu, G. (2020). Geographically weighted Cox regression for prostate cancer survival data in louisiana. Geographical Analysis, 52(4):570–587.
Zhang, (2010) Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2):894–942.
Zhang and Lawson, (2011) Zhang, J. and Lawson, A. B. (2011). Bayesian parametric accelerated failure time spatial model and its application to prostate cancer. Journal of applied statistics, 38(3):591–603.

	$\displaystyle P\left(\left\\|\bm{U}^{T}\bm{\Omega}\bm{\epsilon}\right\\|_{\infty}>C\sqrt{\frac{n}{\min n_{i}}\log n}\right)$	$\displaystyle\leq P\left(\left\\|\left(\bm{X}\bm{W}\right)^{T}\bm{\Omega}\bm{\epsilon}\right\\|_{\infty}>C\sqrt{\frac{n}{\min n_{i}}\log n}\right)$
		$\displaystyle+P\left(\left\\|\bm{Z}^{T}\bm{\Omega}\bm{\epsilon}\right\\|_{\infty}>C\sqrt{\frac{n}{\min n_{i}}\log n}\right),$		(24)

	$\displaystyle P\left(\left\\|\bm{\xi}\right\\|_{\infty}>\sqrt{2c_{1}^{-1}}\sqrt{\log n/\min n_{i}}\right)$	$\displaystyle\leq\sum_{i=1}^{n}P\left(\left\|\xi_{i}\right\|>\sqrt{2c_{1}^{-1}}\sqrt{\log n/\min n_{i}}\right)$
		$\displaystyle=\sum_{i=1}^{n}P\left(\left\|\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}\epsilon_{ij}\right\|>\sqrt{2c_{1}^{-1}}\sqrt{\log n/\min n_{i}}\right)$
		$\displaystyle\leq\sum_{i=1}^{n}P\left(\left\|\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}\epsilon_{ij}\right\|>\sqrt{2c_{1}^{-1}}\sqrt{\log n/n_{i}}\right)$
		$\displaystyle\leq 2\sum_{i=1}^{n}\exp\left\{-c_{1}2c_{1}^{-1}\log n\right\}\leq\frac{2}{n}.$