Mixed effects models for extreme value index regression

Koki Momoki¹ and Takuma Yoshida²

^1,2Graduate School of Science and Engineering, Kagoshima University
1-21-40 Korimoto, Kagoshima, Kagoshima, 890-8580, Japan E-mail: k3499390@kadai.jpE-mail: yoshida@sci.kagoshima-u.ac.jp

Abstract

Extreme value theory (EVT) provides an elegant mathematical tool for the statistical analysis of rare events. When data are collected from multiple population subgroups, because some subgroups may have less data available for extreme value analysis, a scientific interest of many researchers would be to improve the estimates obtained directly from each subgroup. To achieve this, we incorporate the mixed effects model (MEM) into the regression technique in EVT. In small area estimation, the MEM has attracted considerable attention as a primary tool for producing reliable estimates for subgroups with small sample sizes, i.e., “small areas.” The key idea of MEM is to incorporate information from all subgroups into a single model and to borrow strength from all subgroups to improve estimates for each subgroup. Using this property, in extreme value analysis, the MEM may contribute to reducing the bias and variance of the direct estimates from each subgroup. This prompts us to evaluate the effectiveness of the MEM for EVT through theoretical studies and numerical experiments, including its application to the risk assessment of a number of stocks in the cryptocurrency market.

Keywords: Extreme value index regression; Extreme value theory; Mixed effects model; Pareto-type distribution; Risk assessment

1 Introduction

Statistical analysis of rare events is crucial for risk assessment in various fields, including meteorology, environment, seismology, finance, economics and insurance. Extreme value theory (EVT) provides an elegant mathematical tool for the analysis of rare events.

In the framework of univariate EVT, the generalized extreme value distribution (see, Fisher and Tippett 1928; Gumbel 1958) and generalized Pareto distribution (GPD, see, Davison and Smith 1990) are standard models for fitting extreme value data. In addition, the class of GPDs with unbounded tails is called the Pareto-type distribution, which has been recognized as an important model for analyzing heavy-tailed data (see, Beirlant et al. 2004). As an important direction of development in EVT, these models have been extended to regression models to incorporate covariate information into extreme value analysis (see, Davison and Smith 1990; Beirlant and Goegebeur 2003; Beirlant et al. 2004; Wang and Tsai 2009). However, few regression techniques have been developed for unit-level data in extreme value analysis.

The data category of interest is denoted by $\{(Y_{ij},{\bm{X}}_{ij}),\ i=1,2,\ldots,n_{j},\ j=1,2,\ldots,J\}$ , where $J$ is the number of population subgroups, $n_{j}$ is the sample size from the $j$ th population subgroup, $Y_{ij}$ is the response of interest, and ${\bm{X}}_{ij}$ is the vector of relevant covariates. In much of the literature on small area estimation (SAE), this data category is referred to as unit-level data, and each subgroup is then called an “area” (see, Rao and Molina 2015; Sugasawa and Kubokawa 2020; Molina, Corral, and Nguyen 2022). In particular, an area with a small sample size is the so-called “small area,” which is described in detail by Jiang (2017). Examples of an area include a geographic region such as a state, county or municipality, or a demographic group such as a specific age-sex-race group. From the definition on p. 173 of Rao and Molina (2015), unit-level data means that the number of areas $J$ is known, and each observation $(Y_{ij},{\bm{X}}_{ij})$ is explicitly assigned to one of the areas. This study aims to develop the regression technique of extreme value analysis for unit-level data.

Our purpose is not to pool data from multiple areas into a single area by clustering, nor to create a new set of areas with heterogeneous characteristics (see, Bottolo et al. 2003; Rohrbeck and Tawn 2021; de Carvalho, Huser, and Rubio 2023; Dupuis, Engelke, and Trapin 2023 and references therein). Unlike these approaches, our goal is to enhance the accuracy of extreme value analysis by using information from all areas simultaneously, instead of building models by area. However, if the heterogeneity between all areas is modeled as parameters, the number of parameters depends on the number of areas $J$ . Accordingly, for large $J$ , the fully parametric model can lead to a large bias in the parameter estimates (see, Section 4 of Ruppert, Wand, and Carroll 2003; Broström and Holmberg 2011). Thus, we want to develop a model for extreme value analysis that does not require many parameters for unit-level data. To this end, we incorporate the mixed effects model (MEM) into EVT. The MEM has been described in Jiang (2007), Wu (2009), Jiang (2017), and references therein. This model captures the heterogeneity between areas as a latent structure rather than as parameters. In SAE (see, Torabi 2019; Sugasawa and Kubokawa 2020), the efficiency of MEM is well known for its so-called “borrowing of strength” property (see, Dempster, Rubin, and Tsutakawa 1981). Dempster, Rubin, and Tsutakawa (1981) described the “borrowing of strength” as follows:

Using concepts of variability between and within units, one can improve estimates based on the data from a single unit by the appropriate use of data from the remaining units. (Dempster, Rubin, and Tsutakawa 1981, p. 341)

In other words, the “borrowing of strength” indicates that direct estimates based only on area-specific data are improved by using information from other areas (see, Section 1 of Diallo and Rao 2018). For small areas, the “borrowing of strength” would be particularly helpful because the accuracy of their direct estimates is not sufficiently guaranteed (see, Molina and Rao 2010; Diallo and Rao 2018). In extreme value analysis, we use only data with small or large values; hence, the effective sample size tends to be small for some areas. Thus, the MEM would be crucial to obtain more efficient estimators for extreme value analysis. However, to the best of our knowledge, there are no results for combining the MEM with EVT. Therefore, it would be important to show that the “borrowing of strength” is also valid for extreme value analysis. We will reveal such considerations theoretically and numerically.

In this study, we incorporate the MEM into the extreme value index (EVI) of the Pareto-type distribution. For this model, we first pick the extreme value data for each area using the peak-over-threshold method. Then, the regression parameters are estimated by the maximum likelihood method. In addition, the random components of the MEM, called random effects, which correspond to the latent area-wise differences in the regression coefficients, are predicted by the conditional mode. We investigate the asymptotic normality of the proposed estimator (see, Section 3.2). From this asymptotic normality, we find that the variance of the estimator improves as the number of areas $J$ increases. In other words, the proposed estimator is generally stable even when the effective sample size is small for certain areas. Owing to this property, the proposed estimator can reduce the severe bias of Pareto-type modeling by setting reasonably higher thresholds while achieving its stable behavior. Furthermore, we show numerically through the Monte Carlo simulation study that our estimates for each area, obtained by combining the proposed estimator and predictor, not only have significantly smaller variances than the direct estimates, but are also less biased (see, Section 1.3 of our supplementary material). Surprisingly, in the context of EVT, the “borrowing of strength” of the MEM contributes to reducing both bias and variance.

As an application, we analyze the returns of 413 cryptocurrencies. Since their risks as assets may vary by stock, an accurate risk assessment for each stock would be highly desirable. Along with some analysis, we will demonstrate the effectiveness of the MEM in analyzing the risks of many cryptocurrency stocks. We note that considering many stocks is equivalent to the mathematical situation that the number of areas $J$ is large. Thus, the theoretical study of the proposed estimator will be established under $J\to\infty$ .

The remainder of this article is organized as follows. Section 2 proposes the regression model using the MEM for the Pareto-type distribution. Section 3 examines its asymptotic properties. As a real data example, Section 4 analyzes a cryptocurrency dataset. Section 5 summarizes this study and discusses future research. The simulation studies for the proposed model and technical details of the asymptotic theory in Section 3 are described in our supplementary material.

2 Model and method

Let $\mathbb{R}^{+}$ be defined as the set of all positive real numbers. Throughout this article, we consider the unit-level data

\left\{(Y_{ij},{\bm{X}}_{ij})\in\mathbb{R}^{+}\times\mathbb{R}^{p},\ i=1,2,\ldots,n_{j},\ j=1,2,\ldots,J\right\},

(1)

where $J$ is the number of areas, $n_{j}$ is the sample size from the $j$ th area, $Y_{ij}$ is the continuous random variable corresponding to the response of interest, and ${\bm{X}}_{ij}$ is the random vector representing the associated predictors. Here, $(Y_{ij},{\bm{X}}_{ij})$ is regarded as the observation for the $i$ th unit in the $j$ th area. We denote the index sets by $\mathcal{N}_{j}\coloneqq\{1,2,\ldots,n_{j}\}$ and $\mathcal{J}\coloneqq\{1,2,\ldots,J\}$ . In the following Sections 2.1-2.4, we describe the proposed MEM and associated estimation and prediction methods.

2.1 Mixed effects model under Pareto-type distribution

Let ${\bm{X}}_{ij},\ i\in\mathcal{N}_{j},\ j\in\mathcal{J}$ be an independent and identically distributed (i.i.d.) random sample from some distribution. Subsequently, we assume that for each $i\in\mathcal{N}_{j}$ and $j\in\mathcal{J}$ , the response $Y_{ij}$ is conditionally independently obtained from a certain conditional distribution $F_{j}(y\mid{\bm{x}})=P(Y_{ij}\leq y\mid{\bm{X}}_{ij}={\bm{x}})$ for the given ${\bm{X}}_{ij}$ , where $F_{j}$ is determined for each $j\in\mathcal{J}$ . In this study, we are interested in the right tail behavior of each $F_{j}$ . Here, the right tail of each $F_{j}$ is modeled by the Pareto-type distribution as

F_{j}(y\mid{\bm{x}})=1-y^{-1/\gamma_{j}({\bm{x}})}\mathcal{L}_{j}(y;{\bm{x}}),\quad j\in\mathcal{J},

(2)

where $\gamma_{j}({\bm{x}})>0$ is a positive function called EVI, and $\mathcal{L}_{j}(y;{\bm{x}})$ is a conditional slowly varying function with respect to $y$ given ${\bm{x}}$ , i.e., for any ${\bm{x}}$ and $s>0$ , $\mathcal{L}_{j}(ys;{\bm{x}})/\mathcal{L}_{j}(y;{\bm{x}})\to 1$ as $y\to\infty$ . The EVI function $\gamma_{j}({\bm{x}})$ , which determines the heaviness of the right tail of $F_{j}$ , is assumed here to be the classical linear model formulated as follows:

\gamma_{j}({\bm{x}})=\exp\left[\left({\bm{\theta}}_{j{\rm{A}}}^{0}\right)^{\top}{\bm{x}}_{\rm{A}}+\left({\bm{\theta}}_{\rm{B}}^{0}\right)^{\top}{\bm{x}}_{\rm{B}}\right],\quad j\in\mathcal{J},

(3)

where ${\bm{x}}\coloneqq({\bm{x}}_{\rm{A}}^{\top},{\bm{x}}_{\rm{B}}^{\top})^{\top}\in\mathbb{R}^{p_{\rm{A}}}\times\mathbb{R}^{p_{\rm{B}}}$ , and ${\bm{\theta}}_{j{\rm{A}}}^{0}\in\mathbb{R}^{p_{\rm{A}}}$ and ${\bm{\theta}}_{\rm{B}}^{0}\in\mathbb{R}^{p_{\rm{B}}}$ are regression coefficient vectors. Note that ${\bm{\theta}}_{j{\rm{A}}}^{0}$ is different between areas, whereas ${\bm{\theta}}_{{\rm{B}}}^{0}$ is common to all areas. When $J=1$ , the above model is reduced to that of Wang and Tsai (2009). The purpose of the model (2) with (3) is to estimate the parameter vectors ${\bm{\theta}}_{1{\rm{A}}}^{0},{\bm{\theta}}_{2{\rm{A}}}^{0},\ldots,{\bm{\theta}}_{J{\rm{A}}}^{0}$ and ${\bm{\theta}}_{\rm{B}}^{0}$ . However, this model has $(J\times p_{\rm{A}}+p_{\rm{B}})$ parameters; hence, if $J$ is large, the associated estimators may be severely biased (see, Section 4 of Ruppert, Wand, and Carroll 2003; Broström and Holmberg 2011). To overcome this bias problem, we use the MEM instead of the fully parametric model (2) with (3).

For $p_{\rm{A}}\leq p$ , we introduce the random effects ${\bm{U}}_{j}\in\mathbb{R}^{p_{\rm{A}}},\ j\in\mathcal{J}$ such that

{\bm{U}}_{1},{\bm{U}}_{2},\ldots,{\bm{U}}_{J}\overset{{\rm{i.i.d.}}}{\sim}N({\bm{0}},{\bm{\Sigma}}_{0}),

(4)

where $N({\bm{0}},{\bm{\Sigma}}_{0})$ refers to the multivariate normal distribution with zero mean vector and unknown covariance matrix ${\bm{\Sigma}}_{0}$ . The MEM uses these random effects to express the differences in $F_{j}$ between areas as a latent structure. Let $F(y\mid{\bm{u}}_{j},{\bm{x}})\coloneqq P(Y_{ij}\leq y\mid{\bm{U}}_{j}={\bm{u}}_{j},{\bm{X}}_{ij}={\bm{x}})$ be the conditional distribution function of $Y_{ij}$ given ${\bm{U}}_{j}={\bm{u}}_{j}$ and ${\bm{X}}_{ij}={\bm{x}}$ . In this expression, the information about the differences in $F_{j}$ between areas is assigned to $F$ by ${\bm{u}}_{j}$ . Note that the random effects ${\bm{U}}_{j},\ j\in\mathcal{J}$ are not observed as data.

As an alternative model to (2) with (3), the Pareto-type distribution using the MEM is defined as follows:

F(y\mid{\bm{u}}_{j},{\bm{x}})=1-y^{-1/\gamma({\bm{u}}_{j},{\bm{x}})}\mathcal{L}(y;{\bm{u}}_{j},{\bm{x}}),\quad j\in\mathcal{J},

(5)

where $\mathcal{L}(y;{\bm{u}}_{j},{\bm{x}})$ conditional on ${\bm{U}}_{j}={\bm{u}}_{j}$ and ${\bm{X}}_{ij}={\bm{x}}$ is a slowly varying function with respect to $y$ . Then, as an extension of (3) to the MEM, the EVI function $\gamma({\bm{u}}_{j},{\bm{x}})$ is assumed to be

\gamma({\bm{u}}_{j},{\bm{x}})=\exp\left[\left({\bm{\theta}}_{\rm{A}}^{0}+{\bm{u}}_{j}\right)^{\top}{\bm{x}}_{\rm{A}}+\left({\bm{\theta}}_{\rm{B}}^{0}\right)^{\top}{\bm{x}}_{\rm{B}}\right],\quad j\in\mathcal{J}.

(6)

Compared to (3), the area-wise differences in the slopes of log-EVI with respect to ${\bm{X}}_{{\rm{A}}ij}$ are represented by ${\bm{u}}_{j},\ j\in\mathcal{J}$ as a latent structure. Thus, the total number of parameters in the model (5) using (6) is $p+p_{\rm{A}}(p_{\rm{A}}+1)/2$ , which is independent of $J$ and less than that of the fully parametric model (2) with (3) when $J$ is large. Here, $p_{\rm{A}}(p_{\rm{A}}+1)/2$ is the number of parameters included in ${\bm{\Sigma}}_{0}$ .

The simplest model of (6) is the location-shifting MEM with $p_{\rm{A}}=1$ and ${\bm{X}}_{{\rm{A}}ij}\equiv 1$ , denoted ${\bm{\theta}}_{\rm{A}}^{0}$ and ${\bm{u}}_{j}$ by the scalars $\theta_{\rm{A}}^{0}$ and $u_{j}$ ,

\gamma(u_{j},{\bm{x}}_{\rm{B}})=\exp\left[\theta_{\rm{A}}^{0}+u_{j}+\left({\bm{\theta}}_{\rm{B}}^{0}\right)^{\top}{\bm{x}}_{\rm{B}}\right],\quad j\in\mathcal{J},

(7)

which can be regarded as an EVI regression version of the nested error regression model (see, Battese, Harter, and Fuller 1988). The model (7) indicates that the intercept of $\log\gamma$ accepts the heterogeneity between areas, although each covariate has the common slope of $\log\gamma$ across all areas. The nested error regression model is useful for SAE (see, Diallo and Rao 2018; Sugasawa and Kubokawa 2020). The application in Section 4 demonstrates the analysis using the model (7) and verifies its effectiveness numerically. Alternatively, the case $p=p_{\rm{A}}$ yields the most complicated model of (6), indicating that the slope of $\log\gamma$ with respect to each covariate varies across areas.

2.2 Approximate maximum likelihood estimator

In this section, we construct estimators of the unknown parameters $\{{\bm{\theta}}_{\rm{A}}^{0},{\bm{\theta}}_{\rm{B}}^{0},{\bm{\Sigma}}_{0}\}$ included in the model (5) with (6).

Let $F_{\omega_{j}}(y\mid{\bm{u}}_{j},{\bm{x}})\coloneqq P(Y_{ij}\leq y\mid{\bm{U}}_{j}={\bm{u}}_{j},{\bm{X}}_{ij}={\bm{x}},Y_{ij}>\omega_{j})$ be the conditional distribution function given ${\bm{U}}_{j}={\bm{u}}_{j}$ , ${\bm{X}}_{ij}={\bm{x}}$ and $Y_{ij}>\omega_{j}$ , where $\omega_{j}\in\mathbb{R}^{+},\ j\in\mathcal{J}$ are thresholds. In this paper, we assume that $\mathcal{L}(y;{\bm{u}}_{j},{\bm{x}})$ belongs to the Hall class (see, Hall 1982), which is mentioned in (A1) of Section 3.1. From this, we have

F_{\omega_{j}}(y\mid{\bm{u}}_{j},{\bm{x}})\approx 1-\left(\frac{y}{\omega_{j}}\right)^{-1/\gamma({\bm{u}}_{j},{\bm{x}})},\quad j\in\mathcal{J}

(8)

for sufficiently large $\omega_{j}$ . Using (8) instead of (5), we can remove ${\mathcal{L}}$ for the estimation of $\gamma$ . Similarly, from (8) and the assumption (A1), the density of $Y_{ij}$ given ${\bm{U}}_{j}={\bm{u}}_{j}$ , ${\bm{X}}_{ij}={\bm{x}}$ and $Y_{ij}>\omega_{j}$ is obtained as follows:

f_{w_{j}}(y\mid{\bm{u}}_{j},{\bm{x}})\approx\omega_{j}^{-1}\gamma({\bm{u}}_{j},{\bm{x}})^{-1}\left(\frac{y}{\omega_{j}}\right)^{-1/\gamma({\bm{u}}_{j},{\bm{x}})-1},\quad j\in\mathcal{J}.

(9)

Wang and Tsai (2009) considered the similar approximated distribution (8) and density (9) in linear extreme value index regression. Estimation using data exceeding thresholds is the so-called peak-over-threshold method (see, Hill 1975; Wang and Tsai 2009).

We assume that ${\bm{U}}_{j}$ and ${\bm{X}}_{ij}$ are independent for $i\in\mathcal{N}_{j}$ and $j\in\mathcal{J}$ . Furthermore, we assume that $Y_{ij}$ given ${\bm{U}}_{j}$ and ${\bm{X}}_{ij}$ is conditionally independent for $i\in\mathcal{N}_{j}$ and $j\in\mathcal{J}$ and has the distribution function (5) (see, Jiang, Wand, and Bhaskaran 2022). Using (9), we then define the likelihood of $({\bm{\theta}}_{\rm{A}}^{0},{\bm{\theta}}_{\rm{B}}^{0},{\bm{\Sigma}}_{0})$ as

L({\bm{\theta}}_{\rm{A}},{\bm{\theta}}_{\rm{B}},{\bm{\Sigma}})\coloneqq\prod_{j=1}^{J}E_{{\bm{U}}_{j}}\left[\prod_{i\in\mathcal{N}_{j}:Y_{ij}>\omega_{j}}f_{w_{j}}(Y_{ij}\mid{\bm{U}}_{j},{\bm{X}}_{ij})\right],

where $E_{{\bm{U}}_{j}}$ denotes the expectation over the random effects distribution, $({\bm{\theta}}_{\rm{A}}^{\top},{\bm{\theta}}_{\rm{B}}^{\top})^{\top}\in\mathbb{R}^{p_{\rm{A}}}\times\mathbb{R}^{p_{\rm{B}}}$ is any vector corresponding to $(({\bm{\theta}}_{\rm{A}}^{0})^{\top},({\bm{\theta}}_{\rm{B}}^{0})^{\top})^{\top}$ , and ${\bm{\Sigma}}\in\mathbb{R}^{p_{\rm{A}}\times p_{\rm{A}}}$ is any positive definite matrix corresponding to ${\bm{\Sigma}}_{0}$ . The above $L$ is derived from the standard definition of the likelihood for the MEM, because ${\bm{U}}_{j},\ j\in\mathcal{J}$ are unobserved random variables, unlike the data (1) (see, Section 2 of Wu 2009). From (4), (6), (9) and (A1), the log-likelihood of $({\bm{\theta}}_{\rm{A}}^{0},{\bm{\theta}}_{\rm{B}}^{0},{\bm{\Sigma}}_{0})$ can be expressed as

	$\displaystyle\ell({\bm{\theta}}_{\rm{A}},{\bm{\theta}}_{\rm{B}},{\bm{\Sigma}})$	$\displaystyle\coloneqq\log L({\bm{\theta}}_{\rm{A}},{\bm{\theta}}_{\rm{B}},{\bm{\Sigma}})$
	$\displaystyle\begin{split}&\approx\sum_{j=1}^{J}\log\int_{\mathbb{R}^{p_{\rm{A}}}}\phi({\bm{u}};{\bm{0}},{\bm{\Sigma}})\exp\left(\sum_{i=1}^{n_{j}}\biggl{\{}-\left({\bm{\theta}}_{\rm{A}}+{\bm{u}}\right)^{\top}{\bm{X}}_{{\rm{A}}ij}-{\bm{\theta}}_{\rm{B}}^{\top}{\bm{X}}_{{\rm{B}}ij}\biggr{.}\right.\\ &\quad\Biggl{.}\left.-\exp\left[-\left({\bm{\theta}}_{\rm{A}}+{\bm{u}}\right)^{\top}{\bm{X}}_{{\rm{A}}ij}-{\bm{\theta}}_{\rm{B}}^{\top}{\bm{X}}_{{\rm{B}}ij}\right]\log\frac{Y_{ij}}{\omega_{j}}\right\}I(Y_{ij}>\omega_{j})\Biggr{)}d{\bm{u}}+C,\end{split}$			(10)

where $I(\cdot)$ is an indicator function that returns 1 if $Y_{ij}>\omega_{j}$ and 0 otherwise, $\phi(\cdot;{\bm{0}},{\bm{\Sigma}})$ is a density function of $N({\bm{0}},{\bm{\Sigma}})$ , and $C$ is a suitable constant independent of $({\bm{\theta}}_{\rm{A}},{\bm{\theta}}_{\rm{B}},{\bm{\Sigma}})$ . Again, because ${\bm{U}}_{j},\ j\in\mathcal{J}$ are not observed as data, the log-likelihood (10) includes the integral over the domain $\mathbb{R}^{p_{\rm{A}}}$ of the random effects. We denote the approximate maximum likelihood estimator of $({\bm{\theta}}_{\rm{A}}^{0},{\bm{\theta}}_{\rm{B}}^{0},{\bm{\Sigma}}_{0})$ by $(\hat{\bm{\theta}}_{\rm{A}},\hat{\bm{\theta}}_{\rm{B}},\hat{\bm{\Sigma}})$ , which is the maximizer of the right-hand side of (10).

2.3 Prediction of random effects

In the proposed model (5) using (6), we are not only interested in estimating the parameters $\{{\bm{\theta}}_{\rm{A}}^{0},{\bm{\theta}}_{\rm{B}}^{0},{\bm{\Sigma}}_{0}\}$ , but also in predicting the random effects ${\bm{U}}_{j},\ j\in\mathcal{J}$ . Here, we propose the conditional mode method to predict these random effects (see, Santner and Duffy 1989; Section 11 of Wu 2009). Now, the conditional density function of $({\bm{U}}_{1},{\bm{U}}_{2},\ldots,{\bm{U}}_{J})$ given the data (1) is proportional to

\prod_{j=1}^{J}\left[\phi({\bm{u}}_{j};{\bm{0}},{\bm{\Sigma}}_{0})\prod_{i\in\mathcal{N}_{j}:Y_{ij}>\omega_{j}}f_{\omega_{j}}(Y_{ij}\mid{\bm{u}}_{j},{\bm{X}}_{ij})\right],

as a function of $({\bm{u}}_{1},{\bm{u}}_{2},\ldots,{\bm{u}}_{J})$ . Then, the predictor of ${\bm{U}}_{j}$ is defined as the mode of this conditional distribution, i.e.,

\tilde{\bm{u}}_{j}\coloneqq\operatorname*{argmax}_{{\bm{u}}_{j}\in\mathbb{R}^{p_{\rm{A}}}}\ \phi({\bm{u}}_{j};{\bm{0}},{\bm{\Sigma}}_{0})\prod_{i\in\mathcal{N}_{j}:Y_{ij}>\omega_{j}}f_{\omega_{j}}(Y_{ij}\mid{\bm{u}}_{j},{\bm{X}}_{ij}),\quad j\in\mathcal{J},

(11)

where $f_{\omega_{j}}$ and $({\bm{\theta}}_{\rm{A}}^{0},{\bm{\theta}}_{\rm{B}}^{0},{\bm{\Sigma}}_{0})$ included in $f_{\omega_{j}}$ are replaced by (9) and the estimator $(\hat{\bm{\theta}}_{\rm{A}},\hat{\bm{\theta}}_{\rm{B}},\hat{\bm{\Sigma}})$ , respectively.

2.4 Threshold selection

The thresholds $\omega_{j},\ j\in\mathcal{J}$ in (10) are tuning parameters that balance between the quality of the approximation (9) and the amount of data exceeding the thresholds. By setting higher thresholds, we can generally improve the estimation bias, but the estimator becomes more unstable. Conversely, if we lower the thresholds, the estimator will behave more stably, but may be more biased. Therefore, these thresholds $\omega_{j},\ j\in\mathcal{J}$ should be chosen appropriately, considering this trade-off relationship. Here, for each area, we apply the discrepancy measure (see, Wang and Tsai 2009), which considers the goodness of fit of the model, to select the area-wise optimal threshold. In the simulation studies in Section 1.2 of our supplementary material, we verify the performance of the proposed estimator using this threshold selection method.

3 Asymptotic properties

We investigate the asymptotic properties of the proposed estimator $(\hat{\bm{\theta}}_{\rm{A}},\hat{\bm{\theta}}_{\rm{B}},\hat{\bm{\Sigma}})$ . In general, the following three types of asymptotic scenarios may be considered:

(i)

$J$ remains finite while $n_{j},\ j\in\mathcal{J}$ tend to infinity.
(ii)

$n_{j},\ j\in\mathcal{J}$ remain finite while $J$ tends to infinity.
(iii)

$J$ and $n_{j},\ j\in\mathcal{J}$ tend to infinity.

In applications using the peak-over-threshold method, the sample size of threshold exceedances is often small for some areas. Therefore, we want to use as many related area sources as possible. Such a scenario can be expressed mathematically as $J\to\infty$ . Thus, (i) does not match the background of using the MEM for extreme value analysis. Meanwhile, if the thresholds $\omega_{j},\ j\in\mathcal{J}$ as well as the sample sizes exceeding the thresholds are fixed, the consistency of the proposed estimator would not be shown because the bias occurring from the approximation (9) cannot be improved. Ignoring such a bias is outside the concept of EVT (see, Theorems 2 and 4 of Wang and Tsai 2009). This implies that (ii) is also not realistic in our study. To evaluate the impact of the choice of thresholds and bias of the proposed estimator, we must consider the case where $\omega_{j}\to\infty,\ j\in\mathcal{J}$ and the sample sizes exceeding the thresholds also tend to infinity, which can be taken under $n_{j}\to\infty,\ j\in\mathcal{J}$ . Consequently, (iii) is most important for establishing EVT for the MEM, and we assume this case in the following Sections 3.1 and 3.2.

Nie (2007) and Jiang, Wand, and Bhaskaran (2022) showed the asymptotic normality of the maximum likelihood estimator of the generalized mixed effects model under (iii). Thus, we can say that the following Theorem 1 extends their results from the generalized mixed effects model to the MEM for EVT.

3.1 Conditions

Let $n_{j0}\coloneqq\sum_{i=1}^{n_{j}}I(Y_{ij}>\omega_{j})$ , which is the sample size exceeding the threshold $\omega_{j}$ for the $j$ th area. Additionally, we define $n_{0}\coloneqq J^{-1}\sum_{j=1}^{J}n_{j0}$ as the average of the effective sample sizes of all areas. Note that $n_{j0},\ j\in\mathcal{J}$ and $n_{0}$ are random variables, not constants. In the case (iii) defined above, we assume that for each $j\in\mathcal{J}$ , the threshold $\omega_{j}$ diverges to infinity in tandem with the sequence of $J$ and the $j$ th within-area sample size $n_{j}$ . Accordingly, we denote $\omega_{j}$ by $\omega_{(J,n_{j})}$ . The asymptotic properties of the proposed estimator $(\hat{\bm{\theta}}_{\rm{A}},\hat{\bm{\theta}}_{\rm{B}},\hat{\bm{\Sigma}})$ rely on the following assumptions (A1)-(A6):

(A1)

$\mathcal{L}(y;{\bm{u}},{\bm{x}})$ in (5) belongs to the Hall class (see, Hall 1982), that is,

\mathcal{L}(y;{\bm{u}},{\bm{x}})=c_{0}({\bm{u}},{\bm{x}})+c_{1}({\bm{u}},{\bm{x}})y^{-\beta({\bm{u}},{\bm{x}})}+\lambda(y;{\bm{u}},{\bm{x}}),

(12)

where $c_{0}({\bm{u}},{\bm{x}})>0$ , $c_{1}({\bm{u}},{\bm{x}})$ , $\beta({\bm{u}},{\bm{x}})>0$ and $\lambda(y;{\bm{u}},{\bm{x}})$ are continuous and bounded. Furthermore, $\lambda(y;{\bm{u}},{\bm{x}})$ satisfies

\sup_{{\bm{u}}\in\mathbb{R}^{p_{\rm{A}}},\ {\bm{x}}\in\mathbb{R}^{p}}\left[y^{\beta({\bm{u}},{\bm{x}})}\lambda(y;{\bm{u}},{\bm{x}})\right]\to 0\quad{\rm{as}}\quad y\to\infty.

(A2)

There exists a bounded and continuous function $\delta:\mathbb{R}^{p_{\rm{A}}}\times\mathbb{R}^{p}\to\mathbb{R}^{+}$ such that

\sup_{{\bm{u}}\in\mathbb{R}^{p_{\rm{A}}},\ {\bm{x}}\in\mathbb{R}^{p}}\left\lvert\frac{P(Y_{ij}>y\mid{\bm{U}}_{j}={\bm{u}},{\bm{X}}_{ij}={\bm{x}})}{P(Y_{ij}>y\mid{\bm{U}}_{j}={\bm{u}})}-\delta({\bm{u}},{\bm{x}})\right\lvert\to 0\quad{\rm{as}}\quad y\to\infty.

(A3)

As $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ ,

\inf_{j\in\mathcal{J},\ {\bm{u}}\in\mathbb{R}^{p_{\rm{A}}}}n_{j}P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})\to\infty.

(A4)

There exist some bounded and continuous functions $d_{j}:\mathbb{R}^{p_{\rm{A}}}\to\mathbb{R}^{+},\ j\in\mathcal{J}$ such that under given ${\bm{U}}_{j}={\bm{u}}$ , $n_{j0}/n_{0}\to^{P}d_{j}({\bm{u}})$ uniformly for all $j\in\mathcal{J}$ and ${\bm{u}}\in\mathbb{R}^{p_{\rm{A}}}$ as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ , where the symbol “ $\to^{P}$ ” represents convergence in probability.
(A5)

$n_{0}/J\to^{P}0$ as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ .

(A6)

There exist some bounded and continuous functions ${\bm{b}}_{{\rm{K}}j}:\mathbb{R}^{p_{\rm{A}}}\to\mathbb{R},\ j\in\mathcal{J},\ {\rm{K}}\in\{{\rm{A}},{\rm{B}}\}$ such that

\sup_{j\in\mathcal{J},\ {\bm{u}}\in\mathbb{R}^{p_{\rm{A}}}}\left\lVert\frac{J^{1/2}n_{0}^{1/2}{E_{{\bm{X}}_{ij}}}\left[{\bm{X}}_{{\rm{K}}ij}\zeta_{j}({\bm{u}},{\bm{X}}_{ij})\right]}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})^{1/2}}-{\bm{b}}_{{\rm{K}}j}({\bm{u}})\right\rVert\to 0\\

as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ , where

\zeta_{j}({\bm{u}},{\bm{x}})\coloneqq\frac{c_{1}({\bm{u}},{\bm{x}})\gamma({\bm{u}},{\bm{x}})\beta({\bm{u}},{\bm{x}})}{1+\gamma({\bm{u}},{\bm{x}})\beta({\bm{u}},{\bm{x}})}\omega_{(J,n_{j})}^{-1/\gamma({\bm{u}},{\bm{x}})-\beta({\bm{u}},{\bm{x}})}

and $\lVert\cdot\rVert$ refers to the Euclidean norm.

(A1) and (A2) regularize the tail behavior of the conditional response distribution (5) (see, Wang and Tsai 2009; Ma, Jiang, and Huang 2019). (A3)-(A6) impose the constraints on the divergence rates of the thresholds $\omega_{(J,n_{j})},\ j\in\mathcal{J}$ . (A3) implies that for each $j\in\mathcal{J}$ , the effective sample size $n_{j0}$ asymptotically diverges to infinity. Under (A4), $n_{j0},\ j\in\mathcal{J}$ are not critically different. Furthermore, (A5) means that the number of areas $J$ is relatively larger than the effective sample sizes $n_{j0},\ j\in\mathcal{J}$ . (A5) mathematically links the divergence rates of $\omega_{(J,n_{j})},\ j\in\mathcal{J}$ and $J$ . (A6) is related to the asymptotic bias of the proposed estimator. If (A6) fails, the consistency of the proposed estimator may not be guaranteed.

3.2 Asymptotic normality

Let ${\bm{M}}$ be a matrix of zeros and ones such that ${\bm{M}}{\rm{vech}}({\bm{A}})={\rm{vec}}({\bm{A}})$ for all symmetric matrices ${\bm{A}}\in\mathbb{R}^{p_{\rm{A}}\times p_{\rm{A}}}$ , where ${\rm{vec}}(\cdot)$ is a vector operator, and ${\rm{vech}}(\cdot)$ is a vector half operator that stacks the lower triangular half of a given $d\times d$ square matrix into the single vector of length $d(d+1)/2$ (see, Magnus and Neudecker 1988). The Moore-Penrose inverse of ${\bm{M}}$ is ${\bm{M}}_{*}\coloneqq({\bm{M}}^{\top}{\bm{M}})^{-1}{\bm{M}}^{\top}$ .

For the maximum likelihood estimator $(\hat{\bm{\theta}}_{\rm{A}},\hat{\bm{\theta}}_{\rm{B}},\hat{\bm{\Sigma}})$ , we obtain the following Theorem 1.

Theorem 1.

Suppose that (A1)-(A6) hold. Then, as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ ,

\begin{bmatrix}J^{1/2}\left(\hat{\bm{\theta}}_{\rm{A}}-{\bm{\theta}}_{\rm{A}}^{0}\right)\\ J^{1/2}n_{0}^{1/2}\left(\hat{\bm{\theta}}_{\rm{B}}-{\bm{\theta}}_{\rm{B}}^{0}\right)\\ J^{1/2}{\rm{vech}}\left(\hat{\bm{\Sigma}}-{\bm{\Sigma}}_{0}\right)\end{bmatrix}+\begin{bmatrix}n_{0}^{-1/2}{\bm{b}}_{\rm{A}}\\ {\bm{b}}_{\rm{B}}\\ n_{0}^{-1/2}{\bm{b}}_{\rm{C}}\\ \end{bmatrix}\xrightarrow{D}N\left({\bm{0}},\begin{bmatrix}{\bm{\Delta}}_{\rm{A}}&{\bm{O}}&{\bm{O}}\\ {\bm{O}}&{\bm{\Delta}}_{\rm{B}}&{\bm{O}}\\ {\bm{O}}&{\bm{O}}&{\bm{\Delta}}_{\rm{C}}\end{bmatrix}\right),

where the symbol “ $\to^{D}$ ” denotes convergence in distribution, ${\bm{O}}$ s are zero matrices of appropriate size, and ${\bm{b}}_{\rm{K}}$ and ${\bm{\Delta}}_{\rm{K}},\ {\rm{K}}\in\{{\rm{A}},{\rm{B}},{\rm{C}}\}$ are defined as follows:

	$\displaystyle{\bm{b}}_{\rm{A}}$	$\displaystyle\coloneqq\lim_{J\to\infty}J^{-1}\sum_{j=1}^{J}E\left[d_{j}({\bm{U}}_{j})^{-1/2}{\bm{\Phi}}_{\rm{AA}}({\bm{U}}_{j})^{-1}{\bm{b}}_{{\rm{A}}j}({\bm{U}}_{j})\right],$
	$\displaystyle{\bm{b}}_{\rm{B}}$	$\displaystyle\coloneqq\lim_{J\to\infty}J^{-1}\sum_{j=1}^{J}{\bm{\Delta}}_{\rm{B}}E\left[d_{j}({\bm{U}}_{j})^{1/2}\left[{\bm{b}}_{{\rm{B}}j}({\bm{U}}_{j})-{\bm{\Phi}}_{\rm{AB}}({\bm{U}}_{j})^{\top}{\bm{\Phi}}_{\rm{AA}}({\bm{U}}_{j})^{-1}{\bm{b}}_{{\rm{A}}j}({\bm{U}}_{j})\right]\right],$
	$\displaystyle\begin{split}{\bm{b}}_{\rm{C}}&\coloneqq\lim_{J\to\infty}J^{-1}\sum_{j=1}^{J}{\bm{\Delta}}_{\rm{C}}{\bm{M}}_{*}\left({\bm{\Sigma}}_{0}\otimes{\bm{\Sigma}}_{0}\right)^{-1}\\ &\quad\times{\rm{vec}}\left(E\left[d_{j}({\bm{U}}_{j})^{-1/2}\left[{\bm{U}}_{j}{\bm{b}}_{{\rm{A}}j}({\bm{U}}_{j})^{\top}{\bm{\Phi}}_{\rm{AA}}({\bm{U}}_{j})^{-1}+{\bm{\Phi}}_{\rm{AA}}({\bm{U}}_{j})^{-1}{\bm{b}}_{{\rm{A}}j}({\bm{U}}_{j}){\bm{U}}_{j}^{\top}\right]\right]\right),\end{split}$
	$\displaystyle{\bm{\Delta}}_{\rm{A}}$	$\displaystyle\coloneqq{\bm{\Sigma}}_{0},$
	$\displaystyle{\bm{\Delta}}_{\rm{B}}$	$\displaystyle\coloneqq E\left[{\bm{\Phi}}_{\rm{BB}}({\bm{U}}_{j})-{\bm{\Phi}}_{\rm{AB}}({\bm{U}}_{j})^{\top}{\bm{\Phi}}_{\rm{AA}}({\bm{U}}_{j})^{-1}{\bm{\Phi}}_{\rm{AB}}({\bm{U}}_{j})\right]^{-1}\quad{\text{and}}$
	$\displaystyle{\bm{\Delta}}_{\rm{C}}$	$\displaystyle\coloneqq 2\left[{\bm{M}}_{}\left({\bm{\Sigma}}_{0}\otimes{\bm{\Sigma}}_{0}\right)^{-1}{\bm{M}}_{}^{\top}\right]^{-1},$

where ${\bm{\Phi}}_{{\rm{K}}_{1}{\rm{K}}_{2}}({\bm{U}}_{j})\coloneqq E_{{\bm{X}}_{ij}}[\delta({\bm{U}}_{j},{\bm{X}}_{ij}){\bm{X}}_{{\rm{K}}_{1}ij}{\bm{X}}_{{\rm{K}}_{2}ij}^{\top}]$ for ${\rm{K}}_{1},{\rm{K}}_{2}\in\{{\rm{A}},{\rm{B}}\}$ , and $\otimes$ is the Kronecker product.

Remark 1.

From Theorem 1, $\hat{\bm{\theta}}_{\rm{A}}$ and $\hat{\bm{\Sigma}}$ are $\sqrt{J}$ -consistent, and $\hat{\bm{\theta}}_{\rm{B}}$ is $\sqrt{Jn_{0}}$ -consistent. Furthermore, $\hat{\bm{\theta}}_{\rm{A}}$ , $\hat{\bm{\theta}}_{\rm{B}}$ and $\hat{\bm{\Sigma}}$ are asymptotically independent. If $J$ and $n_{j},\ j\in\mathcal{J}$ are sufficiently large, the covariance matrix of the proposed estimator is obtained as

{\rm{cov}}\left[\hat{\bm{\theta}}_{\rm{A}}\right]\approx J^{-1}{\bm{\Delta}}_{\rm{A}},\;{\rm{cov}}\left[\hat{\bm{\theta}}_{\rm{B}}\right]\approx(Jn_{0})^{-1}{\bm{\Delta}}_{\rm{B}}\quad{\rm{and}}\quad{\rm{cov}}\left[{\rm{vech}}\left(\hat{\bm{\Sigma}}\right)\right]\approx J^{-1}{\bm{\Delta}}_{\rm{C}}.

Theorem 1 also reveals the asymptotic bias of the proposed estimator caused by the approximation (9). If $J$ and $n_{j},\ j\in\mathcal{J}$ are sufficiently large, it can be approximated as

	$\displaystyle E\left[\hat{\bm{\theta}}_{\rm{A}}\right]-{\bm{\theta}}_{\rm{A}}^{0}\approx\left(Jn_{0}\right)^{-1/2}{\bm{b}}_{\rm{A}},$
	$\displaystyle E\left[\hat{\bm{\theta}}_{\rm{B}}\right]-{\bm{\theta}}_{\rm{B}}^{0}\approx\left(Jn_{0}\right)^{-1/2}{\bm{b}}_{\rm{B}}\quad{\rm{and}}$
	$\displaystyle E\left[{\rm{vech}}\left(\hat{\bm{\Sigma}}\right)\right]-{\rm{vech}}\left({\bm{\Sigma}}_{0}\right)\approx\left(Jn_{0}\right)^{-1/2}{\bm{b}}_{\rm{C}}.$

As shown in (A6), ${\bm{b}}_{{\rm{K}}j}$ depends on the EVI function $\gamma({\bm{u}},{\bm{x}})$ , and the proposed estimator is more biased for larger $\gamma({\bm{u}},{\bm{x}})$ , that is, the heavier the right tail of the response distribution (5). Furthermore, ${\bm{b}}_{{\rm{K}}j}$ is also affected by $\beta({\bm{u}},{\bm{x}})$ defined in (12), and the proposed estimator is more biased for smaller $\beta({\bm{u}},{\bm{x}})$ . Meanwhile, $c_{0}({\bm{u}},{\bm{x}})$ in (12), which is the scaling constant to ensure that the upper bound of (5) is equal to one, is not related to the asymptotic bias of the proposed estimator.

Remark 2.

From Theorem 1, we can confirm the good compatibility between the MEM and EVT as follows. In extreme value analysis, we want to set the threshold as high as possible to ensure a good fit with the Pareto distribution, as shown in (9). However, the estimator may have a large variance because the amount of available data becomes small. Meanwhile, the variance of the proposed estimator for (5) with (6) depends strongly on the number of areas $J$ and improves as $J$ increases, as described in Remark 1. Note that the magnitude of $J$ is unaffected by the choice of thresholds $\omega_{j},\ j\in\mathcal{J}$ , unlike $n_{0}$ . Therefore, even if the threshold is high for some areas, the proposed estimator is expected to remain stable as long as $J$ is sufficiently large. Note that estimating the bias of the proposed estimator is a difficult problem because $\beta({\bm{u}},{\bm{x}})$ and $c_{1}({\bm{u}},{\bm{x}})$ in (12) must be estimated. However, if $J$ is sufficiently large, by setting reasonably high thresholds, we may avoid this bias estimation problem while ensuring the stability of the estimator. Such phenomena are numerically confirmed in the simulation study in Section 1.2 of our supplementary material.

Remark 3.

Theorem 1 is directly applicable to confidence interval construction and statistical hypothesis testing on the parameters ${\bm{\theta}}_{\rm{A}}^{0}$ , ${\bm{\theta}}_{\rm{B}}^{0}$ and ${\bm{\Sigma}}_{0}$ . To obtain more efficient estimates, the choice of covariates is crucial. Alternatively, including too many meaningless covariates in the model will adversely affect the parameter estimates, and “borrowing of strength” will not be effective. Therefore, we must check the efficiency of the selected explanatory variables. Hypothesis testing is useful for this purpose. The typical statement of such a hypothesis test is whether or not each component of ${\bm{\theta}}_{\rm{A}}^{0}$ and ${\bm{\theta}}_{\rm{B}}^{0}$ is significantly different from zero. When we organize this test, we must estimate ${\bm{\Delta}}_{\rm{B}}^{-1}$ , which can be naturally estimated by

\hat{\bm{\Delta}}_{\rm{B}}^{-1}\coloneqq J^{-1}\sum_{j=1}^{J}\left(\hat{\bm{\Phi}}_{{\rm{BB}}j}-\hat{\bm{\Phi}}_{{\rm{AB}}j}^{\top}\hat{\bm{\Phi}}_{{\rm{AA}}j}^{-1}\hat{\bm{\Phi}}_{{\rm{AB}}j}\right),

(13)

where $\hat{\bm{\Phi}}_{{\rm{K}}_{1}{\rm{K}}_{2}j}\coloneqq n_{j0}^{-1}\sum_{i=1}^{n_{j}}{\bm{X}}_{{\rm{K}}_{1}ij}{\bm{X}}_{{\rm{K}}_{2}ij}^{\top}I(Y_{ij}>\omega_{j}),\ {\rm{K}}_{1},{\rm{K}}_{2}\in\{{\rm{A}},{\rm{B}}\}$ . In Section 4.3, the hypothesis test on ${\bm{\theta}}_{\rm{B}}^{0}$ is demonstrated for a real dataset (see, Section 1.1 of our supplementary material).

As described in Section 2.1, an important example of (6) is the type of nested error regression model (7). For the model (7), Theorem 1 can be simplified to the following Corollary 1. Let define $\sigma_{0}^{2}\coloneqq{\rm{var}}[U_{j}]$ for the random effects $U_{j},\ j\in\mathcal{J}$ and denote its proposed estimator as $\hat{\sigma}^{2}$ .

Corollary 1.

Suppose that (A1)-(A6) hold. Then, as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ ,

\begin{bmatrix}J^{1/2}\left(\hat{\theta}_{\rm{A}}-\theta_{\rm{A}}^{0}\right)\\ J^{1/2}n_{0}^{1/2}\left(\hat{\bm{\theta}}_{\rm{B}}-{\bm{\theta}}_{\rm{B}}^{0}\right)\\ J^{1/2}\left(\hat{\sigma}^{2}-\sigma_{0}^{2}\right)\end{bmatrix}+\begin{bmatrix}n_{0}^{-1/2}v_{\rm{A}}\\ {\bm{v}}_{\rm{B}}\\ n_{0}^{-1/2}v_{\rm{C}}\\ \end{bmatrix}\xrightarrow{D}N\left({\bm{0}},\begin{bmatrix}\sigma_{0}^{2}&{\bm{O}}&{\bm{O}}\\ {\bm{O}}&{\bm{\Omega}}_{\rm{B}}&{\bm{O}}\\ {\bm{O}}&{\bm{O}}&2\left(\sigma_{0}^{2}\right)^{2}\end{bmatrix}\right),

where $v_{\rm{A}}$ , ${\bm{v}}_{\rm{B}}$ , $v_{\rm{C}}$ and ${\bm{\Omega}}_{\rm{B}}$ are defined as follows:

	$\displaystyle v_{\rm{A}}\coloneqq\lim_{J\to\infty}J^{-1}\sum_{j=1}^{J}E\left[d_{j}(U_{j})^{-1/2}v_{{\rm{A}}j}(U_{j})\right],$
	$\displaystyle{\bm{v}}_{\rm{B}}\coloneqq\lim_{J\to\infty}J^{-1}\sum_{j=1}^{J}{\bm{\Omega}}_{\rm{B}}E\left[d_{j}(U_{j})^{1/2}\left[{\bm{b}}_{{\rm{B}}j}({\bm{U}}_{j})-v_{{\rm{A}}j}(U_{j}){\bm{\Psi}}_{\rm{B}}(U_{j})\right]\right],$
	$\displaystyle v_{\rm{C}}\coloneqq\lim_{J\to\infty}J^{-1}\sum_{j=1}^{J}4{\rm{vec}}\left(E\left[d_{j}(U_{j})^{-1/2}v_{{\rm{A}}j}(U_{j})U_{j}\right]\right)\quad{\text{and}}$
	$\displaystyle{\bm{\Omega}}_{\rm{B}}\coloneqq E\left[{\bm{\Phi}}_{\rm{BB}}(U_{j})-{\bm{\Psi}}_{\rm{B}}(U_{j}){\bm{\Psi}}_{\rm{B}}(U_{j})^{\top}\right]^{-1},$

where $v_{{\rm{A}}j}(U_{j})$ is ${\bm{b}}_{{\rm{A}}j}(U_{j})$ with $p_{\rm{A}}=1$ and ${\bm{X}}_{{\rm{A}}ij}\equiv 1$ , ${\bm{\Psi}}_{\rm{B}}(U_{j})\coloneqq E_{{\bm{X}}_{{\rm{B}}ij}}[\delta(U_{j},{\bm{X}}_{{\rm{B}}ij}){\bm{X}}_{{\rm{B}}ij}]$ , and ${\bm{\Phi}}_{\rm{BB}}(U_{j})$ is defined in Theorem 1.

4 Application

4.1 Background

In this section, we analyze the returns of a large number of cryptocurrencies. Over the past decade, the cryptocurrency market has grown rapidly, led by Bitcoin. As of January 2024, the number of active cryptocurrency stocks is estimated to be over 9,000 (see, https://www.statista.com/statistics/863917/number-crypto-coins-tokens/). However, despite the large number of stocks, most studies are limited to a few major cryptocurrencies such as Bitcoin. Gkillas and Katsiampa (2018) studied the extreme value analysis of the returns of five cryptocurrencies, namely Bitcoin, Ethereum, Ripple, Litecoin and Bitcoin Cash. We extend their work to a larger number of stocks including these five cryptocurrencies. Our goal is to demonstrate the advantage of simultaneously analyzing many cryptocurrency stocks using the MEM. In Section 4.3, we report the interpretation of our model. In Section 4.4, we compare the performance of our method with that of stock-by-stock analysis using the method of Wang and Tsai (2009).

4.2 Dataset

The cryptocurrency dataset is available on CoinMarketCap (see, https://coinmarketcap.com/). We use this dataset from the last ten years, from January 1, 2014 to December 31, 2023. Then, we cover the 413 stocks that are in the top 500 on CoinMarketCap as of February 2, 2024 and have returns of 364 days or more. From the daily closing prices for each stock, the returns can be calculated as $\log P_{tj}-\log P_{(t-1)j}$ , where $P_{tj}$ refers to the closing price on the $t$ th day in the $j$ th cryptocurrency. Let the non-missing returns of the $j$ th cryptocurrency be denoted as $\{Y_{ij},\ i=1,2,\ldots,n_{j}\}$ for $j=1,2,\ldots,413$ . The sample size $n_{j}$ of the $j$ th cryptocurrency is primarily determined by the launch date. Note that the index $i$ does not represent the same date for all stocks.

We examine the tail behavior of both the negative and positive returns for each stock. In many applications of EVT, the sample kurtosis has been used to check the effectiveness of using the Pareto-type distribution (see, Wang and Tsai 2009; Ma, Jiang, and Huang 2019). In this study, the sample kurtosis of $\{Y_{ij}\}_{i\in\mathcal{N}_{j}}$ was greater than zero for each $j=1,2,\ldots,413$ , and the minimum sample kurtosis for the 413 stocks was 1.833, implying that the returns of each cryptocurrency are heavily distributed. Therefore, we use the Pareto-type distribution instead of the GPD to analyze high threshold exceedances for the both negative and positive returns. The following Sections 4.3 and 4.4 describe our method only for positive returns. However, using the same method by replacing $\{Y_{ij}\}_{i\in\mathcal{N}_{j},j\in\mathcal{J}}$ with $\{-Y_{ij}\}_{i\in\mathcal{N}_{j},j\in\mathcal{J}}$ , we can also implement the analysis of extreme negative returns.

For each pair of the 413 stocks, we estimated the tail dependence parameter from returns above the 95th percentile for the both negative and positive returns (see, Reiss and Thomas 2007). The results showed that the percentage of combinations with a tail dependence over 0.7 was only 0.074% for negative returns and only 0.032% for positive returns. Therefore, in this analysis, we do not consider the dependence between each pair of stocks.

4.3 Analysis by our model

In many cryptocurrency applications, dummy variables such as year, month, and day of the week have often been utilized as covariates to account for the non-stationarity of returns. In Longin and Pagliardi (2016) and Gkillas and Katsiampa (2018), the returns were adjusted in terms of variance using some dummy variables and were analyzed under the assumption that the EVI of each cryptocurrency remained constant over the period, despite the dramatic growth of the market. However, in their approach, it is unclear which variable influences returns to what extent. Therefore, in our application, we employ EVI regression to examine the effects of dummy variables with year (9 variables), month (11 variables) and day of the week (6 variables) on the tail behavior of the distribution of returns. The second column of Table 1 shows the assignment of 26 dummy variables. We denote the vector of these dummy variables as ${\bm{X}}_{ij}\in\prod_{k=1}^{26}\{0,1\}\subset\mathbb{R}^{26}$ for $i=1,2,\ldots,n_{j}$ and $j=1,2,\ldots,413$ . For our model in Section 2.1, we set ${\bm{X}}_{{\rm{A}}ij}\equiv 1$ and ${\bm{X}}_{{\rm{B}}ij}={\bm{X}}_{ij}$ and denote the random effects as $U_{1},U_{2},\ldots,U_{413}\stackrel{{\scriptstyle\rm{i.i.d.}}}{{\sim}}N(0,\sigma^{2})$ , where $\sigma^{2}>0$ is an unknown variance. Then, for the underlying Pareto-type distribution (5), the EVI (6) conditional on $U_{j}=u_{j}$ and ${\bm{X}}_{ij}={\bm{x}}$ is modeled as

\gamma(u_{j},{\bm{x}})=\exp\left(\theta_{\rm{A}}+u_{j}+{\bm{\theta}}_{\rm{B}}^{\top}{\bm{x}}\right),\quad j=1,2,\ldots,413,

(14)

where $\theta_{\rm{A}}\in\mathbb{R}$ and ${\bm{\theta}}_{\rm{B}}\in\mathbb{R}^{26}$ are unknown coefficients. Below, we analyze the returns of the 413 cryptocurrencies simultaneously using (14). For the implementation, we then use the glmer() function in the package lme4 (see, Bates et al. 2015) within the R computing environment (see, R Core Team 2021). A more detailed explanation is given in Section 1 of our supplementary material.

Table 1: The estimation and test results for the cryptocurrency dataset. In the third and forth columns, the values in parentheses show the widths of the

95\%

confidence intervals. In the fifth and sixth columns, “1” indicates that the null hypothesis was rejected.

		Estimate				Rejection		p-value
		Negative		Positive		Negative	Positive	Negative	Positive
$\theta_{\rm{A}}$		$-1.337$	$(0.026)$	$-0.854$	$(0.025)$	$-$	$-$	$-$	$-$
${\bm{\theta}}_{\rm{B}}$	2014	$0.690$	$(0.167)$	$0.534$	$(0.164)$	1	1	$<10^{-3}$	$<10^{-3}$
	2015	$0.732$	$(0.168)$	$0.489$	$(0.167)$	1	1	$<10^{-3}$	$<10^{-3}$
	2016	$0.735$	$(0.133)$	$0.603$	$(0.121)$	1	1	$<10^{-3}$	$<10^{-3}$
	2017	$0.677$	$(0.070)$	$0.605$	$(0.060)$	1	1	$<10^{-3}$	$<10^{-3}$
	2018	$0.461$	$(0.049)$	$0.264$	$(0.050)$	1	1	$<10^{-3}$	$<10^{-3}$
	2019	$0.307$	$(0.049)$	$0.165$	$(0.046)$	1	1	$<10^{-3}$	$<10^{-3}$
	2020	$0.483$	$(0.040)$	$0.219$	$(0.035)$	1	1	$<10^{-3}$	$<10^{-3}$
	2021	$0.383$	$(0.032)$	$0.282$	$(0.027)$	1	1	$<10^{-3}$	$<10^{-3}$
	2022	$0.231$	$(0.029)$	$0.018$	$(0.027)$	1	0	$<10^{-3}$	$0.092$
	2023	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$
	Jan	$0.189$	$(0.043)$	$0.050$	$(0.039)$	1	1	$<10^{-3}$	$0.006$
	Feb	$0.097$	$(0.043)$	$0.095$	$(0.039)$	1	1	$<10^{-3}$	$<10^{-3}$
	Mar	$0.141$	$(0.044)$	$0.075$	$(0.038)$	1	1	$<10^{-3}$	$<10^{-3}$
	Apr	$0.016$	$(0.043)$	$0.014$	$(0.042)$	0	0	$0.229$	$0.257$
	May	$0.352$	$(0.038)$	$0.087$	$(0.039)$	1	1	$<10^{-3}$	$<10^{-3}$
	Jun	$0.177$	$(0.039)$	$-0.076$	$(0.042)$	1	1	$<10^{-3}$	$<10^{-3}$
	Jul	$-0.056$	$(0.047)$	$-0.063$	$(0.041)$	1	1	$0.010$	$0.001$
	Aug	$-0.069$	$(0.044)$	$0.029$	$(0.041)$	1	0	$0.001$	$0.084$
	Sep	$0.188$	$(0.042)$	$-0.041$	$(0.042)$	1	0	$<10^{-3}$	$0.028$
	Oct	$-0.071$	$(0.048)$	$-0.039$	$(0.042)$	1	0	$0.002$	$0.033$
	Nov	$0.199$	$(0.040)$	$0.061$	$(0.037)$	1	1	$<10^{-3}$	$0.001$
	Dec	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$
	San	$-0.032$	$(0.038)$	$-0.024$	$(0.034)$	0	0	$0.053$	$0.079$
	Mon	$0.062$	$(0.034)$	$0.022$	$(0.032)$	1	0	$<10^{-3}$	$0.089$
	Tue	$0.066$	$(0.036)$	$-0.013$	$(0.033)$	1	0	$<10^{-3}$	$0.212$
	Wed	$0.183$	$(0.035)$	$-0.026$	$(0.031)$	1	0	$<10^{-3}$	$0.05$
	Thu	$0.065$	$(0.035)$	$0.061$	$(0.032)$	1	1	$<10^{-3}$	$<10^{-3}$
	Fri	$0.013$	$(0.035)$	$-0.006$	$(0.032)$	0	0	$0.234$	$0.365$
	Sat	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$
$\sigma^{2}$		$0.074$	$(0.010)$	$0.068$	$(0.009)$	$-$	$-$	$-$	$-$

First, we estimate the parameters $\{\theta_{\rm{A}},{\bm{\theta}}_{\rm{B}},\sigma^{2}\}$ in (14). The third and fourth columns of Table 1 showed the maximum likelihood estimates of these parameters for negative returns and positive returns, respectively. In these columns, the values in parentheses show the widths of the $95\%$ confidence intervals calculated from Corollary 1. According to the estimates in Table 1, for any combination of our dummy variables, the EVI (14) was higher for positive returns than for negative returns. This result suggests that positive returns may have had a greater impact on the cryptocurrency market than negative returns. Accordingly, this seems to support the evidence of the overall growth of the market. Meanwhile, the effects of dummy variables for year reduced the EVI over the years, implying that many cryptocurrency stocks were less risky as assets than in the past.

Second, we conduct the Wald hypothesis tests to check whether each of our covariates has a significant effect in the model (14). The hypothesis tests of interest are expressed as follows:

{\rm{H}}_{0k}:({\bm{\theta}}_{\rm{B}})_{k}=0\quad{\rm{vs.}}\quad{\rm{H}}_{1k}:({\bm{\theta}}_{\rm{B}})_{k}\neq 0

for $k=1,2,\ldots,26$ , where ${\rm{H}}_{0k}$ is the null hypothesis, ${\rm{H}}_{1k}$ is the alternative hypothesis, and $({\bm{\theta}}_{\rm{B}})_{k}$ is the $k$ th component of ${\bm{\theta}}_{\rm{B}}$ . From Corollary 1, we define the test statistic as

T_{k}\coloneqq({\bm{\Omega}}_{\rm{B}})^{-1/2}_{k}\left[(Jn_{0})^{1/2}(\hat{\bm{\theta}}_{\rm{B}})_{k}-({\bm{v}}_{\rm{B}})_{k}\right],\quad k=1,2,\ldots,26,

(15)

where $({\bm{\Omega}}_{\rm{B}})_{k}$ is the $(k,k)$ entry of ${\bm{\Omega}}_{\rm{B}}$ , and $(\hat{\bm{\theta}}_{\rm{B}})_{k}$ and $({\bm{v}}_{\rm{B}})_{k}$ are the $k$ th components of $\hat{\bm{\theta}}_{\rm{B}}$ and ${\bm{v}}_{\rm{B}}$ , respectively. In (15), we estimate ${\bm{\Omega}}_{\rm{B}}^{-1}$ by (13) and assume that ${\bm{v}}_{\rm{B}}$ is a zero vector. Under the null hypothesis ${\rm{H}}_{0k}$ , the distribution of $T_{k}$ can be approximated by $N(0,1)$ . Therefore, for a given significance level $\alpha$ , we reject the null hypothesis ${\rm{H}}_{0k}$ if $|T_{k}|>z_{1-\alpha/2}$ , where $z_{1-\alpha/2}$ is the $100(1-\alpha/2)$ -th percentile point of $N(0,1)$ . The fifth and sixth columns of Table 1 show the test results with $\alpha=0.05$ for our real dataset, where “1” indicates that the null hypothesis ${\rm{H}}_{0k}$ was rejected. In addition, the seventh and eighth columns show the associated p-values. From the fifth column of Table 1, which had many “1”, we can see that for negative returns, many dummy variables for year, month, and day of the week cannot be ignored as covariates for predicting the EVI. The sixth column of Table 1 implies that for positive returns, the effects of days of the week may be meaningless in our model (14). For both negative and positive returns, the dummy variables for year may be particularly important as covariates because their p-values were quite small.

Third, we predict the random effects $U_{j},\ j=1,2,\ldots,413$ in (14). Since citing all $\tilde{u}_{j}$ for $j=1,2,\ldots,413$ is too voluminous, we only present the results for the major cryptocurrencies studied in Gkillas and Katsiampa (2018), namely Bitcoin, Ethereum, Ripple, Litecoin and Bitcoin Cash. Table 2 shows the predicted random effects $\tilde{u}_{j}$ for these five cryptocurrencies. Based on the model (14), the results in Table 2 mean that Bitcoin Cash had the largest EVI among the five stocks for both negative and positive returns. Thus, Bitcoin Cash may have been the riskiest of the above five cryptocurrencies in the sense that its returns were the most heavily distributed.

Table 2: The predicted random effects for Bitcoin, Ethereum, Ripple, Litecoin and Bitcoin Cash.

		Bitcoin	Ethereum	Ripple (XRP)	Litecoin	Bitcoin Cash
$\tilde{u}_{j}$	Negative	$-0.319$	$-0.224$	$-0.319$	$-0.304$	$0.007$
$\tilde{u}_{j}$	Positive	$-0.351$	$-0.502$	$-0.011$	$-0.410$	$0.012$

Finally, we evaluate the goodness of fit of the model using the similar method appeared in Wang and Tsai (2009). From (9), the distribution of $S_{ij}\coloneqq\exp[-\gamma(U_{j},{\bm{X}}_{ij})^{-1}$
$\log(Y_{ij}/\omega_{j})]$ conditional on $U_{j},{\bm{X}}_{ij}$ and $Y_{ij}>\omega_{j}$ can be approximated by the uniform distribution on $[0,1]$ (see, Wang and Tsai 2009). Let $\tilde{S}_{ij}$ be $S_{ij}$ with $\theta_{\rm{A}}=\hat{\theta}_{\rm{A}}$ , ${\bm{\theta}}_{\rm{B}}=\hat{\bm{\theta}}_{\rm{B}}$ and $U_{j}=\tilde{u}_{j}$ . Then, for each $j=1,2,\ldots,413$ , $\mathcal{S}_{j}\coloneqq\{\tilde{S}_{ij}:Y_{ij}>\omega_{j},\ i=1,2,\ldots,n_{j}\}$ can be considered as a uniformly distributed random sample. Figure 1 shows the Q-Q plots of $\mathcal{S}_{j}$ against equally divided points on $[0,1]$ for the five representative cryptocurrencies discussed above, where the bands are the 95% pointwise confidence bands constructed by the function geom_qq_band() with bandType="boot" in the package qqplotr (see, Almeida, Loy, and Hofmann 2018) within R. In each panel of Figure 1, the points were approximately aligned on a straight line with an intercept of 0 and slope of 1. This means that our model fits both negative and positive returns well for each of the five stocks.

Refer to caption — Figure 1: The Q-Q plots for Bitcoin, Ethereum, Ripple, Litecoin and Bitcoin Cash.

4.4 Comparison

In this section, we compare our model (14) with the method proposed by Wang and Tsai (2009). The latter competitor applies the EVI regression to data stock by stock. To compare these two methods, we iteratively compute the following 2-fold cross-validation criterion. At each iteration step, the returns for each stock are randomly divided into training and test data. As a criterion, we adapt the discrepancy measure proposed by Wang and Tsai (2009). For each of our model and the model of Wang and Tsai (2009), the discrepancy measure for each stock is computed from the test data using the estimated parameters, predicted random effects, and selected thresholds from the training data. Note that our model and the model proposed by Wang and Tsai (2009) use the same threshold exceedances (see, Section 2.4). After 100 iterations, we take the average of the discrepancy measures obtained for each stock. Each panel of Figure 2 shows the scatter plot of averaged discrepancy measures between our model and the model of Wang and Tsai (2009) for the 413 stocks. We can see from the results that the discrepancy measures of our model are smaller than those of Wang and Tsai (2009) for many stocks for both negative and positive returns, suggesting that our model provides better performance.

5 Discussion

In this study, we investigated the MEM for the EVI in the Pareto-type distribution for unit-level data. In other words, this study incorporated the method of SAE into EVT. As explained in Section 2, the parameters of the proposed model were estimated by the maximum likelihood method, and the random effects were predicted by the conditional mode. In Section 3, we established the asymptotic normality of the estimator. Together with the simulation studies in Section 1 of our supplementary material and real data example in Section 4, we can conclude the following advantages of using the MEM for EVI regression. First, in extreme value analysis, the sample size is generally small for some areas because of the peak over threshold. However, as described in Sections 1.3 of our supplementary material and Section 4, the common parametric part in the MEM can adequately guide the differences between areas. Interestingly, the “borrowing of strength” of the MEM is effective for EVI regression because it improves the bias and variance of the peak-over-threshold method (see, Section 1.3 of our supplementary material). Thus, the proposed model provides a significantly efficient tool that is an alternative to direct estimates from each area. Second, the proposed model is effective even when the number of areas is large. This is shown theoretically in Theorem 1 of Section 3.2, while Section 1.2 of our supplementary material proves this property numerically. Furthermore, as a result supporting the use of the proposed model, in Section 1.3 of our supplementary material, the extreme value analysis using the MEM provided more reasonable results than the fully parametric model. Finally, in extreme value analysis, general EVI estimators sometimes have a large bias resulting from the approximation of the peak over threshold. However, from Theorem 1, we found that when $J$ is large, the proposed estimator can be designed to reduce the bias while maintaining its stable variance. This is a somewhat surprising result, because a large number of areas typically leads to poor performance of the estimator in the fully parametric model. Thus, the MEM may be one of the effective approaches to overcome the severe problem of bias in extreme value analysis.

We describe future research using the MEM for extreme value analysis. The first work of interest is the development of models such as the simultaneous autoregressive model and the conditional autoregressive model to explain the dependencies between areas (see, Rao and Molina 2015 and references therein). Such models may help provide better estimates in some applications, including spatial data. Second, it may be feasible to extend the MEM to other EVT models such as the generalized extreme value distribution and GPD. In this study, the methods derived from these models and their theoretical results were not clarified and thus require future detailed study. Third, we expect to extend the MEM to extreme quantile regression (see, Wang, Li, and He 2012; Wang and Li 2013). Finally, although this paper studied the MEM with Gaussian random effects, it may be important to consider other distributions of the random effects (see, Section 9.2 of Wu 2009 and Yavuz and Arslan 2018). The development of the MEM with non-Gaussian random effects is also an interesting future work in EVT.

Acknowledgments

We would like to thank Editage (https://www.editage.jp/) for the English language editing.

Data availability statement

The dataset analyzed during this study was processed from those obtained from CoinMarketCap (see, https://coinmarketcap.com/). This dataset and the associated code of R (https://www.r-project.org/) are available from the corresponding author on reasonable request.

References

[1] Almeida, S., A. Loy, and H. Hofmann. 2018. ggplot2 compatible quantile-quantile plots in R. The R Journal 10 (2):248-61. doi:10.32614/RJ-2018-051.
[2] Bates, D., M. Mächler, B. Bolker, and S. Walker. 2015. Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67 (1):1-48. doi:10.18637/jss.v067.i01.
[3] Battese, G., R. Harter, and W. Fuller. 1988. An error-components model for prediction of county crop areas using survey and satellite data. Journal of the American Statistical Association 83 (401):28-36. doi:10.1080/01621459.1988.10478561.
[4] Beirlant, J., and Y. Goegebeur. 2003. Regression with response distributions of Pareto-type. Computational Statistics & Data Analysis 42 (4):595-619. doi:10.1016/S0167-9473(02)00120-2.
[5] Beirlant, J., Y. Goegebeur, J. Teugels, and J. Segers. 2004. Statistics of extremes: Theory and applications. New Jersey: John Wiley & Sons.
[6] Bottolo, L., G. Consonni, P. Dellaportas, and A. Lijoi. 2003. Bayesian analysis of extreme values by mixture modeling. Extremes 6 (1):25-47. doi:10.1023/A:1026225113154.
[7] Broström, G., and H. Holmberg. 2011. Generalized linear models with clustered data: Fixed and random effects models. Computational Statistics & Data Analysis 55 (12):3123-34. doi:10.1016/j.csda.2011.06.011.
[8] Davison, A. C., and R. L. Smith. 1990. Models for exceedances over high thresholds. Journal of the Royal Statistical Society: Series B (Methodological) 52 (3):393-425. doi:10.1111/j.2517-6161.1990.tb01796.x.
[9] de Carvalho, M., R. Huser, and R. Rubio. 2023. Similarity-based clustering for patterns of extreme values. Stat 12 (1):e560. doi:10.1002/sta4.560.
[10] Dempster, A. P., D. B. Rubin, and R. K. Tsutakawa. 1981. Estimation in covariance components models. Journal of the American Statistical Association 76 (374):341-53. doi:10.1080/01621459.1981.10477653.
[11] Diallo, M., and J. N. K. Rao. 2018. Small area estimation of complex parameters under unit-level models with skew-normal errors. Scandinavian Journal of Statistics 45 (4):1092-116. doi:10.1111/sjos.12336.
[12] Dupuis, D. J., S. Engelke, and L. Trapin. 2023. Modeling panels of extremes. The Annals of Applied Statistics 17 (1):498-517. doi:10.1214/22-AOAS1639.
[13] Fisher, R., and L. Tippett. 1928. Limiting forms of the frequency distribution of the largest or smallest member of a sample. Mathematical Proceedings of the Cambridge Philosophical Society 24 (2):180-90. doi:10.1017/S0305004100015681.
[14] Gkillas, K., and P. Katsiampa. 2018. An application of extreme value theory to cryptocurrencies. Economics Letters 164:109-11. doi:10.1016/j.econlet.2018.01.020.
[15] Gumbel, E. J. 1958. Statistics of extremes. West Sussex: Columbia University Press.
[16] Hall, P. 1982. On some simple estimates of an exponent of regular variation. Journal of the Royal Statistical Society: Series B (Methodological) 44 (1):37-42. doi:10.1111/j.2517-6161.1982.tb01183.x.
[17] Hill, B. M. 1975. A simple general approach to inference about the tail of a distribution. The Annals of Statistics 3 (5):1163-74. doi:10.1214/aos/1176343247.
[18] Jiang, J. 2007. Linear and generalized linear mixed models and their applications. New York: Springer.
[19] Jiang, J. 2017. Asymptotic analysis of mixed effects models: Theory, applications, and open problems. Boca Raton, Florida: CRC Press.
[20] Jiang, J., M. P. Wand, and A. Bhaskaran. 2022. Usable and precise asymptotics for generalized linear mixed model analysis and design. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 84 (1):55-82. doi:10.1111/rssb.12473.
[21] Longin, F., and G. Pagliardi. 2016. Tail relation between return and volume in the US stock market: An analysis based on extreme value theory. Economics Letters 145:252-4. doi:10.1016/j.econlet.2016.06.026.
[22] Ma, Y., Y. Jiang, and W. Huang. 2019. Tail index varying coefficient model. Communications in Statistics - Theory and Methods 48 (2):235-56. doi:10.1080/03610926.2017.1406519.
[23] Magnus, J. R., and H. Neudecker. 1988. Matrix differential calculus with applications in statistics and econometrics. New Jersey: John Wiley & Sons.
[24] Molina, I., P. Corral, and M. Nguyen. 2022. Estimation of poverty and inequality in small areas: Review and discussion. TEST 31 (4):1143-66. doi:10.1007/s11749-022-00822-1.
[25] Molina, I., and J. N. K. Rao. 2010. Small area estimation of poverty indicators. The Canadian Journal of Statistics 38 (3):369-85. https://www.jstor.org/stable/27896031.
[26] Nie, L. 2007. Convergence rate of MLE in generalized linear and nonlinear mixed-effects models: Theory and applications. Journal of Statistical Planning and Inference 137 (6):1787-804. doi:10.1016/j.jspi.2005.06.010.
[27] R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
[28] Rao, J. N. K., and I. Molina. 2015. Small area estimation. Hoboken, New Jersey: John Wiley & Sons.
[29] Reiss, R. D., and M. Thomas. 2007. Statistical analysis of extreme values: With applications to insurance, finance, hydrology and other fields. Basel, Switzerland: Birkhäuser Basel.
[30] Rohrbeck, C., and J. Tawn. 2021. Bayesian spatial clustering of extremal behavior for hydrological variables. Journal of Computational and Graphical Statistics 30 (1):91-105. doi:10.1080/10618600.2020.1777139.
[31] Ruppert, D., M. P. Wand, and R. J. Carroll. 2003. Semiparametric regression. New York: Cambridge University Press.
[32] Santner, T. J., and D. E. Duffy. 1989. The statistical analysis of discrete data. New York: Springer.
[33] Sugasawa, S., and T. Kubokawa. 2020. Small area estimation with mixed models: A review. Japanese Journal of Statistics and Data Science 3 (2):693-720. doi:10.1007/s42081-020-00076-x.
[34] Torabi, M. 2019. Spatial generalized linear mixed models in small area estimation. Canadian Journal of Statistics 47 (3):426-37. doi:10.1002/cjs.11502.
[35] Wang, H. J., and D. Li. 2013. Estimation of extreme conditional quantiles through power transformation. Journal of the American Statistical Association 108 (503):1062-74. doi:10.1080/01621459.2013.820134.
[36] Wang, H. J., D. Li, and X. He. 2012. Estimation of high conditional quantiles for heavy-tailed distributions. Journal of the American Statistical Association 107 (500):1453-64. doi:10.1080/01621459.2012.716382.
[37] Wang, H., and C. L. Tsai. 2009. Tail index regression. Journal of the American Statistical Association 104 (487):1233-40. doi:10.1198/jasa.2009.tm08458.
[38] Wu, L. 2009. Mixed effects models for complex data. Boca Raton, Florida: CRC Press.
[39] Yavuz, F. G., and O. Arslan. 2018. Linear mixed model with Laplace distribution (LLMM). Statistical Papers 59 (1):271-289. doi:10.1007/s00362-016-0763-x.

Supplemental online material for
“Mixed effects models for extreme value index regression”

This supplementary material supports our main article entitled “Mixed effects models for extreme value index regression” and is organized as follows. Section 1 provides some Monte Carlo simulation studies to verify the finite sample performance of the model proposed in Section 2 of our main article. Section 2 describes the technical details of the proof of Theorem 1 in Section 3.2 of our main article. Note that we use many of the symbols defined in our main article.

1 Simulation

From Eq. (9) of our main article, we can approximate the distribution of $\log(Y_{ij}/\omega_{j})$ conditional on ${\bm{U}}_{j}$ , ${\bm{X}}_{ij}$ and $Y_{ij}>\omega_{j}$ by the exponential distribution, which belongs to the gamma distribution (see, Wang and Tsai 2009). Therefore, our estimator $(\hat{\bm{\theta}}_{\rm{A}},\hat{\bm{\theta}}_{\rm{B}},\hat{\bm{\Sigma}})$ and predictor $\tilde{\bm{u}}_{j}$ proposed in Section 2 of our main article can be easily implemented by using the function glmer() with family=Gamma(link="log") in the package lme4 (see, Bates et al. 2015) within the R computing environment (see, R Core Team 2021). In the function glmer(), when $p_{\rm{A}}=1$ , the integral in the log-likelihood defined in Eq. (10) of our main article is approximated by the adaptive Gauss-Hermite quadrature, and $\hat{\bm{\Sigma}}$ and $(\hat{\bm{\theta}}_{\rm{A}},\hat{\bm{\theta}}_{\rm{B}})$ are then optimized by “bobyqa” and “Nelder Mead”, respectively.

In the following Sections 1.1-1.3, we investigate the performance of our estimator $(\hat{\bm{\theta}}_{\rm{A}},\hat{\bm{\theta}}_{\rm{B}},\hat{\bm{\Sigma}})$ and predictor $\tilde{\bm{u}}_{j}$ through some simulation studies using the above package.

1.1 Practicality of asymptotic normality of the estimator

In this section, we illustrate the applicability of our asymptotic normality constructed in Corollary 1 of our main article to finite samples. This is positioned as a preliminary study for hypothesis testing on a real data example in Section 4 of our main article.

We simulate the dataset $\{(Y_{ij},{\bm{X}}_{ij})\}_{i\in\mathcal{N}_{j},j\in\mathcal{J}}$ as follows. Let denote ${\bm{X}}_{ij}=(X_{ij}^{(1)},X_{ij}^{(2)})^{\top}\in\mathbb{R}^{2}$ and set $X_{ij}^{(1)}\equiv 1$ for $i\in\mathcal{N}_{j}$ and $j\in\mathcal{J}$ . First, we independently generate $\{X_{ij}^{(2)}\}_{i\in\mathcal{N}_{j},j\in\mathcal{J}}$ from the standard normal distribution or uniform distribution on $[-\sqrt{3},\sqrt{3}]$ . Note that in both covariate cases, $X_{ij}^{(2)}$ has zero mean and unit variance. In the next step, we obtain an independent sample $\{U_{j}\}_{j\in\mathcal{J}}$ from $N(0,\sigma^{2})$ with $\sigma^{2}=0.2$ . Finally, for each $i\in\mathcal{N}_{j}$ and $j\in\mathcal{J}$ , we generate $Y_{ij}$ using a given conditional response distribution $F(\cdot\mid U_{j},X_{ij}^{(2)})$ . The same data generation procedure will be used in Section 1.2. Here, to obtain $\{Y_{ij}\}_{i\in\mathcal{N}_{j},j\in\mathcal{J}}$ , we use the Pareto distribution

F(y\mid U_{j},X_{ij}^{(2)})=1-y^{-1/\gamma(U_{j},X_{ij}^{(2)})}

(16)

and apply the nested error regression type model formulated as Eq. (7) of our main article with ${\bm{X}}_{{\rm{A}}ij}=X_{ij}^{(1)}$ and ${\bm{X}}_{{\rm{B}}ij}=X_{ij}^{(2)}$ , i.e.,

\gamma(U_{j},X_{ij}^{(2)})=\exp\left(\theta_{\rm{A}}^{(1)}+U_{j}+\theta_{\rm{B}}^{(2)}X_{ij}^{(2)}\right),

(17)

where $(\theta_{\rm{A}}^{(1)},\theta_{\rm{B}}^{(2)})=(-0.5,0.2)$ . Let us denote the proposed estimator of $(\theta_{\rm{A}}^{(1)},\theta_{\rm{B}}^{(2)},\sigma^{2})$ by $(\hat{\theta}_{\rm{A}}^{(1)},\hat{\theta}_{\rm{B}}^{(2)},\hat{\sigma}^{2})$ . Because the Pareto distribution (16) satisfies Eq. (12) of our main article with $\beta(U_{j},X_{ij}^{(2)})=\infty$ , the estimator does not have the asymptotic bias described in Remark 1 in Section 3.2 of our main article. Therefore, we do not use the thresholds $\omega_{j},\ j\in\mathcal{J}$ , and thus the effective sample size $n_{j0}$ for each $j\in\mathcal{J}$ is unchanged from $n_{j}$ .

Under the above model setups, from Corollary 1 of our main article, $J^{1/2}(\hat{\theta}_{\rm{A}}^{(1)}-\theta_{\rm{A}}^{(1)})/\sigma$ , $(Jn_{0})^{1/2}(\hat{\theta}_{\rm{B}}^{(2)}-\theta_{\rm{B}}^{(2)})$ and $J^{1/2}(\hat{\sigma}^{2}-\sigma^{2})/(\sqrt{2}\sigma^{2})$ are asymptotically distributed as $N(0,1)$ . Note that this simulation setting satisfies ${\bm{\Omega}}_{\rm{B}}=1$ in Corollary 1 of our main article. To obtain the empirical distributions of the above standardized estimators, we use 500 datasets and repeatedly estimate the parameters $\{\theta_{\rm{A}}^{(1)},\theta_{\rm{B}}^{(2)},\sigma^{2}\}$ from each dataset. Figures 3 and 4 show the Q-Q plots for the obtained standardized estimates against $N(0,1)$ for the normal covariate and uniform covariate, respectively. In these figures, $(J,n_{j0})$ varies by column as $(20,10)$ , $(40,10)$ , $(10,20)$ , $(20,20)$ , $(40,20)$ , $(80,20)$ , $(80,40)$ , and $(150,100)$ . Furthermore, the bands in each panel are the 95% pointwise confidence bands constructed by the function geom_qq_band() with bandType="boot" in the package qqplotr (see, Almeida, Loy, and Hofmann 2018) within R. If all generated $U_{1},U_{2},\ldots,U_{J}$ are close to each other, $\sigma^{2}$ may be estimated to be zero by glmer(). Thus, for $J=10$ and $J=20$ , the Q-Q plot contained several equal values for $\sigma^{2}$ . Comparing Figures 3 and 4, the type of the distribution of ${X}_{ij}^{(2)}$ did not significantly affect the results. We can see from Figures 3 and 4 that for $\theta_{\rm{A}}^{(1)}$ and $\sigma^{2}$ , the empirical distribution of the standardized estimators had heavier tails than $N(0,1)$ , but this tendency disappeared with increasing $J$ and $n_{j0}$ . In contrast to $\theta_{\rm{A}}^{(1)}$ and $\sigma^{2}$ , the Q-Q plot for $\theta_{\rm{B}}^{(2)}$ was good for all pairs of $(J,n_{j0})$ . Consequently, these results reflect the claims of Corollary 1 of our main article.

1.2 Behavior of the estimator for numerous areas

In this section, we examine the numerical performance of the estimator for large $J$ . To obtain the dataset $\{(Y_{ij},{\bm{X}}_{ij})\}_{i\in\mathcal{N}_{j},j\in\mathcal{J}}$ according to the procedure in Section 1.1, we use (17) and the following conditional distribution (a) or (b) instead of (16):

(a)

Student’s $t$ -distribution

	$\displaystyle F(y\mid U_{j},X_{ij}^{(2)})$
	$\displaystyle=\int_{-\infty}^{y}\frac{\Gamma([\nu(U_{j},X_{ij}^{(2)})+1]/2)}{\sqrt{\nu(U_{j},X_{ij}^{(2)})\pi}\Gamma(\nu(U_{j},X_{ij}^{(2)})/2)}\left[1+\frac{t^{2}}{\nu(U_{j},X_{ij}^{(2)})}\right]^{-\left[\nu(U_{j},X_{ij}^{(2)})+1\right]/2}dt$

with $\nu(U_{j},X_{ij}^{(2)})\coloneqq\gamma(U_{j},X_{ij}^{(2)})^{-1}$ , where $\Gamma(\cdot)$ is a gamma function. For Eq. (12) of our main article, this distribution belongs to the Pareto-type distribution with $\beta(U_{j},X_{ij}^{(2)})\equiv 2$ .

(b)

Burr distribution

$F(y\mid U_{j},X_{ij}^{(2)})=1-\left[\frac{\eta}{\eta+y^{\tau(U_{j},X_{ij}^{(2)})}}\right]^{\kappa}$

with $\eta=1$ , $\kappa=1$ and $\tau(U_{j},X_{ij}^{(2)})\coloneqq\gamma(U_{j},X_{ij}^{(2)})^{-1}$ . The Burr distribution satisfies Eq. (12) of our main article with $\beta(U_{j},X_{ij}^{(2)})=\tau(U_{j},X_{ij}^{(2)})$ .

For (a), we obtain a sample directly from the $t$ -distribution for a given non-integer degree of freedom. As described in Remark 1 of our main article, the proposed estimator is more biased for smaller $\beta(U_{j},X_{ij}^{(2)})$ .

In advance, we generate the unit-level data with 500 areas and $n_{j}=1000$ using the above procedure. Then, we use a part of the dataset according to the following rules. The number of areas $J$ is increased by a factor of 10 from 50 to 500. Furthermore, for the discrepancy measure described in Section 2.4 of our main article, we use the 10th to $T$ th largest responses in the $j$ th area as the candidates for the $j$ th threshold $\omega_{j}$ , where $T$ varies from 20 to 200 in increments of 20. Roughly speaking, a smaller $T$ means that higher thresholds $\omega_{j},\ j\in\mathcal{J}$ are chosen. For each $J$ and $T$ , we obtain the estimates $\{\hat{\theta}_{\rm{A}}^{(1)},\hat{\theta}_{\rm{B}}^{(2)},\hat{\sigma}^{2}\}$ . Following the above rules, we iterate the estimation for 100 sets of unit-level data. Figures 5-8 show the sample squared bias and variance of the estimator for each $J$ and $T$ . The details can be found in the description of each figure. Overall, we can see that our estimator remained stable when $J$ was sufficiently large, even when $T$ was small. Based on the first row of each figure, our estimator then did not suffer from a large bias for large $J$ . Such results guarantee the considerations in Remark 2 of Section 3.2 of our main article. As described in Remark 1 of our main article, from the second row of each figure, the variance of $\hat{\theta}_{\rm{B}}^{(2)}$ was strongly dependent on both $J$ and $T$ when $J$ was small, while the variances of $\hat{\theta}_{\rm{A}}^{(1)}$ and $\hat{\sigma}^{2}$ were almost unaffected by $T$ .

1.3 Borrowing of strength for extremes

In this section, we compare our model, the fully parametric model defined in Eqs. (2) and (3) of our main article, and model proposed by Wang and Tsai (2009). The use of the model proposed by Wang and Tsai (2009) implies that areas are analyzed separately. To obtain the estimates for the latter two parametric models, we use the function glm() with family=Gamma(link="log") within R.

Unlike Sections 1.1 and 1.2, the dataset $\{(Y_{ij},{\bm{X}}_{ij})\}_{i\in\mathcal{N}_{j},j\in\mathcal{J}}$ is simulated from the fully parametric model as follows. Let $J$ be divided into $J=J^{\dagger}+J^{\ddagger}$ and set $J^{\dagger}=130$ , $J^{\ddagger}=20$ , $n_{1}=\cdots=n_{J^{\dagger}}=100$ and $n_{J^{\dagger}+1}=\cdots=n_{J^{\dagger}+J^{\ddagger}}=20$ . We generate $\{X_{ij}^{(2)}\}_{i\in\mathcal{N}_{j},j\in\mathcal{J}}$ from $N(0,1)$ . To generate $\{Y_{ij}\}_{i\in\mathcal{N}_{j},j\in\mathcal{J}}$ , we then combine the two distributions

F_{j}^{\dagger}(y\mid X_{ij}^{(2)})=1-y^{-1/\gamma_{j}(X_{ij}^{(2)})},\quad j=1,2,\ldots,J^{\dagger}

(18)

and

F_{j}^{\ddagger}(y\mid X_{ij}^{(2)})=1-\frac{1.5y^{-1/\gamma_{j}(X_{ij}^{(2)})}}{1+0.5y^{-1/\gamma_{j}(X_{ij}^{(2)})}},\quad j=J^{\dagger}+1,J^{\dagger}+2,\ldots,J^{\dagger}+J^{\ddagger},

(19)

where $F_{j}^{\ddagger}$ is described in detail in Section 4.1 of Wang and Tsai (2009). Then, the extreme value index $\gamma_{j}(X_{ij}^{(2)})$ is modeled as $\gamma_{j}(X_{ij}^{(2)})=\exp(\theta_{j{\rm{A}}}^{(1)}+\theta_{\rm{B}}^{(2)}X_{ij}^{(2)})$ , where $\theta_{\rm{B}}^{(2)}=0.2$ and the parameters $\theta_{j{\rm{A}}}^{(1)},\ j\in\mathcal{J}$ are assigned the following two types. One $\theta_{j{\rm{A}}}^{(1)}$ is defined as the $j/(J^{\dagger}+1)$ -th quantile of $N(0,1/12)$ for $j=1,2,\ldots,J^{\dagger}$ and $(j-J^{\dagger})/(J^{\ddagger}+1)$ -th quantile of $N(0,1/12)$ for $j=J^{\dagger}+1,J^{\dagger}+2,\ldots,J^{\dagger}+J^{\ddagger}$ , and the other $\theta_{j{\rm{A}}}^{(1)}$ is fixed as $-0.5+0.5(j-1)/(J^{\dagger}-1)$ for $j=1,2,\ldots,J^{\dagger}$ and $-0.5+0.5(j-J^{\dagger}-1)/(J^{\ddagger}-1)$ for $j=J^{\dagger}+1,J^{\dagger}+2,\ldots,J^{\dagger}+J^{\ddagger}$ . These two settings of $\theta_{j{\rm{A}}}^{(1)}$ are denoted by Normal type and Uniform type, respectively.

Table 3: Sample bias and variance of the estimator for the three models.

		Our model	Fully parametric model	Direct estimates
Normal type
$\theta_{j{\rm{A}}}^{(1)}$	$\text{Bias}^{\dagger}$	$-1.77\times 10^{-3}$	$-4.39\times 10^{-3}$	$-9.50\times 10^{-3}$
	$\text{Bias}^{\ddagger}$	$1.20\times 10^{-1}$	$1.77\times 10^{-1}$	$1.57\times 10^{-1}$
	$\text{Variance}^{\dagger}$	$7.89\times 10^{-3}$	$9.97\times 10^{-3}$	$1.01\times 10^{-2}$
	$\text{Variance}^{\ddagger}$	$1.69\times 10^{-2}$	$4.25\times 10^{-2}$	$4.59\times 10^{-2}$
$\theta_{\rm{B}}^{(2)}$	Bias	$2.00\times 10^{-1}$	$2.00\times 10^{-1}$	$2.00\times 10^{-1}$
$\theta_{\rm{B}}^{(2)}$	Variance	$6.78\times 10^{-5}$	$6.78\times 10^{-5}$	$1.23\times 10^{-4}$
Uniform type
$\theta_{j{\rm{A}}}^{(1)}$	$\text{Bias}^{\dagger}$	$-3.07\times 10^{-3}$	$-5.56\times 10^{-3}$	$-1.06\times 10^{-2}$
	$\text{Bias}^{\ddagger}$	$1.26\times 10^{-1}$	$1.80\times 10^{-1}$	$1.60\times 10^{-1}$
	$\text{Variance}^{\dagger}$	$8.07\times 10^{-3}$	$9.98\times 10^{-3}$	$1.01\times 10^{-2}$
	$\text{Variance}^{\ddagger}$	$1.73\times 10^{-2}$	$4.10\times 10^{-2}$	$4.50\times 10^{-2}$
$\theta_{\rm{B}}^{(2)}$	Bias	$2.00\times 10^{-1}$	$2.00\times 10^{-1}$	$1.99\times 10^{-1}$
$\theta_{\rm{B}}^{(2)}$	Variance	$7.32\times 10^{-5}$	$7.34\times 10^{-5}$	$1.13\times 10^{-4}$

Without introducing thresholds, we estimate $\theta_{j{\rm{A}}}^{(1)},\ j\in\mathcal{J}$ and $\theta_{\rm{B}}^{(2)}$ by each of our model, the fully parametric model expressed as Eqs. (2) and (3) of our main article, and model proposed by Wang and Tsai (2009). In our model, estimates for $\theta_{{\rm{A}}j}^{(1)},\ j\in\mathcal{J}$ are obtained by combining the estimator and predictor proposed in Section 2 of our main article. For the model proposed by Wang and Tsai (2009), different estimates of $\theta_{\rm{B}}^{(2)}$ are obtained across the areas, but the mean of these is used as the estimate of $\theta_{\rm{B}}^{(2)}$ . Under the above simulation settings, if the areas were analyzed separately, the estimates for the areas from (19) would have large bias and variance. Therefore, we are interested in whether these direct estimates can be improved by simultaneously analyzing the areas using our model. To verify this, we evaluate the sample bias and variance of the estimator for each method. Table 3 shows the results based on 500 sets of unit-level data, where the superscripts ${\dagger}$ and ${\ddagger}$ in this table refer to the means for $j=1,2,\ldots,J^{\dagger}$ and for $j=J^{\dagger}+1,J^{\dagger}+2,\ldots,J^{\dagger}+J^{\ddagger}$ , respectively. According to Table 3, our model produced the most stable estimates of the three models. In particular, for $\theta_{j{\rm{A}}}^{(1)},\ j=J^{\dagger}+1,J^{\dagger}+2,\ldots,J^{\dagger}+J^{\ddagger}$ , the variance of the estimates was dramatically improved using our model rather than the direct estimates. In addition, we can see that in extreme value analysis, the “borrowing of strength” of the mixed effects model also contributes to reducing the bias of the estimator. In the fully parametric model, the bias of the estimator was larger than that of the direct estimates. Meanwhile, even when the assumption of normality of the random effects in Eq. (4) of our main article was not satisfied, our model was still favorable.

2 Proof of theorems

We give the proof of Theorem 1 in Section 3.2 of our main article. For convenience, we introduce some new symbols:

•

${\bm{\theta}}_{\rm{C}}\coloneqq{\rm{vech}}({\bm{\Sigma}})$ , ${\bm{\theta}}_{\rm{C}}^{0}\coloneqq{\rm{vech}}({\bm{\Sigma}}_{0})$ and $\hat{\bm{\theta}}_{\rm{C}}\coloneqq{\rm{vech}}(\hat{\bm{\Sigma}})$ , which are $p_{\rm{C}}\coloneqq p_{\rm{A}}(p_{\rm{A}}+1)/2$ -dimensional vectors.
•

${\bm{\theta}}\coloneqq({\bm{\theta}}_{\rm{A}}^{\top},{\bm{\theta}}_{\rm{B}}^{\top},{\bm{\theta}}_{\rm{C}}^{\top})^{\top}$ , ${\bm{\theta}}^{0}\coloneqq(({\bm{\theta}}_{\rm{A}}^{0})^{\top},({\bm{\theta}}_{\rm{B}}^{0})^{\top},({\bm{\theta}}_{\rm{C}}^{0})^{\top})^{\top}$ and $\hat{\bm{\theta}}\coloneqq(\hat{\bm{\theta}}_{\rm{A}}^{\top},\hat{\bm{\theta}}_{\rm{B}}^{\top},\hat{\bm{\theta}}_{\rm{C}}^{\top})^{\top}$ .

•

For each $j\in\mathcal{J}$ , we denote

	$\displaystyle\ell_{j}({\bm{\theta}})$	$\displaystyle\coloneqq\log\int_{\mathbb{R}^{p_{\rm{A}}}}\phi({\bm{u}};{\bm{0}},{\bm{\theta}}_{\rm{C}})\exp\left(\sum_{i=1}^{n_{j}}\biggl{\{}-\left({\bm{\theta}}_{\rm{A}}+{\bm{u}}\right)^{\top}{\bm{X}}_{{\rm{A}}ij}-{\bm{\theta}}_{\rm{B}}^{\top}{\bm{X}}_{{\rm{B}}ij}\biggr{.}\right.$
		$\displaystyle\quad\Biggl{.}\left.-\exp\left[-\left({\bm{\theta}}_{\rm{A}}+{\bm{u}}\right)^{\top}{\bm{X}}_{{\rm{A}}ij}-{\bm{\theta}}_{\rm{B}}^{\top}{\bm{X}}_{{\rm{B}}ij}\right]\log\frac{Y_{ij}}{\omega_{(J,n_{j})}}\right\}I(Y_{ij}>\omega_{(J,n_{j})})\Biggr{)}d{\bm{u}},$

where $\phi({\bm{u}};{\bm{0}},{\bm{\theta}}_{\rm{C}})\coloneqq\phi({\bm{u}};{\bm{0}},{\bm{\Sigma}})$ . In Eq. (10) of our main article, the approximated log-likelihood $\ell({\bm{\theta}}_{\rm{A}},{\bm{\theta}}_{\rm{B}},{\bm{\Sigma}})$ can be redefined as $\ell({\bm{\theta}})\coloneqq\sum_{j=1}^{J}\ell_{j}({\bm{\theta}})$ .

•

For any smooth function $R_{1}:\mathbb{R}^{d}\to\mathbb{R};{\bm{z}}\mapsto R_{1}({\bm{z}})$ , we denote $\nabla R_{1}({\bm{z}})\coloneqq(\partial/\partial{\bm{z}})R_{1}({\bm{z}})\in\mathbb{R}^{d}$ and $\nabla^{2}R_{1}({\bm{z}})\coloneqq(\partial^{2}/\partial{\bm{z}}\partial{\bm{z}}^{\top})R_{1}({\bm{z}})\in\mathbb{R}^{d\times d}$ . In particular, we denote $\nabla_{{\bm{z}}_{0}}R_{1}({\bm{z}})\coloneqq(\partial/\partial{\bm{z}}_{0})R_{1}({\bm{z}})$ and $\nabla_{{\bm{z}}_{1}{\bm{z}}_{2}}^{2}R_{1}({\bm{z}})\coloneqq(\partial^{2}/\partial{\bm{z}}_{1}\partial{\bm{z}}_{2}^{\top})R_{1}({\bm{z}})$ , where ${\bm{z}}_{0}$ , ${\bm{z}}_{1}$ and ${\bm{z}}_{2}$ are part of ${\bm{z}}$ . As a special case, for any smooth real-valued function $R_{2}({\bm{\theta}})$ of ${\bm{\theta}}=({\bm{\theta}}_{\rm{A}}^{\top},{\bm{\theta}}_{\rm{B}}^{\top},{\bm{\theta}}_{\rm{C}}^{\top})^{\top}$ , we simply write $\nabla_{\rm{K}}R_{2}({\bm{\theta}})\coloneqq\nabla_{{\bm{\theta}}_{\rm{K}}}R_{2}({\bm{\theta}})$ and $\nabla_{{\rm{K}}_{1}{\rm{K}}_{2}}^{2}R_{2}({\bm{\theta}})\coloneqq\nabla_{{\bm{\theta}}_{{\rm{K}}_{1}}{\bm{\theta}}_{{\rm{K}}_{2}}}^{2}R_{2}({\bm{\theta}})$ for ${\rm{K}},{\rm{K}}_{1},{\rm{K}}_{2}\in\{{\rm{A}},{\rm{B}},{\rm{C}}\}$ .
•

For any column vector ${\bm{z}}$ , we denote ${\bm{z}}^{\otimes 2}\coloneqq{\bm{z}}{\bm{z}}^{\top}$ .
•

Let ${\bm{\Upsilon}}_{(J,n_{0})}$ be the $(p_{\rm{A}}+p_{\rm{B}}+p_{\rm{C}})$ -diagonal matrix with ${\rm{diag}}({\bm{\Upsilon}}_{(J,n_{0})})=(J^{1/2}{\bm{1}}_{p_{\rm{A}}}^{\top},J^{1/2}n_{0}^{1/2}{\bm{1}}_{p_{\rm{B}}}^{\top},$
$J^{1/2}{\bm{1}}_{p_{\rm{C}}}^{\top})^{\top}$ , where ${\bm{1}}_{d}$ is the $d$ -dimensional vector with all elements equal to 1.

Let denote $D_{j}(\bm{u})\coloneqq n_{j}P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})/n_{0}$ . For each $j\in\mathcal{J}$ , $D_{j}(\bm{u})$ conditional on ${\bm{U}}_{j}={\bm{u}}$ satisfies the following Lemma 1.

Lemma 1.

Suppose that (A3) and (A4) hold. Then, under given ${\bm{U}}_{j}={\bm{u}}$ , $D_{j}({\bm{u}})\to^{P}d_{j}({\bm{u}})$ as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ .

Proof of Lemma 1.

We have

E\left[n_{j0}n_{j}^{-1}P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})^{-1}\ \middle|\ {\bm{U}}_{j}={\bm{u}}\right]=1

and

{\rm{cov}}\left[n_{j0}n_{j}^{-1}P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})^{-1}\ \middle|\ {\bm{U}}_{j}={\bm{u}}\right]=n_{j}^{-1}P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})^{-1}-n_{j}^{-1},

which converges to 0 as $J\to\infty$ and $n_{j}\to\infty$ from (A3). Accordingly, under given ${\bm{U}}_{j}={\bm{u}}$ , as $J\to\infty$ and $n_{j}\to\infty$ ,

n_{j0}n_{j}^{-1}P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})^{-1}\xrightarrow{P}1.

Thus, Lemma 1 is proved by combining this result with (A4). ∎

For each $j\in\mathcal{J}$ , we denote

H_{j}({\bm{u}})\coloneqq n_{0}^{-1}\sum_{i=1}^{n_{j}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij}),

(20)

where

h_{j}(y,{\bm{u}},{\bm{x}})\coloneqq\left[\log\gamma({\bm{u}},{\bm{x}})+\gamma({\bm{u}},{\bm{x}})^{-1}\log\frac{y}{\omega_{(J,n_{j})}}\right]I(y>\omega_{(J,n_{j})}),

which satisfies

\nabla_{\bm{u}}h_{j}(y,{\bm{u}},{\bm{x}})=\left[1-\gamma({\bm{u}},{\bm{x}})^{-1}\log\frac{y}{\omega_{(J,n_{j})}}\right]I(y>\omega_{(J,n_{j})}){\bm{x}}_{\rm{A}}

and

\nabla_{{\bm{u}}{\bm{u}}}^{2}h_{j}(y,{\bm{u}},{\bm{x}})=\gamma({\bm{u}},{\bm{x}})^{-1}\log\frac{y}{\omega_{(J,n_{j})}}I(y>\omega_{(J,n_{j})}){\bm{x}}_{\rm{A}}^{\otimes 2}.

In the following Lemmas 2 and 3, we reveal the asymptotic properties of $H_{j},\ j\in\mathcal{J}$ .

Lemma 2.

Suppose that (A1)-(A4) and (A6) hold. Then, under given ${\bm{U}}_{j}={\bm{u}}$ , as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ ,

n_{0}^{1/2}\nabla H_{j}({\bm{u}})\xrightarrow{D}N({\bm{0}},d_{j}({\bm{u}}){\bm{\Phi}}_{\rm{AA}}({\bm{u}})).

Proof of Lemma 2.

For each $j\in\mathcal{J}$ , $n_{0}^{1/2}\nabla H_{j}({\bm{u}})$ can be written as

$\displaystyle n_{0}^{1/2}\nabla H_{j}({\bm{u}})$	$\displaystyle=n_{0}^{-1/2}\sum_{i=1}^{n_{j}}\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})$
	$\displaystyle=D_{j}(\bm{u})^{1/2}n_{j}^{-1/2}\sum_{i=1}^{n_{j}}\frac{\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})-E\left[\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})\ \middle\|\ {\bm{U}}_{j}={\bm{u}}\right]}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})^{1/2}}$	(21)
	$\displaystyle\quad+D_{j}(\bm{u})^{1/2}\frac{n_{j}^{1/2}E\left[\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})\ \middle\|\ {\bm{U}}_{j}={\bm{u}}\right]}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})^{1/2}}.$	(22)

In the following Steps 1 and 2, we derive the asymptotic distributions of (21) and (22) conditional on ${\bm{U}}_{j}={\bm{u}}$ . By combining Lemma 1 and these steps, Lemma 2 holds from Slutsky’s theorem.

Step 1.

For (22), we show

\frac{J^{1/2}n_{j}^{1/2}E\left[\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})\ \middle|\ {\bm{U}}_{j}={\bm{u}}\right]}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})^{1/2}}\to{\bm{b}}_{{\rm{A}}j}({\bm{u}})

as $J\to\infty$ and $n_{j}\to\infty$ . Because ${\bm{U}}_{j}$ and ${\bm{X}}_{ij}$ are independent, we have

	$\displaystyle\frac{J^{1/2}n_{j}^{1/2}E\left[\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})\ \middle\|\ {\bm{U}}_{j}={\bm{u}}\right]}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})^{1/2}}$
	$\displaystyle\quad=\frac{J^{1/2}n_{j}^{1/2}E_{{\bm{X}}_{ij}}\left[E\left[\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})\ \middle\|\ {\bm{U}}_{j}={\bm{u}},{\bm{X}}_{ij}\right]\right]}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})^{1/2}}.$		(23)

By the integration by parts, we have

	$\displaystyle E\left[\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{x}})\ \middle\|\ {\bm{U}}_{j}={\bm{u}},{\bm{X}}_{ij}={\bm{x}}\right]$
	$\displaystyle\quad=\left[\bar{F}(\omega_{(J,n_{j})}\mid{\bm{u}},{\bm{x}})-\gamma({\bm{u}},{\bm{x}})^{-1}\int_{0}^{\infty}\bar{F}(\omega_{(J,n_{j})}e^{s}\mid{\bm{u}},{\bm{x}})\,ds\right]{\bm{x}}_{\rm{A}},$

where $\bar{F}(\cdot\mid{\bm{u}},{\bm{x}})\coloneqq 1-F(\cdot\mid{\bm{u}},{\bm{x}})$ . Furthermore, from (A1), we have

	$\displaystyle\bar{F}(\omega_{(J,n_{j})}\mid{\bm{u}},{\bm{x}})-\gamma({\bm{u}},{\bm{x}})^{-1}\int_{0}^{\infty}\bar{F}(\omega_{(J,n_{j})}e^{s}\mid{\bm{u}},{\bm{x}})\,ds$
	$\displaystyle\quad=\left[\frac{c_{1}({\bm{u}},{\bm{x}})\gamma({\bm{u}},{\bm{x}})\beta({\bm{u}},{\bm{x}})}{1+\gamma({\bm{u}},{\bm{x}})\beta({\bm{u}},{\bm{x}})}\omega_{(J,n_{j})}^{-1/\gamma({\bm{u}},{\bm{x}})-\beta({\bm{u}},{\bm{x}})}\right]\left[1+o(1)\right].$

Accordingly, from (A6), (23) converges to ${\bm{b}}_{{\rm{A}}j}({\bm{u}})$ as $J\to\infty$ and $n_{j}\to\infty$ .

Step 2.

For (21), we show that under given ${\bm{U}}_{j}={\bm{u}}$ ,

n_{j}^{-1/2}\sum_{i=1}^{n_{j}}\frac{\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})-E\left[\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})\ \middle|\ {\bm{U}}_{j}={\bm{u}}\right]}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})^{1/2}}\xrightarrow{D}N({\bm{0}},{\bm{\Phi}}_{\rm{AA}}({\bm{u}}))

(24)

as $J\to\infty$ and $n_{j}\to\infty$ . Because (24) is the sum of conditionally independent and identically distributed random vectors, we can apply the central limit theorem. Obviously, the conditional expectation of (24) is ${\bm{0}}$ . Moreover, we obtain

	$\displaystyle{\rm{cov}}\left[\frac{\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})^{1/2}}\ \middle\|\ {\bm{U}}_{j}={\bm{u}}\right]$
	$\displaystyle\quad=E\left[\frac{\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})^{\otimes 2}}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})}\ \middle\|\ {\bm{U}}_{j}={\bm{u}}\right]$		(25)
	$\displaystyle\quad\quad-E\left[\frac{\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})^{1/2}}\ \middle\|\ {\bm{U}}_{j}={\bm{u}}\right]^{\otimes 2}.$		(26)

From Step 1, (26) converges to ${\bm{O}}$ as $J\to\infty$ and $n_{j}\to\infty$ . Therefore, we show that (25) converges to ${\bm{\Phi}}_{\rm{AA}}({\bm{u}})$ as $J\to\infty$ and $n_{j}\to\infty$ . From Eq. (9) of our main article, we have

{\bm{\xi}}_{j}({\bm{u}},{\bm{x}})\coloneqq E\left[\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})^{\otimes 2}\ \middle|\ {\bm{U}}_{j}={\bm{u}},{\bm{X}}_{ij}={\bm{x}},Y_{ij}>\omega_{(J,n_{j})}\right]\to{\bm{x}}_{\rm{A}}^{\otimes 2}

(27)

uniformly for all ${\bm{x}}\in\mathbb{R}^{p}$ as $J\to\infty$ and $n_{j}\to\infty$ . In addition, from (A2), we have

\delta_{j}({\bm{u}},{\bm{x}})\coloneqq\frac{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}},{\bm{X}}_{ij}={\bm{x}})}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})}\to\delta({\bm{u}},{\bm{x}})

(28)

uniformly for all ${\bm{x}}\in\mathbb{R}^{p}$ as $J\to\infty$ and $n_{j}\to\infty$ . Now, (25) can be written as

E\left[\frac{\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})^{\otimes 2}}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})}\ \middle|\ {\bm{U}}_{j}={\bm{u}}\right]=E_{{\bm{X}}_{ij}}\left[\delta_{j}({\bm{u}},{\bm{X}}_{ij}){\bm{\xi}}_{j}({\bm{u}},{\bm{X}}_{ij})\right].

From (27) and (28), (25) then converges to ${\bm{\Phi}}_{\rm{AA}}({\bm{u}})$ as $J\to\infty$ and $n_{j}\to\infty$ .

∎

Lemma 3.

Suppose that (A1)-(A4) hold. Then, under given ${\bm{U}}_{j}={\bm{u}}$ , as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ ,

\nabla^{2}H_{j}({\bm{u}})\xrightarrow{P}d_{j}({\bm{u}}){\bm{\Phi}}_{\rm{AA}}({\bm{u}}).

Proof of Lemma 3.

For each $j\in\mathcal{J}$ , $\nabla^{2}H_{j}({\bm{u}})$ can be written as

	$\displaystyle\nabla^{2}H_{j}({\bm{u}})$	$\displaystyle=n_{0}^{-1}\sum_{i=1}^{n_{j}}\nabla_{\bm{uu}}^{2}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})$
		$\displaystyle=D_{j}(\bm{u})n_{j}^{-1}\sum_{i=1}^{n_{j}}\frac{\nabla_{\bm{uu}}^{2}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})}.$

We show that under given ${\bm{U}}_{j}={\bm{u}}$ ,

n_{j}^{-1}\sum_{i=1}^{n_{j}}\frac{\nabla_{\bm{uu}}^{2}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})}\xrightarrow{P}{\bm{\Phi}}_{\rm{AA}}({\bm{u}})

(29)

as $J\to\infty$ and $n_{j}\to\infty$ . From Eq. (9) of our main article, we have

{\bm{\xi}}_{j}^{(1)}({\bm{u}},{\bm{x}})\coloneqq E\left[\nabla_{\bm{uu}}^{2}h_{j}(Y_{ij},{\bm{u}},{\bm{x}})\ \middle|\ {\bm{U}}_{j}={\bm{u}},{\bm{X}}_{ij}={\bm{x}},Y_{ij}>\omega_{(J,n_{j})}\right]\to{\bm{x}}_{\rm{A}}^{\otimes 2}

and

{\bm{\xi}}_{j}^{(2)}({\bm{u}},{\bm{x}})\coloneqq E\left[{\rm{vec}}\left[\nabla_{\bm{uu}}^{2}h_{j}(Y_{ij},{\bm{u}},{\bm{x}})\right]^{\otimes 2}\ \middle|\ {\bm{U}}_{j}={\bm{u}},{\bm{X}}_{ij}={\bm{x}},Y_{ij}>\omega_{(J,n_{j})}\right]\to 2{\rm{vec}}\left({\bm{x}}_{\rm{A}}^{\otimes 2}\right)^{\otimes 2}

uniformly for all ${\bm{x}}\in\mathbb{R}^{p}$ as $J\to\infty$ and $n_{j}\to\infty$ . Now, (29) has the form of the sum of conditionally independent and identically distributed random vectors, and ${\bm{U}}_{j}$ and ${\bm{X}}_{ij}$ are independent. These facts yield that for (29),

	$\displaystyle E\left[n_{j}^{-1}\sum_{i=1}^{n_{j}}\frac{\nabla_{\bm{uu}}^{2}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})}\ \middle\|\ {\bm{U}}_{j}={\bm{u}}\right]$	$\displaystyle=E\left[\delta_{j}({\bm{u}},{\bm{X}}_{ij}){\bm{\xi}}_{j}^{(1)}({\bm{u}},{\bm{X}}_{ij})\right]$
		$\displaystyle\to{\bm{\Phi}}_{\rm{AA}}({\bm{u}})$

and

		$\displaystyle{\rm{cov}}\left[n_{j}^{-1}\sum_{i=1}^{n_{j}}\frac{{\rm{vec}}\left[\nabla_{\bm{uu}}^{2}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})\right]}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})}\ \middle\|\ {\bm{U}}_{j}={\bm{u}}\right]$
	$\displaystyle\begin{split}&=n_{j}^{-1}P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})^{-1}E\left[\delta_{j}({\bm{u}},{\bm{X}}_{ij}){\bm{\xi}}_{j}^{(2)}({\bm{u}},{\bm{X}}_{ij})\right]\\ &\quad-n_{j}^{-1}E\left[\delta_{j}({\bm{u}},{\bm{X}}_{ij}){\rm{vec}}\left[{\bm{\xi}}_{j}^{(1)}({\bm{u}},{\bm{X}}_{ij})\right]\right]^{\otimes 2}\end{split}$
		$\displaystyle\to{\bm{O}}$

as $J\to\infty$ and $n_{j}\to\infty$ , where $\delta_{j}$ is defined in (28). Therefore, (29) holds. By combining Lemma 1 and (29), we then obtain Lemma 3. ∎

For each $j\in\mathcal{J}$ , we denote the minimizer of $H_{j}({\bm{u}})$ defined in (20) as $\dot{\bm{U}}_{j}$ , which satisfies the following Lemma 4.

Lemma 4.

Suppose that (A1)-(A4) and (A6) hold. Then, as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ , $n_{0}^{1/2}(\dot{\bm{U}}_{j}-{\bm{U}}_{j})=O_{P}(1)$ uniformly for all $j\in\mathcal{J}$ .

Proof of Lemma 4.

We show that under given ${\bm{U}}_{j}={\bm{u}}$ , $n_{0}^{1/2}(\dot{\bm{U}}_{j}-{\bm{u}})=O_{P}(1)$ uniformly for all $j\in\mathcal{J}$ and ${\bm{u}}\in\mathbb{R}^{p_{\rm{A}}}$ . By the Taylor expansion of $H_{j}({\bm{u}})$ , we have

H_{j}(n_{0}^{-1/2}{\bm{s}}+{\bm{u}})=H_{j}({\bm{u}})+n_{0}^{-1}{\bm{s}}^{\top}\left[n_{0}^{1/2}\nabla H_{j}({\bm{u}})\right]+2^{-1}n_{0}^{-1}{\bm{s}}^{\top}\nabla^{2}H_{j}({\bm{u}}){\bm{s}}+o_{P}(1)

for any ${\bm{s}}\in\mathbb{R}^{p_{\rm{A}}}$ and $j\in\mathcal{J}$ . From Lemmas 2 and 3, we have that for any $\ >0$ , there exists a large constant $B>0$ such that for any $j\in\mathcal{J}$ and ${\bm{u}}\in\mathbb{R}^{p_{\rm{A}}}$ ,

\liminf_{n_{j}\to\infty,\ j\in\mathcal{J},\ J\to\infty}P\left(\inf_{{\bm{s}}\in\mathbb{R}^{p_{\rm{A}}}:\left\lVert{\bm{s}}\right\rVert=B}H_{j}(n_{0}^{-1/2}{\bm{s}}+{\bm{u}})>H_{j}({\bm{u}})\ \middle|\ {\bm{U}}_{j}={\bm{u}}\right)\geq 1-\varepsilon.

(30)

We assume that for all ${\bm{u}}\in\mathbb{R}^{p_{\rm{A}}}$ , $\nabla^{2}H_{j}({\bm{u}})$ is the positive definite matrix, which implies that $H_{j}({\bm{u}})$ is the strictly convex function. Therefore, $\dot{\bm{U}}_{j}$ is the unique global minimizer of $H_{j}({\bm{u}})$ . Then, we obtain Lemma 4 (see, the proof of Theorem 1 of Fan and Li 2001). ∎

To show Lemma 6 below, we use the result of the following Laplace approximation. The proof is described in (2.6) of Tierney, Kass, and Kadane (1989) and Appendix A of Miyata (2004).

Lemma 5.

For any smooth functions $g$ , $c$ and $h:\mathbb{R}^{d}\to\mathbb{R}$ ,

		$\displaystyle\frac{\int g({\bm{u}})c({\bm{u}})\exp\left[-nh({\bm{u}})\right]d{\bm{u}}}{\int c({\bm{u}})\exp\left[-nh({\bm{u}})\right]d{\bm{u}}}$
	$\displaystyle\begin{split}&=g(\dot{\bm{u}})+\frac{\nabla g(\dot{\bm{u}})^{\top}\nabla^{2}h(\dot{\bm{u}})^{-1}\nabla c(\dot{\bm{u}})}{nc(\dot{\bm{u}})}\\ &\quad+\frac{{\rm{tr}}\left[\nabla^{2}h(\dot{\bm{u}})^{-1}\nabla^{2}g(\dot{\bm{u}})\right]}{2n}-\frac{\nabla g(\dot{\bm{u}})^{\top}\nabla^{2}h(\dot{\bm{u}})^{-1}{\bm{a}}(\dot{\bm{u}})}{2n}+O(n^{-2}),\end{split}$

where $\dot{\bm{u}}$ is the minimizer of $h({\bm{u}})$ , ${\bm{a}}(\dot{\bm{u}})$ is the $d\times 1$ vector with the $k$ th entry equal to ${\rm{tr}}[\nabla^{2}h(\dot{\bm{u}})^{-1}\nabla^{3}h(\dot{\bm{u}})_{[k]}]$ , $\nabla^{3}h(\dot{\bm{u}})_{[k]}$ is the $d\times d$ matrix with the $(i,j)$ entry equal to the $(i,j,k)$ entry of $\nabla^{3}h(\dot{\bm{u}})$ , and $\nabla^{3}h({\bm{u}})$ denotes the $d\times d\times d$ array with the $(i,j,k)$ entry $(\partial^{3}/\partial u_{i}\partial u_{j}\partial u_{k})h({\bm{u}})$ .

Lemma 6.

Suppose that (A1)-(A4) and (A6) hold. Then, as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ ,

	$\displaystyle\nabla_{\rm{A}}\ell_{j}({\bm{\theta}}^{0})={\bm{\Sigma}}_{0}^{-1}\left[{\bm{U}}_{j}+n_{0}^{-1}d_{j}({\bm{U}}_{j})^{-1}{\bm{\Phi}}_{\rm{AA}}({\bm{U}}_{j})^{-1}{\bm{g}}_{{\rm{A}}j}({\bm{U}}_{j})\right]+O_{P}(n_{0}^{-1}),$		(31)
	$\displaystyle\nabla_{\rm{B}}\ell_{j}({\bm{\theta}}^{0})={\bm{g}}_{{\rm{B}}j}({\bm{U}}_{j})-{\bm{\Phi}}_{\rm{AB}}({\bm{U}}_{j})^{\top}{\bm{\Phi}}_{\rm{AA}}({\bm{U}}_{j})^{-1}{\bm{g}}_{{\rm{A}}j}({\bm{U}}_{j})+O_{P}(1)$		(32)

and

\displaystyle\begin{split}\nabla_{\rm{C}}\ell_{j}({\bm{\theta}}^{0})&=2^{-1}{\bm{M}}_{*}\left({\bm{\Sigma}}_{0}\otimes{\bm{\Sigma}}_{0}\right)^{-1}{\rm{vec}}\left[{\bm{U}}_{j}^{\otimes 2}-{\bm{\Sigma}}_{0}+n_{0}^{-1}d_{j}({\bm{U}}_{j})^{-1}{\bm{U}}_{j}{\bm{g}}_{{\rm{A}}j}({\bm{U}}_{j})^{\top}{\bm{\Phi}}_{\rm{AA}}({\bm{U}}_{j})^{-1}\right.\\ &\quad+\left.n_{0}^{-1}d_{j}({\bm{U}}_{j})^{-1}{\bm{\Phi}}_{\rm{AA}}({\bm{U}}_{j})^{-1}{\bm{g}}_{{\rm{A}}j}({\bm{U}}_{j}){\bm{U}}_{j}^{\top}\right]+O_{P}(n_{0}^{-1}),\end{split}

(33)

where

{\bm{g}}_{{\rm{K}}j}({\bm{u}})\coloneqq\sum_{i=1}^{n_{j}}\left[\gamma({\bm{u}},{\bm{X}}_{ij})^{-1}\log\frac{Y_{ij}}{\omega_{(J,n_{j})}}-1\right]I(Y_{ij}>\omega_{(J,n_{j})}){\bm{X}}_{{\rm{K}}ij}\\

for $j\in\mathcal{J}$ and ${\rm{K}}\in\{{\rm{A}},{\rm{B}}\}$ .

Proof of Lemma 6.

For each $j\in\mathcal{J}$ , $\nabla_{\rm{K}}\ell_{j}({\bm{\theta}}^{0}),\ {\rm{K}}\in\{{\rm{A}},{\rm{B}},{\rm{C}}\}$ can be written as

	$\displaystyle\nabla_{{\rm{A}}}\ell_{j}({\bm{\theta}}^{0})=\frac{\int_{\mathbb{R}^{p_{\rm{A}}}}{\bm{g}}_{{\rm{A}}j}({\bm{u}})c_{D}({\bm{u}})\exp\left[-n_{0}H_{j}({\bm{u}})\right]d{\bm{u}}}{\int_{\mathbb{R}^{p_{\rm{A}}}}c_{D}({\bm{u}})\exp\left[-n_{0}H_{j}({\bm{u}})\right]d{\bm{u}}},$
	$\displaystyle\nabla_{{\rm{B}}}\ell_{j}({\bm{\theta}}^{0})=\frac{\int_{\mathbb{R}^{p_{\rm{A}}}}{\bm{g}}_{{\rm{B}}j}({\bm{u}})c_{D}({\bm{u}})\exp\left[-n_{0}H_{j}({\bm{u}})\right]d{\bm{u}}}{\int_{\mathbb{R}^{p_{\rm{A}}}}c_{D}({\bm{u}})\exp\left[-n_{0}H_{j}({\bm{u}})\right]d{\bm{u}}}$

and

\nabla_{\rm{C}}\ell_{j}({\bm{\theta}}^{0})=\frac{\int_{\mathbb{R}^{p_{\rm{A}}}}{\bm{g}}_{{\rm{C}}j}({\bm{u}})c_{D}({\bm{u}})\exp\left[-n_{0}H_{j}({\bm{u}})\right]d{\bm{u}}}{\int_{\mathbb{R}^{p_{\rm{A}}}}c_{D}({\bm{u}})\exp\left[-n_{0}H_{j}({\bm{u}})\right]d{\bm{u}}}-2^{-1}{\bm{M}}_{*}{\rm{vec}}\left({\bm{\Sigma}}_{0}^{-1}\right),

where $c_{D}({\bm{u}})\coloneqq\exp(-2^{-1}{\bm{u}}^{\top}{\bm{\Sigma}}_{0}^{-1}{\bm{u}})$ and ${\bm{g}}_{{\rm{C}}j}({\bm{u}})\coloneqq 2^{-1}{\bm{M}}_{*}({\bm{\Sigma}}_{0}\otimes{\bm{\Sigma}}_{0})^{-1}{\rm{vec}}({\bm{u}}^{\otimes 2})$ . For each $j\in\mathcal{J}$ , ${\rm{K}}\in\{{\rm{A}},{\rm{B}},{\rm{C}}\}$ and $l\in\{1,2,\ldots,p_{\rm{K}}\}$ , we denote the $l$ th component of ${\bm{g}}_{{\rm{K}}j}({\bm{u}})$ as $g_{{\rm{K}}j}^{(l)}({\bm{u}})$ . From Lemma 5, we have

		$\displaystyle\frac{\int_{\mathbb{R}^{p_{\rm{A}}}}{\bm{g}}_{{\rm{K}}j}({\bm{u}})c_{D}({\bm{u}})\exp\left[-n_{0}H_{j}({\bm{u}})\right]d{\bm{u}}}{\int_{\mathbb{R}^{p_{\rm{A}}}}c_{D}({\bm{u}})\exp\left[-n_{0}H_{j}({\bm{u}})\right]d{\bm{u}}}$
	$\displaystyle\begin{split}&={\bm{g}}_{{\rm{K}}j}(\dot{\bm{U}}_{j})+\left[\frac{\nabla g_{{\rm{K}}j}^{(l)}(\dot{\bm{U}}_{j})^{\top}\nabla^{2}H_{j}(\dot{\bm{U}}_{j})^{-1}\nabla c_{D}(\dot{\bm{U}}_{j})}{n_{0}c_{D}(\dot{\bm{U}}_{j})}\right]_{p_{\rm{K}}\times 1,\ 1\leq l\leq p_{\rm{K}}}\\ &\quad+\left[\frac{{\rm{tr}}\left[\nabla^{2}H_{j}(\dot{\bm{U}}_{j})^{-1}\nabla^{2}g_{{\rm{K}}j}^{(l)}(\dot{\bm{U}}_{j})\right]}{2n_{0}}\right]_{p_{\rm{K}}\times 1,\ 1\leq l\leq p_{\rm{K}}}\\ &\quad-\left[\frac{\nabla g_{{\rm{K}}j}^{(l)}(\dot{\bm{U}}_{j})^{\top}\nabla^{2}H_{j}(\dot{\bm{U}}_{j})^{-1}{\bm{a}}_{j}(\dot{\bm{U}}_{j})}{2n_{0}}\right]_{p_{\rm{K}}\times 1,\ 1\leq l\leq p_{\rm{K}}}+O(n_{0}^{-2})\end{split}$			(34)

for $j\in\mathcal{J}$ and ${\rm{K}}\in\{{\rm{A}},{\rm{B}},{\rm{C}}\}$ . In the following Steps 1-3, we evaluate $\nabla_{\rm{K}}\ell_{j}({\bm{\theta}}^{0})$ based on (34).

Step 1.

We apply (34) to $\nabla_{\rm{A}}\ell_{j}({\bm{\theta}}^{0})$ . By straightforward calculation, we have ${\bm{g}}_{{\rm{A}}j}(\dot{\bm{U}}_{j})={\bm{0}}$ ,

\left[\frac{\nabla g_{{\rm{A}}j}^{(l)}(\dot{\bm{U}}_{j})^{\top}\nabla^{2}H_{j}(\dot{\bm{U}}_{j})^{-1}\nabla c_{D}(\dot{\bm{U}}_{j})}{n_{0}c_{D}(\dot{\bm{U}}_{j})}\right]_{p_{\rm{A}}\times 1,\ 1\leq l\leq p_{\rm{A}}}={\bm{\Sigma}}_{0}^{-1}\dot{\bm{U}}_{j}

and

\frac{{\rm{tr}}\left[\nabla^{2}H_{j}(\dot{\bm{U}}_{j})^{-1}\nabla^{2}g_{{\rm{A}}j}^{(l)}(\dot{\bm{U}}_{j})\right]}{2n_{0}}-\frac{\nabla g_{{\rm{A}}j}^{(l)}(\dot{\bm{U}}_{j})^{\top}\nabla^{2}H_{j}(\dot{\bm{U}}_{j})^{-1}{\bm{a}}_{j}(\dot{\bm{U}}_{j})}{2n_{0}}=0

for $l\in\{1,2,\ldots,p_{\rm{A}}\}$ . Furthermore, from Lemmas 3 and 4, by the Taylor expansion of $\nabla H_{j}({\bm{u}})$ , we obtain

\dot{\bm{U}}_{j}={\bm{U}}_{j}+n_{0}^{-1}d_{j}({\bm{U}}_{j})^{-1}{\bm{\Phi}}_{\rm{AA}}({\bm{U}}_{j})^{-1}{\bm{g}}_{{\rm{A}}j}({\bm{U}}_{j})+O_{P}(n_{0}^{-1}).

(35)

Consequently, we obtain (31).

Step 2.

In this step, (34) is applied to $\nabla_{\rm{B}}\ell_{j}({\bm{\theta}}^{0})$ . From Lemma 4, (35) and the similar results to Lemma 3, by the Taylor expansion of ${\bm{g}}_{{\rm{B}}j}({\bm{u}})$ , we obtain

	$\displaystyle{\bm{g}}_{{\rm{B}}j}(\dot{\bm{U}}_{j})$	$\displaystyle={\bm{g}}_{{\rm{B}}j}({\bm{U}}_{j})-n_{0}d_{j}({\bm{U}}_{j}){\bm{\Phi}}_{\rm{AB}}({\bm{U}}_{j})^{\top}\left(\dot{\bm{U}}_{j}-{\bm{U}}_{j}\right)\left[1+o_{P}(1)\right]$
		$\displaystyle={\bm{g}}_{{\rm{B}}j}({\bm{U}}_{j})-{\bm{\Phi}}_{\rm{AB}}({\bm{U}}_{j})^{\top}{\bm{\Phi}}_{\rm{AA}}({\bm{U}}_{j})^{-1}{\bm{g}}_{{\rm{A}}j}({\bm{U}}_{j})+O_{P}(1).$

In addition, we have

\displaystyle\begin{split}&\frac{\nabla g_{{\rm{B}}j}^{(l)}(\dot{\bm{U}}_{j})^{\top}\nabla^{2}H_{j}(\dot{\bm{U}}_{j})^{-1}\nabla c_{D}(\dot{\bm{U}}_{j})}{n_{0}c_{D}(\dot{\bm{U}}_{j})}+\frac{{\rm{tr}}\left[\nabla^{2}H_{j}(\dot{\bm{U}}_{j})^{-1}\nabla^{2}g_{{\rm{B}}j}^{(l)}(\dot{\bm{U}}_{j})\right]}{2n_{0}}\\ &\quad-\frac{\nabla g_{{\rm{B}}j}^{(l)}(\dot{\bm{U}}_{j})^{\top}\nabla^{2}H_{j}(\dot{\bm{U}}_{j})^{-1}{\bm{a}}_{j}(\dot{\bm{U}}_{j})}{2n_{0}}=O_{P}(1)\end{split}

for $l\in\{1,2,\ldots,p_{\rm{B}}\}$ . Therefore, (32) is obtained.

Step 3.

In the last step, we calculate $\nabla_{\rm{C}}\ell_{j}({\bm{\theta}}^{0})$ according to (34). From (35), we have

	$\displaystyle{\bm{g}}_{{\rm{C}}j}(\dot{\bm{U}}_{j})$	$\displaystyle=2^{-1}{\bm{M}}_{*}\left({\bm{\Sigma}}_{0}\otimes{\bm{\Sigma}}_{0}\right)^{-1}{\rm{vec}}\left(\dot{\bm{U}}_{j}^{\otimes 2}\right)$
	$\displaystyle\begin{split}&=2^{-1}{\bm{M}}_{*}\left({\bm{\Sigma}}_{0}\otimes{\bm{\Sigma}}_{0}\right)^{-1}\\ &\quad\times{\rm{vec}}\left[{\bm{U}}_{j}^{\otimes 2}+n_{0}^{-1}d_{j}({\bm{U}}_{j})^{-1}{\bm{U}}_{j}{\bm{g}}_{{\rm{A}}j}({\bm{U}}_{j})^{\top}{\bm{\Phi}}_{\rm{AA}}({\bm{U}}_{j})^{-1}\right.\\ &\quad\left.+n_{0}^{-1}d_{j}({\bm{U}}_{j})^{-1}{\bm{\Phi}}_{\rm{AA}}({\bm{U}}_{j})^{-1}{\bm{g}}_{{\rm{A}}j}({\bm{U}}_{j}){\bm{U}}_{j}^{\top}\right]+O_{P}(n_{0}^{-1}).\end{split}$

Additionally, we have

\displaystyle\begin{split}&\frac{\nabla g_{{\rm{C}}j}^{(l)}(\dot{\bm{U}}_{j})^{\top}\nabla^{2}H_{j}(\dot{\bm{U}}_{j})^{-1}\nabla c_{D}(\dot{\bm{U}}_{j})}{n_{0}c_{D}(\dot{\bm{U}}_{j})}+\frac{{\rm{tr}}\left[\nabla^{2}H_{j}(\dot{\bm{U}}_{j})^{-1}\nabla^{2}g_{{\rm{C}}j}^{(l)}(\dot{\bm{U}}_{j})\right]}{2n_{0}}\\ &\quad-\frac{\nabla g_{{\rm{C}}j}^{(l)}(\dot{\bm{U}}_{j})^{\top}\nabla^{2}H_{j}(\dot{\bm{U}}_{j})^{-1}{\bm{a}}_{j}(\dot{\bm{U}}_{j})}{2n_{0}}=O_{P}(n_{0}^{-1})\end{split}

for $l\in\{1,2,\ldots,p_{\rm{C}}\}$ . Thus, (33) is shown.

∎

Lemma 6 leads to the following Lemmas 7-9.

Lemma 7.

Suppose that (A1)-(A6) hold. Then, as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ ,

J^{-1/2}\nabla_{\rm{A}}\ell({\bm{\theta}}^{0})+n_{0}^{-1/2}{\bm{\Delta}}_{\rm{A}}^{-1}{\bm{b}}_{\rm{A}}\xrightarrow{D}N({\bm{0}},{\bm{\Delta}}_{\rm{A}}^{-1}).

Proof of Lemma 7.

Let denote

{\bm{Z}}\coloneqq J^{-1/2}\sum_{j=1}^{J}n_{0}^{-1/2}d_{j}({\bm{U}}_{j})^{-1}{\bm{\Phi}}_{\rm{AA}}({\bm{U}}_{j})^{-1}{\bm{g}}_{{\rm{A}}j}({\bm{U}}_{j}).

From Lemma 6, we then have

J^{-1/2}\nabla_{\rm{A}}\ell({\bm{\theta}}^{0})=J^{-1/2}\sum_{j=1}^{J}{\bm{\Sigma}}_{0}^{-1}{\bm{U}}_{j}+n_{0}^{-1/2}{\bm{\Sigma}}_{0}^{-1}{\bm{Z}}\left[1+o_{P}(1)\right].

(36)

From the reproductive property of the normal distribution, the first term on the right-hand side of (36) converges to $N({\bm{0}},{\bm{\Sigma}}_{0}^{-1})$ in distribution as $J\to\infty$ . From the proof of Lemma 2, for the second term of the right-hand side of (36), we have

	$\displaystyle E\left[{\bm{Z}}\right]$	$\displaystyle=J^{-1}\sum_{j=1}^{J}E\left[d_{j}({\bm{U}}_{j})^{-1}{\bm{\Phi}}_{\rm{AA}}({\bm{U}}_{j})^{-1}E\left[J^{1/2}n_{0}^{-1/2}{\bm{g}}_{{\rm{A}}j}({\bm{U}}_{j})\ \middle\|\ {\bm{U}}_{j}\right]\right]$
		$\displaystyle=-\left(J^{-1}\sum_{j=1}^{J}E\left[d_{j}({\bm{U}}_{j})^{-1/2}{\bm{\Phi}}_{\rm{AA}}({\bm{U}}_{j})^{-1}{\bm{b}}_{{\rm{A}}j}({\bm{U}}_{j})\right]\right)\left[1+o(1)\right]$
		$\displaystyle=-{\bm{b}}_{\rm{A}}\left[1+o(1)\right]$

and ${\rm{cov}}[n_{0}^{-1/2}{\bm{Z}}]\to{\bm{O}}$ , which implies that the sum of $n_{0}^{-1/2}{\bm{\Sigma}}_{0}^{-1}{\bm{b}}_{\rm{A}}$ and the second term on the right-hand side of (36) converges to ${\bm{0}}$ in probability as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ . Thus, Lemma 7 is shown. ∎

Lemma 8.

Suppose that (A1)-(A6) hold. Then, as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ ,

J^{-1/2}n_{0}^{-1/2}\nabla_{\rm{B}}\ell({\bm{\theta}}^{0})\xrightarrow{D}N(-{\bm{\Delta}}_{\rm{B}}^{-1}{\bm{b}}_{\rm{B}},{\bm{\Delta}}_{\rm{B}}^{-1}).

Proof of Lemma 8.

We denote

{\bm{W}}_{j}({\bm{u}})\coloneqq{\bm{g}}_{{\rm{B}}j}({\bm{u}})-{\bm{\Phi}}_{\rm{AB}}({\bm{u}})^{\top}{\bm{\Phi}}_{\rm{AA}}({\bm{u}})^{-1}{\bm{g}}_{{\rm{A}}j}({\bm{u}}).

From Lemma 6, we have

J^{-1/2}n_{0}^{-1/2}\nabla_{\rm{B}}\ell({\bm{\theta}}^{0})=\left[J^{-1/2}\sum_{j=1}^{J}n_{0}^{-1/2}{\bm{W}}_{j}({\bm{U}}_{j})\right]\left[1+o_{P}(1)\right].

(37)

Now, the right-hand side of (37) can be written as

	$\displaystyle J^{-1/2}\sum_{j=1}^{J}n_{0}^{-1/2}{\bm{W}}_{j}({\bm{U}}_{j})$
	$\displaystyle=J^{-1/2}\sum_{j=1}^{J}n_{j0}^{1/2}n_{0}^{-1/2}\left[n_{j0}^{-1/2}{\bm{W}}_{j}({\bm{U}}_{j})-E\left[n_{j0}^{-1/2}{\bm{W}}_{j}({\bm{U}}_{j})\ \middle\|\ {\bm{U}}_{j}\right]\right]$		(38)
	$\displaystyle\quad+J^{-1}\sum_{j=1}^{J}E\left[J^{1/2}n_{0}^{-1/2}{\bm{W}}_{j}({\bm{U}}_{j})\ \middle\|\ {\bm{U}}_{j}\right].$		(39)

Similar to the proof of Lemma 7, (39) converges to $-{\bm{\Delta}}_{\rm{B}}^{-1}{\bm{b}}_{\rm{B}}$ in probability as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ . Similar to Lemma 2, for (38), we have that under given ${\bm{U}}_{j}={\bm{u}}_{j}$ ,

	$\displaystyle n_{j0}^{-1/2}{\bm{W}}_{j}({\bm{u}}_{j})-E\left[n_{j0}^{-1/2}{\bm{W}}_{j}({\bm{u}}_{j})\ \middle\|\ {\bm{U}}_{j}={\bm{u}}_{j}\right]$
	$\displaystyle\quad\xrightarrow{D}N({\bm{0}},{\bm{\Phi}}_{\rm{BB}}({\bm{u}}_{j})-{\bm{\Phi}}_{\rm{AB}}({\bm{u}}_{j})^{\top}{\bm{\Phi}}_{\rm{AA}}({\bm{u}}_{j})^{-1}{\bm{\Phi}}_{\rm{AB}}({\bm{u}}_{j}))$

as $J\to\infty$ and $n_{j}\to\infty$ . Therefore, (38) is the weighted sum of independent and asymptotically identically distributed random vectors, which can be applied the weighted central limit theorem (see, Weber 2006). As a result, (38) converges to $N({\bm{0}},{\bm{\Delta}}_{\rm{B}}^{-1})$ in distribution as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ . Thus, the proof of Lemma 8 is completed. ∎

Lemma 9.

Suppose that (A1)-(A6) hold. Then, as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ ,

J^{-1/2}\nabla_{\rm{C}}\ell({\bm{\theta}}^{0})+n_{0}^{-1/2}{\bm{\Delta}}_{\rm{C}}^{-1}{\bm{b}}_{\rm{C}}\xrightarrow{D}N({\bm{0}},{\bm{\Delta}}_{\rm{C}}^{-1}).

Proof of Lemma 9.

Let denote

{\bm{V}}_{j}({\bm{u}})\coloneqq d_{j}({\bm{u}})^{-1}\left[{\bm{u}}{\bm{g}}_{{\rm{A}}j}({\bm{u}})^{\top}{\bm{\Phi}}_{\rm{AA}}({\bm{u}})^{-1}+{\bm{\Phi}}_{\rm{AA}}({\bm{u}})^{-1}{\bm{g}}_{{\rm{A}}j}({\bm{u}}){\bm{u}}^{\top}\right].

From Lemma 6, we obtain

	$\displaystyle J^{-1/2}\nabla_{\rm{C}}\ell({\bm{\theta}}^{0})$
	$\displaystyle=2^{-1}{\bm{M}}_{*}\left({\bm{\Sigma}}_{0}\otimes{\bm{\Sigma}}_{0}\right)^{-1}J^{-1/2}\sum_{j=1}^{J}{\rm{vec}}\left({\bm{U}}_{j}^{\otimes 2}-{\bm{\Sigma}}_{0}\right)$		(40)
	$\displaystyle\quad+2^{-1}{\bm{M}}_{*}\left({\bm{\Sigma}}_{0}\otimes{\bm{\Sigma}}_{0}\right)^{-1}n_{0}^{-1/2}\left(J^{-1/2}\sum_{j=1}^{J}n_{0}^{-1/2}{\rm{vec}}\left[{\bm{V}}_{j}({\bm{U}}_{j})\right]\right)\left[1+o_{P}(1)\right].$		(41)

${\bm{U}}_{j}^{\otimes 2}$ is distributed as a Wishart distribution with $E[{\bm{U}}_{j}^{\otimes 2}]={\bm{\Sigma}}_{0}$ and ${\rm{cov}}[{\rm{vec}}({\bm{U}}_{j}^{\otimes 2})]=2({\bm{\Sigma}}_{0}\otimes{\bm{\Sigma}}_{0})$ . Therefore, by the central limit theorem, (40) converges to $N({\bm{0}},{\bm{\Delta}}_{\rm{C}}^{-1})$ in distribution as $J\to\infty$ . Moreover, similar to the proof of Lemma 7, (41) is asymptotically equivalent to $-n_{0}^{-1/2}{\bm{\Delta}}_{\rm{C}}^{-1}{\bm{b}}_{\rm{C}}$ . Consequently, we obtain Lemma 9. ∎

The above Lemmas 7-9 are summarized following two propositions.

Proposition 1.

Suppose that (A1)-(A6) hold. Then, as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ ,

{\bm{\Upsilon}}_{(J,n_{0})}^{-1}\nabla\ell({\bm{\theta}}^{0})+\begin{bmatrix}n_{0}^{-1/2}{\bm{\Delta}}_{\rm{A}}^{-1}{\bm{b}}_{\rm{A}}\\ {\bm{\Delta}}_{\rm{B}}^{-1}{\bm{b}}_{\rm{B}}\\ n_{0}^{-1/2}{\bm{\Delta}}_{\rm{C}}^{-1}{\bm{b}}_{\rm{C}}\\ \end{bmatrix}\xrightarrow{D}N\left({\bm{0}},\begin{bmatrix}{\bm{\Delta}}_{\rm{A}}^{-1}&{\bm{O}}&{\bm{O}}\\ {\bm{O}}&{\bm{\Delta}}_{\rm{B}}^{-1}&{\bm{O}}\\ {\bm{O}}&{\bm{O}}&{\bm{\Delta}}_{\rm{C}}^{-1}\end{bmatrix}\right).

Proof of Proposition 1.

Similar to Lemmas 7-9, from Lemma 6, we have

	$\displaystyle{\rm{cov}}\left[J^{-1/2}\nabla_{\rm{A}}\ell({\bm{\theta}}^{0}),\ J^{-1/2}n_{0}^{-1/2}\nabla_{\rm{B}}\ell({\bm{\theta}}^{0})\right]\to{\bm{O}},$
	$\displaystyle{\rm{cov}}\left[J^{-1/2}\nabla_{\rm{A}}\ell({\bm{\theta}}^{0}),\ J^{-1/2}\nabla_{\rm{C}}\ell({\bm{\theta}}^{0})\right]\to{\bm{O}}$

and

{\rm{cov}}\left[J^{-1/2}n_{0}^{-1/2}\nabla_{\rm{B}}\ell({\bm{\theta}}^{0}),\ J^{-1/2}\nabla_{\rm{C}}\ell({\bm{\theta}}^{0})\right]\to{\bm{O}}

as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ . By combining these results and Lemmas 7-9, we obtain Proposition 1. ∎

Proposition 2.

Suppose that (A1)-(A6) hold. Then, as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ ,

{\bm{\Upsilon}}_{(J,n_{0})}^{-1}\nabla^{2}\ell({\bm{\theta}}^{0}){\bm{\Upsilon}}_{(J,n_{0})}^{-1}\xrightarrow{P}\begin{bmatrix}-{\bm{\Delta}}_{\rm{A}}^{-1}&{\bm{O}}&{\bm{O}}\\ {\bm{O}}&-{\bm{\Delta}}_{\rm{B}}^{-1}&{\bm{O}}\\ {\bm{O}}&{\bm{O}}&-{\bm{\Delta}}_{\rm{C}}^{-1}\end{bmatrix}.

Proof of Proposition 2.

From Lemma 5 of Nie (2007) and Lemma 5, the covariance matrix of ${\rm{vec}}({\bm{\Upsilon}}_{(J,n_{0})}^{-1}\nabla^{2}\ell({\bm{\theta}}^{0}){\bm{\Upsilon}}_{(J,n_{0})}^{-1})$ converges to ${\bm{O}}$ as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ . Now, we have

E\left[{\bm{\Upsilon}}_{(J,n_{0})}^{-1}\nabla^{2}\ell({\bm{\theta}}^{0}){\bm{\Upsilon}}_{(J,n_{0})}^{-1}\right]=-E\left[{\bm{\Upsilon}}_{(J,n_{0})}^{-1}\nabla\ell({\bm{\theta}}^{0})^{\otimes 2}{\bm{\Upsilon}}_{(J,n_{0})}^{-1}\right].

(42)

By calculating the right-hand side of (42) using Lemma 6, Proposition 2 is obtained. ∎

Proof of Theorem 1 in Section 2.3 of our main article.

For any ${\bm{\theta}}\in\mathbb{R}^{p_{\rm{A}}+p_{\rm{B}}+p_{\rm{C}}}$ , the Taylor expansion of $\ell({\bm{\theta}})$ around ${\rm{\bm{\theta}}}={\rm{\bm{\theta}}}^{0}$ yields that

	$\displaystyle\ell({\bm{\Upsilon}}_{(J,n_{0})}^{-1}{\bm{\theta}}+{\bm{\theta}}^{0})$
	$\displaystyle=\ell({\bm{\theta}}^{0})+{\bm{\theta}}^{\top}{\bm{\Upsilon}}_{(J,n_{0})}^{-1}\nabla\ell({\bm{\theta}}^{0})+2^{-1}{\bm{\theta}}^{\top}{\bm{\Upsilon}}_{(J,n_{0})}^{-1}\nabla^{2}\ell({\bm{\theta}}^{0}){\bm{\Upsilon}}_{(J,n_{0})}^{-1}{\bm{\theta}}+o_{P}(1).$

From Propositions 1 and 2, for any $\varepsilon>0$ , there exists a large constant $B>0$ such that

\liminf_{n_{j}\to\infty,\ j\in\mathcal{J},\ J\to\infty}P\left(\inf_{{\bm{\theta}}\in\mathbb{R}^{p_{\rm{A}}+p_{\rm{B}}+p_{\rm{C}}}:\left\lVert{\bm{\theta}}\right\rVert=B}-\ell({\bm{\Upsilon}}_{(J,n_{0})}^{-1}{\bm{\theta}}+{\bm{\theta}}^{0})>-\ell({\bm{\theta}}^{0})\right)\geq 1-\varepsilon

(43)

as $n_{j}\to\infty,\ j\in\mathcal{J}$ and $J\to\infty$ . We assume that for all ${\bm{\theta}}$ , $-\nabla^{2}\ell({\bm{\theta}})$ is the positive definite matrix, which implies that $-\ell({\bm{\theta}})$ is the strictly convex function. Therefore, $\hat{\bm{\theta}}$ is the unique global maximizer of $\ell({\bm{\theta}})$ . Then, (43) implies ${\bm{\Upsilon}}_{(J,n_{0})}(\hat{\bm{\theta}}-{\bm{\theta}}^{0})=O_{P}(1)$ (see, the proof of Theorem 1 of Fan and Li 2001). Because $\hat{\bm{\theta}}$ is the global maximizer of $\ell({\bm{\theta}})$ , we have $\nabla\ell(\hat{\bm{\theta}})={\bm{0}}$ . From the Taylor expansion of $\ell({\bm{\theta}})$ , we have

\displaystyle{\bm{\Upsilon}}_{(J,n_{0})}\left(\hat{\bm{\theta}}-{\bm{\theta}}^{0}\right)

\displaystyle=-\left[{\bm{\Upsilon}}_{(J,n_{0})}^{-1}\nabla^{2}\ell({\bm{\theta}}^{0}){\bm{\Upsilon}}_{(J,n_{0})}^{-1}\right]^{-1}{\bm{\Upsilon}}_{(J,n_{0})}^{-1}\nabla\ell({\bm{\theta}}^{0})+o_{P}(1).

Therefore, by applying Propositions 1 and 2, we obtain Theorem 1 of our main article. ∎

References

[1] Almeida, S., A. Loy, and H. Hofmann. 2018. ggplot2 compatible quantile-quantile plots in R. The R Journal 10 (2):248-61. doi:10.32614/RJ-2018-051.
[2] Bates, D., M. Mächler, B. Bolker, and S. Walker. 2015. Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67 (1):1-48. doi:10.18637/jss.v067.i01.
[3] Fan, J., and R. Li. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96 (456):1348-60. doi:10.1198/016214501753382273.
[4] Miyata, Y. 2004. Fully exponential Laplace approximations using asymptotic modes. Journal of the American Statistical Association 99 (468):1037-49. doi:10.1198/016214504000001673.
[5] Nie, L. 2007. Convergence rate of MLE in generalized linear and nonlinear mixed-effects models: Theory and applications. Journal of Statistical Planning and Inference 137 (6):1787-804. doi:10.1016/j.jspi.2005.06.010.
[6] R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
[7] Tierney, L., R. E. Kass, and J. B. Kadane. 1989. Fully exponential Laplace approximations to expectations and variances of nonpositive functions. Journal of the American Statistical Association 84 (407):710-6. doi:10.1080/01621459.1989.10478824.
[8] Wang, H., and C. L. Tsai. 2009. Tail index regression. Journal of the American Statistical Association 104 (487):1233-40. doi:10.1198/jasa.2009.tm08458.
[9] Weber, M. 2006. A weighted central limit theorem. Statistics & Probability Letters 76 (14):1482-7. doi:10.1016/j.spl.2006.03.007.

	$\displaystyle{\rm{cov}}\left[\frac{\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})^{1/2}}\ \middle\|\ {\bm{U}}_{j}={\bm{u}}\right]$
	$\displaystyle\quad=E\left[\frac{\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})^{\otimes 2}}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})}\ \middle\|\ {\bm{U}}_{j}={\bm{u}}\right]$		(25)
	$\displaystyle\quad\quad-E\left[\frac{\nabla_{\bm{u}}h_{j}(Y_{ij},{\bm{u}},{\bm{X}}_{ij})}{P(Y_{ij}>\omega_{(J,n_{j})}\mid{\bm{U}}_{j}={\bm{u}})^{1/2}}\ \middle\|\ {\bm{U}}_{j}={\bm{u}}\right]^{\otimes 2}.$		(26)