Sparse M-estimators in semi-parametric copula models

Benjamin Poignard Jean-David Fermanian Jean-David Fermanian¹¹1Ensae-Crest, 5 avenue Henry le Chatelier, 91129 Palaiseau, France. E-mail address: jean-david.fermanian@ensae.fr, Benjamin Poignard²²2Osaka University, Graduate School of Economics, 1-7, Machikaneyama, Toyonaka-Shi, Osaka-Fu, 560-0043, Japan. E-mail address: bpoignard@econ.osaka-u.ac.jp. Jointly affiliated at RIKEN Center for Advanced Intelligence Project (AIP), and CREST-LFA.

Abstract

We study the large sample properties of sparse M-estimators in the presence of pseudo-observations. Our framework covers a broad class of semi-parametric copula models, for which the marginal distributions are unknown and replaced by their empirical counterparts. It is well known that the latter modification significantly alters the limiting laws compared to usual M-estimation. We establish the consistency and the asymptotic normality of our sparse penalized M-estimator and we prove the asymptotic oracle property with pseudo-observations, possibly in the case when the number of parameters is diverging. Our framework allows to manage copula-based loss functions that are potentially unbounded. Additionally, we state the weak limit of multivariate rank statistics for an arbitrary dimension and the weak convergence of empirical copula processes indexed by maps. We apply our inference method to Canonical Maximum Likelihood losses with Gaussian copulas, mixtures of copulas or conditional copulas. The theoretical results are illustrated by two numerical experiments.

Key words: Copulas; M-estimation; Pseudo-observations; Sparsity.

1 Introduction

In this paper, we consider the parsimonious estimation of copula models within the semi-parametric framework: margins are left unspecified and a parametric copula model is assumed. The sparse assumption is motivated by the model complexity that occurs in copula modelling, where the parameterization may require the estimation of a large number of parameters. For instance, the variance-covariance matrix of a $d$ -dimensional Gaussian copula involves the estimation of $d(d-1)/2$ parameters, the components of an unknown correlation matrix; in single-index copula, the underlying conditional copula is parameterized through a link function that depends on a potentially large number of covariates and thus parameters, which may not be all relevant for describing these conditional distributions. Since the seminal work of [12], a significant amount of literature dedicated to sparsity-based M-estimators has been flourishing in a broad range of settings. In contrast, the sparse estimation of copula based M-estimators has benefited from a very limited attention so far. In [7], the authors considered a mixture of copulas with a joint estimation of the weight parameters and the copula parameters, while penalizing the former ones only. However, a strong limitation of their approach is the parametric assumption formulated for the marginals, which greatly simplifies the large sample inference. [37] specified a penalized estimating equation-based estimator for single-index copula models and derived the corresponding large sample properties but assuming known margins. A theory covering the sparse estimation for semi-parametric copulas is an important missing stone in the literature. One of the key difficulties is the treatment of the values close to the boundaries of $[0,1]^{d}$ , where some loss functions potentially “explode”. The latter situation is not pathological. It often occurs for the standard Canonical Maximum Likelihood method or CML (see, e.g., [19, 33, 34]) and many usual copula log-densities, as pointed out in [32] in particular. Since the seminal works of [31, 30], the large sample properties of the empirical copula process $\widehat{{\mathbb{C}}}_{n}$ were established by, e.g., [16], and such properties were applied to the asymptotic analysis of copula-based maximum likelihood estimators with pseudo-observations. In that case, the empirical copula process is indexed by the likelihood function: see [34, 8], who considered some regularity conditions on the likelihood function to manage the values close to the boundaries of $[0,1]^{d}$ . Similar conditions were stated likewise in [8, 39, 22], among others, where some bracketing number conditions on a suitable class of functions are assumed. These works share a similar spirit with [36, 11], who considered a general framework of empirical processes indexed by classes of functions under entropy conditions. Thanks to a general integration by parts formula, [28] established the conditions for the weak convergence of the empirical copula process $\int f\text{d}\widehat{{\mathbb{C}}}_{n}$ indexed by a class of functions $f\in\mathcal{F}$ of bounded variation, the so-called Hardy–Krause variation. Their results do not require explicit entropy conditions on the class of functions. In the same vein, [6] assumed similar regularity conditions on the indexing functions but restricted their analysis to the two-dimensional copula case. It is worth mentioning that the techniques for stating the large sample analysis of semi-parametric copulas amply differ from the fully parametric viewpoint, for which the classical M-estimation theory obviously applies.

The present paper is then motivated by the lack of links between sparsity and semi-parametric copulas. Our asymptotic analysis for sparse M-estimators in the context of semi-parametric copulas builds upon the theoretical frameworks of [6] and [28]. The contribution of our paper is fourfold: first, we provide the asymptotic theory (consistency, oracle property for variable selection and asymptotic normality in the same spirit as [12]) for general penalized semi-parametric copula models, where the margins are estimated by their empirical counterparts. In particular, our setting includes the Canonical Maximum Likelihood method. Second, these asymptotic results are extended for (a sequence of) copula models in large dimensions, a framework that corresponds to the diverging dimension case, as in, e.g., [13]. Third, we prove the asymptotic normality of multivariate-rank statistics for any arbitrary dimension $d\geq 2$ , extending Theorem 3.3 of [6]. Fourth and finally, we prove the weak convergence of the empirical copula process indexed by functions of bounded variation, extending Theorem 5 of [28] to cover the prevailing situation of unbounded copula densities. We emphasize that our theory is not restricted to i.i.d. data and potentially covers the case of dependent observations, as in [28].

The rest of the paper is organized as follows. Section 2 details the framework and fix our notations. The large sample properties of our penalized estimator are provided in Section 3. The situation of conditional copulas is managed in Section 4. Section 5 discusses some examples and two simulated experiments to illustrate the relevance of our method. The theoretical results about multivariate rank statistics and empirical copula processes are stated in Appendix A. All the proofs, the theoretical results in the case of a diverging number of parameters and additional simulated experiments are provided in the Appendix.

2 The framework

This section details the sparse estimation framework for copula models when the marginal distributions are non-parametrically managed. We consider a sample of $n$ realizations of a random vector $\mbox{\boldmath$X$}\in{\mathbb{R}}^{d}$ , $\mbox{\boldmath$X$}:=(X_{1},\ldots,X_{d})$ . This sample is denoted as ${\mathcal{X}}_{n}=(\mbox{\boldmath$X$}_{1},\ldots,\mbox{\boldmath$X$}_{n})$ . The observations may be dependent or not. As usual in the copula world, we are more interested in the “reduced” random variables $U_{k}=F_{k}(X_{k})$ , $k\in\{1,\ldots,d\}$ , where $F_{k}$ denotes the cumulative distribution function (c.d.f.) of $X_{k}$ . Throughout this paper, we make the blanket assumption that all $X_{k}$ have continuous marginals. Then, the variables $U_{k}$ are uniformly distributed on $[0,1]$ and the joint law of $\mbox{\boldmath$U$}:=(U_{1},\ldots,U_{d})$ is the uniquely defined copula of $X$ denoted by $C$ . To study the latter copula, it would be tempting to work with the sample ${\mathcal{U}}_{n}:=(\mbox{\boldmath$U$}_{1},\ldots,\mbox{\boldmath$U$}_{n})$ instead of ${\mathcal{X}}_{n}$ . Unfortunately, since the marginal c.d.f.s’ $X_{k}$ are unknown in general, this is still the case of ${\mathcal{U}}_{n}$ , and the marginal c.d.f.s’ have to be replaced by consistent estimates. Therefore, it is common to build a sample of pseudo-observations $\widehat{\mbox{\boldmath$U$}}_{i}=(\widehat{U}_{i,1},\ldots,\widehat{U}_{i,d})$ , $i\in\{1,\ldots,n\}$ , obtained from the initial sample ${\mathcal{X}}_{n}$ . Here and as usual, set $\widehat{U}_{i,k}=F_{n,k}(X_{i,k})$ for every $i\in\{1,\ldots,n\}$ and every $k\in\{1,\ldots,d\}$ , using the $k$ -th re-scaled empirical c.d.f. $F_{n,k}(s):=(n+1)^{-1}\sum^{n}_{i=1}\mathbf{1}\{X_{i,k}\leq s\}.$ We will denote by $G_{n,k}$ , $k\in\{1,\ldots,d\}$ , the empirical c.d.f. of the (unobservable) random variable $U_{k}$ , i.e. $G_{n,k}(u):=(n+1)^{-1}\sum^{n}_{i=1}\mathbf{1}\{F_{k}(X_{i,k})\leq u\}.$ The empirical c.d.f. of $U$ is $G_{n}$ , i.e. $G_{n}(\mbox{\boldmath$u$}):=(n+1)^{-1}\sum^{n}_{i=1}\mathbf{1}\{\mbox{\boldmath$U$}_{i}\leq\mbox{\boldmath$u$}\}$ for any $\mbox{\boldmath$u$}\in[0,1]^{d}$ . We denote by $\alpha_{n}$ the usual empirical process associated with the sample $(\mbox{\boldmath$U$}_{i})_{i=1,\ldots,n}$ , i.e.

\alpha_{n}(\mbox{\boldmath$u$}):=\sqrt{n}\big{(}G_{n}-C\big{)}(\mbox{\boldmath$u$})=\sqrt{n}\Big{\{}\frac{1}{n}\sum_{i=1}^{n}{\mathbf{1}}\big{(}\mbox{\boldmath$U$}_{i}\leq\mbox{\boldmath$u$}\big{)}-C(\mbox{\boldmath$u$})\Big{\}}.

The natural estimator of the true underlying copula $C$ , i.e. the c.d.f. of $U$ , is the empirical copula map

\widehat{C}_{n}(\mbox{\boldmath$u$}):=\frac{1}{n}\overset{n}{\underset{i=1}{\sum}}\mathbf{1}\Big{\{}F_{n,1}(X_{i,1})\leq u_{1},\ldots,F_{n,d}(X_{i,d})\leq u_{d}\Big{\}},

(2.1)

and the associated empirical copula process is $\widehat{\mathbb{C}}_{n}:=\sqrt{n}(\widehat{C}_{n}-C)$ .

Hereafter, we select a parametric family of copulas $C_{\theta}$ , $\theta\in\Theta\subset{\mathbb{R}}^{p}$ , and we assume it contains the true copula $C$ : there exists $\theta_{0}$ (the “true value” of the parameter) s.t. $C=C_{\theta_{0}}$ . We want to cover the usual case of semi-parametric dependence models, for which there is an orthogonality condition of the type ${\mathbb{E}}\big{[}\nabla_{\theta}\ell(\mbox{\boldmath$U$};\theta_{0})\big{]}=0,$ for some family of loss functions $\ell:(0,1)^{d}\times\Theta\rightarrow{\mathbb{R}}$ . The dimension $p$ of the copula parameter $\theta$ will be fixed hereafter. In the Appendix, it will be allowed to tend to the infinity with the sample size $n$ . The function $\ell$ is usually defined as a quadratic loss or minus a log-likelihood function. Note that $\ell$ has not to be defined on the boundaries of $[0,1]^{d}$ at this stage because the law of $U$ was assumed to be continuous. Moreover, an important contribution of the paper will be to deal with some maps $\ell$ that cannot be continuously extended on $[0,1]^{d}$ .

For the sake of the estimation of $\theta_{0}$ , let us specify a statistical criterion. Consider a global loss function ${\mathbb{L}}_{n}$ from $\Theta\times(0,1)^{dn}$ to ${\mathbb{R}}$ . The value ${\mathbb{L}}_{n}(\theta;\mbox{\boldmath$u$}_{1},\ldots,\mbox{\boldmath$u$}_{n})$ evaluates the quality of the “fit” given $\mbox{\boldmath$U$}_{i}=\mbox{\boldmath$u$}_{i}$ for every $i\in\{1,\ldots,n\}$ and under ${\mathbb{P}}_{\theta}$ . Hereafter, we assume there exists a continuous function $\ell:\Theta\times(0,1)^{d}\rightarrow{\mathbb{R}}$ such that

{\mathbb{L}}_{n}(\theta;\mbox{\boldmath$u$}_{1},\ldots,\mbox{\boldmath$u$}_{n}):=\overset{n}{\underset{i=1}{\sum}}\ell(\theta;\mbox{\boldmath$u$}_{i}),

(2.2)

for every $\theta\in\Theta$ and every $(\mbox{\boldmath$u$}_{1},\ldots,\mbox{\boldmath$u$}_{n})$ in $(0,1)^{dn}$ . As usual for the inference of semiparametric copula models, the empirical loss ${\mathbb{L}}_{n}(\theta;{\mathcal{U}}_{n})$ cannot be calculated since we do not observe the realizations of $U$ in practice. Therefore, invoking the “pseudo-sample” $\widehat{{\mathcal{U}}}_{n}:=(\widehat{\mbox{\boldmath$U$}}_{1},\ldots,\widehat{\mbox{\boldmath$U$}}_{n})$ , the empirical loss ${\mathbb{L}}_{n}(\theta;{\mathcal{U}}_{n})$ will be approximated by ${\mathbb{L}}_{n}(\theta;\widehat{\mathcal{U}}_{n})$ , a quantity called “pseudo-empirical” loss function.

Example 1.

A key example is the Canonical Maximum Likelihood method: the law of $U$ (i.e. the copula of $X$ ) belongs to a parametric family ${\mathcal{P}}:=\{{\mathbb{P}}_{\theta},\,\theta\in\Theta\}$ and $C=C_{\theta_{0}}$ . There, for i.i.d. data, $\ell(\mbox{\boldmath$u$};\theta)=-\ln c(\mbox{\boldmath$u$};\theta)$ , minus the log-copula density of $C_{\theta}$ w.r.t. the Lebesgue measure on $(0,1)^{d}$ .

Now, we assume that the unknown parameter is sparse and we introduce a penalization term. Our criterion becomes

\widehat{\theta}\,{\color[rgb]{0,0,0}\in}\,\underset{\theta\in\Theta}{\arg\;\min}\;\Big{\{}{\mathbb{L}}_{n}(\theta;\widehat{{\mathcal{U}}}_{n})+n\overset{p}{\underset{k=1}{\sum}}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{k}|)\Big{\}},

(2.3)

when such a minimizer exists. Here, $\mbox{\boldmath$p$}(\lambda_{n},x)$ , for $x\geq 0$ , is a penalty function, where $\lambda_{n}\geq 0$ is a tuning parameter which depends on the sample size and enforces a particular type of sparse structure. Throughout the paper, we will implicitly work under the following assumption.

Assumption 0.

The parameter space $\Theta$ is a borelian subset of ${\mathbb{R}}^{p}$ . The function $\theta\mapsto{\mathbb{E}}[\ell(\theta;\mbox{\boldmath$U$})]$ is uniquely minimized on $\Theta$ at $\theta=\theta_{0}$ , and an open neighborhood of $\theta_{0}$ is contained in $\Theta$ .

Note that only the uniqueness of $\theta_{0}$ is required but not that of $\widehat{\theta}$ . In [12, 13], $\Theta$ is assumed to be an open subset. Nonetheless, it may happen that $\theta_{0}$ belongs to the frontier of $\Theta$ . This may be the case for the estimation of weights in mixture models (see Example 3 below), e.g. Actually, our results will apply even if $\Theta$ does not contain an open neighborhood of $\theta_{0}$ . Indeed, it is sufficient that there exists a convex open neighborhood of $\theta_{0}$ in $\Theta$ : for some open ball $B_{\delta}(\theta_{0})\subset{\mathbb{R}}^{p}$ centered at $\theta_{0}$ and with a radius $\delta>0$ and for every parameter $\theta_{1}\in\Theta\cap B_{\delta}(\theta_{0})$ , every element of the segment $t\theta_{0}+(1-t)\theta_{1}$ , $t\in[0,1]$ belongs to $\Theta$ and $\theta_{1}$ is the limit of a sequence of elements in $B_{\delta}(\theta_{0})\cap\text{Int}(\Theta)$ , where $\text{Int}(\Theta)$ is the interior of $\Theta$ . Therefore, we will define the partial derivatives of any map $h:\Theta\mapsto{\mathbb{R}}$ at some point on the frontier of $\Theta$ (in particular $\theta_{0}$ ) by continuity. For example, when the derivative of $\theta\mapsto\ell(\theta;\mbox{\boldmath$u$})$ exists in the interior of $\Theta$ , for some $u$ , it will be defined at $\theta_{0}$ by $\nabla_{\theta}\ell(\theta_{0};\mbox{\boldmath$u$}):=\lim_{\theta\rightarrow\theta_{0},\theta\in\Theta}\nabla_{\theta}\ell(\theta;\mbox{\boldmath$u$})$ , assuming the latter limit exists. To lighten the presentation, we will keep Assumption ‣ 2 as above. By slightly strengthening some regularity assumptions, the case of $\theta_{0}$ on the boundary of $\Theta$ will be straightforwardly obtained (essentially by imposing the continuity of some derivatives around $\theta_{0}$ ).

Some well-known penalty functions are the LASSO, where $\mbox{\boldmath$p$}(\lambda,{\color[rgb]{0,0,0}x})=\lambda{\color[rgb]{0,0,0}x}$ for every $x\geq 0$ , and the non-convex SCAD and MCP. The SCAD penalty of [12] is defined as: for every $x\geq 0$ ,

\mbox{\boldmath$p$}(\lambda,{\color[rgb]{0,0,0}x})=\begin{cases}\lambda{\color[rgb]{0,0,0}x},&\text{for}\;{\color[rgb]{0,0,0}x}\leq\lambda,\\ \frac{1}{2(a_{\text{scad}}-1)}\big{(}2a_{\text{scad}}\lambda{\color[rgb]{0,0,0}x}-{\color[rgb]{0,0,0}x}^{2}-\lambda^{2}\big{)},&\text{for}\;\lambda{\color[rgb]{0,0,0}<x}\leq a_{\text{scad}}\lambda,\\ (a_{\text{scad}}+1)\lambda^{2}/2,&\text{for}\;{\color[rgb]{0,0,0}x}>a_{\text{scad}}\lambda,\end{cases}

where $a_{\text{scad}}>2$ . The MCP due to [38] is defined for $b_{\text{mcp}}>0$ as follows: for every $x\geq 0$ ,

\mbox{\boldmath$p$}(\lambda,{\color[rgb]{0,0,0}x})=\big{(}\lambda{\color[rgb]{0,0,0}x}-\frac{{\color[rgb]{0,0,0}x}^{2}}{2b_{{\color[rgb]{0,0,0}\text{mcp}}}}\big{)}\mathbf{1}\big{(}{\color[rgb]{0,0,0}x\leq b_{\text{mcp}}\lambda}\big{)}+\lambda^{2}\frac{b_{\text{mcp}}}{2}\mathbf{1}\big{(}{\color[rgb]{0,0,0}x>b_{\text{mcp}}\lambda}\big{)}.

Note that when $a_{\text{scad}}\rightarrow\infty$ (resp. $b_{\text{mcp}}\rightarrow\infty$ ), the SCAD (resp. MCP) penalty behaves as the LASSO penalty since all coefficients of $\theta$ are equally penalized. The idea of sparsity for copulas naturally applies to a broad range of situations. In such cases, the parameter value zero usually plays a particular role, possibly after reparameterisation. This is in line with the usual previously cited penalties.

Example 2.

Consider a Gaussian copula model in dimension $d>>1$ , whose parameter is a correlation matrix $\Sigma$ . The description of all the underlying dependencies between the components of $U$ is a rather painful task. Then, the sparsity of $\Sigma$ becomes a nice property. Indeed, the independence between two components of $U$ is equivalent to the nullity of their corresponding coefficients in $\Sigma$ .

Example 3.

The inference of mixtures of copulas may justify the application of a penalty function. Indeed, consider a set of known $d$ -dimensional copulas $C^{(k)}$ , $k\in\{1,\ldots,p+1\}$ . In practice, we could try to approximate the true underlying copula $C$ by a mixture $\sum^{p+1}_{k=1}\pi_{k}C^{(k)},\sum^{p+1}_{k=1}\pi_{k}=1$ , $\pi_{k}\in[0,1]$ for every $k$ . Here, the underlying parameter is the vector of weights $\theta=(\pi_{1},\ldots,\pi_{p})$ and $\Theta$ is defined as $\Theta_{\text{mixt},p}:=\{(\pi_{1},\ldots,\pi_{p});\pi_{j}\in[0,1],j\in\{1,\ldots,p\};\sum_{j=1}^{p}\pi_{j}\leq 1\}$ . If a weight is estimated as zero, its corresponding copula does not matter to approximate $C$ . The latter model is generally misspecified, but our theory will apply even in this case, interpreting $\widehat{\theta}$ in (2.3) as an estimator of a “pseudo-true” value $\theta_{0}$ . If we apply the CML method, ${\mathbb{L}}_{n}(\theta;\widehat{\mathcal{U}}_{n})$ is the log-copula density of $\sum^{p}_{k=1}\theta_{k}C^{(k)}+(1-\sum_{k=1}^{p}\theta_{k})C^{(p+1)}$ . When some or all copulas $C^{(k)}$ depend on unknown parameters that need to be estimated in addition to the weights, the penalty function could also be applied to these copula parameters.

Dealing with conditional copulas ([17], e.g.), our framework will be slightly modified. Now, the law of a random vector $\mbox{\boldmath$X$}\in{\mathbb{R}}^{d}$ knowing the vector of covariates $\mbox{\boldmath$Z$}=\mbox{\boldmath$z$}\in{\mathbb{R}}^{m}$ is given by a parametric conditional copula whose parameter depends on a known map of $z$ and is denoted $\theta(\mbox{\boldmath$z$};\beta)$ , $\beta\in{\mathbb{R}}^{q}$ . Beside, the law of the margins $X_{k}$ , $k\in\{1,\ldots,d\}$ , given $\mbox{\boldmath$Z$}=\mbox{\boldmath$z$}$ are unknown and we assume they do not depend on $z$ . In other words, the conditional law of $X$ given $Z$ is assumed to be

{\mathbb{P}}(\mbox{\boldmath$X$}\leq\mbox{\boldmath$x$}|\mbox{\boldmath$Z$}=\mbox{\boldmath$z$})=C_{\theta(\mbox{\boldmath$z$};\beta)}(F_{1}(x_{1}),\ldots,F_{d}(x_{d})\big{)},\;\mbox{\boldmath$x$}\in{\mathbb{R}}^{d},\mbox{\boldmath$z$}\in{\mathbb{R}}^{m}.

Therefore, as for the CML method and in the i.i.d. case, an estimator of $\beta$ would be

\widehat{\beta}\,{\color[rgb]{0,0,0}\in}\,\arg\max_{\beta}\sum_{i=1}^{n}\ln c_{\theta(\mbox{\boldmath$Z$}_{i};\beta)}\big{(}\widehat{U}_{i,1},\ldots,\widehat{U}_{i,d}\big{)}.

Surprisingly and to the best of our knowledge, the asymptotic theory of such estimators has apparently not been explicitly stated in the literature until now. This will be the topic of Section 4.

Example 4.

Sparsity naturally applies to single-index copulas (see, e.g., [18]). The function $\mbox{\boldmath$p$}(\lambda_{n},\cdot)$ is now specified with respect to the underlying $\beta$ parameter. In other words, sparsity refers to the situation where only a (small) subset of the $Z$ components is relevant to describe the dependencies between the $X$ components given $Z$ . Consider the conditional Gaussian copulas, as Example 4 of [18]. Here, the correlation matrix $\Sigma$ would be a function of $\mbox{\boldmath$Z$}=\mbox{\boldmath$z$}$ . It may be rewritten $\Sigma(\mbox{\boldmath$z$})=\big{[}\sin\big{(}\frac{\pi}{2}\tau_{kl}(\mbox{\boldmath$z$}^{\top}\beta)\big{)}\big{]}_{1\leq k,l\leq d}$ , where $\tau_{kl}(\mbox{\boldmath$z$}^{\top}\beta)$ denotes the conditional Kendall’s tau of $(X_{k},X_{l})$ given $\mbox{\boldmath$Z$}=\mbox{\boldmath$z$}$ .

3 Asymptotic properties

To prove the asymptotic results, we consider two sets of assumptions: one is related to the loss function; the other one concerns the penalty function. First, define the support of the true parameter as ${\mathcal{A}}:=\big{\{}k:\theta_{0,k}\neq 0,k=1,\ldots,p\big{\}}$ . We will implicitly assume a sparsity assumption, i.e. the cardinal of ${\mathcal{A}}$ is ”significantly” smaller than $p$ .

Assumption 1.

The map $\theta\mapsto\ell(\theta;\mbox{\boldmath$u$})$ is thrice differentiable on $\Theta$ , for every $\mbox{\boldmath$u$}\in(0,1)^{d}$ . The parameter $\theta_{0}$ satisfies the first order condition ${\mathbb{E}}[\nabla_{\theta}\ell(\theta_{0};\mbox{\boldmath$U$})]=0$ . Moreover, ${\mathbb{H}}:={\mathbb{E}}[\nabla^{2}_{\theta\theta^{\top}}\ell(\theta_{0};\mbox{\boldmath$U$})]$ and ${\mathbb{M}}:={\mathbb{E}}[\nabla_{\theta}\ell(\theta_{0};\mbox{\boldmath$U$})\nabla_{\theta^{\top}}\ell(\theta_{0};\mbox{\boldmath$U$})]$ exist and are positive definite. Finally, for every $\epsilon>0$ , there exists a constant $K_{\epsilon}$ such that

\sup_{\{\theta;\|\theta-\theta_{0}\|<\epsilon\}}\;\sup_{j,l,m}\big{|}{\mathbb{E}}[\partial^{3}_{\theta_{j}\theta_{l}\theta_{m}}\ell(\theta;\mbox{\boldmath$U$})]\big{|}\leq K_{\epsilon}.

Assumption 2.

For any $j\in\{1,\ldots,d\}$ , the copula partial derivative $\dot{C}_{j}(\mbox{\boldmath$u$}):=\partial C(\mbox{\boldmath$u$})/\partial u_{j}$ exists and is continuous on $V_{j}:=\{\mbox{\boldmath$u$}\in[0,1]^{d};u_{j}\in(0,1)\}$ . For every couple $(j_{1},j_{2})\in\{1,\ldots,d\}^{2}$ , the second-order partial derivative $\ddot{C}_{j_{1},j_{2}}(\mbox{\boldmath$u$}):=\partial^{2}C(\mbox{\boldmath$u$})/\partial u_{j_{1}}\partial u_{j_{2}}$ exists and is continuous on $V_{j_{1}}\cap V_{j_{2}}$ . Moreover, there exists a positive constant $K$ such that

\big{|}\ddot{C}_{j_{1},j_{2}}(\mbox{\boldmath$u$})\big{|}\leq K\min\Big{(}\frac{1}{u_{j_{1}}(1-u_{j_{1}})},\frac{1}{u_{j_{2}}(1-u_{j_{2}})}\Big{)},\;\mbox{\boldmath$u$}\in V_{j_{1}}\cap V_{j_{2}}.

(3.1)

When $u$ does not belong to $(0,1)^{d}$ (i.e. when one of its components is zero or one), we have defined $\dot{C}_{j}(\mbox{\boldmath$u$}):=\lim\sup_{t\rightarrow 0}\big{\{}C(\mbox{\boldmath$u$}+t\mbox{\boldmath$e$}_{j})-C(\mbox{\boldmath$u$})\big{\}}/t$ . It has been pointed out in [32] that Condition 2 is satisfied for bivariate Gaussian and bivariate extreme-value copulas in particular. We formally state this property for $d$ -dimensional Gaussian copulas in Section E of the Appendix.

Assumption 3.

For some $\omega$ , the family of maps ${\mathcal{F}}={\mathcal{F}}_{1}\cup{\mathcal{F}}_{2}\cup{\mathcal{F}}_{3}$ from $(0,1)^{d}$ to ${\mathbb{R}}$ is $g_{\omega}$ -regular (see Definition A in Appendix A), with

{\mathcal{F}}_{1}:=\{f:\mbox{\boldmath$u$}\mapsto\partial_{\theta_{k}}\ell(\theta_{0};\mbox{\boldmath$u$});k=1,\ldots,p\},

{\mathcal{F}}_{2}:=\{f:\mbox{\boldmath$u$}\mapsto\partial^{2}_{\theta_{k},\theta_{l}}\ell(\theta_{0};\mbox{\boldmath$u$});k,l=1,\ldots,p\},\;\text{and}

{\mathcal{F}}_{3}:=\{f:\mbox{\boldmath$u$}\mapsto\partial^{3}_{\theta_{k},\theta_{l},\theta_{j}}\ell(\theta;\mbox{\boldmath$u$});k,l,j=1,\ldots,p,\;\|\theta-\theta_{0}\|<K\},

for some constant $K>0$ .

We will denote by $\partial_{2}\mbox{\boldmath$p$}(\lambda,x)$ (resp. $\partial^{2}_{2,2}\mbox{\boldmath$p$}(\lambda,x)$ ) the first order (resp. second order) derivative of $x\mapsto\mbox{\boldmath$p$}(\lambda,x)$ , for any $\lambda$ .

Assumption 4.

Defining

a_{n}:=\max_{1\leq j\leq p}\big{\{}\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{0,j}|),\theta_{0,j}\neq 0\big{\}}\;\text{and}\;b_{n}:=\max_{1\leq j\leq p}\big{\{}\partial^{2}_{2,2}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{0,j}|),\theta_{0,j}\neq 0\big{\}},

assume that $a_{n}\rightarrow 0$ and $b_{n}\rightarrow 0$ when $n\rightarrow\infty$ . Moreover, there exist constants $M$ and $\bar{M}$ such that

|\partial^{2}_{2,2}\mbox{\boldmath$p$}(\lambda_{n},\theta_{1})-\partial^{2}_{2,2}\mbox{\boldmath$p$}(\lambda_{n},\theta_{2})|\leq M|\theta_{1}-\theta_{2}|,

for any real numbers $\theta_{1},\theta_{2}$ such that $\theta_{1},\theta_{2}>\bar{M}\lambda_{n}$ .

Assumption 1 is a standard regularity condition for asymptotically normal M-estimators. Assumptions 2 is a smoothness condition on the copula $C$ and is similar to Condition 4.1 of [32] and Condition 2.1 of [6]. It ensures that the second-order derivatives with respect to $u$ do not explode “too rapidly” when $u$ approaches the boundaries of $[0,1]^{d}$ . Assumption 3 is related to the indexing function of the copula process, here by the first, second and third order derivatives of the loss function $\ell(\theta;.)$ . The $g_{\omega}$ -regularity of these functions ensures that they are of bounded Hardy-Krause variation - similar to assumption F of [28] - together with some integrability conditions. Here, $\omega$ is some fixed number in $(0,1/2)$ entering in the definition of a weight function $\min_{k}\min(u_{k},1-u_{k})^{\omega}$ , as specified in Definition A, point (ii). Such a weight function is related to the theory of weighted empirical processes applied in [6]. Assumption 4 is dedicated to the regularity of the penalty function, and includes conditions in the same vein as [13], Assumption 3.1.1. Note that the LASSO, the SCAD and the MCP penalties fulfill Assumption 4. Our first result establishes the existence of a consistent penalized M-estimator.

Theorem 3.1.

Suppose Assumptions 10-11 given in Appendix A are satisfied. Let some

\omega\in\Big{(}0,\min\big{\{}\frac{{\color[rgb]{0,0,0}\kappa}_{1}}{2(1-{\color[rgb]{0,0,0}\kappa}_{1})},\frac{{\color[rgb]{0,0,0}\kappa}_{2}}{2(1-{\color[rgb]{0,0,0}\kappa}_{2})},{\color[rgb]{0,0,0}\kappa}_{3}-\frac{1}{2}\big{\}}\Big{)},

and suppose Assumptions ‣ 2-4 hold for this $\omega$ . Then, there exists a sequence of estimators $\widehat{\theta}$ as defined in (2.3) which satisfies

\|\widehat{\theta}-\theta_{0}\|_{2}=O_{p}\Big{(}\ln(\ln n)n^{-1/2}+a_{n}\Big{)}.

The proof is postponed in Section D of the Appendix. The factor $\ln(\ln n)$ could be replaced by any sequence that tends to the infinity with $n$ , as in Corollary A.2. It has been arbitrarily chosen for convenience. We will apply this rule throughout the article.

We now show that the penalized estimator $\widehat{\theta}$ satisfies the oracle property in the sense of [12]: the true support can be recovered asymptotically and the non-zero estimated coefficients are asymptotically normal. Denote by $\widehat{\mathcal{A}}:=\big{\{}k:\widehat{\theta}_{k}\neq 0,k=1,\ldots,p\big{\}}$ the (random) support of our estimator. The following theorem states the main results of this section. It uses some standard notations for concatenation of sub-vectors, as recalled in Appendix A.

Theorem 3.2.

In addition to the assumptions of Theorem 3.1, assume that the penalty function satisfies $\underset{n\rightarrow\infty}{\lim\,\inf}\;\underset{x\rightarrow 0^{+}}{\lim\,\inf}\;\lambda^{-1}_{n}\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},x)>0$ . Moreover, assume $\lambda_{n}\rightarrow 0$ , $\sqrt{n}\lambda_{n}(\ln(\ln n))^{-1}\rightarrow\infty$ and $a_{n}=o(\lambda_{n})$ . Then, the consistent estimator $\widehat{\theta}$ of Theorem 3.1 satisfies the two following properties.

(i)

Support recovery: $\underset{n\rightarrow\infty}{\lim}\;{\mathbb{P}}\left(\widehat{{\mathcal{A}}}={\mathcal{A}}\right)=1$ .

(ii)

Asymptotic normality: if, in addition, the conditions 12-13 in Appendix A (applied to ${\mathcal{F}}_{1}$ instead of ${\mathcal{F}}$ ) are fulfilled and $\sqrt{n}\lambda_{n}^{2}=o(1)$ , then

\sqrt{n}\Big{[}{\mathbb{H}}_{{\mathcal{A}}{\mathcal{A}}}+\mathbf{B}_{n}(\theta_{0})\Big{]}\Big{\{}\big{(}\widehat{\theta}-\theta_{0}\big{)}_{{\mathcal{A}}}+\big{[}{\mathbb{H}}_{{\mathcal{A}}{\mathcal{A}}}+\mathbf{B}_{n}(\theta_{0})\big{]}^{-1}\mathbf{A}_{n}(\theta_{0})\Big{\}}\overset{d}{\underset{n\rightarrow\infty}{\longrightarrow}}\mbox{\boldmath$W$},

where ${\mathbb{H}}_{{\mathcal{A}}{\mathcal{A}}}=\Big{[}{\mathbb{E}}\big{[}\partial^{2}_{\theta_{k}\theta_{l}}\ell(\theta_{0};\mbox{\boldmath$U$})\big{]}\Big{]}_{k,l\in{\mathcal{A}}}$ , $\mathbf{A}_{n}(\theta_{0})=\big{[}\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{0,k}|)\text{sgn}(\theta_{0,k})\big{]}_{k\in{\mathcal{A}}}$ ,
$\mathbf{B}_{n}(\theta)=\text{diag}\big{(}\partial^{2}_{2,2}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{0,k}|),\,k\in{\mathcal{A}}\big{)}$ and $W$ a $|{\mathcal{A}}|$ -dimensional centered random vector defined as

W_{j}:=(-1)^{d}\int_{(0,1)^{d}}{\mathbb{C}}(\mbox{\boldmath$u$})\,\partial_{\theta_{j}}\ell(\theta_{0};d\mbox{\boldmath$u$})+\sum_{\begin{subarray}{c}I\subset\{1,\ldots,d\}\\ I\neq\emptyset,I\neq\{1,\ldots,d\}\end{subarray}}(-1)^{|I|}\int_{(0,1)^{|I|}}{\mathbb{C}}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,\partial_{\theta_{j}}\ell(\theta_{0};d\mbox{\boldmath$u$}_{I};{\mathbf{1}}_{-I}),

for $j\in{\mathcal{A}}$ , with ${\mathbb{C}}(\mbox{\boldmath$u$}):=\alpha_{C}(\mbox{\boldmath$u$})-\sum_{k=1}^{d}\dot{C}_{k}(\mbox{\boldmath$u$})\alpha_{C}({\mathbf{1}}_{-k}:u_{k})$ and $\alpha_{C}$ is the process defined in Assumption 12.

The proof is relegated to Section D of the Appendix. The existence and the meaning of the integrals defining $W_{j}$ comes from Assumption 13. The proofs of the two latter theorems rely on a third-order limited expansion of our empirical loss w.r.t. its arguments. The negligible terms are managed thanks to a ULLN (Corollary A.2). An integration by parts formula (Theorem A.1) allows to write the main terms as sums of integrals of the empirical copula process w.r.t. some “sufficiently regular” functions. The latter ones are deduced from the derivatives of the loss, justifying the concept of $g_{\omega}$ regularity and Assumption 3. A weak convergence result of the weighted empirical copula process concludes the asymptotic normality proof.

Note that the previous results apply with dependent observations $(\mbox{\boldmath$X$}_{i})$ . Indeed, in Theorem 3.2 (ii), we only require the weak convergence of the process $(\alpha_{n})$ to $\alpha_{C}$ . In the i.i.d. case, $\alpha_{C}$ is a Brownian Bridge and $W$ is a Gaussian random vector. The latter assertion is still true when $(\mbox{\boldmath$X$}_{i})$ is a strongly stationary and geometrically alpha-mixing sequence, due to Proposition 4.4 in [6]. The existence and the meaning of the random variable $\int_{(0,1)^{|I|}}{\mathbb{C}}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,\partial_{\theta_{j}}\ell(\theta_{0};d\mbox{\boldmath$u$}_{I};{\mathbf{1}}_{-I})$ come from Assumption 13. Our conditions on $(a_{n},\lambda_{n})$ allow to use the SCAD or the MCP penalty, but not the LASSO because $a_{n}=\lambda_{n}$ in the latter case. In other words, Theorem 3.1 may apply with LASSO but not Theorem 3.2. The fact that LASSO does not yield the oracle property has already been noted in the literature: see [40] and the references therein. Actually, consider any penalty such that $\mbox{\boldmath$p$}(\lambda,t)$ does not depend on $t>0$ , when $\lambda$ is sufficiently small, as for the SCAD and MCP cases. Then,

\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{0,j}|)=\partial^{2}_{2,2}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{0,j}|)=0,\;\;j\in{\mathcal{A}},

when $n$ is sufficiently large because $\lambda_{n}\rightarrow 0$ , implying $a_{n}=b_{n}=0$ . Thus, Assumption 4 is satisfied and $\mathbf{A}_{n}(\theta_{0})=\mathbf{B}_{n}(\theta_{0})=\mathbf{0}$ , for $n$ large enough. Therefore, for such penalty functions, the asymptotic law of $\widehat{\theta}$ becomes a lot simpler.

Corollary 3.3.

Assume we have chosen the SCAD or the MCP penalty. Then, under the assumptions of Theorem 3.2 (i) and (ii), we have

\sqrt{n}(\widehat{\theta}-\theta_{0})_{{\mathcal{A}}}\overset{d}{\underset{n\rightarrow\infty}{\longrightarrow}}{\mathbb{H}}^{-1}_{{\mathcal{A}}{\mathcal{A}}}\mathbf{W},

where $W$ is defined in Theorem 3.2.

Theorem 3.2 establishes that the non-convex penalization procedure for semi-parametric copula models is an “oracle”: for a consistent estimator $\widehat{\theta}$ , the true sparse support is correctly identified in probability, and the non-zero estimated parameters have the same limiting law as if the support of $\theta_{0}$ were known. The limit $W$ is in the same vein as the one in Theorem 5 of [28]. However, in contrast to the latter result which assumes bounded indexing functions on $[0,1]^{d}$ , our framework covers the case of unbounded ones. In particular, our indexing functions $\mbox{\boldmath$u$}\mapsto\partial_{\theta_{j}}\ell(\theta_{0};\mbox{\boldmath$u$})$ are not constrained to be bounded on $(0,1)^{d}$ .

In the particular case of CML, the asymptotic law of our estimator has been stated under a set of regularity assumptions (particularly Assumption 3) that significantly differs from those that have been proposed in the literature: see the seminal papers [34] (Assumption (A.1)) and [8] (Assumptions (A.2)-(A.4)). These competing sets of assumption are apparently not nested, due to two very different techniques of proofs: a general integration by parts formula in our case, and some old results from [31, 30] in the latter cases. Our set of assumptions should most often be weaker. Indeed, we typically do not require that some expectations as ${\mathbb{E}}\big{[}\{U_{1}(1-U_{1})U_{2}(1-U_{2})\}^{-a}\big{]}$ are finite for some $a>0$ (Assumption (A.3) in [8]). Moreover and importantly, our results may directly be applied to time series, contrary to the cited papers that are strongly restricted to i.i.d. observations.

Our proofs rely on the weak convergence of multivariate rank-order statistics: see Theorem A.1 in the Appendix, that is an extension of Theorem 3.3 of [6] in an arbitrary dimension. Note that some papers have already stated similar results, but under restrictive conditions: Theorem 6 in [15] assumed the existence of continuous partial derivatives of the true copula on the whole hypercube $[0,1]^{d}$ (contrary to our Assumption 2). Theorem 2.4 in [20] relies on numerous very technical assumptions and it is unclear whether the latter result is weaker/stronger than ours. Nonetheless, it “only” states weak convergence in the space of cadlag functions equipped with the Skorohod topology.

Applying Theorem 3.2 (ii) or its corollary, it is possible to approximate by plug-in the limiting law of $\sqrt{n}(\widehat{\theta}-\theta_{0})_{I}$ , for every fixed subset $I\subset\widehat{\mathcal{A}}$ and assuming $\widehat{\mathcal{A}}={\mathcal{A}}$ (an event whose probability tends to one): just replace the unknown quantities with their empirical counterparts. This is obvious concerning the matrices $\mathbf{A}_{n}(\theta_{0})$ , $\mathbf{B}_{n}(\theta_{0})$ and ${\mathbb{H}}$ . Concerning the Gaussian random vector $W$ , an estimator of its variance is proposed in Section B of the Appendix.

4 Conditional copula models

In several situations of interest, multivariate models are defined through some conditioning laws. In other words, there exist a random vector of covariates $\mbox{\boldmath$Z$}\in{\mathcal{Z}}$ , for some borelian subset ${\mathcal{Z}}$ in ${\mathbb{R}}^{m}$ , and we focus on the laws of $X$ given any value of $Z$ . The natural framework is given by conditional copulas and the associated Sklar’s theorem ([17], e.g.). Generally speaking, conditional copula models have to manage conditional marginal distributions on one side, and conditional copulas on the other side. In our semiparametric approach, we do not specify the former ones. Indeed, this would be a source of complexities due to the need of kernel smoothing or other non parametric techniques ([17, 2], among others). Here, we make the following simplifying assumption.

Assumption 5.

For every $k\in\{1,\ldots,d\}$ , the law of $X_{k}$ given $\mbox{\boldmath$Z$}=\mbox{\boldmath$z$}$ does not depend on $\mbox{\boldmath$z$}\in{\mathcal{Z}}$ .

In other words, the conditional margins and the unconditional ones are identical. Even if the latter assumption may be considered as relatively strong, it is not implausible. For instance, in typical copula-GARCH models ([9]), the marginal dynamics are filtered in a first stage, and a parametric copula is then postulated between the residuals. It is well-known that systemic risk measures strongly depend on the economic cycle. Thus, some macroeconomic explanatory variables may have a significant impact on the latter copula. But, concerning the marginal conditional distributions, this effect could be hidden due to the first-order phenomenon of “volatility clustering”.

Under Assumption 5, our dependence model of interest will be related to the laws of $U$ given $\mbox{\boldmath$Z$}=\mbox{\boldmath$z$}$ , $\mbox{\boldmath$z$}\in{\mathcal{Z}}$ , by keeping the same definition of $U$ as previously. Sparsity would then be related to the number of components of $z$ that are relevant to specify any copula of $X$ (or $U$ , equivalently) given $\mbox{\boldmath$Z$}=\mbox{\boldmath$z$}$ . Typically, the latter copulas belong to a given parametric $d$ -dimensional family $\{C_{\theta};\theta\in\Theta\subset{\mathbb{R}}^{p}\}$ and the parameter $\theta$ depends on the underlying covariate: given $\mbox{\boldmath$Z$}=\mbox{\boldmath$z$}$ , the law of $U$ is $C_{\theta(\mbox{\boldmath$z$};\beta_{0})}$ , for some known map $\theta:{\mathcal{Z}}\times{\mathcal{B}}\rightarrow\Theta$ , ${\mathcal{B}}\subset{\mathbb{R}}^{q}$ . The problem is now to evaluate the true “new parameter” $\beta_{0}$ , based on a sample $(\mbox{\boldmath$X$}_{i},\mbox{\boldmath$Z$}_{i})_{i=1,\ldots,n}$ . Compared to the previous sections, the focus has switched from $(\theta_{0},\Theta,p)$ to $(\beta_{0},{\mathcal{B}},q)$ . In particular, we will assume that the new parameter set ${\mathcal{B}}$ satisfies Assumption ‣ 2 instead of $\Theta$ .

Under Assumption 5, we define the same pseudo-observations as before, and keep the notation $\widehat{\mathcal{U}}_{n}$ . In addition, set ${\mathcal{Z}}_{n}:=(\mbox{\boldmath$Z$}_{1},\ldots,\mbox{\boldmath$Z$}_{n})$ . For example, the parameter $\beta_{0}$ may naturally be estimated without any penalty by CML as

\widetilde{\beta}\,{\color[rgb]{0,0,0}\in}\,\arg\max_{\beta\in{\mathcal{B}}}\;{\mathbb{L}}_{n}(\beta;\widehat{\mathcal{U}}_{n},{\mathcal{Z}}_{n}),\;{\mathbb{L}}_{n}(\beta;\widehat{\mathcal{U}}_{n},{\mathcal{Z}}_{n}):=\sum_{i=1}^{n}\ln c_{\theta(\mbox{\boldmath$Z$}_{i};\beta)}\big{(}\widehat{U}_{i,1},\ldots,\widehat{U}_{i,d}\big{)}.

(4.1)

Under sparsity and with penalties, the results of Sections 2 and 3 can be adapted to tackle this new problem and even more general ones. First of all, we need to distinguish the cases of known and/or unknown covariate distributions.

4.1 The marginal laws of the covariates are known.

Let us assume the law of $Z_{k}$ is known, continuous, and denoted as $F_{Z_{k}}$ , $k\in\{1,\ldots,m\}$ . To simplify and w.l.o.g., we can additionally impose that the joint law of $(\mbox{\boldmath$U$},\mbox{\boldmath$Z$})$ is a copula.

Assumption 6.

The law of $Z_{k}$ is uniform between $0$ and $1$ , for every $k\in\{1,\ldots,m\}$ .

If this is not the case, just replace any $z$ with $\tilde{}\mbox{\boldmath$z$}:=\big{(}F_{Z_{1}}(z_{1}),\ldots,F_{Z_{m}}(z_{m})\big{)}$ . In the case of conditional copula models, the map $\theta$ would be replaced by $\tilde{\theta}(\tilde{}\mbox{\boldmath$z$};\beta):=\theta\big{(}\mbox{\boldmath$z$};\beta\big{)}$ . To ease the notations, we will not distinguish between $(\theta,Z_{k})$ and $(\tilde{\theta},\tilde{Z}_{k})$ . Extending (2.3), we now consider the penalized estimator

\widehat{\beta}\,{\color[rgb]{0,0,0}\in}\,\underset{\beta\in{\mathcal{B}}}{\arg\;\min}\;\Big{\{}{\mathbb{L}}_{n}(\beta;\widehat{{\mathcal{U}}}_{n},{\mathcal{Z}}_{n})+n\overset{q}{\underset{k=1}{\sum}}\mbox{\boldmath$p$}(\lambda_{n},|\beta_{k}|)\Big{\}},

(4.2)

where $\mbox{\boldmath$p$}(\lambda_{n},.):{\mathbb{R}}\rightarrow{\mathbb{R}}_{+}$ is a penalty. Assume the new empirical loss ${\mathbb{L}}_{n}(\beta;\mbox{\boldmath$u$}_{1},\ldots,\mbox{\boldmath$u$}_{n},\mbox{\boldmath$z$}_{1},\ldots,\mbox{\boldmath$z$}_{n})$ is associated with a continuous function $\ell:{\mathcal{B}}\times(0,1)^{d+m}\rightarrow{\mathbb{R}}$ so that we can write

{\mathbb{L}}_{n}(\beta;\mbox{\boldmath$u$}_{1},\ldots,\mbox{\boldmath$u$}_{n},\mbox{\boldmath$z$}_{1},\ldots,\mbox{\boldmath$z$}_{n}):=\overset{n}{\underset{i=1}{\sum}}\ell(\beta;\mbox{\boldmath$u$}_{i};\mbox{\boldmath$z$}_{i}).

(4.3)

Since the margins of $Z$ are uniform by assumption, the joint law of $(\mbox{\boldmath$U$},\mbox{\boldmath$Z$})$ is a $d+m$ -dimensional copula denoted $D$ . Instead of the empirical copula related to the $\mbox{\boldmath$X$}_{i}$ , we now focus on an empirical counterpart of the $(\mbox{\boldmath$X$},\mbox{\boldmath$Z$})$ copula, i.e.

\widehat{D}_{n}(\mbox{\boldmath$u$},\mbox{\boldmath$z$}):=\frac{1}{n}\overset{n}{\underset{i=1}{\sum}}\mathbf{1}\big{\{}F_{n,1}(X_{i,1})\leq u_{1},\ldots,F_{n,d}(X_{i,d})\leq u_{d},Z_{i,1}\leq z_{1},\ldots,Z_{i,m}\leq z_{m}\big{\}},

(4.4)

for every $\mbox{\boldmath$u$}\in[0,1]^{d}$ and $\mbox{\boldmath$z$}\in[0,1]^{m}$ . The associated “empirical copula” process becomes $\widehat{\mathbb{D}}_{n}:=\sqrt{n}(\widehat{D}_{n}-D)$ . Obviously, the weak behavior of $\widehat{\mathbb{D}}_{n}$ is the same as the one of ${\mathbb{D}}_{n}:=\sqrt{n}(D_{n}-D)$ , where

D_{n}(\mbox{\boldmath$u$},\mbox{\boldmath$z$}):=\frac{1}{n}\overset{n}{\underset{i=1}{\sum}}\mathbf{1}\big{\{}X_{i,1}\leq F_{n,1}^{-}(u_{1}),\ldots,X_{i,d}\leq F_{n,d}^{-}(u_{d}),Z_{i,1}\leq z_{1},\ldots,Z_{i,m}\leq z_{m}\big{\}}.

See Appendix C in [28], for instance.

We would like to state some versions of Theorems 3.1 and 3.2 for an estimator given by (4.2). Obviously,

{\mathbb{L}}_{n}(\beta;\widehat{{\mathcal{U}}}_{n},{\mathcal{Z}}_{n})=n\int\ell(\beta;\mbox{\boldmath$u$};\mbox{\boldmath$z$})\,\widehat{D}_{n}(d\mbox{\boldmath$u$},d\mbox{\boldmath$z$}),

and the limiting law of $\widehat{\beta}$ will be deduced from the asymptotic behavior of $\widehat{D}_{n}$ . Broadly speaking, the two main components to state such results are an integration by parts formula and a weak convergence result for $\widehat{\mathbb{D}}_{n}$ (or ${\mathbb{D}}_{n}$ , equivalently). The former tool will be guaranteed by applying our Theorem A.1 in the appendix. And the latter weak convergence result, will be a consequence of the weak convergence of $(\alpha_{n})$ , now the empirical process associated with the sample $(\mbox{\boldmath$U$}_{i},\mbox{\boldmath$Z$}_{i})_{i=1,\ldots,n}$ :

\alpha_{n}(\mbox{\boldmath$u$},\mbox{\boldmath$z$}):=\sqrt{n}\Big{\{}\frac{1}{n}\sum_{i=1}^{n}{\mathbf{1}}\big{(}\mbox{\boldmath$U$}_{i}\leq\mbox{\boldmath$u$},\mbox{\boldmath$Z$}_{i}\leq\mbox{\boldmath$z$}\big{)}-D(\mbox{\boldmath$u$},\mbox{\boldmath$z$})\Big{\}}.

(4.5)

Obviously, when our observations are i.i.d., the process $(\alpha_{n})$ weakly tends to a $D$ -Brownian bridge $\alpha_{D}$ in $\ell^{\infty}([0,1]^{d+m})$ . Instead of $\bar{\mathbb{C}}_{n}$ (see Theorem A.1 in the appendix), the approximated empirical copula process is here

\bar{\mathbb{D}}_{n}(\mbox{\boldmath$u$},\mbox{\boldmath$z$}):=\alpha_{n}(\mbox{\boldmath$u$},\mbox{\boldmath$z$})-\sum_{k=1}^{d}\dot{D}_{k}(\mbox{\boldmath$u$},\mbox{\boldmath$z$})\alpha_{n}({\mathbf{1}}_{-k}:u_{k}),\;\;(\mbox{\boldmath$u$},\mbox{\boldmath$z$})\in[0,1]^{d+m},

using the notations detailed in our appendix. Note that the partial derivatives of the copula $D$ have to be considered w.r.t. the first $d$ components only, i.e. the components that correspond to pseudo-observations (and not $Z_{k}$ -type components).

Now, let us state the new theoretical results related to semi-parametric inference in the presence of pseudo-observations and possible complete observations. Since they can be deduced from the previous sections and proofs, we omit the details.

Assumption 7.

The map $\beta\mapsto\ell(\beta;\mbox{\boldmath$u$},\mbox{\boldmath$z$})$ is thrice differentiable on ${\mathcal{B}}$ , for every $(\mbox{\boldmath$u$},\mbox{\boldmath$z$})\in(0,1)^{d+m}$ . Moreover, ${\mathbb{H}}:={\mathbb{E}}[\nabla^{2}_{\beta\beta^{\top}}\ell(\beta_{0};\mbox{\boldmath$U$},\mbox{\boldmath$Z$})]$ and ${\mathbb{M}}:={\mathbb{E}}[\nabla_{\beta}\ell(\beta_{0};\mbox{\boldmath$U$},\mbox{\boldmath$Z$})\nabla_{\beta^{\top}}\ell(\beta_{0};\mbox{\boldmath$U$},\mbox{\boldmath$Z$})]$ exist and are positive definite. Finally, for every $\epsilon>0$ , there exists a constant $K_{\epsilon}$ such that

\sup_{\{\beta;\|\beta-\beta_{0}\|<\epsilon\}}\;\sup_{j,l,m}\big{|}{\mathbb{E}}[\partial^{3}_{\beta_{j}\beta_{l}\beta_{m}}\ell(\beta;\mbox{\boldmath$U$},\mbox{\boldmath$Z$})]\big{|}\leq K_{\epsilon}.

Assumption 8.

For any $1\leq j\leq d$ and $\epsilon>0$ , the copula partial derivative $\dot{D}_{j}(\mbox{\boldmath$u$},\mbox{\boldmath$z$}):=\partial D(\mbox{\boldmath$u$},\mbox{\boldmath$z$})/\partial u_{j}$ exists and is continuous on $V_{j}:=\{\mbox{\boldmath$u$}\in[0,1]^{d};u_{j}\in[\epsilon,1-\epsilon]\}$ uniformly w.r.t. $\mbox{\boldmath$z$}\in[0,1]^{m}$ . For every couple $(j_{1},j_{2})\in\{1,\ldots,d\}^{2}$ and $\mbox{\boldmath$z$}\in[0,1]^{m}$ , the second-order partial derivative $\ddot{D}_{j_{1},j_{2}}(\mbox{\boldmath$u$},\mbox{\boldmath$z$}):=\partial^{2}D(\mbox{\boldmath$u$},\mbox{\boldmath$z$})/\partial u_{j_{1}}\partial u_{j_{2}}$ exists and is continuous on $V_{j_{1}}\cap V_{j_{2}}$ . Moreover, there exists a positive constant $K$ such that

\sup_{\mbox{\boldmath$z$}\in[0,1]^{m}}\big{|}\ddot{D}_{j_{1},j_{2}}(\mbox{\boldmath$u$},\mbox{\boldmath$z$})\big{|}\leq K\min\left(\frac{1}{u_{j_{1}}(1-u_{j_{1}})},\frac{1}{u_{j_{2}}(1-u_{j_{2}})}\right),\;\mbox{\boldmath$u$}\in V_{j_{1}}\cap V_{j_{2}}.

(4.6)

Assumption 9.

For some positive constants $\omega$ and $K$ , the family of maps ${\mathcal{F}}={\mathcal{F}}_{1}\cup{\mathcal{F}}_{2}\cup{\mathcal{F}}_{3}$ from $(0,1)^{d+m}$ to ${\mathbb{R}}$ is $g_{\omega,d+m}$ -regular (see Definition A in Appendix A), with

{\mathcal{F}}_{1}:=\{f:(\mbox{\boldmath$u$},\mbox{\boldmath$z$})\mapsto\partial_{\beta_{k}}\ell(\beta_{0};\mbox{\boldmath$u$};\mbox{\boldmath$z$});k=1,\ldots,p\},

{\mathcal{F}}_{2}:=\{f:(\mbox{\boldmath$u$},\mbox{\boldmath$z$})\mapsto\partial^{2}_{\beta_{k},\beta_{l}}\ell(\beta_{0};\mbox{\boldmath$u$};\mbox{\boldmath$z$});k,l=1,\ldots,p\},

{\mathcal{F}}_{3}:=\{f:(\mbox{\boldmath$u$},\mbox{\boldmath$z$})\mapsto\partial^{3}_{\beta_{k},\beta_{l},\beta_{j}}\ell(\beta;\mbox{\boldmath$u$};\mbox{\boldmath$z$});k,l,j=1,\ldots,p,\;\|\beta-\beta_{0}\|<K\}.

Theorem 4.1.

Suppose Assumptions 10-11 given in Appendix A are satisfied with the process $(\alpha_{n})$ defined on $[0,1]^{d+m}$ by (4.5). Let some

\omega\in\Big{(}0,\min\big{\{}\frac{{\color[rgb]{0,0,0}\kappa}_{1}}{2(1-{\color[rgb]{0,0,0}\kappa}_{1})},\frac{{\color[rgb]{0,0,0}\kappa}_{2}}{2(1-{\color[rgb]{0,0,0}\kappa}_{2})},{\color[rgb]{0,0,0}\kappa}_{3}-\frac{1}{2}\big{\}}\Big{)}.

Suppose Assumptions 4-9 hold for this $\omega$ . Then, there exists a sequence $\widehat{\beta}$ defined in (4.2) that satisfies the bound

\|\widehat{\beta}-\beta_{0}\|_{2}=O_{p}\Big{(}\ln(\ln n)n^{-1/2}+a_{n}\Big{)}.

Theorem 4.2.

In addition to the assumptions of Theorem 4.1, assume that the penalty function satisfies $\underset{n\rightarrow\infty}{\lim\,\inf}\;\underset{x\rightarrow 0^{+}}{\lim\,\inf}\;\lambda^{-1}_{n}\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},x)>0$ . Moreover, assume $\lambda_{n}\rightarrow 0$ , $\sqrt{n}\lambda_{n}(\ln(\ln n))^{-1}\rightarrow\infty$ and $a_{n}=o(\lambda_{n})$ . Then, the consistent estimator $\widehat{\beta}$ of Theorem 4.1 satisfies the following properties.

(i)

Support recovery: $\underset{n\rightarrow\infty}{\lim}\;{\mathbb{P}}(\widehat{{\mathcal{A}}}={\mathcal{A}})=1$ , where ${\mathcal{A}}$ and $\widehat{{\mathcal{A}}}$ are related to the support of $\beta_{0}$ .

(ii)

Asymptotic normality: assume the empirical process $(\alpha_{n})$ converges weakly in $\ell^{\infty}([0,1]^{d+m})$ to a limit process $\alpha_{D}$ which has continuous sample paths, almost surely. If, in addition, the Assumption 13 in Appendix A (applied to ${\mathcal{F}}_{1}$ instead of ${\mathcal{F}}$ ) is fulfilled and $\sqrt{n}\lambda_{n}^{2}=o(1)$ , then

\sqrt{n}\Big{[}{\mathbb{H}}_{{\mathcal{A}}{\mathcal{A}}}+\mathbf{B}_{n}(\beta_{0})\Big{]}\Big{\{}\big{(}\widehat{\beta}-\beta_{0}\big{)}_{{\mathcal{A}}}+\big{[}{\mathbb{H}}_{{\mathcal{A}}{\mathcal{A}}}+\mathbf{B}_{n}(\beta_{0})\big{]}^{-1}\mathbf{A}_{n}(\beta_{0})\Big{\}}\overset{d}{\underset{n\rightarrow\infty}{\longrightarrow}}\mbox{\boldmath$W$},

where ${\mathbb{H}}_{{\mathcal{A}}{\mathcal{A}}}=\Big{[}{\mathbb{E}}\big{[}\partial^{2}_{\beta_{k}\beta_{l}}\ell(\beta_{0};\mbox{\boldmath$U$};\mbox{\boldmath$Z$})\big{]}\Big{]}_{k,l\in{\mathcal{A}}}$ , $\mathbf{A}_{n}(\beta_{0})=\big{[}\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},|\beta_{0,k}|)\text{sgn}(\beta_{0,k})\big{]}_{k\in{\mathcal{A}}}$ , $\mathbf{B}_{n}(\beta)=\text{diag}(\partial^{2}_{2,2}\mbox{\boldmath$p$}(\lambda_{n},|\beta_{0,k}|),\,k\in{\mathcal{A}})$ and $W$ is a $|{\mathcal{A}}|$ -dimensional centered random vector defined as

	$\displaystyle W_{j}:=(-1)^{d+m}\int_{(0,1)^{d+m}}{\mathbb{D}}(\mbox{\boldmath$u$},\mbox{\boldmath$z$})\,\partial_{\beta_{j}}\ell(\beta_{0};d\mbox{\boldmath$u$},d\mbox{\boldmath$z$})$
		$\displaystyle+$	$\displaystyle\sum_{\begin{subarray}{c}I\subset\{1,\ldots,d+m\}\\ I\neq\emptyset,I\neq\{1,\ldots,d+m\}\end{subarray}}(-1)^{\|I\|}\int_{(0,1)^{\|I\|}}{\mathbb{D}}\big{(}(\mbox{\boldmath$u$},\mbox{\boldmath$z$})_{I}:{\mathbf{1}}_{-I}\big{)}\,\partial_{\beta_{j}}\ell(\beta_{0};d(\mbox{\boldmath$u$},\mbox{\boldmath$z$})_{I};{\mathbf{1}}_{-I}),$

for $j\in{\mathcal{A}}$ , with ${\mathbb{D}}(\mbox{\boldmath$u$},\mbox{\boldmath$z$})=\alpha_{D}(\mbox{\boldmath$u$},\mbox{\boldmath$z$})-\sum_{k=1}^{d}\dot{D}_{k}(\mbox{\boldmath$u$},\mbox{\boldmath$z$})\alpha_{D}({\mathbf{1}}_{-k}:u_{k})$ , $(\mbox{\boldmath$u$},\mbox{\boldmath$z$})\in[0,1]^{d}\times{\mathcal{Z}}$ .

Remark 1.

In Theorems 4.1 and 4.2, it has been assumed that some $(\mbox{\boldmath$u$},\mbox{\boldmath$z$})$ -maps are regular w.r.t. $g_{\omega,d+m}$ (Assumption 9). Checking the latter property with $g_{\omega,d+m}$ may sometimes be painful, or even impossible. Beside, this task may have been done for the same parametric copula family in the (usual) unconditional case, using the weights $g_{\omega,d}:\mbox{\boldmath$u$}\mapsto g_{\omega,d}(\mbox{\boldmath$u$})$ . Unfortunately, there is no order between $g_{\omega,d+m}(\mbox{\boldmath$u$},\mbox{\boldmath$z$})$ and $g_{\omega,d}(\mbox{\boldmath$u$})$ and the randomness of the covariates matters in the more general situation (4.2). Hopefully, when some regularity properties are available uniformly w.r.t. $\mbox{\boldmath$z$}\in{\mathcal{Z}}$ , we can rely on $g_{\omega,d}$ instead of $g_{\omega,d+m}$ and checking regularity properties becomes simpler: see Section 4.3 below.

By a careful inspection of the proof of Proposition 3.1 in [32], the weak convergence of ${\mathbb{D}}_{n}$ can be easily stated under “minimal assumptions” in the i.i.d. case. Since this result is new and of interest per se, it is now precisely stated.

Theorem 4.3.

Assume the margins $F_{1},\ldots,F_{d}$ are continuous. If, for any $j\in\{1,\ldots,d\}$ , any $\mbox{\boldmath$z$}\in{\mathcal{Z}}$ and any $\epsilon>0$ , the partial derivative $\dot{D}_{j}(\mbox{\boldmath$u$},\mbox{\boldmath$z$}):=\partial D(\mbox{\boldmath$u$},\mbox{\boldmath$z$})/\partial u_{j}$ exists and is continuous on $V_{j,\epsilon}:=\{\mbox{\boldmath$u$}\in[0,1]^{d};u_{j}\in[\epsilon,1-\epsilon]\}$ uniformly w.r.t. $\mbox{\boldmath$z$}\in{\mathcal{Z}}$ , then

\sup_{\mbox{\boldmath$u$}\in[0,1]^{d},\mbox{\boldmath$z$}\in{\mathcal{Z}}}\big{|}({\mathbb{D}}_{n}-\bar{\mathbb{D}}_{n})(\mbox{\boldmath$u$},\mbox{\boldmath$z$})\big{|}=o_{P}(1),

when $n\rightarrow\infty$ . Moreover, ${\mathbb{D}}_{n}$ weakly tends to ${\mathbb{D}}$ in $\ell^{\infty}([0,1]^{d}\times{\mathcal{Z}})$ .

Note that Theorem 4.3 does not require Assumption 6, i.e. it applies with an arbitrary (possibly discrete) subset set ${\mathcal{Z}}$ , and even if the marginal laws of the covariates are discontinuous.

4.2 The marginal laws of the covariates are unknown.

In this case, the laws $F_{Z_{k}}$ are still continuous but unknown, and the covariates belong to some arbitrary subset ${\mathcal{Z}}\in{\mathbb{R}}^{m}$ . Introduce the random variables $V_{k}:=F_{Z_{k}}(Z_{k})$ , $k\in\{1,\ldots,m\}$ , that are uniformly distributed between zero and one. Set $\mbox{\boldmath$V$}:=(V_{1},\ldots,V_{m})$ . We can manage this situation when the loss function is a map of $(\mbox{\boldmath$U$},\mbox{\boldmath$V$})$ , instead of $(\mbox{\boldmath$U$},\mbox{\boldmath$Z$})$ as previously: define pseudo-observations related to the covariates $\widehat{\mbox{\boldmath$V$}}_{i}:=(\widehat{V}_{i,1},\ldots,\widehat{V}_{i,m})$ , where $\widehat{V}_{i,k}:=F_{n,Z_{k}}(Z_{i,k})$ for every $i\in\{1,\ldots,n\}$ and every $k\in\{1,\ldots,m\}$ , using the $k$ -th re-scaled empirical c.d.f. $F_{n,Z_{k}}(s):=(n+1)^{-1}\sum^{n}_{i=1}\mathbf{1}\{Z_{i,k}\leq s\}.$ The penalized estimator of interest is here defined as

\overline{\beta}\,{\color[rgb]{0,0,0}\in}\,\underset{\beta\in{\mathcal{B}}}{\arg\;\min}\;\Big{\{}\overset{n}{\underset{i=1}{\sum}}\ell(\beta;\widehat{\mbox{\boldmath$U$}}_{i};\widehat{\mbox{\boldmath$V$}}_{i})+n\overset{q}{\underset{k=1}{\sum}}\mbox{\boldmath$p$}(\lambda_{n},|\beta_{k}|)\Big{\}}.

Thus, we recover the standard situation that has been studied in Section 3. All the results of Section 3 directly apply, replacing the $d$ -dimensional copula $C$ by the $d+m$ -dimensional copula $D$ , replacing $\widehat{{\mathcal{U}}}_{n}$ by $\big{(}\widehat{\mbox{\boldmath$U$}}_{i},\widehat{\mbox{\boldmath$V$}}_{i}\big{)}_{i=1,\ldots,n}$ , etc. The limiting law of $\big{(}\overline{\beta}-\beta_{0}\big{)}_{{\mathcal{A}}}$ will not be the same as in Theorem 4.2 (ii). Indeed, the process ${\mathbb{D}}$ has now to be replaced by $\tilde{\mathbb{D}}(\mbox{\boldmath$u$},\mbox{\boldmath$z$}):={\mathbb{D}}(\mbox{\boldmath$u$},\mbox{\boldmath$z$})-\sum_{k=d+1}^{d+m}\dot{D}_{k}(\mbox{\boldmath$u$},\mbox{\boldmath$z$})\alpha_{D}({\mathbf{1}}_{-k}:z_{k})$ , due to the additional amount of randomness induced by the “pseudo-covariates” $\widehat{}\mbox{\boldmath$V$}_{i}$ .

4.3 Practical considerations

Now, let us come back to the estimator given by (4.1). The regularity assumptions of Theorems 4.1 and 4.2 have to be checked on a case-by-case basis. Nonetheless, there are some situations where things become simpler. Indeed, assume

(i)

the map $(\mbox{\boldmath$z$},\beta)\mapsto\theta(\mbox{\boldmath$z$};\beta)$ and all its partial derivatives (up to order $3$ and $m$ w.r.t. the components of $\beta$ and $z$ respectively) are bounded in ${\mathcal{V}}(\beta_{0})\times{\mathcal{Z}}$ , denoting by ${\mathcal{V}}(\beta_{0})$ a neighborhood of the true parameter $\beta_{0}$ in the space ${\mathcal{B}}$ ;
(ii)

the conditions of Theorems 3.1 and 3.2 are satisfied for the CML loss function $\ell(\theta;\mbox{\boldmath$u$}):=-\ln c_{\theta}(\mbox{\boldmath$u$})$ .
(iii)

the latter conditions may be verified replacing the weight function $g_{\omega,d}(\mbox{\boldmath$u$})$ by $\min_{j}(u_{j})$ .

Note that, under (i) and (ii), the conditions of Theorems 3.1 and 3.2 are satisfied with the derivatives of the loss functions $(\mbox{\boldmath$u$},\beta)\mapsto\ln c_{\theta(\mbox{\boldmath$z$};\beta)}(\mbox{\boldmath$u$})$ , for any fixed $\mbox{\boldmath$z$}\in[0,1]^{m}$ . Then, it can be easily checked that Theorems 4.1 and 4.2 apply. In particular, the influence of the covariates is “neutralized” through (i); moreover, noting that $g_{\omega,d+m}(\mbox{\boldmath$u$},\mbox{\boldmath$z$})\leq\min_{j}(u_{j})$ , (iii) is sufficient to manage the weight functions.

To illustrate, consider the case of Gumbel and Clayton copulas, for which $\beta$ is $(1,+\infty)$ and ${\mathbb{R}}_{+}^{*}$ respectively (under the usual parameterization of [26]). Assume that, given $\mbox{\boldmath$Z$}=\mbox{\boldmath$z$}$ , the latter parameters can be rewritten $\theta(\mbox{\boldmath$z$};\beta)$ for some $\beta$ in ${\mathcal{B}}$ that satisfies Assumption ‣ 2. Moreover, assume the ranges of the maps $(\mbox{\boldmath$z$},\beta)\mapsto\theta(\mbox{\boldmath$z$};\beta)$ from $[0,1]^{m}\times{\mathcal{B}}$ to $\Theta$ are included into a compact subset of ${\mathbb{R}}$ . Then, Theorems 4.1 and 4.2 apply: the associated parameters $\tilde{\beta}$ defined in 4.1 are consistent and weakly convergent. See Sections F and G of the Appendix for the technical details and for the proof that (i), (ii) and (iii) are indeed satisfied. This justifies the application of our results in the case of single-index models with Clayton/Gumbel copulas (Section 5.2 below).

5 Applications

5.1 Examples

5.1.1 M-criterion for Gaussian copulas

An important application of the latter results is Maximum Likelihood Estimation with pseudo-observations, where we observe a sample $\mathbf{X}_{1},\ldots,\mathbf{X}_{n}$ from a $d$ -dimensional distribution whose parametric copula depends on some parameter $\theta\in{\mathbb{R}}^{p}$ . Equipped with pseudo-observations and using the same notations as above, our penalized estimator is defined as

\widehat{\theta}\,{\color[rgb]{0,0,0}\in}\,\underset{\theta\in\Theta}{\arg\;\min}\;\Big{\{}-\overset{n}{\underset{i=1}{\sum}}\ln c\big{(}\widehat{\mbox{\boldmath$U$}}_{i};\theta\big{)}+n\overset{p}{\underset{k=1}{\sum}}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{k}|)\Big{\}},

denoting by $c(\cdot,\theta)$ the copula density. In the case of Gaussian copula model, the parameter of interest is $\theta=\text{vech}(\Sigma)\in{\mathbb{R}}^{d(d-1)/2}$ , where $\Sigma$ is the correlation matrix of such a copula. This yields

\widehat{\theta}\,{\color[rgb]{0,0,0}\in}\,\underset{\theta\in\Theta}{\arg\;\min}\;\Big{\{}\overset{n}{\underset{i=1}{\sum}}\ell(\theta;\widehat{\mbox{\boldmath$U$}}_{i})+n\overset{p}{\underset{k=1}{\sum}}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{k}|)\Big{\}},\;\ell(\theta;\widehat{\mbox{\boldmath$U$}}_{i})=\frac{1}{2}\ln(|\Sigma|)+\frac{1}{2}\mathbf{Z}^{\top}_{ni}\Sigma^{-1}\mathbf{Z}_{ni},

(5.1)

with $\mathbf{Z}_{ni}=\big{(}\Phi^{-1}(\widehat{U}_{i1}),\ldots,\Phi^{-1}(\widehat{U}_{id})\big{)}^{\top}$ , $i\in\{1,\ldots,n\}$ . Note that the $\mathbf{Z}_{ni}$ are approximated realizations of the Gaussian random vector $\mathbf{Z}:=(\Phi^{-1}(U_{1}),\ldots,\Phi^{-1}(U_{d}))^{\top}$ . The Gaussian copula exhibits discontinuous partial derivatives at the boundary of $[0,1]^{d}$ : see [32], Example 5.1. We have seen in Section 3 that $\widehat{\theta}$ is asymptotically normally distributed under suitable regularity conditions. In Section E of the Appendix, we check that all the conditions are satisfied so that Theorems 3.1 and 3.2 can be applied to Gaussian copula models and $\widehat{\theta}$ given by (5.1). Interestingly, the associated limiting law in Theorem 3.2 is simply $\mbox{\boldmath$W$}:=(-1)^{d}\int_{(0,1)^{d}}{\mathbb{C}}(\mbox{\boldmath$u$})\,\nabla_{\theta}\ell(\theta_{0};d\mbox{\boldmath$u$})$ .

The estimation of $\Sigma$ can be carried out using the least squares loss $\ell_{\text{LS}}(\theta;\widehat{\mbox{\boldmath$U$}}_{i}):=\text{tr}((\mathbf{Z}_{ni}\mathbf{Z}_{ni}^{\top}-\Sigma)^{2})$ . In Section E of the Appendix, we verify that the latter loss satisfies the regularity conditions of Theorems 3.1 and 3.2. Our simulation experiments on sparse Gaussian copula will be based on both the Gaussian loss and the least squares loss. Set $\widehat{S}:=n^{-1}\sum^{n}_{i=1}\mathbf{Z}_{ni}\mathbf{Z}_{ni}^{\top}$ , that approximates $\Sigma$ . Interestingly, our empirical loss is equal to $n\ln(|\Sigma|)/2+n\text{tr}(\Sigma^{-1}\widehat{S})/2$ and $n\;\text{tr}((\widehat{S}-\Sigma)^{2})$ for the Gaussian CML and least squares cases respectively, apart from some constant terms that do not depend on $\Sigma$ . Indeed, for the least squares loss, we have

n\;\text{tr}((\widehat{S}-\Sigma)^{2})=n\;\text{tr}(\widehat{S}^{\top}\widehat{S})-\text{tr}(\sum_{i=1}^{n}\mathbf{Z}_{ni}\mathbf{Z}_{ni}^{\top}\Sigma-\Sigma^{\top}\sum_{i=1}^{n}\mathbf{Z}_{ni}\mathbf{Z}_{ni}^{\top})+n\;\text{tr}(\Sigma^{\top}\Sigma),

which is equal to $\sum^{n}_{i=1}\text{tr}((\mathbf{Z}_{ni}\mathbf{Z}_{ni}^{\top}-\Sigma)^{2})$ plus some constant terms that do not depend on $\Sigma$ . In our simulation experiment for sparse Gaussian copulas, the implemented code relies on $\ln(|\Sigma|)+\text{tr}(\Sigma^{-1}\widehat{S})$ and/or $\text{tr}((\widehat{S}-\Sigma)^{2})$ intensively, some quantities that can be quickly calculated through some matrix manipulations, even when $n>>1$ .

5.1.2 M-criterion for mixtures of copulas

Mixing parametric copulas is a flexible way to build richly parameterized copulas. More precisely, a mixture based copula $C$ is usually specified by its density $c(\mbox{\boldmath$u$})=\sum^{q}_{k=1}\pi_{k}c_{k}(\mbox{\boldmath$u$};\gamma_{k})$ , built from the family of copula densities $\{c_{k}(\mbox{\boldmath$u$};\gamma_{k}),k=1,\ldots,q\}$ . Each of the copula density $c_{k}(\mbox{\boldmath$u$};\cdot)$ depends on a vector of parameters $\gamma_{k}\in\Theta_{k}$ , and $(\pi_{k})_{k=1,\ldots,q}$ are some unknown weights satisfying $\pi_{k}\in[0,1]$ , $k\in\{1,\ldots,q\}$ , and $\sum^{q}_{k=1}\pi_{k}=1$ . The parameter of interest is $\theta=\big{(}\pi_{1},\ldots,\pi_{q-1},\gamma^{\top}_{1},\ldots,\gamma^{\top}_{q}\big{)}^{\top}\in\Theta$ , with $\Theta:=\Theta_{\text{mixt},q-1}\times\Theta_{1}\times\cdots\Theta_{q}$ , with the notations of Example 3. Let $p$ be the dimension of any parameter $\theta$ . Then, with our CML criterion with pseudo-observations, an estimator of the true $\theta_{0}$ is defined as

\widehat{\theta}_{\text{mixt}}\,{\color[rgb]{0,0,0}\in}\,\underset{\theta\in\Theta}{\arg\;\min}\;\Big{\{}\overset{n}{\underset{i=1}{\sum}}\ell(\theta;\widehat{\mbox{\boldmath$U$}}_{i})+n\overset{p}{\underset{k=1}{\sum}}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{k}|)\Big{\}},\;\ell(\theta;\widehat{\mbox{\boldmath$U$}}_{i})=-\ln\big{(}\overset{q}{\underset{k=1}{\sum}}\pi_{k}c_{k}(\widehat{U}_{i1},\ldots,\widehat{U}_{id};\gamma_{k})\big{)}.

Such a procedure fosters sparsity among $(\pi_{k})_{k=1,\ldots,q}$ and among $\gamma_{1},\ldots,\gamma_{q}$ : when $\widehat{\pi}_{k}\neq 0$ , then the corresponding copula parameter $\widehat{\gamma}_{k}$ can potentially be sparse. The latter criterion is similar to Criterion (3) in [7]; however, these authors treat the marginals as known quantities, which significantly simplifies their large sample analysis.

Assume that all parametric copula families $\{C_{k}(\mbox{\boldmath$u$};\gamma_{k}),\gamma_{k}\in\Theta_{k}\}$ , $k\in\{1,\ldots,q\}$ , satisfy the regularity conditions to apply our Theorems 3.1 and 3.2. Unfortunately, in general, this does not imply their mixture model will satisfy all these properties, in particular the $g_{\omega}$ -regularity. Therefore, it will be necessary to do this task on a case-by-case basis. Nonetheless, in the particular case of mixtures of Gaussian copulas, our regularity conditions are satisfied by considering the least squares loss function, as in Section 5.1.1. Indeed, for this model, the variance-covariance matrix of $Z$ is $\Sigma:=\sum_{k=1}^{q}\pi_{k}\Sigma_{k}$ , denoting by $\Sigma_{k}$ the correlation matrix of parameters that is associated with $c_{k}$ , $k\in\{1,\ldots,q\}$ . Thus, the same arguments as for the Gaussian copula with the $\ell_{LS}$ loss can easily be invoked. Alternatively, choosing the log-likelihood (CML) loss induces more difficulties. Nonetheless, it can be proved that our regularity conditions apply, at least when all the (true) weights are strictly positive. See Section E of the Appendix for details.

5.2 Simulated experiments

In this section, we assess the finite sample performances of our penalization procedure in the presence of pseudo-observations. To do so, we carry out simulated experiments for the sparse Gaussian copula model and some sparse conditional copulas. These experiments are meant to illustrate the ability of the penalization procedure to correctly identify the zero entries of the copula parameter with non-parametric marginals. First, let us briefly discuss the implementation procedure and the choice of $\lambda_{n}$ .

5.2.1 Implementation and selection of $\lambda_{n}$

All the experiments are implemented in Matlab code and run on a Mac-OS Apple M1 Ultra with 20 cores and 128 GB Memory. A gradient descent type algorithm is implemented to solve the penalized Gaussian copula problem, a situation where closed-form gradient formulas can directly be applied. We employed the numerical optimization routine fmincon of Matlab to find the estimated parameter for sparse conditional copulas ³³3The code for replication is available at https://github.com/Benjamin-Poignard/sparse-copula. The tuning parameter $\lambda_{n}$ controls the model complexity and must be calibrated for each penalty function. To do so, we employ a $5$ -fold cross-validation procedure, in the same spirit as in Section 7.10 of [23]. To be specific, we divide the data into $5$ disjoint subgroups of roughly the same size, the so-called folds. Denote the indices of that observations that belong to the $k$ -th fold by $T_{k}$ , $k\in\{1,\ldots,5\}$ and the size of the $k$ -th fold by $n_{k}$ . The $5$ -fold cross-validation score is defined as

\text{CV}(\lambda_{n}):=\overset{5}{\underset{k=1}{\sum}}\Big{\{}\underset{i\in T_{k}}{\sum}\ell(\widehat{\theta}_{-k}(\lambda_{n});\widehat{U}_{i})\Big{\}},

(5.2)

where $\sum_{i\in T_{k}}\ell(\widehat{\theta}_{-k}(\lambda_{n});\widehat{U}_{i})$ is the non-penalized loss associated to the copula model and evaluated over the $k$ -th fold $T_{k}$ of size $n_{k}$ , which serves as the test set, and $\widehat{\theta}_{-k}(\lambda_{n})$ is our penalized estimator of the latter copula parameter based on the sample $(\cup^{5}_{j=1}T_{j})\setminus T_{k}$ - the training set - using $\lambda_{n}$ as the tuning parameter. The optimal tuning parameter $\lambda^{\ast}_{n}$ is then selected according to $\lambda^{\ast}_{n}\,{\color[rgb]{0,0,0}\in}\,\arg\;\min_{\lambda_{n}}\;\text{CV}(\lambda_{n})$ . Then, $\lambda^{\ast}_{n}$ is used to obtain the final estimate of $\theta_{0}$ over the whole sample. Here, the minimization of the cross-validation score is performed over $\{c\sqrt{\log(\text{dim})/n}\}$ , where $c$ is a grid (its size is user-specified) of $91$ values set as $0.01,0.05,0.1,0.15,\ldots,4.5$ and dim the number of parameters to estimate. The choice of the rate $\sqrt{\log(\text{dim})/n}$ is standard in the sparse literature for M-estimators: see, e.g., Chapter 6 of [5] for the LASSO and [25] for non-convex penalization methods.

5.2.2 Sparse Gaussian copula models

Our first application concerns the Gaussian copula. Here, sparsity is specified with respect to the variance-covariance matrix $\Sigma\in{\mathbb{R}}^{d\times d}$ . Its diagonal elements are equal to one and its off-diagonal elements are $\theta_{kl}\in(-1,1),1\leq k,l\leq d,k\neq l$ , so that the number of distinct free parameters is $d(d-1)/2$ . The parameter $\theta$ is defined as the column vector of the $\Sigma$ components located strictly below the main diagonal. Thus, $\Sigma$ can be considered as a function of $\theta$ : $\Sigma=\Sigma(\theta)$ . We still denote by ${\mathcal{A}}=\{k:\theta_{0,k}\neq 0,k=1,\ldots,d(d-1)/2\}$ the true sparse support, where $\theta_{0}$ is sparse when some components of $U$ are independent. Our simulated experiment can be summarized as follows: we simulate a sparse true $\theta_{0}$ , generate the $\mbox{\boldmath$U$}_{i}$ from the corresponding Gaussian copula with parameter $\theta_{0}$ for a given sample size $n$ , and calculate $\widehat{\theta}$ by minimizing our penalized procedure based on the pseudo-sample; this procedure is repeated for two hundred independent batches.

To be more specific, a sparse $\theta_{0}$ is randomly drawn for each batch as detailed below. Then, we generate a sample of $n$ vectors $\mbox{\boldmath$U$}_{i}$ as follows: we draw $\mbox{\boldmath$U$}_{i}=(U_{i,1},\ldots,U_{i,d})$ , $i\in\{1,\ldots,n\}$ , from a Gaussian copula with parameter $\theta_{0}$ ; then we consider their rank-based transformation to obtain a non-parametric estimator of their marginal distribution, providing the pseudo-observations $\widehat{\mbox{\boldmath$U$}}_{i}=(\widehat{U}_{i,1},\ldots,\widehat{U}_{i,d})$ that enter the loss function. Then, we solve (2.3). The non-penalized loss is the Gaussian log-likelihood, as defined in (5.1). Alternatively, in (2.3), we consider the least squares criterion for which $\ell(\theta;\widehat{U}_{i})=\text{tr}((\mbox{\boldmath$Z$}_{ni}\mbox{\boldmath$Z$}^{\top}_{ni}-\Sigma)^{2})$ , where $\mbox{\boldmath$Z$}_{ni}:=(\Phi^{-1}(\widehat{U}_{i,1}),\ldots,\Phi^{-1}(\widehat{U}_{i,d}))^{\top}$ . In both cases, the penalized problem is solved by a gradient descent algorithm based on the updating formulas of Section 4.2 in [24], where the initial value is set as $\widehat{S}:=n^{-1}\sum^{n}_{i=1}\mathbf{Z}_{ni}\mathbf{Z}_{ni}^{\top}$ . The score for cross-validation purpose is defined in (5.2) with the Gaussian loss or the least squares loss. Concerning $\widehat{\theta}$ , we apply the SCAD, MCP and LASSO penalties. The non-convex SCAD and MCP ones require the calibration of $a_{\text{scad}}$ and $b_{\text{mcp}}$ , respectively. We select $a_{\text{scad}}=3.7$ , a “reference” value identified as optimal in [12] by cross-validated experiments. In the MCP case, the “reference” parameter is set as $b_{\text{mcp}}=3.5$ , following [24]. We investigate the sensitivity of these non-convex procedures with respect to their parameters $a_{\text{scad}},b_{\text{mcp}}$ . In particular, our results are also detailed with the values $a_{\text{scad}}=40$ and $b_{\text{mcp}}=40$ . This case corresponds to “large” $a_{\text{scad}}$ and $b_{\text{mcp}}$ , for which the corresponding penalty functions tend to the LASSO penalty.

We consider the dimensions $d\in\{10,20\}$ , so that the dimension of $\theta_{0}$ is $p=d(d-1)/2=45$ and $190$ , respectively. The cardinality of the true support ${\mathcal{A}}$ is set arbitrarily as $|{\mathcal{A}}|=7$ (resp. $|{\mathcal{A}}|=19$ ) when $d=10$ (resp. $d=20$ ), so that the percentage of zero coefficients of $\theta_{0}$ is approximately $85\%$ (resp. $90\%$ ). As for the non-zero coefficients of $\theta_{0}$ , for each batch, they are generated from the uniform distribution ${\mathcal{U}}([-0.7,-0.05]\cup[0.05,0.7])$ , thus ensuring the minimum signal strength $\min_{k\in{\mathcal{A}}}|\theta_{0,k}|\geq 0.05$ . As for the sample size, we consider $n\in\{500,1000\}$ . Note that, for each batch, the number of zero coefficients of $\theta_{0}$ remains unchanged but their locations may be different.

We report the variable selection performance through the percentage of zero coefficients in $\theta_{0}$ that are correctly estimated, denoted by C1, and the percentage of non-zero coefficients in $\theta_{0}$ correctly identified as such, denoted by C2. The mean squared error (MSE), defined as $\|\widehat{\theta}-\theta_{0}\|^{2}_{2}$ , is reported as an estimation accuracy measure. These metrics are averaged over the two hundred batches and reported in Table 1. For clarifying the reading of the figures, the first entry $84.70$ in the column “Gaussian” represents the percentage of the true zero coefficients correctly identified by the estimator deduced from the Gaussian loss function, with SCAD penalization when $a_{\text{scad}}=3.7$ , for a sample size $n=500$ , a dimension $d=10$ , and averaged over two hundred batches; in the last MSE line, the value $0.0625$ in the column “Least Squares” represents the MSE of the estimator deduced from the least squares loss function with MCP penalization when $b_{\text{mcp}}=40$ , for a sample size $n=1000$ and $d=20$ , and averaged over two hundred batches. Our results highlight that, for a given loss, the SCAD/MCP-based penalization procedures provide better results in terms of support recovery compared to the LASSO for our reference values of $a_{\text{scad}},b_{\text{mcp}}$ . Furthermore, the SCAD and even more the MCP-based estimators with the Gaussian loss provide better recovery performances compared to the least squares loss. This is particularly true with the indicator C1, i.e., for the sake of identifying the zero coefficients. Interestingly, large $a_{\text{scad}},b_{\text{mcp}}$ values worsen the recovery ability. Indeed, such large values result in a LASSO-type behavior, which is biased so that small $\lambda_{n}$ will tend to be selected. Moreover, for any given penalty function, the Gaussian loss-based MSE’s are always lower than the least squares loss-based MSE’s, which suggest that the estimator deduced from the former loss is more efficient than the estimator obtained from the latter loss. Furthermore, larger $a_{\text{scad}},b_{\text{mcp}}$ values worsen the MSE performances: when $a_{\text{scad}}=b_{\text{mcp}}=40$ , for a given loss function, the MSE’s of SCAD/MCP are close to the LASSO. In Section I of the Appendix, we investigate further the sensitivity of the performances of the SCAD and MCP-based estimators with respect to $a_{\text{scad}}$ and $b_{\text{mcp}}$ .

Table 1: Model selection and accuracy, based on 200 replications. For each penalized loss, the results are reported according to the order SCAD, MCP and then LASSO.

\text{C1},\text{C2}

are expressed in percentage, and larger numbers are better; for each MSE metric, smaller numbers are better.

$(n,d,a_{\text{scad}},b_{\text{mcp}})$		Truth	Gaussian	Least Squares
$(500,10,3.7,3.5)$	C1	$100$	$84.70\,-\,89.97\,-\,78.36$	$80.22\,-\,86.26\,-\,73.76$
	C2	$100$	$96.14\,-\,94.64\,-\,97.07$	$95.57\,-\,94.43\,-\,96.07$
	MSE		$0.0190\,-\,0.0189\,-\,0.0238$	$0.0293\,-\,0.0281\,-\,0.0421$
$(1000,10,3.7,3.5)$	C1	$100$	$86.55\,-\,91.29\,-\,77.53$	$80.37\,-\,86.90\,-\,69.17$
	C2	$100$	$97.93\,-\,97.50\,-\,98.64$	$98.14\,-\,97.64\,-\,98.71$
	MSE		$0.0081\,-\,0.0079\,-\,0.0117$	$0.0131\,-\,0.0124\,-\,0.0213$
$(500,10,40,40)$	C1	$100$	$79.54\,-\,80.01\,-\,78.36$	$74.30\,-\,75.34\,-\,73.76$
	C2	$100$	$97.00\,-\,96.79\,-\,97.07$	$96.00\,-\,95.93\,-\,96.07$
	MSE		$0.0228\,-\,0.0227\,-\,0.0238$	$0.0405\,-\,0.0403\,-\,0.0421$
$(1000,10,40,40)$	C1	$100$	$78.97\,-\,79.86\,-\,77.53$	$70.49\,-\,71.54\,-\,69.17$
	C2	$100$	$98.64\,-\,98.57\,-\,98.64$	$98.64\,-\,98.57\,-\,98.71$
	MSE		$0.0110\,-\,0.0110\,-\,0.0117$	$0.0199\,-\,0.0198\,-\,0.0213$
$(500,20,3.7,3.5)$	C1	$100$	$88.08\,-\,92.71\,-\,85.63$	$84.13\,-\,89.91\,-\,81.14$
	C2	$100$	$94.55\,-\,92.84\,-\,95.40$	$94.05\,-\,92.55\,-\,94.82$
	MSE		$0.0602\,-\,0.0571\,-\,0.0690$	$0.0937\,-\,0.0884\,-\,0.1272$
$(1000,20,3.7,3.5)$	C1	$100$	$89.28\,-\,93.46\,-\,84.65$	$86.16\,-\,91.77\,-\,80.33$
	C2	$100$	$97.92\,-\,97.08\,-\,98.21$	$97.11\,-\,96.13\,-\,97.74$
	MSE		$0.0252\,-\,0.0239\,-\,0.0337$	$0.0433\,-\,0.0406\,-\,0.0664$
$(500,20,40,40)$	C1	$100$	$85.85\,-\,86.51\,-\,85.63$	$81.29\,-\,82.14\,-\,81.14$
	C2	$100$	$95.37\,-\,95.26\,-\,95.40$	$94.79\,-\,94.58\,-\,94.82$
	MSE		$0.0673\,-\,0.0668\,-\,0.0690$	$0.1234\,-\,0.1224\,-\,0.1272$
$(1000,20,40,40)$	C1	$100$	$85.51\,-\,86.11\,-\,84.65$	$80.73\,-\,81.74\,-\,80.33$
	C2	$100$	$98.08\,-\,98.05\,-\,98.21$	$97.74\,-\,97.55\,-\,97.74$
	MSE		$0.0322\,-\,0.0319\,-\,0.0337$	$0.0631\,-\,0.0625\,-\,0.0664$

5.2.3 Conditional copulas

Our next application is dedicated to the sparse estimation of conditional copula models with known link functions and known marginal laws of the covariates: the experiment is an application of the penalized problem detailed in Subsection 4.1. We specify the law of $\mbox{\boldmath$X$}\in{\mathbb{R}}^{d}$ , given some covariates $\mbox{\boldmath$Z$}\in{\mathbb{R}}^{q}$ , as a parametric copula with parameter $\theta(\mbox{\boldmath$Z$}^{\top}\beta)$ , where $\mbox{\boldmath$Z$}\in{\mathbb{R}}^{q}$ and $\beta\in{\mathbb{R}}^{q}$ ( $p=1$ and $m=q$ , with our notations of Section 4). We assume the marginal distribution of $X_{k}$ , $k\in\{1,\ldots,d\}$ , is unknown and does not depend on $Z$ . We focus on the Clayton and Gumbel copulas, and restrict ourselves to $d=2$ .

In the same vein as in the previous application to sparse Gaussian copulas, for each sample size $n$ , we draw two hundred independent batches of $n$ vectors $\mbox{\boldmath$U$}_{i}$ as follows: in each batch, we simulate a sparse true $\beta_{0}$ ; then for every $i\in\{1,\ldots,n\}$ , we draw the covariates $Z_{i,k}$ , $k\in\{1,\ldots,q\}$ , from a uniform distribution ${\mathcal{U}}([0,1])$ , independently of each other; then, for a given $\mbox{\boldmath$Z$}_{i}=(Z_{i,1},\ldots,Z_{i,q})^{\top}$ , we sample $\mbox{\boldmath$U$}_{i}=(U_{i,1},U_{i,2})$ from the Clayton/Gumbel copula with parameter $\theta_{i}:=\theta(\mbox{\boldmath$Z$}^{\top}_{i}\beta_{0})$ ; we consider their rank-based transformation to obtain the pseudo-observations $\widehat{\mbox{\boldmath$U$}}_{i}=(\widehat{U}_{i,1},\widehat{U}_{i,2})$ , which are plugged in the penalized criterion. Here, the copula parameters $\theta_{i}$ are specified in terms of Kendall’s tau: for each $i$ , define the Kendall’s tau $\tau_{i}:=2\arctan(\mbox{\boldmath$Z$}^{\top}_{i}\beta_{0})/\pi$ . Using the mappings of, e.g., [26], set $\theta_{i}=2\tau_{i}/(1-\tau_{i})$ for the Clayton copula and $\theta_{i}=1/(1-\tau_{i})$ for the Gumbel copula. We consider the dimension $q=30$ , and set the cardinality of the true support ${\mathcal{A}}=\{k:\beta_{0,k}\neq 0,k=1,\cdots,q\}$ as $|{\mathcal{A}}|=3$ , so that approximately $90\%$ of the entries of $\beta_{0}$ are zero coefficients. For each batch, the non-zero entries are simulated from the uniform distribution ${\mathcal{U}}([0.05,1])$ , which ensures that the following copula parameter constraints are satisfied: $\theta_{i}>0$ (resp. $\theta_{i}>1$ ) for the Clayton (resp. Gumbel) copula. For each batch, the locations of the zero/non-zero entries in $\beta_{0}$ are arbitrary, but the size of ${\mathcal{A}}$ remains fixed. Finally, we consider the sample size $n\in\{500,1000\}$ . For a given batch, our criterion becomes:

\widehat{\beta}\,{\color[rgb]{0,0,0}\in}\,\underset{\beta\in\Theta}{\arg\;\min}\;\Big{\{}{\mathbb{L}}_{n}(\theta;\widehat{{\mathcal{U}}}_{n},\mathcal{Z}_{n})+n\overset{q}{\underset{k=1}{\sum}}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{k}|)\Big{\}},\;\text{with}\;{\mathbb{L}}_{n}(\theta;\widehat{{\mathcal{U}}}_{n},\mathcal{Z}_{n})=-\overset{n}{\underset{i=1}{\sum}}\ln c_{\theta(\mbox{\boldmath$Z$}^{\top}_{i}\beta)}\big{(}\widehat{U}_{i,1},\widehat{U}_{i,2}\big{)},

where $\ln c_{\theta(\mbox{\boldmath$Z$}^{\top}_{i}\beta)}(.)$ is the log-density of the Clayton/Gumbel copula with parameter $\theta(\mbox{\boldmath$Z$}^{\top}_{i}\beta)$ . As for the penalty function, we consider the SCAD, MCP and LASSO penalty functions to estimate ${\mathcal{A}}$ and $\beta$ . Moreover, we choose $a_{\text{scad}}\in\{3.7,10,20,40,70\}$ and $b_{\text{mcp}}\in\{3.5,10,20,40,70\}$ . To assess the finite sample performance of the penalization methods and as in Section 5.2.2, we report in Table 2 the percentage of zero coefficients correctly estimated (C1), the percentage of non-zero coefficients correctly identified (C2) and the mean squared error (MSE), averaged over the two hundred batches. For both models and low/large sample sizes, our results emphasize the poor performances of the LASSO penalization in terms of support recovery (correct identification of the zero coefficients). As for non-convex penalization, the trade-off between C1 and C2 is more indicative than in the application to the Gaussian copula: small $a_{\text{scad}},b_{\text{mcp}}$ provide better C1 to the detriment of C2, which results in larger MSE since C2 worsens in that case. The MSE results are significantly improved for $n$ large. Mid-range $a_{\text{scad}},b_{\text{mcp}}$ values provide an optimal trade-off in terms of the combined C1, C2 and MSE metrics.

Table 2: Model selection and accuracy based on 200 replications.

\text{C1},\text{C2}

are expressed in percentage, and larger numbers are better; for each MSE metric, smaller numbers are better.

	Gumbel $n=500$	Gumbel $n=1000$	Clayton $n=500$	Clayton $n=1000$
Penalty	C1 C2 MSE	C1 C2 MSE	C1 C2 MSE	C1 C2 MSE
SCAD, $a_{\text{scad}}=3.7$	$85.41\;-\;64.17\;-\;0.5612$	$90.11\;-\;75.33\;-\;0.2964$	$82.57\;-\;59.83\;-\;0.6255$	$94.20\;-\;75.67\;-\;0.1856$
$a_{\text{scad}}=10$	$84.46\;-\;63.83\;-\;0.5860$	$87.02\;-\;74.00\;-\;0.2607$	$83.72\;-\;58.83\;-\;0.6021$	$90.33\;-\;77.17\;-\;0.1800$
$a_{\text{scad}}=20$	$86.07\;-\;63.33\;-\;0.5711$	$87.93\;-\;78.50\;-\;0.2464$	$82.94\;-\;61.17\;-\;0.6037$	$90.87\;-\;80.83\;-\;0.1584$
$a_{\text{scad}}=40$	$83.46\;-\;69.33\;-\;0.5246$	$83.70\;-\;83.67\;-\;0.2280$	$80.70\;-\;65.33\;-\;0.5602$	$87.57\;-\;85.33\;-\;0.1456$
$a_{\text{scad}}=70$	$82.11\;-\;72.50\;-\;0.4834$	$79.50\;-\;86.17\;-\;0.2297$	$78.72\;-\;67.67\;-\;0.5322$	$84.59\;-\;86.17\;-\;0.1462$
MCP, $b_{\text{mcp}}=3.5$	$88.04\;-\;58.50\;-\;0.6430$	$94.98\;-\;69.00\;-\;0.3396$	$86.41\;-\;53.33\;-\;0.7295$	$95.78\;-\;71.67\;-\;0.2044$
$b_{\text{mcp}}=10$	$83.22\;-\;62.00\;-\;0.5809$	$88.98\;-\;71.50\;-\;0.2563$	$84.06\;-\;58.00\;-\;0.5846$	$92.26\;-\;72.17\;-\;0.1653$
$b_{\text{mcp}}=20$	$84.17\;-\;63.83\;-\;0.5711$	$89.74\;-\;77.00\;-\;0.2386$	$81.89\;-\;61.17\;-\;0.5927$	$92.35\;-\;79.67\;-\;0.1442$
$b_{\text{mcp}}=40$	$83.43\;-\;69.33\;-\;0.5234$	$84.65\;-\;82.50\;-\;0.2211$	$79.89\;-\;66.17\;-\;0.5513$	$90.11\;-\;84.67\;-\;0.1385$
$b_{\text{mcp}}=70$	$81.33\;-\;72.67\;-\;0.4844$	$81.06\;-\;87.17\;-\;0.2249$	$77.48\;-\;68.50\;-\;0.5312$	$85.43\;-\;85.33\;-\;0.1476$
LASSO	$76.87\;-\;79.00\;-\;0.4092$	$72.52\;-\;88.33\;-\;0.2245$	$76.00\;-\;74.33\;-\;0.4384$	$77.65\;-\;89.00\;-\;0.1398$

6 Conclusion

We studied the asymptotic properties of sparse M-estimator based on pseudo-observations, where we treat the marginal distributions entering the loss function as unknown, which is a common situation in copula inference. Our framework includes, among others, semi-parametric copula models and the CML inference method. We assume sparsity among the coefficients of the true copula parameter and apply a penalty function to recover the sparse underlying support. Our method is based on penalized M-estimation and accommodates data-dependent penalties, such as the LASSO, SCAD and MCP. We establish the consistency of the sparse M-estimator together with the oracle property for the SCAD and MCP cases for both fixed and diverging dimensions of the vector of parameters. Because of the presence of non-parametric estimators of marginal cdfs’ and potentially unbounded loss functions, it is difficult to exhibit simple regularity conditions and to derive the oracle property. This would make the large sample analysis intricate when $p$ and $d$ simultaneously diverge. We shall leave it as a future research direction. Among potential applications of our methodology, the (brute force) estimation of vine models ([10]) under sparsity seems to be particularly relevant. Nonetheless, checking our regularity assumptions for such highly nonlinear models would surely be challenging. In addition, it would be interesting to prove similar theoretical results in the case of conditional copulas for which their conditional margins would depend on covariates.

Acknowledgements

J.D. Fermanian was supported by the labex Ecodec (reference project ANR-11-LABEX-0047) and B. Poignard by the Japanese Society for the Promotion of Science (Grant 22K13377).

References

[1] Abadir, K.M. and J.R. Magnus (2005). Matrix algebra. Cambridge University Press.
[2] Abegaz, F., Gijbels, I. and N. Veraverbeke (2012). Semiparametric estimation of conditional copulas. Journal of Multivariate Analysis, 110: 43–73.
[3] Abramowitz, Milton and I.A. Stegun (1972). Handbook of mathematical functions with formulas, graphs, and mathematical tables, Vol. 55, Reprint of the 1972 ed. A Wiley-Interscience Publication. Selected Government Publications, John Wiley & Sons, New York.
[4] Aistleitner, C. and J. Dick (2015). Functions of bounded variation, signed measures, and a general Koksma-Hlawka inequality. Acta Arithmetica, 167:143–171.
[5] Bühlmann, P. and S. van de Geer (2011). Statistics for high-dimensional data: Methods, theory and applications. 1st ed. Springer Series in Statistics. Springer, Berlin.
[6] Berghaus, B. and Bücher, A. and S. Volgushev (2017). Weak convergence of the empirical copula process with respect to weighted metrics. Bernoulli, 23(1):743–772.
[7] Cai, Z. and X. Wang (2014). Selection of mixed copula model via penalized likelihood. Journal of the American statistical Association, 109(506):788–801.
[8] Chen, X. and Y. Fan (2005). Pseudo-likelihood ratio tests for semiparametric multivariate copula model selection. Canadian Journal of Statistics, 33(3):389–414.
[9] Chen, X. and Y. Fan (2006). Estimation and model selection of semiparametric copula-based multivariate dynamic models under copula misspecification. Journal of econometrics, 135(1-2): 125–154.
[10] Czado, C (2019). Analyzing dependent data with vine copulas. 1st ed. Lecture Notes in Statistics. Springer, Cham.
[11] Dehling, H., Durieu, O. and M. Tusche (2014). Approximating class approach for empirical processes of dependent sequences indexed by functions. Bernoulli, 20(3):1372–1403.
[12] Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360.
[13] Fan, J. and H. Peng (2004). Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32(3):928–961.
[14] Fermanian, J.-D. (1998). Contributions à l’analyse nonparamétrique des fonctions de hasard sur données multivariées et censurées. PhD thesis, Paris 6.
[15] Fermanian, J.-D., Radulović, D. and M. Wegkamp (2002). Weak convergence of empirical copula processes, Center for Research in Economics and Statistics, Working Paper, No. 2002-06.
[16] Fermanian, J.-D. and D. Radulović and M. Wegkamp (2004). Weak convergence of empirical copula processes. Bernoulli, 10(5):847–860.
[17] Fermanian, J.-D. and M. Wegkamp (2012). Time-dependent copulas. Journal of Multivariate Analysis, 110:19–29.
[18] Fermanian, J.D. and O. Lopez (2018). Single-index copulas. Journal of Multivariate Analysis, 165:27–55.
[19] Genest, C. and Ghoudi, K. and L.-P. Rivest (1995). A semi-parametric estimation procedure of dependence parameters in multivariate families of distributions. Biometrika, 82(3):543–552.
[20] Ghoudi, K. and B. Rémillard (2004). Empirical processes based on pseudo-observations II: the multivariate case. Fields Institute Communications, 44:381–406.
[21] Gijbels, I., Veraverbeke, N. and O. Marel (2011). Conditional copulas, association measures and their applications. Computational Statistics & Data Analysis, 55(5), 1919–1932.
[22] Hamori, S., Motegi, K. and Z. Zhang (2019). Calibration estimation of semiparametric copula models with data missing at random. Journal of Multivariate Analysis, 173:85–109.
[23] Hastie, T., Tibshirani, R. and J. Friedman (2009). The elements of statistical learning: Data mining, inference, and prediction. 2nd ed.Springer Series in Statistics. Springer, New York.
[24] Loh, P.L and M.J. Wainwright (2015). Regularized M-estimators with non-convexity: statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16:559-616.
[25] Loh, P.L and M.J. Wainwright (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 45(6):2455–2482.
[26] Nelsen, R.B. (2006). An introduction to copulas. 2nd ed. Springer Series in Statistics. Springer, New York.
[27] Portnoy, S. (1985). Asymptotic behavior of M-estimators of $p$ regression parameters when $p^{2}/n$ is large; II. Normal approximation. The Annals of Statistics, 13(4):1403–1417.
[28] Radulović, D. and Wegkamp, M. and Y. Zhao (2017). Weak convergence of empirical copula processes indexed by functions. Bernoulli, 23(4B):3346–3384.
[29] Rémillard, B. and O. Scaillet (2009). Testing for equality between two copulas. Journal of Multivariate Analysis, 100(3):377–386.
[30] Ruymgaart, F.H. (1974). Asymptotic normality of nonparametric tests for independence. The Annals of Statistics, 2(5):892–910.
[31] Ruymgaart, F.H. and Shorack, G.R. and W.R. Van Zwet (1972). Asymptotic normality of nonparametric tests for independence. The Annals of Mathematical Statistics, 43(4):1122–1135.
[32] Segers, J. (2012). Asymptotics of empirical copula processes under non-restrictive smoothness assumptions. Bernoulli, 18(3):764–782.
[33] Shi, J. H. and T. Louis (1995). Inferences on the association parameter in copula models for bivariate survival data. Biometrics, 51(4):1384–1399.
[34] Tsukahara, H. (2005). Semi-parametric estimation in copula models. The Canadian Journal of Statistics, 33(3):357–375.
[35] van der Vaart, A.W. and J. Wellner (1996). Weak convergence and empirical processes: with applications to statistics. 1st ed. Springer Series in Statistics. Springer, New York.
[36] van der Vaart, A.W. and J.A. Wellner (2007). Empirical processes indexed by estimated functions. Asymptotics: Particles, Processes and Inverse Problems, 234–252, IMS Lecture Notes Monograph Series, 55.
[37] Yang, B. and Hafner, C.M. and Liu, G. and W. Long (2021). Semi-parametric estimation and variable selection for single-index copula models. Journal of Applied Econometrics, 36(7):962–988.
[38] Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894-942.
[39] Zhang, S. and O. Okhrin and Q.M. Zhou and P.X.-K. Song (2016). Goodness-of-fit test for specification of semiparametric copula dependence models. Journal of Econometrics, 193(1):215–233.
[40] Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429.

Appendix A Multivariate rank statistics and empirical copula processes indexed by functions

In this section, we prove some theoretical results about the asymptotic behavior of multivariate rank statistics

\int fd\widehat{C}_{n}=\frac{1}{n}\sum_{i=1}^{n}f\big{(}F_{n,1}(X_{i,1}),\ldots,F_{n,d}(X_{i,d})\big{)},

for a class of maps $f:(0,1)^{d}\rightarrow{\mathbb{R}}$ that will be of “locally” bounded variation and sufficiently regular. Such maps will be allowed to diverge when some of their arguments tend to zero or one, i.e. when their arguments are close to the boundaries of $(0,1)^{d}$ . We will prove the asymptotic normality of $\int fd\widehat{C}_{n}$ , extending Theorem 3.3 of [6] to any dimension $d\geq 2$ . Moreover, we will state the weak convergence $\sqrt{n}\int fd(\widehat{C}_{n}-C)$ seen as an empirical process indexed by $f\in{\mathcal{F}}$ , for a convenient family of maps ${\mathcal{F}}$ .

To be specific, consider a family of measurable maps ${\mathcal{F}}=\{f:(0,1)^{d}\rightarrow{\mathbb{R}}\}$ . As in [6] and for any $\omega\geq 0$ , define the weight function

g_{\omega,d}(\mbox{\boldmath$u$}):=\min\{\min_{k=1,\ldots,d}u_{k},1-\min_{j\neq 1}u_{j},\ldots,1-\min_{j\neq d}u_{j}\}^{\omega},\;\mbox{\boldmath$u$}\in[0,1]^{d}.

When $u_{1}\leq u_{2}\leq\cdots\leq u_{d}$ , check that $g_{\omega,2}(\mbox{\boldmath$u$})=\min(u_{1},1-u_{2})$ . Moreover, if $d=2$ , then $g_{\omega,2}(u_{1},u_{2})=\min(u_{1},u_{2},1-u_{1},1-u_{2})$ . To lighten notations and when there will be no ambiguity, the map $g_{\omega,d}$ will simply denoted as $g_{\omega}$ hereafter. For technical reasons, we will need the map $\tilde{g}_{\omega}(\mbox{\boldmath$u$}):=g_{\omega}(\mbox{\boldmath$u$})+{\mathbf{1}}(g_{\omega}(\mbox{\boldmath$u$})=0)$ for every $\mbox{\boldmath$u$}\in[0,1]^{d}$ .

Recall the process $\widehat{\mathbb{C}}_{n}:=\sqrt{n}(\widehat{C}_{n}-C)$ and $\widehat{\mathbb{C}}_{n}(f)=\int f\,d\widehat{\mathbb{C}}_{n}$ for any $f\in{\mathcal{F}}$ . Therefore, $\widehat{\mathbb{C}}_{n}$ may be considered as a process defined on ${\mathcal{F}}$ . The maps $f\in{\mathcal{F}}$ may potentially be unbounded, particularly when their arguments tend to the boundaries of the hypercube $[0,1]^{d}$ . This is a common situation when $f$ is chosen as the log-density of many copula families. Moreover, we will need to apply an integration by parts trick that has proved its usefulness in several copula-related papers, particularly [28] and [6]. To this end, we introduce the following class of maps.

Definition.

A map $f$ is of locally bounded Hardy Krause variation, a property denoted by $f\in BHKV_{loc}\big{(}(0,1)^{d}\big{)}$ , if, for any sequence $(a_{n})$ and $(b_{n})$ , $0<a_{n}<b_{n}<1$ , $a_{n}\rightarrow 0$ , $b_{n}\rightarrow 1$ , the restriction of $f$ to $[a_{n},b_{n}]^{d}$ is of bounded Hardy-Krause variation.

The concept of Hardy-Krause variation has become a standard extension of the usual concept of bounded variation for multivariate maps: see the Supplementary Material in [6] or Section 2 and Appendix A in [28], and the references therein.

Denote the box $\mbox{\boldmath$B$}_{n,m}:=(1/2n;1-1/2n]^{m}$ and $\mbox{\boldmath$B$}^{c}_{n,m}$ its complementary in $[0,1]^{m}$ , $1<m\leq d$ . Moreover, any sub-vector whose components are all equal to $1/2n$ (resp. $1-1/2n$ ) will be denoted as $\mbox{\boldmath$c$}_{n}$ (resp. $\mbox{\boldmath$d$}_{n}$ ). For any $f\in BHKV_{loc}\big{(}(0,1)^{d}\big{)}$ and a measurable map $g:(0,1)^{d}\rightarrow{\mathbb{R}}$ , the integral $\int_{(0,1)^{d}}g\,df$ can be conveniently defined: see [6], Section 3.1 and its Supplementary Material.

In terms of notations, we use the same rules as [28], Section 1.1, to manage sub-vectors and the way of concatenating them. More precisely, for $J\subset\{1,\ldots,d\}$ , $|J|$ denotes the cardinality of $J$ , and the unary minus refers to the complement with respect to $\{1,\ldots,d\}$ so that $-J=\{1,\ldots,d\}\setminus J$ . For $J\subset\{1,\ldots,d\}$ , $\mbox{\boldmath$u$}_{J}$ denotes a $|J|$ -tuple of real numbers whose elements are $u_{j},j\in J$ ; the vector $\mbox{\boldmath$u$}_{J}$ typically belongs to $[0,1]^{|J|}$ . Now let $J_{1},J_{2}\subset\{1,\ldots,d\}$ , $J_{1}\cap J_{2}=\emptyset$ and $\mbox{\boldmath$u$},\mathbf{v}$ two vectors in $[0,1]^{d}$ . The concatenation symbol “ $:$ ” is defined as follows: the vector $\mbox{\boldmath$u$}_{J_{1}}:\mathbf{v}_{J_{2}}$ denotes the point $\mbox{\boldmath$x$}\in[0,1]^{|J_{1}\cup J_{2}|}$ such that $x_{j}=u_{j}$ for $j\in J_{1}$ and $x_{j}=v_{j}$ for $j\in J_{2}$ . The vector $\mbox{\boldmath$u$}_{J_{1}}:\mathbf{v}_{J_{2}}$ is well defined for $\mbox{\boldmath$u$}_{J_{1}}\in[0,1]^{|J_{1}|}$ and $\mathbf{v}_{J_{2}}\in[0,1]^{|J_{2}|}$ when $J_{1}\cap J_{2}=\emptyset$ even if $\mbox{\boldmath$u$}_{-J_{1}}$ and $\mathbf{v}_{-J_{2}}$ remains unspecified. We use this concatenation symbol to glue together more than two sets of components: let $\mbox{\boldmath$u$}_{J_{1}}\in[0,1]^{|J_{1}|},\mathbf{v}_{J_{2}}\in[0,1]^{|J_{2}|},\mbox{\boldmath$w$}_{J_{3}}\in[0,1]^{|J_{3}|}$ with $J_{1},J_{2},J_{3}$ mutually disjoint sets such that $J_{1}\cup J_{2}\cup J_{3}=\{1,\ldots,d\}$ . Then $\mbox{\boldmath$u$}_{J_{1}}:\mathbf{v}_{J_{2}}:\mbox{\boldmath$w$}_{J_{3}}$ is a well defined vector in $[0,1]^{d}$ . Finally, for a function $f:[0,1]^{d}\rightarrow{\mathbb{R}}$ and a constant vector $\mathbf{c}_{J}\in[0,1]^{|J|}$ , the function $\mbox{\boldmath$x$}_{\mapsto}f(\mbox{\boldmath$x$}_{J}:\mathbf{c}_{-J})$ denotes a lower-dimensional projection of $f$ onto $[0,1]^{|J|}$ . The integral of a function $g:[0,1]^{|J|}\mapsto{\mathbb{R}}$ w.r.t. the measure induced by the latter map will be denoted as $\int g(\mbox{\boldmath$x$}_{J})\,f(d\mbox{\boldmath$x$}_{J}:\mathbf{c}_{-J})$ .

Definition.

A family of maps ${\mathcal{F}}$ is said to be regular with respect to the weight function $g_{\omega,d}$ for some $\omega\in(0,1/2)$ (or $g_{\omega}$ -regular, to be short) if

(i)

every $f\in{\mathcal{F}}$ is $BVHK_{loc}\big{(}(0,1)^{d}\big{)}$ and right-continuous;

(ii)

the map $\mbox{\boldmath$u$}\mapsto\sup_{f\in{\mathcal{F}}}\min_{k}\min(u_{k},1-u_{k})^{\omega}|f(\mbox{\boldmath$u$})|$ is bounded on $(0,1)^{d}$ ,

\sup_{f\in{\mathcal{F}}}\int_{(0,1)^{d}}g_{\omega,d}(\mbox{\boldmath$u$})\,|f(d\mbox{\boldmath$u$})|<\infty,

(A.1)

and, for any partition $(J_{1},J_{2},J_{3})$ of the set of indices $\{1,\ldots,d\}$ with $J_{1}\neq\emptyset$ ,

\sup_{f\in{\mathcal{F}}}\int_{\mbox{\boldmath$B$}_{n,|J_{1}|}}g_{\omega,d}(\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}\big{|}f\big{(}d\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}\big{|}=O(1).

(A.2)

Moreover, the latter sequence tends to zero when $J_{2}\neq\emptyset$ .

When ${\mathcal{F}}=\{f_{0}\}$ is a singleton, one simply says that the map $f_{0}$ is $g_{\omega}$ -regular. Note that, if $\mbox{\boldmath$u$}_{J_{1}}\in\mbox{\boldmath$B$}_{n,|J_{1}|}$ , then $g_{\omega,d}(\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}=(2n)^{-\omega}$ except when $J_{2}=\emptyset$ and $|J_{1}|\geq 2$ simultaneously. In the latter case, $g_{\omega,d}(\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$d$}_{n,-J_{1}}\big{)}=g_{\omega,|J_{1}|}(\mbox{\boldmath$u$}_{J_{1}})$ .

Remark 2.

Consider a family ${\mathcal{F}}$ of maps from $(0,1)^{d}$ to ${\mathbb{R}}$ . Assume there exist $m$ subsets $I_{k}\subset\{1,\ldots,d\}$ , $k\in\{1,\ldots,m\}$ s.t. every member $f\in{\mathcal{F}}$ can be written

f(\mbox{\boldmath$u$})=f_{1,I_{1}}(\mbox{\boldmath$u$}_{I_{1}})+\ldots+f_{m,I_{m}}(\mbox{\boldmath$u$}_{I_{m}}),\;\;\mbox{\boldmath$u$}\in(0,1)^{d},

for some maps $f_{k,I_{k}}:(0,1)^{|I_{k}|}\rightarrow{\mathbb{R}}$ , $k\in\{1,\ldots,m\}$ . Define ${\mathcal{F}}_{k}=\{f_{k,I_{k}};f\in{\mathcal{F}}\}$ for every $k$ . If every ${\mathcal{F}}_{k}$ , $k\in\{1,\ldots,m\},$ is regular w.r.t. the weight function $g_{\omega,|I_{k}|}$ , then it is easy to see that ${\mathcal{F}}$ is regular w.r.t. the weight function $g_{\omega,d}$ . This property may be invoked to prove the $g_{\omega}$ regularity of the Gaussian copula family, for instance (see Section E in the Appendix).

Remark 3.

Any family ${\mathcal{F}}$ of maps defined on $(0,1)^{d}$ may formally be seen as a family $\tilde{\mathcal{F}}$ of maps defined on a larger dimension, say $(0,1)^{d+p}$ , $p>0$ : every $f\in{\mathcal{F}}$ defines a map $\tilde{f}$ on $(0,1)^{d+p}$ by setting $\tilde{f}(\mbox{\boldmath$u$},\mathbf{v})=f(\mbox{\boldmath$u$})$ , $\mbox{\boldmath$u$}\in(0,1)^{d}$ , $\mathbf{v}\in(0,1)^{p}$ . It can be easily checked that, if ${\mathcal{F}}$ is $g_{\omega}$ regular then this is still the case for $\tilde{\mathcal{F}}$ .

Beside the regularity conditions on the family of maps ${\mathcal{F}}$ , we will need that the (standard) empirical process $\alpha_{n}$ is well-behaved. To this aim, we recall the so-called conditions 4.1, 4.2 and 4.3 in [6].

Assumption 10.

There exists ${\color[rgb]{0,0,0}\kappa}_{1}\in(0,1/2]$ such that, for all $\mu\in(0,{\color[rgb]{0,0,0}\kappa}_{1})$ and all sequences $\delta_{n}\rightarrow 0$ , we have

\sup_{|\mbox{\boldmath$u$}-\mathbf{v}|<\delta_{n}}\frac{|\alpha_{n}(\mbox{\boldmath$u$})-\alpha_{n}(\mathbf{v})|}{|\mbox{\boldmath$u$}-\mathbf{v}|^{\mu}\vee n^{-\mu}}=o_{P}(1).

Assumption 11.

There exists ${\color[rgb]{0,0,0}\kappa}_{2}\in(0,1/2]$ and ${\color[rgb]{0,0,0}\kappa}_{3}\in(1/2,1]$ such that, for any ${\color[rgb]{0,0,0}\nu}\in(0,{\color[rgb]{0,0,0}\kappa}_{2})$ , any $\lambda\in(0,{\color[rgb]{0,0,0}\kappa}_{3})$ and any $j\in\{1,\ldots,d\}$ , we have

\sup_{u\in(0,1)}\Big{|}\frac{\sqrt{n}\big{\{}G_{nj}(u)-u\big{\}}}{u^{{\color[rgb]{0,0,0}\nu}}(1-u)^{{\color[rgb]{0,0,0}\nu}}}\Big{|}+\sup_{u\in(1/n^{\lambda},1-1/n^{\lambda})}\Big{|}\frac{\sqrt{n}\big{\{}G^{-}_{nj}(u)-u\big{\}}}{u^{{\color[rgb]{0,0,0}\nu}}(1-u)^{{\color[rgb]{0,0,0}\nu}}}\Big{|}=O_{P}(1).

Assumption 12.

The empirical process $(\alpha_{n})$ converges weakly in $\ell^{\infty}([0,1]^{d})$ to some limit process $\alpha_{C}$ which has continuous sample paths, almost surely.

As pointed out in [6], such conditions 10-12 are satisfied for i.i.d. data with ${\color[rgb]{0,0,0}\kappa}_{1}=1/2$ , ${\color[rgb]{0,0,0}\kappa}_{2}=1/2$ and ${\color[rgb]{0,0,0}\kappa}_{3}=1$ . In the latter case, the limiting process $\alpha_{C}$ is a $C$ -Brownian bridge, such that $\text{cov}\big{\{}\alpha_{C}(\mbox{\boldmath$u$}),\alpha_{C}(\mathbf{v})\big{\}}=C(\mbox{\boldmath$u$}\wedge\mathbf{v})-C(\mbox{\boldmath$u$})C(\mathbf{v})$ for any $u$ and $\mathbf{v}$ in $[0,1]^{d}$ , with the usual notation $\mbox{\boldmath$u$}\wedge\mathbf{v}=\big{(}\min(u_{1},v_{1}),\ldots,\min(u_{d},v_{d})\big{)}$ . More generally, if the process $(\mbox{\boldmath$X$}_{i})_{i\in{\mathbb{N}}}$ is strongly stationary and geometrically $\alpha$ -mixing, then the assumptions 10-12 are still satisfied, with the same choice ${\color[rgb]{0,0,0}\kappa}_{1}=1/2$ , ${\color[rgb]{0,0,0}\kappa}_{2}=1/2$ and ${\color[rgb]{0,0,0}\kappa}_{3}=1$ (Proposition 4.4 in [6]). In the latter case, the covariance of the limiting process is more complex: $\text{cov}\big{\{}\alpha_{C}(\mbox{\boldmath$u$}),\alpha_{C}(\mathbf{v})\big{\}}=\sum_{j\in{\mathbb{Z}}}\text{cov}\big{\{}{\mathbf{1}}(\mbox{\boldmath$U$}_{0}\leq\mbox{\boldmath$u$}),{\mathbf{1}}(\mbox{\boldmath$U$}_{j}\leq\mbox{\boldmath$u$})\big{\}}$ .

Assumption 13.

For any $I\subset\{1,\ldots,d\}$ , $I\neq\emptyset$ , any $f$ that belongs to a regular family ${\mathcal{F}}$ and for any continuous map $h:[0,1]^{|I|}\rightarrow{\mathbb{R}}$ , the sequence $\int_{\mbox{\boldmath$B$}_{n,|I|}}h(\mbox{\boldmath$u$}_{I})g_{\omega}\big{(}\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}\,f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}$ is convergent when $n\rightarrow\infty$ . Its limit is denoted as $\int_{(0,1)^{|I|}}h(\mbox{\boldmath$u$}_{I})g_{\omega}\big{(}\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}\,f\big{(}d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}$ , i.e. it is given by an integral w.r.t. a borelian measure on $(0,1)^{|I|}$ denoted as $f\big{(}\cdot:{\mathbf{1}}_{-I}\big{)}$ .

The latter regularity condition is required to get the weak convergence of our main statistic in Theorem A.1. Note that the map $f(\mbox{\boldmath$u$})$ is not defined when one component of $u$ is one. Therefore, the way we write the limits in Assumption 13 is a slight abuse of notation. Typically, $f(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})$ will be defined as the limit $f(\mbox{\boldmath$u$}_{I}:\mathbf{v}_{-I})$ when $\mathbf{v}_{-I}$ tends to ${\mathbf{1}}_{-I}$ when such a limit exists. In other standard situations, there exists a measurable map $h_{f}$ such that $f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$u$}_{-I}\big{)}=h_{f}(\mbox{\boldmath$u$})\,d\mbox{\boldmath$u$}_{I}$ . If it is possible to extend by continuity the map $\mbox{\boldmath$u$}\mapsto g_{\omega}(\mbox{\boldmath$u$})h_{f}(\mbox{\boldmath$u$})$ when $\mbox{\boldmath$u$}_{-I}$ tends to ${\mathbf{1}}_{-I}$ , simply set $g_{\omega}\big{(}\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}f\big{(}d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}=g_{\omega}\big{(}\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}h_{f}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,d\mbox{\boldmath$u$}_{I}$ . But other more complex situations.

To get the weak convergence of the process $\widehat{\mathbb{C}}_{n}$ indexed by the maps in ${\mathcal{F}}$ , we will need to strengthen the latter assumption 13, so that it becomes true uniformly over ${\mathcal{F}}$ .

Assumption 14.

For any $I\subset\{1,\ldots,d\}$ and any continuous map $h:[0,1]^{|I|}\rightarrow{\mathbb{R}}$ ,

\sup_{f\in{\mathcal{F}}}\big{|}\int_{\mbox{\boldmath$B$}_{n,|I|}}h(\mbox{\boldmath$u$}_{I})g_{\omega}\big{(}\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}\,f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}-\int_{(0,1)^{|I|}}h(\mbox{\boldmath$u$}_{I})g_{\omega}\big{(}\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}\,f\big{(}d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}\big{|}\longrightarrow 0,

when $n\rightarrow\infty$ .

Theorem A.1.

(i) Assume the assumptions 2,10 and 11 are satisfied and consider a family ${\mathcal{F}}$ of maps that is $g_{\omega}$ -regular, for some $\omega\in\big{(}0,\min(\frac{{\color[rgb]{0,0,0}\kappa}_{1}}{2(1-{\color[rgb]{0,0,0}\kappa}_{1})},\frac{{\color[rgb]{0,0,0}\kappa}_{2}}{2(1-{\color[rgb]{0,0,0}\kappa}_{2})},{\color[rgb]{0,0,0}\kappa}_{3}-1/2)\big{)}$ . Then, for any $f\in{\mathcal{F}}$ , we have

	$\displaystyle\int f\,d\widehat{\mathbb{C}}_{n}=(-1)^{d}\int_{\mbox{\boldmath$B$}_{n,d}}\bar{\mathbb{C}}_{n}(\mbox{\boldmath$u$})\,f\big{(}d\mbox{\boldmath$u$}\big{)}$				(A.3)
		$\displaystyle+$	$\displaystyle\sum_{\begin{subarray}{c}I\subset\{1,\ldots,d\}\\ I\neq\emptyset,I\neq\{1,\ldots,d\}\end{subarray}}(-1)^{\|I\|}\int_{\mbox{\boldmath$B$}_{n,\|I\|}}\frac{\bar{\mathbb{C}}_{n}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})}{\tilde{g}_{\omega}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})}g_{\omega}\big{(}\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}\,f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}+r_{n}(f),$		(A.3)

where $\bar{\mathbb{C}}_{n}(\mbox{\boldmath$u$}):=\alpha_{n}(\mbox{\boldmath$u$})-\sum_{k=1}^{d}\dot{C}_{k}(\mbox{\boldmath$u$})\alpha_{n}({\mathbf{1}}_{-k}:u_{k})$ , $\mbox{\boldmath$u$}\in[0,1]^{d}$ and $\sup_{f\in{\mathcal{F}}}|r_{n}(f)|=o_{P}(1)$ .

(ii) In addition, assume the conditions 12 and 13 apply. Then, for any function $f\in{\mathcal{F}}$ , the sequence of random variables $\sqrt{n}\int f\,d(\widehat{C}_{n}-C)$ tends in law to the centered Gaussian r.v.

(-1)^{d}\int_{(0,1)^{d}}{\mathbb{C}}(\mbox{\boldmath$u$})\,f(d\mbox{\boldmath$u$})+\sum_{\begin{subarray}{c}I\subset\{1,\ldots,d\}\\ I\neq\emptyset,I\neq\{1,\ldots,d\}\end{subarray}}(-1)^{|I|}\int_{(0,1)^{|I|}}{\mathbb{C}}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)},

(A.4)

where ${\mathbb{C}}(\mbox{\boldmath$u$}):=\alpha_{C}(\mbox{\boldmath$u$})-\sum_{k=1}^{d}\dot{C}_{k}(\mbox{\boldmath$u$})\alpha_{C}({\mathbf{1}}_{-k}:u_{k})$ for any $\mbox{\boldmath$u$}\in[0,1]^{d}$ .

(iii) Under the latter assumptions of (i) and (ii), in addition to Assumption 14, $\widehat{\mathbb{C}}_{n}$ weakly tends in $\ell^{\infty}({\mathcal{F}})$ to a Gaussian process.

Points (i) and (ii) of the latter theorem yield a generalization of Theorem 3.3 in [6] for any arbitrarily chosen dimension $d\geq 2$ , and uniformly over a class of functions. Note that it is always possible to set ${\mathcal{F}}=\{f_{0}\}$ and we have proved the weak convergence of a single multivariate rank statistic. The proof is based on the integration by part formula in [28]. Note that, in dimension $d=2$ , ${\mathbb{C}}(u_{1},1)={\mathbb{C}}(1,u_{2})=0$ for any $u_{1},u_{2}\in(0,1)$ . Thus, in the bivariate case, the limiting law of $\sqrt{n}\int f\,d(\widehat{C}_{n}-C)$ is simply the law of $\int_{(0,1)^{2}}{\mathbb{C}}\,df$ , as stated in Theorem 3.3 in [6]. Nonetheless, this is no longer true in dimension $d>2$ , explaining the more complex limiting laws in our Theorem A.1. Finally, the sum in (A.4) can be restricted to the subsets $I$ such that $|I|\geq 2$ . Indeed, when $I$ is a singleton, then ${\mathbb{C}}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})$ is zero a.s.

The point (iii) of Theorem A.1 extends Theorem 5 in [28]. The latter one was restricted to right-continuous maps $f$ of bounded Hardy-Krause variation and defined on the whole hypercube $[0,1]^{d}$ . For any of these maps $f$ , there exists a finite signed Borel measure bounded $\nu_{f}$ on $[0,1]^{d}$ such that $f(\mbox{\boldmath$u$})=\nu_{f}\big{(}[{\mathbf{0}},\mbox{\boldmath$u$}]\big{)}$ for every $\mbox{\boldmath$u$}\in[0,1]^{d}$ (Theorem 3 in [4]). In particular, they are bounded on $[0,1]^{d}$ , an excessively demanding assumption in many cases. Indeed, for the inference of copulas, many families of relevant maps ${\mathcal{F}}$ contain elements that are not of bounded variation or cannot be defined on $[0,1]^{d}$ as a whole, as pointed out by several authors, following [32]; see Section 3.4 in [28] too. This is in particular the case with the Canonical Maximum Likelihood method and Gaussian copulas. In such a case, $f(\mbox{\boldmath$u$})=-\ln c_{\Sigma}(\mbox{\boldmath$u$})$ and $c_{\Sigma}$ is the density of a Gaussian copula with correlation parameter $\Sigma$ . Therefore, we have preferred the less restrictive approach of [6], that can tackle unbounded maps $f$ (in particular copula log-densities), through the concept of locally bounded Hardy Krause variation.

We deduce from Theorem A.1 an uniform law of large numbers too.

Corollary A.2.

Assume the assumptions 2,10 and 11 are satisfied and consider a family ${\mathcal{F}}$ of maps that is $g_{\omega}$ -regular (for some $\omega$ in the same range as in Theorem A.1). If $\sup_{\mbox{\boldmath$u$}\in(0,1)^{d}}|\alpha_{n}(\mbox{\boldmath$u$})|=O_{P}(1)$ then, for any positive sequence $(\mu_{n})$ of real numbers s.t. $\mu_{n}\rightarrow+\infty$ when $n\rightarrow\infty$ , we have

\sup_{f\in{\mathcal{F}}}\big{|}\int f\,d(\widehat{C}_{n}-C)\big{|}=O_{P}(\mu_{n}/\sqrt{n}).

Remark 4.

In the literature, some ULLN for copula models have already been applied, but without specifying the corresponding rates of convergence to zero. In semi-parametric models, some authors invoked some properties of bracketing numbers (Lemma 1 in [8]; Th. 17 in the working paper version of [15]): if, for every $\delta>0$ , the $L^{1}(C)$ bracketing number of ${\mathcal{F}}$ (denoted $N_{[\cdot]}\big{(}\delta,{\mathcal{F}},L^{1}(C)\big{)}$ in the literature) is finite, then $\sup_{f\in{\mathcal{F}}}\big{|}\int f\,d(\widehat{C}_{n}-C)\big{|}$ tends to zero a.s.

Appendix B Asymptotic variance of $W$

Here, we provide a plug-in estimator of the variance-covariance matrix of the vector $W$ that appeared in Theorem 3.2. The latter vector is centered Gaussian and, for every $(i,k)\in{\mathcal{A}}^{2}$ ,

$\displaystyle{\mathbb{E}}[W_{j}W_{k}]=\int_{(0,1)^{2d}}{\mathbb{E}}\big{[}{\mathbb{C}}(\mbox{\boldmath$u$}){\mathbb{C}}(\mbox{\boldmath$u$}^{\prime})\big{]}\,\partial_{\theta_{j}}\ell(\theta_{0};d\mbox{\boldmath$u$})\partial_{\theta_{k}}\ell(\theta_{0};d\mbox{\boldmath$u$}^{\prime})$
	$\displaystyle+$	$\displaystyle\sum_{\begin{subarray}{c}I\subset\{1,\ldots,d\}\\ I\neq\emptyset,I\neq\{1,\ldots,d\}\end{subarray}}(-1)^{d+\|I\|}\int_{(0,1)^{d+\|I\|}}{\mathbb{E}}\big{[}{\mathbb{C}}(\mbox{\boldmath$u$}){\mathbb{C}}(\mbox{\boldmath$u$}^{\prime}_{I}:{\mathbf{1}}_{-I})\big{]}\,\partial_{\theta_{k}}\ell(\theta_{0};d\mbox{\boldmath$u$}^{\prime}_{I};{\mathbf{1}}_{-I})\,\partial_{\theta_{j}}\ell(\theta_{0};d\mbox{\boldmath$u$})$
	$\displaystyle+$	$\displaystyle\sum_{\begin{subarray}{c}I\subset\{1,\ldots,d\}\\ I\neq\emptyset,I\neq\{1,\ldots,d\}\end{subarray}}(-1)^{d+\|I\|}\int_{(0,1)^{\|I\|}}{\mathbb{E}}\big{[}{\mathbb{C}}(\mbox{\boldmath$u$}^{\prime}){\mathbb{C}}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\big{]}\,\partial_{\theta_{j}}\ell(\theta_{0};d\mbox{\boldmath$u$}_{I};{\mathbf{1}}_{-I})\,\partial_{\theta_{k}}\ell(\theta_{0};d\mbox{\boldmath$u$}^{\prime})$
	$\displaystyle+$	$\displaystyle\sum_{\begin{subarray}{c}I,I^{\prime}\subset\{1,\ldots,d\}\\ I,I^{\prime}\neq\emptyset;I,I^{\prime}\neq\{1,\ldots,d\}\end{subarray}}(-1)^{\|I\|+\|I^{\prime}\|}\int_{(0,1)^{\|I\|+\|I^{\prime}\|}}{\mathbb{E}}\big{[}{\mathbb{C}}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}){\mathbb{C}}(\mbox{\boldmath$u$}^{\prime}_{I^{\prime}}:{\mathbf{1}}_{-I^{\prime}})\big{]}\partial_{\theta_{j}}\ell(\theta_{0};d\mbox{\boldmath$u$}_{I};{\mathbf{1}}_{-I})\,\partial_{\theta_{k}}\ell(\theta_{0};d\mbox{\boldmath$u$}^{\prime}_{I^{\prime}};{\mathbf{1}}_{-I^{\prime}}).$

In the latter formula, we will replace $\theta_{0}$ with $\widehat{\theta}_{n}$ . Denote the covariance function of the process $\alpha_{C}$ as $v_{\alpha}$ , i.e. $v_{\alpha}(\mbox{\boldmath$u$},\mathbf{v}):={\mathbb{E}}\big{[}\alpha_{C}(\mbox{\boldmath$u$})\alpha_{C}(\mathbf{v})\big{]}$ , for every $u$ and $\mathbf{v}$ in $[0,1]^{d}$ . Then, the covariance function of the process ${\mathbb{C}}$ is

	$\displaystyle{\mathbb{E}}\big{[}{\mathbb{C}}(\mbox{\boldmath$u$}){\mathbb{C}}(\mathbf{v})\big{]}=v_{\alpha}(\mbox{\boldmath$u$},\mathbf{v})-\sum_{k=1}^{d}\dot{C}_{k}(\mathbf{v})v_{\alpha}\big{(}\mbox{\boldmath$u$},({\mathbf{1}}_{-k}:v_{k})\big{)}$
		$\displaystyle-$	$\displaystyle\sum_{k=1}^{d}\dot{C}_{k}(\mbox{\boldmath$u$})v_{\alpha}\big{(}\mathbf{v},({\mathbf{1}}_{-k}:u_{k})\big{)}+\sum_{k,k^{\prime}=1}^{d}\dot{C}_{k}(\mbox{\boldmath$u$})\dot{C}_{k}(\mathbf{v})v_{\alpha}\big{(}({\mathbf{1}}_{-k}:u_{k}),({\mathbf{1}}_{-k}:v_{k^{\prime}})\big{)}.$

In the latter formula, every partial derivative of the copula $C$ could be empirically approximated, as in [29] for instance. Moreover, assume we have found an estimator of the map $(\mbox{\boldmath$u$},\mathbf{v})\mapsto v_{\alpha}(\mbox{\boldmath$u$},\mathbf{v})$ , denoted $\widehat{v}_{\alpha}$ . With i.i.d data, $v_{\alpha}(\mbox{\boldmath$u$},\mathbf{v})=C(\mbox{\boldmath$u$}\wedge\mathbf{v})-C(\mbox{\boldmath$u$})C(\mbox{\boldmath$v$})$ is obviously approximated by $\widehat{v}_{\alpha}(\mbox{\boldmath$u$},\mathbf{v}):=\widehat{C}_{n}(\mbox{\boldmath$u$}\wedge\mathbf{v})-\widehat{C}_{n}(\mbox{\boldmath$u$})\widehat{C}_{n}(\mathbf{v})$ . This would yield and estimator of ${\mathbb{E}}\big{[}{\mathbb{C}}(\mbox{\boldmath$u$}){\mathbb{C}}(\mbox{\boldmath$u$}^{\prime})\big{]}$ , for every $(\mbox{\boldmath$u$},\mathbf{v})\in(0,1)^{d}$ , that can be plugged in (B). Taking all pieces together yields an estimator of ${\mathbb{E}}[W_{j}W_{k}]$ .

Appendix C Proofs of Theorem A.1 and Corollary A.2

C.1 Proof of Theorem A.1

To state (i), we follow the same paths as in the proof of Theorem 3.3 in [6]. For any $0<a<b<1/2$ , define $N(a,b):=\{\mbox{\boldmath$u$}\in[0,1]^{d}:a<g_{1,d}(\mbox{\boldmath$u$})\leq b\}$ . Note that, when $d=2$ , $N(a,1/2)=(a,1-a)^{2}$ , but this property does not extend to larger dimensions. Any remainder term that tends to zero in probability uniformly w.r.t. $f\in{\mathcal{F}}$ will be denoted as $o_{P,u}(1)$ . Note that, for any $f\in{\mathcal{F}}$ ,

\sqrt{n}\Big{\{}\int_{\mbox{\boldmath$B$}_{n,d}}f\,d\widehat{C}_{n}-{\mathbb{E}}\big{[}f(\mbox{\boldmath$U$})\big{]}\Big{\}}=\int_{\mbox{\boldmath$B$}_{n,d}}f\,d\widehat{\mathbb{C}}_{n}-\sqrt{n}\int_{\mbox{\boldmath$B$}_{n,d}^{c}}f\,dC=:A_{n}-r_{n1}.

Let us prove that $\sup_{f\in{\mathcal{F}}}|r_{n1}|=o(1)$ . Indeed, a vector $u$ belongs to $\mbox{\boldmath$B$}_{n,d}^{c}$ iff one of its components is smaller than $1/2n$ or is strictly larger than $1-1/2n$ . Thus, let us decompose $\mbox{\boldmath$B$}_{n,d}^{c}$ as the disjoint union of “boxes” on $[0,1]^{d}$ such as

\mbox{\boldmath$B$}_{n}^{J_{1},J_{2},J_{3}}:=\big{\{}\mbox{\boldmath$u$}\>|\>\mbox{\boldmath$u$}_{J_{1}}\in[0,1/2n]^{|J_{1}|},\mbox{\boldmath$u$}_{J_{2}}\in(1/2n,1-1/2n]^{|J_{2}|},\mbox{\boldmath$u$}_{J_{3}}\in(1-1/2n,1]^{|J_{3}|}\big{\}},

where $J_{1}\cup J_{3}\neq\emptyset$ and $(J_{1},J_{2},J_{3})$ is a partition of $\{1,\ldots,d\}$ . Note that, for any $\mbox{\boldmath$u$}\in\mbox{\boldmath$B$}_{n}^{J_{1},J_{2},J_{3}}$ , we have

\{\min_{k}\min(u_{k},1-u_{k})\}^{-\omega}\leq\sum_{k\in I_{1}\cup I_{3}}\{u_{k}^{-\omega}+(1-u_{k})^{-\omega}\}.

Since there exists a constant $C_{\mathcal{F}}$ such that $\sup_{\mbox{\boldmath$u$}\in[0,1]^{d}}\{\min_{k}\min(u_{k},1-u_{k})\}^{\omega}|f(\mbox{\boldmath$u$})|\leq C_{\mathcal{F}}$ for every $f\in{\mathcal{F}}$ by $g_{\omega}$ -regularity, we have for any $f\in{\mathcal{F}}$

$\displaystyle 0\leq\sqrt{n}\int_{\mbox{\boldmath$B$}_{n}^{J_{1},J_{2},J_{3}}}\|f\|\,dC\leq\sqrt{n}C_{\mathcal{F}}\int_{\mbox{\boldmath$B$}_{n}^{J_{1},J_{2},J_{3}}}\{\min_{k}\min(u_{k},1-u_{k})\}^{-\omega}\,C(d\mbox{\boldmath$u$})$
	$\displaystyle\leq$	$\displaystyle\sqrt{n}C_{\mathcal{F}}\sum_{k\in J_{1}\cup J_{3}}\int_{\mbox{\boldmath$B$}_{n}^{J_{1},J_{2},J_{3}}}\big{\{}u_{k}^{-\omega}+(1-u_{k})^{-\omega}\big{\}}\,C(d\mbox{\boldmath$u$})$
	$\displaystyle\leq$	$\displaystyle\sqrt{n}C_{\mathcal{F}}\sum_{k\in J_{1}}\int_{\{u_{k}\in(0,1/2n],\mbox{\boldmath$u$}_{-k}\in(0,1]^{d-1}\}}\big{\{}\frac{C(d\mbox{\boldmath$u$})}{u_{k}^{\omega}}+\frac{C(d\mbox{\boldmath$u$})}{(1-u_{k})^{\omega}}\big{\}}$
	$\displaystyle+$	$\displaystyle\sqrt{n}C_{\mathcal{F}}\sum_{k\in J_{3}}\int_{\{u_{k}\in(1-1/2n,1],\mbox{\boldmath$u$}_{-k}\in(0,1]^{d-1}\}}\big{\{}\frac{C(d\mbox{\boldmath$u$})}{u_{k}^{\omega}}+\frac{C(d\mbox{\boldmath$u$})}{(1-u_{k})^{\omega}}\big{\}}$
	$\displaystyle\leq$	$\displaystyle\sqrt{n}C_{\mathcal{F}}\sum_{k\in J_{1}\cup J_{3}}\big{\{}\int_{\{u_{k}\in(0,1/2n]\}}u_{k}^{-\omega}\,du_{k}+\int_{\{u_{k}\in(1-1/2n,1]\}}(1-u_{k})^{-\omega}\,du_{k}\big{\}}$
	$\displaystyle\leq$	$\displaystyle 2C_{\mathcal{F}}\|J_{1}\cup J_{3}\|(2n)^{\omega-1/2}/(1-\omega),$

that tends to zero with $n$ uniformly wrt $f\in{\mathcal{F}}$ . Therefore, we have proven that $\sup_{f\in{\mathcal{F}}}|r_{n1}|=o(1)$ .

Moreover, invoking the integration by parts formula (40) in [28], we get

$\displaystyle A_{n}=\int_{\mbox{\boldmath$B$}_{n,d}}f\,d\widehat{\mathbb{C}}_{n}=(-1)^{d}\int_{\mbox{\boldmath$B$}_{n,d}}\widehat{\mathbb{C}}_{n}(\mbox{\boldmath$u$}-)\,f\big{(}d\mbox{\boldmath$u$}\big{)}$
	$\displaystyle+$	$\displaystyle\sum_{\begin{subarray}{c}I_{1}+I_{2}+I_{3}=\{1,\ldots,d\}\\ I_{1}\neq\emptyset,I_{1}\neq\{1,\ldots,d\}\end{subarray}}(-1)^{\|I_{1}\|+\|I_{2}\|}\int_{\mbox{\boldmath$B$}_{n,\|I_{1}\|}}\widehat{\mathbb{C}}_{n}(\mbox{\boldmath$u$}_{I_{1}}-:\mbox{\boldmath$c$}_{n,I_{2}}:\mbox{\boldmath$d$}_{n,I_{3}})\,f\big{(}d\mbox{\boldmath$u$}_{I_{1}}:\mbox{\boldmath$c$}_{n,I_{2}}:\mbox{\boldmath$d$}_{n,I_{3}}\big{)}$
	$\displaystyle+$	$\displaystyle\Delta\big{(}\widehat{\mathbb{C}}_{n}f\big{)}\big{(}\mbox{\boldmath$B$}_{n,d}\big{)}=:A_{n,1}+A_{n,2}+r_{n2}.$

In $A_{n,2}$ , the ‘+’ symbol within $I_{1}+I_{2}+I_{3}$ denotes the disjoint union. In other words, the summation is taken over all partitions of $\{1,\ldots,d\}$ into three disjoint subsets. Moreover, we have used the usual notation $\Delta(f)((\mbox{\boldmath$u$},\mathbf{v}])$ that is the sum of component-wise differentials of $f$ over all the vertices of the hypercube $(\mbox{\boldmath$u$},\mathbf{v}]$ . For instance, in dimension two,

\Delta(f)((\mbox{\boldmath$u$},\mathbf{v}])=f(u_{2},v_{2})-f(u_{1},v_{2})-f(u_{2},v_{1})+f(u_{1},v_{1}).

By assumptions 2, 10 and 11, Theorem 4.5 in [6] holds. Then, the term $A_{n,1}$ can be rewritten as

A_{n,1}=(-1)^{d}\int_{\mbox{\boldmath$B$}_{n,d}}\frac{\widehat{\mathbb{C}}_{n}(\mbox{\boldmath$u$}-)}{g_{\omega}(\mbox{\boldmath$u$})}\,g_{\omega}(\mbox{\boldmath$u$})f\big{(}d\mbox{\boldmath$u$}\big{)}=(-1)^{d}\int_{\mbox{\boldmath$B$}_{n,d}}\frac{\bar{\mathbb{C}}_{n}(\mbox{\boldmath$u$}-)}{g_{\omega}(\mbox{\boldmath$u$}-)}\,g_{\omega}(\mbox{\boldmath$u$})f\big{(}d\mbox{\boldmath$u$}\big{)}+o_{P,u}(1).

(C.1)

Moreover, due to Lemma 4.10 in [6] and their theorem 4.5 again, this yields

A_{n,1}=(-1)^{d}\int_{\mbox{\boldmath$B$}_{n,d}}\frac{\bar{\mathbb{C}}_{n}(\mbox{\boldmath$u$})}{g_{\omega}(\mbox{\boldmath$u$})}\,g_{\omega}(\mbox{\boldmath$u$})f\big{(}d\mbox{\boldmath$u$}\big{)}+o_{P,u}(1)=(-1)^{d}\int_{\mbox{\boldmath$B$}_{n,d}}\bar{\mathbb{C}}_{n}\,df+o_{P,u}(1).

The term $A_{n,2}$ is a finite sum of integrals as

{\mathcal{I}}_{n,I_{1},I_{2},I_{3}}:=\int_{\mbox{\boldmath$B$}_{n,|I_{1}|}}\widehat{\mathbb{C}}_{n}(\mbox{\boldmath$u$}_{I_{1}}-:\mbox{\boldmath$c$}_{n,I_{2}}:\mbox{\boldmath$d$}_{n,I_{3}})\,f\big{(}d\mbox{\boldmath$u$}_{I_{1}}:\mbox{\boldmath$c$}_{n,I_{2}}:\mbox{\boldmath$d$}_{n,I_{3}}\big{)},

where $I_{1}$ is not empty and is not equal to the whole set $\{1,\ldots,d\}$ . By the first part of Theorem 4.5 and Lemma 4.10 in [6], we obtain

{\mathcal{I}}_{n,I_{1},I_{2},I_{3}}:=\int_{\mbox{\boldmath$B$}_{n,|I_{1}|}}\frac{\bar{\mathbb{C}}_{n}}{g_{\omega}}(\mbox{\boldmath$u$}_{I_{1}}:\mbox{\boldmath$c$}_{n,I_{2}}:\mbox{\boldmath$d$}_{n,I_{3}})\,g_{\omega}(\mbox{\boldmath$u$}_{I_{1}}:\mbox{\boldmath$c$}_{n,I_{2}}:\mbox{\boldmath$d$}_{n,I_{3}})f\big{(}d\mbox{\boldmath$u$}_{I_{1}}:\mbox{\boldmath$c$}_{n,I_{2}}:\mbox{\boldmath$d$}_{n,I_{3}}\big{)}+o_{P,u}(1).

If $I_{2}\neq\emptyset$ , any argument $(\mbox{\boldmath$u$}_{I_{1}}:\mbox{\boldmath$c$}_{n,I_{2}}:\mbox{\boldmath$d$}_{n,I_{3}})$ of $\bar{\mathbb{C}}_{n}/g_{\omega}$ belongs to the subset $N(0,1/n)$ . In such a case, for any $\epsilon>0$ , we have

	$\displaystyle{\mathbb{P}}\Big{(}\sup_{f\in{\mathcal{F}}}\big{\|}{\mathcal{I}}_{n,I_{1},I_{2},I_{3}}\big{\|}>\epsilon\Big{)}\leq{\mathbb{P}}\Big{(}\sup_{\mbox{\boldmath$u$}\in N(0,1/n)}\frac{\|\bar{\mathbb{C}}_{n}(\mbox{\boldmath$u$})\|}{g_{\omega}(\mbox{\boldmath$u$})}>\epsilon^{2}\Big{)}$
		$\displaystyle+$	$\displaystyle{\mathbb{P}}\Big{(}\sup_{f\in{\mathcal{F}}}\int_{\mbox{\boldmath$B$}_{n,\|I_{1}\|}}g_{\omega}\big{(}\mbox{\boldmath$u$}_{I_{1}}:\mbox{\boldmath$c$}_{n,I_{2}}:\mbox{\boldmath$d$}_{n,I_{3}}\big{)}\big{\|}f\big{(}d\mbox{\boldmath$u$}_{I_{1}}:\mbox{\boldmath$c$}_{n,I_{2}}:\mbox{\boldmath$d$}_{n,I_{3}}\big{)}\big{\|}>1/\epsilon\Big{)}.$

The two latter terms tend to zero with $n$ , for any sufficiently small $\epsilon$ . Indeed, the first probability tends to zero with $n$ by Lemma 4.9 in [6], and the second one may be arbitrarily small by $g_{\omega}$ -regularity, choosing a sufficiently small $\epsilon$ . Therefore, all the terms of $A_{n,2}$ for which $I_{2}\neq\emptyset$ are negligible. Moreover, if $I_{2}=\emptyset$ , then $I_{3}\neq\emptyset$ . By the stochastic equicontinuity of the process $\bar{\mathbb{C}}_{n}/\tilde{g}_{\omega}$ (Lemma 4.10 in [6]), we have

	$\displaystyle{\mathcal{I}}_{n,I_{1},\emptyset,I_{3}}=\int_{\mbox{\boldmath$B$}_{n,\|I_{1}\|}}\frac{\bar{\mathbb{C}}_{n}(\mbox{\boldmath$u$}_{I_{1}}:\mbox{\boldmath$d$}_{n,I_{3}})}{\tilde{g}_{\omega}(\mbox{\boldmath$u$}_{I_{1}}:\mbox{\boldmath$d$}_{n,I_{3}})}g_{\omega}(\mbox{\boldmath$u$}_{I_{1}}:\mbox{\boldmath$d$}_{n,I_{3}})\,f\big{(}d\mbox{\boldmath$u$}_{I_{1}}:\mbox{\boldmath$d$}_{n,I_{3}}\big{)}$
		$\displaystyle=$	$\displaystyle\int_{\mbox{\boldmath$B$}_{n,\|I_{1}\|}}\frac{\bar{\mathbb{C}}_{n}(\mbox{\boldmath$u$}_{I_{1}}:{\mathbf{1}}_{I_{3}})}{\tilde{g}_{\omega}(\mbox{\boldmath$u$}_{I_{1}}:{\mathbf{1}}_{I_{3}})}g_{\omega}(\mbox{\boldmath$u$}_{I_{1}}:\mbox{\boldmath$d$}_{n,I_{3}})\,f\big{(}d\mbox{\boldmath$u$}_{I_{1}}:\mbox{\boldmath$d$}_{n,I_{3}}\big{)}+o_{P,u}(1),$

invoking again (A.2), when $I_{2}=\emptyset$ . Re-indexing the subsets $I_{j}$ , check that $A_{n,2}$ yields the sum in (A.3) plus a negligible term.

The remaining term $r_{n2}=\Delta\big{(}\widehat{\mathbb{C}}_{n}f\big{)}\big{(}\mbox{\boldmath$B$}_{n,d}\big{)}$ is a sum of $2^{d}$ terms. By $g_{\omega}$ -regularity, all these terms are smaller than a constant times $|\widehat{\mathbb{C}}_{n}|/g_{\omega}$ , evaluated at a $d$ -vector whose components are $1/2n$ or $1-1/2n$ . This implies these terms are equal to $|\bar{\mathbb{C}}_{n}|/g_{\omega}$ with the same arguments (Th. 4.5 in [6]), plus a negligible term. Due to Lemma 4.9 in [6], all of the latter terms tend to zero in probability, and then $r_{n2}=o_{P,u}(1)$ . Therefore, we have proven (A.3) and point (i).

(ii) If, in addition, the process $(\alpha_{n})$ is weakly convergent, then $(\bar{\mathbb{C}}_{n}/\tilde{g}_{\omega})$ is weakly convergent to $({\mathbb{C}}/\tilde{g}_{\omega})$ in $\big{(}\ell^{\infty}([0,1]^{d}),\|\cdot\|_{\infty}\big{)}$ , by Theorem 2.2 in [6]. For a given $f\in{\mathcal{F}}$ , define the sequence of maps $g_{n}:\ell^{\infty}([0,1]^{d})\rightarrow{\mathbb{R}}$ as

	$\displaystyle g_{n}(h):=(-1)^{d}\int_{\mbox{\boldmath$B$}_{n,d}}h(\mbox{\boldmath$u$})g_{\omega}(\mbox{\boldmath$u$})\,f\big{(}d\mbox{\boldmath$u$}\big{)}$
		$\displaystyle+$	$\displaystyle\sum_{\begin{subarray}{c}I\subset\{1,\ldots,d\}\\ I\neq\emptyset,I\neq\{1,\ldots,d\}\end{subarray}}(-1)^{\|I\|}\int_{\mbox{\boldmath$B$}_{n,\|I\|}}h(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})g_{\omega}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{-I}\big{)}.$

If a sequence of maps $(h_{n})$ tends to $h_{\infty}$ in $\ell^{\infty}([0,1]^{d})$ and $h_{\infty}$ is continuous on $[0,1]^{d}$ , let us prove that $g_{n}(h_{n})\rightarrow g_{\infty}(h_{\infty})$ , where

	$\displaystyle g_{\infty}(h):=(-1)^{d}\int_{(0,1)^{d}}h(\mbox{\boldmath$u$})g_{\omega}(\mbox{\boldmath$u$})\,f(d\mbox{\boldmath$u$})$
		$\displaystyle+$	$\displaystyle\sum_{\begin{subarray}{c}I\subset\{1,\ldots,d\}\\ I\neq\emptyset,I\neq\{1,\ldots,d\}\end{subarray}}(-1)^{\|I\|}\int_{(0,1)^{\|I\|}}h(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})g_{\omega}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}.$

The difference $g_{n}(h_{n})-g_{\infty}(h_{\infty})$ is a sum of $2^{d}-1$ differences between integrals that come from (C.1). The first one is managed as

	$\displaystyle\big{\|}\int_{\mbox{\boldmath$B$}_{n,d}}h_{n}(\mbox{\boldmath$u$})g_{\omega}(\mbox{\boldmath$u$})\,f\big{(}d\mbox{\boldmath$u$}\big{)}-\int_{(0,1)^{d}}h_{\infty}(\mbox{\boldmath$u$})g_{\omega}(\mbox{\boldmath$u$})\,f(d\mbox{\boldmath$u$})\big{\|}\leq\\|h_{n}-h_{\infty}\\|_{\infty}\int_{\mbox{\boldmath$B$}_{n,d}}g_{\omega}(\mbox{\boldmath$u$})\,\left\|f(d\mbox{\boldmath$u$})\right\|$
		$\displaystyle+$	$\displaystyle\big{\|}\int_{\mbox{\boldmath$B$}_{n,d}}h_{\infty}(\mbox{\boldmath$u$})g_{\omega}(\mbox{\boldmath$u$})\,f(d\mbox{\boldmath$u$})-\int_{(0,1)^{d}}h_{\infty}(\mbox{\boldmath$u$})g_{\omega}(\mbox{\boldmath$u$})\,f(d\mbox{\boldmath$u$})\big{\|},\hskip 142.26378pt$

that tends to zero by (A.1) and Assumption 13. The other terms of $g_{n}(h_{n})-g_{\infty}(h_{\infty})$ are indexed by a subset $I$ , and can be bounded similarly:

$\displaystyle\big{\|}\int_{\mbox{\boldmath$B$}_{n,\|I\|}}h_{n}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})g_{\omega}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}-\int_{(0,1)^{\|I\|}}h_{\infty}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})g_{\omega}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,f(d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}))\big{\|}$
	$\displaystyle\leq$	$\displaystyle\\|h_{n}-h_{\infty}\\|_{\infty}\int_{\mbox{\boldmath$B$}_{n,I}}g_{\omega}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\,\left\|f(d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\right\|$
	$\displaystyle+$	$\displaystyle\big{\|}\int_{\mbox{\boldmath$B$}_{n,I}}h_{\infty}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})g_{\omega}(\mbox{\boldmath$u$}_{-I}:\mbox{\boldmath$d$}_{n,-I})\,f(d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})-\int_{(0,1)^{\|I\|}}h_{\infty}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})g_{\omega}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,f(d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\big{\|},$

that tends to zero by Equation (A.2) and Assumption 13.

Therefore, apply the extended continuous mapping (Theorem 1.11.1 in [35]) to obtain the weak convergence of $g_{n}(\bar{\mathbb{C}}_{n}/\tilde{g}_{\omega})$ (that is equal to $g_{n}(\bar{\mathbb{C}}_{n}/g_{\omega})$ in our case) towards $g_{\infty}({\mathbb{C}}/\tilde{g}_{\omega})$ in $\ell^{\infty}([0,1]^{d})$ . Note that almost every trajectory of ${\mathbb{C}}/\tilde{g}_{\omega}$ on $[0,1]^{d}$ is continuous. Since $\int f\,d\widehat{\mathbb{C}}_{n}=g_{n}(\bar{\mathbb{C}}_{n}/\tilde{g}_{\omega})+o_{P,u}(1)$ , this proves the announced weak convergence result (ii).

(iii) Our arguments are close to those invoked to prove Theorem 1 in [28]. Our point (ii) above yields the finite-dimensional convergence of $\widehat{\mathbb{C}}_{n}$ in $\ell^{\infty}({\mathcal{F}})$ . For any (possibly random) map $X:[0,1]^{d}\rightarrow{\mathbb{R}}$ and any $f\in{\mathcal{F}}$ , set

	$\displaystyle\Gamma_{\infty}(X,f)$
		$\displaystyle:=$	$\displaystyle(-1)^{d}\int_{(0,1)^{d}}(Xg_{\omega})(\mbox{\boldmath$u$})\,f(d\mbox{\boldmath$u$})+\sum_{\begin{subarray}{c}I\subset\{1,\ldots,d\}\\ I\neq\emptyset,I\neq\{1,\ldots,d\}\end{subarray}}(-1)^{\|I\|}\int_{(0,1)^{\|I\|}}(Xg_{\omega})(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}.$

Moreover, define

	$\displaystyle\Gamma_{n}(X,f):=(-1)^{d}\int_{\mbox{\boldmath$B$}_{n,d}}X(\mbox{\boldmath$u$})g_{\omega}(\mbox{\boldmath$u$})\,f\big{(}d\mbox{\boldmath$u$}\big{)}$
		$\displaystyle+$	$\displaystyle\sum_{\begin{subarray}{c}I\subset\{1,\ldots,d\}\\ I\neq\emptyset,I\neq\{1,\ldots,d\}\end{subarray}}(-1)^{\|I\|}\int_{\mbox{\boldmath$B$}_{n,\|I\|}}X(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})g_{\omega}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}.$

We have proved above (recall Equation (A.3) and (C.1)) that, for any $f\in{\mathcal{F}}$ ,

\widehat{\mathbb{C}}_{n}(f):=\int f\,d\widehat{\mathbb{C}}_{n}=\Gamma_{n}\Big{(}\frac{\bar{\mathbb{C}}_{n}}{\tilde{g}_{\omega}},f\Big{)}+o_{P,u}(1).

Therefore, we expect that the weak limit of $\widehat{\mathbb{C}}_{n}$ in $\ell^{\infty}({\mathcal{F}})$ will be $\Gamma_{\infty}\big{(}{\mathbb{C}}/\tilde{g}_{\omega},.\big{)}$ . It is sufficient to prove that $\Gamma_{n}\big{(}\bar{\mathbb{C}}_{n}/\tilde{g}_{\omega},\cdot\big{)}$ weakly tends to the latter process.

To this end, we slightly adapt our notations to deal with functionals defined on ${\mathcal{F}}$ . The weak limit of $\widehat{\mathbb{C}}_{n}/\tilde{g}_{\omega}$ on $\ell^{\infty}([0,1]^{d})$ is the Gaussian process ${\mathbb{C}}/\tilde{g}_{\omega}$ , that is tight (Ex. 1.5.10 in [35]). Moreover, define the map $\widetilde{\Gamma}_{\infty}:C_{0}([0,1]^{d},\|\cdot\|_{\infty})\rightarrow\ell^{\infty}({\mathcal{F}})$ as $\widetilde{\Gamma}_{\infty}(X)(f)=\Gamma_{\infty}(X,f)$ , where $C_{0}([0,1]^{d},\|\cdot\|_{\infty})$ denotes the set on continuous maps on $[0,1]^{d}$ , endowed with the sup-norm. Similarly, define $\widetilde{\Gamma}_{n}:\ell^{\infty}([0,1]^{d},\|\cdot\|_{\infty})\rightarrow\ell^{\infty}({\mathcal{F}})$ as $\widetilde{\Gamma}_{n}(X)(f)=\Gamma_{n}(X,f)$ . We now have to prove that $\widetilde{\Gamma}_{n}\big{(}\bar{\mathbb{C}}_{n}/\tilde{g}_{\omega}\big{)}$ weakly tends to $\widetilde{\Gamma}_{\infty}\big{(}{\mathbb{C}}/\tilde{g}_{\omega}\big{)}$ on $\ell^{\infty}({\mathcal{F}})$ .

First, we prove that $\widetilde{\Gamma}_{\infty}$ is continuous. Let $(X_{n})$ be a sequence of maps in $C_{0}([0,1]^{d},\|\cdot\|_{\infty})$ that tends to $X$ is the latter space. We want to prove that $\widetilde{\Gamma}_{\infty}(X_{n})$ tends to $\widetilde{\Gamma}_{\infty}(X)$ in $\ell^{\infty}({\mathcal{F}})$ . The first term of $\widetilde{\Gamma}_{\infty}(X_{n})-\widetilde{\Gamma}_{\infty}(X)$ that comes from the definition of $\Gamma_{\infty}$ is easily managed:

\sup_{f\in{\mathcal{F}}}\big{|}\int_{(0,1)^{d}}(X_{n}g_{\omega})(\mbox{\boldmath$u$})\,f(d\mbox{\boldmath$u$})-\int_{(0,1)^{d}}(Xg_{\omega})(\mbox{\boldmath$u$})\,f(d\mbox{\boldmath$u$})\big{|}\leq\|X_{n}-X\|_{\infty}\sup_{f\in{\mathcal{F}}}\int_{(0,1)^{d}}g_{\omega}(\mbox{\boldmath$u$})\,\left|f(d\mbox{\boldmath$u$})\right|,

that tends to zero because of (A.1). The other terms are tackled similarly:

	$\displaystyle\sup_{f\in{\mathcal{F}}}\big{\|}\int_{(0,1)^{\|I\|}}(X_{n}g_{\omega})(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}-\int_{(0,1)^{\|I\|}}(Xg_{\omega})(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}\big{\|}$
		$\displaystyle\leq$	$\displaystyle\\|X_{n}-X\\|_{\infty}\sup_{f\in{\mathcal{F}}}\int_{(0,1)^{\|I\|}}g_{\omega}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,\|f\big{(}d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}\|,\hskip 142.26378pt$

that tends to zero. We have used the fact that, due to (A.2) and Assumption 14, we have

\sup_{f\in{\mathcal{F}}}\int_{(0,1)^{|I|}}g_{\omega}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,|f\big{(}d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}|<\infty.

As a consequence, $\widetilde{\Gamma}_{\infty}(X_{n})$ tends to $\widetilde{\Gamma}_{\infty}(X)$ in $\ell^{\infty}({\mathcal{F}})$ . Therefore, by continuity, the expected weak limit $\widetilde{\Gamma}_{\infty}({\mathbb{C}}/\tilde{g}_{\omega})$ of $\widetilde{\Gamma}_{\infty}(\bar{\mathbb{C}}_{n}/\tilde{g}_{\omega})$ is tight on $\ell^{\infty}({\mathcal{F}})$ .

Then, the weak convergence of ${\mathbb{G}}_{n}:=\widetilde{\Gamma}_{n}\big{(}\bar{\mathbb{C}}_{n}/\tilde{g}_{\omega}\big{)}$ towards ${\mathbb{G}}_{\infty}:=\widetilde{\Gamma}_{\infty}({\mathbb{C}}/\tilde{g}_{\omega})$ in $\ell^{\infty}({\mathcal{F}})$ is obtained if we prove that the bounded Lipschitz distance between the two processes tends to zero with $n$ (Th. 1.12.4 in [35]), i.e. if

d_{BL}\big{(}{\mathbb{G}}_{n},{\mathbb{G}}_{\infty}\big{)}=\sup_{h}\big{|}{\mathbb{E}}\big{[}h({\mathbb{G}}_{n})\big{]}-{\mathbb{E}}\big{[}h({\mathbb{G}}_{\infty})\big{]}\big{|}\underset{n\rightarrow\infty}{\longrightarrow}0,

with the supremum taken over all the uniformly bounded and Lipschitz maps $h:\ell^{\infty}({\mathcal{F}})\rightarrow{\mathbb{R}}$ , $\sup_{x\in\ell^{\infty}({\mathcal{F}})}|h(x)|\leq 1$ and $|h(x)-h(y)|\leq\|x-y\|_{\infty}$ for all $x,y\in\ell^{\infty}({\mathcal{F}})$ . By the triangle inequality, we have

	$\displaystyle d_{BL}\big{(}{\mathbb{G}}_{n},{\mathbb{G}}_{\infty}\big{)}=d_{BL}\bigg{(}\widetilde{\Gamma}_{n}\Big{(}\frac{\bar{\mathbb{C}}_{n}}{\tilde{g}_{\omega}}\Big{)},\widetilde{\Gamma}_{\infty}\Big{(}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\Big{)}\bigg{)}$
		$\displaystyle\leq$	$\displaystyle d_{BL}\bigg{(}\widetilde{\Gamma}_{n}\Big{(}\frac{\bar{\mathbb{C}}_{n}}{\tilde{g}_{\omega}}\Big{)},\widetilde{\Gamma}_{n}\Big{(}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\Big{)}\bigg{)}+d_{BL}\bigg{(}\widetilde{\Gamma}_{n}\Big{(}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\Big{)},\widetilde{\Gamma}_{\infty}\Big{(}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\Big{)}\bigg{)}=:d_{1,n}+d_{2,n}.$

To deal with $d_{1,n}\rightarrow 0$ , note that

$\displaystyle\\|\widetilde{\Gamma}_{n}\big{(}\bar{\mathbb{C}}_{n}/\tilde{g}_{\omega}\big{)}-\widetilde{\Gamma}_{n}\big{(}{\mathbb{C}}/\tilde{g}_{\omega}\big{)}\\|_{\infty}=\sup_{f\in{\mathcal{F}}}\big{\|}\Gamma_{n}\big{(}\bar{\mathbb{C}}_{n}/\tilde{g}_{\omega},f\big{)}-\Gamma_{n}\big{(}{\mathbb{C}}/\tilde{g}_{\omega},f\big{)}\big{\|}$
	$\displaystyle\leq$	$\displaystyle\\|\frac{\bar{\mathbb{C}}_{n}}{\tilde{g}_{\omega}}-\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\\|_{\infty}\sum_{I\neq\emptyset}\sup_{f\in{\mathcal{F}}}\int_{\mbox{\boldmath$B$}_{n,\|I\|}}g_{\omega}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\,\big{\|}f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}\big{\|}$
	$\displaystyle\leq$	$\displaystyle M\\|\frac{\bar{\mathbb{C}}_{n}}{\tilde{g}_{\omega}}-\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\\|_{\infty},$

for some positive constant $M$ , due to (A.2). This proves that $\widetilde{\Gamma}_{n}$ is Lipschitz, with a Lipschitz constant that does not depend on $n$ nor $f\in{\mathcal{F}}$ . Therefore, we get

d_{1,n}=\sup_{h}\big{|}{\mathbb{E}}\big{[}h\circ\widetilde{\Gamma}_{n}\big{(}\frac{\bar{\mathbb{C}}_{n}}{\tilde{g}_{\omega}}\big{)}\big{]}-{\mathbb{E}}\big{[}h\circ\widetilde{\Gamma}_{n}\big{(}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\big{)}\big{]}\big{|}\leq M\,d_{BL}\Big{(}\frac{\bar{\mathbb{C}}_{n}}{\tilde{g}_{\omega}},\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\Big{)},

that tends to zero because of the weak convergence of $(\bar{\mathbb{C}}_{n}/\tilde{g}_{\omega})$ to $({\mathbb{C}}/\tilde{g}_{\omega})$ in $\ell^{\infty}(|0,1]^{d})$ (Th. 4.5 in [6]).

To show that $(d_{2,n})$ tends to zero, note that, for every $I\neq\emptyset$ , we have

$\displaystyle\big{\|}\int_{\mbox{\boldmath$B$}_{n,\|I\|}}{\mathbb{C}}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}-\int_{(0,1)^{\|I\|}}{\mathbb{C}}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}\big{\|}$
	$\displaystyle\leq$	$\displaystyle\int_{\mbox{\boldmath$B$}_{n,\|I\|}}\big{\|}\frac{{\mathbb{C}}}{g_{\omega}}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})-\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\big{\|}\,g_{\omega}\big{(}\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}\,\big{\|}f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}\big{\|}$
	$\displaystyle+$	$\displaystyle\big{\|}\int_{\mbox{\boldmath$B$}_{n,\|I\|}}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,g_{\omega}\big{(}\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}$
	$\displaystyle-$	$\displaystyle\int_{(0,1)^{d}}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,\tilde{g}_{\omega}\big{(}\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}f\big{(}d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}\big{\|}=:e_{1,n}(f)+e_{2,n}(f).$

Clearly, $\sup_{f\in{\mathcal{F}}}e_{1,n}$ tends to zero, invoking (A.2) and the continuity of ${\mathbb{C}}/\tilde{g}_{\omega}$ on $(0,1]^{d}$ . By Assumption 14, $\sup_{f\in{\mathcal{F}}}e_{2,n}$ tends to zero a.s. Thus, we have proved that

	$\displaystyle\\|\widetilde{\Gamma}_{n}\Big{(}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\Big{)}-\widetilde{\Gamma}_{{\color[rgb]{0,0,0}\infty}}\Big{(}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\Big{)}\\|_{\infty}=\sup_{f\in{\mathcal{F}}}\big{\|}\Gamma_{n}\Big{(}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}},f\Big{)}-\Gamma\Big{(}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}},f\Big{)}\big{\|}$
		$\displaystyle\leq$	$\displaystyle\sum_{I\neq\emptyset}\sup_{f\in{\mathcal{F}}}\big{\|}\int_{\mbox{\boldmath$B$}_{n,\|I\|}}{\mathbb{C}}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}-\int_{(0,1)^{\|I\|}}{\mathbb{C}}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}\big{\|}$

tends to zero for almost every trajectory and when $n\rightarrow\infty$ , Considering the bounded Lipschitz maps $h$ as in the definition of $d_{BL}$ , deduce

	$\displaystyle d_{2,n}=\sup_{h}\big{\|}{\mathbb{E}}\Big{[}h\circ\widetilde{\Gamma}_{n}\Big{(}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\Big{)}-h\circ\widetilde{\Gamma}_{{\color[rgb]{0,0,0}\infty}}\Big{(}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\Big{)}\Big{]}\big{\|}\leq{\mathbb{E}}\Big{[}\sup_{h}\big{\|}h\circ\widetilde{\Gamma}_{n}\Big{(}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\Big{)}-h\circ\widetilde{\Gamma}_{{\color[rgb]{0,0,0}\infty}}\Big{(}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\Big{)}\big{\|}\Big{]}$
		$\displaystyle\leq$	$\displaystyle{\mathbb{E}}\Big{[}\\|\widetilde{\Gamma}_{n}\Big{(}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\Big{)}-\widetilde{\Gamma}_{{\color[rgb]{0,0,0}\infty}}\Big{(}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\Big{)}\\|_{\infty}{\color[rgb]{0,0,0}\wedge 2}\Big{]}=:{\mathbb{E}}[V_{n}].\hskip 170.71652pt$

Since $\sup_{x\in\ell^{\infty}({\mathcal{F}})}|h(x)|\leq 1$ for every $h$ , the sequence $V_{n}$ is bounded by two. And we have proved above that $V_{n}$ tends to zero a.s. Thus, the dominated convergence theorem implies that ${\mathbb{E}}[V_{n}]\rightarrow 0$ when $n\rightarrow\infty$ , i.e. $d_{2,n}\rightarrow 0$ when $n\rightarrow\infty$ .

To conclude, we have proved that $d_{BL}\big{(}{\mathbb{G}}_{n},{\mathbb{G}}_{\infty}\big{)}\rightarrow 0$ and $n\rightarrow\infty$ . Since the limit ${\mathbb{G}}_{\infty}$ is tight, we get the weak convergence of ${\mathbb{G}}_{n}$ to ${\mathbb{G}}_{\infty}$ in $\ell^{\infty}({\mathcal{F}})$ , i.e. the weak convergence of $\widehat{\mathbb{C}}_{n}$ indexed by $f\in{\mathcal{F}}$ , i.e. in $\ell^{\infty}({\mathcal{F}})$ , as announced.

C.2 Proof of Corollary A.2

By inspecting the proof of Theorem A.1 (i), it appears that it is sufficient to prove

\sup_{f\in{\mathcal{F}}}\Big{|}\int_{\mbox{\boldmath$B$}_{n,|I|}}\bar{\mathbb{C}}_{n}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}\Big{|}=O_{P}(\mu_{n}),

for any $I\subset\{1,\ldots,d\}$ , $I\neq\emptyset$ . For any constant $A>0$ , we have

	$\displaystyle{\mathbb{P}}\Big{(}\sup_{f\in{\mathcal{F}}}\big{\|}\int_{\mbox{\boldmath$B$}_{n,\|I\|}}\bar{\mathbb{C}}_{n}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}\big{\|}>A\mu_{n}\Big{)}\leq{\mathbb{P}}\Big{(}\sup_{\mbox{\boldmath$u$}_{I}\in\mbox{\boldmath$B$}_{n,\|I\|}}\frac{\|\bar{\mathbb{C}}_{n}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\|}{g_{\omega}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})}>\sqrt{A}\mu_{n}\Big{)}$				(C.4)
		$\displaystyle+$	$\displaystyle{\mathbb{P}}\Big{(}\sup_{f\in{\mathcal{F}}}\int_{\mbox{\boldmath$B$}_{n,\|I\|}}g_{\omega}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\,\big{\|}f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}\big{\|}>\sqrt{A}\Big{)}.\hskip 142.26378pt$		(C.4)

Check that $\{(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\,|\,\mbox{\boldmath$u$}_{I}\in\mbox{\boldmath$B$}_{n,|I|}\}\subset N(\delta_{n},1/2)$ , for any sequence of positive numbers $(\delta_{n})$ , $\sup_{n}\delta_{n}<1/2$ and $\delta_{n}\rightarrow 0$ with $n$ . This yields

{\mathbb{P}}\Big{(}\sup_{\mbox{\boldmath$u$}_{I}\in\mbox{\boldmath$B$}_{n,|I|}}\frac{|\bar{\mathbb{C}}_{n}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})|}{g_{\omega}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})}>\sqrt{A}\mu_{n}\Big{)}\leq{\mathbb{P}}\Big{(}\sup_{(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\in N(\delta_{n},1/2)}\frac{|\bar{\mathbb{C}}_{n}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})|}{g_{\omega}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})}>\sqrt{A}\mu_{n}\Big{)}.

Note that $g_{\omega}(\mbox{\boldmath$u$})\geq\delta_{n}^{\omega}$ when $\mbox{\boldmath$u$}\in N(\delta_{n},1/2)$ , and set $\delta_{n}:=\mu_{n}^{-1/\omega}$ . Thus, for $n$ sufficiently large, $\delta_{n}<1/2$ and we get

{\mathbb{P}}\Big{(}\sup_{(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\in N(\delta_{n},1/2)}\frac{|\bar{\mathbb{C}}_{n}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})|}{g_{\omega}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})}>\sqrt{A}\mu_{n}\Big{)}\leq{\mathbb{P}}\Big{(}\sup_{(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\in N(\delta_{n},1/2)}|\bar{\mathbb{C}}_{n}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})|>\sqrt{A}\Big{)}.

Since $\sup_{\mbox{\boldmath$u$}\in[0,1]^{d}}|\alpha_{n}(\mbox{\boldmath$u$})|=O_{P}(1)$ , then $\sup_{\mbox{\boldmath$u$}\in[0,1]^{d}}|\bar{\mathbb{C}}_{n}(\mbox{\boldmath$u$})|=O_{P}(1)$ too because all partial derivatives $\dot{C}_{j}(\mbox{\boldmath$u$})$ , $j\in\{1,\ldots,d\}$ , belong to $[0,1]$ (Th. 2.2.7 in [26]). Therefore, the first term on the r.h.s. of (C.4) tends to zero. Finally, the second term on the r.h.s. of (C.4) may be arbitrarily small with a large $A$ , due to (A.2), proving the result.

Appendix D Additional proofs

D.1 Proof of Theorem 3.1

We denote $\nu_{n}=\ln(\ln n)n^{-1/2}+a_{n}$ and we would like to prove that, for any $\epsilon>0$ , there exists $L_{\epsilon}>0$ such that, for any $n$ , we have

{\mathbb{P}}\Big{(}\|\widehat{\theta}-\theta_{0}\|_{2}/\nu_{n}\geq L_{\epsilon}\Big{)}<\epsilon.

(D.1)

Now, following the reasoning of Fan and Li (2001), Theorem 1, and denoting the penalized loss ${\mathbb{L}}^{\text{pen}}_{n}(\theta;\widehat{{\mathcal{U}}}_{n})={\mathbb{L}}_{n}(\theta;\widehat{{\mathcal{U}}}_{n})+n\sum^{p}_{k=1}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{k}|)$ , we have

{\mathbb{P}}\Big{(}\|\widehat{\theta}-\theta_{0}\|_{2}/\nu_{n}\geq L_{\epsilon}\Big{)}\leq{\mathbb{P}}\Big{(}\exists\mathbf{v}\in{\mathbb{R}}^{p},\|\mathbf{v}\|_{2}=L^{\prime}_{\epsilon}\geq L_{\epsilon}:{\mathbb{L}}^{\text{pen}}_{n}(\theta_{0}+\nu_{n}\mathbf{v};\widehat{{\mathcal{U}}}_{n})\leq{\mathbb{L}}^{\text{pen}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})\Big{)},

(D.2)

and we can always impose $L^{\prime}_{\epsilon}=L_{\epsilon}$ , our choice hereafter. If the r.h.s. of (D.2) is smaller than $\epsilon$ , there is a local minimum in the ball $\big{\{}\theta_{0}+\nu_{n}\mathbf{v},\|\mathbf{v}\|_{2}\leq L_{\epsilon}\big{\}}$ with a probability larger than $1-\epsilon$ . In other words, (D.1) is satisfied and $\|\widehat{\theta}-\theta_{0}\|_{2}=O_{p}(\nu_{n})$ . Now, by a Taylor expansion of the penalized loss function around the true parameter, we get

$\displaystyle{\mathbb{L}}^{\text{pen}}_{n}(\theta_{0}+\nu_{n}\mathbf{v};\widehat{{\mathcal{U}}}_{n})-{\mathbb{L}}^{\text{pen}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})$
	$\displaystyle\geq$	$\displaystyle\nu_{n}\mathbf{v}^{\top}\nabla_{\theta}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})+\frac{\nu^{2}_{n}}{2}\mathbf{v}^{\top}\nabla^{2}_{\theta\theta^{\top}}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})\mathbf{v}+\frac{\nu^{3}_{n}}{6}\nabla_{\theta}\left\{\mathbf{v}^{\top}\nabla^{2}_{\theta\theta^{\top}}{\mathbb{L}}_{n}(\overline{\theta};\widehat{{\mathcal{U}}}_{n})\mathbf{v}\right\}\mathbf{v}$
	$\displaystyle+$	$\displaystyle n\overset{}{\underset{k\in{\mathcal{A}}}{\sum}}\left\{\mbox{\boldmath$p$}(\lambda_{n},\|\theta_{0,k}+\nu_{n}v_{k}\|)-\mbox{\boldmath$p$}(\lambda_{n},\|\theta_{0,k}\|)\right\}=:\sum_{j=1}^{4}T_{j},$

for some parameter $\bar{\theta}$ such that $\|\overline{\theta}-\theta_{0}\|_{2}\leq L_{\epsilon}\nu_{n}$ . Note that we have used $\mbox{\boldmath$p$}(\lambda_{n},0)=0$ and the positiveness of the penalty. Thus, it is sufficient to prove there exists $L_{\epsilon}$ such that

{\mathbb{P}}\left(\exists\mathbf{v}\in{\mathbb{R}}^{p},\|\mathbf{v}\|_{2}=L_{\epsilon}:T_{1}+\ldots+T_{4}\leq 0\right)<\epsilon.

(D.3)

Let us deal with the non-penalized quantities. First, for any $k\in\{1,\ldots,p\}$ , we have

\partial_{\theta_{k}}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})=n\int_{(0,1]^{d}}\partial_{\theta_{k}}\ell(\theta_{0};\mbox{\boldmath$u$})\text{d}\big{(}C_{n}(\mbox{\boldmath$u$})-C(\mbox{\boldmath$u$})\big{)},

due to the first-order conditions. For any $\theta_{0}\in\Theta$ , we have assumed that the family of maps ${\mathcal{F}}_{0}$ is $g_{\omega}$ -regular. Moreover, Assumption 10 and the compacity of $[0,1]^{d}$ implies $\|\alpha_{n}\|_{\infty}=O_{P}(1)$ . Then, we can apply Corollary A.2, that yields

\sup_{k=1,\ldots,p}\big{|}\partial_{\theta_{k}}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})\big{|}=n\sup_{k=1,\ldots,p}\big{|}\int\partial_{\theta_{k}}\ell(\theta_{0};\mbox{\boldmath$u$})\,(C_{n}-C)(d\mbox{\boldmath$u$})\big{|}=O_{P}(\ln(\ln n)\sqrt{n}).

By Cauchy-Schwarz, we deduce

|T_{1}|{\color[rgb]{0,0,0}=}\nu_{n}\big{|}\mathbf{v}^{\top}\nabla_{\theta}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})\big{|}\leq\nu_{n}\|\mathbf{v}\|_{2}\|\nabla_{\theta}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})\|_{2}=O_{p}\big{(}\nu_{n}\sqrt{np}\ln(\ln n)\big{)}\|\mathbf{v}\|_{2}.

The empirical Hessian matrix can be expanded as

n^{-1}\mathbf{v}^{\top}\nabla^{2}_{\theta\theta^{\top}}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})\mathbf{v}=\int_{(0,1]^{d}}\mathbf{v}^{\top}\nabla_{\theta\theta^{\top}}\ell(\theta_{0};\mbox{\boldmath$u$})\mathbf{v}\,C_{n}(d\mbox{\boldmath$u$})=\sum_{k,l=1}^{p}v_{k}v_{l}\int_{(0,1]^{d}}\partial^{2}_{\theta_{k}\theta_{l}}\ell(\theta_{0};\mbox{\boldmath$u$})\,C_{n}(d\mbox{\boldmath$u$}).

We have assumed that the maps $\mbox{\boldmath$u$}\mapsto\partial^{2}_{\theta_{k}\theta_{l}}\ell(\theta_{0};\mbox{\boldmath$u$})$ , $(k,l)\in\{1,\ldots,p\}^{2}$ , belong the a $g_{\omega}$ -regular family. Therefore, applying Corollary A.2, we get

	$\displaystyle\int_{(0,1]^{d}}\partial^{2}_{\theta_{k}\theta_{l}}\ell(\theta_{0};\mbox{\boldmath$u$})\,C_{n}(d\mbox{\boldmath$u$})=\int_{(0,1]^{d}}\partial^{2}_{\theta_{k}\theta_{l}}\ell(\theta_{0};\mbox{\boldmath$u$})\,C(d\mbox{\boldmath$u$})+\int_{(0,1]^{d}}\partial^{2}_{\theta_{k}\theta_{l}}\ell(\theta_{0};\mbox{\boldmath$u$})\,(C_{n}-C)(d\mbox{\boldmath$u$})$
		$\displaystyle=$	$\displaystyle\int_{(0,1]^{d}}\partial^{2}_{\theta_{k}\theta_{l}}\ell(\theta_{0};\mbox{\boldmath$u$})\,C(d\mbox{\boldmath$u$})+O_{P}(\ln(\ln n)/\sqrt{n}).\hskip 142.26378pt$

As a consequence, since $\|\mathbf{v}\|_{1}^{2}\leq p\|\mathbf{v}\|_{2}^{2}$ , this yields

T_{2}=\frac{n\nu^{2}_{n}}{2}\mathbf{v}^{\top}{\mathbb{E}}\big{[}\nabla_{\theta\theta^{\top}}^{2}\ell(\theta_{0};\mbox{\boldmath$U$})\big{]}\mathbf{v}+O_{P}\big{(}p\|\mathbf{v}\|_{2}^{2}\nu_{n}^{2}\ln(\ln n)\sqrt{n}\big{)},

that is positive by assumption, when $n$ is sufficiently large and for a probability arbitrarily close to one. By similar arguments with the family of maps ${\mathcal{F}}_{3}$ , we get

T_{3}=n\frac{\nu^{3}_{n}}{6}\nabla_{\theta}\big{\{}\mathbf{v}^{\top}\nabla^{2}_{\theta\theta^{\top}}{\mathbb{E}}\big{[}\ell(\overline{\theta};\mbox{\boldmath$U$})\big{]}\mathbf{v}\big{\}}\mathbf{v}+O_{P}\big{(}p^{3/2}\|\mathbf{v}\|_{2}^{3}\nu_{n}^{3}\ln(\ln n)\sqrt{n}\big{)}=O_{P}(np^{3/2}\|\mathbf{v}\|_{2}^{3}\nu_{n}^{3}).

Let us now treat the penalty part as in [13] (proof of Theorem 1, equations (5.5) and (5.6)). By using exactly the same method, we obtain

|T_{4}|\leq\sqrt{|{\mathcal{A}}|}n\nu_{n}a_{n}\|\mathbf{v}\|_{2}+2nb_{n}\nu_{n}^{2}\|\mathbf{v}\|_{2}^{2},

and the latter term is dominated by $T_{2}$ , allowing $\|\mathbf{v}\|$ to be large enough. Thus, for such $\mathbf{v}$ , we have

\sum_{j=1}^{4}T_{j}=\frac{n\nu^{2}_{n}}{2}\mathbf{v}^{\top}{\mathbb{E}}\big{[}\nabla_{\theta\theta^{\top}}^{2}\ell(\theta_{0};\mbox{\boldmath$U$})\big{]}\mathbf{v}\big{(}1+o_{P}(1)\big{)}.

Since the latter dominant term is larger than $nL^{2}_{\epsilon}\lambda_{\min}({\mathbb{H}})\nu^{2}_{n}/2>0$ for $n$ large enough, where $\lambda_{\min}({\mathbb{H}})$ denotes the smallest eigenvalue of ${\mathbb{H}}$ , we deduce (D.3) and finally $\|\widehat{\theta}-\theta_{0}\|_{2}=O_{p}(\nu_{n})$ .

D.2 Proof of Theorem 3.2

Point (i): The proof is performed in the same spirit as in Fan and Li (2001). Consider an estimator $\widehat{\theta}:=(\widehat{\theta}^{\top}_{{\mathcal{A}}},\widehat{\theta}^{\top}_{{\mathcal{A}}^{c}})^{\top}$ of $\theta_{0}$ such that $\|\widehat{\theta}-\theta_{0}\|_{2}=O_{P}(\nu_{n})$ , as in Theorem 3.1, with $\nu_{n}:=n^{-1/2}\ln(\ln n)+a_{n}$ . Using our notations for vector concatenation, as detailed in Appendix A, the support recovery property holds asymptotically if

{\mathbb{L}}^{\text{pen}}_{n}(\widehat{\theta}_{{\mathcal{A}}}:\mathbf{0}_{{\mathcal{A}}^{c}};\widehat{{\mathcal{U}}}_{n})=\underset{\|\theta_{{\mathcal{A}}^{c}}\|_{2}\leq C\nu_{n}}{\min}{\mathbb{L}}^{\text{pen}}_{n}(\widehat{\theta}_{{\mathcal{A}}}:\theta_{{\mathcal{A}}^{c}};\widehat{{\mathcal{U}}}_{n}),

(D.4)

for any constant $C>0$ with a probability that tends to one with $n$ . Set $\epsilon_{n}:=C\nu_{n}$ . To prove (D.4), it is sufficient to show that, for any $\theta\in\Theta$ such that $\|\theta-\theta_{0}\|\leq\epsilon_{n}$ , we have with a probability that tends to one

\partial_{\theta_{j}}{\mathbb{L}}^{\text{pen}}_{n}(\theta;\widehat{{\mathcal{U}}}_{n})>0\;\;\text{when}\;\;0<\theta_{j}<\epsilon_{n};\;\partial_{\theta_{j}}{\mathbb{L}}^{\text{pen}}_{n}(\theta;\widehat{{\mathcal{U}}}_{n})<0\;\;\text{when}\;\;-\epsilon_{n}<\theta_{j}<0,

(D.5)

for any $j\in{\mathcal{A}}^{c}$ . By a Taylor expansion of the partial derivative of the penalized loss around $\theta_{0}$ , we obtain

$\displaystyle\partial_{\theta_{j}}{\mathbb{L}}^{\text{pen}}_{n}(\theta;\widehat{{\mathcal{U}}}_{n})=\partial_{\theta_{j}}{\mathbb{L}}_{n}(\theta;\widehat{{\mathcal{U}}}_{n})+n\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},\|\theta_{j}\|)\text{sgn}(\theta_{j})$
	$\displaystyle=$	$\displaystyle\partial_{\theta_{j}}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})+\overset{p}{\underset{l=1}{\sum}}\partial^{2}_{\theta_{j}\theta_{l}}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})(\theta_{l}-\theta_{0,l})+\frac{1}{2}\overset{p}{\underset{l,m=1}{\sum}}\partial^{3}_{\theta_{j}\theta_{l}\theta_{m}}{\mathbb{L}}_{n}(\overline{\theta};\widehat{{\mathcal{U}}}_{n})(\theta_{l}-\theta_{0,l})(\theta_{m}-\theta_{0,m})$
	$\displaystyle+$	$\displaystyle n\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},\|\theta_{j}\|)\text{sgn}(\theta_{j}),$

for some $\overline{\theta}$ that satisfies $\|\overline{\theta}-\theta_{0}\|_{2}\leq\epsilon_{n}$ . The family of maps ${\mathcal{F}}$ is $g_{\omega}$ -regular and $\|\alpha_{n}\|_{\infty}=O_{p}(1)$ . As a consequence, by Corollary A.2,

\big{|}\partial_{\theta_{j}}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})\big{|}=O_{p}\big{(}\ln(\ln n)\sqrt{n}\big{)}.

As for the second order term, the maps $\mbox{\boldmath$u$}\mapsto\partial^{2}_{\theta_{i}\theta_{l}}\ell(\theta_{0};\mbox{\boldmath$u$})$ are $g_{\omega}$ -regular by assumption, for any $(i,l)\in{\mathcal{A}}^{c}\times\{1,\ldots,p\}$ . Then, by Corollary A.2, we deduce

\displaystyle\frac{1}{n}\partial^{2}_{\theta_{j}\theta_{l}}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})=\int_{(0,1]^{d}}\partial^{2}_{\theta_{j}\theta_{l}}\ell(\theta_{0};\mbox{\boldmath$u$})C_{n}(d\mbox{\boldmath$u$})={\mathbb{E}}[\partial^{2}_{\theta_{j}\theta_{l}}\ell(\theta_{0};\mbox{\boldmath$U$})]+O_{p}(\ln(\ln n)/\sqrt{n}).

Finally, for the remaining third order term, since the family of maps $\mbox{\boldmath$u$}\mapsto\partial^{3}_{\theta_{j}\theta_{l}\theta_{m}}\ell(\overline{\theta};\mbox{\boldmath$u$})$ is $g_{\omega}$ -regular by assumption, Corollary A.2 yields

\frac{1}{n}\partial^{3}_{\theta_{j}\theta_{l}\theta_{m}}{\mathbb{L}}_{n}(\overline{\theta};\widehat{{\mathcal{U}}}_{n})={\mathbb{E}}[\partial^{3}_{\theta_{j}\theta_{l}\theta_{m}}\ell(\overline{\theta};\mbox{\boldmath$U$})]+O_{p}\big{(}\ln(\ln n)/\sqrt{n}\big{)},

that is bounded in probability by Assumption 1. Hence putting the pieces together and using the Cauchy-Schwarz inequality, we get

$\displaystyle\partial_{\theta_{j}}{\mathbb{L}}^{\text{pen}}_{n}(\theta;\widehat{{\mathcal{U}}}_{n})$
	$\displaystyle=$	$\displaystyle O_{p}\big{(}\ln(\ln n)\sqrt{n}+n\\|\theta-\theta_{0}\\|_{1}+n\\|\theta-\theta_{0}\\|^{2}_{1}\big{)}+n\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},\|\theta_{j}\|)\text{sgn}(\theta_{j})$
	$\displaystyle=$	$\displaystyle O_{p}\big{(}\ln(\ln n)\sqrt{n}+n\sqrt{p}\nu_{n}+np\nu_{n}^{2}\big{)}+n\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},\|\theta_{j}\|)\text{sgn}(\theta_{j})$
	$\displaystyle=$	$\displaystyle n\lambda_{n}\Big{\{}O_{p}\big{(}\ln(\ln n)/(\sqrt{n}\lambda_{n})+a_{n}/\lambda_{n}\big{)}+\lambda^{-1}_{n}\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},\|\theta_{j}\|)\text{sgn}(\theta_{j})\Big{\}}.$

Under the assumptions $\underset{n\rightarrow\infty}{\lim\,\inf}\;\underset{\theta\rightarrow 0^{+}}{\lim\,\inf}\,\lambda^{-1}_{n}\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},\theta)>0$ , $a_{n}=o(\lambda_{n})$ and $\sqrt{n}\lambda_{n}/\ln(\ln n)$ tends to the infinity, the sign of the derivative is determined by the sign of $\theta_{j}$ . As a consequence, (D.5) is satisfied, implying our assertion (i). Indeed, all zero components of $\theta_{0}$ will be estimated as zero with a high probability. And its non-zero components will be consistently estimated. Then, the probability that all the latter estimates will not be zero tends to one, when $n\rightarrow\infty$ .

Point (ii): We have proved that $\underset{n\rightarrow\infty}{\lim}\;{\mathbb{P}}(\widehat{{\mathcal{A}}}={\mathcal{A}})=1$ . Therefore, for any $\epsilon>0$ , the event $\widehat{\theta}_{{\mathcal{A}}^{c}}=\mathbf{0}$ in ${\mathbb{R}}^{|{\mathcal{A}}^{c}|}$ occurs with a probability larger than $1-\epsilon$ for $n$ large enough. Since we want to state a convergence in law result, we can consider that the latter event is always satisfied. By a Taylor expansion around the true parameter, the orthogonality conditions yield

$\displaystyle 0=\nabla_{\theta_{{\mathcal{A}}}}{\mathbb{L}}^{\text{pen}}_{n}(\widehat{\theta}_{{\mathcal{A}}}:\mathbf{0}_{{\mathcal{A}}^{c}};\widehat{{\mathcal{U}}}_{n})$
	$\displaystyle=$	$\displaystyle\nabla_{\theta_{{\mathcal{A}}}}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})+\nabla^{2}_{\theta_{{\mathcal{A}}}\theta^{\top}_{{\mathcal{A}}}}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})(\widehat{\theta}-\theta_{0})_{{\mathcal{A}}}+\frac{1}{2}\nabla_{\theta_{{\mathcal{A}}}}\Big{\{}(\widehat{\theta}-\theta_{0})^{\top}_{{\mathcal{A}}}\nabla^{2}_{\theta_{{\mathcal{A}}}\theta^{\top}_{{\mathcal{A}}}}{\mathbb{L}}_{n}(\overline{\theta};\widehat{{\mathcal{U}}}_{n})\Big{\}}(\widehat{\theta}-\theta_{0})_{{\mathcal{A}}}$
	$\displaystyle+$	$\displaystyle n\mathbf{A}_{n}(\theta_{0})+n\Big{\{}\mathbf{B}_{n}(\theta_{0})+o_{p}(1)\Big{\}}(\widehat{\theta}-\theta_{0})_{{\mathcal{A}}},$

where $\mathbf{A}_{n}(\theta)=\big{[}\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{k}|)\text{sgn}(\theta_{k})\big{]}_{k\in{\mathcal{A}}}$ and $\mathbf{B}_{n}(\theta)=\big{[}\partial^{2}_{\theta_{k}\theta_{l}}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{k}|)\big{]}_{(k,l)\in{\mathcal{A}}^{2}}$ , which simplifies into a diagonal matrix since the penalty function is coordinate-separable. Obviously, $\overline{\theta}$ is a random parameter such that $\|\overline{\theta}_{{\mathcal{A}}}-\theta_{0,{\mathcal{A}}}\|_{2}<\|\widehat{\theta}_{{\mathcal{A}}}-\theta_{0,{\mathcal{A}}}\|_{2}$ . Rearranging the terms and multiplying by $\sqrt{n}$ , we deduce

$\displaystyle\sqrt{n}{\mathbb{K}}_{n}(\theta_{0})\Big{\{}\big{(}\widehat{\theta}-\theta_{0}\big{)}_{{\mathcal{A}}}+\mathbf{A}_{n}(\theta_{0})\Big{\}}$
	$\displaystyle=$	$\displaystyle-\frac{1}{\sqrt{n}}\nabla_{\theta_{{\mathcal{A}}}}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})-\frac{1}{2\sqrt{n}}\nabla_{\theta_{{\mathcal{A}}}}\Big{\{}(\widehat{\theta}-\theta_{0})^{\top}_{{\mathcal{A}}}\nabla^{2}_{\theta_{{\mathcal{A}}}\theta^{\top}_{{\mathcal{A}}}}{\mathbb{L}}_{n}(\overline{\theta};\widehat{{\mathcal{U}}}_{n})\Big{\}}(\widehat{\theta}-\theta_{0})_{{\mathcal{A}}}+o_{p}(1)$
	$\displaystyle:=$	$\displaystyle T_{1}+T_{2}+o_{p}(1),$

where ${\mathbb{K}}_{n}(\theta_{0})=n^{-1}\nabla^{2}_{\theta_{{\mathcal{A}}}\theta^{\top}_{{\mathcal{A}}}}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})+\mathbf{B}_{n}(\theta_{0})$ . First, under the $g_{\omega}$ -regularity conditions and by Corollary A.2, we have $n^{-1}\nabla^{2}_{\theta_{{\mathcal{A}}}\theta^{\top}_{{\mathcal{A}}}}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})={\mathbb{H}}_{{\mathcal{A}}{\mathcal{A}}^{\top}}+o_{p}(1)$ . Second, the third order term $T_{2}$ can be managed as follows:

T_{2}=-\frac{1}{2\sqrt{n}}\nabla_{\theta_{{\mathcal{A}}}}\Big{\{}(\widehat{\theta}-\theta_{0})^{\top}_{{\mathcal{A}}}\nabla^{2}_{\theta_{{\mathcal{A}}}\theta^{\top}_{{\mathcal{A}}}}{\mathbb{L}}_{n}(\overline{\theta};\widehat{{\mathcal{U}}}_{n})\Big{\}}(\widehat{\theta}-\theta_{0})_{{\mathcal{A}}},

is a vector as size $|{\mathcal{A}}|$ whose $j$ -th component is

T_{2,j}:=n^{-1/2}\underset{l,m\in{\mathcal{A}}}{\sum}\partial^{3}_{\theta_{j}\theta_{l}\theta_{m}}{\mathbb{L}}_{n}(\overline{\theta};\widehat{{\mathcal{U}}}_{n})(\widehat{\theta}_{l}-\theta_{0,l})(\widehat{\theta}_{m}-\theta_{0,m}),

for any $j\in{\mathcal{A}}$ . Invoking Corollary A.2 and Assumption 1, $T_{2,j}=O_{P}\big{(}\sqrt{n}\nu_{n}^{2}\big{)}$ for any $j$ . Then, since $a_{n}=o(\lambda_{n})$ , we obtain

T_{2}=O_{P}\big{(}\ln(\ln n)^{2}n^{-1/2}+\sqrt{n}\lambda_{n}^{2}\big{)}=o(1).

Regarding the gradient in $T_{1}$ , since $\partial_{\theta_{j}}\ell(\theta_{0};\mbox{\boldmath$u$})$ belongs to our $g_{\omega}$ -regular family ${\mathcal{F}}$ for any $j\in{\mathcal{A}}$ , apply Theorem A.1 (ii): for any $j\in{\mathcal{A}}$ , we have

	$\displaystyle n^{-1/2}\partial_{\theta_{j}}{\mathbb{L}}_{n}(\theta_{0};\widehat{{\mathcal{U}}}_{n})=\sqrt{n}\int_{(0,1]^{d}}\partial_{\theta_{j}}\ell(\theta_{0};\mbox{\boldmath$u$})(C_{n}-C)(d\mbox{\boldmath$u$})$
		$\displaystyle\overset{d}{\underset{n\rightarrow\infty}{\longrightarrow}}$	$\displaystyle(-1)^{d}\int_{(0,1]^{d}}{\mathbb{C}}(\mbox{\boldmath$u$})\,\partial_{\theta_{j}}\ell(\theta_{0};d\mbox{\boldmath$u$})+\sum_{\begin{subarray}{c}I\subset\{1,\ldots,d\}\\ I\neq\emptyset,I\neq\{1,\ldots,d\}\end{subarray}}(-1)^{\|I\|}\int_{(0,1]^{\|I\|}}{\mathbb{C}}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,\partial_{\theta_{j}}\ell(\theta_{0},d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}).$

We then conclude by Slutsky’s Theorem to deduce the asymptotic distribution

\displaystyle\sqrt{n}\Big{[}{\mathbb{H}}_{{\mathcal{A}}{\mathcal{A}}}+\mathbf{B}_{n}(\theta_{0})\Big{]}\Big{\{}\big{(}\widehat{\theta}-\theta_{0}\big{)}_{{\mathcal{A}}}+\Big{[}{\mathbb{H}}_{{\mathcal{A}}{\mathcal{A}}}+\mathbf{B}_{n}(\theta_{0})\Big{]}^{-1}\mathbf{A}_{n}(\theta_{0})\Big{\}}

\displaystyle\overset{d}{\underset{n\rightarrow\infty}{\longrightarrow}}

\displaystyle\mathbf{W},

where $\mathbf{W}$ the $|\mathcal{A}|$ -dimensional random vector defined in (D.2).

Remark 5.

It would be possible to state Theorem 3.2-(ii) under other sets of regularity conditions. Indeed, the latter result mainly requires a CLT (given by our Theorem A.1 (ii)) and a ULLN (given by our Corollary A.2). The latter one may be obtained with a condition on the bracketing numbers associated with the family of maps ${\mathcal{F}}_{\delta}:=\{\mbox{\boldmath$u$}\in[0,1]^{d}\mapsto\partial_{\theta_{k}}\ell(\theta;\mbox{\boldmath$u$}),\;\|\theta-\theta_{0}\|<\delta,k=1,\ldots,p\},$ for some (small) $\delta>0$ : see Remark 4 at the end of the main text. This would provide an alternative way of managing the term $T_{2}$ . To deal with $T_{1}$ , a CLT for $\sqrt{n}\int_{(0,1]^{d}}\nabla_{\theta}\ell(\theta_{0};\mbox{\boldmath$u$})d\big{\{}C_{n}(\mbox{\boldmath$u$})-C(\mbox{\boldmath$u$})\big{\}}$ can be obtained under some regularity conditions on $\mbox{\boldmath$u$}\mapsto\nabla_{\theta}\ell(\theta_{0};\mbox{\boldmath$u$})$ and its derivative w.r.t. $u$ , as introduced in [30] and [31]. See [34], Prop. 3 and Assumption (A.1), to be more specific. At the opposite, the proof of Theorem 3.2-(i) (support recovery) requires Corollary A.2 and not only a usual ULLN. Indeed, an upper bound for the rate of convergence to zero of $\|\widehat{\theta}_{n}-\theta_{0}\|$ is here required to manage the penalty functions and sparsity.

Appendix E Regularity conditions for Gaussian copulas

Let us verify that the Gaussian copula family fulfills all regularity conditions that are required to apply Theorems 3.1 and 3.2. Here, the loss function is $\ell(\theta;\mbox{\boldmath$U$})=\ln|\Sigma|+\mbox{\boldmath$Z$}^{\top}\Sigma^{-1}\mbox{\boldmath$Z$},$ where $\mbox{\boldmath$Z$}:=\big{(}\Phi^{-1}(U_{1}),\ldots,\Phi^{-1}(U_{d})\big{)}^{\top}$ . Since the true underlying copula is Gaussian, the random vector $Z$ in ${\mathbb{R}}^{d}$ is Gaussian ${\mathcal{N}}(0,\Sigma)$ . The vector of parameters is $\theta=vech(\Sigma)$ , whose “true value” will be $\theta_{0}=vech(\Sigma_{0})$ . Note that $\ell(\theta;\mbox{\boldmath$U$})$ , as a function of $U$ , is a quadratic form w.r.t. $Z$ , and that

\sup_{j,k\in\{1,\ldots,d\}}{\mathbb{E}}\Big{[}\big{|}\Phi^{-1}(U_{j})\Phi^{-1}(U_{k})\big{|}\Big{]}<\infty.

(E.1)

Assumption 1: when $\Sigma_{0}$ is invertible, ${\mathbb{H}}$ and ${\mathbb{M}}$ are positive definite. This is exactly the same situation as the Hessian matrix associated with the (usual) MLE for the centered Gaussian random vector $Z$ . When $\theta$ belongs to a small neighborhood of $\theta_{0}$ , the associated correlation $\Sigma$ is still invertible by continuity. Then, the third order partial derivatives of the loss are uniformly bounded in expectation in such a neighborhood, due to (E.1).

The first part of Assumption 2 is satisfied for Gaussian copulas, as noticed in [32], Example 5.1. Checking (3.1) is more complex. This is the topic of Lemma E.1 below.

Let us verify Assumption 3 for the Gaussian loss defined as $\ell(\theta;\mbox{\boldmath$u$})=\ln|\Sigma|+\text{tr}(\mbox{\boldmath$z$}\mbox{\boldmath$z$}^{\top}\Sigma^{-1})$ , with $\mbox{\boldmath$z$}:=(\Phi^{-1}(u_{1}),\ldots,\Phi^{-1}(u_{d}))^{\top}$ : every member $f$ of the family ${\mathcal{F}}$ has continuous and integrable partial derivatives on $(0,1)^{d}$ and then is $BHKV_{loc}\big{(}(0,1)^{d}\big{)}$ . Moreover, any $f\in{\mathcal{F}}$ can be written as a quadratic form w.r.t. $z$ :

f(\mbox{\boldmath$u$})=\sum_{k,l=1}^{d}\nu_{k,l}z_{k}z_{l}=\sum_{k,l=1}^{d}\nu_{k,l}\Phi^{-1}(u_{k})\Phi^{-1}(u_{l}),\;\;\mbox{\boldmath$u$}\in(0,1)^{d}.

Note that, for every $u\in(0,1)$ , $|\Phi^{-1}(u)|\leq(u(1-u))^{-\epsilon}$ for any $\epsilon>0$ and $\min(u,1-u)\leq u(1-u)/2$ . Thus, for every $\omega>0$ , we clearly have

\sup_{\mbox{\boldmath$u$}\in(0,1)^{d}}\min_{k}\min(u_{k},1-u_{k})^{\omega}|f(\mbox{\boldmath$u$})|<\infty,

for every $f\in{\mathcal{F}}$ , proving the first condition for the $g_{\omega}$ regularity of ${\mathcal{F}}$ . To check (A.1), note that $f(d\mbox{\boldmath$u$})$ is zero when $d>2$ . Otherwise, when $d=2$ , $f(d\mbox{\boldmath$u$})\propto\,du_{1}\,du_{2}/\big{\{}\phi(z_{1})\phi(z_{2})\big{\}}$ . Thus, to apply our theoretical results, it is sufficient to check Assumption 3 by replacing $g_{\omega}(\mbox{\boldmath$u$})$ by $g_{\omega}(u_{1},u_{2})$ , as if the loss and its derivatives were some functions of $(u_{1},u_{2})$ only, instead of $u$ . See the remark after the definition of the $g_{\omega}$ regularity too. Since $1/\phi\circ\Phi^{-1}(u)=O\big{(}1/u(1-u)\big{)}$ (c.f. (E.2) in the proof of Lemma E.1), we obtain

	$\displaystyle\int_{(0,1)^{2}}\frac{g_{\omega}(u_{1},u_{2})}{\|\phi(z_{1})\phi(z_{2})\|}\,du_{1}\,du_{2}\leq M\int_{(0,1)^{2}}\frac{g_{\omega}(u_{1},u_{2})}{u_{1}(1-u_{1})u_{2}(1-u_{2})}\,du_{1}\,du_{2}$
		$\displaystyle=$	$\displaystyle M\Big{\{}\int_{(0,1)}\frac{\min(u,1-u)^{\omega/2}}{u(1-u)}\,du\Big{\}}^{2}<\infty,\hskip 199.16928pt$

for some constant $M$ , yielding (A.1). Note that we have used the inequality $\min(u_{1},u_{2})^{\omega}\leq u_{1}^{\omega/2}u_{2}^{\omega/2}$ .

Now consider (A.2). We restrict ourselves to the case $|J_{1}|=1$ , because we are interested in the situation for which $J_{2}\cup J_{3}\neq\emptyset$ . Again, when the cardinality of $J_{1}$ is larger than two, the latter condition is satisfied because $f(d\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}})=0$ . When $J_{1}$ is a singleton, say $J_{1}=\{1\}$ , the absolute value of

{\mathcal{J}}_{J_{2},J_{3}}:=\int_{\mbox{\boldmath$B$}_{n,|J_{1}|}}g_{\omega}(\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}f\big{(}d\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}

is smaller than a constant times

\int_{(1/2n,1-1/2n]}g_{\omega}(u_{1}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}\frac{|\Phi^{-1}(u_{1})|+|\Phi^{-1}(1/2n)|}{\phi\circ\Phi^{-1}(u_{1})}\,du_{1}.

We have used the identity $\Phi^{-1}(1-1/2n)=-\Phi^{-1}(1/2n)$ . Note that $g_{\omega}(u_{1}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}=(1/2n)^{\omega}$ when $u_{1}\in(1/2n,1-1/2n]$ . Using $\Phi^{-1}(u)/\phi\circ\Phi^{-1}(u)=O\big{(}1/u(1-u)\big{)}$ , this yields

	$\displaystyle\big{\|}{\mathcal{J}}_{J_{2},J_{3}}\big{\|}\leq\frac{K_{0}}{n^{\omega}}\int_{1/2n}^{1-1/2n}\frac{du}{u(1-u)}+\frac{K_{0}\|\Phi^{-1}(1/2n)\|}{n^{\omega}}\int_{1/2n}^{1-1/2n}\frac{du}{\phi\circ\Phi^{-1}(u)}$
		$\displaystyle\leq$	$\displaystyle\frac{2K_{0}}{n^{\omega}}\ln(2n)+\frac{K_{0}\|\Phi^{-1}(1/2n)\|^{2}}{n^{\omega}},\hskip 170.71652pt$

for some constant $K_{0}$ . The latter r.h.s. tends to zero, because $\Phi^{-1}(1/2n)\sim-\sqrt{2\ln(2n)}$ ([3], 26.2.23). This reasoning can be led for every map $f\in{\mathcal{F}}$ . This proves (A.2). Importantly, we have proved that all integrals as ${\mathcal{J}}_{J_{2},J_{3}}$ tend to zero with $n$ . As a consequence, the limiting law in Theorem 3.2 is simply $\mbox{\boldmath$W$}:=(-1)^{d}\int_{(0,1]^{d}}{\mathbb{C}}(\mbox{\boldmath$u$})\,\nabla_{\theta}\ell(\theta_{0};d\mbox{\boldmath$u$})$ .

Assumption 13 is a direct consequence of the dominated convergence theorem and the upper bounds that have been exhibited just before.

Remark 6.

Alternatively to the Gaussian loss function, consider the least squares loss

\ell_{\text{LS}}(\theta;\mbox{\boldmath$u$}):=\|\mbox{\boldmath$z$}\mbox{\boldmath$z$}^{\top}-\Sigma\|^{2}_{F}=\text{tr}\big{(}(\mbox{\boldmath$z$}\mbox{\boldmath$z$}^{\top}-\Sigma)^{2}\big{)}.

Then, it is easy to check our regularity assumptions as above. In particular, every member $f$ of ${\mathcal{F}}$ can be written as a quadratic form w.r.t. $z$ , as for the previous log-likelihood loss. Then the same techniques apply.

Remark 7.

For completeness, let us provide the gradient of the Gaussian and least squares losses w.r.t. the parameter $\theta$ . By the identification of the gradient following [1], the derivatives of the Gaussian and least squares functions are, respectively, given by

\nabla_{\text{vec}(\Sigma)}\ell(\theta;\mbox{\boldmath$u$})=\text{vec}(\Sigma^{-1}-\Sigma^{-1}\mbox{\boldmath$z$}\mbox{\boldmath$z$}^{\top}\Sigma^{-1}),\;\text{and}\;{\color[rgb]{0,0,0}\nabla_{\text{vec}(\Sigma)}\ell_{\text{LS}}(\theta;\mbox{\boldmath$u$})=-2\text{vec}(\mbox{\boldmath$z$}\mbox{\boldmath$z$}^{\top}-\Sigma).}

The Gaussian and least squares based Hessian matrices respectively follow as

\nabla^{2}_{\text{vec}(\Sigma)\text{vec}(\Sigma)^{\top}}\ell(\theta;\mbox{\boldmath$u$})=-\big{(}\Sigma^{-1}\otimes\Sigma^{-1}\big{)}+\big{(}\Sigma^{-1}\mbox{\boldmath$z$}\mbox{\boldmath$z$}^{\top}\Sigma^{-1}\otimes\Sigma^{-1}\big{)}+\big{(}\Sigma^{-1}\otimes\Sigma^{-1}\mbox{\boldmath$z$}\mbox{\boldmath$z$}^{\top}\Sigma^{-1}\big{)},

and $\nabla^{2}_{\text{vec}(\Sigma)\text{vec}(\Sigma)^{\top}}\ell_{\text{LS}}(\theta;\mbox{\boldmath$u$})=2\big{(}I_{p}\otimes I_{p}\big{)}$ .

Lemma E.1.

Let $C_{\Sigma}$ be a $d$ -dimensional Gaussian copula. Then, there exists a constant $M_{d,\Sigma}$ s.t., for every $j_{1},j_{2}$ in $\{1,\ldots,d\}$ and every $\mbox{\boldmath$u$}\in V_{j_{1}}\cap V_{j_{2}}$ ,

\big{|}\frac{\partial^{2}C_{\Sigma}}{\partial u_{j_{1}}\partial u_{j_{2}}}(\mbox{\boldmath$u$})\big{|}\leq M_{d,\Sigma}\min\Big{\{}\frac{1}{u_{j_{1}}(1-u_{j_{1}})},\frac{1}{u_{j_{2}}(1-u_{j_{2}})}\Big{\}}.

Obviously, the constant $M_{d,\Sigma}$ depends on the dimension $d$ and on the correlation matrix $\Sigma$ . We prove the latter property below. In dimension two, it had been announced in [32] and [6], but without providing the corresponding (non trivial) technical details.

Proof.

First assume that $d=2$ and $(j_{1},j_{2})=(1,2)$ . Note that the random vector $(X_{1},X_{2}):=\big{(}\Phi^{-1}(U_{1}),\Phi^{-1}(U_{2})\big{)}$ is Gaussian ${\mathcal{N}}({\mathbf{0}},\Sigma)$ . The extra diagonal coefficient of $\Sigma$ is $\theta$ . Moreover, $C_{\Sigma}(u,v)=\Phi_{\Sigma}(x,y)$ , where $\Phi_{\Sigma}$ is the bivariate cdf of $(X_{1},X_{2})$ , $x:=\Phi^{-1}(u)$ and $y:=\Phi^{-1}(v)$ . With obvious notations, simple calculations provide

\partial^{2}_{1,2}C_{\Sigma}(u,v)=\frac{\partial^{2}_{1,2}\Phi_{\Sigma}(x,y)}{\phi(x)\phi(y)}=\frac{1}{2\pi\sqrt{1-\theta^{2}}}\exp\Big{(}-\frac{1}{2(1-\theta^{2})}\big{\{}\theta^{2}x^{2}+\theta^{2}y^{2}-2\theta xy\big{\}}\Big{)}.

Since $\theta^{2}x^{2}+\theta^{2}y^{2}-2\theta xy=(\theta x-y)^{2}+(\theta^{2}-1)y^{2}$ , we deduce

\big{|}\partial^{2}_{1,2}C_{\Sigma}(u,v)\big{|}\leq\frac{1}{2\pi\sqrt{1-\theta^{2}}}\exp(y^{2}/2).

But there exists a constant $M$ such that

\frac{1}{\phi(y)}\leq\frac{M}{v(1-v)},\;v\in(0,1)\cdot

(E.2)

Indeed, when $v$ tends to zero, $\Phi^{-1}(1-v)=-\Phi^{-1}(v)\sim\sqrt{\ln(1/v^{2})}$ ([3], 26.2.23). Then, the map $v\mapsto v(1-v)/\phi\circ\Phi^{-1}(v)$ is bounded. As a consequence, $\partial^{2}_{1,2}C_{\Sigma}(u,v)=O\big{(}1/v(1-v)\big{)}$ . Similarly, by symmetry, $\partial^{2}_{1,2}C_{\Sigma}(u,v)=O\big{(}1/u(1-u)\big{)}$ , proving the announced inequality for crossed partial derivatives in the bivariate case.

Second, assume that $d=2$ and $j_{1}=j_{2}=1$ . We the same notations as above, simple calculations provide

\partial^{2}_{1,1}C_{\Sigma}(u,v)=\frac{x\partial_{1}\Phi_{\Sigma}(x,y)}{\phi(x)^{2}}+\frac{\partial^{2}_{1,1}\Phi_{\Sigma}(x,y)}{\phi(x)^{2}}=:T_{1}+T_{2}.

Note that

	$\displaystyle\partial_{1}\Phi_{\Sigma}(x,y)=\int_{-\infty}^{y}\frac{1}{2\pi\sqrt{1-\theta^{2}}}\exp\Big{(}-\frac{1}{2(1-\theta^{2})}\big{\{}x^{2}+t^{2}-2\theta xt\big{\}}\Big{)}\,dt$
		$\displaystyle=$	$\displaystyle\int_{-\infty}^{y}\frac{1}{2\pi\sqrt{1-\theta^{2}}}\exp\Big{(}-\frac{1}{2(1-\theta^{2})}\big{\{}(t-\theta x)^{2}+(1-\theta^{2})x^{2}\big{\}}\Big{)}\,dt\leq\frac{\phi(x)}{\sqrt{2\pi}},$

implying

|T_{1}|\leq\frac{|x|}{\phi(x)\sqrt{2\pi}}=O\Big{(}\frac{1}{u(1-u)}\Big{)}.

(E.3)

Indeed, it is easy to check that the map $x\mapsto x\Phi(x)\big{(}1-\Phi(x)\big{)}/\phi(x)$ is bounded, because $\Phi(x)\sim\phi(x)/|x|$ (resp. $1-\Phi(x)\sim\phi(x)/|x|$ ) when $x\rightarrow-\infty$ (resp. $x\rightarrow+\infty$ ). Moreover,

	$\displaystyle\partial^{2}_{1,1}\Phi_{\Sigma}(x,y)=\int_{-\infty}^{y}\frac{(-1)}{2\pi\sqrt{1-\theta^{2}}}\exp\Big{(}-\frac{1}{2(1-\theta^{2})}\big{\{}x^{2}+t^{2}-2\theta xt\big{\}}\Big{)}\Big{\{}\frac{x-t\theta}{1-\theta^{2}}\Big{\}}\,dt$
		$\displaystyle=$	$\displaystyle\int_{-\infty}^{y}\frac{1}{2\pi\sqrt{1-\theta^{2}}}\exp\Big{(}-\frac{1}{2(1-\theta^{2})}\big{\{}(t-\theta x)^{2}+(1-\theta^{2})x^{2}\big{\}}\Big{)}\Big{\{}\frac{x-t\theta}{1-\theta^{2}}\Big{\}}\,dt.$

Then, after an integration w.r.t. $t$ , we get

\big{|}\partial^{2}_{1,1}\Phi_{\Sigma}(x,y)\big{|}\leq M_{1}\big{(}|x|+1\big{)}\phi(x),

for some constant $M_{1}$ . This yields

T_{2}=O\Big{(}\frac{|x|+1}{\phi(x)}\Big{)}=O\Big{(}\frac{1}{u(1-u)}\Big{)}.

(E.4)

Globally, (E.3) and (E.4) imply the announced result in this case.

Third, assume $d>2$ and choose the indices $(j_{1},j_{2})=(1,2)$ , w.l.o.g. By the definition of partial derivatives, we have

$\displaystyle\partial^{2}_{1,2}C_{\Sigma}(\mbox{\boldmath$u$})=\lim_{\|\Delta u_{1}\|+\|\Delta u_{2}\|\rightarrow 0}\frac{1}{\Delta u_{1}\Delta u_{2}}{\mathbb{P}}\Big{(}U_{1}\in[u_{1},u_{1}+\Delta u_{1}],U_{2}\in[u_{2},u_{2}+\Delta u_{2}],\mbox{\boldmath$U$}_{3:d}\leq\mbox{\boldmath$u$}_{3:d}\Big{)}$
	$\displaystyle\leq$	$\displaystyle\lim_{\|\Delta u_{1}\|+\|\Delta u_{2}\|\rightarrow 0}\frac{1}{\Delta u_{1}\Delta u_{2}}{\mathbb{P}}\Big{(}U_{1}\in[u_{1},u_{1}+\Delta u_{1}],U_{2}\in[u_{2},u_{2}+\Delta u_{2}]\Big{)}$
	$\displaystyle=$	$\displaystyle\partial^{2}_{1,2}C_{\Sigma_{1,2}}(u_{1},u_{2})\leq M_{2,\Sigma_{1,2}}\min\Big{\{}\frac{1}{u_{1}(1-u_{1})},\frac{1}{u_{2}(1-u_{2})}\Big{\}},\hskip 85.35826pt$

applying the previously proved result in dimension $2$ . Thus, the latter inequality is extended for any dimension $d$ , at least concerning crossed partial derivatives.

Concerning the second-order derivative of $C_{\Sigma}$ w.r.t. $u_{1}$ (to fix the ideas) and when $d>2$ , we can mimic the bivariate case. For any $\mbox{\boldmath$u$}\in(0,1)^{d}$ , set $\mbox{\boldmath$x$}=(x_{1},\ldots,x_{d})$ with $x_{j}:=\Phi^{-1}(u_{j})$ , $j\in\{1,\ldots,d\}$ . We obtain

\partial^{2}_{1,1}C_{\Sigma}(\mbox{\boldmath$u$})=\frac{x_{1}\partial_{1}\Phi_{\Sigma}(\mbox{\boldmath$x$})}{\phi(x_{1})^{2}}+\frac{\partial^{2}_{1,1}\Phi_{\Sigma}(\mbox{\boldmath$x$})}{\phi(x_{1})^{2}}=:V_{1}+V_{2}.

Note that, with $X_{j}:=\Phi^{-1}(U_{j})$ for every $j$ , we have

	$\displaystyle\partial_{1}\Phi_{\Sigma}(\mbox{\boldmath$x$})=\lim_{\Delta x_{1}\rightarrow 0}\frac{1}{\Delta x_{1}}{\mathbb{P}}\Big{(}X_{1}\in[x_{1},x_{1}+\Delta x_{1}],\mbox{\boldmath$X$}_{2:d}\leq\mbox{\boldmath$x$}_{2:d}\Big{)}$				(E.5)
		$\displaystyle\leq$	$\displaystyle\lim_{\Delta x_{1}\rightarrow 0}\frac{1}{\Delta x_{1}}{\mathbb{P}}\Big{(}X_{1}\in[x_{1},x_{1}+\Delta x_{1}]\Big{)}=\phi(x_{1}).$		(E.5)

The latter upper bound does not depend on $\Sigma$ . Therefore, we get

|V_{1}|\leq\frac{|x_{1}|}{\phi(x_{1})}\leq\frac{M_{3}}{u_{1}(1-u_{1})},

(E.6)

for some constant $M_{3}$ . Moreover, for some constants $a\in{\mathbb{R}}$ and $\mbox{\boldmath$b$}\in{\mathbb{R}}^{d-1}$ that depend on $\Sigma$ only, we have

\partial^{2}_{1,1}\Phi_{\Sigma}(\mbox{\boldmath$x$})=\frac{1}{(2\pi)^{d/2}\sqrt{|\Sigma|}}\int_{{\mathbb{R}}^{d-1}}{\mathbf{1}}(\mbox{\boldmath$t$}\leq\mbox{\boldmath$x$}_{2:d})\exp\Big{(}-\frac{1}{2}[x_{1},\mbox{\boldmath$t$}]^{\top}\Sigma^{-1}[x_{1},\mbox{\boldmath$t$}]\Big{)}(ax_{1}+\mbox{\boldmath$t$}^{\top}\mbox{\boldmath$b$})\,d\mbox{\boldmath$t$}.

(E.7)

It can be proved that there exist some constants $M_{4}$ and $M_{5}$ s.t.

\big{|}\partial^{2}_{1,1}\Phi_{\Sigma}(\mbox{\boldmath$x$})\big{|}\leq\frac{M_{4}|x_{1}|+M_{5}}{(2\pi)^{d/2}\sqrt{|\Sigma|}}\int_{{\mathbb{R}}^{d-1}}{\mathbf{1}}(\mbox{\boldmath$t$}\leq\mbox{\boldmath$x$}_{2:d})\exp\Big{(}-\frac{1}{2}[x_{1},\mbox{\boldmath$t$}]^{\top}\Sigma^{-1}[x_{1},\mbox{\boldmath$t$}]\Big{)}\,d\mbox{\boldmath$t$},

(E.8)

where $\partial_{1}\Phi_{\Sigma}(\mbox{\boldmath$x$})$ appears on the r.h.s. of (E.8). Indeed, the multiplicative factors $t_{j}$ , $j\in\{2,\ldots,d\}$ , inside the integral sum in (E.7) do not prevent the use of the same change of variable trick that had been used in the bivariate case for the treatment of $T_{1}$ above. Therefore, after $d-1$ integrations, the $x_{1}$ -function that would remain is the same as for $\partial_{1}\Phi_{\Sigma}(\mbox{\boldmath$x$})$ , apart from a multiplicative factor. Since the latter quantity is $O\big{(}\phi(x_{1})\big{)}$ due to Equation (E.5), this yields

|V_{2}|=O\big{(}\frac{|x_{1}|+1}{\phi(x_{1})}\Big{)}=O\Big{(}\frac{1}{u_{1}(1-u_{1})}\Big{)}.

(E.9)

Thus, (E.6) and (E.9) provide the result when $d>2$ , for the second-order derivatives of $C_{\Sigma}$ w.r.t. $u_{1}$ . ∎

Now, consider the case of a mixture of Gaussian copulas, i.e. the true underlying copula density is

c_{\theta}(\mbox{\boldmath$u$})=\sum_{k=1}^{q}\pi_{k}c_{k}(\mbox{\boldmath$u$}),\;\;\mbox{\boldmath$u$}\in(0,1)^{d},

where $\sum_{k=1}^{q}\pi_{k}=1$ , $\pi_{k}\in[0,1]$ and $c_{k}$ is a Gaussian copula density with a correlation matrix $\Sigma_{k}$ , $k\in\{1,\ldots,d\}$ . Here, $\theta$ is the concatenation of the $q-1$ first weights and the unknown parameters of every Gaussian copula. The latter ones are given by the lower triangular parts of the correlations matrices $\Sigma_{k}$ . Assume the true weights are strictly positive. The $m$ -order partial derivatives of the loss function $\ell(\mbox{\boldmath$u$}):=-\ln c_{\theta}(\mbox{\boldmath$u$})$ w.r.t. $\theta$ and/or w.r.t. its arguments $u_{1},\ldots,u_{d}$ are linear combinations of maps as

\prod_{j=1}^{r}\partial^{\nu_{j}}c_{i_{j}}(\mbox{\boldmath$u$})/c^{r}_{\theta}(\mbox{\boldmath$u$}),

(E.10)

where $i_{j}\in\{1,\ldots,d\}$ for every $j$ , $\sum_{j=1}^{r}\nu_{j}=m$ and $r\leq m$ . Here, the derivatives of order $\nu_{j}$ have to be understood w.r.t. the corresponding parameters and/or the arguments of the copula $c_{i_{j}}$ , possibly multiple times. The latter derivatives may be written as

\partial^{\nu_{j}}c_{i_{j}}(\mbox{\boldmath$u$})=c_{i_{j}}(\mbox{\boldmath$u$})Q_{j}(z_{1},\ldots,z_{d}),

for some polynomials $Q_{j}$ of the variables $z_{k}:=\Phi^{-1}(u_{k})$ . When all the weights are positive, every map given by (E.10) with be (in absolute value) smaller than $\prod_{j=1}^{r}Q_{j}(\mbox{\boldmath$z$})$ times a constant, that is itself a polynomial in terms of $z$ ’s components. Checking our Assumption 3, the single problematic one, is then reduced to checking this assumption for polynomials of $z_{1},\ldots,z_{d}$ . This task is easy and details are left to the reader. To conclude, the penalized CML method can be used with mixtures of Gaussian copulas, at least when the underlying true weights are strictly positive. In practice, penalization should then be restricted to the copulas parameters.

Appendix F Regularity conditions for Gumbel copulas

We now verify that the Gumbel copula family fulfills all regularity conditions that are required to apply Theorems 3.1 and 3.2 when the loss is chosen as the opposite of the log-copula density. A $d$ -dimensional Gumbel copula is defined by $C_{\theta}(\mbox{\boldmath$u$}):=\psi_{\theta}\big{(}\sum_{j=1}^{d}\psi_{\theta}^{-1}(u_{j})\big{)}$ where $\psi_{\theta}(t)=\exp(-t^{1/\theta})$ , $t\in{\mathbb{R}}^{+}$ , for some parameter $\theta>1$ . Note that $\psi_{\theta}^{-1}(u)=(-\ln u)^{\theta}$ , $u\in(0,1]$ . The associated density is

c_{\theta}(\mbox{\boldmath$u$}):=\frac{\psi_{\theta}^{(d)}\big{(}\sum_{j=1}^{d}\psi_{\theta}^{-1}(u_{j})\big{)}}{\prod_{j=1}^{d}\psi_{\theta}^{\prime}\circ\psi_{\theta}^{-1}(u_{j})},

and the considered loss will be $\ell(\theta;\mbox{\boldmath$u$})=-\ln c_{\theta}(\mbox{\boldmath$u$})$ . Simple calculations show that $\psi^{(d)}(t)=(-1)^{d}\psi_{\theta}(t)Q_{\theta}(t)$ for every $t$ , where

Q_{\theta}(t):=\sum_{k=1}^{d}a_{k,\theta}t^{k/\theta-d},

for some coefficients $a_{k,\theta}$ that depend on $\theta$ . Since $\psi_{\theta}^{\prime}\circ\psi^{-1}_{\theta}(u)=-u(-\ln u)^{1-\theta}/\theta$ , deduce

	$\displaystyle\ell(\theta;\mbox{\boldmath$u$})+d\ln\theta=\Big{\{}\sum_{j=1}^{d}(-\ln u_{j})^{\theta}\Big{\}}^{1/\theta}-\ln Q_{\theta}\Big{(}\big{(}-\ln C_{\theta}(\mbox{\boldmath$u$})\big{)}^{\theta}\Big{)}$
		$\displaystyle+$	$\displaystyle\sum_{j=1}^{d}\Big{\{}\ln u_{j}+(1-\theta)\ln\big{(}\ln(1/u_{j})\big{)}\Big{\}}=:\ell_{1}(\theta;\mbox{\boldmath$u$})-\ell_{2}(\theta;\mbox{\boldmath$u$})+\ell_{3}(\theta;\mbox{\boldmath$u$}).$

Assumption 1 is satisfied because all the derivatives of $\theta\mapsto\ell(\theta;\mbox{\boldmath$U$})$ are nonzero and given by some polynomial maps of $\theta$ , of the quantities $\ln(U_{j})$ and $\ln(-\ln U_{j})$ , $j\in\{1,\ldots,d\}$ , or by the logarithm of such maps. The latter maps are clearly integrable, even uniformly w.r.t. $\theta$ in a small neighborhood of $\theta_{0}$ . This may be seen by doing the $d$ changes of variables $-\ln u_{j}=:x_{j}$ , $j\in\{1,\ldots,d\}$ in the corresponding integrals.

To state Assumption 2 and w.l.o.g., let us focus on the cross-derivatives of the true copula w.r.t. its first two components. Note that

\partial^{2}_{1,2}C_{\theta}(\mbox{\boldmath$u$})=\frac{\psi^{{}^{\prime\prime}}\big{(}\sum_{j=1}^{d}\psi_{\theta}^{-1}(u_{j})\big{)}}{\psi^{\prime}\circ\psi(u_{1})\psi^{\prime}\circ\psi(u_{2})}=\frac{\psi(t)\big{\{}t^{2/\theta-2}-(1-\theta)t^{1/\theta-2}\big{\}}}{u_{1}u_{2}(-\ln u_{1})^{1-\theta}(-\ln u_{2})^{1-\theta}},

setting $t_{\theta}(\mbox{\boldmath$u$}):=\sum_{j=1}^{d}(-\ln u_{j})^{\theta}\in{\mathbb{R}}^{+}$ , denoted also by $t$ when there is no ambiguity. Since $(-\ln u_{k})^{\theta}\leq t_{\theta}(\mbox{\boldmath$u$})$ for every $k$ and $\theta>1$ , we have $\big{(}-\ln\min(u_{1},u_{2})\big{)}^{\theta}\leq t_{\theta}(\mbox{\boldmath$u$}).$ Then, we deduce

0\leq\frac{\big{\{}t^{2/\theta-2}+t^{1/\theta-2}\big{\}}}{(-\ln u_{1})^{1-\theta}(-\ln u_{2})^{1-\theta}}\leq\Big{\{}\big{(}-\ln\min(u_{1},u_{2})\big{)}^{2-2\theta}+\big{(}-\ln\min(u_{1},u_{2})\big{)}^{1-2\theta}\Big{\}}(-\ln\min(u_{1},u_{2}))^{2\theta-2}.

Since $\psi(t)=C_{\theta}(\mbox{\boldmath$u$})\leq\min_{j}u_{j}$ , this easily yields

|\partial_{1,2}C_{\theta}(\mbox{\boldmath$u$})|\leq\frac{(\theta+1)(\min_{j}u_{j})}{u_{1}u_{2}}\Big{\{}1+\big{(}-\ln\min(u_{1},u_{2})\big{)}^{-1}\Big{\}}=O\Big{(}\min\big{\{}\frac{1}{u_{1}(1-u_{1})};\frac{1}{u_{2}(1-u_{2})}\big{\}}\Big{)}.

To check Assumption 3 (the $g_{\omega}$ regularity of the partial derivatives of the loss function), it is sufficient to verify this assumption for every term $\ell_{k}(\theta;\mbox{\boldmath$u$})$ , $k\in\{1,2,3\}$ .

Study of $\ell_{1}(\theta;\mbox{\boldmath$U$})$ and Assumption 3: by simple calculations, we get

\partial_{\theta}\ell_{1}(\theta;\mbox{\boldmath$u$})=\frac{(-1)}{\theta^{2}}\Big{\{}\sum_{j=1}^{d}(-\ln u_{j})^{\theta}\Big{\}}^{1/\theta}\ln\Big{(}\sum_{j=1}^{d}(-\ln u_{j})^{\theta}\Big{)}\Big{\{}\sum_{j=1}^{d}(-\ln u_{j})^{\theta}\ln(-\ln u_{j})\Big{\}}.

Let us prove that ${\mathcal{F}}_{1}=\big{\{}\partial_{\theta}\ell_{1}(\theta_{0};\cdot)\big{\}}$ is $g_{\omega}$ -regular for any $\omega>0$ . For convenience, $\theta_{0}$ will simply be denoted as $\theta$ hereafter.

Since $u\mapsto(\ln u)^{\alpha}\ln(-\ln u)^{\beta}u^{\gamma}$ is bounded on $(0,1)$ , for any triplet $(\alpha,\beta,\gamma)$ of positive numbers, the map $\mbox{\boldmath$u$}\mapsto\min_{k}\min(u_{k},1-u_{k})^{\omega}|\partial_{\theta}\ell_{1}(\theta;\mbox{\boldmath$u$})|$ is bounded on $(0,1)$ for any positive $\omega$ .

To verify (A.1), it is necessary to calculate

\partial_{\theta}\ell_{1}(\theta;d\mbox{\boldmath$u$})=\partial_{\theta}\partial^{d}_{u_{1},\ldots,u_{d}}\ell_{1}(\theta;\mbox{\boldmath$u$})d\mbox{\boldmath$u$}.

By tedious calculations, it can be shown that

\partial_{\theta}\partial^{d}_{u_{1},\ldots,u_{d}}\ell_{1}(\theta;\mbox{\boldmath$u$})=A_{d}(\theta;\mbox{\boldmath$u$})\Big{\{}\sum_{j=1}^{d}(-\ln u_{j})^{\theta}\Big{\}}^{1/\theta-d}\prod_{j=1}^{d}\frac{(-\ln u_{j})^{\theta-1}}{u_{j}},

for some map $A_{d}(\theta;\cdot)$ that is a power function of the quantities $(-\ln u_{j})^{\theta}$ , $\ln(-\ln u_{j})$ , $j\in\{1,\ldots,d\}$ , and $\ln\big{(}\sum_{j=1}^{d}(-\ln u_{j})^{\theta}\big{)}$ . Therefore, by symmetry, we have

	$\displaystyle\int_{(0,1]^{d}}g_{\omega}(\mbox{\boldmath$u$})\|\ell_{1}(\theta;d\mbox{\boldmath$u$})\|=d!\int_{0<u_{1}\leq u_{2}\leq\cdots\leq u_{d}\leq 1}g_{\omega}(\mbox{\boldmath$u$})\|\ell_{1}(\theta;d\mbox{\boldmath$u$})\|$
		$\displaystyle=$	$\displaystyle\int_{0<u_{1}\leq u_{2}\leq\cdots\leq u_{d}\leq 1}\min(u_{1},1-u_{2})^{\omega}\|A_{d}(\theta;\mbox{\boldmath$u$})\|\Big{\{}\sum_{j=1}^{d}(-\ln u_{j})^{\theta}\Big{\}}^{1/\theta-d}\prod_{j=1}^{d}\frac{(-\ln u_{j})^{\theta-1}}{u_{j}}\,d\mbox{\boldmath$u$}.$

For any $\mbox{\boldmath$u$}\in(0,1]^{d}$ s.t. $u_{1}\leq u_{2}\leq\cdots\leq u_{d}$ , $|A_{d}(\theta;\mbox{\boldmath$u$})|\leq\widetilde{A}_{1}(\theta;u_{1}),$ for some map $\widetilde{A}_{1}(\theta;u_{1})$ that is a power function of the quantities $(-\ln u_{1})^{\theta}$ and $\ln(-\ln u_{1})$ . Therefore, after $d-2$ integrations w.r.t. $u_{d},\ldots,u_{3}$ successively, we get

	$\displaystyle\int_{(0,1]^{d}}g_{\omega}(\mbox{\boldmath$u$})\|\ell_{1}(\theta;d\mbox{\boldmath$u$})\|\leq K_{\theta}\int_{0<u_{1}\leq u_{2}\leq 1}\min(u_{1},1-u_{2})^{\omega}\widetilde{A}_{1}(\theta;u_{1})$
		$\displaystyle\times$	$\displaystyle\Big{\{}(-\ln u_{1})^{\theta}+(-\ln u_{2})^{\theta}\Big{\}}^{1/\theta-2}\prod_{j=1}^{2}\frac{(-\ln u_{j})^{\theta-1}}{u_{j}}\,du_{1}\,du_{2},$

for some constant $K_{\theta}$ . Note that $\min(u_{1},1-u_{2})^{\omega}\leq u_{1}^{\omega}$ . After another integration w.r.t. $u_{2}$ , there exists another constant $K^{\prime}_{\theta}$ s.t.

\int_{(0,1]^{d}}g_{\omega}(\mbox{\boldmath$u$})|\ell_{1}(\theta;d\mbox{\boldmath$u$})|\leq K^{\prime}_{\theta}\int_{0<u_{1}\leq 1}u_{1}^{\omega}\widetilde{A}_{1}(\theta;u_{1})(-\ln u_{1})^{1-\theta}\frac{(-\ln u_{1})^{\theta-1}}{u_{1}}\,du_{1}<+\infty.

This means (A.1) is satisfied for $\partial_{\theta}\ell_{1}(\theta_{0};\cdot)$ .

The same technique can be applied to check (A.2), assuming $J_{2}\cup J_{3}\neq\emptyset$ . Set $m_{k}:=\text{Card}(J_{k})$ , $k\in\{1,2,3\}$ . For every $\mbox{\boldmath$u$}\in\mbox{\boldmath$B$}_{n,|J_{1}|}$ , $J_{1}\neq\emptyset$ , we have

g_{\omega}(\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}=1/(2n)^{\omega},

in every case, except when $J_{2}=\emptyset$ and $m_{1}\geq 2$ simultaneously. W.l.o.g., let us assume that the components indexed by $J_{1}$ are the first ones, i.e. are $u_{1},\ldots,u_{m_{1}}$ . Note that

$\displaystyle\int_{\mbox{\boldmath$B$}_{n,\|J_{1}\|}}g_{\omega}(\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}\Big{\|}\partial_{\theta}\ell_{1}\big{(}\theta;d\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}\Big{\|}$			(F.1)
	$\displaystyle=$	$\displaystyle\int_{\mbox{\boldmath$B$}_{n,\|J_{1}\|}}g_{\omega}(\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}\|\bar{A}_{d}(\theta;\mbox{\boldmath$u$}_{J_{1}};n)\|\prod_{j=1}^{m_{1}}\frac{(-\ln u_{j})^{\theta-1}}{u_{j}}$
	$\displaystyle\times$	$\displaystyle\Big{\{}\sum_{j=1}^{m_{1}}(-\ln u_{j})^{\theta}+m_{2}\ln(2n)^{\theta}+m_{3}\ln\big{(}\frac{2n}{2n-1}\big{)}^{\theta}\Big{\}}^{1/\theta-m_{1}}\,d\mbox{\boldmath$u$},$

for some map $\bar{A}_{d}(\theta;\mbox{\boldmath$u$}_{J_{1}};n)$ defined on $(0,1]^{m_{1}}$ that is a power function of the quantities $\ln n$ , $(-\ln u_{k})^{\theta}$ , $\ln(-\ln u_{k})$ , $k\in J_{1}$ , and $\ln\big{(}\sum_{j\in J_{1}}(-\ln u_{j})^{\theta}\big{)}$ .

Let us first deal with the case $J_{2}=\emptyset$ and $m_{1}\geq 2$ . Here,

g_{\omega}(\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}=\min(u_{1},1-u_{2})^{\omega},

when $u_{1}\leq u_{2}\leq\ldots\leq u_{m_{1}}$ . The reasoning is then exactly the same as for checking (A.1), starting from (F.1), noting that $m_{2}=0$ and by doing $m_{1}$ integrations on $(1/(2n);1-1/(2n)]$ instead of $(0,1]$ . In the calculation of the latter integrals, the single difference w.r.t. (A.1) comes from the terms $\sum_{j=1}^{k}(-\ln u_{j})^{\theta}+\ln(2n/(2n-1))^{\theta}$ instead of $\sum_{j=1}^{k}(-\ln u_{j})^{\theta}$ . This will not change the conclusion and we have stated (A.2).

Incidentally, Assumption 13 is easily checked by applying the dominated convergence theorem. This statement is general and will not be repeated hereafter for the other terms.

When $J_{2}\neq\emptyset$ or $m_{1}=1$ , note that

$\displaystyle\int_{\mbox{\boldmath$B$}_{n,\|J_{1}\|}}g_{\omega}(\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}\Big{\|}\partial_{\theta}\ell_{1}\big{(}d\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}\Big{\|}$
	$\displaystyle=$	$\displaystyle\frac{1}{(2n)^{\omega}}\int_{\mbox{\boldmath$B$}_{n,\|J_{1}\|}}\|\bar{A}_{d}(\theta;\mbox{\boldmath$u$}_{J_{1}};n)\|\Big{\{}\sum_{j=1}^{m_{1}}(-\ln u_{j})^{\theta}+m_{2}\ln(2n)^{\theta}+m_{3}\ln\big{(}\frac{2n}{2n-1}\big{)}^{\theta}\Big{\}}^{1/\theta-m_{1}}$
	$\displaystyle\times$	$\displaystyle\prod_{j=1}^{m_{1}}\frac{(-\ln u_{j})^{\theta-1}}{u_{j}}\,d\mbox{\boldmath$u$}_{1}.$

Thus, after $m_{1}-1$ integration on $(1/(2n);1-1/(2n)]$ , we obtain

$\displaystyle\int_{\mbox{\boldmath$B$}_{n,\|J_{1}\|}}g_{\omega}(\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}\big{\|}\partial_{\theta}\ell_{1}\big{(}d\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}\big{\|}$
	$\displaystyle\leq$	$\displaystyle\frac{\bar{K}_{\theta}}{(2n)^{\omega}}\int_{1/(2n)}^{1-1/(2n)}\big{\{}(-\ln u_{1})^{\theta}+(\ln n)^{\theta}\big{\}}^{1/\theta-1}\frac{(-\ln u_{1})^{\theta-1}}{u_{1}}\,du_{1}$
	$\displaystyle\leq$	$\displaystyle\frac{\bar{K}^{\prime}_{\theta}}{n^{\omega}}(\ln n)^{1-\theta}\big{[}(-\ln u_{1})^{\theta}\big{]}_{1/(2n)}^{1-1/(2n)},$

for some constants $\bar{K}_{\theta}$ and $\bar{K}^{\prime}_{\theta}$ . The latter term on the r.h.s. is $O(n^{-\omega}\ln n)=o(1)$ , proving (A.2) when $J_{2}\neq\emptyset$ or when $m_{1}=1$ .

With exactly the same techniques, it can be proved that ${\mathcal{F}}_{2}$ and ${\mathcal{F}}_{3}$ (the family given by the second and third order derivatives of $\theta\mapsto\ell_{1}(\theta;\cdot)$ ) are $g_{\omega}$ -regular for any $\omega>0$ , even when the parameter $\theta$ is free to belong to a neighborhood of $\theta_{0}$ .

Study of $\ell_{2}(\theta;\mbox{\boldmath$U$})$ and Assumption 3: Since $\psi$ is the Laplace transform of a stable distribution, it is completely monotone and $(-1)^{d}\psi^{(d)}(t)>0$ for every $t\in{\mathbb{R}}^{+}$ . Thus, $Q_{\theta}(t)>0$ , $t\in{\mathbb{R}}_{+}^{*}$ . By definition, $Q_{\theta}(t):=\sum_{k=1}^{d}a_{k,\theta}t^{k/\theta-d}.$ By recursion, it can be proved that $a_{1,\theta}=-\prod_{k=0}^{d-1}(k-1/\theta)$ , that is not zero because $\theta>1$ . Moreover, $a_{d,\theta}=1/\theta^{d}$ . Therefore, there exists a positive constant $\lambda_{\theta}$ s.t.

t^{d}\big{|}Q_{\theta}(t)\big{|}=t^{d}Q_{\theta}(t)=\sum_{k=1}^{d}a_{k,\theta}t^{k/\theta}\geq\lambda_{\theta}t^{1/\theta},

(F.2)

for any $t>0$ . Note that we can write $\partial_{\theta}Q_{\theta}(t)=\sum_{k=1}^{d}\big{\{}\beta_{k,\theta}+\gamma_{k,\theta}\ln t\big{\}}t^{k/\theta-d}$ , for some constants $\beta_{k,\theta}$ and $\gamma_{k,\theta}$ . As above and to lighten notations, denote $t_{\theta}(\mbox{\boldmath$u$}):=\sum_{j=1}^{d}(-\ln u_{j})^{\theta}$ , also denoted $t$ simply. Simple calculations yield

\partial_{\theta}\ell_{2}(\theta;\mbox{\boldmath$u$})=\frac{\partial_{\theta}Q_{\theta}}{Q_{\theta}}\big{(}t_{\theta}(\mbox{\boldmath$u$})\big{)}\partial_{\theta}t_{\theta}(\mbox{\boldmath$u$}),\;\;\partial_{\theta}t_{\theta}(\mbox{\boldmath$u$})=\sum_{j=1}^{d}(-\ln u_{j})^{\theta}\ln(-\ln u_{j}).

Moreover, successive derivatives yield

\partial^{2}_{\theta,u_{1}}\ell_{2}(\theta;\mbox{\boldmath$u$})=(\partial_{\theta}\ln Q_{\theta})^{\prime}(t)\partial_{\theta}t_{\theta}(\mbox{\boldmath$u$})\partial_{u_{1}}t_{\theta}(\mbox{\boldmath$u$})+(\partial_{\theta}\ln Q_{\theta})(t)\partial^{2}_{\theta,u_{1}}t_{\theta}(\mbox{\boldmath$u$}),

	$\displaystyle\partial^{3}_{\theta,u_{1},u_{2}}\ell_{2}(\theta;\mbox{\boldmath$u$})=(\partial_{\theta}\ln Q_{\theta})^{\prime\prime}(t)\partial_{\theta}t_{\theta}(\mbox{\boldmath$u$})\partial_{u_{1}}t_{\theta}(\mbox{\boldmath$u$})\partial_{u_{2}}t_{\theta}(\mbox{\boldmath$u$})$
		$\displaystyle+$	$\displaystyle(\partial_{\theta}\ln Q_{\theta})^{\prime}(t)\Big{\{}\partial^{2}_{\theta,u_{1}}t_{\theta}(\mbox{\boldmath$u$})\partial_{u_{2}}t_{\theta}(\mbox{\boldmath$u$})+\partial^{2}_{\theta,u_{2}}t_{\theta}(\mbox{\boldmath$u$})\partial_{u_{1}}t_{\theta}(\mbox{\boldmath$u$})\Big{\}},$

and, by iteration, we get

$\displaystyle\partial^{d+1}_{\theta,u_{1},\cdots,u_{d}}\ell_{2}(\theta;\mbox{\boldmath$u$})=(\partial_{\theta}\ln Q_{\theta})^{(d)}(t)\partial_{\theta}t_{\theta}(\mbox{\boldmath$u$})\prod_{j=1}^{d}\partial_{u_{j}}t_{\theta}(\mbox{\boldmath$u$})$
	$\displaystyle+$	$\displaystyle(\partial_{\theta}\ln Q_{\theta})^{(d-1)}(t)\sum_{k=1}^{d}\partial^{2}_{\theta,u_{k}}t_{\theta}(\mbox{\boldmath$u$})\prod_{j=1,j\neq k}^{d}\partial_{u_{j}}t_{\theta}(\mbox{\boldmath$u$})$
	$\displaystyle=$	$\displaystyle(\partial_{\theta}\ln Q_{\theta})^{(d)}(t)\big{\{}\sum_{j=1}^{d}(-\ln u_{j})^{\theta}\ln(-\ln u_{j})\big{\}}\prod_{j=1}^{d}\frac{\theta(-\ln u_{j})^{\theta-1}}{u_{j}}$
	$\displaystyle-$	$\displaystyle(\partial_{\theta}\ln Q_{\theta})^{(d-1)}(t)\sum_{k=1}^{d}\big{\{}1+\theta\ln(-\ln u_{k})\big{\}}\prod_{j=1}^{d}\frac{(-\ln u_{j})^{\theta-1}}{u_{j}}=:W_{1}(\mbox{\boldmath$u$})-W_{2}(\mbox{\boldmath$u$}).$

Note that, for every positive integer $p$ , there exists some constants $b_{p,k}$ and $c_{p,k}$ s.t.

(\partial_{\theta}\ln Q_{\theta})^{(p)}(t)=\frac{\sum_{k=p+1}^{d(p+1)}(b_{k,p}+c_{k,p}\ln t)t^{k/\theta-p}}{\big{(}\sum_{k=1}^{d}a_{k,\theta}t^{k/\theta}\big{)}^{p}},\;\;t>0.

(F.3)

As before, we have by symmetry

$\displaystyle\int_{(0,1]^{d}}g_{\omega}(\mbox{\boldmath$u$})\|\ell_{2}(\theta;d\mbox{\boldmath$u$})\|=d!\int_{0<u_{1}\leq u_{2}\leq\cdots\leq u_{d}\leq 1}g_{\omega}(\mbox{\boldmath$u$})\|\ell_{2}(\theta;d\mbox{\boldmath$u$})\|$
	$\displaystyle=$	$\displaystyle\int_{0<u_{1}\leq u_{2}\leq\cdots\leq u_{d}\leq 1}\min(u_{1},1-u_{2})^{\omega}\|\partial^{d+1}_{\theta,u_{1},\ldots,u_{d}}\ell_{2}(\theta;\mbox{\boldmath$u$})\|\,d\mbox{\boldmath$u$}$
	$\displaystyle\leq$	$\displaystyle\int_{0<u_{1}\leq u_{2}\leq\cdots\leq u_{d}\leq 1}u_{1}^{\omega}\|W_{1}(\mbox{\boldmath$u$})\|\,d\mbox{\boldmath$u$}+\int_{0<u_{1}\leq u_{2}\leq\cdots\leq u_{d}\leq 1}u_{1}^{\omega}\|W_{2}(\mbox{\boldmath$u$})\|\,d\mbox{\boldmath$u$}$
	$\displaystyle=:$	$\displaystyle{\mathcal{I}}_{1}+{\mathcal{I}}_{2}.$

Deduce from (F.2) and (F.3) that the first latter term ${\mathcal{I}}_{1}$ is smaller than a linear combination of integrals as

\int_{0<u_{1}\leq u_{2}\leq\cdots\leq u_{d}\leq 1}u_{1}^{\omega}|\ln t|^{\tau}t^{k/\theta-d}t^{-d/\theta}(-\ln u_{i})^{\theta}|\ln(-\ln u_{i})|\prod_{j=1}^{d}\frac{(-\ln u_{j})^{\theta-1}}{u_{j}}\,d\mbox{\boldmath$u$},

(F.4)

for some $i\in\{1,\ldots,d\}$ , $\tau\in\{0,1\}$ , and $k\in\{d+1,\ldots,d+d^{2}\}.$ But, for any $\epsilon>0$ , there exist some positive constants $\alpha$ and $\beta$ s.t.

|\ln t|\leq\alpha t^{\epsilon}+\beta t^{-\epsilon},\;\;t>0.

(F.5)

Moreover, there exist some positive constants $\alpha^{\prime}$ and $\beta^{\prime}$ s.t.

(-\ln u)^{\theta}|\ln(-\ln u)|\leq\alpha^{\prime}+\beta^{\prime}(-\ln u)^{\theta^{\prime}},\;\;u\in(0,1),

for any $\theta^{\prime}>\theta$ . In the “worst case”, the latter integral (F.4) is smaller than a constant times

	$\displaystyle\int_{0<u_{1}\leq u_{2}\leq\cdots\leq u_{d}\leq 1}u_{1}^{\omega}t^{k/\theta-d-d/\theta-\epsilon}\big{\{}(-\ln u_{1})^{\theta+1}+1\big{\}}\prod_{j=1}^{d}\frac{(-\ln u_{j})^{\theta-1}}{u_{j}}\,d\mbox{\boldmath$u$}$				(F.6)
		$\displaystyle\propto$	$\displaystyle\int u_{1}^{\omega}t_{\theta}(u_{1},1,\ldots,1)^{(k-d)/\theta-\epsilon-1}\big{\{}(\ln\frac{1}{u_{1}})^{\theta+1}+1\big{\}}\frac{(-\ln u_{1})^{\theta-1}}{u_{1}}\,du_{1},$		(F.6)

after $d-1$ integrations w.r.t. $u_{d},u_{d-1},\ldots,u_{2}$ successively. The r.h.s. of (F.6) is smaller than

\int u_{1}^{\omega-1}(-\ln u_{1})^{\theta(k/\theta-d/\theta-\epsilon-1)+\theta-1}\big{\{}(-\ln u_{1})^{\theta+1}+1\big{\}}\,du_{1}.

By choosing $\epsilon=1/(2\theta)$ , the latter integral is finite because

\theta(k/\theta-d/\theta-\epsilon-1)+\theta-1=k-d-1-\theta\epsilon\geq-\theta\epsilon=(-1/2).

This proves that ${\mathcal{I}}_{1}$ is finite.

Similarly, ${\mathcal{I}}_{2}$ is smaller than a linear combination of integrals as

\int_{0<u_{1}\leq u_{2}\leq\cdots\leq u_{d}\leq 1}u_{1}^{\omega}|\ln t|^{\tau}t^{k/\theta-d+1}t^{-(d-1)/\theta}|\ln(-\ln u_{i})|^{\bar{\tau}}\prod_{j=1}^{d}\frac{(-\ln u_{j})^{\theta-1}}{u_{j}}\,d\mbox{\boldmath$u$},

(F.7)

for some $i\in\{1,\ldots,d\}$ , $(\tau,\bar{\tau})\in\{0,1\}^{2}$ , and $k\in\{d,\ldots,d(d-1)\}.$ Reminding (F.5), note that

|\ln(-\ln u)|\leq\alpha(-\ln u)^{-\epsilon}+\beta(-\ln u)^{\epsilon},\;\;u\in(0,1).

Therefore, the “worst situation” to manage ${\mathcal{I}}_{2}$ is to evaluate

\int_{0<u_{1}\leq u_{2}\leq\cdots\leq u_{d}\leq 1}u_{1}^{\omega}t^{k/\theta-(d-1)-(d-1)/\theta-\epsilon}(-\ln u_{i})^{-\epsilon}\prod_{j=1}^{d}\frac{(-\ln u_{j})^{\theta-1}}{u_{j}}\,d\mbox{\boldmath$u$}.

(F.8)

When $i=1$ , integrate the latter integral w.r.t. $u_{d},u_{d-1},\ldots,u_{2}$ successively, and we obtain a scalar times the integral

\int u_{1}^{\omega}t_{\theta}(u_{1},1,\ldots,1)^{(k-d+1)/\theta-\epsilon}\frac{(-\ln u_{1})^{\theta-1-\epsilon}}{u_{1}}\,du_{1}=\int u_{1}^{\omega-1}(-\ln u_{1})^{\theta\{(k-d+1)/\theta-\epsilon\}+\theta-1-\epsilon}\,du_{1},

that is finite because

\theta\big{\{}(k-d+1)/\theta-\epsilon\big{\}}-\epsilon+\theta-1\geq\theta-\epsilon(\theta+1)-1>0,

for some sufficiently small constant $\epsilon$ .

When $i>1$ , first integrate (F.8) w.r.t. $u_{i}$ , but on $(u_{1},1]$ instead of $(u_{i-1},1]$ . This will yield an upper bound of the ${\mathcal{I}}_{2}$ -type term (F.8). In such a case, note that

t_{\theta}(\mbox{\boldmath$u$}_{-i},1)\leq t_{\theta}(\mbox{\boldmath$u$})\leq 2t_{\theta}(\mbox{\boldmath$u$}_{-i},1),\;\;\mbox{\boldmath$u$}\in(0,1]^{d},

with obvious notations. Thus, the term $t_{\theta}(\mbox{\boldmath$u$})$ in (F.7) can be replaced by $t_{\theta}(\mbox{\boldmath$u$}_{-i},1)$ that does not depend on $u_{i}$ . The new integral w.r.t. $u_{i}$ is then

\int_{u_{1}}^{1}\frac{(-\ln u_{i})^{\theta-1-\epsilon}}{u_{i}}du_{i}=\frac{(-\ln u_{1})^{\theta-\epsilon}}{\theta-\epsilon},

and we will choose $\epsilon<\theta$ . To bound (F.8), we are restricted to the evaluation of

\int_{u_{1}\leq\cdots\leq u_{i-1}\leq u_{i+1}\leq\cdots\leq u_{d}}u_{1}^{\omega}t_{\theta}(\mbox{\boldmath$u$}_{-i},1)^{(k-d+1)/\theta-(d-1)-\epsilon}(-\ln u_{1})^{\theta-\epsilon}\prod_{j=1,j\neq i}^{d}\frac{(-\ln u_{j})^{\theta-1}}{u_{j}}\,d\mbox{\boldmath$u$}_{-i}.

Now, integrate w.r.t. $u_{d},u_{d-1},\ldots,u_{i+1},u_{i-1},\ldots,u_{2}$ successively. We obtain a scalar times the integral

	$\displaystyle\int u_{1}^{\omega}t_{\theta}(u_{1},1,\ldots,1)^{(k-d+1)/\theta-\epsilon-1}\frac{(-\ln u_{1})^{2\theta-\epsilon-1}}{u_{1}}\,du_{1}$
		$\displaystyle=$	$\displaystyle\int_{0}^{1}u_{1}^{\omega-1}(-\ln u_{1})^{k-d+\theta-\theta\epsilon-\epsilon}\,du_{1},$

that is finite for any $\omega>0$ , because

k-d+\theta-\theta\epsilon-\epsilon\geq\theta-(\theta+1)\epsilon>0,

for some sufficiently small constant $\epsilon$ . This means (A.1) is satisfied for $\partial_{\theta}\ell_{2}(\theta_{0};\cdot)$ .

The same technique can be applied to check (A.2) for $\partial_{\theta}\ell_{2}(\theta_{0};\cdot)$ , mimicking the reasonings we led for $\partial_{\theta}\ell_{1}(\theta_{0};\cdot)$ .

With exactly the same techniques, it can be proved that ${\mathcal{F}}_{2}$ and ${\mathcal{F}}_{3}$ (here defined through $\ell_{2}$ ) are $g_{\omega}$ -regular for any $\omega>0$ . Actually, this is still the case for any higher-order $\theta$ -derivatives of the loss function. Indeed, the effect of such derivatives will be to add some multiplicative factors $\ln(-\ln u_{j})$ , $j\in\{1,\ldots,d\}$ , and such factors do not play any role to check $g_{\omega}$ -regularity.

Since Assumption 3 is trivially satisfied with $\ell_{3}(\theta;\mbox{\boldmath$u$})$ , we have proven the validity of this assumption for the Gumbel family.

Remark 8.

Note that we have proved the regularity assumptions as if the weight function $g_{\omega,d}$ were replaced by $\mbox{\boldmath$u$}\mapsto\min_{j}u_{j}^{\omega}$ , implying a stronger requirement.

Appendix G Regularity conditions for Clayton copulas

We now verify that the Clayton copula family fulfills all regularity conditions that are required to apply Theorems 3.1 and 3.2 when the loss is chosen as the opposite of the log-copula density. A $d$ -dimensional Clayton copula is defined by $C_{\theta}(\mbox{\boldmath$u$}):=\psi_{\theta}\big{(}\sum_{j=1}^{d}\psi_{\theta}^{-1}(u_{j})\big{)}$ where $\psi_{\theta}(t)=(1+t)^{-1/\theta}$ , $t\in{\mathbb{R}}^{+}$ , for some parameter $\theta>0$ . Note that $\psi_{\theta}^{-1}(u)=u^{-\theta}-1$ , $u\in(0,1]$ and

C_{\theta}(\mbox{\boldmath$u$})=\Big{\{}\sum_{j=1}^{d}u_{j}^{-\theta}-d+1\Big{\}}^{-1/\theta},\;\;\mbox{\boldmath$u$}\in(0,1]^{d}.

The associated density on $(0,1]^{d}$ is

c_{\theta}(\mbox{\boldmath$u$}):=\prod_{k=0}^{d-1}(1+k\theta)\big{\{}\sum_{j=1}^{d}u_{j}^{-\theta}-d+1\big{\}}^{-1/\theta-d}\prod_{j=1}^{d}u_{j}^{-\theta-1},

and the considered loss will be the $\ell(\theta;\mbox{\boldmath$u$})=-\ln c_{\theta}(\mbox{\boldmath$u$})$ . Note that

\ell(\theta;\mbox{\boldmath$u$})=M(\theta)+\big{(}\frac{1}{\theta}+d\big{)}\ln\big{(}\sum_{j=1}^{d}u_{j}^{-\theta}-d+1\big{)}+(1+\theta)\big{(}\sum_{j=1}^{d}\ln u_{j}\big{)},

where $M(\theta)$ is a positive map of $\theta$ only.

Assumption 1 is satisfied because $\partial^{k}_{\theta}\ell(\theta_{0};\mbox{\boldmath$U$})$ is nonzero and integrable for any $k\in\{1,2,3\}$ , even uniformly w.r.t. $\theta$ in a small neighborhood of $\theta_{0}$ . This can be easily seen by noting that

\frac{|\sum_{j=1}^{d}u_{j}^{-\theta}\ln u_{j}|}{\sum_{j=1}^{d}u_{j}^{-\theta}-d+1}\leq\sum_{j=1}^{d}|\ln u_{j}|,

for every $\mbox{\boldmath$u$}\in(0,1]^{d}$ because $\sum_{j=1}^{d}u_{j}^{-\theta}-d+1\geq u_{k}^{-\theta}$ , $k\in\{1,\ldots,d\}$ , and $\int|\ln u_{k}|c_{\theta}(\mbox{\boldmath$u$})\,d\mbox{\boldmath$u$}$ is finite for every $k$ .

To state Assumption 2 and w.l.o.g., let us focus on the cross-derivative w.r.t. the first two components of the true copula. By simple calculations, we get

\partial_{1,2}C_{\theta}(\mbox{\boldmath$u$})=\frac{\psi^{{}^{\prime\prime}}\big{(}\sum_{j=1}^{d}\psi_{\theta}^{-1}(u_{j})\big{)}}{\psi^{\prime}\circ\psi(u_{1})\psi^{\prime}\circ\psi(u_{2})}=(1+\theta)s_{\theta}(\mbox{\boldmath$u$})^{-1/\theta-2}(u_{1}u_{2})^{-\theta-1},

setting $s_{\theta}(\mbox{\boldmath$u$}):=\sum_{j=1}^{d}u_{j}^{-\theta}-d+1\in{\mathbb{R}}^{+}$ . Since $C_{\theta}(\mbox{\boldmath$u$})\leq\min_{j}u_{j}$ , deduce

	$\displaystyle\|\partial_{1,2}C_{\theta}(\mbox{\boldmath$u$})\|=(\theta+1)C_{\theta}(\mbox{\boldmath$u$})^{1+2\theta}(u_{1}u_{2})^{-\theta-1}\leq(\theta+1)(\min_{j}u_{j})^{1+2\theta}(u_{1}u_{2})^{-\theta-1}$
		$\displaystyle\leq$	$\displaystyle(\theta+1)\min(u_{1},u_{2})^{1+2\theta}\min(u_{1},u_{2})^{-2\theta-2}=O\Big{(}\min\big{\{}(u_{1}(1-u_{1}))^{-1};(u_{2}(1-u_{2}))^{-1}\big{\}}\Big{)}.$

Let us check Assumption 3, i.e. the $g_{\omega}$ regularity of the partial derivatives of the loss function. We will do the task for ${\mathcal{F}}_{1}=\{\mbox{\boldmath$u$}\mapsto\partial_{\theta}\ell(\theta_{0};\mbox{\boldmath$u$})\}$ only. The task for ${\mathcal{F}}_{2}$ and ${\mathcal{F}}_{3}$ , or even for every higher-order derivatives of the loss w.r.t. $\theta$ , is obtained by exactly similar reasonings. Simple calculations yield

\partial_{\theta}\ell(\theta;\mbox{\boldmath$u$})=M^{\prime}(\theta)-\frac{1}{\theta^{2}}\ln s_{\theta}(\mbox{\boldmath$u$})-\big{(}\frac{1}{\theta}+d\big{)}\frac{\sum_{j=1}^{d}u_{j}^{-\theta}\ln u_{j}}{s_{\theta}(\mbox{\boldmath$u$})}+\big{(}\sum_{j=1}^{d}\ln u_{j}\big{)}.

First, let us show that the map $\mbox{\boldmath$u$}\mapsto\min_{k}\min(u_{k},1-u_{k})^{\omega}|\partial_{\theta}\ell(\theta_{0};\mbox{\boldmath$u$})|$ is bounded on $(0,1)$ for any positive $\omega$ . We will replace $\theta_{0}$ by $\theta$ hereafter, to weaken our notations. W.l.o.g., assume that $u_{1}\leq u_{2}\leq\cdots\leq u_{d}$ . Then, $s_{\theta}(\mbox{\boldmath$u$})\in[u_{1}^{-\theta},du_{1}^{-\theta}-d+1]$ . We deduce

$\displaystyle\min_{k}\min(u_{k},1-u_{k})^{\omega}\|\partial_{\theta}\ell(\theta;\mbox{\boldmath$u$})\|\leq u_{1}^{\omega}\Big{\{}\|M^{\prime}(\theta)\|+\frac{1}{\theta^{2}}\ln\big{(}du_{1}^{-\theta}-d+1\big{)}$
	$\displaystyle+$	$\displaystyle\big{(}\frac{1}{\theta}+d\big{)}\frac{du_{1}^{-\theta}\|\ln u_{1}\|}{u_{1}^{-\theta}}+d\|\ln u_{1}\|\Big{\}}$
	$\displaystyle=$	$\displaystyle O\Big{(}u_{1}^{\omega}(1+\|\ln u_{1}\|)+u_{1}^{\omega}\ln\big{(}du_{1}^{-\theta}-d+1\big{)}\Big{)},$

that is a bounded function of $u$ .

Second, by simple calculations, it can be easily seen that $\partial_{\theta}\ell(\theta;d\mbox{\boldmath$u$})=h(\theta;\mbox{\boldmath$u$})\,d\mbox{\boldmath$u$}$ , for some map $\mbox{\boldmath$u$}\mapsto h(\theta;\mbox{\boldmath$u$})$ that is a linear combination of the maps

D_{0}(\mbox{\boldmath$u$}):=\prod_{j=1}^{d}u_{j}^{-\theta-1}s_{\theta}(\mbox{\boldmath$u$})^{-d},\;\;D_{k}(\mbox{\boldmath$u$}):=\prod_{j=1,j\neq k}^{d}u_{j}^{-\theta-1}\frac{u_{k}^{-\theta-1}\ln u_{k}}{s_{\theta}(\mbox{\boldmath$u$})^{d}},

\text{and}\;\;D^{*}_{k}(\mbox{\boldmath$u$}):=\prod_{j=1}^{d}u_{j}^{-\theta-1}\frac{u_{k}^{-\theta}\ln u_{k}}{s_{\theta}(\mbox{\boldmath$u$})^{d+1}},

for every $k\in\{1,\ldots,d\}$ .

Let us check (A.1) for all the latter maps. Concerning $D_{0}$ , this means we have to show

0\leq\int g_{\omega}(\mbox{\boldmath$u$})D_{0}(\mbox{\boldmath$u$})\,d\mbox{\boldmath$u$}<\infty.

(G.1)

Indeed, by symmetry, we have

\int g_{\omega}(\mbox{\boldmath$u$})D_{0}(\mbox{\boldmath$u$})\,d\mbox{\boldmath$u$}=d!\int_{u_{1}\leq u_{2}\leq\cdots\leq u_{d}}g_{\omega}(\mbox{\boldmath$u$})D_{0}(\mbox{\boldmath$u$})\,d\mbox{\boldmath$u$}\leq d!\int_{u_{1}\leq u_{2}\leq\cdots\leq u_{d}}u_{1}^{\omega}\frac{\prod_{j=1}^{d}u_{j}^{-\theta-1}}{s_{\theta}(\mbox{\boldmath$u$})^{d}}\,d\mbox{\boldmath$u$}.

By an integration w.r.t. $u_{d}\in(u_{d-1},1]$ , the latter integral is smaller than a constant times

\int_{u_{1}\leq u_{2}\leq\cdots\leq u_{d-1}}u_{1}^{\omega}\frac{\prod_{j=1}^{d-1}u_{j}^{-\theta-1}}{\big{\{}\sum_{j=1}^{d-1}u_{j}^{-\theta}-d+2\big{\}}^{d-1}}\,d\mbox{\boldmath$u$}.

By integrating w.r.t. $u_{d-1},u_{d-3},\ldots,u_{2}$ successively, we obtain a constant times

\int_{0}^{1}u_{1}^{\omega}u_{1}^{-\theta-1}\frac{du_{1}}{u_{1}^{-\theta}-d+d}=\int_{0}^{1}u_{1}^{\omega-1}\,du_{1}<\infty,

proving (G.1). To manage $\int g_{\omega}(\mbox{\boldmath$u$})D_{k}(\mbox{\boldmath$u$})\,d\mbox{\boldmath$u$}$ for any $k$ , note that, for any $\epsilon>0$ , there exist some positive constants $\alpha$ and $\beta$ s.t.

|\ln t|\leq\alpha t^{-\epsilon}+\beta t^{\epsilon},\;\;t>0.

Then, apply the same technique as for $D_{0}$ . The upper bound is here reduced to a constant times $\int u_{1}^{\omega-\epsilon}\,du_{1}$ , that is finite by choosing $\epsilon<\theta$ . The same ideas apply to deal with $D^{*}_{k}$ : $u_{k}^{-\theta}|\ln u_{k}|/s_{\theta}(\mbox{\boldmath$u$})\leq\gamma u_{k}^{-\epsilon}$ for some constant $\gamma$ , and we recover the $D_{k}$ case. We have then proved (A.1) for ${\mathcal{F}}_{1}$ .

The same technique can be applied to check (A.2), assuming $J_{2}\cup J_{3}\neq\emptyset$ . Denote $m_{k}:=\text{Card}(J_{k})$ , $k\in\{1,2,3\}$ . W.l.o.g., let us assume that the components indexed by $J_{1}$ are the first ones, i.e. are $u_{1},\ldots,u_{m_{1}}$ . By simple calculations, it can be easily seen that $\partial_{\theta}\ell(\theta;\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}=h_{J_{2},J_{3}}(\theta;\mbox{\boldmath$u$}_{J_{1}})\,d\mbox{\boldmath$u$}_{J_{1}}$ , for some map $\mbox{\boldmath$u$}_{J_{1}}\mapsto h_{J_{2},J_{3}}(\theta;\mbox{\boldmath$u$}_{J_{1}})$ whose absolute value is smaller than a linear combination of the maps

\widetilde{D}_{0,J_{1}}(\mbox{\boldmath$u$}_{J_{1}}):=\frac{1}{\widetilde{s}_{\theta}(\mbox{\boldmath$u$}_{J_{1}})^{m_{1}}}\prod_{j=1}^{m_{1}}u_{j}^{-\theta-1},\;\;\widetilde{D}^{*}_{0,J_{1}}(\mbox{\boldmath$u$}_{J_{1}}):=\frac{(|J_{2}|n^{\theta}\ln n+1)}{\widetilde{s}_{\theta}(\mbox{\boldmath$u$}_{J_{1}})^{m_{1}+1}}\prod_{j=1}^{m_{1}}u_{j}^{-\theta-1},

\widetilde{D}_{k,J_{1}}(\mbox{\boldmath$u$}_{J_{1}}):=\frac{u_{k}^{-\theta-1}\ln u_{k}}{\widetilde{s}_{\theta}(\mbox{\boldmath$u$}_{J_{1}})^{m_{1}}}\prod_{j=1,j\neq k}^{m_{1}}u_{j}^{-\theta-1},\text{and}\;\widetilde{D}^{*}_{k,J_{1}}(\mbox{\boldmath$u$}_{J_{1}}):=\frac{u_{k}^{-\theta}\ln u_{k}}{\widetilde{s}_{\theta}(\mbox{\boldmath$u$}_{J_{1}})^{m_{1}+1}}\prod_{j=1}^{m_{1}}u_{j}^{-\theta-1},

for every $k\in\{1,\ldots,m_{1}\}$ , by setting

\widetilde{s}_{\theta}(\mbox{\boldmath$u$}_{J_{1}}):=\sum_{j=1}^{m_{1}}u_{j}^{-\theta}+|J_{2}|(2n)^{\theta}+|J_{3}|(1-1/2n)^{-\theta}-d+1.

We will check Assumption (A.2) for all latter maps.

For every $\mbox{\boldmath$u$}\in\mbox{\boldmath$B$}_{n,|J_{1}|}$ , $J_{1}\neq\emptyset$ , note that

g_{\omega}(\mbox{\boldmath$u$}_{J_{1}}):=g_{\omega}(\mbox{\boldmath$u$}_{J_{1}}:\mbox{\boldmath$c$}_{n,J_{2}}:\mbox{\boldmath$d$}_{n,J_{3}}\big{)}=1/(2n)^{\omega},

when $J_{2}\neq\emptyset$ or $m_{1}=1$ . For the moment, let us assume this is the case.

To deal with $\widetilde{D}_{0,J_{1}}$ , we have by symmetry

$\displaystyle\int_{\mbox{\boldmath$B$}_{n,\|J_{1}\|}}g_{\omega}(\mbox{\boldmath$u$}_{J_{1}})\Big{\|}\widetilde{D}_{0,J_{1}}\big{(}\mbox{\boldmath$u$}_{J_{1}}\big{)}\Big{\|}\,d\mbox{\boldmath$u$}_{J_{1}}$
	$\displaystyle\leq$	$\displaystyle m_{1}!\int_{\mbox{\boldmath$B$}_{n,\|J_{1}\|},u_{1}\leq\cdots\leq u_{m_{1}}}g_{\omega}(\mbox{\boldmath$u$}_{J_{1}})\Big{\|}\widetilde{D}_{0,J_{1}}\big{(}\mbox{\boldmath$u$}_{J_{1}}\big{)}\Big{\|}\,d\mbox{\boldmath$u$}_{J_{1}}$
	$\displaystyle\leq$	$\displaystyle\frac{m_{1}!}{(2n)^{\omega}}\int_{\mbox{\boldmath$B$}_{n,\|J_{1}\|},u_{1}\leq\cdots\leq u_{m_{1}}}\frac{\prod_{j=1}^{m_{1}}u_{j}^{-\theta-1}\,d\mbox{\boldmath$u$}_{J_{1}}}{\big{(}\sum_{j=1}^{m_{1}}u_{j}^{-\theta}+\|J_{2}\|(2n)^{\theta}+\|J_{3}\|(1-1/2n)^{-\theta}-d+1\big{)}^{m_{1}}}\cdot$

First integrate w.r.t. $u_{m_{1}}$ between $u_{m_{1}-1}$ and $1-1/2n$ . The absolute value of the latter integral w.r.t. $u_{m_{1}}$ can be bounded by a constant times

$\displaystyle\frac{1}{\big{(}\sum_{j=1}^{m_{1}-1}u_{j}^{-\theta}+\|J_{2}\|(2n)^{\theta}+(\|J_{3}\|+1)(1-1/2n)^{-\theta}-d+1\big{)}^{m_{1}-1}}$
	$\displaystyle+$	$\displaystyle\frac{1}{\big{(}\sum_{j=1}^{m_{1}-1}u_{j}^{-\theta}+u_{m_{1}-1}^{-\theta}+\|J_{2}\|(2n)^{\theta}+\|J_{3}\|(1-1/2n)^{-\theta}-d+1\big{)}^{m_{1}-1}}$
	$\displaystyle\leq$	$\displaystyle\frac{2}{\big{(}\sum_{j=1}^{m_{1}-1}u_{j}^{-\theta}+\|J_{2}\|(2n)^{\theta}+\|J_{3}\|(1-1/2n)^{-\theta}-d+2\big{)}^{m_{1}-1}}\cdot$

Then, successively integrate w.r.t. $u_{m_{1}-1},u_{m_{1}-2},\ldots,u_{2}$ using the same type of upper bounds for every integral. We finally obtain an $u_{1}$ -integral of order

\frac{1}{n^{\omega}}\int_{1/2n}^{1-1/2n}u_{1}^{-\theta-1}\frac{du_{1}}{u_{1}^{-\theta}+|J_{2}|(2n)^{\theta}+|J_{3}|(1-1/2n)^{-\theta}-d+m_{1}},

that is $O(\ln n/n^{\omega})$ and then tends to zero with $n$ for any $\omega>0$ . Thus, (A.2) is proven in the case of the integrand $\widetilde{D}_{0,J_{1}}$ . The terms $\widetilde{D}_{k,J_{1}}$ are managed similarly by noting that $|\ln u_{k}|\leq\ln(2n)$ when $u_{k}\in(1/2n;1-1/2n]$ .

The task is similar with $\widetilde{D}^{*}_{0,J_{1}}$ : after $m_{1}-1$ integrations w.r.t. $u_{m_{1}}$ , $u_{m_{1}-1}$ , etc, $u_{2}$ , we get

	$\displaystyle\frac{(\|J_{2}\|n^{\theta}\ln n+1)}{n^{\omega}}\int_{1/2n}^{1-1/2n}u_{1}^{-\theta-1}\frac{du_{1}}{\big{(}u_{1}^{-\theta}+\|J_{2}\|(2n)^{\theta}+\|J_{3}\|(1-1/2n)^{-\theta}-d+m_{1}\big{)}^{2}}$
		$\displaystyle=$	$\displaystyle O\Big{(}\frac{n^{\theta-\omega}\ln n}{(\|J_{2}\|+1)(2n)^{\theta}+\|J_{3}\|(1-1/2n)^{-\theta}-d+m_{1}}\Big{)}=O\Big{(}\frac{n^{\theta-\omega}\ln n}{n^{\theta}}\Big{)}=o(1),$

for any $\omega>0$ . The terms $\widetilde{D}^{*}_{k,J_{1}}$ are managed similarly by noting that $|u_{k}^{-\theta}\ln u_{k}|=O(n^{\theta}\ln n)$ when $u_{k}\in(1/2n;1-1/2n]$ .

Thus, it remains to consider the case $J_{2}=\emptyset$ and $m_{1}\geq 2$ to check (A.2). Apply the same technique as above, invoking $g_{\omega}(\mbox{\boldmath$u$})\leq u_{1}^{\omega}$ for every $\mbox{\boldmath$u$}\in[0,1]^{d}$ . Moreover, after every integration stage, it is possible to replace $1-1/2n$ by $1$ in the denominators. For instance, in the case of $\widetilde{D}_{0,J_{1}}$ , the integration w.r.t. $u_{m_{1}}$ yields

$\displaystyle\frac{1}{\big{(}\sum_{j=1}^{m_{1}-1}u_{j}^{-\theta}+(\|J_{3}\|+1)(1-1/2n)^{-\theta}-d+1\big{)}^{m_{1}-1}}$
	$\displaystyle+$	$\displaystyle\frac{1}{\big{(}\sum_{j=1}^{m_{1}-1}u_{j}^{-\theta}+u_{m_{1}-1}^{-\theta}+\|J_{3}\|(1-1/2n)^{-\theta}-d+1\big{)}^{m_{1}-1}}$
	$\displaystyle\leq$	$\displaystyle\frac{2}{\big{(}\sum_{j=1}^{m_{1}-1}u_{j}^{-\theta}-m_{1}+2\big{)}^{m_{1}-1}},$

because $|J_{3}|=d-m_{1}$ in our case. After $m_{1}-1$ integration stages, we obtain

\int_{1/2n}^{1-1/2n}u_{1}^{\omega-\theta-1}\frac{du_{1}}{u_{1}^{-\theta}},

that is finite, as required. This is still the case for term $\widetilde{D}_{0,J_{1}}^{*}$ , obviously, when $|J_{2}|=0$ because $\widetilde{s}_{\theta}(\mbox{\boldmath$u$})\geq 1$ . The terms $\widetilde{D}_{k,J_{1}}$ are managed similarly because $|\ln u_{k}|\leq|\ln u_{1}|$ for every $k$ and $\int_{1/2n}^{1-1/2n}|\ln u_{1}|u_{1}^{\omega-1}\,du_{1}=O(1).$ The same ideas apply with the terms $\widetilde{D}^{*}_{k,J_{1}}$ , because $\widetilde{D}^{*}_{k,J_{1}}(\mbox{\boldmath$u$})\leq\widetilde{D}_{0,J_{1}}(\mbox{\boldmath$u$})|\ln u_{1}|.$ To conclude, we have checked (A.2) in every situation.

Since Assumption 3 is now satisfied with $\ell(\theta;\mbox{\boldmath$u$})$ , we have proven the validity of our assumptions in the case of the Clayton family.

Remark 9.

As for the Gumbel copula family, we have proved the regularity assumptions for the Clayton family as if the weight function were replaced by $\mbox{\boldmath$u$}\mapsto\min_{j}u_{j}$ , a stronger property.

Appendix H Asymptotic properties for parameters of varying dimensions

In this section, we deal with the case of copula parameters whose dimensions are functions of the sample size. More formally, we consider a sequence of parametric copula models ${\mathcal{P}}_{n}:=\{{\mathbb{P}}_{\theta_{n}},\,\theta_{n}\in\Theta_{n}\}$ , for some subsets $\Theta_{n}\subset{\mathbb{R}}^{p_{n}}$ , $n\geq 1$ . Therefore, the number of unknown parameters $p_{n}$ may vary with the sample size $n$ . In particular, the sequence $(p_{n})$ could tend to the infinity with $n$ , i.e. $p=p_{n}\rightarrow\infty$ when $n\rightarrow\infty$ , but it is not mandatory.

As a consequence, in this section only, we introduce a sequence of loss functions $(\ell_{n})_{n\geq 1}$ , which hereafter enters in an associated map ${\mathbb{L}}_{n}$ (whose notation remains unchanged, to simplify): for every $n$ , the map $\ell_{n}:\Theta_{n}\times(0,1)^{d}\rightarrow{\mathbb{R}}$ defines the “global loss” map

{\mathbb{L}}_{n}(\theta_{n};\mbox{\boldmath$u$}_{1},\ldots,\mbox{\boldmath$u$}_{n}):=\overset{n}{\underset{i=1}{\sum}}\ell_{n}(\theta_{n};\mbox{\boldmath$u$}_{i}),

(H.1)

for every $\theta_{n}\in\Theta_{n}$ and every $(\mbox{\boldmath$u$}_{1},\ldots,\mbox{\boldmath$u$}_{n})$ in $(0,1)^{dn}$ .

Assumption 15.

For every $n$ , the parameter space $\Theta_{n}$ is a borelian subset of ${\mathbb{R}}^{p_{n}}$ . The function $\theta_{n}\mapsto{\mathbb{E}}[\ell_{n}(\theta_{n};\mbox{\boldmath$U$})]$ is uniquely minimized on $\Theta_{n}$ at $\theta_{n,0}$ , and an open neighborhood of $\theta_{n,0}$ is contained in $\Theta_{n}$ .

Exactly as detailed in the core of the main text (see the discussion after Assumption ‣ 2), our theory applies when $\theta_{n,0}$ belongs to the frontier of $\Theta_{n}$ . Technical details are left to the reader. Let us illustrate the relevance of the diverging dimension case for copula selection.

Example 5 (Example 3 cont’d).

Consider an infinite sequence of given copulas $(C^{(n)})_{k\geq 1}$ in dimension $d$ . The true underlying copula will be estimated by a sequence of finite mixtures, given by the weighted sum of the first $p_{n}+1$ copulas $C^{(j)}$ , $j\in\{1,\ldots,p_{n}+1\}$ . If we choose the CML method, the loss function $\ell_{n}(\theta_{n};\cdot)$ is the log-copula density associated with $\sum_{k=1}^{p_{n}+1}\theta_{n,k}C^{(k)}$ . In theory, we could extend the latter framework by assuming that every $C^{(k)}$ depends on an unknown parameter that has to be estimated in addition to the weights, but the numerical estimation of the enlarged unknown vector of parameters will surely become numerically challenging.

Example 6.

A probably more relevant application is related to single-index copulas with a diverging number of underlying factors and a known link function $\zeta$ : in the same spirit as Example 4, the conditional copula of $\mbox{\boldmath$X$}\in{\mathbb{R}}^{d}$ given $\mbox{\boldmath$Z$}=\mbox{\boldmath$z$}\in{\mathbb{R}}^{m_{n}}$ would be the $d$ -dimension copula $C_{\zeta(\mbox{\boldmath$z$}^{\top}\beta_{n})}$ for some parameter $\beta_{n}\in{\mathbb{R}}^{m_{n}}$ to be estimated and a given parametric copula family ${\mathcal{C}}:=\{C_{\theta};\theta\in\Theta\subset{\mathbb{R}}^{p}\}$ .

We have now to consider a sequence of estimators $(\widehat{\theta}_{n})_{n\geq 1}$ defined as

\widehat{\theta}_{n}\,{\color[rgb]{0,0,0}\in}\,\underset{\theta_{n}\in\Theta_{n}}{\arg\;\min}\;\Big{\{}{\mathbb{L}}_{n}(\theta_{n};\widehat{{\mathcal{U}}}_{n})+n\overset{p_{n}}{\underset{k=1}{\sum}}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{n,k}|)\Big{\}}.

(H.2)

We will focus on the distance between $\theta_{n,0}$ and $\widehat{\theta}_{n}$ .

Remark 10.

Note that we do not evaluate the distance between $\theta_{n,0}$ (or $\widehat{\theta}_{n}$ ) and an hypothetical “true parameter” $\theta_{0}$ , because they generally do not live in the same spaces. In some cases, it is possible to “dive” $\Theta_{n}$ into $\Theta$ and/or ${\mathcal{P}}_{n}$ into a (correctly specified) parametric family of copulas. To illustrate, in Example 5, assume the true copula is an infinite sum of known copulas, i.e. $C_{0}=\sum_{k=1}^{+\infty}\pi_{k}C^{(k)}$ . Setting $\theta_{0}:=(\pi_{1},\pi_{2},\ldots)$ , some identification constraints have to be found and they depend on the selected copulas $C^{(k)}$ . Therefore, it would be possible to compare the two infinite sequences $\theta_{0}$ and $\bar{\theta}_{n,0}:=(\theta_{n,0},0,0,\ldots)$ . Nonetheless, in such cases, the distance between $\theta_{0}$ and $\bar{\theta}_{n,0}$ is strongly model-dependent. Since this is no longer an inference problem but rather a problem of model specification, we will not go further in this direction.

In terms of notations, for every $n$ , the sparse subset of parameters is ${\mathcal{A}}_{n}:=\big{\{}k:\theta_{n,0,k}\neq 0,k=1,\ldots,p_{n}\big{\}}$ , and $s_{n}$ will denote the cardinality of ${\mathcal{A}}_{n}$ . Let us rewrite some of our latter assumptions that have to be adapted to the new framework.

Assumption 16.

The map $\theta_{n}\mapsto\ell_{n}(\theta_{n};\mbox{\boldmath$u$})$ is thrice differentiable on $\Theta_{n}$ , for every $\mbox{\boldmath$u$}\in(0,1)^{d}$ . Any pseudo-true value $\theta_{n,0}$ satisfies the first-order condition, i.e. ${\mathbb{E}}[\nabla_{\theta_{n}}\ell_{n}(\theta_{n,0};\mbox{\boldmath$U$})]=0$ . Moreover, ${\mathbb{H}}_{n}:={\mathbb{E}}[\nabla^{2}_{\theta_{n}\theta_{n}^{\top}}\ell_{n}(\theta_{n,0};\mbox{\boldmath$U$})]$ and ${\mathbb{M}}_{n}:={\mathbb{E}}[\nabla_{\theta_{n}}\ell_{n}(\theta_{n,0};\mbox{\boldmath$U$})\nabla_{\theta_{n}^{\top}}\ell_{n}(\theta_{n,0};\mbox{\boldmath$U$})]$ exist, are positive definite and $\sup_{n}\|{\mathbb{H}}_{n}\|_{\infty}<\infty$ . Denoting by $\lambda_{1}({\mathbb{H}}_{n})$ the smallest eigenvalue of ${\mathbb{H}}_{n}$ , there exists a positive constant $\underline{\lambda}$ such that $\lambda_{1}({\mathbb{H}}_{n})\geq\underline{\lambda}>0$ for every $n$ . Finally, for every $\epsilon>0$ , there exists a positive constant $K_{\epsilon}$ such that

\sup_{n}\sup_{\{\theta_{n};\|\theta_{n}-\theta_{n,0}\|<\epsilon\}}\;\sup_{j,l,m}\big{|}{\mathbb{E}}[\partial^{3}_{\theta_{n,j}\theta_{n,l}\theta_{n,m}}\ell_{n}(\theta_{n};\mbox{\boldmath$U$})]\big{|}\leq K_{\epsilon}.

Assumption 17.

For some $\omega$ , the family of maps ${\mathcal{F}}:=\bigcup_{n\geq 1}{\mathcal{F}}_{n}$ , ${\mathcal{F}}_{n}:={\mathcal{F}}_{1,n}\cup{\mathcal{F}}_{2,n}\cup{\mathcal{F}}_{3,n}$ , from $(0,1)^{d}$ to ${\mathbb{R}}$ is $g_{\omega}$ -regular, with

{\mathcal{F}}_{n,1}:=\{f:\mbox{\boldmath$u$}\mapsto\partial_{\theta_{n,k}}\ell_{n}(\theta_{n,0};\mbox{\boldmath$u$});k=1,\ldots,p_{n}\},

{\mathcal{F}}_{n,2}:=\{f:\mbox{\boldmath$u$}\mapsto\partial^{2}_{\theta_{n,k},\theta_{n,l}}\ell_{n}(\theta_{n,0};\mbox{\boldmath$u$});k,l=1,\ldots,p_{n}\},

{\mathcal{F}}_{n,3}:=\{f:\mbox{\boldmath$u$}\mapsto\partial^{3}_{\theta_{n,k},\theta_{n,l},\theta_{n,j}}\ell_{n}(\theta_{n};\mbox{\boldmath$u$});k,l,j=1,\ldots,p_{n},\;\|\theta_{n}-\theta_{n,0}\|<K\},

for some constant $K>0$ .

Assumption 18.

Define

a_{n}:=\max_{1\leq j\leq p_{n}}\big{\{}\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{n,0,j}|),\theta_{n,0,j}\neq 0\big{\}}\;\text{and}\;b_{n}:=\max_{1\leq j\leq p_{n}}\big{\{}\partial^{2}_{2,2}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{n,0,j}|),\theta_{n,0,j}\neq 0\big{\}}.

We assume that $a_{n}\rightarrow 0$ and $b_{n}\rightarrow 0$ when $n\rightarrow\infty$ . Moreover, there exist some constants $M$ and $D$ such that $|\partial^{2}_{2,2}\mbox{\boldmath$p$}(\lambda_{n},\theta_{1})-\partial^{2}_{2,2}\mbox{\boldmath$p$}(\lambda_{n},\theta_{2})|\leq D|\theta_{1}-\theta_{2}|,$ for any real numbers $\theta_{1},\theta_{2}$ such that $\theta_{1},\theta_{2}>M\lambda_{n}$ .

This set of assumptions extends the regularity conditions of Section 3 to the diverging dimension case. In particular, our assumptions 16 and 18 are in the same vein as in [13], assumption (F) and condition 3.1.1. respectively. Note that our Assumption 2 in the main text does not need to be altered as $d$ remains fixed.

Theorem H.1.

Suppose Assumptions 10-11 given in Appendix A are satisfied. Let some

\omega\in\Big{(}0,\min\big{\{}\frac{{\color[rgb]{0,0,0}\kappa}_{1}}{2(1-{\color[rgb]{0,0,0}\kappa}_{1})},\frac{{\color[rgb]{0,0,0}\kappa}_{2}}{2(1-{\color[rgb]{0,0,0}\kappa}_{2})},{\color[rgb]{0,0,0}\kappa}_{3}-\frac{1}{2}\big{\}}\Big{)},

and suppose Assumptions 2, 15-18 hold for this $\omega$ . Finally, assume that $p_{n}^{2}\ln(\ln n)/\sqrt{n}\rightarrow 0$ and $p_{n}^{2}a_{n}\rightarrow 0$ when $n\rightarrow\infty$ . Then, there exists a sequence $(\widehat{\theta}_{n})_{n\geq 1}$ of solutions of (H.2) that satisfies the bound

\|\widehat{\theta}_{n}-\theta_{n,0}\|_{2}=O_{p}\Big{(}\sqrt{p_{n}}\big{(}n^{-1/2}\ln(\ln n)+a_{n}\big{)}\Big{)}.

Note that $p_{n}$ is allowed to diverge to the infinity, but not too fast, since $p_{n}=o\big{(}n^{1/4}\big{)}$ by assumption.

Proof.

The arguments are exactly the same as those given for the proof of Theorem 3.1. With the same notations, the dominant term in the expansion comes from $T_{2}$ and is larger than $n\nu_{n}^{2}\underline{\lambda}\|\mathbf{v}\|_{2}^{2}/2$ , for some vector $\mathbf{v}\in{\mathbb{R}}^{p_{n}}$ , $\|\mathbf{v}\|_{2}=L_{\epsilon}$ . By a careful inspection of the previous proof, the result follows if we satisfy

\sqrt{np_{n}}\ln(\ln n)\nu_{n}\|\mathbf{v}\|_{2}<<n\nu_{n}^{2}\|\mathbf{v}\|_{2}^{2},

\sqrt{n}p_{n}\ln(\ln n)\nu_{n}^{2}\|\mathbf{v}\|_{2}^{2}<<n\nu_{n}^{2}\|\mathbf{v}\|_{2}^{2},

np_{n}^{3/2}\nu_{n}^{3}\|\mathbf{v}\|_{2}^{3}<<n\nu_{n}^{2}\|\mathbf{v}\|_{2}^{2},\;\text{and}

\sqrt{\|\theta_{n,0}\|_{0}}n\nu_{n}a_{n}\|\mathbf{v}\|_{2}+nb_{n}\nu_{n}^{2}\|\mathbf{v}\|_{2}^{2}<<n\nu_{n}^{2}\|\mathbf{v}\|_{2}^{2}.

The latter conditions will be satisfied under our assumptions, choosing some vectors $\mathbf{v}$ whose norm $L_{\epsilon}$ is sufficiently large and setting $\nu_{n}:=\sqrt{p_{n}}\big{(}n^{-1/2}\ln(\ln n)+a_{n}\big{)}$ . ∎

As in the fixed dimension case, we establish the asymptotic oracle property, i.e. the conditions for which the true support is recovered and the non-zero coefficients are asymptotically normal. We denote by $\widehat{{\mathcal{A}}}_{n}$ the estimated support of our estimator for the $n$ -th model, i.e. $\widehat{{\mathcal{A}}}_{n}:=\big{\{}k:\widehat{\theta}_{n,k}\neq 0;k=1,\ldots,p_{n}\big{\}}$ . For convenience and w.l.o.g., we assume that the supports are related to the first components of the true parameters, i.e. ${\mathcal{A}}_{n}=\{1,\ldots,s_{n}\}$ and ${\mathcal{A}}_{n}^{c}=\{s_{n}+1,\ldots,p_{n}\}$ . Therefore, every (true or estimated) parameter will be split as $\theta_{n}=:(\theta_{n}^{(1)},\theta_{n}^{(2)})$ , where $\theta_{n}^{(1)}$ (resp. $\theta_{n}^{(2)}$ ) is related to the ${\mathcal{A}}_{n}$ (resp. ${\mathcal{A}}_{n}^{c}$ ). components. The statement of the asymptotic distribution with a diverging dimension requires the introduction of a sequence of deterministic real matrices $(Q_{n})_{n\geq 1}$ , $Q_{n}$ being of size $q\times s_{n}$ , for some fixed $q>0$ . Denote $Q_{n}:=[q_{n,i,j}]_{1\leq l\leq q,1\leq r\leq s_{n}}$ . Define the $q$ sequences of maps $(w_{n}^{(l)})_{n\geq 1}$ , $l\in\{1,\ldots,q\}$ , from $\Theta_{n}\times(0,1)^{d}$ to ${\mathbb{R}}$ by

w_{n}^{(l)}(\theta_{n};\mbox{\boldmath$u$}):=\sum_{r=1}^{s_{n}}q_{n,l,r}\partial_{\theta_{n,r}}\ell_{n}(\theta_{n};\mbox{\boldmath$u$}),\;\mbox{\boldmath$u$}\in(0,1)^{d}.

In addition to Condition 17, the next assumption allows to obtain the $g_{\omega}$ -regularity of the latter maps.

Assumption 19.

$\sup_{n}\sup_{1\leq l\leq q}\sum_{r=1}^{s_{n}}|q_{n,l,r}|<\infty.$

Moreover, we need to introduce a limit for the sequences of maps $(w_{n}^{(l)}(\theta_{n,0},\cdot))$ .

Assumption 20.

There exist $q$ maps $w_{\infty}^{(l)}:(0,1)^{d}\rightarrow{\mathbb{R}}$ that are $g_{\omega}$ -regular and such that

\sup_{1\leq l\leq q}{\mathbb{E}}\Big{[}\big{(}w_{n}^{(l)}(\theta_{n,0},\mbox{\boldmath$U$})-w_{\infty}^{(l)}(\mbox{\boldmath$U$})\big{)}^{2}\Big{]}\rightarrow 0\;\text{as}\;n\rightarrow\infty.

Denote ${\mathcal{W}}_{\infty}:=\{w_{\infty}^{(l)},l=1,\dots,q\}$ . The use of the sequence of matrices $(Q_{n})$ is classical and inspired here by Theorem 2 of [13]. This technicality allows to obtain convergence towards a finite $q$ -dimensional distribution. A similar technique was employed in Theorem 3.2 of [27], which established the normal distribution of the least squares based M-estimator when the dimension diverges. Our regularity assumptions differ from the latter ones, due to different techniques of proofs. For instance, Theorem 2 of [13] assume $p_{n}^{5}/n=o(1)$ and imposes the convergence of $Q_{n}Q_{n}^{\top}$ . In our case, we need $p_{n}^{4}/n=o(1)$ and the boundedness of the sequence of row-sum norms $\|Q_{n}\|_{\text{row}}$ (Assumption 19). In light of our criterion, since Theorem A.1 will be applied to $n^{-1/2}\sum^{n}_{i=1}\sum_{r\in{\mathcal{A}}_{n}}q_{n,l,r}\partial_{\theta_{n,r}}\ell_{n}(\theta_{n,0},\widehat{\mbox{\boldmath$U$}}_{i})$ , $l\in\{1,\ldots,q\}$ , contrary to our Theorem 3.2 that applied the same corollary to $n^{-1/2}\sum^{n}_{i=1}\partial_{\theta_{j}}\ell(\theta_{0},\widehat{\mbox{\boldmath$U$}}_{i})$ , $j\in{\mathcal{A}}$ , this motivates Assumptions 19 and 20.

Theorem H.2.

In addition to the conditions of Theorem H.1, assume that $\lambda_{n}\rightarrow 0$ , $p_{n}a_{n}=o(\lambda_{n})$ , $\sqrt{n}\lambda_{n}/(p_{n}\ln(\ln n))\rightarrow\infty$ and $\underset{n\rightarrow\infty}{\lim\,\inf}\;\underset{x\rightarrow 0^{+}}{\lim\;\inf}\,\lambda^{-1}_{n}\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},x)>0$ . Then, the consistent estimator $\widehat{\theta}_{n}$ given by (H.2) satisfies the following properties.

(i)

Sparsity: $\underset{n\rightarrow\infty}{\lim}\;{\mathbb{P}}(\widehat{\theta}_{n}^{(2)}=\theta_{n,0}^{(2)})=1$ .

(ii)

Asymptotic normality: in addition, assume $\sqrt{n}\lambda_{n}^{2}=o(1)$ and Conditions 19-20 hold. Moreover, 12-14 in Appendix A of the main text are met, replacing ${\mathcal{F}}$ with $\bigcup_{n}{\mathcal{F}}_{1n}\cup{\mathcal{W}}_{\infty}$ . Then, we have

\sqrt{n}Q_{n}\Big{[}{\mathbb{H}}_{n,{\mathcal{A}}_{n}{\mathcal{A}}_{n}}+\mathbf{B}_{n}(\theta_{n,0})\Big{]}\Big{\{}\big{(}\widehat{\theta}^{(1)}_{n}-\theta_{n,0}^{(1)}\big{)}+\big{[}{\mathbb{H}}_{n,{\mathcal{A}}_{n}{\mathcal{A}}_{n}}+\mathbf{B}_{n}(\theta_{n,0})\big{]}^{-1}\mathbf{A}_{n}(\theta_{n,0})\Big{\}}\overset{d}{\underset{n\rightarrow\infty}{\longrightarrow}}\mbox{\boldmath$Y$},

where ${\mathbb{H}}_{n,{\mathcal{A}}_{n}{\mathcal{A}}_{n}}:=\Big{[}{\mathbb{E}}\big{[}\partial^{2}_{\theta_{n,k}\theta_{n,l}}\ell(\theta_{n,0};\mbox{\boldmath$U$})\big{]}\Big{]}_{k,l\in{\mathcal{A}}_{n}}$ , $\mathbf{A}_{n}(\theta_{n})=\big{[}\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{n,k}|)\text{sgn}(\theta_{n,k})\big{]}_{k\in{\mathcal{A}}_{n}}$ , $\mathbf{B}_{n}(\theta_{n})=\text{diag}(\partial^{2}_{2,2}\mbox{\boldmath$p$}(\lambda_{n},|\theta_{n,k}|),\,k\in{\mathcal{A}}_{n})$ and $Y$ a $q$ -dimensional random vector whose $j$ -th component, $j\in\{1,\ldots,q\}$ , is

Y_{j}:=(-1)^{d}\int_{(0,1]^{d}}{\mathbb{C}}(\mbox{\boldmath$u$})\,w_{\infty}^{(j)}(d\mbox{\boldmath$u$})+\sum_{\begin{subarray}{c}I\subset\{1,\ldots,d\}\\ I\neq\emptyset,I\neq\{1,\ldots,d\}\end{subarray}}(-1)^{|I|}\int_{(0,1]^{|I|}}{\mathbb{C}}(\mbox{\boldmath$u$}_{I};{\mathbf{1}}_{-I})\,w_{\infty}^{(j)}(d\mbox{\boldmath$u$}_{I};{\mathbf{1}}_{-I}).

Note that the property (i) is related to the zero coefficients of the true parameters $\theta_{n,0}$ , as in [13]. There is a subtle difference with the oracle property established for a fixed dimension, where both true zero and non-zero coefficients are correctly identified, a property termed “support recovery”. Herer, the so-called “sparsity property” actually does not preclude the possibility that, for every $n$ , some components of $\theta_{n,0}$ ’s support may be estimated as zero.

Remark 11.

Assume the minimum signal condition (H) of [13], i.e., $\min_{k\in{\mathcal{A}}_{n}}|\theta_{n,0,k}|/\lambda_{n}\rightarrow\infty$ as $n\rightarrow\infty$ - a condition standard for sparse estimation with diverging dimensions - is satisfied. Such a property is closely related to the unbiasedness property for non-convex penalization: see, e.g., condition 2.2.-(vii) of [25]. Then, if the penalty is SCAD or MCP, the quantities $\mathbf{A}_{n}(\theta_{n,0}),\mathbf{B}_{n}(\theta_{n,0})$ are zero when $n$ is sufficiently large. Therefore, the conclusion of Theorem H.2 (ii) becomes

\sqrt{n}Q_{n}{\mathbb{H}}_{n,{\mathcal{A}}_{n}{\mathcal{A}}_{n}}\big{(}\widehat{\theta}^{(1)}_{n}-\theta_{n,0}^{(1)}\big{)}\overset{d}{\underset{n\rightarrow\infty}{\longrightarrow}}\mbox{\boldmath$Y$}.

Remark 12.

It can be checked that our assumptions in Theorem H.2 can be satisfied by some sequences $(p_{n},\lambda_{n},a_{n})$ . Restricting ourselves to some powers of $n$ , set $p_{n}=[n^{a}]$ , $\lambda_{n}=n^{-b}$ and $a_{n}=n^{-c}$ , for some positive constants $a$ , $b$ and $c$ . The subset

\{(a,b,c)\in{\mathbb{R}}^{3}\,|\,b>\frac{1}{4},0<a<\frac{1}{4},a+b<\min(\frac{1}{2},c)\}

yields an acceptable choice for $(p_{n},\lambda_{n},a_{n})$ .

Proof.

Point (i): The proof of (i) follows exactly the same lines as the proof of the first point in Theorem 3.2. Due to the diverging number of parameters and Theorem H.1, we now consider $\nu_{n}:=\sqrt{p_{n}}\big{\{}n^{-1/2}\ln(\ln n)+a_{n}\}$ . By the same reasoning as above, particularly Equation (D.2), the result is proved if we satisfy

\ln(\ln n)\sqrt{n}+n\sqrt{p_{n}}\nu_{n}+np_{n}\nu_{n}^{2}<<n\inf_{j\in{\mathcal{A}}^{c}_{n}}\partial_{2}\mbox{\boldmath$p$}(\lambda_{n},|\widehat{\theta}_{n,j}|)\text{sgn}(\widehat{\theta}_{n,j}),

(H.3)

keeping in mind that the estimated parameters $\widehat{\theta}_{n}$ we consider satisfies $\max\{|\widehat{\theta}_{n,j}|;j\in{\mathcal{A}}^{c}_{n}\}=O_{P}(\nu_{n})=o_{P}(1)$ . Note that we have invoked an uniform upper bound for the second and third order partial derivatives of the loss (Assumption 16). It can be easily checked that (H.3) is satisfied under our assumptions on the sequence $(p_{n},\lambda_{n},a_{n})$ . Indeed, (H.3) if satisfied if

\frac{\ln(\ln n)}{\sqrt{n}}+\sqrt{p_{n}}\nu_{n}+p_{n}\nu_{n}^{2}<<\lambda_{n}.

Since $p_{n}\ln(\ln n)/\sqrt{n}=o(1)$ and $p_{n}a_{n}=o(1)$ , then $p_{n}\nu_{n}^{2}=o(\sqrt{p_{n}}\nu_{n})$ . Is is then sufficient to check that

p_{n}\big{(}\frac{\ln(\ln n)}{\sqrt{n}}+a_{n}\big{)}<<\lambda_{n},

that is a direct consequence of our assumptions.

Point (ii): We proved the sparsity property $\lim_{n\rightarrow\infty}\;{\mathbb{P}}(\widehat{\theta}_{n}^{(2)}=\theta_{n,0}^{(2)})=1$ . Therefore, for any $\epsilon>0$ , the event $\widehat{\theta}_{n,{\mathcal{A}}_{n}^{c}}=\mathbf{0}\in{\mathbb{R}}^{|{\mathcal{A}}_{n}^{c}|}$ occurs with a probability larger than $1-\epsilon$ for $n$ large enough. Since we want to state a convergence in law result, we can consider that the latter event is satisfied everywhere. By a Taylor expansion around the true parameter, as in the proof of Theorem 3.2, and after multiplying by the matrix $Q_{n}$ , we get

$\displaystyle\sqrt{n}Q_{n}{\mathbb{K}}_{n}(\theta_{n,0})\Big{\{}\big{(}\widehat{\theta}_{n}-\theta_{n,0}\big{)}_{{\mathcal{A}}_{n}}+\mathbf{A}_{n}(\theta_{n,0})\Big{\}}=-\frac{1}{\sqrt{n}}Q_{n}\nabla_{\theta_{n,{\mathcal{A}}_{n}}}{\mathbb{L}}_{n}(\theta_{n,0};\widehat{{\mathcal{U}}}_{n})$
	$\displaystyle-$	$\displaystyle\frac{1}{2\sqrt{n}}Q_{n}\nabla_{\theta_{n,{\mathcal{A}}_{n}}}\Big{\{}(\widehat{\theta}_{n}-\theta_{n,0})^{\top}_{{\mathcal{A}}_{n}}\nabla^{2}_{\theta_{{\mathcal{A}}_{n}}\theta^{\top}_{{\mathcal{A}}_{n}}}{\mathbb{L}}_{n}(\overline{\theta}_{n};\widehat{{\mathcal{U}}}_{n})\Big{\}}(\widehat{\theta}_{n}-\theta_{n,0})_{{\mathcal{A}}_{n}}+o_{p}(1)$
	$\displaystyle=:$	$\displaystyle R_{1}+R_{2}+o_{p}(1),$

where ${\mathbb{K}}_{n}(\theta_{n,0}):=n^{-1}\nabla^{2}_{\theta_{n,{\mathcal{A}}_{n}}\theta^{\top}_{n,{\mathcal{A}}_{n}}}{\mathbb{L}}_{n}(\theta_{n,0};\widehat{{\mathcal{U}}}_{n})+\mathbf{B}_{n}(\theta_{0})$ . Obviously, $\overline{\theta}_{n}$ is a random parameter such that $\|\overline{\theta}_{n,{\mathcal{A}}_{n}}-\theta_{n,0,{\mathcal{A}}_{n}}\|_{2}<\|\widehat{\theta}_{n,{\mathcal{A}}_{n}}-\theta_{n,0,{\mathcal{A}}_{n}}\|_{2}$ . Due to Assumptions 17 and 19, the family $\bigcup_{n\geq 1}{\mathcal{G}}_{n}$ of maps from $(0,1)^{d}$ to ${\mathbb{R}}$ defined as

{\mathcal{G}}_{n}:=\{f:\mbox{\boldmath$u$}\mapsto\sum_{r\in{\mathcal{A}}_{n}}q_{n,l,r}\partial^{2}_{\theta_{n,r}\theta_{n,k}}\ell_{n}(\theta_{n,0};\mbox{\boldmath$u$});l=1,\ldots,q;k\in{\mathcal{A}}_{n}\},\,n\geq 1,

is $g_{\omega}$ -regular, with the same $\omega$ as in Assumption 17. Invoking Corollary A.2, we obtain

\|Q_{n}{\mathbb{K}}_{n}(\theta_{n,0})-Q_{n}{\mathbb{H}}_{{\mathcal{A}}_{n}{\mathcal{A}}_{n}^{\top}}-Q_{n}\mathbf{B}_{n}(\theta_{0})\|_{\infty}=o_{p}(1).

Second, the third order term $R_{2}$ is a vector as size $q$ whose $i$ -th component, $i\in\{1,\ldots,q\}$ , is

R_{2,i}:=-\frac{1}{2\sqrt{n}}\underset{j\in{\mathcal{A}}_{n}}{\sum}q_{n,i,j}\underset{l,m\in{\mathcal{A}}_{n}}{\sum}\partial^{3}_{\theta_{j}\theta_{l}\theta_{m}}{\mathbb{L}}_{n}(\overline{\theta}_{n};\widehat{{\mathcal{U}}}_{n})(\widehat{\theta}_{n,l}-\theta_{n,0,l})(\widehat{\theta}_{n,m}-\theta_{n,0,m}).

Conditions 17 and 19 imply that the maps

\mbox{\boldmath$u$}\rightarrow\underset{j\in{\mathcal{A}}_{n}}{\sum}q_{n,i,j}\partial^{3}_{\theta_{j}\theta_{l}\theta_{m}}\mbox{\boldmath$p$}_{n}(\overline{\theta}_{n};\mbox{\boldmath$u$}),\;i\in\{1,\ldots,q\},l,m\in{\mathcal{A}}_{n},

are $g_{\omega}$ -regular. Then, apply Corollary A.2 to the latter family. This yields $R_{2}-\bar{R}_{2}=O_{P}(\ln(\ln n)/\sqrt{n})$ , by setting

\bar{R}_{2,i}:=-\frac{n}{2\sqrt{n}}\underset{j\in{\mathcal{A}}_{n}}{\sum}q_{n,i,j}\underset{l,m\in{\mathcal{A}}_{n}}{\sum}{\mathbb{E}}\big{[}\partial^{3}_{\theta_{j}\theta_{l}\theta_{m}}\ell_{n}(\overline{\theta}_{n};\mbox{\boldmath$U$})\big{]}(\widehat{\theta}_{n,l}-\theta_{n,0,l})(\widehat{\theta}_{n,m}-\theta_{n,0,m}),

for every $i\in\{1,\ldots,q\}$ . By Assumption 16, we get

\bar{R}_{2}=O_{P}\Big{(}\sqrt{n}\|\widehat{\theta}_{n}-\theta_{n,0}\|_{1}^{2}\Big{)}=O_{P}\Big{(}\sqrt{n}p_{n}\|\widehat{\theta}_{n}-\theta_{n,0}\|_{2}^{2}\Big{)}=O_{P}\Big{(}\sqrt{n}p^{2}_{n}\big{(}\frac{\ln(\ln n)}{n}+a_{n}^{2}\big{)}\Big{)}.

With our assumptions about $(a_{n},\lambda_{n},p_{n})$ , we obtain $R_{2}=o_{P}(1)$ .

It remains to show that $R_{1}$ is asymptotically normal. To this end, note that

R_{1,j}=-\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\underset{r\in{\mathcal{A}}_{n}}{\sum}q_{n,j,r}\partial_{\theta_{n,r}}\ell_{n}(\theta_{n,0};\widehat{\mbox{\boldmath$U$}}_{i})=-\frac{1}{\sqrt{n}}\sum_{i=1}^{n}w_{n}^{(j)}(\theta_{n,0};\widehat{}\mbox{\boldmath$U$}_{i}),

for any $j\in\{1,\ldots,q\}$ . Setting $w_{n,0}(\cdot):=\big{[}w_{n}^{(1)}(\theta_{n,0};\cdot),\ldots,w_{n}^{(1)}(\theta_{n,0};\cdot)\big{]},$ this implies that $R_{1}=-\sqrt{n}\int w_{n,0}\,d\widehat{\mathbb{C}}_{n}$ , due the first order conditions.

Now, denote ${\mathcal{W}}:=\bigcup_{n\geq 1}\{w_{n}^{(l)}(\theta_{n,0},\cdot),l=1,\dots,q\}\cup\{w_{\infty}^{(l)},l=1,\dots,q\}$ . Remind that ${\mathcal{W}}_{\infty}$ is $g_{\omega}$ -regular (Assumption 20). Thus, the family of maps ${\mathcal{W}}$ is $g_{\omega}$ -regular, due to the $g_{\omega}$ -regularity of $\bigcup_{n}{\mathcal{F}}_{n,1}$ (Assumption 17) and Assumption 19. Moreover, ${\mathcal{W}}$ satisfies Assumption 14 (replacing ${\mathcal{F}}$ with ${\mathcal{W}}$ ). Therefore, all the conditions of application of Theorem A.1 (iii) are fullfilled and $\widehat{\mathbb{C}}_{n}$ is weakly convergent in $\ell^{\infty}({\mathcal{W}})$ . By the stochastic equicontinuity of the process $\widehat{\mathbb{C}}_{n}$ defined on ${\mathcal{W}}$ , the random vector $\int w_{n}\,d\widehat{\mathbb{C}}_{n}$ weakly tends to $\int w_{\infty}\,d\widehat{\mathbb{C}}_{n}$ , with obvious notations.

We then conclude with an application of Slutsky’s Theorem to deduce the following asymptotic distribution:

\displaystyle\sqrt{n}Q_{n}\Big{[}{\mathbb{H}}_{n,{\mathcal{A}}_{n}{\mathcal{A}}_{n}}+\mathbf{B}_{n}(\theta_{0})\Big{]}\Big{\{}\big{(}\widehat{\theta}_{n}-\theta_{n,0}\big{)}_{{\mathcal{A}}_{n}}+\Big{[}{\mathbb{H}}_{n,{\mathcal{A}}_{n}{\mathcal{A}}_{n}}+\mathbf{B}_{n}(\theta_{n,0})\Big{]}^{-1}\mathbf{A}_{n}(\theta_{n,0})\Big{\}}

\displaystyle\overset{d}{\underset{n\rightarrow\infty}{\longrightarrow}}

\displaystyle\mathbf{Y},

where $\mathbf{Y}$ is the $q$ -dimensional random vector defined in the statement of the theorem. ∎

Appendix I Additional simulated experiment

In this subsection, we investigate how the calibration of $a_{\text{scad}}$ and $b_{\text{mcp}}$ alters the performances of the SCAD and MCP penalty functions, respectively. Following the experiment on sparse Gaussian copula performed in the main text, we set $d=10$ and specify two true sparse correlation matrices $\Sigma_{0,1},\Sigma_{0,2}$ . The true parameters $\theta_{0,1},\theta_{0,2}$ , which stack the lower triangular part of $\Sigma_{0,1},\Sigma_{0,2}$ , respectively, excluding the diagonal terms, belong to ${\mathbb{R}}^{p}$ with $p=d(d-1)/2=45$ . The number of zero coefficients in $\theta_{0,1},\theta_{0,2}$ is $38$ , i.e., approximately $85\%$ of their total number of entries are zero coefficients. The non-zero entries are generated from the uniform distribution ${\mathcal{U}}([-0.7,-0.05]\cup[0.05,0.7])$ . For them, we have $\max_{k\in{\mathcal{A}}}\,|\theta_{0,1,k}|=0.4417$ , $\min_{k\in{\mathcal{A}}}\,|\theta_{0,1,k}|=0.0631$ ; $\max_{k\in{\mathcal{A}}}\,|\theta_{0,2,k}|=0.6518$ , $\min_{k\in{\mathcal{A}}}\,|\theta_{0,2,k}|=0.1041$ . Moreover, the true vector $\theta_{0,1}$ contains three non-zero entries smaller (in absolute value) than $0.1$ . The latter parameters $\theta_{0,1}$ and $\theta_{0,2}$ will be left fixed hereafter.

Then, for the sample size $n=500$ , we draw $\mbox{\boldmath$U$}_{i}\in{\mathbb{R}}^{d}$ , $i\in\{1,\ldots,n\}$ , from the sparse Gaussian copula with parameter $\Sigma_{0,1}$ and apply the rank-based transformation to obtain the $\widehat{\mbox{\boldmath$U$}}_{i}$ . Equipped with the pseudo-sample $\widehat{\mbox{\boldmath$U$}}_{i}$ , $i\in\{1,\ldots,n\}$ , we solve the same penalized criteria as in the main text for the Gaussian copula (Gaussian loss and least squares loss) with SCAD and MCP penalty functions. We will use a grid of different $a_{\text{scad}}$ and $b_{\text{mcp}}$ values: $a_{\text{scad}}\in\{2.1,2.5,3,3.5,\ldots,{\color[rgb]{0,0,0}25}\}$ and $b_{\text{mcp}}\in\{0.1,0.5,1,1.5,\ldots,{\color[rgb]{0,0,0}25}\}$ . We repeat this procedure for $100$ independent batches. The same experiment is performed with the true Gaussian copula parameter $\Sigma_{0,2}$ .

Figure 1 and Figure 2 display the metrics C1/C2 and MSE, respectively, averaged over these $100$ batches, when the true parameter is $\Sigma_{0,1}$ . Figure 3 and Figure 4 display the same metrics when the true parameter is $\Sigma_{0,2}$ . For instance, on Figure LABEL:scad_c1_c2_case1, the red solid line represents the percentage of the true zero coefficients in $\theta_{0,1}$ that are correctly recovered by $\widehat{\theta}$ when deduced from the Gaussian loss penalized by the SCAD penalty for different values of $a_{\text{scad}}$ and averaged over the $100$ batches. Similarly, on Figure LABEL:mcp_c1_c2_case1, the blue dash-dotted line represents the percentage of the true non-zero coefficients in $\theta_{0,1}$ correctly recovered by $\widehat{\theta}$ when deduced from the least squares loss penalized by the MCP penalty for some values of $b_{\text{mcp}}$ and averaged over the $100$ batches.

Figures LABEL:scad_c1_c2_case1 and LABEL:mcp_c1_c2_case1 highlight the existence of a trade-off between the recovery of the zero and non-zero coefficients, particularly for the Gaussian loss: smaller $a_{\text{scad}}$ and $b_{\text{mcp}}$ provide better C1 but to the detriment of C2, whereas larger values imply worse C1, but better C2. Interestingly, a different recovery pattern of the true non-zero coefficients is displayed in Figure LABEL:scad_c1_c2_case2 and Figure LABEL:mcp_c1_c2_case2: the metric C2 is close to $100\%$ for all $a_{\text{scad}},b_{\text{mcp}}$ (excluding the values $b_{\text{mcp}}<1.5$ in the MCP case). Since $\theta_{0,2}$ contains a true minimum signal $\min_{k\in{\mathcal{A}}}\,|\theta_{0,2,k}|$ sufficiently large, the penalization procedure can correctly recover all its non-zero entries. On the other hand, $\theta_{0,1}$ includes three values smaller than $0.1$ . In that case, the penalization tends to estimate the true small non-zero entries as zero entries, thus worsening the C2 metric. The pattern of C1 for both penalty functions and loss functions is not much affected by $\theta_{0,1},\theta_{0,2}$ . For both $\theta_{0,1},\theta_{0,2}$ cases, and for both SCAD and MCP, the Gaussian loss-based criterion tends to generate more zero coefficients than the least squares-based criterion. Furthermore, the performances in terms of estimation accuracy (MSE) are in favor of the Gaussian loss-based criterion. Larger $a_{\text{scad}}$ and $b_{\text{mcp}}$ values result in larger MSE, which is in line with the findings in the main text: large $a_{\text{scad}}$ and $b_{\text{mcp}}$ values result in a LASSO penalization, a situation where the penalty increases linearly with respect to the absolute value of the coefficient, thus generating more bias. Moreover, the MSE metric is smaller when the true parameter is $\theta_{0,2}$ : compare Figure 2 and Figure 4. This is in line with the fact that the support recovery is better for $\theta_{0,2}$ than for $\theta_{0,1}$ .

A point worth mentioning are the poor performances of the MCP method for too small values of $b_{\text{mcp}}$ . In the latter case, the set $\{|\theta|>b_{\text{mcp}}\lambda_{n}\}$ is likely to be active, and the penalty becomes $\lambda^{2}_{n}\,b_{\text{mcp}}/2$ , so that the penalization is almost vanishing, which results in lower C1 and larger C2, as depicted in Figure LABEL:mcp_c1_c2_case1 when $0<b_{\text{mcp}}<1$ . But a low C1 implies that the penalization misses a large number of true zero entries, which prompts large MSE patterns.

This simulated experiment on sparse Gaussian copula suggests the existence of an interval of “optimal” values of $a_{\text{scad}}$ and $b_{\text{mcp}}$ , where optimality refers to the ideal situation of perfect recovery of the zero and non-zero coefficients with a low MSE. Such coeffficients should not be too small and too large: for the SCAD, $a_{\text{scad}}\in(3,{\color[rgb]{0,0,0}9})$ provides an optimal compromise between C1 and C2 with low MSE; this is the case for the MCP when $b_{\text{mcp}}\in(2,9)$ .

$\displaystyle\big{\|}\int_{\mbox{\boldmath$B$}_{n,\|I\|}}h_{n}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})g_{\omega}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}-\int_{(0,1)^{\|I\|}}h_{\infty}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})g_{\omega}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,f(d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}))\big{\|}$
	$\displaystyle\leq$	$\displaystyle\\|h_{n}-h_{\infty}\\|_{\infty}\int_{\mbox{\boldmath$B$}_{n,I}}g_{\omega}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\,\left\|f(d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\right\|$
	$\displaystyle+$	$\displaystyle\big{\|}\int_{\mbox{\boldmath$B$}_{n,I}}h_{\infty}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})g_{\omega}(\mbox{\boldmath$u$}_{-I}:\mbox{\boldmath$d$}_{n,-I})\,f(d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})-\int_{(0,1)^{\|I\|}}h_{\infty}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})g_{\omega}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,f(d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\big{\|},$

$\displaystyle\\|\widetilde{\Gamma}_{n}\big{(}\bar{\mathbb{C}}_{n}/\tilde{g}_{\omega}\big{)}-\widetilde{\Gamma}_{n}\big{(}{\mathbb{C}}/\tilde{g}_{\omega}\big{)}\\|_{\infty}=\sup_{f\in{\mathcal{F}}}\big{\|}\Gamma_{n}\big{(}\bar{\mathbb{C}}_{n}/\tilde{g}_{\omega},f\big{)}-\Gamma_{n}\big{(}{\mathbb{C}}/\tilde{g}_{\omega},f\big{)}\big{\|}$
	$\displaystyle\leq$	$\displaystyle\\|\frac{\bar{\mathbb{C}}_{n}}{\tilde{g}_{\omega}}-\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\\|_{\infty}\sum_{I\neq\emptyset}\sup_{f\in{\mathcal{F}}}\int_{\mbox{\boldmath$B$}_{n,\|I\|}}g_{\omega}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\,\big{\|}f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}\big{\|}$
	$\displaystyle\leq$	$\displaystyle M\\|\frac{\bar{\mathbb{C}}_{n}}{\tilde{g}_{\omega}}-\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}\\|_{\infty},$

$\displaystyle\big{\|}\int_{\mbox{\boldmath$B$}_{n,\|I\|}}{\mathbb{C}}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}-\int_{(0,1)^{\|I\|}}{\mathbb{C}}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,f\big{(}d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}\big{\|}$
	$\displaystyle\leq$	$\displaystyle\int_{\mbox{\boldmath$B$}_{n,\|I\|}}\big{\|}\frac{{\mathbb{C}}}{g_{\omega}}(\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I})-\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\big{\|}\,g_{\omega}\big{(}\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}\,\big{\|}f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}\big{\|}$
	$\displaystyle+$	$\displaystyle\big{\|}\int_{\mbox{\boldmath$B$}_{n,\|I\|}}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,g_{\omega}\big{(}\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}f\big{(}d\mbox{\boldmath$u$}_{I}:\mbox{\boldmath$d$}_{n,-I}\big{)}$
	$\displaystyle-$	$\displaystyle\int_{(0,1)^{d}}\frac{{\mathbb{C}}}{\tilde{g}_{\omega}}(\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I})\,\tilde{g}_{\omega}\big{(}\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}f\big{(}d\mbox{\boldmath$u$}_{I}:{\mathbf{1}}_{-I}\big{)}\big{\|}=:e_{1,n}(f)+e_{2,n}(f).$

$\displaystyle\min_{k}\min(u_{k},1-u_{k})^{\omega}\|\partial_{\theta}\ell(\theta;\mbox{\boldmath$u$})\|\leq u_{1}^{\omega}\Big{\{}\|M^{\prime}(\theta)\|+\frac{1}{\theta^{2}}\ln\big{(}du_{1}^{-\theta}-d+1\big{)}$
	$\displaystyle+$	$\displaystyle\big{(}\frac{1}{\theta}+d\big{)}\frac{du_{1}^{-\theta}\|\ln u_{1}\|}{u_{1}^{-\theta}}+d\|\ln u_{1}\|\Big{\}}$
	$\displaystyle=$	$\displaystyle O\Big{(}u_{1}^{\omega}(1+\|\ln u_{1}\|)+u_{1}^{\omega}\ln\big{(}du_{1}^{-\theta}-d+1\big{)}\Big{)},$

$\displaystyle\int_{\mbox{\boldmath$B$}_{n,\|J_{1}\|}}g_{\omega}(\mbox{\boldmath$u$}_{J_{1}})\Big{\|}\widetilde{D}_{0,J_{1}}\big{(}\mbox{\boldmath$u$}_{J_{1}}\big{)}\Big{\|}\,d\mbox{\boldmath$u$}_{J_{1}}$
	$\displaystyle\leq$	$\displaystyle m_{1}!\int_{\mbox{\boldmath$B$}_{n,\|J_{1}\|},u_{1}\leq\cdots\leq u_{m_{1}}}g_{\omega}(\mbox{\boldmath$u$}_{J_{1}})\Big{\|}\widetilde{D}_{0,J_{1}}\big{(}\mbox{\boldmath$u$}_{J_{1}}\big{)}\Big{\|}\,d\mbox{\boldmath$u$}_{J_{1}}$
	$\displaystyle\leq$	$\displaystyle\frac{m_{1}!}{(2n)^{\omega}}\int_{\mbox{\boldmath$B$}_{n,\|J_{1}\|},u_{1}\leq\cdots\leq u_{m_{1}}}\frac{\prod_{j=1}^{m_{1}}u_{j}^{-\theta-1}\,d\mbox{\boldmath$u$}_{J_{1}}}{\big{(}\sum_{j=1}^{m_{1}}u_{j}^{-\theta}+\|J_{2}\|(2n)^{\theta}+\|J_{3}\|(1-1/2n)^{-\theta}-d+1\big{)}^{m_{1}}}\cdot$

Sparse M-estimators in semi-parametric copula models

Abstract

1 Introduction

2 The framework

Example 1.

Assumption 0.

Example 2.

Example 3.

Example 4.

3 Asymptotic properties

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Theorem 3.1.

Theorem 3.2.

Corollary 3.3.

4 Conditional copula models

Assumption 5.

4.1 The marginal laws of the covariates are known.

Assumption 6.

Assumption 7.

Assumption 8.

Assumption 9.

Theorem 4.1.

Theorem 4.2.

Remark 1.

Theorem 4.3.

4.2 The marginal laws of the covariates are unknown.

4.3 Practical considerations

5 Applications

5.1 Examples

5.1.1 M-criterion for Gaussian copulas

5.1.2 M-criterion for mixtures of copulas

5.2 Simulated experiments

5.2.1 Implementation and selection of λn\lambda_{n}

5.2.2 Sparse Gaussian copula models

5.2.3 Conditional copulas

6 Conclusion

References

Appendix A Multivariate rank statistics and empirical copula processes indexed by functions

Remark 2.

Remark 3.

Assumption 10.

Assumption 11.

Assumption 12.

Assumption 13.

Assumption 14.

Theorem A.1.

Corollary A.2.

Remark 4.

Appendix B Asymptotic variance of 𝑾W

Appendix C Proofs of Theorem A.1 and Corollary A.2

C.1 Proof of Theorem A.1

C.2 Proof of Corollary A.2

Appendix D Additional proofs

D.1 Proof of Theorem 3.1

D.2 Proof of Theorem 3.2

Remark 5.

Appendix E Regularity conditions for Gaussian copulas

Remark 6.

Remark 7.

Lemma E.1.

Proof.

Appendix F Regularity conditions for Gumbel copulas

Remark 8.

Appendix G Regularity conditions for Clayton copulas

Remark 9.

Appendix H Asymptotic properties for parameters of varying dimensions

Assumption 15.

Example 5 (Example 3 cont’d).

Example 6.

Remark 10.

Assumption 16.

Assumption 17.

Assumption 18.

Theorem H.1.

Proof.

Assumption 19.

Assumption 20.

5.2.1 Implementation and selection of $\lambda_{n}$

Appendix B Asymptotic variance of $W$