Regional Pooling in Extreme Event Attribution Studies: an Approach Based on Multiple Statistical Testing

Leandra Zanger Heinrich-Heine Universität Düsseldorf, Mathematisches Institut, Düsseldorf, Germany Axel Bücher Corresponding author: axel.buecher@hhu.de Heinrich-Heine Universität Düsseldorf, Mathematisches Institut, Düsseldorf, Germany Frank Kreienkamp Deutscher Wetterdienst, Regionales Klimabüro Potsdam, Potsdam, Germany Philip Lorenz Deutscher Wetterdienst, Regionales Klimabüro Potsdam, Potsdam, Germany Jordis Tradowsky Deutscher Wetterdienst, Regionales Klimabüro Potsdam, Potsdam, Germany

Abstract

Statistical methods are proposed to select homogeneous locations when analyzing spatial block maxima data, such as in extreme event attribution studies. The methods are based on classical hypothesis testing using Wald-type test statistics, with critical values obtained from suitable parametric bootstrap procedures and corrected for multiplicity. A large-scale Monte Carlo simulation study finds that the methods are able to accurately identify homogeneous locations, and that pooling the selected locations improves the accuracy of subsequent statistical analyses. The approach is illustrated with a case study on precipitation extremes in Western Europe. The methods are implemented in an R package that allows easy application in future extreme event attribution studies.

Key words: Extreme event attribution; Extreme Value Statistics; Homogeneity Tests; Multiple Comparison Problem; Parametric Bootstrap; Max-Stable Processes.

1 Introduction

Extreme event attribution studies on precipitation extremes are typically motivated by the occurrence of an extreme event which causes major impacts such as damages to infrastructure and agriculture, or even fatalities, see, for instance, Van16; Van17; Ott18; Kre21. A key task for attributing the event to anthropogenic climate change consists of a statistical analysis of available observational data products at the location or region of interest (Phi20). Typically, the observed time period is short, often less than 100 years, which ultimately leads to large statistical uncertainties. One possibility to reduce those uncertainties is to incorporate observations from nearby locations/regions, given that their meteorological characteristics are sufficiently similar and governed by the same underlying processes to those from the region affected by an extreme event. The selection of surrounding areas for which these criteria are met can be based on expert knowledge of the meteorological characteristics and dynamics, for instance provided by experts from the national meteorological and hydrological service of the affected country, like the Deutsche Wetterdienst in Germany. The expert knowledge-based suggestion may next be assessed statistically, which, to the best of our knowledge, has been done based on ad hoc methods in the past. In this paper, we propose profound statistical methods that can complement the expert’s knowledge and which is based on statistically evaluating observational data from the past. Once regions with sufficiently similar characteristics of the analysed variable, e.g., the yearly maximum of daily rainfall, have been identified, the time series of all identified regions can be combined, thereby extending the available time series for the benefit of a more efficient statistical analysis.

The building blocks for the new approach are classical Wald-type tests statistics (Leh21) for testing the null hypothesis that the temporal dynamics at multiple locations of interest are the same. Unlike in the classical text-book case, and motivated by the fact that standard likelihood-based inference for extreme value distributions requires unreasonably large sample sizes for sufficient finite-sample accuracy, we employ a parametric bootstrap device to approximate the distribution of the test statistics under the null hypothesis. This approach is motivated by results in Lil22 for respective stationary extreme value models. Based on suitable decompositions of a global null hypothesis, we then propose to test for carefully selected sub-hypotheses, possibly after correcting the individual tests’ level for multiple comparisons.

The new methods are illustrated by a large-scale Monte Carlo simulation study and by an application to the severe flooding event in Western Europe during July 2021 for which spatial pooling was applied in an attribution study following the event (Kre21; Tra22). For the benefit of researchers who would like to use this spatial pooling approach, an implementation of the method in the statistical programming environment R (R) is publicly available as an R package called findpoolreg on GitHub (findpoolreg).

Attribution analysis of precipitation extremes is especially challenging due to short observational time series as well as their often limited spatial extend, which further complicates the detection of a trend and estimation of return periods based on the limited time series (see Tra22, for a discussion on this). Therefore, we will in the following present the suggested approach for a heavy rainfall event, however, the method could equally be applied to other parameters.

The remaining parts of this paper are organized as follows. Section 2 explains the mathematical concept of the proposed methods, starting with a detailed description of the underlying model assumptions and a strategy for the detection of a possible pooling region in Section 2.1. It is recommended to all readers. In Sections 2.2 and 2.3, mathematical details on the applied estimators and test statistics are given, and they may be skipped by non-expert readers. Next, the ideas of the bootstrap procedures that allow to draw samples from the distribution under the null hypothesis are explained. Again, this may be skipped by non-statisticians. Section 2.5 goes into detail about the detection strategy of possible pooling regions and the treatment of the related multiple testing problem. This section is of interest to all readers who want to apply the methods, since it provides details on how to process a set of p-values obtained from testing multiple hypotheses. Next, Section 3 gives the results of the simulation study that was performed in order to evaluate the performance of the proposed methods. These results are of interest to all readers, and they serve as a basis for the case study conducted in Section 4. Section 5 then discusses several extensions of the proposed methods. In 5.1, we provide a method for estimating region-wise return periods once a pooling region has been chosen. Here, a region-wise return period of a given event is defined as the number of years that one has to wait on average until an event of the same or even greater magnitude is observed anywhere in the pooling region. Extensions to different model assumptions that suit e.g. other variables such as temperature are discussed in Section 5. Last but not least, we come to a conclusion in Section 6. Some mathematical details and further illustrations on the simulation study and the case study are postponed to a supplement.

2 Assessing spatial homogeneities for precipitation extremes

2.1 A homogeneous model for precipitation extremes

The observational data of interest consists of annual or seasonal maximal precipitation amounts (over some fixed time duration, e.g., a day) collected over various years and at various locations (in practice, each location may correspond to a spatial region; we separate these two terms from the outset to avoid misunderstandings: subsequently, a region shall be a set of locations). More precisely, we denote by $m^{(t)}_{d}$ the observed maximal precipitation amount in season $t$ and at location $d$ , with $t=1,\dots,n$ and $d=1,\dots,D$ . The location of primary interest shall be the one with index $d=1$ . Note that the choice of $d=1$ is made for illustrative purposes only and can be replaced by any index $d\in\{1,\ldots,D\}$ .

In view of the stochastic nature, we assume that $m^{(t)}_{d}$ is an observed value of some random variable $M_{d}^{(t)}$ . Since $M^{(t)}_{d}$ is generated by a maxima operation, standard extreme value theory (Col01) suggests to assume that $M_{d}^{(t)}$ follows the generalized extreme value (GEV) distribution, i.e.,

M_{d}^{(t)}\sim\mathrm{GEV}(\mu_{d}(t),\sigma_{d}(t),\gamma_{d}(t))

for some $\mu_{d}(t),\sigma_{d}(t)>0,\gamma_{d}(t)\in\mathbb{R}$ , where the $\mathrm{GEV}(\mu,\sigma,\gamma)$ distribution with location parameter $\mu>0$ , scale parameter $\sigma>0$ and shape parameter $\gamma\in\mathbb{R}$ is defined by its cumulative distribution function

\displaystyle G_{(\mu,\sigma,\gamma)}(x)=\exp\Big{\{}-\Big{(}1+\gamma\frac{x-\mu}{\sigma}\Big{)}^{-1/\gamma}\Big{\}}

(1)

for $x$ such that $1+\gamma\frac{x-\mu}{\sigma}>0$ . Due to climate change, the temporal dynamics at location $d$ , which are primarily driven by the function $t\mapsto(\mu_{d}(t),\sigma_{d}(t),\gamma_{d}(t))$ , are typically non-constant. Any proxy for climate change qualifies as a suitable temporal covariate, and a standard assumption in extreme event attribution studies, motivated by the Clausius–Clapeyron relation, postulates that

\displaystyle\mu_{d}(t)=\mu_{d}\exp\left(\frac{\alpha_{d}\mathrm{GMST^{\prime}}(t)}{\mu_{d}}\right),\quad\sigma_{d}(t)=\sigma_{d}\exp\left(\frac{\alpha_{d}\mathrm{GMST^{\prime}}(t)}{\mu_{d}}\right),\quad\gamma_{d}(t)=\gamma_{d}

(2)

for certain parameters $\alpha_{d},\gamma_{d}\in\mathbb{R},\mu_{d},\sigma_{d}>0$ . Here, $\mathrm{GMST^{\prime}}(t)$ denotes the smoothed global mean surface temperature anomaly, see Phi20. Note that (2) implies

\mathrm{GEV}(\mu_{d}(t),\sigma_{d}(t),\gamma_{d}(t))=\exp\left(\frac{\alpha_{d}\mathrm{GMST^{\prime}}(t)}{\mu_{d}}\right)\mathrm{GEV}(\mu_{d},\sigma_{d},\gamma_{d}),

hence the model may be identified as a temporal scaling model. It is further assumed that any temporal dependence at location $d$ is completely due to $\mathrm{GMST^{\prime}}(t)$ , which we treat as deterministic and which implies that $M_{d}^{(1)},\dots,M_{d}^{(n)}$ are stochastically independent, for each $d=1,\dots,D$ . For the moment, the spatial dependence will be left unspecified.

Recall that the location of interest is the one with $d=1$ , which is characterised by the four parameters $\mu_{1},\sigma_{1},\gamma_{1},\alpha_{1}$ . As described before, estimating those parameters based on the observations from location $d=1$ only may be unpleasantly inaccurate, which is why one commonly assumes that the $D$ locations have been carefully selected by experts to meet the following space-time homogeneity assumption:

\displaystyle H_{0}^{\mathrm{ED}}:\quad\exists\,\bm{\vartheta}\in\Theta\ \forall\,d\in\{1,\dots,D\}:\quad\bm{\vartheta}_{d}=\bm{\vartheta},

(3)

where $\Theta:=(0,\infty)^{2}\times\mathbb{R}^{2}$ and $\bm{\vartheta}=(\mu,\sigma,\gamma,\alpha)^{\top},\bm{\vartheta}_{d}=(\mu_{d},\sigma_{d},\gamma_{d},\alpha_{d})^{\top}$ , and where the upper index $\mathrm{ED}$ stands for ‘equal distribution’, since, in short, Equation (3) states that the location-wise GEV parameters coincide for the $D$ locations.

In the subsequent sections, we aim at testing the validity of the expert’s hypothesis $H_{0}^{\mathrm{ED}}$ . Here, it is not only of interest to test the hypothesis for the whole set $\{1,\ldots,D\}$ , but also to find a (maximal) subset $A\subset\{1,\ldots,D\}$ with $1\in A$ and $\left|{A}\right|=k\geq 2$ on which the space-time homogeneity assumption holds. Here, for an arbitrary index set $A$ , the latter assumption may be expressed through

\displaystyle H_{0}^{\mathrm{ED}}(A):\quad\exists\,\bm{\vartheta}_{A}\in\Theta\ \forall\,d\in A:\quad\bm{\vartheta}_{d}=\bm{\vartheta}_{A},

(4)

with $\Theta$ as in Equation (3) and $\bm{\vartheta}_{A}=(\mu_{A},\sigma_{A},\gamma_{A},\alpha_{A})^{\top}$ , meaning that the location-wise GEV parameters coincide for all locations with index in the set $A$ , making the respective locations a possible pooling region.

Now, a maximal subset $A$ for which Equation (4) holds may be determined with the following strategy: Since we are interested in finding all locations that ‘match’ the location of primary interest with index $d=1$ , we test for each pair $A_{d}=\{1,d\},d=2,\ldots,D$ , whether the null hypothesis $H_{0}^{\mathrm{ED}}(A_{d})$ holds. This will provide us with a set of p-values based on which we can decide which locations to reject and which not to reject. Those locations that are not rejected can then be assumed to be sufficiently homogeneous and are thus included in the suggestion of a pooling region of maximal extent. For further details on this strategy and the impact of the induced multiple testing problem, see Section 2.5.

2.2 Coordinate-wise maximum likelihood estimation

The starting point for the subsequent test statistics are the coordinate-wise maximum likelihood estimators for the model specified in (2). Writing $c^{(t)}=\mathrm{GMST^{\prime}}(t)$ for brevity, the log-likelihood contribution of observation $(M_{d}^{(t)},c^{(t)})$ is given by $\ell_{\bm{\vartheta}_{d}}(M_{d}^{(t)},c^{(t)})$ , where

\displaystyle\ell_{\bm{\vartheta}_{d}}(x,c)=\log g_{(\mu_{d}\exp(\alpha_{d}c/\mu_{d}),\sigma_{d}\exp(\alpha_{d}c/\mu_{d}),\gamma_{d})}(x)

(5)

with $g_{(\gamma,\mu,\sigma)}(x)=\frac{\partial}{\partial x}G_{(\mu,\sigma,\gamma)}(x)$ the probability density function of the $\mathrm{GEV}(\mu,\sigma,\gamma)$ -distribution. The maximum likelihood estimator for $\bm{\vartheta}_{d}$ at location $d$ is then defined as

\displaystyle\hat{\bm{\vartheta}}_{d}\in\mathrm{argmax}_{\bm{\vartheta}_{d}\in\Theta}\sum_{t=1}^{n}\ell_{\bm{\vartheta}_{d}}(M_{d}^{(t)},c^{(t)}).

(6)

The arg-maximum cannot be calculated explicitly, but may be found by suitable numerical optimization routines. We denote the gradient and the Hessian matrix of $\bm{\vartheta}\mapsto\ell_{\bm{\vartheta}}(x,c)$ by $\dot{\ell}_{\bm{\vartheta}}(x,c)\in\mathbb{R}^{4}$ and $\ddot{\ell}_{\bm{\vartheta}}(x,c)\in\mathbb{R}^{4\times 4}$ , respectively. Under appropriate regularity assumptions, standard asymptotic expansions (Van98, see also BucSeg17 for the stationary GEV family) imply that $\hat{\bm{\theta}}=(\hat{\bm{\vartheta}}_{1}^{\top},\dots,\hat{\bm{\vartheta}}_{D}^{\top})^{\top}\in\Theta^{D}$ is approximately Gaussian with mean ${\bm{\theta}}=({\bm{\vartheta}}_{1}^{\top},\dots,{\bm{\vartheta}}_{D}^{\top})^{\top}$ and covariance $n^{-1}\bm{\Sigma}_{n}$ , where $\bm{\Sigma}_{n}=(\bm{\Sigma}_{n;j,k})_{j,k=1}^{D}\in\mathbb{R}^{4D\times 4D}$ is defined as

\displaystyle\bm{\Sigma}_{n;j,k}

\displaystyle=J_{n,j,\bm{\vartheta}_{j}}^{-1}\Big{(}\frac{1}{n}\sum_{t=1}^{n}\operatorname{Cov}\big{[}\dot{\ell}_{\bm{\vartheta}_{j}}(M_{j}^{(t)},c^{(t)}),\dot{\ell}_{\bm{\vartheta}_{k}}(M_{k}^{(t)},c^{(t)})\big{]}\Big{)}J_{n,k,\bm{\vartheta}_{k}}^{-1}\in\mathbb{R}^{4\times 4}

(7)

with $J_{n,j,\bm{\vartheta}}=\frac{1}{n}\sum_{t=1}^{n}{\mathbb{E}}[\ddot{\ell}_{\bm{\vartheta}}(M_{j}^{(t)},c^{(t)})]\in\mathbb{R}^{4\times 4}$ . See Appendix A.1 for details and Appendix A.2 for a suitable estimator $\hat{\bm{\Sigma}}_{n}$ for $\bm{\Sigma}_{n}$ .

2.3 Wald-type test statistics

We define test statistics which allow to test for the sub-hypotheses $H_{0}^{\mathrm{ED}}(A)$ of $H_{0}^{\mathrm{ED}}$ from Equation (4), where $A\subset\{1,\ldots,D\}$ . For that purpose, we propose to use classical Wald-type test statistics; see Section 14.4.2 in Leh21 for a general discussion and Lil22 for a similar approach in temporally stationary GEV models, i.e., with $\alpha_{d}$ fixed to $\alpha_{d}=0$ .

Write $A=\{d_{1},\dots,d_{k}\}$ with $1\leq d_{1}<\dots<d_{k}\leq D$ and let $h_{A}:\mathbb{R}^{4D}\to\mathbb{R}^{4(k-1)}$ be defined by

	$\displaystyle h_{A}(\bm{\theta})=h_{A}(\bm{\vartheta}_{1},\ldots,\bm{\vartheta}_{D})$	$\displaystyle=(\bm{\vartheta}_{d_{1}}^{\top}-\bm{\vartheta}_{d_{2}}^{\top},\bm{\vartheta}_{d_{2}}^{\top}-\bm{\vartheta}_{d_{3}}^{\top},\ldots,\bm{\vartheta}_{d_{k-1}}^{\top}-\bm{\vartheta}_{d_{k}}^{\top})^{\top}$
		$\displaystyle=(\mu_{d_{1}}-\mu_{d_{2}},\sigma_{d_{1}}-\sigma_{d_{2}},\gamma_{d_{1}}-\gamma_{d_{2}},\alpha_{d_{1}}-\alpha_{d_{2}},$
		$\displaystyle\hskip 113.81102pt\ldots,\gamma_{d_{k-1}}-\gamma_{d_{k}},\alpha_{d_{k-1}}-\alpha_{d_{k}})^{\top}.$

We may then write $H_{0}^{\mathrm{ED}}(A)$ equivalently as

H_{0}^{\mathrm{ED}}(A):h_{A}(\bm{\theta})=0.

Hence, significant deviations of $h_{A}(\hat{\bm{\theta}})$ from 0 with $\hat{\bm{\theta}}$ from Section 2.2 provide evidence against $H_{0}^{\mathrm{ED}}(A)$ . Such deviations may be measured by the Wald-type test statistic

T_{n}(A)=n\,(h_{A}(\hat{\bm{\theta}}))^{\top}\,\left(\bm{H}_{A}\;\hat{\bm{\Sigma}}_{n}\;\bm{H}_{A}^{\top}\right)^{-1}\,h_{A}(\hat{\bm{\theta}}),

(8)

where $\bm{H}_{A}=\dot{h}_{A}(\bm{\theta})\in\mathbb{R}^{4(k-1)\times 4D}$ denotes the Jacobian matrix of $\bm{\theta}\mapsto h_{A}(\bm{\theta})$ , which is a matrix with entries in $\{-1,0,1\}$ that does not depend on $\bm{\theta}$ . In view of the asymptotic normality of $\hat{\bm{\theta}}$ , see Section 2.2, the asymptotic distribution of $T_{n}(A)$ under the null hypothesis $H_{0}^{\mathrm{ED}}(A)$ is the chi-square distribution $\chi_{4(k-1)}^{2}$ with $4(k-1)$ degrees of freedom; see also Section 4 in Lil22. Hence, rejecting $H_{0}^{\mathrm{ED}}(A)$ if $T_{n}(A)$ exceeds the $(1-\alpha)$ -quantile of the $\chi_{4(k-1)}^{2}$ -distribution provides a statistical test of asymptotic level $\alpha\in(0,1)$ . The finite-sample performance of the related test in the stationary setting was found to be quite inaccurate (see Lil22). To overcome this issue, we propose a suitable bootstrap scheme in the next section.

2.4 Parametric bootstrap devices for deriving p-values

Throughout this section, we propose two bootstrap devices that allow to simulate approximate samples from the $H_{0}^{\mathrm{ED}}(A)$ -distribution of the test statistic $T_{n}(A)$ from Equation (8). Based on a suitably large set of such samples, one can compute a reliable p-value for testing $H_{0}^{\mathrm{ED}}(A)$ , even for short sample sizes.

The first method is based on a global fit of a max-stable process model to the entire region under consideration, while the second one is based on fitting multiple pairwise models. The main difference of the two approaches is that the first one can test the hypothesis $H_{0}^{\mathrm{ED}}(A)$ for arbitrary subsets $A\subset\{1,\ldots,D\},$ while the second approach is restricted to testing the null hypothesis on subsets of cardinality two, i.e., it can only test whether a pair of locations is homogeneous. Depending on the question that is asked, applying the one or the other method may be advantageous.

2.4.1 Global bootstrap based on max-stable process models

The subsequent bootstrap device is a modification of the parametric bootstrap procedure described in Section 5.3 of Lil22. Fix some large number $B$ , say $B=200$ , noting that larger numbers are typically better, but going beyond $B=1000$ is usually not worth the extra computational effort.

The basic idea is as follows: for each $b=1,\dots,B$ , simulate artificial bootstrap samples

\mathcal{D}^{*}_{b}=\big{\{}M_{d,b}^{(t),*}:t\in\{1,\dots,n\},d\in\{1,\dots,D\}\big{\}}

that have a sufficiently similar spatial dependence structure as the observed data $\mathcal{D}=\{M_{d}^{(t)}:t\in\{1,\dots,T\},d\in\{1,\dots,D\}\}$ and that satisfy the null hypothesis $H_{0}^{\mathrm{ED}}$ . For each fixed $A\subset\{1,\dots,D\}$ with $k=|A|\geq 2$ , the test statistics computed on all bootstrap samples, say $(T_{n,b}^{*}(A))_{b=1,\dots,B}$ , are then compared to the observed test statistic $T_{n}(A)$ . Since the bootstrap samples do satisfy $H_{0}^{\mathrm{ED}}(A)$ , the observed test statistic $T_{n}(A)$ should differ significantly from the bootstrapped test statistics in case $H_{0}^{\mathrm{ED}}(A)$ is not satisfied on the observed data.

Here, for simulating the bootstrap samples, we assume that the spatial dependence structure of the observed data can be sufficiently captured by a max-stable process model. Max-stable processes provide a natural choice here, since they are the only processes that can arise, after proper affine transformation, as the limit of maxima of independent and identically distributed random fields $\{Y_{i}(x):x\in\mathbb{R}^{p}\}$ (Col01, Section 9.3). Parametric models for max-stable processes are usually stated for unit Fréchet (i.e., $\mathrm{GEV}(1,1,1)$ ) margins. Therefore, the first steps in our algorithm below aim at transforming the margins of our observed data to be approximately unit Fréchet.

More precisely, the parametric bootstrap algorithm is defined as follows:

Algorithm 1 (Bootstrap based on max-stable processes).

(1)

For each $d\in\{1,\ldots,D\}$ , calculate $\hat{\bm{\vartheta}}_{d}$ from Section 2.2.

(2)

For each $d\in\{1,\ldots,D\}$ , transform the observations to approximately i.i.d. Fréchet-distributed data, by letting

\displaystyle Y_{d}^{(t)}=\left\{1+\hat{\gamma}_{d}\frac{M_{d}^{(t)}-\hat{\mu}_{d}\exp\left(\frac{\hat{\alpha}_{d}\mathrm{GMST^{\prime}}(t)}{\hat{\mu}_{d}}\right)}{\hat{\sigma}_{d}\exp\left(\frac{\hat{\alpha}_{d}\mathrm{GMST^{\prime}}(t)}{\hat{\mu}_{d}}\right)}\right\}_{+}^{1/\gamma_{d}}\quad(t\in\{1,\dots,n\}).

(9)

(3)
Fit a set of candidate max-stable process models with standard Fréchet margins to the observations $(Y_{1}^{(t)},\ldots,Y_{D}^{(t)})_{t=1,\dots,n}$ and choose the best fit according to the composite likelihood information criterion (CLIC), which is a model selection criterion that is commonly applied when fitting max-stable process models. Throughout, we chose the following three models:
1. (a)
  
  Smith’s model (3 parameters);
2. (b)
  
  Schlather’s model with a powered exponential correlation function (3 parameters);
3. (c)
  
  the Brown-Resnick process (2 parameters).
For further details on max-stable processes, the mentioned models and the CLIC, see davisonModelSpatExt and Dav12. Respective functions are implemented in the R package SpatialExtremes (SpatialExtremes).
(4)

For $b\in\{1,\ldots,B\}$ and $t\in\{1,\dots,n\}$ , simulate spatial data with unit Fréchet margins from the chosen max-stable process model, denoted by

$(Y_{1,b}^{(t),\ast},Y_{2,b}^{(t),\ast},\ldots,Y_{D,b}^{(t),\ast}).$

Note that until now we haven’t used the particular hypothesis $H_{0}^{\mathrm{ED}}(A)$ . Subsequently, fix $A=\{d_{1},\dots,d_{k}\}$ with $1\leq d_{1}<\dots<d_{k}\leq D$ .

(5)

Assume that $H_{0}^{\mathrm{ED}}(A)$ from Equation (4) is true, and estimate the four dimensional model parameters $\bm{\vartheta}_{A}=(\mu_{A},\sigma_{A},\gamma_{A},\alpha_{A})^{\top}\in\Theta$ by (pseudo) maximum likelihood based on the pooled sample

	$\displaystyle(M_{d_{1}}^{(1)},c^{(1)}),\dots,(M_{d_{1}}^{(n)},c^{(n)}),(M_{d_{2}}^{(1)},c^{(1)}),\dots,($	$\displaystyle M_{d_{2}}^{(n)},c^{(n)}),\dots$
	$\displaystyle\dots,($	$\displaystyle M_{d_{k}}^{(1)},c^{(1)}),\dots,(M_{d_{k}}^{(n)},c^{(n)}).$

Denote the resulting parameter vector as $\hat{\bm{\vartheta}}_{A}=(\hat{\mu}_{A},\hat{\sigma}_{A},\hat{\gamma}_{A},\hat{\alpha}_{A})^{\top}$ , and note that $\hat{\bm{\vartheta}}_{A}$ should be close to $\hat{\bm{\vartheta}}_{d}$ for each $d\in A$ , if $H_{0}^{\mathrm{ED}}(A)$ is met.

(6)

Transform the margins of the bootstrap samples to the ones of a GEV-model satisfying $H_{0}^{\mathrm{ED}}(A)$ , by letting

\displaystyle M_{d,b}^{(t),\ast}=\hat{\mu}_{A}\exp\left(\frac{\hat{\alpha}_{A}\mathrm{GMST^{\prime}}(t)}{\hat{\mu}_{A}}\right)+\hat{\sigma}_{A}\exp\left(\frac{\hat{\alpha}_{A}\mathrm{GMST^{\prime}}(t)}{\hat{\mu}_{A}}\right)\frac{(Y_{d,b}^{(t),\ast})^{\hat{\gamma}_{A}}-1}{\hat{\gamma}_{A}}

(10)

for $t\in\{1,\ldots,n\},d\in A$ and $b\in\{1,\ldots,B\}$ . For each resulting bootstrap sample $\mathcal{D}_{b}^{*}(A)=\{M_{d,b}^{(t),\ast}:t\in\{1,\dots,n\},d\in A\}$ , compute the value $t_{n,b}^{\ast}(A)$ of the test statistic $T_{n}(A)$ from Equation (8). Note that $T_{n}(A)$ only depends on the coordinates with $d\in A$ .

(7)

Compute the value $t_{n}(A)$ of the test statistic $T_{n}(A)$ from Equation (8) on the observed sample.
(8)

Compute the bootstrapped $p$ -value by

$p(A)=\frac{1}{B+1}\sum_{b=1}^{B}\operatorname{\bf{1}}(t_{n}(A)\leq t_{n,b}^{\ast}(A)).$

In a classical test situation, one may now reject $H_{0}^{\mathrm{ED}}(A)$ for a fixed set $A$ at significance level $\alpha\in(0,1)$ if $p(A)\leq\alpha$ . In the current pooling situation, we would need to apply the test to multiple pooling regions $A$ , which hence constitutes a multiple testing problem where standard approaches yield inflated levels. We discuss possible remedies in Section 2.5.

2.4.2 Pairwise bootstrap based on bivariate extreme value distributions

Recall that the location of primary interest is the one with index $d=1$ .

As stated in Section 2.1, it is of interest to test for all bivariate hypotheses $H_{0}^{\mathrm{ED}}(\{1,d\})$ with $d=2,\dots,D$ . For that purpose, we may apply a modification of the bootstrap procedure from the previous section that makes use of bivariate extreme value models only. By doing so, we decrease the model risk implied by imposing a possibly restrictive global max stable process model.

The modification only affects step (3) and (4) from Algorithm 1. More precisely, for testing the hypothesis $H_{0}^{\mathrm{ED}}(A_{d})$ with $A_{d}=\{1,d\}$ for some fixed value $d=2,\dots,D$ , we make the following modifications:

Algorithm 2 (Pairwise bootstrap based on bivariate extreme value distributions).

Perform step (1) and (2) from Algorithm 1 with the set $\{1,\ldots,D\}$ replaced by $A_{d}$ .

(3a)
Fit a set of bivariate extreme value distributions to the bivariate sample $(Y_{1}^{(t)},Y_{d}^{(t)})_{t=1,\dots,n}$ , assuming the marginal distributions to be unit Fréchet. Choose the best fit according to the Akaike information criterion (AIC), a model selection criterion that rewards a good fit of a model and penalises the model’s complexity at the same time (Aka73). Possible models are:
1. (a)
  
  the Hüsler-Reiss model (1 parameter);
2. (b)
  
  the logistic model (1 parameter);
3. (c)
  
  the asymmetric logistic model (2 parameters).
Note that all models are implemented in evd.
(4a)

For $b\in\{1,\ldots,B\}$ and $t\in\{1,\dots,n\}$ , simulate bivariate data with unit Fréchet margins from the chosen bivariate extreme value model, denoted by $(Y_{1,b}^{(t),\ast},Y_{d,b}^{(t),\ast}).$

Perform Steps (5)-(8) from Algorithm 1 with $A=A_{d}$ .

Note that Algorithm 2 is computationally more expensive than Algorithm 1 since model selection and fitting of dependence models and its subsequent simulation must be performed separately for each hypothesis $H_{0}^{\mathrm{ED}}(A_{d})$ of interest.

2.5 Combining test statistics

As already addressed at the end of Section 2.1, it is not only of interest to test the global hypothesis $H_{0}^{\mathrm{ED}}$ , since a possible rejection of $H_{0}^{\mathrm{ED}}$ gives no indication about which locations deviate from the one of primary interest. Instead, one might want to test hypotheses on several subsets and then pool those subsets for which no signal of heterogeneity was found. In this subsection, we provide the mathematical framework of testing sub-hypotheses and discuss how to deal with the induced multiple testing problem.

Mathematically, we propose to regard $H_{0}^{\mathrm{ED}}$ as a global hypothesis that is built up from elementary hypotheses of smaller dimension. A particularly useful decomposition is based on pairwise elementary hypotheses: recalling the notation $H_{0}^{\mathrm{ED}}(A)$ from Equation (4), we clearly have

\displaystyle H_{0}^{\mathrm{ED}}=\bigcap_{d=2}^{D}H_{0}^{\mathrm{ED}}(\{1,d\}),

(11)

i.e., $H_{0}^{\mathrm{ED}}$ holds globally when it holds locally for all pairs $\{1,d\}$ with $d\in\{2,\ldots,D\}.$ We may now either apply Algorithm 1 or Algorithm 2 to obtain a p-value, say $p^{\text{raw}}_{d}=p(\{1,d\})$ , for testing $H_{0}^{\mathrm{ED}}(\{1,d\})$ , for any $d\in\{2,\dots,D\}$ . Each p-value may be interpreted as a signal for heterogeneity between locations $1$ and $d$ , with smaller values indicating stronger heterogeneity. The obtained raw list of p-values may hence be regarded as an exploratory tool for identifying possible heterogeneities.

Since we are now dealing with a multiple testing problem, it might be advisable to adjust for multiple comparison in order to control error rates. This can be done by interpreting the raw list based on classical statistical testing routines, in which p-values are compared with suitable critical values to declare a hypothesis significant. Several methods appear to be meaningful, and we discuss three of them in the following. For this, let $\alpha\in(0,1)$ denote a significance level, e.g., $\alpha=0.1$ .

IM (Ignore multiplicity): reject homogeneity for all pairs $\{1,d\}$ for which $p_{d}^{\text{raw}}\leq\alpha$ . In doing so, we do not have any control over false rejections. In particular, in case $D$ is large, false rejections of some null hypotheses will be very likely. On the other hand, the procedure will have decent power properties, and will likely detect most alternatives. Hence, in a subsequent analysis based on the pooled sample of homogeneous locations, we can expect estimators to exhibit comparably little bias and large variance.

Holm (Control the family-wise error rate): apply Holm’s stepdown procedure (Hol79). For that purpose, sort the p-values $p_{j}=p^{\text{raw}}_{1+j}=p(\{1,1+j\})$ with $j=1,\dots,D-1$ ; denote them by $p_{(1)}\leq\dots\leq p_{(D-1)}$ . Starting from $j=1$ , determine the smallest index $j$ such that

p_{(j)}>\alpha_{j}:=\alpha/(D-j).

If $j=1$ , then reject no hypotheses. If no such index exists, then reject all hypotheses. Otherwise, if $j\in\{2,\dots,D-1\}$ , reject the hypotheses that belong to the p-values $p_{(1)},\dots,p_{(j-1)}$ .

The procedure can be equivalently expressed by adjusted p-values. Recursively defining $\tilde{p}_{(1)}=\min\{1,(D-1)p_{(1)}\}$ and

\tilde{p}_{(j)}=\min\left\{1,\max\{\tilde{p}_{(j-1)},(D-j)p_{(j)}\}\right\}

for $j=2,\ldots,D-1$ , we simply reject those hypotheses that belong to the adjusted p-values with $\tilde{p}_{(j)}\leq\alpha$ .

Holm’s stepdown procedure is known to asymptotically control the family-wise error rate (FWER) at level $\alpha$ , i.e.,

\mathrm{FWER}:=\Pr\big{(}\text{reject any true null hypothesis }H_{0}^{\mathrm{ED}}(\{1,d\})\big{)}\leq\alpha,

see Theorem 9.1.2 in Leh21.

In general, controlling the family-wise error rate will result in comparably little power, i.e., we might falsely identify some pairs of locations as homogeneous. Hence, in a subsequent analysis based on the pooled sample of homogeneous locations, we can expect estimators to exhibit comparably large bias and little variance.

BH (Control the false discovery rate): apply the Benjamini Hochberg stepup procedure (Ben95). For that purpose, sort the p-values $p_{j}=p^{\text{raw}}_{1+j}=p(\{1,1+j\})$ with $j=1,\dots,D-1$ ; denote them by $p_{(1)}\leq\dots\leq p_{(D-1)}$ . Starting from $j=D-1$ , determine the largest index $j$ such that

p_{(j)}\leq\alpha_{j}:=\frac{j\alpha}{(D-1)}.

If no such index exists, then reject no hypotheses. Otherwise, if $j\in\{1,\dots,D-1\}$ , reject the hypotheses that belong to the p-values $p_{(1)},\dots,p_{(j)}$ .

Again, one can compute adjusted $p$ -values $\tilde{p}_{(j)}$ such that the procedure is equivalent to rejecting those hypotheses for which $\tilde{p}_{(j)}\leq\alpha$ . For that purpose, let $\tilde{p}_{(D-1)}=\min\{1,(D-1)p_{(D-1)}\}$ and recursively define, for $j=D-2,\ldots,1$ ,

\tilde{p}_{(j)}=\min\left\{1,\min\left\{(D-1)\frac{p_{(j)}}{j},\tilde{p}_{(j+1)}\right\}\right\}.

Under an additional assumption on the p-values that belong to the true null hypotheses (they must exhibit some positive dependence), the BH procedure is known to asymptotically control the false discovery rate (FDR) at level $\alpha$ , i.e.,

\mathrm{FDR}:=\mathbb{E}\Big{[}\frac{\text{Number of false rejections}}{\text{Number of all rejections}}\bm{1}(\text{at least one rejections})\Big{]}\leq\alpha,

see Theorem 9.3.3 in Leh21. Control of the FDR will be confirmed by the simulation experiments in Section 3.

If one were interested in guaranteed theoretical control of the FDR rate, one might alternatively apply the Benjamini Yekutieli (BY) stepup procedure, see (Ben01) and Theorem 9.3.3 in Leh21. In view of the fact that the procedure is much more conservative than BH, we do not recommend its application in the current setting.

Concerning a subsequent analysis, estimators based on a pooled sample obtained from the BH procedure can be expected to exhibit bias and variance to be somewhere between the IM and Holm procedure.

Remark 1.

The decomposition of $H_{0}^{\mathrm{ED}}$ into hypotheses of smaller dimensionality is not unique. For instance, we may alternatively write

\displaystyle H_{0}^{\mathrm{ED}}=\bigcap_{k=1}^{K}H_{0}^{\mathrm{ED}}(B_{k}),

(12)

where $\{1\}\subset B_{1}\subset B_{2}\dots\subset B_{K}=\{1,\dots,d\}$ denotes an increasing sequence of regions with $2\leq|B_{1}|<|B_{2}|<\dots<|B_{K}|=d$ (for instance, $B_{k}=\{1,2,\dots,1+k\}$ with $k=1,\dots,D-1$ ). In practice, the sequence is supposed to be derived from some expert knowledge of the region of interest; it shall represent a sequence of possible pooling regions where $B_{k}$ is constructed from $B_{k-1}$ by adding the locations which are a priori ‘most likely’ homogeneous to the locations in $B_{k}$ . Note that, necessarily, $K\leq D-1$ , which provides an upper bound on the number of hypotheses to be tested.

The derivation of respective testing methods is straightforward. In view of the facts that the choice of the sequence is fairly subjective and that the eventual results crucially depend on that choice, we do not pursue the method any further.

3 Simulation Study

A large-scale Monte Carlo simulation study was conducted to assess the performance of the proposed bootstrap procedures in finite sample situations. We aim at answering the following questions:

(a)

Regarding the test’s power: What percentage of locations that are heterogeneous w.r.t. the location of primary interest can be expected to be identified correctly?
(b)

Regarding the test’s error rates: What percentage of locations that are homogeneous w.r.t. the location of primary interest can be expected to be wrongly identified as heterogeneous (FDR)? What is the probability of wrongly identifying at least one location that is homogeneous w.r.t. the location of interest as heterogeneous (FWER)?
(c)

Regarding the chosen pooling regions: How does return level (RL) estimation based on the pooling regions proposed by the bootstrap procedures compare to RL estimation based on the location of interest only or the whole (heterogeneous) region?

The data was generated in such a way that the temporal spatial dynamics from the case study in Section 4 are mimicked. To achieve this, we started by fitting the scale-GEV model from Equation (2) to annual block-maxima of observations from 1950–2021 at 16 spatial locations in Western Europe (i.e., $n=72$ and $D=16$ ) that are arranged in a $4\times 4$ grid; see Figure 1 and the additional explanations in Section 4. The locations correspond to the center points of the grid cells; the distance between the center points of two neighbouring grid cells is approximately 140 km. The location-wise GEV parameter estimates $\hat{\bm{\vartheta}}_{d}$ exhibit the following approximate ranges over $d\in\{1,\dots,16\}$ : $\hat{\mu}_{d}\in(18.1,30.8)$ with a mean of $20.85$ , $\hat{\sigma}_{d}\in(4.185,7.92)$ with a mean of 5.3, $\hat{\gamma}_{d}\in(-0.13,0.36)$ with a mean of 0.08 and $\hat{\alpha}_{d}\in(-2.3,5.08)$ with a mean of 1.5. Fitting the scale-GEV model to the full pooled sample of size $n\cdot D=1152$ , we obtained parameter estimates that were close to the means over the location-wise parameter estimates, with $20.37,5.8,0.1,1.5$ for location, scale, shape and trend parameter, respectively. Next, we transformed the margins to (approximate) unit Fréchet by applying the transformation from Equation (9), such that we can fit several max-stable process models to the transformed data. The best fit was Smith’s model with approximate dependence parameters $\sigma_{11}=0.4,\sigma_{12}=0.2,\sigma_{22}=0.9$ ; see davisonModelSpatExt for details on the model.

Refer to caption — Figure 1: Illustration of the grid used for the simulation. The regions contained in $A_{\text{dev}}$ are shaded in blue, with $|A_{\text{dev}}|=2$ shown in the left plot and $|A_{\text{dev}}|=7$ shown on the right. The region of interest is the one labelled 10.

Based on these model fits, we chose to generate data with the following specifications: first, the sample size was fixed to $n=75$ and the regional $4\times 4$ grid was chosen as described before, i.e., $d=16$ . The grid cell/location labelled ‘ $10$ ’ is chosen as the one of primary interest. Further, the dependence structure is fixed to that of Smith’s model with (approximately) those parameters that gave the best fit on the observed data, i.e. $\sigma_{11}=0.4,\sigma_{12}=0.2,\sigma_{22}=0.9$ . For simulating data, we first simulate from this max-stable process model (SpatialExtremes) and then transform the margins to scale-GEV distributions, either in a homogeneous or in a heterogeneous manner. Here, the globally homogeneous model is defined by fixing the marginal scale-GEV parameters to approximately the mean values of the location-wise GEV parameters obtained for the real observations, i.e.,

\displaystyle\mu_{d}=20,\ \sigma_{d}=5.5,\ \gamma_{d}=0.1,\ \alpha_{d}=1.5

(13)

for each $d\in\{1,\dots,16\}$ .

Starting from this homogeneous model, we consider two different heterogeneous scenarios. In the first scenario, we fix $\bm{\vartheta}_{d}=(\mu_{d},\sigma_{d},\gamma_{d},\alpha_{d})^{\top}$ as in Equation (13) for all $d\in A_{\text{hom}}=\{1,\dots,16\}\setminus\{4,8\}$ , while

\displaystyle\begin{aligned} \mu_{d}&=20+c_{\mu},&\hskip 14.22636pt&c_{\mu}\in\{-3,-1.5,0,1.5,3\},\\ \sigma_{d}&=5.5\cdot c_{\sigma},&\hskip 14.22636pt&c_{\sigma}\in\{0.7,0.85,1,1.15,1.3\},\\ \gamma_{d}&=0.1+c_{\gamma},&\hskip 14.22636pt&c_{\gamma}\in\{-0.1,0,0.1\},\\ \alpha_{d}&=1.5+c_{\alpha},&\hskip 14.22636pt&c_{\alpha}\in\{-1,0,1\},\end{aligned}

(14)

for $d\in A_{\text{dev}}=\{4,8\}$ with $(c_{\mu},c_{\sigma},c_{\gamma},c_{\alpha})\neq(0,0,0,0)$ . Note that this defines $5\cdot 5\cdot 3\cdot 3-1=224$ different heterogeneous models. In the second scenario, we consider the same construction with $A_{\text{hom}}=\{5,6,7,9,10,11,13,14,15\}$ and $A_{\text{dev}}=\{1,2,3,4,8,12,16\}$ . An illustration of the grid cells and their partition into homogeneous and non-homogeneous areas can be found in Figure 1. Overall, we obtain 448 different heterogeneous models and one homogeneous model.

For each of the 449 models, we now apply the following three bootstrap procedures, each carried out with $B=300$ bootstrap replications (recall that the grid cell of interest is the one labelled with 10):

(B1)

The bootstrap procedure from Algorithm 1 with $A=\{1,\dots,16\}$ .
(B2)

The bootstrap procedure from Algorithm 1 for all sets $A_{d}=\{10,d\}$ with $d\in\{1,\ldots,16\}\setminus\{10\}$ .
(B3)

The bootstrap procedure from Algorithm 2 for all sets $A_{d}=\{10,d\}$ with $d\in\{1,\ldots,16\}\setminus\{10\}$ .

Note that the second and third method both yield 15 raw p-values. Each procedure was applied to 500 simulated samples from all models under consideration.

Regarding (B1), we compute the percentage of rejections among the 500 replications, which represents the empirical type I error of the test under the homogeneous model and the empirical power under the heterogeneous models. The results can be found in Figure 2. The null hypothesis is met in the central square only, and we observe that the nominal level of $\alpha=0.1$ is perfectly matched. All non-central squares correspond to different alternatives, and we observe decent power properties in both scenarios. Note that a rejection only implies that the entire region $\{1,\dots,16\}$ is not homogeneous; there is no information on possible smaller subgroups that are homogeneous to the location of interest.

Regarding (B2) and (B3), rejection decisions were obtained for each hypothesis $H_{0}^{\mathrm{ED}}(\{10,d\})$ by one of the three methods from Section 2.5. The empirical family-wise error rate is then the percentage of cases (over 500 replications) for which at least one null hypothesis was rejected. Likewise, for the false discovery rate, we calculate, for each replication, the number of false rejections and divide that by the total number of rejections (when the number of total rejections is 0, this ratio is set to 0). The empirical false discovery rate is obtained by taking the mean over all 500 replications. Similarly, for assessing the power properties, we calculate the empirical proportion of correct rejections (i.e., among the 2 or 7 locations that deviate, the proportion of detected heterogeneous locations) over all 500 replications.

Results for the false discovery and family-wise error rate are given in Table 1. We find that the $p$ -value combination methods from Section 2.5 are sufficiently accurate: the BH method controls the false discovery rate, while Holm’s method controls the family-wise error rate. This holds exactly for procedures (B3), where the maximal FDR (FWER) of the BH (Holm) method is at 9.4% (8.7%), and approximately for (B2), where the maximal FDR (FWER) is at 12.2% (12.6%). Further, we see that the IM procedure neither controls the FWER nor the FDR.

Method min FDR max FDR mean FDR min FWER max FWER mean FWER (B2) (B3) (B2) (B3) (B2) (B3) (B2) (B3) (B2) (B3) (B2) (B3) Scenario 1: $|A_{\mathrm{dev}}|=2$ BH 7.3 5.6 12.2 9.4 9.4 7.5 9.1 6.9 21.2 19.0 14.4 11.8 Holm 3.0 2.3 11.7 8.3 7.1 5.1 6.9 5.0 12.6 8.7 9.5 6.6 IM 25.8 25.3 61.7 60.1 37.7 37.2 53.4 53.4 64.5 62.8 59.1 58.3 Scenario 2: $|A_{\mathrm{dev}}|=7$ BH 3.6 2.4 12.1 8.9 5.6 4.9 6.2 4.2 32.0 29.8 18.6 16.8 Holm 1.0 0.9 11.3 7.9 3.4 2.6 3.8 2.4 11.3 7.9 7.1 5.1 IM 7.9 7.8 61.5 60.1 16.0 15.8 40.9 40.9 61.5 60.1 47.3 46.4

Table 1: False Discovery Rate (FDR) and family-wise Error Rate (FWER) for the three p-value combination methods from Section 2.5 and the two bootstrap methods (B2) and (B3). The stated values are the minimum, maximum and mean across the 224 alternative models from each scenario.

The power properties for procedure (B2) combined with the BH method are shown in Figure 3. We see that the procedure is able to detect some of the deviations of the null hypothesis, with more correct rejections the stronger the deviation is. The method is particularly powerful when the location and scale parameters deviate into opposite directions, i.e. when $c_{\mu}>0$ and $c_{\sigma}<1$ or $c_{\mu}<0$ and $c_{\sigma}>1$ . There is no obvious pattern regarding the deviations of the shape and trend parameter. Further, we analogously show the power properties of the IM method with bootstrap (B2) in Figure 3. As expected, this method has more power against all alternatives under consideration. However, this comes at the cost of more false discoveries, as can be seen in Table 1.

The results for bootstrap scheme (B3) were very similar and are therefore not shown here, but can be found in Section B of the supplementary material. Likewise, we omit the results for the more conservative Holm procedure, which exhibits, as expected, less power against all alternatives. Further, we repeated the simulation study with an increased location-wise sample size of $n=100$ . As one would expect, the tests have more power in this case.

The results presented so far show that the proposed pooling methods work ‘as intended’, since the theoretical test characteristics are well approximated in finite sample situations, and since we observe decent power properties. In practical applications however, spatial pooling of locations is usually the starting point for subsequent analyses. For instance, one may be interested in estimating return levels at the location of interest based on the data from all locations that were identified as homogeneous. Moreover, the analysis of alternative data sets like climate model data may be based on the homogeneous locations identified within the analysis of observations.

This suggests that the methods should be examined with regard to their quality in subsequent analyses. For that purpose, we consider, as an example, the problem of return level estimation at the location of interest. The state-of-the-art method would consist of GEV fitting at the location of interest only, which results in (asymptotically) unbiased estimators that suffer from large variance. Basing the estimator on pooled regions will decrease the variance, but at the same time increase its bias if some heterogeneous locations have been wrongly identified as homogeneous.

In particular, pooling based on a conservative testing approach like the BH procedure leads to the acceptance of many locations and thus to a large pooling area and low estimation variance. Most likely, some of the chosen locations will be violating the null hypothesis though, which yields a rather large estimation bias. For pooling based on a more liberal rejection approach like the IM procedure, the estimation bias and variance behave exactly opposite: since the null hypotheses are more likely to be rejected, the resulting pooling sample is smaller (i.e., larger estimation variance) but ‘more accurate’ (i.e., smaller estimation bias).

For our comparison, we consider fitting the scale-GEV model based on pooled locations that have been obtained from one of the following eight methods

\displaystyle m\in\{\mathrm{LOI},\mathrm{full},\mathrm{MS\ IM},\mathrm{MS\ Holm},\mathrm{MS\ BH},\mathrm{biv.\ IM},\mathrm{biv.\ Holm},\mathrm{biv.\ BH}\}.

Here, LOI refers to considering the location of interest only (no pooling), full refers to pooling all available locations, and the last six methods encode pooling based on any combination of the proposed p-value combination methods and bootstrap approaches.

For each method, we compute the maximum likelihood estimate $\hat{\bm{\vartheta}}=(\hat{\mu},\hat{\sigma},\hat{\gamma},\hat{\alpha})^{\top}$ of the scale-GEV model parameters and transform this to an estimate of the $T$ -year return level (RL) in the reference climate of year $t$ by

\widehat{\mathrm{RL}}_{t}(T)=G^{-1}_{(\hat{\mu}(t),\hat{\sigma}(t),\hat{\gamma})}(1-1/T),

where $\hat{\mu}(t)=\hat{\mu}\exp(\hat{\alpha}\mathrm{GMST^{\prime}}(t)/\hat{\mu})$ and $\hat{\sigma}(t)=\hat{\sigma}\exp(\hat{\alpha}\mathrm{GMST^{\prime}}(t)/\hat{\mu})$ and where $G$ is the cumulative distribution function of the GEV-distribution, see Equation (1). Now, in our simulation study, we know that the true value of the target RL is given by $\mathrm{RL}_{t}(T)=G^{-1}_{(\mu_{0}(t),\sigma_{0}(t),\gamma_{0})}(1-1/T)$ with

\mu_{0}(t)=20\exp\left(\frac{1.5\mathrm{GMST^{\prime}}(t)}{20}\right),\,\sigma_{0}(t)=5.5\exp\left(\frac{1.5\mathrm{GMST^{\prime}}(t)}{20}\right),\,\gamma_{0}=0.1.

From the 500 replications we can therefore compute the empirical Mean Squared Error (MSE) of method $m$ as

\mathrm{MSE}(m)=\frac{1}{500}\sum_{j=1}^{500}\left(\widehat{\mathrm{RL}}^{(m,j)}_{t}(T)-\mathrm{RL}_{t}(T)\right)^{2},

where $\widehat{\mathrm{RL}}^{(m,j)}_{t}(T)$ denotes the estimated RL obtained in the $j$ -th replication with method $m$ . Note that we have suppressed the MSE’s dependence on $T$ and $t$ from the notation.

In Figure 4 we compare MSEs of the 100-year RL with reference climate as in year 2021, which is given by $\mathrm{RL}_{2021}(100)=55.87$ , by plotting the difference $\mathrm{MSE}(m_{1})-\mathrm{MSE}(m_{2})$ with $m_{1}\in\{\mathrm{MS\,BH,MS\,IM}\}$ and $m_{2}\in\{\mathrm{full},\mathrm{ROI}\}$ as obtained for the setting where $|A_{\mathrm{dev}}|=7$ . The plots reveal that both the MS BH and the MS IM method are superior to the the LOI fit for almost all scenarios. Comparing the two methods to the full fit reveals that there are certain scenarios for which the full fit performs substantially worse, mostly when the shape and scale parameter deviate towards the same direction for the alternatives. For those scenarios where the full fit outperforms the two methods, the discrepancy is not very large, with the BH method performing slightly better than the IM method.

A comparison between MS BH and MS IM is shown in Figure 5 for $|A_{\mathrm{dev}}|\in\{2,7\}$ . The results reveal that the BH method slightly outperforms the IM method in the case $|A_{\mathrm{dev}}|=2$ for almost all alternative scenarios. In case $|A_{\mathrm{dev}}|=7$ , the results are quite mixed, with the IM method becoming clearly superior when the shape, scale and location parameters deviate jointly to the top. In all other scenarios, the differences are only moderate, sometimes favoring one method and sometimes the other. Corresponding results for the bootstrap methods based on bivariate extreme value distributions are very similar and therefore not shown. Further, the results were found to be robust against the choices of $t=2021$ and $T=100$ that were made here for the return level estimation.

Overall, the results suggest the following practical recommendation: if the full sample is expected to be quite homogeneous a priori (for instance, because it was built based on expert knowledge), then estimation based on BH-based pooling is preferable over the other options (LOI, the full and the IM-based fit). If one expects to have a larger number of heterogeneous locations, it is advisable to apply the IM procedure (or any other liberal procedure), which likely rejects most of the heterogeneous locations and hence reduces the bias. In general, the liberal behavior of IM-based pooling suggests its use when it is of highest practical interest to obtain a pooled region that is as homogeneous as possible (as a trade-off, one has to accept that the region is probably much smaller than the initial full region).

4 Severe flooding in Western Europe during July 2021 revisited

We illustrate the new pooling methods in a case study by revisiting the extreme event attribution study for the heavy precipitation event that led to severe flooding in Western Europe during July 2021, see Kre21; Tra22. In that study, observational data were pooled together based on expert knowledge and on ad hoc tests, with the ultimate goal of assessing the influence of human-made climate change on the likelihood and severity of similar events in Western and Central Europe.

The full region under investigation in Kre21; Tra22 consists of sixteen $(2.0^{\circ}\times 1.25^{\circ})$ (i.e. about $(140\,\mathrm{km}\times 140\,\mathrm{km})$ ) grid cells reaching from the northern Alps to the Netherlands, see Figure 5 in Kre21 or the right-hand side of Figure 6. Two of the 16 locations were rejected in that study due to expert knowledge and too large deviations in fitted GEV-parameters (regions 17 and 11 of Figure ). Among other things, our illustrative application of the methods explained above will reveal that grid cell 11 has been rightfully dismissed, while grid cell 17 might have been considered homogeneous. Further, there is no clear evidence that any other grid cell that has been declared homogeneous should rather be considered non-homogeneous.

For illustrative purposes, we apply our methods to two different initial areas:

(A)

An area consisting of $6\times 6$ grid cells covering a large part of Western/ Central Europe, as shown in Figure 6 on the left.
(B)

The original $4\times 4$ grid cells from Kre21 as shown in Figure 6 on the right.

Note that homogeneity for the 20 grid cells at the boundary of the larger area in (A) has been dismissed based on expert knowledge in Kre21; the larger area is included here for illustrative purposes only.

The data used throughout the study consists of April-September block-maxima of tile-wise averaged 1-day accumulated precipitation amounts of the E-OBS data set (cornes2018ensemble, Version 23.1e). In both cases, the grid cell with label 21 is the one of primary interest, since it is the one containing the target location of the study, i.e., the region that accumulated the highest precipitation sum and experienced the largest impacts during the flooding of July 2021. The time series are shown in Figure C.1 in the supplementary material. There, we also plot values of $\hat{\mu}(t)=\hat{\mu}\exp\left(\hat{\alpha}\mathrm{GMST^{\prime}}(t)/\hat{\mu}\right)$ obtained from different data sets: once from data of location 21 only, once from data of the respective location only, and once from the pooled data of the respective pair $(21,d)$ for $d\in\{1,\ldots,36\}\setminus\{21\}$ .

We apply the two proposed bootstrap procedures to areas (A) and (B). Note that the raw p-values obtained with the bootstrap based on bivariate extreme value distributions should be very similar (or even identical when using the same seed for random number generation) for those grid cells that appear in both areas, while they may differ to a greater extent for the MS bootstrap. This is because the p-value for a given pair obtained with the bivariate bootstrap procedure only depends on the observations of the pair, while the respective p-value obtained with the MS bootstrap also depends on the spatial model that was fitted to the whole area. However, even if the raw p-value of a given pair obtained for setting (B) coincides with the raw p-value obtained for setting (A), the adjustment for multiple testing can lead to slightly different rejection decisions of the pair at a given level $\alpha$ . The bootstrap procedures are applied with $B=2000$ bootstrap replications.

We start by discussing the results of the application to the larger grid in (A). Recall that, for a given significance level $\alpha$ , one rejects the null hypothesis for all grid cells whose p-value is smaller than $\alpha$ . To visualise the results, we therefore shade the grid cells according to the magnitude of their (adjusted) p-value. Here, we divide the colour scale into three groups: $[0,0.05],(0.05,0.1]$ and $(0.1,1]$ , with a dark red tone assigned to the first group, a brighter red tone for Group 2 and an almost transparent shade for Group 3. This allows us to see the test decisions for significance levels of $\alpha\in\{0.05,0.1\}$ : when the significance level is chosen as $\alpha=0.1$ , all tiles with a reddish shade are rejected, while when working with a level of $\alpha=0.05$ only tiles shaded in the dark shade are rejected.

Results for both the conservative BH procedure and the liberal IM procedure are shown in Figure 7. For completeness, results on Holm’s method, which is even more conservative than BH, as well as the BH and IM p-values themselves can be found in the supplementary material, Tables C.2 and C.3. One can see that, for a given rejection method (i.e. BH or IM), the MS and bivariate procedures mostly agree on the rejection decisions that would be made at a level of 10% (compare the rows of Figure 7 to see this). The same holds when working with a significance level of 5%.

Further, as expected, the IM method rejects more hypotheses than the BH method. However, according to the results of the simulation study, it is quite likely that at least one of these rejections is a false discovery.

Analogous results for the $4\times 4$ grid in (B) are shown in Figure 8. As discussed above, except for the MS BH method, the results are consistent with the results obtained for the $6\times 6$ grid in the sense that for those locations which are contained in both grids, the locations with p-values of critical magnitude ( $<10\%$ ) coincide (compare the plot in the upper right corner of Figure 8 to the plot in the upper right corner of Figure 7 to see this for the MS IM method, and similar for the other methods). For the MS BH method, grid cells 10, 14, 15, and 16 are not significant anymore at a level of 10 %, but we recorded an adjusted p-value of 0.106 for those four grid cells, so this is a rather tight decision. The p-values obtained for the $4\times 4$ grid can be found in Table C.1 in the supplementary material.

Let us now move on to the interpretation: considering the larger grid first, some grid cells for which the characteristics of extreme precipitation are different (according to expert opinion) from the grid cell of the target location are detected as heterogeneous. These rejected grid cells are located along the coast and in the mountainous terrain. Comparing the results with Kre21; Tra22, we observe that grid cell 11 has been rejected in their study based on expert knowledge. For grid cell 17, however, we do not detect any statistical evidence that the probabilistic behavior of extreme precipitation is different from the grid cell of the target location, even when applying the liberal IM procedure. We would like to stress though that non-rejection of a null hypothesis does not provide any evidence of the null hypothesis, even when ignoring the multiple testing issue. Hence, expert knowledge that leads to rejection should, in general, outweigh any statistical non-rejection. This particularly applies to the eastern (continental) grid cells in the larger 6 $\times$ 6-grid, which can be influenced by heavy precipitation caused by different synoptic situations compared to the target region.

Moreover, as the results for locations 10, 14, 15, and 16 showed some discrepancy across the different testing procedures, we suggest that the final decision on the exclusion or inclusion of these locations in a spatial pooling approach should be based on expert knowledge of the meteorological characteristics, and the willingness to trade possible bias for variance (with a possibly larger bias when including the locations – note that statistical evidence against homogeneity in the bivariate extreme value distribution-based bootstrap is only weak, and wrongly declaring the regions as homogeneous is possibly not too harmful). The same holds for locations 9, 20, 23 and 27 for which only the IM method yielded p-values between 5% and 10%. Again, these rather small p-values could be subject to false discoveries though, and since the heterogeneity signal is also not too strong, there is no clear evidence that these need to be excluded from pooling.

For a last evaluation of results from pairwise tests, we estimated the 100-year RLs in the reference climate of the year 2021, i.e. with reference value $\mathrm{GMST^{\prime}}(2021)=0.925^{\circ}C$ , on five different data sets obtained from the $4\times 4$ grid. Here, we use the data sets consisting of data from

•

the location of interest only
•

pooling those grid cells suggested by the results of the case study (i.e., all cells but 11, or all cells but 10, 11, 14, 15, 16) or expert opinion (i.e., all cells but 11, 17)
•

pooling all grid cells of the $4\times 4$ grid.

The results can be found in Table 2 and reveal that excluding cell 11 has a clear effect on the estimated RL. Ex- or including grid cell 17 once 11 is excluded does not have a large effect, while excluding cells 10, 14, 15 and 16 additionally to cell 11 has a moderate effect.

cells excluded	$\hat{\mu}$	$\hat{\sigma}$	$\hat{\gamma}$	$\hat{\alpha}$	$\widehat{\mathrm{RL}}_{2021}(100)$
none	20.37	5.80	0.1039	1.50	58.43
11	20.01	5.44	0.0676	1.45	52.74
11, 17	20.01	5.40	0.0760	1.29	52.82
10, 11, 14, 15, 16	19.90	5.41	0.0484	1.79	51.93
all but 21	21.92	6.08	0.0634	$-0.00$	54.37

Table 2: Estimated parameters and estimate of

\mathrm{RL}_{2021}(100)

obtained when pooling all grid cells but the ones given in the first column.

Finally, we would like to mention that similar results were obtained when applying the BH test procedures to all triples containing the pair of grid cells $(20,21)$ , i.e., the extended target region considered in the study of Kre21; Tra22, consisting of those regions in Germany and Belgium affected worst by the July 2021 flooding.

5 Extensions

In this section, we discuss how to estimate region-wise return levels under homogeneity assumptions (Section 5.1). We also propose two possible extensions of the pooling approach from the previous sections to other hypotheses (Section 5.2) or other underlying model assumptions (Section 5.3).

5.1 Estimation of regional return levels and return periods

As pointed out in Kre21; Tra22 among others, an estimated return period (RP) of $T$ years for a given event and in a fixed reference climate (e.g., the preindustrial climate), obtained based on a fit of the GEV distribution to a pooled homogeneous sample, has the following interpretation: for each fixed location/tile within the region, one can expect one event of the same or larger magnitude within $T$ (imaginary) years of observing the reference climate. We refer to this quantity as the local return period. Obviously, one would expect more than one event of similar magnitude happening at at least one of the locations of the pooling region. Likewise, for a given $T$ , one would expect a higher $T$ -year return level for the whole region. The latter corresponds to the value that is expected to be exceeded only once in $T$ years at at least one of the locations.

Mathematically, using the notation from Section 2.1, the exceedance probability of value $r$ at at least one among $D\geq 2$ locations in the reference climate corresponding to year $t$ is given by

\displaystyle p_{t}(r)=P\Big{(}\exists\,j\in\{1,\dots,D\}:M_{j}^{(t)}\geq r\Big{)}

\displaystyle=P\Big{(}\max_{d=1,\ldots,D}M_{d}^{(t)}\geq r\Big{)},

such that the return period for event $r$ of the region is $\mathrm{RP}_{t}^{\mathrm{reg}}(r)=\frac{1}{p_{t}(r)}$ . Further, the $T$ -year return level of the region in the climate corresponding to year $t$ is the minimal value $\mathrm{RL}_{t}^{\mathrm{reg}}(T)$ for which

P\Big{(}\max_{d=1,\ldots,D}M_{d}^{(t)}\geq\mathrm{RL}_{t}^{\mathrm{reg}}(T)\Big{)}\leq\frac{1}{T}

holds. Both quantities could be computed (exactly) if one had access to the distribution of $\max_{d=1,\ldots,D}M_{d}^{(t)}$ . For example, if the random variables $M_{d}^{(t)},\ d=1,\ldots,D$ were independent, $p_{t}(r)$ could be further simplified to

\displaystyle p_{t}(r)=1-P\Big{(}\max_{d=1,\ldots,D}M_{d}^{(t)}\leq r\Big{)}=1-(G_{(\mu(t),\sigma(t),\gamma)}(r))^{D},

where $G$ is the distribution function of the GEV distribution and where $\mu(t),\sigma(t)$ and $\gamma$ denote the parameters at reference climate of year $t$ from Equation (2) under the homogeneity assumption from Equation (3).

The locations are, however, usually not independent in applications. In the following, we propose a simulation-based estimation method that involves max-stable process models to account for the spatial dependence. As before, the R package SpatialExtremes (SpatialExtremes) allows for fitting and simulating max-stable process models.

Algorithm 3.

(Simulation-based estimation of the regionwise RL and RP)

(1)

Fit the scale-GEV parameters to the pooled homogeneous sample, resulting in the parameter vector $\hat{\bm{\vartheta}}=(\hat{\mu},\hat{\sigma},\hat{\gamma},\hat{\alpha})^{\top}$ .
(2)

Transform the margins of the pooled data to approximately unit Fréchet by applying transformation from Equation (9) with the parameter estimate from Step 1. Then fit several max-stable process models to the obtained data and choose the best fit according to the information criterion CLIC.
(3)
Replicate for $b=1,\ldots,B$ the following steps:
1. (i)
  
  Generate one random observation $(y_{1,b}^{(t),\ast},\ldots,y_{D,b}^{(t),\ast})$ from the chosen max-stable process model.
2. (ii)
  
  Transform the margins to GEV margins, by applying the transformation in (10) with parameters as estimated in Step 1, resulting in the observation $(m_{1,b}^{(t),\ast},\ldots,m_{D,b}^{(t),\ast})$ .
3. (iii)
  
  Compute the maximum $m_{\max,b}^{(t),\ast}=\max_{d=1,\ldots,D}m_{d,b}^{(t),\ast}.$

(4)

The regionwise $T$ -year return level $\mathrm{RL}_{t,\mathrm{reg}}(T)$ and the return period $\mathrm{RP}_{t,\mathrm{reg}}(r)$ of an event with value $r$ can now be estimated from the empirical cumulative distribution function $\hat{F}_{t}^{\ast}$ of the sample $(m_{\max,b}^{(t),\ast})_{b=1,\dots,B}$ through

\displaystyle\widehat{\mathrm{RL}}_{t}^{\mathrm{reg}}(T)

\displaystyle=(\hat{F}_{t}^{\ast})^{-1}(1-1/T),\qquad\widehat{\mathrm{RP}}_{t}^{\mathrm{reg}}(r)=\frac{1}{1-\hat{F}_{t}^{\ast}(r)}.

Especially, when we have estimated the local 100-year RL, we can get an estimate of the return time this event has for the whole region. Likewise, when we have an estimate of the local return period of an event with value $r$ , we can estimate what the event value for that return period would be for the whole region.

We illustrate the estimators for the pooled data sets from Section 4. The estimates are based on $B=100\,000$ simulation replications and are shown in Table 3. We see that the local 100-year return levels have substantially shorter region-wise return periods. In the region with 15 tiles (only cell 11 excluded), the estimated local 100-year RL at reference climate of 2021 can be expected to be exceeded once in approximately 19 years in at least one of the 15 tiles. We find a similar region-wise return period for the pooling region consisting of 14 tiles. In the pooling region consisting of 11 tiles, the local 100-year return level becomes a region-wise 33-year event. This comparably larger value arises from the smaller region that is considered: the smaller the region, the less likely it is that one of the locations exceeds a high threshold. Further, as expected, we find that the region-wise 100-year return levels at reference climate of 2021 are larger than their local counterparts. For the regions consisting of 15 and 14 tiles, this increase is approximately 26%, while it is approximately 17.3% for the region consisting of 11 tiles.

cells excluded	$\mathrm{RL}_{2021}(100)$	$\mathrm{RP}_{2021}^{\mathrm{reg}}(\mathrm{RL}_{2021}(100))$	$\mathrm{RL}_{2021}^{\mathrm{reg}}(100)$
11	52.74	18.90	66.40
11, 17	52.82	18.32	67.08
10, 11, 14, 15, 16	51.93	32.76	60.93

Table 3: Estimated local (second column) and regional (fourth column) 100-year RLs for reference climate 2021, for three possible choices of pooling regions as indicated by the first column. Column 3 shows the regional return periods of the local 100-year events.

5.2 A homogeneous scaling model with location-wise scaling factor

In this section, we maintain the temporal dynamics from the scale-GEV model from Equation (2). However, instead of testing for the homogeneity assumption from Equation (3), we additionally allow for a location-wise scaling factor under the null hypothesis. Such an approach can be useful when it is known that observations from different locations occur on different scales, but, apart from that, show a common probabilistic behaviour. In fact, a stationary version of the following model is commonly used in hydrology, where it is known as the Index Flood approach (dalrymple1960flood).

More precisely, suppose that

\displaystyle M_{t,d}\sim c_{d}\exp\left(\frac{\alpha\mathrm{GMST^{\prime}}(t)}{\mu}\right)\mathrm{GEV}(\mu,\sigma,\gamma)\quad\forall t,d,

(15)

where $c_{d}>0$ is a location-specific scaling factor that we may fix to 1 at the location of primary interest (say $d=1$ , i.e., $c_{1}=1$ ). Writing $\mu_{d}=c_{d}\mu,\sigma_{d}=c_{d}\sigma,\alpha_{d}=c_{d}\alpha$ , the model in Equation (15) can be rewritten as

\displaystyle M_{t,d}\sim\mathrm{GEV}(\mu_{d}(t),\sigma_{d}(t),\gamma)\quad\forall t,d,

where

\displaystyle\mu_{d}(t)=\mu_{d}\exp\left(\frac{\alpha_{d}\mathrm{GMST^{\prime}}(t)}{\mu_{d}}\right),\quad\sigma_{d}(t)=\sigma_{d}\exp\left(\frac{\alpha_{d}\mathrm{GMST^{\prime}}(t)}{\mu_{d}}\right).

(16)

Note that the parameters $\mu_{1},\dots,\mu_{D},\sigma_{1},\dots,\sigma_{D},\alpha_{1},\dots,\alpha_{D}$ satisfy the relationships

\frac{\mu_{d}}{\sigma_{d}}\equiv\delta,\quad\frac{\alpha_{d}}{\mu_{d}}\equiv\eta,\quad\frac{\alpha_{d}}{\sigma_{d}}\equiv\kappa

for certain parameters $\delta,\eta,\kappa$ ; in particular, $\mu_{1},\dots,\mu_{D},\sigma_{1},\dots,\sigma_{D},\alpha_{1},\dots,\alpha_{D}$ can be retrieved from $\mu_{1},\dots,\mu_{D},\delta,\eta$ (note that the constraint on $\alpha_{d}/\sigma_{d}$ is not needed, but comes as a consequence of the first two relations). Fitting this model instead of fitting the scale-GEV distribution to each location separately has the advantage of reducing the number of parameters that need to be estimated substantially ( $4+(D-1)=D+3$ instead of $4D$ parameters). Once the local scaling factors are identified, we can bring all observations to the same scale by dividing each location by its location-specific scaling factor.

Now one can test whether such a local scaling model holds on a subset $A=\{d_{1},\ldots,d_{k}\}\subset\{1,\ldots,D\}$ with cardinality $k=\left|{A}\right|\geq 2$ , by testing the hypothesis

\displaystyle H_{0}^{\mathrm{LS}}(A):\quad\exists\,\delta_{A},\eta_{A},\gamma_{A}\,\forall d\in A:\quad\frac{\mu_{d}}{\sigma_{d}}=\delta_{A},\quad\frac{\alpha_{d}}{\mu_{d}}=\eta_{A},\quad\gamma_{d}=\gamma_{A},

(17)

with a Wald-type test statistic. In this case, the latter is defined as

\displaystyle T_{n}^{\mathrm{LS}}(A)=n(g_{A}(\hat{\bm{\theta}}))^{\top}\left(\bm{G}_{A}(\hat{\bm{\theta}})\hat{\bm{\Sigma}}_{n}\bm{G}_{A}(\hat{\bm{\theta}})^{\top}\right)^{-1}g_{A}(\hat{\bm{\theta}}),

(18)

where $g_{A}:\mathbb{R}^{4D}\to\mathbb{R}^{3(k-1)}$ is given by

\displaystyle g_{A}(\bm{\theta})=\left(\frac{\mu_{d_{1}}}{\sigma_{d_{1}}}-\frac{\mu_{d_{2}}}{\sigma_{d_{2}}},\gamma_{d_{1}}-\gamma_{d_{2}},\frac{\alpha_{d_{1}}}{\mu_{d_{1}}}-\frac{\alpha_{d_{2}}}{\mu_{d_{2}}},\ldots,\gamma_{d_{k-1}}-\gamma_{d_{k}},\frac{\alpha_{d_{k-1}}}{\mu_{d_{k-1}}}-\frac{\alpha_{d_{k}}}{\mu_{d_{k}}}\right)^{\top},

with Jacobian matrix $G_{A}(\bm{\theta})\in\mathbb{R}^{3(k-1)\times 4D}$ , since the hypothesis in Equation (17) may be rewritten as $H_{0}^{\mathrm{LS}}(A):g_{A}(\bm{\theta})=0.$

When considering this kind of modification, the bootstrap algorithms from Section 2.4, steps (5)-(7), must be adapted accordingly. In step (5), one has to estimate the parameter under the constraint of the considered null hypothesis by adapting the $\log$ -likelihood accordingly. The estimated parameters are then used during the transformation step (6). Further, the test statistic in steps (6) and (7) is replaced by $T_{n}^{\mathrm{LS}}(A)$ from (18). Further details are omitted for the sake of brevity.

5.3 General homogeneous models with smooth parametrization

In this section, we consider general GEV models in which the location, scale and shape parameters are allowed to depend in a (fixed) differentiable way on some parameter vector $\bm{\vartheta}\in\mathbb{R}^{q}$ and some temporal covariate $c^{(t)}\in\mathbb{R}^{p}$ with $p,q\in\mathbb{N}$ . More precisely, suppose that $f_{\mu},f_{\sigma}$ and $f_{\gamma}$ are (known) real-valued functions of $\bm{\vartheta}$ and $c$ that are differentiable with respect to their first argument $\bm{\vartheta}$ . We assume that, for each $d=1,\dots,d$ , there exists an unknown parameter $\bm{\vartheta}_{d}$ such that $M_{d}^{(t)}\sim\mathrm{GEV}(\mu_{d}(t),\sigma_{d}(t),\gamma_{d}(t))$ with

\mu_{d}(t)=f_{\mu}(\bm{\vartheta}_{d};c^{(t)}),\quad\sigma_{d}(t)=f_{\sigma}(\bm{\vartheta}_{d};c^{(t)}),\quad\gamma_{d}(t)=f_{\gamma}(\bm{\vartheta}_{d};c^{(t)}).

The global null hypothesis of interest within this model is assumed to be expressible as $h(\bm{\vartheta}_{1},\ldots,\bm{\vartheta}_{D})=0$ for a differentiable function $h:\mathbb{R}^{qD}\to\mathbb{R}^{s}$ with $s\in\mathbb{N}$ .

An example is given by the linear shift model that is frequently considered when modelling temperature extremes in Extreme Event Attribution studies (see Phi20), where

\mu_{d}(t)=\mu_{d}+\alpha_{d}\mathrm{GMST^{\prime}}(t),\quad\sigma_{d}(t)\equiv\sigma_{d},\quad\gamma_{d}(t)\equiv\gamma_{d}.

A possible global null hypothesis of interest could be

H_{0}:\ \exists\,\bm{\vartheta}\in\mathbb{R}\times(0,\infty)\times\mathbb{R}^{2}\ \forall d\in\{1,\ldots,D\}:\quad\bm{\vartheta}_{d}=\bm{\vartheta},

where $\bm{\vartheta}=(\mu,\sigma,\gamma,\alpha)^{\top}$ and $\bm{\vartheta}_{d}=(\mu_{d},\sigma_{d},\gamma_{d},\alpha_{d})^{\top}$ .

When considering this kind of extension, one has to adapt the maximum likelihood estimator as well as the estimator of its covariance matrix, hence steps (1)-(2) and (5)-(7) in the bootstrap algorithms are affected. Further details are omitted for the sake of brevity.

6 Conclusion

Extreme event attribution studies can build upon a GEV scaling model. Depending on the analysed variable, it may be useful to apply spatial pooling and fit the GEV distribution to a pooled sample of observations collected at sufficiently homogeneous spatial locations as it has been done in Kre21; Tra22; Vau15, among others. Here, we propose several statistical methods that enable the selection of a homogeneous pooling region from a larger initial region. The BH approach was found to be quite conservative, hence some heterogeneous locations are likely to be declared homogeneous. The IM approach is a more liberal alternative with a higher proportion of rejected locations that may contain some homogeneous ones. In subsequent analyses, the selected pooling region typically results in a classical bias-variance trade-off: the larger the pooling region, the smaller the variance. At the same time, the bias may be larger, given that some heterogeneous regions may have been declared homogeneous. In practice, the tests should always be complemented by expert knowledge on the driving meteorological/climatological background processes.

To make the statistical approach to select homogeneous pooling regions for attribution studies as described here usable for the extreme event attribution community, we have developed a software package that can be freely downloaded and used by applied researchers (findpoolreg). The selection of spatial pooling regions for attribution studies may hence be based on a combination of expert knowledge and thorough statistical tests. The experts applying the methods can thereby decide between a conservative approach, which tends to reject more locations and a liberal approach which tends to accept more locations as being homogeneous. This decision depends on the a priori knowledge about the meteorology of the analysed area and the specific requirements of the study.

If the ultimate interest is estimation of, for example, return levels, one may, as an alternative to the classical approach based on pooling selected locations, consider penalized maximum likelihood estimators with a penalty on large heterogeneities (Buc21). A detailed investigation of the resulting bias-variance trade-off would be a worthwhile topic for future research.

Declaration

Ethical Approval

Not applicable.

Availability of supporting data

All methods are implemented in the R package findpoolreg (findpoolreg) that is publicly available at https://github.com/leandrazan/findpoolreg. The data used throughout the case study is derived from the E-OBS gridded data set, publicly available at https://www.ecad.eu/download/ensembles/download.php.

Competing interests

The authors declare that they have no conflict of interest.

Funding

This work has been supported by the integrated project “Climate Change and Extreme Events - ClimXtreme Module B - Statistics (subprojects B1.2, grant number: 01LP1902B and B3.3, grant number: 01LP1902L)” funded by the German Federal Ministry of Education and Research (BMBF).

Authors’ contributions

LZ and AB wrote the main manuscript and worked out the mathematical details. LZ implemented all methods and carried out the Monte Carlo simulation study and the case study. FK, PL and JT contributed by extensive discussions, provided the data for the case study and improved the text.

Acknowledgements

Computational infrastructure and support were provided by the Centre for Information and Media Technology at Heinrich Heine University Düsseldorf, which is gratefully acknowledged.

References

SUPPLEMENT TO THE PAPER:
“Regional Pooling in Extreme Event Attribution Studies: an Approach Based on Multiple Statistical Testing”

Leandra Zanger, Axel Bücher, Frank Kreienkamp,
Philip Lorenz, Jordis Tradowsky

This supplement contains mathematical details on the maximum likelihood estimator and the estimation of its covariance matrix, as well as additional simulation results and further details on the case study. References like (1) refer to equations from the main paper, while references like (A.1) or Figure B.2 refer to equations or Figures within this appendix.

Appendix A Mathematical Details

A.1 Maximum Likelihood estimation

Throughout this section, we provide mathematical details on the coordinate-wise maximum likelihood estimator from Equation (6). In particular, we motivate the approximate normality of $\hat{\bm{\theta}}=(\hat{\bm{\vartheta}}_{1}^{\top},\dots,\hat{\bm{\vartheta}}_{D}^{\top})^{\top}\in\Theta^{D}$ with mean ${\bm{\theta}}=({\bm{\vartheta}}_{1}^{\top},\dots,{\bm{\vartheta}}_{D}^{\top})^{\top}$ and covariance $n^{-1}\bm{\Sigma}_{n}$ with $\bm{\Sigma}_{n}=(\bm{\Sigma}_{n;j,k})_{j,k=1}^{D}\in\mathbb{R}^{4D\times 4D}$ as defined in Equation (7). As in the stationary GEV-model, the derivations require $\gamma>-1/2$ , see BucSeg17.

We start by some explicit formulas for the functions appearing in Equations (6) and (7). For that purpose, let $l_{(\mu,\sigma,\gamma)}(x)$ denote the log density of the plain $\mathrm{GEV}(\mu,\sigma,\gamma)$ distribution (see Appendix B in BucSeg17), i.e.,

\displaystyle l_{(\mu,\sigma,\gamma)}(x)=-\log(\sigma)-u_{\gamma}\Big{(}\frac{x-\mu}{\sigma}\Big{)}+(\gamma+1)\log\Big{(}u_{\gamma}\Big{(}\frac{x-\mu}{\sigma}\Big{)}\Big{)}

(A.1)

for $x$ such that $1+\gamma\frac{x-\mu}{\sigma}>0$ ; here

\displaystyle u_{\gamma}(z)

\displaystyle=\begin{cases}(1+\gamma z)^{-\frac{1}{\gamma}},&\gamma\neq 0,\\ \exp(-z),&\gamma=0.\end{cases}

Then, writing $\bm{\vartheta}=(\mu,\sigma,\gamma,\alpha)^{\top}$ , the $\log$ -density $\ell_{\bm{\vartheta}}(x,c)$ from Equation (5) may be written as

\displaystyle\ell_{\bm{\vartheta}}(x,c)=l_{(\mu(c),\sigma(c),\gamma)}(x),

(A.2)

where

\displaystyle\mu(c)

\displaystyle=\mu(c\mid\mu,\alpha)=\mu\exp(\alpha c/\mu),

\displaystyle\sigma(c)

\displaystyle=\sigma(c\mid\mu,\sigma,\alpha)=\sigma\exp(\alpha c/\mu).

We next derive formulas for the gradient and the Hessian of $\bm{\vartheta}\mapsto\ell_{\bm{\vartheta}}(x,c)$ . For that purpose, let $\dot{l}_{(\mu,\sigma,\gamma)}(x)$ and $\ddot{l}_{(\mu,\sigma,\gamma)}(x)$ denote the respective gradient and Hessian of the standard GEV log density (see Appendix B in BucSeg17) for precise formulas). Note that, in view of the fact that the GEV distribution is a location scale family,

	$\displaystyle\dot{l}_{(\mu,\sigma,\gamma)}(x)$	$\displaystyle=T_{\sigma}^{-1}\dot{l}_{(0,1,\gamma)}\Big{(}\frac{x-\mu}{\sigma}\Big{)},\qquad T_{\sigma}=\operatorname{diag}(\sigma,\sigma,1)\in\mathbb{R}^{3\times 3},$		(A.3)
	$\displaystyle\ddot{l}_{(\mu,\sigma,\gamma)}(x)$	$\displaystyle=T_{\sigma}^{-1}\dot{l}_{(0,1,\gamma)}\Big{(}\frac{x-\mu}{\sigma}\Big{)}T_{\sigma}^{-1}.$		(A.4)

Next, consider the function $p_{c}:\Theta\to(0,\infty)^{2}\times\mathbb{R}$ defined by $p_{c}(\mu,\sigma,\gamma,\alpha)=(\mu\exp(\alpha c/\mu),\sigma\exp(\alpha c/\mu),\gamma)^{\top}$ , whose Jacobian is given by $B_{c}(\mu,\sigma,\alpha)^{\top}$ , where

\displaystyle B_{c}(\mu,\sigma,\alpha)=\begin{pmatrix}\left(1-\frac{\alpha c}{\mu}\right)\exp\left(\frac{\alpha}{\mu}c\right)&-\frac{\sigma\alpha c}{\mu^{2}}\exp\left(\frac{\alpha}{\mu}c\right)&0\\ 0&\exp\left(\frac{\alpha}{\mu}c\right)&0\\ 0&0&1\\ c\exp\left(\frac{\alpha}{\mu}c\right)&\frac{\sigma c}{\mu}\exp\left(\frac{\alpha}{\mu}c\right)&0\end{pmatrix}.

Then, in view of Equations (A.2) and (A.3), the multivariate chain rule yields

	$\displaystyle\dot{\ell}_{\bm{\vartheta}}(x,c)$	$\displaystyle=B_{c}(\mu,\sigma,\alpha)\cdot\dot{l}_{(\mu(c),\sigma(c),\gamma)}(x)$
		$\displaystyle=B_{c}(\mu,\sigma,\alpha)\cdot T_{\sigma(c)}^{-1}\cdot\dot{l}_{(0,1,\gamma)}\left(\frac{x-\mu(c)}{\sigma(c)}\right).$		(A.5)

In view of the multivariate product rule, this equation further implies

$\displaystyle\ddot{\ell}_{\bm{\vartheta}}(x,c)$	$\displaystyle=\begin{pmatrix}\dot{l}_{(\mu(c),\sigma(c),\gamma)}(x)^{\top}\dot{B}_{c,1}(\mu,\sigma,\alpha)\\ \dot{l}_{(\mu(c),\sigma(c),\gamma)}(x)^{\top}\dot{B}_{c,2}(\mu,\sigma,\alpha)\\ \dot{l}_{(\mu(c),\sigma(c),\gamma)}(x)^{\top}\dot{B}_{c,3}(\mu,\sigma,\alpha)\\ \dot{l}_{(\mu(c),\sigma(c),\gamma)}(x)^{\top}\dot{B}_{c,4}(\mu,\sigma,\alpha)\end{pmatrix}+B_{c}(\mu,\sigma,\alpha)\ddot{l}_{(\mu(c),\sigma(c),\gamma)}(x)B_{c}(\mu,\sigma,\alpha)^{\top}$
	$\displaystyle=\begin{pmatrix}\dot{l}_{(0,1,\gamma)}(\tfrac{x-\mu(c)}{\sigma(c)})^{\top}T_{\sigma(c)}^{-1}\dot{B}_{c,1}(\mu,\sigma,\alpha)\\ \dot{l}_{(0,1,\gamma)}(\tfrac{x-\mu(c)}{\sigma(c)})^{\top}T_{\sigma(c)}^{-1}\dot{B}_{c,2}(\mu,\sigma,\alpha)\\ \dot{l}_{(0,1,\gamma)}(\tfrac{x-\mu(c)}{\sigma(c)})^{\top}T_{\sigma(c)}^{-1}\dot{B}_{c,3}(\mu,\sigma,\alpha)\\ \dot{l}_{(0,1,\gamma)}(\tfrac{x-\mu(c)}{\sigma(c)})^{\top}T_{\sigma(c)}^{-1}\dot{B}_{c,4}(\mu,\sigma,\alpha)\\ \end{pmatrix}$
	$\displaystyle\hskip 48.36958pt+B_{c}(\mu,\sigma,\alpha)T_{\sigma(c)}^{-1}\ddot{l}_{(0,1,\gamma)}\left(\frac{x-\mu(c)}{\sigma(c)}\right)T_{\sigma(c)}^{-1}(B_{c}(\mu,\sigma,\alpha))^{\top},$	(A.6)

where we used Equations (A.3) and (A.4) for the second equality and where $\dot{B}_{c,j}(\mu,\sigma,\alpha)$ denotes the Jacobian in $\mathbb{R}^{3\times 4}$ of the $j$ th row of $B_{c}(\mu,\sigma,\alpha)$ (derivative with respect to $(\mu,\sigma,\gamma,\alpha)$ ). The latter can be derived explicitly by a tedious but straightforward calculation; we omit precise formulas for the sake of brevity.

We next motivate the claimed normal approximation. First of all, in view of the differentiability of $\bm{\vartheta}\mapsto\ell_{\bm{\vartheta}}$ , the vector of maximum likelihood estimators is necessarily a zero of the gradient of the log-likelihood function, i.e., we have

0=\frac{1}{n}\sum_{t=1}^{n}\begin{pmatrix}\dot{\ell}_{\hat{\bm{\vartheta}}_{1}}(M_{1}^{(t)},c^{(t)})\\ \vdots\\ \dot{\ell}_{\hat{\bm{\vartheta}}_{D}}(M_{D}^{(t)},c^{(t)})\end{pmatrix}.

Denoting the true parameter vector by $\bm{\theta}$ , a Taylor expansion implies that

	$\displaystyle 0$	$\displaystyle=\frac{1}{n}\sum_{t=1}^{n}\begin{pmatrix}\dot{\ell}_{\bm{\vartheta}_{1}}(M_{1}^{(t)},c^{(t)})\\ \vdots\\ \dot{\ell}_{\bm{\vartheta}_{D}}(M_{D}^{(t)},c^{(t)})\end{pmatrix}$
		$\displaystyle\quad+{\small\left\{\frac{1}{n}\sum_{t=1}^{n}\begin{pmatrix}\ddot{\ell}_{\bm{\vartheta}_{1}}(M_{1}^{(t)},c^{(t)})&0&\ldots&0\\ 0&\ddot{\ell}_{\bm{\vartheta}_{2}}(M_{2}^{(t)},c^{(t)})&\ldots&0\\ \vdots&\vdots&&\vdots\\ 0&0&\ldots&\ddot{\ell}_{\bm{\vartheta}_{D}}(M_{D}^{(t)},c^{(t)})\end{pmatrix}\right\}\small}(\hat{\bm{\theta}}-\bm{\theta})+R_{n}$
		$\displaystyle\equiv L_{n,\bm{\theta}}+I_{n,\bm{\theta}}\cdot(\hat{\bm{\theta}}-\bm{\theta})+R_{n},$

where $R_{n}$ denotes higher order terms which are negligible. Solving for $\sqrt{n}(\hat{\bm{\theta}}-\bm{\theta})$ , we obtain that

\sqrt{n}(\hat{\bm{\theta}}-\bm{\theta})\approx-I_{n,\bm{\theta}}^{-1}\cdot\sqrt{n}L_{n,\bm{\theta}}.

By Equation (A.6), each $(4\times 4)$ block matrix $I_{n,d,\bm{\vartheta}_{d}}=\frac{1}{n}\sum_{t=1}^{n}\ddot{\ell}_{\bm{\vartheta}_{d}}(M_{d}^{(t)},c^{(t)})$ on the block-diagonal of $I_{n,\bm{\theta}}$ is of the form

\displaystyle\frac{1}{n}\sum_{t=1}^{n}f(c^{(t)})g(Z_{d}^{(t)})

(A.7)

for suitable functions $f$ and $g$ , where

\displaystyle Z_{d}^{(t)}=\{M_{d}^{(t)}-\mu_{d}(c^{(t)})\}/\sigma_{d}(c^{(t)})

(A.8)

with $\mu_{d}(c^{(t)})=\mu_{d}\exp(\alpha_{d}c^{(t)}/\mu_{d})$ and $\sigma(c^{(t)})=\sigma_{d}\exp(\alpha_{d}c^{(t)}/\mu_{d})$ is $\mathrm{GEV}(0,1,\gamma_{d})$ -distributed and independent over $t$ . Under suitable assumptions on $t\mapsto f(c^{(t)})$ (and hence on $c^{(t)}$ ), we obtain that the variance of $I_{n,d,\bm{\vartheta}_{d}}$ is of the order $1/n$ . As a consequence, $I_{n,d,\bm{\vartheta}_{d}}$ is close to its expectation, that is, $I_{n,d,\bm{\vartheta}_{d}}=J_{n,d,\bm{\vartheta}_{d}}+o(1)$ with $J_{n,d,\bm{\vartheta}_{d}}$ defined just after Equation (7). More precisely, in an asymptotic framework where one assumes that $c^{(t)}=h(t/n)$ for some continuous function $h:[0,1]\to\mathbb{R}$ , expressions as in Equation (A.7) converge to $\int_{0}^{1}f(h(t)){\,\mathrm{d}}t\times{\mathbb{E}}[g(Z_{d})]$ with $Z_{d}\sim\mathrm{GEV}(0,1,\gamma_{d})$ (note that both the integral and the expectation exist).

Next, consider $\sqrt{n}L_{n,\bm{\theta}}$ . It suffices to argue that we may apply a suitable version of the Central Limit Theorem, under suitable assumptions on $t\mapsto c^{(t)}$ . Similar as for $I_{n,\bm{\theta}}$ , by Equation (A.1), each entry of $\sqrt{n}L_{n,\bm{\theta}}$ is of the form

\frac{1}{\sqrt{n}}\sum_{t=1}^{n}f(c^{(t)})g(Z_{d}^{(t)})

for certain functions $f$ and $g$ and for some $d\in\{1,\dots,D\}$ . In view of the independence over $t$ and the fact that $Z_{d}^{(t)}\sim\mathrm{GEV}(0,1,\gamma_{d})$ , we may for instance apply the Ljyapunov CLT for independent triangular arrays, see Theorem 27.3 in Bil95. Since ${\mathbb{E}}[g(Z_{d}^{(t)})^{p}]<\infty$ for sufficiently small $p>2$ and for the functions $g$ of interest, a sufficient condition for its applicability is

\lim_{n\to\infty}\frac{\sum_{t=1}^{n}\{f(c^{(t)})\}^{p}}{[\sum_{t=1}^{n}\{f(c^{(t)})\}^{2}]^{p/2}}=0,

which readily follows for instance if one assumes that $c^{(t)}=h(t/n)$ for some continuous function $h:[0,1]\to\mathbb{R}$ .

A.2 Covariance Estimation

Throughout this section, we provide an estimator for $\bm{\Sigma}_{n}=(\bm{\Sigma}_{n;j,k})_{j,k=1}^{D}$ defined in Equation (7). First of all, we denote by $\hat{J}_{n,d,\bm{\vartheta}_{d}}$ an (approximate) Hessian of the function

\bm{\vartheta}_{d}\mapsto\frac{1}{n}\sum_{t=1}^{n}\ell_{{\bm{\vartheta}}_{d}}(M_{d}^{(t)},c^{(t)})

evaluated at its maximizer $\hat{\bm{\vartheta}}_{d}$ , possibly obtained by numerical differentiation. Note that this matrix is routinely returned by standard implementations for maximization; for instance, the optim-function in R returns an output value hessian.

It remains to estimate the matrix

\bm{C}_{n,j,k}:=\frac{1}{n}\sum_{t=1}^{n}\operatorname{Cov}\big{[}\dot{\ell}_{\bm{\vartheta}_{j}}(M_{j}^{(t)},c^{(t)}),\dot{\ell}_{\bm{\vartheta}_{k}}(M_{k}^{(t)},c^{(t)})\big{]}\in\mathbb{R}^{4\times 4}

for all $1\leq j<k\leq D$ . By Equation (A.1), we may write

\bm{C}_{n,j,k}=\frac{1}{n}\sum_{t=1}^{n}B_{c^{(t)}}(\mu_{j},\sigma_{j},\alpha_{j})T_{\sigma_{j}(c^{(t)})}^{-1}\\ \times\operatorname{Cov}\left(\dot{l}_{(0,1,\gamma_{j})}(Z_{j}^{(t)}),\dot{l}_{(0,1,\gamma_{k})}(Z_{k}^{(t)})\right)T_{\sigma_{k}(c^{(t)})}^{-1}(B_{c^{(t)}}(\mu_{k},\sigma_{k},\alpha_{k}))^{\top}.

with $\dot{l}_{(0,1,\gamma)}$ the gradient of $l_{(0,1,\gamma)}$ from Equation (A.1) and with $Z_{j}^{(t)}$ as defined in Equation (A.8), which is $\mathrm{GEV}(0,1,\gamma_{j})$ -distributed for any $j=1,\dots D$ . Note that the cross covariance $\bm{\Gamma}_{j,k}=\operatorname{Cov}(\dot{l}_{(0,1,\gamma_{j})}(Z_{j}^{(t)}),\dot{l}_{(0,1,\gamma_{k})}(Z_{k}^{(t)}))$ does not depend on $t$ , and may hence be estimated empirically after replacing the true parameters by their estimators. More precisely, we obtain the estimator

\hat{\bm{C}}_{n,j,k}=\frac{1}{n}\sum_{t=1}^{n}B_{c^{(t)}}(\hat{\mu}_{j},\hat{\sigma}_{j},\hat{\alpha}_{j})T_{\hat{\sigma}_{j}(c^{(t)})}^{-1}\hat{\bm{\Gamma}}_{n,j,k}T_{\hat{\sigma}_{k}(c^{(t)})}^{-1}B_{c^{(t)}}(\hat{\mu}_{k},\hat{\sigma}_{k},\hat{\alpha}_{k})

where $\hat{\sigma}_{j}(c)=\hat{\sigma}_{j}\exp(\hat{\alpha}_{j}c/\hat{\mu}_{j})$ and where $\hat{\bm{\Gamma}}_{n,j,k}$ denotes the empirical cross covariance matrix of the two samples $(\dot{l}_{(0,1,\hat{\gamma}_{j})}(\hat{Z}_{j}^{(t)}))_{t=1}^{n}$ and $(\dot{l}_{(0,1,\hat{\gamma}_{k})}(\hat{Z}_{k}^{(t)}))_{t=1}^{n}$ with

\hat{Z}_{j}^{(t)}=\frac{M_{j}^{(t)}-\hat{\mu}_{j}(c^{(t)})}{\hat{\sigma}_{j}(c^{(t)})}

and $\hat{\mu}_{j}(c)=\hat{\mu}_{j}\exp(\hat{\alpha}_{j}c/\hat{\mu}_{j})$ . The final estimator for $\bm{\Sigma}_{n}$ is $\hat{\bm{\Sigma}}_{n}=(\hat{\bm{\Sigma}}_{n;j,k})_{j,k=1}^{D}$ with

\hat{\bm{\Sigma}}_{n;j,k}=\hat{J}_{n,j,\bm{\vartheta}_{d}}^{-1}\hat{\bm{C}}_{n,j,k}\hat{J}_{n,k,\bm{\vartheta}_{k}}^{-1}.

Appendix B Additional results of the simulation study

B.1 Additional results for record length $n=75$

We report the power properties obtained with the BH and IM method for procedure (B3) in Figures B.1 and B.2, respectively. Power properties obtained with the Holm method are shown in Figures B.3 (for (B2)) and B.4 (for (B3)).

B.2 Additional results for record length $n=100$

Since the bootstrap procedures implicitly depend on the asymptotic distribution of the test statistic, we repeated the simulation study with a larger sample size, in order to investigate the sample size’s impact on the performance of the bootstrap procedure. The location-wise sample size is increased to $n=100$ . Again, the FDR and FWER are reported (Table B.1), as well as the power plots for BH and (B2) in Figure B.5, and for IM and (B2) in Figure B.6. As expected, the error rates are again sufficiently controlled by those methods that claim to do so, while the power has substantially increased (on average by 50% for the BH method and by 18% for the IM method). The results for the other methods and (B3) were again similar.

Method min FDR max FDR mean FDR min FWER max FWER mean FWER (B2) (B3) (B2) (B3) (B2) (B3) (B2) (B3) (B2) (B3) (B2) (B3) Scenario 1: $|A_{\mathrm{dev}}|=2$ BH 6.9 5.2 12.4 11.4 9.4 7.8 8.7 6.8 23.9 20.4 15.8 13.4 Holm 2.7 1.4 10.3 8.0 6.4 4.6 6.8 3.9 12.9 9.4 9.5 6.7 IM 26.4 25.7 60.3 59.9 35.2 34.6 53.4 52.7 64.4 63.8 59.2 58.3 Scenario 2: $|A_{\mathrm{dev}}|=7$ BH 4.0 2.7 10.7 8.3 5.6 5.0 6.3 5.1 32.4 32.6 21.6 19.9 Holm 0.8 0.5 10.3 6.5 2.9 2.1 4.2 3.0 11.2 8.0 7.1 5.2 IM 8.2 7.8 60.1 59.7 14.2 13.8 41.8 40.9 60.1 59.7 47.9 46.6

Table B.1: False Discovery Rate (FDR) and Familiywise Error Rate (FWER) for the three p-value combination methods from Section 2.5 and the two bootstrap methods (B2) and (B3), obtained in the simulations with record length

n=100

. The stated values are the minimum, maximum and mean across the 224 models from each scenario.

Appendix C Additional results for the case study

The complete results of the bootstrap procedures applied to the $4\times 4$ can be found in Table C.1. For the $6\times 6$ grid, the complete results of the bootstrap based on bivariate extreme value distributions can be found in Table C.2, and the results for the bootstrap procedure based on max-stable processes in Table C.3.

The time series used throughout the case study are shown in Figure C.1. Along with the Apr-Sep maxima of 1950-2021, we plot values of

\hat{\mu}(t)=\hat{\mu}\exp\left(\frac{\hat{\alpha}\mathrm{GMST^{\prime}}(t)}{\hat{\mu}}\right),

where $\hat{\mu}$ and $\hat{\alpha}$ are estimated on the data of location 21 only (blue line), the respective location $d$ (red line) or the pooled data of the pair $(21,d)$ (green line), for $d\in\{1,\ldots,36\}\setminus\{21\}$ . Note that these three lines should not differ much when the homogeneity assumption holds. On the other hand, perfectly matching lines do not imply that the homogeneity assumption is true, since they do not give any information about the scale and shape parameter of the distributions.

		MS bootstrap (B2)			bivariate bootstrap (B3)
Pair	$t_{n}$	$p_{\text{raw}}$	$p_{\text{BH}}$	$p_{\text{Holm}}$	$p_{\text{raw}}$	$p_{\text{BH}}$	$p_{\text{Holm}}$
(21, 11)	78.8	0.00	0.00	0.00	0.05	0.75	0.75
(21, 16)	15.6	1.60	10.64	22.39	1.45	9.90	20.29
(21, 15)	15.2	2.50	10.64	32.48	2.50	9.90	32.48
(21, 10)	13.6	3.40	10.64	40.78	3.30	9.90	36.58
(21, 14)	13.7	3.55	10.64	40.78	3.05	9.90	36.58
(21, 23)	12.1	5.30	13.24	52.97	6.30	15.74	62.97
(21, 20)	10.4	7.15	15.09	64.32	9.85	16.36	78.76
(21, 27)	10.6	8.05	15.09	64.37	8.10	16.36	72.86
(21, 9)	9.8	10.00	15.59	69.97	10.44	16.36	78.76
(21, 22)	9.4	10.39	15.59	69.97	11.69	16.36	78.76
(21, 8)	9.0	13.04	17.79	69.97	11.99	16.36	78.76
(21, 26)	7.2	20.34	25.42	81.36	22.04	27.55	88.16
(21, 17)	4.4	46.08	53.17	100.00	45.73	52.76	100.00
(21, 28)	3.9	52.17	55.90	100.00	50.92	54.56	100.00
(21, 29)	2.8	66.92	66.92	100.00	65.77	65.77	100.00

Table C.1: (Adjusted) p-values for all three methods from Section 2.5 obtained with both bootstrap methods applied to the

4\times 4

grid. Values that are significant at the 10%-level are in boldface. Results are based on

B=2000

bootstrap replications.

Pair	$t_{n}$	$p_{\text{raw}}$	$p_{\text{BH}}$	$p_{\text{Holm}}$
(21, 3)	41.5	0.10	0.50	3.50
(21, 4)	87.3	0.10	0.50	3.50
(21, 6)	46.0	0.10	0.50	3.50
(21, 11)	78.8	0.10	0.50	3.50
(21, 12)	75.2	0.10	0.50	3.50
(21, 13)	33.7	0.10	0.50	3.50
(21, 25)	38.5	0.10	0.50	3.50
(21, 19)	28.6	0.20	0.87	5.59
(21, 31)	33.3	0.30	1.17	8.09
(21, 7)	16.1	1.50	5.24	38.96
(21, 5)	15.7	1.70	5.40	42.46
(21, 16)	15.6	2.00	5.83	47.95
(21, 10)	13.6	2.70	6.99	62.04
(21, 15)	15.2	2.80	6.99	62.04
(21, 14)	13.8	4.10	9.56	86.01
(21, 33)	12.5	4.70	10.27	93.91
(21, 23)	12.1	6.89	14.19	100.00
(21, 20)	10.4	8.39	16.19	100.00
(21, 27)	10.6	8.79	16.19	100.00
(21, 36)	10.5	9.89	16.98	100.00
(21, 34)	9.9	10.19	16.98	100.00
(21, 2)	9.6	11.19	17.03	100.00
(21, 9)	9.8	11.19	17.03	100.00
(21, 22)	9.4	13.69	19.44	100.00
(21, 8)	9.0	13.89	19.44	100.00
(21, 35)	8.2	15.88	21.38	100.00
(21, 26)	7.2	19.78	25.64	100.00
(21, 30)	6.1	29.17	36.46	100.00
(21, 32)	5.8	33.07	39.91	100.00
(21, 17)	4.4	46.75	54.55	100.00
(21, 28)	3.9	52.75	59.55	100.00
(21, 29)	2.8	66.13	72.33	100.00
(21, 24)	2.7	70.23	74.49	100.00
(21, 1)	2.4	73.13	75.28	100.00
(21, 18)	1.8	83.82	83.82	100.00

Table C.2: (Adjusted) p-values for all three methods from Section 2.5 obtained with the bootstrap based on bivariate extreme value distributions. Values that are significant at the 10%-level are in boldface. Results are based on

B=2000

bootstrap replications.

Pair	$t_{n}$	$p_{\text{raw}}$	$p_{\text{BH}}$	$p_{\text{Holm}}$
(21, 3)	41.5	0.00	0.00	0.00
(21, 4)	87.2	0.00	0.00	0.00
(21, 11)	78.8	0.00	0.00	0.00
(21, 12)	75.2	0.00	0.00	0.00
(21, 25)	38.5	0.00	0.00	0.00
(21, 6)	46.0	0.10	0.50	3.00
(21, 13)	33.4	0.10	0.50	3.00
(21, 31)	33.3	0.20	0.87	5.59
(21, 19)	28.7	0.30	1.17	8.09
(21, 7)	16.0	1.80	6.29	46.75
(21, 5)	15.7	2.60	7.53	64.94
(21, 15)	15.2	2.70	7.53	64.94
(21, 16)	15.6	2.80	7.53	64.94
(21, 14)	13.7	3.70	9.24	81.32
(21, 10)	13.6	4.40	10.26	92.31
(21, 23)	12.1	5.09	11.15	100.00
(21, 33)	12.5	5.89	12.13	100.00
(21, 27)	10.6	8.59	15.96	100.00
(21, 20)	10.4	8.89	15.96	100.00
(21, 36)	10.5	9.19	15.96	100.00
(21, 9)	9.8	9.59	15.96	100.00
(21, 2)	9.6	10.49	15.96	100.00
(21, 34)	9.9	10.49	15.96	100.00
(21, 22)	9.4	10.99	16.03	100.00
(21, 8)	9.0	12.39	17.34	100.00
(21, 35)	8.2	16.38	22.05	100.00
(21, 26)	7.2	21.88	28.36	100.00
(21, 30)	6.1	28.77	35.96	100.00
(21, 32)	5.8	31.87	38.46	100.00
(21, 17)	4.4	46.25	53.96	100.00
(21, 28)	3.9	50.95	57.52	100.00
(21, 29)	2.8	66.33	70.99	100.00
(21, 24)	2.7	66.93	70.99	100.00
(21, 1)	2.4	71.03	73.12	100.00
(21, 18)	1.8	81.92	81.92	100.00

Table C.3: (Adjusted) p-values for all three methods from Section 2.5 obtained with the bootstrap based on max-stable processes. Values that are significant at the 10%-level are in boldface. Results are based on

B=2000

bootstrap replications.

Regional Pooling in Extreme Event Attribution Studies: an Approach Based on Multiple Statistical Testing

Abstract

1 Introduction

2 Assessing spatial homogeneities for precipitation extremes

2.1 A homogeneous model for precipitation extremes

2.2 Coordinate-wise maximum likelihood estimation

2.3 Wald-type test statistics

2.4 Parametric bootstrap devices for deriving p-values

2.4.1 Global bootstrap based on max-stable process models

Algorithm 1 (Bootstrap based on max-stable processes).

2.4.2 Pairwise bootstrap based on bivariate extreme value distributions

Algorithm 2 (Pairwise bootstrap based on bivariate extreme value distributions).

2.5 Combining test statistics

Remark 1.

3 Simulation Study

4 Severe flooding in Western Europe during July 2021 revisited

5 Extensions

5.1 Estimation of regional return levels and return periods

Algorithm 3.

5.2 A homogeneous scaling model with location-wise scaling factor

5.3 General homogeneous models with smooth parametrization

6 Conclusion

Declaration

Ethical Approval

Availability of supporting data

Competing interests

Funding

Authors’ contributions

Acknowledgements

References

Appendix A Mathematical Details

A.1 Maximum Likelihood estimation

A.2 Covariance Estimation

Appendix B Additional results of the simulation study

B.1 Additional results for record length n=75n=75

B.2 Additional results for record length n=100n=100

Appendix C Additional results for the case study

References

B.1 Additional results for record length $n=75$

B.2 Additional results for record length $n=100$