Regional Pooling in Extreme Event Attribution Studies: an Approach Based on Multiple Statistical Testing
Abstract
Statistical methods are proposed to select homogeneous locations when analyzing spatial block maxima data, such as in extreme event attribution studies. The methods are based on classical hypothesis testing using Wald-type test statistics, with critical values obtained from suitable parametric bootstrap procedures and corrected for multiplicity. A large-scale Monte Carlo simulation study finds that the methods are able to accurately identify homogeneous locations, and that pooling the selected locations improves the accuracy of subsequent statistical analyses. The approach is illustrated with a case study on precipitation extremes in Western Europe. The methods are implemented in an R package that allows easy application in future extreme event attribution studies.
Key words: Extreme event attribution; Extreme Value Statistics; Homogeneity Tests; Multiple Comparison Problem; Parametric Bootstrap; Max-Stable Processes.
1 Introduction
Extreme event attribution studies on precipitation extremes are typically motivated by the occurrence of an extreme event which causes major impacts such as damages to infrastructure and agriculture, or even fatalities, see, for instance, Van16; Van17; Ott18; Kre21. A key task for attributing the event to anthropogenic climate change consists of a statistical analysis of available observational data products at the location or region of interest (Phi20). Typically, the observed time period is short, often less than 100 years, which ultimately leads to large statistical uncertainties. One possibility to reduce those uncertainties is to incorporate observations from nearby locations/regions, given that their meteorological characteristics are sufficiently similar and governed by the same underlying processes to those from the region affected by an extreme event. The selection of surrounding areas for which these criteria are met can be based on expert knowledge of the meteorological characteristics and dynamics, for instance provided by experts from the national meteorological and hydrological service of the affected country, like the Deutsche Wetterdienst in Germany. The expert knowledge-based suggestion may next be assessed statistically, which, to the best of our knowledge, has been done based on ad hoc methods in the past. In this paper, we propose profound statistical methods that can complement the expert’s knowledge and which is based on statistically evaluating observational data from the past. Once regions with sufficiently similar characteristics of the analysed variable, e.g., the yearly maximum of daily rainfall, have been identified, the time series of all identified regions can be combined, thereby extending the available time series for the benefit of a more efficient statistical analysis.
The building blocks for the new approach are classical Wald-type tests statistics (Leh21) for testing the null hypothesis that the temporal dynamics at multiple locations of interest are the same. Unlike in the classical text-book case, and motivated by the fact that standard likelihood-based inference for extreme value distributions requires unreasonably large sample sizes for sufficient finite-sample accuracy, we employ a parametric bootstrap device to approximate the distribution of the test statistics under the null hypothesis. This approach is motivated by results in Lil22 for respective stationary extreme value models. Based on suitable decompositions of a global null hypothesis, we then propose to test for carefully selected sub-hypotheses, possibly after correcting the individual tests’ level for multiple comparisons.
The new methods are illustrated by a large-scale Monte Carlo simulation study and by an application to the severe flooding event in Western Europe during July 2021 for which spatial pooling was applied in an attribution study following the event (Kre21; Tra22). For the benefit of researchers who would like to use this spatial pooling approach, an implementation of the method in the statistical programming environment R (R) is publicly available as an R package called findpoolreg on GitHub (findpoolreg).
Attribution analysis of precipitation extremes is especially challenging due to short observational time series as well as their often limited spatial extend, which further complicates the detection of a trend and estimation of return periods based on the limited time series (see Tra22, for a discussion on this). Therefore, we will in the following present the suggested approach for a heavy rainfall event, however, the method could equally be applied to other parameters.
The remaining parts of this paper are organized as follows. Section 2 explains the mathematical concept of the proposed methods, starting with a detailed description of the underlying model assumptions and a strategy for the detection of a possible pooling region in Section 2.1. It is recommended to all readers. In Sections 2.2 and 2.3, mathematical details on the applied estimators and test statistics are given, and they may be skipped by non-expert readers. Next, the ideas of the bootstrap procedures that allow to draw samples from the distribution under the null hypothesis are explained. Again, this may be skipped by non-statisticians. Section 2.5 goes into detail about the detection strategy of possible pooling regions and the treatment of the related multiple testing problem. This section is of interest to all readers who want to apply the methods, since it provides details on how to process a set of p-values obtained from testing multiple hypotheses. Next, Section 3 gives the results of the simulation study that was performed in order to evaluate the performance of the proposed methods. These results are of interest to all readers, and they serve as a basis for the case study conducted in Section 4. Section 5 then discusses several extensions of the proposed methods. In 5.1, we provide a method for estimating region-wise return periods once a pooling region has been chosen. Here, a region-wise return period of a given event is defined as the number of years that one has to wait on average until an event of the same or even greater magnitude is observed anywhere in the pooling region. Extensions to different model assumptions that suit e.g. other variables such as temperature are discussed in Section 5. Last but not least, we come to a conclusion in Section 6. Some mathematical details and further illustrations on the simulation study and the case study are postponed to a supplement.
2 Assessing spatial homogeneities for precipitation extremes
2.1 A homogeneous model for precipitation extremes
The observational data of interest consists of annual or seasonal maximal precipitation amounts (over some fixed time duration, e.g., a day) collected over various years and at various locations (in practice, each location may correspond to a spatial region; we separate these two terms from the outset to avoid misunderstandings: subsequently, a region shall be a set of locations). More precisely, we denote by the observed maximal precipitation amount in season and at location , with and . The location of primary interest shall be the one with index . Note that the choice of is made for illustrative purposes only and can be replaced by any index .
In view of the stochastic nature, we assume that is an observed value of some random variable . Since is generated by a maxima operation, standard extreme value theory (Col01) suggests to assume that follows the generalized extreme value (GEV) distribution, i.e.,
for some , where the distribution with location parameter , scale parameter and shape parameter is defined by its cumulative distribution function
(1) |
for such that . Due to climate change, the temporal dynamics at location , which are primarily driven by the function , are typically non-constant. Any proxy for climate change qualifies as a suitable temporal covariate, and a standard assumption in extreme event attribution studies, motivated by the Clausius–Clapeyron relation, postulates that
(2) |
for certain parameters . Here, denotes the smoothed global mean surface temperature anomaly, see Phi20. Note that (2) implies
hence the model may be identified as a temporal scaling model. It is further assumed that any temporal dependence at location is completely due to , which we treat as deterministic and which implies that are stochastically independent, for each . For the moment, the spatial dependence will be left unspecified.
Recall that the location of interest is the one with , which is characterised by the four parameters . As described before, estimating those parameters based on the observations from location only may be unpleasantly inaccurate, which is why one commonly assumes that the locations have been carefully selected by experts to meet the following space-time homogeneity assumption:
(3) |
where and , and where the upper index stands for ‘equal distribution’, since, in short, Equation (3) states that the location-wise GEV parameters coincide for the locations.
In the subsequent sections, we aim at testing the validity of the expert’s hypothesis . Here, it is not only of interest to test the hypothesis for the whole set , but also to find a (maximal) subset with and on which the space-time homogeneity assumption holds. Here, for an arbitrary index set , the latter assumption may be expressed through
(4) |
with as in Equation (3) and , meaning that the location-wise GEV parameters coincide for all locations with index in the set , making the respective locations a possible pooling region.
Now, a maximal subset for which Equation (4) holds may be determined with the following strategy: Since we are interested in finding all locations that ‘match’ the location of primary interest with index , we test for each pair , whether the null hypothesis holds. This will provide us with a set of p-values based on which we can decide which locations to reject and which not to reject. Those locations that are not rejected can then be assumed to be sufficiently homogeneous and are thus included in the suggestion of a pooling region of maximal extent. For further details on this strategy and the impact of the induced multiple testing problem, see Section 2.5.
2.2 Coordinate-wise maximum likelihood estimation
The starting point for the subsequent test statistics are the coordinate-wise maximum likelihood estimators for the model specified in (2). Writing for brevity, the log-likelihood contribution of observation is given by , where
(5) |
with the probability density function of the -distribution. The maximum likelihood estimator for at location is then defined as
(6) |
The arg-maximum cannot be calculated explicitly, but may be found by suitable numerical optimization routines. We denote the gradient and the Hessian matrix of by and , respectively. Under appropriate regularity assumptions, standard asymptotic expansions (Van98, see also BucSeg17 for the stationary GEV family) imply that is approximately Gaussian with mean and covariance , where is defined as
(7) |
with . See Appendix A.1 for details and Appendix A.2 for a suitable estimator for .
2.3 Wald-type test statistics
We define test statistics which allow to test for the sub-hypotheses of from Equation (4), where . For that purpose, we propose to use classical Wald-type test statistics; see Section 14.4.2 in Leh21 for a general discussion and Lil22 for a similar approach in temporally stationary GEV models, i.e., with fixed to .
Write with and let be defined by
We may then write equivalently as
Hence, significant deviations of from 0 with from Section 2.2 provide evidence against . Such deviations may be measured by the Wald-type test statistic
(8) |
where denotes the Jacobian matrix of , which is a matrix with entries in that does not depend on . In view of the asymptotic normality of , see Section 2.2, the asymptotic distribution of under the null hypothesis is the chi-square distribution with degrees of freedom; see also Section 4 in Lil22. Hence, rejecting if exceeds the -quantile of the -distribution provides a statistical test of asymptotic level . The finite-sample performance of the related test in the stationary setting was found to be quite inaccurate (see Lil22). To overcome this issue, we propose a suitable bootstrap scheme in the next section.
2.4 Parametric bootstrap devices for deriving p-values
Throughout this section, we propose two bootstrap devices that allow to simulate approximate samples from the -distribution of the test statistic from Equation (8). Based on a suitably large set of such samples, one can compute a reliable p-value for testing , even for short sample sizes.
The first method is based on a global fit of a max-stable process model to the entire region under consideration, while the second one is based on fitting multiple pairwise models. The main difference of the two approaches is that the first one can test the hypothesis for arbitrary subsets while the second approach is restricted to testing the null hypothesis on subsets of cardinality two, i.e., it can only test whether a pair of locations is homogeneous. Depending on the question that is asked, applying the one or the other method may be advantageous.
2.4.1 Global bootstrap based on max-stable process models
The subsequent bootstrap device is a modification of the parametric bootstrap procedure described in Section 5.3 of Lil22. Fix some large number , say , noting that larger numbers are typically better, but going beyond is usually not worth the extra computational effort.
The basic idea is as follows: for each , simulate artificial bootstrap samples
that have a sufficiently similar spatial dependence structure as the observed data and that satisfy the null hypothesis . For each fixed with , the test statistics computed on all bootstrap samples, say , are then compared to the observed test statistic . Since the bootstrap samples do satisfy , the observed test statistic should differ significantly from the bootstrapped test statistics in case is not satisfied on the observed data.
Here, for simulating the bootstrap samples, we assume that the spatial dependence structure of the observed data can be sufficiently captured by a max-stable process model. Max-stable processes provide a natural choice here, since they are the only processes that can arise, after proper affine transformation, as the limit of maxima of independent and identically distributed random fields (Col01, Section 9.3). Parametric models for max-stable processes are usually stated for unit Fréchet (i.e., ) margins. Therefore, the first steps in our algorithm below aim at transforming the margins of our observed data to be approximately unit Fréchet.
More precisely, the parametric bootstrap algorithm is defined as follows:
Algorithm 1 (Bootstrap based on max-stable processes).
-
(1)
For each , calculate from Section 2.2.
-
(2)
For each , transform the observations to approximately i.i.d. Fréchet-distributed data, by letting
(9) -
(3)
Fit a set of candidate max-stable process models with standard Fréchet margins to the observations and choose the best fit according to the composite likelihood information criterion (CLIC), which is a model selection criterion that is commonly applied when fitting max-stable process models. Throughout, we chose the following three models:
-
(a)
Smith’s model (3 parameters);
-
(b)
Schlather’s model with a powered exponential correlation function (3 parameters);
-
(c)
the Brown-Resnick process (2 parameters).
For further details on max-stable processes, the mentioned models and the CLIC, see davisonModelSpatExt and Dav12. Respective functions are implemented in the R package SpatialExtremes (SpatialExtremes).
-
(a)
-
(4)
For and , simulate spatial data with unit Fréchet margins from the chosen max-stable process model, denoted by
Note that until now we haven’t used the particular hypothesis . Subsequently, fix with .
-
(5)
Assume that from Equation (4) is true, and estimate the four dimensional model parameters by (pseudo) maximum likelihood based on the pooled sample
Denote the resulting parameter vector as , and note that should be close to for each , if is met.
-
(6)
Transform the margins of the bootstrap samples to the ones of a GEV-model satisfying , by letting
(10) for and . For each resulting bootstrap sample , compute the value of the test statistic from Equation (8). Note that only depends on the coordinates with .
-
(7)
Compute the value of the test statistic from Equation (8) on the observed sample.
-
(8)
Compute the bootstrapped -value by
In a classical test situation, one may now reject for a fixed set at significance level if . In the current pooling situation, we would need to apply the test to multiple pooling regions , which hence constitutes a multiple testing problem where standard approaches yield inflated levels. We discuss possible remedies in Section 2.5.
2.4.2 Pairwise bootstrap based on bivariate extreme value distributions
Recall that the location of primary interest is the one with index .
As stated in Section 2.1, it is of interest to test for all bivariate hypotheses with . For that purpose, we may apply a modification of the bootstrap procedure from the previous section that makes use of bivariate extreme value models only. By doing so, we decrease the model risk implied by imposing a possibly restrictive global max stable process model.
The modification only affects step (3) and (4) from Algorithm 1. More precisely, for testing the hypothesis with for some fixed value , we make the following modifications:
Algorithm 2 (Pairwise bootstrap based on bivariate extreme value distributions).
Perform step (1) and (2) from Algorithm 1 with the set replaced by .
-
(3a)
Fit a set of bivariate extreme value distributions to the bivariate sample , assuming the marginal distributions to be unit Fréchet. Choose the best fit according to the Akaike information criterion (AIC), a model selection criterion that rewards a good fit of a model and penalises the model’s complexity at the same time (Aka73). Possible models are:
-
(a)
the Hüsler-Reiss model (1 parameter);
-
(b)
the logistic model (1 parameter);
-
(c)
the asymmetric logistic model (2 parameters).
Note that all models are implemented in evd.
-
(a)
-
(4a)
For and , simulate bivariate data with unit Fréchet margins from the chosen bivariate extreme value model, denoted by
Perform Steps (5)-(8) from Algorithm 1 with .
2.5 Combining test statistics
As already addressed at the end of Section 2.1, it is not only of interest to test the global hypothesis , since a possible rejection of gives no indication about which locations deviate from the one of primary interest. Instead, one might want to test hypotheses on several subsets and then pool those subsets for which no signal of heterogeneity was found. In this subsection, we provide the mathematical framework of testing sub-hypotheses and discuss how to deal with the induced multiple testing problem.
Mathematically, we propose to regard as a global hypothesis that is built up from elementary hypotheses of smaller dimension. A particularly useful decomposition is based on pairwise elementary hypotheses: recalling the notation from Equation (4), we clearly have
(11) |
i.e., holds globally when it holds locally for all pairs with We may now either apply Algorithm 1 or Algorithm 2 to obtain a p-value, say , for testing , for any . Each p-value may be interpreted as a signal for heterogeneity between locations and , with smaller values indicating stronger heterogeneity. The obtained raw list of p-values may hence be regarded as an exploratory tool for identifying possible heterogeneities.
Since we are now dealing with a multiple testing problem, it might be advisable to adjust for multiple comparison in order to control error rates. This can be done by interpreting the raw list based on classical statistical testing routines, in which p-values are compared with suitable critical values to declare a hypothesis significant. Several methods appear to be meaningful, and we discuss three of them in the following. For this, let denote a significance level, e.g., .
IM (Ignore multiplicity): reject homogeneity for all pairs for which . In doing so, we do not have any control over false rejections. In particular, in case is large, false rejections of some null hypotheses will be very likely. On the other hand, the procedure will have decent power properties, and will likely detect most alternatives. Hence, in a subsequent analysis based on the pooled sample of homogeneous locations, we can expect estimators to exhibit comparably little bias and large variance.
Holm (Control the family-wise error rate): apply Holm’s stepdown procedure (Hol79). For that purpose, sort the p-values with ; denote them by . Starting from , determine the smallest index such that
If , then reject no hypotheses. If no such index exists, then reject all hypotheses. Otherwise, if , reject the hypotheses that belong to the p-values .
The procedure can be equivalently expressed by adjusted p-values. Recursively defining and
for , we simply reject those hypotheses that belong to the adjusted p-values with .
Holm’s stepdown procedure is known to asymptotically control the family-wise error rate (FWER) at level , i.e.,
see Theorem 9.1.2 in Leh21.
In general, controlling the family-wise error rate will result in comparably little power, i.e., we might falsely identify some pairs of locations as homogeneous. Hence, in a subsequent analysis based on the pooled sample of homogeneous locations, we can expect estimators to exhibit comparably large bias and little variance.
BH (Control the false discovery rate): apply the Benjamini Hochberg stepup procedure (Ben95). For that purpose, sort the p-values with ; denote them by . Starting from , determine the largest index such that
If no such index exists, then reject no hypotheses. Otherwise, if , reject the hypotheses that belong to the p-values .
Again, one can compute adjusted -values such that the procedure is equivalent to rejecting those hypotheses for which . For that purpose, let and recursively define, for ,
Under an additional assumption on the p-values that belong to the true null hypotheses (they must exhibit some positive dependence), the BH procedure is known to asymptotically control the false discovery rate (FDR) at level , i.e.,
see Theorem 9.3.3 in Leh21. Control of the FDR will be confirmed by the simulation experiments in Section 3.
If one were interested in guaranteed theoretical control of the FDR rate, one might alternatively apply the Benjamini Yekutieli (BY) stepup procedure, see (Ben01) and Theorem 9.3.3 in Leh21. In view of the fact that the procedure is much more conservative than BH, we do not recommend its application in the current setting.
Concerning a subsequent analysis, estimators based on a pooled sample obtained from the BH procedure can be expected to exhibit bias and variance to be somewhere between the IM and Holm procedure.
Remark 1.
The decomposition of into hypotheses of smaller dimensionality is not unique. For instance, we may alternatively write
(12) |
where denotes an increasing sequence of regions with (for instance, with ). In practice, the sequence is supposed to be derived from some expert knowledge of the region of interest; it shall represent a sequence of possible pooling regions where is constructed from by adding the locations which are a priori ‘most likely’ homogeneous to the locations in . Note that, necessarily, , which provides an upper bound on the number of hypotheses to be tested.
The derivation of respective testing methods is straightforward. In view of the facts that the choice of the sequence is fairly subjective and that the eventual results crucially depend on that choice, we do not pursue the method any further.
3 Simulation Study
A large-scale Monte Carlo simulation study was conducted to assess the performance of the proposed bootstrap procedures in finite sample situations. We aim at answering the following questions:
-
(a)
Regarding the test’s power: What percentage of locations that are heterogeneous w.r.t. the location of primary interest can be expected to be identified correctly?
-
(b)
Regarding the test’s error rates: What percentage of locations that are homogeneous w.r.t. the location of primary interest can be expected to be wrongly identified as heterogeneous (FDR)? What is the probability of wrongly identifying at least one location that is homogeneous w.r.t. the location of interest as heterogeneous (FWER)?
-
(c)
Regarding the chosen pooling regions: How does return level (RL) estimation based on the pooling regions proposed by the bootstrap procedures compare to RL estimation based on the location of interest only or the whole (heterogeneous) region?
The data was generated in such a way that the temporal spatial dynamics from the case study in Section 4 are mimicked. To achieve this, we started by fitting the scale-GEV model from Equation (2) to annual block-maxima of observations from 1950–2021 at 16 spatial locations in Western Europe (i.e., and ) that are arranged in a grid; see Figure 1 and the additional explanations in Section 4. The locations correspond to the center points of the grid cells; the distance between the center points of two neighbouring grid cells is approximately 140 km. The location-wise GEV parameter estimates exhibit the following approximate ranges over : with a mean of , with a mean of 5.3, with a mean of 0.08 and with a mean of 1.5. Fitting the scale-GEV model to the full pooled sample of size , we obtained parameter estimates that were close to the means over the location-wise parameter estimates, with for location, scale, shape and trend parameter, respectively. Next, we transformed the margins to (approximate) unit Fréchet by applying the transformation from Equation (9), such that we can fit several max-stable process models to the transformed data. The best fit was Smith’s model with approximate dependence parameters ; see davisonModelSpatExt for details on the model.


Based on these model fits, we chose to generate data with the following specifications: first, the sample size was fixed to and the regional grid was chosen as described before, i.e., . The grid cell/location labelled ‘’ is chosen as the one of primary interest. Further, the dependence structure is fixed to that of Smith’s model with (approximately) those parameters that gave the best fit on the observed data, i.e. . For simulating data, we first simulate from this max-stable process model (SpatialExtremes) and then transform the margins to scale-GEV distributions, either in a homogeneous or in a heterogeneous manner. Here, the globally homogeneous model is defined by fixing the marginal scale-GEV parameters to approximately the mean values of the location-wise GEV parameters obtained for the real observations, i.e.,
(13) |
for each .
Starting from this homogeneous model, we consider two different heterogeneous scenarios. In the first scenario, we fix as in Equation (13) for all , while
(14) |
for with . Note that this defines different heterogeneous models. In the second scenario, we consider the same construction with and . An illustration of the grid cells and their partition into homogeneous and non-homogeneous areas can be found in Figure 1. Overall, we obtain 448 different heterogeneous models and one homogeneous model.
For each of the 449 models, we now apply the following three bootstrap procedures, each carried out with bootstrap replications (recall that the grid cell of interest is the one labelled with 10):
-
(B1)
The bootstrap procedure from Algorithm 1 with .
-
(B2)
The bootstrap procedure from Algorithm 1 for all sets with .
-
(B3)
The bootstrap procedure from Algorithm 2 for all sets with .
Note that the second and third method both yield 15 raw p-values. Each procedure was applied to 500 simulated samples from all models under consideration.
Regarding (B1), we compute the percentage of rejections among the 500 replications, which represents the empirical type I error of the test under the homogeneous model and the empirical power under the heterogeneous models. The results can be found in Figure 2. The null hypothesis is met in the central square only, and we observe that the nominal level of is perfectly matched. All non-central squares correspond to different alternatives, and we observe decent power properties in both scenarios. Note that a rejection only implies that the entire region is not homogeneous; there is no information on possible smaller subgroups that are homogeneous to the location of interest.
Regarding (B2) and (B3), rejection decisions were obtained for each hypothesis by one of the three methods from Section 2.5. The empirical family-wise error rate is then the percentage of cases (over 500 replications) for which at least one null hypothesis was rejected. Likewise, for the false discovery rate, we calculate, for each replication, the number of false rejections and divide that by the total number of rejections (when the number of total rejections is 0, this ratio is set to 0). The empirical false discovery rate is obtained by taking the mean over all 500 replications. Similarly, for assessing the power properties, we calculate the empirical proportion of correct rejections (i.e., among the 2 or 7 locations that deviate, the proportion of detected heterogeneous locations) over all 500 replications.
Results for the false discovery and family-wise error rate are given in Table 1. We find that the -value combination methods from Section 2.5 are sufficiently accurate: the BH method controls the false discovery rate, while Holm’s method controls the family-wise error rate. This holds exactly for procedures (B3), where the maximal FDR (FWER) of the BH (Holm) method is at 9.4% (8.7%), and approximately for (B2), where the maximal FDR (FWER) is at 12.2% (12.6%). Further, we see that the IM procedure neither controls the FWER nor the FDR.
Method min FDR max FDR mean FDR min FWER max FWER mean FWER (B2) (B3) (B2) (B3) (B2) (B3) (B2) (B3) (B2) (B3) (B2) (B3) Scenario 1: BH 7.3 5.6 12.2 9.4 9.4 7.5 9.1 6.9 21.2 19.0 14.4 11.8 Holm 3.0 2.3 11.7 8.3 7.1 5.1 6.9 5.0 12.6 8.7 9.5 6.6 IM 25.8 25.3 61.7 60.1 37.7 37.2 53.4 53.4 64.5 62.8 59.1 58.3 Scenario 2: BH 3.6 2.4 12.1 8.9 5.6 4.9 6.2 4.2 32.0 29.8 18.6 16.8 Holm 1.0 0.9 11.3 7.9 3.4 2.6 3.8 2.4 11.3 7.9 7.1 5.1 IM 7.9 7.8 61.5 60.1 16.0 15.8 40.9 40.9 61.5 60.1 47.3 46.4
The power properties for procedure (B2) combined with the BH method are shown in Figure 3. We see that the procedure is able to detect some of the deviations of the null hypothesis, with more correct rejections the stronger the deviation is. The method is particularly powerful when the location and scale parameters deviate into opposite directions, i.e. when and or and . There is no obvious pattern regarding the deviations of the shape and trend parameter. Further, we analogously show the power properties of the IM method with bootstrap (B2) in Figure 3. As expected, this method has more power against all alternatives under consideration. However, this comes at the cost of more false discoveries, as can be seen in Table 1.
The results for bootstrap scheme (B3) were very similar and are therefore not shown here, but can be found in Section B of the supplementary material. Likewise, we omit the results for the more conservative Holm procedure, which exhibits, as expected, less power against all alternatives. Further, we repeated the simulation study with an increased location-wise sample size of . As one would expect, the tests have more power in this case.
The results presented so far show that the proposed pooling methods work ‘as intended’, since the theoretical test characteristics are well approximated in finite sample situations, and since we observe decent power properties. In practical applications however, spatial pooling of locations is usually the starting point for subsequent analyses. For instance, one may be interested in estimating return levels at the location of interest based on the data from all locations that were identified as homogeneous. Moreover, the analysis of alternative data sets like climate model data may be based on the homogeneous locations identified within the analysis of observations.
This suggests that the methods should be examined with regard to their quality in subsequent analyses. For that purpose, we consider, as an example, the problem of return level estimation at the location of interest. The state-of-the-art method would consist of GEV fitting at the location of interest only, which results in (asymptotically) unbiased estimators that suffer from large variance. Basing the estimator on pooled regions will decrease the variance, but at the same time increase its bias if some heterogeneous locations have been wrongly identified as homogeneous.
In particular, pooling based on a conservative testing approach like the BH procedure leads to the acceptance of many locations and thus to a large pooling area and low estimation variance. Most likely, some of the chosen locations will be violating the null hypothesis though, which yields a rather large estimation bias. For pooling based on a more liberal rejection approach like the IM procedure, the estimation bias and variance behave exactly opposite: since the null hypotheses are more likely to be rejected, the resulting pooling sample is smaller (i.e., larger estimation variance) but ‘more accurate’ (i.e., smaller estimation bias).
For our comparison, we consider fitting the scale-GEV model based on pooled locations that have been obtained from one of the following eight methods
Here, LOI refers to considering the location of interest only (no pooling), full refers to pooling all available locations, and the last six methods encode pooling based on any combination of the proposed p-value combination methods and bootstrap approaches.
For each method, we compute the maximum likelihood estimate of the scale-GEV model parameters and transform this to an estimate of the -year return level (RL) in the reference climate of year by
where and and where is the cumulative distribution function of the GEV-distribution, see Equation (1). Now, in our simulation study, we know that the true value of the target RL is given by with
From the 500 replications we can therefore compute the empirical Mean Squared Error (MSE) of method as
where denotes the estimated RL obtained in the -th replication with method . Note that we have suppressed the MSE’s dependence on and from the notation.
In Figure 4 we compare MSEs of the 100-year RL with reference climate as in year 2021, which is given by , by plotting the difference with and as obtained for the setting where . The plots reveal that both the MS BH and the MS IM method are superior to the the LOI fit for almost all scenarios. Comparing the two methods to the full fit reveals that there are certain scenarios for which the full fit performs substantially worse, mostly when the shape and scale parameter deviate towards the same direction for the alternatives. For those scenarios where the full fit outperforms the two methods, the discrepancy is not very large, with the BH method performing slightly better than the IM method.
A comparison between MS BH and MS IM is shown in Figure 5 for . The results reveal that the BH method slightly outperforms the IM method in the case for almost all alternative scenarios. In case , the results are quite mixed, with the IM method becoming clearly superior when the shape, scale and location parameters deviate jointly to the top. In all other scenarios, the differences are only moderate, sometimes favoring one method and sometimes the other. Corresponding results for the bootstrap methods based on bivariate extreme value distributions are very similar and therefore not shown. Further, the results were found to be robust against the choices of and that were made here for the return level estimation.
Overall, the results suggest the following practical recommendation: if the full sample is expected to be quite homogeneous a priori (for instance, because it was built based on expert knowledge), then estimation based on BH-based pooling is preferable over the other options (LOI, the full and the IM-based fit). If one expects to have a larger number of heterogeneous locations, it is advisable to apply the IM procedure (or any other liberal procedure), which likely rejects most of the heterogeneous locations and hence reduces the bias. In general, the liberal behavior of IM-based pooling suggests its use when it is of highest practical interest to obtain a pooled region that is as homogeneous as possible (as a trade-off, one has to accept that the region is probably much smaller than the initial full region).
4 Severe flooding in Western Europe during July 2021 revisited
We illustrate the new pooling methods in a case study by revisiting the extreme event attribution study for the heavy precipitation event that led to severe flooding in Western Europe during July 2021, see Kre21; Tra22. In that study, observational data were pooled together based on expert knowledge and on ad hoc tests, with the ultimate goal of assessing the influence of human-made climate change on the likelihood and severity of similar events in Western and Central Europe.
The full region under investigation in Kre21; Tra22 consists of sixteen (i.e. about ) grid cells reaching from the northern Alps to the Netherlands, see Figure 5 in Kre21 or the right-hand side of Figure 6. Two of the 16 locations were rejected in that study due to expert knowledge and too large deviations in fitted GEV-parameters (regions 17 and 11 of Figure ). Among other things, our illustrative application of the methods explained above will reveal that grid cell 11 has been rightfully dismissed, while grid cell 17 might have been considered homogeneous. Further, there is no clear evidence that any other grid cell that has been declared homogeneous should rather be considered non-homogeneous.


For illustrative purposes, we apply our methods to two different initial areas:
-
(A)
An area consisting of grid cells covering a large part of Western/ Central Europe, as shown in Figure 6 on the left.
-
(B)
The original grid cells from Kre21 as shown in Figure 6 on the right.
Note that homogeneity for the 20 grid cells at the boundary of the larger area in (A) has been dismissed based on expert knowledge in Kre21; the larger area is included here for illustrative purposes only.
The data used throughout the study consists of April-September block-maxima of tile-wise averaged 1-day accumulated precipitation amounts of the E-OBS data set (cornes2018ensemble, Version 23.1e). In both cases, the grid cell with label 21 is the one of primary interest, since it is the one containing the target location of the study, i.e., the region that accumulated the highest precipitation sum and experienced the largest impacts during the flooding of July 2021. The time series are shown in Figure C.1 in the supplementary material. There, we also plot values of obtained from different data sets: once from data of location 21 only, once from data of the respective location only, and once from the pooled data of the respective pair for .
We apply the two proposed bootstrap procedures to areas (A) and (B). Note that the raw p-values obtained with the bootstrap based on bivariate extreme value distributions should be very similar (or even identical when using the same seed for random number generation) for those grid cells that appear in both areas, while they may differ to a greater extent for the MS bootstrap. This is because the p-value for a given pair obtained with the bivariate bootstrap procedure only depends on the observations of the pair, while the respective p-value obtained with the MS bootstrap also depends on the spatial model that was fitted to the whole area. However, even if the raw p-value of a given pair obtained for setting (B) coincides with the raw p-value obtained for setting (A), the adjustment for multiple testing can lead to slightly different rejection decisions of the pair at a given level . The bootstrap procedures are applied with bootstrap replications.
We start by discussing the results of the application to the larger grid in (A). Recall that, for a given significance level , one rejects the null hypothesis for all grid cells whose p-value is smaller than . To visualise the results, we therefore shade the grid cells according to the magnitude of their (adjusted) p-value. Here, we divide the colour scale into three groups: and , with a dark red tone assigned to the first group, a brighter red tone for Group 2 and an almost transparent shade for Group 3. This allows us to see the test decisions for significance levels of : when the significance level is chosen as , all tiles with a reddish shade are rejected, while when working with a level of only tiles shaded in the dark shade are rejected.
Results for both the conservative BH procedure and the liberal IM procedure are shown in Figure 7. For completeness, results on Holm’s method, which is even more conservative than BH, as well as the BH and IM p-values themselves can be found in the supplementary material, Tables C.2 and C.3. One can see that, for a given rejection method (i.e. BH or IM), the MS and bivariate procedures mostly agree on the rejection decisions that would be made at a level of 10% (compare the rows of Figure 7 to see this). The same holds when working with a significance level of 5%.
Further, as expected, the IM method rejects more hypotheses than the BH method. However, according to the results of the simulation study, it is quite likely that at least one of these rejections is a false discovery.




Analogous results for the grid in (B) are shown in Figure 8. As discussed above, except for the MS BH method, the results are consistent with the results obtained for the grid in the sense that for those locations which are contained in both grids, the locations with p-values of critical magnitude () coincide (compare the plot in the upper right corner of Figure 8 to the plot in the upper right corner of Figure 7 to see this for the MS IM method, and similar for the other methods). For the MS BH method, grid cells 10, 14, 15, and 16 are not significant anymore at a level of 10 %, but we recorded an adjusted p-value of 0.106 for those four grid cells, so this is a rather tight decision. The p-values obtained for the grid can be found in Table C.1 in the supplementary material.




Let us now move on to the interpretation: considering the larger grid first, some grid cells for which the characteristics of extreme precipitation are different (according to expert opinion) from the grid cell of the target location are detected as heterogeneous. These rejected grid cells are located along the coast and in the mountainous terrain. Comparing the results with Kre21; Tra22, we observe that grid cell 11 has been rejected in their study based on expert knowledge. For grid cell 17, however, we do not detect any statistical evidence that the probabilistic behavior of extreme precipitation is different from the grid cell of the target location, even when applying the liberal IM procedure. We would like to stress though that non-rejection of a null hypothesis does not provide any evidence of the null hypothesis, even when ignoring the multiple testing issue. Hence, expert knowledge that leads to rejection should, in general, outweigh any statistical non-rejection. This particularly applies to the eastern (continental) grid cells in the larger 66-grid, which can be influenced by heavy precipitation caused by different synoptic situations compared to the target region.
Moreover, as the results for locations 10, 14, 15, and 16 showed some discrepancy across the different testing procedures, we suggest that the final decision on the exclusion or inclusion of these locations in a spatial pooling approach should be based on expert knowledge of the meteorological characteristics, and the willingness to trade possible bias for variance (with a possibly larger bias when including the locations – note that statistical evidence against homogeneity in the bivariate extreme value distribution-based bootstrap is only weak, and wrongly declaring the regions as homogeneous is possibly not too harmful). The same holds for locations 9, 20, 23 and 27 for which only the IM method yielded p-values between 5% and 10%. Again, these rather small p-values could be subject to false discoveries though, and since the heterogeneity signal is also not too strong, there is no clear evidence that these need to be excluded from pooling.
For a last evaluation of results from pairwise tests, we estimated the 100-year RLs in the reference climate of the year 2021, i.e. with reference value , on five different data sets obtained from the grid. Here, we use the data sets consisting of data from
-
•
the location of interest only
-
•
pooling those grid cells suggested by the results of the case study (i.e., all cells but 11, or all cells but 10, 11, 14, 15, 16) or expert opinion (i.e., all cells but 11, 17)
-
•
pooling all grid cells of the grid.
The results can be found in Table 2 and reveal that excluding cell 11 has a clear effect on the estimated RL. Ex- or including grid cell 17 once 11 is excluded does not have a large effect, while excluding cells 10, 14, 15 and 16 additionally to cell 11 has a moderate effect.
cells excluded | |||||
---|---|---|---|---|---|
none | 20.37 | 5.80 | 0.1039 | 1.50 | 58.43 |
11 | 20.01 | 5.44 | 0.0676 | 1.45 | 52.74 |
11, 17 | 20.01 | 5.40 | 0.0760 | 1.29 | 52.82 |
10, 11, 14, 15, 16 | 19.90 | 5.41 | 0.0484 | 1.79 | 51.93 |
all but 21 | 21.92 | 6.08 | 0.0634 | 54.37 |
Finally, we would like to mention that similar results were obtained when applying the BH test procedures to all triples containing the pair of grid cells , i.e., the extended target region considered in the study of Kre21; Tra22, consisting of those regions in Germany and Belgium affected worst by the July 2021 flooding.
5 Extensions
In this section, we discuss how to estimate region-wise return levels under homogeneity assumptions (Section 5.1). We also propose two possible extensions of the pooling approach from the previous sections to other hypotheses (Section 5.2) or other underlying model assumptions (Section 5.3).
5.1 Estimation of regional return levels and return periods
As pointed out in Kre21; Tra22 among others, an estimated return period (RP) of years for a given event and in a fixed reference climate (e.g., the preindustrial climate), obtained based on a fit of the GEV distribution to a pooled homogeneous sample, has the following interpretation: for each fixed location/tile within the region, one can expect one event of the same or larger magnitude within (imaginary) years of observing the reference climate. We refer to this quantity as the local return period. Obviously, one would expect more than one event of similar magnitude happening at at least one of the locations of the pooling region. Likewise, for a given , one would expect a higher -year return level for the whole region. The latter corresponds to the value that is expected to be exceeded only once in years at at least one of the locations.
Mathematically, using the notation from Section 2.1, the exceedance probability of value at at least one among locations in the reference climate corresponding to year is given by
such that the return period for event of the region is . Further, the -year return level of the region in the climate corresponding to year is the minimal value for which
holds. Both quantities could be computed (exactly) if one had access to the distribution of . For example, if the random variables were independent, could be further simplified to
where is the distribution function of the GEV distribution and where and denote the parameters at reference climate of year from Equation (2) under the homogeneity assumption from Equation (3).
The locations are, however, usually not independent in applications. In the following, we propose a simulation-based estimation method that involves max-stable process models to account for the spatial dependence. As before, the R package SpatialExtremes (SpatialExtremes) allows for fitting and simulating max-stable process models.
Algorithm 3.
(Simulation-based estimation of the regionwise RL and RP)
-
(1)
Fit the scale-GEV parameters to the pooled homogeneous sample, resulting in the parameter vector .
-
(2)
Transform the margins of the pooled data to approximately unit Fréchet by applying transformation from Equation (9) with the parameter estimate from Step 1. Then fit several max-stable process models to the obtained data and choose the best fit according to the information criterion CLIC.
-
(3)
Replicate for the following steps:
-
(i)
Generate one random observation from the chosen max-stable process model.
-
(ii)
Transform the margins to GEV margins, by applying the transformation in (10) with parameters as estimated in Step 1, resulting in the observation .
-
(iii)
Compute the maximum
-
(i)
-
(4)
The regionwise -year return level and the return period of an event with value can now be estimated from the empirical cumulative distribution function of the sample through
Especially, when we have estimated the local 100-year RL, we can get an estimate of the return time this event has for the whole region. Likewise, when we have an estimate of the local return period of an event with value , we can estimate what the event value for that return period would be for the whole region.
We illustrate the estimators for the pooled data sets from Section 4. The estimates are based on simulation replications and are shown in Table 3. We see that the local 100-year return levels have substantially shorter region-wise return periods. In the region with 15 tiles (only cell 11 excluded), the estimated local 100-year RL at reference climate of 2021 can be expected to be exceeded once in approximately 19 years in at least one of the 15 tiles. We find a similar region-wise return period for the pooling region consisting of 14 tiles. In the pooling region consisting of 11 tiles, the local 100-year return level becomes a region-wise 33-year event. This comparably larger value arises from the smaller region that is considered: the smaller the region, the less likely it is that one of the locations exceeds a high threshold. Further, as expected, we find that the region-wise 100-year return levels at reference climate of 2021 are larger than their local counterparts. For the regions consisting of 15 and 14 tiles, this increase is approximately 26%, while it is approximately 17.3% for the region consisting of 11 tiles.
cells excluded | |||
---|---|---|---|
11 | 52.74 | 18.90 | 66.40 |
11, 17 | 52.82 | 18.32 | 67.08 |
10, 11, 14, 15, 16 | 51.93 | 32.76 | 60.93 |
5.2 A homogeneous scaling model with location-wise scaling factor
In this section, we maintain the temporal dynamics from the scale-GEV model from Equation (2). However, instead of testing for the homogeneity assumption from Equation (3), we additionally allow for a location-wise scaling factor under the null hypothesis. Such an approach can be useful when it is known that observations from different locations occur on different scales, but, apart from that, show a common probabilistic behaviour. In fact, a stationary version of the following model is commonly used in hydrology, where it is known as the Index Flood approach (dalrymple1960flood).
More precisely, suppose that
(15) |
where is a location-specific scaling factor that we may fix to 1 at the location of primary interest (say , i.e., ). Writing , the model in Equation (15) can be rewritten as
where
(16) |
Note that the parameters satisfy the relationships
for certain parameters ; in particular, can be retrieved from (note that the constraint on is not needed, but comes as a consequence of the first two relations). Fitting this model instead of fitting the scale-GEV distribution to each location separately has the advantage of reducing the number of parameters that need to be estimated substantially ( instead of parameters). Once the local scaling factors are identified, we can bring all observations to the same scale by dividing each location by its location-specific scaling factor.
Now one can test whether such a local scaling model holds on a subset with cardinality , by testing the hypothesis
(17) |
with a Wald-type test statistic. In this case, the latter is defined as
(18) |
where is given by
with Jacobian matrix , since the hypothesis in Equation (17) may be rewritten as
When considering this kind of modification, the bootstrap algorithms from Section 2.4, steps (5)-(7), must be adapted accordingly. In step (5), one has to estimate the parameter under the constraint of the considered null hypothesis by adapting the -likelihood accordingly. The estimated parameters are then used during the transformation step (6). Further, the test statistic in steps (6) and (7) is replaced by from (18). Further details are omitted for the sake of brevity.
5.3 General homogeneous models with smooth parametrization
In this section, we consider general GEV models in which the location, scale and shape parameters are allowed to depend in a (fixed) differentiable way on some parameter vector and some temporal covariate with . More precisely, suppose that and are (known) real-valued functions of and that are differentiable with respect to their first argument . We assume that, for each , there exists an unknown parameter such that with
The global null hypothesis of interest within this model is assumed to be expressible as for a differentiable function with .
An example is given by the linear shift model that is frequently considered when modelling temperature extremes in Extreme Event Attribution studies (see Phi20), where
A possible global null hypothesis of interest could be
where and .
When considering this kind of extension, one has to adapt the maximum likelihood estimator as well as the estimator of its covariance matrix, hence steps (1)-(2) and (5)-(7) in the bootstrap algorithms are affected. Further details are omitted for the sake of brevity.
6 Conclusion
Extreme event attribution studies can build upon a GEV scaling model. Depending on the analysed variable, it may be useful to apply spatial pooling and fit the GEV distribution to a pooled sample of observations collected at sufficiently homogeneous spatial locations as it has been done in Kre21; Tra22; Vau15, among others. Here, we propose several statistical methods that enable the selection of a homogeneous pooling region from a larger initial region. The BH approach was found to be quite conservative, hence some heterogeneous locations are likely to be declared homogeneous. The IM approach is a more liberal alternative with a higher proportion of rejected locations that may contain some homogeneous ones. In subsequent analyses, the selected pooling region typically results in a classical bias-variance trade-off: the larger the pooling region, the smaller the variance. At the same time, the bias may be larger, given that some heterogeneous regions may have been declared homogeneous. In practice, the tests should always be complemented by expert knowledge on the driving meteorological/climatological background processes.
To make the statistical approach to select homogeneous pooling regions for attribution studies as described here usable for the extreme event attribution community, we have developed a software package that can be freely downloaded and used by applied researchers (findpoolreg). The selection of spatial pooling regions for attribution studies may hence be based on a combination of expert knowledge and thorough statistical tests. The experts applying the methods can thereby decide between a conservative approach, which tends to reject more locations and a liberal approach which tends to accept more locations as being homogeneous. This decision depends on the a priori knowledge about the meteorology of the analysed area and the specific requirements of the study.
If the ultimate interest is estimation of, for example, return levels, one may, as an alternative to the classical approach based on pooling selected locations, consider penalized maximum likelihood estimators with a penalty on large heterogeneities (Buc21). A detailed investigation of the resulting bias-variance trade-off would be a worthwhile topic for future research.
Declaration
Ethical Approval
Not applicable.
Availability of supporting data
All methods are implemented in the R package findpoolreg (findpoolreg) that is publicly available at https://github.com/leandrazan/findpoolreg. The data used throughout the case study is derived from the E-OBS gridded data set, publicly available at https://www.ecad.eu/download/ensembles/download.php.
Competing interests
The authors declare that they have no conflict of interest.
Funding
This work has been supported by the integrated project “Climate Change and Extreme Events - ClimXtreme Module B - Statistics (subprojects B1.2, grant number: 01LP1902B and B3.3, grant number: 01LP1902L)” funded by the German Federal Ministry of Education and Research (BMBF).
Authors’ contributions
LZ and AB wrote the main manuscript and worked out the mathematical details. LZ implemented all methods and carried out the Monte Carlo simulation study and the case study. FK, PL and JT contributed by extensive discussions, provided the data for the case study and improved the text.
Acknowledgements
Computational infrastructure and support were provided by the Centre for Information and Media Technology at Heinrich Heine University Düsseldorf, which is gratefully acknowledged.
References
SUPPLEMENT TO THE PAPER:
“Regional Pooling in Extreme Event Attribution Studies: an Approach Based on Multiple Statistical Testing”
Leandra Zanger, Axel Bücher, Frank Kreienkamp,
Philip Lorenz, Jordis Tradowsky
This supplement contains mathematical details on the maximum likelihood estimator and the estimation of its covariance matrix, as well as additional simulation results and further details on the case study. References like (1) refer to equations from the main paper, while references like (A.1) or Figure B.2 refer to equations or Figures within this appendix.
Appendix A Mathematical Details
A.1 Maximum Likelihood estimation
Throughout this section, we provide mathematical details on the coordinate-wise maximum likelihood estimator from Equation (6). In particular, we motivate the approximate normality of with mean and covariance with as defined in Equation (7). As in the stationary GEV-model, the derivations require , see BucSeg17.
We start by some explicit formulas for the functions appearing in Equations (6) and (7). For that purpose, let denote the log density of the plain distribution (see Appendix B in BucSeg17), i.e.,
(A.1) |
for such that ; here
Then, writing , the -density from Equation (5) may be written as
(A.2) |
where
We next derive formulas for the gradient and the Hessian of . For that purpose, let and denote the respective gradient and Hessian of the standard GEV log density (see Appendix B in BucSeg17) for precise formulas). Note that, in view of the fact that the GEV distribution is a location scale family,
(A.3) | ||||
(A.4) |
Next, consider the function defined by , whose Jacobian is given by , where
Then, in view of Equations (A.2) and (A.3), the multivariate chain rule yields
(A.5) |
In view of the multivariate product rule, this equation further implies
(A.6) |
where we used Equations (A.3) and (A.4) for the second equality and where denotes the Jacobian in of the th row of (derivative with respect to ). The latter can be derived explicitly by a tedious but straightforward calculation; we omit precise formulas for the sake of brevity.
We next motivate the claimed normal approximation. First of all, in view of the differentiability of , the vector of maximum likelihood estimators is necessarily a zero of the gradient of the log-likelihood function, i.e., we have
Denoting the true parameter vector by , a Taylor expansion implies that
where denotes higher order terms which are negligible. Solving for , we obtain that
By Equation (A.6), each block matrix on the block-diagonal of is of the form
(A.7) |
for suitable functions and , where
(A.8) |
with and is -distributed and independent over . Under suitable assumptions on (and hence on ), we obtain that the variance of is of the order . As a consequence, is close to its expectation, that is, with defined just after Equation (7). More precisely, in an asymptotic framework where one assumes that for some continuous function , expressions as in Equation (A.7) converge to with (note that both the integral and the expectation exist).
Next, consider . It suffices to argue that we may apply a suitable version of the Central Limit Theorem, under suitable assumptions on . Similar as for , by Equation (A.1), each entry of is of the form
for certain functions and and for some . In view of the independence over and the fact that , we may for instance apply the Ljyapunov CLT for independent triangular arrays, see Theorem 27.3 in Bil95. Since for sufficiently small and for the functions of interest, a sufficient condition for its applicability is
which readily follows for instance if one assumes that for some continuous function .
A.2 Covariance Estimation
Throughout this section, we provide an estimator for defined in Equation (7). First of all, we denote by an (approximate) Hessian of the function
evaluated at its maximizer , possibly obtained by numerical differentiation. Note that this matrix is routinely returned by standard implementations for maximization; for instance, the optim-function in R returns an output value hessian.
It remains to estimate the matrix
for all . By Equation (A.1), we may write
with the gradient of from Equation (A.1) and with as defined in Equation (A.8), which is -distributed for any . Note that the cross covariance does not depend on , and may hence be estimated empirically after replacing the true parameters by their estimators. More precisely, we obtain the estimator
where and where denotes the empirical cross covariance matrix of the two samples and with
and . The final estimator for is with
Appendix B Additional results of the simulation study
B.1 Additional results for record length
We report the power properties obtained with the BH and IM method for procedure (B3) in Figures B.1 and B.2, respectively. Power properties obtained with the Holm method are shown in Figures B.3 (for (B2)) and B.4 (for (B3)).
B.2 Additional results for record length
Since the bootstrap procedures implicitly depend on the asymptotic distribution of the test statistic, we repeated the simulation study with a larger sample size, in order to investigate the sample size’s impact on the performance of the bootstrap procedure. The location-wise sample size is increased to . Again, the FDR and FWER are reported (Table B.1), as well as the power plots for BH and (B2) in Figure B.5, and for IM and (B2) in Figure B.6. As expected, the error rates are again sufficiently controlled by those methods that claim to do so, while the power has substantially increased (on average by 50% for the BH method and by 18% for the IM method). The results for the other methods and (B3) were again similar.
Method min FDR max FDR mean FDR min FWER max FWER mean FWER (B2) (B3) (B2) (B3) (B2) (B3) (B2) (B3) (B2) (B3) (B2) (B3) Scenario 1: BH 6.9 5.2 12.4 11.4 9.4 7.8 8.7 6.8 23.9 20.4 15.8 13.4 Holm 2.7 1.4 10.3 8.0 6.4 4.6 6.8 3.9 12.9 9.4 9.5 6.7 IM 26.4 25.7 60.3 59.9 35.2 34.6 53.4 52.7 64.4 63.8 59.2 58.3 Scenario 2: BH 4.0 2.7 10.7 8.3 5.6 5.0 6.3 5.1 32.4 32.6 21.6 19.9 Holm 0.8 0.5 10.3 6.5 2.9 2.1 4.2 3.0 11.2 8.0 7.1 5.2 IM 8.2 7.8 60.1 59.7 14.2 13.8 41.8 40.9 60.1 59.7 47.9 46.6
Appendix C Additional results for the case study
The complete results of the bootstrap procedures applied to the can be found in Table C.1. For the grid, the complete results of the bootstrap based on bivariate extreme value distributions can be found in Table C.2, and the results for the bootstrap procedure based on max-stable processes in Table C.3.
The time series used throughout the case study are shown in Figure C.1. Along with the Apr-Sep maxima of 1950-2021, we plot values of
where and are estimated on the data of location 21 only (blue line), the respective location (red line) or the pooled data of the pair (green line), for . Note that these three lines should not differ much when the homogeneity assumption holds. On the other hand, perfectly matching lines do not imply that the homogeneity assumption is true, since they do not give any information about the scale and shape parameter of the distributions.
MS bootstrap (B2) | bivariate bootstrap (B3) | ||||||
---|---|---|---|---|---|---|---|
Pair | |||||||
(21, 11) | 78.8 | 0.00 | 0.00 | 0.00 | 0.05 | 0.75 | 0.75 |
(21, 16) | 15.6 | 1.60 | 10.64 | 22.39 | 1.45 | 9.90 | 20.29 |
(21, 15) | 15.2 | 2.50 | 10.64 | 32.48 | 2.50 | 9.90 | 32.48 |
(21, 10) | 13.6 | 3.40 | 10.64 | 40.78 | 3.30 | 9.90 | 36.58 |
(21, 14) | 13.7 | 3.55 | 10.64 | 40.78 | 3.05 | 9.90 | 36.58 |
(21, 23) | 12.1 | 5.30 | 13.24 | 52.97 | 6.30 | 15.74 | 62.97 |
(21, 20) | 10.4 | 7.15 | 15.09 | 64.32 | 9.85 | 16.36 | 78.76 |
(21, 27) | 10.6 | 8.05 | 15.09 | 64.37 | 8.10 | 16.36 | 72.86 |
(21, 9) | 9.8 | 10.00 | 15.59 | 69.97 | 10.44 | 16.36 | 78.76 |
(21, 22) | 9.4 | 10.39 | 15.59 | 69.97 | 11.69 | 16.36 | 78.76 |
(21, 8) | 9.0 | 13.04 | 17.79 | 69.97 | 11.99 | 16.36 | 78.76 |
(21, 26) | 7.2 | 20.34 | 25.42 | 81.36 | 22.04 | 27.55 | 88.16 |
(21, 17) | 4.4 | 46.08 | 53.17 | 100.00 | 45.73 | 52.76 | 100.00 |
(21, 28) | 3.9 | 52.17 | 55.90 | 100.00 | 50.92 | 54.56 | 100.00 |
(21, 29) | 2.8 | 66.92 | 66.92 | 100.00 | 65.77 | 65.77 | 100.00 |
Pair | ||||
---|---|---|---|---|
(21, 3) | 41.5 | 0.10 | 0.50 | 3.50 |
(21, 4) | 87.3 | 0.10 | 0.50 | 3.50 |
(21, 6) | 46.0 | 0.10 | 0.50 | 3.50 |
(21, 11) | 78.8 | 0.10 | 0.50 | 3.50 |
(21, 12) | 75.2 | 0.10 | 0.50 | 3.50 |
(21, 13) | 33.7 | 0.10 | 0.50 | 3.50 |
(21, 25) | 38.5 | 0.10 | 0.50 | 3.50 |
(21, 19) | 28.6 | 0.20 | 0.87 | 5.59 |
(21, 31) | 33.3 | 0.30 | 1.17 | 8.09 |
(21, 7) | 16.1 | 1.50 | 5.24 | 38.96 |
(21, 5) | 15.7 | 1.70 | 5.40 | 42.46 |
(21, 16) | 15.6 | 2.00 | 5.83 | 47.95 |
(21, 10) | 13.6 | 2.70 | 6.99 | 62.04 |
(21, 15) | 15.2 | 2.80 | 6.99 | 62.04 |
(21, 14) | 13.8 | 4.10 | 9.56 | 86.01 |
(21, 33) | 12.5 | 4.70 | 10.27 | 93.91 |
(21, 23) | 12.1 | 6.89 | 14.19 | 100.00 |
(21, 20) | 10.4 | 8.39 | 16.19 | 100.00 |
(21, 27) | 10.6 | 8.79 | 16.19 | 100.00 |
(21, 36) | 10.5 | 9.89 | 16.98 | 100.00 |
(21, 34) | 9.9 | 10.19 | 16.98 | 100.00 |
(21, 2) | 9.6 | 11.19 | 17.03 | 100.00 |
(21, 9) | 9.8 | 11.19 | 17.03 | 100.00 |
(21, 22) | 9.4 | 13.69 | 19.44 | 100.00 |
(21, 8) | 9.0 | 13.89 | 19.44 | 100.00 |
(21, 35) | 8.2 | 15.88 | 21.38 | 100.00 |
(21, 26) | 7.2 | 19.78 | 25.64 | 100.00 |
(21, 30) | 6.1 | 29.17 | 36.46 | 100.00 |
(21, 32) | 5.8 | 33.07 | 39.91 | 100.00 |
(21, 17) | 4.4 | 46.75 | 54.55 | 100.00 |
(21, 28) | 3.9 | 52.75 | 59.55 | 100.00 |
(21, 29) | 2.8 | 66.13 | 72.33 | 100.00 |
(21, 24) | 2.7 | 70.23 | 74.49 | 100.00 |
(21, 1) | 2.4 | 73.13 | 75.28 | 100.00 |
(21, 18) | 1.8 | 83.82 | 83.82 | 100.00 |
Pair | ||||
---|---|---|---|---|
(21, 3) | 41.5 | 0.00 | 0.00 | 0.00 |
(21, 4) | 87.2 | 0.00 | 0.00 | 0.00 |
(21, 11) | 78.8 | 0.00 | 0.00 | 0.00 |
(21, 12) | 75.2 | 0.00 | 0.00 | 0.00 |
(21, 25) | 38.5 | 0.00 | 0.00 | 0.00 |
(21, 6) | 46.0 | 0.10 | 0.50 | 3.00 |
(21, 13) | 33.4 | 0.10 | 0.50 | 3.00 |
(21, 31) | 33.3 | 0.20 | 0.87 | 5.59 |
(21, 19) | 28.7 | 0.30 | 1.17 | 8.09 |
(21, 7) | 16.0 | 1.80 | 6.29 | 46.75 |
(21, 5) | 15.7 | 2.60 | 7.53 | 64.94 |
(21, 15) | 15.2 | 2.70 | 7.53 | 64.94 |
(21, 16) | 15.6 | 2.80 | 7.53 | 64.94 |
(21, 14) | 13.7 | 3.70 | 9.24 | 81.32 |
(21, 10) | 13.6 | 4.40 | 10.26 | 92.31 |
(21, 23) | 12.1 | 5.09 | 11.15 | 100.00 |
(21, 33) | 12.5 | 5.89 | 12.13 | 100.00 |
(21, 27) | 10.6 | 8.59 | 15.96 | 100.00 |
(21, 20) | 10.4 | 8.89 | 15.96 | 100.00 |
(21, 36) | 10.5 | 9.19 | 15.96 | 100.00 |
(21, 9) | 9.8 | 9.59 | 15.96 | 100.00 |
(21, 2) | 9.6 | 10.49 | 15.96 | 100.00 |
(21, 34) | 9.9 | 10.49 | 15.96 | 100.00 |
(21, 22) | 9.4 | 10.99 | 16.03 | 100.00 |
(21, 8) | 9.0 | 12.39 | 17.34 | 100.00 |
(21, 35) | 8.2 | 16.38 | 22.05 | 100.00 |
(21, 26) | 7.2 | 21.88 | 28.36 | 100.00 |
(21, 30) | 6.1 | 28.77 | 35.96 | 100.00 |
(21, 32) | 5.8 | 31.87 | 38.46 | 100.00 |
(21, 17) | 4.4 | 46.25 | 53.96 | 100.00 |
(21, 28) | 3.9 | 50.95 | 57.52 | 100.00 |
(21, 29) | 2.8 | 66.33 | 70.99 | 100.00 |
(21, 24) | 2.7 | 66.93 | 70.99 | 100.00 |
(21, 1) | 2.4 | 71.03 | 73.12 | 100.00 |
(21, 18) | 1.8 | 81.92 | 81.92 | 100.00 |