This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Nearly Similar Powerful Test for Mediation

Kees Jan van Garderen
Amsterdam School of Economics
University of Amsterdam
K.J.vanGarderen@uva.nl
   Noud van Giersbergen
Amsterdam School of Economics
University of Amsterdam
N.P.A.vanGiersbergen@uva.nl
Abstract

This paper derives a new powerful test for mediation that is easy to use. Testing for mediation is empirically very important in psychology, sociology, medicine, economics and business, generating over 100,000 citations to a single key paper. The no-mediation hypothesis H0:θ1θ2=0H_{0}:\theta_{1}\theta_{2}=0 also poses a theoretically interesting statistical problem since it defines a manifold that is non-regular in the origin where rejection probabilities of standard tests are extremely low. We prove that a similar test for mediation only exists if the size is the reciprocal of an integer. It is unique, but has objectionable properties. We propose a new test that is nearly similar with power close to the envelope without these abject properties and is easy to use in practice. Construction uses the general varying gg-method that we propose. We illustrate the results in an educational setting with gender role beliefs and in a trade union sentiment application.

Keywords: Varying gg-method, Mediation, Indirect Effect, Power Envelope, Similar Tests, Invariant Tests, Optimal Tests

Version October, 2021

1 Introduction

This paper derives a new powerful test for mediation that is easy to use. Testing for mediation effects is empirically extremely important in various scientific disciplines. A key paper in psychology, Baron and Kenny (1986) has more than 100,000 citations111 Cited by 106,782 on 13 October 2021, 90,147 on 15 January 2020, and 79,205 on 22 October 2018. and is used in many other fields. Mediation testing is important in accounting, e.g. Coletti et al. (2005), marketing, e.g. MacKenzie et al. (1986), sociology, e.g. Alwin and Hauser (1975) who used the expression indirect effect, a term commonly used in economics also.For a recent overview of mediation in economics see Huber (2020) and e.g. Heckman and Pinto (2015a, b) on treatment effects and production technology, and Imbens (2020) who extensively reviews connections between directed acyclic graphs (DAGs), potential outcomes, causal inference, instrumental variables, and mediation. Frewen et al. (2013) is exemplary for the increasing network literature, including DAGs, using mediation. The minimal selection here is hardly representative for the vast body of literature on mediation analysis. It only illustrates the breadth of its empirical relevance. Tests for mediation can have extremely low power, especially when the effect is small, or estimated with large variance. The primary purpose of this paper is to provide a new and more powerful test that is easy to use.

The aim of mediation testing is to discover if an independent variable (XX) causes a dependent variable (YY) via an intervening, or mediating variable (MM). The mediating variable is exogenous in the common experimental setting in psychology and other fields, but is also considered exogenous in other settings where assignments are random or constitute a natural experiment. The basic model is simply:

Y\displaystyle Y =τX+θ2M+u,\displaystyle=\tau X+\theta_{2}M+u, (1)
M\displaystyle M =θ1X+v,\displaystyle=\theta_{1}X+v, (2)

where all variables are taken in deviation from their means, or more generally after partialing out other exogenous effects. The disturbances uu and vv are assumed to be independent because of an experimental set up and more generally because no influence of YY on MM is assumed in this type of model. This independence is a crucial identification condition, since the parameter θ2\theta_{2} cannot be estimated consistently if MM is endogenous. We make a convenient distributional assumption: (ui,vi)IIN(0,diag(σ11,σ22))\left(u_{i},v_{i}\right)^{\prime}\sim II\emph{N}\left(0,diag\left(\sigma_{11},\sigma_{22}\right)\right), i=1,,n,i=1,\cdots,n, with nn the number of observations. This facilitates a likelihood analysis, but is not necessary for the asymptotic normality of the tt-statistics that will be used.

MacKinnon et al. (2002) give a literature review and compare 14 different methods for testing the effects of a mediation variable. These methods are based on standardized measures of the product of two coefficients θ1θ2\theta_{1}\theta_{2} or based on the difference of two related coefficients (ττ\tau^{\ast}-\tau) in equations (1) and (3):

Y=τX+u.Y=\tau^{\ast}X+u^{\ast}. (3)

If there is a mediation effect, then XX influences M,M, such that θ10,\theta_{1}\neq 0, and MM influences Y,Y, such that θ20.\theta_{2}\neq 0. If there is no mediation by M,M, then the effect of XX on YY is not altered by the inclusion of MM such that ττ=0.\tau^{\ast}-\tau=0.

Model (3) is a restricted version of (1) with θ2=0,\theta_{2}=0, and it is straightforward therefore to show that the OLS estimates for the three models satisfy τ^=τ^+θ^1θ^2\hat{\tau}^{\ast}=\hat{\tau}+\hat{\theta}_{1}\hat{\theta}_{2} and the relation ττ=θ1θ2\tau^{\ast}-\tau=\theta_{1}\theta_{2} also holds in model interpretation terms; see Appendix A.

Refer to caption
Figure 1: (No)-Mediation, Moderation, and Confoundedness

Figure 1 illustrates mediation, no mediation, and the related concept of moderation in terms of directed acyclic graphs. MM is a moderator if it changes the relation between XX and YY. In its basic form this is modeled by adding the interaction term between XX and MM to the model. When confounders WW are observed and included we have:

Y=τX+γXM+δW+u.Y=\tau^{\sharp}X+\gamma XM+\delta W+u^{\sharp}. (4)

WW can also be included in (1) - (3), but partialed out such that YY, X,X, and MM are residuals after regressions on W.W. Moderation and confoundedness can be tested by the ordinary tt or FF tests on γ\gamma and δ,\delta, but testing for mediation is less straightforward.

The best well known and commonly used test for mediation by Sobel (1982) is a Wald-type test of the form θ^1θ^2/SE(θ^1θ^2)\ \hat{\theta}_{1}\hat{\theta}_{2}/SE(\hat{\theta}_{1}\hat{\theta}_{2}), with SE(θ^1θ^2)SE(\hat{\theta}_{1}\hat{\theta}_{2}) an estimate of the standard error of the product θ^1θ^2.\hat{\theta}_{1}\hat{\theta}_{2}. It is available in standard statistical packages such as SAS, R, Stata, or SPSS. It has good properties when either θ1\theta_{1} or θ2\theta_{2} is large and the standard errors of θ^1\hat{\theta}_{1} and θ^2\hat{\theta}_{2} are small, but if the two tt-tests for testing θ1=0\theta_{1}=0 and θ2=0\theta_{2}=0 tend to be small, properties deteriorate. For parameter values under the null, the Null Rejection Probability (NRP) can be very close to zero and, under the alternative, power can fall far below the size (highest NRP) of 5%5\% that we use throughout. All tests considered in MacKinnon et al. (2002) suffer from these problems. The distributions of the test statistics considered in the literature depend on the value of the parameters under the null. As a consequence none of these tests is similar, meaning that rejection probabilities are not constant on the boundary of the null hypothesis. In fact rejection probabilities under alternatives close to the origin, i.e. power, can be much lower than size and these tests are biased.

Much effort in the literature has gone into improving well-known test statistics, such as the Wald statistic, without a satisfactory solution. The bootstrap is invalid (see Van Garderen and Van Giersbergen (2021)) and cannot salvage these statistics. The key step in our approach is to move away from test statistics and consider the critical region in the sample space directly and optimize the flexible boundary we define.

We make two main contributions. Our main theoretical contribution in Theorem 2 shows that a similar mediation test exists if and only if the level of the test is the reciprocal of an integer, i.e. 1/α,1/\alpha\in\mathbb{N}, or α=0\alpha=0. Hence, for practical levels such as 1%,1\%, 5%5\%, or 10%10\% an exact similar test exists. The proof is constructive and the test is unique within the class considered. Unfortunately, the critical region is objectionable in that it includes an area near the origin where both tt-statistics are arbitrarily close to zero. Such values do not provide overwhelming evidence against the null and Perlman and Wu (1999) coined the term “Emperor’s New Tests” for similar tests with undesirable properties like these. Insistence on similarity can render LR tests α\alpha-inadmissible, cf. Lehmann and Romano (2005, Section 6.7), but Perlman and Wu (1999) give examples where similar tests have extremely undesirable properties, yet inadmissible LR tests still provide reasonable answers. In the mediation setting the LR test is also inadmissible, but does not provide a satisfactory answer in this case. It is much better than the Wald test as we will see, but nevertheless suffers from extremely poor power properties for parameter values close to the origin.

So a better test is called for and our second, more important contribution therefore is practical. We construct a new simple test for mediation that is uniformly more powerful than the LR test without the undesirable properties sketched by Perlman and Wu (1999), and that is nearly similar and practically unbiased.

It is extremely simple to use:

  1. 1.

    Order the absolute values of the common tt-statistics from the basic OLS regressions (1) and (2) for testing θ1=0\theta_{1}=0 or θ2=0:\theta_{2}=0: |t|(1)=min{|t1|,|t2|}\left|t\right|_{(1)}=\min\left\{\left|t_{1}\right|,\left|t_{2}\right|\right\}, |t|(2)=max{|t1|,|t2|}\left|t\right|_{(2)}=\max\left\{\left|t_{1}\right|,\left|t_{2}\right|\right\}

  2. 2.

    Reject if |t|(1)>g(|t|(2))\left|t\right|_{(1)}>g(\left|t\right|_{(2)}) using Table 1 (or employing the code in Appendix E).

tt 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.1 0.1 0.10672 0.10672 0.10672 0.10672 0.10672 0.11672 0.12671 0.13670 0.14669
0.2 0.15669 0.16668 0.17667 0.18666 0.19666 0.20665 0.21664 0.22663 0.23663 0.24662
0.3 0.25661 0.26660 0.27660 0.28659 0.29658 0.30658 0.31657 0.32656 0.33655 0.34655
0.4 0.35654 0.36653 0.37652 0.38652 0.39651 0.40650 0.41649 0.42649 0.43648 0.44647
0.5 0.45646 0.46646 0.47645 0.48644 0.49643 0.50643 0.51642 0.52641 0.53640 0.54640
0.6 0.55639 0.56638 0.57637 0.58637 0.59636 0.60635 0.61634 0.62634 0.63633 0.64632
0.7 0.65631 0.66631 0.67630 0.68629 0.69628 0.70628 0.71627 0.72626 0.73625 0.74625
0.8 0.75624 0.76623 0.77622 0.78622 0.79621 0.80620 0.81620 0.82619 0.83618 0.84617
0.9 0.85617 0.86616 0.87615 0.88614 0.89614 0.90613 0.91612 0.92611 0.93611 0.94610
1.0 0.95609 0.96608 0.97608 0.98607 0.99606 1.00605 1.01605 1.02604 1.03603 1.04602
1.1 1.05602 1.06601 1.07600 1.08599 1.09599 1.10598 1.11597 1.12596 1.13596 1.14595
1.2 1.15594 1.16593 1.17593 1.18592 1.19591 1.20590 1.21590 1.22589 1.23588 1.24587
1.3 1.25587 1.26586 1.27585 1.28584 1.29584 1.30583 1.31286 1.31310 1.31310 1.31310
1.4 1.31310 1.31310 1.31310 1.31310 1.31310 1.31750 1.32750 1.33750 1.34750 1.35750
1.5 1.36750 1.37750 1.38750 1.39750 1.40750 1.41750 1.42750 1.43750 1.44750 1.45750
1.6 1.46750 1.47750 1.48750 1.49750 1.50750 1.51750 1.52750 1.53750 1.54750 1.55750
1.7 1.56750 1.57750 1.58750 1.59750 1.60750 1.61750 1.62750 1.63750 1.64750 1.65750
1.8 1.66750 1.67750 1.68750 1.69750 1.70750 1.71750 1.72750 1.73750 1.74750 1.75750
1.9 1.76750 1.77750 1.78750 1.79750 1.80750 1.81750 1.82750 1.83750 1.84750 1.85750
2.0 1.86750 1.87750 1.88750 1.89750 1.90750 1.91750 1.92750 1.93750 1.94750 1.95750
2.1 1.95996 1.95996 1.95996 1.95996 1.95996 1.95996 1.95996 1.95996 1.95996 1.95996
Table 1: gg-function. Entries are g(t)g(t) values for t=t= value first column+value first row. Reject if smallest absolute tt-statistic is larger than g(largest absolute t-statistic)g\left(\text{largest absolute }t\text{-statistic}\right) using ordinary tt-statistics from OLS regressions (1) and (2). E.g. if t1=2.052t_{1}=2.052 and t2=1.941t_{2}=-1.941 then 1.941>1.9175=g(2.05)1.941>1.9175=g(2.05) and reject. Linear interpolation results in |NRP5%|<0.00001.\left|NRP-5\%\right|<0.00001.

That the test can be based on elementary tt-statistics is important for ease of application, but is theoretically justified below by sufficiency and invariance arguments. The testing problem is invariant to permutations of parameters and statistics and sign changes. We show that the ordered absolute tt-statistic, (|t|(1),|t|(2)),(\left|t\right|_{(1)},\left|t\right|_{(2)}), is a maximal invariant. The critical region is a subset of the relevant sample space which is an octant in 2\mathbb{R}^{2}. This can also be justified asymptotically under very weak assumptions and different estimation methods.

The new test is constructed by varying the boundary of the critical region, defined as a function gg, such that the test is almost similar and minimizing the distance to the power envelope surface. It is based on a new general, so called varying-gg method that can be applied to other testing problems with nuisance parameters more generally to obtain near similar tests. It does not require a choice of mixture distribution, nor the construction of least favorable distributions, cf. Andrews and Ploberger (1994), Andrews et al. (2006, 2008), Elliott et al. (2015), Guggenberger et al. (2019). It can be given a random critical value interpretation as in Moreira and Mourão (2016) since any critical region in a higher-dimensional space has a boundary for one statistic in terms of the remaining statistics. The critical region that we construct is fixed however, not random, avoids simulation, and our approach appears to lend itself better to multivariate extensions, as will be shown for dimension three.

We use and develop numerical methods that avoid simulations and use numerical integration instead. With the required computing completed for the mediation problem, practitioners can simply use Table 1 or the computer code provided in the appendix. In fact no further reading on the motivation and derivation of the test is required for its implementation.

Section 5 shows the ease of implementation using an interesting application by Alan et al. (2018) on educational attainment. The test confirms that the negative effect on girls’ attainment of 1-year exposure to teachers with traditional attitudes, is mediated through students’ own gender role beliefs. Neither the procedure in Alan et al. (2018) nor the LR test reject the no mediation hypothesis in this case, but our new test does.

A further empirical illustration on union sentiment among southern nonunion textile workers is provided in Section 7. Different mediation channels are tested involving two mediating variables and requires an extension of our methods. We therefore consider general hypotheses of the form H0:θ1θK=0H_{0}:\theta_{1}\cdots\theta_{K}=0. We give the relevant distributions of maximal invariants that can be used to derive the critical regions that are nearly similar and do so explicitly for three dimensions.

For practitioners the major advantage of our test is that there is a better chance of formally showing that there is a mediation effect. Our test has better power, especially when the two channeling effects are small or less accurately estimated. Given the enormous interest in testing for mediation and the fact that our test can have close to 5%5\% more power than standard level 5%5\% tests, many unpublished examples will exist where it can now be concluded that there is a statistically significant mediation effect.

2 Theory

The joint density of (Y,M)\left(Y,M\right) given XX in equations (1) and (2) can be written as:

fY,M|X(y,m|x;λ)=fY|M,X(y|m,x;λ1)fM|X(m|x;λ2),f_{Y,M|X}\left(y,m|x;\lambda\right)=f_{Y|M,X}\left(y|m,x;\lambda_{1}\right)f_{M|X}\left(m|x;\lambda_{2}\right),

with λ1=(τ,θ2,σ11)\lambda_{1}=\left(\tau,\theta_{2},\sigma_{11}\right)^{\prime}, λ2=(θ1,σ22)\lambda_{2}=\left(\theta_{1},\sigma_{22}\right)^{\prime}, and λ=(λ1,λ2)\lambda=\left(\lambda_{1}^{\prime},\lambda_{2}^{\prime}\right)^{\prime}. The parameters λ1\lambda_{1}, and λ2\lambda_{2} vary freely as a result of the triangular structure of the model. The mediation variable is the endogenous variable in (2), but is exogenous for θ2\theta_{2} in (1) since YY is not causal for MM. For a sample of nn independent observations the loglikelihood equals the sum of two normal loglikelihoods corresponding to (1) and (2)222This can easily be extended to include more regressors/covariates. Instrumental variables can also be used, but note that XX and MM appear in both equations and in the standard setup uu and vv are independent because of the experimental interpretation of MM. All that is required is the asymptotic normality of the estimators and tt-statistics.:

(λ)12σ11i=1n(yiτxiθ2mi)2n2log(σ11)12σ22i=1n(miθ1xi)2n2log(σ22).\ell\left(\lambda\right)\propto-\frac{1}{2\sigma_{11}}\sum_{i=1}^{n}\left(y_{i}-\tau x_{i}-\theta_{2}m_{i}\right)^{2}-\frac{n}{2}\log\left(\sigma_{11}\right)-\frac{1}{2\sigma_{22}}\sum_{i=1}^{n}\left(m_{i}-\theta_{1}x_{i}\right)^{2}-\frac{n}{2}\log\left(\sigma_{22}\right). (5)

As a consequence the Maximum Likelihood Estimators (MLEs) for θ1\theta_{1} and θ2\theta_{2} are the basic OLS estimators for the two equations separately. Furthermore, both observed and expected Fisher information matrices will be block diagonal in terms of λ1\lambda_{1} and λ2\lambda_{2} as well as in (τ,θ2)\left(\tau,\theta_{2}\right)^{\prime}, σ11\sigma_{11}, θ1,\theta_{1}, and σ22.\sigma_{22}. As a result the standard tt-statistics T1T_{1} and T2T_{2} for θ1\theta_{1} and θ2\theta_{2} respectively are asymptotically independent and normally distributed with means μ1θ10/σθ1\mu_{1}\equiv\theta_{1}^{0}/\sigma_{\theta_{1}}, μ2θ20/σθ2\mu_{2}\equiv\theta_{2}^{0}/\sigma_{\theta_{2}} where θ10\theta_{1}^{0}, θ20\theta_{2}^{0} denote the true parameter values and σθ1\sigma_{\theta_{1}}, σθ2\sigma_{\theta_{2}} the standard deviations of the OLS estimators: (Tμ)𝑑N(0,I2)\ \left(T-\mu\right)\overset{d}{\rightarrow}\emph{N}\left(0,I_{2}\right), with T=(T1,T2).T=\left(T_{1},T_{2}\right)^{\prime}.

Restricting attention to TT can be justified by statistical sufficiency, since the MLE is minimal sufficient and complete, and by invariance since the values of σ11,σ22\sigma_{11},\sigma_{22} and τ\tau do not affect whether H0:H_{0}: θ1θ2=0\theta_{1}\theta_{2}=0 is true or not. Hillier et al. (2021) show that TT is a maximal invariant under a relevant group of transformations. The testing problem has two more obvious symmetries. The problem is not affected by sign changes or permutations. This also holds in higher dimensions, e.g. when mediation is through a chain of effects as in our empirical illustration. If XM(0)M(K1)Y,X\rightarrow M^{\left(0\right)}\rightarrow\cdots\rightarrow M^{\left(K-1\right)}\rightarrow Y,  then KK parameters are required to be non-zero for this channel to operate. In KK dimensions the null hypothesis that at least one parameter is zero and the alternative is all KK parameters non-zero:

H0\displaystyle H_{0} :θ1θ2θK=0\displaystyle:\theta_{1}\theta_{2}...\theta_{K}=0
H1\displaystyle H_{1} :θ1θ2θK0\displaystyle:\theta_{1}\theta_{2}...\theta_{K}\neq 0

There are many other hypotheses including collapsibility in contingency tables, testing indirect effects, channels in DAGs, see Dufour et al. (2017) for a range of examples.

In the multivariate setting we also assume that the estimator θ^\hat{\theta} is normally distributed with known covariance matrix Ω=diag(σ12,,σK2)\Omega=diag\left(\sigma_{1}^{2},...,\sigma_{K}^{2}\right). Hence, if we let T=Ω1/2θ^T=\Omega^{-1/2}\hat{\theta} and μ=Ω1/2θ0\mu=\Omega^{-1/2}\theta^{0} such that Tk=θ^k/σθkT_{k}=\hat{\theta}_{k}/\sigma_{\theta_{k}} are the tt-ratios and μk=θk/σθk\mu_{k}=\theta_{k}/\sigma_{\theta_{k}} are the non-centrality parameters, we assume:

Assumption 1

:TN(μ,IK).\qquad T\sim N\left(\mu,I_{K}\right).

The testing problem is invariant to reordering the parameters (permutations) and sign changes (reflections) of the KK parameters {θi}i=1K\left\{\theta_{i}\right\}_{i=1}^{K}. The group of permutations, 𝐆1\mathbf{G}_{1} say, has K!K! elements and the group of sign changes, 𝐆2\mathbf{G}_{2} say, has 2K2^{K} elements (two possible signs for each element). The groups 𝐆1\mathbf{G}_{1} and 𝐆2\mathbf{G}_{2} have only the identity element in common, but are otherwise non-overlapping. The full group 𝐆=𝐆1×𝐆2\mathbf{G=G}_{1}\mathbf{\times G}_{2} generated by 𝐆1\mathbf{G}_{1} and 𝐆2\mathbf{G}_{2} therefore has K!2KK!2^{K} elements. Exploiting the invariance and symmetry properties of the problem reduces the domain of integration by a factor K!2KK!2^{K}. This is important because all optimizations require probabilities calculated by numerical integration. The density after a sign change in TkT_{k} is obtained by a corresponding sign change in μk\mu_{k} and for a permutation of TT also μ\mu permutes accordingly. Hence for any element gg𝐆\in\mathbf{G} we have 𝐠T\mathbf{g}\cdot T\sim N(𝐠μ,IK)\emph{N}\left(\mathbf{g}\cdot\mu,I_{K}\right) or P𝐠μ[𝐠TA]=Pμ[TA]P_{\mathbf{g}\mu}\left[\mathbf{g}T\in A\right]=P_{\mu}\left[T\in A\right] so the distribution is invariant; see Lehmann and Romano (2005).

Theorem 1

The testing problem H0:μ1μ2μK=0H_{0}:\mu_{1}\mu_{2}\cdots\mu_{K}=0 is invariant under the group of transformations  𝐆=𝐆1×𝐆2\mathbf{G=G}_{1}\mathbf{\times G}_{2} acting on TT and μ\mu, given Assumption 1. The absolute order statistic (|T|(1),,|T|(K))\left(\left|T\right|_{\left(1\right)},...,\left|T\right|_{\left(K\right)}\right) with 0<|T|(1)<|T|(2)<<|T|(K)0<\left|T\right|_{\left(1\right)}<\left|T\right|_{\left(2\right)}<...<\left|T\right|_{\left(K\right)} is a maximal invariant statistic and the absolute order parameter (|μ|(1),,|μ|(K))\left(\left|\mu\right|_{\left(1\right)},...,\left|\mu\right|_{\left(K\right)}\right) with 0|μ|(1)|μ|(K)0\leq\left|\mu\right|_{\left(1\right)}\leq\cdots\leq\left|\mu\right|_{\left(K\right)} is a maximal invariant parameter under the group of transformations G=G1×G2.\mathbf{G}=\mathbf{G}_{1}\times\mathbf{G}_{2}. The distribution of (|T|(1),,|T|(K))\left(\left|T\right|_{\left(1\right)},...,\left|T\right|_{\left(K\right)}\right) depends only on (|μ|(1),,|μ|(K)).\left(\left|\mu\right|_{\left(1\right)},...,\left|\mu\right|_{\left(K\right)}\right).

The Wald and LR tests are functions of the maximal invariant and so will our new test. The density is required for probability calculations and optimizations for the new test. It is easily derived for arbitrary dimension using Equation (6) of Vaughan and Venables (1972):

Lemma 1

The probability density function of the absolute order statistic is given by:

f{|T|(1),,|T|(K)}(|t|(1),,|t|(K))=perm(χ(|t|(1),|μ|(1))χ(|t|(1),|μ|(K))χ(|t|(K),|μ|(1))χ(|t|(K),|μ|(K))),f_{\left\{\left|T\right|_{\left(1\right)},...,\left|T\right|_{\left(K\right)}\right\}}\left(\left|t\right|_{\left(1\right)},...,\left|t\right|_{\left(K\right)}\right)=perm\left(\begin{array}[]{ccc}\chi\left(\left|t\right|_{\left(1\right)},\left|\mu\right|_{\left(1\right)}\right)&\cdots&\chi\left(\left|t\right|_{\left(1\right)},\left|\mu\right|_{\left(K\right)}\right)\\ \vdots&&\vdots\\ \chi\left(\left|t\right|_{\left(K\right)},\left|\mu\right|_{\left(1\right)}\right)&\cdots&\chi\left(\left|t\right|_{\left(K\right)},\left|\mu\right|_{\left(K\right)}\right)\end{array}\right), (6)

with perm(A)perm\left(A\right) the permanent333The permanent is defined as perm(A)=σSni=1nai,σ(i)perm\left(A\right)=\sum_{\sigma\in S_{n}}\prod_{i=1}^{n}a_{i,\sigma\left(i\right)} with the sum over all permutations σ\sigma of the numbers 1,,n,1,...,n, akin the determinant but without the ±\pm signature of the permutation. of the square matrix AA and χ(x,μ)\chi\left(x,\mu\right) the noncentral Chi-distribution with one degree of freedom and noncentrality parameter μ>0\mu>0.

The noncentral Chi-distribution χ(t,μ)\chi\left(t,\mu\right) with one degree of freedom equals the folded normal distribution and if TN(μ,1),T\sim N\left(\mu,1\right), then the density of |T|\left|T\right| can be written as f|T|(t,μ)=2/πexp{12(t2+μ2)}cosh(μt),f_{|T|}(t,\mu)=\,\sqrt{2/\pi}\exp\{-\tfrac{1}{2}(t^{2}+\mu^{2})\}\cosh(\mu\ t), for t0t\geq 0. Substitution in Lemma 1 and simplifying gives the following result that is the basis for the numerical calculations that follow:

Lemma 2

The density of the ordered absolute tt-statistics for the mediation hypothesis is:

f{|T|(1),|T|(2)}(t1,t2;μ1,μ2)\displaystyle f_{\left\{\left|T\right|_{\left(1\right)},\left|T\right|_{\left(2\right)}\right\}}\left(t_{1},t_{2};\mu_{1},\mu_{2}\right) =\displaystyle= 2πexp{(t12+t22+μ12+μ12)/2}\displaystyle\frac{2}{\pi}\exp\left\{-(t_{1}^{2}+t_{2}^{2}+\mu_{1}^{2}+\mu_{1}^{2})/2\right\}
{cosh(μ1t1)cosh(μ2t2)+cosh(μ1t2)cosh(μ2t1)},\displaystyle\left\{\cosh(\mu_{1}t_{1})\cosh(\mu_{2}t_{2})+\cosh(\mu_{1}t_{2})\cosh(\mu_{2}t_{1})\right\},
 for t2t10and μ1=θ1/σθ1,μ2=θ2/σθ2.\displaystyle\text{ \ \ for }t_{2}\geq t_{1}\geq 0~\text{and }\mu_{1}=\theta_{1}/\sigma_{\theta_{1}},\mu_{2}=\theta_{2}/\sigma_{\theta_{2}}.

The ordered squared tt-statistic could also be used as maximal invariant. Lemma 1 would then lead to a density in terms of noncentral Chi-squared distributions.

2.1 Problems with Standard (Single) Mediation Test Statistics

Standard mediation test statistics used in practice have distributions that depend on the parameter values under the null. The rejection probabilities are therefore not constant and the tests are biased with power dropping below the size of the test, especially in a neighborhood of the origin. We illustrate the issue for the classic Wald and LR tests.

The null hypothesis θ1θ2=0\theta_{1}\theta_{2}=0 defines a manifold that is almost everywhere continuously differentiable, with the exception of the origin which is a so-called “double point” where the two restrictions θ1=0\theta_{1}=0 and θ2=0\theta_{2}=0, each defining a one-dimensional line, coincide. The widely used Sobel (1982) test equals the square root of the Wald test. Glonek (1993) derives the asymptotic distribution for the Wald test statistic:

W=T12T22T12+T22𝑑{χ12:if θ1=0 or θ2=0, but not both,14χ12:if θ1=θ2=0.W=\frac{T_{1}^{2}T_{2}^{2}}{T_{1}^{2}+T_{2}^{2}}\overset{d}{\rightarrow}\left\{\begin{array}[]{l}\ \chi_{1}^{2}:\text{if }\theta_{1}=0\text{ or }\theta_{2}=0,\text{ but not both,}\\ \ \frac{1}{4}\chi_{1}^{2}:\text{if }\theta_{1}=\theta_{2}=0.\end{array}\right. (7)

As a consequence the asymptotic 5%5\% critical value for WW when both θ1=0\theta_{1}=0 and θ2=0\theta_{2}=0 is 14χ12(0.95),\frac{1}{4}\chi_{1}^{2}(0.95), but jumps to χ12(0.95)\chi_{1}^{2}(0.95) i.e. the usual Chi-squared critical value for one restriction, for any other value. The discrete jump in the asymptotic distribution from the origin to any other fixed parameter is remarkable and shows explicitly that the distribution depends heavily on the parameter values under the null. This discontinuity in the asymptotic distribution and dependence on the parameter also invalidates bootstrap procedures and they are oversized. For an NRP of 5%5\% at the origin the critical value should be 0.960.96 but this would lead to over-rejection for other values under the null and the test would be oversized (size >30%>30\%) and invalid. One could consider drifting sequences of parameter values to investigate the behavior of the Wald statistic near the origin, but that would not solve the problem. The problematic behavior of the Wald test under the null with singularities is well documented by Drton and Xiao (2016) and Drton (2009). No satisfactory solution has been found in the preceding decades to salvage the Wald statistic, see e.g. Dufour et al. (2017). This prompted our investigation and to propose an alternative solution.

The LR test was shown by Van Giersbergen (2014) to equal:

LR=min{|T1|,|T2|}=|T|(1),LR=\min\left\{\left|T_{1}\right|,\left|T_{2}\right|\right\}=\left|T\right|_{\left(1\right)}, (8)

and rejects when both H0θ1:θ1=0H_{0}^{\theta_{1}}:\theta_{1}=0 and H0θ2:θ2=0H_{0}^{\theta_{2}}:\theta_{2}=0 are rejected by basic tt-tests. In MacKinnon et al. (2002) this is referred to as the test for joint significance, but not identified as the LR test. The rejection probability for critical value cvcv is:

P[LRcv]=P[|T1|cv|T2|cv]=P[|T1|cv]P[|T2|cv],P\left[LR\geq cv\right]=P\left[\ \left|T_{1}\right|\geq cv\ \cap\ \left|T_{2}\right|\geq cv\right]=P\left[\ \left|T_{1}\right|\geq cv\right]\ \cdot P\left[\ \left|T_{2}\right|\geq cv\right],

by independence of T1T_{1} and T2T_{2}. These rejection probabilities are monotonically increasing in the absolute values of θ1\theta_{1} and θ2\theta_{2}. Correct size is therefore obtained by choosing the critical value of the test by letting θ1\theta_{1}\rightarrow\infty when θ2=0,\theta_{2}=0, or θ2\theta_{2}\rightarrow\infty if θ1=0,\theta_{1}=0, to guarantee that the rejection probability under the null is always smaller than or equal to the nominal size. The asymptotic 5%5\% critical value is therefore the usual 1.961.96. The NRP will depend on the values of θ1\theta_{1} and θ2\theta_{2} and vary between the following two extremes:

P[LRz0.025]={0.05: if θ1 θ2=0,or θ2θ1=0,0.052=0.0025: if θ1=0θ2=0,P\left[LR\geq z_{0.025}\right]=\left\{\begin{array}[]{l}0.05:\text{ if }\theta_{1}\rightarrow\infty\text{ }\wedge\theta_{2}=0,\text{or }\theta_{2}\rightarrow\infty\wedge\theta_{1}=0,\\ 0.05^{2}=0.0025:\text{ if }\theta_{1}=0\wedge\theta_{2}=0,\end{array}\right.

where z0.025z_{0.025} is the upper 2.5%2.5\% percentile of the standard normal distribution. For an NRP of 5%5\% at the origin (θ1,θ2)=(0,0),\left(\theta_{1},\theta_{2}\right)=\left(0,0\right), the critical value should equal cvLR00=1.217cv_{LR00}=1.217. This leads to massive over-rejection if only one parameter is zero and the other much larger. The test with this critical value is oversized ((size >20%)>20\%) and invalid.

The third classic test, the Lagrange Multiplier (LM) or score test, is even more problematic because its definition depends on parameter values under the null and there are three different versions depending on which θ\theta, or both θ\theta’s are zero.

All these classic tests are functions of two tt-statistics. Their distributions, as well as their NRPs, clearly depend on the parameter values under the null and the tests are not similar. A test is called similar on the boundary of H0H_{0} if the probability of rejecting the null is constant for all parameter values on the boundary of H0H_{0} and H1H_{1}. For mediation, this boundary equals H0H_{0} itself and consists of the horizontal and vertical axes of the (θ1,θ2)\left(\theta_{1},\theta_{2}\right) space. None of the classic tests is similar and in a neighborhood of the origin the NRPs are close to zero. As a result the power in a neighborhood of the origin is also close to zero and far below the size of the test and the tests are biased since there are parameter values with probability of rejection under the alternative lower than under the null.

2.2 Critical Regions

The behavior and construction of the classic test statistics is problematic. Given that no satisfactory adjustments of classic test statistics have been found, despite considerable efforts over recent decades, a different approach is required.

In order to derive an alternative test procedure we shift the focus from the test statistic to the critical region (CR). A critical region defines a test statistic of course, but choosing a class of tests, such as Wald, LR, or LM tests, restricts the shape of the critical region. For the same reason the tests focusing on improving the standard error of θ^1θ^2\hat{\theta}_{1}\hat{\theta}_{2} or (τ^τ^)\left(\hat{\tau}^{\ast}-\hat{\tau}\right) analyzed in MacKinnon et al. (2002) restrict possible shapes of the critial region.

We construct a new test procedure by constructing the critical region directly by determining its shape in the two-dimensional sample space of the tt-statistics used in the construction of the tests. We consider critical regions that are bounded by a function gg and reject when |T|(1)>g(|T|(2)).\left|T\right|_{\left(1\right)}>g(\left|T\right|_{\left(2\right)}). We impose some weak regularity conditions. In particular we assume that gg is a càdlàg function from 0+\mathbb{R}_{0}^{+} (including 0) to 0+.\mathbb{R}_{0}^{+}. This will allow gg to have jumps, but limits the number of jumps to countably many. We also insist that gg is weakly increasing. This assures that a rejection (acceptance) for a realization (|t|(1),|t|(2))(\left|t\right|_{\left(1\right)},\left|t\right|_{\left(2\right)}) is not reversed when either |t|(1)\left|t\right|_{\left(1\right)} or |t|(2)\left|t\right|_{\left(2\right)} is increased (decreased). This reasoning also motivates a CR that is (topologically) simply connected. Denote the set of weakly increasing càdlàg functions by 𝔻(0+,0+)\mathbb{D}\left(\mathbb{R}_{0}^{+},\mathbb{R}_{0}^{+}\right), as in the common Skorokhod space notation, but here with weak monotonicity.

Definition 1

A function g𝔻(0+,0+)g\in\mathbb{D}\left(\mathbb{R}_{0}^{+},\mathbb{R}_{0}^{+}\right) is called the boundary function (of the critical region) when it defines:

Critical Region :CRg={(T1,T2)2|T|(1)>g(|T|(2))},\displaystyle:CR_{g}=\left\{\left(T_{1},T_{2}\right)\in\mathbb{R}^{2}\mid\left|T\right|_{\left(1\right)}>g(\left|T\right|_{\left(2\right)})\right\},
Acceptance Region :ARg={(T1,T2)2|T|(1)g(|T|(2))}.\displaystyle:AR_{g}=\left\{\left(T_{1},T_{2}\right)\in\mathbb{R}^{2}\mid\left|T\right|_{\left(1\right)}\leq g(\left|T\right|_{\left(2\right)})\right\}.

Justification for only using tt-statistics is by sufficiency and invariance. First, the MLE   λ^=(τ^,θ^2,σ^11,θ^1,σ^22)\hat{\lambda}=\left(\hat{\tau},\hat{\theta}_{2},\hat{\sigma}_{11},\hat{\theta}_{1},\hat{\sigma}_{22}\right)^{\prime}  is a complete minimal sufficient statistic. The model constitutes a full exponential model since the dimensions of the minimal sufficient statistic and the parameter space are equal; see Van Garderen (1997). Second, T1T_{1} and T2T_{2} have distributions under the null that are independent of the nuisance parameters τ,\tau, σ11,\sigma_{11}, and σ22.\sigma_{22}. Hillier et al. (2021) shows that, also in finite samples, T=(T1,T2)T=\left(T_{1},T_{2}\right)^{\prime} is a maximal invariant under an appropriate group of transformations generalizing the scale invariance of the tt-statistics. Theorem 1 shows that as a consequence of the permutation and reflection invariance only 1/8th1/8^{th} of the two-dimensional sample space of TT needs to be considered and we define the critical region in the first octant (east to north-east). The other seven octants in 2\mathbb{R}^{2} follow by symmetry. The test defined by CRgCR_{g} is indeed invariant to permutations, reflections, and scale transformations. The domain of g()g\left(\cdot\right) can therefore be restricted to the non-negative real line and bounded by the 4545^{\circ} line: g(x)x.g\left(x\right)\leq x.

We can put the definition of a similar test in terms of the boundary function g()g\left(\cdot\right), noting that H0H_{0} is itself the boundary of H0H_{0} and H1H_{1}:

Definition 2

g()g\left(\cdot\right) is said to be a similar boundary function if the probability of the critical region CRgCR_{g} defined by gg is constant under H0H_{0}:

P[TCRg(θ1,θ2)2withθ1θ2=0]=constant.P\left[T\in CR_{g}\mid\forall\left(\theta_{1},\theta_{2}\right)\in\mathbb{R}^{2}\ with\ \theta_{1}\theta_{2}=0\right]=constant.\

The boundary functions defined by the Wald (Sobel) and LR are not similar. Figures 2(a) and 2(b) show the boundary functions that define the critical region in terms of (T1,T2)\left(T_{1},T_{2}\right) for the Wald and LR test. We show the boundaries for two critical values: one such that for large |θ1|\left|\theta_{1}\right| or |θ2|\left|\theta_{2}\right| the NRP is 5%5\% asymptotically. This value is the usual 3.843.84 for the Wald test and 1.961.96 for the LR test. The second, smaller critical value is such that the NRP is 5%5\% when θ1=θ2=0.\theta_{1}=\theta_{2}=0. This value is 0.960.96 for the Wald test and 1.2171.217 for the LR test. The rejection probabilities are shown as a function of the noncentrality parameter μ2=θ2/σθ2\mu_{2}=\theta_{2}/\sigma_{\theta_{2}} for given μ1=0,\mu_{1}=0, such that H0H_{0} holds. For the LR test with critical value 1.961.96 the NRP goes down to 0.0025=0.0520.0025=0.05^{2} when θ1=θ2=0.\theta_{1}=\theta_{2}=0. In the second case, with the smaller critical value 1.2171.217, the NRP is 5%5\% by construction when θ1=θ2=0,\theta_{1}=\theta_{2}=0, but this is not a valid test since for other values the NRPs are much higher than the nominal size of 5%5\%. The same situation will occur when constructing point optimal invariant tests. The Wald test is considerably worse than the LR test with lower NRP over a wider range of μ1\mu_{1}, see figures 2(c) and 2(d).

Refer to caption
(a) CR Wald (Sobel) test.
Refer to caption
(b) CR LR test.
Refer to caption
(c) NRP Wald (Sobel) test.
Refer to caption
(d) NRP LR test.
Figure 2: Critical regions for Sobel (Wald) and LR tests and their rejection probabilities. White areas are the critical regions for the valid 5%5\% tests. For 5%5\% rejection in the origin (α,β)=(0,0)(\alpha,\beta)=(0,0), the dark areas are added to the critical regions which leads to over-sized tests. The LR test is poor, but the Sobel test is even worse.

The trinity of classic tests is clearly nonsimilar. The question is if we can do much better. Does there exist a similar test or is this a problem that is intrinsically unsolvable? The next theorem, and the main theoretical contribution, answers this question.

Theorem 2

A similar boundary function g()g\left(\cdot\right) exists for testing H0:θ1θ2=0H_{0}:\theta_{1}\theta_{2}=0 if and only if 1/α1/\alpha is an integer (or trivially α=0\alpha=0). If it exists, the boundary is unique in 𝔻(0+,0+).\mathbb{D}\left(\mathbb{R}_{0}^{+},\mathbb{R}_{0}^{+}\right).

For common significance levels, including 1%1\%, 5%5\%, and 10%10\%, the theorem proves that exact similar tests exist. The proof is given in Appendix C and exploits the symmetries of the problem and the completeness of the normal distribution. The proof is constructive showing that the gg function must be a step function and is unique within the class of weakly increasing càdlàg functions. For α=0.25\alpha=0.25 the exact similar boundary is shown in Figure 3.

Refer to caption
Figure 3: Boundary of the exact similar gg-test when α=0.25\alpha=0.25 in the first quadrant.

The CR shows a clear parallel with Figure 2 of Berger (1989). The step function starts horizontal and defines a CR with objectionable features. In particular, the CR will include all TT such that 0<|T|(2)<Φ1(12+α2)0<\left|T\right|_{\left(2\right)}<\Phi^{-1}\left(\frac{1}{2}+\frac{\alpha}{2}\right). For α=0.05\alpha=0.05 this corresponds to rejection when both tt-statistics are smaller than 0.06270.0627 in absolute value (but non-zero)  (for α=0.10\alpha=0.10 and 0.010.01 smaller than 0.12570.1257 and 0.01250.0125 respectively). Such tt-statistics close to zero cannot be characterized as strong evidence against the null, in favor of both θ1\theta_{1} and θ2\theta_{2} being non-zero. Nevertheless, the test is size correct and renders the LR test α\alpha-inadmissible. So it is an “Emperor’s New Tests” in the terminology of Perlman and Wu (1999) and does not provide a satisfactory solution to the problem of finding a better test. A second unattractive feature of this exact similar boundary is that for increasing values of the test statistics parallel and close to the diagonal, the decision alternates between rejection and acceptance, despite the fact that evidence against the null appears monotonically increasing.

Given the uniqueness of the similar test, we relax the strict similarity requirement and consider a class of near similar tests with NRPs that differ from α\alpha by no more than ϵ\epsilon. Within this class of so called ϵ\epsilon-similar tests given in Definition 3 below, we determine a test that avoids the objectionable properties of the exact similar test, but achieves good power properties. Within this class of near similar tests we determine the power envelope and determine the test that minimizes the distance between its power surface and this power envelope. This new test is therefore optimal in this sense. It is easy to implement using Table 1 or using the R-code in Appendix E.

The first step in the composition of this optimal test is a new general method for constructing near similar tests.

3 Near Similar Test Construction: Varying g-Method

A general method for constructing near similar tests involves three generic steps:

  1. 1.

    Define a flexible boundary gg for the critical region in the relevant sample space.

  2. 2.

    Define a criterion function Q(g)Q\left(g\right) that penalizes the deviation of the NRP from the level α\alpha for a grid of parameter values under the null (and possibly restrictions on gg and other aspects deemed relevant).

  3. 3.

    Systematically vary and determine gg such that it minimizes the criterion function and is therefore as close to similarity as possible in the metric defined by QQ.

The relevant sample space is determined by the particular testing problem at hand and may have been reduced by sufficiency, invariance, or other principles, to dimension kk, say. The boundary gg of the critical and acceptance region is then of dimension (k1)\left(k-1\right), but may consist of disjoint parts if the critical and/or acceptance region are not simply connected in a topological sense. There are various possibilities to define gg flexibly, but we will use splines.

The criterion function may include aspects other than similarity, for instance smoothness and monotonicity of gg, convexity of the critical or acceptance regions, or even rejection probabilities under alternatives. Consequently, Step 3 will generally be a constraint optimization problem. The systematic variation of gg is intended to be in line with the optimization routine used to minimize QQ as in, e.g. a Newton-Raphson-type procedure.

For mediation testing with α=0.05\alpha=0.05 an exact test exists, but the varying gg-method may not find it because (i) gg is not flexible enough (ii) Q(g)Q\left(g\right) includes criteria other than NRPs (iii) numerical difficulties determining the jumps (iv) and finally restrictions to purposely exclude objectionable CRs.

An explicit implementation of the varying gg-method to the mediation problem is given next.

3.1 Near Similar Mediation g-Test

Step 1 in the varying gg-method is to determine the relevant sample space for the testing problem. The first dimensional reduction is to the MLE which is minimal sufficient and complete. The second reduction to (T1,T2)\left(T_{1},T_{2}\right) follows from location-scale type invariance shown in Hillier et al. (2021). Permutation and reflection symmetries further reduce the relevant sample space to one octant according to Theorem 1. The maximal invariant is the ordered absolute tt-statistic (|T|(1),|T|(2))=(min(|T1|,|T2|),max(|T1|,|T2|))(\left|T\right|_{\left(1\right)},\left|T\right|_{\left(2\right)})=\left(\min\left(\left|T_{1}\right|,\left|T_{2}\right|\right),\max\left(\left|T_{1}\right|,\left|T_{2}\right|\right)\right) with density given in Lemma  2 . For general μ\mu, the rejection probability (RP) for the gg-method is given by:

πg(μ1,μ2)=100g(t2)f|T|(1),|T|(2)(t1,t2,μ1,μ2)𝑑t1𝑑t2.\pi_{g}\left(\mu_{1},\mu_{2}\right)=1-\int_{0}^{\infty}\int_{0}^{g(t_{2})}f_{|T|_{(1)},|T|_{(2)}}(t_{1},t_{2},\mu_{1},\mu_{2})dt_{1}dt_{2}. (9)

This two-dimensional integral can be reduced to a one-dimensional integral by expressing the inner integral in terms of the CDF of the standard normal distribution Φ()\Phi\left(\cdot\right):

0bf|T|(1),|T|(2)(t1,t2,μ1,μ2)𝑑t1\displaystyle\int_{0}^{b}f_{|T|_{(1)},|T|_{(2)}}(t_{1},t_{2},\mu_{1},\mu_{2})dt_{1}\, =\displaystyle= 2πexp(t22/2)\displaystyle\sqrt{\frac{2}{\pi}}\exp(-t_{2}^{2}/2)\cdot (10)
{exp(μ22/2)cosh(t2μ2)(Φ(bμ1)+Φ(b+μ1)1)+\displaystyle\{\exp(-\mu_{2}^{2}/2)\cosh(t_{2}\mu_{2})\left(\Phi(b-\mu_{1})+\Phi(b+\mu_{1})-1\right)+
exp(μ12/2)cosh(t2μ1)(Φ(bμ2)+Φ(b+μ2)1)}.\displaystyle\exp(-\mu_{1}^{2}/2)\cosh(t_{2}\mu_{1})\left(\Phi(b-\mu_{2})+\Phi(b+\mu_{2})-1\right)\}.

The RP thus simplifies to a one-dimensional integral by integrating formula (10) for b=g(t2)b=g(t_{2}) over t2[0,)t_{2}\in[0,\infty), which greatly improves efficiency and accuracy.

All NRPs in the paper were determined by numerical integration over t2[0,μ2+9]t_{2}\in[0,\mu_{2}+9]  under the null with μ1=0\mu_{1}=0. The numerical integration was performed in Julia, see Bezanson et al. (2017), using the function quadgk that is based on adaptive Gauss-Kronrod quadrature. For the final gg-function, the estimated upper bound on the absolute error for the calculated NRP is approximately 10810^{-8}.

The gg-boundary is generally determined by an algorithm and Appendix D shows the basic implementation of the varying gg-method using linear splines with J+2J+2 knots, with the first and last knots fixed. In spite of its simplicity, it leads to big improvements even for small values of JJ. Figure 4 illustrates the construction of the gg-function for a fixed number of grid points J=6J=6 and the resulting CRgCR_{g} in the sample space of (|T|(1),|T|(2))(\left|T\right|_{\left(1\right)},\left|T\right|_{\left(2\right)}), the East to North-East octant of 2\mathbb{R}^{2}.

Refer to caption
Figure 4: Construction of the basic gg-function with J=6J=6, so 88 knots in all and the resulting CR boundary in (|T|(2),|T|(1))(\left|T\right|_{\left(2\right)},\left|T\right|_{\left(1\right)}) space.

Figure 5 shows the NRPs of the test in comparison to the LR and Wald (Sobel) tests. There is a remarkable gain in the lowest NRP, and therefore local power, from 0.25%0.25\% to 4.999%4.999\%. Already for J=2J=2 there is a large improvement and after J=8J=8 gains are small, and beyond J=16J=16 there was hardly any improvement.

Refer to caption
Figure 5: NRPs gg-tests with J=2,4J=2,4 versus LR and Wald (Sobel) tests.

3.2 Power

Insistence on similarity can have negative consequences for the power in general, but not here. The power envelope with or without (near) similarity restriction are very close. The power surface of our optimal test is also close. Even the basic test with J=16J=16 has very good power in comparison with the Sobel and LR tests and is uniformly better for all values of the noncentrality parameter μ\mu. In a neighborhood of the origin with μ=0,\mu=0, it is essentially 5%5\% points higher.

The RP πg(μ1,μ2)\pi_{g}\left(\mu_{1},\mu_{2}\right) defined in Equation (9) is the NRP if μ1\mu_{1} and/or μ2\mu_{2} equal 0 (the null hypothesis). When both are non-zero, H0H_{0} is false and πg(μ1,μ2)\pi_{g}\left(\mu_{1},\mu_{2}\right) is the power of the test defined by CRgCR_{g}. Figure 6 illustrates the power in the 45,45^{\circ}, μ1=μ2,\mu_{1}=\mu_{2}, direction. Power in other directions is also superior to the Wald (Sobel) and LR tests.

Refer to caption
Figure 6: Power comparison basic gg-tests, LR- and W (Sobel) tests along μ1=μ2(=μ)\mu_{1}=\mu_{2}\ \ (=\mu).

There is a straightforward explanation for the additional power. The Wald and LR test both reject far less than 5%5\% near the origin. The critical region can therefore be extended and the power increased without failing the size condition. In the origin the NRPs are close to 0%0\% for the LR and Wald (Sobel) tests. By extending the critical region we can therefore gain almost 5%5\% power without violating the size condition. The LR test has some attractive features including that rejection for a particular (t1,t2)\left(t_{1},t_{2}\right) implies rejection for larger values of t1t_{1} and/or t2t_{2} also. This is intuitive since the evidence against the null is increasing. A disadvantage is however that one never rejects when either t1t_{1} or t2t_{2} is smaller than 1.961.96 and this causes extreme conservativeness that can be resolved by adding area to the critical region. It should also be noted that the relevant distribution of the order statistic (|T|(1),|T|(2))(\left|T\right|_{\left(1\right)},\left|T\right|_{\left(2\right)}) is quite different from the distribution of the product of two standard normals for (T1,T2)\left(T_{1},T_{2}\right) or their absolute values.

3.3 Power Envelope

Comparison to the Sobel and LR tests is limited because they both perform very poorly for small values of μ\mu. The absolute quality, or even near optimality, of the new gg-test can only be assessed by comparing the power surface of the test to the power envelope, or a tight upper bound thereof, for a class of tests that satisfy appropriate invariance-, size and almost similarity restrictions. Since the exact similar invariant test is objectionable when it exists, we introduce a class Γα,ϵ\Gamma_{\alpha,\epsilon} of near similar tests with NRPs that deviate less than ϵ\epsilon from the α\alpha level and an operational (super)class Γα,ϵ𝕄0Γα,ϵ\Gamma_{\alpha,\epsilon}^{\mathbb{M}_{0}}\supseteq\Gamma_{\alpha,\epsilon} as follows.

Definition 3

The class Γα,ϵ\Gamma_{\alpha,\epsilon} of near similar boundary functions with ϵ>0\epsilon>0 and given α\alpha is defined by:

Γα,ϵ={g𝔻(0+,0+)|supμ00P[CRg(0,μ0)]α and infμ00P[CRg(0,μ0)]αϵ}\Gamma_{\alpha,\epsilon}=\left\{g\in\mathbb{D}\left(\mathbb{R}_{0}^{+},\mathbb{R}_{0}^{+}\right)\left|\sup_{\mu_{0}\geq 0}P\left[CR_{g}\mid\left(0,\mu_{0}\right)\right]\leq\alpha\text{ and }\inf_{\mu_{0}\geq 0}P\left[CR_{g}\mid\left(0,\mu_{0}\right)\right]\geq\alpha-\epsilon\right.\right\}

The class Γα,ϵ𝕄0\Gamma_{\alpha,\epsilon}^{\mathbb{M}_{0}} with set 𝕄0={(0,μ0(ι))}ι=1Υ0\mathbb{M}_{0}=\left\{\left(0,\mu_{0}^{\left(\iota\right)}\right)\right\}_{\iota=1}^{\Upsilon_{0}} containing Υ0\Upsilon_{0} points under H0H_{0}, is defined by:

Γα,ϵ𝕄0={g𝔻(0+,0+)|sup(0,μ0)𝕄0P[CRg(0,μ0)]α and inf(0,μ0)𝕄0P[CRg(0,μ0)]αϵ}.\Gamma_{\alpha,\epsilon}^{\mathbb{M}_{0}}=\left\{g\in\mathbb{D}\left(\mathbb{R}_{0}^{+},\mathbb{R}_{0}^{+}\right)\left|\sup_{\left(0,\mu_{0}\right)\in\mathbb{M}_{0}}P\left[CR_{g}\mid\left(0,\mu_{0}\right)\right]\leq\alpha\text{ and }\inf_{\left(0,\mu_{0}\right)\in\mathbb{M}_{0}}P\left[CR_{g}\mid\left(0,\mu_{0}\right)\right]\geq\alpha-\epsilon\right.\right\}.

For ϵ=0\epsilon=0 the boundary functions in Γ0\Gamma_{0} would be similar. We consider α=0.05\alpha=0.05 in all our numerical examples, but for general 1/α1/\alpha\notin\mathbb{N} the set would be empty according to Theorem 2. For ϵ=α\epsilon=\alpha, on the other hand, Γα,α\Gamma_{\alpha,\alpha} contains all gg-based tests that satisfy the size condition. Our interest is in ϵ\epsilon close to 0, when Γα,ϵ\Gamma_{\alpha,\epsilon} contains boundaries that are almost similar. The minimum value of ϵ\epsilon for which Γα,ϵ\Gamma_{\alpha,\epsilon} is not empty is 0 for the mediation problem when 1/α1/\alpha\in\mathbb{N}, but in general depends on the testing problem considered and larger than 0 if no similar test exists.

The class Γα,ϵ𝕄0\Gamma_{\alpha,\epsilon}^{\mathbb{M}_{0}} can be thought of as a discretization of Γα,ϵ\Gamma_{\alpha,\epsilon} in the sense that a grid of points under the null is considered. It imposes less restrictions and enforces near similarity conditions on a finite number of points only. As a consequence it may contain boundaries that do not satisfy the size condition for points that are not in 𝕄0.\mathbb{M}_{0}. Obviously Γα,ϵ\Gamma_{\alpha,\epsilon} Γα,ϵ𝕄0\subseteq\Gamma_{\alpha,\epsilon}^{\mathbb{M}_{0}} since the size and NRP conditions also hold for the points in 𝕄0\mathbb{M}_{0}.

Within the class Γα,ϵ\Gamma_{\alpha,\epsilon} there is no unique solution. As a consequence one has to choose a boundary function from Γα,ϵ,\Gamma_{\alpha,\epsilon}, or in practice from Γα,ϵ𝕄0,\Gamma_{\alpha,\epsilon}^{\mathbb{M}_{0}}, to obtain an operational test. For the construction of the power envelope we can select the test in Γα,ϵ𝕄0\Gamma_{\alpha,\epsilon}^{\mathbb{M}_{0}} that maximizes the power against a particular point (μ1,μ2)\left(\mu_{1},\mu_{2}\right) in the alternative. This test is a Point Optimal Invariant Near Similar (POINS) test. The critical region of this test varies with (μ1,μ2)\left(\mu_{1},\mu_{2}\right) and no uniformly most powerful test exists within the class Γα,ϵ\Gamma_{\alpha,\epsilon}. It can be used however, to construct an upper bound for the power envelope.

Definition 4

The power envelope of a near similar invariant test with ϵ>0\epsilon>0 is defined as:

π(μ1,μ2)=maxgΓα,ϵP[CRg(μ1,μ2)].\pi\left(\mu_{1},\mu_{2}\right)=\max_{g\in\Gamma_{\alpha,\epsilon}}P\left[CR_{g}\mid\left(\mu_{1},\mu_{2}\right)\right].

For a given set of points 𝕄0={(0,μ0(ι))}ι=1Υ0\mathbb{M}_{0}=\left\{(0,\mu_{0}^{\left(\iota\right)})\right\}_{\iota=1}^{\Upsilon_{0}} an upper bound to the power envelope is:

π¯(μ1,μ2)=maxgΓα,ϵ𝕄0P[CRg(μ1,μ2)].\bar{\pi}\left(\mu_{1},\mu_{2}\right)=\max_{g\in\Gamma_{\alpha,\epsilon}^{\mathbb{M}_{0}}}P\left[CR_{g}\mid\left(\mu_{1},\mu_{2}\right)\right].

For notational simplicity we have suppressed the dependence on ϵ\epsilon and 𝕄0.\mathbb{M}_{0}. Since Γα,ϵ\Gamma_{\alpha,\epsilon} Γα,ϵ𝕄0\subseteq\Gamma_{\alpha,\epsilon}^{\mathbb{M}_{0}} and elements of Γα,ϵ𝕄0\Gamma_{\alpha,\epsilon}^{\mathbb{M}_{0}} do not necessarily satisfy the size condition for all parameter values it follows that π¯(μ1,μ2)π(μ1,μ2),\bar{\pi}\left(\mu_{1},\mu_{2}\right)\geq\pi\left(\mu_{1},\mu_{2}\right), because fewer conditions are imposed. Choosing a finer grid for 𝕄0\mathbb{M}_{0} will force π¯(μ1,μ2)\bar{\pi}\left(\mu_{1},\mu_{2}\right)\ closer to π(μ1,μ2)\pi\left(\mu_{1},\mu_{2}\right), at least in the additional points in 𝕄0\mathbb{M}_{0} where the size condition is now required to hold. Also note that the “point” optimal gg that maximizes power for the point (μ1,μ2),\left(\mu_{1},\mu_{2}\right), may have undesirable features such as including parts of the axes in the critical region, even though such observations are perfectly in line with the null hypothesis.

We determine π¯(μ1,μ2)\bar{\pi}\left(\mu_{1},\mu_{2}\right) numerically for α=0.05\alpha=0.05 by maximizing the power directly by selecting critical region points in the sample space that maximize the probability of rejection when the true density has parameter (μ1,μ2)\left(\mu_{1},\mu_{2}\right), under the side conditions that the NRP[0.05ϵ,0.05]\in\left[0.05-\epsilon,0.05\right] for all parameters (0,μ0)𝕄0\left(0,\mu_{0}\right)\in\mathbb{M}_{0}. The sample space is decomposed into 285,150285,150 squares and a modern optimization routine is used to determine which squares should be included in the critical or acceptance region in order to maximize the power while at the same time satisfying the approximate similarity condition. This is repeated for a grid of (μ1,μ2)\left(\mu_{1},\mu_{2}\right) points. So for each point on the grid the POINS critical region is determined and the power recorded. Appendix D gives details of the algorithm and the optimization routine that can deal with a large number of variables and side conditions.

By dropping the near similarity restriction (0.05ϵNRP)0.05-\epsilon\leq NRP) in the same algorithm, we can construct a power envelope for nonsimilar tests. The maximal difference from the (higher) nonsimilar power surface is 2%2\% points when power is around 40%,40\%, showing that the power loss due to the similarity requirement is small. The calculated π¯(μ1,μ2)\bar{\pi}\left(\mu_{1},\mu_{2}\right) surface enables us to construct a correctly sized optimal test derived in the next section.

4 The New Mediation Test

Having determined an upper bound to the power envelope, we can determine a gg-boundary function with a power surface as close as possible to this upper bound. This optimal test is found using the algorithm given in Appendix D. This function is given in Appendix E and R-code is also provided there. For ease of implementation we give values of g(t)g\left(t\right) in Table 1. Figure 7 shows the optimal gg-boundary test for the mediation problem.

Refer to caption
Figure 7: Optimal gg-boundary function. The dashed line is the LR boundary.
Refer to caption
Figure 8: NRP as a function of the noncentrality parameter μ\mu. The solid line is the optimal test and is strictly between 0.049999 (=0.05ϵ,=0.05-\epsilon, dotted line) and 0.05 (dashed line).

The optimal CRgCR_{g} includes a narrow region close to the 4545^{\circ} line where both tt-statistics are of the same magnitude. This is expedient for two reasons. First, because mediation requires both θ1\theta_{1} and θ2\theta_{2} to be non-zero. The best possibility of detecting this is along the 4545^{\circ} line as illustrated by the power surface in Figure 9 showing highest power on the diagonal. The optimal CRgCR_{g} does exclude both tt-statistics smaller than 0.10.1, unlike the unappealing region of the exact test. Second, the near similarity condition requires additional critical region area in the left corner of the octant because NRPs are particularly low for small parameter values. The increased power is naturally linked to the increase in Type I error, but correct size of a test by definition merely requires that this is not larger than 5%5\%. Nevertheless, size (NRP)/power trade-off exists as well as other compromises that can be assessed using critical region analysis. For instance, it may seem less intuitive that rejection is not monotonic in t1t_{1} and t2t_{2} since an increase in both t1t_{1} and t2t_{2} represents increased evidence against the null. The LR and Wald tests are monotonic in this sense, but lead to a reduction in power to nearly zero for small parameter values. No observed value tt of TT will ever lie on the horizontal or vertical axis. Any observed tt is therefore more likely given an alternative parameter value than a value under the null. It is therefore desirable to add area to the LR critical region even if this results in a non-convex critical region or acceptance region. One could cogitate about the very narrow region close to the diagonal and whether the acceptance should not continue along the 4545^{\circ} line further than 0.1,0.1, until e.g. 1.2171.217 as in Figure 2(b), but the new gg-boundary is the optimal solution to a well-defined problem.

The narrow region of the optimal CRgCR_{g} is a strict extension of the CRLR,CR_{LR}, which itself is strictly larger than the Sobel (Wald) CRWCR_{W}. Since the new test is constructed to satisfy the size condition we have the following:

Theorem 3

The Sobel/Wald test and the LR test are inadmissible.

Proof. CRWCRLRCRgCR_{W}\subset CR_{LR}\subset CR_{g} hence P[CRW]<P[CRLR]<P[CRg]0.05P[CR_{W}]<P[CR_{LR}]<P[CR_{g}]\leq 0.05. The optimal gg-test has uniformly higher power and is correctly sized by construction.   \square

The NRP as a function of the noncentrality parameter μ\mu is shown in Figure 8. The difference from 5%5\% is less than 10510^{-5} and so small that the scale had to be magnified greatly, to an extend that prevents comparison with the LR and Sobel tests in the same graph.

The power of the new gg-test is very close to the power envelope (upper bound). The maximal difference is 0.006140.00614. This has important implications. First, the upper bound is tight as claimed earlier. Second, the upper bound of the power envelope and power surface of the gg-test look almost identical when graphed. Figure 9 therefore shows only the power surface of the optimal gg-test. Finally, the new gg-test is optimal for all intents and purposes in a larger class of tests. It is optimal by construction within the class of near similar tests Γα,ϵ𝕄0\Gamma_{\alpha,\epsilon}^{\mathbb{M}_{0}}, but given the closeness of its power surface to the (non)similar power envelope, there cannot exist any near similar test that has additional power more than 0.006140.00614, even if construction is based on a different method. More generally, nonsimilar tests can have 2%2\% points more power, but at the possible cost of odd rejection regions and low power for other parameter values that are not used in the construction of the test. The optimal gg-test has good properties for all parameter values.

The power surface in Figure 9 shows only the first quadrant of the parameter space of (μ1,μ2).\left(\mu_{1},\mu_{2}\right). The other three quadrants follow by simple permutations and reflections of the parameters.

Refer to caption
Figure 9: Power surface optimal gg-test

5 Application: Educational Attainment and Gender

To illustrate the usage of the new test, we consider the mediation analysis in Alan et al. (2018) on the effect of elementary school teachers’ gender beliefs, either traditional or progressive, on student mathematical and verbal achievements. They exploit the unique institutional features of Turkish data that provides a natural experiment in the random allocation of teachers to schools. Information consists of approximately 4,000 third- and fourth-grade students and their 145 teachers. The data are available in their online appendix. Children are divided into three groups depending on the length of exposure to a participating teacher: “1-year exposure” (at most one year), “2-3 year exposure” (more than one year and at most three years) and “4-year exposure” (at most four years). Alan et al. (2018) consider three potential mediators, but we focus on students’ own gender role beliefs. In the notation of equations (1)-(3), XX is a dummy whether the teacher is classified as traditional or progressive. The mediating variable MM is the student’s own belief on gender roles. We focus on the verbal test scores as the dependent variable YY. Table 2 shows the estimates for the indirect effect for girls based on the full sample and the three exposure groups after controlling for school fixed effects, student characteristics, family characteristics, teacher characteristics, teacher styles and teacher effort.

The results for the full sample are similar to the values shown in Table 5 and Table 6 of Alan et al. (2018), although they use the approach by Imai et al. (2013). We consider three different exposure duration groups. The tt-ratios of θ^1\hat{\theta}_{1} are not significant at the 5%5\% level for more than 1-year exposure. For the 1-year exposure, however, it is significant, but the tt-ratio t2=1.941t_{2}=-1.941 of θ^2\hat{\theta}_{2} is not. So, the LR test would not find a significant effect and neither does the simulation based method Alan et al. (2018) use. Using the new test, however, we have |t|(1)=1.941\left|t\right|_{(1)}=1.941 and |t|(2)=2.052|t|_{(2)}=2.052, such that g(2.05)=1.9175g(2.05)=1.9175 and consequently |t|(1)>g(|t|(2))|t|_{(1)}>g(|t|_{(2)}) and the proposed test finds the mediation effect significant at the 5%5\% level. The R function in Appendix E can be used if greater precision is desired, e.g. g(2.052)=1.9195g(2.052)=1.9195.

Exposure: Full 1-Year 2-3 Year 4-Year
θ1^:\hat{\theta_{1}}: 0.199 0.256 0.109 0.064
tt-ratio: 3.140 2.052 1.065 0.513
θ2^:\hat{\theta_{2}}: -0.119 -0.097 -0.125 -0.113
tt-ratio: -5.343 -1.941 -4.163 -1.931
θ1^θ1^\hat{\theta_{1}}\cdot\hat{\theta_{1}}: -0.024 -0.025 -0.014 -0.007
Table 2: Mediation effect of students’ own gender role beliefs on verbal test scores by exposure to progressive or traditional teachers. Full-sample results are similar to the values of Table 5 (Gender Role Beliefs: Girls) and Table 6 (Panel B. Verbal Test Scores: Gender Role Beliefs) of Alan et al. (2018) who use the approach by Imai et al. (2010), which is slightly different, but the value of the indirect effect in the last row is numerically the same as ours based on the regressions (1) and (2) extended with the same controls.

6 Higher Dimensions

In a further empirical illustration below, mediation may be via channels that involve two mediating variables. This requires a multivariate extension of the new test. The necessary invariance and distribution theory was given in Section 2 but a further aspect is the coherency between solutions in different dimensions. Consider the null hypothesis H0:θ1θ2θK=0H_{0}:\theta_{1}\theta_{2}\cdots\theta_{K}=0 in KK dimensions. If it were known that θK0,\theta_{K}\neq 0, then the null hypothesis reduces to H0:θ1θ2θK1=0.H_{0}:\theta_{1}\theta_{2}\cdots\theta_{K-1}=0. This implies that the critical region for the K1K-1 corresponding tt-statistics should reduce to the solution found for K1K-1 dimensions when |μK|\left|\mu_{K}\right| is large. For large values of |TK|\left|T_{K}\right| (very small pp-values) it is essentially known that μK\mu_{K} and θK\theta_{K} are non-zero. The probability of rejection will effectively depend only on the K1K-1 other tt-values. In two dimensions this means that as tt\rightarrow\infty the boundary function g(t)1.96g\left(t\right)\rightarrow 1.96, which is the one-dimensional solution for testing H0:θ1=0H_{0}:\theta_{1}=0 when α=5%.\alpha=5\%. In three dimensions it means that the solution must reduce to the gg-test derived in Section 4. We make this requirement explicit in the definition below.

Definition 5

The gg-boundary in dimension KK for the test that rejects if |T|(1)>g(|T|(2),|T|(K))\newline \left|T\right|_{\left(1\right)}>g\left(\left|T\right|_{\left(2\right)},\cdots\left|T\right|_{\left(K\right)}\right) is dimensionally coherent if for each 2kK2\leq k\leq K

limtkg(t2,,tk1,tk)=g(t2,,tk1)\lim_{t_{k}\rightarrow\infty}g\left(t_{2},\cdots,t_{k-1},t_{k}\right)=g\left(t_{2},\cdots,t_{k-1}\right)

We used a multivariate spline generalization to implement the varying gg-method based on barycentric coordinates in three dimensions. It resulted in a maximum of 0.2%0.2\% points difference from 5%5\%. But imposing dimensional coherency was complicated and the problem suffers from the curse of dimensionality: the dimension of the integral increases with KK and the number of knots necessary to define gg impedes optimization. For practical purposes we therefore propose a simple solution that exploits the dimensional coherency inductively and weighs the LR test in dimension KK with the solution obtained in dimension K1K-1. First note that one could satisfy the coherency condition in three dimensions by simply rejecting when |T|(1)>g(|T|(2)),\left|T\right|_{\left(1\right)}>g(\left|T\right|_{\left(2\right)}), irrespective of |T|(3).\left|T\right|_{\left(3\right)}. This results in an invalid test however, because it is oversized with a maximum NRP of 7.2%7.2\% when K=3K=3. The LR test on the other hand is conservative, in particular near the origin. A practical solution therefore is to use a weighted average between the liberal and conservative boundary. We use weights that depend on the largest absolute tt-statistic. In particular for K=3K=3

g(t2,t3)=(1w(t3))gLR(t2)+w(t3)g(t2),g(t_{2},t_{3})=(1-w(t_{3}))g_{LR}(t_{2})+w(t_{3})g(t_{2}),\ \

with weight function w(t3)w(t_{3}) a linear spline with 0w(t3)1,0\leq w(t_{3})\leq 1, and limt3w(t3)=1\lim_{t_{3}\rightarrow\infty}w(t_{3})=1. The test rejects if |T|(1)>g(|T|(2),|T|(3)).\left|T\right|_{\left(1\right)}>g(\left|T\right|_{\left(2\right)},\left|T\right|_{\left(3\right)}). Minimizing deviation ofthe NRP from the significance level and imposing the size restriction, results in a spline with knots:

{(0,0)},{(1.35,0.959),(2.025,0.842),(2.7,1),(,1)},\{(0,0)\},\{(1.35,0.959),(2.025,0.842),(2.7,1),(\infty,1)\},

leading to a maximum of 0.13%0.13\% points difference in NRPs from 5%5\% and never exceeding 5%5\%. The solution is shown in Figure 10.

Refer to caption
Figure 10: The gg-boundary in 3D for the unsorted tt-statistics (T1,T2,T3)\left(T_{1},T_{2},T_{3}\right). CR is furthest removed from the origin and includes e.g. (4,4,4). The edges show the 2D solution since one tt-statistic is very large. If two tt-statistics are very large then it reduces to 1.96, the 1D solution.

7 Empirical Illustration

For a numerical illustration, we consider the recursive model of union sentiment among southern nonunion textile workers as used by Bollen and Stine (1990). The model:

[ym2m1]=[0β12β1300β23000][ym2m1]+[τ1100α220α32][x1x2]+[uv1v2],\left[\begin{array}[]{l}y\\ m_{2}\\ m_{1}\end{array}\right]=\left[\begin{array}[]{lll}0&\beta_{12}&\beta_{13}\\ 0&0&\beta_{23}\\ 0&0&0\end{array}\right]\left[\begin{array}[]{l}y\\ m_{2}\\ m_{1}\end{array}\right]+\left[\begin{array}[]{ll}\tau_{11}&0\\ 0&\alpha_{22}\\ 0&\alpha_{32}\end{array}\right]\left[\begin{array}[]{l}x_{1}\\ x_{2}\end{array}\right]+\left[\begin{array}[]{l}u\\ v_{1}\\ v_{2}\end{array}\right], (11)

is a simplified version of McDonald and Clelland (1984) and discussed in some detail by Bollen (1989, p. 82–93). It analyses the direct and indirect effects of tenure and age on union sentiment via deference and/or labor activism. Tenure x1x_{1} is measured in log of years working in a particular textile mill and age x2x_{2} is measured in years. The variables sentiment towards unions yy, deference/submissiveness to managers m1m_{1}, and support for labor activism m2m_{2}, are measures based on 7, 4, and 9 survey questions respectively. The disturbances (uu, v1v_{1} and v2v_{2}) are assumed to be uncorrelated across equations and individuals. When they are normally distributed, ML estimation of the system reduces to OLS applied to each equation separately due to the recursive structure.

Refer to caption
Figure 11: Union sentiment mediation graph

We use a selection of 100 observations out of the original 173 and focus on three alternative theories of the indirect effects from age to union sentiment: two competing parallel effects that the age effect is mediated by increased deference in which case i1=α32β13i_{1}=\alpha_{32}\beta_{13} quantifies the indirect effect. The alternative mediation channel is that activism mediates such that i2=α22β12i_{2}=\alpha_{22}\beta_{12} is the indirect effect. The third channel is a serial effect that age affects deference, which in turn affects activism, which in turn affects union sentiment such that i3=α32β12β23i_{3}=\alpha_{32}\beta_{12}\beta_{23} measures the indirect effect. Figure 11 illustrates the three mediation channels. The OLS estimates of the coefficients of the structural equations and their tt-statistics are shown in Table 3.

α32\alpha_{32} α22\alpha_{22} β23\beta_{23} β12\beta_{12} β13\beta_{13} τ11\tau_{11}
Estimate –0.050 0.057 –0.283 0.987 –0.215 0.720
tt-statistic –1.902 2.709 –3.582 7.120 –1.838 1.777
Table 3: OLS Estimates and tt-statistics for Union Sentiment Model (N=100N=100)

The point estimates of the indirect effects and their tt-statistics based on the delta method are shown in Table 4. For the gg-test we need the absolute order statistics, evaluate g,g, and compare. For H0i1:α32β13=0H_{0}^{i_{1}}:\alpha_{32}\beta_{13}=0 we observe |t(β^13)|=1.838>1.770=g(1.902)=g(|t(α^32)|),|t(\hat{\beta}_{13})|=1.838>1.770=g(1.902)=g(|t(\hat{\alpha}_{32})|), and hence reject. For H0i2:α22β12=0H_{0}^{i_{2}}:\alpha_{22}\beta_{12}=0  we have |t(α^22)|=2.709>1.960=g(7.120)=g(|t(β^12)|)|t(\hat{\alpha}_{22})|=2.709>1.960=g(7.120)=g(|t\left(\hat{\beta}_{12}\right)|) and also reject. Testing the last null hypothesis H0i3:α32β12β23=0H_{0}^{i_{3}}:\alpha_{32}\beta_{12}\beta_{23}=0 requires the three-dimensional solution given in Figure 10. We have |t(α^32)|=1.902<1.96=g2(3.582,7.120)=g2(|t(β^23|),|t(β^12)|)|t(\hat{\alpha}_{32})|=1.902<1.96=g_{2}(3.582,7.120)=g_{2}(|t\left(\hat{\beta}_{23}|\right),|t\left(\hat{\beta}_{12}\right)|) and do not reject.

Estimate Sobel tt-statistic gg-test
i1=α32β13i_{1}=\alpha_{32}\beta_{13} 0.01080.0108 1.3221.322 1.838>1.770=g(|1.902|)1.838^{\ast}>1.770=g(|-1.902|)
i2=α22β12i_{2}=\alpha_{22}\beta_{12} 0.05610.0561 2.5322.532^{\ast} 2.709>1.960=g(7.120)2.709^{\ast}>1.960=g(7.120)
i3=α32β12β23i_{3}=\alpha_{32}\beta_{12}\beta_{23} 0.01400.0140 1.6351.635 1.9021.960=g2(3.582,7.120)1.902\ngtr 1.960=g_{2}(3.582,7.120)
Table 4: Estimates, Sobel tt-statistics and gg-test. * indicates significance at 5%5\%

The Sobel test with critical value 1.961.96 concludes that i2i_{2} is significant but does not find enough evidence for the i1i_{1} mediation channel. The new gg-test in contrast, concludes that i1i_{1} is also significant. Both tt-values in this case are smaller than 1.961.96, so the LR test would not reject either. The two tt-values are of comparable magnitude and the gg-test finds a significant mediation effect. For implementation of the gg-test only the relevant tt-statistics are required. The absolute values are ordered and the smallest value compared with the value of the gg-function evaluated at the largest absolute tt-value. This can be looked up in Table 1 (possibly using linear interpolation) or one can use the R code provided in Appendix E.444The bootstrap is a popular alternative for testing mediation. Because of the asymmetry of the distribution involved this is carried out through alternative confidence intervals of the indirect effect. See e.g. MacKinnon et al. (2004) and Preacher and Hayes (2008). The bootstrap is not valid. Simulations we carried out showed that bootstrap tests for mediation based on generally preferred BCa confidence intervals can have sizes of 8% when n=100n=100 and higher for nn smaller.

For i3i_{3} both tests draw the same conclusion. The three tt-values involved are not of comparable magnitude and the tt-statistics for β12\beta_{12} and β23\beta_{23} are so large that rejecting the null H0:α32β12β23=0H_{0}:\alpha_{32}\beta_{12}\beta_{23}=0 essentially depends on whether α32\alpha_{32} is zero. The corresponding absolute tt-value of 1.901.90 is too small to warrant such conclusion.

8 Conclusion

This paper proposes a new near similar more powerful mediation test that is simple to used based on two ordinary tt-statistics. The mediation problem is empirically extremely important in many different fields. Theoretically we solve an interesting statistical problem which has generated results dating back to Craig (1936) and still continues today with contributions on poor performance of the Wald statistic, construction of similar tests, and hypotheses with singularities. A main theoretical contribution has been the derivation of an exact similar test that is unique within the general class considered (with CR that is topologically simply connected and has weak monotone càdlág boundary). This exact test has unattractive statistical properties, however, leading us to consider a class of nearly similar tests and choose an attractive test within it. By relaxing the strict similarity condition we are able to construct a near similar test that has superior power and avoids the disagreeable properties of the exact similar test and would please even statistically erudite emperors in Perlman and Wu (1999). The new test can also be justified asymptotically under much weaker conditions and other estimation methods.

The new test is constructed using a new general method we propose for constructing tests that are approximately similar. This varying-gg method considers a flexible critical region boundary and minimizes the difference from the level α\alpha of the rejection probabilities at a number of points on the boundary of the null hypothesis. Conceptually and practically this was very simple and straightforward to implement. It does not require, as in other approaches, a choice of mixture distribution, nor the construction of least favorable distributions. Numerically it is also attractive in terms of convergence properties and avoids the need for simulations. The appropriate distribution theory for the mediation case and higher dimensional extensions is explicitly given. All our calculations are done using numerical integration using the distribution of the maximal invariant.

The new method is applicable to many other testing problems with nuisance parameters. It is remarkable that this simple method works so well and can deliver substantial improvements. Even the simplest linear interpolation implementation for the mediation hypothesis increases power by almost 5%5\% points when mediation effects are small.

We have calculated a power envelope upper bound for the mediation testing problem that is very tight. Using this result, we were able to construct a test that is optimal within the class of near similar tests. It minimizes the total difference between its power surface and the power envelope bound. It results in a point wise difference less than 0.00620.0062 for all alternative parameter points considered. This implies that the test is practically optimal even if a more general class of possible test construction is considered. A power envelope for nonsimilar tests showed that power loss due to the similarity requirement is minimal since the maximum power loss is less than 2%2\% points (when maximum power is around 40%40\%).

The optimal gg-test satisfies the size condition. The critical region is strictly larger than the LR and Wald critical regions and is therefore strictly and uniformly more powerful. The classic tests are therefore not admissible and their bootstrapped variants are not valid. For large values of the standardized coefficients the power difference becomes negligible, but when mediation effects are small or have relatively large standard errors, the power can be close to 5%5\% points higher than these classic 5%5\%-level tests. This has important consequences for empirical work. It enables researcher to prove mediation effects earlier in circumstances that one could not show mediation before due to extreme conservativeness of standard tests near the origin.

Appendix A Theory

A.1 Elementary Relation

Let y=(y1,,yn)y=\left(y_{1},\cdots,y_{n}\right)^{\prime}, m=(m1,,mn)m=\left(m_{1},\cdots,m_{n}\right)^{\prime}, x=(x1,,xn)x=\left(x_{1},\cdots,x_{n}\right)^{\prime}, be vectors of observables in deviations from their means such that y¯=0\bar{y}=0, m¯=0\bar{m}=0, x¯=0\bar{x}=0 and disturbance vectors u=(u1,,un)u=\left(u_{1},\cdots,u_{n}\right)^{\prime}, v=(v1,,vn).v=\left(v_{1},\cdots,v_{n}\right)^{\prime}. The model is then:

yi\displaystyle y_{i} =τxi+θ2mi+ui,\displaystyle=\tau x_{i}+\theta_{2}m_{i}+u_{i}, (12)
mi\displaystyle m_{i} =θ1xi+vi,\displaystyle=\theta_{1}x_{i}+v_{i}, (13)

and the restricted version of Equation (12) with θ2=0\theta_{2}=0 equals:

yi=τxi+wi.y_{i}=\tau^{\ast}x_{i}+w_{i}. (14)

The claim τ^=τ^+θ^1θ^2\hat{\tau}^{\ast}=\hat{\tau}+\hat{\theta}_{1}\hat{\theta}_{2} follows from a standard exercise to relate restricted and unrestricted OLS estimators:

τ^=(xx)1xy=(xx)1x(xτ^+mθ^2+u^)=τ^+(xx)1xmθ^2+(xx)1xu^,\widehat{\tau^{\ast}}=\left(x^{\prime}x\right)^{-1}x^{\prime}y=\left(x^{\prime}x\right)^{-1}x^{\prime}\left(x\hat{\tau}+m\hat{\theta}_{2}+\hat{u}\right)=\hat{\tau}+\left(x^{\prime}x\right)^{-1}x^{\prime}m\hat{\theta}_{2}+\left(x^{\prime}x\right)^{-1}x^{\prime}\hat{u},

and θ^1=(xx)1xm\hat{\theta}_{1}=\left(x^{\prime}x\right)^{-1}x^{\prime}m is the OLS estimator in Equation (13) and xu^=0x^{\prime}\hat{u}=0 since u^\hat{u} are the OLS residuals from Equation (12) and orthogonal to x.x.
The parameter relation ττ=θ1θ2\tau^{\ast}-\tau=\theta_{1}\theta_{2} follows by substituting (13) in (12):

yi=τxi+θ2mi+ui=(τ+θ2θ1)xi+(θ2vi+ui)=τxi+wi.y_{i}=\tau x_{i}+\theta_{2}m_{i}+u_{i}=\left(\tau+\theta_{2}\theta_{1}\right)x_{i}+\left(\theta_{2}v_{i}+u_{i}\right)=\tau^{\ast}x_{i}+w_{i}.

It follows that H0:θ1θ2=0H0:τ=τ.H_{0}:\theta_{1}\theta_{2}=0\Leftrightarrow H_{0}:\tau^{\ast}=\tau.

A.2 Likelihood

The joint density of (y,m)\left(y,m\right) given xx is f(y,m|x)=f(y|m,x)f(m,x),f\left(\left.y,m\right|\ x\right)=f\left(\left.y\right|\ m,x\right)f\left(m,x\right), and according to the model:

yi|mi,xi\displaystyle\left.y_{i}\right|m_{i},x_{i} N(τxi+θ2mi,σ11),\displaystyle\sim N\left(\tau x_{i}+\theta_{2}m_{i},\sigma_{11}\right),
mi|xi\displaystyle\left.m_{i}\right|x_{i} N(θ1xi,σ22).\displaystyle\sim N\left(\theta_{1}x_{i},\sigma_{22}\right).

Hence the log-likelihood: =(τ,θ2,σ11,θ1,σ22)=logf(y|m,x;τ,θ2,σ11)+logf(m|x;θ1,σ22),\ell=\ell\left(\tau,\theta_{2},\sigma_{11},\theta_{1},\sigma_{22}\right)=\log f\left(y|m,x;\tau,\theta_{2},\sigma_{11}\right)+\log f\left(m|x;\theta_{1},\sigma_{22}\right), for nn independent observations equals Equation (5) which can be written as:

\displaystyle\ell 12σ11yy+τσ11yx+θ2σ11ym(τθ2σ11θ1σ22)xm12(θ22σ11+1σ22)mm+\displaystyle\propto-\frac{1}{2\sigma_{11}}y^{\prime}y+\frac{\tau}{\sigma_{11}}y^{\prime}x+\frac{\theta_{2}}{\sigma_{11}}y^{\prime}m-\left(\frac{\tau\theta_{2}}{\sigma_{11}}-\frac{\theta_{1}}{\sigma_{22}}\right)x^{\prime}m-\frac{1}{2}\left(\frac{\theta_{2}^{2}}{\sigma_{11}}+\frac{1}{\sigma_{22}}\right)m^{\prime}m+
(τ22σ11θ122σ22)xxn2log(σ11σ22)\displaystyle~~~\ \ \ ~~~\ \ \ ~~~\ \ \ ~~~\ \ \ ~~~\ \ \ ~~~\ \ \ ~~~\ \ \ ~~~\ \ \ ~~~\ \ \ ~~~\ \ \ -\left(\frac{\tau^{2}}{2\sigma_{11}}-\frac{\theta_{1}^{2}}{2\sigma_{22}}\right)x^{\prime}x-\frac{n}{2}\log\left(\sigma_{11}\sigma_{22}\right)
=ηrκ(η,xx),with:\displaystyle=\eta^{\prime}r-\kappa\left(\eta,x^{\prime}x\right),~~~\ \ \ with:
η\displaystyle\eta =(12σ11,τσ11,θ2σ11,(τθ2σ11θ1σ22),12(θ22σ11+1σ22)),\displaystyle=\left(-\frac{1}{2\sigma_{11}},\frac{\tau}{\sigma_{11}},\frac{\theta_{2}}{\sigma_{11}},-\left(\frac{\tau\theta_{2}}{\sigma_{11}}-\frac{\theta_{1}}{\sigma_{22}}\right),-\frac{1}{2}\left(\frac{\theta_{2}^{2}}{\sigma_{11}}+\frac{1}{\sigma_{22}}\right)\right)^{\prime},
r\displaystyle r =(yy,yx,ym,xm,mm),\displaystyle=\left(y^{\prime}y,y^{\prime}x,y^{\prime}m,x^{\prime}m,m^{\prime}m\right)^{\prime},

and κ\kappa some function of η\eta and xxx^{\prime}x which is fixed. Since dim(η)=dim(r)\dim\left(\eta\right)=\dim\left(r\right) the model is a full exponential model of dimension five following the Koopman-Fisher-Darmois theorem (see Van Garderen (1997)) and rr is a complete sufficient statistic. The score 𝐬(τ,θ2,σ11,θ1,σ22)=𝐬=(𝐬1,𝐬2)\mathbf{s}\left(\tau,\theta_{2},\sigma_{11},\theta_{1},\sigma_{22}\right)=\mathbf{s}=\left(\mathbf{s}_{1}^{\prime},\mathbf{s}_{2}^{\prime}\right)^{\prime} is analogues to the scores of the two separate regression models since (τ,θ2,σ11)\left(\tau,\theta_{2},\sigma_{11}\right) appears in the first equation only, and (θ1,σ22)\left(\theta_{1},\sigma_{22}\right) appears in the second equation only. So:

𝐬=((yτxθ2m)xσ11,(yτxθ2m)mσ11,(yτxθ2m)(yτxθ2m)2σ112n2σ11,(mθ1x)xσ22,(mθ1x)(mθ1x)2σ222n2σ22)\mathbf{s=}\left(\frac{\left(y-\tau x-\theta_{2}m\right)^{\prime}x}{\sigma_{11}},\frac{\left(y-\tau x-\theta_{2}m\right)^{\prime}m}{\sigma_{11}},\frac{\left(y-\tau x-\theta_{2}m\right)^{\prime}\left(y-\tau x-\theta_{2}m\right)}{2\sigma_{11}^{2}}-\frac{n}{2\sigma_{11}},\frac{\left(m-\theta_{1}x\right)^{\prime}x}{\sigma_{22}},\frac{\left(m-\theta_{1}x\right)^{\prime}\left(m-\theta_{1}x\right)}{2\sigma_{22}^{2}}-\frac{n}{2\sigma_{22}}\right)^{\prime}
and the Maximum Likelihood Estimator (MLE) equals the MLE for the two equations separately:

(τ^θ^2)\displaystyle\left(\begin{array}[]{c}\hat{\tau}\\ \hat{\theta}_{2}\end{array}\right) =((x:m)(x:m))1(x:m)y;σ^11=1nyMXy;\displaystyle=\left(\left(x:m\right)^{\prime}\left(x:m\right)\right)^{-1}\left(x:m\right)^{\prime}y;~~~\ \ \ \widehat{\sigma}_{11}=\frac{1}{n}y^{\prime}M_{X}y;
θ^1\displaystyle\hat{\theta}_{1} =(xx)1xm;σ^22=1nmMxm,\displaystyle=\left(x^{\prime}x\right)^{-1}x^{\prime}m;~~~\ \ \ \widehat{\sigma}_{22}=\frac{1}{n}m^{\prime}M_{x}m,

with MA=IA(AA)1AM_{A}=I-A\left(A^{\prime}A\right)^{-1}A^{\prime} and X=[x:m]X=\left[x:m\right] an n×2n\times 2 matrix. The MLE is minimal sufficient and complete because it is a bijective transformation of rr which is a minimal sufficient and complete statistic.

A.3 Classic Tests

Wald test. Under H0:θ1θ2=𝐫(θ1,θ2)=0H_{0}:\theta_{1}\theta_{2}=\mathbf{r}\left(\theta_{1},\theta_{2}\right)=0. Then R(θ1,θ2)=𝐫(θ1,θ2)(θ1,θ2)=(θ2,θ1)R\left(\theta_{1},\theta_{2}\right)=\frac{\partial\mathbf{r}\left(\theta_{1},\theta_{2}\right)}{\partial\left(\theta_{1},\theta_{2}\right)^{\prime}}=\left(\theta_{2},\theta_{1}\right)^{\prime} and evaluated at the (unrestricted) MLE equals R(θ^1,θ^2)=(θ^2,θ^1).R\left(\hat{\theta}_{1},\hat{\theta}_{2}\right)=\left(\hat{\theta}_{2},\hat{\theta}_{1}\right)^{\prime}. The Wald test therefore becomes:

W\displaystyle W =θ^1θ^2((θ^2θ^1)(σθ^1200σθ^22)(θ^2θ^1))1θ^1θ^2\displaystyle=\hat{\theta}_{1}\hat{\theta}_{2}\left(\left(\begin{array}[]{c}\hat{\theta}_{2}\\ \hat{\theta}_{1}\end{array}\right)^{\prime}\left(\begin{array}[]{ll}\sigma_{\hat{\theta}_{1}}^{2}&0\\ 0&\sigma_{\hat{\theta}_{2}}^{2}\end{array}\right)\left(\begin{array}[]{c}\hat{\theta}_{2}\\ \hat{\theta}_{1}\end{array}\right)\right)^{-1}\hat{\theta}_{1}\hat{\theta}_{2}
=θ^12θ^22θ^12σθ^22+θ^22σθ^12(σθ^12σθ^22)1(σθ^12σθ^22)1=T12T22T12+T22\displaystyle=\frac{\hat{\theta}_{1}^{2}\hat{\theta}_{2}^{2}}{\hat{\theta}_{1}^{2}\sigma_{\hat{\theta}_{2}}^{2}+\hat{\theta}_{2}^{2}\sigma_{\hat{\theta}_{1}}^{2}}\cdot\frac{\left(\sigma_{\hat{\theta}_{1}}^{2}\sigma_{\hat{\theta}_{2}}^{2}\right)^{-1}}{\left(\sigma_{\hat{\theta}_{1}}^{2}\sigma_{\hat{\theta}_{2}}^{2}\right)^{-1}}=\frac{T_{1}^{2}T_{2}^{2}}{T_{1}^{2}+T_{2}^{2}}

The Sobel test equals W\sqrt{W} and is usually expressed as the square root of the first term in the second line:θ^1θ^2θ^12σθ^22+θ^22σθ^12\frac{\hat{\theta}_{1}\hat{\theta}_{2}}{\sqrt{\hat{\theta}_{1}^{2}\sigma_{\hat{\theta}_{2}}^{2}+\hat{\theta}_{2}^{2}\sigma_{\hat{\theta}_{1}}^{2}}}.

LR test. The maximum value of the log-likelihood can be expressed in terms of the OLS residual sum of squares in the usual way for the first and second equation, RSS1RSS_{1} and RSS2RSS_{2} respectively:

(τ^,θ^2,σ^11,θ^1,σ^22)\displaystyle\ell\left(\hat{\tau},\hat{\theta}_{2},\hat{\sigma}_{11},\hat{\theta}_{1},\hat{\sigma}_{22}\right) 12σ^11i=1n(yiτ^xiθ^2mi)2n2log(σ^11)+\displaystyle\propto-\frac{1}{2\hat{\sigma}_{11}}\sum_{i=1}^{n}\left(y_{i}-\hat{\tau}x_{i}-\hat{\theta}_{2}m_{i}\right)^{2}-\frac{n}{2}\log\left(\hat{\sigma}_{11}\right)+
12σ^22i=1n(miθ^1xi)2n2log(σ^22)\displaystyle~~~\ \ \ ~~~\ \ \ ~~~\ \ \ ~~~\ \ \ ~~~\ \ \ -\frac{1}{2\hat{\sigma}_{22}}\sum_{i=1}^{n}\left(m_{i}-\hat{\theta}_{1}x_{i}\right)^{2}-\frac{n}{2}\log\left(\hat{\sigma}_{22}\right)
=n2n2log(RSS1/n)n2n2log(RSS2/n).\displaystyle=-\frac{n}{2}-\frac{n}{2}\log\left(RSS_{1}/n\right)-\frac{n}{2}-\frac{n}{2}\log\left(RSS_{2}/n\right).

Denote the restricted residual sums of squares by RSS~1\widetilde{RSS}_{1} when θ2=0\theta_{2}=0, RSS~2\widetilde{RSS}_{2} when θ1=0,\theta_{1}=0, and the restricted maximized log-likelihoods by:

θ1=0(τ^,θ^2,σ^11,0,σ~22)\displaystyle\ell_{\theta_{1}=0}\left(\hat{\tau},\hat{\theta}_{2},\hat{\sigma}_{11},0,\tilde{\sigma}_{22}\right) n2n2log(RSS1/n)n2n2log(RSS~2/n),\displaystyle\propto-\frac{n}{2}-\frac{n}{2}\log\left(RSS_{1}/n\right)-\frac{n}{2}-\frac{n}{2}\log\left(\widetilde{RSS}_{2}/n\right),
θ2=0(τ~,0,σ~11,θ^1,σ^22)\displaystyle\ell_{\theta_{2}=0}\left(\tilde{\tau},0,\tilde{\sigma}_{11},\hat{\theta}_{1},\hat{\sigma}_{22}\right) n2n2log(RSS~1/n)n2n2log(RSS2/n).\displaystyle\propto-\frac{n}{2}-\frac{n}{2}\log\left(\widetilde{RSS}_{1}/n\right)-\frac{n}{2}-\frac{n}{2}\log\left(RSS_{2}/n\right).

The LR test of the full model with five parameters, against the model with the single restriction θ2=0\theta_{2}=0 equals:

LRθ2=0\displaystyle LR_{\theta_{2}=0} =2(n2log(RSS1/n)+n2log(RSS~1/n))=nlog(1+1nT22) since\displaystyle=2\left(-\frac{n}{2}\log\left(RSS_{1}/n\right)+\frac{n}{2}\log\left(\widetilde{RSS}_{1}/n\right)\right)=n\log\left(1+\frac{1}{n}T_{2}^{2}\right)\text{ since}
RSS~1\displaystyle\widetilde{RSS}_{1} =θ^2m(mMxm)1mθ^2+RSS1 and\displaystyle=\hat{\theta}_{2}^{\prime}m\left(m^{\prime}M_{x}m\right)^{-1}m\hat{\theta}_{2}+RSS_{1}\text{ \ and}
T22\displaystyle T_{2}^{2} =θ^2m(mMxm)1mθ^2/σ^11=(RSS~1RSS1)/(RSS1/n).\displaystyle=\hat{\theta}_{2}^{\prime}m\left(m^{\prime}M_{x}m\right)^{-1}m\hat{\theta}_{2}/\hat{\sigma}_{11}=\left(\widetilde{RSS}_{1}-RSS_{1}\right)/\left(RSS_{1}/n\right).

Analogously the LR test for θ1=0\theta_{1}=0 equals:

LRθ1=0=nlog(1+1nT12).LR_{\theta_{1}=0}=n\log\left(1+\frac{1}{n}T_{1}^{2}\right).

The likelihood ratio test for H0:θ1=0H_{0}:\theta_{1}=0 and/or θ2=0\theta_{2}=0 uses the maximized log-likelihood under the alternative (the same in both cases) and under the null, which means minimizing over LRθ1=0LR_{\theta_{1}=0} and LRθ2=0LR_{\theta_{2}=0} and hence:

LR=min{LRθ1=0,LRθ2=0},LR=\min\left\{LR_{\theta_{1}=0},LR_{\theta_{2}=0}\right\},

which is equivalent to rejecting for large values of:

min{T12,T22} ormin{|T1|,|T2|}.\min\left\{T_{1}^{2},T_{2}^{2}\right\}\text{ }\text{or}\min\left\{\left|T_{1}\right|,\left|T_{2}\right|\right\}.

Appendix B Invariance

When testing the no-mediation hypothesis H0:θ1θ2=0,H_{0}:\theta_{1}\theta_{2}=0, the parameters τ,σ11,\tau,\sigma_{11}, σ22\sigma_{22} are nuisance parameters and their values have no influence on whether the null is true or not. Hillier et al. (2021) shows that TT is maximal invariant with respect to an appropriate group of transformations that leaves the testing problem invariant and provides justification for restricting attention to the two tt-statistics. These exact invariance results provide a strong justification for restricting attention to the two tt-statistics for any sample size, finite or asymptotically, since it is natural to restrict the problem to procedures that are scale invariant and do not depend on τ.\tau.

The testing problem has further invariance properties. The problem is invariant to changing the signs (reflections) of T1T_{1} and T2T_{2} or permuting them. This leads to maximal invariants with a sample and parameter space that is only part of K\mathbb{R}^{K}.

Proof. (of Theorem 1) {|T|(1),,|T|(K)}\left\{\left|T\right|_{\left(1\right)},...,\left|T\right|_{\left(K\right)}\right\} is obviously invariant to changes in sign and permutation as a consequence of the absolute values and subsequent sorting. It is a maximal invariant because any two TT and T~\tilde{T} such that {|T|(1),,|T|(K)}={|T~|(1),,|T~|(K)}\left\{\left|T\right|_{\left(1\right)},...,\left|T\right|_{\left(K\right)}\right\}=\left\{|\tilde{T}|_{\left(1\right)},...,|\tilde{T}|_{\left(K\right)}\right\} can only hold if T~\tilde{T} is a permutation of TT with a number of sign changes. Hence there will exist a transformation 𝐠=𝐠1𝐠2G1×G2\mathbf{g}=\mathbf{g}_{1}\cdot\mathbf{g}_{2}\in G_{1}\times G_{2} s.t. T~=𝐠T.\tilde{T}=\mathbf{g}\cdot T. The same argument holds for {|μ|(1),,|μ|(K)}\left\{\left|\mu\right|_{\left(1\right)},...,\left|\mu\right|_{\left(K\right)}\right\} since the group of transformations on the parameter space is the same as on the sample space. That the distribution of {|T|(1),,|T|(K)}\left\{\left|T\right|_{\left(1\right)},...,\left|T\right|_{\left(K\right)}\right\} depends only on {|μ|(1),,|μ|(K)}\left\{\left|\mu\right|_{\left(1\right)},...,\left|\mu\right|_{\left(K\right)}\right\} is a property of a maximal invariant.   \square

Lemma 1 gives an explicit expression that further shows that the distribution is invariant under the G1×G2G_{1}\times G_{2}.

Proof. (of Lemma 1) The absolute value of the normal variate TkT_{k} with mean μk\mu_{k} and variance 11 follows a noncentral Chi-distribution with one degree of freedom. The KK distributions χ(|t|(k),|μ|(k))\chi\left(\left|t\right|_{\left(k\right)},\left|\mu\right|_{\left(k\right)}\right) are independent. The result is then a direct application of Vaughan and Venables (1972, eq. 6).   \square

Appendix C Proof of Theorem 2

The proof exploits the symmetries of the problem and the completeness of the normal distribution. We therefore consider the sample space of TT in 2\mathbb{R}^{2}, rather than the octant that is the sample space of the maximal invariant. The proof is constructive. It shows how to construct a monotonic weakly increasing function g()g\left(\cdot\right) with 1/α\left\lfloor 1/\alpha\right\rfloor steps, when 1/α1/\alpha is an integer. If 1/α1/\alpha is not an integer then it cannot satisfy the condition that the final step equals the asymptotic value limtg(t),\lim_{t\rightarrow\infty}g\left(t\right), which is the normal critical value and determined by letting μ.\mu\rightarrow\infty. There are two trivial solutions. First, if g(t)=0g\left(t\right)=0 then CRg=2CR_{g}=\mathbb{R}^{2} and the test always rejects, leading to an NRP of 100%100\% for all parameter values. The other trivial solution is g(t)=tg\left(t\right)=t such that ARg=2AR_{g}=\mathbb{R}^{2} and the test would never reject and the NRP is 0% for all μ\mu.


Proof. Symmetry of the problem implies for the boundary function g(t)=g(t)=g(|t|)g\left(-t\right)=g\left(t\right)=g\left(\left|t\right|\right) and g(t)t.g\left(t\right)\leq t. So we only define g(t)g\left(t\right) for t0t\geq 0\, and g(0)=0.g\left(0\right)=0. The boundary function g(t)g\left(t\right) is weakly monotonically increasing by assumption, but may be a step function. If g(t)g\left(t\right) is a step function then it has no ordinary inverse but we can define its generalized inverse as:

g1(t)=inf{xg(x)>t}g^{-1}\left(t\right)=\inf\left\{x\mid g\left(x\right)>t\right\}

(cf. quantile function e.g. Van der Vaart (2000, p.304)). This definition of g1(t)g^{-1}\left(t\right) with strict inequality is chosen such that a necessary condition for similarity may hold.

  1. 1.

    For any finite constant cv1cv_{1} we have

    limμ1P[|T1|>cv1μ1]=1\lim_{\mu_{1}\rightarrow\infty}P\left[\left|T_{1}\right|>cv_{1}\mid\mu_{1}\right]=1

    and rejection only depends on T2T_{2} when μ1\mu_{1}\rightarrow\infty. Hence P[|T2|>c]=2P[T2>c]=2(1Φ(c))=α,P\left[\left|T_{2}\right|>c_{\infty}\right]=2P\left[T_{2}>c_{\infty}\right]=2(1-\Phi\left(c_{\infty}\right))=\alpha, such that:

    c=Φ1(1α2).c_{\infty}=\Phi^{-1}\left(1-\frac{\alpha}{2}\right).

    Monotonicity of gg implies g(t)limtg\left(t\right)\leq\lim_{t\rightarrow\infty} g(t)=Φ1(1α2)g\left(t\right)=\Phi^{-1}\left(1-\frac{\alpha}{2}\right) and we follow the convention that

    g1(t)=+ if tΦ1(1α2).g^{-1}\left(t\right)=+\infty\text{ if }t\geq\Phi^{-1}\left(1-\frac{\alpha}{2}\right).
  2. 2.

    The null rejection probability as a function of μ\mu equals:

    NRPα(μ1)\displaystyle NRP_{\alpha}\left(\mu_{1}\right) =\displaystyle= 2+ϕ(t1μ1)[g(t1)g1(t1)ϕ(t2)𝑑t2]𝑑t1\displaystyle 2\int_{-\infty}^{+\infty}\phi\left(t_{1}-\mu_{1}\right)\left[\int_{g\left(t_{1}\right)}^{g^{-1}\left(t_{1}\right)}\phi\left(t_{2}\right)dt_{2}\right]dt_{1}
    =\displaystyle= 2+ϕ(t1μ1)[Φ(g1(t1))Φ(g(t1))]𝑑t1.\displaystyle 2\int_{-\infty}^{+\infty}\phi\left(t_{1}-\mu_{1}\right)\left[\Phi\left(g^{-1}\left(t_{1}\right)\right)-\Phi\left(g\left(t_{1}\right)\right)\right]dt_{1}.
  3. 3.

    Similarity requires NRPα(μ)=αμNRP_{\alpha}\left(\mu\right)=\alpha\forall\mu\in\mathbb{R}. Hence

    α\displaystyle\alpha =\displaystyle= 2+ϕ(t1μ1)[Φ(g1(t1))Φ(g(t1))]𝑑t1\displaystyle 2\int_{-\infty}^{+\infty}\phi\left(t_{1}-\mu_{1}\right)\left[\Phi\left(g^{-1}\left(t_{1}\right)\right)-\Phi\left(g\left(t_{1}\right)\right)\right]dt_{1}\Rightarrow
    0\displaystyle 0 =\displaystyle= +ϕ(t1μ1)[Φ(g1(t1))Φ(g(t1))α2]𝑑t1\displaystyle\int_{-\infty}^{+\infty}\phi\left(t_{1}-\mu_{1}\right)\left[\Phi\left(g^{-1}\left(t_{1}\right)\right)-\Phi\left(g\left(t_{1}\right)\right)-\frac{\alpha}{2}\right]dt_{1}
    =\displaystyle= +ϕ(t1μ1)F(t)𝑑t1, with\displaystyle\int_{-\infty}^{+\infty}\phi\left(t_{1}-\mu_{1}\right)F\left(t\right)dt_{1},\text{\ \ with}
    F(t)\displaystyle F\left(t\right) =\displaystyle= Φ(g1(t))Φ(g(t))α2.\displaystyle\Phi\left(g^{-1}\left(t\right)\right)-\Phi\left(g\left(t\right)\right)-\frac{\alpha}{2}. (15)
  4. 4.

    Completeness of the normal distribution with mean μ1\mu_{1} implies the condition

    F(t)=0.F\left(t\right)=0. (16)

    We will show that this leads to a step function and the proof iteratively determines the stepping points and step sizes, illustrated in the Figure 3.

  5. 5.

    Starting at t=0t=0 we have g(0)=0g\left(0\right)=0 by definition, but for a generalized inverse g1(0)g^{-1}\left(0\right) there are two possibilities:

    1. (a)

      g1(0)=0=g(0).g^{-1}\left(0\right)=0=g\left(0\right). For t=0,t=0, condition (16) implies: F(0)=Φ(g1(0))Φ(g(0))α2=F\left(0\right)=\Phi\left(g^{-1}\left(0\right)\right)-\Phi\left(g\left(0\right)\right)-\frac{\alpha}{2}= Φ(0)Φ(0)α2=α2=0\Phi\left(0\right)-\Phi\left(0\right)-\frac{\alpha}{2}=-\frac{\alpha}{2}=0 only holds if α=0\alpha=0.  In this case F(t)=Φ(g1(t))Φ(g(t))=0F\left(t\right)=\Phi\left(g^{-1}\left(t\right)\right)-\Phi\left(g\left(t\right)\right)=0 can only occur if g1(t)=g(t)=tg^{-1}\left(t\right)=g\left(t\right)=t and the test never rejects.

    2. (b)

      g1(0)=c1>0g^{-1}\left(0\right)=c_{1}>0. In this case

      1. i.

        g1(t)=inf{xg(x)>t}g^{-1}\left(t\right)=\inf\left\{x\mid g\left(x\right)>t\right\}  and g1(0)=inf{xg(x)>0}g^{-1}\left(0\right)=\inf\left\{x\mid g\left(x\right)>0\right\} imply g(t)=0,0t<c1g\left(t\right)=0,\forall 0\leq t<c_{1} and g(c1)>0.g\left(c_{1}\right)>0.

      2. ii.

        g(t)=0 0t<c1g\left(t\right)=0\ \forall\ 0\leq t<c_{1} and F(0)F\left(0\right) imply g1(t)=c1g^{-1}\left(t\right)=c_{1}  0t<c1.\forall\ 0\leq t<c_{1}.

      3. iii.

        g1(t)=c1g^{-1}\left(t\right)=c_{1}  0t<c1\forall\ 0\leq t<c_{1} and g1(t)=inf{xg(x)>t}g^{-1}\left(t\right)=\inf\left\{x\mid g\left(x\right)>t\right\} in turn imply g(c1)c1,g\left(c_{1}\right)\geq c_{1}, but g(t)tg\left(t\right)\leq t so

      4. iv.

        g(c1)=c1.g\left(c_{1}\right)=c_{1}.

      5. v.

        c1=Φ1(12+α2)c_{1}=\Phi^{-1}\left(\frac{1}{2}+\frac{\alpha}{2}\right) because F(0)=Φ(c1)Φ(0)α2=0F\left(0\right)=\Phi\left(c_{1}\right)-\Phi\left(0\right)-\frac{\alpha}{2}=0 implies Φ(g1(t))=12+α2\Phi\left(g^{-1}\left(t\right)\right)=\frac{1}{2}+\frac{\alpha}{2} and the result follows by the uniqueness of the inverse of Φ\Phi.

  6. 6.

    This argument can now be repeated.

    1. (a)

      In Φ(g1(c1))Φ(g(c1))α2=0\Phi\left(g^{-1}\left(c_{1}\right)\right)-\Phi\left(g\left(c_{1}\right)\right)-\frac{\alpha}{2}=0 implies Φ(g1(c1))=(12+α2)+α2\Phi\left(g^{-1}\left(c_{1}\right)\right)=\left(\frac{1}{2}+\frac{\alpha}{2}\right)+\frac{\alpha}{2} and g1(c1)=Φ1(12+2α2)=c2.g^{-1}\left(c_{1}\right)=\Phi^{-1}\left(\frac{1}{2}+\frac{2\alpha}{2}\right)=c_{2}. As in 5.b.i) g(t)=c1c1t<c2g\left(t\right)=c_{1}\forall c_{1}\leq t<c_{2}\Rightarrow ii) g1(t)=c2g^{-1}\left(t\right)=c_{2} c1t<c2\forall\ c_{1}\leq t<c_{2} \Rightarrow iii) g(c2)=c2g\left(c_{2}\right)=c_{2} and g1(c2)=Φ1(12+3α2)=c3.g^{-1}\left(c2\right)=\Phi^{-1}\left(\frac{1}{2}+\frac{3\alpha}{2}\right)=c_{3}.

    2. (b)

      After JJ repetitions of this argument we obtain a step function:

      g(|t|)\displaystyle g\left(\left|t\right|\right) =\displaystyle= cj:cj1t<cj,j=1,..,J\displaystyle c_{j}:c_{j-1}\leq t<c_{j},\ \ \ j=1,..,J
      c0\displaystyle c_{0} =\displaystyle= 0,cj=Φ1(12+jα2)\displaystyle 0,c_{j}=\Phi^{-1}\left(\frac{1}{2}+j\frac{\alpha}{2}\right)
  7. 7.

    An exact similar test only exists if 1/α1/\alpha is an integer.

    1. (a)

      If RR is an integer such that α=1/R\alpha=1/R then cR=Φ1(12+Rα2)=Φ1(1)=+c_{R}=\Phi^{-1}\left(\frac{1}{2}+\frac{R\ \alpha}{2}\right)=\Phi^{-1}\left(1\right)=+\infty and cR1=Φ1(12+(R1)α2)=Φ1(1α2)=c_{R-1}=\Phi^{-1}\left(\frac{1}{2}+\frac{\left(R-1\right)\ \alpha}{2}\right)=\Phi^{-1}\left(1-\frac{\alpha}{2}\right)= limt\lim_{t\rightarrow\infty} g(t)g\left(t\right).  The resulting step-function g(t)g\left(t\right) is constructed such that F(t)=0F\left(t\right)=0 for all tt\in\mathbb{R}. The NRP equals 2+ϕ(t1μ)[Φ(g1(t1))Φ(g(t1))]𝑑t1=α2\int_{-\infty}^{+\infty}\phi\left(t_{1}-\mu\right)\left[\Phi\left(g^{-1}\left(t_{1}\right)\right)-\Phi\left(g\left(t_{1}\right)\right)\right]dt_{1}=\alpha for all μ.\mu.
      So an exact similar test exists if 1/α1/\alpha is an integer.

    2. (b)

      If 1/α1/\alpha is not an integer then define R=1/αR=\left\lfloor 1/\alpha\right\rfloor (entier or floor function) RR\in\mathbb{N} and 0<r<10<r<1 the remainder such that α=1/(R+r).\alpha=1/\left(R+r\right). So Rα<1R\alpha<1. After RR iterations of the argument in Step 5 gives Φ1(12+Rα2)=Φ1(12+(R+rr)α2)=Φ1(1rα2)>Φ1(1α2).\Phi^{-1}\left(\frac{1}{2}+\frac{R\ \alpha}{2}\right)=\Phi^{-1}\left(\frac{1}{2}+\frac{\left(R+r-r\right)\alpha}{2}\right)=\Phi^{-1}\left(1-\frac{r\alpha}{2}\right)>\Phi^{-1}\left(1-\frac{\alpha}{2}\right). Similarly R1R-1 iterations give: Φ1(12+(R1)α2)=Φ1(1α2rα2)<Φ1(1α2)\Phi^{-1}\left(\frac{1}{2}+\frac{\left(R-1\right)\ \alpha}{2}\right)=\Phi^{-1}\left(1-\frac{\alpha}{2}-\frac{r\alpha}{2}\right)<\Phi^{-1}\left(1-\frac{\alpha}{2}\right). Therefore the step-function g()g\left(\cdot\right) does not have limtg(t)=\lim_{t\rightarrow\infty}g\left(t\right)= Φ1(1α2)\Phi^{-1}\left(1-\frac{\alpha}{2}\right) which is the requirement for the NRP to equal α\alpha as μ1.\mu_{1}\rightarrow\infty. Further note that the next step after RR iterations has (1+(1r)α2)>1\left(1+\frac{(1-r)\alpha}{2}\right)>1 so we cannot apply Φ1\Phi^{-1} to obtain the next boundary point.
      So no similar gg-test exists if 1/α1/\alpha is not an integer.

\square

Appendix D Algorithms

The construction of the optimal gg-test is in two steps. The first step is a basic implementation of the general varying-gg method. This generates a near similar test that deviates less than 0.01%0.01\% points from 5%.5\%. We use this ϵ\epsilon as a starting value for determining an upper bound to the power envelope. The second step is using this upper bound to derive an optimal gg-test that minimizes the distance between the power surface and the power envelope for tests in Γα,ϵ𝕄0\Gamma_{\alpha,\epsilon}^{\mathbb{M}_{0}}. Implementation of the varying-gg method.

Basic gg-function algorithm

  1. 1.

    Define g()g\left(\cdot\right) nonparametrically as a linear spline defined by J+2J+2 knots {(t(j),g(j))}j=0J+1,\left\{\left(t^{\left(j\right)},g^{\left(j\right)}\right)\right\}_{j=0}^{J+1}, i.e. by J+2J+2 values g(j)g^{\left(j\right)} on a grid of points t(j).t^{\left(j\right)}. The first and last knots are fixed at (0,0)(0,0) and (2.5,z0.025)(2.5,z_{0.025}) respectively, so there are JJ knots to be chosen. One of those knots is chosen t=z0.025t=z_{0.025} such that the LR boundary can be constructed as initializing function. For points tt not on the grid, g(t)g\left(t\right) is obtained by linear interpolation, and g(t)=z0.0251.96g\left(t\right)=z_{0.025}\approx 1.96 for t>2.5.t>2.5.

  2. 2.

    The criterion function Q(g)Q\left(g\right) is the accumulated NRP deviation from 5%5\% as measured by a quadratic loss function over a grid of points {μ0(ι)}ι=1Υ0\left\{\mu_{0}^{\left(\iota\right)}\right\}_{\iota=1}^{\Upsilon_{0}} with Υ0>J\Upsilon_{0}>J and μ0(1)=0:\mu_{0}^{\left(1\right)}=0:

    Q(g)\displaystyle Q\left(g\right) =ι=1Υ0(NRPg(μ0(ι))0.05)2,\displaystyle=\sum_{\iota=1}^{\Upsilon_{0}}\left(NRP_{g}\left(\mu_{0}^{\left(\iota\right)}\right)-0.05\right)^{2},
    s.t.NRPg(μ0(ι))\displaystyle s.t.~~~NRP_{g}\left(\mu_{0}^{\left(\iota\right)}\right) 0.05with\displaystyle\leq 0.05\ \ \ \ with
    NRPg(μ)\displaystyle NRP_{g}\left(\mu\right) =P[TCRg|μ1=0,μ2=μ0]\displaystyle=P\left[T\in CR_{g}\left|\mu_{1}=0,\mu_{2}=\mu\geq 0\right.\right]
  3. 3.

    Minimize Q(g)Q\left(g\right) by varying g()g\left(\cdot\right):

    1. (a)

      Initialize g()g\left(\cdot\right) with knots {(0,0),(1.96,1.96),(2.5,1.96)}\left\{\left(0,0\right),\left(1.96,1.96\right),\left(2.5,1.96\right)\right\} which is the J=1J=1 function corresponding to the LR boundary. The first and last knot are fixed and the middle one is varied when optimizing Q(g)Q\left(g\right).

    2. (b)

      For given g()g\left(\cdot\right) calculate the NRPs by numerical integration for the grid of Υ0\Upsilon_{0} noncentrality parameter points {(0,μ(ι))}ι=1Υ0\{(0,\mu^{\left(\iota\right)})\}_{\iota=1}^{\Upsilon_{0}} under the null, with Υ0J\Upsilon_{0}\geq J and calculate Q(g).Q\left(g\right).

    3. (c)

      Vary g()g\left(\cdot\right) by changing JJ knots and minimize the criterion function Q(g)Q\left(g\right), subject to:

      1. i.

        0g(j+1)g(j)<δ0\leq g^{\left(j+1\right)}-g^{\left(j\right)}<\delta\ : monotonicity and limited increase

      2. ii.

        g(t)tg\left(t\right)\leq t\ : logical restriction since maximal invariant is absolute order statistic and |T|(1)|T|(2)\left|T\right|_{\left(1\right)}\leq\left|T\right|_{\left(2\right)}

      3. iii.

        g(J+1)=z0.025g^{\left(J+1\right)}=z_{0.025}\ : dimensional coherence requires reduction to one-dimensional solution (see Section 6)

    4. (d)

      Increase the number of knots JJ and iterate until convergence.

Comments

  1. 1.

    We set g(t)=z0.025g\left(t\right)=z_{0.025} for t>2.5,t>2.5, because for large enough |T|2\left|T\right|_{2} it is essentially known that θ20\theta_{2}\neq 0 and the rejection depends only on whether θ1=0\theta_{1}=0 is rejected. The corresponding 5%5\% critical value for |T|1\left|T\right|_{1} based on the normal distribution is the usual z0.0251.96z_{0.025}\approx 1.96 as |T|2.\left|T\right|_{2}\rightarrow\infty.

  2. 2.

    For JJ small there are big gains in reducing the deviation from 5%5\% by varying the knots {t(j),g(j)}j=1J\left\{t^{\left(j\right)},g^{\left(j\right)}\right\}_{j=1}^{J} and also by increasing J,J, see Figure 5.

  3. 3.

    The number Υ0\Upsilon_{0} of μ(ι)\mu^{\left(\iota\right)} points to check similarity was chosen to be Υ0=76>J\Upsilon_{0}=76>J: 60 points equally spaced between 0 and 6, and 16 points equally spaced between 6 and 20. This imposes 152 side conditions. Step 3(c) imposes a further 3J3J restrictions approximately for every choice of JJ, and about 100100 when J=32.J=32.

  4. 4.

    We have actually used as a side condition 0.05ϵNRPg(μ0(ι))0.050.05-\epsilon\leq NRP_{g}\left(\mu_{0}^{\left(\iota\right)}\right)\leq 0.05 and iterated to find the smallest ϵ\epsilon that yields a feasible solution.

Optimal gg-function

In order to find the optimal gg, we minimize the sum of differences between gg’s power surface and the power envelope on a grid of points, subject to the size and ϵ\epsilon-similarity conditions. We impose monotonicity g(t(i+1))g(t(i))g\left(t^{\left(i+1\right)}\right)\geq g\left(t^{\left(i\right)}\right) and, since by definition of the absolute order statistic |T|(1)|T|(2)\left|T\right|_{\left(1\right)}\leq\left|T\right|_{\left(2\right)},we logically restrict gg to 0g(t)t0\leq g\left(t\right)\leq t.

Optimal gg-function algorithm

  1. 1.

    Define g()g\left(\cdot\right) nonparametrically as a linear spline defined above.

  2. 2.

    Define the criterion function Qϵ(g)Q_{\epsilon}^{\ast}\left(g\right) as the accumulated power difference over the triangular grid of points 𝕄1={(μ1(γ,κ),μ2(γ,κ))}1γ<κΥ1:\mathbb{M}_{1}=\left\{\left(\mu_{1}^{\left(\gamma,\kappa\right)},\mu_{2}^{\left(\gamma,\kappa\right)}\right)\right\}_{1\leq\gamma<\kappa\leq\Upsilon_{1}}:

    Qϵ(g)=κ=1Υ1γκπ¯(μ1(γ,κ),μ2(γ,κ))P[CRg(μ1(γ,κ),μ2(γ,κ))]Q_{\epsilon}^{\ast}\left(g\right)=\sum_{\kappa=1}^{\Upsilon_{1}}\sum_{\gamma\leq\kappa}\bar{\pi}\left(\mu_{1}^{\left(\gamma,\kappa\right)},\mu_{2}^{\left(\gamma,\kappa\right)}\right)-P\left[CR_{g}\mid\left(\mu_{1}^{\left(\gamma,\kappa\right)},\mu_{2}^{\left(\gamma,\kappa\right)}\right)\right]
  3. 3.

    Minimize Qϵ(g)Q_{\epsilon}^{\ast}\left(g\right) by varying g()g\left(\cdot\right):

    1. (a)

      Initialize with g()g\left(\cdot\right) equal to the LR boundary or the previously determined basic gg-function.

    2. (b)

      For given g()g\left(\cdot\right) calculate Qϵ(g)Q_{\epsilon}^{\ast}\left(g\right) by numerical integration.

    3. (c)

      Vary g()g\left(\cdot\right) by changing JJ knots and minimize the criterion function Qϵ(g)Q_{\epsilon}^{\ast}\left(g\right), subject to:

      1. i.

        0.05ϵP[CRg(0,μ0(ι))]0.05,ι=1,,Υ00.05-\epsilon\leq P\left[CR_{g}\mid\left(0,\mu_{0}^{\left(\iota\right)}\right)\right]\leq 0.05,~~\forall\iota=1,\cdots,\Upsilon_{0} : near similarity and size restrictions

      2. ii.

        0=g(0)g(t(j))g(t(j+1))t(j+1)0=g\left(0\right)\leq g\left(t^{\left(j\right)}\right)\leq g\left(t^{\left(j+1\right)}\right)\leq t^{\left(j+1\right)} : monotonicity,

      3. iii.

        g(j+1)g(j)<t(j+1)t(j):g^{\left(j+1\right)}-g^{\left(j\right)}<t^{\left(j+1\right)}-t^{\left(j\right)}: limited increase and derivative,

      4. iv.

        g(t)tg\left(t\right)\leq t : logical restriction since argument is absolute order statistic,

      5. v.

        g(J+1)=z0.025g^{\left(J+1\right)}=z_{0.025} : dimensional coherence.

    4. (d)

      Increase the number of knots JJ and iterate until convergence.

The basic implementation algorithm solved the optimal gg-boundary by minimizing ϵ.\epsilon. Once ϵ\epsilon is determined, the current algorithm is akin to solving a dual problem that uses ϵ\epsilon for the inequality restrictions and maximizes power. It minimizes the total difference from the power envelope.


Power Envelopes

We calculate two power envelopes: one for near similar tests in Γα,ϵ\Gamma_{\alpha,\epsilon} and a second for nonsimilar tests. The algorithm for calculating the power envelope is related to Chiburis (2009) and implemented in Julia, see Bezanson et al. (2017), using Gurobi, an optimization package that can handle many side restrictions; see Gurobi Optimization (2021). We maximize power subject to size and near similarity restrictions on a grid of Υ0{\Upsilon}_{0} parameter points under the null: 0.05ϵNRP(μ(ι))0.050.05-\epsilon\leq NRP\left(\mu^{\left(\iota\right)}\right)\leq 0.05 for ι=1,,Υ0\iota={1,...,\Upsilon}_{0}. The upper bounds ensure correct size, at least for the points considered. The lower bounds constitute the near similarity restriction. The power envelope is obtained by repeating this maximization on a grid of points (μ1,μ2)\left(\mu_{1},\mu_{2}\right) under the alternative.

For the nonsimilar power envelope we can discard the lower bound restrictions 0.05ϵNRP(μ(ι))0.05-\epsilon\leq NRP\left(\mu^{(\iota)}\right). The power can only increase (or remain the same) and the difference between the two different power envelopes is the power loss one suffers from insisting on similarity. This turns out to be less than 2%2\% points and it should be stressed that this overstates the loss since no single test achieves the power envelope.

Denote the parameter space for the ordered absolute noncentrality parameter
Ξ={(μ1,μ2)+×+0μ1μ2}.\Xi=\left\{\left(\mu_{1},\mu_{2}\right)\in\mathbb{\mathbb{R}}^{+}\mathbb{\times\mathbb{R}}^{+}\mid 0\leq\mu_{1}\leq\mu_{2}\right\}.

We will use a bounded (triangular) subset of this octant Ξ\Xi defined as
Ξ¯={(μ1,μ2)+×+0μ1μ2μmax}\overline{\Xi}=\left\{\left(\mu_{1},\mu_{2}\right)\in\mathbb{\mathbb{R}}^{+}\mathbb{\times\mathbb{R}}^{+}\mid 0\leq\mu_{1}\leq\mu_{2}\leq\mu_{\max}\right\} and partitioned it into a null and alternative parameter set
Ξ¯0={(μ1,μ2)+×+0=μ1μ2μmax}\overline{\Xi}_{0}=\left\{\left(\mu_{1},\mu_{2}\right)\in\mathbb{\mathbb{R}}^{+}\mathbb{\times\mathbb{R}}^{+}\mid 0=\mu_{1}\leq\mu_{2}\leq\mu_{\max}\right\} and
Ξ¯1={(μ1,μ2)+×+0<μ1μ2μmax}\overline{\Xi}_{1}=\left\{\left(\mu_{1},\mu_{2}\right)\in\mathbb{\mathbb{R}}^{+}\mathbb{\times\mathbb{R}}^{+}\mid 0<\mu_{1}\leq\mu_{2}\leq\mu_{\max}\right\} respectively.

Analogously define the sample space of the maximal invariant/absolute order statistic as 𝕋={(t1,t2)0+×0+t1t2}\mathbb{T=}\left\{\left(t_{1},t_{2}\right)\in\mathbb{R}_{0}^{+}\mathbb{\times}\mathbb{R}_{0}^{+}\mid t_{1}\leq t_{2}\right\}. Very large values of t1t_{1} and t2t_{2} are of limited interest and for computational purposes we can restrict ourselves to a bounded triangular subset of the sample space: 𝕋¯={(t1,t2)0+×0+t1t2tmax}.\overline{\mathbb{T}}=\left\{\left(t_{1},t_{2}\right)\in\mathbb{R}_{0}^{+}\mathbb{\times}\mathbb{R}_{0}^{+}\mid t_{1}\leq t_{2}\leq t_{\max}\right\}.

Power Envelope Algorithm

  1. 1.

    Discretize Ξ¯0\overline{\Xi}_{0} into Υ0\Upsilon_{0} points under H0:H_{0}: 𝕄0={(0,μ0(ι))}ι=1Υ0.\mathbb{M}_{0}=\left\{\left(0,\mu_{0}^{\left(\iota\right)}\right)\right\}_{\iota=1}^{\Upsilon_{0}}.

  2. 2.

    Discretize Ξ¯1\overline{\Xi}_{1} by choosing a triangular array of 12Υ1(1+Υ1)\frac{1}{2}\Upsilon_{1}\left(1+\Upsilon_{1}\right) points under H1:H_{1}: 𝕄1={(μ1(γ,κ),μ2(γ,κ))}1γκΥ1\mathbb{M}_{1}=\left\{\left(\mu_{1}^{\left(\gamma,\kappa\right)},\mu_{2}^{\left(\gamma,\kappa\right)}\right)\right\}_{1\leq\gamma\leq\kappa\leq\Upsilon_{1}}

  3. 3.

    Partition 𝕋¯\overline{\mathbb{T}} into squares sijs_{ij} with 1ijN1\leq i\leq j\leq N such that 1ijNsij=𝕋¯\mathop{\displaystyle\bigcup}\limits_{1\leq i\leq j\leq N}s_{ij}=\overline{\mathbb{T}} and sijskl=(i,j)(k,l).s_{ij}\cap s_{kl}=\varnothing\ \ \forall(i,j)\neq\left(k,l\right).

  4. 4.

    Under H0H_{0} for ι=1,,Υ0\iota=1,\cdots,\Upsilon_{0} calculate pijι=P[(|T|(1),|T|(2))sij(0,μ0(ι))Ξ¯0]p_{ij}^{\iota}=P\left[\left(\left|T\right|_{\left(1\right)},\left|T\right|_{\left(2\right)}\right)\in s_{ij}\mid\left(0,\mu_{0}^{\left(\iota\right)}\right)\in\overline{\Xi}_{0}\right]

  5. 5.

    For each 1γ,κm1\leq\gamma,\kappa\leq m choose μ(γ,κ)=(μ1(γ,κ),μ2(γ,κ))𝕄1\mu^{\left(\gamma,\kappa\right)}=\left(\mu_{1}^{\left(\gamma,\kappa\right)},\mu_{2}^{\left(\gamma,\kappa\right)}\right)\in\mathbb{M}_{1}  under the alternative. For this μ\mu:

    1. (a)

      Calculate pijγκ=P[(|T|(1),|T|(2))sijμ(γ,κ)=(μ1(γ,κ),μ2(γ,κ))]p_{ij}^{\gamma\kappa}=P\left[\left(\left|T\right|_{\left(1\right)},\left|T\right|_{\left(2\right)}\right)\in s_{ij}\mid\mu^{\left(\gamma,\kappa\right)}=\left(\mu_{1}^{\left(\gamma,\kappa\right)},\mu_{2}^{\left(\gamma,\kappa\right)}\right)\right] for each sij𝕋¯s_{ij}\in\overline{\mathbb{T}}

    2. (b)

      Determine the critical region to maximize the power

      max{ϕijγκ,1ijN}1ijNpijγκϕijγκ\underset{\left\{\phi_{ij}^{\gamma\kappa},1\leq i\leq j\leq N\right\}}{\max}\sum_{1\leq i\leq j\leq N}p_{ij}^{\gamma\kappa}\phi_{ij}^{\gamma\kappa}

      by selecting indicators ϕijγκ=𝟏CR(sij)\phi_{ij}^{\gamma\kappa}=\mathbf{1}_{CR}\left(s_{ij}\right) equal to 1 if sijs_{ij} is part of the critical region, or 0 if part of the acceptance region, subject to the near similarity and size restrictions on the NRPs:

      0.05ϵ1ijNpijιϕijμ0.05 for ι=1,,Υ00.05-\epsilon\leq\sum_{1\leq i\leq j\leq N}p_{ij}^{\iota}\phi_{ij}^{\mu}\leq 0.05\text{ for }\iota={1,...,\Upsilon}_{0}

Comments. Optimizer: Gurobi: tmax=11,t_{\max}=11, each square sijs_{ij} has lengths 0.01.0.01. Hence cardinality of |𝕋¯|=285150|\overline{\mathbb{T}}|=285150. For power calculations we use for 𝕄1\mathbb{M}_{1} a regular grid with μ2{0.2,0.4,,4},μ1{0.2,0.4,,μ2}\mu_{2}\in\{0.2,0.4,\cdots,4\},\mu_{1}\in\{0.2,0.4,\cdots,\mu_{2}\}. For size and near similarity restrictions we use μ0{0.0,0.1,,7.5}\mu_{0}\in\{0.0,0.1,\cdots,7.5\} and for near similarity ϵ=105.\epsilon=10^{-5}.

Appendix E gg-Function R Code

R Code

g <- function(t){

tabs=abs(t)

x <<- c(0., 0.1, 0.11, 0.13, 0.14, 0.15, 1.35, 1.36, 1.37, 1.44, 1.45, 2.05, 2.06, 2.07, 2.08, 2.09, 2.1)

y <<- c(0., 0.1, 0.106723, 0.106723, 0.106724, 0.106724, 1.30583, 1.31286, 1.3131, 1.3131, 1.3175, 1.9175, 1.9275, 1.9375, 1.9475, 1.9575, 1.95996)

ifelse(tabs>>=2.1,1.95996,approx(x,y,xout=tabs)$y)

}

References

  • Alan et al. (2018) Alan, S., S. Ertac, and I. Mumcu (2018). Gender stereotypes in the classroom and effects on achievement. Review of Economics and Statistics 100(5), 876–890.
  • Alwin and Hauser (1975) Alwin, D. F. and R. M. Hauser (1975). The decomposition of effects in path analysis. American sociological review 40, 37–47.
  • Andrews et al. (2006) Andrews, D. W. K., M. J. Moreira, and J. H. Stock (2006). Optimal two-sided invariant similar tests for instrumental variables regression. Econometrica 74(3), 715–752.
  • Andrews et al. (2008) Andrews, D. W. K., M. J. Moreira, and J. H. Stock (2008). Efficient two-sided nonsimilar invariant tests in iv regression with weak instruments. Journal of Econometrics 146(2), 241–254.
  • Andrews and Ploberger (1994) Andrews, D. W. K. and W. Ploberger (1994). Optimal tests when a nuisance parameter is present only under the alternative. Econometrica 62(6), 1383–1414.
  • Baron and Kenny (1986) Baron, R. M. and D. A. Kenny (1986). The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of personality and social psychology 51(6), 1173.
  • Berger (1989) Berger, R. L. (1989). Uniformly more powerful tests for hypotheses concerning linear inequalities and normal means. Journal of the American Statistical Association 84(405), 192–199.
  • Bezanson et al. (2017) Bezanson, J., A. Edelman, S. Karpinski, and V. B. Shah (2017). Julia: A fresh approach to numerical computing. SIAM review 59(1), 65–98.
  • Bollen and Stine (1990) Bollen, K. and R. Stine (1990). Direct and indirect effects: Classical and bootstrap estimates of variability. Sociological methodology 20, 115–140.
  • Bollen (1989) Bollen, K. A. (1989). Structural equations with latent variables. John Wiley & Sons.
  • Chiburis (2009) Chiburis, R. C. (2009). Approximately most powerful tests for moment inequalities. In Essays on Treatment Effects and Moment Inequalities, Chapter 3. Ph.D. thesis, Department of Economics, Princeton University.
  • Coletti et al. (2005) Coletti, A. L., K. L. Sedatole, and K. L. Towry (2005). The effect of control systems on trust and cooperation in collaborative environments. The Accounting Review 80(2), 477–500.
  • Craig (1936) Craig, C. C. (1936). On the frequency function of xy. The Annals of Mathematical Statistics 7(1), 1–15.
  • Drton (2009) Drton, M. (2009). Likelihood ratio tests and singularities. The Annals of Statistics 37(2), 979–1012.
  • Drton and Xiao (2016) Drton, M. and H. Xiao (2016). Wald tests of singular hypotheses. Bernoulli 22(1), 38–59.
  • Dufour et al. (2017) Dufour, J.-M., E. Renault, and V. Zinde-Walsh (2017). Wald tests when restrictions are locally singular. Technical report, arxiv.org/abs/1312.0569v1.
  • Elliott et al. (2015) Elliott, G., U. K. Müller, and M. W. Watson (2015). Nearly optimal tests when a nuisance parameter is present under the null hypothesis. Econometrica 83(2), 771–811.
  • Frewen et al. (2013) Frewen, P. A., V. D. Schmittmann, L. F. Bringmann, and D. Borsboom (2013). Perceived causal relations between anxiety, posttraumatic stress and depression: extension to moderation, mediation, and network analysis. European journal of psychotraumatology 4(1), 20656.
  • Glonek (1993) Glonek, G. F. V. (1993). On the behaviour of wald statistics for the disjunction of two regular hypotheses. Journal of the Royal Statistical Society: Series B (Methodological) 55(3), 749–755.
  • Guggenberger et al. (2019) Guggenberger, P., F. Kleibergen, and S. Mavroeidis (2019). A more powerful subvector anderson rubin test in linear instrumental variables regression. Quantitative Economics 10(2), 487–526.
  • Gurobi Optimization (2021) Gurobi Optimization, L. (2021). Gurobi optimizer reference manual.
  • Heckman and Pinto (2015a) Heckman, J. and R. Pinto (2015a). Causal analysis after haavelmo. Econometric Theory 31(1), 115–151.
  • Heckman and Pinto (2015b) Heckman, J. and R. Pinto (2015b). Econometric mediation analyses: Identifying the sources of treatment effects from experimentally estimated production technologies with unmeasured and mismeasured inputs. Econometric reviews 34(1-2), 6–31.
  • Hillier et al. (2021) Hillier, G. H., K. J. Van Garderen, and N. P. A. Van Giersbergen (2021). Improved tests for mediation. Mimeo.
  • Huber (2020) Huber, M. (2020). Mediation analysis. Handbook of Labor, Human Resources and Population Economics, 1–38.
  • Imai et al. (2010) Imai, K., L. Keele, and T. Yamamoto (2010). Identification, inference and sensitivity analysis for causal mediation effects. Statistical science 25(1), 51–71.
  • Imai et al. (2013) Imai, K., D. Tingley, and T. Yamamoto (2013). Experimental designs for identifying causal mechanisms. Journal of the Royal Statistical Society: Series A (Statistics in Society) 176(1), 5–51.
  • Imbens (2020) Imbens, G. W. (2020). Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature 58(4), 1129–79.
  • Lehmann and Romano (2005) Lehmann, E. L. and J. P. Romano (2005). Testing statistical hypotheses (3 ed.). Springer Science & Business Media.
  • MacKenzie et al. (1986) MacKenzie, S. B., R. J. Lutz, and G. E. Belch (1986). The role of attitude toward the ad as a mediator of advertising effectiveness: A test of competing explanations. Journal of marketing research 23(2), 130–143.
  • MacKinnon et al. (2002) MacKinnon, D. P., C. M. Lockwood, J. M. Hoffman, S. G. West, and V. Sheets (2002). A comparison of methods to test mediation and other intervening variable effects. Psychological methods 7(1), 83.
  • MacKinnon et al. (2004) MacKinnon, D. P., C. M. Lockwood, and J. Williams (2004). Confidence limits for the indirect effect: Distribution of the product and resampling methods. Multivariate behavioral research 39(1), 99–128.
  • McDonald and Clelland (1984) McDonald, J. A. and D. A. Clelland (1984). Textile workers and union sentiment. Social Forces 63(2), 502–521.
  • Moreira and Mourão (2016) Moreira, M. J. and R. Mourão (2016). A critical value function approach, with an application to persistent time-series. arXiv preprint arXiv:1606.03496.
  • Perlman and Wu (1999) Perlman, M. D. and L. Wu (1999). The emperor’s new tests. Statistical Science 14(4), 355–369.
  • Preacher and Hayes (2008) Preacher, K. J. and A. F. Hayes (2008). Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models. Behavior research methods 40(3), 879–891.
  • Sobel (1982) Sobel, M. E. (1982). Asymptotic confidence intervals for indirect effects in structural equation models. Sociological methodology 13, 290–312.
  • Van der Vaart (2000) Van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge university press.
  • Van Garderen (1997) Van Garderen, K. J. (1997). Curved exponential models in econometrics. Econometric Theory 13(6), 771–790.
  • Van Garderen and Van Giersbergen (2021) Van Garderen, K. J. and N. P. A. Van Giersbergen (2021). Bootstrapping mediation tests. Mimeo.
  • Van Giersbergen (2014) Van Giersbergen, N. P. A. (2014). Inference about the indirect effect: a likelihood approach. Technical report, University of Amsterdam, UvA-Econometrics Discussion Papers 2014/10.
  • Vaughan and Venables (1972) Vaughan, R. J. and W. N. Venables (1972). Permanent expressions for order statistic densities. Journal of the Royal Statistical Society: Series B (Methodological) 34(2), 308–310.