Estimation and Inference of Average Treatment Effect in Percentage Points under Heterogeneity

Ying Zeng School of Economics and Wang Yanan Institute for Studies in Economics, Xiamen University, China.

Abstract

In semi-log regression models with heterogeneous treatment effects, the average treatment effect (ATE) in log points and its exponential transformation minus one underestimate the ATE in percentage points. I propose new estimation and inference methods for the ATE in percentage points, with inference utilizing the Fenton-Wilkinson approximation. These methods are particularly relevant for staggered difference-in-differences designs, where treatment effects often vary across groups and periods. I prove the methods’ large-sample properties and demonstrate their finite-sample performance through simulations, revealing substantial discrepancies between conventional and proposed measures. Two empirical applications further underscore the practical importance of these methods.

Keywords: Treatment effect heterogeneity, Semi-log regression, Average treatment effect, Difference-in-differences, Percentage point

JEL codes: C01, C21

1 Introduction

Semi-log regression models, where the dependent variable is in the natural logarithm form, are widely used in empirical studies. In these models, the coefficient of a binary treatment is often interpreted as approximating the average treatment effect (ATE) in percentage change when its magnitude is small (Hansen, 2022; Chen and Roth, 2024). This interpretation is based on the following logic: Let $Y_{1}$ and $Y_{0}$ denote the potential outcomes under treatment and non-treatment respectively. The treatment effect in log points is defined as $\tau=\ln(Y_{1})-\ln(Y_{0})$ , while the percentage change is given by $\rho=\left(Y_{1}-Y_{0}\right)/Y_{0}=\operatorname{exp}(\tau)-1$ . When $|\tau|$ is small, $\operatorname{exp}(\tau)-1\approx\tau$ , justifying the interpretation of $\tau$ as an approximate percentage effect. Alternatively, as suggested by Halvorsen and Palmquist (1980), researchers can directly use the exact formula $\operatorname{exp}(\tau)-1$ as the percentage effect of the treatment, which is applicable regardless of the magnitude of $\tau$ .

This paper argues, however, that neither $\tau$ nor $\operatorname{exp}(\tau)-1$ can be interpreted as the ATE in percentage points when treatment effects are heterogeneous across groups.¹¹1Following existing discussions, this paper focuses on group-specific treatment effect heterogeneity, leaving within-group heterogeneity for future research. The intuition is straightforward: with group-specific treatment effect heterogeneity, the ATE in log points is a weighted sum of treatment effects in log points, expressed as $\bar{\tau}=\sum_{g}w_{g}\tau_{g}$ , where $w_{g}$ is the share of subgroup $g$ in the population of interest, and $\tau_{g}$ is the log point effect for subgroup $g$ . Interpreting $\bar{\tau}$ as an approximation of the average percentage effect $\bar{\rho}=\sum_{g}w_{g}\rho_{g}=\sum_{g}w_{g}\left(\operatorname{exp}(\tau_{g})-1\right)$ requires $\tau_{g}$ to be close to zero for all $g$ , a condition that may fail even if $\bar{\tau}=0$ . Moreover, $\operatorname{exp}(\bar{\tau})-1$ is also a biased estimator of $\bar{\rho}$ as $\operatorname{exp}(\bar{\tau})-1\geqslant\bar{\rho}$ by Jensen’s inequality, with equality holds if and only if $\rho_{g}$ is constant across $g$ . Consequently, $\operatorname{exp}(\bar{\tau})-1\neq\bar{\rho}$ under treatment effect heterogeneity, with disparities between them increasing as treatment effects vary more widely.

The literature has largely overlooked the bias introduced by using the ATE in log points or its natural exponential minus one as the ATE in percentage points. This oversight may lead to misleading interpretations of results in empirical studies with heterogeneous treatment effects and log-transformed outcomes. A prominent example is the staggered difference-in-differences design, where different groups start receiving treatment at different times. When treatment effects are heterogeneous across groups or across time since treatment, two-way fixed effects models adopting a constant effect specification yield biased estimators for aggregate and dynamic treatment effects (de Chaisemartin and D’Haultfœuille, 2020; Goodman-Bacon, 2021; Sun and Abraham, 2021). Many heterogeneity-robust estimators have been proposed (de Chaisemartin and D’Haultfœuille, 2020; Callaway and Sant’Anna, 2021; Sun and Abraham, 2021; Borusyak et al., 2024), which amount to the ATE in log points when the outcome is log-transformed. Therefore, these estimators and their exponential minus one should not be interpreted as the ATE in percentage points.

Estimating and conducting inference on the ATE in percentage points $\bar{\rho}$ presents challenges. An obvious estimator of $\bar{\rho}$ is $\sum_{g}\hat{w}_{g}\operatorname{exp}(\hat{\tau}_{g})-1$ , where $\hat{w}_{g}$ and $\hat{\tau}_{g}$ are consistent estimators for $w_{g}$ and $\tau_{g}$ respectively. While this estimator is consistent, it can exhibit large small-sample bias due to $E(\operatorname{exp}(\hat{\tau}_{g}))\neq\operatorname{exp}(\tau_{g})$ (Kennedy, 1981). Furthermore, inference on $\bar{\rho}$ is challenging because the distribution of $\sum_{g}\hat{w}_{g}\operatorname{exp}(\hat{\tau}_{g})$ has an unknown form if both $\hat{w}_{g}$ and $\hat{\tau}_{g}$ are asymptotically normal. Nevertheless, knowledge of $\bar{\rho}$ is crucial for policy-making, as changes are typically discussed in terms of levels or percentages rather than on a logarithmic scale.

This paper highlights the often-overlooked problem of interpreting the ATE in log points as the ATE in percentage points in the context of heterogeneous treatment effects. Building on this insight, the paper develops estimation and inference procedures for the ATE in percentage points under a general econometrics framework. This framework not only encompasses semi-log regression and semi-log staggered difference-in-differences models but also has the potential to be extended to Poisson regression models. Central to these procedures is a consistent estimator that does not rely on the ATE in log points or individual treatment effects in log points to be small, thus broadening its applicability across various empirical settings. To facilitate inference, the paper develops an approximate method using the Fenton-Wilkinson approximation of the sum of log-normal distributions (Fenton, 1960; Abu-Dayya and Beaulieu, 1994). The validity of these proposed methods is rigorously proven for large samples, and their performance in finite samples is demonstrated through comprehensive Monte Carlo simulations. To illustrate these methods’ practical relevance and applicability, the paper presents two empirical applications: one employing an semi-log regression model to study the causal impact of an education reform on wages, and another utilizing a staggered difference-in-differences design to examine the effect of minimum wage raises on teen employment. Through these contributions, this paper provides researchers with robust tools to more accurately estimate and interpret treatment effects in percentage terms, particularly in the presence of heterogeneity.

This paper connects to three strands of literature. First, it contributes to a growing body of works on treatment effect heterogeneity, which has shown that traditional estimates assuming a constant treatment effect are weighted averages of treatment effects, with weights possibly negative and not equal to the share of the subpopulation. Examples include ordinary least squares (OLS) estimators (Angrist, 1998; Gibbons et al., 2019; Słoczyński, 2022; Goldsmith-Pinkham et al., 2024), two-stage least squares (2SLS) estimators (Imbens and Angrist, 1994; Mogstad et al., 2021), and two-way fixed effects estimators for staggered difference-in-differences models (Goodman-Bacon, 2021; Sun and Abraham, 2021; de Chaisemartin and D’Haultfœuille, 2020; Borusyak et al., 2024). Numerous heterogeneity-robust estimators have been developed, typically by estimating subpopulation treatment effects and assigning proper weights to each subpopulation (e.g., Gibbons et al., 2019; Callaway and Sant’Anna, 2021; Sun and Abraham, 2021; Goldsmith-Pinkham et al., 2024). The results presented in this paper imply that when the outcome is log-transformed in these models, neither the traditional estimator nor the heterogeneity-robust estimators provide proper approximations of the ATE in percentage effects. The estimator and inference methods proposed here can be combined with heterogeneity-robust estimators to create consistent estimation and inference of the ATE in percentage points.

Second, this study extends existing research on interpreting binary variables in semi-log regressions from the case of constant treatment effects to heterogeneous treatment effects. Halvorsen and Palmquist (1980) pointed out that in a regression like $\ln(y_{i})=\alpha+\tau d_{i}+\epsilon_{i}$ , where $d_{i}$ is a dummy variable, the percentage effect of $d$ should be $\operatorname{exp}(\tau)-1$ rather than $\tau$ . Kennedy (1981) argued that directly replacing $\tau$ in $\operatorname{exp}(\tau)-1$ with the OLS estimate $\hat{\tau}$ leads to a biased estimator, as $E(\operatorname{exp}(\hat{\tau}))\neq\operatorname{exp}(\tau)$ due to the non-linearity of the exponential function. He suggested using $\operatorname{exp}(\hat{\tau}-0.5\hat{\sigma}_{\tau}^{2})-1$ to correct the bias, observing that $E(\operatorname{exp}(\hat{\tau}))=\operatorname{exp}(\tau+0.5\sigma_{\tau}^{2})$ when $\hat{\tau}\sim\mathcal{N}(\tau,\sigma_{\tau}^{2})$ . This estimator is still biased as it uses the estimated value $\hat{\sigma}_{\tau}^{2}$ rather than $\sigma_{\tau}^{2}$ . Under the assumption that $\epsilon_{i}$ are i.i.d. normal, Giles (1982) and van Garderen and Shah (2002) proposed exact unbiased estimators. However, simulation results show that the difference between the exact unbiased estimator and the Kennedy (1981) estimator is negligible.

Third, this work supplements ongoing research into the potential pitfalls of using logarithmic transformations in empirical analyses. For instance, Mullahy and Norton (2024) and Chen and Roth (2024) demonstrated that when dependent variables have many zero values, coefficients in regressions with log-like transformations such as $\ln(y+1)$ do not bear the interpretation of percentage effects. Roth and Sant’Anna (2023) showed that the parallel trends assumption in difference-in-differences models is sensitive to the log-transformation of outcome variables. Manning (1998) and Silva and Tenreyro (2006) highlighted issues of using log-linear regressions for estimating impact on the outcome’s mean and elasticity under heteroscedasticity. The results in this paper further demonstrate that in the presence of treatment effect heterogeneity, coefficients in semi-log regressions may not have an ATE in percentage points interpretation.

The remainder of the paper is structured as follows: Section 2 introduces the definition of ATE in percentage points and discusses potential issues of using ATEs in log points as approximations. Section 3 outlines the estimation and inference methods. Section 4 presents two empirical applications, while Section 5 reports the results of Monte Carlo simulations. Section 6 concludes the paper. The appendix contains results tables from both the empirical applications and Monte Carlo simulations. An online appendix provides proof for all results stated in the main text, explores the extension of the proposed methods to Poisson regression, and presents additional Monte Carlo simulation results.

2 ATE in Percentage Points

2.1 Setup and Definitions

Consider a population partitioned into $G$ mutually exclusive and collectively exhaustive subgroups, indexed by $g=1,\ldots,G$ . Let $w_{g}$ denote the population share of subgroup $g$ , where $w_{g}>0$ and $\sum_{g=1}^{G}w_{g}=1$ . Formal assumptions on $w_{g}$ are presented in Section 3. I assume treatment effects are constant within subgroups but may vary across subgroups.

Define $d_{i}^{(g)}=1$ if individual $i$ belongs to subgroup $g$ , and 0 otherwise. Let $y_{i,0}$ and $y_{i,1}$ represent the potential outcomes for individual $i$ if not treated and if treated respectively. I assume that $y_{i,0}>0$ and $y_{i,1}>0$ for all $i$ , so that their natural logarithms are well defined.

Remark 1.

When outcome variables contain many zero values, alternative frameworks may be more appropriate. Specifically, Poisson regression models, as proposed by Chen and Roth (2024) and Mullahy and Norton (2024), can be employed to estimate the ATE in levels as a percentage of the baseline mean. I provide a detailed exposition on adapting the estimation and inference methods to this scenario in Section LABEL:sec:poisson of the online appendix.

The individual treatment effect in log points is defined as $\tau_{i}=\ln\left(y_{i,1}\right)-\ln\left(y_{i,0}\right)$ . I assume that $\tau_{i}=\tau_{g}$ if $d_{i}^{(g)}=1$ , implying homogeneous treatment effects in log points within subgroups. As a result, $\tau_{i}=\sum_{g=1}^{G}\tau_{g}d_{i}^{(g)}=d_{i}^{\prime}\tau$ , where $d_{i}=(d_{i}^{(1)},\ldots,d_{i}^{(G)})^{\prime}$ and $\tau=(\tau_{1},\ldots,\tau_{G})^{\prime}$ . The average treatment effect (ATE) in log points is $\bar{\tau}=\sum_{g=1}^{G}w_{g}\tau_{g}=w^{\prime}\tau$ , where $w=(w_{1},\ldots,w_{G})^{\prime}$ .

The individual treatment effect in percentage points is $\rho_{i}=\left(y_{i,1}-y_{i,0}\right)/y_{i,0}=\operatorname{exp}(\tau_{i})-1$ . Consequently $\rho_{i}=\rho_{g}$ when $d_{i}^{(g)}=1$ , ensuring homogeneity of percentage point effects within subgroups. The ATE in percentage points is defined as

\bar{\rho}=\sum_{g=1}^{G}w_{g}\rho_{g}=\sum_{g=1}^{G}w_{g}\operatorname{exp}(\tau_{g})-1.

(1)

The objective of this paper is to develop estimation and inference methods for $\bar{\rho}$ using consistent and asymptotic normal estimators of $w_{g}$ and $\tau_{g}$ .

2.2 Log Points Approximations

Empirical studies often approximate average effects in percentage points using average effects in log points. When treatment effects are homogeneous, this approximation is reasonably accurate for modest log point effects. Formally, when $\tau_{g}=\bar{\tau}$ and hence $\rho_{g}=\bar{\rho}$ for all $g$ , the approximation error is given by $\bar{\rho}-\bar{\tau}=\operatorname{exp}(\bar{\tau})-1-\bar{\tau}$ . This error function is strictly convex in $\bar{\tau}$ and reaches its minimum value of 0 when $\bar{\tau}=0$ . When $|\bar{\tau}|\leqslant 0.05$ , the approximation error is $0\leqslant\bar{\rho}-\bar{\tau}\leqslant 0.0013$ , i.e., less than 0.13 percentage points.²²2The function $f(x)=\operatorname{exp}(x)-1-x$ has $f^{\prime}(x)=\operatorname{exp}(x)-x$ , and $f^{\prime\prime}(x)=\operatorname{exp}(x)>0$ . With $f^{\prime}(0)=0$ , we have $f(x)\geqslant f(0)=0$ . Also, note that $f(0.05)=0.00127>f(-0.05)=0.00123$ . Hence $f(x)\leqslant 0.00127$ when $|x|\leqslant 0.05$ . This justifies interpreting $\bar{\tau}$ as a percentage effect when $|\bar{\tau}|$ is small and treatment effects are constant. Furthermore, when treatment effects are homogeneous, the expression $\operatorname{exp}(\bar{\tau})-1$ provides the exact value of $\bar{\rho}$ , regardless of the magnitude of $\bar{\tau}$ (Halvorsen and Palmquist, 1980).

However, when treatment effects are heterogeneous across subgroups, the average effect in log points or its transformation $\operatorname{exp}(\bar{\tau})-1$ can differ substantially from $\bar{\rho}$ . Two types of log-based approximations are relevant in this context: (i) The weighted average of treatment effects in log points with arbitrary weights: $\tilde{\tau}=\sum_{g=1}^{G}\tilde{w}_{g}\tau_{g}$ , where $\tilde{w}_{g}\neq w_{g}$ generally and $\sum_{g=1}^{G}\tilde{w}_{g}=1$ . When $\tilde{w}_{g}\geqslant 0$ for all $g$ , $\tilde{\tau}$ is called a convex average of $\tau_{g}$ . (ii) The ATE in log points: $\bar{\tau}=\sum_{g=1}^{G}w_{g}\tau_{g}$ , which is a special type of $\tilde{\tau}$ with $\tilde{w}=w$ and $\tilde{w}_{g}>0$ . Below I discuss the potential pitfalls of each of them.

2.2.1 Weighted Average of Log Point Effects

Interpreting estimates of $\tilde{\tau}$ , the weighted average of treatment effect in log points, as approximations of $\bar{\rho}$ is common in empirical research. This practice is particularly prevalent when treatment effects in log-transformed outcomes are heterogeneous, yet estimation assumes a constant effect. For example, consider a semi-log regression model $\ln(y_{i})=\alpha+\tau s_{i}+x_{i}^{\prime}\beta+\epsilon_{i},$ where $s_{i}$ is the treatment dummy and $x_{i}$ is a column vector of covariates. Under the assumption $E\left[\ln(y_{0,i})|x_{i},s_{i}\right]=E(\ln(y_{0,i})|x_{i})$ , where $y_{0,i}$ is the potential outcome if not treated, the OLS estimator of $\tau$ is a convex average of covariate-specific treatment effect in $\ln(y_{i})$ (Angrist, 1998). Another example is the two-way fixed effects (TWFE) estimator for the staggered difference-in-differences designs, where different cohorts start to get treated at different times. Treatment effects can be heterogeneous across cohorts and across event time (time since treatment). Consider the following TWFE model:

\ln(y_{it})=\alpha_{i}+\gamma_{t}+\tau s_{it}+x_{it}^{\prime}\beta+\epsilon_{it},

(2)

where $y_{it}$ is the outcome of individual $i$ at time $t$ , $s_{it}$ is the indicator that $i$ has already been treated in period $t$ , $\alpha_{i}$ , $\gamma_{t}$ and $x_{it}$ are individual fixed effects, time fixed effects and control variables respectively. Under parallel trend and no anticipation assumptions, the OLS estimator for $\tau$ is a weighted average of treatment effect in $\ln(y_{i})$ for different cohort-event time combinations. Notably, some of the weights can potentially be negative (de Chaisemartin and D’Haultfœuille, 2020; Goodman-Bacon, 2021). When outcomes are log-transformed, two-stage least squares (2SLS) estimators are also weighted average of log point effects (see, e.g., Imbens and Angrist 1994; Mogstad et al. 2021).

In these examples, interpreting the coefficient of the treatment dummy or its exponential transformation minus one as the ATE in percentage points implicitly approximates $\bar{\rho}$ with weighted averages of log points effects $\tilde{\tau}$ or $\operatorname{exp}(\tilde{\tau})-1$ . Such approximations can lead to misleading interpretations due to several key issues.

First, if $\tilde{\tau}$ is a non-convex average of $\tau_{g}$ , as can occur in some staggered difference-in-differences settings, then $\tilde{\tau}$ and $\operatorname{exp}(\tilde{\tau})-1$ may fall out of the convex hull of $\rho_{g}$ . For example, if $(\tilde{w}_{1},\tilde{w}_{2})=(-0.1,1.1)$ and $(\tau_{1},\tau_{2})=(0.1,0.05)$ , we have $(\rho_{1},\rho_{2})=(0.105,0.051)$ , hence $\tilde{\tau}=0.045$ and $\operatorname{exp}(\tilde{\tau})-1=0.046$ are both outside the convex hull of $\rho_{g}$ .

Secondly, if $\tilde{\tau}$ is a convex average of $\tau_{g}$ , $\tilde{\tau}$ may still be out of the convex hull of $\rho_{g}$ , although $\operatorname{exp}(\tilde{\tau})-1$ is in the convex hull of $\rho_{g}$ . One example of the former case is when $(\tau_{1},\tau_{2})=(0.09,0.1)$ and $\tilde{w}=(0.9,0.1)$ , then $\tilde{\tau}=0.091$ is not in the convex hull of $(\rho_{1},\rho_{2})=(0.094,0.105)$ . To see the latter, note that with $0\leqslant\tilde{w}_{g}\leqslant 1$ , $\min_{g}(\tau_{g})\leqslant\tilde{\tau}\leqslant\max_{g}(\tau_{g})$ , hence $\min_{g}(\rho_{g})=\min(\operatorname{exp}(\tau_{g})-1)\leqslant\operatorname{exp}(\tilde{\tau})-1\leqslant\max_{g}(\operatorname{exp}(\tau_{g})-1)=\max(\rho_{g})$ .

Third, even if $\tilde{\tau}$ and $\operatorname{exp}(\tilde{\tau})-1$ are in the convex hull of $\rho_{g}$ , they are different from $\bar{\rho}$ . Take $\tilde{\tau}$ for example,

\displaystyle\tilde{\tau}-\bar{\rho}

\displaystyle=\sum_{g=1}^{G}\tilde{w}_{g}\tau_{g}-\sum_{g=1}^{G}w_{g}\rho_{g}=\sum_{g=1}^{G}w_{g}(\tau_{g}-\rho_{g})+\sum_{g=1}^{G}(\tilde{w}_{g}-w_{g})\tau_{g},

which depends on the disparities between $w_{g}$ and $\tilde{w}_{g}$ , as well as between $\tau_{g}$ and $\rho_{g}=\operatorname{exp}(\tau_{g})-1$ . The difference approaches 0 if $\tau_{g}$ is close to 0 for all $g$ , but can be large even if $\tilde{\tau}$ is close to 0. One example is when $(\tilde{w}_{1},\tilde{w}_{2})=(0.2,0.8)$ , $(w_{1},w_{2})=(0.8,0.2)$ and $(\tau_{1},\tau_{2})=(0.08,-0.02)$ , then $\tilde{\tau}=\operatorname{exp}(\tilde{\tau})-1=0$ but $\bar{\rho}=0.063$ , and the difference is $6.3$ percentage points.

2.2.2 ATE in Log Points

The ATE in log points, denoted as $\bar{\tau}$ , can be estimated by explicitly accounting for treatment effects heterogeneity. This involves estimating $\tau_{g}$ and $w_{g}$ for each subgroup, then computing their weighted average. For example, in staggered difference-in-differences designs, Callaway and Sant’Anna (2021) develop estimates for aggregate treatment effects that essentially represent a set of ATEs in log points when outcomes are log-transformed. Their method employs doubly robust estimation to estimate each $\tau_{g}$ (i.e., treatment effect in log points for a specific cohort-event time combination), and then aggregates them up using respective weights. Another example is the interaction-weighted estimator developed by Gibbons et al. (2019) and applied to staggered difference-in-differences settings by Sun and Abraham (2021). With log-transformed outcomes, these methods estimate $\tau_{g}$ using OLS and then compute the weighted average using sample size based weights.

Given that $w_{g}>0$ , $\bar{\tau}$ is a convex average of $\tau_{g}$ . As discussed in the second property of $\tilde{\tau}$ , $\operatorname{exp}(\bar{\tau})-1$ lies within the convex hull of $\rho_{g}$ , but $\bar{\tau}$ itself may not. Moreover, both $\bar{\tau}$ and $\operatorname{exp}(\bar{\tau})-1$ may be poor approximations of $\bar{\rho}$ . The approximation bias of $\bar{\tau}$ is $\bar{\tau}-\bar{\rho}=\sum_{g=1}^{G}w_{g}(\tau_{g}-\rho_{g})=\sum_{g=1}^{G}w_{g}(\tau_{g}-\operatorname{exp}(\tau_{g})+1)$ . This bias approaches $0$ if $\tau_{g}$ is close to 0 for all $g$ . However, it can be substantial if individual $\tau_{g}$ values are large, even when $\bar{\tau}=0$ . For instance, with $w_{1}=w_{2}=0.5$ and $(\tau_{1},\tau_{2})=(-0.2,0.2)$ , we obtain $\bar{\tau}=0$ but $\bar{\rho}=0.02$ . The approximation $\operatorname{exp}(\bar{\tau})-1$ yields a smaller error, as $\bar{\rho}\geqslant\operatorname{exp}(\bar{\tau})-1\geqslant\bar{\tau}$ . The first inequality follows from Jensen’s inequality: $\sum_{g=1}^{G}w_{g}\operatorname{exp}(\tau_{g})\geqslant\operatorname{exp}(\sum w_{g}\tau_{g})$ , with equality holds only when treatment effects are homogeneous ( $\tau_{g}=\tau$ for all $g$ ). The larger the heterogeneity in $\tau_{g}$ , the greater the bias of $\operatorname{exp}(\bar{\tau})-1$ . The second inequality stems from $\operatorname{exp}(x)-x-1\geqslant 0$ for any $x$ , with equality holds only when $\bar{\tau}=0$ , see footnote 2. Thus, $\operatorname{exp}(\bar{\tau})-1$ provides a better approximation for $\bar{\rho}$ as it always lies within the convex hull of $\rho_{g}$ and yields a smaller bias. However, $\operatorname{exp}(\bar{\tau})-1$ lacks a clear economic interpretation. It can be viewed as an alternative to the geometric mean of the percentage effect, as $\operatorname{exp}(\bar{\tau})=\operatorname{exp}\left(\sum_{g=1}^{G}w_{g}\ln(\rho_{g}+1)\right)$ represents the weighted geometric mean of $\rho_{g}+1$ , which is the ratio of the potential outcomes $y_{i,1}/y_{i,0}$ for subgroup $g$ .

While $\bar{\tau}$ or $\operatorname{exp}(\bar{\tau})-1$ improve upon $\tilde{\tau}$ or $\operatorname{exp}(\tilde{\tau})-1$ , they still fail to fully capture the ATE in percentage points. These limitations underscore the importance of developing consistent and, under certain circumstances, unbiased estimation and inference methods for $\bar{\rho}$ , which will be addressed in the subsequent section.

3 Estimation and Inference

This study proposes a general econometric framework for the estimation and inference of $\bar{\rho}$ , building upon consistent and asymptotically normal estimators for $w=(w_{1},\ldots,w_{G})^{\prime}$ and $\tau=(\tau_{1},\ldots,\tau_{G})^{\prime}$ .³³3The methodology is implemented in an accompanying R package, forthcoming for public use. The proposed methodology imposes no restrictions on the function form or application context of these estimators, provided they meet the assumptions in Section 3.1. The framework is motivated by semi-log regression models with multiple treatment indicators, which naturally extends to semi-log difference-in-differences models with staggered treatment adoption. However, the framework’s design allows extension to diverse empirical settings beyond its primary motivation.

3.1 Estimators of $w$ and $\tau$

I impose the following assumptions on the parameters $w$ and $\tau$ .

Assumption 1.

There exist constants $c_{w}$ and $C_{w}$ such that $0<c_{w}\leqslant w_{g}\leqslant C_{w}\leqslant 1$ for all $g$ , and $\sum_{g}^{G}w_{g}=1$ .

Without loss of generality, I restrict $w_{g}$ to be strictly positive for all $g$ , excluding groups with zero share from the analysis. I allow $w_{g}=1$ to incorporate the case of $G=1$ , representing homogeneous treatment effects.

Assumption 2.

For $g=1,\ldots,G$ , $-C_{\tau}\leqslant\tau_{g}\leqslant C_{\tau}$ for some constant $0\leqslant C_{\tau}<\infty$ .

Assumption 2 bounds the value of $\tau_{g}$ . It allows $C_{\tau}=0$ , which implies $\tau_{g}=0$ and consequently $\rho_{g}=0$ for all $g$ . Given that $\rho_{g}=\operatorname{exp}(\tau_{g})-1$ , the assumption leads to $\operatorname{exp}(-C_{\tau})-1\leqslant\rho_{g}\leqslant\operatorname{exp}(C_{\tau})-1$ . Combining Assumptions 1 and 2 yields $-C_{\tau}\leqslant\bar{\tau}\leqslant C_{\tau}$ and $\operatorname{exp}(-C_{\tau})-1\leqslant\bar{\rho}\leqslant\operatorname{exp}(C_{\tau})-1$ , where $\bar{\tau}$ and $\bar{\rho}$ are ATEs in log points and percentage points respectively.

Let $\hat{w}=(\hat{w}_{1},\ldots,\hat{w}_{G})^{\prime}$ and $\hat{\tau}=(\hat{\tau}_{1},\ldots,\hat{\tau}_{G})^{\prime}$ denote the estimators of $w$ and $\tau$ respectively. I assume that $\hat{w}$ and $\hat{\tau}$ are consistent and asymptotically jointly normal.

Assumption 3.

(a) There exist $G\times G$ constant matrices $\bar{\Sigma}_{w}$ and $\bar{\Sigma}_{\tau}$ , such that $\bar{\Sigma}_{w}$ is positive semi-definite, $\bar{\Sigma}_{\tau}$ is positive definite, and as $N\rightarrow\infty$ ,

\sqrt{N}\left(\begin{array}[]{c}\hat{w}-w\\ \hat{\tau}-\tau\end{array}\right)\xrightarrow{d}\mathcal{N}\left[0_{2G\times 1},\left(\begin{array}[]{cc}\bar{\Sigma}_{w}&0\\ 0&\bar{\Sigma}_{\tau}\end{array}\right)\right].

(b) Let $\hat{\bar{\Sigma}}_{w}$ and $\hat{\bar{\Sigma}}_{\tau}$ be the estimators of $\bar{\Sigma}_{w}$ and $\bar{\Sigma}_{\tau}$ respectively. As $N\rightarrow\infty$ , $\hat{\bar{\Sigma}}_{w}-\bar{\Sigma}_{w}\xrightarrow{p}0$ , and $\hat{\bar{\Sigma}}_{\tau}-\bar{\Sigma}_{\tau}\xrightarrow{p}0$ .

Assumption 3 is a standard property for estimators, except for the restriction of asymptotic joint normality. The asymptotic joint normality can hold trivially if $\hat{w}$ and $\hat{\tau}$ are independent, e.g., when $\hat{w}$ is estimated with sampling factors, and $\hat{\tau}$ is estimated with another set of variables independent of sampling. It can also hold when $d_{i}$ is a vector of dummies for sub-treatment groups and $\hat{\tau}$ is an OLS estimator for the treatment effects in log points, as will be illustrated in Example 1.

I define $\Sigma_{w}=\bar{\Sigma}_{w}/N$ and $\Sigma_{\tau}=\bar{\Sigma}_{\tau}/N$ , which are the asymptotic variances of $\hat{w}$ and $\hat{\tau}$ respectively. When $N$ is large enough, approximately we have⁴⁴4The statement is heuristic, see Wooldridge (2010) pp. 40-42 for related discussions.

\left(\begin{array}[]{c}\hat{w}-w\\ \hat{\tau}-\tau\end{array}\right)\sim\mathcal{N}\left[0_{2G\times 1},\left(\begin{array}[]{cc}\Sigma_{w}&0\\ 0&\Sigma_{\tau}\end{array}\right)\right].

Denote the $g$ -th diagonal elements of $\Sigma_{\tau}=\bar{\Sigma}_{\tau}/N$ and $\hat{\Sigma}_{\tau}=\hat{\bar{\Sigma}}_{\tau}/N$ as $\sigma_{\tau,g}^{2}$ and $\hat{\sigma}_{\tau,g}^{2}$ respectively, then $\sigma_{\tau,g}^{2}$ is the asymptotic variance of $\hat{\tau}_{g}$ and $\hat{\sigma}_{\tau,g}^{2}$ is its estimator, with $N(\hat{\sigma}_{\tau,g}^{2}-\sigma_{\tau,g}^{2})\xrightarrow{p}0$ under Assumption 3. Let $\hat{\Sigma}_{w}=\hat{\bar{\Sigma}}/N$ , and define $\sigma_{w,g}^{2}$ and $\hat{\sigma}_{w,g}^{2}$ in a similar manner, then $N(\hat{\sigma}_{w,g}^{2}-\sigma_{w,g}^{2})\xrightarrow{p}0$ .

To construct the exact unbiased estimator, I further assume:

Assumption 4.

(a) For $g=1,\ldots,G$ , $E\left(\hat{w}_{g}\right)=w_{g}$ .

(b) The estimator $\hat{w}_{g}$ is a function of some set of variables $Z$ , and we have $\hat{\tau}_{g}-\tau_{g}|Z\sim\mathcal{N}(0,\sigma_{\tau,g}^{2}(Z))$ , $\left(m\hat{\sigma}_{\tau,g}^{2}/\sigma_{\tau,g}^{2}(Z)\right)|Z\sim\chi^{2}(m)$ for some positive integer $m$ , and $\hat{\tau}_{g}$ and $\hat{\sigma}_{\tau,g}^{2}$ are independent conditional on $Z$ .

Assumption 4 can hold when $\hat{\tau}$ is the OLS estimator in a semi-log regression model with i.i.d. normal errors and independent sampling.

I use two examples to highlight how the general assumptions are met in practice and explain their formulation. The first example is the OLS estimator in a semi-log regression model with multiple treatments under random sampling. The second example, which can be viewed as a special case of the first, is a semi-log difference-in-differences model with staggered adoption.

3.1.1 Example 1: Semi-log Regression Model

This example demonstrates how $\hat{w}$ and $\hat{\tau}$ satisfying Assumption 3 can be obtained as OLS estimators, applicable in research designs such as (conditional) randomized controlled trials.

The objective is to estimate the average treatment effect on the treated (ATT) in percentage points. The population of interest is the entire treatment group, comprising $G$ sub-treatment groups, with homogeneous treatment effects within each sub-treatment group but potential heterogeneity across groups. In this setting, $d_{i}^{(g)}=1$ if $i$ is in sub-treatment group $g$ and $0$ otherwise. The indicator of treatment is $s_{i}=\sum_{g=1}^{G}d_{i}^{(g)}=\mathbf{1}^{\prime}d_{i}$ , where $\mathbf{1}_{G}$ is a $G\times 1$ vector of ones.

I obtain $\hat{\tau}$ as the OLS estimator from a semi-log regression model. Note that $s_{i}d_{i}=d_{i}$ as $d_{i}^{(g)}$ are mutually exclusive dummies that sum up to $s_{i}$ . Also recall that $\tau_{i}\equiv\ln(y_{1,i})-\ln(y_{0,i})=d_{i}^{\prime}\tau$ . Thus the observed log-transformed outcome is

\displaystyle\ln(y_{i})

\displaystyle=\ln(y_{0,i})(1-s_{i})+\ln(y_{1,i})s_{i}=\ln(y_{0,i})+s_{i}\tau_{i}=\ln(y_{0,i})+d_{i}^{\prime}\tau.

Let $x$ denote a $k_{x}\times 1$ vector of observed covariates including a constant term. Assuming that $E\left(\ln(y_{0,i})\left|x_{i},d_{i}\right.\right)=x_{i}^{\prime}\beta$ , a semi-log linear regression model that allows for treatment effect heterogeneity is specified as

\ln(y_{i})=x_{i}^{\prime}\beta+d_{i}^{\prime}\tau+\epsilon_{i},

(3)

where $\epsilon_{i}=\ln(y_{0,i})-E\left[\ln\left(y_{0,i}\right)|x_{i},d_{i}\right]$ and by construction $E(\epsilon_{i}|x_{i},d_{i})=0$ . ⁵⁵5The specification can also be justified by other assumptions and alternative estimation approaches are available, see Goldsmith-Pinkham et al. (2024) for a comprehensive discussion of the estimation of heterogeneous treatment effects in linear regressions.

Remark 2.

The specification in equation (3) imposes no restriction on the heterogeneity of $\ln(y_{0,i})$ and permits various group structures. The control group can be homogeneous or partitioned into $G$ sub-groups corresponding to treatment sub-groups. In the latter case, $x_{i}$ includes sub-group dummies, and $d_{i}^{(g)}$ represents sub-group-treatment interactions. For instance, consider a sample comprising $G$ cities, each containing both control and treatment units. Here, $x_{i}$ would include city dummies, and $d_{i}$ would be city-treatment interactions, capturing city-specific effects. In this specific scenario, the approach is analogous to the interacted weighted estimator in Gibbons et al. (2019), differing in the log-transformed outcome.

The matrix form of model (3) is $lnY=X\beta+D\tau+\epsilon,$ where $lnY=\left(\ln(y_{1}),\ldots,\ln(y_{N})\right)^{\prime}$ , $X=\left(x_{1},\ldots,x_{N}\right)^{\prime}$ , $D=(d_{1},\ldots,d_{N})^{\prime}$ , and $\epsilon=\left(\epsilon_{1},\ldots,\epsilon_{N}\right)^{\prime}$ . Let $\tilde{x}_{i}=\left(x_{i}^{\prime},d_{i}^{\prime}\right)^{\prime}$ , and $\tilde{X}=(\tilde{x}_{1},\ldots\tilde{x}_{N})^{\prime}=[X,D]$ . The OLS estimator of $(\beta^{\prime},\tau^{\prime})^{\prime}$ is given by $(\hat{\beta}^{\prime},\hat{\tau}^{\prime})^{\prime}=(\tilde{X}^{\prime}\tilde{X})^{-1}\tilde{X}^{\prime}lnY$ . The heteroscedasticity robust estimator for the asymptotic variance of $\sqrt{N}(\hat{\beta}^{\prime},\hat{\tau}^{\prime})^{\prime}$ is

\hat{\bar{\Sigma}}_{(\beta,\tau)}=N(\tilde{X}^{\prime}\tilde{X})^{-1}\left(\sum_{i=1}^{N}\tilde{x}_{i}\tilde{x}_{i}^{\prime}\hat{\epsilon}_{i}^{2}\right)(\tilde{X}^{\prime}\tilde{X})^{-1},

(4)

where $\hat{\epsilon}_{i}$ is the OLS residual.

I next discuss the estimation of $w$ . The population share of sub-treatment group $g$ in the whole treatment group is $w_{g}=E(d_{i}^{(g)}=1|s_{i}=1)$ . Suppose we have a sample of $N$ individuals, with $N_{S}=\sum_{i=1}^{N}s_{i}$ in the treatment group. Define $N_{g}=\sum_{i=1}^{N}d_{i}^{(g)}$ as the size of sub-treatment group $g$ so that $N_{S}=\sum_{g=1}^{G}N_{g}$ .

The estimator of $w_{g}$ is $\hat{w}_{g}=N_{g}/N_{S}$ and of $w$ is $\hat{w}=(N_{1}/N_{S},\ldots,N_{G}/N_{S})^{\prime}$ . Denote the diagonal matrix with elements of vector $\alpha$ on the main diagonal as $\operatorname{diag}(\alpha)$ . Let

\hat{\bar{\Sigma}}_{w}=N/N_{S}\left(\operatorname{diag}(\hat{w})-\hat{w}\hat{w}^{\prime}\right)

(5)

be the estimator for the asymptotic variance of $\sqrt{N}\left(\hat{w}-w\right)$ . As shown in the proof of Lemma 1 in the online appendix, $\hat{w}_{g}$ is the OLS estimator for $w_{g}$ in the regression $d_{i}^{(g)}=w_{g}s_{i}+v_{i}$ and hence a function of $D=(d_{1},\ldots,d_{N})$ , and $\hat{\bar{\Sigma}}_{w}/N$ is the heteroscedasticity robust covariance matrix of $\hat{w}$ .

Lemma 1.

Assume that: (i) The data $\left\{\left(y_{i},x_{i},d_{i}\right),i=1,\ldots,N\right\}$ is an i.i.d. sample drawn from the population. (ii) Treatment probability is $E(s_{i})=p_{S}$ , where $0<p_{S}<1$ . (iii) $E\left(\tilde{x}_{i}\tilde{x}_{i}^{\prime}\right)$ is finite and nonsingular. (iv) $E(\epsilon_{i}|\tilde{x}_{i})=0$ and $E(\epsilon_{i}^{2}\tilde{x}_{i}\tilde{x}_{i}^{\prime})=\Sigma_{x}$ , where $\Sigma_{x}$ is finite and positive definite. Then Assumption 3 holds, specifically: (a)

\sqrt{N}\left(\begin{array}[]{c}\hat{\tau}-\tau\\ \hat{w}-w\end{array}\right)\xrightarrow{d}\mathcal{N}\left[0_{2G\times 1},\left(\begin{array}[]{cc}\bar{\Sigma}_{\tau}&0\\ 0&\bar{\Sigma}_{w}\end{array}\right)\right],

where $\bar{\Sigma}_{w}=p_{S}^{-1}\left(\operatorname{diag}(w)-ww^{\prime}\right)$ is positive semi-definite, $\bar{\Sigma}_{\tau}$ is the lower-right $G\times G$ sub-matrix of $\bar{\Sigma}_{\left(\beta,\tau\right)}=\left[E\left(\tilde{x}_{i}\tilde{x}_{i}^{\prime}\right)\right]^{-1}\Sigma_{x}\left[E\left(\tilde{x}_{i}\tilde{x}_{i}^{\prime}\right)\right]^{-1}$ and is positive definite.

(b) Let $\hat{\bar{\Sigma}}_{\tau}$ be the lower-right $G\times G$ sub-matrix of $\hat{\bar{\Sigma}}_{(\beta,\tau)}$ in (4), and $\hat{\bar{\Sigma}}_{w}$ be defined in (5), then $\hat{\bar{\Sigma}}_{w}-\bar{\Sigma}_{w}\xrightarrow{p}0$ and $\hat{\bar{\Sigma}}_{\tau}-\bar{\Sigma}_{\tau}\xrightarrow{p}0$ as $N\rightarrow\infty$ .

Remark 3.

In the special case when $G=1$ , $\hat{w}=w=1$ and $\hat{\bar{\Sigma}}_{w}=\bar{\Sigma}_{w}=0$ , which satisfy Assumption 3 trivially.

Under more restrictive conditions, Assumption 4 holds.

Lemma 2.

Suppose all conditions in Lemma 1 hold. Furthermore, $\epsilon_{i}$ in regression (3) are i.i.d. $\mathcal{N}(0,\sigma_{\epsilon}^{2})$ conditional on $\tilde{X}$ , and $\hat{\sigma}_{\tau,g}^{2}=z_{g}\hat{\sigma}_{\epsilon}^{2}$ , where $z_{g}$ is the $(k_{x}+g)$ -th diagonal element of $(\tilde{X}^{\prime}\tilde{X})^{-1}$ , $\hat{\sigma}_{\epsilon}^{2}=\hat{\epsilon}^{\prime}\hat{\epsilon}/(N-k_{x}-G)$ and $\hat{\epsilon}$ is the vector of OLS residuals. Then Assumption 4 holds. Specifically: (i) $\hat{w}$ is a function of $\tilde{X}$ . (ii) $\left(\hat{\tau}-\tau\right)|\tilde{X}\sim\mathcal{N}\left(0,\sigma_{\tau,g}^{2}(\tilde{X})\right)$ , where $\sigma_{\tau,g}^{2}(\tilde{X})=z_{g}\sigma_{\epsilon}^{2}$ . (iii) Conditional on $\tilde{X}$ , $m\hat{\sigma}_{\tau,g}^{2}/\sigma_{\tau,g}^{2}(\tilde{X})=m\hat{\sigma}_{\epsilon}^{2}/\sigma_{\epsilon}^{2}\sim\chi^{2}(m)$ where $m=N-k_{x}-G$ . (iv) $\hat{\tau}_{g}$ is independent of $\hat{\sigma}_{\tau,g}^{2}$ conditional on $\tilde{X}$ .

Proof.

Note that $\hat{w}$ is a function of $D$ hence of $\tilde{X}=[X,D]$ . The lemma thus follows the theory of the classical OLS model. ∎

3.1.2 Example 2: Staggered Difference-in-differences Design

This example considers a staggered difference-in-differences design where multiple groups start receiving treatment at different times. I focus on the case with a never-treated group and no treatment exit and use panel data for illustration.

Let $c(i)$ denote the period when individual $i$ first receives treatment, with $c(i)=\infty$ for never-treated individuals. Cohort $c$ is defined as the group of individuals with $c(i)=c$ , and the event time $r_{it}=t-c(i)$ is time relative to treatment. Each $c,r$ combination is treated as a sub-group, with $\tau(c,r)$ denoting the treatment effect on the log-transformed outcome for cohort $c$ at event time $r$ . Recent literature has demonstrated that if treatment effects are heterogeneous across $c$ and $r$ , i.e., $\tau(c,r)\neq\tau(c^{\prime},r^{\prime})$ if $c\neq c^{\prime}$ or $r\neq r^{\prime}$ , traditional two-way fixed effects (TWFE) estimators can produce biased results (Goodman-Bacon, 2021; Sun and Abraham, 2021; de Chaisemartin and D’Haultfœuille, 2020; Borusyak et al., 2024). In response, researchers have proposed heterogeneity robust estimators (de Chaisemartin and D’Haultfœuille, 2020; Sun and Abraham, 2021; Callaway and Sant’Anna, 2021; Borusyak et al., 2024, etc.). With log-transformed outcomes, these estimators typically involve estimating $\tau(c,r)$ and the corresponding weights $w(c,r)$ , then computing the ATE in log points $\bar{\tau}=\sum_{c}\sum_{r}w(c,r)\tau(c,r)$ . As discussed in Section 2.2, $\bar{\tau}$ and $\operatorname{exp}(\bar{\tau})-1$ differ from the ATE in percentage points, $\bar{\rho}$ . However, the estimators of $w$ and $\tau$ in these studies usually satisfy Assumption 3,⁶⁶6See, e.g., Theorems 2 and 3 in Callaway and Sant’Anna (2021), Propositions 5 and 6 in Sun and Abraham (2021). and therefore form the basis of the estimation and inference methods of $\bar{\rho}$ in this paper.

Below I discuss estimators of $\tau(c,r)$ and $w(c,r)$ in Sun and Abraham (2021), which are adaptions of the OLS estimators in Example 1 to staggered difference-in-differences settings, if we view each $(c,r)$ combination as a sub-treatment group. For the estimation of $\tau(c,r)$ , consider the model

\ln(y_{it})=\alpha_{i}+\beta_{t}+x_{it}^{\prime}\gamma+\sum_{c\neq\infty}\sum_{r\neq-1}d_{it}(c,r)\tau(c,r)+\epsilon_{it},

(6)

where $y_{it}$ , $\alpha_{i}$ , $\beta_{t}$ and $x_{it}$ are defined as in equation (2), and $d_{it}(c,r)=1$ if individual $i$ belongs to cohort $c$ and $r_{it}=r$ . The never-treated group ( $c=\infty$ ) serves as the control group, and the period immediately preceding treatment ( $r=-1$ ) serves as the base period. Under assumptions of conditional parallel trends and no anticipation, OLS estimation of equation (6) yields an estimator of $\tau(c,r)$ that satisfies Assumption 3, see Propositions 5 and 6 in Sun and Abraham (2021).

The estimation of $w(c,r)$ depends on the parameter of interest. Define $\mathcal{A}\subset\{(c,r):c\neq\infty,r\neq-1\}$ as a subset of the sub-treatment groups ( $c,r$ combinations) for which to calculate the ATT on percentage change. Following Callaway and Sant’Anna (2021), I define $\mathcal{A}=\{(c,r):c\neq\infty,r=r^{\ast}\}$ for ATT of event time $r^{\ast}$ , $\mathcal{A}=\{(c,r):c=c^{\ast},r\geqslant 0\}$ for ATT of cohort $c^{\ast}$ , $\mathcal{A}=\{(c,r):c+r=t,r\geqslant 0\}$ for ATT of calendar time $t$ , and $\mathcal{A}=\{(c,r):c\neq\infty,r\geqslant 0\}$ for ATT of all treated units. Let $w^{\mathcal{A}}(c,r)$ be the share of sub-treatment group $(c,r)$ in $\mathcal{A}$ , with $w^{\mathcal{A}}(c,r)=0$ if $(c,r)\notin\mathcal{A}$ . Then $w^{\mathcal{A}}(c^{\ast},r^{\ast})=E\left[(c,r)=(c^{\ast},r^{\ast})|(c,r)\in\mathcal{A}\right]$ .

Treating each $(c,r)$ combination in $\mathcal{A}$ as a sub-treatment group and estimate $w^{\mathcal{A}}(c,r)$ as in Example 1:

\hat{w}^{\mathcal{A}}(c,r)=\sum_{i}\sum_{t}d_{it}(c,r)/\left(\sum_{(c^{\ast},r^{\ast})\in\mathcal{A}}\sum_{i}\sum_{t}d_{it}(c,r)\right),

(7)

where the numerator is the sample size of cohort $c$ at event time $r$ , and the denominator is the sample size of all units in set $\mathcal{A}$ . Equivalently, $\hat{w}^{\mathcal{A}}(c,r)$ is the OLS estimator in the regression $d_{it}(c,r)=w^{\mathcal{A}}(c,r)s_{it}^{\mathcal{A}}+v_{it},$ where $s_{it}^{\mathcal{A}}=\sum_{(c,r)\in\mathcal{\mathcal{A}}}d_{it}(c,r)$ is the indicator that individual $i$ in period $t$ belongs to set $\mathcal{A}$ . Let $\hat{w}^{\mathcal{A}}$ be the vector of $\hat{w}^{\mathcal{A}}(c,r)$ of all $(c,r)$ combinations in $\mathcal{A}$ . The asymptotic covariance matrix of $\sqrt{N}\left(\hat{w}^{\mathcal{A}}-w^{\mathcal{A}}\right)$ is estimated as $\hat{\bar{\Sigma}}_{w}^{\mathcal{A}}=N/N_{\mathcal{A}}\left(\operatorname{diag}(\hat{w}^{\mathcal{A}})-\hat{w}^{\mathcal{A}}\hat{w}^{\mathcal{A}\prime}\right)$ , where $N_{\mathcal{A}}$ is the total sample size of all units in $\mathcal{A}$ . The estimators $\hat{w}^{\mathcal{A}}$ and $\hat{\bar{\Sigma}}_{w}^{\mathcal{A}}$ satisfy Assumption 3 by Lemma 1 and are identical to those in Sun and Abraham (2021).

3.2 Estimation of $\bar{\rho}$

First, consider the estimators $\hat{\bar{\tau}}=\sum_{g=1}^{G}\hat{w}_{g}\hat{\tau}_{g}$ for $\bar{\tau}$ and $\hat{\rho}_{a}=\operatorname{exp}(\hat{\bar{\tau}})-1$ for $\rho_{a}=\operatorname{exp}(\bar{\tau})-1$ . Under Assumption 3, $\hat{\bar{\tau}}\xrightarrow{p}\bar{\tau}$ and $\hat{\rho}_{a}\xrightarrow{p}\rho_{a}$ by Slutsky’s theorem. Under treatment effect heterogeneity $\bar{\tau}\neq\bar{\rho}$ and $\rho_{a}\neq\bar{\rho}$ , both $\hat{\bar{\tau}}$ and $\hat{\rho}_{a}$ are inconsistent estimators for $\bar{\rho}$ .

A straightforward estimator for $\bar{\rho}$ is:

\hat{\rho}_{b}=\sum_{g=1}^{G}\hat{w}_{g}\operatorname{exp}(\hat{\tau}_{g})-1.

(8)

By Slutsky’s theorem, $\hat{\rho}_{b}$ is consistent under Assumption 3. However, it exhibits bias in finite samples due to the convexity of the exponential function: when $E(\hat{\tau}_{g})=\tau_{g}$ , $\operatorname{exp}(\tau_{g})=\operatorname{exp}(E(\hat{\tau}_{g}))\neq E(\operatorname{exp}(\hat{\tau}_{g}))$ unless $var(\hat{\tau}_{g})=0$ . As sample size increases, $var(\hat{\tau}_{g})\rightarrow 0$ , and the difference between $\operatorname{exp}(E(\hat{\tau}_{g}))$ and $E(\operatorname{exp}(\hat{\tau}_{g}))$ diminishes. The bias is more pronounced with multiple sub-treatment groups, as with more subgroups the size of each group decreases and $var(\hat{\tau}_{g})$ increases.

To address the finite sample bias, I adapt the bias correction approach in Kennedy (1981) for heterogeneous treatment effects:

\hat{\rho}_{c}=\sum_{g=1}^{G}\hat{w}_{g}\operatorname{exp}(\hat{\tau}_{g}-0.5\hat{\sigma}_{\tau,g}^{2})-1.

(9)

The estimator utilizes the mean of lognormal distribution: for $x\sim\mathcal{N}(\mu_{x},\sigma_{x}^{2})$ , so that $\operatorname{exp}(x)\sim Lognormal(\mu_{x},\sigma_{x}^{2})$ ,

E\left(\operatorname{exp}(x)\right)=\operatorname{exp}(\mu_{x}+0.5\sigma_{x}^{2}).

(10)

Under Assumptions 3 and 4, $\sum_{g=1}^{G}\hat{w}_{g}\operatorname{exp}(\hat{\tau}_{g}-0.5\sigma_{\tau,g}^{2})$ is an unbiased estimator for $\sum_{g=1}^{G}w_{g}\operatorname{exp}(\tau_{g})$ . The estimator $\hat{\rho}_{c}$ is obtained by replacing $\sigma_{\tau,g}^{2}$ with $\hat{\sigma}_{\tau,g}^{2}$ . While $\hat{\rho}_{c}$ remains biased as $E\left(\operatorname{exp}(\hat{\sigma}_{\tau,g}^{2})\right)\neq\operatorname{exp}\left(E(\hat{\sigma}_{\tau,g}^{2})\right)$ , the bias diminishes as the sample size increases. Furthermore, Monte Carlo simulations in van Garderen and Shah (2002) for the case of homogeneous effects and those in this paper demonstrate that $\hat{\rho}_{c}$ performs well in modest sample sizes.

Under Assumption 4, I propose an exact unbiased estimator of $\bar{\rho}$ :

\hat{\rho}_{d}=\sum_{g=1}^{G}\hat{w}_{g}\operatorname{exp}(\hat{\tau}_{g})\operatorname{{}_{0}F_{1}}(\frac{m}{2},-\frac{m}{2}\frac{\hat{\sigma}_{\tau,g}^{2}}{2})-1.

(11)

where $m$ is the degree of freedom of the chi-square distribution of $\hat{\sigma}_{\tau,g}^{2}$ , or in Example 1 the residual degree of freedom in regression (3), $\operatorname{{}_{0}F_{1}}(a,b)$ is a confluent hypergeometric limit function, equivalently a generalized hypergeometric function with 0 parameter of type 1 and 1 parameter of type 2, defined as

\operatorname{{}_{0}F_{1}}(a,b)=\sum_{n=0}^{\infty}\frac{b^{n}}{(a)_{n}n!},

(12)

with $(a)_{n}$ denoting the rising factorial:

(a)_{n}=\begin{cases}1&n=0\\ a(a+1)\ldots(a+n-1)&n>0.\end{cases}

The estimator is adapted from the exact unbiased estimator under treatment effect homogeneity by van Garderen and Shah (2002).⁷⁷7For a detailed discussion of hypergeometric functions, see Abadir (1999). Alternative forms of estimators can be developed using different approaches to estimate $\operatorname{exp}(\hat{\tau}_{g}-0.5\sigma_{g}^{2})$ , see Zhang and Gou (2022) for a comprehensive study.

The estimator $\hat{\rho}_{d}$ involves more complex computation due to the confluent hypergeometric function. Unbiasedness of $\hat{\rho}_{d}$ requires substantially stronger assumptions than those for $\hat{\rho}_{c}$ , including i.i.d. normal errors as stipulated in Lemma 2. Furthermore, the difference between $\hat{\rho}_{c}$ and $\hat{\rho}_{d}$ becomes negligible as sample size increases, with convergence observed even for sample sizes as small as 20 under homogeneous treatment effects (van Garderen and Shah, 2002). Given the computational simplicity of $\hat{\rho}_{c}$ and the limited gains from using $\hat{\rho}_{d}$ , especially as sample size increases, $\hat{\rho}_{c}$ is generally preferred for estimation of $\bar{\rho}$ .

The properties of these estimators are formalized in the following theorem:

Theorem 1.

(a) Under Assumptions 1- 3, $\hat{\rho}_{b}$ , $\hat{\rho}_{c}$ and $\hat{\rho}_{d}$ are all consistent estimators for $\bar{\rho}$ .

(b) Under Assumptions 1- 4, $\hat{\rho}_{d}$ is an unbiased estimator of $\bar{\rho}$ .

The proof is in the online appendix.

3.3 Inference of $\bar{\rho}$

I first discuss inference of $\bar{\rho}$ based on approximations $\bar{\tau}$ or $\rho_{a}=\operatorname{exp}(\bar{\tau})-1$ . Assuming that $\hat{\bar{\tau}}\sim N(\bar{\tau},\sigma_{\bar{\tau}}^{2})$ approximately, which holds for various heterogeneity robust estimators (e.g., Callaway and Sant’Anna, 2021; Sun and Abraham, 2021; Gibbons et al., 2019). We can test $H_{0}:\bar{\tau}=\rho_{0}$ using z-test based on:

z_{\tau}=\left(\hat{\bar{\tau}}-\rho_{0}\right)/\sigma_{\bar{\tau}}.

(13)

The corresponding $(1-\alpha)$ confidence interval (CI) for $\bar{\tau}$ is

CI_{\tau}=\left[\hat{\bar{\tau}}+\sigma_{\bar{\tau}}z_{\alpha/2},\hat{\bar{\tau}}-\sigma_{\bar{\tau}}z_{\alpha/2}\right],

(14)

where $z_{\alpha/2}$ is the $\alpha/2$ quantile of the standard normal distribution. In practice, $\sigma_{\bar{\tau}}$ is replaced by a consistent estimator $\hat{\sigma}_{\bar{\tau}}$ . Using $\rho_{a}=\operatorname{exp}(\bar{\tau})-1$ to approximate $\bar{\rho}$ , the z-score for $H_{0}:\rho_{a}=\rho_{0}$ is

z_{a}=\left(\hat{\bar{\tau}}-\ln(\rho_{0}+1)\right)/\sigma_{\bar{\tau}},

(15)

as $\rho_{a}=\rho_{0}$ is equivalent to $\bar{\tau}=\ln(\rho_{0})+1$ . The $(1-\alpha)$ CI of $\rho_{a}$ is

\displaystyle CI_{a}

\displaystyle=

\displaystyle\left[\operatorname{exp}\left(\hat{\bar{\tau}}+\sigma_{\bar{\tau}}z_{\alpha/2}\right)-1,\operatorname{exp}\left(\hat{\bar{\tau}}-\sigma_{\bar{\tau}}z_{\alpha/2}\right)-1\right].

(16)

Observe that when $\rho_{0}=0$ , we have $\rho_{0}=\ln(\rho_{0}+1)$ and consequently $z_{\tau}=z_{a}$ . Therefore, the tests for $H_{0}:\bar{\rho}=\rho_{0}$ based on the approximations $\bar{\tau}$ and $\operatorname{exp}(\bar{\tau})-1$ yield identical results for $\rho_{0}=0$ and similar results for $\rho_{0}$ close to 0. Given that $\bar{\rho}\geqslant\operatorname{exp}(\bar{\tau})-1\geqslant\bar{\tau}$ , the inference for $\bar{\rho}$ based on $\operatorname{exp}(\bar{\tau})-1$ is expected to outperform that based on $\bar{\tau}$ . Both methods yield reliable results when $\tau_{g}$ is small for all $g$ and hence $\bar{\tau}$ , $\operatorname{exp}(\bar{\tau})-1$ are both close to $\bar{\rho}$ , but may exhibit substantial bias when $\bar{\tau}$ or $\operatorname{exp}(\bar{\tau})-1$ differ significantly from $\bar{\rho}$ .

This paper proposes a novel approach for more accurate inference of $\mu_{0}=\ln\left(\sum_{g=1}^{G}w_{g}\operatorname{exp}(\tau_{g})\right)$ , which is equivalent of inference of $\bar{\rho}$ , as $\bar{\rho}=\operatorname{exp}(\mu_{0})-1$ is a strictly increasing function of $\mu_{0}$ . Note that $\mu_{0}=0$ if and only if $\bar{\rho}=0$ . When $G=1$ , $\mu_{0}=\tau$ is the homogeneous treatment effect in log points.

Exact inference on $\mu_{0}$ is infeasible due to the unknown distribution of $\bar{\rho}$ estimators. Even with known $w$ , $\sum_{g=1}^{G}w_{g}\operatorname{exp}(\hat{\tau}_{g})=\sum_{g=1}^{G}\operatorname{exp}\left(\ln(w_{g})+\hat{\tau}_{g}\right)$ is the sum of correlated log-normal distributions, which has no known closed-form distribution. The approximation of the sum of log-normal variables is a well-known statistics problem and is of particular interest to researchers in telecommunications, see, e.g., Mehta et al. (2007) for a review.

I propose an approximate inference method for $\mu_{0}$ , using the Fenton-Wilkinson method to approximate the distribution of $\sum_{g=1}^{G}\hat{w}_{g}\operatorname{exp}(\hat{\tau}_{g})$ , which is approximately the sum of correlated log-normals. The Fenton-Wilkinson method approximates the sum of log-normal variables with a single log-normal variable, by matching the first and second moments (Fenton, 1960; Abu-Dayya and Beaulieu, 1994). While primarily used in telecommunications, it has also found applications in economics (e.g., Kovak et al. 2021; Marone and Sabety 2022).

Let $\eta_{0,g}=\ln(w_{g})+\tau_{g},$ and $\eta_{0}=(\eta_{0,1},\ldots,\eta_{0,G})^{\prime}$ , then

\operatorname{exp}(\eta_{0})=\left(w_{1}\operatorname{exp}(\tau_{1})\ldots,w_{G}\operatorname{exp}(\tau_{G})\right)^{\prime},

(17)

\mu_{0}=\ln\left[\sum_{g=1}^{G}\operatorname{exp}\left(\eta_{0,g}\right)\right]=\ln\left[\mathbf{1}_{G}^{\prime}\operatorname{exp}(\eta_{0})\right],

(18)

where $\mathbf{1}_{G}$ is a $G\times 1$ vector of ones. Define

\eta_{g}^{\ast}=\ln(\hat{w}_{g})+\hat{\tau}_{g}-\frac{1}{2}\frac{\sigma_{w,g}^{2}}{w_{g}^{2}}-\frac{1}{2}\sigma_{\tau,g}^{2}

(19)

as the estimate for $\eta_{0,g}$ , with $\eta^{\ast}=(\eta_{1}^{\ast},\ldots,\eta_{G}^{\ast})^{\prime}$ estimating $\eta_{0}$ . The bias correcting terms involving $\sigma_{w,g}^{2}$ and $\sigma_{\tau,g}^{2}$ are motivated by equation (10), as we shall see in the proof of Theorem 2. The estimator for $\mu_{0}$ is:

\mu^{\ast}=\ln\left(\sum_{g=1}^{G}\operatorname{exp}(\eta_{g}^{\ast})\right)=\ln\left[\mathbf{1}_{G}^{\prime}\operatorname{exp}(\eta^{\ast})\right].

(20)

Recall that $\operatorname{diag}(\alpha)$ is the diagonal matrix with elements of vector $\alpha$ on the main diagonal.

Theorem 2.

(a) Under Assumptions 1- 3,

z_{\mu}=\frac{\mu^{\ast}+\frac{1}{2}\sigma_{\mu}^{\ast 2}-\mu_{0}}{\sigma_{\mu}^{\ast}}\sim\mathcal{N}(0,1)

(21)

approximately as $N\rightarrow\infty$ , where

\sigma_{\mu}^{\ast 2}=\ln\left(\frac{\operatorname{exp}(\eta_{0})^{\prime}\operatorname{exp}(\Sigma_{\eta})\operatorname{exp}(\eta_{0})}{\operatorname{exp}(\eta_{0})^{\prime}\mathbf{1}_{G}\mathbf{1}_{G}^{\prime}\operatorname{exp}(\eta_{0})}\right)

(22)

is the approximate variance of $\mu^{\ast}$ , with $\Sigma_{\eta}=\operatorname{diag}(w)^{-1}\Sigma_{w}\operatorname{diag}(w)^{-1}+\Sigma_{\tau}$ as the approximate variance of $\eta^{\ast}$ .

(b) The limit $\lim_{N\rightarrow\infty}N\sigma_{\mu}^{\ast 2}=\bar{\sigma}_{\mu}^{2}$ exists, and $0<\bar{\sigma}_{\mu}^{2}<\infty$ .

Remark 4.

In the case of homogeneous treatment effects, $w=1$ and $\Sigma_{w}=0$ , we have $\mu^{\ast}=\hat{\tau}-\frac{1}{2}\sigma_{\tau}^{2}$ , $\sigma_{\mu}^{2}=\sigma_{\tau}^{2}$ , and hence $z_{\mu}=\left(\hat{\tau}-\mu_{0}\right)/\sigma_{\tau}=z_{\tau}$ . Observe further that $\bar{\tau}=\tau$ under homogeneity and $\mu_{0}=\ln(\rho_{0}+1)$ , hence $z_{\mu}=z_{a}=z_{\tau}$ .

Proof of the theorem is in the online appendix. Here is a sketch of the intuition. By Taylor approximation, $\ln(\hat{w}_{g})\approx\ln(w_{g})+\left(\hat{w}_{g}-w_{g}\right)/w_{g}$ , which follows a normal distribution. Consequently $\eta_{g}^{\ast}$ is approximately normal with variance $\sigma_{w,g}^{2}/w_{g}^{2}+\sigma_{\tau,g}^{2}$ . In light of the mean of log-normal variables in (10), $E(\operatorname{exp}(\eta_{g}^{\ast}))=\operatorname{exp}(\eta_{0,g})$ and hence by (18) and (20), $E\left(\operatorname{exp}(\mu^{\ast})\right)=\operatorname{exp}(\mu_{0})$ . In addition, $\operatorname{exp}(\mu^{\ast})=\sum_{g=1}^{G}\operatorname{exp}(\eta_{g}^{\ast})$ represents the sum of log-normal variables, which can be approximated by a log-normal distribution using the Fenton-Wilkinson method. Thus, $\mu^{\ast}$ is approximately normal with mean $E(\mu^{\ast})$ and variance $\sigma_{\mu}^{\ast 2}$ . Applying (10) again, we obtain $\operatorname{exp}(\mu_{0})=E(\operatorname{exp}(\mu^{\ast}))=\operatorname{exp}\left[E(\mu^{\ast})+0.5\sigma_{\mu}^{\ast 2}\right]$ , which implies $\mu_{0}=E(\mu^{\ast})+0.5\sigma_{\mu}^{\ast 2}$ . We thus have $\mu^{\ast}+0.5\sigma_{\mu}^{\ast 2}\sim\mathcal{N}(\mu_{0},\sigma_{\mu}^{\ast 2})$ approximately.

Theorem 2 provides the basis for hypothesis testing and confidence interval construction for $\mu_{0}$ and, by extension, $\bar{\rho}$ . For testing $H_{0}:\mu_{0}=a$ , we can use the z-score $z_{\mu}$ defined in (21). The corresponding $1-\alpha$ confidence interval for $\mu_{0}$ is:

[\mu^{\ast}+0.5\sigma_{\mu}^{\ast 2}+z_{\alpha/2}\sigma_{\mu}^{\ast},\mu^{\ast}+0.5\sigma_{\mu}^{\ast 2}-z_{\alpha/2}\sigma_{\mu}^{\ast}],

where $z_{\alpha/2}$ is the $\alpha/2$ quantile of the standard normal distribution. The $1-\alpha$ confidence interval of $\bar{\rho}$ is

\left[\operatorname{exp}(\mu^{\ast}+0.5\sigma_{\mu}^{\ast 2}+z_{\alpha/2}\sigma_{\mu}^{\ast})-1,\operatorname{exp}(\mu^{\ast}+0.5\sigma_{\mu}^{\ast 2}-z_{\alpha/2}\sigma_{\mu}^{\ast})-1\right].

(23)

Let $\hat{\mu}$ and $\hat{\sigma}_{\mu}^{2}$ denote feasible estimates of $\mu^{\ast}$ and $\sigma_{\mu}^{\ast 2}$ , obtained by replacing $w_{g},\tau_{g},\sigma_{w,g}^{2}$ and $\sigma_{\tau,g}^{2}$ with their respective estimators. Lemma 3 below ensures that the asymptotic normality result holds when using these estimates in practice.

Lemma 3.

Suppose Assumptions 1- 3 hold, then

\hat{z}_{\mu}=\frac{\hat{\mu}+0.5\hat{\sigma}_{\mu}^{2}-\mu_{0}}{\hat{\sigma}_{\mu}}\sim\mathcal{N}(0,1)

approximately as $N\to\infty$ .

4 Empirical Applications

4.1 Education Reform and Earnings

The first application uses a semi-log regression model as discussed in Example 1. I replicate and extend the analysis presented in Table 2, Column 1 of Meghir and Palme (2005), which examines the impact of educational reform in Sweden on earnings. I utilize their original unbalanced panel dataset, comprising 19,230 individuals observed from 1985 to 1996, with a total sample size of 209,683. I apply the procedure for estimating $\tau$ and $w$ in Example 1.

I consider two specifications, both of which are semi-log regression models. The model assuming homogeneous treatment effects is:

\ln(earn_{it})=\alpha+\tau\times reform_{i}+x_{it}^{\prime}\beta+\epsilon_{it},

(24)

and the model allowing for heterogeneous treatment effects across genders is

\ln(earn_{it})=\alpha+\tau_{1}reform_{i}*female_{i}+\tau_{2}reform_{i}*male_{i}+x_{it}^{\prime}\beta+\epsilon_{it},

(25)

where $earn_{it}$ is the earnings of individual $i$ in year $t$ , $reform_{i}$ is an indicator that individual $i$ was affected by the reform during childhood, $x_{it}$ is a vector of controls including year and municipal fixed effects, a female indicator, as well as the following variables and their interactions with the female indicator: county fixed effects, father’s education level dummies, a cohort dummy, and 44 measures of individual abilities. ⁸⁸8The specification in Meghir and Palme (2005) differs slightly from model (24). While I incorporate municipal fixed effects directly, Meghir and Palme (2005) employ a demeaning approach, subtracting municipal means from all variables except for year dummies. If year dummies were also demeaned, the two approaches are mathematically equivalent for the full sample analysis but still different for subsample analyses, unless demeaning is performed within each specific subsample. Nevertheless, estimates of $\tau$ from (24) closely align with those in Meghir and Palme (2005).

I estimate $\tau$ in (24) and (25) using OLS on five samples: the full sample, individuals with low-educated fathers ( $N=173,435$ ), those with low-educated fathers and low personal ability ( $N=92,473$ ), those with low-educated fathers and high personal ability ( $N=80,962$ ), and individuals with high-educated fathers ( $N=362,48$ ). Standard errors are clustered at the municipality level. I estimate $w$ and the variance $\Sigma_{w}$ as in Example 1. For regression (24), $\hat{w}=1$ and $\hat{\Sigma}_{w}=0$ . For regression (25), $w_{1}$ and $w_{2}$ are estimated by the share of females and males in all treated units in each sample. Using these estimates, I compute point estimates and construct confidence intervals for $\bar{\rho}$ as detailed in Sections 3.2 and 3.3.

Table 1 presents the results, with the top and bottom panels corresponding to models (24) and (25) respectively. Columns (1)-(4) report point estimates of $\bar{\rho}$ : $\hat{\bar{\tau}}=\sum_{g=1}^{G}\hat{w}_{g}\hat{\tau}_{g}$ , $\hat{\rho}_{a}=\operatorname{exp}(\hat{\bar{\tau}})-1$ , $\hat{\rho}_{b}$ from (8) and $\hat{\rho}_{c}$ from (9). The estimate $\hat{\rho}_{d}$ is suppressed as it is virtually identical to $\hat{\rho}_{c}$ when rounded to three decimal places. Columns (5)-(7) provide $95\%$ confidence intervals (CI) for $\bar{\tau}$ , $\rho_{a}$ and $\bar{\rho}$ , defined in (14), (16), and (23) respectively. In the case of homogeneous treatment effects, $\hat{\rho}_{a}=\hat{\rho}_{b}$ and $z_{a}=z_{\mu}$ , hence $\hat{\rho}_{a}$ and CI for $\rho_{a}$ are omitted. To facilitate interpretation, all estimates and CIs are scaled by 100.

The results for model (24) and (25) exhibit similar patterns. In what follows, I focus the discussion on the results from model (25). For the first four samples, $100(\hat{\tau}_{1},\hat{\tau}_{2})$ are $(1.86,0.96)$ , $(4.42,2.39)$ , $(3.65,1.48)$ , and $(6.09,2.60)$ respectively. The small magnitudes of $\hat{\tau}_{g}$ result in similar values across $\bar{\rho}$ estimates and confidence intervals. For example, for the whole sample, $\hat{\bar{\tau}}$ is $1.422$ ( $95\%$ CI: $[-0.318,3.163]$ ), which is similar to $\hat{\rho}_{c}=1.427$ ( $95\%$ CI: $[-0.321,3.213]$ ). For the fifth sample (individuals with high-educated fathers), $(\hat{\tau}_{1},\hat{\tau}_{2})=(-10.30,-2.21)$ , indicating larger magnitudes and between-group heterogeneity of $\tau_{g}$ . We can thus observe a larger difference in the estimators and CIs: $\hat{\bar{\tau}}=-6.37\%$ ( $95\%$ CI: $[-10.14\%,-2.61\%]$ ) and $\hat{\rho}_{c}=-6.13\%$ ( $95\%$ CI: $[-9.57\%,-2.52\%]$ ).

4.2 Minimum Wage Policies and Teen Employment

The second application illustrates the applicability of my methodology to staggered difference-in-differences designs, as outlined in Example 2. It expands the analysis in panel B of Table 3 in Callaway and Sant’Anna (2021), examining the impact of minimum wage policies on teen employment.

This study utilizes a panel dataset covering 2,197 counties from 2001 to 2007.⁹⁹9The sample size slightly exceeds that of Callaway and Sant’Anna (2021), which includes 2,284 counties. This discrepancy likely arises from subtle differences in the definition of the dependent variable and my approach to merging the datasets. Nevertheless, my estimates closely align with theirs, even when they employ the doubly robust estimator. The sample comprises four groups based on minimum wage policy changes: 102 counties in states that increased minimum wages in 2004 (cohort 2004), 225 counties in cohort 2006, 590 counties in cohort 2007, and 1379 counties with no minimum wage increases during the study period (never-treated group).

I use model (6) to estimate $\tau$ . The outcome variable $\ln(y_{it})$ is the natural logarithm of teen employment in county $i$ in year $t$ , obtained from Quarterly Workforce Indicators as the employment of individuals aged 14 to 18 in all private sectors at the end of the first quarter of each year. Treatment indicators, $\mathbf{1}(c(i)=c,t-c(i)=r)$ , denote that county $i$ belongs to cohort $c$ and year $t$ is the $r-th$ year post-treatment. The control variables $x_{it}$ include interaction terms between year dummies and the following county-level characteristics in the year 2000: population, percent of white residents, poverty rate, and log median income. These variables are sourced from the County and City Data Book 2000. For a comprehensive description of the minimum wage changes and the datasets, please refer to Dube et al. (2016), Callaway and Sant’Anna (2021). The weights $w^{\mathcal{A}}$ are estimated using equation (7) for various $\mathcal{A}$ , as discussed in Example 2. Utilizing the estimators of $\tau$ and $w^{\mathcal{A}}$ , I implement my estimation and inference methods for $\bar{\rho}$ .

Table 2 presents the findings, with columns defined the same as Table 1 and $\hat{\rho}_{d}$ omitted as it is indistinguishable from $\hat{\rho}_{c}$ when rounded to three digits. From top to bottom are estimates and CIs for ATT in percentage points of all treated units, each cohort, each event time, and each calendar year.

My results reveal small but non-negligible differences between $\hat{\rho}_{c}$ and alternative estimators such as $\hat{\bar{\tau}}$ , $\hat{\rho}_{a}$ and $\hat{\rho}_{b}$ , and between the CIs of $\bar{\rho}$ and those of $\bar{\tau}$ and $\rho_{a}$ . For example, the ATT for event time $2$ is estimated at $-0.0949$ log points with a 95% CI of $[-0.1203,-0.0695]$ , and $-9.06$ percentage points with a 95% CI of $[-11.33\%,-6.72\%]$ . The discrepancy between $\hat{\bar{\tau}}$ and $\hat{\rho}_{c}$ is approximately 0.43%, while the difference between the lower bounds of the confidence intervals for $\bar{\tau}$ and $\bar{\rho}$ exceeds 0.7%. While these differences are modest in absolute terms, they are sufficiently large to warrant consideration in the interpretation of results.

5 Monte Carlo Experiment

This section presents a Monte Carlo experiment for the semi-log regression model in Example 1 to examine the finite sample properties of various estimators and inference methods.

I consider a setting with $G=4$ sub-treatment groups. Treatment probability is $p_{S}=0.2$ , and each sub-treatment group has an equal weight $w_{g}=0.25$ in the whole treatment population. Consequently, the control group and each sub-treatment group comprise $20\%$ of the total population. Observations are randomly assigned to either the control group or a sub-treatment group according to these probabilities. Sample size $N$ is selected from {20, 50, 100, 200, 500, 1000, 2000, 5000, $10^{4}$ , $10^{5}$ , $10^{6}$ }.

I generate samples using a simplified version of model (3): $\ln(y_{i})=1+x_{i}+\sum_{g=1}^{4}d_{i}^{(g)}\tau_{g}+\epsilon_{i}$ , where the covariate $x_{i}$ and error term $\epsilon_{i}$ each is i.i.d. $\mathcal{N}(0,1)$ . Results using skew normal errors are analogous and presented in the online appendix. I examine two scenarios: large effects with $100(\rho_{1},\rho_{2},\rho_{3},\rho_{4})=(-16,-8,8,16)$ and small effects with $100(\rho_{1},\rho_{2},\rho_{3},\rho_{4})=(-8,-4,4,8)$ . For both cases, the true value of $\bar{\rho}$ is $0\%$ .

Tables 3 and 4 present results for large and small treatment effects respectively, with values scaled by 100. Each table reports the mean and standard errors (in parentheses) of estimators for $\bar{\rho}$ across 100,000 repetitions. True values of $\bar{\tau}$ , $\rho_{a}=\operatorname{exp}(\bar{\tau})-1$ and $\bar{\rho}$ are provided at the top of each table. The final two columns show empirical rejection rates of z-tests using $z_{\tau}$ for $H_{0}:\bar{\tau}=0$ and using $z_{\mu}$ for $H_{0}:\bar{\rho}=0$ at the 5% level. The empirical rejection rates are equivalent to one minus the coverage rate for $0$ of the confidence intervals in equation (14) and equation (23) respectively. The z-test using $z_{a}$ is omitted, as $z_{a}=z_{\tau}$ when $\rho_{0}=0$ .

The results show that the proposed estimator $\hat{\rho}_{c}$ and z-test using $z_{\mu}$ perform well even for modest sample sizes. For both large and small treatment effects, when $N\geqslant 50$ , $\hat{\rho}_{c}$ has a bias smaller than 0.1%. The empirical rejection rate of the z-test based on $z_{\mu}$ falls between 4.9% and 5.1% when $N\geqslant 200$ . For $50\leqslant N<200$ , the empirical rejection rate is slightly larger but remains below 5.6%.

For small treatment effects (Table 4), $\hat{\bar{\tau}}$ and $\hat{\rho}_{a}$ provide reasonable approximation for $\bar{\rho}$ , as true values of $\bar{\tau}$ and $\rho_{a}$ are around $0.2\%$ , close to $\bar{\rho}=0$ . The approximation bias converges to $0.2\%$ as sample size increases. Tests of $\bar{\rho}=0$ based on $z_{\tau}$ maintain appropriate rejection rates for $N\leqslant 5000$ but can be significantly oversized for large samples, e.g., 12.5% when $N=100,000$ . For large treatment effects (Table 3), bias from $\hat{\bar{\tau}}$ and $\hat{\rho}_{a}$ are more obvious, and converges to $0.8\%$ in this particular case. Z-Tests based on $z_{\tau}$ have empirical rejection rates that are far above the nominal rate $5\%$ , e.g., when $N=10^{5}$ , the empirical rejection rate is $17.5\%$ .

The gain of $\hat{\rho}_{d}$ is overall modest. In smaller samples, $\hat{\rho}_{d}$ demonstrates slightly smaller root mean square deviations. However, the estimators $\hat{\rho}_{c}$ and $\hat{\rho}_{d}$ yield similar results when $N\geqslant 200$ . While $\hat{\rho}_{b}$ is consistent and approaches $\hat{\rho}_{c}$ when sample size grows, it yields much larger bias and standard errors when $N\leqslant 10^{4}$ . This highlights the importance of correcting small sample bias using $\hat{\rho}_{c}$ .

These findings underscore the importance of choosing appropriate estimators and inference methods, particularly when dealing with large treatment effects or very large sample sizes. The proposed estimator $\hat{\rho}_{c}$ and z-test using $z_{\mu}$ demonstrate robust performance across various scenarios, while traditional approaches may lead to biased estimates or inflated rejection rates under certain conditions.

6 Conclusion

This paper highlights the importance of correctly estimating and interpreting ATEs in percentage points when treatment effects are heterogeneous across groups. The discrepancies between ATEs in log points and percentage points can be substantial, especially when treatment effects are large or vary significantly across subgroups. Failing to account for these differences may lead to misinterpretation of results and potentially misguided policy recommendations.

My proposed methods provide researchers with tools to obtain more accurate estimates and conduct valid inferences in ATEs in percentage points. The methods can be applied to a variety of settings like the semi-log regression models. They are particularly relevant for research designs such as staggered difference-in-differences models, where treatment effect heterogeneity is common. By applying the methods to empirical studies on education reform and minimum wage policies, I demonstrate how accounting for heterogeneity can affect the interpretation of ATE in percentage points in practice. My method so far has focused on group-specific treatment effect heterogeneity. Future research could explore the case of within-group heterogeneity. As empirical studies continue to grapple with complex treatment effect patterns, tools like those presented in this paper will become increasingly valuable for accurate estimation and inference.

References

(1)
Abadir (1999) Abadir, Karim M., “An Introduction to Hypergeometric Functions for Economists,” Econometric Reviews, January 1999, 18 (3), 287–330.
Abu-Dayya and Beaulieu (1994) Abu-Dayya, A.A. and N.C. Beaulieu, “Outage Probabilities in the Presence of Correlated Lognormal Interferers,” IEEE Transactions on Vehicular Technology, February 1994, 43 (1), 164–173.
Angrist (1998) Angrist, Joshua D., “Estimating the Labor Market Impact of Voluntary Military Service Using Social Security Data on Military Applicants,” Econometrica, 1998, 66 (2), 249–288.
Borusyak et al. (2024) Borusyak, Kirill, Xavier Jaravel, and Jann Spiess, “Revisiting Event-Study Designs: Robust and Efficient Estimation,” The Review of Economic Studies, February 2024, p. rdae007.
Callaway and Sant’Anna (2021) Callaway, Brantly and Pedro H. C. Sant’Anna, “Difference-in-Differences with Multiple Time Periods,” Journal of Econometrics, December 2021, 225 (2), 200–230.
Chen and Roth (2024) Chen, Jiafeng and Jonathan Roth, “Logs with Zeros? Some Problems and Solutions,” The Quarterly Journal of Economics, May 2024, 139 (2), 891–936.
de Chaisemartin and D’Haultfœuille (2020) de Chaisemartin, Clément and Xavier D’Haultfœuille, “Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects,” American Economic Review, September 2020, 110 (9), 2964–2996.
Dube et al. (2016) Dube, Arindrajit, T. William Lester, and Michael Reich, “Minimum Wage Shocks, Employment Flows, and Labor Market Frictions,” Journal of Labor Economics, July 2016, 34 (3), 663–704.
Fenton (1960) Fenton, L., “The Sum of Log-Normal Probability Distributions in Scatter Transmission Systems,” IRE Transactions on Communications Systems, March 1960, 8 (1), 57–67.
Gibbons et al. (2019) Gibbons, Charles E., Juan Carlos Suárez Serrato, and Michael B. Urbancic, “Broken or Fixed Effects?,” Journal of Econometric Methods, January 2019, 8 (1).
Giles (1982) Giles, David E. A., “The Interpretation of Dummy Variables in Semilogarithmic Equations: Unbiased Estimation,” Economics Letters, January 1982, 10 (1), 77–79.
Goldsmith-Pinkham et al. (2024) Goldsmith-Pinkham, Paul, Peter Hull, and Michal Kolesár, “Contamination Bias in Linear Regressions,” February 2024.
Goodman-Bacon (2021) Goodman-Bacon, Andrew, “Difference-in-Differences with Variation in Treatment Timing,” Journal of Econometrics, December 2021, 225 (2), 254–277.
Halvorsen and Palmquist (1980) Halvorsen, Robert and Raymond Palmquist, “The Interpretation of Dummy Variables in Semilogarithmic Equations,” American Economic Review, 1980, 70 (3), 474–475.
Hansen (2022) Hansen, Bruce, Econometrics, Princeton University Press, June 2022.
Imbens and Angrist (1994) Imbens, Guido W. and Joshua D. Angrist, “Identification and Estimation of Local Average Treatment Effects,” Econometrica, 1994, 62 (2), 467–475.
Kennedy (1981) Kennedy, Peter E., “Estimation with Correctly Interpreted Dummy Variables in Semilogarithmic Equations,” American Economic Review, 1981, 71 (4), 801–801.
Kovak et al. (2021) Kovak, Brian K., Lindsay Oldenski, and Nicholas Sly, “The Labor Market Effects of Offshoring by U.S. Multinational Firms,” The Review of Economics and Statistics, May 2021, 103 (2), 381–396.
Manning (1998) Manning, Willard G., “The Logged Dependent Variable, Heteroscedasticity, and the Retransformation Problem,” Journal of Health Economics, June 1998, 17 (3), 283–295.
Marone and Sabety (2022) Marone, Victoria R. and Adrienne Sabety, “When Should There Be Vertical Choice in Health Insurance Markets?,” American Economic Review, January 2022, 112 (1), 304–342.
Meghir and Palme (2005) Meghir, Costas and Mårten Palme, “Educational Reform, Ability, and Family Background,” American Economic Review, March 2005, 95 (1), 414–424.
Mehta et al. (2007) Mehta, Neelesh B., Jingxian Wu, Andreas F. Molisch, and Jin Zhang, “Approximating a Sum of Random Variables with a Lognormal,” IEEE Transactions on Wireless Communications, July 2007, 6 (7), 2690–2699.
Mogstad et al. (2021) Mogstad, Magne, Alexander Torgovitsky, and Christopher R. Walters, “The Causal Interpretation of Two-Stage Least Squares with Multiple Instrumental Variables,” American Economic Review, November 2021, 111 (11), 3663–3698.
Mullahy and Norton (2024) Mullahy, John and Edward C. Norton, “Why Transform Y? The Pitfalls of Transformed Regressions with a Mass at Zero*,” Oxford Bulletin of Economics and Statistics, 2024, 86 (2), 417–447.
Roth and Sant’Anna (2023) Roth, Jonathan and Pedro H. C. Sant’Anna, “When Is Parallel Trends Sensitive to Functional Form?,” Econometrica, 2023, 91 (2), 737–747.
Silva and Tenreyro (2006) Silva, J. M. C. Santos and Silvana Tenreyro, “The Log of Gravity,” The Review of Economics and Statistics, November 2006, 88 (4), 641–658.
Słoczyński (2022) Słoczyński, Tymon, “Interpreting OLS Estimands When Treatment Effects Are Heterogeneous: Smaller Groups Get Larger Weights,” The Review of Economics and Statistics, May 2022, 104 (3), 501–509.
Sun and Abraham (2021) Sun, Liyang and Sarah Abraham, “Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects,” Journal of Econometrics, December 2021, 225 (2), 175–199.
van Garderen (2001) van Garderen, Kees Jan, “Optimal Prediction in Loglinear Models,” Journal of Econometrics, August 2001, 104 (1), 119–140.
van Garderen and Shah (2002) and Chandra Shah, “Exact Interpretation of Dummy Variables in Semilogarithmic Equations,” The Econometrics Journal, June 2002, 5 (1), 149–159.
Wooldridge (2010) Wooldridge, Jeffrey M., Econometric Analysis of Cross Section and Panel Data, Second Edition, MIT Press, October 2010.
Zhang and Gou (2022) Zhang, Fengqing and Jiangtao Gou, “A Unified Framework for Estimation in Lognormal Models,” Journal of Business & Economic Statistics, October 2022, 40 (4), 1583–1595.

Appendix

Table 1: Impact of Education Reform on Earnings

	Estimates of $\bar{\rho}$				$95\%$ Confidence Intervals
	$\hat{\bar{\tau}}$	$\hat{\rho}_{a}$	$\hat{\rho}_{b}$	$\hat{\rho}_{c}$	$\bar{\tau}$	$\rho_{a}$	$\bar{\rho}$
Assuming Homogeneous Treatment Effects
All	1.422		1.433	1.429	[-0.322, 3.167]		[-0.321, 3.217]
low fedu,all abi	3.434		3.493	3.489	[1.635, 5.232]		[1.649, 5.371]
low fedu,low abi	2.569		2.602	2.593	[-0.030, 5.168]		[-0.030, 5.304]
low fedu,high abi	4.445		4.546	4.536	[1.756, 7.135]		[1.771, 7.396]
high fedu,all abi	-6.426		-6.224	-6.241	[-10.168, -2.683]		[-9.668, -2.648]
Allowing Treatment Effect Heterogeneity across Gender
All	1.422	1.433	1.434	1.427	[-0.318, 3.163]	[-0.318, 3.213]	[-0.321, 3.213]
low fedu,all abi	3.428	3.488	3.493	3.486	[1.633, 5.223]	[1.646, 5.362]	[1.647, 5.366]
low fedu,low abi	2.561	2.594	2.600	2.586	[-0.034, 5.157]	[-0.034, 5.292]	[-0.040, 5.298]
low fedu,high abi	4.429	4.529	4.545	4.528	[1.770, 7.089]	[1.785, 7.346]	[1.794, 7.354]
high fedu,all abi	-6.374	-6.175	-6.098	-6.129	[-10.140, -2.607]	[-9.643, -2.574]	[-9.573, -2.521]

•

1. This table replicates Table 2 column 1 of Meghir and Palme (2005), studying ATE in percentage points of education reform on earnings. All values have been scaled by 100.
•

2. The first panel includes results based on regression (24), which assumes constant treatment effects. The second panel includes results based on regression (25), which allows treatment effects to vary by gender. For both regressions, standard errors are clustered by municipalities of schooling. Each regression is run on 5 samples: (1) the full sample( $N=209,683$ ); (2) individuals with low father’s education( $N=173,435$ ); (3) individuals with low father’s education and low personal ability( $N=92,473$ ); (4) individuals with low father’s education and high personal ability( $N=80,962$ ); (5) individuals with high father’s education( $N=36,248$ ).
•

3. The left panel are four estimates of $\bar{\rho}$ : $\hat{\bar{\tau}}=\sum_{g}\hat{w}_{g}\hat{\tau}_{g}$ , $\hat{\rho}_{a}=\operatorname{exp}(\hat{\bar{\tau}})-1$ , $\hat{\rho}_{b}$ in equation (8), and $\hat{\rho}_{c}$ in equation (9). The estimator $\hat{\rho}_{d}$ is omitted as it is indistinguishable from $\hat{\rho}_{c}$ when rounded to three digits. The right panel are 95% confidence intervals of $\bar{\tau}$ as in equation (14), of $\rho_{a}$ as in equation (16), and of $\bar{\rho}$ as in equation (23). In the homogeneous effects case, $\hat{\rho}_{a}=\hat{\rho}_{b}$ , and confidence intervals of $\rho_{a}$ are identical to those of $\bar{\rho}$ , hence $\hat{\rho}_{a}$ and $CI$ for $\rho_{a}$ are omitted.

Table 2: Impact of Minimum Wage Raise on Teen Employment

	Estimates of $\bar{\rho}$				$95\%$ Confidence Intervals
	$\hat{\bar{\tau}}$	$\hat{\rho}_{a}$	$\hat{\rho}_{b}$	$\hat{\rho}_{c}$	$\bar{\tau}$	$\rho_{a}$	$\bar{\rho}$
All Treated Units
atet	-4.891	-4.774	-4.729	-4.738	[-6.603, -3.180]	[-6.389, -3.130]	[-6.536, -3.293]
Cohort
2004	-8.263	-7.931	-7.889	-7.896	[-10.293, -6.233]	[-9.781, -6.043]	[-10.032, -6.390]
2006	-4.479	-4.380	-4.349	-4.359	[-6.718, -2.239]	[-6.498, -2.214]	[-6.563, -2.308]
2007	-2.874	-2.833	-2.833	-2.844	[-5.752, 0.003]	[-5.589, 0.003]	[-5.589, 0.003]
Event Time
-6	-0.250	-0.250	-0.250	-0.340	[-8.583, 8.083]	[-8.225, 8.419]	[-8.225, 8.419]
-5	-3.214	-3.163	-3.147	-3.231	[-8.942, 2.514]	[-8.554, 2.546]	[-8.617, 2.437]
-4	-1.638	-1.625	-1.571	-1.640	[-6.815, 3.539]	[-6.588, 3.602]	[-6.606, 3.539]
-3	-0.192	-0.192	-0.170	-0.226	[-3.984, 3.600]	[-3.906, 3.666]	[-4.000, 3.513]
-2	0.887	0.891	0.902	0.888	[-1.222, 2.997]	[-1.215, 3.042]	[-1.329, 2.945]
0	-2.864	-2.823	-2.820	-2.830	[-4.878, -0.850]	[-4.761, -0.846]	[-4.868, -0.953]
1	-6.730	-6.508	-6.508	-6.516	[-8.667, -4.792]	[-8.302, -4.679]	[-8.440, -4.833]
2	-9.491	-9.055	-9.055	-9.062	[-12.029, -6.953]	[-11.334, -6.717]	[-11.334, -6.717]
3	-12.623	-11.859	-11.859	-11.873	[-16.065, -9.182]	[-14.841, -8.773]	[-14.841, -8.773]
Calendar Year
2004	-4.808	-4.694	-4.694	-4.697	[-6.361, -3.255]	[-6.163, -3.203]	[-6.163, -3.203]
2005	-6.130	-5.946	-5.946	-5.951	[-8.058, -4.203]	[-7.741, -4.116]	[-7.741, -4.116]
2006	-4.306	-4.215	-4.157	-4.166	[-6.662, -1.951]	[-6.445, -1.932]	[-6.517, -2.034]
2007	-4.971	-4.850	-4.801	-4.812	[-7.324, -2.618]	[-7.063, -2.584]	[-7.118, -2.639]

•

1. This table replicates Table 3 Panel B in Callaway and Sant’Anna (2021), exploring the impact of minimum wage on teen employment.
•

2. The four panels are the results of ATE in percentage points for all treated units, each cohort, each event time, and each calendar year. All values have been scaled by 100.
•

3. The left panel are four estimates of $\bar{\rho}$ : $\hat{\bar{\tau}}=\sum_{g}\hat{w}_{g}\hat{\tau}_{g}$ , $\hat{\rho}_{a}=\operatorname{exp}(\hat{\bar{\tau}})-1$ , $\hat{\rho}_{b}$ in equation (8), and $\hat{\rho}_{c}$ in equation (9). The estimator $\hat{\rho}_{d}$ is omitted as it is indistinguishable from $\hat{\rho}_{c}$ when rounded to three digits. The right panel are 95% confidence intervals of $\bar{\tau}$ as in equation (14), of $\rho_{a}$ as in equation (16), and of $\bar{\rho}$ as in equation (23).

Table 3: Monte Carlo Results for Large

\rho

and Normal Errors

	Estimates of $\bar{\rho}$					Empirical Rej. Rate
N	$\hat{\bar{\tau}}$	$\hat{\rho}_{a}$	$\hat{\rho}_{b}$	$\hat{\rho}_{c}$	$\hat{\rho}_{d}$	$z_{\tau}$	$z_{\mu}$
True Values
	-0.809	-0.806	0	0	0	5	5
Estimates
20	-0.721	20.288	34.375	0.677	0.040	5.961	6.797
	(61.636)	(85.942)	(97.127)	(71.749)	(71.331)
50	-0.743	6.396	11.435	0.039	0.011	5.133	5.508
	(37.165)	(41.656)	(43.832)	(39.077)	(39.065)
100	-0.813	2.493	5.286	-0.019	-0.022	5.075	5.228
	(25.599)	(26.675)	(27.485)	(26.086)	(26.086)
200	-0.832	0.770	2.543	-0.028	-0.028	5.058	5.121
	(17.882)	(18.165)	(18.526)	(18.062)	(18.062)
500	-0.831	-0.202	0.986	-0.025	-0.025	4.997	5.014
	(11.212)	(11.235)	(11.390)	(11.277)	(11.277)
1000	-0.836	-0.520	0.474	-0.029	-0.029	5.144	5.027
	(7.932)	(7.905)	(7.997)	(7.957)	(7.957)
2000	-0.846	-0.688	0.211	-0.039	-0.039	5.209	4.935
	(5.584)	(5.550)	(5.610)	(5.595)	(5.595)
5000	-0.817	-0.751	0.091	-0.009	-0.009	5.600	5.018
	(3.542)	(3.517)	(3.553)	(3.550)	(3.550)
10000	-0.825	-0.791	0.033	-0.017	-0.017	6.196	4.978
	(2.504)	(2.485)	(2.510)	(2.508)	(2.508)
100000	-0.805	-0.799	0.009	0.004	0.004	17.465	5.111
	(0.793)	(0.787)	(0.794)	(0.794)	(0.794)
1000000	-0.808	-0.805	0.001	0.001	0.001	89.760	5.042
	(0.250)	(0.248)	(0.250)	(0.250)	(0.250)

•

1. Monte Carlo simulation results for the case with $100\rho=(-16,-8,8,16)$ and normal errors. All results are scaled by 100.
•

2. The left panel reports mean and standard errors (in the parentheses) across 100,000 replications of five estimates of $\bar{\rho}$ : $\hat{\bar{\tau}}=\sum_{g}\hat{w}_{g}\hat{\tau}_{g}$ , $\hat{\rho}_{a}=\operatorname{exp}(\hat{\bar{\tau}})-1$ , $\hat{\rho}_{b}$ in equation (8), $\hat{\rho}_{c}$ in equation (9), and $\hat{\rho}_{d}$ in equation (11).
•

3. The right panel reports empirical rejection rates across 100,000 replications of the z-tests for $H_{0}:\bar{\tau}=0$ using $z_{\tau}$ in (13) and for $H_{0}:\bar{\rho}=0$ using $z_{\mu}$ in (21). Results of z-test for $H_{0}:\rho_{a}=0$ using $z_{a}$ in (15) are equivalent to testing $H_{0}:\bar{\tau}=0$ using $z_{\tau}$ in (13) when $\rho_{0}=0$ and therefore omitted.

Table 4: Monte Carlo Results for Small

\rho