This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Incorporating Prior Knowledge of Latent Group Structure in Panel Data Models111Please check here for the latest version.

Boyuan Zhang
Amazon.com, Seattle, WA Email: zhang.boyuan@hotmail.com. I am extremely grateful to my advisors Francis X. Diebold and Frank Schorfheide, and my dissertation committee, Xu Cheng and Minchul Shin, for their invaluable guidance and support. I would also like to thank Karun Adusumilli, Siddhartha Chib, Wayne Yuan Gao, Philippe Goulet Coulombe, and Daniel Lewis for helpful comments and suggestions. I further benefited from many helpful discussions with Econometrics Lunch seminar participants at the University of Pennsylvania, as well as participants at the 2022 North American Summer Meeting of the Econometric Society, the 2022 IAAE Annual Conference, the 42nd International Symposium on Forecasting, the 16th International Symposium on Econometric Theory and Applications, the 2022 Asian Meeting of the Econometric Society, the 2022 Australasia Meeting of the Econometric Society, the NBER-NSF Seminar on Bayesian Inference in Econometrics and Statistics, Philly Fed Young Scholars Conference on Machine Learning in Economics and Finance. This paper and its contents are not related to Amazon and do not reflect the position of the company and its subsidiaries. All remaining errors are my own.
( Amazon
First Version: June 14, 2022
This Version: August 1, 2025

)
Abstract

The assumption of group heterogeneity has become popular in panel data models. We develop a constrained Bayesian grouped estimator that exploits researchers’ prior beliefs on groups in a form of pairwise constraints, indicating whether a pair of units is likely to belong to a same group or different groups. We propose a prior to incorporate the pairwise constraints with varying degrees of confidence. The whole framework is built on the nonparametric Bayesian method, which implicitly specifies a distribution over the group partitions, and so the posterior analysis takes the uncertainty of the latent group structure into account. Monte Carlo experiments reveal that adding prior knowledge yields more accurate estimates of coefficient and scores predictive gains over alternative estimators. We apply our method to two empirical applications. In a first application to forecasting U.S. CPI inflation, we illustrate that prior knowledge of groups improves density forecasts when the data is not entirely informative. A second application revisits the relationship between a country’s income and its democratic transition; we identify heterogeneous income effects on democracy with five distinct groups over ninety countries.

JEL CLASSIFICATION: C11, C14, C23, E31

KEY WORDS: Grouped Heterogeneity; Bayesian Nonparametrics; Dirichlet Process Prior; Density Forecast; Inflation Rate Forecasting; Democracy and Development.

1 Introduction

Numerous studies have examined and demonstrated the important role of panel data models in empirical research throughout the social and business sciences, as the availability of panel data has increased. Using fixed-effects, panel data permits researchers to model unobserved heterogeneity across individuals, firms, regions, and countries as well as possible structural changes over time. As individual heterogeneity is often empirically relevant, fixed-effects are an objective of interest in numerous empirical studies. For example, teachers’ fixed-effects are viewed as a measure of teacher quality in the literature on teacher valued-added (Rockoff, 2004; Chetty et al., 2014); the heterogeneous coefficients are crucial for panel data forecasting (Liu et al., 2020; Pesaran et al., 2022). In practice, however, researchers may face a short panel where NN is large and TT is short and fixed. When applying the least squares estimator for the fixed-effects, a significant number of noisy estimates are produced. To alleviate this issue, a popular and parsimonious assumption that has recently been used is to introduce a group pattern into the individual coefficients, so that units within each group have identical coefficients (Bonhomme and Manresa (2015, BM hereafter), Su et al. (2016), Bonhomme et al. (2022)).

To recover the group pattern, we essentially face a clustering problem, e.g., dividing NN units into several unknown groups. All existing methods for estimating group heterogeneity solve a clustering problem by assuming that units are exchangeable and treating all units equally a priori. In a cross-country application of evaluating the impact of climate change on economic growth (Hsiang, 2016; Henseler and Schumacher, 2019; Kahn et al., 2021), countries in different climatic zones are assumed to have equal probabilities of being grouped together. The assumption of exchangeability might not be reasonable since correlations are common between observations at proximal locations and researchers could have knowledge of the underlying group structure based on theories or empirical findings. For instance, Sweden and Finland, which share a border, an economic structure, and weather conditions, may have a higher chance of being in the same group than African countries. In such a scenario, it is preferable to use additional information to break the exchangeability between countries to facilitate grouping as opposed to clustering based solely on observations in the sample. The availability of this information drives us to formalize such prior knowledge, which we wish to leverage to improve model performance.

In this paper, we focus on the group heterogeneity in the linear panel data model and develop a nonparametric Bayesian framework to incorporate prior knowledge of groups, which is considered additional information that does not enter the likelihood function. The prior knowledge aids in clustering units into groups and sharpens the inference of group-specific parameters, particularly when units are not well-separated.

The whole framework is built on the nonparametric Bayesian method, where we do not impose a restriction on the number of groups, and model selection is not required. The baseline model is a linear panel data model with an unknown group pattern in fixed-effects, slope coefficients, and cross-sectional error variances. We estimate the model using the stick-breaking representation Sethuraman (1994) of the Dirichlet process (DP) Ferguson (1973, 1974) prior, a standard prior in nonparametric Bayesian inference. In this framework, the number of groups is considered a random variable and is subject to posterior inference. The number of groups and group membership are estimated together with the heterogeneous coefficients. Moreover, since the DP prior implicitly defines a prior distribution on the group partitionings, the posterior analysis takes the uncertainty of the latent group structure into account.

The derivation of the proposed prior starts with summarizing prior knowledge in the form of pairwise constraints, which describe a bilateral relationship between any two units. Inspired by the work of Wagstaff and Cardie (2000), we consider two types of constraints: positive-link and negative-link constraints, representing the preference of assigning two units to the same group or distinct groups. Instead of imposing these constraints dogmatically, each constraint is given a level of accuracy that shows how confident the researchers are in their choice. There is a hyperparameter that controls the overall strength of the prior knowledge: a small value partially recovers the exchangeability assumption on units, whereas a large value confines the prior distribution of group partitioning around group structure based on prior knowledge. We choose the optimal value for the hyperparameter by maximizing the marginal data density. Summarizing prior knowledge in the form of pairwise constraints is practical and flexible since it eliminates the need to predetermine the number of groups and focuses on the bilateral relationships within any subset of units.

The aforementioned pairwise constraints are used to modify the standard DP prior. In particular, the pairwise constraints are combined with the prior distribution of the group partitioning, shrinking the distribution toward my prior knowledge. We refer to the estimator using the proposed prior as the Bayesian group fixed-effects (BGFE) estimator.

We derive a posterior sampling algorithm for the framework with the modified DP prior. Adopting conjugate priors on group-specific coefficients allows for drawing directly from posteriors using a computationally efficient Gibbs sampler. With the newly proposed prior, it can be shown that, compared to the framework that uses a standard DP prior, all that is needed to implement pairwise constraints is a simple modification to the posterior of the group indices.

The pairwise constraint-based framework is closely related and applicable to other models where group structure plays a role. Although we concentrate primarily on the panel data model, the DP prior with pairwise constraints applies to models without the time dimension, such as the standard clustering problem and the estimation of heterogeneous treatment effects. The framework is also applicable to estimating panel VARs (Holland et al., 1983), which involves multiple dependent variables. The group structure is used to overcome overparameterization and overfitting issues by clustering the VAR coefficients into groups, and pairwise constraints add additional information to the highly parameterized model. Moreover, the proposed Gibbs sampler with pairwise constraints is connected to the KMeans-type algorithm, motivating a frequentist’s counterpart of our estimator with a fixed KK. Essentially, the assignment step in the Pairwise Constrained-KMeans algorithm Basu et al. (2004a), a constrained version of the KMeans algorithm MacQueen et al. (1967), is remarkably similar to the step of drawing a group membership indicator from its posterior. The same exact equivalence can be achieved by applying small-variance asymptotics to the posterior densities under certain conditions. To obtain the frequentist’s analog of our pairwise constrained Bayesian estimators, one can utilize the same approach in BM with the Pairwise Constrained-KMeans algorithm.

We compare the performance of the BGFE estimator to alternative estimators using simulated data. The Monte Carlo simulation demonstrates that the BGFE estimator generates more accurate estimates of the group-specific parameters and the number of groups than the BGFE estimator without including any constraints. The improved performance is mostly attributable to the precise group structure estimation. The BGFE estimator clearly dominates the estimators that omit the group structure by assuming homogeneity or full heterogeneity. We also evaluate the performance of one-step ahead point, set, and density forecasts. Unsurprisingly, the accurate estimates translate into the predictive power of the underlying model; the BGFE estimator outperforms the rest of the estimators.

We apply the proposed method to two empirical applications. An application to forecasting the inflation of the U.S. CPI sub-indices demonstrates that the suggested predictor yields more accurate density predictions. The better forecasting performance is mostly attributable to three key characteristics: the nonparametric Bayesian prior, prior belief on group structure, and grouped cross-sectional heteroskedasticity. In a second application, we revisit the relationship between a country’s income and its democratic transition. This question was originally studied by Acemoglu et al. (2008), who demonstrate that the positive income effect on democracy disappears if country fixed effects are introduced into the model. The proposed framework recovers a group structure with a moderate number of groups. Each group has a clear and distinct path to democracy. In addition, we identify heterogeneous income effects on democracy and, contrary to the initial findings, show that a positive income effect persists in some groups of countries, though quantitatively small.

Literature. This paper relates to the econometric literature on clustering in panel data models. Early contributions include Sun (2005) and Buchinsky et al. (2005). Hahn and Moon (2010) provide economic and theoretical foundations for fixed effects with a finite support. Most recent works focus on linear222See Wang and Su (2021); Bonhomme et al. (2022), among others, for procedures to identify latent group structures in nonlinear panel data models. panel data models with discrete unobserved group heterogeneity. Lin and Ng (2012) and Sarafidis and Weber (2015) apply the KMeans algorithm to identify the unobserved group structure of slope coefficients. Bonhomme and Manresa (2015) also use the KMeans algorithm to recover the group pattern, but they assume group structure in the additive fixed effects. Bonhomme et al. (2022) modify this method and split the procedure into two steps. They first classify individuals into groups using KMeans algorithm and then estimate the coefficients. Ando and Bai (2016) improved on BM’s approach by allowing for group structure among the interactive fixed effects. The underlying factor structure in the interactive fixed effects is the key to forming groups. Su et al. (2016) develop a new variant of Lasso to shrink individual slope coefficients to unknown group-specific coefficients. This method is then extended by Su and Ju (2018) and Su et al. (2019). Freeman and Weidner (2022) consider two-way grouped fixed effects that allow for different group patterns in time and cross-sectional dimensions. Okui and Wang (2021) and Lumsdaine et al. (2022) identify structure breaks in parameters along with grouped patterns. From the Bayesian perspective, Kim and Wang (2019), Zhang (2020), and Liu (2022) adopt the Dirichlet process prior to estimate grouped heterogeneous intercepts in linear panel data models in the semiparametric Bayesian framework. moon2023 incorporate a version of a spike and slab prior to recover one core group of units. Alternative methods, such as binary segmentation (Wang et al., 2018) and assumptions, such as multiple latent groups structure (Cheng et al., 2019; Cytrynbaum, 2021) have also been explored to flourish group heterogeneity literature.

Our work concerns prior knowledge. Bonhomme and Manresa (2015)’s grouped fixed-effects (GFE) estimator is able to include prior knowledge, but it is plagued by practical issues to some extent. They add a collection of individual group probabilities as a penalty term in the objective function, which is a NN by KK matrix describing the probability of assigning each unit to all potential groups. This additional penalty term balances the respective weights attached to prior and data information in estimation. The main challenge is providing the set of individual group probabilities for each potential value of KK as the underlying KMeans algorithm requires model selection. It is rather cumbersome to assess these probabilities for each possible KK and to adjust for changes in reallocating probabilities across KK.

None but Aguilar and Boot (2022) explore heterogeneous error variance, and they extend BM’s GFE estimator to allow for group-specific error variances. They modify the objective function to avoid the singularity issue in pseudo-likelihood. Despite the fact that their work paves the way for identifying groups in the error variance, their framework is not yet ready to satisfactorily incorporate prior knowledge because they face the same issue as BM. Building on these works, we investigate the value of prior knowledge of group structure.

This paper also relates to the literature of constraint-based semi-supervised clustering in statistics and computer science. Pairwise constraints have been widely implemented in numerous models and have been shown to improve clustering performance. In the past two decades, various pairwise constrained KMeans algorithms using prior information have been suggested Wagstaff et al. (2001); Basu et al. (2002, 2004a); Bilenko et al. (2004); Davidson and Ravi (2005); Pelleg and Baras (2007); Yoder and Priebe (2017). Prior information is also introduced in the model-based method. Shental et al. (2003) develop a framework to incorporate prior information for the density estimation with Gaussian mixture models. The Dirichlet process mixture model with pairwise constraints has been discussed in Vlachos et al. (2008), Vlachos et al. (2009), Orbanz and Buhmann (2008), Vlachos et al. (2010), Ross and Dy (2013). Lu and Leen (2004), Lu (2007) and Lu and Leen (2007) assume the knowledge on constraints is incomplete and penalize the constraints in accordance with their weights. Law et al. (2004) extents Shental et al. (2003) to allow for soft constraints in the mixture model by adding another layer of latent variables for the group label. Nelson and Cohen (2007) propose a new framework that samples pairwise constraints given a set of probabilities related to the weights of constraints.

Our paper is closely related to Paganin et al. (2021), who address a similar problem using a novel Bayesian framework. Their proposed method shrinks the prior distribution of group partitioning toward a full target group structure, which is an initial clustering of all units provided by experts. This is demanding since not every application can have a full target group structure, as their birth defect epidemiology study did. Our framework circumvents this problem by using pairwise constraints, which are flexibly assigned to any two units. In addition, the induced shrinkage of their framework is produced by the distance function defined by Variation of Information (Meilă, 2007). It can be demonstrated that a partition can readily become caught in local modes, preventing it from ever shrinking toward the prior partition. The use of pairwise relationships in this paper circumvents this issue as well. By fixing the group indices of other pairs, our framework makes sure that the partition with a specific pair that fits our prior belief has a higher prior probability than the partition with a pair that goes against our prior belief.

Outline. In section 2, we present the specification of the dynamic panel data model with group pattern in slope coefficients and error variances and provide details on nonparametric Bayesian priors without prior knowledge, which are then extended to accommodate soft pairwise constraints. Section 3 focuses on the posterior analysis, where the posterior sampling algorithm is provided. We also highlight the posterior estimate of group structure and discuss the connection to constrained KMeans models. We briefly discuss the extensions of the baseline model in section 5. In section 4, we present empirical analysis in which we forecast the inflation rate of the U.S. CPI sub-indices and estimate the country’s income effect on its democracy. Finally, we conclude in section 6. Monte Carlo simulations, additional empirical results, and proofs are relegated to the appendix.

2 Model and Prior Specification

We begin our analysis by setting up a linear panel data model with group heterogeneity in intercepts, slope coefficients, and cross-sectional innovation variance. We then elaborate a nonparametric Bayesian prior for the unknown parameters that takes prior beliefs in the group pattern into account. We briefly highlight several key concepts of a standard nonparametric Bayesian prior as our proposed prior inherits some of its properties.

2.1 A Basic Linear Panel Data Model

We consider a panel with observations for cross-sectional units i=1,,Ni=1,\ldots,N in periods t=1,,Tt=1,\ldots,T. Given the panel data set (yit,xit)(y_{it},x^{\prime}_{it}), a basic linear panel data model with grouped heterogeneous slope coefficients and grouped heteroskedasticity takes the following form:

yit\displaystyle y_{it} =αgixit+εit,εitN(0,σgi2),\displaystyle=\alpha^{\prime}_{g_{i}}x_{it}+\varepsilon_{it},\quad\varepsilon_{it}\sim N\left(0,\sigma_{g_{i}}^{2}\right), (2.1)

where xitx_{it} are a p×1p\times 1 vector of covariates, which may contain intercept, lagged yity_{it}, other informative covariates. αgi\alpha_{g_{i}} denote the group-specific slope coefficients (including intercepts). σgi2\sigma^{2}_{g_{i}} are the group-specific variance. gi{1,,K}g_{i}\in\{1,...,K\} is the latent group index with an unknown number of groups KK. εit\varepsilon_{it} are the idiosyncratic errors that are independent across ii and tt conditional on gig_{i}. They feature zero mean and grouped heteroskedasticity σgi2\sigma_{g_{i}}^{2}, with cross-sectional homoskedasticity being a special case where σgi2=σ2\sigma_{g_{i}}^{2}=\sigma^{2}. This setting leads to a heterogeneous panel with group pattern modeled through both αgi\alpha_{g_{i}} and σgi2\sigma_{g_{i}}^{2}.

It is convenient to reformulate the model in (2.1) in matrix form by stacking all observations for unit ii:

𝒚𝒊=𝒙𝒊αgi+𝜺𝒊,𝜺𝒊N(𝟎,σgi2IT),\displaystyle\boldsymbol{y_{i}}=\boldsymbol{x_{i}}\alpha_{g_{i}}+\boldsymbol{\varepsilon_{i}},\quad\boldsymbol{\varepsilon_{i}}\sim N\left(\mathbf{0},\sigma_{g_{i}}^{2}I_{T}\right), (2.2)

where 𝒚𝒊=[yi1,yi2,,yiT]\boldsymbol{y_{i}}=\left[y_{i1},y_{i2},\ldots,y_{iT}\right]^{\prime}, 𝒙𝒊=[xi1,xi2,,xiT]\boldsymbol{x_{i}}=\left[x_{i1},x_{i2},\ldots,x_{iT}\right]^{\prime}, 𝜺𝒊=[εi1,εi2,,εiT]\boldsymbol{\varepsilon_{i}}=\left[\varepsilon_{i1},\varepsilon_{i2},\ldots,\varepsilon_{iT}\right]^{\prime}, and G=[g1,,gN]G=\left[g_{1},\ldots,g_{N}\right] is a vector of group indices.

Group structure is the key element in our approach. It can be either represented as a vector of group indices GG describing to which group each unit belongs or as a collection of disjoint blocks ={B1,B2,,BK}\mathcal{B}=\left\{B_{1},B_{2},\ldots,B_{K}\right\} induced by GG, where BkB_{k} contains all the units in the kk-th group and KK is the number of groups in the sample of size NN. |Bk||B_{k}| denotes the cardinality of the set BkB_{k} with k=1K|Bk|=N\sum_{k=1}^{K}|B_{k}|=N.

Remark 2.1.

Identification issues may arise with certain specifications. If the grouped fixed-effects in αgi\alpha_{g_{i}} are allowed to vary over time, for example, σgi2\sigma_{g_{i}}^{2} cannot be identified when the group g=gig=g_{i} contains only one unit. Aguilar and Boot (2022) propose a solution, but this problem is beyond the scope of this work. They suggest using the square-root objective function rather than the pseudo-log-likelihood function as the objective function, which replaces the logarithm of σi2\sigma^{2}_{i} with the square root of σi2\sigma^{2}_{i}, to avoid the singularity problem.

Following Sun (2005), Lin and Ng (2012) and BM, we assume that the composition of groups does not change over time. In addition, for any group kkk\neq k^{\prime}, we assume that they have different slope coefficients, e.g., αkαk\alpha_{k}\neq\alpha_{k^{\prime}}, and no single unit can simultaneously belong to these two groups: BkBk=B_{k}\bigcap B_{k^{\prime}}=\emptyset. Note that these assumptions are used to simplified the prior construction and are not necessary to incorporate prior knowledge. As we show in Section 5, both assumptions can be relaxed by using slightly different priors.

The primary objective of this paper is to estimate the group-specific slope coefficients αgi\alpha_{g_{i}}, group-specific variance σgi2\sigma_{g_{i}}^{2}, group membership GG as well as the unknown number of groups KK using full sample and prior knowledge of the group structure. Given estimates of group-specific coefficients, we are able to offer the point, set, and density forecasts of yit+hy_{it+h} for each unit ii. Throughout this paper, we will concentrate on the one-step ahead forecast where h=1h=1. For multiple-step forecasting, the procedure can be extended by iterating yiT+hy_{iT+h} in accordance with (2.1) given the estimates of parameters or estimating the model in the style of direct forecasting. The method proposed in this paper is applicable beyond forecasting. In certain applications, the heterogeneous parameters themselves are the objects of interest. For example, the technique developed here can be adapted to infer group-specific heterogeneous treatment effects.

2.2 Nonparametric Bayesian Prior with Knowledge on GG

We propose a nonparametric Bayesian prior for the unknown parameters with prior beliefs on the group pattern. Figure 1 provides a preview of the procedure for introducing prior knowledge into the model. We propose to use pairwise constraints to summarize researchers’ prior knowledge, with each constraint accompanied by a hyperparameter WW indicating the researchers’ levels of confidence in their choice. The WW is then incorporated directly in the prior distribution of the group partition GG, which is induced from a standard nonparametric Bayesian prior, yielding a new prior. We will elaborate the details throughout this subsection and highlight the clustering properties of the underlying nonparametric Bayesian priors in Section 2.3.

Figure 1: Graphical Representation of the Proposed Prior
αgi\alpha_{g_{i}}σgi2\sigma^{2}_{g_{i}}GGp(G)p(G)WijW_{ij}Pairwise ConstraintsPrior KnowledgeNonparametric Bayesian PriorDataPrior Knowledgeconstructpriorlikelihood

2.2.1 A New Prior with Soft Pairwise Constraints

The derivation of the proposed prior starts from summarizing prior knowledge in the form of pairwise constraints, which describe a bilateral relationship between any two units. Inspired by the literature on semi-supervised learning Wagstaff and Cardie (2000),333We essentially follow the same idea of the pairwise constraints in Wagstaff and Cardie (2000). To better demonstrate the beliefs on constraints, we use different names: positive-link and negative-link, rather than must-link and cannot-link. we consider two types of pairwise constraints: (1) positive-link (PL) constraints, 𝒫\mathcal{P}, and (2) negative-link (NL) constraints, 𝒩\mathcal{N}. A positive-link constraint specifies that two units are more likely to be assigned to the same group, whereas a negative-link constraint indicates that the units are prone to be assigned to different groups.

Instead of imposing these constraints dogmatically, the constraint between units ii and jj is given a hyperparameter WijW_{ij} which describes how confident the researchers are in their choice for different types of constraints. WijW_{ij} is continuously valued on the real line, as depicted in Figure 2. On the one hand, the sign of WijW_{ij} specifies the constraint type, with a positive (negative) value indicating a PL (NL) constraint between ii and jj. On the other hand, the absolute value of WijW_{ij} reflects the strength of the prior belief. We become increasingly confident in our prior belief on units ii and jj as |Wij||W_{ij}|\to\infty. If |Wij|=|W_{ij}|=\infty, we essentially impose the constraint, which is known as a hard PL/NL constraint. Otherwise, it’s a soft PL/NL constraint with a nonzero and finite WijW_{ij}. Wij=0W_{ij}=0 if there is no prior belief in units ii and jj.

Figure 2: Relationship Between WijW_{ij} and Pairwise Constraints
Soft NLSoft PLWij=W_{ij}=-\inftyHard NLWij=0W_{ij}=0No InfoWij=+W_{ij}=+\inftyHard PL

We assume the weight WijW_{ij} is a logit function of two user-defined hyperparameters444This parametric form is related to the penalized probabilistic clustering proposed by Lu and Leen (2004, 2007). See detailed discussion in Appendix D.2., accuracy ψij\psi_{ij} and type TijT_{ij}:

Wij=Tijln(ψij1ψij).\displaystyle W_{ij}=T_{ij}\ln\left(\frac{\psi_{ij}}{1-\psi_{ij}}\right). (2.3)

Accuracy, ψij[0.5,1)\psi_{ij}\in[0.5,1), describes the user-specified probability of assigning a constraint for unit ii and jj being correct given our prior preference. Specifically, ψij=1\psi_{ij}=1 implies the constraint between ii and jj must be imposed since we confident that it is accurate, while specifying ψij=0.5\psi_{ij}=0.5 is equivalent to a random guess or no information is provided. ψij\psi_{ij} is bounded below by 0.5, following the assumption that leaving the pair unrestricted is more rational than setting a less likely constraint. The type of constraints is denoted by TijT_{ij}. Tij=1T_{ij}=1 if unit ii and jj are specified to be positive-linked, and Tij=1T_{ij}=-1 for a NL constraints. If the pair (i,j)(i,j) doesn’t involve any constraint, we assume Tij=0T_{ij}=0.

To incorporate these constraints into the prior, we propose modifying the exchangeable partition probability function (EPPF) or the prior distribution of group indices, p(G)p(G), of the baseline Dirichlet process, which we will highlight in the Section 2.3. The resulting group partition will receive a strictly higher (lower) probability if it is (in)consistent with pairwise constraints. As a result, the induced prior on the group indices GG directly depend on the characteristics of user-specific pairwise constraints and is able to increase or decrease the likelihood of a certain GG.

In the presence of soft constraints, we modify the EPPF by multiplying a function of characteristics of constraints,

p(G|ψ,T)p(G)π(ψ,T|G)=p(G)i,j(ψij1ψij)cTijδij(G),\displaystyle p(G|\psi,T)\;\propto\;p(G)\pi(\psi,T|G)=p(G)\prod_{i,j}\left(\frac{\displaystyle\psi_{ij}}{\displaystyle 1-\psi_{ij}}\right)^{cT_{ij}\delta_{ij}(G)}, (2.4)

where ψij/(1ψij)\psi_{ij}/(1-\psi_{ij}) is the prior odds for the constraint between unit ii and jj, δij(G)\delta_{ij}(G) is a transformed Kronecker delta function such that

δij(G)\displaystyle\delta_{ij}(G) ={1 if gi=gj1 if gigj,\displaystyle=\begin{cases}1&\text{ if }g_{i}=g_{j}\\ -1&\text{ if }g_{i}\neq g_{j}\end{cases}, (2.5)

and cc is a positive number that controls the overall strength of prior belief. For c0c\rightarrow 0, p(G|ψ,T)p(G|\psi,T) corresponds to the baseline EPPF p(G)p(G), while for cc\rightarrow\infty, p(G=G|ψ,T)1p\left(G=G^{*}|\psi,T\right)\rightarrow 1, where GG^{*} satisfies all pairwise constraints.

Remark 2.2.

Due to the presence of pairwise constraints, the partition probability function presented in (2.4) no longer satisfies the exchangeable assumption as we now distinguish units within each group.

Remark 2.3.

The current framework enables us to impose some constraints. It is an extreme case of soft constraint and thus handy to implement, requiring only setting ψij1\psi_{ij}\to 1 for the pair (i,j)(i,j). Intuitively, any group partition violating the pairwise constraint between ii and jj (i.e., Tijδij(G)=1T_{ij}\delta_{ij}(G)=-1) will have zero probability, since for such a partition,

(ψij1ψij)cTijδij(G)(1)c=0, this imples p(G|ψ,T)=0,\displaystyle\left(\frac{\displaystyle\psi_{ij}}{\displaystyle 1-\psi_{ij}}\right)^{cT_{ij}\delta_{ij}(G)}\to\left(\frac{1}{\infty}\right)^{c}=0\text{, this imples }p(G|\psi,T)=0,

and hence the constraint on (i,j)(i,j) is imposed and referred to as a hard constraint as opposed to soft constraint. By assigning proper ψij\psi_{ij} for the pairs (i,j)(i,j), we can flexibly combine soft and hard constraints inside a single specification.

Remark 2.4.

Soft pairwise constraints solve the transitivity issue that might be a problem for hard pairwise constraints. For instance, if we have (1,2)𝒫(1,2)\in\mathcal{P} and (2,3)𝒫(2,3)\in\mathcal{P}, we can still have (1,3)𝒩(1,3)\in\mathcal{N} in the framework of soft pairwise constraints since it preserves the possibility of violating any of these constraints. This is not the case in hard pairwise constraints, as (1,2)𝒫(1,2)\in\mathcal{P} and (2,3)𝒫(2,3)\in\mathcal{P} implies (1,3)𝒫(1,3)\in\mathcal{P} by transitivity.

With the definition of WijW_{ij}, we rewrite the partition probability function defined in (2.4) in terms of WijW_{ij} to ease notation,

p(G|ψ,T)=p(G|W)=𝒞(W,G,c)1p(G)exp[ci,jWijδij(G)],\displaystyle p(G|\psi,T)=p(G|W)=\mathcal{C}(W,G,c)^{-1}p(G)\exp\left[c\sum_{i,j}W_{ij}\delta_{ij}(G)\right], (2.6)

where

𝒞(W,G,c)=Gp(G)exp[ci,jWijδij(G)],\displaystyle\mathcal{C}(W,G,c)=\sum_{G^{\prime}}p(G^{\prime})\exp\left[c\sum_{i,j}W_{ij}\delta_{ij}(G^{\prime})\right], (2.7)

is a normalization constant and we will use the prior p(G|W)p(G|W) hereinafter. In practice, we will first specify (Tij,ψij)=(type, accuracy)(T_{ij},\psi_{ij})=(\text{type, accuracy}) for the constraint between unit ii and jj and then construct the corresponding weight WijW_{ij} via the equation (2.3).

Remark 2.5.

In the particular case where we don’t have any constraint information, exp(ci,jWijδij)\exp\left(c\sum_{i,j}W_{ij}\delta_{ij}\right) reduces to 1 as Wij=0W_{ij}=0 for all ii and jj, and recovers the original DP prior. Hence, our method can cater to all levels of supervision, ranging from hard constraints to a complete lack of constraints.

2.2.2 The Effect of Constraints and Scaling Constant on Group Partitioning

The function π(ψ,T|G)\pi(\psi,T|G) in Equation (2.4) is crucial in shifting the prior probability of GG. By design, Tijδij=1T_{ij}\delta_{ij}=1 when the constraint between ii and jj is met in a group partitioning defined by GG. The prior probability for GG is therefore increased since [ψij/(1ψij)]c>1\left[\psi_{ij}/(1-\psi_{ij})\right]^{c}>1. Similarly, if a group partitioning GG violates the constraint between ii and jj, then Tijδij=1T_{ij}\delta_{ij}=-1 and the prior probability for GG drops due to [ψij/(1ψij)]c<1\left[\psi_{ij}/(1-\psi_{ij})\right]^{-c}<1. Therefore, with π(ψ,T|G)\pi(\psi,T|G), the resulting group partition is shrunk toward our prior knowledge without imposing any constraint.

To fix ideas, consider a simplified scenario with N=2N=2 units where there are at most two groups. For illustrative purposes, we set the concentration parameter to a=1a=1 so that Pr(g1=g2)=Pr(g1g2)=0.5\Pr(g_{1}=g_{2})=\Pr(g_{1}\neq g_{2})=0.5555Antoniak (1974) provides analytical formulas for probabilities of more general events with larger NN. In this example, Pr(g1=g2)\Pr(g_{1}=g_{2}) = 1a+1\frac{1}{a+1}. if no constraint exists. When N=2N=2, listing all partitions GG is possible, G{(1,1),(1,2),(2,1),(2,2)}G\in\{(1,1),(1,2),(2,1),(2,2)\}, and we can calculate the probabilities for each GG using (2.6). As a result, we are able to derive the probability of units 1 and 2 belonging to the same or different groups, i.e., analytical formulae for Pr(g1=g2)\Pr(g_{1}=g_{2}) and Pr(g1g2)\Pr(g_{1}\neq g_{2}), which neatly demonstrate the effect of ψ\psi and cc on group partitioning.

It is straightforward to show Pr(g1=g2)\Pr(g_{1}=g_{2}) as a function of ψ12,T12\psi_{12},T_{12}, and cc:

Pr(g1=g2)=11+exp(4cW12)=11+(ψ121ψ12)4cT12.\displaystyle\Pr(g_{1}=g_{2})=\frac{1}{1+\exp(-4cW_{12})}=\frac{1}{1+\left(\frac{\psi_{12}}{1-\psi_{12}}\right)^{-4cT_{12}}}. (2.8)

Figure 3 traces out the equation (2.8) for a range of cc values. The left panel (a) displays the curve for a PL constraint. Firstly, observe that when c=0c=0, Pr(g1=g2)\Pr(g_{1}=g_{2}) remains unchanged at 0.50.5 regardless of the value of ψ\psi. This is the situation in which cc eliminates the constraint’s effect on the prior. Next, given a particular cc, Pr(g1=g2)\Pr(g_{1}=g_{2}) increases in ψ\psi, which means that a stronger soft PL constraint between units 1 and 2 leads to higher chance of assigning both units to the same group. When ψ\psi is fixed, increasing cc easily results in a higher Pr(g1=g2)\Pr(g_{1}=g_{2}), indicating that a larger cc value magnifies the effect of the PL constraint. In contrast, panel (b) depicts the curve with a NL constraint. ψ12\psi_{12} and cc clearly have the opposite effect on Pr(g1=g2)Pr(g_{1}=g_{2}): Pr(g1=g2)\Pr(g_{1}=g_{2}) drops significantly as ψ12\psi_{12} or cc increases. Notably, even with a large ψ12\psi_{12}, the soft constraint framework maintains the possibility of breaching the constraint, which is another important feature that preserves the chance of correctly assigning group indices even if the constraint is erroneous.

Figure 3: Pr(g1=g2)\Pr(g_{1}=g_{2}) as a Function of ψ12\psi_{12} and cc
Refer to caption
(a) Positive-Link (T12=1T_{12}=1)
Refer to caption
(b) Negative-Link (T12=1T_{12}=-1)

In the general case where numerous PL and NL constraints are enforced, cc concurrently affects all constraints. In other words, the value of cc determines the overall “strength” of the prior belief of GG. If the prior belief is coherent with the real group partition, it would be preferable to have a large cc to intensify the effect on constraints, allowing prior information to take precedence over data information, and vice versa.

Remark 2.6.

We propose to find the optimal cc that maximizes marginal data density using grid search, see details in Appendix 4.1.2. Alternatively, the scaling constant cc can be pair-specific and data-driven. Basu et al. (2004b), for instance, assume that cijc_{ij} is a function of observables, i.e., cij=ψ(xi,xj;Tij)c_{ij}=\psi(x_{i},x_{j};T_{ij}). ψ(xi,xj;Tij)\psi(x_{i},x_{j};T_{ij}) is monotonically increasing (decreasing) in the distance between units ii and jj if they are involved in positive-link (negative-link) constraints. This reflects the belief that if two more distant units are assumed to be positive-linked (negative-linked), their constraint should be given more (less) weight.

2.2.3 Specification of Soft Pairwise Constraints

In reality, it is practical to establish soft pairwise constraints based on existing information on group, even if it is not the genuine group partitioning. In the empirical analysis, for instance, we use the official expenditure categories of CPI sub-indices to construct soft pairwise constraints. When information on group partitioning is insufficient, especially when the number of units is large, these official expenditure categories may serve as a trustworthy starting point. Before formalizing the idea, we first introduce the prior similarity matrix π¯S\underline{\pi}^{S} which is a N×NN\times N symmetric matrix describing the prior probability of any two units belonging to the same group, i.e., π¯ijS=Pr(gi=gj)\underline{\pi}^{S}_{ij}=\Pr(g_{i}=g_{j}) conditional on all hyperparameters in the prior.

The general idea is to derive soft pairwise constraints using the existing information on a preliminary group partitioning G¯\underline{G}, which is allowed to involve only a subset of units. We start with the type of constraints TijT_{ij} between any two units. Given the preliminary group structure, such as expenditure categories, we specify PL constraints for all pairs of units within the same group and NL constraints for all pairs of units from different groups. This means that we believe the preliminary group structure is correct a priori. Despite the fact that more elaborate and subtle constraints might be implemented, this rough specification is usually a great starting point.

The accuracy ψij\psi_{ij} for constraints is then specified. When our prior knowledge is limited or the number of units is large, we cannot specify ψij\psi_{ij} for all pairs with solid knowledge of them. Instead, one desirable yet simple choice is to assume ψij\psi_{ij} again based on preliminary group partitioning G¯\underline{G}. More specifically, all units in the same group are positive-linked with identical ψijPL\psi^{PL}_{ij}, i.e., for units ii and jj from the group g¯i=g¯j=g¯\underline{g}_{i}=\underline{g}_{j}=\underline{g}, we have ψijPL=cg¯\psi^{PL}_{ij}=c_{\underline{g}}. Units from different groups are assumed to be negative-linked with identical ψijNL\psi^{NL}_{ij}, i.e., for units ii and jj from distinct groups, we assume ψijNL=cg¯ig¯j\psi^{NL}_{ij}=c_{\underline{g}_{i}\underline{g}_{j}} and cg¯ig¯j=cg¯jg¯ic_{\underline{g}_{i}\underline{g}_{j}}=c_{\underline{g}_{j}\underline{g}_{i}}. Following this strategy, ψij\psi_{ij} depends solely on G¯\underline{G} and hence two units from the same group would have identical soft pairwise constraints with other units. Notice that the number of possible distinct ψij\psi_{ij} reduces from N(N1)/2N(N-1)/2 to K¯(K¯+1)/2\widebar{K}(\widebar{K}+1)/2, where K¯\widebar{K} is the number of groups in G¯\underline{G} and K¯N\widebar{K}\ll N.

This framework permits no prior belief in certain units. If at least one unit in a pair (i,j)(i,j) is not included in G¯\underline{G}, we assume that this pair of units is free of constraints and we set TijT_{ij} to 0 or ψij\psi_{ij} to 0.5 in the prior. Note that the absence of a constraint does not ensure that the units ii and jj are completely unrelated. Instead, if both ii and jj are involved in constraints with a third unit kk, or are connected through a series of ll constraints, ik1klji\leftrightarrow k_{1}\leftrightarrow\cdots\leftrightarrow k_{l}\leftrightarrow j, then the prior probability of ii and jj belonging to the same group differs from the prior probability without any constraints. If we wish to prevent two units from linking a priori, they must not be subject to any constraints with the remaining units.

The aforementioned specification strategy induces a block prior similarity matrix, i.e., for an unit ii, π¯ijS=π¯ikS\underline{\pi}^{S}_{ij}=\underline{\pi}^{S}_{ik} if g¯j=g¯k\underline{g}_{j}=\underline{g}_{k}. Intuitively, if two units have identical soft pairwise constraints and hence posit an identical relationship with all other units, they are equivalent and exchangeable. As a result, these units should have an equal prior probability of sharing the same group index with any other units. More formally,

Theorem 1 (Stochastic Equivalence).

Given two units j,kj,k from the same prior group, if ψjm=ψkm\psi_{jm}=\psi_{km} for all m=1,2,..,Nm=1,2,..,N, then Pr(gi=gj)=Pr(gi=gk)\Pr(g_{i}=g_{j})=\Pr(g_{i}=g_{k}) for all unit ii in the prior, given weights WW.

Theorem 1 echos the concept of stochastic equivalence (Nowicki and Snijders, 2001) in stochastic block model666For a more comprehensive review of the stochastic block model, see Lee and Wilkinson (2019). (SBM) (Holland et al., 1983). In less technical terms, for nodes pp and qq in the same group, pp has the same (and independent) probability of connecting with node rr, as qq does. Interestingly, this relationship is not coincidental. The prior draw of group membership with the aforementioned specification of TijT_{ij} and ψij\psi_{ij} can be viewed as a simulation of a simple SBM. In a simple SBM, there are two essential components: a vector of group memberships and a block matrix, each element of which represents the edge probability of two nodes, given their group memberships. In our case, the preliminary group structure serves as the group membership in SBM. The DP prior and the weight (or TijT_{ij} and ψij\psi_{ij}) of each constraint induce a prior similarity probability comparable to the block matrix.

Let’s consider an example of 90 units. The preliminary group structure divides units into 3 groups, with groups 1, 2 and 3 containing 25, 30 and 35 units, respectively. Figure 4 shows the prior similarity matrix, which is based on the aforementioned specification strategy, so it becomes a block matrix with equal entries in each of nine blocks. Units within the same group are stochastically equivalent, as their prior probabilities of being grouped not only with each other but also with units from other groups are the same. As a result, the similarity probability of each pair depends solely on their preliminary membership (and ψ\psi).

Figure 4: Prior Similarity Matrix under Stochastic Equivalence
Refer to caption

2.2.4 Comparison to Existing Methods

Bonhomme and Manresa (2015) incorporates prior knowledge of group membership by adding a penalty term to the objective function. They assume that prior information is in the form of probabilities which describe the prior probability of unit ii belonging to group kk with at most KK groups as ωik(K)\omega^{(K)}_{ik}. Consequently, the estimated group index is given by:

g^i(β,α)=argmink{1,,K}t=1T(yitβ(K)xitαkt(K))2Clnωik(K),\displaystyle\widehat{g}_{i}(\beta,\alpha)=\underset{k\in\{1,\ldots,K\}}{\operatorname{argmin}}\sum_{t=1}^{T}\left(y_{it}-\beta^{(K)^{\prime}}x_{it}-\alpha^{(K)}_{kt}\right)^{2}-C\ln\omega^{(K)}_{ik}, (2.9)

where C>0C>0 is a hyperparameter need to be tuned further and KK is the predetermined number of groups.

The penalty determines the weights assigned to prior and data information in estimation. Due to the fact that KK is frequently unknown in advance, this method requires model selection to determine the ideal number of groups. Assume we have nKn_{K} alternative options for KK. We have a N×KN\times K matrix ω(K)={ωik(K)}\omega^{(K)}=\left\{\omega^{(K)}_{ik}\right\} for prior information for a given KK. As a result, in order to pick a model, we must therefore provide nKn_{K} sets of prior probability matrix ω(K)\omega^{(K)}, which is cumbersome and inconvenient. For instance, if KK has values ranging from 3,5,103,5,10 and N=200N=200, there are 3,600 entries for ω\omega, none of which can be missing or undefined. In addition, the information criteria may be unreliable in the finite-sample results, necessitating further care when selecting an appropriate variant for empirical application.

Summarizing prior knowledge through pairwise constraints is often more practical than the penalty function approach in BM and solves the aforementioned practical issues. Pairwise relationships can be derived intuitively from researchers’ input without requiring in-depth knowledge of the underlying groups; researchers do not need to fix the number of groups KK or group membership a priori. One only needs to focus on a pair of units each time and specify the preference of assigning them to the same or different groups. Moreover, since the pairwise constraints are incorporated into the DP prior, which implicitly defines a prior distribution on the group partitions, model selection is not required and the posterior analysis also takes the uncertainty of the latent group structure into account.

Paganin et al. (2021) offer a statistical framework for including concrete prior knowledge on the partition. Their proposed method aims to shrink the prior distribution towards a complete prior clustering structure, which is an initial clustering involving all units provided by experts. Specifically, they suggest a prior on group partition that is proportional to a baseline EPPF of the DP prior multiplied by a penalization term,

p(G|G0,m)p(G)emd(G,G0)\displaystyle p\left(G|G_{0},m\right)\;\propto\;p(G)e^{-md\left(G,G_{0}\right)} (2.10)

with m>0m>0 is a penalization parameter, d(G,G0)d\left(G,G_{0}\right) is a suitable distance measuring how far GG is from G0G_{0}, and p(G)p(G) indicates the same EPPF as in (2.4). Because of the penalization term, the resulting group indices GG shrink toward the initial target G0G_{0}.

This framework is parsimonious and easy to implement, but it comes with a cost. The method is incapable of coping with an initial clustering in a subset of the units under study or multiple plausible prior partitions; otherwise, the distance measure is not well-defined. In addition, the authors suggest utilizing Variation of Information (Meilă, 2007) as the distance measure. It can be shown that the resulting partition can easily become trapped in local modes, leading the partition to never shrink toward G0G_{0}. They also argue that other available distance methods have flaws. As a result, the penalty term does not function as anticipated.

Our proposed framework with pairwise constraints is more flexible than adopting a target partition. Actually, the target partition in Paganin et al. (2021) can be viewed as a special case of the pairwise constants, in which every unit must be involved in at least one PL constraint. Our framework could manage partitions involving arbitrary subsets of the units by tactically specifying the bilateral relationships. Most importantly, when the group indices of other pairs are fixed, our framework ensures that the partition containing a specific pair that is consistent with our prior belief receives a strictly greater prior probability than the partition that is inconsistent with our prior belief. This guarantees that the generated GG shrinks in the direction of our prior belief.

2.3 Nonparametric Bayesian Prior

The baseline model contains the parameters listed below: (α,σ2,π,ξ,a,ϕ)(\alpha,\sigma^{2},\pi,\xi,a,\phi). We rely mostly on nonparametric Bayesian models.777For a more comprehensive review of the nonparametric Bayesian literature, see Ghosal and Van der Vaart (2017) and Müeller et al. (2018). Bayesian nonparametric models have emerged as rigorous and principled paradigms to bypass the model selection problem in parametric models by introducing a nonparametric prior distribution on the unknown parameters. The prior assumes that a collection of α\alpha and σ2\sigma^{2} is drawn from the Dirichlet process prior.888There have been some empirical works that use Dirichlet process model with panel data. Dirichlet process mixture prior is specified for either the distribution of the innovations (Hirano, 2002) or intercepts (Fisher and Jensen, 2022). π\pi is a vector of mixture probabilities in Dirichlet process that is produced by the stick-breaking approach with stick length ξ\xi. aa is the concentration parameter in the Dirichlet process, whereas ϕ\phi is a collection of hyperparameters in the base measure B0B_{0}. We consider prior distributions in the partially separable form,999The joint prior includes ξ\xi but not π\pi. Because the stick-breaking formulation of ξ\xi is a deterministic transformation of ξ\xi, knowing ξ\xi is identical to knowing π\pi.

p(α,σ2|a,ϕ,ξ)p(ξ|a)p(a).\displaystyle p(\alpha,\sigma^{2}|a,\phi,\xi)p(\xi|a)p(a).

We tentatively focus on the random coefficients model where, conditional on GG, αgi\alpha_{g_{i}} and σgi2\sigma^{2}_{g_{i}} are independent to the conditional set that includes initial value of each unit yi0y_{i0}, the initial values of predetermined variables, and the whole history of exogenous variables. The assumption guarantees that αgi\alpha_{g_{i}} and σgi\sigma_{g_{i}} can be sampled separately and simplifies the inference of the underlying distribution of αgi\alpha_{g_{i}} and σgi\sigma_{g_{i}} to an unconditional density estimation problem, therefore lowering computational complexity. The joint distribution of heterogeneous parameters as a function of the conditioning variables can then be modeled to extend the model to the correlated random coefficient model, which is briefed in Section 2.3.3. A full explanation and derivation for the correlated random coefficient model are provided in the online appendix.

2.3.1 Prior on Group-Specific Parameters

In the nonparametric Bayesian literature, the Dirichlet Process (DP) prior Ferguson (1973, 1974); Sethuraman (1994) is a canonical choice, notable for its capacity to construct group structure and accommodate an infinite number of possible group components.101010See Appendix A.1 for a brief overview of the DP and Appendix A.1.3 for its clustering properties. The DP mixture is also known as a “infinite” mixture model due to the fact that the data indicate a finite number of components, but fresh data can uncover previously undetected components Neal (2000). When the model is estimated, it chooses automatically an appropriate subset of groups to characterize any finite data set. Therefore, there is no need to determine the “proper” number of groups.

The DP prior can be written as an infinite mixture of point mass with the probability mass function:

(αi,σi2)k=1πkδ(αk,σk2) with (αk,σk2)B0(ϕ),\displaystyle\left(\alpha_{i},\sigma_{i}^{2}\right)\sim\sum_{k=1}^{\infty}\pi_{k}\delta_{\left(\alpha_{k},\sigma_{k}^{2}\right)}\text{ with }\left(\alpha_{k},\sigma_{k}^{2}\right)\sim B_{0}(\phi), (2.11)

where δx\delta_{x} denotes the Dirac-delta function concentrated at xx and B0B_{0} is the base distribution. We adopt an Independent Normal Inverse-Gamma (INIG) distribution for the base distribution B0B_{0}:

B0(ϕ):=INIG(μα,Σα,νσ2,δσ2),\displaystyle B_{0}(\phi):=INIG\left(\mu_{\alpha},\Sigma_{\alpha},\frac{\nu_{\sigma}}{2},\frac{\delta_{\sigma}}{2}\right), (2.12)

with a set of hyperparameters ϕ=(μα,Σα,νσ2,δσ2)\phi=\left(\mu_{\alpha},\Sigma_{\alpha},\frac{\nu_{\sigma}}{2},\frac{\delta_{\sigma}}{2}\right).

The group probabilities πk\pi_{k} are constructed by an infinite-dimensional stick-breaking process Sethuraman (1994) governed by the concentration parameter aa,

πk\displaystyle\pi_{k} ξkj<k(1ξj) for k>1, and π1=ξ1,\displaystyle\equiv\xi_{k}\prod_{j<k}(1-\xi_{j})\text{ for }k>1,\text{ and }\pi_{1}=\xi_{1}, (2.13)

where stick lengths ξk\xi_{k} are independent random variables drawn from the beta distribution,111111Recall that a Bata(m,n)Bata(m,n) distribution is supported on the unit interval and has mean m/(m+n)m/(m+n). Beta(1,a)Beta(1,a). The group probability will be random but still satisfy k=1πk=1\sum_{k=1}^{\infty}\pi_{k}=1 almost surely.

Equation (2.13) is essential to understanding how the DP prior controls the number of groups. The building of group probabilities is compared to the breaking of a stick of unit length sequentially, in which the length of each break is assigned to the current value of πk\pi_{k}. As the number of groups increases, the probability created by the stochastic process decreases because the remaining stick becomes shorter with each break. In practice, the number of groups does not increase as fast as NN due to the characteristic of the stick-breaking process that leads the group probability to soon approach zero.

Although in principle we do not restrict the maximum number of groups and allow the number to rise as NN increases, a finite number of instances will only occupy a finite number of KK components. The concentration parameter aa in the prior of ξk\xi_{k} determines the degree of discretization – the complexity of the mixture and, consequently, KK, as also revealed in (A.3). As a0a\to 0, the realizations are all concentrated at a single value, however as aa\to\infty, the realizations become continuous-valued as its based distribution. Specifically, Antoniak (1974) derives the relationship between aa and the number of unique groups,

E(K|a)alog(a+Na) and Var(K|a)α[log(α+Nα)1],\displaystyle E\left(K|a\right)\approx a\log\left(\frac{a+N}{a}\right)\quad\text{ and }\quad\operatorname{Var}\left(K|a\right)\approx\alpha\left[\log\left(\frac{\alpha+N}{\alpha}\right)-1\right],

that is, the expected number of unique groups is increasing in both aa and the number of units NN.

Escobar and West (1995) highlights the importance of specifying aa when imposing prior smoothness on an unknown density and demonstrates that the number of estimated groups under a DP prior is sensitive to aa. This suggests that a data-driven estimate of aa is more reasonable. Moreover, Ascolani et al. (2022) emphasizes the importance of introducing a prior for aa as it is crucial for learning the true number of groups as NN increases and hence establishing the posterior consistency. We define a gamma hyperprior for aa and update it based on the observed data in order to alter the smoothness level. This step generates a posterior estimate of aa, which indirectly determines the number of groups KK without reestimating the models with different group sizes. Essentially, this represents “automated” model selection.

Collectively, we specify a DP prior for (αi,σi2)\left(\alpha_{i},\sigma^{2}_{i}\right). The DP prior is a mixture of an infinite number of possible point masses, which can be constructed through the stick-breaking process. The discretization of the underlying distribution is governed by the concentration parameter aa. With a hyerprior on aa, we permit the data to determine the number of groups KK present in the data, which can expand unboundedly along with the data.

2.3.2 Prior on Group Partitions

In a formal Bayesian formulation, a prior distribution is specified to partition \mathcal{B} with associated indices GG. Despite the fact that DP prior does not specify this prior distribution explicitly, we can characterize it using the exchangeable partition probability function (EPPF) (Pitman, 1995). As we briefly mentioned in the last subsection, the EPPF plays a significant role in connecting the prior belief on group structure to the DP prior, which is included as part of our proposed prior distribution in Equation (2.4).

The EPPF characterizes the distribution of a partition ={B1,B2,,BK}\mathcal{B}=\left\{B_{1},B_{2},\ldots,B_{K}\right\} induced by GG. As the generic Dirichlet process assumes units are exchangeable, any permutation has no effect on the joint probability distribution of GG; hence, the EPPF is determined entirely by the number of groups and the size of each group. Pitman (1995) demonstrates that the EPPF of the Dirichlet process has the closed form,

p(G)=Γ(a)Γ(a+N)aKk=1KΓ(|Bk|),\displaystyle p(G)=\frac{\Gamma(a)}{\Gamma(a+N)}a^{K}\prod_{k=1}^{K}\Gamma\left(|B_{k}|\right), (2.14)

where aa is the concentration parameter and Γ(x)=(x1)!\Gamma(x)=(x-1)! denotes the Gamma function. Note that the partition \mathcal{B} is conceived as a random object and hence the group number KK is not predetermined, but rather is a function of GG, K=K(G)K=K(G).

Sethuraman (1994) and Pitman (1996) constructively show that group indices/partitions can be drawn from the EPPF for DP using the stick-breaking process defined in (2.13). As a result, the EPPF does not explicitly appear in the posterior analysis in the current setting so long as the priors for the stick lengths are included.

2.3.3 Correlated Random Coefficients Model

As suggested by Chamberlain (1980), allowing the individual effects to be correlated with the initial condition can eliminate the omitted variable bias. This subsection presents the first attempt to introduce dependence between grouped effects and covariates, under the presence of group structure in both heterogeneous slope coefficients and cross-sectional variance. The underlying theorems, such as posterior consistency, and the performance of the framework are left for future study.

We primarily follow the proposed framework in Liu (2022) and utilize Mixtures of Gaussian Linear Regressions (MGLRx) for the group-specific parameters. MGLRx prior is discussed in Pati et al. (2013) and can be viewed as a Dirichlet Process Mixture (DPM) prior that takes the dependence of covariates into account. Notice that the correlated random coefficients model requires a DPM-type prior for αi\alpha_{i} and σi2\sigma_{i}^{2}. This is because αi\alpha_{i} and σi2\sigma_{i}^{2} are assumed to be correlated with covariates of each individual, and as such, they are not identical within a group.

Following Liu (2022), we first transform σi2\sigma_{i}^{2} and define li=logσ¯2(σi2σ¯2)σ¯2σi2l_{i}=\log\frac{\bar{\sigma}^{2}\left(\sigma_{i}^{2}-\underline{\sigma}^{2}\right)}{\bar{\sigma}^{2}-\sigma_{i}^{2}}, where σ¯2(σ¯2)\underline{\sigma}^{2}\left(\bar{\sigma}^{2}\right) is some small (large) positive number. This transformation simplifies the prior for σi2\sigma_{i}^{2}, which is now dependent on covariates, and ensures that a similar prior structures can be applied to both αi\alpha_{i} and lil_{i}.

In the correlated random coefficients model, the DPM prior for αi\alpha_{i} or σi2\sigma_{i}^{2} is an infinite mixture of Normal densities with the probability density function:

αik=1πk(wi0)N(μkα[1wi0],Ωkα),\displaystyle\alpha_{i}\sim\sum_{k=1}^{\infty}\pi_{k}(w_{i0})N\left(\mu^{\alpha}_{k}[1\;w^{\prime}_{i0}]^{\prime},\;\Omega^{\alpha}_{k}\right), (2.15)
σi2k=1πk(wi0)N(μkσ[1wi0],Ωkσ),\displaystyle\sigma_{i}^{2}\sim\sum_{k=1}^{\infty}\pi_{k}(w_{i0})N\left(\mu^{\sigma}_{k}[1\;w^{\prime}_{i0}]^{\prime},\;\Omega^{\sigma}_{k}\right), (2.16)

where wi0=[1,yi0,xi,0:T]w_{i0}=[1,y_{i0},x_{i,0:T}]^{\prime} is the conditioning set at period 0, which includes initial value of each unit yi0y_{i0}, the initial values of predetermined variables, and the whole history of exogenous variables. Notice that αi\alpha_{i} and σi2\sigma_{i}^{2} share the same set of group probabilities πk(wi0)\pi_{k}(w_{i0}).

Similar but not identical to the DP prior, it is the component parameters (μkα,Ωkα)\left(\mu^{\alpha}_{k},\Omega^{\alpha}_{k}\right) or (μkσ,Ωkσ)\left(\mu^{\sigma}_{k},\Omega^{\sigma}_{k}\right) that are directly drawn from the base distribution G0G_{0}. G0G_{0} is assumed to be a conjugate Matricvariate-Normal-Inverse-Wishart distribution.

The group probabilities are now characterized by a probit stick-breaking process (Rodriguez and Dunson, 2011),

πk(wi0)=Φ(ζk(wi0))j<k(1Φ(ζj(wi0))),\displaystyle\pi_{k}(w_{i0})=\Phi\left(\zeta_{k}\left(w_{i0}\right)\right)\prod_{j<k}\left(1-\Phi\left(\zeta_{j}\left(w_{i0}\right)\right)\right), (2.17)

where the stochastic function ζk\zeta_{k} is drawn from a Gaussian process, ζkGP(0,Vk)\zeta_{k}\sim GP\left(0,V_{k}\right) for k=1,2,k=1,2,\cdots. The Gaussian process is assumed to have zero mean and the covariance function VkV_{k}. defined as follows,

Vk(x,x)=τvexp(Akxx22),\displaystyle V_{k}(x,x^{\prime})=\tau_{v}\exp\left(-A_{k}\|x-x^{\prime}\|^{2}_{2}\right), (2.18)

where τvIG(νv2,νv2)\tau_{v}\sim IG(\frac{\nu_{v}}{2},\frac{\nu_{v}}{2}) and AkA_{k} has its own hyperprior, see details in Pati et al. (2013).

3 Posterior Analysis

This section describes the procedure for analyzing posterior distributions for the baseline model described in (2.1) with the priors specified in Section 2.2.1. The joint posterior distribution of model parameters is

p(α,σ2,Ξ,a,G|Y,X,W,ϕ)\displaystyle p(\alpha,\sigma^{2},\Xi,a,G|Y,X,W,\phi)
\displaystyle\propto\; p(Y|X,α,σ2,G)p(α,σ2|ϕ)p(G|Ξ)p(W|G)p(Ξ|a)p(a),\displaystyle p(Y|X,\alpha,\sigma^{2},G)p(\alpha,\sigma^{2}|\phi)p(G|\Xi)p(W|G)p(\Xi|a)p(a), (3.1)

where p(Y|X,α,σ2,G)p(Y|X,\alpha,\sigma^{2},G) is the likelihood function given by equation (2.1) for an i.i.d. model conditional on group indices GG, and p(W|G)p(W|G) is the additional term of pairwise constraints with the form p(W|G)=i=1Nj=1Nexp(cWijδij)p(W|G)=\prod_{i=1}^{N}\prod_{j=1}^{N}\exp\left(cW_{ij}\delta_{ij}\right).

3.1 Posterior Sampling

Draws from the joint posterior distribution can be obtained by using blocked Gibbs sampling. The algorithm is derived from Ishwaran and James (2001) and Walker (2007). Due to the use of a finite-dimensional prior and truncation, the method described in Ishwaran and James (2001) cannot truly address our demand for estimating the number of groups without a predetermined value or upper bound. We employ the slice sampler (Walker, 2007), which is the exact block Gibbs sampler for the posterior computation in infinite-dimensional Dirichlet process models, modifying the block Gibbs sampler of Ishwaran and James (2001) to avoid truncation approximations. Walker (2007) augments the posterior distribution with a set of auxiliary variables consisting of i.i.d. standard uniform random variables, i.e., uiiidU(0,1)u_{i}\stackrel{{\scriptstyle iid}}{{\sim}}U(0,1) for i=1,2,..,Ni=1,2,..,N. The augmented posterior is then represented as

p(α,σ2,Ξ,a,G,u|Y,X,W,ϕ)\displaystyle p(\alpha,\sigma^{2},\Xi,a,G,u|Y,X,W,\phi)
\displaystyle\propto\; p(Y|X,α,σ2,G)p(α,σ2|ϕ)p(W|G)p(Ξ|a)p(a)i𝟏(uiπgi).\displaystyle p(Y|X,\alpha,\sigma^{2},G)p(\alpha,\sigma^{2}|\phi)p(W|G)p(\Xi|a)p(a)\prod_{i}\mathbf{1}(u_{i}\leq\pi_{g_{i}}). (3.2)

where i𝟏(uiπgi)\prod_{i}\mathbf{1}(u_{i}\leq\pi_{g_{i}}) is substituted for p(G|Ξ)p(G|\Xi) in the equation (3).

To roughly see how slice sampling works, recall that the group probabilities are constructed in a sequential manner, following a stick-breaking procedure. The leftover of the stick after each break gets smaller and smaller. Given the finite number of units, we can always find the smallest KK^{*} such that for all groups kKk\geq K^{*}, the minimum of uiu_{i} among all units is larger than πk\pi_{k}, which is bounded above by the length of the leftover after kk breaks, 1j=1kπj1-\sum_{j=1}^{k}\pi_{j}. More formally,

K=mink{u>1j=1kπj}, with u=min1iNui.\displaystyle K^{*}=\min_{k}\left\{u^{*}>1-\sum_{j=1}^{k}\pi_{j}\right\},\text{ with }u^{*}=\min_{1\leq i\leq N}u_{i}. (3.3)

As a result, all units receive strictly zero probability of being assigned to any group k=K+1,K+2,,Nk=K^{*}+1,K^{*}+2,...,N since the indicator function 𝟏(uiπk)\mathbf{1}(u_{i}\leq\pi_{k}) is zero.

There are two advantages to incorporating the auxiliary variable uu into the model. First and foremost, uu directly determines the largest possible number of groups in each sampling iteration. This reduces the support of GG and 𝚵\boldsymbol{\Xi} to a finite space, enabling us to solve a problem of finite dimensions without truncation. Furthermore, uu have no effect on the joint posterior of other parameters because the original posterior can be restored by integrating out uiu_{i} for i=1,2,,Ni=1,2,...,N.

The Gibbs sampler are used to simulate the joint posterior distribution of (α,σ2,Ξ,u,a,G)\left(\alpha,\sigma^{2},\Xi,u,a,G\right). We break this vector into blocks and sequentially sampling for each block conditional on the current draws for the other parameters and the data. The full conditional distributions for each block are easily derived using the conjugate priors specified in Section 2.

For the group-specific parameters, we directly draw samples from their posterior densities as we adapt conjugate priors. The posterior inference with respect to (α,σ2)(\alpha,\sigma^{2}) becomes standard once we condition on the latent group indices GG. It is essentially a Bayesian panel data regression for each group. The conditional posterior for the stick length Ξ\Xi is a beta distribution given GG, and hence direct sampling is possible.

We follow Walker (2007) to derive the posterior of auxiliary variable uu. As uu are standard uniformly distributed, the posterior is a uniform distribution defined on (0,πgi)\left(0,\pi_{g_{i}}\right), conditional on the group probabilities and group indices. In terms of the concentration parameter aa, we use a 2-step procedure proposed by Escobar and West (1995). Following their approach, we first draw a latent variable η\eta from Beta(a+1,N)Beta(a+1,N). Then, given η\eta and number of groups KaK^{a} in the current iteration, we directly draw aa from a mixture of two Gamma distribution.

It is worth noting that the steps for implementing the DP prior with or without soft pairwise constraints are the same for all parameter besides the group indices GG. This is due to the fact that soft pairwise constraints only affect other parameters through the group indices. It is handy to sample group indices with soft pairwise constraints conditional on other parameters. The posterior probability of assigning unit ii to group kk includes additional term p(Wi|G)=ji,gj=kexp(2cWijδij)p(W_{i}|G)=\prod_{j\neq i,g_{j}=k}\exp\left(2cW_{ij}\delta_{ij}\right) to rewards (penalizes) the abidance (violation) of constraints,

π¯k=p(gi=k|α,σ2,G(i),u,Y,X,W)=p(yi|αk,σk2,Y,X)𝟏(uiπk)p(Wi|G)k=1Kp(yi|αk,σk2,Y,X)𝟏(uiπk)p(Wi|G),\displaystyle\bar{\pi}_{k}=p\left(g_{i}=k|\alpha,\sigma^{2},G^{(i)},u,Y,X,W\right)=\frac{p\left(y_{i}|\alpha_{k},\sigma_{k}^{2},Y,X\right)\mathbf{1}\left(u_{i}\leq\pi_{k}\right)p(W_{i}|G)}{\sum_{k^{\prime}=1}^{K^{*}}p\left(y_{i}|\alpha_{k^{\prime}},\sigma_{k^{\prime}}^{2},Y,X\right)\mathbf{1}\left(u_{i}\leq\pi_{k^{\prime}}\right)p(W_{i}|G)}, (3.4)

where KK^{*} is the maximal number of groups after generating potential new group-specific slope coefficients and variance. We then draw the group index for unit ii from a multinomial distribution:

gi=k, with probability π¯k.\displaystyle g_{i}=k,\text{ with probability }\bar{\pi}_{k}. (3.5)

Algorithm 1 below summarizes the algorithm for the proposed Gibbs sampling. For illustrative purposes, we focus primarily on the posterior densities of major parameters and omit details on step (vii). In short, step (vii) creates potential groups by sampling new (αk,σk2)(\alpha_{k},\sigma_{k}^{2}) from the prior if the latest KK^{*} based on newly drawn uiu_{i} and πk\pi_{k} is larger than previous KK^{*}, which indicate the current iteration permits more groups. This Detailed derivations and explanation of each step are provided in Appendix C.2.

Algorithm 1.

(Gibbs Sampler for Random Coefficients Model with Soft Pairwise Constraints)
For each iteration s=1,2,..,Nsims=1,2,..,N_{sim},

  1. (i)

    Calculate number of active groups: Ka=max1iNgi(s1).K^{a}=\max_{1\leq i\leq N}g_{i}^{(s-1)}.

  2. (ii)

    Group-specific slope coefficients: draw αk(s)\alpha_{k}^{(s)} from p(αk|σ2(s1),G(s1),Y,X)p\left(\alpha_{k}|\sigma^{2(s-1)},G^{(s-1)},Y,X\right) for k=1,2,,Kak=1,2,...,K^{a}.

  3. (iii)

    Group-specific variance: draw σk2(s)\sigma_{k}^{2^{(s)}} from p(σk2|α(s),G(s1),Y,X)p\left(\sigma_{k}^{2}|\alpha^{(s)},G^{(s-1)},Y,X\right) for k=1,2,,Kak=1,2,...,K^{a}.

  4. (iv)

    Group “stick length”: draw ξk(s)\xi^{(s)}_{k} from p(ξk|a(s1),G(s1))p\left(\xi_{k}|a^{(s-1)},G^{(s-1)}\right) for k=1,2,,Kak=1,2,...,K^{a} and update group probability in accordance to the stick-breaking procedure.

  5. (v)

    Auxiliary variable: draw ui(s)u^{(s)}_{i} from p(ui|Ξ(s),G(s1))p\left(u_{i}|\Xi^{(s)},G^{(s-1)}\right) for i=1,2,,Ni=1,2,...,N and calculate u=min1iNuiu^{*}=\min_{1\leq i\leq N}u_{i}.

  6. (vi)

    DP concentration parameter: draw a latent variable η\eta from Beta(a+1,N)\textit{Beta}(a+1,N) and draw a(s)a^{(s)} from p(a|η,Ka)p\left(a|\eta,K^{a}\right).

  7. (vii)

    Generate potential groups based on uu^{*} and find the maximal number of groups KK^{*}.

  8. (xi)

    Group indices: draw gig_{i} from p(gi=k|α(s),σ2(s),G(i),u,Y,X,W)p\left(g_{i}=k|\alpha^{(s)},\sigma^{2(s)},G^{(i)},u,Y,X,W\right) for i=1,2,,Ni=1,2,...,N and k=1,2,,Kk=1,2,...,K^{*}.

3.2 Determining Partition

In contrast to popular algorithms such as agglomerative hierarchical clustering or the KMeans algorithm, which return a single clustering solution, Bayesian nonparametric models provide a posterior over the entire space of partitions, enabling the assessment of statistical properties, such as the uncertainty on the number of groups.

However, when the group structure is part of the major conclusion of an empirical analysis, the point estimate of group structure becomes crucial. Wade and Ghahramani (2018) discuss in detail an appropriate point estimate of the group partitioning based on the posterior draws. From the decision theory, the point estimate GG^{*} minimizes the posterior expected loss,

G=argminG^𝔼[L(G,G^)|Y]=argminG^GL(G,G^)p(G|Y),\displaystyle G^{*}=\underset{\widehat{G}}{\operatorname{argmin}}\;\mathbb{E}\left[L(G,\widehat{G})|Y\right]=\underset{\widehat{G}}{\operatorname{argmin}}\sum_{G}L(G,\widehat{G})p\left(G|Y\right),

where the loss function L(G,G^)L(G,\widehat{G}) is the variation of information by Meilă (2007), which measures the amount of information lost and gained in changing from partition GG to G^\widehat{G}.121212Another possible loss function is the 0-1 loss function L(G,G^)=𝟏(G=G^)L(G,\widehat{G})=\mathbf{1}(G=\widehat{G}), which leads to GG^{*} being the posterior mode. This loss function is undesirable since it ignores similarity between two partitions. For instance, a partition that deviates from the truth in the allocation of only one unit is penalized the same as a partition that deviates from the truth in the allocation of numerous units. Furthermore, it is generally recognized that the mode may not accurately reflect the distribution’s center. The Variation of Information is based on the Shannon entropy H()H(\cdot), and can be computed as

VI(G,G^)=H(G)H(G^)+2H(GG^)=j=1KλjNlog(λjN)+l=1KλlNlog(λlN)2j=1Kl=1KλjlNlog(λjlN),\displaystyle\begin{aligned} \mathrm{VI}\left(G,\widehat{G}\right)&=-H(G)-H\left(\widehat{G}\right)+2H\left(G\wedge\widehat{G}\right)\\ &=\sum_{j=1}^{K}\frac{\lambda_{j}}{N}\log\left(\frac{\lambda_{j}}{N}\right)+\sum_{l=1}^{K^{\prime}}\frac{\lambda_{l}^{\prime}}{N}\log\left(\frac{\lambda_{l}^{\prime}}{N}\right)-2\sum_{j=1}^{K}\sum_{l=1}^{K^{\prime}}\frac{\lambda_{jl}^{\wedge}}{N}\log\left(\frac{\lambda_{jl}^{\wedge}}{N}\right),\end{aligned}

where log\log denotes log\log base 2, λj=|Bj|\lambda_{j}=\left|B_{j}\right| is the cardinality of the group jj, and λjl\lambda_{jl}^{\wedge} the size of blocks of the intersection GG^G\wedge\widehat{G} and hence the number of indices in block jj under partition GG and block ll under G^\widehat{G}.

Wade and Ghahramani (2018) show that the optimal group partitioning can be identified based on the posterior similarity matrix,

g=argming^i=1Nlog(j=1N𝟏(g^j=g^i))2i=1Nlog(j=1NP(gj=gi|Y,X,W)𝟏(g^j=g^i))\displaystyle g^{*}=\underset{\widehat{g}}{\operatorname{argmin}}\sum_{i=1}^{N}\log\left(\sum_{j=1}^{N}\mathbf{1}\left(\widehat{g}_{j}=\widehat{g}_{i}\right)\right)-2\sum_{i=1}^{N}\log\left(\sum_{j=1}^{N}P\left(g_{j}=g_{i}|Y,X,W\right)\mathbf{1}\left(\widehat{g}_{j}=\widehat{g}_{i}\right)\right) (3.6)

where P(gj=gi|Y,X,W)P\left(g_{j}=g_{i}|Y,X,W\right) is the (i,j)(i,j) entry of the posterior similarity matrix. We refer to Wade and Ghahramani (2018) for additional properties and empirical evaluations.

3.3 Connection to Constrained KMeans Algorithm

The procedure of Gibbs sampling with soft constraints in Algorithm 1 is closely related to constrained clustering in the computer science literature. In this parallel literature, constrained clustering refers to the process of introducing prior knowledge to guide a clustering algorithm. For a subset of the data, the prior knowledge takes the form of constraints that supplement the information derived from the data via a distance metric. As we shall see below, under several simplifying assumptions, our framework could be reduced to a deterministic method for estimating group heterogeneity using a constrained KMeans algorithm. Though this deterministic method may address the practical issues in BM, it only works for certain restricted models and hence is not as general as our proposed framework.

We start with a brief review of the Pairwise Constrained KMeans (PC-KMeans) clustering algorithm by Basu et al. (2004a), which is a well-known clustering algorithm in the field of semi-supervised machine learning. It’s a pairwise constrained variant of the standard KMeans algorithm in which an augmented objective function is used in the assignment step. Given a collection of observations (y1,y2,,yN)\left(y_{1},y_{2},\ldots,y_{N}\right), a set of positive-link constraints 𝒫\mathcal{P}, a set of negative-link constraints 𝒩\mathcal{N}, the cost of violating constraints w={wijp,wijn}w=\{w^{p}_{ij},w^{n}_{ij}\} and the number of groups KK, the PC-KMeans algorithm divides NN observations into KK groups (the assignment step) so as to minimize the following objective function,

12k=1KiBkyiμk2within-cluster sum of squares+(i,j)𝒫wijp𝟏(gigj)+(i,j)𝒩wijn𝟏(gi=gj)cost of violation,\displaystyle\underbrace{\frac{1}{2}\sum_{k=1}^{K}\sum_{i\in B_{k}}\left\|y_{i}-\mu_{k}\right\|^{2}}_{\text{within-cluster sum of squares}}+\underbrace{\sum_{\left(i,j\right)\in\mathcal{P}}w^{p}_{ij}\mathbf{1}\left(g_{i}\neq g_{j}\right)+\sum_{\left(i,j\right)\in\mathcal{N}}w^{n}_{ij}\mathbf{1}\left(g_{i}=g_{j}\right)}_{\text{cost of violation}}, (3.7)

where μk\mu_{k} is the centroid of group kk, i.e., μk=1|Bk|iBkyi\mu_{k}=\frac{1}{\left|B_{k}\right|}\sum_{i\in B_{k}}y_{i}, BkB_{k} is the set of units assigned to group kk, and |Bk|\left|B_{k}\right| is the size of group kk. The first part is the objective function for the conventional KMeans algorithm, while the second part accounts for the incurred cost of violating either PL constraints (wijpw^{p}_{ij}) or NL constraints (wijnw^{n}_{ij}).

Similar to KMeans, PC-KMeans alternates between reassigning units to groups and recomputing the means. In the assignment step, it determines a disjoint KK partitioning that minimizes (3.7). Then the update step of the algorithm recalculates centroids of observations assigned to each cluster and updates μk\mu_{k} for all kk.

By applying asymptotics to the variance of distributions within the model, we demonstrate linkages between the posterior sampler of our constrained BGFE estimator and KMeans-type algorithms in Theorem 2. We investigate small-variance asymptotics for posterior densities, motivated by the asymptotic connection between the Gibbs sampling algorithm for the Dirichlet process mixture model and KMeans Kulis and Jordan (2011), and demonstrate that the Gibbs sampling algorithm for the CBG estimator with soft constraints encompasses the constrained clustering algorithm PC-KMean in the limit.

Theorem 2.

(Equivalency between BGFE with Soft Constraints and PC-KMeans)
If the following conditions hold,

  1. (i)

    group pattern is in fixed-effects but not in slope coefficients, i.e., xit=1x_{it}=1. Other covariates might be introduced, but they cannot have grouped effects on yity_{it};

  2. (ii)

    The number of group is fixed at KK;

  3. (iii)

    Homoscedasticity: σk2=σ2\sigma^{2}_{k}=\sigma^{2} for all k=1,2,,Kk=1,2,...,K;

  4. (iv)

    Constraint weights is scaled by the variance of errors: WijWij/σ2W_{ij}\to W_{ij}/\sigma^{2};

then the proposed Gibbs sampling algorithm for the BGFE estimator with soft constraint embodies the PC-KMeans clustering algorithm in the limit as σ20\sigma^{2}\to 0. In particular, the posterior draw of group indices GG is the solution to the PC-KMeans algorithm.

We return to the world of grouped fixed-effects models. In fact, the clustering algorithm is essential for BM and Bonhomme et al. (2022), who use the KMeans algorithm to reveal the group pattern in the fixed-effects. With the theorem described above, it motivates a constrained version of BM’s GFE estimator. We show that it is straightforward to incorporate prior knowledge in the form of soft paired restrictions into the GFE estimator. The soft pairwise constrained grouped fixed-effects (SPC-GFE) estimator is defined as the solution to the following minimization problem given the number of groups KK:

(θ^,α^,G^)=argminθ,α,Gi=1Nt=1T(yitxitθαgit)2+c[(i,j)𝒫wijp𝟏(gigj)+(i,j)𝒩wijn𝟏(gi=gj)],\displaystyle\left(\widehat{\theta},\widehat{\alpha},\widehat{G}\right)=\underset{\theta,\alpha,G}{\operatorname{argmin}}\sum_{i=1}^{N}\sum_{t=1}^{T}\left(y_{it}-x^{\prime}_{it}\theta-\alpha_{g_{i}t}\right)^{2}+c\left[\sum_{(i,j)\in\mathcal{P}}w^{p}_{ij}\mathbf{1}\left(g_{i}\neq g_{j}\right)+\sum_{(i,j)\in\mathcal{N}}w^{n}_{ij}\mathbf{1}\left(g_{i}=g_{j}\right)\right], (3.8)

where the minimum is taken over all possible partitions GG of the NN units into KK groups, common parameters θ\theta, and group-specific time effects α\alpha. wijpw^{p}_{ij} and wijnw^{n}_{ij} are the user-specified costs on PL and NL constraints.

For given values of θ\theta and α\alpha, the optimal group assignment for each individual unit is

g^i(θ,α)=argming{1,,K}i=1Nt=1T(yitxitθαgit)2+c[(i,j)𝒫wijp𝟏(gigj)+(i,j)𝒩wijn𝟏(gi=gj)],\displaystyle\widehat{g}_{i}(\theta,\alpha)=\underset{g\in\{1,\ldots,K\}}{\operatorname{argmin}}\sum_{i=1}^{N}\sum_{t=1}^{T}\left(y_{it}-x^{\prime}_{it}\theta-\alpha_{g_{i}t}\right)^{2}+c\left[\sum_{(i,j)\in\mathcal{P}}w^{p}_{ij}\mathbf{1}\left(g_{i}\neq g_{j}\right)+\sum_{(i,j)\in\mathcal{N}}w^{n}_{ij}\mathbf{1}\left(g_{i}=g_{j}\right)\right], (3.9)

where we essentially apply the PC-KMeans algorithm to get the group partition. The SPC-GFE estimator of (θ,α)(\theta,\alpha) in (3.8) can then be written as

(θ^,α^)=argminθ,αi=1Nt=1T(yitxitθαg^it)2,\displaystyle(\widehat{\theta},\widehat{\alpha})=\underset{\theta,\alpha}{\operatorname{argmin}}\sum_{i=1}^{N}\sum_{t=1}^{T}\left(y_{it}-x^{\prime}_{it}\theta-\alpha_{\widehat{g}_{i}t}\right)^{2}, (3.10)

where g^i=g^i(θ,α)\widehat{g}_{i}=\widehat{g}_{i}(\theta,\alpha) is given by (3.9). θ\theta and α\alpha are computed using an OLS regression that controls for interactions of group indices and time dummies. The SPC-GFE estimate of gig_{i} is then simply g^i(θ^,α^)\widehat{g}_{i}(\widehat{\theta},\widehat{\alpha}).

Remark 3.1.

While the SPC-GFE estimator implements soft constraints, it still requires a predetermined number of group KK and model selection.

4 Empirical Applications

We apply our panel forecasting methods to the following two empirical applications: inflation of the U.S. CPI sub-indices and the income effect on democracy. The first application focuses mostly on predictive performance, whereas the second application focuses primarily on parameter estimation and group structure.

4.1 Model Estimation and Measures of Forecasting Performance

To accommodate richer assumptions on model, we use variants of the baseline model in Equation (2.1) in this section, either by adding common regressors or allowing for time-variation in the intercept. We use the conjugate prior for all parameters, see details in Appendix B.

4.1.1 Specification of Constraints

In both applications, the prior knowledge of the latent group structure or the pre-grouping structure covers all units. In the first application, CPI sub-indices can be clustered by expenditure category, whereas countries in the second application may be grouped according to their geographic regions. We build positive-link and negative-link constraints given the prior knowledge: all units within the same group are presumed to be positive-linked, while units from different categories are believed to be negative-linked. In terms of the accuracy of constraints, ψijPL\psi^{PL}_{ij} and ψijNL\psi^{NL}_{ij} are equal for all constraints with the same type, following the strategy described in Section 2.2.3. In the applications below, we fix ψijPL=0.65\psi^{PL}_{ij}=0.65 and ψijNL=0.55\psi^{NL}_{ij}=0.55 to reflect the belief that PL constraints (attracting forces) play slightly more important role than the NL constraints (repelling forces) and NL constraints cannot be ignored. Finally, we construct weights WijW_{ij} using (2.3). Notice that these assumptions on prior and hyperparameters are an example to showcase how the proposed framework works with real data. In practice, we may specify constraints for a subset of units with different levels of weights, either in a data-driven manner (for instance, highly correlated units may fall into the same group with a high level of confidence) or in a model-based manner (i.e., WW is a function of covariates).

4.1.2 Determining the Scaling Constant cc

Given that the dimension of the space of group partitions increases exponentially with the number of units NN, attention must be given while selecting cc across analyses with different NN. As suggested by Paganin et al. (2021), calibrating the modified prior is computationally intensive. We are facing a trade-off between investing time to get the prior “exactly right” and letting the constant cc be an estimated model parameter. As such, we propose to find the optimal cc that maximizes marginal data density using grid search.

In the Monte Carlo simulation, the value of cc is fixed for simplicity, but in the empirical applications, cc is determined by marginal data density (MDD). We calculate MDD using the harmonic mean estimator (Newton and Raftery, 1994), which defined as

m^HM(y)=[1Sj=1S1p(y|θ(j))]1,\displaystyle\hat{m}^{HM}(y)=\left[\frac{1}{S}\sum_{j=1}^{S}\frac{1}{p\left(y|\theta^{(j)}\right)}\right]^{-1}, (4.1)

given a sample θ(j)\theta^{(j)} from the posterior p(θ|y)p(\theta|y). The simplicity of the harmonic mean estimator is its main advantage over other more specialized techniques. It uses only within-model posterior samples and likelihood evaluations, which are often available anyway as part of posterior sampling. We finally choose the optimal value for cc that maximizes MDD.

4.1.3 Estimators

We consider six estimators in the section. The first three estimators are our proposed Bayesian grouped fixed-effects (BGFE) estimator with different assumptions on cross-sectional variance and pairwise constraints. The last three estimator ignore the group structure.

  1. (i)

    BGFE-he-cstr: group-specific slope coefficients and heteroskedasticity with constraints.

  2. (ii)

    BGFE-he: group-specific slope coefficients and heteroskedasticity without constraints.

  3. (iii)

    BGFE-ho: homoskedastic version of BGFE-he.

  4. (iv)

    Pooled OLS: fully homogeneous estimator

  5. (v)

    AR-he: flat-prior estimator that assumes p(αi) 1p(\alpha_{i})\;\propto\;1 corresponds to standard AR model with additional regressor utu_{t} in this environment.

  6. (vi)

    AR-he-PC: AR-he with the lagged value of the first principal component as additional regressor.

In the first application, we focus on inflation forecasting. For the most recent advances in this topic, Faust and Wright (2013) provide a comprehensive overview of a large set of traditional and recently developed forecasting methods. Among many candidate methods, we choose the AR model as the benchmark and exclusively include it as an alternative estimator in this exercise. This is because the AR is relatively hard to beat and, notably, other popular methods, such as the Atkeson–Ohanian version random walk model (Atkeson et al., 2001), UCSV (Stock and Watson, 2007), and TVP-VAR (Primiceri, 2005), generally do as reasonably well as the AR model, according to Faust and Wright (2013).

4.1.4 Posterior Predictive Densities

We generate one-step ahead forecasts of yi,T+1y_{i,T+1} for i=1,,Ni=1,...,N conditional on the history of observations

Y\displaystyle Y =[y1,y2,,yN],yi=[yi1,yi2,,yiT],\displaystyle=[y_{1},y_{2},...,y_{N}],y_{i}=[y_{i1},y_{i2},...,y_{iT}]^{\prime},
X\displaystyle X =[x1,x2,,xN],xi=[xi1,xi2,,xiT],\displaystyle=[x_{1},x_{2},...,x_{N}],x_{i}=[x_{i1},x_{i2},...,x_{iT}]^{\prime},

and newly available variables xiT+1x_{iT+1} at T+1T+1.

The posterior predictive distribution for unit ii is given by

p(yiT+1|Y,X)=p(yiT+1|Y,X,Θ)p(Θ|Y,X)𝑑Θ,\displaystyle p(y_{iT+1}|Y,X)=\int p(y_{iT+1}|Y,X,\Theta)p(\Theta|Y,X)d\Theta, (4.2)

where Θ\Theta is a vector of parameters Θ=(αgi,σgi2,gi)\Theta=\left(\alpha_{g_{i}},\sigma^{2}_{g_{i}},g_{i}\right). This density is the posterior expectation of the following function:

p(yiT+1|Y,X,Θ)=k=1K(G)𝟏(gi=k)p(yiT+1|Y,X,Θ),\displaystyle p(y_{iT+1}|Y,X,\Theta)=\sum_{k=1}^{K(G)}\mathbf{1}(g_{i}=k)p\left(y_{iT+1}|Y,X,\Theta\right), (4.3)

which is invariant to relabeling the components of the mixture and K(G)K(G) is the number of groups in GG. Given SS posterior draws, the posterior predictive distribution estimated from the MCMC draws is

p^(yiT+1|Y,X)=1Sj=1S[k=1K(j)(G)𝟏(gi=k)p(yiT+1|Y,X,Θ(j))].\displaystyle\hat{p}(y_{iT+1}|Y,X)=\frac{1}{S}\sum_{j=1}^{S}\left[\sum_{k=1}^{K^{(j)}(G)}\mathbf{1}(g_{i}=k)p\left(y_{iT+1}|Y,X,\Theta^{(j)}\right)\right]. (4.4)

where

p(yiT+1|Y,X,Θ(j))=ϕ(yiT+1;αgi(j)xit+1+γ(j)zit+1,σgi(j)2).\displaystyle p\left(y_{iT+1}|Y,X,\Theta^{(j)}\right)=\phi\left(y_{iT+1};\alpha_{g_{i}}^{(j)^{\prime}}x_{it+1}+\gamma^{(j)}z_{it+1},\sigma^{(j)^{2}}_{g_{i}}\right). (4.5)

We can therefore draw samples from p^(yiT+1|Y,X)\hat{p}(y_{iT+1}|Y,X) by simulating (2.1) forward conditional on the posterior draws of Θ\Theta and observations. Note that MCMC exhibits the true Bayesian predictive distribution, implicitly integrating over the entire underlying parameter space.

4.1.5 Point Forecasts

We evaluate the point forecasts via the real-time recursive out-of-sample Root Mean Squared Forecast Error (RMSFE) under the quadratic compound loss function averaged across units. Let y^iT+1|T\hat{y}_{iT+1|T} represent the predicted value conditional on the observed data up to period TT, the loss function is written as

L(y^1:N,T+1|T,y1:N,T+1)=1Ni=1N(y^iT+1|TyiT+1)2=1Ni=1Nε^iT+1|T2,\displaystyle L\left(\widehat{y}_{1:N,T+1|T},y_{1:N,T+1}\right)=\frac{1}{N}\sum_{i=1}^{N}\left(\hat{y}_{iT+1|T}-y_{iT+1}\right)^{2}=\frac{1}{N}\sum_{i=1}^{N}\hat{\varepsilon}_{iT+1|T}^{2}, (4.6)

where yi,T+1y_{i,T+1} is the realization at T+1T+1 and ε^iT+1|T\hat{\varepsilon}_{iT+1|T} denote the forecast error.

The optimal posterior forecast under quadratic loss function is obtain by minimizing the posterior risk,

y^1:N,T+1|T\displaystyle\hat{y}_{1:N,T+1|T} =argminy^NL(y^,y1:N,T+1)p(y1:N,T+1|Y,X)𝑑y1:N,T+1\displaystyle=\operatorname*{argmin}_{\hat{y}\in\mathbb{R}^{N}}\int_{-\infty}^{\infty}L\left(\hat{y},y_{1:N,T+1}\right)p(y_{1:N,T+1}|Y,X)dy_{1:N,T+1}
=argminy^N1Ni=1NE[(y^yiT+1)2|Y,X].\displaystyle=\operatorname*{argmin}_{\hat{y}\in\mathbb{R}^{N}}\frac{1}{N}\sum_{i=1}^{N}E\left[\left(\hat{y}-y_{iT+1}\right)^{2}|Y,X\right]. (4.7)

This implies optimal posterior forecast is the posterior mean,

y^i,T+1|T=E(yiT+1|Y,X), for i=1,,N.\displaystyle\hat{y}_{i,T+1|T}=E\left(y_{iT+1}|Y,X\right),\text{ for }i=1,\ldots,N. (4.8)

Conditional on posterior draws of parameters, the mean forecast can be approximated by the Monte Carlo averaging,

y^i,T+1|T1Sj=1Sy^iT+1|T(j)=1Sj=1Sα^gi(j)xiT+1.\displaystyle\hat{y}_{i,T+1|T}\approx\frac{1}{S}\sum_{j=1}^{S}\hat{y}_{iT+1|T}^{(j)}=\frac{1}{S}\sum_{j=1}^{S}\hat{\alpha}^{(j)^{\prime}}_{g_{i}}x_{iT+1}. (4.9)

Finally, the RMSFE across units is given by

RMSFET+1=1Ni=1N(yi,T+1y^i,T+1|T)2.\displaystyle RMSFE_{T+1}=\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(y_{i,T+1}-\hat{y}_{i,T+1|T}\right)^{2}}. (4.10)

4.1.6 Density Forecasts

To compare the performance of density forecasts for various estimators, we report the average log predictive scores (LPS) to assess the performance of the density forecast from the view of the probability distribution function. As suggested in Geweke and Amisano (2010), the LPS for a panel reads as,

LPST+1=\displaystyle LPS_{T+1}= 1Ni=1Nlnp(yiT+1|Y,X,Θ)p(Θ|Y,X)𝑑Θ,\displaystyle-\frac{1}{N}\sum_{i=1}^{N}\ln\int p\left(y_{iT+1}|Y,X,\Theta\right)p(\Theta|Y,X)d\Theta, (4.11)

where the expectation can be approximated using posterior draws:

p(yiT+1|Y,X,Θ)p(Θ|Y,X)𝑑Θ\displaystyle\int p\left(y_{iT+1}|Y,X,\Theta\right)p(\Theta|Y,X)d\Theta 1Sj=1Sp(yiT+1|Y,X,Θ(j)).\displaystyle\approx\frac{1}{S}\sum_{j=1}^{S}p\left(y_{iT+1}|Y,X,\Theta^{(j)}\right). (4.12)

The following results are also robust to other metrics such as the continuous ranked probability score (Matheson and Winkler, 1976; Hersbach, 2000).

4.2 Inflation of the U.S. CPI Sub-Indices

Policymakers and market participants are very interested in the abilities to reliably predict the future disaggregated inflation rate. Central banks predict future inflation trends to justify interest rate decisions, control and maintain inflation around their targets. The Federal Reserve Board forecasts disaggregated price categories for short-term inflation forecasting (Bernanke, 2007). They rely primarily on the bottom-up approach that focuses on estimating and forecasting price behavior for the various categories of goods and services that make up the aggregate price index. Moreover, investors in fixed-income markets in the private sector wish to forecast future sectoral inflation in order to anticipate future trends in discounted real returns. Some private firms also need to predict specific inflation components in order to forecast price dynamics and reduce risks accordingly.

In this section, we demonstrate the use of constrained BGFE estimators with prior knowledge on the group pattern to forecast inflation rates for the sub-indices of U.S. Consumer Expenditure Index (CPI). We focus primarily on the one-step ahead point and density forecast. Due to space constraints, we only report the group partitioning for the most recent month in the main text.

4.2.1 Model Specification and Data

Model: We start by exploring the out-of-sample forecast performance of a simple, generic Phillips curve model. It is a panel autoregressive distributed lag (ADL) model with a group pattern in the intercept, coefficients, as well as error variance. The model is given by

yit+1=\displaystyle y_{it+1}= αgi+j=0p1ρgi,jyitj+βgiut+εit+1,εit+1N(0,σgi2).\displaystyle\;\alpha_{g_{i}}+\sum_{j=0}^{p-1}\rho_{g_{i,j}}y_{it-j}+\beta_{g_{i}}u_{t}+\varepsilon_{it+1},\quad\varepsilon_{it+1}\sim N(0,\sigma_{g_{i}}^{2}). (4.13)

where yity_{it} is year-over-year inflation rate, i.e., yit=log(priceit/priceit12)y_{it}=\log(\text{price}_{it}/\text{price}_{it-12}), and utu_{t} is the slack measure for the labor market, the unemployment gap. We fix pp at 3 because the benchmark AR model would have the best predictive performance.


Data: We collect the sub-indices of CPI for all urban consumer (CPI-U) that include food and energy. The raw data is obtained from the U.S. Bureau of Labor Statistics (BLS), which is recorded on a monthly basis from January 1947 to August 2022. The CPI-U is a hierarchical composite index system that partitions all consumer goods and services into a hierarchy of increasingly detailed categories. It consists of eight major expenditure categories (1) Apparel; (2) Education and Communication; (3) Food and Beverages; (4) Housing; (5) Medical Care; (6) Recreation; (7) Transportation; (8) Other Goods and Services. Each category is composed of finer and finer sub-indexes until the most detailed levels or “leaves” are reached. This hierarchical structure can be represented as a tree structure, as shown in Figure 5. It is important to note that the parent series and its child series may be highly correlated and readily form a group due to the fact that parent series are generated from child series. For instance, the Energy Services is expected to correlated with its child series Utility gas service and Electricity. Due to our focus on group structure, it is vital to eliminate all parent series in order to prevent not just double-counting but also dubious grouping results. More details regarding the data are provided in Appendix F.1.

Figure 5: Hierarchical Structure of CPI
{forest}

forked edges, for tree= grow = east, align = left, font=, [All Items [Housing [Shelter [Rent of primary residence] [Owners’ equivalent rent of residences] ] [Fuels and utilities [Water sewer and trash] [Household energy [Fuel oil and other fuels [Fuel oil] [Propane kerosene and firewood]] [Energy services [Electricity] [Utility gas service]]] ] ] [Transportation […] ] [Food and beverages […] ] ]

Pre-grouping: The official expenditure categories are used to build PL and NL constraints: all units within the same categories are presumed to be positive-linked, while units from different categories are believed to be negative-linked.

We focus on the CPI sub-indices after January 1990 for two reasons: (1) the number of sub-indices before 1990 was relatively small, diminishing the importance of the group structure; and (2) the consumption has been changed and more expenditure series were introduced in the 1990s as a result of the popularity of electronic products, food care, etc. After the elimination of all parent nodes, the unbalanced panel consists of 156 sub-indices in eight major expenditure categories. We employ rolling estimation windows of 48 months131313The benchmark AR-he model scores the best overall performance with a window size of 48. and require each estimation sample to be balanced, removing individual series lacking a complete set of observations in a given window. Finally, we generate 329 samples with the first forecast produced for April 1995.

4.2.2 Results

We begin the empirical analysis by comparing the performance of point and density forecasts across 329 samples. Throughout the analysis, the AR-he estimator serves as the benchmark as it essentially assumes individual effects.

In Figure 6, we present the frequency of each estimator with the lowest RMSFE in the panel (a) and the boxplot141414The boundaries of the whiskers is based on the 1.5 IQR value. All other points outside the boundary of the whiskers are plotted as outliers in red crosses. of the ratio of RMSFE relative to the AR-he estimator in the panel (b). First, the AR-he and AR-he-PC estimators, which rely only on an individual’s own past data, are not competitive in point forecasts and perform considerably worse than the others. This implies that it is highly advantageous to explore cross-sectional information to improve point forecasts. Moreover, the BGRE-he-cstr estimator scores the highest frequency of being the best estimator despite the fact that BGFE-he-cstr, BGFE-he, BGFE-ho, and pooled OLS estimators all utilize cross-sectional information. Examining the box plot, we find that the BGFE-ho and pooled OLS estimators, which overlook heteroskedasticity, can achieve greater performance in some samples, but make poorer forecasts more often than the other estimators. BGFE-he-cstr and BGFE-he estimator, on the other hand, typically outperform the others and the benchmark in terms of median RMSFE and the ability to produce forecasts with the lowest RMSFE without also increasing the risk of generating the least accurate forecasts.

Figure 6: RMSFE - All Samples
(a) Freq. lowest RMSFE
Refer to caption
(b) Boxplot: RMSEm\text{RMSE}_{m} / RMSEARhe\text{RMSE}_{AR-he}
Refer to caption

The revealing patterns of the density forecast are significantly distinct from those of the point forecast. Figure 7 depicts the log predictive score (LPS) for density forecast. The most notable pattern from the panel (a) is that the BGFE-he-cstr estimator, which incorporates prior knowledge, is dominating and outperforms the rest in over 80% of the samples. It emerges as the apparent winner in this case. Furthermore, when generating density forecast, the BGFE-ho and pooled OLS are not as accurate as they are in point forecast: they never have the lowest LPS across samples. This also confirms that the heteroscedasticity151515We provide more results in Section G.1.2 to explore the importance of heteroskedasticity in density forecast for the inflation. is a well-known feature of the inflation time series (Clark and Ravazzolo, 2015). In the boxplot, we ignore BGFE-ho and pooled OLS and show the differences in LPS between the respective estimators and the AR-he estimator. As LPS differences represent percentage point differences, BGFE-he-cstr can provide density forecasts that are up to 22% more accurate compared to the benchmark model. Finally, despite the fact that the BGFE-he-cstr and BGFE-he estimators are mainly based on the same algorithm, the use of prior knowledge on group pattern further enhances the performance, resulting in the BGFE-he-cstr estimator having a lower LPS and scoring the best model with the highest frequency.

Figure 7: Log Predictive Scores - All Samples
(a) Freq. lowest LPS
Refer to caption
(b) Boxplot: LPSm\text{LPS}_{m} - LPSAR\text{LPS}_{AR}
Refer to caption

Next, we assess the value of adding prior information about groups by comparing the performance of the BGFE-he-cstr and BGFE-he estimators exclusively. The solid black line in Figure 8 represents the ratio of RMSE between BGFE-he-cstr and BGFE-he. The periods during which BGFE-he-cstr, BGFE-he, and all other estimators achieve the lowest RMSE are indicated by pink, blue, and green shaded areas, respectively. Though the BGFE-he-cstr estimator is not always the best across samples, the prior information improves the performance of the Bayesian grouped estimator. The BGFE-he-cstr estimator performs better than the BGFE-he estimator in most samples, with an average improvement of 2%.

Figure 8: Relative RMSE, BGFE-he-cstr
Refer to caption

Adding prior information on groups substantially improves the accuracy of density forecasts. Figure 9 shows the comparison between BGFE-he-cstr and BGFE-he in terms of the difference in LPS. We find the prior information valuable as BGFE-he-cstr outperforms BGFE-he in more than 98% of the samples. Clearly, the majority of the figure is covered by a pink background, showing that BGFE-he-cstr is typically the best choice. All of these facts demonstrate that adding prior informatio is favorable and essential, especially for density forecasting.

Figure 9: Relative LPS, BGFE-he-cstr
Refer to caption

Having specified pairwise constraints across sub-indices, we provide a prior on GG that shrinks the group structure toward the eight expenditure categories with equal accuracy for all pairs inside each category. As Theorem 1 suggested, our prior specification essentially assumes that the prior probability of any two units that come from the same expenditure category being in the same group is equal, and the prior group pattern is actually the expenditure category. We now examine the posterior of group structure to demonstrate how the distribution of GG gets updated by data. In order to accomplish this, we construct a posterior similarity matrix (PSM), whose (i,j)(i,j)-element records the posterior probabilities of units ii and jj being in the same group. For illustrative purposes, we present the results for the last sample, in which we forecast CPI in August 2022. Figure 10 depicts the PSM generated by BGFE-he-crst for the series in the categories of Food and Beverages and Transportation. A darker block indicates a higher posterior probability of being in one group. A common pattern emerges: even though some sub-indices are joined together frequently, as shown in the dark diagonal blocks, it is extremely unlikely that all series within the same category belong to the same group. Some series have relatively low or zero probabilities of being grouped together, as suggested by the white and gray off-diagonal blocks. This indicates that the group structure based on official expenditure categories is not optimal, which may result in inaccurate forecasting. Instead, our suggested framework uses information from both prior beliefs and data to reinvent the group pattern, leading to improved forecasting performance.

Figure 10: Posterior Similarity Matrices for Selected Categories
(a) Food and beverages (Category 3)
Refer to caption
(b) Transportation (Category 7)
Refer to caption

Notes: This is not the direct output of the algorithm, the ordering of series is changed so that cluster are lying on the diagonals.

Finally, we restrict our analysis to the point estimate of group partition, i.e., the single grouping solution, rather than the posterior over the whole universe of partitions. Figure 11 depicts the posterior point estimate of GG for the last sample ended in August 2022, derived using the approach described in Section 3.2. Eight expenditure categories are divided into twelve groups of varied sizes. Two different forms of groups are generated based on the arrangement of their components. Groups 2, 3, 4, and 5 contain sub-indices from a variety of categories, with no clear dominance. In contrast, the majority of the series in groups 1 and 8, for example, belong to a certain category. Group 1 may refer to a Food group, whereas group 8 is a Transportation group. The detailed group 8 components are depicted in Figure 12. There are seven sub-indices from Transportation, including car and truck rentals, gasoline (regular, midgrade, and premium), other motor fuels, airline fares, and ship fares, and one series from Housing - fuel oil (for residential heating). Clearly, all sub-indices share a common trend and have a close relationship with energy and oil prices, which have increased since the Pandemic. This is an example demonstrating that our proposed algorithm exploits cross-sectional information, not limited to our prior knowledge, and forms meaningful groups for forecasting.

Figure 11: Posterior Point Estimate of the Group Partition, August 2022
Refer to caption
Figure 12: Components in the Group #9, August 2022
Refer to caption

4.2.3 Impact of the Accuracy of Constraints

We examine how the accuracy of pairwise constraints influences the point estimate of group partitioning. For demonstration purposes, we restrict our analysis to PL constraints solely by setting ψijNL=0.5\psi^{NL}_{ij}=0.5 and changing ψijPL\psi^{PL}_{ij}. We do not select the constant cc in the setup since it would balance the impact of the PL restrictions with a different level of accuracy. We set cc to 0.5. Again, PL constraints are derived from the official expenditure categories, with the assumption that all units within the same category are positive-linked with equal probability of being in the same group.

Figure 13: Impact of the Strength of Constraints
(a) Weaker PL (ψijPL=0.55)\left(\psi^{PL}_{ij}=0.55\right)
Refer to caption
(b) Stronger PL (ψijPL=0.95)\left(\psi^{PL}_{ij}=0.95\right)
Refer to caption

Figure 13 presents the point estimates of the group structure with two different levels of accuracy. The “weaker” PL constraints with ψijPL=0.55\psi^{PL}_{ij}=0.55, as shown in panel (a), demonstrate a limited influence of prior knowledge on the group structure. The eleven groups are composites of CPI sub-indices from the various categories, which is diverse from the official spending categories. Panel (b), on the other hand, illustrates the group structure with “stronger” PL constraints. By setting a high level of accuracy for PL constraints, such as ψijPL=0.95\psi^{PL}_{ij}=0.95, the prior knowledge dominates and pushes the group structure towards the official expenditure categories. As anticipated, panel (b) shows fewer groups, and the majority of CPI sub-indices within each group belong to the same category, bringing the group structure closer to that of the prior.

4.3 Income and Democracy

It is well-known in the literature that income per capita is strongly correlated with the level of democracy across countries. This strong empirical regularity is often known as ”modernization theory” or Lipset hypothesis (Lipset, 1959). The theory claims a causal relation: democratic regimes are created and consolidated in affluent societies (Lipset, 1959; Przeworski et al., 1995; Barro, 1999; Epstein et al., 2006).

In an influential paper, Acemoglu et al. (2008) challenge the casual effect of countries’ income on the level of democracy. They argue that it is essential to take into account other factors that affect both economic and political development simultaneously. Their analysis, based on panel data, indicates that the positive relationship between income and democracy disappears when fixed effects are included in the regression. They suggest that this finding is due to historical events, such as the end of feudalism, industrialization, or colonization, which have led countries to follow distinct paths of development. The fixed effects are meant to capture these persistent events. The finding is robust, as it holds for different measures of democracy, various econometric specifications, and additional covariates. Another study by Bonhomme and Manresa (2015) uses a different econometric model but arrives at the same conclusion. Their analysis highlights the presence of diverse group-specific paths of democratization in the data, consistent with the observation that regime types and transitions tend to cluster in time and space (Gleditsch and Ward, 2006; Ahlquist and Wibbels, 2012).

The seminal work of Acemoglu et al. (2008) has been subject to critical scrutiny by several recent works, including Moral-Benito and Bartolucci (2012), Benhabib et al. (2013), and Cervellati et al. (2014). Moral-Benito and Bartolucci (2012) contend that a nonlinear relationship between income and democracy exists, even after accounting for country-specific effects. They show that a positive income-democracy relationship holds only in countries with low levels of income. Benhabib et al. (2013) use panel estimation methods that adjust for the censoring of democracy measures at their upper and lower bounds and find that the positive relationship between income and democracy withstands the inclusion of country fixed effects. Cervellati et al. (2014) extend the linear estimation framework of Acemoglu et al. (2008) and unveil the presence of significant heterogeneity in the income effect on democracy across different subsamples. Specifically, they demonstrate that this effect exhibits an opposite sign for colonies and non-colonies, is substantially different from zero, and is of considerable magnitude. They also argue that the existence of a heterogeneous effect of income suggests that results from a linear framework, such as the finding of a zero effect, may lack robustness since they depend on the composition of the sample.

In this section, we contribute to the literature on the relationship between income and democracy by employing a novel grouped fixed-effects approach. Specifically, we expand on the econometric model proposed by Bonhomme and Manresa (2015) to incorporate group structure not only in time fixed-effects, but also in slope coefficients and the variance of errors. This more complex model allows for a more detailed analysis of the heterogeneous effects of income on democracy across different groups of countries. To identify these groups, we incorporate prior knowledge about the latent group structure by clustering countries based on either geographic location or initial levels of democracy score. By leveraging this information, we are able to identify a moderate number of groups, each of which exhibits a distinct path to democracy.

Our results indicate that the effect of income on democracy is highly varied across countries, a finding which is consistent with previous research by Cervellati et al. (2014). Furthermore, we find that the positive cumulative effect of income on democracy exists in groups of countries with a medium or relative low level of income, in line with the findings of Moral-Benito and Bartolucci (2012). However, the effect could be relatively small for some groups, suggesting that other factors beyond income may also play important roles in democratic development.

4.3.1 Model Specification and Data

Model: To accommodate richer assumptions on models for the real-world applications, we extend the baseline model in Chapter 1 either by adding common regressors and allowing for time-variation in the fixed-effects. Time-variation are essential to this analysis as they capture highly persistent historical shocks. Following Bonhomme and Manresa (2015), we introduce group-specific time patterns of heterogeneity αgit\alpha_{g_{i}t} and consider the following two specifications:

  • SP1:

    Time-varying GFE + grouped slope coefficients

    yit=αgit+ρgiyit1+βgixit1+εit,εitN(0,σgi2),\displaystyle y_{it}=\alpha_{g_{i}t}+\rho_{g_{i}}y_{it-1}+\beta_{g_{i}}x_{it-1}+\varepsilon_{it},\;\varepsilon_{it}\sim N(0,\sigma_{g_{i}}^{2}), (4.14)
  • SP2:

    Time-varying GFE

    yit=αgit+ρyit1+βxit1+εit,εitN(0,σgi2),\displaystyle y_{it}=\alpha_{g_{i}t}+\rho y_{it-1}+\beta x_{it-1}+\varepsilon_{it},\;\varepsilon_{it}\sim N(0,\sigma_{g_{i}}^{2}), (4.15)

where yity_{it} is the democracy score of country ii at time tt. The lagged value of democracy score yit1y_{it-1} is included to capture persistence in democracy and also potentially mean-reverting dynamics (i.e., the tendency of the democracy score to return to some equilibrium value for the country). The coefficient of main interest is βgi\beta_{g_{i}} and reflects the effect of the lagged value of log\log income per capita xit1x_{it-1} on democracy. In addition, αgit\alpha_{g_{i}t} denote a set of group-specific time fixed-effects; εit\varepsilon_{it} is an error term with grouped variance σgi2\sigma_{g_{i}}^{2}, capturing additional transitory shocks to democracy and other omitted factors. We use the conjugate prior for all parameters, see details in Appendix B.

Specification 2 in (4.15) nests the linear dynamic panel data model in BM as a special case. If we assume homoskedasticity, it is the Equation (22) in BM. This specification enables us to reproduce BM’s results and provide fresh insight into their framework. Specification 1 in (4.14), on the other hand, generalizes specification 2 by introducing group-dependent slope coefficients. As we shall demonstrate in the following section, specification 1 yields a more refined group structure and provides a clearer view of the income effects.


Data: We use the Freedom House (FH) Political Rights Index as a benchmark for measuring democracy. To standardize the index, we normalize it between 0 and 1, with higher scores indicating higher levels of democracy. FH assesses a country’s political rights based on a checklist of questions, such as the presence of free and fair elections, the role of elected officials, the existence of competitive parties or other political groupings, the power of the opposition, and the extent of self-government or participation by minority groups. We measure countries’ income using the logarithm of GDP per capita, which is adjusted for purchasing power parity (PPP) in 1996 prices, using data from the Penn World Tables 6.1 (Heston et al., 2002). Details on the data can be found in Section 1 of Acemoglu et al. (2008), and all data in this section are from the replication files of Bonhomme and Manresa (2015).161616 https://www.dropbox.com/s/ssjabvc2hxa5791/Bonhomme_Manresa_codes.zip?dl=0 Our analysis is based on a five-year panel dataset that includes all independent countries since the postwar period, with observations taken every fifth year from 1970 to 2000. We chose this period for comparability with previous studies. The final dataset consists of a balanced panel of 89 countries.


Prior group structure: We propose two prior grouping strategies as specified below and assume all units within the same prior group are presumed to be positive-linked, while units from different prior groups are believed to be negative-linked.

  1. (i)

    Given the countries available in the dataset, we form six groups according to their geographic locations:171717The regions are assigned by the Economist Intelligence Unit, and may slightly differ from conventional classifications. (1) North America; (2) Europe; (3) Latin America and the Caribbean; (4) Asia and Australasia; (5) Sub-Saharan Africa; (6) Middle East and North Africa. We refer this prior to geo-prior.

  2. (ii)

    Alternately, countries could be categorized according to their initial level of democracy in year 1970. As the Freedom House Index has six possible values, we cluster countries into three primary groups with a reasonable number of countries in each: (1) low democracy, yi,1970=y_{i,1970}= 0 or 0.166; (2) medium democracy, yi,1970=y_{i,1970}= 0.333, 0.5, or 0.667; (3) high democracy, yi,1970=y_{i,1970}= 0.833 or 1. We refer this prior to dem-prior.

Figure 14 presents the world maps with countries colored differently according to their respective groups. The panel on the left illustrates the geographic groups, while the panel on the right depicts the democratic groups. All gray nations/regions are excluded from the dataset. We concentrate primarily on the first pre-grouping strategy, as it needs no country-specific knowledge beyond geographic information. We then compare the results using different pre-grouping strategies in Section 4.3.3.

Figure 14: Specifications of Prior Grouping
(a) Geographic Location (Geo-Prior)
Refer to caption
(b) Initial Level of Democracy (Dem-Prior)
Refer to caption

4.3.2 Results

Specification 1: We begin with specification 1 where group-specific slope coefficients are allowed and new findings emerge. Table 1 presents the posterior probability of the number of groups utilizing various estimators. BGFE-ho creates more than 5 groups in all posterior draws. Intriguingly, accounting for heteroskedasticity drastically reduces the number of groups, with BGFE-he identifying four groups. Adding pairwise constraints based on geographic information increases the number of groups to five, whereas six groups are expected in the prior.

Table 1: Probability for Number of Groups, Geo-Prior
BGFE-he-cstr BGFE-he
Pr(K<4)Pr(K<4) 0.000 0.000
Pr(K=4)Pr(K=4) 0.000 1.000
Pr(K=5)Pr(K=5) 1.000 0.000
Pr(K>5)Pr(K>5) 0.000 0.000

The marginal data density (MDD) of each estimator in Table 2 provides some insight on different models. Among all the estimators, the BGFE-ho estimator has the lowest MDD; it is even lower than that of specification 1. BGFE-he-cstr and BGFE-he, on the other hand, benefit from the introduction of group-specific slope coefficients, since both achieve substantially greater MDD than in specification 1. BGFE-he-cstr has the highest MDD since the pairwise constraints give direction on grouping and identify the ideal group structure, which BGFE-he cannot uncover without our prior knowledge.

Table 2: Marginal Data Density, Geo-Prior
BGFE-he-cstr BGFE-he BGFE-ho
SP1 544.324 501.904 327.077
SP2 413.476 381.218 368.918

We concentrate on the BGFE-he-cstr estimator and use the approach outlined in Section 3.2 to identify the unique group partitioning G^\widehat{G}. The left panel of Figure 15 presents the world map colored by G^\widehat{G}, while the right panel present the group-specific averages of democracy index over time. The estimated group structure G^\widehat{G} features five distinct groups which we refer to as the “high-democracy”, “low-democracy”, “flawed-democracy”, “late-transition” and “progressive-transition” group, respectively. With the exception of the “flawed-democracy” and “progressive-transition” group, the group-specific averages of the democracy index are comparable to those in BM for all other groups. BGFE-he-cstr does not identify the ”early transition” group in comparison to BM but instead produces two new groups. Group 3 (”flawed-democracy”) comprises primarily of relatively democratic but not the most democratic nations, including India, Sri Lanka, and Venezuela, among others. Group 5 (”progressive transition”) contains 30 countries that have had a steady expansion of democracy, including Argentina, Greece, and Panama. Consequently, by incorporating group-specific slope coefficients, we recover a more refined group structure than that of BM.

Figure 15: Posterior Point Estimate of Group Partitioning and Average Democracy
Refer to caption
Refer to caption

Table 3 presents the posterior mean and 90% credible set for each coefficient across all groups, with GG fixed at the point estimate G^\widehat{G}. The key feature of using the specification 1 is that we are able to see distinct (cumulative) income effects across groups as group-specific coefficients are allowed.

The effect of income on democracy is negligible for group 1 (”full-democracy”) and group 4 (”late-transition”) as the posterior means of β\beta are close to 0 and the associated credible intervals for β^\hat{\beta} contain 0. Group 1, which we refer to as the ”full-democracy” group, mostly contains high-income, high-democracy countries. It includes the United States, Canada, UK, most of European countries, Australia, and New Zealand, but also Costa Rica and Uruguay. These country kept their democracy index at the highest level throughout the sample, demonstrating that income has no effect on democracy. Group 4 is referred to as the ”late-transition” group, which consists of Benin, Central African Republic, Mali, Malawi, Niger, and Romania. The transition to democracy for countries in group 4 was primarily driven by historical events in the 90s: Romania began a transition towards democracy after the 1989 Revolution; all other countries involved in the third wave of democratization in sub-Saharan Africa beginning in 1989. The impact of historical events is primarily captured by time fixed-effects, as the credible intervals for β^\hat{\beta} and ρ^\hat{\rho} well cover zero.

Group 2 (”low-democracy”) and group 3 (”flawed-democracy”) are two groups that have stable Freedom House scores. Lagged democracy is highly significant and indicates that there is a considerable degree of persistence in democracy. Log income per capita is also significant and illustrates the well-documented positive relationship between income and democracy. Though statistically significant, the effect of income is quantitatively small. For example, the coefficient of 0.055 for the group 3 implies that a 10 percent increase in GDP per capita is associated with an increase in the Freedom House score of 0.0055, which is very small. Group 2 includes low-democratic countries, China, Singapore, Iran, and a fraction of African countries, whose Freedom House scores remain relatively low throughout the sample. Countries in group 3, however, have Freedom House score stay in the relatively high level. The group covers countries with almost democratic but minor flaws in certain aspects, including India, Japan, South Korea, Finland, Sweden, and Portugal, among others. The cumulative income effects for these two groups are, however, different - it is negligible for group 2 (0.079) and modest for group 3 (0.244). Group 5 (”progressive-transition”), on the other hand, experiences a continuous increase in Freedom House score from a 0.35 in 1970 to 0.7 in 2020. It has the largest positive income coefficient, although the cumulative income effect is modest (0.156).

Table 3: Coefficient Estimates across Groups, Geo-Prior
Lagged democracy (ρ\rho) Lagged Income (β\beta) Income Effect (β/(1ρ)\beta/(1-\rho)) Error variance (σ2\sigma^{2})
Coef. Cred. Set Coef. Cred. Set Coef. Cred. Set Coef. Cred. Set
Group 1 (16) 0.058 [-0.263, 0.360] 0.000 [-0.012, 0.012] 0.000 [-0.013, 0.013] 0.001 [0.001, 0.001]
Group 2 (18) 0.484 [ 0.354, 0.606] 0.041 [ 0.019, 0.062] 0.079 [ 0.044, 0.117] 0.010 [0.008, 0.011]
Group 3 (19) 0.775 [ 0.703, 0.850] 0.055 [ 0.031, 0.078] 0.249 [ 0.143, 0.352] 0.013 [0.011, 0.016]
Group 4 (6) -0.178 [-0.468, 0.115] -0.025 [-0.066, 0.017] -0.020 [-0.054, 0.015] 0.008 [0.005, 0.011]
Group 5 (30) 0.206 [ 0.091, 0.310] 0.125 [ 0.090, 0.163] 0.157 [ 0.118, 0.199] 0.057 [0.048, 0.066]
Pooled OLS 0.667 [ 0.614, 0.717] 0.081 [ 0.064, 0.099] 0.244 [ 0.207, 0.281] 0.039 [0.035, 0.043]

Figure 16 depicts the historical average log income for each group of countries between 1970 and 2000. Except for the high income group (group 1) and low income group (group 4), all other three groups with medium or relatively low levels of income reveal positive cumulative income effect, as indicated in Table 3. This finding is generally in line with the results of Moral-Benito and Bartolucci (2012), who observe a positive effect in low-income nations. Notice that, their definition of low-income country is quite board, encompassing countries with a GDP per capita below the 80th percentile of the empirical cross-sectional density.. As a result, beside six counties in the group 4 that had the lowest income on average, other countries fit within this definition and confirm that the positive income effect is not prevalent in high-income countries.

Figure 16: Average Log Income by Groups
Refer to caption

Specification 2: The results for specification 2 are reported in Appendix G.2. In short, the results are comparable to the key findings in BM. BGFE-ho in specification 1 is identical to the main model in BM; it produces eight groups, which is consistent with the upper bound on the number of groups in BM based on BIC. BGFE-he-cstr, on the other hand, is more preferable and has the highest marginal data density as shown in Table 2. The point estimate of group partitioning based on BGFE-he-cstr consists of four groups that all have the similar pattern as BM’s group structure. This justifies BM’s subjective choice of four groups.  Regarding the estimated coefficients, there is moderate persistence and a positive effect of income on democracy, but the cumulative effect of income is quantitatively small: β/(1ρ)=0.08\beta/(1-\rho)=0.08.

4.3.3 Impacts of Different Pre-Grouping Strategies

All results presented thus far are based on pairwise constraints derived from spatial information. We now implement the alternative pre-grouping strategy based on the initial level of democracy. We stick with the BGFE-he-cstr estimator under specification 1.

Different pre-grouping strategies yields different estimates of group patterns. We consider the BGFE-he estimator with three prior group structures: geo-prior, dem-prior, and no prior knowledge, with point estimates of the group partition presented in panel (17(a)), (17(b)), and (17(c)) of Figure 17, respectively. Geo-prior and dem-prior produce comparable group structures, however certain nations are assigned to distinct groups.. They are encircled in the black dashed rectangle, including Portugal, Spain, Romania, Mali, Niger, Central African Republic, Benin, Malawi, and Jordan. Another country is South Korea. As depicted in panel (17(c)), without any prior knowledge of groups, the group pattern is quite different, particularly for countries in Asia, Africa, and Latin America.

The difference in group patterns result in discrepancies in MDD, which are listed in Table 4. The BGFE-he estimator with geo-prior has the highest MDD in both specifications, whereas the dem-prior is only informative in specification 1 in comparison to the BGFE-he estimator without prior knowledge.

Table 4: Marginal Data Density, Different Priors
geo-prior dem-prior no prior
SP1 544.324 527.517 501.904
SP2 413.476 309.369 381.218
Figure 17: Posterior Point Estimate of Group Structure
(a) Geo-Prior
Refer to caption
(b) Dem-Prior
Refer to caption
(c) No Prior Knowledge
Refer to caption

Using the initial level of democracy as prior knowledge results in four groups, as indicated in the panel (17(b)). The dem-prior has two major impacts on the group structure comparing with the geo-prior. It combines the “late-transition” group (group 4 in geo-prior) with the “progressive-transition” group (group 5 in geo-prior) to form a bigger and boarder “progressive-transition” group. Additionally, Portugal and Spain are no longer categorized as “flawed-democracy” countries, but rather as “progressive-transition” group in a boarder sense. In terms of the posterior estimates, as shown in Table 5, we observe similar value for the first three groups because they are merely subject to minor changes in the group structure. However, as “late-transition” and “progressive-transition” groups are agglomerated together under the dem-prior, the income effects of countries in the new group becomes larger, forcing some countries to exhibit strong and positive effects even though they are not under the geo-prior.

Table 5: Coefficient Estimates across Groups, Dem-Prior
Lagged democracy (ρ\rho) Lagged Income (β\beta) Income Effect (β/(1ρ)\beta/(1-\rho)) Error variance (σ2\sigma^{2})
Coef. Cred. Set Coef. Cred. Set Coef. Cred. Set Coef. Cred. Set
Group 1 (16) 0.062 [-0.262, 0.369] 0.000 [-0.012, 0.012] 0.000 [-0.014, 0.013] 0.001 [0.001, 0.001]
Group 2 (19) 0.513 [ 0.395, 0.640] 0.043 [ 0.022, 0.064] 0.087 [ 0.047, 0.126] 0.010 [0.008, 0.012]
Group 3 (15) 0.802 [ 0.704, 0.904] 0.040 [ 0.016, 0.064] 0.223 [ 0.079, 0.337] 0.012 [0.009, 0.015]
Group 4 (39) 0.302 [ 0.205, 0.388] 0.120 [ 0.092, 0.151] 0.172 [ 0.136, 0.210] 0.054 [0.047, 0.062]
Pooled OLS 0.667 [ 0.614, 0.717] 0.081 [ 0.064, 0.099] 0.244 [ 0.207, 0.281] 0.039 [0.035, 0.043]

5 Extensions

Within the domain of panel data models, the proposed constrained-based BGFE framework can be extended in multiple directions to allow for more subtle group structures or more covariates. In addition, the DP prior with soft pairwise constraints also applies to other related topics and models, such as clustering problems, heterogeneous treatment effects, and panel VARs.

5.1 Subtle Group Structure

Through the Dirichlet process defines a prior that possesses the clustering property and is flexible enough to incorporate pairwise constraints, the group structure itself is elementary. Aside from our prior belief on the group, the group structure, which is introduced in all αi\alpha_{i} and σi2\sigma_{i}^{2}, is entirely governed by the stick-breaking process defined in Equation (2.13). The stick length ξk\xi_{k}, on which we have a prior, is independent of any regressors or time. Consequently, each unit is associated with a single group, and the membership remains constant across time.

To create an even more flexible and richer group structure, we provide insight into three possible extensions, each of which requires a set of more distinctive nonparametric priors. (1) overlapping group and (2) time-varying group and (3) dependent group.

Overlapping group structures allow for multi-dimensional grouping. This is a natural extension without having to greatly modify the proposed DP prior. Following Cheng et al. (2019), each of αi\alpha_{i}’s and σi2\sigma_{i}^{2} may have its own group structure and a separate Dirichlet process is specified to each of them. As a result, units simultaneously belong to multiple groups based on the heterogeneous effects among regressors or cross-sectional heteroskedasticity.

Time-varying group structures allow the membership of the group to change over time. We could replace the DP by variants of the hierarchical Dirichlet process (Teh et al., 2006) to achieve this feature. In short, the hierarchical Dirichlet process (HDP), a nonparametric Bayesian approach to clustering grouped data, is now the foundation of the prior. The time dimension naturally divides the panel data into TT groups, and a Dirichlet process is assumed for each group, with all Dirichlet processes having the same base distribution, which is distributed according to a global base distribution. The HDP allows each group to have its own cluster, but most importantly, these clusters are shared across groups. This lays the groundwork for time-varying group structures, as it assumes that the number of clusters remains constant over time, while cluster memberships are subject to change. Variants of the HPD are then proposed to capture the time-persistence in group structures, including dynamic HDP (Ren et al., 2008) and sticky HDP (Fox et al., 2008, 2011). A closely related area in the frequentists’ methods is to identify structure breaks in parameters with grouped patterns, see Okui and Wang (2021); Lumsdaine et al. (2022).

Dependent group structures allow the prior group probability to rely directly on a collection of characteristics. The dependence is introduced through a modification of the stick-breaking representation for DPs, where the group probabilities vary with the characteristics. Rodriguez and Dunson (2011) introduced the probit-stick breaking (PSB) process where the Beta random variables are replaced by normally distributed random variables transformed using the standard normal CDF. The PSB is defined by,

πk(wi)=Φ(ζk(wi))j<k[1Φ(ζj(wi))],\displaystyle\pi_{k}\left(w_{i}\right)=\Phi\left(\zeta_{k}\left(w_{i}\right)\right)\prod_{j<k}\left[1-\Phi\left(\zeta_{j}\left(w_{i}\right)\right)\right], (5.1)

where stochastic function ζk\zeta_{k} is drawn from Gaussian process ζkGP(0,Vk)\zeta_{k}\sim GP\left(0,V_{k}\right) for k=1,2,k=1,2,\cdots and wiw_{i} is the set of characteristics that are informative to the latent group. Other forms of dependence are also available, see Quintana et al. (2022) for a comprehensive review. A caveat of this approach is that analysis of group structure is confined to wiw_{i} observed by the researcher. The approach requires researchers to know possible key characteristics, be able to observe them and ensure they are informative. In many cases, however, these characteristics might be hard to justify by researchers.

5.2 Beyond Panel Data Models

Although we concentrate on panel data model, our framework of the DP prior with soft pairwise constraints applies to other models where the group structure are crucial.


Gaussian Mixture Model

If we ignore covariates and focus exclusively on group membership, we essentially face a classical clustering problem with an infinite-dimensional mixture model. A typical probabilistic model is the infinite Gaussian mixture model (Rasmussen, 1999), where the data itself is assumed to be drawn from a mixture of Gaussian components

yik=1πkN(μk,Σk),\displaystyle y_{i}\sim\sum_{k=1}^{\infty}\pi_{k}N(\mu_{k},\Sigma_{k}), (5.2)

where πk\pi_{k} are the mixture weights. With soft pairwise constraints, observations are clustered in accordance with prior belief.


Heterogeneous Treatment Effects

Following the potential outcomes framework of Rubin (1974), we posit the existence of potential outcomes yi(1)y_{i}(1) and yi(0)y_{i}(0) corresponding respectively to the response the ii th subject would have experienced with and without the treatment, and define the treatment effect at xx as

τ(x)=𝔼[yi(1)yi(0)|xi=x].\displaystyle\tau(x)=\mathbb{E}\left[y_{i}(1)-y_{i}(0)|x_{i}=x\right]. (5.3)

Previous works on Bayesian analysis for the treatment effects include Chib and Hamilton (2000, 2002); Chib and Jacobi (2007); Chib (2007); Heckman et al. (2014), etc. Another strand of literature propose to estimate (5.3) by using several machine learning algorithms (Hill, 2011; Athey and Imbens, 2016; Wager and Athey, 2018). These methods are built on the idea that researchers find the subsamples across which the effect of a treatment differs out of all possible subsamples on the basis of the values of xix_{i}. Instead of trying to discover valid subsets of the data, Shiraito (2016) directly models the outcome as a function of the treatment and pre-treatment covariates xix_{i} and estimate of the distribution of conditional average treatment effects (CATE) across units by employing the Dirichlet process.

yi=τgiTi+γgixi+εi,εi iid𝒩(0,σgi2),\displaystyle y_{i}=\tau_{g_{i}}T_{i}+\gamma^{\prime}_{g_{i}}x_{i}+\varepsilon_{i},\quad\varepsilon_{i}\stackrel{{\scriptstyle\text{ iid}}}{{\sim}}\mathcal{N}\left(0,\sigma_{g_{i}}^{2}\right), (5.4)

where TT is the binary treatment variable. Our approaches fit in this methods by adding side information on the treatment groups into the prior.


Panel VARs

Panel VARs (Holtz-Eakin et al., 1988; Canova and Ciccarelli, 2013, and references therein.) has been widely used in macroeconomic analysis and policy evaluations to capture the interdependency across sectors, markets, and countries. Nevertheless, the large dimension of panel VARs typically makes the curse of dimensionality a severe problem. Billio et al. (2019) propose nonparametric Bayesian priors that cluster the VAR coefficients and induce group-level shrinkage. Our paradigm with the DP prior with soft pairwise constraints is applicable to their method and injects prior information on groups into the underlying Granger causal networks.

Panel VARs have the same structure as VAR models, in the sense that all variables are assumed to be endogenous and interdependent, but a cross-sectional dimension is added to the representation. Thus, let YtY_{t} be the stacked version of yity_{it}, the vector of JJ variables for each unit i=1,,Ni=1,\ldots,N, i.e., Yt=(y1t,y2t,yNt)Y_{t}=\left(y_{1t}^{\prime},y_{2t}^{\prime},\ldots y_{Nt}^{\prime}\right)^{\prime}. Then a panel VAR is

Yt=A0+A1Yt1+A2Yt2++ApYtp+ut,i=1,,N,\displaystyle Y_{t}=A_{0}+A_{1}Y_{t-1}+A_{2}Y_{t-2}+...+A_{p}Y_{t-p}+u_{t},\quad i=1,\ldots,N, (5.5)

where utu_{t} is a J×1J\times 1 vector of idiosyncratic errors and A0A_{0} and AjA_{j} are NJ×NJNJ\times NJ matrices of coefficients.

The main feature of Billio et al. (2019) is to specify a prior that blends the DP prior with Lasso prior for each of A0A_{0} and AjA_{j}, such that the VAR coefficients are either shrunk toward 0 or clustered at multiple non-zero locations. Our proposed DP prior with soft pairwise constraints, in the meantime, fit into their framework by replacing the original DP prior and permitting richer structure within each coefficient matrix. As the nonzero coefficients form Granger causal networks, equipping with soft pairwise constraints may result in a more plausible network by taking researchers’ expertise into account.

6 Concluding Remarks

This paper proposes a Bayesian framework for estimating and forecasting in panel data models when prior group knowledge is available and informative for the group pattern. We include prior knowledge in the form of soft pairwise constraints into the Dirichlet process prior. Then, an intuitive and coherent prior is presented. The constrained grouped estimator proposed examines both heteroskedasticity and heterogeneous slope coefficients to endogenously reveal group structure. Our framework immediately estimates the number of groups as opposed to relying on ex-post model selection, and the structure of pairwise restrictions circumvents the computational difficulties and limitations that afflict conventional approaches. In addition, when utilizing small-variance asymptotics, the suggested Gibbs sampler with pairwise constraint contains a clustering procedure comparable to that of the constrained KMeans algorithm. In Monte Carlo simulations, we demonstrate that constrained Bayesian grouped estimators outperform conventional estimators even in the presence of incorrect prior knowledge. Our empirical application to forecasting sub-indices of CPI inflation rates demonstrates that incorporating prior knowledge on the latent group structure yields more accurate density predictions. The better forecasting performance is mostly attributable to the key characteristics: nonparametric Bayesian prior and grouped cross-sectional variance. The method proposed in this paper is applicable beyond forecasting. In a second application, we revisit the relationship between a country’s income and its democratic transition, where estimation of heterogeneous parameters is the object of interest. We recover a reasonable cluster pattern with a moderate number of groups and identify heterogeneous income effects on democracy.

The current work raises exciting questions for future research. It is desirable to investigate overlapping group structures, in which a unit might belong to many groups. This would allow us to increase the flexibility of a panel data model, potentially enhancing its predictive performance. Second, the assumption that an individual cannot change its group identity for the entire sample time can be amended, resulting in a specification that is even more flexible. Thirdly, our method is applicable to other econometric models, such as panel VARs with latent group structures in macro series.

References

  • Acemoglu et al. (2008) Acemoglu, D., S. Johnson, J. A. Robinson, and P. Yared (2008): “Income and Democracy,” American Economic Review, 98, 808–42.
  • Aguilar and Boot (2022) Aguilar, J. and T. Boot (2022): “Grouped Heterogeneity in Linear Panel Data Models with Heterogeneous Error Variances,” Available at SSRN 4031841.
  • Ahlquist and Wibbels (2012) Ahlquist, J. S. and E. Wibbels (2012): “Riding the Wave: World Trade and Factor-Based Models of Democratization,” American Journal of Political Science, 56, 447–464.
  • Aldous (1985) Aldous, D. J. (1985): “Exchangeability and Related Topics,” in École d’Été de Probabilités de Saint-Flour XIII—1983, Springer, 1–198.
  • Anderson and Hsiao (1982) Anderson, T. W. and C. Hsiao (1982): “Formulation and Estimation of Dynamic Models using Panel Data,” Journal of Econometrics, 18, 47–82.
  • Ando and Bai (2016) Ando, T. and J. Bai (2016): “Panel Data Models with Grouped Factor Structure under Unknown Group Membership,” Journal of Applied Econometrics, 31, 163–191.
  • Antoniak (1974) Antoniak, C. E. (1974): “Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems,” The Annals of Statistics, 2, 1152–1174.
  • Ascolani et al. (2022) Ascolani, F., A. Lijoi, G. Rebaudo, and G. Zanella (2022): “Clustering Consistency with Dirichlet Process Mixtures,” arXiv preprint arXiv:2205.12924.
  • Athey and Imbens (2016) Athey, S. and G. Imbens (2016): “Recursive Partitioning for Heterogeneous Causal Effects,” Proceedings of the National Academy of Sciences, 113, 7353–7360.
  • Atkeson et al. (2001) Atkeson, A., L. E. Ohanian, et al. (2001): “Are Phillips Curves Useful for Forecasting Inflation?” Federal Reserve Bank of Minneapolis Quarterly Review, 25, 2–11.
  • Barro (1999) Barro, R. J. (1999): “Determinants of Democracy,” Journal of Political economy, 107, S158–S183.
  • Basu et al. (2002) Basu, S., A. Banerjee, and R. Mooney (2002): “Semi-Supervised Clustering by Seeding,” in Proceedings of 19th International Conference on Machine Learning, Citeseer.
  • Basu et al. (2004a) Basu, S., A. Banerjee, and R. J. Mooney (2004a): “Active Semi-Supervision for Pairwise Constrained Clustering,” in Proceedings of the 2004 SIAM international conference on data mining, SIAM, 333–344.
  • Basu et al. (2004b) Basu, S., M. Bilenko, and R. J. Mooney (2004b): “A Probabilistic Framework for Semi-Supervised Clustering,” in Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 59–68.
  • Benhabib et al. (2013) Benhabib, J., A. Corvalan, and M. M. Spiegel (2013): “Income and Democracy: Evidence from Nonlinear Estimations,” Economics Letters, 118, 489–492.
  • Bernanke (2007) Bernanke, B. S. (2007): “Inflation Expectations and Inflation Forecasting,” in Speech at the Monetary Economics Workshop of the National Bureau of Economic Research Summer Institute, Cambridge, Massachusetts, vol. 10.
  • Bilenko et al. (2004) Bilenko, M., S. Basu, and R. J. Mooney (2004): “Integrating Constraints and Metric Learning in Semi-Supervised Clustering,” in Proceedings of the 21st International Conference on Machine Learning, 81–88.
  • Billio et al. (2019) Billio, M., R. Casarin, and L. Rossini (2019): “Bayesian Nonparametric Sparse VAR Models,” Journal of Econometrics, 212, 97–115.
  • Blackwell and MacQueen (1973) Blackwell, D. and J. B. MacQueen (1973): “Ferguson Distributions via Pólya Urn Schemes,” The Annals of Statistics, 1, 353–355.
  • Bonhomme et al. (2022) Bonhomme, S., T. Lamadon, and E. Manresa (2022): “Discretizing Unobserved Heterogeneity,” Econometrica, 90, 625–643.
  • Bonhomme and Manresa (2015) Bonhomme, S. and E. Manresa (2015): “Grouped Patterns of Heterogeneity in Panel Data,” Econometrica, 83, 1147–1184.
  • Buchinsky et al. (2005) Buchinsky, M., J. Hahn, and V. Hotz (2005): “Cluster Analysis: A Tool for Preliminary Structural Analysis,” Unpublished manuscript.
  • Canova and Ciccarelli (2013) Canova, F. and M. Ciccarelli (2013): “Panel Vector Autoregressive Models: A Survey,” in VAR Models in Macroeconomics–New Developments and Applications: Essays in Honor of Christopher A. Sims, Emerald Group Publishing Limited.
  • Cervellati et al. (2014) Cervellati, M., F. Jung, U. Sunde, and T. Vischer (2014): “Income and Democracy: Comment,” American Economic Review, 104, 707–719.
  • Chamberlain (1980) Chamberlain, G. (1980): “Analysis of Covariance with Qualitative Data,” The Review of Economic Studies, 47, 225–238.
  • Cheng et al. (2019) Cheng, X., F. Schorfheide, and P. Shao (2019): “Clustering for Multi-Dimensional Heterogeneity,” Manuscript.
  • Chetty et al. (2014) Chetty, R., J. N. Friedman, and J. E. Rockoff (2014): “Measuring the Impacts of Teachers I: Evaluating Bias in Teacher Value-Added Estimates,” American Economic Review, 104, 2593–2632.
  • Chib (2007) Chib, S. (2007): “Analysis of Treatment Response Data without the Joint Distribution of Potential Outcomes,” Journal of Econometrics, 140, 401–412.
  • Chib and Hamilton (2000) Chib, S. and B. H. Hamilton (2000): “Bayesian Analysis of Cross-Section and Clustered Data Treatment Models,” journal of Econometrics, 97, 25–50.
  • Chib and Hamilton (2002) ——— (2002): “Semiparametric Bayes Analysis of Longitudinal Data Treatment Models,” Journal of Econometrics, 110, 67–89.
  • Chib and Jacobi (2007) Chib, S. and L. Jacobi (2007): “Modeling and Calculating the Effect of Treatment at Baseline from Panel Outcomes,” Journal of Econometrics, 140, 781–801.
  • Clark and Ravazzolo (2015) Clark, T. E. and F. Ravazzolo (2015): “Macroeconomic Forecasting Performance under Alternative Specifications of Time-Varying Volatility,” Journal of Applied Econometrics, 30, 551–575.
  • Cytrynbaum (2021) Cytrynbaum, M. (2021): “Blocked Clusterwise Regression,” arXiv preprint arXiv:2001.11130.
  • Davidson and Ravi (2005) Davidson, I. and S. Ravi (2005): “Clustering with Constraints: Feasibility Issues and the K-Means Algorithm,” in Proceedings of the 2005 SIAM International Conference on Data Mining, SIAM, 138–149.
  • Epstein et al. (2006) Epstein, D. L., R. Bates, J. Goldstone, I. Kristensen, and S. O’Halloran (2006): “Democratic Transitions,” American journal of political science, 50, 551–569.
  • Escobar and West (1995) Escobar, M. D. and M. West (1995): “Bayesian Density Estimation and Inference using Mixtures,” Journal of the American Statistical Association, 90, 577–588.
  • Faust and Wright (2013) Faust, J. and J. H. Wright (2013): “Forecasting Inflation,” in Handbook of Economic Forecasting, Elsevier, vol. 2, 2–56.
  • Ferguson (1973) Ferguson, T. S. (1973): “A Bayesian Analysis of Some Nonparametric Problems,” The Annals of Statistics, 1, 209–230.
  • Ferguson (1974) ——— (1974): “Prior Distributions on Spaces of Probability Measures,” The Annals of Statistics, 2, 615–629.
  • Fisher and Jensen (2022) Fisher, M. and M. J. Jensen (2022): “Bayesian Nonparametric Learning of How Skill is Distributed across the Mutual Fund Industry,” Journal of Econometrics, 230, 131–153.
  • Fox et al. (2008) Fox, E. B., E. B. Sudderth, M. I. Jordan, and A. S. Willsky (2008): “An HDP-HMM for Systems with State Persistence,” in Proceedings of the 25th International Conference on Machine Learning, 312–319.
  • Fox et al. (2011) ——— (2011): “A Sticky HDP-HMM with Application to Speaker Diarization,” The Annals of Applied Statistics, 5, 1020–1056.
  • Freeman and Weidner (2022) Freeman, H. and M. Weidner (2022): “Linear Panel Regressions with Two-Way Unobserved Heterogeneity,” arXiv preprint arXiv:2109.11911.
  • Fruchterman and Reingold (1991) Fruchterman, T. M. and E. M. Reingold (1991): “Graph Drawing by Force-Directed Placement,” Software: Practice and Experience, 21, 1129–1164.
  • Geweke and Amisano (2010) Geweke, J. and G. Amisano (2010): “Comparing and Evaluating Bayesian Predictive Distributions of Asset Returns,” International Journal of Forecasting, 26, 216–230.
  • Ghosal and Van der Vaart (2017) Ghosal, S. and A. Van der Vaart (2017): Fundamentals of Nonparametric Bayesian Inference, vol. 44, Cambridge University Press.
  • Gleditsch and Ward (2006) Gleditsch, K. S. and M. D. Ward (2006): “Diffusion and the International Context of Democratization,” International Organization, 60, 911–933.
  • Hahn and Moon (2010) Hahn, J. and H. R. Moon (2010): “Panel Data Models with Finite Number of Multiple Equilibria,” Econometric Theory, 26, 863–881.
  • Hamilton (2018) Hamilton, J. D. (2018): “Why You Should Never Use the Hodrick-Prescott Filter,” Review of Economics and Statistics, 100, 831–843.
  • Heckman et al. (2014) Heckman, J. J., H. F. Lopes, and R. Piatek (2014): “Treatment Effects: A Bayesian Perspective,” Econometric Reviews, 33, 36–67.
  • Henseler and Schumacher (2019) Henseler, M. and I. Schumacher (2019): “The Impact of Weather on Economic Growth and its Production Factors,” Climatic Change, 154, 417–433.
  • Hersbach (2000) Hersbach, H. (2000): “Decomposition of the Continuous Ranked Probability Score for Ensemble Prediction Systems,” Weather and Forecasting, 15, 559–570.
  • Heston et al. (2002) Heston, A., R. Summers, B. Aten, et al. (2002): “Penn world table version 6.1,” Center for International Comparisons at the University of Pennsylvania (CICUP), 18.
  • Hill (2011) Hill, J. L. (2011): “Bayesian Nonparametric Modeling for Causal Inference,” Journal of Computational and Graphical Statistics, 20, 217–240.
  • Hirano (2002) Hirano, K. (2002): “Semiparametric Bayesian Inference in Autoregressive Panel Data Models,” Econometrica, 70, 781–799.
  • Holland et al. (1983) Holland, P. W., K. B. Laskey, and S. Leinhardt (1983): “Stochastic Blockmodels: First Steps,” Social Networks, 5, 109–137.
  • Holtz-Eakin et al. (1988) Holtz-Eakin, D., W. Newey, and H. S. Rosen (1988): “Estimating Vector Autoregressions with Panel Data,” Econometrica, 56, 1371–1395.
  • Hsiang (2016) Hsiang, S. (2016): “Climate Econometrics,” Annual Review of Resource Economics, 8, 43–75.
  • Ishwaran and James (2001) Ishwaran, H. and L. F. James (2001): “Gibbs Sampling Methods for Stick-Breaking Priors,” Journal of the American Statistical Association, 96, 161–173.
  • Kahn et al. (2021) Kahn, M. E., K. Mohaddes, R. N. Ng, M. H. Pesaran, M. Raissi, and J.-C. Yang (2021): “Long-Term Macroeconomic Effects of Climate Change: A Cross-Country Analysis,” Energy Economics, 104, 105624.
  • Kim and Wang (2019) Kim, J. and L. Wang (2019): “Hidden Group Patterns in Democracy Developments: Bayesian Inference for Grouped Heterogeneity,” Journal of Applied Econometrics, 34, 1016–1028.
  • Koop (2003) Koop, G. (2003): Bayesian Econometrics, John Wiley & Sons.
  • Kulis and Jordan (2011) Kulis, B. and M. I. Jordan (2011): “Revisiting K-Means: New Algorithms via Bayesian Nonparametrics,” arXiv preprint arXiv:1111.0352.
  • Laio and Tamea (2007) Laio, F. and S. Tamea (2007): “Verification Tools for Probabilistic Forecasts of Continuous Hydrological Variables,” Hydrology and Earth System Sciences, 11, 1267–1277.
  • Law et al. (2004) Law, M. H., A. Topchy, and A. K. Jain (2004): “Clustering with Soft and Group Constraints,” in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer, 662–670.
  • Lee and Wilkinson (2019) Lee, C. and D. J. Wilkinson (2019): “A Review of Stochastic Block Models and Extensions for Graph Clustering,” Applied Network Science, 4, 1–50.
  • Lin and Ng (2012) Lin, C.-C. and S. Ng (2012): “Estimation of Panel Data Models with Parameter Heterogeneity When Group Membership is Unknown,” Journal of Econometric Methods, 1, 42–55.
  • Lipset (1959) Lipset, S. M. (1959): “Some Social Requisites of Democracy: Economic Development and Political Legitimacy,” American Political Science Review, 53, 69–105.
  • Liu (2022) Liu, L. (2022): “Density Forecasts in Panel Data Models: A Semiparametric Bayesian Perspective,” Journal of Business & Economic Statistics, 0, 1–15.
  • Liu et al. (2020) Liu, L., H. R. Moon, and F. Schorfheide (2020): “Forecasting with Dynamic Panel Data Models,” Econometrica, 88, 171–201.
  • Lu (2007) Lu, Z. (2007): “Semi-Supervised Clustering with Pairwise Constraints: A Discriminative Approach,” in Artificial Intelligence and Statistics, PMLR, 299–306.
  • Lu and Leen (2004) Lu, Z. and T. K. Leen (2004): “Semi-Supervised Learning with Penalized Probabilistic Clustering,” in NIPS’04 Proceedings of the 17th International Conference on Neural Information Processing Systems, 849–856.
  • Lu and Leen (2007) ——— (2007): “Penalized Probabilistic Clustering,” Neural Computation, 19, 1528–1567.
  • Lumsdaine et al. (2022) Lumsdaine, R. L., R. Okui, and W. Wang (2022): “Estimation of Panel Group Structure Models with Structural Breaks in Group Memberships and Coefficients,” Journal of Econometrics, Forthcoming.
  • MacQueen et al. (1967) MacQueen, J. et al. (1967): “Some Methods for Classification and Analysis of Multivariate Observations,” in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, vol. 1, 281–297.
  • Matheson and Winkler (1976) Matheson, J. E. and R. L. Winkler (1976): “Scoring Rules for Continuous Probability Distributions,” Management Science, 22, 1087–1096.
  • Meilă (2007) Meilă, M. (2007): “Comparing Clusterings—An Information Based Distance,” Journal of Multivariate Analysis, 98, 873–895.
  • Miller (2019) Miller, J. W. (2019): “An Elementary Derivation of the Chinese Restaurant Process from Sethuraman’s Stick-Breaking Process,” Statistics & Probability Letters, 146, 112–117.
  • Moral-Benito and Bartolucci (2012) Moral-Benito, E. and C. Bartolucci (2012): “Income and Democracy: Revisiting the Evidence,” Economics Letters, 117, 844–847.
  • Müeller et al. (2018) Müeller, P., F. A. Quintana, and G. Page (2018): “Nonparametric Bayesian Inference in Applications,” Statistical Methods & Applications, 27, 175–206.
  • Neal (1992) Neal, R. M. (1992): “Bayesian Mixture Modeling,” in Maximum Entropy and Bayesian Methods, Springer, 197–211.
  • Neal (2000) ——— (2000): “Markov Chain Sampling Methods for Dirichlet Process Mixture Models,” Journal of Computational and Graphical Statistics, 9, 249–265.
  • Nelson and Cohen (2007) Nelson, B. and I. Cohen (2007): “Revisiting Probabilistic Models for Clustering with Pairwise Constraints,” in Proceedings of the 24th International Conference on Machine Learning, 673–680.
  • Newton and Raftery (1994) Newton, M. A. and A. E. Raftery (1994): “Approximate Bayesian Inference with the Weighted Likelihood Bootstrap,” Journal of the Royal Statistical Society: Series B (Methodological), 56, 3–26.
  • Nowicki and Snijders (2001) Nowicki, K. and T. A. B. Snijders (2001): “Estimation and Prediction for Stochastic Blockstructures,” Journal of the American Statistical Association, 96, 1077–1087.
  • Okui and Wang (2021) Okui, R. and W. Wang (2021): “Heterogeneous Structural Breaks in Panel Data Models,” Journal of Econometrics, 220, 447–473.
  • Orbanz and Buhmann (2008) Orbanz, P. and J. M. Buhmann (2008): “Nonparametric Bayesian Image Segmentation,” International Journal of Computer Vision, 77, 25–45.
  • Paganin et al. (2021) Paganin, S., A. H. Herring, A. F. Olshan, and D. B. Dunson (2021): “Centered Partition Processes: Informative Priors for Clustering,” Bayesian Analysis, 16, 301–370.
  • Pati et al. (2013) Pati, D., D. B. Dunson, and S. T. Tokdar (2013): “Posterior Consistency in Conditional Distribution Estimation,” Journal of multivariate analysis, 116, 456–472.
  • Pelleg and Baras (2007) Pelleg, D. and D. Baras (2007): “K-Means with Large and Noisy Constraint Sets,” in European Conference on Machine Learning, Springer, 674–682.
  • Pesaran et al. (2022) Pesaran, M. H., A. Pick, and A. Timmermann (2022): “Forecasting with Panel Data: Estimation Uncertainty versus Parameter Heterogeneity,” CEPR Discussion Paper No. DP17123.
  • Pitman (1995) Pitman, J. (1995): “Exchangeable and Partially Exchangeable Random Partitions,” Probability Theory and Related Fields, 102, 145–158.
  • Pitman (1996) ——— (1996): “Some Developments of the Blackwell-MacQueen Urn Scheme,” Lecture Notes-Monograph Series, 245–267.
  • Pitman and Yor (1997) Pitman, J. and M. Yor (1997): “The Two-Parameter Poisson-Dirichlet Distribution Derived from a Stable Subordinator,” The Annals of Probability, 25, 855–900.
  • Primiceri (2005) Primiceri, G. E. (2005): “Time Varying Structural Vector Autoregressions and Monetary Policy,” The Review of Economic Studies, 72, 821–852.
  • Przeworski et al. (1995) Przeworski, A., F. Limongi, and S. Giner (1995): “Political Regimes and Economic Growth,” in Democracy and Development, Springer, 3–27.
  • Quintana et al. (2022) Quintana, F. A., P. Müller, A. Jara, and S. N. MacEachern (2022): “The Dependent Dirichlet Process and Related Models,” Statistical Science, 37, 24–41.
  • Rasmussen (1999) Rasmussen, C. (1999): “The Infinite Gaussian Mixture Model,” in Advances in Neural Information Processing Systems, ed. by S. Solla, T. Leen, and K. Müller, MIT Press, vol. 12, 554–560.
  • Ren et al. (2008) Ren, L., D. B. Dunson, and L. Carin (2008): “The Dynamic Hierarchical Dirichlet Process,” in Proceedings of the 25th International Conference on Machine Learning, 824–831.
  • Rockoff (2004) Rockoff, J. E. (2004): “The Impact of Individual Teachers on Student Achievement: Evidence from Panel Data,” American Economic Review, 94, 247–252.
  • Rodriguez and Dunson (2011) Rodriguez, A. and D. B. Dunson (2011): “Nonparametric Bayesian Models through Probit Stick-Breaking Processes,” Bayesian Analysis, 6.
  • Ross and Dy (2013) Ross, J. and J. Dy (2013): “Nonparametric Mixture of Gaussian Processes with Constraints,” in International Conference on Machine Learning, PMLR, 1346–1354.
  • Rubin (1974) Rubin, D. B. (1974): “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies,” Journal of Educational Psychology, 66, 688–701.
  • Sarafidis and Weber (2015) Sarafidis, V. and N. Weber (2015): “A Partially Heterogeneous Framework for Analyzing Panel Data,” Oxford Bulletin of Economics and Statistics, 77, 274–296.
  • Sethuraman (1994) Sethuraman, J. (1994): “A Constructive Definition of Dirichlet Priors,” Statistica Sinica, 4, 639–650.
  • Shental et al. (2003) Shental, N., A. Bar-Hillel, T. Hertz, and D. Weinshall (2003): “Computing Gaussian Mixture Models with EM using Equivalence Constraints,” Advances in Neural Information Processing Systems, 16, 465–472.
  • Shiraito (2016) Shiraito, Y. (2016): “Uncovering Heterogeneous Treatment Effects,” Visited on. https://shiraito. github. io/research/files/jmp. pdf.
  • Smith (1973) Smith, A. F. (1973): “A General Bayesian Linear Model,” Journal of the Royal Statistical Society: Series B (Methodological), 35, 67–75.
  • Stock and Watson (2007) Stock, J. H. and M. W. Watson (2007): “Why has US Inflation Become Harder to Forecast?” Journal of Money, Credit and Banking, 39, 3–33.
  • Su and Ju (2018) Su, L. and G. Ju (2018): “Identifying Latent Grouped Patterns in Panel Data Models with Interactive Fixed Effects,” Journal of Econometrics, 206, 554–573.
  • Su et al. (2016) Su, L., Z. Shi, and P. C. Phillips (2016): “Identifying Latent Structures in Panel Data,” Econometrica, 84, 2215–2264.
  • Su et al. (2019) Su, L., X. Wang, and S. Jin (2019): “Sieve Estimation of Time-Varying Panel Data Models with Latent Structures,” Journal of Business & Economic Statistics, 37, 334–349.
  • Sun (2005) Sun, Y. (2005): “Estimation and Inference in Panel Structure Models,” Available at SSRN 794884.
  • Teh et al. (2006) Teh, Y. W., M. I. Jordan, M. J. Beal, and D. M. Blei (2006): “Hierarchical Dirichlet Processes,” Journal of the American Statistical Association, 101, 1566–1581.
  • Vlachos et al. (2010) Vlachos, A., Z. Ghahramani, and T. Briscoe (2010): “Active Learning for Constrained Dirichlet Process Mixture Models,” in Proceedings of the 2010 Workshop on Geometrical Models of Natural Language Semantics, 57–61.
  • Vlachos et al. (2008) Vlachos, A., Z. Ghahramani, and A. Korhonen (2008): “Dirichlet Process Mixture Models for Verb Clustering,” in Proceedings of the ICML Workshop on Prior Knowledge for Text and Language.
  • Vlachos et al. (2009) Vlachos, A., A. Korhonen, and Z. Ghahramani (2009): “Unsupervised and Constrained Dirichlet Process Mixture Models for Verb Clustering,” in Proceedings of the Workshop on Geometrical Models of Natural Language Semantics, 74–82.
  • Wade and Ghahramani (2018) Wade, S. and Z. Ghahramani (2018): “Bayesian Cluster Analysis: Point Estimation and Credible Balls (with Discussion),” Bayesian Analysis, 13, 559–626.
  • Wager and Athey (2018) Wager, S. and S. Athey (2018): “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests,” Journal of the American Statistical Association, 113, 1228–1242.
  • Wagstaff and Cardie (2000) Wagstaff, K. and C. Cardie (2000): “Clustering with Instance-Level Constraints,” AAAI/IAAI Proceedings, 1097, 577–584.
  • Wagstaff et al. (2001) Wagstaff, K., C. Cardie, S. Rogers, S. Schroedl, et al. (2001): “Constrained K-Means Clustering with Background Knowledge,” in Proceedings of the 18th International Conference on Machine Learning, vol. 1, 577–584.
  • Walker (2007) Walker, S. G. (2007): “Sampling the Dirichlet Mixture Model with Slices,” Communications in Statistics—Simulation and Computation®, 36, 45–54.
  • Wang and Dunson (2011) Wang, L. and D. B. Dunson (2011): “Fast Bayesian Inference in Dirichlet Process Mixture Models,” Journal of Computational and Graphical Statistics, 20, 196–216.
  • Wang et al. (2018) Wang, W., P. C. Phillips, and L. Su (2018): “Homogeneity Pursuit in Panel Data Models: Theory and Application,” Journal of Applied Econometrics, 33, 797–815.
  • Wang and Su (2021) Wang, W. and L. Su (2021): “Identifying Latent Group Structures in Nonlinear Panels,” Journal of Econometrics, 220, 272–295.
  • Yoder and Priebe (2017) Yoder, J. and C. E. Priebe (2017): “Semi-Supervised K-Means++,” Journal of Statistical Computation and Simulation, 87, 2597–2608.
  • Zhang (2020) Zhang, B. (2020): “Forecasting with Bayesian Grouped Random Effects in Panel Data,” arXiv preprint arXiv:2007.02435.

Supplemental Appendix to
“Unobserved Grouped Patterns in Panel Data
and Prior Wisdom”

Boyuan Zhang

Appendix A Definitions and Terminology

A.1 Dirichlet Process and Related Stochastic Processes

All unknown quantities in a model must be assigned prior distributions in Bayesian inference. A nonparametric prior can be used to reflect uncertainty about the parametric form of the prior distribution. Because of its richness, computational ease, and interpretability, the Dirichlet process (DP) is one of the most often used random probability measures. It can be used to model the uncertainty about the functional form of the prior distribution for parameters in a model.

The DP, which was first established using Kolmogorov consistency conditions Ferguson (1973), can be defined from a number of views. Ferguson (1973) shows that the DP can be obtained by normalizing a gamma process. By using exchangeability, the Pólya urn method leads to the GP Blackwell and MacQueen (1973). The Chinese restaurant process (CRP) Aldous (1985); Pitman (1996), a distribution over partitions, is a similarly related sequential process that produces the DP when each partition is assigned an independent parameter with a common distribution. Sethuraman (1994) provided a constructive definition of the DP, which led to the characterization as a stick-breaking prior Ishwaran and James (2001).

Construction of the DP using a stick-breaking process or a gamma process represents the DP as a countably infinite sum of atomic measures. These approaches suggest that a DPM model can be seen as a mixture model with infinitely many components. The distribution of parameters imposed by a DP can also be obtained as a limiting case of a parametric mixture model Neal (1992); Rasmussen (1999); Neal (2000). This approach shows that a DPM can easily be defined as an extension of a parametric mixture model without the need to do model selection for determining the number of components to be used.

A.1.1 Dirichlet Process

Ferguson (1973) defines a DP with two parameters, a positive scalar aa and a probability measure B0B_{0}, referred to as the concentration parameter and the base measure, respectively. The base distribution B0B_{0} is the parameter on which the nonparametric distribution is centered, which can be thought of as the prior guess Antoniak (1974). The concentration parameter aa expresses the strength of belief in B0B_{0}. For small values of aa, samples from a DP is likely to be composed of a small number of atomic measures with large weights. For large values, most samples are likely to be distinct, thus concentrated on B0B_{0}.

Technically, a nonparametric prior is a probability distribution on 𝒫\mathcal{P}, the space of all probability measures (say on the real line). Measurable sets in 𝒫\mathcal{P} are of the form {A:P(A)<1}\{A\colon P(A)<1\}. We could specify a prior distribution over (P(A1),P(A2),,P(AK))\left(P\left(A_{1}\right),P\left(A_{2}\right),\ldots,P\left(A_{K}\right)\right) where A1A_{1}, A2A_{2},…, AKA_{K} are measurable finite partition of the measurable set AA. Denote

PDP(a,B0)\displaystyle P\sim DP\left(a,B_{0}\right)

for all partition (A1,,AK)\left(A_{1},\cdots,A_{K}\right), then,

(P(A1),,P(AK))Dir(aB0(A1),,aB0(AK))\displaystyle\left(P\left(A_{1}\right),\cdots,P\left(A_{K}\right)\right)\sim\operatorname{Dir}\left(aB_{0}\left(A_{1}\right),\cdots,aB_{0}\left(A_{K}\right)\right)

Dir ()(\cdot) stands for the Dirichlet distribution with probability distribution function being

p(x1,,xK;η1,,ηK)=Γ(k=1Kηk)k=1KΓ(ηk)k=1Kxkηk1\displaystyle p\left(x_{1},\cdots,x_{K};\eta_{1},\cdots,\eta_{K}\right)=\frac{\Gamma\left(\sum_{k=1}^{K}\eta_{k}\right)}{\prod_{k=1}^{K}\Gamma\left(\eta_{k}\right)}\prod_{k=1}^{K}x_{k}^{\eta_{k}-1}

where xi(0,1)x_{i}\in(0,1) and i=1Kxi=1\sum_{i=1}^{K}x_{i}=1. This is a multivariate generalization of the Beta distribution and the infinite-dimensional generalization of the Dirichlet distribution is the Dirichlet process.

We next define a typical DP prior for (α,σ2)(\alpha,\sigma^{2}):

(αi,σi2)\displaystyle(\alpha_{i},\sigma_{i}^{2}) B,\displaystyle\sim B,
B\displaystyle B DP(a,B0),\displaystyle\sim\operatorname{DP}\left(a,B_{0}\right), (A.1)

where BB is a random distribution. There are two parameters. The base distribution B0B_{0} is a distribution over the same space as BB. For example, if B0B_{0} is a distribution on reals then BB must be a distribution on reals too. The concentration parameter aa is a positive scalar. One property of the DP is that random distributions BB are discrete, and each places its mass on a countably infinite collection of atoms drawn from B0B_{0}.

We adopt an Independent Normal Inverse-Gamma (INIG) distribution for the base distribution B0B_{0}:

B0(ϕ):=INIG(μα,Σα,νσ2,δσ2),\displaystyle B_{0}(\phi):=INIG\left(\mu_{\alpha},\Sigma_{\alpha},\frac{\nu_{\sigma}}{2},\frac{\delta_{\sigma}}{2}\right), (A.2)

with a set of hyperparameters ϕ=(μα,Σα,νσ2,δσ2)\phi=\left(\mu_{\alpha},\Sigma_{\alpha},\frac{\nu_{\sigma}}{2},\frac{\delta_{\sigma}}{2}\right).

The form of the base distribution and the value of the concentration parameter are critical aspects of model selection that influence modeling performance. Given a murky prior distribution, the concentration parameter’s value can be derived from the data. It is more difficult to choose the base distribution because the model’s performance is largely dependent on its parametric form, even if it is constructed hierarchically for robustness. The choice of the base distribution is determined largely by mathematical and practical convenience. For computational ease, conjugate distributions are recommended.

A draw from DP is, by definition, a discrete distribution. In this sense, given the baseline model, imposing a DP prior on the distribution of αgi\alpha_{g_{i}} entails limiting the intercepts to some discrete values and assuming that intercepts within a group are identical, which may not be appealing for some empirical applications. A natural extension is to suppose that αi\alpha_{i} has a continuous parametric distribution f(αi;θ)f(\alpha_{i};\theta), with θ\theta as parameters, and to use a DP prior for the distribution of θ\theta. The parameters θ\theta are then discrete and has group structure, whereas group heterogeneity αgi\alpha_{g_{i}} has a continuous distribution, i.e., αi\alpha_{i} within a group can be different, but they are all derived from the same distribution. This additional layer of mixing is the general idea of the Dirichlet process mixture (DPM) model.

A.1.2 Stick Breaking Process

A nonparametric prior can also be defined as the distribution of a random variable PP taking values in 𝒫\mathcal{P}. A construction of DP\mathrm{DP} follows the stick-breaking process Sethuraman (1994),

P(A)\displaystyle P(A) =k=1πk𝟏αk(A),\displaystyle=\sum_{k=1}^{\infty}\pi_{k}\mathbf{1}_{\alpha_{k}}\left(A\right),
αk\displaystyle\alpha_{k} B0,k=1,2,,\displaystyle\sim B_{0},\quad k=1,2,\cdots,
πk\displaystyle\pi_{k} ={ζ1,k=1j<k(1ζj)ζk,k=2,3,\displaystyle=\begin{cases}\zeta_{1},&k=1\\ \prod_{j<k}\left(1-\zeta_{j}\right)\zeta_{k},&k=2,3,\cdots\end{cases}
where ζkBeta(1,a),k=1,2,\displaystyle\text{ where }\zeta_{k}\sim\operatorname{Beta}(1,a),\quad k=1,2,\cdots

The stick breaking process distinguishes the roles of B0B_{0} and aa in that the former governs component value αk\alpha_{k} while the latter guides the choice of component probability πk\pi_{k}. Roughly speaking, the DP concentration parameter aa is linked to the number of unique components in the mixture density and thus determines and reflects the flexibility of the mixture density.

A.1.3 Chinese Restaurant / Pólya Urn Process

Another widely used representation of the DP prior is the Chinese restaurant process (CRP). To set the stage, imagine that we have a Chinese restaurant that has infinitely many tables that can each seat infinitely many customers. When a new customer, say the nn-th, enters the restaurant, the probability of them sitting at the table kk with nkn_{k} other customers proportional to nkn_{k}, and the probability of this customer sitting alone at a new table is related to aa (the concentration parameter in DP),

p((αi,σi2)=(αk,σk2)|α1:i1,σ1:i12,a,B0)\displaystyle p\left((\alpha_{i},\sigma_{i}^{2})=(\alpha_{k},\sigma_{k}^{2})|\alpha_{1:i-1},\sigma_{1:i-1}^{2},a,B_{0}\right) {nk if k is an existing tablea if k is a new table.\displaystyle\;\propto\left\{\begin{array}[]{ll}n_{k}&\text{ if }k\text{ is an existing table}\\ a&\text{ if }k\text{ is a new table}.\end{array}\right.

Upon marginalizing out the random mixing measure BB, one obtains the conditional distribution of (αi,σi2)(\alpha_{i},\sigma_{i}^{2}) given (α1:i1,σ1:i12)(\alpha_{1:i-1},\sigma_{1:i-1}^{2}), which follows a Polya urn distribution,

αi,σi2|α1:i1,σ1:i12,a,B01a+i1j=1i1δ(αj,σj2)+aa+i1B0().\displaystyle\alpha_{i},\sigma_{i}^{2}|\alpha_{1:i-1},\sigma_{1:i-1}^{2},a,B_{0}\sim\frac{1}{a+i-1}\sum_{j=1}^{i-1}\delta_{(\alpha_{j},\sigma_{j}^{2})}+\frac{a}{a+i-1}B_{0}(\cdot). (A.3)

Equation (A.3) reveals the clustering property of the DP prior: there is a positive probability that each (αi,σi2)(\alpha_{i},\sigma_{i}^{2}) will take on the value of another (αj,σj2)(\alpha_{j},\sigma_{j}^{2}), leading some of the variables to share values. The probability of sharing values is proportional to ngjn_{g_{j}}. This self-reinforcing property is sometimes expressed as the rich get richer. This equation also reveals the roles of scaling parameter aa and base distribution B0B_{0}. The unique values contained in (α1:N,σ1:N2)(\alpha_{1:N},\sigma_{1:N}^{2}) are drawn independently from B0B_{0}, and the parameter aa determines how likely αi,σi2\alpha_{i},\sigma_{i}^{2} is to be a newly drawn value from B0B_{0} rather than take on one of the values from α1:i1,σ1:i12\alpha_{1:i-1},\sigma_{1:i-1}^{2}.

Chinese restaurant process shares the same characteristics as the Pólya urn process which can be extend to the two-parameter Pitman–Yor process Pitman and Yor (1997). Here is the basic idea of Pólya urn process. Imagine that we have an urn with possibly infinitely many colors. Let aa (again, the concentration parameter in DP) be the initial number of balls with each color. The urn evolves in discrete time steps - at each step, one ball is sampled uniformly at random and put it back to the urn; The color of the withdrawn ball is observed, and one additional ball with the same color is returned to the urn. This process is then repeated.

Equation (A.3) is also called the Blackwell-MacQueen prediction rule - the conditional distribution of θn\theta_{n} given previous sampled θ1:n1\theta_{1:n-1} from the Dirichlet process prior. It characterizes the Chinese restaurant process/Pólya urn process and serves as the key component in the Pólya urn Gibbs sampler Ishwaran and James (2001).

Prior literature shows the equivalence between Chinese restaurant process/Pólya urn process and aforementioned processes. Blackwell and MacQueen (1973) present the equivalence between Pólya urn process and Dirichlet process. Miller (2019) formally prove that the Chinese restaurant process is equivalent to the stick breaking process.

A.2 Exchangeable Partition Probability Function

A predominant feature of the the exchangeable partition probability function (EPPF) in (2.14) is that it defines a prior distribution over GG. To obtain the prior from EPPF, we must first identify all possible group partitions of NN units. This problem can be recast as a prototypical “balls and urns” problem: what are the ways of putting NN distinguishable balls into NN indistinguishable urns if empty urns are allowed?

Example A.1.

Consider a simple case in which N=3N=3 and a=1a=1. It is easy to show that there are five ways to group three units. Then the prior distribution over GG under Dirichlet process is given by,

Pr(g1=g2=g3=1)\displaystyle\Pr(g_{1}=g_{2}=g_{3}=1) =Γ(1)Γ(4)Γ(3)=13,\displaystyle=\frac{\Gamma(1)}{\Gamma(4)}\Gamma(3)=\frac{1}{3},
Pr(g1=g2=1,g3=2)\displaystyle\Pr(g_{1}=g_{2}=1,g_{3}=2) =Γ(1)Γ(4)Γ(2)Γ(1)=16,\displaystyle=\frac{\Gamma(1)}{\Gamma(4)}\Gamma(2)\Gamma(1)=\frac{1}{6},
Pr(g1=g3=1,g2=2)\displaystyle\Pr(g_{1}=g_{3}=1,g_{2}=2) =Γ(1)Γ(4)Γ(2)Γ(1)=16,\displaystyle=\frac{\Gamma(1)}{\Gamma(4)}\Gamma(2)\Gamma(1)=\frac{1}{6},
Pr(g2=g3=1,g1=2)\displaystyle\Pr(g_{2}=g_{3}=1,g_{1}=2) =Γ(1)Γ(4)Γ(2)Γ(1)=16,\displaystyle=\frac{\Gamma(1)}{\Gamma(4)}\Gamma(2)\Gamma(1)=\frac{1}{6},
Pr(g1=1,g2=2,g3=3)\displaystyle\Pr(g_{1}=1,g_{2}=2,g_{3}=3) =Γ(1)Γ(4)Γ(1)Γ(1)Γ(1)=16.\displaystyle=\frac{\Gamma(1)}{\Gamma(4)}\Gamma(1)\Gamma(1)\Gamma(1)=\frac{1}{6}.

Finding all solutions to the ”balls and urns” problem with big NN is computationally impossible in general. For a certain number of groups KK, the number of ways to assign NN unit to KK groups is described by the Stirling number of the second kind,

𝒮N,K=1K!j=0K(1)jCKj(Kj)N.\displaystyle\mathcal{S}_{N,K}=\frac{1}{K!}\sum_{j=0}^{K}(-1)^{j}C_{K}^{j}(K-j)^{N}. (A.4)

The sum of 𝒮N,K\mathcal{S}_{N,K} over all possible KK, also known as the NN-th Bell number, N=K=1N𝒮N,K\mathcal{B}_{N}=\sum_{K=1}^{N}\mathcal{S}_{N,K} describes the number of all possible partitions of NN balls. Owing to the rapid growth of the space, listing all feasible partitions becomes computationally impossible. For example from a moderate N=12N=12 to 15, the number of partitions increases from 12=4,213,597\mathcal{B}_{12}=4,213,597 to 15=1,382,958,545\mathcal{B}_{15}=1,382,958,545. Sethuraman (1994) and Pitman (1996) constructively show that group indices/partitions can be drawn from the EPPF for DP using the stick-breaking process defined in (2.8). As a result, the EPPF does not explicitly appear in the posterior analysis in the current setting so long as the priors for the stick lengths are included.

In the example below, we demonstrate how pairwise constraints affect the prior density of the group partition using Equation (2.4):

p(G|ψ,T)p(G)i,j(ψij1ψij)cTijδij(G).\displaystyle p(G|\psi,T)\;\propto\;p(G)\prod_{i,j}\left(\frac{\displaystyle\psi_{ij}}{\displaystyle 1-\psi_{ij}}\right)^{cT_{ij}\delta_{ij}(G)}. (A.5)
Example A.2.

Consider again the three-unit case in Example A.1 with a=1a=1. Assume there is a positive-link constraint between units 1 and 2 and that is the only constraint in this example. Then the prior probabilities of the five partitions are adjusted to account for the effect of W12W_{12}:

Pr(g1=g2=g3=1)\displaystyle\Pr(g_{1}=g_{2}=g_{3}=1) =132exp(4cW12)exp(4cW12)+1>13,\displaystyle=\frac{1}{3}\cdot\frac{2\exp(4cW_{12})}{\exp(4cW_{12})+1}>\frac{1}{3},
Pr(g1=g2=1,g3=2)\displaystyle\Pr(g_{1}=g_{2}=1,g_{3}=2) =13exp(4cW12)exp(4cW12)+1>16,\displaystyle=\frac{1}{3}\cdot\frac{\exp(4cW_{12})}{\exp(4cW_{12})+1}>\frac{1}{6},
Pr(g1=g3=1,g2=2)\displaystyle\Pr(g_{1}=g_{3}=1,g_{2}=2) =131exp(4cW12)+1<16,\displaystyle=\frac{1}{3}\cdot\frac{1}{\exp(4cW_{12})+1}<\frac{1}{6},
Pr(g2=g3=1,g1=2)\displaystyle\Pr(g_{2}=g_{3}=1,g_{1}=2) =131exp(4cW12)+1<16,\displaystyle=\frac{1}{3}\cdot\frac{1}{\exp(4cW_{12})+1}<\frac{1}{6},
Pr(g1=1,g2=2,g3=3)\displaystyle\Pr(g_{1}=1,g_{2}=2,g_{3}=3) =131exp(4cW12)+1<16.\displaystyle=\frac{1}{3}\cdot\frac{1}{\exp(4cW_{12})+1}<\frac{1}{6}.

Note that c>0c>0 and W12>0W_{12}>0. Comparing to the results in Example A.1, the probabilities of the first two partitions become higher since they all meet the PL constraint between units 1 and 2, while the rest of the partitions violate the constraint and hence the probabilities drop.

A.3 Hierarchical Dirichlet Process

The hierarchical Dirichlet process (HDP) was developed by Teh et al. (2006). The HDP is a nonparametric Bayesian approach to clustering grouped data, with the known group membership. It equips a Dirichlet process for each group of data, with the Dirichlet processes for all groups sharing a base distribution which is itself drawn from a Dirichlet process. This method allows groups to share statistical strength via sharing of clusters across groups. The base distribution being drawn from a Dirichlet process is important, because draws from a Dirichlet process are atomic probability measures, and the atoms will appear in all group-level Dirichlet processes. Since each atom corresponds to a cluster, clusters are shared across all groups.

The HDP is parameterized by a base distribution HH that governs the prior distribution over data items, and a number of concentration parameters that govern the prior number of clusters and amount of sharing across groups. Assume that we have JJ groups of data, each consisting of NjN_{j} data points, yj1,yjNjy_{j1},\ldots y_{jN_{j}}. The process defines a set of random probability measures (Bj)j=1J(B_{j})_{j=1}^{J}, one for each group. The random probability measure BjB_{j} for the jj-th group is distributed as a Dirichlet process:

Bj|B0DP(γ,B0),\displaystyle B_{j}|B_{0}\sim\operatorname{DP}\left(\gamma,B_{0}\right), (A.6)

where γ\gamma is the concentration parameter and B0B_{0} is the base distribution shared across all groups. The distribution of the global random probability measure B0B_{0} is given by,

B0DP(α0,H),\displaystyle B_{0}\sim\operatorname{DP}\left(\alpha_{0},H\right), (A.7)

with concentration parameter α0\alpha_{0} and base distribution HH.

A hierarchical Dirichlet process can be used as the prior distribution over the parameters for grouped data. For each jj, let (ϕji)i=1nj\left(\phi_{ji}\right)_{i=1}^{n_{j}} be i.i.d. random variables distributed as BjB_{j}. Each ϕji\phi_{ji} is a parameter corresponding to a single observation yjiy_{ji}. The likelihood is given by,

ϕji|BjBj,\displaystyle\phi_{ji}|B_{j}\sim B_{j},
yji|ϕjip(ϕji).\displaystyle y_{ji}|\phi_{ji}\sim p\left(\phi_{ji}\right). (A.8)

The resulting model above is called a HDP mixture model, with the HDP referring to the hierarchically linked set of Dirichlet processes, and the mixture model referring to the way the Dirichlet processes are related to the data items.

To understand how the HDP implements a clustering model, and how clusters become shared across groups, recall that draws from a Dirichlet process are atomic probability measures with probability one. The base distribution B0B_{0} can be expressed using a stick-breaking representation,

B0=k=1βkδθk,\displaystyle B_{0}=\sum_{k=1}^{\infty}\beta_{k}\delta_{\theta_{k}}, (A.9)

where there are an infinite number of atoms, θkH\theta_{k}\sim H, k=1,2,k=1,2,\ldots. Each atom is associated with a mass βk\beta_{k} and β=(βi)i=1Stick(γ)\beta=\left(\beta_{i}\right)_{i=1}^{\infty}\sim\operatorname{Stick}(\gamma) are mutually independent. Since B0B_{0} is the base distribution for the group specific Dirichlet processes, each BjB_{j} has the same atoms as B0B_{0} and can be written in the form,

Bj=k=1πjkδθk.\displaystyle B_{j}=\sum_{k=1}^{\infty}\pi_{jk}\delta_{\theta_{k}}. (A.10)

Let πj=(πjk)k=1\pi_{j}=\left(\pi_{jk}\right)_{k=1}^{\infty}. Note that the weights πj\pi_{j} are independent given β\beta (since the BjB_{j} are independent given B0B_{0}). It can be shown that the connection between the weights πj\pi_{j} and the global weights β\beta is

πjα0,βDP(α0,β).\displaystyle\pi_{j}\mid\alpha_{0},\beta\sim\operatorname{DP}\left(\alpha_{0},\beta\right). (A.11)

Thus the set of atoms is shared across all groups, with each group having its own group-specific atom masses. Relating this representation back to the observed data, we see that each data item is described by a mixture model,

yji|Bjk=1πjkf(θk),\displaystyle y_{ji}|B_{j}\sim\sum_{k=1}^{\infty}\pi_{jk}f\left(\theta_{k}\right), (A.12)

where the atoms θk\theta_{k} play the role of the mixture component parameters, while the masses πjk\pi_{jk} play the role of the mixing proportions. As a result, each group of data is modeled using a mixture model, with mixture components shared across all groups and group-specific mixing weights.


Chinese Restaurant Franchise
Teh et al. (2006) have also described the marginal probabilities obtained from integrating over the random measures B0B_{0} and (Bj)j=1J(B_{j})_{j=1}^{J}. They show that these marginals can be described in terms of a Chinese restaurant franchise (CRF) that is an analog of the Chinese restaurant process.

Recall that ϕji\phi_{ji} are random variables with distribution BjB_{j}. In the following discussion, we will let θ1,,θK\theta_{1},\ldots,\theta_{K} denote KK i.i.d. random variables distributed according to HH, and, for each jj, we let ψj1,,ψjTj\psi_{j1},\ldots,\psi_{jT_{j}} denote TjT_{j} i.i.d. variables distributed according to B0B_{0}.

Each ϕji\phi_{ji} is associated with one ψjt\psi_{jt}, while each ψjt\psi_{jt} (table id) is associated with one θk\theta_{k}. Let tjit_{ji} be the index of the ψjt\psi_{jt} associated with ϕji\phi_{ji}, and let kjtk_{jt} (dish id) be the index of θk\theta_{k} associated with ψjt\psi_{jt}. Let njtn_{jt} be the number of ϕji\phi_{ji}’s associated with ψjt\psi_{jt}, while mjkm_{jk} is the number of ψjt\psi_{jt}’s associated with θk\theta_{k}. Define mk=jmjkm_{k}=\sum_{j}m_{jk} as the number of ψjt\psi_{jt}’s associated with θk\theta_{k} over all jj. Notice that while the values taken on by the ψjt\psi_{jt}’s need not be distinct, they are distributed according to a discrete random probability measure B0DP(γ,H)B_{0}\sim\mathrm{DP}(\gamma,H), we are denoting them as distinct random variables.

First consider the conditional distribution for ϕji\phi_{ji} given ϕj1,,ϕji1\phi_{j1},\ldots,\phi_{ji-1} and B0B_{0}, where BjB_{j} is integrated out, we have,

ϕji|ϕj1,,ϕji1,α0,G0t=1Tjnjti1+α0δψjt+α0i1+α0G0,\displaystyle\phi_{ji}|\phi_{j1},\ldots,\phi_{ji-1},\alpha_{0},G_{0}\sim\sum_{t=1}^{T_{j}}\frac{n_{jt}}{i-1+\alpha_{0}}\delta_{\psi_{jt}}+\frac{\alpha_{0}}{i-1+\alpha_{0}}G_{0}, (A.13)

This is a mixture, and a draw from this mixture can be obtained by drawing from the terms on the right-hand side with probabilities given by the corresponding mixing proportions. If a term in the first summation is chosen, then we set ϕji=ψjt\phi_{ji}=\psi_{jt} and let tji=tt_{ji}=t for the chosen tt. If the second term is chosen, then we increment TjT_{j} by one, draw ψjTjB0\psi_{jT_{j}}\sim B_{0} and set ϕji=ψjTj\phi_{ji}=\psi_{jT_{j}} and tji=Tjt_{ji}=T_{j}.

Now we proceed to integrate out B0B_{0}. Notice that B0B_{0} appears only in its role as the distribution of the variables ψjt\psi_{jt}. Since B0B_{0} is distributed according to a Dirichlet process, we can integrate it out and writing the conditional distribution of ψjt\psi_{jt} directly:

ψjt|ψ11,ψ12,,ψ21,,ψjt1,γ,Hk=1Kmkkmk+γδθk+γkmk+γH.\displaystyle\psi_{jt}|\psi_{11},\psi_{12},\ldots,\psi_{21},\ldots,\psi_{jt-1},\gamma,H\sim\sum_{k=1}^{K}\frac{m_{k}}{\sum_{k}m_{k}+\gamma}\delta_{\theta_{k}}+\frac{\gamma}{\sum_{k}m_{k}+\gamma}H. (A.14)

If we draw ψjt\psi_{jt} via choosing a term in the summation on the right-hand side of this equation, we set ψjt=θk\psi_{jt}=\theta_{k} and let kjt=kk_{jt}=k for the chosen kk. If the second term is chosen, we increment KK by one, draw θKH\theta_{K}\sim H and set ψjt=θK,kjt=K\psi_{jt}=\theta_{K},k_{jt}=K.

In short, the CRF is comprised of JJ restaurants with a shared menu across the restaurants. Each restaurant corresponds to an HDP group, and an infinite buffet line of dishes common to all restaurants. The process of seating customers at tables, however, is restaurant specific. Each customer is preassigned to a given restaurant determined by that customer’s group jj. Upon entering the jj th restaurant in the CRF, customer yjiy_{ji} sits at currently occupied tables tjit_{ji} with probability proportional to the number of currently seated customers, or starts a new table Tj+1T_{j}+1 with probability proportional to α\alpha. The first customer to sit at a table goes to the buffet line and picks a dish kjtk_{jt} for their table, choosing the dish with probability proportional to the number of times that dish has been picked previously, or ordering a new dish θK+1\theta_{K+1} with probability proportional to γ\gamma. The intuition behind this predictive distribution is that integrating over the global dish probabilities β\beta results in customers making decisions based on the observed popularity of the dishes throughout the entire franchise.

A.4 Random Effects vs. Fixed-Effects

Regarding the connection between Bayesian and frequentists’ panel data model, according to Koop (2003), if we impose a

  1. (i)

    non-hierarchical prior (such as Normal prior without hyperpriors) on the intercept αi0\alpha_{i0}, the resulting panel data model is equivalent to the frequentist fixed-effects model. This is basically a Bayesian linear regression with standard priors on parameters.

  2. (ii)

    hierarchical prior on the intercept αi0\alpha_{i0}, the resulting panel data model is equivalent to the frequentists’ random effects model. A convenient hierarchical prior assumes that, for i=1,,Ni=1,\ldots,N,

    αi0N(μα,σα2).\alpha_{i0}\sim N\left(\mu_{\alpha},\sigma^{2}_{\alpha}\right).

    The hierarchical structure of the prior arises if we treat μα\mu_{\alpha} and VαV_{\alpha} as unknown parameters which require their own prior. We assume μα\mu_{\alpha} and VαV_{\alpha} to be independent of one another with

    μαN(μ¯α,σ¯α2),\mu_{\alpha}\sim N\left(\underline{\mu}_{\alpha},\underline{\sigma}_{\alpha}^{2}\right),

    and

    σα2IG(τ¯α,ν¯α).\sigma^{2}_{\alpha}\sim IG\left(\underline{\tau}_{\alpha},\underline{\nu}_{\alpha}\right).

    This is analogous to the random effects model as αi\alpha_{i} are essentially assumed to draw from the underlying distribution, and data are used to update our prior on the hyperparameters of the underlying distribution.

The discussion in Koop (2003) is in line with the hierarchical models discussed in Smith (1973). The panel data model equipped with a non-hierarchical prior is a two‐stage hierarchical model which results in a fixed effects model, while incorporating a hierarchical prior forms a three‐stage hierarchical model that corresponds to a random effects model.

Back to our settings, if the baseline prior for α0i\alpha_{0i} is a DP (DPM) prior, then our proposed nonparametric Bayesian prior is a type of non-hierarchical (hierarchical) prior with latent group structure in intercepts and hence we call our proposed estimator as the constrained grouped fixed (random) effects estimator.

Appendix B Priors

Prior on Group-Specific Parameters

(αi,σi2)k=1πkδ(αk,σk2) with (αk,σk2)B0(ϕ).\displaystyle\left(\alpha_{i},\sigma_{i}^{2}\right)\sim\sum_{k=1}^{\infty}\pi_{k}\delta_{\left(\alpha_{k},\sigma_{k}^{2}\right)}\text{ with }\left(\alpha_{k},\sigma_{k}^{2}\right)\sim B_{0}(\phi). (B.1)

B0B_{0} is an Independent Normal Inverse-Gamma (INIG) distribution:

B0:=INIG(μα,Σα,νσ2,δσ2),\displaystyle B_{0}:=INIG\left(\mu_{\alpha},\Sigma_{\alpha},\frac{\nu_{\sigma}}{2},\frac{\delta_{\sigma}}{2}\right), (B.2)

with a set of hyperparameters ϕ=(μα,Σα,νσ2,δσ2)=(0,1,6,5)\phi=\left(\mu_{\alpha},\Sigma_{\alpha},\frac{\nu_{\sigma}}{2},\frac{\delta_{\sigma}}{2}\right)=\left(0,1,6,5\right).


Prior on Stick Lengths

ξkBeta(1,a),\displaystyle\xi_{k}\sim Beta\left(1,a\right), (B.3)

where aa is the concentration parameter.


Hyper-prior on Concentration Parameter

aGamma(m,n),\displaystyle a\sim Gamma\left(m,n\right), (B.4)

with (m,n)=(0.4,10)(m,n)=(0.4,10).


Prior on Common and Individual Slope Coefficients (if any) Finally, the prior distribution for the common parameter ρ\rho is chosen to be a normal distribution to stay close to the linear regression framework,

ρN(0,σρ2) with σρ2=1.\displaystyle\rho\sim N(0,\sigma^{2}_{\rho})\text{ with }\sigma^{2}_{\rho}=1. (B.5)

The prior of heterogeneous parameter βi\beta_{i} follows,

βiN(0,Σβ) with Σβ=1×𝐈p.\displaystyle\beta_{i}\sim N(0,\Sigma_{\beta})\text{ with }\Sigma_{\beta}=1\times\mathbf{I}_{p}. (B.6)

Appendix C Posterior Distributions and Algorithms

C.1 Blocked Gibbs Sampler and Algorithm

Initialization:

  1. (i)

    Preset the initial number of active groups as K0a=NK_{0}^{a}=N.

  2. (ii)

    Set concentration parameter aa to its prior mean.

  3. (iii)

    In ignorance of group heterogeneity (K=1K=1) and heteroskedasticity, use Anderson and Hsiao (1982) IV approach to get α^IV\hat{\alpha}_{IV} and Σ^α,IV\hat{\Sigma}_{\alpha,IV}. These IV estimators serve as the mean and covariance matrix in the related priors.

  4. (iv)

    Generate K0aK_{0}^{a} random sample from the distribution N(α^IV,Σ^α,IV)N(\hat{\alpha}_{IV},\hat{\Sigma}_{\alpha,IV}).

  5. (v)

    Initialize group membership GG by using assuming no group structure: G(0)=[1,2,,N]G^{(0)}=[1,2,...,N].

For each iteration s=1,2,..,Nsims=1,2,..,N_{sim}

  1. (i)

    Number of active groups:

    Ka=max1iNgi(s1).\displaystyle K^{a}=\max_{1\leq i\leq N}g_{i}^{(s-1)}.
  2. (ii)

    Group “stick length”: for k=1,2,,Kak=1,2,...,K^{a}, draw ξk\xi_{k} from a Beta distribution in (C.14):

    ξk|a(s1),G(s1)Beta(|Bk|+1,a+j=1N𝟏(gj>k)),\displaystyle\xi_{k}|a^{(s-1)},G^{(s-1)}\sim\textit{Beta}\left(|B_{k}|+1,a+\sum_{j=1}^{N}\mathbf{1}(g_{j}>k)\right),

    and calculate group probability in accordance to (C.17).

  3. (iii)

    Group heterogeneity: for k=1,2,,Kak=1,2,...,K^{a}, draw αk(s)\alpha_{k}^{(s)} from a normal distribution in (C.12):

    αk|ρ(s1),β(s1),Σ(s1),G(s1),Y,XN(μ¯αk,Σ¯αk).\displaystyle\alpha_{k}|\rho^{(s-1)},\beta^{(s-1)},\Sigma^{(s-1)},G^{(s-1)},Y,X\sim\textit{N}\left(\bar{\mu}_{\alpha_{k}},\bar{\Sigma}_{\alpha_{k}}\right).
  4. (iv)

    Group heteroscedasticity: for k=1,2,,Kak=1,2,...,K^{a} and t=1,2,,Tt=1,2,...,T, draw σk2(s)\sigma_{k}^{2^{(s)}} from an inverse Gamma distribution in (C.13):

    σk2|ρ(s1),β(s1),α(s),G(s1),Y,XIG(v¯σ,k2,δ¯σ,k2).\displaystyle\sigma_{k}^{2}|\rho^{(s-1)},\beta^{(s-1)},\alpha^{(s)},G^{(s-1)},Y,X\sim\textit{IG}\left(\frac{\bar{v}_{\sigma,k}}{2},\frac{\bar{\delta}_{\sigma,k}}{2}\right).
  5. (v)

    Auxiliary variables: for i=1,2,,Ni=1,2,...,N, draw uiu_{i} from an uniform distribution in (C.20):

    ui|Ξ(s),G(s1)Unif(0,pgi(s)).\displaystyle u_{i}|\Xi^{(s)},G^{(s-1)}\sim Unif(0,p^{(s)}_{g_{i}}).

    Then calculate uu^{*} according to (C.9).

  6. (vi)

    DP concentration parameter:

    1. (a)

      Draw latent variable η\eta from a Beta distribution in (C.18):

      ηBeta(a+1,N)\displaystyle\eta\sim\textit{Beta}(a+1,N)
    2. (b)

      Draw concentration parameter aa from a mixture of Gamma distribution in (C.19):

      a|η,Ka{Gamma(m+Ka,nlog(η))with prob. πaGamma(m+Ka1,nlog(η))with prob. 1πa,a|\eta,K^{a}\sim\left\{\begin{array}[]{ll}\textit{Gamma}\left(m+K^{a},n-\log(\eta)\right)&\mbox{with prob. }\pi_{a}\\ \textit{Gamma}\left(m+K^{a}-1,n-\log(\eta)\right)&\mbox{with prob. }1-\pi_{a}\end{array}\right.,

      and πa\pi_{a} is defined as

      πa1πa=m+Ka1N(nlog(η)).\displaystyle\frac{\displaystyle\pi_{a}}{\displaystyle 1-\pi_{a}}=\frac{m+K^{a}-1}{N(n-\log(\eta))}.
  7. (vii)

    Potential groups: start with K~=Ka\tilde{K}=K^{a},

    1. (a)

      Group probabilities:

      1. (1)

        if j=1K~πj(s)>1u\sum_{j=1}^{\tilde{K}}\pi_{j}^{(s)}>1-u^{*}, set K=K~K^{*}=\tilde{K} and stop.

      2. (2)

        otherwise, let K~=K~+1\tilde{K}=\tilde{K}+1, draw ξK~Beta(1,α(s)),\xi_{\tilde{K}}\sim Beta\left(1,\alpha^{(s)}\right), update πK~=ξK~j<K~(1ξj)\pi_{\tilde{K}}=\xi_{\tilde{K}}\prod_{j<\tilde{K}}\left(1-\xi_{j}\right) and go to step (1)(1).

    2. (b)

      Group parameters: for k=K+1,,Kk=K+1,\cdots,K^{*}, draw αk(s)\alpha_{k}^{(s)} and σk2(s)\sigma_{k}^{2(s)} from their prior distributions.

  8. (xi)

    Group membership: for j=1,2,,Jj=1,2,...,J and k=1,2,,Kak=1,2,...,K^{a}, draw gjg_{j} from a multinomial distribution in (C.2.1).

C.2 Random Coefficients Model with Soft Constraints

We present the conditional posterior distributions of parameters in the time-invariant random effects model with heteroscedasticity, positive-link constraints and negative-link constraints, which is the most complicated scenarios. For other models, such as its homoscedastic counterparts, adjustment can be easily made by assuming common error variances.

C.2.1 Derivation

Model:

yit\displaystyle y_{it} =αgixit+εit,εitN(0,σgi2).\displaystyle=\alpha^{\prime}_{g_{i}}x_{it}+\varepsilon_{it},\quad\varepsilon_{it}\sim N\left(0,\sigma_{g_{i}}^{2}\right). (C.1)

To facilitate derivation, we stack observations and parameters,

Dependent variable: Y\displaystyle\text{Dependent variable: }Y =[y1,y2,,yN],yi=[yi1,yi2,,yiT],\displaystyle=[y_{1},y_{2},\ldots,y_{N}],y_{i}=[y_{i1},y_{i2},\ldots,y_{iT}]^{\prime},
Covariates: X\displaystyle\text{Covariates: }X =[x1,x2,,xN],xi=[xi1,xi2,,xiT],\displaystyle=[x_{1},x_{2},\ldots,x_{N}],x_{i}=[x_{i1},x_{i2},\ldots,x_{iT}]^{\prime},
Grouped-specific parameters: α\displaystyle\text{Grouped-specific parameters: }\alpha =[α1,α2,,αN],\displaystyle=\left[\alpha_{1},\alpha_{2},\ldots,\alpha_{N}\right],
Error variance: Σ\displaystyle\text{Error variance: }\Sigma =[σ12,σ22,,σN2],\displaystyle=\left[\sigma^{2}_{1},\sigma^{2}_{2},\ldots,\sigma^{2}_{N}\right],
Stick length: Ξ\displaystyle\text{Stick length: }\Xi =[ξ1,ξ2,],\displaystyle=\left[\xi_{1},\xi_{2},\ldots\right],
Group indices: G\displaystyle\text{Group indices: }G =[g1,,gN],\displaystyle=\left[g_{1},\ldots,g_{N}\right],
Auxiliary variable: u\displaystyle\text{Auxiliary variable: }u =[u1,u2,,uN],\displaystyle=\left[u_{1},u_{2},...,u_{N}\right],
Hyper parameters: ϕ\displaystyle\text{Hyper parameters: }\phi =[μα,Σα,νσ,δσ].\displaystyle=\left[\mu_{\alpha},\Sigma_{\alpha},\nu_{\sigma},\delta_{\sigma}\right].

In order to write down the posterior of unknown parameters given a set of pairwise constraints, a probabilistic model of how weights of constraints are obtained must be specified. Inspired by Shental et al. (2003), we have the following assumptions:

Assumption 3.

(Data)

  1. (i)

    Data points are first sampled i.i.d from the full probability distribution conditional on GG.

  2. (ii)

    From this sample, pairs of points are randomly chosen according to a uniform distribution. In case both points in a pair belong to the same source a positive-link constraint is formed and a negative-link if formed when they belong to different sources.

The posterior of unknown objects in the random coefficients model is,

p(α,σ2,Ξ,a,G|Y,X,W,ϕ)\displaystyle p(\alpha,\sigma^{2},\Xi,a,G|Y,X,W,\phi)
\displaystyle\propto\; p(Y|X,α,σ2,G)p(α,σ2|ϕ)p(G|Ξ,W)p(Ξ|a)p(a).\displaystyle p(Y|X,\alpha,\sigma^{2},G)p(\alpha,\sigma^{2}|\phi)p(G|\Xi,W)p(\Xi|a)p(a). (C.2)

All priors have been well-defined except for p(G|Ξ,W)p(G|\Xi,W) - the prior for group indices GG conditional on stick lengths Ξ\Xi and the weights of constraints WW.

Using the Bayes rule, the modified prior for the group indices is

p(G|Ξ,W)=p(W|G)p(G|Ξ)Gp(W|G)p(G|Ξ)p(W|G)p(G|Ξ),\displaystyle p(G|\Xi,W)=\frac{p(W|G)p(G|\Xi)}{\sum_{G^{\prime}}p(W|G^{\prime})p(G^{\prime}|\Xi)}\;\propto\;p(W|G)p(G|\Xi), (C.3)

where the sum in the denominator is taken over all possible group partitioning, p(W|G)p(W|G) is the weighting function of the form:

p(W|G)=i,jexp(cWijδij),\displaystyle p(W|G)=\prod_{i,j}\exp\left(cW_{ij}\delta_{ij}\right),

and p(G|Ξ)p(G|\Xi) is the density of a categorical distribution with probabilities generated by the stick-breaking process.

From (C.3), the prior of gig_{i} conditional on the group indices of the other G(i)G^{(-i)} is

p(gi|Ξ,Wi,G(i))p(Wi|G)p(gi|Ξ),\displaystyle p(g_{i}|\Xi,W_{i},G^{(-i)})\;\propto\;p\left(W_{i}|G\right)p(g_{i}|\Xi), (C.4)

where Wi={Wij|j=1,,N}W_{i}=\{W_{ij}|j=1,...,N\} and

p(Wi|G)=j=1Nexp(2cWijδij).\displaystyle p(W_{i}|G)=\prod_{j=1}^{N}\exp\left(2cW_{ij}\delta_{ij}\right). (C.5)

Given the expression of p(gi|Ξ,Wi,G(i))p(g_{i}|\Xi,W_{i},G^{(-i)}) and the DP prior specified in Appendix B, the posterior of unknown objects in the random coefficients model can be written as,

p(α,σ2,Ξ,a,G|Y,X,W,ϕ)\displaystyle p(\alpha,\sigma^{2},\Xi,a,G|Y,X,W,\phi)
\displaystyle\propto\; p(Y|X,α,σ2,G)p(α,σ2|ϕ)p(G|Ξ,W)p(Ξ|a)p(a)\displaystyle p(Y|X,\alpha,\sigma^{2},G)p(\alpha,\sigma^{2}|\phi)p(G|\Xi,W)p(\Xi|a)p(a)
\displaystyle\propto\; i=1Np(yi|xi,αgi,σgi2)j=1p(αj,σj2|ϕ)j=1p(ξj|a)i=1Np(gi|Ξ,Wi,G(i))p(a)\displaystyle\prod_{i=1}^{N}p\left(y_{i}|x_{i},\alpha_{g_{i}},\sigma^{2}_{g_{i}}\right)\prod_{j=1}^{\infty}p(\alpha_{j},\sigma^{2}_{j}|\phi)\prod_{j=1}^{\infty}p(\xi_{j}|a)\prod_{i=1}^{N}p\left(g_{i}|\Xi,W_{i},G^{(-i)}\right)p(a)
=\displaystyle=\; [i=1Np(yi|xi,αgi,σgi2)p(gi|Ξ,Wi,G(i))][j=1p(αj,σj2|ϕ)p(ξj|a)]p(a)\displaystyle\left[\prod_{i=1}^{N}p\left(y_{i}|x_{i},\alpha_{g_{i}},\sigma^{2}_{g_{i}}\right)p\left(g_{i}|\Xi,W_{i},G^{(-i)}\right)\right]\left[\prod_{j=1}^{\infty}p(\alpha_{j},\sigma^{2}_{j}|\phi)p(\xi_{j}|a)\right]p(a)
=\displaystyle=\; [i=1Np(yi|xi,αgi,σgi2)p(Wi|G)p(gi|Ξ)][j=1p(αj,σj2|ϕ)p(ξj|a)]p(a).\displaystyle\left[\prod_{i=1}^{N}p\left(y_{i}|x_{i},\alpha_{g_{i}},\sigma^{2}_{g_{i}}\right)p\left(W_{i}|G\right)p(g_{i}|\Xi)\right]\left[\prod_{j=1}^{\infty}p(\alpha_{j},\sigma^{2}_{j}|\phi)p(\xi_{j}|a)\right]p(a). (C.6)

In the following derivation and algorithm, we adopt the slice sampler (Walker, 2007) that avoids approximation in Ishwaran and James (2001). Walker (2007) augments the posterior distribution with a set of auxiliary variables u=[u1,u2,,uN]u=[u_{1},u_{2},...,u_{N}], which are i.i.d. standard uniform random variables, i.e, uiiidU(0,1)u_{i}\stackrel{{\scriptstyle iid}}{{\sim}}U(0,1). Then the augmented posterior is written as,

p(α,σ2,Ξ,a,G,u|Y,X,W,ϕ)\displaystyle p(\alpha,\sigma^{2},\Xi,a,G,u|Y,X,W,\phi)
\displaystyle\propto\; [i=1Np(yi|xi,αgi,σgi2)𝟏(ui<πgi)p(Wi|G)][j=1p(αj,σj2|ϕ)p(ξj|a)]p(a)\displaystyle\left[\prod_{i=1}^{N}p\left(y_{i}|x_{i},\alpha_{g_{i}},\sigma^{2}_{g_{i}}\right)\mathbf{1}(u_{i}<\pi_{g_{i}})p\left(W_{i}|G\right)\right]\left[\prod_{j=1}^{\infty}p(\alpha_{j},\sigma^{2}_{j}|\phi)p(\xi_{j}|a)\right]p(a)
=\displaystyle=\; [i=1Np(yi|xi,αgi,σgi2)p(ui|πgi)πgip(Wi|G)][j=1p(αj,σj2|ϕ)p(ξj|a)]p(a),\displaystyle\left[\prod_{i=1}^{N}p\left(y_{i}|x_{i},\alpha_{g_{i}},\sigma^{2}_{g_{i}}\right)p(u_{i}|\pi_{g_{i}})\pi_{g_{i}}p\left(W_{i}|G\right)\right]\left[\prod_{j=1}^{\infty}p(\alpha_{j},\sigma^{2}_{j}|\phi)p(\xi_{j}|a)\right]p(a), (C.7)

where πgi=p(gi|Ξ)\pi_{g_{i}}=p(g_{i}|\Xi), p(ui|πgi)p(u_{i}|\pi_{g_{i}}) is a uniform distribution defined on [0,πgi][0,\pi_{g_{i}}], and 𝟏()\mathbf{1}(\cdot) is the indicator function, which is equal to zero unless the specific condition is met. The original posterior can be recovered by integrating out uiu_{i} for i=1,2,,Ni=1,2,...,N. As we don’t limit the upper bound of the number of groups, it is impossible to sample from an infinite-dimensional posterior density. The merit of slice-sampling is that it reduces the dimensions and allows us to solve a manageable problem with finite dimensions, which we will see below.

With a set of auxiliary variables u=[u1,u2,,uN]u=[u_{1},u_{2},...,u_{N}], we define the largest possible number of potential components as

K=mink{u>1j=1kπj},\displaystyle K^{*}=\min_{k}\left\{u^{*}>1-\sum_{j=1}^{k}\pi_{j}\right\}, (C.8)

where

u=min1iNui.\displaystyle u^{*}=\min_{1\leq i\leq N}u_{i}. (C.9)

Such a specification ensures that for any group k>Kk>K^{*} and any unit i{1,2,,N}i\in\{1,2,...,N\}, we have ui>πku_{i}>\pi_{k}.181818See proof in theorem 4. This crucial property limits the dimension of (αk,σk2)(\alpha_{k},\sigma^{2}_{k}) to KK^{*} as the densities of (αk,σk2)(\alpha_{k},\sigma^{2}_{k}) and equal 0 for k>Kk>K^{*} due to 𝟏(ui<πk)=0\mathbf{1}(u_{i}<\pi_{k})=0, which will be clear in the subsequent posterior derivation. Intuitively, the latent variable uiu_{i} has an effect of “dynamically truncating” the number of groups needed to be sampled.

Next, we define the number of active groups

Ka=max1iNgi.\displaystyle K^{a}=\max_{1\leq i\leq N}g_{i}. (C.10)

It can be shown that KaKK^{a}\leq K^{*}.191919See proof in theorem 4.

As the base distribution B0B_{0} is the Independent-Normal-Inverse-Gamma distribution, the prior density of αi\alpha_{i} and σi2\sigma^{2}_{i} are independent.

Conditional posterior of α\alpha (grouped coefficients).

p(α|σ2,G,Y,X,ϕ)\displaystyle p(\alpha|\sigma^{2},G,Y,X,\phi)\;\propto\; [i=1Np(yi|xi,αgi,σgi2)][j=1p(αj,σj2|ϕ)].\displaystyle\left[\prod_{i=1}^{N}p\left(y_{i}|x_{i},\alpha_{g_{i}},\sigma^{2}_{g_{i}}\right)\right]\left[\prod_{j=1}^{\infty}p(\alpha_{j},\sigma^{2}_{j}|\phi)\right].

For k=1,2,,Kak=1,2,...,K^{a}, define a set of units that belong to the group kk,

Bk={i|gi=k,i{1,2,,N}},\displaystyle B_{k}=\left\{i|g_{i}=k,i\in\{1,2,...,N\}\right\}, (C.11)

then the posterior density for αk\alpha_{k} read as

p(αk|σk2,G,Y,X,ϕ)\displaystyle p(\alpha_{k}|\sigma_{k}^{2},G,Y,X,\phi)
\displaystyle\propto\; [iBkp(yi|xi,αgi,σgi2)]p(αk|ϕ)\displaystyle\left[\prod_{i\in B_{k}}p(y_{i}|x_{i},\alpha_{g_{i}},\sigma^{2}_{g_{i}})\right]p(\alpha_{k}|\phi)
\displaystyle\propto\; exp[12σk2iBk(yixiαk)(yixiαk)]exp[12(αkμα)Σα1(αkμα)]\displaystyle\exp\left[-\frac{1}{2\sigma^{2}_{k}}\sum_{i\in B_{k}}\left(y_{i}-x_{i}\alpha_{k}\right)^{\prime}\left(y_{i}-x_{i}\alpha_{k}\right)\right]\exp\left[-\frac{1}{2}\left(\alpha_{k}-\mu_{\alpha}\right)^{\prime}\Sigma_{\alpha}^{-1}\left(\alpha_{k}-\mu_{\alpha}\right)\right]
\displaystyle\propto\; exp[12(αkμ¯αk)Σ¯αk1(αkμ¯αk)].\displaystyle\exp\left[-\frac{1}{2}\left(\alpha_{k}-\bar{\mu}_{\alpha_{k}}\right)^{\prime}\bar{\Sigma}_{\alpha_{k}}^{-1}\left(\alpha_{k}-\bar{\mu}_{\alpha_{k}}\right)\right].

Assuming an independent normal conjugate prior for αk\alpha_{k}, the posterior for αk\alpha_{k} is given by

αk|σk2,G,Y,X,ϕN(μ¯αk,Σ¯αk).\displaystyle\alpha_{k}|\sigma_{k}^{2},G,Y,X,\phi\sim N\left(\bar{\mu}_{\alpha_{k}},\bar{\Sigma}_{\alpha_{k}}\right). (C.12)

where

Σ¯αk\displaystyle\bar{\Sigma}_{\alpha_{k}} =(Σα1+σk2iBkxixi)1,\displaystyle=\left(\Sigma_{\alpha}^{-1}+\sigma_{k}^{-2}\sum_{i\in B_{k}}x_{i}^{\prime}x_{i}\right)^{-1},
μ¯αk\displaystyle\bar{\mu}_{\alpha_{k}} =Σ¯αk(Σα1μα+σk2iBkxiyiα),\displaystyle=\bar{\Sigma}_{\alpha_{k}}\left(\Sigma_{\alpha}^{-1}\mu_{\alpha}+\sigma_{k}^{-2}\sum_{i\in B_{k}}x_{i}^{\prime}y^{\alpha}_{i}\right),
yiα\displaystyle y^{\alpha}_{i} =yixiαgi.\displaystyle=y_{i}-x_{i}\alpha_{g_{i}}.

Conditional posterior of σ2\sigma^{2} (grouped variance). Under the assumption of cross-sectional independence, for k=1,2,,Kak=1,2,...,K^{a},

p(σk2|αk,G,Y,X,ϕ)\displaystyle p(\sigma^{2}_{k}|\alpha_{k},G,Y,X,\phi)\;\propto\; [iBkp(yi|xi,αgi,σgi2)]p(σk2|ϕ).\displaystyle\left[\prod_{i\in B_{k}}p\left(y_{i}|x_{i},\alpha_{g_{i}},\sigma^{2}_{g_{i}}\right)\right]p(\sigma_{k}^{2}|\phi).

With a inverse-gamma prior σk2IG(vσ2,δσ2)\sigma_{k}^{2}\sim IG\left(\frac{v_{\sigma}}{2},\frac{\delta_{\sigma}}{2}\right), the posterior distribution of σk2\sigma_{k}^{2} is

p(σk2|αk,G,Y,X,ϕ)\displaystyle p(\sigma^{2}_{k}|\alpha_{k},G,Y,X,\phi)
\displaystyle\propto\; iBk[(σk2)T2exp(12σk2(yixiαk)(yixiαk))](1σk2)vσ2+1exp(δσ2σk2)\displaystyle\prod_{i\in B_{k}}\left[\left(\sigma_{k}^{2}\right)^{-\frac{T}{2}}\exp\left(-\frac{1}{2\sigma^{2}_{k}}\left(y_{i}-x_{i}\alpha_{k}\right)^{\prime}\left(y_{i}-x_{i}\alpha_{k}\right)\right)\right]\left(\frac{1}{\sigma_{k}^{2}}\right)^{\frac{v_{\sigma}}{2}+1}\exp\left(-\frac{\delta_{\sigma}}{2\sigma_{k}^{2}}\right)
=\displaystyle=\; (1σk2)vσ+T|Bk|2+1exp[δσ+iBk(yixiαk)(yixiαk)2σk2].\displaystyle\left(\frac{1}{\sigma_{k}^{2}}\right)^{\frac{v_{\sigma}+T|B_{k}|}{2}+1}\exp\left[-\frac{\delta_{\sigma}+\sum_{i\in B_{k}}\left(y_{i}-x_{i}\alpha_{k}\right)^{\prime}\left(y_{i}-x_{i}\alpha_{k}\right)}{2\sigma_{k}^{2}}\right].

This implies

σk2|αk,G,Y,X,ϕIG(v¯σ,k2,δ¯σ,k2),\displaystyle\sigma^{2}_{k}|\alpha_{k},G,Y,X,\phi\sim IG\left(\frac{\bar{v}_{\sigma,k}}{2},\frac{\bar{\delta}_{\sigma,k}}{2}\right), (C.13)

where

v¯σ,k\displaystyle\bar{v}_{\sigma,k} =vσ+T|Bk|,\displaystyle=v_{\sigma}+T|B_{k}|,
δ¯σ,kt\displaystyle\bar{\delta}_{\sigma,kt} =δσ+iBk(yixiαk)(yixiαk),\displaystyle=\delta_{\sigma}+\sum_{i\in B_{k}}\left(y_{i}-x_{i}\alpha_{k}\right)^{\prime}\left(y_{i}-x_{i}\alpha_{k}\right),
|Bk|\displaystyle|B_{k}| =# of units in group k.\displaystyle=\text{\# of units in group }k.

Conditional posterior of Ξ\Xi (stick length).

p(Ξ|a,G)\displaystyle p(\Xi|a,G)
\displaystyle\propto\; [i=1Np(ui|πgi)πgi][j=1p(ξj|a)]\displaystyle\left[\prod_{i=1}^{N}p(u_{i}|\pi_{g_{i}})\pi_{g_{i}}\right]\left[\prod_{j=1}^{\infty}p(\xi_{j}|a)\right]
\displaystyle\propto\; [i=1Np(ui|πgi)ξgil<gi(1ξl)][j=1p(ξj|a)].\displaystyle\left[\prod_{i=1}^{N}p(u_{i}|\pi_{g_{i}})\xi_{g_{i}}\prod_{l<g_{i}}(1-\xi_{l})\right]\left[\prod_{j=1}^{\infty}p(\xi_{j}|a)\right].

For k=1,2,,Kak=1,2,...,K^{a},

p(Ξ|a,G)\displaystyle p(\Xi|a,G)\;\propto\; (iBkξk)(1ξk)j=1N𝟏(gj>k)(1ξk)a1,\displaystyle\left(\prod_{i\in B_{k}}\xi_{k}\right)(1-\xi_{k})^{\sum\limits_{j=1}^{N}\mathbf{1}(g_{j}>k)}(1-\xi_{k})^{a-1},
\displaystyle\propto\; ξk|Bk|(1ξk)a+j=1N𝟏(gj>k)1.\displaystyle\xi_{k}^{|B_{k}|}\left(1-\xi_{k}\right)^{a+\sum\limits_{j=1}^{N}\mathbf{1}(g_{j}>k)-1}.

where BKB_{K} is the set of units that currently belong to group kk, see equation (C.11).

Therefore, posterior distribution of ξk\xi_{k} is

ξk|a,GBeta(|Bk|+1,a+j=1N𝟏(gj>k)).\displaystyle\xi_{k}|a,G\sim\textit{Beta}\left(|B_{k}|+1,a+\sum_{j=1}^{N}\mathbf{1}(g_{j}>k)\right). (C.14)

Give Ξ=[ξ1,ξ2,,ξKa]\Xi=[\xi_{1},\xi_{2},...,\xi_{K^{a}}], update group probabilities π1,π2,,πKa\pi_{1},\pi_{2},...,\pi_{K^{a}}:

πk|G,Ξ={ξ1,k=1ξkj<k(1ξj),k=2,,Ka.\displaystyle\pi_{k}|G,\Xi=\left\{\begin{array}[]{ll}{\xi_{1},}&{k=1}\\ {\xi_{k}\prod_{j<k}\left(1-\xi_{j}\right),}&{k=2,\ldots,K^{a}}\end{array}\right.. (C.17)

Conditional posterior of aa (concentration parameter). Regarding the DP concentration parameter, the standard posterior derivation doesn’t work due to the unrestricted number of components in the current sampler. Instead, we implement the 2-step procedure proposed by Escobar and West (1995) (p.8-9). Following their approach, we first draw a latent variable η\eta,

ηBeta (a+1,J).\displaystyle\eta\sim\textit{Beta }(a+1,J). (C.18)

Then, conditional on η\eta and KaK^{a}, we draw aa from a mixture of two Gamma distribution:

p(a|η,Ka)=πaGamma (m+Ka,nlog(η))+(1πa)Gamma (m+Ka1,nlog(η)),\displaystyle p(a|\eta,K^{a})=\pi_{a}\textit{Gamma }(m+K^{a},n-\log(\eta))+(1-\pi_{a})\textit{Gamma }(m+K^{a}-1,n-\log(\eta)), (C.19)

with the weights πa\pi_{a} defined by

πa1πa=m+Ka1N[nlog(η)].\displaystyle\frac{\displaystyle\pi_{a}}{\displaystyle 1-\pi_{a}}=\frac{m+K^{a}-1}{N[n-\log(\eta)]}.

Conditional posterior of uu (auxiliary variable). Conditional on the group “stick lengths” ξk\xi_{k} and group indices GG, it is straightforward to show that the posterior density of uiu_{i} is a uniform distribution defined on (0,πgi)(0,\pi_{g_{i}}), that is

ui|Ξ,GUnif (0,πgi),\displaystyle u_{i}|\Xi,G\sim\textit{Unif }(0,\pi_{g_{i}}), (C.20)

where πgi=ξgij<gj(1ξj)\pi_{g_{i}}=\xi_{g_{i}}\prod_{j<g_{j}}(1-\xi_{j}). Moreover, it is worth noting that the values for KK^{*} and uu^{*} need to be updated according to equation (C.8) and (C.9) after this step.

Conditional posterior of GG (group indices). We derive the posterior distribution of gig_{i} consider on G(i)G^{(-i)}, where G(i)G^{(-i)} is a set including all member indices except for gig_{i}, i.e., G(i)=GgiG^{(-i)}=G\char 92\relax g_{i}. As a result, for k=1,2,,Kk=1,2,...,K^{*},

p(gi=k|yi,xi,αk,σk2,G(i),ui)\displaystyle p\left(g_{i}=k|y_{i},x_{i},\alpha_{k},\sigma^{2}_{k},G^{(-i)},u_{i}\right)
\displaystyle\propto\; p(yi|xi,αk,σk2)𝟏(ui<πk)p(Wi|G)\displaystyle p\left(y_{i}|x_{i},\alpha_{k},\sigma^{2}_{k}\right)\mathbf{1}(u_{i}<\pi_{k})p\left(W_{i}|G\right)
=\displaystyle=\; p(yi|xi,αk,σk2)𝟏(ui<πk)j=1Nexp(2cWijδij).\displaystyle p\left(y_{i}|x_{i},\alpha_{k},\sigma^{2}_{k}\right)\mathbf{1}(u_{i}<\pi_{k})\prod_{j=1}^{N}\exp\left(2cW_{ij}\delta_{ij}\right). (C.21)

Finally, we normalize the point mass to get a valid distribution.

Appendix D Technical Proofs

D.1 Slice Sampling

Theorem 4.

Suppose that we have a model with posterior as given in Appendix C.2. Given the definition of the number of potential component KK^{*} in (C.8), the minimum of auxiliary variables uu^{*} in (C.9) and the number of active group KK in (C.10), we have

  1. (i)

    ui>πku_{i}>\pi_{k} for i=1,2,,n\forall i=1,2,...,n and k>K\forall k>K^{*};

  2. (ii)

    K<KK<K^{*}.

Proof.
  1. (i)

    By definition, u=min1iNuiu^{*}=\min\limits_{1\leq i\leq N}u_{i} for i=1,2,,ni=1,2,...,n, then,

    uiu>1j=1Kπj=j=Kπjπk,k>K.\displaystyle u_{i}\geq u^{*}>1-\sum_{j=1}^{K^{*}}\pi_{j}=\sum_{j=K^{*}}^{\infty}\pi_{j}\geq\pi_{k},\forall k>K^{*}.
  2. (ii)

    Let ii^{\prime} be an unit ii such that gi=Kg_{i^{\prime}}=K. According to the posterior of GG, the group KK exists if ui<πKu_{i^{\prime}}<\pi_{K}, otherwise p(gi=K|)=0p(g_{i}=K|\cdot)=0. Then by definition,

    uui<πK1u>1πK=j=1K1πj.\displaystyle u^{*}\leq u_{i^{\prime}}<\pi_{K}\Rightarrow 1-u^{*}>1-\pi_{K}=\sum_{j=1}^{K-1}\pi_{j}.

    Since KK^{*} is the smallest number s.t.s.t. 1u<j=1Kπj1-u^{*}<\sum\limits_{j=1}^{K^{*}}\pi_{j}, then KKK\leq K^{*}.

D.2 Connection to Lu and Leen (2004) and Lu and Leen (2007)

In this section, we will first show the close connection between the modified prior in the presence of soft constraints defined in (2.4) and the framework of penalized probabilistic clustering proposed by Lu and Leen (2004) and Lu and Leen (2007). Then we will discuss the properties of the weights WijW_{ij}.

We start with joint prior odds in (2.4):

i,j(ψij1ψij)cTijδij=i,jexp[cδijlog(ψij1ψij)Tij].\displaystyle\prod_{i,j}\left(\frac{\displaystyle\psi_{ij}}{\displaystyle 1-\psi_{ij}}\right)^{cT_{ij}\delta_{ij}}=\prod_{i,j}\exp\left[c\delta_{ij}\log\left(\frac{\displaystyle\psi_{ij}}{\displaystyle 1-\psi_{ij}}\right)^{T_{ij}}\right]. (D.1)

Define the weight as Wij=log(ψij1ψij)TijW_{ij}=\log\left(\frac{\displaystyle\psi_{ij}}{\displaystyle 1-\psi_{ij}}\right)^{T_{ij}}. Then when Tij=1T_{ij}=1, we have

Wij=log(ψij1ψij)ψij=exp(Wij)1+exp(Wij).\displaystyle W_{ij}=\log\left(\frac{\displaystyle\psi_{ij}}{\displaystyle 1-\psi_{ij}}\right)\Leftrightarrow\psi_{ij}=\frac{\exp\left(W_{ij}\right)}{1+\exp\left(W_{ij}\right)}. (D.2)

When Tij=1T_{ij}=-1, we get

Wij=log(1ψijψij)1ψij=exp(Wij)1+exp(Wij).\displaystyle W_{ij}=\log\left(\frac{\displaystyle 1-\psi_{ij}}{\displaystyle\psi_{ij}}\right)\Leftrightarrow 1-\psi_{ij}=\frac{\exp\left(W_{ij}\right)}{1+\exp\left(W_{ij}\right)}. (D.3)

Combining (D.2) and (D.3) yields that

exp(Wij)1+exp(Wij)=ψij12(1+Tij)(1ψij)12(1Tij).\displaystyle\frac{\exp\left(W_{ij}\right)}{1+\exp\left(W_{ij}\right)}=\psi_{ij}^{\frac{1}{2}(1+T_{ij})}(1-\psi_{ij})^{\frac{1}{2}(1-T_{ij})}. (D.4)

This is exactly the equation (7) in Lu and Leen (2007) with γij=ψij\gamma_{ij}=\psi_{ij} and Lij=12(Tij+1)L_{ij}=\frac{1}{2}\left(T_{ij}+1\right), which uniquely defines the expression for the weights associated with each pairwise constraint given γij\gamma_{ij} and LijL_{ij}. Since both LijL_{ij} and TijT_{ij} are indicators for the type of constraints, they don’t affect the formula for WijW_{ij}, thus the following formula weights coincides with the one used in Lu and Leen (2007) and the both frameworks converge,

Wij\displaystyle W_{ij} ={log(ψij1ψij) if Tij=1log(1ψijψij) if Tij=10 if Tij=0.\displaystyle=\left\{\begin{array}[]{ll}\log\left(\frac{\psi_{ij}}{1-\psi_{ij}}\right)&\text{ if }T_{ij}=1\\ \log\left(\frac{1-\psi_{ij}}{\psi_{ij}}\right)&\text{ if }T_{ij}=-1\vskip 2.84544pt\\ 0&\text{ if }T_{ij}=0.\end{array}\right. (D.8)

Accordingly, the prior defined in (D.1) can be rewritten in term of WijW_{ij} as

i,j(ψij1ψij)cTijδij=i,jexp(cWijδij).\displaystyle\prod_{i,j}\left(\frac{\displaystyle\psi_{ij}}{\displaystyle 1-\psi_{ij}}\right)^{cT_{ij}\delta_{ij}}=\prod_{i,j}\exp\left(cW_{ij}\delta_{ij}\right). (D.9)

The weight WijW_{ij} associated with the constraint between unit ii and jj as in (D.8) has the following properties:

  1. (a)

    Unboundedness: Wij(,)W_{ij}\in(-\infty,\infty);

  2. (b)

    Symmetry: Wij=WjiW_{ij}=W_{ji};

  3. (c)

    Sign reflects constraint’s type: If (i,j)𝒫(i,j)\in\mathcal{P} or Lij=1L_{ij}=1, then Wij=log(ψij1ψij)>0W_{ij}=\log\left(\frac{\psi_{ij}}{1-\psi_{ij}}\right)>0; If (i,j)𝒩(i,j)\in\mathcal{N} or Lij=1L_{ij}=-1, then Wij=log(1ψijψij)<0W_{ij}=\log\left(\frac{1-\psi_{ij}}{\psi_{ij}}\right)<0; If (i,j)(i,j) doesn’t involve in any constraint or Lij=0L_{ij}=0, then Wij=0W_{ij}=0.

  4. (d)

    Absolute value reflects constraint’s accuracy:

    e|Wij|1+e|Wij|=ψij.\displaystyle\frac{e^{\left|W_{ij}\right|}}{1+e^{\left|W_{ij}\right|}}=\psi_{ij}.

    It is straightforward to show that |Wij|\left|W_{ij}\right| is monotonically increasing in ψij\psi_{ij}.

D.3 Prior Similarity Matrix

Proof of Theorem 1.

Given equation (2.14) and (2.6), the prior probability of unit ii and jj being in the same group is

Pr(gi=gj|W)\displaystyle\Pr(g_{i}=g_{j}|W)
=\displaystyle=\; G𝒢ij1Mp(G)exp(cm,nWmnδmn)\displaystyle\sum_{G\in\mathcal{G}_{ij}}\frac{1}{M}p(G)\exp\left(c\sum_{m,n}W_{mn}\delta_{mn}\right)
=\displaystyle=\; G𝒢ij1MΓ(a)Γ(a+N)[k=1KaΓ(|Bk|)]exp(cm,nWmnδmn)\displaystyle\sum_{G\in\mathcal{G}_{ij}}\frac{1}{M}\frac{\Gamma(a)}{\Gamma(a+N)}\left[\prod_{k=1}^{K}a\Gamma\left(|B_{k}|\right)\right]\exp\left(c\sum_{m,n}W_{mn}\delta_{mn}\right)
=\displaystyle=\; G𝒢ijA(G)exp(cm,nWmnδmn)\displaystyle\sum_{G\in\mathcal{G}_{ij}}A(G)\exp\left(c\sum_{m,n}W_{mn}\delta_{mn}\right) (D.10)

where 𝒢ij\mathcal{G}_{ij} is the set of all possible group indices that satisfies gi=gjg_{i}=g_{j} and MM is the normalization constant in (2.6).

𝒢ij\mathcal{G}_{ij} and 𝒢ik\mathcal{G}_{ik} are closed related. It is straightforward to see that the numbers of element in 𝒢ij\mathcal{G}_{ij} and 𝒢ik\mathcal{G}_{ik} are equal since they are all equal to the number of permutation of other N2N-2 units. Moreover, as unit jj and kk are exchangeable, 𝒢ik\mathcal{G}_{ik} can be constructed from 𝒢ij\mathcal{G}_{ij} by swapping the group index of unit jj and kk.

As a result, we can find an one-on-one mapping between 𝒢ij\mathcal{G}_{ij} and 𝒢ik\mathcal{G}_{ik}. That is, for any G𝒢ijG\in\mathcal{G}_{ij}, if we swap the group index of unit jj and kk, the resulting partition sjk(G)s_{jk}(G) belongs to 𝒢ik\mathcal{G}_{ik}, and vice versa. As the constant A(G)A(G) depends only on the size of partitions, we have A(G)=A(sjk(G))A(G)=A(s_{jk}(G)).

The properties between 𝒢ij\mathcal{G}_{ij} and 𝒢ik\mathcal{G}_{ik} enable we to compare each summand in Pr(gi=gj|W)\Pr(g_{i}=g_{j}|W) and Pr(gi=gk|W)\Pr(g_{i}=g_{k}|W). The difference between these two probabilities is

Pr(gi=gj|W)Pr(gi=gk|W)\displaystyle\Pr(g_{i}=g_{j}|W)-\Pr(g_{i}=g_{k}|W)
=\displaystyle\;=\; G𝒢ijA(G)exp(cm,nWmnδmn)G𝒢ikA(G)exp(cm,nWmnδmn)\displaystyle\sum_{G\in\mathcal{G}_{ij}}A(G)\exp\left(c\sum_{m,n}W_{mn}\delta_{mn}\right)-\sum_{G\in\mathcal{G}_{ik}}A(G)\exp\left(c\sum_{m,n}W_{mn}\delta_{mn}\right)
=\displaystyle\;=\; G𝒢ijA(G)exp(cm,nWmnδmn)A(sjk(G))exp(cm,nWmnδmn)\displaystyle\sum_{G\in\mathcal{G}_{ij}}A(G)\exp\left(c\sum_{m,n}W_{mn}\delta_{mn}\right)-A(s_{jk}(G))\exp\left(c\sum_{m,n}W_{mn}\delta^{\prime}_{mn}\right)
=\displaystyle\;=\; G𝒢ijA(G)[exp(cm,nWmnδmn)exp(cm,nWmnδmn)].\displaystyle\sum_{G\in\mathcal{G}_{ij}}A(G)\left[\exp\left(c\sum_{m,n}W_{mn}\delta_{mn}\right)-\exp\left(c\sum_{m,n}W_{mn}\delta^{\prime}_{mn}\right)\right]. (D.11)

where δmn\delta^{\prime}_{mn} is evaluated at sjk(G)s_{jk}(G).

We can classify a group partitioning GG into two cases:

  1. (i)

    G=sjk(G)G=s_{jk}(G). This happens when units jj and kk are assigned to the same group. Swapping them doesn’t affect the group partitioning, which indicates that m,nWmnδmn=m,nWmnδmn\sum_{m,n}W_{mn}\delta_{mn}=\sum_{m,n}W_{mn}\delta^{\prime}_{mn}.

  2. (ii)

    Gsjk(G)G\neq s_{jk}(G). These are the more common cases. We again compare m,nWmnδmn\sum_{m,n}W_{mn}\delta_{mn} with m,nWmnδmn\sum_{m,n}W_{mn}\delta^{\prime}_{mn}. WmnδmnW_{mn}\delta_{mn} and WmnδmnW_{mn}\delta^{\prime}_{mn} are equal when mj,km\neq j,k and nj,kn\neq j,k as these terms remain unchanged regardless of the group indices of units jj and kk. For m=j,km=j,k, note that δjn=δkn\delta_{jn}=\delta^{\prime}_{kn} and δkn=δjn\delta_{kn}=\delta^{\prime}_{jn} for all n=1,2,..,Nn=1,2,..,N. Therefore, under the assumption that Wjn=WknW_{jn}=W_{kn} for n\forall n , we have,

    n=1NWjnδjn+n=1NWknδkn=n=1NWjnδkn+n=1NWknδjn=n=1NWknδkn+n=1NWjnδjn,\displaystyle\sum_{n=1}^{N}W_{jn}\delta_{jn}+\sum_{n=1}^{N}W_{kn}\delta_{kn}=\sum_{n=1}^{N}W_{jn}\delta^{\prime}_{kn}+\sum_{n=1}^{N}W_{kn}\delta^{\prime}_{jn}=\sum_{n=1}^{N}W_{kn}\delta^{\prime}_{kn}+\sum_{n=1}^{N}W_{jn}\delta^{\prime}_{jn}, (D.12)

    and hence

    m,nWmnδmn\displaystyle\sum_{m,n}W_{mn}\delta_{mn}
    =\displaystyle\;=\; m,n(j,k)Wmnδmn+2(n=1NWjnδjn+n=1NWknδkn)\displaystyle\sum_{m,n\not\in(j,k)}W_{mn}\delta_{mn}+2\left(\sum_{n=1}^{N}W_{jn}\delta_{jn}+\sum_{n=1}^{N}W_{kn}\delta_{kn}\right)
    =\displaystyle\;=\; m,n(j,k)Wmnδmn+2(n=1NWjnδjn+n=1NWknδkn)\displaystyle\sum_{m,n\not\in(j,k)}W_{mn}\delta^{\prime}_{mn}+2\left(\sum_{n=1}^{N}W_{jn}\delta^{\prime}_{jn}+\sum_{n=1}^{N}W_{kn}\delta^{\prime}_{kn}\right)
    =\displaystyle\;=\; m,nWmnδmn,\displaystyle\sum_{m,n}W_{mn}\delta^{\prime}_{mn},

    where the first and third equalities use facts that Wmn=WnmW_{mn}=W_{nm}, δmn=δnm\delta_{mn}=\delta_{nm}, and Wnn=0W_{nn}=0 for n,m\forall n,m. The second equality follows the result in (D.12).

In both cases, we have m,nWmnδmn=m,nWmnδmn\sum_{m,n}W_{mn}\delta_{mn}=\sum_{m,n}W_{mn}\delta^{\prime}_{mn} for all G𝒢ijG\in\mathcal{G}_{ij} and therefore

Pr(gi=gj|W)Pr(gi=gk|W)=0.\displaystyle\Pr(g_{i}=g_{j}|W)-\Pr(g_{i}=g_{k}|W)=0. (D.13)

D.4 PC-KMeans

Proof of Theorem 2.

We start with a brief discussion of PC-KMeans algorithm Basu et al. (2004a). Given a set of observations (y1,y2,,yN)\left(y_{1},y_{2},\ldots,y_{N}\right), a set of positive-link constraints 𝒫\mathcal{P}, a set of negative-link constraints 𝒩\mathcal{N}, the cost of violating constraints w={wijp,wijn}w=\{w^{p}_{ij},w^{n}_{ij}\} and the number of groups KK, the PC-KMeans algorithm aims to partition the NN units into KK groups so as to minimize the following objective function,

L(G)=\displaystyle L(G)= 12k=1KiBkziμk2within-cluster sum of squares+(i,j)𝒫ωijm𝟏(gigj)+(i,j)𝒩ωijc𝟏(gi=gj)cost of violation,\displaystyle\underbrace{\frac{1}{2}\sum_{k=1}^{K}\sum_{i\in B_{k}}\left\|z_{i}-\mu_{k}\right\|^{2}}_{\text{within-cluster sum of squares}}+\underbrace{\sum_{\left(i,j\right)\in\mathcal{P}}\omega^{m}_{ij}\mathbf{1}\left(g_{i}\neq g_{j}\right)+\sum_{\left(i,j\right)\in\mathcal{N}}\omega^{c}_{ij}\mathbf{1}\left(g_{i}=g_{j}\right)}_{\text{cost of violation}}, (D.14)

where μk\mu_{k} is the centroid of group kk, i.e., μk=1|Bk|iBkyi\mu_{k}=\frac{1}{\left|B_{k}\right|}\sum_{i\in B_{k}}y_{i}, BkB_{k} is the set of units that are assigned to group kk, and |Bk|\left|B_{k}\right| is the size of group kk. Equation (D.14) can be rewritten as

L(G)=\displaystyle L(G)=\; 12i=1Nyiμgi2i,jcWijδij+Const,\displaystyle\frac{1}{2}\sum_{i=1}^{N}\left\|y_{i}-\mu_{g_{i}}\right\|^{2}-\sum_{i,j}cW_{ij}\delta_{ij}+Const, (D.15)

where Const=c((i,j)𝒫Wij(i,j)𝒩Wij)Const=c\left(\sum_{\left(i,j\right)\in\mathcal{P}}W_{ij}-\sum_{\left(i,j\right)\in\mathcal{N}}W_{ij}\right) is a constant, cc is the scaling constant introduced in (2.4), and

Wij=\displaystyle W_{ij}\;=\; {ωijm2c if (i,j)𝒫ωijc2c if (i,j)𝒩0 otherwise.\displaystyle\begin{cases}\frac{\omega^{m}_{ij}}{2c}&\text{ if }\left(i,j\right)\in\mathcal{P}\\ -\frac{\omega^{c}_{ij}}{2c}&\text{ if }\left(i,j\right)\in\mathcal{N}\\ 0&\text{ otherwise}.\end{cases} (D.16)

The clustering process includes minimizing the objective function over both group partition GG (assignment step) and the model parameters μ={μ1,μ2,,μK}\mu=\left\{\mu_{1},\mu_{2},\ldots,\mu_{K}\right\} (update step). Next, we will show that the PC-KMeans algorithm is embodied in our proposed Gibbs sampler with soft constraints.

Under assumption (i), we can rewrite the baseline model with a set of variables zitz_{it} that don’t have grouped heterogeneous effects on yity_{it},

yit\displaystyle y_{it} =αgixit+βizit+εit=αgi+βizit+εit,\displaystyle=\alpha^{\prime}_{g_{i}}x_{it}+\beta^{\prime}_{i}z_{it}+\varepsilon_{it}=\alpha_{g_{i}}+\beta^{\prime}_{i}z_{it}+\varepsilon_{it},

where the second equality holds due to the assumption of xit=1x_{it}=1. βi\beta_{i} can be equal across units, i.e., βi=β\beta_{i}=\beta.

Under assumption (ii), we fix the number of groups upfront and thus we don’t rely on slice sampling in which KK is unknown and determined dynamically. Hereinafter, we focus on posterior distribution without the auxiliary variable uiu_{i}. Notice that the indicator function 𝟏(ui<πgi)\mathbf{1}(u_{i}<\pi_{g_{i}}) in the posterior density reduces to πgi\pi_{g_{i}}.


Part 1: Assignment Step

Assume we have soft pairwise constraints and weights are specified in (D.16). Under the assumptions (iii) and (iv), the posterior density of the group membership indicators GG is,

p(G|α,β,σ2,Y,X,Z,W)\displaystyle p(G|\alpha,\beta,\sigma^{2},Y,X,Z,W)
=\displaystyle\;=\; 1ZSi=1N[p(yi|βi,αgi,σgi2,xi,zi)πgi]p(W|G)\displaystyle\frac{1}{Z_{S}}\prod_{i=1}^{N}\left[p(y_{i}|\beta_{i},\alpha_{g_{i}},\sigma^{2}_{g_{i}},x_{i},z_{i})\pi_{g_{i}}\right]p(W|G)
=\displaystyle=\; 1ZSi=1Np(yi|βi,αgi,σgi2,xi,zi)πgii,j=1Nexp(cWijσ2δij)\displaystyle\frac{1}{Z_{S}}\prod_{i=1}^{N}p(y_{i}|\beta_{i},\alpha_{g_{i}},\sigma^{2}_{g_{i}},x_{i},z_{i})\pi_{g_{i}}\prod_{i,j=1}^{N}\exp\left(\frac{cW_{ij}}{\sigma^{2}}\delta_{ij}\right)
=\displaystyle=\; 1ZSi=1N(2πσ2)T2πgiexp[12σ2y~iαgi2]i,j=1Nexp(cWijσ2δij),\displaystyle\frac{1}{Z_{S}}\prod_{i=1}^{N}(2\pi\sigma^{2})^{-\frac{T}{2}}\pi_{g_{i}}\exp\left[-\frac{1}{2\sigma^{2}}\left\|\tilde{y}_{i}-\alpha_{g_{i}}\right\|^{2}\right]\prod_{i,j=1}^{N}\exp\left(\frac{cW_{ij}}{\sigma^{2}}\delta_{ij}\right), (D.17)

where y~i=yiβizi\tilde{y}_{i}=y_{i}-\beta_{i}^{\prime}z_{i}, zi=[zi1zi2ziT]z_{i}=[z_{i1}\;z_{i2}\ldots\;z_{iT}]^{\prime} and ZSZ_{S} is the normalization constant.

Next, we define the optimal group partition GG^{*} that minimizes the objective function of PC-KMeans defined in (D.15) with xi=y~ix_{i}=\tilde{y}_{i} and μk=αk\mu_{k}=\alpha_{k}, that is,

G\displaystyle G^{*}\equiv argminGL(G)\displaystyle\arg\min_{G}L(G)
=\displaystyle= argminG12i=1Ny~iαgi2i,jcWijδij.\displaystyle\arg\min_{G}\frac{1}{2}\sum_{i=1}^{N}\left\|\tilde{y}_{i}-\alpha_{g_{i}}\right\|^{2}-\sum_{i,j}cW_{ij}\delta_{ij}. (D.18)

Now we consider the asymptotic behavior of the posterior probability in (D.4). We will show that as σ2\sigma^{2} goes to 0, the posterior probability of GG approaches 0 for all group partitions except for GG^{*}:

limσ20p(G|ρ,β,α,Σ,Y,X,W){1 if G=G;0otherwise.\displaystyle\lim_{\sigma^{2}\to 0}p(G|\rho,\beta,\alpha,\Sigma,Y,X,W)\to\begin{cases}1&\text{ if }G=G^{*};\\ 0&otherwise.\\ \end{cases}

We start with the log posterior density of GG in (D.4),

l(G)\displaystyle l(G)\equiv\; logp(G|ρ,β,α,Σ,Y,X,W)\displaystyle\log p(G|\rho,\beta,\alpha,\Sigma,Y,X,W)
=\displaystyle=\; 12σ2i=1Ny~iαgi2+i,j=1N(cWijσ2δij)\displaystyle-\frac{1}{2\sigma^{2}}\sum_{i=1}^{N}\left\|\tilde{y}_{i}-\alpha_{g_{i}}\right\|^{2}+\sum_{i,j=1}^{N}\left(\frac{cW_{ij}}{\sigma^{2}}\delta_{ij}\right)
NT2log(2πσ2)+i=1Nlog(πgi)logZS.\displaystyle-\frac{NT}{2}\log(2\pi\sigma^{2})+\sum_{i=1}^{N}\log(\pi_{g_{i}})-\log Z_{S}. (D.19)

The difference between two log posterior probabilities evaluated at GG^{*} and any other GG is

l(G)l(G)\displaystyle l(G^{*})-l(G)
=\displaystyle=\; 1σ2[(12i=1Ny~iαgi2i,j=1NcWijδij)(12i=1Ny~iαgi2i,j=1NcWijδij)]\displaystyle\frac{1}{\sigma^{2}}\left[\left(\frac{1}{2}\sum_{i=1}^{N}\left\|\tilde{y}_{i}-\alpha_{g_{i}}\right\|^{2}-\sum_{i,j=1}^{N}cW_{ij}\delta_{ij}\right)-\left(\frac{1}{2}\sum_{i=1}^{N}\left\|\tilde{y}_{i}-\alpha_{g^{*}_{i}}\right\|^{2}-\sum_{i,j=1}^{N}cW_{ij}\delta^{*}_{ij}\right)\right]
+i=1N[log(πgi)log(πgi)].\displaystyle+\sum_{i=1}^{N}\left[\log(\pi_{g^{*}_{i}})-\log(\pi_{g_{i}})\right]. (D.20)

The first term is strictly positive according the definition of GG^{*} in (D.4). For simplicity, we denote the expression within the first square brace as VV and V>0V>0. The second term is finite since

|i=1N[log(πgi)log(πgi)]|N|max(log(πj))min(log(πj))|<+\displaystyle\left|\sum_{i=1}^{N}\left[\log(\pi_{g^{*}_{i}})-\log(\pi_{g_{i}})\right]\right|\leq N\left|\max(\log(\pi_{j}))-\min(\log(\pi_{j}))\right|<+\infty

Thus, for any GGG\neq G^{*}, in the limit as σ20\sigma^{2}\to 0, we have

limσ20l(G)l(G)=limσ20Vσ2+i=1N[log(πgi)log(πgi)]=+.\displaystyle\lim_{\sigma^{2}\to 0}l(G^{*})-l(G)=\lim_{\sigma^{2}\to 0}\frac{V}{\sigma^{2}}+\sum_{i=1}^{N}\left[\log(\pi_{g^{*}_{i}})-\log(\pi_{g_{i}})\right]=+\infty. (D.21)

This indicates that

limσ20p(G|α,σ2,Y,X,Z,W)p(G|α,σ2,Y,X,Z,W)=limσ20exp[l(G)l(G)]=exp()=0.\displaystyle\lim_{\sigma^{2}\to 0}\frac{p(G|\alpha,\sigma^{2},Y,X,Z,W)}{p(G^{*}|\alpha,\sigma^{2},Y,X,Z,W)}=\lim_{\sigma^{2}\to 0}\exp\left[l(G)-l(G^{*})\right]=\exp(-\infty)=0.

We take the sum over all possible group partitions and get,

limσ20Gp(G|α,σ2,Y,X,Z,W)p(G|α,σ2,Y,X,Z,W)\displaystyle\lim_{\sigma^{2}\to 0}\frac{\sum_{G^{\prime}}p(G^{\prime}|\alpha,\sigma^{2},Y,X,Z,W)}{p(G^{*}|\alpha,\sigma^{2},Y,X,Z,W)}
=\displaystyle=\; limσ20GGp(G|α,σ2,Y,X,Z,W)+p(G|α,σ2,Y,X,Z,W)p(G|α,σ2,Y,X,Z,W)\displaystyle\lim_{\sigma^{2}\to 0}\frac{\sum_{G^{\prime}\neq G}p(G^{\prime}|\alpha,\sigma^{2},Y,X,Z,W)+p(G^{*}|\alpha,\sigma^{2},Y,X,Z,W)}{p(G^{*}|\alpha,\sigma^{2},Y,X,Z,W)}
=\displaystyle=\; 1.\displaystyle 1.

Since Gp(G|α,σ2,Y,X,Z,W)=1\sum_{G^{\prime}}p(G^{\prime}|\alpha,\sigma^{2},Y,X,Z,W)=1, we have

limσ20p(G|α,σ2,Y,X,Z,W)=1.\displaystyle\lim_{\sigma^{2}\to 0}p(G^{*}|\alpha,\sigma^{2},Y,X,Z,W)=1. (D.22)

Therefore, when σ20\sigma^{2}\to 0, every posterior draw of GG from the proposed Gibbs sampler is the solution to the assignment step of the PC-KMeans algorithm, conditional on the posterior draws of other parameters.


Part 2: Update Step

During Gibbs sampling, once we have performed one complete set of Gibbs moves on the group assignments and non-group-specific parameters including βi\beta_{i} and σ2\sigma^{2}, we need to sample the αk\alpha_{k} conditioned on all assignments and observations.

Let |Bk||B_{k}| be the number of units assigned to group kk, then the posterior density for αk\alpha_{k} read as

p(αk|β,σ2,Y,X,Z)exp[12(αkμ¯αk)Σ¯αk1(αkμ¯αk)],\displaystyle p(\alpha_{k}|\beta,\sigma^{2},Y,X,Z)\;\propto\;\exp\left[-\frac{1}{2}\left(\alpha_{k}-\bar{\mu}_{\alpha_{k}}\right)^{\prime}\bar{\Sigma}_{\alpha_{k}}^{-1}\left(\alpha_{k}-\bar{\mu}_{\alpha_{k}}\right)\right], (D.23)

where

Σ¯αk\displaystyle\bar{\Sigma}_{\alpha_{k}} =(Σα1+|Bk|σ2IT)1,\displaystyle=\left(\Sigma_{\alpha}^{-1}+|B_{k}|\sigma^{-2}I_{T}\right)^{-1},
μ¯αk\displaystyle\bar{\mu}_{\alpha_{k}} =Σ¯αk(Σα1μα+σ2iBky~i),\displaystyle=\bar{\Sigma}_{\alpha_{k}}\left(\Sigma_{\alpha}^{-1}\mu_{\alpha}+\sigma^{-2}\sum_{i\in B_{k}}\tilde{y}_{i}\right),
y~i\displaystyle\tilde{y}_{i} =yiρy1,ixiβi.\displaystyle=y_{i}-\rho y_{-1,i}-x_{i}\beta_{i}.

We can see that the mass of the posterior distribution becomes concentrated around the posterior group mean μ¯αk\bar{\mu}_{\alpha_{k}} as σ20\sigma^{2}\to 0. Meanwhile, the posterior group mean μ¯αk\bar{\mu}_{\alpha_{k}} equals the group “sample” mean in the limit:

limσ20μ¯αk\displaystyle\lim_{\sigma^{2}\to 0}\bar{\mu}_{\alpha_{k}} =limσ20(Σα1+|Bk|σ2IT)1(Σα1μα+σ2iBky~i)\displaystyle=\lim_{\sigma^{2}\to 0}\left(\Sigma_{\alpha}^{-1}+|B_{k}|\sigma^{-2}I_{T}\right)^{-1}\left(\Sigma_{\alpha}^{-1}\mu_{\alpha}+\sigma^{-2}\sum_{i\in B_{k}}\tilde{y}_{i}\right)
=limσ20(σ2Σα1+|Bk|IT)1(σ2Σα1μα+iBky~i)\displaystyle=\lim_{\sigma^{2}\to 0}\left(\sigma^{2}\Sigma_{\alpha}^{-1}+|B_{k}|I_{T}\right)^{-1}\left(\sigma^{2}\Sigma_{\alpha}^{-1}\mu_{\alpha}+\sum_{i\in B_{k}}\tilde{y}_{i}\right)
=|Bk|1iBky~i.\displaystyle=|B_{k}|^{-1}\sum_{i\in B_{k}}\tilde{y}_{i}.

In other words, after we determine the assignments of units to groups, we update the means as the “sample” mean of the units in each group. This is equivalent to the standard KMeans cluster update step in general. Of course, we need additional steps to draw βi\beta_{i} and σ2\sigma^{2} before updating group means. ∎

Appendix E Monte Carlo Simulation

In this section, we conducted Monte Carlo simulations to examine the performance of various constrained BGFE estimators under different data generating processes (DGPs) and prior belief on GG. Two sets of DGPs are considered. For the simple DGPs, we introduce various group pattern in the fixed-effects only. The general DGPs, on the other hand, include more covariates with group-specific slope coefficients. Such designs enable us to investigate not only how our proposed estimators perform under various DGPs with specific features, but also the accuracy of estimating the number of groups.

We consider a short-panel environment in which the sample size is N=200N=200 and the time span is T=11T=11. As we focus on one-step ahead forecasts, the last observation of each unit serves as the hold-out sample for evaluation. A similar framework can be applied to HH-step ahead forecasts by generating additional HH observations. The true number of groups is set to K0=4K_{0}=4. Given NN and K0K^{0}, we divide the entire sample into K0K^{0} balanced blocks with N/K0N/K_{0} units in each block.202020If N/K0N/K_{0} is not an integer, we assign N/K0\lfloor N/K_{0}\rfloor units for group 1,2,..,K01K_{0}-1 and the last group contains the remainder. For each DGP, 1,000 datasets are generated, and we run the block Gibbs samplers for each data set with M=5,000M=5,000 iterations after a burn-in of 5,000 draws.

E.1 Data Generating Processes

E.1.1 Simple DGPs

We begin with a simple dynamic panel data model with group pattern in the fixed-effects and no covariates or heteroskedasticity.

DGP 1 & 2:

yit\displaystyle y_{it} =αgi+ρyit1+εit,\displaystyle=\alpha_{g_{i}}+\rho y_{it-1}+\varepsilon_{it}, (E.1)

where ρ=0.7\rho=0.7 and εitN(0,1)\varepsilon_{it}\sim N\left(0,1\right). The distributions of initial values are selected to ensure the simulation paths are stationary. Idiosyncratic error εit\varepsilon_{it} are independent across ii and tt, and mutually independent. εit\varepsilon_{it} is also independent of all regressors.

We assume αk\alpha_{k} has zero mean and takes the form αk=m(k2.5)\alpha_{k}=m(k-2.5), where mm controls the cross-sectional variance of αi\alpha_{i}. Two sets of αk\alpha_{k} are specified: m=1.79m=1.79 such that var(αk)=1/4\operatorname{var}\left(\alpha_{k}\right)=1/4 in DGP 1 and m=0.51m=0.51 such that var(αk)=1/50\operatorname{var}\left(\alpha_{k}\right)=1/50 in DGP 2, see details in Appendix E.1.2. The difference in αk\alpha_{k} between these two DGPs distinguishes their properties. As depicted in Figure E.1, the group pattern is readily apparent in DGP 1. Different groups of units are perfectly divided, and the simulated paths are pretty flat. DGP 2 has a less visible group structure than DGP 1 because the difference between group means of αk\alpha_{k} is smaller. The simulated pathways are considerably noisier and fluctuate around the unconditional mean.

Figure E.1: Simulated Paths for Units in Different DGPs
(a) DGP 1: Sharp Group
Refer to caption
(b) DGP 2: Noisy Group
Refer to caption

E.1.2 Details of the Simple DGPs

We start with the mean of αk\alpha_{k}. Assume the values of αk\alpha_{k} for group kk take the form of αk=m(kc)\alpha_{k}=m(k-c), where cc is a shifting constant and mm is a scaling constant. With loss of generality, we fix the mean of αk\alpha_{k} to 0,

k=1K0αk=mk=1K0(kc)=0\displaystyle\sum_{k=1}^{K_{0}}\alpha_{k}=m\sum_{k=1}^{K_{0}}\left(k-c\right)=0 (E.2)

It immediately follows that c=K0+12c=\frac{K_{0}+1}{2} and

αk=m(kK0+12), for k=1,2,..,K0.\displaystyle\alpha_{k}=m\left(k-\frac{K_{0}+1}{2}\right),\quad\text{ for }k=1,2,..,K_{0}. (E.3)

Next, mm is the only unknown coefficient in the DGP. To find a reasonable value for mm, we connect it to the variance of αi\alpha_{i}. As αi\alpha_{i} are assume to be identical within a group, the sample variance of αi\alpha_{i} is given by

V(αi)=1Ni=1Nαi2=1NNK0k=1K0αk2=1K0k=1K0αk2\displaystyle V(\alpha_{i})=\frac{1}{N}\sum_{i=1}^{N}\alpha_{i}^{2}=\frac{1}{N}\frac{N}{K_{0}}\sum_{k=1}^{K_{0}}\alpha_{k}^{2}=\frac{1}{K_{0}}\sum_{k=1}^{K_{0}}\alpha_{k}^{2}

Plugging in the expression of αk\alpha_{k} in (E.3), we have

V(αi;m,K0)=m2K0k=1K0(kK0+12)2.\displaystyle V(\alpha_{i};m,K_{0})=\frac{m^{2}}{K_{0}}\sum_{k=1}^{K_{0}}\left(k-\frac{K_{0}+1}{2}\right)^{2}. (E.4)

To make the DGPs more comparable as more groups are considered, we assume V(αi;m,K0)V(\alpha_{i};m,K_{0}) is monotonically increasing in K02K^{2}_{0}, e.g., V(αi;m,K0)=V0K02V(\alpha_{i};m,K_{0})=V_{0}K^{2}_{0} for some constant V0V_{0}. As a result, we can deduct the value of mm from (E.4),

m(K0,V0)=[V0K0k=1K0(kK0+12)2]12.\displaystyle m(K_{0},V_{0})=\left[\frac{V_{0}K_{0}}{\sum_{k=1}^{K_{0}}\left(k-\frac{K_{0}+1}{2}\right)^{2}}\right]^{\frac{1}{2}}. (E.5)

It is straightforward to find V0V_{0} controls the dispersion of the underlying DGP. A larger V0V_{0} indicate αk\alpha_{k} are more separated and hence the group pattern become sharper, and vice versa.

E.1.3 General DGPs

The general DGP is based on the dynamic panel data model specified in (2.1) with an exogenous predictor zitz_{it} that has common effect for all units. This DGP incorporates group heterogeneity in the fixed-effects, the lagged term xit(1)=yit1x^{(1)}_{it}=y_{it-1} and an exogenous predictor xit(2)x^{(2)}_{it}, as well as error variance σgi2\sigma^{2}_{g_{i}}.

DGP 3:

yit\displaystyle y_{it} =αgixit+γzit+σgiεit,\displaystyle=\alpha_{g_{i}}^{\prime}x_{it}+\gamma z_{it}+\sigma_{g_{i}}\varepsilon_{it}, (E.6)

where xit=[1,xit(1),xit(2)]x_{it}=[1,x^{(1)}_{it},x^{(2)}_{it}]^{\prime}, γ=1.5\gamma=1.5, yi0N(0,1)y_{i0}\sim N(0,1) and εitN(0,1)\varepsilon_{it}\sim N(0,1). For each ii, the initial value is specified to guarantee that the time series (yi0,yi1,,yiT)(y_{i0},y_{i1},...,y_{iT}) is strictly stationary. We assume there are K0=4K^{0}=4 balanced groups, with the true grouped coefficients summarized in Table (E.1). The AR(1) coefficients represent different degree of persistence persistence. The exogenous variable xit(2)x^{(2)}_{it} is drawn from N(0,1)N(0,1) and zitz_{it} is drawn from Gamma(1,1)Gamma(1,1), capped by 10.

Table E.1: True Grouped Coefficients in the General DGP
α0,k\alpha_{0,k} α1,k\alpha_{1,k} α2,k\alpha_{2,k} σk2\sigma_{k}^{2}
(FE) (lagged) (exo var.) (variance)
Group 1 -0.15 0.4 0.16 0.500
Group 2 -0.05 0.8 0.14 0.375
Group 3 0.05 0.5 0.12 0.250
Group 4 0.15 0.7 0.10 0.125

E.2 Construction of Soft Pairwise Constraints

Number of constraints: We set the number of constraints NPLN_{PL} and NNLN_{NL} as a function of NN and K0K_{0} to facilitate performance comparisons across different settings and to ensure that the information of constraints does not vanish as NN increases. Specifically, NPLN_{PL} and NNLN_{NL} are a predetermined proportion of the total number of correct constraints for each type which are given by,

NPL(N,K)\displaystyle N^{*}_{PL}(N,K) =KCN/K2=KN/K(N/K1)2=N(NK)2K,\displaystyle=KC^{2}_{N/K}=K\frac{N/K(N/K-1)}{2}=\frac{N(N-K)}{2K}, (E.7)
NNL(N,K)\displaystyle N^{*}_{NL}(N,K) =(NK)2CK2=(NK)2K(K1)2=N2(K1)2K.\displaystyle=\left(\frac{N}{K}\right)^{2}C^{2}_{K}=\left(\frac{N}{K}\right)^{2}\frac{K(K-1)}{2}=\frac{N^{2}(K-1)}{2K}. (E.8)

In the setting with (N,K0)=(200,4)(N,K_{0})=(200,4), we have NPL(200,4)=4,900N^{*}_{PL}(200,4)=4,900 and NNL(200,4)=15,000N^{*}_{NL}(200,4)=15,000. We choose randomly select 5% of these constraints, leading to NPL=245N_{PL}=245 and NNL=750N_{NL}=750.

Type of pairwise constraints: The pairwise constraints are generated randomly. Given the number of PL constraints NPLN_{PL}, each PL constraint is generated by randomly selecting a group and uniformly selecting two units within that group to be positive-linked. Similarly, for each of NNLN_{NL} NL constraints, two unique groups are chosen at random and one unit is randomly selected from each. Regarding the remaining unselected units, we assume they are not restricted and have no prior belief on them.

Accuracy of pairwise constraints: Each constraint is annotated with a level of accuracy ψ\psi generating from a transformed Beta distribution defined on [0.5,1][0.5,1]. We begin by drawing ν\nu from a Beta distribution: if the constraint is correct, νBeta(3,2)\nu\sim\operatorname{Beta}(3,2) for some α>1\alpha>1; otherwise, νBeta(2,3)\nu\sim\operatorname{Beta}(2,3). Then the level of confidence is ψ=ν2+0.5\psi=\frac{\nu}{2}+0.5 so that its domain is [0.5,1][0.5,1]. We derive ψ\psi in this manner to reflect the assumption that an expert should have less certainty in erroneous constraints than in correct ones.

Perturbation in pairwise constraints: To examine the performance with soft constraints under inaccurate prior belief, we artificially add errors to the randomly generated constraints. A fraction ee of the constraints are mislabeled – a positive-link would be mislabeled as a negative-link and vice versa. We turn eNPLeN_{PL} true PL into NL and eNNLeN_{NL} true NL into PL with e=20%e=20\%.

All DGPs are equipped with the same set of pairwise constraints, e.g., we only draw pairwise constraints and construct weights once.

E.3 Alternative Estimators

We explore various types of estimators that differ in the prior belief on GG.

  1. (i)

    BGFE: The baseline Bayesian grouped fixed-effects (BGFE) estimator are correctly-specified, i.e. assuming that the true model exhibits time-invariant grouped heterogeneity and that variance of error term is constant (varying) across units in the simple (general) DGPs. No prior belief on GG is available for this estimator.

  2. (ii)

    BGFE-cstr: The baseline BGFE estimator that takes pairwise constraints into consideration.

  3. (iii)

    BGFE-oracle: This estimator is a variant of the BGFE estimator equipped with known true G.

We also evaluate the other Bayesian estimators with different prior assumptions on αi\alpha_{i} that don’t model group structure.

  1. (iv)

    Pooled: Bayesian pooled estimator views αi\alpha_{i} as a common parameter and, consequently, all units have the same prior level of αi\alpha_{i}.

  2. (v)

    Flat: flat-prior estimator assumes p(αi) 1p(\alpha_{i})\;\propto\;1. There is no pooling across units in this case and αi\alpha_{i}’s are individually estimated using their own history. This also amounts to sampling from a posterior whose mode is the MLE estimate.

E.4 Posterior Predictive Densities and Performance Evaluation

E.4.1 Posterior Predictive Densities

Given SS posterior draws, the posterior predictive distribution estimated from the MCMC draws is

p^(yiT+1|Y,X)=1Sj=1S[k=1K(j)(G)𝟏(gi=k)p(yiT+1|Y,X,Θ(j))].\displaystyle\hat{p}(y_{iT+1}|Y,X)=\frac{1}{S}\sum_{j=1}^{S}\left[\sum_{k=1}^{K^{(j)}(G)}\mathbf{1}(g_{i}=k)p\left(y_{iT+1}|Y,X,\Theta^{(j)}\right)\right]. (E.9)

We can therefore draw samples from p^(yiT+1|Y,X)\hat{p}(y_{iT+1}|Y,X) by simulating (2.1) forward conditional on the posterior draws of Θ\Theta and observations.

E.4.2 Point Forecasts

The optimal posterior forecast under quadratic loss function is obtain by minimizing the posterior risk, with is the posterior mean. Conditional on posterior draws of parameters, the mean forecast can be approximated by the Monte Carlo averaging,

y^i,T+1|T1Sj=1Sy^iT+1|T(j)=1Sj=1Sα^gi(j)xiT+1,\displaystyle\hat{y}_{i,T+1|T}\approx\frac{1}{S}\sum_{j=1}^{S}\hat{y}_{iT+1|T}^{(j)}=\frac{1}{S}\sum_{j=1}^{S}\hat{\alpha}^{(j)^{\prime}}_{g_{i}}x_{iT+1}, (E.10)

and the RMSFE across units is given by

RMSFET+1=1Ni=1N(yi,T+1y^i,T+1)2.\displaystyle RMSFE_{T+1}=\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(y_{i,T+1}-\hat{y}_{i,T+1}\right)^{2}}. (E.11)

E.4.3 Set Forecasts

We construct set forecasts CSiT+1CS_{iT+1} from the posterior predictive distribution of each unit. In particular, we adopt a Bayesian approach and report the highest posterior density interval (HPDI), which is the narrowest connected interval with coverage probability of 1α1-\alpha. In other words, it requires that the probability of yiT+1CSiT+1y_{iT+1}\in CS_{iT+1} conditional on having observed the history YY be at least 1α1-\alpha, i.e.,

P(yiT+1CSiT+1)1α,for all i,\displaystyle P(y_{iT+1}\in CS_{iT+1})\geq 1-\alpha,\quad\text{for all }i, (E.12)

and this interval is the shortest among all possible single connected candidate sets. Let δl\delta^{l} be the lower bound and δu\delta_{u} be the upper bound, then CSiT+1=[δil,δiu]CS_{iT+1}=\left[\delta_{i}^{l},\delta_{i}^{u}\right].

The assessment of set forecasts in simulation studies and empirical applications is based on two metrics: (1) the cross-sectional coverage frequency,

CovT+1=1Ni=1N𝟏(yiT+1CSiT+1),\displaystyle Cov_{T+1}=\frac{1}{N}\sum_{i=1}^{N}\mathbf{1}\left(y_{iT+1}\in CS_{iT+1}\right), (E.13)

and (2) the average length of the sets CiT+1C_{iT+1},

AvgLT+1=1Ni=1N(δiuδil).\displaystyle AvgL_{T+1}=\frac{1}{N}\sum_{i=1}^{N}\left(\delta_{i}^{u}-\delta_{i}^{l}\right). (E.14)

E.4.4 Density Forecasts

To compare the performance of density forecasts for various estimators, we examine the continuous ranked probability score (Matheson and Winkler, 1976; Hersbach, 2000) across units. The continuous ranked probability score (CRPS) is frequently used to assess the respective accuracy of two probabilistic forecasting models. It is a quadratic measure of the difference between the predictive cumulative distribution function, FiT+1|T(y)F^{T+1|T}_{i}(y), and the empirical CDF of the observation with the formula as follows,

CRPST+1\displaystyle CRPS_{T+1} =1Ni=1NCRPS(FiT+1|T,yiT+1)\displaystyle=\frac{1}{N}\sum_{i=1}^{N}CRPS(F^{T+1|T}_{i},y_{iT+1})
=1Ni=1N0[FiT+1|T(y)𝟏(yiT+1y)]2𝑑y,\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\int_{0}^{\infty}\left[F^{T+1|T}_{i}(y)-\mathbf{1}\left(y_{iT+1}\leq y\right)\right]^{2}dy, (E.15)

where yiT+1y_{iT+1} is the realization at T+1T+1.

In practice, the true predictive cumulative distribution function FiT+1|T(y)F^{T+1|T}_{i}(y) or the PIT of yiT+1y_{iT+1} is not available. We approximate it via the empirical distribution function for each unit based on the posterior draws from the predictive density,

F^iT+1|T(y)=1Sj=1S𝟏(yiT+1|T(j)y),\displaystyle\hat{F}^{T+1|T}_{i}(y)=\frac{1}{S}\sum_{j=1}^{S}\mathbf{1}\left(y^{(j)}_{iT+1|T}\leq y\right), (E.16)

Based on sorted posterior draws y~iT+1(j)\widetilde{y}_{iT+1}^{(j)}, we can calculate CRPS using the below representation by Laio and Tamea (2007),

CRPS(F^iT+1|T,yiT+1)=2S2j=1S(y~iT+1|T(j)yiT+1)(1{yiT+1<y~iT+1|T(j)}si+12).\displaystyle CRPS\left(\hat{F}^{T+1|T}_{i},y_{iT+1}\right)=\frac{2}{S^{2}}\sum_{j=1}^{S}\left(\widetilde{y}_{iT+1|T}^{(j)}-y_{iT+1}\right)\left(1\left\{y_{iT+1}<\widetilde{y}_{iT+1|T}^{(j)}\right\}s-i+\frac{1}{2}\right). (E.17)

Moreover, we report the average log predictive scores (LPS) to assess the performance of the density forecast from the view of the probability distribution function. As suggested in Geweke and Amisano (2010), the LPS for a panel reads as,

LPST+1=\displaystyle LPS_{T+1}= 1Ni=1Nlnp(yiT+1|Y,X,Θ)p(Θ|Y,X)𝑑Θ,\displaystyle-\frac{1}{N}\sum_{i=1}^{N}\ln\int p\left(y_{iT+1}|Y,X,\Theta\right)p(\Theta|Y,X)d\Theta, (E.18)

where the expectation can be approximated using posterior draws,

p(yiT+1|Y,X,Θ)p(Θ|Y,X)𝑑Θ\displaystyle\int p\left(y_{iT+1}|Y,X,\Theta\right)p(\Theta|Y,X)d\Theta 1Sj=1Sp(yiT+1|Y,X,Θ(j)).\displaystyle\approx\frac{1}{S}\sum_{j=1}^{S}p\left(y_{iT+1}|Y,X,\Theta^{(j)}\right). (E.19)

E.5 Simulation Results

E.5.1 Simple Dynamics Panel Data

To evaluate the advantage of pooling units into groups, we report the RMSE, bias, standard deviation, average length of 95% credible set, and frequentist coverage of the posterior estimate of ρ\rho across Monte Carlo repetitions. For the fixed effects α\alpha, we only present the average bias as it may not be of importance for most empirical study.

The comparison across alternative estimators is shown in Table E.2. In DGP 1,the BGFE-cstr and BGFE estimators are equally accurate as the oracle estimator. This is not surprising because the units are well-separated by design, and the data provide sufficient information for the BGFE estimator to determine the group pattern. In this situation, prior knowledge of GG or the true group indices has quite marginal influence. The pooled estimator, on the other hand, erroneously pools all groups together, resulting in inaccurate estimates of αi\alpha_{i} and ρ\rho. Despite the fact that the flat estimator treats units separately, it is still inferior to the BGFE-type estimators. This is because it cannot utilize cross-sectional information to estimates parameters in this short panel and hence bears much larger bias.

In DGP 2, where the group pattern is less apparent, the BGFE-cstr estimator is arguably the most accurate. In contrast to the standard BGFE estimator, it uses cross-sectional data and pairwise constraints to determine the group pattern. These properties substantially reduce the biases of β^\hat{\beta} and α^i\hat{\alpha}_{i}, enabling the BGFE-cstr estimator to outperform the unconstrained estimator by a significant margin and to perform comparable to the oracle estimator. Remember that we manually add 20% incorrect constraints into the prior knowledge. Despite the presence of these misspecified constraints, the BGFE-cstr estimator is still able to extract relevant information from constraints in order to enhance the overall performance. The BGFE estimator, however, is unable to correctly reconstruct the group structure due to the noisy data, which results in the algorithm improperly grouping the units and hence generating inaccurate estimates.

Table E.2: Monte Carlo: Estimates, Soft Constraint
ρ^\hat{\rho} α^i\hat{\alpha}_{i} Group
RMSE Bias Std AvgL Cov Bias Avg K
DGP 1 BGFE-oracle 0.0104 0.0037 0.0072 0.0276 0.92 0.0371 4
BGFE-cstr 0.0102 0.0030 0.0072 0.0282 0.94 0.0369 4.92
BGFE 0.0103 0.0037 0.0071 0.0274 0.92 0.0377 4.4
Pooled 0.3543 0.3543 0.0032 0.0125 0 1.7889 -
Flat 0.1713 0.1711 0.0073 0.0283 0 0.8668 -
DGP 2 BGFE-oracle 0.0186 0.0030 0.0137 0.0527 0.95 0.0235 4
BGFE-cstr 0.0202 0.0058 0.0143 0.0557 0.93 0.0373 5.06
BGFE 0.0546 0.0443 0.0212 0.0809 0.66 0.1357 4.78
Pooled 0.2920 0.2919 0.0077 0.0298 0 0.5060 -
Flat 0.1170 0.0834 0.0131 0.0509 0.14 0.2344 -

Table E.3 provides a summary of the prediction performance of each estimator. In general, the conclusions of the one-step-ahead forecast agree with those of the estimation. In DGP 1, the performance of the three BGFE estimators are quite similar, followed by the flat and pooled estimators. In DGP 2, the BGFE-cstr estimator, which utilizes prior belief on GG, beats the other feasible estimators in point, set, and density forecast and is comparable to the oracle estimator.

Table E.3: Monte Carlo: Forecast, Soft Constraint
Point Forecast Set Forecast Density Forecast
RMSFE Error Std AvgL Cov LPS CRPS
DGP 1 BGFE-oracle 0.4989 0.0001 0.4989 1.9627 0.95 0.7254 0.2818
BGFE-cstr 0.4991 0.0004 0.4990 1.9666 0.95 0.7256 0.2819
BGFE 0.4990 0.0001 0.4990 1.9616 0.95 0.7255 0.2818
Pooled 0.6401 0.0006 0.6404 3.0114 0.98 1.0064 0.3657
Flat 0.5620 0.0003 0.5622 2.4265 0.97 0.8544 0.3184
DGP 2 BGFE-oracle 0.4990 0.0001 0.4989 1.9629 0.95 0.7254 0.2819
BGFE-cstr 0.5021 0.0001 0.5021 1.9790 0.95 0.7314 0.2836
BGFE 0.5186 0.0002 0.5187 2.0546 0.95 0.7633 0.2930
Pooled 0.5396 0.0005 0.5395 2.2444 0.96 0.8079 0.3052
Flat 0.5286 0.0002 0.5287 2.1165 0.95 0.7841 0.2987

E.5.2 General Panel Data

As the number of parameters increases for DGP 3, we present the RMSE and absolute bias of αgi=[α1,giα2,giα3,gi]\alpha_{g_{i}}=[\alpha_{1,g_{i}}\;\alpha_{2,g_{i}}\;\alpha_{3,g_{i}}]^{\prime} and γ\gamma, as well as metrics for point and density prediction. In addition, all BGFE estimators now account for heteroskedasticity because the cross-sectional variance in DGP 3 is informative to group structure. As a result, we have the BGFE-he-oracle, BGFE-he-cstr and BGFE-he estimators in this exercise, where ”he” denotes heteroskedasticity. For comparison, we also offer the BGFE-ho-cstr estimator, which assumes homoskedasticity, and Flat-he estimator, which is the heteroskedastic flat estimator.

Table E.4: Results for Estimation, Point Forecast and Estimated KK
Estimates Forecast Group
R(α^0)R(\hat{\alpha}_{0}) B(α^0)B(\hat{\alpha}_{0}) R(α^1)R(\hat{\alpha}_{1}) B(α^1)B(\hat{\alpha}_{1}) R(α^2)R(\hat{\alpha}_{2}) B(α^2)B(\hat{\alpha}_{2}) R(γ^)R(\hat{\gamma}) B(γ^)B(\hat{\gamma}) RMSFE LPS AvgK PctK
Flat-he 0.258 0.199 0.131 0.098 0.200 0.149 0.026 0.014 0.667 1.108 - -
BGFE-he-oracle 0.126 0.137 0.092 0.101 0.119 0.132 0.569 0.602 0.840 0.706 4 1
BGFE-he-cstr 0.171 0.164 0.290 0.179 0.125 0.135 0.566 0.608 0.847 0.716 4.087 0.914
BGFE-ho-cstr 0.218 0.198 0.328 0.220 0.140 0.144 0.625 0.671 0.850 0.769 4.342 0.682
BGFE-he 0.303 0.317 0.560 0.482 0.137 0.147 0.601 0.640 0.871 0.756 3.575 0.534
Pooled 0.444 0.503 1.262 1.527 0.131 0.148 0.734 0.805 0.993 0.910 - -

Notes: The first line gives the levels of the each metrics based on the Flat-he estimator, which is the benchmark model, and the following lines in the columns head ”Estimates” and ”Forecast” present ratios of the respective method relative those based on the flat-he estimator. In the columns head ”Group”, we show the average of number of groups (AvgK) and the percentage of iterations that the posterior sampler selects K0K_{0} (PctK) averaged over 1,000 runs of algorithm. R()R(\cdot) is RMSE of the posterior mean estimator. B()B(\cdot) is the absolute bias of the posterior mean estimator.

Table E.4 presents the relative performance of estimation and forecasting for DGP 3. The benchmark model is Flat-he. Several findings arise. First, the gain from incorporating pairwise constraints is evident. It reduces the RMSE and bias for all parameters and improve both point and density forecast, when comparing BGFE-he-cstr to BGFE-he. The percentage of the Gibbs sampler that visits the true number of groups K0K_{0} grows considerably from 53.4% to 91.4%. Even when pairwise constraints are taken into account, this percentage is just 68.2% if heteroskedasticity is ignored. Second, when we include prior belief on GG, the improvement in α1,gi\alpha_{1,g_{i}}, the AR coefficient, is the greatest among all three grouped coefficients with a bias reduction of more than 60%. This also suggests that the AR coefficient may be more sensitive to the estimated group structure. Thirdly, BGFE-he-cstr and BGFE-ho-cstr have comparable RMSFE values, but BGFE-he-cstr has a significantly lower LPS, showing that modeling heteroskedasticity in the current setting is favorable for the density forecast. The empirical results below also confirm this finding. Lastly, all BGFE-type estimators generate similar estimates for the exogenous variables that don’t have group effects on yity_{it} as the improvement in α2,gi\alpha_{2,g_{i}} and γ\gamma are marginal when prior belief on group is included or when true group is imposed.

E.5.3 Computational Time

We compare the running times of each estimator using the simulation with the general DGPs. For each estimator, we run the block Gibbs samplers with 5,000 iterations after a burn-in of 5,000 draws. We calculate the running time by averaging over 1,000 repetitions. All programs, including this exercise, the simulations above, and the empirical analysis in Section 4 were executed in Matlab 2020a running on a server with Intel Skylake CPUs and 32 GB of RAM.

Table E.5 shows the running time in seconds. It suggests that modeling group structure does not considerably increase computational load. The running time of the BGFE-he estimator is comparable to that of the Flat estimator when the number of observations is small. As TT increases, the BGFE-he estimators take slightly more time to finish than the flat estimator. The BGFE-he-cstr estimator requires approximately 20% more time to complete the calculations than the BGFE-he estimator. This is because BGFE-he-cstr adapts pairwise constraints and has additional term to evaluate in the posterior distribution of the group indices. Notice that we fix the scaling constant cc in this experiment. If we find the optimal cc using grid search as proposed in Section 4.1.2, the running time of BGFE-he-cstr will be mm times as great as what is shown below, where mm is the size of the grid.

Table E.5: Running Time in Seconds
N T BGFE-he-cstr BGFE-he Flat Pooled
100 5 74.9 62.1 1.2 61.7
10 82.5 69.8 1.3 64.7
20 86.1 75.7 1.3 62.4
200 5 145.0 111.9 1.2 114.3
10 146.2 118.8 1.3 113.0
20 159.4 133.2 1.5 117.0

It is possible to speed the algorithm up especially when we select the optimal value for cc. Wang and Dunson (2011) propose the sequential updating and greedy search (SUGS) algorithm, which essentially simplifies the algorithm 1 with approximation. The SUGS algorithm allows fast yet accurate approximate Bayes inferences under DP priors with just a single cycle of simple calculations for each unit, while also producing marginal likelihood estimates to be used in selecting the constant cc. This will be left for future study.

Appendix F Data Description

F.1 Inflation of the U.S. CPI Sub-Indices

The seasonally adjusted series of CPI for All Urban Consumers (CPI-U) for subcategories at all display level are obtained from the BLS Online Databases.212121https://data.bls.gov/PDQWeb/cu The raw data contains 318 series, which are recorded on a monthly basis and spanned the period from January 1947 to August 2022. Notice that the raw dataset doesn’t include an indicator for the expenditure categories. We manually merge the raw dataset with the table of content of CPI entry level items222222https://www.bls.gov/cpi/additional-resources/entry-level-item-descriptions.xlsx by entry level item (ELI) code, the series description, and universal classification codes (UCC), if necessary.232323Some series are labeled by UCC rather than ELI. The concordance provided by the BLS can be found here: https://www.bls.gov/cpi/additional-resources/ce-cpi-concordance.htm.

Series can enter and exit the sample. The BLS discontinued and launched series on a regular basis owing to changes in source data and methodology, for example, see the Post for the updates on series since 2017. The measure of certain subcategories was impacted by the Pandemic and hence missing. Since the Pandemic, the related activities and venues (sports events, bars, schools) were canceled and close temporarily, such as admission to sporting events (SS62032), distilled spirits away from home (SS20053), food at elementary and secondary schools (SSFV031A), etc. We chose to not impute the missing values since there was no clear benchmark to compare with, especially given the depressed economic conditions.

The CPI-U consists of eight major expenditure categories (1) Apparel; (2) Education and Communication; (3) Food and beverages; (4) Housing; (5) Medical Care; (6) Recreation; (7) Transportation; (8) Other goods and services. Each major category contains multiple subcategories, resulting in a hierarchy of categories with increasing specificity. BLS provides a detailed table242424 https://download.bls.gov/pub/time.series/cu/cu.item. that records the series code, series name, and display level. We resort to the display level to build the tree structure of the CPI sub-indices and eliminate those parent nodes, as illustrated in Figure F.1.

Figure F.1: Hierarchical Structure of CPI: Eliminating Parent Nodes
{forest}

forked edges, for tree= grow = east, align = center, font=, [All Items [Housing [Shelter [Rent of primary residence] [Owners’ equivalent rent of residences] ] [Fuels and utilities [Water sewer and trash] [Household energy [Fuel oil and other fuels [Fuel oil] [Propane kerosene and firewood]] [Energy services [Electricity] [Utility gas service]]] ] ] [Transportation […] ] [Food and beverages […] ] ]

Because all CPI data is available on a monthly basis, we use the unemployment gap as labor market slack measures in the Phillips curve model. We use the seasonally adjusted unemployment rate252525https://fred.stlouisfed.org/series/UNRATE from FRED and construct the “gap” measures using the Hamilton filter (Hamilton, 2018). The Hamilton filter has two parameters: number of lags pp and number of lookahead periods hh. We follow Hamilton’s suggestion and set h=24h=24 and p=12p=12, or an AR(12) process, additionally lagged by 24 lookahead periods for the monthly time series.

F.2 Income and Democracy

All data in this section are taken from the replication files of BM.262626 https://www.dropbox.com/s/ssjabvc2hxa5791/Bonhomme_Manresa_codes.zip?dl=0 The data set contains a balanced panel of 89 countries and 7 periods at a five-year interval over 1970-2000. The main measure of democracy is the Freedom House Political Rights Index. A country receives the highest score if political rights come closest to the ideals suggested by a checklist of questions, beginning with whether there are free and fair elections, whether those who are elected rule, whether there are competitive parties or other political groupings, whether the opposition plays an important role and has actual power, and whether minority groups have reasonable self-government or can participate in the government through informal consensus. See more details in Acemoglu et al. (2008), Section 1.

Table F.1 contains descriptive statistics for the main variables. The sample period is 1970–2000, and each observation corresponds to five-year intervals. The table shows these statistics for all countries and also for high- and low-income countries, split according to the median of the countries’ averaged income. The comparison of high- and low-income countries in the medium and lower panels reveals the pattern that richer countries tend to be more democratic.

Table F.1: Summary Statistics for the Democracy Data Set
Mean Median S.E. Min Max
Full Sample
Democracy index 0.5535 0.5000 0.3727 0 1.0000
GDP per capita (in log) 8.2534 8.2444 1.0763 5.7739 10.4450
High-Income
Democracy index 0.7852 1.0000 0.2934 0 1.0000
GDP per capita (in log) 9.1490 9.1975 0.6079 7.4970 10.4450
Low-Income
Democracy index 0.3247 0.1667 0.2973 0 1.0000
GDP per capita (in log) 7.3576 7.3208 0.6051 5.7739 8.81969

Appendix G Additional Empirical Results

G.1 Inflation of the U.S. CPI Sub-Indices

Figure G.1 shows the posterior mean of the number of active groups (red) along with total number of available CPI sub-indices (dark blue). Note that we choose not to show the credible set or the distribution of KK because the distributions are concentrated around a single number in most samples.

Figure G.1: Number of Active Groups, BGFE-he-cstr
Refer to caption

Figure G.2 shows the selected scaling constant cc over time.

Figure G.2: Scaling Constant, BGFE-he-cstr
Refer to caption

Table G.1 and G.2 present the ratio of RMSE between each estimator and AR-he for each expenditure category in five periods: 1995-1999, 2000-2004, 2005-2009, 2010-2014, 2015-2019, 2020-2022. Similar results for the difference of LPS are shown in Table G.3 and G.4.

Table G.1: Relative RMSE by Expenditure Category and Period
Full 95 - 99 00 - 04 05 - 09 10 - 14 15 - 19 20 - 22
Average of All Series
BGFE-he-cstr 0.970.97 0.970.97 0.970.97 0.990.99 0.960.96 0.970.97 0.940.94
BGFE-he 0.980.98 0.980.98 0.980.98 1.001.00 0.970.97 0.980.98 0.950.95
BGFE-ho 1.001.00 1.041.04 0.980.98 1.001.00 0.980.98 0.990.99 1.001.00
AR-he-PC 1.011.01 1.021.02 1.031.03 0.990.99 1.001.00 1.011.01 1.041.04
Pooled 1.001.00 1.041.04 0.980.98 1.031.03 1.011.01 0.990.99 0.940.94
Category 1: Apparel
BGFE-he-cstr 0.970.97 0.980.98 0.990.99 0.970.97 0.960.96 0.980.98 0.910.91
BGFE-he 0.970.97 0.980.98 0.990.99 0.980.98 0.970.97 0.980.98 0.920.92
BGFE-ho 0.980.98 0.990.99 1.001.00 0.980.98 0.980.98 0.990.99 0.930.93
AR-he-PC 1.021.02 1.021.02 1.001.00 1.031.03 1.011.01 1.011.01 1.081.08
Pooled 0.990.99 1.001.00 0.970.97 0.990.99 1.031.03 1.031.03 0.900.90
Category 2: Education and Communication
BGFE-he-cstr 0.960.96 0.930.93 0.990.99 0.950.95 0.940.94 0.970.97 0.930.93
BGFE-he 0.950.95 0.950.95 1.001.00 0.940.94 0.950.95 0.970.97 0.920.92
BGFE-ho 1.081.08 2.182.18 1.021.02 1.001.00 1.001.00 0.970.97 0.910.91
AR-he-PC 1.011.01 1.031.03 0.990.99 1.001.00 0.950.95 1.011.01 1.081.08
Pooled 1.161.16 2.242.24 1.211.21 1.051.05 1.121.12 1.031.03 0.970.97
Category 3: Food and Beverages
BGFE-he-cstr 0.980.98 0.990.99 0.990.99 1.001.00 0.970.97 0.980.98 0.950.95
BGFE-he 0.980.98 0.990.99 1.001.00 1.001.00 0.970.97 0.980.98 0.950.95
BGFE-ho 0.990.99 1.031.03 0.990.99 0.970.97 0.970.97 0.980.98 0.950.95
AR-he-PC 1.001.00 1.011.01 1.021.02 0.990.99 1.001.00 1.011.01 0.980.98
Pooled 1.011.01 1.031.03 1.011.01 1.051.05 1.011.01 0.990.99 0.930.93
Category 4: Housing
BGFE-he-cstr 0.970.97 0.960.96 0.950.95 1.031.03 0.940.94 0.960.96 0.960.96
BGFE-he 0.960.96 0.970.97 0.930.93 1.031.03 0.940.94 0.960.96 0.960.96
BGFE-ho 0.980.98 1.141.14 0.930.93 1.021.02 0.960.96 0.970.97 0.950.95
AR-he-PC 1.031.03 1.031.03 1.041.04 1.041.04 0.990.99 1.011.01 1.071.07
Pooled 0.990.99 1.141.14 0.920.92 1.061.06 0.970.97 0.970.97 0.940.94

Notes: Benchmark model = AR-he.

Table G.2: Relative RMSE by Expenditure Category and Period, cont.
Full 95 - 99 00 - 04 05 - 09 10 - 14 15 - 19 20 - 22
Category 5: Medical Care
BGFE-he-cstr 0.950.95 0.930.93 0.980.98 0.950.95 1.011.01 0.960.96 0.870.87
BGFE-he 0.950.95 0.930.93 0.980.98 0.950.95 1.011.01 0.960.96 0.850.85
BGFE-ho 1.061.06 1.321.32 1.151.15 1.071.07 1.091.09 0.990.99 0.860.86
AR-he-PC 1.031.03 1.041.04 0.990.99 1.021.02 1.031.03 1.021.02 1.071.07
Pooled 1.151.15 1.301.30 1.381.38 1.231.23 1.151.15 1.041.04 0.920.92
Category 6: Recreation
BGFE-he-cstr 0.970.97 0.970.97 0.990.99 0.990.99 0.980.98 0.960.96 0.940.94
BGFE-he 0.970.97 0.970.97 1.001.00 0.990.99 0.990.99 0.960.96 0.920.92
BGFE-ho 1.021.02 1.181.18 1.091.09 1.051.05 1.001.00 0.970.97 0.940.94
AR-he-PC 1.031.03 1.031.03 1.031.03 1.021.02 1.011.01 1.011.01 1.081.08
Pooled 1.121.12 1.171.17 1.241.24 1.311.31 1.171.17 0.990.99 0.960.96
Category 7: Transportation
BGFE-he-cstr 0.990.99 0.970.97 1.011.01 0.990.99 0.990.99 1.001.00 0.960.96
BGFE-he 0.990.99 0.970.97 1.011.01 0.980.98 0.990.99 1.001.00 0.960.96
BGFE-ho 1.031.03 1.071.07 1.021.02 1.001.00 1.011.01 1.021.02 1.071.07
AR-he-PC 1.021.02 1.041.04 1.051.05 0.970.97 1.011.01 0.990.99 1.071.07
Pooled 0.990.99 1.071.07 1.021.02 0.990.99 0.970.97 0.960.96 0.970.97
Category 8: Other Goods and Services
BGFE-he-cstr 0.970.97 0.980.98 0.990.99 1.021.02 0.890.89 0.960.96 0.930.93
BGFE-he 0.950.95 0.940.94 0.950.95 1.051.05 0.890.89 0.970.97 0.910.91
BGFE-ho 0.960.96 0.990.99 0.970.97 0.990.99 0.890.89 0.970.97 0.910.91
AR-he-PC 1.021.02 1.011.01 1.021.02 1.021.02 0.970.97 1.001.00 1.111.11
Pooled 0.980.98 0.990.99 1.041.04 0.990.99 0.890.89 0.990.99 0.940.94

Notes: Benchmark model = AR-he.

Table G.3: Relative LPS, by Expenditure Category and Period
Full 95 - 99 00 - 04 05 - 09 10 - 14 15 - 19 20 - 22
Average of All Series
BGFE-he-cstr 0.08-0.08 0.08-0.08 0.07-0.07 0.08-0.08 0.09-0.09 0.07-0.07 0.08-0.08
BGFE-he 0.06-0.06 0.06-0.06 0.05-0.05 0.06-0.06 0.07-0.07 0.05-0.05 0.08-0.08
BGFE-ho 0.640.64 0.770.77 0.720.72 0.770.77 0.480.48 0.530.53 0.450.45
AR-he-PC 0.010.01 0.020.02 0.020.02 0.010.01 0.010.01 0.010.01 0.010.01
Pooled OLS 0.660.66 0.840.84 0.730.73 0.790.79 0.530.53 0.520.52 0.440.44
Category 1: Apparel
BGFE-he-cstr 0.05-0.05 0.03-0.03 0.02-0.02 0.05-0.05 0.05-0.05 0.02-0.02 0.14-0.14
BGFE-he 0.04-0.04 0.02-0.02 0.02-0.02 0.04-0.04 0.05-0.05 0.01-0.01 0.14-0.14
BGFE-ho 0.100.10 0.250.25 0.070.07 0.070.07 0.110.11 0.060.06 0.030.03
AR-he-PC 0.030.03 0.030.03 0.010.01 0.030.03 0.020.02 0.010.01 0.090.09
Pooled 0.120.12 0.250.25 0.070.07 0.100.10 0.180.18 0.110.11 0.05-0.05
Category 2: Education and Communication
BGFE-he-cstr 0.12-0.12 0.06-0.06 0.07-0.07 0.19-0.19 0.17-0.17 0.10-0.10 0.11-0.11
BGFE-he 0.13-0.13 0.08-0.08 0.11-0.11 0.20-0.20 0.16-0.16 0.09-0.09 0.12-0.12
BGFE-ho 0.920.92 1.441.44 1.041.04 0.870.87 0.910.91 0.560.56 0.560.56
AR-he-PC 0.010.01 0.000.00 0.010.01 0.020.02 0.000.00 0.010.01 0.070.07
Pooled 0.980.98 1.451.45 1.091.09 0.950.95 0.990.99 0.620.62 0.620.62
Category 3: Food and Beverages
BGFE-he-cstr 0.04-0.04 0.03-0.03 0.02-0.02 0.06-0.06 0.05-0.05 0.05-0.05 0.07-0.07
BGFE-he 0.04-0.04 0.04-0.04 0.02-0.02 0.05-0.05 0.05-0.05 0.05-0.05 0.08-0.08
BGFE-ho 0.370.37 0.660.66 0.360.36 0.340.34 0.270.27 0.390.39 0.030.03
AR-he-PC 0.000.00 0.010.01 0.030.03 0.01-0.01 0.010.01 0.000.00 0.08-0.08
Pooled 0.440.44 0.840.84 0.410.41 0.440.44 0.330.33 0.410.41 0.030.03
Category 4: Housing
BGFE-he-cstr 0.12-0.12 0.13-0.13 0.14-0.14 0.10-0.10 0.11-0.11 0.12-0.12 0.07-0.07
BGFE-he 0.11-0.11 0.12-0.12 0.13-0.13 0.10-0.10 0.11-0.11 0.12-0.12 0.07-0.07
BGFE-ho 0.810.81 0.960.96 1.131.13 0.830.83 0.600.60 0.600.60 0.700.70
AR-he-PC 0.020.02 0.000.00 0.030.03 0.040.04 0.010.01 0.010.01 0.060.06
Pooled 0.820.82 0.970.97 1.131.13 0.880.88 0.630.63 0.600.60 0.680.68

Notes: Benchmark model = AR-he.

Table G.4: Relative LPS by Expenditure Category and Period, cont.
Full 95 - 99 00 - 04 05 - 09 10 - 14 15 - 19 20 - 22
Category 5: Medical Care
BGFE-he-cstr 0.15-0.15 0.29-0.29 0.14-0.14 0.14-0.14 0.16-0.16 0.05-0.05 0.09-0.09
BGFE-he 0.15-0.15 0.29-0.29 0.13-0.13 0.14-0.14 0.16-0.16 0.05-0.05 0.09-0.09
BGFE-ho 1.161.16 1.641.64 1.341.34 1.111.11 1.141.14 0.780.78 0.780.78
AR-he-PC 0.020.02 0.020.02 0.010.01 0.030.03 0.020.02 0.010.01 0.040.04
Pooled 1.211.21 1.641.64 1.391.39 1.201.20 1.201.20 0.840.84 0.840.84
Category 6: Recreation
BGFE-he-cstr 0.04-0.04 0.04-0.04 0.05-0.05 0.02-0.02 0.02-0.02 0.05-0.05 0.05-0.05
BGFE-he 0.03-0.03 0.03-0.03 0.05-0.05 0.01-0.01 0.01-0.01 0.04-0.04 0.07-0.07
BGFE-ho 0.550.55 0.910.91 0.790.79 0.580.58 0.490.49 0.250.25 0.120.12
AR-he-PC 0.020.02 0.030.03 0.030.03 0.020.02 0.010.01 0.010.01 0.060.06
Pooled 0.610.61 0.910.91 0.840.84 0.700.70 0.580.58 0.290.29 0.130.13
Category 7: Transportation
BGFE-he-cstr 0.04-0.04 0.10-0.10 0.03-0.03 0.05-0.05 0.07-0.07 0.000.00 0.000.00
BGFE-he 0.03-0.03 0.10-0.10 0.000.00 0.04-0.04 0.06-0.06 0.000.00 0.000.00
BGFE-ho 1.591.59 0.980.98 1.561.56 2.832.83 0.790.79 1.511.51 2.112.11
AR-he-PC 0.010.01 0.030.03 0.030.03 0.02-0.02 0.01-0.01 0.000.00 0.070.07
Pooled 1.421.42 0.980.98 1.411.41 2.382.38 0.750.75 1.241.24 1.981.98
Category 8: Other Goods and Services
BGFE-he-cstr 0.03-0.03 0.220.22 0.09-0.09 0.07-0.07 0.14-0.14 0.06-0.06 0.05-0.05
BGFE-he 0.03-0.03 0.250.25 0.09-0.09 0.06-0.06 0.13-0.13 0.06-0.06 0.06-0.06
BGFE-ho 0.640.64 0.590.59 0.660.66 0.690.69 0.830.83 0.530.53 0.490.49
AR-he-PC 0.010.01 0.020.02 0.000.00 0.030.03 0.000.00 0.010.01 0.040.04
Pooled 0.690.69 0.610.61 0.710.71 0.770.77 0.890.89 0.570.57 0.530.53

Notes: Benchmark model = AR-he.

G.1.1 Network Visualization of Posterior Similarity Matrix

In our empirical work, we estimate the posterior similarity matrix (PSM) among 154 series for the last sample (August 2022). Presenting and examining 154×154=23,716154\times 154=23,716 estimates pairwise posterior probabilities in PSM would be thoroughly uninformative. Hence we characterize the estimated PSM graphically as network graphs, which contain node names, node color, and link size (one per link since the network is undirected).

  • Node name shows the item names of CPI sub-indices.

  • Node color indicates the group structure used in the prior, e.g, expenditure category.

  • Link size represents the pairwise probabilities in the PSM.

We use the qgraph package in R for network visualization. Node locations are determined by a modified version of a force-embedded algorithm proposed by Fruchterman and Reingold (1991). Figure G.3 show the full-sample CPI sub-index network graphs.

Figure G.3: Individual CPI Sub-Index Network Graph based on Posterior Similarity Matrix, August 2022

See graphs/app_CPI_details/fcst_nlag3_window48_hetsk_NKPC_final_2022sep/fig_network_name.pdf

G.1.2 Heteroskedasticity vs. Homoskedasticity

We conclude by examining how grouped heteroskedasticity impacts forecast accuracy and why this is important. For illustrative purposes, we focus on the two BGFE estimators, BGFE-he and BGFE-ho, that do not involve pairwise constraints.

A distinguishing characteristic between BGFE-he and BGFE-ho is the estimated number of groups. Figure G.4 depicts the number of groups over samples. BGFE-he estimator forms 9 groups for the beginning of the sample, and increase it during the Great Recession and the Pandemic. However, the estimated number of groups for BGFE-ho is rather low in the 1990s, and progressively increases to around seven by the end of the sample. It is noticeable that when heteroskedasticity is allowed, there are more groups than when it is not. This is intuitive. Two groups can be expected to have comparable estimates of grouped fixed-effects and slope coefficients, but vastly different error variances. As a result, allowing for heteroskedasticity would result in a more refined group structure and increase the overall number of groups.

Figure G.4: Number of Groups
Refer to caption

As seen in Figure 6 and 7, the grouped heteroskedasticity doesn’t improve the point forecast but the density forecast. Figure G.5 depicts a clear perspective of it and demonstrates the performance of point and density forecasts through time. In panel (a), we observe that the ratio of RMSE is generally around one over the whole sample, meaning that heteroskedasticity cannot improve the point forecast in general. In panel (b), the difference in LPS is consistently negative. This demonstrates that the improved density prediction performance is not a fluke and that enabling heteroskedasticity improves the density forecast regardless of sample. This is actually in line with the simulation results presented in Table E.4.

Figure G.5: Results of BGFE-he vs. BGFE-ho
(a) RMSEhe\text{RMSE}_{he} / RMSEho\text{RMSE}_{ho}
Refer to caption
(b) LPShe\text{LPS}_{he} - LPSho\text{LPS}_{ho}
Refer to caption

Density forecasts vary substantially across categories. We pick three typical sub-categories and plot their posterior predictive densities of August 2022 in Figure G.6. The vertical dashed black lines represent the actual values. Several insights emerge while comparing these three subcategories. First, BGFE-he and BGFE-ho provide comparable posterior means for all three subcategories - the posterior predictive densities concentrate around the similar price levels. This explains why BGFE-he and BGFE-ho have comparable results in point forecasting. Second, BGFE-he reveals different predictive variance. As the rolling sample size is set to 4 years, all observations throughout the Pandemic are included, and it is anticipated that the price levels of elementary and high school (basic education) tuition and fees, major appliances, and airline prices would respond differently to the shock. Intuitively, education tuition and fees should not fluctuate as much as other prices, while airline fares have been strongly influenced by the fluctuating oil prices since the beginning of the Pandemic. Consequently, accounting for heteroskedasticity successfully captures this characteristic, such that college tuition and fees have a smaller predictive variance than that of BGFE-ho, but airline fares have a greater predictive variance.

Combing these two observations together reveals why BGFE-he has a better density forecast: the capacity to optimally cluster units according to the error variance and accommodate heteroskedasticity. For elementary and high tuition and fees, providing that both BGFE-he and BGFE-ho yields accurate posterior mean, BGFE-he yields much lower predictive variance, decreasing the LPS dramatically. Both BGFE-he and BGFE-ho underestimate the inflation rate for airline fares, but BGFE-he subtly creates a greater predicted variance to account for the wild probable shift in this sub-category and hence reduces the LPS significantly. Major appliances is an example to show that BGFE-he and BGFE-ho generate comparable density forecasts for some sub-categories.

Figure G.6: Predictive Posteriors for Selected Series: BGFE-he vs. BGFE-ho
(a) Basic Edu. Tuition and Fees
Refer to caption
(b) Major Appliances
Refer to caption
(c) Airline Fares
Refer to caption

G.2 Income and Democracy

G.2.1 Results of Specification 1

We start our analysis with the specification 1 in (4.15). Table G.5 demonstrates the posterior probability of the number of groups utilizing various estimators. Notably, the BGFE-ho in this specification is identical to the primary model in BM, allowing us to evaluate the optimal number of groups. BGFE-ho creates 8 groups in all posterior draws, which is consistent with BM’s conclusion of using BIC: the upper bound of the true number of groups is 10. Despite the fact that BM is unable to validate the ideal number of groups for their study, our BGFE-ho estimator provides an accurate estimate of it. Intriguingly, accounting for heteroskedasticity drastically reduces the number of groups, with BGFE-he identifying three groups in 92.9% of posterior draws. Adding pairwise constraints based on geographic information increase the number groups. Two-third of posterior draws from BGFE-he-cstr generate 5 group.

Table G.5: Probability for the number of groups
BGFE-he-cstr BGFE-he BGFE-ho
Pr(K<3)Pr(K<3) 0.000 0.000 0.000
Pr(K=3)Pr(K=3) 0.000 0.929 0.000
Pr(K=4)Pr(K=4) 0.344 0.071 0.000
Pr(K=5)Pr(K=5) 0.656 0.000 0.000
Pr(K>5)Pr(K>5) 0.000 0.000 1.000

The marginal data density (MDD) of each estimators in Table G.6 provides some insight on different models. Even while BGFE-ho produces eight groups and has a tendency to overfit, its MDD is the lowest of the three estimators. BGFE-he with fewer groups is superior to BGFE-ho with higher MDD. BGFE-he-cstr has the highest MDD because the pairwise constraints give direction on grouping and identify the ideal group structure, which BGFE-he cannot uncover without our prior knowledge.

Table G.6: Marginal Data Density
BGFE-he-cstr BGFE-he BGFE-ho
425.690 381.218 368.918

We focus on the BGFE-he-cstr estimator and use the approach outlined in Section 3.2 to identify the unique group partitioning G^\widehat{G}. The left panel of Figure G.7 presents the world map colored by G^\widehat{G}, while the right panel present the group-specific averages of democracy index over time. The estimated group structure G^\widehat{G} features four distinct groups, which is coincident to the choice of BM. As described in BM, we refer to groups 1-4 as the “high-democracy”, “low-democracy”, “early transition”, and “late transition” group, respectively. With the exception of the “early transition” group that is slight at odd with the counterpart in BM, the group-specific averages of the democracy index for all other groups are relatively similar to those in BM. Notice that BM manually sets the number of groups to four, but we discover that four is the optimal number. Consequently, by employing model specification 1 and accounting for heteroskedasticity, we find the support for BM’s main results.

Figure G.7: Point Estimation of Group Partitioning and Average Democracy
Refer to caption
Refer to caption

Table G.7 shows the posterior mean and 90% credible set for each coefficient, with GG fixing at the point estimate G^\widehat{G}. Comparing to the pooled OLS, ρ^\hat{\rho} and β^\hat{\beta} once we incorporate the group-specific time patterns. The results are essentially consistent with the conclusion in BM: there is modest persistence and a positive effect of income on democracy, but the cumulative income effect β/(1ρ)=0.08\beta/(1-\rho)=0.08 is quantitatively small.

Table G.7: Coefficient estimates across groups
Lagged democracy (ρ\rho) Lagged Income (β\beta)
Coef. Cred. Set Coef. Cred. Set
BGFE-he-cstr 0.499 [0.438, 0.558] 0.040 [0.027, 0.053]
Pooled OLS 0.665 [0.616, 0.718] 0.082 [0.065, 0.100]

G.2.2 Network Visualization of Posterior Similarity Matrix

Figure G.8 and G.9 show the full-sample country network graphs, for specification 1 and 2 respectively. Node color reflects the geographic information used to construct the group structure in the prior.

Figure G.8: Individual Country Network Graph based on Posterior Similarity Matrix, Specification 1

See graphs/app_democracy/fig_network_geo_ver2.pdf

Figure G.9: Individual Country Network Graph based on Posterior Similarity Matrix, Specification 2

See graphs/app_democracy/fig_network_geo_ver1.pdf