Supplement to “Markov Chain Monte Carlo Based on Deterministic Transformations”
Throughout, we refer to our main article Dutta and Bhattacharya (2013) as DB.
S-1 Proof of detailed balance for TMCMC
The detailed balance condition is proved as follows: Suppose , then . Hence, the kernel satisfies,
and
where satisfies
S-2 General TMCMC algorithm based on a single
Algorithm S-2.1
General TMCMC algorithm based on a single .
-
•
Input: Initial value , and number of iterations .
-
•
For
-
1.
Generate and an index independently. Again, actual simulation from the high-dimensional multinomial distribution is not necessary; see Section 3.1 of DB.
-
2.
-
3.
Set
-
1.
-
•
End for
S-3 Convergence properties of additive TMCMC
In this section we prove some convergence properties of the TMCMC in the case of the additive transformation. Before going into our main result we first borrow some definitions from the MCMC literature.
Definition 1 (Irreducibility)
A Markov transition kernel is irreducible, where is a nontrivial measure, if for every and for every measurable set of with , there exists , such that
Definition 2 (Small set)
A measurable subset of is said to be small if there is an , a constant , possibly depending on and a finite measure such that
Definition 3 (Aperiodicity)
A Markov kernel is said to be periodic with period if the state-space can be partitioned into disjoint subsets with
and .
A Markov kernel is aperiodic if for no it is periodic with period .
S-3.1 Additive transformation with singleton
Consider now the case where , and where, for , , and . In this case . Suppose that is a density on .
Theorem 1
Suppose that is bounded and positive on every compact subset of and that is positive on every compact subset of . Then the chain is -irreducible, aperiodic. Moreover every nonempty compact subset of is small.
Proof 1
Without loss we may assume that . For notational convenience we shall prove the theorem for . The general case can be seen to hold with suitably defined ‘rotational’ matrices on similar to (S-3.1).
Suppose is a nonempty compact subset of . Let be a compact rectangle whose sides are parallel to the diagonals and containing such that . We shall show that is small, i.e., such that
It is clear that the points reachable from in two steps are of the form
Thus, if we define the matrices
(S-3.1) |
then the points reachable from in two steps, other than the points lying on the diagonals passing through itself, are of the form
Define
where is the length of the diagonal of the rectangle 111Actually suffices.. Fix an element . For any set , let and define,
(S-3.2) |
The need for defining such sets illustrated in the following example: to make a transition from the state to a state in in two steps, first making a forward transition in both coordinates and then a forward transition in first coordinate and a backward transition in the second coordinate is same as applying the transformation for some in two steps, i.e. first
Also note that for any , implies that the intermediate point and similarly for . Now, with and as the minimum and maximum of the move probabilities.
(S-3.3) | |||||
Since , so that, . Now notice that, if we define for
and
then,
since, ’s are pairwise disjoint, and for . It follows from (S-3.3) that
where .
This completes the proof that is small.
That the chain is irreducible, follows easily, for any , the set is a compact set and for a measurable set with we may choose in the first part of the proof such that . Now,
Also aperiodicity follows trivially from the observation that any set with positive -measure can be accessed in at most 2 steps.
S-4 General TMCMC algorithm with single and dependent
Also, let Let , be the specified joint distributions of and induced by the Gaussian distributions of , and let denote the conditional probability of , given , where is the conditional probability of given and . Then the general TMCMC algorithm with singleton and dependent is given as follows.
Algorithm S-4.1
General TMCMC algorithm based on single and dependent .
-
•
Input: Initial value , and number of iterations .
-
1.
For
-
(a)
Generate , , and .
-
(b)
For , set , , and .
-
(c)
Generate and an index independently.
-
(a)
-
2.
-
3.
Set
-
1.
-
•
End for
S-5 Proof of detailed balance for TMCMC with dependent
Let , then . The kernel satisfies,
and
S-6 Improved acceptance rates of additive TMCMC with singleton compared to joint updating using RWMH
The joint RWMH algorithm generates independently from , and then uses the transformation of the form , where are appropriate scaling constants. For large , the so-called “curse of dimensionality” can force the acceptance rate to be close to zero. On the other hand, the additive-transformation based TMCMC also updates simultaneously in a single block, but instead of using different , it uses a single for updating all the variables. In other words, for TMCMC based on additive transformation is of the form , where . Thus, relative to RWMH, the dimension in the TMCMC case is effectively reduced to 1, avoiding the curse of dimensionality. Thus, it is expected that additive TMCMC will have a much higher acceptance rate than RWMH. In this section we formalize and compare the issues related to acceptance rates of additive TMCMC and RWMH.
S-6.1 Discussion on optimal scaling and optimal acceptance rate of additive TMCMC and RWMH
A reasonable approach to compare the acceptance rates of additive TMCMC and RWMH is to develop the optimal scaling theory for additive TMCMC, obtain the optimal acceptance rate, and then compare the latter with the optimal acceptance rates for RWMH, which are already established in the MCMC literature. Indeed, optimal scaling and optimal acceptance rate of additive TMCMC and comparison with those of RWMH is the subject of Dey and Bhattacharya (2013), where it is shown that additive TMCMC has a much higher optimal acceptance rate compared to RWMH. Before we summarize the results of Dey and Bhattacharya (2013) we first provide a brief overview of optimal scaling and optimal acceptance rate of RWMH.
S-6.1.1 Brief overview of optimal scaling and optimal acceptance rate for RWMH
Roughly, the optimal random walk proposal variance, represented as an inverse function of the dimension , is the one that maximizes the speed of convergence to the stationary distribution of a relevant diffusion process to which a ‘sped-up’ version of RWMH weakly converges as the dimension increases to infinity. The optimal acceptance rate corresponds to the optimal proposal variance. Under various assumptions on the form of the target distribution , ranging from the assumption (Roberts et al. (1997)), through independent but non-identical set-up (Bedard (2007)), to a more general dependent structure (Mattingly et al. (2011)), the optimal acceptance rate turns out to be 0.234.
S-6.1.2 Optimal scaling and optimal acceptance rate for additive TMCMC
In Dey and Bhattacharya (2013) it has been proved in the case of additive TMCMC, assuming , that the optimal acceptance rate, as , is 0.439 under the set-ups (, independent but non-identical, and dependent) for which the optimal acceptance rate for RWMH has been studied and established to be 0.234. Thus, the optimal acceptance rate for additive TMCMC is much higher than that of RWMH. The optimal scalings, that is, the optimal values of the scales are also available using the optimal scaling theory. As shown in Dey and Bhattacharya (2013), all these results for additive TMCMC and RWMH remain true even in all the aforementioned set-ups if some of the co-ordinates of are updated at random, conditioning on the remaining co-ordinates.
S-6.2 Comparison between the asymptotic forms of the acceptance rates of additive TMCMC and RWMH for strongly log-concave target densities
The results on optimal scaling and optimal acceptance rate discussed in Sections S-6.1.1 and S-6.1.2 are available only for special forms of the target distribution . In this section we obtain the asymptotic forms of the acceptance rates associated with RWMH and additive TMCMC assuming that the target density is strongly log-concave. In particular, under suitable conditions we show that as the dimension increases, the acceptance rate of RWMH converges to zero at a much faster rate than that of additive TMCMC.
Assuming without loss of generality that the marginal variances of the target density are all unity (achieved after suitable scaling perhaps), for RWMH we consider the following proposed value given the current value : , where . On the other hand, for additive TMCMC, we consider , where and the components of are taking values with probability each.
To proceed we consider the following form of acceptance rate for our asymptotic framework. Letting denote the acceptance probability of given the current value , and letting , the acceptance rate is given by
(S-6.1) |
In the above formula for acceptance rate note that as and as . Hence, given any , we can choose sufficiently small such that and . Hence, re-writing (S-6.1) as
we find that the first and the third term on the right hand side are negligible for any algorithm. So, for the purpose of comparing algorithms with respect to their acceptance rates, we consider only the middle term; in all that follow we denote
(S-6.2) |
For our purpose, we consider a target density of variables that is strongly log-concave, that is,
(S-6.3) |
where we assume that , with for every . We further assume that , and the sequence is such that as . Then clearly, , meaning as . In fact, we assume that approaches 1 at a sufficiently fast rate, so that . For our purpose we assume that and , so that . It is easy to verify that these choices satisfy the above conditions.
It is important to note that our assumption need not hold for all strongly log-concave distributions. For instance, when is the product of standard normals, that is, when under , . In this case, for every . In general, even if and remains finite as grows to infinity, our proofs remain valid provided that and . The case of being an product of standard normals clearly satisfies the above conditions.
S-6.2.1 Asymptotic form of the acceptance rate for RWMH
Let denote the mode of the target density . Then for every
Thus from the assumption in (S-6.3), and noting that it follows that
(S-6.4) |
so that
(S-6.5) |
Hence, using (S-6.2) it can be seen that the acceptance rate is bounded above and below as follows
(S-6.6) | |||
where
and
Now, note that for some depending upon and ,
so that the inequalities related to strong convexity, given by (S-6.3) yield
(S-6.8) |
Using the lower bound of given by (S-6.8) and Fubini’s theorem, the lower bound of the acceptance rate given by (S-6.6) can be further bounded below as
(S-6.9) |
where must be calculated with respect to , and independently, .
Similarly, using the upper bound of given by (S-6.8) the upper bound of the acceptance rate given by (S-6.6) can be further bounded above as
(S-6.10) |
The probability must be calculated with respect to , and independently, . Thus, we have
(S-6.11) |
We first focus on the lower bound in (S-6.11).
As ,
(S-6.12) |
where denotes asymptotic normal with mean and variance . From (S-6.12) it follows that
(S-6.13) | |||||
Combining (S-6.9) and (S-6.13) we obtain
(S-6.14) |
Now focusing our attention on the upper bound of we similarly obtain
In other words,
Since , it is easy to see that
Hence, it follows that
(S-6.17) |
S-6.2.2 Asymptotic bounds of the acceptance rate for additive TMCMC
Next let us obtain lower and upper bounds of associated with TMCMC with additive transformation. Recall that in this case, where and the components of are taking values with probability each. In this set up (S-6.5) becomes
(S-6.18) |
Now notice that, under the lower bound of provided in (S-6.8), as ,
and
Similarly, under the upper bound of in (S-6.8), the above hold with replaced with . From these it follow that the asymptotic forms of the lower and the upper bounds of (S-6.18) are given by
and
Using the above results, it follows as in the case of that
Substituting the infimum and supremum over we obtain
Since and , it follows that
Hence,
For comparing (LABEL:eq:AR_TMCMC_bounds) with (S-6.17) where , it can be easily verified using L’Hospital’s rule that for any , ,
(S-6.20) |
The above result will continue to hold if instead of , , where is some constant. Hence, converges to zero at a much slower rate compared to .
S-7 Comparison of TMCMC with HMC
Motivated by Hamiltonian dynamics, Duane et al. (1987) introduced HMC, an MCMC algorithm with deterministic proposals based on approximations of the Hamiltonian equations. We will show that this algorithm is a special case of TMCMC, but first we provide a brief overview of HMC. More details can be found in Liu (2001), Cheung and Beck (2009) and the references therein.
S-7.1 Overview of HMC
If is the target distribution, a fictitious dynamical system may be considered, where can be thought of as the -dimensional position vector of a body of particles at time . If is the speed vector of the particles, is its acceleration vector, and is the force exerted on the particle; then, by Newton’s law of motion , where is a mass vector. The momentum vector, , often used in classical mechanics, can be thought of as a vector of auxiliary variables brought in to facilitate simulation from . The kinetic energy of the system is defined as , being the mass matrix. Usually, is taken as .
The target density is linked to the dynamical system via the potential energy field of the system, defined as . The total energy (Hamiltonian function), is given by . A joint distribution over the phase-space is then considered, given by
(S-7.1) |
Since the marginal density of is , it now remains to provide a joint proposal mechanism for simulating jointly; ignoring yields marginally from .
For the joint proposal mechanism, HMC makes use of Newton’s law of motion, derived from the law of conservation of energy, and often written in the form of Hamiltonian equations, given by
where . The Hamiltonian equations can be approximated by the commonly used leap-frog algorithm (Hockney (1970)), given by,
(S-7.2) | ||||
(S-7.3) |
Given choices of , , and , the HMC is then given by the following algorthm:
Algorithm S-7.1
HMC
-
•
Initialise and draw .
-
•
Assuming the current state to be , do the following:
-
1.
Generate ;
-
2.
Letting , run the leap-frog algorithm for time steps, to yield ;
-
3.
Accept with probability
(S-7.4) and accept with the remaining probability.
-
1.
In the above algorithm, it is not required to store simulations of . Next we show that HMC is a special case of TMCMC.
S-7.2 HMC is a special case of TMCMC
To see that HMC is a special case of TMCMC, note that the leap-frog step of the HMC algorithm (Algorithm S-7.1) is actually a deterministic transformation of the form (see Liu (2001)). This transformation satisfies the following: if , then .
The Jacobian of this transformation is 1 because of the volume preservation property, which says that if is a subset of the phase space, and if , then the volume . As a result, the Jacobian does not feature in the HMC acceptance probability (S-7.4).
For any dimension, there is only one move type defined for HMC, which is the forward transformation . Hence, this move type has probability one of selection, and all other move types which we defined in general terms in connection with TMCMC, have zero probability of selection. As a result, the corresponding TMCMC acceptance ratio needs slight modification—it must be made free of the move-type probabilities, which is exactly the case in (S-7.4).
The momentum vector can be likened to of TMCMC, but note that must always be of the same dimensionality as ; this is of course, permitted by TMCMC as a special case.
S-7.3 Comparison of acceptance rate for with RWMH and TMCMC
For , the proposal corresponding to HMC is given by (see Cheung and Beck (2009))
(S-7.5) |
where (S-7.5) is a normal distribution with mean and variance given, respectively, by the following:
(S-7.6) | ||||
(S-7.7) |
Assuming diagonal with being the -th diagonal element, the proposal can be re-written in the following more convenient manner: for ,
(S-7.8) |
where denotes the -th component of , and . Assuming, as is usual, that for each , it follows that
(S-7.9) |
where is a non-central distribution with degrees of freedom and non-centrality parameter . Since, as either or ,
(S-7.10) |
assuming the same strong log-concavity conditions on the target density as provided in Section S-6.2 it follows as in (S-6.2.1) that,
If as , it follows as in Section S-6.2.1 that
(S-7.12) |
which is of the same asymptotic form as (S-6.17), corresponding to the RWMH acceptance rate. On the other hand, if as , then it follows that
(S-7.13) |
which clearly tends to zero at a much faster rate compared to (S-7.12).
To summarize, if as , then both HMC and RWMH have the same asymptotic acceptance rate, tending to zero much faster than that of additive TMCMC. On the other hand, if as , the acceptance rate of HMC tends to zero much faster than that of RWMH, while that of additive TMCMC maintains its slowest convergence rate to zero. Also observe that the above conclusions will continue to hold if and tend to finite positive constants satisfying and as .
S-8 Generalized Gibbs/Metropolis approaches and comparisons with TMCMC
It is important to make it clear at the outset of this discussion that the goals of TMCMC and generalized Gibbs/Metropolis methods are different, even though both use moves based on transformations. While the strength of the latter lies in improving mixing of the standard Gibbs/MH algorithms by adding transformation-based steps to the underlying collection of usual Gibbs/MH steps, TMCMC is an altogether general method of simulating from the target distribution which does not require any underlying step of Gibbs or MH.
The generalized Gibbs/MH methods work in the following manner. Suppose that an underlying Gibbs or MH algorithm for exploring a target distribution has poor mixing properties. Then in order to improve mixing, one may consider some suitable transformation of the random variables being updated such that mixing is improved under the transformation. Such a transformation needs to chosen carefully since it is important to ensure that invariance of the Markov chain is preserved under the transformation. It is convenient to begin with an overview of the generalized Gibbs method with a sequential updating scheme and then proceed to the discussion on the issues and the importance of the block updating idea in the context of improving mixing of standard Gibbs/MH methods.
Liu and Sabatti (2000) (see also Liu and Yu (1999)) propose simulation of a transformation from some appropriate probability distribution, and then applying the transformation to the co-ordinate to be updated. For example, in a -dimensional target distribution, for updating to , using an additive transformation, one can select from some appropriate distribution and set . Similarly, if a scale transformation is desired, then one can set , where must be sampled from some suitable distribution. The suitable distributions of and are chosen such that the target distribution is invariant with respect to the move , the forms of which are provided in Liu and Sabatti (2000). For instance, if denotes the target distribution, then for the additive transformation, may be sampled from , and for the multiplicative transformation, one may sample from . Since direct sampling from such distributions may be impossible, Liu and Sabatti (2000) suggest a Metropolis-type move with respect to a transformation-invariant transition kernel.
Thus, in the generalized Gibbs method, sequentially all the variables must be updated, unlike TMCMC, where all the variables can be updated simultaneously in a single block. Here we note that for irreducibility issues the generalized Gibbs approach is not suitable for updating the variables blockwise using some transformation that acts on all the variables in a given block. To consider a simple example, with say, and a single block consisting of both the variables, if one considers the additive transformation, then starting with , where , one can not ever reach , where . This is because and , for some , and implies and , which is a contradiction. The scale transformation implies the move . If one initializes the Markov chain with all components positive, for instance, then in every iteration, all the variables will have the same sign. The spaces where some variables are positive and some negative will never be visited, even if those spaces have positive (in fact, high) probabilities under the target distribution. This shows that the Markov chain is not irreducible. In fact, with the aforementioned approach, no transformation, whatever distribution they are generated from, can guarantee irreducibility in general if blockwise updates using the transformation strategy of generalized Gibbs is used.
Although blockwise transformations are proposed in Liu and Sabatti (2000) (see also Kou et al. (2005) who propose a MH-based rule for blockwise transformation), they are meant for a different purpose than that discussed above. The strength of such blockwise transformations lies in improving the mixing behaviour of standard Gibbs or MH algorithms. Suppose that an underlying Gibbs or MH algorithm for exploring a target distribution has poor mixing properties. Then in order to improve mixing, one may consider some suitable transformation of the set of random variables being updated such that mixing is improved under the transformation. This additional step involving transformation of the block of random variables can be obtained by selecting a transformation from the appropriate probability distribution provided in Liu and Sabatti (2000). This “appropriate” probability distribution guarantees that stationarity of the transformed block of random variables is preserved. Examples reported in Liu and Sabatti (2000), Müller and Czado (2005), Kou et al. (2005), etc. demonstrate that this transformation also improves the mixing behaviour of the chain, as desired.
Thus, to improve mixing using the methods of Liu and Sabatti (2000) or Kou et al. (2005) one needs to run the usual Gibbs/MH steps, with an additional step involving transformations as discussed above. This additional step induces more computational burden compared to the standard Gibbs/MH steps, but improved mixing may compensate for the extra computational labour. In very high dimensions, of course, this need not be a convenient approach since computational complexity usually makes standard Gibbs/MH approaches infeasible. Since the additional transformation-based step works on the samples generated by standard Gibbs/MH, impracticality of the latter implies that the extra transformation-based step of Liu and Sabatti (2000) for improving mixing is of little value in such cases.
It is important to point out that the generalized Gibbs/MH methods can be usefully employed by even TMCMC to further improve its mixing properties. In other words, a step of generalized Gibbs/MH can be added to the computational fast TMCMC. This additional step can significantly improve the mixing properties of TMCMC. That TMCMC is much faster computationally than standard Gibbs/MH methods imply that even in very high-dimensional situations the generalized Gibbs/MH step can ve very much successful while working in conjunction with TMCMC.
Flight no. | Failure | Temp | Flight no. | Failure | Temp |
---|---|---|---|---|---|
14 | 1 | 53 | 2 | 1 | 70 |
9 | 1 | 57 | 11 | 1 | 70 |
23 | 1 | 58 | 6 | 0 | 72 |
10 | 1 | 63 | 7 | 0 | 73 |
1 | 0 | 66 | 16 | 0 | 75 |
5 | 0 | 67 | 21 | 1 | 75 |
13 | 0 | 67 | 19 | 0 | 76 |
15 | 0 | 67 | 22 | 0 | 76 |
4 | 0 | 68 | 12 | 0 | 78 |
3 | 0 | 69 | 20 | 0 | 79 |
8 | 0 | 70 | 18 | 0 | 81 |
17 | 0 | 70 |
S-9 Examples of TMCMC for discrete state spaces
The ideas developed in this paper are not confined to continuous target distributions, but also to discrete cases. For the sake of illustration, we consider two examples below.
-
(i)
Consider an Ising model, where, for , the discrete random variable takes the value or with positive probabilities. We then have . To implement TMCMC, consider the forward transformation with probability , and choose the backward transformation as with probability . Here accordingly as or , and . Note the difference with the continuous cases. Here even though neither of the transformations is 1-to-1 or onto, TMCMC works because of discreteness; the algorithm can easily be seen to satisfy detailed balance, irreducibility and aperiodicity. However, if with being the only variable, then, if , it is possible to choose, with probability one, the backward move-type, yielding . On the other hand, if , with probability one, we can choose the forward move-type, yielding . Only move-types are necessary for the -dimensional case for one-step irreducibility. In discrete cases, however, there will be no Jacobian of transformation, thereby simplifying the acceptance ratio.
-
(ii)
For discrete state spaces like , () the additive transformation with single epsilon does not work. For example, with , if the starting state is then the chain will never reach any states where and have same parity (i.e. both even or both odd) resulting a reducible Markov chain. Thus in this case we need to have more move-types than . For example, with some positive probability (say ) we may select a random coordinate and update it leaving other states unchanged. With the remaining probability (i.e. ) we may do the analogous version of the additive transformation:
Let . Then, can choose the forward transformation for each coordinate as and the backward transformation as , where denotes the largest integer not exceeding .This chain is clearly ergodic and we still need only one epsilon to update the states.
However, in discrete cases, TMCMC reduces to Metropolis-Hastings with a mixture proposal. But it is important to note that the implementation is much efficient and computationally cheap when TMCMC-based methodologies developed in this paper, are used.
References
- Bedard (2007) M. Bedard. Weak Convergence of Metropolis Algorithms for Non-i.i.d. Target Distributions. The Annals of Applied Probability, 17:1222–1244, 2007.
- Cheung and Beck (2009) S. H. Cheung and J. L. Beck. Bayesian Model Updating Using Hybrid Monte Carlo Simulation with Application to Structural Dynamic Models with Many Uncertain Parameters. Journal of Engineering Mechanics, 135:243–255, 2009.
- Dey and Bhattacharya (2013) K. K. Dey and S. Bhattacharya. On Optimal Scaling of Additive Transformation Based Markov Chain Monte Carlo. Technical report, Indian Statistical Institute, 2013.
- Duane et al. (1987) S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B, 195:216–222, 1987.
- Dutta and Bhattacharya (2013) S. Dutta and S. Bhattacharya. Markov Chain Monte Carlo Based on Deterministic Transformations, 2013. Submitted.
- Hockney (1970) R. W. Hockney. The Potential Calculation and some Applications. Methods in Computational Physics, 9:136–211, 1970.
- Kou et al. (2005) S. C. Kou, X. S. Xie, and J. S. Liu. Bayesian Analysis of Single-Molecule Experimental Data. Applied Statistics, 54:469–506, 2005.
- Liu (2001) J. Liu. Monte Carlo Strategies in Scientific Computing. Springer-Verlag, New York, 2001.
- Liu and Sabatti (2000) J. S. Liu and S. Sabatti. Generalized Gibbs Sampler and Multigrid Monte Carlo for Bayesian Computation. Biometrika, 87:353–369, 2000.
- Liu and Yu (1999) J. S. Liu and Y. N. Yu. Parameter Expansion for Data Augmentation. Journal of the American Statistical Association, 94:1264–1274, 1999.
- Mattingly et al. (2011) J. C. Mattingly, N. S. Pillai, and A. M. Stuart. Diffusion Limits of the Random Walk Metropolis Algorithm in High Dimensions. The Annals of Applied Probability, 22:881–930, 2011.
- Müller and Czado (2005) G. Müller and C. Czado. An Autoregressive Ordered Probit Model with Application to High-Frequency Financial Data. Journal of Computational and Graphical Statistics, 14:320–338, 2005.
- Roberts et al. (1997) G. O. Roberts, A. Gelman, and W. R. Gilks. Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms. The Annals of Applied Probability, 7:110–120, 1997.