This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Concavity of entropy under thinning

Yaming Yu Department of Statistics
University of California
Irvine, CA 92697, USA
Email: yamingy@uci.edu
   Oliver Johnson Department of Mathematics
University of Bristol
University Walk, Bristol, BS8 1TW, UK
Email: o.johnson@bristol.ac.uk
Abstract

Building on the recent work of Johnson (2007) and Yu (2008), we prove that entropy is a concave function with respect to the thinning operation TαT_{\alpha}. That is, if XX and YY are independent random variables on 𝐙+\mathbf{Z}_{+} with ultra-log-concave probability mass functions, then

H(TαX+T1αY)αH(X)+(1α)H(Y),0α1,H(T_{\alpha}X+T_{1-\alpha}Y)\geq\alpha H(X)+(1-\alpha)H(Y),\quad 0\leq\alpha\leq 1,

where HH denotes the discrete entropy. This is a discrete analogue of the inequality (hh denotes the differential entropy)

h(αX+1αY)αh(X)+(1α)h(Y),0α1,h(\sqrt{\alpha}X+\sqrt{1-\alpha}Y)\geq\alpha h(X)+(1-\alpha)h(Y),\quad 0\leq\alpha\leq 1,

which holds for continuous XX and YY with finite variances and is equivalent to Shannon’s entropy power inequality. As a consequence we establish a special case of a conjecture of Shepp and Olkin (1981). Possible extensions are also discussed.

Index Terms:
binomial thinning; convolution; entropy power inequality; Poisson distribution; ultra-log-concavity.

I Introduction

This paper considers information-theoretic properties of the thinning map, an operation on the space of discrete random variables, based on random summation.

Definition 1 (Rényi, [10])

For a discrete random variable XX on 𝐙+={0,1,}\mathbf{Z}_{+}=\{0,1,\ldots\}, the thinning operation TαT_{\alpha} is defined by

TαX=i=1XBiT_{\alpha}X=\sum_{i=1}^{X}B_{i}

where BiB_{i} are (i) independent of each other and of XX and (ii) identically distributed Bernoulli(α)(\alpha) random variables, i.e., Pr(Bi=1)=1Pr(Bi=0)=α\Pr(B_{i}=1)=1-\Pr(B_{i}=0)=\alpha for each ii.

Equivalently, if the probability mass function (pmf) of XX is ff, then the pmf of TαXT_{\alpha}X is

(Tαf)iPr(TαX=i)=jibi(i;j,α)fj,(T_{\alpha}f)_{i}\equiv\Pr(T_{\alpha}X=i)=\sum_{j\geq i}bi(i;j,\alpha)f_{j},

where bi(i;j,α)=(ji)αi(1α)jibi(i;j,\alpha)=\binom{j}{i}\alpha^{i}(1-\alpha)^{j-i} is the binomial pmf. (Note that we write TαT_{\alpha} for the map acting on the pmf as well as acting on the random variable.)

We briefly mention other notation used in this paper. We use Po(λ){\rm Po}(\lambda) to denote the Poisson distribution with mean λ\lambda, i.e., the pmf is po(λ)={po(i;λ),i=0,1,},po(i;λ)=λieλ/i!po(\lambda)=\{po(i;\lambda),\ i=0,1,\ldots\},\ po(i;\lambda)=\lambda^{i}e^{-\lambda}/i!. The entropy of a discrete random variable XX with pmf ff is defined as

H(X)=H(f)=ifilogfi,H(X)=H(f)=\sum_{i}-f_{i}\log f_{i},

and the relative entropy between XX (with pmf ff) and YY (with pmf gg) is defined as

D(X||Y)=D(f||g)=ifilog(fi/gi).D(X||Y)=D(f||g)=\sum_{i}f_{i}\log(f_{i}/g_{i}).

For convenience we write D(X)=D(X||po(λ))D(X)=D(X||po(\lambda)) where λ=EX\lambda=EX.

The thinning operation is intimately associated with the Poisson distribution and Poisson convergence theorems. It plays a significant role in the derivation of a maximum entropy property for the Poisson distribution (Johnson [7]). Recently there has been evidence that, in a number of problems related to information theory, the operation TαT_{\alpha} is the discrete counterpart of the operation of scaling a random variable by α\sqrt{\alpha}; see [5, 6, 7, 14]. Since scaling arguments can give simple proofs of results such as the Entropy Power Inequality, we believe that improved understanding of the thinning operation could lead to discrete analogues of such results.

For example, thinning lies at the heart of the following result (see [5, 6, 14]), which is a Poisson limit theorem with an information-theoretic interpretation.

Theorem 1 (Law of Thin Numbers)

Let ff be a pmf on 𝐙+\mathbf{Z}_{+} with mean λ<\lambda<\infty. Denote by fnf^{*n} the nnth convolution of ff, i.e., the pmf of i=1nXi\sum_{i=1}^{n}X_{i} where XiX_{i} are independent and identically distributed (i.i.d.) with pmf ff. Then

  1. 1.

    T1/n(fn)T_{1/n}(f^{*n}) converges point-wise to Po(λ){\rm Po}(\lambda) as nn\to\infty;

  2. 2.

    H(T1/n(fn))H(T_{1/n}(f^{*n})) tends to H(po(λ))H(po(\lambda)) as nn\to\infty;

  3. 3.

    as nn\to\infty, D(T1/n(fn))D(T_{1/n}(f^{*n})) monotonically decreases to zero, if it is ever finite;

  4. 4.

    if ff is ultra-log-concave, then H(T1/n(fn))H(T_{1/n}(f^{*n})) increases in nn.

For Part (4), we recall that a random variable XX on 𝐙+\mathbf{Z}_{+} is called ultra-log-concave, or ULC, if its pmf ff is such that the sequence i!fi,i=0,1,,i!f_{i},\ i=0,1,\ldots, is log-concave. Examples of ULC random variables include the binomial and the Poisson. In general, a sum of independent (but not necessarily identically distributed) Bernoulli random variables is ULC. Informally, a ULC random variable is less “spread out” than a Poisson with the same mean. Note that in Part (4) the ULC assumption is natural since, among ULC distributions with a fixed mean, the Poisson achieves maximum entropy ([7, 14]).

Parts (2) and (3) of Theorem 1 (see [5, 6]) resemble the entropic central limit theorem of Barron [2], in that convergence in relative entropy, rather than the usual weak convergence, is established. The monotonicity statements in Parts (3) and (4), proved in [14], can be seen as the discrete analogue of the monotonicity of entropy in the central limit theorem, conjectured by Shannon and proved much later by Artstein et al. [1].

In this work we further explore the behavior of entropy under thinning. Our main result is the following concavity property.

Theorem 2

If XX and YY are independent random variables on 𝐙+\mathbf{Z}_{+} with ultra-log-concave pmfs, then

H(TαX+TβY)αH(X)+βH(Y),α,β0,α+β1.H(T_{\alpha}X+T_{\beta}Y)\geq\alpha H(X)+\beta H(Y),\quad\alpha,\,\beta\geq 0,\ \alpha+\beta\leq 1. (1)

Theorem 2 is interesting on two accounts. Firstly, it can be seen as an analogue of the inequality

h(αX+1αY)αh(X)+(1α)h(Y)h(\sqrt{\alpha}X+\sqrt{1-\alpha}Y)\geq\alpha h(X)+(1-\alpha)h(Y) (2)

where XX and YY are continuous random variables with finite variances and hh denotes the differential entropy. The difference between thinning by α\alpha in (1) and scaling by α\sqrt{\alpha} in (2) is required to control different moments. In the discrete case, the law of small numbers [5] and the corresponding maximum entropy property [7] both require control of the mean, which is achieved by this thinning factor. In the continuous case, the central limit theorem [2] requires control of the variance, which is achieved by this choice of scaling. It is well-known that (2) is a reformulation of Shannon’s entropy power inequality ([12, 3]). Thus Theorem 2 may be regarded as a first step towards a discrete entropy power inequality (see Section IV for further discussion).

Secondly, Theorem 2 is closely related to an open problem of Shepp and Olkin [11] concerning Bernoulli sums. With a slight abuse of notation let H(a1,,an)H(a_{1},\ldots,a_{n}) denote the entropy of the sum i=1nXi\sum_{i=1}^{n}X_{i}, where XiX_{i} is an independent Bernoulli random variable with parameter ai,i=1,,na_{i},\ i=1,\ldots,n.

Conjecture 1 ([11])

The function H(a1,,an)H(a_{1},\ldots,a_{n}) is concave in (a1,,an)(a_{1},\ldots,a_{n}), i.e.,

H(αa1+(1α)b1,,αan+(1α)bn)\displaystyle H\left(\alpha a_{1}+(1-\alpha)b_{1},\ldots,\alpha a_{n}+(1-\alpha)b_{n}\right) (3)
\displaystyle\geq αH(a1,,an)+(1α)H(b1,,bn)\displaystyle\alpha H(a_{1},\ldots,a_{n})+(1-\alpha)H(b_{1},\ldots,b_{n})

for all 0α10\leq\alpha\leq 1 and ai,bi[0,1]a_{i},b_{i}\in[0,1].

As noted by Shepp and Olkin [11], H(a1,,an)H(a_{1},\ldots,a_{n}) is concave in each aia_{i} and is concave in the special case where a1==ana_{1}=\ldots=a_{n} and b1==bnb_{1}=\ldots=b_{n}. We provide further evidence supporting Conjecture 1, by proving another special case, which is a consequence of Theorem 2 when applied to Bernoulli sums.

Corollary 1

Relation (3) holds if aibi=0a_{i}b_{i}=0 for all ii.

Conjecture 1 remains open. We are hopeful, however, that the techniques introduced here could help resolve this long-standing problem.

In Section II we collect some basic properties of thinning and ULC distributions, which are used in the proof of Theorem 2 in Section III. Possible extensions are discussed in Section IV.

II Preliminary observations

Basic properties of thinning include the semigroup relation ([7])

Tα(Tβf)=TαβfT_{\alpha}(T_{\beta}f)=T_{\alpha\beta}f (4)

and the commuting relation (* denotes convolution)

Tα(fg)=(Tαf)(Tαg).T_{\alpha}(f*g)=(T_{\alpha}f)*(T_{\alpha}g). (5)

It is (5) that allows us to deduce Corollary 1 from Theorem 2 easily.

Concerning the ULC property, three important observations ([7]) are

  1. 1.

    a pmf ff is ULC if and only if the ratio (i+1)fi+1/fi(i+1)f_{i+1}/f_{i} is a decreasing function of ii;

  2. 2.

    if ff is ULC, then so is TαfT_{\alpha}f;

  3. 3.

    if ff and gg are ULC, then so is their convolution fgf*g.

A key tool for deriving Theorem 2 and related results ([7, 13]) is Chebyshev’s rearrangement theorem, which states that the covariance of two increasing functions of the same random variable is non-negative. In other words, if XX is a scalar random variable, and gg and g~\tilde{g} are increasing functions, then (assuming the expectations are finite)

E[g(X)g~(X)]Eg(X)Eg~(X).E[g(X)\tilde{g}(X)]\geq Eg(X)E\tilde{g}(X).

III Proof of Theorem 2

The basic idea is to use the decomposition

H(X)=D(X)L(X)H(X)=-D(X)-L(X)

where as before D(X)=D(X||po(λ))D(X)=D(X||po(\lambda)) with λ=EX\lambda=EX, and L(X)=Elog(po(X;λ))L(X)=E\log(po(X;\lambda)).

The behavior of the relative entropy D(X)D(X) under thinning is fairly well-understood. In particular, by differentiating D(TαX)D(T_{\alpha}X) with respect to α\alpha and then using a data-processing argument, Yu [14] shows that

D(TαX)αD(X).D(T_{\alpha}X)\leq\alpha D(X). (6)

Further, for any independent UU and VV, the data-processing inequality shows that D(U+V)D(U)+D(V).D(U+V)\leq D(U)+D(V). By taking U=TαXU=T_{\alpha}X and V=T1αYV=T_{1-\alpha}Y, one concludes that

D(TαX+T1αY)\displaystyle D(T_{\alpha}X+T_{1-\alpha}Y) D(TαX)+D(T1αY)\displaystyle\leq D(T_{\alpha}X)+D(T_{1-\alpha}Y)
αD(X)+(1α)D(Y).\displaystyle\leq\alpha D(X)+(1-\alpha)D(Y).

Therefore we only need to prove the corresponding result for LL, that is

L(TαX+T1αY)αL(X)+(1α)L(Y).L(T_{\alpha}X+T_{1-\alpha}Y)\leq\alpha L(X)+(1-\alpha)L(Y). (7)

Unfortunately, matters are more complicated because there is no equivalent of the data-processing inequality, i.e., the inequality L(U+V)L(U)+L(V)L(U+V)\leq L(U)+L(V) does not always hold. (Consider for example UU and VV i.i.d. Bernoulli with parameter p(0,1)p\in(0,1). This inequality then reduces to 2pp22p\leq p^{2}, which clearly fails for all pp.)

Nevertheless, it is possible to establish (7) directly. We illustrate the strategy with a related but simpler result, which involves the equivalent of Equation (6) for LL.

Proposition 1

For any pmf ff on 𝐙+\mathbf{Z}_{+} with mean λ<\lambda<\infty, we have H(Tαf)αH(f)H(T_{\alpha}f)\geq\alpha H(f).

Proof:

Let us assume that the support of ff is finite; the general case follows by a truncation argument ([14]). In view of (6), we only need to show l(α)αl(1)l(\alpha)\leq\alpha l(1), where

l(α)=L(Tαf)=i0(Tαf)ilog(po(i;αλ)).l(\alpha)=L(T_{\alpha}f)=\sum_{i\geq 0}(T_{\alpha}f)_{i}\log\left(po(i;\alpha\lambda)\right).

By substituting f(α)=0f(\alpha)=0 in Equation (8) of [7], we obtain that

d(Tαf)idα=i(Tαf)i(i+1)(Tαf)i+1α,\frac{{\rm d}(T_{\alpha}f)_{i}}{{\rm d}\alpha}=\frac{i(T_{\alpha}f)_{i}-(i+1)(T_{\alpha}f)_{i+1}}{\alpha},

and hence, using summation by parts,

l(α)\displaystyle l^{\prime}(\alpha) =λlog(αλ)i0d(Tαf)idαlogi!\displaystyle=\lambda\log(\alpha\lambda)-\sum_{i\geq 0}\frac{{\rm d}(T_{\alpha}f)_{i}}{{\rm d}\alpha}\log i!
=λlog(αλ)1αi0(i+1)(Tαf)i+1log(i+1).\displaystyle=\lambda\log(\alpha\lambda)-\frac{1}{\alpha}\sum_{i\geq 0}(i+1)(T_{\alpha}f)_{i+1}\log\left(i+1\right).

In a similar way, using the inequality log(1+u)u,u>1\log(1+u)\leq u,\ u>-1,

l′′(α)\displaystyle l^{\prime\prime}(\alpha) =λα1α2i0(Tαf)i+2(i+2)(i+1)logi+2i+1\displaystyle=\frac{\lambda}{\alpha}-\frac{1}{\alpha^{2}}\sum_{i\geq 0}(T_{\alpha}f)_{i+2}(i+2)(i+1)\log\frac{i+2}{i+1}
λα1α2i0(Tαf)i+2(i+2)(i+1)1i+1\displaystyle\geq\frac{\lambda}{\alpha}-\frac{1}{\alpha^{2}}\sum_{i\geq 0}(T_{\alpha}f)_{i+2}(i+2)(i+1)\frac{1}{i+1}
=λα1α2i0(Tαf)i+2(i+2)0.\displaystyle=\frac{\lambda}{\alpha}-\frac{1}{\alpha^{2}}\sum_{i\geq 0}(T_{\alpha}f)_{i+2}(i+2)\geq 0.

The last inequality holds since s=0s(Tαf)s=λα\sum_{s=0}^{\infty}s(T_{\alpha}f)_{s}=\lambda\alpha.

Having established the convexity of l(α)l(\alpha), we can now deduce the full Proposition using (6). ∎

Before proving Theorem 2, we note that although (1) is stated for α+β1\alpha+\beta\leq 1, only the case α+β=1\alpha+\beta=1 need to be considered. Indeed, if (1) holds for α+β=1\alpha+\beta=1, then for general α,β0\alpha,\ \beta\geq 0 such that α+β=γ1\alpha+\beta=\gamma\leq 1, we have

H(TαX+TβY)\displaystyle H(T_{\alpha}X+T_{\beta}Y) =H(Tγ(Tα/γX+Tβ/γY))\displaystyle=H(T_{\gamma}(T_{\alpha/\gamma}X+T_{\beta/\gamma}Y))
γH(Tα/γX+Tβ/γY)\displaystyle\geq\gamma H(T_{\alpha/\gamma}X+T_{\beta/\gamma}Y) (8)
αH(X)+βH(Y),\displaystyle\geq\alpha H(X)+\beta H(Y),

where (4) and (5) are used in the equality, and Proposition 1 is used in (8).

Proof:

Assume β=1α\beta=1-\alpha, and let ff and gg denote the pmfs of XX and YY respectively. Assume λ=EX>0\lambda=EX>0 and μ=EY>0\mu=EY>0 to avoid the trivial case. As noted before, we only need to show that

l(α)=i0(TαfTβg)ilogpo(i;αλ+βμ)l(\alpha)=\sum_{i\geq 0}(T_{\alpha}f*T_{\beta}g)_{i}\log po(i;\alpha\lambda+\beta\mu)

is convex in α\alpha (where β=1α\beta=1-\alpha). The calculations are similar to (but more involved than) those for Proposition 1, and we omit the details. The key is to express l′′(α)l^{\prime\prime}(\alpha) in the following form suitable for applying Chebyshev’s rearrangement theorem.

l′′(α)=\displaystyle l^{\prime\prime}(\alpha)= (λμ)2αλ+βμ+A+B\displaystyle\frac{(\lambda-\mu)^{2}}{\alpha\lambda+\beta\mu}+A+B

where

A=i1,j0(Tαf)i(Tβg)jia(i,j),A=\sum_{i\geq 1,j\geq 0}(T_{\alpha}f)_{i}(T_{\beta}g)_{j}ia(i,j),
B=i0,j1(Tαf)i(Tβg)jjb(i,j),B=\sum_{i\geq 0,j\geq 1}(T_{\alpha}f)_{i}(T_{\beta}g)_{j}jb(i,j),

and

a(i,j)\displaystyle a(i,j) =(i+j1α2βμj(αλ+βμ)α2β2)logi+j1i+j,\displaystyle=\left(\frac{i+j-1}{\alpha^{2}}-\frac{\beta\mu j}{(\alpha\lambda+\beta\mu)\alpha^{2}\beta^{2}}\right)\log\frac{i+j-1}{i+j},
b(i,j)\displaystyle b(i,j) =(i+j1β2αλi(αλ+βμ)α2β2)logi+j1i+j.\displaystyle=\left(\frac{i+j-1}{\beta^{2}}-\frac{\alpha\lambda i}{(\alpha\lambda+\beta\mu)\alpha^{2}\beta^{2}}\right)\log\frac{i+j-1}{i+j}.

Ultra-log-concavity and dominated convergence permit differentiating term-by-term.

For each fixed jj, since (i+j1)log((i+j1)/(i+j))(i+j-1)\log((i+j-1)/(i+j)) decreases in ii and log((i+j1)/(i+j))\log((i+j-1)/(i+j)) increases in ii, we know that a(i,j)a(i,j) decreases in ii. Since TαfT_{\alpha}f is ULC, the ratio i(Tαf)i/(Tαf)i1i(T_{\alpha}f)_{i}/(T_{\alpha}f)_{i-1} is decreasing in ii. Hence we may apply Chebyshev’s rearrangement theorem to the sum over ii and obtain

A\displaystyle A =i1,j0(Tαf)i1(Tβg)j(i(Tαf)i(Tαf)i1)a(i,j)\displaystyle=\sum_{i\geq 1,j\geq 0}(T_{\alpha}f)_{i-1}(T_{\beta}g)_{j}\left(\frac{i(T_{\alpha}f)_{i}}{(T_{\alpha}f)_{i-1}}\right)a(i,j)
αλi1,j0(Tαf)i1(Tβg)ja(i,j)\displaystyle\geq\alpha\lambda\sum_{i\geq 1,j\geq 0}(T_{\alpha}f)_{i-1}(T_{\beta}g)_{j}a(i,j)
=αλi,j0(Tαf)i(Tβg)ja(i+1,j).\displaystyle=\alpha\lambda\sum_{i,j\geq 0}(T_{\alpha}f)_{i}(T_{\beta}g)_{j}a(i+1,j). (9)

Similarly, considering the sum over jj, since b(i,j)b(i,j) is decreasing in jj for any fixed ii,

B\displaystyle B βμi,j0(Tαf)i(Tβg)jb(i,j+1).\displaystyle\geq\beta\mu\sum_{i,j\geq 0}(T_{\alpha}f)_{i}(T_{\beta}g)_{j}b(i,j+1). (10)

Adding up (9) and (10), and noting that

αλa(i+1,j)+βμb(i,j+1)=(λμ)2αλ+βμ(i+j)logi+ji+j+1,\alpha\lambda a(i+1,j)+\beta\mu b(i,j+1)=\frac{(\lambda-\mu)^{2}}{\alpha\lambda+\beta\mu}(i+j)\log\frac{i+j}{i+j+1},

we get

l′′(α)\displaystyle l^{\prime\prime}(\alpha)\geq (λμ)2αλ+βμ\displaystyle\frac{(\lambda-\mu)^{2}}{\alpha\lambda+\beta\mu}
+i,j0(Tαf)i(Tβg)j(λμ)2αλ+βμ(i+j)logi+ji+j+1,\displaystyle+\sum_{i,j\geq 0}(T_{\alpha}f)_{i}(T_{\beta}g)_{j}\frac{(\lambda-\mu)^{2}}{\alpha\lambda+\beta\mu}(i+j)\log\frac{i+j}{i+j+1},

which is nonnegative, in view of the inequality ulog(u/(u+1))1,u0u\log(u/(u+1))\geq-1,\ u\geq 0. ∎

IV Towards a discrete Entropy Power Inequality

In the continuous case, (2) is quickly shown (see [4]) to be equivalent to Shannon’s entropy power inequality

exp(2h(X+Y))exp(2h(X))+exp(2h(Y)),\exp(2h(X+Y))\geq\exp(2h(X))+\exp(2h(Y)), (11)

valid for independent XX and YY with finite variances, with equality if and only if XX and YY are normal. We aim to formulate a discrete analogue of (11), with the Poisson distribution playing the same role as the normal since it has the corresponding infinite divisibility and maximum entropy properties.

Observe that the function exp(2t)\exp(2t) appearing in (11) is (proportional to) the inverse of the entropy of the normal with variance tt. That is, if we write e(t)=h(N(0,t))=log(2πt)e(t)=h(N(0,t))=\log(\sqrt{2\pi t}) then the entropy power v(X)=e1(h(X))=exp(2h(X))/(2π)v(X)=e^{-1}(h(X))=\exp(2h(X))/(2\pi), so Equation (11) can be written as

v(X+Y)v(X)+v(Y).v(X+Y)\geq v(X)+v(Y).

Although there does not exist a corresponding closed form expression for the entropy of a Poisson random variable, we can denote (t)=H(po(t)){\cal E}(t)=H(po(t)). Then (t){\cal E}(t) is increasing and concave. (The proof of Proposition 1, when specialized to the Poisson case, implies this concavity.) Define

V(X)=1(H(X)).V(X)={\cal E}^{-1}(H(X)).

That is, H(po(V(X)))=H(X)H(po(V(X)))=H(X). It is tempting to conjecture that the natural discrete analogue of Equation (11) is

V(X+Y)V(X)+V(Y),V(X+Y)\geq V(X)+V(Y),

for independent discrete random variables XX and YY, with equality if and only if XX and YY are Poisson. However, this is not true. A counterexample, provided by an anonymous referee, is the case where XX and YY both have the pmf p(0)=1/6p(0)=1/6, p(1)=2/3p(1)=2/3, p(2)=1/6.p(2)=1/6. Since this pmf even lies within the ULC class, the conjecture still fails when restricted to this class.

We believe that the discrete counterpart of the entropy power inequality should involve the thinning operation described above. If so, the natural conjecture is the following, which we refer to as the thinned entropy power inequality.

Conjecture 2

If XX and YY are independent random variables with ULC pmfs on 𝐙+\mathbf{Z}_{+}, then (0<α<10<\alpha<1)

V(TαX+T1αY)αV(X)+(1α)V(Y).V(T_{\alpha}X+T_{1-\alpha}Y)\geq\alpha V(X)+(1-\alpha)V(Y). (12)

In a similar way to the continuous case, (12) easily yields the concavity of entropy, Equation (1), as a corollary. Indeed, by (12) and the concavity of (t){\cal E}(t), we have

H(TαX+T1αY)\displaystyle H(T_{\alpha}X+T_{1-\alpha}Y) (αV(X)+(1α)V(Y))\displaystyle\geq{\cal E}(\alpha V(X)+(1-\alpha)V(Y))
α(V(X))+(1α)(V(Y))\displaystyle\geq\alpha{\cal E}(V(X))+(1-\alpha){\cal E}(V(Y))
=αH(X)+(1α)H(Y)\displaystyle=\alpha H(X)+(1-\alpha)H(Y)

and (1) follows.

Unlike the continuous case, (1) does not easily yield (12). The key issue is the question of scaling. That is, in the continuous case, the entropy power v(X)v(X) satisfies v(αX)=αv(X)v(\sqrt{\alpha}X)=\alpha v(X) for all α\alpha and XX. It is this result that allows Dembo et al. [4] to deduce (11) from (2).

Such an identity does not hold for thinned random variables. However, we conjecture that

V(TαX)αV(X)V(T_{\alpha}X)\geq\alpha V(X) (13)

for all α\alpha and ULC XX. Note that this Equation (13), which we refer to as the restricted thinned entropy power inequality (RTEPI), is simply the case Y=0Y=0 of the full thinned entropy power inequality (12). If (13) holds, we can use the argument provided by [4] to deduce the following result, which is in some sense close to the full thinned entropy power inequality, although β+γ<1\beta+\gamma<1 in general.

Proposition 2

Consider independent ULC random variables XX and YY. For any β,γ(0,1)\beta,\ \gamma\in(0,1) such that

β1γV(Y)V(X)1βγ,\frac{\beta}{1-\gamma}\leq\frac{V(Y)}{V(X)}\leq\frac{1-\beta}{\gamma},

if the RTEPI (13) holds then

V(TβX+TγY)βV(X)+γV(Y).V(T_{\beta}X+T_{\gamma}Y)\geq\beta V(X)+\gamma V(Y).
Proof:

Note that an equivalent formulation of the RTEPI (13) is that if XX^{\prime} is Poisson with H(X)=H(X)H(X)=H(X^{\prime}) then for any α(0,1)\alpha\in(0,1), H(TαX)H(TαX).H(T_{\alpha}X)\geq H(T_{\alpha}X^{\prime}). Given XX and YY we define XX^{\prime} and YY^{\prime} to be Poisson with H(X)=H(X)H(X)=H(X^{\prime}) and H(X)=H(Y)H(X)=H(Y^{\prime}).

Given β\beta and γ\gamma, we pick α\alpha such that βα\beta\leq\alpha and γ1α\gamma\leq 1-\alpha so that:

H(TβX+TγY)\displaystyle H(T_{\beta}X+T_{\gamma}Y) (14)
=\displaystyle= H(Tα(Tβ/αX)+T1α(Tγ/(1α)Y))\displaystyle H(T_{\alpha}(T_{\beta/\alpha}X)+T_{1-\alpha}(T_{\gamma/(1-\alpha)}Y))
\displaystyle\geq αH(Tβ/αX)+(1α)H(Tγ/(1α)Y)\displaystyle\alpha H(T_{\beta/\alpha}X)+(1-\alpha)H(T_{\gamma/(1-\alpha)}Y)
\displaystyle\geq αH(Tβ/αX)+(1α)H(Tγ/(1α)Y)\displaystyle\alpha H(T_{\beta/\alpha}X^{\prime})+(1-\alpha)H(T_{\gamma/(1-\alpha)}Y^{\prime})
=\displaystyle= α(βV(X)/α)+(1α)(γV(Y)/(1α))\displaystyle\alpha{\mathcal{E}}(\beta V(X)/\alpha)+(1-\alpha){\mathcal{E}}(\gamma V(Y)/(1-\alpha))

where Equation (14) follows by Theorem 2 and Equation (IV) follows by the reformulated RTEPI.

Now making the (optimal) choice

α=βV(X)/(βV(X)+γV(Y))\alpha=\beta V(X)/(\beta V(X)+\gamma V(Y))

this inequality becomes

H(TβX+TγY)(βV(X)+γV(Y)).H(T_{\beta}X+T_{\gamma}Y)\geq{\mathcal{E}}(\beta V(X)+\gamma V(Y)).

The result follows by applying 1{\mathcal{E}}^{-1} to both sides. Note that the restrictions on β\beta and γ\gamma are required to ensure βα\beta\leq\alpha and γ1α\gamma\leq 1-\alpha. ∎

Again assuming (13), Proposition 2 yields the following special case of (12). The reason this argument works is that, as in [4], if XX is Poisson then (13) holds with equality for all α\alpha.

Corollary 2

If RTEPI (13) holds then (12) holds in the special case where XX is ULC and YY is Poisson with mean μ\mu such that μV(X)\mu\leq V(X).

Proof:

For γ(0,1)\gamma\in(0,1) let ZZ be Poisson with mean μ(1α)/γ\mu(1-\alpha)/\gamma. Then V(Z)=μ(1α)/γV(Z)=\mu(1-\alpha)/\gamma. The condition μV(X)\mu\leq V(X) ensures that we can choose γ\gamma small enough such that

α1γV(Z)V(X)1αγ.\frac{\alpha}{1-\gamma}\leq\frac{V(Z)}{V(X)}\leq\frac{1-\alpha}{\gamma}.

By Proposition 2,

V(TαX+TγZ)αV(X)+γV(Z).V(T_{\alpha}X+T_{\gamma}Z)\geq\alpha V(X)+\gamma V(Z).

The claim follows by noting that TγZT_{\gamma}Z has the same Poisson distribution as T1αYT_{1-\alpha}Y. ∎

We hope to report progress on (12) in future work. Given the fundamental importance of (11), it would also be interesting to see potential applications of (12) (if true) and (1). For example, Oohama [9] used the entropy power inequality (11) to solve the multi-terminal source coding problem. This showed the rate at which information could be transmitted from LL sources, producing correlated Gaussian signals but unable to collaborate or communicate with each other, under the addition of Gaussian noise. It would be of interest to know whether (12) could lead to a corresponding result for discrete channels.

Note: Since the submission of this paper to ISIT09, we have found a proof of the restricted thinned entropy power inequality, i.e., Equation (13). The proof, based on [7], is somewhat technical and will be presented in a future work.

References

  • [1] S. Artstein, K. M. Ball, F. Barthe and A. Naor, “Solution of Shannon’s problem on the monotonicity of entropy,” J. Amer. Math. Soc., vol. 17, no. 4, pp. 975–982 (electronic), 2004.
  • [2] A. R. Barron, “Entropy and the central limit theorem,” Ann. Probab., vol. 14, no. 1, pp. 336–342, 1986.
  • [3] N. M. Blachman, “The convolution inequality for entropy powers,” IEEE Trans. Inform. Theory, vol. 11, pp. 267–271, 1965.
  • [4] A. Dembo, T. M. Cover and J. A. Thomas, “Information Theoretic Inequalities,” IEEE Trans. Information Theory, vol. 37, no. 6, pp. 1501–1518, 1991.
  • [5] P. Harremoës, O. Johnson, and I. Kontoyiannis, “Thinning and the law of small numbers,” IEEE Symp. Inf. Theory, Jun. 2007.
  • [6] P. Harremoës, O. Johnson, and I. Kontoyiannis, “Thinning and information projections,” IEEE Symp. Inf. Theory, Jul. 2008.
  • [7] O. Johnson, “Log-concavity and the maximum entropy property of the Poisson distribution,” Stochastic Processes and their Applications, vol. 117, no. 6, pp. 791–802, 2007.
  • [8] I. Kontoyiannis, P. Harremoës, and O. T. Johnson, “Entropy and the law of small numbers,” IEEE Trans. Inform. Theory, vol. 51, no. 2, pp. 466–472, Feb. 2005.
  • [9] Y. Oohama, “Rate-distortion theory for Gaussian multiterminal source coding systems with several side informations at the decoder,” IEEE Trans. Inform. Theory, vol. 51, no. 7, pp. 2577–2593, July 2005.
  • [10] A. Rényi, “A characterization of Poisson processes,” Magyar Tud. Akad. Mat. Kutató Int. Közl., vol. 1, pp. 519–527, 1956.
  • [11] L. A. Shepp and I. Olkin, “Entropy of the sum of independent Bernoulli random variables and of the multinomial distribution,” in Contributions to probability, pp. 201–206, Academic Press, New York, 1981.
  • [12] A. J. Stam, “Some inequalities satisfied by the quantities of information of Fisher and Shannon,” Inform. Contr., vol. 2, no. 2, pp. 101–112, Jun. 1959.
  • [13] Y. Yu, “On the maximum entropy properties of the binomial distribution,” IEEE Trans. Inf. Theory, vol. 54, no. 7, pp. 3351–3353, Jul. 2008.
  • [14] Y. Yu, “Monotonic convergence in an information-theoretic law of small numbers,” Technical report, Department of Statistics, University of California, Irvine, 2008. http://arxiv.org/abs/0810.5203