This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Bahadur Efficiency of Power Divergence Statistics

Peter Harremoës,  and Igor Vajda Manuscript submitted February 2010.P. Harremoës is with Copenhagen Business College, Copenhagen, Denmark. I. Vajda is with Institute of Information Theory and Automation, Prague, Czech Republic.
Abstract

It is proved that the information divergence statistic is infinitely more Bahadur efficient than the power divergence statistics of the orders α>1\alpha>1 as long as the sequence of alternatives is contiguous with respect to the sequence of null-hypotheses and the the number of observations per bin increases to infinity is not very slow. This improves the former result in Harremoës and Vajda (2008) where the the sequence of null-hypotheses was assumed to be uniform and the restrictions on on the numbers of observations per bin were sharper. Moreover, this paper evaluates also the Bahadur efficiency of the power divergence statistics of the remaining positive orders 0<α10\,<\alpha\leq 1. The statistics of these orders are mutually Bahadur-comparable and all of them are more Bahadur efficient than the statistics of the orders α>1.\alpha>1. A detailed discussion of the technical definitions and conditions is given, some unclear points are resolved, and the results are illustrated by examples.

Index Terms:
Bahadur efficiency, consistency, power divergence, Rényi divergence.

I Introduction

Problems of detection, classification and identification are often solved by the method of testing statistical hypotheses. Consider signals Y1,Y2,,YnY_{1},Y_{2},...,Y_{n} collected  from a random source independently at time instants i=1,2,,n.i=1,2,...,n. Signal processing usually requires digitalization based on appropriate quantization. Quantization of the signal space 𝒴\mathcal{Y} into kk disjoint cells (or bins) 𝒴n1,𝒴n2,,𝒴nk\mathcal{Y}_{n1},\mathcal{Y}_{n2},...,\mathcal{Y}_{nk} reduces the signals Y1,Y2,,YnY_{1},Y_{2},...,Y_{n} into simple kk-valued indicators In(Y1),In(Y2),,In(Yn)I_{n}(Y_{1}),I_{n}(Y_{2}),...,I_{n}(Y_{n}) of their cover cells. Various hypotheses about the data source represented by probability measures QnQ_{n} on 𝒴\mathcal{Y} are transformed by the quantization into discrete probability distributions

Qn=(qn1=Q(𝒴n1),,qnk=Q(𝒴nk))Q_{n}=\left(q_{n1}=Q(\mathcal{Y}_{n1}),...,q_{nk}=Q(\mathcal{Y}_{nk})\right)

on the quantization cells where for no quantization cell qnj=0.q_{nj}=0. These hypothetical distributions need not be the same as the true distributions Pn=(pn1=P(𝒴n1),,pnk=P(𝒴nk))P_{n}=(p_{n1}=P(\mathcal{Y}_{n1}),...,p_{nk}=P(\mathcal{Y}_{nk})). The latter distributions are usually unknown but, by the law of large numbers, they can be approximated by the empirical distributions (vectors of relative cell frequencies)

P^n=(p^n1=Xn1n,,pnk=Xnkn)=𝐗nn\hat{P}_{n}=\left(\hat{p}_{n1}=\frac{X_{n1}}{n},...,p_{nk}=\frac{X_{nk}}{n}\right)=\frac{\mathbf{X}_{n}}{n} (1)

where XnjX_{nj} is the numbers of the signals Y1,Y2,,YnY_{1},Y_{2},...,Y_{n} in 𝒴nj.\mathcal{Y}_{nj}. Formally,

Xnj=i=1n1{Yi𝒴nj}=i=1n1{In(Yi)=j}, 1jkX_{nj}={\textstyle\sum\limits_{i=1}^{n}}1_{\left\{Y_{i}\in\mathcal{Y}_{nj}\right\}}={\textstyle\sum\limits_{i=1}^{n}}1_{\left\{I_{n}(Y_{i})=j\right\}},\text{ \ \ }1\leq j\leq k (2)

where 1A1_{A} denotes the indicator of the event AA. The problem is to decide whether the signals Y1,Y2,,YnY_{1},Y_{2},...,Y_{n} are generated by the source (𝒴,Q)(\mathcal{Y},Q) on the basis of the distributions P^n,Qn\hat{P}_{n},Q_{n}. A classical method for solving this problem is the method of testing statistical hypotheses in the spirit of Fisher, Neyman and Pearson. In our case the hypothesis is

:Pn=Qn{\mathcal{H}}:P_{n}=Q_{n} (3)

and the decision is based either on the likelihood ratio statistic

T^1,n=2j=1kXnjlnXnjnqnj\hat{T}_{1,n}=2{\textstyle\sum\limits_{j=1}^{k}}X_{nj}\ln\frac{X_{nj}}{nq_{nj}} (4)

or the Pearson χ2\chi^{2}-statistic

T^2,n=j=1k(Xnjnqnj)2nqnj\hat{T}_{2,n}={\textstyle\sum\limits_{j=1}^{k}}\frac{(X_{nj}-nq_{nj})^{2}}{nq_{nj}} (5)

in the sense that the hypothesis is rejected when the statistic is large, where ”large” depends on the required decision error or risk [1].

It is easy to see (c.f. (13), (14) below) that the classical test statistics (4), (5) are of the form

T^α,n=2nD^α,n=def2nDα(P^n,Qn),α{1,2}\hat{T}_{\alpha,n}=2n\hat{D}_{\alpha,n}\overset{\text{def}}{=}2nD_{\alpha}\left(\hat{P}_{n},Q_{n}\right),\quad\alpha\in\{1,2\} (6)

where Dα(P,Q)D_{\alpha}\left(P,Q\right) for arbitrary α>0\alpha>0 and distributions P=(p1,,pk)P=(p_{1},...,p_{k}), Q=(q1,,qk)Q=(q_{1},...,q_{k}) denotes the divergence Dϕα(P,Q)D_{\phi_{\alpha}}\left(P,Q\right) of Csiszár [2] for the power function

ϕα(t)=tαα(t1)1α(α1) when α1\phi_{\alpha}(t)={\frac{t^{\alpha}-\alpha(t-1)-1}{\alpha(\alpha-1)}}\text{ \ when \ \ }\alpha\neq 1 (7)

and

ϕ1(t)=limα1ϕα(t)=tlntt+1.\phi_{1}(t)=\lim_{\alpha\rightarrow 1}\phi_{\alpha}(t)=t\ln t-t+1. (8)

The power divergences

Dα(P,Q)=1α(α1)(j=1kpjαqj1α1) α1D_{\alpha}\left(P,Q\right)=\frac{1}{\alpha(\alpha-1)}\left({\textstyle\sum\limits_{j=1}^{k}}p_{j}^{\alpha}q_{j}^{1-\alpha}-1\right)\text{ \ \ }\alpha\neq 1 (9)

or the one-one related Rényi divergences [3]

Dα(PQ)=1α1lnj=1kpjαqj1α α1D_{\alpha}\left(P\|Q\right)=\frac{1}{\alpha-1}\ln{\textstyle\sum\limits_{j=1}^{k}}p_{j}^{\alpha}q_{j}^{1-\alpha}\text{ \ \ }\alpha\neq 1 (10)

with the common information divergence limit

D1(P,Q)=D1(PQ)=j=1kpjlnpjqjD_{1}\left(P,Q\right)=D_{1}\left(P\|Q\right)={\textstyle\sum\limits_{j=1}^{k}}p_{j}\ln\frac{p_{j}}{q_{j}} (11)

are often applied in various areas of information theory. In the present context of detection and identification one can mention e.g. the work of Kailath [4] who used the Bhattacharryya distance

B(P,Q)=lnj=1k(pjqj)1/2=12D1/2(PQ)B\left(P,Q\right)=-\ln{\textstyle\sum\limits_{j=1}^{k}}\left(p_{j}q_{j}\right)^{1/2}=\frac{1}{2}D_{1/2}\left(P\|Q\right)

which is one-one related to the Hellinger divergence.

In practical applications it is important to use the statistic D^αopt,n\hat{D}_{\alpha_{\text{opt}},n} which is optimal in a sufficiently wide class of divergence statistics D^α,n\hat{D}_{\alpha,n}containing the standard statistical proposals D^1,n\hat{D}_{1,n} and D^2,n\hat{D}_{2,n} appearing in (6).  We addressed this problem previously [5, 6, 7]. Our solution confirmed the classical statistical result of Quine and Robinson [8] who proved that the likelihood ratio statistic D^1,n\hat{D}_{1,n} is more efficient in the Bahadur sense than the χ2\chi^{2}-statistic D^2,n\hat{D}_{2,n} and extended the results of Beirlant et al. [9] and Györfi et al. [10] dealing with Bahadur efficiency of several selected power divergence  statistics. Namely, we evaluated the Bahadur efficiencies of the statistics D^nα\hat{D}_{n\alpha} in the domain α1\alpha\geq 1 for the numbers k=knk=k_{n} of quantization cells slowly increasing with nn when the hypothetical distributions QnQ_{n} are uniform and the alternative distributions PnP_{n} are contiguous in the sense that limnDα(Pn,Qn)\lim_{n\rightarrow\infty}D_{\alpha}\left(P_{n},Q_{n}\right) exists and identifiable in the sense that this limit is positive. We found that the Bahadur efficiencies decrease with the power parameter in the whole domain α1\alpha\geq 1. In the present paper we sharpen this result by relaxing conditions on the rate of knk_{n} and extend it considerably by admitting non-uniform hypothetical distributions QnQ_{n} and by evaluating the Bahadur efficiencies also in the domain 0<α1.0<\alpha\leq 1.

II Basic model

Let M(k)M(k) denote the set of all probability distributions P=(pj:1jk)P=(p_{j}:1\leq j\leq k) and

M(k|n)={PM(k):nP{0,1,}k}M(k|n)=\left\{P\in M(k):nP\in\{0,1,\ldots\}^{k}\right\}

its subset called the set of types in information theory. We consider hypothetical distributions Qn=(qnj:1jk)M(k)Q_{n}=(q_{nj}:1\leq j\leq k)\in M(k) restricted by the condition qnj>0q_{nj}>0 and arbitrary alternative distributions Pn=(pnj:1jk)M(k).P_{n}=(p_{nj}:1\leq j\leq k)\in M(k). The {0,1,}k\{0,1,\ldots\}^{k}-valued frequency counts 𝐗n\mathbf{X}_{n} with coordinates introduced in (2) are multinomially distributed in the sense

𝐗nMultk(n,Pn),n=1,2,.\mathbf{X}_{n}\sim\mbox{Mult}_{k}(n,P_{n}),n=1,2,\ldots. (12)

Important components of the model are the empirical distributions P^nM(k|n)\widehat{P}_{n}\in M(k|n) defined by (1). Finally, for arbitrary PM(k)P\in M(k) and arbitrary QM(k)Q\in M(k) with positive coordinates we consider the power divergences (9)-(8). For their properties we refer to [11, 12, 13]. In particular, for the empirical and hypothetical distributions P^n,Qn\hat{P}_{n},Q_{n} we consider the power divergence statistics D^α,n=Dα(P^n,Qn)\widehat{D}_{\alpha,n}=D_{\alpha}\left(\hat{P}_{n},Q_{n}\right) (c.f. (6))defined by (9), (11) for all α>0\alpha>0.

Example 1

For α=2,\alpha=2, α=1\alpha=1 and α=1/2\alpha=1/2 we get the special power divergence statistics

D^2,n\displaystyle\widehat{D}_{2,n} =12j=1n(p^njqnj)2qnj=12nT^2,n,\displaystyle={\frac{1}{2}}\sum_{j=1}^{n}{\frac{(\widehat{p}_{nj}-q_{nj})^{2}}{q_{nj}}}={\frac{1}{2n}}\hat{T}_{2,n},\vskip 6.0pt plus 2.0pt minus 2.0pt (13)
D^1,n\displaystyle\widehat{D}_{1,n} =j=1np^njlnp^njqnj=12nT^1,n,\displaystyle=\sum_{j=1}^{n}\widehat{p}_{nj}\ln{\frac{\widehat{p}_{nj}}{q_{nj}}}={\frac{1}{2n}}\hat{T}_{1,n},\vskip 6.0pt plus 2.0pt minus 2.0pt (14)
D^1/2,n\displaystyle\widehat{D}_{1/2,n} =2j=1n(p^nj1/2qnj1/2)2\displaystyle=2\sum_{j=1}^{n}\left(\widehat{p}_{nj}^{1/2}-q_{nj}^{1/2}\right)^{2} (15)

For testing the hypothesis \mathcal{H} of (3) are usually used the re-scaled versions

T^α,n=2nD^α,n\widehat{T}_{\alpha,n}=2n\widehat{D}_{\alpha,n} (16)

distributed under \mathcal{H} asymptotically χ2\chi^{2} with k1k-1 degrees of freedom if kk is constant and asymptotically normally if k=knk=k_{n} slowly increases to infinity [14, 15, and references therein] . The statistics (13) and (14) rescaled in this manner were already mentioned in (5) and (4). In (15) is the Hellinger divergence statistics rescaled by 2n2n is known as Freeman–Tukey statistic

T^1/2,n=2nD^1/2,n=4j=1k((Xnj)1/2(nqnj)1/2)2.\widehat{T}_{1/2,n}=2n\widehat{D}_{1/2,n}=4\sum_{j=1}^{k}(\left(X_{nj}\right)^{1/2}-\left(nq_{nj}\right)^{1/2})^{2}. (17)

Convention

Unless the hypothesis {\mathcal{H}} is explicitly assumed, the random variables, convergences and asymptotic relations are considered under the alternative 𝒜{\mathcal{A}}. Further, unless otherwise explicitly stated, the asymptotic relations are considered for nn\longrightarrow\infty and the symbols of the type

sns and sn(𝐗n)𝑝ss_{n}\longrightarrow s\text{ \ \ and \ \ }s_{n}(\mathbf{X}_{n})\overset{p}{\longrightarrow}s

denote the ordinary numerical convergence and the stochastic convergence in probability for nn\longrightarrow\infty.

In this paper we consider the following assumptions.

A1:

The number of cells k=knnk=k_{n}\leq n of the distributions from M(k),M(k|n)M(k),\,M(k|n) depends on the sample size nn and increases to infinity. In the rest of the paper the subscript nn is suppressed in the symbols containing kk.

A2:

The hypothetical distributions Qn=(qnj>0:1jk)Q_{n}=(q_{nj}>0:1\leq j\leq k) are regular in the sense that maxjqnj0\max_{j}q_{nj}\rightarrow 0 for nn\rightarrow\infty and that there exists ϱ>0\varrho>0 such that

qnj>ϱk for all 1jk and n=1,2, .q_{nj}>\frac{\varrho}{k}\text{ \ for all }1\leq j\leq k\text{\ and }n=1,2,\ldots\text{ .} (18)
A3α\alpha:

The alternative 𝒜:(Pn:n=1,2,){\mathcal{A}}:(P_{n}:n=1,2,\ldots)  is identifiable in the sense that there exits 0<Δα<0<\Delta_{\alpha}<\infty such that

Dα,n=defDα(Pn,Qn)Δα under 𝒜.D_{\alpha,n}\overset{def}{=}D_{\alpha}(P_{n},Q_{n})\longrightarrow\Delta_{\alpha}\text{\ \ under }{\mathcal{A}}. (19)

Under A2

lnqnj<lnkϱ and ln2qnj<ln2kϱ.-\ln q_{nj}<\ln\frac{k}{\varrho}\text{ \ \ and \ \ }\ln^{2}q_{nj}<\ln^{2}\frac{k}{\varrho}. (20)

Further, logical complement to the hypothesis {\mathcal{H}} is the alternative denoted by 𝒜.{\mathcal{A}}. By (3), under 𝒜{\mathcal{A}} the alternative distributions PnP_{n} differ from QnQ_{n}. Assumption A3α\alpha means that the alternative distributions are neither too close to nor too distant from QnQ_{n} in the sense of DαD_{\alpha}-divergence for given α>0\alpha>0. Since for all n=1,2,n=1,2,\ldots

Dα,n=Dα(Qn,Qn)0so that Δα=0 under D_{\alpha,n}=D_{\alpha}(Q_{n},Q_{n})\equiv 0\ \ \text{so\ that }\Delta_{\alpha}=0\text{\ \ under }{\mathcal{H}}

it is clear that the hypothesis𝒜{\mathcal{A}} is under A1, A2,A3α\alpha distinguished from the hypothesis {\mathcal{H}} by achieving a positive DαD_{\alpha}-divergence limit Δα\Delta_{\alpha}. In what follows we use the abbreviated notations

A(α)\displaystyle\text{{A}}(\alpha) ={A1,A2A3α},\displaystyle=\left\{\text{{A1},{A2},\thinspace{A3}}\alpha\right\}, (21)
A(α1,α2)\displaystyle\text{{A}}(\alpha_{1},\alpha_{2}) ={A1,A2A3α1A3α2}\displaystyle=\left\{\text{{A1},{A2},\thinspace{A3}}\alpha_{1}\text{, {A3}}\alpha_{2}\right\} (22)

for the combinations of assumptions.

Definition 2

Under A(α\alpha) we say that the statistic D^α,n\widehat{D}_{\alpha,n} is consistent with parameter Δα\Delta_{\alpha} appearing in (19) if

D^α,n𝑝Δα under 𝒜\widehat{D}_{\alpha,n}\overset{p}{\longrightarrow}\Delta_{\alpha}\text{ \ \ under }{\mathcal{A}} (23)

and

D^α,n𝑝0 under \widehat{D}_{\alpha,n}\overset{p}{\longrightarrow}0\text{ \ \ under }{\mathcal{H}} (24)

i.e. if D^α,n𝑝Δα\widehat{D}_{\alpha,n}\overset{p}{\longrightarrow}\Delta_{\alpha} under both 𝒜{\mathcal{A}} and {\mathcal{H}}. If (24) is replaced by the stronger condition that the expectation ED^α,n\widehat{D}_{\alpha,n} tends to zero under {\mathcal{H}}, in symbols

E[D^α,n|]0,\text{{E}}\left[\left.\widehat{D}_{\alpha,n}\right|{\mathcal{H}}\right]\longrightarrow 0, (25)

then D^α,n\widehat{D}_{\alpha,n} is said strongly consistent.

Definition 3

We say that the statistic D^α,n\widehat{D}_{\alpha,n} is Bahadur stable if there is a continuous function with a Bahadur relative function ϱα:\varrho_{\alpha}: ]0,[2]0,\infty[^{2} \rightarrow ]0,[]0,\infty[ such that the probability of error function

𝖾α,n(Δ)=𝖯(D^α,n>Δ|), Δ>0\mathsf{e}_{\alpha,n}(\Delta)=\mathsf{P}\left(\left.\widehat{D}_{\alpha,n}>\Delta\right|{\mathcal{H}}\right),\text{ \ \ }\Delta>0 (26)

corresponding to the test rejecting {\mathcal{H}} when D^α,n>Δ\widehat{D}_{\alpha,n}>\Delta satisfies for all Δ1,Δ2>0\Delta_{1},\Delta_{2}>0 the relation

ln𝖾α,n(Δ1)ln𝖾α,n(Δ2)ϱα(Δ1,Δ2).\frac{\ln\mathsf{e}_{\alpha,n}(\Delta_{1})}{\ln\mathsf{e}_{\alpha,n}(\Delta_{2})}\longrightarrow\varrho_{\alpha}\left(\Delta_{1},\Delta_{2}\right).

If this condition holds then ϱα\varrho_{\alpha} is called the Bahadur relative function.

Obviously, the Bahadur relative functions are multiplicative in the sense

ϱα(Δ1,Δ2)ϱα(Δ2,Δ3)=ϱα(Δ1,Δ3).\varrho_{\alpha}\left(\Delta_{1},\Delta_{2}\right)\varrho_{\alpha}\left(\Delta_{2},\Delta_{3}\right)=\varrho_{\alpha}\left(\Delta_{1},\Delta_{3}\right).

Statistics that are Bahadur stable have the nice property that the asymptotic behavior of the error function 𝖾α,n(Δ)\mathsf{e}_{\alpha,n}(\Delta) is determined by its behavior for just a single argument Δ>0.\Delta^{\ast}>0. Indeed, if D^α,n\widehat{D}_{\alpha,n} is Bahadur stable and if we define for a fixed Δ>0\Delta^{\ast}>0 the sequence

cα(n)=nln𝖾α,n(Δ)c_{\alpha}^{\ast}(n)=-\frac{n}{\ln\mathsf{e}_{\alpha,n}(\Delta^{\ast})} (27)

then for all Δ>0\Delta>0

cα(n)nln𝖾α,n(Δ)ϱα(Δ,Δ) for all Δ>0.-{\frac{c_{\alpha}^{\ast}(n)}{n}}\ln\mathsf{e}_{\alpha,n}(\Delta)\longrightarrow\varrho_{\alpha}\left(\Delta,\Delta^{\ast}\right)\text{ \ \ for all }\Delta>0.

Moreover, if the expressions cα(n)/nln𝖾α,n(Δ)\ -{c_{\alpha}(n)/n}\ln\mathsf{e}_{\alpha,n}(\Delta)  converge for a sequence cα(n)c_{\alpha}\left(n\right) then the ratio cα(n)/cα(n)c_{\alpha}(n)/c_{\alpha}^{\ast}(n) tends to a constant.

Motivation of the next definition

Suppose that condition A(α1,α2\alpha_{1},\alpha_{2}) holds and denote for each α{α1,α2}\alpha\in\{\alpha_{1},\alpha_{2}\} and n=1,2,n=1,2,\ldots by Δα+εα,n\Delta_{\alpha}+\varepsilon_{\alpha,n} the critical value of the statistics D^αi,n\widehat{D}_{\alpha_{i},n} leading to the rejection of \mathcal{H} with a fixed power 0<𝗉<10<\mathsf{p}<1. In other words, let

𝗉=P(D^α,n>Δα+εα,n) for all n=1,2, \mathsf{p}=\text{{P}}\left(\widehat{D}_{\alpha,n}>\Delta_{\alpha}+\varepsilon_{\alpha,n}\right)\text{ \ \ for all }n=1,2,\ldots\text{ }

where the sequence εα,n=εα,n(𝗉)\varepsilon_{\alpha,n}=\varepsilon_{\alpha,n}(\mathsf{p}) depends on the fixed 𝗉\mathsf{p}. Since the assumed consistency of D^α,n\widehat{D}_{\alpha,n} implies that εα,n\varepsilon_{\alpha,n} tends to zero, the corresponding error probabilities 𝖾α,n(Δα+εα,n)\mathsf{e}_{\alpha,n}(\Delta_{\alpha}+\varepsilon_{\alpha,n}) =𝖯(D^α,n>Δα+εα,n|)=\mathsf{P}\left(\left.\widehat{D}_{\alpha,n}>\Delta_{\alpha}+\varepsilon_{\alpha,n}\right|\mathcal{H}\right) can be approximated by 𝖾α,n(Δα)\mathsf{e}_{\alpha,n}(\Delta_{\alpha}) =𝖯(D^α,n>Δα|).=\mathsf{P}\left(\left.\widehat{D}_{\alpha,n}>\Delta_{\alpha}\right|\mathcal{H}\right). By (33),

cα(n)nln𝖾α,n(Δα)gα(Δα).-{\frac{c_{\alpha}(n)}{n}}\ln\mathsf{e}_{\alpha,n}(\Delta_{\alpha})\longrightarrow g_{\alpha}(\Delta_{\alpha}).

Hence the error 𝖾α1,n(Δα1)\mathsf{e}_{\alpha_{1},n}(\Delta_{\alpha_{1}}) of the statistic D^α1,n\widehat{D}_{\alpha_{1},n} tends to zero with the same exponential rate as 𝖾α2,mn(Δα2)\mathsf{e}_{\alpha_{2},m_{n}}(\Delta_{\alpha_{2}}) achieved by D^α2,mn\widehat{D}_{\alpha_{2},m_{n}} for possibly different sample sizes mnnm_{n}\neq n with the property mnm_{n}\longrightarrow\infty if the corresponding error exponents

gα1(Δα1)ncα1(n) and gα2(Δα2)mncα2(mn)g_{\alpha_{1}}(\Delta_{\alpha_{1}})\frac{n}{c_{\alpha_{1}}(n)}\text{ \ \ and \ \ }g_{\alpha_{2}}(\Delta_{\alpha_{2}})\frac{m_{n}}{c_{\alpha_{2}}(m_{n})} (28)

tend to infinity with the same rate in the sense

mncα2(mn)=gα1(Δα1)gα2(Δα2).ncα1(n)(1+o(1)).\frac{m_{n}}{c_{\alpha_{2}}(m_{n})}=\frac{g_{\alpha_{1}}(\Delta_{\alpha_{1}})}{g_{\alpha_{2}}(\Delta_{\alpha_{2}})}.\frac{n}{c_{\alpha_{1}}(n)}\left(1+o(1)\right). (29)

The sample sizes mnm_{n} and nn needed by the statistics D^α2,n\widehat{D}_{\alpha_{2},n} and D^α1,n\widehat{D}_{\alpha_{1},n} to achieve the same rate of convergence of errors are thus mutually related by the formula

mnn= gα1(Δα1)gα2(Δα2).cα2(mn)cα1(n)(1+o(1)).{\frac{m_{n}}{n}=}\text{ }\frac{g_{\alpha_{1}}(\Delta_{\alpha_{1}})}{g_{\alpha_{2}}(\Delta_{\alpha_{2}})}.\frac{c_{\alpha_{2}}(m_{n})}{c_{\alpha_{1}}(n)}\left(1+o(1)\right). (30)

Obviously, the statistic D^α1,n\widehat{D}_{\alpha_{1},n} is asymptotically less or more efficient than D^α2,n\widehat{D}_{\alpha_{2},n} if the ratio mn/nm_{n}/n of sample sizes needed to achieve the same rate of convergence of errors to zero tends to a constant larger or smaller than 11. This motivates the following definition which refers to the typical convergent situation

cα2(mn)cα1(n) cα2/α1 for some 0cα2/α1.{\frac{c_{\alpha_{2}}(m_{n})}{c_{\alpha_{1}}(n)}\longrightarrow}\text{ }c_{\alpha_{2}/\alpha_{1}}\text{ \ \ for some }0\leq c_{\alpha_{2}/\alpha_{1}}\leq\infty.\vskip 6.0pt plus 2.0pt minus 2.0pt (31)
Definition 4

If there is a continuous function

gα:]0,[]0,[g_{\alpha}:\left]0,\infty\right[\mathcal{\ }{\mathcal{\rightarrow}}\left]0,\infty\right[

and a sequence cα(n)c_{\alpha}(n) such that for all x>0x>0 the error function

𝖾α,n(x)=𝖯(Dα,n>x|), x>0\mathsf{e}_{\alpha,n}(x)=\mathsf{P}\left(\left.D_{\alpha,n}>x\right|\mathcal{H}\right),\text{ \ \ }x>0 (32)

satisfies for all x>0x>0 the relation

cα(n)nln𝖾α,n(x)gα(x)-{\frac{c_{\alpha}(n)}{n}}\ln\mathsf{e}_{\alpha,n}(x)\longrightarrow g_{\alpha}(x) (33)

then gαg_{\alpha} is called Bahadur function of the statistic Dα,nD_{\alpha,n} generated by cα(n)c_{\alpha}(n). If (33) is replaced by the condition

cα(n)nln𝖾α,n(x+εn)gα(x) for arbitrary εn0-{\frac{c_{\alpha}(n)}{n}}\ln\mathsf{e}_{\alpha,n}(x+\varepsilon_{n})\longrightarrow g_{\alpha}(x)\text{ \ \ for arbitrary }\varepsilon_{n}\longrightarrow 0 (34)

then the function gαg_{\alpha} is strongly Bahadur.

Definition 5

Let us assume that A(α1,α2\alpha_{1},\alpha_{2}) holds and that for each α{α1,α2}\alpha\in\{\alpha_{1},\alpha_{2}\} the statistic D^n,α\widehat{D}_{n,\alpha} is consistent with parameter Δα\Delta_{\alpha} and has a Bahadur function gαg_{\alpha} generated by a sequence cα(n)c_{\alpha}(n) such that (31) is satisfied. Then the Bahadur efficiency of D^α1,n\widehat{D}_{\alpha_{1},n} with respect to D^α2,n\widehat{D}_{\alpha_{2},n} is the number from the interval [0,][0,\infty] defined by the formula

BE(D^α1,n; D^α2,n)=gα1(Δα1)gα2(Δα2).cα2/α1.\mbox{BE}\left(\widehat{D}_{\alpha_{1},n}\,;\text{ }\widehat{D}_{\alpha_{2},n}\right)={\frac{g_{\alpha_{1}}(\Delta_{\alpha_{1}})}{g_{\alpha_{2}}(\Delta_{\alpha_{2}})}}.c_{\alpha_{2}/\alpha_{1}}. (35)

Hereafter we shall consider also the slightly modified concept of Bahadur efficiency.

Definition 6

Let in addition to the assumptions of Definition 5, the statistics D^α1,n,\widehat{D}_{\alpha_{1},n}, D^α2,n\widehat{D}_{\alpha_{2},n} be strongly consistent and the functions gα1,g_{\alpha_{1}}, gα2g_{\alpha_{2}} strongly Bahadur. Then the Bahadur efficiency (35) is said to be Bahadur efficiency in the strong sense.

Motivation of Definition 6

Let the assumptions of this definition hold then for each α{α1,α2},and\alpha\in\{\alpha_{1},\alpha_{2}\},and u>0u>0 the function

Lα,n(u)=P(T^α,n𝖤[T^α,n|]u|),(cf. 26)L_{\alpha,n}(u)=\text{{P}}\left(\left.{\widehat{T}}_{\alpha,n}-\mathsf{E}\left[\left.{\widehat{T}}_{\alpha,n}\right|{\mathcal{H}}\right]\geq u\right|{\mathcal{H}}\right),\ \text{(cf. \ref{10})}

denotes the level of the error of the statistic

T^α,n𝖤[T^α,n|]2n(D^α,n𝖤[D^α,n|]){\widehat{T}}_{\alpha,n}-\mathsf{E}\left[\left.{\widehat{T}}_{\alpha,n}\right|{\mathcal{H}}\right]\equiv 2n\left(\widehat{D}_{\alpha,n}-\mathsf{E}\left[\left.\widehat{D}_{\alpha,n}\right|{\mathcal{H}}\right]\right)\

for critical value u>0u>0. By the assumed strong consistency of D^α,n,\widehat{D}_{\alpha,n},

E[T^α,n|]2n0(cf.(25)).\frac{\text{{E}}\left[\left.\widehat{T}_{\alpha,n}\right|{\mathcal{H}}\right]}{2n}\longrightarrow 0\ \ \ \ \ \mbox{(cf.(\ref{9}))}.

This means that the sequence cα(n)c_{\alpha}(n) generating the strongly Bahadur gαg_{\alpha} satisfies for all t>0t>0 the relation

cα(n)nlnP(T^α,nE[T^α,n|]+2nt|)gα(t) .-{\frac{c_{\alpha}(n)}{n}}\ln\text{{P}}\left(\left.{\widehat{T}}_{\alpha,n}\geq\text{{E}}\left[\left.{\widehat{T}}_{\alpha,n}\right|{\mathcal{H}}\right]+2nt\right|{\mathcal{H}}\right)\longrightarrow g_{\alpha}(t)\text{ \ \ .} (cf. (34))

Consequently, by the argument of Quine and Robinson [8, p. 732],

limncα(n)nlnLα,n(T^α,n)𝑝gα(Δα).\lim\nolimits_{n}-{\frac{c_{\alpha}(n)}{n}}\ln L_{\alpha,n}({\widehat{T}}_{\alpha,n})\overset{p}{\longrightarrow}g_{\alpha}(\Delta_{\alpha}).

Hence [8], the error level Lα1,n(T^α1,n)L_{\alpha_{1},n}({\widehat{T}}_{\alpha_{1},n}) of the statistic T^α1,n=2nD^α1,n{\widehat{T}}_{\alpha_{1},n}=2n\widehat{D}_{\alpha_{1},n} is asymptotically equivalent to the error level Lα2,mn(T^α2,mn)L_{\alpha_{2},m_{n}}({\widehat{T}}_{\alpha_{2},m_{n}}) of the statistic T^α2,mn=2mnD^α2,mn{\widehat{T}}_{\alpha_{2},m_{n}}=2m_{n}\widehat{D}_{\alpha_{2},m_{n}} achieved by a sample size mnm_{n} if the comparability (29) takes place or, in other words, if the sample sizes nn and mnm_{n} are mutually related by (30). In other words, the concept of Bahadur efficiency introduced in this paper coincides under the stronger assumptions of Definition 6 with the Bahadur efficiency of Quine and Robinson [8].

Harremoës and Vajda [5] assumed the same strong consistency as in Definition 6 but introduced the Bahadur efficiency by the slightly different formula

BE(D^α1,n; D^α2,n)=gα1(Δα1)gα2(Δα2).c¯α2/α1\mbox{BE}\left(\widehat{D}_{\alpha_{1},n}\,;\text{ }\widehat{D}_{\alpha_{2},n}\right)={\frac{g_{\alpha_{1}}(\Delta_{\alpha_{1}})}{g_{\alpha_{2}}(\Delta_{\alpha_{2}})}}.\bar{c}_{\alpha_{2}/\alpha_{1}} (36)

where111Due to a missprint, α1\alpha_{1} and α2\alpha_{2} were interchanged behind the limit in [5, Eq. 30], but the formula was used in the correct form (36). In the Appendix we prove that the conclusions made on the basis of the original formula (36) hold unchanged under the present precised formula (35).

c¯α2/α1=limncα2(n)cα1(n).\bar{c}_{\alpha_{2}/\alpha_{1}}=\lim_{n\longrightarrow\infty}\frac{c_{\alpha_{2}}(n)}{c_{\alpha_{1}}(n)}. (37)

III Consistency

In this section we study the consistency of the class of power divergence statistics Dα(P^n,Qn),D_{\alpha}(\widehat{P}_{n},Q_{n}), α>0.\alpha>0. In the domain α<0\alpha<0 this consistency was studied in the particular case of uniform QQ by Harremoës and Vajda [6].

Theorem 7

Let distributions QnM(k)Q_{n}\in M\left(k\right) satisfy the assumption A(α\alpha). Assume that ff is uniformly continuous. Then the statistic Df(P^n,Qn)D_{f}(\widehat{P}_{n},Q_{n}) is strongly consistent provided

nk.\frac{n}{k}\longrightarrow\infty. (38)
Proof:

Under \mathcal{H} we have Df(Pn,Qn)=Df(Qn,Qn)=0.D_{f}(P_{n},Q_{n})=D_{f}(Q_{n},Q_{n})=0. Hence it suffices to prove

|Λα,n|𝑝0 under both and 𝒜\left|\Lambda_{\alpha,n}\right|\overset{p}{\longrightarrow}0\text{ \ \ under both }\mathcal{H}\ \text{and }\mathcal{A} (39)

for Λα,n=Df(P^n,Qn)Df(Pn,Qn)\Lambda_{\alpha,n}=D_{f}(\widehat{P}_{n},Q_{n})-D_{f}(P_{n},Q_{n}). For simplicity we skip the subscript nn in the symbols P^n,Pn,\widehat{P}_{n},P_{n}, and QnQ_{n}, i.e. we substitute

P^n=P^=(p^j:1jk), Pn=P=(pj:1jk).\widehat{P}_{n}=\widehat{P}=(\widehat{p}_{j}:1\leq j\leq k),\text{ \ \ }P_{n}=P=(p_{j}:1\leq j\leq k). (40)

This leads to the simplified formula Λα,n=Df(P^,Q)Df(P,Q).\Lambda_{\alpha,n}=D_{f}(\widehat{P},Q)-D_{f}(P,Q). We can without loss of generality assume that Df(P,Q)D_{f}(P,Q) is constant not only under \mathcal{H} (where the constant is automatically 0) but also under 𝒜\mathcal{A} (where the assumed detectability implies the convergence Df(P,Q)ΔαD_{f}(P,Q)\longrightarrow\Delta_{\alpha} for 0<Δα<0<\Delta_{\alpha}<\infty). In this asymptotic sense we use the equalities

Df(P,Q)=qj(pjqj)α1α(α1)=ΔαD_{f}(P,Q)=\frac{\sum q_{j}\left(\frac{p_{j}}{q_{j}}\right)^{\alpha}-1}{\alpha\left(\alpha-1\right)}=\Delta_{\alpha} (41)

and

Λα,n=Df(P^,U)Δα.\Lambda_{\alpha,n}=D_{f}(\widehat{P},U)-\Delta_{\alpha}\text{.} (42)

Choose some 0<s<10<s<1 and define

fs(t)={f(t)for ts,f(s)+f+(s)(ts)for 0t<s.f^{s}\left(t\right)=\left\{\begin{array}[c]{ll}f\left(t\right)&\text{for }t\geq s,\\ f\left(s\right)+f_{+}^{\prime}\left(s\right)\left(t-s\right)&\text{for }0\leq t<s.\end{array}\right.

Then

0f(t)fs(t)f(0)fs(0)0\leq f\left(t\right)-f^{s}\left(t\right)\leq f\left(0\right)-f^{s}\left(0\right)

so that (9) implies

0Df(P,Q)Dfs(P,Q)f(0)fs(0).0\leq D_{f}\left(P,Q\right)-D_{f^{s}}\left(P,Q\right)\leq f\left(0\right)-f^{s}\left(0\right).

The function fsf^{s} is Lipschitz with the Lipschitz constant λ=max{|f+(s)|,|f()|}\lambda=\max\left\{\left|f_{+}^{\prime}\left(s\right)\right|,\left|f^{\prime}\left(\infty\right)\right|\right\} i.e. |f(t1)f(t2)|λ|t1t2|\left|f\left(t_{1}\right)-f\left(t_{2}\right)\right|\leq\lambda\left|t_{1}-t_{2}\right| for all t1,t20.t_{1},t_{2}\geq 0. Then

|Dfs(P^n,Q)Dfs(Pn,Q)|=|j=1kqjfs(p^jqj)j=1kqjfs(pjqj)|=j=1kqj|fs(p^jqj)fs(pjqj)|j=1kqjλ|p^jqjpjqj|=λj=1k|p^jpj|λ(j=1k(p^jpj)2pj)1/2\left|D_{f^{s}}(\widehat{P}_{n},Q)-D_{f^{s}}(P_{n},Q)\right|\\ =\left|\sum_{j=1}^{k}q_{j}\,f^{s}\left({\frac{\widehat{p}_{j}}{q_{j}}}\right)-\sum_{j=1}^{k}\,q_{j}f^{s}\left({\frac{p_{j}}{q_{j}}}\right)\right|\\ =\sum_{j=1}^{k}q_{j}\,\left|f^{s}\left({\frac{\widehat{p}_{j}}{q_{j}}}\right)-f^{s}\left({\frac{p_{j}}{q_{j}}}\right)\right|\leq\sum_{j=1}^{k}q_{j}\lambda\,\left|{\frac{\widehat{p}_{j}}{q_{j}}}-{\frac{p_{j}}{q_{j}}}\right|\\ =\lambda\sum_{j=1}^{k}\,\left|\widehat{p}_{j}-p_{j}\right|\leq\lambda\left(\sum_{j=1}^{k}\frac{\left(\widehat{p}_{j}-p_{j}\right)^{2}}{p_{j}}\right)^{1/2}

where in the last step we used the Schwarz inequality. Since

𝖤[(p^jpj)2]=pj(1pj)/npj/n\mathsf{E}\left[\left(\widehat{p}_{j}-p_{j}\right)^{2}\right]=p_{j}(1-p_{j})/n\leq\ p_{j}/n (43)

it holds

𝖤|Dfs(P^n,Q)Dfs(Pn,Q)|λ(𝖤[j=1k(p^jpj)2pj])1/2λ(kn)1/2.\mathsf{E}\left|D_{f^{s}}(\widehat{P}_{n},Q)-D_{f^{s}}(P_{n},Q)\right|\\ \leq\lambda\left(\mathsf{E}\left[\sum_{j=1}^{k}\frac{\left(\widehat{p}_{j}-p_{j}\right)^{2}}{p_{j}}\right]\right)^{1/2}\leq\lambda\left(\frac{k}{n}\right)^{1/2}.

Consequently,

𝖤|Df(P^n,Q)Df(Pn,Q)|2(f(0)fs(0))+λ(k/n)1/2\mathsf{E}\left|D_{f}(\widehat{P}_{n},Q)-D_{f}(P_{n},Q)\right|\\ \leq 2\left(f\left(0\right)-f^{s}\left(0\right)\right)+\lambda\left(k/n\right)^{1/2}

so that under (49)

limsup n𝖤|Df(P^n,Qn)Df(Pn,Qn)|2(f(0)fs(0)).\underset{n\rightarrow\infty}{\lim\sup\text{ }}\mathsf{E}\left|D_{f}(\widehat{P}_{n},Q_{n})-D_{f}(P_{n},Q_{n})\right|\\ \leq 2\left(f\left(0\right)-f^{s}\left(0\right)\right).

This holds for all s>0.s>0. Since f(0)fs(0)0f\left(0\right)-f^{s}\left(0\right)\longrightarrow 0 for s0,s\downarrow 0, we see that in this case (38) implies (39). ∎

The interpretation of condition 38 is that the mean number of observations per bin should tend to infinity under \mathcal{H}. Note that this condition does not exclude that we will observe empty cells.

Our results are concentrated in Theorem 9 below. Its proof uses the following auxiliary result.

Lemma 8

For x,y0x,y\geq 0 and 1α21\leq\alpha\leq 2 it holds

Lα(x,y)ϕα(y)ϕα(x)Uα(x,y)L_{\alpha}\left(x,y\right)\leq\phi_{\alpha}(y)-\phi_{\alpha}(x)\leq U_{\alpha}\left(x,y\right) (44)

where

Lα(x,y)=(yx)ϕα(x)L_{\alpha}\left(x,y\right)=(y-x)\phi_{\alpha}^{\prime}(x) (45)

and

Uα(x,y)=Lα(x,y)+1αxα2(yx)2.U_{\alpha}\left(x,y\right)=L_{\alpha}\left(x,y\right)+\frac{1}{\alpha}x^{\alpha-2}\left(y-x\right)^{2}.\vskip 6.0pt plus 2.0pt minus 2.0pt (46)

Proof:

First assume 1<α<2.1<\alpha<2. Since 1αxα2(yx)2\frac{1}{\alpha}x^{\alpha-2}\left(y-x\right)^{2} is nonnegative, it suffices to prove

ϕα(y)ϕα(x)+ϕα(x)(yx)\phi_{\alpha}(y)\geq\phi_{\alpha}(x)+\phi_{\alpha}^{\prime}(x)\left(y-x\right) (47)

and

ϕα(y)ϕα(x)+ϕα(x)(yx)+1αxα2(yx)2.\phi_{\alpha}(y)\leq\phi_{\alpha}(x)+\phi_{\alpha}^{\prime}(x)\left(y-x\right)+\frac{1}{\alpha}x^{\alpha-2}\left(y-x\right)^{2}. (48)

But Inequality (47) is evident since the function yϕα(y)y\rightarrow\phi_{\alpha}(y) is convex. We shall prove that the function

f(y)=ϕα(y)(ϕα(x)+ϕα(x)(yx)+1αxα2(yx)2)f\left(y\right)=\\ \phi_{\alpha}\left(y\right)-\left(\phi_{\alpha}\left(x\right)+\phi_{\alpha}^{\prime}(x)\left(y-x\right)+\frac{1}{\alpha}x^{\alpha-2}\left(y-x\right)^{2}\right)

is non-positive. First we observe that f(0)=f(x)=0f(0)=f(x)=0. By differentiating f(y)f\left(y\right) we get

f(y)=yα11α1(ϕα(x)+2αxα2(yx))f^{\prime}\left(y\right)={\frac{y^{\alpha-1}-1}{\alpha-1}-}\left(\phi_{\alpha}^{\prime}(x)+\frac{2}{\alpha}x^{\alpha-2}\left(y-x\right)\right)

so that f(x)=0.f^{\prime}\left(x\right)=0. Differentiating once more we get

f′′(y)=yα22αxα2.f^{\prime\prime}\left(y\right)=y^{\alpha-2}{-}\frac{2}{\alpha}x^{\alpha-2}.

Thus f′′(y)>0f^{\prime\prime}(y)>0 for y<xα=def(α/2)12αxy<x_{\alpha}\overset{def}{=}\left(\alpha/2\right)^{\frac{1}{2-\alpha}}x and f′′(y)<0f^{\prime\prime}(y)<0 for y>xα.y>x_{\alpha}. Since xα<xx_{\alpha}<x and f(y)f(y) is concave on [xα,1][x_{\alpha},1], it is maximized on this interval at y=xy=x where f(x)=0f(x)=0. Thus f(y)0f\left(y\right)\leq 0 on this interval and in particular f(xα)0f(x_{\alpha})\leq 0. This together with f(0)=0f(0)=0 and the convexity of f(y)f\left(y\right) on the interval [0,[0, xα]x_{\alpha}] implies f(y)0f\left(y\right)\leq 0 for y[0,x]y\in\left[0,x\right]. This completes the proof of the non-positivity of f(y)f\left(y\right), i.e. the proof of (48). The cases α=2\alpha=2 and α=1\alpha=1 follow by continuity. ∎

The main result of this section is the following theorem.

Theorem 9

Let distributions QnM(k)Q_{n}\in M\left(k\right) satisfy the assumption A(α\alpha). Then the statistic Dα(P^n,Qn)D_{\alpha}(\widehat{P}_{n},Q_{n}) is strongly consistent provided

 0<α2 and nk\rule{-19.91692pt}{0.0pt}0<\alpha\leq 2\text{ \ \ \ \ and \ \ \ \ }{\frac{n}{k}}\longrightarrow\infty\vskip-8.53581pt (49)

and consistent provided

 α>2  and nklogk.\rule{5.69054pt}{0.0pt}\alpha>2\rule{28.45274pt}{0.0pt}\text{\ \ \ and\ \ \ \ \ }{\frac{n}{k\log k}}\longrightarrow\infty.\vskip 6.0pt plus 2.0pt minus 2.0pt (50)
Proof:

We shall use the same notation as in the proof of Theorem 7. In the proof we treat separately the cases

i :0<α<1,\displaystyle:0<\alpha<1,\quad\text{ }
ii :1<α2,\displaystyle:1<\alpha\leq 2,\quad\text{ }
iii :α=1,\displaystyle:\alpha=1,
iv :α>2.\displaystyle:\alpha>2.
Case i (0<α<10<\alpha<1)

This follows from Theorem 7 because xϕα(x)x\rightarrow\phi_{\alpha}\left(x\right) is uniformly continuous.

Case ii (1<α21<\alpha\leq 2)

Here we get from (42)

Λα,n=j=1kqj(ϕα(p^jqj)ϕα(pjqj))\Lambda_{\alpha,n}=\sum_{j=1}^{k}q_{j}\left(\phi_{\alpha}\left(\frac{\widehat{p}_{j}}{q_{j}}\right)-\phi_{\alpha}\left(\frac{p_{j}}{q_{j}}\right)\right) (51)

so that Lemma 8 implies

j=1kqjLα(pjqj,p^jqj)Λα,nj=1kqjLα(pjqj,p^jqj)+j=1kqj1α(pjqj)α2(p^jqjpjqj)2\sum_{j=1}^{k}q_{j}L_{\alpha}\left(\frac{p_{j}}{q_{j}},\frac{\widehat{p}_{j}}{q_{j}}\right)\leq\Lambda_{\alpha,n}\leq\\ \\ \sum_{j=1}^{k}q_{j}L_{\alpha}\left(\frac{p_{j}}{q_{j}},\frac{\widehat{p}_{j}}{q_{j}}\right)+\sum_{j=1}^{k}q_{j}\frac{1}{\alpha}\left(\frac{p_{j}}{q_{j}}\right)^{\alpha-2}\left(\frac{\widehat{p}_{j}}{q_{j}}-\frac{p_{j}}{q_{j}}\right)^{2}

and

|Λα,n||j=1k(p^jpj)ϕα(pjqj)|+j=1kpjα2qjα1(p^jpj)2α.\left|\Lambda_{\alpha,n}\right|\leq\left|\sum_{j=1}^{k}(\widehat{p}_{j}-p_{j})\phi_{\alpha}^{\prime}\left(\frac{p_{j}}{q_{j}}\right)\right|+\sum_{j=1}^{k}\frac{p_{j}^{\alpha-2}}{q_{j}^{\alpha-1}}\frac{\left(\widehat{p}_{j}-p_{j}\right)^{2}}{\alpha}.

We take the mean and get

E|Λα,n|𝖤|j=1k(p^jpj)ϕα(pjqj)|+j=1kpjα2αqjα1𝖤[(p^jpj)2]E\left|\Lambda_{\alpha,n}\right|\leq\\ \mathsf{E}\left|\sum_{j=1}^{k}(\widehat{p}_{j}-p_{j})\phi_{\alpha}^{\prime}\left(\frac{p_{j}}{q_{j}}\right)\right|+\sum_{j=1}^{k}\frac{p_{j}^{\alpha-2}}{\alpha q_{j}^{\alpha-1}}\mathsf{E}\left[\left(\widehat{p}_{j}-p_{j}\right)^{2}\right]

The terms on the right hand side are treated separately.

j=1kpjα2qjα1𝖤[(p^jpj)2]α\displaystyle\sum_{j=1}^{k}\frac{p_{j}^{\alpha-2}}{q_{j}^{\alpha-1}}\frac{\mathsf{E}\left[\left(\widehat{p}_{j}-p_{j}\right)^{2}\right]}{\alpha} =j=1kpjα2qjα1𝖤[(np^jnpj)2]αn2\displaystyle=\sum_{j=1}^{k}\frac{p_{j}^{\alpha-2}}{q_{j}^{\alpha-1}}\frac{\mathsf{E}\left[\left(n\widehat{p}_{j}-np_{j}\right)^{2}\right]}{\alpha n^{2}}
=j=1kpjα2qjα1npj(1pj)αn2\displaystyle=\sum_{j=1}^{k}\frac{p_{j}^{\alpha-2}}{q_{j}^{\alpha-1}}\frac{np_{j}\left(1-p_{j}\right)}{\alpha n^{2}}
1αnj=1kpjα1(ρk)α1\displaystyle\leq\frac{1}{\alpha n}\sum_{j=1}^{k}\frac{p_{j}^{\alpha-1}}{\left(\frac{\rho}{k}\right)^{\alpha-1}}
kα1αnρα1j=1kpjα1.\displaystyle\leq\frac{k^{\alpha-1}}{\alpha n\rho^{\alpha-1}}\sum_{j=1}^{k}p_{j}^{\alpha-1}.

The function Pj=1kpjα1P\rightarrow\sum_{j=1}^{k}p_{j}^{\alpha-1} is concave so it attains its maximum for P=(1/k,1/k,,1/k).P=\left(1/k,1/k,\cdots,1/k\right). Therefore

j=1kpjα2qjα1𝖤[(p^jpj)2]α\displaystyle\sum_{j=1}^{k}\frac{p_{j}^{\alpha-2}}{q_{j}^{\alpha-1}}\frac{\mathsf{E}\left[\left(\widehat{p}_{j}-p_{j}\right)^{2}\right]}{\alpha} kα1αnρα1k(1k)α1\displaystyle\leq\frac{k^{\alpha-1}}{\alpha n\rho^{\alpha-1}}k\left(\frac{1}{k}\right)^{\alpha-1}
=1αρα1kn.\displaystyle=\frac{1}{\alpha\rho^{\alpha-1}}\frac{k}{n}.

Next we bound the first term.

𝖤|j=1k(p^jpj)ϕα(pjqj)|𝖤[(j=1k(p^jpj)ϕα(pjqj))2]1/2=(i,j=1kCov(p^i,p^j)ϕα(piqi)ϕα(pjqj))1/2=1n(i,j=1kCov(np^i,np^j)ϕα(piqi)ϕα(pjqj))1/2=1n(i=1kVar(np^i)(ϕα(piqi))2+ijCov(np^i,np^j)ϕα(piqi)ϕα(pjqj))1/2.\mathsf{E}\left|\sum_{j=1}^{k}(\widehat{p}_{j}-p_{j})\phi_{\alpha}^{\prime}\left(\frac{p_{j}}{q_{j}}\right)\right|\\ \leq\mathsf{E}\left[\left(\sum_{j=1}^{k}(\widehat{p}_{j}-p_{j})\phi_{\alpha}^{\prime}\left(\frac{p_{j}}{q_{j}}\right)\right)^{2}\right]^{1/2}\\ =\left(\sum_{i,j=1}^{k}Cov\left(\widehat{p}_{i},\widehat{p}_{j}\right)\phi_{\alpha}^{\prime}\left(\frac{p_{i}}{q_{i}}\right)\phi_{\alpha}^{\prime}\left(\frac{p_{j}}{q_{j}}\right)\right)^{1/2}\\ =\frac{1}{n}\left(\sum_{i,j=1}^{k}Cov\left(n\widehat{p}_{i},n\widehat{p}_{j}\right)\phi_{\alpha}^{\prime}\left(\frac{p_{i}}{q_{i}}\right)\phi_{\alpha}^{\prime}\left(\frac{p_{j}}{q_{j}}\right)\right)^{1/2}\\ =\frac{1}{n}\left(\begin{array}[c]{c}\sum_{i=1}^{k}Var\left(n\widehat{p}_{i}\right)\left(\phi_{\alpha}^{\prime}\left(\frac{p_{i}}{q_{i}}\right)\right)^{2}\\ +\sum_{i\neq j}Cov\left(n\widehat{p}_{i},n\widehat{p}_{j}\right)\phi_{\alpha}^{\prime}\left(\frac{p_{i}}{q_{i}}\right)\phi_{\alpha}^{\prime}\left(\frac{p_{j}}{q_{j}}\right)\end{array}\right)^{1/2}.

This equals

1n(i=1knpi(1pi)(ϕα(piqi))2+ijnpipjϕα(piqi)ϕα(pjqj))1/21n1/2(i=1kpi(ϕα(piqi))2+i,jpipjϕα(piqi)ϕα(pjqj))1/2.\frac{1}{n}\left(\begin{array}[c]{c}\sum_{i=1}^{k}np_{i}\left(1-p_{i}\right)\left(\phi_{\alpha}^{\prime}\left(\frac{p_{i}}{q_{i}}\right)\right)^{2}\\ +\sum_{i\neq j}np_{i}p_{j}\phi_{\alpha}^{\prime}\left(\frac{p_{i}}{q_{i}}\right)\phi_{\alpha}^{\prime}\left(\frac{p_{j}}{q_{j}}\right)\end{array}\right)^{1/2}\\ \leq\frac{1}{n^{1/2}}\left(\begin{array}[c]{c}\sum_{i=1}^{k}p_{i}\left(\phi_{\alpha}^{\prime}\left(\frac{p_{i}}{q_{i}}\right)\right)^{2}\\ +\sum_{i,j}p_{i}p_{j}\phi_{\alpha}^{\prime}\left(\frac{p_{i}}{q_{i}}\right)\phi_{\alpha}^{\prime}\left(\frac{p_{j}}{q_{j}}\right)\end{array}\right)^{1/2}.

This can be bounded as

1n1/2(i=1kpi(ϕα(piqi))2+(ipiϕα(piqi))2)1/2(2ni=1kpi(ϕα(piqi))2)1/2.\frac{1}{n^{1/2}}\left(\begin{array}[c]{c}\sum_{i=1}^{k}p_{i}\left(\phi_{\alpha}^{\prime}\left(\frac{p_{i}}{q_{i}}\right)\right)^{2}\\ +\left(\sum_{i}p_{i}\phi_{\alpha}^{\prime}\left(\frac{p_{i}}{q_{i}}\right)\right)^{2}\end{array}\right)^{1/2}\\ \leq\left(\frac{2}{n}\sum_{i=1}^{k}p_{i}\left(\phi_{\alpha}^{\prime}\left(\frac{p_{i}}{q_{i}}\right)\right)^{2}\right)^{1/2}.

These bounds can be combined into

𝖤|Λα,n|1αρα1kn+(2ni=1kpi(ϕα(piqi))2)1/2.\mathsf{E}\left|\Lambda_{\alpha,n}\right|\leq\frac{1}{\alpha\rho^{\alpha-1}}\frac{k}{n}+\left(\frac{2}{n}\sum_{i=1}^{k}p_{i}\left(\phi_{\alpha}^{\prime}\left(\frac{p_{i}}{q_{i}}\right)\right)^{2}\right)^{1/2}. (52)

Under (49) the first term tends to zero as n.n\rightarrow\infty. The last term does the same, which is seen from the inequalities

2ni=1kpi((piqi)α11α1)22ni=1kpi(piqi)2α2+1(α1)2=2n(α1)2(i=1kqi(piqi)α(piqi)α1+1)2n(α1)2i=1kqi(piqi)α(1ρ/k)α1+2n(α1)2=kα1n2(α1)2ρα1i=1kqi(piqi)α+2n(α1)2=kα1n2(α(α1)Δ+1)(α1)2ρα1+2n(α1)2.\frac{2}{n}\sum_{i=1}^{k}p_{i}\left(\frac{\left(\frac{p_{i}}{q_{i}}\right)^{\alpha-1}-1}{\alpha-1}\right)^{2}\\ \leq\frac{2}{n}\sum_{i=1}^{k}p_{i}\frac{\left(\frac{p_{i}}{q_{i}}\right)^{2\alpha-2}+1}{\left(\alpha-1\right)^{2}}\\ =\frac{2}{n\left(\alpha-1\right)^{2}}\left(\sum_{i=1}^{k}q_{i}\left(\frac{p_{i}}{q_{i}}\right)^{\alpha}\left(\frac{p_{i}}{q_{i}}\right)^{\alpha-1}+1\right)\\ \leq\frac{2}{n\left(\alpha-1\right)^{2}}\sum_{i=1}^{k}q_{i}\left(\frac{p_{i}}{q_{i}}\right)^{\alpha}\left(\frac{1}{\rho/k}\right)^{\alpha-1}+\frac{2}{n\left(\alpha-1\right)^{2}}\\ =\frac{k^{\alpha-1}}{n}\frac{2}{\left(\alpha-1\right)^{2}\rho^{\alpha-1}}\sum_{i=1}^{k}q_{i}\left(\frac{p_{i}}{q_{i}}\right)^{\alpha}+\frac{2}{n\left(\alpha-1\right)^{2}}\\ =\frac{k^{\alpha-1}}{n}\frac{2\left(\alpha\left(\alpha-1\right)\Delta+1\right)}{\left(\alpha-1\right)^{2}\rho^{\alpha-1}}+\frac{2}{n\left(\alpha-1\right)^{2}}.
Case iii (α=1\alpha=1)

For α=1\alpha=1 in Inequality 52 we get

𝖤|Λ1,n|kn+(2ni=1kpi(lnpiqi)2)1/2.\mathsf{E}\left|\Lambda_{1,n}\right|\leq\frac{k}{n}+\left(\frac{2}{n}\sum_{i=1}^{k}p_{i}\left(\ln\frac{p_{i}}{q_{i}}\right)^{2}\right)^{1/2}.

Using lnpi0\ln p_{i}\leq 0 we find that last term on the right satisfies the relations

1nj=1kpi(ln2pi2lnpilnqi+ln2qi)\displaystyle\frac{1}{n}\sum_{j=1}^{k}p_{i}\left(\ln^{2}p_{i}-2\ln p_{i}\ln q_{i}+\ln^{2}q_{i}\right) (53)
=1nj=1kpiln2pi2nj=1kpilnpilnqi+1nj=1kpiln2qi\displaystyle=\frac{1}{n}\sum_{j=1}^{k}p_{i}\ln^{2}p_{i}-\frac{2}{n}\sum_{j=1}^{k}p_{i}\ln p_{i}\ln q_{i}+\frac{1}{n}\sum_{j=1}^{k}p_{i}\ln^{2}q_{i}
1nj=1kpiln2pi+ln2kϱnj=1kpi1nj=1kpiln2pi+ln2kϱn.\displaystyle\leq\frac{1}{n}\sum_{j=1}^{k}p_{i}\ln^{2}p_{i}+\frac{\ln^{2}\frac{k}{\varrho}}{n}\sum_{j=1}^{k}p_{i}\leq\frac{1}{n}\sum_{j=1}^{k}p_{i}\ln^{2}p_{i}+\frac{\ln^{2}\frac{k}{\varrho}}{n}.

The function xxln2xx\rightarrow x\ln^{2}x is concave in the interval [0;e1]\left[0;e^{-1}\right] and convex in the interval [e1;1].\left[e^{-1};1\right]. Therefore we we can apply the method of [16] to verify that i=1kpiln2pi\sum_{i=1}^{k}p_{i}\ln^{2}p_{i} attains its maximum for a mixture of uniform distributions on kk points and on subset of k1k-1 of these points. Thus

1ni=1kpiln2pj1ni=1k1k1ln2(1k)=kln2kn(k1)2ln2kn\frac{1}{n}\sum_{i=1}^{k}p_{i}\ln^{2}p_{j}\leq\frac{1}{n}\sum_{i=1}^{k}\frac{1}{k-1}\ln^{2}\left(\frac{1}{k}\right)\\ =\frac{k\ln^{2}k}{n\left(k-1\right)}\leq\frac{2\ln^{2}k}{n} (54)

and we can conclude that under (49) the first term in (53) tends to zero as nn tends to infinity. Obviously, under (49) also the second term in (51) tends to zero so that the desired relation (39) holds.

Case iv (α>2\alpha>2)

By A2,

Dα(P,Q)\displaystyle D_{\alpha}(P,Q) =1α(α1)(j=1kpjαqj1α1)\displaystyle=\frac{1}{\alpha(\alpha-1)}\left(\sum_{j=1}^{k}p_{j}^{\alpha}q_{j}^{1-\alpha}-1\right)
1α(α1)((kϱ)α1j=1kpjα1)\displaystyle\geq\frac{1}{\alpha(\alpha-1)}\left(\left(\frac{k}{\varrho}\right)^{\alpha-1}\sum_{j=1}^{k}p_{j}^{\alpha}-1\right)

so that

j=1kpjα(α(α1)Δ+1)(ϱk)α1\sum_{j=1}^{k}p_{j}^{\alpha}\leq\left(\alpha(\alpha-1)\Delta+1\right)\left(\frac{\varrho}{k}\right)^{\alpha-1} (55)

where we replaced Dα(P,Q)D_{\alpha}(P,Q) by Δ=Δα\Delta=\Delta_{\alpha} in the sense of (41). Further, by the Taylor formula

p^jα=pjα+αpjα1(p^jpj)+α(α1)2ξjα2(p^jpj)2\widehat{p}_{j}^{\alpha}=p_{j}^{\alpha}+\alpha\,p_{j}^{\alpha-1}(\widehat{p}_{j}-p_{j})+{\frac{\alpha(\alpha-1)}{2}}\,\xi_{j}^{\alpha-2}(\widehat{p}_{j}-p_{j})^{2} (56)

where ξj\xi_{j} is between pjp_{j} and p^j.\widehat{p}_{j}. We shall look for a highly probable upper bound on p^j\widehat{p}_{j}. Choose any b>1b>1 and consider the random event

Enj(b)={p^jbmax{pj,qj}}.E_{nj}(b)=\{\widehat{p}_{j}\geq b\max\left\{p_{j},q_{j}\right\}\}.

We shall prove that under (50) it holds

𝝅n(b)=def𝖯(jEnj(b))0.\boldsymbol{\pi}_{n}(b)\overset{def}{=}\mathsf{P}\left(\cup_{j}E_{nj}(b)\right)\longrightarrow 0. (57)

The components Xj=XnjX_{j}=X_{nj} of the observation vector 𝐗n\boldsymbol{X}_{n} defined in Section 1 are approximately Poisson distributed, Po(npj),\left(np_{j}\right), so that

𝖯(p^jbmax{pj,qj})=𝖯(Xjnbmax{pj,qj})exp{D1(Po(bmax{npj,nqj}), Po(npj))}\mathsf{P}\left(\widehat{p}_{j}\geq b\max\left\{p_{j},q_{j}\right\}\right)=\mathsf{P}\left(X_{j}\geq nb\max\left\{p_{j},q_{j}\right\}\right)\\ \leq\exp\{-D_{1}\left(\text{Po}\left(b\max\left\{np_{j},nq_{j}\right\}\right),\text{ Po}\left(np_{j}\right)\right)\}

for the divergence D1(P,Q)D_{1}(P,Q) defined by (9)-(8) with P,QP,Q replaced by the corresponding Poisson distributions. But

D1(Po(bnqj), Po(npj))=npjϕ1(b)D_{1}\left(\text{Po}\left(bnq_{j}\right),\text{ Po}\left(np_{j}\right)\right)=np_{j}\phi_{1}\left(b\right) (58)

for the logarithmic function ϕ10\phi_{1}\geq 0 introduced in (7). Since for all 0pj,qj10\leq p_{j},q_{j}\leq 1

ϕ1(bmax{pj,qj}jpj)ϕ1(b)>1 for b>1, \phi_{1}\left(\frac{b\max\left\{p_{j},q_{j}\right\}_{j}}{p_{j}}\right)\geq\phi_{1}\left(b\right)>1\text{ \ for }b>1,\text{\ }

it holds

D1(Po(bmax{npj,nqj}), Po(npj))D1(Po(bnqj), Po(nqj)).D_{1}\left(\text{Po}\left(b\max\left\{np_{j},nq_{j}\right\}\right),\text{ Po}\left(np_{j}\right)\right)\\ \geq D_{1}\left(\text{Po}\left(bnq_{j}\right),\text{ Po}\left(nq_{j}\right)\right).

Consequently,

𝝅n(b)\displaystyle\boldsymbol{\pi}_{n}(b) j𝖯(p^jbmax{pj,qj})\displaystyle\leq\sum_{j}\mathsf{P}\left(\widehat{p}_{j}\geq b\max\left\{p_{j},q_{j}\right\}\right)
jexp{D1(Po(bmax{npj,nqj}), Po(npj))}\displaystyle\leq\sum_{j}\exp\{-D_{1}\left(\text{Po}\left(b\max\left\{np_{j},nq_{j}\right\}\right),\text{ Po}\left(np_{j}\right)\right)\}
jexp{D1(Po(bnqj), Po(nqj))}\displaystyle\leq\sum_{j}\exp\left\{-D_{1}\left(\text{Po}\left(bnq_{j}\right),\text{ Po}\left(nq_{j}\right)\right)\right\}
=jexp{nqjϕ1(b)} (cf. (58))\displaystyle=\sum_{j}\exp\left\{-nq_{j}\phi_{1}\left(b\right)\right\}\text{ \ \ \ \ (cf. (\ref{Poisson}))}
kexp{nϱkϕ1(b)}=k1nklogkϱϕ1(b).\displaystyle\leq k\exp\left\{-n\frac{\varrho}{k}\phi_{1}\left(b\right)\right\}=k^{1-\frac{n}{k\log k}\varrho\phi_{1}\left(b\right)}. (59)

Assumption (50) implies that the exponent in (59) tends to -\infty so that (57) holds. Therefore it suffices to prove (39) under the condition that for all sufficiently large nn the random events jEnj(b)\cup_{j}E_{nj}(b) fail to take place, i.e. that

p^j<bmax{pj,qj} for all 1jk.\widehat{p}_{j}<b\max\left\{p_{j},q_{j}\right\}\text{ \ \ for all \ }1\leq j\leq k. (60)

Let us start with the fact that under (60) it holds ξj{bpj,bqj}\xi_{j}\leq\left\{bp_{j},bq_{j}\right\} and then

ξjα2(max{bpj,bqj})α2bα2pjα2+bα2ϱα2kα2.\xi_{j}^{\alpha-2}\leq\left(\max\left\{bp_{j},bq_{j}\right\}\right)^{\alpha-2}\leq b^{\alpha-2}p_{j}^{\alpha-2}+b^{\alpha-2}\frac{\varrho^{\alpha-2}}{k^{\alpha-2}}. (61)

Applying this in the Taylor formula (56) we obtain

|p^jαpjα|\displaystyle\left|\widehat{p}_{j}^{\alpha}-p_{j}^{\alpha}\right| αpjα1|p^jpj|\displaystyle\leq\alpha\,p_{j}^{\alpha-1}\left|\widehat{p}_{j}-p_{j}\right|
+α(α1)bα22(pjα2+ϱα2kα2)(p^jpj)2.\displaystyle+{\frac{\alpha(\alpha-1)b^{\alpha-2}}{2}}\,\left(p_{j}^{\alpha-2}+\frac{\varrho^{\alpha-2}}{k^{\alpha-2}}\right)(\widehat{p}_{j}-p_{j})^{2}.

Hence under (60) we get from (51) and Lemma 8

|Λα,n|kα1α(α1)j=1nαpjα1|p^jpj|+kα1α(α1)j=1nα(α1)bα22(pjα2+ϱα2kα2)(p^jpj)2.\left|\Lambda_{\alpha,n}\right|\leq\frac{k^{\alpha-1}}{\alpha\left(\alpha-1\right)}\sum_{j=1}^{n}\alpha\,p_{j}^{\alpha-1}\left|\widehat{p}_{j}-p_{j}\right|\\ +\frac{k^{\alpha-1}}{\alpha\left(\alpha-1\right)}\sum_{j=1}^{n}{\frac{\alpha(\alpha-1)b^{\alpha-2}}{2}}\,\left(p_{j}^{\alpha-2}+\frac{\varrho^{\alpha-2}}{k^{\alpha-2}}\right)(\widehat{p}_{j}-p_{j})^{2}. (62)

Applying (55) and using Jensen’s inequality and the expectation bound (43), we upper bound 𝐄|Λα,n|\mathbf{E}\left|\Lambda_{\alpha,n}\right| by

(α(α1)Δ+1)1/2α(α1)(pjα1k1αn)1/2\displaystyle\frac{\left(\alpha\left(\alpha-1\right)\Delta+1\right)^{1/2}}{\alpha\left(\alpha-1\right)}\left(\frac{\sum p_{j}^{\alpha-1}}{k^{1-\alpha}n}\right)^{1/2}
+bα2kα12j=1k(pjα2+ϱα2kα2)𝖤[(p^jpj)2]\displaystyle+{\frac{b^{\alpha-2}k^{\alpha-1}}{2}}\sum_{j=1}^{k}\left(p_{j}^{\alpha-2}+\frac{\varrho^{\alpha-2}}{k^{\alpha-2}}\right)\mathsf{E}\left[(\widehat{p}_{j}-p_{j})^{2}\right]\vskip 6.0pt plus 2.0pt minus 2.0pt
(α(α1)Δ+1)1/2α(α1)(pjα1k1αn)1/2\displaystyle\leq\frac{\left(\alpha\left(\alpha-1\right)\Delta+1\right)^{1/2}}{\alpha\left(\alpha-1\right)}\left(\frac{\sum p_{j}^{\alpha-1}}{k^{1-\alpha}n}\right)^{1/2}
+bα2kα12j=1k(pjα2+ϱα2kα2)pjn\displaystyle+{\frac{b^{\alpha-2}k^{\alpha-1}}{2}}\sum_{j=1}^{k}\left(p_{j}^{\alpha-2}+\frac{\varrho^{\alpha-2}}{k^{\alpha-2}}\right)\frac{p_{j}}{n}\vskip 6.0pt plus 2.0pt minus 2.0pt
=(α(α1)Δ+1)1/2α(α1)(pjα1k1αn)1/2\displaystyle=\frac{\left(\alpha\left(\alpha-1\right)\Delta+1\right)^{1/2}}{\alpha\left(\alpha-1\right)}\left(\frac{\sum p_{j}^{\alpha-1}}{k^{1-\alpha}n}\right)^{1/2} (63)
+bα22kα1j=1kpjα1n+bα2ϱα22kn.\displaystyle+{\frac{b^{\alpha-2}}{2}}\frac{k^{\alpha-1}\sum_{j=1}^{k}p_{j}^{\alpha-1}}{n}+\frac{b^{\alpha-2}\varrho^{\alpha-2}}{2}\frac{k}{n}.

Obviously, under (60) the desired relation (39) holds if the assumption (50) implies the convergence

pjα1k1αn0.\frac{\sum p_{j}^{\alpha-1}}{k^{1-\alpha}n}\rightarrow 0. (64)

However, by Schwarz inequality and (55),

j=1kpjα1=j=1kpj(pjα1)(α2)/(α1)(j=1kpjpjα1)(α2)/(α1)=(j=1kpjα)(α2)/(α1)((α(α1)Δ+1)ϱα1kα1)(α2)/(α1)=ϱα2(α(α1)Δ+1)(α2)/(α1)kα2\sum_{j=1}^{k}\,p_{j}^{\alpha-1}=\sum_{j=1}^{k}p_{j}\,\left(p_{j}^{\alpha-1}\right)^{(\alpha-2)/(\alpha-1)}\vskip 6.0pt plus 2.0pt minus 2.0pt\\ \leq\,\left(\sum_{j=1}^{k}p_{j}p_{j}^{\alpha-1}\right)^{(\alpha-2)/(\alpha-1)}=\left(\sum_{j=1}^{k}p_{j}^{\alpha}\right)^{(\alpha-2)/(\alpha-1)}\vskip 6.0pt plus 2.0pt minus 2.0pt\\ \leq\left(\left(\alpha\left(\alpha-1\right)\Delta+1\right)\frac{\varrho^{\alpha-1}}{k^{\alpha-1}}\right)^{(\alpha-2)/(\alpha-1)}\\ =\frac{\varrho^{\alpha-2}\left(\alpha\left(\alpha-1\right)\Delta+1\right)^{(\alpha-2)/(\alpha-1)}}{k^{\alpha-2}}

so that the validity of (39) under (50) is obvious and the proof is complete.

Condition 50 is stronger than Condition 38 and implies that for any fixed number a>0a>0 eventually any bins will contain more than aa observations.

IV Bahadur efficiency

In this section we study the Bahadur efficiency in the class of power divergence statistics D^α,n=Dα(P^n,Qn)\hat{D}_{\alpha,n}=D_{\alpha}(\hat{P}_{n},Q_{n}), α>0\alpha>0. As before, we use the simplified notations

Pn=P, Qn=Q and kn=k.P_{n}=P,\text{ }Q_{n}=Q\text{ \ \ and \ }k_{n}=k.

The results are concentrated in Theorem 13 below. Its proof is based on the following lemmas. The first two of them make use of the Rényi divergences of orders α>0\alpha>0

Dα(PQ)\displaystyle D_{\alpha}\left(P\|Q\right) =1α1lnj=1kpjαqj1α,\displaystyle={\frac{1}{\alpha-1}}\ln\sum_{j=1}^{k}p_{j}^{\alpha}q_{j}^{1-\alpha},\ \
D1(PQ)\displaystyle D_{1}\left(P\|Q\right) =limα1Dα(PQ)=D(PQ)\displaystyle=\lim_{\alpha\rightarrow 1}D_{\alpha}\left(P\|Q\right)=D\left(P\|Q\right)

where D(PQ)D\left(P\|Q\right) is the classical information divergence denoted above by D1(P,Q)D_{1}\left(P,Q\right). There is a monotone relationship between the Rényi and power divergences given by the formula

Dα(PQ)\displaystyle D_{\alpha}\left(P\|Q\right) =1α1ln(1+α(α1)Dα(P,Q)),\displaystyle={\frac{1}{\alpha-1}}\ln\left(1+\alpha\left(\alpha-1\right)D_{\alpha}\left(P,Q\right)\right),\text{ \ \ } (65)
D1(PQ)\displaystyle D_{1}\left(P\|Q\right) =D1(P,Q).\displaystyle=D_{1}\left(P,Q\right). (66)
Lemma 10

Let PP and QQ be probability vectors on the set 𝒳\mathcal{X}. If α<β\alpha<\beta then

Dα(PQ)Dβ(PQ).D_{\alpha}\left(P\|Q\right)\leq D_{\beta}\left(P\|Q\right).

with equality if and only there exists a subset A𝒳A\subseteq\mathcal{X} such that P=Q(A).P=Q\left(\cdot\mid A\right).\vskip 6.0pt plus 2.0pt minus 2.0pt

Proof:

By Jensen’s inequality

Dα(PQ)\displaystyle D_{\alpha}\left(P\|Q\right) =1α1lnj=1kpjαqj1α\displaystyle={\frac{1}{\alpha-1}}\ln\sum_{j=1}^{k}p_{j}^{\alpha}q_{j}^{1-\alpha}
=1α1lnj=1kpj((pjqj)β1)α1β1\displaystyle={\frac{1}{\alpha-1}}\ln\sum_{j=1}^{k}p_{j}\left(\left(\frac{p_{j}}{q_{j}}\right)^{\beta-1}\right)^{\frac{\alpha-1}{\beta-1}}
1α1ln(j=1kpj(pjqj)β1)α1β1\displaystyle\leq{\frac{1}{\alpha-1}}\ln\left(\sum_{j=1}^{k}p_{j}\left(\frac{p_{j}}{q_{j}}\right)^{\beta-1}\right)^{\frac{\alpha-1}{\beta-1}}
1β1lnj=1kpj(pjqj)β1\displaystyle\leq{\frac{1}{\beta-1}}\ln\sum_{j=1}^{k}p_{j}\left(\frac{p_{j}}{q_{j}}\right)^{\beta-1}
=Dβ(PQ).\displaystyle=D_{\beta}\left(P\|Q\right).

The equality takes place if and only if (pjqj)β1\left(\frac{p_{j}}{q_{j}}\right)^{\beta-1} is constant PP-almost surely. Therefore pjqj\frac{p_{j}}{q_{j}} is constant on the support of PP that we shall denote A.A. Now PP equals QQ conditioned on A.A.

Lemma 11

Let 0<α1.0<\alpha\leq 1. If

nklnn.{\frac{\,n}{k\ln n}}\longrightarrow\infty. (67)

and qmax0q_{\max}\rightarrow 0 as nn\rightarrow\infty then the statistic D^α,n\widehat{D}_{\alpha,n} is Bahadur stable and consistent and the constant sequence generates the Bahadur function

gα(Δ)={ln(1+α(α1)Δ)α1,Δ>0when 0<α<1limα1gα(Δ)=Δ, Δ>0whenα=1.g_{\alpha}(\Delta)=\\ \left\{\begin{array}[c]{ll}\displaystyle{\frac{\ln\left(1+\alpha(\alpha-1)\,\Delta\right)}{\alpha-1}},\quad\Delta>0&\mbox{when}\ 0<\alpha<1\vskip 6.0pt plus 2.0pt minus 2.0pt\\ \displaystyle\lim_{\alpha\rightarrow 1}g_{\alpha}(\Delta)=\Delta,\quad\text{\ \ \ \ \ }\Delta>0&\mbox{when}\ \alpha=1.\vskip 6.0pt plus 2.0pt minus 2.0pt\end{array}\right.\vskip 6.0pt plus 2.0pt minus 2.0pt (68)
Proof:

Let us first consider 0<α<10<\alpha<1. The minimum of D1(P,Q)D_{1}(P,Q) given Dα(PQ)ΔD_{\alpha}(P\|Q)\geq\Delta is lower bounded by Δ.\Delta. Let ε>0\varepsilon>0 be given. If qmaxq_{\max} is sufficiently small there exist sets AA+A_{-}\subseteq A_{+} such that

lnQ(A+)ΔlnQ(A)Δ+ε.-\ln Q\left(A_{+}\right)\leq\Delta\leq-\ln Q\left(A_{-}\right)\leq\Delta+\varepsilon.

Let PsP_{s} denote the mixture (1s)Q(A+)+sQ(A).\left(1-s\right)Q\left(\cdot\mid A_{+}\right)+sQ\left(\cdot\mid A_{-}\right). Then sDα(PsQ)s\rightarrow D_{\alpha}\left(P_{s}\|Q\right) is a continuous function satisfying

Dα(P0Q)\displaystyle D_{\alpha}\left(P_{0}\|Q\right) Δ,\displaystyle\leq\Delta,
Dα(P1Q)\displaystyle D_{\alpha}\left(P_{1}\|Q\right) Δ.\displaystyle\geq\Delta.

In particular there exist s[0,1]s\in\left[0,1\right] such that Dα(PsQ)=Δ.D_{\alpha}\left(P_{s}\|Q\right)=\Delta. For this ss we have

D1(Ps,Q)(1s)D1(Q(A+),Q)+sD1(Q(A),Q)=(1s)(lnQ(A+))+s(lnQ(A))(1s)Δ+s(Δ+ε)=Δ+ε.D_{1}\left(P_{s},Q\right)\\ \leq\left(1-s\right)D_{1}\left(Q\left(\cdot\mid A_{+}\right),Q\right)+sD_{1}\left(Q\left(\cdot\mid A_{-}\right),Q\right)\\ =\left(1-s\right)\left(-\ln Q\left(A_{+}\right)\right)+s\left(-\ln Q\left(A_{-}\right)\right)\\ \leq\left(1-s\right)\Delta+s\left(\Delta+\varepsilon\right)=\Delta+\varepsilon.

Hence

ΔinfD1(P,Q)Δ+ε\Delta\leq\inf D_{1}\left(P,Q\right)\leq\Delta+\varepsilon

where the infimum is taken over all PP satisfying Dα(PQ)=ΔD_{\alpha}\left(P\|Q\right)=\Delta and where nn is sufficiently large. This holds for all ε>0\varepsilon>0 so the Bahadur function of the statistic Dα(P^Q)D_{\alpha}\left(\hat{P}\|Q\right) is g(Δ)=Δ.g\left(\Delta\right)=\Delta. The Bahadur function of the power divergence statistics Dα(P^,Q)D_{\alpha}\left(\hat{P},Q\right) can be calculated using Equality 65. ∎

Lemma 12

Let α>1.\alpha>1. If assumptions A(α\alpha) holds for for the uniform distributions Qn=UQ_{n}=U and the sequence

cα(n)=k(α1)/αlnkc_{\alpha}(n)={\frac{\,k^{(\alpha-1)/\alpha}}{\ln k}} (69)

satisfies the condition

ncα(n)klnn{\frac{n}{c_{\alpha}(n)\,k\ln n}}\longrightarrow\infty (70)

then the statistic D^α,n=Dα(P^n,Qn)\widehat{D}_{\alpha,n}=D_{\alpha}(\hat{P}_{n},Q_{n}) is consistent and the sequence (69) generates the Bahadur function

gα(Δ)=(α(α1)Δ)1/α,Δ>0.g_{\alpha}(\Delta)=\left(\alpha(\alpha-1)\,\Delta\right)^{1/\alpha},\quad\Delta>0.\vskip 6.0pt plus 2.0pt minus 2.0pt (71)
Proof:

If the sequence (69) satisfies (70) then Theorem 1 implies the consistency of D^α,n\widehat{D}_{\alpha,n}. Formula (71) was already mentioned in Example 2 above with a reference to Harremoës and Vajda [5]). ∎

Theorem 13

Let the assumption A(α1,α2\alpha_{1},\alpha_{2}) hold where 0<α1<α2.0<\alpha_{1}<\alpha_{2}. If

klnnn0{\frac{k\ln n}{n}}\longrightarrow 0 (72)

then the statistics

D^α1,n\displaystyle\widehat{D}_{\alpha_{1},n} =Dα1(P^n,Qn),\displaystyle=D_{\alpha_{1}}(\hat{P}_{n},Q_{n}),
D^α2,n\displaystyle\widehat{D}_{\alpha_{2},n} =Dα2(P^n,Qn)\displaystyle=D_{\alpha_{2}}(\hat{P}_{n},Q_{n})

satisfy the relation

BE(D^α1,n;D^α2,n)={α21α11ln(1+α1(α11)Δα1)ln(1+α2(α21)Δα2)forα2<11α11ln(1+α1(α11)Δα1)Δα2forα2=1.\mbox{BE}\left(\widehat{D}_{\alpha_{1},n};\widehat{D}_{\alpha_{2},n}\right)\\ =\left\{\begin{array}[c]{ll}\displaystyle{\frac{\alpha_{2}-1}{\alpha_{1}-1}}\cdot{\frac{\ln\left(1+\alpha_{1}(\alpha_{1}-1)\Delta_{\alpha_{1}}\right)\,}{\ln\left(1+\alpha_{2}(\alpha_{2}-1)\,\Delta_{\alpha_{2}}\right)}}&\mbox{for}\ \alpha_{2}<1\vskip 6.0pt plus 2.0pt minus 2.0pt\vskip 6.0pt plus 2.0pt minus 2.0pt\vskip 6.0pt plus 2.0pt minus 2.0pt\\ \displaystyle{\frac{1}{\alpha_{1}-1}}\cdot{\frac{\ln\left(1+\alpha_{1}(\alpha_{1}-1)\Delta_{\alpha_{1}}\right)\,}{\Delta_{\alpha_{2}}}}&\mbox{for}\ \alpha_{2}=1.\vskip 6.0pt plus 2.0pt minus 2.0pt\end{array}\right. (73)

If

k21/α2lnnn0{\frac{k^{2-1/\alpha_{2}}\ln n}{n}}\longrightarrow 0 (74)

then the statistics D^α1,n=Dα1(P^n,U)\widehat{D}_{\alpha_{1},n}=D_{\alpha_{1}}(\hat{P}_{n},U) and D^α2,n=Dα2(P^n,U)\widehat{D}_{\alpha_{2},n}=D_{\alpha_{2}}(\hat{P}_{n},U) satisfy the relation

BE(D^α1,n;D^α2,n)= for α2>1.\mbox{BE}\left(\widehat{D}_{\alpha_{1},n};\widehat{D}_{\alpha_{2},n}\right)=\infty\text{ \ \ \ }\mbox{for}\text{ }\alpha_{2}>1.\vskip 6.0pt plus 2.0pt minus 2.0pt (75)
Proof:

By Lemma 11, the assumptions of Definition 5 hold. The first assertion follows directly from Definition 3 since, by Lemma  11,

gα1(Δα1)gα2(Δα2)={α21α11ln(1+α1(α11)Δα1)ln(1+α2(α21)Δα2)whenα2<11α11ln(1+α1(α11)Δα1)Δα2whenα2=1.\frac{g_{\alpha_{1}}(\Delta_{\alpha_{1}})}{g_{\alpha_{2}}(\Delta_{\alpha_{2}})}=\\ \left\{\begin{array}[c]{ll}\displaystyle{\frac{\alpha_{2}-1}{\alpha_{1}-1}}\cdot{\frac{\ln\left(1+\alpha_{1}(\alpha_{1}-1)\Delta_{\alpha_{1}}\right)}{\ln\left(1+\alpha_{2}(\alpha_{2}-1)\,\Delta_{\alpha_{2}}\right)}}&\mbox{when}\ \alpha_{2}<1\vskip 6.0pt plus 2.0pt minus 2.0pt\vskip 6.0pt plus 2.0pt minus 2.0pt\vskip 6.0pt plus 2.0pt minus 2.0pt\\ \displaystyle{\frac{1}{\alpha_{1}-1}}\cdot{\frac{\ln\left(1+\alpha_{1}(\alpha_{1}-1)\Delta_{\alpha_{1}}\right)}{\Delta_{\alpha_{2}}}}&\mbox{when}\ \alpha_{2}=1.\vskip 6.0pt plus 2.0pt minus 2.0pt\end{array}\right. (76)

The second assertion was for α1=1\alpha_{1}=1 deduced in Section 2 from the lemmas presented there. The argument was based on the fact that cα1(n)=1c_{\alpha_{1}}(n)=1 for α1=1\alpha_{1}=1. But cα(n)=1c_{\alpha}(n)=1 for all 0<α10<\alpha\leq 1 so that extension from α1=1\alpha_{1}=1 to 0<α1<10<\alpha_{1}<1 is straightforward. ∎

Example 14

Let

Pn=(pnj=def1{1jk/2}k/2), n=1,2,P_{n}=\left(p_{nj}\overset{def}{=}\frac{1_{\left\{1\leq j\leq k/2\right\}}}{\left\lfloor k/2\right\rfloor}\right),\text{ \ \ }n=1,2,\ldots (77)

where 1A1_{A} is the indicator function, \left\lfloor\cdot\right\rfloor stands for the integer part (floor function) and, as before,

U=(uj=def1/k:1jk).U=\left(u_{j}\overset{def}{=}1/k:1\leq j\leq k\right).

Then forα0,1\ \alpha\neq 0\vskip 6.0pt plus 2.0pt minus 2.0pt,1

Dα(Pn,U)\displaystyle D_{\alpha}(P_{n},U) =1kuj((pnj/uj)αα(pnj/uj1)1)α(α1)\displaystyle={\frac{\sum_{1}^{k}u_{j}\left(\left(p_{nj}/u_{j}\right)^{\alpha}-\alpha\left(p_{nj}/u_{j}-1\right)-1\right)}{\alpha(\alpha-1)}}
=1kpnjαuj1α+1k(pnjuj)1kujα(α1)\displaystyle={\frac{\sum_{1}^{k}p_{nj}^{\alpha}u_{j}^{1-\alpha}+\sum_{1}^{k}(p_{nj}-u_{j})-\sum_{1}^{k}u_{j}}{\alpha(\alpha-1)}}\vskip 6.0pt plus 2.0pt minus 2.0pt
=kα11k/2k/2α1α(α1)\displaystyle={\frac{k^{\alpha-1}\sum_{1}^{\left\lfloor k/2\right\rfloor}\left\lfloor k/2\right\rfloor^{-\alpha}-1}{\alpha(\alpha-1)}}\hskip 79.6678pt\vskip 6.0pt plus 2.0pt minus 2.0pt
=kα1k/2/k/2α1α(α1)\displaystyle={\frac{k^{\alpha-1}\left\lfloor k/2\right\rfloor/\left\lfloor k/2\right\rfloor^{\alpha}-1}{\alpha(\alpha-1)}}\vskip 6.0pt plus 2.0pt minus 2.0pt
=(k/k/2)α11α(α1).\displaystyle={\frac{\left(k/\left\lfloor k/2\right\rfloor\right)^{\alpha-1}-1}{\alpha(\alpha-1)}}.

Therefore the identifiably condition (19) takes on the form

Dα(Pn,U){2α11α(α1)=defΔα,ifα>0,α1ln2=defΔ1ifα=1.D_{\alpha}(P_{n},U)\longrightarrow\\ \left\{\begin{array}[c]{ll}\displaystyle{\frac{2^{\alpha-1}-1}{\alpha(\alpha-1)}}\overset{def}{=}\Delta_{\alpha}&,\mbox{if}\ \alpha>0,\ \alpha\neq 1\vskip 6.0pt plus 2.0pt minus 2.0pt\vskip 6.0pt plus 2.0pt minus 2.0pt\\ \displaystyle\ln 2\ \overset{def}{=}\Delta_{1}&\mbox{if}\ \alpha=1.\end{array}\right.

If 0<α10<\alpha\leq 1 then Lemma 12 implies

gα(Δ)=ln(1+α(α1)Δ)/(α1)g_{\alpha}(\Delta)=\ln\left(1+\alpha(\alpha-1)\,\Delta\right)/(\alpha-1)

when 0<α<10<\alpha<1 and g1(Δ)=Δg_{1}(\Delta)=\Delta when α=1\alpha=1. If moreover (73) then under the alternative (77)

gα(Δα)g1(Δ1)\displaystyle\frac{g_{\alpha}(\Delta_{\alpha})}{g_{1}(\Delta_{1})} =ln(1+α(α1)2α11α(α1))(α1)ln2\displaystyle={\frac{\ln\left(1+\alpha(\alpha-1)\,\frac{2^{\alpha-1}-1}{\alpha(\alpha-1)}\right)}{(\alpha-1)\ln 2}}\vskip 6.0pt plus 2.0pt minus 2.0pt
=ln(1+2α11)(α1)ln2=1.\displaystyle={\frac{\ln\left(1+2^{\alpha-1}-1\right)}{(\alpha-1)\ln 2}}=1.

Hence, by Definition 4, the likelihood ratio statistic D^1,n\widehat{D}_{1,n} is as Bahadur efficient as any D^α,n\widehat{D}_{\alpha,n} with 0<α<10<\alpha<1. If α>1\alpha>1 then Lemma 12 implies

gα(Δα)g1(Δ1)=(2α11)1/αln2>1.{\frac{g_{\alpha}(\Delta_{\alpha})}{g_{1}(\Delta_{1})}}={\frac{(2^{\alpha-1}-1)^{1/\alpha}}{\ln 2}}>1.

However, contrary to this prevalence of gα(Δα)g_{\alpha}(\Delta_{\alpha})\ over g1(Δ1)g_{1}(\Delta_{1}), Theorem 13 implies that D^1,n\widehat{D}_{1,n} is infinitely more Bahadur efficient than D^α,n\widehat{D}_{\alpha,n}.

Example 15

Let us now consider the truncated geometric distribution

Pn=(pn1,,pnk)=ck(p)(1,p,,pk)P_{n}=(p_{n1},\ldots,p_{nk})=c_{k}(p)(1,p,\ldots,p^{k})

with parameter p=pn]0,1[p=p_{n}\in]0,1[. Since

1+p+p2+=11p and pk+1+pk+2+=pk+11p,1+p+p^{2}+\ldots={\frac{1}{1-p}}\text{ \ \ and \ \ }p^{k+1}+p^{k+2}+\ldots={\frac{p^{k+1}}{1-p},}

it holds

1+p++pk=11ppk+11p=1pk+11p=1ck(p).1+p+\cdots+p^{k}={\frac{1}{1-p}}-{\frac{p^{k+1}}{1-p}}={\frac{1-p^{k+1}}{1-p}=}\frac{1}{c_{k}(p)}.

Hence for all α0,1\alpha\neq 0,1

α(α1)Dα,n+1\displaystyle\alpha(\alpha-1)\,D_{\alpha,n}+1 =1kj=0k(pnj1/k)α\displaystyle={\frac{1}{k}}\sum_{j=0}^{k}\left({\frac{p_{nj}}{1/k}}\right)^{\alpha}
=1kj=1kkα(1p)αpαj(1pk+1)α\displaystyle={\frac{1}{k}}\sum_{j=1}^{k}{\frac{k^{\alpha}(1-p)^{\alpha}p^{\alpha j}}{(1-p^{k+1})^{\alpha}}}\vskip 6.0pt plus 2.0pt minus 2.0pt
=(k(1p))αk(1pk+1)αj=0k(pα)j\displaystyle={\frac{\left(k(1-p)\right)^{\alpha}}{k(1-p^{k+1})^{\alpha}}}\sum_{j=0}^{k}(p^{\alpha})^{j}\vskip 6.0pt plus 2.0pt minus 2.0pt
=(k(1p))αk(1pk+1)α1pα(k+1)1pα\displaystyle={\frac{\left(k(1-p)\right)^{\alpha}}{k(1-p^{k+1})^{\alpha}}}\cdot{\frac{1-p^{\alpha(k+1)}}{1-p^{\alpha}}}\vskip 6.0pt plus 2.0pt minus 2.0pt
=(k(1p))αk(1pα)1pα(k+1)(1pk+1)α.\displaystyle={\frac{\left(k(1-p)\right)^{\alpha}}{k(1-p^{\alpha})}}\cdot{\frac{1-p^{\alpha(k+1)}}{(1-p^{k+1})^{\alpha}}}.

In the particular case p=1x/kp=1-x/k for x0x\neq 0 fixed we get k(1p)=xk(1-p)=x and

k(1pα)\displaystyle k(1-p^{\alpha})\ =k(1(1αxk+o(xk)))αx,\displaystyle=\ k\left(1-\left(1-{\frac{\alpha x}{k}}+o\left({\frac{x}{k}}\right)\right)\right)\longrightarrow\alpha x,\vskip 6.0pt plus 2.0pt minus 2.0pt
pα(k+1)\displaystyle p^{\alpha(k+1)} =(1xk)α(k+1)exα,\displaystyle=\left(1-{\frac{x}{k}}\right)^{\alpha(k+1)}\longrightarrow e^{-x\alpha},\vskip 6.0pt plus 2.0pt minus 2.0pt
pk+1\displaystyle p^{k+1} =(1xk)k+1ex.\displaystyle=\left(1-{\frac{x}{k}}\right)^{k+1}\longrightarrow e^{-x}.

Therefore

α(α1)Dα,n+1\displaystyle\alpha(\alpha-1)\,D_{\alpha,n}+1 =xαk(αxk+o(xk))1exα(1ex)α\displaystyle={\frac{x^{\alpha}}{k\left({\frac{\alpha x}{k}}+o\left({\frac{x}{k}}\right)\right)}}\cdot{\frac{1-e^{-x\alpha}}{(1-e^{-x})^{\alpha}}}\vskip 6.0pt plus 2.0pt minus 2.0pt
=xααx+o(x)exα1(ex1)α.\displaystyle={\frac{x^{\alpha}}{\alpha x+o(x)}}\cdot{\frac{e^{x\alpha}-1}{(e^{x}-1)^{\alpha}}}.

Consequently,

α(α1)Δα+1=xα1αexα1(ex1)α\alpha(\alpha-1)\,\Delta_{\alpha}+1={\frac{x^{\alpha-1}}{\alpha}}\cdot{\frac{e^{x\alpha}-1}{(e^{x}-1)^{\alpha}}}

i.e.,

Δα=xα1(exα1)α(ex1)αα2(α1)(ex1)α for α0,1.\Delta_{\alpha}=\frac{x^{\alpha-1}(e^{x\alpha}-1)-\alpha(e^{x}-1)^{\alpha}}{\alpha^{2}(\alpha-1)(e^{x}-1)^{\alpha}}\text{ \ \ for }\alpha\neq 0,1.

By the L’Hospital rule,

Δ1\displaystyle\Delta_{1} =lnxe(ex1)+xexex1,\displaystyle=\ln\frac{x}{e(e^{x}-1)}+\frac{xe^{x}}{e^{x}-1}~\text{,}
Δ0\displaystyle\Delta_{0} =ln(ex1)lnx2.\displaystyle=\frac{\ln(e^{x}-1)-\ln x}{2}~.

From here one can deduce that if x0x\rightarrow 0 then

Δα0 for all α.\Delta_{\alpha}\longrightarrow 0\text{ \ \ for all \ \ }\alpha\in\mathbb{R}.

If x=1x=1 then

Δα=eα1(e1)αα2(α1).(e1)α for α0,1\Delta_{\alpha}={\frac{e^{\alpha}-1-(e-1)^{\alpha}}{\alpha^{2}(\alpha-1).(e-1)^{\alpha}}}\text{ \ \ for }\alpha\neq 0,1

and

Δ1\displaystyle\Delta_{1} =1(e1)ln(e1)e1=0.035,\displaystyle=\frac{1-(e-1)\ln(e-1)}{e-1}=0.035,\ \ \ \
Δ0\displaystyle\Delta_{0} =ln(e1)2=0.271.\displaystyle=\frac{\ln(e-1)}{2}=0.271.

Using Lemma 2 and Theorem 2 in a similar manner as in the previous example, we find that here D^1,n\widehat{D}_{1,n} is more Bahadur efficient as any D^α,n\widehat{D}_{\alpha,n} with 0<α<,0<\alpha<\infty, α1.\alpha\neq 1.

V Contiguity

In this paper we proved that the statistics D^α,n\hat{D}_{\alpha,n} of orders α>1\alpha>1 are less Bahadur efficient than those of the orders 0<α10<\alpha\leq 1 and that the latter are mutually comparable in the Bahadur sense. One may have expected D^1,n\hat{D}_{1,n} to be much more Bahadur efficient than D^α,n\hat{D}_{\alpha,n} for 0<α<1.\ 0<\alpha<1. In order to understand why this is not the case we have to examine somewhat closer the assumptions of our theory.

Recall that given a sequence of pairs of probability measures (Pn,Qn)n,\left(P_{n},Q_{n}\right)_{n\in\mathbb{N}}, (Pn)n\left(P_{n}\right)_{n\in\mathbb{N}} is said to be contiguous with respect to (Qn)n\left(Q_{n}\right)_{n\in\mathbb{N}} if Qn(An)0Q_{n}\left(A_{n}\right)\rightarrow 0 for nn\rightarrow\infty implies Pn(An)0P_{n}\left(A_{n}\right)\rightarrow 0 for nn\rightarrow\infty and any sequence of sets (An)n.\left(A_{n}\right)_{n\in\mathbb{N}}. When (Pn)n\left(P_{n}\right)_{n\in\mathbb{N}} is contiguous with respect to (Qn)n\left(Q_{n}\right)_{n\in\mathbb{N}} we write PnQn.P_{n}\vartriangleleft Q_{n}. Let PP and QQ be probability measures on the same set 𝒳\mathcal{X} and let (n)n\left(\mathcal{F}_{n}\right)_{n\in\mathbb{N}} be an increasing sequence of finite sub-σ\sigma-algebras on 𝒳\mathcal{X} that generates the full σ\sigma-algebra on 𝒳.\mathcal{X}. If Pn=PnP_{n}=P_{\mid\mathcal{F}_{n}} and Qn=QnQ_{n}=Q_{\mid\mathcal{F}_{n}} then PnQnP_{n}\vartriangleleft Q_{n} if and only if PQP\ll Q where \ll denotes absolute continuity. For completeness we give the proof of the following simple proposition.

Proposition 16

Let (Pn,Qn)n\left(P_{n},Q_{n}\right)_{n\in\mathbb{N}} denote a sequence of pairs of probability measures and assume that the sequence D1(Pn,Qn)D_{1}\left(P_{n},Q_{n}\right) is bounded. Then PnQn.P_{n}\vartriangleleft Q_{n}.\vskip 6.0pt plus 2.0pt minus 2.0pt

Proof:

Assume that the proposition is false. Then there exist ε>0\varepsilon>0 and a subsequence of sets (Ank)k\left(A_{n_{k}}\right)_{k\in\mathbb{N}} such that Qnk(Ank)0Q_{n_{k}}\left(A_{n_{k}}\right)\rightarrow 0 for kk\rightarrow\infty and Pnk(Ank)εP_{n_{k}}\left(A_{n_{k}}\right)\geq\varepsilon for all k.k\in\mathbb{N}.

In general, a large power α\alpha makes the power divergence Da(P,Q)D_{a}\left(P,Q\right) sensitive to large values of dP/dQ.dP/dQ. Therefore the statistics D^α,n\widehat{D}_{\alpha,n} with large α\alpha should be used when the sequence of alternatives PnP_{n} may not be contiguous with respect to the sequence of hypotheses Qn.Q_{n}. Conversely, a small power α\alpha makes Da(P,Q)D_{a}\left(P,Q\right) sensitive to small values of dP/dQ.dP/dQ. Therefore D^α,n\widehat{D}_{\alpha,n} with small α\alpha should be used when the sequence of hypotheses QnQ_{n} is not contiguous with respect to the sequence alternatives Pn.P_{n}. Our conditions guarantee PnQnP_{n}\vartriangleleft Q_{n} but not the reversed contiguity QnPn.Q_{n}\vartriangleleft P_{n}. We see that a substantial modification of the conditions is needed in order to guarantee that D^1,n\hat{D}_{1,n} dominates the divergence statistcs D^α,n\hat{D}_{\alpha,n} of the orders 0<α<1\ 0<\alpha<1 in the Bahadur sense.

VI Appendix: Relations to previous results

As mentioned at the end of Section II, Harremoës and Vajda [5] assumed the same strong consistency as in Definition 4+ but introduced the Bahadur efficiency by the formula (36). The next four lemmas help to clarify the relation between this and the present precised concept of Bahadur efficiency (35).

Under the assumptions of Definition 4, [5] considered the following conditions.

C1:

The limit c¯α2/α1\bar{c}_{\alpha_{2}/\alpha_{1}} considered in (37) exists.

C2:

Both statistics D^αi,n\widehat{D}_{\alpha_{i},n} are strongly consistent and both functions gαig_{\alpha_{i}} are strongly Bahadur.

Lemma 17

Let the assumptions of Definition 5 hold. Under C1 the Bahadur efficiency (36) coincides with the present Bahadur efficiency (35). If moreover C2 holds then (36) is the Bahadur efficiency in the strong sense.

Proof:

The first assertion is clear from (36) and (35). Under C2 the assumptions of Definition 3+ hold. Hence the second assertion follows from Definition 6. ∎

Lemma 18

Let the assumptions of Definition 3 hold and let b(α):]0,1[b(\alpha):\mathcal{I}\longrightarrow]0,1[ be increasing and dα:]0,[d_{\alpha}:\mathcal{I}\longrightarrow]0,\infty[ arbitrary function on an interval \mathcal{I} covering {α1,α2}\{\alpha_{1},\alpha_{2}\}. If for every α{α1,α2}\alpha\in\{\alpha_{1},\alpha_{2}\} the sequence cα(n)c_{\alpha}(n) generating the Bahadur function gαg_{\alpha} satisfies the asymptotic condition

cα(n)=nb(α)(dα+o(1))c_{\alpha}(n)=n^{b(\alpha)}(d_{\alpha}+o(1)) (78)

then (31) holds for cα2/α1=c_{\alpha_{2}/\alpha_{1}}=\infty and condition C1 is satisfied.

Proof:

Under (78) it suffices to prove that (31) holds for cα2/α1=,c_{\alpha_{2}/\alpha_{1}}=\infty, i.e.

limncα2(mn)cα1(n)=\lim_{n\longrightarrow\infty}\frac{c_{\alpha_{2}}(m_{n})}{c_{\alpha_{1}}(n)}=\infty (79)

for mnm_{n} defined by (29). By (78),

cα2(mn)=mnb(α2)(dα2+o(1))c_{\alpha_{2}}(m_{n})=m_{n}^{b(\alpha_{2})}(d_{\alpha_{2}}+o(1))

and

cα1(n)=nb(α1)(dα1+o(1))c_{\alpha_{1}}(n)=n^{b(\alpha_{1})}(d_{\alpha_{1}}+o(1))

so that (29) implies

mn1b(α2)=n1b(α1)(γδ+o(1))m_{n}^{1-b(\alpha_{2})}=n^{1-b(\alpha_{1})}(\gamma\delta+o(1))

for the finite positive constants

δ=dα1dα2 and γ=gα1(Δα1)gα2(Δα2).\delta=\frac{d_{\alpha_{1}}}{d_{\alpha_{2}}}\text{ \ \ and \ \ }\gamma={\frac{g_{\alpha_{1}}(\Delta_{\alpha_{1}})}{g_{\alpha_{2}}(\Delta_{\alpha_{2}})}.}

Hence (30) implies

cα2(mn)cα1(n)\displaystyle\frac{c_{\alpha_{2}}(m_{n})}{c_{\alpha_{1}}(n)} =mnn(γ1+o(1))\displaystyle=\frac{m_{n}}{n}(\gamma^{-1}+o(1))\vskip 6.0pt plus 2.0pt minus 2.0pt
=n1b(α1)1b(α2)n((γδ)11b(α2)γ1+o(1))\displaystyle=\frac{n^{\frac{1-b(\alpha_{1})}{1-b(\alpha_{2})}}}{n}((\gamma\delta)^{\frac{1}{1-b(\alpha_{2})}}\gamma^{-1}+o(1))\vskip 6.0pt plus 2.0pt minus 2.0pt
=nb(α2)b(α1)1b(α2)(γb(α2)1b(α2)δ11b(α2)+o(1))\displaystyle=n^{\frac{b(\alpha_{2})-b(\alpha_{1})}{1-b(\alpha_{2})}}(\gamma^{\frac{b(\alpha_{2})}{1-b(\alpha_{2})}}\delta^{\frac{1}{1-b(\alpha_{2})}}+o(1))

so that (79) holds. ∎

Lemma 19

Let the assumptions of Definition 5 hold and let for every α{α1,α2}\alpha\in\{\alpha_{1},\alpha_{2}\} the sequence cα(n)c_{\alpha}(n) generating the Bahadur function gαg_{\alpha} satisfy the asymptotic condition

cα(n)=αnb(α)lnnc_{\alpha}(n)=\frac{\alpha n^{b(\alpha)}}{\ln n} (80)

for some increasing function b(α):]0,1[b(\alpha):\mathcal{I}\longrightarrow]0,1[ on an interval \mathcal{I} covering {α1,α2}\{\alpha_{1},\alpha_{2}\} . Then (31) holds for cα2/α1=c_{\alpha_{2}/\alpha_{1}}=\infty and condition C1 is satisfied.

Proof:

Similarly as before, it suffices to prove the relation (79) for mnm_{n} defined by (29). By (80),

cα2(mn)=α2mnb(α2)lnmn and cα1(n)=α1nb(α1)lnnc_{\alpha_{2}}(m_{n})=\frac{\alpha_{2}m_{n}^{b(\alpha_{2})}}{\ln m_{n}}\text{ \ \ and \ \ }c_{\alpha_{1}}(n)=\frac{\alpha_{1}n^{b(\alpha_{1})}}{\ln n}

so that (29) implies

α2mn1b(α2)lnmn=α1n1b(α1)lnn(γ+o(1))\frac{\alpha_{2}m_{n}^{1-b(\alpha_{2})}}{\ln m_{n}}=\frac{\alpha_{1}n^{1-b(\alpha_{1})}}{\ln n}(\gamma+o(1))

for the same γ\gamma as in the previous proof. Since 1b(α2)<1b(α1)1-b(\alpha_{2})<1-b(\alpha_{1}), this implies the asymptotic relation

mnn.\frac{m_{n}}{n}\longrightarrow\infty{.} (81)

Similarly as in the previous proof, we get from (30)

cα2(mn)cα1(n)=mnn(γ1+o(1))=(α1lnmnα2lnn)11b(α2)nb(α2)b(α1)1b(α2)(γb(α2)1b(α2)+o(1))>nb(α2)b(α1)1b(α2)(γb(α2)1b(α2)+o(1)).\frac{c_{\alpha_{2}}(m_{n})}{c_{\alpha_{1}}(n)}=\frac{m_{n}}{n}(\gamma^{-1}+o(1))\vskip 6.0pt plus 2.0pt minus 2.0pt=\\ \left(\frac{\alpha_{1}\ln m_{n}}{\alpha_{2}\ln n}\right)^{\frac{1}{1-b(\alpha_{2})}}n^{\frac{b(\alpha_{2})-b(\alpha_{1})}{1-b(\alpha_{2})}}\left(\begin{array}[c]{c}\gamma^{\frac{b(\alpha_{2})}{1-b(\alpha_{2})}}\\ +o(1)\end{array}\right)\vskip 6.0pt plus 2.0pt minus 2.0pt\vskip 6.0pt plus 2.0pt minus 2.0pt\\ >n^{\frac{b(\alpha_{2})-b(\alpha_{1})}{1-b(\alpha_{2})}}(\gamma^{\frac{b(\alpha_{2})}{1-b(\alpha_{2})}}+o(1)).

Therefore the desired relation (79) holds. ∎

Lemma 20

Let the assumptions of Definition 3 hold and let for every α{α1,α2}\alpha\in\{\alpha_{1},\alpha_{2}\} the sequence cα(n)c_{\alpha}(n) generating the Bahadur function gαg_{\alpha} satisfy the asymptotic condition

cα(n)=αkb(α)lnkc_{\alpha}(n)=\frac{\alpha k^{b(\alpha)}}{\ln k} (82)

where k=knk=k_{n}\longrightarrow\infty is the sequence considered above and b(α):]0,[b(\alpha):\mathcal{I}\longrightarrow]0,\infty[ is increasing on an interval \mathcal{I} covering{α1,α2}\{\alpha_{1},\alpha_{2}\} . Then (31) holds for cα2/α1=c_{\alpha_{2}/\alpha_{1}}=\infty and condition C1 is satisfied.

Proof:

It suffices to apply Lemma 17 to the sequences

cα1(k)=α1kb(α1)lnk and cα2(mk)=α2mkb(α2)lnmkc_{\alpha_{1}}(k)=\frac{\alpha_{1}k^{b(\alpha_{1})}}{\ln k}\text{ \ \ and \ \ }c_{\alpha_{2}}(m_{k})=\frac{\alpha_{2}m_{k}^{b(\alpha_{2})}}{\ln m_{k}}

for mkm_{k} defined by the condition

mkcα2(mk)=gα1(Δα1)gα2(Δα2).kcα1(k)(1+o(1)) (cf. (29)).\frac{m_{k}}{c_{\alpha_{2}}(m_{k})}=\frac{g_{\alpha_{1}}(\Delta_{\alpha_{1}})}{g_{\alpha_{2}}(\Delta_{\alpha_{2}})}.\frac{k}{c_{\alpha_{1}}(k)}\left(1+o(1)\right)\text{ \ \ (cf. (\ref{15})).} (83)

Example 21

Let assumptions of Definition 5 hold for α1=1\alpha_{1}=1 and α2=α>1,\alpha_{2}=\alpha>1, and let

kb(α)+1lnnn0forb(α)=(α1)/α.{\frac{k^{b(\alpha)+1}\ln n}{n}}\longrightarrow 0\quad\mbox{for}\ \ \ b(\alpha)=(\alpha-1)/\alpha. (84)

By [5, Eq. 51, 76 and 79] and (84) the sequences

c1(n)=1andcα(n)=αkb(α)lnkc_{1}(n)=1\quad\mbox{and}\quad c_{\alpha}(n)={\frac{\alpha k^{b(\alpha)}}{\ln k}} (85)

generate the Bahadur functions

g1(Δ)=Δandgα(Δ)=(α(α1)Δ)1/α, Δ>0.g_{1}(\Delta)=\Delta\quad\mbox{and}\quad g_{\alpha}(\Delta)=\left(\alpha(\alpha-1)\,\Delta\right)^{1/\alpha},\text{ \ \ }\Delta>0. (86)

Here we cannot apply Lemma 18 since c1(n)c_{1}(n) is not special case of cα(n)c_{\alpha}(n) for α=1.\alpha=1. An alternative direct approach can be based on the observation that (29) cannot hold if liminfnmn<.\lim\inf_{n}m_{n}<\infty. In the opposite case mnm_{n}\rightarrow\infty obviously implies

cα/1=deflimncα(mn)c1(n)=c_{\alpha/1}\overset{def}{=}\lim\nolimits_{n}{\frac{c_{\alpha}(m_{n})}{c_{1}(n)}}=\infty

so that C1 holds with c¯α2/α1cα/1=\bar{c}_{\alpha_{2}/\alpha_{1}}\equiv c_{\alpha/1}=\infty. Hence Lemma 1 implies that the Bahadur efficiency BE(D^1,n;D^α,n)=\mbox{BE}\left(\widehat{D}_{1,n}\,;\,\widehat{D}_{\alpha,n}\right)=\infty obtained previously by Harremoës and Vajda [5, Eq. 81] coincides with the Bahadur efficiency of D^1,n\widehat{D}_{1,n} with respect to D^α,n\widehat{D}_{\alpha,n} in the present precised sense of (35). Under stronger condition on kk than (84), Harremoës and Vajda established also the strong consistency of the statistics D^1,n\widehat{D}_{1,n} and D^α,n\widehat{D}_{\alpha,n}. One can verify that (86) are strongly Bahadur functions so that C2 holds as well. Hence, as argued by Lemma 3, we deal here with the Bahadur efficiency in the strong sense.

Example 22

Let assumptions of Definition 5 hold for α1>1\alpha_{1}>1 and let the function b(α)b(\alpha) be defined by (84) for all α1\alpha\geq 1. Harremoës and Vajda (2008) proved that if the sequence kk satisfies the condition (84) with α=α2\alpha=\alpha_{2} then for all α{α1,α2}\alpha\in\{\alpha_{1},\alpha_{2}\} the function gα(Δ)g_{\alpha}(\Delta) given by the second formula in (86) is Bahadur function of the statistics D^α,n\widehat{D}_{\alpha,n} generated by the sequences cα(n)c_{\alpha}(n) from the second formula in (85). Thus in this case the assumptions of Lemma 18 hold. From Lemmas 20 and 17 we conclude that the Bahadur efficiency

BE(D^α1,n;D^α2,n)=for all 0<α1<α2<\mbox{BE}\left(\widehat{D}_{\alpha_{1},n}\,;\,\widehat{D}_{\alpha_{2},n}\right)=\infty\quad\mbox{for all}\ \text{\ \ }0<\alpha_{1}<\alpha_{2}<\infty

obtained in [5, Eq. 81] coincides with the Bahadur efficiency in the present precise sense. Similarly as in the previous example, we can arrive to the conclusion that this is the Bahadur efficiency in the strong sense.

Acknowledgement. This research was supported by the European Network of Excellence and by the GAČR grant 202/10/0618.

References

  • [1] E. Lehman and G. Castella, Testing Statistical Hypotheses. New York: Springer, 3rd ed. ed., 2005.
  • [2] I. Csiszár, “Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der ergodizität von Markoffschen Ketten,” Publ. Math. Inst. Hungar. Acad., vol. 8, pp. 95–108, 1963.
  • [3] A. Rényi, “On measures of entropy and information,” in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 547–561, 1961.
  • [4] T. Kailath, “The divergence and Bhattacharrya distance in signal selection.,” IEEE Trans. Comm., vol. 15, pp. 52–60, 1967.
  • [5] P. Harremoës and I. Vajda, “On the Bahadur-efficient testing of uniformity by means of the entropy,” IEEE Trans. Inform Theory, vol. 54, pp. 321–331, Jan. 2008.
  • [6] P. Harremoës and I. Vajda, “Consistence of various ϕ\phi-divergence statistics,” Research Report ÚTIA AV ČR 2218, Institute of Information Theory and Automation, Praha, March 2008.
  • [7] P. Harremoës and I. Vajda, “Efficiency of entropy testing,” in International Symposium on Information Theory, pp. 2639–2643, IEEE, July 2008.
  • [8] M. P. Quine and J. Robinson, “Efficiencies of chi-square and likelihood ratio goodness-of-fit tests.,” Ann. Statist., vol. 13, pp. 727–742, 1985.
  • [9] J. Beirlant, L. Devroye, L. Györfi, and I. Vajda, “Large deviations of divergence measures on partitions,” J. Statist. Planning and Infer., vol. 93, pp. 1–16, 2001.
  • [10] L. Györfi, G. Morvai, and I. Vajda, “Information-theoretic methods in testing the goodness-of-fit,” in Proc. International Symposium on Information Theory, Sorrento, Italy, June25-30, p. 28, 2000.
  • [11] F. Liese and I. Vajda, Convex Statistical Distances. Leipzig: Teubner, 1987.
  • [12] F. Liese and I. Vajda, “On divergence and informations in statistics and information theory,” IEEE Tranns. Inform. Theory, vol. 52, pp. 4394 – 4412, Oct. 2006.
  • [13] T. R. C. Read and N. Cressie, Goodness of Fit Statistics for Discrete Multivariate Data. Berlin: Springer, 1988.
  • [14] C. Morris, “Central limit theorems for multinomial sums.,” Ann. Statist., vol. 3, pp. 165–188, 1975.
  • [15] L. Györfi and I. Vajda, “Asymptotic distributions for goodness-of-fit statistics in a sequence of multinomial models,” Stat. Probab. Letters, vol. 56, no. 1, pp. 57–67, 2002.
  • [16] P. Harremoës and F. Topsøe, “Inequalities between entropy and index of coincidence derived from information diagrams,” IEEE Trans. Inform. Theory, vol. 47, pp. 2944–2960, Nov. 2001.