This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Continuity and Additivity Properties
of Information Decompositions

Johannes Rauh Institut für Qualität und Transparenz im Gesundheitswesen (IQTiG), Berlin Pradeep Kr. Banerjee Max-Planck-Institut für Mathematik in den Naturwissenschaften (MiS), Leipzig Eckehard Olbrich Max-Planck-Institut für Mathematik in den Naturwissenschaften (MiS), Leipzig Guido Montúfar Max-Planck-Institut für Mathematik in den Naturwissenschaften (MiS), Leipzig University of California, Los Angeles Jürgen Jost Max-Planck-Institut für Mathematik in den Naturwissenschaften (MiS), Leipzig
Abstract

Information decompositions quantify how the Shannon information about a given random variable is distributed among several other random variables. Various requirements have been proposed that such a decomposition should satisfy, leading to different candidate solutions. Curiously, however, only two of the original requirements that determined the Shannon information have been considered, namely monotonicity and normalization. Two other important properties, continuity and additivity, have not been considered. In this contribution, we focus on the mutual information of two finite variables Y,ZY,Z about a third finite variable SS and check which of the decompositions satisfy these two properties. While most of them satisfy continuity, only one of them is both continuous and additive.

1 Introduction

The fundamental concept of Shannon information is uniquely determined by four simple requirements, continuity, strong additivity, monotonicity, and a normalization (Shannon, 1948). We refer to Csiszár (2008) for a discussion of axiomatic characterizations. Continuity implies that small perturbations of the underlying probability distribution have only small effects on the information measure, and this is of course very appealing. Strong additivity refers to the requirement that the chain rule H(ZY)=H(Y)+H(Z|Y)H(ZY)=H(Y)+H(Z|Y) holds. Similar conditions are also satisfied, mutatis mutandis, for the derived concepts of conditional and mutual information, as well as for other information measures, such as interaction information/co-information (McGill, 1954; Bell, 2003) or total correlation/multi-information (Watanabe, 1960; Studenỳ and Vejnarová, 1998).

Williams and Beer (2010) proposed to decompose the mutual information that several random variables Y1,,YkY_{1},\dots,Y_{k} have about a target variable SS into various components that quantify how much information these variables possess individually, how much they share and how much they need to combine to become useful. That is, one wants to disentangle how information about SS is distributed over the Y1,,YkY_{1},\dots,Y_{k}. Again, various requirements can be imposed, with varying degrees of plausibility, upon such a decomposition. There are several candidate solutions, and not all of them satisfy all those requirements. Curiously, however, previous considerations did not include continuity and strong additivity. While Bertschinger et al. (2013) did consider chain rule-type properties, none of the information measures defined within the context of information decompositions satisfies any of these chain rule properties (Rauh et al., 2014).

In this contribution, we evaluate which of the various proposed decompositions satisfy continuity and additivity. Here, additivity (without strong) is required only for independent variables (see Definition 8 below). Additivity (together with other properties) may replace strong additivity when defining Shannon information axiomatically (see (Csiszár, 2008) for an overview). The importance of additivity is also discussed by Matveev and Portegies (2017).

We consider the case where all random variables are finite, and we restrict to the bivariate case k=2k=2. We think that this simplest possible setting is the most important one to understand conceptually and in practical applications. Already here there are important differences between the measures that have been proposed in the literature. A bivariate information decomposition consists of three functions SISI, UIUI and CICI that depend on the joint distribution of three variables S,Y,ZS,Y,Z, and that satisfy:

I(S;YZ)\displaystyle I(S;YZ) =SI(S;Y,Z)shared+CI(S;Y,Z)complementary+UI(S;Y\Z)unique (Y wrt Z)+UI(S;Z\Y)unique (Z wrt Y),\displaystyle=\underbrace{{SI}(S;Y,Z)}_{\text{shared}}+\underbrace{CI(S;Y,Z)}_{\text{complementary}}+\underbrace{UI(S;Y\backslash Z)}_{\text{unique ($Y$ wrt $Z$)}}+\underbrace{UI(S;Z\backslash Y)}_{\text{unique ($Z$ wrt $Y$)}}, (1)
I(S;Y)\displaystyle I(S;Y) =SI(S;Y,Z)+UI(S;Y\Z),\displaystyle=SI(S;Y,Z)+UI(S;Y\backslash Z),
I(S;Z)\displaystyle I(S;Z) =SI(S;Y,Z)+UI(S;Z\Y).\displaystyle=SI(S;Y,Z)+UI(S;Z\backslash Y).

Hence, I(S;YZ)I(S;YZ) is decomposed into a shared part that is contained in both YY and ZZ, a complementary (or synergistic) part that is only available from (Y,Z)(Y,Z) together, and unique parts contained exclusively in either YY or ZZ. The different terms I,SI,CI,UII,SI,CI,UI are functions of the joint probability distribution of the three random variables S,Y,ZS,Y,Z, commonly written with suggestive arguments as in (1).

To define a bivariate information decomposition in this sense, it suffices to define either of SISI, UIUI or CICI. The other functions are then determined from (1). The linear system (1) consists of three equations in four unknowns, where the two unknowns UI(S;Y\Z)UI(S;Y\backslash Z) and UI(S;Z\Y)UI(S;Z\backslash Y) are related. Thus, when starting with a function UIUI to define an information decomposition, the following consistency condition must be satisfied:

I(S;Y)+UI(S;Z\Y)=I(S;Z)+UI(S;Y\Z).\displaystyle I(S;Y)+UI(S;Z\backslash Y)=I(S;Z)+UI(S;Y\backslash Z). (2)

If consistency is not given, one may try to adjust the proposed measure of unique information to enforce consistency using a construction from Banerjee et al. (2018a) (see Section 2).

As mentioned above, several bivariate information decompositions have been proposed (see Section 2 for a list). However, there are still holes in our understanding of the properties of those decompositions that have been proposed so far. This paper investigates the continuity and additivity properties of some of these decompositions.

Continuity is understood with respect to the canonical topology of the set of joint distributions of finite variables of fixed sizes. When PnP_{n} is a sequence of joint distributions with PnPP_{n}\to P, does SIPn(S;Y,Z)SIP(S;Y,Z)SI_{P_{n}}(S;Y,Z)\to SI_{P}(S;Y,Z)? Most, but not all, proposed information decompositions are continuous (i.e. SISI, UIUI and CICI are all continuous). If an information decomposition is continuous, one may ask whether it is differentiable, at least at probability distributions of full support. Among the information decompositions that we consider, only the decomposition IIGI_{\operatorname{IG}} (Niu and Quinn, 2019) is differentiable. Continuity and smoothness are discussed in detail in Section 3.

The second property that we focus on is additivity, by which we mean that SISI, UIUI and CICI behave additively when a system can be decomposed into (marginally) independent subsystems (see Definition 8 in Section 4). This property corresponds to the notion of extensivity as used in thermodynamics. Only the information decomposition IBROJAI_{\operatorname{BROJA}} (Bertschinger et al., 2014) in our list satisfies this property. A weak form of additivity, the identity axiom proposed by Harder et al. (2013), is well-studied and is satisfied by other bivariate information decompositions.

2 Proposed information decompositions

We now list the bivariate information decompositions that we want to investigate. The last paragraph mentions further related information measures. We denote information decompositions by II, with sub- or superscripts. The corresponding measures SISI, UIUI and CICI inherit these decorations.

We use the following notation: SS, YY, ZZ are random variables with finite state spaces 𝒮\mathcal{S}, 𝒴\mathcal{Y}, 𝒵\mathcal{Z}. The set of all probability distributions on a set 𝒮\mathcal{S} (i.e. the probability simplex over 𝒮\mathcal{S}) is denoted by 𝒮\mathbb{P}_{\mathcal{S}}. The joint distribution PP of S,Y,ZS,Y,Z is then an element of 𝒮×𝒴×𝒵\mathbb{P}_{\mathcal{S}\times\mathcal{Y}\times\mathcal{Z}}.

𝑰𝐦𝐢𝐧\bullet\;\;\boldsymbol{I_{\min}}

Together with the information decomposition framework, Williams and Beer (2010) also defined an information decomposition IminI_{\min}. Let

I(S=s;Y)\displaystyle I(S=s;Y) =y𝒴P(y|s)logP(s|y)P(s),\displaystyle=\sum_{y\in\mathcal{Y}}P(y|s)\log\frac{P(s|y)}{P(s)},
I(S=s;Z)\displaystyle I(S=s;Z) =z𝒵P(z|s)logP(s|z)P(s)\displaystyle=\sum_{z\in\mathcal{Z}}P(z|s)\log\frac{P(s|z)}{P(s)}

be the specific information of the outcome S=sS=s about YY and ZZ, respectively. Then

SImin(S;Y,Z)=s𝒮P(s)min{I(S=s;Y),I(S=s;Z)}.SI_{\min}(S;Y,Z)=\sum_{s\in\mathcal{S}}P(s)\min\big{\{}I(S=s;Y),I(S=s;Z)\big{\}}.

IminI_{\min} has been criticized, because it assigns relatively large values of shared information, conflating “the same amount of information” with “the same information” (Harder et al., 2013; Griffith and Koch, 2014).

𝑰𝐌𝐌𝐈\bullet\;\;\boldsymbol{I_{\operatorname{MMI}}}

A related information decomposition is the minimum mutual information (MMI) decomposition given by

SIMMI(S;Y,Z)=min{I(S;Y),I(S;Z)}.SI_{\operatorname{MMI}}(S;Y,Z)=\min\big{\{}I(S;Y),I(S;Z)\big{\}}.

Even more severely than IminI_{\min}, this information decomposition conflates “the same amount of information” with “the same information.” Still, formally, this definition produces a valid bivariate information decomposition and thus serves as a useful benchmark. The axioms imply that SI(S;Y,Z)SIMMI(S;Y,Z)SI(S;Y,Z)\leq SI_{\operatorname{MMI}}(S;Y,Z) for any other bivariate information decomposition. For multivariate Gaussian variables, many information decompositions actually agree with IMMII_{\operatorname{MMI}} (Barrett, 2015).

𝑰𝐫𝐞𝐝\bullet\;\;\boldsymbol{I_{\operatorname{red}}}

To address the criticism of IminI_{\min}, Harder et al. (2013) introduced a bivariate information decomposition IredI_{\operatorname{red}} based on a notion of redundant information as follows. Let 𝒵:={z𝒵:P(Z=z)>0}\mathcal{Z}^{\prime}:=\{z\in\mathcal{Z}:P(Z=z)>0\} be the support of ZZ, and let

PS|yZ=argminQconv{P(S|z):z𝒵}D(P(S|y)Q),IS(yZ)=D(P(S|y)P(S))D(P(S|y)PS|yZ),IS(YZ)=Py𝒴(y)IS(yZ).\begin{split}P_{S|y\searrow Z}&=\operatorname*{arg\,min}_{Q\in\operatorname{conv}\{P(S|z):z\in\mathcal{Z}^{\prime}\}}D(P(S|y)\|Q),\\ I_{S}(y\searrow Z)&=D(P(S|y)\|P(S))-D(P(S|y)\|P_{S|y\searrow Z}),\\ I_{S}(Y\searrow Z)&=\sum{}_{y\in\mathcal{Y}}P(y)I_{S}(y\searrow Z).\end{split} (3)

Then

SIred(S;Y,Z)=min{IS(YZ),IS(ZY)}.SI_{\operatorname{red}}(S;Y,Z)=\min\big{\{}I_{S}(Y\searrow Z),I_{S}(Z\searrow Y)\big{\}}.

Banerjee et al. (2018a) showed that the quantity I(S;Y)IS(YZ)I(S;Y)-I_{S}(Y\searrow Z) has a decision-theoretic interpretation by way of channel deficiencies (Raginsky, 2011).

𝑰𝐁𝐑𝐎𝐉𝐀\bullet\;\;\boldsymbol{I_{\operatorname{BROJA}}}

Motivated from decision-theoretic considerations, Bertschinger et al. (2014) introduced the bivariate information decomposition IBROJAI_{\operatorname{BROJA}} (eponymously named after the authors in (Bertschinger et al., 2014)). Given P𝒮×𝒴×𝒵P\in\mathbb{P}_{\mathcal{S}\times\mathcal{Y}\times\mathcal{Z}}, let ΔP\Delta_{P} denote the set of joint distributions of (S,Y,Z)(S,Y,Z) that have the same marginals on (S,Y)(S,Y) and (S,Z)(S,Z) as PP. Then define the unique information that YY conveys about SS with respect to ZZ as

UIBROJA(S;Y\Z):=minQΔPIQ(S;Y|Z),UI_{\operatorname{BROJA}}(S;Y\backslash Z):=\min_{Q\in\Delta_{P}}I_{Q}(S;Y|Z),

where the subscript QQ in IQI_{Q} denotes the joint distribution on which the function is computed. Computation of this decomposition was investigated by Banerjee et al. (2018b). IBROJAI_{\operatorname{BROJA}} leads to a concept of synergy that agrees with the synergy measure defined by Griffith and Koch (2014).

𝑰𝐝𝐞𝐩\bullet\;\;\boldsymbol{I_{\operatorname{dep}}}

James et al. (2018) define the following bivariate decomposition: Given the joint distribution P𝒮×𝒴×𝒵P\in\mathbb{P}_{\mathcal{S}\times\mathcal{Y}\times\mathcal{Z}} of (S,Y,Z)(S,Y,Z), let PYSZ=P(S,Y)P(S,Z)/P(S)P_{Y-S-Z}=P(S,Y)P(S,Z)/P(S) be the probability distribution in 𝒮×𝒴×𝒵\mathbb{P}_{\mathcal{S}\times\mathcal{Y}\times\mathcal{Z}} that maximizes the entropy among all distributions QQ with Q(S,Y)=P(S,Y)Q(S,Y)=P(S,Y) and Q(S,Z)=P(S,Z)Q(S,Z)=P(S,Z). Similarly, let PΔP_{\Delta} be the probability distribution in 𝒮×𝒴×𝒵\mathbb{P}_{\mathcal{S}\times\mathcal{Y}\times\mathcal{Z}} that maximizes the entropy among all distributions QQ with Q(S,Y)=P(S,Y)Q(S,Y)=P(S,Y) and Q(S,Z)=P(S,Z)Q(S,Z)=P(S,Z) and Q(Y,Z)=P(Y,Z)Q(Y,Z)=P(Y,Z) (unlike for PYSZP_{Y-S-Z}, we do not have an explicit formula for PΔP_{\Delta}). Then

UIdep(S;YZ)=min{IPYSZ(S;Y|Z),IPΔ(S;Y|Z)}.UI_{\operatorname{dep}}(S;Y\setminus Z)=\min\big{\{}I_{P_{Y-S-Z}}(S;Y|Z),I_{P_{\Delta}}(S;Y|Z)\big{\}}.

This definition is motivated in terms of a lattice of all sensible marginal constraints when maximizing the entropy, as in the definition of PYSZP_{Y-S-Z} and PΔP_{\Delta} (see (James et al., 2018) for the details).

𝑰\bullet\;\;\boldsymbol{I^{*}_{\cap}}, 𝑰\boldsymbol{I_{\cap}^{\wedge}} and 𝑰𝐆𝐇\boldsymbol{I_{\cap}^{\operatorname{GH}}}

The information decompositions II_{\cap}^{\wedge} (Griffith et al., 2014), IGHI_{\cap}^{\operatorname{GH}} (Griffith and Ho, 2015) and II_{\cap}^{*} (Kolchinsky, 2022) are motivated from the notion of common information due to Gács and Körner (1973) and present three different approaches that try to represent the shared information in terms of a random variable QQ:

SI(S;Y,Z)\displaystyle SI_{\cap}^{\wedge}(S;Y,Z) =max{I(Q;S):Q=f(Y)=f(Z) a.s.},\displaystyle=\max\Big{\{}I(Q;S):Q=f(Y)=f^{\prime}(Z)\text{ a.s.}\Big{\}},\allowdisplaybreaks (4)
SIGH(S;Y,Z)\displaystyle SI_{\cap}^{\operatorname{GH}}(S;Y,Z) =max{I(Q;S):I(S;Q|Y)=I(S;Q|Z)=0},\displaystyle=\max\Big{\{}I(Q;S):\allowdisplaybreaks[0]I(S;Q|Y)=I(S;Q|Z)=0\Big{\}},\allowdisplaybreaks (5)
SI(S;Y,Z)\displaystyle SI^{*}_{\cap}(S;Y,Z) =max{I(Q;S):P(s,q)=yP(s,y)λq|y=zP(s,z)λq|z},\displaystyle=\max\Big{\{}I(Q;S):P(s,q)=\sum_{y}P(s,y)\lambda_{q|y}=\sum_{z}P(s,z)\lambda^{\prime}_{q|z}\Big{\}}, (6)

where the optimization runs over all pairs of (deterministic) functions f,ff,f^{\prime} (for SISI_{\cap}^{\wedge}), all joint distributions of four random variables S,X,Y,QS,X,Y,Q that extend the joint distribution of S,X,YS,X,Y (for SIGHSI_{\cap}^{\operatorname{GH}}), and all pairs of stochastic matrices λq|y,λq|z\lambda_{q|y},\lambda^{\prime}_{q|z} (for SISI_{\cap}^{*}), respectively. One can show that SI(S;Y,Z)SIGH(S;Y,Z)SI(S;Y,Z)SI_{\cap}^{\wedge}(S;Y,Z)\leq SI_{\cap}^{\operatorname{GH}}(S;Y,Z)\leq SI_{\cap}^{*}(S;Y,Z) (Kolchinsky, 2022).

The II_{\cap}^{*}-decomposition draws motivation from considerations of channel preorders, in a similar spirit as Banerjee et al. (2018a), and it is related to ideas from Bertschinger and Rauh (2014). Kolchinsky (2022) shows that there is an analogy between II_{\cap}^{*} and IBROJAI_{\operatorname{BROJA}}.

𝑰𝐈𝐆\bullet\;\;\boldsymbol{I_{\operatorname{IG}}}

Niu and Quinn (2019) presented a bivariate information decomposition IIGI_{\operatorname{IG}} based on information geometric ideas. While their construction is very elegant, it only works for joint distributions PP of full support (i.e. P(s,y,z)>0P(s,y,z)>0 for all s,y,zs,y,z). It is unknown whether it can be extended meaningfully to all joint distributions. Numerical evidence exists that a unique continuous extension is possible at least to some joint distributions with restricted support (see examples in (Niu and Quinn, 2019)).

For any tt\in\mathbb{R}, consider the joint distribution

P(t)(s,y,z)=1ctPSYZt(s,y,z)PSZY1t(s,y,z)=1ctP(y,z)P(s|y)tP(s|z)1t,P^{(t)}(s,y,z)=\frac{1}{c_{t}}P^{t}_{S-Y-Z}(s,y,z)P^{1-t}_{S-Z-Y}(s,y,z)=\frac{1}{c_{t}}P(y,z)P(s|y)^{t}P(s|z)^{1-t},

where ctc_{t} is a normalizing constant, and let

P=argmintD(PP(t)).P^{*}=\operatorname*{arg\,min}_{t\in\mathbb{R}}D(P\|P^{(t)}).

Then

CIIG(S;Y,Z)\displaystyle{CI_{\operatorname{IG}}}(S;Y,Z) =D(PP),\displaystyle=D(P\|P^{*}),
UIIG(S;YZ)\displaystyle UI_{\operatorname{IG}}(S;Y\setminus Z) =D(PPSZY).\displaystyle=D(P^{*}\|P_{S-Z-Y}).

An interesting aspect about these definitions is that, by the Generalized Pythagorean Theorem (see (Amari, 2018)), D(PPSZY)=D(PP)+D(PPSZY)D(P\|P_{S-Z-Y})=D(P\|P^{*})+D(P^{*}\|P_{S-Z-Y}).

\bullet\; The UI construction

Given an information measure that captures some aspect of unique information but that fails to satisfy the consistency condition (2), one may construct a corresponding bivariate information decomposition as follows:

Lemma 1.

Let δ:𝒮×𝒴×𝒵\delta:\mathbb{P}_{\mathcal{S}\times\mathcal{Y}\times\mathcal{Z}}\to\mathbb{R} be a non-negative function that satisfies

δ(S;YZ)min{I(S;Y),I(S;Y|Z)}.\delta(S;Y\setminus Z)\leq\min\{I(S;Y),I(S;Y|Z)\}.

Then a bivariate information decomposition is given by

UIδ(S;YZ)\displaystyle UI_{\delta}(S;Y\setminus Z) =max{δ(S;YZ),δ(S;ZY)+I(S;Y)I(S;Z)},\displaystyle=\max\big{\{}\delta(S;Y\setminus Z),\delta(S;Z\setminus Y)+I(S;Y)-I(S;Z)\big{\}},
UIδ(S;ZY)\displaystyle UI_{\delta}(S;Z\setminus Y) =max{δ(S;ZY),δ(S;YZ)+I(S;Z)I(S;Y)},\displaystyle=\max\big{\{}\delta(S;Z\setminus Y),\delta(S;Y\setminus Z)+I(S;Z)-I(S;Y)\big{\}},
SIδ(S;Z,Y)\displaystyle SI_{\delta}(S;Z,Y) =min{I(S;Y)δ(S;YZ),I(S;Z)δ(S;ZY)},\displaystyle=\min\big{\{}I(S;Y)-\delta(S;Y\setminus Z),I(S;Z)-\delta(S;Z\setminus Y)\big{\}},
CIδ(S;Z,Y)\displaystyle CI_{\delta}(S;Z,Y) =min{I(S;Y|Z)δ(S;YZ),I(S;Z|Y)δ(S;ZY)}.\displaystyle=\min\big{\{}I(S;Y|Z)-\delta(S;Y\setminus Z),I(S;Z|Y)-\delta(S;Z\setminus Y)\big{\}}.
Proof.

The proof follows just as the proof of (Banerjee and Montúfar, 2020, Proposition 9). ∎

We call the construction of Lemma 1 the UI construction. The unique information UIδUI_{\delta} returned by the UI construction is the smallest UIUI-function of any bivariate information decomposition with UIδUI\geq\delta.

As Banerjee et al. (2018a) show, the decomposition IredI_{\operatorname{red}} is an example of this construction. As another example, as Banerjee et al. (2018a) and Rauh et al. (2019) suggested, the UI construction can be used to obtain bivariate information decompositions from the one- or two-way secret key rates and related information functions that have been defined as bounds on the secret key rates, such as the intrinsic information (Maurer and Wolf, 1997), the reduced intrinsic information (Renner and Wolf, 2003), or the minimum intrinsic information (Gohari and Anantharam, 2010).

\bullet\;\; Other decompositions

Several other measures have been proposed that are motivated by the framework of Williams and Beer (2010) but that leave the framework. Ince (2017) defines a decomposition IccsI_{\operatorname{ccs}}, which satisfies (1), but in which SIccsSI_{\operatorname{ccs}}, UIccsUI_{\operatorname{ccs}} and CIccsCI_{\operatorname{ccs}} may take negative values. The SPAM decomposition of Finn and Lizier (2018) consists of non-negative information measures that decompose the mutual information, but this decomposition has a different structure, with alternating signs and twice as many terms. Both approaches construct “pointwise” decompositions, in the sense that SISI, UIUI and CICI can be naturally expressed as expectations, in a similar way that entropy and mutual information can be written as expectations (see (Finn and Lizier, 2018) for details). Recent works have proposed other decompositions based on a different lattice (Ay et al., 2020) or singling out features of the target variable (Magri, 2021).

Since these measures do not lie in our direct focus, we omit their definitions. Nevertheless, one can ask the same questions: Are the corresponding information measures continuous, and are they additive? For the constructions in (Finn and Lizier, 2018), both continuity and additivity (as a consequence of a chain rule) are actually postulated. The decomposition in (Ay et al., 2020) is additive. On the other hand, IccsI_{\operatorname{ccs}} is neither continuous (as can be seen from its definition) nor additive (since it does not satisfy the identity property).

3 Continuity

Most of the information decompositions that we consider are continuous. Moreover, the UI construction preserves continuity: if δ\delta is continuous, then UIδUI_{\delta} is continuous.

The notable exceptions to continuity are IredI_{\operatorname{red}} and the II_{\cap} decompositions (see Lemmas 2 and 4 below). For SIredSI_{\operatorname{red}}, this is due to its definition in terms of conditional probabilities. Thus, SIredSI_{\operatorname{red}} is continuous when restricted to probability distributions of full support. For SISI^{*}_{\cap}, discontinuities also appear for sequences PnPP_{n}\to P where the support does not change.

For the SIIGSI_{\operatorname{IG}} information decomposition, one should keep in mind that it is only defined for probability distributions with full support. It is currently unknown whether it can be continuously extended to all probability distributions.

Clearly, continuity is a desirable property, but is it essential? A discontinuous information measure might still be useful, if the discontinuity is not too severe. For example, the Gács-Körner common information C(YZ)C(Y\wedge Z) (Gács and Körner, 1973) is an information measure that vanishes except on a set of measure zero (certain distributions that do not have full support). Clearly, such an information measure is difficult to estimate. The II_{\cap} decompositions are related to C(YZ)C(Y\wedge Z), and so their discontinuity is almost as severe (see Lemma 4). Similarly, the IredI_{\operatorname{red}}-decomposition is continuous at distributions of full support. If the discontinuity is well-behaved and well understood, then such a decomposition may still be useful for certain applications. Still, a discontinuous information decomposition challenges the intuition, and any discontinuity must be interpreted (such as the discontinuity of C(YZ)C(Y\wedge Z) can be explained and interpreted (Gács and Körner, 1973)).

If an information decomposition is continuous, one may ask whether it is differentiable, at least at probability distributions of full support. For almost all information decompositions that we consider, the answer is no. This is easy to see for those information decompositions that involve a minimum of finitely many smooth functions (SIminSI_{\min}, SIMMISI_{\operatorname{MMI}}, SIredSI_{\operatorname{red}}, SIdepSI_{\operatorname{dep}}). For SIBROJASI_{\operatorname{BROJA}}, we refer to Rauh et al. (2021). Only SIIGSI_{\operatorname{IG}} is differentiable for distributions of full support111Personal communication with the authors Niu and Quinn (2019)..

Lemma 2.

SIredSI_{\operatorname{red}} is not continuous.

Proof.

IS(YZ)I_{S}(Y\searrow Z) and IS(ZY)I_{S}(Z\searrow Y) are defined in terms of conditional probability P(S|Y=y)P(S|Y=y) and P(S|Z=z)P(S|Z=z), which are only defined for those y,zy,z with P(Y=y)>0P(Y=y)>0 and P(Z=z)>0P(Z=z)>0. Therefore, IS(YZ)I_{S}(Y\searrow Z) and IS(ZY)I_{S}(Z\searrow Y) are discontinuous when probabilities tend to zero. A concrete example is given below. ∎

Example 3 (SIredSI_{\operatorname{red}} is not continuous).

For 0a10\leq a\leq 1, suppose that the joint distribution of S,Y,ZS,Y,Z has the following marginal distributions:

ss yy Pa(s,y)P_{a}(s,y)
1 0 a2\frac{a}{2}
1 1 12a2\frac{1}{2}-\frac{a}{2}
0 1 14\frac{1}{4}
0 2 14\frac{1}{4}
ss zz Pa(s,z)P_{a}(s,z)
0 0 a2\frac{a}{2}
0 1 12a2\frac{1}{2}-\frac{a}{2}
1 1 14\frac{1}{4}
1 2 14\frac{1}{4}

.

Observe the symmetry of YY, ZZ. For a>0a>0, the conditional distributions of SS given YY and ZZ are, respectively:

yy Pa(S|y)P_{a}(S|y)
0 (0,1)\left({0},{1}\right)
1 (132a,22a32a)\left({\frac{1}{3-2a}},{\frac{2-2a}{3-2a}}\right)
2 (1,0)\left({1},{0}\right)

and zz Pa(S|z)P_{a}(S|z) 0 (1,0)\left({1},{0}\right) 1 (22a32a,132a)\left({\frac{2-2a}{3-2a}},{\frac{1}{3-2a}}\right) 2 (0,1)\left({0},{1}\right) .

Therefore, IS(YZ)=I(S;Y)=I(S;Z)=IS(ZY)I_{S}(Y\searrow Z)=I(S;Y)=I(S;Z)=I_{S}(Z\searrow Y). The first equality follows from the definition of IS(YZ)I_{S}(Y\searrow Z) in (3), noting that conv{P(S|z):P(z)>0}\operatorname{conv}\{P(S|z)\colon P(z)>0\} includes all probability distributions on {0,1}\{0,1\} and hence D(P(S|y)PS|yZ)=0D(P(S|y)\|P_{S|y\searrow Z})=0. The second equality holds because the marginal distribution of (S,Y)(S,Y) is equal to that of (S,Z)(S,Z) up to relabeling of the states of SS. The third equality follows by similar considerations as the first.

For a=0a=0, the conditional distributions P(S|Y=0)P(S|Y=0) and P(S|Z=0)P(S|Z=0) are not defined. In this case we have IS(YZ)=IS(ZY)<I(S;Y)=I(S;Z)I_{S}(Y\searrow Z)=I_{S}(Z\searrow Y)<I(S;Y)=I(S;Z). As before, the two equalities hold because the marginal distribution of (S,Y)(S,Y) is equal to that of (S,Z)(S,Z) up to relabeling of the states of SS. The inequality holds because now conv{P(S|z):P(z)>0}\operatorname{conv}\{P(S|z)\colon P(z)>0\} does not include all probability distributions on {0,1}\{0,1\} and D(P(S|y)PS|yZ)>0D(P(S|y)\|P_{S|y\searrow Z})>0 for y=2y=2. In total,

lima0+SIred(S;Y,Z)=lima0+I(S;Y)>IS(YZ)=SIred(S;Y,Z).\displaystyle\lim_{a\to 0+}SI_{\operatorname{red}}(S;Y,Z)=\lim_{a\to 0+}I(S;Y)>I_{S}(Y\searrow Z)=SI_{\operatorname{red}}(S;Y,Z).
Lemma 4.

SISI_{\cap}^{*}, II_{\cap}^{\wedge} and IGHI_{\cap}^{\operatorname{GH}} are not continuous.

Proof.

By Kolchinsky (2022, Theorem 6), for all three measures, SI(YZ;Y,Z)SI_{\cap}(YZ;Y,Z) equals the Gács-Körner common information C(YZ)C(Y\wedge Z) (Gács and Körner, 1973), which is not continuous. ∎

A concrete example is given below.

Example 5 (SISI_{\cap}^{*} is not continuous).

Suppose that the joint distribution of S,Y,ZS,Y,Z has the following marginal distributions, for 1a1-1\leq a\leq 1:

ss yy Pa(s,y)P_{a}(s,y)
0 0 13\frac{1}{3}
1 0 16a6\frac{1}{6}-\frac{a}{6}
1 1 16+a6\frac{1}{6}+\frac{a}{6}
2 1 13\frac{1}{3}
ss zz Pa(s,z)P_{a}(s,z)
0 0 13\frac{1}{3}
1 0 16\frac{1}{6}
1 1 16\frac{1}{6}
2 1 13\frac{1}{3}

.

Recall the definition of SI(S;Y,Z)SI_{\cap}^{*}(S;Y,Z) in (6). For a=0a=0, the marginal distributions of the pairs (S,Y)(S,Y) and (S,Z)(S,Z) are identical, whence SI(S;Y,Z)=I(S;Y)=I(S;Z)SI_{\cap}^{*}(S;Y,Z)=I(S;Y)=I(S;Z).

Now let a0a\neq 0. According to the definition of SISI^{*}_{\cap}, we need to find stochastic matrices λq|y,λq|z\lambda_{q|y},\lambda^{\prime}_{q|z} that satisfy the condition

P(s,q)=yP(s,y)λq|y=zP(s,z)λq|z.P(s,q)=\sum_{y}P(s,y)\lambda_{q|y}=\sum_{z}P(s,z)\lambda^{\prime}_{q|z}. (7)

For s=0s=0 and s=2s=2, this condition implies λq|0=λq|0\lambda_{q|0}=\lambda^{\prime}_{q|0} and λq|1=λq|1\lambda_{q|1}=\lambda^{\prime}_{q|1}. For s=1s=1, the same condition gives

a(λq|1λq|0)=0.a(\lambda_{q|1}-\lambda_{q|0})=0.

In the case a0a\neq 0, this implies that λq|1=λq|0\lambda_{q|1}=\lambda_{q|0} and thus that QQ is independent of YY and SS. Therefore, I(Q;S)=0I(Q;S)=0 and SI(S;Y,Z)=0SI_{\cap}^{*}(S;Y,Z)=0 for a0a\neq 0.

Asymptotic continuity and locking

We discuss two further related properties, namely, asymptotic continuity and locking (Banerjee et al., 2018a; Rauh et al., 2019), which we explain shortly below. IBROJAI_{\operatorname{BROJA}} is asymptotically continuous and does not exhibit locking. It is not known if other information decompositions satisfy these properties.

Operational quantities in information theory such as channel capacities and compression rates are usually defined in the spirit of Shannon (1948) - in the asymptotic regime of many independent uses of the channel or many independent realizations of the underlying source distribution. In the asymptotic regime, real-valued functionals of distributions that are asymptotically continuous are especially useful as they often provide lower or upper bounds for operational quantities of interest (Cerf et al., 2002; Banerjee et al., 2018a; Chitambar and Gour, 2019).

Asymptotic continuity is a stronger notion of continuity that considers convergence relative to the dimension of the underlying state space (Synak-Radtke and Horodecki, 2006; Chitambar and Gour, 2019; Fannes, 1973; Winter, 2016). Concretely, a function ff is said to be asymptotically continuous if

|f(P)f(P)|Cϵlog|𝒮|+ζ(ϵ)|f(P)-f(P^{\prime})|\leq C\epsilon\log|{\mathcal{S}}|+\zeta(\epsilon)

for all joint distributions P,P𝒮P,P^{\prime}\in{\mathbb{P}_{\mathcal{S}}}, where CC is some constant, ϵ=12PP1\epsilon=\tfrac{1}{2}\|P-P^{\prime}\|_{1}, and ζ:[0,1]+\zeta:[0,1]\to\mathbb{R}_{+} is any continuous function converging to zero as ϵ0\epsilon\to 0 (Chitambar and Gour, 2019).

As an example, entropy is asymptotically continuous (see, e.g., (Csiszár and Körner, 2011, Lemma 2.7)): For any P,P𝒮P,P^{\prime}\in\mathbb{P}_{\mathcal{S}}, if 12PP1ϵ\tfrac{1}{2}\|P-P^{\prime}\|_{1}\leq\epsilon, then

|HP(S)HP(S)|ϵlog|𝒮|+h(ϵ),|H_{P}(S)-H_{P^{\prime}}(S)|\leq\epsilon\log|\mathcal{S}|+h(\epsilon),

where h()h(\cdot) is the binary entropy function, h(p)=plogp(1p)log(1p)h(p)=-p\log p-(1-p)\log(1-p) for p(0,1)p\in(0,1) and h(0)=h(1)=0h(0)=h(1)=0. Likewise, the conditional mutual information satisfies asymptotic continuity in the following sense (Renner and Wolf, 2003; Christandl and Winter, 2004): For any P,P𝒮×𝒴×𝒵P,P^{\prime}\in\mathbb{P}_{\mathcal{S}\times\mathcal{Y}\times\mathcal{Z}}, if 12PP1ϵ\tfrac{1}{2}\|P-P^{\prime}\|_{1}\leq\epsilon then

|IP(S;Y|Z)IP(S;Y|Z)|ϵlogmin{|𝒮|,|𝒴|}+2h(ϵ).|I_{P}(S;Y|Z)-I_{P^{\prime}}(S;Y|Z)|\leq\epsilon\log\min\{|\mathcal{S}|,|\mathcal{Y}|\}+2h(\epsilon).

Note that the right-hand side of the above inequality does not depend explicitly on the cardinality of ZZ.

As Banerjee et al. (2018a) show, UIBROJAUI_{\operatorname{BROJA}} is asymptotically continuous:

Lemma 6.

For any P,P𝒮×𝒴×𝒵P,P^{\prime}\in\mathbb{P}_{\mathcal{S}\times\mathcal{Y}\times\mathcal{Z}}, and ϵ[0,1]\epsilon\in[0,1], if PP1=ϵ\|P-P^{\prime}\|_{1}=\epsilon, then

|UIP(S;Y\Z)UIP(S;Y\Z)|52ϵlogmin{|𝒮|,|𝒴|}+ζ(ϵ)|{UI}_{P}(S;Y\backslash Z)-{UI}_{P^{\prime}}(S;Y\backslash Z)|\leq\tfrac{5}{2}\epsilon\log\min\{|\mathcal{S}|,|\mathcal{Y}|\}+\zeta(\epsilon)

for some bounded, continuous function ζ:[0,1]+\zeta:[0,1]\to\mathbb{R}_{+} that converges to zero as ϵ0\epsilon\to 0.

Locking is motivated from the following property of the conditional mutual information (Renner and Wolf, 2003; Christandl et al., 2007): For arbitrary discrete random variables SS, YY, ZZ, and UU,

I(S;Y|ZU)I(S;Y|Z)H(U).\displaystyle I(S;Y|ZU)\geq I(S;Y|Z)-H(U). (8)

The conditional mutual information does not exhibit “locking” in the sense that any additional side information UU accessible to ZZ cannot reduce the conditional mutual information by more than the entropy of the side information.

As Rauh et al. (2019) show, UIBROJAUI_{\operatorname{BROJA}} does not exhibit locking:

Lemma 7.

For jointly distributed random variables (S,Y,Z,U)(S,Y,Z,U),

UIBROJA(S;Y|ZU)UIBROJA(S;Y|Z)H(U).\displaystyle UI_{\operatorname{BROJA}}(S;Y|ZU)\geq UI_{\operatorname{BROJA}}(S;Y|Z)-H(U). (9)

This property is useful, for example, in a cryptographic context (Rauh et al., 2019) where it ensures that the unique information that YY has about SS w.r.t. an adversary ZZ cannot “unlock”, i.e., drop by an arbitrarily large amount on giving away a bit of information to ZZ.

4 Additivity

Definition 8.

An information measure I(X1,,Xn)I(X_{1},\dots,X_{n}) (i.e. a function of the joint distribution of nn random variables) is additive if and only if the following holds: If (X1,,Xn)(X_{1},\dots,X_{n}) is independent of (Y1,,Yn)(Y_{1},\dots,Y_{n}), then

I(X1Y1,X2Y2,,XnYn)=I(X1,,Xn)+I(Y1,,Yn).I(X_{1}Y_{1},X_{2}Y_{2},\dots,X_{n}Y_{n})=I(X_{1},\dots,X_{n})+I(Y_{1},\dots,Y_{n}).

The information measure is superadditive, if, under the same assumptions,

I(X1Y1,X2Y2,,XnYn)I(X1,,Xn)+I(Y1,,Yn).I(X_{1}Y_{1},X_{2}Y_{2},\dots,X_{n}Y_{n})\geq I(X_{1},\dots,X_{n})+I(Y_{1},\dots,Y_{n}).

The IBROJAI_{\operatorname{BROJA}} decomposition is additive:

Lemma 9.

IBROJAI_{\operatorname{BROJA}} is additive.

Proof.

This is (Bertschinger et al., 2014, Lemma 19). ∎

The information decompositions motivated from the Gács-Körner common information as defined in (4), (5) and (6) are additive (Theorem 13). All other information decompositions that we consider are not additive. However, in all information decompositions that we consider, SISI is superadditive and UIUI is subadditive (Theorem 12).

Again, additivity is a desirable property, but is it essential? As in the case of continuity, we argue that non-additivity challenges the intuition, and any non-additivity must be interpreted. Why is it plausible that the shared information contained in two independent pairs is more than the sum of the individual shared informations, and how can one explain that the unique information is subadditive?

A related weaker property is additivity under i.i.d. sequences, i.e. when, in the definition of additivity, the vectors (X1,,Xn)(X_{1},\ldots,X_{n}) and (Y1,,Yn)(Y_{1},\ldots,Y_{n}) are identically distributed. One can show that IredI_{\operatorname{red}}, IMMII_{\operatorname{MMI}}, IdepI_{\operatorname{dep}} and IIGI_{\operatorname{IG}} (and, of course, IBROJAI_{\operatorname{BROJA}}) are additive under i.i.d. sequences, but not IminI_{\min}. The UIUI construction gives additivity of IδI_{\delta} under i.i.d. sequences if δ\delta is additive under i.i.d. sequences. The proof of these statements is similar to the proof for additivity (given below) and omitted. For the II_{\cap} decompositions, it is not as easy to see, and so we currently do not know whether additivity under i.i.d. sequences holds.

Lemma 10.
  1. 1.

    If I1I_{1} and I2I_{2} are superadditive, then min{I1,I2}\min\{I_{1},I_{2}\} is superadditive.

  2. 2.

    If, in addition, there exist distributions P,QP,Q with I1(P)<I2(P)I_{1}(P)<I_{2}(P) and I1(Q)>I2(Q)I_{1}(Q)>I_{2}(Q), then min{I1,I2}\min\{I_{1},I_{2}\} is not additive.

Proof.
  1. 1.

    With X1,,Xn,Y1,,YnX_{1},\dots,X_{n},Y_{1},\dots,Y_{n} as in the definition of superadditivity,

    min\displaystyle\min {I1(X1Y1,X2Y2,,XnYn),I2(X1Y1,X2Y2,,XnYn)}\displaystyle\big{\{}I_{1}(X_{1}Y_{1},X_{2}Y_{2},\dots,X_{n}Y_{n}),I_{2}(X_{1}Y_{1},X_{2}Y_{2},\dots,X_{n}Y_{n})\big{\}}
    min{I1(X1,,Xn)+I1(Y1,,Yn),I2(X1,,Xn)+I2(Y1,,Yn)}\displaystyle\geq\min\big{\{}I_{1}(X_{1},\dots,X_{n})+I_{1}(Y_{1},\dots,Y_{n}),I_{2}(X_{1},\dots,X_{n})+I_{2}(Y_{1},\dots,Y_{n})\big{\}}
    min{I1(X1,,Xn),I2(X1,,Xn)}+min{I1(Y1,,Yn),I2(Y1,,Yn)}.\displaystyle\geq\min\big{\{}I_{1}(X_{1},\dots,X_{n}),I_{2}(X_{1},\dots,X_{n})\big{\}}+\min\big{\{}I_{1}(Y_{1},\dots,Y_{n}),I_{2}(Y_{1},\dots,Y_{n})\big{\}}.
  2. 2.

    In this inequality, if X1,,XnPX_{1},\dots,X_{n}\sim P and Y1,,YnQY_{1},\dots,Y_{n}\sim Q, then the right hand side equals I1(X1,,Xn)+I2(Y1,,Yn)I_{1}(X_{1},\dots,X_{n})+I_{2}(Y_{1},\dots,Y_{n}), which makes the inequality strict.∎

As a consequence:

Lemma 11.

If δ\delta is subadditive, then UIδUI_{\delta} is subadditive, SIδSI_{\delta} is superadditive, but neither is additive.

Theorem 12.

The shared information measures SIminSI_{\min}, SIMMISI_{\operatorname{MMI}}, SIredSI_{\operatorname{red}}, SIdepSI_{\operatorname{dep}}, and SIIGSI_{\operatorname{IG}} are superadditive, but not additive.

Proof.

For IMMII_{\operatorname{MMI}}, the claim follows directly from Lemma 10. The same is true for IdepI_{\operatorname{dep}}, since IPYSZ(S;Y|Z)I_{P_{Y-S-Z}}(S;Y|Z) and IPΔ(S;Y|Z)I_{P_{\Delta}}(S;Y|Z) are additive, and also for IredI_{\operatorname{red}}, since IS(YZ)I_{S}(Y\searrow Z) and IS(ZY)I_{S}(Z\searrow Y) are superadditive. For IminI_{\min}, the same argument as in the proof of Lemma 10 applies, since the specific information is additive, in the sense that

I(S1S2=s1s2;Y1Y2)=I(S1=s1;Y1)+I(S2=s2;Y2).I(S_{1}S_{2}=s_{1}s_{2};Y_{1}Y_{2})=I(S_{1}=s_{1};Y_{1})+I(S_{2}=s_{2};Y_{2}).

Next, consider IIGI_{\operatorname{IG}}. For i=1,2i=1,2

Pi(t)(si,yi,zi)=1ci,tP(yi,zi)P(si|yi)tP(si|zi)1t.P_{i}^{(t)}(s_{i},y_{i},z_{i})=\frac{1}{c_{i,t}}P(y_{i},z_{i})P(s_{i}|y_{i})^{t}P(s_{i}|z_{i})^{1-t}.

Then

P(t)(s1s2,y1y2,z1z2)=P1(t)(s1,y1,z1)P2(t)(s2,y2,z2)P^{(t)}(s_{1}s_{2},y_{1}y_{2},z_{1}z_{2})=P_{1}^{(t)}(s_{1},y_{1},z_{1})P_{2}^{(t)}(s_{2},y_{2},z_{2})

and

D(PP(t))=D(P1P1(t))+D(P2P2(t)),D(P\|P^{(t)})=D(P_{1}\|P_{1}^{(t)})+D(P_{2}\|P_{2}^{(t)}),

where PiP_{i} denotes the marginal distribution of Si,Yi,ZiS_{i},Y_{i},Z_{i} for i=1,2i=1,2. It follows that

SI(S1S2;Y1Y2,Z1Z2)\displaystyle SI(S_{1}S_{2};Y_{1}Y_{2},Z_{1}Z_{2}) =mintD(PP(t))\displaystyle=\min_{t\in\mathbb{R}}D(P\|P^{(t)})
mintD(P1P1(t))+mintD(P2P2(t))\displaystyle\geq\min_{t\in\mathbb{R}}D(P_{1}\|P_{1}^{(t)})+\min_{t\in\mathbb{R}}D(P_{2}\|P_{2}^{(t)})
=SI(S1;Y1,Z1)+SI(S2;Y2,Z2).\displaystyle=SI(S_{1};Y_{1},Z_{1})+SI(S_{2};Y_{2},Z_{2}).

If argmintD(P1P1(t))argmintD(P2P2(t))\operatorname*{arg\,min}_{t\in\mathbb{R}}D(P_{1}\|P_{1}^{(t)})\neq\operatorname*{arg\,min}_{t\in\mathbb{R}}D(P_{2}\|P_{2}^{(t)}), then strict inequality holds. ∎

Theorem 13.

II_{\cap}^{\wedge}, IGHI_{\cap}^{\operatorname{GH}} and II_{\cap}^{*} are additive.

Proof.

In the following, to slightly simplify notation, we write II_{\wedge}, IGHI_{\operatorname{GH}} and II_{*} for the decompositions defined in (4), (5) and (6), respectively.

  • First, consider II_{\wedge}. As Griffith et al. (2014) show, SI(S1S2;Y1Y2,Z1Z2)=I(S1S2;Q)SI_{\wedge}(S_{1}S_{2};Y_{1}Y_{2},Z_{1}Z_{2})=I(S_{1}S_{2};Q), where QQ is the common random variable (Gács and Körner, 1973), which satisfies Q=Q1Q2Q=Q_{1}Q_{2}, where QjQ_{j} is the common random variable of YjY_{j} and ZjZ_{j}. Therefore,

    SI(S1S2;Y1Y2,Z1Z2)=\displaystyle SI_{\wedge}(S_{1}S_{2};Y_{1}Y_{2},Z_{1}Z_{2})= I(S1S2;Q1Q2)\displaystyle I(S_{1}S_{2};Q_{1}Q_{2})
    =\displaystyle= I(S1;Q1)+I(S2;Q2)\displaystyle I(S_{1};Q_{1})+I(S_{2};Q_{2})
    =\displaystyle= SI(S1;Y1,Z1)+SI(S2;Y2,Z2).\displaystyle SI_{\wedge}(S_{1};Y_{1},Z_{1})+SI_{\wedge}(S_{2};Y_{2},Z_{2}).
  • To see that SIGHSI_{\operatorname{GH}} is superadditive, suppose that SIGH(Sj;Yj,Zj)=I(Qj;Sj)SI_{\operatorname{GH}}(S_{j};Y_{j},Z_{j})=I(Q_{j};S_{j}). The joint distribution of S1,S2,Q1,Q2S_{1},S_{2},Q_{1},Q_{2} defined by P(s1s2q1q2)=P(s1q1)P(s2q2)P(s_{1}s_{2}q_{1}q_{2})=P(s_{1}q_{1})P(s_{2}q_{2}) is feasible for the optimization problem in the definition of SIGH(S1S2;Y1Y2,Z1Z2)SI_{\operatorname{GH}}(S_{1}S_{2};Y_{1}Y_{2},Z_{1}Z_{2}). Therefore,

    SIGH(S1S2;Y1Y2,Z1Z2)\displaystyle SI_{\operatorname{GH}}(S_{1}S_{2};Y_{1}Y_{2},Z_{1}Z_{2}) I(S1S2;Q1Q2)\displaystyle\geq I(S_{1}S_{2};Q_{1}Q_{2})
    =I(S1;Q1)+I(S2;Q2)\displaystyle=I(S_{1};Q_{1})+I(S_{2};Q_{2})
    =SIGH(S1;Y1,Z1)+SIGH(S2;Y2,Z2).\displaystyle=SI_{\operatorname{GH}}(S_{1};Y_{1},Z_{1})+SI_{\operatorname{GH}}(S_{2};Y_{2},Z_{2}).

    To prove subadditivity, let QQ be as in the definition of SIGH(S1S2;Y1Y2,Z1Z2)SI_{\operatorname{GH}}(S_{1}S_{2};Y_{1}Y_{2},Z_{1}Z_{2}) in (5), with (S1,Y1,Z1S_{1},Y_{1},Z_{1}), (S2,Y2,Z2S_{2},Y_{2},Z_{2}) as in Definition 8. The chain rule implies

    I(S1S2;Q)=I(S1;Q)+I(S2;Q|S1),I(S_{1}S_{2};Q)=I(S_{1};Q)+I(S_{2};Q|S_{1}),

    where I(S2;Q|S1)=s1P(S1=s1)I(S2;Q|S1=s1)I(S_{2};Q|S_{1})=\sum_{s_{1}}P(S_{1}=s_{1})I(S_{2};Q|S_{1}=s_{1}). Choose an element s1s_{1}^{*} in argmaxs1I(S2;Q|S1=s1)\operatorname*{arg\,max}_{s_{1}}I(S_{2};Q|S_{1}=s_{1}). Construct two random variables Q1,Q2Q_{1},Q_{2} as follows: Q1Q_{1} is independent of S2,Y2,Z2S_{2},Y_{2},Z_{2} and satisfies P(Q1|S1,Y1,Z1)=P(Q|S1,Y1,Z1)P(Q_{1}|S_{1},Y_{1},Z_{1})=P(Q|S_{1},Y_{1},Z_{1}). Q2Q_{2} is independent of S1,Y1,Z1S_{1},Y_{1},Z_{1} and satisfies P(Q2|S2,Y2,Z2)=P(Q|S2,Y2,Z2,S1=s1)P(Q_{2}|S_{2},Y_{2},Z_{2})=P(Q|S_{2},Y_{2},Z_{2},S_{1}=s_{1}^{*}). By construction, Q1Q2Q_{1}Q_{2} is independent of S1S2S_{1}S_{2} given Y1Y2Y_{1}Y_{2}, and Q1Q2Q_{1}Q_{2} is independent of S1S2S_{1}S_{2} given Z1Z2Z_{1}Z_{2}. The statement follows from

    SIGH\displaystyle{SI_{\operatorname{GH}}} (S1;Y1,Z1)+SIGH(S2;Y2,Z2)\displaystyle{(S_{1};Y_{1},Z_{1})+SI_{\operatorname{GH}}(S_{2};Y_{2},Z_{2})}
    I(S1;Q1)+I(S2;Q2)=I(S1;Q)+I(S2;Q|S1=s1)\displaystyle\geq I(S_{1};Q_{1})+I(S_{2};Q_{2})=I(S_{1};Q)+I(S_{2};Q|S_{1}=s_{1})
    I(S1;Q)+I(S2;Q|S1)=I(S1S2;Q)=SIGH(S1S2;Y1Y2,Z1Z2).\displaystyle\geq I(S_{1};Q)+I(S_{2};Q|S_{1})=I(S_{1}S_{2};Q)=SI_{\operatorname{GH}}(S_{1}S_{2};Y_{1}Y_{2},Z_{1}Z_{2}).
  • The proof of superadditivity for II_{*} follows line by line the proof for IGHI_{\operatorname{GH}}. To prove subadditivity for II_{*}, we claim that for all random variables S,Y,ZS,Y,Z there exist random variables S=fS(S,Y,Z)S^{\prime}=f_{S}(S,Y,Z), Y=fY(S,Y,Z)Y^{\prime}=f_{Y}(S,Y,Z), Z=fZ(S,Y,Z)Z^{\prime}=f_{Z}(S,Y,Z) with P(S,Y)=P(S,Y)P(S,Y)=P(S^{\prime},Y^{\prime}), P(S,Z)=P(S,Z)P(S,Z)=P(S^{\prime},Z^{\prime}) and I(S;Y,Z)=IGH(S;Y,Z)I_{*}(S;Y,Z)=I_{\operatorname{GH}}(S^{\prime};Y^{\prime},Z^{\prime}). This correspondence can be chosen such that

    (S1S2)=fS(S1S2,Y1Y2,Z1Z2)\displaystyle(S_{1}S_{2})^{\prime}=f_{S}(S_{1}S_{2},Y_{1}Y_{2},Z_{1}Z_{2}) =(fS(S1,Y1,Z1),fS(S2,Y2,Z2))=S1S2,\displaystyle=(f_{S}(S_{1},Y_{1},Z_{1}),f_{S}(S_{2},Y_{2},Z_{2}))=S_{1}^{\prime}S_{2}^{\prime},
    (Y1Y2)=fY(S1S2,Y1Y2,Z1Z2)\displaystyle(Y_{1}Y_{2})^{\prime}=f_{Y}(S_{1}S_{2},Y_{1}Y_{2},Z_{1}Z_{2}) =(fY(S1,Y1,Z1),fY(S2,Y2,Z2))=Y1Y2,\displaystyle=(f_{Y}(S_{1},Y_{1},Z_{1}),f_{Y}(S_{2},Y_{2},Z_{2}))=Y_{1}^{\prime}Y_{2}^{\prime},
    (Z1Z2)=fZ(S1S2,Y1Y2,Z1Z2)\displaystyle(Z_{1}Z_{2})^{\prime}=f_{Z}(S_{1}S_{2},Y_{1}Y_{2},Z_{1}Z_{2}) =(fZ(S1,Y1,Z1),fZ(S2,Y2,Z2))=Z1Z2,\displaystyle=(f_{Z}(S_{1},Y_{1},Z_{1}),f_{Z}(S_{2},Y_{2},Z_{2}))=Z_{1}^{\prime}Z_{2}^{\prime},

    where S1Y1Z1S_{1}^{\prime}Y_{1}^{\prime}Z_{1}^{\prime} is independent of S2Y2Z2S_{2}^{\prime}Y_{2}^{\prime}Z_{2}^{\prime}. Thus,

    SI(S1S2;Y1Y2,Z1Z2)\displaystyle SI_{*}(S_{1}S_{2};Y_{1}Y_{2},Z_{1}Z_{2}) =SIGH((S1S2);(Y1Y2),(Z1Z2))\displaystyle=SI_{\operatorname{GH}}\big{(}(S_{1}S_{2})^{\prime};(Y_{1}Y_{2})^{\prime},(Z_{1}Z_{2})^{\prime}\big{)}
    =SIGH(S1;Y1,Z1)+SIGH(S2;Y2,Z2)\displaystyle=SI_{\operatorname{GH}}(S^{\prime}_{1};Y^{\prime}_{1},Z^{\prime}_{1})+SI_{\operatorname{GH}}(S^{\prime}_{2};Y^{\prime}_{2},Z^{\prime}_{2})
    SI(S1;Y1,Z1)+SI(S2;Y2,Z2).\displaystyle\leq SI_{*}(S^{\prime}_{1};Y^{\prime}_{1},Z^{\prime}_{1})+SI_{*}(S^{\prime}_{2};Y^{\prime}_{2},Z^{\prime}_{2}).

    To prove the claim, suppose that SI(S;Y,Z)=I(S;Q)SI_{*}(S;Y,Z)=I(S;Q), with QQ as in the definition of SISI_{*} in (6). Define random variables S,Y,Z,QS^{\prime},Y^{\prime},Z^{\prime},Q^{\prime} such that

    P(SYZQ=syzq)=P(SQ=sq)P(Y=y|SQ=sq)P(Z=z|SQ=sq).\displaystyle P(S^{\prime}Y^{\prime}Z^{\prime}Q^{\prime}=syzq)=P(SQ=sq)P(Y=y|SQ=sq)P(Z=z|SQ=sq).

    Then P(SY=sy)=P(SY=sy)P(S^{\prime}Y^{\prime}=sy)=P(SY=sy) and P(SZ=sz)=P(SZ=sz)P(S^{\prime}Z^{\prime}=sz)=P(SZ=sz). Since SISI_{*} only depends on the (SY)(SY)- and (SZ)(SZ)-marginals, SI(S;Y,Z)=SI(S;Y,Z)SI_{*}(S;Y,Z)=SI_{*}(S^{\prime};Y^{\prime},Z^{\prime}). Moreover,

    SI(S;Y,Z)=I(S;Q)=I(S;Q)SIGH(S;Y,Z)SI(S;Y,Z),\displaystyle SI_{*}(S;Y,Z)=I(S;Q)=I(S^{\prime};Q^{\prime})\leq SI_{\operatorname{GH}}(S^{\prime};Y^{\prime},Z^{\prime})\leq SI_{*}(S^{\prime};Y^{\prime},Z^{\prime}),

    where the first inequality follows from (5) and the second one was discussed following (6). The claim follows from this.

5 Conclusions

We have studied measures that have been defined for bivariate information decompositions, asking whether they are continuous and/or additive. The only information decomposition that is both continuous and additive is IBROJAI_{\operatorname{BROJA}}.

While there are many continuous information decompositions, it seems difficult to construct differentiable information decompositions: Currently, the only differentiable example is IIGI_{\operatorname{IG}} (which, however, is only defined in the interior of the probability simplex). It would be interesting to know which other smoothness properties are satisfied by the proposed information decompositions, such as locking and asymptotic continuity.

It also seems to be difficult to construct additive information decompositions, with IBROJAI_{\operatorname{BROJA}} and the Gács-Körner-based measures II_{\cap}^{\wedge}, IGHI_{\cap}^{\operatorname{GH}} and II_{\cap}^{*} being the only ones. In contrast, many known information decompositions are additive under i.i.d. sequences. In the other direction, it would be worthwhile to have another look at stronger versions of additivity, such as chain rule-type properties. Bertschinger et al. (2013) concluded that such chain rules prevent a straightforward extension of decompositions to the non-bivariate case along the lines of Williams and Beer (2010). It has been argued (see, e.g., Rauh (2017)) that a general information decomposition likely needs a structure that differs from the proposal of  (Williams and Beer, 2010), whence another look at chain rules may be worthwhile. Recent work (Ay et al., 2020) has proposed an additive decomposition based on a different lattice.

Acknowledgement

PB and GM have been supported by the ERC under the European Union’s Horizon 2020 research and innovation programme (grant agreement no 757983).

References

  • Amari (2018) S. Amari. Information Geometry and Its Applications. Springer Publishing Company, Incorporated, 2018.
  • Ay et al. (2020) N. Ay, D. Polani, and N. Virgo. Information decomposition based on cooperative game theory. Kybernetika, 56(5):979–1014, 2020.
  • Banerjee and Montúfar (2020) P. K. Banerjee and G. Montúfar. The variational deficiency bottleneck. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2020.
  • Banerjee et al. (2018a) P. K. Banerjee, E. Olbrich, J. Jost, and J. Rauh. Unique informations and deficiencies. In 56th Annual Allerton Conference on Communication, Control, and Computing, pages 32–38, 2018a.
  • Banerjee et al. (2018b) P. K. Banerjee, J. Rauh, and G. Montúfar. Computing the unique information. In IEEE International Symposium on Information Theory (ISIT), pages 141–145, June 2018b.
  • Barrett (2015) A. B. Barrett. Exploration of synergistic and redundant information sharing in static and dynamical Gaussian systems. Phys. Rev. E, 91:052802, 2015.
  • Bell (2003) A. J. Bell. The co-information lattice. In Proc. Fourth Int. Symp. Independent Component Analysis and Blind Signal Separation (ICA 03), 2003.
  • Bertschinger and Rauh (2014) N. Bertschinger and J. Rauh. The Blackwell relation defines no lattice. In IEEE International Symposium on Information Theory (ISIT), pages 2479–2483, 2014.
  • Bertschinger et al. (2013) N. Bertschinger, J. Rauh, E. Olbrich, and J. Jost. Shared information — new insights and problems in decomposing information in complex systems. In Proc. ECCS 2012, pages 251–269. Springer, 2013.
  • Bertschinger et al. (2014) N. Bertschinger, J. Rauh, E. Olbrich, J. Jost, and N. Ay. Quantifying unique information. Entropy, 16(4):2161–2183, 2014.
  • Cerf et al. (2002) N. J. Cerf, S. Massar, and S. Schneider. Multipartite classical and quantum secrecy monotones. Physical Review A, 66(4):042309, 2002.
  • Chitambar and Gour (2019) E. Chitambar and G. Gour. Quantum resource theories. Reviews of Modern Physics, 91(2):025001, 2019.
  • Christandl and Winter (2004) M. Christandl and A. Winter. “Squashed entanglement” - An additive entanglement measure. Journal of Mathematical Physics, 45(3):829–840, 2004.
  • Christandl et al. (2007) M. Christandl, A. Ekert, M. Horodecki, P. Horodecki, J. Oppenheim, and R. Renner. Unifying classical and quantum key distillation. In 4th Theory of Cryptography Conference (TCC), pages 456–478, 2007.
  • Csiszár and Körner (2011) I. Csiszár and J. Körner. Information theory: Coding theorems for discrete memoryless systems. Cambridge University Press, 2011.
  • Csiszár (2008) I. Csiszár. Axiomatic characterizations of information measures. Entropy, 10:261–273, 2008.
  • Fannes (1973) M. Fannes. A continuity property of the entropy density for spin lattice systems. Communications in Mathematical Physics, 31:291–294, 1973.
  • Finn and Lizier (2018) C. Finn and J. T. Lizier. Pointwise partial information decomposition using the specificity and ambiguity lattices. Entropy, 20(4), 2018.
  • Gács and Körner (1973) P. Gács and J. Körner. Common information is far less than mutual information. Problems of Control and Information Theory, 2(2):149–162, 1973.
  • Gohari and Anantharam (2010) A. A. Gohari and V. Anantharam. Information-theoretic key agreement of multiple terminals-Part I. IEEE Transactions on Information Theory, 56(8):3973–3996, 2010.
  • Griffith and Ho (2015) V. Griffith and T. Ho. Quantifying redundant information in predicting a target random variable. Entropy, 17(7):4644–4653, 2015.
  • Griffith and Koch (2014) V. Griffith and C. Koch. Quantifying synergistic mutual information. In M. Prokopenko, editor, Guided Self-Organization: Inception, volume 9, pages 159–190. Springer Berlin Heidelberg, 2014.
  • Griffith et al. (2014) V. Griffith, E. K. P. Chong, R. G. James, C. J. Ellison, and J. P. Crutchfield. Intersection information based on common randomness. Entropy, 16(4):1985–2000, 2014.
  • Harder et al. (2013) M. Harder, C. Salge, and D. Polani. A bivariate measure of redundant information. Phys. Rev. E, 87:012130, Jan 2013.
  • Ince (2017) R. Ince. Measuring multivariate redundant information with pointwise common change in surprisal. Entropy, 19(7):318, 2017.
  • James et al. (2018) R. James, J. Emenheiser, and J. Crutchfield. Unique information via dependency constraints. Journal of Physics A, 52(1):014002, 2018.
  • Kolchinsky (2022) A. Kolchinsky. A novel approach to the partial information decomposition. Entropy, 24(3):403, 2022.
  • Magri (2021) C. Magri. On shared and multiple information (version 4). arXiv:2107.11032, 2021.
  • Matveev and Portegies (2017) R. Matveev and J. W. Portegies. Tropical limits of probability spaces, Part I: The intrinsic Kolmogorov-Sinai distance and the asymptotic equipartition property for configurations. arXiv:1704.00297, 2017.
  • Maurer and Wolf (1997) U. Maurer and S. Wolf. The intrinsic conditional mutual information and perfect secrecy. In IEEE International Symposium on Information Theory (ISIT), 1997.
  • McGill (1954) W. McGill. Multivariate information transmission. IRE Transactions on Information Theory, 4(4):93–111, 1954.
  • Niu and Quinn (2019) X. Niu and C. Quinn. A measure of synergy, redundancy, and unique information using information geometry. In IEEE International Symposium on Information Theory (ISIT), 2019.
  • Raginsky (2011) M. Raginsky. Shannon meets Blackwell and Le Cam: Channels, codes, and statistical experiments. In IEEE International Symposium on Information Theory, pages 1220–1224, July 2011.
  • Rauh (2017) J. Rauh. Secret sharing and shared information. Entropy, 19(11):601, 2017.
  • Rauh et al. (2014) J. Rauh, N. Bertschinger, E. Olbrich, and J. Jost. Reconsidering unique information: Towards a multivariate information decomposition. In IEEE International Symposium on Information Theory (ISIT), pages 2232–2236, 2014.
  • Rauh et al. (2019) J. Rauh, P. K. Banerjee, E. Olbrich, and J. Jost. Unique information and secret key decompositions. In IEEE International Symposium on Information Theory (ISIT), pages 3042–3046, 2019.
  • Rauh et al. (2021) J. Rauh, M. Schünemann, and J. Jost. Properties of unique information. Kybernetika, 57:383–403, 2021.
  • Renner and Wolf (2003) R. Renner and S. Wolf. New bounds in secret-key agreement: The gap between formation and secrecy extraction. In Advances in Cryptology - EUROCRYPT, pages 562–577, 2003.
  • Shannon (1948) C. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379–423 and 623–656, 1948.
  • Studenỳ and Vejnarová (1998) M. Studenỳ and J. Vejnarová. The multiinformation function as a tool for measuring stochastic dependence. In Learning in graphical models, pages 261–297. Springer, 1998.
  • Synak-Radtke and Horodecki (2006) B. Synak-Radtke and M. Horodecki. On asymptotic continuity of functions of quantum states. Journal of Physics A: Mathematical and General, 39(26):L423, 2006.
  • Watanabe (1960) S. Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of Research and Development, 4(1):66–82, 1960.
  • Williams and Beer (2010) P. Williams and R. Beer. Nonnegative decomposition of multivariate information. arXiv:1004.2515v1, 2010.
  • Winter (2016) A. Winter. Tight uniform continuity bounds for quantum entropies: Conditional entropy, relative entropy distance and energy constraints. Communications in Mathematical Physics, 347:291–313, 2016.