This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Information-Theoretic Lower Bounds for Distributed Function Computation

Aolin Xu and Maxim Raginsky This work was supported by the NSF under grant CCF-1017564, CAREER award CCF-1254041, by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under grant agreement CCF-0939370, and by ONR under grant N00014-12-1-0998. The material in this paper was presented in part at the IEEE International Symposium on Information Theory (ISIT), Honolulu, HI, July 2014.The authors are with the Department of Electrical and Computer Engineering and the Coordinated Science Laboratory, University of Illinois, Urbana, IL 61801, USA. E-mail: {aolinxu2,maxim}@illinois.edu.
Abstract

We derive information-theoretic converses (i.e., lower bounds) for the minimum time required by any algorithm for distributed function computation over a network of point-to-point channels with finite capacity, where each node of the network initially has a random observation and aims to compute a common function of all observations to a given accuracy with a given confidence by exchanging messages with its neighbors. We obtain the lower bounds on computation time by examining the conditional mutual information between the actual function value and its estimate at an arbitrary node, given the observations in an arbitrary subset of nodes containing that node. The main contributions include: 1) A lower bound on the conditional mutual information via so-called small ball probabilities, which captures the dependence of the computation time on the joint distribution of the observations at the nodes, the structure of the function, and the accuracy requirement. For linear functions, the small ball probability can be expressed by Lévy concentration functions of sums of independent random variables, for which tight estimates are available that lead to strict improvements over existing lower bounds on computation time. 2) An upper bound on the conditional mutual information via strong data processing inequalities, which complements and strengthens existing cutset-capacity upper bounds. 3) A multi-cutset analysis that quantifies the loss (dissipation) of the information needed for computation as it flows across a succession of cutsets in the network. This analysis is based on reducing a general network to a line network with bidirectional links and self-links, and the results highlight the dependence of the computation time on the diameter of the network, a fundamental parameter that is missing from most of the existing lower bounds on computation time.

Index Terms:
Distributed function computation, computation time, small ball probability, Lévy concentration function, strong data processing inequality, cutset bound, multi-cutset analysis

I Introduction and preview of results

I-A Model and problem formulation

The problem of distributed function computation arises in such applications as inference and learning in networks and consensus or coordination of multiple agents. Each node of the network has an initial random observation and aims to compute a common function of the observations of all the nodes by exchanging messages with its neighbors over discrete memoryless point-to-point channels and by performing local computations. A problem of theoretical and practical interest is to determine the fundamental limits on the computation time, i.e., the minimum number of steps needed by any distributed computation algorithm to guarantee that, when the algorithm terminates, each node has an accurate estimate of the function value with high probability.

Formally, a network consisting of nodes connected by point-to-point channels is represented by a directed graph G=(𝒱,)G=({\mathcal{V}},{\mathcal{E}}), where 𝒱{\mathcal{V}} is a finite set of nodes and 𝒱×𝒱{\mathcal{E}}\subseteq{\mathcal{V}}\times{\mathcal{V}} is a set of edges. Node uu can send messages to node vv only if (u,v)(u,v)\in{\mathcal{E}}. Accordingly, to each edge ee\in{\mathcal{E}} we associate a discrete memoryless channel with finite input alphabet 𝖷e{\mathsf{X}}_{e}, finite output alphabet 𝖸e{\mathsf{Y}}_{e}, and stochastic transition law KeK_{e} that specifies the transition probabilities Ke(ye|xe)K_{e}(y_{e}|x_{e}) for all (xe,ye)𝖷e×𝖸e(x_{e},y_{e})\in{\mathsf{X}}_{e}\times{\mathsf{Y}}_{e}. The channels corresponding to different edges are assumed to be independent. Initially, each node vv has access to an observation given by a random variable (r.v.) WvW_{v} taking values in some space 𝖶v{\mathsf{W}}_{v}. We assume that the joint probability law W{\mathbb{P}}_{W} of W(Wv)v𝒱W\triangleq(W_{v})_{v\in{\mathcal{V}}} is known to all the nodes. Given a function f:v𝒱𝖶v𝖹f:\prod_{v\in{\mathcal{V}}}{\mathsf{W}}_{v}\to{\mathsf{Z}}, each node aims to estimate the value Z=f(W)Z=f(W) via local communication and computation. For example, when ff is given by the identity mapping Z=WZ=W, the goal of each node is to estimate the observations of all other nodes in the network.

The operation of the network is synchronized, and takes place in discrete time. A TT-step algorithm 𝒜{\mathcal{A}} is a collection of deterministic encoders (φv,t)(\varphi_{v,t}) and estimators (ψv)(\psi_{v}), for all v𝒱v\in{\mathcal{V}} and t{1,,T}t\in\{1,\ldots,T\}, given by mappings

φv,t:𝖶v×𝖸vt1𝖷v,ψv:𝖶v×𝖸vT𝖹,\displaystyle\varphi_{v,t}:{\mathsf{W}}_{v}\times{\mathsf{Y}}^{t-1}_{v\leftarrow}\to{\mathsf{X}}_{v\rightarrow},\quad\psi_{v}:{\mathsf{W}}_{v}\times{\mathsf{Y}}^{T}_{v\leftarrow}\to{\mathsf{Z}},

where 𝖷v=u𝒩v𝖷(v,u){\mathsf{X}}_{v\rightarrow}=\prod_{u\in{\mathcal{N}}_{v\rightarrow}}{\mathsf{X}}_{(v,u)} and 𝖸v=u𝒩v𝖸(u,v){\mathsf{Y}}_{v\leftarrow}=\prod_{u\in{\mathcal{N}}_{v\leftarrow}}{\mathsf{Y}}_{(u,v)}. Here, 𝒩v{u𝒱:(u,v)}{\mathcal{N}}_{v\leftarrow}\triangleq\{u\in{\mathcal{V}}:(u,v)\in{\mathcal{E}}\} and 𝒩v{u𝒱:(v,u)}{\mathcal{N}}_{v\rightarrow}\triangleq\{u\in{\mathcal{V}}:(v,u)\in{\mathcal{E}}\} are, respectively, the in-neighborhood and the out-neighborhood of node vv. The algorithm operates as follows: at each step tt, each node vv computes Xv,t(X(v,u),t)u𝒩v=φv,t(Wv,Yvt1)𝖷vX_{v,t}\triangleq(X_{(v,u),t})_{u\in{\mathcal{N}}_{v\rightarrow}}=\varphi_{v,t}\big{(}W_{v},Y^{t-1}_{v}\big{)}\in{\mathsf{X}}_{v\rightarrow}, and then transmits each message X(v,u),tX_{(v,u),t} along the edge (v,u)(v,u)\in{\mathcal{E}}. For each (u,v)(u,v)\in{\mathcal{E}}, the received message Y(u,v),tY_{(u,v),t} at each tt is related to the transmitted message X(u,v),tX_{(u,v),t} via the stochastic transition law K(u,v)K_{(u,v)}. At step TT, each node vv computes Z^v=ψv(Wv,YvT){\widehat{Z}}_{v}=\psi_{v}(W_{v},Y^{T}_{v}) as an estimate of ZZ, where Yv,t(Y(u,v),t)u𝒩v𝖸vY_{v,t}\triangleq(Y_{(u,v),t})_{u\in{\mathcal{N}}_{v\leftarrow}}\in{\mathsf{Y}}_{v\leftarrow} for t{1,,T}t\in\{1,\ldots,T\}.

Given a nonnegative distortion function d:𝖹×𝖹+d:{\mathsf{Z}}\times{\mathsf{Z}}\to\mathbb{R}^{+}, we use the excess distortion probability [d(Z,Z^v)>ε]{\mathbb{P}}\big{[}d(Z,{\widehat{Z}}_{v})>\varepsilon\big{]} to quantify the computation fidelity of the algorithm at node vv. A key fundamental limit of distributed function computation is the (ε,δ)(\varepsilon,\delta)-computation time:

T(ε,δ)inf{T:\displaystyle T(\varepsilon,\delta)\triangleq\inf\Big{\{}T\in{\mathbb{N}}:  a T-step algorithm 𝒜 such that\displaystyle\exists\text{ a $T$-step algorithm ${\mathcal{A}}$ such that }
maxv𝒱[d(Z,Z^v)>ε]δ}.\displaystyle\max_{v\in{\mathcal{V}}}{\mathbb{P}}\big{[}d(Z,{\widehat{Z}}_{v})>\varepsilon\big{]}\leq\delta\Big{\}}. (1)

If an algorithm 𝒜{\mathcal{A}} has the property that

maxv𝒱[d(Z,Z^v)>ε]δ,\max_{v\in{\mathcal{V}}}{\mathbb{P}}\big{[}d(Z,{\widehat{Z}}_{v})>\varepsilon\big{]}\leq\delta,

then we say that it achieves accuracy ε\varepsilon with confidence 1δ1-\delta. Thus, T(ε,δ)T(\varepsilon,\delta) is the minimum number of time steps needed by any algorithm to achieve accuracy ε\varepsilon with confidence 1δ1-\delta. The objective of this paper is to derive general lower bounds on T(ε,δ)T(\varepsilon,\delta) for arbitrary network topologies, discrete memoryless channel models, continuous or discrete observations, and functions ff.

Previously, this problem (for real-valued functions and quadratic distortion) has been studied by Ayaso et al. [1] and by Como and Dahleh [2] using information-theoretic techniques. This problem is also related to the study of communication complexity of distributed computing over noisy channels. In that context, Goyal et al. [3] studied the problem of computing Boolean functions in complete graphs, where each pair of nodes communicates over a pair of independent binary symmetric channels (BSCs), and obtained tight lower bounds on the number of serial broadcasts using an approach tailored to that special problem. The technique used in [3] has been extended to random planar networks by Dutta et al. [4]. Other related, but differently formulated, problems include communication complexity and information complexity in distributed computing over noiseless channels, surveyed in [5]; minimum communication rates for distributed computing [6, 7, 8], compression, or estimation based on infinite sequences of observations, surveyed in [9, Chap. 21]; and distributed computing in wireless networks, surveyed in [10]. Some achievability results for specific distributed function computation problems can be found in [11, 12, 13, 1, 14, 15, 16, 17, 18].

I-B Method of analysis and summary of main results

Our analysis builds upon the information-theoretic framework proposed by Ayaso et al. [1] and Como and Dahleh [2]. The underlying idea is rather natural and exploits a fundamental trade-off between the minimal amount of information any good algorithm must necessarily extract about the function value ZZ when it terminates and the maximal amount of information any algorithm is able to obtain due to time and communication constraints. To be more precise, given any set of nodes 𝒮𝒱{\mathcal{S}}\subseteq{\mathcal{V}}, let W𝒮(Wv)v𝒮W_{\mathcal{S}}\triangleq(W_{v})_{v\in{\mathcal{S}}} denote the vector of observations at all the nodes in 𝒮{\mathcal{S}}. The quantity that plays a key role in the analysis is the conditional mutual information I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}) between the actual function value ZZ and the estimate Z^v{\widehat{Z}}_{v} at an arbitrary node vv, given the observations in an arbitrary subset of nodes 𝒮{\mathcal{S}} containing vv.

Consider an arbitrary TT-step algorithm 𝒜{\mathcal{A}} that achieves accuracy ε\varepsilon with confidence 1δ1-\delta. Then, as we show in Lemma 1 of Sec. II-A, this mutual information can be lower-bounded by

I(Z;Z^v|W𝒮)\displaystyle I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}) (1δ)log1𝔼[L(W𝒮,ε)]h2(δ),\displaystyle\geq(1-\delta)\log\frac{1}{\mathbb{E}[L(W_{\mathcal{S}},\varepsilon)]}-h_{2}(\delta), (2)

where h2(δ)δlogδ(1δ)log(1δ)h_{2}(\delta)\triangleq-\delta\log\delta-(1-\delta)\log(1-\delta) is the binary entropy function, and

L(w𝒮,ε)\displaystyle L(w_{\mathcal{S}},\varepsilon) supz𝖹[d(Z,z)ε|W𝒮=w𝒮]\displaystyle\triangleq\sup_{z\in{\mathsf{Z}}}{\mathbb{P}}[d(Z,z)\leq\varepsilon|W_{\mathcal{S}}=w_{\mathcal{S}}]
=supz𝖹[d(f(W),z)ε|W𝒮=w𝒮]\displaystyle=\sup_{z\in{\mathsf{Z}}}{\mathbb{P}}[d(f(W),z)\leq\varepsilon|W_{\mathcal{S}}=w_{\mathcal{S}}]

is the conditional small ball probability of Z=f(W)Z=f(W) given W𝒮=w𝒮W_{\mathcal{S}}=w_{\mathcal{S}}. The conditional small ball probability quantifies the difficulty of localizing the value of Z=f(W)Z=f(W) in a “distortion ball” of size ε\varepsilon given partial knowledge about the value of WW, namely W𝒮=w𝒮W_{\mathcal{S}}=w_{\mathcal{S}}. For example, as discussed in Sec. IV, when ff is a linear function of the observations WW, the conditional small ball probability can be expressed in terms of so-called Lévy concentration functions [19], for which tight estimates are available under various regularity conditions.

Refer to caption
Figure 1: A four-node network with a cut defined by 𝒮={2,3}{\mathcal{S}}=\{2,3\} and 𝒮c={1,4}{\mathcal{S}}^{c}=\{1,4\}. The cutset 𝒮{\mathcal{E}}_{\mathcal{S}} consists of edges (1,2)(1,2) and (4,3)(4,3), marked in blue.

On the other hand, if 𝒜{\mathcal{A}} is a TT-step algorithm, then the amount of information any node vv has about ZZ once 𝒜{\mathcal{A}} terminates can be upper-bounded by a quantity that increases with TT and also depends on the network topology and on the information transmission capabilities of the channels connecting the nodes. To quantify this amount of information, we consider a cut of the network, i.e., a partition of the set of nodes 𝒱{\mathcal{V}} into two disjoint subsets 𝒮{\mathcal{S}} and 𝒮c𝒱\𝒮{\mathcal{S}}^{c}\triangleq{\mathcal{V}}\backslash{\mathcal{S}}, such that v𝒮v\in{\mathcal{S}}. The underlying intuition is that any information that nodes in 𝒮{\mathcal{S}} receive about W𝒮cW_{{\mathcal{S}}^{c}} must flow across the edges from nodes in 𝒮c{\mathcal{S}}^{c} to nodes in 𝒮{\mathcal{S}}. The set of these edges, denoted by 𝒮{\mathcal{E}}_{\mathcal{S}}, is referred to as the cutset induced by 𝒮{\mathcal{S}}. Figure 1 illustrates these concepts on a simple four-node network. We then have the following upper bound [1, 2] (see also Lemma 2 in Sec. II-B):

I(Z;Z^v|W𝒮)TC𝒮.\displaystyle I(Z;{\widehat{Z}}_{v}|W_{{\mathcal{S}}})\leq TC_{{\mathcal{S}}}. (3)

The quantity C𝒮C_{\mathcal{S}}, referred to as the cutset capacity, is the sum of the Shannon capacities of all the channels located on the edges in the cutset 𝒮{\mathcal{E}}_{\mathcal{S}}. Thus, if there exists a cut (𝒮,𝒮c)({\mathcal{S}},{\mathcal{S}}^{c}) with a small value of C𝒮C_{\mathcal{S}}, then the amount of information gained by the nodes in 𝒮{\mathcal{S}} about ZZ will also be small. Note that the cutset upper bound grows linearly with TT. However, when the initial observations WW are discrete, we also know that

I(Z;Z^v|W𝒮)I(W𝒮c;Z^v|W𝒮)H(W𝒮c|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{{\mathcal{S}}})\leq I(W_{{\mathcal{S}}^{c}};{\widehat{Z}}_{v}|W_{{\mathcal{S}}})\leq H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}})

where H(W𝒮c|W𝒮)H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}}) is the conditional entropy of W𝒮cW_{{\mathcal{S}}^{c}} given W𝒮W_{{\mathcal{S}}}, which does not depend on TT. In fact, we sharpen this bound by showing in Lemma 5 in Sec. II-D that

I(Z;Z^v|W𝒮)\displaystyle I(Z;{\widehat{Z}}_{v}|W_{{\mathcal{S}}}) (1(1ηv)T)H(W𝒮c|W𝒮).\displaystyle\leq\big{(}1-(1-\eta_{v})^{T}\big{)}H(W_{{\mathcal{S}}^{c}}|W_{{\mathcal{S}}}). (4)

Here, ηv\eta_{v} is defined as

ηv=supI(U;Yv)I(U;Xv)\eta_{v}=\sup\frac{I(U;Y_{v})}{I(U;X_{v})}

where the supremum is over all triples (U,Xv,Yv)(U,X_{v},Y_{v}) of r.v.’s, such that UU takes values in an arbitrary alphabet, UXvYvU\to X_{v}\to Y_{v} is a Markov chain, XvX_{v} takes values in 𝖷v{\mathsf{X}}_{v\leftarrow}, YvY_{v} takes values in 𝖸v{\mathsf{Y}}_{v\leftarrow}, and the conditional probability law Yv|Xv{\mathbb{P}}_{Y_{v}|X_{v}} is equal to the product of all the channels entering vv. As we discuss in detail in Sec. II-C, this constant is related to so-called strong data processing inequalities (SDPIs) [20], and quantifies the information transmission capabilities of the channels entering vv. When ηv<1\eta_{v}<1, the upper bound (4) is strictly smaller than H(W𝒮c|W𝒮)H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}}). With the upper bound (4), we can strengthen the cutset bound to the following:

I(Z;Z^v|W𝒮)min{TC𝒮,(1(1ηv)T)H(W𝒮c|W𝒮)}.\displaystyle I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}})\leq\min\big{\{}TC_{\mathcal{S}},\big{(}1-(1-\eta_{v})^{T}\big{)}H(W_{{\mathcal{S}}^{c}}|W_{{\mathcal{S}}})\big{\}}. (5)

Combining the bounds in (2) and (5), we conclude that, if there exists a TT-step algorithm 𝒜{\mathcal{A}} that achieves accuracy ε\varepsilon with confidence 1δ1-\delta, then

Tmax{1C𝒮((1δ)log1𝔼[L(W𝒮,ε)]h2(δ)),\displaystyle T\geq\max\Bigg{\{}\frac{1}{C_{\mathcal{S}}}\left((1-\delta)\log\frac{1}{\mathbb{E}[L(W_{\mathcal{S}},\varepsilon)]}-h_{2}(\delta)\right),
log(11H(W𝒮c|W𝒮)((1δ)log1𝔼[L(W𝒮,ε)]h2(δ)))log(1ηv)};\displaystyle\frac{\log\left(1-\frac{1}{H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}})}\left((1-\delta)\log\frac{1}{\mathbb{E}[L(W_{\mathcal{S}},\varepsilon)]}-h_{2}(\delta)\right)\right)}{\log(1-\eta_{v})}\Bigg{\}}; (6)

moreover, this inequality holds for all choices of 𝒮𝒱{\mathcal{S}}\subset{\mathcal{V}} and v𝒮v\in{\mathcal{S}}. The precise statements of the resulting lower bounds on T(ε,δ)T(\varepsilon,\delta) are given in Theorem 1 and Theorem 2.

The lower bound in (I-B) accounts for the difficulty of estimating the value of Z=f(W)Z=f(W) given only a subset of observations W𝒮W_{\mathcal{S}} through the small ball probability L(W𝒮,ε)L(W_{\mathcal{S}},\varepsilon), and for the communication bottlenecks in the network through the cutset capacity C𝒮C_{\mathcal{S}} and the constants ηv\eta_{v}. The presence of L(W𝒮,ε)L(W_{\mathcal{S}},\varepsilon) in the bound ensures the correct scaling of T(ε,δ)T(\varepsilon,\delta) in the high-accuracy limit ε0\varepsilon\to 0. In particular, when the function ff is real-valued and the probability distribution of Z=f(W)Z=f(W) has a density, it is not hard to see that L(W𝒮,ε)=O(ε)L(W_{\mathcal{S}},\varepsilon)=O(\varepsilon), and therefore T(ε,δ)T(\varepsilon,\delta) grows without bound at the rate of log(1/ε)\log(1/\varepsilon) as ε0\varepsilon\to 0. By contrast, the bounds of Ayaso et al. [1] saturate at a finite constant even when no computation error is allowed, i.e., when ε=0\varepsilon=0. Detailed comparison with existing bounds is given in Sec. IV, where we particularize our lower bounds to the computation of linear functions. Moreover, in certain cases our lower bound on T(ε,δ)T(\varepsilon,\delta) tends to infinity in the high-confidence regime δ0\delta\to 0. By contrast, existing lower bounds that rely on cutset capacity estimates remain bounded regardless of how small we make δ\delta.

Throughout the paper, we provide several concrete examples that illustrate the tightness of the general lower bound in (I-B). In particular, Example 1 in Sec. II-E concerns the problem of computing the mod-22 sum of two independent Bern(12){\rm Bern}(\tfrac{1}{2}) random variables in a network of two nodes communicating over binary symmetric channels (BSCs). For that problem, we obtain a lower bound on T(0,δ)T(0,\delta) that matches an achievable upper bound within a factor of 22. In Example 2 in Sec. II-E, we consider the case where the nodes aim to distribute their discrete observations to all other nodes, and obtain a lower bound on T(0,δ)T(0,\delta) that captures the conductance of the network, which plays a prominent role in the previously published bounds of Ayaso et al. [1]. In Sec. V, we study two more examples: computing a sum of independent Rademacher random variables in a dumbbell network of BSCs, and distributed averaging of real-valued observations in an arbitrary network of binary erasure channels (BECs). Our lower bound for the former example can precisely capture the dependence of the computation time on the number of nodes in the network, while for the latter example it captures the correct dependence of the computation time on the accuracy parameter ε\varepsilon.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: A six-node network partitioned into three sets, 𝒮1={1,4}{\mathcal{S}}_{1}=\{1,4\}, 𝒮2={2,5}{\mathcal{S}}_{2}=\{2,5\}, and 𝒮3={3,6}{\mathcal{S}}_{3}=\{3,6\}. Here, 𝒫1={1,4}{\mathcal{P}}_{1}=\{1,4\}, 𝒫2={1,2,4,5}{\mathcal{P}}_{2}=\{1,2,4,5\}, and the cutsets 𝒫1={(2,1),(2,4)}{\mathcal{E}}_{{\mathcal{P}}_{1}}=\{(2,1),(2,4)\}, 𝒫2={(3,2),(6,5)}{\mathcal{E}}_{{\mathcal{P}}_{2}}=\{(3,2),(6,5)\}, 𝒫1c={(1,5),(4,5)}{\mathcal{E}}_{{\mathcal{P}}^{c}_{1}}=\{(1,5),(4,5)\}, and 𝒫2c={(2,3),(5,6)}{\mathcal{E}}_{{\mathcal{P}}^{c}_{2}}=\{(2,3),(5,6)\} are disjoint. Observe that nodes in 𝒮1{\mathcal{S}}_{1} communicate only with nodes in 𝒮2{\mathcal{S}}_{2} and 𝒮1{\mathcal{S}}_{1}, nodes in 𝒮2{\mathcal{S}}_{2} communicate only with nodes in 𝒮1,𝒮2,𝒮3{\mathcal{S}}_{1},{\mathcal{S}}_{2},{\mathcal{S}}_{3}, and nodes in 𝒮3{\mathcal{S}}_{3} communicate only with nodes in 𝒮2,𝒮3{\mathcal{S}}_{2},{\mathcal{S}}_{3}. The bidirected chain reduced from the network is shown on the right.

A significant limitation of the analysis based on a single cut (𝒮,𝒮c)({\mathcal{S}},{\mathcal{S}}^{c}) of the network is that it only captures the flow of information across the cutset 𝒮{\mathcal{E}}_{\mathcal{S}}, but does not account for the time it takes the algorithm to disseminate this information to all the nodes in 𝒮{\mathcal{S}}. We address this limitation in Sec. III through a multi-cutset analysis. The main idea is to partition the set of nodes 𝒱{\mathcal{V}} into several subsets 𝒮1,,𝒮n{\mathcal{S}}_{1},\ldots,{\mathcal{S}}_{n}, such that, for all 𝒫i𝒮1𝒮i{\mathcal{P}}_{i}\triangleq{\mathcal{S}}_{1}\cup\ldots\cup{\mathcal{S}}_{i}, the cutsets 𝒫1,,𝒫n1{\mathcal{E}}_{{\mathcal{P}}_{1}},\ldots,{\mathcal{E}}_{{\mathcal{P}}_{n-1}}, 𝒫1c,,𝒫n1c{\mathcal{E}}_{{\mathcal{P}}^{c}_{1}},\ldots,{\mathcal{E}}_{{\mathcal{P}}^{c}_{n-1}} are disjoint, and to analyze the flow of information across this sequence of cutsets. Once such a partition is selected, the analysis is based on a network reduction argument (Lemma 7), which lumps all the nodes in each 𝒮i{\mathcal{S}}_{i} into a single virtual “supernode.” The construction of the partition ensures that each supernode ii only communicates with supernodes i1i-1 and i+1i+1, and can also send noisy messages to itself (this is needed to simulate noisy communication among the nodes within 𝒮i{\mathcal{S}}_{i} in the original network). Thus, the reduced network takes the form of a chain with nn nodes communicating with their nearest neighbors over bidirectional noisy links and, in addition, sending noisy messages to themselves. We refer to this network as a bidirected chain of length n1n-1. Figure 2a shows the partition of a six-node network, and the bidirected chain reduced from this network is shown in Fig. 2b.

Once this reduction is carried out, we can convert any TT-step algorithm 𝒜{\mathcal{A}} running on the original network into a randomized TT-step algorithm 𝒜{\mathcal{A}}^{\prime} running on the reduced network with the same accuracy and confidence guarantees as 𝒜{\mathcal{A}}. Consequently, it suffices to analyze distributed function computation in bidirected chains. The key quantitative statement that emerges from this analysis can be informally stated as follows: For any bidirected chain with n>3n>3 nodes, there exists a constant η[0,1]\eta\in[0,1] that plays the same role as ηv\eta_{v} in (4) and quantifies the information transmission capabilities of the channels in the chain, such that, for any algorithm 𝒜{\mathcal{A}} that runs on this chain and takes time T=O(n/η)T=O(n/\eta), the conditional mutual information between the function value ZZ and its estimate Z^n{\widehat{Z}}_{n} at the rightmost node nn given the observations of nodes 22 through nn is upper-bounded by

I(Z;Z^n|W2:n)=O(C(1,2)n2ηe2nη2),\displaystyle I(Z;{\widehat{Z}}_{n}|W_{2:n})=O\left(\frac{C_{(1,2)}n^{2}}{\eta}e^{-2n\eta^{2}}\right), (7)

where C(1,2)C_{(1,2)} is the Shannon capacity of the channel from node 11 to node 22. The precise statement is given in Lemma 8 in Sec. III-A. Intuitively, this shows that, unless the algorithm uses Ω(n/η)\Omega(n/\eta) steps, the information about W1W_{1} will dissipate at an exponential rate by the time it propagates through the chain from node 11 to node nn. Combining (7) with the lower bound on I(Z;Z^n|W2:n)I(Z;{\widehat{Z}}_{n}|W_{2:n}) based on small ball probabilities, we can obtain lower bounds on the computation time T(ε,δ)T(\varepsilon,\delta). The precise statement is given in Theorem 3. Moreover, as we show, it is always possible to reduce an arbitrary network with bidirectional point-to-point channels between the nodes to a bidirected chain whose length is equal to the diameter of the original network, which implies that, for networks with sufficiently large diameter, and for sufficiently small values of ε,δ\varepsilon,\delta,

T(ε,δ)=Ω(diam(G)η),\displaystyle T(\varepsilon,\delta)=\Omega\left(\frac{{\rm diam}(G)}{\eta}\right), (8)

where diam(G){\rm diam}(G) denotes the diameter. This dependence on diam(G){\rm diam}(G), which cannot be captured by the single-cutset analysis, is missing in almost all of the existing lower bounds on computation time. An exception is the paper by Rajagopalan and Schulman [13] that gives an asymptotic lower bound on the time required to broadcast a single bit over a chain of unidirectional BSCs. Our multi-cutset analysis applies to both discrete and continuous observations, and to general network topologies. It can be straightforwardly particularized to specific networks, such as bidirected chains, rings, trees, and grids, as discussed in Sec. III-B. We note that techniques involving multiple (though not necessarily disjoint) cutsets have also been proposed in the study of multi-party communication complexity by Tiwari [21] and more recently by Chattopadhyay et al. [22], while our concern is the influence of network topology and channel noise on the computation time.

I-C Organization of the paper

The remainder of the paper is structured as follows. We start with the single-cutset analysis in Sec. II. The lower bound on the conditional mutual information via the conditional small ball probability is presented in Sec. II-A. The cutset upper bound and the SDPI upper bound on the conditional mutual information are presented in Sec. II-B and Sec. II-D. An introduction on SDPIs given in Sec. II-C. The lower bound on computation time is given in Sec. II-E, along with two concrete examples. Sec. III is devoted to the multi-cutset analysis, where we first present the network reduction argument in Sec. III-A, then derive general lower bounds on computation time and particularize the results to special networks in Sec. III-B. In Sec. IV, we discuss lower bounds for computing linear functions, where we relate the conditional small ball probability to Lévy concentration functions, and evaluate them in a number of special cases. We also make detailed comparisons of our results with existing lower bounds in Sec. IV-D. In Sec. V, we compare the lower bounds on computation time with the achievable upper bounds for two more examples: computing a sum of independent Rademacher random variables in a dumbbell network of BSCs, and distributed averaging of real-valued observations in an arbitrary network of binary erasure channels (BECs). We conclude this paper and point out future research directions in Sec. VI. A couple of lengthy technical proofs are relegated to a series of appendices.

II Single-cutset analysis

We start by deriving information-theoretic lower bounds on the computation time T(ε,δ)T(\varepsilon,\delta) based on a single cutset in the network. Recall that a cutset associated to a partition of 𝒱{\mathcal{V}} into two disjoint sets 𝒮{\mathcal{S}} and 𝒮c𝒱𝒮{\mathcal{S}}^{c}\triangleq{\mathcal{V}}\setminus{\mathcal{S}} consists of all edges that connect a node in 𝒮c{\mathcal{S}}^{c} to a node in 𝒮{\mathcal{S}}:

𝒮{(u,v):u𝒮c,v𝒮}(𝒮c×𝒮).{\mathcal{E}}_{\mathcal{S}}\triangleq\big{\{}(u,v)\in{\mathcal{E}}:u\in{\mathcal{S}}^{c},v\in{\mathcal{S}}\big{\}}\equiv({\mathcal{S}}^{c}\times{\mathcal{S}})\cap{\mathcal{E}}.

When 𝒮{\mathcal{S}} is a singleton, i.e., 𝒮={v}{\mathcal{S}}=\{v\}, we will write v{\mathcal{E}}_{v} instead of the more clunky {v}{\mathcal{E}}_{\{v\}}. As the discussion in Sec. I-B indicates, our analysis revolves around the conditional mutual information I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}) for an arbitrary set of nodes 𝒮𝒱{\mathcal{S}}\subset{\mathcal{V}} and for an arbitrary node v𝒮v\in{\mathcal{S}}. The lower bound on I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}) expresses quantitatively the intuition that any algorithm that achieves

maxv𝒱[d(Z,Z^v)>ε]δ\displaystyle\max_{v\in{\mathcal{V}}}{\mathbb{P}}\big{[}d(Z,{\widehat{Z}}_{v})>\varepsilon\big{]}\leq\delta

must necessarily extract a sufficient amount of information about the value of Z=f(W)=f(W𝒮,W𝒮c)Z=f(W)=f(W_{\mathcal{S}},W_{{\mathcal{S}}^{c}}). On the other hand, the upper bounds on I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}) formalize the idea that this amount cannot be too large, since any information that nodes in 𝒮{\mathcal{S}} receive about W𝒮cW_{{\mathcal{S}}^{c}} must flow across the edges in the cutset 𝒮{\mathcal{E}}_{\mathcal{S}} (cf. [23, Sec. 15.10] for a typical illustration of this type of cutset arguments). We capture this information limitation in two ways: via channel capacity and via SDPI constants.

The remainder of this section is organized as follows. We first present conditional mutual information lower bounds in Sec. II-A. Then we state the upper bound based on cutset capacity in Sec. II-B. After a brief detour to introduce the SDPIs in Sec. II-C, we state the SDPI-based upper bounds in Sec. II-D. Finally, we combine the lower and upper bounds to derive lower bounds on T(ε,δ)T(\varepsilon,\delta) in Sec. II-E.

II-A Lower bound on I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}})

For any ε0\varepsilon\geq 0, 𝒮𝒱{\mathcal{S}}\subset{\mathcal{V}}, and w𝒮v𝒮𝖶vw_{\mathcal{S}}\in\prod_{v\in{\mathcal{S}}}{\mathsf{W}}_{v}, define the conditional small ball probability of ZZ given W𝒮=w𝒮W_{\mathcal{S}}=w_{\mathcal{S}} as

L(w𝒮,ε)supz𝖹[d(Z,z)ε|W𝒮=w𝒮].\displaystyle L(w_{\mathcal{S}},\varepsilon)\triangleq\sup_{z\in{\mathsf{Z}}}{\mathbb{P}}[d(Z,z)\leq\varepsilon|W_{\mathcal{S}}=w_{\mathcal{S}}]. (9)

This quantity measures how well the conditional distribution of ZZ given W𝒮=w𝒮W_{\mathcal{S}}=w_{\mathcal{S}} concentrates in a small region of size ε\varepsilon as measured by d(,)d(\cdot,\cdot). The following lower bound on I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}) in terms of the conditional small ball probability is essential for proving lower bounds on T(ε,δ)T(\varepsilon,\delta).

Lemma 1.

If an algorithm 𝒜{\mathcal{A}} achieves

maxv𝒱[d(Z,Z^v)>ε]δ1/2,\displaystyle\max_{v\in{\mathcal{V}}}{\mathbb{P}}\big{[}d(Z,{\widehat{Z}}_{v})>\varepsilon\big{]}\leq\delta\leq 1/2, (10)

then for any set 𝒮𝒱{\mathcal{S}}\subset{\mathcal{V}} and any node v𝒮v\in{\mathcal{S}},

I(Z;Z^v|W𝒮)\displaystyle I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}) (1δ)log1𝔼[L(W𝒮,ε)]h2(δ),\displaystyle\geq(1-\delta)\log\frac{1}{\mathbb{E}[L(W_{\mathcal{S}},\varepsilon)]}-h_{2}(\delta), (11)

where h2(δ)=δlogδ(1δ)log(1δ)h_{2}(\delta)=-\delta\log\delta-(1-\delta)\log(1-\delta) is the binary entropy function.

Proof:

Fix an arbitrary 𝒮𝒱{\mathcal{S}}\subset{\mathcal{V}} and an arbitrary v𝒮v\in{\mathcal{S}}. Consider the probability distributions =W𝒮,Z,Z^v{\mathbb{P}}={\mathbb{P}}_{W_{\mathcal{S}},Z,{\widehat{Z}}_{v}} and =W𝒮Z|W𝒮Z^v|W𝒮{\mathbb{Q}}={\mathbb{P}}_{W_{\mathcal{S}}}\otimes{\mathbb{P}}_{Z|W_{\mathcal{S}}}\otimes{\mathbb{P}}_{{\widehat{Z}}_{v}|W_{\mathcal{S}}}. Define the indicator random variable Υ𝟏{d(Z,Z^v)ε}\Upsilon\triangleq\mathbf{1}\big{\{}d(Z,{\widehat{Z}}_{v})\leq\varepsilon\big{\}}. Then from (10) it follows that [Υ=1]1δ{\mathbb{P}}[\Upsilon=1]\geq 1-\delta. On the other hand, since ZW𝒮Z^vZ\to W_{\mathcal{S}}\to{\widehat{Z}}_{v} form a Markov chain under {\mathbb{Q}}, by Fubini’s theorem,

[Υ=1]\displaystyle\quad\,\,{\mathbb{Q}}[\Upsilon=1]
=𝖶𝒮𝖹𝖹𝟏{d(z,z^v)ε}(dz|w𝒮)(dz^v|w𝒮)(dw𝒮)\displaystyle=\int_{{\mathsf{W}}_{\mathcal{S}}}\int_{{\mathsf{Z}}}\int_{{\mathsf{Z}}}\mathbf{1}\big{\{}d(z,{\widehat{z}}_{v})\leq\varepsilon\big{\}}{\mathbb{P}}({\rm d}z|w_{\mathcal{S}}){\mathbb{P}}({\rm d}{\widehat{z}}_{v}|w_{\mathcal{S}}){\mathbb{P}}({\rm d}w_{\mathcal{S}})
=𝖶𝒮𝖹[d(Z,z^v)ε|W𝒮=w𝒮](dz^v|w𝒮)(dw𝒮)\displaystyle=\int_{{\mathsf{W}}_{\mathcal{S}}}\int_{{\mathsf{Z}}}{\mathbb{P}}\big{[}d(Z,{\widehat{z}}_{v})\leq\varepsilon\big{|}W_{\mathcal{S}}=w_{\mathcal{S}}\big{]}{\mathbb{P}}({\rm d}{\widehat{z}}_{v}|w_{\mathcal{S}}){\mathbb{P}}({\rm d}w_{\mathcal{S}})
𝖶𝒮supz^v𝖹[d(Z,z^v)ε|W𝒮=w𝒮](dw𝒮)\displaystyle\leq\int_{\mathcal{{\mathsf{W}}}_{\mathcal{S}}}\sup_{{\widehat{z}}_{v}\in{\mathsf{Z}}}{\mathbb{P}}\big{[}d(Z,{\widehat{z}}_{v})\leq\varepsilon\big{|}W_{\mathcal{S}}=w_{\mathcal{S}}\big{]}{\mathbb{P}}({\rm d}w_{\mathcal{S}})
=𝔼[L(W𝒮,ε)].\displaystyle=\mathbb{E}[L(W_{\mathcal{S}},\varepsilon)]. (12)

Consequently,

I(Z;Z^v|W𝒮)\displaystyle I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}) =D()\displaystyle=D({\mathbb{P}}\|{\mathbb{Q}})
(a)d2([Υ=1][Υ=1])\displaystyle\overset{{\rm(a)}}{\geq}d_{2}({\mathbb{P}}[\Upsilon=1]\|{\mathbb{Q}}[\Upsilon=1])
(b)[Υ=1]log1[Υ=1]h2([Υ=1])\displaystyle\overset{{\rm(b)}}{\geq}{\mathbb{P}}[\Upsilon=1]\log\frac{1}{{\mathbb{Q}}[\Upsilon=1]}-h_{2}({\mathbb{P}}[\Upsilon=1])
(c)(1δ)log1𝔼[L(W𝒮,ε)]h2(δ)\displaystyle\overset{{\rm(c)}}{\geq}(1-\delta)\log\frac{1}{\mathbb{E}[L(W_{\mathcal{S}},\varepsilon)]}-h_{2}(\delta)

where

  1. (a)

    follows from the data processing inequality for divergence, where d2(pq)plog(p/q)+(1p)log((1p)/(1q))d_{2}(p\|q)\triangleq p\log(p/q)+(1-p)\log((1-p)/(1-q)) is the binary divergence function;

  2. (b)

    follows from the fact that d2(pq)plog(1/q)h2(p)d_{2}(p\|q)\geq p\log(1/q)-h_{2}(p);

  3. (c)

    follows from the fact that [Υ=1]1δ1/2{\mathbb{P}}[\Upsilon=1]\geq 1-\delta\geq 1/2 by (10), and [Υ=1]𝔼[L(W𝒮,ε)]{\mathbb{Q}}[\Upsilon=1]\leq\mathbb{E}[L(W_{\mathcal{S}},\varepsilon)] by (12).

For a fixed ε\varepsilon, Lemma 1 captures the intuition that, the more spread the conditional distribution Z|W𝒮{\mathbb{P}}_{Z|W_{\mathcal{S}}} is, the more information we need about ZZ to achieve the required accuracy; similarly, for a fixed Z|W𝒮{\mathbb{P}}_{Z|W_{\mathcal{S}}}, the smaller the accuracy parameter ε\varepsilon, the more information is necessary. In Section IV, we provide explicit expressions and upper bounds for the conditional small ball probability L(ε,w𝒮)L(\varepsilon,w_{\mathcal{S}}) in the context of computing linear functions of real-valued r.v.’s with absolutely continuous probability distributions. We show that, in such cases, L(ε,w𝒮)=O(ε)L(\varepsilon,w_{\mathcal{S}})=O(\varepsilon), which implies that the lower bound of Lemma 1 grows at least as fast as log(1/ε)\log(1/\varepsilon) in the high-accuracy limit ε0\varepsilon\to 0.

II-B Upper bound on I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}) via cutset capacity

Our first upper bound involves the cutset capacity C𝒮C_{\mathcal{S}}, defined as

C𝒮e𝒮Ce.\displaystyle C_{\mathcal{S}}\triangleq\sum_{e\in{\mathcal{E}}_{\mathcal{S}}}C_{e}.

Here, CeC_{e} denotes the Shannon capacity of the channel KeK_{e}.

Lemma 2.

For any set 𝒮𝒱{\mathcal{S}}\subset{\mathcal{V}}, let Z^𝒮(Z^v)v𝒮{\widehat{Z}}_{\mathcal{S}}\triangleq({\widehat{Z}}_{v})_{v\in{\mathcal{S}}}. Then, for any TT-step algorithm 𝒜{\mathcal{A}} and for any v𝒮v\in{\mathcal{S}},

I(Z;Z^v|W𝒮)I(Z;Z^𝒮|W𝒮)TC𝒮.\displaystyle I(Z;{\widehat{Z}}_{v}|W_{{\mathcal{S}}})\leq I(Z;{\widehat{Z}}_{{\mathcal{S}}}|W_{{\mathcal{S}}})\leq TC_{{\mathcal{S}}}.
Proof:

The first inequality follows from the data processing lemma for mutual information. The second inequality has been obtained in [1] and [2] as well, but the proof in [1] relies heavily on differential entropy. Our proof is more general, as it only uses the properties of mutual information.

For a set of nodes 𝒮𝒱{\mathcal{S}}\subset{\mathcal{V}}, let X𝒮,t(Xv,t)v𝒮X_{{\mathcal{S}},t}\triangleq(X_{v,t})_{v\in{\mathcal{S}}} and Y𝒮,t(Yv,t)v𝒮Y_{{\mathcal{S}},t}\triangleq(Y_{v,t})_{v\in{\mathcal{S}}}. For two subsets 𝒮1{\mathcal{S}}_{1} and 𝒮2{\mathcal{S}}_{2} of 𝒱{\mathcal{V}}, define X(𝒮1,𝒮2),t(X(u,v),t:u𝒮1,v𝒮2,(v,u))X_{({\mathcal{S}}_{1},{\mathcal{S}}_{2}),t}\triangleq\big{(}X_{(u,v),t}:u\in{\mathcal{S}}_{1},v\in{\mathcal{S}}_{2},(v,u)\in{\mathcal{E}}\big{)} as the messages sent from nodes in 𝒮1{\mathcal{S}}_{1} to nodes in 𝒮2{\mathcal{S}}_{2} at step tt, and Y(𝒮1,𝒮2),t(Y(u,v),t:u𝒮1,v𝒮2,(u,v))Y_{({\mathcal{S}}_{1},{\mathcal{S}}_{2}),t}\triangleq\big{(}Y_{(u,v),t}:u\in{\mathcal{S}}_{1},v\in{\mathcal{S}}_{2},(u,v)\in{\mathcal{E}}\big{)} as the messages received by nodes in 𝒮2{\mathcal{S}}_{2} from nodes in 𝒮1{\mathcal{S}}_{1} at step tt. We will be using this notation in the proofs that follow, as well.

If T=0T=0, then for any v𝒮v\in{\mathcal{S}}, Z^v=ψv(Wv){\widehat{Z}}_{v}=\psi_{v}(W_{v}), hence I(Z;Z^𝒮|W𝒮)I(Z;W𝒮|W𝒮)=0I(Z;{\widehat{Z}}_{{\mathcal{S}}}|W_{{\mathcal{S}}})\leq I(Z;W_{\mathcal{S}}|W_{{\mathcal{S}}})=0. For T1T\geq 1, we start with the following chain of inequalities:

I(Z;Z^𝒮|W𝒮)\displaystyle I(Z;{\widehat{Z}}_{{\mathcal{S}}}|W_{{\mathcal{S}}}) (a)I(W𝒮,W𝒮c;W𝒮,Y𝒮T|W𝒮)\displaystyle\overset{\rm(a)}{\leq}I(W_{{\mathcal{S}}},W_{{\mathcal{S}}^{c}};W_{{\mathcal{S}}},Y_{{\mathcal{S}}}^{T}|W_{{\mathcal{S}}})
=I(W𝒮c;Y𝒮T|W𝒮)\displaystyle=I(W_{{\mathcal{S}}^{c}};Y_{{\mathcal{S}}}^{T}|W_{{\mathcal{S}}})
=t=1TI(W𝒮c;Y𝒮,t|W𝒮,Y𝒮t1)\displaystyle=\sum\limits_{t=1}^{T}I(W_{{\mathcal{S}}^{c}};Y_{{\mathcal{S}},t}|W_{{\mathcal{S}}},Y_{{\mathcal{S}}}^{t-1})
=(b)t=1TI(W𝒮c;Y𝒮,t|W𝒮,Y𝒮t1,X𝒮,t)\displaystyle\overset{\rm(b)}{=}\sum\limits_{t=1}^{T}I(W_{{\mathcal{S}}^{c}};Y_{{\mathcal{S}},t}|W_{{\mathcal{S}}},Y_{{\mathcal{S}}}^{t-1},X_{{\mathcal{S}},t})
t=1TI(W𝒮c,X𝒮c,t;Y𝒮,t|W𝒮,Y𝒮t1,X𝒮,t)\displaystyle\leq\sum\limits_{t=1}^{T}I(W_{{\mathcal{S}}^{c}},X_{{\mathcal{S}}^{c},t};Y_{{\mathcal{S}},t}|W_{{\mathcal{S}}},Y_{{\mathcal{S}}}^{t-1},X_{{\mathcal{S}},t})
=t=1T(I(X𝒮c,t;Y𝒮,t|W𝒮,Y𝒮t1,X𝒮,t)\displaystyle=\sum\limits_{t=1}^{T}\Big{(}I(X_{{\mathcal{S}}^{c},t};Y_{{\mathcal{S}},t}|W_{{\mathcal{S}}},Y_{{\mathcal{S}}}^{t-1},X_{{\mathcal{S}},t})
+I(W𝒮c;Y𝒮,t|W𝒮,Y𝒮t1,X𝒮,t,X𝒮c,t))\displaystyle\qquad+I(W_{{\mathcal{S}}^{c}};Y_{{\mathcal{S}},t}|W_{{\mathcal{S}}},Y_{{\mathcal{S}}}^{t-1},X_{{\mathcal{S}},t},X_{{\mathcal{S}}^{c},t})\Big{)}
=(c)t=1TI(X𝒮c,t;Y𝒮,t|W𝒮,Y𝒮t1,X𝒮,t)\displaystyle\overset{\rm(c)}{=}\sum\limits_{t=1}^{T}I(X_{{\mathcal{S}}^{c},t};Y_{{\mathcal{S}},t}|W_{{\mathcal{S}}},Y_{{\mathcal{S}}}^{t-1},X_{{\mathcal{S}},t})
(d)t=1TI(X𝒮c,t;Y𝒮,t|X𝒮,t)\displaystyle\overset{\rm(d)}{\leq}\sum\limits_{t=1}^{T}I(X_{{\mathcal{S}}^{c},t};Y_{{\mathcal{S}},t}|X_{{\mathcal{S}},t}) (13)

where

  1. (a)

    follows from data processing inequality, and the fact that Z=f(W𝒮,W𝒮c)Z=f(W_{\mathcal{S}},W_{{\mathcal{S}}^{c}}) and Z^v=ψv(Wv,YvT){\widehat{Z}}_{v}=\psi_{v}(W_{v},Y_{v}^{T});

  2. (b)

    follows from the fact that Xv,t=φv,t(Wv,Yvt1)X_{v,t}=\varphi_{v,t}(W_{v},Y_{v}^{t-1});

  3. (c)

    follows from the memorylessness of the channels, hence the Markov chain W𝒮c,W𝒮,Y𝒮t1X𝒮,t,X𝒮c,tY𝒮,tW_{{\mathcal{S}}^{c}},W_{{\mathcal{S}}},Y_{{\mathcal{S}}}^{t-1}\rightarrow X_{{\mathcal{S}},t},X_{{\mathcal{S}}^{c},t}\rightarrow Y_{{\mathcal{S}},t}, and the weak union property of conditional independence [24, p. 25];

  4. (d)

    follows from the Markov chain

    W𝒮,Y𝒮t1X𝒮,t,X𝒮c,tY𝒮,t,W_{{\mathcal{S}}},Y_{{\mathcal{S}}}^{t-1}\rightarrow X_{{\mathcal{S}},t},X_{{\mathcal{S}}^{c},t}\rightarrow Y_{{\mathcal{S}},t},

    together with the fact that, if XA,BCX\rightarrow A,B\rightarrow C form a Markov chain, then

    I(A;C|X,B)I(A;C|B).I(A;C|X,B)\leq I(A;C|B).

    To prove this, we expand I(A,X;C|B)I\left(\left.A,X;C\right|B\right) in two ways to get

    I(A,X;C|B)\displaystyle I\left(A,X;C|B\right) =I(X;C|B)+I(A;C|X,B)\displaystyle=I(X;C|B)+I\left(\left.A;C\right|X,B\right)
    =I(A;C|B)+I(X;C|A,B).\displaystyle=I(A;C|B)+I\left(\left.X;C\right|A,B\right).

    The claim follows because I(X;C|A,B)=0I\left(\left.X;C\right|A,B\right)=0.

From now on we drop the step index tt and denote X(𝒮1,𝒮2),tX_{({\mathcal{S}}_{1},{\mathcal{S}}_{2}),t} as X𝒮1𝒮2X_{{\mathcal{S}}_{1}{\mathcal{S}}_{2}} to simplify the notation. Note that X𝒮=(X𝒮𝒮,X𝒮𝒮c)X_{{\mathcal{S}}}=(X_{{\mathcal{S}}{\mathcal{S}}},X_{{\mathcal{S}}{\mathcal{S}}^{c}}) and Y𝒮=(Y𝒮𝒮,Y𝒮c𝒮)Y_{{\mathcal{S}}}=(Y_{{\mathcal{S}}{\mathcal{S}}},Y_{{\mathcal{S}}^{c}{\mathcal{S}}}). We have

I(X𝒮c;Y𝒮|X𝒮)\displaystyle I(X_{{\mathcal{S}}^{c}};Y_{{\mathcal{S}}}|X_{{\mathcal{S}}}) =I(X𝒮c;Y𝒮c𝒮,Y𝒮𝒮|X𝒮)\displaystyle=I(X_{{\mathcal{S}}^{c}};Y_{{\mathcal{S}}^{c}{\mathcal{S}}},Y_{{\mathcal{S}}{\mathcal{S}}}|X_{\mathcal{S}})
=I(X𝒮c;Y𝒮c𝒮|X𝒮)+I(X𝒮c;Y𝒮𝒮|X𝒮,Y𝒮c𝒮)\displaystyle=I(X_{{\mathcal{S}}^{c}};Y_{{\mathcal{S}}^{c}{\mathcal{S}}}|X_{\mathcal{S}})+I(X_{{\mathcal{S}}^{c}};Y_{{\mathcal{S}}{\mathcal{S}}}|X_{\mathcal{S}},Y_{{\mathcal{S}}^{c}{\mathcal{S}}})
=(a)I(X𝒮c𝒮,X𝒮c𝒮c;Y𝒮c𝒮|X𝒮)\displaystyle\overset{\rm(a)}{=}I(X_{{\mathcal{S}}^{c}{\mathcal{S}}},X_{{\mathcal{S}}^{c}{\mathcal{S}}^{c}};Y_{{\mathcal{S}}^{c}{\mathcal{S}}}|X_{\mathcal{S}})
=I(X𝒮c𝒮;Y𝒮c𝒮|X𝒮)\displaystyle=I(X_{{\mathcal{S}}^{c}{\mathcal{S}}};Y_{{\mathcal{S}}^{c}{\mathcal{S}}}|X_{\mathcal{S}})
+I(X𝒮c𝒮c;Y𝒮c𝒮|X𝒮,X𝒮c𝒮)\displaystyle\quad+I(X_{{\mathcal{S}}^{c}{\mathcal{S}}^{c}};Y_{{\mathcal{S}}^{c}{\mathcal{S}}}|X_{\mathcal{S}},X_{{\mathcal{S}}^{c}{\mathcal{S}}})
(b)I(X𝒮c𝒮;Y𝒮c𝒮)\displaystyle\overset{\rm(b)}{\leq}I(X_{{\mathcal{S}}^{c}{\mathcal{S}}};Y_{{\mathcal{S}}^{c}{\mathcal{S}}})
(c)e𝒮Ce\displaystyle\overset{\rm(c)}{\leq}\sum_{e\in{\mathcal{E}}_{{\mathcal{S}}}}C_{e} (14)

where

  1. (a)

    follows from the Markov chain X𝒮c,Y𝒮c𝒮X𝒮Y𝒮𝒮X_{{\mathcal{S}}^{c}},Y_{{\mathcal{S}}^{c}{\mathcal{S}}}\rightarrow X_{\mathcal{S}}\rightarrow Y_{{\mathcal{S}}{\mathcal{S}}} and the weak union property of conditional independence;

  2. (b)

    follows from the Markov chains X𝒮X𝒮c𝒮Y𝒮c𝒮X_{\mathcal{S}}\rightarrow X_{{\mathcal{S}}^{c}{\mathcal{S}}}\rightarrow Y_{{\mathcal{S}}^{c}{\mathcal{S}}} and X𝒮c𝒮c,X𝒮X𝒮c𝒮Y𝒮c𝒮X_{{\mathcal{S}}^{c}{\mathcal{S}}^{c}},X_{\mathcal{S}}\rightarrow X_{{\mathcal{S}}^{c}{\mathcal{S}}}\rightarrow Y_{{\mathcal{S}}^{c}{\mathcal{S}}}, and the weak union property of conditional independence;

  3. (c)

    follows from the fact that the channels associated with 𝒮{\mathcal{E}}_{\mathcal{S}} are independent, and the fact that the capacity of a product channel is at most the sum of the capacities of the constituent channels [25].

Then the statement of Lemma 2 follows from (13) and (14). ∎

II-C Preliminaries on strong data processing inequalities

In Sec. II-D, we will upper-bound I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}) using so-called strong data processing inequalities (SDPI’s) for discrete channels (cf. [20] and references therein). Here we provide the necessary background. A discrete memoryless channel is specified by a triple (𝖷,𝖸,K)({\mathsf{X}},{\mathsf{Y}},K), where 𝖷{\mathsf{X}} is the input alphabet, 𝖸{\mathsf{Y}} is the output alphabet, and K=(K(y|x))(x,y)𝖷×𝖸K=\big{(}K(y|x)\big{)}_{(x,y)\in{\mathsf{X}}\times{\mathsf{Y}}} is the stochastic transition law. We say that the channel (𝖷,𝖸,K)({\mathsf{X}},{\mathsf{Y}},K) satisfies an SDPI at input distribution X{\mathbb{P}}_{X} with constant c[0,1)c\in[0,1) if D(YY)cD(XX)D({\mathbb{Q}}_{Y}\|{\mathbb{P}}_{Y})\leq cD({\mathbb{Q}}_{X}\|{\mathbb{P}}_{X}) for any other input distribution X{\mathbb{Q}}_{X}. Here Y{\mathbb{P}}_{Y} and Y{\mathbb{Q}}_{Y} denote the marginal distribution of the channel output when the input has distribution X{\mathbb{P}}_{X} and X{\mathbb{Q}}_{X}, respectively. Define the SDPI constant of KK as

η(K)supXsupXXD(YY)D(XX).\displaystyle\eta(K)\triangleq\sup_{{\mathbb{P}}_{X}}\sup_{{\mathbb{Q}}_{X}\neq{\mathbb{P}}_{X}}\frac{D({\mathbb{Q}}_{Y}\|{\mathbb{P}}_{Y})}{D({\mathbb{Q}}_{X}\|{\mathbb{P}}_{X})}.

The SDPI constants of some common discrete channels have closed form expressions. For example, for a binary symmetric channel (BSC) with crossover probability pp, η(BSC(p))=(12p)2\eta({\rm BSC}(p))=(1-2p)^{2} [26], and for a binary erasure channel (BEC) with erasure probability pp, η(BEC(p))=1p\eta({\rm BEC}(p))=1-p. It can be shown that η(K)\eta(K) is also the maximum mutual information contraction ratio in a Markov chain UXYU\rightarrow X\rightarrow Y with Y|X=K{\mathbb{P}}_{Y|X}=K [27]:

η(K)=supU,XI(U;Y)I(U;X)\displaystyle\eta(K)=\sup_{{\mathbb{P}}_{U,X}}\frac{I(U;Y)}{I(U;X)}

(see [28, App. B] for a proof of this formula in the setting of abstract alphabets). Consequently, for any such Markov chain,

I(U;Y)η(K)I(U;X).\displaystyle I(U;Y)\leq\eta(K)I(U;X).

This is a stronger result than the ordinary data processing inequality for mutual information, as it quantitatively captures the amount by which the information contracts after passing through a channel. We will also need a conditional version of the SDPI:

Lemma 3.

For any Markov chain U,VXYU,V\rightarrow X\rightarrow Y with Y|X=K{\mathbb{P}}_{Y|X}=K,

I(U;Y|V)η(K)I(U;X|V).\displaystyle I(U;Y|V)\leq\eta(K)I(U;X|V).

For binary channels, this result was first proved by Evans and Schulman [29, Corollary 1]. A proof for the general case is included in [30, Lemma 2.7]. Finally, we will need a bound on the SDPI constant of a product channel. The tensor product of two channels (𝖷1,𝖸2,K1)({\mathsf{X}}_{1},{\mathsf{Y}}_{2},K_{1}) and (𝖷2,𝖸2,K2)({\mathsf{X}}_{2},{\mathsf{Y}}_{2},K_{2}) is a channel (𝖷1×𝖷2,𝖸1×𝖸2,K1K2)({\mathsf{X}}_{1}\times{\mathsf{X}}_{2},{\mathsf{Y}}_{1}\times{\mathsf{Y}}_{2},K_{1}\otimes K_{2}) with

K1K2(y1,y2|x1,x2)K1(y1|x1)K2(y2|x2)K_{1}\otimes K_{2}(y_{1},y_{2}|x_{1},x_{2})\triangleq K_{1}(y_{1}|x_{1})K_{2}(y_{2}|x_{2})

for all (x1,x2)𝖷1×𝖷2,(y1,y2)𝖸1×𝖸2(x_{1},x_{2})\in{\mathsf{X}}_{1}\times{\mathsf{X}}_{2},\,(y_{1},y_{2})\in{\mathsf{Y}}_{1}\times{\mathsf{Y}}_{2}. The extension to more than two channels is obvious. The following lemma is a special case of Corollary 2 of Polyanskiy and Wu [31], obtained using the method of Evans and Schulman [29]. We give the proof, since we adapt the underlying technique at several points in this paper.

Lemma 4.

For a product channel K=i=1mKiK=\bigotimes_{i=1}^{m}K_{i}, if the constituent channels satisfy η(Ki)η\eta(K_{i})\leq\eta for i{1,,m}i\in\{1,\ldots,m\}, then

η(K)1(1η)m.\displaystyle\eta(K)\leq 1-(1-{\eta})^{m}.
Proof:

Let XmX^{m} and YmY^{m} be the input and output of the product channel K=K1KmK=K_{1}\otimes\ldots\otimes K_{m}. Let UU be an arbitrary random variable, such that UXmYmU\rightarrow X^{m}\rightarrow Y^{m} form a Markov chain. It suffices to show that

I(U;Ym)(1(1η)m)I(U;Xm).\displaystyle I(U;Y^{m})\leq\big{(}1-(1-\eta)^{m}\big{)}I(U;X^{m}). (15)

From the chain rule,

I(U;Ym)=I(U;Ym1)+I(U;Ym|Ym1).\displaystyle I(U;Y^{m})=I(U;Y^{m-1})+I(U;Y_{m}|Y^{m-1}).

Since U,Ym1XmYmU,Y^{m-1}\rightarrow X_{m}\rightarrow Y_{m} form a Markov chain, and Ym|Xm=Km{\mathbb{P}}_{Y_{m}|X_{m}}=K_{m}, Lemma 3 gives

I(U;Ym|Ym1)\displaystyle I(U;Y_{m}|Y^{m-1}) η(Km)I(U;Xm|Ym1)\displaystyle\leq\eta(K_{m})I(U;X_{m}|Y^{m-1})
ηI(U;Xm|Ym1).\displaystyle\leq\eta I(U;X_{m}|Y^{m-1}).

It follows that

I(U;Ym)\displaystyle I(U;Y^{m}) I(U;Ym1)+ηI(U;Xm|Ym1)\displaystyle\leq I(U;Y^{m-1})+\eta I(U;X_{m}|Y^{m-1})
=(1η)I(U;Ym1)+ηI(U;Ym1,Xm)\displaystyle=(1-\eta)I(U;Y^{m-1})+\eta I(U;Y^{m-1},X_{m})
(1η)I(U;Ym1)+ηI(U;Xm),\displaystyle\leq(1-\eta)I(U;Y^{m-1})+\eta I(U;X^{m}),

where the last step follows from the ordinary data processing inequality and the Markov chain UXmYm1,XmU\rightarrow X^{m}\rightarrow Y^{m-1},X_{m}. Unrolling the above recursive upper bound on I(U;Ym)I(U;Y^{m}) and noting that I(U;Y1)ηI(U;X1)I(U;Y_{1})\leq\eta I(U;X_{1}), we get

I(U;Ym)\displaystyle I(U;Y^{m}) (1η)m1ηI(U;X1)+\displaystyle\leq(1-\eta)^{m-1}\eta I(U;X_{1})+\ldots
+(1η)ηI(U;Xm1)+ηI(U;Xm)\displaystyle\quad+(1-\eta)\eta I(U;X^{m-1})+\eta I(U;X^{m})
((1η)m1++(1η)+1)ηI(U;Xm)\displaystyle\leq\big{(}(1-\eta)^{m-1}+\ldots+(1-\eta)+1\big{)}\eta I(U;X^{m})
=(1(1η)m)I(U;Xm),\displaystyle=\big{(}1-(1-\eta)^{m}\big{)}I(U;X^{m}),

which proves (15) and hence Lemma 4. ∎

II-D Upper bound on I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}) via SDPI

Having the necessary background at hand, we can now state our upper bounds based on SDPI constants. Let KvevKeK_{v}\triangleq\bigotimes_{e\in{\mathcal{E}}_{v}}K_{e} be the overall transition law of the channels across the cutset v{\mathcal{E}}_{v}. Define

ηvη(Kv)\eta_{v}\triangleq\eta(K_{v})

as the SDPI constant of KvK_{v}, and

ηvmaxevη(Ke)\eta^{*}_{v}\triangleq\max_{e\in{\mathcal{E}}_{v}}\eta(K_{e})

as the largest SDPI constant among all the channels across v{\mathcal{E}}_{v}. Our second upper bound on I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}) involves these SDPI constants, and the conditional entropy of W𝒮cW_{{\mathcal{S}}^{c}} given W𝒮W_{{\mathcal{S}}}.

Lemma 5.

For any set 𝒮𝒱{\mathcal{S}}\subset{\mathcal{V}}, any node v𝒮v\in{\mathcal{S}}, and any TT-step algorithm 𝒜{\mathcal{A}},

I(Z;Z^v|W𝒮)\displaystyle I(Z;{\widehat{Z}}_{v}|W_{{\mathcal{S}}}) (1(1ηv)T)H(W𝒮c|W𝒮)\displaystyle\leq(1-(1-\eta_{v})^{T})H(W_{{\mathcal{S}}^{c}}|W_{{\mathcal{S}}})
(1(1ηv)|v|T)H(W𝒮c|W𝒮).\displaystyle\leq(1-(1-\eta^{*}_{v})^{|{\mathcal{E}}_{v}|T})H(W_{{\mathcal{S}}^{c}}|W_{{\mathcal{S}}}).
Proof:

We adapt the proof of Lemma 4. For any vv and tt, define the shorthand Xv,tX(𝒩v,v),tX_{v\leftarrow,t}\triangleq X_{({\mathcal{N}}_{v\leftarrow},v),t}. If T=0T=0, then for any v𝒮v\in{\mathcal{S}}, Z^v=ψv(Wv){\widehat{Z}}_{v}=\psi_{v}(W_{v}); hence I(Z;Z^v|W𝒮)I(Z;Wv|W𝒮)=0I(Z;{\widehat{Z}}_{v}|W_{{\mathcal{S}}})\leq I(Z;W_{v}|W_{{\mathcal{S}}})=0. If T1T\geq 1, then for any v𝒮v\in{\mathcal{S}},

I(Z;Z^v|W𝒮)\displaystyle\quad\,\,I(Z;{\widehat{Z}}_{v}|W_{{\mathcal{S}}})
I(W𝒮,W𝒮c;Wv,YvT|W𝒮)\displaystyle\leq I(W_{\mathcal{S}},W_{{\mathcal{S}}^{c}};W_{v},Y_{v}^{T}|W_{{\mathcal{S}}})
=I(W𝒮c;YvT|W𝒮)\displaystyle=I(W_{{\mathcal{S}}^{c}};Y_{v}^{T}|W_{{\mathcal{S}}})
=I(W𝒮c;YvT1|W𝒮)+I(W𝒮c;Yv,T|W𝒮,YvT1)\displaystyle=I(W_{{\mathcal{S}}^{c}};Y_{v}^{T-1}|W_{{\mathcal{S}}})+I(W_{{\mathcal{S}}^{c}};Y_{v,T}|W_{{\mathcal{S}}},Y_{v}^{T-1})
(a)I(W𝒮c;YvT1|W𝒮)+ηvI(W𝒮c;Xv,T|W𝒮,YvT1)\displaystyle\overset{\rm(a)}{\leq}I(W_{{\mathcal{S}}^{c}};Y_{v}^{T-1}|W_{{\mathcal{S}}})+\eta_{v}I(W_{{\mathcal{S}}^{c}};X_{v\leftarrow,T}|W_{{\mathcal{S}}},Y_{v}^{T-1})
=(1ηv)I(W𝒮c;YvT1|W𝒮)\displaystyle=(1-\eta_{v})I(W_{{\mathcal{S}}^{c}};Y_{v}^{T-1}|W_{{\mathcal{S}}})
+ηvI(W𝒮c;YvT1,Xv,T|W𝒮)\displaystyle\quad+\eta_{v}I(W_{{\mathcal{S}}^{c}};Y_{v}^{T-1},X_{v\leftarrow,T}|W_{\mathcal{S}})

where (a) follows from the conditional SDPI (Lemma 3) and the fact that W𝒮c,W𝒮,Yvt1Xv,tYv,tW_{{\mathcal{S}}^{c}},W_{{\mathcal{S}}},Y_{v}^{t-1}\rightarrow X_{v\leftarrow,t}\rightarrow Y_{v,t} form a Markov chain for t{1,,T}t\in\{1,\ldots,T\}. Unrolling the above recursive upper bound on I(W𝒮c;YvT|W𝒮)I(W_{{\mathcal{S}}^{c}};Y_{v}^{T}|W_{{\mathcal{S}}}), and noting that I(W𝒮c;Yv,1|W𝒮)ηvI(W𝒮c;Xv,1|W𝒮)I(W_{{\mathcal{S}}^{c}};Y_{{v},1}|W_{{\mathcal{S}}})\leq\eta_{v}I(W_{{\mathcal{S}}^{c}};X_{v\leftarrow,1}|W_{{\mathcal{S}}}), we get

I(W𝒮c;YvT|W𝒮)\displaystyle\quad\,\,I(W_{{\mathcal{S}}^{c}};Y_{v}^{T}|W_{{\mathcal{S}}})
(1ηv)T1ηvI(W𝒮c;Xv,1|W𝒮)+\displaystyle\leq(1-\eta_{v})^{T-1}\eta_{v}I(W_{{\mathcal{S}}^{c}};X_{v\leftarrow,1}|W_{{\mathcal{S}}})+\ldots
+(1ηv)ηvI(W𝒮c;YvT2,Xv,T1|W𝒮)\displaystyle\quad+(1-\eta_{v})\eta_{v}I(W_{{\mathcal{S}}^{c}};Y_{v}^{T-2},X_{v\leftarrow,T-1}|W_{{\mathcal{S}}})
+ηvI(W𝒮c;YvT1,Xv,T|W𝒮)\displaystyle\quad+\eta_{v}I(W_{{\mathcal{S}}^{c}};Y_{v}^{T-1},X_{v\leftarrow,T}|W_{{\mathcal{S}}})
((1ηv)T1++(1ηv)+1)ηvH(W𝒮c|W𝒮)\displaystyle\leq\big{(}(1-\eta_{v})^{T-1}+\ldots+(1-\eta_{v})+1\big{)}\eta_{v}H(W_{{\mathcal{S}}^{c}}|W_{{\mathcal{S}}})
=(1(1ηv)T)H(W𝒮c|W𝒮).\displaystyle=\big{(}1-(1-\eta_{v})^{T}\big{)}H(W_{{\mathcal{S}}^{c}}|W_{{\mathcal{S}}}).

The weakened upper bound follows from the fact that ηv1(1ηv)|v|\eta_{v}\leq 1-(1-\eta_{v}^{*})^{|{\mathcal{E}}_{v}|}, due to Lemma 4. This completes the proof of Lemma 5.∎

Comparing Lemma 2 and Lemma 5, we note that the upper bound in Lemma 2 captures the communication constraints through the cutset capacity alone, in accordance with the fact that the communication constraints do not depend on WW or ZZ. The bound applies when WW is either discrete or continuous; however, it grows linearly with TT. By contrast, the upper bound in Lemma 5 builds on the fact that I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{{\mathcal{S}}}) is upper bounded by H(W𝒮c|W𝒮)H(W_{{\mathcal{S}}^{c}}|W_{{\mathcal{S}}}), and goes a step further by capturing the communication constraint through a multiplicative contraction of H(W𝒮c|W𝒮)H(W_{{\mathcal{S}}^{c}}|W_{{\mathcal{S}}}). It never exceeds H(W𝒮c|W𝒮)H(W_{{\mathcal{S}}^{c}}|W_{{\mathcal{S}}}) as TT increases. However, it is useful only when the conditional entropy H(W𝒮c|W𝒮)H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}}) is well-defined and finite (e.g., when WW is discrete). We give an explicit comparison of Lemma 2 and Lemma 5 in the following example:

Example 1.

Consider a two-node network, where the nodes are connected by BSCs. The problem is for the two nodes to compute the mod-22 sum of their one-bit observations. Formally, we have G=(𝒱,)G=({\mathcal{V}},{\mathcal{E}}) with 𝒱={1,2}{\mathcal{V}}=\{1,2\}, ={(1,2),(2,1)}{\mathcal{E}}=\{(1,2),(2,1)\}, K(1,2)=K(2,1)=BSC(p)K_{(1,2)}=K_{(2,1)}={\rm BSC}(p), W1W_{1} and W2W_{2} are independent Bern(12){\rm Bern}(\frac{1}{2}) r.v.’s, Z=W1W2Z=W_{1}\oplus W_{2}, and d(z,z^)=𝟏{zz^}d(z,{\widehat{z}})=\mathbf{1}\{z\neq{\widehat{z}}\}.

Choosing 𝒮={2}{\mathcal{S}}=\{2\}, Lemma 2 gives

I(Z;Z^2|W2)(1h2(p))T,\displaystyle I(Z;{\widehat{Z}}_{2}|W_{2})\leq(1-h_{2}(p))T, (16)

whereas Lemma 5, together with the fact that η(BSC(p))=(12p)2\eta({\rm BSC}(p))=(1-2p)^{2}, gives

I(Z;Z^2|W2)1(4pp¯)T,\displaystyle I(Z;{\widehat{Z}}_{2}|W_{2})\leq 1-(4p\bar{p})^{T}, (17)

where, for p[0,1]p\in[0,1], p¯1p\bar{p}\triangleq 1-p. For this example, the cutset-capacity upper bound is always tighter for small TT, as

(1(4pp¯)T)T|T=0\displaystyle\frac{\partial\big{(}1-(4p\bar{p})^{T}\big{)}}{\partial T}\Big{|}_{T=0} =log14pp¯1h2(p),p[0,1].\displaystyle=\log\frac{1}{4p\bar{p}}\geq 1-h_{2}(p),\quad p\in[0,1].

Fig. 3 shows the two upper bounds with p=0.3p=0.3: the cutset-capacity upper bound is tighter when T<5T<5.

Refer to caption
Figure 3: Comparison of upper bounds in Lemma 2 and Lemma 5 for computing mod-22 sum in a two-node network.

II-E Lower bounds on computation time

We now proceed to derive lower bounds on the computation time T(ε,δ)T(\varepsilon,\delta) based on the previously derived lower and upper bounds on the conditional mutual information I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}). Define the shorthand notation

(𝒮,ε,δ)(1δ)log1𝔼[L(W𝒮,ε)]h2(δ),\displaystyle\ell({\mathcal{S}},\varepsilon,\delta)\triangleq(1-\delta)\log\frac{1}{\mathbb{E}[L(W_{\mathcal{S}},\varepsilon)]}-h_{2}(\delta), (18)

which is the lower bound on I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}) in Lemma 1 .

II-E1 Cutset-capacity bounds

Combined with the conditional small ball probability lower bound in Lemma 1, the cutset-capacity upper bound in Lemma 2 leads to a lower bound on T(ε,δ)T(\varepsilon,\delta):

Theorem 1.

For an arbitrary network, for any ε0\varepsilon\geq 0 and δ[0,1/2]\delta\in[0,1/2],

T(ε,δ)max𝒮𝒱(𝒮,ε,δ)C𝒮.\displaystyle T(\varepsilon,\delta)\geq\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{\ell({\mathcal{S}},\varepsilon,\delta)}{C_{\mathcal{S}}}.

From an operational point of view, the lower bound of Theorem 1 reflects the fact that the problem of distributed function computation is, in a certain sense, a joint source-channel coding (JSCC) problem with possibly noisy feedback. In particular, the lower bound on I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}) from Lemma 1, which is used to prove Theorem 1, can be interpreted in terms of a reduction of JSCC to generalized list decoding [32, Sec. III.B]. Given any algorithm 𝒜{\mathcal{A}} and any node v𝒱v\in{\mathcal{V}}, we may construct a “list decoder” as follows: given the estimate Z^v{\widehat{Z}}_{v}, we generate a “list” {z𝖹:d(z,Z^v)ε}\{z\in{\mathsf{Z}}:d(z,{\widehat{Z}}_{v})\leq\varepsilon\}. If we fix a set 𝒮𝒱{\mathcal{S}}\subset{\mathcal{V}} and allow all the nodes in 𝒮{\mathcal{S}} to share their observations W𝒮W_{\mathcal{S}}, then 𝔼[L(W𝒮,ε)]\mathbb{E}[L(W_{\mathcal{S}},\varepsilon)] is an upper bound on the W{\mathbb{P}}_{W}-measure of the list of any node v𝒮v\in{\mathcal{S}}. Therefore, (𝒮,ε,δ)\ell({\mathcal{S}},\varepsilon,\delta) is a lower bound on the total amount of information that is necessary for the JSCC problem. The complementary cutset upper bound on I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}) bounds the amount of information that can be accumulated with each channel use. The lower bound on T(ε,δ)T(\varepsilon,\delta) can thus be interpreted as a lower bound on the blocklength of the JSCC problem.

As we will demonstrate in Section IV, based on Theorem 1, it is possible to exploit structural properties of the function ff (such as linearity) and of the probability law W{\mathbb{P}}_{W} (such as log-concavity) to derive lower bounds on the computation time that are often tighter than existing bounds.

II-E2 SDPI bounds

Combining the lower bound of Lemma 1 with the SDPI upper bound of Lemma 5, we get the following:

Theorem 2.

For an arbitrary network, for any ε0\varepsilon\geq 0 and δ[0,1/2]\delta\in[0,1/2],

T(ε,δ)max𝒮𝒱maxv𝒮log(1(𝒮,ε,δ)H(W𝒮c|W𝒮))1|v|log(1ηv)1\displaystyle T(\varepsilon,\delta)\geq\max_{{\mathcal{S}}\subset{\mathcal{V}}}\max_{v\in{\mathcal{S}}}\frac{\log\big{(}1-\frac{\ell({\mathcal{S}},\varepsilon,\delta)}{H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}})}\big{)}^{-1}}{|{\mathcal{E}}_{v}|\log(1-\eta^{*}_{v})^{-1}} (19)

where ηvmaxevη(Ke)\eta^{*}_{v}\triangleq\max_{e\in{\mathcal{E}}_{v}}\eta(K_{e}).

The lower bounds in Theorem 1 and Theorem 2 can behave quite differently. To illustrate this, we compare them in two cases:

When H(W𝒮c|W𝒮)log1𝔼[L(W𝒮,ε)]H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}})\gg\log\frac{1}{\mathbb{E}[L(W_{\mathcal{S}},\varepsilon)]}, Theorem 2 gives

T(ε,δ)\displaystyle T(\varepsilon,\delta) max𝒮𝒱maxv𝒮log(1(𝒮,ε,δ)H(W𝒮c|W𝒮))1|v|log(1ηv)1\displaystyle\geq\max_{{\mathcal{S}}\subset{\mathcal{V}}}\max_{v\in{\mathcal{S}}}\frac{\log\big{(}1-\frac{\ell({\mathcal{S}},\varepsilon,\delta)}{H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}})}\big{)}^{-1}}{|{\mathcal{E}}_{v}|\log(1-\eta^{*}_{v})^{-1}}
max𝒮𝒱maxv𝒮(𝒮,ε,δ)logeH(W𝒮c|W𝒮)|v|log(1ηv)1,\displaystyle\approx\max_{{\mathcal{S}}\subset{\mathcal{V}}}\max_{v\in{\mathcal{S}}}\frac{\ell({\mathcal{S}},\varepsilon,\delta)\log e}{H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}})|{\mathcal{E}}_{v}|\log(1-\eta^{*}_{v})^{-1}},

which has essentially the same dependence on (𝒮,ε,δ)\ell({\mathcal{S}},\varepsilon,\delta) as the lower bound given by Theorem 1. In this case, Theorem 1 gives more useful lower bounds as long as C𝒮H(W𝒮c|W𝒮)C_{\mathcal{S}}\ll H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}}), especially when WW is continuous.

When H(W𝒮c|W𝒮)log1𝔼[L(W𝒮,ε)]H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}})\approx\log\frac{1}{\mathbb{E}[L(W_{\mathcal{S}},\varepsilon)]} and δ\delta is small, H(W𝒮c|W𝒮)H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}}) serves as a sharp proxy of (𝒮,ε,δ)\ell({\mathcal{S}},\varepsilon,\delta). Theorem 1 in this case gives

T(ε,δ)max𝒮𝒱(𝒮,ε,δ)C𝒮max𝒮𝒱H(W𝒮c|W𝒮)C𝒮,\displaystyle T(\varepsilon,\delta)\geq\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{\ell({\mathcal{S}},\varepsilon,\delta)}{C_{\mathcal{S}}}\approx\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}})}{C_{\mathcal{S}}},

while Theorem 2 gives

T(ε,δ)\displaystyle T(\varepsilon,\delta) max𝒮𝒱maxv𝒮log(1(𝒮,ε,δ)H(W𝒮c|W𝒮))1|v|log(1ηv)1\displaystyle\geq\max_{{\mathcal{S}}\subset{\mathcal{V}}}\max_{v\in{\mathcal{S}}}\frac{\log\big{(}1-\frac{\ell({\mathcal{S}},\varepsilon,\delta)}{H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}})}\big{)}^{-1}}{|{\mathcal{E}}_{v}|\log(1-\eta^{*}_{v})^{-1}}
max𝒮𝒱maxv𝒮logH(W𝒮c|W𝒮)+log1h2(δ)|v|log(1ηv)1\displaystyle\approx\max_{{\mathcal{S}}\subset{\mathcal{V}}}\max_{v\in{\mathcal{S}}}\frac{\log{H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}})}+\log\frac{1}{h_{2}(\delta)}}{|{\mathcal{E}}_{v}|\log(1-\eta^{*}_{v})^{-1}}

where in the last step we have used the fact that log(δ+h2(δ)H(W𝒮c|W𝒮))log(h2(δ)H(W𝒮c|W𝒮))\log\left(\delta+\frac{h_{2}(\delta)}{H(W_{{\mathcal{S}}^{c}}|W_{{\mathcal{S}}})}\right)\sim\log\left(\frac{h_{2}(\delta)}{H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}})}\right) as δ0\delta\rightarrow 0. Theorem 1 in this case is sharper in capturing the dependence of T(ε,δ)T(\varepsilon,\delta) on the amount of information contained in ZZ, in that the lower bound is proportional to H(W𝒮c|W𝒮)H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}}), whereas the lower bound given by Theorem 2 depends on H(W𝒮c|W𝒮)H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}}) only through logH(W𝒮c|W𝒮)\log H(W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}}). On the other hand, Theorem 2 in this case is much sharper in capturing the dependence of T(ε,δ)T(\varepsilon,\delta) on the confidence parameter δ\delta, since logh2(δ)\log h_{2}(\delta) grows without bound as δ0\delta\rightarrow 0, while the lower bound given by Theorem 1 remains bounded. We consider two examples for this case.

The first is Example 1 in Section II-D, for the two-node mod-22 sum problem. We have L(w2,ε)=maxz{0,1}[W1W2=z|W2=w2]=12L(w_{2},\varepsilon)=\max_{z\in\{0,1\}}{\mathbb{P}}[W_{1}\oplus W_{2}=z|W_{2}=w_{2}]=\frac{1}{2}, and (𝒮,0,δ)=1δh2(δ)\ell({\mathcal{S}},0,\delta)=1-\delta-h_{2}(\delta). Theorems 1 and 2 imply the following:

Corollary 1.

For the problem in Example 1, for δ[0,1/2]\delta\in[0,1/2], the (0,δ)(0,\delta)-computation time satisfies

T(0,δ)\displaystyle T(0,\delta) max{1δh2(δ)1h2(p),log(δ+h2(δ))1log(4pp¯)1},\displaystyle\geq\max\Big{\{}\frac{1-\delta-h_{2}(\delta)}{1-h_{2}(p)},\frac{\log(\delta+h_{2}(\delta))^{-1}}{\log(4p\bar{p})^{-1}}\Big{\}}, (20)

where the first lower bound is given by Theorem 1, and the second one is given by Theorem 2.

To obtain an achievable upper bound on T(0,δ)T(0,\delta) in Example 1, we consider the algorithm where each node uses a length-TT repetition code to send its one-bit observation to the other node. Using the Chernoff bound, as in [33], it can be shown that the probability of decoding error at each node is upper-bounded by (4pp¯)T/2(4p\bar{p})^{T/2}, and therefore this algorithm achieves accuracy ε=0\varepsilon=0 with confidence parameter δ(4pp¯)T/2\delta\leq(4p\bar{p})^{T/2}. This gives the upper bound

T(0,δ)\displaystyle T(0,\delta) 2logδ1log(4pp¯)1.\displaystyle\leq\frac{2\log\delta^{-1}}{\log(4p\bar{p})^{-1}}. (21)

Comparing (21) with the second lower bound in (20), we see that they asymptotically differ only by a factor of 22 as δ0\delta\rightarrow 0, as limδ0log(δ+h2(δ))/log(δ)=1\lim_{\delta\rightarrow 0}\log(\delta+h_{2}(\delta))/\log(\delta)=1. Thus, for the problem in Example 1, the converse lower bound on T(0,δ)T(0,\delta) obtained from the SDPI closely matches the achievable upper bound on T(0,δ)T(0,\delta).

The second example concerns the problem of disseminating all of the observations through an arbitrary network:

Example 2.

Consider the problem where WvW_{v}’s are i.i.d. samples from the uniformly distribution over {1,,M}\{1,\ldots,M\}, Z=WZ=W, and d(z,z^)=𝟏{zz^}d(z,{\widehat{z}})=\mathbf{1}\{z\neq{\widehat{z}}\}. In other words, the goal of the nodes is to distribute their observations to all other nodes.

In this example, H(W𝒮c|W𝒮)=|𝒮c|logMH(W_{{\mathcal{S}}^{c}}|W_{{\mathcal{S}}})=|{\mathcal{S}}^{c}|\log M, and (𝒮,0,δ)=(1δ)|𝒮c|logMh2(δ)\ell({\mathcal{S}},0,\delta)=(1-\delta)|{\mathcal{S}}^{c}|\log M-h_{2}(\delta). Following Ayaso et al. [1, Def. III.4], we define the conductance of the network GG as

Φ(G)min𝒮𝒱:|𝒱|/2<|𝒮|<|𝒱|C𝒮|𝒮c|.\Phi(G)\triangleq\min_{{\mathcal{S}}\in{\mathcal{V}}:|{\mathcal{V}}|/2<|{\mathcal{S}}|<|{\mathcal{V}}|}\frac{C_{{\mathcal{S}}}}{|{\mathcal{S}}^{c}|}.

Then we have the following corollary:

Corollary 2.

For the problem in Example 2, Theorem 1 gives

T(0,δ)\displaystyle T(0,\delta) max𝒮𝒱(1δ)|𝒮c|logMh2(δ)C𝒮\displaystyle\geq\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{(1-\delta)|{\mathcal{S}}^{c}|\log M-h_{2}(\delta)}{C_{\mathcal{S}}} (22)
logMΦ(G)as δ0,\displaystyle\gtrsim\frac{\log M}{\Phi(G)}\qquad\text{as $\delta\rightarrow 0$}, (23)

whereas Theorem 2 gives

T(0,δ)\displaystyle T(0,\delta) max𝒮𝒱maxv𝒮log(|𝒮c|logM)+logh2(δ)1|v|log(1ηv)1\displaystyle\gtrsim\max_{{\mathcal{S}}\subset{\mathcal{V}}}\max_{v\in{\mathcal{S}}}\frac{\log\big{(}|{\mathcal{S}}^{c}|\log M\big{)}+\log h_{2}(\delta)^{-1}}{|{\mathcal{E}}_{v}|\log(1-\eta^{*}_{v})^{-1}} (24)

as δ0\delta\rightarrow 0.

Again, we see that the lower bound obtained from SDPI is much sharper for capturing the dependence of T(0,δ)T(0,\delta) on δ\delta, since logh2(δ)1+\log h_{2}(\delta)^{-1}\to+\infty as δ0\delta\to 0. On the other hand, the lower bound obtained from the cutset capacity upper bound is tighter in its dependence on MM, and can also capture the dependence on the conductance of the network.

Finally, we point out that Theorem 1 gives the correct lower bound T(ε,δ)=+T(\varepsilon,\delta)=+\infty when the network graph GG is disconnected (assuming ff depends on the observations of all nodes): If 𝒱{\mathcal{V}} consists of two disconnected components 𝒮{\mathcal{S}} and 𝒮c{\mathcal{S}}^{c}, then C𝒮=0C_{\mathcal{S}}=0, which results in T(ε,δ)=+T(\varepsilon,\delta)=+\infty. Despite the sharp dependence of the lower bounds of Theorems 1 and 2 on ε\varepsilon and δ\delta, they have the same limitation as all previously known bounds obtained via single-cutset arguments: they examine only the flow of information across a cutset 𝒮{\mathcal{E}}_{\mathcal{S}}, but not within 𝒮{\mathcal{S}}; hence they cannot capture the dependence of computation time on the diameter of the network. We address this limitation in the following section.

III Multi-cutset analysis

We now extend the techniques of Section II to a multi-cutset analysis, to address the limitation of the results obtained from the single-cutset analysis. In particular, the new results are able to quantify the dissipation of information as it flows across a succession of cutsets in the network. As briefly sketched in Sec. I-B, we accomplish this by partitioning a general network using multiple disjoint cutsets, such that the operation of any algorithm on the network can be simulated by another algorithm running on a chain of bidirectional noisy links. We then derive tight mutual information upper bounds for such chains, which in turn can be used to lower-bound the computation time for the original network.

III-A Network reduction

Consider an arbitrary network G=(𝒱,)G=({\mathcal{V}},{\mathcal{E}}). If there exists a collection of nested subsets 𝒫1𝒫n1{\mathcal{P}}_{1}\subset\ldots\subset{\mathcal{P}}_{n-1} of 𝒱{\mathcal{V}}, such that the associated cutsets 𝒫1,,𝒫n1{\mathcal{E}}_{{\mathcal{P}}_{1}},\ldots,{\mathcal{E}}_{{\mathcal{P}}_{n-1}} are disjoint, and the cutsets 𝒫1c,,𝒫n1c{\mathcal{E}}_{{\mathcal{P}}_{1}^{c}},\ldots,{\mathcal{E}}_{{\mathcal{P}}_{n-1}^{c}} are also disjoint, then we say that GG is successively partitioned according to 𝒫1,,𝒫n1{\mathcal{P}}_{1},\ldots,{\mathcal{P}}_{n-1} into nn subsets 𝒮1,,𝒮n{\mathcal{S}}_{1},\ldots,{\mathcal{S}}_{n}, where 𝒮i=𝒫i𝒫i1{\mathcal{S}}_{i}={\mathcal{P}}_{i}\setminus{\mathcal{P}}_{i-1}, with 𝒫0{\mathcal{P}}_{0}\triangleq\varnothing and 𝒫n𝒱{\mathcal{P}}_{n}\triangleq{\mathcal{V}}. For i{2,,n}i\in\{2,\ldots,n\}, a node in 𝒮i{\mathcal{S}}_{i} is called a left-bound node of 𝒮i{\mathcal{S}}_{i} if there is an edge from it to a node in 𝒮i1{\mathcal{S}}_{i-1}. The set of left-bound nodes of 𝒮i{\mathcal{S}}_{i} is denoted by 𝒮i\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}. For 𝒮1{\mathcal{S}}_{1}, define 𝒮1={v}\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{1}=\{v\} for an arbitrary v𝒮1v\in{\mathcal{S}}_{1}. In addition, for i{2,,n}i\in\{2,\ldots,n\}, let

di|𝒫i1c|+|𝒫i|+|{(𝒮i×𝒮i)}|\displaystyle d_{i}\triangleq|{\mathcal{E}}_{{\mathcal{P}}_{i-1}^{c}}|+|{\mathcal{E}}_{{\mathcal{P}}_{i}}|+|\{{\mathcal{E}}\cap({\mathcal{S}}_{i}\times\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i})\}| (25)

be the number of edges entering 𝒮i{\mathcal{S}}_{i} from its neighbors 𝒮i1{\mathcal{S}}_{i-1} and 𝒮i+1{\mathcal{S}}_{i+1}, plus the number of edges entering 𝒮i\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i} from 𝒮i{\mathcal{S}}_{i} itself. For example, Fig. 2a in Sec. I-B illustrates a successive partition of a six-node network into three subsets 𝒮1={1,4}{\mathcal{S}}_{1}=\{1,4\}, 𝒮2={2,5}{\mathcal{S}}_{2}=\{2,5\} and 𝒮3={3,6}{\mathcal{S}}_{3}=\{3,6\}, with 𝒮1={4}\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{1}=\{4\}, 𝒮2={2}\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{2}=\{2\} and 𝒮3={3,6}\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{3}=\{3,6\}. In addition, d2=5d_{2}=5 and d3=4d_{3}=4. As another example, the network in Fig. 4a, where each undirected edge represents a pair of channels with opposite directions, can be successively partitioned into 𝒮1={1}{\mathcal{S}}_{1}=\{1\}, 𝒮2={2,7}{\mathcal{S}}_{2}=\{2,7\}, 𝒮3={3,6,8,9}{\mathcal{S}}_{3}=\{3,6,8,9\}, 𝒮4={4,10}{\mathcal{S}}_{4}=\{4,10\}, and 𝒮5={5}{\mathcal{S}}_{5}=\{5\}, with 𝒮1={1}\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{1}=\{1\}, 𝒮1={2,7}\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{1}=\{2,7\}, 𝒮3={3,8}\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{3}=\{3,8\}, 𝒮4={4,10}\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{4}=\{4,10\}, and 𝒮i={5}\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}=\{5\}. In addition, d2=6d_{2}=6, d3=7d_{3}=7, d4=6d_{4}=6, and d5=2d_{5}=2.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: A successive partition of a network and the chain reduced according to it.
Refer to caption
(a)
Refer to caption
(b)
Figure 5: Another successive partition (using the construction in the proof of Lemma 6) and the chain reduced according to it.

Formally, a network GG has bidirectional links if, for any pair of nodes u,v𝒱u,v\in{\mathcal{V}}, (u,v)(u,v)\in{\mathcal{E}} if and only if (v,u)(v,u)\in{\mathcal{E}}. A path between uu and vv is a sequence of edges {(vi,vi+1)}i=1k1\{(v_{i},v_{i+1})\}^{k-1}_{i=1}, such that v1=uv_{1}=u and vk=vv_{k}=v (if GG is connected, there is at least one path between any pair of nodes). The graph distance between uu and vv, denoted by dG(u,v)d_{G}(u,v), is the length of a shortest path between uu and vv (shortest paths are not necessarily unique). The diameter of GG is then defined by

diam(G)maxu𝒱maxv𝒱dG(u,v).{\rm diam}(G)\triangleq\max_{u\in{\mathcal{V}}}\max_{v\in{\mathcal{V}}}d_{G}(u,v).

The following lemma states that any such network GG can be successively partitioned into n=diam(G)+1n={\rm diam}(G)+1 subsets:

Lemma 6.

Any network G=(𝒱,)G=({\mathcal{V}},{\mathcal{E}}) with bidirectional links (i.e., (u,v)(u,v)\in{\mathcal{E}} if and only if (v,u)(v,u)\in{\mathcal{E}}) admits a successive partition into subsets 𝒮1,,𝒮n{\mathcal{S}}_{1},\ldots,{\mathcal{S}}_{n} with n=diam(G)+1n={\rm diam}(G)+1.

Proof:

For any v𝒱v\in{\mathcal{V}} and any r{0:diam(G)}r\in\{0:{\rm diam}(G)\}, we define the sets

𝔹G(v,r){u𝒱:dG(v,u)r}\displaystyle{\mathbb{B}}_{G}(v,r)\triangleq\left\{u\in{\mathcal{V}}:d_{G}(v,u)\leq r\right\}

and

SSG(v,r){u𝒱:dG(v,u)=r},\displaystyle\SS_{G}(v,r)\triangleq\left\{u\in{\mathcal{V}}:d_{G}(v,u)=r\right\},

i.e., the ball and the sphere of radius rr centered at vv. In particular, 𝔹G(v,r)=𝔹G(v,r1)SSG(v,r){\mathbb{B}}_{G}(v,r)={\mathbb{B}}_{G}(v,r-1)\cup\SS_{G}(v,r).

We now construct the desired successive partition. Let n=diam(G)+1n={\rm diam}(G)+1, and pick any pair of nodes v0,v1𝒱v_{0},v_{1}\in{\mathcal{V}} that achieve the maximum in the definition of diam(G){\rm diam}(G). With this, we take

𝒫i=𝔹G(v0,i1),i=1,,n.{\mathcal{P}}_{i}={\mathbb{B}}_{G}(v_{0},i-1),\qquad i=1,\ldots,n.

Clearly, 𝒫1={v0}𝒫2𝒫n=𝒱{\mathcal{P}}_{1}=\{v_{0}\}\subset{\mathcal{P}}_{2}\subset\ldots\subset{\mathcal{P}}_{n}={\mathcal{V}}, and moreover

𝒮i=SSG(v0,i1),i=1,,n.{\mathcal{S}}_{i}=\SS_{G}(v_{0},i-1),\qquad i=1,\ldots,n.

From this construction, we see that

𝒫i={(u,v):u𝒮i+1,v𝒮i}{\mathcal{E}}_{{\mathcal{P}}_{i}}=\left\{(u,v)\in{\mathcal{E}}:u\in{\mathcal{S}}_{i+1},\,v\in{\mathcal{S}}_{i}\right\}

and

𝒫ic={(u,v)𝒱:u𝒮i,v𝒮i+1}.{\mathcal{E}}_{{\mathcal{P}}^{c}_{i}}=\left\{(u,v)\in{\mathcal{V}}:u\in{\mathcal{S}}_{i},\,v\in{\mathcal{S}}_{i+1}\right\}.

The pairwise disjointness of the cutsets 𝒫i{\mathcal{E}}_{{\mathcal{P}}_{i}}, as well as of the cutsets 𝒫ic{\mathcal{E}}_{{\mathcal{P}}^{c}_{i}}, is immediate. ∎

Remarks:

  • Using the construction underlying the proof, we can also show that, for any two nodes in GG, we can successively partition GG into n=dG(u,v)+1n=d_{G}(u,v)+1 subsets.

  • For the successive partition constructed in the proof, all nodes in 𝒮i{\mathcal{S}}_{i} are left-bound nodes, and did_{i} is the sum of the in-degrees of the nodes in 𝒮i{\mathcal{S}}_{i}.

As an example, Fig. 5a shows the successive partition of the network in Fig. 4a using the construction in the proof, where 𝒮1={1}{\mathcal{S}}_{1}=\{1\}, 𝒮2={2,7}{\mathcal{S}}_{2}=\{2,7\}, 𝒮3={3,8}{\mathcal{S}}_{3}=\{3,8\}, 𝒮4={4,6,9}{\mathcal{S}}_{4}=\{4,6,9\}, 𝒮5={5,10}{\mathcal{S}}_{5}=\{5,10\}, with 𝒮i=𝒮i\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}={\mathcal{S}}_{i}, i{1,,5}i\in\{1,\ldots,5\}, and d2=6d_{2}=6, d3=6d_{3}=6, d4=9d_{4}=9, and d5=5d_{5}=5.

The successive partition of GG ensures that nodes in 𝒮i{\mathcal{S}}_{i} only communicate with nodes in 𝒮i1{\mathcal{S}}_{i-1} and 𝒮i+1{\mathcal{S}}_{i+1}, as well as among themselves. Indeed, suppose that the network graph GG includes an edge e=(u,v)e=(u,v)\in{\mathcal{E}} with u𝒮iu\in{\mathcal{S}}_{i} and v𝒮jv\in{\mathcal{S}}_{j}, where i>j+1i>j+1. By construction of the successive partition, u𝒫j+1c𝒫jcu\in{\mathcal{P}}^{c}_{j+1}\subset{\mathcal{P}}^{c}_{j} and v𝒫j𝒫j+1v\in{\mathcal{P}}_{j}\subset{\mathcal{P}}_{j+1}. Therefore, ee belongs to both 𝒫j{\mathcal{E}}_{{\mathcal{P}}_{j}} and 𝒫j+1{\mathcal{E}}_{{\mathcal{P}}_{j+1}}. However, the cutsets 𝒫j{\mathcal{E}}_{{\mathcal{P}}_{j}} and 𝒫j+1{\mathcal{E}}_{{\mathcal{P}}_{j+1}} are disjoint, so we arrive at a contradiction. Likewise, we can use the disjointness of the cutsets 𝒫ic{\mathcal{E}}_{{\mathcal{P}}^{c}_{i}} and 𝒫jc{\mathcal{E}}_{{\mathcal{P}}^{c}_{j}} to show that the network graph contains no edges (u,v)(u,v) with u𝒮iu\in{\mathcal{S}}_{i}, v𝒮jv\in{\mathcal{S}}_{j}, and j>i+1j>i+1.

In view of this, we can associate to the partition {𝒮i}\{{\mathcal{S}}_{i}\} a bidirected chain G=(𝒱,)G^{\prime}=({\mathcal{V}}^{\prime},{\mathcal{E}}^{\prime}), i.e., a network with vertex set 𝒱={1,,n}{\mathcal{V}}^{\prime}=\{1^{\prime},\ldots,n^{\prime}\}, edge set

={(i,(i1))}i=2n{(i,(i+1))}i=1n1{(i,i)}i=1n,\displaystyle{\mathcal{E}}^{\prime}=\big{\{}(i^{\prime},(i-1)^{\prime})\big{\}}_{i=2}^{n}\cup\big{\{}(i^{\prime},(i+1)^{\prime})\big{\}}_{i=1}^{n-1}\cup\big{\{}(i^{\prime},i^{\prime})\big{\}}_{i=1}^{n},

and channel transition laws

K(i,(i1))\displaystyle K_{(i^{\prime},(i-1)^{\prime})} =(u,v):u𝒮i,v𝒮i1K(u,v)\displaystyle=\bigotimes_{(u,v)\in{\mathcal{E}}:u\in{\mathcal{S}}_{i},v\in{\mathcal{S}}_{i-1}}K_{(u,v)} (26)
K(i,(i+1))\displaystyle K_{(i^{\prime},(i+1)^{\prime})} =(u,v):u𝒮i,v𝒮i+1K(u,v)\displaystyle=\bigotimes_{(u,v)\in{\mathcal{E}}:u\in{\mathcal{S}}_{i},v\in{\mathcal{S}}_{i+1}}K_{(u,v)} (27)
K(i,i)\displaystyle K_{(i^{\prime},i^{\prime})} =(u,v):u𝒮i,v𝒮iK(u,v),\displaystyle=\bigotimes_{(u,v)\in{\mathcal{E}}:u\in{\mathcal{S}}_{i},v\in\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}}K_{(u,v)}, (28)

where node ii^{\prime} in GG^{\prime} observes

Wi=W𝒮i.\displaystyle W_{i^{\prime}}=W_{{\mathcal{S}}_{i}}.

In other words, the subset 𝒮i{\mathcal{S}}_{i} in GG is reduced to node ii^{\prime} in GG^{\prime}; the channels across the subsets in GG are reduced to the channels between the nodes in GG^{\prime}; and the channels from 𝒮i{\mathcal{S}}_{i} to 𝒮i\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i} in GG are reduced to a self-loop at node ii^{\prime} in GG^{\prime}. The channels from 𝒮i{\mathcal{S}}_{i} to 𝒮i𝒮i{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i} in GG are not included in GG^{\prime}, and will be simulated by node ii^{\prime} using private randomness. For the network in Fig. 2a in Sec. I-B, according to the illustrated partition, it can be reduced to a 33-node bidirected chain in Fig. 2b, with K(1,1)=K(1,4)K_{(1^{\prime},1^{\prime})}=K_{(1,4)}, K(2,2)=K(5,2)K_{(2^{\prime},2^{\prime})}=K_{(5,2)}, and K(3,3)=K(3,6)K(6,3)K_{(3^{\prime},3^{\prime})}=K_{(3,6)}\otimes K_{(6,3)}. For the network in Fig. 4a, according to the illustrated partition, it can be reduced to a 55-node bidirected chain in Fig. 4b, with K(2,2)=K(2,7)K(7,2)K_{(2^{\prime},2^{\prime})}=K_{(2,7)}\otimes K_{(7,2)}, K(3,3)=K(6,3)K(6,8)K(9,8)K_{(3^{\prime},3^{\prime})}=K_{(6,3)}\otimes K_{(6,8)}\otimes K_{(9,8)}, and K(4,4)=K(4,10)K(10,4)K_{(4^{\prime},4^{\prime})}=K_{(4,10)}\otimes K_{(10,4)}. According to the partition in Fig. 5a, the same network can be reduced to a 55-node bidirected chain in Fig. 5b, with K(2,2)=K(2,7)K(7,2)K_{(2^{\prime},2^{\prime})}=K_{(2,7)}\otimes K_{(7,2)}, K(4,4)=K(6,9)K(9,6)K_{(4^{\prime},4^{\prime})}=K_{(6,9)}\otimes K_{(9,6)}, and K(5,5)=K(5,10)K(10,5)K_{(5^{\prime},5^{\prime})}=K_{(5,10)}\otimes K_{(10,5)}.

For the bidirected chain GG^{\prime} reduced from GG, we consider a class of randomized TT-step algorithms that run on GG^{\prime} and are of a more general form compared to the deterministic algorithms considered so far. Such a randomized algorithm operates as follows: at step t{1,,T}t\in\{1,\ldots,T\}, node ii^{\prime} computes the outgoing messages X(i,(i1)),t=φi,t(Wi,Yit1)X_{(i^{\prime},(i-1)^{\prime}),t}=\overset{{}_{\leftarrow}}{\varphi}_{i^{\prime},t}(W_{i^{\prime}},Y^{t-1}_{i^{\prime}}), X(i,(i+1)),t=φi,t(Wi,Yit1,Uit1)X_{(i^{\prime},(i+1)^{\prime}),t}=\overset{{}_{\rightarrow}}{\varphi}_{i^{\prime},t}(W_{i^{\prime}},Y^{t-1}_{i^{\prime}},U^{t-1}_{i^{\prime}}), and X(i,i),t=φ̊i,t(Wi,Yit1,Uit1)X_{(i^{\prime},i^{\prime}),t}=\mathring{\varphi}_{i^{\prime},t}(W_{i^{\prime}},Y^{t-1}_{i^{\prime}},U^{t-1}_{i^{\prime}}), and computes the private message Ui,t=ϑi,t(Wi,Yit1,Uit1,Ri,t)U_{i^{\prime},t}=\vartheta_{i^{\prime},t}(W_{i^{\prime}},Y_{i^{\prime}}^{t-1},U^{t-1}_{i^{\prime}},R_{i^{\prime},t}), where Ri,tR_{i^{\prime},t} is the private randomness held by node ii^{\prime}, uniformly distributed on [0,1][0,1] and independent across i𝒱i^{\prime}\in{\mathcal{V}}^{\prime} and t{1,,T}t\in\{1,\ldots,T\}. At step TT, node ii^{\prime} computes the final estimate Z^i=ψi(Wi,YiT){\widehat{Z}}_{i^{\prime}}=\psi_{i^{\prime}}(W_{i^{\prime}},Y^{T}_{i^{\prime}}) of ZZ. These randomized algorithms have the feature that the message sent to the node on the left and the final estimate of a node are computed solely based on the node’s initial observation and received messages, whereas the messages sent to the node on the right and to itself are computed based on the node’s initial observation, received messages, as well as private messages, and the computation of the private messages involves the node’s private randomness. Define

T(ε,δ)=inf{\displaystyle T^{\prime}(\varepsilon,\delta)=\inf\Big{\{} T: a randomized T-step algorithm 𝒜\displaystyle T\in{\mathbb{N}}:\exists\text{ a randomized $T$-step algorithm }{\mathcal{A}}^{\prime}
such that maxi𝒱[d(Z,Z^i)>ε]δ}\displaystyle\text{ such that }\max_{i^{\prime}\in{\mathcal{V}}^{\prime}}{\mathbb{P}}\big{[}d(Z,{\widehat{Z}}_{i^{\prime}})>\varepsilon\big{]}\!\leq\delta\Big{\}} (29)

as the (ε,δ)(\varepsilon,\delta)-computation time for ZZ on GG^{\prime} using the randomized algorithms described above. The following lemma indicates that we can obtain lower bounds on T(ε,δ)T(\varepsilon,\delta) by lower-bounding T(ε,δ)T^{\prime}(\varepsilon,\delta).

Lemma 7.

Consider an arbitrary network GG that can be successively partitioned into 𝒮1,,𝒮n{\mathcal{S}}_{1},\ldots,{\mathcal{S}}_{n}, such that 𝒮i\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}’s are all nonempty. Let G=(𝒱,)G^{\prime}=({\mathcal{V}}^{\prime},{\mathcal{E}}^{\prime}) be the bidirected chain constructed from GG according to the partition. Then, given any TT-step algorithm on GG that achieves maxv𝒱[d(Z,Z^v)>ε]δ\max_{v\in{\mathcal{V}}}{\mathbb{P}}[d(Z,{\widehat{Z}}_{v})>\varepsilon]\leq\delta, we can construct a randomized TT-step algorithm 𝒜{\mathcal{A}}^{\prime} on GG^{\prime}, such that maxi𝒱[d(Z,Z^i)>ε]δ\max_{i^{\prime}\in{\mathcal{V}}^{\prime}}{\mathbb{P}}[d(Z,{\widehat{Z}}_{i^{\prime}})>\varepsilon]\leq\delta. Consequently, T(ε,δ)T(\varepsilon,\delta) for computing ZZ on GG is lower bounded by T(ε,δ)T^{\prime}(\varepsilon,\delta) defined in (III-A).

Proof:

Appendix A. ∎

Remark: In the network reduction, we can alternatively map all the channels from 𝒮i{\mathcal{S}}_{i} to 𝒮i{\mathcal{S}}_{i} (instead of only mapping the channels from 𝒮i{\mathcal{S}}_{i} to 𝒮i\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}) in the original network GG to the self-loop at node ii^{\prime} of the reduced chain GG^{\prime}. By doing so, to simulate the operation of an algorithm 𝒜{\mathcal{A}} that runs on GG, the algorithm 𝒜{\mathcal{A}}^{\prime} that runs on GG^{\prime} no longer needs to generate private messages using the nodes’ private randomness, since all the channels in GG are preserved in GG^{\prime}. In other words, under this alternative reduction, any TT-step algorithm 𝒜{\mathcal{A}} that runs on GG can be simulated by a TT-step algorithm 𝒜{\mathcal{A}}^{\prime} of the same deterministic type as 𝒜{\mathcal{A}} that runs on GG^{\prime}. However, this alternative reduction increases the information transmission capability of the self-loops in GG^{\prime}, and will result in a looser lower bound on T(ε,δ)T(\varepsilon,\delta), as will be discussed in the remark following Theorem 3.

In light of Lemma 7, in order to lower-bound T(ε,δ)T(\varepsilon,\delta) for computing ZZ on GG, we just need to lower-bound T(ε,δ)T^{\prime}(\varepsilon,\delta) defined in (III-A). To this end, we derive upper bounds on the conditional mutual information for bidirected chains by extending the techniques behind Lemma 2 and Lemma 5:

Lemma 8.

Consider an nn-node bidirected chain with vertex set 𝒱={1,,n}{\mathcal{V}}=\{1,\ldots,n\} and edge set

={(i,i1)}i=2n{(i,i+1)}i=1n1{(i,i)}i=1n,\displaystyle{\mathcal{E}}=\big{\{}(i,i-1)\big{\}}_{i=2}^{n}\cup\big{\{}(i,i+1)\big{\}}_{i=1}^{n-1}\cup\big{\{}(i,i)\big{\}}_{i=1}^{n},

and an arbitrary randomized TT-step algorithm 𝒜{\mathcal{A}}^{\prime} that runs on this chain. Let ηiη(Ki)\eta_{i}\triangleq\eta(K_{i}) denote the SDPI constant of the channel Kij:(j,i)K(j,i)K_{i}\triangleq\bigotimes_{j:\,(j,i)\in{\mathcal{E}}}K_{(j,i)}, and let ηmaxi=1,,nηi\eta\triangleq\max_{i=1,\ldots,n}\eta_{i}. If Tn2T\leq n-2, then

I(Z;Z^n|W2:n)=0.\displaystyle I(Z;{\widehat{Z}}_{n}|W_{2:n})=0.

If Tn1T\geq n-1, then

I(Z;Z^n|W2:n)\displaystyle I(Z;{\widehat{Z}}_{n}|W_{2:n})\leq
H(W1|W2:n)ηi=1Tn+2(Ti,n2,η),\displaystyle H(W_{1}|W_{2:n})\eta\sum_{i=1}^{T-n+2}{\mathcal{B}}(T-i,n-2,\eta), ​​​​​​​​​ n2n\geq 2\qquad (30)
C(1,2)ηi=1Tn+2(Ti1,n3,η)i,\displaystyle C_{(1,2)}\eta\sum_{i=1}^{T-n+2}{\mathcal{B}}(T-i-1,n-3,\eta)i, ​​​​​​​​​ n3n\geq 3\qquad (31)

with (m,k,p)(mk)pk(1p)mk{\mathcal{B}}(m,k,p)\triangleq{m\choose k}p^{k}(1-p)^{m-k}. For n2n\geq 2, the above upper bounds can be weakened to

I(Z;Z^n|W2:n)\displaystyle\quad I(Z;{\widehat{Z}}_{n}|W_{2:n})\leq
H(W1|W2:n)(1(1η)Tn+2)n1,\displaystyle\!\!\!H(W_{1}|W_{2:n})\big{(}1-(1-\eta)^{T-n+2}\big{)}^{n-1}, (32)
C(1,2)(Tn+2)(1(1η~)Tn+2)n2.\displaystyle\!\!\!C_{(1,2)}(T-n+2)\big{(}1-(1-\tilde{\eta})^{T-n+2}\big{)}^{n-2}. (33)

Moreover, if n4n\geq 4 and

n1T2+(n3)γηn-1\leq T\leq 2+\frac{(n-3)\gamma}{\eta}

for some γ(0,1)\gamma\in(0,1), then

I(Z;Z^n|W2:n)\displaystyle I(Z;{\widehat{Z}}_{n}|W_{2:n})\leq
C(1,2)(n3)2γ2ηexp(2(ηγη)2(n3)).\displaystyle\qquad C_{(1,2)}\frac{(n-3)^{2}\gamma^{2}}{\eta}\exp\left(-2\left(\frac{\eta}{\gamma}-\eta\right)^{2}(n-3)\right). (34)
Proof:

Appendix B. ∎

Equation (30) is reminiscent of a result of Rajagopalan and Schulman [13] on the evolution of mutual information in broadcasting a bit over a unidirectional chain of BSCs. The result in [13] is obtained by solving a system of recursive inequalities on the mutual information involving suboptimal SDPI constants. Our results apply to chains of general bidirectional links and to the computation of general functions. We arrive at a system of inequalities similar to the one in [13], which can be solved in a similar manner and gives (30) and (31). We also obtain weakened upper bounds in (32) and (33), which show that, for a fixed TT, the conditional mutual information decays at least exponentially fast in nn. The upper bound in (8) provides another weakening of (30) and (31), and shows explicitly the dependence of the upper bound on nn.

Assuming for simplicity that H(W1|W2:n)=1H(W_{1}|W_{2:n})=1, Fig. 6 compares (30) with the weakened upper bound in (32). We can see that the gap can be large when nn is large and TT is much larger than nn. Nevertheless, the weakened upper bounds in (32) and (33) allow us to derive lower bounds on computation time that are non-asymptotic in nn, and explicit in ε\varepsilon, δ\delta, and channel properties.

Refer to caption
Figure 6: Upper bound in (30) (solid line) vs. the weakened one in (32) (dashed line) for chains.

III-B Lower bounds on computation time

We now build on the results presented above to obtain lower bounds on the T(ε,δ)T(\varepsilon,\delta) by reducing the original problem to function computation over bidirected chains. We first provide the result for an arbitrary network, and then particularize it to several specific topologies (namely, chains, rings, grids, and trees).

III-B1 Lower bound for an arbitrary network

Theorem 3 below contains general lower bounds on computation time for an arbitrary network. The statement of the theorem is somewhat lengthy, but can be parsed as follows: Given an arbitrary connected network with bidirectional links, any reduction of that network to a bidirected chain gives rise to a system of inequalities that must be satisfied by the computation time T(ε,δ)T(\varepsilon,\delta). These inequalities, presented in (3), are nonasymptotic in nature and involve explicitly computable parameters of the network, but cannot be solved in closed form. The first inequality follows from an SDPI-based analysis analogous to Theorem 2, while the second inequality is a cutset bound in the spirit of Theorem 1. Explicit but weaker expressions that lower-bound T(ε,δ)T(\varepsilon,\delta) in terms of network parameters appear below as (36) and (37), together with asymptotic expressions for large nn (the size of the reduced bidirected chain). Both of these bounds state that T(ε,δ)T(\varepsilon,\delta) is lower-bounded by the size of the bidirected chain plus a correction term that accounts for the effect of channel noise (via channel capacities and SDPI constants). Finally, (38) and (39) provide the precise version of the bound in (8): asymptotically, the computation time T(ε,δ)T(\varepsilon,\delta) scales as Ω(n/η~)\Omega(n/\tilde{\eta}), where η~\tilde{\eta} is the worst-case SDPI constant of the reduced network. By Lemma 6, it is always possible to reduce the network to a bidirected chain of length diam(G)+1{\rm diam}(G)+1, so the main message of Theorem 3 is that the computation time T(ε,δ)T(\varepsilon,\delta) scales at least linearly in the network diameter. Thus, the main advantage of the multi-cutset analysis over the usual single-cutset analysis is that it can capture this dependence on the network diameter.

Theorem 3.

Assume the following:

  • The network graph G=(𝒱,)G=({\mathcal{V}},{\mathcal{E}}) is connected, the capacities of all edge links are upper-bounded by CC, and the SDPI constants of edge links are upper-bounded by η\eta.

  • GG admits a successive partition into 𝒮1,,𝒮n{\mathcal{S}}_{1},\ldots,{\mathcal{S}}_{n}, such that 𝒮i\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}’s are all nonempty.

Let

Δmaxi{2:n}di\Delta\triangleq\max_{i\in\{2:n\}}d_{i}

where

di=|𝒫i1c|+|𝒫i|+|{(𝒮i×𝒮i)}|d_{i}=|{\mathcal{E}}_{{\mathcal{P}}_{i-1}^{c}}|+|{\mathcal{E}}_{{\mathcal{P}}_{i}}|+|\{{\mathcal{E}}\cap({\mathcal{S}}_{i}\times\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i})\}|

as defined in (25), and let

η~=1(1η)Δ.\tilde{\eta}=1-(1-\eta)^{\Delta}.

Then for ε0\varepsilon\geq 0 and δ(0,1/2]\delta\in(0,1/2], the (ε,δ)(\varepsilon,\delta)-computation time T(ε,δ)T(\varepsilon,\delta) must satisfy the inequalities

(𝒮1c,ε,δ)\displaystyle\ell({\mathcal{S}}_{1}^{c},\varepsilon,\delta)\leq
{H(W𝒮1|W𝒮1c)η~i=1T(ε,δ)n+2(T(ε,δ)i,n2,η~),n2C𝒮1cη~i=1T(ε,δ)n+2(T(ε,δ)i1,n3,η~)i,n3.\displaystyle\begin{cases}H(W_{{\mathcal{S}}_{1}}|W_{{\mathcal{S}}_{1}^{c}})\tilde{\eta}\displaystyle\sum\limits_{i=1}^{T(\varepsilon,\delta)-n+2}{\mathcal{B}}(T(\varepsilon,\delta)-i,n-2,\tilde{\eta}),&\!\!n\geq 2\\ C_{{\mathcal{S}}_{1}^{c}}\tilde{\eta}\displaystyle\sum\limits_{i=1}^{T(\varepsilon,\delta)-n+2}{\mathcal{B}}(T(\varepsilon,\delta)-i-1,n-3,\tilde{\eta})i,&\!\!n\geq 3.\end{cases} (35)

The above results can be weakened to

T(ε,δ)\displaystyle T(\varepsilon,\delta) log(1((𝒮1c,ε,δ)H(W𝒮1|W𝒮1c))1n1)1Δlog(1η)1+n2\displaystyle\geq\frac{\log\left(1-\big{(}\frac{\ell({\mathcal{S}}_{1}^{c},\varepsilon,\delta)}{H(W_{{\mathcal{S}}_{1}}|W_{{\mathcal{S}}_{1}^{c}})}\big{)}^{\frac{1}{n-1}}\right)^{-1}}{\Delta\log(1-\eta)^{-1}}+n-2 (36)
log(n1)+log(1(𝒮1c,ε,δ)H(W𝒮1|W𝒮1c))1Δlog(1η)1+n2,\displaystyle\sim\frac{\log(n-1)+\log\big{(}1-\frac{\ell\left({\mathcal{S}}_{1}^{c},\varepsilon,\delta\right)}{H(W_{{\mathcal{S}}_{1}}|W_{{\mathcal{S}}_{1}^{c}})}\big{)}^{-1}}{\Delta\log(1-\eta)^{-1}}+n-2,

as nn\rightarrow\infty, and

T(ε,δ)(𝒮1c,ε,δ)C𝒮1c+n2.\displaystyle T(\varepsilon,\delta)\geq\frac{\ell\left({\mathcal{S}}_{1}^{c},\varepsilon,\delta\right)}{C_{{\mathcal{S}}_{1}^{c}}}+n-2. (37)

Moreover, if the partition size nn is large enough, so that n4n\geq 4 and

C|𝒱|2(n3)24ηexp(2η2(n3))<(𝒮1c,ε,δ),\displaystyle\frac{C|{\mathcal{V}}|^{2}(n-3)^{2}}{4\eta}\exp\left(-2\eta^{2}(n-3)\right)<\ell({\mathcal{S}}^{c}_{1},\varepsilon,\delta), (38)

then

T(ε,δ)>2+n32η~2+n32Δη.\displaystyle T(\varepsilon,\delta)>2+\frac{n-3}{2\tilde{\eta}}\geq 2+\frac{n-3}{2\Delta\eta}. (39)
Proof:

In light of Lemma 7, it suffices to show that the lower bounds in Theorem 3 need to be satisfied by T(ε,δ)T^{\prime}(\varepsilon,\delta) for the bidirected chain GG^{\prime}, to which GG reduces according to the partition {𝒮i}\{{\mathcal{S}}_{i}\}.

Consider any randomized TT-step algorithm 𝒜{\mathcal{A}}^{\prime} that achieves maxi𝒱[d(Z,Z^i)>ε]δ\max_{i^{\prime}\in{\mathcal{V}}^{\prime}}{\mathbb{P}}[d(Z,{\widehat{Z}}_{i^{\prime}})>\varepsilon]\leq\delta on GG^{\prime}. From Lemma 1,

I(Z;Z^n|W2:n)({2:n},ε,δ).I(Z;{\widehat{Z}}_{n^{\prime}}|W_{2^{\prime}:n^{\prime}})\geq\ell(\{2^{\prime}:n^{\prime}\},\varepsilon,\delta).

Then from Lemma 8 and the fact that

ηi\displaystyle\eta_{i^{\prime}} =η(K((i1),i)K((i+1),i)Ki,i)\displaystyle=\eta(K_{((i-1)^{\prime},i^{\prime})}\otimes K_{((i+1)^{\prime},i^{\prime})}\otimes K_{i^{\prime},i^{\prime}})
1(1η)di\displaystyle\leq 1-(1-\eta)^{d_{i}}
1(1η)Δ,\displaystyle\leq 1-(1-\eta)^{\Delta}, (40)

we have

({2:n},ε,δ)\displaystyle\ell(\{2^{\prime}:n^{\prime}\},\varepsilon,\delta)\leq
{H(W1|W2:n)η~i=1Tn+2(Ti,n2,η~),n2C(1,2)η~i=1Tn+2(Ti1,n3,η~)i,n3,\displaystyle\begin{cases}H(W_{1^{\prime}}|W_{2^{\prime}:n^{\prime}})\tilde{\eta}\displaystyle\sum\limits_{i=1}^{T-n+2}{\mathcal{B}}(T-i,n-2,\tilde{\eta}),&\quad n\geq 2\\ C_{(1^{\prime},2^{\prime})}\tilde{\eta}\displaystyle\sum\limits_{i=1}^{T-n+2}{\mathcal{B}}(T-i-1,n-3,\tilde{\eta})i,&\quad n\geq 3,\end{cases}

and for n2n\geq 2,

({2:n},ε,δ)\displaystyle\ell(\{2^{\prime}:n^{\prime}\},\varepsilon,\delta)\leq
{H(W1|W2:n)i=2n(1(1η)di(Tn+2))C(1,2)(Tn+2)i=3n(1(1η)di(Tn+2)).\displaystyle\begin{cases}H(W_{1^{\prime}}|W_{2^{\prime}:n^{\prime}})\displaystyle\prod\limits_{i=2}^{n}\big{(}1-(1-\eta)^{d_{i}(T-n+2)}\big{)}\\ C_{(1^{\prime},2^{\prime})}(T-n+2)\displaystyle\prod\limits_{i=3}^{n}\big{(}1-(1-\eta)^{d_{i}(T-n+2)}\big{)}.\end{cases} (41)

Since ({2:n},ε,δ)=(𝒮1c,ε,δ)\ell(\{2^{\prime}:n^{\prime}\},\varepsilon,\delta)=\ell({\mathcal{S}}_{1}^{c},\varepsilon,\delta), H(W1|W2:n)=H(W𝒮1|W𝒮1c)H(W_{1^{\prime}}|W_{2^{\prime}:n^{\prime}})=H(W_{{\mathcal{S}}_{1}}|W_{{\mathcal{S}}_{1}^{c}}), and C(1,2)=C𝒮1cC_{(1^{\prime},2^{\prime})}=C_{{\mathcal{S}}_{1}^{c}}, we see that T(ε,δ)T^{\prime}(\varepsilon,\delta) must satisfy (3) in Theorem 3.

Using (40), (III-B1) can be weakened to

(𝒮1c,ε,δ)\displaystyle\ell({\mathcal{S}}_{1}^{c},\varepsilon,\delta)\leq
{H(W𝒮1|W𝒮1c)(1(1η)Δ(Tn+2))n1C𝒮1c(Tn+2)(1(1η)Δ(Tn+2))n2.\displaystyle\begin{cases}H(W_{{\mathcal{S}}_{1}}|W_{{\mathcal{S}}_{1}^{c}})\big{(}1-(1-\eta)^{\Delta(T-n+2)}\big{)}^{n-1}\\ C_{{\mathcal{S}}_{1}^{c}}(T-n+2)\big{(}1-(1-\eta)^{\Delta(T-n+2)}\big{)}^{n-2}\end{cases}. (42)

The first line of (III-B1) leads to

T(ε,δ)\displaystyle T^{\prime}(\varepsilon,\delta) log(1((𝒮1c,ε,δ)H(W𝒮1|W𝒮1c))1n1)1Δlog(1η)1+n2\displaystyle\geq\frac{\log\left(1-\big{(}\frac{\ell({\mathcal{S}}_{1}^{c},\varepsilon,\delta)}{H(W_{{\mathcal{S}}_{1}}|W_{{\mathcal{S}}_{1}^{c}})}\big{)}^{\frac{1}{n-1}}\right)^{-1}}{\Delta\log(1-\eta)^{-1}}+n-2
log(n1)+log(1(𝒮1c,ε,δ)H(W𝒮1|W𝒮1c))1Δlog(1η)1+n2,\displaystyle\sim\frac{\log(n-1)+\log\big{(}1-\frac{\ell\left({\mathcal{S}}_{1}^{c},\varepsilon,\delta\right)}{H(W_{{\mathcal{S}}_{1}}|W_{{\mathcal{S}}_{1}^{c}})}\big{)}^{-1}}{\Delta\log(1-\eta)^{-1}}+n-2,

where the last step follows from the fact that log(1p1n)1logn1p\log\big{(}1-p^{\frac{1}{n}}\big{)}^{-1}\sim\log\frac{n}{1-p} as nn\rightarrow\infty for p(0,1)p\in(0,1). The second line of (III-B1) leads to

T(ε,δ)(𝒮1c,ε,δ)C𝒮1c+n2.\displaystyle T^{\prime}(\varepsilon,\delta)\geq\frac{\ell\left({\mathcal{S}}_{1}^{c},\varepsilon,\delta\right)}{C_{{\mathcal{S}}_{1}^{c}}}+n-2.

Finally, we prove that T(ε,δ)=Ω(n/η~)T^{\prime}(\varepsilon,\delta)=\Omega(n/\tilde{\eta}) under the assumption that (38) holds. Suppose that T(ε,δ)2+(n3)/2η~T^{\prime}(\varepsilon,\delta)\leq 2+(n-3)/2\tilde{\eta}. Then, from (8) in Lemma 8, we have

(𝒮1c,ε,δ)C𝒮1c(n3)24η~exp(2η~2(n3)),if n4.\displaystyle\ell({\mathcal{S}}_{1}^{c},\varepsilon,\delta)\leq C_{{\mathcal{S}}_{1}^{c}}\frac{(n-3)^{2}}{4\tilde{\eta}}\exp\left(-2{\tilde{\eta}}^{2}(n-3)\right),\quad\text{if $n\geq 4$}.

Note that Δ1\Delta\geq 1 by the assumption that GG is connected, thus η~=1(1η)Δη\tilde{\eta}=1-(1-\eta)^{\Delta}\geq\eta. Moreover, C𝒮1cC||C|𝒱|2C_{{\mathcal{S}}_{1}^{c}}\leq C|{\mathcal{E}}|\leq C|{\mathcal{V}}|^{2}. As a result,

(𝒮1c,ε,δ)\displaystyle\ell({\mathcal{S}}_{1}^{c},\varepsilon,\delta) C|𝒱|2(n3)24ηexp(2η2(n3)),if n4,\displaystyle\leq\frac{C|{\mathcal{V}}|^{2}(n-3)^{2}}{4\eta}\exp\left(-2\eta^{2}(n-3)\right),\quad\text{if $n\geq 4$,}

which contradicts the assumption that (38) holds. Thus,

T(ε,δ)>2+n32η~2+n32ΔηT^{\prime}(\varepsilon,\delta)>2+\frac{n-3}{2\tilde{\eta}}\geq 2+\frac{n-3}{2\Delta\eta}

Theorem 3 then follows from Lemma 7. ∎

Remarks:

  • We call a node in 𝒮i{\mathcal{S}}_{i} a boundary node if there is an edge (either inward or outward) between it and a node in 𝒮i1{\mathcal{S}}_{i-1} or 𝒮i+1{\mathcal{S}}_{i+1}. Denote the set of boundary nodes of 𝒮i{\mathcal{S}}_{i} by 𝒮i\partial{\mathcal{S}}_{i}. The results in Theorem 3 can be weakened by replacing did_{i} with

    di=v𝒮i|v|,\displaystyle\partial d_{i}=\sum_{v\in\partial{\mathcal{S}}_{i}}|{\mathcal{E}}_{v}|,

    namely the summation of the in-degrees of boundary nodes of 𝒮i{\mathcal{S}}_{i}, since didid_{i}\leq\partial d_{i} for i{2,,n}i\in\{2,\ldots,n\}.

  • As discussed in the remark following Lemma 7, an alternative network reduction is to map all the channels from 𝒮i{\mathcal{S}}_{i} to 𝒮i{\mathcal{S}}_{i} (instead of only mapping the channels from 𝒮i{\mathcal{S}}_{i} to 𝒮i\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}) in the original network GG to the self-loop at node ii^{\prime} of the reduced chain GG^{\prime}. Using the same proof strategy with this alternative reduction, we can obtain lower bounds on T(ε,δ)T(\varepsilon,\delta) of the same form as the results in Theorem 3, but with did_{i}’s replaced by

    d~i|𝒫i1c|+|𝒫i|+|{(𝒮i×𝒮i)}|.\tilde{d}_{i}\triangleq|{\mathcal{E}}_{{\mathcal{P}}_{i-1}^{c}}|+|{\mathcal{E}}_{{\mathcal{P}}_{i}}|+|\{{\mathcal{E}}\cap({\mathcal{S}}_{i}\times{\mathcal{S}}_{i})\}|.

    Since didid~id_{i}\leq\partial d_{i}\leq\tilde{d}_{i} for i{2,,n}i\in\{2,\ldots,n\}, the lower bounds on T(ε,δ)T(\varepsilon,\delta) obtained by this alternative network reduction are weaker than the results in Theorem 3, and are even weaker than the results obtained by replacing did_{i}’s with di\partial d_{i}s.

  • Due to Lemma 6, for a network GG of bidirectional links, we can always find a successive partition of GG such that nn in Theorem 3 is equal to the diam(G)+1{\rm diam}(G)+1. By contrast, the diameter cannot be captured in general by the theorems in Section II.

  • Choosing a successive partition of GG with n=2n=2 is equivalent to choosing a single cutset. In that case, we see that (37) recovers Theorem 1, while (36) recovers a weakened version of Theorem 2 (in (36), Δ=d2\Delta=d_{2} is at least the sum of the in-degrees of the left-bound nodes of 𝒮2{\mathcal{S}}_{2}, while Theorem 2 involves the in-degree of only one node in 𝒮2{\mathcal{S}}_{2}).

We now apply Theorem 3 to networks with specific topologies. We assume that nodes communicate via bidirectional links. Thus, any such network will be represented by an undirected graph, where each undirected edge represents a pair of channels with opposite directions.

III-B2 Chains

For chains, the proof of Theorem 3 already contains lower bounds on T(ε,δ)T^{\prime}(\varepsilon,\delta). These lower bounds apply to T(ε,δ)T(\varepsilon,\delta) as well, since the class of TT-step algorithms on a chain is a subcollection of randomized TT-step algorithms on the same chain. We thus have the following corollary.

Corollary 3.

Consider an nn-node bidirected chain without self-loops, where the SDPI constants of all channels are upper bounded by η\eta. Then for ε0\varepsilon\geq 0 and δ(0,1/2]\delta\in(0,1/2], T(ε,δ)T(\varepsilon,\delta) must satisfy the inequalities in Theorem 3 with 𝒮1={1}{\mathcal{S}}_{1}=\{1\} and di=2d_{i}=2 for all i{1,,n}i\in\{1,\ldots,n\}. In particular, if all channels are BSC(p){\rm BSC}(p), then

T(ε,\displaystyle T(\varepsilon, δ)max{(𝒱{1},ε,δ)1h2(p),\displaystyle\delta)\geq\max\bigg{\{}\frac{\ell\big{(}{\mathcal{V}}\setminus\{1\},\varepsilon,\delta\big{)}}{1-h_{2}(p)},
log(n1)+log(1(𝒱{1},ε,δ)H(W1|W𝒱{1}))12log(4pp¯)1}+n2\displaystyle\frac{\log(n-1)+\log\big{(}1-\frac{\ell\big{(}{\mathcal{V}}\setminus\{1\},\varepsilon,\delta\big{)}}{H(W_{1}|W_{{\mathcal{V}}\setminus\{1\}})}\big{)}^{-1}}{2\log(4p\bar{p})^{-1}}\bigg{\}}+n-2

for all sufficiently large nn.

Here and below, the estimates for a network of bidirectional BSCs are obtained using the bounds (16) and (17).

III-B3 Rings

Consider a ring with 2n22n-2 nodes, where the nodes are labeled clockwise from 11 to 2n22n-2. The diameter is equal to n1n-1. According to the successive partition in the proof of Lemma 6, this ring can be partitioned into 𝒮1={1}{\mathcal{S}}_{1}=\{1\}, 𝒮i={i,2ni}{\mathcal{S}}_{i}=\{i,2n-i\}, i{2,,n1}i\in\{2,\ldots,n-1\}, and 𝒮n={n}{\mathcal{S}}_{n}=\{n\}. As an example, Fig. 7a shows a 66-node ring and Fig. 7b shows the chain reduced from it.

Refer to caption
(a)
Refer to caption
(b)
Figure 7: A ring network and the chain reduced from it.

With this partition, we can apply Theorem 3 and get the following corollary.

Corollary 4.

Consider a (2n2)(2n-2)-node ring, where the SDPI constants of all channels are upper bounded by η\eta. Then for ε0\varepsilon\geq 0 and δ(0,1/2]\delta\in(0,1/2], T(ε,δ)T(\varepsilon,\delta) must satisfy the inequalities in Theorem 3 with 𝒮1={1}{\mathcal{S}}_{1}=\{1\} and di=4d_{i}=4 for all i{1,,n}i\in\{1,\ldots,n\}. In particular, if all channels are BSC(p){\rm BSC}(p), then

T(ε,δ)\displaystyle T(\varepsilon,\delta) =max{(𝒱{1},ε,δ)2(1h2(p)),\displaystyle=\max\bigg{\{}\frac{\ell\big{(}{\mathcal{V}}\setminus\{1\},\varepsilon,\delta\big{)}}{2(1-h_{2}(p))},
log(n1)+log(1(𝒱{1},ε,δ)H(W1|W𝒱{1}))14log(4pp¯)1}+n2\displaystyle\frac{\log(n-1)+\log\big{(}1-\frac{\ell\left({\mathcal{V}}\setminus\{1\},\varepsilon,\delta\right)}{H(W_{1}|W_{{\mathcal{V}}\setminus\{1\}})}\big{)}^{-1}}{4\log(4p\bar{p})^{-1}}\bigg{\}}+n-2

for all sufficiently large nn.

III-B4 Grids

Consider an n+12×n+12\frac{n+1}{2}\times\frac{n+1}{2} grid (where we assume nn is odd), which has diameter n1n-1. Figure 8a shows a successive partition of a n+12×n+12\frac{n+1}{2}\times\frac{n+1}{2} grid into n+12\frac{n+1}{2} subsets, with Δ=maxi{2:n}di=2n\Delta=\max_{i\in\{2:n\}}d_{i}=2n. Figure 8b shows the successive partition in the proof of Lemma 6, which partitions the network into nn subsets, with Δ=maxi{2:n}di=2(n1)\Delta=\max_{i\in\{2:n\}}d_{i}=2(n-1), thus resulting in strictly tighter lower bounds on computation time compared to the ones obtained from the partition in Fig. 8a. With the latter partition, we get the following corollary.

Refer to caption
(a)

Refer to caption

(b)
Figure 8: Successive partitions of a 4×44\times 4 (n=7n=7) grid network. The length of the labeled path is the diameter of the network.
Corollary 5.

Consider an n+12×n+12\frac{n+1}{2}\times\frac{n+1}{2} grid, where 1n1-\ldots-n is one of the longest paths. Assume that the SDPI constants of all channels are upper bounded by η\eta. Then for ε0\varepsilon\geq 0 and δ(0,1/2]\delta\in(0,1/2], T(ε,δ)T(\varepsilon,\delta) must satisfy the inequalities in Theorem 3 with 𝒮1={1}{\mathcal{S}}_{1}=\{1\}, di=dn+1i=4(i2)+6d_{i}=d_{n+1-i}=4(i-2)+6, i{1,,n12}i\in\{1,\ldots,\frac{n-1}{2}\}, and d(n+1)/2=2(n1)d_{{(n+1)}/{2}}=2(n-1). In particular, if all channels are BSC(p){\rm BSC}(p), then

T(ε,\displaystyle T(\varepsilon, δ)max{(𝒱{1},ε,δ)2(1h2(p)),\displaystyle\delta)\geq\max、\bigg{\{}\frac{\ell\big{(}{\mathcal{V}}\setminus\{1\},\varepsilon,\delta\big{)}}{2(1-h_{2}(p))},
log(n1)+log(1(𝒱{1},ε,δ)H(W1|W𝒱{1}))12(n1)log(4pp¯)1}+n2\displaystyle\frac{\log(n-1)+\log\big{(}1-\frac{\ell\left({\mathcal{V}}\setminus\{1\},\varepsilon,\delta\right)}{H(W_{1}|W_{{\mathcal{V}}\setminus\{1\}})}\big{)}^{-1}}{2(n-1)\log(4p\bar{p})^{-1}}\bigg{\}}+n-2

for all sufficiently large nn.

III-B5 Trees

Consider a tree, whose nodes are numbered in such a way that 1n1-\ldots-n is one of the longest paths. Then the diameter of the tree is n1n-1, and nodes 11 and nn are necessarily leaf nodes. The tree can be viewed as being rooted at node 11. Let 𝒟i{\mathcal{D}}_{i} be the union of node ii and its descendants in the rooted tree, and let 𝒮i=𝒟i𝒟i+1{\mathcal{S}}_{i}={\mathcal{D}}_{i}\setminus{\mathcal{D}}_{i+1}, i{1,,n}i\in\{1,\ldots,n\}. The tree can then be successively partitioned into 𝒮1,,𝒮n{\mathcal{S}}_{1},\ldots,{\mathcal{S}}_{n}. In the nn-node bidirected chain reduced according to this partition, the edges between nodes ii^{\prime} and (i+1)(i+1)^{\prime} are the pair of channels between nodes ii and i+1i+1 in the tree, and the self-loop of node ii^{\prime}, i{2,,n1}i\in\{2,\ldots,n-1\}, is the channel from 𝒮i{i}{\mathcal{S}}_{i}\setminus\{i\} to node ii in the tree.

Refer to caption
(a)
Refer to caption
(b)
Figure 9: Successive partitions of a tree network.

As an example, Fig. 9a shows this partition of a tree network, and the chain reduced from it has the same form as the one in Fig. 4b. With this partition, we get the following corollary.

Corollary 6.

Consider a dd-regular tree network where 1n1-\ldots-n is one of the longest paths. Assume that the SDPI constants of all channels are upper bounded by η\eta. Then for ε0\varepsilon\geq 0 and δ(0,1/2]\delta\in(0,1/2], T(ε,δ)T(\varepsilon,\delta) must satisfy the inequalities in Theorem 3 with 𝒮1={1}{\mathcal{S}}_{1}=\{1\} and di=dd_{i}=d for all i{1,,n}i\in\{1,\ldots,n\}. In particular, if all channels are BSC(p){\rm BSC}(p), then

T(ε,\displaystyle T(\varepsilon, δ)max{(𝒱{1},ε,δ)1h2(p),\displaystyle\delta)\geq\max\bigg{\{}\frac{\ell\big{(}{\mathcal{V}}\setminus\{1\},\varepsilon,\delta\big{)}}{1-h_{2}(p)},
log(n1)+log(1(𝒱{1},ε,δ)H(W1|W𝒱{1}))1dlog(4pp¯)1}+n2\displaystyle\frac{\log(n-1)+\log\big{(}1-\frac{\ell\left({\mathcal{V}}\setminus\{1\},\varepsilon,\delta\right)}{H(W_{1}|W_{{\mathcal{V}}\setminus\{1\}})}\big{)}^{-1}}{d\log(4p\bar{p})^{-1}}\bigg{\}}+n-2

for all sufficiently large nn.

If we use the successive partition in the proof of Lemma 6 on a dd-regular tree with diameter n1n-1, then the tree will be reduced to an nn-node bidirected chain without self-loops. Figure 9b shows such an example. However, with this partition, Δ=maxi{2:n}di\Delta=\max_{i\in\{2:n\}}d_{i} increases with nn, which renders the resulting lower bound on computation time looser than the one in Corollary 6. It means that, although the partition in the proof of Lemma 6 always captures the diameter of a network, it may not always give the best lower bound on computation time among all possible successive partitions.

IV Small ball probability estimates for computation of linear functions

The bounds stated in the preceding sections involve the conditional small ball probability, defined in (9). In this section, we provide estimates for this quantity in the context of a distributed computation problem of wide interest — the computation of linear functions. Specifically, we assume that the observations Wv,v𝒱W_{v},v\in{\mathcal{V}}, are independent real-valued random variables, and the objective is to compute a linear function

Z=f(W)=v𝒱avWv\displaystyle Z=f(W)=\sum_{v\in{\mathcal{V}}}a_{v}W_{v} (43)

for a fixed vector of coefficients (av)v𝒱|𝒱|(a_{v})_{v\in{\mathcal{V}}}\in\mathbb{R}^{|{\mathcal{V}}|}, subject to the absolute error criterion d(z,z^)=|zz^|d(z,{\widehat{z}})=|z-{\widehat{z}}|. We will use the following shorthand notation: for any set 𝒮𝒱{\mathcal{S}}\subset{\mathcal{V}}, let a𝒮=(av)v𝒮a_{\mathcal{S}}=(a_{v})_{v\in{\mathcal{S}}} and a𝒮,W𝒮=v𝒮avWv\langle a_{\mathcal{S}},W_{\mathcal{S}}\rangle=\sum_{v\in{\mathcal{S}}}a_{v}W_{v}.

The independence of the WvW_{v}’s and the additive structure of ff allow us to express the conditional small ball probability L(w𝒮,ε)L(w_{\mathcal{S}},\varepsilon) defined in (9) in terms of so-called Lévy concentration functions of random sums [19]. The Lévy concentration function of a real-valued r.v. UU (also known as the “small ball probability”) is defined as

(U,ρ)=supu[|Uu|ρ],ρ>0.\displaystyle{\mathcal{L}}(U,\rho)=\sup_{u\in\mathbb{R}}{\mathbb{P}}\left[|U-u|\leq\rho\right],\qquad\rho>0. (44)

If we fix a subset 𝒮𝒱{\mathcal{S}}\subset{\mathcal{V}}, and consider a specific realization W𝒮=w𝒮W_{\mathcal{S}}=w_{\mathcal{S}} of the observations of the nodes in 𝒮{\mathcal{S}}, then

L(w𝒮,ε)\displaystyle L(w_{\mathcal{S}},\varepsilon) =supz[|v𝒱avWvz|ε|W𝒮=w𝒮]\displaystyle=\sup_{z\in\mathbb{R}}{\mathbb{P}}\Bigg{[}\Big{|}\sum_{v\in{\mathcal{V}}}a_{v}W_{v}-z\Big{|}\leq\varepsilon\Bigg{|}W_{\mathcal{S}}=w_{\mathcal{S}}\Bigg{]}
=supz[|v𝒮cavWv+v𝒮avwvz|ε]\displaystyle=\sup_{z\in\mathbb{R}}{\mathbb{P}}\left[\Big{|}\sum_{v\in{\mathcal{S}}^{c}}a_{v}W_{v}+\sum_{v\in{\mathcal{S}}}a_{v}w_{v}-z\Big{|}\leq\varepsilon\right]
=supz[|v𝒮cavWvz|ε]\displaystyle=\sup_{z\in\mathbb{R}}{\mathbb{P}}\Bigg{[}\Big{|}\sum_{v\in{\mathcal{S}}^{c}}a_{v}W_{v}-z\Big{|}\leq\varepsilon\Bigg{]}
=(a𝒮c,W𝒮c,ε),\displaystyle={\mathcal{L}}\left(\langle a_{{\mathcal{S}}^{c}},W_{{\mathcal{S}}^{c}}\rangle,\varepsilon\right), (45)

where in the second line we have used the fact that the WvW_{v}’s are independent r.v.’s, while in the third line we have used the fact that for any function g:g:\mathbb{R}\to\mathbb{R} and any aa\in\mathbb{R}, supzg(z)=supzg(z+a)\sup_{z}g(z)=\sup_{z}g(z+a). In other words, for a fixed 𝒮{\mathcal{S}}, the quantity L(w𝒮,ε)L(w_{\mathcal{S}},\varepsilon) is independent of the boundary condition w𝒮w_{\mathcal{S}}, and is controlled by the probability law of the random sum a𝒮c,W𝒮c\langle a_{{\mathcal{S}}^{c}},W_{{\mathcal{S}}^{c}}\rangle, i.e., the part of the function ff that depends on the observations of the nodes in 𝒮c{\mathcal{S}}^{c}.

The problem of estimating Lévy concentration functions of sums of independent random variables has a long history in the theory of probability — for random variables with densities, some of the first results go back at least to Kolmogorov [34], while for discrete random variables it is closely related to the so-called Littlewood–Offord problem [35]. We provide a few examples to illustrate how one can exploit available estimates for Lévy concentration functions under various regularity conditions to obtain tight lower bounds on the computation time for linear functions. The examples are illustrated through Theorem 1, as it tightly captures the dependence of computation time on (𝒮,ε,δ)\ell({\mathcal{S}},\varepsilon,\delta). (However, since the results of Theorems 2 and 3 also involve the quantity (𝒮,ε,δ)\ell({\mathcal{S}},\varepsilon,\delta), the estimates for Lévy concentration functions can be applied there as well.)

IV-A Computing linear functions of continuous observations

IV-A1 Gaussian sums

Suppose that the local observations WvW_{v}, v𝒱v\in{\mathcal{V}}, are i.i.d. standard Gaussian random variables. Then, for any 𝒮𝒱{\mathcal{S}}\subseteq{\mathcal{V}}, a𝒮,W𝒮\langle a_{{\mathcal{S}}},W_{{\mathcal{S}}}\rangle is a zero-mean Gaussian r.v. with variance a𝒮22=v𝒮av2\|a_{{\mathcal{S}}}\|^{2}_{2}=\sum_{v\in{\mathcal{S}}}a^{2}_{v} (here, 2\|\cdot\|_{2} is the usual Euclidean 2\ell_{2} norm). A simple calculation shows that

L(w𝒮,ε)\displaystyle L(w_{\mathcal{S}},\varepsilon) =(N(0,a𝒮c22),ε)2πεa𝒮c2.\displaystyle={\mathcal{L}}\left(N\left(0,\|a_{{\mathcal{S}}^{c}}\|^{2}_{2}\right),\varepsilon\right)\leq\sqrt{\frac{2}{\pi}}\frac{\varepsilon}{\|a_{{\mathcal{S}}^{c}}\|_{2}}.

Using this in Theorem 1, we get the following result.

Corollary 7.

For the problem of computing a linear function in (43), where (Wv)i.i.d.N(0,1)(W_{v})\overset{{\rm i.i.d.}}{\sim}N(0,1), suppose that the coefficients ava_{v} are all nonzero. Then for ε0\varepsilon\geq 0 and δ(0,1/2]\delta\in(0,1/2],

T(ε,δ)max𝒮𝒱1C𝒮(1δ2logπa𝒮c222ε2h2(δ)).\displaystyle T(\varepsilon,\delta)\geq\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{1}{C_{\mathcal{S}}}\left(\frac{1-\delta}{2}\log\frac{\pi\|a_{{\mathcal{S}}^{c}}\|^{2}_{2}}{2\varepsilon^{2}}-h_{2}(\delta)\right).

Thus, the lower bound on the computation time for (43) depends on the vector of coefficients aa only through its 2\ell_{2} norm.

IV-A2 Sums of independent r.v.’s with log-concave distributions

Another instance in which sharp bounds on the Lévy concentration function are available is when the observations of the nodes are independent random variables with log-concave distributions (we recall that a real-valued r.v. UU is said to have a log-concave distribution if it has a density of the form pU(u)=eF(u)p_{U}(u)=e^{-F(u)}, where F:(,+]F:\mathbb{R}\to(-\infty,+\infty] is a convex function; this includes Gaussian, Laplace, uniform, etc.). The following result was obtained recently by Bobkov and Chistyakov [36, Theorem 1.1]: Let U1,,UkU_{1},\ldots,U_{k} be independent random variables with log-concave distributions, and let Sk=U1++UkS_{k}=U_{1}+\ldots+U_{k}. Then, for any ρ0\rho\geq 0,

13ρVar(Sk)+ρ2/3(Sk,ρ)2ρVar(Sk)+ρ2/3.\displaystyle\!\!\!\!\frac{1}{\sqrt{3}}\frac{\rho}{\sqrt{{\rm Var}(S_{k})+{\rho^{2}}/{3}}}\leq{\mathcal{L}}(S_{k},\rho)\leq\!\frac{2\rho}{\sqrt{{\rm Var}(S_{k})+{\rho^{2}}/{3}}}. (46)
Corollary 8.

For the problem of computing a linear function in (43), where the WvW_{v}’s are independent random variables with log-concave distributions and with variances at least σ2\sigma^{2}, suppose that the coefficients ava_{v} are all nonzero. Then for ε0\varepsilon\geq 0 and δ(0,1/2]\delta\in(0,1/2],

T(ε,δ)max𝒮𝒱1C𝒮(1δ2log(σ2a𝒮c224ε2+112)h2(δ)).\displaystyle\!T(\varepsilon,\delta)\geq\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{1}{C_{\mathcal{S}}}\left(\frac{1-\delta}{2}\log\left(\frac{\sigma^{2}\|a_{{\mathcal{S}}^{c}}\|^{2}_{2}}{4\varepsilon^{2}}+\frac{1}{12}\right)-h_{2}(\delta)\right).
Proof:

For each v𝒱v\in{\mathcal{V}}, avWva_{v}W_{v} also has a log-concave distribution, and, for any 𝒮𝒱{\mathcal{S}}\subset{\mathcal{V}},

Var(a𝒮c,W𝒮c)=v𝒮c|av|2Var(Wv)a𝒮c22σ2.\displaystyle{\rm Var}(\langle{a_{{\mathcal{S}}^{c}}},W_{{\mathcal{S}}^{c}}\rangle)=\sum_{v\in{\mathcal{S}}^{c}}|a_{v}|^{2}{\rm Var}(W_{v})\geq\|a_{{\mathcal{S}}^{c}}\|^{2}_{2}\sigma^{2}.

The lower bound follows from Theorem 1 and from (46). ∎

IV-A3 Sums of independent r.v.’s with bounded third moments

It is known that random variables with log-concave distributions have bounded moments of any order. Under a much weaker assumption that the local observations WvW_{v}, v𝒱v\in{\mathcal{V}} have bounded third moments, we can prove the following result.

Corollary 9.

Consider the problem of computing the linear function in (43), where the WvW_{v}’s are independent zero-mean r.v.’s with variances at least 11 and with third moments bounded by BB, and the coefficients ava_{v} satisfy the constraint K1|av|K2K_{1}\leq|a_{v}|\leq K_{2} for some K1,K2>0K_{1},K_{2}>0. Then for ε0\varepsilon\geq 0 and δ(0,1/2]\delta\in(0,1/2],

T(ε,δ)max𝒮𝒱1C𝒮(1δ2log|𝒱𝒮|M2(ε)h2(δ))\displaystyle T(\varepsilon,\delta)\geq\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{1}{C_{\mathcal{S}}}\Bigg{(}\frac{1-\delta}{2}\log\frac{|{\mathcal{V}}\setminus{\mathcal{S}}|}{M^{2}(\varepsilon)}-h_{2}(\delta)\Bigg{)}

where M(ε)c(ε/K1+B(K2/K1)3)M(\varepsilon)\triangleq c\big{(}\varepsilon/K_{1}+B(K_{2}/K_{1})^{3}\big{)} with some absolute constant cc.

Proof:

Under the conditions of the theorem, a small ball estimate due to Rudelson and Vershynin [37, Corollary 2.10] can be used to show that, for any 𝒮𝒱{\mathcal{S}}\subset{\mathcal{V}},

(a𝒮,W𝒮,ε)M(ε)|𝒮|.\displaystyle{\mathcal{L}}(\langle a_{{\mathcal{S}}},W_{{\mathcal{S}}}\rangle,\varepsilon)\leq\frac{M(\varepsilon)}{\sqrt{|{\mathcal{S}}|}}.

The desired conclusion follows immediately. ∎

IV-B Linear vector-valued functions

Similar to the Lévy concentration function of a real-valued random variable, the Lévy concentration function of a random vector UU taking values in n\mathbb{R}^{n} can be defined as

(U,ρ)=supun[Uu2ρ],ρ>0.\displaystyle{\mathcal{L}}(U,\rho)=\sup_{u\in\mathbb{R}^{n}}{\mathbb{P}}\left[\|U-u\|_{2}\leq\rho\right],\qquad\rho>0.

Consider the case where each node observes an independent real-valued random variable WvW_{v}, and the observations form a |𝒱|×1|{\mathcal{V}}|\times 1 vector W𝒱W_{\mathcal{V}}. Suppose the nodes wish to compute a linear transform of W𝒱W_{\mathcal{V}},

Z=AW𝒱\displaystyle Z=AW_{\mathcal{V}} (47)

with some fixed n×|𝒱|n\times|{\mathcal{V}}| matrix AA, subject to the Euclidean-norm distortion criterion d(z,z^)=zz^2d(z,{\widehat{z}})=\|z-{\widehat{z}}\|_{2}. In this case

L(w𝒮,ε)\displaystyle L(w_{\mathcal{S}},\varepsilon) =supzn[AW𝒱z2ε|W𝒮=w𝒮]\displaystyle=\sup_{z\in\mathbb{R}^{n}}{\mathbb{P}}[\|AW_{\mathcal{V}}-z\|_{2}\leq\varepsilon|W_{\mathcal{S}}=w_{\mathcal{S}}]
=supzn[A𝒮cW𝒮c+A𝒮w𝒮z2ε]\displaystyle=\sup_{z\in\mathbb{R}^{n}}{\mathbb{P}}[\|A_{{\mathcal{S}}^{c}}W_{{\mathcal{S}}^{c}}+A_{{\mathcal{S}}}w_{\mathcal{S}}-z\|_{2}\leq\varepsilon]
=supzn[A𝒮cW𝒮cz2ε]\displaystyle=\sup_{z\in\mathbb{R}^{n}}{\mathbb{P}}[\|A_{{\mathcal{S}}^{c}}W_{{\mathcal{S}}^{c}}-z\|_{2}\leq\varepsilon]
=(A𝒮cW𝒮c,ε)\displaystyle={\mathcal{L}}(A_{{\mathcal{S}}^{c}}W_{{\mathcal{S}}^{c}},\varepsilon)

where A𝒮cA_{{\mathcal{S}}^{c}} is the submatrix formed by the columns of AA with indices in 𝒮c{\mathcal{S}}^{c}. We will need the following result, due to Rudelson and Vershynin [38]. Let sj(A𝒮c)s_{j}(A_{{\mathcal{S}}^{c}}), j=1,,min{n,|𝒮c|}j=1,\ldots,\min\{n,|{\mathcal{S}}^{c}|\}, denote the singular values of A𝒮cA_{{\mathcal{S}}^{c}} arranged in non-increasing order, and define the stable rank of A𝒮cA_{{\mathcal{S}}^{c}} by

r(A𝒮c)=A𝒮cHS2A𝒮c2\displaystyle r(A_{{\mathcal{S}}^{c}})=\Bigg{\lfloor}\frac{\|A_{{\mathcal{S}}^{c}}\|_{\rm HS}^{2}}{\|A_{{\mathcal{S}}^{c}}\|^{2}}\Bigg{\rfloor}

where A𝒮cHS=(j=1min{n,|𝒮c|}sj(A𝒮c)2)1/2\|A_{{\mathcal{S}}^{c}}\|_{\rm HS}=\big{(}\sum_{j=1}^{\min\{n,|{\mathcal{S}}^{c}|\}}s_{j}(A_{{\mathcal{S}}^{c}})^{2}\big{)}^{1/2} is the Hilbert-Schmidt norm of A𝒮cA_{{\mathcal{S}}^{c}}, and A𝒮c=s1(A𝒮c)\|A_{{\mathcal{S}}^{c}}\|=s_{1}(A_{{\mathcal{S}}^{c}}) is the spectral norm of A𝒮cA_{{\mathcal{S}}^{c}}. (Note that for any nonzero matrix A𝒮cA_{{\mathcal{S}}^{c}}, 1r(A𝒮c)rank(A𝒮c)1\leq r(A_{{\mathcal{S}}^{c}})\leq{\rm rank}(A_{{\mathcal{S}}^{c}}).) Then, provided

(Wv,ε/A𝒮cHS)p\displaystyle{\mathcal{L}}(W_{v},\varepsilon/\|A_{{\mathcal{S}}^{c}}\|_{\rm HS})\leq p

for all v𝒮cv\in{\mathcal{S}}^{c}, we will have

(A𝒮cW𝒮c,ε)(cp)0.9r(A𝒮c)\displaystyle{\mathcal{L}}(A_{{\mathcal{S}}^{c}}W_{{\mathcal{S}}^{c}},\varepsilon)\leq(cp)^{0.9r(A_{{\mathcal{S}}^{c}})}

where cc is an absolute constant [38, Theorem 1.4]. This result relates the Lévy concentration function of the linear transform of a vector to the Lévy concentration function of each coordinate of the vector. Applying this result in Theorem 1, we get a lower bound on T(ε,δ)T(\varepsilon,\delta) for computing linear vector-valued functions.

Corollary 10.

For the problem of computing a linear transform of the observations defined in (47), where WvW_{v}’s are independent real-valued r.v.s, suppose the rows of AA are nonzero vectors. Then for ε0\varepsilon\geq 0 and δ(0,1/2]\delta\in(0,1/2],

T(ε,δ)\displaystyle T(\varepsilon,\delta)\geq max𝒮𝒱1C𝒮(0.9(1δ)r(A𝒮c)\displaystyle\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{1}{C_{\mathcal{S}}}\Bigg{(}0.9(1-\delta)r(A_{{\mathcal{S}}^{c}})
log1cmaxv𝒮c(Wv,ε/A𝒮cHS)h2(δ))\displaystyle\log\frac{1}{c\max_{v\in{\mathcal{S}}^{c}}{\mathcal{L}}(W_{v},\varepsilon/\|A_{{\mathcal{S}}^{c}}\|_{\rm HS})}-h_{2}(\delta)\Bigg{)}

for some absolute constant cc.

IV-C Linear function of discrete observations

Finally, we consider a case when the local observations WvW_{v} have discrete distributions. Specifically, let the WvW_{v}’s be i.i.d. Rademacher random variables, i.e., each WvW_{v} takes values ±1\pm 1 with equal probability. We still use the absolute distortion function d(z,z^)=|zz^|d(z,{\widehat{z}})=|z-{\widehat{z}}| to quantify the estimation error. In this case, the Lévy concentration function (a𝒮,W𝒮,ε){\mathcal{L}}(\langle a_{\mathcal{S}},W_{\mathcal{S}}\rangle,\varepsilon) will be highly sensitive to the direction of the vector a𝒮a_{{\mathcal{S}}}, rather than just its norm. For example, consider the extreme case when av=|𝒱|a_{v}=|{\mathcal{V}}| for a single node v𝒮v\in{\mathcal{S}}, and all other coefficients are zero. Then (a𝒮,W𝒮,0)=(|𝒱|Wv,0)=1/2{\mathcal{L}}(\langle a_{\mathcal{S}},W_{\mathcal{S}}\rangle,0)={\mathcal{L}}(|{\mathcal{V}}|W_{v},0)=1/2. On the other hand, if av=1a_{v}=1 for all v𝒱v\in{\mathcal{V}} and |𝒮||{\mathcal{S}}| is even, then

(a𝒮,W𝒮,0)=2|𝒮|(|𝒮||𝒮|/2)2π|𝒮|as |𝒮|\displaystyle{\mathcal{L}}(\langle a_{{\mathcal{S}}},W_{{\mathcal{S}}}\rangle,0)=2^{-|{\mathcal{S}}|}{|{\mathcal{S}}|\choose|{\mathcal{S}}|/2}\sim\sqrt{\frac{2}{\pi|{\mathcal{S}}|}}\quad\text{as $|{\mathcal{S}}|\rightarrow\infty$}

where the last step is due to Stirling’s approximation. Moreover, a celebrated result due to Littlewood and Offord, improved later by Erdős [39], says that, if |av|1|a_{v}|\geq 1 for all vv, then

(a𝒮,W𝒮,1)2|𝒮|(|𝒮||𝒮|/2)2π|𝒮|as |𝒮|,\displaystyle{\mathcal{L}}(\langle a_{{\mathcal{S}}},W_{{\mathcal{S}}}\rangle,1)\leq 2^{-|{\mathcal{S}}|}{|{\mathcal{S}}|\choose\lfloor|{\mathcal{S}}|/2\rfloor}\sim\sqrt{\frac{2}{\pi|{\mathcal{S}}|}}\quad\text{as $|{\mathcal{S}}|\rightarrow\infty$,}

which translates into a lower bound on the (1,δ)(1,\delta)-computation time which is of the same order as the lower bound on the zero-error computation time.

Corollary 11.

For the problem of computing the linear function in (43), where the WvW_{v}’s are independent Rademacher random variables, suppose that |av|1|a_{v}|\geq 1 for all vv, and δ<1/2\delta<{1}/{2}. Then

T(0,δ)T(1,δ)max𝒮𝒱1C𝒮(1δ2logπ|𝒱𝒮|2h2(δ))\displaystyle T(0,\delta)\geq T(1,\delta)\gtrsim\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{1}{C_{\mathcal{S}}}\left(\frac{1-\delta}{2}\log\frac{\pi|{\mathcal{V}}\setminus{\mathcal{S}}|}{2}-h_{2}(\delta)\right)

as |𝒮||{\mathcal{S}}|\rightarrow\infty.

IV-D Comparison with existing results

We illustrate the utility of the above bounds through comparison with some existing results. For example, Ayaso et al. [1] derive lower bounds on a related quantity

T~(ε,δ)inf{T\displaystyle\tilde{T}(\varepsilon,\delta)\triangleq\inf\Big{\{}T : a T-step algorithm 𝒜 such that\displaystyle\in{\mathbb{N}}:\exists\text{ a $T$-step algorithm }{\mathcal{A}}\text{ such that }
maxv𝒱[Z^v[(1ε)Z,(1+ε)Z]]<δ}.\displaystyle\max_{v\in{\mathcal{V}}}{\mathbb{P}}\big{[}{\widehat{Z}}_{v}\notin\left[(1-\varepsilon)Z,(1+\varepsilon)Z\right]\big{]}<\delta\Big{\}}.

One of their results is as follows: if Z=f(W)Z=f(W) is a linear function of the form (43) and (Wv)i.i.d.Uniform([1,1+B])\left(W_{v}\right)\overset{{\rm i.i.d.}}{\sim}\text{Uniform}([1,1+B]) for some B>0B>0, then

T~(ε,δ)max𝒮𝒱|𝒮|2C𝒮log1Bε2+κδ+(1/B)2/|𝒱|\displaystyle\tilde{T}(\varepsilon,\delta)\geq\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{|{\mathcal{S}}|}{2C_{\mathcal{S}}}\log\frac{1}{B\varepsilon^{2}+\kappa\delta+(1/B)^{2/|{\mathcal{V}}|}} (48)

for all sufficiently small ε,δ>0\varepsilon,\delta>0, where κ>0\kappa>0 is a fixed constant [1, Theorem III.5]. Let us compare (48) with what we can obtain using our techniques. It is not hard to show that

T~(ε,δ)T(a1(1+B)ε,δ)\displaystyle\tilde{T}(\varepsilon,\delta)\geq T\big{(}\|a\|_{1}(1+B)\varepsilon,\delta\big{)} (49)

where a1=v𝒱|av|\|a\|_{1}=\sum_{v\in{\mathcal{V}}}|a_{v}| is the 1\ell_{1} norm of aa. Moreover, since any r.v. uniformly distributed on a bounded interval of the real line has a log-concave distribution, we can use Corollary 8 to lower-bound the right-hand side of (49). This gives

T~(ε,δ)max𝒮𝒱1C𝒮(1δ2logB2a𝒮c2248(B+1)2a12ε2h2(δ))\displaystyle\tilde{T}(\varepsilon,\delta)\geq\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{1}{C_{\mathcal{S}}}\left(\frac{1-\delta}{2}\log\frac{B^{2}\|a_{{\mathcal{S}}^{c}}\|^{2}_{2}}{48(B+1)^{2}\|a\|^{2}_{1}\varepsilon^{2}}-h_{2}(\delta)\right) (50)

for all sufficiently small ε,δ>0\varepsilon,\delta>0. We immediately see that this bound is tighter than the one in (48). In particular, the right-hand side of (48) remains bounded for vanishingly small ε\varepsilon and δ\delta, and in the limit of ε,δ0\varepsilon,\delta\to 0 tends to

max𝒮𝒱|𝒮|C𝒮logB|𝒱|logBmin𝒮𝒱C𝒮.\displaystyle\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{|{\mathcal{S}}|}{C_{\mathcal{S}}}\frac{\log B}{|{\mathcal{V}}|}\leq\frac{\log B}{\min_{{\mathcal{S}}\subset{\mathcal{V}}}C_{\mathcal{S}}}.

By contrast, as ε,δ0\varepsilon,\delta\to 0, the right-hand side of (50) grows without bound as log(1/ε)\log(1/\varepsilon).

Another lower bound on the (ε,δ)(\varepsilon,\delta)-computation time T(ε,δ)T(\varepsilon,\delta) was obtained by Como and Dahleh [2]. Their starting point is the following continuum generalization of Fano’s inequality [2, Lemma 2] in terms of conditional differential entropy: if Z,Z^Z,{\widehat{Z}} are two jointly distributed real-valued r.v.’s, such that 𝔼Z2<\mathbb{E}Z^{2}<\infty, then, for any ε>0\varepsilon>0,

h(Z|Z^)[|ZZ^|ε]logε+12log(16πe𝔼Z2).\displaystyle h(Z|{\widehat{Z}})\leq{\mathbb{P}}\big{[}|Z-{\widehat{Z}}|\leq\varepsilon\big{]}\log\varepsilon+\frac{1}{2}\log\big{(}16\pi e\mathbb{E}Z^{2}\big{)}. (51)

If we use (51) instead of Lemma 1 to lower-bound I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{\mathcal{S}}), then we get

T(ε,δ)max𝒮𝒱1C𝒮(\displaystyle T(\varepsilon,\delta)\geq\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{1}{C_{\mathcal{S}}}\Bigg{(} 1δ2log1ε2+h(Z|W𝒮)\displaystyle\frac{1-\delta}{2}\log\frac{1}{\varepsilon^{2}}+h(Z|W_{\mathcal{S}})
12log(16πe𝔼Z2)).\displaystyle-\frac{1}{2}\log\big{(}16\pi e\mathbb{E}Z^{2}\big{)}\Bigg{)}. (52)

Again, let us consider the case when Z=f(W)Z=f(W) is a linear function of the form (43) with all ava_{v} nonzero and with (Wv)i.i.d.N(0,1)(W_{v})\overset{{\rm i.i.d.}}{\sim}N(0,1). Then (IV-D) becomes

T(ε,δ)max𝒮𝒱1C𝒮(1δ2log1ε2+12loga𝒮c228a22).\displaystyle T(\varepsilon,\delta)\geq\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{1}{C_{\mathcal{S}}}\left(\frac{1-\delta}{2}\log\frac{1}{\varepsilon^{2}}+\frac{1}{2}\log\frac{\|a_{{\mathcal{S}}^{c}}\|^{2}_{2}}{8\|a\|^{2}_{2}}\right). (53)

The lower bound of our Corollary 7 will be tighter than (53) for all ε>0\varepsilon>0 as long as

1δ2logπa𝒮c222h2(δ)12loga𝒮c228a22,𝒮𝒱.\displaystyle\frac{1-\delta}{2}\log\frac{\pi\|a_{{\mathcal{S}}^{c}}\|^{2}_{2}}{2}-h_{2}(\delta)\geq\frac{1}{2}\log\frac{\|a_{{\mathcal{S}}^{c}}\|^{2}_{2}}{8\|a\|^{2}_{2}},\quad\forall{\mathcal{S}}\subset{\mathcal{V}}.

Note that the quantity on the right-hand side is nonpositive. More generally, for observations with log-concave distributions, the result of Lemma 1 can be weakened to get a lower bound involving the conditional differential entropy h(Z|W𝒮)h(Z|W_{\mathcal{S}}), which is tighter than similar results obtained in [2].

Corollary 12.

If the observations WvW_{v}, v𝒱v\in{\mathcal{V}}, have log-concave distributions, then for computing the sum Z=v𝒱WvZ=\sum_{v\in{\mathcal{V}}}W_{v} subject to the absolute error criterion d(z,z^)=|zz^|d(z,{\widehat{z}})=|z-{\widehat{z}}|, for ε0\varepsilon\geq 0 and δ(0,1/2]\delta\in(0,1/2],

T(ε,δ)max𝒮𝒱1C𝒮((1δ)(h(Z|W𝒮)+log12eε)h2(δ)).\displaystyle T(\varepsilon,\delta)\geq\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{1}{C_{\mathcal{S}}}\left(\!(1-\delta)\!\left(\!h(Z|W_{\mathcal{S}})+\log\frac{1}{2e\varepsilon}\!\right)-h_{2}(\delta)\!\right).
Proof:

Let p𝒮(z)p_{\mathcal{S}}(z) denote the probability density of v𝒮cWv\sum_{v\in{\mathcal{S}}^{c}}W_{v}. Then from (45),

L(w𝒮,ε)=supzzεz+εp𝒮(z)dz2εp𝒮\displaystyle L(w_{\mathcal{S}},\varepsilon)=\sup_{z\in\mathbb{R}}\int^{z+\varepsilon}_{z-\varepsilon}p_{\mathcal{S}}(z){\rm d}z\leq 2\varepsilon\|p_{\mathcal{S}}\|_{\infty} (54)

for all w𝒮v𝒮𝖶vw_{\mathcal{S}}\in\prod_{v\in{\mathcal{S}}}{\mathsf{W}}_{v}, where p𝒮\|p_{\mathcal{S}}\|_{\infty} is the sup norm of p𝒮p_{\mathcal{S}}. By a result of Bobkov and Madiman [40, Proposition I.2], if UU is a real-valued r.v. with a log-concave density pp, then the differential entropy h(U)h(U) is upper-bounded by loge+logp1\log e+\log\|p\|^{-1}_{\infty}. Using this fact together with (54), the log-concavity of p𝒮p_{\mathcal{S}}, and the fact that the WvW_{v}’s are mutually independent, we can write

log1𝔼[L(W𝒮,ε)]\displaystyle\log\frac{1}{\mathbb{E}[L(W_{\mathcal{S}},\varepsilon)]} log12ε+log1p𝒮\displaystyle\geq\log\frac{1}{2\varepsilon}+\log\frac{1}{\|p_{\mathcal{S}}\|_{\infty}}
log12eε+h(v𝒮cWv)\displaystyle\geq\log\frac{1}{2e\varepsilon}+h\Big{(}\sum_{v\in{\mathcal{S}}^{c}}W_{v}\Big{)}
=log12eε+h(Z|W𝒮),\displaystyle=\log\frac{1}{2e\varepsilon}+h(Z|W_{\mathcal{S}}),

Using this estimate in Theorem 1, we get the desired lower bound on T(ε,δ)T(\varepsilon,\delta). ∎

V Comparison with upper bounds on computation time

For the two-node mod-22 sum problem in Example 1, we have shown in Corollary 1 that the lower bound on computation given by Theorem 2 can tightly match the upper bound. In this section, we provide two more examples in which our lower bounds on computation time are tight. In the first example, our lower bound precisely captures the dependence of computation time on the number of nodes in the network. In the second example, our lower bound tightly captures the dependence of computation time on the accuracy parameter ε\varepsilon.

V-A Rademacher sum over a dumbbell network

Example 3.

Consider a dumbbell network of bidirectional BSCs with the same crossover probability. Formally, suppose |𝒱||{\mathcal{V}}| is even, and let the nodes be indexed from 11 to |𝒱||{\mathcal{V}}|. Nodes 11 to |𝒱|/2|{\mathcal{V}}|/2 form a clique (i.e., each pair of nodes are connected by a pair of BSCs), while nodes |𝒱|/2+1|{\mathcal{V}}|/2+1 to |𝒱||{\mathcal{V}}| form another clique. The two cliques are connected by a pair of BSCs between nodes |𝒱|/2|{\mathcal{V}}|/2 and |𝒱|/2+1|{\mathcal{V}}|/2+1. Each node initially observes a Bern(12){\rm Bern}(\frac{1}{2}) (or Rademacher) r.v. The goal is for the nodes to compute the sum of the observations of all nodes. The distortion function is d(z,z^)=|zz^|d(z,{\widehat{z}})=|z-{\widehat{z}}|.

By choosing the cutset as the pair of BSCs that joins the two cliques, our lower bound for random Rademacher sums in Corollary 11 gives the following lower bound on computation time.

Corollary 13.

For the problem of in Example 3, for δ(0,1/2)\delta\in(0,1/2),

T(0,δ)1C(1δ2logπ|𝒱|4h2(δ))as |𝒱|,\displaystyle T(0,\delta)\gtrsim\frac{1}{C}\left(\frac{1-\delta}{2}\log\frac{\pi|{\mathcal{V}}|}{4}-h_{2}(\delta)\right)\quad\text{as $|{\mathcal{V}}|\rightarrow\infty$},

which implies

T(0,δ)=Ω(log|𝒱|).\displaystyle T(0,\delta)=\Omega\left(\log|{\mathcal{V}}|\right).

Now we show that the above lower bound matches the upper bound on the computation time, which turns out to be

T(0,δ)=O(log|𝒱|).T(0,\delta)=O\left(\log|{\mathcal{V}}|\right).

As shown by Gallager [11], for a fixed success probability, nodes |𝒱|/2|{\mathcal{V}}|/2 and |𝒱|/2+1|{\mathcal{V}}|/2+1 can learn the partial sum of the observations in their respective cliques in O(loglog|𝒱|)O\big{(}\log\log|{\mathcal{V}}|\big{)} steps. These two nodes then exchange their partial sum estimates using binary block codes. Each partial sum can take |𝒱|/2+1|{\mathcal{V}}|/2+1 values, and can be encoded losslessly with log(|𝒱|/2+1)\log(|{\mathcal{V}}|/2+1) bits. The blocklength needed for transmission of the encoded partial sums is thus O(log(|𝒱|/2+1))O\big{(}\log(|{\mathcal{V}}|/2+1)\big{)}, where the hidden factor depends on the required success probability and the channel crossover probability, but not on |𝒱||{\mathcal{V}}|. Having learned the partial sum of the other clique, nodes |𝒱|/2|{\mathcal{V}}|/2 and |𝒱|/2+1|{\mathcal{V}}|/2+1 continue to broadcast this partial sum to other nodes in their own clique. This takes another O(log(|𝒱|/2+1))O\big{(}\log(|{\mathcal{V}}|/2+1)\big{)} step. In total, the computation can be done in O(loglog|𝒱|)+2O(log(|𝒱|/2+1))=O(log|𝒱|)O\big{(}\log\log|{\mathcal{V}}|\big{)}+2O\big{(}\log(|{\mathcal{V}}|/2+1)\big{)}=O(\log|{\mathcal{V}}|) steps, to have all nodes learn the sum of all observations, for any prescribed success probability. This shows that T(0,δ)=O(log|𝒱|)T(0,\delta)=O\left(\log|{\mathcal{V}}|\right).

V-B Distributed averaging over discrete noisy channels

Example 4.

Consider a network where the nodes are connected by binary erasure channels with the same erasure probability. Each node initially observes a log-concave r.v. The goal is for the nodes to compute the average of the observations of all nodes.

For this example, Carli et al. [14] define the computation time as

T~(ε)\displaystyle\tilde{T}(\varepsilon) inf{T:1|𝒱|v𝒱𝔼[(ZZ^v(t))2]ε,tT}\displaystyle\triangleq\inf\Big{\{}T\in{\mathbb{N}}:\frac{1}{|{\mathcal{V}}|}\sum_{v\in{\mathcal{V}}}\mathbb{E}\big{[}(Z-{\widehat{Z}}_{v}(t))^{2}\big{]}\leq\varepsilon,\,\forall t\geq T\Big{\}}

and show that

T~(ε)c1+c2log3ε1log2ρ1\displaystyle\tilde{T}(\varepsilon)\leq c_{1}+c_{2}\frac{\log^{3}\varepsilon^{-1}}{\log^{2}\rho^{-1}} (55)

where ρ\rho is the second largest singular value of the consensus matrix adapted to the network, and c1c_{1} and c2c_{2} are positive constants depending only on channel erasure probability. It can be shown that the above upper bound still holds (with different constants) when channels are BSCs.

We use Corollary 12 to derive the following lower bound on T~(ε)\tilde{T}(\varepsilon).

Corollary 14.

For the problem in Example 4,

T~(ε)max𝒮𝒱12C𝒮(h(Z|W𝒮)+log14e|𝒱|+12log1ε2).\displaystyle\tilde{T}(\varepsilon)\geq\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{1}{2C_{\mathcal{S}}}\left(\!h(Z|W_{\mathcal{S}})+\log\frac{1}{4e|{\mathcal{V}}|}+\frac{1}{2}\log\frac{1}{\varepsilon}-2\!\right). (56)
Proof:

Using Jensen’s inequality twice, we can write

1|𝒱|v𝒱𝔼[(ZZ^v(T))2]\displaystyle\frac{1}{|{\mathcal{V}}|}\sum_{v\in{\mathcal{V}}}\mathbb{E}\big{[}(Z-{\widehat{Z}}_{v}(T))^{2}\big{]} 1|𝒱|v𝒱(𝔼|ZZ^v(T)|)2\displaystyle\geq\frac{1}{|{\mathcal{V}}|}\sum_{v\in{\mathcal{V}}}\big{(}\mathbb{E}|Z-{\widehat{Z}}_{v}(T)|\big{)}^{2}
(1|𝒱|v𝒱𝔼|ZZ^v(T)|)2.\displaystyle\geq\left(\frac{1}{|{\mathcal{V}}|}\sum_{v\in{\mathcal{V}}}\mathbb{E}|Z-{\widehat{Z}}_{v}(T)|\right)^{2}.

Therefore, |𝒱|1v𝒱𝔼[(ZZ^v(T))2]ε|{\mathcal{V}}|^{-1}\sum_{v\in{\mathcal{V}}}\mathbb{E}\big{[}(Z-{\widehat{Z}}_{v}(T))^{2}\big{]}\leq\varepsilon implies that 𝔼|ZZ^v(T)||𝒱|ε\mathbb{E}|Z-{\widehat{Z}}_{v}(T)|\leq|{\mathcal{V}}|\sqrt{\varepsilon} for all v|𝒱|v\in|{\mathcal{V}}|, and

[|ZZ^v(T)||𝒱|εδ]δ,v𝒱,δ(0,1/2]{\mathbb{P}}\left[|Z-{\widehat{Z}}_{v}(T)|\geq\frac{|{\mathcal{V}}|\sqrt{\varepsilon}}{\delta}\right]\leq\delta,\qquad\forall v\in{\mathcal{V}},\delta\in(0,1/2]

by Markov’s inequality. Then by Corollary 12,

T~(ε)T(|𝒱|εδ,δ)\displaystyle\tilde{T}(\varepsilon)\geq T\left(\frac{|{\mathcal{V}}|\sqrt{\varepsilon}}{\delta},\delta\right)
max𝒮𝒱1C𝒮((1δ)(h(Z|W𝒮)+logδ2e|𝒱|ε)h2(δ)).\displaystyle\geq\max_{{\mathcal{S}}\subset{\mathcal{V}}}\frac{1}{C_{\mathcal{S}}}\left((1-\delta)\left(h(Z|W_{\mathcal{S}})+\log\frac{\delta}{2e|{\mathcal{V}}|\sqrt{\varepsilon}}\right)-h_{2}(\delta)\right).

Choosing δ=1/2\delta=1/2, we obtain (56). ∎

The lower bound given by (56) states that T~(ε)\tilde{T}(\varepsilon) is necessarily logarithmic in ε1\varepsilon^{-1}, which tightly matches the poly-logarithmic dependence on ε1\varepsilon^{-1} in the upper bound given by (55). As pointed out in Carli et al. [41], it is possible to prove that a computation time logarithmic in ε1\varepsilon^{-1} is achievable by embedding a quantized consensus algorithm for noiseless networks into the simulation framework developed by Rajagopalan and Schulman for noisy networks in [13].

VI Conclusion and future research directions

We have studied the fundamental time limits of distributed function computation from an information-theoretic perspective. The computation time depends on the amount of information about the function value needed by each node and the rate for the nodes to accumulate such an amount of information. The small ball probability lower bound on conditional mutual information reveals how much information is necessary, while the cutset-capacity upper bound and the SDPI upper bound capture the bottleneck on the rate for the information to be accumulated. The multi-cutset analysis provides a more refined characterization of the information dissipation in a network.

Here are some questions that are worthwhile to consider in the future:

  • In the multi-cutset analysis, the purpose of introducing self-loops when reducing the network to a chain is to establish necessary Markov relations for proving upper bounds on I(Z;Z^n|W𝒮)I(Z;{\widehat{Z}}_{n}|W_{{\mathcal{S}}}) in bidirected chains, and the reason for considering left-bound nodes is to improve the lower bounds on computation time. We could have included all channels from 𝒮i{\mathcal{S}}_{i} to 𝒮i{\mathcal{S}}_{i} into the self-loop at node ii^{\prime} in GG^{\prime}, but this would result in looser lower bounds on computation time (cf. the remark after Theorem 3). However, there might be other network reduction methods, e.g., different ways to construct the bidirected chain, that will yield even tighter lower bounds on computation time than our proposed method.

  • In the first step of the derivation of Lemma 2 and Lemma 5, we have upper-bounded I(Z;Z^v|W𝒮)I(Z;{\widehat{Z}}_{v}|W_{{\mathcal{S}}}) using the ordinary data processing inequality as

    I(Z;Z^v|W𝒮)\displaystyle I(Z;{\widehat{Z}}_{v}|W_{{\mathcal{S}}}) I(W𝒮c;Z^v|W𝒮).\displaystyle\leq I(W_{{\mathcal{S}}^{c}};{\widehat{Z}}_{v}|W_{{\mathcal{S}}}).

    One may wonder whether we can tighten this step by a judicious use of SDPIs. The answer is negative. It can be shown that

    I(Z;\displaystyle I(Z; Z^v|W𝒮)I(W𝒮c;Z^v|W𝒮)\displaystyle{\widehat{Z}}_{v}|W_{{\mathcal{S}}})\leq I(W_{{\mathcal{S}}^{c}};{\widehat{Z}}_{v}|W_{{\mathcal{S}}})
    supw𝒮v𝒮𝖶vη(W𝒮c|W𝒮=w𝒮,Z|W𝒮c,W𝒮=w𝒮)\displaystyle\sup_{w_{\mathcal{S}}\in\prod_{v\in{\mathcal{S}}}{\mathsf{W}}_{v}}\eta\big{(}{\mathbb{P}}_{W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}}=w_{\mathcal{S}}},{\mathbb{P}}_{Z|W_{{\mathcal{S}}^{c}},W_{\mathcal{S}}=w_{\mathcal{S}}}\big{)}

    where the contraction coefficient depends on the joint distribution of the observations W{\mathbb{P}}_{W} and the function Z=f(W)Z=f(W). However,

    η(W𝒮c|W𝒮=w𝒮,Z|W𝒮c,W𝒮=w𝒮)=1\displaystyle\eta\big{(}{\mathbb{P}}_{W_{{\mathcal{S}}^{c}}|W_{\mathcal{S}}=w_{\mathcal{S}}},{\mathbb{P}}_{Z|W_{{\mathcal{S}}^{c}},W_{\mathcal{S}}=w_{\mathcal{S}}}\big{)}=1

    for both discrete and continuous observations. For discrete observations, this is a consequence of the fact that η(X,Y|X)<1\eta({\mathbb{P}}_{X},{\mathbb{P}}_{Y|X})<1 if and only if the graph {(x,y):X(x)>0,Y|X(y|x)>0}\big{\{}(x,y):{\mathbb{P}}_{X}(x)>0,{\mathbb{P}}_{Y|X}(y|x)>0\big{\}} is connected [26], and the fact that, for any Y|X{\mathbb{P}}_{Y|X} induced by a deterministic function f:𝖷𝖸f:{\mathsf{X}}\rightarrow{\mathsf{Y}}, this graph is always disconnected. This condition can be extended to continuous alphabets [42]. It would be interesting to see whether nonlinear SDPI’s, e.g., of the sort recently introduced by Polyanskiy and Wu [28], can be somehow applied here to tighten the upper bounds.

  • If the function to be computed is the identity mapping, i.e., Z=WZ=W, then the goal of the nodes is to distribute their observations to all other nodes in the network. In this case, our results on the computation time can provide non-asymptotic lower bounds on the blocklength of the codes for the source-channel coding problems in multi-terminal networks. In Example 2, we have considered one such case with discrete observations, and obtained lower bounds in Corollary 2 based on the single cutset analysis. It would be interesting to apply the multi-cutset analysis to the source-channel coding problems in multi-terminal, multi-hop networks.

Acknowledgment

The authors would like to thank the Associate Editor Prof. Chandra Nair and two anonymous referees for numerous constructive suggestions on how to improve the flow and the structure of the paper.

Appendix A Proof of Lemma 7

The goal of this proof is to show that, given any TT-step algorithm 𝒜{\mathcal{A}} running on GG, we can construct a randomized TT-step algorithm 𝒜{\mathcal{A}}^{\prime} running on GG^{\prime} that simulates 𝒜{\mathcal{A}}. Fix any TT-step algorithm 𝒜{\mathcal{A}} that runs on GG. For each tt, we can factor the conditional distribution of the messages Xt(Xv,t)v𝒱X_{t}\triangleq(X_{v,t})_{v\in{\mathcal{V}}} given W,Xt1,Yt1W,X^{t-1},Y^{t-1} as follows:

Xt|W,Xt1,Yt1(xt|w,xt1,yt1)\displaystyle\quad\,\,{\mathbb{P}}_{X_{t}|W,X^{t-1},Y^{t-1}}(x_{t}|w,x^{t-1},y^{t-1})
=v𝒱Xv,t|Wv,Yvt1(xv,t|wv,yvt1)\displaystyle=\prod_{v\in{\mathcal{V}}}{\mathbb{P}}_{X_{v,t}|W_{v},Y^{t-1}_{v}}(x_{v,t}|w_{v},y^{t-1}_{v})
=i=1nv𝒮iXv,t|Wv,Yvt1(xv,t|wv,yvt1)\displaystyle=\prod^{n}_{i=1}\prod_{v\in{\mathcal{S}}_{i}}{\mathbb{P}}_{X_{v,t}|W_{v},Y^{t-1}_{v}}\Big{(}x_{v,t}\Big{|}w_{v},y^{t-1}_{v}\Big{)}
=i=1nX𝒮i,t|W𝒮i,Y𝒮it1(x𝒮i,t|w𝒮i,y𝒮it1).\displaystyle=\prod^{n}_{i=1}{\mathbb{P}}_{X_{{\mathcal{S}}_{i},t}|W_{{\mathcal{S}}_{i}},Y^{t-1}_{{\mathcal{S}}_{i}}}\Big{(}x_{{\mathcal{S}}_{i},t}\Big{|}w_{{\mathcal{S}}_{i}},y^{t-1}_{{\mathcal{S}}_{i}}\Big{)}. (A.1)

Likewise, the conditional distribution of the received messages Yt(Yv,t)v𝒱Y_{t}\triangleq(Y_{v,t})_{v\in{\mathcal{V}}} given W,Xt,Yt1W,X^{t},Y^{t-1} can be factored as

Yt|W,Xt,Yt1(yt|w,xt,yt1)\displaystyle\quad\,\,{\mathbb{P}}_{Y_{t}|W,X^{t},Y^{t-1}}(y_{t}|w,x^{t},y^{t-1})
=eYe,t|Xe,t(ye,t|xe,t)\displaystyle=\prod_{e\in{\mathcal{E}}}{\mathbb{P}}_{Y_{e,t}|X_{e,t}}(y_{e,t}|x_{e,t})
=eKe(ye,t|xe,t)\displaystyle=\prod_{e\in{\mathcal{E}}}K_{e}(y_{e,t}|x_{e,t})
=i=1nu𝒮iv𝒱:(u,v)K(u,v)(y(u,v),t|x(u,v),t).\displaystyle=\prod^{n}_{i=1}\prod_{u\in{\mathcal{S}}_{i}}\prod_{v\in{\mathcal{V}}:\,(u,v)\in{\mathcal{E}}}K_{(u,v)}(y_{(u,v),t}|x_{(u,v),t}). (A.2)

Since the successive partition of GG ensures that nodes in 𝒮i{\mathcal{S}}_{i} can communicate with nodes in 𝒮j{\mathcal{S}}_{j} only if |ij|1|i-j|\leq 1, the messages originating from 𝒮i{\mathcal{S}}_{i} at step tt can be decomposed as

X𝒮i,t\displaystyle X_{{\mathcal{S}}_{i},t} =(X(𝒮i,𝒮i1),t,X(𝒮i,𝒮i+1),t,X(𝒮i,𝒮i),t)\displaystyle=(X_{({\mathcal{S}}_{i},{\mathcal{S}}_{i-1}),t},X_{({\mathcal{S}}_{i},{\mathcal{S}}_{i+1}),t},X_{({\mathcal{S}}_{i},{\mathcal{S}}_{i}),t})
=(X(𝒮i,𝒮i1),t,X(𝒮i,𝒮i+1),t,X(𝒮i,𝒮i),t,X(𝒮i,𝒮i𝒮i),t),\displaystyle=(X_{({\mathcal{S}}_{i},{\mathcal{S}}_{i-1}),t},X_{({\mathcal{S}}_{i},{\mathcal{S}}_{i+1}),t},X\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}},X\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}}),

and the messages received by nodes in 𝒮i{\mathcal{S}}_{i} at step tt can be decomposed as

Y𝒮i,t\displaystyle Y_{{\mathcal{S}}_{i},t} =(Y(𝒮i1,𝒮i),t,Y(𝒮i+1,𝒮i),t,Y(𝒮i,𝒮i),t)\displaystyle=(Y_{({\mathcal{S}}_{i-1},{\mathcal{S}}_{i}),t},Y_{({\mathcal{S}}_{i+1},{\mathcal{S}}_{i}),t},Y_{({\mathcal{S}}_{i},{\mathcal{S}}_{i}),t})
=(Y(𝒮i1,𝒮i),t,Y(𝒮i+1,𝒮i),t,Y(𝒮i,𝒮i),t,Y(𝒮i,𝒮i𝒮i),t).\displaystyle=(Y_{({\mathcal{S}}_{i-1},{\mathcal{S}}_{i}),t},Y_{({\mathcal{S}}_{i+1},{\mathcal{S}}_{i}),t},Y\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}},Y\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}}). (A.3)

According to the operation of algorithm 𝒜{\mathcal{A}}, for each (u,v)(u,v)\in{\mathcal{E}} there exists a mapping φ(u,v),t\varphi_{(u,v),t}, such that X(u,v),t=φ(u,v),t(Wu,Yut1)X_{(u,v),t}=\varphi_{(u,v),t}(W_{u},Y_{u}^{t-1}). By the definition of 𝒮i\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}, we can write

X(𝒮i,𝒮i1),t\displaystyle X_{({\mathcal{S}}_{i},{\mathcal{S}}_{i-1}),t} =(φ(u,v),t(Wu,Yut1):\displaystyle=\big{(}\varphi_{(u,v),t}(W_{u},Y_{u}^{t-1}):
(u,v),u𝒮i,v𝒮i1).\displaystyle\quad(u,v)\in{\mathcal{E}},u\in\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i},v\in{\mathcal{S}}_{i-1}\big{)}.

Thus, there exists a mapping φ𝒮i,t\overset{{}_{\leftarrow}}{\varphi}_{{\mathcal{S}}_{i},t}, such that

X(𝒮i,𝒮i1),t\displaystyle X_{({\mathcal{S}}_{i},{\mathcal{S}}_{i-1}),t} =φ𝒮i,t(W𝒮i,Yt1𝒮i)\displaystyle=\overset{{}_{\leftarrow}}{\varphi}_{{\mathcal{S}}_{i},t}(W\thinspace{\raisebox{-4.0pt}{$\scriptstyle{\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}}$}},Y^{t-1}\thinspace{\raisebox{-5.0pt}{$\scriptstyle{\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}}$}}) (A.4)

where

Y𝒮i,t=(Y(𝒮i1,𝒮i),t,Y(𝒮i+1,𝒮i),t,Y(𝒮i,𝒮i),t).\displaystyle Y\thinspace{\raisebox{-3.0pt}{$\scriptstyle{\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i},t}$}}=\big{(}Y\thinspace{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i-1},\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}},Y\thinspace{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i+1},\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}},Y\thinspace{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}}\big{)}. (A.5)

By the same token, there exist mappings φ𝒮i,t\overset{{}_{\rightarrow}}{\varphi}_{{\mathcal{S}}_{i},t}, φ̊𝒮i,t\mathring{\varphi}_{{\mathcal{S}}_{i},t} and φ¯𝒮i,t\bar{\varphi}_{{\mathcal{S}}_{i},t}, such that

X(𝒮i,𝒮i+1),t\displaystyle X_{({\mathcal{S}}_{i},{\mathcal{S}}_{i+1}),t} =φ𝒮i,t(W𝒮i,Y𝒮it1),\displaystyle=\overset{{}_{\rightarrow}}{\varphi}_{{\mathcal{S}}_{i},t}(W_{{\mathcal{S}}_{i}},Y^{t-1}_{{\mathcal{S}}_{i}}), (A.6)
X(𝒮i,𝒮i),t\displaystyle X\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}} =φ̊𝒮i,t(W𝒮i,Y𝒮it1),\displaystyle=\mathring{\varphi}_{{\mathcal{S}}_{i},t}(W_{{\mathcal{S}}_{i}},Y^{t-1}_{{\mathcal{S}}_{i}}), (A.7)
X(𝒮i,𝒮i𝒮i),t\displaystyle X\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}} =φ¯𝒮i,t(W𝒮i,Y𝒮it1).\displaystyle=\bar{\varphi}_{{\mathcal{S}}_{i},t}(W_{{\mathcal{S}}_{i}},Y^{t-1}_{{\mathcal{S}}_{i}}). (A.8)

Define the random variables

Wi\displaystyle W_{i} W𝒮i,\displaystyle\triangleq W_{{\mathcal{S}}_{i}},
Xi,t\displaystyle X_{i,t} =(X(i,i1),t,X(i,i+1),t,X(i,i),t)\displaystyle=(X_{(i,i-1),t},X_{(i,i+1),t},X_{(i,i),t})
(X(𝒮i,𝒮i1),t,X(𝒮i,𝒮i+1),t,X(𝒮i,𝒮i),t),\displaystyle\triangleq(X_{({\mathcal{S}}_{i},{\mathcal{S}}_{i-1}),t},X_{({\mathcal{S}}_{i},{\mathcal{S}}_{i+1}),t},X\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}}),
Yi,t\displaystyle Y_{i,t} =(Y(i1,i),t,Y(i+1,i),t,Y(i,i),t)\displaystyle=(Y_{(i-1,i),t},Y_{(i+1,i),t},Y_{(i,i),t})
(Y(𝒮i1,𝒮i),t,Y(𝒮i+1,𝒮i),t,Y(𝒮i,𝒮i),t),\displaystyle\triangleq(Y_{({\mathcal{S}}_{i-1},{\mathcal{S}}_{i}),t},Y_{({\mathcal{S}}_{i+1},{\mathcal{S}}_{i}),t},Y\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}}),
Ui,t\displaystyle U_{i,t} (X(𝒮i,𝒮i𝒮i),t,Y(𝒮i,𝒮i𝒮i),t).\displaystyle\triangleq(X\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}},Y\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}}).

From the decomposition of Y𝒮i,tY_{{\mathcal{S}}_{i},t} in (A.3), we know that (Yit1,Uit1)(Y_{i}^{t-1},U_{i}^{t-1}) contains Y𝒮it1Y_{{\mathcal{S}}_{i}}^{t-1}; while from the decomposition of Y𝒮i,tY\thinspace{\raisebox{-3.0pt}{$\scriptstyle{\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i},t}$}} in (A.5), we know that Yit1Y_{i}^{t-1} contains Yt1𝒮iY^{t-1}\thinspace{\raisebox{-5.0pt}{$\scriptstyle{\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}}$}}. Therefore, from Eqs. (A.4) and (A.6)-(A.8), we deduce the existence of mappings φi,t\overset{{}_{\leftarrow}}{\varphi}_{i,t}, φi,t\overset{{}_{\rightarrow}}{\varphi}_{i,t}, φ̊i,t\mathring{\varphi}_{i,t}, and φ¯i,t\bar{\varphi}_{i,t}, such that the messages transmitted by nodes in 𝒮i{\mathcal{S}}_{i} at time tt can be generated as

X(i,i1),t\displaystyle X_{(i,i-1),t} =φi,t(Wi,Yit1),\displaystyle=\overset{{}_{\leftarrow}}{\varphi}_{i,t}(W_{i},Y_{i}^{t-1}), (A.9)
X(i,i+1),t\displaystyle X_{(i,i+1),t} =φi,t(Wi,Yit1,Uit1),\displaystyle=\overset{{}_{\rightarrow}}{\varphi}_{i,t}(W_{i},Y_{i}^{t-1},U_{i}^{t-1}), (A.10)
X(i,i),t\displaystyle X_{(i,i),t} =φ̊i,t(Wi,Yit1,Uit1),\displaystyle=\mathring{\varphi}_{i,t}(W_{i},Y_{i}^{t-1},U_{i}^{t-1}), (A.11)
X(𝒮i,𝒮i𝒮i),t\displaystyle X\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}} =φ¯i,t(Wi,Yit1,Uit1).\displaystyle=\bar{\varphi}_{i,t}(W_{i},Y^{t-1}_{i},U^{t-1}_{i}). (A.12)

Note that the computation of X(i,i1),tX_{(i,i-1),t} does not involve Uit1U_{i}^{t-1}. Next, the messages received by nodes in 𝒮i{\mathcal{S}}_{i} at step tt are related to the transmitted messages as

X(i1,i),t\displaystyle X_{(i-1,i),t} K(i1,i)Y(i1,i),t,\displaystyle\xrightarrow{K_{(i-1,i)}}Y_{(i-1,i),t},
X(i+1,i),t\displaystyle X_{(i+1,i),t} K(i+1,i)Y(i+1,i),t,\displaystyle\xrightarrow{K_{(i+1,i)}}Y_{(i+1,i),t},
X(i,i),t\displaystyle X_{(i,i),t} K(i,i)Y(i,i),t,\displaystyle\xrightarrow{K_{(i,i)}}Y_{(i,i),t},

where the stochastic transition laws have the same form as those in Eqs. (26) to (28). In addition, since X(𝒮i,𝒮i𝒮i),tX\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}} and Y(𝒮i,𝒮i𝒮i),tY\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}} are related through the channels from 𝒮i{\mathcal{S}}_{i} to 𝒮i𝒮i{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}, there exists a mapping κi,t\kappa_{i,t} such that Y(𝒮i,𝒮i𝒮i),tY\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}} can be realized as

Y(𝒮i,𝒮i𝒮i),t\displaystyle Y\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}} =κi,t(X(𝒮i,𝒮i𝒮i),t,Ri,t),\displaystyle=\kappa_{i,t}(X\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}},R_{i,t}), (A.13)

where Ri,tR_{i,t} can be taken as a random variable uniformly distributed over [0,1][0,1] and independent of everything else. From (A.12) and (A.13), we know that Ui,tU_{i,t} can be realized by a mapping ϑi,t\vartheta_{i,t} as

Ui,t\displaystyle U_{i,t} =ϑi,t(Wi,Yit1,Uit1,Ri,t).\displaystyle=\vartheta_{i,t}(W_{i},Y_{i}^{t-1},U_{i}^{t-1},R_{i,t}). (A.14)

Taking all of this into account, we can rewrite the factorization (A.1) as follows:

Xt|W,Xt1,Yt1(xt|w,xt1,yt1)\displaystyle\quad\,\,{\mathbb{P}}_{X_{t}|W,X^{t-1},Y^{t-1}}(x_{t}|w,x^{t-1},y^{t-1})
=i=1n𝟏{x(i1,i),t=φi,t(wi,yit1)}\displaystyle=\prod^{n}_{i=1}\mathbf{1}\big{\{}x_{(i-1,i),t}=\overset{{}_{\leftarrow}}{\varphi}_{i,t}(w_{i},y_{i}^{t-1})\big{\}}
𝟏{x(i,i+1),t=φi,t(wi,yit1,uit1)}\displaystyle\quad\cdot\mathbf{1}\big{\{}x_{(i,i+1),t}=\overset{{}_{\rightarrow}}{\varphi}_{i,t}(w_{i},y_{i}^{t-1},u_{i}^{t-1})\big{\}}
𝟏{x(i,i),t=φ̊i,t(wi,yit1,uit1)}\displaystyle\quad\cdot\mathbf{1}\big{\{}x_{(i,i),t}=\mathring{\varphi}_{i,t}(w_{i},y_{i}^{t-1},u_{i}^{t-1})\big{\}}
𝟏{x(𝒮i,𝒮i𝒮i),t=φ¯i,t(wi,yit1,uit1)},\displaystyle\quad\cdot\mathbf{1}\big{\{}x\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}}=\bar{\varphi}_{i,t}(w_{i},y^{t-1}_{i},u^{t-1}_{i})\big{\}}, (A.15)

and we can rewrite the factorization (A.2) as

Yt|W,Xt,Yt1(yt|w,xt,yt1)\displaystyle\quad\,\,{\mathbb{P}}_{Y_{t}|W,X^{t},Y^{t-1}}(y_{t}|w,x^{t},y^{t-1})
=i=1nK(i1,i)(y(i1,i),t|x(i1,i),t)\displaystyle=\prod^{n}_{i=1}K_{(i-1,i)}(y_{(i-1,i),t}|x_{(i-1,i),t})
K(i+1,i)(y(i+1,i),t|x(i+1,i),t)K(i,i)(y(i,i),t|x(i,i),t)\displaystyle\quad\cdot K_{(i+1,i)}(y_{(i+1,i),t}|x_{(i+1,i),t})\cdot K_{(i,i)}(y_{(i,i),t}|x_{(i,i),t})
(u,v):u𝒮i,v𝒮i𝒮iK(u,v)(y(𝒮i,𝒮i𝒮i),t|x(𝒮i,𝒮i𝒮i),t)\displaystyle\quad\cdot\bigotimes_{(u,v)\in{\mathcal{E}}:u\in{\mathcal{S}}_{i},v\in{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}}K_{(u,v)}(y\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}}|x\!{\raisebox{-3.0pt}{$\scriptstyle{({\mathcal{S}}_{i},{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}),t}$}}) (A.16)

where the channel (u,v):u𝒮i,v𝒮i𝒮iK(u,v)\bigotimes_{(u,v)\in{\mathcal{E}}:u\in{\mathcal{S}}_{i},v\in{\mathcal{S}}_{i}\setminus\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}}K_{(u,v)} can be realized by the mapping κi,t\kappa_{i,t} with the r.v. Ri,tR_{i,t}.

To summarize: the mappings defined in (A.9) to (A.11) and (A.14) specify a randomized TT-step algorithm 𝒜{\mathcal{A}}^{\prime} that runs on GG^{\prime} and simulates the TT-step algorithm 𝒜{\mathcal{A}} that runs on GG. Specifically, using these mappings, each node ii^{\prime} in GG^{\prime} can generate all the transmitted and received messages of 𝒮i{\mathcal{S}}_{i} in 𝒜{\mathcal{A}} as (XiT,YiT,UiT)(X_{i^{\prime}}^{T},Y_{i^{\prime}}^{T},U_{i^{\prime}}^{T}). Moreover, from (A.15) and (A.16) we see that the random objects

(W𝒮i,X𝒮iT,Y𝒮iT:i{1,,n})\big{(}W_{{\mathcal{S}}_{i}},X_{{\mathcal{S}}_{i}}^{T},Y_{{\mathcal{S}}_{i}}^{T}:i\in\{1,\ldots,n\}\big{)}

and

(Wi,XiT,YiT,UiT:i{1,,n})\big{(}W_{i^{\prime}},X_{i^{\prime}}^{T},Y_{i^{\prime}}^{T},U_{i^{\prime}}^{T}:i^{\prime}\in\{1,\ldots,n\}\big{)}

have the same joint distribution.

Finally, as we have assumed that 𝒮i\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}’s are all nonempty, we can define

Z^iZ^v=ψv(Wv,YvT)\displaystyle{\widehat{Z}}_{i}\triangleq{\widehat{Z}}_{v}=\psi_{v}(W_{v},Y_{v}^{T})

with an arbitrary v𝒮iv\in\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i}. From the definition of Yi,tY_{i,t} and the fact that YiTY_{i}^{T} contains YvTY_{v}^{T}, it follows that there exists a mapping ψi\psi_{i} such that

Z^i=ψi(Wi,YiT).\displaystyle{\widehat{Z}}_{i}=\psi_{i}(W_{i},Y_{i}^{T}).

Using this mapping, node ii^{\prime} in GG^{\prime} can generate the final estimate of the chosen v𝒮iv\in\overset{{}_{\leftarrow}}{\partial}{\mathcal{S}}_{i} in 𝒜{\mathcal{A}} as Z^i{\widehat{Z}}_{i^{\prime}}, such that (Z,Z^i:i{1,,n})(Z,{\widehat{Z}}_{i}:i\in\{1,\ldots,n\}) and (Z,Z^i:i{1,,n})(Z,{\widehat{Z}}_{i^{\prime}}:i\in\{1,\ldots,n\}) have the same joint distribution. This guarantees that

maxi𝒱[d(Z,Z^i)>ε]\displaystyle\max_{i^{\prime}\in{\mathcal{V}}^{\prime}}{\mathbb{P}}[d(Z,{\widehat{Z}}_{i^{\prime}})>\varepsilon] =maxi{1:n}[d(Z,Z^i)>ε]\displaystyle=\max_{i\in\{1:n\}}{\mathbb{P}}[d(Z,{\widehat{Z}}_{i})>\varepsilon]
maxv𝒱[d(Z,Z^v)>ε]\displaystyle\leq\max_{v\in{\mathcal{V}}}{\mathbb{P}}[d(Z,{\widehat{Z}}_{v})>\varepsilon]
δ.\displaystyle\leq\delta.

The claim that T(ε,δ)T(\varepsilon,\delta) for computing ZZ on GG is lower bounded by T(ε,δ)T^{\prime}(\varepsilon,\delta) for computing ZZ on GG^{\prime} then follows from the definition of T(ε,δ)T^{\prime}(\varepsilon,\delta) in (III-A). This proves Lemma 7.

Appendix B Proof of Lemma 8

Recall that, for any randomized TT-step algorithm 𝒜{\mathcal{A}}^{\prime}, at step t{1,,T}t\in\{1,\ldots,T\}, node i{1,,n}i\in\{1,\ldots,n\} computes the outgoing messages X(i,i1),t=φi,t(Wi,Yit1)X_{(i,i-1),t}=\overset{{}_{\leftarrow}}{\varphi}_{i,t}(W_{i},Y^{t-1}_{i}), X(i,i+1),t=φi,t(Wi,Yit1,Uit1)X_{(i,i+1),t}=\overset{{}_{\rightarrow}}{\varphi}_{i,t}(W_{i},Y^{t-1}_{i},U^{t-1}_{i}), and X(i,i),t=φ̊i,t(Wi,Yit1,Uit1)X_{(i,i),t}=\mathring{\varphi}_{i,t}(W_{i},Y^{t-1}_{i},U^{t-1}_{i}), and the private message Ui,t=ϑi,t(Wi,Yit1,Uit1,Ri,t)U_{i,t}=\vartheta_{i,t}(W_{i},Y_{i}^{t-1},U^{t-1}_{i},R_{i,t}), where Ri,tR_{i,t} is the private randomness of node ii. At step TT, node ii computes Z^i=ψi(Wi,YiT){\widehat{Z}}_{i}=\psi_{i}(W_{i},Y^{T}_{i}). We will use the Bayesian network formed by all the relevant variables and the d-separation criterion [24, Theorem 3.3] to find conditional independences among these variables. To simplify the Bayesian network, we merge some of the variables by defining

U~i,t(X(i,i),t,X(i,i+1),t,Ui,t)\tilde{U}_{i,t}\triangleq(X_{(i,i),t},X_{(i,i+1),t},U_{i,t})

and

Y~i,t(Y(i,i),t,Y(i+1,i),t)\tilde{Y}_{i,t}\triangleq(Y_{(i,i),t},Y_{(i+1,i),t})

for i{1,,n}i\in\{1,\ldots,n\}. The joint distribution of the variables can then be factored as

W,XT,UT,YT(w,xT,uT,yT)\displaystyle\quad\,\,{\mathbb{P}}_{W,X^{T},U^{T},Y^{T}}(w,x^{T},u^{T},y^{T})
=W(w)t=1Ti=1n𝟏{x(i,i1),t=φi,t(wi,yit1)}\displaystyle={\mathbb{P}}_{W}(w)\prod_{t=1}^{T}\prod_{i=1}^{n}\mathbf{1}\big{\{}x_{(i,i-1),t}=\overset{{}_{\leftarrow}}{\varphi}_{i,t}(w_{i},y_{i}^{t-1})\big{\}}
U~i,t|Wi,Yit1,U~it1(u~i,t|wi,yit1,u~it1)\displaystyle\quad\cdot{\mathbb{P}}_{\tilde{U}_{i,t}|W_{i},Y_{i}^{t-1},\tilde{U}_{i}^{t-1}}(\tilde{u}_{i,t}|w_{i},y_{i}^{t-1},\tilde{u}_{i}^{t-1})
i=1nY(i1,i),t|U~i1,t(y(i1,i),t|u~i1,t)\displaystyle\quad\cdot\prod_{i=1}^{n}{\mathbb{P}}_{Y_{(i-1,i),t}|\tilde{U}_{i-1,t}}(y_{(i-1,i),t}|\tilde{u}_{i-1,t})
Y~i,t|U~i,t,X(i+1,i),t(y~i,t|u~i,t,x(i+1,i),t).\displaystyle\quad\cdot{\mathbb{P}}_{\tilde{Y}_{i,t}|\tilde{U}_{i,t},X_{(i+1,i),t}}(\tilde{y}_{i,t}|\tilde{u}_{i,t},x_{(i+1,i),t}). (B.1)

The Bayesian network corresponding to this factorization for n=4n=4 and T=4T=4 is shown in Fig. 10.

If T=0T=0, then Z^n=ψ(Wn){\widehat{Z}}_{n}=\psi(W_{n}), hence I(Z;Z^n|W2:n)I(Z;Wn|W2:n)=0I(Z;{\widehat{Z}}_{n}|W_{2:n})\leq I(Z;W_{n}|W_{2:n})=0. For T1T\geq 1, we prove the upper bounds in the following steps, where we assume n4n\geq 4. The case n=3n=3 can be proved by skipping Step 2, and the case n=2n=2 can be proved by skipping Step 1 and Step 2.

Step 1:
For any ii and tt, define the shorthand Xi,tX(𝒩i,i),tX_{i\leftarrow,t}\triangleq X_{({\mathcal{N}}_{i\leftarrow},i),t}, where 𝒩i{\mathcal{N}}_{i\leftarrow} is the in-neighborhood of node ii. From the Markov chain W,YnT1Xn,TYn,TW,Y_{n}^{T-1}\rightarrow X_{n\leftarrow,T}\rightarrow Y_{n,T} and Lemma 3, we follow the same argument as the one used for proving Lemma 5 to show that

I(Z;Z^n|W2:n)\displaystyle I(Z;{\widehat{Z}}_{n}|W_{2:n}) I(W1;YnT|W2:n)\displaystyle\leq I(W_{1};Y_{n}^{T}|W_{2:n})
(1ηn)I(W1;YnT1|W2:n)\displaystyle\leq(1-\eta_{n})I(W_{1};Y_{n}^{T-1}|W_{2:n})
+ηnI(W1;YnT1,Xn,T|W2:n).\displaystyle\quad+\eta_{n}I(W_{1};Y_{n}^{T-1},X_{{n\leftarrow},T}|W_{2:n}).

Applying the d-separation criterion to the Bayesian network corresponding to (B.1) (see Fig. 10 for an illustration), we can read off the Markov chain

W1W2:n,Yn1t1Ynt1,U~n1,t,U~n,t\displaystyle{W_{1}\rightarrow W_{2:n},Y_{n-1}^{t-1}\rightarrow Y_{n}^{t-1},\tilde{U}_{n-1,t},\tilde{U}_{n,t}}

for t{1,,T}t\in\{1,\ldots,T\}, since all trails from W1W_{1} to (Ynt1,U~n1,t,U~n,t)(Y_{n}^{t-1},\tilde{U}_{n-1,t},\tilde{U}_{n,t}) are blocked by (W2:n,Y(n2,n1)t1)(W_{2:n},Y_{(n-2,n-1)}^{t-1}), and all trails from (Ynt1,U~n1,t,U~n,t)(Y_{n}^{t-1},\tilde{U}_{n-1,t},\tilde{U}_{n,t}) to W1W_{1} are blocked by (W2:n,Y~n1t1)(W_{2:n},\tilde{Y}_{n-1}^{t-1}). This implies the Markov chain W1W2:n,Yn1T1YnT1,Xn,TW_{1}\rightarrow W_{2:n},Y_{n-1}^{T-1}\rightarrow Y_{n}^{T-1},X_{n\leftarrow,T}, since X(n1,n),TX_{(n-1,n),T} is included in U~n1,T\tilde{U}_{n-1,T} and X(n,n),TX_{(n,n),T} is included in U~n,T\tilde{U}_{n,T}. Consequently,111This follows from the ordinary DPI and from the fact that, if XA,BCX\to A,B\to C is a Markov chain, then XBCX\to B\to C is a Markov chain conditioned on A=aA=a.

I(W1;YnT|W2:n)\displaystyle I(W_{1};Y_{n}^{T}|W_{2:n}) (1ηn)I(W1;YnT1|W2:n)\displaystyle\leq(1-\eta_{n})I(W_{1};Y_{n}^{T-1}|W_{2:n})
+ηnI(W1;Yn1T1|W2:n).\displaystyle\quad+\eta_{n}I(W_{1};Y_{n-1}^{T-1}|W_{2:n}). (B.2)

Also note that I(W1;Yn,1|W2:n)I(W1;Xn,1|W2:n)I(W1;W𝒩n|W2:n)=0I(W_{1};Y_{n,1}|W_{2:n})\leq I(W_{1};X_{n\leftarrow,1}|W_{2:n})\leq I(W_{1};W_{{\mathcal{N}}_{n\leftarrow}}|W_{2:n})=0.

Step 2:
For i{1,,n3}i\in\{1,\ldots,n-3\}, from the Markov chain W,YniTi1X(ni),TiYni,TiW,Y_{n-i}^{T-i-1}\rightarrow X_{{{(n-i)}\leftarrow},T-i}\rightarrow Y_{n-i,T-i} and Lemma 3,

I(W1;YniTi|\displaystyle I(W_{1};Y_{n-i}^{T-i}| W2:n)(1ηni)I(W1;YniTi1|W2:n)\displaystyle W_{2:n})\leq(1-\eta_{n-i})I(W_{1};Y_{n-i}^{T-i-1}|W_{2:n})
+ηniI(W1;YniTi1,X(ni),Ti|W2:n)\displaystyle+\eta_{n-i}I(W_{1};Y_{n-i}^{T-i-1},X_{({n-i)}\leftarrow,T-i}|W_{2:n})

From the Bayesian network corresponding to (B.1), we can read off the Markov chain

W1\displaystyle W_{1} W2:n,Yni1t1\displaystyle\rightarrow W_{2:n},Y_{n-i-1}^{t-1}
Ynit1,U~ni1,t,U~ni,t,X(ni+1,ni),t\displaystyle\rightarrow Y_{n-i}^{t-1},\tilde{U}_{n-i-1,t},\tilde{U}_{n-i,t},X_{(n-i+1,n-i),t}

for t{1,,Ti}t\in\{1,\ldots,T-i\}, since all trails from W1W_{1} to

(Ynit1,U~ni1,t,U~ni,t,X(ni+1,ni),t)(Y_{n-i}^{t-1},\tilde{U}_{n-i-1,t},\tilde{U}_{n-i,t},X_{(n-i+1,n-i),t})

are blocked by (W2:n,Y(ni2,ni1)t1)(W_{2:n},Y_{(n-i-2,n-i-1)}^{t-1}), and all trails from

(Ynit1,U~ni1,t,U~ni,t,X(ni+1,ni),t)(Y_{n-i}^{t-1},\tilde{U}_{n-i-1,t},\tilde{U}_{n-i,t},X_{(n-i+1,n-i),t})

to W1W_{1} are blocked by (W2:n,Y~ni1t1)(W_{2:n},\tilde{Y}_{n-i-1}^{t-1}). This implies the Markov chain

W1W2:n,Yni1Ti1YniTi1,X(ni),Ti,W_{1}\rightarrow W_{2:n},Y_{n-i-1}^{T-i-1}\rightarrow Y_{n-i}^{T-i-1},X_{(n-i)\leftarrow,T-i},

since X(ni1,ni),TiX_{(n-i-1,n-i),T-i} is included in U~ni1,Ti\tilde{U}_{n-i-1,T-i} and X(ni,ni),TiX_{(n-i,n-i),T-i} is included in U~ni,Ti\tilde{U}_{n-i,T-i}. Therefore,

I(W1;YniTi|W2:n)\displaystyle I(W_{1};Y_{n-i}^{T-i}|W_{2:n}) (1ηni)I(W1;YniTi1|W2:n)\displaystyle\leq(1-\eta_{n-i})I(W_{1};Y_{n-i}^{T-i-1}|W_{2:n})
+ηniI(W1;Yni1Ti1|W2:n)\displaystyle\quad+\eta_{n-i}I(W_{1};Y_{n-i-1}^{T-i-1}|W_{2:n}) (B.3)

for i{1,,n3}i\in\{1,\ldots,n-3\}. Also note that

I(W1;Yni,1|W2:n)\displaystyle I(W_{1};Y_{n-i,1}|W_{2:n}) I(W1;X(ni),1|W2:n)\displaystyle\leq I(W_{1};X_{(n-i)\leftarrow,1}|W_{2:n})
I(W1;W𝒩(ni)|W2:n)\displaystyle\leq I(W_{1};W_{{\mathcal{N}}_{(n-i)\leftarrow}}|W_{2:n})
=0.\displaystyle=0.

Step 3:
Finally, we upper-bound I(W1;Y2Tn+2|W2:n)I(W_{1};Y_{2}^{T-n+2}|W_{2:n}) for Tn1T\geq n-1. From the Markov chain W,Y2t1X2,tY2,tW,Y_{2}^{t-1}\rightarrow X_{2\leftarrow,t}\rightarrow Y_{2,t} and Lemma 3,

I(W1;Y2Tn+2|W2:n)\displaystyle I(W_{1};Y_{2}^{T-n+2}|W_{2:n}) (1η2)I(W1;Y2Tn+1|W2:n)\displaystyle\leq(1-\eta_{2})I(W_{1};Y_{2}^{T-n+1}|W_{2:n})
+η2H(W1|W2:n).\displaystyle\quad+\eta_{2}H(W_{1}|W_{2:n}). (B.4)

This upper bound is useful only when H(W1|W2:n)H(W_{1}|W_{2:n}) is finite. If the observations are continuous r.v.’s, we can upper bound I(W1;Y2Tn+2|W2:n)I(W_{1};Y_{2}^{T-n+2}|W_{2:n}) in terms of the channel capacity C(1,2)C_{(1,2)}:

I(W1;Y2Tn+2|W2:n)\displaystyle\quad\,\,I(W_{1};Y_{2}^{T-n+2}|W_{2:n})
=t=1Tn+2I(W1;Y2,t|W2:n,Y2t1)\displaystyle=\sum_{t=1}^{T-n+2}I(W_{1};Y_{2,t}|W_{2:n},Y_{2}^{t-1})
=(a)t=1Tn+2(I(W1;Y(1,2),t|W2:n,Y2t1)\displaystyle\overset{\rm(a)}{=}\sum_{t=1}^{T-n+2}\Big{(}I(W_{1};Y_{(1,2),t}|W_{2:n},Y_{2}^{t-1})
+I(W1;Y~2,t|W2:n,Y2t1,Y(1,2),t))\displaystyle\qquad\qquad+I(W_{1};\tilde{Y}_{2,t}|W_{2:n},Y_{2}^{t-1},Y_{(1,2),t})\Big{)}
(b)t=1Tn+2I(X(1,2),t;Y(1,2),t|W2:n,Y2t1)\displaystyle\overset{\rm(b)}{\leq}\sum_{t=1}^{T-n+2}I(X_{(1,2),t};Y_{(1,2),t}|W_{2:n},Y_{2}^{t-1})
(c)t=1Tn+2I(X(1,2),t;Y(1,2),t)\displaystyle\overset{\rm(c)}{\leq}\sum_{t=1}^{T-n+2}I(X_{(1,2),t};Y_{(1,2),t})
C(1,2)(Tn+2),\displaystyle\leq C_{(1,2)}(T-n+2), (B.5)

where we have used the Markov chain W1W2:n,Y2t1,Y(1,2),tY~2,tW_{1}\rightarrow W_{2:n},Y_{2}^{t-1},Y_{(1,2),t}\rightarrow\tilde{Y}_{2,t} for t{1,,Tn+2}t\in\{1,\ldots,T-n+2\}, which follows by applying the d-separation criterion to the Bayesian network corresponding to the factorization in (B.1), so that the second term in (a) is zero; the Markov chain W,Y2t1X(1,2),tY(1,2),tW,Y_{2}^{t-1}\rightarrow X_{(1,2),t}\rightarrow Y_{(1,2),t}, which also implies the Markov chain W1X(1,2),t,W2:n,Y2t1Y(1,2),tW_{1}\rightarrow X_{(1,2),t},W_{2:n},Y_{2}^{t-1}\rightarrow Y_{(1,2),t} by the weak union property of conditional independence, hence (b) and (c); and the fact that I(X(1,2),t;Y(1,2),t)C(1,2)I(X_{(1,2),t};Y_{(1,2),t})\leq C_{(1,2)}.

Step 4:
Define Ii,t=I(W1;Yit|W2:i)I_{i,t}=I(W_{1};Y_{i}^{t}|W_{2:i}) for i2i\geq 2 and t1t\geq 1. From (B.2), (B.3), (B.4), and (B.5), we can write, for n3n\geq 3, Tn1T\geq n-1, and i{0,,n3}i\in\{0,\ldots,n-3\},

Ini,Ti\displaystyle I_{n-i,T-i} η¯niIni,Ti1+ηniIni1,Ti1\displaystyle\leq\bar{\eta}_{n-i}I_{n-i,T-i-1}+\eta_{n-i}I_{n-i-1,T-i-1} (B.6)

where η¯ni=1ηni\bar{\eta}_{n-i}=1-\eta_{n-i}, and Ini,1=0I_{n-i,1}=0. In addition, for Tn1T\geq n-1,

I2,Tn+2\displaystyle I_{2,T-n+2} {η¯2I2,Tn+1+η2H(W1|W2:n)C(1,2)(Tn+2),\displaystyle\leq\begin{cases}\bar{\eta}_{2}I_{2,T-n+1}+\eta_{2}H(W_{1}|W_{2:n})\\ C_{(1,2)}(T-n+2)\end{cases}, (B.7)

and I2,0=0I_{2,0}=0.

An upper bound on I(W1;YnT|W2:n)I(W_{1};Y_{n}^{T}|W_{2:n}) can be obtained by solving this set of recursive inequalities with the specified boundary conditions. It can be checked by induction that I(W1;YnT|W2:n)=0I(W_{1};Y_{n}^{T}|W_{2:n})=0 if Tn2T\leq n-2. For Tn1T\geq n-1, if ηiη~\eta_{i}\leq\tilde{\eta} for all i{1,,n}i\in\{1,\ldots,n\}, then the above inequalities continue to hold with ηi\eta_{i}’s replaced with η~\tilde{\eta}. The resulting set of inequalities is similar to the one obtained by Rajagopalan and Schulman [13] for the evolution of mutual information in broadcasting a bit over a unidirectional chain of BSCs. With

(m,k,p)(mk)pk(1p)mk,{\mathcal{B}}(m,k,p)\triangleq{m\choose k}p^{k}(1-p)^{m-k},

the exact solution is given by

I(W1;YnT|W2:n)\displaystyle\quad\,\,I(W_{1};Y_{n}^{T}|W_{2:n})
H(W1|W2:n)η~i=1Tn+2η~n2(1η~)Tin+2(Tin2)\displaystyle\leq H(W_{1}|W_{2:n})\tilde{\eta}\sum_{i=1}^{T-n+2}\tilde{\eta}^{n-2}(1-\tilde{\eta})^{T-i-n+2}{T-i\choose n-2}
=H(W1|W2:n)ηi=1Tn+2(Ti,n2,η)\displaystyle=H(W_{1}|W_{2:n})\eta\sum_{i=1}^{T-n+2}{\mathcal{B}}(T-i,n-2,\eta)

for n2n\geq 2, and

I(W1;YnT|W2:n)\displaystyle\quad\,\,I(W_{1};Y_{n}^{T}|W_{2:n})
C(1,2)η~i=1Tn+2η~n3(1η~)Tin+2(Ti1n3)i\displaystyle\leq C_{(1,2)}\tilde{\eta}\sum_{i=1}^{T-n+2}\tilde{\eta}^{n-3}(1-\tilde{\eta})^{T-i-n+2}{T-i-1\choose n-3}i
=C(1,2)ηi=1Tn+2(Ti1,n3,η)i\displaystyle=C_{(1,2)}\eta\sum_{i=1}^{T-n+2}{\mathcal{B}}(T-i-1,n-3,\eta)i

for n3n\geq 3. This proves (30) and (31).

For general ηi\eta_{i}’s, we obtain a suboptimal upper bound by unrolling the first term in (B.6) for each ii and using the fact that Ini,t=0I_{n-i,t}=0 for tni2t\leq n-i-2, getting

Ini,Ti\displaystyle I_{n-i,T-i} η¯niTn+1ηniIni1,ni2+\displaystyle\leq\bar{\eta}_{n-i}^{T-n+1}\eta_{n-i}I_{n-i-1,n-i-2}+\ldots
+η¯niηniIni1,Ti2+ηniIni1,Ti1\displaystyle\quad+\bar{\eta}_{n-i}\eta_{n-i}I_{n-i-1,T-i-2}+\eta_{n-i}I_{n-i-1,T-i-1}
(η¯niTn+1++η¯ni+1)ηniIni1,Ti1\displaystyle\leq\big{(}\bar{\eta}_{n-i}^{T-n+1}+\ldots+\bar{\eta}_{n-i}+1\big{)}\eta_{n-i}I_{n-i-1,T-i-1}
=(1η¯niTn+2)Ini1,Ti1.\displaystyle=\big{(}1-\bar{\eta}_{n-i}^{T-n+2}\big{)}I_{n-i-1,T-i-1}.

Iterating over ii, and noting that

I2,Tn+2\displaystyle\quad\,\,I_{2,T-n+2}
min{H(W1|W2:n)(1η¯2Tn+2),C(1,2)(Tn+2)},\displaystyle\leq\min\big{\{}H(W_{1}|W_{2:n})(1-\bar{\eta}_{2}^{T-n+2}),C_{(1,2)}(T-n+2)\big{\}},

we get for n2n\geq 2 and Tn1T\geq n-1,

I(W1;YnT|W2:n)\displaystyle I(W_{1};Y_{n}^{T}|W_{2:n})\leq
{H(W1|W2:n)i=2n(1(1ηi)Tn+2)C(1,2)(Tn+2)i=3n(1(1ηi)Tn+2).\displaystyle\begin{cases}H(W_{1}|W_{2:n})\prod_{i=2}^{n}\big{(}1-(1-\eta_{i})^{T-n+2}\big{)}\\ C_{(1,2)}(T-n+2)\prod_{i=3}^{n}\big{(}1-(1-\eta_{i})^{T-n+2}\big{)}\end{cases}. (B.8)

The weakened upper bounds in (32) and (33) are obtained by replacing ηi\eta_{i} in (B) with

ηmaxi=1,,nηi.\eta\triangleq\max_{i=1,\ldots,n}\eta_{i}.

Finally, we show (8) using an argument similar to the one in [13]. If n4n\geq 4 and T2+(n3)γ/ηT\leq 2+(n-3)\gamma/\eta for some γ(0,1)\gamma\in(0,1), then

η<ηγn3T2n2T11\displaystyle\eta<\frac{\eta}{\gamma}\leq\frac{n-3}{T-2}\leq\frac{n-2}{T-1}\leq 1

where the last inequality follows from the assumption that Tn1T\geq n-1, since otherwise I(Z;Z^n|W2:n)=0I(Z;{\widehat{Z}}_{n}|W_{2:n})=0. The upper bounds in (30) and (31) can be weakened to

I(Z;Z^n|W2:n)\displaystyle\quad\,\,I(Z;{\widehat{Z}}_{n}|W_{2:n})
(a){H(W1|W2:n)η(Tn+2)(T1,n2,η)C(1,2)η(Tn+2)2(T2,n3,η)\displaystyle\overset{\rm(a)}{\leq}\begin{cases}H(W_{1}|W_{2:n})\eta(T-n+2){\mathcal{B}}(T-1,n-2,\eta)\\ C_{(1,2)}\eta(T-n+2)^{2}{\mathcal{B}}(T-2,n-3,\eta)\end{cases}
(b)min{H(W1|W2:n),C(1,2)}η(Tn+2)2(T2,n3,η)\displaystyle\overset{\rm(b)}{\leq}\min\big{\{}H(W_{1}|W_{2:n}),C_{(1,2)}\big{\}}\eta(T-n+2)^{2}{\mathcal{B}}(T-2,n-3,\eta)
(c)C(1,2)η(Tn+2)2exp(2(n3T2η)2(T2))\displaystyle\overset{\rm(c)}{\leq}C_{(1,2)}\eta(T-n+2)^{2}\exp\left(-2\left(\frac{n-3}{T-2}-\eta\right)^{2}(T-2)\right)
(d)C(1,2)(n3)2γ2ηexp(2(ηγη)2(n3))\displaystyle\overset{\rm(d)}{\leq}C_{(1,2)}\frac{(n-3)^{2}\gamma^{2}}{\eta}\exp\left(-2\left(\frac{\eta}{\gamma}-\eta\right)^{2}(n-3)\right)

where

  1. (a)

    and (b) follow from monotonicity properties of the binomial distribution;

  2. (c)

    follows from the Chernoff–Hoeffding bound;

  3. (d)

    follows from the fact that the channels associated with 𝒮{\mathcal{E}}_{\mathcal{S}} are independent, and the fact that the assumption that n4n\geq 4 and n1T2+(n3)γ/ηn-1\leq T\leq 2+(n-3)\gamma/\eta.

Refer to caption
Figure 10: Bayesian network of (W,XT,UT,YT)(W,X^{T},U^{T},Y^{T}) for the randomized algorithm 𝒜{\mathcal{A}}^{\prime} on a 44-node bidirected chain with T=4T=4. (W1:4W_{1:4} are arbitrarily correlated, and not all edges emanating from W2:4W_{2:4} are shown.)

References

  • [1] O. Ayaso, D. Shah, and M. Dahleh, “Information-theoretic bounds for distributed computation over networks of point-to-point channels,” IEEE Trans. Inform. Theory, vol. 56, no. 12, pp. 6020–6039, 2010.
  • [2] G. Como and M. Dahleh, “Lower bounds on the estimation error in problems of distributed computation,” in Proc. Inform. Theory and Applications Workshop, 2009, pp. 70–76.
  • [3] N. Goyal, G. Kindler, and M. Saks, “Lower bounds for the noisy broadcast problem,” SIAM Journal on Computing, vol. 37, no. 6, pp. 1806–1841, 2008.
  • [4] C. Dutta, Y. Kanoria, D. Manjunath, and J. Radhakrishnan, “A tight lower bound for parity in noisy communication networks,” in Proc. ACM Symposium on Discrete Algorithms (SODA), 2014, pp. 1056–1065.
  • [5] M. Braverman, “Interactive information and coding theory,” in Proc. Int. Congress Math., 2014.
  • [6] A. Orlitsky and J. Roche, “Coding for computing,” IEEE Trans. Inform. Theory, vol. 47, no. 3, pp. 903–917, 2001.
  • [7] J. Körner and K. Marton, “How to encode the modulo-two sum of binary sources,” IEEE Trans. Inform. Theory, vol. 25, no. 2, pp. 219–221, 1979.
  • [8] A. B. Wagner, S. Tavildar, and P. Viswanath, “Rate region of the quadratic Gaussian two-encoder source-coding problem,” IEEE Trans. Inform. Theory, vol. 54, no. 5, pp. 1938–1961, 2008.
  • [9] A. El Gamal and Y.-H. Kim, Network Information Theory. Cambridge Univ. Press, 2011.
  • [10] A. Giridhar and P. Kumar, “Toward a theory of in-network computation in wireless sensor networks,” IEEE Communications Magazine, vol. 44, no. 4, pp. 98–107, April 2006.
  • [11] R. Gallager, “Finding parity in a simple broadcast network,” IEEE Trans. Inform. Theory, vol. 34, no. 2, pp. 176–180, 1988.
  • [12] L. Schulman, “Coding for interactive communication,” IEEE Trans. Inform. Theory, vol. 42, no. 6, pp. 1745–1756, 1996.
  • [13] S. Rajagopalan and L. Schulman, “A coding theorem for distributed computation,” in ACM Symposium on Theory of Computing, 1994.
  • [14] R. Carli, G. Como, P. Frasca, and F. Garin, “Distributed averaging on digital erasure networks,” Automatica, vol. 47, no. 115-121, 2011.
  • [15] S. Kar and J. Moura, “Distributed consensus algorithms in sensor networks with imperfect communication: Link failures and channel noise,” IEEE Trans. Signal Process., vol. 57, no. 1, pp. 355–369, 2009.
  • [16] N. Noorshams and M. Wainwright, “Non-asymptotic analysis of an optimal algorithm for network-constrained averaging with noisy links,” IEEE J. Sel. Top. Sign. Proces., vol. 5, no. 4, pp. 833–844, 2011.
  • [17] L. Ying, R. Srikant, and G. Dullerud, “Distributed symmetric function computation in noisy wireless sensor networks with binary data,” in International Symposium on Modeling and Optimization in Mobile, Ad-Hoc and Wireless networks (WiOpt), 2006.
  • [18] S. Deb, M. Medard, and C. Choute, “Algebraic gossip: a network coding approach to optimal multiple rumor mongering,” IEEE Trans. Inform. Theory, vol. 52, no. 6, pp. 2486–2507, 2006.
  • [19] V. V. Petrov, Sums of Independent Random Variables. Berlin: Springer-Verlag, 1975.
  • [20] M. Raginsky, “Strong data processing inequalities and Φ\Phi-Sobolev inequalities for discrete channels,” IEEE Trans. Inform. Theory, vol. 62, no. 6, pp. 3355–3389, 2016.
  • [21] P. Tiwari, “Lower bounds on communication complexity in distributed computer networks,” J. ACM, vol. 34, no. 4, pp. 921–938, Oct. 1987.
  • [22] A. Chattopadhyay, J. Radhakrishnan, and A. Rudra, “Topology matters in communication,” in Proc. IEEE Annu. Symp. on Foundations of Comp. Sci. (FOCS), Oct 2014, pp. 631–640.
  • [23] T. Cover and J. Thomas, Elements of Information Theory, 2nd ed. New York: Wiley.
  • [24] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
  • [25] Y. Polyanskiy and Y. Wu, “Lecture Notes on Information Theory,” Lecture Notes for ECE563 (UIUC) and 6.441 (MIT), 2012-2016. [Online]. Available: http://people.lids.mit.edu/yp/homepage/data/itlectures_v4.pdf
  • [26] R. Ahlswede and P. Gács, “Spreading of sets in product spaces and hypercontraction of the Markov operator,” Ann. Probab., vol. 4, no. 6, pp. 925–939, 1976.
  • [27] V. Anantharam, A. Gohari, S. Kamath, and C. Nair, “On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover,” arXiv preprint, 2013. [Online]. Available: http://arxiv.org/abs/1304.6133
  • [28] Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with input constraints,” IEEE Trans. Inform. Theory, vol. 62, no. 1, pp. 35–55, 2016.
  • [29] W. Evans and L. Schulman, “Signal propagation and noisy circuits,” IEEE Trans. Inform. Theory, vol. 45, no. 7, pp. 2367–2373, 1999.
  • [30] A. Xu, “Information-theoretic limitations of distributed information processing,” Ph.D. dissertation, University of Illinois at Urbana-Champaign, 2016.
  • [31] Y. Polyanskiy and Y. Wu, “Strong data-processing inequalities for channels and Bayesian networks,” arXiv preprint, 2015. [Online]. Available: http://arxiv.org/abs/1508.06025
  • [32] V. Kostina and S. Verdú, “Lossy joint source-channel coding in the finite blocklength regime,” IEEE Trans. Inform. Theory, vol. 59, no. 5, pp. 2545–2575, 2013.
  • [33] R. Gallager, Information Theory and Reliable Communication. New York: Wiley, 1968.
  • [34] A. Kolmogorov, “Sur les propriétés des fonctions de concentrations de M. P. Lévy,” Ann. Inst. H. Poincaré, vol. 16, pp. 27–34, 1958.
  • [35] H. H. Nguyen and V. H. Vu, “Small ball probability, inverse theorems, and applications,” in Erdős Centennial, ser. Bolyai Society Mathematical Studies. Springer, 2013, vol. 25. [Online]. Available: http://arxiv.org/abs/1301.0019
  • [36] S. G. Bobkov and G. P. Chistyakov, “On concentration functions of random variables,” J. Theor. Probab., vol. 28, no. 3, pp. 976–988, 2015, published online.
  • [37] M. Rudelson and R. Vershynin, “The Littlewood–Offord problem and invertibility of random matrices,” Adv. Math., vol. 218, pp. 600–633, 2008.
  • [38] M. Rudelson and R. Vershynin, “Small ball probabilities for linear images of high dimensional distributions,” arXiv1402.4492R, Feb. 2014. [Online]. Available: https://arxiv.org/abs/1402.4492
  • [39] P. Erdős, “On a lemma of Littlewood and Offord,” Bull. Amer. Math. Soc., vol. 51, pp. 898–902, 1945.
  • [40] S. Bobkov and M. Madiman, “The entropy per coordinate of a random vector is highly constrained under convexity conditions,” IEEE Trans. Inform. Theory, vol. 57, no. 8, pp. 4940–4954, 2011.
  • [41] R. Carli, G. Como, P. Frasca, and F. Garin, “Average consensus on digital noisy networks,” 1st IFAC Workshop on Estimation and Control of Networked Systems, 2009.
  • [42] H. S. Witsenhausen, “On sequences of pairs of dependent random variables,” SIAM J. Appl. Math., vol. 28, no. 1, pp. 100–113, Jan. 1975.