Semantic Compression with Side Information: A Rate-Distortion Perspective

Tao Guo, Member, IEEE, Yizhu Wang, Jie Han, Huihui Wu, Member, IEEE,
Bo Bai, Senior Member, IEEE, and Wei Han, Member, IEEE

Abstract

We consider the semantic rate-distortion problem motivated by task-oriented video compression. The semantic information corresponding to the task, which is not observable to the encoder, shows impacts on the observations through a joint probability distribution. The similarities among intra-frame segments and inter-frames in video compression are formulated as side information available at both the encoder and the decoder. The decoder is interested in recovering the observation and making an inference of the semantic information under certain distortion constraints.

We establish the information-theoretic limits for the tradeoff between compression rates and distortions by fully characterizing the rate-distortion function. We further evaluate the rate-distortion function under specific Markov conditions for three scenarios: i) both the task and the observation are binary sources; ii) the task is a binary classification of an integer observation as even and odd; iii) Gaussian correlated task and observation. We also illustrate through numerical results that recovering only the semantic information can reduce the coding rate comparing to recovering the source observation.

Index Terms:

Semantic communication, inference, video compression, rate-distortion function.

I Introduction

The rate limit for lossless compression of memoryless sources is commonly known as the entropy shown by Shannon in his landmark paper [1]. In addition, lossy source coding under given fidelity criterion was also introduced in the same paper. Further, the Shannon rate-distortion function was proposed in [2], characterizing the optimal tradeoff between compression rates and distortion measurements, from the perspective of mutual information.

Thereafter, the rate-distortion function was investigated when side information is available at the encoder or/and decoder, see [3, 4, 5, 6, 7, 8, 9, 10, 11] and reference therein. In the case that the side information is only available at the encoder, then no benefit could be achieved. In case of side information being only at the decoder, the corresponding rate-distortion function was considered by Wyner and Ziv in [5], with its extensions being discussed in [9, 10, 11, 6, 7, 8]. Finally, if both the encoder and decoder have access to the same side information, the optimal tradeoff is called conditional rate-distortion function, which was given by [3] and [4].

The lossy source coding theory finds applications in establishing information theoretic limits for practical compression of speech signals, images and videos etc. [12, 13, 14, 15]. Practical techniques for video compression have been explored since decades ago [16, 17, 18]. Currently, popular protocols such as HEVC, VP9, VVC and AV1 are based on partitioning a picture/frame into coding tree units, which typically correspond to 64x64 or 128x128 pixel areas. Each coding tree unit is then partitioned into coding blocks (segments) [19, 20], following a recursive coding tree representation. The aforementioned compression schemes consider both intra-correlation within one frame and inter-correlation between two consecutive frames.

Nowadays, with the development of high-definition videos, 5G communication systems and industrial Internet of Things, communication overhead and storage demand have been exponentially growing. As a result, higher compression rates are required, but it seems not possible by simply compressing a given source (e.g., a video or image) itself in light of the rate-distortion limits.

Semantic or Task-oriented compression [21, 22, 23, 24], aiming at efficiently compressing the sources according to specific communication tasks (e.g., video detection, inference, classification, decision making, etc.), has been viewed as a promising technique for future 6G systems due to its extraordinary effectiveness. Particularly, the goal of semantic compression is to recover the necessary semantic information corresponding to a certain task instead of each individually transmitted bit as in Shannon communication setups, and thus it leads to significant reduction of coding rates.

In addition to the interested semantic information, the original sources are also required in some cases such as video surveillance, for the purpose of evidence storage and verification. An effective way is to save only the most important or relevant segments of a video, and related work on semantic-based image/video segmentation can be found in [25, 26, 27]. Most recently, the classical indirect source coding problem [28] was revisited from the semantic point of view, and the corresponding rate-distortion framework for semantic information in [29, 30, 31, 32].

Figure 1: Illustration of system model with side information.

The current paper introduces side information into the framework of [29] and completely characterizes the semantic rate-distortion problem with side information. Motivated from the task-oriented video compression, the semantic information corresponding to the task is not observable to the encoder. In light of video segmentation, the observed source is partitioned into two parts, and the semantic information only shows influence on the more important part. Moreover, the intra-correlation and inter-correlation are viewed as side information, and they are available at both the encoder and decoder to help compression. The decoder needs to reconstruct the whole source subject to different distortion constraints for the two parts, respectively. The semantic information can be recovered upon observing the source reconstructions at the decoder. Finally, our main contributions are summarized as follows:

1)

We fully characterize the optimal rate-distortion tradeoff. It is further shown that separately compressing the two source parts is optimal, if the they are independent conditioning on the side information.
2)

The rate-distortion function is evaluated for the inference of a binary source under some specific Markov chains.
3)

We further evaluate the rate-distortion function for binary classification of an integer source. The numerical results show that recovering only the semantic information can reduce the coding rate comparing to recovering the source message.
4)

The rate-distortion function for Gaussian sources is also illustrated, which may provide more insights for future real video compression simulations.

This paper is mainly pertained to the information-theoretic aspects, and future work on the limit of real video compression is under investigation.

The rest of the paper is organized as follows. We first formulate the problem and present some preliminary results in Section II. In Section III, we characterize the rate-distortion function and some useful properties. Evaluations of the rate-distortion function for binary, integer, and Gaussian sources with Hamming/mean squared error distortions are devoted to Sections IV, V and VI, respectively. We present and analyze some plots of the evaluations in Section VII. The paper is concluded in Section VIII. Some essential proofs can be found in the appendices.

II Problem Formulation and Preliminaries

II-A Problem Formulation

Consider the system model for video detection (inference) that also requires evidence storage depicted in Fig. 1. The problem is defined as follows. A collection of discrete memoryless sources (DMS) is described by generic random variables $(S,X_{1},X_{2},Y)$ taking values in finite alphabets $\mathcal{S}\times\mathcal{X}_{1}\times\mathcal{X}_{2}\times\mathcal{Y}$ according to probability distribution $p(x_{1},x_{2},y)p(s|x_{1})$ . In particular, this indicates the Markov chain $S-X_{1}-(X_{2},Y)$ . We interpret $S$ as a latent variable, which is not observable by the encoder. It can be viewed as the semantic information (e.g., the state of a system), which describes the features of the system. We assume that the observation of the system consists of two parts:

•

$X_{1}$ varies according to the semantic information $S$ , which captures the “appearance” of the features, e.g., the vehicle and red lights in the frame that captures a violation at the cross;
•

$X_{2}$ is the background information irrelevant to the features, e.g., buildings in the frame capturing the violation.

$Y$ is the side information that can help compressing such as previous frames in the video. For length- $n$ source sequences, $(S^{n},X_{1}^{n},X_{2}^{n},Y^{n})$ , the encoder has access to only the observed ones $(X_{1}^{n},X_{2}^{n},Y^{n})$ and encodes them as $W$ which will be stored at the server. Upon observing local information $Y^{n}$ and receiving $W$ , the decoder reconstructs the source sequences as $(\hat{X}_{1}^{n},\hat{X}_{2}^{n})$ drawn values from $\hat{\mathcal{X}}_{1}\times\hat{\mathcal{X}}_{2}$ , within distortions $D_{1}$ and $D_{2}$ . Given the reconstructions, the classifier is required to recover the semantic information as $\hat{S}^{n}$ from alphabet $\hat{\mathcal{S}}$ with distortion constraint $D_{s}$ . Here, for simplicity, we assume a perfect classifier, i.e., it is equivalent to recover $\hat{S}^{n}$ directly at the decoder as illustrated in Fig. 2.

Figure 2: The equivalent system model.

Formally, an $\left(n,2^{nR}\right)$ code is defined by the encoding function

\displaystyle En:\mathcal{X}_{1}^{n}\times\mathcal{X}_{2}^{n}\times\mathcal{Y}^{n}\rightarrow\{1,2,\cdots,2^{nR}\}

and the decoding function

\displaystyle De:\{1,2,\cdots,2^{nR}\}\times\mathcal{Y}^{n}\rightarrow\hat{\mathcal{X}}_{1}^{n}\times\hat{\mathcal{X}}_{2}^{n}\times\hat{\mathcal{S}}^{n}.

Let $\mathbb{R}^{+}$ be the set of nonnegative real numbers. We consider bounded per-letter distortion functions $d_{1}:\mathcal{X}_{1}\times\hat{\mathcal{X}}_{1}\rightarrow\mathbb{R}^{+}$ , $d_{2}:\mathcal{X}_{2}\times\hat{\mathcal{X}}_{2}\rightarrow\mathbb{R}^{+}$ , and $d_{s}:\mathcal{S}\times\hat{\mathcal{S}}\rightarrow\mathbb{R}^{+}$ . The distortions between length- $n$ sequences are defined by

	$\displaystyle d_{1}(x_{1}^{n},\hat{x}_{1}^{n})$	$\displaystyle\triangleq\frac{1}{n}\sum_{i=1}^{n}d_{1}(x_{1,i},\hat{x}_{1,i}),$
	$\displaystyle d_{2}(x_{2}^{n},\hat{x}_{2}^{n})$	$\displaystyle\triangleq\frac{1}{n}\sum_{i=1}^{n}d_{2}(x_{2,i},\hat{x}_{2,i}),$
	$\displaystyle d_{s}(s^{n},\hat{s}^{n})$	$\displaystyle\triangleq\frac{1}{n}\sum_{i=1}^{n}d_{s}(s_{i},\hat{s}_{i}).$

A nonnegative rate-distortion tuple $(R,D_{1},D_{2},D_{s})$ is said to be achievable if for sufficiently large $n$ , there exists an $\left(n,2^{nR}\right)$ code such that

	$\displaystyle\lim_{n\rightarrow\infty}\mathbb{E}d_{1}(X_{1}^{n},\hat{X}_{1}^{n})$	$\displaystyle\leq D_{1},$
	$\displaystyle\lim_{n\rightarrow\infty}\mathbb{E}d_{2}(X_{2}^{n},\hat{X}_{2}^{n})$	$\displaystyle\leq D_{2},$
	$\displaystyle\lim_{n\rightarrow\infty}\mathbb{E}d_{s}(S^{n},\hat{S}^{n})$	$\displaystyle\leq D_{s}.$

The rate-distortion function $R(D_{1},D_{2},D_{s})$ is the infimum of coding rate $R$ for distortions $(D_{1},D_{2},D_{s})$ such that the rate-distortion tuple $(R,D_{1},D_{2},D_{s})$ is achievable. Our goal is to characterize the rate-distortion function.

II-B Preliminaries

II-B1 Conditional rate-distortion function

The elegant rate-distortion function was investigated and fully characterized in [2]. Assume the length- $n$ source sequence $X^{n}$ is independent and identically distributed (i.i.d.) over $\mathcal{X}$ with generic random variable $X$ and $d:\mathcal{X}\times\hat{\mathcal{X}}\rightarrow\mathbb{R}^{+}$ be a bounded per-letter distortion measure. The rate-distortion function for a given distortion criterion $D$ is given by

R(D)=\min_{p(\hat{x}|x):\mathbb{E}d(X,\hat{X})\leq D}I(X;\hat{X}).

(1)

It was proved in [2] and also introduced in [33, 34, 35, 36, 37] that $R(D)$ is a non-increasing and convex function of $D$ .

If both the encoder and decoder are allowed to observe side information $Y^{n}$ (with generic variable $Y$ over $\mathcal{Y}$ jointly distributed with $X$ ), as depicted in Fig. 3, then the tradeoff is called the conditional rate-distortion function [33, 3, 4], which is characterized as

R_{X|Y}(D)=\min_{p(\hat{x}|x,y):\mathbb{E}d(X,\hat{X})\leq D}I(X;\hat{X}|Y).

(2)

Figure 3: Conditional Rate-distortion model.

It is shown in [3] that the conditional rate-distortion function can also be obtained as the weighted sum of the marginal rate-distortion function of sources with distribution $P_{X|Y}(\cdot|y),y\in\mathcal{Y}$ , i.e.,

R_{X|Y}(D)=\min_{\{D_{y}:y\in\mathcal{Y}\}:\sum_{y\in\mathcal{Y}}p(y)\cdot D_{y}\leq D}\sum_{y\in\mathcal{Y}}p(y)\cdot R(D_{y}),

(3)

where for any $y\in\mathcal{Y}$ , $R(D_{y})$ is obtained from (1) through replacing the source distribution by $P_{X|Y}(\cdot|y)$ . This property will be useful for evaluating conditional rate-distortion functions of given source distributions. If $(X,Y)$ is a doubly symmetric binary source (DSBS) with parameter $p_{0}$ , i.e.,

p(x,y)=\left[\begin{array}[]{cc}\frac{1-p_{0}}{2}&\frac{p_{0}}{2}\\ \frac{p_{0}}{2}&\frac{1-p_{0}}{2}\end{array}\right],

(4)

then the conditional rate distortion function is given in [4] by

R_{X|Y}(D)=\left[h_{b}(p_{0})-h_{b}(D)\right]\cdot\mathds{1}_{{}_{0\leq D\leq p_{0}}},

(5)

where $h_{b}(q)=-q\log q-(1-q)\log(1-q)$ is the entropy for a Bernoulli( $q$ ) distribution and $\mathds{1}_{A}$ is the indicator function of whether event $A$ happens.

II-B2 Rate-distortion function with two constraints

The scenario was discussed in [38, 37] that we wish to describe the i.i.d. source sequence $X^{n}$ at rate $R$ and recover two reconstructions $\hat{X}_{a}^{n}$ and $\hat{X}_{b}^{n}$ with distortion criteria $\mathbb{E}d_{a}(X^{n},\hat{X}_{a}^{n})\leq D_{a}$ and $\mathbb{E}d_{b}(X^{n},\hat{X}_{b}^{n})\leq D_{b}$ , respectively. The rate-distortion function is given by

R_{\text{2d}}(D_{a},D_{b})=\min_{\begin{subarray}{c}p(\hat{x}_{a},\hat{x}_{b}|x):~{}\\ \mathbb{E}d_{a}(X,\hat{X}_{a})\leq D_{a}\\ \mathbb{E}d_{b}(X,\hat{X}_{b})\leq D_{b}\end{subarray}}I(X;\hat{X}_{a},\hat{X}_{b}).

(6)

Comparing (1) and (6), we easily see that

\max\{R(D_{a}),R(D_{b})\}\leq R_{\text{2d}}(D_{a},D_{b})\leq R(D_{a})+R(D_{b}).

For the special case where $\hat{\mathcal{X}}_{a}=\hat{\mathcal{X}}_{b}$ and $d_{a}(x,\hat{x})=d_{b}(x,\hat{x})$ for all $x\in\mathcal{X}$ and $\hat{x}\in\hat{\mathcal{X}}_{a}$ , it suffice to recover only one sequence $\hat{X}_{a}^{n}=\hat{X}_{b}^{n}$ with distortion $\min\{D_{a},D_{b}\}$ . Then both distortion constraints are satisfied since

\mathbb{E}d_{a}(X^{n},\hat{X}_{a}^{n})=\min\{D_{a},D_{b}\}\leq D_{a}

and

\mathbb{E}d_{b}(X^{n},\hat{X}_{b}^{n})=\min\{D_{a},D_{b}\}\leq D_{b}.

This implies

R_{\text{2d}}(D_{a},D_{b})=R(\min\{D_{a},D_{b}\})=\max\{R(D_{a}),R(D_{b})\},

(7)

where the second equality follows from the non-increasing property of $R(D)$ .

When side information is available at the decoder for only one of the two reconstructions, e.g., $\hat{X}_{b}$ , it was proved in [10, 7] that successive encoding (first $\hat{X}_{a}$ , then $\hat{X}_{b}$ ) is optimal. For the case when the two reconstructions have access to different side information respectively, the rate-distortion tradeoff was characterized in [10, 11, 9, 36].

II-B3 Rate-distortion function of two sources

The problem of compressing two i.i.d. source sequences $X_{a}^{n}$ and $X_{b}^{n}$ at the same encoder is considered in [37, Problem. 10.14]. The rate-distortion function is given therein, which is

R_{\text{2s}}(D_{a},D_{b})=\min_{\begin{subarray}{c}p(\hat{x}_{a},\hat{x}_{b}|x_{a},x_{b}):~{}\\ \mathbb{E}d_{a}(X_{a},\hat{X}_{a})\leq D_{a}\\ \mathbb{E}d_{b}(X_{b},\hat{X}_{b})\leq D_{b}\end{subarray}}I(X_{a},X_{b};\hat{X}_{a},\hat{X}_{b}).

(8)

It is also shown that for two independent sources, compressing simultaneously is the same as compressing separately in terms of the rate and distortions, i.e.,

R_{\text{2s}}(D_{a},D_{b})=R(D_{a})+R(D_{b}).

(9)

If the two sources are dependent, the equality in (9) can be false, and the Slepian-Wolf rate region [39] indicates that joint entropy of the two source variables is sufficient and optimal for lossless reconstructions. Taking into account distortions, Gray showed via an example in [4] that the compression rate can be strictly larger than $R(D_{a})+R_{X_{b}|X_{a}}(D_{b})$ in general. At last, some related results for compressing compound sources can be found in [40].

III Optimal Rate-distortion Tradeoff

III-A The Rate-distortion Function

Theorem 1.

The rate-distortion function for compression and inference with side information is given as the solution to the following optimization problem

$\displaystyle R(D_{1},D_{2},D_{s})$	$\displaystyle=\min I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}\|Y)$	(10)
	$\displaystyle\quad\textup{ s.t. }\mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}$	(11)
	$\displaystyle\qquad\quad\mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}$	(12)
	$\displaystyle\qquad\quad\mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s},$	(13)

where the minimum is taken over all conditional pmf $p(\hat{x}_{1},\hat{x}_{2},\hat{s}|x_{1},x_{2},y)$ and the modified distortion measure is defined by

d_{s}^{\prime}(x_{1},\hat{s})=\frac{1}{p(x_{1})}\sum_{s\in\mathcal{S}}p(x_{1},s)d_{s}(s,\hat{s}).

(14)

Proof.

We can interpret the problem as the combination of rate-distortion with two sources ( $X_{1}$ and $X_{2}$ ), rate-distortion with two constraints ( $X_{1}$ is recovered with two constraints $D_{1}$ and $D_{s}$ ), and conditional rate-distortion (conditioning on $Y$ ). Then the theorem can be obtained informally by combining the rate-distortion functions in (2), (6), and (8). For completeness, we provide a rigorous technical proof in Appendix A. ∎

III-B Some Properties

Similar to the rate-distortion function in (1), we collect some properties in the following lemma. The proof simply follows the same procedure as that for (1) in [2, 33, 34, 35, 37, 36]. We omit the details here.

Lemma 2.

The rate-distortion function $R(D_{1},D_{2},D_{s})$ is non-increasing and convex in $(D_{1},D_{2},D_{s})$ .

Recall from (9) that compressing two independent sources is the same as compressing them simultaneously. Then one may query that whether the optimality of separate compression remains to hold here? We answer the question in the following lemma.

Lemma 3.

If $X_{1}-Y-X_{2}$ forms a Markov chain, then

R(D_{1},D_{2},D_{s})=R_{\text{2d},X_{1}|Y}(D_{1},D_{s})+R_{X_{2}|Y}(D_{2}),

where the conditional rate-distortion function with two constraints is given by

R_{\text{2d},X_{1}|Y}(D_{1},D_{s})=\min_{\begin{subarray}{c}p(\hat{x}_{1},\hat{s}|x_{1},y):\\ \mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}\\ \mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}\end{subarray}}I(X_{1};\hat{X}_{1},\hat{S}|Y)

and the conditional rate-distortion function is given in (2) and can be written by

R_{X_{2}|Y}(D)=\min_{p(\hat{x}_{2}|x_{2},y):\mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}}I(X_{2};\hat{X}_{2}|Y).

Proof.

The proof is given in Appendix B. ∎

Remark 1.

Compared to the independence assumption for the equality in (9), we have the conditional independence $I(X_{1};X_{2}|Y)=0$ in Lemma 3. This is intuitive since compared to the setting for (9), we have the additional side information available to the encoder and decoder.

III-C Rate-distortion Function for Semantic Information

The indirect rate-distortion problem can be viewed as a special case of Theorem 1 that only recovers the semantic information $S$ , i.e., $X_{2}$ and $Y$ are constants and $D_{1}=\infty$ . Denote the minimum achievable rate for a given distortion constraint $D_{s}$ by $R_{s}(D_{s})$ .

Consider the binary sources and assume $S$ and $X_{1}$ follow the doubly symmetric binary distribution, i.e.,

p(s,x_{1})=\left[\begin{matrix}\frac{1-p}{2}&\frac{p}{2}\\ \frac{p}{2}&\frac{1-p}{2}\end{matrix}\right].

(15)

The transition probability $p(x_{1}|s)$ can also be defined via the binary symmetric channel (BSC) in Fig. 4.

Figure 4: Transition probability from

S

X_{1}

: in terms of BSC.

Assume $p\leq 0.5$ , which means that $X_{1}$ has a higher probability to reflect the same value as $S$ . Let $d_{s}:\mathcal{S}\times\hat{\mathcal{S}}\rightarrow\{0,1\}$ be the Hamming distortion measure. Then the evaluation of $R_{s}(D_{s})$ is given in the following lemma.

Let $R_{d_{s}^{\prime}}(\cdot)$ be the ordinary rate-distortion function in (1) under the distortion measure $d_{s}^{\prime}$ (c.f. (14)). For notational simplicity, for $D_{s}\geq p$ , define

D_{s}^{0}\triangleq\frac{D_{s}-p}{1-2p}.

(16)

Lemma 4.

For binary sources in (15) and Hamming distortion, the rate-distortion function for semantic information is

R_{s}(D_{s})=R_{d_{s}^{\prime}}(D_{s}),

(17)

where $R_{d_{s}^{\prime}}(D_{s})=R\left(D_{s}^{0}\right)=\left[1-h_{b}\left(\frac{D_{s}-p}{1-2p}\right)\right]\cdot\mathds{1}_{{}_{p\leq D_{s}\leq 0.5}}.$

Proof.

The evaluation of $R_{s}(D_{s})$ was given in [33, 41]. A simpler proof can be found in Appendix C. ∎

Remark 2.

By the properties of the rate-distortion function in (1) and the linearity between $D_{s}^{0}$ and $D_{s}$ , we see that $R_{s}(D_{s})$ is also non-increasing and convex in $D_{s}$ .

Remark 3.

It is easy to check that $\frac{D-p}{1-2p}<D$ for $D<0.5$ . This implies that $R_{s}(D)>R(D)$ for $D<0.5$ , where $R(D)$ is the ordinary rate-distortion function in (1). The inequality is intuitive from the data processing inequality that under the same distortion constraint $D$ , recovering $S$ directly (with rate $R(D)$ ) is easier than recovering it from the observation $X_{1}$ (with rate $R_{s}(D)$ ). Moreover, we see from the lemma that $D_{s}\geq p$ , which means that the semantic information can never be losslessly recovered for $p>0$ . This can be induced from the fact that even we know the complete information of $X_{1}$ , the best distortion for reconstructing $S$ is the distortion between $S$ and $X_{1}$ which is equal to $p$ . The rate-distortion functions $R_{s}(D)$ and $R(D)$ are illustrated in Fig. 5 for $p=0.1$ , which verifies the above observations. For general source and distortion measure, we have $R_{s}(D)\geq R(D)$ where the equality holds only when $X_{1}$ determines $S$ . This can be easily proved by the data processing inequality and we omit the details here.

Refer to caption — Figure 5: Comparison of rate-distortion functions $R_{s}(D)$ and $R(D)$ for $p=0.1$ .

Remark 4.

We can imagine that $d_{s}^{\prime}$ measures the distortion between the observation and reconstruction of semantic information. Furthermore, it was shown in [28, 29, 33] that $d_{s}$ and $d_{s}^{\prime}$ measure equivalent distortions, i.e.,

	$\displaystyle\mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})$	$\displaystyle=\mathbb{E}d_{s}(S,\hat{S})$
	$\displaystyle\mathbb{E}d_{s}^{\prime}(X_{1}^{n},\hat{S}^{n})$	$\displaystyle=\mathbb{E}d_{s}(S^{n},\hat{S}^{n}).$

Then we can regard the system of compressing $X_{1}^{n}$ and reconstructing $\hat{S}^{n}$ as the ordinary rate-distortion problem with distortion measure $d_{s}^{\prime}$ . Thus, $R_{s}(D_{s})$ is equivalent to the ordinary rate-distortion function in (1) under distortion measure $d_{s}^{\prime}$ , which rigorously proves (17).

IV Case Study: Binary Sources

Assume $S$ and $X_{1}$ are doubly symmetric binary sources with distribution in (15), $X_{2}$ and $Y$ are both Bernoulli $(\frac{1}{2})$ sources. The reconstructions are all binary, i.e., $\hat{\mathcal{X}}_{1}=\hat{\mathcal{X}}_{2}=\hat{\mathcal{S}}=\{0,1\}$ . The distortion measures $d_{1},d_{2}$ , and $d_{s}$ are all assumed to be Hamming distortion. We further assume that any two of $X_{1}$ , $X_{2}$ , and $Y$ are doubly symmetric binary distributed (c.f. (4)) with parameters $p_{1},p_{2}$ , and $p_{3}$ , respectively. Specifically, $(X_{1},X_{2})\sim\text{DSBS}(p_{1})$ , $(X_{1},Y)\sim\text{DSBS}(p_{2})$ , $(X_{2},Y)\sim\text{DSBS}(p_{3})$ .

Consider the following two examples that only differ in the source distributions.

IV-A Conditionally Independent Sources

Assume we have the Markov chain¹¹1Note that the Markov chain $X_{1}-Y-X_{2}$ indicates $p_{1}=p_{2}\star p_{3}\triangleq p_{2}(1-p_{3})+p_{3}(1-p_{2})$ . $X_{1}-Y-X_{2}$ , i.e., $X_{1}$ and $X_{2}$ are independent conditioning on $Y$ . This assumption coincides with the intuitive understanding of $X_{1}$ and $X_{2}$ in Section II-A, that the semantic feature can be independent with the background. Then from Lemma 3, compressing $X_{1}^{n}$ and $X_{2}^{n}$ simultaneously is the same as compressing them separately in terms of the optimal compression rate and distortions, which implies the following theorem.

Theorem 5.

The rate-distortion function for the above conditionally independent sources is given by

	$\displaystyle R(D_{1},D_{2},D_{s})$	$\displaystyle=\big{[}h_{b}(p_{3})-h_{b}(D_{2})\big{]}\cdot\mathds{1}_{{}_{0\leq D_{2}\leq p_{3}}}$
		$\displaystyle+\left[h_{b}(p_{2})-h_{b}\big{(}\min\{D_{1},D_{s}^{0}\}\big{)}\right]\cdot\mathds{1}_{{}_{0\leq\min\{D_{1},D_{s}^{0}\}\leq p_{2}}},$

where $D_{s}^{0}=\frac{D_{s}-p}{1-2p}$ is defined in (16).

Proof.

The rate-distortion function in Theorem 1 satisfies

	$\displaystyle R(D_{1},D_{2},D_{s})$
	$\displaystyle=R_{\text{2d},X_{1}\|Y}(D_{1},D_{s})+R_{X_{2}\|Y}(D_{2})$
	$\displaystyle=\min_{\begin{subarray}{c}\mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}\\ \mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}\end{subarray}}I(X_{1};\hat{X}_{1},\hat{S}\|Y)+\min_{\mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}}I(X_{2};\hat{X}_{2}\|Y)$
	$\displaystyle=\left[h_{b}(p_{2})-h_{b}\big{(}\min\{D_{1},D_{s}^{0}\}\big{)}\right]\cdot\mathds{1}_{{}_{0\leq\min\{D_{1},D_{s}^{0}\}\leq p_{2}}}$
	$\displaystyle\qquad+\big{[}h_{b}(p_{3})-h_{b}(D_{2})\big{]}\cdot\mathds{1}_{{}_{0\leq D_{2}\leq p_{3}}}$

where the last step follows from the rate-distortion functions in (5), (7) and (17). ∎

IV-B Correlated Sources

Similar to that in [41], evaluating the rate-distortion function for the general correlated sources can be extremely difficult. Thus, we assume the Markov chain $Y-X_{1}-X_{2}$ behind the intuition that the side information $Y$ can help more to the semantic related source $X_{1}$ .

Without the conditional independence of $X_{1}$ and $X_{2}$ , the optimality of separate compression in Lemma 3 may not hold. The rate-distortion function $R(D_{1},D_{2},D_{s})$ in Theorem 1 can be calculated as follows. Recall from (16) that $D_{s}^{0}\triangleq\frac{D_{s}-p}{1-2p}$ . For simplicity, we consider only small distortions in the set

	$\displaystyle\mathcal{D}_{0}$	$\displaystyle=\big{\{}(D_{1},D_{2},D_{s}):0\leq\min\{D_{1},D_{s}^{0}\}\leq p_{1}p_{2}$
		$\displaystyle\hskip 79.6678pt\text{ and }0\leq D_{2}\leq p_{1}\big{\}}.$		(18)

Theorem 6.

For $(D_{1},D_{2},D_{s})\in\mathcal{D}_{0}$ , the rate-distortion function for the above correlated sources is

	$\displaystyle R(D_{1},D_{2},D_{s})$
	$\displaystyle=h_{b}(p_{1})+h_{b}(p_{2})-h_{b}\left(\min\left\{D_{1},D_{s}^{0}\right\}\right)-h_{b}(D_{2}),$		(19)

where $D_{s}^{0}=\frac{D_{s}-p}{1-2p}$ is defined in (16).

Proof.

The proof is given in Appendix D. ∎

From the distribution of $(X_{1},X_{2})$ and $(X_{1},Y)$ , and the Markov chain $Y-X_{1}-X_{2}$ , it is easy to check that $(X_{2},Y)$ is doubly symmetric distributed with parameter $p_{3}=p_{1}\star p_{2}\triangleq p_{1}(1-p_{2})+p_{2}(1-p_{1})$ . Then comparing (19) with the rate of separate compression, we have for $(D_{1},D_{2},D_{s})\in\mathcal{D}_{0}$ that

	$\displaystyle R_{\text{2d},X_{1}\|Y}(D_{1},D_{s})+R_{X_{2}\|Y}(D_{2})$
	$\displaystyle=\min_{\begin{subarray}{c}\mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}\\ \mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}\end{subarray}}I(X_{1};\hat{X}_{1},\hat{S}\|Y)+\min_{\mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}}I(X_{2};\hat{X}_{2}\|Y)$
	$\displaystyle=\big{[}h_{b}(p_{2})-h_{b}(\min\{D_{1},D_{s}^{0}\})\big{]}+\big{[}h_{b}(p_{1}\star p_{2})-h_{b}(D_{2})\big{]}$
	$\displaystyle\geq R(D_{1},D_{2},D_{s}),$		(20)

where the last inequality follows from the fact that $h_{b}(\cdot)$ is increasing in $[0,0.5]$ and $p_{1}\star p_{2}\geq p_{1}$ for $0\leq p_{1}\leq 0.5$ .

Note that the equality in (20) holds only for $p_{1}=0.5$ , which together with $Y-X_{1}-X_{2}$ imply the Markov chain $X_{1}-Y-X_{2}$ in Lemma 3. Then the problem reduces to that in Theorem 5. For $p_{1}<0.5$ , simultaneously compressing $X_{1}$ and $X_{2}$ is strictly better than separate compression.

V Case Study: Binary Classification of Integers

Consider classification integers into even and odd. Let $X_{1}$ be uniformly distributed over $\mathcal{X}_{1}=[1:N]$ with $N\geq 4$ being even. The semantic information $S$ is a binary random variable probabilistically indicates whether $X_{1}$ is even or odd. The transition probability can be defined by BSC in Fig. 6, which is similar to that in Fig. 4 by replacing the value of $X_{1}$ with “even” and “odd”.

Figure 6: Transition probability from the binary semantic information

S

to the integer

X_{1}

: in terms of BSC.

The binary side information $Y$ is correlated with $X_{1}$ also indicating its odevity (even/odd) similar to Fig. 6 with parameter $p_{2}$ . Assume the Markov chain $X_{1}-Y-X_{2}$ holds, and the Bernoulli( $\frac{1}{2}$ ) source $X_{2}$ is independent with $Y$ . We can verify that $X_{2}$ is independent with $(X_{1},Y)$ . By Lemma 3, compressing $X_{1}^{n}$ and $X_{2}^{n}$ simultaneously is the same as compressing them separately. For simplicity, we consider only small distortions in the set

	$\displaystyle\mathcal{D}_{1}$	$\displaystyle=\bigg{\{}(D_{1},D_{2},D_{s}):0\leq\min\{D_{1},D_{s}^{0}\}\leq\frac{2(N-1)p_{2}}{N}$
		$\displaystyle\qquad\qquad\qquad\qquad\text{ and }0\leq D_{2}\leq 0.5\bigg{\}}.$		(21)

Theorem 7.

For $(D_{1},D_{2},D_{s})\in\mathcal{D}_{1}$ , the rate-distortion function for integer classification is

	$\displaystyle R(D_{1},D_{2},D_{s})$	$\displaystyle=\big{[}h_{b}(p_{2})+\log(N/2)-h_{b}(\min\{D_{1},D_{s}^{0}\})$
		$\displaystyle\qquad-D_{1}\log(N-1)\big{]}+\big{[}1-h_{b}(D_{2})\big{]},$

where $D_{s}^{0}=\frac{D_{s}-p}{1-2p}$ is defined in (16).

Proof.

The proof is given in Appendix E. ∎

VI Case Study: Gaussian Sources

Assume $S$ and $X_{1}$ are jointly Gaussian sources with zero mean and covariance matrix

\begin{bmatrix}\sigma_{S}&\sigma_{SX_{1}}\\ \sigma_{SX_{1}}&\sigma_{X_{1}}\end{bmatrix}.

(22)

Similarly, we assume the Markov chain $X_{1}-Y-X_{2}$ , where $X_{2}$ and $Y$ are jointly Gaussian sources with zero mean and covariance matrix

\begin{bmatrix}\sigma_{X_{2}}&\sigma_{X_{2}Y}\\ \sigma_{X_{2}Y}&\sigma_{Y}\end{bmatrix}.

(23)

Thus $X_{1}$ is conditionally independent of $X_{2}$ given $Y$ . Let the covariance of $X_{1}$ and $Y$ be $\sigma_{X_{1}Y}$ . The reconstructions are real scalars, i.e., $\hat{\mathcal{X}}_{1}=\hat{\mathcal{X}}_{2}=\hat{\mathcal{S}}=\mathbb{R}$ . The distortion metrics are squared error.

We see from Lemma 3 that compressing $X_{1}^{n}$ and $X_{2}^{n}$ simultaneously is the same as compressing them separately in terms of the optimal compression rate and distortions. Then we have the following theorem.

Theorem 8.

For the Gaussian sources, if the Markov chain $X_{1}-Y-X_{2}$ holds, the rate-distortion function is

	$\displaystyle R(D_{1},D_{2},D_{s})=\frac{1}{2}\left(\log\frac{\sigma_{X_{2}}-\frac{\sigma_{X_{2}Y}^{2}}{\sigma_{Y}}}{D_{2}}\right)^{+}+$
	$\displaystyle\frac{1}{2}\left[\log\max\left(\frac{\sigma_{X_{1}}-\frac{\sigma_{X_{1}Y}^{2}}{\sigma_{Y}}}{D_{1}},\frac{\sigma_{SX_{1}}^{2}\left(\sigma_{X_{1}}-\frac{\sigma_{X_{1}Y}^{2}}{\sigma_{Y}}\right)}{\sigma_{X_{1}}^{2}\left(D_{s}-\mathrm{mmse}\right)}\right)\right]^{+},$		(24)

where $\mathrm{mmse}$ is the minimum mean squared error for estimating $S$ from $X_{1}$ , given by

\mathrm{mmse}=\sigma_{S}-\frac{\sigma_{SX_{1}}^{2}}{\sigma_{X_{1}}}.

(25)

Proof.

The proof is given in Appendix F. ∎

VII Numerical Results

In this section, we plot the rate-distortion curves evaluated in the previous sections.

VII-A Correlated Binary Sources

Consider the rate-distortion function for correlated binary sources in Theorem 6 with $p=p_{1}=p_{2}=0.25$ and $D_{2}=0.5$ . Fig. 7 shows the 3-D plot of the optimal tradeoff between the coding rate $R$ and distortions $(D_{1},D_{s})$ . We can see that the rate-distortion function is decreasing and convex in $(D_{1},D_{s})$ for distortions in $\mathcal{D}_{0}$ (c.f. (18)).

The truncated curves with $D_{1}=0.03$ and $D_{1}=0.05$ are shown in Fig. 8. We see that the rate is decreasing in $D_{s}$ until it achieves the minimum rate, which is determined by $D_{1}$ . Similar curves can also be obtained by truncating with some constant $D_{s}$ .

VII-B Binary Classification of Integers

The rate-distortion function for integer classification in Theorem 7 with $p=p_{1}=p_{2}=0.25$ , $D_{2}=0.5$ , and $N=8$ is illustrated in Fig. 9. Note that in both Theorem 6 and Theorem 7, $D_{2}=0.5$ indicates that $X_{2}$ can be recovered by random guessing, which further implies that $X_{2}$ can also be regarded as side information at both sides.

Comparing the rates along the $D_{1}$ and $D_{s}$ axis in Fig. 9, we see that recovering only the semantic information can reduce the coding rate comparing to recovering the source message.

Comparing Fig. 9 with Fig. 7, we see that the rate for integer classification decreases faster as $D_{1}$ increases (which is clearer at the minimum of $D_{s}=p$ ). This implies that $D_{1}$ is more dominant (to determine the rate) here, which is intuitive since the integer source has a larger alphabet and recovering it with different distortions requires a larger range of rates.

VII-C Gaussian Sources

Consider the rate-distortion function for Gaussian sources in Theorem 8. Let all of the variances be $2$ , all of the covariances be $1$ , and $D_{2}=1$ .

The 3-D plot of the optimal tradeoff between the coding rate $R$ and distortions $(D_{1},D_{s})$ is illustrated in Fig. 10. We can see that the rate-distortion function is decreasing and convex in $(D_{1},D_{s})$ . The minimum rate is equal to

R_{X_{2}|Y}(D_{2})=\frac{1}{2}\left(\log\frac{\sigma_{X_{2}}-\frac{\sigma_{X_{2}Y}^{2}}{\sigma_{Y}}}{D_{2}}\right)^{+}=0.20.

The contour plot of the rate-distortion function is shown in Fig. 11. The slanted line denotes the situations that $R_{X_{1}|Y}(D_{1})=R_{S|Y}(D_{s})$ . We see that when $D_{1}$ is more dominant (the region above the slanted line), the rate only needs to meet the distortion constraint to reconstruct $X_{1}$ . On the contrary, when $D_{s}$ is more dominant (the region below the slanted line), the rate only needs to meet the distortion constraint to reconstruct $S$ .

VIII Conclusion

In this paper, we studied the semantic rate-distortion problem with side information motivated by task-oriented video compression. The general rate-distortion function was characterized. We also evaluated several cases with specific sources and distortion measures. It is more desirable to derive the rate-distortion function for real video sources, which is more challenging due to the high complexity of real source models and choice of meaningful distortion measures. This part of work is now under investigation.

Appendix A Proof of Theorem 1

The achievability part is a straightforward extension of the joint typicality coding scheme for lossy source coding. We simply present the coding ideas and analysis as follows. Fix the conditional pmf $p(\hat{x}_{1},\hat{x}_{2},\hat{s}|x_{1},x_{2},y)$ such that the distortion constraints are satisfied, $\mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}$ , $\mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}$ , and $\mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}$ . Let $p(\hat{x}_{1},\hat{x}_{2},\hat{s}|y)=\sum_{x_{1},x_{2}}p(x_{1},x_{2}|y)p(\hat{x}_{1},\hat{x}_{2},\hat{s}|x_{1},x_{2},y)$ . Randomly and independently generate $2^{nR}$ sequence triples $(\hat{x}_{1}^{n},\hat{x}_{2}^{n},\hat{s}^{n})$ indexed by $m\in[1:2^{nR}]$ , each according to $p(\hat{x}_{1},\hat{x}_{2},\hat{s}|y)$ . The whole codebook $\mathcal{C}$ , consisting of these sequence triples, is revealed to both the encoder and decoder. When observing the source messages $(x_{1}^{n},x_{2}^{n},y^{n})$ , find an index $m$ such that its indexing sequence $(\hat{x}_{1}^{n},\hat{x}_{2}^{n},\hat{s}^{n})$ satisfies $(x_{1}^{n},x_{2}^{n},y^{n},\hat{x}_{1}^{n},\hat{x}_{2}^{n},\hat{s}^{n})\in\mathcal{T}_{\epsilon}^{n}$ . If there is more than one such index, randomly choose one of them; if there is no such index, set $m=1$ . Upon receiving the index $m$ , the decoder reconstruct the messages and inference by choosing the codeword $(\hat{x}_{1}^{n},\hat{x}_{2}^{n},\hat{s}^{n})$ indexed by $m$ . By law of large numbers, the source sequences are joint typical with probability 1 as $n\rightarrow\infty$ . Then we define the “encoding error” event as

\mathcal{E}=\left\{\left(X_{1}^{n},X_{2}^{n},Y^{n},\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n}\right)\notin\mathcal{T}_{\epsilon}^{n},~{}\forall m\in\left[1:2^{nR}\right]\right\}.

(26)

Then we can bound the error probability as follows

	$\displaystyle P(\mathcal{E})$
	$\displaystyle=P\left\{\left(x_{1}^{n},x_{2}^{n},y^{n},\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n}\right)\notin\mathcal{T}_{\epsilon}^{n},~{}\forall m\in\left[1:2^{nR}\right]\right\}$
	$\displaystyle=\prod_{m=1}^{2^{nR}}P\left\{\left(x_{1}^{n},x_{2}^{n},y^{n},\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n}\right)\notin\mathcal{T}_{\epsilon}^{n}\right\}$
	$\displaystyle=\left(1-P\left\{\left(x_{1}^{n},x_{2}^{n},y^{n},\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n}\right)\in\mathcal{T}_{\epsilon}^{n}\right\}\right)^{2^{nR}}$
	$\displaystyle\leq\sum_{(x_{1}^{n},x_{2}^{n},y^{n})\in\mathcal{T}_{\epsilon}^{n}}\Bigg{[}p(x_{1}^{n},x_{2}^{n},y^{n})\cdot$
	$\displaystyle\qquad\left(1-2^{-n[I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}\|Y)+\delta(\epsilon)]}\right)^{2^{nR}}\Bigg{]}$
	$\displaystyle\leq\exp\left(-2^{n[R-I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}\|Y)-\delta(\epsilon)]}\right),$

where $\delta(\epsilon)\rightarrow 0$ as $n\rightarrow\infty$ , the first inequality follows from the joint typicality lemma in [36], and the last inequality follows from the fact that $(1-z)^{t}\leq\exp(-tz)$ for $z\in[0,1]$ and $t\geq 0$ . We see that $P(\mathcal{E})\rightarrow 0$ as $n\rightarrow\infty$ if $R>I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}|Y)+\delta(\epsilon)$ . If the error event does not happen, i.e., the reconstruction is joint typical with the source sequences, then from the distortion constraints assumed for the conditional pmf, the expected distortions can achieve $D_{1},D_{2},$ and $D_{s}$ , respectively. This proves the achievability.

Define $R_{I}(D_{1},D_{2},D_{s})$ as the rate-distortion function characterized by Theorem 1, For the converse part, we show that

$\displaystyle nR$	$\displaystyle\geq H(W)\geq H(W\|Y^{n})\geq I(X_{1}^{n},X_{2}^{n};W\|Y^{n})$
	$\displaystyle\geq I(X_{1}^{n},X_{2}^{n};\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n}\|Y^{n})$
	$\displaystyle=I(X_{1}^{n},X_{2}^{n},Y^{n};\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n})-I(Y^{n};\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n})$
	$\displaystyle=\sum_{i=1}^{n}\left[I(X_{1,i},X_{2,i},Y_{i};\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n}\|X_{1,1}^{i-1},X_{2,1}^{i-1},Y_{1}^{i-1})\right.$
	$\displaystyle\qquad\left.-I(Y_{i};\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n}\|Y_{1}^{i-1})\right]$
	$\displaystyle=\sum_{i=1}^{n}\left[I(X_{1,i},X_{2,i},Y_{i};\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n},X_{1,1}^{i-1},X_{2,1}^{i-1},Y_{1}^{i-1})\right.$
	$\displaystyle\qquad\left.-I(Y_{i};\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n},Y_{1}^{i-1})\right]$
	$\displaystyle=\sum_{i=1}^{n}\left[I(X_{1,i},X_{2,i};\hat{X}_{1,i},\hat{X}_{2,i},\hat{S}_{i}\|Y_{i})\right.$
	$\displaystyle\qquad+I(X_{1,i},X_{2,i},Y_{i};\hat{X}_{1,1}^{i-1},\hat{X}_{1,i+1}^{n},\hat{X}_{2,1}^{i-1},\hat{X}_{2,i+1}^{n},$
	$\displaystyle\qquad\qquad\hat{S}_{1}^{i-1},\hat{S}_{i+1}^{n},X_{1,1}^{i-1},X_{2,1}^{i-1},Y_{1}^{i-1}\|\hat{X}_{1,i},\hat{X}_{2,i},\hat{S}_{i})$
	$\displaystyle\qquad-I(Y_{i};\hat{X}_{1,1}^{i-1},\hat{X}_{1,i+1}^{n},\hat{X}_{2,1}^{i-1},\hat{X}_{2,i+1}^{n},$
	$\displaystyle\qquad\qquad\left.\hat{S}_{1}^{i-1},\hat{S}_{i+1}^{n},Y_{1}^{i-1}\|\hat{X}_{1,i},\hat{X}_{2,i},\hat{S}_{i})\right]$
	$\displaystyle=\sum_{i=1}^{n}\left[I(X_{1,i},X_{2,i};\hat{X}_{1,i},\hat{X}_{2,i},\hat{S}_{i}\|Y_{i})\right.$
	$\displaystyle\qquad+I(X_{1,i},X_{2,i},Y_{i};X_{1,1}^{i-1},X_{2,1}^{i-1}\|\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n},Y_{1}^{i})$
	$\displaystyle\qquad+I(X_{1,i},X_{2,i};\hat{X}_{1,1}^{i-1},\hat{X}_{1,i+1}^{n},\hat{X}_{2,1}^{i-1},\hat{X}_{2,i+1}^{n},$
	$\displaystyle\qquad\qquad\left.\hat{S}_{1}^{i-1},\hat{S}_{i+1}^{n},Y_{1}^{i-1}\|\hat{X}_{1,i},\hat{X}_{2,i},\hat{S}_{i},Y_{i})\right]$
	$\displaystyle\geq\sum_{i=1}^{n}I(X_{1,i},X_{2,i};\hat{X}_{1,i},\hat{X}_{2,i},\hat{S}_{i}\|Y_{i})$	(27)
	$\displaystyle\geq\sum_{i=1}^{n}R_{I}\left(\mathbb{E}d_{1}(X_{1,i},\hat{X}_{1,i}),\mathbb{E}d_{2}(X_{2,i},\hat{X}_{2,i}),\mathbb{E}d_{s}^{\prime}(X_{1,i},\hat{S}_{i})\right)$	(28)
	$\displaystyle\geq nR_{I}\left(\mathbb{E}d_{1}(X_{1}^{n},\hat{X}_{1}^{n}),\mathbb{E}d_{2}(X_{2}^{n},\hat{X}_{2}^{n}),\mathbb{E}d_{s}^{\prime}(X_{1}^{n},\hat{S}^{n})\right)$	(29)
	$\displaystyle\geq nR_{I}\left(\mathbb{E}d_{1}(X_{1}^{n},\hat{X}_{1}^{n}),\mathbb{E}d_{2}(X_{2}^{n},\hat{X}_{2}^{n}),\mathbb{E}d_{s}(S^{n},\hat{S}^{n})\right)$	(30)
	$\displaystyle\geq nR_{I}(D_{1},D_{2},D_{s}),$	(31)

where (27) follows from the nonnegativity of mutual information, (28) follows from the definition of $R_{I}(D_{1},D_{2},D_{s})$ , (29) follows from the convexity of $R_{I}(D_{1},D_{2},D_{s})$ , (30) follows from $\mathbb{E}d_{s}^{\prime}(X_{1}^{n},\hat{S}^{n})=\mathbb{E}d_{s}(S^{n},\hat{S}^{n})$ which is proved in [29], and the last inequality follows from the non-increasing property of $R_{I}(D_{1},D_{2},D_{s})$ . This completes the converse proof.

Appendix B Proof of Lemma 3

The Markov chain $X_{1}-Y-X_{2}$ indicates that

H(X_{2}|X_{1},Y)=H(X_{2}|Y).

(32)

Then the mutual information in (10) can be bounded by

	$\displaystyle I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}\|Y)$
	$\displaystyle=H(X_{1},X_{2}\|Y)-H(X_{1},X_{2}\|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$
	$\displaystyle=H(X_{1}\|Y)+H(X_{2}\|Y)-H(X_{1}\|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$
	$\displaystyle\qquad-H(X_{2}\|X_{1},\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$
	$\displaystyle\geq H(X_{1}\|Y)+H(X_{2}\|Y)\!-\!H(X_{1}\|\hat{X}_{1},\hat{S},Y)\!-\!H(X_{2}\|\hat{X}_{2},Y)$
	$\displaystyle=I(X_{1};\hat{X}_{1},\hat{S}\|Y)+I(X_{2};\hat{X}_{2}\|Y),$

where the inequality follows from the fact that conditioning does not increase entropy. Now, we have

	$\displaystyle R(D_{1},D_{2},D_{s})$
	$\displaystyle=\min_{\begin{subarray}{c}p(\hat{x}_{1},\hat{x}_{2},\hat{s}\|x_{1},x_{2},y)\\ \mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}\\ \mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}\\ \mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}\end{subarray}}I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}\|Y)$
	$\displaystyle\geq\min_{\begin{subarray}{c}p(\hat{x}_{1},\hat{x}_{2},\hat{s}\|x_{1},x_{2},y)\\ \mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}\\ \mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}\\ \mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}\end{subarray}}\left[I(X_{1};\hat{X}_{1},\hat{S}\|Y)+I(X_{2};\hat{X}_{2}\|Y)\right]$
	$\displaystyle=\min_{\begin{subarray}{c}p(\hat{x}_{1},\hat{s}\|x_{1},y)\\ \mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}\\ \mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}\end{subarray}}I(X_{1};\hat{X}_{1},\hat{S}\|Y)$
	$\displaystyle\qquad+\min_{\begin{subarray}{c}p(\hat{x}_{2}\|x_{2},y)\\ \mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}\end{subarray}}I(X_{2};\hat{X}_{2}\|Y)$
	$\displaystyle=R_{\text{2d},X_{1}\|Y}(D_{1},D_{s})+R_{X_{2}\|Y}(D_{2}).$

For the other direction, we show that the rate-distortion quadruple $\big{(}R_{\text{2d},X_{1}|Y}(D_{1},D_{s})+R_{X_{2}|Y}(D_{2}),D_{1},D_{2},D_{s}\big{)}$ is achievable. To see this, let $p^{*}(\hat{x}_{1},\hat{s}|x_{1},y)$ and $p^{*}(\hat{x}_{2}|x_{2},y)$ be the optimal distributions that achieve the rate-distortion tuples $\big{(}R_{\text{2d},X_{1}|Y}(D_{1},D_{s}),D_{1},D_{s}\big{)}$ and $\big{(}R_{X_{2}|Y}(D_{2}),D_{2}\big{)}$ , respectively. Now we consider the distribution $p^{*}(x_{1},x_{2},\hat{x}_{1},\hat{x}_{2},\hat{s}|y)\triangleq p^{*}(x_{1},\hat{x}_{1},\hat{s}|y)p^{*}(x_{2},\hat{x}_{2}|y)$ which requires the Markov chain $(X_{1},\hat{X}_{1},\hat{S})-Y-(X_{2},\hat{X}_{2})$ and is consistent with the condition $X_{1}-Y-X_{2}$ . Then the corresponding random variables satisfy

	$\displaystyle I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}\|Y)$
	$\displaystyle=H(X_{1}\|Y)+H(X_{2}\|Y)-H(X_{1}\|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$
	$\displaystyle-H(X_{2}\|X_{1},\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$
	$\displaystyle=H(X_{1}\|Y)+H(X_{2}\|Y)\!-\!H(X_{1}\|\hat{X}_{1},\hat{S},Y)\!-\!H(X_{2}\|\hat{X}_{2},Y)$
	$\displaystyle=I(X_{1};\hat{X}_{1},\hat{S}\|Y)+I(X_{2};\hat{X}_{2}\|Y),$
	$\displaystyle=R_{\text{2d},X_{1}\|Y}(D_{1},D_{s})+R_{X_{2}\|Y}(D_{2}),$

where the first equality follows from (32), the second equality follows from the Markov chain $(X_{1},\hat{X}_{1},\hat{S})-Y-(X_{2},\hat{X}_{2})$ , and the last equality follows from the optimality of $p^{*}(\hat{x}_{1},\hat{s}|x_{1},y)$ and $p^{*}(\hat{x}_{2}|x_{2},y)$ . Lastly, by the minimization in the expression of the rate-distortion function in (10), we conclude that $R(D_{1},D_{2},D_{s})\leq R_{\text{2d},X_{1}|Y}(D_{1},D_{s})+R_{X_{2}|Y}(D_{2})$ , which completes the proof of the lemma.

Appendix C Proof of Lemma 4

As we are in the binary Hamming setting, we first calculate the values of $d_{s}^{\prime}(x_{1},\hat{s})$ (c.f. (14)) by

	$\displaystyle d_{s}^{\prime}(0,0)$	$\displaystyle=\frac{1}{p(x_{1}=0)}\sum_{s=0,1}p(x_{1}=0,s)d_{s}(s,0)$
		$\displaystyle=\frac{1}{p(x_{1}=0)}\big{[}p(x_{1}=0,s=0)d_{s}(0,0)$
		$\displaystyle\qquad+p(x_{1}=0,s=1)d_{s}(1,0)\big{]}$
		$\displaystyle=\frac{p(x_{1}=0,s=1)}{p(x_{1}=0)}$
		$\displaystyle=p(s=1\|x_{1}=0)$
		$\displaystyle=p.$

The other values follow similarly, and we obtain the distortion function

d_{s}^{\prime}(x_{1},\hat{s})=\begin{cases}p,&\text{if }\hat{s}=x_{1}\\ 1-p,&\text{if }\hat{s}\neq x_{1}.\end{cases}

(33)

Then

	$\displaystyle\mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})$	$\displaystyle=\sum_{x_{1},\hat{s}}p(x_{1},\hat{s})d_{s}^{\prime}(x_{1},\hat{s})$
		$\displaystyle=P(X_{1}\neq\hat{S})\cdot(1-p)+P(X_{1}=\hat{S})\cdot p$
		$\displaystyle=P(X_{1}\neq\hat{S})\big{[}1-2p\big{]}+p.$

The distortion constraint $\mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}$ for $D_{s}\geq p$ implies $P(X_{1}\neq\hat{S})\leq\frac{D_{s}-p}{1-2p}$ . Now we can follow the rate-distortion evaluation for Bernoulli source and Hamming distortion in [37, 35] while only changing the probability $P(X_{1}\neq\hat{S})$ and obtain

R_{s}(D_{s})=\left[1-h_{b}\left(\frac{D_{s}-p}{1-2p}\right)\right]\cdot\mathds{1}_{{}_{p\leq D_{s}\leq 0.5}}.

(34)

This proves the lemma.

Appendix D Proof of Theorem 6

Note that separately compressing correlated sources is not optimal in general, i.e., the statement in Lemma 3 does not hold here. Then we need to evaluate the mutual information in (10) over joint distributions, and we have

	$\displaystyle I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}\|Y)$
	$\displaystyle=H(X_{1},X_{2}\|Y)-H(X_{1},X_{2}\|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$
	$\displaystyle=H(X_{1})+H(X_{2}\|X_{1})+H(Y\|X_{1})-H(Y)$
	$\displaystyle-H(X_{1},X_{2}\|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$
	$\displaystyle=h_{b}(p_{1})+h_{b}(p_{2})-H(X_{1}\oplus\hat{X}_{1},X_{2}\oplus\hat{X}_{2}\|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$		(35)
	$\displaystyle\geq h_{b}(p_{1})+h_{b}(p_{2})-H(X_{1}\oplus\hat{X}_{1},X_{2}\oplus\hat{X}_{2})$
	$\displaystyle\geq h_{b}(p_{1})+h_{b}(p_{2})-H(X_{1}\oplus\hat{X}_{1})-H(X_{2}\oplus\hat{X}_{2})$
	$\displaystyle\geq h_{b}(p_{1})+h_{b}(p_{2})-h_{b}(D_{1})-h_{b}(D_{2}),$

where $\oplus$ denote modulo 2 addition, the second equality follows from the Markov chain $Y-X_{1}-X_{2}$ , the first inequality follows from the fact that conditioning does not increase entropy, and the last inequality follows from $P(X_{1}\oplus\hat{X}_{1}=1)=P(X_{1}\neq\hat{X}_{1})\leq D_{1}$ , $P(X_{2}\oplus\hat{X}_{2}=1)=P(X_{2}\neq\hat{X}_{2})\leq D_{2}$ , and $h_{b}(D)$ is increasing in $D$ for $0\leq D\leq 0.5$ . By switching the roles of $\hat{X}_{1}$ and $\hat{S}$ , i.e., replacing the conditional information in (35) by $H(X_{1}\oplus\hat{S},X_{2}\oplus\hat{X}_{2}|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$ , we can obtain similarly

I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}|Y)\geq h_{b}(p_{1})+h_{b}(p_{2})-h_{b}\left(D_{s}^{0}\right)-h_{b}(D_{2}).

Thus, the rate-distortion function is lower bounded by

R(D_{1},D_{2},D_{s})\geq h_{b}(p_{1})+h_{b}(p_{2})-h_{b}(\min\{D_{1},D_{s}^{0}\})-h_{b}(D_{2}).

(36)

We now show that the lower bound is tight by finding a joint distribution $p(x_{1},x_{2},y,\hat{x}_{1},\hat{x}_{2},\hat{s})$ that meets the distortion constraints and achieves the above lower bound.

Figure 12: Test channel from

\hat{Z}_{1}\hat{Z}_{2}

Z_{1}Z_{2}

Z_{1}\sim

Ber(

p_{1}

Z_{2}\sim

Ber(

p_{2}

Z_{1},Z_{2}

are mutually independent,

\hat{Z}_{1}\hat{Z}_{2}\sim(q_{1},q_{2},q_{3},q_{4})

, and the transition probability

p(z_{1},z_{2}|\hat{z}_{1},\hat{z}_{2})

is given by (37).

Let $Z_{i}=Y\oplus X_{i}$ and $\hat{Z}_{i}=Y\oplus\hat{X}_{i}$ for $i=1,2$ . Then there is a one-to-one correspondence between $p(x_{1},x_{2},y,\hat{x}_{1},\hat{x}_{2},\hat{s})$ and $p(z_{1},z_{2},y,\hat{z}_{1},\hat{z}_{2},\hat{s})$ . From $Z_{1},Z_{2}$ generated from the source distribution $p(x_{1},x_{2},y)$ , it is easy to check that $(Z_{1},Z_{2})\sim[(1-p_{1})(1-p_{2}),p_{1}(1-p_{2}),p_{1}p_{2},(1-p_{1})p_{2}]$ as shown in Fig. 12.

Next, we construct the desired joint distribution using the test channel in Fig. 12 as follows. For $p_{1},p_{2}\in[0,0.5]$ , $(D_{1},D_{2},D_{s})\in\mathcal{D}_{0}$ (c.f. (18)), and $D_{1}\leq D_{s}^{0}$ , consider the joint distribution $p(x_{1},x_{2},y,\hat{x}_{1},\hat{x}_{2},\hat{s})$ defined by the following conditions

i.

$\hat{S}=\hat{X}_{1}$ ;
ii.

Markov chain $Y-(\hat{Z}_{1}\hat{Z}_{2})-(Z_{1}Z_{2})$ ;

iii.

The test channel in Fig. 12 with the conditional probability $p(z_{1},z_{2}|\hat{z}_{1},\hat{z}_{2})$ given as

	$\displaystyle p(z_{1},z_{2}\|\hat{z}_{1},\hat{z}_{2})=$
	$\displaystyle\begin{cases}(1-D_{1})(1-D_{2}),&\text{ if }z_{1}=\hat{z}_{1}\text{ and }z_{2}=\hat{z}_{2}\\ (1-D_{1})D_{2},&\text{ if }z_{1}=\hat{z}_{1}\text{ and }z_{2}\neq\hat{z}_{2}\\ D_{1}(1-D_{2}),&\text{ if }z_{1}\neq\hat{z}_{1}\text{ and }z_{2}=\hat{z}_{2}\\ D_{1}D_{2},&\text{ if }z_{1}\neq\hat{z}_{1}\text{ and }z_{2}\neq\hat{z}_{2};\\ \end{cases}$		(37)

iv.

In order for $Z_{1},Z_{2}$ to follow the independent Bernoulli distributions, we need to choose the distribution of $\hat{Z}_{1}\hat{Z}_{2}$ as (38)-(41).

$\displaystyle q_{1}$	$\displaystyle=\frac{[(1-p_{2})-D_{1}][(1-p_{1}-p_{2}+2p_{1}p_{2})-D_{2}]+p_{2}(1-2p_{1})(1-p_{2})}{(1-2D_{1})(1-2D_{2})}$	(38)
$\displaystyle q_{2}$	$\displaystyle=\frac{[(1-p_{2})-D_{1}][(p_{1}+p_{2}-2p_{1}p_{2})-D_{2}]+p_{2}(2p_{1}-1)(1-p_{2})}{(1-2D_{1})(1-2D_{2})}$	(39)
$\displaystyle q_{3}$	$\displaystyle=\frac{(p_{2}-D_{1})[(1-p_{1}-p_{2}+2p_{1}p_{2})-D_{2}]+p_{2}(2p_{1}-1)(1-p_{2})}{(1-2D_{1})(1-2D_{2})}$	(40)
$\displaystyle q_{4}$	$\displaystyle=\frac{(p_{2}-D_{1})[(p_{1}+p_{2}-2p_{1}p_{2})-D_{2}]+p_{2}(1-2p_{1})(1-p_{2})}{(1-2D_{1})(1-2D_{2})}$	(41)

We can verify that $q_{i}\geq 0,~{}i=1,2,3,4$ for $(D_{1},D_{2},D_{s})\in\mathcal{D}_{0}$ .

Now it remains to verify that the above distribution achieves the rate in (36) and distortions $D_{1},D_{2}$ , and $D_{s}$ . From conditions i and ii, we have

	$\displaystyle I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}\|Y)$
	$\displaystyle=H(X_{1},X_{2}\|Y)-H(X_{1},X_{2}\|\hat{X}_{1},\hat{X}_{2},Y)$
	$\displaystyle=H(X_{1},X_{2}\|Y)-H(X_{1},X_{2}\|\hat{X}_{1},\hat{X}_{2})$
	$\displaystyle=h_{b}(p_{1})+h_{b}(p_{2})-h_{b}(D_{1})-h_{b}(D_{2}).$

From (37) and (33), it is easy to calculate the expected distortions as

$\displaystyle\mathbb{E}d_{1}(X_{1},\hat{X}_{1})$	$\displaystyle=D_{1}$	(42)
$\displaystyle\mathbb{E}d_{2}(X_{2},\hat{X}_{2})$	$\displaystyle=D_{2}$	(43)
$\displaystyle\mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})$	$\displaystyle=(1-D_{1})p+D_{1}(1-p)\leq D_{s}.$	(44)

On the other hand, if $D_{1}\geq D_{s}^{0}$ , we can construct the joint distribution $p(x_{1},x_{2},y,\hat{x}_{1},\hat{x}_{2},\hat{s})$ using four conditions similarly by switching the role of $(\hat{X}_{1},D_{1})$ and $(\hat{S},D_{s}^{0})$ . Then the rate and distortions can be obtained accordingly. This proves the theorem.

Appendix E Proof of Theorem 7

Since Lemma 3 holds here, we first calculate the rate-distortion function for $X_{2}$ , which is

\displaystyle R_{X_{2}|Y}(D_{2})

\displaystyle=\big{[}1-h_{b}(D_{2})\big{]}\cdot\mathds{1}_{{}_{0\leq D_{2}\leq 0.5}}.

(45)

Now it remains to calculate $R_{\text{2d},X_{1}|Y}(D_{1},D_{s})$ . We first consider the case that $D_{1}\leq D_{s}^{0}$ and provide a lower bound of the mutual information as follows

	$\displaystyle I(X_{1};\hat{X_{1}},\hat{S}\|Y)$
	$\displaystyle\geq I(X_{1};\hat{X_{1}}\|Y)$
	$\displaystyle=H(X_{1}\|Y)-H(X_{1}\|\hat{X}_{1},Y)$
	$\displaystyle\geq H(X_{1}\|Y)-H(X_{1}\|\hat{X}_{1})$
	$\displaystyle=h_{b}(p_{2})+\log(N/2)-H(X_{1}\|\hat{X}_{1})$
	$\displaystyle\geq h_{b}(p_{2})+\log(N/2)-h_{b}(D_{1})-D_{1}\log(N-1),$

where the last inequality follows from $P(\hat{X}_{1}\neq X_{1})\leq D_{1}$ , $h_{b}(x)$ is an increasing function for $x\in[0,0.5]$ , and the fact that uniform distribution maximizes entropy. (Note that we can also obtain the above lower bound by directly applying the log-sum inequality.)

Next, we show the lower bound is tight by finding a joint distribution that achieves the above rate and distortions $D_{1}$ and $D_{s}$ . For $0\leq D_{1}\leq\frac{2(N-1)p_{2}}{N}$ , we choose $\hat{S}=\hat{X}_{1}$ and $(X_{1},Y,\hat{X}_{1})$ by the test channel $p(x_{1}|\hat{x}_{1})$ in Fig. 13 and the Markov chain $Y-\hat{X}_{1}-X_{1}$ . The conditional probability of the test channel in Fig. 13 is given as

\displaystyle p(\hat{x}_{1}|x)=\begin{cases}1-D_{1},&\text{ if }\hat{x}_{1}=x\\ \frac{D_{1}}{N-1},&\text{ if }\hat{x}_{1}\neq x.\end{cases}

(46)

Figure 13: Test channel from

\hat{X}_{1}

X_{1}

X_{1}\sim

Uniform(

\frac{1}{N}

) and the transition probability

p(x_{1}|\hat{x}_{1})

is given by (46).

Solving the equations

\displaystyle q_{i}(1-D_{1})+\left(\sum_{j\in[1:N],j\neq i}q_{j}\right)\frac{D_{1}}{N-1}=\frac{1}{N},~{}i\in[1:N],

we obtain that $q_{i}=\frac{1}{N}$ for $i\in[1:N]$ , i.e., $\hat{X}_{1}$ is also uniformly distributed over $[1:N]$ . For the joint distribution of $Y$ and $\hat{X}_{1}$ , we define

\displaystyle p(y|\hat{x}_{1})=\begin{cases}\dfrac{p_{2}-\frac{ND_{1}}{2(N-1)}}{1-\frac{ND_{1}}{N-1}},&\text{if }y=0,\hat{x}_{1}\text{ is odd}\\ ~{}&\text{~{}~{}or }y=1,\hat{x}_{1}\text{ is even}\\ ~{}\\ \dfrac{1-p_{2}-\frac{ND_{1}}{2(N-1)}}{1-\frac{ND_{1}}{N-1}},&\text{if }y=0,\hat{x}_{1}\text{ is even}\\ ~{}&\text{~{}~{}or }y=1,\hat{x}_{1}\text{ is odd}.\end{cases}

We see that $p(y|\hat{x}_{1})\geq 0$ for any $D_{1}\leq\frac{2(N-1)p_{2}}{N}$ , i.e., $(D_{1},D_{2},D_{s})\in\mathcal{D}_{1}$ . Then we can verify using $p(y|x_{1})=\sum_{\hat{x}_{1}}p(y|x_{1},\hat{x}_{1})p(\hat{x}|x)=\sum_{\hat{x}_{1}}p(y|\hat{x}_{1})p(\hat{x}|x)$ that the above distribution can induce the same conditional probability $p(y|x_{1})=p_{2},1-p_{2}$ as defined at the beginning of this section. Thus, we have constructed a feasible $p(x_{1},y,\hat{x}_{1})$ that can achieve expected distortion $\mathbb{E}d_{1}(X,\hat{X}_{1})=D_{1}$ , $D_{s}^{0}\geq D_{1}$ , and mutual information

	$\displaystyle I(X_{1};\hat{X_{1}},\hat{S}\|Y)$
	$\displaystyle=I(X_{1};\hat{X_{1}}\|Y)$
	$\displaystyle=H(X_{1}\|Y)-H(X_{1}\|\hat{X}_{1},Y)$
	$\displaystyle=H(X_{1}\|Y)-H(X_{1}\|\hat{X}_{1})$
	$\displaystyle=h_{b}(p_{2})+\log(N/2)-h_{b}(D_{1})-D_{1}\log(N-1),$

where the first equality follows from $\hat{S}=\hat{X}_{1}$ , the third equality follows from the Markov chain $Y-\hat{X}_{1}-X_{1}$ , and the last equality follows from the joint distribution of $(X_{1},Y)$ and the distribution in (46).

For the other case that $D_{1}\geq D_{s}^{0}$ , we only need to switch the role of $(\hat{X}_{1},D_{1})$ and $(\hat{S},D_{s}^{0})$ . Then the rate and distortions can be obtained accordingly, which can prove the theorem.

Appendix F Proof of Theorem 8

The rate-distortion function in Theorem 1 satisfies

R(D_{1},D_{2},D_{s})=R_{2d,X_{1}|Y}(D_{1},D_{s})+R_{X_{2}|Y}(D_{2}).

(47)

The second term is the solution to the quadratic Gaussian source coding problem with side information [36, Chapter 11], given as

R_{X_{2}|Y}(D_{2})=\frac{1}{2}\left(\log\frac{\sigma_{X_{2}|Y}}{D_{2}}\right)^{+},

(48)

where $\sigma_{X_{2}|Y}$ is the conditional variance of $X_{2}$ given $Y$ . Note that $X_{2}$ is Gaussian conditioning on $Y$ , i.e., $X_{2}|Y\sim\mathcal{N}(\frac{\sigma_{X_{2}Y}}{\sigma_{Y}}Y,\sigma_{X_{2}}-\frac{\sigma_{X_{2}Y}^{2}}{\sigma_{Y}})$ , which implies $\sigma_{X_{2}|Y}=\sigma_{X_{2}}-\frac{\sigma_{X_{2}Y}^{2}}{\sigma_{Y}}$ .

For the first term, note that $R_{2d,X_{1}|Y}(D_{1},D_{s})$ is lowered bounded by both

R_{X_{1}|Y}(D_{1})=\min_{\mathbb{E}d_{1}\left(X_{1},\hat{X}_{1}\right)\leq D_{1}}I(X_{1};\hat{X}_{1}|Y),

(49)

and

R_{S|Y}(D_{s})=\min_{\mathbb{E}d_{S}^{\prime}\left(X_{1},\hat{S}\right)\leq D_{s}}I(X_{1};\hat{S}|Y).

(50)

Obviously, (49) is the solution of the quadratic Gaussian source coding with side information similarly to (48).

We proceed to calculate (50), which is actually the semantic rate-distortion function of the indirect source coding with side information. Observing $\left(X_{1},Y\right)$ , $S$ is conditionally Gaussian as $S|\left(X_{1},Y\right)\sim\mathcal{N}\left(\frac{\sigma_{SX_{1}}}{\sigma_{X_{1}}}X_{1},\sigma_{S}-\frac{\sigma_{SX_{1}}^{2}}{\sigma_{X_{1}}}\right)$ . It is shown in [42] that we can rewrite the semantic distortion as

\mathbb{E}d_{S}^{\prime}\left(X_{1},\hat{S}\right)=\mathbb{E}\left[\left(S-\tilde{S}_{\mathrm{MMSE}}\right)^{2}\right]+\mathbb{E}\left[\left(\tilde{S}_{\mathrm{MMSE}}-\hat{S}\right)^{2}\right],

(51)

where $\tilde{S}_{\mathrm{MMSE}}=\frac{\sigma_{SX_{1}}}{\sigma_{X_{1}}}X_{1}$ is the MMSE estimator upon observing $X_{1}$ and $Y$ , and the first term on the right-hand side is the corresponding minimum mean squared error ( $\mathrm{mmse}$ c.f. (25)), i.e.,

\mathbb{E}\left[\left(S-\tilde{S}_{\mathrm{MMSE}}\right)^{2}\right]=\mathrm{mmse}=\sigma_{S}-\frac{\sigma_{SX_{1}}^{2}}{\sigma_{X_{1}}}.

(52)

Then we consider a specific encoder which first estimates the semantic information using MMSE estimator and then compresses the estimation under mean squared error distortion constraint $D_{s}-\mathrm{mmse}$ with side information. The resulting achievable rate provides an upper bound on $R_{S|Y}(D_{s})$ which is

$\displaystyle R_{S\|Y}(D_{s})$	$\displaystyle\leq\frac{1}{2}\left(\log\frac{\sigma_{\tilde{S}_{\mathrm{MMSE}}\|Y}}{D_{s}-\mathrm{mmse}}\right)^{+}$
	$\displaystyle=\frac{1}{2}\left(\log\frac{\sigma_{SX_{1}}^{2}\sigma_{X_{1}\|Y}}{\sigma_{X_{1}}^{2}\left(D_{s}-\mathrm{mmse}\right)}\right)^{+}$
	$\displaystyle=\frac{1}{2}\left(\log\frac{\sigma_{SX_{1}}^{2}\left(\sigma_{X_{1}}-\frac{\sigma_{X_{1}Y}^{2}}{\sigma_{Y}}\right)}{\sigma_{X_{1}}^{2}\left(D_{s}-\mathrm{mmse}\right)}\right)^{+}.$	(53)

Furthermore, for $D_{s}\geq\mathrm{mmse}+\frac{\sigma_{SX_{1}}^{2}\sigma_{X_{1}|Y}}{\sigma_{X_{1}}^{2}}$ , we have $R_{S|Y}(D_{s})=0$ , which can be obtained by setting $\hat{S}=\mathbb{E}\left[\tilde{S}_{\mathrm{MMSE}}|Y\right]=\frac{\sigma_{SX_{1}}}{\sigma_{X_{1}}}\mathbb{E}\left[X_{1}|Y\right]$ . For $D_{s}<\mathrm{mmse}+\frac{\sigma_{SX_{1}}^{2}\sigma_{X_{1}|Y}}{\sigma_{X_{1}}^{2}}$ , we derive a lower bound for $R_{S|Y}(D_{s})$ as follows

	$\displaystyle R_{S\|Y}(D_{s})\geq I(X_{1};\hat{S}\|Y)=H(X_{1}\|Y)-H(X_{1}\|\hat{S},Y)$
	$\displaystyle=\frac{1}{2}\log\left(2\pi e\sigma_{X_{1}\|Y}\right)-H(X_{1}-\frac{\sigma_{X_{1}}}{\sigma_{SX_{1}}}\hat{S}\|\hat{S},Y)$
	$\displaystyle\geq\frac{1}{2}\log\left(2\pi e\sigma_{X_{1}\|Y}\right)-H(X_{1}-\frac{\sigma_{X_{1}}}{\sigma_{SX_{1}}}\hat{S})$
	$\displaystyle\geq\frac{1}{2}\log\left(2\pi e\sigma_{X_{1}\|Y}\right)-\frac{1}{2}\log\left(2\pi e\mathbb{E}\left[\left(X_{1}-\frac{\sigma_{X_{1}}}{\sigma_{SX_{1}}}\hat{S}\right)^{2}\right]\right)$		(54)
	$\displaystyle\geq\frac{1}{2}\log\left(2\pi e\sigma_{X_{1}\|Y}\right)-\frac{1}{2}\log\left(2\pi e\frac{\sigma_{X_{1}}^{2}\left(D_{s}-\mathrm{mmse}\right)}{\sigma_{SX_{1}}^{2}}\right)$		(55)
	$\displaystyle=\frac{1}{2}\log\frac{\sigma_{SX_{1}}^{2}\sigma_{X_{1}\|Y}}{\sigma_{X_{1}}^{2}\left(D_{s}-\mathrm{mmse}\right)}$
	$\displaystyle=\frac{1}{2}\log\frac{\sigma_{SX_{1}}^{2}\left(\sigma_{X_{1}}-\frac{\sigma_{X_{1}Y}^{2}}{\sigma_{Y}}\right)}{\sigma_{X_{1}}^{2}\left(D_{s}-\mathrm{mmse}\right)}$		(56)

where (54) is due to the fact that the Gaussian distribution maximizes the entropy for a given variance, and (55) follows from (51) and the semantic distortion constraint. Combining the upper and lower bounds in (F) and (56), we obtain

R_{S|Y}(D_{s})=\frac{1}{2}\left(\log\frac{\sigma_{SX_{1}}^{2}\left(\sigma_{X_{1}}-\frac{\sigma_{X_{1}Y}^{2}}{\sigma_{Y}}\right)}{\sigma_{X_{1}}^{2}\left(D_{s}-\mathrm{mmse}\right)}\right)^{+}.

(57)

Thus we have

\displaystyle\begin{split}&R(D_{1},D_{2},D_{s})\geq\max\Big{\{}R_{X_{1}|Y}(D_{1}),R_{S|Y}(D_{s})\Big{\}}+R_{X_{2}|Y}(D_{2})\\ &=\frac{1}{2}\left(\log\frac{\sigma_{X_{2}}-\frac{\sigma_{X_{2}Y}^{2}}{\sigma_{Y}}}{D_{2}}\right)^{+}+\\ &\frac{1}{2}\left[\log\max\left(\frac{\sigma_{X_{1}}-\frac{\sigma_{X_{1}Y}^{2}}{\sigma_{Y}}}{D_{1}},\frac{\sigma_{SX_{1}}^{2}\left(\sigma_{X_{1}}-\frac{\sigma_{X_{1}Y}^{2}}{\sigma_{Y}}\right)}{\sigma_{X_{1}}^{2}\left(D_{s}-\mathrm{mmse}\right)}\right)\right]^{+}.\end{split}

To show the achievability, consider the following two cases.

•

For $\frac{D_{s}-\mathrm{mmse}}{\sigma_{SX_{1}}^{2}}\geq\frac{D_{1}}{\sigma_{X_{1}}^{2}}$ , we first reconstruct $\hat{X}_{1}$ and $X_{2}$ subject to distortion constraints $D_{1}$ and $D_{2}$ , and hence achieve $R_{X_{1}|Y}(D_{1})+R_{X_{2}|Y}(D_{2})$ . Then we recover the semantic information by $\hat{S}=\frac{\sigma_{SX_{1}}}{\sigma_{X_{1}}}\hat{X}_{1}$ , and the semantic distortion satisfies

	$\displaystyle\mathbb{E}d_{S}^{\prime}\left(X_{1},\hat{S}\right)$	$\displaystyle=\mathrm{mmse}+\mathbb{E}\left[\left(\frac{\sigma_{SX_{1}}}{\sigma_{X_{1}}}X_{1}-\frac{\sigma_{SX_{1}}}{\sigma_{X_{1}}}\hat{X}_{1}\right)^{2}\right]$
		$\displaystyle\leq\mathrm{mmse}+\frac{\sigma_{SX_{1}}^{2}}{\sigma_{X_{1}}^{2}}D_{1}$
		$\displaystyle\leq D_{s}.$

•

For $\frac{D_{s}-\mathrm{mmse}}{\sigma_{SX_{1}}^{2}}<\frac{D_{1}}{\sigma_{X_{1}}^{2}}$ , we first reconstruct $\hat{S}$ and $X_{2}$ subject to distortion constraints $D_{s}$ and $D_{2}$ , and hence achieve $R_{S|Y}(D_{s})+R_{X_{2}|Y}(D_{2})$ . Then we recover $\hat{X}_{1}=\frac{\sigma_{X_{1}}}{\sigma_{SX_{1}}}\hat{S}$ , and the distortion satisfies

	$\displaystyle\mathbb{E}d_{1}\left(X_{1},\hat{X}_{1}\right)$	$\displaystyle=\mathbb{E}\left[\left(X_{1}-\frac{\sigma_{X_{1}}}{\sigma_{SX_{1}}}\hat{S}\right)^{2}\right]$
		$\displaystyle=\frac{\sigma_{X_{1}}^{2}}{\sigma_{SX_{1}}^{2}}\mathbb{E}\left[\left(\tilde{S}_{\mathrm{MMSE}}-\hat{S}\right)^{2}\right]$
		$\displaystyle=\frac{\sigma_{X_{1}}^{2}}{\sigma_{SX_{1}}^{2}}\left(\mathbb{E}d_{S}^{\prime}\left(X_{1},\hat{S}\right)-\mathrm{mmse}\right)$
		$\displaystyle<D_{1}.$

This establishes the achievability and thus completes the proof.

References

[1] C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.
[2] ——, “Coding theorems for a discrete source with a fidelity criterion,” IRE Nat. Conv. Rec., Pt. 4,, vol. 7, pp. 142–163, 1959.
[3] R. M. Gray, “Conditional rate-distortion theory,” Stanford Electron. Lab., Stanford, Calif., Tech. Rep., vol. 7, pp. 6502–2, Oct. 1972.
[4] ——, “A new class of lower bounds to information rates of stationary sources via conditional rate-distortion functions,” IEEE Transactions on Information Theory, vol. 19, no. 4, pp. 480–489, 1973.
[5] A. Wyner and J. Ziv, “The rate-distortion function for source coding with side information at the decoder,” IEEE Transactions on Information Theory, vol. 22, no. 1, pp. 1–10, 1976.
[6] A. D. Wyner, “The rate-distortion function for source coding with side information at the decoder—ii: General sources,” IEEE Transactions on Information Theory, vol. 38, pp. 60–80, 1978.
[7] A. Kaspi, “Rate-distortion function when side-information may be present at the decoder,” IEEE Transactions on Information Theory, vol. 40, no. 6, pp. 2031–2034, 1994.
[8] H. Permuter and T. Weissman, “Source coding with a side information “vending machine”,” IEEE Transactions on Information Theory, vol. 57, no. 7, pp. 4530–4544, 2011.
[9] S. Watanabe, “The rate-distortion function for product of two sources with side-information at decoders,” IEEE Transactions on Information Theory, vol. 59, no. 9, pp. 5678–5691, 2013.
[10] C. Heegard and T. Berger, “Rate distortion when side information may be absent,” IEEE Transactions on Information Theory, vol. 31, no. 6, pp. 727–734, 1985.
[11] A. Kimura and T. Uyematsu, “Multiterminal source coding with complementary delivery,” in 2006 IEEE International Symposium on Information Theory and its Applications (ISITA), Seoul, Korea, Oct. 2006.
[12] M. Tasto and P. Wintz, “A bound on the rate-distortion function and application to images,” IEEE Transactions on Information Theory, vol. 18, no. 1, pp. 150–159, 1972.
[13] A. Aaron, S. Rane, R. Zhang, and B. Girod, “Wyner-ziv coding for video: applications to compression and error resilience,” in Data Compression Conference, 2003. Proceedings. DCC 2003, 2003, pp. 93–102.
[14] J. D. Gibson and J. Hu, Rate Distortion Bounds for Voice and Video. Foundations and Trends in Communications and Information Theory, Jan. 2014, vol. 10, no. 4.
[15] Y. Wang, A. Reibman, and S. Lin, “Multiple description coding for video delivery,” Proceedings of the IEEE, vol. 93, no. 1, pp. 57–70, 2005.
[16] T.-C. Chen, P. Fleischer, and K.-H. Tzou, “Multiple block-size transform video coding using a subband structure,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 1, no. 1, pp. 59–71, 1991.
[17] B. Zeng, Introduction to Digital Image and Video Compression and Processing, ser. The Morgan Kaufmann Series in Multimedia Information and Systems. San Francisco, CA, USA: Morgan Kaufmann, 2002.
[18] A. Habibian, T. V. Rozendaal, J. Tomczak, and T. Cohen, “Video compression with rate-distortion autoencoders,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 7032–7041.
[19] A. Habibi and P. Wintz, “Image coding by linear transformation and block quantization,” IEEE Transactions on Communication Technology, vol. 19, no. 1, pp. 50–62, 1971.
[20] J. Hu and J. D. Gibson, “Rate distortion bounds for blocking and intra-frame prediction in videos,” in 2009 Conference Record of the Forty-Third Asilomar Conference on Signals, Systems and Computers, 2009, pp. 573–577.
[21] B. Güler, A. Yener, and A. Swami, “The semantic communication game,” IEEE Transactions on Cognitive Communications and Networking, vol. 4, no. 4, pp. 787–802, 2018.
[22] Y. Blau and T. Michaeli, “Rethinking lossy compression: The rate-distortion-perception tradeoff,” in Proceedings of the 36th International Conference on Machine Learning (ICML), vol. 97, Long Beach, CA, USA, Jun. 2019, pp. 675–685.
[23] H. Xie and Z. Qin, “A lite distributed semantic communication system for internet of things,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 1, pp. 142–153, 2021.
[24] Z. Weng and Z. Qin, “Semantic communication systems for speech transmission,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 8, pp. 2434–2444, 2021.
[25] W. Wang, J. Wang, and J. Chen, “Adaptive block-based compressed video sensing based on saliency detection and side information,” Entropy, vol. 23, no. 9, 2021.
[26] Y. Guo, Y. Liu, T. Georgiou, and M. S. Lew, “A review of semantic segmentation using deep neural networks,” International Journal of Multimedia Information Retrieval, vol. 7, pp. 87–93, 2017.
[27] B. Zhao, J. Feng, X. Wu, and S. Yan, “A survey on deep learning-based fine-grained object classification and semantic segmentation,” International Journal of Automation and Computing, vol. 14, pp. 119–135, 2017.
[28] H. Witsenhausen, “Indirect rate distortion problems,” IEEE Transactions on Information Theory, vol. 26, no. 5, pp. 518–521, 1980.
[29] J. Liu, W. Zhang, and H. V. Poor, “A rate-distortion framework for characterizing semantic information,” in 2021 IEEE International Symposium on Information Theory (ISIT), Melbourne, Australia, 2021, pp. 2894–2899.
[30] P. A. Stavrou and M. Kountouris, “A rate distortion approach to goal-oriented communication,” in 2022 IEEE International Symposium on Information Theory (ISIT), Espoo, Finland, Jun. 2022, pp. 778–783.
[31] J. Liu, W. Zhang, and H. V. Poor, “n indirect rate-distortion characterization for semantic sources: General model and the case of gaussian observation,” ArXiv, 2022. [Online]. Available: https://arxiv.org/abs/2201.12477
[32] P. A. Stavrou and M. Kountouris, “The role of fidelity in goal-oriented semantic communication: A rate distortion approach,” TechRxiv. Preprint, 2022. [Online]. Available: https://doi.org/10.36227/techrxiv.20098970.v1
[33] T. Berger, Rate Distortion Theory: A Mathematical Basis for Data Compression. NJ, USA: Englewood Cliffs, NJ: Prentice-Hall,, 1971.
[34] ——, Multiterminal source coding. in The Information Theory Approach to Communications (CISM Courses and Lectures, no. 229), G. Longo, Ed. Vienna/New York: Springer-Verlag, 1978.
[35] R. W. Yeung, Information Theory and Network Coding. Verlag, USA: Springer, 2008.
[36] A. El Gamal and Y.-H. Kim, Network Information Theory. Cambridge University Press, 2011.
[37] T. M. Cover and J. A. Thomas, Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing). USA: Wiley-Interscience, 2006.
[38] A. El Gamal and T. Cover, “Achievable rates for multiple descriptions,” IEEE Transactions on Information Theory, vol. 28, no. 6, pp. 851–857, 1982.
[39] D. Slepian and J. Wolf, “Noiseless coding of correlated information sources,” IEEE Transactions on Information Theory, vol. 19, no. 4, pp. 471–480, 1973.
[40] M. Carter, “Source coding of composite sources,” Ph.D. Dissertation, Department of Computer Information and Control Engineering, University of Michigan, Ann Arbor, MI, 1984.
[41] A. Kipnis, S. Rini, and A. J. Goldsmith, “The indirect rate-distortion function of a binary i.i.d source,” in 2015 IEEE Information Theory Workshop - Fall (ITW), 2015, pp. 352–356.
[42] J. Wolf and J. Ziv, “Transmission of noisy information to a noisy receiver with minimum distortion,” IEEE Transactions on Information Theory, vol. 16, no. 4, pp. 406–411, 1970.

	$\displaystyle I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}\|Y)$
	$\displaystyle=H(X_{1},X_{2}\|Y)-H(X_{1},X_{2}\|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$
	$\displaystyle=H(X_{1}\|Y)+H(X_{2}\|Y)-H(X_{1}\|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$
	$\displaystyle\qquad-H(X_{2}\|X_{1},\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$
	$\displaystyle\geq H(X_{1}\|Y)+H(X_{2}\|Y)\!-\!H(X_{1}\|\hat{X}_{1},\hat{S},Y)\!-\!H(X_{2}\|\hat{X}_{2},Y)$
	$\displaystyle=I(X_{1};\hat{X}_{1},\hat{S}\|Y)+I(X_{2};\hat{X}_{2}\|Y),$

	$\displaystyle R(D_{1},D_{2},D_{s})$
	$\displaystyle=\min_{\begin{subarray}{c}p(\hat{x}_{1},\hat{x}_{2},\hat{s}\|x_{1},x_{2},y)\\ \mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}\\ \mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}\\ \mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}\end{subarray}}I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}\|Y)$
	$\displaystyle\geq\min_{\begin{subarray}{c}p(\hat{x}_{1},\hat{x}_{2},\hat{s}\|x_{1},x_{2},y)\\ \mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}\\ \mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}\\ \mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}\end{subarray}}\left[I(X_{1};\hat{X}_{1},\hat{S}\|Y)+I(X_{2};\hat{X}_{2}\|Y)\right]$
	$\displaystyle=\min_{\begin{subarray}{c}p(\hat{x}_{1},\hat{s}\|x_{1},y)\\ \mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}\\ \mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}\end{subarray}}I(X_{1};\hat{X}_{1},\hat{S}\|Y)$
	$\displaystyle\qquad+\min_{\begin{subarray}{c}p(\hat{x}_{2}\|x_{2},y)\\ \mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}\end{subarray}}I(X_{2};\hat{X}_{2}\|Y)$
	$\displaystyle=R_{\text{2d},X_{1}\|Y}(D_{1},D_{s})+R_{X_{2}\|Y}(D_{2}).$

	$\displaystyle I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}\|Y)$
	$\displaystyle=H(X_{1}\|Y)+H(X_{2}\|Y)-H(X_{1}\|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$
	$\displaystyle-H(X_{2}\|X_{1},\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$
	$\displaystyle=H(X_{1}\|Y)+H(X_{2}\|Y)\!-\!H(X_{1}\|\hat{X}_{1},\hat{S},Y)\!-\!H(X_{2}\|\hat{X}_{2},Y)$
	$\displaystyle=I(X_{1};\hat{X}_{1},\hat{S}\|Y)+I(X_{2};\hat{X}_{2}\|Y),$
	$\displaystyle=R_{\text{2d},X_{1}\|Y}(D_{1},D_{s})+R_{X_{2}\|Y}(D_{2}),$

	$\displaystyle I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}\|Y)$
	$\displaystyle=H(X_{1},X_{2}\|Y)-H(X_{1},X_{2}\|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$
	$\displaystyle=H(X_{1})+H(X_{2}\|X_{1})+H(Y\|X_{1})-H(Y)$
	$\displaystyle-H(X_{1},X_{2}\|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$
	$\displaystyle=h_{b}(p_{1})+h_{b}(p_{2})-H(X_{1}\oplus\hat{X}_{1},X_{2}\oplus\hat{X}_{2}\|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)$		(35)
	$\displaystyle\geq h_{b}(p_{1})+h_{b}(p_{2})-H(X_{1}\oplus\hat{X}_{1},X_{2}\oplus\hat{X}_{2})$
	$\displaystyle\geq h_{b}(p_{1})+h_{b}(p_{2})-H(X_{1}\oplus\hat{X}_{1})-H(X_{2}\oplus\hat{X}_{2})$
	$\displaystyle\geq h_{b}(p_{1})+h_{b}(p_{2})-h_{b}(D_{1})-h_{b}(D_{2}),$

	$\displaystyle I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}\|Y)$
	$\displaystyle=H(X_{1},X_{2}\|Y)-H(X_{1},X_{2}\|\hat{X}_{1},\hat{X}_{2},Y)$
	$\displaystyle=H(X_{1},X_{2}\|Y)-H(X_{1},X_{2}\|\hat{X}_{1},\hat{X}_{2})$
	$\displaystyle=h_{b}(p_{1})+h_{b}(p_{2})-h_{b}(D_{1})-h_{b}(D_{2}).$