This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Semantic Compression with Side Information: A Rate-Distortion Perspective

Tao Guo, Member, IEEE, Yizhu Wang, Jie Han, Huihui Wu, Member, IEEE,
Bo Bai, Senior Member, IEEE, and Wei Han, Member, IEEE
Abstract

We consider the semantic rate-distortion problem motivated by task-oriented video compression. The semantic information corresponding to the task, which is not observable to the encoder, shows impacts on the observations through a joint probability distribution. The similarities among intra-frame segments and inter-frames in video compression are formulated as side information available at both the encoder and the decoder. The decoder is interested in recovering the observation and making an inference of the semantic information under certain distortion constraints.

We establish the information-theoretic limits for the tradeoff between compression rates and distortions by fully characterizing the rate-distortion function. We further evaluate the rate-distortion function under specific Markov conditions for three scenarios: i) both the task and the observation are binary sources; ii) the task is a binary classification of an integer observation as even and odd; iii) Gaussian correlated task and observation. We also illustrate through numerical results that recovering only the semantic information can reduce the coding rate comparing to recovering the source observation.

Index Terms:
Semantic communication, inference, video compression, rate-distortion function.

I Introduction

The rate limit for lossless compression of memoryless sources is commonly known as the entropy shown by Shannon in his landmark paper [1]. In addition, lossy source coding under given fidelity criterion was also introduced in the same paper. Further, the Shannon rate-distortion function was proposed in  [2], characterizing the optimal tradeoff between compression rates and distortion measurements, from the perspective of mutual information.

Thereafter, the rate-distortion function was investigated when side information is available at the encoder or/and decoder, see [3, 4, 5, 6, 7, 8, 9, 10, 11] and reference therein. In the case that the side information is only available at the encoder, then no benefit could be achieved. In case of side information being only at the decoder, the corresponding rate-distortion function was considered by Wyner and Ziv in [5], with its extensions being discussed in [9, 10, 11, 6, 7, 8]. Finally, if both the encoder and decoder have access to the same side information, the optimal tradeoff is called conditional rate-distortion function, which was given by [3] and [4].

The lossy source coding theory finds applications in establishing information theoretic limits for practical compression of speech signals, images and videos etc.  [12, 13, 14, 15]. Practical techniques for video compression have been explored since decades ago [16, 17, 18]. Currently, popular protocols such as HEVC, VP9, VVC and AV1 are based on partitioning a picture/frame into coding tree units, which typically correspond to 64x64 or 128x128 pixel areas. Each coding tree unit is then partitioned into coding blocks (segments) [19, 20], following a recursive coding tree representation. The aforementioned compression schemes consider both intra-correlation within one frame and inter-correlation between two consecutive frames.

Nowadays, with the development of high-definition videos, 5G communication systems and industrial Internet of Things, communication overhead and storage demand have been exponentially growing. As a result, higher compression rates are required, but it seems not possible by simply compressing a given source (e.g., a video or image) itself in light of the rate-distortion limits.

Semantic or Task-oriented compression [21, 22, 23, 24], aiming at efficiently compressing the sources according to specific communication tasks (e.g., video detection, inference, classification, decision making, etc.), has been viewed as a promising technique for future 6G systems due to its extraordinary effectiveness. Particularly, the goal of semantic compression is to recover the necessary semantic information corresponding to a certain task instead of each individually transmitted bit as in Shannon communication setups, and thus it leads to significant reduction of coding rates.

In addition to the interested semantic information, the original sources are also required in some cases such as video surveillance, for the purpose of evidence storage and verification. An effective way is to save only the most important or relevant segments of a video, and related work on semantic-based image/video segmentation can be found in [25, 26, 27]. Most recently, the classical indirect source coding problem [28] was revisited from the semantic point of view, and the corresponding rate-distortion framework for semantic information in [29, 30, 31, 32].

SnS^{n}p(x1|s)p(x_{1}|s)(X1n,X2n)(X_{1}^{n},X_{2}^{n})EncoderWWstorageDecoderX^1n(D1)\hat{X}_{1}^{n}(D_{1})X^2n(D2)\hat{X}_{2}^{n}(D_{2})ClassifierS^n(Ds)\hat{S}^{n}(D_{s})YnY^{n}
Figure 1: Illustration of system model with side information.

The current paper introduces side information into the framework of [29] and completely characterizes the semantic rate-distortion problem with side information. Motivated from the task-oriented video compression, the semantic information corresponding to the task is not observable to the encoder. In light of video segmentation, the observed source is partitioned into two parts, and the semantic information only shows influence on the more important part. Moreover, the intra-correlation and inter-correlation are viewed as side information, and they are available at both the encoder and decoder to help compression. The decoder needs to reconstruct the whole source subject to different distortion constraints for the two parts, respectively. The semantic information can be recovered upon observing the source reconstructions at the decoder. Finally, our main contributions are summarized as follows:

  1. 1)

    We fully characterize the optimal rate-distortion tradeoff. It is further shown that separately compressing the two source parts is optimal, if the they are independent conditioning on the side information.

  2. 2)

    The rate-distortion function is evaluated for the inference of a binary source under some specific Markov chains.

  3. 3)

    We further evaluate the rate-distortion function for binary classification of an integer source. The numerical results show that recovering only the semantic information can reduce the coding rate comparing to recovering the source message.

  4. 4)

    The rate-distortion function for Gaussian sources is also illustrated, which may provide more insights for future real video compression simulations.

This paper is mainly pertained to the information-theoretic aspects, and future work on the limit of real video compression is under investigation.

The rest of the paper is organized as follows. We first formulate the problem and present some preliminary results in Section II. In Section III, we characterize the rate-distortion function and some useful properties. Evaluations of the rate-distortion function for binary, integer, and Gaussian sources with Hamming/mean squared error distortions are devoted to Sections IV, V and VI, respectively. We present and analyze some plots of the evaluations in Section VII. The paper is concluded in Section VIII. Some essential proofs can be found in the appendices.

II Problem Formulation and Preliminaries

II-A Problem Formulation

Consider the system model for video detection (inference) that also requires evidence storage depicted in Fig. 1. The problem is defined as follows. A collection of discrete memoryless sources (DMS) is described by generic random variables (S,X1,X2,Y)(S,X_{1},X_{2},Y) taking values in finite alphabets 𝒮×𝒳1×𝒳2×𝒴\mathcal{S}\times\mathcal{X}_{1}\times\mathcal{X}_{2}\times\mathcal{Y} according to probability distribution p(x1,x2,y)p(s|x1)p(x_{1},x_{2},y)p(s|x_{1}). In particular, this indicates the Markov chain SX1(X2,Y)S-X_{1}-(X_{2},Y). We interpret SS as a latent variable, which is not observable by the encoder. It can be viewed as the semantic information (e.g., the state of a system), which describes the features of the system. We assume that the observation of the system consists of two parts:

  • X1X_{1} varies according to the semantic information SS, which captures the “appearance” of the features, e.g., the vehicle and red lights in the frame that captures a violation at the cross;

  • X2X_{2} is the background information irrelevant to the features, e.g., buildings in the frame capturing the violation.

YY is the side information that can help compressing such as previous frames in the video. For length-nn source sequences, (Sn,X1n,X2n,Yn)(S^{n},X_{1}^{n},X_{2}^{n},Y^{n}), the encoder has access to only the observed ones (X1n,X2n,Yn)(X_{1}^{n},X_{2}^{n},Y^{n}) and encodes them as WW which will be stored at the server. Upon observing local information YnY^{n} and receiving WW, the decoder reconstructs the source sequences as (X^1n,X^2n)(\hat{X}_{1}^{n},\hat{X}_{2}^{n}) drawn values from 𝒳^1×𝒳^2\hat{\mathcal{X}}_{1}\times\hat{\mathcal{X}}_{2}, within distortions D1D_{1} and D2D_{2}. Given the reconstructions, the classifier is required to recover the semantic information as S^n\hat{S}^{n} from alphabet 𝒮^\hat{\mathcal{S}} with distortion constraint DsD_{s}. Here, for simplicity, we assume a perfect classifier, i.e., it is equivalent to recover S^n\hat{S}^{n} directly at the decoder as illustrated in Fig. 2.

(X1n,X2n)(X_{1}^{n},X_{2}^{n})EncoderWWDecoderX^1n(D1)\hat{X}_{1}^{n}(D_{1})X^2n(D2)\hat{X}_{2}^{n}(D_{2})S^n(Ds)\hat{S}^{n}(D_{s})YnY^{n}
Figure 2: The equivalent system model.

Formally, an (n,2nR)\left(n,2^{nR}\right) code is defined by the encoding function

En:𝒳1n×𝒳2n×𝒴n{1,2,,2nR}\displaystyle En:\mathcal{X}_{1}^{n}\times\mathcal{X}_{2}^{n}\times\mathcal{Y}^{n}\rightarrow\{1,2,\cdots,2^{nR}\}

and the decoding function

De:{1,2,,2nR}×𝒴n𝒳^1n×𝒳^2n×𝒮^n.\displaystyle De:\{1,2,\cdots,2^{nR}\}\times\mathcal{Y}^{n}\rightarrow\hat{\mathcal{X}}_{1}^{n}\times\hat{\mathcal{X}}_{2}^{n}\times\hat{\mathcal{S}}^{n}.

Let +\mathbb{R}^{+} be the set of nonnegative real numbers. We consider bounded per-letter distortion functions d1:𝒳1×𝒳^1+d_{1}:\mathcal{X}_{1}\times\hat{\mathcal{X}}_{1}\rightarrow\mathbb{R}^{+}, d2:𝒳2×𝒳^2+d_{2}:\mathcal{X}_{2}\times\hat{\mathcal{X}}_{2}\rightarrow\mathbb{R}^{+}, and ds:𝒮×𝒮^+d_{s}:\mathcal{S}\times\hat{\mathcal{S}}\rightarrow\mathbb{R}^{+}. The distortions between length-nn sequences are defined by

d1(x1n,x^1n)\displaystyle d_{1}(x_{1}^{n},\hat{x}_{1}^{n}) 1ni=1nd1(x1,i,x^1,i),\displaystyle\triangleq\frac{1}{n}\sum_{i=1}^{n}d_{1}(x_{1,i},\hat{x}_{1,i}),
d2(x2n,x^2n)\displaystyle d_{2}(x_{2}^{n},\hat{x}_{2}^{n}) 1ni=1nd2(x2,i,x^2,i),\displaystyle\triangleq\frac{1}{n}\sum_{i=1}^{n}d_{2}(x_{2,i},\hat{x}_{2,i}),
ds(sn,s^n)\displaystyle d_{s}(s^{n},\hat{s}^{n}) 1ni=1nds(si,s^i).\displaystyle\triangleq\frac{1}{n}\sum_{i=1}^{n}d_{s}(s_{i},\hat{s}_{i}).

A nonnegative rate-distortion tuple (R,D1,D2,Ds)(R,D_{1},D_{2},D_{s}) is said to be achievable if for sufficiently large nn, there exists an (n,2nR)\left(n,2^{nR}\right) code such that

limn𝔼d1(X1n,X^1n)\displaystyle\lim_{n\rightarrow\infty}\mathbb{E}d_{1}(X_{1}^{n},\hat{X}_{1}^{n}) D1,\displaystyle\leq D_{1},
limn𝔼d2(X2n,X^2n)\displaystyle\lim_{n\rightarrow\infty}\mathbb{E}d_{2}(X_{2}^{n},\hat{X}_{2}^{n}) D2,\displaystyle\leq D_{2},
limn𝔼ds(Sn,S^n)\displaystyle\lim_{n\rightarrow\infty}\mathbb{E}d_{s}(S^{n},\hat{S}^{n}) Ds.\displaystyle\leq D_{s}.

The rate-distortion function R(D1,D2,Ds)R(D_{1},D_{2},D_{s}) is the infimum of coding rate RR for distortions (D1,D2,Ds)(D_{1},D_{2},D_{s}) such that the rate-distortion tuple (R,D1,D2,Ds)(R,D_{1},D_{2},D_{s}) is achievable. Our goal is to characterize the rate-distortion function.

II-B Preliminaries

II-B1 Conditional rate-distortion function

The elegant rate-distortion function was investigated and fully characterized in [2]. Assume the length-nn source sequence XnX^{n} is independent and identically distributed (i.i.d.) over 𝒳\mathcal{X} with generic random variable XX and d:𝒳×𝒳^+d:\mathcal{X}\times\hat{\mathcal{X}}\rightarrow\mathbb{R}^{+} be a bounded per-letter distortion measure. The rate-distortion function for a given distortion criterion DD is given by

R(D)=minp(x^|x):𝔼d(X,X^)DI(X;X^).R(D)=\min_{p(\hat{x}|x):\mathbb{E}d(X,\hat{X})\leq D}I(X;\hat{X}). (1)

It was proved in [2] and also introduced in [33, 34, 35, 36, 37] that R(D)R(D) is a non-increasing and convex function of DD.

If both the encoder and decoder are allowed to observe side information YnY^{n} (with generic variable YY over 𝒴\mathcal{Y} jointly distributed with XX), as depicted in Fig. 3, then the tradeoff is called the conditional rate-distortion function [33, 3, 4], which is characterized as

RX|Y(D)=minp(x^|x,y):𝔼d(X,X^)DI(X;X^|Y).R_{X|Y}(D)=\min_{p(\hat{x}|x,y):\mathbb{E}d(X,\hat{X})\leq D}I(X;\hat{X}|Y). (2)
XnX^{n}EncoderWWDecoderX^n(D)\hat{X}^{n}(D)YnY^{n}
Figure 3: Conditional Rate-distortion model.

It is shown in [3] that the conditional rate-distortion function can also be obtained as the weighted sum of the marginal rate-distortion function of sources with distribution PX|Y(|y),y𝒴P_{X|Y}(\cdot|y),y\in\mathcal{Y}, i.e.,

RX|Y(D)=min{Dy:y𝒴}:y𝒴p(y)DyDy𝒴p(y)R(Dy),R_{X|Y}(D)=\min_{\{D_{y}:y\in\mathcal{Y}\}:\sum_{y\in\mathcal{Y}}p(y)\cdot D_{y}\leq D}\sum_{y\in\mathcal{Y}}p(y)\cdot R(D_{y}), (3)

where for any y𝒴y\in\mathcal{Y}, R(Dy)R(D_{y}) is obtained from (1) through replacing the source distribution by PX|Y(|y)P_{X|Y}(\cdot|y). This property will be useful for evaluating conditional rate-distortion functions of given source distributions. If (X,Y)(X,Y) is a doubly symmetric binary source (DSBS) with parameter p0p_{0}, i.e.,

p(x,y)=[1p02p02p021p02],p(x,y)=\left[\begin{array}[]{cc}\frac{1-p_{0}}{2}&\frac{p_{0}}{2}\\ \frac{p_{0}}{2}&\frac{1-p_{0}}{2}\end{array}\right], (4)

then the conditional rate distortion function is given in [4] by

RX|Y(D)=[hb(p0)hb(D)]𝟙0Dp0,R_{X|Y}(D)=\left[h_{b}(p_{0})-h_{b}(D)\right]\cdot\mathds{1}_{{}_{0\leq D\leq p_{0}}}, (5)

where hb(q)=qlogq(1q)log(1q)h_{b}(q)=-q\log q-(1-q)\log(1-q) is the entropy for a Bernoulli(qq) distribution and 𝟙A\mathds{1}_{A} is the indicator function of whether event AA happens.

II-B2 Rate-distortion function with two constraints

The scenario was discussed in [38, 37] that we wish to describe the i.i.d. source sequence XnX^{n} at rate RR and recover two reconstructions X^an\hat{X}_{a}^{n} and X^bn\hat{X}_{b}^{n} with distortion criteria 𝔼da(Xn,X^an)Da\mathbb{E}d_{a}(X^{n},\hat{X}_{a}^{n})\leq D_{a} and 𝔼db(Xn,X^bn)Db\mathbb{E}d_{b}(X^{n},\hat{X}_{b}^{n})\leq D_{b}, respectively. The rate-distortion function is given by

R2d(Da,Db)=minp(x^a,x^b|x):𝔼da(X,X^a)Da𝔼db(X,X^b)DbI(X;X^a,X^b).R_{\text{2d}}(D_{a},D_{b})=\min_{\begin{subarray}{c}p(\hat{x}_{a},\hat{x}_{b}|x):~{}\\ \mathbb{E}d_{a}(X,\hat{X}_{a})\leq D_{a}\\ \mathbb{E}d_{b}(X,\hat{X}_{b})\leq D_{b}\end{subarray}}I(X;\hat{X}_{a},\hat{X}_{b}). (6)

Comparing (1) and (6), we easily see that

max{R(Da),R(Db)}R2d(Da,Db)R(Da)+R(Db).\max\{R(D_{a}),R(D_{b})\}\leq R_{\text{2d}}(D_{a},D_{b})\leq R(D_{a})+R(D_{b}).

For the special case where 𝒳^a=𝒳^b\hat{\mathcal{X}}_{a}=\hat{\mathcal{X}}_{b} and da(x,x^)=db(x,x^)d_{a}(x,\hat{x})=d_{b}(x,\hat{x}) for all x𝒳x\in\mathcal{X} and x^𝒳^a\hat{x}\in\hat{\mathcal{X}}_{a}, it suffice to recover only one sequence X^an=X^bn\hat{X}_{a}^{n}=\hat{X}_{b}^{n} with distortion min{Da,Db}\min\{D_{a},D_{b}\}. Then both distortion constraints are satisfied since

𝔼da(Xn,X^an)=min{Da,Db}Da\mathbb{E}d_{a}(X^{n},\hat{X}_{a}^{n})=\min\{D_{a},D_{b}\}\leq D_{a}

and

𝔼db(Xn,X^bn)=min{Da,Db}Db.\mathbb{E}d_{b}(X^{n},\hat{X}_{b}^{n})=\min\{D_{a},D_{b}\}\leq D_{b}.

This implies

R2d(Da,Db)=R(min{Da,Db})=max{R(Da),R(Db)},R_{\text{2d}}(D_{a},D_{b})=R(\min\{D_{a},D_{b}\})=\max\{R(D_{a}),R(D_{b})\}, (7)

where the second equality follows from the non-increasing property of R(D)R(D).

When side information is available at the decoder for only one of the two reconstructions, e.g., X^b\hat{X}_{b}, it was proved in [10, 7] that successive encoding (first X^a\hat{X}_{a}, then X^b\hat{X}_{b}) is optimal. For the case when the two reconstructions have access to different side information respectively, the rate-distortion tradeoff was characterized in [10, 11, 9, 36].

II-B3 Rate-distortion function of two sources

The problem of compressing two i.i.d. source sequences XanX_{a}^{n} and XbnX_{b}^{n} at the same encoder is considered in [37, Problem. 10.14]. The rate-distortion function is given therein, which is

R2s(Da,Db)=minp(x^a,x^b|xa,xb):𝔼da(Xa,X^a)Da𝔼db(Xb,X^b)DbI(Xa,Xb;X^a,X^b).R_{\text{2s}}(D_{a},D_{b})=\min_{\begin{subarray}{c}p(\hat{x}_{a},\hat{x}_{b}|x_{a},x_{b}):~{}\\ \mathbb{E}d_{a}(X_{a},\hat{X}_{a})\leq D_{a}\\ \mathbb{E}d_{b}(X_{b},\hat{X}_{b})\leq D_{b}\end{subarray}}I(X_{a},X_{b};\hat{X}_{a},\hat{X}_{b}). (8)

It is also shown that for two independent sources, compressing simultaneously is the same as compressing separately in terms of the rate and distortions, i.e.,

R2s(Da,Db)=R(Da)+R(Db).R_{\text{2s}}(D_{a},D_{b})=R(D_{a})+R(D_{b}). (9)

If the two sources are dependent, the equality in (9) can be false, and the Slepian-Wolf rate region [39] indicates that joint entropy of the two source variables is sufficient and optimal for lossless reconstructions. Taking into account distortions, Gray showed via an example in [4] that the compression rate can be strictly larger than R(Da)+RXb|Xa(Db)R(D_{a})+R_{X_{b}|X_{a}}(D_{b}) in general. At last, some related results for compressing compound sources can be found in [40].

III Optimal Rate-distortion Tradeoff

III-A The Rate-distortion Function

Theorem 1.

The rate-distortion function for compression and inference with side information is given as the solution to the following optimization problem

R(D1,D2,Ds)\displaystyle R(D_{1},D_{2},D_{s}) =minI(X1,X2;X^1,X^2,S^|Y)\displaystyle=\min I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}|Y) (10)
 s.t. 𝔼d1(X1,X^1)D1\displaystyle\quad\textup{ s.t. }\mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1} (11)
𝔼d2(X2,X^2)D2\displaystyle\qquad\quad\mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2} (12)
𝔼ds(X1,S^)Ds,\displaystyle\qquad\quad\mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}, (13)

where the minimum is taken over all conditional pmf p(x^1,x^2,s^|x1,x2,y)p(\hat{x}_{1},\hat{x}_{2},\hat{s}|x_{1},x_{2},y) and the modified distortion measure is defined by

ds(x1,s^)=1p(x1)s𝒮p(x1,s)ds(s,s^).d_{s}^{\prime}(x_{1},\hat{s})=\frac{1}{p(x_{1})}\sum_{s\in\mathcal{S}}p(x_{1},s)d_{s}(s,\hat{s}). (14)
Proof.

We can interpret the problem as the combination of rate-distortion with two sources (X1X_{1} and X2X_{2}), rate-distortion with two constraints (X1X_{1} is recovered with two constraints D1D_{1} and DsD_{s}), and conditional rate-distortion (conditioning on YY). Then the theorem can be obtained informally by combining the rate-distortion functions in (2), (6), and (8). For completeness, we provide a rigorous technical proof in Appendix A. ∎

III-B Some Properties

Similar to the rate-distortion function in (1), we collect some properties in the following lemma. The proof simply follows the same procedure as that for (1) in [2, 33, 34, 35, 37, 36]. We omit the details here.

Lemma 2.

The rate-distortion function R(D1,D2,Ds)R(D_{1},D_{2},D_{s}) is non-increasing and convex in (D1,D2,Ds)(D_{1},D_{2},D_{s}).

Recall from (9) that compressing two independent sources is the same as compressing them simultaneously. Then one may query that whether the optimality of separate compression remains to hold here? We answer the question in the following lemma.

Lemma 3.

If X1YX2X_{1}-Y-X_{2} forms a Markov chain, then

R(D1,D2,Ds)=R2d,X1|Y(D1,Ds)+RX2|Y(D2),R(D_{1},D_{2},D_{s})=R_{\text{2d},X_{1}|Y}(D_{1},D_{s})+R_{X_{2}|Y}(D_{2}),

where the conditional rate-distortion function with two constraints is given by

R2d,X1|Y(D1,Ds)=minp(x^1,s^|x1,y):𝔼d1(X1,X^1)D1𝔼ds(X1,S^)DsI(X1;X^1,S^|Y)R_{\text{2d},X_{1}|Y}(D_{1},D_{s})=\min_{\begin{subarray}{c}p(\hat{x}_{1},\hat{s}|x_{1},y):\\ \mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}\\ \mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}\end{subarray}}I(X_{1};\hat{X}_{1},\hat{S}|Y)

and the conditional rate-distortion function is given in (2) and can be written by

RX2|Y(D)=minp(x^2|x2,y):𝔼d2(X2,X^2)D2I(X2;X^2|Y).R_{X_{2}|Y}(D)=\min_{p(\hat{x}_{2}|x_{2},y):\mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}}I(X_{2};\hat{X}_{2}|Y).
Proof.

The proof is given in Appendix B. ∎

Remark 1.

Compared to the independence assumption for the equality in (9), we have the conditional independence I(X1;X2|Y)=0I(X_{1};X_{2}|Y)=0 in Lemma 3. This is intuitive since compared to the setting for (9), we have the additional side information available to the encoder and decoder.

III-C Rate-distortion Function for Semantic Information

The indirect rate-distortion problem can be viewed as a special case of Theorem 1 that only recovers the semantic information SS, i.e., X2X_{2} and YY are constants and D1=D_{1}=\infty. Denote the minimum achievable rate for a given distortion constraint DsD_{s} by Rs(Ds)R_{s}(D_{s}).

Consider the binary sources and assume SS and X1X_{1} follow the doubly symmetric binary distribution, i.e.,

p(s,x1)=[1p2p2p21p2].p(s,x_{1})=\left[\begin{matrix}\frac{1-p}{2}&\frac{p}{2}\\ \frac{p}{2}&\frac{1-p}{2}\end{matrix}\right]. (15)

The transition probability p(x1|s)p(x_{1}|s) can also be defined via the binary symmetric channel (BSC) in Fig. 4.

SS011X1X_{1}0111p1-p1p1-ppp
Figure 4: Transition probability from SS to X1X_{1}: in terms of BSC.

Assume p0.5p\leq 0.5, which means that X1X_{1} has a higher probability to reflect the same value as SS. Let ds:𝒮×𝒮^{0,1}d_{s}:\mathcal{S}\times\hat{\mathcal{S}}\rightarrow\{0,1\} be the Hamming distortion measure. Then the evaluation of Rs(Ds)R_{s}(D_{s}) is given in the following lemma.

Let Rds()R_{d_{s}^{\prime}}(\cdot) be the ordinary rate-distortion function in (1) under the distortion measure dsd_{s}^{\prime} (c.f. (14)). For notational simplicity, for DspD_{s}\geq p, define

Ds0Dsp12p.D_{s}^{0}\triangleq\frac{D_{s}-p}{1-2p}. (16)
Lemma 4.

For binary sources in (15) and Hamming distortion, the rate-distortion function for semantic information is

Rs(Ds)=Rds(Ds),R_{s}(D_{s})=R_{d_{s}^{\prime}}(D_{s}), (17)

where Rds(Ds)=R(Ds0)=[1hb(Dsp12p)]𝟙pDs0.5.R_{d_{s}^{\prime}}(D_{s})=R\left(D_{s}^{0}\right)=\left[1-h_{b}\left(\frac{D_{s}-p}{1-2p}\right)\right]\cdot\mathds{1}_{{}_{p\leq D_{s}\leq 0.5}}.

Proof.

The evaluation of Rs(Ds)R_{s}(D_{s}) was given in [33, 41]. A simpler proof can be found in Appendix C. ∎

Remark 2.

By the properties of the rate-distortion function in (1) and the linearity between Ds0D_{s}^{0} and DsD_{s}, we see that Rs(Ds)R_{s}(D_{s}) is also non-increasing and convex in DsD_{s}.

Remark 3.

It is easy to check that Dp12p<D\frac{D-p}{1-2p}<D for D<0.5D<0.5. This implies that Rs(D)>R(D)R_{s}(D)>R(D) for D<0.5D<0.5, where R(D)R(D) is the ordinary rate-distortion function in (1). The inequality is intuitive from the data processing inequality that under the same distortion constraint DD, recovering SS directly (with rate R(D)R(D)) is easier than recovering it from the observation X1X_{1} (with rate Rs(D)R_{s}(D)). Moreover, we see from the lemma that DspD_{s}\geq p, which means that the semantic information can never be losslessly recovered for p>0p>0. This can be induced from the fact that even we know the complete information of X1X_{1}, the best distortion for reconstructing SS is the distortion between SS and X1X_{1} which is equal to pp. The rate-distortion functions Rs(D)R_{s}(D) and R(D)R(D) are illustrated in Fig. 5 for p=0.1p=0.1, which verifies the above observations. For general source and distortion measure, we have Rs(D)R(D)R_{s}(D)\geq R(D) where the equality holds only when X1X_{1} determines SS. This can be easily proved by the data processing inequality and we omit the details here.

Refer to caption
Figure 5: Comparison of rate-distortion functions Rs(D)R_{s}(D) and R(D)R(D) for p=0.1p=0.1.
Remark 4.

We can imagine that dsd_{s}^{\prime} measures the distortion between the observation and reconstruction of semantic information. Furthermore, it was shown in [28, 29, 33] that dsd_{s} and dsd_{s}^{\prime} measure equivalent distortions, i.e.,

𝔼ds(X1,S^)\displaystyle\mathbb{E}d_{s}^{\prime}(X_{1},\hat{S}) =𝔼ds(S,S^)\displaystyle=\mathbb{E}d_{s}(S,\hat{S})
𝔼ds(X1n,S^n)\displaystyle\mathbb{E}d_{s}^{\prime}(X_{1}^{n},\hat{S}^{n}) =𝔼ds(Sn,S^n).\displaystyle=\mathbb{E}d_{s}(S^{n},\hat{S}^{n}).

Then we can regard the system of compressing X1nX_{1}^{n} and reconstructing S^n\hat{S}^{n} as the ordinary rate-distortion problem with distortion measure dsd_{s}^{\prime}. Thus, Rs(Ds)R_{s}(D_{s}) is equivalent to the ordinary rate-distortion function in (1) under distortion measure dsd_{s}^{\prime}, which rigorously proves (17).

IV Case Study: Binary Sources

Assume SS and X1X_{1} are doubly symmetric binary sources with distribution in (15), X2X_{2} and YY are both Bernoulli(12)(\frac{1}{2}) sources. The reconstructions are all binary, i.e., 𝒳^1=𝒳^2=𝒮^={0,1}\hat{\mathcal{X}}_{1}=\hat{\mathcal{X}}_{2}=\hat{\mathcal{S}}=\{0,1\}. The distortion measures d1,d2d_{1},d_{2}, and dsd_{s} are all assumed to be Hamming distortion. We further assume that any two of X1X_{1}, X2X_{2}, and YY are doubly symmetric binary distributed (c.f. (4)) with parameters p1,p2p_{1},p_{2}, and p3p_{3}, respectively. Specifically, (X1,X2)DSBS(p1)(X_{1},X_{2})\sim\text{DSBS}(p_{1}), (X1,Y)DSBS(p2)(X_{1},Y)\sim\text{DSBS}(p_{2}), (X2,Y)DSBS(p3)(X_{2},Y)\sim\text{DSBS}(p_{3}).

Consider the following two examples that only differ in the source distributions.

IV-A Conditionally Independent Sources

Assume we have the Markov chain111Note that the Markov chain X1YX2X_{1}-Y-X_{2} indicates p1=p2p3p2(1p3)+p3(1p2)p_{1}=p_{2}\star p_{3}\triangleq p_{2}(1-p_{3})+p_{3}(1-p_{2}).X1YX2X_{1}-Y-X_{2}, i.e., X1X_{1} and X2X_{2} are independent conditioning on YY. This assumption coincides with the intuitive understanding of X1X_{1} and X2X_{2} in Section II-A, that the semantic feature can be independent with the background. Then from Lemma 3, compressing X1nX_{1}^{n} and X2nX_{2}^{n} simultaneously is the same as compressing them separately in terms of the optimal compression rate and distortions, which implies the following theorem.

Theorem 5.

The rate-distortion function for the above conditionally independent sources is given by

R(D1,D2,Ds)\displaystyle R(D_{1},D_{2},D_{s}) =[hb(p3)hb(D2)]𝟙0D2p3\displaystyle=\big{[}h_{b}(p_{3})-h_{b}(D_{2})\big{]}\cdot\mathds{1}_{{}_{0\leq D_{2}\leq p_{3}}}
+[hb(p2)hb(min{D1,Ds0})]𝟙0min{D1,Ds0}p2,\displaystyle+\left[h_{b}(p_{2})-h_{b}\big{(}\min\{D_{1},D_{s}^{0}\}\big{)}\right]\cdot\mathds{1}_{{}_{0\leq\min\{D_{1},D_{s}^{0}\}\leq p_{2}}},

where Ds0=Dsp12pD_{s}^{0}=\frac{D_{s}-p}{1-2p} is defined in (16).

Proof.

The rate-distortion function in Theorem 1 satisfies

R(D1,D2,Ds)\displaystyle R(D_{1},D_{2},D_{s})
=R2d,X1|Y(D1,Ds)+RX2|Y(D2)\displaystyle=R_{\text{2d},X_{1}|Y}(D_{1},D_{s})+R_{X_{2}|Y}(D_{2})
=min𝔼d1(X1,X^1)D1𝔼ds(X1,S^)DsI(X1;X^1,S^|Y)+min𝔼d2(X2,X^2)D2I(X2;X^2|Y)\displaystyle=\min_{\begin{subarray}{c}\mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}\\ \mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}\end{subarray}}I(X_{1};\hat{X}_{1},\hat{S}|Y)+\min_{\mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}}I(X_{2};\hat{X}_{2}|Y)
=[hb(p2)hb(min{D1,Ds0})]𝟙0min{D1,Ds0}p2\displaystyle=\left[h_{b}(p_{2})-h_{b}\big{(}\min\{D_{1},D_{s}^{0}\}\big{)}\right]\cdot\mathds{1}_{{}_{0\leq\min\{D_{1},D_{s}^{0}\}\leq p_{2}}}
+[hb(p3)hb(D2)]𝟙0D2p3\displaystyle\qquad+\big{[}h_{b}(p_{3})-h_{b}(D_{2})\big{]}\cdot\mathds{1}_{{}_{0\leq D_{2}\leq p_{3}}}

where the last step follows from the rate-distortion functions in (5), (7) and (17). ∎

IV-B Correlated Sources

Similar to that in [41], evaluating the rate-distortion function for the general correlated sources can be extremely difficult. Thus, we assume the Markov chain YX1X2Y-X_{1}-X_{2} behind the intuition that the side information YY can help more to the semantic related source X1X_{1}.

Without the conditional independence of X1X_{1} and X2X_{2}, the optimality of separate compression in Lemma 3 may not hold. The rate-distortion function R(D1,D2,Ds)R(D_{1},D_{2},D_{s}) in Theorem 1 can be calculated as follows. Recall from (16) that Ds0Dsp12pD_{s}^{0}\triangleq\frac{D_{s}-p}{1-2p}. For simplicity, we consider only small distortions in the set

𝒟0\displaystyle\mathcal{D}_{0} ={(D1,D2,Ds):0min{D1,Ds0}p1p2\displaystyle=\big{\{}(D_{1},D_{2},D_{s}):0\leq\min\{D_{1},D_{s}^{0}\}\leq p_{1}p_{2}
 and 0D2p1}.\displaystyle\hskip 79.6678pt\text{ and }0\leq D_{2}\leq p_{1}\big{\}}. (18)
Theorem 6.

For (D1,D2,Ds)𝒟0(D_{1},D_{2},D_{s})\in\mathcal{D}_{0}, the rate-distortion function for the above correlated sources is

R(D1,D2,Ds)\displaystyle R(D_{1},D_{2},D_{s})
=hb(p1)+hb(p2)hb(min{D1,Ds0})hb(D2),\displaystyle=h_{b}(p_{1})+h_{b}(p_{2})-h_{b}\left(\min\left\{D_{1},D_{s}^{0}\right\}\right)-h_{b}(D_{2}), (19)

where Ds0=Dsp12pD_{s}^{0}=\frac{D_{s}-p}{1-2p} is defined in (16).

Proof.

The proof is given in Appendix D. ∎

From the distribution of (X1,X2)(X_{1},X_{2}) and (X1,Y)(X_{1},Y), and the Markov chain YX1X2Y-X_{1}-X_{2}, it is easy to check that (X2,Y)(X_{2},Y) is doubly symmetric distributed with parameter p3=p1p2p1(1p2)+p2(1p1)p_{3}=p_{1}\star p_{2}\triangleq p_{1}(1-p_{2})+p_{2}(1-p_{1}). Then comparing (19) with the rate of separate compression, we have for (D1,D2,Ds)𝒟0(D_{1},D_{2},D_{s})\in\mathcal{D}_{0} that

R2d,X1|Y(D1,Ds)+RX2|Y(D2)\displaystyle R_{\text{2d},X_{1}|Y}(D_{1},D_{s})+R_{X_{2}|Y}(D_{2})
=min𝔼d1(X1,X^1)D1𝔼ds(X1,S^)DsI(X1;X^1,S^|Y)+min𝔼d2(X2,X^2)D2I(X2;X^2|Y)\displaystyle=\min_{\begin{subarray}{c}\mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}\\ \mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}\end{subarray}}I(X_{1};\hat{X}_{1},\hat{S}|Y)+\min_{\mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}}I(X_{2};\hat{X}_{2}|Y)
=[hb(p2)hb(min{D1,Ds0})]+[hb(p1p2)hb(D2)]\displaystyle=\big{[}h_{b}(p_{2})-h_{b}(\min\{D_{1},D_{s}^{0}\})\big{]}+\big{[}h_{b}(p_{1}\star p_{2})-h_{b}(D_{2})\big{]}
R(D1,D2,Ds),\displaystyle\geq R(D_{1},D_{2},D_{s}), (20)

where the last inequality follows from the fact that hb()h_{b}(\cdot) is increasing in [0,0.5][0,0.5] and p1p2p1p_{1}\star p_{2}\geq p_{1} for 0p10.50\leq p_{1}\leq 0.5.

Note that the equality in (20) holds only for p1=0.5p_{1}=0.5, which together with YX1X2Y-X_{1}-X_{2} imply the Markov chain X1YX2X_{1}-Y-X_{2} in Lemma 3. Then the problem reduces to that in Theorem 5. For p1<0.5p_{1}<0.5, simultaneously compressing X1X_{1} and X2X_{2} is strictly better than separate compression.

V Case Study: Binary Classification of Integers

Consider classification integers into even and odd. Let X1X_{1} be uniformly distributed over 𝒳1=[1:N]\mathcal{X}_{1}=[1:N] with N4N\geq 4 being even. The semantic information SS is a binary random variable probabilistically indicates whether X1X_{1} is even or odd. The transition probability can be defined by BSC in Fig. 6, which is similar to that in Fig. 4 by replacing the value of X1X_{1} with “even” and “odd”.

SS011X1X_{1}evenodd1p1-p1p1-ppp
Figure 6: Transition probability from the binary semantic information SS to the integer X1X_{1}: in terms of BSC.

The binary side information YY is correlated with X1X_{1} also indicating its odevity (even/odd) similar to Fig. 6 with parameter p2p_{2}. Assume the Markov chain X1YX2X_{1}-Y-X_{2} holds, and the Bernoulli(12\frac{1}{2}) source X2X_{2} is independent with YY. We can verify that X2X_{2} is independent with (X1,Y)(X_{1},Y). By Lemma 3, compressing X1nX_{1}^{n} and X2nX_{2}^{n} simultaneously is the same as compressing them separately. For simplicity, we consider only small distortions in the set

𝒟1\displaystyle\mathcal{D}_{1} ={(D1,D2,Ds):0min{D1,Ds0}2(N1)p2N\displaystyle=\bigg{\{}(D_{1},D_{2},D_{s}):0\leq\min\{D_{1},D_{s}^{0}\}\leq\frac{2(N-1)p_{2}}{N}
 and 0D20.5}.\displaystyle\qquad\qquad\qquad\qquad\text{ and }0\leq D_{2}\leq 0.5\bigg{\}}. (21)
Theorem 7.

For (D1,D2,Ds)𝒟1(D_{1},D_{2},D_{s})\in\mathcal{D}_{1}, the rate-distortion function for integer classification is

R(D1,D2,Ds)\displaystyle R(D_{1},D_{2},D_{s}) =[hb(p2)+log(N/2)hb(min{D1,Ds0})\displaystyle=\big{[}h_{b}(p_{2})+\log(N/2)-h_{b}(\min\{D_{1},D_{s}^{0}\})
D1log(N1)]+[1hb(D2)],\displaystyle\qquad-D_{1}\log(N-1)\big{]}+\big{[}1-h_{b}(D_{2})\big{]},

where Ds0=Dsp12pD_{s}^{0}=\frac{D_{s}-p}{1-2p} is defined in (16).

Proof.

The proof is given in Appendix E. ∎

VI Case Study: Gaussian Sources

Assume SS and X1X_{1} are jointly Gaussian sources with zero mean and covariance matrix

[σSσSX1σSX1σX1].\begin{bmatrix}\sigma_{S}&\sigma_{SX_{1}}\\ \sigma_{SX_{1}}&\sigma_{X_{1}}\end{bmatrix}. (22)

Similarly, we assume the Markov chain X1YX2X_{1}-Y-X_{2}, where X2X_{2} and YY are jointly Gaussian sources with zero mean and covariance matrix

[σX2σX2YσX2YσY].\begin{bmatrix}\sigma_{X_{2}}&\sigma_{X_{2}Y}\\ \sigma_{X_{2}Y}&\sigma_{Y}\end{bmatrix}. (23)

Thus X1X_{1} is conditionally independent of X2X_{2} given YY. Let the covariance of X1X_{1} and YY be σX1Y\sigma_{X_{1}Y}. The reconstructions are real scalars, i.e., 𝒳^1=𝒳^2=𝒮^=\hat{\mathcal{X}}_{1}=\hat{\mathcal{X}}_{2}=\hat{\mathcal{S}}=\mathbb{R}. The distortion metrics are squared error.

We see from Lemma 3 that compressing X1nX_{1}^{n} and X2nX_{2}^{n} simultaneously is the same as compressing them separately in terms of the optimal compression rate and distortions. Then we have the following theorem.

Theorem 8.

For the Gaussian sources, if the Markov chain X1YX2X_{1}-Y-X_{2} holds, the rate-distortion function is

R(D1,D2,Ds)=12(logσX2σX2Y2σYD2)++\displaystyle R(D_{1},D_{2},D_{s})=\frac{1}{2}\left(\log\frac{\sigma_{X_{2}}-\frac{\sigma_{X_{2}Y}^{2}}{\sigma_{Y}}}{D_{2}}\right)^{+}+
12[logmax(σX1σX1Y2σYD1,σSX12(σX1σX1Y2σY)σX12(Dsmmse))]+,\displaystyle\frac{1}{2}\left[\log\max\left(\frac{\sigma_{X_{1}}-\frac{\sigma_{X_{1}Y}^{2}}{\sigma_{Y}}}{D_{1}},\frac{\sigma_{SX_{1}}^{2}\left(\sigma_{X_{1}}-\frac{\sigma_{X_{1}Y}^{2}}{\sigma_{Y}}\right)}{\sigma_{X_{1}}^{2}\left(D_{s}-\mathrm{mmse}\right)}\right)\right]^{+}, (24)

where mmse\mathrm{mmse} is the minimum mean squared error for estimating SS from X1X_{1}, given by

mmse=σSσSX12σX1.\mathrm{mmse}=\sigma_{S}-\frac{\sigma_{SX_{1}}^{2}}{\sigma_{X_{1}}}. (25)
Proof.

The proof is given in Appendix F. ∎

VII Numerical Results

In this section, we plot the rate-distortion curves evaluated in the previous sections.

VII-A Correlated Binary Sources

Refer to caption
Figure 7: Rate-distortion function for correlated binary sources: p=p1=p2=0.25p=p_{1}=p_{2}=0.25 and D2=0.5D_{2}=0.5.
Refer to caption
(a) D1=0.03D_{1}=0.03
Refer to caption
(b) D1=0.05D_{1}=0.05
Figure 8: Rate-distortion function for correlated binary sources: p=p1=p2=0.25p=p_{1}=p_{2}=0.25 and D2=0.5D_{2}=0.5.

Consider the rate-distortion function for correlated binary sources in Theorem 6 with p=p1=p2=0.25p=p_{1}=p_{2}=0.25 and D2=0.5D_{2}=0.5. Fig. 7 shows the 3-D plot of the optimal tradeoff between the coding rate RR and distortions (D1,Ds)(D_{1},D_{s}). We can see that the rate-distortion function is decreasing and convex in (D1,Ds)(D_{1},D_{s}) for distortions in 𝒟0\mathcal{D}_{0} (c.f. (18)).

The truncated curves with D1=0.03D_{1}=0.03 and D1=0.05D_{1}=0.05 are shown in Fig. 8. We see that the rate is decreasing in DsD_{s} until it achieves the minimum rate, which is determined by D1D_{1}. Similar curves can also be obtained by truncating with some constant DsD_{s}.

VII-B Binary Classification of Integers

The rate-distortion function for integer classification in Theorem 7 with p=p1=p2=0.25p=p_{1}=p_{2}=0.25, D2=0.5D_{2}=0.5, and N=8N=8 is illustrated in Fig. 9. Note that in both Theorem 6 and Theorem 7, D2=0.5D_{2}=0.5 indicates that X2X_{2} can be recovered by random guessing, which further implies that X2X_{2} can also be regarded as side information at both sides.

Comparing the rates along the D1D_{1} and DsD_{s} axis in Fig. 9, we see that recovering only the semantic information can reduce the coding rate comparing to recovering the source message.

Comparing Fig. 9 with Fig. 7, we see that the rate for integer classification decreases faster as D1D_{1} increases (which is clearer at the minimum of Ds=pD_{s}=p). This implies that D1D_{1} is more dominant (to determine the rate) here, which is intuitive since the integer source has a larger alphabet and recovering it with different distortions requires a larger range of rates.

Refer to caption
Figure 9: Rate-distortion function for integer classification: p=p1=p2=0.25p=p_{1}=p_{2}=0.25, D2=0.5D_{2}=0.5, and N=8N=8.

VII-C Gaussian Sources

Consider the rate-distortion function for Gaussian sources in Theorem 8. Let all of the variances be 22, all of the covariances be 11, and D2=1D_{2}=1.

The 3-D plot of the optimal tradeoff between the coding rate RR and distortions (D1,Ds)(D_{1},D_{s}) is illustrated in Fig. 10. We can see that the rate-distortion function is decreasing and convex in (D1,Ds)(D_{1},D_{s}). The minimum rate is equal to

RX2|Y(D2)=12(logσX2σX2Y2σYD2)+=0.20.R_{X_{2}|Y}(D_{2})=\frac{1}{2}\left(\log\frac{\sigma_{X_{2}}-\frac{\sigma_{X_{2}Y}^{2}}{\sigma_{Y}}}{D_{2}}\right)^{+}=0.20.

The contour plot of the rate-distortion function is shown in Fig. 11. The slanted line denotes the situations that RX1|Y(D1)=RS|Y(Ds)R_{X_{1}|Y}(D_{1})=R_{S|Y}(D_{s}). We see that when D1D_{1} is more dominant (the region above the slanted line), the rate only needs to meet the distortion constraint to reconstruct X1X_{1}. On the contrary, when DsD_{s} is more dominant (the region below the slanted line), the rate only needs to meet the distortion constraint to reconstruct SS.

Refer to caption
Figure 10: Rate-distortion function for Gaussian sources: σS=σX1=σX2=σY=1\sigma_{S}=\sigma_{X_{1}}=\sigma_{X_{2}}=\sigma_{Y}=1, σSX1=σX1Y=σX2Y=2\sigma_{SX_{1}}=\sigma_{X_{1}Y}=\sigma_{X_{2}Y}=2, and D2=1D_{2}=1.
Refer to caption
Figure 11: The contour plot of the rate-distortion function.

VIII Conclusion

In this paper, we studied the semantic rate-distortion problem with side information motivated by task-oriented video compression. The general rate-distortion function was characterized. We also evaluated several cases with specific sources and distortion measures. It is more desirable to derive the rate-distortion function for real video sources, which is more challenging due to the high complexity of real source models and choice of meaningful distortion measures. This part of work is now under investigation.

Appendix A Proof of Theorem 1

The achievability part is a straightforward extension of the joint typicality coding scheme for lossy source coding. We simply present the coding ideas and analysis as follows. Fix the conditional pmf p(x^1,x^2,s^|x1,x2,y)p(\hat{x}_{1},\hat{x}_{2},\hat{s}|x_{1},x_{2},y) such that the distortion constraints are satisfied, 𝔼d1(X1,X^1)D1\mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}, 𝔼d2(X2,X^2)D2\mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}, and 𝔼ds(X1,S^)Ds\mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}. Let p(x^1,x^2,s^|y)=x1,x2p(x1,x2|y)p(x^1,x^2,s^|x1,x2,y)p(\hat{x}_{1},\hat{x}_{2},\hat{s}|y)=\sum_{x_{1},x_{2}}p(x_{1},x_{2}|y)p(\hat{x}_{1},\hat{x}_{2},\hat{s}|x_{1},x_{2},y). Randomly and independently generate 2nR2^{nR} sequence triples (x^1n,x^2n,s^n)(\hat{x}_{1}^{n},\hat{x}_{2}^{n},\hat{s}^{n}) indexed by m[1:2nR]m\in[1:2^{nR}], each according to p(x^1,x^2,s^|y)p(\hat{x}_{1},\hat{x}_{2},\hat{s}|y). The whole codebook 𝒞\mathcal{C}, consisting of these sequence triples, is revealed to both the encoder and decoder. When observing the source messages (x1n,x2n,yn)(x_{1}^{n},x_{2}^{n},y^{n}), find an index mm such that its indexing sequence (x^1n,x^2n,s^n)(\hat{x}_{1}^{n},\hat{x}_{2}^{n},\hat{s}^{n}) satisfies (x1n,x2n,yn,x^1n,x^2n,s^n)𝒯ϵn(x_{1}^{n},x_{2}^{n},y^{n},\hat{x}_{1}^{n},\hat{x}_{2}^{n},\hat{s}^{n})\in\mathcal{T}_{\epsilon}^{n}. If there is more than one such index, randomly choose one of them; if there is no such index, set m=1m=1. Upon receiving the index mm, the decoder reconstruct the messages and inference by choosing the codeword (x^1n,x^2n,s^n)(\hat{x}_{1}^{n},\hat{x}_{2}^{n},\hat{s}^{n}) indexed by mm. By law of large numbers, the source sequences are joint typical with probability 1 as nn\rightarrow\infty. Then we define the “encoding error” event as

={(X1n,X2n,Yn,X^1n,X^2n,S^n)𝒯ϵn,m[1:2nR]}.\mathcal{E}=\left\{\left(X_{1}^{n},X_{2}^{n},Y^{n},\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n}\right)\notin\mathcal{T}_{\epsilon}^{n},~{}\forall m\in\left[1:2^{nR}\right]\right\}. (26)

Then we can bound the error probability as follows

P()\displaystyle P(\mathcal{E})
=P{(x1n,x2n,yn,X^1n,X^2n,S^n)𝒯ϵn,m[1:2nR]}\displaystyle=P\left\{\left(x_{1}^{n},x_{2}^{n},y^{n},\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n}\right)\notin\mathcal{T}_{\epsilon}^{n},~{}\forall m\in\left[1:2^{nR}\right]\right\}
=m=12nRP{(x1n,x2n,yn,X^1n,X^2n,S^n)𝒯ϵn}\displaystyle=\prod_{m=1}^{2^{nR}}P\left\{\left(x_{1}^{n},x_{2}^{n},y^{n},\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n}\right)\notin\mathcal{T}_{\epsilon}^{n}\right\}
=(1P{(x1n,x2n,yn,X^1n,X^2n,S^n)𝒯ϵn})2nR\displaystyle=\left(1-P\left\{\left(x_{1}^{n},x_{2}^{n},y^{n},\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n}\right)\in\mathcal{T}_{\epsilon}^{n}\right\}\right)^{2^{nR}}
(x1n,x2n,yn)𝒯ϵn[p(x1n,x2n,yn)\displaystyle\leq\sum_{(x_{1}^{n},x_{2}^{n},y^{n})\in\mathcal{T}_{\epsilon}^{n}}\Bigg{[}p(x_{1}^{n},x_{2}^{n},y^{n})\cdot
(12n[I(X1,X2;X^1,X^2,S^|Y)+δ(ϵ)])2nR]\displaystyle\qquad\left(1-2^{-n[I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}|Y)+\delta(\epsilon)]}\right)^{2^{nR}}\Bigg{]}
exp(2n[RI(X1,X2;X^1,X^2,S^|Y)δ(ϵ)]),\displaystyle\leq\exp\left(-2^{n[R-I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}|Y)-\delta(\epsilon)]}\right),

where δ(ϵ)0\delta(\epsilon)\rightarrow 0 as nn\rightarrow\infty, the first inequality follows from the joint typicality lemma in [36], and the last inequality follows from the fact that (1z)texp(tz)(1-z)^{t}\leq\exp(-tz) for z[0,1]z\in[0,1] and t0t\geq 0. We see that P()0P(\mathcal{E})\rightarrow 0 as nn\rightarrow\infty if R>I(X1,X2;X^1,X^2,S^|Y)+δ(ϵ)R>I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}|Y)+\delta(\epsilon). If the error event does not happen, i.e., the reconstruction is joint typical with the source sequences, then from the distortion constraints assumed for the conditional pmf, the expected distortions can achieve D1,D2,D_{1},D_{2}, and DsD_{s}, respectively. This proves the achievability.

Define RI(D1,D2,Ds)R_{I}(D_{1},D_{2},D_{s}) as the rate-distortion function characterized by Theorem 1, For the converse part, we show that

nR\displaystyle nR H(W)H(W|Yn)I(X1n,X2n;W|Yn)\displaystyle\geq H(W)\geq H(W|Y^{n})\geq I(X_{1}^{n},X_{2}^{n};W|Y^{n})
I(X1n,X2n;X^1n,X^2n,S^n|Yn)\displaystyle\geq I(X_{1}^{n},X_{2}^{n};\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n}|Y^{n})
=I(X1n,X2n,Yn;X^1n,X^2n,S^n)I(Yn;X^1n,X^2n,S^n)\displaystyle=I(X_{1}^{n},X_{2}^{n},Y^{n};\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n})-I(Y^{n};\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n})
=i=1n[I(X1,i,X2,i,Yi;X^1n,X^2n,S^n|X1,1i1,X2,1i1,Y1i1)\displaystyle=\sum_{i=1}^{n}\left[I(X_{1,i},X_{2,i},Y_{i};\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n}|X_{1,1}^{i-1},X_{2,1}^{i-1},Y_{1}^{i-1})\right.
I(Yi;X^1n,X^2n,S^n|Y1i1)]\displaystyle\qquad\left.-I(Y_{i};\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n}|Y_{1}^{i-1})\right]
=i=1n[I(X1,i,X2,i,Yi;X^1n,X^2n,S^n,X1,1i1,X2,1i1,Y1i1)\displaystyle=\sum_{i=1}^{n}\left[I(X_{1,i},X_{2,i},Y_{i};\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n},X_{1,1}^{i-1},X_{2,1}^{i-1},Y_{1}^{i-1})\right.
I(Yi;X^1n,X^2n,S^n,Y1i1)]\displaystyle\qquad\left.-I(Y_{i};\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n},Y_{1}^{i-1})\right]
=i=1n[I(X1,i,X2,i;X^1,i,X^2,i,S^i|Yi)\displaystyle=\sum_{i=1}^{n}\left[I(X_{1,i},X_{2,i};\hat{X}_{1,i},\hat{X}_{2,i},\hat{S}_{i}|Y_{i})\right.
+I(X1,i,X2,i,Yi;X^1,1i1,X^1,i+1n,X^2,1i1,X^2,i+1n,\displaystyle\qquad+I(X_{1,i},X_{2,i},Y_{i};\hat{X}_{1,1}^{i-1},\hat{X}_{1,i+1}^{n},\hat{X}_{2,1}^{i-1},\hat{X}_{2,i+1}^{n},
S^1i1,S^i+1n,X1,1i1,X2,1i1,Y1i1|X^1,i,X^2,i,S^i)\displaystyle\qquad\qquad\hat{S}_{1}^{i-1},\hat{S}_{i+1}^{n},X_{1,1}^{i-1},X_{2,1}^{i-1},Y_{1}^{i-1}|\hat{X}_{1,i},\hat{X}_{2,i},\hat{S}_{i})
I(Yi;X^1,1i1,X^1,i+1n,X^2,1i1,X^2,i+1n,\displaystyle\qquad-I(Y_{i};\hat{X}_{1,1}^{i-1},\hat{X}_{1,i+1}^{n},\hat{X}_{2,1}^{i-1},\hat{X}_{2,i+1}^{n},
S^1i1,S^i+1n,Y1i1|X^1,i,X^2,i,S^i)]\displaystyle\qquad\qquad\left.\hat{S}_{1}^{i-1},\hat{S}_{i+1}^{n},Y_{1}^{i-1}|\hat{X}_{1,i},\hat{X}_{2,i},\hat{S}_{i})\right]
=i=1n[I(X1,i,X2,i;X^1,i,X^2,i,S^i|Yi)\displaystyle=\sum_{i=1}^{n}\left[I(X_{1,i},X_{2,i};\hat{X}_{1,i},\hat{X}_{2,i},\hat{S}_{i}|Y_{i})\right.
+I(X1,i,X2,i,Yi;X1,1i1,X2,1i1|X^1n,X^2n,S^n,Y1i)\displaystyle\qquad+I(X_{1,i},X_{2,i},Y_{i};X_{1,1}^{i-1},X_{2,1}^{i-1}|\hat{X}_{1}^{n},\hat{X}_{2}^{n},\hat{S}^{n},Y_{1}^{i})
+I(X1,i,X2,i;X^1,1i1,X^1,i+1n,X^2,1i1,X^2,i+1n,\displaystyle\qquad+I(X_{1,i},X_{2,i};\hat{X}_{1,1}^{i-1},\hat{X}_{1,i+1}^{n},\hat{X}_{2,1}^{i-1},\hat{X}_{2,i+1}^{n},
S^1i1,S^i+1n,Y1i1|X^1,i,X^2,i,S^i,Yi)]\displaystyle\qquad\qquad\left.\hat{S}_{1}^{i-1},\hat{S}_{i+1}^{n},Y_{1}^{i-1}|\hat{X}_{1,i},\hat{X}_{2,i},\hat{S}_{i},Y_{i})\right]
i=1nI(X1,i,X2,i;X^1,i,X^2,i,S^i|Yi)\displaystyle\geq\sum_{i=1}^{n}I(X_{1,i},X_{2,i};\hat{X}_{1,i},\hat{X}_{2,i},\hat{S}_{i}|Y_{i}) (27)
i=1nRI(𝔼d1(X1,i,X^1,i),𝔼d2(X2,i,X^2,i),𝔼ds(X1,i,S^i))\displaystyle\geq\sum_{i=1}^{n}R_{I}\left(\mathbb{E}d_{1}(X_{1,i},\hat{X}_{1,i}),\mathbb{E}d_{2}(X_{2,i},\hat{X}_{2,i}),\mathbb{E}d_{s}^{\prime}(X_{1,i},\hat{S}_{i})\right) (28)
nRI(𝔼d1(X1n,X^1n),𝔼d2(X2n,X^2n),𝔼ds(X1n,S^n))\displaystyle\geq nR_{I}\left(\mathbb{E}d_{1}(X_{1}^{n},\hat{X}_{1}^{n}),\mathbb{E}d_{2}(X_{2}^{n},\hat{X}_{2}^{n}),\mathbb{E}d_{s}^{\prime}(X_{1}^{n},\hat{S}^{n})\right) (29)
nRI(𝔼d1(X1n,X^1n),𝔼d2(X2n,X^2n),𝔼ds(Sn,S^n))\displaystyle\geq nR_{I}\left(\mathbb{E}d_{1}(X_{1}^{n},\hat{X}_{1}^{n}),\mathbb{E}d_{2}(X_{2}^{n},\hat{X}_{2}^{n}),\mathbb{E}d_{s}(S^{n},\hat{S}^{n})\right) (30)
nRI(D1,D2,Ds),\displaystyle\geq nR_{I}(D_{1},D_{2},D_{s}), (31)

where (27) follows from the nonnegativity of mutual information, (28) follows from the definition of RI(D1,D2,Ds)R_{I}(D_{1},D_{2},D_{s}), (29) follows from the convexity of RI(D1,D2,Ds)R_{I}(D_{1},D_{2},D_{s}), (30) follows from 𝔼ds(X1n,S^n)=𝔼ds(Sn,S^n)\mathbb{E}d_{s}^{\prime}(X_{1}^{n},\hat{S}^{n})=\mathbb{E}d_{s}(S^{n},\hat{S}^{n}) which is proved in [29], and the last inequality follows from the non-increasing property of RI(D1,D2,Ds)R_{I}(D_{1},D_{2},D_{s}). This completes the converse proof.

Appendix B Proof of Lemma 3

The Markov chain X1YX2X_{1}-Y-X_{2} indicates that

H(X2|X1,Y)=H(X2|Y).H(X_{2}|X_{1},Y)=H(X_{2}|Y). (32)

Then the mutual information in (10) can be bounded by

I(X1,X2;X^1,X^2,S^|Y)\displaystyle I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}|Y)
=H(X1,X2|Y)H(X1,X2|X^1,X^2,S^,Y)\displaystyle=H(X_{1},X_{2}|Y)-H(X_{1},X_{2}|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)
=H(X1|Y)+H(X2|Y)H(X1|X^1,X^2,S^,Y)\displaystyle=H(X_{1}|Y)+H(X_{2}|Y)-H(X_{1}|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)
H(X2|X1,X^1,X^2,S^,Y)\displaystyle\qquad-H(X_{2}|X_{1},\hat{X}_{1},\hat{X}_{2},\hat{S},Y)
H(X1|Y)+H(X2|Y)H(X1|X^1,S^,Y)H(X2|X^2,Y)\displaystyle\geq H(X_{1}|Y)+H(X_{2}|Y)\!-\!H(X_{1}|\hat{X}_{1},\hat{S},Y)\!-\!H(X_{2}|\hat{X}_{2},Y)
=I(X1;X^1,S^|Y)+I(X2;X^2|Y),\displaystyle=I(X_{1};\hat{X}_{1},\hat{S}|Y)+I(X_{2};\hat{X}_{2}|Y),

where the inequality follows from the fact that conditioning does not increase entropy. Now, we have

R(D1,D2,Ds)\displaystyle R(D_{1},D_{2},D_{s})
=minp(x^1,x^2,s^|x1,x2,y)𝔼d1(X1,X^1)D1𝔼d2(X2,X^2)D2𝔼ds(X1,S^)DsI(X1,X2;X^1,X^2,S^|Y)\displaystyle=\min_{\begin{subarray}{c}p(\hat{x}_{1},\hat{x}_{2},\hat{s}|x_{1},x_{2},y)\\ \mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}\\ \mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}\\ \mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}\end{subarray}}I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}|Y)
minp(x^1,x^2,s^|x1,x2,y)𝔼d1(X1,X^1)D1𝔼d2(X2,X^2)D2𝔼ds(X1,S^)Ds[I(X1;X^1,S^|Y)+I(X2;X^2|Y)]\displaystyle\geq\min_{\begin{subarray}{c}p(\hat{x}_{1},\hat{x}_{2},\hat{s}|x_{1},x_{2},y)\\ \mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}\\ \mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}\\ \mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}\end{subarray}}\left[I(X_{1};\hat{X}_{1},\hat{S}|Y)+I(X_{2};\hat{X}_{2}|Y)\right]
=minp(x^1,s^|x1,y)𝔼d1(X1,X^1)D1𝔼ds(X1,S^)DsI(X1;X^1,S^|Y)\displaystyle=\min_{\begin{subarray}{c}p(\hat{x}_{1},\hat{s}|x_{1},y)\\ \mathbb{E}d_{1}(X_{1},\hat{X}_{1})\leq D_{1}\\ \mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s}\end{subarray}}I(X_{1};\hat{X}_{1},\hat{S}|Y)
+minp(x^2|x2,y)𝔼d2(X2,X^2)D2I(X2;X^2|Y)\displaystyle\qquad+\min_{\begin{subarray}{c}p(\hat{x}_{2}|x_{2},y)\\ \mathbb{E}d_{2}(X_{2},\hat{X}_{2})\leq D_{2}\end{subarray}}I(X_{2};\hat{X}_{2}|Y)
=R2d,X1|Y(D1,Ds)+RX2|Y(D2).\displaystyle=R_{\text{2d},X_{1}|Y}(D_{1},D_{s})+R_{X_{2}|Y}(D_{2}).

For the other direction, we show that the rate-distortion quadruple (R2d,X1|Y(D1,Ds)+RX2|Y(D2),D1,D2,Ds)\big{(}R_{\text{2d},X_{1}|Y}(D_{1},D_{s})+R_{X_{2}|Y}(D_{2}),D_{1},D_{2},D_{s}\big{)} is achievable. To see this, let p(x^1,s^|x1,y)p^{*}(\hat{x}_{1},\hat{s}|x_{1},y) and p(x^2|x2,y)p^{*}(\hat{x}_{2}|x_{2},y) be the optimal distributions that achieve the rate-distortion tuples (R2d,X1|Y(D1,Ds),D1,Ds)\big{(}R_{\text{2d},X_{1}|Y}(D_{1},D_{s}),D_{1},D_{s}\big{)} and (RX2|Y(D2),D2)\big{(}R_{X_{2}|Y}(D_{2}),D_{2}\big{)}, respectively. Now we consider the distribution p(x1,x2,x^1,x^2,s^|y)p(x1,x^1,s^|y)p(x2,x^2|y)p^{*}(x_{1},x_{2},\hat{x}_{1},\hat{x}_{2},\hat{s}|y)\triangleq p^{*}(x_{1},\hat{x}_{1},\hat{s}|y)p^{*}(x_{2},\hat{x}_{2}|y) which requires the Markov chain (X1,X^1,S^)Y(X2,X^2)(X_{1},\hat{X}_{1},\hat{S})-Y-(X_{2},\hat{X}_{2}) and is consistent with the condition X1YX2X_{1}-Y-X_{2}. Then the corresponding random variables satisfy

I(X1,X2;X^1,X^2,S^|Y)\displaystyle I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}|Y)
=H(X1|Y)+H(X2|Y)H(X1|X^1,X^2,S^,Y)\displaystyle=H(X_{1}|Y)+H(X_{2}|Y)-H(X_{1}|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)
H(X2|X1,X^1,X^2,S^,Y)\displaystyle-H(X_{2}|X_{1},\hat{X}_{1},\hat{X}_{2},\hat{S},Y)
=H(X1|Y)+H(X2|Y)H(X1|X^1,S^,Y)H(X2|X^2,Y)\displaystyle=H(X_{1}|Y)+H(X_{2}|Y)\!-\!H(X_{1}|\hat{X}_{1},\hat{S},Y)\!-\!H(X_{2}|\hat{X}_{2},Y)
=I(X1;X^1,S^|Y)+I(X2;X^2|Y),\displaystyle=I(X_{1};\hat{X}_{1},\hat{S}|Y)+I(X_{2};\hat{X}_{2}|Y),
=R2d,X1|Y(D1,Ds)+RX2|Y(D2),\displaystyle=R_{\text{2d},X_{1}|Y}(D_{1},D_{s})+R_{X_{2}|Y}(D_{2}),

where the first equality follows from (32), the second equality follows from the Markov chain (X1,X^1,S^)Y(X2,X^2)(X_{1},\hat{X}_{1},\hat{S})-Y-(X_{2},\hat{X}_{2}), and the last equality follows from the optimality of p(x^1,s^|x1,y)p^{*}(\hat{x}_{1},\hat{s}|x_{1},y) and p(x^2|x2,y)p^{*}(\hat{x}_{2}|x_{2},y). Lastly, by the minimization in the expression of the rate-distortion function in (10), we conclude that R(D1,D2,Ds)R2d,X1|Y(D1,Ds)+RX2|Y(D2)R(D_{1},D_{2},D_{s})\leq R_{\text{2d},X_{1}|Y}(D_{1},D_{s})+R_{X_{2}|Y}(D_{2}), which completes the proof of the lemma.

Appendix C Proof of Lemma 4

As we are in the binary Hamming setting, we first calculate the values of ds(x1,s^)d_{s}^{\prime}(x_{1},\hat{s}) (c.f. (14)) by

ds(0,0)\displaystyle d_{s}^{\prime}(0,0) =1p(x1=0)s=0,1p(x1=0,s)ds(s,0)\displaystyle=\frac{1}{p(x_{1}=0)}\sum_{s=0,1}p(x_{1}=0,s)d_{s}(s,0)
=1p(x1=0)[p(x1=0,s=0)ds(0,0)\displaystyle=\frac{1}{p(x_{1}=0)}\big{[}p(x_{1}=0,s=0)d_{s}(0,0)
+p(x1=0,s=1)ds(1,0)]\displaystyle\qquad+p(x_{1}=0,s=1)d_{s}(1,0)\big{]}
=p(x1=0,s=1)p(x1=0)\displaystyle=\frac{p(x_{1}=0,s=1)}{p(x_{1}=0)}
=p(s=1|x1=0)\displaystyle=p(s=1|x_{1}=0)
=p.\displaystyle=p.

The other values follow similarly, and we obtain the distortion function

ds(x1,s^)={p,if s^=x11p,if s^x1.d_{s}^{\prime}(x_{1},\hat{s})=\begin{cases}p,&\text{if }\hat{s}=x_{1}\\ 1-p,&\text{if }\hat{s}\neq x_{1}.\end{cases} (33)

Then

𝔼ds(X1,S^)\displaystyle\mathbb{E}d_{s}^{\prime}(X_{1},\hat{S}) =x1,s^p(x1,s^)ds(x1,s^)\displaystyle=\sum_{x_{1},\hat{s}}p(x_{1},\hat{s})d_{s}^{\prime}(x_{1},\hat{s})
=P(X1S^)(1p)+P(X1=S^)p\displaystyle=P(X_{1}\neq\hat{S})\cdot(1-p)+P(X_{1}=\hat{S})\cdot p
=P(X1S^)[12p]+p.\displaystyle=P(X_{1}\neq\hat{S})\big{[}1-2p\big{]}+p.

The distortion constraint 𝔼ds(X1,S^)Ds\mathbb{E}d_{s}^{\prime}(X_{1},\hat{S})\leq D_{s} for DspD_{s}\geq p implies P(X1S^)Dsp12pP(X_{1}\neq\hat{S})\leq\frac{D_{s}-p}{1-2p}. Now we can follow the rate-distortion evaluation for Bernoulli source and Hamming distortion in [37, 35] while only changing the probability P(X1S^)P(X_{1}\neq\hat{S}) and obtain

Rs(Ds)=[1hb(Dsp12p)]𝟙pDs0.5.R_{s}(D_{s})=\left[1-h_{b}\left(\frac{D_{s}-p}{1-2p}\right)\right]\cdot\mathds{1}_{{}_{p\leq D_{s}\leq 0.5}}. (34)

This proves the lemma.

Appendix D Proof of Theorem 6

Note that separately compressing correlated sources is not optimal in general, i.e., the statement in Lemma 3 does not hold here. Then we need to evaluate the mutual information in (10) over joint distributions, and we have

I(X1,X2;X^1,X^2,S^|Y)\displaystyle I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}|Y)
=H(X1,X2|Y)H(X1,X2|X^1,X^2,S^,Y)\displaystyle=H(X_{1},X_{2}|Y)-H(X_{1},X_{2}|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)
=H(X1)+H(X2|X1)+H(Y|X1)H(Y)\displaystyle=H(X_{1})+H(X_{2}|X_{1})+H(Y|X_{1})-H(Y)
H(X1,X2|X^1,X^2,S^,Y)\displaystyle-H(X_{1},X_{2}|\hat{X}_{1},\hat{X}_{2},\hat{S},Y)
=hb(p1)+hb(p2)H(X1X^1,X2X^2|X^1,X^2,S^,Y)\displaystyle=h_{b}(p_{1})+h_{b}(p_{2})-H(X_{1}\oplus\hat{X}_{1},X_{2}\oplus\hat{X}_{2}|\hat{X}_{1},\hat{X}_{2},\hat{S},Y) (35)
hb(p1)+hb(p2)H(X1X^1,X2X^2)\displaystyle\geq h_{b}(p_{1})+h_{b}(p_{2})-H(X_{1}\oplus\hat{X}_{1},X_{2}\oplus\hat{X}_{2})
hb(p1)+hb(p2)H(X1X^1)H(X2X^2)\displaystyle\geq h_{b}(p_{1})+h_{b}(p_{2})-H(X_{1}\oplus\hat{X}_{1})-H(X_{2}\oplus\hat{X}_{2})
hb(p1)+hb(p2)hb(D1)hb(D2),\displaystyle\geq h_{b}(p_{1})+h_{b}(p_{2})-h_{b}(D_{1})-h_{b}(D_{2}),

where \oplus denote modulo 2 addition, the second equality follows from the Markov chain YX1X2Y-X_{1}-X_{2}, the first inequality follows from the fact that conditioning does not increase entropy, and the last inequality follows from P(X1X^1=1)=P(X1X^1)D1P(X_{1}\oplus\hat{X}_{1}=1)=P(X_{1}\neq\hat{X}_{1})\leq D_{1}, P(X2X^2=1)=P(X2X^2)D2P(X_{2}\oplus\hat{X}_{2}=1)=P(X_{2}\neq\hat{X}_{2})\leq D_{2}, and hb(D)h_{b}(D) is increasing in DD for 0D0.50\leq D\leq 0.5. By switching the roles of X^1\hat{X}_{1} and S^\hat{S}, i.e., replacing the conditional information in (35) by H(X1S^,X2X^2|X^1,X^2,S^,Y)H(X_{1}\oplus\hat{S},X_{2}\oplus\hat{X}_{2}|\hat{X}_{1},\hat{X}_{2},\hat{S},Y), we can obtain similarly

I(X1,X2;X^1,X^2,S^|Y)hb(p1)+hb(p2)hb(Ds0)hb(D2).I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}|Y)\geq h_{b}(p_{1})+h_{b}(p_{2})-h_{b}\left(D_{s}^{0}\right)-h_{b}(D_{2}).

Thus, the rate-distortion function is lower bounded by

R(D1,D2,Ds)hb(p1)+hb(p2)hb(min{D1,Ds0})hb(D2).R(D_{1},D_{2},D_{s})\geq h_{b}(p_{1})+h_{b}(p_{2})-h_{b}(\min\{D_{1},D_{s}^{0}\})-h_{b}(D_{2}). (36)

We now show that the lower bound is tight by finding a joint distribution p(x1,x2,y,x^1,x^2,s^)p(x_{1},x_{2},y,\hat{x}_{1},\hat{x}_{2},\hat{s}) that meets the distortion constraints and achieves the above lower bound.

Z^1Z^2\hat{Z}_{1}\hat{Z}_{2}0000010110101111q1q_{1}q2q_{2}q3q_{3}q4q_{4}Z1Z2Z_{1}Z_{2}0000010110101111(1p1)(1p2)(1-p_{1})(1-p_{2})p1(1p2)p_{1}(1-p_{2})p1p2p_{1}p_{2}(1p1)p2(1-p_{1})p_{2}
Figure 12: Test channel from Z^1Z^2\hat{Z}_{1}\hat{Z}_{2} to Z1Z2Z_{1}Z_{2}: Z1Z_{1}\simBer(p1p_{1}), Z2Z_{2}\simBer(p2p_{2}), Z1,Z2Z_{1},Z_{2} are mutually independent, Z^1Z^2(q1,q2,q3,q4)\hat{Z}_{1}\hat{Z}_{2}\sim(q_{1},q_{2},q_{3},q_{4}), and the transition probability p(z1,z2|z^1,z^2)p(z_{1},z_{2}|\hat{z}_{1},\hat{z}_{2}) is given by (37).

Let Zi=YXiZ_{i}=Y\oplus X_{i} and Z^i=YX^i\hat{Z}_{i}=Y\oplus\hat{X}_{i} for i=1,2i=1,2. Then there is a one-to-one correspondence between p(x1,x2,y,x^1,x^2,s^)p(x_{1},x_{2},y,\hat{x}_{1},\hat{x}_{2},\hat{s}) and p(z1,z2,y,z^1,z^2,s^)p(z_{1},z_{2},y,\hat{z}_{1},\hat{z}_{2},\hat{s}). From Z1,Z2Z_{1},Z_{2} generated from the source distribution p(x1,x2,y)p(x_{1},x_{2},y), it is easy to check that (Z1,Z2)[(1p1)(1p2),p1(1p2),p1p2,(1p1)p2](Z_{1},Z_{2})\sim[(1-p_{1})(1-p_{2}),p_{1}(1-p_{2}),p_{1}p_{2},(1-p_{1})p_{2}] as shown in Fig. 12.

Next, we construct the desired joint distribution using the test channel in Fig. 12 as follows. For p1,p2[0,0.5]p_{1},p_{2}\in[0,0.5], (D1,D2,Ds)𝒟0(D_{1},D_{2},D_{s})\in\mathcal{D}_{0} (c.f. (18)), and D1Ds0D_{1}\leq D_{s}^{0}, consider the joint distribution p(x1,x2,y,x^1,x^2,s^)p(x_{1},x_{2},y,\hat{x}_{1},\hat{x}_{2},\hat{s}) defined by the following conditions

  1. i.

    S^=X^1\hat{S}=\hat{X}_{1};

  2. ii.

    Markov chain Y(Z^1Z^2)(Z1Z2)Y-(\hat{Z}_{1}\hat{Z}_{2})-(Z_{1}Z_{2});

  3. iii.

    The test channel in Fig. 12 with the conditional probability p(z1,z2|z^1,z^2)p(z_{1},z_{2}|\hat{z}_{1},\hat{z}_{2}) given as

    p(z1,z2|z^1,z^2)=\displaystyle p(z_{1},z_{2}|\hat{z}_{1},\hat{z}_{2})=
    {(1D1)(1D2), if z1=z^1 and z2=z^2(1D1)D2, if z1=z^1 and z2z^2D1(1D2), if z1z^1 and z2=z^2D1D2, if z1z^1 and z2z^2;\displaystyle\begin{cases}(1-D_{1})(1-D_{2}),&\text{ if }z_{1}=\hat{z}_{1}\text{ and }z_{2}=\hat{z}_{2}\\ (1-D_{1})D_{2},&\text{ if }z_{1}=\hat{z}_{1}\text{ and }z_{2}\neq\hat{z}_{2}\\ D_{1}(1-D_{2}),&\text{ if }z_{1}\neq\hat{z}_{1}\text{ and }z_{2}=\hat{z}_{2}\\ D_{1}D_{2},&\text{ if }z_{1}\neq\hat{z}_{1}\text{ and }z_{2}\neq\hat{z}_{2};\\ \end{cases} (37)
  4. iv.

    In order for Z1,Z2Z_{1},Z_{2} to follow the independent Bernoulli distributions, we need to choose the distribution of Z^1Z^2\hat{Z}_{1}\hat{Z}_{2} as (38)-(41).

    q1\displaystyle q_{1} =[(1p2)D1][(1p1p2+2p1p2)D2]+p2(12p1)(1p2)(12D1)(12D2)\displaystyle=\frac{[(1-p_{2})-D_{1}][(1-p_{1}-p_{2}+2p_{1}p_{2})-D_{2}]+p_{2}(1-2p_{1})(1-p_{2})}{(1-2D_{1})(1-2D_{2})} (38)
    q2\displaystyle q_{2} =[(1p2)D1][(p1+p22p1p2)D2]+p2(2p11)(1p2)(12D1)(12D2)\displaystyle=\frac{[(1-p_{2})-D_{1}][(p_{1}+p_{2}-2p_{1}p_{2})-D_{2}]+p_{2}(2p_{1}-1)(1-p_{2})}{(1-2D_{1})(1-2D_{2})} (39)
    q3\displaystyle q_{3} =(p2D1)[(1p1p2+2p1p2)D2]+p2(2p11)(1p2)(12D1)(12D2)\displaystyle=\frac{(p_{2}-D_{1})[(1-p_{1}-p_{2}+2p_{1}p_{2})-D_{2}]+p_{2}(2p_{1}-1)(1-p_{2})}{(1-2D_{1})(1-2D_{2})} (40)
    q4\displaystyle q_{4} =(p2D1)[(p1+p22p1p2)D2]+p2(12p1)(1p2)(12D1)(12D2)\displaystyle=\frac{(p_{2}-D_{1})[(p_{1}+p_{2}-2p_{1}p_{2})-D_{2}]+p_{2}(1-2p_{1})(1-p_{2})}{(1-2D_{1})(1-2D_{2})} (41)

    We can verify that qi0,i=1,2,3,4q_{i}\geq 0,~{}i=1,2,3,4 for (D1,D2,Ds)𝒟0(D_{1},D_{2},D_{s})\in\mathcal{D}_{0}.

Now it remains to verify that the above distribution achieves the rate in (36) and distortions D1,D2D_{1},D_{2}, and DsD_{s}. From conditions i and ii, we have

I(X1,X2;X^1,X^2,S^|Y)\displaystyle I(X_{1},X_{2};\hat{X}_{1},\hat{X}_{2},\hat{S}|Y)
=H(X1,X2|Y)H(X1,X2|X^1,X^2,Y)\displaystyle=H(X_{1},X_{2}|Y)-H(X_{1},X_{2}|\hat{X}_{1},\hat{X}_{2},Y)
=H(X1,X2|Y)H(X1,X2|X^1,X^2)\displaystyle=H(X_{1},X_{2}|Y)-H(X_{1},X_{2}|\hat{X}_{1},\hat{X}_{2})
=hb(p1)+hb(p2)hb(D1)hb(D2).\displaystyle=h_{b}(p_{1})+h_{b}(p_{2})-h_{b}(D_{1})-h_{b}(D_{2}).

From (37) and (33), it is easy to calculate the expected distortions as

𝔼d1(X1,X^1)\displaystyle\mathbb{E}d_{1}(X_{1},\hat{X}_{1}) =D1\displaystyle=D_{1} (42)
𝔼d2(X2,X^2)\displaystyle\mathbb{E}d_{2}(X_{2},\hat{X}_{2}) =D2\displaystyle=D_{2} (43)
𝔼ds(X1,S^)\displaystyle\mathbb{E}d_{s}^{\prime}(X_{1},\hat{S}) =(1D1)p+D1(1p)Ds.\displaystyle=(1-D_{1})p+D_{1}(1-p)\leq D_{s}. (44)

On the other hand, if D1Ds0D_{1}\geq D_{s}^{0}, we can construct the joint distribution p(x1,x2,y,x^1,x^2,s^)p(x_{1},x_{2},y,\hat{x}_{1},\hat{x}_{2},\hat{s}) using four conditions similarly by switching the role of (X^1,D1)(\hat{X}_{1},D_{1}) and (S^,Ds0)(\hat{S},D_{s}^{0}). Then the rate and distortions can be obtained accordingly. This proves the theorem.

Appendix E Proof of Theorem 7

Since Lemma 3 holds here, we first calculate the rate-distortion function for X2X_{2}, which is

RX2|Y(D2)\displaystyle R_{X_{2}|Y}(D_{2}) =[1hb(D2)]𝟙0D20.5.\displaystyle=\big{[}1-h_{b}(D_{2})\big{]}\cdot\mathds{1}_{{}_{0\leq D_{2}\leq 0.5}}. (45)

Now it remains to calculate R2d,X1|Y(D1,Ds)R_{\text{2d},X_{1}|Y}(D_{1},D_{s}). We first consider the case that D1Ds0D_{1}\leq D_{s}^{0} and provide a lower bound of the mutual information as follows

I(X1;X1^,S^|Y)\displaystyle I(X_{1};\hat{X_{1}},\hat{S}|Y)
I(X1;X1^|Y)\displaystyle\geq I(X_{1};\hat{X_{1}}|Y)
=H(X1|Y)H(X1|X^1,Y)\displaystyle=H(X_{1}|Y)-H(X_{1}|\hat{X}_{1},Y)
H(X1|Y)H(X1|X^1)\displaystyle\geq H(X_{1}|Y)-H(X_{1}|\hat{X}_{1})
=hb(p2)+log(N/2)H(X1|X^1)\displaystyle=h_{b}(p_{2})+\log(N/2)-H(X_{1}|\hat{X}_{1})
hb(p2)+log(N/2)hb(D1)D1log(N1),\displaystyle\geq h_{b}(p_{2})+\log(N/2)-h_{b}(D_{1})-D_{1}\log(N-1),

where the last inequality follows from P(X^1X1)D1P(\hat{X}_{1}\neq X_{1})\leq D_{1}, hb(x)h_{b}(x) is an increasing function for x[0,0.5]x\in[0,0.5], and the fact that uniform distribution maximizes entropy. (Note that we can also obtain the above lower bound by directly applying the log-sum inequality.)

Next, we show the lower bound is tight by finding a joint distribution that achieves the above rate and distortions D1D_{1} and DsD_{s}. For 0D12(N1)p2N0\leq D_{1}\leq\frac{2(N-1)p_{2}}{N}, we choose S^=X^1\hat{S}=\hat{X}_{1} and (X1,Y,X^1)(X_{1},Y,\hat{X}_{1}) by the test channel p(x1|x^1)p(x_{1}|\hat{x}_{1}) in Fig. 13 and the Markov chain YX^1X1Y-\hat{X}_{1}-X_{1}. The conditional probability of the test channel in Fig. 13 is given as

p(x^1|x)={1D1, if x^1=xD1N1, if x^1x.\displaystyle p(\hat{x}_{1}|x)=\begin{cases}1-D_{1},&\text{ if }\hat{x}_{1}=x\\ \frac{D_{1}}{N-1},&\text{ if }\hat{x}_{1}\neq x.\end{cases} (46)
1D11-D_{1}1D11-D_{1}X^1\hat{X}_{1}011\vdotsNNq1q_{1}q2q_{2}\vdotsq4q_{4}X1X_{1}011\vdotsNN1N\frac{1}{N}1N\frac{1}{N}\vdots1N\frac{1}{N}
Figure 13: Test channel from X^1\hat{X}_{1} to X1X_{1}: X1X_{1}\simUniform(1N\frac{1}{N}) and the transition probability p(x1|x^1)p(x_{1}|\hat{x}_{1}) is given by (46).

Solving the equations

qi(1D1)+(j[1:N],jiqj)D1N1=1N,i[1:N],\displaystyle q_{i}(1-D_{1})+\left(\sum_{j\in[1:N],j\neq i}q_{j}\right)\frac{D_{1}}{N-1}=\frac{1}{N},~{}i\in[1:N],

we obtain that qi=1Nq_{i}=\frac{1}{N} for i[1:N]i\in[1:N], i.e., X^1\hat{X}_{1} is also uniformly distributed over [1:N][1:N]. For the joint distribution of YY and X^1\hat{X}_{1}, we define

p(y|x^1)={p2ND12(N1)1ND1N1,if y=0,x^1 is odd or y=1,x^1 is even1p2ND12(N1)1ND1N1,if y=0,x^1 is even or y=1,x^1 is odd.\displaystyle p(y|\hat{x}_{1})=\begin{cases}\dfrac{p_{2}-\frac{ND_{1}}{2(N-1)}}{1-\frac{ND_{1}}{N-1}},&\text{if }y=0,\hat{x}_{1}\text{ is odd}\\ ~{}&\text{~{}~{}or }y=1,\hat{x}_{1}\text{ is even}\\ ~{}\\ \dfrac{1-p_{2}-\frac{ND_{1}}{2(N-1)}}{1-\frac{ND_{1}}{N-1}},&\text{if }y=0,\hat{x}_{1}\text{ is even}\\ ~{}&\text{~{}~{}or }y=1,\hat{x}_{1}\text{ is odd}.\end{cases}

We see that p(y|x^1)0p(y|\hat{x}_{1})\geq 0 for any D12(N1)p2ND_{1}\leq\frac{2(N-1)p_{2}}{N}, i.e., (D1,D2,Ds)𝒟1(D_{1},D_{2},D_{s})\in\mathcal{D}_{1}. Then we can verify using p(y|x1)=x^1p(y|x1,x^1)p(x^|x)=x^1p(y|x^1)p(x^|x)p(y|x_{1})=\sum_{\hat{x}_{1}}p(y|x_{1},\hat{x}_{1})p(\hat{x}|x)=\sum_{\hat{x}_{1}}p(y|\hat{x}_{1})p(\hat{x}|x) that the above distribution can induce the same conditional probability p(y|x1)=p2,1p2p(y|x_{1})=p_{2},1-p_{2} as defined at the beginning of this section. Thus, we have constructed a feasible p(x1,y,x^1)p(x_{1},y,\hat{x}_{1}) that can achieve expected distortion 𝔼d1(X,X^1)=D1\mathbb{E}d_{1}(X,\hat{X}_{1})=D_{1}, Ds0D1D_{s}^{0}\geq D_{1}, and mutual information

I(X1;X1^,S^|Y)\displaystyle I(X_{1};\hat{X_{1}},\hat{S}|Y)
=I(X1;X1^|Y)\displaystyle=I(X_{1};\hat{X_{1}}|Y)
=H(X1|Y)H(X1|X^1,Y)\displaystyle=H(X_{1}|Y)-H(X_{1}|\hat{X}_{1},Y)
=H(X1|Y)H(X1|X^1)\displaystyle=H(X_{1}|Y)-H(X_{1}|\hat{X}_{1})
=hb(p2)+log(N/2)hb(D1)D1log(N1),\displaystyle=h_{b}(p_{2})+\log(N/2)-h_{b}(D_{1})-D_{1}\log(N-1),

where the first equality follows from S^=X^1\hat{S}=\hat{X}_{1}, the third equality follows from the Markov chain YX^1X1Y-\hat{X}_{1}-X_{1}, and the last equality follows from the joint distribution of (X1,Y)(X_{1},Y) and the distribution in (46).

For the other case that D1Ds0D_{1}\geq D_{s}^{0}, we only need to switch the role of (X^1,D1)(\hat{X}_{1},D_{1}) and (S^,Ds0)(\hat{S},D_{s}^{0}). Then the rate and distortions can be obtained accordingly, which can prove the theorem.

Appendix F Proof of Theorem 8

The rate-distortion function in Theorem 1 satisfies

R(D1,D2,Ds)=R2d,X1|Y(D1,Ds)+RX2|Y(D2).R(D_{1},D_{2},D_{s})=R_{2d,X_{1}|Y}(D_{1},D_{s})+R_{X_{2}|Y}(D_{2}). (47)

The second term is the solution to the quadratic Gaussian source coding problem with side information [36, Chapter 11], given as

RX2|Y(D2)=12(logσX2|YD2)+,R_{X_{2}|Y}(D_{2})=\frac{1}{2}\left(\log\frac{\sigma_{X_{2}|Y}}{D_{2}}\right)^{+}, (48)

where σX2|Y\sigma_{X_{2}|Y} is the conditional variance of X2X_{2} given YY. Note that X2X_{2} is Gaussian conditioning on YY, i.e., X2|Y𝒩(σX2YσYY,σX2σX2Y2σY)X_{2}|Y\sim\mathcal{N}(\frac{\sigma_{X_{2}Y}}{\sigma_{Y}}Y,\sigma_{X_{2}}-\frac{\sigma_{X_{2}Y}^{2}}{\sigma_{Y}}), which implies σX2|Y=σX2σX2Y2σY\sigma_{X_{2}|Y}=\sigma_{X_{2}}-\frac{\sigma_{X_{2}Y}^{2}}{\sigma_{Y}}.

For the first term, note that R2d,X1|Y(D1,Ds)R_{2d,X_{1}|Y}(D_{1},D_{s}) is lowered bounded by both

RX1|Y(D1)=min𝔼d1(X1,X^1)D1I(X1;X^1|Y),R_{X_{1}|Y}(D_{1})=\min_{\mathbb{E}d_{1}\left(X_{1},\hat{X}_{1}\right)\leq D_{1}}I(X_{1};\hat{X}_{1}|Y), (49)

and

RS|Y(Ds)=min𝔼dS(X1,S^)DsI(X1;S^|Y).R_{S|Y}(D_{s})=\min_{\mathbb{E}d_{S}^{\prime}\left(X_{1},\hat{S}\right)\leq D_{s}}I(X_{1};\hat{S}|Y). (50)

Obviously, (49) is the solution of the quadratic Gaussian source coding with side information similarly to (48).

We proceed to calculate (50), which is actually the semantic rate-distortion function of the indirect source coding with side information. Observing (X1,Y)\left(X_{1},Y\right), SS is conditionally Gaussian as S|(X1,Y)𝒩(σSX1σX1X1,σSσSX12σX1)S|\left(X_{1},Y\right)\sim\mathcal{N}\left(\frac{\sigma_{SX_{1}}}{\sigma_{X_{1}}}X_{1},\sigma_{S}-\frac{\sigma_{SX_{1}}^{2}}{\sigma_{X_{1}}}\right). It is shown in [42] that we can rewrite the semantic distortion as

𝔼dS(X1,S^)=𝔼[(SS~MMSE)2]+𝔼[(S~MMSES^)2],\mathbb{E}d_{S}^{\prime}\left(X_{1},\hat{S}\right)=\mathbb{E}\left[\left(S-\tilde{S}_{\mathrm{MMSE}}\right)^{2}\right]+\mathbb{E}\left[\left(\tilde{S}_{\mathrm{MMSE}}-\hat{S}\right)^{2}\right], (51)

where S~MMSE=σSX1σX1X1\tilde{S}_{\mathrm{MMSE}}=\frac{\sigma_{SX_{1}}}{\sigma_{X_{1}}}X_{1} is the MMSE estimator upon observing X1X_{1} and YY, and the first term on the right-hand side is the corresponding minimum mean squared error (mmse\mathrm{mmse} c.f. (25)), i.e.,

𝔼[(SS~MMSE)2]=mmse=σSσSX12σX1.\mathbb{E}\left[\left(S-\tilde{S}_{\mathrm{MMSE}}\right)^{2}\right]=\mathrm{mmse}=\sigma_{S}-\frac{\sigma_{SX_{1}}^{2}}{\sigma_{X_{1}}}. (52)

Then we consider a specific encoder which first estimates the semantic information using MMSE estimator and then compresses the estimation under mean squared error distortion constraint DsmmseD_{s}-\mathrm{mmse} with side information. The resulting achievable rate provides an upper bound on RS|Y(Ds)R_{S|Y}(D_{s}) which is

RS|Y(Ds)\displaystyle R_{S|Y}(D_{s}) 12(logσS~MMSE|YDsmmse)+\displaystyle\leq\frac{1}{2}\left(\log\frac{\sigma_{\tilde{S}_{\mathrm{MMSE}}|Y}}{D_{s}-\mathrm{mmse}}\right)^{+}
=12(logσSX12σX1|YσX12(Dsmmse))+\displaystyle=\frac{1}{2}\left(\log\frac{\sigma_{SX_{1}}^{2}\sigma_{X_{1}|Y}}{\sigma_{X_{1}}^{2}\left(D_{s}-\mathrm{mmse}\right)}\right)^{+}
=12(logσSX12(σX1σX1Y2σY)σX12(Dsmmse))+.\displaystyle=\frac{1}{2}\left(\log\frac{\sigma_{SX_{1}}^{2}\left(\sigma_{X_{1}}-\frac{\sigma_{X_{1}Y}^{2}}{\sigma_{Y}}\right)}{\sigma_{X_{1}}^{2}\left(D_{s}-\mathrm{mmse}\right)}\right)^{+}. (53)

Furthermore, for Dsmmse+σSX12σX1|YσX12D_{s}\geq\mathrm{mmse}+\frac{\sigma_{SX_{1}}^{2}\sigma_{X_{1}|Y}}{\sigma_{X_{1}}^{2}}, we have RS|Y(Ds)=0R_{S|Y}(D_{s})=0, which can be obtained by setting S^=𝔼[S~MMSE|Y]=σSX1σX1𝔼[X1|Y]\hat{S}=\mathbb{E}\left[\tilde{S}_{\mathrm{MMSE}}|Y\right]=\frac{\sigma_{SX_{1}}}{\sigma_{X_{1}}}\mathbb{E}\left[X_{1}|Y\right]. For Ds<mmse+σSX12σX1|YσX12D_{s}<\mathrm{mmse}+\frac{\sigma_{SX_{1}}^{2}\sigma_{X_{1}|Y}}{\sigma_{X_{1}}^{2}}, we derive a lower bound for RS|Y(Ds)R_{S|Y}(D_{s}) as follows

RS|Y(Ds)I(X1;S^|Y)=H(X1|Y)H(X1|S^,Y)\displaystyle R_{S|Y}(D_{s})\geq I(X_{1};\hat{S}|Y)=H(X_{1}|Y)-H(X_{1}|\hat{S},Y)
=12log(2πeσX1|Y)H(X1σX1σSX1S^|S^,Y)\displaystyle=\frac{1}{2}\log\left(2\pi e\sigma_{X_{1}|Y}\right)-H(X_{1}-\frac{\sigma_{X_{1}}}{\sigma_{SX_{1}}}\hat{S}|\hat{S},Y)
12log(2πeσX1|Y)H(X1σX1σSX1S^)\displaystyle\geq\frac{1}{2}\log\left(2\pi e\sigma_{X_{1}|Y}\right)-H(X_{1}-\frac{\sigma_{X_{1}}}{\sigma_{SX_{1}}}\hat{S})
12log(2πeσX1|Y)12log(2πe𝔼[(X1σX1σSX1S^)2])\displaystyle\geq\frac{1}{2}\log\left(2\pi e\sigma_{X_{1}|Y}\right)-\frac{1}{2}\log\left(2\pi e\mathbb{E}\left[\left(X_{1}-\frac{\sigma_{X_{1}}}{\sigma_{SX_{1}}}\hat{S}\right)^{2}\right]\right) (54)
12log(2πeσX1|Y)12log(2πeσX12(Dsmmse)σSX12)\displaystyle\geq\frac{1}{2}\log\left(2\pi e\sigma_{X_{1}|Y}\right)-\frac{1}{2}\log\left(2\pi e\frac{\sigma_{X_{1}}^{2}\left(D_{s}-\mathrm{mmse}\right)}{\sigma_{SX_{1}}^{2}}\right) (55)
=12logσSX12σX1|YσX12(Dsmmse)\displaystyle=\frac{1}{2}\log\frac{\sigma_{SX_{1}}^{2}\sigma_{X_{1}|Y}}{\sigma_{X_{1}}^{2}\left(D_{s}-\mathrm{mmse}\right)}
=12logσSX12(σX1σX1Y2σY)σX12(Dsmmse)\displaystyle=\frac{1}{2}\log\frac{\sigma_{SX_{1}}^{2}\left(\sigma_{X_{1}}-\frac{\sigma_{X_{1}Y}^{2}}{\sigma_{Y}}\right)}{\sigma_{X_{1}}^{2}\left(D_{s}-\mathrm{mmse}\right)} (56)

where (54) is due to the fact that the Gaussian distribution maximizes the entropy for a given variance, and (55) follows from (51) and the semantic distortion constraint. Combining the upper and lower bounds in (F) and (56), we obtain

RS|Y(Ds)=12(logσSX12(σX1σX1Y2σY)σX12(Dsmmse))+.R_{S|Y}(D_{s})=\frac{1}{2}\left(\log\frac{\sigma_{SX_{1}}^{2}\left(\sigma_{X_{1}}-\frac{\sigma_{X_{1}Y}^{2}}{\sigma_{Y}}\right)}{\sigma_{X_{1}}^{2}\left(D_{s}-\mathrm{mmse}\right)}\right)^{+}. (57)

Thus we have

R(D1,D2,Ds)max{RX1|Y(D1),RS|Y(Ds)}+RX2|Y(D2)=12(logσX2σX2Y2σYD2)++12[logmax(σX1σX1Y2σYD1,σSX12(σX1σX1Y2σY)σX12(Dsmmse))]+.\displaystyle\begin{split}&R(D_{1},D_{2},D_{s})\geq\max\Big{\{}R_{X_{1}|Y}(D_{1}),R_{S|Y}(D_{s})\Big{\}}+R_{X_{2}|Y}(D_{2})\\ &=\frac{1}{2}\left(\log\frac{\sigma_{X_{2}}-\frac{\sigma_{X_{2}Y}^{2}}{\sigma_{Y}}}{D_{2}}\right)^{+}+\\ &\frac{1}{2}\left[\log\max\left(\frac{\sigma_{X_{1}}-\frac{\sigma_{X_{1}Y}^{2}}{\sigma_{Y}}}{D_{1}},\frac{\sigma_{SX_{1}}^{2}\left(\sigma_{X_{1}}-\frac{\sigma_{X_{1}Y}^{2}}{\sigma_{Y}}\right)}{\sigma_{X_{1}}^{2}\left(D_{s}-\mathrm{mmse}\right)}\right)\right]^{+}.\end{split}

To show the achievability, consider the following two cases.

  • For DsmmseσSX12D1σX12\frac{D_{s}-\mathrm{mmse}}{\sigma_{SX_{1}}^{2}}\geq\frac{D_{1}}{\sigma_{X_{1}}^{2}}, we first reconstruct X^1\hat{X}_{1} and X2X_{2} subject to distortion constraints D1D_{1} and D2D_{2}, and hence achieve RX1|Y(D1)+RX2|Y(D2)R_{X_{1}|Y}(D_{1})+R_{X_{2}|Y}(D_{2}). Then we recover the semantic information by S^=σSX1σX1X^1\hat{S}=\frac{\sigma_{SX_{1}}}{\sigma_{X_{1}}}\hat{X}_{1}, and the semantic distortion satisfies

    𝔼dS(X1,S^)\displaystyle\mathbb{E}d_{S}^{\prime}\left(X_{1},\hat{S}\right) =mmse+𝔼[(σSX1σX1X1σSX1σX1X^1)2]\displaystyle=\mathrm{mmse}+\mathbb{E}\left[\left(\frac{\sigma_{SX_{1}}}{\sigma_{X_{1}}}X_{1}-\frac{\sigma_{SX_{1}}}{\sigma_{X_{1}}}\hat{X}_{1}\right)^{2}\right]
    mmse+σSX12σX12D1\displaystyle\leq\mathrm{mmse}+\frac{\sigma_{SX_{1}}^{2}}{\sigma_{X_{1}}^{2}}D_{1}
    Ds.\displaystyle\leq D_{s}.
  • For DsmmseσSX12<D1σX12\frac{D_{s}-\mathrm{mmse}}{\sigma_{SX_{1}}^{2}}<\frac{D_{1}}{\sigma_{X_{1}}^{2}}, we first reconstruct S^\hat{S} and X2X_{2} subject to distortion constraints DsD_{s} and D2D_{2}, and hence achieve RS|Y(Ds)+RX2|Y(D2)R_{S|Y}(D_{s})+R_{X_{2}|Y}(D_{2}). Then we recover X^1=σX1σSX1S^\hat{X}_{1}=\frac{\sigma_{X_{1}}}{\sigma_{SX_{1}}}\hat{S}, and the distortion satisfies

    𝔼d1(X1,X^1)\displaystyle\mathbb{E}d_{1}\left(X_{1},\hat{X}_{1}\right) =𝔼[(X1σX1σSX1S^)2]\displaystyle=\mathbb{E}\left[\left(X_{1}-\frac{\sigma_{X_{1}}}{\sigma_{SX_{1}}}\hat{S}\right)^{2}\right]
    =σX12σSX12𝔼[(S~MMSES^)2]\displaystyle=\frac{\sigma_{X_{1}}^{2}}{\sigma_{SX_{1}}^{2}}\mathbb{E}\left[\left(\tilde{S}_{\mathrm{MMSE}}-\hat{S}\right)^{2}\right]
    =σX12σSX12(𝔼dS(X1,S^)mmse)\displaystyle=\frac{\sigma_{X_{1}}^{2}}{\sigma_{SX_{1}}^{2}}\left(\mathbb{E}d_{S}^{\prime}\left(X_{1},\hat{S}\right)-\mathrm{mmse}\right)
    <D1.\displaystyle<D_{1}.

This establishes the achievability and thus completes the proof.

References

  • [1] C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.
  • [2] ——, “Coding theorems for a discrete source with a fidelity criterion,” IRE Nat. Conv. Rec., Pt. 4,, vol. 7, pp. 142–163, 1959.
  • [3] R. M. Gray, “Conditional rate-distortion theory,” Stanford Electron. Lab., Stanford, Calif., Tech. Rep., vol. 7, pp. 6502–2, Oct. 1972.
  • [4] ——, “A new class of lower bounds to information rates of stationary sources via conditional rate-distortion functions,” IEEE Transactions on Information Theory, vol. 19, no. 4, pp. 480–489, 1973.
  • [5] A. Wyner and J. Ziv, “The rate-distortion function for source coding with side information at the decoder,” IEEE Transactions on Information Theory, vol. 22, no. 1, pp. 1–10, 1976.
  • [6] A. D. Wyner, “The rate-distortion function for source coding with side information at the decoder—ii: General sources,” IEEE Transactions on Information Theory, vol. 38, pp. 60–80, 1978.
  • [7] A. Kaspi, “Rate-distortion function when side-information may be present at the decoder,” IEEE Transactions on Information Theory, vol. 40, no. 6, pp. 2031–2034, 1994.
  • [8] H. Permuter and T. Weissman, “Source coding with a side information “vending machine”,” IEEE Transactions on Information Theory, vol. 57, no. 7, pp. 4530–4544, 2011.
  • [9] S. Watanabe, “The rate-distortion function for product of two sources with side-information at decoders,” IEEE Transactions on Information Theory, vol. 59, no. 9, pp. 5678–5691, 2013.
  • [10] C. Heegard and T. Berger, “Rate distortion when side information may be absent,” IEEE Transactions on Information Theory, vol. 31, no. 6, pp. 727–734, 1985.
  • [11] A. Kimura and T. Uyematsu, “Multiterminal source coding with complementary delivery,” in 2006 IEEE International Symposium on Information Theory and its Applications (ISITA), Seoul, Korea, Oct. 2006.
  • [12] M. Tasto and P. Wintz, “A bound on the rate-distortion function and application to images,” IEEE Transactions on Information Theory, vol. 18, no. 1, pp. 150–159, 1972.
  • [13] A. Aaron, S. Rane, R. Zhang, and B. Girod, “Wyner-ziv coding for video: applications to compression and error resilience,” in Data Compression Conference, 2003. Proceedings. DCC 2003, 2003, pp. 93–102.
  • [14] J. D. Gibson and J. Hu, Rate Distortion Bounds for Voice and Video.   Foundations and Trends in Communications and Information Theory, Jan. 2014, vol. 10, no. 4.
  • [15] Y. Wang, A. Reibman, and S. Lin, “Multiple description coding for video delivery,” Proceedings of the IEEE, vol. 93, no. 1, pp. 57–70, 2005.
  • [16] T.-C. Chen, P. Fleischer, and K.-H. Tzou, “Multiple block-size transform video coding using a subband structure,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 1, no. 1, pp. 59–71, 1991.
  • [17] B. Zeng, Introduction to Digital Image and Video Compression and Processing, ser. The Morgan Kaufmann Series in Multimedia Information and Systems.   San Francisco, CA, USA: Morgan Kaufmann, 2002.
  • [18] A. Habibian, T. V. Rozendaal, J. Tomczak, and T. Cohen, “Video compression with rate-distortion autoencoders,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 7032–7041.
  • [19] A. Habibi and P. Wintz, “Image coding by linear transformation and block quantization,” IEEE Transactions on Communication Technology, vol. 19, no. 1, pp. 50–62, 1971.
  • [20] J. Hu and J. D. Gibson, “Rate distortion bounds for blocking and intra-frame prediction in videos,” in 2009 Conference Record of the Forty-Third Asilomar Conference on Signals, Systems and Computers, 2009, pp. 573–577.
  • [21] B. Güler, A. Yener, and A. Swami, “The semantic communication game,” IEEE Transactions on Cognitive Communications and Networking, vol. 4, no. 4, pp. 787–802, 2018.
  • [22] Y. Blau and T. Michaeli, “Rethinking lossy compression: The rate-distortion-perception tradeoff,” in Proceedings of the 36th International Conference on Machine Learning (ICML), vol. 97, Long Beach, CA, USA, Jun. 2019, pp. 675–685.
  • [23] H. Xie and Z. Qin, “A lite distributed semantic communication system for internet of things,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 1, pp. 142–153, 2021.
  • [24] Z. Weng and Z. Qin, “Semantic communication systems for speech transmission,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 8, pp. 2434–2444, 2021.
  • [25] W. Wang, J. Wang, and J. Chen, “Adaptive block-based compressed video sensing based on saliency detection and side information,” Entropy, vol. 23, no. 9, 2021.
  • [26] Y. Guo, Y. Liu, T. Georgiou, and M. S. Lew, “A review of semantic segmentation using deep neural networks,” International Journal of Multimedia Information Retrieval, vol. 7, pp. 87–93, 2017.
  • [27] B. Zhao, J. Feng, X. Wu, and S. Yan, “A survey on deep learning-based fine-grained object classification and semantic segmentation,” International Journal of Automation and Computing, vol. 14, pp. 119–135, 2017.
  • [28] H. Witsenhausen, “Indirect rate distortion problems,” IEEE Transactions on Information Theory, vol. 26, no. 5, pp. 518–521, 1980.
  • [29] J. Liu, W. Zhang, and H. V. Poor, “A rate-distortion framework for characterizing semantic information,” in 2021 IEEE International Symposium on Information Theory (ISIT), Melbourne, Australia, 2021, pp. 2894–2899.
  • [30] P. A. Stavrou and M. Kountouris, “A rate distortion approach to goal-oriented communication,” in 2022 IEEE International Symposium on Information Theory (ISIT), Espoo, Finland, Jun. 2022, pp. 778–783.
  • [31] J. Liu, W. Zhang, and H. V. Poor, “n indirect rate-distortion characterization for semantic sources: General model and the case of gaussian observation,” ArXiv, 2022. [Online]. Available: https://arxiv.org/abs/2201.12477
  • [32] P. A. Stavrou and M. Kountouris, “The role of fidelity in goal-oriented semantic communication: A rate distortion approach,” TechRxiv. Preprint, 2022. [Online]. Available: https://doi.org/10.36227/techrxiv.20098970.v1
  • [33] T. Berger, Rate Distortion Theory: A Mathematical Basis for Data Compression.   NJ, USA: Englewood Cliffs, NJ: Prentice-Hall,, 1971.
  • [34] ——, Multiterminal source coding.   in The Information Theory Approach to Communications (CISM Courses and Lectures, no. 229), G. Longo, Ed. Vienna/New York: Springer-Verlag, 1978.
  • [35] R. W. Yeung, Information Theory and Network Coding.   Verlag, USA: Springer, 2008.
  • [36] A. El Gamal and Y.-H. Kim, Network Information Theory.   Cambridge University Press, 2011.
  • [37] T. M. Cover and J. A. Thomas, Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing).   USA: Wiley-Interscience, 2006.
  • [38] A. El Gamal and T. Cover, “Achievable rates for multiple descriptions,” IEEE Transactions on Information Theory, vol. 28, no. 6, pp. 851–857, 1982.
  • [39] D. Slepian and J. Wolf, “Noiseless coding of correlated information sources,” IEEE Transactions on Information Theory, vol. 19, no. 4, pp. 471–480, 1973.
  • [40] M. Carter, “Source coding of composite sources,” Ph.D. Dissertation, Department of Computer Information and Control Engineering, University of Michigan, Ann Arbor, MI, 1984.
  • [41] A. Kipnis, S. Rini, and A. J. Goldsmith, “The indirect rate-distortion function of a binary i.i.d source,” in 2015 IEEE Information Theory Workshop - Fall (ITW), 2015, pp. 352–356.
  • [42] J. Wolf and J. Ziv, “Transmission of noisy information to a noisy receiver with minimum distortion,” IEEE Transactions on Information Theory, vol. 16, no. 4, pp. 406–411, 1970.