Tractable Computation of Expected Kernels (Supplementary material)

Wenzhe Li Authors contributed equally. This research was performed while W.L. was visiting UCLA remotely. Tsinghua University Zhe Zeng^∗ University of California, Los Angeles scott.wenzhe.li@gmail.com, {zhezeng, aver, guyvdb}@cs.ucla.edu Antonio Vergari University of California, Los Angeles scott.wenzhe.li@gmail.com, {zhezeng, aver, guyvdb}@cs.ucla.edu Guy Van den Broeck University of California, Los Angeles scott.wenzhe.li@gmail.com, {zhezeng, aver, guyvdb}@cs.ucla.edu

1 Proofs

We first present another hardness result about the computation of expected kernels besides Theorem LABEL:thm:_hardness_for_expected_kernels.

Theorem 1.1.

There exist representations of distributions $p$ and $q$ that are smooth and compatible, yet computing the expected kernel of a simple kernel $k$ that is the Kronecker delta is already #P-hard.

Proof.

(an alternative proof to the one in Section LABEL:sec:_tractable_computation_of_expected_kernels) Consider the case when the positive definite kernel $k$ is a Kronecker delta function defined as $k(\mathbf{x},\mathbf{x}^{\prime})=1$ if and only if $\mathbf{x}=\mathbf{x}^{\prime}$ . Moreover, assume that the probabilistic circuit $p$ is smooth and decomposable, and that $q=p$ . Then computing the expected kernel is equivalent to computing the power of a probabilistic circuit $p$ , that is, $M_{k}(p,q)=\sum_{\mathbf{x}\in\mathcal{X}}p^{2}(\mathbf{x})$ with $\mathcal{X}$ being the domain of variables $\mathbf{X}$ . vergari2021compositional proves that the task of computing $\sum_{\mathbf{x}\in\mathcal{X}}p^{2}(\mathbf{x})$ is #P-hard even when the PC $p$ is smooth and decomposable, which concludes our proof. ∎

Proposition LABEL:pro:_recursive-sum-nodes

Let ${p}_{n}$ and ${q}_{m}$ be two compatible probabilistic circuits over variables $\mathbf{X}$ whose output units $n$ and $m$ are sum units, denoted by ${p}_{n}(\mathbf{X})=\sum_{i\in\mathsf{in}(n)}\theta_{i}{p}_{i}(\mathbf{X})$ and ${q}_{m}(\mathbf{X})=\sum_{j\in\mathsf{in}(m)}\delta_{j}{q}_{j}(\mathbf{X})$ respectively. Let $k_{l}$ be a kernel circuit with its output unit being a sum unit $l$ , denoted by $k_{l}(\mathbf{X})=\sum_{c\in\mathsf{in}(l)}\gamma_{c}k_{c}(\mathbf{X})$ . Then it holds that

M_{k_{l}}({p}_{n},{q}_{m})=\sum_{i\in\mathsf{in}(n)}\theta_{i}\sum_{j\in\mathsf{in}(m)}\delta_{j}\sum_{c\in\mathsf{in}(l)}\gamma_{c}\,M_{k_{c}}({p}_{i},{q}_{j}).

(1)

Proof.

$M_{k_{l}}({p}_{n},{q}_{m})$ can be expanded as

	$\displaystyle\quad M_{k_{l}}(p_{n},q_{m})$
	$\displaystyle=\sum_{\mathbf{x}}\sum_{\mathbf{x}^{\prime}}p_{n}(\mathbf{x})q_{m}(\mathbf{x}^{\prime})k_{l}(\mathbf{x},\mathbf{x}^{\prime})$
	$\displaystyle=\sum_{\mathbf{x}}\sum_{\mathbf{x}^{\prime}}\sum_{i\in\mathsf{in}(n)}\theta_{i}{p}_{i}(\mathbf{x})\sum_{j\in\mathsf{in}(m)}\delta_{j}{q}_{j}(\mathbf{x}^{\prime})\sum_{c\in\mathsf{in}(l)}\gamma_{c}k_{c}(\mathbf{x},\mathbf{x}^{\prime})$
	$\displaystyle=\sum_{i\in\mathsf{in}(n)}\theta_{i}\sum_{j\in\mathsf{in}(m)}\delta_{j}\sum_{c\in\mathsf{in}(l)}\gamma_{c}\,M_{k_{c}}({p}_{i},{q}_{j}).$

∎

Proposition LABEL:pro:_recursive-product-nodes

Let ${p}_{n}$ and ${q}_{m}$ be two compatible probabilistic circuits over variables $\mathbf{X}$ whose output units $n$ and $m$ are product units, denoted by ${p}_{n}(\mathbf{X})={p}_{n_{\mathsf{L}}}(\mathbf{X}_{\mathsf{L}}){p}_{n_{\mathsf{R}}}(\mathbf{X}_{\mathsf{R}})$ and ${q}_{m}(\mathbf{X})={q}_{m_{\mathsf{L}}}(\mathbf{X}_{\mathsf{L}}){q}_{m_{\mathsf{R}}}(\mathbf{X}_{\mathsf{R}})$ . Let $k$ be a kernel circuit that is kernel-compatible with the circuit pair ${p}_{n}$ and ${q}_{m}$ with its output unit being a product unit denoted by $k(\mathbf{X},\mathbf{X}^{\prime})=k_{\mathsf{L}}(\mathbf{X}_{\mathsf{L}},\mathbf{X}_{\mathsf{L}}^{\prime})k_{\mathsf{R}}(\mathbf{X}_{\mathsf{R}},\mathbf{X}_{\mathsf{R}}^{\prime})$ . Then it holds that

M_{k}({p}_{n},{q}_{m})=M_{k_{\mathsf{L}}}({p}_{n_{\mathsf{L}}},{q}_{m_{\mathsf{L}}})\cdot M_{k_{\mathsf{R}}}({p}_{n_{\mathsf{R}}},{q}_{m_{\mathsf{R}}}).

Proof.

$M_{k}(p_{n},q_{m})$ can be expanded as

	$\displaystyle\quad M_{k}(p_{n},q_{m})$
	$\displaystyle=\sum_{\mathbf{x}}\sum_{\mathbf{x}^{\prime}}p_{n}(\mathbf{x})q_{m}(\mathbf{x}^{\prime})k(\mathbf{x},\mathbf{x}^{\prime})$
	$\displaystyle=\sum_{\mathbf{x}}\sum_{\mathbf{x}^{\prime}}{p}_{m_{\mathsf{L}}}(\mathbf{x}_{\mathsf{L}}){p}_{m_{\mathsf{R}}}(\mathbf{x}_{\mathsf{R}}){q}_{n_{\mathsf{L}}}(\mathbf{x}_{\mathsf{L}}){q}_{n_{\mathsf{R}}}(\mathbf{x}_{\mathsf{R}})k_{\mathsf{L}}(\mathbf{x}_{\mathsf{L}},\mathbf{x}^{\prime}_{\mathsf{L}})k_{\mathsf{R}}(\mathbf{x}_{\mathsf{R}},\mathbf{x}^{\prime}_{\mathsf{R}})$
	$\displaystyle=M_{k_{\mathsf{L}}}({p}_{n_{\mathsf{L}}},{q}_{m_{\mathsf{L}}})\cdot M_{k_{\mathsf{R}}}({p}_{n_{\mathsf{R}}},{q}_{m_{\mathsf{R}}}).$

∎

Corollary LABEL:cor:_tractable_mmd.

Following the assumptions in Theorem LABEL:thm:_double_sum_complexity, the squared maximum mean discrepancy $\mathit{MMD}[\mathcal{H},p,q]$ in RKHS $\mathcal{H}$ associated with kernel $k$ as defined in gretton2012kernel can be tractably computed.

Proof.

This is an immediate result following Theorem LABEL:thm:_double_sum_complexity by rewriting MMD as defined in gretton2012kernel in the form of a linear combination of expected kernels, that is, $\mathit{MMD}^{2}[\mathcal{H},p,q]=M_{k}(p,p)+M_{k}(q,q)-2M_{k}(p,q)$ . ∎

Corollary LABEL:cor:_tractable_kdsd.

Following the assumptions in Theorem LABEL:thm:_double_sum_complexity, if the probabilistic circuit $p$ further satisfies determinism, the kernelized discrete Stein discrepancy (KDSD) $\mathbb{D}^{2}(q\leavevmode\nobreak\ \parallel\leavevmode\nobreak\ p)=\mathbb{E}_{\mathbf{x},\mathbf{x}^{\prime}\sim q}[k_{p}(\mathbf{x},\mathbf{x}^{\prime})]$ in the RKHS associated with kernel $k$ as defined in yang2018goodness can be tractably computed.

Before showing the proof for Corollary LABEL:cor:_tractable_kdsd, we first give definitions that are necessary for defining KDSD as follows to be self-contained.

Definition 1.2 (Cyclic permutation).

For a finite set $\mathcal{X}$ and $D=|\mathcal{X}|$ , a cyclic permutation $\neg:\mathcal{X}\rightarrow\mathcal{X}$ is a bijective function such that for some ordering $a_{1},a_{2},\cdots,a_{D}$ of the elements in $\mathcal{X}$ , $\neg a_{i}=a_{(i+1)\mod D}$ , $\forall i=1,2,\cdots,D$ .

Definition 1.3 (Partial difference operator).

For any function $f:\mathcal{X}\rightarrow\mathbb{R}$ with $D=|\mathcal{X}|$ , the partial difference operator is defined as

\Delta^{*}_{i}f(\mathbf{X}):=f(\mathbf{X})-f(\neg_{i}\mathbf{X}),\forall i=1,\cdots,D,

(2)

with $\neg_{i}\mathbf{X}:=(X_{1},\cdots,\neg X_{i},\cdots,X_{D})$ . Moreover, the difference operator is defined as $\Delta^{*}f(\mathbf{X}):=(\Delta^{*}_{1}f(\mathbf{X}),\cdots,\Delta^{*}_{D}f(\mathbf{X}))$ . Similarly, let $\invneg$ be the inverse permutation of $\neg$ , and $\Delta$ denote the difference operator defined with respect to $\invneg$ , i.e.,

\Delta_{i}f(\mathbf{X}):=f(\mathbf{X})-f(\invneg_{i}\mathbf{X}),i=1,\cdots,D.

Definition 1.4 (Difference score function).

The (difference) score function is defined as $\bm{s}_{p}(\mathbf{X}):=\frac{\Delta^{*}p(\mathbf{X})}{p(\mathbf{X})}$ on domain $\mathcal{X}$ with $D=\mid\mathcal{X}\mid$ , a vector-valued function with its $i$ -th dimension being

\bm{s}_{p,i}(\mathbf{X}):=\frac{\Delta^{*}_{i}p(\mathbf{X})}{p(\mathbf{X})}=1-\frac{p(\neg_{i}\mathbf{X})}{p(\mathbf{X})},i=1,2,\cdots,D.

(3)

Given the above definitions, the discrete Stein discrepancy between two distributions $p$ and $q$ is defined as

\mathbb{D}(q\parallel p):=\sup_{\bm{f}\in\mathcal{H}}\mathbb{E}_{\mathbf{x}\sim q(\mathbf{X})}[\mathcal{T}_{p}\bm{f}(\mathbf{x})],

(4)

where $\bm{f}:\mathcal{X}\rightarrow\mathbb{R}^{D}$ is a test function, belonging to some function space $\mathcal{H}$ and $\mathcal{T}_{p}$ is the so-called Stein difference operator, which is defined as

\mathcal{T}_{p}\bm{f}=\bm{s}_{p}(\mathbf{x})\bm{f}^{\top}-\Delta\bm{f}(\mathbf{x}).

(5)

If the function space $\mathcal{H}$ is an reproducing kernel Hilbert space (RKHS) on $\mathcal{X}$ equipped with a kernel function $k(\cdot,\cdot)$ , then a kernelized discrete Stein discrepancy (KDSD) is defined and admits a closed-form representation as

\mathbb{S}(q\parallel p):=\mathbb{D}^{2}(q\parallel p)=\mathbb{E}_{\mathbf{x},\mathbf{x}^{\prime}\sim q}[k_{p}(\mathbf{x},\mathbf{x}^{\prime})].

(6)

Here, the kernel function $k_{p}$ is defined as

\begin{split}k_{p}(\mathbf{x},\mathbf{x}^{\prime})&=\bm{s}_{p}(\mathbf{x})^{\top}k(\mathbf{x},\mathbf{x}^{\prime})\bm{s}_{p}(\mathbf{x}^{\prime})-\bm{s}_{p}(\mathbf{x})^{\top}\Delta^{\mathbf{x}^{\prime}}k(\mathbf{x},\mathbf{x}^{\prime})\\ &\quad-\Delta^{\mathbf{x}}k(\mathbf{x},\mathbf{x}^{\prime})^{\top}\bm{s}_{p}(\mathbf{x}^{\prime})+\mathit{tr}(\Delta^{\mathbf{x},\mathbf{x}^{\prime}}k(\mathbf{x},\mathbf{x}^{\prime})),\end{split}

where the difference operator $\Delta^{\mathbf{x}}$ is as in Definition 1.3. The superscript $\mathbf{x}$ specifies the variables that it operates on.

Proof.

[Corollary LABEL:cor:_tractable_kdsd] By the definition of difference score functions, the close form of KDSD can be further rewritten as follows.

\begin{split}&\mathbb{E}_{\mathbf{x},\mathbf{x}^{\prime}\sim q}[k_{p}(\mathbf{x},\mathbf{x}^{\prime})]\\ =&\sum_{i=1}^{D}\mathbb{E}_{\mathbf{x},\mathbf{x}^{\prime}\sim q}[\frac{p(\neg_{i}\mathbf{x})p(\neg_{i}\mathbf{x}^{\prime})}{p(\mathbf{x})p(\mathbf{x}^{\prime})}k(\mathbf{x},\mathbf{x}^{\prime})-\frac{p(\neg_{i}\mathbf{x})}{p(\mathbf{x})}k(\mathbf{x},\neg_{i}\mathbf{x}^{\prime})\\ &\quad-\frac{p(\neg_{i}\mathbf{x}^{\prime})}{p(\mathbf{x}^{\prime})}k(\neg_{i}\mathbf{x},\mathbf{x}^{\prime})+k(\neg_{i}\mathbf{x},\neg_{i}\mathbf{x}^{\prime})]\\ =&\sum_{i=1}^{D}[M_{k}(q\frac{\tilde{p}_{i}}{p},q\frac{\tilde{p}_{i}}{p})-M_{k}(q\frac{\tilde{p}_{i}}{p},\tilde{q}_{i})\\ &\quad-M_{k}(\tilde{q}_{i},q\frac{\tilde{p}_{i}}{p})+M_{k}(\tilde{q}_{i},\tilde{q}_{i})]\end{split}

(7)

where $D$ denotes the cardinality of the domain of variables $\mathbf{X}$ , the probablity $\tilde{p}_{i}(\mathbf{X}):=p(\neg_{i}\mathbf{X})$ and the probablity $\tilde{q}_{i}(\mathbf{X}):=q(\neg_{i}\mathbf{X})$ . Notice that the cyclic permutation $\neg_{i}$ operates on individual variable and the resulting PC $\tilde{p}_{i}$ and $\tilde{q}_{i}$ retains the same structure properties as PCs $p$ and $q$ respectively. To prove that KDSD can be tratably computed, it suffices to prove that the expected kernel terms in Equation 7 can be tractably computed.

For a deterministic and structured-decomposable PC $p$ , since PC $\tilde{p}_{i}$ retains the same structure, then resulting ratio $\tilde{p}_{i}/p$ is again a smooth circuit compatible with $p$ by vergari2021compositional. Moreover, since PC $p$ and $q$ are compatible, the circuit $\tilde{p}_{i}/p$ is compatible with PC $q$ . Thus, the resulting product $q\frac{\tilde{p}_{i}}{p}$ is a circuit that is smooth and compatible with both $p$ and $q$ by Theorem B.2 and thus compatible with $\tilde{q}_{i}$ . By similar arguments, we can verify that all the circuit pair in the expected kernel terms in Equation 7 satisfy the assumptions in Theorem LABEL:thm:_double_sum_complexity and thus they are amenable to the tractable computation we propose in Algorithm LABEL:alg:_double-sum, which finishes our proof.

∎

Proposition (convergence of Categorical BBIS).

Let $f(\mathbf{x})$ be a test function. Assume that $f-\mathbb{E}_{p}[f]\in\mathcal{H}_{p}$ , with $\mathcal{H}_{p}$ being the RKHS associated with the kernel function $k_{p}$ , and $\sum_{i}w_{i}=1$ , then it holds that

\displaystyle\left|\sum_{n=1}^{N}w_{n}f(x_{n})-\mathbb{E}_{p}f\right|\leq C_{f}\sqrt{\mathbb{S}(\{\mathbf{x}^{(n)},w_{n}\}\parallel p)},

where $C_{f}:=\parallel f-\mathbb{E}_{p}f\parallel_{\mathcal{H}_{p}}$ . Moreover, the convergence rate is $\mathcal{O}(N^{-1/2})$ .

Proof.

Let $\hat{f}(\mathbf{x}):=f(\mathbf{x})-\mathbb{E}_{p}f$ , then it holds that

\begin{split}\left|\sum_{n=1}^{N}w_{n}f(\mathbf{x}^{(n)})-\mathbb{E}_{p}f\right|&=\left|\sum_{n=1}^{N}w_{n}\hat{f}(\mathbf{x}^{(n)})\right|\\ &=\left|\sum_{n=1}^{N}w_{n}\langle\hat{f},k_{p}(\cdot,\mathbf{x}^{(n)})\rangle\right|\\ &=\left|\langle\hat{f},\sum_{n=1}^{N}w_{n}k_{p}(\cdot,\mathbf{x}^{(n)})\rangle_{\mathcal{H}_{p}}\right|\\ &\leq\parallel\hat{f}\parallel_{\mathcal{H}_{p}}\cdot\parallel\sum_{n=1}^{N}w_{n}k_{p}(\cdot,\mathbf{x}^{(n)})\parallel_{\mathcal{H}_{p}}\\ &=\parallel\hat{f}\parallel_{\mathcal{H}_{p}}\cdot\sqrt{\mathbb{S}(\{\mathbf{x}^{(n)},w_{n}\}\parallel p)}.\end{split}

We further prove the convergence rate of the estimation error by using the importance weights as reference weights. Let $v_{n}^{*}=\frac{1}{n}p(\mathbf{x}^{(n)})/q(\mathbf{x}^{(n)})$ . Then $\mathbb{S}(\{\mathbf{x}^{(n)},v_{n}^{*}\}\parallel p)$ is a degenerate V-statistics [liu2016black] and it holds that $\mathbb{S}(\{\mathbf{x}^{(n)},v_{n}^{*}\}\parallel p)=\mathcal{O}(N^{-1})$ . Moreover, we have that $\sum_{n=1}^{N}v^{*}_{n}=1+\mathcal{O}(N^{-1/2})$ , which we denote by $Z$ , i.e., $Z=\sum_{n=1}^{N}v^{*}_{n}$ . Let $w^{*}_{n}=v^{*}_{n}/Z$ , then it holds that

\begin{split}\mathbb{S}(\{\mathbf{x}^{(n)},w_{n}^{*}\}\parallel p)=\frac{\mathbb{S}(\{\mathbf{x}^{(n)},v_{n}^{*}\}\parallel p)}{Z^{2}}=\mathcal{O}(N^{-1}).\end{split}

Therefore,

\begin{split}\left|\sum_{n=1}^{N}w_{n}f(\mathbf{x}^{(n)})-\mathbb{E}_{p}f\right|&\leq\parallel\hat{f}\parallel_{\mathcal{H}_{p}}\cdot\sqrt{\mathbb{S}(\{\mathbf{x}^{(n)},w_{n}\}\parallel p)}\\ &\leq\parallel\hat{f}\parallel_{\mathcal{H}_{p}}\cdot\sqrt{\mathbb{S}(\{\mathbf{x}^{(n)},w_{n}^{*}\}\parallel p)}\\ &=\mathcal{O}(N^{-1/2}).\end{split}

∎

Proposition LABEL:pro:_tractable_conditonal_kernel_function.

Let $p(\mathbf{X_{c}}\mid\mathbf{x_{s}})$ be a PC that encodes a conditional distribution over variables $\mathbf{X_{c}}$ conditioned on $\mathbf{X_{s}}=\mathbf{x_{s}}$ , and $k$ be a KC. If the PC $p(\mathbf{X_{c}}\mid\mathbf{x_{s}})$ and $p(\mathbf{X_{c}}\mid\mathbf{x_{s}}^{\prime})$ are compatible and $k$ is kernel-compatible with the PC pair for any $\mathbf{x_{s}}$ , $\mathbf{x_{s}}^{\prime}$ , then the conditional kernel function $k_{p,\mathbf{s}}$ as defined in Proposition LABEL:pro:_kdsd can be tractably computed.

Proof.

From Proposition LABEL:pro:_kdsd, $k_{p,\mathbf{s}}$ can be written as

k_{p,\mathbf{s}}=\sum_{i=1}^{D}\mathbb{E}_{\mathbf{x}_{\mathbf{c}}\sim p(\mathbf{X}_{\mathbf{c}}\mid\mathbf{x}_{\mathbf{s}}),\mathbf{x}^{\prime}_{\mathbf{c}}\sim p(\mathbf{X}_{\mathbf{c}}\mid\mathbf{x}^{\prime}_{\mathbf{s}})}[k_{p,i}(\mathbf{x},\mathbf{x}^{\prime})],

where $k_{p,i}$ can be expanded as follows.

\begin{split}k_{p,i}(\mathbf{x},\mathbf{x}^{\prime})=&\frac{p(\neg_{i}\mathbf{x})p(\neg_{i}\mathbf{x}^{\prime})}{p(\mathbf{x})p(\mathbf{x}^{\prime})}k(\mathbf{x},\mathbf{x}^{\prime})-\frac{p(\neg_{i}\mathbf{x})}{p(\mathbf{x})}k(\mathbf{x},\neg_{i}\mathbf{x}^{\prime})\\ &-\frac{p(\neg_{i}\mathbf{x}^{\prime})}{p(\mathbf{x}^{\prime})}k(\neg_{i}\mathbf{x},\mathbf{x}^{\prime})+k(\neg_{i}\mathbf{x},\neg_{i}\mathbf{x}^{\prime}).\end{split}

for any $i\in\mathbf{c}$ , given that none of the variables in $\mathbf{X}_{\mathbf{s}}$ is flipped in the above formulation, kernel $k_{p,i}$ can be further written as

	$\displaystyle k_{p,i}(\mathbf{x},\mathbf{x}^{\prime})=$	$\displaystyle\frac{p(\neg_{i}\mathbf{x}_{\mathbf{c}}\mid\mathbf{x}_{\mathbf{s}})p(\neg_{i}\mathbf{x}^{\prime}_{\mathbf{c}}\mid\mathbf{x}^{\prime}_{\mathbf{s}})}{p(\mathbf{x}_{\mathbf{c}}\mid\mathbf{x}_{\mathbf{s}})p(\mathbf{x}^{\prime}_{\mathbf{c}}\mid\mathbf{x}^{\prime}_{\mathbf{s}})}k(\mathbf{x},\mathbf{x}^{\prime})$
		$\displaystyle-\frac{p(\neg_{i}\mathbf{x}_{\mathbf{c}}\mid\mathbf{x}_{\mathbf{s}})}{p(\mathbf{x}_{\mathbf{c}}\mid\mathbf{x}_{\mathbf{s}})}k(\mathbf{x},\neg_{i}\mathbf{x}^{\prime})$
		$\displaystyle-\frac{p(\neg_{i}\mathbf{x}^{\prime}_{\mathbf{c}}\mid\mathbf{x}^{\prime}_{\mathbf{s}})}{p(\mathbf{x}^{\prime}_{\mathbf{c}}\mid\mathbf{x}^{\prime}_{\mathbf{s}})}k(\neg_{i}\mathbf{x},\mathbf{x}^{\prime})$
		$\displaystyle+k(\neg_{i}\mathbf{x},\neg_{i}\mathbf{x}^{\prime}).$

By substituting $k_{p,i}$ into the expected kernel in the expectation of $k_{p,i}$ with respect to the conditional distributions can be simplified to be a constant zero, that is,

\displaystyle\mathbb{E}_{\mathbf{x}_{\mathbf{c}}\sim p(\mathbf{X}_{\mathbf{c}}\mid\mathbf{x}_{\mathbf{s}}),\mathbf{x}^{\prime}_{\mathbf{c}}\sim p(\mathbf{X}^{\prime}_{\mathbf{c}}\mid\mathbf{x}^{\prime}_{\mathbf{s}})}[k_{p,i}(\mathbf{x},\mathbf{x}^{\prime})]=\leavevmode\nobreak\ 0.

Thus, $k_{p,\mathbf{s}}$ can be expanded as

	$\displaystyle k_{p,\mathbf{s}}(\mathbf{x},\mathbf{x}^{\prime})$	$\displaystyle=\mathbb{E}_{\mathbf{x}_{\mathbf{c}}\sim p(\mathbf{X}_{\mathbf{c}}\mid\mathbf{x}_{\mathbf{s}}),\mathbf{x}^{\prime}_{\mathbf{c}}\sim p(\mathbf{X}_{\mathbf{c}}\mid\mathbf{x}^{\prime}_{\mathbf{s}})}[\sum_{i\in\mathbf{s}}k_{p,i}(\mathbf{x},\mathbf{x}^{\prime})]$
		$\displaystyle=\sum_{i\in\mathbf{s}}[\frac{p(\neg_{i}\mathbf{x}_{\mathbf{s}})p(\neg_{i}\mathbf{x}^{\prime}_{\mathbf{s}})}{p(\mathbf{x}_{\mathbf{s}})p(\mathbf{x}^{\prime}_{\mathbf{s}})}\cdot M_{k(\cdot,\cdot)}(p(\cdot\mid\neg_{i}\mathbf{x}_{\mathbf{s}}),p(\cdot\mid\neg_{i}\mathbf{x}^{\prime}_{\mathbf{s}}))$
		$\displaystyle\quad-\frac{p(\neg_{i}\mathbf{x}_{\mathbf{s}})}{p(\mathbf{x}_{\mathbf{s}})}\cdot M_{k(\cdot,\neg_{i}\cdot)}(p(\cdot\mid\neg_{i}\mathbf{x}_{\mathbf{s}}),p(\cdot\mid\mathbf{x}^{\prime}_{\mathbf{s}}))$
		$\displaystyle\quad-\frac{p(\neg_{i}\mathbf{x}^{\prime}_{\mathbf{s}})}{p(\mathbf{x}^{\prime}_{\mathbf{s}})}\cdot M_{k(\neg_{i}\cdot,\cdot)}(p(\cdot\mid\mathbf{x}_{\mathbf{s}}),p(\cdot\mid\neg_{i}\mathbf{x}^{\prime}_{\mathbf{s}}))$
		$\displaystyle\quad+M_{k(\neg_{i}\cdot,\neg_{i}\cdot)}(p(\cdot\mid\mathbf{x}_{\mathbf{s}}),p(\cdot\mid\mathbf{x}^{\prime}_{\mathbf{s}}))].$

As Theorem LABEL:thm:_double_sum_complexity has shown that $M_{k}({p},{q})$ can be computed exactly in time linear in the size of each PC, $k_{p,\mathbf{s}}(\mathbf{x},\mathbf{x}^{\prime})$ can also be computed exactly in time $\mathcal{O}(|{p}_{1}||{p}_{2}||k|)$ , where ${p}_{1}$ and ${p}_{2}$ denote circuits that represent the conditional probability distribution given the index set, i.e., $p(\cdot\mid\mathbf{x}_{\mathbf{s}})$ or $p(\cdot\mid\neg_{i}\mathbf{x}_{\mathbf{s}})$ . ∎

2 Algorithms

Algorithm 1 summarizes how to perform the BBIS scheme we propose for Categorical distributions, and generate a set of weighted samples.

Algorithm 1 CategoricalBBIS(

p,q,k,n

)

Input: target distributions $p$ over variables $\mathbf{X}$ , a black-box mechanism $q$ , a kernel function $k$ and number of samples $n$
Output: weighted samples $\{(\mathbf{x}^{(i)},w^{*}_{i})\}_{i=1}^{n}$

1:Sample

\{\mathbf{x}^{(i)}\}_{i=1}^{n}

from

q

2:for

i=1,\ldots,n

3: for

j=1,\ldots,n

[\bm{K}_{p}]_{ij}=k_{p}(\mathbf{x}^{(i)},\mathbf{x}^{(j)})

\triangleright

cf. LABEL:eq:_kernel-p

\bm{w}^{*}=\operatorname*{arg\,min}_{\bm{w}}\left\{\bm{w}^{\top}\bm{K_{p}}\bm{w}\,\middle|\,\sum_{i=1}^{n}w_{i}=1,\leavevmode\nobreak\ w_{i}\geq 0\right\}

6:return

\{(\mathbf{x}^{(i)},w^{*}_{i})\}_{i=1}^{n}