This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Tractable Computation of Expected Kernels (Supplementary material)

Wenzhe Li Authors contributed equally. This research was performed while W.L. was visiting UCLA remotely. Tsinghua University Zhe Zeng University of California, Los Angeles scott.wenzhe.li@gmail.com, {zhezeng, aver, guyvdb}@cs.ucla.edu Antonio Vergari University of California, Los Angeles scott.wenzhe.li@gmail.com, {zhezeng, aver, guyvdb}@cs.ucla.edu Guy Van den Broeck University of California, Los Angeles scott.wenzhe.li@gmail.com, {zhezeng, aver, guyvdb}@cs.ucla.edu

1 Proofs

We first present another hardness result about the computation of expected kernels besides Theorem LABEL:thm:_hardness_for_expected_kernels.

Theorem 1.1.

There exist representations of distributions pp and qq that are smooth and compatible, yet computing the expected kernel of a simple kernel kk that is the Kronecker delta is already #P-hard.

Proof.

(an alternative proof to the one in Section LABEL:sec:_tractable_computation_of_expected_kernels) Consider the case when the positive definite kernel kk is a Kronecker delta function defined as k(𝐱,𝐱)=1k(\mathbf{x},\mathbf{x}^{\prime})=1 if and only if 𝐱=𝐱\mathbf{x}=\mathbf{x}^{\prime}. Moreover, assume that the probabilistic circuit pp is smooth and decomposable, and that q=pq=p. Then computing the expected kernel is equivalent to computing the power of a probabilistic circuit pp, that is, Mk(p,q)=𝐱𝒳p2(𝐱)M_{k}(p,q)=\sum_{\mathbf{x}\in\mathcal{X}}p^{2}(\mathbf{x}) with 𝒳\mathcal{X} being the domain of variables 𝐗\mathbf{X}. vergari2021compositional proves that the task of computing 𝐱𝒳p2(𝐱)\sum_{\mathbf{x}\in\mathcal{X}}p^{2}(\mathbf{x}) is #P-hard even when the PC pp is smooth and decomposable, which concludes our proof. ∎

Proposition LABEL:pro:_recursive-sum-nodes

Let pn{p}_{n} and qm{q}_{m} be two compatible probabilistic circuits over variables 𝐗\mathbf{X} whose output units nn and mm are sum units, denoted by pn(𝐗)=i𝗂𝗇(n)θipi(𝐗){p}_{n}(\mathbf{X})=\sum_{i\in\mathsf{in}(n)}\theta_{i}{p}_{i}(\mathbf{X}) and qm(𝐗)=j𝗂𝗇(m)δjqj(𝐗){q}_{m}(\mathbf{X})=\sum_{j\in\mathsf{in}(m)}\delta_{j}{q}_{j}(\mathbf{X}) respectively. Let klk_{l} be a kernel circuit with its output unit being a sum unit ll, denoted by kl(𝐗)=c𝗂𝗇(l)γckc(𝐗)k_{l}(\mathbf{X})=\sum_{c\in\mathsf{in}(l)}\gamma_{c}k_{c}(\mathbf{X}). Then it holds that

Mkl(pn,qm)=i𝗂𝗇(n)θij𝗂𝗇(m)δjc𝗂𝗇(l)γcMkc(pi,qj).M_{k_{l}}({p}_{n},{q}_{m})=\sum_{i\in\mathsf{in}(n)}\theta_{i}\sum_{j\in\mathsf{in}(m)}\delta_{j}\sum_{c\in\mathsf{in}(l)}\gamma_{c}\,M_{k_{c}}({p}_{i},{q}_{j}). (1)
Proof.

Mkl(pn,qm)M_{k_{l}}({p}_{n},{q}_{m}) can be expanded as

Mkl(pn,qm)\displaystyle\quad M_{k_{l}}(p_{n},q_{m})
=𝐱𝐱pn(𝐱)qm(𝐱)kl(𝐱,𝐱)\displaystyle=\sum_{\mathbf{x}}\sum_{\mathbf{x}^{\prime}}p_{n}(\mathbf{x})q_{m}(\mathbf{x}^{\prime})k_{l}(\mathbf{x},\mathbf{x}^{\prime})
=𝐱𝐱i𝗂𝗇(n)θipi(𝐱)j𝗂𝗇(m)δjqj(𝐱)c𝗂𝗇(l)γckc(𝐱,𝐱)\displaystyle=\sum_{\mathbf{x}}\sum_{\mathbf{x}^{\prime}}\sum_{i\in\mathsf{in}(n)}\theta_{i}{p}_{i}(\mathbf{x})\sum_{j\in\mathsf{in}(m)}\delta_{j}{q}_{j}(\mathbf{x}^{\prime})\sum_{c\in\mathsf{in}(l)}\gamma_{c}k_{c}(\mathbf{x},\mathbf{x}^{\prime})
=i𝗂𝗇(n)θij𝗂𝗇(m)δjc𝗂𝗇(l)γcMkc(pi,qj).\displaystyle=\sum_{i\in\mathsf{in}(n)}\theta_{i}\sum_{j\in\mathsf{in}(m)}\delta_{j}\sum_{c\in\mathsf{in}(l)}\gamma_{c}\,M_{k_{c}}({p}_{i},{q}_{j}).

Proposition LABEL:pro:_recursive-product-nodes

Let pn{p}_{n} and qm{q}_{m} be two compatible probabilistic circuits over variables 𝐗\mathbf{X} whose output units nn and mm are product units, denoted by pn(𝐗)=pn𝖫(𝐗𝖫)pn𝖱(𝐗𝖱){p}_{n}(\mathbf{X})={p}_{n_{\mathsf{L}}}(\mathbf{X}_{\mathsf{L}}){p}_{n_{\mathsf{R}}}(\mathbf{X}_{\mathsf{R}}) and qm(𝐗)=qm𝖫(𝐗𝖫)qm𝖱(𝐗𝖱){q}_{m}(\mathbf{X})={q}_{m_{\mathsf{L}}}(\mathbf{X}_{\mathsf{L}}){q}_{m_{\mathsf{R}}}(\mathbf{X}_{\mathsf{R}}). Let kk be a kernel circuit that is kernel-compatible with the circuit pair pn{p}_{n} and qm{q}_{m} with its output unit being a product unit denoted by k(𝐗,𝐗)=k𝖫(𝐗𝖫,𝐗𝖫)k𝖱(𝐗𝖱,𝐗𝖱)k(\mathbf{X},\mathbf{X}^{\prime})=k_{\mathsf{L}}(\mathbf{X}_{\mathsf{L}},\mathbf{X}_{\mathsf{L}}^{\prime})k_{\mathsf{R}}(\mathbf{X}_{\mathsf{R}},\mathbf{X}_{\mathsf{R}}^{\prime}). Then it holds that

Mk(pn,qm)=Mk𝖫(pn𝖫,qm𝖫)Mk𝖱(pn𝖱,qm𝖱).M_{k}({p}_{n},{q}_{m})=M_{k_{\mathsf{L}}}({p}_{n_{\mathsf{L}}},{q}_{m_{\mathsf{L}}})\cdot M_{k_{\mathsf{R}}}({p}_{n_{\mathsf{R}}},{q}_{m_{\mathsf{R}}}).
Proof.

Mk(pn,qm)M_{k}(p_{n},q_{m}) can be expanded as

Mk(pn,qm)\displaystyle\quad M_{k}(p_{n},q_{m})
=𝐱𝐱pn(𝐱)qm(𝐱)k(𝐱,𝐱)\displaystyle=\sum_{\mathbf{x}}\sum_{\mathbf{x}^{\prime}}p_{n}(\mathbf{x})q_{m}(\mathbf{x}^{\prime})k(\mathbf{x},\mathbf{x}^{\prime})
=𝐱𝐱pm𝖫(𝐱𝖫)pm𝖱(𝐱𝖱)qn𝖫(𝐱𝖫)qn𝖱(𝐱𝖱)k𝖫(𝐱𝖫,𝐱𝖫)k𝖱(𝐱𝖱,𝐱𝖱)\displaystyle=\sum_{\mathbf{x}}\sum_{\mathbf{x}^{\prime}}{p}_{m_{\mathsf{L}}}(\mathbf{x}_{\mathsf{L}}){p}_{m_{\mathsf{R}}}(\mathbf{x}_{\mathsf{R}}){q}_{n_{\mathsf{L}}}(\mathbf{x}_{\mathsf{L}}){q}_{n_{\mathsf{R}}}(\mathbf{x}_{\mathsf{R}})k_{\mathsf{L}}(\mathbf{x}_{\mathsf{L}},\mathbf{x}^{\prime}_{\mathsf{L}})k_{\mathsf{R}}(\mathbf{x}_{\mathsf{R}},\mathbf{x}^{\prime}_{\mathsf{R}})
=Mk𝖫(pn𝖫,qm𝖫)Mk𝖱(pn𝖱,qm𝖱).\displaystyle=M_{k_{\mathsf{L}}}({p}_{n_{\mathsf{L}}},{q}_{m_{\mathsf{L}}})\cdot M_{k_{\mathsf{R}}}({p}_{n_{\mathsf{R}}},{q}_{m_{\mathsf{R}}}).

Corollary LABEL:cor:_tractable_mmd.

Following the assumptions in Theorem LABEL:thm:_double_sum_complexity, the squared maximum mean discrepancy 𝑀𝑀𝐷[,p,q]\mathit{MMD}[\mathcal{H},p,q] in RKHS \mathcal{H} associated with kernel kk as defined in gretton2012kernel can be tractably computed.

Proof.

This is an immediate result following Theorem LABEL:thm:_double_sum_complexity by rewriting MMD as defined in gretton2012kernel in the form of a linear combination of expected kernels, that is, 𝑀𝑀𝐷2[,p,q]=Mk(p,p)+Mk(q,q)2Mk(p,q)\mathit{MMD}^{2}[\mathcal{H},p,q]=M_{k}(p,p)+M_{k}(q,q)-2M_{k}(p,q). ∎

Corollary LABEL:cor:_tractable_kdsd.

Following the assumptions in Theorem LABEL:thm:_double_sum_complexity, if the probabilistic circuit pp further satisfies determinism, the kernelized discrete Stein discrepancy (KDSD) 𝔻2(qp)=𝔼𝐱,𝐱q[kp(𝐱,𝐱)]\mathbb{D}^{2}(q\leavevmode\nobreak\ \parallel\leavevmode\nobreak\ p)=\mathbb{E}_{\mathbf{x},\mathbf{x}^{\prime}\sim q}[k_{p}(\mathbf{x},\mathbf{x}^{\prime})] in the RKHS associated with kernel kk as defined in yang2018goodness can be tractably computed.

Before showing the proof for Corollary LABEL:cor:_tractable_kdsd, we first give definitions that are necessary for defining KDSD as follows to be self-contained.

Definition 1.2 (Cyclic permutation).

For a finite set 𝒳\mathcal{X} and D=|𝒳|D=|\mathcal{X}|, a cyclic permutation ¬:𝒳𝒳\neg:\mathcal{X}\rightarrow\mathcal{X} is a bijective function such that for some ordering a1,a2,,aDa_{1},a_{2},\cdots,a_{D} of the elements in 𝒳\mathcal{X}, ¬ai=a(i+1)modD\neg a_{i}=a_{(i+1)\mod D}, i=1,2,,D\forall i=1,2,\cdots,D.

Definition 1.3 (Partial difference operator).

For any function f:𝒳f:\mathcal{X}\rightarrow\mathbb{R} with D=|𝒳|D=|\mathcal{X}|, the partial difference operator is defined as

Δif(𝐗):=f(𝐗)f(¬i𝐗),i=1,,D,\Delta^{*}_{i}f(\mathbf{X}):=f(\mathbf{X})-f(\neg_{i}\mathbf{X}),\forall i=1,\cdots,D, (2)

with ¬i𝐗:=(X1,,¬Xi,,XD)\neg_{i}\mathbf{X}:=(X_{1},\cdots,\neg X_{i},\cdots,X_{D}). Moreover, the difference operator is defined as Δf(𝐗):=(Δ1f(𝐗),,ΔDf(𝐗))\Delta^{*}f(\mathbf{X}):=(\Delta^{*}_{1}f(\mathbf{X}),\cdots,\Delta^{*}_{D}f(\mathbf{X})). Similarly, let \invneg be the inverse permutation of ¬\neg, and Δ\Delta denote the difference operator defined with respect to \invneg, i.e.,

Δif(𝐗):=f(𝐗)f(i𝐗),i=1,,D.\Delta_{i}f(\mathbf{X}):=f(\mathbf{X})-f(\invneg_{i}\mathbf{X}),i=1,\cdots,D.
Definition 1.4 (Difference score function).

The (difference) score function is defined as 𝐬p(𝐗):=Δp(𝐗)p(𝐗)\bm{s}_{p}(\mathbf{X}):=\frac{\Delta^{*}p(\mathbf{X})}{p(\mathbf{X})} on domain 𝒳\mathcal{X} with D=𝒳D=\mid\mathcal{X}\mid, a vector-valued function with its ii-th dimension being

𝒔p,i(𝐗):=Δip(𝐗)p(𝐗)=1p(¬i𝐗)p(𝐗),i=1,2,,D.\bm{s}_{p,i}(\mathbf{X}):=\frac{\Delta^{*}_{i}p(\mathbf{X})}{p(\mathbf{X})}=1-\frac{p(\neg_{i}\mathbf{X})}{p(\mathbf{X})},i=1,2,\cdots,D. (3)

Given the above definitions, the discrete Stein discrepancy between two distributions pp and qq is defined as

𝔻(qp):=sup𝒇𝔼𝐱q(𝐗)[𝒯p𝒇(𝐱)],\mathbb{D}(q\parallel p):=\sup_{\bm{f}\in\mathcal{H}}\mathbb{E}_{\mathbf{x}\sim q(\mathbf{X})}[\mathcal{T}_{p}\bm{f}(\mathbf{x})], (4)

where 𝒇:𝒳D\bm{f}:\mathcal{X}\rightarrow\mathbb{R}^{D} is a test function, belonging to some function space \mathcal{H} and 𝒯p\mathcal{T}_{p} is the so-called Stein difference operator, which is defined as

𝒯p𝒇=𝒔p(𝐱)𝒇Δ𝒇(𝐱).\mathcal{T}_{p}\bm{f}=\bm{s}_{p}(\mathbf{x})\bm{f}^{\top}-\Delta\bm{f}(\mathbf{x}). (5)

If the function space \mathcal{H} is an reproducing kernel Hilbert space (RKHS) on 𝒳\mathcal{X} equipped with a kernel function k(,)k(\cdot,\cdot), then a kernelized discrete Stein discrepancy (KDSD) is defined and admits a closed-form representation as

𝕊(qp):=𝔻2(qp)=𝔼𝐱,𝐱q[kp(𝐱,𝐱)].\mathbb{S}(q\parallel p):=\mathbb{D}^{2}(q\parallel p)=\mathbb{E}_{\mathbf{x},\mathbf{x}^{\prime}\sim q}[k_{p}(\mathbf{x},\mathbf{x}^{\prime})]. (6)

Here, the kernel function kpk_{p} is defined as

kp(𝐱,𝐱)=𝒔p(𝐱)k(𝐱,𝐱)𝒔p(𝐱)𝒔p(𝐱)Δ𝐱k(𝐱,𝐱)Δ𝐱k(𝐱,𝐱)𝒔p(𝐱)+𝑡𝑟(Δ𝐱,𝐱k(𝐱,𝐱)),\begin{split}k_{p}(\mathbf{x},\mathbf{x}^{\prime})&=\bm{s}_{p}(\mathbf{x})^{\top}k(\mathbf{x},\mathbf{x}^{\prime})\bm{s}_{p}(\mathbf{x}^{\prime})-\bm{s}_{p}(\mathbf{x})^{\top}\Delta^{\mathbf{x}^{\prime}}k(\mathbf{x},\mathbf{x}^{\prime})\\ &\quad-\Delta^{\mathbf{x}}k(\mathbf{x},\mathbf{x}^{\prime})^{\top}\bm{s}_{p}(\mathbf{x}^{\prime})+\mathit{tr}(\Delta^{\mathbf{x},\mathbf{x}^{\prime}}k(\mathbf{x},\mathbf{x}^{\prime})),\end{split}

where the difference operator Δ𝐱\Delta^{\mathbf{x}} is as in Definition 1.3. The superscript 𝐱\mathbf{x} specifies the variables that it operates on.

Proof.

[Corollary LABEL:cor:_tractable_kdsd] By the definition of difference score functions, the close form of KDSD can be further rewritten as follows.

𝔼𝐱,𝐱q[kp(𝐱,𝐱)]=i=1D𝔼𝐱,𝐱q[p(¬i𝐱)p(¬i𝐱)p(𝐱)p(𝐱)k(𝐱,𝐱)p(¬i𝐱)p(𝐱)k(𝐱,¬i𝐱)p(¬i𝐱)p(𝐱)k(¬i𝐱,𝐱)+k(¬i𝐱,¬i𝐱)]=i=1D[Mk(qp~ip,qp~ip)Mk(qp~ip,q~i)Mk(q~i,qp~ip)+Mk(q~i,q~i)]\begin{split}&\mathbb{E}_{\mathbf{x},\mathbf{x}^{\prime}\sim q}[k_{p}(\mathbf{x},\mathbf{x}^{\prime})]\\ =&\sum_{i=1}^{D}\mathbb{E}_{\mathbf{x},\mathbf{x}^{\prime}\sim q}[\frac{p(\neg_{i}\mathbf{x})p(\neg_{i}\mathbf{x}^{\prime})}{p(\mathbf{x})p(\mathbf{x}^{\prime})}k(\mathbf{x},\mathbf{x}^{\prime})-\frac{p(\neg_{i}\mathbf{x})}{p(\mathbf{x})}k(\mathbf{x},\neg_{i}\mathbf{x}^{\prime})\\ &\quad-\frac{p(\neg_{i}\mathbf{x}^{\prime})}{p(\mathbf{x}^{\prime})}k(\neg_{i}\mathbf{x},\mathbf{x}^{\prime})+k(\neg_{i}\mathbf{x},\neg_{i}\mathbf{x}^{\prime})]\\ =&\sum_{i=1}^{D}[M_{k}(q\frac{\tilde{p}_{i}}{p},q\frac{\tilde{p}_{i}}{p})-M_{k}(q\frac{\tilde{p}_{i}}{p},\tilde{q}_{i})\\ &\quad-M_{k}(\tilde{q}_{i},q\frac{\tilde{p}_{i}}{p})+M_{k}(\tilde{q}_{i},\tilde{q}_{i})]\end{split} (7)

where DD denotes the cardinality of the domain of variables 𝐗\mathbf{X}, the probablity p~i(𝐗):=p(¬i𝐗)\tilde{p}_{i}(\mathbf{X}):=p(\neg_{i}\mathbf{X}) and the probablity q~i(𝐗):=q(¬i𝐗)\tilde{q}_{i}(\mathbf{X}):=q(\neg_{i}\mathbf{X}). Notice that the cyclic permutation ¬i\neg_{i} operates on individual variable and the resulting PC p~i\tilde{p}_{i} and q~i\tilde{q}_{i} retains the same structure properties as PCs pp and qq respectively. To prove that KDSD can be tratably computed, it suffices to prove that the expected kernel terms in Equation 7 can be tractably computed.

For a deterministic and structured-decomposable PC pp, since PC p~i\tilde{p}_{i} retains the same structure, then resulting ratio p~i/p\tilde{p}_{i}/p is again a smooth circuit compatible with pp by vergari2021compositional. Moreover, since PC pp and qq are compatible, the circuit p~i/p\tilde{p}_{i}/p is compatible with PC qq. Thus, the resulting product qp~ipq\frac{\tilde{p}_{i}}{p} is a circuit that is smooth and compatible with both pp and qq by Theorem B.2 and thus compatible with q~i\tilde{q}_{i}. By similar arguments, we can verify that all the circuit pair in the expected kernel terms in Equation 7 satisfy the assumptions in Theorem LABEL:thm:_double_sum_complexity and thus they are amenable to the tractable computation we propose in Algorithm LABEL:alg:_double-sum, which finishes our proof.

Proposition (convergence of Categorical BBIS).

Let f(𝐱)f(\mathbf{x}) be a test function. Assume that f𝔼p[f]pf-\mathbb{E}_{p}[f]\in\mathcal{H}_{p}, with p\mathcal{H}_{p} being the RKHS associated with the kernel function kpk_{p}, and iwi=1\sum_{i}w_{i}=1, then it holds that

|n=1Nwnf(xn)𝔼pf|Cf𝕊({𝐱(n),wn}p),\displaystyle\left|\sum_{n=1}^{N}w_{n}f(x_{n})-\mathbb{E}_{p}f\right|\leq C_{f}\sqrt{\mathbb{S}(\{\mathbf{x}^{(n)},w_{n}\}\parallel p)},

where Cf:=f𝔼pfpC_{f}:=\parallel f-\mathbb{E}_{p}f\parallel_{\mathcal{H}_{p}}. Moreover, the convergence rate is 𝒪(N1/2)\mathcal{O}(N^{-1/2}).

Proof.

Let f^(𝐱):=f(𝐱)𝔼pf\hat{f}(\mathbf{x}):=f(\mathbf{x})-\mathbb{E}_{p}f, then it holds that

|n=1Nwnf(𝐱(n))𝔼pf|=|n=1Nwnf^(𝐱(n))|=|n=1Nwnf^,kp(,𝐱(n))|=|f^,n=1Nwnkp(,𝐱(n))p|f^pn=1Nwnkp(,𝐱(n))p=f^p𝕊({𝐱(n),wn}p).\begin{split}\left|\sum_{n=1}^{N}w_{n}f(\mathbf{x}^{(n)})-\mathbb{E}_{p}f\right|&=\left|\sum_{n=1}^{N}w_{n}\hat{f}(\mathbf{x}^{(n)})\right|\\ &=\left|\sum_{n=1}^{N}w_{n}\langle\hat{f},k_{p}(\cdot,\mathbf{x}^{(n)})\rangle\right|\\ &=\left|\langle\hat{f},\sum_{n=1}^{N}w_{n}k_{p}(\cdot,\mathbf{x}^{(n)})\rangle_{\mathcal{H}_{p}}\right|\\ &\leq\parallel\hat{f}\parallel_{\mathcal{H}_{p}}\cdot\parallel\sum_{n=1}^{N}w_{n}k_{p}(\cdot,\mathbf{x}^{(n)})\parallel_{\mathcal{H}_{p}}\\ &=\parallel\hat{f}\parallel_{\mathcal{H}_{p}}\cdot\sqrt{\mathbb{S}(\{\mathbf{x}^{(n)},w_{n}\}\parallel p)}.\end{split}

We further prove the convergence rate of the estimation error by using the importance weights as reference weights. Let vn=1np(𝐱(n))/q(𝐱(n))v_{n}^{*}=\frac{1}{n}p(\mathbf{x}^{(n)})/q(\mathbf{x}^{(n)}). Then 𝕊({𝐱(n),vn}p)\mathbb{S}(\{\mathbf{x}^{(n)},v_{n}^{*}\}\parallel p) is a degenerate V-statistics [liu2016black] and it holds that 𝕊({𝐱(n),vn}p)=𝒪(N1)\mathbb{S}(\{\mathbf{x}^{(n)},v_{n}^{*}\}\parallel p)=\mathcal{O}(N^{-1}). Moreover, we have that n=1Nvn=1+𝒪(N1/2)\sum_{n=1}^{N}v^{*}_{n}=1+\mathcal{O}(N^{-1/2}), which we denote by ZZ, i.e., Z=n=1NvnZ=\sum_{n=1}^{N}v^{*}_{n}. Let wn=vn/Zw^{*}_{n}=v^{*}_{n}/Z, then it holds that

𝕊({𝐱(n),wn}p)=𝕊({𝐱(n),vn}p)Z2=𝒪(N1).\begin{split}\mathbb{S}(\{\mathbf{x}^{(n)},w_{n}^{*}\}\parallel p)=\frac{\mathbb{S}(\{\mathbf{x}^{(n)},v_{n}^{*}\}\parallel p)}{Z^{2}}=\mathcal{O}(N^{-1}).\end{split}

Therefore,

|n=1Nwnf(𝐱(n))𝔼pf|f^p𝕊({𝐱(n),wn}p)f^p𝕊({𝐱(n),wn}p)=𝒪(N1/2).\begin{split}\left|\sum_{n=1}^{N}w_{n}f(\mathbf{x}^{(n)})-\mathbb{E}_{p}f\right|&\leq\parallel\hat{f}\parallel_{\mathcal{H}_{p}}\cdot\sqrt{\mathbb{S}(\{\mathbf{x}^{(n)},w_{n}\}\parallel p)}\\ &\leq\parallel\hat{f}\parallel_{\mathcal{H}_{p}}\cdot\sqrt{\mathbb{S}(\{\mathbf{x}^{(n)},w_{n}^{*}\}\parallel p)}\\ &=\mathcal{O}(N^{-1/2}).\end{split}

Proposition LABEL:pro:_tractable_conditonal_kernel_function.

Let p(𝐗𝐜𝐱𝐬)p(\mathbf{X_{c}}\mid\mathbf{x_{s}}) be a PC that encodes a conditional distribution over variables 𝐗𝐜\mathbf{X_{c}} conditioned on 𝐗𝐬=𝐱𝐬\mathbf{X_{s}}=\mathbf{x_{s}}, and kk be a KC. If the PC p(𝐗𝐜𝐱𝐬)p(\mathbf{X_{c}}\mid\mathbf{x_{s}}) and p(𝐗𝐜𝐱𝐬)p(\mathbf{X_{c}}\mid\mathbf{x_{s}}^{\prime}) are compatible and kk is kernel-compatible with the PC pair for any 𝐱𝐬\mathbf{x_{s}}, 𝐱𝐬\mathbf{x_{s}}^{\prime}, then the conditional kernel function kp,𝐬k_{p,\mathbf{s}} as defined in Proposition LABEL:pro:_kdsd can be tractably computed.

Proof.

From Proposition LABEL:pro:_kdsd, kp,𝐬k_{p,\mathbf{s}} can be written as

kp,𝐬=i=1D𝔼𝐱𝐜p(𝐗𝐜𝐱𝐬),𝐱𝐜p(𝐗𝐜𝐱𝐬)[kp,i(𝐱,𝐱)],k_{p,\mathbf{s}}=\sum_{i=1}^{D}\mathbb{E}_{\mathbf{x}_{\mathbf{c}}\sim p(\mathbf{X}_{\mathbf{c}}\mid\mathbf{x}_{\mathbf{s}}),\mathbf{x}^{\prime}_{\mathbf{c}}\sim p(\mathbf{X}_{\mathbf{c}}\mid\mathbf{x}^{\prime}_{\mathbf{s}})}[k_{p,i}(\mathbf{x},\mathbf{x}^{\prime})],

where kp,ik_{p,i} can be expanded as follows.

kp,i(𝐱,𝐱)=p(¬i𝐱)p(¬i𝐱)p(𝐱)p(𝐱)k(𝐱,𝐱)p(¬i𝐱)p(𝐱)k(𝐱,¬i𝐱)p(¬i𝐱)p(𝐱)k(¬i𝐱,𝐱)+k(¬i𝐱,¬i𝐱).\begin{split}k_{p,i}(\mathbf{x},\mathbf{x}^{\prime})=&\frac{p(\neg_{i}\mathbf{x})p(\neg_{i}\mathbf{x}^{\prime})}{p(\mathbf{x})p(\mathbf{x}^{\prime})}k(\mathbf{x},\mathbf{x}^{\prime})-\frac{p(\neg_{i}\mathbf{x})}{p(\mathbf{x})}k(\mathbf{x},\neg_{i}\mathbf{x}^{\prime})\\ &-\frac{p(\neg_{i}\mathbf{x}^{\prime})}{p(\mathbf{x}^{\prime})}k(\neg_{i}\mathbf{x},\mathbf{x}^{\prime})+k(\neg_{i}\mathbf{x},\neg_{i}\mathbf{x}^{\prime}).\end{split}

for any i𝐜i\in\mathbf{c}, given that none of the variables in 𝐗𝐬\mathbf{X}_{\mathbf{s}} is flipped in the above formulation, kernel kp,ik_{p,i} can be further written as

kp,i(𝐱,𝐱)=\displaystyle k_{p,i}(\mathbf{x},\mathbf{x}^{\prime})= p(¬i𝐱𝐜𝐱𝐬)p(¬i𝐱𝐜𝐱𝐬)p(𝐱𝐜𝐱𝐬)p(𝐱𝐜𝐱𝐬)k(𝐱,𝐱)\displaystyle\frac{p(\neg_{i}\mathbf{x}_{\mathbf{c}}\mid\mathbf{x}_{\mathbf{s}})p(\neg_{i}\mathbf{x}^{\prime}_{\mathbf{c}}\mid\mathbf{x}^{\prime}_{\mathbf{s}})}{p(\mathbf{x}_{\mathbf{c}}\mid\mathbf{x}_{\mathbf{s}})p(\mathbf{x}^{\prime}_{\mathbf{c}}\mid\mathbf{x}^{\prime}_{\mathbf{s}})}k(\mathbf{x},\mathbf{x}^{\prime})
p(¬i𝐱𝐜𝐱𝐬)p(𝐱𝐜𝐱𝐬)k(𝐱,¬i𝐱)\displaystyle-\frac{p(\neg_{i}\mathbf{x}_{\mathbf{c}}\mid\mathbf{x}_{\mathbf{s}})}{p(\mathbf{x}_{\mathbf{c}}\mid\mathbf{x}_{\mathbf{s}})}k(\mathbf{x},\neg_{i}\mathbf{x}^{\prime})
p(¬i𝐱𝐜𝐱𝐬)p(𝐱𝐜𝐱𝐬)k(¬i𝐱,𝐱)\displaystyle-\frac{p(\neg_{i}\mathbf{x}^{\prime}_{\mathbf{c}}\mid\mathbf{x}^{\prime}_{\mathbf{s}})}{p(\mathbf{x}^{\prime}_{\mathbf{c}}\mid\mathbf{x}^{\prime}_{\mathbf{s}})}k(\neg_{i}\mathbf{x},\mathbf{x}^{\prime})
+k(¬i𝐱,¬i𝐱).\displaystyle+k(\neg_{i}\mathbf{x},\neg_{i}\mathbf{x}^{\prime}).

By substituting kp,ik_{p,i} into the expected kernel in the expectation of kp,ik_{p,i} with respect to the conditional distributions can be simplified to be a constant zero, that is,

𝔼𝐱𝐜p(𝐗𝐜𝐱𝐬),𝐱𝐜p(𝐗𝐜𝐱𝐬)[kp,i(𝐱,𝐱)]= 0.\displaystyle\mathbb{E}_{\mathbf{x}_{\mathbf{c}}\sim p(\mathbf{X}_{\mathbf{c}}\mid\mathbf{x}_{\mathbf{s}}),\mathbf{x}^{\prime}_{\mathbf{c}}\sim p(\mathbf{X}^{\prime}_{\mathbf{c}}\mid\mathbf{x}^{\prime}_{\mathbf{s}})}[k_{p,i}(\mathbf{x},\mathbf{x}^{\prime})]=\leavevmode\nobreak\ 0.

Thus, kp,𝐬k_{p,\mathbf{s}} can be expanded as

kp,𝐬(𝐱,𝐱)\displaystyle k_{p,\mathbf{s}}(\mathbf{x},\mathbf{x}^{\prime}) =𝔼𝐱𝐜p(𝐗𝐜𝐱𝐬),𝐱𝐜p(𝐗𝐜𝐱𝐬)[i𝐬kp,i(𝐱,𝐱)]\displaystyle=\mathbb{E}_{\mathbf{x}_{\mathbf{c}}\sim p(\mathbf{X}_{\mathbf{c}}\mid\mathbf{x}_{\mathbf{s}}),\mathbf{x}^{\prime}_{\mathbf{c}}\sim p(\mathbf{X}_{\mathbf{c}}\mid\mathbf{x}^{\prime}_{\mathbf{s}})}[\sum_{i\in\mathbf{s}}k_{p,i}(\mathbf{x},\mathbf{x}^{\prime})]
=i𝐬[p(¬i𝐱𝐬)p(¬i𝐱𝐬)p(𝐱𝐬)p(𝐱𝐬)Mk(,)(p(¬i𝐱𝐬),p(¬i𝐱𝐬))\displaystyle=\sum_{i\in\mathbf{s}}[\frac{p(\neg_{i}\mathbf{x}_{\mathbf{s}})p(\neg_{i}\mathbf{x}^{\prime}_{\mathbf{s}})}{p(\mathbf{x}_{\mathbf{s}})p(\mathbf{x}^{\prime}_{\mathbf{s}})}\cdot M_{k(\cdot,\cdot)}(p(\cdot\mid\neg_{i}\mathbf{x}_{\mathbf{s}}),p(\cdot\mid\neg_{i}\mathbf{x}^{\prime}_{\mathbf{s}}))
p(¬i𝐱𝐬)p(𝐱𝐬)Mk(,¬i)(p(¬i𝐱𝐬),p(𝐱𝐬))\displaystyle\quad-\frac{p(\neg_{i}\mathbf{x}_{\mathbf{s}})}{p(\mathbf{x}_{\mathbf{s}})}\cdot M_{k(\cdot,\neg_{i}\cdot)}(p(\cdot\mid\neg_{i}\mathbf{x}_{\mathbf{s}}),p(\cdot\mid\mathbf{x}^{\prime}_{\mathbf{s}}))
p(¬i𝐱𝐬)p(𝐱𝐬)Mk(¬i,)(p(𝐱𝐬),p(¬i𝐱𝐬))\displaystyle\quad-\frac{p(\neg_{i}\mathbf{x}^{\prime}_{\mathbf{s}})}{p(\mathbf{x}^{\prime}_{\mathbf{s}})}\cdot M_{k(\neg_{i}\cdot,\cdot)}(p(\cdot\mid\mathbf{x}_{\mathbf{s}}),p(\cdot\mid\neg_{i}\mathbf{x}^{\prime}_{\mathbf{s}}))
+Mk(¬i,¬i)(p(𝐱𝐬),p(𝐱𝐬))].\displaystyle\quad+M_{k(\neg_{i}\cdot,\neg_{i}\cdot)}(p(\cdot\mid\mathbf{x}_{\mathbf{s}}),p(\cdot\mid\mathbf{x}^{\prime}_{\mathbf{s}}))].

As Theorem LABEL:thm:_double_sum_complexity has shown that Mk(p,q)M_{k}({p},{q}) can be computed exactly in time linear in the size of each PC, kp,𝐬(𝐱,𝐱)k_{p,\mathbf{s}}(\mathbf{x},\mathbf{x}^{\prime}) can also be computed exactly in time 𝒪(|p1||p2||k|)\mathcal{O}(|{p}_{1}||{p}_{2}||k|), where p1{p}_{1} and p2{p}_{2} denote circuits that represent the conditional probability distribution given the index set, i.e., p(𝐱𝐬)p(\cdot\mid\mathbf{x}_{\mathbf{s}}) or p(¬i𝐱𝐬)p(\cdot\mid\neg_{i}\mathbf{x}_{\mathbf{s}}). ∎

2 Algorithms

Algorithm 1 summarizes how to perform the BBIS scheme we propose for Categorical distributions, and generate a set of weighted samples.

Algorithm 1 CategoricalBBIS(p,q,k,np,q,k,n)

Input: target distributions pp over variables 𝐗\mathbf{X}, a black-box mechanism qq, a kernel function kk and number of samples nn
Output: weighted samples {(𝐱(i),wi)}i=1n\{(\mathbf{x}^{(i)},w^{*}_{i})\}_{i=1}^{n}

1:Sample {𝐱(i)}i=1n\{\mathbf{x}^{(i)}\}_{i=1}^{n} from qq
2:for  i=1,,ni=1,\ldots,n do
3:     for  j=1,,nj=1,\ldots,n do
4:         [𝑲p]ij=kp(𝐱(i),𝐱(j))[\bm{K}_{p}]_{ij}=k_{p}(\mathbf{x}^{(i)},\mathbf{x}^{(j)}) \triangleright cf. LABEL:eq:_kernel-p      
5:𝒘=argmin𝒘{𝒘𝑲𝒑𝒘|i=1nwi=1,wi0}\bm{w}^{*}=\operatorname*{arg\,min}_{\bm{w}}\left\{\bm{w}^{\top}\bm{K_{p}}\bm{w}\,\middle|\,\sum_{i=1}^{n}w_{i}=1,\leavevmode\nobreak\ w_{i}\geq 0\right\}
6:return {(𝐱(i),wi)}i=1n\{(\mathbf{x}^{(i)},w^{*}_{i})\}_{i=1}^{n}