Tractable Computation of Expected Kernels (Supplementary material)
1 Proofs
We first present another hardness result about the computation of expected kernels besides Theorem LABEL:thm:_hardness_for_expected_kernels.
Theorem 1.1.
There exist representations of distributions and that are smooth and compatible, yet computing the expected kernel of a simple kernel that is the Kronecker delta is already #P-hard.
Proof.
(an alternative proof to the one in Section LABEL:sec:_tractable_computation_of_expected_kernels) Consider the case when the positive definite kernel is a Kronecker delta function defined as if and only if . Moreover, assume that the probabilistic circuit is smooth and decomposable, and that . Then computing the expected kernel is equivalent to computing the power of a probabilistic circuit , that is, with being the domain of variables . vergari2021compositional proves that the task of computing is #P-hard even when the PC is smooth and decomposable, which concludes our proof. ∎
Proposition LABEL:pro:_recursive-sum-nodes
Let and be two compatible probabilistic circuits over variables whose output units and are sum units, denoted by and respectively. Let be a kernel circuit with its output unit being a sum unit , denoted by . Then it holds that
(1) |
Proof.
can be expanded as
∎
Proposition LABEL:pro:_recursive-product-nodes
Let and be two compatible probabilistic circuits over variables whose output units and are product units, denoted by and . Let be a kernel circuit that is kernel-compatible with the circuit pair and with its output unit being a product unit denoted by . Then it holds that
Proof.
can be expanded as
∎
Corollary LABEL:cor:_tractable_mmd.
Following the assumptions in Theorem LABEL:thm:_double_sum_complexity, the squared maximum mean discrepancy in RKHS associated with kernel as defined in gretton2012kernel can be tractably computed.
Proof.
This is an immediate result following Theorem LABEL:thm:_double_sum_complexity by rewriting MMD as defined in gretton2012kernel in the form of a linear combination of expected kernels, that is, . ∎
Corollary LABEL:cor:_tractable_kdsd.
Following the assumptions in Theorem LABEL:thm:_double_sum_complexity, if the probabilistic circuit further satisfies determinism, the kernelized discrete Stein discrepancy (KDSD) in the RKHS associated with kernel as defined in yang2018goodness can be tractably computed.
Before showing the proof for Corollary LABEL:cor:_tractable_kdsd, we first give definitions that are necessary for defining KDSD as follows to be self-contained.
Definition 1.2 (Cyclic permutation).
For a finite set and , a cyclic permutation is a bijective function such that for some ordering of the elements in , , .
Definition 1.3 (Partial difference operator).
For any function with , the partial difference operator is defined as
(2) |
with . Moreover, the difference operator is defined as . Similarly, let be the inverse permutation of , and denote the difference operator defined with respect to , i.e.,
Definition 1.4 (Difference score function).
The (difference) score function is defined as on domain with , a vector-valued function with its -th dimension being
(3) |
Given the above definitions, the discrete Stein discrepancy between two distributions and is defined as
(4) |
where is a test function, belonging to some function space and is the so-called Stein difference operator, which is defined as
(5) |
If the function space is an reproducing kernel Hilbert space (RKHS) on equipped with a kernel function , then a kernelized discrete Stein discrepancy (KDSD) is defined and admits a closed-form representation as
(6) |
Here, the kernel function is defined as
where the difference operator is as in Definition 1.3. The superscript specifies the variables that it operates on.
Proof.
[Corollary LABEL:cor:_tractable_kdsd] By the definition of difference score functions, the close form of KDSD can be further rewritten as follows.
(7) |
where denotes the cardinality of the domain of variables , the probablity and the probablity . Notice that the cyclic permutation operates on individual variable and the resulting PC and retains the same structure properties as PCs and respectively. To prove that KDSD can be tratably computed, it suffices to prove that the expected kernel terms in Equation 7 can be tractably computed.
For a deterministic and structured-decomposable PC , since PC retains the same structure, then resulting ratio is again a smooth circuit compatible with by vergari2021compositional. Moreover, since PC and are compatible, the circuit is compatible with PC . Thus, the resulting product is a circuit that is smooth and compatible with both and by Theorem B.2 and thus compatible with . By similar arguments, we can verify that all the circuit pair in the expected kernel terms in Equation 7 satisfy the assumptions in Theorem LABEL:thm:_double_sum_complexity and thus they are amenable to the tractable computation we propose in Algorithm LABEL:alg:_double-sum, which finishes our proof.
∎
Proposition (convergence of Categorical BBIS).
Let be a test function. Assume that , with being the RKHS associated with the kernel function , and , then it holds that
where . Moreover, the convergence rate is .
Proof.
Let , then it holds that
We further prove the convergence rate of the estimation error by using the importance weights as reference weights. Let . Then is a degenerate V-statistics [liu2016black] and it holds that . Moreover, we have that , which we denote by , i.e., . Let , then it holds that
Therefore,
∎
Proposition LABEL:pro:_tractable_conditonal_kernel_function.
Let be a PC that encodes a conditional distribution over variables conditioned on , and be a KC. If the PC and are compatible and is kernel-compatible with the PC pair for any , , then the conditional kernel function as defined in Proposition LABEL:pro:_kdsd can be tractably computed.
Proof.
From Proposition LABEL:pro:_kdsd, can be written as
where can be expanded as follows.
for any , given that none of the variables in is flipped in the above formulation, kernel can be further written as
By substituting into the expected kernel in the expectation of with respect to the conditional distributions can be simplified to be a constant zero, that is,
Thus, can be expanded as
As Theorem LABEL:thm:_double_sum_complexity has shown that can be computed exactly in time linear in the size of each PC, can also be computed exactly in time , where and denote circuits that represent the conditional probability distribution given the index set, i.e., or . ∎
2 Algorithms
Algorithm 1 summarizes how to perform the BBIS scheme we propose for Categorical distributions, and generate a set of weighted samples.
Input: target distributions over variables , a black-box mechanism , a kernel function and number of samples
Output: weighted samples