Learning Implicit Generative Models Using
Differentiable Graph Tests

Josip Djolonga
ETH Zürich
josipd@inf.ethz.ch Andreas Krause
ETH Zürich
krausea@ethz.ch

Abstract

Recently, there has been a growing interest in the problem of learning rich implicit models — those from which we can sample, but can not evaluate their density. These models apply some parametric function, such as a deep network, to a base measure, and are learned end-to-end using stochastic optimization. One strategy of devising a loss function is through the statistics of two sample tests — if we can fool a statistical test, the learned distribution should be a good model of the true data. However, not all tests can easily fit into this framework, as they might not be differentiable with respect to the data points, and hence with respect to the parameters of the implicit model. Motivated by this problem, in this paper we show how two such classical tests, the Friedman-Rafsky and $k$ -nearest neighbour tests, can be effectively smoothed using ideas from undirected graphical models – the matrix tree theorem and cardinality potentials. Moreover, as we show experimentally, smoothing can significantly increase the power of the test, which might of of independent interest. Finally, we apply our method to learn implicit models.

1 Introduction

The main motivation for our work is that of learning implicit models, i.e., those from which we can easily sample, but can not evaluate their density. Formally, we can generate a sample from an implicit distribution $Q$ by first drawing $\mathbf{z}$ from some known and fixed distribution $Q_{0}$ , typically Gaussian or uniform, and then passing it through some differentiable function $f_{\bm{\theta}}$ parametrized by some vector ${\bm{\theta}}$ to generate $\mathbf{x}=f_{\bm{\theta}}(\mathbf{z})\sim Q$ . The goal is then to optimize the parameters ${\bm{\theta}}$ of the mapping ${\bm{\theta}}$ so that $Q$ is as close as possible to some target distribution $P$ , which we can access only via iid samples. The approach that we undertake in this paper is that of defeating statistical two-sample tests. These tests operate in the following setting — given two sets of iid samples, $X_{1}=\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{n_{1}}\}$ from $P$ , and $X_{2}=\{\mathbf{x}_{n_{1}+1},\mathbf{x}_{n_{1}+2},\ldots,\mathbf{x}_{n_{1}+n_{2}}\}$ from $Q$ , we have to distinguish between the following hypotheses

H_{0}\colon P=Q\quad\textrm{ vs }\quad H_{1}\colon P\neq Q.

The tests that we consider start by defining a function $T\colon(\mathbb{R}^{d})^{n_{1}}\times(\mathbb{R}^{d})^{n_{2}}\to\mathbb{R}$ that should result in a low value if the two samples come from different distributions. Then, the hypothesis $H_{0}$ is rejected at significance level $\alpha\in[0,1]$ if $T(X_{1},X_{2})$ is lower than some threshold $t_{\alpha}$ , which is computed using a permutation test, as explained in Section 2. Going back the original problem, one intuitive approach would be to maximize the expected statistic $\mathbb{E}_{\mathbf{x}_{i}\sim P,\mathbf{z}_{i}\sim Q_{0}}[T(\{\mathbf{x}_{i}\}_{i=1}^{n_{1}},\{f_{\bm{\theta}}(\mathbf{z}_{i})\}_{i=n_{1}}^{n_{1}+n_{2}})]$ using stochastic optimization over the parameters of the mapping $f_{\bm{\theta}}$ . However, this requires the availability of the derivatives $\partial T/\partial\mathbf{x}_{i}$ , which is unfortunately not always possible. For example, the Friedman-Rafsky (FR) and $k$ -nearest neighbours ( $k$ -NN) tests, which have very desirable statistical properties (including consistency and convergence of their statistics to $f$ -divergences), can not be cast in the above framework as they use the output of a combinatorial optimization problem. Our main contribution is the development of differentiable versions of these tests that remedy the above problem by smoothing their statistics. We moreover show, similarly to these classical tests, that our tests are asymptotically normal under certain conditions, and derive the corresponding $t$ -statistic, which can be evaluated with minimal additional complexity. Our smoothed tests can have more power over their classical variants, as we showcase with numerical experiments. Finally, we experimentally learn implicit models in Section 5.

Related work.

The problem of two-sample testing for distributional equality has received significant interest in statistics. For example, the celebrated Kolmogorov-Smirnov test compares two one dimensional distributions by taking the maximal difference of the empirical CDFs. Another one-dimensional test is the runs test of Wald and Wolfowitz [1], which has been extended to the multivariate case by Friedman and Rafsky [2] (FR). It is exactly this test, together with $k$ -NN test originally suggested in [3] that we analyze. These tests have been analyzed in more detail by Henze and Penrose [4], and Henze [5], Schilling [6] respectively. Their asymptotic efficiency has been discussed by Bhattacharya [7]. Chen and Zhang [8] considered the problem of tie breaking when applying the FR tests to discrete data and suggested averaging over all minimal spanning trees, which can be seen as as special case of our test in the low-temperature setting. A very prominent test that has been more recently developed is the kernel maximum mean discrepancy (MMD) test of Gretton et al. [9], which we compare with in Section 5. The test statistic is differentiable and has been used for learning implicit models by Li et al. [10], Dziugaite et al. [11]. Sutherland et al. [12] consider the problem of learning the kernel by creating a $t$ -statistic using a variance estimator. Moreover, they also pioneered the idea of using tests for model criticism — for two fixed distributions, one optimizes over the parameters of the test (the kernel used). The energy test of Székely and Rizzo [13], a special case of the MMD test, has been used by Bellemare et al. [14].

Other approaches for learning implicit models that do not depend on two sample tests have been developed as well. For example, one approach is by estimating the log-ratio of the distributions [15]. Another approach, that has recently sparked significant interest, and can be also seen as estimating the log-ratio of the distributions, are the generative adversarial networks (GAN) of Goodfellow et al. [16], who pose the problem as a two player game. One can, as done in [12], combine GANs with two sample tests by using them as feature matchers at some layer of the generating network [17]. Nowozin et al. [18] minimize an arbitrary $f$ -divergence [19] using a GAN framework, which can be related to our approach, because the limit of our tests converge to specific $f$ -divergences, as explained in Section 2. For an overview of various approaches to learning implicit models we direct the reader to Mohamed and Lakshminarayanan [20].

2 Classical Graph Tests

Let us start by introducing some notation. For any set $X=\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{n}\}$ of points in $\mathbb{R}^{d}$ , we will denote by $\mathcal{G}(X)=(X,E)$ the complete directed graph¹¹1For the FR test we will arbitrarily choose one of the two edges for each pair of nodes. defined over the vertex set $X$ with edges $E$ . We will moreover weigh this graph using some function $d\colon\mathbb{R}^{d}\times\mathbb{R}^{d}\to[0,\infty)$ , e.g. a natural choice would be $d(\mathbf{x},\mathbf{x}^{\prime})=\|\mathbf{x}-\mathbf{x}^{\prime}\|$ . Similarly, we will use $d(e)$ for the weight of the edge $e$ under $d(\cdot,\cdot)$ . For any labelling of the vertices $\pi:X\to\{1,2\}$ , and any edge $e\in E$ with adjacent vertices $i$ and $j$ we define²²2We use the Iverson bracket $\llbracket S\rrbracket$ that evaluates to 1 if $S$ is true and 0 otherwise. $\Delta_{\pi}(e)=\llbracket\pi(i)\neq\pi(j)\rrbracket$ , i.e., $\Delta_{\pi}(e)$ indicates if its end points of $e$ have different labels under $\pi$ . Remember that we are given $n_{1}$ points $X_{1}=\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{n_{1}}\}$ from $P$ , and $n_{2}$ points $X_{2}=\{\mathbf{x}_{n_{1}+1},\mathbf{x}_{n_{1}+2},\ldots,\mathbf{x}_{n_{1}+n_{2}}\}$ from $Q$ . In the remaining of the paper we will use $n=n_{1}+n_{2}$ for the total number of points. The tests are based on the following four-step strategy.

(i)

Pool the samples $X_{1}$ and $X_{2}$ together into $X=X_{1}\cup X_{2}=\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{n_{1}+n_{2}}\}$ , and create the graph $\mathcal{G}(X)$ . Define the mapping $\pi^{*}\colon X\to\{1,2\}$ evaluating to 1 on $X_{1}$ and to 2 on $X_{2}$ .
(ii)

Using some well-defined algorithm $\mathcal{A}$ choose a subset $U^{*}=\mathcal{A}(\mathcal{G}(X))$ of the edges of this graph with the underlying motivation that it defines some neighbourhood structure.
(iii)

Count how many edges in $U^{*}$ connect points from $X_{1}$ with points from $X_{2}$ , i.e., compute the statistic $T_{\pi^{*}}(U^{*})=\sum_{e\in U^{*}}\Delta_{\pi^{*}}(e)$ .
(iv)

Reject $H_{0}$ for small values of $T_{\pi^{*}}(U^{*})$ .

These tests condition on the data and are executed as permutation tests, so that the critical value in step (iv) is computed using the quantiles of ${\mathbb{E}}_{\pi\sim H_{0}}T_{\pi}(U^{*})$ , where $\pi\colon X\to\{1,2\}$ is drawn uniformly at random from the set of ${n_{1}+n_{2}\choose n_{1}}$ labellings that map exactly $n_{1}$ points from $X$ to 1. Formally, the $p$ -value is given as $\mathbb{E}_{\pi\sim H_{0}}[\llbracket T_{\pi^{*}}(U^{*})\geq T_{\pi}(U^{*})\rrbracket]$ . We are now ready to introduce the two tests that we consider in this paper, which are obtained by using a different neighbourhood selection algorithm $\mathcal{A}$ in step (ii).

Friedman-Rafsky (FR).

This test, developed by Friedman and Rafsky [2], uses the minimum-spanning tree (MST) of $\mathcal{G}(X)$ as the neighbourhood structure $U^{*}$ , which can be computed using the classical algorithms of Prim [21] and Kruskal [22] in time $O(n^{2}\log n)$ . If we use $d(\mathbf{x}_{i},\mathbf{x}_{j})=\|\mathbf{x}_{i}-\mathbf{x}_{j}\|$ , the problem is also known as the Euclidean spanning tree problem, and in this case Henze and Penrose [4] have proven that the test is consistent and has the following asymptotic limit.

Theorem 1 ([4]).

If $d(\mathbf{x},\mathbf{x}^{\prime})=\|\mathbf{x}-\mathbf{x}^{\prime}\|$ and $n_{1}/(n_{1}+n_{2})\to\alpha\in(0,1)$ , then it almost surely holds that

\frac{T_{\pi^{*}}(U^{*})}{n_{1}+n_{2}}\to 2\alpha(1-\alpha)\int\frac{p(\mathbf{x})q(\mathbf{x})}{\alpha p(\mathbf{x})+(1-\alpha)q(\mathbf{x})}d\mathbf{x},

where $p$ and $q$ are the densities of $P$ and $Q$ .

As noted by Berisha and Hero [23], after some algebraic manipulation of the right hand side of the above equation, we obtain that $1-T_{\pi^{*}}(U^{*})\frac{n_{1}+n_{2}}{2n_{1}n_{2}}$ converges almost surely to the following $f$ -divergence [19]

	$\displaystyle D^{\textrm{FR}}_{\alpha}(P\,\\|\,Q)$	$\displaystyle=\frac{1}{4\alpha(1-\alpha)}\int\frac{(\alpha p(\mathbf{x})-(1-\alpha)q(\mathbf{x}))^{2}}{\alpha p(\mathbf{x})+(1-\alpha)q(\mathbf{x})}d\mathbf{x}$
		$\displaystyle-\frac{(2\alpha-1)^{2}}{4\alpha(1-\alpha)}.$

In [23] it is also noted that if $n_{1}=n_{2}$ , then $\alpha=1/2$ and in that case $D_{1/2}$ is equal to $2\int\frac{(p(\mathbf{x})-q(\mathbf{x}))^{2}}{p(\mathbf{x})+q(\mathbf{x})}d\mathbf{x}$ , which is known as the symmetric $\chi^{2}$ divergence.

Figure 1: The functions generating the

f

-divergences.

$k$ -nearest-neighbours ( $k$ -NN).

Maybe the most intuitive way to construct a neighbourhood structure is to connect each point $\mathbf{x}_{j}\in X$ to its $k$ nearest neighbours. Specifically, we will add the edge $\mathbf{x}_{i}\to\mathbf{x}_{j}$ to $U^{*}$ iff $\mathbf{x}_{i}$ is one of the $k$ closest neighbours of $\mathbf{x}_{j}$ as measured by $d(\mathbf{x},\mathbf{x}^{\prime})$ . If one uses the Euclidean norm, then the asymptotic distribution and the consistency of the test have been proven by Schilling [6]. These results has been extended to arbitrary norms by Henze [5], who also proved the limiting behaviour of the statistic as $n\to\infty$ .

Theorem 2 ([5]).

If $n_{1}/(n_{1}+n_{2})\to\alpha\in(0,1)$ , then $1-\frac{T_{\pi^{*}}(U^{*})}{(n_{1}+n_{2})k}$ converges in probability to

D^{\textrm{NN}}_{\alpha}(P\,\|\,Q)\equiv\int\frac{\alpha^{2}p^{2}(\mathbf{x})+(1-\alpha)^{2}q^{2}(\mathbf{x})}{\alpha p(\mathbf{x})+(1-\alpha)q(\mathbf{x})}d\mathbf{x},

where $p$ and $q$ are the continuous densities of $P$ and $Q$ .

As for the FR test, we can also re-write the limit as an $f$ -divergence³³3This $f$ does not vanish at one, but we can simply shift it. corresponding to $f(t)=(\alpha^{2}t^{2}+(1-\alpha))/(\alpha t+(1-\alpha))$ . Moreover, if we compare the integrands in $D^{\textrm{FR}}_{\alpha}$ and $D^{\textrm{NN}}_{\alpha}$ , we see that they are related and they differ by the term $2\alpha(1-\alpha)p(\mathbf{x})q(\mathbf{x})$ in the numerator. The fact that they are closely related can be also seen from Figure 1, where we plot the corresponding $f$ -functions for the $n_{1}=n_{2}$ case.

3 Differentiable Graph Tests

While the tests from the previous section have been studied from a statistical perspective, we can not use them to train implicit models because the derivatives $\partial T/\partial\mathbf{x}_{i}$ are either zero or do not exist, as $T$ takes on finitely many values. The strategy that we undertake in this paper is to smooth them into continuously differentiable functions by relaxing them to expectations in natural probabilistic models. To motivate the models we will introduce, note that for both the $k$ -NN and the FR test, the optimal neighbourhood is the solution to the following optimization problem

U^{*}=\operatorname*{arg\,min}_{U\subseteq E}\sum_{e\in U}d(e)\;\textrm{s.t.\ }\nu(U)=1,

(1)

where $\nu\colon 2^{E}\to\{0,1\}$ indicates if the set of edges is valid, i.e., if every vertex has exactly $k$ neighbours in the $k$ -NN case, or if the set of edges forms a poly-tree in the MST case. Moreover, note that once we fix $n_{1}$ and $n_{2}$ , the optimization problem (1) depends only on the edge weights $d(e)$ , which we will concatenate in an arbitrary order and store in the vector $\mathbf{d}\in\mathbb{R}^{|E|}$ . We want to design a probability distribution over $U$ that focuses on those configurations $U$ that are both feasible and have a low cost for problem (1). One such natural choice is the following Gibbs measure

P(U\mid\mathbf{d}/\lambda)=e^{-\sum_{e\in U}d(e)/\lambda-A(-\mathbf{d}/\lambda)}\nu(U),

(2)

where $\lambda$ is the so-called temperature parameter, and $A(-\mathbf{d}/\lambda)$ is the log-partition function that ensures that the distribution is normalized. Note that $U^{*}$ is a MAP configuration of this distribution (2), and the distribution will concentrate on the MAP configurations as $\lambda\to 0$ . Once we have fixed the model, the strategy is clear — replace the statistic $T_{\pi^{*}}(U^{*})$ with its expectation ${\mathbb{E}}_{U}[T_{\pi^{*}}(U)]$ , which results in the following smooth statistic

	$\displaystyle T_{\pi}(U^{})\longrightarrow T_{\pi^{*}}^{\lambda}$	$\displaystyle\equiv\mathbb{E}_{U\sim P(\cdot\mid\mathbf{d},\lambda)}[T_{\pi^{*}}(U)]$
		$\displaystyle=\sum_{e\in E}\Delta_{\pi^{*}}(e)\bm{\mu}(\mathbf{d}/\lambda)_{e},$

where $\bm{\mu}(\mathbf{d}/\lambda)$ are the marginal probabilities of the edges, i.e., $[\bm{\mu}(\mathbf{d}/\lambda)]_{e}=\mathbb{E}_{P(U\mid\mathbf{d}/\lambda)}[\llbracket e\in U\rrbracket]$ . Hence, we can compute the statistic as long as we can perform inference in (2). To compute its derivatives we can use the fact that (2) is a member of the exponential family. Namely, leveraging the classical properties of the log-partition function [24, Prop. 3.1], we obtain the following identities

$\displaystyle\bm{\mu}(\mathbf{d}/\lambda)$	$\displaystyle=\nabla A(-\mathbf{d}/\lambda),\textrm{ and}$	(3)
$\displaystyle\frac{\partial\bm{\mu}(\mathbf{d}/\lambda)_{e}}{\partial\bm{\mu}(\mathbf{d}/\lambda)_{e^{\prime}}}$	$\displaystyle=\mathbb{E}_{P(U\mid\mathbf{d}/\lambda)}[\llbracket\{e,e^{\prime}\}\subseteq U\rrbracket]$
	$\displaystyle-\bm{\mu}(\mathbf{d}/\lambda)_{e}\bm{\mu}(\mathbf{d}/\lambda)_{e^{\prime}}.$

Thus, if we can compute both first- and second-order moments under (2), we get both the smoothed statistic and its derivative. We show how to do this for the $k$ -NN and FR tests in Section 4.

A smooth $p$ -value.

Even though one can directly use the smoothed test statistic $T_{\pi^{*}}^{\lambda}$ as an objective when learning implicit models, it does not necessarily mean that lower values of this statistic result in higher $p$ -values. Remember that to compute a $p$ -value, one has to run a permutation test by computing quantiles of $T_{\pi}^{\lambda}$ under random draws of the permutation $\pi\sim H_{0}$ . However, as this procedure is not smooth and can be costly to compute, we suggest as an alternative that does not suffer from these problems the following $t$ -statistic

t^{\lambda}_{\pi^{*}}=\frac{T_{\pi^{*}}^{\lambda}-\mathbb{E}_{\pi\sim H_{0}}[T_{\pi}^{\lambda}]}{\sqrt{\mathbb{V}_{\pi\sim H_{0}}[T_{\pi}^{\lambda}]}}.

(4)

The same strategy has been undertaken for the FR and $k$ -NN tests in [2, 4, 6]. Before we show to compute the first two moments under $H_{0}$ , we need to define the matrix $\Pi$ holding the second moments of the variables $\Delta_{\pi}(e)$ .

Lemma 1 ([2]).

The matrix $\Pi\in\mathbb{R}^{|E|\times|E|}$ with entries $\Pi_{e,e^{\prime}}={\mathbb{E}}_{\pi\sim H_{0}}[\Delta_{\pi}(e)\Delta_{\pi}(e^{\prime})]$ is equal to

\Pi_{e,e^{\prime}}=\begin{cases}\frac{2n_{1}n_{2}}{n(n-1)}&\textrm{ if }\delta(e)=\delta(e^{\prime}),\textrm{ or}\\ \frac{n_{1}n_{2}}{n(n-1)}&\textrm{ if }|\delta(e)\cap\delta(e^{\prime})|=1,\textrm{ or}\\ \frac{4n_{1}n_{2}(n_{1}-1)(n_{2}-1)}{n(n-1)(n-2)(n-3)}&\textrm{ if }\delta(e)\cap\delta(e^{\prime})=\emptyset,\end{cases}

where $\delta(e)$ is the set of vertices incident to the edge $e\in E$ .

Theorem 3.

Assume that all valid configurations $U$ satisfy $|U|=m$ , i.e. that $\nu(U)\neq 0$ implies $|U|=m.$ ⁴⁴4Note that we have $m=kn$ for $k$ -NN and $m=n-1$ for FR. Then, the first two moments of the statistic under $H_{0}$ are

	$\displaystyle{\mathbb{E}}_{\pi\sim H_{0}}[T^{\lambda}_{\pi^{*}}]$	$\displaystyle=2mn_{1}n_{2}/n(n-1),\textrm{ and}$
	$\displaystyle{\mathbb{V}}_{\pi\sim H_{0}}[T^{\lambda}_{\pi^{*}}]$	$\displaystyle=\bm{\mu}(\mathbf{d}/\lambda)^{T}\Pi\bm{\mu}(\mathbf{d}/\lambda)-4\frac{n_{1}^{2}n_{2}^{2}}{n^{2}(n-1)^{2}}m^{2}.$

While the computation of the mean is trivial, it seems that the computation of the variance needs $O(|E|^{2})$ operations. However, we can simplify its computation to $O(|E|)$ using the following result.

Lemma 2.

Define $\chi_{1}=\frac{n_{1}n_{2}}{n(n-1)}$ and $\chi_{2}=\frac{4(n_{1}-1)(n_{2}-1)}{(n-2)(n-3)}$ . Then, the variance can be computed as

	$\displaystyle\sigma^{2}$	$\displaystyle=\chi_{1}(1-\chi_{2})\sum_{v}(\sum_{e\in\delta(v)}\mu_{e})^{2}$
		$\displaystyle+\chi_{1}\chi_{2}\sum_{e\\|e^{\prime}}\mu_{e}\mu_{e^{\prime}}+\chi_{1}(\chi_{2}-4\chi_{1})m^{2},$

where $\sum_{e\|e}$ sums over all pairs of parallel edges, i.e., those connecting the same end-points.

Approximate normality of $t_{\pi^{*}}^{\lambda}$ .

To better motivate the use of a $t$ -statistic, we can, similarly to the arguments in [2, 4, 6], show that it is is close to a normal distribution by casting it as a generalized correlation coefficient [25, 3]. Namely, these are tests whose statistics are the form form $\kappa=\sum_{i=1}^{n}\sum_{j=1}^{n}\overline{\mu}_{i,j}b_{i,j}$ , and whose critical values are computed using the distribution of $\sum_{i=1}^{n}\sum_{j=1}^{n}\overline{\mu}_{i,j}b_{\pi(i),\pi(j)}$ , where $\pi$ is a random permutation on $\{1,2,\ldots,n\}$ . It is easily seen that we can fit the suggested tests in this framework if we set $\overline{\mu}_{i,j}=\frac{1}{2}(\bm{\mu}(\mathbf{d}/\lambda)_{i\to j}+\bm{\mu}(\mathbf{d}/\lambda)_{j\to i})$ and $b_{i,j}=\Delta_{\pi^{*}}(\{i,j\})$ . Then, using the conditions of Barbour and Eagleson [26], we obtain the following bound on the deviation from normality.

Theorem 4.

Let $n_{1}/(n_{1}+n_{2})\to\alpha\in(0,1)$ , and define

•

$S_{2}=\sum_{i,j,k}\overline{\mu}_{i,j}\overline{\mu}_{i,k}$ , i.e., the expected number of edges sharing a vertex,
•

$S_{3}=\sum_{i,j,k,m}\overline{\mu}_{i,j}\overline{\mu}_{i,k}\overline{\mu}_{i,m}$ , i.e., the expected number of 3 stars, and
•

$L_{4}=\sum_{i,j,k,m}\overline{\mu}_{i,j}\overline{\mu}_{j,k}\overline{\mu}_{k,m}$ , i.e., the expected number of paths with 4 nodes.

Then, the Wasserstein distance between the permutation null ${\mathbb{E}}_{\pi\sim H_{0}}[T_{\pi}^{\lambda}(U^{*})]$ and the standard normal is of order $O\big{(}(nk^{3}+kS_{2}+S_{3}+L_{4})/\sigma^{3}\big{)}$ .

Let us analyze the above bound in the setting that we will use it — when $n_{1}=n_{2}$ . First, let us look at the variance, as formulated in 2. The last term can be ignored as it is always non-negative because $\chi_{2}\geq 4\chi_{1}$ (shown in the appendix). Because $\sum_{e\in\delta(v)}\mu_{e}\geq 1$ , it follows that the variance grows as $\Omega(n)$ . Thus, without any additional assumption on the growth of the neighbourhoods, we have asymptotic normality as $n\to\infty$ if the numerator is of order $o(n^{1.5})$ . For example, that would be satisfied if the largest neighbourhood $\max_{i}\sum_{e\in\delta(i)}\overline{\mu}_{e}$ grows as $o(n^{1/6})$ . Note that in the low temperature setting (when $\lambda\to 0$ ), the coordinates of $\bm{\mu}$ will be very close to either zero or one. As observed by Friedman and Rafsky [2], in this case $S_{2}=O(1)$ as the nodes of both the $k$ -NN and MST graphs have nodes whose degree is bounded by a constant independent of $n$ as $n\to\infty$ [27]. We also observe experimentally in Section 5 that the distribution gets closer to normality as $\lambda$ decreases.

4 The Differentiable $k$ -NN and Friedman-Rafsky Tests

In this section, we discuss these two tests in more detail and show to efficiently compute their statistics. Remember that to compute and optimize both $T_{\pi^{*}}^{\lambda}$ and $t_{\pi^{*}}^{\lambda}$ we have to be able to perform inference in the model $P(U)=\exp(-\sum_{e}d(e)/\lambda-A(-\mathbf{d}/\lambda))\nu(U)$ , by computing the first and- second-order moments of the edge indicator variables. We would stress that, in the learning setting that we consider $n$ refers to the number of data-points in a mini-batch.

$k$ -NN.

The constraint $\nu(\cdot)$ in this case requires the total number of edges in $U$ incoming at each node to be exactly $k$ . First, note that the problem completely separates per node, i.e., the marginals of edges with different target vertices are independent. Formally, if we denote by $U_{i}$ the set of edges incoming at vertex $i$ , then $U_{i}$ and $U_{j}$ are independent for $i\neq j$ . Hence, for each node $i$ separately, have to perform inference in

P(U_{i})\propto\exp(-\sum_{j\in U_{i}}d(\mathbf{x}_{i},\mathbf{x}_{j})/\lambda)\llbracket|U_{i}|=k\rrbracket,

which is a special case of the cardinality potentials considered by Tarlow et al. [28], Swersky et al. [29]. Swersky et al. [29] consider the same model, and note that we can compute all marginals in time $O(nk)$ using the algorithm in [28], which works by re-writing the model as a chain CRF and running the classical forward-backward algorithm. Hence, the total time complexity to compute the vector $\bm{\mu}(\mathbf{d}/\lambda)$ is $O(n^{2}k)$ . Moreover, as marginalization requires only simple operations, we can compute the derivatives with any automatic differentiation software, and we thus do not provide formulas for the second-order moments. In [29] the authors provide an approximation for the Jacobian, which we did not use in our experiments, but instead we differentiate through the messages of the forward-backward algorithm.

As a concrete example, let us work out the simplest case — the $k$ -NN test with $k=1$ . In this case, the smoothed statistic reduces to

T^{\lambda}_{\pi^{*}}(\mathbf{x}_{1},\ldots,\mathbf{x}_{n})=\sum_{i=1}^{n}\sum_{\begin{subarray}{c}j=1\\ \pi^{*}(i)\neq\pi^{*}(j)\end{subarray}}^{n}s_{i}(\mathbf{x}_{1},\ldots,\mathbf{x}_{n})_{j},

where $s_{i}(\mathbf{x}_{1},\ldots,\mathbf{x}_{n})=\texttt{softmax}(-\otimes_{l\neq i}\|\mathbf{x}_{i}-\mathbf{x}_{l}\|/\lambda)$ . In other words, for each $i$ you compute the softmax of the distances to all other points using $s_{i}$ , and then sum up only those positions that correspond to points from the other sample. One interpretation of the loss is the following — maximize the number of incorrect predictions if we are to estimate the label $\pi(i)$ from $\mathbf{x}_{i}$ using a soft $1$ -nearest neighbour approach.

Furthermore, we can also make a clear connection between the smooth $1$ -NN test and neighbourhood component analysis (NCA) [30]. Namely, we can see NCA as learning a mapping $h\colon\mathbf{x}\to A\mathbf{x}$ so that the test distinguishes (by minimizing $T^{\lambda}_{\pi^{*}}$ ) the two samples as best as possible after applying $h$ on them. The extension of NCA to $k$ -NN [31] can be also seen as minimizing the test statistic for a particular instance of their loss function.

Friedman-Rafsky.

The model that we have to perform inference in for this test seems extremely complicated and intractable at first because the constraint has the form $\nu(U)=\llbracket U\textrm{ forms a spanning tree}\rrbracket$ . First, note that if $\mathbf{d}/\lambda$ had all entries equal to a constant $\gamma$ , we have that $A(-\mathbf{d}/\lambda)=(1-n)\gamma+\log c_{G(X)}$ , where $c_{\mathcal{G}(X)}$ is the number of spanning trees in the graph $\mathcal{G}(X)$ , and can be computed using Kirchoff’s (also known as the matrix-tree) theorem. To treat the weighted case, we use the approach of Lyons [32], who has showed that the above model is a determinantal point process (DPP), so that marginalization can be done exactly as follows. First, create the incidence matrix $A\in\{-1,0,+1\}^{(n-1)\times|E|}$ of the graph $\mathcal{G}(X)$ after removing an arbitrary vertex, and construct its Laplacian $L=A\texttt{diag}\big{[}\exp(-\mathbf{d}/\lambda)\big{]}A^{T}$ . Then, if we compute $H=L^{-1/2}A\texttt{diag}\big{[}\exp(-\mathbf{d}/(2\lambda))\big{]}$ , the distribution $P(U)$ is a DPP with kernel matrix $K=H^{T}H$ , implying that for every $W\subseteq E$

\mathbb{E}_{P(U\mid\mathbf{d}/\lambda)}[\llbracket W\subseteq U\rrbracket]=\det K_{W},

where $K_{W}$ is the $|W|\times|W|$ submatrix of $K$ formed by the rows and columns indexed by $W$ . Thus, we can easily compute all marginals and the smoothed test statistic and its derivatives using (3) as

	$\displaystyle\mu_{i\to j}$	$\displaystyle=e^{-d(\mathbf{x}_{i},\mathbf{x}_{j})/\lambda}(\mathbf{u}_{i}-\mathbf{u}_{j})^{T}L^{-1}(\mathbf{u}_{i}-\mathbf{u}_{j}),\textrm{ and}$
	$\displaystyle\frac{\partial\mu_{i\to j}}{\partial\mu_{k\to l}}$	$\displaystyle=e^{-\frac{d(\mathbf{x}_{i},\mathbf{x}_{j})+d(\mathbf{x}_{k},\mathbf{x}_{l})}{\lambda}}((\mathbf{u}_{i}-\mathbf{u}_{j})^{T}L^{-1}(\mathbf{u}_{k}-\mathbf{u}_{l}))^{2},$

where $\mathbf{u}_{i}$ is the vector with coordinates equal to zero, except the $i$ -th coordinate which is one. Note that if we first compute the inverse $L^{-1}$ , all quantities of the form $L^{-1}(\mathbf{u}_{i}-\mathbf{u}_{j})$ can be computed in time $O(n)$ as the vectors $\mathbf{u}_{i}$ have a single non-zero entry, for a total complexity of $O(n^{3})$ .

To speed up this computation we can leverage the existing theory on fast solvers of Laplacian systems. Let us first create from $\mathcal{G}(X)$ the graph $e^{\mathcal{G}}(X)$ that has the same structure as $\mathcal{G}(X)$ , but with edge weights $e^{-d(e)/\lambda}$ instead of $d(e)$ . Hence, in this graph, a large weight between $\mathbf{x}$ and $\mathbf{x}^{\prime}$ indicates that these two points are similar to one another. In $e^{\mathcal{G}}(X)$ , the marginals $\bm{\mu}_{e}$ are also known as effective resistances⁵⁵5For additional properties of the effective resistances see [33].. Spielman and Srivastava [34] provide a method to compute all marginals at once in time that is $\tilde{O}(rn^{2}/\varepsilon^{2})$ , where $\varepsilon$ is the desired relative precision and $r=\frac{1}{\lambda}(\max_{e}d(e)-\min_{e}d(e))$ . The idea is to first solve for $Z^{T}=L^{-1}A\texttt{diag}\big{[}\exp(-\mathbf{d}/2\lambda)\big{]}R$ where $R\in\{-1/\sqrt{k},+1/\sqrt{k}\}^{|E|\times p}$ is a random projection matrix with elements chosen uniformly from $\{-1/\sqrt{k},+1/\sqrt{k}\}$ and $p=O(\log n/\varepsilon^{2})$ . Then, the suggested approximation is $\mu_{i\to j}\approx\|Z(\mathbf{u}_{i}-\mathbf{u}_{j})\|^{2}$ . While computing $Z$ naïvely would take $O(n^{3}+n^{2}p)$ , one achieves the claimed bound with the Laplacian solver of Spielman and Teng [35].

As an extra benefit, the above connection provides an alternative interpretation of the smoothed FR test. Namely, assume that we want to create a spectral sparsifier [36] of $e^{\mathcal{G}}(X)$ , which should contain significantly less edges, but be a good summary of the graph by having a similar spectrum. Spielman and Srivastava [34] provide a strategy to create such a sparsifier by sampling edges randomly, where edge $e$ is sampled proportional to $\mu_{e}$ . Hence, by optimizing $T^{\lambda}_{\pi^{*}}$ we are encouraging the constructed sparsifier of $e^{\mathcal{G}}(X)$ to have in expectation as many edges as possible connecting points from $X_{1}$ with points from $X_{2}$ .

5 Experiments

We implemented our methods in Python using the PyTorch library. For the $k$ -NN test, we have adapted the code accompanying [29]. Throughout this section we used a 10 dimensional normal as $Q_{0}$ , drew samples of equal size $n_{1}=n_{2}$ , and used the $\ell_{2}$ norm $d(\mathbf{x},\mathbf{x}^{\prime})=\|\mathbf{x}-\mathbf{x}^{\prime}\|_{2}$ as a weighting function. We provide additional details in Appendix B.

Power as a function of $\lambda$ and $d$ .

In our first experiment we analyze the effect of the smoothing strength on the power of our differentiable tests. In addition to the classical FR and $k$ -NN tests, we have considered the unbiased MMD test [9] with the squared exponential kernel (as implemented in Shogun [37] using the code from [12]), and the energy test [13]. The problem that we consider, which is challenging in high dimensions, is that of differentiating the distribution $\mathcal{N}(\mathbf{0},I)$ from $\mathcal{N}((\mu,0,\ldots,0),\texttt{diag}(\sigma^{2},1,\ldots,1))$ . This setting was considered to be fair in [38], as the KL divergence between the distribution is constant irrespective of the dimension. To set the smoothing strength and the bandwidth of the MMD kernel (in addition to the median heuristic) we used the same strategy as in [38] by setting $\lambda=d^{\gamma}$ for varying $\gamma\in[0,1]$ . The results are presented in Figure 2, where can observe that (i) our test have similar results with MMD for shift-alternatives, while performing significantly better for scale alternatives, and (ii) by varying the smoothing parameter we can significantly increase the power of the test. In the third column we present only the best performing MMD, while we present the remaining results in Appendix B. Note that we expect the power to go to zero as the dimension increases [7, 38].

Learning.

As we have already hinted in the introduction, we stochastically optimize

\textrm{max.}_{{\bm{\theta}}}\,\mathbb{E}_{\mathbf{x}_{i}\sim P,\mathbf{z}_{i}\sim Q_{0}}[t_{\pi^{*}}^{\lambda}(\{\mathbf{x}_{i}\}_{i=1}^{n_{1}},\{f_{\bm{\theta}}(\mathbf{z}_{i})\}_{i=n_{1}}^{n_{1}+n_{2}})]

using the Adam [39] optimizer. To optimize, we draw at each round $n_{1}$ samples from the true distribution $P$ , $n_{2}=n_{1}$ samples from the base measure $Q_{0}$ , and then plug them in into the smoothed $t$ -statistic.

The first experiment we perform, with the goal of understanding the effects of $\lambda$ , is on the toy two moons dataset from scikit-learn [40]. We show the results in Figure 3. From the second row, showing the estimated $p$ -value versus the correct one (from 1000 random permutations) at several points during training, we can indeed see that the permutation null gets closer to normality as $\lambda$ decreases. Most importantly, note that the relationship is monotone, so that we would expect the optimization to not be significantly harmed if we use the approximation. Qualitatively, we can observe that the solutions have the general structure of $P$ , and that they improve as we decrease $\lambda$ — the symmetry is better captured and the two moons get better separated.

MNIST.

Finally, we have trained several models on the MNIST [41] dataset, which we present in Figure 4. We can observe that despite the high (784) dimensional data and the fact that we use the distance directly on the pixels, the learned models generate digits that look mostly realistic and are competitive with those obtained using MMD [10, 11].

Refer to caption — (a) Power against the alternative $(\mu=0.5,\sigma=1)$ from $n_{1}=n_{2}=128$ samples.

6 Conclusion

We have developed smoothed two-sample graph tests that can be used for learning implicit models. These tests moreover outperform their classical equivalents on the problem of two sample testing. We have shown how to compute them by performing inference in undirected models, and presented alternative interpretations by drawing connections to neighbourhood component analysis and spectral graph sparsifiers. In the last section we have experimentally showcased the benefits of our approach, and presented results from a learned model.

Acknowledgements.

The research was partially supported by ERC StG 307036 and a Google European PhD Fellowship.

References

Wald and Wolfowitz [1940] Abraham Wald and Jacob Wolfowitz. On a test whether two samples are from the same population. Annals of Mathematical Statistics, 11(2):147–162, 1940.
Friedman and Rafsky [1979] Jerome H Friedman and Lawrence C Rafsky. Multivariate generalizations of the wald-wolfowitz and smirnov two-sample tests. Annals of Statistics, pages 697–717, 1979.
Friedman and Rafsky [1983] Jerome H Friedman and Lawrence C Rafsky. Graph-theoretic measures of multivariate association and prediction. Annals of Statistics, pages 377–391, 1983.
Henze and Penrose [1999] Norbert Henze and Mathew D Penrose. On the multivariate runs test. Annals of Statistics, pages 290–298, 1999.
Henze [1988] Norbert Henze. A multivariate two-sample test based on the number of nearest neighbor type coincidences. Annals of Statistics, pages 772–783, 1988.
Schilling [1986] Mark F Schilling. Multivariate two-sample tests based on nearest neighbors. Journal of the American Statistical Association, 81(395):799–806, 1986.
Bhattacharya [2015] Bhaswar B Bhattacharya. Power of graph-based two-sample tests. arXiv preprint arXiv:1508.07530, 2015.
Chen and Zhang [2013] Hao Chen and Nancy R Zhang. Graph-based tests for two-sample comparisons of categorical data. Statistica Sinica, pages 1479–1503, 2013.
Gretton et al. [2012] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
Li et al. [2015] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In International Conference on Machine Learning (ICML), 2015.
Dziugaite et al. [2015] Gintare Karolina Dziugaite, Daniel M. Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In Uncertainty in Artificial Intelligence (UAI), 2015.
Sutherland et al. [2016] Dougal J Sutherland, Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alex Smola, and Arthur Gretton. Generative models and model criticism via optimized maximum mean discrepancy. In International Conference on Learning Representations (ICLR), 2016.
Székely and Rizzo [2013] Gábor J Székely and Maria L Rizzo. Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8):1249–1272, 2013.
Bellemare et al. [2017] Marc G Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and Rémi Munos. The cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743, 2017.
Sugiyama et al. [2012] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), pages 2672–2680, 2014.
Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems (NIPS), pages 2234–2242, 2016.
Nowozin et al. [2016] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. $f$ -GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems (NIPS), pages 271–279, 2016.
Ali and Silvey [1966] Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), pages 131–142, 1966.
Mohamed and Lakshminarayanan [2016] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
Prim [1957] Robert Clay Prim. Shortest connection networks and some generalizations. Bell Labs Technical Journal, 36(6):1389–1401, 1957.
Kruskal [1956] Joseph B Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society, 7(1):48–50, 1956.
Berisha and Hero [2015] Visar Berisha and Alfred O Hero. Empirical non-parametric estimation of the fisher information. IEEE Signal Processing Letters, 22(7):988–992, 2015.
Wainwright and Jordan [2008] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1-2), 2008.
Daniels [1944] Henry E Daniels. The relation between measures of correlation in the universe of sample permutations. Biometrika, 33(2):129–135, 1944.
Barbour and Eagleson [1986] AD Barbour and GK Eagleson. Random association of symmetric arrays. Stochastic Analysis and Applications, 4(3):239–281, 1986.
Yukich [2006] Joseph E Yukich. Probability theory of classical Euclidean optimization problems. Springer, 2006.
Tarlow et al. [2012] Daniel Tarlow, Kevin Swersky, Richard S Zemel, Ryan Prescott Adams, and Brendan J Frey. Fast exact inference for recursive cardinality models. Uncertainty in Artificial Intelligence (UAI), 2012.
Swersky et al. [2012] Kevin Swersky, Ilya Sutskever, Daniel Tarlow, Richard S Zemel, Ruslan R Salakhutdinov, and Ryan P Adams. Cardinality restricted boltzmann machines. In Advances in Neural Information Processing Systems (NIPS), pages 3293–3301, 2012.
Goldberger et al. [2005] Jacob Goldberger, Geoffrey E Hinton, Sam T Roweis, and Ruslan R Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems (NIPS), pages 513–520, 2005.
Tarlow et al. [2013] Daniel Tarlow, Kevin Swersky, Laurent Charlin, Ilya Sutskever, and Rich Zemel. Stochastic k-neighborhood selection for supervised and unsupervised learning. In International Conference on Machine Learning, pages 199–207, 2013.
Lyons [2003] Russell Lyons. Determinantal probability measures. Publications mathématiques de l’IHÉS, 98(1):167–212, 2003.
Chandra et al. [1996] Ashok K Chandra, Prabhakar Raghavan, Walter L Ruzzo, Roman Smolensky, and Prasoon Tiwari. The electrical resistance of a graph captures its commute and cover times. Computational Complexity, 6(4):312–340, 1996.
Spielman and Srivastava [2011] Daniel A Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. SIAM Journal on Computing, 40(6):1913–1926, 2011.
Spielman and Teng [2014] Daniel A Spielman and Shang-Hua Teng. Nearly linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. SIAM Journal on Matrix Analysis and Applications, 35(3):835–885, 2014.
Spielman and Teng [2011] Daniel A Spielman and Shang-Hua Teng. Spectral sparsification of graphs. SIAM Journal on Computing, 40(4):981–1025, 2011.
Sonnenburg et al. [2010] SĆ Sonnenburg, Sebastian Henschel, Christian Widmer, Jonas Behr, Alexander Zien, Fabio de Bona, Alexander Binder, Christian Gehl, VojtÄ Franc, et al. The shogun machine learning toolbox. Journal of Machine Learning Research, 11(Jun):1799–1802, 2010.
Ramdas et al. [2015] Aaditya Ramdas, Sashank Jakkam Reddi, Barnabás Póczos, Aarti Singh, and Larry A Wasserman. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In AAAI, 2015.
Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

Appendix A Proofs

Proof of 3.

The expectation of the statistic under $H_{0}$ is (when $\pi$ is a uniformly random labelling)

\sum_{e\in E}\mu(\mathbf{d}/\lambda)_{e}\underbrace{{\mathbb{E}}_{\pi}[\Delta_{\pi}(e)]}_{2n_{1}n_{2}/n(n-1)}=2mn_{1}n_{2}/n(n-1),

where the inner expectation ${\mathbb{E}}_{\pi}[\Delta_{\pi}(e)]$ has been computed in [2]. We can also easily compute the variance as

\displaystyle\sum_{e,e^{\prime}\in E}{\mathrm{Cov}}_{\pi\sim H_{0}}[\mu_{e}\Delta_{\pi}(e),\mu_{e^{\prime}}\Delta_{\pi}(e^{\prime})]

\displaystyle=\sum_{e,e^{\prime}\in E}\mu_{e}\mu_{e^{\prime}}\underbrace{{\mathbb{E}}_{\pi\sim H_{0}}[\Delta_{\pi}(e)\Delta_{\pi}(e^{\prime})]}_{\Pi_{e,e^{\prime}}}-\underbrace{\frac{4n_{1}^{2}n_{2}^{2}}{n^{2}(n-1)^{2}}m^{2}}_{({\mathbb{E}}_{\pi\sim H_{0}}[T^{\lambda}_{\pi^{*}}])^{2}}.

(5)

∎

Proof of 2.

We can split the sum in the variance formula over all edge pairs into three groups as follows

\sum_{e}\sum_{e^{\prime}\sim e}\underbrace{\frac{n_{1}n_{2}}{n(n-1)}}_{\chi_{1}}\mu_{e}\mu_{e^{\prime}}+\sum_{e}\sum_{e^{\prime}\perp e}\underbrace{\frac{4n_{1}n_{2}(n_{1}-1)(n_{2}-1)}{n(n-1)(n-2)(n-3)}}_{\chi_{1}\chi_{2}}\mu_{e}\mu_{e^{\prime}}+\sum_{e}\underbrace{\frac{n_{1}n_{2}}{n(n-1)}}_{\chi_{1}}(\mu_{e}^{2}+\mu_{e}\mu_{\overline{e}}),

(6)

where $\sum_{e^{\prime}\sim e}$ sums over all edges $e^{\prime}$ that share at least one vertex with $e$ , and $\sum_{e^{\prime}\perp e}$ sums over those edges that share no vertex with $e$ , and $\overline{e}$ denote the reverse edge of $e$ (if it exist, zero otherwise). Note that each term $\mu_{e}\mu_{e^{\prime}}$ appears twice if $e\neq e^{\prime}$ , as in the formula for the variance (5). Moreover, note that if $\delta(e)=\delta(e^{\prime})$ , then in the above formula the term $\mu_{e}\mu_{e^{\prime}}$ (same for $\mu_{e^{\prime}}\mu_{e}$ ) gets multiplied by $2\chi_{1}=\Pi_{e,e^{\prime}}$ , as it appears in both the first and the third term. Given that assumption that $|U|=m$ under $\nu(\cdot)$ , we also know that

m^{2}=(\sum_{e}\mu_{e})^{2}=\sum_{e}\sum_{e^{\prime}}\mu_{e}\mu_{e^{\prime}}=\sum_{e}\sum_{e^{\prime}\sim e}\mu_{e}\mu_{e^{\prime}}+\sum_{e}\sum_{e^{\prime}\perp e}\mu_{e}\mu_{e^{\prime}},

so that eq. (6) can be simplified to

\chi_{1}\sum_{e}\sum_{e^{\prime}\sim e}\mu_{e}\mu_{e^{\prime}}+\chi_{1}\chi_{2}(m^{2}-\sum_{e}\sum_{e^{\prime}\sim e}\mu_{e}\mu_{e^{\prime}})+\chi_{1}\sum_{e}(\mu_{e}^{2}+\mu_{e}\mu_{\overline{e}}),

which be simplified to

\chi_{1}(1-\chi_{2})\sum_{e}\sum_{e^{\prime}\sim e}\mu_{e}\mu_{e^{\prime}}+\chi_{1}\sum_{e}(\mu_{e}^{2}+\mu_{e}\mu_{\overline{e}})+\chi_{1}\chi_{2}m^{2}.

Now the result follows by observing that

\sum_{v}(\sum_{e\in\delta(v)}\mu_{e})^{2}=\sum_{e}\sum_{e^{\prime}\sim e}\mu_{e}\mu_{e^{\prime}}+\sum_{e}\mu_{e}^{2}+\sum_{e}\mu_{e}\mu_{\overline{e}}.

To understand why this holds, let us count how many times each term $\mu_{e}\mu_{e^{\prime}}$ appears on both sides of the equality if we expand the lhs. If $e\neq e^{\prime}$ and they share exactly one vertex, then the lhs will have two $\mu_{e}\mu_{e^{\prime}}$ terms, as $\mu_{e}$ and $\mu_{e^{\prime}}$ will be multiplied only at the term corresponding to the shared vertex. On the other hand, if $e=e^{\prime}$ we will again have two $\mu_{e}\mu_{e^{\prime}}=\mu_{e}^{2}$ terms, as we get one contribution from each end-point of $e$ . Finally, if $e^{\prime}=\overline{e}$ , we have a total of four $\mu_{e}\mu_{e^{\prime}}$ terms, as we get two $\mu_{e}\mu_{e^{\prime}}$ from each end-point. Thus, eq. (6) is equal to

\chi_{1}(1-\chi_{2})\big{(}\sum_{v}(\sum_{e\in\delta(v)}\mu_{e})^{2}-\sum_{e}\mu_{e}^{2}-\sum_{e}\mu_{e}\mu_{\overline{e}}\big{)}+\chi_{1}\sum_{e}(\mu_{e}^{2}+\mu_{e}\mu_{\overline{e}})+\chi_{1}\chi_{2}m^{2}.

Finally, if we subtract $4\chi_{1}^{2}m^{2}$ and simplify the expression we have

\chi_{1}(1-\chi_{2})\sum_{v}(\sum_{e\in\delta(v)}\mu_{e})^{2}+\chi_{1}\chi_{2}\sum_{e}\mu_{e}^{2}+\chi_{1}\chi_{2}\sum_{e}\mu_{e}\mu_{\overline{e}}+\chi_{1}(\chi_{2}-4\chi_{1})m^{2},

which is exactly what is claimed in the theorem, if we observe that $e$ and $\overline{e}$ are the only edges parallel to $e$ . ∎

Proof that $\chi_{2}-4\chi_{1}\geq 0$ when $n_{1}=n_{2}=n/2$ .

First, note that $\frac{n_{1}}{n}\leq\frac{n_{1}-1}{n-2}$ , if and only if $n_{1}n-2n_{1}\leq nn_{1}-n$ , which is equivalent to $n_{1}\geq\frac{1}{2}n$ . Similarly, we have $\frac{n_{2}}{n-1}\leq\frac{n_{2}-1}{n-3}$ iff $nn_{2}-3n_{2}\leq nn_{2}-n-n_{2}+1$ , which can be re-written as $-2n_{2}\leq-n+1$ , i.e., $n_{2}\geq\frac{n}{2}-\frac{1}{2}$ . Combining these two inequalities proves the result.

Proof of 4.

Let us compute an upper bound on the quantities in [26].

$\displaystyle a_{1}$	$\displaystyle=\frac{1}{n(n-1)}\sum_{i,j}\overline{\mu}_{i,j}=\frac{k}{n}$	$\displaystyle b_{1}$	$\displaystyle=\frac{2}{n(n-1)}n_{2}n_{1}=\Theta(1)$
$\displaystyle a_{2}$	$\displaystyle=\frac{1}{n(n-1)(n-2)}\underbrace{\sum_{i,j,k}\overline{\mu}_{i,j}\overline{\mu}_{i,k}}_{S_{2}}$	$\displaystyle b_{2}$	$\displaystyle=\frac{n_{2}n_{1}^{2}+n_{1}n_{2}^{2}}{n(n-1)(n-2)}=\Theta(1)$
$\displaystyle a_{3}$	$\displaystyle=\frac{1}{n(n-1)(n-2)(n-3)}\underbrace{\sum_{i,j,k,m}\overline{\mu}_{i,j}\overline{\mu}_{i,k}\overline{\mu}_{i,m}}_{S_{3}}$	$\displaystyle b_{3}$	$\displaystyle=\frac{n_{2}n_{1}^{3}+n_{1}n_{2}^{3}}{n(n-1)(n-2)(n-3)}=\Theta(1)$
$\displaystyle a_{4}$	$\displaystyle=\frac{1}{n(n-1)(n-2)(n-3)}\underbrace{\sum_{i,j,k,m}\overline{\mu}_{k,i}\overline{\mu}_{i,j}\overline{\mu}_{j,m}}_{L_{4}}$	$\displaystyle b_{4}$	$\displaystyle=2\frac{n_{2}^{2}n_{1}^{2}}{n(n-1)(n-2)(n-3)}=\Theta(1)$
$\displaystyle a_{5}$	$\displaystyle=\frac{1}{n(n-1)(n-2)}\sum_{i,j,k}\overline{\mu}_{i,j}^{2}\overline{\mu}_{i,k}=O(a_{2})$	$\displaystyle b_{5}$	$\displaystyle=b_{2}$
$\displaystyle a_{6}$	$\displaystyle=\frac{1}{n(n-1)}\sum_{i,j}\overline{\mu}_{i,j}^{3}=O(a_{1})$	$\displaystyle b_{6}$	$\displaystyle=b_{1}$
$\displaystyle a_{7}$	$\displaystyle=\frac{1}{n(n-1)(n-2)}\sum_{i,j,k,m}\overline{\mu}_{i,j}\overline{\mu}_{i,k}\overline{\mu}_{j,k}$	$\displaystyle b_{7}$	$\displaystyle=\frac{n_{2}n_{1}n_{2}+n_{1}n_{2}n_{1}}{n(n-1)(n-2)}=\Theta(1)$
$\displaystyle a_{8}$	$\displaystyle=\frac{1}{n(n-1)}\sum_{i,j}\overline{\mu}_{i,j}^{2}=O(a_{1})$	$\displaystyle b_{8}$	$\displaystyle=b_{1}.$

Then, the upper bound has the form

	$\displaystyle\frac{1}{\sigma^{3}}\big{[}$	$\displaystyle n^{4}(\underbrace{a_{1}^{3}}_{k^{3}/n^{3}}+\underbrace{a_{1}a_{2}}_{O(kS_{2}/n^{4})}+\underbrace{a_{3}}_{O(S_{3}/n^{4})}+\underbrace{a_{4}}_{O(L_{4}/n^{4})})\underbrace{(b_{1}^{3}+b_{1}b_{2}+b_{3}+b_{4})}_{O(1)}+$
		$\displaystyle n^{3}(\underbrace{a_{5}}_{O(S_{2}/n^{3})}+\underbrace{a_{1}a_{8}}_{O(k^{2}/n^{2})})\underbrace{(b_{5}+b_{1}b_{8})}_{O(1)}+n^{2}\underbrace{a_{6}}_{O(k/n)}\underbrace{b_{6}}_{O(1)}\big{]},$

which can be simplified to

O(\frac{1}{\sigma^{3}}\big{[}nk^{3}+kS_{2}+S_{3}+L_{4}+S_{2}+nk^{2}+k/n\big{]})=O\big{(}\frac{1}{\sigma^{3}}(nk^{3}+kS_{2}+S_{3}+L_{4})\big{)},

which is what is claimed in the theorem.

Appendix B Experiments

B.1 MMD

B.2 Architecture

We have used the same architecture as in [10, 12], which using the modules from PyTorch can be written as follows.

nn.Sequential(
    nn.Linear(noise_dim, 64),
    nn.ReLU(),
    nn.Linear(64, 256),
    nn.ReLU(),
    nn.Linear(256, 256),
    nn.ReLU(),
    nn.Linear(256, 1024),
    nn.ReLU(),
    nn.Linear(1024, ambient_dim))

For MNIST we have also added a terminal nn.Tanh layer.

B.3 Data

We have used the MNIST data as packaged by torchvision, with the additional processing of scaling the output to $[-1,1]$ as we are using a final Tanh layer. For the two moons data, we have used a noise level of $0.05$ .

B.4 Optimization

All details are provided in the table below. In some cases we have optimized with a larger step for a number of epochs, and then reduced it for the remaining epochs — in the table below these are separated by commas.

Model	Step size	Batch size	Epochs
Figure 3(b)	$10^{-4}$	256	500
Figure 3(c)	$10^{-4}$	256	500
Figure 3(d)	$10^{-4}$	256	500
Figure 4(a)	$10^{-3},10^{-4}$	256	500, 500
Figure 4(b)	$10^{-3},10^{-4}$	512	500, 500
Figure 4(c)	$10^{-3},10^{-4}$	128	100, 100
Figure 4(d)	$10^{-4},10^{-4}$	128	100, 100

Learning Implicit Generative Models Using Differentiable Graph Tests