How Does GAN-based
Semi-supervised Learning Work?

Xuejiao Liu Qian Xuesen Laboratory of Space Technology
China Academy of Space Technology Xueshuang Xiang Corresponding author: xiangxueshuang@qxslab.cn Qian Xuesen Laboratory of Space Technology
China Academy of Space Technology

Abstract

Generative adversarial networks (GANs) have been widely used and have achieved competitive results in semi-supervised learning. This paper theoretically analyzes how GAN-based semi-supervised learning (GAN-SSL) works. We first prove that, given a fixed generator, optimizing the discriminator of GAN-SSL is equivalent to optimizing that of supervised learning. Thus, the optimal discriminator in GAN-SSL is expected to be perfect on labeled data. Then, if the perfect discriminator can further cause the optimization objective to reach its theoretical maximum, the optimal generator will match the true data distribution. Since it is impossible to reach the theoretical maximum in practice, one cannot expect to obtain a perfect generator for generating data, which is apparently different from the objective of GANs. Furthermore, if the labeled data can traverse all connected subdomains of the data manifold, which is reasonable in semi-supervised classification, we additionally expect the optimal discriminator in GAN-SSL to also be perfect on unlabeled data. In conclusion, the minimax optimization in GAN-SSL will theoretically output a perfect discriminator on both labeled and unlabeled data by unexpectedly learning an imperfect generator, i.e., GAN-SSL can effectively improve the generalization ability of the discriminator by leveraging unlabeled information.

1 Introduction

Deep neural networks have repeatedly been able to achieve results similar to or beyond those of humans on certain supervised classification tasks based on large numbers of labeled samples. In the real world, due to the limited cost of labeling and the lack of expert knowledge, the dataset we obtain usually contains large numbers of unlabeled samples and only a small number of labeled samples. Although unlabeled samples do not have label information, they originate from the same data source as the labeled data. Semi-supervised learning (SSL) strives to make use of the model assumptions of unlabeled data distributions, thereby greatly reducing the need of a task for labeled data.

Early semi-supervised learning methods include self-training, transductive learning, generative models and other learning methods. With the rapid development of deep learning in recent years, semi-supervised learning has gradually been combined with neural networks, and corresponding achievements have occurred [17, 11, 10, 3, 19]. At the same time, the deep generative model has become a powerful framework for modeling complex high-dimensional datasets [9, 7, 18]. As one of the earlier works, [8] uses the deep generative model in SSL by maximizing the lower variational bound of the unlabeled data log-likelihood and treats the classification label as an additional latent variable in the directed generative model. In the same period, the well-known generative adversarial networks (GANs) [7] were proposed as a new framework for estimating generative models via an adversarial process. The real distribution of unsupervised data can be learned by using the generator of GANs. Recently, GANs have been widely used and have obtained competitive results for semi-supervised learning [15, 17, 4, 12, 5]. The early classic GAN-based semi-supervised learning (GAN-SSL) [15] presented a variety of new architectural features and training procedures, such as feature matching and minibatch discrimination techniques, to encourage the convergence of GANs. By labeling the generated samples as a new ”generated” class, the above techniques were introduced into semi-supervised tasks, and the best results at that time were achieved.

Despite the great empirical success achieved using GANs to improve semi-supervised classification performance, many theoretical issues are still unresolved. For example, [15] observed that, compared with minibatch discrimination, semi-supervised learning with feature matching performs better, but generates images of relatively poorer quality. A basic theoretical issue of the above phenomenon is as follows: how does GAN-based semi-supervised learning work? The theoretical work of [4] claimed that good semi-supervised learning requires a ”bad” generator, which is necessary to overcome the generalization difficulty in the low-density areas of the data manifold. Thus, they further proposed a ”complement” generator to enhance the effect of the ”bad” generator. Although the complement generator is empirically helpful for improving the generalization performance, we find that their theoretical analysis was not rigorous; hence, their judgment regarding the requirement of a ”bad” generator is somewhat unreasonable. This paper re-examines this theoretical issue and obtains some theoretical observations inconsistent with that in [4].

First, we study the relationship between the optimal discriminator in GAN-SSL and the one in the corresponding supervised learning. For a fixed generator, we prove that maximizing the GAN-SSL objective is indeed equivalent to maximizing the supervised learning objective. That is, the optimal discriminator in GAN-SSL is expected to be perfect in the corresponding supervised learning, i.e., the optimal discriminator can make a correct decision on all labeled data. Thus, we can initially state that GAN-SSL can at least obtain a good performance on labeled data.

Second, for the optimal discriminator, we investigate the behavior of the optimal generator of GAN-SSL. We find that the error of the optimal generator distribution relative to the true data distribution highly depends on the optimal discriminator. Given an optimal discriminator, the minimax game in GAN-SSL for the generator is equivalent to minimizing $-KL(p||p-\epsilon p_{G})+2JS(p||p_{G})$ , where $p$ is the true data distribution, $p_{G}$ is the generator distribution and $\epsilon$ represents the error between the output of the perfect discriminator and its theoretical maximum. Once the optimal discriminator can not only make a correct decision on all labeled data but also cause the objective of the supervised learning part in GAN-SSL to reach its theoretical maximum (i.e., $\epsilon=0$ ), minimizing the generator objective is equivalent to minimizing $JS(p||p_{G})$ ; hence, the optimal generator is perfect, i.e., the optimal generator distribution is indeed the true data distribution. Whereas the theoretical maximum is impossible to reach in practice, the optimal generator will always be inconsistent with the true data distribution, i.e., the optimal generator is imperfect. Notably, although our observation is that the optimal generator is imperfect in practice, it is not as ”bad” as claimed in [4]. Here, we can state that GAN-SSL will always output an imperfect generator.

Furthermore, if the labeled data can traverse all connected subdomains of the data manifold, which is a reasonable assumption in semi-supervised classification, we will additionally have that the optimal discriminator of GAN-SSL can be perfect on all unlabeled data, i.e., it can make a correct decision on all unlabeled data. Thus, we can state that GAN-SSL can also obtain a good performance on unlabeled data.

Overall, with our theoretical analysis, we can answer the above question: the optimal discriminator in GAN-SSL can be expected to be perfect on both labeled and unlabeled data by learning an imperfect generator. This theoretical result means that GAN-SSL can effectively improve the generalization ability of the discriminator by leveraging unlabeled information.

2 Related Work

Based on the classic GAN-SSL [15], the design of additional neural networks and the corresponding objectives have received much interest. The work of [13] presents Triple-GAN, which consists of three players, i.e., a generator, a discriminator and a classifier, to address the problem in which the generator and discriminator (i.e., the classifier) in classic GAN-SSL may not be optimal at the same time. Later, a more complex architecture consisting of two generators and two discriminators was developed [6], which performs better than Triple-GAN under certain conditions. To address the issue caused by incorrect pseudo labels, based on the margin theory of classifiers, MarginGAN [5], which also adopts a three-player architecture, was proposed.

As mentioned in GAN-SSL [15], feature matching works very well for GAN-SSL, but the interaction between the discriminator $D$ and the generator $G$ is not yet understood, especially from the theoretical aspect. To the best of our knowledge, few theoretical investigations on this topic exist, with one exception being the work of [4], which focused on the analysis of the optimal $G$ and $D$ . Our paper is also developed from this aspect. In addition, some theoretical works in the area of GANs regarding the design of the objective [16, 14] or the convergence of the adversarial process [1, 2], can be applied to the area of GAN-SSL. Since GAN-SSL has an additional supervised learning objective, one should further develop some theoretical techniques to handle this problem. These open aspects should be studied in the future.

3 Theoretical Analysis

We first introduce some classic notations and definitions. We refer to [15] for a detailed discussion. Consider a standard classifier for classifying a data point $x$ into one of $K$ possible classes. The output of the classifier is a $K$ -dimensional vector of logits that can be turned into class probabilities by applying softmax. The classic GAN-SSL [15] is implemented by labeling samples from the GAN generator $G$ with a new ”generated” class $K+1$ . We use $P_{D}(y\leq K|x),P_{D}(K+1|x)$ to determine the probability that $x$ is true or fake, respectively. The model $D$ is both a discriminator and a classifier, correspondingly increasing the dimension of the output from $K$ to $K+1$ . Thus, the discriminator $D$ is defined as $P_{D}(k|x)=\frac{\exp(\omega_{k}^{T}f(x))}{\sum_{i=1}^{K+1}\exp(\omega_{i}^{T}f(x))}$ , where $f(x)$ is a nonlinear vector-valued function, $\omega_{k}$ is the weight vector for class $k$ and $\{\omega_{1}^{T}f(x),\ldots,\omega_{K+1}^{T}f(x)\}$ is the $K+1$ -dimensional vector of the logits. Since a discriminator with $K+1$ outputs is over-parameterized, $\omega_{K+1}$ is fixed as a zero vector. We also denote $D:=(\omega,f)$ as a discriminator. Similar to the traditional GANs, in GAN-SSL, $D$ and $G$ play the following two-player minimax game with the value function $J_{GD}$ :

\displaystyle\begin{aligned} \min_{G}\max_{D}J_{GD}=\min_{G}\max_{D}(L_{D}+U_{GD})\end{aligned}

(1)

where

\displaystyle\begin{aligned} L_{D}&=\mathbb{E}_{(x,y)\sim p(x,y)}\log P_{D}(y|x,y\leq K)\\ U_{GD}&=\mathbb{E}_{x\sim p(x)}\log P_{D}(y\leq K|x)\\ &+\mathbb{E}_{x\sim p_{G}(x)}\log P_{D}(K+1|x)\end{aligned}

where $L_{D}$ is the supervised learning objective for all labeled data and $U_{GD}$ is the unsupervised objective, i.e., the traditional GANs objective, in which $p$ is the true data distribution and $p_{G}$ is the generator distribution. When $G$ is fixed, the objectives $J_{GD}$ and $U_{GD}$ become $J_{D}$ and $U_{D}$ , respectively. After we obtain a satisfactory $D$ by optimizing $J_{GD}$ , we use ${\rm argmax}\,P_{D}(k|x,k\leq K)$ to determine the class of the input data $x$ . Similar to the other related works on GAN-SSL, we use $L_{D}$ as the supervised learning objective by applying only the softmax operator on the former $K$ -dimensional vector of the output of $D$ .

Similar to the theoretical analysis of GANs, we consider a nonparametric setting, e.g., we represent a model with infinite capacity and study the optimal discriminator $D$ and generator $G$ in the space of probability density functions. The theoretical proofs, including Proposition 1 - Proposition 3, are provided in the supplementary material. We show the proof of Lemma 1, since it is the key point of our paper.

3.1 Optimal discriminator on labeled data

First, we consider the optimal discriminator $D$ for any given generator $G$ . Motivated by the proof of Proposition 1 in [4], this subsection proves that for fixed $G$ , if the discriminator has infinite capacity, then regardless of whether the generator is perfect, i.e., $p_{G}=p$ or $p_{G}\neq p$ , maximizing the above GAN-SSL objective $J_{D}$ is equivalent to maximizing the supervised learning objective $L_{D}$ . We first introduce a basic Lemma 1 and give its proof.

Lemma 1.

For any given generator $G$ , if the discriminator has infinite capacity, then for any solution $D=(\omega,f)$ of the GAN-SSL objective $J_{D}$ , there exists another solution $D^{*}=(\omega^{*},f^{*})$ such that $L_{D^{*}}=L_{D}$ and $U_{D^{*}}\geq U_{D}$ .

Proof.

For any given generator $G$ and any solution $D=(\omega,f)$ of the GAN-SSL objective $J_{D}$ , because the discriminator has infinite capacity, there exists $D^{*}=(\omega^{*},f^{*})$ such that for all $x$ and $k\leq K$ ,

\exp(\omega_{k}^{*T}f^{*}(x))=\frac{\exp(\omega_{k}^{T}f(x))}{\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))}\cdot\frac{p(x)}{p_{G}(x)}

For all $x$ ,

		$\displaystyle P_{D^{}}(y\|x,y\leq K)=\frac{\exp(\omega_{y}^{T}f^{}(x))}{\sum_{i=1}^{K}\exp(\omega_{i}^{T}f^{*}(x))}$
	$\displaystyle=$	$\displaystyle\frac{\exp(\omega_{y}^{T}f(x))}{\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))}=P_{D}(y\|x,y\leq K)$

Then $L_{D^{*}}=L_{D}$ , and

		$\displaystyle P_{D^{}}(K+1\|x)=\frac{\exp(\omega_{K+1}^{T}f^{}(x))}{\sum_{i=1}^{K+1}\exp(\omega_{i}^{T}f^{*}(x))}$
	$\displaystyle=$	$\displaystyle\frac{1}{1+\sum_{i=1}^{K}\exp(\omega_{i}^{T}f^{}(x))}=\frac{p_{G}(x)}{p(x)+p_{G}(x)}$

For the following unsupervised objective function

	$\displaystyle U_{D}$	$\displaystyle=\mathbb{E}_{x\sim p(x)}\log(1-P_{D}(K+1\|x))$
		$\displaystyle+\mathbb{E}_{x\sim p_{G}(x)}\log P_{D}(K+1\|x)$
		$\displaystyle=\int_{x}p(x)\log(1-P_{D}(K+1\|x))$
		$\displaystyle+p_{G}(x)\log P_{D}(K+1\|x)dx$

when $G$ is fixed, it is easy to verify that $D^{*}$ can maximize it. Therefore, $L_{D^{*}}=L_{D}$ and $U_{D^{*}}\geq U_{D}$ . This completes the proof. ∎

Lemma 1 states that for any given generator and due to the infinite capacity of the discriminator, under the condition that the supervised objective remains unchanged, we can always increase the unsupervised objective until the extreme value is reached. Then, based on Lemma 1, we can present the theoretical results of the optimal discriminator $D$ for any given generator $G$ .

Proposition 1.

Given the conditions in Lemma 1, we can obtain the following:
(1) for any optimal solution $D_{L}=(\omega,f)$ of the supervised learning objective $L_{D}$ , there exists $D^{*}=(\omega^{*},f^{*})$ such that $D^{*}$ maximizes the GAN-SSL objective $J_{D}$ and that, for all $x$ ,

P_{D^{*}}(y|x,y\leq K)=P_{D_{L}}(y|x,y\leq K).

(2) for any optimal solution $D^{*}=(\omega^{*},f^{*})$ of the above GAN-SSL objective $J_{D}$ , $D^{*}$ is an optimal solution of the supervised objective $L_{D}$ .
(3) the optimal discriminator $D^{*}$ of the above GAN-SSL objective is

	$\displaystyle P_{D^{*}}(y\leq K\|x)$	$\displaystyle=\frac{p(x)}{p(x)+p_{G}(x)},$
	$\displaystyle P_{D^{*}}(K+1\|x)$	$\displaystyle=\frac{p_{G}(x)}{p(x)+p_{G}(x)}.$

Proposition 1 (1) and (2) jointly indicate that given a fixed generator, optimizing the discriminator of GAN-SSL is equivalent to optimizing that of supervised learning. Thus, the optimal discriminator in GAN-SSL is expected to be perfect on labeled data. We should note that Proposition $1$ of [4] gives a similar theoretical result, claiming only that the optimal discriminator of $L_{D}$ can lead to an optimal discriminator of GAN-SSL, i.e., the first claim of the above Proposition 1, but given the condition that $p_{G}=p$ . Actually, our theoretical investigation instead shows that the condition $p_{G}=p$ is unnecessary and that the optimal discriminator of GAN-SSL is an optimal discriminator of supervised learning.

By carefully checking the statements in Proposition $1$ of [4] and comparing their results with ours, one may realize that their results are not rigorous. They claimed that good semi-supervised learning requires a bad generator because they observed that the semi-supervised objective shares the same generalization error with the supervised objective given a perfect generator. First, from our theoretical results, a perfect generator is not necessary for the observation, such that the requirement of a bad generator in their claim is unreasonable. Moreover, the fact that the semi-supervised objective shares the same generalization error with the supervised objective is insufficient for claiming that GAN-SSL will reduce the generalization ability of supervised learning. In addition, if we can claim only that the optimal discriminator of $L_{D}$ can lead to the optimal discriminator of GAN-SSL, one would not expect to obtain a perfect discriminator on labeled data by GAN-SSL such that the Assumption $1$ (1) of [4] is over-assumed. Our result, i.e., the Proposition 1 (2), can overcome their shortcoming, such that our later Assumption 1 (1) is indeed reasonable.

3.2 Optimal generator

Next, for a fixed optimal discriminator, we discuss the behavior of the optimal generator of GAN-SSL. By the definition of the discriminator $D$ , we have $P_{D}(K+1|x)={1}/{(1+\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x)))}$ . Based on Proposition 1 (3), $P_{D}(K+1|x)={p_{G}(x)}/{(p(x)+p_{G}(x))}$ . Namely,

\displaystyle\frac{1}{1+\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))}=\frac{p_{G}(x)}{p(x)+p_{G}(x)}

implying that $\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))=p(x)/p_{G}(x)$ . Now, the objective of $G$ becomes $C(G)=L_{DG}+U_{G}$ , with

	$\displaystyle L_{DG}$	$\displaystyle=\mathbb{E}_{(x,y)\sim p(x,y)}\log P_{D}(y\|x,y\leq K)$
		$\displaystyle=\mathbb{E}_{(x,y)\sim p(x,y)}\log(\exp(\omega_{y}^{T}f(x))\frac{p_{G}(x)}{p(x)})$
	$\displaystyle U_{G}$	$\displaystyle=\mathbb{E}_{x\sim p}\log P_{D}(y\leq K\|x)+\mathbb{E}_{x\sim p_{G}}\log P_{D}(K+1\|x)$
		$\displaystyle=\mathbb{E}_{x\sim p}\log\frac{p(x)}{p(x)+p_{G}(x)}+\mathbb{E}_{x\sim p_{G}}\log\frac{p_{G}(x)}{p(x)+p_{G}(x)}$

The aim of the discriminator is to classify the labeled samples into the correct class, the unlabeled samples into the ”true” class (any of the first $K$ classes) and the generated samples into the ”fake” class, i.e., the $(K+1)$ -th class. In this paper, the so-called perfect discriminator means that for any $(x,y)\in p(x,y)$ , $P_{D}(y|x,y\leq K)>P_{D}(k|x,k\leq K,k\neq y)$ , i.e., the discriminator can make a correct decision on labeled data. We say that the discriminator reaches its theoretical maximum given that $P_{D}(y|x,y\leq K)=1$ and, for any other class $k\neq y$ , $P_{D}(k|x,k\leq K)=0$ . Obviously, it is impossible to reach the theoretical maximum in practice, i.e., $P_{D}(y|x,y\leq K)$ can only be close to $1$ , and for any other class $k\neq y$ , $\exp(\omega_{k}^{T}f(x))\rightarrow 0$ . That is, because the discriminator has an infinite capacity, we can expect that there exists $0<\epsilon\ll 1$ such that $(K-1)\exp(\omega_{k}^{T}f(x))\leq\epsilon$ for any other class $k\neq y$ . Here, $\epsilon$ represents the error between the output of the perfect discriminator and its theoretical maximum. Now, we show the relationship between the true data distribution and the generator distribution in GAN-based semi-supervised tasks through the following Proposition 2.

Proposition 2.

Given the conditions in Proposition 1, for a fixed optimal discriminator $D$ , suppose there exists $0<\epsilon\ll 1$ such that the other logit output ( $k\neq y$ ) of $D$ satisfies $(K-1)\exp(\omega_{k}^{T}f(x))\leq\epsilon$ . Then, the generator objective $C(G)=-KL(p||p-\epsilon p_{G})+2JS(p||p_{G})-2\log 2$ .

Proposition 2 indicates that minimizing the generator objective $C(G)$ is equivalent to minimizing $JS(p||p_{G})$ and maximizing $KL(p||p-\epsilon p_{G})$ simultaneously. To minimize $JS(p||p_{G})$ , we need to train $p_{G}$ close to $p$ ; however, to maximize $KL(p||p-\epsilon p_{G})$ , we need to increase the difference between $p$ and $p-\epsilon p_{G}$ . If the optimal solution $D$ of GAN-SSL for the supervised objective can reach the theoretical maximum, i.e., $\epsilon=0$ , then minimizing $C(G)$ is equivalent to minimizing $JS(p||p_{G})$ . Since the Jensen-Shannon divergence between two distributions is always non-negative and zero iff they are equal, then $C^{*}=-2\log 2$ is the global minimum of $C(G)$ , and the only solution is $p_{G}=p$ , i.e., the generator distribution perfectly replicates the true data distribution such that the optimal generator is perfect. Whereas, it is impossible to reach the theoretical maximum, then $\epsilon\neq 0$ . If we assume that $\epsilon$ is sufficiently small, i.e., $0<\epsilon\ll 1$ , the optimal generator $p_{G}$ will not be $p$ . Otherwise, if $p_{G}=p$ , $JS(p||p_{G})$ can reach its minimum, and $KL(p||p-\epsilon p_{G})=KL(p||p(1-\epsilon))$ becomes very small, which is in contrast to maximizing $KL(p||p-\epsilon p_{G})$ in the objective $C(G)$ .

Different from the theoretical result of [4] that good semi-supervised learning requires a bad generator, Proposition 2 shows that the error of the optimal generator distribution relative to the true data distribution largely depends on the optimal discriminator. First, when the perfect discriminator for the supervised objective can reach its theoretical maximum, the global optimal generator of GAN-SSL is $p_{G}=p$ , i.e., the optimal generator $G$ is perfect. However, since the theoretical maximum cannot be reached in practice, one cannot expect to learn a perfect generator, i.e., the optimal generator $G$ is imperfect. In addition, we note that although the optimal generator is imperfect in actual situations, it can be similar to the original data distribution. As shown in the work of [15], samples generated by the excellent GAN-SSL generator are not perfect but not completely different from the original data distribution, which is apparently inconsistent with the requirement of a ”bad” generator in [4].

Case Study on Synthetic Data. To obtain a more intuitive understanding of Proposition 2, we conduct a case study based on a 1D synthetic dataset, where we can easily verify our theoretical analysis by visualizing the model behaviors. The training dataset is sampled from the Gaussian distribution $p=\mathcal{N}(0,0.4^{2})$ . Based on Proposition 2, we study the relationship between the original data distribution $p$ and the generator distribution $p_{G}$ obtained by directly training the following generator objective function $C(G)=-KL(p||p-\epsilon p_{G})+2JS(p||p_{G})$ . We use the generator architecture of a fully connected neural network with ReLU activations: 1-100-100-1. We give the results of comparing $p$ and $p_{G}$ when $\epsilon=0$ , $\epsilon=0.1$ and $\epsilon=0.2$ in Figure 1, respectively. As theoretically analyzed above, for a constant $\epsilon=0$ , minimizing $C(G)$ is equivalent to minimizing $JS(p||p_{G})$ , and the optimal solution is $p_{G}=p$ (as shown in Figure 1 (a)). As shown in Figure 1 (b) and (c), when $\epsilon\neq 0$ , $p_{G}$ is imperfect and the greater $\epsilon$ is, the larger the gap between $p_{G}$ and $p$ . Although there is a certain gap between the generator distribution and the data distribution, $p_{G}$ is similar to $p$ , and most of their supports overlap. Therefore, in this case, different from GANs, the optimal generator cannot be used to generate samples.

Refer to caption — Figure 1: The comparison results of the data distribution $p$ (blue) and the generator distribution $p_{G}$ (red) when $\epsilon=0$ , $\epsilon=0.1$ and $\epsilon=0.2$ , respectively. Here, $\epsilon$ can somehow represent the error between the output of the perfect discriminator and its theoretical maximum.

3.3 Optimal discriminator on unlabeled data

Finally, to study the behavior of the optimal discriminator of GAN-SSL on unlabeled data, we first present the assumption on the convergence conditions of the optimal discriminator and generator and the manifold conditions on the dataset. Let $\Omega_{\mathcal{L}}$ be the set of labeled data. Let $\Omega$ , $\Omega_{\mathcal{G}}$ be the data manifold and generated data manifold, respectively. Obviously, we have $\Omega_{\mathcal{L}}\subset\Omega$ , and the unlabeled data should be sampled from $\Omega\setminus\Omega_{\mathcal{L}}$ . Denote $\Omega=\cup_{k=1}^{K}\Omega^{k}$ , where $\Omega^{k}$ is the data manifold of class $k$ . Denote $\Omega^{k}=\cup_{l}\Omega^{kl}$ , where $\Omega^{kl}$ is a connected subdomain.

Assumption 1.

Suppose $D,G$ are optimal solutions of the GAN-SSL objective (1). We assume that:
(1) for any $(x,y)\in\Omega_{\mathcal{L}}$ , we have $\omega_{y}^{T}f(x)>\omega_{k}^{T}f(x)$ for any other class $k\neq y$ ;
(2) for any $x\in\Omega$ , $\max_{k=1}^{K}\omega_{k}^{T}f(x)>0$ ; for any $x\in\Omega_{\mathcal{G}}\setminus\bar{\Omega}$ , $\max_{k=1}^{K}\omega_{k}^{T}f(x)<0$ ;
(3) each $\Omega^{jl}$ contains at least one labeled data point;
(4) for any $x_{k}\in\Omega^{k}$ and $j\neq k$ , there exists a connected subdomain $\Omega^{jl}$ , a labeled data point $x_{j}\in\Omega^{jl}$ , a generated data point $x_{g}\in\Omega_{\mathcal{G}}\setminus\bar{\Omega}$ , and $0<\alpha<1$ , such that $f(x_{g})=\alpha f(x_{k})+(1-\alpha)f(x_{j})$ .

The Reasonableness of Assumption 1 (1). According to the theoretical analysis for the optimal discriminator (Proposition 1), we found that optimizing the discriminator of GAN-SSL is equivalent to optimizing that of supervised learning. Because $D$ is an optimal solution of the GAN-SSL objective, $D$ is an optimal solution of the supervised learning objective. Naturally, Assumption 1 (1) holds, i.e., the discriminator has a correct decision boundary for labeled data given that the discriminator has infinite capacity.

The Reasonableness of Assumption 1 (2). Proposition 2 implies that the optimal generator we obtain in practice is not a perfect generator such that we assume $\Omega_{\mathcal{G}}\setminus\bar{\Omega}\neq\emptyset$ . We ignore the case of $\Omega_{\mathcal{G}}\subset\Omega,\Omega\setminus\bar{\Omega}_{\mathcal{G}}\neq\emptyset$ since it is almost impossible to ensure that our optimal generator can output only true data but with different probabilities in practice, unless we obtain the worst model with model collapse. Similar to [4], we also make a strong assumption regarding the true-fake correctness of the true data ( $\Omega$ ) and fake data ( $\Omega_{\mathcal{G}}\setminus\bar{\Omega}$ ). In other words, we assume that the sampling of unlabeled data is good enough to achieve the best generalization ability on the true data manifold.

The Reasonableness of Assumption 1 (3). Intuitively, one cannot expect to achieve a good classification performance on a connected subdomain with no label information, since we can optionally set this connected subdomain to any class but with no influence on the objective.

The Reasonableness of Assumption 1 (4). Apparently, Assumption 1 (3) is a sufficient condition for this assumption. In addition, since the optimal generator is imperfect and we need only its existence, one can expect to be able to achieve this condition.

Now, based on Assumption 1, we give the main result of this subsection.

Proposition 3.

Given the conditions in Assumption 1, for all classes $k\leq K$ , for all data space points $x_{k}\in\Omega^{k}$ , we have $\omega_{k}^{T}f(x_{k})>\omega_{j}^{T}f(x_{k})$ for any $j\neq k$ .

We remark that the proof of Proposition 3 is similar to that in [4] but with different conditions; see the supplementary material for more details. Proposition 3 guarantees that given the convergence conditions of the optimal discriminator and generator, which are induced by Proposition 1 and 2, respectively, if the labeled data can traverse all subdomains of the data manifold, then the discriminator $D$ will be perfect on the data manifold, i.e., it can also learn correct decision boundaries for all unlabeled data. Now, we discuss the difference between our Assumption 1 and Assumption $1$ in [4].

1.

As discussed above in subsection $3.1$ , the first Assumption $1$ (1) in [4] is over-assumed, since they cannot expect to obtain a perfect discriminator by optimizing the GAN-SSL objective. However, our Proposition 1 can support our Assumption 1 (1).
2.

Their Assumption $1$ (2) and Assumption $1$ (3) correspond to our Assumption 1 (2). We all assume that the optimal $D$ can result in a perfect decision regarding true-fake correctness. However, since our optimal $G$ is not a bad one, as a result of Proposition 2, we assume only that the data in $\Omega_{\mathcal{G}}\setminus\bar{\Omega}$ are fake.
3.

Unlike [4], we additionally give a reasonable assumption on the data manifold, i.e, Assumption 1 (3). By a careful check, one can find that the authors of [4] ignore this assumption because they implicitly assume that after embedding $f$ , the feature space of each class is a connected domain. Otherwise, the proof of their Proposition $2$ is not correct, since one can set a connected subdomain of the feature space $F_{k}$ without label information to any class by a similar contradiction proof.
4.

Similar to our Assumption 1 (4), [4] make a stronger assumption before the definition of their Assumption $1$ by setting $G$ as a complement generator. Given a complement generator, our Assumption 1 (4) can clearly be satisfied. Note that under their definition, if the optimal generator $G$ is a complementary one, the feature space of the generated data will fill the complement set of the feature space of the true data. In other words, they hope that the ”bad” generator can not only generate fake data but also generate all the existing fake data. Apparently, this assumption is too strong and requires the generator to offer a high representation ability, since our true data manifold is always low-dimensional. However, we need only the existence of a generated fake data point such that its feature can be linearly represented by that of two true data points with at least one labeled data point. This existence has a high probability to be satisfied given only a connected subdomain $\Omega^{jl}$ and an imperfect generator. Here, we actually release the requirement for a complement generator.

The Reasonableness of Assumption 1 (3) on Synthetic Data. We design a binary classification problem with $\Omega=\Omega^{1}\cup\Omega^{2}$ , where $\Omega^{1}=\Omega^{11}\cup\Omega^{12}$ and $\Omega^{11}$ , $\Omega^{12}$ and $\Omega^{2}$ are three bounded 2D circles with no interaction. The unlabeled data are uniformly sampled from $\Omega$ . If Assumption 1 (3) is satisfied, each connected domain contains at least one labeled data point, as shown in Figure 2 (a1). Then, as shown in Figure 2 (a4), we can easily train the discriminator to learn the correct decision boundary. However, if Assumption 1 (3) is not satisfied, for the data in class $1$ , we sample only the labeled data in subdomain $\Omega^{11}$ , as shown in Figure 2 (b1). At this time, it is difficult for the discriminator to learn a correct decision boundary. As shown in Figure 2 (b4), (b6), and (b8), we choose three kinds of results under several training processes. For the unlabeled data in subdomain $\Omega^{12}$ , the classification results are uncertain. Hence, a comparison between Figure 2 (a4), (b4), (b6), and (b8) can show that the condition of Assumption 1 (3) is necessary to obtain a perfect discriminator on unlabeled data. Furthermore, as shown in Figure 2 (a3), (b3), (b5) and (b7), a comparison of the generated samples shows that there is indeed a certain gap between the generator distribution and the data distribution, while the optimal generator is not a complementary one, as claimed in [4].

4 Conclusions

Via a theoretical analysis, this paper answers the question of how GAN-SSL works. In conclusion, semi-supervised learning based on GANs will yield a perfect discriminator on both labeled (Proposition 1) and unlabeled data (Proposition 3) by learning an imperfect generator (Proposition 2), i.e., GAN-SSL can effectively improve the generalization ability in semi-supervised classification. In the future, the theoretical problems of more complex models, such as Triple-GAN and other methods, will be studied. In addition, the existence of Assumption 1 (4) will undergo further theoretical and empirical investigations.

Acknowledgements

This work was supported in part by the Innovation Foundation of Qian Xuesen Laboratory of Space Technology, and in part by Beijing Nova Program of Science and Technology under Grant Z191100001119129.

References

[1] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In International Conference on Machine Learning, pages 224–232, 2017.
[2] Yu Bai, Tengyu Ma, and Andrej Risteski. Approximability of discriminators implies diversity in GANs. In International Conference on Learning Representations, 2019.
[3] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. MixMatch: A holistic approach to semi-supervised learning. In Neural Information Processing Systems, pages 5049–5059, 2019.
[4] Zihang Dai, Zhilin Yang, Fan Yang, William W. Cohen, and Ruslan Salakhutdinov. Good semi-supervised learning that requires a bad GAN. In Neural Information Processing Systems, 2017.
[5] Jinhao Dong and Tong Lin. MarginGAN: Adversarial training in semi-supervised learning. In Neural Information Processing Systems, pages 10440–10449, 2019.
[6] Zhe Gan, Liqun Chen, Weiyao Wang, Yuchen Pu, Yizhe Zhang, Hao Liu, Chunyuan Li, and Lawrence Carin. Triangle generative adversarial networks. In Neural Information Processing Systems, pages 5247–5256, 2017.
[7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Neural Information Processing Systems, pages 2672–2680, 2014.
[8] Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, and Max Welling. Semi-supervised learning with deep generative models. In Neural Information Processing Systems, pages 3581–3589, 2014.
[9] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.
[10] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.
[11] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations, 2017.
[12] Bruno Lecouat, Chuan Sheng Foo, Houssam Zenati, and Vijay Chandrasekhar. Manifold regularization with GANs for semi-supervised learning. arXiv preprint arXiv:1807.04307, 2018.
[13] Chongxuan Li, Taufik Xu, Jun Zhu, and Bo Zhang. Triple generative adversarial nets. In Neural Information Processing Systems, pages 4088–4098, 2017.
[14] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for GANs do actually converge? International Conference on Machine Learning, 2018.
[15] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Chen Xi. Improved techniques for training GANs. In Neural Information Processing Systems, 2016.
[16] Tim Salimans, Han Zhang, Alec Radford, and Dimitris Metaxas. Improving GANs using optimal transport. In International Conference on Learning Representations, 2018.
[17] Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. In International Conference on Learning Representations, 2016.
[18] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, page 125, 2016.
[19] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848, 2019.

Appendix

Proof of Proposition $1$

Proof.

Proof of Proposition $1$ (1). Similar to the proof of Lemma $1$ , given an optimal solution $D_{L}=(\omega,f)$ of the supervised objective $L_{D}$ , due to the discriminator has infinite capacity, there exists $D^{*}=(\omega^{*},f^{*})$ such that for all $x$ and $k\leq K$ ,

\displaystyle\exp(\omega_{k}^{*T}f^{*}(x))=\frac{\exp(\omega_{k}^{T}f(x))}{\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))}\cdot\frac{p(x)}{p_{G}(x)}

(2)

For all $x$ ,

		$\displaystyle P_{D^{}}(y\|x,y\leq K)=\frac{\exp(\omega_{y}^{T}f^{}(x))}{\sum_{i=1}^{K}\exp(\omega_{i}^{T}f^{*}(x))}$
	$\displaystyle=$	$\displaystyle\frac{\exp(\omega_{y}^{T}f(x))}{\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))}=P_{D_{L}}(y\|x,y\leq K)$

Then $L_{D^{*}}=L_{D_{L}}$ . Based on the definition and given Eq. (2), we can obtain

P_{D^{*}}(K+1|x)=\frac{1}{1+\sum_{i=1}^{K}\exp(\omega_{i}^{*T}f^{*}(x))}=\frac{p_{G}}{p+p_{G}}.

By the proof of Lemma $1$ , $D^{*}$ is an optimal solution of $U_{D}$ . Because $D_{L}$ maximizes $L_{D}$ , $D^{*}$ also maximizes $L_{D}$ . It follows that $D^{*}$ maximizes $J_{D}$ .

Proof of Proposition $1$ (2). First, we should note that if $D^{*}$ maximizes the GAN-SSL objective $J_{D}$ , then $D^{*}$ maximizes $U_{D}$ . Otherwise, based on Lemma $1$ , there exists another solution $D^{{}^{\prime}}=(\omega^{{}^{\prime}},f^{{}^{\prime}})$ such that $L_{D^{{}^{\prime}}}=L_{D^{*}}$ and $U_{D^{{}^{\prime}}}>U_{D^{*}}$ , i.e., $J_{D^{{}^{\prime}}}>J_{D^{*}}$ , leading to contradiction. That is to say, for any optimal solution $D^{*}=(\omega^{*},f^{*})$ of the GAN-SSL objective $J_{D}$ , $U_{D^{*}}$ reaches the extreme value; then, $D^{*}$ is also an optimal solution of $L_{D}$ . Otherwise, due to the infinite capacity of the discriminator, there exists an optimal solution $D_{1}$ of the supervised objective $L_{D}$ . Thus, based on Proposition $1$ (1), there exists $D_{1}^{*}=(\omega_{1}^{*},f_{1}^{*})$ such that $L_{D_{1}^{*}}=L_{D_{1}}>L_{D^{*}}$ and $U_{D_{1}^{*}}=U_{D^{*}}$ . Therefore, $J_{D_{1}^{*}}>J_{D^{*}}$ , leading to contradiction, i.e., for any optimal solution $D^{*}=(\omega^{*},f^{*})$ of $J_{D}$ , $D^{*}$ is an optimal solution of $L_{D}$ .

Based on Proposition $1$ (1) and (2), we can obtain that maximizing $J_{D}$ is equivalent to maximizing the quantity $L_{D}$ and $U_{D}$ , simultaneously. Then, the optimal solution of $J_{D}$ must also be the optimal solution of $U_{D}$ . Similar to the theoretical results of [Goodfellow et al. 2014], for $G$ fixed, the optimal discriminator $D^{*}$ of the GAN-SSL objective is

P_{D^{*}}(K+1|x)=\frac{p_{G}(x)}{p(x)+p_{G}(x)},

and

P_{D^{*}}(y\leq K|x)=1-P_{D^{*}}(K+1|x)=\frac{p(x)}{p(x)+p_{G}(x)}.

This completes the proof. ∎

Proof of Proposition $2$

Proof.

By the definition of the discriminator $D$ , $P_{D}(K+1|x)=\frac{1}{1+\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))}$ , and based on Proposition $1$ (3), $P_{D^{*}}(K+1|x)=\frac{p_{G}(x)}{p(x)+p_{G}(x)}$ , then for the optimal discriminator $D=(\omega,f)$ , $\frac{1}{1+\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))}=\frac{p_{G}(x)}{p(x)+p_{G}(x)}$ , such that

	$\displaystyle\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))$	$\displaystyle=\exp(\omega_{y}^{T}f(x))+\sum_{i\neq y}^{K}\exp(\omega_{i}^{T}f(x))$
		$\displaystyle=p(x)/p_{G}(x)$

For a fixed optimal discriminator $D$ , suppose there exists $0<\epsilon\ll 1$ such that the other logit output ( $k\neq y$ ) of $D$ satisfies $(K-1)\exp(\omega_{k}^{T}f(x))\leq\epsilon$ , then, $\exp(\omega_{y}^{T}f(x))\geq p/p_{G}-\epsilon$ . If the minimum can be achieved, i.e., $\exp(\omega_{y}^{T}f(x))=p/p_{G}-\epsilon$ , therefore,

	$\displaystyle L_{DG}$	$\displaystyle=\mathbb{E}_{(x,y)\sim p(x,y)}\log P_{D}(y\|x,y\leq K)$
		$\displaystyle=\mathbb{E}_{(x,y)\sim p(x,y)}\log(\frac{\exp(\omega_{y}^{T}f(x))}{\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))})$
		$\displaystyle=\mathbb{E}_{x\sim p(x)}\log((\frac{p(x)}{p_{G}(x)}-\epsilon)\frac{p_{G}(x)}{p(x)})$
		$\displaystyle=-\int_{x}p(x)\log\frac{p(x)}{p(x)-\epsilon p_{G}(x)}dx$

Then,

\displaystyle\begin{aligned} C(G)&=L_{DG}+U_{G}\\ &=-\int_{x}p(x)\log\frac{p(x)}{p(x)-\epsilon p_{G}(x)}dx\\ &+\int_{x}p\log\frac{p(x)}{p(x)+p_{G}(x)}+p_{G}\log\frac{p_{G}(x)}{p(x)+p_{G}(x)}dx\\ &=-KL(p||p-\epsilon p_{G})+2JS(p||p_{G})-2\log 2\end{aligned}

This completes the proof. ∎

Proof of Proposition $3$

Proof.

First, if $x_{k}$ is a labeled data point, based on Assumption $1$ (1), we have $\omega_{k}^{T}f(x_{k})>\omega_{j}^{T}f(x_{k})$ for any $j\neq k$ . Then, we consider $x_{k}\in\Omega^{k}$ is an unlabeled data point. Without loss of generality, suppose $j=\text{arg}\max_{j\neq k}\omega_{j}^{T}f(x_{k})$ . Now, we prove it by contradiction.

Suppose there exists a data space point $x_{k}\in\Omega^{k}$ and a class $j\neq k$ , such that

\omega_{k}^{T}f(x_{k})-\omega_{j}^{T}f(x_{k})\leq 0

(3)

By Assumption $1$ (3) and Assumption $1$ (4), there exists a connected subdomain $\Omega^{jl}$ , a labeled data point $x_{j}\in\Omega^{jl}$ , and a generated data point $x_{g}\in\Omega_{\mathcal{G}}\setminus\bar{\Omega}$ , such that $f(x_{g})=\alpha f(x_{k})+(1-\alpha)f(x_{j})$ with $0<\alpha<1$ . Based on Assumption $1$ (2), $\omega_{j}^{T}f(x_{g})<0$ . Thus,

\omega_{j}^{T}f(x_{g})=\alpha\omega_{j}^{T}f(x_{k})+(1-\alpha)\omega_{j}^{T}f(x_{j})<0

By Assumption $1$ (1) and Assumption $1$ (2), for any $(x_{j},j)\in\Omega_{\mathcal{L}}$ , $\omega_{j}^{T}f(x_{j})=\max_{k}\omega_{k}^{T}f(x_{j})>0$ . Moreover, by Eq. (3) and Assumption $1$ (2), for any $x_{k}\in\Omega^{k}\subset\Omega$ , $\omega_{j}^{T}f(x_{k})=\max_{i=1}^{K}\omega_{i}^{T}f(x_{k})>0$ . Then, $\alpha\omega_{j}^{T}f(x_{k})+(1-\alpha)\omega_{j}^{T}f(x_{j})>0$ , leading to contradiction. In summary, for all data space points $x_{k}\in\Omega^{k}$ , we have $\omega_{k}^{T}f(x_{k})>\omega_{j}^{T}f(x_{k})$ for any $j\neq k$ . ∎

	$\displaystyle U_{D}$	$\displaystyle=\mathbb{E}_{x\sim p(x)}\log(1-P_{D}(K+1\|x))$
		$\displaystyle+\mathbb{E}_{x\sim p_{G}(x)}\log P_{D}(K+1\|x)$
		$\displaystyle=\int_{x}p(x)\log(1-P_{D}(K+1\|x))$
		$\displaystyle+p_{G}(x)\log P_{D}(K+1\|x)dx$

How Does GAN-based Semi-supervised Learning Work?

Abstract

1 Introduction

2 Related Work

3 Theoretical Analysis

3.1 Optimal discriminator on labeled data

Lemma 1.

Proof.

Proposition 1.

3.2 Optimal generator

Proposition 2.

3.3 Optimal discriminator on unlabeled data

Assumption 1.

Proposition 3.

4 Conclusions

Acknowledgements

References

Appendix

Proof of Proposition 11

Proof.

Proof of Proposition 22

Proof.

Proof of Proposition 33

Proof.

How Does GAN-based
Semi-supervised Learning Work?

Proof of Proposition $1$

Proof of Proposition $2$

Proof of Proposition $3$