This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

How Does GAN-based
Semi-supervised Learning Work?

Xuejiao Liu Qian Xuesen Laboratory of Space Technology
China Academy of Space Technology
Xueshuang Xiang Corresponding author: xiangxueshuang@qxslab.cn Qian Xuesen Laboratory of Space Technology
China Academy of Space Technology
Abstract

Generative adversarial networks (GANs) have been widely used and have achieved competitive results in semi-supervised learning. This paper theoretically analyzes how GAN-based semi-supervised learning (GAN-SSL) works. We first prove that, given a fixed generator, optimizing the discriminator of GAN-SSL is equivalent to optimizing that of supervised learning. Thus, the optimal discriminator in GAN-SSL is expected to be perfect on labeled data. Then, if the perfect discriminator can further cause the optimization objective to reach its theoretical maximum, the optimal generator will match the true data distribution. Since it is impossible to reach the theoretical maximum in practice, one cannot expect to obtain a perfect generator for generating data, which is apparently different from the objective of GANs. Furthermore, if the labeled data can traverse all connected subdomains of the data manifold, which is reasonable in semi-supervised classification, we additionally expect the optimal discriminator in GAN-SSL to also be perfect on unlabeled data. In conclusion, the minimax optimization in GAN-SSL will theoretically output a perfect discriminator on both labeled and unlabeled data by unexpectedly learning an imperfect generator, i.e., GAN-SSL can effectively improve the generalization ability of the discriminator by leveraging unlabeled information.

1 Introduction

Deep neural networks have repeatedly been able to achieve results similar to or beyond those of humans on certain supervised classification tasks based on large numbers of labeled samples. In the real world, due to the limited cost of labeling and the lack of expert knowledge, the dataset we obtain usually contains large numbers of unlabeled samples and only a small number of labeled samples. Although unlabeled samples do not have label information, they originate from the same data source as the labeled data. Semi-supervised learning (SSL) strives to make use of the model assumptions of unlabeled data distributions, thereby greatly reducing the need of a task for labeled data.

Early semi-supervised learning methods include self-training, transductive learning, generative models and other learning methods. With the rapid development of deep learning in recent years, semi-supervised learning has gradually been combined with neural networks, and corresponding achievements have occurred [17, 11, 10, 3, 19]. At the same time, the deep generative model has become a powerful framework for modeling complex high-dimensional datasets [9, 7, 18]. As one of the earlier works, [8] uses the deep generative model in SSL by maximizing the lower variational bound of the unlabeled data log-likelihood and treats the classification label as an additional latent variable in the directed generative model. In the same period, the well-known generative adversarial networks (GANs) [7] were proposed as a new framework for estimating generative models via an adversarial process. The real distribution of unsupervised data can be learned by using the generator of GANs. Recently, GANs have been widely used and have obtained competitive results for semi-supervised learning [15, 17, 4, 12, 5]. The early classic GAN-based semi-supervised learning (GAN-SSL) [15] presented a variety of new architectural features and training procedures, such as feature matching and minibatch discrimination techniques, to encourage the convergence of GANs. By labeling the generated samples as a new ”generated” class, the above techniques were introduced into semi-supervised tasks, and the best results at that time were achieved.

Despite the great empirical success achieved using GANs to improve semi-supervised classification performance, many theoretical issues are still unresolved. For example, [15] observed that, compared with minibatch discrimination, semi-supervised learning with feature matching performs better, but generates images of relatively poorer quality. A basic theoretical issue of the above phenomenon is as follows: how does GAN-based semi-supervised learning work? The theoretical work of [4] claimed that good semi-supervised learning requires a ”bad” generator, which is necessary to overcome the generalization difficulty in the low-density areas of the data manifold. Thus, they further proposed a ”complement” generator to enhance the effect of the ”bad” generator. Although the complement generator is empirically helpful for improving the generalization performance, we find that their theoretical analysis was not rigorous; hence, their judgment regarding the requirement of a ”bad” generator is somewhat unreasonable. This paper re-examines this theoretical issue and obtains some theoretical observations inconsistent with that in [4].

First, we study the relationship between the optimal discriminator in GAN-SSL and the one in the corresponding supervised learning. For a fixed generator, we prove that maximizing the GAN-SSL objective is indeed equivalent to maximizing the supervised learning objective. That is, the optimal discriminator in GAN-SSL is expected to be perfect in the corresponding supervised learning, i.e., the optimal discriminator can make a correct decision on all labeled data. Thus, we can initially state that GAN-SSL can at least obtain a good performance on labeled data.

Second, for the optimal discriminator, we investigate the behavior of the optimal generator of GAN-SSL. We find that the error of the optimal generator distribution relative to the true data distribution highly depends on the optimal discriminator. Given an optimal discriminator, the minimax game in GAN-SSL for the generator is equivalent to minimizing KL(p||pϵpG)+2JS(p||pG)-KL(p||p-\epsilon p_{G})+2JS(p||p_{G}), where pp is the true data distribution, pGp_{G} is the generator distribution and ϵ\epsilon represents the error between the output of the perfect discriminator and its theoretical maximum. Once the optimal discriminator can not only make a correct decision on all labeled data but also cause the objective of the supervised learning part in GAN-SSL to reach its theoretical maximum (i.e., ϵ=0\epsilon=0), minimizing the generator objective is equivalent to minimizing JS(p||pG)JS(p||p_{G}); hence, the optimal generator is perfect, i.e., the optimal generator distribution is indeed the true data distribution. Whereas the theoretical maximum is impossible to reach in practice, the optimal generator will always be inconsistent with the true data distribution, i.e., the optimal generator is imperfect. Notably, although our observation is that the optimal generator is imperfect in practice, it is not as ”bad” as claimed in [4]. Here, we can state that GAN-SSL will always output an imperfect generator.

Furthermore, if the labeled data can traverse all connected subdomains of the data manifold, which is a reasonable assumption in semi-supervised classification, we will additionally have that the optimal discriminator of GAN-SSL can be perfect on all unlabeled data, i.e., it can make a correct decision on all unlabeled data. Thus, we can state that GAN-SSL can also obtain a good performance on unlabeled data.

Overall, with our theoretical analysis, we can answer the above question: the optimal discriminator in GAN-SSL can be expected to be perfect on both labeled and unlabeled data by learning an imperfect generator. This theoretical result means that GAN-SSL can effectively improve the generalization ability of the discriminator by leveraging unlabeled information.

2 Related Work

Based on the classic GAN-SSL [15], the design of additional neural networks and the corresponding objectives have received much interest. The work of [13] presents Triple-GAN, which consists of three players, i.e., a generator, a discriminator and a classifier, to address the problem in which the generator and discriminator (i.e., the classifier) in classic GAN-SSL may not be optimal at the same time. Later, a more complex architecture consisting of two generators and two discriminators was developed [6], which performs better than Triple-GAN under certain conditions. To address the issue caused by incorrect pseudo labels, based on the margin theory of classifiers, MarginGAN [5], which also adopts a three-player architecture, was proposed.

As mentioned in GAN-SSL [15], feature matching works very well for GAN-SSL, but the interaction between the discriminator DD and the generator GG is not yet understood, especially from the theoretical aspect. To the best of our knowledge, few theoretical investigations on this topic exist, with one exception being the work of [4], which focused on the analysis of the optimal GG and DD. Our paper is also developed from this aspect. In addition, some theoretical works in the area of GANs regarding the design of the objective [16, 14] or the convergence of the adversarial process [1, 2], can be applied to the area of GAN-SSL. Since GAN-SSL has an additional supervised learning objective, one should further develop some theoretical techniques to handle this problem. These open aspects should be studied in the future.

3 Theoretical Analysis

We first introduce some classic notations and definitions. We refer to [15] for a detailed discussion. Consider a standard classifier for classifying a data point xx into one of KK possible classes. The output of the classifier is a KK-dimensional vector of logits that can be turned into class probabilities by applying softmax. The classic GAN-SSL [15] is implemented by labeling samples from the GAN generator GG with a new ”generated” class K+1K+1. We use PD(yK|x),PD(K+1|x)P_{D}(y\leq K|x),P_{D}(K+1|x) to determine the probability that xx is true or fake, respectively. The model DD is both a discriminator and a classifier, correspondingly increasing the dimension of the output from KK to K+1K+1. Thus, the discriminator DD is defined as PD(k|x)=exp(ωkTf(x))i=1K+1exp(ωiTf(x))P_{D}(k|x)=\frac{\exp(\omega_{k}^{T}f(x))}{\sum_{i=1}^{K+1}\exp(\omega_{i}^{T}f(x))}, where f(x)f(x) is a nonlinear vector-valued function, ωk\omega_{k} is the weight vector for class kk and {ω1Tf(x),,ωK+1Tf(x)}\{\omega_{1}^{T}f(x),\ldots,\omega_{K+1}^{T}f(x)\} is the K+1K+1-dimensional vector of the logits. Since a discriminator with K+1K+1 outputs is over-parameterized, ωK+1\omega_{K+1} is fixed as a zero vector. We also denote D:=(ω,f)D:=(\omega,f) as a discriminator. Similar to the traditional GANs, in GAN-SSL, DD and GG play the following two-player minimax game with the value function JGDJ_{GD}:

minGmaxDJGD=minGmaxD(LD+UGD)\displaystyle\begin{aligned} \min_{G}\max_{D}J_{GD}=\min_{G}\max_{D}(L_{D}+U_{GD})\end{aligned} (1)

where

LD=𝔼(x,y)p(x,y)logPD(y|x,yK)UGD=𝔼xp(x)logPD(yK|x)+𝔼xpG(x)logPD(K+1|x)\displaystyle\begin{aligned} L_{D}&=\mathbb{E}_{(x,y)\sim p(x,y)}\log P_{D}(y|x,y\leq K)\\ U_{GD}&=\mathbb{E}_{x\sim p(x)}\log P_{D}(y\leq K|x)\\ &+\mathbb{E}_{x\sim p_{G}(x)}\log P_{D}(K+1|x)\end{aligned}

where LDL_{D} is the supervised learning objective for all labeled data and UGDU_{GD} is the unsupervised objective, i.e., the traditional GANs objective, in which pp is the true data distribution and pGp_{G} is the generator distribution. When GG is fixed, the objectives JGDJ_{GD} and UGDU_{GD} become JDJ_{D} and UDU_{D}, respectively. After we obtain a satisfactory DD by optimizing JGDJ_{GD}, we use argmaxPD(k|x,kK){\rm argmax}\,P_{D}(k|x,k\leq K) to determine the class of the input data xx. Similar to the other related works on GAN-SSL, we use LDL_{D} as the supervised learning objective by applying only the softmax operator on the former KK-dimensional vector of the output of DD.

Similar to the theoretical analysis of GANs, we consider a nonparametric setting, e.g., we represent a model with infinite capacity and study the optimal discriminator DD and generator GG in the space of probability density functions. The theoretical proofs, including Proposition 1 - Proposition 3, are provided in the supplementary material. We show the proof of Lemma 1, since it is the key point of our paper.

3.1 Optimal discriminator on labeled data

First, we consider the optimal discriminator DD for any given generator GG. Motivated by the proof of Proposition 1 in [4], this subsection proves that for fixed GG, if the discriminator has infinite capacity, then regardless of whether the generator is perfect, i.e., pG=pp_{G}=p or pGpp_{G}\neq p, maximizing the above GAN-SSL objective JDJ_{D} is equivalent to maximizing the supervised learning objective LDL_{D}. We first introduce a basic Lemma 1 and give its proof.

Lemma 1.

For any given generator GG, if the discriminator has infinite capacity, then for any solution D=(ω,f)D=(\omega,f) of the GAN-SSL objective JDJ_{D}, there exists another solution D=(ω,f)D^{*}=(\omega^{*},f^{*}) such that LD=LDL_{D^{*}}=L_{D} and UDUDU_{D^{*}}\geq U_{D}.

Proof.

For any given generator GG and any solution D=(ω,f)D=(\omega,f) of the GAN-SSL objective JDJ_{D}, because the discriminator has infinite capacity, there exists D=(ω,f)D^{*}=(\omega^{*},f^{*}) such that for all xx and kKk\leq K,

exp(ωkTf(x))=exp(ωkTf(x))i=1Kexp(ωiTf(x))p(x)pG(x)\exp(\omega_{k}^{*T}f^{*}(x))=\frac{\exp(\omega_{k}^{T}f(x))}{\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))}\cdot\frac{p(x)}{p_{G}(x)}

For all xx,

PD(y|x,yK)=exp(ωyTf(x))i=1Kexp(ωiTf(x))\displaystyle P_{D^{*}}(y|x,y\leq K)=\frac{\exp(\omega_{y}^{*T}f^{*}(x))}{\sum_{i=1}^{K}\exp(\omega_{i}^{*T}f^{*}(x))}
=\displaystyle= exp(ωyTf(x))i=1Kexp(ωiTf(x))=PD(y|x,yK)\displaystyle\frac{\exp(\omega_{y}^{T}f(x))}{\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))}=P_{D}(y|x,y\leq K)

Then LD=LDL_{D^{*}}=L_{D}, and

PD(K+1|x)=exp(ωK+1Tf(x))i=1K+1exp(ωiTf(x))\displaystyle P_{D^{*}}(K+1|x)=\frac{\exp(\omega_{K+1}^{*T}f^{*}(x))}{\sum_{i=1}^{K+1}\exp(\omega_{i}^{*T}f^{*}(x))}
=\displaystyle= 11+i=1Kexp(ωiTf(x))=pG(x)p(x)+pG(x)\displaystyle\frac{1}{1+\sum_{i=1}^{K}\exp(\omega_{i}^{*T}f^{*}(x))}=\frac{p_{G}(x)}{p(x)+p_{G}(x)}

For the following unsupervised objective function

UD\displaystyle U_{D} =𝔼xp(x)log(1PD(K+1|x))\displaystyle=\mathbb{E}_{x\sim p(x)}\log(1-P_{D}(K+1|x))
+𝔼xpG(x)logPD(K+1|x)\displaystyle+\mathbb{E}_{x\sim p_{G}(x)}\log P_{D}(K+1|x)
=xp(x)log(1PD(K+1|x))\displaystyle=\int_{x}p(x)\log(1-P_{D}(K+1|x))
+pG(x)logPD(K+1|x)dx\displaystyle+p_{G}(x)\log P_{D}(K+1|x)dx

when GG is fixed, it is easy to verify that DD^{*} can maximize it. Therefore, LD=LDL_{D^{*}}=L_{D} and UDUDU_{D^{*}}\geq U_{D}. This completes the proof. ∎

Lemma 1 states that for any given generator and due to the infinite capacity of the discriminator, under the condition that the supervised objective remains unchanged, we can always increase the unsupervised objective until the extreme value is reached. Then, based on Lemma 1, we can present the theoretical results of the optimal discriminator DD for any given generator GG.

Proposition 1.

Given the conditions in Lemma 1, we can obtain the following:
(1) for any optimal solution DL=(ω,f)D_{L}=(\omega,f) of the supervised learning objective LDL_{D}, there exists D=(ω,f)D^{*}=(\omega^{*},f^{*}) such that DD^{*} maximizes the GAN-SSL objective JDJ_{D} and that, for all xx,

PD(y|x,yK)=PDL(y|x,yK).P_{D^{*}}(y|x,y\leq K)=P_{D_{L}}(y|x,y\leq K).

(2) for any optimal solution D=(ω,f)D^{*}=(\omega^{*},f^{*}) of the above GAN-SSL objective JDJ_{D}, DD^{*} is an optimal solution of the supervised objective LDL_{D}.
(3) the optimal discriminator DD^{*} of the above GAN-SSL objective is

PD(yK|x)\displaystyle P_{D^{*}}(y\leq K|x) =p(x)p(x)+pG(x),\displaystyle=\frac{p(x)}{p(x)+p_{G}(x)},
PD(K+1|x)\displaystyle P_{D^{*}}(K+1|x) =pG(x)p(x)+pG(x).\displaystyle=\frac{p_{G}(x)}{p(x)+p_{G}(x)}.

Proposition 1 (1) and (2) jointly indicate that given a fixed generator, optimizing the discriminator of GAN-SSL is equivalent to optimizing that of supervised learning. Thus, the optimal discriminator in GAN-SSL is expected to be perfect on labeled data. We should note that Proposition 11 of [4] gives a similar theoretical result, claiming only that the optimal discriminator of LDL_{D} can lead to an optimal discriminator of GAN-SSL, i.e., the first claim of the above Proposition 1, but given the condition that pG=pp_{G}=p. Actually, our theoretical investigation instead shows that the condition pG=pp_{G}=p is unnecessary and that the optimal discriminator of GAN-SSL is an optimal discriminator of supervised learning.

By carefully checking the statements in Proposition 11 of [4] and comparing their results with ours, one may realize that their results are not rigorous. They claimed that good semi-supervised learning requires a bad generator because they observed that the semi-supervised objective shares the same generalization error with the supervised objective given a perfect generator. First, from our theoretical results, a perfect generator is not necessary for the observation, such that the requirement of a bad generator in their claim is unreasonable. Moreover, the fact that the semi-supervised objective shares the same generalization error with the supervised objective is insufficient for claiming that GAN-SSL will reduce the generalization ability of supervised learning. In addition, if we can claim only that the optimal discriminator of LDL_{D} can lead to the optimal discriminator of GAN-SSL, one would not expect to obtain a perfect discriminator on labeled data by GAN-SSL such that the Assumption 11 (1) of [4] is over-assumed. Our result, i.e., the Proposition 1 (2), can overcome their shortcoming, such that our later Assumption 1 (1) is indeed reasonable.

3.2 Optimal generator

Next, for a fixed optimal discriminator, we discuss the behavior of the optimal generator of GAN-SSL. By the definition of the discriminator DD, we have PD(K+1|x)=1/(1+i=1Kexp(ωiTf(x)))P_{D}(K+1|x)={1}/{(1+\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x)))}. Based on Proposition 1 (3), PD(K+1|x)=pG(x)/(p(x)+pG(x))P_{D}(K+1|x)={p_{G}(x)}/{(p(x)+p_{G}(x))}. Namely,

11+i=1Kexp(ωiTf(x))=pG(x)p(x)+pG(x)\displaystyle\frac{1}{1+\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))}=\frac{p_{G}(x)}{p(x)+p_{G}(x)}

implying that i=1Kexp(ωiTf(x))=p(x)/pG(x)\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))=p(x)/p_{G}(x). Now, the objective of GG becomes C(G)=LDG+UGC(G)=L_{DG}+U_{G}, with

LDG\displaystyle L_{DG} =𝔼(x,y)p(x,y)logPD(y|x,yK)\displaystyle=\mathbb{E}_{(x,y)\sim p(x,y)}\log P_{D}(y|x,y\leq K)
=𝔼(x,y)p(x,y)log(exp(ωyTf(x))pG(x)p(x))\displaystyle=\mathbb{E}_{(x,y)\sim p(x,y)}\log(\exp(\omega_{y}^{T}f(x))\frac{p_{G}(x)}{p(x)})
UG\displaystyle U_{G} =𝔼xplogPD(yK|x)+𝔼xpGlogPD(K+1|x)\displaystyle=\mathbb{E}_{x\sim p}\log P_{D}(y\leq K|x)+\mathbb{E}_{x\sim p_{G}}\log P_{D}(K+1|x)
=𝔼xplogp(x)p(x)+pG(x)+𝔼xpGlogpG(x)p(x)+pG(x)\displaystyle=\mathbb{E}_{x\sim p}\log\frac{p(x)}{p(x)+p_{G}(x)}+\mathbb{E}_{x\sim p_{G}}\log\frac{p_{G}(x)}{p(x)+p_{G}(x)}

The aim of the discriminator is to classify the labeled samples into the correct class, the unlabeled samples into the ”true” class (any of the first KK classes) and the generated samples into the ”fake” class, i.e., the (K+1)(K+1)-th class. In this paper, the so-called perfect discriminator means that for any (x,y)p(x,y)(x,y)\in p(x,y), PD(y|x,yK)>PD(k|x,kK,ky)P_{D}(y|x,y\leq K)>P_{D}(k|x,k\leq K,k\neq y), i.e., the discriminator can make a correct decision on labeled data. We say that the discriminator reaches its theoretical maximum given that PD(y|x,yK)=1P_{D}(y|x,y\leq K)=1 and, for any other class kyk\neq y, PD(k|x,kK)=0P_{D}(k|x,k\leq K)=0. Obviously, it is impossible to reach the theoretical maximum in practice, i.e., PD(y|x,yK)P_{D}(y|x,y\leq K) can only be close to 11, and for any other class kyk\neq y, exp(ωkTf(x))0\exp(\omega_{k}^{T}f(x))\rightarrow 0. That is, because the discriminator has an infinite capacity, we can expect that there exists 0<ϵ10<\epsilon\ll 1 such that (K1)exp(ωkTf(x))ϵ(K-1)\exp(\omega_{k}^{T}f(x))\leq\epsilon for any other class kyk\neq y. Here, ϵ\epsilon represents the error between the output of the perfect discriminator and its theoretical maximum. Now, we show the relationship between the true data distribution and the generator distribution in GAN-based semi-supervised tasks through the following Proposition 2.

Proposition 2.

Given the conditions in Proposition 1, for a fixed optimal discriminator DD, suppose there exists 0<ϵ10<\epsilon\ll 1 such that the other logit output (kyk\neq y) of DD satisfies (K1)exp(ωkTf(x))ϵ(K-1)\exp(\omega_{k}^{T}f(x))\leq\epsilon. Then, the generator objective C(G)=KL(p||pϵpG)+2JS(p||pG)2log2C(G)=-KL(p||p-\epsilon p_{G})+2JS(p||p_{G})-2\log 2.

Proposition 2 indicates that minimizing the generator objective C(G)C(G) is equivalent to minimizing JS(p||pG)JS(p||p_{G}) and maximizing KL(p||pϵpG)KL(p||p-\epsilon p_{G}) simultaneously. To minimize JS(p||pG)JS(p||p_{G}), we need to train pGp_{G} close to pp; however, to maximize KL(p||pϵpG)KL(p||p-\epsilon p_{G}), we need to increase the difference between pp and pϵpGp-\epsilon p_{G}. If the optimal solution DD of GAN-SSL for the supervised objective can reach the theoretical maximum, i.e., ϵ=0\epsilon=0, then minimizing C(G)C(G) is equivalent to minimizing JS(p||pG)JS(p||p_{G}). Since the Jensen-Shannon divergence between two distributions is always non-negative and zero iff they are equal, then C=2log2C^{*}=-2\log 2 is the global minimum of C(G)C(G), and the only solution is pG=pp_{G}=p, i.e., the generator distribution perfectly replicates the true data distribution such that the optimal generator is perfect. Whereas, it is impossible to reach the theoretical maximum, then ϵ0\epsilon\neq 0. If we assume that ϵ\epsilon is sufficiently small, i.e., 0<ϵ10<\epsilon\ll 1, the optimal generator pGp_{G} will not be pp. Otherwise, if pG=pp_{G}=p, JS(p||pG)JS(p||p_{G}) can reach its minimum, and KL(p||pϵpG)=KL(p||p(1ϵ))KL(p||p-\epsilon p_{G})=KL(p||p(1-\epsilon)) becomes very small, which is in contrast to maximizing KL(p||pϵpG)KL(p||p-\epsilon p_{G}) in the objective C(G)C(G).

Different from the theoretical result of [4] that good semi-supervised learning requires a bad generator, Proposition 2 shows that the error of the optimal generator distribution relative to the true data distribution largely depends on the optimal discriminator. First, when the perfect discriminator for the supervised objective can reach its theoretical maximum, the global optimal generator of GAN-SSL is pG=pp_{G}=p, i.e., the optimal generator GG is perfect. However, since the theoretical maximum cannot be reached in practice, one cannot expect to learn a perfect generator, i.e., the optimal generator GG is imperfect. In addition, we note that although the optimal generator is imperfect in actual situations, it can be similar to the original data distribution. As shown in the work of [15], samples generated by the excellent GAN-SSL generator are not perfect but not completely different from the original data distribution, which is apparently inconsistent with the requirement of a ”bad” generator in [4].

Case Study on Synthetic Data. To obtain a more intuitive understanding of Proposition 2, we conduct a case study based on a 1D synthetic dataset, where we can easily verify our theoretical analysis by visualizing the model behaviors. The training dataset is sampled from the Gaussian distribution p=𝒩(0,0.42)p=\mathcal{N}(0,0.4^{2}). Based on Proposition 2, we study the relationship between the original data distribution pp and the generator distribution pGp_{G} obtained by directly training the following generator objective function C(G)=KL(p||pϵpG)+2JS(p||pG)C(G)=-KL(p||p-\epsilon p_{G})+2JS(p||p_{G}). We use the generator architecture of a fully connected neural network with ReLU activations: 1-100-100-1. We give the results of comparing pp and pGp_{G} when ϵ=0\epsilon=0, ϵ=0.1\epsilon=0.1 and ϵ=0.2\epsilon=0.2 in Figure 1, respectively. As theoretically analyzed above, for a constant ϵ=0\epsilon=0, minimizing C(G)C(G) is equivalent to minimizing JS(p||pG)JS(p||p_{G}), and the optimal solution is pG=pp_{G}=p (as shown in Figure 1 (a)). As shown in Figure 1 (b) and (c), when ϵ0\epsilon\neq 0, pGp_{G} is imperfect and the greater ϵ\epsilon is, the larger the gap between pGp_{G} and pp. Although there is a certain gap between the generator distribution and the data distribution, pGp_{G} is similar to pp, and most of their supports overlap. Therefore, in this case, different from GANs, the optimal generator cannot be used to generate samples.

Refer to caption
Figure 1: The comparison results of the data distribution pp (blue) and the generator distribution pGp_{G} (red) when ϵ=0\epsilon=0, ϵ=0.1\epsilon=0.1 and ϵ=0.2\epsilon=0.2, respectively. Here, ϵ\epsilon can somehow represent the error between the output of the perfect discriminator and its theoretical maximum.

3.3 Optimal discriminator on unlabeled data

Finally, to study the behavior of the optimal discriminator of GAN-SSL on unlabeled data, we first present the assumption on the convergence conditions of the optimal discriminator and generator and the manifold conditions on the dataset. Let Ω\Omega_{\mathcal{L}} be the set of labeled data. Let Ω\Omega, Ω𝒢\Omega_{\mathcal{G}} be the data manifold and generated data manifold, respectively. Obviously, we have ΩΩ\Omega_{\mathcal{L}}\subset\Omega, and the unlabeled data should be sampled from ΩΩ\Omega\setminus\Omega_{\mathcal{L}}. Denote Ω=k=1KΩk\Omega=\cup_{k=1}^{K}\Omega^{k}, where Ωk\Omega^{k} is the data manifold of class kk. Denote Ωk=lΩkl\Omega^{k}=\cup_{l}\Omega^{kl}, where Ωkl\Omega^{kl} is a connected subdomain.

Assumption 1.

Suppose D,GD,G are optimal solutions of the GAN-SSL objective (1). We assume that:
(1) for any (x,y)Ω(x,y)\in\Omega_{\mathcal{L}}, we have ωyTf(x)>ωkTf(x)\omega_{y}^{T}f(x)>\omega_{k}^{T}f(x) for any other class kyk\neq y;
(2) for any xΩx\in\Omega, maxk=1KωkTf(x)>0\max_{k=1}^{K}\omega_{k}^{T}f(x)>0; for any xΩ𝒢Ω¯x\in\Omega_{\mathcal{G}}\setminus\bar{\Omega}, maxk=1KωkTf(x)<0\max_{k=1}^{K}\omega_{k}^{T}f(x)<0;
(3) each Ωjl\Omega^{jl} contains at least one labeled data point;
(4) for any xkΩkx_{k}\in\Omega^{k} and jkj\neq k, there exists a connected subdomain Ωjl\Omega^{jl}, a labeled data point xjΩjlx_{j}\in\Omega^{jl}, a generated data point xgΩ𝒢Ω¯x_{g}\in\Omega_{\mathcal{G}}\setminus\bar{\Omega}, and 0<α<10<\alpha<1, such that f(xg)=αf(xk)+(1α)f(xj)f(x_{g})=\alpha f(x_{k})+(1-\alpha)f(x_{j}).

The Reasonableness of Assumption 1 (1). According to the theoretical analysis for the optimal discriminator (Proposition 1), we found that optimizing the discriminator of GAN-SSL is equivalent to optimizing that of supervised learning. Because DD is an optimal solution of the GAN-SSL objective, DD is an optimal solution of the supervised learning objective. Naturally, Assumption 1 (1) holds, i.e., the discriminator has a correct decision boundary for labeled data given that the discriminator has infinite capacity.

The Reasonableness of Assumption 1 (2). Proposition 2 implies that the optimal generator we obtain in practice is not a perfect generator such that we assume Ω𝒢Ω¯\Omega_{\mathcal{G}}\setminus\bar{\Omega}\neq\emptyset. We ignore the case of Ω𝒢Ω,ΩΩ¯𝒢\Omega_{\mathcal{G}}\subset\Omega,\Omega\setminus\bar{\Omega}_{\mathcal{G}}\neq\emptyset since it is almost impossible to ensure that our optimal generator can output only true data but with different probabilities in practice, unless we obtain the worst model with model collapse. Similar to [4], we also make a strong assumption regarding the true-fake correctness of the true data (Ω\Omega) and fake data (Ω𝒢Ω¯\Omega_{\mathcal{G}}\setminus\bar{\Omega}). In other words, we assume that the sampling of unlabeled data is good enough to achieve the best generalization ability on the true data manifold.

The Reasonableness of Assumption 1 (3). Intuitively, one cannot expect to achieve a good classification performance on a connected subdomain with no label information, since we can optionally set this connected subdomain to any class but with no influence on the objective.

The Reasonableness of Assumption 1 (4). Apparently, Assumption 1 (3) is a sufficient condition for this assumption. In addition, since the optimal generator is imperfect and we need only its existence, one can expect to be able to achieve this condition.

Now, based on Assumption 1, we give the main result of this subsection.

Proposition 3.

Given the conditions in Assumption 1, for all classes kKk\leq K, for all data space points xkΩkx_{k}\in\Omega^{k}, we have ωkTf(xk)>ωjTf(xk)\omega_{k}^{T}f(x_{k})>\omega_{j}^{T}f(x_{k}) for any jkj\neq k.

We remark that the proof of Proposition 3 is similar to that in [4] but with different conditions; see the supplementary material for more details. Proposition 3 guarantees that given the convergence conditions of the optimal discriminator and generator, which are induced by Proposition 1 and 2, respectively, if the labeled data can traverse all subdomains of the data manifold, then the discriminator DD will be perfect on the data manifold, i.e., it can also learn correct decision boundaries for all unlabeled data. Now, we discuss the difference between our Assumption 1 and Assumption 11 in [4].

  1. 1.

    As discussed above in subsection 3.13.1, the first Assumption 11 (1) in [4] is over-assumed, since they cannot expect to obtain a perfect discriminator by optimizing the GAN-SSL objective. However, our Proposition 1 can support our Assumption 1 (1).

  2. 2.

    Their Assumption 11 (2) and Assumption 11 (3) correspond to our Assumption 1 (2). We all assume that the optimal DD can result in a perfect decision regarding true-fake correctness. However, since our optimal GG is not a bad one, as a result of Proposition 2, we assume only that the data in Ω𝒢Ω¯\Omega_{\mathcal{G}}\setminus\bar{\Omega} are fake.

  3. 3.

    Unlike [4], we additionally give a reasonable assumption on the data manifold, i.e, Assumption 1 (3). By a careful check, one can find that the authors of [4] ignore this assumption because they implicitly assume that after embedding ff, the feature space of each class is a connected domain. Otherwise, the proof of their Proposition 22 is not correct, since one can set a connected subdomain of the feature space FkF_{k} without label information to any class by a similar contradiction proof.

  4. 4.

    Similar to our Assumption 1 (4), [4] make a stronger assumption before the definition of their Assumption 11 by setting GG as a complement generator. Given a complement generator, our Assumption 1 (4) can clearly be satisfied. Note that under their definition, if the optimal generator GG is a complementary one, the feature space of the generated data will fill the complement set of the feature space of the true data. In other words, they hope that the ”bad” generator can not only generate fake data but also generate all the existing fake data. Apparently, this assumption is too strong and requires the generator to offer a high representation ability, since our true data manifold is always low-dimensional. However, we need only the existence of a generated fake data point such that its feature can be linearly represented by that of two true data points with at least one labeled data point. This existence has a high probability to be satisfied given only a connected subdomain Ωjl\Omega^{jl} and an imperfect generator. Here, we actually release the requirement for a complement generator.

The Reasonableness of Assumption 1 (3) on Synthetic Data. We design a binary classification problem with Ω=Ω1Ω2\Omega=\Omega^{1}\cup\Omega^{2}, where Ω1=Ω11Ω12\Omega^{1}=\Omega^{11}\cup\Omega^{12} and Ω11\Omega^{11}, Ω12\Omega^{12} and Ω2\Omega^{2} are three bounded 2D circles with no interaction. The unlabeled data are uniformly sampled from Ω\Omega. If Assumption 1 (3) is satisfied, each connected domain contains at least one labeled data point, as shown in Figure 2 (a1). Then, as shown in Figure 2 (a4), we can easily train the discriminator to learn the correct decision boundary. However, if Assumption 1 (3) is not satisfied, for the data in class 11, we sample only the labeled data in subdomain Ω11\Omega^{11}, as shown in Figure 2 (b1). At this time, it is difficult for the discriminator to learn a correct decision boundary. As shown in Figure 2 (b4), (b6), and (b8), we choose three kinds of results under several training processes. For the unlabeled data in subdomain Ω12\Omega^{12}, the classification results are uncertain. Hence, a comparison between Figure 2 (a4), (b4), (b6), and (b8) can show that the condition of Assumption 1 (3) is necessary to obtain a perfect discriminator on unlabeled data. Furthermore, as shown in Figure 2 (a3), (b3), (b5) and (b7), a comparison of the generated samples shows that there is indeed a certain gap between the generator distribution and the data distribution, while the optimal generator is not a complementary one, as claimed in [4].

Refer to caption
Figure 2: The results of a binary classification problem on synthetic data when Assumption 1 (3) is (a) or not (b) satisfied. Labeled and unlabeled data are denoted by crosses and points, respectively, in (a1) and (b1). Different colors indicate different classes in (a2), (a4), (b2), (b4), (b6), and (b8), where (a2) and (b2) are the ground truths on the test data and (a4), (b4), (b6), and (b8) are our results. Generated and real data are denoted by blue points and gray crosses, respectively, in (a3), (b3), (b5), and (b7).

4 Conclusions

Via a theoretical analysis, this paper answers the question of how GAN-SSL works. In conclusion, semi-supervised learning based on GANs will yield a perfect discriminator on both labeled (Proposition 1) and unlabeled data (Proposition 3) by learning an imperfect generator (Proposition 2), i.e., GAN-SSL can effectively improve the generalization ability in semi-supervised classification. In the future, the theoretical problems of more complex models, such as Triple-GAN and other methods, will be studied. In addition, the existence of Assumption 1 (4) will undergo further theoretical and empirical investigations.

Acknowledgements

This work was supported in part by the Innovation Foundation of Qian Xuesen Laboratory of Space Technology, and in part by Beijing Nova Program of Science and Technology under Grant Z191100001119129.

References

  • [1] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In International Conference on Machine Learning, pages 224–232, 2017.
  • [2] Yu Bai, Tengyu Ma, and Andrej Risteski. Approximability of discriminators implies diversity in GANs. In International Conference on Learning Representations, 2019.
  • [3] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. MixMatch: A holistic approach to semi-supervised learning. In Neural Information Processing Systems, pages 5049–5059, 2019.
  • [4] Zihang Dai, Zhilin Yang, Fan Yang, William W. Cohen, and Ruslan Salakhutdinov. Good semi-supervised learning that requires a bad GAN. In Neural Information Processing Systems, 2017.
  • [5] Jinhao Dong and Tong Lin. MarginGAN: Adversarial training in semi-supervised learning. In Neural Information Processing Systems, pages 10440–10449, 2019.
  • [6] Zhe Gan, Liqun Chen, Weiyao Wang, Yuchen Pu, Yizhe Zhang, Hao Liu, Chunyuan Li, and Lawrence Carin. Triangle generative adversarial networks. In Neural Information Processing Systems, pages 5247–5256, 2017.
  • [7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Neural Information Processing Systems, pages 2672–2680, 2014.
  • [8] Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, and Max Welling. Semi-supervised learning with deep generative models. In Neural Information Processing Systems, pages 3581–3589, 2014.
  • [9] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.
  • [10] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.
  • [11] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations, 2017.
  • [12] Bruno Lecouat, Chuan Sheng Foo, Houssam Zenati, and Vijay Chandrasekhar. Manifold regularization with GANs for semi-supervised learning. arXiv preprint arXiv:1807.04307, 2018.
  • [13] Chongxuan Li, Taufik Xu, Jun Zhu, and Bo Zhang. Triple generative adversarial nets. In Neural Information Processing Systems, pages 4088–4098, 2017.
  • [14] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for GANs do actually converge? International Conference on Machine Learning, 2018.
  • [15] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Chen Xi. Improved techniques for training GANs. In Neural Information Processing Systems, 2016.
  • [16] Tim Salimans, Han Zhang, Alec Radford, and Dimitris Metaxas. Improving GANs using optimal transport. In International Conference on Learning Representations, 2018.
  • [17] Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. In International Conference on Learning Representations, 2016.
  • [18] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. SSW, page 125, 2016.
  • [19] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848, 2019.

Appendix

Proof of Proposition 11

Proof.

Proof of Proposition 11 (1). Similar to the proof of Lemma 11, given an optimal solution DL=(ω,f)D_{L}=(\omega,f) of the supervised objective LDL_{D}, due to the discriminator has infinite capacity, there exists D=(ω,f)D^{*}=(\omega^{*},f^{*}) such that for all xx and kKk\leq K,

exp(ωkTf(x))=exp(ωkTf(x))i=1Kexp(ωiTf(x))p(x)pG(x)\displaystyle\exp(\omega_{k}^{*T}f^{*}(x))=\frac{\exp(\omega_{k}^{T}f(x))}{\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))}\cdot\frac{p(x)}{p_{G}(x)} (2)

For all xx,

PD(y|x,yK)=exp(ωyTf(x))i=1Kexp(ωiTf(x))\displaystyle P_{D^{*}}(y|x,y\leq K)=\frac{\exp(\omega_{y}^{*T}f^{*}(x))}{\sum_{i=1}^{K}\exp(\omega_{i}^{*T}f^{*}(x))}
=\displaystyle= exp(ωyTf(x))i=1Kexp(ωiTf(x))=PDL(y|x,yK)\displaystyle\frac{\exp(\omega_{y}^{T}f(x))}{\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))}=P_{D_{L}}(y|x,y\leq K)

Then LD=LDLL_{D^{*}}=L_{D_{L}}. Based on the definition and given Eq. (2), we can obtain

PD(K+1|x)=11+i=1Kexp(ωiTf(x))=pGp+pG.P_{D^{*}}(K+1|x)=\frac{1}{1+\sum_{i=1}^{K}\exp(\omega_{i}^{*T}f^{*}(x))}=\frac{p_{G}}{p+p_{G}}.

By the proof of Lemma 11, DD^{*} is an optimal solution of UDU_{D}. Because DLD_{L} maximizes LDL_{D}, DD^{*} also maximizes LDL_{D}. It follows that DD^{*} maximizes JDJ_{D}.

Proof of Proposition 11 (2). First, we should note that if DD^{*} maximizes the GAN-SSL objective JDJ_{D}, then DD^{*} maximizes UDU_{D}. Otherwise, based on Lemma 11, there exists another solution D=(ω,f)D^{{}^{\prime}}=(\omega^{{}^{\prime}},f^{{}^{\prime}}) such that LD=LDL_{D^{{}^{\prime}}}=L_{D^{*}} and UD>UDU_{D^{{}^{\prime}}}>U_{D^{*}}, i.e., JD>JDJ_{D^{{}^{\prime}}}>J_{D^{*}}, leading to contradiction. That is to say, for any optimal solution D=(ω,f)D^{*}=(\omega^{*},f^{*}) of the GAN-SSL objective JDJ_{D}, UDU_{D^{*}} reaches the extreme value; then, DD^{*} is also an optimal solution of LDL_{D}. Otherwise, due to the infinite capacity of the discriminator, there exists an optimal solution D1D_{1} of the supervised objective LDL_{D}. Thus, based on Proposition 11 (1), there exists D1=(ω1,f1)D_{1}^{*}=(\omega_{1}^{*},f_{1}^{*}) such that LD1=LD1>LDL_{D_{1}^{*}}=L_{D_{1}}>L_{D^{*}} and UD1=UDU_{D_{1}^{*}}=U_{D^{*}}. Therefore, JD1>JDJ_{D_{1}^{*}}>J_{D^{*}}, leading to contradiction, i.e., for any optimal solution D=(ω,f)D^{*}=(\omega^{*},f^{*}) of JDJ_{D}, DD^{*} is an optimal solution of LDL_{D}.

Based on Proposition 11 (1) and (2), we can obtain that maximizing JDJ_{D} is equivalent to maximizing the quantity LDL_{D} and UDU_{D}, simultaneously. Then, the optimal solution of JDJ_{D} must also be the optimal solution of UDU_{D}. Similar to the theoretical results of [Goodfellow et al. 2014], for GG fixed, the optimal discriminator DD^{*} of the GAN-SSL objective is

PD(K+1|x)=pG(x)p(x)+pG(x),P_{D^{*}}(K+1|x)=\frac{p_{G}(x)}{p(x)+p_{G}(x)},

and

PD(yK|x)=1PD(K+1|x)=p(x)p(x)+pG(x).P_{D^{*}}(y\leq K|x)=1-P_{D^{*}}(K+1|x)=\frac{p(x)}{p(x)+p_{G}(x)}.

This completes the proof. ∎

Proof of Proposition 22

Proof.

By the definition of the discriminator DD, PD(K+1|x)=11+i=1Kexp(ωiTf(x))P_{D}(K+1|x)=\frac{1}{1+\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))}, and based on Proposition 11 (3), PD(K+1|x)=pG(x)p(x)+pG(x)P_{D^{*}}(K+1|x)=\frac{p_{G}(x)}{p(x)+p_{G}(x)}, then for the optimal discriminator D=(ω,f)D=(\omega,f), 11+i=1Kexp(ωiTf(x))=pG(x)p(x)+pG(x)\frac{1}{1+\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))}=\frac{p_{G}(x)}{p(x)+p_{G}(x)}, such that

i=1Kexp(ωiTf(x))\displaystyle\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x)) =exp(ωyTf(x))+iyKexp(ωiTf(x))\displaystyle=\exp(\omega_{y}^{T}f(x))+\sum_{i\neq y}^{K}\exp(\omega_{i}^{T}f(x))
=p(x)/pG(x)\displaystyle=p(x)/p_{G}(x)

For a fixed optimal discriminator DD, suppose there exists 0<ϵ10<\epsilon\ll 1 such that the other logit output (kyk\neq y) of DD satisfies (K1)exp(ωkTf(x))ϵ(K-1)\exp(\omega_{k}^{T}f(x))\leq\epsilon, then, exp(ωyTf(x))p/pGϵ\exp(\omega_{y}^{T}f(x))\geq p/p_{G}-\epsilon. If the minimum can be achieved, i.e., exp(ωyTf(x))=p/pGϵ\exp(\omega_{y}^{T}f(x))=p/p_{G}-\epsilon, therefore,

LDG\displaystyle L_{DG} =𝔼(x,y)p(x,y)logPD(y|x,yK)\displaystyle=\mathbb{E}_{(x,y)\sim p(x,y)}\log P_{D}(y|x,y\leq K)
=𝔼(x,y)p(x,y)log(exp(ωyTf(x))i=1Kexp(ωiTf(x)))\displaystyle=\mathbb{E}_{(x,y)\sim p(x,y)}\log(\frac{\exp(\omega_{y}^{T}f(x))}{\sum_{i=1}^{K}\exp(\omega_{i}^{T}f(x))})
=𝔼xp(x)log((p(x)pG(x)ϵ)pG(x)p(x))\displaystyle=\mathbb{E}_{x\sim p(x)}\log((\frac{p(x)}{p_{G}(x)}-\epsilon)\frac{p_{G}(x)}{p(x)})
=xp(x)logp(x)p(x)ϵpG(x)dx\displaystyle=-\int_{x}p(x)\log\frac{p(x)}{p(x)-\epsilon p_{G}(x)}dx

Then,

C(G)=LDG+UG=xp(x)logp(x)p(x)ϵpG(x)dx+xplogp(x)p(x)+pG(x)+pGlogpG(x)p(x)+pG(x)dx=KL(p||pϵpG)+2JS(p||pG)2log2\displaystyle\begin{aligned} C(G)&=L_{DG}+U_{G}\\ &=-\int_{x}p(x)\log\frac{p(x)}{p(x)-\epsilon p_{G}(x)}dx\\ &+\int_{x}p\log\frac{p(x)}{p(x)+p_{G}(x)}+p_{G}\log\frac{p_{G}(x)}{p(x)+p_{G}(x)}dx\\ &=-KL(p||p-\epsilon p_{G})+2JS(p||p_{G})-2\log 2\end{aligned}

This completes the proof. ∎

Proof of Proposition 33

Proof.

First, if xkx_{k} is a labeled data point, based on Assumption 11 (1), we have ωkTf(xk)>ωjTf(xk)\omega_{k}^{T}f(x_{k})>\omega_{j}^{T}f(x_{k}) for any jkj\neq k. Then, we consider xkΩkx_{k}\in\Omega^{k} is an unlabeled data point. Without loss of generality, suppose j=argmaxjkωjTf(xk)j=\text{arg}\max_{j\neq k}\omega_{j}^{T}f(x_{k}). Now, we prove it by contradiction.

Suppose there exists a data space point xkΩkx_{k}\in\Omega^{k} and a class jkj\neq k, such that

ωkTf(xk)ωjTf(xk)0\omega_{k}^{T}f(x_{k})-\omega_{j}^{T}f(x_{k})\leq 0 (3)

By Assumption 11 (3) and Assumption 11 (4), there exists a connected subdomain Ωjl\Omega^{jl}, a labeled data point xjΩjlx_{j}\in\Omega^{jl}, and a generated data point xgΩ𝒢Ω¯x_{g}\in\Omega_{\mathcal{G}}\setminus\bar{\Omega}, such that f(xg)=αf(xk)+(1α)f(xj)f(x_{g})=\alpha f(x_{k})+(1-\alpha)f(x_{j}) with 0<α<10<\alpha<1. Based on Assumption 11 (2), ωjTf(xg)<0\omega_{j}^{T}f(x_{g})<0. Thus,

ωjTf(xg)=αωjTf(xk)+(1α)ωjTf(xj)<0\omega_{j}^{T}f(x_{g})=\alpha\omega_{j}^{T}f(x_{k})+(1-\alpha)\omega_{j}^{T}f(x_{j})<0

By Assumption 11 (1) and Assumption 11 (2), for any (xj,j)Ω(x_{j},j)\in\Omega_{\mathcal{L}}, ωjTf(xj)=maxkωkTf(xj)>0\omega_{j}^{T}f(x_{j})=\max_{k}\omega_{k}^{T}f(x_{j})>0. Moreover, by Eq. (3) and Assumption 11 (2), for any xkΩkΩx_{k}\in\Omega^{k}\subset\Omega, ωjTf(xk)=maxi=1KωiTf(xk)>0\omega_{j}^{T}f(x_{k})=\max_{i=1}^{K}\omega_{i}^{T}f(x_{k})>0. Then, αωjTf(xk)+(1α)ωjTf(xj)>0\alpha\omega_{j}^{T}f(x_{k})+(1-\alpha)\omega_{j}^{T}f(x_{j})>0, leading to contradiction. In summary, for all data space points xkΩkx_{k}\in\Omega^{k}, we have ωkTf(xk)>ωjTf(xk)\omega_{k}^{T}f(x_{k})>\omega_{j}^{T}f(x_{k}) for any jkj\neq k. ∎