This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Conditional Generative Modeling via Learning the Latent Space

Sameera Ramasinghe
Australian National University & CSIRO, Data61
sameera.ramasinghe@anu.edu.au
Kanchana Ranasinghe
University of Moratuwa
kahnchana@gmail.com
Salman Khan
Mohamed Bin Zayed University of Artificial Intelligence
salman.khan@mbzuai.ac.ae
Nick Barnes
Australian National University
nick.barnes@anu.edu.au
Stephen Gould
Australian National University
stephen.gould@anu.edu.au
Abstract

Although deep learning has achieved appealing results on several machine learning tasks, most of the models are deterministic at inference, limiting their application to single-modal settings. We propose a novel general-purpose framework for conditional generation in multimodal spaces, that uses latent variables to model generalizable learning patterns while minimizing a family of regression cost functions. At inference, the latent variables are optimized to find optimal solutions corresponding to multiple output modes. Compared to existing generative solutions, our approach demonstrates faster and stable convergence, and can learn better representations for downstream tasks. Importantly, it provides a simple generic model that can beat highly engineered pipelines tailored using domain expertise on a variety of tasks, while generating diverse outputs. Our codes will be released.

1 Introduction

Conditional generative models provide a natural mechanism to jointly learn a data distribution and optimize predictions. In contrast, discriminative models improve predictions by modeling the label distribution. Learning to model the data distribution allows generating novel samples and is considered a preferred way to understand the real world. Existing conditional generative models have generally been explored in single-modal settings, where a one-to-one mapping between input and output domains exists (Nalisnick et al., 2019; Fetaya et al., 2020). Here, we investigate continuous multimodal (CMM) spaces for generative modeling, where one-to-many mappings exist between input and output domains. This is critical since many real world situations are inherently multi-modal, e.g., humans can imagine several outcomes for a given occluded image. In a discrete setting, this problem becomes relatively easy to tackle using techniques such as maximum-likelihood-estimation, since the output can be predicted as a vector (Zhang et al., 2016), which is not possible in continuous domains. Consequently, generative modeling in CMM spaces remains a challenging task.

To model CMM spaces, a prominent approach in the literature is to use a combination of reconstruction and adversarial losses (Isola et al., 2017; Zhang et al., 2016; Pathak et al., 2016). However, this entails key shortcomings. 1) The goals of adversarial and reconstruction losses are contradictory (Sec. 4), hence model engineering and numerous regularizers are required to support convergence (Lee et al., 2019; Mao et al., 2019), thereby resulting in less-generic models tailored for specific applications (Zeng et al., 2019; Vitoria et al., 2020). 2) The adversarial loss based models are notorious for difficult convergence due to the challenge of finding Nash equilibrium of a non-convex min-max game in high-dimensions (Barnett, 2018; Chu et al., 2020; Kodali et al., 2017). 3) The convergence is heavily dependent on the architecture, hence such models show lack of scalability (Thanh-Tung et al., 2019; Arora & Zhang, 2017). 4) The promise of assisting downstream tasks remains challenging, with a large gap in performance between the generative modelling approaches and their discriminative counterparts (Grathwohl et al., 2020; Jing & Tian, 2020).

In this work, we propose a general-purpose framework for modeling CMM spaces using a set of domain-agnostic regression cost functions instead of the adversarial loss. This improves both the stability and eliminates the incompatibility between the adversarial and reconstruction losses, allowing more precise outputs while maintaining diversity. The underlying notion is to learn the ‘behaviour of the latent variables’ in minimizing these cost functions while converging to an optimum mode during the training phase, and mimicking the same at inference. Despite being a novel direction, the proposed framework showcases promising attributes by: (a) achieving state-of-the-art results on a diverse set of tasks using a generic model, implying generalizability, (b) rapid convergence to optimal modes despite architectural changes, (c) learning useful features for downstream tasks, and (d) producing diverse outputs via traversal through multiple output modes at inference.

2 Proposed Methodology

We define a family of cost functions {Ei,j=d(yi,jg,𝒢(xj,w))}ξ\{E_{i,j}=d(y^{g}_{i,j},\mathcal{G}(x_{j},w))\}\in\xi, where xjχx_{j}\sim\chi is the input, yi,jgΥy^{g}_{i,j}\sim\Upsilon is the ithi^{th} ground-truth mode for xjx_{j}, 𝒢\mathcal{G} is a generator function with weights ww, and d(,)d(\cdot,\cdot) is a distance function. Note that the number of cost functions E(,j)E_{(\cdot,j)} for a given xjx_{j} can vary over χ\chi. Our aim here is to come up with a generator function 𝒢(xj,w)\mathcal{G}(x_{j},w), that can minimize each Ei,j,iE_{i,j},\forall i as 𝒢(xj,w)yi,jg\mathcal{G}(x_{j},w)\rightarrow y^{g}_{i,j}. However, since 𝒢\mathcal{G} is a deterministic function (xx and ww are both fixed at inference), it can only produce a single output. Therefore, we introduce a latent vector zz to the generator function, that can be used to converge y¯i,j=𝒢(xj,w,zi,j)\bar{y}_{i,j}=\mathcal{G}(x_{j},w,z_{i,j}) towards a y(i,j)gy^{g}_{(i,j)} at inference, and possibly, to multiple solutions. Formally, the family of cost functions now becomes: {E^i,j=d(yi,jg,𝒢(xj,w,zi,j))},zi,jζ.\{\hat{E}_{i,j}=d(y^{g}_{i,j},\mathcal{G}(x_{j},w,z_{i,j}))\},\forall z_{i,j}\sim\zeta. Then, our training objective can be defined as finding a set of optimal ziζz_{i}^{*}\in\zeta and wωw^{*}\in\omega by minimizing 𝔼iI[E^i]\mathbb{E}_{i\sim I}[\hat{E}_{i}], where II is the number of possible solutions for xjx_{j}. Note that ww^{*} is fixed for all ii and a different ziz_{i}^{*} exists for each ii. Considering all the training samples xjχx_{j}\sim\chi, our training objective becomes,

{{zi,j},w}=argminzi,jζ,wω𝔼iI,jJ[E^i,j].\displaystyle\{\{z_{i,j}^{*}\},w^{*}\}=\underset{z_{i,j}\in\zeta,w\in\omega}{\arg\min}\,\mathbb{E}_{i\in I,j\in J}[\hat{E}_{i,j}]. (1)

Eq. 1 can be optimized via Algorithm 1 (proof in App. 2.2). Intuitively, the goal of Eq. 1 is to obtain a family of optimal latent codes {zi,j}z^{*}_{i,j}\}, each causing a global minima in the corresponding E^i,j\hat{E}_{i,j} as yi,jg=𝒢(xj,w,zi,j)y^{g}_{i,j}=\mathcal{G}(x_{j},w,z^{*}_{i,j}). Consequently, at inference, we can optimize y¯i,j\bar{y}_{i,j} to converge to an optimal mode in the output space by varying zz. Therefore, we predict an estimated z¯i,j\bar{z}_{i,j} at inference,

z¯i,jmin𝑧E^i,j,\bar{z}_{i,j}\approx\underset{z}{\min}\,\hat{E}_{i,j}, (2)

for each yi,jgy^{g}_{i,j}, which in turn can be used to obtain the prediction 𝒢(xj,z¯i,j,w)yi,jg\mathcal{G}(x_{j},\bar{z}_{i,j},w)\approx y^{g}_{i,j}. In other words, for a selected xjx_{j}, let y¯i,jt\bar{y}^{t}_{i,j} be the initial estimate for y¯i,j\bar{y}_{i,j}. At inference, zz can traverse gradually towards an optimum point yi,jgy^{g}_{i,j} in the space, forcing y¯i,jt+nyi,jg\bar{y}_{i,j}^{t+n}\rightarrow y^{g}_{i,j}, in finite steps (nn).

However, still a critical problem exists: Eq. 2 depends on yi,jgy^{g}_{i,j}, which is not available at inference. As a remedy, we enforce Lipschitz constraints on 𝒢\mathcal{G} over (xj,zi,j)(x_{j},z_{i,j}), which bounds the gradient norm as,

𝒢(xj,w,zi,j)𝒢(xj,w,z0)zi,jz0z𝒢(xj,w,γ(t))𝑑tC,\displaystyle{\scriptstyle\frac{\mathinner{\!\left\lVert\mathcal{G}(x_{j},w^{*},z^{*}_{i,j})-\mathcal{G}(x_{j},w^{*},z_{0})\right\rVert}}{\mathinner{\!\left\lVert z^{*}_{i,j}-z_{0}\right\rVert}}}\leq\int\mathinner{\!\left\lVert\nabla_{z}\mathcal{G}(x_{j},w^{*},\gamma(t))\right\rVert}dt\leq C, (3)

where z0ζz_{0}\sim\zeta is an arbitrary random initialization, CC is a constant, and γ()\gamma(\cdot) is a straight path from z0z_{0} to zi,jz^{*}_{i,j} (proof in App. 2.1) . Intuitively, Eq. 3 implies that the gradients z𝒢(xj,w,z0)\nabla_{z}\mathcal{G}(x_{j},w^{*},z_{0}) along the path γ()\gamma(\cdot) do not tend to vanish or explode, hence, finding the path to optimal zi,jz^{*}_{i,j} in the space ζ\zeta becomes a fairly straight forward regression problem. Moreover, enforcing the Lipschitz constraint encourages meaningful structuring of the latent space: suppose z1,jz^{*}_{1,j} and z2,jz^{*}_{2,j} are two optimal codes corresponding to two ground truth modes for a particular input. Since z2,jz1,j\|z^{*}_{2,j}-z^{*}_{1,j}\| is lower bounded by 𝒢(xj,w,z2,j)𝒢(xj,w,z1,j)L\frac{\mathinner{\!\left\lVert\mathcal{G}(x_{j},w^{*},z^{*}_{2,j})-\mathcal{G}(x_{j},w^{*},z^{*}_{1,j})\right\rVert}}{L}, where LL is the Lipschitz constant, the minimum distance between the two latent codes is proportional to the difference between the corresponding ground truth modes. In practice, we observed that this encourages the optimum latent codes to be placed sparsely (visual illustration in App. 2), which helps a network to learn distinctive paths towards different modes.

2.1 Convergence at inference

We formulate finding the convergence path of zz at inference as a regression problem, i.e., zt+1=r(zt,xj){z}_{t+1}=r({z}_{t},x_{j}). We implement r()r(\cdot) as a recurrent neural network (RNN). The series of predicted values {z(t+k):k=1,2,..,N}\{z_{(t+k)}\mathrel{\mathop{\mathchar 58\relax}}k=1,2,..,N\} can be modeled as a first-order Markov chain requiring no memory for the RNN. We observe that enforcing Lipschitz continuity on 𝒢\mathcal{G} over zz leads to smooth trajectories even in high dimensional settings, hence, memorizing more than one step in to the history is redundant. However, zt+1{z}_{t+1} is not a state variable, i.e., the existence of multiple modes for output prediction y¯\bar{y} leads to multiple possible solutions for zt+1z_{t+1}. On the contrary, 𝔼[zt+1]\mathbb{E}[z_{t+1}] is a state variable w.r.t. the state (zt,x)({z}_{t},x), which can be used as an approximation to reach the optimal z{z}^{*} at inference. Therefore, instead of directly learning r()r(\cdot), we learn a simplified version r(zt,x)=𝔼[zt+1]r^{\prime}(z_{t},x)=\mathbb{E}[z_{t+1}]. Intuitively, the whole process can be understood as observing the behavior of zz on a smooth surface at the training stage, and predicting the movement at inference. A key aspect of r(zt,x)r^{\prime}(z_{t},x) is that the model is capable of converging to multiple possible optimum modes at inference based on the initial position of zz.

2.2 Momentum as a supplementary aid

Based on Sec. 2.1, z{z} can now traverse to an optimal position z{z^{*}} during inference. However, there can exist rare symmetrical positions in the ζ\zeta where 𝔼[zt+1]zt0\mathbb{E}[{z}_{t+1}]-{z}_{t}\approx 0, although far away from {z}\{{z^{*}}\}, forcing zt+1zt{z}_{t+1}\approx{z}_{t}. Simply, the above phenomenon can occur if some zt+1{z}_{t+1} has traveled in many non-orthogonal directions, so the vector addition of zt+10{z}_{t+1}\approx 0. This can fool the system to falsely identify convergence points, forming phantom optimum point distributions amongst the true distribution (see Fig. 2). To avoid such behavior, we consider v(zt,xj)=(zt+1zt)xj\vec{v}(z_{t},x_{j})=({z}_{t+1}-{z_{t}})_{x_{j}}. Then, we learn the expected momentum 𝔼[ρ(zt,xj)]=α𝔼[|v(zt,xj)|]\mathbb{E}[\rho(z_{t},x_{j})]=\alpha\mathbb{E}[|\vec{v}(z_{t},x_{j})|] at each (zt,xj)({z}_{t},x_{j}) during the training phase, where α\alpha is an empirically chosen scalar. In practice, 𝔼[ρ(zt,xj)]0\mathbb{E}[\rho(z_{t},x_{j})]\rightarrow 0 as zt+1,zt{z}{z}_{t+1},{z}_{t}\rightarrow\{{z}^{*}\}. Thus, to avoid phantom distributions, we improve the zz update as,

zt+1=zt+𝔼[ρ(zt,xj)][r(zt,xj)ztr(zt,xj)zt].\displaystyle z_{t+1}=z_{t}+\mathbb{E}[\rho(z_{t},x_{j})]\left[{\frac{r^{\prime}(z_{t},x_{j})-{z}_{t}}{\mathinner{\!\left\lVert r^{\prime}(z_{t},x_{j})-{z}_{t}\right\rVert}}}\right]. (4)

Since both 𝔼[ρ(zt,xj)]\mathbb{E}[\rho(z_{t},x_{j})] and r(zt,xj)r^{\prime}(z_{t},x_{j}) are functions on (zt,xj)(z_{t},x_{j}), we jointly learn these two functions using a single network 𝒵(zt,xj)\mathcal{Z}(z_{t},x_{j}). Note that coefficient 𝔼[ρ(zt,xj)]\mathbb{E}[\rho(z_{t},x_{j})] serves two practical purposes: 1) slows down the movement of zz near true distributions, 2) pushes zz out of the phantom distributions.

3 Overall Design

Refer to caption
(a) Training
Refer to caption
(b) Inference
Figure 1: Training and inference process. Refer to Algorithm 1 for the training process. At inference, zz is iteratively updated using the predictions of 𝒵\mathcal{Z} and fed to 𝒢\mathcal{G} to obtain increasingly fine-tuned outputs (see Sec. 3).
sample inputs {x1,x2,,xJ}χ\{x_{1},x_{2},...,x_{J}\}\in\chi; sample outputs {y1,y2,,yJ}Υ\{y_{1},y_{2},...,y_{J}\}\in\Upsilon ;
for kk epochs do
       for xx in χ\chi  do
             for ll steps  do
                   update z={z1,z2,,zJ}z=\{z_{1},z_{2},...,z_{J}\}: zE^\nabla_{z}\hat{E} \rhd Freeze ,𝒢,𝒵\mathcal{H},\mathcal{G},\mathcal{Z} and update zz
                   update 𝒵\mathcal{Z}: wL1[(zt+1,ρ),𝒵(zt,(x))]\nabla_{w}L_{1}[(z_{t+1},\rho),\mathcal{Z}(z_{t},\mathcal{H}(x))] \rhd Freeze ,𝒢,z\mathcal{H},\mathcal{G},z and update 𝒵\mathcal{Z}
            update 𝒢,\mathcal{G},\mathcal{H}: wE^\nabla_{w}\hat{E} \rhd Freeze 𝒵,z\mathcal{Z},z and update ,𝒢\mathcal{H},\mathcal{G}
Algorithm 1 Training algorithm

The proposed model consists of three major blocks as shown in Fig. 1: an encoder \mathcal{H}, a generator 𝒢\mathcal{G}, and 𝒵\mathcal{Z}. Note that for derivations in Sec. 2, we used xx instead of h=(x)h=\mathcal{H}(x), as hh is a high-level representation of xx. The training process is illustrated in Algorithm 1. At each optimization zt+1=ztβzt[E^i,j]z_{t+1}=z_{t}-\beta\nabla_{z_{t}}[\hat{E}_{i,j}], 𝒵\mathcal{Z} is trained separately to approximate (zt+1,ρ)(z_{t+1},\rho). At inference, xx is fed to \mathcal{H}, and then the 𝒵\mathcal{Z} optimizes the output y¯\bar{y} by updating zz for a pre-defined number of iterations of Eq. 4. For E^(,)\hat{E}(\cdot,\cdot), we use L1L_{1} loss. Furthermore, it is important to limit the search space for zt+1z_{t+1}, to improve the performance of 𝒵\mathcal{Z}. To this end, we sample zz from the surface of the nn-dimensional sphere (𝕊n\mathbb{S}^{n}). Moreover, to ensure faster convergence of the model, we force the Lipschitz continuity on both 𝒵\mathcal{Z} and the 𝒢\mathcal{G} (App. 2.4) . For hyper-parameters and training details, see App. 3.1.

4 Motivation

Here, we explain the drawbacks of conditional GAN methods and illustrate our idea via a toy example.

Incompatibility of adversarial and reconstruction losses: cGANs use a combination of adversarial and reconstruction losses. We note that this combination is suboptimal to model CMM spaces.
Remark: Consider a generator G(x,z){G}(x,z) and a discriminator D(x,z)D(x,z), where xx and zz are the input and the noise vector, respectively. Then, consider an arbitrary input xjx_{j} and the corresponding set of ground-truths {yi,jg},i=1,2,..N\{y^{g}_{i,j}\},i=1,2,..N. Further, let us define the optimal generator G(xj,z)=y^,y^{yi,jg}G^{*}(x_{j},z)=\hat{y},\hat{y}\in\{y^{g}_{i,j}\}, LGAN=𝔼i[logD(yi,jg)]+𝔼z[log(1D(G(xj,z))]L_{GAN}=\mathbb{E}_{i}[\log D(y^{g}_{i,j})]+\mathbb{E}_{z}[\log(1-D(G(x_{j},z))] and L=𝔼i,z[|yi,jgG(xj,z)|]L_{\ell}=\mathbb{E}_{i,z}[|y^{g}_{i,j}-G(x_{j},z)|]. Then, GG^G^{*}\neq\hat{G}^{*} where G^=argmin𝐺max𝐷LGAN+λL\hat{G}^{*}=\arg\underaccent{G}{\min}\underaccent{D}{\max}L_{GAN}+\lambda L_{\ell}, λ0\forall\lambda\neq 0. (Proof in App. 2.3).

Generalizability: The incompatibility of above mentioned loss functions demands domain specific design choices from models that target high realism in CMM settings. This hinders the generalizability across different tasks (Vitoria et al., 2020; Zeng et al., 2019). We further argue that due to this discrepancy, cGANs learn sub-optimal features which are less useful for downstream tasks (Sec. 5.3).

Convergence and the sensitivity to the architecture: The difficulty of converging GANs to the Nash equilibrium of a non-convex min-max game in high-dimensional spaces is well explored (Barnett, 2018; Chu et al., 2020; Kodali et al., 2017). Goodfellow et al. (2014b) underlines if the discriminator has enough capacity, and is optimal at every step of the GAN algorithm, then the generated distribution converges to the real distribution; that cannot be guaranteed in a practical scenario. In fact, Arora et al. (2018) confirmed that the adversarial objective can easily approach to an equilibrium even if the generated distribution has very low support, and further, the number of training samples required to avoid mode collapse can be in order of exp(d)\exp(d) (dd is the data dimension).

Multimodality: The ability to generate diverse outputs, i.e., convergence to multiple modes in the output space, is an important requirement. Despite the typical noise input, cGANs generally lack the ability to generate diverse outputs (Lee et al., 2019). Pathak et al. (2016) and Iizuka et al. (2016) even state that better results are obtained when the noise is completely removed. Further, variants of cGAN that target diversity often face a trafe-off between the realism and diversity (He et al., 2018), as they have to compromise between the reconstruction and adversarial losses.

y1=4x\scriptstyle y_{1}=4x

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

y2=4x2\scriptstyle y_{2}=4x^{2}

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

y3=4x3\scriptstyle y_{3}=4x^{3}

Refer to caption
(a) GT
Refer to caption
(b) L1L_{1}
Refer to caption
(c) Ours w/o ρ\rho
Refer to caption
(d) 𝔼[ρ]\mathbb{E}[\rho] heatmap
Refer to caption
(e) Ours
Figure 2: Toy Example: Plots generated for each dimension of the CMM space Υ\Upsilon. (a) Ground-truth distributions. (b) Model outputs for L1L_{1} loss. (c) Output when trained with the proposed objective (without ρ\rho correction). Note the phantom distribution identified by the model. (d) 𝔼[ρ]\mathbb{E}[\rho] as a heatmap on (x,y)(x,y). 𝔼[ρ]\mathbb{E}[\rho] is lower near the true distribution and higher otherwise. (e) Model outputs after ρ\rho correction.

A toy example: Here, we experiment with the formulations in Sec. 2. Consider a 3D CMM space y=±4(x,x2,x3)y=\pm 4(x,x^{2},x^{3}). Then, we construct three layer multi-layer perceptrons (MLP) to represent each of the functions, \mathcal{H}, 𝒢\mathcal{G}, and 𝒵\mathcal{Z}, and compare the proposed method against the L1L_{1} loss. Figure 2 illustrates the results. As expected, L1L_{1} loss generates the line y=0y=0, and is inadequate to model the multimodal space. As explained in Sec. 2.2, without momentum correction, the network is fooled by a phantom distribution where 𝔼[zt+1]0\mathbb{E}[{z}_{t+1}]\approx 0 at training time. However, the push of momentum removes the phantom distribution and refines the output to closely resemble the input distribution. As implied in Sec. 2.2, the momentum is maximized near the true distribution and minimized otherwise.

5 Experiments and discussions

The distribution of natural images lies on a high dimensional manifold, making the task of modelling it extremely challenging. Moreover, conditional image generation poses an additional challenge with their constrained multimodal output space (a single input may correspond to multiple outputs while not all of them are available for training). In this section, we experiment on several such tasks. For a fair comparison with a similar capacity GAN, we use the encoder and decoder architectures used in Pathak et al. (2016) for \mathcal{H} and 𝒢\mathcal{G} respectively. We make two minor modifications: the channel-wise fully connected (FC) layers are removed and U-Net style skip connections are added (see App. 3.1). We train the existing models for a maximum of 200200 epochs where pretrained weights are not provided, and demonstrate the generalizability of our theoretical framework in diverse practical settings by using a generic network for all the experiments. Models used for comparisons are denoted as follows: PN (Zeng et al., 2019), CA (Yu et al., 2018b), DSGAN (Yang et al., 2019), CIC (Zhang et al., 2016), Chroma (Vitoria et al., 2020), P2P (Isola et al., 2017), Izuka (Iizuka et al., 2016), CE (Pathak et al., 2016), CRN (Chen & Koltun, 2017a), and B-GAN (Zhu et al., 2017b).

GT

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

Input

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

L1L_{1}

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

CE

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

Ours

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
Figure 3: Performance with 20% corrupted data. Our model demonstrates better convergence compared to L1L_{1} loss and a similar capacity GAN (Pathak et al., 2016).
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

GT 1 (70%)

GT 2 (30%)

Input

Output 1

Output 2

Figure 4: With >30%30\% alternate mode data, our model can converge to both the input modes (cols 4-5).
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

itr 0

itr 5

itr 10

itr 15

itr 20

Figure 5: The prediction quality increases as the zz traverses to an optimum position at the inference.
Method User study Turing test
STL ImageNet ImageNet
Izuka 21.89 32.28 -
Chroma 32.40 31.67 -
Ours 45.71 36.05 31.66
Table 1: Colorization: Psychophysical study and Turing test results. All performances are in %\%.
Method STL ImageNet
LPIP \downarrow PieAPP \downarrow SSIM \uparrow PSNR \uparrow LPIP \downarrow PieAPP \downarrow SSIM \uparrow PSNR \uparrow
Izuka 0.18 2.37 0.81 24.30 0.17 2.47 0.87 18.43
P2P 1.21 2.69 0.73 17.80 2.01 2.80 0.87 18.43
CIC 0.18 2.81 0.71 22.04 0.19 2.56 0.71 19.11
Chroma 0.16 2.06 0.91 25.57 0.16 2.13 0.90 23.33
Ours 0.12 1.47 0.95 27.03 0.16 2.04 0.92 24.51
Ours (w/o ρ\rho) 0.16 1.90 0.89 25.02 0.20 2.11 0.88 23.21
Table 2: Colorization: Quantitative analysis of our method against the state-of-the-art. Ours perform better on a variety of metrics.

5.1 Corrupted Image Recovery

We design this task as image completion, i.e., given a masked image as input, our goal is to recover the masked area. Interestingly, we observed that the MNIST dataset, in its original form, does not have a multimodal behaviour, i.e., a fraction of the input image only maps to a single output. Therefore, we modify the training data as follows: first, we overlap the top half of an input image with the top half of another randomly sampled image. We carry out this corruption for 20%20\% of the training data. Corrupted samples are not fixed across epochs. Then, we apply a random sized mask to the top half, and ask the network to predict the missing pixels. We choose two competitive baselines here: our network with the L1L_{1} loss and CE. Fig. 3 illustrates the predictions. As shown, our model converges to the most probable non-corrupted mode without any ambiguity, while other baselines give sub-optimal results. In the next experiment, we add a small white box to the top part of the ground-truth images at different rates. At inference, our model was able to converge to both the modes (Fig. 4), depending on the initial position of zz, as the probability of the alternate mode reaches 0.30.3.

5.2 Automatic image colorization

Deep models have tackled this problem using semantic priors (Iizuka et al., 2016; Vitoria et al., 2020), adversarial and L1L_{1} losses (Isola et al., 2017; Zhu et al., 2017a; Lee et al., 2019), or by conversion to a discrete form through binning of color values (Zhang et al., 2016). Although these methods provide compelling results, several inherent limitations exist: (a) use of semantic priors results in complex models, (b) adversarial loss suffers from drawbacks (see Sec. 4), and (c) discretization reduces the precision. In contrast, we achieve better results using a simpler model.

The input and the output of the network are ll and (a,b)(a,b) planes respectively (LAB color space). However, since the color distributions of aa and bb spaces are highly imbalanced over a natural dataset (Zhang et al., 2016), we add another constraint to the cost function EE to push the predicted aa and bb colors towards a uniform distribution: E=agta+bgtb+λ(losskl,a+losskl,b)E=\|a_{gt}-a\|+\|b_{gt}-b\|+\lambda(loss_{kl,a}+loss_{kl,b}), where losskl,=KL(||u(0,1))loss_{kl,\cdot}=\mathrm{KL}(\cdot||u(0,1)). Here, KL(||)\mathrm{KL}(\cdot||\cdot) is the KL\mathrm{KL} divergence and u(0,1)u(0,1) is a uniform distribution (see App. 3.3). Fig. 6 and Table 2 depict our qualitative and quantitative results, respectively. We demonstrate the superior performance of our method against four metrics: LPIP, PieAPP, SSIM and PSNR (App. 3.2). Fig. 5.2 depicts examples of multimodality captured by our model (more examples in App. 3.4). Fig. 5 shows colorization behaviour as the zz converges during inference.

User study: We also conduct two user studies to further validate the quality of generated samples (Table 1). a) In the Psychophysical study, we present volunteers with batches of 3 images, each generated with a different method. A batch is displayed for 5 secs and the user has to pick the most realistic image. After 5 secs, the next image batch is displayed. b) We conduct a Turing test to validate our output quality against the ground-truth, following the setting proposed by Zhang et al. (2016). The volunteers are presented with a series of paired images (ground-truth and our output). The images are visible for 1 sec, and then the user has an unlimited time to pick the realistic image.

GT

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

Izuka

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

P2P

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

Chroma

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

CIC

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

Ours

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Qualitative comparison against the state-of-the-art on ImageNet (left 5 columns) and STL (right 5 columns) datasets. Our model generally produces more vibrant and balanced color distributions.

GT

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

Input

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

Ours

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
Figure 7: Image completion on Celeb-HQ (left) and Facade (right) datasets. We used fixed center masks and random irregular masks (Liu et al., 2018) for Celeb-HQ and Facades datasets, respectively.
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

GT

Input

P2P

CA

PN

Ours

Figure 8: Qualitative comparison for image completion with 25% missing data (models trained with random sized square masks).
Method 10% corruption 15% corruption 25% corruption
LPIP \downarrow PieAPP \downarrow PSNR \uparrow SSIM \uparrow LPIP \downarrow PieAPP \downarrow PSNR \uparrow SSIM \uparrow LPIP \downarrow PieAPP \downarrow PSNR \uparrow SSIM \uparrow
DSGAN 0.101 1.577 20.13 0.67 0.189 2.970 18.45 0.55 0.213 3.54 16.44 0.49
PN 0.045 0.639 27.11 0.88 0.084 0.680 20.50 0.71 0.147 0.764 19.41 0.63
CE 0.092 1.134 22.34 0.71 0.134 2.134 19.11 0.63 0.189 2.717 17.44 0.51
P2P 0.074 0.942 22.33 0.79 0.101 1.971 19.34 0.70 0.185 2.378 17.81 0.57
CA 0.048 0.731 26.45 0.83 0.091 0.933 20.12 0.72 0.166 0.822 21.43 0.72
Ours (w/o ρ\rho) 0.053 0.799 27.77 0.83 0.085 0.844 23.22 0.76 0.141 0.812 22.31 0.74
Ours 0.051 0.727 27.83 0.89 0.080 0.740 26.43 0.80 0.129 0.760 24.16 0.77
Table 3: Image completion: Quantitative analysis of our method against state-of-the-art on a variety of metrics.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

Mode 1

Mode 2

Mode 1

Mode 2

Mode 1

Mode 2

Figure 9: Multiple colorization modes predicted by our model for a single input. (Best viewed in color).
[Uncaptioned image]
Figure 10: Multi-modality of our predictions on Celeb-HQ dataset. (Best viewed with zoom)
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
Figure 11: Translation from hand bag sketches to images.
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
Figure 12: Translation from hand shoe sketches to images.
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
Figure 13: Map to aerial image translation. From left: GT, Input and Output. Also see App.  5.2.

5.3 Image completion

In this case, we show that our generic model outperforms a similar capacity GAN (CE) as well as task-specific GANs. In contrast to task-specific models, we do not use any domain-specific modifications to make our outputs perceptually pleasing. We observe that with random irregular and fixed-sized masks, all the models perform well, and we were not able to visually observe a considerable difference (Fig. 7, see App. 3.11 for more results). Therefore, we presented models with a more challenging task: train with random sized square-shaped masks and evaluate the performance against masks of varying sizes. Fig. 8 illustrates qualitative results of the models with 25% masked data. As evident, our model recovers details more accurately compared to the state-of-the-art. Notably, all models produce comparable results when trained with a fixed sized center mask, but find this setting more challenging. Table 3 includes a quantitative comparison. Observe that in the case of smaller sized masks, PN performs slightly better than ours, but worse otherwise. We also evaluate the learned features of the models against a downstream classification task (Table 5). First, we train all the models on Facades (Tyleček & Šára, 2013) against random masks, and then apply the trained models on CIFAR10 (Krizhevsky et al., 2009) to extract bottleneck features, and finally pass them through a FC layer for classification (App. 3.7). We compare PN and ours against an oracle (AlexNet features pre-trained on ImageNet) and show our model performs closer to the oracle.

Refer to caption
Figure 14: Diversity: Quantitative comparisons.

.

Refer to caption
Figure 15: Translation from facial landmarks to faces.
Refer to caption
Figure 16: Translation from surface-normals to pet faces.
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

Input

32×3232\times 32

64×6464\times 64

128×128128\times 128

256×256256\times 256

Figure 17: Scalability: we subsequently add layers to the architecture to be trained on increasingly high-resolution inputs
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

GT

Input

CE

cVAE

Ours

Figure 18: Qualitative comparison of 3D spectral denoising. The results are converted to the spatial domain for a clear visualization.

5.3.1 Diversity and other compelling attributes

We also experiment on a diverse set of image translation tasks to demonstrate our generalizability. Fig. 11, 12, 13, 15 and 16 illustrate the qualitative results of sketch-to-handbag, sketch-to-shoes, map-to-arial, lanmarks-to-faces and surface-normals-to-pets tasks. Fig. 5.2, 10, 11, 12, 15 and 16 show the ability of our model to converge to multiple modes, depending on the zz initialization. Fig. 14 demonstrates the quantitative comparison against other models. See App. 3.4 for further details on experiments. Another appealing feature of our model is its strong convergence properties irrespective of the architecture, hence, scalability to different input sizes. Fig. 17 shows examples from image completion and colorization for varying input sizes. We add layers to the architecture to be trained on increasingly high-resolution inputs, where our model was able to converge to optimal modes at each scale (App. 3.8). Fig. 19 demonstrates our faster and stable convergence.

Refer to caption
Figure 19: Convergence on image completion (Paris view). Our model exhibits rapid and stable convergence compared to state-of-the-art (PN, CE, P2P, CA).
Method M10 M40
Sharma et al. (2016) 80.5% 75.5%
Han et al. (2019) 92.2% 90.2%
Achlioptas et al. (2017) 95.3% 85.7%
Yang et al. (2018) 94.4% 88.4%
Sauder & Sievers (2019) 94.5% 90.6%
Ramasinghe et al. (2019c) 93.1% -
Khan et al. (2019) 92.2% -
Ours 92.4% 90.9%
Table 4: Downstream 3D object classification results on ModelNet10 and ModelNet40 using features learned in an unsupervised manner. All results in % accuracy.
Method Pretext Acc. (%)
ResNet ImageNet Cls. 74.2
PN Im. Completion 40.3
Ours Im. Completion 62.5
Table 5: Comparison on downstream task (CIFAR10 cls). () denotes the oracle case.
Method M10 M40
CE 10.3 4.6
cVAE 8.7 4.2
Ours 84.2 79.4
Table 6: Reconstruction mAP of 3d spectral denoising.

5.4 Denoising of 3D objects in spectral space

Spectral moments of 3D objects provide a compact representation, and help building light-weight networks (Ramasinghe et al., 2020, 2019b; Cohen et al., 2018; Esteves et al., 2018). However, spectral information of 3D objects has not been used before for self-supervised learning, a key reason being the difficulty of learning representations in the spectral domain due to the complex structure and unbounded spectral coefficients. Here, we present an efficient pretext task that is conducted in the spectral domain: denoising 3D spectral maps. We use two types of spectral spaces: spherical harmonics and Zernike polynomials (App. 4). We first convert the 3D point clouds to spherical harmonic coefficients, arrange the values as a 2D map, and mask or add noise to a map portion (App. 3.12). The goal is to recover the original spectral map. Fig. 18 and Table 6 depicts our qualitative and quantitative results. We perform favorably well against other methods. To evaluate the learned features, we use Zernike polynomials, as they are more discriminative compared to spherical harmonics (Ramasinghe et al., 2019a). We first train the network on the 55k ShapeNet objects by denoising spectral maps, and then apply the trained network on the ModelNet10 & 40. The features are then extracted from the bottleneck (similar to Sec. 5.3), and fed to a FC classifier (Table 4). We achieve the state-of-the-art results in ModelNet40 with a simple pretext task.

6 Conclusion

Conditional generation in multimodal domains is a challenging task due to its ill-posed nature. In this paper, we propose a novel generative framework that minimize a family of cost functions during training. Further, it observes the convergence patterns of latent variables and applies this knowledge during inference to traverse to multiple output modes during inference. Despite using a simple and generic architecture, we show impressive results on a diverse set of tasks. The proposed approach demonstrates faster convergence, scalability, generalizability, diversity and superior representation learning capability for downstream tasks.

References

  • Achlioptas et al. (2017) Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Representation learning and adversarial generation of 3d point clouds. arXiv preprint arXiv:1707.02392, 2017.
  • Arora & Zhang (2017) Sanjeev Arora and Yi Zhang. Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224, 2017.
  • Arora et al. (2018) Sanjeev Arora, Andrej Risteski, and Yi Zhang. Do GANs learn the distribution? some theory and empirics. In International Conference on Learning Representations, 2018.
  • Bansal et al. (2017a) Aayush Bansal, Xinlei Chen, Bryan Russell, Abhinav Gupta, and Deva Ramanan. Pixelnet: Representation of the pixels, by the pixels, and for the pixels. arXiv preprint arXiv:1702.06506, 2017a.
  • Bansal et al. (2017b) Aayush Bansal, Yaser Sheikh, and Deva Ramanan. Pixelnn: Example-based image synthesis. arXiv preprint arXiv:1708.05349, 2017b.
  • Bao et al. (2017) Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and Gang Hua. Cvae-gan: Fine-grained image generation through asymmetric training. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • Barnett (2018) Samuel A Barnett. Convergence problems with generative adversarial networks (gans). arXiv preprint arXiv:1806.11382, 2018.
  • Chen & Koltun (2017a) Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE international conference on computer vision, pp.  1511–1520, 2017a.
  • Chen & Koltun (2017b) Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded refinement networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017b.
  • Chu et al. (2020) Casey Chu, Kentaro Minami, and Kenji Fukumizu. Smoothness and stability in gans. arXiv preprint arXiv:2002.04185, 2020.
  • Cohen et al. (2018) Taco S Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Spherical cnns. arXiv preprint arXiv:1801.10130, 2018.
  • Deshpande et al. (2017) Aditya Deshpande, Jiajun Lu, Mao-Chuang Yeh, Min Jin Chong, and David Forsyth. Learning diverse image colorization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • Driscoll & Healy (1994) James R Driscoll and Dennis M Healy. Computing fourier transforms and convolutions on the 2-sphere. Advances in applied mathematics, 15(2):202–250, 1994.
  • Du & Mordatch (2019) Yilun Du and Igor Mordatch. Implicit generation and modeling with energy based models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp.  3608–3618. Curran Associates, Inc., 2019.
  • Esteves et al. (2018) Carlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Daniilidis. Learning so (3) equivariant representations with spherical cnns. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  52–68, 2018.
  • Fetaya et al. (2020) Ethan Fetaya, Jörn-Henrik Jacobsen, Will Grathwohl, and Richard Zemel. Understanding the limitations of conditional generative models. In International Conference on Learning Representations, 2020.
  • Ghosh et al. (2018) Arnab Ghosh, Viveka Kulharia, Vinay P. Namboodiri, Philip H.S. Torr, and Puneet K. Dokania. Multi-agent diverse generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • Goodfellow et al. (2014a) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pp. 2672–2680. 2014a.
  • Goodfellow et al. (2014b) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014b.
  • Grathwohl et al. (2020) Will Grathwohl, Kuan-Chieh Wang, Joern-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. In International Conference on Learning Representations, 2020.
  • Han et al. (2019) Zhizhong Han, Mingyang Shang, Yu-Shen Liu, and Matthias Zwicker. View inter-prediction gan: Unsupervised representation learning for 3d shapes by learning global shape memories to support local view predictions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  8376–8384, 2019.
  • He et al. (2018) Yang He, Bernt Schiele, and Mario Fritz. Diverse conditional image generation by stochastic regression with latent drop-out codes. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  406–421, 2018.
  • Huang et al. (2018) Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. In The European Conference on Computer Vision (ECCV), September 2018.
  • Iizuka et al. (2016) Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Let there be color! joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Transactions on Graphics (ToG), 35(4):1–11, 2016.
  • Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1125–1134, 2017.
  • Jing & Tian (2020) Longlong Jing and Yingli Tian. Self-supervised visual feature learning with deep neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 2020.
  • Khan et al. (2019) Salman H Khan, Yulan Guo, Munawar Hayat, and Nick Barnes. Unsupervised primitive discovery for improved 3d generative modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  9739–9748, 2019.
  • Kingma & Welling (2014) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2014.
  • Kodali et al. (2017) Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of gans. arXiv preprint arXiv:1705.07215, 2017.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). 2009. URL http://www.cs.toronto.edu/~kriz/cifar.html.
  • Lee et al. (2018) Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Diverse image-to-image translation via disentangled representations. In The European Conference on Computer Vision (ECCV), September 2018.
  • Lee et al. (2019) Soochan Lee, Junsoo Ha, and Gunhee Kim. Harmonizing maximum likelihood with GANs for multimodal conditional generation. In International Conference on Learning Representations, 2019.
  • Liu et al. (2018) Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  85–100, 2018.
  • Loquercio et al. (2019) Antonio Loquercio, Mattia Segù, and Davide Scaramuzza. A general framework for uncertainty estimation in deep learning. arXiv preprint arXiv:1907.06890, 2019.
  • Maaløe et al. (2019) Lars Maaløe, Marco Fraccaro, Valentin Liévin, and Ole Winther. Biva: A very deep hierarchy of latent variables for generative modeling. In Advances in neural information processing systems, pp. 6548–6558, 2019.
  • Mao et al. (2019) Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. Mode seeking generative adversarial networks for diverse image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  1429–1437, 2019.
  • Mathieu et al. (2015) Michaël Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. CoRR, abs/1511.05440, 2015.
  • Mirza & Osindero (2014) Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ArXiv, abs/1411.1784, 2014.
  • Nalisnick et al. (2019) Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep generative models know what they don’t know? International Conference on Learning Representations, 2019.
  • Parkhi et al. (2012) Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  • Pathak et al. (2016) Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei Efros. Context encoders: Feature learning by inpainting. 2016.
  • Perraudin et al. (2019) Nathanaël Perraudin, Michaël Defferrard, Tomasz Kacprzak, and Raphael Sgier. Deepsphere: Efficient spherical convolutional neural network with healpix sampling for cosmological applications. Astronomy and Computing, 27:130–146, 2019.
  • Prashnani et al. (2018) Ekta Prashnani, Hong Cai, Yasamin Mostofi, and Pradeep Sen. Pieapp: Perceptual image-error assessment through pairwise preference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  1808–1817, 2018.
  • Ramasinghe et al. (2019a) Sameera Ramasinghe, Salman Khan, and Nick Barnes. Volumetric convolution: Automatic representation learning in unit ball. arXiv preprint arXiv:1901.00616, 2019a.
  • Ramasinghe et al. (2019b) Sameera Ramasinghe, Salman Khan, Nick Barnes, and Stephen Gould. Representation learning on unit ball with 3d roto-translational equivariance. International Journal of Computer Vision, pp.  1–23, 2019b.
  • Ramasinghe et al. (2019c) Sameera Ramasinghe, Salman Khan, Nick Barnes, and Stephen Gould. Spectral-gans for high-resolution 3d point-cloud generation. arXiv preprint arXiv:1912.01800, 2019c.
  • Ramasinghe et al. (2020) Sameera Ramasinghe, Salman Khan, Nick Barnes, and Stephen Gould. Blended convolution and synthesis for efficient discrimination of 3d shapes. In The IEEE Winter Conference on Applications of Computer Vision, pp.  21–31, 2020.
  • Robbins (2007) Herbert E. Robbins. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 2007.
  • Sagong et al. (2019) Min-cheol Sagong, Yong-goo Shin, Seung-wook Kim, Seung Park, and Sung-jea Ko. Pepsi : Fast image inpainting with parallel decoding network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242, 2016.
  • Sauder & Sievers (2019) Jonathan Sauder and Bjarne Sievers. Self-supervised deep learning on point clouds by reconstructing space. In Advances in Neural Information Processing Systems, pp. 12942–12952, 2019.
  • Sharma et al. (2016) Abhishek Sharma, Oliver Grau, and Mario Fritz. Vconv-dae: Deep volumetric shape learning without object labels. In European Conference on Computer Vision, pp.  236–250. Springer, 2016.
  • Sohn et al. (2015) Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems 28. 2015.
  • Thanh-Tung et al. (2019) Hoang Thanh-Tung, Truyen Tran, and Svetha Venkatesh. Improving generalization and stability of generative adversarial networks. arXiv preprint arXiv:1902.03984, 2019.
  • Tyleček & Šára (2013) Radim Tyleček and Radim Šára. Spatial pattern templates for recognition of objects with regular structure. In Proc. GCPR, Saarbrucken, Germany, 2013.
  • Vitoria et al. (2020) Patricia Vitoria, Lara Raad, and Coloma Ballester. Chromagan: Adversarial picture colorization with semantic class distribution. In The IEEE Winter Conference on Applications of Computer Vision, pp.  2445–2454, 2020.
  • Wang et al. (2018) Yi Wang, Xin Tao, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. Image inpainting via generative multi-column convolutional neural networks. In Advances in Neural Information Processing Systems 31. 2018.
  • Wang et al. (2003) Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pp.  1398–1402. Ieee, 2003.
  • Wang et al. (2004) Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • Xie et al. (2018) You Xie, Erik Franz, Mengyu Chu, and Nils Thuerey. tempoGAN: A Temporally Coherent, Volumetric GAN for Super-resolution Fluid Flow. ACM Transactions on Graphics (TOG), 37(4):95, 2018.
  • Yang et al. (2019) Dingdong Yang, Seunghoon Hong, Yunseok Jang, Tianchen Zhao, and Honglak Lee. Diversity-sensitive conditional generative adversarial networks. arXiv preprint arXiv:1901.09024, 2019.
  • Yang et al. (2018) Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  206–215, 2018.
  • Yu et al. (2018a) Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang. Generative image inpainting with contextual attention. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018a.
  • Yu et al. (2018b) Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  5505–5514, 2018b.
  • Zeng et al. (2019) Yanhong Zeng, Jianlong Fu, Hongyang Chao, and Baining Guo. Learning pyramid-context encoder network for high-quality image inpainting. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  1486–1494, 2019.
  • Zhang et al. (2011) Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment. IEEE transactions on Image Processing, 20(8):2378–2386, 2011.
  • Zhang et al. (2016) Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European conference on computer vision, pp.  649–666. Springer, 2016.
  • Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  • Zhang & Qi (2017) Song-Yang Zhang, Zhifei and Hairong Qi. Age progression/regression by conditional adversarial autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
  • Zhang et al. (2020) Xian Zhang, Xin Wang, Bin Kong, Youbing Yin, Qi Song, Siwei Lyu, Jiancheng Lv, Canghong Shi, and Xiaojie Li. Domain embedded multi-model generative adversarial networks for image-based face inpainting. ArXiv, abs/2002.02909, 2020.
  • Zhu et al. (2016) Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A. Efros. Generative visual manipulation on the natural image manifold. In Proceedings of European Conference on Computer Vision (ECCV), 2016.
  • Zhu et al. (2017a) Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems 30. 2017a.
  • Zhu et al. (2017b) Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and Eli Shechtman. Toward multimodal image-to-image translation. In Advances in neural information processing systems, pp. 465–476, 2017b.

Appendix

1 Related work

Conditional Generative Modeling. Conditional generation involves modeling the data distribution given a set of conditioning variables that control of modes of the generated samples. With the success of VAEs (Kingma & Welling, 2014) and GANs (Goodfellow et al., 2014a) in standard generative modeling tasks, their conditioned counterparts (Sohn et al., 2015; Mirza & Osindero, 2014) have dominated conditional generative tasks recently (Vitoria et al., 2020; Zhang et al., 2016; Isola et al., 2017; Pathak et al., 2016; Lee et al., 2019; Zhu et al., 2017a; Bao et al., 2017; Lee et al., 2018; Zeng et al., 2019). While probabilistic latent variable models such as VAEs generate relatively low quality samples and poor likelihood estimates at inference (Maaløe et al., 2019), GAN based models perform significantly better at high dimensional distributions like natural images but demonstrate unstable training behaviour. A distinct feature of GANs is its mapping of points from a random noise distribution to the various modes of the output distribution. However, in the conditional case where an additional loss is incorporated to enforce the conditioning on the input, the significantly better performance of GANs is achieved at the expense of multimodality; the conditioning loss pushes the GAN to learn to mostly ignore its noise distribution. In fact, some works intentionally ignore the noise input in order to achieve more stable training (Isola et al., 2017; Pathak et al., 2016; Mathieu et al., 2015; Xie et al., 2018).

Multimodality. Conditional VAE-GANs are one popular approach for generating multimodal outputs (Bao et al., 2017; Zhu et al., 2017a) using the VAE’s ability to enforce diversity through its latent variable representation and the GAN’s ability to enforce output fidelity through its learnt discrimanator model. Mixture models (Chen & Koltun, 2017b; Ghosh et al., 2018; Deshpande et al., 2017) that discretize the output space are another approach. Domain specific disentangled representations (Lee et al., 2018; Huang et al., 2018) and explicit encoding of multiple modes as inputs Zhu et al. (2016); Isola et al. (2017) have also been successful in generating diverse outputs. Sampling-based loss functions enforcing similarity at a distribution level (Lee et al., 2019) have also been successful in multimoal generative tasks. Further, the use of additional specialized reconstruction losses (often using higher-level features extracted from the data distribution) and attention mechanisms also achieves multimodality through intricate model architectures in domain specific cases (Zeng et al., 2019; Chen & Koltun, 2017b; Vitoria et al., 2020; Zhang et al., 2016; Iizuka et al., 2016; Zhang et al., 2020; Yu et al., 2018a; Sagong et al., 2019; Wang et al., 2018; Iizuka et al., 2016).

We propose a simpler direction through our domain-independent energy function based approach that is also capable of learning generic representations that better support downstream tasks. Notably, our work contrasts from energy based models previously investigated for likelihood modeling due to their simplicity, however, such models are notoriously difficult to train especially on high-dimensional spaces (Du & Mordatch, 2019).

2 Theoretical results

2.1 Proof for Eq. 3

𝒢(xj,w,zi,j)𝒢(xj,w,z0)=z0zi,jz𝒢(xj,w,z)𝑑z||\mathcal{G}(x_{j},w^{*},z_{i,j}^{*})-\mathcal{G}(x_{j},w^{*},z_{0})||=||\int_{z_{0}}^{z_{i,j}^{*}}\nabla_{z}\mathcal{G}(x_{j},w^{*},z)dz|| (5)

Let γ(t)\gamma(t) be a straight path from z0z_{0} to zi,jz^{*}_{i,j}, where γ(0)=z0\gamma(0)=z_{0} and γ(1)=zi,j\gamma(1)=z_{i,j}^{*}. Then,

=01z𝒢(xj,w,γ(t))dγdt𝑑t=||\int_{0}^{1}\nabla_{z}\mathcal{G}(x_{j},w^{*},\gamma(t))\frac{d\gamma}{dt}dt|| (6)
=01z𝒢(xj,w,γ(t))(zi,jz0)𝑑t=||\int_{0}^{1}\nabla_{z}\mathcal{G}(x_{j},w^{*},\gamma(t))(z^{*}_{i,j}-z^{*}_{0})dt|| (7)
=(zi,jz0)01z𝒢(xj,w,γ(t))𝑑t=||(z^{*}_{i,j}-z^{*}_{0})\int_{0}^{1}\nabla_{z}\mathcal{G}(x_{j},w^{*},\gamma(t))dt|| (8)
(zi,jz0)01z𝒢(xj,w,γ(t))𝑑t\leq\mathinner{\!\left\lVert(z^{*}_{i,j}-z^{*}_{0})\right\rVert}\mathinner{\!\left\lVert\int_{0}^{1}\nabla_{z}\mathcal{G}(x_{j},w^{*},\gamma(t))dt\right\rVert} (9)

On the other hand the Lipschitz constraint ensures,

z𝒢(xj,w,γ(t))limϵ0𝒢(xj,w,γ(t))𝒢(xj,w,γ(t+ϵ))ztzt+ϵC,\mathinner{\!\left\lVert\nabla_{z}\mathcal{G}(x_{j},w^{*},\gamma(t))\right\rVert}\leq\lim_{\epsilon\to 0}\frac{\mathinner{\!\left\lVert\mathcal{G}(x_{j},w^{*},\gamma(t))-\mathcal{G}(x_{j},w^{*},\gamma(t+\epsilon))\right\rVert}}{\mathinner{\!\left\lVert z_{t}-z_{t+\epsilon}\right\rVert}}\leq C, (10)

where CC is a constant. Combining Eq. 9 and 10 we get,

𝒢(xj,w,zi,j)𝒢(xj,w,z0)zi,jz001z𝒢(xj,w,γ(t))𝑑tC.\frac{\mathinner{\!\left\lVert\mathcal{G}(x_{j},w^{*},z_{i,j}^{*})-\mathcal{G}(x_{j},w^{*},z_{0})\right\rVert}}{\mathinner{\!\left\lVert z_{i,j}^{*}-z_{0}\right\rVert}}\leq\int_{0}^{1}\mathinner{\!\left\lVert\nabla_{z}\mathcal{G}(x_{j},w^{*},\gamma(t))\right\rVert}dt\leq C. (11)

2.2 Convergence of the training algorithm.

Proof: Let us consider a particular input xjx_{j} and an associated ground truth yi,jgy^{g}_{i,j}. Then, for this particular case, we denote our cost function to be E^i,j=d(w,z)\hat{E}_{i,j}=d(w,z). Further, a family of cost functions can be defined as,

fw(z)=d(w,z),f_{w}(z)=d(w,z), (12)

for each wωw\sim\omega. Further, let us consider an arbitrary initial setting (zinit,winitz_{init},w_{init}). Then, with enough iterations, gradient descent by zfw(z)\nabla_{z}f_{w}(z) converges zinitz_{init} to,

z¯=arginfzζfw.\bar{z}={\arg}\inf_{z\in\zeta}f_{w}. (13)

Next, with enough iterations, gradient descent by wfw(z¯)\nabla_{w}f_{w}(\bar{z}) converges ww to,

w¯=arginfwωfw(z¯).\bar{w}={\arg}\inf_{w\in\omega}f_{w}(\bar{z}). (14)

Observe that fw¯(z¯)fwinitf_{\bar{w}}(\bar{z})\leq f_{w_{init}}, where the equality occurs when zfw(z)=wfw(z¯)=0\nabla_{z}f_{w}(z)=\nabla_{w}f_{w}(\bar{z})=0. If fw(z)f_{w}(z) has a unique global minima, repeating Equation 13 and 14 converges to that global minima, giving {zi,j,wi,j}\{z^{*}_{i,j},w^{*}_{i,j}\}. It is straight forward to see that using a small number of iterations (usually one in our case) for each sample set for Equation 14, i.e., stochastic gradient descent, gives us,

{zi,j,w}=argminzi,jζ,wω𝔼iI,jJ[E^i,j],\{z_{i,j}^{*},w^{*}\}=\underaccent{z_{i,j}\in\zeta,w\in\omega}{\arg\min}\mathbb{E}_{i\in I,j\in J}[\hat{E}_{i,j}], (15)

where ww^{*} is fixed for all samples and modes (Robbins, 2007). Note that the proof is valid only for the convex case, and we rely on stochastic gradient descent to converge to at least a good local minima, as commonly done in many deep learning settings.

2.3 Proof for Remark

Remark:

Consider a generator G(x,z){G}(x,z) and a discriminator D(x,z)D(x,z) with a finite capacity, where xx and zz are input and the noise vector, respectively. Then, consider an arbitrary input xjx_{j} and the corresponding set of ground truths {yi,jg},i=1,2,..N\{y^{g}_{i,j}\},i=1,2,..N. Further, let us define the optimal generator G(xj,z)=y^,y^{yi,jg}G^{*}(x_{j},z)=\hat{y},\hat{y}\in\{y^{g}_{i,j}\}, LGAN=𝔼i[logD(yi,jg)]+𝔼z[log(1D(G(xj,z))]L_{GAN}=\mathbb{E}_{i}[\log D(y^{g}_{i,j})]+\mathbb{E}_{z}[\log(1-D(G(x_{j},z))] and L=𝔼i,z[|yi,jgG(xj,z)|]L_{\ell}=\mathbb{E}_{i,z}[|y^{g}_{i,j}-G(x_{j},z)|]. Then, GG^G^{*}\neq\hat{G}^{*} where G^=argmin𝐺max𝐷LGAN+λL\hat{G}^{*}=\arg\underaccent{G}{\min}\underaccent{D}{\max}L_{GAN}+\lambda L_{\ell}, λ0\forall\lambda\neq 0. Proof.

It is straightforward to derive the equilibrium point of argmin𝐺max𝐷LGAN\arg\underaccent{G}{\min}\underaccent{D}{\max}L_{GAN} from the original GAN formulation. However, for clarity, we show some steps here.

Let,

V(G,D)=argmin𝐺max𝐷𝔼i[logD(yi,jg)]+𝔼z[log(1D(G(xj,z))]V(G,D)=\arg\underaccent{G}{\min}\underaccent{D}{\max}\mathbb{E}_{i}[\log D(y^{g}_{i,j})]+\mathbb{E}_{z}[\log(1-D(G(x_{j},z))] (16)

Let p()p(\cdot) denote the probability distribution. Then,

V(G,D)=argmin𝐺max𝐷Υp(y,jg)logD(y,jg)+p(y¯,j)(log(1D(G(xj,z))dyV(G,D)=\arg\underaccent{G}{\min}\underaccent{D}{\max}\int_{\Upsilon}p(y^{g}_{\cdot,j})\log D(y^{g}_{\cdot,j})+p(\bar{y}_{\cdot,j})(\log(1-D(G(x_{j},z))dy (17)
V(G,D)=argmin𝐺max𝐷𝔼yy,jg[logD(y)]+𝔼yy¯,j[log(1D(y)]V(G,D)=\arg\underaccent{G}{\min}\underaccent{D}{\max}\mathbb{E}_{y\sim y^{g}_{\cdot,j}}[\log D(y)]+\mathbb{E}_{y\sim\bar{y}_{\cdot,j}}[\log(1-D(y)] (18)

Consider the inner loop. It is straightforward to see that V(G,D)V(G,D) is maximized w.r.t. DD when D(y)=p(y,jg)p(y,jg)+p(y¯,j)D(y)=\frac{p(y^{g}_{\cdot,j})}{p(y^{g}_{\cdot,j})+p(\bar{y}_{\cdot,j})}. Then,

C(G)=V(G,D)=argmin𝐺𝔼yy,jg[logp(y,jg)p(y,jg)+p(y¯,j)]+𝔼yy¯,j[logp(y¯,j)p(y,jg)+p(y¯,j)]C(G)=V(G,D)=\arg\underaccent{G}{\min}\mathbb{E}_{y\sim y^{g}_{\cdot,j}}[\log\frac{p(y^{g}_{\cdot,j})}{p(y^{g}_{\cdot,j})+p(\bar{y}_{\cdot,j})}]+\mathbb{E}_{y\sim\bar{y}_{\cdot,j}}[\log\frac{p(\bar{y}_{\cdot,j})}{p(y^{g}_{\cdot,j})+p(\bar{y}_{\cdot,j})}] (19)

Then, following the Theorem 1 from Goodfellow et al. (2014b), it can be shown that the global minimum of the virtual training criterion C(G)C(G) is achieved if and only if p(y,jg)=p(y¯,j)p(y^{g}_{\cdot,j})=p(\bar{y}_{\cdot,j}).

Next, consider the L1L_{1} loss for xjx_{j},

L1=1Ni|yi,jgG(xj,z,w)|L_{1}=\frac{1}{N}\sum_{i}\mathinner{\!\left\lvert y^{g}_{i,j}-G(x_{j},z,w)\right\rvert} (20)
wL1=1Nisgn(yi,jgG(xj,z,w))w(G(xj,z,w))\nabla_{w}L_{1}=-\frac{1}{N}\sum_{i}\text{sgn}(y^{g}_{i,j}-G(x_{j},z,w))\nabla_{w}(G(x_{j},z,w)) (21)

For L1L_{1} to approach to a minima, wL10\nabla_{w}L_{1}\rightarrow 0. Since {yi,jg}\{y^{g}_{i,j}\} is not a singleton, when L10L_{1}\rightarrow 0, G(xj,z,w)y^{yi,jg}G(x_{j},z,w)\neq\hat{y}\in\{y^{g}_{i,j}\}.

Now, let us consider the L2L_{2} loss,

L2=1Niyi,jgG(xj,z,w)2L_{2}=\frac{1}{N}\sum_{i}\mathinner{\!\left\lVert y^{g}_{i,j}-G(x_{j},z,w)\right\rVert}^{2} (22)
wL2=2Ni(yi,jgG(xj,z,w))w(G(xj,z,w))\nabla_{w}L_{2}=-\frac{2}{N}\sum_{i}(y^{g}_{i,j}-G(x_{j},z,w))\nabla_{w}(G(x_{j},z,w)) (23)

For wL20\nabla_{w}L_{2}\rightarrow 0, G(xj,z,w)1Niyi,jg.G(x_{j},z,w)\rightarrow\frac{1}{N}\sum_{i}y^{g}_{i,j}. However, omitting the very specific case where (1Niyi,jg){yi,jg}(\frac{1}{N}\sum_{i}y^{g}_{i,j})\in\{y^{g}_{i,j}\}, which is highly unlikely in a complex distribution, as L20L_{2}\rightarrow 0, G(xj,z,w)y^{yi,jg}G(x_{j},z,w)\neq\hat{y}\in\{y^{g}_{i,j}\}. Therefore, the goals of argmin𝐺max𝐷LGAN\arg\underaccent{G}{\min}\underaccent{D}{\max}L_{GAN} and λL\lambda L_{\ell} are contradictory and GG^G^{*}\neq\hat{G}^{*}. Note that we do not extend our proof to high order LL losses as it is intuitive.

2.4 Lipschitz continuity and structuring of the latent space

Enforcing the Lipschitz constraint encourages meaningful structuring of the latent space: suppose z1,jz^{*}_{1,j} and z2,jz^{*}_{2,j} are two optimal codes corresponding to two ground truth modes for a particular input. Since z2,jz1,j\mathinner{\!\left\lVert z^{*}_{2,j}-z^{*}_{1,j}\right\rVert} is lower bounded by 𝒢(xj,w,z2,j)𝒢(xj,w,z1,j)L\frac{\mathinner{\!\left\lVert\mathcal{G}(x_{j},w^{*},z^{*}_{2,j})-\mathcal{G}(x_{j},w^{*},z^{*}_{1,j})\right\rVert}}{L}, where LL is the Lipschitz constant, the minimum distance between the two latent codes is proportional to the difference between the corresponding ground truth modes. Also, in practice, we observed that this encourages the optimum latent codes to be placed sparsely. Fig. 20 illustrates a visualization from the toy example. As the training progresses, the optimal {z}\{z^{*}\} corresponding to minimas of E^\hat{E} are identified and placed sparsely. Note that as expected, at the 10th10^{th} epoch the distance between the two optimum zz^{*} increases as xx goes from 0 to 11, in other words, as the 4(x,x2,x3)(4(x,x2,x3))\mathinner{\!\left\lVert 4(x,x^{2},x^{3})-(-4(x,x^{2},x^{3}))\right\rVert} increases.

Practical implementation is done as follows: during the training phase, a small noise ee is injected to the inputs of 𝒵\mathcal{Z} and 𝒢\mathcal{G}, and the networks are penalized for any difference in output. More formally, L𝒵L_{\mathcal{Z}} and E^\hat{E} now become, L1[zt+1,𝒵(zt,h)]+αL1[𝒵(zt+e,h+e),𝒵(zt,h)]L_{1}[z_{t+1},\mathcal{Z}(z_{t},h)]+\alpha L_{1}[\mathcal{Z}(z_{t}+e,h+e),\mathcal{Z}(z_{t},h)] and L1[yg,𝒢(h,z)]+αL1[𝒢(h+e,z+e),𝒢(h,z)]L_{1}[y^{g},\mathcal{G}(h,z)]+\alpha L_{1}[\mathcal{G}(h+e,z+e),\mathcal{G}(h,z)], respectively. Fig. 24 illustrates the procedure.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) 0 epochs
Refer to caption
(b) 5 epochs
Refer to caption
(c) 10 epochs
Figure 20: The behaviour of cost heatmaps E^\hat{E} against (x,z)(x,z) as the training progresses (toy example). The latent space gets increasingly structured as www\rightarrow{w^{*}}. Also, in (c) the network intelligently puts the optimal latent codes further apart as the distance between the two ground truth modes (m=4m=4 and m=4m=-4) keeps increasing.

2.5 Towards a measurement of uncertainty

In Bayesian approaches, the uncertainty is represented using the distribution of the network parameters ω\omega. Since a network output is unique for fixed w¯ω\bar{w}\sim\omega, sampling from the output is equivalent to sampling from ω\omega. Often, ω\omega is modeled as a parametric distribution or obtained through sampling, and at inference, the model uncertainty can be estimated as 𝕍𝔸p(y|x)(y)\mathbb{VAR}_{p(y|x)}(y). One intuition behind this is that for more confident inputs, p(y|x,w)p(y|x,w) will showcase less variance over the distribution of ω\omega—hence lower 𝕍𝔸p(y|x)(y)\mathbb{VAR}_{p(y|x)}(y)—as the network parameters have learned redundant information (Loquercio et al., 2019).

As opposed to sampling from the distribution of network parameters, we model the optimal zz^{*} for a particular input as a probability distribution p(z)p(z^{*}), and measure 𝕍𝔸p(y|x)(y)\mathbb{VAR}_{p(y|x)}(y) where p(y|x)=p(y|x,z)p(z|x)𝑑zp(y|x)=\int p(y|x,z^{*})p(z^{*}|x)dz. Our intuition is that in the vicinity of well observed data 𝕍𝔸p(y|x)(y)\mathbb{VAR}_{p(y|x)}(y) is lower, since for training data 1) we enforce the Lipschitz constraint on 𝒢(x,z)\mathcal{G}(x,z) over (x,z)(x,z) and 2) E^(yg,𝒢(x,z))\hat{E}(y^{g},\mathcal{G}(x,z)) resides in a relatively stable local minima against zz^{*} for observed data, as in practice, z=𝔼epochs[z]+ϵz^{*}=\mathbb{E}_{epochs}[z^{*}]+\epsilon for a given xx, where ϵ\epsilon is some random noise which is susceptible to change over each epoch. Further, Let (x,z)(x,z^{*}) and ygy^{g} be the inputs to a network 𝒢\mathcal{G} and the corresponding ground truth label, respectively.

Formally, let p(yg|x,z)=𝒩(yg;𝒢(x,z),α𝕀)p(y^{g}|x,z^{*})=\mathcal{N}(y^{g};\mathcal{G}(x,z^{*}),\alpha\mathbb{I}) and z𝒰(|z𝔼(z)|<δ)z^{*}\sim\mathcal{U}(|z^{*}-\mathbb{E}(z^{*})|<\delta), where α\alpha is some variable describing the noise in the input xx and δ\delta is a small positive scalar. Then,

𝕆𝕍p(yg|x)(yg)=1Kk=1K[αk𝕀]+𝕆𝕍¯(𝒢(x,z)).\mathbb{COV}_{p(y^{g}|x)}(y^{g})\approx=\frac{1}{K}\sum_{k=1}^{K}[\alpha_{k}\mathbb{I}]+\overline{\mathbb{COV}}({\mathcal{G}(x,z^{*})}). (24)

where 𝕆𝕍¯\overline{\mathbb{COV}} is the sample covariance.

proof: 𝔼p(yg|x)(yg)=ygp(yg|x)𝑑yg\mathbb{E}_{p(y^{g}|x)}(y^{g})=\int y^{g}p(y^{g}|x)dy^{g}.

= yg[𝒩(yg;𝒢(x,z),α𝕀)p(z|x)𝑑z]𝑑yg\int y^{g}[\int\mathcal{N}(y^{g};\mathcal{G}(x,z^{*}),\alpha\mathbb{I})p(z^{*}|x)dz^{*}]dy^{g}

= [yg𝒩(yg;𝒢(x,z),α𝕀)p(z|x)𝑑yg]𝑑z\int[\int y^{g}\mathcal{N}(y^{g};\mathcal{G}(x,z^{*}),\alpha\mathbb{I})p(z^{*}|x)dy^{g}]dz^{*}

= [yg𝒩(yg;𝒢(x,z),α𝕀)𝑑yg]p(z|x)𝑑z\int[\int y^{g}\mathcal{N}(y^{g};\mathcal{G}(x,z^{*}),\alpha\mathbb{I})dy^{g}]p(z^{*}|x)dz^{*}

= 𝒢(x,z)p(z|x)𝑑z\int\mathcal{G}(x,z^{*})p(z^{*}|x)dz^{*}

Let πδ2=A\pi\delta^{2}=A, and p(z|x)1Ap(z^{*}|x)\approx\frac{1}{A}. Then, by Monte-Carlo approximation,

1Kk=1K𝒢(x,zk)\approx\frac{1}{K}\sum_{k=1}^{K}\mathcal{G}(x,z^{*}_{k})

Next, consider,

𝕆𝕍p(yg|x)(yg)=𝔼p(yg|x)((yg)(yg)T)𝔼p(yg|x)(yg)𝔼p(yg|x)(yg)T\mathbb{COV}_{p(y^{g}|x)}(y^{g})=\mathbb{E}_{p(y^{g}|x)}((y^{g})(y^{g})^{T})-\mathbb{E}_{p(y^{g}|x)}(y^{g})\mathbb{E}_{p(y^{g}|x)}(y^{g})^{T}

=(yg)(yg)Tp(yg|x,z)p(z|x)𝑑z𝑑yg𝔼p(yg|x)(yg)𝔼p(yg|x)(yg)T=\int\int(y^{g})(y^{g})^{T}p(y^{g}|x,z^{*})p(z^{*}|x)dz^{*}dy^{g}-\mathbb{E}_{p(y^{g}|x)}(y^{g})\mathbb{E}_{p(y^{g}|x)}(y^{g})^{T}

=[𝕆𝕍p(yg|x,z)+𝔼p(yg|x,z)𝔼p(yg|x,z)Tp(z|x)dz𝔼p(yg|x)(yg)𝔼p(yg|x)(yg)T=\int[\mathbb{COV}_{p(y^{g}|x,z^{*})}+\mathbb{E}_{p(y^{g}|x,z^{*})}\mathbb{E}_{p(y^{g}|x,z^{*})}^{T}p(z^{*}|x)dz-\mathbb{E}_{p(y^{g}|x)}(y^{g})\mathbb{E}_{p(y^{g}|x)}(y^{g})^{T}

1Kk=1K[αk𝕀+G(x,zk)G(x,zk)T]1K2[(k=1KG(x,zk))(k=1KG(x,zk))T]\approx\frac{1}{K}\sum_{k=1}^{K}[\alpha_{k}\mathbb{I}+G(x,z^{*}_{k})G(x,z^{*}_{k})^{T}]-\frac{1}{K^{2}}[(\sum_{k=1}^{K}G(x,z^{*}_{k}))(\sum_{k=1}^{K}G(x,z^{*}_{k}))^{T}].

=1Kk=1K[αk𝕀]+𝕆𝕍¯(𝒢(x,z))=\frac{1}{K}\sum_{k=1}^{K}[\alpha_{k}\mathbb{I}]+\overline{\mathbb{COV}}({\mathcal{G}(x,z^{*})})

Note that in similar to Bayesian uncertainty estimations, where an approximate distribution q(w)q(w) is used to estimate p(w|D)p(w|D), where DD is data, our model sample from the an empirical distribution p(z|x)p(z^{*}|x). In practice, we treat αk\alpha_{k} as a constant over all the samples–hence omit from the calculation—and use stochastic forward passes to obtain Eq. 24. Then, the diagonal entries are used to calculate the uncertainty in the each dimension of the output. We test this hypothesis on the toy example and the colorization task, as shown in Fig. 21 and Fig. 22, respectively.

Refer to caption
Figure 21: The uncertainty measurement illustration with the toy example. (left-column: ground truth, right-column: prediction). We train the model with x[0,0.5]x\in[0,0.5] and test with x[0,1.5]x\in[0,1.5]. During the testing, we add a small Gaussian noise to zz^{*} at each xx and get stochastic outputs. As illustrated, the sample variance (the uncertainty measurement) increases as xx deviates from the observed data portion.
Refer to caption
Figure 22: Colorization predictions for models trained with and without monkey class. Output images are shown side by side with corresponding uncertainty maps. For models trained without monkey data, high uncertainty is predicted for pixels belonging to the monkey portion (intensity is higher for high uncertainty).

3 Experiments

3.1 Experimental architectures

For the experiments on images, we mainly use 128×128128\times 128 size inputs. However, to demonstrate the scalability, we use several different architectures and show that the proposed framework is capable of converging irrespective of the architecture. Fig. 23 shows the architectures for different input sizes.

For training, we use the Adam optimizer with hyper-parameters β1=0.9,β2=0.999,ϵ=1×108\beta_{1}=0.9,\beta_{2}=0.999,\epsilon=1\times 10^{-8}, and a learning rate lr=1×105lr=1\times 10^{-5}. We use batch normalization after each convolution layer, and leaky ReLu as the activation, except the last layer where we use tanhtanh. All the weights are initialized using a random normal distribution with 0 mean and 0.50.5 standard deviation. Furthermore, we use a batch size of 20 for training, though we did not observe much change in performance for different batch sizes. We choose the dimensions of zz to be 10,16,32,6410,16,32,64 for 32×3232\times 32, 64×6464\times 64, 128×128128\times 128, 256×256256\times 256 input sizes, respectively. An important aspect to note here is that the dimension of zz should not be increased too much, as it would increase the search space for zz unnecessarily. While training, zz is updated 2020 times for a single 𝒢,\mathcal{G},\mathcal{H} update. Similarly, at inference, we use 2020 update steps for zz , in order to converge to the optimal solution. All the values are chosen empirically.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 23: The model architecture for various input sizes. The same general structure in maintained with minimal changes to accomodate for the changing input size.

.

Refer to caption
Figure 24: We enforce the Lipschitz continuity on both 𝒢\mathcal{G} and 𝒵\mathcal{Z}.

3.2 Evaluation metrics

Although heavily used in the literature, per pixel metrics such as PSNR does not effectively capture the perceptual quality of an image. To overcome this shortcoming, more perceptually motivated metrics have been proposed such as SSIM Wang et al. (2004), MSSIM Wang et al. (2003), and FSIM Zhang et al. (2011). However the similarity of two images is largely context dependant, and may not be captured by the aforementioned metrics. As a solution, recently, two deep feature based perceptual metrics–LPIP Zhang et al. (2018) and PieAPP Prashnani et al. (2018)–were proposed, which coincide well with the human judgement. To cover all these aspects, we evaluate our experiments against four metrics: LPIP, PieAPP, PSNR and SSIM.

3.3 Unbalanced color distributions

The color distribution of a natural dataset in aa and bb planes (LAB space) are strongly biased towards low values. If not taken into account, the loss function can be dominated by these desaturated values. Richard et al. Zhang et al. (2016) addressed this problem by rebalancing class weights according to the probability of color occurrence. However, this is only possible in a case where the output domain is discretized. To tackle this problem in the continuous domain, we push the output color distribution towards a uniform distribution as explained in Sec. 5.2 in the main paper.

3.4 Multimodality

An appealing attribute of our network is its ability to converge to multiple optimal modes at inference. A few such examples are shown in Fig. 26, Fig. 25, Fig. 30 Fig. 31, Fig. 28, Fig. 29 and Fig. 27. For the facial-land-marks-to-faces experiment, we used the UTKFace dataset (Zhang & Qi, 2017). For the surface-normals-to-pets experiment, we used the Oxford Pet dataset (Parkhi et al., 2012). In order to get the surface normal images, we follow Bansal Bansal et al. (2017b). First, we crop the bounding boxes of pet faces and then apply PixelNet (Bansal et al., 2017a) to extract surface normals. For maps-to-ariel and edges-to-photos experiments, we used the datasets provided by Isola et al. (2017).

For measuring the diversity, we adapt the following procedure: 1) we generate 20 random samples from the model. 2) calculate the mean pixel value μi\mu_{i} of each sample. 3) pick the closest sample sms_{m} to the average of all the mean pixels λ=120i=120μi\lambda=\frac{1}{20}\sum_{i=1}^{20}\mu_{i}. 4) pick the 1010 samples which have maximum mean pixel distance from sms_{m}. 5) calculate the mean standard deviation of the 1010 samples picked in step 4. 6) repeat the experiment 55 times for each model and get the expected standard deviation.

Refer to caption
Figure 25: Multimodel predictions of our model in colorization
Refer to caption
Figure 26: Multimodel predictions of our model in colorization
Refer to caption
Figure 27: Multimodel predictions of our model in landmarks-to-faces.
Refer to caption
Figure 28: Multimodel predictions of our model in face inpainting.
Refer to caption
Figure 29: Multimodel predictions of our model in surface-normals-to-pet-faces. Note that this is generally a difficult task due to the diverse texture.
Refer to caption
Figure 30: Multimodel predictions of our model in sketch-to-shoes translation.
Refer to caption
Figure 31: Multimodel predictions of our model in sketch-to-bag translation.

3.5 Colorization on STL dataset

Additional colorization examples on the STL dataset are shown in Fig. 33. We also compare the color distributions of the predicted a,ba,b planes with state-of-the-art. The results are shown in Fig. 32 and Table 7. As evident, our method predicts the closest color distribution to the ground truth.

Method a b
Chroma 0.71 0.78
Izuka 0.68 0.63
Ours 0.82% 0.80%
Table 7: IOU of the predicted color distributions against the ground truth. Our method shows better results.
Refer to caption
Figure 32: Color distribution comparison of a,ba,b planes. Our method produces the closest distribution to the ground truth.
Refer to caption
Figure 33: Qualitative results of our model in the colorization task on STL dataset.

3.6 Colorization on ImageNet dataset

Additional colorization examples on the ImageNet dataset are shown in Fig. 34.

Refer to caption
Figure 34: Qualitative results of our model in the colorization task on ImageNet dataset.

3.7 Self-supervised learning setup

Here we evaluate the performance of our model on down-stream tasks, using three distinct setups involving bottleneck features of trained models. The bottleneck layer features (of models trained on some dataset) are fed to a fully-connected layer and trained on a different dataset.

The baseline experiment uses the output of the penultimate layer in a Resnet-50 trained on ImageNet for classification as the bottleneck features. The comparison to state-of-the-art experiment involves Zeng et al. (2019) where the five outputs of its multi-scale decoder are max-pooled and concatenated to use as the bottleneck features. The outputs of layers before this were also experimented with, and the highest performance was obtained for these selected features. In our network, the output of the encoder network was used as the bottleneck features.

3.8 Scalability

One promising attribute of the proposed method compared to the state-of-the-art is its scalability. In other words, we propose a generic framework which is not bound to the architecture, hence, the model can be scaled to different input sizes without affecting the convergence behaviour. To demonstrate this, we use 4 different architectures and train them on 4 different input sizes (32×3232\times 32, 64×6464\times 64, 128×128128\times 128, 256×256256\times 256) on the same tasks: image completion and colorization. The different architectures we use are shown in Fig. 23.

3.9 Ablation study on the zz dimension

To demonstrate the effect of dimension of zz on the model accuracy, we conduct an ablation study for the colorization task for the input size 128×128128\times 128. Table 8 shows the results. The quality of the outputs increases to a maximum when dim(z)=32dim(z)=32, and then decreases. This is intuitive because when the search space of zz gets unnecessarily high, it becomes difficult for 𝒵\mathcal{Z} to learn the paths to optimum modes, due to limited capacity.

Dimensionality LPIP PieAPP Diversity
5 1.05 3.40 0.01
10 0.58 2.91 0.018
16 0.14 1.89 0.021
32 0.12 1.47 0.043
64 0.27 1.71 0.048
128 0.69 2.12 0.043
Table 8: Ablation study against the dimension of zz for the colorization task (128×128128\times 128 inputs).

3.10 User studies

Evaluation of synthesized images is an open problem (Salimans et al., 2016). Although recent metrics such as LPIP (Zhang et al., 2018) and PieAPP (Prashnani et al., 2018) have been proposed, which coincide closely with human judgement, perceptual user studies remain the preferred method. Therefore, to evaluate the quality of our synthesized images in the colorization task, we conduct two types of user studies: a Turing test and a psychophysical study. In the Turing test, we show the users a series of paired images, ground truth and our predictions, and ask the users to pick the most realistic image. Here, following Zhang et al. (2016), we display each image for 1 second, and then give the users an unlimited amount of time to make the choice. For the psychophysical study, we choose the two best performing methods according to the LPIP metric: Vitoria et al. (2020) and Iizuka et al. (2016). We create a series of batches of three images, Vitoria et al. (2020), Iizuka et al. (2016) and ours, and ask the users to pick the best quality image. In this case, each batch is shown to the users for 5 seconds, and the users have to make this decision during that time. We conduct the Turing test on ImageNet, and the psychophysical study on both ImageNet and STL datasets. For each test, we use 500500 randomly sampled batches and 15\sim 15 users.

We also conduct Turing tests to evaluate the image completion tasks on Facades and Celeb-HQ datasets. The results are shown in Table 9.

Dataset Celeb-HQ Facades
GT 59.11% 55.75%
Ours 40.89% 44.25%
Table 9: Turing Test for GT vs ours on popular image datasets Celeb-HQ and Facades.

3.11 Image completion

The additional image completion examples are provided in Figs. 35 and 36. Our turing test results on Celeb-HQ and Facades are shown in Table 9.

Refer to caption
Figure 35: Qualitative results of our model in the image completion task on Celeb-HQ dataset.
Refer to caption
Figure 36: Qualitative results of our model in the image completion task on Facades dataset.

3.12 3D spectral map denoising

In this experiment, we use two types of spectral moments: spherical harmonics and Zernike polynomials (see App. 4). The minimum number of sample points required to accurately represent a finite energy function in a particular function space depends on the used sampling theorem. According to Driscoll and Healy’s theorem Driscoll & Healy (1994), 4N24N^{2} equiangular sampled points are needed to represent a function on 𝕊2\mathbb{S}^{2} using spherical moments at a maximum degree NN. Therefore, we compute the first 1638416384 spherical moments of 3D objects where l128l\leq 128 by sampling 256×256256\times 256 equiangular points in θ\theta and ϕ\phi directions, where 0θπ0\leq\theta\leq\pi and 0ϕ2π0\leq\phi\leq 2\pi. Afterwards, we arrange the spherical moments as a 128×128128\times 128 feature map, and convolve with a 2×22\times 2 kernel with stride size 22 to downsample the feature map to 64×6464\times 64 size. The output is then fed to 6464-size architecture. We add Gaussian noise and mask portions of the spectral map to corrupt it. Afterwards, the model is trained to de-noise the input.

For Zernike polynomials, we compute the first 100100 moments for each 3D object where n9n\leq 9, and arrange the moments as a 10×1010\times 10 feature map. Then, the feature map is upsampled using transposed convolution by using a 5×55\times 5 kernel and with a stride size 33. The upsamapled feature map is fed to a 3232-size network and trained end-to-end to denoise. We first train the network on 55k objects in ShapeNet, and then apply the trained network on the Modelnet10 and Modelnet40 to extract the bottleneck features. These features are then fed to a single fully connected layer for classification.

4 Spectral domain representation of 3D objects

Spherical harmonics and Zernike polynomials are orthogonal and complete functions in 𝕊2\mathbb{S}^{2} and 𝔹3\mathbb{B}^{3}, respectively, hence, 3D point clouds can be represented by a set of coefficients corresponding to a linear combination of these functions Perraudin et al. (2019); Ramasinghe et al. (2019a, c).

4.1 Spherical harmonics

Spherical harmonics are complete and orthogonal functions defined on the unit sphere (𝕊2\mathbb{S}^{2}) as,

Yl,m(θ,ϕ)=(1)m2l+14π(lm)!(l+m)!Plm(cosϕ)eimθ,Y_{l,m}(\theta,\phi)=(-1)^{m}\sqrt{\frac{2l+1}{4\pi}\frac{(l-m)!}{(l+m)!}}P_{l}^{m}(\cos\phi)e^{im\theta}, (25)

where θ[0,2π]\theta\in[0,2\pi] is the azimuth angle, ϕ[0,π]\phi\in[0,\pi] is the polar angle, l+l\in\mathbb{Z}^{+}, mm\in\mathbb{Z}, and |m|<l|m|<l. Here, Plm()P_{l}^{m}(\cdot) is the associated Legendre function defined as,

Plm(x)=(1)m(1x2)m/22ll!dl+mdxl+m(x21)l.P_{l}^{m}(x)=(-1)^{m}\frac{(1-x^{2})^{m/2}}{2^{l}l!}\frac{d^{l+m}}{dx^{l+m}}(x^{2}-1)^{l}. (26)

Spherical harmonics demonstrate the following orthogonal property,

02π0πYlm(θ,ϕ)Ylm(θ,ϕ)sinϕdϕdθ=δl,lδm,m,\int_{0}^{2\pi}\int_{0}^{\pi}Y_{l}^{m}(\theta,\phi)Y_{l^{\prime}}^{m^{\prime}}(\theta,\phi)^{\dagger}\sin{\phi}\,d\phi d\theta=\delta_{l,l^{\prime}}\delta_{m,m^{\prime}}, (27)

where denotes the complex conjugate and,

δm,m={1,if m=m0,otherwise.\delta_{m,m^{\prime}}=\begin{cases}1,&\text{if }m=m^{\prime}\\ 0,&\text{otherwise}.\end{cases} (28)

Since spherical harmonics are complete in 𝕊2\mathbb{S}^{2}, any function f:𝕊2f\mathrel{\mathop{\mathchar 58\relax}}\mathbb{S}^{2}\rightarrow\mathbb{R} with finite energy can be rewritten as

f(θ,ϕ)=lm=llf^(l,m)Yl,m(θ,ϕ),f(\theta,\phi)=\sum_{l}\sum_{m=-l}^{l}\hat{f}(l,m)Y_{l,m}(\theta,\phi),\quad (29)

where,

f^(l,m)=0π02πf(θ,ϕ)Ylm(θ,ϕ)sinϕdϕdθ.\hat{f}(l,m)=\int_{0}^{\pi}\int_{0}^{2\pi}f(\theta,\phi)Y_{l}^{m}(\theta,\phi)^{\dagger}\sin\phi\,d\phi d\theta. (30)

4.2 3D Zernike polynomials

3D Zernike polynomials are complete and orthogonal on 𝔹3\mathbb{B}^{3} and defined as,

Zn,l,m(r,θ,ϕ)=Rn,l(r)Yl,m(θ,ϕ),Z_{n,l,m}(r,\theta,\phi)=R_{n,l}(r)Y_{l,m}(\theta,\phi), (31)

where,

Rn,l(r)=v=0(n1)/2qnlvr2v+l,R_{n,l}(r)=\sum_{v=0}^{(n-1)/2}q^{v}_{nl}r^{2v+l}, (32)

and qnlvq^{v}_{nl} is a scaler defined as

qnlv=(1)(nl)22(nl)2n+33((nl)(nl)2)(1)v((nl)2v)(2((nl)2+l+v)+1(nl))((nl)2+l+v(nl)2).q^{v}_{nl}=\frac{(-1)^{\frac{(n-l)}{2}}}{2^{(n-l)}}\sqrt{\frac{2n+3}{3}}{(n-l)\choose\frac{(n-l)}{2}}(-1)^{v}\frac{{\frac{(n-l)}{2}\choose v}{2(\frac{(n-l)}{2}+l+v)+1\choose(n-l)}}{{\frac{(n-l)}{2}+l+v\choose\frac{(n-l)}{2}}}. (33)

Here Yl,m(θ,ϕ)Y_{l,m}(\theta,\phi) is the spherical harmonics function, n+n\in\mathbb{Z}^{+}, l[0,n]l\in[0,n], m[l,l]m\in[-l,l] and nln-l is even. 3D Zernike polynomials also show orthogonal properties as,

0102π0πZn,l,m(θ,ϕ,r)Zn,l,mr2sinϕdrdϕdθ=4π3δn,nδl,lδm,m,\begin{split}\int_{0}^{1}\int_{0}^{2\pi}\int_{0}^{\pi}Z_{n,l,m}(\theta,\phi,r)&{Z}^{\dagger}_{n^{\prime},l^{\prime},m^{\prime}}r^{2}\sin\phi\,drd\phi d\theta\\ &=\frac{4\pi}{3}\delta_{n,n^{\prime}}\delta_{l,l^{\prime}}\delta_{m,m^{\prime}},\end{split} (34)

Since Zernike polynomials are complete in 𝔹3\mathbb{B}^{3}, any function f:𝔹3f\mathrel{\mathop{\mathchar 58\relax}}\mathbb{B}^{3}\rightarrow\mathbb{R} with finite energy can be rewritten as,

f(θ,ϕ,r)=n=0l=0nm=llΩn,l,m(f)Zn,l,m(θ,ϕ,r)f(\theta,\phi,r)=\sum\limits_{n=0}^{\infty}\sum\limits_{l=0}^{n}\sum\limits_{m=-l}^{l}\Omega_{n,l,m}(f)Z_{n,l,m}(\theta,\phi,r) (35)

where Ωn,l,m(f)\Omega_{n,l,m}(f) can be obtained using

Ωn,l,m(f)=0102π0πf(θ,ϕ,r)Zn,l,mr2sinϕdrdϕdθ.\Omega_{n,l,m}(f)=\int_{0}^{1}\int_{0}^{2\pi}\int_{0}^{\pi}f(\theta,\phi,r){Z}^{\dagger}_{n,l,m}r^{2}\sin\phi\,drd\phi d\theta. (36)

5 Image-to-Image translation

5.1 Sketch-to-shoes qualitative results

Additional qualitative results of the sketch-to-shoe translation task are shown in Fig. 37.

5.2 Map-to-photo qualitative results

Additional qualitative results of the map-to-photo translation task are shown in Fig. 38.

Refer to caption
Figure 37: Qualitative results of our model in sketch-to-shoe translation.
Refer to caption
Figure 38: Qualitative results of our model in map-to-photo translation.

6 Convergence at inference

A key aspect of our method is the optimization of the predictions at inference. Fig. 39 and Fig. 40 demonstrate this behaviour on the MNIST image completion and STL colorization tasks, respectively.

Refer to caption
Figure 39: Output gets better as the zz traverse to the optimum position at inference. Left column is the input. Five right columns show outputs at iterations 2, 4, 6, 8 and 10 (from left to right).
Refer to caption
Figure 40: Output quality increases as zzz\rightarrow z^{*} at inference.