22institutetext: Max Planck Institute for Informatics, Germany
22email: fzhan@mpi-inf.mpg.de, 22email: {shijian.lu,ascymiao}@ntu.edu.sg
22email: {yingchen001,ronglian001,jiahui003,kaiwen001,aoran.xiao}@e.ntu.edu.sg
Bi-level Feature Alignment for Versatile Image Translation and Manipulation
Abstract
Generative adversarial networks (GANs) have achieved great success in image translation and manipulation. However, high-fidelity image generation with faithful style control remains a grand challenge in computer vision. This paper presents a versatile image translation and manipulation framework that achieves accurate semantic and style guidance in image generation by explicitly building a correspondence. To handle the quadratic complexity incurred by building the dense correspondences, we introduce a bi-level feature alignment strategy that adopts a top- operation to rank block-wise features followed by dense attention between block features which reduces memory cost substantially. As the top- operation involves index swapping which precludes the gradient propagation, we approximate the non-differentiable top- operation with a regularized earth mover’s problem so that its gradient can be effectively back-propagated. In addition, we design a novel semantic position encoding mechanism that builds up coordinate for each individual semantic region to preserve texture structures while building correspondences. Further, we design a novel confidence feature injection module which mitigates mismatch problem by fusing features adaptively according to the reliability of built correspondences. Extensive experiments show that our method achieves superior performance qualitatively and quantitatively as compared with the state-of-the-art.
1 Introduction
Image translation and manipulation aim to generate and edit photo-realistic images conditioning on certain inputs such as semantic segmentation [32, 43], key points [39, 5] and layout [19]. It has been studied intensively in recent years thanks to its wide spectrum of applications in various tasks [35, 30, 42]. However, achieving high fidelity image translation and manipulation with faithful style control remains a grand challenge due to the high complexity of natural image styles. A typical approach to control image styles is to encode image features into a latent space with certain regularization (e.g., Gaussian distribution) on the latent feature distribution. For example, Park et al.[32] utilize VAE [4] to regularize the distribution of encoded features for faithful style control. However, VAE struggles to encode the complex distribution of natural image styles and often suffers from posterior collapse [25] which leads to degraded style control performance. Another strategy is to encode reference images into style codes [3, 65] to provide style guidance in image generation, while style codes often capture the global or regional style without an explicit style guidance for generating texture details.

To achieve more accurate style guidance and preserve details from exemplar, Zhang et al. [59] explore to build cross-domain correspondences with Cosine similarity to achieve exemplar-based image translation. Zhou et al. [63] propose a GRU-assisted Patch-Match [1] method to build high-resolution correspondences efficiently. Since the textures within a semantic region share identical semantic information, the existing methods tend to build correspondences based on the semantic coherence without considering the structure coherence within each semantic region. Warping exemplars with such pure semantic correspondence may cause destroyed texture patterns in the warped exemplars, and consequently result in inaccurate guidance for image generation.
This paper presents RABIT, a Ranking and Attention scheme with Bi-level feature alignment for versatile Image Translation and manipulation. To mitigate the quadratic computational complexity issue of building the dense correspondence between conditional inputs (semantic guidance) and exemplars (style guidance), we design a bi-level alignment strategy with a Ranking and Attention Scheme (RAS) which builds feature correspondences efficiently at two levels: 1) a top- ranking operation for dynamically generating block-wise ranking matrices; 2) a dense attention module that achieves dense correspondences between features within blocks as illustrated in Fig. 1. RAS enables to build high-resolution correspondences and reduces the memory cost from to ( is the number of features for alignment, is block size, and ). However, the top- operation involves index swapping whose gradient cannot be back-propagated in networks. To address this issue, we approximate the top- ranking operation with a regularized Earth Mover’s problem [34] which enables gradient back-propagation effectively.
As in [59, 63], building correspondences based on semantic information only often leads to the losing of texture structures and patterns in warped exemplars. Thus, spatial information should also be incorporated to preserve the texture structures and patterns and yield more accurate feature correspondences. A vanilla method to encode the position information is concatenating the semantic features with the corresponding feature coordinates via coordconv [22]. However, the vanilla position encoding builds a single coordinate system for the whole image which ignores the position information within each semantic region. Instead, we design a semantic position encoding (SPE) mechanism that builds a dedicated coordinate system for each semantic region which outperforms the vanilla position encoding significantly.
In addition, conditional inputs and exemplars are seldom perfectly matched, e.g., conditional inputs could contain several semantic classes that do not exist in exemplar images. Under such circumstances, the built correspondences often contain errors which lead to inaccurate exemplar warping and further deteriorated image generation. We tackle this problem by designing a CONfidence Feature Injection (CONFI) module that fuses the features of conditional inputs and warped exemplars according to the reliability of the built correspondences. Although the warped exemplar may not be reliable, the conditional input always provides accurate semantic guidance in image generation. The CONFI module thus assigns higher weights to the conditional input when the built correspondence (or warped exemplar) is unreliable. Experiments show that CONFI helps to generate faithful yet high-fidelity images consistently.
The contributions of this work can be summarized in three aspects. First, we propose a versatile image translation and manipulation framework which introduces a ranking and attention Scheme for bi-level feature alignment that greatly reduces the memory cost while building the correspondence between conditional inputs and exemplars. Second, we introduce a semantic position encoding mechanism that encodes region-level position information to preserve texture structures and patterns. Third, we design a confidence feature injection module that provides reliable feature guidance in image translation and manipulation.
2 Related Work
2.1 Image-to-Image Translation
Image translation has achieved remarkable progress in learning the mapping between images of different domains. It could be applied in different tasks such as style transfer [10, 7, 20], image super-resolution [16, 21, 15, 58], domain adaptation [36, 30, 8, 41, 49], image composition [57, 48, 55, 51, 54] etc. To achieve high-fidelity and flexible translation, existing work uses different conditional inputs such as semantic segmentation [12, 43, 32, 53, 56], scene layouts [38, 60, 19, 52], key points [27, 29, 5, 50], edge maps [12, 6], etc. However, effective style control remains a challenging task in image translation.
Style control has attracted increasing attention in image translation and generation. Earlier works such as [14] regularize the latent feature distribution to control the generation outcome. However, they struggle to capture the complex textures of natural images. Style encoding has been studied to address this issue. For example, [11] and [26] transfer style codes from exemplars to source images via adaptive instance normalization (AdaIN) [10]. [3] employs a style encoder for style consistency between exemplars and translated images. [65] designs semantic region-adaptive normalization (SEAN) to control the style of each semantic region individually. However, encoding style exemplars tends to capture the overall image style and ignores the texture details in local regions. To achieve accurate style guidance for each local region, Zhang et al. [59] build dense semantic correspondences between conditional inputs and exemplars with Cosine similarity to capture accurate exemplar details. To mitigate the quadratic complexity issue and enable high-resolution correspondence building, Zhou et al. [63] introduce the GRU-assisted Patch-Match to efficiently establish the high-resolution correspondence.
2.2 Semantic Image Editing
The arise of generative adversarial network (GANs) brings revolutionary advances to image editing [64, 9, 31, 2, 33, 45, 44, 46]. As one of the most intuitive representation in image editing, semantic information has been extensively investigated in conditional image synthesis. For example, Park et al. [32] introduce spatially-adaptive normalization (SPADE) to inject guided features in image generation. MaskGAN [17] exploits a dual-editing consistency as auxiliary supervision for robust face image manipulation. Instead of directly learning a label-to-pixel mapping, Hong et al. [9] propose a semantic manipulation framework HIM that generates images guided by a predicted semantic layout. Upon this work, Ntavelis et al. [31] propose SESAME which requires only local semantic maps to achieve image manipulation. However, the aforementioned methods either only learn a global feature without local focus (e.g., MaskGAN [17]) or ignore the features in the editing regions of the original image (e.g., HIM [9], SESAME [31]). To better utilize the fine features in the original image, Zheng et al. [61] adapt exemplar-based image synthesis framework CoCosNet [59] for semantic image manipulation by building a high-resolution correspondence between the original image and the edited semantic map.

3 Proposed Method
The proposed RABIT consists of an alignment network and a generation network that are inter-connected as shown in Fig. 2. The alignment network learns the correspondence between a conditional input and an exemplar for warping the exemplar to be aligned with the conditional input. The generation network produces the final generation under the guidance of the warped exemplar and the conditional input. RABIT is typically applicable in the task of conditional image translation with extra exemplar as style guidance. It is also applicable to the task of image manipulation by treating the exemplars as the original images for editing and the conditional inputs as the edited semantic. The detailed loss functions can be found in the supplementary materials.
3.1 Alignment Network
The alignment network aims to build the correspondence between conditional inputs and exemplars, and accordingly provide accurate style guidance by warping the exemplars to be aligned with the conditional inputs. As shown in Fig. 2, conditional input and exemplar are fed to feature extractors and to extract two sets of feature vectors and , where and denote the number and dimension of feature vectors, respectively. Then and can be aligned by building a dense correspondence matrix where each entry denotes the Cosine similarity between the corresponding feature vectors in and .
Semantic Position Encoding. Existing works [59, 63] mainly rely on semantic features to establish the correspondences. However, as textures within a semantic region share the same semantic feature, the pure semantic correspondence fails to preserve the texture structures or patterns within each semantic region. Thus, the position information of features can be facilitated to preserve the texture structures and patterns. A vanilla method to encode the position information is employing a simple coordconv [22] to build a global coordinate for the full image. However, this vanilla position encoding mechanism builds a single coordinate system for the whole image, ignoring region-wise semantic differences. To preserve the fine texture pattern within each semantic region, we design a semantic position encoding (SPE) mechanism that builds a dedicated coordinate for each semantic region as shown in Fig. 3. Specifically, SPE treats the center of each semantic region as the origin of coordinate, and the coordinates within each semantic region are normalized to [-1, 1]. The proposed SPE outperforms the vanilla position encoding significantly as shown in Fig. 6 and to be evaluated in experiments.

Bi-level Feature Alignment. On the other hand, building correspondence has quadratic complexity which incurs large memory and computation costs. Most existing studies thus work with low-resolution exemplar images (e.g. 64 64 in CoCosNet [59]) which often struggle in generating realistic images with fine texture details. In this work, we propose a bi-level alignment strategy via a novel ranking and attention scheme (RAS) that greatly reduces computational costs and allows to build correspondences with high-resolution images as shown in Fig. 6. Instead of building correspondences between features directly, the bi-level alignment strategy builds correspondences at two levels, including the first level that introduces top- ranking to generate block-wise ranking matrices dynamically and the second level that achieves dense attention between the features within blocks. As Fig. 2 shows, local features are grouped into a block, thus the features of conditional input and exemplar are partitioned into blocks () as denoted by and . In the first level of top- ranking, each block feature of the conditional input serves as a query to retrieve top- block features from the exemplar according to the Cosine similarity between blocks. In the second level of local attention, the features in each query block further attends to the features in the top- retrieved blocks to build up local attention matrices within the block features. The correspondence between the exemplar and conditional input can thus be built much more efficiently by combining such inter-block ranking and inner-block attention.


The ranking and attention scheme employs a top- operation that ranks the correlative blocks. However, the original top- operation involves index swapping whose gradient cannot be computed and so cannot be integrated into end-to-end network training. Inspired by Xie et al. [47], we tackle this issue by formulating the top- ranking as a regularized earth mover’s problem which allows gradient computation via implicit differentiation. Earth mover’s problem aims to find a transport plan that minimizes the total cost to transform one distribution to another. Consider two discrete distributions and defined on supports and , with probability (or amount of earth) and . We define as the cost matrix where denotes the cost of transportation between and , and as a transport plan where denotes the amount of earth transported between and . An earth mover’s (EM) problem can be formulated by: where denotes a vector of ones, denotes inner product. By treating a correlation scores between a query block and key blocks as and defining , and , it can be proved that solving the Earth Mover’s problem is equivalent to select the largest elements from . The detailed proof and optimization of the earth mover’s problem is provided in supplementary material. Fig. 5 illustrates the earth mover’s problem and transport plan which indicates the top- elements.
Complexity Analysis. The vanilla dense correspondence has a self-attention memory complexity where is the input sequence length. For our bi-level alignment strategy, the memory complexity of building block ranking matrices and local attention matrices are and , where , () and are block size, block number and the number of top- selection. Thus, the overall memory complexity is .

3.2 Generation Network
The generation network aims to synthesize images under the semantic guidance of conditional inputs and style guidance of exemplars. The overall architecture of the generation network is similar to SPADE [32]. Please refer to supplementary material for details of the network structure.
State-of-the-art approach [59] simply concatenates the warped exemplar and conditional input to guide the image generation process. However, the warped input image and edited semantic map could be structurally aligned but semantically different especially when they have severe semantic discrepancy. Such unreliably warped exemplars could serve as false guidance and heavily deteriorate the generation performance. Therefore, a mechanism is required to identify the semantic reliability of warped exemplar to provide reliable guidance for the generation network. To this end, we propose a CONfidence Feature Injection (CONFI) module that adaptively weights the features of conditional input and warped exemplar according to the reliability of feature matching.
Confidence Feature Injection. Intuitively, in the case of lower reliability of the feature correspondence, we should assign a relatively lower weight to the warped exemplar which provides unreliable style guidance and a higher weight to the conditional input which consistently provides accurate semantic guidance.
As illustrated in Fig. 5, the proposed CONFI fuses the features of the conditional input and warped exemplar based on a confidence map (CMAP) that captures the reliability of the feature correspondence. To derive the confidence map, we first obtain a block-wise correlation map of size by computing element-wise Cosine distance between and . For a block , the correlation score with is denoted by . As higher correlation scores indicate more reliable feature matching, we treat the peak value of as the confidence score of . Similar for other blocks, we can obtain the confidence map (CMAP) of size () which captures the semantic reliability of all blocks. The features of the conditional input and exemplar (both of size after passing through convolution layers) can thus be fused via weighted sum based on the confidence map CMAP: where is the built correspondence matrix. As the confidence map contains only one channel (), the above feature fusion is conducted in but ignores that in channel. To achieve thorough feature fusion in all channels, we feed the initial fusion to convolution layers to generate a multi-channel confidence map (Multi-CMAP) of size . The conditional input and warped exemplar are then thoroughly fused via a full channel-weighted summation according to the Multi-CMAP. The final fused feature is further injected to the generation process via spatial de-normalization [32] to provide accurate semantic and style guidance.
4 Loss Functions
The alignment network and generation network are jointly optimized. For clarity, we still denote the conditional input and exemplar as and , the ground truth as , the generated image as , the feature extractors for conditional input and exemplar as and , the generator and discriminator in the generation network as and .
Alignment Network. First, the warping should be cycle consistent, i.e. the exemplar should be recoverable from the warped warped. We thus employ a cycle-consistency loss as follows:
where is the correspondence matrix. The feature extractors and aim to extract invariant semantic information across domains, i.e. the extracted features from and should be consistent. A feature consistency loss can thus be formulated as follows:
Generation Network. The generation network employs several losses for high-fidelity synthesis with consistent style with the exemplar and consistent semantic with the conditional input. As the generated image should be semantically consistent with the ground truth , we employ a perceptual loss [13] to penalize their semantic discrepancy as below:
(1) |
where is the activation of layer in pre-trained VGG-19 [37] model. To ensure the statistical consistency between the generated image and the exemplar , a contextual loss [28] is adopted:
(2) |
where and are the indexes of the feature map in layer . Besides, a pseudo pairs loss as described in [59] is included in training.
The discriminator is employed to drive adversarial generation with an adversarial loss [12]. The full network is thus optimized with the following objective:
(3) |
where the weights balance the losses in the objective.
5 Experiments
5.1 Experimental Settings
Datasets. We evaluate and benchmark our method over multiple datasets for image translation & manipulation tasks.
ADE20K [62] is adopted for image translation conditioned on semantic segmentation. For image manipulation, we apply object-level affine transformations on the test set to acquire paired data (150 images) for evaluations as in [61].
CelebA-HQ [24] is used for two translation tasks by using face semantics and face edges as conditional inputs. We use 2993 face images for translation evaluations as in [59], and manually edit 100 semantic maps which is randomly selected for image manipulation evaluations.
DeepFashion [23] is used for image translation conditioned key points.
Implementation Details: The default size for our correspondence computation is with a block size of . The number in top- ranking is set at 3 by default in our experiments. The default size of generated images is .
5.2 Image Translation Experiments
We compare RABIT with several state-of-the-art image translation methods.
ADE20K | CelebA-HQ (Semantic) | DeepFashion | CelebA-HQ (Edge) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods | FID | SWD | LPIPS | FID | SWD | LPIPS | FID | SWD | LPIPS | FID | SWD | LPIPS |
Pix2pixHD[43] | 81.80 | 35.70 | N/A | 43.69 | 34.82 | N/A | 25.20 | 16.40 | N/A | 42.70 | 33.30 | N/A |
StarGAN v2[3] | 98.72 | 65.47 | 0.551 | 53.20 | 41.87 | 0.324 | 43.29 | 30.87 | 0.296 | 48.63 | 41.96 | 0.214 |
SPADE[32] | 33.90 | 19.70 | 0.344 | 39.17 | 29.78 | 0.254 | 36.20 | 27.80 | 0.231 | 31.50 | 26.90 | 0.207 |
SelectionGAN[40] | 35.10 | 21.82 | 0.382 | 42.41 | 30.32 | 0.277 | 38.31 | 28.21 | 0.223 | 34.67 | 27.34 | 0.191 |
SMIS[66] | 42.17 | 22.67 | 0.476 | 28.21 | 24.65 | 0.301 | 22.23 | 23.73 | 0.240 | 23.71 | 22.23 | 0.201 |
SEAN[65] | 24.84 | 10.42 | 0.499 | 17.66 | 14.13 | 0.285 | 16.28 | 17.52 | 0.251 | 16.84 | 14.94 | 0.203 |
CoCosNet[59] | 26.40 | 10.50 | 0.580 | 21.83 | 12.13 | 0.292 | 14.40 | 17.20 | 0.272 | 14.30 | 15.30 | 0.208 |
RABIT | 24.35 | 9.893 | 0.571 | 20.44 | 11.18 | 0.307 | 12.58 | 16.03 | 0.284 | 11.67 | 14.22 | 0.209 |


Quantitative Results. In quantitative experiments, all methods translate images with the same exemplars except Pix2PixHD [43] which doesn’t support style injection from exemplars. LPIPS is calculated by comparing the generated images with randomly selected exemplars. All compared methods adopt three exemplars for each conditional input and the final LPIPS is obtained by averaging the LPIPS between any two generated images.
Table 1 shows experimental results. It can be seen that RABIT outperforms all compared methods over most metrics and tasks consistently. By building explicit yet accurate correspondences between conditional inputs and exemplars, RABIT enables direct and accurate guidance from the exemplar and achieves better translation quality (in FID and SWD) and diversity (in LPIPS) as compared with the regularization-based methods such as SPADE [32] and SMIS [66], and style-encoding methods such as StarGAN v2 [3] and SEAN [65]. Compared with correspondence-based method CoCosNet [59], the proposed bi-level alignment allows RABIT to build correspondences and warp exemplars at higher resolutions (e.g. ) which offers more detailed guidance in the generation process and helps to achieve better FID and SWD. While compared with CoCosNet v2 [63], the proposed semantic position encoding enables to preserve the texture structures and patterns, thus yielding more accurate warped exemplars as guidance. Besides generation quality, RABIT achieves the best generation diversity in LPIPS except StarGAN v2 [3] which sacrifices the generation quality with much lower FID and SWD.
ADE20K [62] | CelebA-HQ [24] | ||||||
---|---|---|---|---|---|---|---|
Models | FID | PSNR | SSIM | Models | FID | SWD | LPIPS |
SPADE [32] | 120.2 | 13.11 | 0.334 | SPADE [32] | 105.1 | 41.90 | 0.376 |
HIM [9] | 59.89 | 18.23 | 0.667 | SEAN [65] | 96.31 | 35.90 | 0.351 |
SESAME [31] | 52.51 | 18.67 | 0.691 | MaskGAN [17] | 80.89 | 23.86 | 0.271 |
CoCosNet [59] | 41.03 | 20.30 | 0.744 | CoCosNet [59] | 68.70 | 22.90 | 0.224 |
RABIT | 26.61 | 23.08 | 0.823 | RABIT | 60.87 | 21.07 | 0.176 |

5.3 Image Manipulation Experiment
RABIT manipulates images by treating input images as exemplars and edited semantic guidance as conditional inputs. We compare RABIT with several state-of-the-art image manipulation methods including 1) SPADE [32], 2) SEAN [65], 3) MaskGAN [18], 4) Hierarchical Image Manipulation (HIM) [9], 5) SESAME [31], 6) CoCosNet [59].

Quantitative Results: In quantitative experiments, all compared methods manipulate images with the same input image and edited semantic label map. Left side of Table 2 shows experimental results over the synthesized test set of ADE20K [62]. It can be observed that RABIT outperforms state-of-the-art methods over all evaluation metrics consistently. Right side of Table 2 shows experimental results over the CelebA-HQ dataset with manually edited semantic maps. It can be observed that RABIT outperforms the state-of-the-art methods by large margins in all perceptual quality metrics.
Qualitative Evaluation: Fig. 9 shows visual comparisons with state-of-art manipulation methods on ADE20K. Fig. 10 shows the editing capacity of RABIT with various types of manipulation on semantic labels. We also compare RABIT with MaskGAN [17] on CelebA-HQ [18] in Fig. 12.

6 User Study
We conduct crowdsourcing user studies through Amazon Mechanical Turk (AMT) to evaluate the image translation & manipulation in terms of generation quality and style consistency. Specifically, each compared method generates 100 images with the same conditional inputs and exemplars. Then the generated images together with the conditional inputs and exemplars were presented to 10 users for assessment. For the evaluation of image quality, the users were instructed to pick the best-quality images. For the evaluation of style consistency, the users were instructed to select the images with best style relevance to the exemplar. The final AMT score is the averaged number of the methods to be selected as the best quality and the best style relevance.
Fig. 11 shows AMT results on multiple datasets. It can be observed that RABIT outperforms state-of-the-art methods consistently in image quality and style consistency on both image translation & image manipulation tasks.

7 Conclusions
This paper presents RABIT, a versatile conditional image translation & manipulation framework that adopts a novel bi-level alignment strategy with a ranking and attention scheme (RAS) to align the features between conditional inputs and exemplars efficiently. A semantic position encoding mechanism is designed to facilitate semantic-level position information and preserve the texture patterns in the exemplars. To handle the semantic mismatching between the conditional inputs and warped exemplars, a novel confidence feature injection module is proposed to achieve multi-channel feature fusion based on the matching reliability of warped exemplars. Quantitative and qualitative experiments over multiple datasets show that RABIT is capable of achieving high-fidelity image translation and manipulation while preserving consistent semantics with the conditional input and faithful styles with the exemplar.
Acknowledgement. This study is supported under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).
References
- [1] Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 28(3), 24 (2009)
- [2] Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8789–8797 (2018)
- [3] Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: Stargan v2: Diverse image synthesis for multiple domains. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8188–8197 (2020)
- [4] Doersch, C.: Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 (2016)
- [5] Dong, H., Liang, X., Gong, K., Lai, H., Zhu, J., Yin, J.: Soft-gated warping-gan for pose-guided person image synthesis. arXiv preprint arXiv:1810.11610 (2018)
- [6] Fu, Y., Ma, J., Ma, L., Guo, X.: Edit: Exemplar-domain aware image-to-image translation. arXiv preprint arXiv:1911.10520 (2019)
- [7] Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2414–2423 (2016)
- [8] Hoffman, J., Tzeng, E., Park, T., Zhu, J.Y., Isola, P., Saenko, K., Efros, A., Darrell, T.: Cycada: Cycle-consistent adversarial domain adaptation. In: International conference on machine learning. pp. 1989–1998. PMLR (2018)
- [9] Hong, S., Yan, X., Huang, T., Lee, H.: Learning hierarchical semantic image manipulation through structured representations. In: NIPS (2018)
- [10] Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1501–1510 (2017)
- [11] Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 172–189 (2018)
- [12] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1125–1134 (2017)
- [13] Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision. pp. 694–711. Springer (2016)
- [14] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
- [15] Lai, W.S., Huang, J.B., Ahuja, N., Yang, M.H.: Deep laplacian pyramid networks for fast and accurate super-resolution. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 624–632 (2017)
- [16] Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4681–4690 (2017)
- [17] Lee, C.H., Liu, Z., Wu, L., Luo, P.: Maskgan: Towards diverse and interactive facial image manipulation. In: CVPR. pp. 5549–5558 (2020)
- [18] Lee, C.H., Liu, Z., Wu, L., Luo, P.: Maskgan: Towards diverse and interactive facial image manipulation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
- [19] Li, Y., Cheng, Y., Gan, Z., Yu, L., Wang, L., Liu, J.: Bachgan: High-resolution image synthesis from salient object layout. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
- [20] Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Universal style transfer via feature transforms. arXiv preprint arXiv:1705.08086 (2017)
- [21] Lim, B., Son, S., Kim, H., Nah, S., Mu Lee, K.: Enhanced deep residual networks for single image super-resolution. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 136–144 (2017)
- [22] Liu, R., Lehman, J., Molino, P., Petroski Such, F., Frank, E., Sergeev, A., Yosinski, J.: An intriguing failing of convolutional neural networks and the coordconv solution. In: Advances in Neural Information Processing Systems (2018)
- [23] Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1096–1104 (2016)
- [24] Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of the IEEE international conference on computer vision. pp. 3730–3738 (2015)
- [25] Lucas, J., Tucker, G., Grosse, R., Norouzi, M.: Don’t blame the elbo! a linear vae perspective on posterior collapse. arXiv preprint arXiv:1911.02469 (2019)
- [26] Ma, L., Jia, X., Georgoulis, S., Tuytelaars, T., Van Gool, L.: Exemplar guided unsupervised image-to-image translation with semantic consistency. In: International Conference on Learning Representations (2018)
- [27] Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L.: Pose guided person image generation. In: Advances in neural information processing systems. pp. 406–416 (2017)
- [28] Mechrez, R., Talmi, I., Zelnik-Manor, L.: The contextual loss for image transformation with non-aligned data. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 768–783 (2018)
- [29] Men, Y., Mao, Y., Jiang, Y., Ma, W.Y., Lian, Z.: Controllable person image synthesis with attribute-decomposed gan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5084–5093 (2020)
- [30] Murez, Z., Kolouri, S., Kriegman, D., Ramamoorthi, R., Kim, K.: Image to image translation for domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4500–4509 (2018)
- [31] Ntavelis, E., Romero, A., Kastanis, I., Van Gool, L., Timofte, R.: Sesame: Semantic editing of scenes by adding, manipulating or erasing objects. In: ECCV. pp. 394–411. Springer (2020)
- [32] Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2337–2346 (2019)
- [33] Pumarola, A., Agudo, A., Martinez, A.M., Sanfeliu, A., Moreno-Noguer, F.: Ganimation: Anatomically-aware facial animation from a single image. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 818–833 (2018)
- [34] Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. IJCV (2000)
- [35] Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2107–2116 (2017)
- [36] Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2107–2116 (2017)
- [37] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
- [38] Sun, W., Wu, T.: Image synthesis from reconfigurable layout and style. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 10531–10540 (2019)
- [39] Tang, H., Xu, D., Liu, G., Wang, W., Sebe, N., Yan, Y.: Cycle in cycle generative adversarial networks for keypoint-guided image generation. In: Proceedings of the 27th ACM International Conference on Multimedia. pp. 2052–2060 (2019)
- [40] Tang, H., Xu, D., Sebe, N., Wang, Y., Corso, J.J., Yan, Y.: Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2417–2426 (2019)
- [41] Tsai, Y.H., Hung, W.C., Schulter, S., Sohn, K., Yang, M.H., Chandraker, M.: Learning to adapt structured output space for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7472–7481 (2018)
- [42] Wan, Z., Zhang, B., Chen, D., Zhang, P., Chen, D., Liao, J., Wen, F.: Bringing old photos back to life. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2747–2757 (2020)
- [43] Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8798–8807 (2018)
- [44] Wu, R., Lu, S.: Leed: Label-free expression editing via disentanglement. In: European Conference on Computer Vision. pp. 781–798. Springer (2020)
- [45] Wu, R., Zhang, G., Lu, S., Chen, T.: Cascade ef-gan: Progressive facial expression editing with local focuses. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5021–5030 (2020)
- [46] Xia, W., Yang, Y., Xue, J.H., Wu, B.: Tedigan: Text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2256–2265 (2021)
- [47] Xie, Y., Dai, H., Chen, M., Dai, B., Zhao, T., Zha, H., Wei, W., Pfister, T.: Differentiable top-k with optimal transport. Advances in Neural Information Processing Systems 33 (2020)
- [48] Zhan, F., Lu, S., Zhang, C., Ma, F., Xie, X.: Adversarial image composition with auxiliary illumination. In: Proceedings of the Asian Conference on Computer Vision (2020)
- [49] Zhan, F., Xue, C., Lu, S.: Ga-dan: Geometry-aware domain adaptation network for scene text detection and recognition. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 9105–9115 (2019)
- [50] Zhan, F., Yu, Y., Cui, K., Zhang, G., Lu, S., Pan, J., Zhang, C., Ma, F., Xie, X., Miao, C.: Unbalanced feature transport for exemplar-based image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021)
- [51] Zhan, F., Yu, Y., Wu, R., Zhang, C., Lu, S., Shao, L., Ma, F., Xie, X.: Gmlight: Lighting estimation via geometric distribution approximation. arXiv preprint arXiv:2102.10244 (2021)
- [52] Zhan, F., Yu, Y., Wu, R., Zhang, J., Lu, S.: Multimodal image synthesis and editing: A survey. arXiv preprint arXiv:2112.13592 (2021)
- [53] Zhan, F., Yu, Y., Wu, R., Zhang, J., Lu, S., Zhang, C.: Marginal contrastive correspondence for guided image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10663–10672 (2022)
- [54] Zhan, F., Zhang, C., Hu, W., Lu, S., Ma, F., Xie, X., Shao, L.: Sparse needlets for lighting estimation with spherical transport loss. arXiv preprint arXiv:2106.13090 (2021)
- [55] Zhan, F., Zhang, C., Yu, Y., Chang, Y., Lu, S., Ma, F., Xie, X.: Emlight: Lighting estimation via spherical distribution approximation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 3287–3295 (2021)
- [56] Zhan, F., Zhang, J., Yu, Y., Wu, R., Lu, S.: Modulated contrast for versatile image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18280–18290 (2022)
- [57] Zhan, F., Zhu, H., Lu, S.: Spatial fusion gan for image synthesis. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3653–3662 (2019)
- [58] Zhang, J., Lu, S., Zhan, F., Yu, Y.: Blind image super-resolution via contrastive representation learning. arXiv preprint arXiv:2107.00708 (2021)
- [59] Zhang, P., Zhang, B., Chen, D., Yuan, L., Wen, F.: Cross-domain correspondence learning for exemplar-based image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5143–5153 (2020)
- [60] Zhao, B., Meng, L., Yin, W., Sigal, L.: Image generation from layout. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8584–8593 (2019)
- [61] Zheng, H., Lin, Z., Lu, J., Cohen, S., Zhang, J., Xu, N., Luo, J.: Semantic layout manipulation with high-resolution sparse attention. arXiv preprint arXiv:2012.07288 (2020)
- [62] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 633–641 (2017)
- [63] Zhou, X., Zhang, B., Zhang, T., Zhang, P., Bao, J., Chen, D., Zhang, Z., Wen, F.: Cocosnet v2: Full-resolution correspondence learning for image translation. In: CVPR (2021)
- [64] Zhu, J.Y., Krähenbühl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: ECCV. pp. 597–613. Springer (2016)
- [65] Zhu, P., Abdal, R., Qin, Y., Wonka, P.: Sean: Image synthesis with semantic region-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5104–5113 (2020)
- [66] Zhu, Z., Xu, Z., You, A., Bai, X.: Semantically multi-modal image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5467–5476 (2020)