Domain Generalization via Discrete Codebook Learning

Shaocong Long^1⋆ Qianyu Zhou^2⋆ Xikun Jiang³ Chenhao Ying^1† Lizhuang Ma¹ Yuan Luo^1†
¹ Shanghai Jiao Tong University, Shanghai, China
² Jilin University, Changchun, China
³ Copenhagen University, Copenhagen, Denmark
This work was supported in part by National Key R&D Program of China under Grant 2022YFA1005000. ^⋆Equal contributions. ^†Corresponding authors (yingchenhao@sjtu.edu.cn, yuanluo@sjtu.edu.cn).

Abstract

Domain generalization (DG) strives to address distribution shifts across diverse environments to enhance model’s generalizability. Current DG approaches are confined to acquiring robust representations with continuous features, specifically training at the pixel level. However, this DG paradigm may struggle to mitigate distribution gaps in dealing with a large space of continuous features, rendering it susceptible to pixel details that exhibit spurious correlations or noise. In this paper, we first theoretically demonstrate that the domain gaps in continuous representation learning can be reduced by the discretization process. Based on this inspiring finding, we introduce a novel learning paradigm for DG, termed Discrete Domain Generalization (DDG). DDG proposes to use a codebook to quantize the feature map into discrete codewords, aligning semantic-equivalent information in a shared discrete representation space that prioritizes semantic-level information over pixel-level intricacies. By learning at the semantic level, DDG diminishes the number of latent features, optimizing the utilization of the representation space and alleviating the risks associated with the wide-ranging space of continuous features. Extensive experiments across widely employed benchmarks in DG demonstrate DDG’s superior performance compared to state-of-the-art approaches, underscoring its potential to reduce the distribution gaps and enhance the model’s generalizability.

Index Terms:

Transfer learning, domain generalization, computer vision.

I Introduction

Refer to caption — Figure 1: (a) Existing DG methods rely on continuous representation learning, struggling with domain gaps due to large feature spaces, pixel perturbations, and interpretation. (b) We introduce a discrete representation codebook to map features into discrete codewords, prioritizing semantic information over imperceptible pixel details, aiding distribution alignment across domains. (c) The discretization of continuous features in our DDG. The numbers in image patches denote codeword indices. Key semantic patches in images from diverse domains (‘Cartoon’ and ‘Sketch’) are replaced with the same codeword (*e.g.*, codeword 201 for long neck, codeword 255 for long legs). (d) Compared to state-of-the-art DG methods, our DDG significantly improves the model’s generalizability.

Domain generalization (DG) has garnered significant attention recently, aiming to alleviate the adverse effects of distribution shifts. DG focuses on leveraging data solely from source domains to capture essential semantic information across domains, and thus enhancing model’s generalizability in target domains. Existing DG methods have focused primarily on attaining robust features through continuous representation learning where features are represented in continuous vector space, i.e., learning at the pixel level, such as domain alignment [1, 2, 3], data augmentation [4, 5, 6, 7, 8], disentanglement [9, 10], contrastive learning [11, 12], flatness-aware strategy [13, 14], and mixture-of-experts learning [15, 16], test-time feature shifting [17, 18].

Despite notable progress in acquiring robust representations for DG, these methods struggle to handle distribution shifts due to the vast space of continuous features inherent in various scenarios, particularly when pixel-level correlations or noise add complexities. Depicted in Fig. 1(a), each vector in the continuous representation space derived by neural networks corresponds to specific input data. This illustrates the expansive nature of continuous feature space and highlights the ability of this expressive space to capture nuanced details at the pixel level, even within semantically similar data. This presents two risks in learning continuous representations for DG: (1) Minor input perturbations can cause large feature variations, distorting semantics, and hindering accurate predictions, ultimately reducing the model’s generalizability. (2) Aligning distributions in the expansive representation space induced by pixel perturbations or intricate pixel details is challenging for models with limited parameters. Furthermore, interpreting continuous representations is difficult as multiple features may map to the same semantic, complicating the interpretation.

As an effective paradigm implied by language, discrete representation learning, which encodes information in discrete codewords, has demonstrated its efficacy in generation tasks [19, 20]. Moreover, the language modality has been proven to improve the performance of vision tasks [21], where diverse images with the same semantics could be effectively described through a consistent text. This observation suggests that discrete representation learning may be inherently well-suited for various modalities that display diverse distributions. Furthermore, the use of discrete representation for inference and prediction emerges as a compelling choice, as illustrated by Fig. 1(b), where, for instance, an animal exhibiting a “long neck” and “long legs” is likely to be identified as a “giraffe”.

Motivated by the preceding analysis, we introduce a novel DG paradigm that focuses on preserving key semantic information while minimizing focus on imperceptible pixel details. This paradigm shift aims to map semantically equivalent continuous features to the same latent variable, i.e., learning at the semantic level. By reducing the number of latent variables and retaining essential features, the new paradigm optimizes the representation space and mitigates the risks associated with continuous representation learning in DG. It leverages the strengths of discrete representation and uses finite latent variables to promote domain alignment, even without domain labels, enhancing the model’s generalizability.

In this study, we theoretically elucidate that distribution gaps across domains can be further reduced by mapping continuous feature representations into discrete codeword representations. Based on this finding, we propose an innovative approach for DG to address distribution shifts, named Discrete Domain Generalization (DDG), which uses a discrete representation codebook to quantize feature maps extracted by the feature encoder into discrete codewords, offering a more stable way to capture semantic information compared to continuous representations that may also capture detrimental pixel details. As shown in Fig. 1(c), our DDG understands the semantic information exhibiting diverse distributions with consistent codewords, e.g., codeword 201 for long neck and codeword 255 for long legs. The codebook serves as a quantized space of the continuous representation space and is end-to-end learnable alongside the feature encoder and classifier. Fig. 1(d) demonstrates the strong results of our DDG compared with state-of-the-art (SOTA) DG methods. Our contributions can be succinctly summarized as follows:

•

We propose a novel learning perspective for DG that shifts focus away from noise or imperceptible pixel details through discrete representation learning, which firstly unveils the potential of discrete representation for enhancing the model’s generalizability.
•

We theoretically illustrate that the domain gaps of continuous representations can be reduced by discretization. Inspired by this, we introduce a discrete codebook-based approach for DG, named Discrete Domain Generalization (DDG), which quantizes feature maps into discrete codewords, emphasizing semantic over pixel details.
•

We conduct comprehensive experiments on widely used DG benchmarks to showcase the superiority of our DDG over SOTA approaches. Besides, in-depth analyses verify the efficacy of discrete representations in mitigating distribution shifts and enhancing model’s generalizability.

II Methodology

Fig. 2 illustrates the complete pipeline of our Discrete Domain Generalization (DDG) during the end-to-end training procedure. Our framework comprises a teacher model, a student model, and a discrete codebook. The codebook is devised to quantize the continuous features generated by the encoder into discrete codewords. In the forward computation, the quantized embedding $Z^{q}$ , rather than the original feature map $Z$ , is forwarded to the classifier. During the backward process, gradients are directly copied from the quantized feature $Z^{q}$ to the original feature map $Z$ in a straight-through manner. In the inference phase, only the student model and the codebook are retained.

II-A Analysis on Distribution Gaps

Gaining a theoretical understanding of risks associated with the prevailing continuous representation learning in DG is crucial. In order to illuminate this and provide guidance for improving the generalizability, we present the following theorem from the perspective of distribution gaps across domains.

Theorem 1

Let $F$ denote a family of functions $f:X\rightarrow\mathbb{R}$ . For two domains characterized by continuous representation distributions $P$ and $Q$ over $X$ respectively, denote the type-1 Wasserstein distance [22] as $\mathcal{W}(P,Q)=\sup_{f\in F}\int|P(x)f(x)-Q(x)f(x)|dx$ . Consider a discretization function $d:X\rightarrow X_{d}$ that maps $x$ into the centroid of its interval, where the intervals are uniformly partitioned. Denote the discrete representation distributions as $P_{d}$ and $Q_{d}$ over $X_{d}$ , respectively, then the following inequality holds:

\mathcal{W}(P,Q)\geq\mathcal{W}(P_{d},Q_{d}).

(1)

Proof 1

To facilitate the proof, consider an interval of $x\in[a,b]$ , and denote $B_{\Phi}:=\sup_{f\in F}|f(x)|\geq 0$ . Then the Wasserstein distance over this interval can be expressed as:

$\displaystyle\mathcal{W}_{a}^{b}(P,Q)$	$\displaystyle=\sup_{f\in F}\int_{a}^{b}\|P(x)f(x)-Q(x)f(x)\|dx$	(2)
	$\displaystyle=B_{\Phi}\int_{a}^{b}\|P(x)-Q(x)\|dx$
	$\displaystyle\geq B_{\Phi}\|\int_{a}^{b}P(x)dx-\int_{a}^{b}Q(x)dx\|.$

The equality holds when $P(x)-Q(x)$ maintains the same sign for $x\in[a,b]$ . This suggests that reducing distribution gaps across domains may be possible under specific constraints. However, achieving these conditions through neural network learning is challenging, as obtaining features with such distribution patterns is not straightforward.

To reduce the domain gaps indicated in Eq.(LABEL:w_con), we propose adopting discrete representations. Here, $x\in[a,b]$ is discretized as $\frac{a+b}{2}$ . Consequently, the discrete representation distributions in this interval are $P_{d}(\frac{a+b}{2})=\int_{a}^{b}P(x)dx$ , and $Q_{d}(\frac{a+b}{2})=\int_{a}^{b}Q(x)dx$ . The corresponding Wasserstein distance can be computed as:

$\displaystyle\mathcal{W}_{a}^{b}(P_{d},Q_{d})$	$\displaystyle=\sup_{f\in F}\int_{a}^{b}\|P_{d}(x)f(x)-Q_{d}(x)f(x)\|dx$	(3)
	$\displaystyle=f(\frac{a+b}{2})\|P_{d}(\frac{a+b}{2})-Q_{d}(\frac{a+b}{2})\|$
	$\displaystyle=B_{\Phi}\|\int_{a}^{b}P(x)dx-\int_{a}^{b}Q(x)dx\|.$

As observed, $\mathcal{W}_{a}^{b}(P_{d},Q_{d})$ of our constructed discretization reaches the optimal domain gaps indicated in Eq.(LABEL:w_con). These principles can similarly be applied to other intervals. As a result, combining Eq.(LABEL:w_con) and Eq.(LABEL:w_dis) concludes the proof.

Remark 1. Theorem 1 suggests that the prevalent continuous representation learning in DG may not be optimal for reducing distribution gaps to obtain robust features. This could be due to the difficulty of mitigating domain gaps when learning at the pixel level, given the vast space of continuous representations. As a result, focusing on learning semantics rather than pixels, and thereby reducing the representation space, is a promising direction to mitigate distribution shifts. To this end, we propose discretizing continuous features into discrete vectors to learn at the semantic level instead of pixel level, consequently decreasing the upper bound of the distribution gaps across domains. In this way, the upper bound of the target risk would be reduced according to the principles in [23], thereby promoting generalizability. These insights point to the potential of introducing discrete representation codebook learning for DG to effectively address distribution shifts.

II-B Discrete Representation Codebook Learning

As indicated by Theorem 1, training models at the pixel level struggle with the broad representation space and may inadvertently incorporate spurious correlations or noise, compromising the model’s generalizability. As a potential remedy, we attempt to curtail the number of latent variables via a discrete representation codebook, which is a promising alternative capable of mitigating the influence of redundant pixel details while preserving crucial semantic content. To address distribution gaps between distinct domains, we advocate using the codebook to vector quantize (VQ) features, facilitating the alignment of representations across domains. The finite codewords within the codebook can be construed as encapsulating underlying semantics common to data across domains. This codebook-driven approach diminishes the space of latent variables and thereby discards uninformative pixel-level information. Consequently, it serves as a pivotal bridge fostering alignment across domains.

In Theorem 1, the optimally reduced distribution gaps can be achieved when the features in an interval are mapped to its centroid. This finding inspires us that the discretization process for DG can be done by discretizing continuous features into the centroids of the feature exhibiting similar semantic information. Specifically, the continuous features should be mapped to the nearest vector in the discrete codebook, with both displaying similar semantic factors. Formally, we present the proposed discretization process as follows.

Denote the discrete representation codebook as $E=\{e_{1},e_{2},\cdots,e_{N}\}\in\mathbb{R}^{d_{c}\times N}$ , where $d_{c}$ is the dimension of the codewords, and $N$ is the total number of the codewords. For a given feature map $Z=f(X)\in\mathbb{R}^{h\times w\times d_{c}}$ , where $h$ , $w$ , and $d_{c}$ represent the height, width, and the number of channels, respectively, the VQ operation is applied to $Z$ to generate a discrete feature map. To ensure seamless VQ implementation, the dimension of the codewords aligns with the number of channels. The VQ operation serves to map the latent variables to discrete codewords:

Z_{ij}^{q}=VQ(Z_{ij})=e_{k},\ \text{where}\ k=\arg\min_{m}||Z_{ij}-e_{m}||_{2},

(4)

where $Z^{q}$ denotes the quantized discrete latent variables of $Z$ . In practice, the optimization of the codebook is achieved by minimizing the following objective:

	$\displaystyle\mathcal{L}_{disc}=\mathcal{L}_{vq}$	$\displaystyle+\eta\mathcal{L}_{comm},$		(5)
	$\displaystyle\text{with}\ \mathcal{L}_{vq}=\|\|sg(Z)-Z^{q}\|\|^{2}_{2}$	$\displaystyle,\ \ \mathcal{L}_{comm}=\|\|Z-sg(Z^{q})\|\|_{2}^{2},$		(5)

where $\eta$ is fixed at 0.25 for all experiments unless specified, and $sg$ signifies the stop gradient operation. The VQ loss ( $\mathcal{L}_{vq}$ ) is used to optimize the codebook, and the commitment loss ( $\mathcal{L}_{comm}$ ) anchors the encoder output to the codewords, thereby focusing on the semantic information and suppressing extraneous information.

In practice, we leverage the Exponential Moving Average (EMA) of the representations to substitute the role played by the VQ loss ( $\mathcal{L}_{vq}$ ), thereby making the codewords evolve smoothly and enhancing the robustness of the codebook. In specific, the update policy of the codebook at each iteration can be formulated as:

\begin{split}N_{v}\leftarrow\gamma N_{v}+(1-\gamma)|H|,m_{v}\leftarrow\gamma m_{v}+(1-\gamma)\sum_{h\in H}h,\end{split}

(6)

then the codeword is updated as $e_{v}=\frac{m_{v}}{N_{v}},1\leq v\leq N$ , where $1\leq v\leq N$ , and $H=\{Z_{ij}|Z^{q}_{ij}=e_{v}\}$ denotes the variable that is replaced by $e_{v}$ . The decay factor $\gamma$ is set to 0.99, and the codebook is initialized as: $N_{v}=1,m_{v}\sim\mathcal{N}_{d_{c}}(0,1),e_{v}=m_{v}$ .

Additionally, a teacher model is introduced to supervise the student output along with true labels, with its updates based on the moving average of the student model.

In summary, the complete objective is formulated as:

	$\displaystyle\mathcal{L}=\mathcal{L}_{cla}+$	$\displaystyle\alpha\cdot\mathcal{L}_{con}+\beta\cdot\mathcal{L}_{comm},$		(7)
	$\displaystyle\text{with}\ \mathcal{L}_{cla}=-y\log(\sigma(\hat{y}_{s}))$	$\displaystyle,\mathcal{L}_{con}=KL(\sigma(\hat{y}_{s}/T)\|\|\sigma(\hat{y}_{t}/T))$		(7)

where $\alpha$ and $\beta$ control the relative importance of each loss, $T$ is the temperature, $\sigma$ denotes the softmax activation, and $\mathcal{L}_{cla}$ and $\mathcal{L}_{con}$ denote the classification loss and consistency loss with the teacher model, respectively.

TABLE I: Generalzation results on DG benchmarks with ResNet-18.

Method	Dataset			Avg.( $\uparrow$ )
Method	PACS( $\uparrow$ )	Terra( $\uparrow$ )	VLCS( $\uparrow$ )	Avg.( $\uparrow$ )
VREx [24] (ICML’2021)	80.97	38.60	76.62	65.40
MTL [25] (JMLR’2021)	80.60	40.55	75.38	65.51
SagNet [26] (CVPR’2021)	81.55	38.75	76.24	65.51
ARM [27] (NeurIPS’2021)	80.98	37.47	76.61	65.02
SAM [28] (ICLR’2021)	82.35	41.76	76.45	66.85
FACT [6] (CVPR’2021)	83.07	43.87	77.00	67.98
SWAD [13] (NeurIPS’2021)	83.11	42.93	76.60	67.55
MIRO [29] (ECCV’2022)	79.28	42.63	76.38	66.10
PCL [11] (CVPR’2022)	82.63	43.21	76.32	67.39
AdaNPC [30] (ICML’2023)	82.50	41.35	75.98	66.61
DandelionNet [31] (ICCV’2023)	82.34	41.98	74.08	66.13
iDAG [32] (ICCV’2023)	83.21	41.93	74.34	66.49
SAGM [14](CVPR’2023)	81.34	40.66	75.83	65.94
GMDG [33] (CVPR’2024)	81.71	43.34	75.81	66.95
DDG (ours)	84.47 $\pm$ 0.18	46.63 $\pm$ 0.25	78.36 $\pm$ 0.24	69.82

III Experiments

III-A Experiment Setup

Following the common practice in DG [34, 4, 15, 35], we conducted experiments on widely used benchmarks: PACS [36], TerraIncognita (Terra) [37], and VLCS [38], which encompass images sourced from diverse media [39]. The ImageNet [40] pre-trained ResNet-18 [41] serves as the backbone for all experiments. Our approach involves training for 5k iterations using SGD, with a batch size of 32 and weight decay of 5e-4. The learning rates are initially set to 0.004 for PACS and Terra, and 0.001 for VLCS. with a 0.1 decay at 80% of the total iterations. Our DDG is built on FACT [6]. The parameter $\alpha$ is set to 200 for Terra and VLCS, and 2 for PACS. While the weight $\beta$ remains fixed at 0.1, and $T$ is set to 10 across all experiments. We report performance following the training-domain validation protocol [34, 2]. In this work, the total number of codewords is fixed as 256.

III-B Experiment Results

Comparisons. The comparison with SOTA methods is reported in TABLE I. As shown, our DDG maintains enhancements of 9.6%, 6.4%, and 2.3% over SOTA methods on PACS, Terra, and VLCS, respectively, consistently attaining the highest average recognition accuracy across diverse benchmarks. This underscores its supremacy in capturing essential semantic information. Despite the heightened difficulty posed by scene-centric images in Terra and VLCS for DG, DDG achieves the top performance. These outcomes underscore that our DDG, utilizing vector quantization, enhances generalizability by prioritizing semantic information over pixel-level details.

Ablation Study. To assess the contributions of each component in the discretization, we perform ablation studies on PACS. The results in TABLE II underscore the efficacy and indispensability of both the VQ loss and the commitment loss for learning the powerful codewords. Moreover, optimizing the VQ loss using EMA achieves a 2.0% improvement over the gradient descent strategy. This outcome supports the claim that incorporating EMA fosters a more resilient codebook, thereby improving the model’s generalizability.

TABLE II: Ablation study of components on PACS.

ID	$\mathcal{L}_{comm}$	$\mathcal{L}_{vq}$ (via SGD)	$\mathcal{L}_{vq}$ (via EMA)	Target domain				Avg.( $\uparrow$ )
ID	$\mathcal{L}_{comm}$	$\mathcal{L}_{vq}$ (via SGD)	$\mathcal{L}_{vq}$ (via EMA)	Art	Cartoon	Photo	Sketch	Avg.( $\uparrow$ )
\@slowromancapi@	-	-	-	79.93	75.43	92.28	79.16	81.71
\@slowromancapii@	✓	-	-	80.76	76.92	92.63	79.38	82.42
\@slowromancapiii@	-	✓	-	81.98	77.60	93.35	76.23	82.40
\@slowromancapiv@	-	-	✓	83.06	78.50	92.63	79.56	83.44
\@slowromancapv@	✓	✓	-	82.61	75.81	92.16	80.73	82.83
\@slowromancapvi@	✓	-	✓	82.81	78.41	94.07	82.62	84.48

Visualization for Distribution Discrepancy. The finite codewords in the codebook constrain the feature space, facilitating domain alignment compared to continuous representation learning. Fig. 4 displays t-SNE embeddings[42] of features from source and target domains, showing that the discrete codebook results in a closer alignment of the distributions. Our DDG lacks an explicit alignment strategy, yet a noticeable reduction in distribution gaps is observed, suggesting that the DDG captures essential semantic features rather than pixel details, reducing the difficulty of mitigating distribution gaps.

Visualization for Codeword Semantics. To understand the semantics of codewords and their role in domain alignment, we track codeword selections for each patch (an image is processed into 7 $\times$ 7 patches), and visualize the discrete features across domains from PACS. Fig. 3 shows that the learned codewords for patches with similar semantics are consistent across domains, despite differences in styles and shapes. For instance, codeword 138 represents dogs’ ears and foreheads, while codewords 92 and 135 capture dog’s faces. This codeword consistency between similar semantics across domains simplifies distribution alignment, demonstrating DDG’s efficacy in learning at the semantic level rather than the pixel level and reducing distribution gaps across domains.

TABLE III: Generalization stability on DG benchmarks.

Method	dataset			Avg.( $\downarrow$ )
Method	PACS( $\downarrow$ )	Terra( $\downarrow$ )	VLCS( $\downarrow$ )	Avg.( $\downarrow$ )
FACT [6] (CVPR’2021)	8.32	10.51	14.33	11.05
SWAD [13] (NeurIPS’2021)	9.65	11.51	15.44	12.20
PCL [11] (CVPR’2022)	9.92	9.28	14.87	11.36
MIRO [29] (CVPR’2022)	13.38	11.70	15.23	13.44
DandelionNet [31] (ICCV’2023)	9.40	11.31	14.11	11.61
iDAG [32] (ICCV’2023)	9.66	12.00	14.52	12.06
SAGM [14] (CVPR’2023)	10.14	12.92	14.78	12.61
GMDG [33] (CVPR’2024)	12.60	8.85	14.46	11.97
DDG (ours)	6.71	8.62	14.68	10.08

Generalization Stability. The robustness of generalizability across diverse scenarios is an essential metric to discern whether the model overfits in easy-to-transfer domains and struggles in hard-to-transfer domains [43]. Following the protocol in [43], we utilize the generalization stability ( $GS$ ) metric to assess the robustness of models’ generalizability and report it in Table III. The lowest $GS$ values in PACS and TerraIncognita underscore the superiority of our DDG in enhancing the generalization stability. Although the $GS$ value of DDG in VLCS does not beat the SOTA method, it is worth noting that the difference is insignificant.

For the definition of the metric $GS$ and additional experimental results, please refer to the supplementary material.

IV Conclusion

This paper introduces a pioneering paradigm for DG by approaching it through the lens of discrete representation codebook learning. We theoretically illustrate the excellence of discrete codewords in reducing distribution gaps compared to prevailing continuous representation learning approaches. Motivated by this insight, our proposed framework, named Discrete Domain Generalization (DDG), quantizes continuous features into discrete codewords, aiming to capture essential semantic features at a semantic level rather than the conventional pixel level. This approach reduces the number of latent variables and aids in domain alignment. Comprehensive experiments on commonly used benchmarks demonstrate the effectiveness and superiority of our DDG, highlighting a new direction for enhancing the model’s generalizability through discrete representation learning.

References

[1] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky, “Domain-adversarial training of neural networks,” JMLR, vol. 17, no. 59, pp. 1–35, 2016.
[2] Shaocong Long, Qianyu Zhou, Chenhao Ying, Lizhuang Ma, and Yuan Luo, “Rethinking domain generalization: Discriminability and generalizability,” TCSVT, pp. 11783–11797, 2024.
[3] Shanshan Zhao, Mingming Gong, Tongliang Liu, Huan Fu, and Dacheng Tao, “Domain generalization via entropy regularization,” NeurIPS, vol. 33, pp. 16096–16107, 2020.
[4] Kaiyang Zhou, Yongxin Yang, Yu Qiao, and Tao Xiang, “Mixstyle neural networks for domain generalization and adaptation,” IJCV, vol. 132, no. 3, pp. 822–836, 2024.
[5] Yuyang Zhao, Zhun Zhong, Na Zhao, Nicu Sebe, and Gim Hee Lee, “Style-hallucinated dual consistency learning: A unified framework for visual domain generalization,” IJCV, vol. 132, no. 3, pp. 837–853, 2024.
[6] Qinwei Xu, Ruipeng Zhang, Ya Zhang, Yanfeng Wang, and Qi Tian, “A fourier-based framework for domain generalization,” in CVPR, 2021, pp. 14383–14392.
[7] Hao Yang, Qianyu Zhou, Haijia Sun, Xiangtai Li, Fengqi Liu, Xuequan Lu, Lizhuang Ma, and Shuicheng Yan, “Pointdgmamba: Domain generalization of point cloud classification via generalized state space model,” in AAAI, 2025.
[8] Qianyu Zhou, Ke-Yue Zhang, Taiping Yao, Xuequan Lu, Ran Yi, Shouhong Ding, and Lizhuang Ma, “Instance-aware domain generalization for face anti-spoofing,” in CVPR, 2023, pp. 20453–20463.
[9] Zhuo Wang, Zezheng Wang, Zitong Yu, Weihong Deng, Jiahong Li, Tingting Gao, and Zhongyuan Wang, “Domain generalization via shuffled style assembly for face anti-spoofing,” in CVPR, 2022, pp. 4123–4133.
[10] Hanlin Zhang, Yi-Fan Zhang, Weiyang Liu, Adrian Weller, Bernhard Schölkopf, and Eric P Xing, “Towards principled disentanglement for domain generalization,” in CVPR, 2022, pp. 8024–8034.
[11] Xufeng Yao, Yang Bai, Xinyun Zhang, Yuechen Zhang, Qi Sun, Ran Chen, Ruiyu Li, and Bei Yu, “Pcl: Proxy-based contrastive learning for domain generalization,” in CVPR, 2022, pp. 7097–7107.
[12] Daehee Kim, Youngjun Yoo, Seunghyun Park, Jinkyu Kim, and Jaekoo Lee, “Selfreg: Self-supervised contrastive regularization for domain generalization,” in ICCV, 2021, pp. 9619–9628.
[13] Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park, “Swad: Domain generalization by seeking flat minima,” NeurIPS, vol. 34, pp. 22405–22418, 2021.
[14] Pengfei Wang, Zhaoxiang Zhang, Zhen Lei, and Lei Zhang, “Sharpness-aware gradient matching for domain generalization,” in CVPR, 2023, pp. 3769–3778.
[15] Bo Li, Yifei Shen, Jingkang Yang, Yezhen Wang, Jiawei Ren, Tong Che, Jun Zhang, and Ziwei Liu, “Sparse mixture-of-experts are domain generalizable learners,” in ICLR, 2023.
[16] Qianyu Zhou, Ke-Yue Zhang, Taiping Yao, Ran Yi, Shouhong Ding, and Lizhuang Ma, “Adaptive mixture of experts learning for generalizable face anti-spoofing,” in ACMMM, 2022, pp. 6009–6018.
[17] Jincen Jiang, Qianyu Zhou, Yuhang Li, Xuequan Lu, Meili Wang, Lizhuang Ma, Jian Chang, and Jian Jun Zhang, “Dg-pic: Domain generalized point-in-context learning for point cloud understanding,” in ECCV. Springer, 2024, pp. 455–474.
[18] Qianyu Zhou, Ke-Yue Zhang, Taiping Yao, Xuequan Lu, Shouhong Ding, and Lizhuang Ma, “Test-time domain generalization for face anti-spoofing,” in CVPR, 2024, pp. 175–187.
[19] Aaron Van Den Oord, Oriol Vinyals, et al., “Neural discrete representation learning,” NeurIPS, vol. 30, pp. 6306–6315, 2017.
[20] Patrick Esser, Robin Rombach, and Bjorn Ommer, “Taming transformers for high-resolution image synthesis,” in CVPR, 2021, pp. 12873–12883.
[21] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021, pp. 8748–8763.
[22] Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert R. G. Lanckriet, “On the empirical estimation of integral probability metrics,” Electronic Journal of Statistics, vol. 6, pp. 1550 – 1599, 2012.
[23] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan, “A theory of learning from different domains,” Machine Learning, vol. 79, no. 1, pp. 151–175, 2010.
[24] David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, and Aaron Courville, “Out-of-distribution generalization via risk extrapolation (rex),” in ICML, 2021, pp. 5815–5826.
[25] Gilles Blanchard, Aniket Anand Deshmukh, Ürun Dogan, Gyemin Lee, and Clayton Scott, “Domain generalization by marginal transfer learning,” JMLR, vol. 22, no. 1, pp. 46–100, 2021.
[26] Hyeonseob Nam, HyunJae Lee, Jongchan Park, Wonjun Yoon, and Donggeun Yoo, “Reducing domain gap by reducing style bias,” in CVPR, 2021, pp. 8690–8699.
[27] Marvin Zhang, Henrik Marklund, Nikita Dhawan, Abhishek Gupta, Sergey Levine, and Chelsea Finn, “Adaptive risk minimization: Learning to adapt to domain shift,” NeurIPS, vol. 34, pp. 23664–23678, 2021.
[28] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” in ICLR, 2021.
[29] Junbum Cha, Kyungjae Lee, Sungrae Park, and Sanghyuk Chun, “Domain generalization by mutual-information regularization with pre-trained models,” in ECCV, 2022, pp. 440–457.
[30] Yifan Zhang, Xue Wang, Kexin Jin, Kun Yuan, Zhang Zhang, Liang Wang, Rong Jin, and Tieniu Tan, “Adanpc: Exploring non-parametric classifier for test-time adaptation,” in ICML, 2023, pp. 41647–41676.
[31] Lanqing Hu, Meina Kan, Shiguang Shan, and Xilin Chen, “Dandelionnet: Domain composition with instance adaptive classification for domain generalization,” in ICCV, 2023, pp. 19050–19059.
[32] Zenan Huang, Haobo Wang, Junbo Zhao, and Nenggan Zheng, “idag: Invariant dag searching for domain generalization,” in ICCV, 2023, pp. 19169–19179.
[33] Zhaorui Tan, Xi Yang, and Kaizhu Huang, “Rethinking multi-domain generalization with a general learning objective,” in CVPR, 2024, pp. 23512–23522.
[34] Ishaan Gulrajani and David Lopez-Paz, “In search of lost domain generalization,” in ICLR, 2020.
[35] Shaocong Long, Qianyu Zhou, Xiangtai Li, Xuequan Lu, Chenhao Ying, Yuan Luo, Lizhuang Ma, and Shuicheng Yan, “Dgmamba: Domain generalization via generalized state space model,” in ACMMM, 2024, pp. 3607–3616.
[36] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M. Hospedales, “Deeper, broader and artier domain generalization,” in ICCV, 2017, pp. 5543–5551.
[37] Sara Beery, Grant Van Horn, and Pietro Perona, “Recognition in terra incognita,” in ECCV, 2018, pp. 456–473.
[38] Chen Fang, Ye Xu, and Daniel N Rockmore, “Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias,” in ICCV, 2013, pp. 1657–1664.
[39] Han Xie, Zhifeng Shen, Shicai Yang, Weijie Chen, and Luojun Lin, “Adapt then generalize: A simple two-stage framework for semi-supervised domain generalization,” in ICME, 2023, pp. 540–545.
[40] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.
[41] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
[42] Laurens Van der Maaten and Geoffrey Hinton, “Visualizing data using t-sne,” JMLR, vol. 9, no. 11, pp. 2579–2605, 2008.
[43] Shaocong Long, Qianyu Zhou, Chenhao Ying, Lizhuang Ma, and Yuan Luo, “Diverse target and contribution scheduling for domain generalization,” arXiv preprint arXiv:2309.16460, 2023.
[44] Mengzhu Wang, Jianlong Yuan, Qi Qian, Zhibin Wang, and Hao Li, “Semantic data augmentation based distance metric learning for domain generalization,” in ACMMM, 2022, pp. 3214–3223.
[45] Kaiyang Zhou, Yongxin Yang, Timothy Hospedales, and Tao Xiang, “Learning to generate novel domains for domain generalization,” in ECCV. Springer, 2020, pp. 561–578.
[46] Ya Li, Mingming Gong, Xinmei Tian, Tongliang Liu, and Dacheng Tao, “Domain generalization via conditional invariant representations,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
[47] YiFan Zhang, xue wang, Jian Liang, Zhang Zhang, Liang Wang, Rong Jin, and Tieniu Tan, “Free lunch for domain adversarial training: Environment label smoothing,” in ICLR, 2023.
[48] Chang Liu, Lichen Wang, Kai Li, and Yun Fu, “Domain generalization via feature variation decorrelation,” in ACMMM, 2021, pp. 1683–1691.
[49] Yufei Wang, Haoliang Li, Hao Cheng, Bihan Wen, Lap-Pui Chau, and Alex Kot, “Variational disentanglement for domain generalization,” TMLR, 2022.
[50] Zeyi Huang, Andy Zhou, Zijian Ling, Mu Cai, Haohan Wang, and Yong Jae Lee, “A sentence speaks a thousand images: Domain generalization through distilling clip with language guidance,” in ICCV, 2023, pp. 11685–11695.
[51] Divyat Mahajan, Shruti Tople, and Amit Sharma, “Domain generalization using causal matching,” in ICML. PMLR, 2021, pp. 7313–7324.
[52] Xu Chu, Yujie Jin, Wenwu Zhu, Yasha Wang, Xin Wang, Shanghang Zhang, and Hong Mei, “DNA: Domain generalization with diversified neural averaging,” in Proceedings of the International Conference on Machine Learning. 2022, pp. 4010–4034, PMLR.
[53] Maryam Sultana, Muzammal Naseer, Muhammad Haris Khan, Salman Khan, and Fahad Shahbaz Khan, “Self-distilled vision transformer for domain generalization,” in ACCV, 2022, pp. 3068–3085.
[54] Jintao Guo, Lei Qi, Yinghuan Shi, and Yang Gao, “Seta: Semantic-aware edge-guided token augmentation for domain generalization,” 2024.

In this supplementary material, we present additional content to further demonstrate the advantages of our proposed DDG, including an overview of related work and supplementary comparison experiments.

Appendix A Related Work

Numerous research endeavors in DG have been devoted to augmenting model generalizability in novel scenarios.

Mainstream approaches in DG have primarily centered on maintaining domain-invariant representation with the aim of achieving powerful expressive representation. Data augmentation [4, 44, 45, 5] augment the source domain with more generated data exhibiting diverse styles, aiming to expose the model to a broader range of scenarios. Distribution alignment [1, 46, 3, 47] employs domain adversarial training to remove domain-specific information. As an effective way to capture the genuine semantic features, disentangle techniques [48, 49, 9, 10] seek to disentangle features into semantic and non-semantic information. Contrastive learning [11, 12, 50, 51] introduces contrastive loss to regularize the acquired features to be close to those with the same label. Inspired by the generalization performance of flatness-aware strategy, stochastic weight averaging [13, 14, 52] attempts to find a flatter minima in loss landscapes. To alleviate the challenges associated with learning expressive representations through a single expert and complement the domain-shared information, methods employing a mixture-of-experts paradigm [15, 16] have been explored. These methods aim to mine sufficient and fine-grained information that may be absent in a single expert, releasing the constraints that a single expert may face when dealing with the substantial variability of data.

Nevertheless, existing approaches to address distribution shifts in DG tend to acquire robust representation with continuous features and train at the pixel level, they grapple with challenges in the face of the expansive scope of continuous features. In contrast, our approach introduces discrete representation learning to obtain potent representations at the semantic level, with the goal of addressing the predicament posed by the vast space of continuous features while preserving crucial semantic features.

Appendix B Experiments

B-A Implementation Details

Following the common practice in DG research [34, 4, 15], we conducted experiments on widely used benchmarks, namely, PACS, TerraIncognita, and VLCS. These datasets encompass images sourced from diverse media, including hand-drawn illustrations, software-composited images, object-centered photographs, and scene-centered shots, thereby exhibiting substantial distribution shifts. Specifically, PACS includes 9991 images categorized into 7 classes, each exhibiting 4 diverse styles. TerraIncognita contains 24330 photographs of 10 kinds of wide animals captured at 4 distinct locations. VLCS is comprised of 4 sub-datasets, collectively consisting of 10729 images in 5 classes.

The metric $GS$ is defined as $GS=\sqrt{\sum_{i=1}^{M}(GP(i)-\overline{GP})^{2}},$ where $GP(i)$ denotes the generalization performance on domain $i$ , and $\overline{GP}:=\frac{1}{M}\sum_{i=1}^{M}GP(i)$ represents the average generalization performance across all domains.

B-B Additional Experiments

TABLE IV: More comparisons on PACS.

{\dagger}

denotes reproduced result.

Method	Venue	Avg.( $\uparrow$ )	Method	Venue	Avg.( $\uparrow$ )
ResNet50			Deit-S
GMDG[33]	CVPR’2024	85.60	SDViT[53]	ACCV’2022	86.3
SETA[54]	TIP’2024	87.18	GMoE[15]^†	ICLR’2023	86.7
DDG (ours)	-	87.29	DDG (ours)	-	87.1

Visualization for Class Separation. To empirically substantiate the enhanced efficacy of the proposed DDG in encapsulating pivotal semantic features and acquiring robust representations, we utilize t-SNE embeddings [42] for the visualizations of acquired features from different classes both before and after the introduced vector quantization process. As illustrated in Figure. 5, the discernible augmentation in the separation of representations among distinct classes becomes evident subsequent to the discretization of features, such as the distinction between ‘elephant’, ‘giraffe’, and ‘horse’. Besides, the intra-class compactness, which could reflect domain gaps, becomes higher after the application of our DDG, as seen in the compactness in ‘person’ and ‘giraffe’. The observations underscore the effectiveness of DDG in prioritizing the capture of semantic information over imperceptible pixel details and thereby acquiring powerful representations with high intra-class compactness and inter-class separation.

TABLE V: Generalization performance on PACS with object recognition accuracy (%) based on ResNet-18 pre-trained on ImageNet. High is better, and bold indicates the best performance.

Method	Target domain				Avg.( $\uparrow$ )
Method	Art	Cartoon	Photo	Sketch	Avg.( $\uparrow$ )
VREx [24] (ICML’2021)	$80.84$	$70.95$	$93.64$	78.44	$80.97$
MTL [25] (JMLR’2021)	$79.99$	$72.18$	$95.28$	$74.94$	$80.60$
SagNet [26] (CVPR’2021)	81.15	$75.05$	$94.61$	75.38	$81.55$
ARM [27] (NeurIPS’2021)	$80.42$	$75.96$	$95.21$	$72.33$	$80.98$
SAM [28] (ICLR’2021)	$80.67$	$75.53$	93.86	$79.33$	$82.35$
FACT [6] (CVPR’2021)	84.08	75.30	94.31	78.57	83.07
SWAD [13] (NeurIPS’2021)	$83.28$	$74.63$	96.56	$77.96$	$83.11$
MIRO [29] (ECCV’2022)	82.43	73.19	96.33	65.17	79.28
PCL [11] (CVPR’2022)	83.53	73.61	$96.18$	$77.20$	$82.63$
AdaNPC [30] (ICML’2023)	82.70	76.80	92.80	77.70	82.50
DandelionNet [31] (ICCV’2023)	83.16	74.36	95.28	76.56	82.34
iDAG [32] (ICCV’2023)	82.18	78.20	97.08	75.38	83.21
SAGM [14](CVPR’2023)	81.76	74.68	95.51	73.41	81.34
GMDG [33] (CVPR’2024)	83.77	75.64	97.38	67.91	81.71
DDG (ours)	82.75	78.71	93.83	82.62	84.47

Comparisons across Diverse Backbones. In this section, we present additional comparisons with state-of-the-art models using various backbones, namely ResNet50 and ViT, as shown in Table IV. The results demonstrate that our DDG consistently outperforms these models across different backbone architectures.

Complexity Analysis. we provide the complexity comparison with ERM in Table VI. As observed, the additional overhead of our DDG is negligible.

TABLE VI: Comparisons of computational efficiency.

Backbone	Method	Params	GFlops
ResNet50	ERM	23.5M	4.1G
	DDG(ours)	23.6M	4.3G

B-C Full Results

TABLE VII: Generalizablity on TerraIncognita with object recognition accuracy (%) based on ResNet-18 pre-trained on ImageNet. High is better, and bold indicates the best performance.

Method	Target domain				Avg.( $\uparrow$ )
Method	L100	L38	L43	L46	Avg.( $\uparrow$ )
VREx [24] (ICML’2021)	$40.65$	$29.95$	$50.06$	33.72	$38.60$
MTL [25] (JMLR’2021)	$38.94$	$35.18$	$52.80$	$35.29$	$40.55$
SagNet [26] (CVPR’2021)	47.25	29.67	52.87	25.22	$38.75$
ARM [27] (NeruIPS’2021)	$44.98$	$33.73$	$43.39$	$27.77$	$37.47$
SAM [28] (ICLR’2021)	55.66	27.92	51.51	31.93	41.76
FACT [6] (CVPR’2021)	52.90	38.66	52.32	31.58	43.87
SWAD [13] (NeurIPS’2021)	$49.80$	$33.16$	55.57	$33.19$	$42.93$
MIRO [29] (ECCV’2022)	53.78	31.88	51.67	33.21	42.63
PCL [11] (CVPR’2022)	$52.62$	39.98	$48.49$	$31.74$	$43.21$
AdaNPC [30] (ICML’2023)	50.60	38.60	42.20	34.00	41.35
DandelionNet [31] (ICCV’2023)	52.78	32.80	50.69	31.63	41.98
iDAG [32] (ICCV’2023)	53.78	34.82	50.28	28.85	41.93
SAGM [14] (CVPR’2023)	50.20	27.54	53.21	31.70	40.66
GMDG [33] (CVPR’2024)	50.70	34.78	51.26	36.63	43.34
DDG (ours)	56.10	42.67	51.03	36.82	46.63

TABLE VIII: Generalization performance on VLCS with object recognition accuracy (%) based on ResNet-18 pre-trained on ImageNet. High is better, and bold indicates the best performance.

Method	Target domain				Avg.( $\uparrow$ )
Method	Caltech	LabelMe	SUN	PASCAL	Avg.( $\uparrow$ )
VREx [24] (ICML’2021)	$96.20$	$62.97$	73.65	73.68	$76.62$
MTL [25] (JMLR’2021)	$96.38$	$62.54$	$70.91$	$71.68$	$75.38$
SagNet [26] (CVPR’2021)	97.09	$62.07$	$70.37$	75.42	76.24
ARM [27] (NeurIPS’2021)	$96.29$	$61.55$	$72.32$	$76.27$	$76.61$
SAM [28] (ICLR’2021)	98.15	60.52	71.25	75.90	76.45
FACT [6] (CVPR’2021)	97.10	63.25	72.67	74.97	77.00
SWAD [13] (NeurIPS’2021)	97.70	61.27	70.72	76.71	76.60
MIRO [29] (ECCV’2022)	97.79	61.98	71.21	74.53	76.38
PCL [11] (CVPR’2022)	97.09	62.07	71.06	75.05	76.32
AdaNPC [30] (ICML’2023)	98.00	60.20	69.10	76.60	75.98
DandelionNet [31] (ICCV’2023)	94.61	63.06	67.17	71.49	74.08
iDAG [32] (ICCV’2023)	94.44	59.88	70.18	72.86	74.34
SAGM [14] (CVPR’2023)	96.03	60.99	70.64	75.68	75.83
GMDG [33] (CVPR’2024)	96.56	63.53	69.35	73.83	75.81
DDG (ours)	99.08	63.69	73.87	76.78	78.36

Results on PACS. TABLE V reports the generalization performance on PACS, demonstrating the superior generalization performance achieved by our proposed DDG. Specifically, our DDG outperforms the state-of-the-art method iDAG by 1.5% in terms of the model’s average generalizability. Notably, on hard-to-transfer domains where the distribution variance is substantial and existing methods exhibit poor performance, such as ‘Sketch’ within PACS, DDG markedly improves generalization performance by 9.6% compared to the state-of-the-art method iDAG, signifying its efficacy in preventing overfitting in tasks (e.g., ‘Photo’) that are already near saturation in performance, while preserving crucial semantic features. These findings underscore the superiority of our proposed DDG in capturing genuine semantic information rather than pixel-level details.

Results on TerraIncognita. We conclude the results on TerraIncognita in TABLE VII. As observed, our DDG emerges with the top performance in three out of the four scenarios, with a substantial improvement of 6.2%, 10.4%, and 16.6% on L100, L38, and L46, respectively, compared to the state-of-the-art method FACT. Besides, our DDG obtains a performance gain of 6.4% over FACT in terms of average generalization performance, highlighting the efficacy of our DDG in tackling distribution shifts across domains.

Results on VLCS. The generalization performance on VLCS is summarized in TABLE VIII. Notably, the proposed DDG achieves the highest generalization performance across all the scenarios, with improvements of 2.0%, 1.3%, 3.3%, and 2.4% on Caltech, LableMe, SUN, and PASCAL, respectively. As a result, our DDG maintains an enhancement of 2.3% compared to the state-of-the-art approach in average generalization performance. These outcomes collectively underscore that our proposed DDG, employing vector quantization, enhances generalization performance by prioritizing semantic information over pixel-level details.

$\displaystyle\mathcal{W}_{a}^{b}(P,Q)$	$\displaystyle=\sup_{f\in F}\int_{a}^{b}\|P(x)f(x)-Q(x)f(x)\|dx$	(2)
	$\displaystyle=B_{\Phi}\int_{a}^{b}\|P(x)-Q(x)\|dx$
	$\displaystyle\geq B_{\Phi}\|\int_{a}^{b}P(x)dx-\int_{a}^{b}Q(x)dx\|.$

$\displaystyle\mathcal{W}_{a}^{b}(P_{d},Q_{d})$	$\displaystyle=\sup_{f\in F}\int_{a}^{b}\|P_{d}(x)f(x)-Q_{d}(x)f(x)\|dx$	(3)
	$\displaystyle=f(\frac{a+b}{2})\|P_{d}(\frac{a+b}{2})-Q_{d}(\frac{a+b}{2})\|$
	$\displaystyle=B_{\Phi}\|\int_{a}^{b}P(x)dx-\int_{a}^{b}Q(x)dx\|.$