This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Domain Adaptation via Maximizing Surrogate Mutual Information

Haiteng Zhao    Chang Ma    Qinyu Chen&Zhi-Hong Deng Peking University
{zhaohaiteng,changma,chenqinyu,zhdeng}@pku.edu.cn
Abstract

Unsupervised domain adaptation (UDA) aims to predict unlabeled data from target domain with access to labeled data from the source domain. In this work, we propose a novel framework called SIDA (Surrogate Mutual Information Maximization Domain Adaptation) with strong theoretical guarantees. To be specific, SIDA implements adaptation by maximizing mutual information (MI) between features. In the framework, a surrogate joint distribution models the underlying joint distribution of the unlabeled target domain. Our theoretical analysis validates SIDA by bounding the expected risk on target domain with MI and surrogate distribution bias. Experiments show that our approach is comparable with state-of-the-art unsupervised adaptation methods on standard UDA tasks.

1 Introduction

Inspired by human beings’ ability to transfer knowledge across domains and tasks, transfer learning is proposed to leverage knowledge from source domain and task to improve performance on target domain and task. However, in practice, labeled data are often limited on target domains. To address such situation, unsupervised domain adaptation (UDA), a category of transfer learning methods Long et al. (2015, 2017b); Ganin et al. (2016), attempts to enhance knowledge transfer from labeled source domain to target domain by leveraging unlabeled target domain data.

Most previous work is based on the data shift assumption, i.e., the label space maintains the same across domains, but the data distribution conditioned on labels varies. Under this hypothesis, domain alignment and class-level method are used to improve generalization across source and target feature distributions. Domain alignment minimizes the discrepancy between the feature distributions of two domains Long et al. (2015); Ganin et al. (2016); Long et al. (2017b), while class-level methods work on conditional distributions. Conditional alignment aligns conditional distributions and use pseudo-labels to estimate conditional distribution on target domain Long et al. (2018); Li et al. (2020c); Chen et al. (2020a). However, the conditional distributions from different categories tend to mix together, leading to performance drop. Contrastive learning based methods resolve this issue by discriminating features from different classes Luo et al. (2020), but still face the problem of pseudo-label precision. In addition, most of the class-level methods lack solid theoretical explanations for the relationship between cross domain generalization and their objectives. Some works Chen et al. (2019); Xie et al. (2018) yield some intuition for conditional alignment and contrastive learning, but the relation between their training objectives and cross-domain error remains unclear.

In this work, we aim to address the generalization problem in domain adaptation from an information theory perspective. In failed case of domain adaptation, as shown in Figure 1, features from the same class do not represent each other well and this inspires us to use mutual information to reduce this confusion. Our motivation is to find more representative features for both domains by maximizing mutual information between features of the same class (on both source and target domains). Therefore, if our classifier can accurately predict features on source domain, then it would also function well on target domains where features share enough information with the source features.

Based on the above motivation, we propose Surrogate Information Domain Adaptation (SIDA), a general domain adaptation framework with strong theoretical guarantees. SIDA achieves adaptation by maximizing the mutual information (MI) between features within the same class, which improves the generalization of the model to the target domain. Furthermore, a surrogate distribution is constructed to approximate the unlabeled target distribution, which improves flexibility for selecting data and assists MI estimation. Also, our theoretical analyses directly establish a bound between MI of features and target expected risk, giving a proof that our model can improve generalization across domain.

Our novelties and contributions are summarized as follows:

  • We propose a novel framework to achieve domain adaptation by maximizing surrogate MI.

  • We establish an expected risk upper bound based on feature MI and surrogate distribution bias for UDA. This provides theoretical guarantee for our framework.

  • Experiment results on three challenging benchmarks demonstrate that our method performs favorably against state-of-art class-level UDA models.

Refer to caption
Figure 1: Our motivation is that higher intra-class feature mutual information implies better generalization ability across domain. As shown, unsuccessful domain adaptation implies lower intra-class feature mutual information. This is reflected in the fact that features from the same class do not represent each other well because of confusion with features from other classes.

2 Related Work

Domain Adaptation  Prior works are based on two major assumptions: (1) the label shift hypothesis, where the label distribution changes, and (2) a more common data shift hypothesis where we only study the shift in conditional distribution under the premise that the label distribution is fixed. Our work focuses on the data shift hypothesis, and previous work following this line can be divided into two major categories: domain alignment methods which align marginal distributions, and class-level methods addressing the alignment of conditional distributions.

Domain alignment methods minimize the difference between feature distributions of source and target domains with various metrics, e.g. maximum mean discrepancy (MMD) Long et al. (2015), JS divergence Ganin et al. (2016) estimated by adversarial discriminator, Wasserstein metric and others. Maximum Mean Discrepancy (MMD) is applied to measure the discrepancy in marginal distributions Long et al. (2015, 2017b). Adversarial domain adaptation plays a mini-max game to learn domain-invariant features Ganin et al. (2016); Li et al. (2020b).

Class-level methods align the conditional distribution based on pseudo-labels Li et al. (2020c); Chen et al. (2020a); Luo et al. (2020); Li et al. (2020a); Tang and Jia (2020); Xu et al. (2019). Conditional alignment methods Xie et al. (2018); Long et al. (2018) minimize the discrepancy between conditional distributions. In class-level methods, conditional distributions are assigned by pseudo-labels. The accuracy of pseudo-labels greatly influences performance and later works construct more accurate pseudo-labels Chen et al. (2020a). However, the major problem with this method is that error in conditional alignment leads to distribution overlap of features from different class, resulting in low discriminability on target domain. Contrastive learning addresses this problem by maximizing the discrepancy between different classes Luo et al. (2020); Li et al. (2020a). However, the performance of contrastive learning also relies on pseudo-labeling.

In addition, previous class-level works provide weak theoretical support for cross-domain generalization. Prior works mainly focus on domain alignment Ben-David et al. (2007); Redko et al. (2020). Some works Chen et al. (2019); Xie et al. (2018) consider optimal classification on both domains, and yield some intuitive explanation for conditional alignment and contrastive learning, but the relation between their objective function and theoretical cross-domain error remains unclear.

Information Maximization Principle  Recently, mutual information maximization (InfoMax) for representation learning has attracted lots of attention Chen et al. (2020b); Hjelm et al. (2018); Khosla et al. (2020). The intuition is that two features belonging to different classes should be discriminable while features of the same class should resemble each other. The InfoMax principle provides a general framework for learning informative representations, and provides consistent boosts in various downstream tasks.

We facilitate domain adaptation with MI maximization, i.e. maximizing the MI between features of the same class. Some works solve domain adaptation problem via information theoretical methods Thota and Leontidis (2021); Chen and Liu (2020); Park et al. (2020), which maximize MI using InfoNCE estimation Poole et al. (2019). As far as we know, we are the first to provide theoretical guarantee for the target domain expected risk based on MI. Compared with InfoNCE, the variational lower bound of MI we use is tighter Poole et al. (2019). We also construct a surrogate distribution as a substitute for unlabeled target domain, which is more suitable for MI estimation.

Refer to caption
Figure 2: Overview of SIDA framework for training. Only encoder and classifier are involved in inference. The dashed arrow shows the path of the gradient backpropagation.

3 Preliminaries

3.1 Notations and Problem Setting

Let 𝒳\mathcal{X} be the data space and 𝒴\mathcal{Y} be the label space. In UDA, there is a source distribution PS(X,Y)P_{S}(X,Y) and a target distribution PT(X,Y)P_{T}(X,Y) on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}. Note that distributions are also referred to as domains in UDA. Our work is based on the data shift hypothesis, which assumes PS(X,Y)P_{S}(X,Y) and PT(X,Y)P_{T}(X,Y) satisfy the following properties: PT(Y)=PS(Y)P_{T}(Y)=P_{S}(Y) and PT(X|Y)PS(X|Y)P_{T}(X|Y)\not=P_{S}(X|Y) .

In our work, we focus on classification tasks. Under this setting, an algorithm has access to nSn_{S} labeled samples {(xSi,ySi)}i=1nSPS(X,Y)\{(x_{S}^{i},y_{S}^{i})\}_{i=1}^{n_{S}}\sim P_{S}(X,Y) and nTn_{T} unlabeled samples {(xTi)}i=1nTPT(X)\{(x_{T}^{i})\}_{i=1}^{n_{T}}\sim P_{T}(X), and outputs a hypothesis composed of an encoder GG and a classifier FF. Let 𝒵\mathcal{Z} be the feature space. The encoder maps data to feature space, denoted by G:𝒳𝒵G:\mathcal{X}\rightarrow\mathcal{Z}. Then the classifier maps the feature to a corresponding class, F:𝒵𝒴F:\mathcal{Z}\rightarrow\mathcal{Y}.

For brevity, given encoder GG and data-label distribution P(X,Y)P(X,Y), denote the distribution of GG-encoded feature and label by PGP^{G}, i.e. PG(z,y)=P(x=G1(z),y)P^{G}(z,y)=P(x=G^{-1}(z),y).

Let FF be a hypothesis, and PP be the distribution of feature and label. The expected risk of a F w.r.t. P is denoted as

ϵP(F)𝔼P(z)|δF(z)P(y|z)|1,\epsilon_{P}(F)\triangleq\mathbb{E}_{P(z)}|\delta_{F(z)}-P(y|z)|_{1}, (1)

where δF(z)(y)\delta_{F(z)}(y) equals to 1 if y=F(z)y=F(z) and equals 0 in else cases. Our objective is to minimize the expected risk of FF on target feature distribution encoded by GG,

minG,FϵPTG(F).\min_{G,F}\epsilon_{P^{G}_{T}}(F). (2)

4 Methodology

4.1 Overview

In UDA task, the model needs to generalize across different domains with varying distributions; thus the encoder needs to extract appropriate features that are transferable across domains. The challenges of class-level adaptation are two folds: learning transferable features, and modeling PTG(Z|Y)P^{G}_{T}(Z|Y) without label information.

To solve the first problem, we use MI based methods. Following the InfoMax principle, we maximize the mutual information between features from the same class on the target and source mixture distribution. This encourages the features of the source domain to carry more information about the features of the same class in target domain, and thus provides opportunities for transferring classifier across domains.

As for the second challenge, we first revisit the data shift hypothesis. The distribution of labels P(Y)P(Y) remains independent of domains; therefore the key is to model the conditional distribution P(Z|Y)P(Z|Y) on the target domain. However, modeling P(Z|Y)P(Z|Y) is intractable, since labels on the target domain are inaccessible. To tackle this problem, we model a surrogate distribution Q(Z|Y)Q(Z|Y) instead.

We introduce the goal of maximizing MI in section 4.2, and theoretically explain how MI affects domain adaptation risk. In Section 4.3, we will introduce the model in detail, including the variational estimation of MI, the modeling of the surrogate distribution, and the optimization of the loss function of the model.

4.2 Mutual Information Maximization

MI measures the degree to which two variables can predict each other. Inspired by InfoMax principle Hjelm et al. (2018), we maximize the MI between the features within the same class. It encourages features from different classes to be discriminable from each other.

We maximize MI between features on both source and target domain, regardless of which domain they come from. So we introduce mixture distribution S+TS+T of both domain, which is

PS+T(x,y)12(PS(x,y)+PT(x,y)).P_{S+T}(x,y)\triangleq\frac{1}{2}(P_{S}(x,y)+P_{T}(x,y)). (3)

Note that because PS(y)=PT(y)=PS+T(y)P_{S}(y)=P_{T}(y)=P_{S+T}(y), PS+T(x|y)=12PS(xy)+12PT(xy)P_{S+T}(x|y)=\frac{1}{2}P_{S}(x\mid y)+\frac{1}{2}P_{T}(x\mid y). Define the distribution of features from the same class as

PS+TG(z1,z2|y)PS+TG(z1|y)PS+TG(z2|y),\displaystyle P_{S+T}^{G}(z_{1},z_{2}|y)\triangleq P_{S+T}^{G}(z_{1}|y)P_{S+T}^{G}(z_{2}|y), (4)
PS+TG(z1,z2)=yPS+TG(y)PS+TG(z1,z2|y).\displaystyle P_{S+T}^{G}(z_{1},z_{2})=\sum_{y}P_{S+T}^{G}(y)P_{S+T}^{G}(z_{1},z_{2}|y).

which means the feature z1z_{1} and z2z_{2} are sampled independently from the conditional distribution of the same class, with equal probability from source domain and target domain.

MI between features is maximized within the mixture distribution, as formalized bellow:

argmaxG\displaystyle\arg\max_{G} IS+TG(Z1;Z2)\displaystyle I^{G}_{S+T}(Z_{1};Z_{2}) (5)
=\displaystyle= PS+TG(z1,z2)logPS+TG(z1,z2)PS+TG(z1)PS+TG(z2)dz1dz2.\displaystyle\int P_{S+T}^{G}(z_{1},z_{2})\log\frac{P_{S+T}^{G}(z_{1},z_{2})}{P_{S+T}^{G}(z_{1})P_{S+T}^{G}(z_{2})}dz_{1}dz_{2}.

However, due to the lack of target domain labels, PS+TGP^{G}_{S+T} is hard to model and thereby it is infeasible to estimate IS+TGI^{G}_{S+T} directly. To address this problem, we propose a surrogate joint distribution Q(Z,Y)Q(Z,Y) as the substitute for target domain PTGP^{G}_{T}. Then the mixture distribution becomes PS+QG=12(PSG+Q)P_{S+Q}^{G}=\frac{1}{2}(P^{G}_{S}+Q), and the objective becomes maximizing IS+QG(Z1;Z2)I_{S+Q}^{G}(Z_{1};Z_{2}). The construction and optimization of the surrogate joint distribution is explained in Section 4.3.2.

4.2.1 Theoretical Motivation for MI Maximization

We use theoretical bound to demonstrate the motivation for using MI maximization. Our theoretical results prove that minimizing the expected risk on the target domain can be naturally transformed into MI maximization and expected risk minimization on the source domain, which explains why MI maximization is pivotal to our framework. The proofs are in appendix.

Definition 1 (Δ\mathcal{H}\Delta\mathcal{H}-Divergence).

Let F1,F2F_{1}\in\mathcal{H},F_{2}\in\mathcal{H} be two hypotheses in hypothesis space :𝒵𝒴\mathcal{H}:\mathcal{Z}\rightarrow\mathcal{Y}. Define ϵP(F1,F2)\epsilon_{P}(F_{1},F_{2}) as the disagreement between hypotheses F1,F2F_{1},F_{2} w.r.t. distribution PP on 𝒵\mathcal{Z}, ϵP(F1,F2)𝔼zP|δF1(z)δF2(z)|\epsilon_{P}\left(F_{1},F_{2}\right)\triangleq\mathbb{E}_{z\sim P}\left|\delta_{F_{1}(z)}-\delta_{F_{2}(z)}\right|. Δ\mathcal{H}\Delta\mathcal{H}-divergence, which is the discrepancy of two distributions P1,P2P_{1},P_{2} w.r.t. any hypothesis F1F2F_{1}-F_{2} where F1,F2F_{1},F_{2}\in\mathcal{H}, is defined as dΔ(P1,P2)2supF1,F2|ϵP1(F1,F2)ϵP2(F1,F2)|d_{\mathcal{H}\Delta\mathcal{H}}(P_{1},P_{2})\triangleq 2\sup_{F_{1},F_{2}\in\mathcal{H}}|\epsilon_{P_{1}}(F_{1},F_{2})-\epsilon_{P_{2}}(F_{1},F_{2})|.

Theorem 1 (Bound of Target Domain expected risk).

The expected risk on target domain can be upper-bounded by the negative MI between features, and Δ\mathcal{H}\Delta\mathcal{H} -divergence between features of two domains:

ϵPTG(F)ϵPSG(F)4IS+TG(Z1;Z2)\displaystyle\epsilon_{P^{G}_{T}}(F)\leq\epsilon_{P^{G}_{S}}(F)-4I_{S+T}^{G}(Z_{1};Z_{2}) (6)
+12dΔ(PSG(Z),PTG(Z))+4H(Y).\displaystyle+\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(P^{G}_{S}(Z),P^{G}_{T}(Z))+4H(Y).

The proof is in appendix. We give an explanation of the conditions for the upper bound to be equal. IS+TG(Z1;Z2)I^{G}_{S+T}(Z_{1};Z_{2}) is a lower bound of IS+TG(Z;Y)I^{G}_{S+T}(Z;Y), and it measures how much uncertainty of YY is reduced by knowing the feature, and it’s equal to H(Y)H(Y) if and only if PS+TG(Y|Z)P^{G}_{S+T}(Y|Z) is deterministic, i.e., PS+TG(Y|Z)P^{G}_{S+T}(Y|Z) is δ\delta distribution, which means PSG(Y|Z)=PTG(Y|Z)=δY(Z)P^{G}_{S}(Y|Z)=P^{G}_{T}(Y|Z)=\delta_{Y(Z)}. Thus if the Δ\mathcal{H}\Delta\mathcal{H} -divergence is zero, i.e., PSG(Z)=PTG(Z)P^{G}_{S}(Z)=P^{G}_{T}(Z), then it’s ensured that PSG(Z,Y)=PTG(Z,Y)P^{G}_{S}(Z,Y)=P^{G}_{T}(Z,Y), and ϵPTG(F)=ϵPSG(F)\epsilon_{P^{G}_{T}}(F)=\epsilon_{P^{G}_{S}}(F).

This upper bound decomposes the cross-domain generalization error into the divergence of feature marginal distribution and MI of features. It emphasizes that in addition to the divergence of the feature marginal distributions, only a MI term is enough for knowledge transfer across domains.

In this work, we minimize the expected risk on the source domain and maximize MI, for minimizing the upper bound of expected risk on target domain. Due to the lack of labels on target domain, we estimate MI based on surrogate distribution QQ. The expected risk upper bound based on surrogate MI is further derived as follows.

Definition 2 (L1L_{1}-distance).

Define L1L_{1}-distance of P1,P2{P_{1}},{P_{2}} as d1(P1,P2)2supBB|PrP1[B]PrP2[B]|d_{1}\left({P_{1}},{P_{2}}\right)\triangleq 2\sup_{B\in{\textbf{B}}}\left|\operatorname{Pr}_{P_{1}}[B]-\operatorname{Pr}_{P_{2}}[B]\right| where B is the set of measurable subsets under P1P_{1} and P2P_{2}.

Theorem 2 (Bound Estimation with Surrogate Distribution).

Let Bd1(PTG(Z),Q(Z))+ϵPTG(Q(Y|Z))B\triangleq d_{1}(P_{T}^{G}(Z),Q(Z))+\epsilon_{P^{G}_{T}}(Q(Y|Z)) be the bias of surrogate distribution QQ w.r.t target distribution. The expected risk on target domain can be upper-bounded by the negative surrogate MI between features, Δ\mathcal{H}\Delta\mathcal{H} -divergence between source and target domain, and additional bias of surrogate domain:

ϵPTG(F)ϵPSG(F)4IS+QG(Z1;Z2)+B\displaystyle\epsilon_{P^{G}_{T}}(F)\leq\epsilon_{P^{G}_{S}}(F)-4I_{S+Q}^{G}(Z_{1};Z_{2})+B (7)
+12dΔ(PSG(Z),PTG(Z))+4H(Y).\displaystyle+\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(P^{G}_{S}(Z),P^{G}_{T}(Z))+4H(Y).

The proof is in appendix. This theorem supports the feasibility of domain adaptation via maximizing surrogate MI IS+QG(Z1;Z2)I_{S+Q}^{G}(Z_{1};Z_{2}). The bias of surrogate distribution is expressed in terms d1(PTG(Z),Q(Z))+ϵPTG(Q(Y|Z))d_{1}(P_{T}^{G}(Z),Q(Z))+\epsilon_{P^{G}_{T}}(Q(Y|Z)), where the first term is the distance between the surrogate and target feature marginal distribution, and the second term is the risk of conditional label surrogate distribution. To minimize the upper bound, the bias of the surrogate distribution should be small.

Bias equal to zero if and only if surrogate feature distribution and conditional label distribution are the same as target distribution, i.e., PTG=QP_{T}^{G}=Q, where surrogate distribution does not introduce errors.

4.3 SIDA Framework

We employ MI maximization and surrogate distribution in our SIDA framework, as shown in Figure 2. During training, a surrogate distribution is first built from target and source data via optimizing w.r.t. Laplacian and MI. Then a mixture data distribution is created by encoding source data to features and sampling target features from the surrogate distribution. The encoder is optimized by maximizing MI, and minimizing classification error. The overall loss is:

Lmodel=LClassify+α1LMI+α2LAuxiliary+LLaplacian.L_{model}=L_{Classify}+\alpha_{1}L_{MI}+\alpha_{2}L_{Auxiliary}+L_{Laplacian}. (8)

We elaborate each module in the following sections, and introduce the optimization of surrogate distribution in the last sections.

4.3.1 Mutual Information Estimation

Several MI estimation and optimization methods are proposed in deep learning Poole et al. (2019). In this work, we use the following variational lower bound of MI as proposed in Nguyen et al. (2010):

I(Z1;Z2)\displaystyle I(Z_{1};Z_{2})\geq 𝔼P(z1,z2)[f(z1,z2)]\displaystyle\mathbb{E}_{P(z_{1},z_{2})}[f(z_{1},z_{2})] (9)
e1𝔼P(z1)[𝔼P(z2)[ef(z1,z2)]],\displaystyle-e^{-1}\mathbb{E}_{P(z_{1})}[\mathbb{E}_{P(z_{2})}\left[e^{f(z_{1},z_{2})}\right]],

where ff is a score function in 𝒵×𝒵R\mathcal{Z}\times\mathcal{Z}\rightarrow R. The equality holds when ef(z1,z2)𝔼P(z1)ef(z1,z2)=P(z1|z2)P(z1)\frac{e^{f(z_{1},z_{2})}}{\mathbb{E}_{P(z_{1})}e^{f(z_{1},z_{2})}}=\frac{P(z_{1}|z_{2})}{P(z_{1})} and 𝔼P(z1)𝔼P(z2)ef(z1,z2)=e\mathbb{E}_{P(z_{1})}\mathbb{E}_{P(z_{2})}e^{f(z_{1},z_{2})}=e. The proof is in appendix. Therefore maximizing MI can be transformed into maximizing its lower bound, and the loss is:

LMI=\displaystyle L_{MI}= 𝔼PS+QG(y)𝔼PS+QG(z1|y)𝔼PS+QG(z2|y)[f(z1,z2)]\displaystyle-\mathbb{E}_{P_{S+Q}^{G}(y)}\mathbb{E}_{P_{S+Q}^{G}(z_{1}|y)}\mathbb{E}_{P_{S+Q}^{G}(z_{2}|y)}[f(z_{1},z_{2})] (10)
+e1𝔼PS+QG(z1)[𝔼PS+QG(z2)[ef(z1,z2)]],\displaystyle+e^{-1}\mathbb{E}_{P_{S+Q}^{G}(z_{1})}[\mathbb{E}_{P_{S+Q}^{G}(z_{2})}\left[e^{f(z_{1},z_{2})}\right]],

where f(z1,z2)f(z_{1},z_{2}) is constructed as Tm1m2(|z1z2|2)T_{m_{1}}^{m_{2}}(|z_{1}-z_{2}|_{2}). Tm1m2T_{m_{1}}^{m_{2}} is a threshold function, i.e., Tm1m2(a)=max(m1,min(m2,a))T_{m_{1}}^{m_{2}}(a)=\max(m_{1},\min(m_{2},a)).

4.3.2 Surrogate Distribution Construction

We decompose the surrogate distribution Q(Z,Y)Q(Z,Y) into two factors Q(Z,Y)=Q(Y)Q(Z|Y)Q(Z,Y)=Q(Y)Q(Z|Y), and describe the construction of two factors individually.

According to the data shift assumption, PT(Y)P_{T}(Y) is similar to PS(Y)P_{S}(Y), thus Q(Y)Q(Y) should be similar to PS(Y)P_{S}(Y). However, source distribution may suffer from the class imbalance problem, which will harm the performance on classes with fewer data. A common solution to this problem is class-balanced sampling, which samples data on each class uniformly. In this work, for the balance across different classes, the marginal distribution PS(Y)P_{S}(Y) and Q(Y)Q(Y) are both considered as uniform distribution.

As for the second term, the conditional surrogate distribution Q(Z|Y)Q(Z|Y) is constructed by weighted sampling method. We need to construct the Q(Z|Y)Q(Z|Y) to calculate Eq. 10, which takes the form of expectation, and only needs samples from Q(Z|Y)Q(Z|Y) to estimate. Instead of explicitly modeling Q(Y|Z)Q(Y|Z), we use the ideas of importance sampling. For each class, the surrogate conditional distribution Q(Z|yj)Q(Z|y_{j}) is constructed by weighted sampling from target features. Thus Q(Z|Y)Q(Z|Y) is a distribution on target features{G(xTi)}i=1nT\{G(x_{T}^{i})\}_{i=1}^{n_{T}}, and parameterized by WRnT×nYW\in R^{n_{T}\times n_{Y}}, where nYn_{Y} is the number of labels:

Q(G(xTi)|yj)=Wij, s.t. Wij[0,1],iWij=1,j.Q(G(x_{T}^{i})|y_{j})=W_{ij},\textbf{ s.t. }W_{ij}\in[0,1],\sum_{i}W_{ij}=1,\forall j. (11)

Compared with pseudo-labeling, our estimation method has the following advantages: (1) The surrogate marginal distribution of feature Q(Z)=YQ(Z|Y)P(Y)Q(Z)=\sum_{Y}Q(Z|Y)P(Y) is not fixed, which enables us to select features more flexibly. (2)The construction process of the surrogate distribution makes MI estimation I(Z1,Z2)I(Z_{1},Z_{2}) more convenient. Our surrogate distribution Q(Z|Y)Q(Z|Y) provides weights so that weighted sampling can be performed directly.

The challenge is to optimize the sampling probability weights WijW_{ij} so as to minimize the bias of the surrogate distribution. We propose to optimize this distribution via Laplacian regularization as well as MI, which is explained in details in the following section.

4.3.3 Surrogate Distribution Loss

Inspired by semi-supervised learning, we expect that the surrogate distribution is consistent with the clustering structure of the feature distribution, based on the assumption that the feature is well-structured and clustered according to class, regardless of domains. We employ Laplacian regularization to capture the manifold clustering structure of feature distribution.

Let ARnT×nTA\in R^{n_{T}\times n_{T}} be the adjacent matrix of target features, where the entry AijA_{ij} measures how similar G(xTi)G(x_{T}^{i}) and G(xTj)G(x_{T}^{j}) are, and D=Diag(A𝟏)D=\text{Diag}({A\boldsymbol{1}}) is the degree matrix, i.e. Dii=jAijD_{ii}=\sum_{j}A_{ij} and Dij=0,ijD_{ij}=0,\forall i\neq j. We construct A as K-nearest graph on target features, and the Laplacian regularization of WW is defined as

LLaplacian\displaystyle L_{Laplacian} =Tr(WTLW)\displaystyle=Tr(W^{T}LW) (12)
=12ki,jAij(WikDiiWjkDjj)2,\displaystyle=\frac{1}{2}\sum_{k}\sum_{i,j}A_{ij}(\frac{W_{ik}}{D_{ii}}-\frac{W_{jk}}{D_{jj}})^{2},

where L is the normalized Laplacian matrix L=ID12AD12L=I-D^{-\frac{1}{2}}AD^{-\frac{1}{2}}. This regularization encourages WikW_{ik} and WjkW_{jk} to be similar if feature G(xTi)G(x_{T}^{i}) is similar to G(xTj)G(x_{T}^{j}). It also enables the conditional surrogate distribution to spread uniformly on a connected region.

4.3.4 Classification and Auxiliary Loss

The model is optimized in supervised manner on the source domain. The classification loss is the standard cross-entropy loss via class-balanced sampling.

LClassify=1nYyEPS(x|y)logP(F(G(x))=y).L_{Classify}=-\frac{1}{n_{Y}}\sum_{y}E_{P_{S}(x|y)}\log P(F(G(x))=y). (13)

And we use auxiliary classification loss on pseudo-labels from the surrogate distribution, as the classifier will benefit from label information of the surrogate distribution. We use mean square error (MSE) for pseudo-labels, which is more robust to noise than cross entropy loss.

LAuxiliary=1nYyEQ(x|y)(1P(F(G(x))=y))2.L_{Auxiliary}=\frac{1}{n_{Y}}\sum_{y}E_{Q(x|y)}(1-P(F(G(x))=y))^{2}. (14)

4.3.5 Optimization of Surrogate Distribution

We optimize both LLaplacianL_{\textbf{Laplacian}} and LMIL_{\textbf{MI}} w.r.t. WW for a structured and informative surrogate distribution. At the beginning of each epoch, WW is initialized by K-means clustering and filtered by the distance to the clustering centers, i.e. W~i,j=𝟏μj nearest to G(xi)·𝟏d(G(xi),μj)<θ\widetilde{W}_{i,j}=\boldsymbol{1}_{\mu_{j}\text{ nearest to }G(x_{i})}·\boldsymbol{1}_{d(G(x_{i}),\mu_{j})<\theta}, where μj\mu_{j} is the j-th clustering center during clustering, and normalized as Wi,j=W~i,jiW~i,j{W}_{i,j}=\frac{\widetilde{W}_{i,j}}{\sum_{i}\widetilde{W}_{i,j}}.

To minimize two losses w.r.t WW, the gradients are derived analytically. The derivation is in appendix.

Based on the gradient of these two losses, we perform T-step descent update of WW with learning rate η1\eta_{1} and η2\eta_{2} respectively, and each step we project WW back to the probability simplex. See appendix for details.

5 Experiments

In this section, We evaluate the proposed method on three public domain adaptation benchmarks, compared with recent state-of-the-art UDA methods. We conduct extensive ablation study to discuss our method.

5.1 Datasets

VisDA-2017 Peng et al. (2017) is a challenging benchmark for UDA with the domain shift from synthetic data to real imagery. It contains 152,397 training images and 55,388 validation images across 12 classes. Following the training and testing protocol in Long et al. (2017a), the model is trained on labeled training and unlabeled validation set and tested on the validation set.

Office-31 Saenko et al. (2010) is a commonly used dataset for UDA, where images are collected from three distinct domains: Amazon (A), Webcam (W) and DSLR (D). The dataset consists of 4,110 images belonging to 31 classes, and is imbalanced across domains, with 2,817 images in A domain, 795 images in W domain, and 498 images in D domain. Our method is evaluated on all six transfer tasks. We follow the standard protocol for UDA Long et al. (2017b) to use all labeled source samples and all unlabeled target samples as the training data.

Office-Home Venkateswara et al. (2017) is another classical dataset with 15,500 images of 65 categories in office and home settings, consisting of 4 domains including Artistic images (A), Clip Art images (C), Product images (P) and Real-World images (R). Following the common protocol, all 65 categories from the four domains are used for evaluation of UDA, forming 12 transfer tasks.

5.2 Implementation details

For each transfer task, mean (±\pmstd) over 5 runs of the test accuracy are reported. We use the ImageNet pre-trained ResNet-50 He et al. (2016) without final classifier layer as the encoder network GG for Office-31 and Office-Home, and ResNet-101 for VisDA-2017. The details of experiments are in appendix. The code is available at https://github.com/zhao-ht/SIDA.

5.3 Baselines

We compare our approach with the state of the arts. Domain alignment methods include DAN Long et al. (2015), DANN Ganin et al. (2016), JAN Long et al. (2017b). Class-level methods include conditional alignment methods (CDAN Long et al. (2018), DCAN Li et al. (2020c), ALDA Chen et al. (2020a)), and contrastive methods (DRMEA Luo et al. (2020), ETD Li et al. (2020a), DADA Tang and Jia (2020), SAFN Xu et al. (2019)). We only report available results in each baseline. We use NA, DA, CA, CT to note no adaptation method, domain alignment methods, conditional alignment methods and contrastive methods respectively.

Type Methods Plane Bcycl Bus Car Horse Knife Mcyle Person Plant Sktbrd Train Truck Avg
NA ResNet-101 55.1 53.3 61.9 59.1 80.6 17.9 79.7 31.2 81.0 26.5 73.5 8.5 52.4
DA DAN 87.1 63.0 76.5 42.0 90.3 42.9 85.9 53.1 49.7 36.3 85.8 20.7 61.1
DANN 81.9 77.7 82.8 44.3 81.2 29.5 65.1 28.6 51.9 54.6 82.8 7.8 57.4
CA CDAN 85.2 66.9 83.0 50.8 84.2 74.9 88.1 74.5 83.4 76.0 81.9 38.0 73.9
ALDA 93.8 74.1 82.4 69.4 90.6 87.2 89.0 67.6 93.4 76.1 87.7 22.2 77.8
CT DRMEA 92.1 75.0 78.9 75.5 91.2 81.9 89.0 77.2 93.3 77.4 84.8 35.1 79.3
DADA 92.9 74.2 82.5 65.0 90.9 93.8 87.2 74.2 89.9 71.5 86.5 48.7 79.8
SAFN 93.6 61.3 84.1 70.6 94.1 79.0 91.8 79.6 89.9 55.6 89.0 24.4 76.1
Ours SIDA 95.4 83.1 77.1 64.6 94.5 97.2 88.7 78.4 93.8 89.9 85.2 59.4 84.0
Table 1: Accuracy (%) on VisDA-2017
Type Methods A\rightarrowW D\rightarrowW W\rightarrowD A\rightarrowD D\rightarrowA W\rightarrowA avg
NA ResNet-50 68.4±\pm0.2 96.7±\pm0.1 99.3±\pm0.1 68.9±\pm0.2 62.5±\pm0.3 60.7±\pm0.3 76.1
DA DAN 80.5±\pm0.4 97.1±\pm0.2 99.6±\pm0.1 78.6±\pm0.2 63.6±\pm0.3 62.8±\pm0.2 80.4
DANN 82.0±\pm0.4 96.9±\pm0.2 99.1±\pm0.1 79.7±\pm0.4 68.2±\pm0.4 67.4±\pm0.5 82.2
JAN 85.4±\pm0.3 97.4±\pm0.2 99.8±\pm0.2 84.7±\pm0.3 68.6±\pm0.3 70.0±\pm0.4 84.3
CA CDAN 94.1±\pm0.1 98.6±\pm0.1 100.0±\pm0.0 92.9±\pm0.2 71.0±\pm0.3 69.3 ±\pm 0.3 87.7
DCAN 95.0 97.5 100.0 92.6 77.2 74.9 89.5
ALDA 95.6±\pm0.5 97.7±\pm0.1 100.0±\pm0.0 94.0±\pm0.4 72.2±\pm0.4 72.5±\pm0.2 88.7
CT ETD 92.1 100.0 100.0 88.0 71.0 67.8 86.2
DADA 92.3±\pm0.1 99.2±\pm0.1 100.0±\pm0.0 93.9±\pm0.2 74.4±\pm0.1 74.2±\pm0.1 89.0
SAFN 90.3 98.7 100.0 92.1 73.4 71.2 87.6
Ours SIDA 94.5±\pm0.6 99.2±\pm0.1 100.0±\pm0.0 95.7±\pm0.3 76.6±\pm0.6 76.2±\pm0.4 90.4
Table 2: Accuracy(%) on Office-31
Type Methods A\rightarrowC A\rightarrowP A\rightarrowR C\rightarrowA C\rightarrowP C\rightarrowR P\rightarrowA P\rightarrowC P\rightarrowR R\rightarrowA R\rightarrowC R\rightarrowP Avg
NA ResNet-50 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1
DA DAN 43.6 57.0 67.9 45.8 56.5 60.4 44.0 43.6 67.7 63.1 51.5 74.3 56.3
DANN 45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6
JAN 45.9 61.2 68.9 50.4 59.7 61.0 45.8 43.4 70.3 63.9 52.4 76.8 58.3
CA CDAN 50.7 70.6 76.0 57.6 70.0 70.0 57.4 50.9 77.3 70.9 56.7 81.6 65.8
DCAN 54.5 75.7 81.2 67.4 74.0 76.3 67.4 52.7 80.6 74.1 59.1 83.5 70.5
ALDA 53.7 70.1 76.4 60.2 72.6 71.5 56.8 51.9 77.1 70.2 56.3 82.1 66.6
CT DRMEA 52.3 73.0 77.3 64.3 72.0 71.8 63.6 52.7 78.5 72.0 57.7 81.6 68.1
ETD 51.3 71.9 85.7 57.6 69.2 73.7 57.8 51.2 79.3 70.2 57.5 82.1 67.3
SAFN 54.4 73.3 77.9 65.2 71.5 73.2 63.6 52.6 78.2 72.3 58.0 82.1 68.5
Ours SIDA 57.2 79.1 81.7 67.1 74.5 77.3 67.2 53.9 82.5 71.4 58.7 83.3 71.2
Table 3: Accuracy (%) on Office-Home

5.4 Results and Comparative Analysis

In this section we will present our results and compare with other methods for evaluation on three standard benchmarks mentioned earlier. We report average classification accuracies with standard deviations. Results of other methods are collected from original papers or the follow-up work. We provide visualizations of the features learned by the model in the appendix.

VisDA-2017    Table 1 summarizes our experimental results on the challenging VisDA-2017 dataset. For fair comparison, all methods listed here use ResNet-101 as the backbone network. Note that SIDA outperforms baseline models with an average accuracy of 84.0, surpassing the previous best result reported by +4%.

Office-31   The unsupervised adaptation results on six Office-31 transfer tasks based on ResNet-50 are reported in Table 2. As the data reveals, the average accuracy of SIDA is 90.4, the best among all compared methods. It is noteworthy that our proposed method substantially improves the classification accuracy on hard transfer tasks, e.g. W\rightarrowA, A\rightarrowD, and D\rightarrowA, where source and target data are not similar. Our model also achieves comparable classification performance on easy transfer tasks, e.g. D\rightarrowW, W\rightarrowD, and A\rightarrowW. Our improvements are mainly on hard settings.

Office-Home   Results on Office-Home using ResNet-50 backbone are reported in Table 3. It can be observed that SIDA exceeds all compared methods on most transfer tasks with an average accuracy of 71.2. The performance reveals the importance of maximizing MI between feature in difficult domain-adaptation tasks which contain more categories.

In summary, our surrogate MI maximization approach achieves competitive performance compared to traditional alignment based methods and recent pseudo-label based methods for UDA. It underlines the validity of using information theory methods for UDA via MI maximization.

MI SD A\rightarrowW A\rightarrowD D\rightarrowA W\rightarrowA Avg
×{\times} ×{\times} 90.25 ±\pm 0.2 92.37 ±\pm 0.1 74.21 ±\pm 0.2 74.09 ±\pm 0.1 82.7
×{\times} \surd 92.08±\pm 0.3 94.28±\pm0.3 74.23±\pm0.9 74.74±\pm 0.8 83.8
\surd ×{\times} 94.03±\pm 0.1 95.28±\pm 0.1 75.86±\pm 0.4 75.72 ±\pm 0.5 85.2
\surd \surd 94.52 ±\pm 0.6 95.68 ±\pm 0.1 76.62 ±\pm 0.6 76.22 ±\pm 0.4 85.8
Table 4: Ablation Study

5.5 Ablation Study

In this section, to evaluate how different components of our work contribute to the final performance, we conduct ablation study for SIDA on Office-31. We mainly focus on harder transfer tasks, e.g. A\rightarrowW , A\rightarrowD, D\rightarrowA and W\rightarrowA. We investigate different combinations of two components:MI maximization and surrogate distribution (SD). Note that without surrogate distribution, we use pseudo label computed by the same method as surrogate distribution initialization to estimate MI. The average classification accuracy on four tasks are in Table 4.

From the results, we can observe that the model with MI maximization outperforms the base model without the two components by about 2.5% on average, which demonstrates the effectiveness of the maximization strategy. The surrogate distribution also improves the average performance by 1.1% compared to base model, confirming that the surrogate distribution improves the estimation quality of target domain compared to pseudo label method. The combination of two components yields the highest improvement.

6 Conclusion and Future Work

In this work, we introduce a novel framework of unsupervised domain adaptation and provide theoretical analysis to validate our optimization objectives. Experiments show that our approach gives competitive results compared to state-of-the-art unsupervised adaptation methods on standard domain adaptation tasks. One unresolved problem is to integrate the domain discrepancy in target risk upper bound into mutual information framework. This problem is left for future work.

References

  • Ben-David et al. [2007] Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira, et al. Analysis of representations for domain adaptation. Advances in neural information processing systems, 2007.
  • Chen and Liu [2020] Qingchao Chen and Yang Liu. Structure-aware feature fusion for unsupervised domain adaptation. In Proc. of AAAI, 2020.
  • Chen et al. [2019] Chaoqi Chen, Weiping Xie, Wenbing Huang, Yu Rong, Xinghao Ding, Yue Huang, Tingyang Xu, and Junzhou Huang. Progressive feature alignment for unsupervised domain adaptation. In Proc. of CVPR, 2019.
  • Chen et al. [2020a] Minghao Chen, Shuai Zhao, Haifeng Liu, and Deng Cai. Adversarial-learned loss for domain adaptation. In Proc. of AAAI, 2020.
  • Chen et al. [2020b] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 2020.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 2009.
  • Ganin and Lempitsky [2015] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, 2015.
  • Ganin et al. [2016] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The journal of machine learning research, 2016.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
  • Hjelm et al. [2018] R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In Proc. of ICLR, 2018.
  • [11]
  • Khosla et al. [2020] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. In Proc. of NeurIPS, 2020.
  • Li et al. [2020a] Mengxue Li, Yi-Ming Zhai, You-Wei Luo, Peng-Fei Ge, and Chuan-Xian Ren. Enhanced transport distance for unsupervised domain adaptation. In Proc. of CVPR, 2020.
  • Li et al. [2020b] Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. Model adaptation: Unsupervised domain adaptation without source data. In Proc. of CVPR, 2020.
  • Li et al. [2020c] Shuang Li, Chi Liu, Qiuxia Lin, Binhui Xie, Zhengming Ding, Gao Huang, and Jian Tang. Domain conditioned adaptation network. In Proc. of AAAI, 2020.
  • Long et al. [2015] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International conference on machine learning, 2015.
  • Long et al. [2017a] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. arXiv preprint arXiv:1705.10667, 2017.
  • Long et al. [2017b] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In International conference on machine learning, 2017.
  • Long et al. [2018] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. In NeurIPS, 2018.
  • Luo et al. [2020] You-Wei Luo, Chuan-Xian Ren, Pengfei Ge, Ke-Kun Huang, and Yu-Feng Yu. Unsupervised domain adaptation via discriminative manifold embedding and alignment. In Proc. of AAAI, 2020.
  • Nguyen et al. [2010] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 2010.
  • Park et al. [2020] Changhwa Park, Jonghyun Lee, Jaeyoon Yoo, Minhoe Hur, and Sungroh Yoon. Joint contrastive learning for unsupervised domain adaptation. arXiv preprint arXiv:2006.10297, 2020.
  • Peng et al. [2017] Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017.
  • Poole et al. [2019] Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker. On variational bounds of mutual information. In Proc. of ICML, 2019.
  • Redko et al. [2020] Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Younès Bennani. A survey on domain adaptation theory: learning bounds and theoretical guarantees. arXiv e-prints, 2020.
  • Saenko et al. [2010] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In European conference on computer vision, 2010.
  • Tang and Jia [2020] Hui Tang and Kui Jia. Discriminative adversarial domain adaptation. In Proc. of AAAI, 2020.
  • Thota and Leontidis [2021] Mamatha Thota and Georgios Leontidis. Contrastive domain adaptation. In Proc. of CVPR, 2021.
  • Venkateswara et al. [2017] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In Proc. of CVPR, 2017.
  • Xie et al. [2018] Shaoan Xie, Zibin Zheng, Liang Chen, and Chuan Chen. Learning semantic representations for unsupervised domain adaptation. In International conference on machine learning, 2018.
  • Xu et al. [2019] Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. Larger norm more transferable: An adaptive feature norm approach for unsupervised domain adaptation. In Proc. of ICCV, 2019.

7 Appendix

7.1 Proof for Theorem 4.1

Theorem 3 (Bound of Target Domain expected risk).

The expected risk on target domain can be upper-bounded by the negative MI between features, and Δ\mathcal{H}\Delta\mathcal{H} -divergence between features of two domains:

ϵPTG(F)ϵPSG(F)4IS+TG(Z1;Z2)+12dΔ(PSG(Z),PTG(Z))+4H(Y)\epsilon_{P^{G}_{T}}(F)\leq\epsilon_{P^{G}_{S}}(F)-4I_{S+T}^{G}(Z_{1};Z_{2})+\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(P^{G}_{S}(Z),P^{G}_{T}(Z))+4H(Y) (15)
Proof.

The risk can be relaxed by triangle inequality

ϵPTG(F)\displaystyle\epsilon_{P^{G}_{T}}(F) ϵPTG(F)+ϵPTG(F,F)\displaystyle\leq\epsilon_{P^{G}_{T}}(F^{\prime})+\epsilon_{P^{G}_{T}}(F,F^{\prime}) (16)
=ϵPTG(F)+ϵPTG(F,F)+ϵPSG(F,F)ϵPSG(F,F)\displaystyle=\epsilon_{P^{G}_{T}}(F^{\prime})+\epsilon_{P^{G}_{T}}(F,F^{\prime})+\epsilon_{P^{G}_{S}}(F,F^{\prime})-\epsilon_{P^{G}_{S}}(F,F^{\prime})
ϵPTG(F)+ϵPTG(F,F)+ϵPSG(F)+ϵPSG(F)ϵPSG(F,F)\displaystyle\leq\epsilon_{P^{G}_{T}}(F^{\prime})+\epsilon_{P^{G}_{T}}(F,F^{\prime})+\epsilon_{P^{G}_{S}}(F)+\epsilon_{P^{G}_{S}}(F^{\prime})-\epsilon_{P^{G}_{S}}(F,F^{\prime})
ϵPSG(F)+ϵPTG(F)+ϵPSG(F)+|ϵPTG(F,F)ϵPSG(F,F)|\displaystyle\leq\epsilon_{P^{G}_{S}}(F)+\epsilon_{P^{G}_{T}}(F^{\prime})+\epsilon_{P^{G}_{S}}(F^{\prime})+|\epsilon_{P^{G}_{T}}(F,F^{\prime})-\epsilon_{P^{G}_{S}}(F,F^{\prime})|
ϵPSG(F)+ϵPTG(F)+ϵPSG(F)+12dΔ(PSG(Z),PTG(Z))\displaystyle\leq\epsilon_{P^{G}_{S}}(F)+\epsilon_{P^{G}_{T}}(F^{\prime})+\epsilon_{P^{G}_{S}}(F^{\prime})+\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(P^{G}_{S}(Z),P^{G}_{T}(Z))

For the term ϵPTG(F)+ϵPSG(F)\epsilon_{P^{G}_{T}}(F^{\prime})+\epsilon_{P^{G}_{S}}(F^{\prime}), we have

ϵPTG(F)+ϵPSG(F)\displaystyle\epsilon_{P^{G}_{T}}(F^{\prime})+\epsilon_{P^{G}_{S}}(F^{\prime}) =𝔼PSG(z)|PSG(y|z)δF(z)|1+𝔼PTG(z)|PTG(y|z)δF(z)|1\displaystyle=\mathbb{E}_{P^{G}_{S}(z)}|P^{G}_{S}(y|z)-\delta_{F^{\prime}(z)}|_{1}+\mathbb{E}_{P^{G}_{T}(z)}|P^{G}_{T}(y|z)-\delta_{F^{\prime}(z)}|_{1} (17)
=2𝔼PSG(z)(1PSG(F(z)|z))+2𝔼PTG(z)(1PTG(F(z)|z))\displaystyle=2\mathbb{E}_{P^{G}_{S}(z)}(1-P^{G}_{S}(F^{\prime}(z)|z))+2\mathbb{E}_{P^{G}_{T}(z)}(1-P^{G}_{T}(F^{\prime}(z)|z))
=2(2zPSG(z)(PSG(F(z)|z))zPTG(z)(PTG(F(z)|z)))\displaystyle=2(2-\sum_{z}P^{G}_{S}(z)(P^{G}_{S}(F^{\prime}(z)|z))-\sum_{z}P^{G}_{T}(z)(P^{G}_{T}(F^{\prime}(z)|z)))
=2(2zPSG(F(z),z)zPTG(F(z),z))\displaystyle=2(2-\sum_{z}P^{G}_{S}(F^{\prime}(z),z)-\sum_{z}P^{G}_{T}(F^{\prime}(z),z))
=2(22zPS+TG(F(z),z))\displaystyle=2(2-2\sum_{z}P^{G}_{S+T}(F^{\prime}(z),z))
=4(1zPS+TG(z)PS+TG(F(z)|z))\displaystyle=4(1-\sum_{z}P^{G}_{S+T}(z)P^{G}_{S+T}(F^{\prime}(z)|z))
=4(zPS+TG(z)(1PS+TG(F(z)|z)))\displaystyle=4(\sum_{z}P^{G}_{S+T}(z)(1-P^{G}_{S+T}(F^{\prime}(z)|z)))
=4𝔼PS+TG(z)(1PS+TG(F(z)|z))\displaystyle=4\mathbb{E}_{P^{G}_{S+T}(z)}(1-P^{G}_{S+T}(F^{\prime}(z)|z))
4𝔼PS+TG(z)logPS+TG(F(z)|z)\displaystyle\leq-4\mathbb{E}_{P^{G}_{S+T}(z)}\log P^{G}_{S+T}(F^{\prime}(z)|z)

While FF^{\prime} can be any classifier, define FF^{\prime} as the optimal classifier on PS+TGP^{G}_{S+T}, i.e. F(z)=argmaxyPS+TG(y|z)F^{\prime}(z)=\arg\max_{y}P^{G}_{S+T}(y|z).

Recall the definition of MI,

IS+TG(Z;Y)\displaystyle I^{G}_{S+T}(Z;Y) =HS+T(Y)HS+T(Y|Z)\displaystyle=H_{S+T}(Y)-H_{S+T}(Y|Z) (18)
=HS+T(Y)+𝔼PS+T(z)G𝔼PS+TG(y|z)logPS+TG(y|z)\displaystyle=H_{S+T}(Y)+\mathbb{E}_{P^{G}_{S+T(z)}}\mathbb{E}_{P^{G}_{S+T}(y|z)}\log P^{G}_{S+T}(y|z)
HS+T(Y)+𝔼PS+T(z)G𝔼PS+TG(y|z)logPS+TG(F(z)|z)\displaystyle\leq H_{S+T}(Y)+\mathbb{E}_{P^{G}_{S+T(z)}}\mathbb{E}_{P^{G}_{S+T}(y|z)}\log P^{G}_{S+T}(F^{\prime}(z)|z)
=HS+T(Y)+𝔼PS+T(z)GlogPS+TG(F(z)|z)\displaystyle=H_{S+T}(Y)+\mathbb{E}_{P^{G}_{S+T(z)}}\log P^{G}_{S+T}(F^{\prime}(z)|z)

Which means that

ϵPTG(F)+ϵPSG(F)\displaystyle\epsilon_{P^{G}_{T}}(F^{\prime})+\epsilon_{P^{G}_{S}}(F^{\prime})\leq 4𝔼PS+TG(z)logPS+TG(F(z)|z)\displaystyle-4\mathbb{E}_{P^{G}_{S+T}(z)}\log P^{G}_{S+T}(F^{\prime}(z)|z) (19)
4IS+TG(Z;Y)+4HS+T(Y)\displaystyle\leq-4I^{G}_{S+T}(Z;Y)+4H_{S+T}(Y)

According to the MI chain rule, I(Z1;Y,Z2)=I(Z1;Y)+I(Z1;Z2|Y)=I(Z1;Z2)+I(Z1;Y|Z2)I(Z_{1};Y,Z_{2})=I(Z_{1};Y)+I(Z_{1};Z_{2}|Y)=I(Z_{1};Z_{2})+I(Z_{1};Y|Z_{2}). Since Z1Z_{1} and Z2Z_{2} are two samples from class YY, Z1Z_{1} and Z2Z_{2} are independent for a given YY, i.e., I(Z1;Z2|Y)=0I(Z_{1};Z_{2}|Y)=0. So we can get I(Z1;Y)=I(Z1;Z2)+I(Z1;Y|Z2)I(Z_{1};Y)=I(Z_{1};Z_{2})+I(Z_{1};Y|Z_{2}). Because I(Z1;Y|Z2)0I(Z_{1};Y|Z_{2})\geq 0, we finally get

I(Z1;Y)I(Z1;Z2)I(Z_{1};Y)\geq I(Z_{1};Z_{2}) (20)

Note that I(Z1;Y)I(Z_{1};Y) is I(Z;Y)I(Z;Y), because Z1Z_{1} and Z2Z_{2} both follow distribution P(Z|Y)P(Z|Y).

So now we get the conclusion by

ϵPTG(F)+ϵPSG(F)4IS+TG(Z1;Z2)+4HS+T(Y)\displaystyle\epsilon_{P^{G}_{T}}(F^{\prime})+\epsilon_{P^{G}_{S}}(F^{\prime})\leq-4I^{G}_{S+T}(Z_{1};Z_{2})+4H_{S+T}(Y) (21)

We give an explanation of the conditions for the upper bound to be equal. IS+TG(Z1;Z2)I^{G}_{S+T}(Z_{1};Z_{2}) is a lower bound of IS+TG(Z;Y)I^{G}_{S+T}(Z;Y), and it measures how much uncertainty of YY is reduced by knowing the feature, and it’s equal to H(Y)H(Y) if and only if PS+TG(Y|Z)P^{G}_{S+T}(Y|Z) is deterministic, i.e., PS+TG(Y|Z)P^{G}_{S+T}(Y|Z) is δ\delta distribution, which means PSG(Y|Z)=PTG(Y|Z)=δY(Z)P^{G}_{S}(Y|Z)=P^{G}_{T}(Y|Z)=\delta_{Y(Z)}. Thus if the Δ\mathcal{H}\Delta\mathcal{H} -divergence is zero, i.e., PSG(Z)=PTG(Z)P^{G}_{S}(Z)=P^{G}_{T}(Z), then it’s ensured that PSG(Z,Y)=PTG(Z,Y)P^{G}_{S}(Z,Y)=P^{G}_{T}(Z,Y), and ϵPTG(F)=ϵPSG(F)\epsilon_{P^{G}_{T}}(F)=\epsilon_{P^{G}_{S}}(F).

7.2 Proof for Theorem 4.2

Theorem 4 (Bound Estimation with Surrogate Distribution).

Let B=d1(PTG(Z),Q(Z))+ϵPTG(Q(Y|Z))B=d_{1}(P_{T}^{G}(Z),Q(Z))+\epsilon_{P^{G}_{T}}(Q(Y|Z)) be the bias of surrogate distribution QQ w.r.t target distribution. The expected risk on target domain can be upper-bounded by the negative surrogate MI between features, Δ\mathcal{H}\Delta\mathcal{H} -divergence between source and target domain, and additional bias of surrogate domain:

ϵPTG(F)ϵPSG(F)4IS+QG(Z1;Z2)+B+12dΔ(PSG(Z),PTG(Z))+4H(Y)\epsilon_{P^{G}_{T}}(F)\leq\epsilon_{P^{G}_{S}}(F)-4I_{S+Q}^{G}(Z_{1};Z_{2})+B+\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(P^{G}_{S}(Z),P^{G}_{T}(Z))+4H(Y) (22)
Proof.

The expected expected risk can be relaxed by triangle inequality:

ϵPTG(F)\displaystyle\epsilon_{P^{G}_{T}}(F) ϵPTG(Q(Y|Z))+ϵPTG(F,Q(Y|Z))\displaystyle\leq\epsilon_{P^{G}_{T}}(Q(Y|Z))+\epsilon_{P^{G}_{T}}(F,Q(Y|Z)) (23)
ϵPTG(Q(Y|Z))+ϵPTG(F,Q(Y|Z))+ϵPTG(F,F)\displaystyle\leq\epsilon_{P^{G}_{T}}(Q(Y|Z))+\epsilon_{P^{G}_{T}}(F^{\prime},Q(Y|Z))+\epsilon_{P^{G}_{T}}(F,F^{\prime})
=ϵPTG(Q(Y|Z))+ϵPTG(F,Q(Y|Z))+ϵPTG(F,F)+ϵPSG(F,F)ϵPSG(F,F)\displaystyle=\epsilon_{P^{G}_{T}}(Q(Y|Z))+\epsilon_{P^{G}_{T}}(F^{\prime},Q(Y|Z))+\epsilon_{P^{G}_{T}}(F,F^{\prime})+\epsilon_{P^{G}_{S}}(F,F^{\prime})-\epsilon_{P^{G}_{S}}(F,F^{\prime})
ϵPTG(Q(Y|Z))+ϵPTG(F,Q(Y|Z))+ϵPTG(F,F)+ϵPSG(F)+ϵPSG(F)ϵPSG(F,F)\displaystyle\leq\epsilon_{P^{G}_{T}}(Q(Y|Z))+\epsilon_{P^{G}_{T}}(F^{\prime},Q(Y|Z))+\epsilon_{P^{G}_{T}}(F,F^{\prime})+\epsilon_{P^{G}_{S}}(F)+\epsilon_{P^{G}_{S}}(F^{\prime})-\epsilon_{P^{G}_{S}}(F,F^{\prime})
ϵPTG(Q(Y|Z))+ϵPSG(F)+ϵPTG(F,Q(Y|Z))+ϵPSG(F)+|ϵPTG(F,F)ϵPSG(F,F)|\displaystyle\leq\epsilon_{P^{G}_{T}}(Q(Y|Z))+\epsilon_{P^{G}_{S}}(F)+\epsilon_{P^{G}_{T}}(F^{\prime},Q(Y|Z))+\epsilon_{P^{G}_{S}}(F^{\prime})+|\epsilon_{P^{G}_{T}}(F,F^{\prime})-\epsilon_{P^{G}_{S}}(F,F^{\prime})|
ϵPTG(Q(Y|Z))+ϵPSG(F)+ϵPTG(F,Q(Y|Z))+ϵPSG(F)+12dΔ(PSG(Z),PTG(Z))\displaystyle\leq\epsilon_{P^{G}_{T}}(Q(Y|Z))+\epsilon_{P^{G}_{S}}(F)+\epsilon_{P^{G}_{T}}(F^{\prime},Q(Y|Z))+\epsilon_{P^{G}_{S}}(F^{\prime})+\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(P^{G}_{S}(Z),P^{G}_{T}(Z))
=ϵPTG(Q(Y|Z))+ϵPSG(F)+ϵPTG(F,Q(Y|Z))ϵQ(F)+\displaystyle=\epsilon_{P^{G}_{T}}(Q(Y|Z))+\epsilon_{P^{G}_{S}}(F)+\epsilon_{P^{G}_{T}}(F^{\prime},Q(Y|Z))-\epsilon_{Q}(F^{\prime})+
ϵQ(F)+ϵPSG(F)+12dΔ(PSG(Z),PTG(Z))\displaystyle\quad\epsilon_{Q}(F^{\prime})+\epsilon_{P^{G}_{S}}(F^{\prime})+\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(P^{G}_{S}(Z),P^{G}_{T}(Z))
ϵPTG(Q(Y|Z))+ϵPSG(F)+(PTG(z)Q(z))|δF(z)Q(Y|z)|1dz+\displaystyle\leq\epsilon_{P^{G}_{T}}(Q(Y|Z))+\epsilon_{P^{G}_{S}}(F)+\int(P_{T}^{G}(z)-Q(z))|\delta_{F^{\prime}(z)}-Q(Y|z)|_{1}dz+
ϵQ(F)+ϵPSG(F)+12dΔ(PSG(Z),PTG(Z))\displaystyle\quad\epsilon_{Q}(F^{\prime})+\epsilon_{P^{G}_{S}}(F^{\prime})+\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(P^{G}_{S}(Z),P^{G}_{T}(Z))
ϵPTG(Q(Y|Z))+ϵPSG(F)+d1(PTG(Z),Q(Z))+ϵQ(F)+ϵPSG(F)+12dΔ(PSG(Z),PTG(Z))\displaystyle\leq\epsilon_{P^{G}_{T}}(Q(Y|Z))+\epsilon_{P^{G}_{S}}(F)+d_{1}(P_{T}^{G}(Z),Q(Z))+\epsilon_{Q}(F^{\prime})+\epsilon_{P^{G}_{S}}(F^{\prime})+\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(P^{G}_{S}(Z),P^{G}_{T}(Z))
=ϵPSG(F)+B+ϵQ(F)+ϵPSG(F)+12dΔ(PSG(Z),PTG(Z))\displaystyle=\epsilon_{P^{G}_{S}}(F)+B+\epsilon_{Q}(F^{\prime})+\epsilon_{P^{G}_{S}}(F^{\prime})+\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(P^{G}_{S}(Z),P^{G}_{T}(Z))

By the same method as the previous proof, terms ϵQ(F)+ϵPSG(F)\epsilon_{Q}(F^{\prime})+\epsilon_{P^{G}_{S}}(F^{\prime}) can be deduced into MI 4IS+QG(Z1;Z2)+4H(Y)-4I_{S+Q}^{G}(Z_{1};Z_{2})+4H(Y).

7.3 Proof for the Equality Condition of MI Estimation

Proposition 5.

The following MI lower bound holds

I(Z1;Z2)𝔼P(z1,z2)[f(z1,z2)]e1𝔼P(z1)[𝔼P(z2)[ef(z1,z2)]]I(Z_{1};Z_{2})\geq\mathbb{E}_{P(z_{1},z_{2})}[f(z_{1},z_{2})]-e^{-1}\mathbb{E}_{P(z_{1})}[\mathbb{E}_{P(z_{2})}\left[e^{f(z_{1},z_{2})}\right]] (24)

where ff is arbitrary function in 𝒵×𝒵R\mathcal{Z}\times\mathcal{Z}\rightarrow R. The equality holds when ef(z1,z2)𝔼P(z1)ef(z1,z2)=P(z1|z2)P(z1)\frac{e^{f(z_{1},z_{2})}}{\mathbb{E}_{P(z_{1})}e^{f(z_{1},z_{2})}}=\frac{P(z_{1}|z_{2})}{P(z_{1})} and 𝔼P(z1)𝔼P(z2)ef(z1,z2)=e\mathbb{E}_{P(z_{1})}\mathbb{E}_{P(z_{2})}e^{f(z_{1},z_{2})}=e.

Proof.

The proof is as follows:

I(Z1;Z2)=𝔼P(z1,z2)[logq(z1z2)P(z1)]+𝔼P(z2)[KL(P(z1z2)q(z1z2))]𝔼P(z1,z2)[logq(z1z2)P(z1)]I(Z_{1};Z_{2})=\mathbb{E}_{P(z_{1},z_{2})}\left[\log\frac{q(z_{1}\mid z_{2})}{P(z_{1})}\right]+\mathbb{E}_{P(z_{2})}[KL(P(z_{1}\mid z_{2})\|q(z_{1}\mid z_{2}))]\geq\mathbb{E}_{P(z_{1},z_{2})}\left[\log\frac{q(z_{1}\mid z_{2})}{P(z_{1})}\right] (25)

, where q is arbitrary variational distribution. Let q(z1z2)=P(z1)Z(z2)ef(z1,z2)q(z_{1}\mid z_{2})=\frac{P(z_{1})}{Z(z_{2})}e^{f(z_{1},z_{2})}, where Z(z2)=𝔼P(z1)[ef(z1,z2)]Z(z_{2})=\mathbb{E}_{P(z_{1})}\left[e^{f(z_{1},z_{2})}\right] is the normalization constant.

Then

I(Z1;Z2)𝔼P(z1,z2)[f(z1,z2)]log𝔼P(z2)[Z(z2)]I(Z_{1};Z_{2})\geq\mathbb{E}_{P(z_{1},z_{2})}[f(z_{1},z_{2})]-\log\mathbb{E}_{P(z_{2})}[Z(z_{2})] (26)

By log(x)xa(x)+log(a(x))1\log(x)\leq\frac{x}{a(x)}+\log(a(x))-1, which is tight when a(x)=xa(x)=x,

I(Z1;Z2)𝔼P(z1,z2)[f(z1,z2)]𝔼P(z2)[𝔼P(z1)[ef(z1,z2)]a(z2)+log(a(z2))1]I(Z_{1};Z_{2})\geq\mathbb{E}_{P(z_{1},z_{2})}[f(z_{1},z_{2})]-\mathbb{E}_{P(z_{2})}\left[\frac{\mathbb{E}_{P(z_{1})}\left[e^{f(z_{1},z_{2})}\right]}{a(z_{2})}+\log(a(z_{2}))-1\right] (27)

Let a(z2)=ea(z_{2})=e, we get the final form of lower bound:

I(Z1;Z2)𝔼P(z1,z2)[f(z1,z2)]e1𝔼P(z2)[𝔼P(z1)[ef(z1,z2)]]I(Z_{1};Z_{2})\geq\mathbb{E}_{P(z_{1},z_{2})}[f(z_{1},z_{2})]-e^{-1}\mathbb{E}_{P(z_{2})}[\mathbb{E}_{P(z_{1})}\left[e^{f(z_{1},z_{2})}\right]] (28)

7.4 Details for Surrogate Distribution Optimization

Let PP be the conditional distribution matrix of source domain, i,e Pij=PS(xSi|ySj)P_{ij}=P_{S}(x_{S}^{i}|y^{j}_{S}), and WW be the conditional distribution matrix of surrogate distribution QQ. Let M=12[PW]M=\frac{1}{2}\left[\begin{array}[]{l}P\\ W\end{array}\right] be the conditional distribution matrix of mixture distribution PS+TP_{S+T}. Let SS is the score function matrix, i.e. Si,j=f(G(xi),G(xj))S_{i,j}=f(G(x^{i}),G(x^{j})), i,j=1,,nS+nTi,j=1,\dots,n_{S}+n_{T}, where ff is the score function of the MI lower bound. With class-balanced sampling, LMIL_{MI} can be represented as follows:

LMI=(1nYTr(MTSM)1enY21TMTeSM1)L_{MI}=-(\frac{1}{n_{Y}}Tr(M^{T}SM)-\frac{1}{en_{Y}^{2}}\textbf{1}^{T}M^{T}e^{S}M\textbf{1}) (29)

The gradient w.r.t M is

MLMI=2(1nYSM1enY2eSM11T)\nabla_{M}L_{MI}=-2(\frac{1}{n_{Y}}SM-\frac{1}{en_{Y}^{2}}e^{S}M\textbf{1}\textbf{1}^{T}) (30)

And thus the gradient w.r.t WW is

WLMI=(1nY(0,I)SM1enY2(0,I)eSM11T)\nabla_{W}L_{MI}=-(\frac{1}{n_{Y}}\left(\textbf{0},I\right)SM-\frac{1}{en_{Y}^{2}}\left(\textbf{0},I\right)e^{S}M\textbf{1}\textbf{1}^{T}) (31)

Where (0,I)RnT,nS+nT\left(\textbf{0},I\right)\in R^{n_{T},n_{S}+n_{T}},II is identity matrix with size nTn_{T}.

In practice, we find it harmful to minimize LMIL_{MI} by WLMI\nabla_{W}L_{MI} directly, because it will encourage the distribution to concentrate on only a few samples rapidly. We thus adjust the decent direction of LMIL_{MI} to update the distribution slowly. Let |WLLaplacian||\nabla_{W}L_{Laplacian}| be the entry-wise absolute value of WLLaplacian\nabla_{W}L_{Laplacian}. The descent directions are: d1=WLLaplaciand_{1}=-\nabla_{W}L_{Laplacian} and d2=|WLLaplacian|(WLMI)d_{2}=-|\nabla_{W}L_{Laplacian}|\odot(\nabla_{W}L_{MI}), where \odot is entry-wise multiplication. Due the property of WLLaplacian\nabla_{W}L_{Laplacian} which is high on the margin of each conditional distribution, this yield a diffusion-like update of WW, which prevent rapid collapse of surrogate distribution.

Therefore, the update rule of WW is Wk+1=Π(Wk+η1d1k+η2d2k))W^{k+1}=\Pi(W^{k}+\eta_{1}d_{1}^{k}+\eta_{2}d_{2}^{k})), where Π\Pi is the projection operator onto probability simplex for each column of W. η1\eta_{1} and η2\eta_{2} are learning rate. The iteration is performed T times.

7.5 Implementation details

For each transfer task, mean (±\pmstd) over 5 runs of the test accuracy are reported. We use the ImageNet Deng et al. [2009] pre-trained ResNet-50 He et al. [2016] without final classifier layer as the encoder network GG for Office-31 and Office-Home, and ResNet-101 for VisDA-2017. Following kan , the final classifier layer of ResNet is replaced with the task-specific fully-connected layer to parameterize the classifier FF, and domain-specific batch normalization parameters are used. Code is attached in supplementary materials.

The model is trained in the finetune protocol, where the learning rate of the classifier layer is 10 times that of the encoder, by mini-batch stochastic gradient descent (SGD) algorithm with momentum of 0.9 and weight decay of 5e45e-4. The learning rate schedule follows Long et al. [2017b, 2015]; Ganin and Lempitsky [2015], where the learning rate ηp\eta_{p} is adjusted following ηp=η0(1+ap)b\eta_{p}=\frac{\eta_{0}}{(1+ap)^{b}}, where p is the normalized training progress from 0 to 1. η0\eta_{0} is the initial learning rate, i.e. 0.001 for the encoder layers and 0.01 for the classifier layer. For Office-31 and Office-Home, a = 10 and b = 0.75, while for VisDA-2017, a = 10 and b = 2.25. The coefficients of LMIL_{\textbf{MI}} and LAuxilaryL_{\textbf{Auxilary}} are α1=0.3,α2=0.1\alpha_{1}=0.3,\alpha_{2}=0.1 for Office-31, α1=1.3,α2=1.0\alpha_{1}=1.3,\alpha_{2}=1.0 for Office-Home, and α1=3.0,α2=1.0\alpha_{1}=3.0,\alpha_{2}=1.0 for VisDA-2017. The hyperparameters of surrogate distribution optimization include K-nearest Graph K=3K=3, number of iterations T=3T=3, learning rate η1=0.5\eta_{1}=0.5 and η2=0.05\eta_{2}=0.05.

Experiments are conducted with Python3 and Pytorch. The model is trained on single NVIDIA GeForce RTX 2080 Ti graphic card. For Office-31, each epoch takes 80 seconds and 10 seconds to perform inference. Code is attached in supplementary materials.

7.6 Visualization

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 3: From top to bottom are the feature visualizations on task D\rightarrowA of ResNet50, CDAN, and SIDA, respectively.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 4: The result of task W\rightarrowA is similar to task D\rightarrowA.

Our visualization experiment is carried out on DAD\rightarrow A and WAW\rightarrow A tasks in the data set Office31, which are the two most difficult tasks in Office31. The baselines we chose were ResNet-50 pre-trained on ImageNet and CDANLong et al. [2018]. We chose CDAN because it is a typical conditional domain alignment method. Pre-trained Resnet-50 is fine-tuned on the source domain and then tested on the target domain. Results of CDAN are obtained by running the official code. We train all the models until convergence, then encode the data of source domain and target domain with the model, and take representation before the final linear classification layer as feature vectors. We use t-SNE to visualize the features, using the t-SNE function of scikit-learn with default parameters. The results are in the link.

Figure 3 shows the results of task DAD\rightarrow A. From top to bottom are the feature visualizations on task DAD\rightarrow A of ResNet-50, CDAN, and SIDA, respectively. The left column is the feature comparison of the source and target domains. Red represents the source domain, and blue represents the target domain. The results show that SIDA emphasises discriminability of features. The right column shows the feature of different classes on the target domain. SIDA makes target features better distinguishable.

Figure 4 shows the results of task WAW\rightarrow A. The results on task WAW\rightarrow A are similar to task DAD\rightarrow A.

The visualization results show that SIDA can make the features of different categories more distinguishable, a natural consequence of maximizing MI among features from the same category. Thus features can be easier for classification, as the visualization shows.

References

  • Ben-David et al. [2007] Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira, et al. Analysis of representations for domain adaptation. Advances in neural information processing systems, 2007.
  • Chen and Liu [2020] Qingchao Chen and Yang Liu. Structure-aware feature fusion for unsupervised domain adaptation. In Proc. of AAAI, 2020.
  • Chen et al. [2019] Chaoqi Chen, Weiping Xie, Wenbing Huang, Yu Rong, Xinghao Ding, Yue Huang, Tingyang Xu, and Junzhou Huang. Progressive feature alignment for unsupervised domain adaptation. In Proc. of CVPR, 2019.
  • Chen et al. [2020a] Minghao Chen, Shuai Zhao, Haifeng Liu, and Deng Cai. Adversarial-learned loss for domain adaptation. In Proc. of AAAI, 2020.
  • Chen et al. [2020b] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 2020.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 2009.
  • Ganin and Lempitsky [2015] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, 2015.
  • Ganin et al. [2016] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The journal of machine learning research, 2016.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
  • Hjelm et al. [2018] R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In Proc. of ICLR, 2018.
  • [11]
  • Khosla et al. [2020] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. In Proc. of NeurIPS, 2020.
  • Li et al. [2020a] Mengxue Li, Yi-Ming Zhai, You-Wei Luo, Peng-Fei Ge, and Chuan-Xian Ren. Enhanced transport distance for unsupervised domain adaptation. In Proc. of CVPR, 2020.
  • Li et al. [2020b] Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. Model adaptation: Unsupervised domain adaptation without source data. In Proc. of CVPR, 2020.
  • Li et al. [2020c] Shuang Li, Chi Liu, Qiuxia Lin, Binhui Xie, Zhengming Ding, Gao Huang, and Jian Tang. Domain conditioned adaptation network. In Proc. of AAAI, 2020.
  • Long et al. [2015] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International conference on machine learning, 2015.
  • Long et al. [2017a] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. arXiv preprint arXiv:1705.10667, 2017.
  • Long et al. [2017b] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In International conference on machine learning, 2017.
  • Long et al. [2018] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. In NeurIPS, 2018.
  • Luo et al. [2020] You-Wei Luo, Chuan-Xian Ren, Pengfei Ge, Ke-Kun Huang, and Yu-Feng Yu. Unsupervised domain adaptation via discriminative manifold embedding and alignment. In Proc. of AAAI, 2020.
  • Nguyen et al. [2010] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 2010.
  • Park et al. [2020] Changhwa Park, Jonghyun Lee, Jaeyoon Yoo, Minhoe Hur, and Sungroh Yoon. Joint contrastive learning for unsupervised domain adaptation. arXiv preprint arXiv:2006.10297, 2020.
  • Peng et al. [2017] Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017.
  • Poole et al. [2019] Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker. On variational bounds of mutual information. In Proc. of ICML, 2019.
  • Redko et al. [2020] Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Younès Bennani. A survey on domain adaptation theory: learning bounds and theoretical guarantees. arXiv e-prints, 2020.
  • Saenko et al. [2010] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In European conference on computer vision, 2010.
  • Tang and Jia [2020] Hui Tang and Kui Jia. Discriminative adversarial domain adaptation. In Proc. of AAAI, 2020.
  • Thota and Leontidis [2021] Mamatha Thota and Georgios Leontidis. Contrastive domain adaptation. In Proc. of CVPR, 2021.
  • Venkateswara et al. [2017] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In Proc. of CVPR, 2017.
  • Xie et al. [2018] Shaoan Xie, Zibin Zheng, Liang Chen, and Chuan Chen. Learning semantic representations for unsupervised domain adaptation. In International conference on machine learning, 2018.
  • Xu et al. [2019] Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. Larger norm more transferable: An adaptive feature norm approach for unsupervised domain adaptation. In Proc. of ICCV, 2019.