This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Curriculum Guided Domain Adaptation in the Dark

Chowdhury Sadman Jahan    and Andreas Savakis    \IEEEmembershipSenior Member, IEEE Submitted for review on July 23, 2023. This research was partly supported by the Air Force Office of Scientific Research (AFOSR) under SBIR grant FA9550-22-P-0009 with Intelligent Fusion Technology and AFOSR grant FA9550-20-1-0039. C. S. Jahan is with the Center for Imaging Science, Rochester Institute of Technology, Rochester, NY 14623 (e-mail: sj4654@rit.edu).A. Savakis is with the Department of Computer Engineering, Kate Gleason College of Engineering, Rochester Institute of Technology, Rochester, NY 14623 (e-mail: andreas.savakis@rit.edu).
Abstract

Addressing the rising concerns of privacy and security, domain adaptation in the dark aims to adapt a black-box source trained model to an unlabeled target domain without access to any source data or source model parameters. The need for domain adaptation of black-box predictors becomes even more pronounced to protect intellectual property as deep learning based solutions are becoming increasingly commercialized. Current methods distill noisy predictions on the target data obtained from the source model to the target model, and/or separate clean/noisy target samples before adapting using traditional noisy label learning algorithms. However, these methods do not utilize the easy-to-hard learning nature of the clean/noisy data splits. Also, none of the existing methods are end-to-end, and require a separate fine-tuning stage and an initial warmup stage. In this work, we present Curriculum Adaptation for Black-Box (CABB) which provides a curriculum guided adaptation approach to gradually train the target model, first on target data with high confidence (clean) labels, and later on target data with noisy labels. CABB utilizes Jensen-Shannon divergence as a better criterion for clean-noisy sample separation, compared to the traditional criterion of cross entropy loss. Our method utilizes co-training of a dual-branch network to suppress error accumulation resulting from confirmation bias. The proposed approach is end-to-end trainable and does not require any extra finetuning stage, unlike existing methods. Empirical results on standard domain adaptation datasets show that CABB outperforms existing state-of-the-art black-box DA models and is comparable to white-box domain adaptation models.

{IEEEImpStatement}

In addition to preserving data privacy, commercialization of deep learning models has given rise to concerns about protecting proprietary rights. In order to alleviate these concerns, domain adaptation of black-box predictors (DABP) puts additional constraints on the already challenging domain adaptation problem by limiting access not only to the source data used for training, but also to the source model parameters during adaptation to the target domain. We take inspiration from noisy label learning and propose CABB as a curriculum guided domain adaptation approach for DABP using a dual-branch target model. Our clean-noisy sample separation process produces more accurate clean sample sets compared to the traditional sample filtering methods. The pseudolabels generated in CABB are also more robust. Unlike existing state-of-the-art, DABP methods, our model is end-to-end trainable, and outperforms other methods in all benchmarks we tested. Our method advances DABP and can have immediate impact to protect proprietary models and their training data during deployment and adaptation.

{IEEEkeywords}

Domain adaptation, Black box models, Curriculum learning, Jensen-Shannon distance

1 Introduction

Refer to caption
Figure 1: Overview of domain adaptation for black-box predictors (DABP). The source model may only be accessed to generate pseudolabels for the target data, and these pseudolabels may be used to adapt another model on the target domain.
\IEEEPARstart

With the availability of massive amounts of labelled image data, deep learning methods have made great progress in numerous computer vision tasks, such as classification, segmentation and object detection, among others. However, it is not feasible to collect, and annotate huge amounts of data for every new environment where a deep network model may be deployed. Unsupervised domain adaptation (UDA) is a special case of domain adaptation (DA) and transfer learning which aims to mitigate the domain gap that arises from deploying a model trained on labelled source data, to a new environment with unlabelled target data. Most of the existing UDA methods either adversarially align the labelled source data features, and unlabelled target data features [1, 2], or minimize their distribution discrepancy [3, 4, 5]. These methods require access to the source data during adaptation, and therefore cannot be applied when the source data is either unavailable, or cannot be shared due to privacy and security concerns. A newer, more efficient UDA paradigm, called Source-Free UDA [6, 7] has recently emerged to address such cases, where the adaptation process, instead of the source data itself, utilizes only a model trained on the source data. Such methods still fail to adequately alleviate data privacy and security concerns as model attacks may potentially retrieve the raw source data or corrupt the model. Moreover, with the commercialization of deep learning based solutions, companies may be reluctant to share their proprietary model parameters with the end users. These issues brought forth a newer UDA paradigm called domain adaptation of black-box predictors (DABP) that adapts without accessing neither the source data, nor the source model parameters [8]. Practically, a vendor can have the source trained model as an API in the cloud, and the end user can access the black-box source model to generate predictions for each unlabelled target instance to adapt on the target domain.

Existing DABP methods transfer knowledge from the source trained model predictions to the target model, and then finetune the target model on the target data [8, 9]. The approach in [9] utilizes a noisy label learning (NLL) algorithm [10] to separate the target domain into an easy-to-adapt subdomain with cleaner pseudolabels, and a hard-to-adapt subdomain with noisier pseudolables using low cross-entropy (CE) loss criterion as the separator [11], and then applies supervised and semi-supervised learning strategies on the easy- and hard-to-adapt subdomains, respectively.

In this work, we propose Curriculum Adaptation for Black-Box (CABB) as an unsupervised domain adaptation framework for black-box predictors. We present Jensen-Shannon distance (JSD) as a better criterion to separate clean and noisy samples using pseudolabels generated by the source model. JSD can be modelled using a two-component Gaussian Mixture Model (GMM) where the distribution with the lower distance can be considered to be consisting of cleaner samples and that with the higher distance contains noisier samples. As opposed to traditional low loss criterion for clean-noisy separation, low JSD criterion produces a more conservative, but more accurate clean sample set. To reduce error accumulation from confirmation bias, CABB employs co-training [11, 10] two identical networks and adapts one network on the clean-noisy separated sets generated by the other, and vice versa. CABB introduces a curriculum learning strategy to adaptively learn from the clean samples first, and the noisy samples later during the adaptation process. CABB foregoes the finetuning stage of existing methods by utilizing mutual information maximization [6, 12] within its curriculum, making it end-to-end adaptable. The main contributions of our work are as follows.

  • We introduce CABB as a curriculum guided domain adaptation model that progressively learns from the clean target set and the noisy target set, while utilizing co-training of a dual-branch network to suppress error accumulation resulting from confirmation bias.

  • We identify Jensen-Shannon divergence loss as a better criterion than cross-entropy loss for separation of clean and noisy samples for DABP.

  • CABB incorporates mutual information maximization within its curriculum and makes the adaptation process end-to-end without the need for any separate finetuning stage.

  • CABB produces robust pseudolabels from the mean of an ensemble of predictions generated by the two branches of the network on a set of augmentations.

2 Related Works

2.1 Unsupervised domain adaptation

Domain gap or domain shift occurs when the data distribution of the training data (source domain) is considerably different from that of the testing data (target domain) [3]. Long et al. [13], and Tzeng et al. [4] proposed to mitigate this distribution shift by minimizing the maximum mean discrepancy (MMD) between the two distributions, while Zellinger et al. [5] proposed to match the higher order central moments of source and target probability distributions, and thus minimize central moment discrepancy (CMD) for UDA. Sun and Saenko [14] devised Deep CORAL to minimize second-order distribution statistics to mitigate domain shift. Ganin et al. [1] utilized a domain discriminator module, and introduced gradient reversal layer (GRL) to adversarially align the two distributions. Many methods followed since then that have utilized adversarial alignment on the latent feature space. [15, 16]. While [1] uses a common encoder for the source and target data, Tzeng et al. [17] proposed to decouple the encoders by first training an encoder and a classifier on the labelled source data, followed by training a separate target data encoder using a domain discriminator, and finally deploying the same source classifier as the target classifier. Hoffman et al. [18] produced source-like images using generative image-to-image translation [19] and adversarially aligned source and target data distributions at the low-level or pixel-level. Global domain-wise adversarial alignment however may cause loss of intrinsic target class discrimination in the embedding space, and lead to suboptimal performance. To preserve class-wise feature discrimination, Li et al. [20] simultaneously aligned the domain-wise and class-wise distributions across the source and target data by solving two complementary domain-specific and class-specific minimax problems. In a non-adversarial approach, Pan et al. [21] proposed to calculate the source class prototypes for the labelled source data, and target class prototypes from the pseudo-labelled target data, and then enforce consistency on the prototypes in the embedding space. Tang et al. [22] similarly bases structural domain similarity to enforce structural source regularization and conducts discriminative clustering of target data without any domain alignment. Chen et al. [23] introduced graph matching to formulate cross-domain adaptation, and minimized Wasserstein distance for entity-matching, and Gromov-Wasserstein distance for edge matching. In order to reduce negative transfer introduced by the target samples that are either near the source-data generated decision boundaries, or are far away from their corresponding class centers, Xu et al. [24] proposed a weighted optimal transport strategy to achieve a reliable precise-pair-wise optimal transport procedure for domain adaptation.

Although domain divergence minimization [3, 4, 5], adversarial adaptation [1, 2], and optimal transport [23, 24] are widely used techniques for UDA, they require access to both the source and target data during adaptation. Addressing situations where source data is unavailable, several source-free DA (SFDA) methods have been proposed recently. Chidlovskii et al. [25] proposed to use a few source prototypes or representatives in place of the entire source data for semi-supervised domain adaptation. Liang et al. [26] proposed to conduct target adaptation using source-free distant supervision to iteratively find taret pseudo-labels, a domain invariant subspace where the source and target data centroids are only moderately shifted, and finally target centroids/prototypes by implementing an alternating minimization strategy. Liang et al. [6] introduced SHOT as an SFDA framework which transfers the source hypothesis or classifier to the target model, and adapts via self-training with information maximization [27, 28, 29] and class centroid-based pseudolabel refinement. Yang et al. [7] proposed G-SFDA which refines the pseudolabels further via consistency regularization among neighboring target samples. Ding et al. [30] introduced SFDA-DE which samples from an estimated source data distribution, and conducts contrastive alignment between the estimated source and target distributions.

2.2 Black box domain adaptation

Refer to caption
Figure 2: UDA pipeline in CABB. The target data is fed to the source model fsf_{s} and the knowledge generated from fsf_{s} is transferred to both target branches ft1f_{t_{1}} and ft2f_{t_{2}}. The source predicted pseudolabels are also used to calculate JSD and produce clean-noisy sample sets. In subsequent co-training of ft1f_{t_{1}} and ft2f_{t_{2}}, the samples sets created by one branch are used to update the other branch, using curriculum guided losses to progressively adapt to clean samples first, and the noisy samples later.

Extending the premise of SFDA further, Liang et al. [8] introduced a newer paradigm of black box DA where, in addition to the source data, the source model parameters are also unavailable during adaptation. This new challenging scenario is important to protect intellectual property (source model parameters) from the end users. Liang et al. proposed DINE which distills knowledge from the black-box source model to the target model in the first stage, followed by finetuning with target pseudolabels in the second stage. Yang et al. [9] proposed BETA as a method that separates easy- and hard-to-learn pseudolabels using a conventional noisy label learning technique [11], and applies a twin-network co-training strategy similar to [10], and adversarial alignment during adaptation.

In this paper, we identify Jensen-Shannon distance (JSD) as a more appropriate criterion for clean-noisy sample separation for the unbounded noise rate in UDA, compared to traditional low CE loss for the bounded loss in NLL. We formulate a curriculum learning strategy to train the target model end-to-end with cleaner samples first, and progressively with noisy samples later.

3 Methodology

The black-box source model fs(θs):𝒳s𝒴sf_{s}(\theta_{s}):\mathcal{X}_{s}\rightarrow\mathcal{Y}_{s} with model parameters θs\theta_{s}, maps the multiclass source data xs𝒳sx_{s}\in\mathcal{X}_{s} of source domain 𝒟s\mathcal{D}_{s}, to the label space ys𝒴sy_{s}\in\mathcal{Y}_{s}. For DABP, we however do not have access to θs\theta_{s}, but only the hard predictions (yt^𝒴t)=fs(θs,xt)(\hat{y_{t}}\in\mathcal{Y}_{t})=f_{s}(\theta_{s},x_{t}) from fsf_{s} on the target data xt𝒳tx_{t}\in\mathcal{X}_{t} of target domain 𝒟t\mathcal{D}_{t}. There exists a domain shift between the source data distribution 𝒟s\mathcal{D}_{s} and the target data distribution 𝒟t\mathcal{D}_{t}, while the label space is shared, i,e 𝒴s=𝒴t\mathcal{Y}_{s}=\mathcal{Y}_{t}. Due to this domain shift, a large number of predictions yt^\hat{y_{t}} may be incorrect and could result in a set of noisy pseudolabels generated by the source model. Our objective for DA is to learn a mapping function ft(θt):𝒳t𝒴tf_{t}(\theta_{t}):\mathcal{X}_{t}\rightarrow\mathcal{Y}_{t}.

Research has shown that when deep networks are trained with noisy labels, the resulting models tend to memorize the wrongly labelled samples owing to confirmation bias, as the training progresses [31]. Furthermore, in regular training of a single-branch network with noisy labels, the error from one training mini-batch flows back into the network itself for the next mini-batch, and thus the error increasingly accumulates [11]. In this work, during adaptation, we employ co-teaching [11] of a dual-branch network [10, 9] to mitigate error accumulation, resulting from the confirmation bias. In co-teaching, due to the difference in branch parameters of the dual-branch design, error introduced by the noisy pseudolabels in one branch can be filtered out by the other branch. In practice, one branch conducts the clean-noisy sample separation for the other branch, and vice versa. Since each branch generates different sets of clean and noisy samples, co-teaching breaks the flow of error through the network, and thus error accumulation attenuates. To simplify notation, the dual target branches/models ft1f_{t_{1}} and ft2f_{t_{2}} may be represented by ftf_{t} in later parts of this paper. Both networks are trained/adapted, and the final inference can be taken from either one. We follow [8] to distill knowledge from the source model to the target model in a teacher-student manner. However, unlike [8], we only have access to the hard predictions from the source model. Similar to [8], the source model predictions y^ti\hat{y}_{t}^{i} are updated during adaptation at certain intervals via exponential moving average between the source model predicted pseudolabels y^ti\hat{y}_{t}^{i} and the target model predicted pseudolabels ytiy_{t}^{i}. The process of generating ytiy_{t}^{i} is described in section 3.2.

3.1 Clean-noisy separation

The predictions yt^\hat{y_{t}} generated by the black box source model fsf_{s} are noisy and unreliable due to domain shift between 𝒟s\mathcal{D}_{s} and 𝒟t\mathcal{D}_{t}. Research on learning with noisy labels shows that deep learning models tend to fit on the clean samples first, and on the noisy samples later during training. [32, 10] We follow this insight and separate the target domain data into a clean sample set 𝒳tc\mathcal{X}_{tc} with reliable predictions, and a noisy sample set 𝒳tn\mathcal{X}_{tn} with unreliable predictions. In traditional noisy label settings, the noisy labels are either caused by wrong annotations from humans or from image search engines. The noise rate is, therefore, bounded. However, as the noisy labels in UDA are generated by the source model, the noise rate in this case is unbounded and can approach unity [33]. We propose Jensen-Shannon distance (JSD) [34] between the source predicted hard labels y^ti\hat{y}_{t}^{i} and the target features as the criterion for clean-noisy sample separation under unbounded noise rate. JSD is calculated as,

JSD(y^ti,pti)=12KL(y^ti,y^ti+pti2)+12KL(pti,pti+y^ti2)JSD(\hat{y}_{t}^{i},p_{t}^{i})=\\ \frac{1}{2}KL(\hat{y}_{t}^{i},\frac{\hat{y}_{t}^{i}+p_{t}^{i}}{2})+\frac{1}{2}KL(p_{t}^{i},\frac{p_{t}^{i}+\hat{y}_{t}^{i}}{2}) (1)

where, KL(a,b)KL(a,b) is the Kullback-Leibler divergence between aa and bb, and ptip_{t}^{i} is the target model output probability for target sample xtix_{t}^{i}. Compared to cross-entropy loss, JSD is symmetric by design, and ranges between 0 and 1, thus becoming less susceptible to noise. When applied to the network response, JSD produces a bimodal distribution, which is modelled by a 2-component Multivariate Gaussian Mixture Model (GMM) with equal priors. In DA, the target model may confidently categorize an image as the wrong class with very high prediction probability. Therefore, this is a poor criterion for identifying whether a sample is clean or noisy. For the potentially unbounded pseudolabel noise rate in DABP, we take the probability of belonging to the JSD Gaussian distribution with the lower mean value as the confidence metric of being a clean sample in our clean-noisy sample separation stage. Empirically, we apply a threshold δt\delta_{t} on our confidence score of belonging to the lower-mean GMM distribution to select our clean sample set 𝒳tc\mathcal{X}_{tc}, at the beginning of each epoch for adaptation. The remaining target samples are included in the noisy label set 𝒳tn\mathcal{X}_{tn}.

3.2 Ensemble based pseudolabeling

Refer to caption
Figure 3: Ensemble-based pseudolabeling in CABB. Each sample is augmented to produce 6 different views that are fed through both branches ft1f_{t_{1}} and ft2f_{t_{2}} to create a total of 12 output predictions, which are then averaged to produce the soft pseudolabel for co-training ft1f_{t_{1}} and ft2f_{t_{2}}.

In order to produce robust target model pseudolabels ytiy_{t}^{i}, we apply a series of augmentations on the target samples and produce an ensemble of output prediction probabilities from our two target models. We give equal weights to each output prediction and take the mean of the outputs as the soft pseudolabel as follows.

yti=12M0Mft1(xtmi)+ft2(xtmi)y_{t}^{i}=\frac{1}{2M}\sum_{0}^{M}f_{t_{1}}(x_{t_{m}}^{i})+f_{t_{2}}(x_{t_{m}}^{i}) (2)

where MM is the number of augmentations for the ii-th target sample. The predictions are further sharpened with a temperature factor T(0<T<1)T(0<T<1) and then normalized as follows.

yti=(yti)1TC(ytiC)1Ty_{t}^{i}=\frac{(y_{t}^{i})^{\frac{1}{T}}}{\sum_{C}(y_{t}^{iC})^{\frac{1}{T}}} (3)

where ytiCy_{t}^{iC} is the CC-th dimensional value of the pseudolabel vector ytiy_{t}^{i}.

3.3 Curriculum guided noisy learning

In order to mitigate early training time memorization [32] induced from noisy labels during the adaptation of deep models, we introduce a curriculum guided learning to train the target model on the clean samples first, and on the noisy samples later. As the adaptation/training progresses, more noisy samples are reclassified as clean samples.

We employ separate training losses for the clean and noisy sample set. The clean set is trained with standard cross-entropy (CE) loss as follows.

tc(ft;𝒳tc)=𝔼xti𝒳tck=1Cytkilog(σk(ft(xti)))\mathcal{L}_{tc}(f_{t};\mathcal{X}_{tc})=-\mathbb{E}_{x_{t}^{i}\in\mathcal{X}_{tc}}\sum_{k=1}^{C}y_{t_{k}}^{i}log(\sigma_{k}(f_{t}(x_{t}^{i}))) (4)

where σk(a)=exp(ak)iexp(ai)\sigma_{k}(a)=\frac{exp(a_{k})}{\sum_{i}exp(a_{i})} is the softmax function and CC is the number of classes. For the noisy set, we minimize a combination of active-passive losses [35] constructed of normalized cross-entropy loss tnNCE\mathcal{L}_{tn_{NCE}} and reverse cross-entropy loss tnRCE\mathcal{L}_{tn_{RCE}}. [35] showed that such normalization makes a model robust to noisy data. Reverse cross-entropy loss is applied to avoid any underfitting on the noisy set. Due to the unbounded nature of noise rate in UDA and conservative clean-noisy separation criteria in CABB, we employ this particular combination of active-passive losses as our noisy set loss tn\mathcal{L}_{tn} to make target training/adaptation robust and comprehensive on the noisy sample set. The loss function is expressed as follows.

tnNCE(ft;𝒳tn)=𝔼xti𝒳tnk=1Cytkilog(σk(ft(xti)))j=1Ck=1Cytjilog(σk(ft(xti)))\mathcal{L}_{tn_{NCE}}(f_{t};\mathcal{X}_{tn})=\\ -\mathbb{E}_{x_{t}^{i}\in\mathcal{X}_{tn}}\frac{\sum_{k=1}^{C}y_{t_{k}}^{i}log(\sigma_{k}(f_{t}(x_{t}^{i})))}{\sum_{j=1}^{C}\sum_{k=1}^{C}y_{t_{j}}^{i}log(\sigma_{k}(f_{t}(x_{t}^{i})))} (5)
tnRCE(ft;𝒳tn)=𝔼xti𝒳tnk=1Cσk(ft(xti))log(ytki)\mathcal{L}_{tn_{RCE}}(f_{t};\mathcal{X}_{tn})=-\mathbb{E}_{x_{t}^{i}\in\mathcal{X}_{tn}}\sum_{k=1}^{C}\sigma_{k}(f_{t}(x_{t}^{i}))log(y_{t_{k}}^{i}) (6)
tn=tnNCE+βtnRCE\mathcal{L}_{tn}=\mathcal{L}_{tn_{NCE}}+\beta\mathcal{L}_{tn_{RCE}} (7)

where β\beta is a hyperparameter.

To promote learning of clean samples first and to mitigate noisy label memorization, target training is done under curriculum guidance [36]. Based on the success of the clean-noisy sample separation, the pseudolabels in the clean sample set 𝒳tc\mathcal{X}_{tc} are more likely to be correct, while those in the noisy sample set 𝒳tn\mathcal{X}_{tn} have a much higher noise rate. Therefore, a deep network tends to easily learn from the unambiguous 𝒳tc\mathcal{X}_{tc} set. We set a curriculum factor γn\gamma_{n} according to the following equation.

γn=γn1(1αϵLxn/Lxn1)\gamma_{n}=\gamma_{n-1}(1-\alpha\epsilon^{-L_{x_{n}}/L_{x_{n-1}}}) (8)

where, α\alpha is a hyperparameter and nn is the iteration number. γn1\gamma_{n-1} is the curriculum factor for the previous iteration. The ratio Lxn/Lxn1L_{x_{n}}/L_{x_{n-1}} determines how much the curriculum factor decreases from iteration n1n-1 to nn. If the CE loss on the clean set increases, γ\gamma decreases by a small value to allow for further training on the clean set in the subsequent iterations. But if the CE loss decreases by a large margin, γ\gamma decreases accordingly to accommodate learning from the noisy sample set in the coming iterations. Our curriculum guidance balances the supervised and unsupervised losses on the respective clean and noisy sets as follows.

t=γntc+(1γn)tn\mathcal{L}_{t}=\gamma_{n}\mathcal{L}_{tc}+(1-\gamma_{n})\mathcal{L}_{tn} (9)

We adopt the formulation of information maximization (IM) loss [27, 28, 6] from [12] to help our model produce precise predictions, while maintaining a global diversity across all classes in the output predictions. The IM loss is a combination of the following entropy loss ent\mathcal{L}_{ent} and equal diversity loss eqdiv\mathcal{L}_{eqdiv}.

ent(ft;𝒳t)=𝔼xti𝒳tk=1Cσk(ft(xti))log(σk(ft(xti)))\mathcal{L}_{ent}(f_{t};\mathcal{X}_{t})=-\mathbb{E}_{x_{t}^{i}\in\mathcal{X}_{t}}\sum_{k=1}^{C}\sigma_{k}(f_{t}(x_{t}^{i}))log(\sigma_{k}(f_{t}(x_{t}^{i})))\\ (10)
eqdiv(ft;𝒳t)=k=1Cqklog(qkq^k)\mathcal{L}_{eqdiv}(f_{t};\mathcal{X}_{t})=\sum_{k=1}^{C}q_{k}log\left(\frac{q_{k}}{\hat{q}_{k}}\right) (11)

where q^k=𝔼xtX~t[σ(ft(xt))]\hat{q}_{k}=\mathbb{E}_{x_{t}\in\tilde{X}_{t}^{*}}[\sigma(f_{t}(x_{t}))] is the mean of the softmax of the target network output response. eqdiv\mathcal{L}_{eqdiv} conducts KL divergence between q^k\hat{q}_{k} and the ideal uniform response qkq_{k}. Our curriculum guided IM loss is as follows.

IM=eqdiv+(1γn)ent\mathcal{L}_{IM}=\mathcal{L}_{eqdiv}+(1-\gamma_{n})\mathcal{L}_{ent} (12)

Minimization of entropy loss ent\mathcal{L}_{ent} is gradually activated as the model sufficiently adapts to the clean sample. Such curriculum guidance ensures that the potentially erroneous predictions produced in the early stages of self-training are not accumulated. The eqdiv\mathcal{L}_{eqdiv} loss enforces diversity in the output predictions throughtout the training process. The overall objective function is,

tot=t+IM\mathcal{L}_{tot}=\mathcal{L}_{t}+\mathcal{L}_{IM} (13)

A brief demonstration of the CABB pipeline can be found in Algorithm 1.

Input: Black-box source trained model fsf_{s} and target data xti𝒳tx_{t}^{i}\in\mathcal{X}_{t}
Output: Target adapted model ftf_{t}
Initialization : Dual target models ft1f_{t_{1}} and ft2f_{t_{2}}
1 for epoch = 11 to epochtotalepoch_{total} do
2      while miterdistillm\leq iter_{distill} do
3             Distill from teacher fsf_{s} to students ft1f_{t_{1}} and ft2f_{t_{2}} following [8]
4       end while
5      Conduct clean(𝒳tc\mathcal{X}_{tc})-noisy(𝒳tn\mathcal{X}_{tn}) sample separation using JSD from model ft1f_{t_{1}} for ft2f_{t_{2}} and vice-versa
6       for ftft1,ft2f_{t}\in f_{t_{1}},f_{t_{2}} do
7            while niteradaptn\leq iter_{adapt} do
8                  Get ensemble averaged pseudolabels yti𝒴ty_{t}^{i}\in\mathcal{Y}_{t} from equations 2 and 3
9                   Calculate tc\mathcal{L}_{tc} on 𝒳tc\mathcal{X}_{tc}, tn\mathcal{L}_{tn} on 𝒳tn\mathcal{X}_{tn}, and ent\mathcal{L}_{ent} and eqdiv\mathcal{L}_{eqdiv} on (𝒳tc,𝒳tn)𝒳t\mathcal{X}_{tc},\mathcal{X}_{tn})\in\mathcal{X}_{t} using equations 4, 7, 10, and 11 respectively
10                   Calculate γn\gamma_{n} using equation 8
11                   Calculate t\mathcal{L}_{t} and IM\mathcal{L}_{IM} using equations 9 and 12
12                   Optimize ftf_{t} with loss tot\mathcal{L}_{tot} using equation 13
13                  
14             end while
15            
16       end for
17      
18 end for
Algorithm 1 Pseudocode for CABB

4 Experimental setup

Refer to caption
Figure 4: Accuracy on the clean sample set achieved via clean-noisy sample separation using low JSD (CABB) vs low CE (BETA), after distillation from the source teacher at the first epoch.
Method SF BB A\rightarrowD A\rightarrowW D\rightarrowA D\rightarrowW W\rightarrowA W\rightarrowD Mean
DANN [1] 79.7 82.0 68.2 96.9 67.4 99.1 82.2
ALDA [37] 94.0 95.6 72.2 97.7 72.5 100.0 88.7
GVB-GD [38] 95.0 94.8 73.4 98.7 73.7 100.0 89.4
SRDC [22] 95.8 95.7 76.7 99.2 77.1 100.0 90.9
SHOT [6] 94.0 90.1 74.7 98.4 74.3 99.9 88.6
A2A^{2}Net [39] 94.5 94.0 76.7 99.2 76.1 100 90.1
SFDA-DE [30] 96.0 94.2 76.6 98.5 75.5 99.8 90.1
LNL-OT [40] 88.8 85.5 64.6 95.1 66.7 98.7 83.2
LNL-KL [41] 89.4 86.8 65.1 94.8 67.1 98.7 83.6
HD-SHOT [42] 86.5 83.1 66.1 95.1 68.9 98.1 83.0
SD-SHOT [42] 89.2 83.7 67.9 95.3 71.1 97.1 84.1
DINE [8] 91.6 86.8 72.2 96.2 73.3 98.6 86.4
BETA [9] 93.6 88.3 76.1 95.5 76.5 99.0 88.2
CABB (Ours) 94.0 88.6 76.0 97.9 76.0 99.6 88.7
Table 1: Mean accuracy on the Office31. ’SF’ refers to source-free and ’BB’ means black-box. The top performing results among the DABP methods are in bold letters.

4.1 Datasets

We evaluate CABB on three popular domain adaptation datasets viz. Office-31 [43], Office-Home [44], and VisDA-C [45]. Office-31 is a small-scale DA dataset consisting of images of 31 classes of common objects found in an office across 3 domains viz. Amazon (A), Webcam (W), and DSLR (D). Office-Home is a medium-sized DA dataset consisting of 4 domains viz. Art (A), Clipart (C), Product (P), and Real-World (R). The dataset contains images of 65 classes of items found in office and home environments. VisDA-C is a large-scale dataset consisting of 12 classes of objects across 2 domains: Synthetic (S) and Real (R). The 152K synthetic images are generated by 3D rendering and taken as the source domain. The 55K real samples are taken from MS COCO dataset [46] and taken as the target domain.

4.2 Implementation details

We follow the same protocol in [8, 9] for source training to ensure fairness for comparison. Our target models are initialized with ImageNet pretrained weights, since source model parameters are inaccessible. For Office-31 and Office-Home, we use ResNet50, and for VisDA-C we use ResNet101 as the backbone [47], on top of which we attach an MLP-based classifier, similar to [8, 9]. The target models are trained with SGD optimizer with 0.90.9 momentum and weight decay 1e31e^{-3}. The learning rate for the backbone is set to 1e31e^{-3}, while that of the classifier is set to 1e21e^{-2}. α\alpha in the curriculum factor is set to 2e32e^{-3} for Office-31, and 2e42e^{-4} for Office-Home and VisDA-C, depending on the size of the dataset. The model is adapted for 50 epochs for Office-31 and Office-Home datasets, and for 5 epochs for the VisDA-C dataset. Temperature sharpening factor TT is set to 0.50.5. We implement our method using the PyTorch library on an NVIDIA-A100 GPU.

5 Results

5.1 Overall evaluation

Method SF BB A\rightarrowC A\rightarrowP A\rightarrowR C\rightarrowA C\rightarrowP C\rightarrowR P\rightarrowA P\rightarrowC P\rightarrowR R\rightarrowA R\rightarrowC R\rightarrowP Mean
DANN [1] 45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6
ALDA [37] 53.7 70.1 76.4 60.2 72.6 71.5 56.8 51.9 77.1 70.2 56.3 82.1 66.6
GVB-GD [38] 57.0 74.7 79.8 64.6 74.1 74.6 65.2 55.1 81.0 74.6 59.7 84.3 70.4
SRDC [22] 52.3 76.3 81.0 69.5 76.2 78.0 68.7 53.8 81.7 76.3 57.1 85.0 71.3
FixBi [48] 58.1 77.3 80.4 67.7 79.5 78.1 65.8 57.9 81.7 76.4 62.9 86.7 72.7
G-SFDA [7] 57.9 78.6 81.0 66.7 77.2 77.2 65.6 56.0 82.2 72.0 57.8 83.4 71.3
SHOT [6] 57.1 78.1 81.5 68.0 78.2 78.1 67.4 54.9 82.2 73.3 58.8 84.3 71.8
HCL [49] 64.0 78.6 82.4 64.5 73.1 80.1 64.8 59.8 75.3 78.1 69.3 81.5 72.6
A2A^{2}Net [39] 58.4 79.0 82.4 67.5 79.3 78.9 68.0 56.2 82.9 74.1 60.5 85.0 72.8
SFDA-DE [30] 59.7 79.5 82.4 69.7 78.6 79.2 66.1 57.2 82.6 73.9 60.8 85.5 72.9
LNL-OT [40] 49.1 71.7 77.3 60.2 68.7 73.1 57.0 46.5 76.8 67.1 52.3 79.5 64.9
LNL-KL [41] 49.0 71.5 77.1 59.0 68.7 72.9 56.4 46.9 76.6 66.2 52.3 79.1 64.6
HD-SHOT [42] 48.6 72.8 77.0 60.7 70.0 73.2 56.6 47.0 76.7 67.5 52.6 80.2 65.3
SD-SHOT [42] 50.1 75.0 78.8 63.2 72.9 76.4 60.0 48.0 79.4 69.2 54.2 81.6 67.4
DINE [8] 52.2 78.4 81.3 65.3 76.6 78.7 62.7 49.6 82.2 69.8 55.8 84.2 69.7
BETA [9] 57.2 78.5 82.1 68.0 78.6 79.7 67.5 56.0 83.0 71.9 58.9 84.2 72.1
CABB (Ours) 57.4 79.5 82.0 68.1 79.3 78.8 68.2 57.9 82.7 73.6 60.0 86.4 72.8
Table 2: Mean accuracy on the Office-Home dataset. ’SF’ refers to source-free and ’BB’ means black-box. The top performing results among the DABP methods are in bold letters.
Method SF BB plane bcycl bus car horse knife mcycle person plant sktbrd train truck Per-class
DANN [1] 81.9 77.7 82.8 44.3 81.2 29.5 65.2 28.6 51.9 54.6 82.8 7.8 57.6
ALDA [37] 93.8 74.1 82.4 69.4 90.6 87.2 89.0 67.6 93.4 76.1 87.7 22.2 77.8
SHOT [6] 94.3 88.5 80.1 57.3 93.1 94.9 80.7 80.3 91.5 89.1 86.3 58.2 82.9
A2A^{2}Net [39] 94.0 87.8 85.6 66.8 93.7 95.1 85.8 81.2 91.6 88.2 86.5 56.0 84.3
SFDA-DE [30] 95.3 91.2 77.5 72.1 95.7 97.8 85.5 86.1 95.5 93.0 86.3 61.6 86.5
LNL-OT [40] 82.6 84.1 76.2 44.8 90.8 39.1 76.7 72.0 82.6 81.2 82.7 50.6 72.0
LNL-KL [41] 82.7 83.4 76.7 44.9 90.9 38.5 78.4 71.6 82.4 80.3 82.9 50.4 71.9
HD-SHOT [42] 75.8 85.8 78.0 43.1 92.0 41.0 79.9 78.1 84.2 86.4 81.0 65.5 74.2
SD-SHOT [42] 79.1 85.8 77.2 43.4 91.6 41.0 80.0 78.3 84.7 86.8 81.1 65.1 74.5
DINE [8] 81.4 86.7 77.9 55.1 92.2 34.6 80.8 79.9 87.3 87.9 84.3 58.7 75.6
BETA [9] 96.2 83.9 82.3 71.0 95.3 73.1 88.4 80.6 95.5 90.9 88.3 45.1 82.6
CABB (Ours) 95.1 87.0 82.6 71.5 94.5 89.7 87.5 81.5 93.8 92.4 87.3 55.5 84.9
Table 3: Mean per-class accuracy on the VisDA-C dataset. ’SF’ refers to source-free and ’BB’ means black-box. The top performing results among the DABP methods are in bold letters.
Curriculum tn\mathcal{L}_{tn} ent\mathcal{L}_{ent} plane bcycl bus car horse knife mcycle person plant sktbrd train truck Per-class
98.0 93.1 79.1 41.8 97.1 81.6 79.5 79.9 93.3 91.1 90.3 49.5 81.2
98.2 89.2 82.2 58.1 97.2 83.5 84.3 71.3 95.8 92.2 90.4 18.1 80.0
97.1 82.3 85.0 79.1 91.7 93.2 89.0 77.7 94.4 92.5 83.9 1.2 80.6
97.3 89.9 78.3 60.1 96.4 76.1 80.2 77.3 93.5 90.0 88.7 52.8 81.7
95.2 85.9 83.5 68.9 93.8 88.6 83.6 80.7 95.1 92.0 86.0 56.7 84.2
95.1 87.0 82.6 71.5 94.5 89.7 87.5 81.5 93.8 92.4 87.3 55.5 84.9
Table 4: Performance evaluation of curriculum adaptation involving different parts of CABB on the VisDA-C dataset. The ’tick’ marks mean the part is present in the model, and the ’cross’ mark means that part is absent. When curriculum is absent and tn\mathcal{L}_{tn} is present, γn\gamma_{n} is set to 0.50.5.

Liang et al. [8] pioneered this area and formulated the problem statement. They also presented a number of baselines for comparison. Among them NLL-KD and NLL-OT are inspired by noisy label learning and utilize KL divergence and optimal transport respectively for refining pseudolabels. HD-SHOT and SD-SHOT are based on the SHOT [6] model and treat the source model predictions as hard labels and soft labels, respectively. In addition to these baselines, we compare CABB against state-of-the-art black-box DA models DINE [8] and BETA [9]. We further compare against a number of standard DA methods, such as DANN [1], ALDA [37], GVB-GD [38], SRDC [22], SHOT [6], A2-Net [39], SFDA-DE [30] etc.

In Figure 4, we present the accuracy of the clean sample set after clean-noisy sample separation for the first epoch after distillation from the source teacher model to the target student model. We can see that our choice of low JSD separation criterion in CABB consistently outperforms the low CE loss criterion used in BETA by 1%-7% across all 12 source-target domain pairs for Office-Home dataset.

The classification accuracies after adaptation across the 6 domain pairs for Office-31 dataset are shown in Table 1. CABB outperforms BETA and DINE on average by 0.5% and 2.3%, respectively. While CABB beats DINE across all the domain pairs, it only underperforms BETA for Webcam-Amazon adaptation by 0.5%. Overall, CABB is on-par with white-box source-free model SHOT and non-source-free model ALDA.

The results for Office-Home dataset are presented in Table 2. CABB outperforms BETA and DINE by 0.7% and 3.1%, respectively. Moreover, CABB outperforms several standard non-source-free DA methods such as SRDC and FixBi, and is either better than, or on par with existing state-of-the-art white-box source-free DA models like HCL, A2A^{2}Net, and SFDA-DE.

A comparative evaluation of CABB against other state-of-the-art DA methods and DABP baselines on the VisDA-C dataset is shown in Table 3. CABB surpasses both DINE and BETA by 9.3%9.3\% and 2.3%2.3\%, respectively in terms of mean-per-class accuracy. CABB beats BETA in the most challenging catergory truck by 10.4%. CABB also outperforms white-box source-free models SHOT and A2A^{2}Net comfortably.

5.2 Ablation study

A detailed ablation study on the efficacy of our curriculum adaptation method is given in Table 4. The impact of curriculum on the noisy set loss tn\mathcal{L}_{tn} and entropy loss ent\mathcal{L}_{ent} is shown, as curriculum is applied to these two components. In this table, in the absence of curriculum adaptation, γn\gamma_{n} is set to 0.50.5. In row 2, tn\mathcal{L}_{tn} is set to 0.

The results clearly indicate the benefit of a guided adaptation framework that progressively learns from the clean samples first and the noisy samples later. We see in the first three rows in Table 4 that without curriculum guidance, adaptation performance suffers significantly. In the absence of curriculum guidance, we see that leaving out learning from the noisy samples during the adaptation process is better than adapting to the noisy samples with tn\mathcal{L}_{tn} loss, and further enforcing the wrong predictions with ent\mathcal{L}_{ent} loss. The drawback of blindly adapting to noisy samples becomes evident in the second and third rows, particularly in the most challenging truck class. By adapting to unrefined noisy samples from the beginning, the model performance drastically deteriorates and accuracy on truck can fall to as low as 1%~{}1\%.

The results in the 4th through 6th rows in Table 4 show the necessity for curriculum guidance during adaptation. In the presence of curriculum learning, CABB outperforms existing state-of-the-art DABP methods. Curriculum guidance progressively refines the noisy sample pseudolabels. While enforcing the refined predictions by minimizing the ent\mathcal{L}_{ent} loss produces improved results, learning from the noisy pseudolabels by minimizing the tn\mathcal{L}_{tn} loss significantly boosts the model performance. Minimizing losses tn\mathcal{L}_{tn} and ent\mathcal{L}_{ent} on the refined pseudolabels together produce the strongest results.

6 Conclusion

In this paper we present a curriculum guided self-training based domain adaptation method called CABB to adapt a black-box source model/predictor to the target domain. Without access to the source data or the source model parameters during adaptation, we draw inspiration from noisy label learning algorithms. We employ a co-training scheme and propose to use Jensen-Shannon distance or JSD as the criterion to filter clean and reliable samples from noisy and unreliable samples. JSD calculated between the source model predicted pseudolabels and target model predictions is modelled using a mixture of Gaussian distributions. The samples with high probability of lying on the distribution with the lower mean JSD are taken as clean samples, and the target model is trained under a curriculum schedule first on the clean samples and progressively on the noisy samples. The dual-branch design of CABB also allows robust ensemble-based pseudolabeling. CABB consistently outperforms existing black-box domain adaptation models on three popular domain adaptation benchmarks, and is on par with other white-box source free models.

Acknowledgment

The authors would like to thank Nazmul Karim for his valuable suggestions and insights regarding noisy label learning. The authors would also like to thank RIT Research Computing for making computing resources available for experimentation.

References

  • [1] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” The journal of machine learning research, vol. 17, no. 1, pp. 2096–2030, 2016.
  • [2] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in International conference on machine learning.   Pmlr, 2018, pp. 1989–1998.
  • [3] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in CVPR 2011.   IEEE, 2011, pp. 1521–1528.
  • [4] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep domain confusion: Maximizing for domain invariance,” arXiv preprint arXiv:1412.3474, 2014.
  • [5] W. Zellinger, T. Grubinger, E. Lughofer, T. Natschläger, and S. Saminger-Platz, “Central moment discrepancy (cmd) for domain-invariant representation learning,” in International Conference on Learning Representations.
  • [6] J. Liang, D. Hu, and J. Feng, “Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation,” in International Conference on Machine Learning.   PMLR, 2020, pp. 6028–6039.
  • [7] S. Yang, Y. Wang, J. Van De Weijer, L. Herranz, and S. Jui, “Generalized source-free domain adaptation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8978–8987.
  • [8] J. Liang, D. Hu, J. Feng, and R. He, “Dine: Domain adaptation from single and multiple black-box predictors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  • [9] J. Yang, X. Peng, K. Wang, Z. Zhu, J. Feng, L. Xie, and Y. You, “Divide to adapt: Mitigating confirmation bias for domain adaptation of black-box predictors,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=hVrXUps3LFA
  • [10] J. Li, R. Socher, and S. C. Hoi, “Dividemix: Learning with noisy labels as semi-supervised learning,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=HJgExaVtwr
  • [11] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust training of deep neural networks with extremely noisy labels,” Advances in neural information processing systems, vol. 31, 2018.
  • [12] A. M. N. Taufique, C. S. Jahan, and A. Savakis, “Continual unsupervised domain adaptation in data-constrained environments,” IEEE Transactions on Artificial Intelligence, 2023.
  • [13] M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adaptation networks,” in International conference on machine learning.   PMLR, 2015, pp. 97–105.
  • [14] B. Sun and K. Saenko, “Deep coral: Correlation alignment for deep domain adaptation,” in Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14.   Springer, 2016, pp. 443–450.
  • [15] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Deep transfer learning with joint adaptation networks,” in International conference on machine learning.   PMLR, 2017, pp. 2208–2217.
  • [16] Z. Pei, Z. Cao, M. Long, and J. Wang, “Multi-adversarial domain adaptation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
  • [17] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7167–7176.
  • [18] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80.   PMLR, 10–15 Jul 2018, pp. 1989–1998. [Online]. Available: https://proceedings.mlr.press/v80/hoffman18a.html
  • [19] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
  • [20] S. Li, C. H. Liu, B. Xie, L. Su, Z. Ding, and G. Huang, “Joint adversarial domain adaptation,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 729–737.
  • [21] Y. Pan, T. Yao, Y. Li, Y. Wang, C.-W. Ngo, and T. Mei, “Transferrable prototypical networks for unsupervised domain adaptation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2239–2247.
  • [22] H. Tang, K. Chen, and K. Jia, “Unsupervised domain adaptation via structurally regularized deep clustering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8725–8735.
  • [23] L. Chen, Z. Gan, Y. Cheng, L. Li, L. Carin, and J. Liu, “Graph optimal transport for cross-domain alignment,” in International Conference on Machine Learning.   PMLR, 2020, pp. 1542–1553.
  • [24] R. Xu, P. Liu, L. Wang, C. Chen, and J. Wang, “Reliable weighted optimal transport for unsupervised domain adaptation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4394–4403.
  • [25] B. Chidlovskii, S. Clinchant, and G. Csurka, “Domain adaptation in the absence of source domain data,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 451–460.
  • [26] J. Liang, R. He, Z. Sun, and T. Tan, “Distant supervised centroid shift: A simple and efficient approach to visual domain adaptation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2975–2984.
  • [27] A. Krause, P. Perona, and R. Gomes, “Discriminative clustering by regularized information maximization,” Advances in neural information processing systems, vol. 23, 2010.
  • [28] Y. Shi and F. Sha, “Information-theoretical learning of discriminative clusters for unsupervised domain adaptation,” in Proceedings of the 29th International Coference on International Conference on Machine Learning, 2012, pp. 1275–1282.
  • [29] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama, “Learning discrete representations via information maximizing self-augmented training,” in International conference on machine learning.   PMLR, 2017, pp. 1558–1567.
  • [30] N. Ding, Y. Xu, Y. Tang, C. Xu, Y. Wang, and D. Tao, “Source-free domain adaptation via distribution estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7212–7222.
  • [31] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in International Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/forum?id=Sy8gdB9xx
  • [32] D. Arpit, S. Jastrzkebski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio et al., “A closer look at memorization in deep networks,” in International conference on machine learning.   PMLR, 2017, pp. 233–242.
  • [33] L. Yi, G. Xu, P. Xu, J. Li, R. Pu, C. Ling, I. McLeod, and B. Wang, “When source-free domain adaptation meets learning with noisy labels,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=u2Pd6x794I
  • [34] D. Endres and J. Schindelin, “A new metric for probability distributions,” IEEE Transactions on Information Theory, vol. 49, no. 7, pp. 1858–1860, 2003.
  • [35] X. Ma, H. Huang, Y. Wang, S. Romano, S. Erfani, and J. Bailey, “Normalized loss functions for deep learning with noisy labels,” in ICML, 2020.
  • [36] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 41–48.
  • [37] M. Chen, S. Zhao, H. Liu, and D. Cai, “Adversarial-learned loss for domain adaptation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 04, 2020, pp. 3521–3528.
  • [38] S. Cui, S. Wang, J. Zhuo, C. Su, Q. Huang, and Q. Tian, “Gradually vanishing bridge for adversarial domain adaptation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 455–12 464.
  • [39] H. Xia, H. Zhao, and Z. Ding, “Adaptive adversarial network for source-free domain adaptation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9010–9019.
  • [40] A. YM., R. C., and V. A., “Self-labelling via simultaneous clustering and representation learning,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=Hyx-jyBFPr
  • [41] H. Zhang, Y. Zhang, K. Jia, and L. Zhang, “Unsupervised domain adaptation of black-box source models,” arXiv preprint arXiv:2101.02839, 2021.
  • [42] J. Liang, D. Hu, Y. Wang, R. He, and J. Feng, “Source data-absent unsupervised domain adaptation through hypothesis transfer and labeling transfer,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8602–8617, 2021.
  • [43] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” in Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11.   Springer, 2010, pp. 213–226.
  • [44] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan, “Deep hashing network for unsupervised domain adaptation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5018–5027.
  • [45] X. Peng, B. Usman, N. Kaushik, D. Wang, J. Hoffman, and K. Saenko, “Visda: A synthetic-to-real benchmark for visual domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 2021–2026.
  • [46] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  • [47] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [48] J. Na, H. Jung, H. J. Chang, and W. Hwang, “Fixbi: Bridging domain spaces for unsupervised domain adaptation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1094–1103.
  • [49] J. Huang, D. Guan, A. Xiao, and S. Lu, “Model adaptation: Historical contrastive learning for unsupervised domain adaptation without source data,” Advances in Neural Information Processing Systems, vol. 34, pp. 3635–3649, 2021.