This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Weak Adaptation Learning:
Addressing Cross-domain Data Insufficiency with Weak Annotator

Shichao Xu 111These authors contributed equally to this work., Lixu Wang 111These authors contributed equally to this work., Yixuan Wang, Qi Zhu
Northwestern University, Evanston, USA
{shichaoxu2023, lixuwang2025, yixuanwang2024}@u.northwestern.edu, qzhu@northwestern.edu
Abstract

Data quantity and quality are crucial factors for data-driven learning methods. In some target problem domains, there are not many data samples available, which could significantly hinder the learning process. While data from similar domains may be leveraged to help through domain adaptation, obtaining high-quality labeled data for those source domains themselves could be difficult or costly. To address such challenges on data insufficiency for classification problem in a target domain, we propose a weak adaptation learning (WAL) approach that leverages unlabeled data from a similar source domain, a low-cost weak annotator that produces labels based on task-specific heuristics, labeling rules, or other methods (albeit with inaccuracy), and a small amount of labeled data in the target domain. Our approach first conducts a theoretical analysis on the error bound of the trained classifier with respect to the data quantity and the performance of the weak annotator, and then introduces a multi-stage weak adaptation learning method to learn an accurate classifier by lowering the error bound. Our experiments demonstrate the effectiveness of our approach in learning an accurate classifier with limited labeled data in the target domain and unlabeled data in the source domain.

1 Introduction

Machine Learning (ML) techniques, especially those based on deep neural networks, have shown great promises in many applications, to a large extent due to their abilities in studying and memorizing the knowledge embedded in high-quality training data [12]. Having a large number of data samples with accurate labels could enable effective supervised learning methods for improving ML model performance. However, it may be difficult to collect many data samples in some problem domains or scenarios, such as for the training of autonomous vehicles during extreme weather (e.g., fog, snow, hail) and natural disasters (e.g., mudflow), or for search and rescue robots during forest fire and earthquake. One possible solution to such problem of data unavailability is using data from other similar domains to train the target domain model and then fine-tune it with limited target domain data, i.e., through domain adaptation. Taking the aforementioned cases as examples, while there may not be much data in hailing weather, we could collect data in days with heavy rain; while it may be difficult to find images during earthquakes for large parts of America, we could collect images in Japan, where earthquakes occur more often in a different environment. However, obtaining a large amount of high-quality labeled data in these source domains could still be challenging and costly.

To address the above data insufficiency challenges across domains, we consider leveraging low-cost weak annotators that can automatically generate large quantity of labeled data based on certain labeling rules/functions, task-specific heuristics, or other methods (which may be inaccurate to some degree). More specifically, our approach considers the following setting for classification problems: There is a small amount of data samples with accurate labels collected for the target domain, which is called target domain data or target data in this paper for simplicity. There is also a large amount of unlabeled data that can be acquired from a similar but different source domain (i.e., there exists domain discrepancy), which is called source (domain) data in this paper. Finally, there is a weak annotator that can produce weak (possibly inaccurate) labels on data samples. Our objective is to learn an accurate classifier for the target domain based on the labeled target data, the initially-unlabeled source data, and the weak annotator.

The problem we are considering here is related but different from Semi-Supervised Learning (SSL) [46, 9, 27] and Unsupervised Domain Adaptation (UDA) [28, 8, 59, 7]. In the setting of SSL, the available training data consists of two parts – one has accurate labels while the other is unlabeled, and the two parts are drawn from the same distribution in terms of training features. This is different from our problem, where there exists domain discrepancy across the source and target domains. The objective of UDA is to adapt a model to perform well in the target domain based on labeled data in the source domain and unlabeled data in the target domain. This is again very different from our problem, where the source domain data is initially unlabeled and assigned with inaccurate labels by a weak annotator, while the target domain data has labels but its quantity is small. Another related field is Positive-unlabeled Learning (PuL) [23, 5], an approach for sample selection. The training data of PuL also consists of two parts – positive and negative data, and the task is to learn a binary classifier to filter out samples that are similar to the positive data from a large amount of negative data. However, the current PuL approaches usually conduct experiments in a single data set rather than multiple domains with feature discrepancy.

To solve our target problem, we first develop a theoretical analysis on the error bound of a trained classifier with respect to the data quantity and the weak annotator performance. We then propose a Weak Adaptation Learning (WAL) method to learn an accurate classifier by lowering the error bound. The main idea of WAL is to obtain a cross-domain representation for both source domain and target domain data, and then use the labeled data to estimate the classification error/distance between the weak annotator and the ideally optimal classifier in the target domain. Next, all the data is re-labeled based on such estimation of weak annotator classification error. Finally, the newly-relabeled data is used to learn a better classifier in the target domain.

Our work makes the following contributions:

  • \bullet

    We address the challenge of data insufficiency in domain adaptation with a novel weak adaptation learning approach that leverages unlabeled source domain data, limited number of labeled target domain data, and a weak annotator.

  • \bullet

    Our approach includes a theoretical analysis on the error bound of the trained classifier and a multi-stage WAL method that improves the classifier accuracy by lowering such error bound.

  • \bullet

    We compare our approach with various baselines in experiments with domain discrepancy setting on several digit datasets and the VisDA-C dataset, and study the cases without domain discrepancy on the CIFAR-10 dataset. We also conduct ablation studies on the impact from the weak annotator accuracy and the quantity of labeled data samples to further validate our ideas.

2 Related Work

We introduce related works in the topics about weakly- and semi-supervised learning, and the importance of sample quantity here. You can also find more related works about domain adaptation in the supplementary materials.

2.1 Weakly- and Semi-Supervised Learning

Weakly Supervised Learning is a large concept that may have multiple problem settings [65]. The problem we consider in this paper is related to the incomplete supervision setting that is often addressed by Semi-Supervised Learning (SSL) approaches. Standard SSL solves the problem of training a model with a few labeled data and a large amount of unlabeled data. Some of the widely-applied methods [46, 9, 43, 2] assign pseudo labels to unlabeled samples and then perform supervised learning. And there are works that address the noises in the labels of those samples [37, 11, 29]. Our target problem is related to SSL with inaccurate supervision, but is different since we consider the feature discrepancy between the (unlabeled) source data and the (labeled) target data – a case that occurs often in practice but has not been sufficiently addressed.

Positive-unlabeled Learning (PuL) is usually regarded as a sub-problem of SSL. Its goal is to learn a binary classifier to distinguish positive and negative samples from a large amount of unlabeled data and a few positive samples. Several works [23, 5] can achieve great performance on selecting samples that are similar to the positive data, and there are also works using samples selected by PuL to perform other tasks [62, 30].

2.2 Importance of Sample Quantity

The training of machine learning models, especially deep neural networks, often requires a large amount of data samples. However, in many practical scenarios, there is not sufficient training data to feed the learning process, degrading the model performance sharply [52, 19, 61]. Many approaches have been proposed to make up for the lack of training samples, e.g., data re-sampling [56], data augmentation [44], metric learning and meta learning [3, 4, 54, 57]. And there are works [39, 1, 3, 58] conducting theoretical analysis on the relation between training data quantity and model performance. These analyses are usually in the form of bounding the prediction error of the models and provide valuable information on how the sample quantity of training data affects the model performance. In our work, we also perform a theoretical analysis on the error bound of the trained model, with respect to not only the data quantity but also the performance of the weak annotator.

3 Theoretical Analysis

3.1 Problem Definition and Formulation

We consider the task of classification, where the goal is to predict labels for samples in the target domain. Two types of supporting data can be accessed for training the model – source domain data and target domain data. The source domain data samples are initially unlabeled and come from a joint probability distribution s\mathbb{Q}^{s}. They can be labeled by a weak annotator 𝐡w\mathbf{h}^{w} (which may be inaccurate) and denoted as Ds={(𝐱s,ys)i}i=1NsD_{s}=\{(\mathbf{x}_{s},y_{s})_{i}\}_{i=1}^{N_{s}}, where NsN_{s} is the number of source data samples. The target domain data Dt={(𝐱t,yt)i}i=1NtD_{t}=\{(\mathbf{x}_{t},y_{t})_{i}\}_{i=1}^{N_{t}} consists of NtN_{t} samples collected from the target distribution t\mathbb{Q}^{t}. Note that t\mathbb{Q}^{t} may be different from S\mathbb{Q}^{S}. And we use Xs\mathbb{Q}^{s}_{X}, Ys\mathbb{Q}^{s}_{Y} and Xt\mathbb{Q}^{t}_{X}, Yt\mathbb{Q}^{t}_{Y} to represent the marginal distributions of the source and target domains, respectively. Moreover, as stated before, we consider the case where there is only a small amount of target domain data, i.e., NtNsN_{t}\ll N_{s}.

Our goal is to learn an accurate classifier for the target domain. The classifier is initialized from a parameter distribution \mathcal{H}, which denotes the hypothesis parameter space of all possible classifiers.

In the following analysis, we will define the classification risk of a classifier and then derive its bound. According to the PAC-Bayesian framework [36, 10], the expected classification risk of a classifier drawn from a distribution 𝒬\mathcal{Q} that depends on the training data can be strictly bounded. Let 𝐡Θ\mathbf{h}_{\Theta} denote a learned classifier from the training data, and its parameter Θ\Theta is drawn from 𝒬\mathcal{Q}. We consider that the prior parameter distribution \mathcal{H} over the hypothesis is independent of the training data. And given a δ\delta with the probability 1δ\geq 1-\delta over the training data set of size mm, the expected error of 𝐡Θ\mathbf{h}_{\Theta} can be bounded as follows [35]:

L(𝐡Θ)\displaystyle L(\mathbf{h}_{\Theta}) L^(𝐡Θ)+L^(𝐡Θ)Ω+Ω\displaystyle\leq\widehat{L}(\mathbf{h}_{\Theta})+\sqrt{\widehat{L}(\mathbf{h}_{\Theta})\cdot\Omega}+\Omega (1)
Ω\displaystyle\Omega =2(KL(𝒬)+lnmδ)m1\displaystyle=\frac{2\left(K\!L(\mathcal{Q}\|\mathcal{H})+\ln\frac{m}{\delta}\right)}{m-1}

Here L(𝐡Θ)L(\mathbf{h}_{\Theta}) is the expected error of 𝐡\mathbf{h} over parameter Θ\Theta, and L^(𝐡Θ)\widehat{L}(\mathbf{h}_{\Theta}) is the empirical error computed from the training set (L^(𝐡Θ)=1mi=1m(𝐱i,yi)\widehat{L}(\mathbf{h}_{\Theta})=\frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\mathbf{x}_{i},y_{i}), where \mathcal{L} denotes the loss of a single training sample). In Eq. (1), KL(𝒬)K\!L(\mathcal{Q}\|\mathcal{H}) represents the Kullback-Leibler (KL) divergence between parameter distribution 𝒬\mathcal{Q} and \mathcal{H}. For any two distributions p,qp,\,q, the specific form of their KL divergence is KL(pq)=𝔼[plnqp]K\!L(p\|q)=-\mathbb{E}[p\cdot\ln{\frac{q}{p}}]. In most cases of mini-batch training, the training loss L^(𝐡Θ)\widehat{L}(\mathbf{h}_{\Theta}) is much smaller than Ω\Omega, and thus we can get a further bound as follows [39]:

L(𝐡Θ)L^(𝐡Θ)+4(KL(𝒬)+ln2mδ)mL(\mathbf{h}_{\Theta})\leq\widehat{L}(\mathbf{h}_{\Theta})+4\sqrt{\frac{\left(K\!L(\mathcal{Q}\|\mathcal{H})+\ln\frac{2m}{\delta}\right)}{m}} (2)

Then if we denote the model parameters of 𝐡\mathbf{h} before the training that are drawn from \mathcal{H} as Θ𝐩\Theta^{\mathbf{p}}, the KL divergence can be written as KL(𝒬)=𝔼[Θ(lnΘ𝐩lnΘ)]K\!L(\mathcal{Q}\|\mathcal{H})=-\mathbb{E}[\Theta\cdot(\ln{\Theta^{\mathbf{p}}}-\ln{\Theta})]. As aforementioned, 𝐡Θ\mathbf{h}_{\Theta} is trained with the training data set from 𝐡Θ𝐩\mathbf{h}_{\Theta^{\mathbf{p}}}, and we consider that the training is optimized by gradient-based method. Thus, we can formulate that Θ=Θ𝐩+(L^(𝐡Θ𝐩))\Theta=\Theta^{\mathbf{p}}+\nabla(\widehat{L}(\mathbf{h}_{\Theta^{\mathbf{p}}})). Here we omit the learning rate to simplify the formula.

The PAC-Bayesian error bound is valid for any parameter distribution \mathcal{H} that is independent of the training data, and any method of optimizing Θ𝐩\Theta^{\mathbf{p}} dependent on the training set [39]. Therefore, in order to simplify the problem, we instantiate the bound as setting \mathcal{H} to conform to a Gaussian distribution with zero mean (μ=0\mu_{\mathcal{H}}=0) and Var=σ2\text{Var}_{\mathcal{H}}=\sigma^{2}_{\mathcal{H}} variance. This simplification is the same as previous PAC-Bayesian works [39, 40]. We further assume that the parameter change of the overall model during training can also be regarded as conforming to an empirical Gaussian distribution. This Gaussian distribution is independent of model parameters if we regard the parameter updates induced by gradient back-propagation as accumulated random perturbations, i.e., each training sample corresponds to a small perturbation [40]. And we denote the mean and the variance of a single training sample as follows:

μ\displaystyle\mu 𝔼[Θ𝐩(𝐱,y)]\displaystyle\triangleq\mathbb{E}\left[\nabla_{\Theta^{\mathbf{p}}}\mathcal{L}(\mathbf{x},y)\right] (3)
σ2\displaystyle\sigma^{2} 𝔼[(Θ𝐩(𝐱,y)μ)(Θ𝐩(𝐱,y)μ)T]\displaystyle\triangleq\mathbb{E}\left[(\nabla_{\Theta^{\mathbf{p}}}\mathcal{L}(\mathbf{x},y)-\mu)(\nabla_{\Theta^{\mathbf{p}}}\mathcal{L}(\mathbf{x},y)-\mu)^{T}\right]

Then, the specific formula of KL divergence to any two Gaussian distributions p𝒩(μ1,σ12)p\sim\mathcal{N}(\mu_{1},\sigma_{1}^{2}), q𝒩(μ2,σ22)q\sim\mathcal{N}(\mu_{2},\sigma_{2}^{2}) is written as follows:

KL(p,q)=lnσ2σ1+σ12+(μ1μ2)22σ2212K\!L(p,q)=\ln{\frac{\sigma_{2}}{\sigma_{1}}}+\frac{\sigma_{1}^{2}+(\mu_{1}-\mu_{2})^{2}}{2\sigma_{2}^{2}}-\frac{1}{2} (4)

Theorem 1. For a classifier parameter distribution 𝒩(0,σ2)\mathcal{H}\sim\mathcal{N}(0,\sigma_{\mathcal{H}}^{2}) that is independent of the training data with size mm, and a posterior parameter distribution 𝒬\mathcal{Q} learned from the training data set, if we assume 𝒬𝒩(μ𝒬,σ𝒬2)\mathcal{Q}\sim\mathcal{N}(\mu_{\mathcal{Q}},\sigma_{\mathcal{Q}}^{2}) and consider Θ𝐩\Theta^{\mathbf{p}}, Θ\Theta as drawn from \mathcal{H}, 𝒬\mathcal{Q} respectively (Θ=Θ𝐩+(L^(𝐡Θ𝐩))\Theta=\Theta^{\mathbf{p}}+\nabla(\widehat{L}(\mathbf{h}_{\Theta^{\mathbf{p}}}))), the KL divergence of 𝒬\mathcal{Q} and \mathcal{H} is bounded with symbols defined in Eq. (3) as follows:

KL(𝒬)σ2m+μ22σ2K\!L(\mathcal{Q}\,\|\,\mathcal{H})\leq\frac{\frac{\sigma^{2}}{m}+\mu^{2}}{2\sigma_{\mathcal{H}}^{2}} (5)

The detailed proof of Theorem 1 is presented in our Supplementary Materials. With the above risk definition, the risk of 𝐡\mathbf{h} with respect to the target data distribution t\mathbb{Q}^{t} is

Rt(𝐡)=𝔼(𝐱,y)t(𝐡(𝐱),y)=L(𝐡Θ)t\centering R^{t}(\mathbf{h})=\mathbb{E}_{(\mathbf{x},y)\sim\mathbb{Q}^{t}}\mathcal{L}(\mathbf{h}(\mathbf{x}),y)=L(\mathbf{h}_{\Theta})_{\sim\mathbb{Q}^{t}}\@add@centering (6)

Besides, we define the Classification Distance of two classifiers 𝐡1\mathbf{h}_{1} and 𝐡2\mathbf{h}_{2} under the same domain distribution \mathbb{P} as

𝒞𝒟(𝐡1,𝐡2)=𝔼𝐱(𝐡1(𝐱),𝐡2(𝐱))\mathcal{CD}_{\sim\mathbb{P}}(\mathbf{h}_{1},\mathbf{h}_{2})=\mathbb{E}_{\mathbf{x}\sim\mathbb{P}}\mathcal{L}(\mathbf{h}_{1}(\mathbf{x}),\mathbf{h}_{2}(\mathbf{x})) (7)

Moreover, the Discrepancy Distance of two domains is defined as in [33]: 𝐡𝟏,𝐡𝟐\forall\mathbf{h_{1}},\mathbf{h_{2}}, the discrepancy distance between the distributions of two domains ,\mathbb{P},\mathbb{Q} is

𝒟𝒟(,)=sup𝐡𝟏,𝐡𝟐|𝒞𝒟(𝐡1,𝐡2)𝒞𝒟(𝐡1,𝐡2)|\mathcal{DD}(\mathbb{P},\mathbb{Q})=\sup_{\mathbf{h_{1}},\mathbf{h_{2}}\in\mathbb{H}}|\mathcal{CD}_{\sim\mathbb{P}}(\mathbf{h}_{1},\mathbf{h}_{2})-\mathcal{CD}_{\sim\mathbb{Q}}(\mathbf{h}_{1},\mathbf{h}_{2})| (8)

For further analysis, we also define two operators in a parameter distribution \mathcal{H}:

  • \bullet

    \oplus: 𝐡𝟏,𝐡𝟐\forall\mathbf{h_{1}},\mathbf{h_{2}}\in\mathcal{H}, and 𝐱\forall\mathbf{x}\in\mathbb{P}, a new classifier 𝐡𝟑=𝐡𝟏𝐡𝟐\mathbf{h_{3}}=\mathbf{h_{1}}\oplus\mathbf{h_{2}} can be acquired by conducting operator \oplus on 𝐡𝟏\mathbf{h_{1}} and 𝐡𝟐\mathbf{h_{2}}, and 𝐡𝟑(𝐱)=𝐡𝟏(𝐱)+𝐡𝟐(𝐱)\mathbf{h_{3}}(\mathbf{x})=\mathbf{h_{1}}(\mathbf{x})+\mathbf{h_{2}}(\mathbf{x}).

  • \bullet

    \ominus: 𝐡𝟏,𝐡𝟐\forall\mathbf{h_{1}},\mathbf{h_{2}}\in\mathcal{H}, and 𝐱\forall\mathbf{x}\in\mathbb{P}, a new classifier 𝐡𝟑=𝐡𝟏𝐡𝟐\mathbf{h_{3}}=\mathbf{h_{1}}\ominus\mathbf{h_{2}} can be acquired by conducting operator \ominus on 𝐡𝟏\mathbf{h_{1}} and 𝐡𝟐\mathbf{h_{2}}, and 𝐡𝟑(𝐱)=𝐡𝟏(𝐱)𝐡𝟐(𝐱)\mathbf{h_{3}}(\mathbf{x})=\mathbf{h_{1}}(\mathbf{x})-\mathbf{h_{2}}(\mathbf{x}).

3.2 Error Bound Analysis

Let 𝐡os\mathbf{h}^{o_{s}} and 𝐡ot\mathbf{h}^{o_{t}} denote the ideal classifiers that perform optimally on the source data and target data, respectively:

𝐡os\displaystyle\mathbf{h}^{o_{s}} =argmin𝐡𝒬Rs(𝐡),𝐡ot=argmin𝐡𝒬Rt(𝐡)\displaystyle={\arg\min}_{\mathbf{h}\in\mathcal{Q}}{R^{s}(\mathbf{h})},\ \mathbf{h}^{o_{t}}={\arg\min}_{\mathbf{h}\in\mathcal{Q}}{R^{t}(\mathbf{h})} (9)

In our approach, we design a classifier that learns the discrepancy between the weak annotator and the ground truth (details will be introduced in Section 4), and we denote it as 𝐝\mathbf{d} drawn from 𝒬\mathcal{Q}. Thus, we can get a model that is the product of conducting the aforementioned \oplus operator on 𝐡\mathbf{h} and 𝐝\mathbf{d}, i.e., 𝐡𝐝\mathbf{h}\oplus\mathbf{d}. Here 𝐡\mathbf{h} is designed for approximating the weak labels. And for the risk of 𝐡𝐝\mathbf{h}\oplus\mathbf{d}, we can obtain the following relation:

Theorem 2. For all L1 (Mean Absolute Error [20]), L2 (Mean Squared Error [14]) and their non-negative combination loss functions (Huber Loss [63], Quantile Loss [24], etc.), the classification risk of aforementioned 𝐡𝐝\mathbf{h}\oplus\mathbf{d} can be formulated as follows:

Rt(𝐡𝐝)\displaystyle R^{t}(\mathbf{h}\oplus\mathbf{d}) =𝔼Xt(𝐡𝐝,𝐡w𝐡ot𝐡w)\displaystyle=\mathbb{E}_{\mathbb{Q}^{t}_{X}}\mathcal{L}(\mathbf{h}\oplus\mathbf{d},\mathbf{h}^{w}\oplus\mathbf{h}^{o_{t}}\ominus\mathbf{h}^{w}) (10)
𝔼Xt(𝐡,𝐡w)+𝔼Xt(𝐝,𝐡ot𝐡w)\displaystyle\leq\mathbb{E}_{\mathbb{Q}^{t}_{X}}\mathcal{L}(\mathbf{h},\mathbf{h}^{w})+\mathbb{E}_{\mathbb{Q}^{t}_{X}}\mathcal{L}(\mathbf{d},\mathbf{h}^{o_{t}}\ominus\mathbf{h}^{w})

Please refer to our Supplementary Materials for detailed proof of Theorem 2. Then if we consider that the training loss L^(𝐡)\widehat{L}(\mathbf{h}) (which equals the average loss of all training samples) is hardly influenced by the sample quantity, and it is the same for the discrepancy between two domains [32], we can split the error bound of 𝐡𝐝\mathbf{h}\oplus\mathbf{d} into two parts, where one part, denoted as Δ\Delta, is not influenced by the sample quantity and the other is related to the sample quantity. According to Eq. (41), these two parts can be written as follows (the detailed derivation of inequalities starting from Eq. (31) can be found in the Supplementary Materials):

Rt(𝐡𝐝)\displaystyle R^{t}(\mathbf{h}\oplus\mathbf{d}) =𝔼Xt(𝐡𝐝,𝐡w𝐡ot𝐡w)\displaystyle=\mathbb{E}_{\mathbb{Q}^{t}_{X}}\mathcal{L}(\mathbf{h}\oplus\mathbf{d},\mathbf{h}^{w}\oplus\mathbf{h}^{o_{t}}\ominus\mathbf{h}^{w}) (11)
Δ+4KL𝐝Nt+4KL𝐡Ns\displaystyle\leq\Delta+4\sqrt{\frac{K\!L_{\mathbf{d}}}{N_{t}}}+4\sqrt{\frac{K\!L_{\mathbf{h}}}{N_{s}}}
+12ln2NtδNt+8ln2NsδNs\displaystyle\quad+12\sqrt{\frac{\ln{\frac{2N_{t}}{\delta}}}{N_{t}}}+8\sqrt{\frac{\ln{\frac{2N_{s}}{\delta}}}{N_{s}}}
whereΔ\displaystyle\text{where}\quad\Delta =2L^t(𝐡w)+L^t(𝐝)+L^s(𝐡)\displaystyle=2\widehat{L}_{t}(\mathbf{h}^{w})+\widehat{L}_{t}(\mathbf{d})+\widehat{L}_{s}(\mathbf{h})
+L^s(𝐡w)+𝒟𝒟(Xt,Xs)\displaystyle\quad+\widehat{L}_{s}(\mathbf{h}^{w})+\mathcal{DD}(\mathbb{Q}^{t}_{X},\mathbb{Q}^{s}_{X})

Here KL𝐝K\!L_{\mathbf{d}} and KL𝐡K\!L_{\mathbf{h}} denote KL divergences between trained 𝐝\mathbf{d}, 𝐡\mathbf{h} and \mathcal{H} respectively. According to Theorem 1, this KL divergence term is influenced by the training, especially impacted by the sample quantity. We will discuss the insights obtained from this error bound in the next section, and then introduce our weak adaptation learning process that is inspired by those insights.

4 Weak Adaptation Learning Method

4.1 Observation from Error Analysis

Based on the error bound derived in Eq. (11), we can put efforts into the following ideas in our approach to improve the classifier performance in the target domain:

  • \bullet

    Performance of annotator (2L^t(𝐡w)+L^s(𝐡w)2\widehat{L}_{t}(\mathbf{h}^{w})+\widehat{L}_{s}(\mathbf{h}^{w})): The supervision provided by the weak annotator can guide the model to better target the given task. Ideally, we want 𝐡w\mathbf{h}^{w} to produce more accurate labels for both source and target data, reducing 2L^t(𝐡w)2\widehat{L}_{t}(\mathbf{h}^{w}) and L^s(𝐡w)\widehat{L}_{s}(\mathbf{h}^{w}) simultaneously. Practically though, we may just be able to make the annotator perform better on the source domain and cannot do much with the target domain.

  • \bullet

    Discrepancy between domains (𝒟𝒟(Xt,Xs)\mathcal{DD}(\mathbb{Q}^{t}_{X},\mathbb{Q}^{s}_{X})): Designing loss to quantify the discrepancy between the source and target domains is well studied in Domain Adaptation. In our approach, we propose a novel inter-domain loss (called Classified-MMD) to minimize 𝒟𝒟(Xt,Xs)\mathcal{DD}(\mathbb{Q}^{t}_{X},\mathbb{Q}^{s}_{X}), as introduced later.

  • \bullet

    Quantity of source and target samples (Ns,NtN_{s},N_{t}): First, the learning of 𝐝\mathbf{d} needs the supervision of the ground truth, and thus we can only use the labeled target data to train 𝐝\mathbf{d}. Then, in our method, 𝐡\mathbf{h} is designed to approximate the weak annotator, and therefore it may see enough that we just use the source data to train 𝐡\mathbf{h}. However, to further reduce KL𝐡K\!L_{\mathbf{h}} according to Theorem 1, we also use target samples to train 𝐡\mathbf{h}, which increases the sample size of training data. Moreover, since the sample quantity of source data is much larger than that of target data (i.e., NtNsN_{t}\ll N_{s}), KL𝐝/Nt\sqrt{K\!L_{\mathbf{d}}/N_{t}} in Eq. (11) dominates over KL𝐡/Ns\sqrt{K\!L_{\mathbf{h}}/N_{s}}, and in the case of δ2/e\delta\leq 2/e, 12ln2Ntδ/Nt12\sqrt{\ln{\frac{2N_{t}}{\delta}}/N_{t}} also strictly dominates over 8ln2Nsδ/Ns8\sqrt{\ln{\frac{2N_{s}}{\delta}}/N_{s}}. As the result, the terms influenced by a few target samples dominates the overall error risk. Therefore, directly applying 𝐡𝐝\mathbf{h}\oplus\mathbf{d} to the target domain will still be impacted by the insufficient samples. However, note that 𝐡𝐝\mathbf{h}\oplus\mathbf{d} can produce more accurate labels for the source data than the weak annotator. Therefore, we add a final step in our learning process that utilizes re-labeled source data and conducts supervised learning with such augmentation.

4.2 Learning Process

Refer to caption
Figure 1: Overview of the Weak Adaptation Learning (WAL) process. The designed network architecture is divided into three components Φ0,Φ1,Φ2\Phi_{0},\Phi_{1},\Phi_{2} and the algorithm has four stages. First, we use a combined loss function to learn a cross-domain representation in Φ0\Phi_{0} for both source and target data samples. Then, in Stage 2, Φ2\Phi_{2} estimates the classification distance between the weak annotator and the ideally optimal one in the target domain. A new re-labeled dataset is calculated in Stage 3, and then used in Stage 4 to learn the desired classifier.

In this section, we present the detailed process of our weak adaptation learning (WAL) method, which is designed based on the observations from the above error bound analysis. The overview of our WAL process is shown in Figure 1. The designed network consists of three parts – (Φ0,Φ1,Φ2\Phi_{0},\Phi_{1},\Phi_{2}). Φ0\Phi_{0} can be seen as a shared feature network for both source and target data, using typical classification networks such as VGG, ResNet, etc. Φ1\Phi_{1} consists of three fully-connected layers that follow the output of Φ0\Phi_{0}. And we denote the combination of Φ0\Phi_{0} and Φ1\Phi_{1} as F1F_{1}. Φ2\Phi_{2} consists of two fully-connected layers that follow the output of Φ0\Phi_{0}. The combination of Φ0\Phi_{0} and Φ2\Phi_{2} is denoted as F2F_{2}. The detailed network architecture is shown in the Supplementary Materials. The workflow of our method is shown in Algorithm 1.

Algorithm 1 The workflow of Weak Adaptation Learning.
1:  Initialize parameters of network components Φ0,Φ1,Φ2\Phi_{0},\Phi_{1},\Phi_{2}.
2:  Obtain dataset DD from the source and target data with the help of weak annotator 𝐡w\mathbf{h}^{w}.
3:  Train F1=Φ1Φ0F_{1}=\Phi_{1}\circ\Phi_{0} using DD, with loss function following equation =KL+αcmmd\mathcal{L}=\mathcal{L}_{K\!L}+\alpha\mathcal{L}_{cmmd}.
4:  Fix the parameters of Φ1\Phi_{1} and use F2=Φ2Φ0F_{2}=\Phi_{2}\circ\Phi_{0} to fit the distance of the optimal classifier for target data 𝐡ot\mathbf{h}^{o_{t}} and the weak annotator 𝐡w\mathbf{h}^{w} with the target data.
5:  Calculate a new dataset using both source and target data. The new labels are calculated by ynew=𝐡w(𝐱)+Φ2(𝐡w(𝐱),y^{new}=\mathbf{h}^{w}(\mathbf{x})+\Phi_{2}(\mathbf{h}^{w}(\mathbf{x}), Φ0(𝐱))\Phi_{0}(\mathbf{x})).
6:  Initialize parameters of Φ0,Φ1,Φ2\Phi_{0},\Phi_{1},\Phi_{2}.
7:  Fix Φ2\Phi_{2} and train F1F_{1} using the new dataset. The loss function follows =KL+αcmmd\mathcal{L}=\mathcal{L}_{K\!L}+\alpha\mathcal{L}_{cmmd}.
8:  Output classifier F1F_{1}.

Stage 1: The first goal we step on is to obtain a common representation for both the source and target data, which helps us encode the inputs while mitigating the domain discrepancy in the feature representation. We gather all the unlabeled source data and the target data without their labels and use weak annotator 𝐡w\mathbf{h}^{w} to assign a label for each data sample 𝐱i\mathbf{x}_{i} and yiw=𝐡w(𝐱i)y^{w}_{i}=\mathbf{h}^{w}(\mathbf{x}_{i}). We denote the dataset obtained in this way as D={(𝐱,yw)i}i=1Ns+NtD=\{(\mathbf{x},y^{w})_{i}\}_{i=1}^{N_{s}+N_{t}}. Then we fix Φ2\Phi_{2} and only consider the left part of the network, which is F1=Φ1Φ0F_{1}=\Phi_{1}\circ\Phi_{0}. It is normally trained by supervised learning using the dataset DD for ep1{ep}_{1} training epochs, and uses the following loss function:

\displaystyle\mathcal{L} =KL+αcmmd\displaystyle=\mathcal{L}_{K\!L}+\alpha\mathcal{L}_{cmmd} (12)

In this loss function, there are two loss terms and the hyper-parameter α\alpha is a scaling factor to balance the scale of two loss functions (we set it as 0.00010.0001 in our experiments). The first term KL\mathcal{L}_{K\!L} is the Kullback-Leibler (KL) divergence loss, stated as follows:

KL\displaystyle\mathcal{L}_{K\!L} =KL(ypre1yw)\displaystyle=K\!L{(y_{pre}^{1}\Arrowvert y^{w})} (13)
=KL(Φ1Φ0(𝐱)𝐡w(𝐱))\displaystyle=K\!L{(\Phi_{1}\circ\Phi_{0}(\mathbf{x})\Arrowvert\mathbf{h}^{w}(\mathbf{x}))}

where ypre1y_{pre}^{1} is the output prediction value of F1F_{1} and ywy^{w} is the corresponding weak label produced by the weak annotator 𝐡w\mathbf{h}^{w}. The second term cmmd\mathcal{L}_{cmmd} aims to mitigate the domain discrepancy of the source and target domain at the feature representation level in the neural networks. Based on the basic MMD loss introduced by [55], we further change it into the version with data labels. We call this loss function as Classified-MMD loss (corresponding to the subscript cmmdcmmd), which is defined as:

cmmd=1Mi=1M\displaystyle\mathcal{L}_{cmmd}=\frac{1}{M}\cdot\sum_{i=1}^{M}\Arrowvert 1|DX(S,i)|𝐱sDX(S,i)F1(𝐱s)\displaystyle\frac{1}{|D_{X}^{(S,i)}|}\sum_{\mathbf{x}_{s}\in D_{X}^{(S,i)}}{F_{1}(\mathbf{x}_{s})} (14)
1|DX(T,i)|𝐱tDX(T,i)F1(𝐱t)\displaystyle-\frac{1}{|D_{X}^{(T,i)}|}\sum_{\mathbf{x}_{t}\in D_{X}^{(T,i)}}{F_{1}(\mathbf{x}_{t})}\Arrowvert

where MM is the number of classes, DXD_{X} is the data from the produced dataset DD without labels, and DX(S,i)D_{X}^{(S,i)} is the source data selected from DXD_{X} with argmax(yw)=i\arg\max(y^{w})=i. Then, we utilize target data with its accurate labels to continue to train the network component F1F_{1} under the loss function KL\mathcal{L}_{K\!L} for ep2{ep}_{2} training epochs, which helps further fine-tune the feature we learned through accurate labels of the target data.

Stage 2: After finishing training in Stage 1, the next step is to estimate the distance of the optimal classifier for target data 𝐡ot\mathbf{h}^{o_{t}} and the weak annotator 𝐡w\mathbf{h}^{w}. We estimate this distance through available target data with accurate labels. We adopt the parameters trained from Stage 1 and train network component F2=Φ2Φ0F_{2}=\Phi_{2}\circ\Phi_{0} using the target data DtD_{t}. For an input data sample 𝐱\mathbf{x}, it is brought into both Φ0\Phi_{0} and the weak annotator as their input. And then Φ2\Phi_{2} takes the output feature of Φ0(𝐱)\Phi_{0}(\mathbf{x}) and 𝐡w(𝐱)\mathbf{h}^{w}(\mathbf{x}) as input feature (these two features are concatenated as the input feature of Φ2\Phi_{2}). For data sample (𝐱t,yt)(\mathbf{x}_{t},y_{t})\in target dataset DtD_{t}, the learning of F2F_{2} uses the following classifier discrepancy loss function:

MSE\displaystyle\mathcal{L}_{M\!S\!E} =Φ2(𝐡w(𝐱t),Φ0(𝐱t))(yt𝐡w(𝐱t))2\displaystyle={\parallel\Phi_{2}(\mathbf{h}^{w}(\mathbf{x}_{t}),\Phi_{0}(\mathbf{x}_{t}))-(y_{t}-\mathbf{h}^{w}(\mathbf{x}_{t}))\parallel}^{2} (15)

The network is trained for ep3{ep}_{3} training epochs.

Stage 3: The third step is to generate a new dataset DnewD_{new} through the obtained network F2F_{2} above. Specifically, we collect data 𝐱\mathbf{x} from both source data and target data, and we re-label these data based on the weak annotator and F2F_{2} obtained from the previous steps:

Dnew={(\displaystyle D_{new}=\{( 𝐱,ynew)|𝐱DX,\displaystyle\mathbf{x},y^{new})|\mathbf{x}\in D_{X}, (16)
ynew=𝐡w(𝐱)+Φ2(𝐡w(𝐱),Φ0(𝐱))}\displaystyle y^{new}=\mathbf{h}^{w}(\mathbf{x})+\Phi_{2}(\mathbf{h}^{w}(\mathbf{x}),\Phi_{0}(\mathbf{x}))\}

Stage 4: In the last step, we focus on F1=Φ1Φ0F_{1}=\Phi_{1}\circ\Phi_{0} again. We fix the parameters of network component Φ2\Phi_{2} and train F1F_{1} using the new dataset DnewD_{new} obtained in Stage 3. To avoid introducing feature bias from the previous steps, we clean all previous network weights and re-initialize the whole network before training. The training lasts for ep4{ep}_{4} epochs, and the loss function for this step is =KL+αcmmd\mathcal{L}=\mathcal{L}_{K\!L}+\alpha\mathcal{L}_{cmmd}, which is the same as the function in Stage 1. Finally, we get the final model F1F_{1} as the desired classifier.

To sum up, in Stage 1, we learn the model 𝐡\mathbf{h} with the help of the weak annotator to decrease the empirical loss L^s(𝐡)\widehat{L}_{s}(\mathbf{h}), and the CMMD loss will reduce the term 𝒟𝒟(Xt,Xs)\mathcal{DD}(\mathbb{Q}^{t}_{X},\mathbb{Q}^{s}_{X}). Stage 2 uses a new classifier 𝐝\mathbf{d} to learn the classification distance corresponding to the term L^t(𝐝)\widehat{L}_{t}(\mathbf{d}). The Stage 3 uses the annotator and the learned 𝐝\mathbf{d} to give more accurate labels than those given solely by the annotator. Then in Stage 4, the model is trained by the relabeled data, making both L^s(𝐡)\widehat{L}_{s}(\mathbf{h}) and 𝒟𝒟(Xt,Xs)\mathcal{DD}(\mathbb{Q}^{t}_{X},\mathbb{Q}^{s}_{X}) be further decreased. The setting of the hyper-parameters used in this section can found in the Supplementary Materials.

5 Experimental Results

The supplementary materials can be found from https://arxiv.org/abs/2102.07358

5.1 Dataset

The experiments are conducted on three application scenarios, the digits recognition with domain discrepancy (SVHN[38], MNIST[6] and USPS[18] digit datasets), object detection with domain discrepancy (VisDA-C[42]), and object detection without domain discrepancy (CIFAR-10[25]). For space, we introduce details of these datasets in the Supplementary Materials.

5.2 Training Setting

All experiments are conducted on a server with Ubuntu 18.04 LTS with NVIDIA TITAN RTX GPU cards. The implementation is based on the Pytorch framework. The hyper-parameter α\alpha mentioned above is set to 1e41e-4. We use the standard Adam optimizer  [22] for optimizing the learning. The network architectures, the learning rate for each part of the network components, the training epoch setting, as well as other hyper-parameters are specified in the Supplementary Materials. And we get weak annotators in different performance by applying early stop for the training. The implementation details of weak annotators can also be found in the Supplementary Materials.

5.3 Baseline Experiments Setting

We conduct comparison experiments with the following baselines. Baseline BwaB_{wa} is the performance of the weak annotator chosen in the experiments in the target domain. Baseline BtB_{t} is training F1F_{1} only with target data. Baseline Bf1B_{f_{1}} is a fine-tuning result. It takes the same model as F1F_{1} and first uses source domain data and weak labels generated by the weak annotator to train it. Then it uses target domain data to fine-tune the last three layers. Baseline Bf2B_{f_{2}} is also a fine-tuning result. The difference is that instead of fine-tuning the last three layers, it trains all network parameters.

As introduced before, our problem is related to the Semi-Supervised Learning (SSL) and the Semi-Supervised Domain Adaptation (SSDA). For SSL, although we can replace the unlabeled data with samples drawn from another domain instead of the target domain, we cannot find a good way to incorporate the weak annotator into SSL methods for fair comparison with our approach. For SSDA, we were able to extend it to our setting for comparison. Specifically, we add 1,000 unlabeled target samples (plus 1,000 labeled target samples, and this setting will be changed accordingly in digits recognition to keep consistent settings) to meet the semi-supervised requirement, and we apply weak annotator to produce weak labels instead of accurate ones for source data. We compare our approach with the following SSDA baselines: FAN [21], MME [48], ENT [13], S+T [4, 45]. Note that to the best of our knowledge, there is no previous work with exactly the same problem setting as ours. The above changes aim at making the comparison as fair as possible. Another thing that is worth to mention is that most SSDA methods conduct adaptation on the ImageNet pre-trained models, which introduces a lot of irrelevant data information from the ImageNet dataset. Thus, we disable the pre-training and only allow training with the available data.

5.4 Results of Digits Recognition

We evaluate our methods on the digit recognition datasets: SVHN (S), MNIST (M),and USPS (U). According to the results shown in Table 1, when the weak annotator performs much worse than the model learned only from the provided target data BtB_{t} (BwaB_{wa} = 73.28%73.28\% on M \to U, 73.28%73.28\% on S \to U and 76.41%76.41\% on S \to M), its corresponding baseline Bf1B_{f_{1}} is also lower than BtB_{t}, and only the second fine-tuning method Bf2B_{f_{2}} is better than or competitive with BtB_{t}. This indicates that the feature learned from the source domain data and with weak labels introduce data bias, and this bias can be mitigated when the parameters from the front layers are fine-tuned by the target data.

Overall, we can clearly see that with 15,000 source domain data, limited number of labeled target domain data (second line), and a weak annotator, our method can outperform all the baselines in Table 1 with 80.00%80.00\% on M \to S, 95.99%95.99\% on M \to U, 96.36%96.36\% on S \to U and 97.24%97.24\% on S \to M.

Method M \to S(%) M \to U(%) S \to U(%) S \to M(%)
#samples 1000 300 300 1000
BwaB_{wa} 59.06 73.28 73.28 76.41
BtB_{t} 61.14 89.20 89.27 94.79
Bf1B_{f_{1}} 55.68 84.58 77.24 80.41
Bf2B_{f_{2}} 77.92 94.10 94.92 95.52
S+T 65.70 93.67 91.21 96.21
ENT 67.89 92.62 92.02 96.42
MME 65.92 93.07 91.32 95.64
FAN 68.48 93.78 92.38 96.51
Ours 80.00 95.99 96.36 97.24
Table 1: The accuracy of different methods on digit datasets.

5.5 Results of Object Recognition

The results of various methods on the VisDA-C dataset are presented in Figure 2. In this task, we utilize the synthetic images as the source domain dataset, and the real-world images as the target domain dataset. And we can see from the table that the performance of the network trained only with the target data is merely 32.86%32.86\%. Then, when the weak annotator is provided, it can help two fine-tune baselines Bf1B_{f_{1}} and Bf2B_{f_{2}} reach 27.67%27.67\% and 35.03%35.03\% respectively. As for the SSDA baselines, all of them perform very badly, and they are provided with more target samples with no labels. The best SSDA methods FAN can only achieve 32.99%32.99\%. Our method can provide a result of 40.83%40.83\%, which again exceeds all baselines above. Besides, we also provide additional experiment results using a different weak annotator in the supplementary materials.

Refer to caption
Figure 2: The accuracy of different methods on the VisDA-C dataset. The number is measured in percentage.

Moreover, we also test on the scenario without domain discrepancy using the CIFAR-10 dataset. We randomly select 10,000 data samples from the dataset as the source data and another 1,000 samples as the target data. The result is included in Table 2. As we can see, when the weak annotator is given at 48.96%48.96\% accuracy, the model trained only with the target data can reach 30.46%30.46\%, while our method nearly doubles the performance and hits 61.71%61.71\%, which exceeds all other baselines.

Method plane mobile bird cat deer dog frog horse ship truck Acc(%)
BwaB_{wa} 43.18 65.68 28.13 25.93 29.00 46.15 83.91 41.76 72.12 51.06 48.96
BtB_{t} 19.08 63.39 03.03 30.16 25.77 22.60 46.11 50.85 23.61 25.74 30.46
Bf1B_{f_{1}} 57.38 77.53 38.46 33.51 45.27 33.17 73.33 58.29 57.67 60.20 52.97
Bf2B_{f_{2}} 44.19 80.00 38.02 41.97 47.15 24.30 78.14 55.06 89.35 38.19 53.49
Ours 65.52 82.61 39.79 48.45 57.36 43.60 67.39 65.32 70.42 78.89 61.71
Table 2: The accuracy of different methods on the CIFAR-10 dataset with 10 classes (without domain discrepancy). The number is measured in percentage. The accuracy of each class is from column 2 to column 11. The overall accuracy is shown in the last column.

5.6 Ablation Study

We also study how the quantity of target domain samples and the performance of the weak annotator affect the overall performance of our method. To reduce the impact of domain discrepancy when we study these two factors, we conduct the ablation study on CIFAR-10.

5.6.1 Sample quantity of target samples

As presented in Figure 3, the horizontal axis indicates the number of target domain data, and the vertical axis shows the performance of our model using the corresponding number of target domain samples. When keeping the weak annotator the same as Section 5.5 and fixing the sample quantity of the source data as 10,000, the accuracy of the model grows as the number of target domain data increases. And it will gradually get saturated when there is enough target domain data. This saturation phenomenon can be explained as the second derivative of KL/N\sqrt{K\!L/N} and ln2Nδ/N\sqrt{\ln{\frac{2N}{\delta}}/N} for NN is positive while the first derivative is negative. And according to the curve, we can observe that the performance improvement when the target data is less than the source data is relatively higher than the case when there is more target data. The reason for this can be found in our theoretical analysis, i.e., when the sample quantities of source and target data become closer, terms impacted by the quantity of target data will not dominate over the error bound.

Refer to caption
Figure 3: The performance of our learned model under different quantities of target domain samples.

5.6.2 Performance of weak annotator

Figure 4 shows the curve of how the performance of our model changes with respect to the accuracy of the weak annotator. As shown in the figure, when the weak annotator performs the worst with accuracy of 23.79%23.79\%, our model can reach 42.29%42.29\%, which is a relatively significant improvement. And as the accuracy of the weak annotator increases, our model performs better accordingly. Interestingly, the improvement curve in Figure 4 is approximately linear, which demonstrates that it is reasonable to linearly add the terms of the weak annotator in the error bound.

Refer to caption
Figure 4: The performance of our learned model under different accuracy of the weak annotator.

6 Conclusion

In this work, we present a novel approach leveraging weak annotator to address the data insufficiency challenge in domain adaptation, where only a small amount of data samples is available in the target domain and the data samples in the source domain are unlabeled. Our weak adaptation approach includes a theoretical analysis that derives the error bound of a trained classifier with respect to the data quantity and the performance of the weak annotator, and a multi-stage learning process that improves classifier performance by lowering the error bound. Our approach shows significant improvement over baselines on cases with or without domain discrepancy in various data sets.

7 Acknowledgement

We gratefully acknowledge the support from NSF grants 1834701, 1839511, 1724341, 2038853, and ONR grant N00014-19-1-2496.

References

  • [1] Ron Amit and Ron Meir. Meta-learning by adjusting priors based on extended pac-bayes theory. In International Conference on Machine Learning, pages 205–214. PMLR, 2018.
  • [2] Soufiane Belharbi, Ismail Ben Ayed, Luke McCaffrey, and Eric Granger. Deep active learning for joint classification & segmentation with weak annotator. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3338–3347, 2020.
  • [3] Tianshi Cao, Marc Law, and Sanja Fidler. A theoretical analysis of the number of shots in few-shot learning. arXiv preprint arXiv:1909.11722, 2019.
  • [4] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. In International Conference on Learning Representations, 2019.
  • [5] Xuxi Chen, Wuyang Chen, Tianlong Chen, Ye Yuan, Chen Gong, Kewei Chen, and Zhangyang Wang. Self-pu: Self boosted and calibrated positive-unlabeled training. In International Conference on Machine Learning, pages 1510–1519. PMLR, 2020.
  • [6] Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  • [7] Jiahua Dong, Yang Cong, Gan Sun, and Dongdong Hou. Semantic-transferable weakly-supervised endoscopic lesions segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10712–10721, 2019.
  • [8] Jiahua Dong, Yang Cong, Gan Sun, Bineng Zhong, and Xiaowei Xu. What can be transferred: Unsupervised domain adaptation for endoscopic lesions segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4023–4032, 2020.
  • [9] WeiWang Dong-DongChen and Zhi-HuaZhou WeiGao. Tri-net for semi-supervised deep learning. In International Joint Conferences on Artificial Intelligence, 2018.
  • [10] Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. Pac-bayesian theory meets bayesian inference. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 1884–1892, 2016.
  • [11] Aritra Ghosh, Naresh Manwani, and PS Sastry. Making risk minimization tolerant to label noise. Neurocomputing, 160:93–107, 2015.
  • [12] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
  • [13] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Proceedings of the 17th International Conference on Neural Information Processing Systems, pages 529–536, 2004.
  • [14] Richard F Gunst and Robert L Mason. Biased estimation in regression: an evaluation using mean squared error. Journal of the American Statistical Association, 72(359):616–628, 1977.
  • [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [16] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [17] Sheng-Wei Huang, Che-Tsung Lin, Shu-Ping Chen, Yen-Yi Wu, Po-Hao Hsu, and Shang-Hong Lai. Auggan: Cross domain adaptation with gan-based data augmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 718–731, 2018.
  • [18] Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550–554, 1994.
  • [19] Justin M Johnson and Taghi M Khoshgoftaar. Survey on deep learning with class imbalance. Journal of Big Data, 6(1):1–54, 2019.
  • [20] S Kassam. Quantization based on the mean-absolute-error criterion. IEEE Transactions on Communications, 26(2):267–270, 1978.
  • [21] Taekyung Kim and Changick Kim. Attract, perturb, and explore: Learning a feature alignment network for semi-supervised domain adaptation. In European Conference on Computer Vision, pages 591–607. Springer, 2020.
  • [22] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [23] Ryuichi Kiryo, Gang Niu, Marthinus C du Plessis, and Masashi Sugiyama. Positive-unlabeled learning with non-negative risk estimator. arXiv preprint arXiv:1703.00593, 2017.
  • [24] Roger Koenker. Quantile regression for longitudinal data. Journal of Multivariate Analysis, 91(1):74–89, 2004.
  • [25] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [26] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, 2013.
  • [27] Junnan Li, Richard Socher, and Steven CH Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. In International Conference on Learning Representations, 2019.
  • [28] Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. Model adaptation: Unsupervised domain adaptation without source data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9641–9650, 2020.
  • [29] Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2015.
  • [30] Mohammad Reza Loghmani, Markus Vincze, and Tatiana Tommasi. Positive-unlabeled learning for open set domain adaptation. Pattern Recognition Letters, 136:198–204, 2020.
  • [31] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Unsupervised domain adaptation with residual transfer networks. In Advances in neural information processing systems, pages 136–144, 2016.
  • [32] Yadan Luo, Zijian Wang, Zi Huang, and Mahsa Baktashmotlagh. Progressive graph learning for open-set domain adaptation. In International Conference on Machine Learning, pages 6468–6478. PMLR, 2020.
  • [33] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.
  • [34] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of variation in deep representation using adversarial training. Advances in neural information processing systems, 29:5040–5048, 2016.
  • [35] David McAllester. Simplified pac-bayesian margin bounds. In Learning theory and Kernel machines, pages 203–215. Springer, 2003.
  • [36] David A McAllester. Some pac-bayesian theorems. Machine Learning, 37(3):355–363, 1999.
  • [37] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Learning with noisy labels. In NIPS, volume 26, pages 1196–1204, 2013.
  • [38] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
  • [39] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In Advances in neural information processing systems, pages 5947–5956, 2017.
  • [40] Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018.
  • [41] Sujoy Paul, Yi-Hsuan Tsai, Samuel Schulter, Amit K Roy-Chowdhury, and Manmohan Chandraker. Domain adaptive semantic segmentation using weak labels. arXiv preprint arXiv:2007.15176, 2020.
  • [42] Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017.
  • [43] Fábio Perez, Rémi Lebret, and Karl Aberer. Weakly supervised active learning with cluster annotation. arXiv preprint arXiv:1812.11780, 2018.
  • [44] Samira Pouyanfar, Yudong Tao, Anup Mohan, Haiman Tian, Ahmed S Kaseb, Kent Gauen, Ryan Dailey, Sarah Aghajanzadeh, Yung-Hsiang Lu, Shu-Ching Chen, et al. Dynamic sampling in convolutional neural networks for imbalanced data classification. In 2018 IEEE conference on multimedia information processing and retrieval (MIPR), pages 112–117. IEEE, 2018.
  • [45] Rajeev Ranjan, Carlos D Castillo, and Rama Chellappa. L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507, 2017.
  • [46] Antti Rasmus, Harri Valpola, Mikko Honkala, Mathias Berglund, and Tapani Raiko. Semi-supervised learning with ladder networks. arXiv preprint arXiv:1507.02672, 2015.
  • [47] Subhankar Roy, Aliaksandr Siarohin, Enver Sangineto, Samuel Rota Bulo, Nicu Sebe, and Elisa Ricci. Unsupervised domain adaptation using feature-whitening and consensus loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9471–9480, 2019.
  • [48] Kuniaki Saito, Donghyun Kim, Stan Sclaroff, Trevor Darrell, and Kate Saenko. Semi-supervised domain adaptation via minimax entropy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8050–8058, 2019.
  • [49] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3723–3732, 2018.
  • [50] Tasfia Shermin, Guojun Lu, Shyh Wei Teng, Manzur Murshed, and Ferdous Sohel. Adversarial network with multiple classifiers for open set domain adaptation. IEEE Transactions on Multimedia, 2020.
  • [51] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [52] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 4080–4090, 2017.
  • [53] Hui Tang and Kui Jia. Discriminative adversarial domain adaptation. In AAAI, pages 5940–5947, 2020.
  • [54] Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Utku Evci, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, et al. Meta-dataset: A dataset of datasets for learning to learn from few examples. In International Conference on Learning Representations, 2019.
  • [55] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
  • [56] Jason Van Hulse, Taghi M Khoshgoftaar, and Amri Napolitano. Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th international conference on Machine learning, pages 935–942, 2007.
  • [57] Lixu Wang, Shichao Xu, Xiao Wang, and Qi Zhu. Addressing class imbalance in federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10165–10173, 2021.
  • [58] Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys (CSUR), 53(3):1–34, 2020.
  • [59] Garrett Wilson and Diane J Cook. A survey of unsupervised deep domain adaptation. ACM Transactions on Intelligent Systems and Technology (TIST), 11(5):1–46, 2020.
  • [60] Garrett Wilson, Janardhan Rao Doppa, and Diane J Cook. Multi-source deep domain adaptation with weak supervision for time-series sensor data. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1768–1778, 2020.
  • [61] Shichao Xu, Yangyang Fu, Yixuan Wang, Zheng O’Neill, and Qi Zhu. Learning-based framework for sensor fault-tolerant building hvac control with model-assisted learning, 2021.
  • [62] Yixing Xu, Yunhe Wang, Hanting Chen, Kai Han, Chunjing Xu, Dacheng Tao, and Chang Xu. Positive-unlabeled compression on the cloud. arXiv preprint arXiv:1909.09757, 2019.
  • [63] Congrui Yi and Jian Huang. Semismooth newton coordinate descent algorithm for elastic-net penalized huber loss regression and quantile regression. Journal of Computational and Graphical Statistics, 26(3):547–557, 2017.
  • [64] Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschläger, and Susanne Saminger-Platz. Central moment discrepancy (cmd) for domain-invariant representation learning. arXiv preprint arXiv:1702.08811, 2017.
  • [65] Zhi-Hua Zhou. A brief introduction to weakly supervised learning. National science review, 5(1):44–53, 2018.

Appendix This appendix contains additional details for the submitted article Weak Adaptation Learning: Addressing Cross-domain Data Insufficiency with Weak Annotator, including mathematical proofs and experimental details. Moreover, the additional related works are provided in Section A, the additional theoretical analysis details are provided in Section B, the introduction of data sets is provided in Section C, the additional ablation study is provided in Section D, the network architecture is shown in Section E, and more about training settings is presented in Section F.

Appendix A Related Work

A.1 Weakly- and Semi-Supervised Learning

Weakly Supervised Learning is a large concept that may have multiple problem settings [65]. The problem we consider in this paper is related to the incomplete supervision setting that is often addressed by Semi-Supervised Learning (SSL) approaches. Standard SSL solves the problem of training a model with a few labeled data and a large amount of unlabeled data. Some of the widely-applied methods [46, 9, 43, 2] assign pseudo labels to unlabeled samples and then perform supervised learning. And there are works that address the noises in the labels of those samples [37, 11, 29]. Our target problem is related to SSL with inaccurate supervision, but is different since we consider the feature discrepancy between the (unlabeled) source data and the (labeled) target data – a case that occurs often in practice but has not been sufficiently addressed.

Positive-unlabeled Learning (PuL) is usually regarded as a sub-problem of SSL. Its goal is to learn a binary classifier to distinguish positive and negative samples from a large amount of unlabeled data and a few positive samples. Several works [23, 5] can achieve great performance on selecting samples that are similar to the positive data, and there are also works using samples selected by PuL to perform other tasks [62, 30].

A.2 Domain Adaptation

Previous Domain Adaptation (DA) approaches often design a number of losses to measure the discrepancy between source and target domains [49, 31, 55, 64, 47]. Then, adversarial training is applied in DA [34, 17, 50, 53]. In the problem settings of these works, the supervision knowledge reflected in the source labels is accurate and reliable, which is not the case in our problem (where weak annotator generates inaccurate labels). In addition, the standard DA problem does not address the case where the quantity of the target domain data is small. The Semi-Supervised Domain Adaptation (SSDA) [48] approach assumes that only a few labeled data samples are available in the target domain, but the adaptation may use a large amount of unlabeled data in the target domain. This is different from our problem, where we consider the cases where even unlabeled data samples are hard to acquire in the target domain (e.g., the examples mentioned in Section 1). In short, our work is the first to explore the DA problem with weak source knowledge and insufficient target samples simultaneously.

In addition, there are several papers in domain adaptation that leverage the information of so-called weak labels in source or target domain [60, 41]. However, these weak labels are in a different task to their target labels (e.g., providing an image-level label for image segmentation task that shows the proportion of categories of the entire image rather than providing a pixel-level label), which is a different concept towards the weak label mentioned in our paper.

A.3 Importance of Sample Quantity

The training of machine learning models, especially deep neural networks, often requires a large amount of data samples. However, in many practical scenarios, there is not sufficient training data to feed the learning process, degrading the model performance sharply [52, 19]. Many approaches have been proposed to make up for the lack of training samples, e.g., data re-sampling [56], data augmentation [44], metric learning, and meta learning [3, 4, 54]. And there are works [39, 1, 3, 58] conducting theoretical analysis on the relation between training data quantity and model performance. These analyses are usually in the form of bounding the prediction error of the models and provide valuable information on how the sample quantity of training data affects the model performance. In our work, we also perform a theoretical analysis on the error bound of the trained model, with respect to not only the data quantity but also the performance of the weak annotator.

Appendix B Theoretical Analysis

B.1 Proof of Theorem 1

Theorem 1. For a classifier parameter distribution 𝒩(0,σ2)\mathcal{H}\sim\mathcal{N}(0,\sigma_{\mathcal{H}}^{2}) that is independent of the training data with size mm, and a posterior parameter distribution 𝒬\mathcal{Q} learned from the training data set, if we assume 𝒬𝒩(μ𝒬,σ𝒬2)\mathcal{Q}\sim\mathcal{N}(\mu_{\mathcal{Q}},\sigma_{\mathcal{Q}}^{2}) and consider Θ𝐩\Theta^{\mathbf{p}}, Θ\Theta as drawn from \mathcal{H}, 𝒬\mathcal{Q} respectively (Θ=Θ𝐩+(L^(𝐡Θ𝐩))\Theta=\Theta^{\mathbf{p}}+\nabla(\widehat{L}(\mathbf{h}_{\Theta^{\mathbf{p}}}))), the KL divergence of 𝒬\mathcal{Q} and \mathcal{H} is bounded with symbols defined in Eq. (18) as follows:

KL(𝒬)σ2m+μ22σ2K\!L(\mathcal{Q}\,\|\,\mathcal{H})\leq\frac{\frac{\sigma^{2}}{m}+\mu^{2}}{2\sigma_{\mathcal{H}}^{2}} (17)

Proof: Let us consider a classifier 𝐡Θ\mathbf{h}_{\Theta} drawn from a parameter distribution 𝒬\mathcal{Q}, and 𝒬\mathcal{Q} depends on the data that has been learned by 𝐡\mathbf{h}. Then, if we denote the model parameters of 𝐡\mathbf{h} before the training that are drawn from \mathcal{H} as Θ𝐩\Theta^{\mathbf{p}}, the KL divergence can be written as KL(𝒬)=𝔼[Θ(lnΘ𝐩lnΘ)]K\!L(\mathcal{Q}\|\mathcal{H})=-\mathbb{E}[\Theta\cdot(\ln{\Theta^{\mathbf{p}}}-\ln{\Theta})]. As aforementioned, 𝐡Θ\mathbf{h}_{\Theta} is trained with the training data set from 𝐡Θ𝐩\mathbf{h}_{\Theta^{\mathbf{p}}}, and we consider that the training is optimized by a gradient-based method. Thus, we can formulate that Θ=Θ𝐩+(L^(𝐡Θ𝐩))\Theta=\Theta^{\mathbf{p}}+\nabla(\widehat{L}(\mathbf{h}_{\Theta^{\mathbf{p}}})). Here we omit the learning rate to simplify the formula, and L^(𝐡Θ)\widehat{L}(\mathbf{h}_{\Theta}) is the empirical error computed from the training set (L^(𝐡Θ)=1mi=1m(𝐱i,yi)\widehat{L}(\mathbf{h}_{\Theta})=\frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\mathbf{x}_{i},y_{i}), where \mathcal{L} denotes the loss of a single training sample).

The PAC-Bayesian error bound is valid for any parameter distribution \mathcal{H} that is independent of the training data, and any method of optimizing Θ𝐩\Theta^{\mathbf{p}} dependent on the training set [39]. Therefore, in order to simplify the problem, we instantiate the bound as setting \mathcal{H} to conform to a Gaussian distribution with zero mean (μ=0\mu_{\mathcal{H}}=0) and Var=σ2\text{Var}_{\mathcal{H}}=\sigma^{2}_{\mathcal{H}} variance. This simplification is similar to the ones in previous PAC-Bayesian works [39, 40]. We further assume that the parameter change of the overall model during training can also be regarded as conforming to an empirical Gaussian distribution, and we denote this distribution as follows:

μ\displaystyle\mu 𝔼[Θ𝐩(𝐱,y)]\displaystyle\triangleq\mathbb{E}\left[\nabla_{\Theta^{\mathbf{p}}}\mathcal{L}(\mathbf{x},y)\right] (18)
σ2\displaystyle\sigma^{2} 𝔼[(Θ𝐩(𝐱,y)μ)(Θ𝐩(𝐱,y)μ)T]\displaystyle\triangleq\mathbb{E}\left[(\nabla_{\Theta^{\mathbf{p}}}\mathcal{L}(\mathbf{x},y)-\mu)(\nabla_{\Theta^{\mathbf{p}}}\mathcal{L}(\mathbf{x},y)-\mu)^{T}\right]

This Gaussian distribution is independent of model parameters if we regard the parameter updates induced by the gradient back-propagation as accumulated random perturbations, i.e., each training sample corresponds to a small perturbation [40]. And we know that the specific formula of KL divergence to any two Gaussian distributions p𝒩(μ1,σ12)p\sim\mathcal{N}(\mu_{1},\sigma_{1}^{2}), q𝒩(μ2,σ22)q\sim\mathcal{N}(\mu_{2},\sigma_{2}^{2}) is written as:

KL(p,q)=lnσ2σ1+σ12+(μ1μ2)22σ2212K\!L(p,q)=\ln{\frac{\sigma_{2}}{\sigma_{1}}}+\frac{\sigma_{1}^{2}+(\mu_{1}-\mu_{2})^{2}}{2\sigma_{2}^{2}}-\frac{1}{2} (19)

And we can prove the equation as follows:

KL(p,q)\displaystyle K\!L(p,q) =[log(p(x))log(q(x))]p(x)𝑑x\displaystyle=\int[\log(p(x))-\log(q(x))]p(x)dx (20)
=[p(x)log(p(x))p(x)log(q(x))]𝑑x\displaystyle=\int[p(x)\log(p(x))-p(x)\log(q(x))]dx

Then we can split Eq. (20) into two different integrations and calculate them respectively.

p(x)log(p(x))𝑑x\displaystyle\int p(x)\log(p(x))dx (21)
=p(x)log[12πσ1exp((xμ1)22σ12)]𝑑x\displaystyle=\int p(x)\log\left[\frac{1}{\sqrt{2\pi}\sigma_{1}}\exp\left(-\frac{\left(x-\mu_{1}\right)^{2}}{2\sigma_{1}^{2}}\right)\right]dx
=p(x)[log12πσ1+logexp((xμ1)22σ12)]𝑑x\displaystyle=\int p(x)\left[\log\frac{1}{\sqrt{2\pi}\sigma_{1}}+\log\exp\left(-\frac{\left(x-\mu_{1}\right)^{2}}{2\sigma_{1}^{2}}\right)\right]dx
=12log(2πσ12)+p(x)((xμ1)22σ12)𝑑x\displaystyle=-\frac{1}{2}\log\left(2\pi\sigma_{1}^{2}\right)+\int p(x)\left(-\frac{\left(x-\mu_{1}\right)^{2}}{2\sigma_{1}^{2}}\right)dx
=12log(2πσ12)(μ12+σ12)(2μ1μ1)+μ122σ12\displaystyle=-\frac{1}{2}\log\left(2\pi\sigma_{1}^{2}\right)-\frac{\left(\mu_{1}^{2}+\sigma_{1}^{2}\right)-\left(2\mu_{1}\cdot\mu_{1}\right)+\mu_{1}^{2}}{2\sigma_{1}^{2}}
=12[1+log(2πσ12)]\displaystyle=-\frac{1}{2}\left[1+\log\left(2\pi\sigma_{1}^{2}\right)\right]
p(x)log(q(x))𝑑x\displaystyle\int p(x)\log(q(x))dx (22)
=p(x)log[12πσ2exp((xμ2)22σ22)]𝑑x\displaystyle=\int p(x)\log\left[\frac{1}{\sqrt{2\pi}\sigma_{2}}\exp\left(-\frac{\left(x-\mu_{2}\right)^{2}}{2\sigma_{2}^{2}}\right)\right]dx
=p(x)[log12πσ2+logexp((xμ2)22σ22)]𝑑x\displaystyle=\int p(x)\left[\log\frac{1}{\sqrt{2\pi}\sigma_{2}}+\log\exp\left(-\frac{\left(x-\mu_{2}\right)^{2}}{2\sigma_{2}^{2}}\right)\right]dx
=12log(2πσ22)+p(x)((xμ2)22σ22)𝑑x\displaystyle=-\frac{1}{2}\log\left(2\pi\sigma_{2}^{2}\right)+\int p(x)\left(-\frac{\left(x-\mu_{2}\right)^{2}}{2\sigma_{2}^{2}}\right)dx
=12log(2πσ22)(μ12+σ12)(2μ2μ1)+μ222σ22\displaystyle=\frac{1}{2}\log\left(2\pi\sigma_{2}^{2}\right)-\frac{\left(\mu_{1}^{2}+\sigma_{1}^{2}\right)-\left(2\mu_{2}\cdot\mu_{1}\right)+\mu_{2}^{2}}{2\sigma_{2}^{2}}
=12log(2πσ22)σ12+(μ1μ2)22σ22\displaystyle=-\frac{1}{2}\log\left(2\pi\sigma_{2}^{2}\right)-\frac{\sigma_{1}^{2}+\left(\mu_{1}-\mu_{2}\right)^{2}}{2\sigma_{2}^{2}}

In this case, Eq. (20) can be written as:

KL(p,q)=[p(x)log(p(x))p(x)log(q(x))]𝑑x\displaystyle K\!L(p,q)=\int[p(x)\log(p(x))-p(x)\log(q(x))]dx (23)
=12[1+log(2πσ12)]+12log(2πσ22)+σ12+(μ1μ2)22σ22\displaystyle=-\frac{1}{2}\left[1+\log\left(2\pi\sigma_{1}^{2}\right)\right]+\frac{1}{2}\log\left(2\pi\sigma_{2}^{2}\right)+\frac{\sigma_{1}^{2}+\left(\mu_{1}-\mu_{2}\right)^{2}}{2\sigma_{2}^{2}}
=logσ2σ1+σ12+(μ1μ2)22σ2212\displaystyle=\log\frac{\sigma_{2}}{\sigma_{1}}+\frac{\sigma_{1}^{2}+\left(\mu_{1}-\mu_{2}\right)^{2}}{2\sigma_{2}^{2}}-\frac{1}{2}

With the above definitions, we can compute the mean of distribution 𝒬\mathcal{Q}:

μ𝒬\displaystyle\mu_{\mathcal{Q}} =𝔼{Θ𝐩+[L^(𝐱,y)]}\displaystyle=\mathbb{E}\left\{\Theta^{\mathbf{p}}+\nabla[\widehat{L}(\mathbf{x},y)]\right\} (24)
=μ+𝔼{[1mj=1m(𝐱j,yj)]}\displaystyle=\mu_{\mathcal{H}}+\mathbb{E}\left\{\nabla\left[\frac{1}{m}\sum_{j=1}^{m}\mathcal{L}(\mathbf{x}_{j},y_{j})\right]\right\}
=0+1mj=1m𝔼[(𝐱j,yj)]\displaystyle=0+\frac{1}{m}\sum_{j=1}^{m}\mathbb{E}\left[\nabla\mathcal{L}(\mathbf{x}_{j},y_{j})\right]
=μ\displaystyle=\mu
σ𝒬2\displaystyle\sigma_{\mathcal{Q}}^{2} =Var{Θ𝐩+[L^(𝐱,y)]}\displaystyle={V\!ar}\left\{\Theta^{\mathbf{p}}+\nabla[\widehat{L}(\mathbf{x},y)]\right\} (25)
=Var(Θ𝐩)+Var{[1mj=1m(𝐱j,yj)]}\displaystyle={V\!ar}(\Theta^{\mathbf{p}})+{V\!ar}\left\{\nabla\left[\frac{1}{m}\sum_{j=1}^{m}\mathcal{L}(\mathbf{x}_{j},y_{j})\right]\right\}
=σ2+1m2Var{j=1m[(𝐱j,yj)]}\displaystyle=\sigma_{\mathcal{H}}^{2}+\frac{1}{m^{2}}{V\!ar}\left\{\sum_{j=1}^{m}\nabla[\mathcal{L}(\mathbf{x}_{j},y_{j})]\right\}
=σ2+1m2j=1mVar{[(𝐱j,yj)]}\displaystyle=\sigma_{\mathcal{H}}^{2}+\frac{1}{m^{2}}\sum_{j=1}^{m}{V\!ar}\left\{\nabla[\mathcal{L}(\mathbf{x}_{j},y_{j})]\right\}
=σ2+1mσ2\displaystyle=\sigma_{\mathcal{H}}^{2}+\frac{1}{m}\sigma^{2}

With the mean and variance of distribution 𝒬\mathcal{Q}, we can then calculate KL(𝒬)K\!L(\mathcal{Q}\,\|\,\mathcal{H}) as follows:

KL(𝒬)\displaystyle K\!L(\mathcal{Q}\|\mathcal{H}) =lnσσ𝒬+σ𝒬2+(μ𝒬μ)22σ212\displaystyle=\ln{\frac{\sigma_{\mathcal{H}}}{\sigma_{\mathcal{Q}}}}+\frac{\sigma_{\mathcal{Q}}^{2}+(\mu_{\mathcal{Q}}-\mu_{\mathcal{H}})^{2}}{2\sigma_{\mathcal{H}}^{2}}-\frac{1}{2} (26)
σ2+1mσ2+μ22σ212\displaystyle\leq\frac{\sigma_{\mathcal{H}}^{2}+\frac{1}{m}\sigma^{2}+\mu^{2}}{2\sigma_{\mathcal{H}}^{2}}-\frac{1}{2}
=1mσ2+μ22σ2\displaystyle\quad=\frac{\frac{1}{m}\sigma^{2}+\mu^{2}}{2\sigma_{\mathcal{H}}^{2}}

\hfill\blacksquare

B.2 Error Bound Derivation

With the risk definition introduced in our main paper, the risk of 𝐡\mathbf{h} with respect to source data distribution s\mathbb{Q}^{s} is

Rs(𝐡)=𝔼(𝐱,y)s(𝐡(𝐱),y)=L(𝐡Θ)s\centering R^{s}(\mathbf{h})=\mathbb{E}_{(\mathbf{x},y)\sim\mathbb{Q}^{s}}\mathcal{L}(\mathbf{h}(\mathbf{x}),y)=L(\mathbf{h}_{\Theta})_{\sim\mathbb{Q}^{s}}\@add@centering (27)

The risk with respect to target data distribution t\mathbb{Q}^{t} is

Rt(𝐡)=𝔼(𝐱,y)t(𝐡(𝐱),y)=L(𝐡Θ)t\centering R^{t}(\mathbf{h})=\mathbb{E}_{(\mathbf{x},y)\sim\mathbb{Q}^{t}}\mathcal{L}(\mathbf{h}(\mathbf{x}),y)=L(\mathbf{h}_{\Theta})_{\sim\mathbb{Q}^{t}}\@add@centering (28)

In addition, we define the Classification Distance of two classifier 𝐡1\mathbf{h}_{1} and 𝐡2\mathbf{h}_{2} under a same domain distribution \mathbb{P} as

𝒞𝒟(𝐡1,𝐡2)=𝔼𝐱(𝐡1(𝐱),𝐡2(𝐱))\mathcal{CD}_{\sim\mathbb{P}}(\mathbf{h}_{1},\mathbf{h}_{2})=\mathbb{E}_{\mathbf{x}\sim\mathbb{P}}\mathcal{L}(\mathbf{h}_{1}(\mathbf{x}),\mathbf{h}_{2}(\mathbf{x})) (29)

Moreover, the Discrepancy Distance of two domains is defined as in [33]: 𝐡𝟏,𝐡𝟐\forall\mathbf{h_{1}},\mathbf{h_{2}}, the discrepancy distance between the distributions of two domains ,\mathbb{P},\mathbb{Q} is

𝒟𝒟(,)=sup𝐡𝟏,𝐡𝟐|𝒞𝒟(𝐡1,𝐡2)𝒞𝒟(𝐡1,𝐡2)|\mathcal{DD}(\mathbb{P},\mathbb{Q})=\sup_{\mathbf{h_{1}},\mathbf{h_{2}}\in\mathbb{H}}|\mathcal{CD}_{\sim\mathbb{P}}(\mathbf{h}_{1},\mathbf{h}_{2})-\mathcal{CD}_{\sim\mathbb{Q}}(\mathbf{h}_{1},\mathbf{h}_{2})| (30)

With the operators \oplus and \ominus defined in our main paper, we can obtain the following theorem for error bound with all L1, L2 and their combination loss functions.

Theorem 2. For all L1 (Mean Absolute Error [20]), L2 (Mean Squared Error [14]) and their non-negative combination loss functions (Huber Loss [63], Quantile Loss [24], etc.), the classification risk of aforementioned 𝐡𝐝\mathbf{h}\oplus\mathbf{d} can be formulated as follows:

Rt(𝐡𝐝)\displaystyle R^{t}(\mathbf{h}\oplus\mathbf{d}) =𝔼Xt(𝐡𝐝,𝐡w𝐡ot𝐡w)\displaystyle=\mathbb{E}_{\mathbb{Q}^{t}_{X}}\mathcal{L}(\mathbf{h}\oplus\mathbf{d},\mathbf{h}^{w}\oplus\mathbf{h}^{o_{t}}\ominus\mathbf{h}^{w}) (31)
𝔼Xt(𝐡,𝐡w)+𝔼Xt(𝐝,𝐡ot𝐡w)\displaystyle\leq\mathbb{E}_{\mathbb{Q}^{t}_{X}}\mathcal{L}(\mathbf{h},\mathbf{h}^{w})+\mathbb{E}_{\mathbb{Q}^{t}_{X}}\mathcal{L}(\mathbf{d},\mathbf{h}^{o_{t}}\ominus\mathbf{h}^{w})

Proof: For L1 (Mean Absolute Error) Loss, (𝐡1,𝐡2)=|𝐡1(𝐱)𝐡2(𝐱)|\mathcal{L}(\mathbf{h}_{1},\mathbf{h}_{2})=|\mathbf{h}_{1}(\mathbf{x})-\mathbf{h}_{2}(\mathbf{x})|. Then, incorporating the definitions of new operators, we can have the following derivation (here we denote (𝐡𝐝,𝐡w𝐡ot𝐡w)\mathcal{L}(\mathbf{h}\oplus\mathbf{d},\mathbf{h}^{w}\oplus\mathbf{h}^{o_{t}}\ominus\mathbf{h}^{w}) as \mathcal{L}):

\displaystyle\mathcal{L} =|[𝐡(𝐱)+𝐝(𝐱)][𝐡w(𝐱)+𝐡ot(𝐱)𝐡w(𝐱)]|\displaystyle=\left|[\mathbf{h}(\mathbf{x})+\mathbf{d}(\mathbf{x})]-[\mathbf{h}^{w}(\mathbf{x})+\mathbf{h}^{o_{t}}(\mathbf{x})-\mathbf{h}^{w}(\mathbf{x})]\right| (32)
=|[𝐡(𝐱)𝐡w(𝐱)]+[𝐝(𝐱)𝐡ot(𝐱)+𝐡w(𝐱)]|\displaystyle=\left|[\mathbf{h}(\mathbf{x})-\mathbf{h}^{w}(\mathbf{x})]+[\mathbf{d}(\mathbf{x})-\mathbf{h}^{o_{t}}(\mathbf{x})+\mathbf{h}^{w}(\mathbf{x})]\right|
|𝐡(𝐱)𝐡w(𝐱)|+|𝐝(𝐱)𝐡ot(𝐱)+𝐡w(𝐱)|\displaystyle\leq|\mathbf{h}(\mathbf{x})-\mathbf{h}^{w}(\mathbf{x})|+|\mathbf{d}(\mathbf{x})-\mathbf{h}^{o_{t}}(\mathbf{x})+\mathbf{h}^{w}(\mathbf{x})|
=(𝐡,𝐡w)+(𝐝,𝐡ot𝐡w)\displaystyle\quad=\mathcal{L}(\mathbf{h},\mathbf{h}^{w})+\mathcal{L}(\mathbf{d},\mathbf{h}^{o_{t}}\ominus\mathbf{h}^{w})

For L2 (Mean Squared Error) Loss, (𝐡1,𝐡2)=(𝐡1(𝐱)𝐡2(𝐱))2\mathcal{L}(\mathbf{h}_{1},\mathbf{h}_{2})=(\mathbf{h}_{1}(\mathbf{x})-\mathbf{h}_{2}(\mathbf{x}))^{2}, and the corresponding bound of (𝐡𝐝,𝐡w𝐡ot𝐡w)\mathcal{L}(\mathbf{h}\oplus\mathbf{d},\mathbf{h}^{w}\oplus\mathbf{h}^{o_{t}}\ominus\mathbf{h}^{w}) is as follows:

\displaystyle\mathcal{L} ={[𝐡(𝐱)+𝐝(𝐱)][𝐡w(𝐱)+𝐡ot(𝐱)𝐡w(𝐱)]}2\displaystyle=\left\{[\mathbf{h}(\mathbf{x})+\mathbf{d}(\mathbf{x})]-[\mathbf{h}^{w}(\mathbf{x})+\mathbf{h}^{o_{t}}(\mathbf{x})-\mathbf{h}^{w}(\mathbf{x})]\right\}^{2} (33)
={[𝐡(𝐱)𝐡w(𝐱)]+[𝐝(𝐱)𝐡ot(𝐱)+𝐡w(𝐱)]}2\displaystyle=\left\{[\mathbf{h}(\mathbf{x})-\mathbf{h}^{w}(\mathbf{x})]+[\mathbf{d}(\mathbf{x})-\mathbf{h}^{o_{t}}(\mathbf{x})+\mathbf{h}^{w}(\mathbf{x})]\right\}^{2}

Here, we denote [𝐡(𝐱)𝐡w(𝐱)][\mathbf{h}(\mathbf{x})-\mathbf{h}^{w}(\mathbf{x})] as t1t_{1}, and [𝐝(𝐱)𝐡ot(𝐱)+𝐡w(𝐱)][\mathbf{d}(\mathbf{x})-\mathbf{h}^{o_{t}}(\mathbf{x})+\mathbf{h}^{w}(\mathbf{x})] as t2t_{2}. As aforementioned, 𝐝\mathbf{d} is designed to approximate the difference between the weak supervision 𝐡\mathbf{h} and the ground truth 𝐡ot\mathbf{h}^{o_{t}} (in our method, we show that 𝐡\mathbf{h} can be directly replaced with 𝐡w\mathbf{h}^{w}), and thus we assume that the output of 𝐝\mathbf{d} is very closed to 𝐡ot𝐡\mathbf{h}^{o_{t}}-\mathbf{h}. Then, we analyze the relation between the product of t1,t2t_{1},t_{2} and 0 in all 6 possible cases shown in Table 3.

Condition Derivation Conclusion
𝐡ot𝐡𝐡w\mathbf{h}^{o_{t}}\geq\mathbf{h}\geq\mathbf{h}^{w} 𝐡ot𝐡𝐡ot𝐡w\mathbf{h}^{o_{t}}-\mathbf{h}\leq\mathbf{h}^{o_{t}}-\mathbf{h}^{w} t1×t20t_{1}\times t_{2}\leq 0
𝐡ot𝐡w𝐡\mathbf{h}^{o_{t}}\geq\mathbf{h}^{w}\geq\mathbf{h} 𝐡ot𝐡𝐡ot𝐡w\mathbf{h}^{o_{t}}-\mathbf{h}\geq\mathbf{h}^{o_{t}}-\mathbf{h}^{w} t1×t20t_{1}\times t_{2}\leq 0
𝐡𝐡ot𝐡w\mathbf{h}\geq\mathbf{h}^{o_{t}}\geq\mathbf{h}^{w} 𝐡ot𝐡𝐡ot𝐡w\mathbf{h}^{o_{t}}-\mathbf{h}\leq\mathbf{h}^{o_{t}}-\mathbf{h}^{w} t1×t20t_{1}\times t_{2}\leq 0
𝐡𝐡w𝐡ot\mathbf{h}\geq\mathbf{h}^{w}\geq\mathbf{h}^{o_{t}} 𝐡ot𝐡𝐡ot𝐡w\mathbf{h}^{o_{t}}-\mathbf{h}\leq\mathbf{h}^{o_{t}}-\mathbf{h}^{w} t1×t20t_{1}\times t_{2}\leq 0
𝐡w𝐡ot𝐡\mathbf{h}^{w}\geq\mathbf{h}^{o_{t}}\geq\mathbf{h} 𝐡ot𝐡𝐡ot𝐡w\mathbf{h}^{o_{t}}-\mathbf{h}\geq\mathbf{h}^{o_{t}}-\mathbf{h}^{w} t1×t20t_{1}\times t_{2}\leq 0
𝐡w𝐡𝐡ot\mathbf{h}^{w}\geq\mathbf{h}\geq\mathbf{h}^{o_{t}} 𝐡ot𝐡𝐡ot𝐡w\mathbf{h}^{o_{t}}-\mathbf{h}\geq\mathbf{h}^{o_{t}}-\mathbf{h}^{w} t1×t20t_{1}\times t_{2}\leq 0
Table 3: Compare the product of [𝐡(𝐱)𝐡w(𝐱)][\mathbf{h}(\mathbf{x})-\mathbf{h}^{w}(\mathbf{x})], [𝐝(𝐱)𝐡ot(𝐱)+𝐡w(𝐱)][\mathbf{d}(\mathbf{x})-\mathbf{h}^{o_{t}}(\mathbf{x})+\mathbf{h}^{w}(\mathbf{x})] with 0 in all 6 possible cases.

According to Table 3, the product of t1,t2t_{1},t_{2} is always less than or equal to 0. Therefore, we can enlarge the bound of Eq. (33) as follows:

\displaystyle\mathcal{L} {𝐡(𝐱)𝐡w(𝐱)}2+{𝐝(𝐱)𝐡ot(𝐱)+𝐡w(𝐱)}2\displaystyle\leq\{\mathbf{h}(\mathbf{x})-\mathbf{h}^{w}(\mathbf{x})\}^{2}+\{\mathbf{d}(\mathbf{x})-\mathbf{h}^{o_{t}}(\mathbf{x})+\mathbf{h}^{w}(\mathbf{x})\}^{2} (34)
=(𝐡,𝐡w)+(𝐝,𝐡ot𝐡w)\displaystyle\quad=\mathcal{L}(\mathbf{h},\mathbf{h}^{w})+\mathcal{L}(\mathbf{d},\mathbf{h}^{o_{t}}\ominus\mathbf{h}^{w})

Now, we have proven that both L1 and L2 Losses hold true for Eq. (31). Moreover, we believe that other losses that are non-negative combination forms of L1 and L2 can also ensure this inequality relation. \hfill\blacksquare

From the definition of Discrepancy Distance between two domains shown in Eq. (30), we can derive the inequality relation on the discrepancy between Xt\mathbb{Q}_{X}^{t} and Xs\mathbb{Q}_{X}^{s} as:

𝔼Xt(𝐡,𝐡w)𝒟𝒟(Xt,Xs)+𝔼Xw(𝐡,𝐡w)\mathbb{E}_{\mathbb{Q}^{t}_{X}}\mathcal{L}(\mathbf{h},\mathbf{h}^{w})\leq\mathcal{DD}(\mathbb{Q}_{X}^{t},\mathbb{Q}_{X}^{s})+\mathbb{E}_{\mathbb{Q}^{w}_{X}}\mathcal{L}(\mathbf{h},\mathbf{h}^{w}) (35)

By utilizing the triangle inequality, we have:

𝔼Xt(𝐡,𝐡w)\displaystyle\mathbb{E}_{\mathbb{Q}^{t}_{X}}\mathcal{L}(\mathbf{h},\mathbf{h}^{w}) 𝔼Xt(𝐡,𝐡ot)+𝔼Xt(𝐡ot,𝐡w)\displaystyle\leq\mathbb{E}_{\mathbb{Q}^{t}_{X}}\mathcal{L}(\mathbf{h},\mathbf{h}^{o_{t}})+\mathbb{E}_{\mathbb{Q}^{t}_{X}}\mathcal{L}(\mathbf{h}^{o_{t}},\mathbf{h}^{w}) (36)
Rt(𝐡)+Rt(𝐡w)\displaystyle\leq R^{t}(\mathbf{h})+R^{t}(\mathbf{h}^{w})
Rt(𝐡)\displaystyle R^{t}(\mathbf{h}) 𝔼𝒬Xt(𝐡,𝐡w)+𝔼Xt(𝐡w,𝐡ot)\displaystyle\leq\mathbb{E}_{\mathcal{Q}^{t}_{X}}\mathcal{L}(\mathbf{h},\mathbf{h}^{w})+\mathbb{E}_{\mathbb{Q}^{t}_{X}}\mathcal{L}(\mathbf{h}^{w},\mathbf{h}^{o_{t}}) (37)
𝒞𝒟t(𝐡,𝐡w)+Rt(𝐡w)\displaystyle\leq\mathcal{CD}_{t}(\mathbf{h},\mathbf{h}^{w})+R^{t}(\mathbf{h}^{w})

Based on the relations presented in Eq. (35), (36), (37), we can bound the classification risk of 𝐡𝐝\mathbf{h}\oplus\mathbf{d} in the target domain shown in Eq. (31) as:

Rt(𝐡𝐝)\displaystyle R^{t}(\mathbf{h}\oplus\mathbf{d}) (38)
𝔼𝒬Xt(𝐡,𝐡w)+𝔼𝒬Xt(𝐝,𝐡ot𝐡w)\displaystyle\leq\mathbb{E}_{\mathcal{Q}^{t}_{X}}\mathcal{L}(\mathbf{h},\mathbf{h}^{w})+\mathbb{E}_{\mathcal{Q}^{t}_{X}}\mathcal{L}(\mathbf{d},\mathbf{h}^{o_{t}}\ominus\mathbf{h}^{w})
Rt(𝐡)+Rt(𝐡w)+𝒞𝒟t(𝐝,𝐡ot𝐡w)\displaystyle\leq R^{t}(\mathbf{h})+R^{t}(\mathbf{h}^{w})+\mathcal{CD}_{t}(\mathbf{d},\mathbf{h}^{o_{t}}\ominus\mathbf{h}^{w})
Rt(𝐡w)+𝒞𝒟t(𝐡,𝐡w)+Rt(𝐡w)+𝒞𝒟t(𝐝,𝐡ot𝐡w)\displaystyle\leq R^{t}(\mathbf{h}^{w})+\mathcal{CD}_{t}(\mathbf{h},\mathbf{h}^{w})+R^{t}(\mathbf{h}^{w})+\mathcal{CD}_{t}(\mathbf{d},\mathbf{h}^{o_{t}}\ominus\mathbf{h}^{w})
2Rt(𝐡w)+𝒞𝒟s(𝐡,𝐡w)+𝒟𝒟(Xt,Xs)\displaystyle\leq 2R^{t}(\mathbf{h}^{w})+\mathcal{CD}_{s}(\mathbf{h},\mathbf{h}^{w})+\mathcal{DD}(\mathbb{Q}^{t}_{X},\mathbb{Q}^{s}_{X})
+𝒞𝒟t(𝐝,𝐡ot𝐡w)\displaystyle\quad+\mathcal{CD}_{t}(\mathbf{d},\mathbf{h}^{o_{t}}\ominus\mathbf{h}^{w})

Furthermore, the Classification Discrepancy between 𝐡\mathbf{h} and 𝐡w\mathbf{h}^{w} for the source domain data can also be bounded by the triangle inequality:

𝔼𝒬Xs(𝐡,𝐡w)\displaystyle\mathbb{E}_{\mathcal{Q}^{s}_{X}}\mathcal{L}(\mathbf{h},\mathbf{h}^{w}) 𝔼𝒬Xs(𝐡,𝐡os)+𝔼𝒬Xs(𝐡os,𝐡w)\displaystyle\leq\mathbb{E}_{\mathcal{Q}^{s}_{X}}\mathcal{L}(\mathbf{h},\mathbf{h}^{o_{s}})+\mathbb{E}_{\mathcal{Q}^{s}_{X}}\mathcal{L}(\mathbf{h}^{o_{s}},\mathbf{h}^{w}) (39)
𝒞𝒟s(𝐡,𝐡w)\displaystyle\mathcal{CD}_{s}(\mathbf{h},\mathbf{h}^{w}) Rs(𝐡)+Rs(𝐡w)\displaystyle\leq R^{s}(\mathbf{h})+R^{s}(\mathbf{h}^{w})

Thus, Eq. (38) can be formulated as follows:

Rt(𝐡𝐝)\displaystyle R^{t}(\mathbf{h}\oplus\mathbf{d}) 2Rt(𝐡w)+Rs(𝐡)+Rs(𝐡w)\displaystyle\leq 2R^{t}(\mathbf{h}^{w})+R^{s}(\mathbf{h})+R^{s}(\mathbf{h}^{w}) (40)
+𝒟𝒟(Xt,Xs)+𝒞𝒟t(𝐝,𝐡ot𝐡w)\displaystyle\quad+\mathcal{DD}(\mathbb{Q}^{t}_{X},\mathbb{Q}^{s}_{X})+\mathcal{CD}_{t}(\mathbf{d},\mathbf{h}^{o_{t}}\ominus\mathbf{h}^{w})

As introduced in our main paper, if 𝐡Θ\mathbf{h}_{\Theta} is trained with mini-batch, its classification error can be bounded as follows [39]:

L(𝐡Θ)\displaystyle L(\mathbf{h}_{\Theta}) L^(𝐡Θ)+4(KL(𝒬)+ln2mδ)m\displaystyle\leq\widehat{L}(\mathbf{h}_{\Theta})+4\sqrt{\frac{\left(K\!L(\mathcal{Q}\|\mathcal{H})+\ln\frac{2m}{\delta}\right)}{m}} (41)
L^(𝐡Θ)+4(KL(𝒬))m+4ln2mδm\displaystyle\leq\widehat{L}(\mathbf{h}_{\Theta})+4\sqrt{\frac{\left(K\!L(\mathcal{Q}\|\mathcal{H})\right)}{m}}+4\sqrt{\frac{\ln{\frac{2m}{\delta}}}{m}}

In our setting, the weak annotator is unchanged. Thus the KL divergence of the risk for 𝐡w\mathbf{h}^{w} is zero. As mentioned before, classifier 𝐝\mathbf{d} is designed to learn the discrepancy between the weak supervision and the ground truth in the target domain. Therefore, we can only use target data with accurate labels to estimate 𝐝\mathbf{d}. Moreover, if we consider that the training loss L^(𝐡)\widehat{L}(\mathbf{h}) (which equals the average loss of all training samples) is hardly influenced by the sample quantity, and it is the same for the discrepancy between two domains [32], we can split Eq. (40) into two parts, where one part (denoted as Δ\Delta) is not influenced by the sample quantity and the other part is related to the sample quantity. Based on Eq. (41), these two parts can be written as follows:

Rt(𝐡𝐝)\displaystyle R^{t}(\mathbf{h}\oplus\mathbf{d}) Δ+4KL𝐝Nt+4KL𝐡Ns\displaystyle\leq\Delta+4\sqrt{\frac{K\!L_{\mathbf{d}}}{N_{t}}}+4\sqrt{\frac{K\!L_{\mathbf{h}}}{N_{s}}} (42)
+12ln2NtδNt+8ln2NsδNs\displaystyle\quad+12\sqrt{\frac{\ln{\frac{2N_{t}}{\delta}}}{N_{t}}}+8\sqrt{\frac{\ln{\frac{2N_{s}}{\delta}}}{N_{s}}}
Δ\displaystyle\Delta =2L^t(𝐡w)+L^t(𝐝)+L^s(𝐡)\displaystyle=2\widehat{L}_{t}(\mathbf{h}^{w})+\widehat{L}_{t}(\mathbf{d})+\widehat{L}_{s}(\mathbf{h})
+L^s(𝐡w)+𝒟𝒟(Xt,Xs)\displaystyle\quad+\widehat{L}_{s}(\mathbf{h}^{w})+\mathcal{DD}(\mathbb{Q}^{t}_{X},\mathbb{Q}^{s}_{X})

Here KL𝐝K\!L_{\mathbf{d}} and KL𝐡K\!L_{\mathbf{h}} denote KL divergences between trained 𝐝\mathbf{d}, 𝐡\mathbf{h} and \mathcal{H} respectively. According to Theorem 1, this KL divergence term is influenced by the training, especially impacted by the sample quantity. This is further discussed in Section 4 of the main paper.

Appendix C Dataset

The experiments are conducted on three application scenarios: the digits recognition with domain discrepancy (SVHN[38], MNIST[6] and USPS[18] digit datasets), object detection with domain discrepancy (VisDA-C[42]), and object detection without domain discrepancy (CIFAR-10[25]).

  • \bullet

    SVHN, MNIST, USPS: SVHN dataset is a real-world image dataset that has 73,257 training examples obtained from house numbers in Google Street View images. MNIST is a widely used dataset for handwritten digit recognition and contains 60,000 training examples. USPS database contains 7291 training images scanned from mail in a working post office. Every set is composed of 10 classes corresponding to 10 digit numbers. In the experiments, we randomly and non-repetitively sample 15,000 samples (each class includes 1,000 samples) from one domain as the source domain data, 1,000 samples from another domain as the target domain data, and another 2,000 samples as the validation data. We tested some combinations of adaptations chosen from these three datasets. All the images are resized to 32 ×\times 32.

  • \bullet

    VisDA-C: VisDA-C is a challenging object recognition dataset that contains 152,397 synthetic images and 55,388 real-world object images on 12 object classes. In the experiments, we randomly and non-repetitively sample 20,000 samples from synthetic images as the source domain data, 1,000 samples from real-world object images as the target domain data, and another 2,000 samples from real-world object images as the validation data. All the images are resized to 224 ×\times 224.

  • \bullet

    CIFAR-10 CIFAR-10 is a popular object recognition benchmark with 50,000 training samples. We use this dataset here for testing our method under the scenario without domain discrepancy. Additionally, we randomly select 10,000 data samples from the dataset as the source domain data and another 1,000 data samples as the target domain data. All the images are resized to 32 ×\times 32.

Appendix D Additional Ablation Study

Method knife plane bcycl person mcycl car truck plant bus sktbrd horse train Acc(%)
BwaB_{wa} 01.55 0.50 0.62 05.49 80.24 46.63 06.08 51.24 05.04 42.94 13.20 07.17 27.02
BtB_{t} 15.85 33.33 32.54 34.69 55.20 42.08 32.11 25.68 15.48 13.95 27.16 29.49 32.86
Bf1B_{f_{1}} 12.34 10.58 15.74 07.53 44.34 51.91 10.00 29.53 28.99 37.69 12.19 19.23 27.67
Bf2B_{f_{2}} 30.86 09.41 11.90 23.97 53.84 47.68 25.13 43.04 46.10 36.92 22.08 28.20 35.03
S+T 00.05 03.84 03.34 00.00 14.24 86.16 07.16 14.42 16.78 02.67 00.00 00.00 21.56
ENT 22.07 33.85 42.79 10.18 34.84 45.07 37.61 45.19 20.58 10.08 13.99 27.53 31.51
MME 26.42 35.57 29.91 13.98 40.47 21.39 20.53 50.05 43.72 33.61 36.65 14.99 31.39
FAN 18.17 36.23 31.48 16.87 35.79 40.13 27.51 48.23 32.97 29.40 25.66 22.17 32.99
Ours 26.50 38.37 16.53 12.16 69.72 52.32 11.91 44.00 52.66 60.00 37.42 36.84 40.82
Table 4: The accuracy of different methods on the VisDA-C dataset with 12 classes. The number is measured in percentage. The accuracy of each class is at column 2 to column 13, and the overall accuracy is shown in the last column. This table shows results using a weak annotator that has worse performance than the model trained with target data only (BtB_{t}).

In addition to the ablation experiments shown in our main paper, we also conduct additional studies that provide the experimental evidence on why we choose to estimate the distance of the optimal classifier for the target data and the weak annotator, instead of learning the target task directly using the target data in Stage 2 of our algorithm (Algorithm 1 in the main paper) and then gives pseudo labels for the source data in Stage 3.

The experiment setting is the same as in the Section 5.5 of the main paper on the CIFAR-10 dataset. The algorithm for the ablation study is the same as Algorithm 1 in the main paper except for the following modifications: 1) The network F2F_{2} will learn the classification result from the target data instead of learning the difference. In this way, Φ2\Phi_{2} will only take the output feature from Φ0\Phi_{0} as the input feature. 2) After finishing learning in Stage 2, when generating the new dataset in Stage 3, it becomes

Dnew={(𝐱,ynew)|𝐱DX,ynew=Φ0Φ2(𝐱)}\displaystyle D_{new}=\{(\mathbf{x},y^{new})|\mathbf{x}\in D_{X},y^{new}=\Phi_{0}\circ\Phi_{2}(\mathbf{x})\} (43)

All other steps are the same as Algorithm 1. This additional ablation study actually follows the thought of assigning pseudo labels for all the data. Note that in the original paper of pseudo label [26], it is only for the unlabeled data. However the target data should also be re-labeled using the soft labels, following the ideas in [16] – this provides better performance and is the same approach as our method for this part.

The additional ablation study result is shown in Table 5. The performance of the ablation method Bdirect=B_{direct}= 57.70%57.70\% is lower than our result 61.71%61.71\%, which indicates that learning the classification difference as in our Algorithm 1 is a better solution.

Method plane mobile bird cat deer dog frog horse ship truck Acc(%)
BwaB_{wa} 43.18 65.68 28.13 25.93 29.00 46.15 83.91 41.76 72.12 51.06 48.96
BdirectB_{direct} 62.06 65.16 45.64 48.42 46.63 43.39 64.48 61.79 67.12 73.63 57.70
Ours 65.52 82.61 39.79 48.45 57.36 43.60 67.39 65.32 70.42 78.89 61.71
Table 5: Additional experiment on whether we should choose to learn the classification difference or the target task directly in Stage 2.
Method knife plane bcycl person mcycl car truck plant bus sktbrd horse train Acc(%)
BwaB_{wa} 41.89 53.03 34.17 46.67 72.59 55.42 07.07 47.13 24.84 40.00 42.85 25.32 42.14
BtB_{t} 15.85 33.33 32.54 34.69 55.20 42.08 32.11 25.68 15.48 13.95 27.16 29.49 32.86
Bf1B_{f_{1}} 31.71 40.31 11.20 22.30 64.09 39.45 10.00 44.74 39.05 18.82 31.10 23.23 33.56
Bf2B_{f_{2}} 34.14 08.13 24.19 26.71 63.30 55.73 25.13 57.89 37.86 44.69 50.60 40.25 42.84
S+T 22.62 37.24 17.07 03.40 10.29 06.65 29.05 35.72 09.20 23.81 40.85 03.52 23.36
ENT 40.53 36.06 46.62 11.80 19.85 22.87 37.71 45.08 14.19 28.97 20.52 13.47 28.98
MME 21.60 44.95 17.05 24.88 57.13 36.62 16.72 18.65 36.57 28.93 20.83 13.11 29.31
FAN 32.71 50.13 14.72 20.09 60.78 53.44 20.60 42.34 20.07 17.65 23.91 33.25 34.47
Ours 43.37 59.54 13.49 21.76 67.43 60.92 21.28 51.97 58.08 18.60 40.24 40.38 45.06
Table 6: The accuracy of different methods on the VisDA-C dataset with 12 classes. The number is measured in percentage. The accuracy of each class is at column 2 to column 13, and the overall accuracy is shown in the last column. This table shows results using a weak annotator that has better performance than the model trained with target data only (BtB_{t}).

Appendix E Network Architecture

In the digit experiments, the weak annotator is chosen as ResNet-18 [15]. It is trained with randomly selected 10,000 data samples from the source dataset and 100 from the target dataset for 4 epochs. Φ0\Phi_{0} is made from the VGG-19 [51] network. Φ1\Phi_{1} consists of three fully-connected layers, and the neuron number is set to [128, 64, 10]. Φ2\Phi_{2} consists of two fully-connected layers, and the neuron number is set to [512, 10].

In the VisDA-C experiments, The weak annotator is chosen as ResNet-50 [15]. It is trained with randomly selected 10,000 data samples from both datasets for 10 epochs. Φ0\Phi_{0} is made from the ResNet-50 [51] network. Φ1\Phi_{1} consists of three fully-connected layers, and the neuron number is set to [128, 64, 12]. Φ2\Phi_{2} consists of two fully-connected layers, and the neuron number is set to [1024, 10].

In the CIFAR-10 experiments, The weak annotator is chosen as VGG-19 [51]. It is trained with randomly selected 10,000 data samples from the dataset for 7 epochs. Φ0\Phi_{0} is made from the VGG-19 [51] network. Φ1\Phi_{1} consists of three fully-connected layers, and the neuron number is set to [128, 64, 10]. Φ2\Phi_{2} consists of two fully-connected layers, and the neuron number is set to [64, 10].

Appendix F Training Settings

Generally, the scaling factor α\alpha in our algorithm is set to 1e41e-4. The learning rate is selected from [1e-1, 1e-2, 1e-3, 1e-4, 1e-5], for the value with the best performance in experiments. The training epochs are empirically set as multiples of 10 and are selected for each experiment. We pre-run each experiment to determine the epoch value and stop training when the performance does not increase in the next 20 epochs to prevent over-fitting.

In the digit experiments, the training epochs in each training step is chosen as: ep1=90{ep}_{1}=90, ep2=90{ep}_{2}=90, ep3=40{ep}_{3}=40, ep4=180{ep}_{4}=180. The learning rate for experiment M \to S is set to 1e41e-4 and for other experiments set to 1e51e-5. Training batch size is set to 128128. For the baseline BtB_{t}, it is trained for 9090 epochs, and the learning rate is 1e51e-5 (1e41e-4 for M \to S). For Bf1B_{f_{1}}, it is trained on the source data with weak labels for 9090 epochs and on the target data for 9090 epochs, and the learning rate is 1e51e-5 (1e41e-4 for M \to S). For Bf2B_{f_{2}}, it is trained on the source data with weak labels for 9090 epochs and on the target data for 9090 epochs, and the learning rate is 1e51e-5 (1e41e-4 for M \to S). Moreover, image augmentation techniques (provided by Torchvision.Transform) are applied for baselines BtB_{t}, Bf1B_{f_{1}}, Bt2B_{t_{2}}, and our approach. Other baselines use their original augmentation setting. We use the function in the Pytorch vision package for the implementation, and the images may be rotated from 3-3 to 33 degree, or changed to gray-scale with a probability of 0.1.

In the VisDA-C experiments, the training epochs in each training step is chosen as: ep1=90{ep}_{1}=90, ep2=90{ep}_{2}=90, ep3=40{ep}_{3}=40, ep4=180{ep}_{4}=180. The learning rate for experiment is set to 1e51e-5. Training batch size is set to 128128. For the baseline BtB_{t}, it is trained for 9090 epochs, and learning rate is 1e51e-5. For Bf1B_{f_{1}}, it is trained on the source data with weak labels for 9090 epochs and on the target data for 9090 epochs, and the learning rate is 1e51e-5. For Bf2B_{f_{2}}, it is trained on the source data with weak labels for 9090 epochs and on the target data for 9090 epochs, and the learning rate is 1e51e-5. The image augmentation techniques are also applied for baselines BtB_{t}, Bf1B_{f_{1}}, Bt2B_{t_{2}}, and our approach. Other baselines use their original augmentation setting. We similarly use the function in the Pytorch vision package for the implementation, and the images may be rotated from 3-3 to 33 degree, or changed to gray-scale with a probability of 0.1, or horizontally flipped with a probability of 0.5.

In the CIFAR-10 experiments, the training epochs in each training step is chosen as: ep1=40{ep}_{1}=40, ep2=30{ep}_{2}=30, ep3=70{ep}_{3}=70, ep4=70{ep}_{4}=70. The learning rate is set to 1e31e-3. Training batch size is set to 128128. For the baseline BtB_{t}, it is trained for 7070 epochs, and the learning rate is 1e31e-3. For Bf1B_{f_{1}}, it is trained on the source data with weak labels for 3030 epochs and on the target data for 4040 epochs, and the learning rate is 1e31e-3. For Bf2B_{f_{2}}, it is trained on the source data with weak labels for 3030 epochs and on the target data for 4040 epochs, and the learning rate is 1e31e-3. The image augmentation techniques are still applied to baselines BtB_{t}, Bf1B_{f_{1}}, Bt2B_{t_{2}}, and our approach. We use the function in the Pytorch vision package for the implementation, and the images are horizontally flipped with a probability of 0.5.

Appendix G Additional Experiments Details

We provide more details about the accuracy of different methods on VisDA-C dataset, which is shown in Table 6 and Table 4. In Table 6, we utilize a weak annotator that has better performance than the model trained with target data along BtB_{t}. And in Table 4, we employ a weak annotator that has worse performance than BtB_{t}. Both of them show that our method can provide a better performance boost compared to all baselines above.

Appendix H The impact of domain discrepancy

We also evaluate how the domain discrepancy will affect the model performance, and we conduct experiments on domain data with various levels of domain discrepancy. To be specific, we utilize gaussian noise (σ\sigma=5.0) with different level of mean value, and add it to the source data to create various data domains as the source domain. And the final model performance is shown in Figure 5.

Refer to caption
Figure 5: We add gaussian noise (σ\sigma=5.0) to CIFAR-10 on source data to create various data domains. The mean of noise reflects the domain discrepancy of source data and non-noisy target data.