This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Semantic Recovery for Open-Set Domain Adaptation: Discover Unseen Categories in the Target Domain

First Author
Institution1
Institution1 address
firstauthor@i1.org
   Second Author
Institution2
First line of institution2 address
secondauthor@i2.org
Abstract

Semantic Recovery for unseen targets samples with help of attributes.

1 Introduction

Domain adaptation - require same label space

openset - lack of unknown set detailed categories

zero-shot - cannot manage domain shift, and convention ZSL already assumed test set from unseen categories.

Generalized ZSL - require knowledge of unseen categories and attributes, or word vector graphs.

Ours - truly explore unseen categories based on seen data. With help of attributes, to recover semantic characteristics and meaningful representations.

2 Related Works

Openset

ZSL - generalized ZSL and transductive ZSL

3 The Proposed Method

Refer to caption
Figure 1: Illustration of our proposed framework.

3.1 Problem Definition

Given the unlabeled target domain 𝒟t={𝐗t}\mathcal{D}_{t}=\{\mathbf{X}_{t}\} includes NtN_{t} images belonging to CtC_{t} categories drawn from the distribution PtP_{t}. We seek help from another auxiliary source domain 𝒟s={𝐗s,𝐘s,𝐀𝐬}\mathcal{D}_{s}=\{\mathbf{X}_{s},\mathbf{Y}_{s},\mathbf{A_{s}}\} consisting of NsN_{s} samples from CsC_{s} categories and drawn from the distribution PsP_{s}, where PsPtP_{s}\neq P_{t}. In reality, DsD_{s} cannot promise always cover the whole target domain label space, leading to the Open-Set problem where CtCsC_{t}-C_{s} categories existing in the target domain but are unseen in the source domain. However, we are not satisfied with simply filtering out those unknown categories as most existing Open-Set domain adaptation solutions did, we would also like to explore target samples from unseen categories and discover them in a more meaningful way. Thus, we introduce the semantic description of those seen categories in the source domain, denoted as 𝐀𝐬Cs×da\mathbf{A_{s}}\in\mathbb{R}^{C_{s}\times d_{a}}, to learn enriched knowledge and representations for the visual images. The semantic descriptions describe the characteristics of each category, so they are class-level descriptions, instead of sample-level. In other words, for source samples belonging to the same category cc, their semantic descriptions 𝐚si=𝐀sc\mathbf{a}_{s}^{i}=\mathbf{A}_{s}^{c}, 𝐚sida\mathbf{a}_{s}^{i}\in\mathbb{R}^{d_{a}} are the same.

ZD: We should merge with the overall framework Moreover, we denote 𝐗s/tNs/t×dx\mathbf{X}_{s/t}\in\mathbb{R}^{N_{s/t}\times d_{x}} as the source/target features extracted by the pre-trained convolutional feature extractor, and 𝐘sNs\mathbf{Y}_{s}\in\mathbb{R}^{N_{s}} denotes the source domain label set. It is noteworthy that, different from some generalized zero-shot learning tasks, we don’t expect the knowledge of the target domain unseen categories, no matter unseen classes numbers nor corresponding semantic descriptions. The goal of our work are recognizing the unlabeled target domain data into either seen categories as the source domain label space or the unseen set, and further explore the semantic descriptions of the all the target domain data especially for the samples from the target domain unseen set.

3.2 Framework

We illustrate the proposed framework of this paper as Fig. 1. Specifically, the framework consists of a Convolutional Neural Network E()E(\cdot) as the feature extractor for both source and target domain, which extracts visual features 𝐗s/t\mathbf{X}_{s/t} from raw images. GZ()G_{Z}(\cdot) is a network mapping the source and target domain data into a shared feature space, from which the output is denoted as 𝐳s/ti=GZ(𝐱s/ti)\mathbf{z}_{s/t}^{i}=G_{Z}(\mathbf{x}_{s/t}^{i}), 𝐳s/tidz\mathbf{z}_{s/t}^{i}\in\mathbb{R}^{d_{z}}. GA()G_{A}(\cdot) is used to mapping each sample from the visual feature space to the semantic feature space, and the predicted semantic description is denoted as 𝐚^s/ti=GA(𝐱s/ti)\hat{\mathbf{a}}_{s/t}^{i}=G_{A}(\mathbf{x}_{s/t}^{i}) for each instance. D()D(\cdot) is a binary classifier to recognize if the target domain data is from the seen categories or the unseen categories. C()C(\cdot) is a classifier with output dimension as Cs+1C_{s}+1, which can recognize the input sample is from which specific one of CsC_{s} seen category or 11 "unseen" set, and the predicted label is denoted as y^s/ti\hat{y}_{s/t}^{i}.

3.2.1 Towards Seen-Unseen Separation

ZD: This is initialization stage?

The prototypical classifier has been explored in transfer learning tasks. Specifically, for each input sample 𝐱ti\mathbf{x}_{t}^{i}, prototypical classifier predicts the input sample into the label space of known prototypes, producing the probability distribution 𝐏(𝐱ti)Cs\mathbf{P}(\mathbf{x}_{t}^{i})\in\mathbb{R}^{C_{s}}, where CsC_{s} is the number of categories in the labels as well as the number of prototypes we have, because only the source domain has labels from CsC_{s} categories. Specifically, for each class cc the predicted probability is:

p(yti=c|𝐱ti)=ed(𝐱ti,μc)ced(𝐱ti,μc),p(y_{t}^{i}=c|\mathbf{x}_{t}^{i})=\frac{e^{-d(\mathbf{x}_{t}^{i},\,\,\mu^{c})}}{\sum_{c^{\prime}}e^{-d(\mathbf{x}_{t}^{i},\,\,\mu^{c})}}, (1)

where d()d(\cdot) is to measure the distance between the input sample 𝐱ti\mathbf{x}_{t}^{i} and the specific class cc prototype μc\mu^{c}. For each sample 𝐱ti\mathbf{x}_{t}^{i}, the predicted label y~ti=argmax(𝐏(𝐱ti))\tilde{y}_{t}^{i}=arg\,max(\mathbf{P}(\mathbf{x}_{t}^{i})) with the highest probability is accepted as the pseudo label with confidence p(yti=y~ti|𝐱ti)=max(𝐏(𝐱ti))p(y_{t}^{i}=\tilde{y}_{t}^{i}|\mathbf{x}_{t}^{i})=\max(\mathbf{P}(\mathbf{x}_{t}^{i})).

The prototypical classification measures the similarity between the input samples to each corresponding class prototype in the feature space, which means the more the input sample similar to one specific prototype, the more probable the input sample does belong to the specific category. So based on the prototypical classification results, we can separate the whole target domain set into High Confidence and Low Confidence subsets, denoted as 𝒟tH\mathcal{D}_{t}^{H} and 𝒟tL\mathcal{D}_{t}^{L}, respectively. Specifically, we accept the mean of the probability prediction on the whole target domain, i.e., τ=1Nt𝐱ti𝒟tp(yti=y~ti|𝐱ti)\tau=\frac{1}{N_{t}}\sum_{\mathbf{x}_{t}^{i}\in\mathcal{D}_{t}}p(y_{t}^{i}=\tilde{y}_{t}^{i}|\mathbf{x}_{t}^{i}), as the threshold to decide if the prediction to each sample 𝐱ti\mathbf{x}_{t}^{i} is confident or not:

{𝐱ti𝒟tH,p(yti=y~ti|𝐱ti)τ𝐱ti𝒟tL,p(yti=y~ti|𝐱ti)<τ\left\{\begin{matrix}\mathbf{x}_{t}^{i}\in\mathcal{D}_{t}^{H},&p(y_{t}^{i}=\tilde{y}_{t}^{i}|\mathbf{x}_{t}^{i})\geq\tau\\ \mathbf{x}_{t}^{i}\in\mathcal{D}_{t}^{L},&p(y_{t}^{i}=\tilde{y}_{t}^{i}|\mathbf{x}_{t}^{i})<\tau\end{matrix}\right. (2)

Unfortunately, due to the lack of the target domain labels, we cannot have the class prototypes of the target domain, so we have to initialize the prototypes based on the labeled source domain data. However, because of the domain shift across the source and target domain, such prototypes based on the source domain data cannot recognize the target domain data accurately. So we accept the target domain samples from the High Confidence set 𝒟tH\mathcal{D}_{t}^{H} to adaptively refine the categories prototypes as:

μc=(1α)μc+α1NtH(c)𝐱ti𝒟tH(c)𝐱ti,\mu^{c}=(1-\alpha)\mu^{c}+\alpha\frac{1}{N_{t}^{H(c)}}\sum_{\mathbf{x}_{t}^{i}\in\mathcal{D}_{t}^{H(c)}}\mathbf{x}_{t}^{i}, (3)

where 𝒟tH(c)\mathcal{D}_{t}^{H(c)} denotes a subset of the High Confidence set 𝒟tH\mathcal{D}_{t}^{H}, consisting of NtH(c)N_{t}^{H(c)} samples predicted as y~ti=c\tilde{y}_{t}^{i}=c, and α\alpha is the weight to control the progress of refining the prototypes. In some cases there are no samples predicted as class cc in the High Confidence set, the corresponding prototype are just not updated and keep the same as the existing one.

For the Low Confidence set 𝒟tL\mathcal{D}_{t}^{L}, consisting of samples predicted as far from the known categories prototypes, we also appreciate their characteristics and structure information, because they may include the knowledge of data from unseen categories. So we apply K-means clustering algorithm to cluster samples in 𝒟tL\mathcal{D}_{t}^{L} into KK clusters, and the cluster centers {ηk1,ηk2,,ηK}\{\eta^{k_{1}},\eta^{k_{2}},...,\eta^{K}\} are put together with the prototypes of seen categories building up the refined prototypes 𝐱={μ1,μ2,,μCs,ηk1,ηk2,,ηK}{\mathcal{R}_{\mathbf{x}}}=\{\mu^{1},\mu^{2},...,\mu^{C_{s}},\eta^{k_{1}},\eta^{k_{2}},...,\eta^{K}\}.

The refined prototypes 𝐱\mathcal{R}_{\mathbf{x}} are used to apply the prototypical classification to all target domain data again and obtain new pseudo labels 𝐘~t={y~t1,y~t2,,y~tNt}\mathbf{\tilde{Y}}_{t}=\{\tilde{y}_{t}^{1},\tilde{y}_{t}^{2},...,\tilde{y}_{t}^{N_{t}}\}. For all samples predicted as one of the seen categories, i.e., y~ti{1,2,,Cs}\tilde{y}_{t}^{i}\in\{1,2,...,C_{s}\}, are recognized as new High Confidence set 𝒟tH\mathcal{D}_{t}^{H}. On the contrary, all samples predicted as more similar to the cluster centers, i.e., y~ti{k1,k2,,K}\tilde{y}_{t}^{i}\in\{k_{1},k_{2},...,K\}, making up the new Low Confidence set 𝒟tL\mathcal{D}_{t}^{L}.

After iteratively applying the operations illustrated, prototypical classification \rightarrow Recognize High/Low Confidence sets \rightarrow Refine prototypes, we can get pseudo labels 𝐘~t={𝐘~tH,𝐘~tL}\mathbf{\tilde{Y}}_{t}=\{\mathbf{\tilde{Y}}_{t}^{H},\mathbf{\tilde{Y}}_{t}^{L}\} for all target domain data, where for \textigHigh Confidence set 𝒟tH\mathcal{D}_{t}^{H}, 𝐘~tH\mathbf{\tilde{Y}}_{t}^{H} describes each specific target sample is distributed closer to one of the seen categories prototypes, while for the Low Confidence set 𝒟tL\mathcal{D}_{t}^{L} instances, the pseudo labels 𝐘~tL\mathbf{\tilde{Y}}_{t}^{L} describes how similar they are to the one of the unseen subset cluster centers. We treat the target domain High Confidence set 𝒟tH\mathcal{D}_{t}^{H} as the set consisting of target samples from the seen categories, while the Low Confidence set 𝒟tL\mathcal{D}_{t}^{L} as the collection of samples from the unseen categories. Although there must be some samples with wrong pseudo labels and are recognized in the wrong set, after several rounds of prototypes refining, such recognition results still represent the target domain data structure knowledge with high confidence.

3.3 Aggregate structural knowledge by semantic propagation

The most challenging and practical task we focus in this work is to reveal meaningful semantic descriptions for the target domain data no matter they are from the seen categories as the source domain or from the unseen categories only existing in the target domain. So we expect a projector GA()G_{A}(\cdot) to map the data from visual feature space into the semantic feature space.

However, because only the knowledge of labels and semantic descriptions from seen categories are available, both the source domain and the target domain High Confidence set 𝒟tH\mathcal{D}_{t}^{H} ignore the information and structure of those target samples from the unseen categories. Unfortunately, training the feature extractor GZ()G_{Z}(\cdot) and the semantic projector GA()G_{A}(\cdot) only on 𝒟s\mathcal{D}_{s} and 𝒟tH\mathcal{D}_{t}^{H} makes things worse because of overfitting to the seen categories, although we seek to explore the characteristics of data from unseen categories. So we explore the mechanism of semantic propagation to aggregate the visual features structural knowledge into the semantic description projection.

Specifically, for samples in a training batch drawn from the source or target domain, the adjacency matrix AA is calculated as Aij=exp(dij2σ2)A_{ij}=\exp(-\frac{d_{ij}^{2}}{\sigma^{2}}), where Aij=0,iA_{ij}=0,\forall i, and dij2=𝐳i𝐳jd_{ij}^{2}=\|\mathbf{z}^{i}-\mathbf{z}^{j}\| is the distance of two features (𝐳i,𝐳j)(\mathbf{z}^{i},\mathbf{z}^{j}). σ2\sigma^{2} is a scaling factor and we accept σ2=Var(dij2)\sigma^{2}=Var(d_{ij}^{2}) as [rodriguez2020embedding] to stabilize training. Then we calculate the Laplacian of the adjacency matrix L=D12AD12,Dii=jAijL=D^{-\frac{1}{2}}AD^{-\frac{1}{2}},\,\,D_{ii}=\sum_{j}A_{ij}. Then, we have the semantic propagator matrix 𝒲=(IαL)1\mathcal{W}=(I-\alpha L)^{-1} following the idea described in [zhou2004learning], and α\mattbbR\alpha\in\mattbb{R} is a scaling factor, and II is the identity matrix. Finally, the semantic descriptions projected from the visual features are obtained as:

𝐚^s/ti=j𝒲ijGA(GZ(𝐱s/ti)).\hat{\mathbf{a}}_{s/t}^{i}=\sum_{j}\mathcal{W}_{ij}G_{A}(G_{Z}(\mathbf{x}_{s/t}^{i})). (4)

After the semantic propagation, the projected semantic description 𝐚~s/ti\mathbf{\tilde{a}}_{s/t}^{i} is refined as a weighted combination of the semantic representation of its neighbors guided by the visual features graph structure. Such strategy aggregates the visual features similarity graph knowledge into the semantic description projection, preventing the risks of overfitting to the seen categories during training, and has the effect of removing undesired noise from the visual and semantic features vectors [rodriguez2020embedding].

Then for each source domain sample 𝐱si𝐗s\mathbf{x}_{s}^{i}\in\mathbf{X}_{s}, the ground-truth label ysiy_{s}^{i} is known, so does the semantic description 𝐚si\mathbf{a}_{s}^{i}, where 𝐚si=𝐀sysi\mathbf{a}_{s}^{i}=\mathbf{A}_{s}^{y_{s}^{i}} is obtained from 𝐀s\mathbf{A}_{s} by class label ysiy_{s}^{i}. We construct the semantic projection loss on the source domain as:

sA=1Ns𝐱si𝒟sLbce(𝐚^si,𝐚si),\mathcal{L}_{s}^{A}=\frac{1}{N_{s}}\sum_{\mathbf{x}_{s}^{i}\in\mathcal{D}_{s}}L_{bce}(\hat{\mathbf{a}}_{s}^{i},\,\,\,\mathbf{a}_{s}^{i}), (5)

where Lbce()L_{bce}(\cdot) is the binary cross-entropy loss. For each sample 𝐱si\mathbf{x}_{s}^{i}, each dimension value of the semantic description 𝐚sidA\mathbf{a}_{s}^{i}\in\mathbb{R}^{d_{A}} represent one specific semantic characteristic, so each dimension of the GA()G_{A}(\cdot) output measures the predicted probability that the input sample has such specific characteristic.

For the target domain, although the ground-truth labels and semantic descriptions are not available, fortunately we have already obtained confident pseudo labels and pseudo semantic descriptions after the operations in Section 3.2.1. So for each target domain sample 𝐱ti𝒟tH\mathbf{x}_{t}^{i}\in\mathcal{D}_{t}^{H} from the High Confidence set, we accept its pseudo labels y~ti𝐘~tH\tilde{y}_{t}^{i}\in\mathbf{\tilde{Y}}_{t}^{H} to get the pseudo semantic descriptions 𝐚~ti=𝐀sy~ti\mathbf{\tilde{a}}_{t}^{i}=\mathbf{A}_{s}^{\tilde{y}_{t}^{i}}, as the samples in 𝒟tH\mathcal{D}_{t}^{H} all have pseudo labels from the seen categories shared between the source and target domain. Then similarly, we construct the semantic projection loss on the target domain as:

tA==1NtH𝐱ti𝒟tHLbce(𝐚^ti,𝐚~ti),\mathcal{L}_{t}^{A}==\frac{1}{N_{t}^{H}}\sum_{\mathbf{x}_{t}^{i}\in\mathcal{D}_{t}^{H}}L_{bce}(\mathbf{\hat{a}}_{t}^{i},\,\,\,\mathbf{\tilde{a}}_{t}^{i}), (6)

where NtHN_{t}^{H} is the number of samples in 𝒟tH\mathcal{D}_{t}^{H}.

Combining them together, we have the semantic description projection objective as:

minGZ,GAA=sA+tA{\underset{G_{Z},G_{A}}{\min}\mathcal{L}^{A}=\mathcal{L}_{s}^{A}+\mathcal{L}_{t}^{A}} (7)

3.4 Joint Representation From Visual and Semantic Perspectives

As the visual features and semantic descriptions explain the data from different perspectives in various modalities, we explore the joint distribution of both visual and semantic descriptions for each data simultaneously. Inspired by [long2017conditional], for each sample 𝐳i=GZ(𝐱i)\mathbf{z}^{i}=G_{Z}(\mathbf{x}^{i}), we convey the semantic discriminative information 𝐚i\mathbf{a}^{i} into the visual features by construct the joint representation as:

𝐟i=𝐳i𝐚i,\mathbf{f}^{i}=\mathbf{z}^{i}\oplus\mathbf{a}^{i}, (8)

where \oplus is the concatenation operation to combine the visual and semantic features 𝐳i\mathbf{z}^{i} and 𝐚i\mathbf{a}^{i} to construct a joint distribution 𝐟i\mathbf{f}^{i}, which will be used to do the classification optimization and cross-domain alignment.

It is noteworthy that we have already introduced several different semantic descriptions for each sample from different subsets, thus we will obtain various joint representations for the data as:

{si={𝐟¯si,𝐟^si},𝐱si𝒟stH(j)={𝐟~tj,𝐟^tj},𝐱tj𝒟tHtL(k)={𝐟^tk},𝐱tk𝒟tL,\left\{\begin{matrix}[l]\,\,\mathcal{F}_{s}^{i}\quad=\{\mathbf{\bar{f}}_{s}^{i},\mathbf{\hat{f}}_{s}^{i}\},&\mathbf{x}_{s}^{i}\in\mathcal{D}_{s}\\ \mathcal{F}_{t}^{H(j)}=\{\mathbf{\tilde{f}}_{t}^{j},\mathbf{\hat{f}}_{t}^{j}\},&\mathbf{x}_{t}^{j}\in\mathcal{D}_{t}^{H}\\ \mathcal{F}_{t}^{L(k)}=\{\mathbf{\hat{f}}_{t}^{k}\},&\mathbf{x}_{t}^{k}\in\mathcal{D}_{t}^{L}\end{matrix}\right., (9)

pecifically, for the source domain data 𝐱si𝒟s\mathbf{x}_{s}^{i}\in\mathcal{D}_{s}, we can construct 𝐟¯si=𝐳si𝐚si\mathbf{\bar{f}}_{s}^{i}=\mathbf{z}_{s}^{i}\oplus\mathbf{a}_{s}^{i} with the ground-truth semantic description (𝐚si)(\mathbf{a}_{s}^{i}), and 𝐟^si=𝐳si𝐚^si\mathbf{\hat{f}}_{s}^{i}=\mathbf{z}_{s}^{i}\oplus\mathbf{\hat{a}}_{s}^{i} is with the projected semantic description 𝐚^si=GA(𝐳si)\mathbf{\hat{a}}_{s}^{i}=G_{A}(\mathbf{z}_{s}^{i}). Similarly, for the target domain High Confidence data 𝐱tj𝒟tH\mathbf{x}_{t}^{j}\in\mathcal{D}_{t}^{H}, pseudo semantic feature (𝐚~tj)(\mathbf{\tilde{a}}_{t}^{j}) can be obtained by pseudo label 𝐲~tj\mathbf{\tilde{y}}_{t}^{j}, together with the semantic representation prediction 𝐚^tj=GA(𝐳tj)\mathbf{\hat{a}}_{t}^{j}=G_{A}(\mathbf{z}_{t}^{j}), we can construct two kind of joint representation as 𝐟~tj=𝐱tj𝐚~tj\mathbf{\tilde{f}}_{t}^{j}=\mathbf{x}_{t}^{j}\oplus\mathbf{\tilde{a}}_{t}^{j}, and 𝐟^tj=𝐳tj𝐚^tj\mathbf{\hat{f}}_{t}^{j}=\mathbf{z}_{t}^{j}\oplus\mathbf{\hat{a}}_{t}^{j}. Finally, for all data 𝐱tk𝒟tL\mathbf{x}_{t}^{k}\in\mathcal{D}_{t}^{L}, the only semantic description we have is the prediction (𝐚^tk)(\mathbf{\hat{a}}_{t}^{k}) projected by GA(𝐳tk)G_{A}(\mathbf{z}_{t}^{k}), and we can construct the joint representation as 𝐟^tk=𝐳tk𝐚^s/tk\mathbf{\hat{f}}_{t}^{k}=\mathbf{z}_{t}^{k}\oplus\mathbf{\hat{a}}_{s/t}^{k}. All the joint representations s,tH,tL\mathcal{F}_{s},\mathcal{F}_{t}^{H},\mathcal{F}_{t}^{L} are input to the classifier C()C(\cdot) and D()D(\cdot) to optimize the framework.

3.5 Classification Supervision Optimization

Domain adaptation assumes that with the help of labeled source domain data, we can train a model to transferable to the unlabeled target domain data. A lot of existing domain adaptation works have proven the reasonableness and effectiveness of such strategy. So maintaining the ability of the model on the source domain data is crucial for transferring the model to the target domain tasks. For each source domain sample 𝐱si\mathbf{x}_{s}^{i}\in, the ground-truth label ysiy_{s}^{i} is known. We construct the classification loss on the source domain as:

sC=1Ns𝐱si𝒟s𝐟sisiLce(C(𝐟si),ysi),\mathcal{L}_{s}^{C}=\frac{1}{N_{s}}\sum_{\mathbf{x}_{s}^{i}\in\mathcal{D}_{s}}\sum_{\mathbf{f}_{s}^{i}\in\mathcal{F}_{s}^{i}}L_{ce}(C(\mathbf{f}_{s}^{i}),y_{s}^{i}), (10)

where Lce()L_{ce}(\cdot) is the cross-entropy loss.

Moreover, we have already obtained confident pseudo labels for the target domain data, which are also accepted to optimize the model to the target domain. It is noteworthy that C()C(\cdot) is an extended classifier recognizing the input instance into Cs+1C_{s}+1 classes, which includes CsC_{s} seen categories shared across domains plus the additional "unseen" class. Specifically, for samples from the High Confidence set 𝒟tH\mathcal{D}_{t}^{H}, the model is optimized to recognize them as the corresponding pseudo labels in 𝐘~tH\mathbf{\tilde{Y}}_{t}^{H}. For the samples from the Low Confidence set 𝒟tL\mathcal{D}_{t}^{L}, the classifier C()C(\cdot) recognize them as the "unseen" class. Thus we have the loss term on the target domain as:

tC\displaystyle\mathcal{L}_{t}^{C} =1NtH𝐱ti𝒟tH𝐟titH(i)Lce(C(𝐟ti),ϕ(y~ti))\displaystyle=\frac{1}{N_{t}^{H}}\sum_{\mathbf{x}_{t}^{i}\in\mathcal{D}_{t}^{H}}\sum_{\mathbf{f}_{t}^{i}\in\mathcal{F}_{t}^{H(i)}}L_{ce}(C(\mathbf{f}_{t}^{i}),\,\,\phi(\tilde{y}_{t}^{i})) (11)
+1NtL𝐱ti𝒟tL𝐟titL(i)Lce(C(𝐟ti),ϕ(y~ti)),\displaystyle+\frac{1}{N_{t}^{L}}\sum_{\mathbf{x}_{t}^{i}\in\mathcal{D}_{t}^{L}}\sum_{\mathbf{f}_{t}^{i}\in\mathcal{F}_{t}^{L(i)}}L_{ce}(C(\mathbf{f}_{t}^{i}),\,\,\phi(\tilde{y}_{t}^{i})),

where ϕ(y~ti)=y~ti\phi(\tilde{y}_{t}^{i})=\tilde{y}_{t}^{i} if 𝐱ti𝒟tH\mathbf{x}_{t}^{i}\in\mathcal{D}_{t}^{H}, ϕ(y~ti)=Cs+1\phi(\tilde{y}_{t}^{i})=C_{s}+1 if 𝐱ti𝒟tL\mathbf{x}_{t}^{i}\in\mathcal{D}_{t}^{L}, and NtH/LN_{t}^{H/L} denotes the number of samples in 𝒟tH/L\mathcal{D}_{t}^{H/L}. Then we have our classification supervision objective on both source and target domain as:

minGZ,GA,CC=sC+tC\underset{G_{Z},G_{A},C}{min}\mathcal{L}^{C}=\mathcal{L}_{s}^{C}+\mathcal{L}_{t}^{C} (12)

3.6 Fine-grained Seen/Unseen Subsets Separation

To more accurately recognize the target domain data into seen and unseen subsets, we further train a binary classifier D()D(\cdot) to finely separate the seen and unseen classes. With the help of the joint representations of the target domain as input and the pseudo labels and pseudo semantic descriptions available, the fine-grained binary classifier D()D(\cdot) can be optimized by:

tD\displaystyle\mathcal{L}_{t}^{D} =1NtH𝐱ti𝒟tH𝐟titH(i)Lbce(C(𝐟ti),ψ(y~ti))\displaystyle=\frac{1}{N_{t}^{H}}\sum_{\mathbf{x}_{t}^{i}\in\mathcal{D}_{t}^{H}}\sum_{\mathbf{f}_{t}^{i}\in\mathcal{F}_{t}^{H(i)}}L_{bce}(C(\mathbf{f}_{t}^{i}),\,\,\psi(\tilde{y}_{t}^{i})) (13)
+1NtL𝐱ti𝒟tL𝐟titL(i)Lbce(C(𝐟ti),ψ(y~ti)),\displaystyle+\frac{1}{N_{t}^{L}}\sum_{\mathbf{x}_{t}^{i}\in\mathcal{D}_{t}^{L}}\sum_{\mathbf{f}_{t}^{i}\in\mathcal{F}_{t}^{L(i)}}L_{bce}(C(\mathbf{f}_{t}^{i}),\,\,\psi(\tilde{y}_{t}^{i})),

in which ψ(y~ti)\psi(\tilde{y}_{t}^{i}) indicate if the target sample 𝐱ti\mathbf{x}_{t}^{i} if from the seen categories (𝐱ti𝒟tH,ψ(y~ti)=0\mathbf{x}_{t}^{i}\in\mathcal{D}_{t}^{H},\psi(\tilde{y}_{t}^{i})=0), or from the unseen categories (𝐱ti𝒟tL,ψ(y~ti)=1\mathbf{x}_{t}^{i}\in\mathcal{D}_{t}^{L},\psi(\tilde{y}_{t}^{i})=1).

3.7 Structure Preserving Partial Cross-Domain Alignment

A lot of previous domain adaptation efforts focus on exploring a domain invariant feature space across domains. However, due to the mismatch of the source and target domains label space in our task, simply matching the feature distribution across domains becomes destructive. Moreover, the distribution structural information has been proven important for the label space mismatch tasks like conventional open-set domain adaptation [pan2020exploring]. Considering our goal of uncovering the unseen categories in the target domain, preserving the structural knowledge of the target domain data becomes even more crucial. Thanks to the refined prototypes 𝐱\mathcal{R}_{\mathbf{x}} and the pseudo labels 𝐘~t\mathbf{\tilde{Y}}_{t} for target domain data we already obtained in Section 3.2.1, we can get the prototype 𝐳\mathcal{R}_{\mathbf{z}} on the embedding features 𝐳t\mathbf{z}_{t} with the help of pseudo labels 𝐘~t\mathbf{\tilde{Y}}_{t} in the similar way. Specifically, for each pseudo label cc, the corresponding prototype can be calculated as 𝐳c=𝔼𝐱𝒟t𝐳ti𝕀y~ti=c\mathcal{R}_{\mathbf{z}}^{c}=\mathbb{E}_{\mathbf{x}\in\mathcal{D}_{t}}\mathbf{z}_{t}^{i}\cdot\mathbb{I}_{\tilde{y}_{t}^{i}=c}. The prototypes of the target domain features describe the structural graph knowledge in the target domain, thus we propose the loss function to enforce the target domain data closer to their corresponding prototypes based on the pseudo labels as:

tR=\displaystyle\mathcal{L}_{t}^{R}= 1Nt𝐱ti𝒟t(d(𝐳ti,𝐳c)𝕀y~ti=c\displaystyle\frac{1}{N_{t}}\sum_{\mathbf{x}_{t}^{i}\in\mathcal{D}_{t}}(d(\mathbf{z}_{t}^{i},\mathcal{R}_{\mathbf{z}}^{c})\cdot\mathbb{I}_{\tilde{y}_{t}^{i}=c} (14)
1Cs+K1d(𝐳ti,𝐳c)𝕀y~tic),\displaystyle-\frac{1}{C_{s}+K-1}\sum d(\mathbf{z}_{t}^{i},\mathcal{R}_{\mathbf{z}}^{c}{}^{\prime})\cdot\mathbb{I}_{\tilde{y}_{t}^{i}\neq c^{\prime}}),

where 𝐳c/c\mathcal{R}_{\mathbf{z}}^{c/c^{\prime}} denotes the prototype with label c/cc/c^{\prime}, Cs+K=|𝐳|C_{s}+K=|\mathcal{R_{\mathbf{z}}}| is the total number of prototypes in 𝐳\mathcal{R}_{\mathbf{z}}, and d()d(\cdot) is the distance measurement between two features. Such loss function pushes the embedding feature of the target sample 𝐳ti=GZ(𝐱ti)\mathbf{z}_{t}^{i}=G_{Z}(\mathbf{x}_{t}^{i}) close to the corresponding prototype 𝐳c\mathcal{R}_{\mathbf{z}}^{c} with pseudo label c=y~tic=\tilde{y}_{t}^{i}, while further away from other prototypes 𝐳c\mathcal{R}_{\mathbf{z}}^{c^{\prime}} where cy~tic^{\prime}\neq\tilde{y}_{t}^{i}.

Moreover, for the source domain data, we seek to map the source domain data into the target domain feature space, to preserve the target domain distribution structure, instead of mapping both source and target domain into a shared feature space as such strategy may break the original structural knowledge in the target domain. Then we propose the loss function on the source domain as:

sR=\displaystyle\mathcal{L}_{s}^{R}= 1Ns𝐱si𝒟s(d(𝐳si,𝐳c)𝕀ysi=c\displaystyle\frac{1}{N_{s}}\sum_{\mathbf{x}_{s}^{i}\in\mathcal{D}_{s}}(d(\mathbf{z}_{s}^{i},\mathcal{R}_{\mathbf{z}}^{c})\cdot\mathbb{I}_{y_{s}^{i}=c} (15)
1Cs+K1d(𝐳si,𝐳c)𝕀ysic).\displaystyle-\frac{1}{C_{s}+K-1}\sum d(\mathbf{z}_{s}^{i},\mathcal{R}_{\mathbf{z}}^{c}{}^{\prime})\cdot\mathbb{I}_{y_{s}^{i}\neq c^{\prime}}).

It is noteworthy that, these two loss functions simultaneously achieve the cross-domain partial alignment and enhancing the discriminative characteristics in the embedding feature space while preserving the target domain data structure. Thus we get our structure preserving partial cross-domain alignment loss as:

minGZR=sR+tR\underset{G_{Z}}{min}\mathcal{L}^{R}=\mathcal{L}_{s}^{R}+\mathcal{L}_{t}^{R} (16)

3.8 Overall Objective

Overall, we propose our final optimization objective as:

minGZ,GA,C,DA+C+tD+R{\underset{G_{Z},G_{A},C,D}{min}}\mathcal{L}^{A}+\mathcal{L}^{C}+\mathcal{L}_{t}^{D}+\mathcal{L}^{R} (17)

3.9 Training and Optimization Strategy

4 Experiments

4.1 Datasets

I2AwA is proposed by [zhuo2019unsupervised] consisting of 50 animal classes, split into 40 seen categories and 10 unseen categories as [xian2017zero]. The source domain includes 2,970 images from the seen categories collected via Google image search engine, while the target domain is the AwA2 dataset proposed in [xian2017zero] for zero-shot learning which contains all 50 classes with 37,322 images. We use the attributes feature of AwA2 as the semantic description, and only the seen categories attributes are available for training.

D2AwA is constructed from the DomainNet dataset [peng2019moment] and AwA2[xian2017zero]. Specifically, we choose the shared 17 classes between the DomainNet and AwA2 as the total dataset, and select the alphabetically first 10 classes as the seen categories, leaving the rest 7 classes as unseen. The corresponding attributes features in AwA2 are used as the semantic description. It is noteworkty that, DomainNet contains 6 different domains, while some of them are far from the semantic characteristics described by the attributes of AwA2, e.g., quick draw. So we only take the "real image" (R) and "painting" (P) domains into account, together with the AwA2 (A) data building up 6 source-target pair tasks as R\rightarrowA, R\rightarrowP, P\rightarrowA, P\rightarrowR, A\rightarrowR, A\rightarrowP.

4.2 Evaluation Metrics

We evaluate our method in two ways, open-set and semantic recovery. For open-set is same as the conventional open-set domain adaptation works did, recognizing the whole target domain data into the seen categories shared between the source and target domain, or 1 unseen category. We report the results of the classifier C()C(\cdot) with the performance only on the seen categories (OS), unseen categories (OS), and overall accuracy (OS) on the whole target data [panareda2017open]. For the semantic recovery task, we evaluate the predicted semantic description projected by GA()G_{A}(\cdot) by applying the prototypical classification with the ground-truth semantic attributes from all classes to get the classification accuracy on the whole target domain label space. We report the performances on the seen categories and unseen categories as UU and SS, respectively, and calculate the harmonic mean HH, defined as H=(2×S×U)/(S+U)H=(2\times S\times U)/(S+U), to evaluate the performance on both seen and unseen categories. It is noteworthy that all results we reported are the average of class-wise top-1 accuracy, to eliminate the influence of imbalance of data distribution across classes. Furthermore, we will show the explored semantic attributes for some target samples from the unseen categories to intuitively evaluate the performance of discovering unseen classes only exist in the target domain.

4.3 Baselines

Open-Set Domain Adaptation

PGL (2020 ICML) [luo2020progressive]

TIM (2020 CVPR) [kundu2020towards]

AOD (2019 ICCV) [feng2019attract]

SAT (2019 CVPR) [liu2019separate]

USBP (2018 ECCV) [saito2018open]

ANMC (2020 IEEE TMM) [shermin2020adversarial]

Zero-Shot Learning

(ZSL)

(GZSL)

TF-CLSGAN (2020 ECCV) [narayan2020latent]

FFT (2019 ICCV) [zhu2019learning]

CADA-VAE (2019 CVPR) [schonfeld2019generalized]

GDAN (2019 CVPR) [Huang_2019_CVPR]

LisGAN (2019 CVPR) [Li19Leveraging]

Cycle-WGAN (2018 ECCV) [felix2018multi]

(Transductive Generalized Zero-Shot Learning)

4.4 Implementation

We use the binary attributes of AwA2 from corresponding classes as the semantic descriptions. ImageNet pre-trained ResNet-50 is accepted as the backbone, and we take the output before the last fully connected layer as the features 𝐗s/t\mathbf{X}_{s/t} [deng2009imagenet, he2016deep]. GZ()G_{Z}(\cdot) is two-layer fully connected neural networks with hidden layer output dimension as 1024, and the output feature dimension is 512. C()C(\cdot) and D()D(\cdot) are both two-layer fully connected layer neural network classifier with hidden layer output dimension as 256, and the output dimension of C()C(\cdot) is Cs+1C_{s}+1, while D()D(\cdot) output as two classes. GA()G_{A}(\cdot) is two-layer neural networks with hidden layer output dimension as 256 followed by ReLU activation, and the final output dimension is the same as the semantic attributes dimension followed by Sigmoid function. We accept the cosine distance for the prototypical classification, while all other distances used in the paper are Euclidean distance.

4.5 Results

4.6 Ablation Study

4.7 Visualization

5 Conclusion