This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Improved Proxy-based Deep Metric Learning via Data-Augmented Domain Adaptation

Li Ren, Chen Chen, Liqiang Wang, Kien Hua
Abstract

Deep Metric Learning (DML) plays an important role in modern computer vision research, where we learn a distance metric for a set of image representations. Recent DML techniques utilize the proxy to interact with the corresponding image samples in the embedding space. However, existing proxy-based DML methods focus on learning individual proxy-to-sample distance while the overall distribution of samples and proxies lacks attention. In this paper, we present a novel proxy-based DML framework that focuses on aligning the sample and proxy distributions to improve the efficiency of proxy-based DML losses. Specifically, we propose the Data-Augmented Domain Adaptation (DADA) method to adapt the domain gap between the group of samples and proxies. To the best of our knowledge, we are the first to leverage domain adaptation to boost the performance of proxy-based DML. We show that our method can be easily plugged into existing proxy-based DML losses. Our experiments on benchmarks, including the popular CUB-200-2011, CARS196, Stanford Online Products, and In-Shop Clothes Retrieval, show that our learning algorithm significantly improves the existing proxy losses and achieves superior results compared to the existing methods. Our code is available at https://github.com/Noahsark/DADA

Refer to caption
Figure 1: The intuition of our Data-Augmented Domain Adaptation (DADA). The classes are labeled with unique colors. The initial distribution gap between the data samples and corresponding proxies causes ambiguity for proxy-based deep metric learning. Our proposed method solves this problem by aligning the data samples and proxies, assuming they are from different data domains. We further augment the data to a dense manifold with mixed features to support this alignment.

Introduction

The fundamental task of Deep Metric Learning (DML) focuses on learning deep representation with a known similarity metric. DML is a crucial topic in computer vision since it has a wide range of applications, including image retrieval (Lee, Jin, and Jain 2008; Yang et al. 2018; Ren et al. 2021), person re-identification (Yi et al. 2014; Wojke and Bewley 2018; Dai et al. 2019), and image localization (Lu et al. 2015; Ge et al. 2020). Modern DML techniques utilize deep neural networks (DNN) to project image samples into a hidden space where similar data points are grouped within short distances while the dissimilar points are separated. The majority of DML approaches focus on optimizing the similarities between pairwise samples with various loss functions, ranging from contrastive losses (Hadsell, Chopra, and LeCun 2006), triplet losses (Schroff, Kalenichenko, and Philbin 2015) to cross-entropy losses (Boudiaf et al. 2020). With the increasing number of samples in the deep learning tasks, the basic pair or triplet losses face the difficulty of high computational complexity. Some approaches select informative samples by mining the hard or semi-hard samples (Wu et al. 2017; Katharopoulos and Fleuret 2018) while another group is devoted to comparing the sample clusters (Oh Song et al. 2017) or the statistics of the samples (Rippel et al. 2016).

Unlike the pair-based DML methods, the proxy-based approaches try to learn a group of trainable vectors, named proxy, instead of sweeping all sample pairs within the mini-batch or cluster (Movshovitz-Attias et al. 2017; Kim et al. 2020). Thus, the proxies capture the semantic information about the classes and optimize the uninformative sample-sample comparison with the proxy-sample relations. Based on the efficient proxy-sample distance metrics, later works further select the most informative proxy (Zhu et al. 2020) or assign each class with multiple proxies (Qian et al. 2019) to capture the intra-class structures. However, those existing proxy-based approaches simply guide the proxies by measuring their similarity with data samples where the learning process still faces a fundamental problem: the colossal distribution gap between the proxies and the data samples, since the proxies are initially sampled from a normal distribution that does not contain any semantic information. The distribution gap would slow the convergence speed and cause ambiguity and bias in the learning process. Initializing the proxy with representations of the data sample is one straightforward solution to this problem. However, the distribution of proxies still differs dramatically between the early and late training stages due to the poor quality of sample representations at the early stage. Additionally, it takes a significant amount of extra time and space to calculate the representations for every class in each iteration.

In this paper, we introduce a novel framework to solve these problems by aligning the distributions of the proxies and the data samples (as illustrated in Figure 1). Specifically, we utilize Adversarial Domain Adaptation (Wang and Deng 2018) techniques to minimize their distribution gap. To align those distributions, we propose a domain-level discriminator, which is a classifier to separate their domain properties. Note that the single domain discriminator would cause mode collapse (Goodfellow et al. 2020; Che et al. 2017) where the majority of data points are constrained to a local area so that their discriminative information is lost. To endorse their discriminative information, we leverage one additional category-level discriminator to evaluate the consistency of their class properties. We show that with these discriminators, the adversarial training signal can efficiently align the distributions of the data samples and the proxies.

However, there are still two difficulties in learning the distribution of the proxy space: (1) the limited number and diversity of the proxies and (2) the large initial gap between the proxies and the data samples. The limited number of proxies causes difficulties for discriminators in capturing the inter-class manifold structure, and the large domain gap further hinders their learning efficiency. To overcome these challenges, we propose a novel data-augmented domain as a bridge where the data samples and the proxies are evenly mixed to conduct an intermediate domain. This domain contains rich mixing samples holding information and statistics from both sides. We also propose to create mixture samples within the same categories to increase the density of the manifold. We demonstrate the mechanisms of our method in Figure 2. Our experiments show that the proposed method can easily plug into existing proxy-based losses to boost their performance dramatically. Our main contributions are three-fold:

  • We propose a novel adversarial learning framework to optimize the existing proxy-based DML by aligning the overall distributions of the data samples and the proxies at both domain and category levels.

  • We propose an additional data-augmented domain that contains mixup representations from both sides to further bridge the distribution gap. We show that our combined discriminators efficiently guide the proxies and the data samples to a hidden space under the same distribution, in which the proxy-based loss significantly increases its learning efficiency.

  • Our experiments demonstrate the effectiveness of our adversarial adaptation method on the image data samples and the proxies. We show that our approach increases the performance of existing proxy-based DML loss by a large margin, and our best result outperforms the state-of-the-art methods on four popular benchmarks.

Refer to caption
Figure 2: Demonstrate the mechanisms of our adversarial learning. Each class is labeled with a unique color. Left: Illustrate the Initial Space. Mid: Illustrate the training mechanisms and progress of our proposed method. Right: Illustrate the Adapted Space after training. The surface boundaries of the classifiers are trained to discriminate the domains with Domain-level Discriminators, and sample classes with Category-level Discriminators in the discriminator training phase. In the generator training phase, the samples and proxies are pushed to fool the Domain-level Discriminators from the adversarial learning signals while the class predictions from Category-level Discriminators are maintained.

Related Work

Pair-based DML. Metric Learning in the computer vision area aims to learn a metric that measures the distance between a pair of image samples. Initially, the image samples inside a class and out of a class are regarded as positive and negative samples; and they are learned and projected to a low dimensional space (Hadsell, Chopra, and LeCun 2006; Oh Song et al. 2016). The samples in different classes are paired and measured with the contrastive loss (Chopra, Hadsell, and LeCun 2005; Hadsell, Chopra, and LeCun 2006). To further compare the ranking relation between pairs of samples, an additional sample is selected as an anchor to compare with both positive and negative samples with the triplet loss (Weinberger and Saul 2009; Wang et al. 2014; Cheng et al. 2016; Hermans, Beyer, and Leibe 2017) where the positive sample is ensured to be closer than the negative samples. Based on the triple loss, Sohn et al.(2016) propose SoftMax cross-entropy to compare the group of pairs to improve pair sampling.

The computational cost of these pair-based works is always high due to the workload of comparing each sample with all other samples within a given batch. Additionally, these methods reveal sensitivity to the size of the batch, where their performance may significantly drop if the size is too small.

Proxy-based DML. To further accelerate the sampling and clustering process, Movshovitz et al. (2017) leverage the proxy, a group of learnable representations, to compare data samples via the Neighbourhood component analysis (NCA) loss (Roweis, Hinton, and Salakhutdinov 2004). The motivation is to set image samples as anchors to compare with proxies of different classes instead of corresponding samples to reduce sampling times. Teh et al. (2020) further improve the ProxyNCA by scaling the gradient of proxies. Zhu et al. (2020) propose to sample the most informative negative proxies to improve the performance, while Kim et al. (2020) set the proxies as anchors instead of the samples to learn the inter-class structure. Yang et al. (2022) develop hierarchical-based proxy loss to boost learning efficiency. Roth et al. (2022) regular the distribution of samples around the proxies following a non-isotropic distribution. In contrast to these methods that compare the single sample-proxy pair, our method further refines the manifold structure by aligning the whole distributions between proxies and image samples via a novel adversarial domain adaptation framework.

Domain Adaptation and Adversarial Learning. Domain Adaptation initially aims to solve the lack of labeled data where the learned feature is domain-invariant so that classifiers can be easily shifted to the new data distribution. The basic idea is to match the feature distributions to decrease their domain shift between the source and target datasets (Quiñonero-Candela et al. 2008; Torralba and Efros 2011). One important branch of domain adaptation is Adversarial Learning (Goodfellow et al. 2020; HassanPour Zonoozi and Seydi 2022), where two or more models take part in the min-max game to generate domain-invariant features.

Ganin et al. (2015) first generate domain invariant features with adversarial training on neural networks. Tzeng et al. (2017) improve the discriminator that does not share the weight with the feature generator. Pei et al. (2018) utilize multiple discriminators assigned for each class to improve performance. Saito et al. (2018) minimize the prediction discrepancy of two discriminators on the target domain, while Lee et al. (2019) improve the method to compare their sliced Wasserstein distance instead. The primary application of adversarial learning is to produce synthetic textual or image data (Kingma and Welling 2013; Radford, Metz, and Chintala 2015; Isola et al. 2017). Ren et al. (2018; 2019) also applied this technique to enhance the quality of image captioning.

Recent studies have investigated the application of domain adaptation in image or textual retrieval tasks (Laradji and Babanezhad 2020; Pinheiro 2018). Wang et al. (2017) employ domain adaptation to align image and textural data using a single discriminator, whereas Ren et al. (2021) utilize multiple discriminators to get improved performance. In contrast to previous efforts, we propose aligning the distributions of data representations and proxies within the same image modality.

Proposed Method

We propose a new framework to close the gap between the distributions of the data samples and the proxies for proxy-based DML losses that are already in place. We utilize the adversarial domain adaptation technique to transfer data samples and proxies to domain invariant feature space. To overcome the limitation of the number of proxies, we also conduct a novel strategy to augment data as a bridge between the samples and proxies, which demonstrates a smooth learning process.

Preliminary

Deep Metric Learning (DML) Consider a set of data samples 𝒮={Ii,yi}i=1N\mathcal{S}=\{I_{i},y_{i}\}_{i=1}^{N} with raw images IiI_{i} and its corresponding class label yi{1,,C}y_{i}\in\{1,\dots,C\}; we learn a projection function fG:𝒮f𝒳f_{G}:\mathcal{S}\stackrel{{\scriptstyle f}}{{\rightarrow}}\mathcal{X}, which project the input data samples to a hidden embedding space (or metric space) 𝒳\mathcal{X}. We define the projected features set as X={xid}i=1NX=\{x_{i}\in\mathbb{R}^{d}\}_{i=1}^{N}. The primal goal of Deep Metric Learning (DML) is to refine the projector function fG()f_{G}(\cdot), which is usually constructed with convolutional deep neural networks (CNN) as the backbone, to generate the projected features that can be easily measured with defined distance metric d(xi,xj)d(x_{i},x_{j}) based on the semantic similarity between sample IiI_{i} and IjI_{j}. Here we adopt the distance metric d()d(\cdot) as the cosine similarity. Before delivering features to any loss, we use L2 normalization to eliminate the effect of differing magnitudes.

Proxy-based DML To boost the learning efficiency, a group of DML methods pre-define a set of learnable representations P={pid}i=1CP=\{p_{i}\in\mathbb{R}^{d}\}_{i=1}^{C}, named proxy, to represent subsets or categories of data samples. Typically there is one proxy for each class so that the number of proxies is the same as the number of classes CC. The proxies are also optimized with other network parameters. The first proxy-based method, Proxy-NCA (Movshovitz-Attias et al. 2017), or its improved version Proxy-NCA++ (Teh, DeVries, and Taylor 2020), utilizes the Neighborhood Component Analysis (NCA) (Goldberger et al. 2004) loss to conduct this optimization. The later loss Proxy-Anchor (PA) (Kim et al. 2020) inversely sets the proxy as the anchor and measures all proxies for each minibatch of samples. The PA loss proxy(X,P)\mathcal{L}_{proxy}(X,P) can be presented as

proxy(X,P)=1|P+|pP+log(1+xXp+eτd(x,p)+δ)+1|P|pPlog(1+xXpeτd(x,p)+δ)\begin{aligned} \mathcal{L}_{proxy}(X,P)&=\frac{1}{\left|P^{+}\right|}\sum_{p\in P^{+}}\log\left(1+\sum_{x\in X_{p}^{+}}e^{-\tau d(x,p)+\delta}\right)\\ &+\frac{1}{|P|}\sum_{p\in P}\log\left(1+\sum_{x\in X_{p}^{-}}e^{\tau d(x,p)+\delta}\right)\end{aligned}

(1)

where Xp+X_{p}^{+} denotes the set of positive samples for a proxy pp; XpX_{p}^{-} is its complement set; τ\tau is the scale factor; and δ\delta is the margin. Since the PA updates all proxies for each mini-batch, the model has higher learning efficiency in capturing the structure of samples beyond the mini-batches. We propose these two fundamental proxy-based losses (PNCA++ and PA) that achieve competitive results as our baselines.

Domain Data Augmentation

To reduce the distribution gap between the data samples and the proxies, we transform the proxy-based DML into a domain adaptation problem. We regard the data samples as data points in the source domain, while the initialed proxies are data points in the target domain. We noticed that the number of proxies in the target domain is especially limited compared to the data samples because the basic proxy method only assigns a single proxy for each class. The unbalanced samples and proxies would cause learning biases in modeling the distributions. Also, the proxies are initialized from a normal distribution and do not contain any related semantic information, which also causes difficulty in aligning their distribution to the data sample domain.

To overcome these difficulties, we propose a novel data augmentation strategy to create an intermediate domain to balance the amount of data points for domain adaptation. Specifically, we interpolate the space with mixed features from data set XX and proxy set PP. For each data sample xix_{i} and its corresponding proxy pip_{i}, we create a data feature d^\hat{d}:

d^i={λxi+(1λ)pi},\hat{d}_{i}=\{\lambda x_{i}+(1-\lambda)p_{i}\}, (2)

where λBeta(α,β)\lambda\sim Beta(\alpha,\beta) is the linear interpolation coefficient that sampled from beta distribution with α>0\alpha>0 and β>0\beta>0 that decide its probability density function. The new data contains semantic information between the data sample and proxies and shares their distribution statistics. Therefore pushing d^\hat{d} is equal to pushing both samples xx and proxies pp, and their distribution is closer to the data sample domain than the original proxies.

In addition, we further propose extending the number of training instances by augmenting the data-proxy pairs within the same class. For each pair of samples (xi,xj)(x_{i},x_{j}) and their corresponding augmented data (d^i,d^j)(\hat{d}_{i},\hat{d}_{j}) inside the mini-batch, we propose the following mixing:

x~i={μ1xi+(1μ1)xj}\displaystyle\tilde{x}_{i}=\{\mu_{1}x_{i}+(1-\mu_{1})x_{j}\} (3)
d~i={μ2d^i+(1μ2)d^j},\displaystyle\tilde{d}_{i}=\{\mu_{2}\hat{d}_{i}+(1-\mu_{2})\hat{d}_{j}\},

where μ1,μ2\mu_{1},\mu_{2} are also sampled from BetaBeta distribution. Then we mix the new samples x~i\tilde{x}_{i} and d~i\tilde{d}_{i} inside the mini-batch to ensure the number of data samples with the same label n2n\geq 2. Combined with the original mini-batch, the augmented data sample set and the augmented proxy set are noted as X~=X{x~1,x~2}\tilde{X}=X\cup\{\tilde{x}_{1},\tilde{x}_{2}\cdots\} and D~={d^1,d^2}{d~1,d~2}\tilde{D}=\{\hat{d}_{1},\hat{d}_{2}\cdots\}\cup\{\tilde{d}_{1},\tilde{d}_{2}\cdots\}, and the size of mini-batch is also extended accordingly. We then normalize the composed features in X~\tilde{X} and D~\tilde{D} with L2 normalization to constrain them on a unit hypersphere embedding space where the magnitude is fixed to 1.

Domain-level Discriminator

Based on the augmented data, our goal is to refine the set X~\tilde{X}, D~\tilde{D} and the original proxy set PP to domain invariant representations that share the same distribution to help the proxy-based losses. We follow the principle idea of adversarial domain adaptation (Ganin et al. 2016) to estimate the domain divergence by learning a domain-level discriminator. Specifically, we learn a classifier fD()f_{D}(\cdot) that minimizes the risk of domain prediction (to predict if the data comes from a unique domain) between the set X~\tilde{X}, D~\tilde{D} and PP.

Generally, we would label the data from a specific domain with the one-hot label as the prediction target. Since we have three different domains including the augmented data domain and our labeling space is symmetric, we would simply assume the features x~X~\tilde{x}\in\tilde{X} are labeled as y0=001¯y_{0}=\overline{001} while d~D~\tilde{d}\in\tilde{D} are labeled as y1=010¯y_{1}=\overline{010}, and the initial proxies PP are labeled as y2=100¯y_{2}=\overline{100} for convenience. Specifically, we estimate the domain classifier fD()f_{D}(\cdot) as an MLP with a single hidden layer and a ReLU function. The hidden layer is then projected to a 3-dimensional head as the logits prediction of the domains. To optimize the fD()f_{D}(\cdot) with a low prediction risk on the labeling space, we conduct the cross-entropy objective adv\mathcal{L}_{adv} as follows

adv(X~,D~,P)=iN~ce(fD(x~i),y0)+iN~ce(fD(d~i),y1)+iCce(fD(pi),y2)\begin{aligned} &\mathcal{L}_{adv}(\tilde{X},\tilde{D},P)=\sum_{i}^{\tilde{N}}\mathcal{L}_{ce}(f_{D}(\tilde{x}_{i}),y_{0})\\ &+\sum_{i}^{\tilde{N}}\mathcal{L}_{ce}(f_{D}(\tilde{d}_{i}),y_{1})+\sum_{i}^{C}\mathcal{L}_{ce}(f_{D}(p_{i}),y_{2})\end{aligned}

(4)

where ce\mathcal{L}_{ce} is the cross entropy loss and N~\tilde{N} is the total number of samples after the data augmentation. The parameters of the classifier fD()f_{D}(\cdot) are optimized to minimize the adversarial loss adv\mathcal{L}_{adv} in training. Recall that the feature xx is generated from the projection function fG()f_{G}(\cdot). Thus, the parameters of generator fG()f_{G}(\cdot) are optimized to fool the discriminator fD()f_{D}(\cdot) in the opposite direction. Since D~\tilde{D} in the target domain contains features that mixed from XX and proxies PP, optimizing the D~\tilde{D} equals optimizing the generator fG()f_{G}(\cdot) in the source domain while updating the original proxies PP. Thus, the adversarial learning signal of adv\mathcal{L}_{adv} would help both generator fG()f_{G}(\cdot) and the original proxies PP to maintain the domain invariant representations to fool the classifier.

Category-level Discriminator

One drawback of the domain-level discriminator described above is that the discriminate information, especially the inter-class correlation, is ignored in the optimization process. Losing the discriminative information will cause all data points to be concentrated on a local area or a surface, which would cause inter-class ambiguity and confuse the metric learning losses. To solve this problem, we further propose a category-level discriminator that learns to predict the class of data samples and compare the discrepancy of predictions between the data samples and mixture proxies.

Specifically, we optimize a classifier fC()f_{C}(\cdot) with the feature generator fG()f_{G}(\cdot) to predict the category label Y={y0,y1,}Y=\{y_{0},y_{1},\dots\} from mixture data samples X~\tilde{X} with the classification loss cls(X~,Y)\mathcal{L}_{cls}(\tilde{X},Y) as

cls(X~,Y)=1N~iN~ce(fC(xi~),yi).\mathcal{L}_{cls}(\tilde{X},Y)=\frac{1}{\tilde{N}}\sum_{i}^{\tilde{N}}\mathcal{L}_{ce}(f_{C}(\tilde{x_{i}}),y_{i}). (5)

The cross-entropy loss ce\mathcal{L}_{ce} would provide a supervised learning signal to fG()f_{G}(\cdot) to maintain the discriminative information during the DML training process.

We note that the data samples that share the distributions would also share the labeling space with the target proxy domain. To further align the distributions, we propose to constrain the samples from the source domain and augmented data from the target domain to have a low discrepancy of predictions from our category classifier fC()f_{C}(\cdot). Thus, one additional goal of fC()f_{C}(\cdot) is to learn the maximized discrepancy of the category prediction between the data samples X~\tilde{X} and mixture proxies D~\tilde{D} while the D~\tilde{D} are later optimized to minimize this discrepancy.

To measure the discrepancy of the category probabilities, we empirically adopt the discrepancy introduced in (Chen et al. 2022) that utilizes the Nuclear-norm Wasserstein Distance (NWD). The NWD is demonstrated to be the upper bound of the Frobenius-norm, which estimates the correlations of the predictions (Cui et al. 2020). Thus, we compare the NWD between the logistic predictions of fC()f_{C}(\cdot) from the augmented samples X~\tilde{X} and data D~\tilde{D}. The loss d(X~,D~)\mathcal{L}_{d}(\tilde{X},\tilde{D}), which measures the NWD can be described as,

d(X~,D~)=1N~(iN~fC(X~)iNfC(D~)),\mathcal{L}_{d}(\tilde{X},\tilde{D})=\frac{1}{\tilde{N}}(\sum_{i}^{\tilde{N}}||f_{C}(\tilde{X})||_{\star}-\sum_{i}^{N}||f_{C}(\tilde{D})||_{\star}), (6)

where x=σ(x)||x||_{\star}=\sum\sigma(x) denotes the nuclear-norm of xx, which is defined as the sum of its singular values.

The Combined Loss and Training Progress

We adopt the paradigm of adversarial learning to alternatively update the gradient of our feature generator fG()f_{G}(\cdot) and the discriminators fD()f_{D}(\cdot) and fC()f_{C}(\cdot) discussed above. To this end, we train our combined loss by playing the min-max game as follows,

minfG,fC{η(cls+maxfCd)}+(1η)minfDmaxfGadv,\min_{f_{G},f_{C}}\{\eta(\mathcal{L}_{cls}+\max_{f_{C}}\mathcal{L}_{d})\}+(1-\eta)\min_{f_{D}}\max_{f_{G}}\mathcal{L}_{adv}, (7)

where η\eta is the pre-defined hyperparameter that balances the contribution between the domain-level and category-level discriminators. Empirically, we do not set another weight between classification loss cls\mathcal{L}_{cls} and discrepancy loss d\mathcal{L}_{d}. We also need the original proxy-based loss proxy\mathcal{L}_{proxy} in Eq. 1 to do the basic DML of the sample-proxy pair in training. Note that the augmented data set D~\tilde{D} is only for domain adaptation progress; the original proxy\mathcal{L}_{proxy} only operates X~\tilde{X} and original proxies PP. Thus, our combined training progress can be described as the following two sub-processes:

(θfD,θfC)=argminfD,fC{η(clsd)+(1η)adv},\displaystyle(\theta_{f_{D}},\theta_{f_{C}})=\arg\min_{f_{D},f_{C}}\{\eta(\mathcal{L}_{cls}-\mathcal{L}_{d})+(1-\eta)\mathcal{L}_{adv}\}, (8)
(θfG,P)=argminfG,P,D~{η(cls+d)(1η)adv+γproxy},\displaystyle(\theta_{f_{G}},P)=\arg\min_{f_{G},P,\tilde{D}}\{\eta(\mathcal{L}_{cls}+\mathcal{L}_{d})-(1-\eta)\mathcal{L}_{adv}+\gamma\mathcal{L}_{proxy}\}, (9)

where parameters θfD\theta_{f_{D}} and θfC\theta_{f_{C}} are updated in first phase and θfG\theta_{f_{G}} and the proxies PP are updated with D~\tilde{D} in the second phase. Even if gradient reversal layers are accepted for achieving adversarial training in earlier domain adaptation works, we empirically conclude that a separate training phase would be more feasible for us in our search for stable training parameters. The full training progress can be referred to in Algorithm 1.

Method Reference Settings CUB-200 CARS-196 SOP Arch/Dim R@1 R@2 R@4 R@1 R@2 R@4 R@1 R@10 R@100 PNCA (Movshovitz-Attias et al. 2017) CVPR17’ BN/512 49.2 61.9 67.9 73.2 82.4 86.4 73.7 ProxyGML\dagger(Zhu et al. 2020) NeurIPS20’ BN/512 66.6 77.6 86.4 85.5 91.8 95.3 78.0 90.6 96.2 DiVA(Milbich et al. 2020) ECCV20’ R50/512 69.2 79.3 87.6 92.9 79.6 91.2 S2SD(Roth et al. 2021) ICML21’ R50/512 70.1 79.7 71.6 89.5 93.9 72.9 80.0 91.4 DCML-Proxy\dagger(Zheng et al. 2021a) CVPR21’ R50/512 65.2 76.4 84.8 81.2 89.8 94.6 DCML-MDW(Zheng et al. 2021a) CVPR21’ R50/512 68.4 77.9 86.1 85.2 91.8 96.0 79.8 90.8 95.8 DRML(Zheng et al. 2021b) ICCV21’ BN/512 68.7 78.6 86.3 86.9 92.1 95.2 71.5 85.2 93.0 PA+AVSL\dagger(Zhang et al. 2022) CVPR22’ R50/512 71.9 81.7 88.1 91.5 95.0 97.0 79.6 91.4 96.4 PA+NIR\dagger(Roth, Vinyals, and Akata 2022) CVPR22’ R50/512 69.1 79.6 87.7 92.5 80.7 91.5 HIST(Lim et al. 2022) CVPR22’ R50/512 71.4 81.1 88.1 89.6 93.9 96.4 81.4 92.0 96.7 DAS(Liu et al. 2022) ECCV22’ R50/512 69.2 79.3 87.0 87.8 93.2 96.0 80.6 91.8 96.7 MS+CRT(Kan et al. 2022) NeurIPS22’ R50/512 64.2 75.5 84.1 83.3 89.8 93.9 79.0 91.1 96.5 \vartrianglerightPNCA++(Teh, DeVries, and Taylor 2020) ECCV20’ R50/512 69.0 79.8 87.3 86.5 92.5 95.7 80.7 92.0 96.7 PNCA+DADA(R50) Ours R50/512 71.4 81.1 87.6 90.5 93.4 96.8 81.2 91.8 96.5 \vartrianglerightPA(Kim et al. 2020) CVPR20’ BN/512 68.4 79.2 86.8 86.1 91.7 95.0 79.1 90.8 96.2 PA+DADA(BN) Ours BN/512 69.8 80.4 87.1 89.4 92.1 96.2 79.6 91.0 96.3 \vartrianglerightPA (R50) \ast (Kim et al. 2020) CVPR20’ R50/512 69.7 80.0 87.0 87.7 92.9 95.8 80.0 91.7 96.6 PA+DADA(R50) Ours R50/512 72.9 81.9 88.3 92.1 95.2 97.1 81.0 92.1 96.2

Table 1: Comparison with the state-of-the-art litterateurs on CUB200-2011 (Wah et al. 2011), CARS196 (Krause et al. 2013), Stanford Online Products (SOP) (Oh Song et al. 2016). The works are sorted by their published date. The second column shows the same architecture of the backbone and feature dimension we selected to compare with our proposed method. R50 represents the ResNet50 and BN for InceptionBN and GN for GoogleNet backbones. \dagger denotes the methods based on proxy-based DML, and \vartriangleright labels the works on which our method is based. We adopt the experimental results of PA(R50) from the third papers (Lim et al. 2022). The Bold represents the best score.
Algorithm 1 Data-Augmented Domain Adaptation (DADA) for Proxy-based Deep Metric Learning
1:Input: Training Set 𝒮={Ii,yi}i=1N\mathcal{S}=\{I_{i},y_{i}\}_{i=1}^{N}
2:Initialization: θfG\theta_{f_{G}}, θfC\theta_{f_{C}}, θfD\theta_{f_{D}}, and proxies PP
3:while stop criteria is not satisfied do
4:     Obtain a batch {Ii,yi}i=1n\{I_{i},y_{i}\}_{i=1}^{n} from SS
5:     Select proxies P={pi}i=1nP=\{p_{i}\}_{i=1}^{n} according the labels YY
6:     Embedding features X={xi}i=1nfG(I)X=\{x_{i}\}_{i=1}^{n}\leftarrow f_{G}(I)
7:     /* Prepare the mixture intermediate data domain */
8:     Sample λBeta(α,β)\lambda\sim Beta(\alpha,\beta)
9:     Sample μ1,μ2Beta(1.0,1.0)\mu_{1},\mu_{2}\sim Beta(1.0,1.0)
10:     Compose P^{λX+(1λ)P}\hat{P}\leftarrow\{\lambda X+(1-\lambda)P\}
11:     Compose X~X{μ1xi+(1μ1)xj}\tilde{X}\leftarrow X\cup\{\mu_{1}x_{i}+(1-\mu_{1})x_{j}\}
12:     Compose P~P^{μ2p^i+(1μ2)p^j}\tilde{P}\leftarrow\hat{P}\cup\{\mu_{2}\hat{p}_{i}+(1-\mu_{2})\hat{p}_{j}\}
13:     L2 Normalize X~\tilde{X}, P~\tilde{P} and PP
14:     /* Discriminator Training Phase begin */
15:     for kk steps do
16:         Cal ΔθfD,ΔθfCη(cls(X~,Y)d(X~,P~))ΔθfD,ΔθfC\Delta\theta_{f_{D}},\Delta\theta_{f_{C}}\leftarrow\eta\frac{\partial(\mathcal{L}_{cls}(\tilde{X},Y)-\mathcal{L}_{d}(\tilde{X},\tilde{P}))}{\Delta\theta_{f_{D}},\Delta\theta_{f_{C}}}
17:         Cal ΔθfD,ΔθfC(1η)adv(X~,P~,P)ΔθfD,ΔθfC\Delta\theta_{f_{D}},\Delta\theta_{f_{C}}\leftarrow(1-\eta)\frac{\partial\mathcal{L}_{adv}(\tilde{X},\tilde{P},P)}{\Delta\theta_{f_{D}},\Delta\theta_{f_{C}}}
18:         Update θfD,θfCAdam{ΔθfD,ΔθfC}\theta_{f_{D}},\theta_{f_{C}}\leftarrow Adam\{\Delta\theta_{f_{D}},\Delta\theta_{f_{C}}\}      
19:     /* Generator Training Phase begin */
20:     Cal ΔθfG,ΔPη(cls(X~,Y)+d(X~,P~))ΔθfG,ΔP\Delta\theta_{f_{G}},\Delta P\leftarrow\eta\frac{\partial(\mathcal{L}_{cls}(\tilde{X},Y)+\mathcal{L}_{d}(\tilde{X},\tilde{P}))}{\Delta\theta_{f_{G}},\Delta P}
21:     Cal ΔθfG,ΔP(1η)adv(X~,P~,P)ΔθfG,ΔP\Delta\theta_{f_{G}},\Delta P\leftarrow-(1-\eta)\frac{\partial\mathcal{L}_{adv}(\tilde{X},\tilde{P},P)}{\Delta\theta_{f_{G}},\Delta P}
22:     Cal ΔθfG,ΔPγproxy(X~,P)ΔθfG,ΔP\Delta\theta_{f_{G}},\Delta P\leftarrow\gamma\frac{\partial\mathcal{L}_{proxy}(\tilde{X},P)}{\Delta\theta_{f_{G}},\Delta P}
23:     Update θfG,PAdam{ΔθfG,ΔP}\theta_{f_{G}},P\leftarrow Adam\{\Delta\theta_{f_{G}},\Delta P\}

Experiments

We present our performance study and discuss the experimental results in this section.

Datasets and Metrics

We use the standard benchmarks CUB-200-2011 (CUB200) (Wah et al. 2011) with 11,788 bird images and 200 classes, and CARS196 (Krause et al. 2013) that contains 16,185 car images and 196 classes. We also evaluate our method on larger Stanford Online Products (SOP) (Oh Song et al. 2016) benchmark that includes 120,053 images with 22,634 product classes, and In-shop Clothes Retrieval (In-Shop) (Liu et al. 2016) dataset with 25,882 images and 7982 classes. We follow the data split that is consistent with the standard settings of existing DML works (Teh, DeVries, and Taylor 2020; Kim et al. 2020; Venkataramanan et al. 2022; Zheng et al. 2021b; Roth, Vinyals, and Akata 2022; Lim et al. 2022; Zhang et al. 2022). We adopt the Recall@K (K=1,2,4 in CUB200 and CARS196, K=1,10,100 in SOP, and K=1,10,20,30 in In-Shop) proposed in existing works to evaluate the accuracy of ranking. We also evaluate it with Mean Average Precision at R (MAP@R) which is based on the idea of MAP and R-precision, which is a more informative DML metric (Musgrave, Belongie, and Lim 2020).

Implementation Details

We train our model in a machine that contains a single RTX3090 GPU with 24GB memory. The Implementation is based on the existing RDML (Roth et al. 2020)

Backbones and Preprocessing. In this paper, we propose two basic backbones to evaluate our learning algorithm: the ResNet50(He et al. 2016) and the InceptionBN (Ioffe and Szegedy 2015). They are pre-trained on ImageNet1K(Deng et al. 2009) and are widely used in DML works for performance evaluation, where we resize the image to 224×224224\times 224, do random resized cropping, and random horizontal flipping. In the test phase, the images are first resized to 256×256256\times 256, then cropped back to 224×224224\times 224. A linear head embeds the feature from the second last layer of the backbones to a 512-dimension hidden space. We follow the standard pre-processing introduced in other deep metric learning works (Venkataramanan et al. 2022; Zheng et al. 2021b; Roth, Vinyals, and Akata 2022; Lim et al. 2022; Zhang et al. 2022). We also adopt global max and average pooling with layer normalization on CNN backbones suggested by Teh et al. (Teh, DeVries, and Taylor 2020) to further improve the generalization of features.

In-Shop Clothes Retrieval (In-Shop) Methods Arch/Dim R@1 R@10 R@20 R@30 MS (2019) BN/512 89.7 97.9 98.5 98.8 SHM (2019) BN/512 90.7 97.8 98.5 98.8 SCT (2020) R50/512 90.0 97.5 98.1 XBM (2020) BN/512 89.9 97.6 98.4 98.6 IBC (2021) R50/512 92.8 98.5 99.1 99.2 PA\dagger (2020) BN/512 90.4 98.1 98.8 99.0 PA+Mix\dagger (2022) R50/512 91.9 98.2 98.8 PNCA++\dagger(2020) R50/512 90.4 98.1 98.8 99.0 PNCA + DADA (ours) R50/512 91.7 98.2 98.6 98.8 PA + DADA (ours) R50/512 93.0 98.5 98.9 99.1

Table 2: Compare with the existing state-of-the-art DML works on the In-Shop (Liu et al. 2016) dataset. The Bold represents the best score.

ProxyAnchor ProxyNCA++ Settings R@1 MAP@R R@1 MAP@R Baseline 69.1 26.5 68.4 25.8 +Aug 69.3 (+0.2) 26.5 (+0.0) 68.5 (+0.1) 25.9 (+0.1) +adv+\mathcal{L}_{adv} 70.2 (+1.1) 27.3 (+0.8) 69.2 (+0.8) 26.4 (+0.6) +adv++\mathcal{L}_{adv}+Aug 70.9 (+1.8) 27.8 (+1.3) 69.8 (+1.4) 26.8 (+1.0) +cls+\mathcal{L}_{cls} 69.3 (+0.2) 27.0 (+0.5) 68.9 (+0.5) 26.2 (+0.4) +cls+d+\mathcal{L}_{cls}+\mathcal{L}_{d} 69.9 (+0.8) 27.4 (+0.9) 69.5 (+1.1) 26.6 (+0.8) +cls+d++\mathcal{L}_{cls}+\mathcal{L}_{d}+Aug 70.4 (+1.3) 27.8 (+1.3) 69.4 (+1.0) 26.7 (+0.9) +adv+cls+\mathcal{L}_{adv}+\mathcal{L}_{cls} 71.4 (+2.3) 28.2 (+1.7) 69.3 (+0.9) 27.1 (+1.3) +adv+cls+1+\mathcal{L}_{adv}+\mathcal{L}_{cls}+\mathcal{L}_{1} 71.6 (+2.5) 28.2 (+1.7) 69.4 (+1.0) 27.0 (+1.2) +adv+cls+d+\mathcal{L}_{adv}+\mathcal{L}_{cls}+\mathcal{L}_{d} 72.0 (+2.9) 28.9 (+2.4) 69.9 (+1.5) 27.7 (+1.9) +adv+cls+d+\mathcal{L}_{adv}+\mathcal{L}_{cls}+\mathcal{L}_{d} + Aug (ours) 72.9 (+3.8) 29.9(+3.4) 70.2(+1.8) 28.0 (+2.2)

Table 3: Study the contribution of each component of our method and loss function on CUB200. We reproduce the result of ProxyAnchor, which has a batch size of 90, and ProxyNCA++, which has a batch size of 32, as the baseline of our method. Aug represents the alignment of the augmented data and samples we introduced in Sec Domain Data Augmentation. We denote the difference in percentage point (pppp) compared with our baseline in the bracket.

Training Details. Our optimization is done using Adam (β1=0.5,β2=0.999\beta_{1}=0.5,\beta_{2}=0.999) (Kingma and Ba 2015) with a decay of 11031\cdot 10^{-3}. We set the learning rate at 1.21041.2\cdot 10^{-4} for the feature generator fG()f_{G}(\cdot) and 51045\cdot 10^{-4} for our discriminators. We adopt the learning rate 41024\cdot 10^{-2} for the proxies as suggested in (Roth, Vinyals, and Akata 2022). For most of the experiments, we fixed the batch size to 90 as a default setting, which is consistent with (Kim et al. 2020). Empirically we apply batch normalization on the domain-level discriminator to reduce its correlation variance within the batch. For all experiments, the first layer of fC()f_{C}(\cdot) is set to 512. For the second layer, we assigned 128 dimensions to the CUB200 and CARS196 datasets, 8192 dimensions to the SOP datasets, and 4096 dimensions to the In-Shop datasets. We set {η=0.005,γ=0.0075}\{\eta=0.005,\gamma=0.0075\} for CUB200, and {η=0.01,γ=0.0075}\{\eta=0.01,\gamma=0.0075\} for CARS196. We select {η=0.01,γ=0.005}\{\eta=0.01,\gamma=0.005\} for both SOP and In-Shop datasets.

Qualitative Results

Comparing with Proxy Baselines. We compare the performance of our approach with the existing proxy-based metric learning methods and the recent state-of-the-art metric learning methods on the popular benchmarks introduced above (refer to Table 1). We observe that our DADA frameworks can significantly improve the performance of the original proxy-based DML methods (marked with \vartriangleright) by a large margin. Specifically, comparing with the original PA method on ResNet50, our proposed PA+DADA outperforms 3.2pp3.2pp (4.6%4.6\%) on the recall@1 of CUB200 and 4.4pp4.4pp (5.0%5.0\%) on the recall@1 of CARS196. On the larger datasets (SOP and In-Shop), our method is also better than the original PA and PNCA++.

Comparing with state-of-the-art. We further compare the performance of our method with the state-of-the-art methods based on the CNN backbones as listed in Table 1 and 2. For the CARS196 dataset, our method reaches 92.192.1 on Recall@1, which has a 0.9pp0.9pp improvement over the previous state-of-the-art AVSL (Zhang et al. 2022) on the ResNet50 backbone. For CUB200, our method outperforms the previous state-of-the-art AVSL 0.6pp0.6pp on Recall@1, 0.2pp0.2pp on Recall@2, and 0.1pp0.1pp on Recall@4. We observe that our performance on SOP and In-Shop is limited but very close to the previous state-of-the-art IBC (Seidenschwarz, Elezi, and Leal-Taixé 2021), CRT (Kan et al. 2022), and HIST (Lim et al. 2022) on a few metrics. The lesser improvement in the high-value recall of these two datasets is mainly due to the large number of classes (11318 and 3997) and the limited number of samples in each class (less than 10). This causes some difficulty for our category-level discriminator to learn the discriminative information. Nevertheless, our method still achieves good performance comparable to those of the state-of-the-art methods in all metrics and outperforms other proxy-related methods on these two datasets. We will investigate techniques to overcome this limitation in our future works.

Ablation Study

Contributions of the Objective Components. We analyze the ablation study to evaluate the contribution of each objective component of our proposed framework based on both ProxyAnchor (Kim et al. 2020) and ProxyNCA++ (Teh, DeVries, and Taylor 2020) on the CUB200 as listed in Table 3. We first notice that the data augmentation strategy (Aug) does not improve our baseline significantly in the absence of adv\mathcal{L}_{adv} and cls\mathcal{L}_{cls}. This is because, without those regularization losses, Aug simply boosts some redundant positive samples and the mixed features do not take part in training. We conclude that the domain-level discriminator with adv\mathcal{L}_{adv} has higher efficiency when the category-level discriminator with cls\mathcal{L}_{cls} helps regularize the space and avoid the inter-class ambiguity. It increases the improvement to +2.3pp+2.3pp on R@1 and +1.7pp+1.7pp on MAP@R from +1.1pp+1.1pp on R@1 and +0.8pp+0.8pp on MAP@R in comparison with the single adv\mathcal{L}_{adv} setting. We also demonstrate that the efficiency of the category-level classifier (+cls+\mathcal{L}_{cls}) can be further improved by comparing the discrepancy of class prediction (+d+\mathcal{L}_{d}) between the source data and target proxies in adversarial learning. Comparing the general discrepancy L1 distance (+1+\mathcal{L}_{1}), the proposed NWD also shows increasing performances on both R@1 and MAP@R. A similar conclusion can also be driven by the results based on ProxyNCA. Therefore, we conclude that the combination of the domain and the category-level discriminator is more suitable for proxy-based DML than the settings with any single discriminator. We also study the impact of our hyperparameters and the combination of data groups that apply domain adaptation in the Appendix.

Conclusion

In this paper, we present an adversarial domain adaptation method with data augmentation to optimize the hidden space of the data and the proxies. We overcome the initial distribution gap between them to boost the learning efficiency of deep metric learning. We propose to align the domains of the data and the initial proxies by optimizing two classifiers at different levels, and training the embedding function and the proxies against them. To enhance the density of the manifold, we propose a strategy to conduct a mixture space by mixing the features from both domains. Our experimental results based on four popular deep metric learning benchmarks demonstrate that our learning method and mixed space efficiently boost the learning efficiency of existing proxy-based methods. While our framework focuses on solving the challenge of proxy-based DML methods, we believe it can be easily extended to other related metric learning methods, and it can also benefit zero-shot and self-supervised learning works. These are interesting and challenging works for future study.

References

  • Ben-David et al. (2010) Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; and Vaughan, J. W. 2010. A theory of learning from different domains. Machine learning, 79: 151–175.
  • Boudiaf et al. (2020) Boudiaf, M.; Rony, J.; Ziko, I. M.; Granger, E.; Pedersoli, M.; Piantanida, P.; and Ayed, I. B. 2020. A unifying mutual information view of metric learning: cross-entropy vs. pairwise losses. In ECCV, 548–564. Springer.
  • Che et al. (2017) Che, T.; Li, Y.; Jacob, A. P.; Bengio, Y.; and Li, W. 2017. Mode regularized generative adversarial networks. ICLR.
  • Chen et al. (2022) Chen, L.; Chen, H.; Wei, Z.; Jin, X.; Tan, X.; Jin, Y.; and Chen, E. 2022. Reusing the Task-specific Classifier as a Discriminator: Discriminator-free Adversarial Domain Adaptation. In CVPR, 7181–7190.
  • Cheng et al. (2016) Cheng, D.; Gong, Y.; Zhou, S.; Wang, J.; and Zheng, N. 2016. Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In CVPR, 1335–1344.
  • Chopra, Hadsell, and LeCun (2005) Chopra, S.; Hadsell, R.; and LeCun, Y. 2005. Learning a similarity metric discriminatively, with application to face verification. In CVPR, volume 1, 539–546. IEEE.
  • Cui et al. (2020) Cui, S.; Wang, S.; Zhuo, J.; Li, L.; Huang, Q.; and Tian, Q. 2020. Towards discriminability and diversity: Batch nuclear-norm maximization under label insufficient situations. In CVPR, 3941–3950.
  • Dai et al. (2019) Dai, Z.; Chen, M.; Gu, X.; Zhu, S.; and Tan, P. 2019. Batch dropblock network for person re-identification and beyond. In ICCV, 3691–3701.
  • Deng et al. (2009) Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In CVPR, 248–255. Ieee.
  • El-Nouby et al. (2021) El-Nouby, A.; Neverova, N.; Laptev, I.; and Jégou, H. 2021. Training vision transformers for image retrieval. arXiv preprint arXiv:2102.05644.
  • Ermolov et al. (2022) Ermolov, A.; Mirvakhabova, L.; Khrulkov, V.; Sebe, N.; and Oseledets, I. 2022. Hyperbolic Vision Transformers: Combining Improvements in Metric Learning. In CVPR, 7409–7419.
  • Ganin and Lempitsky (2015) Ganin, Y.; and Lempitsky, V. 2015. Unsupervised domain adaptation by backpropagation. In ICML, 1180–1189. PMLR.
  • Ganin et al. (2016) Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1): 2096–2030.
  • Ge et al. (2020) Ge, Y.; Wang, H.; Zhu, F.; Zhao, R.; and Li, H. 2020. Self-supervising fine-grained region similarities for large-scale image localization. In ECCV, 369–386. Springer.
  • Goldberger et al. (2004) Goldberger, J.; Hinton, G. E.; Roweis, S.; and Salakhutdinov, R. R. 2004. Neighbourhood components analysis. NIPS, 17.
  • Goodfellow et al. (2020) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2020. Generative adversarial networks. Communications of the ACM, 63(11): 139–144.
  • Hadsell, Chopra, and LeCun (2006) Hadsell, R.; Chopra, S.; and LeCun, Y. 2006. Dimensionality reduction by learning an invariant mapping. In CVPR, volume 2, 1735–1742. IEEE.
  • HassanPour Zonoozi and Seydi (2022) HassanPour Zonoozi, M.; and Seydi, V. 2022. A Survey on Adversarial Domain Adaptation. Neural Processing Letters, 1–41.
  • He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
  • Hermans, Beyer, and Leibe (2017) Hermans, A.; Beyer, L.; and Leibe, B. 2017. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737.
  • Ioffe and Szegedy (2015) Ioffe, S.; and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 448–456. PMLR.
  • Isola et al. (2017) Isola, P.; Zhu, J.-Y.; Zhou, T.; and Efros, A. A. 2017. Image-to-image translation with conditional adversarial networks. In CVPR, 1125–1134.
  • Kan et al. (2022) Kan, S.; Liang, Y.; Li, M.; Cen, Y.; Wang, J.; and He, Z. 2022. Coded Residual Transform for Generalizable Deep Metric Learning. NeurIPS.
  • Katharopoulos and Fleuret (2018) Katharopoulos, A.; and Fleuret, F. 2018. Not all samples are created equal: Deep learning with importance sampling. In ICML, 2525–2534. PMLR.
  • Kim et al. (2020) Kim, S.; Kim, D.; Cho, M.; and Kwak, S. 2020. Proxy anchor loss for deep metric learning. In CVPR, 3238–3247.
  • Kingma and Ba (2015) Kingma, D. P.; and Ba, J. 2015. Adam: A method for stochastic optimization. ICLR.
  • Kingma and Welling (2013) Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • Ko, Gu, and Kim (2021) Ko, B.; Gu, G.; and Kim, H.-G. 2021. Learning with memory-based virtual classes for deep metric learning. In CVPR, 11792–11801.
  • Krause et al. (2013) Krause, J.; Stark, M.; Deng, J.; and Fei-Fei, L. 2013. 3d object representations for fine-grained categorization. In ICCV workshop, 554–561.
  • Laradji and Babanezhad (2020) Laradji, I. H.; and Babanezhad, R. 2020. M-adda: Unsupervised domain adaptation with deep metric learning. Domain adaptation for visual understanding, 17–31.
  • Lee et al. (2019) Lee, C.-Y.; Batra, T.; Baig, M. H.; and Ulbricht, D. 2019. Sliced wasserstein discrepancy for unsupervised domain adaptation. In CVPR, 10285–10295.
  • Lee, Jin, and Jain (2008) Lee, J.-E.; Jin, R.; and Jain, A. K. 2008. Rank-based distance metric learning: An application to image retrieval. In CVPR, 1–8. IEEE.
  • Lim et al. (2022) Lim, J.; Yun, S.; Park, S.; and Choi, J. Y. 2022. Hypergraph-Induced Semantic Tuplet Loss for Deep Metric Learning. In CVPR, 212–222.
  • Liu et al. (2022) Liu, L.; Huang, S.; Zhuang, Z.; Yang, R.; Tan, M.; and Wang, Y. 2022. DAS: Densely-Anchored Sampling for Deep Metric Learning. ECCV.
  • Liu et al. (2016) Liu, Z.; Luo, P.; Qiu, S.; Wang, X.; and Tang, X. 2016. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In CVPR, 1096–1104.
  • Lu et al. (2015) Lu, G.; Yan, Y.; Ren, L.; Song, J.; Sebe, N.; and Kambhamettu, C. 2015. Localize me anywhere, anytime: a multi-task point-retrieval approach. In ICCV, 2434–2442.
  • Milbich et al. (2020) Milbich, T.; Roth, K.; Bharadhwaj, H.; Sinha, S.; Bengio, Y.; Ommer, B.; and Cohen, J. P. 2020. Diva: Diverse visual feature aggregation for deep metric learning. In ECCV, 590–607. Springer.
  • Movshovitz-Attias et al. (2017) Movshovitz-Attias, Y.; Toshev, A.; Leung, T. K.; Ioffe, S.; and Singh, S. 2017. No fuss distance metric learning using proxies. In CVPR, 360–368.
  • Musgrave, Belongie, and Lim (2020) Musgrave, K.; Belongie, S.; and Lim, S.-N. 2020. A metric learning reality check. In ECCV, 681–699. Springer.
  • Oh Song et al. (2017) Oh Song, H.; Jegelka, S.; Rathod, V.; and Murphy, K. 2017. Deep metric learning via facility location. In CVPR, 5382–5390.
  • Oh Song et al. (2016) Oh Song, H.; Xiang, Y.; Jegelka, S.; and Savarese, S. 2016. Deep metric learning via lifted structured feature embedding. In CVPR, 4004–4012.
  • Patel, Tolias, and Matas (2022) Patel, Y.; Tolias, G.; and Matas, J. 2022. Recall@ k surrogate loss with large batches and similarity mixup. In CVPR, 7502–7511.
  • Pei et al. (2018) Pei, Z.; Cao, Z.; Long, M.; and Wang, J. 2018. Multi-adversarial domain adaptation. In AAAI.
  • Pinheiro (2018) Pinheiro, P. O. 2018. Unsupervised domain adaptation with similarity learning. In CVPR, 8004–8013.
  • Qian et al. (2019) Qian, Q.; Shang, L.; Sun, B.; Hu, J.; Li, H.; and Jin, R. 2019. Softtriple loss: Deep metric learning without triplet sampling. In ICCV, 6450–6458.
  • Quiñonero-Candela et al. (2008) Quiñonero-Candela, J.; Sugiyama, M.; Schwaighofer, A.; and Lawrence, N. 2008. Covariate shift and local learning by distribution matching.
  • Radford, Metz, and Chintala (2015) Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
  • Ren and Hua (2018) Ren, L.; and Hua, K. 2018. Improved image description via embedded object structure graph and semantic feature matching. In ISM, 73–80. IEEE.
  • Ren et al. (2021) Ren, L.; Li, K.; Wang, L.; and Hua, K. 2021. Beyond the deep metric learning: enhance the cross-modal matching with adversarial discriminative domain regularization. In ICPR, 10165–10172. IEEE.
  • Ren, Qi, and Hua (2019) Ren, L.; Qi, G.-J.; and Hua, K. 2019. Improving diversity of image captioning through variational autoencoders and adversarial learning. In WACV, 263–272. IEEE.
  • Rippel et al. (2016) Rippel, O.; Paluri, M.; Dollar, P.; and Bourdev, L. 2016. Metric learning with adaptive density discrimination. ICLR.
  • Roth et al. (2021) Roth, K.; Milbich, T.; Ommer, B.; Cohen, J. P.; and Ghassemi, M. 2021. Simultaneous similarity-based self-distillation for deep metric learning. In ICML, 9095–9106. PMLR.
  • Roth et al. (2020) Roth, K.; Milbich, T.; Sinha, S.; Gupta, P.; Ommer, B.; and Cohen, J. P. 2020. Revisiting training strategies and generalization performance in deep metric learning. In ICML, 8242–8252. PMLR.
  • Roth, Vinyals, and Akata (2022) Roth, K.; Vinyals, O.; and Akata, Z. 2022. Non-isotropy Regularization for Proxy-based Deep Metric Learning. In CVPR, 7420–7430.
  • Roweis, Hinton, and Salakhutdinov (2004) Roweis, S.; Hinton, G.; and Salakhutdinov, R. 2004. Neighbourhood component analysis. NIPS, 17(513-520): 4.
  • Saito et al. (2018) Saito, K.; Watanabe, K.; Ushiku, Y.; and Harada, T. 2018. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, 3723–3732.
  • Schroff, Kalenichenko, and Philbin (2015) Schroff, F.; Kalenichenko, D.; and Philbin, J. 2015. Facenet: A unified embedding for face recognition and clustering. In CVPR, 815–823.
  • Seidenschwarz, Elezi, and Leal-Taixé (2021) Seidenschwarz, J. D.; Elezi, I.; and Leal-Taixé, L. 2021. Learning intra-batch connections for deep metric learning. In ICML, 9410–9421. PMLR.
  • Sohn (2016) Sohn, K. 2016. Improved deep metric learning with multi-class n-pair loss objective. NIPS, 29.
  • Suh et al. (2019) Suh, Y.; Han, B.; Kim, W.; and Lee, K. M. 2019. Stochastic class-based hard example mining for deep metric learning. In CVPR, 7251–7259.
  • Tan, Yuan, and Ordonez (2021) Tan, F.; Yuan, J.; and Ordonez, V. 2021. Instance-level image retrieval using reranking transformers. In ICCV, 12105–12115.
  • Teh, DeVries, and Taylor (2020) Teh, E. W.; DeVries, T.; and Taylor, G. W. 2020. Proxynca++: Revisiting and revitalizing proxy neighborhood component analysis. In ECCV, 448–464. Springer.
  • Torralba and Efros (2011) Torralba, A.; and Efros, A. A. 2011. Unbiased look at dataset bias. In CVPR, 1521–1528. IEEE.
  • Tzeng et al. (2017) Tzeng, E.; Hoffman, J.; Saenko, K.; and Darrell, T. 2017. Adversarial discriminative domain adaptation. In CVPR, 7167–7176.
  • Venkataramanan et al. (2022) Venkataramanan, S.; Psomas, B.; Avrithis, Y.; Kijak, E.; Amsaleg, L.; and Karantzalos, K. 2022. It takes two to tango: Mixup for deep metric learning. ICLR.
  • Wah et al. (2011) Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset.
  • Wang et al. (2017) Wang, B.; Yang, Y.; Xu, X.; Hanjalic, A.; and Shen, H. T. 2017. Adversarial cross-modal retrieval. In Multimedia, 154–162.
  • Wang et al. (2014) Wang, J.; Song, Y.; Leung, T.; Rosenberg, C.; Wang, J.; Philbin, J.; Chen, B.; and Wu, Y. 2014. Learning fine-grained image similarity with deep ranking. In CVPR, 1386–1393.
  • Wang and Deng (2018) Wang, M.; and Deng, W. 2018. Deep visual domain adaptation: A survey. Neurocomputing, 312: 135–153.
  • Wang et al. (2019) Wang, X.; Han, X.; Huang, W.; Dong, D.; and Scott, M. R. 2019. Multi-similarity loss with general pair weighting for deep metric learning. In CVPR, 5022–5030.
  • Wang et al. (2020) Wang, X.; Zhang, H.; Huang, W.; and Scott, M. R. 2020. Cross-batch memory for embedding learning. In CVPR, 6388–6397.
  • Weinberger and Saul (2009) Weinberger, K. Q.; and Saul, L. K. 2009. Distance metric learning for large margin nearest neighbor classification. Journal of machine learning research, 10(2).
  • Wojke and Bewley (2018) Wojke, N.; and Bewley, A. 2018. Deep cosine metric learning for person re-identification. In WACV, 748–756. IEEE.
  • Wu et al. (2017) Wu, C.-Y.; Manmatha, R.; Smola, A. J.; and Krahenbuhl, P. 2017. Sampling matters in deep embedding learning. In CVPR, 2840–2848.
  • Xuan et al. (2020) Xuan, H.; Stylianou, A.; Liu, X.; and Pless, R. 2020. Hard negative examples are hard, but useful. In ECCV, 126–142. Springer.
  • Yang et al. (2018) Yang, J.; She, D.; Lai, Y.-K.; and Yang, M.-H. 2018. Retrieving and classifying affective images via deep metric learning. In AAAI, volume 32.
  • Yang et al. (2022) Yang, Z.; Bastan, M.; Zhu, X.; Gray, D.; and Samaras, D. 2022. Hierarchical proxy-based loss for deep metric learning. In WACV, 1859–1868.
  • Yi et al. (2014) Yi, D.; Lei, Z.; Liao, S.; and Li, S. Z. 2014. Deep metric learning for person re-identification. In ICPR, 34–39. IEEE.
  • Zhang et al. (2022) Zhang, B.; Zheng, W.; Zhou, J.; and Lu, J. 2022. Attributable Visual Similarity Learning. In CVPR, 7532–7541.
  • Zheng et al. (2021a) Zheng, W.; Wang, C.; Lu, J.; and Zhou, J. 2021a. Deep compositional metric learning. In CVPR, 9320–9329.
  • Zheng et al. (2021b) Zheng, W.; Zhang, B.; Lu, J.; and Zhou, J. 2021b. Deep relational metric learning. In ICCV, 12065–12074.
  • Zhu et al. (2020) Zhu, Y.; Yang, M.; Deng, C.; and Liu, W. 2020. Fewer is more: A deep graph metric learning perspective using fewer proxies. NeurIPS, 33: 17792–17803.

Appendix

In the appendix, we first discuss the generalized bound of our domain adaptation framework in Section A. Then, we compare our method with other similar methods in Section B. We discuss the possible extension of our method to other model backbones in Section C. We also discuss some related studies including the data domain and the discriminators in Section D. After that, we list detailed experimental settings in Section E and describe some existing limitations in Section F. We illustrate the overall architecture of our method in Figure 3 and show some examples of image retrieval results.

Appendix A Theory Insight

Notation and Definitions

We adopt the definition domain as the distribution 𝒟\mathcal{D} on the space 𝒳\mathcal{X} with the ground truth labeling function yy. In our scenario, we have source domain {𝒟S,yS}\{\mathcal{D}_{S},y_{S}\}, the intermediate domain {𝒟M,yM}\{\mathcal{D}_{M},y_{M}\} and the target domain {𝒟T,yT}\{\mathcal{D}_{T},y_{T}\}. The hypothesis hh is a function that we search to demonstrate the labeling yy. For our data sample x𝒟Sx\sim\mathcal{D}_{S} and its corresponding proxy p𝒟Tp\sim\mathcal{D}_{T}, the risk of the hypothesis hh is defined as the error risk with the labeling function yy:

ϵS(h,y):=𝔼x𝒟S[|h(x)yx|]\epsilon_{S}(h,y):=\mathbb{E}_{x\thicksim\mathcal{D}_{S}}[|h(x)-y_{x}|] (10)
ϵT(h,y):=𝔼p𝒟T[|h(p)yp|],\epsilon_{T}(h,y^{\prime}):=\mathbb{E}_{p\thicksim\mathcal{D}_{T}}[|h(p)-y^{\prime}_{p}|], (11)

where yy^{\prime} is the hidden labeling function of the target domain. Thus, our goal is to find the bound of ϵT(h,y)\epsilon_{T}(h,y^{\prime}) in the target domain where the domain adaptation is summarized to minimize the target risk ϵT(h,y)\epsilon_{T}(h,y^{\prime}) in term of the source risk ϵS(h,y)\epsilon_{S}(h,y) and other terms that affect its upper bound.

Given the set of hypothesis class hh\in\mathcal{H}, Ben-David et al. (Ben-David et al. 2010) define the symmetric difference hypothesis space Δ\mathcal{H}\Delta\mathcal{H}, and the domain divergence dΔ(𝒟S,𝒟T)d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{D}_{S},\mathcal{D}_{T}) where Δ:={h(x)h(x)|h,h}\mathcal{H}\Delta\mathcal{H}:=\{h(x)\neq h^{\prime}(x)|h,h^{\prime}\in\mathcal{H}\}. Thus dΔ(𝒟S,𝒟T)d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{D}_{S},\mathcal{D}_{T}) can be defined as,

dΔ(𝒟S,𝒟T)\displaystyle d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{D}_{S},\mathcal{D}_{T}) =2suph,hΔ|P𝒟S(hh)P𝒟T(hh)|\displaystyle=2\sup_{h,h^{\prime}\in\mathcal{H}\Delta\mathcal{H}}|P_{\mathcal{D}_{S}}(h\neq h^{\prime})-P_{\mathcal{D}_{T}}(h\neq h^{\prime})|
2|ϵS(h,h)ϵT(h,h)|\displaystyle\geq 2|\epsilon_{S}(h,h^{\prime})-\epsilon_{T}(h,h^{\prime})|

To estimate domain divergence within the space of finite samples (𝒰S,𝒰T)(\mathcal{U}_{S},\mathcal{U}_{T}), the dΔ(𝒟S,𝒟T)d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{D}_{S},\mathcal{D}_{T}) is further relaxed to

dΔ(𝒟S,𝒟T)dΔ(𝒰S,𝒰T)+O(1/m)d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{D}_{S},\mathcal{D}_{T})\leq d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U}_{S},\mathcal{U}_{T})+O(\sqrt{1/m}) (12)

where O(1/m)O(\sqrt{1/m}) is a empirical function introduced in (Ben-David et al. 2010) with converge rate 1/m\sqrt{1/m} and mm is the size of data samples 𝒰S\mathcal{U}_{S} and 𝒰T\mathcal{U}_{T}.

The Generalization Bound

With the definition of dΔd_{\mathcal{H}\Delta\mathcal{H}} introduced above, for every hh\in\mathcal{H} the bound for data space 𝒰S\mathcal{U}_{S} and 𝒰T\mathcal{U}_{T} can be described as,

ϵT(h)ϵS(h)+12dΔ(𝒰S,𝒰T)+λ+O(1/m),\epsilon_{T}(h)\leq\epsilon_{S}(h)+\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U}_{S},\mathcal{U}_{T})+\lambda+O(\sqrt{1/m}), (13)

where λ\lambda denotes the combined risk of optimized hypothesis hh^{*}: λ:=ϵS(h)+ϵT(h)\lambda:=\epsilon_{S}(h^{*})+\epsilon_{T}(h^{*}).

Now we assume our source, target, and intermediate domain service three possible source-target adaptation pairs: {𝒰S,𝒰T},{𝒰S,𝒰M}\{\mathcal{U}_{S},\mathcal{U}_{T}\},\{\mathcal{U}_{S},\mathcal{U}_{M}\}, and {𝒰M,𝒰T}\{\mathcal{U}_{M},\mathcal{U}_{T}\}. Note that 𝒰T\mathcal{U}_{T} does not serve as the source domain since it initially does not contain any semantic information, and its labeling space is still hidden. We combine the single bound of each source-target pair, as a convex function, to a combined boundary. In other words, by combining the source domain DSD_{S} and intermediate domain DMD_{M} to a single domain, our generalization bound of risks on three pairs can be inferred as the following:

Theorem A.1.

Let \mathcal{H} be a hypothesis space of VC-dimension dd, and 𝒰S,𝒰M\mathcal{U}_{S},\mathcal{U}_{M} and 𝒰T\mathcal{U}_{T} are samples of size mm drawn from 𝒟S,𝒟M,𝒟T\mathcal{D}_{S},\mathcal{D}_{M},\mathcal{D}_{T}. Then h\forall h\in\mathcal{H},

ϵT(h)\displaystyle\epsilon_{T}(h) 12ϵS(h)+14(dΔ(𝒰S,𝒰M,𝒰T))\displaystyle\leq\frac{1}{2}\epsilon_{S}(h)+\frac{1}{4}(d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U}_{S},\mathcal{U}_{M},\mathcal{U}_{T})) (14)
+λ~+O~(1/m)\displaystyle+\tilde{\lambda}+\tilde{O}(\sqrt{1/m})

where λ~\tilde{\lambda} denotes the combined risk of optimal hypothesis hh^{*} that λ~:=ϵS(h)+ϵM(h)+ϵT(h)\tilde{\lambda}:=\epsilon_{S}(h^{*})+\epsilon_{M}(h^{*})+\epsilon_{T}(h^{*}), and dΔ(𝒰S,𝒰M,𝒰T):=dΔ(𝒰S,𝒰T)+dΔ(𝒰M,𝒰T)+dΔ(𝒰S,𝒰M)d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U}_{S},\mathcal{U}_{M},\mathcal{U}_{T}):=d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U}_{S},\mathcal{U}_{T})+d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U}_{M},\mathcal{U}_{T})+d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U}_{S},\mathcal{U}_{M})

Proof.

We have the bound proposed in Eq. 13 for each source-target domain pair. Assume all domains have a finite sample of size mm; the bounds for all pairs can be listed as follows:

ϵT(h)ϵS(h)+12dΔ(𝒰S,𝒰T)+λ1+O(1/m)\displaystyle\epsilon_{T}(h)\leq\epsilon_{S}(h)+\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U}_{S},\mathcal{U}_{T})+\lambda_{1}+O(\sqrt{1/m}) (15)
ϵT(h)ϵM(h)+12dΔ(𝒰M,𝒰T)+λ2+O(1/m)\displaystyle\epsilon_{T}(h)\leq\epsilon_{M}(h)+\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U}_{M},\mathcal{U}_{T})+\lambda_{2}+O(\sqrt{1/m}) (16)
ϵM(h)ϵS(h)+12dΔ(𝒰S,𝒰M)+λ3+O(1/m)\displaystyle\epsilon_{M}(h)\leq\epsilon_{S}(h)+\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U}_{S},\mathcal{U}_{M})+\lambda_{3}+O(\sqrt{1/m}) (17)

Then we combine the convex upper bounds 15, 16 and 17 with a interpolate parameter α=1/3\alpha=1/3 as follows,

2αϵT(h)+αϵM(h)2αϵS(h)+αϵM(h)+α2(dΔ(𝒰S,𝒰M,𝒰T))++α(λ1+λ2+λ3)+O~(1/m),\begin{split}2\alpha\epsilon_{T}(h)+\alpha\epsilon_{M}(h)&\leq 2\alpha\epsilon_{S}(h)+\alpha\epsilon_{M}(h)\\ &+\frac{\alpha}{2}(d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U}_{S},\mathcal{U}_{M},\mathcal{U}_{T}))+\\ &+\alpha(\lambda_{1}+\lambda_{2}+\lambda_{3})+\tilde{O}(\sqrt{1/m}),\end{split} (18)

where O~(1/m)\tilde{O}(\sqrt{1/m}) is a interpolated empirical function. We simply replace λ\lambda with λ~:=α(λ1+λ2+λ3)=ϵS(h)+ϵM(h)+ϵT(h)\tilde{\lambda}:=\alpha(\lambda_{1}+\lambda_{2}+\lambda_{3})=\epsilon_{S}(h^{*})+\epsilon_{M}(h^{*})+\epsilon_{T}(h^{*}) to get our combined upper bound. ∎

Connection to Our Objective Function

For each dΔ(𝒰,𝒰)d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U},\mathcal{U}^{\prime}), we know that

dΔ(𝒰,𝒰)=2suph,hΔ|P𝒰(hh)P𝒰(hh)|2suph|P𝒰(h=0)P𝒰(h=1)|=2suph(P𝒰(h=0)+P𝒰(h=1)1)\begin{split}d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U},\mathcal{U}^{\prime})&=2\sup_{h,h^{\prime}\in\mathcal{H}\Delta\mathcal{H}}|P_{\mathcal{U}}(h\neq h^{\prime})-P_{\mathcal{U^{\prime}}}(h\neq h^{\prime})|\\ &\leq 2\sup_{h\in\mathcal{H}}|P_{\mathcal{U}}(h=0)-P_{\mathcal{U^{\prime}}}(h=1)|\\ &=2\sup_{h\in\mathcal{H}}(P_{\mathcal{U}}(h=0)+P_{\mathcal{U^{\prime}}}(h=1)-1)\end{split} (19)
Refer to caption
Figure 3: The overview of our framework. The input images are embedded with a CNN encoder. The proxies are randomly sampled and mixed with the embedding of image samples. Then a domain-level classifier fD()f_{D}(\cdot) and a category-level classifier fC()f_{C}(\cdot) are trained to predict the domain and class property of each sample and proxy. With the adversarial training paradigm, the features and proxies are moved to fool the fC()f_{C}(\cdot) and minimize the discrepancy of prediction of fD()f_{D}(\cdot). The dashed lines on the surface represent the surface boundary of our discriminators. The gradient from the adversarial learning pushes the data samples in the opposite direction from the separation of the boundary, which aligns the data and proxies.

Similarly, we can also relax our h,hh,h^{\prime} from Δ\mathcal{H}\Delta\mathcal{H} space by specific class labels where h=0,h=1,h=2h=0,h=1,h=2. Thus, we get the bound of dΔ(𝒰S,𝒰M,𝒰T)d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U}_{S},\mathcal{U}_{M},\mathcal{U}_{T}) as follows,

dΔ(𝒰S,𝒰M,𝒰T)\displaystyle d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U}_{S},\mathcal{U}_{M},\mathcal{U}_{T})\leq (20)
2suph(P𝒰𝒮(h=0)+P𝒰M(h=1)+P𝒰𝒯(h=2)),\displaystyle 2\sup_{h\in\mathcal{H}}(P_{\mathcal{U_{S}}}(h=0)+P_{\mathcal{U}_{M}}(h=1)+P_{\mathcal{U_{T}}}(h=2)),

When the combined risk λ~\tilde{\lambda} in A.1 is achieved by optimized hypothesis as a small constant, the bound mainly depends on the term dΔ(𝒰S,𝒰M,𝒰T)d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U}_{S},\mathcal{U}_{M},\mathcal{U}_{T}) and ϵS(h)\epsilon_{S}(h) when mm is sufficiently large.

The term dΔ(𝒰S,𝒰M,𝒰T)d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{U}_{S},\mathcal{U}_{M},\mathcal{U}_{T}) is the min-max optimize goal of our domain-level discriminator. And the category-level discriminator optimizes the risk ϵS(h)\epsilon_{S}(h). This shows that optimizing the risk in the target domain needs a combination of domain and category-level discriminators. Besides, the theory in (Chen et al. 2022) also conduct that |ϵS(h,h)ϵT(h,h)|dΔ(𝒟S,𝒟T)|\epsilon_{S}(h,h^{\prime})-\epsilon_{T}(h,h^{\prime})|\leq d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{D}_{S},\mathcal{D}_{T}) is also bounded by NWD with K-Lipschits constraint where |ϵS(h,h)ϵT(h,h)|2Kd(νs,νt)|\epsilon_{S}(h,h^{\prime})-\epsilon_{T}(h,h^{\prime})|\leq 2K\mathcal{L}_{d}(\nu_{s},\nu_{t}). Here we explained the intuition behind our proposed objective.

Connection between Proxy Loss and ϵT(h)\epsilon_{T}(h)

Here, we try to roughly explain why making domain adaptation would help the proxy-based DML. We take Proxy-NCA loss as an example. Recall that Proxy-NCA has the following form:

proxy(X,P)=xXloged(x,p+)pPed(x,p),\mathcal{L}_{proxy}(X,P)=\sum_{x\in X}-\log\frac{e^{d\left(x,p^{+}\right)}}{\sum_{p^{-}\in P^{-}}e^{d\left(x,p^{-}\right)}}, (21)

The NCA loss initially follows the design of the “leave-one-out” classification paradigm (Goldberger et al. 2004) where we try to label the sample xx and anchor pp into the same class (or to select pp as a neighbor of x with the same label). We know that optimizing the NCA loss maximizes the probability of labeling the sample xx and pp to the same class while pushing them from others. On the other hand, we have,

12dΔ(DS,DT)|ϵS(h(x))ϵT(h(p))|=|𝔼xDS|h(x)y|𝔼pDT|h(p)y||\begin{split}&\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(D_{S},D_{T})\geq|\epsilon_{S}(h(x))-\epsilon_{T}(h(p))|\\ &=|\mathbb{E}_{x\sim D_{S}}|h(x)-y|-\mathbb{E}_{p\sim D_{T}}|h(p)-y^{\prime}||\end{split} (22)

By aligning distributions (DS,DT)(D_{S},D_{T}) and labeling space (y,y)(y,y^{\prime}), we know that 12dΔ(DS,DT)𝔼x,pD|h(x)h(p)|\frac{1}{2}d_{\mathcal{H}\Delta\mathcal{H}}(D_{S},D_{T})\geq\mathbb{E}_{x,p\sim D}|h(x)-h(p)| which has the same optimizing target to the Proxy-NCA: to minimize the risk labeling xx and pp in different classes. Proxy-Anchor or other Proxy-based losses also follow this paradigm with modified NCA losses so that they initially consist of the same target.

Appendix B Comparison with Related Methods

In this section, we discuss two related works, XBM (Wang et al. 2020) and MemVir (Ko, Gu, and Kim 2021), that are intuitively close to the idea of our and other proxy-based DML works. Our approach and the XBM have something in common since both of us attempt to compare data representations with some other representations (XBM compares the data representations while ours compares class representations) that are unrelated to the training batch. MemVir, on the other hand, proposes to extend the space of the class by gradually adding the representations and class weights that are out of the batch. They all attempt to reduce the shift between the samples in batch and out of batch, which is also called semantic drift. However, there is a fundamental difference between proxy-based and XBM-related approaches, where proxy-based DML chooses to directly update the class representations in the whole data space while XBM approaches select to update a subset of data outside the batch.

Appendix C Extension to Other Backbones

Note that this paper compares our method with existing state-of-the-art proxy-based methods and other popular DML approaches that embed the images with CNN backbones (ResNet50 and InceptionBN). Some recent works that apply Vision Transformers (El-Nouby et al. 2021; Tan, Yuan, and Ordonez 2021), which is pre-trained on larger datasets (ImageNet21K) or with extremely large batch sizes (Ermolov et al. 2022; Patel, Tolias, and Matas 2022) are beyond the scope of this paper due to our limited computing resources. We anticipate that our method can be further extended with Transformer encoders (ViT) to boost its performance with larger GPU memory or parallel training with multi-GPUs in future works. We leave this extension to subsequent studies.

Refer to caption
Figure 4: Illustrate the optimized beta distribution on different datasets in our experiments.

Appendix D Additional Studies

Study of the Mixing Distribution

We study the mixing domain’s performance with various sampling distributions. We evaluate the performance of our method under different probability density functions based on the α,β\alpha,\beta. We discover that the optimized sampling density varies in different datasets. This is because of the variety of data distribution in the initial space. When the λ\lambda sampling approach to 0.00.0 is equal to putting the target proxy directly as the target space without any mixing, and vice versa. As illustrated in Figure 4, we observe that the λ\lambda tends to be sampled larger than 0.5 for all datasets but with different probabilities, which means the mixed intermediate data domain tends to have more semantic information from the source data domain in all datasets. This also consists of our assumption that the proxy profoundly needs to be connected with the informative source domain during the DML learning progress.

Refer to caption
(a) The Initial Space
Refer to caption
(b) The Space of the Original PA
Refer to caption
(c) The Space of our PA-DADA
Figure 5: Illustrate the space with the t-SNE visualizations of sample representations and corresponding proxy vectors on the part of the training set of CUB-200-2011. Each class is labeled with a unique color. We demonstrate that the embedding space of our method looks well clustered, and the proxies are sufficiently separated and close to the clustered sample data, while the proxies of the original PA are not well separated and still maintain their own distribution.

Study of Confusion Matrix

We compare the predicted probability on CUB200 from our category-level discriminator between the data source domain and target proxies. We repeat the comparison on the PA baseline, where we also train a classifier with the same architecture and initial parameters (so that the prediction on the source data is the same as our proposed method). The Figure 7 illustrates the confusion matrix. It is obvious that our proposed method has better consistency between the distribution of data samples and the proxies.

Study of the Adaptation Groups

Another important factor that affects the results is where to apply the domain adaptation. To investigate this we try domain adaptation on different combinations of data in the set X~\tilde{X}, D~\tilde{D} and PP. We study the efficiency of our domain-level discriminator under three different settings: (1) the original space with only the augmented samples X~\tilde{X} and the original proxies PP adapted (labeled as X~\tilde{X} + PP), (2) the setting with only the augmented samples and X~\tilde{X} and augmented data D~\tilde{D} (labeled as X~\tilde{X} + D~\tilde{D}), and (3) setting with all three set of data as proposed in main paper (labeled as X~\tilde{X} + D~\tilde{D} + PP). For the settings (1) and (2), we set the domain label to 01 and 10 instead. As illustrated in Figure 6(d), the X~\tilde{X} + D~\tilde{D} + PP setting achieves the best overall performance and specifically overcomes the X~\tilde{X} + PP setting for a large margin. This demonstrates the efficiency of the augmented data for domain adaptation.

Appendix E Detailed Experimental Settings

Pre-Processing

We follow the standard pre-processing procedure proposed in existing works for fairness. Specifically, we resize the image to 224×224224\times 224, do random resized cropping, and random horizontal flipping with probability 0.5. In the test phase, the images are first resized to 256×256256\times 256, then cropped back to 224×224224\times 224.

Refer to caption
(a) Impact of η\eta and γ\gamma
Refer to caption
(b) Impact of dimension
Refer to caption
(c) Impact of batch size
Refer to caption
(d) Compare Adapted Groups
Figure 6: Illustrate the impact of our hyperparameters. The blue line represents our Proxy-Anchor baseline. The blue horizontal line indicates our PA baseline.

Parameter Searching

We empirically search the dimensions of each hidden layer of our discriminators in a discrete range of {128,256,512}\{128,256,512\}. We search the second layer of the category-level classifier according to the number of classes in each dataset. We eventually set our domain-level discriminator as a single-layer MLP with a 512-dimension hidden layer. Our category-level discriminator is constructed as a 2-layer MLP where the first hidden layer has 512 dimensions.

We search our hyper-parameters {η,γ}\{\eta,\gamma\} from a grid where η,γ\eta,\gamma are searched in a range from 10210^{-2} to 1.01.0. For the parameter α,β\alpha,\beta in Beta distribution, we also search them within the range α,β[0.5,5]\alpha,\beta\in[0.5,5] since they control the probability of sampling between the two domains. We adopt the parameters of baseline Proxy-NCA and Proxy-Anchor as their suggested values in the original paper where the scaling factor τ=32\tau=32, margin δ=0.1\delta=0.1. In each iteration of training, we train the discriminator with k=3k=3 times while training the generator for a single time. We adopt a warm-up phase where all parameters are frozen except the linear head of the generator, as suggested in (Roth, Vinyals, and Akata 2022).

Impact of Hyperparameters

We observe in Figure 6 that our discriminators are effective when the balancing factor satisfies 0.002η0.0060.002\leq\eta\leq 0.006 and 0.002γ0.0090.002\leq\gamma\leq 0.009. In this range, the performance starts to overcome the settings with a single classifier (R@1 >70.0>70.0). Besides, we notice that the adversarial training would not support or even damage the original metric learning system when η0.08,γ0.1\eta\geq 0.08,\gamma\geq 0.1.

We also observe that the large batch size is not a sufficient condition to achieve better performance for our method (as illustrated in Figure 6(c)) since the batch size and batch sampling also affect the quality of our mixed positive samples. Figure 6(b) also illustrates that the performance of our learning system improves with the dimension size of the features. Even though we achieve better results on larger dimensions, we only report the results on dimension 512 for fairness with other works. Note that our method is based on Proxy-Anchor where the proxy collects information across batches. Thus, the performance of our method is not sensitive to batch size (see Figure 6(c)).

Training Time and Memory

When utilizing our adversarial training, the additional cost of time and memory is very limited in our framework. In our experimental machine, the training time on CUB200 for a single epoch of our proposed method is 37.3±0.4s37.3\pm 0.4s (average in 10 epochs), including the evaluation time, when the original PA baseline holds the running time of 35.2±0.3s35.2\pm 0.3s. Even though we train the discriminators in k=3k=3 times in each iteration, the increased time for this additional part is only around 2s(5%)2s(5\%). This is because our discriminators are shallow (1 and 2 layers), and the computational complexity does not change. In terms of memory, our proposed method takes 11.3GB of GPU memory when the original baseline takes 11.2GB for the default batch size of 90. Although we increase the number of samples by mixing the features, the increased space is negligible compared to the baseline since our proposed method does not need additional forward propagation for the increased samples.

Appendix F Broader Impact and Limitation

We believe our DADA work would serve as an excellent example to inspire other researchers on proxy-based DML to focus on capturing space distributions. By investigating more solutions to compare and align the domains, future contributions would further push their limits under an aligned space with unique data distribution. The possible limitation of DADA is its learning efficiency on datasets with a large number of classes, which may cause difficulty in extending to other learning paradigms, such as continuous learning. We will focus on this interesting challenge in future works.

Refer to caption
(a) Source data sample
Refer to caption
(b) Target proxies of our proposed method
Refer to caption
(c) Target proxies of PA baseline
Figure 7: Illustrate the confusion matrix on CUB200 (100 classes) from our category-level discriminator. We also illustrate the confusion matrix of the PA baseline where we also train a classifier with the same architecture and initialed parameters in baseline training.
Refer to caption
Figure 8: Illustrate the retrieved examples on CUB200 test set. Ranked by similarity score.
Refer to caption
Figure 9: Illustrate the retrieved examples on CUB200 test set. Ranked by similarity score.
Refer to caption
Figure 10: Illustrate the retrieved examples on CARS196 test set. Ranked by similarity score.
Refer to caption
Figure 11: Illustrate the retrieved examples on CARS196 test set. Ranked by similarity score.