Domain-shift adaptation via linear transformations

Roberto Vega^1,2 Russell Greiner^1,2
¹Department of Computing Science, University of Alberta, Canada
²Alberta Machine Intelligence Institute
{rvega, rgreiner}@ualberta.ca

Abstract

A predictor, $f_{A}:X\to Y$ , learned with data from a source domain (A) might not be accurate on a target domain (B) when their distributions are different. Domain adaptation aims to reduce the negative effects of this distribution mismatch. Here, we analyze the case where $P_{A}(Y\ |\ X)\neq P_{B}(Y\ |\ X)$ , $P_{A}(X)\neq P_{B}(X)$ but $P_{A}(Y)=P_{B}(Y)$ ; where there are affine transformations of $X$ that makes all distributions equivalent. We propose an approach to project the source and target domains into a lower-dimensional, common space, by (1) projecting the domains into the eigenvectors of the empirical covariance matrices of each domain, then (2) finding an orthogonal matrix that minimizes the maximum mean discrepancy between the projections of both domains. For arbitrary affine transformations, there is an inherent unidentifiability problem when performing unsupervised domain adaptation that can be alleviated in the semi-supervised case. We show the effectiveness of our approach in simulated data and in binary digit classification tasks, obtaining improvements up to 48% accuracy when correcting for the domain shift in the data.

1 Introduction

The goal of supervised machine learning is to produce a model that can accurately predict a value, $y$ , given a vector input, $x$ , corresponding (implicitly) to an unknown function $y=f(x)$ Murphy (2012). In the supervised setting, we learn an approximate $\hat{y}=\hat{f}(x)\approx f(x)$ , by applying a learning algorithm to a (source) training dataset $D_{S}=\{(x_{1},y_{1}),(x_{2},y_{2}),\dots(x_{n},y_{n})\}$ . We can then apply this $\hat{f}$ by applying it to new instances $D_{T}=\{x_{1},x_{2},\dots x_{m}\}$ .

A common assumption is that the source (S) and the target (T) domains follow the same probability distribution–i.e. $P_{S}(X,Y)=P_{T}(X,Y)$ . When this is not the case, a predictor learned using $D_{S}$ might not generalize when used on $D_{T}$ Storkey (2009). The performance on the target domain depends on its performance on the source domain, and on the similarity between the distributions of the domain and target domains Ben-David et al. (2007).

A well known model to explain the discrepancy between distributions is covariate shift, where $P_{S}(X)\neq P_{T}(X)$ , but $P_{S}(Y\ |\ X)=P_{T}(Y\ |\ X)$ Shimodaira (2000). Other assumptions lead to different models Storkey (2009); Kull and Flach (2014), which motivate algorithms that decrease the negative impact of the discrepancies under different circumstances Csurka (2017); Wen et al. (2014).

Our study focuses on the case where $P_{S}(X)\neq P_{T}(X)$ , $P_{S}(Y\ |\ X)\neq P_{T}(Y\ |\ X)$ , and $P_{S}(Y)=P_{T}(Y)$ . However, we assume the existence of a function $f(\cdot)$ , with parameters $\lambda_{A}$ and $\lambda_{B}$ , such that $P_{S}(\ f(X,\ \lambda_{A})\ )=P_{T}(\ f(X,\ \lambda_{B})\ )$ and $P_{S}(\ Y\ |\ f(X,\ \lambda_{A})\ )=P_{T}(\ Y\ |\ f(X,\ \lambda_{B})\ )$ . This implies that there is a common feature space where the source and target domains follow the same distribution; see Figure 1. This model is called domain-shift Storkey (2009), or covariate observation shift Kull and Flach (2014).

Refer to caption — Figure 1: (a) Domain-shift for domains A and B. $Y$ represents the label, $Z$ represent a latent common feature space. $X_{A}$ and $X_{B}$ are the observations for the two different domains. Here, $g_{\lambda}:Z\to X$ and $f_{\theta}:X\to Z$ . (b) Plate model for representing more than 2 domains.

Figure 2(a) exemplifies domain shift. Note that the decision boundary learned from the source domain (shown with solid squares and triangles) has poor performance on the target domain (shown with dashed squares and triangles). Domain-shift adaptation aims to find a common representation that minimizes the divergence between the domains. If successful, the decision function learned during training will have a good performance when doing inference; see Figure 2(b).

Under the assumption that the mapping from source and target domains to a common representation is an affine function (i.e., $Z_{i}=f(X_{A,i},\lambda_{A})$ and $Z_{j}=f(X_{B,j},\lambda_{B})$ have the form $f(X,\lambda)=\lambda X+\lambda_{0}$ ), we propose an algorithm for unsupervised and semi-supervised domain adaptation. We find the parameters $\lambda$ and $\lambda_{0}$ , which project the data into a common space, by computing the first $p$ eigenvectors of the covariance matrices of the probability distributions of each domain, and then finding an orthogonal matrix that minimizes the maximum mean discrepancy between both distributions.

There is an inherent unidentifiability problem with unsupervised domain adaptation. Observe in Figure 3 that, in the absence of labeled data from the source and target domains, it is impossible to distinguish between the different “distribution alignments” presented there. This problem can be alleviated in the semi-supervised case, where few labeled instances allow the distinction between both scenarios.

2 Related work

The model presented in Figure 1 is closely related to the probabilistic versions of principal components analysis Tipping and Bishop (1999) and canonical correlation analysis Bach and Jordan (2005). In both cases, $Z\sim\mathcal{N}(0,I)$ and $X\ |\ Z\sim~\mathcal{N}(Wz+\mu,\Psi)$ . When the transformation matrix, $W$ , is diagonal –i.e., a location-scale transformation– it is possible to perform domain-shift adaptation without the Gaussianity assumption by minimizing the maximum mean discrepancy between both domains Gretton et al. (2012); Zhang et al. (2013). In our study, we allow $W$ to be an arbitrary matrix without the Gaussianity assumption. Domain adaptation with arbitrary affine transformation has been explored in the context of mixing small datasets from different domains to increase the size of the training set. While this is often successful, this approach still requires a supervised dataset for each of the different sources Vega and Greiner (2018). Here, our objective is to learn a predictor in the source domain, and then apply it to the target domain in the unsupervised and semi-supervised scenarios.

Recent approaches attempt to minimize the divergence between source and target distributions by using data transformations. CORAL Sun et al. (2016) matches the first two moments of the source and target distributions, while a different line of work uses variants of autoencoders to find a common mapping between source and target data Glorot et al. (2011); Chen et al. (2012, 2015); Louizos et al. (2016). After finding this common feature space, they learned a predictor using only the source data. Their approach achieved better performance when applied to the target dataset, relative to not correcting for domain-shift, in natural language processing tasks.

Adversarial domain adaptation learns the mapping to the common space and the predictor at the same time Ganin et al. (2016); Long et al. (2018); Zhao et al. (2018). It combines a discriminator (which distinguishes instances from the different domains); a predictor (which tries to minimize the prediction error on the labeled instances); and a third function (which maps the instances into a common space). The three functions are optimized together. If successful, the discriminator should be unable to distinguish the domain on an instance (based on its common space encoding), while the predictor should have good performance under the metric of interest Tzeng et al. (2017).

Despite the success of neural networks for domain adaptation in natural language processing and computer vision tasks, it is hard to define what type of problems can be solved with this approach. For example, they might fail when $P_{S}(Y)\neq P_{T}(Y)$ Tachet des Combes et al. (2020). Even when they successfully learn an invariant representation across domains that preserves the predictive power in the source domain, this do not guarantee a successful adaptation. It is possible to have invariant representations and small error in the training set, and still have a large error in the test set Zhao et al. (2019). Therefore, it is important to explicitly determine under which conditions we expect an algorithm to work. Our goal with this paper is to analyze the problem of domain-shift under affine transformations, and to propose an approach for domain adaptation for this scenario.

3 Domain-shift adaptation via linear transformations

Under the assumption that the domain-shift is caused by affine transformations, the equations on Figure 1 become:

\begin{gathered}x_{A}=\theta_{A}z+\mu_{A}\\ x_{B}=\theta_{B}z+\mu_{B}\\ x_{A}\in\mathbb{R}^{m},\ x_{B}\in\mathbb{R}^{n},\ z\in\mathbb{R}^{p}\end{gathered}

(1)

Note that the latent variable, $Z$ , can have a different (lower) dimensionality than the observations $x_{A}$ and $x_{B}$ , which in turn can have different dimensionality between themselves. Importantly, we do not assume that we have paired data between the source and target domain. In other words, for a given instance, $i$ , we can either observe its representation in the source domain, $x_{A}^{i}$ , or the target domain, $x_{B}^{i}$ , but not both.

If we knew the parameters $\theta_{A},\mu_{A},\theta_{B},\mu_{B}$ , and assuming they are non-degenerate, we could do the inverse mapping from the observations $x_{A}$ and $x_{B}$ to $z$ by solving the following optimization problem:

z^{*}=\arg\min_{z}||(x-\mu)-\theta z||^{2}

whose solution (see the Appendix A.1) is given by

\begin{gathered}z=(\theta^{T}\theta)^{-1}\theta^{T}(x-\mu)\end{gathered}

(2)

Once we map the data from the source and target domains to a common space, we can use the labeled data from the source domain to learn a predictor that we can apply to data from any of both domains (after the appropriate projection into the common space).

3.1 Estimating the transformation parameters

Without loss of generality, we assume that $E[Z]=0$ and $Cov[Z]=I$ . Then, given a dataset with instances drawn from the source domain $X_{A}$ and a dataset with instances drawn from the target domain $X_{B}$ :

\begin{gathered}E[X]=\theta E[Z]+\mu=\mu\\ Cov[X]=\theta Cov[Z]\theta^{T}=\theta\theta^{T}\\ \end{gathered}

(3)

Note that we can compute the empirical estimates $\hat{\mu}_{A},\hat{\mu}_{B},\hat{\Sigma}_{A},\hat{\Sigma}_{B}$ , given the datasets $X_{A}$ and $X_{B}$ . The empirical estimators of the mean directly give us half of the transformation parameters. For the case of the covariance matrix, we can compute the singular value decomposition:

$\displaystyle\Sigma$	$\displaystyle=USU^{T}$	(4)
$\displaystyle\Sigma$	$\displaystyle=US^{\frac{1}{2}}S^{\frac{1}{2}}U^{T}$
$\displaystyle\Sigma$	$\displaystyle=US^{\frac{1}{2}}QQ^{T}S^{\frac{1}{2}}U^{T}\quad\mbox{;where}\quad QQ^{T}=I$
$\displaystyle\Sigma$	$\displaystyle=(US^{\frac{1}{2}}Q)(US^{\frac{1}{2}}Q)^{T}$

Since $\Sigma$ is a positive semi-definite matrix, its eigenvalues are non-negative, which allows us to decompose the diagonal matrix as $S=S^{\frac{1}{2}}S^{\frac{1}{2}}$ . By comparing Equations 3 and 4, we can estimate the parameters $\theta$ as:

\begin{gathered}\theta=US^{\frac{1}{2}}Q\end{gathered}

(5)

for any orthogonal matrix, $Q$ . After substituting the parameter $\theta$ into Equation 2, then applying some algebraic manipulations (see the Appendix A.2) , we observe that:

\begin{gathered}z=Q^{T}S^{-\frac{1}{2}}U^{T}(x-\mu)\end{gathered}

(6)

A consequence of Equation 6 is that matching the empirical mean and covariance matrices of the source and target domain is not enough to correct for domain-shift adaptation: The orthogonal matrix $Q$ , which represents rotations or reflections, might cause misalignment in the data; see Figure 4(a).

The matrices $(S_{A},U_{A})$ and $(S_{B},U_{B})$ , for the source and target domains, respectively, can be computed from the SVD of their empirical covariance matrices, $\hat{\Sigma}_{A}$ and $\hat{\Sigma}_{B}$ . Since the objective of domain adaptation is to align the distributions, regardless of the “direction” of the alignment, we arbitrarily set $Q_{A}=I$ . Then, we find an orthogonal matrix that minimizes the divergence between both probability distributions:

Q_{B}^{*}=\arg\min_{Q_{B}}Div(X_{A}\ ||\ Q_{B}^{T}X_{B}),\ \ s.t.\ \ Q_{B}Q_{B}^{T}=I

(7)

where $Div(\cdot||\cdot)$ is an empirical measure of the divergence between the two domains.

3.2 Maximum Mean Discrepancy

A common measure of the divergence between two probability distributions is the Maximum Mean Discrepancy (MMD) Gretton et al. (2012).

Definition 1 (Maximum Mean Discrepancy)

Let p and q be Borel probability measures defined on a domain $\mathcal{X}$ . Given observations $X:=\{x_{1},\dots,x_{m}\}$ and $Y:=\{y_{1},\dots,y_{n}\}$ , drawn independently and identically distributed from p and q, respectively. Let $\mathcal{F}$ be a class of functions $f:\mathcal{X}\to\mathbb{R}$ , the MMD is defined as:

\mbox{MMD}[\mathcal{F},p,q]:=\sup_{f\in\mathcal{F}}\left(E_{x\sim p}[f(x)]-E_{y\sim q}[f(y)]\right)

Informally, the purpose of the MMD is to determine if two probability distributions, $p$ and $q$ , are different. The associated algorithm involves taking samples from $p$ and $q$ , then finding a function that take large values on samples from $p$ and small (or negative) values on samples of $q$ . The MMD is then the difference between the mean values of the function of the samples.

By defining the class of functions $\mathcal{F}$ as the unit ball in a reproducing kernel Hilbert space, Gretton et al. (2012) proposed (biased) empirical estimator of the $\mbox{MMD}^{2}$ as follows:

\begin{split}\mbox{MMD}_{b}^{2}[\mathcal{F},X,Y]&=\frac{1}{m^{2}}\sum_{i,j=1}^{m}k(x_{i},x_{j})+\frac{1}{n^{2}}\sum_{i,j=1}^{n}k(y_{i},y_{j})\\ &-\frac{2}{mn}\sum_{i,j=1}^{m,n}k(x_{i},y_{j})\end{split}

(8)

where $k(\cdot,\cdot)$ is a valid kernel. In our case, we use the Gaussian kernel $k(x,y)=\exp\left(-\frac{1}{2}\sigma^{-2}||x-y||^{2}\right)$ .

Equation 8 has two nice properties. (1) It computes an estimation of the MMD with a finite number of instances from each domain. (2) It is a differentiable function, so it can be optimized with iterative methods, such as gradient descent.

3.3 Optimization with orthogonality constraints

For solving the optimization problem with orthogonality constraints presented in Equation 7, we used the Wen and Yin (2013) algorithm (Algorithm 1), which is an iterative method based on the Cayley transform. Their algorithm is similar to gradient descent, but instead of looking for solutions in the Euclidean space, they look for solutions in the Stiefel manifold, which is the set that contains all the orthogonal matrices.

Formally, their proposed algorithm solves:

\min_{X\in\mathbb{R}^{n\times p}}\mathcal{F}(X),\ s.t.\ \ X^{T}X=I

(9)

Algorithm 1 Optimization with orthogonality contraints

Input: $\mathcal{F},X_{0}$
Parameter: Learning rate ( $\tau$ ), Max iterations (M)
Output: $\arg\min_{X}\mathcal{F}(X)\ s.t.\ X^{T}X=I$

1: Given an initial orthogonal matrix

X_{0}

2: Let

t=0

3: while

t<M

4: Compute the Gradient

G_{t}=\mathcal{DF}(X_{t})=\left(\frac{\partial\mathcal{F}(X_{t})}{\partial X_{t}{i,j}}\right)

5: Compute

A_{t}=G_{t}X_{t}^{T}-X_{t}G_{t}^{T}

6: Compute

Q_{t}=\left(I+\frac{\tau}{2}A_{t}\right)^{-1}\left(I-\frac{\tau}{2}A_{t}\right)

7: Compute

X_{t+1}=Q_{t}X_{t}

8: Update

t:=t+1

9: end while

10: return

X_{M}

where $\mathcal{F}:\mathbb{R}^{n\times p}\to\mathbb{R}$ is a differentiable function. For our purposes, $\mathcal{F}(Q_{B})=\mbox{MMD}^{2}(Z_{A},Q_{B}^{T}Z_{B}^{\prime})$ , where $Z_{A}$ and $Z_{B}^{\prime}$ are the projections of $X_{A}$ and $X_{B}$ , respectively, using Equation 6 with $Q_{A}=Q_{B}^{\prime}=I$ . $Q_{B}$ is an orthogonal (rotation or reflection) matrix that multiplies $Z_{B}^{\prime}$ .

Algorithm 1 is guaranteed to converge when the learning rate ( $\tau$ ) meets the Armijo-Wolfe conditions Nocedal and Wright (2006). However, it is not guaranteed to find the global minimum of $\mathcal{F}(X)$ . Similarly to gradient descent approaches, the algorithm might converge to a local minimum. One heuristic to alleviate this problem is to perform multiple restart with different seed points; however, this is still not guaranteed to convergence to the global minimum.

3.4 Unsupervised domain adaptation

For finding the parameters $\theta_{A}$ and $\theta_{B}$ , we can arbitrarily set $Q_{A}=I$ in Equation 5, and then use Algorithm 1, with the Maximum Mean Discrepancy, for computing $Q_{B}$ :

Q_{B}^{*}=\arg\min_{Q_{B}}\mbox{MMD}_{b}^{2}(Z_{A},Q_{B}^{T}Z^{\prime}_{B}),\ s.t.\ Q_{B}Q_{B}^{T}=I

(10)

where $Z^{\prime}_{B}$ is a dataset that contains the transformed instances $x_{B}$ using $Q_{B}^{\prime}=I$ . Finally, we can project the source and target domains to a common representation using Equation 6.

Note that Algorithm 1 requires the gradient of the $\mbox{MMD}^{2}$ with respect to the matrix $Q_{B}^{T}$ . By applying standard matrix calculus we compute (see Appendix A.3 for details):

\begin{gathered}G(Q_{B}^{T})=\frac{\partial\mbox{MMD}^{2}(Z_{A},Q_{B}^{T}Z^{\prime}_{B})}{\partial Q_{B}^{T}}=\\ -\frac{2}{mn}\sum_{i,j}^{n,m}\exp\left(-\frac{1}{2\sigma_{2}}||z_{A}^{(i)}-Q_{B}^{T}z_{B}^{(j)}||^{2}\right)\left(\frac{z_{A}^{(i)}z_{B}^{(j)T}}{\sigma^{2}}\right)\end{gathered}

(11)

Algorithm 2 Unsupervised domain adaptation with linear transformations

Input: $X_{A}\in\mathbb{R}^{i\times m},X_{B}\in\mathbb{R}^{j\times n}$ . Every row in these matrices represents an instance.
Parameter: Variance of Gaussian kernel ( $\sigma^{2}$ )
Output: $Z_{A}\in\mathbb{R}^{i\times p},Z_{B}\in\mathbb{R}^{j\times p}$ . Every row in each matrix represents an instance in the shared space.

1: Compute

\mu_{A}

and

\Sigma_{A}

of the dataset

X_{A}

2: Compute

\mu_{B}

and

\Sigma_{B}

of the dataset

X_{B}

U_{A},S_{A},V_{A}=\mbox{SVD}(\Sigma_{A})

U_{B},S_{B},V_{B}=\mbox{SVD}(\Sigma_{B})

\theta_{A}=U_{A}S_{A}^{\frac{1}{2}}

{Use only the positive eigenvalues (and their corresponding eigenvectors)}

\theta_{B}^{\prime}=U_{B}S_{B}^{\frac{1}{2}}

{Use only the positive eigenvalues (and their corresponding eigenvectors)}

Z_{A}=(\theta_{A}^{T}\theta_{A})^{-1}\theta_{A}^{T}(X_{A}-\mu_{A})

Z_{B}^{\prime}=(\theta_{B}^{\prime T}\theta_{B}^{\prime})^{-1}\theta_{B}^{\prime T}(X_{B}-\mu_{B})

9: Use Algorithm 1 to find the

Q_{B}^{T}\in\mathbb{R}^{p\times p}

that minimizes Equation 10. Compute the MMD using a Gaussian kernel with variance

\sigma_{2}

10:

Z_{B}=Z_{B}^{\prime}Q_{B}

11: return

Z_{A},Z_{B}

Algorithm 2 shows the procedure to map the source and target domain into a common space in an unsupervised way (The code is publicly available; see Appendix B ). For notation, the source domain (resp., target, shared space) is $m$ -dimensional space, (resp. $n$ -dimensional, $p$ -dimensional; here we assume that $p\leq\min(m,n)$ . For mapping into this lower dimensional space, we project the data into the first $p$ eigenvectors of the empirical covariance matrice $\Sigma_{A}$ (resp. $\Sigma_{B}$ ). The eigenvalues corresponding these eigenvectors are positive, while the other $m-p$ and $n-p$ eigenvalues will be equal to zero.

After mapping both domains to a common space, we can use the labels of the source domain to learn a predictor, and then use it to make predictions in the target domain. Section 4 will show that the Maximum Mean Discrepancy is not convex with respect to the orthogonal matrix $Q_{B}^{T}$ , meaning Algorithm 1 might converge to a local minimum. Additionally, in the unsupervised case there is an inherent identifiability problem caused by the missing labels David et al. (2010); Koller and Friedman (2009) – i.e., there are $\theta_{B}$ and $\theta_{B}^{\prime}$ such that $P(\ f(X,\theta_{B})\ )=P(\ f(X,\theta_{B}^{\prime})\ )$ . This can create an “anti-alignment” problem; see Figure 3.

Note than when the “anti-alignment” occurs, the source and target domains have the same marginal probability $P_{S}(Z)=P_{T}(Z)$ , but $P_{S}(Y\ |\ Z)\neq P_{T}(Y\ |\ Z)$ . In other words, a classifier learned on the source domain will have good performance on more data from the same domain; however, it will have very poor performance on the target domain. Zhao et al. (2019) shows that aligning the marginal probability of the covariates, then learning a good predictor on the source domain, is not sufficient for successfully performing domain adaptation.

3.5 Semi-supervised domain adaptation

If we have access to a few labeled instances in the target domain, we might reduce the chance of converging to an “anti-alignment”. Since the MMD is not convex with respect to the rotation matrix $Q_{B}^{T}$ , a common strategy is to attempt multiple re-start (with different seed points) of an iterative optimization algorithm. We then choose the one with the lowest cost. In the unsupervised case, the MMD itself is the cost function. For the semi-supervised case, we can first run Algorithm 2 for each seed point, then learn a predictor, using only the labeled data from the source domain. Then, for every alignment generated by each of the seed points, evaluate those predictors on the labeled data of the target domain, and choose the one with the lowest error.

Alternatively, we could incorporate the cross entropy loss of the source and target domains into Equation 10. This approach requires optimizing a weighted linear combination of three terms in the loss function: the MMD, the cross-entropy in the source domain, and the cross-entropy in the target domain. Since this path requires setting these three extra weights, we limited our experiments to the first approach.

4 Experiments and Results

We first show the performance of our approach to perform domain-shift adaptation in a simulated dataset, and then, in a modified version of the MNIST digit classification task.

4.1 Simulated data

For the simulated data we sampled 600 instances to creaet a dataset, $D_{z}\in\mathbb{R}^{600\times 2}$ , from a mixture of bi-variate Gaussians with parameters $\mu_{1}=\begin{bmatrix}1\\ 1\end{bmatrix}$ , $\mu_{2}=\begin{bmatrix}5\\ -5\end{bmatrix}$ , $\Sigma_{1}=\begin{bmatrix}2&0.7\\ 0.7&1\end{bmatrix}$ , $\Sigma_{2}=\begin{bmatrix}2&1\\ 1&4\end{bmatrix}$ . These instances correspond to a common shared space $Z$ . We then created two random transformation matrices $\theta_{A},\theta_{B}\in\mathbb{R}^{5\times 2}$ , and two random translation vectors $\mu_{A},\mu_{B}\in\mathbb{R}^{5}$ to create the observations. Of course, neither the real parameters, nor the instances in the shared space are visible to our algorithm.

We randomly divided $D_{z}$ into two disjoint datasets $D_{A}$ (source domain) and $D_{B}$ (target domain) with 300 instances each. Then, we created the datasets $X_{A}=D_{A}\theta_{A}^{T}+\mu_{A}$ and $X_{B}=D_{B}\theta_{B}^{T}+\mu_{B}$ . Our algorithm only sees $X_{A}$ and $X_{B}$ , which each contain 5-dimensional vectors.

Figure 4(b) shows the result of applying Algorithm 2 to the simulated datasets $X_{A}$ and $X_{B}$ . Figure 4(a), on the other hand, shows the effect of ignoring the effect of the orthogonal matrix $Q_{B}^{T}$ . In this last case we successfully mapped both datasets into the same lower-dimensional space, and that both datasets have zero mean and an identity covariance matrix; however, they are not aligned. By finding the orthogonal matrix that minimizes the MMD between the source and target datasets we can obtain the correct alignment.

As mentioned in Section 3.4, the maximum mean discrepancy is not convex with respect to the orthogonal matrix $Q_{B}^{T}$ . For 2-dimensional spaces, an orthogonal matrix is either a rotation $R=\begin{bmatrix}\cos\alpha&-\sin\alpha\\ \sin\alpha&\cos\alpha\end{bmatrix}$ or a reflection $S=\begin{bmatrix}-\cos\alpha&-\sin\alpha\\ -\sin\alpha&\cos\alpha\end{bmatrix}$ Winter (1992). Figure 5 shows the MMD between the projection of $X_{A}$ into $Z_{A}$ and the rotated (or reflected) projection of $X_{B}$ into $Z_{B}$ at different angles. Note that we have a total of 4 local minima for the simulated data. The global minimum corresponds to the proper alignment, shown in Figure 4(b). Similar to gradient descent, Algorithm 2 might converge to a local minimum depending on the seed point.

4.2 Binary digit classification

The second experiment is a variation of the digit classification task with the dataset MNIST (source domain) LeCun et al. (1998) and USPS (target domain) Hull (1994). We simplified the task from 10-class digit classification to 45 binary digit classification (0 vs 1, 0 vs 2, … , 8 vs 9).

We first trained a 10-class convolutional neural network on the training data of the source domain (60,000 images) to create image embeddings in a 20-dimensional space. The convolutional neural network contained 4 convolutional layers (32, 128, 256 and 512 filters, respectively), each followed by a MaxPooling layer ( $2\times 2$ kernel). Then we added a fully connected layer with 20 hidden neurons, and finally the output layer with 10 output neurons. While the output layer used a softmax activation function, the other layers used a rectified linear unit (ReLU) as the activation function; see Appendix B.

We used the output of the fully connected layer with 20 neurons as the image embeddings for both, the training data of the MNIST (60,000 images) and the test data USPS datasets (2,007 images). Then, for each of the 45 binary classification tasks, we compared three scenarios: (1) Baseline: learn the parameters of a logistic regression model using the MNIST dataset, and then test it on the USPS dataset. (2) Use Algorithm 2 to project the MNIST and USPS dataset into a common space of dimension 5. Then learn the parameters of a logistic regression model using only the projected data of the MNIST, and test it on the projected data of the USPS dataset. We chose a “small” dimensionality of the shared space because computing distances in high-dimensional spaces is harder because the instances sparsely populate the input space Friedman et al. (2001). (3) Similar to the second scenario, but now assume that 10% of the data in the USPS dataset is labelled (semi-supervised case).

Figure 6(a) shows the classification accuracy of the unsupervised domain adaptation approach. The number in parenthesis indicates the improvement (or decrease) in performance relative to the Scenario 1. Similarly, Figure 6(b) shows the classification accuracy of the semi-supervised experiment.

5 Discussion

As expected, the results in Figure 6(a) show that even after the reducing the discrepancy between the source and target domain, and learning an accurate classifier on the source domain, this classifier is not guaranteed to generalize to the target domain. All the boxes in orange indicate that applying our algorithm for domain-shift had a lower performance than not doing any transformation at all. Specially interesting are the cases marked in dark orange, where we observe the effect of the “anti-alignment”. On the other hand, when the alignment is done properly, there are very significant improvements in the classification accuracy. Note that for unsupervised domain adaptation there is no way to distinguish between correct and incorrect alignments.

Figure 6(b), on the other hand, shows that we avoid incorrect alignments in the semi-supervised case. Having access to a small set of labelled data allows the algorithm to identify when no domain-shift adaptation is needed (because the classifier already generalizes to the target domain), or detect the “anti-alignments” and choose the proper alignment instead; see the case of 2 vs 7, where an improper alignment occurs in the unsupervised case, but the proper alignment of the semi-supervised case increases the classification accuracy 6%. In the case of proper alignments of the data the accuracy improved in essentially all the cases up to 48%.

The dimensionality of the shared space plays an important role when performing domain-shift adaptation. While Figures 6(a) and 6(b) show the performance obtained in a shared space of dimension 5, Figures 7 and 8 in Appendix C show the performance when the dimension of the shared space is the number of positive eigenvalues in the empirical covariance matrix (13 in our experiments). The performance of the unsupervised domain adaptation degrades significantly, while the semi-supervised approach remain roughly the same. We hypothesize that this decrease in performance is due to the difficulty of reliably estimating metrics on probability distributions in high dimensional spaces with a limited number of instances Friedman et al. (2001).

In summary, we present an algorithm for domain-shift adaptation caused by arbitrary affine transformations. Our approach first projects the data into a shared low-dimensional space using the first $p$ eigenvectors of the empirical covariance matrices of the data. Then, it find an orthogonal matrix that minimize the maximum mean discrepancy between the source and target data. For the unsupervised domain adaptation, there is an unavoidable identifiabiliy problem that can be alleviated by having a few labels of the target domain (semi-supervised domain adaptation). When using the correct othogonal matrix, this effectively maps both domains into a shared space where the $P(Z_{A},Y)=P(Z_{B},Y)$ . In those cases, we can expect that a predictor learned using data from the source domain to generalize to data from a target domain.

References

Bach and Jordan [2005] Francis R Bach and Michael I Jordan. A probabilistic interpretation of canonical correlation analysis. 2005.
Ben-David et al. [2007] Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira, et al. Analysis of representations for domain adaptation. NeurIPS, 19:137, 2007.
Chen et al. [2012] Minmin Chen, Zhixiang Xu, Kilian Q Weinberger, and Fei Sha. Marginalized denoising autoencoders for domain adaptation. In ICML, pages 1627–1634, 2012.
Chen et al. [2015] Minmin Chen, Kilian Q Weinberger, Zhixiang Xu, and Fei Sha. Marginalizing stacked linear denoising autoencoders. JMLR, 16(1):3849–3875, 2015.
Csurka [2017] Gabriela Csurka. Domain adaptation for visual applications: A comprehensive survey. arXiv preprint arXiv:1702.05374, 2017.
David et al. [2010] Shai Ben David, Tyler Lu, Teresa Luu, and Dávid Pál. Impossibility theorems for domain adaptation. In AISTATS’10, pages 129–136. JMLR Workshop and Conference Proceedings, 2010.
Friedman et al. [2001] Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
Ganin et al. [2016] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. JMLR, 17(1):2096–2030, 2016.
Glorot et al. [2011] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML, 2011.
Gretton et al. [2012] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. JMLR, 13(1):723–773, 2012.
Hull [1994] Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550–554, 1994.
Koller and Friedman [2009] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
Kull and Flach [2014] Meelis Kull and Peter Flach. Patterns of dataset shift. In First International Workshop on Learning over Multiple Contexts (LMCE) at ECML-PKDD., 2014.
LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Long et al. [2018] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain adaptation. In NeurIPS, pages 1647–1657, 2018.
Louizos et al. [2016] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard S Zemel. The variational fair autoencoder. In ICLR, 2016.
Murphy [2012] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
Nocedal and Wright [2006] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.
Shimodaira [2000] Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
Storkey [2009] Amos Storkey. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, 30:3–28, 2009.
Sun et al. [2016] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In AAAI, volume 30, 2016.
Tachet des Combes et al. [2020] Remi Tachet des Combes, Han Zhao, Yu-Xiang Wang, and Geoffrey J Gordon. Domain adaptation with conditional distribution matching and generalized label shift. NeurIPS, 33, 2020.
Tipping and Bishop [1999] Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999.
Tzeng et al. [2017] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017.
Vega and Greiner [2018] Roberto Vega and Russ Greiner. Finding effective ways to (machine) learn fmri-based classifiers from multi-site data. In Understanding and Interpreting Machine Learning in Medical Image Computing Applications, pages 32–39. Springer, 2018.
Wen and Yin [2013] Zaiwen Wen and Wotao Yin. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(1):397–434, 2013.
Wen et al. [2014] Junfeng Wen, Chun-Nam Yu, and Russell Greiner. Robust learning under uncertain test distributions: Relating covariate shift to model misspecification. In ICML, pages 631–639. PMLR, 2014.
Winter [1992] David J Winter. Matrix algebra. Macmillan, 1992.
Zhang et al. [2013] Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In ICML, pages 819–827. PMLR, 2013.
Zhao et al. [2018] Han Zhao, Shanghang Zhang, Guanhang Wu, José MF Moura, Joao P Costeira, and Geoffrey J Gordon. Adversarial multiple source domain adaptation. NeurIPS, 31:8559–8570, 2018.
Zhao et al. [2019] Han Zhao, Remi Tachet Des Combes, Kun Zhang, and Geoffrey Gordon. On learning invariant representations for domain adaptation. In ICML, pages 7523–7532. PMLR, 2019.

Appendix A Mathematical details

A.1 Proof of Equation 2

	$\displaystyle z^{*}$	$\displaystyle=\arg\min_{z}\|\|(x-\mu)-\theta z\|\|^{2}$
		$\displaystyle=\arg\min_{z}\left((x-\mu)-\theta z\right)^{T}\left((x-\mu)-\theta z\right)$
		$\displaystyle=\arg\min_{z}((x-\mu)^{T}-z^{T}\theta^{T})\left((x-\mu)-\theta z\right)$
		$\displaystyle=\arg\min_{z}\left(z^{T}\theta^{T}\theta z-2(x-\mu)^{T}\theta z\right)$

Taking the derivative with respect to $z$ and making it equal to the zero vector:

	$\displaystyle 0$	$\displaystyle=\frac{\partial}{\partial z}\left(z^{T}\theta^{T}\theta z-2(x-\mu)^{T}\theta z\right)$
		$\displaystyle=2\theta^{T}\theta z-2\theta^{T}(x-\mu)$
	$\displaystyle\rightarrow z$	$\displaystyle=(\theta^{T}\theta)^{-1}\theta^{T}(x-\mu)$

Note that the second derivative is always non-negative, so $z$ is a minimum.

A.2 Proof of Equation 6

Substituting $\theta=US^{\frac{1}{2}}Q$ in $z=(\theta^{T}\theta)^{-1}\theta^{T}(x-\mu)$ , and using that $Q^{T}=Q^{-1}$ for orthogonal matrices, and $(AB)^{-1}=B^{-1}A^{-1}$ for invertible matrices $A$ and $B$ .:

	$\displaystyle z$	$\displaystyle=(\theta^{T}\theta)^{-1}\theta^{T}(x-\mu)$
		$\displaystyle=\left((US^{\frac{1}{2}}Q)^{T}(US^{\frac{1}{2}}Q)\right)^{-1}(US^{\frac{1}{2}}Q)^{T}(x-\mu)$
		$\displaystyle=\left(Q^{T}S^{\frac{1}{2}}U^{T}US^{\frac{1}{2}}Q\right)^{-1}(US^{\frac{1}{2}}Q)^{T}(x-\mu)$
		$\displaystyle=\left(Q^{T}S^{\frac{1}{2}}S^{\frac{1}{2}}Q\right)^{-1}(US^{\frac{1}{2}}Q)^{T}(x-\mu)$
		$\displaystyle=\left(S^{\frac{1}{2}}Q\right)^{-1}\left(Q^{T}S^{\frac{1}{2}}\right)^{-1}(US^{\frac{1}{2}}Q)^{T}(x-\mu)$
		$\displaystyle=Q^{T}S^{-\frac{1}{2}}S^{-\frac{1}{2}}QQ^{T}S^{\frac{1}{2}}U^{T}(x-\mu)$
		$\displaystyle=Q^{T}S^{-1}S^{\frac{1}{2}}U^{T}(x-\mu)$
		$\displaystyle=Q^{T}S^{-\frac{1}{2}}U^{T}(x-\mu)$

A.3 Proof of Equation 11

For the case of the Equation 8 using the Gaussian kernel, and since $R^{T}R=I$ , the gradient of the MMD between $X$ and a linear transformation of $Y$ , $RY$ , with respect to $R$ is:

		$\displaystyle\frac{\partial}{\partial R}\mbox{MMD}_{b}^{2}[\mathcal{F},X,RY]$
		$\displaystyle=\frac{1}{m^{2}}\sum_{i,j=1}^{m}\frac{\partial}{\partial R}k(x_{i},x_{j})-\frac{2}{mn}\sum_{i,j=1}^{m,n}\frac{\partial}{\partial R}k(x_{i},Ry_{j})+$
		$\displaystyle\ \ \quad\frac{1}{n^{2}}\sum_{i,j=1}^{n}\frac{\partial}{\partial R}k(Ry_{i},Ry_{j})$

		$\displaystyle=\frac{1}{m^{2}}\sum_{i,j=1}^{m}\frac{\partial}{\partial R}\exp\left(-\frac{1}{2\sigma^{2}}\|\|x_{i}-x_{j}\|\|^{2}\right)$
		$\displaystyle\ \ \quad-\frac{2}{mn}\sum_{i,j=1}^{m,n}\frac{\partial}{\partial R}\exp\left(-\frac{1}{2\sigma^{2}}\|\|x_{i}-Ry_{j}\|\|^{2}\right)$
		$\displaystyle\ \ \quad+\frac{1}{n^{2}}\sum_{i,j=1}^{n}\frac{\partial}{\partial R}\exp\left(-\frac{1}{2\sigma^{2}}\|\|Ry_{i}-Ry_{j}\|\|^{2}\right)$

		$\displaystyle=-\frac{2}{mn}\sum_{i,j=1}^{m,n}\frac{\partial}{\partial R}\exp\left(-\frac{1}{2\sigma^{2}}(x_{i}-Ry_{j})^{T}(x_{i}-Ry_{j})\right)$
		$\displaystyle\ \ \quad+\frac{1}{n^{2}}\sum_{i,j=1}^{n}\frac{\partial}{\partial R}\exp\left(-\frac{1}{2\sigma^{2}}(Ry_{i}-Ry_{j})^{T}(Ry_{i}-Ry_{j})\right)$

		$\displaystyle=-\frac{2}{mn}\sum_{i,j=1}^{m,n}\frac{\partial}{\partial R}\exp\left(-\frac{x_{i}^{T}x_{i}-2x_{i}^{T}Ry_{j}+y_{j}^{T}y_{j}}{2\sigma^{2}}\right)$
		$\displaystyle\ \ \quad+\frac{1}{n^{2}}\sum_{i,j=1}^{n}\frac{\partial}{\partial R}\exp\left(-\frac{y_{i}^{T}y_{i}-2y_{j}^{T}y_{i}+y_{j}^{T}y_{j}}{2\sigma^{2}}\right)$

		$\displaystyle=-\frac{2}{mn}\sum_{i,j=1}^{m,n}\exp\left(-\frac{\|\|x_{i}-Ry_{j}\|\|^{2}}{2\sigma^{2}}\right)\left(\frac{x_{i}y_{j}^{T}}{\sigma^{2}}\right)$

Appendix B Code availability

The code for reproducing the results presented in this paper is publicly available at https://github.com/rvegaml/DA_Linear. It contains three main elements:

•

MLib: An in-house developed library that contains the implementation of Algorithms 1 and 2, and auxiliary functions required to reproduce our main results.
•

Simulations.ipynb: A jupyter notebook with the code to reproduce the simulated experiments.
•

BinaryDigits.ipynb: A jupyter notebook with the code to reproduce our results with MNIST dataset.

During our experiments, we did not tune any parameter. The CNN was trained for a maximum of 500 epochs, using a learning rate of $10^{-5}$ , and the default parameters of the Adam Optimizer. For the computation of the MMD we used a Gaussian kernel with $\sigma^{2}=2$ .