This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Generalization Bounds for Deep Transfer Learning Using Majority Predictor Accuracy

Cuong N. Nguyen1, Lam Si Tung Ho2, Vu Dinh3, Tal Hassner4, and Cuong V. Nguyen1 1Florida International University, USA, {cnguy049, vcnguyen}@cs.fiu.edu 2Dalhousie University, Canada, lam.ho@dal.ca 3University of Delaware, USA, vucdinh@udel.edu 4Meta AI, USA, talhassner@gmail.com
Abstract

We analyze new generalization bounds for deep learning models trained by transfer learning from a source to a target task. Our bounds utilize a quantity called the majority predictor accuracy, which can be computed efficiently from data. We show that our theory is useful in practice since it implies that the majority predictor accuracy can be used as a transferability measure, a fact that is also validated by our experiments.

I Introduction

Deep transfer learning, the problem of transferring representations (or features) learned by deep neural networks from one task to another, has become a crucial part for training deep learning models in practice [14, 10, 19]. Despite this fact, the current literature still lacks a theory for understanding the generalization of models obtained by deep transfer learning. In this paper, we close this gap between the theory and practice of deep transfer learning by proving novel generalization bounds for models learned through such transfer learning methods.

To prove the bounds, we develop the Majority Predictor Accuracy (MPA), a simple and easy-to-compute quantity defined as the accuracy of the classifier that returns the most probable target label conditioned on a given source label. Using the MPA, we can show that when the source and target data share the input set, the true risk of the transferred model is upper bounded by the sum of the empirical risk of the source model, 1MPA1-\text{MPA}, and an 𝒪~(1/n)\widetilde{\mathcal{O}}(1/\sqrt{n}) sample complexity with high probability. We further extend this result to the more general setting where the source and target datasets contain different inputs. This extension is achieved by using dummy source labels, a technique previously developed for transferability estimation [13].

We also demonstrate the usefulness of our theoretical bounds in practice by showing empirically that the MPA can be used as a transferability measure, defined as a numeric score that can tell whether deep transfer learning would be effective when transferring between a given pair of source-target tasks. Specifically, our experiments on the large-scale CUB-200 dataset [18] show that the MPA scores are highly correlated with the actual accuracies of the transferred models with statistical significance, thus indicating that the MPA is a good measure of transferability.

To summarize, our paper makes the following contributions: (1) developing the new MPA score, (2) proving novel deep transfer learning bounds using the MPA, and (3) showing our bounds are practically useful through experiments.

Related Work. Transfer learning [10, 20] is a long-standing research area of machine learning. Several previous work has provided theoretical analysis and generalization bounds for transfer learning, especially under the domain adaptation setting, such as [5, 6, 11, 4, 1]; however, these results were not explicitly developed for deep learning and the settings commonly used in practice, where a learned representation is adapted to the new domain [14, 19]. Our paper, on the other hand, provides generalization bounds explicitly for these commonly used deep transfer learning settings. Furthermore, unlike these previous work, our bounds are useful in practice for understanding the transferability between different tasks, as demonstrated in our experiments.

Our work is also related to a recent attempt to develop transferability measures for deep transfer learning [2, 17, 13, 15, 20, 16, 8]. Transferability measures aim to estimate the effectiveness of deep transfer learning between tasks and have been used for model or task selection [2, 17, 13, 20], checkpoint ranking [8], and few-shot learning [16]. Although some theoretical properties were shown for these transferability measures [17, 13, 16], they only focused on the empirical risk instead of the true risk as in our paper.

II Deep Transfer Learning: Formal Setting

Deep transfer learning [10] refers to the problem of transferring a learned deep neural network representation from a source task to a target task. In this section, we formalize the deep transfer learning setting considered in our paper. This setting is commonly used in practice for several large-scale deep learning models [14, 19].

Let 𝒮={(x1,s1),(x2,s2),,(xn,sn)}\mathcal{S}=\{(x_{1},s_{1}),(x_{2},s_{2}),\ldots,(x_{n},s_{n})\} be a train dataset for a source classification task where each input-label pair (xi,si)d×[mS](x_{i},s_{i})\in\mathbb{R}^{d}\times[m_{S}] is drawn iid from a joint distribution X,S\mathbb{P}_{X,S}, with [n]={1,2,,n}[n]=\{1,2,\ldots,n\} for any positive integer nn. Consider a target classification task with a train set 𝒯={(x1,t1),(x2,t2),,(xn,tn)}{\mathcal{T}=\{(x_{1},t_{1}),(x_{2},t_{2}),\ldots,(x_{n},t_{n})\}} where each target example (xi,ti)d×[mT](x_{i},t_{i})\in\mathbb{R}^{d}\times[m_{T}] is drawn iid from X,T\mathbb{P}_{X,T}. We will first consider this simple case where the source and target datasets share the same inputs {x1,x2,,xn}\{x_{1},x_{2},\ldots,x_{n}\}, with each xix_{i} being a dd-dimensional vector having the same marginal distribution X\mathbb{P}_{X} in both tasks. Here the source task has mSm_{S} classes and the target task has mTm_{T} classes. The more general case with different input sets will be discussed in Section V.

In our deep transfer learning setting, we first train a source model hwh{\,{\circ}\,}w using 𝒮\mathcal{S}, where w(x)rw(x)\in\mathbb{R}^{r} is the rr-dimensional representation (also called embedding or feature vector) of the input xx extracted from the network ww, and hw(x)=h(w(x))[mS]h{\,{\circ}\,}w(x)=h(w(x))\in[m_{S}] is the source label returned by the network hh with the representation w(x)w(x) as input. The functions ww and hh are usually called the feature extractor and the head classifier respectively. In deep learning, the whole model hwh{\,{\circ}\,}w is a deep neural network with ww being its parameters from the input up to some layer LL, and hh being the network parameters from layer LL to the output. We obtain the optimal source model on 𝒮\mathcal{S} by minimizing the empirical risk:111Throughout our paper, we assume argmin\operatorname*{arg\,min} and argmax\operatorname*{arg\,max} follow any deterministic tie-breaking strategy.

w,h=argminw,hΩw×ΩhR^𝒮(w,h),w^{*},h^{*}=\operatorname*{arg\,min}_{w,h\in\Omega_{w}\times\Omega_{h}}\widehat{R}_{\mathcal{S}}(w,h), (1)

where Ωw\Omega_{w} and Ωh\Omega_{h} are the spaces of all possible ww’s and hh’s respectively, and with 𝟏[]\mathbf{1}[\cdot] being the indicator function,

R^𝒮(w,h)=1ni=1n𝟏[sihw(xi)].\widehat{R}_{\mathcal{S}}(w,h)=\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}[s_{i}\neq h{\,{\circ}\,}w(x_{i})]. (2)

In practice, the optimal feature extractor ww^{*} often learns generic feature representations (e.g., edges or shapes in images) that can be reused for several tasks, while the optimal head classifier hh^{*} is often specialized for a particular source task. To transfer this trained model hwh^{*}{{\circ}\,}w^{*} to a target task, the usual practice is to discard hh^{*} and reuse ww^{*} for the target task. Specifically, we will re-train a new head classifier kk^{*} on the target dataset 𝒯\mathcal{T} using the features extracted from ww^{*}.

In this paper, we allow the target head classifier to return real-valued scores for all target labels. That means for each example xx, a head classifier kk on the target task would take w(x)w^{*}(x) as input and return kw(x)=k(w(x))mTk{\,{\circ}\,}w^{*}(x)=k(w^{*}(x))\in\mathbb{R}^{m_{T}}, the scores (before softmax) for all target labels. We will consider the optimal target head classifier kk^{*} obtained by minimizing the empirical risk with a given margin γ0\gamma\geq 0:

k=argminkΩkR^𝒯,γ(w,k),k^{*}=\operatorname*{arg\,min}_{k\in\Omega_{k}}\widehat{R}_{\mathcal{T},\gamma}(w^{*},k), (3)

where Ωk\Omega_{k} is the space of all kk’s, and for all wΩww\in\Omega_{w}, kΩkk\in\Omega_{k}:

R^𝒯,γ(w,k)=1ni=1n𝟏[kw(xi)ti<γ+maxttikw(xi)t],\widehat{R}_{\mathcal{T},\gamma}(w,k)=\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}[k{\,{\circ}\,}w(x_{i})_{t_{i}}<\gamma+\max_{t\neq t_{i}}k{\,{\circ}\,}w(x_{i})_{t}],

with kw(xi)tk{\,{\circ}\,}w(x_{i})_{t} being the tt-th element of the vector kw(xi)k{\,{\circ}\,}w(x_{i}). Here R^,γ\widehat{R}_{\mathbin{\vbox{\hbox{\scalebox{0.6}{$\bullet$}}}},\gamma} is a general version of the empirical risk in Eq. (2). The margin γ\gamma measures the gap between the prediction probability of the correct label and those of the other labels, and has often been used in generalization bounds for deep learning [3, 9].

Our paper shall prove generalization bounds for the optimal transferred model kwk^{*}{{\circ}\,}w^{*}. For this purpose, we introduce in the next section the majority predictor accuracy, a transferability measure that we will use for our bounds.

III Majority Predictor Accuracy

The Majority Predictor Accurcacy (MPA) is defined as the accuracy of the simple predictor (classifier) that maps each source label to the target label with maximal empirical conditional probability. Formally, given the source dataset 𝒮\mathcal{S} and the target dataset 𝒯\mathcal{T}, the empirical joint distribution between all possible source-target label pairs (s,t)[mS]×[mT]{(s,t)\in[m_{S}]\times[m_{T}]} is P^(s,t)=1n|{i[n]:si=sandti=t}|\hat{P}(s,t)=\frac{1}{n}{|\{i\in[n]:s_{i}=s\ \mathrm{and}\ t_{i}=t\}|}, the empirical marginal distribution over the source labels is P^(s)=t[mT]P^(s,t),s[mS]{\hat{P}(s)=\sum_{t\in[m_{T}]}\hat{P}(s,t)},\forall s\in[m_{S}], and the empirical conditional distribution of a target label tt given a source label ss is P^(t|s)=P^(s,t)/P^(s)\hat{P}(t|s)=\hat{P}(s,t)/\hat{P}(s), for all (s,t)[mS]×[mT](s,t)\in[m_{S}]\times[m_{T}].

To define the MPA, consider the following majority predictor fmpf_{\text{mp}} that takes a source label s[mS]s\in[m_{S}] and simply returns a target label tt that maximizes the empirical conditional probability P^(t|s)\hat{P}(t|s):

fmp(s)=argmaxt[mT]P^(t|s),f_{\text{mp}}(s)=\operatorname*{arg\,max}_{t\in[m_{T}]}\hat{P}(t|s), (4)

The MPA is then defined as the accuracy of fmpf_{\text{mp}} on the target dataset, as stated in the following definition.

Definition 1.

The majority predictor accuracy MPA(𝒯|𝒮)\text{MPA}(\mathcal{T}|\mathcal{S}) of a target dataset 𝒯\mathcal{T} given a source dataset 𝒮\mathcal{S} is the accuracy of the majority predictor fmpf_{\text{mp}} on the target dataset:

MPA(𝒯|𝒮)=1ni=1n𝟏[ti=fmp(si)].\text{MPA}(\mathcal{T}|\mathcal{S})=\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}[t_{i}=f_{\text{mp}}(s_{i})]. (5)

The MPA is very simple to compute, requiring only O(n)O(n) computational time by looping through the datasets a few times to compute the empirical distributions, fmpf_{\text{mp}}, and its accuracy. In Section VII, we will show that it can also be used as a transferability measure that estimates the effectiveness of transfer learning between two tasks. We now prove generalization guarantees for the transferred models in the next section.

IV Bounds for Shared Training Inputs Setting

This section proves our generalization bounds for deep transfer learning based on the MPA where the source and target training sets are assumed to share the inputs. In particular, we bound the true risk of the transferred model kwk^{*}{{\circ}\,}w^{*} on the target distribution X,T\mathbb{P}_{X,T}, which is:

RT(w,k)=(targmaxikw(x)i)R_{T}(w^{*},k^{*})=\mathbb{P}(t\neq\operatorname*{arg\,max}_{i}k^{*}{{\circ}\,}w^{*}(x)_{i})

for (x,t)X,T(x,t)\sim\mathbb{P}_{X,T}. We will prove the bounds for both fully connected neural networks and convolutional neural networks. For this purpose, we consider the head classifier fmphf_{\text{mp}}{\,{\circ}\,}h^{*} defined on any representation w(x)rw(x)\in\mathbb{R}^{r}, where:

fmph(w(x))=fmp(h(w(x))).f_{\text{mp}}{\,{\circ}\,}h^{*}(w(x))=f_{\text{mp}}(h^{*}(w(x))). (6)

Throughout our paper, we will make an assumption that using a deep neural network as the target head classifier can achieve better empirical risk than using the naive classifier fmphf_{\text{mp}}{\,{\circ}\,}h^{*}. This assumption is usually satisfied in practice because of the expressiveness of neural network models [21].

Assumption 1.

For any datasets 𝒮\mathcal{S} any 𝒯\mathcal{T}, there exists γ¯>0\bar{\gamma}>0 and k¯Ωk\bar{k}\in\Omega_{k} such that R^𝒯,γ¯(w,k¯)R^𝒯(w,fmph)\widehat{R}_{\mathcal{T},\bar{\gamma}}(w^{*},\bar{k})\leq\widehat{R}_{\mathcal{T}}(w^{*},f_{\text{mp}}{\,{\circ}\,}h^{*}).

In this assumption, R^𝒯(w,fmph)\widehat{R}_{\mathcal{T}}(w^{*},f_{\text{mp}}{\,{\circ}\,}h^{*}) is defined similarly as in Eq. (2). We also note that R^,γ\widehat{R}_{\mathbin{\vbox{\hbox{\scalebox{0.6}{$\bullet$}}}},\gamma} is non-decreasing in γ\gamma, so the assumption implies, for all γ[0,γ¯]\gamma\in[0,\bar{\gamma}], R^𝒯,γ(w,k¯)R^𝒯,γ¯(w,k¯)R^𝒯(w,fmph)\widehat{R}_{\mathcal{T},\gamma}(w^{*},\bar{k})\leq\widehat{R}_{\mathcal{T},\bar{\gamma}}(w^{*},\bar{k})\leq\widehat{R}_{\mathcal{T}}(w^{*},f_{\text{mp}}{\,{\circ}\,}h^{*}). Under this assumption, we first prove the following lemma relating the empirical risks of the optimal source and transferred models using the MPA.

Lemma 1.

Under Assumption 1, for any γ[0,γ¯]\gamma\in[0,\bar{\gamma}], we have:

R^𝒯,γ(w,k)R^𝒮(w,h)+1MPA(𝒯|𝒮).\widehat{R}_{\mathcal{T},\gamma}(w^{*},k^{*})\leq\widehat{R}_{\mathcal{S}}(w^{*},h^{*})+1-\text{MPA}(\mathcal{T}|\mathcal{S}).
Proof.

Consider the majority predictor fmpf_{\text{mp}} defined in Eq. (4). We first split the data index set [n][n] into two non-overlap sets:

I\displaystyle I ={i[n]:ti=fmp(si)}, and\displaystyle=\{i\in[n]:t_{i}=f_{\text{mp}}(s_{i})\},\text{ and }
I¯\displaystyle\bar{I} ={i[n]:tifmp(si)}.\displaystyle=\{i\in[n]:t_{i}\neq f_{\text{mp}}(s_{i})\}.

Here the set II (respectively, I¯\bar{I}) contains indices of data points whose source-target label pairs are consistent (respectively, inconsistent) with fmpf_{\text{mp}}. For any γ[0,γ¯]\gamma\in[0,\bar{\gamma}], we have:

R^𝒯,γ(w,k)R^𝒯,γ(w,k¯)\displaystyle\widehat{R}_{\mathcal{T},\gamma}(w^{*},k^{*})\leq\widehat{R}_{\mathcal{T},\gamma}(w^{*},\bar{k}) (def. of kk^{*})
R^𝒯,γ¯(w,k¯)\displaystyle\leq\widehat{R}_{\mathcal{T},\bar{\gamma}}(w^{*},\bar{k}) (R^𝒯,γ\widehat{R}_{\mathcal{T},\gamma} is non-decreasing in γ\gamma)
R^𝒯(w,fmph)\displaystyle\leq\widehat{R}_{\mathcal{T}}(w^{*},f_{\text{mp}}{\,{\circ}\,}h^{*}) (assumption 1)
=1ni=1n𝟏[tifmphw(xi)]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}[t_{i}\neq f_{\text{mp}}{\,{\circ}\,}h^{*}{{\circ}\,}w^{*}(x_{i})] (def. of R^𝒯\widehat{R}_{\mathcal{T}})
=1n(iI𝟏[tifmphw(xi)]+.\displaystyle=\frac{1}{n}\Big{(}\sum_{i\in I}\mathbf{1}[t_{i}\neq f_{\text{mp}}{\,{\circ}\,}h^{*}{{\circ}\,}w^{*}(x_{i})]+\Big{.}
.iI¯𝟏[tifmphw(xi)])\displaystyle\hskip 25.6073pt\Big{.}\sum_{i\in\bar{I}}\mathbf{1}[t_{i}\neq f_{\text{mp}}{\,{\circ}\,}h^{*}{{\circ}\,}w^{*}(x_{i})]\Big{)} (def. of II and I¯\bar{I})
1n(iI𝟏[tifmphw(xi)]+|I¯|)\displaystyle\leq\frac{1}{n}\Big{(}\sum_{i\in I}\mathbf{1}[t_{i}\neq f_{\text{mp}}{\,{\circ}\,}h^{*}{{\circ}\,}w^{*}(x_{i})]+|\bar{I}|\Big{)}
=1n(iI𝟏[fmp(si)fmphw(xi)]+|I¯|)\displaystyle=\frac{1}{n}\Big{(}\sum_{i\in I}\mathbf{1}[f_{\text{mp}}(s_{i})\neq f_{\text{mp}}{\,{\circ}\,}h^{*}{{\circ}\,}w^{*}(x_{i})]+|\bar{I}|\Big{)}
1niI𝟏[sihw(xi)]+|I¯|n.\displaystyle\leq\frac{1}{n}\sum_{i\in I}\mathbf{1}[s_{i}\neq h^{*}{{\circ}\,}w^{*}(x_{i})]+\frac{|\bar{I}|}{n}. (7)

By definition of R^𝒮(w,h)\widehat{R}_{\mathcal{S}}(w^{*},h^{*}), we also have:

R^𝒮(w,h)=1ni=1n𝟏[sihw(xi)]\displaystyle\widehat{R}_{\mathcal{S}}(w^{*},h^{*})=\frac{1}{n}\sum_{i=1}^{n}\mathbf{1}[s_{i}\neq h^{*}{{\circ}\,}w^{*}(x_{i})]
=1n(iI𝟏[sihw(xi)]+iI¯𝟏[sihw(xi)])\displaystyle=\frac{1}{n}\Big{(}\sum_{i\in I}\mathbf{1}[s_{i}\neq h^{*}{{\circ}\,}w^{*}(x_{i})]+\sum_{i\in\bar{I}}\mathbf{1}[s_{i}\neq h^{*}{{\circ}\,}w^{*}(x_{i})]\Big{)}
1niI𝟏[sihw(xi)].\displaystyle\geq\frac{1}{n}\sum_{i\in I}\mathbf{1}[s_{i}\neq h^{*}{{\circ}\,}w^{*}(x_{i})].

Plug this into Eq. (7) and note that MPA(𝒯|𝒮)=|I|/n\text{MPA}(\mathcal{T}|\mathcal{S})=|I|/n, we have: R^𝒯,γ(w,k)R^𝒮(w,h)+|I¯|/n=R^𝒮(w,h)+1MPA(𝒯|𝒮).\widehat{R}_{\mathcal{T},\gamma}(w^{*},k^{*})\leq\widehat{R}_{\mathcal{S}}(w^{*},h^{*})+|\bar{I}|/n=\widehat{R}_{\mathcal{S}}(w^{*},h^{*})+1-\text{MPA}(\mathcal{T}|\mathcal{S}).

As a remark, the bound in Lemma 1 gets tighter when MPA(𝒯|𝒮)1\text{MPA}(\mathcal{T}|\mathcal{S})\rightarrow 1, that is, when fmpf_{\text{mp}} is more accurate. Using this lemma, we now prove the generalization bounds for the transferred model kwk^{*}{{\circ}\,}w^{*}. Section IV-A below proves the bound for fully connected neural networks, while Section IV-B proves the bound for convolutional neural networks.

IV-A Generalization Bound for Fully Connected Networks

In this section, we consider target models kwk{\,{\circ}\,}w that are deep neural networks parameterized by w={A1,A2,,AL}w=\{A^{1},A^{2},\ldots,A^{L}\} and k={AL+1,AL+2,,ALT}k=\{A^{L+1},A^{L+2},\ldots,A^{L_{T}}\} such that:

w(x)=σL(ALσL1(AL1σL2(A1(x)))), and w(x)=\sigma_{L}(A^{L}\sigma_{L-1}(A^{L-1}\sigma_{L-2}(\ldots A^{1}(x)))),\text{ and }
k(w(x))=ALTσLT1(ALT1σLT2(AL+1(w(x)))),k(w(x))=A^{L_{T}}\sigma_{L_{T}-1}(A^{L_{T}-1}\sigma_{L_{T}-2}(\ldots A^{L+1}(w(x)))),

where LL is the depth of the neural network ww, LTL_{T} is the depth of the whole target network kwk{\,{\circ}\,}w, and AiWi×Wi1A^{i}\in\mathbb{R}^{W_{i}\times W_{i-1}} is the weight matrix at layer ii with W0=dW_{0}=d, WL=rW_{L}=r, and WLT=mTW_{L_{T}}=m_{T}. In the above formulas, σi:WiWi\sigma_{i}:\mathbb{R}^{W_{i}}\rightarrow\mathbb{R}^{W_{i}} is a non-linear activation function that is assumed to be 1-Lipschitz.

We do not make any assumption regarding the form or architecture of the source head classifier hh, except for Assumption 1. Thus, our generalization bound in this section holds for all types of source head classifiers, including neural networks, logistic regression, support vector machines, etc. In practice, however, hh is usually chosen as a logistic regression or neural network for ease of implementation and better accuracy.

Following the notations in [9], in our result, we write AFr\|A\|_{\text{Fr}}, Aσ\|A\|_{\sigma}, and Ap,q\|A\|_{p,q} to denote respectively the Frobenius norm, the spectral norm, and the (p,q)(p,q)-norm of a matrix AA. We also write Ai,A_{i,\mathbin{\vbox{\hbox{\scalebox{0.6}{$\bullet$}}}}} to denote the ii-th row of AA. We let W¯=maxi=1LTWi\bar{W}=\max_{i=1}^{L_{T}}W_{i} be the maximum width of the target neural network. We now state and prove our generalization bound of the true risk RT(w,k)=(targmaxikw(x)i)R_{T}(w^{*},k^{*})=\mathbb{P}(t\neq\operatorname*{arg\,max}_{i}k^{*}{{\circ}\,}w^{*}(x)_{i}) for this fully connected network setting in the theorem below.

Theorem 1.

Assume we are given some fixed reference matrices M1,M2,,MLTM^{1},M^{2},\ldots,M^{L_{T}} representing the initialized weights of the target network. Under Assumption 1, with probability at least 1δ1-\delta, for all margin γ(0,γ¯]\gamma\in(0,\bar{\gamma}], we have:

RT(w,k)R^𝒮(w,h)+(1MPA(𝒯|𝒮))+\displaystyle R_{T}(w^{*},k^{*})\leq\widehat{R}_{\mathcal{S}}(w^{*},h^{*})+(1-\text{MPA}(\mathcal{T}|\mathcal{S}))~{}+
𝒪~(maxi=1nxiFr𝒜γnlog(W¯)+log(1/δ)n),\displaystyle\qquad\widetilde{\mathcal{O}}\Big{(}\frac{\max_{i=1}^{n}\|x_{i}\|_{\text{Fr}}\,\mathcal{F}_{\mathcal{A}}}{\gamma\sqrt{n}}\log(\bar{W})+\sqrt{\frac{\log(1/\delta)}{n}}\Big{)},

where 𝒜=(A1,A2,,ALT)\mathcal{A}=(A^{*1},A^{*2},\ldots,A^{*L_{T}}) is the weight matrices of the target fully connected neural network kw{k^{*}{{\circ}\,}w^{*}} trained using the deep transfer learning procedure in Section II, and
𝒜:=LTmaxiAi,LTFr(i=1LT1Aiσ){\hskip 14.22636pt}\mathcal{F}_{\mathcal{A}}:=L_{T}\max_{i}\|A^{*L_{T}}_{i,\mathbin{\vbox{\hbox{\scalebox{0.6}{$\bullet$}}}}}\|_{\text{Fr}}\Big{(}\prod_{i=1}^{L_{T}-1}\|A^{*i}\|_{\sigma}\Big{)}
(i=1LT1AiMi2,12/3Aiσ2/3+ALTFr2/3maxiAi,LTFr2/3)3/2.\displaystyle{\hskip 42.67912pt}\Big{(}\sum_{i=1}^{L_{T}-1}\frac{\|A^{*i}-M^{i}\|_{2,1}^{2/3}}{\|A^{*i}\|_{\sigma}^{2/3}}+\frac{\|A^{*L_{T}}\|_{\text{Fr}}^{2/3}}{\max_{i}\|A^{*L_{T}}_{i,\mathbin{\vbox{\hbox{\scalebox{0.6}{$\bullet$}}}}}\|^{2/3}_{\text{Fr}}}\Big{)}^{3/2}.

Proof.

Using Theorem 1 of [9], with probability at least 1δ1-\delta, for all margin γ>0\gamma>0, we have:

RT(w,k)R^𝒯,γ(w,k)+\displaystyle R_{T}(w^{*},k^{*})\leq\widehat{R}_{\mathcal{T},\gamma}(w^{*},k^{*})+
𝒪~(maxi=1nxiFr𝒜γnlog(W¯)+log(1/δ)n).\displaystyle\qquad\widetilde{\mathcal{O}}\Big{(}\frac{\max_{i=1}^{n}\|x_{i}\|_{\text{Fr}}\,\mathcal{F}_{\mathcal{A}}}{\gamma\sqrt{n}}\log(\bar{W})+\sqrt{\frac{\log(1/\delta)}{n}}\Big{)}.

Combining this with Lemma 1, we have, with probability at least 1δ1-\delta, for all margin γ(0,γ¯]\gamma\in(0,\bar{\gamma}]:

RT(w,k)R^𝒮(w,h)+(1MPA(𝒯|𝒮))+\displaystyle R_{T}(w^{*},k^{*})\leq\widehat{R}_{\mathcal{S}}(w^{*},h^{*})+(1-\text{MPA}(\mathcal{T}|\mathcal{S}))+
𝒪~(maxi=1nxiFr𝒜γnlog(W¯)+log(1/δ)n).\displaystyle\quad\widetilde{\mathcal{O}}\Big{(}\frac{\max_{i=1}^{n}\|x_{i}\|_{\text{Fr}}\,\mathcal{F}_{\mathcal{A}}}{\gamma\sqrt{n}}\log(\bar{W})+\sqrt{\frac{\log(1/\delta)}{n}}\Big{)}.\qed

IV-B Generalization Bound for Convolutional Neural Networks

We now consider target models kwk{\,{\circ}\,}w that are convolutional neural networks. For these models, the matrices A1,A2,,ALTA^{1},A^{2},\ldots,A^{L_{T}} are the filter matrices of the convolutional layers. Following [9], for each filter matrix AiA^{i}, we can construct a corresponding larger convolutional matrix A~i\tilde{A}^{i} by repeating the weights of AiA^{i} as many times as the filter AiA^{i} is applied. The activation functions considered here are assumed to be either ReLU or max pooling.

In our result, W¯\bar{W} is the maximum number of neurons in a single layer before pooling, counting all the channels. For each layer ii, wiw_{i} is the spacial width of the layer after pooling, and BiB_{i} is the maximum l2l_{2} norm of any convolutional patch of the layer’s activations over all inputs. For any layer iLT1i\leq L_{T}-1, we also write A~iσ\|\tilde{A}^{i}\|_{\sigma^{\prime}} to denote the maximum spectral norm of any matrix obtained by deleting, for each pooling window, all but one of the corresponding rows of A~i\tilde{A}^{i}. For i=LTi=L_{T}, A~LTσ=ρLTmaxjAj,LTFr\|\tilde{A}^{L_{T}}\|_{\sigma^{\prime}}=\rho_{L_{T}}\max_{j}\|A^{L_{T}}_{j,\mathbin{\vbox{\hbox{\scalebox{0.6}{$\bullet$}}}}}\|_{\text{Fr}}, with ρLT\rho_{L_{T}} being the Lipschitz constant of the activation and pooling at layer LTL_{T}. More details of the notations can be found in [9].

Theorem 2 below shows our generalization bound for this convolutional neural network setting. Similar to Section IV-A, we do not restrict the form of the source head classifier hh, so our result will also hold for all types of source head classifiers.

Theorem 2.

Assume we are given some fixed reference matrices M1,M2,,MLTM^{1},M^{2},\ldots,M^{L_{T}} representing the initialized weights of the target network’s filter matrices. Under Assumption 1, with probability at least 1δ1-\delta, for all margin γ(0,γ¯]\gamma\in(0,\bar{\gamma}], we have:

RT(w,k)\displaystyle R_{T}(w^{*},k^{*}) R^𝒮(w,h)+(1MPA(𝒯|𝒮))+\displaystyle\leq\widehat{R}_{\mathcal{S}}(w^{*},h^{*})+(1-\text{MPA}(\mathcal{T}|\mathcal{S}))~{}+
𝒪~(𝒢𝒜nlog(W¯)+log(1/δ)n),\displaystyle\qquad\widetilde{\mathcal{O}}\Big{(}\frac{\mathcal{G}_{\mathcal{A}}}{\sqrt{n}}\log(\bar{W})+\sqrt{\frac{\log(1/\delta)}{n}}\Big{)},

where 𝒜=(A1,A2,,ALT)\mathcal{A}=(A^{*1},A^{*2},\ldots,A^{*L_{T}}) is the filter matrices of the target convolutional neural network kwk^{*}{{\circ}\,}w^{*} trained using the deep transfer learning procedure in Section II, and 𝒢𝒜2/3:=i=1LTTi2/3{\mathcal{G}_{\mathcal{A}}^{2/3}:=\sum_{i=1}^{L_{T}}T_{i}^{2/3}}, where for all iLT1i\leq L_{T}-1,

Ti:=Bi1(AiMi)2,1wimaxULTu=i+1UA~uσBU,T_{i}:=B_{i-1}\|(A^{*i}-M^{i})^{\top}\|_{2,1}\sqrt{w_{i}}\max_{U\leq L_{T}}\frac{\prod_{u=i+1}^{U}\|\tilde{A}^{*u}\|_{\sigma^{\prime}}}{B_{U}},

and TLT:=BLT1ALTMLTFr/γ\displaystyle T_{L_{T}}:=B_{L_{T}-1}\|A^{*L_{T}}-M^{L_{T}}\|_{\text{Fr}}/\gamma.

Proof.

This proof is similar to the proof of our Theorem 1 above, but replacing Theorem 1 of [9] by their Theorem 3, which states that with probability at least 1δ1-\delta, for all γ>0\gamma>0:

RT(w,k)R^𝒯,γ(w,k)+𝒪~(𝒢𝒜nlog(W¯)+log(1/δ)n).R_{T}(w^{*},k^{*})\leq\widehat{R}_{\mathcal{T},\gamma}(w^{*},k^{*})+\widetilde{\mathcal{O}}\big{(}\frac{\mathcal{G}_{\mathcal{A}}}{\sqrt{n}}\log(\bar{W})+\sqrt{\frac{\log(1/\delta)}{n}}\big{)}.

The theorem holds by combining this with Lemma 1. ∎

V Bounds for Different Training Inputs Setting

Up until now, we have only considered the case where source and target datasets share the same input set. In this section, we extend our results to the case where the source and target datasets contain different inputs. Formally, let 𝒮={(x1,s1),(x2,s2),,(xn,sn)}\mathcal{S}=\{(x_{1},s_{1}),(x_{2},s_{2}),\ldots,(x_{n},s_{n})\} be the source dataset where (xi,si)d×[mS](x_{i},s_{i})\in\mathbb{R}^{d}\times[m_{S}] is drawn iid from a joint distribution X,S\mathbb{P}_{X,S}, and 𝒯={(z1,t1),(z2,t2),,(zp,tp)}\mathcal{T}=\{(z_{1},t_{1}),(z_{2},t_{2}),\ldots,(z_{p},t_{p})\} be the target dataset where (zi,ti)d×[mT](z_{i},t_{i})\in\mathbb{R}^{d}\times[m_{T}] is drawn iid from Z,T\mathbb{P}_{Z,T}. We also follow the deep transfer learning procedure in Section II and first train the optimal model hwh^{*}{{\circ}\,}w^{*} on the source data 𝒮\mathcal{S} using Eq. (1). Then we freeze ww^{*} and train the target head classifier kk^{*} using Eq. (3) with the target dataset 𝒯\mathcal{T}, where we will apply ww^{*} to the target inputs {z1,z2,,zp}\{z_{1},z_{2},\ldots,z_{p}\} to get the representations {w(z1),w(z2),,w(zp)}\{w^{*}(z_{1}),w^{*}(z_{2}),\ldots,w^{*}(z_{p})\}.

To prove the generalization bounds, we will consider a new source dataset 𝒮~:={(zi,hw(zi))}i=1p\tilde{\mathcal{S}}:=\{(z_{i},h^{*}{{\circ}\,}w^{*}(z_{i}))\}_{i=1}^{p} induced by hwh^{*}{{\circ}\,}w^{*} and the target inputs {z1,z2,,zp}\{z_{1},z_{2},\ldots,z_{p}\}. In essence, 𝒮~\tilde{\mathcal{S}} contains the target inputs with “dummy” source labels generated by hwh^{*}{{\circ}\,}w^{*}. This technique of using these dummy labels was previously employed to develop the LEEP transferability measure [13], and is useful for proving our bounds as well. With the new source dataset 𝒮~\tilde{\mathcal{S}}, we consider the majority predictor fmpf_{\text{mp}} constructed from (𝒮~,𝒯)(\tilde{\mathcal{S}},\mathcal{T}), as well as the corresponding majority predictor accuracy MPA(𝒯|𝒮~)\text{MPA}(\mathcal{T}|\tilde{\mathcal{S}}). We still keep Assumption 1 in Section IV, but adapt it to the new hwh^{*}{{\circ}\,}w^{*} and fmpf_{\text{mp}}. The following lemma is the analogue of Lemma 1 for the different inputs setting.

Lemma 2.

With the adapted Assumption 1, for any γ[0,γ¯]\gamma\in[0,\bar{\gamma}], we have: R^𝒯,γ(w,k)1MPA(𝒯|𝒮~)\widehat{R}_{\mathcal{T},\gamma}(w^{*},k^{*})\leq 1-\text{MPA}(\mathcal{T}|\tilde{\mathcal{S}}).

Proof.

From (5)\eqref{eq:mpa}(6)\eqref{eq:fo}, and the definition of 𝒮~\tilde{\mathcal{S}}, we have:

MPA(𝒯|𝒮~)=1pi=1p𝟏[ti=fmphw(zi)].\text{MPA}(\mathcal{T}|\tilde{\mathcal{S}})=\frac{1}{p}\sum_{i=1}^{p}\mathbf{1}[t_{i}=f_{\text{mp}}{\,{\circ}\,}h^{*}{{\circ}\,}w^{*}(z_{i})].

By Assumption 1, for any γ[0,γ¯]\gamma\in[0,\bar{\gamma}], we have

R^𝒯,γ(w,k)R^𝒯(w,fmph)\displaystyle\widehat{R}_{\mathcal{T},\gamma}(w^{*},k^{*})\leq\widehat{R}_{\mathcal{T}}(w^{*},f_{\text{mp}}{\,{\circ}\,}h^{*})
=1pi=1p𝟏[tifmphw(zi)]=1MPA(𝒯|𝒮~).\displaystyle=\frac{1}{p}\sum_{i=1}^{p}\mathbf{1}[t_{i}\neq f_{\text{mp}}{\,{\circ}\,}h^{*}{{\circ}\,}w^{*}(z_{i})]=1-\text{MPA}(\mathcal{T}|\tilde{\mathcal{S}}).\qed

Similar to Section IV, we derive the following generalization bounds, which are analogues of Theorems 1 and 2. The proofs of these theorems are similar to those of Theorems 1 and 2, with Lemma 1 being replaced by Lemma 2.

Theorem 3.

Assume we are given some fixed reference matrices M1,M2,,MLTM^{1},M^{2},\ldots,M^{L_{T}} representing the initialized weights of the target network. Under the adapted Assumption 1, with probability at least 1δ1-\delta, for all margin γ(0,γ¯]\gamma\in(0,\bar{\gamma}], with 𝒜\mathcal{F}_{\mathcal{A}} defined as in Theorem 1, we have: RT(w,k)1MPA(𝒯|𝒮~)+𝒪~(maxi=1pxiFr𝒜γplog(W¯)+log(1/δ)p)R_{T}(w^{*},k^{*})\leq{1-\text{MPA}(\mathcal{T}|\tilde{\mathcal{S}})+\widetilde{\mathcal{O}}\big{(}\frac{\max_{i=1}^{p}\|x_{i}\|_{\text{Fr}}\,\mathcal{F}_{\mathcal{A}}}{\gamma\sqrt{p}}\log(\bar{W})+\sqrt{\frac{\log(1/\delta)}{p}}\big{)}}.

Theorem 4.

Assume we are given some fixed reference matrices M1,M2,,MLTM^{1},M^{2},\ldots,M^{L_{T}} representing the initialized weights of the target network’s filter matrices. Under the adapted Assumption 1, with probability at least 1δ1-\delta, for all margin γ(0,γ¯]\gamma\in(0,\bar{\gamma}], with 𝒢𝒜\mathcal{G}_{\mathcal{A}} defined as in Theorem 2, we have:
RT(w,k)1MPA(𝒯|𝒮~)+𝒪~(𝒢𝒜plog(W¯)+log(1/δ)p).R_{T}(w^{*},k^{*})\leq 1-\text{MPA}(\mathcal{T}|\tilde{\mathcal{S}})+\widetilde{\mathcal{O}}\big{(}\frac{\mathcal{G}_{\mathcal{A}}}{\sqrt{p}}\log(\bar{W})+\sqrt{\frac{\log(1/\delta)}{p}}\big{)}.

VI Discussions

The technique used to prove our theorems is general and can be combined with other generalization bounds for deep neural networks. Although we proved our results using the norm-based bounds of [9], we emphasize that our proof technique can also be used with other generalization bounds for neural networks, such as those of [3].

The bounds in Theorems 1 and 2 depend on both the optimal source empirical risk R^𝒮(w,h)\widehat{R}_{\mathcal{S}}(w^{*},h^{*}) and MPA(𝒯|𝒮)\text{MPA}(\mathcal{T}|\mathcal{S}). These bounds get better when R^𝒮(w,h)0\widehat{R}_{\mathcal{S}}(w^{*},h^{*})\rightarrow 0 and MPA(𝒯|𝒮)1\text{MPA}(\mathcal{T}|\mathcal{S})\rightarrow 1. The bounds in Theorems 3 and 4 do not contain the source empirical risk, since it has been indirectly measured in MPA(𝒯|𝒮~)\text{MPA}(\mathcal{T}|\tilde{\mathcal{S}}) when we use hwh^{*}{{\circ}\,}w^{*} to construct 𝒮~\tilde{\mathcal{S}}.

From our results, we can see that MPA(𝒯|𝒮)\text{MPA}(\mathcal{T}|\mathcal{S}) (or MPA(𝒯|𝒮~)\text{MPA}(\mathcal{T}|\tilde{\mathcal{S}}) for the setting with different inputs) can be used as a transferability measure. Specifically, for well-trained and deep enough neural networks, it has been observed empirically [21] that R^𝒮(w,h)0{\widehat{R}_{\mathcal{S}}(w^{*},h^{*})\approx 0}. Furthermore, the 𝒪~()\widetilde{\mathcal{O}}(\cdot) terms in our theorems are near 0 for large enough nn. In this case, our results imply that MPA(𝒯|𝒮)1RT(w,k)\text{MPA}(\mathcal{T}|\mathcal{S})\lessapprox 1-R_{T}(w^{*},k^{*}) or MPA(𝒯|𝒮~)1RT(w,k){\text{MPA}(\mathcal{T}|\tilde{\mathcal{S}})\lessapprox 1-R_{T}(w^{*},k^{*})}. This means that MPA(𝒯|𝒮)\text{MPA}(\mathcal{T}|\mathcal{S}) or MPA(𝒯|𝒮~)\text{MPA}(\mathcal{T}|\tilde{\mathcal{S}}) lower bounds the expected accuracy of the transferred model kwk^{*}{{\circ}\,}w^{*}, and thus can be used as a transferability measure. We now validate this observation empirically.

VII Experiments

We show the usefulness of our theoretical bounds in practice by empirically illustrating the ability of MPA as a transferability measure on the large-scale Caltech-UCSD Birds-200 dataset [18], which contains 11,788 images of 200 bird species labeled with 312 binary attributes. We keep the train-test split as provided in dataset, with 5,994 train images and 5,794 test images. We pick 4 attributes Curved Bill, Iridescent Wings, Brown Upper Parts and Olive Under Parts for training source models, and randomly choose 100 different attributes as target tasks. Regarding the model architecture, we use ResNet18 [7] without the last fully connected layer as the feature extractor ww. In all tests, we train our source model hwh^{*}{{\circ}\,}w^{*} and the transferred model kwk^{*}{{\circ}\,}w^{*} using the cross-entropy loss with batch size 32 and run the stochastic gradient descent optimizer with momentum for 40 epochs. The initial learning rate is set at 0.01 and is divided by 10 every 10 epochs.

Following the settings in [17, 13], we estimate the correlations between the MPA scores and the actual test accuracies of the transferred models to evaluate the relationship between these two quantities. High correlations mean the MPA score is a good measure for comparing test accuracies of the transferred models, and thus is a good transferability measure. For the 4 source tasks above with 100 randomly chosen target tasks, our experimental results give the following Pearson correlation coefficients: 0.9534 (Curved Bill), 0.9452 (Iridescent Wings), 0.9484 (Brown Upper Parts), and 0.9611 (Olive Under Parts). These coefficients show that the MPA scores and the test accuracies are highly positive correlated with statistical significance (p<104p<10^{-4}), which clearly indicates that the MPA is a reliable transferability measure for estimating the performance of transferred models.

VIII Conclusion

We proved novel generalization bounds for transfer learning of deep neural networks using a new quantity, the majority predictor accuracy, that can be computed easily and efficiently from data. We showed the usefulness of our bounds in practice by demonstrating that the majority predictor accuracy can be used for estimating the effectiveness of deep transfer learning. Our theory can potentially be extended to analyze more complex transfer scenarios such as continual learning [12].

References

  • [1] K. Azizzadenesheli, A. Liu, F. Yang, and A. Anandkumar. Regularized learning for domain adaptation under label shifts. In ICLR, 2018.
  • [2] Y. Bao, Y. Li, S. Huang, L. Zhang, L. Zheng, A. Zamir, and L. Guibas. An information-theoretic approach to transferability in task transfer learning. In ICIP, 2019.
  • [3] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks. In NeurIPS, 2017.
  • [4] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010.
  • [5] S. Ben-David and R. Schuller. Exploiting task relatedness for multiple task learning. Learning theory and kernel machines, pages 567–580, 2003.
  • [6] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adaptation. In NeurIPS, 2007.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [8] J. Huang, Q. Qiu, and K. Church. Exploiting a zoo of checkpoints for unseen tasks. In NeurIPS, 2021.
  • [9] A. Ledent, W. Mustafa, Y. Lei, and M. Kloft. Norm-based generalisation bounds for deep multi-class convolutional neural networks. In AAAI, 2021.
  • [10] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. In ICML, 2017.
  • [11] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT, 2009.
  • [12] C. V. Nguyen, A. Achille, M. Lam, T. Hassner, V. Mahadevan, and S. Soatto. Toward understanding catastrophic forgetting in continual learning. arXiv:1908.01091, 2019.
  • [13] C. V. Nguyen, T. Hassner, M. Seeger, and C. Archambeau. LEEP: A new measure to evaluate transferability of learned representations. In ICML, 2020.
  • [14] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features off-the-shelf: An astounding baseline for recognition. In CVPR Workshops, 2014.
  • [15] Y. Tan, Y. Li, and S. Huang. OTCE: A transferability metric for cross-domain cross-task representations. In CVPR, 2021.
  • [16] X. Tong, X. Xu, S. Huang, and L. Zheng. A mathematical framework for quantifying transferability in multi-source transfer learning. In NeurIPS, 2021.
  • [17] A. T. Tran, C. V. Nguyen, and T. Hassner. Transferability and hardness of supervised classification tasks. In ICCV, 2019.
  • [18] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical report, Caltech, 2010.
  • [19] P. N. Whatmough, C. Zhou, P. Hansen, S. K. Venkataramanaiah, J. Seo, and M. Mattina. FixyNN: Efficient hardware for mobile computer vision via transfer learning. In MLSys, 2019.
  • [20] K. You, Y. Liu, J. Wang, and M. Long. LogME: Practical assessment of pre-trained models for transfer learning. In ICML, 2021.
  • [21] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.