This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deep Transfer Learning: Model Framework and Error Analysis

Yuling Jiao School of Mathematics and Statistics, Wuhan University, Wuhan, Hubei 430072, China Hubei Key Laboratory of Computational Science, Wuhan, Hubei 430072, China Huazhen Lin Center of Statistical Research and School of Statistics, Southwestern University of Finance and Economics, Chengdu 61113, China Yuchen Luo School of Mathematics and Statistics, Wuhan University, Wuhan, Hubei 430072, China Jerry Zhijian Yang School of Mathematics and Statistics, Wuhan University, Wuhan, Hubei 430072, China Hubei Key Laboratory of Computational Science, Wuhan, Hubei 430072, China
Abstract

This paper introduces a framework for deep transfer learning aimed at improving performance on a single-domain downstream task with a limited sample size (mm) by leveraging information from multi-domain upstream data with a significantly larger sample size (nn), where mnm\ll n. Our framework offers several intriguing features. First, it allows the existence of both shared and domain-specific features across multi-domain data and provides a framework for automatic identification, achieving precise transfer and utilization of information. Second, the framework explicitly identifies upstream features that contribute to downstream tasks, establishing clear relationships between upstream domains and downstream tasks, thereby enhancing interpretability. Error analysis shows that our framework can significantly improve the convergence rate for learning Lipschitz functions in downstream supervised tasks, reducing it from O~(m12(d+2)+n12(d+2))\tilde{O}(m^{-\frac{1}{2(d+2)}}+n^{-\frac{1}{2(d+2)}}) (for “no transfer”) to O~(m12(d+3)+n12(d+2))\tilde{O}(m^{-\frac{1}{2(d^{*}+3)}}+n^{-\frac{1}{2(d+2)}}) (for “partial transfer”), and even to O~(m1/2+n12(d+2))\tilde{O}(m^{-1/2}+n^{-\frac{1}{2(d+2)}}) (for “complete transfer”), where ddd^{*}\ll d and dd is the dimension of the observed data. Our theoretical findings are supported by empirical experiments on image classification and regression datasets.

Keywords: transfer learning, nonparametric analysis, convergence rate.

1 Introduction

Transfer learning (TL) is a powerful technique that improves model performance in target domains with limited data by leveraging knowledge from related source domains. This approach addresses the common challenge of obtaining large datasets necessary for high predictive accuracy. TL has been successfully applied in diverse fields, including natural language processing (Ruder et al., 2019), medical image analysis (Kora et al., 2022), drug discovery (Cai et al., 2020), and materials science (Jha et al., 2019; Kim et al., 2021; Ju et al., 2021).

Due to its notable successes in practical applications, extensive research has focused on theory of transfer learning under conditions of covariate shift and posterior shift, addressing both parametric and nonparametric models. Specifically, covariate shift refers to changes in the marginal distributions of covariates between the source and target domains, while posterior shift involves changes in the relationships between covariates and outcomes across these domains. Theoretical investigations into transfer learning within parametric models, such as high-dimensional linear regression (Li et al., 2022), generalized linear models (Bastani, 2021; Li, Zhang, Cai and Li, 2023; Tian and Feng, 2023) or gaussian graphical models with false discovery rate control (Li, Cai and Li, 2023), have been thoroughly explored. However, once parametric models are specified, they often lack the flexibility to handle the complex posterior shifts encountered in real-world settings, leading to insufficient information transfer due to the varying expressions in parameters for the different models.

Nonparametric transfer learning with covariate shift has been addressed by Huang et al. (2006); Wen et al. (2014); Wang et al. (2016); Cai and Pu (2024); Schmidt-Hieber and Zamolodtchikov (2024) for regression tasks and by Ben-David et al. (2006); Blitzer et al. (2007); Mansour et al. (2009) for classification tasks. Theoretical results have been developed for both the covariate shift setting (Shimodaira, 2000; Sugiyama et al., 2007) and the posterior drift setting (Cai and Wei, 2021; Reeve et al., 2021). Paritcularly, minimax rates are derived under posterior drift by Cai and Pu (2024) and under covariate shift by Kpotufe and Martinet (2021), respectively. These nonparametric methods are primarily based on traditional nonparametric approximation techniques, such as local polynomial methods (Fan and Gijbels, 2018), spline approximation (Schumaker, 2007), and reproducing kernel techniques (Berlinet and Thomas-Agnan, 2011; Lv et al., 2018). In practice, these methods often face the “curse of dimensionality” when the covariate dimension dd is moderate to high, for instance, when d>3d>3.

Thanks to their powerful function-fitting capability, well-designed architectures, effective training algorithms, and high-performance computing technologies, nonparametric analysis based on deep neural networks (DNNs) has achieved tremendous success. Deep transfer learning, which uses DNNs to transfer learned representations across tasks, has been studied in the literature. For example, Du et al. (2020), Tripuraneni et al. (2020), Tripuraneni et al. (2021), Chen et al. (2021), and the references therein develop theories that consistently demonstrate the improvement of deep transfer learning in terms of sampling efficiency, under the assumption that sufficient representations of the downstream task can be obtained from the upstream domains and certain transferability conditions. However, these studies do not account for cases where downstream tasks may require more than representations which the upstream tasks can provide.

While prior research has advanced our understanding of transfer learning, the existing literature lacks systematic research on the following issues, particularly within the framework of deep learning:

  • (i)

    How to learn a sufficient and invariant representation form the upstream data?

  • (ii)

    When the upstream representation is sufficient, which features are relevant for downstream prediction? And how should transfer learning be described when the upstream representation is not sufficient for the downstream task?

  • (iii)

    Given that the representations of different datasets should not be identical and there should be shared as well as domain-specific features, how can we design a model framework to accurately estimate and characterize them?

In this paper, we aim to address the aforementioned questions by proposing a deep learning framework for transfer learning. Our model accomplishes this in three key aspects. Firstly, our model permits the presence of common and distinct features across domains in the upstream and downstream data and provides framework to automatically identifies them. Secondly, it explicitly expresses which upstream features contribute to the downstream tasks, establishing connection among domains and enhancing interpretability. In addition, a non-linear function QQ^{*} is introduced for the downstream task to learn features that are independent of the upstream representation, determining the level of difficulty in transfer learning. Particularly, in the case where Q=0Q^{*}=0, the upstream representation is sufficient to the downstream task, transfer from upstream to downstream is facile, leading to “complete transfer” and a parametric convergence rate. On the other hand, when specific information is encoded in the non-zero QQ^{*}, the transfer problem becomes more challenging. Our error analysis reveals that the convergence speed varies from slow 𝒪(mβ2(d+1+2β))\mathcal{O}(m^{-\frac{\beta}{2(d+1+2\beta)}}), to moderate 𝒪(mβ2(d+1+2β))\mathcal{O}(m^{-\frac{\beta}{2(d^{*}+1+2\beta)}}) with ddd^{*}\ll d. These rates correspond to “no transfer” that the downstream task necessitates learning a dd-dimensional function, and “partial transfer” problems where the downstream task entails learning an dd^{*}-dimensional function. Importantly, we emphasize that our model framework seamlessly adapts to these scenarios, accommodating varying degrees of information transfer and yielding distinct convergence rates for the downstream task.

The remainder of this paper is structured as follows: Preliminary details about the framework, including essential definitions and model setup, are provided in Section 2. Our main theoretical results are presented in Section 3. Experimental results are showcased in Section 4, followed by a discussion in Section 5. Technical proofs are provided in the Supplementary Materials.

2 Preliminaries

Notation: We use bold letters (e.g., 𝐱\mathbf{x}, 𝐅\mathbf{F}) to refer to vectors or matrixs. The norm \|\cdot\| appearing on a vector or matrix refers to its 2\ell_{2} norm or infinity-norm respectively. The norm 1\|\cdot\|_{1} appearing on a vector or matrix refers to its 1\ell_{1} norm or group 1\ell_{1} norm respectively. We use the bracketed notation [n]={1,,n}[n]=\{1,\ldots,n\} as shorthand for integer sets. fL\|f\|_{L} is the Lipschitz constant of function ff. We use 𝒰r\mathcal{U}_{r} as U[0,1]rU[0,1]^{r}, which is the random vector with each coordinate being mutually independent and have uniform distribution U[0,1]U[0,1]. Throughout we will use \mathcal{H} to be a function class of features mapping dr\mathbb{R}^{d}\to\mathbb{R}^{r}, 𝒢\mathcal{G} to be a function class of lipschitz-constrained mapping r\mathbb{R}^{r}\to\mathbb{R} and 𝒬\mathcal{Q} to be a function class during fine-tuning. We use O~\tilde{O} to denote an expression that hides polylogarithmic factors in all problem parameters.

2.1 Transfer learning based on sufficient and domain-invariant representation

2.1.1 Setup

Formally, the upstream dataset 𝒟={(𝐗i,Yi,Si)}i=1n\mathcal{D}=\{(\mathbf{X}_{i},Y_{i},S_{i})\}_{i=1}^{n} consists of nn independent and identically distributed (i.i.d.) copies of random triples (𝐗,Y,S)(\mathbf{X},Y,S), sampled from an unknown distribution (𝐱,y,𝐬)\mathbb{P}(\mathbf{x},y,\mathbf{s}) supported over 𝒳×𝒴×𝒮\mathcal{X}\times\mathcal{Y}\times\mathcal{S}, where 𝒳d\mathcal{X}\subset\mathbb{R}^{d}, 𝒴1\mathcal{Y}\subset\mathbb{R}^{1} and 𝒮=[p]\mathcal{S}=[p] denotes the domain label set. We combine all data together to define (𝐗,Y\mathbf{X},Y) as their joint distribution is (𝐱,y,𝐬)𝑑𝐬\int\mathbb{P}(\mathbf{x},y,\mathbf{s})d\mathbf{s}. For a given domain s𝒮s\in\mathcal{S}, suppose that there exists a low-dimensional features 𝐡s(𝐗)\mathbf{h}^{*}_{s}(\mathbf{X}) that capture the relationship between 𝐗\mathbf{X} and YY. By allowing 𝐡s(𝐗)𝐡t(𝐗),s,t𝒮\mathbf{h}^{*}_{s}(\mathbf{X})\cap\mathbf{h}^{*}_{t}(\mathbf{X})\neq\varnothing,s,t\in\mathcal{S}, we enable the features to vary or be shared across domains. Specifically, domain-specific features reduce transfer bias, enhancing accuracy, while shared features, estimated based on multiple domains, improve efficiency. Here, we impose no restrictions on the patterns of variation or sharing across domains. Denote 𝐡(𝐗)r×1,rd\mathbf{h}^{*}(\mathbf{X})\in\mathbb{R}^{r\times 1},r\ll d, as the combination of 𝐡s(𝐗),s[p]\mathbf{h}^{*}_{s}(\mathbf{X}),s\in[p], noting that 𝐗sYs𝐡s(𝐗)\mathbf{X}_{s}\perp\!\!\!\perp Y_{s}\mid\mathbf{h}^{*}_{s}(\mathbf{X}) for any given domain ss, we then have

𝐗Y𝐡(𝐗),\displaystyle\mathbf{X}\perp\!\!\!\perp Y\mid\mathbf{h}^{*}(\mathbf{X}), (1)
S𝐡(𝐗),\displaystyle S\perp\!\!\!\perp\mathbf{h}^{*}(\mathbf{X}), (2)

where 𝐗Y𝐡(𝐗)\mathbf{X}\perp\!\!\!\perp Y\mid\mathbf{h}^{*}(\mathbf{X}) indicates that the information of 𝐗\mathbf{X} to predict YY is sufficiently encoded in 𝐡(𝐗)\mathbf{h}^{*}(\mathbf{X}), and S𝐡(𝐗)S\perp\!\!\!\perp\mathbf{h}^{*}(\mathbf{X}) signifies the need for the representation to be invariant across all domains in the upstream dataset. Under (1), and employing the shorthand notation 𝐗S,YS\mathbf{X}_{S},Y_{S} to represent the triplet (𝐗,Y,S)(\mathbf{X},Y,S), we use the following regression model for identifying and estimating 𝐡(𝐗)\mathbf{h}^{*}(\mathbf{X}):

YS=𝐅S𝐡(𝐗S)+ε1,𝔼[ε1]=0,Y_{S}=\mathbf{F}^{*}_{S}\mathbf{h}^{*}(\mathbf{X}_{S})+\varepsilon_{1},\quad\mathbb{E}[\varepsilon_{1}]=0, (3)

where the sparse vector 𝐅S1×r\mathbf{F}^{*}_{S}\in\mathbb{R}^{1\times r} represents the transformation applied to the representation 𝐡(𝐗S)\mathbf{h}^{*}(\mathbf{X}_{S}) for domain SS, ε1\varepsilon_{1} is a bounded noise with mean zero. The sparsity of 𝐅S\mathbf{F}^{*}_{S} is used to identify the feature subset 𝐡S(𝐗S)(𝐡(𝐗S))\mathbf{h}^{*}_{S}(\mathbf{X}_{S})(\subset\mathbf{h}^{*}(\mathbf{X}_{S})) that capture the relationship between 𝐗S\mathbf{X}_{S} and YSY_{S}. The sparsity sets our framework apart from existing transfer learning methods by allowing features to vary or be shared across domains, rather than exclusively being shared. With the sparsity of 𝐅S\mathbf{F}_{S}, model (3) also provides a more parsimonious and interpretable representation for each domain. Domains with different feature sets enable us to better explore their interdomain differences. For the downstream data, we focus on single-target domain transfer learning. Let the data set 𝒯={(𝐗T,i,YT,i)}i=1m\mathcal{T}=\{(\mathbf{X}_{T,i},Y_{T,i})\}_{i=1}^{m} consist of mm samples of the random variable pair (XT,YT)(X_{T},Y_{T}), which are i.i.d. from an unknown distribution 0\mathbb{P}_{0}, which is also supported over 𝒳×𝒴\mathcal{X}\times\mathcal{Y}. We assume the supervised downstream task data following the regression model:

YT=𝐅T𝐡(XT)+Q(XT)+ε2,𝔼[ε2]=0,Y_{T}=\mathbf{F}_{T}^{*}\mathbf{h}^{*}(X_{T})+Q^{*}(X_{T})+\varepsilon_{2},\quad\mathbb{E}[\varepsilon_{2}]=0, (4)

or the binary classification model

YTBer(σ(𝐅T𝐡(XT)+Q(XT)))Y_{T}\sim\mathrm{Ber}(\sigma(\mathbf{F}_{T}^{*}\mathbf{h}^{*}(X_{T})+Q^{*}(X_{T}))) (5)

where ε2\varepsilon_{2} is bounded noise with zero mean, 𝐅T\mathbf{F}_{T}^{*} is the sparse transformation vector for target domain and Q:dQ^{*}:\mathbb{R}^{d}\to\mathbb{R} is an additional function to be learned which indicates that downstream tasks may not share sufficient similarity to predict YTY_{T} solely based on the representation 𝐡(XT)\mathbf{h}^{*}({X_{T}}). Thus, Q0Q^{*}\neq 0 serves the purpose of extracting additional information different from 𝐡(XT)\mathbf{h}^{*}(X_{T}) which is implemented by forcing

Q(XT)𝐡(XT).Q^{*}(X_{T})\perp\!\!\!\perp\mathbf{h}^{*}(X_{T}). (6)

To explicitly express the constraint (6), we further model

Q(XT)=q(𝐀XT),Q^{*}(X_{T})=q^{*}(\mathbf{A}^{*}X_{T}), (7)

where 𝐀dd\mathbf{A}^{*}\in\mathbb{R}^{d}\rightarrow\mathbb{R}^{d^{*}} with ddd^{*}\leq d, and q:dq^{*}:\mathbb{R}^{d^{*}}\mapsto\mathbb{R}. In fact, if d=dd^{*}=d, setting 𝐀=𝐈\mathbf{A}^{*}=\mathbf{I} reduces q(𝐀XT)q^{*}(\mathbf{A}^{*}X_{T}) to q(XT)q^{*}(X_{T}) or Q(XT)Q^{*}(X_{T}), indicating no restriction on the model (7). Therefore, the value of dd^{*} indicates the strength of the constraint (6). A smaller dd^{*} implies a more stringent constraint of (6). In practice, we can determine dd^{*} among various candidate integers by evaluating prediction accuracy.

When 𝐅T=0\mathbf{F}_{T}^{*}=0, the contribution of the upstream representation to the downstream task is negligible, and we refer to this scenario as “no transfer”. On the other hand, when Q=0Q^{*}=0, the entirety of upstream information can be seamlessly transferred to the downstream task. In this case of “complete transfer”, the task involves solely the estimation of the sparse vector 𝐅T\mathbf{F}_{T}^{*} through penalized least squares. In situations where 𝐅T0\mathbf{F}^{*}_{T}\neq 0 and Q0Q^{*}\neq 0, we classify it as “partial transfer”. Hence, our downstream task model (4)-(7) can adaptively accommodate these three distinct scenarios. Furthermore, our Theorem 3, Remark 6 and Theorem 4 demonstrate that the end-to-end convergence rate distinguishes these settings.

Remark 1.

Practitioners have also explored the following transfer models,

YT=fT(𝐡(XT))+ε2,𝔼[ε2]=0,Y_{T}=f^{*}_{T}(\mathbf{h}^{*}(X_{T}))+\varepsilon_{2},\quad\mathbb{E}[\varepsilon_{2}]=0, (8)

or the binary classification model

YTBer(σ(fT(𝐡(XT))))Y_{T}\sim\mathrm{Ber}(\sigma(f^{*}_{T}(\mathbf{h}^{*}(X_{T})))) (9)

where fT:rf^{*}_{T}:\mathbb{R}^{r}\to\mathbb{R} is a low dimensional function to be learned nonparametrically in the downstream task. The model (8) or (9) suggests sufficient features from the upstream domains to the downstream task, as fT(𝐡(XT))f^{*}_{T}(\mathbf{h}^{*}(X_{T})) depends on 𝐡(XT)\mathbf{h}^{*}(X_{T}). Here in (4) or (5) , we choose a linear sparse function over the nonparametric function fT()f^{*}_{T}(\cdot) for interpretability, as well as the consideration that the inherent nonparametric features are already captured by 𝐡(𝐗T)\mathbf{h}^{*}(\mathbf{X}_{T}).

2.1.2 Loss for representation learning

Notably, for any bijective mapping ψ\psi acting on 𝐡\mathbf{h}^{*}, the composition ψ𝐡\psi\circ\mathbf{h}^{*} preserves the sufficiency and invariance properties required in (1)-(2). As a result, 𝐡(𝐗)\mathbf{h}^{*}(\mathbf{X}) is not unique. To ensure uniqueness, following the standard approach in dimensionality reduction (Chen et al., 2024; Liu et al., 2024), we constrain the distribution of 𝐡(𝐗)\mathbf{h}^{*}(\mathbf{X}) to match a given distribution 𝒰r\mathcal{U}_{r}. Specifically, we assume a uniform distribution for 𝒰r\mathcal{U}_{r}, as its existence can be guaranteed by optimal transport theory (Villani et al., 2009). Since the entries of 𝒰r\mathcal{U}_{r} are independent, this promotes the disentanglement of 𝐡(𝐗)\mathbf{h}^{*}(\mathbf{X}).

As discussed in Section 2.1.1, for different domain from upstream, only a portion of information from 𝐡(XS)\mathbf{h}^{*}(X_{S}) is utilized during the prediction process. Consequently, we incorporate domain-specific extractors represented by the sparse vector 𝐅S\mathbf{F}_{S}, and add 1\ell_{1} penalty in loss for the sparsity of 𝐅S\mathbf{F}_{S}. For simplicity, we denote (𝐅1,,𝐅p)(\mathbf{F}_{1},\ldots,\mathbf{F}_{p}) as 𝐅pr\mathbf{F}\in\mathbb{R}^{pr}. Then, defining the population risk as:

R(𝐅,𝐡)=\displaystyle R(\mathbf{F},\mathbf{h})= 𝔼[YS𝐅S𝐡(𝐗S)2]+λ𝒱[𝐡(𝐗),S]+τ𝒲1(𝐡(𝐗),𝒰r)+μ𝐅1,\displaystyle\mathbb{E}[\|Y_{S}-\mathbf{F}_{S}\mathbf{h}(\mathbf{X}_{S})\|^{2}]+\lambda\mathcal{V}[\mathbf{h}(\mathbf{X}),S]+\tau\mathcal{W}_{1}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r})+\mu\|\mathbf{F}\|_{1}, (10)

where the first term is the regression loss for (3), and 𝒱[𝐡(𝐗),S]\mathcal{V}[\mathbf{h}(\mathbf{X}),S] is a measure used to evaluate independence between 𝐡(𝐗)\mathbf{h}(\mathbf{X}) and SS as required in (2), such as the distance covariance (detail definition see Section 2.2), and

𝒲1(𝐡(𝐗),𝒰r)=supgL1𝔼𝐱𝐡(𝐗)[g(𝐱)]𝔼𝐲𝒰r[g(𝐲)],\mathcal{W}_{1}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r})=\sup_{\|g\|_{L}\leq 1}\mathbb{E}_{\mathbf{x}\sim\mathbf{h}(\mathbf{X})}[g(\mathbf{x})]-\mathbb{E}_{\mathbf{y}\sim\mathcal{U}_{r}}[g(\mathbf{y})],

is the Wasserstein distance used to match the distribution of 𝐡(𝐗)\mathbf{h}(\mathbf{X}) and 𝒰r\mathcal{U}_{r}, and λ\lambda, τ,μ\tau,\mu are regularization parameters. Given data 𝒟={(𝐗i,Yi,Si)}i=1n\mathcal{D}=\{(\mathbf{X}_{i},Y_{i},S_{i})\}_{i=1}^{n}, the empirical version of (10) can be written as:

R^(𝐅,\displaystyle\hat{R}(\mathbf{F}, 𝐡)=i=1pj=1np1n(Yi,j𝐅i𝐡(Xi,j))2+λ𝒱^n[𝐡(𝐗),S]+τ𝒲1^(𝐡(𝐗),𝒰r)+μ𝐅1,\displaystyle\mathbf{h})=\sum_{i=1}^{p}\sum_{j=1}^{n_{p}}\frac{1}{n}(Y_{i,j}-\mathbf{F}_{i}\mathbf{h}(X_{i,j}))^{2}+\lambda\widehat{\mathcal{V}}_{n}[\mathbf{h}(\mathbf{X}),S]+\tau\hat{\mathcal{W}_{1}}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r})+\mu\|\mathbf{F}\|_{1}, (11)

where j=1pnp=n\sum_{j=1}^{p}n_{p}=n, and 𝒱^n[𝐡(𝐗),S]\widehat{\mathcal{V}}_{n}[\mathbf{h}(\mathbf{X}),S] is the empirical form of 𝒱[𝐡(𝐗),S]\mathcal{V}[\mathbf{h}(\mathbf{X}),S] (see Section 2.2 for detail), and

𝒲1^(𝐡(𝐗),𝒰r)=1K3supg𝒢1ni=1n(g(𝐡(𝐗i))g(ξi)),\displaystyle\hat{\mathcal{W}_{1}}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r})=\frac{1}{K_{3}}\sup_{g\in\mathcal{G}}\frac{1}{n}\sum_{i=1}^{n}(g(\mathbf{h}(\mathbf{X}_{i}))-g(\mathbf{\xi}_{i})),

denotes the empirical form of the Wasserstein distance, wherein {ξi}i=1n\{\xi_{i}\}_{i=1}^{n} denotes i.i.d random samples drawn from 𝒰r\mathcal{U}_{r} and the function class 𝒢\mathcal{G} is chosen as the norm-constrained neural network class whose Lipschitz constants are bounded by K3K_{3}, see Section 2.2 for detail.

We define the estimated sparse vector 𝐅^=(𝐅^1,,𝐅^p)\hat{\mathbf{F}}=(\hat{\mathbf{F}}_{1},\ldots,\hat{\mathbf{F}}_{p}) and estimated feature representations 𝐡^\hat{\mathbf{h}} as the following

(𝐅^,𝐡^)=argmin(𝐅,𝐡)pr(B)R^(𝐅,𝐡),(\hat{\mathbf{F}},\hat{\mathbf{h}})=\operatorname*{arg\,min}_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{pr}(B)\circ\mathcal{H}}\hat{R}(\mathbf{F},\mathbf{h}), (12)

where pr(B)\mathcal{B}_{pr}(B) is the ball in pr\mathbb{R}^{pr} with radius BB, and \mathcal{H} is a deep neural network class with prescribed network structure. We summarize the procedure for representation learning from upstream data in Algorithm 1.

Algorithm 1 Upstream training
  Input: Upstream data {𝐗i,Yi,Si}i=1n\{\mathbf{X}_{i},Y_{i},S_{i}\}_{i=1}^{n}
  Initialize 𝐅\mathbf{F} and 𝐡\mathbf{h}
  repeat
     gargmaxg𝒢𝒲1^(𝐡(X),𝒰r)g\leftarrow\mathop{\rm argmax}_{g\in\mathcal{G}}\hat{\mathcal{W}_{1}}(\mathbf{h}(X),\mathcal{U}_{r})
     (𝐅,𝐡)argmin(𝐅,𝐡)q(B)R^(𝐅,𝐡)(\mathbf{F},\mathbf{h})\leftarrow\mathop{\rm argmin}_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{q}(B)\circ\mathcal{H}}\hat{R}(\mathbf{F},\mathbf{h})
  until convergence
  Output: 𝐅^,𝐡^\hat{\mathbf{F}},\hat{\mathbf{h}}

2.1.3 Loss for downstream task

For the downstream learning, we plug in the learned 𝐡^\hat{\mathbf{h}} in Algorithm 1 and fit the regression and classification models (4) (5) with downstream data 𝒯={(𝐗T,i,YT,i)}i=1m\mathcal{T}=\{(\mathbf{X}_{T,i},Y_{T,i})\}_{i=1}^{m}, respectively. Specifically, we estimate the optimal 𝐅T\mathbf{F}_{T}^{*} and Q()=q(𝐀)Q^{*}(\cdot)=q^{*}(\mathbf{A}^{*}\cdot) via

(𝐅^T,Q^)argmin(𝐅T,𝐀,q)r(B)dd(B)𝒬^(𝐡^,Qq,𝐀,𝐅T)\displaystyle(\hat{\mathbf{F}}_{T},\hat{Q})\in\operatorname*{arg\,min}_{(\mathbf{F}_{T},\mathbf{A},q)\in\mathcal{B}_{r}(B)\circ\mathcal{B}_{dd^{*}}(B)\circ\mathcal{Q}}\hat{\mathcal{L}}(\hat{\mathbf{h}},Q_{q,\mathbf{A}},\mathbf{F}_{T}) (13)
=\displaystyle= i=1m(𝐅T𝐡^(𝐗T,i)+q(𝐀𝐗T,i),YT,i)/m+κ𝒱^m[𝐡^,Qq,𝐀]+χ𝐅T1+ζ𝐀F2,\displaystyle\sum_{i=1}^{m}\ell(\mathbf{F}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i})+q(\mathbf{A}\mathbf{X}_{T,i}),Y_{T,i})/m+\kappa\hat{\mathcal{V}}_{m}[\hat{\mathbf{h}},Q_{q,\mathbf{A}}]+\chi\|\mathbf{F}_{T}\|_{1}+{\zeta\|\mathbf{A}\|_{F}^{2}},

where Qq,𝐀()=q(𝐀)Q_{q,\mathbf{A}}(\cdot)=q(\mathbf{A}\cdot) and (,)\ell(\cdot,\cdot) is the loss for regression or classification, r(B)\mathcal{B}_{r}(B) is the ball in r\mathbb{R}^{r} with radius BB, and 𝒬\mathcal{Q} is a prescribed neural network class, and the empirical distance covariance term 𝒱^m[𝐡^,Qq,𝐀]\hat{\mathcal{V}}_{m}[\hat{\mathbf{h}},Q_{q,\mathbf{A}}] enforces the independent constraint in (6), and κ,χ,ζ\kappa,\chi,\zeta are the regularization parameters. Also, define the population risk in downstream as (𝐡,Qq,𝐀,𝐅T)=𝔼XT,YT[^(𝐡(XT),q(𝐀XT),𝐅T)]\mathcal{L}(\mathbf{h},Q_{q,\mathbf{A}},\mathbf{F}_{T})=\mathbb{E}_{X_{T},Y_{T}}[\hat{\mathcal{L}}(\mathbf{h}(X_{T}),q(\mathbf{A}X_{T}),\mathbf{F}_{T})].

2.2 Function class and measure

Distance covariance. We recall the concept of distance covariance (Székely et al., 2007), which characterizes the dependence of two random variables. Let 𝔦\mathfrak{i} be the imaginary unit (1)1/2(-1)^{1/2}. For any 𝐭d\mathbf{t}\in\mathbb{R}^{d} and 𝐬m\mathbf{s}\in\mathbb{R}^{m}, let ψZ(𝐭)=𝔼[exp𝔦𝐭TZ],ψY(𝐬)=𝔼[exp𝔦𝐬TY],\psi_{Z}(\mathbf{t})=\mathbb{E}[\exp^{\mathfrak{i}\mathbf{t}^{T}Z}],\psi_{Y}(\mathbf{s})=\mathbb{E}[\exp^{\mathfrak{i}\mathbf{s}^{T}Y}], and ψZ,Y(𝐭,𝐬)=𝔼[exp𝔦(𝐭TZ+𝐬TY)]\psi_{Z,Y}(\mathbf{t},\mathbf{s})=\mathbb{E}[\exp^{\mathfrak{i}(\mathbf{t}^{T}Z+\mathbf{s}^{T}Y)}] be the characteristic functions of random vectors Zd,Ym,Z\in\mathbb{R}^{d},Y\in\mathbb{R}^{m}, and the pair (Z,Y)(Z,Y), respectively. The squared distance covariance 𝒱[Z,Y]\mathcal{V}[Z,Y] is defined as

𝒱[Z,Y]=d+m|ψZ,Y(𝐭,𝐬)ψZ(𝐭)ψY(𝐬)|2cdcm𝐭d+1𝐬m+1d𝐭d𝐬,\mathcal{V}[Z,Y]=\int_{\mathbb{R}^{d+m}}\frac{\left|\psi_{Z,Y}(\mathbf{t},\mathbf{s})-\psi_{Z}(\mathbf{t})\psi_{Y}(\mathbf{s})\right|^{2}}{c_{d}c_{m}\|\mathbf{t}\|^{d+1}\|\mathbf{s}\|^{m+1}}\mathrm{d}\mathbf{t}\mathrm{d}\mathbf{s},

where cd=π(d+1)/2Γ((d+1)/2).c_{d}=\frac{\pi^{(d+1)/2}}{\Gamma((d+1)/2)}. Given nn i.i.d copies {Zi,Yi}i=1n\{Z_{i},Y_{i}\}_{i=1}^{n} of (Z,Y)(Z,Y), an unbiased estimator of 𝒱\mathcal{V} is the empirical distance covariance 𝒱^n\widehat{\mathcal{V}}_{n}, which can be elegantly expressed as a UU-statistic (Huo and Székely, 2016)

𝒱^n[Z,Y]=1Cn41i1<i2<i3<i4nk((Zi1,Yi1),,(Zi4,Yi4)),\widehat{\mathcal{V}}_{n}[Z,Y]=\frac{1}{C_{n}^{4}}\sum_{1\leq i_{1}<i_{2}<i_{3}<i_{4}\leq n}k\left(\left(Z_{i_{1}},Y_{i_{1}}\right),\cdots,\left(Z_{i_{4}},Y_{i_{4}}\right)\right),

where kk is the kernel defined by

k((𝐳1,𝐲1),,(𝐳4,𝐲4))\displaystyle k\left(\left(\mathbf{z}_{1},\mathbf{y}_{1}\right),\ldots,\left(\mathbf{z}_{4},\mathbf{y}_{4}\right)\right) =\displaystyle= 141i,j4ij𝐳i𝐳j𝐲i𝐲j\displaystyle\frac{1}{4}\sum_{\begin{subarray}{c}1\leq i,j\leq 4\\ i\neq j\end{subarray}}\|\mathbf{z}_{i}-\mathbf{z}_{j}\|\|\mathbf{y}_{i}-\mathbf{y}_{j}\|
+\displaystyle+ 1241i,j4ij𝐳i𝐳j1i,j4ij𝐲i𝐲j\displaystyle\frac{1}{24}\sum_{\begin{subarray}{c}1\leq i,j\leq 4\\ i\neq j\end{subarray}}\left\|\mathbf{z}_{i}-\mathbf{z}_{j}\right\|\sum_{\begin{subarray}{c}1\leq i,j\leq 4\\ i\neq j\end{subarray}}\|\mathbf{y}_{i}-\mathbf{y}_{j}\|
\displaystyle- 14i=14(1j4ji𝐳i𝐳j1j4ij𝐲i𝐲j).\displaystyle\frac{1}{4}\sum_{i=1}^{4}(\sum_{\begin{subarray}{c}1\leq j\leq 4\\ j\neq i\end{subarray}}\left\|\mathbf{z}_{i}-\mathbf{z}_{j}\right\|\sum_{\begin{subarray}{c}1\leq j\leq 4\\ i\neq j\end{subarray}}\|\mathbf{y}_{i}-\mathbf{y}_{j}\|).

Hölder class. Let β=s+r>0,r(0,1]\beta=s+r>0,r\in(0,1] and s=β0s=\lfloor\beta\rfloor\in\mathbb{N}_{0}, where β\lfloor\beta\rfloor denotes the largest integer strictly smaller than β\beta and 0\mathbb{N}_{0} denotes the set of non-negative integers. For a finite constant B>0B>0, the Hölder class of functions β([0,1]d,B)\mathcal{H}^{\beta}\left([0,1]^{d},B\right) is defined as

β([0,1]d,B)={f:[0,1]d,maxα1sαfB,maxα1=ssupxy|αf(𝐗)αf(y)|xyrB}\mathcal{H}^{\beta}([0,1]^{d},B)=\{f:[0,1]^{d}\rightarrow\mathbb{R},\max_{\|\alpha\|_{1}\leq s}\left\|\partial^{\alpha}f\right\|_{\infty}\leq B,\max_{\|\alpha\|_{1}=s}\sup_{x\neq y}\frac{\left|\partial^{\alpha}f(\mathbf{X})-\partial^{\alpha}f(y)\right|}{\|x-y\|^{r}}\leq B\}

where α=α1αd\partial^{\alpha}=\partial^{\alpha_{1}}\cdots\partial^{\alpha_{d}} with α=(α1,,αd)0d\alpha=\left(\alpha_{1},\ldots,\alpha_{d}\right)^{\top}\in\mathbb{N}_{0}^{d} and α1=i=1dαi\|\alpha\|_{1}=\sum_{i=1}^{d}\alpha_{i}. The Lipschitz class 1:=([0,1]d,K2)\mathcal{H}^{1}:=\mathcal{H}([0,1]^{d},K_{2}) is a special case of Hölder class with β=1\beta=1. We adopt the Hölder class as the evaluation class, which encompasses the true functions. This is a common setup in nonparametric estimation (Györfi et al., 2002).

Norm-constrained neural network. A neural network function ϕ:N0NL+1\phi:\mathbb{R}^{N_{0}}\to\mathbb{R}^{N_{L+1}} is a function that can be parameterized as,

ϕ(𝐗)=TL(σ(TL1(σ(T0(𝐗))))),\phi(\mathbf{X})=T_{L}(\sigma(T_{L-1}(\cdots\sigma(T_{0}(\mathbf{X}))\cdots))), (14)

where the activation function σ(x):=x0\sigma(x):=x\lor 0 is applied component-wisely and Tl(𝐗):=Alx+blT_{l}(\mathbf{X}):=A_{l}x+b_{l} is an affine transformation with AlNl+1×NlA_{l}\in\mathbb{R}^{N_{l+1}\times N_{l}} and blNl+1b_{l}\in\mathbb{R}^{N_{l+1}} for l=0,,Ll=0,\dots,L. The numbers W=max{N1,,NL}W=\max\{N_{1},\dots,N_{L}\} and LL are called the width and the depth of neural network, respectively. We denote by 𝒩𝒩d1,d2(W,L)\mathcal{N}\mathcal{N}_{d_{1},d_{2}}(W,L) the set of ReLU FNN functions with width at most WW, depth at most LL and its input (output) dimensions of d1d_{1}(d2d_{2}). When the input dimension and output dimension are clear from contexts, we simply denote it by 𝒩𝒩(W,L)\mathcal{N}\mathcal{N}(W,L).

We define the norm constrained neural network 𝒩𝒩(W,L,K)\mathcal{NN}(W,L,K) as the set of functions ϕθ𝒩𝒩(W,L)\phi_{\theta}\in\mathcal{NN}(W,L) of the form (14) that satisfies the following norm constraint on the weights

κ(θ):=AL=0L1max{(A,𝒃),1}K.\kappa(\theta):=\left\|A_{L}\right\|\prod_{\ell=0}^{L-1}\max\left\{\left\|\left(A_{\ell},\bm{b}_{\ell}\right)\right\|,1\right\}\leq K.

It’s easy to see that any f𝒩𝒩(W,L,K)f\in\mathcal{N}\mathcal{N}(W,L,K), fL<K\|f\|_{L}<K. We set the network class used in Section  2.1.22.1.3 as follows: =𝒩𝒩d,r(W1,L1,K1)\mathcal{H}=\mathcal{N}\mathcal{N}_{d,r}(W_{1},L_{1},K_{1}), 𝒢=𝒩𝒩r,1(W2,L2,K2)\mathcal{G}=\mathcal{N}\mathcal{N}_{r,1}(W_{2},L_{2},K_{2}) and 𝒬=𝒩𝒩d,1(W3,L3,K)\mathcal{Q}=\mathcal{N}\mathcal{N}_{d^{*},1}(W_{3},L_{3},K).

3 Main results

3.1 Result for upstream training

We first make the following regularity assumptions on the target function and on the neural network function classes.

Assumption 1.

(Regularity conditions on excess risk.)

  • (A1)

    𝒳=[0,1]d\mathcal{X}=[0,1]^{d}, and |Y|B|Y|\leq B for all Y𝒴Y\in\mathcal{Y};

  • (A2)

    The sparse vector 𝐅sB,s𝒮\|\mathbf{F}^{*}_{s}\|\leq B,\forall s\in\mathcal{S}, and each component of 𝐡\mathbf{h}^{*} is contained in β([0,1]d,B)\mathcal{H}^{\beta}([0,1]^{d},B);

  • (A3)

    sup𝐱𝒳{g(𝐡(𝐱))𝐡(𝐱)}B\sup_{\mathbf{x}\in\mathcal{X}}\{\|g(\mathbf{h}(\mathbf{x}))\|\lor\|\mathbf{h}(\mathbf{x})\|\}\leq B for all g𝒢,𝐡g\in\mathcal{G},\mathbf{h}\in\mathcal{H}.

Remark 2.

The regularity conditions in Assumption 1 are mild. Firstly, we impose the boundedness assumptions on 𝒳\mathcal{X} and 𝒴\mathcal{Y} in (A1) for the sake of simplicity in presentation. This choice is facilitated by the ability to employ truncation techniques, given additional exponential decay tail assumptions on the distribution of the covariate and the noise, as elaborated in detail in Györfi et al. (2002). Secondly, the assumption of a bounded target (A2) is standard in both parametric and nonparametric estimation. We required the network’s output to be bounded in (A3), which is achievable through a truncation implemented by a ReLU layer.

Theorem 1 (Upstream excess risk bound).

Suppose the Assumption 1 holds. Set =𝒩𝒩d,r(W1,L1,K1)\mathcal{H}=\mathcal{N}\mathcal{N}_{d,r}(W_{1},L_{1},K_{1}) and network 𝒢=𝒩𝒩r,1(W2,L2,K2)\mathcal{G}=\mathcal{N}\mathcal{N}_{r,1}(W_{2},L_{2},K_{2}) according the following Table:

NN class \mathcal{H} 𝒢\mathcal{G}
width(W) W1n2d+β4(d+1+β)𝕀β2+n2d+β2(2d+3β)𝕀β>2W_{1}\asymp n^{\frac{2d+\beta}{4(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{\frac{2d+\beta}{2(2d+3\beta)}}\mathbb{I}_{\beta>2} W2n2r+β4(d+1+β)𝕀β2+n2r+β2(2d+3β)𝕀β>2W_{2}\gtrsim n^{\frac{2r+\beta}{4(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{\frac{2r+\beta}{2(2d+3\beta)}}\mathbb{I}_{\beta>2}
depth(L) L12log2(r+β)+2L_{1}\geq 2\left\lceil\log_{2}(r+\lfloor\beta\rfloor)\right\rceil+2 L22log2(r+β)+2L_{2}\geq 2\left\lceil\log_{2}(r+\lfloor\beta\rfloor)\right\rceil+2
norm constraint(K) K1nd+12(d+β+1)𝕀β2+nd+12d+3β𝕀β>2K_{1}\asymp n^{\frac{d+1}{2(d+\beta+1)}}\mathbb{I}_{\beta\leq 2}+n^{\frac{d+1}{2d+3\beta}}\mathbb{I}_{\beta>2} K2nr+12(d+1+β)𝕀β2+nr+12d+3β𝕀β>2K_{2}\asymp n^{\frac{r+1}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{\frac{r+1}{2d+3\beta}}\mathbb{I}_{\beta>2}

Then the ERM solution (𝐅^,𝐡^)(\hat{\mathbf{F}},\hat{\mathbf{h}}) from (12) satisfy:

𝔼[R(𝐅^,𝐡^)R(𝐅,𝐡)]O~(nβ2(d+1+β)𝕀β2+nβ2d+3β𝕀β>2).\displaystyle\mathbb{E}[R(\hat{\mathbf{F}},\hat{\mathbf{h}})-R(\mathbf{F}^{*},\mathbf{h}^{*})]\precsim\tilde{O}\left(n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\mathbb{I}_{\beta>2}\right). (15)
Remark 3.

Theorem 1 illustrates that, with an appropriate configuration of neural networks in terms of width, depth, and norm constraints, it is possible to design ERM solutions for upstream representation learning that exhibit a rate for nonparametric estimation.

A key feature of Theorem 1 is that, by appropriately controlling the norm constraint on the network, the rate can be achieved with only the restriction on the lower bound of the depth and width for 𝒢\mathcal{G}. Hence, we can attain the consistency results within an over-parameterized network class, representing a novel finding in the realm of deep transfer learning, to the best of our current understanding.

Tripuraneni et al. (2020, Theorem 4) presented a generalization bound for multitask transfer learning. However, their analysis is based on a well-specified neural network model, where the target function is also a network, without accounting for the approximation error as we have done.

Since the true vector 𝐅s,s𝒮\mathbf{F}^{*}_{s},s\in\mathcal{S} is sparse, we also want to derive the consistency for the estimation error. Hence, we introduce the following assumptions.

Assumption 2.

(Regularity conditions on estimation).

  • (A4)

    Denote 𝐇^i=1n(𝐡^(Xi,1),,𝐡^(Xi,ni))\hat{\mathbf{H}}_{i}=\frac{1}{\sqrt{n}}(\hat{\mathbf{h}}(X_{i,1}),\ldots,\hat{\mathbf{h}}(X_{i,n_{i}})). We assume that 𝐇^i2>B1\|\hat{\mathbf{H}}_{i}\|_{2}>\sqrt{B_{1}} for all the i[p]i\in[p], where B1>0B_{1}>0 is a constant;

  • (A5)

    The ERM solution 𝐡^\hat{\mathbf{h}} from (12) satisfied 𝔼[𝐡𝐡^L1(𝐗)]𝔼[𝒲1(𝐡^(𝐗),𝐡(𝐗))]\mathbb{E}[\|\mathbf{h}^{*}-\hat{\mathbf{h}}\|_{L_{1}(\mathbb{P}_{\mathbf{X}})}]\asymp\mathbb{E}[\mathcal{W}_{1}(\hat{\mathbf{h}}(\mathbf{X}),\mathbf{h}^{*}(\mathbf{X}))].

Remark 4.

Assumption (A4) is mild due to the condition rnr\ll n, ensuring that each 𝐇^iT\hat{\mathbf{H}}_{i}^{T} easily becomes full column rank. Assumption (A5) is a technical condition which requires that there exist a constant B2B_{2} such that 𝔼[𝐡𝐡^L1(𝐗)]B2𝔼[𝒲1(𝐡^(𝐗),𝐡(𝐗))]\mathbb{E}[\|\mathbf{h}^{*}-\hat{\mathbf{h}}\|_{L_{1}(\mathbb{P}_{\mathbf{X}})}]\leq B_{2}\mathbb{E}[\mathcal{W}_{1}(\hat{\mathbf{h}}(\mathbf{X}),\mathbf{h}^{*}(\mathbf{X}))], given another direction of the inequality holds trivially.

Theorem 2 (Estimator error of 𝐅^\hat{\mathbf{F}}).

Under Theorem 1 and Assumption 2, for any δ>0\delta>0, there exits constant γ>0\gamma>0 and a constant C(γ,B,B1,B2,r,p,δ)C(\gamma,B,B_{1},B_{2},r,p,\delta) such that if we set μ=γ2Brp(nβ2(d+1+β)𝕀β2+nβ2d+3βlogn𝕀β>2)\mu=\frac{\gamma}{2Brp}(n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\log n\mathbb{I}_{\beta>2}), then with probability at least 12δ1-2\delta over the draw of dataset 𝒟\mathcal{D},

𝐅^𝐅2C(γ,B,B1,B2,r,p,δ)(nβ2(d+1+β)𝕀β2+nβ2d+3βlogn𝕀β>2).\|\hat{\mathbf{F}}-\mathbf{F}^{*}\|_{2}\leq C(\gamma,B,B_{1},B_{2},r,p,\delta)(n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\log n\mathbb{I}_{\beta>2}).

3.2 Results for downstream prediction

First, we give the assumptions on the smoothness of target function, and transferable condition.

Assumption 3.

(Regularity conditions for down-stream task). (A6) Each entries of qq^{*} is contained in β([1,1]d,B)\mathcal{H}^{\beta}([-1,1]^{d},B). 𝐅TB\|\mathbf{F}_{T}^{*}\|_{\infty}\leq B and 𝐀FB\|\mathbf{A}^{*}\|_{F}\leq B; (A7) (,y)\ell(\cdot,y) is Lipschitz with respect to y𝒴y\in\mathcal{Y} and the Lipschitz constant is bounded by L0L_{0}, and |(,)|B3|\ell(\cdot,\cdot)|\leq B_{3}; (A8) 𝒲1(𝐗,𝐗T)ω\mathcal{W}_{1}(\mathbf{X},\mathbf{X}_{T})\leq\omega with ω(n12𝕀β2+nβ+d+12d+3β𝕀β>2)\omega\asymp\left(n^{-\frac{1}{2}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta+d+1}{2d+3\beta}}\mathbb{I}_{\beta>2}\right); (A9) Consider η={h|𝔼[𝒲1(𝐡(XT),𝐡(XT))]η,ηO~(nβ2(d+1+β)𝕀β2+nβ2d+3β𝕀β>2)}.\mathcal{H}_{\eta}=\{h\in\mathcal{H}|\mathbb{E}[\mathcal{W}_{1}(\mathbf{h}(X_{T}),\mathbf{h}^{*}(X_{T}))]\leq\eta,\eta\precsim\tilde{O}(n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\mathbb{I}_{\beta>2})\}.
Suppose sup𝐡δ𝔼[𝔼YT[𝒲1(𝐅T𝐡(XT)|YT,𝐅T𝐡(XT)|YT)]]ν𝔼[𝒲1(𝐡(XT),𝐡(XT))],\sup_{\mathbf{h}\in\mathcal{H}_{\delta}}\mathbb{E}[\mathbb{E}_{Y_{T}}[\mathcal{W}_{1}(\mathbf{F}_{T}^{*}\mathbf{h}(X_{T})|Y_{T},\mathbf{F}_{T}^{*}\mathbf{h}^{*}(X_{T})|Y_{T})]]\leq\nu\mathbb{E}[\mathcal{W}_{1}(\mathbf{h}(X_{T}),\mathbf{h}^{*}(X_{T}))], where ν>0\nu>0 is a constant.

Remark 5.

Conditions (A6)-(A7) are standard requirements in the context of nonparametric supervised learning. Condition (A8) characterizes the similarity between the upstream and downstream tasks, a prerequisite for transfer feasibility. Let 𝐗\mathbb{P}_{\mathbf{X}} and T\mathbb{P}_{T} denote the marginal distributions of 𝐗\mathbf{X} and XTX_{T} in the upstream and downstream tasks, respectively. In previous work, specifically in Tripuraneni et al. (2020), the condition 𝐗=T\mathbb{P}_{\mathbf{X}}=\mathbb{P}_{T} or 𝒲1(𝐗,XT)=0\mathcal{W}_{1}(\mathbf{X},X_{T})=0 was deemed necessary. In contrast, our assumption relaxes this requirement, allowing 𝒲1(𝐗,XT)\mathcal{W}_{1}(\mathbf{X},X_{T}) to be small, indicating that a minor shift in input data is acceptable. Condition (A9) constitutes a technical requirement on local conditional distribution matching. Specifically, it implies that when the distribution of 𝐡(XT)\mathbf{h}(X_{T}) closely approximates that of 𝐡(XT)\mathbf{h}^{*}(X_{T}), the average conditional Wasserstein distance between the representations 𝐡(XT)\mathbf{h}(X_{T}) and 𝐡(XT)\mathbf{h}^{*}(X_{T}) is also small. In a recent study on domain adaptation, a similar, albeit more stringent, assumption known as conditional invariant component is employed, as outlined in Wu et al. (2023).

Theorem 3 (Excess risk bound for downstream task).

Under the Assumptions 1, 3, we plug the learned representation 𝐡^\hat{\mathbf{h}} described in Theorem 1 into (13), and set the network class 𝒬=𝒩𝒩d,1(W3,L3,K)\mathcal{Q}=\mathcal{NN}_{d^{*},1}(W_{3},L_{3},K) satisfying W3m2d+β4(d+1+β),L32log2(d+s)+2,Kmd+12(d+1+β).W_{3}\gtrsim m^{\frac{2d^{*}+\beta}{4(d^{*}+1+\beta)}},L_{3}\geq 2\left\lceil\log_{2}(d^{*}+s)\right\rceil+2,K\asymp m^{\frac{d^{*}+1}{2(d^{*}+1+\beta)}}. Then ERM solution 𝐅^T\hat{\mathbf{F}}_{T} and Q^\hat{Q} enjoy the following

𝔼[(h^,Q^,𝐅^T)(𝐡,Q,𝐅T)]O~(nβ2(d+1+β)𝕀β2+nβ2d+3β𝕀β>2)+𝒪(mβ2(d+1+2β)).\displaystyle\mathbb{E}[\mathcal{L}(\hat{h},\hat{Q},\hat{\mathbf{F}}_{T})-\mathcal{L}(\mathbf{h}^{*},Q^{*},\mathbf{F}^{*}_{T})]\leq\tilde{O}\left(n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\mathbb{I}_{\beta>2}\right)+\mathcal{O}\left(m^{-\frac{\beta}{2(d^{*}+1+2\beta)}}\right).
Remark 6.

The above Theorem 3 demonstrates that end-to-end prediction accuracy for the downstream task is 𝒪(mβ2(d+1+2β))\mathcal{O}(m^{-\frac{\beta}{2(d^{*}+1+2\beta)}}) if the error in upstream training is negligible due to the huge sample size nn.

Remark 7.

In the challenging scenario of “no transfer” where 𝐅T=0\mathbf{F}_{T}^{*}=0 and d=dd^{*}=d, the convergence rate goes back to 𝒪(mβ2(d+1+2β))\mathcal{O}(m^{-\frac{\beta}{2(d+1+2\beta)}}). This rate aligns with the convergence rate achieved using ERM for learning the dd-dimensional function QQ^{*} in classification (Shen et al., 2022) and is slower than the minimax rate observed in regression (Schmidt-Hieber, 2019; Nakada and Imaizumi, 2020; Farrell et al., 2021; Chen et al., 2022; Kohler et al., 2022; Fan and Gu, 2023; Bhattacharya et al., 2023; Jiao, Shen, Lin and Huang, 2023). Again, same with Remark 3 on Theorem 1, the convergence rate established in Theorem 3 is applicable to deep learning models characterized by over-parametrization. In the case of “partial transfer”, where 𝐅T0\mathbf{F}_{T}^{*}\neq 0 and Q()=q(𝐀)Q^{*}(\cdot)=q^{*}(\mathbf{A}^{*}\cdot) with d<dd^{*}<d, the convergence rate demonstrates improvement from 𝒪(mβ2(d+1+2β))\mathcal{O}(m^{-\frac{\beta}{2(d+1+2\beta)}}) to 𝒪(mβ2(d+1+2β))\mathcal{O}(m^{-\frac{\beta}{2(d^{*}+1+2\beta)}}), benefiting from the knowledge embedded in the upstream representation. In the case of “complete transfer” where Q=0Q^{*}=0, the convergence rate experiences further enhancement from 𝒪(mβ2(d+1+2β))\mathcal{O}(m^{-\frac{\beta}{2(d+1+2\beta)}}) to 𝒪(1m)\mathcal{O}(\frac{1}{\sqrt{m}}). This enhancement is elaborated in detail in the following Theorem 4. Before this, we first give some assumptions.

Assumption 4.

(Regularity conditions on estimation error for downstream task).

  • (A10)

    Denote 𝐇^T=1m(𝐡^(𝐗T,1),,𝐡^(𝐗T,m))\hat{\mathbf{H}}_{T}=\frac{1}{\sqrt{m}}(\hat{\mathbf{h}}(\mathbf{X}_{T,1}),\ldots,\hat{\mathbf{h}}(\mathbf{X}_{T,m})). We assume that 𝐇^T2>B1\|\hat{\mathbf{H}}_{T}\|_{2}>\sqrt{B_{1}};

  • (A11)

    𝔼[𝐡𝐡^L1(XT)]𝔼[𝒲1(𝐡^(XT),𝐡(XT))]\mathbb{E}[\|\mathbf{h}^{*}-\hat{\mathbf{h}}\|_{L_{1}(\mathbb{P}_{X_{T}})}]\asymp\mathbb{E}[\mathcal{W}_{1}(\hat{\mathbf{h}}(X_{T}),\mathbf{h}^{*}(X_{T}))].

Theorem 4.

(Estimation error for “complete transfer”) Under the Assumptions 1, 3 and 4, for any δ>0\delta>0, there exits constant γ1>0\gamma_{1}>0, constant CT,1(γ1,B,B2,r,δ)C_{T,1}(\gamma_{1},B,B_{2},r,\delta) and CT,2(γ1,B,B2,r,δ)C_{T,2}(\gamma_{1},B,B_{2},r,\delta) such that if we set χ=1m\chi=\frac{1}{\sqrt{m}} in the least squares regression for complete transfer, then with probability at least 12δ1-2\delta over the draw of dataset 𝒯\mathcal{T}

𝔼𝒟[𝐅T𝐅^T2]\displaystyle\mathbb{E}_{\mathcal{D}}[\|\mathbf{F}^{*}_{T}-\hat{\mathbf{F}}_{T}\|_{2}]
CT,1(B,r,δ)m12+CT,2(γ1,B,B2,r,δ)(nβ2(d+1+β)𝕀β2+nβ2d+3βlogn𝕀β>2).\displaystyle~{}~{}~{}~{}\leq C_{T,1}(B,r,\delta)m^{-\frac{1}{2}}+C_{T,2}(\gamma_{1},B,B_{2},r,\delta)(n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\log n\mathbb{I}_{\beta>2}).

It shows the sparsity of 𝐅^T\hat{\mathbf{F}}_{T} when assumption 3 and 4 holds and m is considerable large.

Remark 8.

Notably, in the binary classification problem, we can change the loss into logistic loss and utilize its local strongly convexity to get the similar result.

4 Experiment on four real classification datasets and one regression dataset

We substantiate the effectiveness of our method through experimental validation using four image classification datasets and one regression dataset. The following tables all omit the percentage sign (%) for the sake of simplicity when presenting the classification accuracy. More details on hyperparameter setting are shown in the supplementary material.

Datasets. Among the multi-domain image datasets, PACS (Li et al., 2017), VLCS (Fang et al., 2013), TerraIncognita (Beery et al., 2018) and OfficeHome (Venkateswara et al., 2017) are widely employed. VLCS represents a challenging dataset that is composed of 10,700 images. It encompasses four domains, namely VOC2007 (V), LabelMe (L), Caltech (C), and Sun (S), along with 5 classes. PACS is a dataset comprising 9,900 images, which has 7 classes and 4 domains. TerraIncognita is a geographic dataset consisting of images that cover a wide variety of environments and objects. It consists of 4 domains and 24,700 images. OfficeHome contains images gathered from four distinct domains: Art (A), Clipart (C), Product (P), and Real-world (R), along with corresponding annotations for classification tasks in office environments. It comprises 15,500 images, which are partitioned into 65 object categories. All of these classification datasets are constituted of image data with four distinct domains.

The regression dataset employed is the tool wear dataset from the 2010 PHM Challenge (Li et al., 2009), which is among the rare common datasets utilized for multi-domain regression. This dataset is composed of labeled data spanning three domains, specifically C1, C4, and C6. The original dataset contains sensing signals sampled at a frequency of 50 kHz. These signals encompass cutting force signals, vibration signals in the x, y, and z directions, as well as acoustic emission signals that are generated during the processing stage. We use cutting force signals in three directions as input and the average wear values of the three blades for each tool as output.

Model selection. In the classification task, we partition data from different domains into training and validation sets in an 8:2 ratio. The transfer process involves the following steps: withholding data from one domain, merging the training sets of the remaining domains to construct the upstream training set and the validation sets to construct the validation set accordingly. For each domain, we vary the hyperparameter λ\lambda, τ\tau and μ\mu, selecting the model with the highest accuracy on the validation set we construct from the upstream training set. During the downstream task, training is conducted utilizing the training set from the reserved domain, with adjusting the hyperparameter κ,χ,ζ\kappa,\chi,\zeta and dd^{*}. The model with the highest training accuracy is chosen, and its accuracy on the withheld domain’s validation set is considered as the final accuracy. In the regression task, the distinction lies in partitioning data from various domains into training and validation sets at a 9:1 ratio. Meanwhile, Mean Absolute Error (MAE) Root and Mean Square Error (RMSE) are used to evaluate performance of methods.

Network structure The network structure of 𝐡\mathbf{h} in upstream classification task is deinfed as 𝐡=LinMLPFeExtResNet50,\mathbf{h}=\mathrm{LinMLP}\circ\mathrm{FeExtResNet50}, where LinMLP\mathrm{LinMLP} is a fully connected layer, and FeExtResNet50\mathrm{FeExtResNet50} is the feature extractor of ResNet50 (He et al., 2016) pretrained on the Imagenet (Russakovsky et al., 2015) dataset. The discriminator network for 𝒢\mathcal{G} in upstream is implemented as a 3-layer fully connect ReLU network. We use EfficientNet’s feature extractor (Tan and Le, 2019), pretrained on the Imagenet dataset as the additional feature extractor Qq,𝐀Q_{q,\mathbf{A}}. For the regression task, we replace the feature extractor ResNet50 and EfficientNet with convolution neural networks of small size.

Comparison with other works. Two classes were selected from the PACS dataset to compare our method with the KNN-based approach proposed by Cai and Wei (2021), as their method was designed specifically for binary classification. The comparison results are presented in Table 1. The reason why the KNN-based method performs poor is that it is inherently unsuitable for handling high-dimensional data.

Table 1: The accuracy of two selected classes from four domains of PACS
Method/Domain Art Cartoon Photo sketch
KNN-based 62.7 67.5 57.7 49.3
Ours 96.1 97.1 98.7 92.7
Table 2: The accuracy of different training methods on typical datasets
Algorithm TerraInc PACS VLCS Office-home Avg
ERM (Vapnik, 1998) 87.1  ±\pm  0.1 94.1  ±\pm  0.3 82.0  ±\pm  0.6 79.9  ±\pm  0.3 85.9  ±\pm  0.3
DRO (Sagawa et al., 2019) 87.5  ±\pm  0.5 94.4  ±\pm  0.0 82.2  ±\pm  0.5 80.9  ±\pm  0.5 85.7  ±\pm  0.2
IRM (Arjovsky et al., 2020) 82.2  ±\pm  0.9 93.6  ±\pm  0.1 82.0  ±\pm  0.4 76.2  ±\pm  0.4 83.5  ±\pm  0.3
DANN (Ganin et al., 2016) 87.1  ±\pm  0.2 93.7  ±\pm  0.4 82.1  ±\pm  0.9 79.4  ±\pm  0.1 85.6  ±\pm  0.3
SagNet (Nam et al., 2021) 87.3  ±\pm  0.3 94.0  ±\pm  0.5 81.7  ±\pm  0.3 79.5  ±\pm  0.4 85.7  ±\pm  0.1
Ours 84.2  ±\pm  0.2 95.7  ±\pm  0.4 85.3  ±\pm  0.1 84.0  ±\pm  0.3 87.3  ±\pm  0.1

While our research setting diverges from that of Domain Adaptation, Domain Generalization, and Multi-task Learning as discussed in Farahani et al. (2021). And it also differs from single-source domain transfer learning, as discussed in Zhuang et al. (2020). To demonstrate the superiority of our learning approach, we conducted comparisons among several renowned Domain Adaptation methods that center around learning effective representations, along with the classical ERM algorithm,, as outlined in Table 2, utilizing the code from DomainBed (Gulrajani and Lopez-Paz, 2020). Despite these methods do not impose any constraints on downstream training, the test results demonstrate the ascendancy of our approach to some extent. Further implementation details regarding the comparison are provided in the supplementary material.

Ablation study and corresponding results. To verify the effectiveness of additional constraints and architecture in our framework. We present the test results in Table LABEL:The_accuracy_of_different_training_methods_on_typical_datasets_and_their_domains(new) for the following training methods. ERM-UD (ERM on both upstream and downstream): Training a network without considering task diversity, transferring the feature extractor to the downstream task, and training a linear classifier while freezing the feature extractor. Transfer Invariant Representation (TIR): Adopting and freezing the feature extractor using our upstream training method, followed by training a linear classifier at the downstream task without the QQ network. ERM-D (ERM only at downstream): Solely utilizing the pretrained EfficientNet to train the downstream task. Without Independence (WI): Adopting the feature extractor through our upstream training method and incorporating the pretrained EfficientNet while training at the downstream task without imposing any independence constraint between the output of EfficientNet and the transfer featurizer.

Table 3: The accuracy of ablation study on typical datasets and their domains
Dataset/Alg Accuracy on domain
officehome A C P R Avg
ERM-UD 71.4  ±\pm  1.0 73.3  ±\pm  0.1 90.3  ±\pm  0.1 83.8  ±\pm  0.6 79.7  ±\pm  0.1
TIR 71.1  ±\pm  0.6 72.9  ±\pm  0.6 89.8  ±\pm  0.7 82.7  ±\pm  0.3 79.1  ±\pm  0.1
ERM-D 51.2  ±\pm  1.9 60.4  ±\pm  1.2 83.1  ±\pm  0.3 78.8  ±\pm  0.5 68.4  ±\pm  0.2
WI 70.6  ±\pm  0.7 73.3  ±\pm  0.1 89.7  ±\pm  0.4 83.2  ±\pm  0.5 79.2  ±\pm  0.3
Ours 77.5  ±\pm  1.1 78.4  ±\pm  0.3 93.3  ±\pm  0.1 86.7  ±\pm  0.5 84.0  ±\pm  0.3
PACS A C P S Avg
ERM-UD 92.3  ±\pm  0.7 93.7  ±\pm  0.6 99.0  ±\pm  0.3 88.8  ±\pm  0.5 93.5  ±\pm  0.2
TIR 94.2  ±\pm  0.2 94.4  ±\pm  0.1 99.2  ±\pm  0.2 89.6  ±\pm  0.4 94.3  ±\pm  0.1
ERM-D 89.5  ±\pm  0.2 88.4  ±\pm  0.3 99.1  ±\pm  0.3 90.8  ±\pm  0.3 92.0  ±\pm  0.2
WI 94.2  ±\pm  0.2 93.6  ±\pm  0.2 98.9  ±\pm  0.5 89.3  ±\pm  0.6 94.0  ±\pm  0.1
Ours 95.0  ±\pm  0.4 96.0  ±\pm  0.4 99.4  ±\pm  0.3 92.6  ±\pm  0.7 95.7  ±\pm  0.4
VLCS C L S V Avg
ERM-UD 99.4  ±\pm  0.2 72.2  ±\pm  2.0 78.2  ±\pm  1.0 84.0  ±\pm  0.4 83.5  ±\pm  0.7
TIR 99.8  ±\pm  0.2 74.8  ±\pm  1.0 80.6  ±\pm  1.0 83.6  ±\pm  0.9 84.7  ±\pm  0.6
ERM-D 100.0  ±\pm  0.0 73.0  ±\pm  1.0 79.8  ±\pm  0.4 86.9  ±\pm  0.2 85.0  ±\pm  0.4
WI 99.5  ±\pm  0.2 74.8  ±\pm  0.4 80.2  ±\pm  0.5 84.6  ±\pm  0.6 84.8  ±\pm  0.1
Ours 100.0  ±\pm  0.0 73.9  ±\pm  0.0 80.0  ±\pm  0.1 87.3  ±\pm  0.3 85.3  ±\pm  0.1
TerraInc 38 43 46 100 Avg
ERM-UD 85.2  ±\pm  0.4 82.1  ±\pm  0.6 74.2  ±\pm  0.5 89.3  ±\pm  0.6 82.7  ±\pm  0.3
TIR 83.1  ±\pm  0.8 76.8  ±\pm  0.9 72.9  ±\pm  0.8 86.2  ±\pm  0.7 79.7  ±\pm  0.6
ERM-D 85.1  ±\pm  0.4 69.9  ±\pm  0.3 70.6  ±\pm  0.5 83.9  ±\pm  0.3 77.5  ±\pm  0.1
WI 82.2  ±\pm  0.6 76.9  ±\pm  0.6 71.7  ±\pm  0.8 85.8  ±\pm  0.7 79.2  ±\pm  0.3
Ours 87.4  ±\pm  0.4 83.5  ±\pm  0.6 76.3  ±\pm  0.3 89.6  ±\pm  0.4 84.2  ±\pm  0.2

Results interpretation. Contrasting the results between our method and WI demonstrates the efficacy of the independence penalty term in improving accuracy for downstream tasks, which can be attributed to that the independence constraints (6) and (7) reduce the learning difficulty of the downstream task. Moreover, when comparing the outcomes of our approach with those of ERM-D, it becomes evident that the enhanced accuracy cannot be solely attributed to the information learned in QQ. Furthermore, the comparison between TIR and WI reveals that the auxiliary network for extracting different information can be detrimental to accuracy when independence requirements are not imposed. These numerical results align consistently with our observations in Theorem 3 and Remark 6. Specifically, the model under “partial transfer” conditions is demonstrated to be more easily learnable, primarily attributed to not only the advantageous representation obtained from the upstream domains but also the imposition of the independence constraint, which significantly reduces the estimation complexity in the downstream task.

Results of regression dataset. The results were shown in Table 4. The phrase “C4, C6 -C1” in that represents using the data from domains C4 and C6 as upstream data, and the data from domain C1 as downstream data. The subsequent “C1, C6 -C4” and “C1, C4 -C6” have similar meanings.

Table 4: Fine-tuning accuracy on the Wear dataset
Transfer task/Algorithm C4,C6 -C1 C1,C6 -C4 C1,C4 -C6
Metrics MAE MAE MAE
ERM-UD 8.68 ±\pm 1.26 5.54 ±\pm 0.29 2.14 ±\pm 0.01
Ours 1.22 ±\pm 0.06 1.47 ±\pm 0.19 1.23 ±\pm 0.11
Metrics RMSE RMSE RMSE
ERM-UD 15.10 ±\pm 0.48 7.36 ±\pm 0.43 2.77 ±\pm 0.04
Ours 2.31 ±\pm 0.15 1.95 ±\pm 0.21 1.73 ±\pm 0.12
Table 5: Impact of different representation dimensions on the OfficeHome
Domain \\backslash Representation dimension 256 512 1024 2048 4096 8192
A 54.8 ±\pm 1.7 57.3±2.857.3\pm 2.8 59.4 ±\pm 1.1 56.9 ±\pm 0.5 54.1±1.554.1\pm 1.5 50.1 ±\pm 1.6
C 51.4 ±\pm 1.4 58.6 ±\pm 0.9 56.8 ±\pm 1.4 53.5 ±\pm 0.5 46.6 ±\pm 2.3 38.8 ±\pm 2.3
P 76.1 ±\pm 1.5 77.9 ±\pm 1.0 79.4 ±\pm 2.1 72.5 ±\pm 1.5 59.7 ±\pm 1.0 50.3 ±\pm 0.7
R 71.2 ±\pm 2.3 77.5 ±\pm 0.9 76.8 ±\pm 0.2 74.4 ±\pm 1.1 65.0 ±\pm 1.9 59.7 ±\pm 3.5
Table 6: The Impact of different representation dimensions on the Regression task
Task \\backslash dimension 265 512 1024 2048
C4,C6 -C1 MAE 6.59 ±\pm 0.13 4.87 ±\pm 0.11 6.20±0.716.20\pm 0.71 10.49±0.210.49\pm 0.2
RMSE 9.17 ±\pm 0.12 5.89 ±\pm 0.11 6.45±0.676.45\pm 0.67 15.45±0.2015.45\pm 0.20
C1,C6 -C4 MAE 10.38 ±\pm 0.04 8.57 ±\pm 0.17 10.07±0.3810.07\pm 0.38 10.33±0.210.33\pm 0.2
RMSE 15.09 ±\pm 0.08 13.82 ±\pm 0.14 14.98±0.4214.98\pm 0.42 15.47±0.0415.47\pm 0.04
C1,C4 -C6 MAE 1.64 ±\pm 0.08 1.41 ±\pm 0.04 1.92±0.101.92\pm 0.10 2.77±0.232.77\pm 0.23
RMSE 2.22 ±\pm 0.04 1.84 ±\pm 0.01 2.72±0.132.72\pm 0.13 3.45±0.233.45\pm 0.23

The effect of representation dimension. To determine the optimal dimension of the representation, we conducted experiments on the OfficeHome and the Wear dataset, modifying the dimensionality of the transferred representations based on the TIR training method, which consider the downstream model as (8), with the results shown in Table 5, 6. By comparing the impact of different dimensional representations on downstream tasks, we observed that the accuracy of downstream tasks first increases and then decreases as the dimensionality of the transferred representation increases. This suggests that when the representation dimension is small, it fails to capture enough predictive information, while when the dimension is too large, it introduces unnecessary high dimension information, increasing the cost of estimating a high dimension function during downstream learning. These results align consistently with the result using same technique utilized in our Theorem 3 applied to model (8). Therefore, we selected 1024 and 512 as the dimensionality of the transferred representations for classification tasks and regression tasks respectively .

5 Conclusion

In this paper, we propose a statistical framework for transfer learning from a multi-task upstream scenario to a single downstream task. Our framework accommodates both shared and specific features, explicitly identifying the upstream features contributing to downstream tasks. We introduce a novel training algorithm tailored for transferring a suitably invariant shared representation. Convergence analysis reveals that the derived rate is adaptive to the difficulty of transferability. Empirical experiments on real classification and regression data validate the efficacy of our model and theoretical findings.

Learning a pre-trained model with huge data set and then fine-tuning on specific task (Lee et al., 2019; Brown et al., 2020; Hu et al., 2021; Sung et al., 2021; Rücklé et al., 2021; Xu et al., 2021; Liu et al., 2022; Jiang et al., 2022; Zhang et al., 2023; Gao et al., 2023) has empirically proven to be a highly effective strategy in recent years. Our framework for multi-domain transfer learning can also be interpreted along these lines, wherein learning representations from multiple domains can be viewed as pre-training, and refining downstream predictions can be seen as fine-tuning. Consequently, the theoretical results established in this paper also contribute to enhancing our understanding of pre-training and fine-tuning processes.

References

  • (1)
  • Anthony and Bartlett (2009) Anthony, M. and Bartlett, P. L. (2009), Neural network learning: Theoretical foundations, Cambridge University Press.
  • Anthony et al. (1999) Anthony, M., Bartlett, P. L., Bartlett, P. L. et al. (1999), Neural network learning: Theoretical foundations, Vol. 9, cambridge university press Cambridge.
  • Arjovsky et al. (2020) Arjovsky, M., Bottou, L., Gulrajani, I. and Lopez-Paz, D. (2020), ‘Invariant risk minimization’, International Conference on Learning Representations 1050, 27.
  • Bartlett et al. (2019) Bartlett, P. L., Harvey, N., Liaw, C. and Mehrabian, A. (2019), ‘Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks’, Journal of Machine Learning Research 20(63), 1–17.
  • Bastani (2021) Bastani, H. (2021), ‘Predicting with proxies: Transfer learning in high dimension’, Management Science 67(5), 2964–2984.
  • Beery et al. (2018) Beery, S., Van Horn, G. and Perona, P. (2018), Recognition in terra incognita, in ‘Proceedings of the European Conference on Computer Vision (ECCV)’, pp. 456–473.
  • Ben-David et al. (2006) Ben-David, S., Blitzer, J., Crammer, K. and Pereira, F. (2006), ‘Analysis of representations for domain adaptation’, Advances in Neural Information Processing Systems 19.
  • Berlinet and Thomas-Agnan (2011) Berlinet, A. and Thomas-Agnan, C. (2011), Reproducing kernel Hilbert spaces in probability and statistics, Springer Science & Business Media.
  • Bhattacharya et al. (2023) Bhattacharya, S., Fan, J. and Mukherjee, D. (2023), ‘Deep neural networks for nonparametric interaction models with diverging dimension’, arXiv preprint arXiv:2302.05851 .
  • Blitzer et al. (2007) Blitzer, J., Crammer, K., Kulesza, A., Pereira, F. and Wortman, J. (2007), ‘Learning bounds for domain adaptation’, Advances in Neural Information Processing Systems 20.
  • Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. et al. (2020), ‘Language models are few-shot learners’, Advances in Neural Information Processing Systems 33, 1877–1901.
  • Cai et al. (2020) Cai, C., Wang, S., Xu, Y., Zhang, W., Tang, K., Ouyang, Q., Lai, L. and Pei, J. (2020), ‘Transfer learning for drug discovery’, Journal of Medicinal Chemistry 63(16), 8683–8694.
  • Cai and Pu (2024) Cai, T. T. and Pu, H. (2024), ‘Transfer learning for nonparametric regression: Non-asymptotic minimax analysis and adaptive procedure’, arXiv preprint arXiv:2401.12272 .
  • Cai and Wei (2021) Cai, T. T. and Wei, H. (2021), ‘Transfer learning for nonparametric classification: Minimax rate and adaptive classifier’, The Annals of Statistics 49(1), 100–128.
  • Chen et al. (2022) Chen, M., Jiang, H., Liao, W. and Zhao, T. (2022), ‘Nonparametric regression on low-dimensional manifolds using deep relu networks: Function approximation and statistical recovery’, Information and Inference: A Journal of the IMA 11(4), 1203–1253.
  • Chen et al. (2021) Chen, S., Crammer, K., He, H., Roth, D. and Su, W. J. (2021), Weighted training for cross-task learning, in ‘International Conference on Learning Representations’.
  • Chen et al. (2024) Chen, Y., Jiao, Y., Qiu, R. and Yu, Z. (2024), ‘Deep nonlinear sufficient dimension reduction’, The Annals of Statistics 52(3), 1201–1226.
  • De la Pena and Giné (2012) De la Pena, V. and Giné, E. (2012), Decoupling: from Dependence to Independence, Springer Science & Business Media.
  • Du et al. (2020) Du, S. S., Hu, W., Kakade, S. M., Lee, J. D. and Lei, Q. (2020), Few-shot learning via learning the representation, provably, in ‘International Conference on Learning Representations’.
  • Fan and Gijbels (2018) Fan, J. and Gijbels, I. (2018), Local polynomial modelling and its applications, Routledge.
  • Fan and Gu (2023) Fan, J. and Gu, Y. (2023), ‘Factor augmented sparse throughput deep relu neural networks for high dimensional regression’, Journal of the American Statistical Association pp. 1–15.
  • Fang et al. (2013) Fang, C., Xu, Y. and Rockmore, D. N. (2013), Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias, in ‘Proceedings of the IEEE International Conference on Computer Vision’, pp. 1657–1664.
  • Farahani et al. (2021) Farahani, A., Voghoei, S., Rasheed, K. and Arabnia, H. R. (2021), ‘A brief review of domain adaptation’, Advances in Data Science and Information Engineering: Proceedings from ICDATA 2020 and IKE 2020 pp. 877–894.
  • Farrell et al. (2021) Farrell, M. H., Liang, T. and Misra, S. (2021), ‘Deep neural networks for estimation and inference’, Econometrica 89(1), 181–213.
  • Ganin et al. (2016) Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., March, M. and Lempitsky, V. (2016), ‘Domain-adversarial training of neural networks’, Journal of Machine Learning Research 17(59), 1–35.
  • Gao et al. (2023) Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H. and Qiao, Y. (2023), ‘Clip-adapter: Better vision-language models with feature adapters’, International Journal of Computer Vision pp. 1–15.
  • Giné and Nickl (2021) Giné, E. and Nickl, R. (2021), Mathematical foundations of infinite-dimensional statistical models, Cambridge university press.
  • Gulrajani and Lopez-Paz (2020) Gulrajani, I. and Lopez-Paz, D. (2020), In search of lost domain generalization, in ‘International Conference on Learning Representations’.
  • Györfi et al. (2002) Györfi, L., Kohler, M., Krzyzak, A., Walk, H. et al. (2002), A distribution-free theory of nonparametric regression, Vol. 1, Springer.
  • He et al. (2016) He, K., Zhang, X., Ren, S. and Sun, J. (2016), Deep residual learning for image recognition, in ‘Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition’, pp. 770–778.
  • Hu et al. (2021) Hu, E. J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W. et al. (2021), Lora: Low-rank adaptation of large language models, in ‘International Conference on Learning Representations’.
  • Huang et al. (2006) Huang, J., Gretton, A., Borgwardt, K., Schölkopf, B. and Smola, A. (2006), ‘Correcting sample selection bias by unlabeled data’, Advances in Neural Information Processing Systems 19.
  • Huang et al. (2022) Huang, J., Jiao, Y., Li, Z., Liu, S., Wang, Y. and Yang, Y. (2022), ‘An error analysis of generative adversarial networks for learning distributions’, Journal of Machine Learning Research 23(116), 1–43.
  • Huang et al. (2024) Huang, J., Jiao, Y., Liao, X., Liu, J. and Yu, Z. (2024), ‘Deep dimension reduction for supervised representation learning’, IEEE Transactions on Information Theory .
  • Huo and Székely (2016) Huo, X. and Székely, G. J. (2016), ‘Fast computing for distance covariance’, Technometrics 58(4), 435–447.
  • Jha et al. (2019) Jha, D., Choudhary, K., Tavazza, F., Liao, W.-k., Choudhary, A., Campbell, C. and Agrawal, A. (2019), ‘Enhancing materials property prediction by leveraging computational and experimental data using deep transfer learning’, Nature communications 10(1), 5316.
  • Jiang et al. (2022) Jiang, H., Zhang, J., Huang, R., Ge, C., Ni, Z., Lu, J., Zhou, J., Song, S. and Huang, G. (2022), ‘Cross-modal adapter for text-video retrieval’, arXiv preprint arXiv:2211.09623 .
  • Jiao, Shen, Lin and Huang (2023) Jiao, Y., Shen, G., Lin, Y. and Huang, J. (2023), ‘Deep nonparametric regression on approximate manifolds: Nonasymptotic error bounds with polynomial prefactors’, The Annals of Statistics 51(2), 691–716.
  • Jiao, Wang and Yang (2023) Jiao, Y., Wang, Y. and Yang, Y. (2023), ‘Approximation bounds for norm constrained neural networks with applications to regression and gans’, Applied and Computational Harmonic Analysis 65, 249–278.
  • Ju et al. (2021) Ju, S., Yoshida, R., Liu, C., Wu, S., Hongo, K., Tadano, T. and Shiomi, J. (2021), ‘Exploring diamondlike lattice thermal conductivity crystals via feature-based transfer learning’, Physical Review Materials 5(5), 053801.
  • Kim et al. (2021) Kim, Y., Kim, Y., Yang, C., Park, K., Gu, G. X. and Ryu, S. (2021), ‘Deep learning framework for material design space exploration using active transfer learning and data augmentation’, npj Computational Materials 7(1), 140.
  • Kingma and Ba (2015) Kingma, D. P. and Ba, J. (2015), Adam: A method for stochastic optimization, in ‘3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings’.
  • Kohler et al. (2022) Kohler, M., Krzyżak, A. and Langer, S. (2022), ‘Estimation of a function of low local dimensionality by deep neural networks’, IEEE Transactions on Information Theory 68(6), 4032–4042.
  • Kora et al. (2022) Kora, P., Ooi, C. P., Faust, O., Raghavendra, U., Gudigar, A., Chan, W. Y., Meenakshi, K., Swaraja, K., Plawiak, P. and Acharya, U. R. (2022), ‘Transfer learning techniques for medical image analysis: A review’, Biocybernetics and Biomedical Engineering 42(1), 79–107.
  • Kpotufe and Martinet (2021) Kpotufe, S. and Martinet, G. (2021), ‘Marginal singularity and the benefits of labels in covariate-shift’, The Annals of Statistics 49(6), 3299–3323.
  • Ledoux and Talagrand (2013) Ledoux, M. and Talagrand, M. (2013), Probability in Banach Spaces: isoperimetry and processes, Springer Science & Business Media.
  • Lee et al. (2019) Lee, C., Cho, K. and Kang, W. (2019), Mixout: Effective regularization to finetune large-scale pretrained language models, in ‘International Conference on Learning Representations’.
  • Li et al. (2017) Li, D., Yang, Y., Song, Y.-Z. and Hospedales, T. M. (2017), Deeper, broader and artier domain generalization, in ‘Proceedings of the IEEE International Conference on Computer Vision’, pp. 5542–5550.
  • Li et al. (2022) Li, S., Cai, T. T. and Li, H. (2022), ‘Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality’, Journal of the Royal Statistical Society Series B: Statistical Methodology 84(1), 149–173.
  • Li, Cai and Li (2023) Li, S., Cai, T. T. and Li, H. (2023), ‘Transfer learning in large-scale gaussian graphical models with false discovery rate control’, Journal of the American Statistical Association 118(543), 2171–2183.
  • Li, Zhang, Cai and Li (2023) Li, S., Zhang, L., Cai, T. T. and Li, H. (2023), ‘Estimation and inference for high-dimensional generalized linear models with knowledge transfer’, Journal of the American Statistical Association pp. 1–12.
  • Li et al. (2009) Li, X., Lim, B., Zhou, J., Huang, S., Phua, S., Shaw, K. and Er, M. (2009), Fuzzy neural network modelling for tool wear estimation in dry milling operation, in ‘Annual Conference of the PHM Society’, Vol. 1.
  • Liu et al. (2022) Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M. and Raffel, C. A. (2022), ‘Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning’, Advances in Neural Information Processing Systems 35, 1950–1965.
  • Liu et al. (2024) Liu, Q., Chen, Z. and Wong, W. H. (2024), ‘An encoding generative modeling approach to dimension reduction and covariate adjustment in causal inference with observational studies’, Proceedings of the National Academy of Sciences 121(23), e2322376121.
  • Lv et al. (2018) Lv, S., Lin, H., Lian, H. and Huang, J. (2018), ‘Oracle inequalities for sparse additive quantile regression in reproducing kernel hilbert space’, The Annals of Statistics 46(2), 781–813.
  • Mansour et al. (2009) Mansour, Y., Mohri, M. and Rostamizadeh, A. (2009), Domain adaptation: Learning bounds and algorithms, in ‘22nd Conference on Learning Theory, COLT 2009’.
  • Mohri et al. (2018) Mohri, M., Rostamizadeh, A. and Talwalkar, A. (2018), Foundations of machine learning, second edn, MIT Press.
  • Nakada and Imaizumi (2020) Nakada, R. and Imaizumi, M. (2020), ‘Adaptive approximation and generalization of deep neural network with intrinsic dimensionality’, The Journal of Machine Learning Research 21(1), 7018–7055.
  • Nam et al. (2021) Nam, H., Lee, H., Park, J., Yoon, W. and Yoo, D. (2021), Reducing domain gap by reducing style bias, in ‘Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition’, pp. 8690–8699.
  • Reeve et al. (2021) Reeve, H. W., Cannings, T. I. and Samworth, R. J. (2021), ‘Adaptive transfer learning’, The Annals of Statistics 49(6), 3618–3649.
  • Rücklé et al. (2021) Rücklé, A., Geigle, G., Glockner, M., Beck, T., Pfeiffer, J., Reimers, N. and Gurevych, I. (2021), Adapterdrop: On the efficiency of adapters in transformers, in ‘Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing’, pp. 7930–7946.
  • Ruder et al. (2019) Ruder, S., Peters, M. E., Swayamdipta, S. and Wolf, T. (2019), Transfer learning in natural language processing, in ‘Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Tutorials’, pp. 15–18.
  • Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. et al. (2015), ‘Imagenet large scale visual recognition challenge’, International Journal of Computer Vision 115, 211–252.
  • Sagawa et al. (2019) Sagawa, S., Koh, P. W., Hashimoto, T. B. and Liang, P. (2019), Distributionally robust neural networks, in ‘International Conference on Learning Representations’.
  • Schmidt-Hieber (2019) Schmidt-Hieber, J. (2019), ‘Deep relu network approximation of functions on a manifold’, arXiv preprint arXiv:1908.00695 .
  • Schmidt-Hieber and Zamolodtchikov (2024) Schmidt-Hieber, J. and Zamolodtchikov, P. (2024), ‘Local convergence rates of the nonparametric least squares estimator with applications to transfer learning’, Bernoulli 30(3), 1845–1877.
  • Schumaker (2007) Schumaker, L. (2007), Spline functions: basic theory, Cambridge university press.
  • Shen et al. (2022) Shen, G., Jiao, Y., Lin, Y. and Huang, J. (2022), ‘Approximation with cnns in sobolev space: with applications to classification’, Advances in Neural Information Processing Systems 35, 2876–2888.
  • Shimodaira (2000) Shimodaira, H. (2000), ‘Improving predictive inference under covariate shift by weighting the log-likelihood function’, Journal of Statistical Planning and Inference 90(2), 227–244.
  • Sugiyama et al. (2007) Sugiyama, M., Nakajima, S., Kashima, H., Buenau, P. and Kawanabe, M. (2007), ‘Direct importance estimation with model selection and its application to covariate shift adaptation’, Advances in Neural Information Processing Systems 20.
  • Sung et al. (2021) Sung, Y.-L., Nair, V. and Raffel, C. A. (2021), ‘Training neural networks with fixed sparse masks’, Advances in Neural Information Processing Systems 34, 24193–24205.
  • Székely et al. (2007) Székely, G. J., Rizzo, M. L. and Bakirov, N. K. (2007), ‘Measuring and testing dependence by correlation of distances’, The Annals of Statistics 35(6), 2769–2794.
  • Tan and Le (2019) Tan, M. and Le, Q. (2019), Efficientnet: Rethinking model scaling for convolutional neural networks, in ‘International Conference on Machine Learning’, PMLR, pp. 6105–6114.
  • Tian and Feng (2023) Tian, Y. and Feng, Y. (2023), ‘Transfer learning under high-dimensional generalized linear models’, Journal of the American Statistical Association 118(544), 2684–2697.
  • Tripuraneni et al. (2021) Tripuraneni, N., Jin, C. and Jordan, M. (2021), Provable meta-learning of linear representations, in ‘International Conference on Machine Learning’, PMLR, pp. 10434–10443.
  • Tripuraneni et al. (2020) Tripuraneni, N., Jordan, M. and Jin, C. (2020), ‘On the theory of transfer learning: The importance of task diversity’, Advances in Neural Information Processing Systems 33, 7852–7862.
  • Vapnik (1998) Vapnik, V. (1998), ‘Statistical learning theory’, Adaptive and Learning Systems for Signal Processing Communications and Control .
  • Venkateswara et al. (2017) Venkateswara, H., Eusebio, J., Chakraborty, S. and Panchanathan, S. (2017), Deep hashing network for unsupervised domain adaptation, in ‘Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition’, pp. 5018–5027.
  • Villani et al. (2009) Villani, C. et al. (2009), Optimal transport: old and new, Vol. 338, Springer.
  • Wainwright (2019) Wainwright, M. J. (2019), High-dimensional statistics: A non-asymptotic viewpoint, Vol. 48, Cambridge University Press.
  • Wang et al. (2016) Wang, X., Oliva, J. B., Schneider, J. G. and Póczos, B. (2016), Nonparametric risk and stability analysis for multi-task learning problems., in ‘IJCAI’, pp. 2146–2152.
  • Wen et al. (2014) Wen, J., Yu, C.-N. and Greiner, R. (2014), Robust learning under uncertain test distributions: Relating covariate shift to model misspecification, in ‘International Conference on Machine Learning’, PMLR, pp. 631–639.
  • Wu et al. (2023) Wu, K., Chen, Y., Ha, W. and Yu, B. (2023), ‘Prominent roles of conditionally invariant components in domain adaptation: Theory and algorithms’, arXiv preprint arXiv:2309.10301 .
  • Xu et al. (2021) Xu, R., Luo, F., Zhang, Z., Tan, C., Chang, B., Huang, S. and Huang, F. (2021), Raise a child in large language model: Towards effective and generalizable fine-tuning, in ‘Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing’, pp. 9514–9528.
  • Zhang et al. (2023) Zhang, B., Jin, X., Gong, W., Xu, K., Zhang, Z., Wang, P., Shen, X. and Feng, J. (2023), ‘Multimodal video adapter for parameter efficient video text retrieval’, arXiv preprint arXiv:2301.07868 .
  • Zhuang et al. (2020) Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H. and He, Q. (2020), ‘A comprehensive survey on transfer learning’, Proceedings of the IEEE 109(1), 43–76.

Supplementary Material to “Deep Transfer Learning: Model Framework and Error Analysis”

Yuling Jiao, Huazhen Lin, Yuchen Luo, Jerry Zhijian Yang


Appendix A Proof of error decomposition and approximation decomposition

\mathcal{H} to be a function class of features mapping dr\mathbb{R}^{d}\to\mathbb{R}^{r}, 𝒢\mathcal{G} to be a function class of lipschitz-constrained mapping r\mathbb{R}^{r}\to\mathbb{R} and 𝒬:d\mathcal{Q}:\mathbb{R}^{d^{*}}\to\mathbb{R} to be a function class during fine-tuning.

For any i[p]i\in[p], (S=i)>0\mathbb{P}(S=i)>0, which is a general case and means nin_{i} is the same order of n. Without loss of generality, we set the upper bound of random noise as 1: |εi|1,i[2]|\varepsilon_{i}|\leq 1,i\in[2].

The proof and definitions are organized as follows: In the Section B, we present the proof of Theorem 1 and Theorem 2 in Section 3.1. Section C is dedicated to proving the Theorem 3 and Theorem 4 in Section 3.2. Section D compiles a list of the theorems cited, and Section E provides additional experimental details.

Appendix B Proofs of Section 3.1

For all the following lemmas, we omit the statement about assumptions for simplicity. The lemma proof in Section B.1.1 and B.1.2 is under the assumption we made in Theorem 1. The lemma proof in Section B.2 is under the assumption we made in Theorem 2, the lemma proof in downstream is under the assumption we made in Theorem 3.

B.1 Proofs of Theorem 1

B.1.1 Proof of error decomposition and approximation decomposition

Without loss of generality, we set the λ=1\lambda=1 and τ=1\tau=1. We first presents a basic inequality for the excess risk in terms of stochastic and approximation errors.

Lemma 1.
𝔼[R(𝐅^,𝐡^)R(𝐅,𝐡)]\displaystyle\mathbb{E}[R(\hat{\mathbf{F}},\hat{\mathbf{h}})-R(\mathbf{F}^{*},\mathbf{h}^{*})]\leq 2sup(𝐅,𝐡)pr(B)𝔼[|R^(𝐅,𝐡)R(𝐅,𝐡)|]\displaystyle 2\sup_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{pr}(B)\circ\mathcal{H}}\mathbb{E}[|\hat{R}(\mathbf{F},\mathbf{h})-R(\mathbf{F},\mathbf{h})|]
+inf(𝐅,𝐡)pr(B)|R(𝐅,𝐡)R(𝐅,𝐡)|.\displaystyle+\inf_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{pr}(B)\circ\mathcal{H}}|R(\mathbf{F},\mathbf{h})-R(\mathbf{F}^{*},\mathbf{h}^{*})|.
Proof.

For any 𝐅pr(B)\mathbf{F}\in\mathcal{B}_{pr}(B) and any 𝐡\mathbf{h}\in\mathcal{H}, we have:

𝔼[R(𝐅^,𝐡^)R(𝐅,𝐡)]=\displaystyle\mathbb{E}[R(\hat{\mathbf{F}},\hat{\mathbf{h}})-R(\mathbf{F}^{*},\mathbf{h}^{*})]= 𝔼[R(𝐅^,𝐡^)R^(𝐅^,𝐡^)+R^(𝐅^,𝐡^)R^(𝐅,𝐡)+R^(𝐅,𝐡)R(𝐅,𝐡)\displaystyle\mathbb{E}[R(\hat{\mathbf{F}},\hat{\mathbf{h}})-\hat{R}(\hat{\mathbf{F}},\hat{\mathbf{h}})+\hat{R}(\hat{\mathbf{F}},\hat{\mathbf{h}})-\hat{R}(\mathbf{F},\mathbf{h})+\hat{R}(\mathbf{F},\mathbf{h})-R(\mathbf{F},\mathbf{h})
+R(𝐅,𝐡)R(𝐅,𝐡)]\displaystyle+R(\mathbf{F},\mathbf{h})-R(\mathbf{F}^{*},\mathbf{h}^{*})]
\displaystyle\leq 𝔼[R(𝐅^,𝐡^)R^(𝐅^,𝐡^)+R^(𝐅,𝐡)R(𝐅,𝐡)+R(𝐅,𝐡)R(𝐅,𝐡)]\displaystyle\mathbb{E}[R(\hat{\mathbf{F}},\hat{\mathbf{h}})-\hat{R}(\hat{\mathbf{F}},\hat{\mathbf{h}})+\hat{R}(\mathbf{F},\mathbf{h})-R(\mathbf{F},\mathbf{h})+R(\mathbf{F},\mathbf{h})-R(\mathbf{F}^{*},\mathbf{h}^{*})]
\displaystyle\leq 2sup(𝐅,𝐡)pr(B)𝔼[|R^(𝐅,𝐡)R(𝐅,𝐡)|]+𝔼[R(𝐅,𝐡)R(𝐅,𝐡)].\displaystyle 2\sup_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{pr}(B)\circ\mathcal{H}}\mathbb{E}[|\hat{R}(\mathbf{F},\mathbf{h})-R(\mathbf{F},\mathbf{h})|]+\mathbb{E}[R(\mathbf{F},\mathbf{h})-R(\mathbf{F}^{*},\mathbf{h}^{*})].

The term R^(𝐅^,𝐡^)R^(𝐅,𝐡)<0\hat{R}(\hat{\mathbf{F}},\hat{\mathbf{h}})-\hat{R}(\mathbf{F},\mathbf{h})<0 by definition. Due to the randomness of 𝐅\mathbf{F} and 𝐡\mathbf{h}, we can take the infimum of both the LHS and the RHS of the inequality. Thus,

𝔼[R(𝐅^,𝐡^)R(𝐅,𝐡)]\displaystyle\mathbb{E}[R(\hat{\mathbf{F}},\hat{\mathbf{h}})-R(\mathbf{F}^{*},\mathbf{h}^{*})]\leq 2sup(𝐅,𝐡)pr(B)𝔼[|R^(𝐅,𝐡)R(𝐅,𝐡)|]\displaystyle 2\sup_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{pr}(B)\circ\mathcal{H}}\mathbb{E}[|\hat{R}(\mathbf{F},\mathbf{h})-R(\mathbf{F},\mathbf{h})|]
+inf(𝐅,𝐡)pr(B)|R(𝐅,𝐡)R(𝐅,𝐡)|.\displaystyle+\inf_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{pr}(B)\circ\mathcal{H}}|R(\mathbf{F},\mathbf{h})-R(\mathbf{F}^{*},\mathbf{h}^{*})|\negmedspace.

Lemma 2.
inf(𝐅,𝐡)pr(B)|R(𝐅,𝐡)R(𝐅,𝐡)||R(𝐅,𝐡)R(𝐅,𝐡)|(4B2+32p+1)𝐡𝐡.\inf_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{pr}(B)\circ\mathcal{H}}|R(\mathbf{F},\mathbf{h})-R(\mathbf{F}^{*},\mathbf{h}^{*})|\leq|R(\mathbf{F}^{*},\mathbf{h})-R(\mathbf{F}^{*},\mathbf{h}^{*})|\leq(4B^{2}+32p+1)\|\mathbf{h}-\mathbf{h}^{*}\|_{\infty}. (16)
Proof.
|R(𝐅,𝐡)R(𝐅,𝐡)|\displaystyle|R(\mathbf{F}^{*},\mathbf{h})-R(\mathbf{F}^{*},\mathbf{h}^{*})| |E[(YS𝐅S𝐡(𝐗S))2]+𝒱[D,𝐡(X)]E[(YS𝐅S𝐡(𝐗S))2]\displaystyle\leq|\mathrm{E}[(Y_{S}-\mathbf{F}_{S}^{*}\mathbf{h}(\mathbf{X}_{S}))^{2}]+\mathcal{V}[\mathrm{D},\mathbf{h}(X)]-\mathrm{E}[(Y_{S}-\mathbf{F}_{S}^{*}\mathbf{h}^{*}(\mathbf{X}_{S}))^{2}]
𝒱[D,𝐡(𝐗)]|+𝒲1(𝐡(X),𝒰r)𝒲1(𝐡(𝐗),𝒰r)\displaystyle\quad-\mathcal{V}[\mathrm{D},\mathbf{h}^{*}(\mathbf{X})]|+\mathcal{W}_{1}(\mathbf{h}(X),\mathcal{U}_{r})-\mathcal{W}_{1}(\mathbf{h}^{*}(\mathbf{X}),\mathcal{U}_{r})
4BE[|𝐅S(𝐡(X))𝐅S(𝐡(𝐗))|]+|𝒱[D,𝐡(X)]𝒱[D,𝐡(𝐗)]|\displaystyle\leq 4B\mathrm{E}[|\mathbf{F}_{S}^{*}(\mathbf{h}(X))-\mathbf{F}_{S}^{*}(\mathbf{h}^{*}(\mathbf{X}))|]+|\mathcal{V}[\mathrm{D},\mathbf{h}(X)]-\mathcal{V}[\mathrm{D},\mathbf{h}^{*}(\mathbf{X})]|
+𝒲1(𝐡(X),𝐡(𝐗))\displaystyle\quad+\mathcal{W}_{1}(\mathbf{h}(X),\mathbf{h}^{*}(\mathbf{X}))
4B2𝐡𝐡+|𝒱[D,𝐡(X)]𝒱[D,𝐡(𝐗)]|+𝒲1(𝐡(X),𝐡(𝐗))\displaystyle\leq 4B^{2}\|\mathbf{h}-\mathbf{h}^{*}\|_{\infty}+|\mathcal{V}[\mathrm{D},\mathbf{h}(X)]-\mathcal{V}[\mathrm{D},\mathbf{h}^{*}(\mathbf{X})]|+\mathcal{W}_{1}(\mathbf{h}(X),\mathbf{h}^{*}(\mathbf{X}))
(4B2+32p+1)𝐡𝐡.\displaystyle\leq(4B^{2}+32p+1)\|\mathbf{h}-\mathbf{h}^{*}\|_{\infty}.

Here we utilize the metric property of Wasserstein distance and propery of covariance distance in Székely et al. (2007, p16,Remark 3). ∎

B.1.2 Statistical errors

We can decompose the analysis of statistical errors into three components by the loss we defined.

𝔼[sup(𝐅,𝐡)pr(B)|R^(𝐅,𝐡)R(𝐅,𝐡)|]\displaystyle\mathbb{E}[\sup_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{pr}(B)\circ\mathcal{H}}|\hat{R}(\mathbf{F},\mathbf{h})-R(\mathbf{F},\mathbf{h})|] (17)
\displaystyle\leq 𝔼[sup(𝐅,𝐡)pr(B)|𝔼[(YS𝐅S𝐡(𝐗S))2]1ni=1n(YSi,i𝐅Si(𝐡(xSi,i)))2|1\displaystyle\mathbb{E}[\underbrace{\sup_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{pr}(B)\circ\mathcal{H}}|\mathbb{E}[(Y_{S}-\mathbf{F}_{S}\mathbf{h}(\mathbf{X}_{S}))^{2}]-\frac{1}{n}\sum_{i=1}^{n}(Y_{S_{i},i}-\mathbf{F}_{S_{i}}(\mathbf{h}(x_{S_{i},i})))^{2}|}_{\mathcal{L}_{1}}
+𝔼[sup𝐡|V[𝐡(X),D]𝒱^[𝐡(X),D]|]2+𝔼[sup𝐡|𝒲1^(𝐡(𝐗),𝒰r)𝒲1(𝐡(𝐗),𝒰r)|]3.\displaystyle+\underbrace{\mathbb{E}[\sup_{\mathbf{h}\in\mathcal{H}}|V[\mathbf{h}(X),D]-\widehat{\mathcal{V}}[\mathbf{h}(X),D]|]}_{\mathcal{L}_{2}}+\underbrace{\mathbb{E}[\sup_{\mathbf{h}\in\mathcal{H}}|\hat{\mathcal{W}_{1}}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r})-\mathcal{W}_{1}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r})|]}_{\mathcal{L}_{3}}.

We decompose this part of the error into three components and analyze them separately. The processing for bounding the statistical error of 1\mathcal{L}_{1} and 3\mathcal{L}_{3} is almost standard (Anthony and Bartlett, 2009; Mohri et al., 2018). In the context of analysis restricted to 2\mathcal{L}_{2}, we leveraged the methodological techniques concerning the estimation error of the U-statistic as presented in Huang et al. (2024).

In the following we will bounding the three term in (17) one by one.

Lemma 3.
18B𝔼[sup(𝐅,𝐡)pr(B)|1ni=1nσi𝐅Si𝐡(𝐗Si,i)|]+2B2n.\mathcal{L}_{1}\leq 8B\mathbb{E}[\sup_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{pr}(B)\circ\mathcal{H}}|\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}\mathbf{F}_{S_{i}}\mathbf{h}(\mathbf{X}_{S_{i},i})|]+\frac{2B^{2}}{\sqrt{n}}. (18)
Proof.

Consider XSi,i,YSi,i,Sii=1n{X^{\prime}_{S^{\prime}_{i},i},Y^{\prime}_{S^{\prime}_{i},i},S^{\prime}_{i}}_{i=1}^{n} be a ghost sample of i.i.d random variables distributed as (𝐱,y,s)\mathbb{P}(\mathbf{x},y,s) and independent of dataset 𝒟\mathcal{D}. σi\sigma_{i} is i.i.d random variables distributed as Rademacher distribution Rad(12)Rad(\frac{1}{2}). And σi\sigma_{i} is independent of all other randomness.

1=\displaystyle\mathcal{L}_{1}= 𝔼[sup(𝐅,𝐡)pr(B)|𝔼[(YS𝐅S𝐡(𝐗S))2]1ni=1n(YSi,i𝐅Si(𝐡(𝐱Si,i)))2|]\displaystyle\mathbb{E}[\sup_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{pr}(B)\circ\mathcal{H}}|\mathbb{E}[(Y_{S}-\mathbf{F}_{S}\mathbf{h}(\mathbf{X}_{S}))^{2}]-\frac{1}{n}\sum_{i=1}^{n}(Y_{S_{i},i}-\mathbf{F}_{S_{i}}(\mathbf{h}(\mathbf{x}_{S_{i},i})))^{2}|]
\displaystyle\leq 𝔼[sup(𝐅,𝐡)pr(B)|1ni=1n(YSi,i𝐅Si𝐡(𝐗Si,i)))21ni=1n(YSi,i𝐅Si𝐡(𝐗Si,i))2|]\displaystyle\mathbb{E}[\sup_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{pr}(B)\circ\mathcal{H}}|\frac{1}{n}\sum_{i=1}^{n}(Y^{\prime}_{S^{\prime}_{i},i}-\mathbf{F}_{S^{\prime}_{i}}\mathbf{h}(\mathbf{X}^{\prime}_{S^{\prime}_{i},i})))^{2}-\frac{1}{n}\sum_{i=1}^{n}(Y_{S_{i},i}-\mathbf{F}_{S_{i}}\mathbf{h}(\mathbf{X}_{S_{i},i}))^{2}|]
\displaystyle\leq 𝔼[sup(𝐅,𝐡)pr(B)|1ni=1nσi(YSi,i𝐅Si𝐡(𝐗Si,i)))21ni=1nσi(YSi,i𝐅Si𝐡(𝐗Si,i))2|]\displaystyle\mathbb{E}[\sup_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{pr}(B)\circ\mathcal{H}}|\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}(Y^{\prime}_{S^{\prime}_{i},i}-\mathbf{F}_{S^{\prime}_{i}}\mathbf{h}(\mathbf{X}^{\prime}_{S^{\prime}_{i},i})))^{2}-\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}(Y_{S_{i},i}-\mathbf{F}_{S_{i}}\mathbf{h}(\mathbf{X}_{S_{i},i}))^{2}|]
\displaystyle\leq 2𝔼[sup(𝐅,𝐡)pr(B)|1ni=1nσi(YSi,i𝐅Si𝐡(𝐗Si,i))2|]\displaystyle 2\mathbb{E}[\sup_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{pr}(B)\circ\mathcal{H}}|\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}(Y_{S_{i},i}-\mathbf{F}_{S_{i}}\mathbf{h}(\mathbf{X}_{S_{i},i}))^{2}|]
\displaystyle\leq 8B𝔼[sup(𝐅,𝐡)pr(B)|1ni=1nσi𝐅Si𝐡(𝐗Si,i)|]+2B2n.\displaystyle 8B\mathbb{E}[\sup_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{pr}(B)\circ\mathcal{H}}|\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}\mathbf{F}_{S_{i}}\mathbf{h}(\mathbf{X}_{S_{i},i})|]+\frac{2B^{2}}{\sqrt{n}}.

The last inequality is due to the Wainwright (2019, Exercise 4.7c) and the contraction principle (Ledoux and Talagrand, 2013, Theorem 4.12). ∎

As for the upper bound of 2\mathcal{L}_{2}, We need an alternative pseudo-metric to simplify the analysis. 𝐡\forall\mathbf{h}\in\mathcal{H}, define a random empirical measure (depends on D~i=(𝐗i,Si),i=1,,n\tilde{D}_{i}=(\mathbf{X}_{i},S_{i}),i=1,\ldots,n)

dD,2(𝐡,𝐡^)=𝔼σi1,i1=1,,n|1Cn41i1<i2<i3<i4nσi1(k𝐡k𝐡^)(D~i1,,D~i4)|,d_{D,2}(\mathbf{h},\hat{\mathbf{h}})=\mathbb{E}_{\sigma_{i_{1}},i_{1}=1,...,n}|\frac{1}{C_{n}^{4}}\sum_{1\leq i_{1}<i_{2}<i_{3}<i_{4}\leq n}\sigma_{i_{1}}(k_{\mathbf{h}}-k_{\hat{\mathbf{h}}})(\tilde{D}_{i_{1}},\ldots,\tilde{D}_{i_{4}})|,

where σi1Rad(12)\sigma_{i_{1}}\sim Rad(\frac{1}{2}) i.i.d and k𝐡(D~i1,,D~i4)=k((𝐡(𝐗i1),Si1),,(𝐡(𝐗i4),Si4))k_{\mathbf{h}}(\tilde{D}_{i_{1}},\ldots,\tilde{D}_{i_{4}})=k((\mathbf{h}(\mathbf{X}_{i_{1}}),S_{i_{1}}),\ldots,(\mathbf{h}(\mathbf{X}_{i_{4}}),S_{i_{4}})).

Condition on D~i,i=1,,n\tilde{D}_{i},i=1,\ldots,n, let (,dD,2,δ))\mathfrak{C}(\mathcal{H},d_{D,2},\delta)) be the covering number (Giné and Nickl, 2021, page 41) of \mathcal{H} with respect to the empirical distance dD,2d_{D,2} at scale of δ>0\delta>0. Denote δ\mathcal{H}_{\delta} as the covering set of \mathcal{H} with with cardinality of (,dD,2,δ))\mathfrak{C}(\mathcal{H},d_{D,2},\delta)).

Lemma 4.

If ξi,i=1,m\xi_{i},i=1,...m are mm finite linear combinations of Rademacher variables ϵj,j=1,..J\epsilon_{j},j=1,..J. There exists some C0,C1,C2>0C_{0},C_{1},C_{2}>0, Then

𝔼ϵj,j=1,Jmax1im|ξi|C2(logm)1/2max1im(𝔼ξi2)1/2.\mathbb{E}_{\epsilon_{j},j=1,...J}\max_{1\leq i\leq m}|\xi_{i}|\leq C_{2}(\log m)^{1/2}\max_{1\leq i\leq m}\left(\mathbb{E}\xi_{i}^{2}\right)^{1/2}. (19)
Proof.

By De la Pena and Giné (2012)[Corollary 3.2.6], we know that ξi\xi_{i} is all belongs to ϕ2\mathcal{L}_{\phi_{2}} (a space of Orlicz space with ϕ2(x)=exp(x2)\phi_{2}(x)=exp(x^{2})) for their corresponding Orlicz norm (De la Pena and Giné, 2012, page 36) is C0E[ξi2]12C_{0}\mathrm{E}[\xi_{i}^{2}]^{\frac{1}{2}}. So it’s still a Sub-Gaussian r.v with parameter C1E[ξi2]12C_{1}\mathrm{E}[\xi_{i}^{2}]^{\frac{1}{2}}, and by De la Pena and Giné (2012)[inequality (4.3.1)] with Φ(x)=exp(ρx)\Phi(x)=\exp(\rho x) we can get that

𝔼ϵj,j=1,Jmax1im|ξi|logmρ+logmax1imexpρ2C1E[ξi2]122ρ.\mathbb{E}_{\epsilon_{j},j=1,...J}\max_{1\leq i\leq m}|\xi_{i}|\leq\frac{\log m}{\rho}+\frac{\log\max_{1\leq i\leq m}\exp{\frac{\rho^{2}C_{1}\mathrm{E}[\xi_{i}^{2}]^{\frac{1}{2}}}{2}}}{\rho}.

, then we can choose a optimal ρ\rho to get the wanted result. ∎

Lemma 5.

There exists some C3,C4,C5,C6>0C_{3},C_{4},C_{5},C_{6}>0, making the following inequality satisfied.

2C6rPdim(1)n,\mathcal{L}_{2}\leq C_{6}\sqrt{r}\sqrt{\frac{{\rm Pdim}(\mathcal{H}_{1})}{n}}, (20)

where 1\mathcal{H}_{1} means the neural network function class 𝒩𝒩d,1(W1,L1,K1)\mathcal{N}\mathcal{N}_{d,1}(W_{1},L_{1},K_{1}) which can be seen as the projection to one-dimensional of \mathcal{H}.

Proof.

Base the definition of (,dD,2,δ))\mathfrak{C}(\mathcal{H},d_{D,2},\delta)) and δ\mathcal{H}_{\delta},

2=\displaystyle\mathcal{L}_{2}= 𝔼𝔼σi1[suph|1Cn41i1<i2<i3<i4nσi1k¯𝐡(D~i1,D~i2,D~i3,D~i4)|]\displaystyle\mathbb{E}\mathbb{E}_{\sigma_{i_{1}}}[\sup_{h\in\mathcal{H}}|\frac{1}{C_{n}^{4}}\sum_{1\leq i_{1}<i_{2}<i_{3}<i_{4}\leq n}\sigma_{i_{1}}\bar{k}_{\mathbf{h}}(\tilde{D}_{i_{1}},\tilde{D}_{i_{2}},\tilde{D}_{i_{3}},\tilde{D}_{i_{4}})|]
inf0<δ<1{δ+𝔼𝔼σi1[sup𝐡δ|1Cn41i1<i2<i3<i4nσi1k¯𝐡(D~i1,D~i2,D~i3,D~i4)|]}\displaystyle\leq\inf_{0<\delta<1}\{\delta+\mathbb{E}\mathbb{E}_{\sigma_{i_{1}}}[\sup_{\mathbf{h}\in\mathcal{H}_{\delta}}|\frac{1}{C_{n}^{4}}\sum_{1\leq i_{1}<i_{2}<i_{3}<i_{4}\leq n}\sigma_{i_{1}}\bar{k}_{\mathbf{h}}(\tilde{D}_{i_{1}},\tilde{D}_{i_{2}},\tilde{D}_{i_{3}},\tilde{D}_{i_{4}})|]\}
inf0<δ<1{δ+C31Cn4𝔼[(log(,dD,2,δ))1/2max𝐡δ[i1=1ni2<i3<i4(k¯𝐡(D~i1,D~i2,D~i3,D~i4))2]1/2]}\displaystyle\leq\inf_{0<\delta<1}\{\delta+C_{3}\frac{1}{C_{n}^{4}}\mathbb{E}[(\log\mathfrak{C}(\mathcal{H},d_{D,2},\delta))^{1/2}\max_{\mathbf{h}\in\mathcal{H}_{\delta}}[\sum_{i_{1}=1}^{n}\sum_{i_{2}<i_{3}<i_{4}}(\bar{k}_{\mathbf{h}}(\tilde{D}_{i_{1}},\tilde{D}_{i_{2}},\tilde{D}_{i_{3}},\tilde{D}_{i_{4}}))^{2}]^{1/2}]\}
inf0<δ<1{δ+C3C4𝔼[(log(,dD,2,δ))1/21Cn4[n(n!)2((n3)!)2]1/2]}\displaystyle\leq\inf_{0<\delta<1}\{\delta+C_{3}C_{4}\mathbb{E}[(\log\mathfrak{C}(\mathcal{H},d_{D,2},\delta))^{1/2}\frac{1}{C_{n}^{4}}[\frac{n(n!)^{2}}{((n-3)!)^{2}}]^{1/2}]\}
inf0<δ<1{δ+2C3C4𝔼[(log(,dD,2,δ))1/2/n]}\displaystyle\leq\inf_{0<\delta<1}\{\delta+2C_{3}C_{4}\mathbb{E}[(\log\mathfrak{C}(\mathcal{H},d_{D,2},\delta))^{1/2}/\sqrt{n}]\}
inf0<δ<1{δ+2C3C4𝔼[(log(,C5dD,,δ))1/2/n]}\displaystyle\leq\inf_{0<\delta<1}\{\delta+2C_{3}C_{4}\mathbb{E}[(\log\mathfrak{C}(\mathcal{H},C_{5}d_{D,\infty},\delta))^{1/2}/\sqrt{n}]\}
inf0<δ<1{δ+2C3C4𝔼[(log(,dD,,δ/C5))1/2/n]}\displaystyle\leq\inf_{0<\delta<1}\{\delta+2C_{3}C_{4}\mathbb{E}[(\log\mathfrak{C}(\mathcal{H},d_{D,\infty},\delta/C_{5}))^{1/2}/\sqrt{n}]\}
inf0<δ<1{δ+2C3C4r𝔼[(log(1,dD,,δ/C5))1/2/n]}\displaystyle\leq\inf_{0<\delta<1}\{\delta+2C_{3}C_{4}\sqrt{r}\mathbb{E}[(\log\mathfrak{C}(\mathcal{H}_{1},d_{D,\infty},\delta/C_{5}))^{1/2}/\sqrt{n}]\}
inf0<δ<1{δ+2C3C4r(Pdim(1)log(2C5enδPdim(1)))1/2/n}\displaystyle\leq\inf_{0<\delta<1}\{\delta+2C_{3}C_{4}\sqrt{r}({\rm Pdim}(\mathcal{H}_{1})\log(\frac{2C_{5}en}{\delta{\rm Pdim}(\mathcal{H}_{1})}))^{1/2}/\sqrt{n}\}
C6rPdim(1)n,\displaystyle\leq C_{6}\sqrt{r}\sqrt{\frac{{\rm Pdim}(\mathcal{H}_{1})}{n}},

where dD,(𝐡1,𝐡2)=i=1n1n𝐡1(𝐱i)𝐡2(𝐱2)d_{D,\infty}(\mathbf{h}_{1},\mathbf{h}_{2})=\sum_{i=1}^{n}\frac{1}{n}\|\mathbf{h}_{1}(\mathbf{x}_{i})-\mathbf{h}_{2}(\mathbf{x}_{2})\|_{\infty}. The proof is base on the proof on Huang et al. (2024, Lemma 8.7), and the second inequality is base on the lemma 4.

As for the third last inequality, when estimating the covering number of the family of neural network functions, it is commonly approached by employing concepts such as pseudo dimension or VC dimension. Denote the pseudo dimension of \mathcal{F} by Pdim(){\rm Pdim}(\mathcal{F}), by Anthony et al. (1999, Theorem 12.2), for nPdim()n\geq{\rm Pdim}(\mathcal{F}),

(,dD,,δ)(2eBnPdim()δ)Pdim().\mathfrak{C}(\mathcal{F},d_{D,\infty},\delta)\leq\Big{(}\frac{2eB_{\mathcal{F}}n}{{\rm Pdim}(\mathcal{F})\delta}\Big{)}^{{\rm Pdim}(\mathcal{F})}.

Here BB_{\mathcal{F}} is the upper bound of funtions in \mathcal{F}. However,we can establish its validity regardless of the relative magnitudes of nn and Pdim(pr(B)){\rm Pdim}(\mathcal{B}_{pr}(B)\circ\mathcal{H}), as demonstrated by the proof of Huang et al. (2022, Corollary 35).

For the last inequality we set the δ=C4Pdim(1)n<1\delta=C_{4}\sqrt{\frac{{\rm Pdim}(\mathcal{H}_{1})}{n}}<1. ∎

Lemma 6.
3\displaystyle\mathcal{L}_{3} =𝔼[sup𝐡|𝒲1^(𝐡(𝐗),𝒰r)𝒲1(𝐡(𝐗),𝒰r)|]\displaystyle=\mathbb{E}[\sup_{\mathbf{h}\in\mathcal{H}}|\hat{\mathcal{W}_{1}}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r})-\mathcal{W}_{1}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r})|] (21)
2suph1infg𝒢gh+2𝔼[sup𝐡,g𝒢|1ni=1nσig(𝐡(𝐗i))|]\displaystyle\leq 2\sup_{h\in\mathcal{H}^{1}}\inf_{g\in\mathcal{G}}\|g-h\|+2\mathbb{E}[\sup_{\mathbf{h}\in\mathcal{H},g\in\mathcal{G}}|\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}g(\mathbf{h}(\mathbf{X}_{i}))|]
+2𝔼[supg𝒢|1ni=1nσig(ξi)|].\displaystyle\quad+2\mathbb{E}[\sup_{g\in\mathcal{G}}|\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}g(\xi_{i})|].
Proof.

Let 𝒲1¯(𝐡(𝐗),𝒰r):=supg𝒢11K2𝔼𝐱𝐡(𝐗)[g(𝐗)]𝔼𝐬𝒰r[g(z)]\overline{\mathcal{W}_{1}}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r}):=\sup_{g\in\mathcal{G}_{1}}\frac{1}{K_{2}}\mathbb{E}_{\mathbf{x}\sim\mathbf{h}(\mathbf{X})}[g(\mathbf{X})]-\mathbb{E}_{\mathbf{s}\sim\mathcal{U}_{r}}[g(z)] be the population form of 𝒲1^(𝐡(𝐗),𝒰r)\hat{\mathcal{W}_{1}}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r}), then we can derive the proof as followed:

3\displaystyle\mathcal{L}_{3} =𝔼[sup𝐡|𝒲1^(𝐡(𝐗),𝒰r)𝒲1(𝐡(𝐗),𝒰r)|]\displaystyle=\mathbb{E}[\sup_{\mathbf{h}\in\mathcal{H}}|\hat{\mathcal{W}_{1}}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r})-\mathcal{W}_{1}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r})|]
𝔼[sup𝐡|𝒲1^(𝐡(𝐗),𝒰r)𝒲1¯(𝐡(𝐗),𝒰r)+𝒲1¯(𝐡(𝐗),𝒰r)𝒲1(𝐡(𝐗),𝒰r)|]\displaystyle\leq\mathbb{E}[\sup_{\mathbf{h}\in\mathcal{H}}|\hat{\mathcal{W}_{1}}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r})-\overline{\mathcal{W}_{1}}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r})+\overline{\mathcal{W}_{1}}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r})-\mathcal{W}_{1}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r})|]
2K2suph1infg𝒢gh+1K2𝔼[suph1|supg𝒢1ni=1ng(𝐡(𝐗i))1ni=1ng(ξi)\displaystyle\leq\frac{2}{K_{2}}\sup_{h\in\mathcal{H}^{1}}\inf_{g\in\mathcal{G}}\|g-h\|+\frac{1}{K_{2}}\mathbb{E}[\sup_{h\in\mathcal{H}^{1}}|\sup_{g\in\mathcal{G}}\frac{1}{n}\sum_{i=1}^{n}g(\mathbf{h}(\mathbf{X}_{i}))-\frac{1}{n}\sum_{i=1}^{n}g(\xi_{i})
supg𝒢𝔼𝐱𝐡(𝐗)[g(𝐱)]𝔼𝐳𝒰r[g(𝐳)]|]\displaystyle\quad-\sup_{g\in\mathcal{G}}\mathbb{E}_{\mathbf{x}\sim\mathbf{h}(\mathbf{X})}[g(\mathbf{x})]-\mathbb{E}_{\mathbf{z}\sim\mathcal{U}_{r}}[g(\mathbf{z})]|]
2K2suph1infg𝒢gh+1K2𝔼[suph1supg𝒢|𝔼𝐱𝐡(𝐗)[g(𝐱)]1ni=1ng(𝐡(𝐗i))\displaystyle\leq\frac{2}{K_{2}}\sup_{h\in\mathcal{H}^{1}}\inf_{g\in\mathcal{G}}\|g-h\|+\frac{1}{K_{2}}\mathbb{E}[\sup_{h\in\mathcal{H}^{1}}\sup_{g\in\mathcal{G}}|\mathbb{E}_{\mathbf{x}\sim\mathbf{h}(\mathbf{X})}[g(\mathbf{x})]-\frac{1}{n}\sum_{i=1}^{n}g(\mathbf{h}(\mathbf{X}_{i}))
+1ni=1ng(ξi)𝔼𝐳𝒰r[g(𝐳)]|]\displaystyle\quad+\frac{1}{n}\sum_{i=1}^{n}g(\xi_{i})-\mathbb{E}_{\mathbf{z}\sim\mathcal{U}_{r}}[g(\mathbf{z})]|]
2K2suph1infg𝒢gh+1K2𝔼[suph1supg𝒢|𝔼𝐱𝐡(𝐗)[g(𝐱)]1ni=1ng(𝐡(𝐗i))\displaystyle\leq\frac{2}{K_{2}}\sup_{h\in\mathcal{H}^{1}}\inf_{g\in\mathcal{G}}\|g-h\|+\frac{1}{K_{2}}\mathbb{E}[\sup_{h\in\mathcal{H}^{1}}\sup_{g\in\mathcal{G}}|\mathbb{E}_{\mathbf{x}\sim\mathbf{h}(\mathbf{X})}[g(\mathbf{x})]-\frac{1}{n}\sum_{i=1}^{n}g(\mathbf{h}(\mathbf{X}_{i}))
+|1ni=1ng(ξi)𝔼𝐳𝒰r[g(𝐳)]|]\displaystyle\quad+|\frac{1}{n}\sum_{i=1}^{n}g(\xi_{i})-\mathbb{E}_{\mathbf{z}\sim\mathcal{U}_{r}}[g(\mathbf{z})]|]
2K2suph1infg𝒢gh+1K2𝔼[sup𝐡,g𝒢|𝔼𝐱𝐡(𝐗)[g(𝐱)]1ni=1ng(𝐡(𝐗i))|\displaystyle\leq\frac{2}{K_{2}}\sup_{h\in\mathcal{H}^{1}}\inf_{g\in\mathcal{G}}\|g-h\|+\frac{1}{K_{2}}\mathbb{E}[\sup_{\mathbf{h}\in\mathcal{H},g\in\mathcal{G}}|\mathbb{E}_{\mathbf{x}\sim\mathbf{h}(\mathbf{X})}[g(\mathbf{x})]-\frac{1}{n}\sum_{i=1}^{n}g(\mathbf{h}(\mathbf{X}_{i}))|
+supg𝒢|𝔼𝐳𝒰r[g(𝐳)]1ni=1ng(ξi)|]\displaystyle\quad+\sup_{g\in\mathcal{G}}|\mathbb{E}_{\mathbf{z}\sim\mathcal{U}_{r}}[g(\mathbf{z})]-\frac{1}{n}\sum_{i=1}^{n}g(\xi_{i})|]
2K2suph1infg𝒢gh+2K2𝔼[sup𝐡,g𝒢|1ni=1nσig(𝐡(𝐗i))|]\displaystyle\leq\frac{2}{K_{2}}\sup_{h\in\mathcal{H}^{1}}\inf_{g\in\mathcal{G}}\|g-h\|+\frac{2}{K_{2}}\mathbb{E}[\sup_{\mathbf{h}\in\mathcal{H},g\in\mathcal{G}}|\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}g(\mathbf{h}(\mathbf{X}_{i}))|]
+2K2𝔼[supg𝒢|1ni=1nσig(ξi)|].\displaystyle\quad+\frac{2}{K_{2}}\mathbb{E}[\sup_{g\in\mathcal{G}}|\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}g(\xi_{i})|].

The first inequality is due to the metric property. The second inequality bounding the term 𝒲1¯(𝐡(𝐗),𝒰r)𝒲1(𝐡(𝐗),𝒰r)\overline{\mathcal{W}_{1}}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r})-\mathcal{W}_{1}(\mathbf{h}(\mathbf{X}),\mathcal{U}_{r}) by Huang et al. (2022, lemma 24). The third inequality is due to the |supafsupag|supa|fg||\sup_{a}f-\sup_{a}g|\leq\sup_{a}|f-g|.The second last and last inequality is again by applying chain and symmetrization technique as proof of lemma 3. ∎

We now present the proof of our main result for upstream.

Proof of Theorem 1.

Combine all the lemma proofed in Section B.1.1 and B.1.2, we can get that

𝔼[R(𝐅^,𝐡^)R(𝐅,𝐡)]\displaystyle\mathbb{E}[R(\hat{\mathbf{F}},\hat{\mathbf{h}})-R(\mathbf{F}^{*},\mathbf{h}^{*})] 2B2n+(4B2+32p+1)𝐡𝐡+2K2supf1infg𝒢gf\displaystyle\leq\frac{2B^{2}}{\sqrt{n}}+(4B^{2}+32p+1)\|\mathbf{h}-\mathbf{h}^{*}\|+\frac{2}{K_{2}}\sup_{f\in\mathcal{H}^{1}}\inf_{g\in\mathcal{G}}\|g-f\| (22)
+8B𝔼[sup(𝐅,𝐡)pr(B)|1ni=1nσi𝐅Si𝐡(𝐗Si,i)|]+C6rPdim(1)n\displaystyle\quad+8B\mathbb{E}[\sup_{(\mathbf{F},\mathbf{h})\in\mathcal{B}_{pr}(B)\circ\mathcal{H}}|\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}\mathbf{F}_{S_{i}}\mathbf{h}(\mathbf{X}_{S_{i},i})|]+C_{6}\sqrt{r}\sqrt{\frac{{\rm Pdim}(\mathcal{H}_{1})}{n}}
+2K2𝔼[sup𝐡,g𝒢|1ni=1nσig(𝐡(𝐗i))|]+2K2𝔼[supg𝒢|1ni=1nσig(ξi)|]}.\displaystyle\quad+\frac{2}{K_{2}}\mathbb{E}[\sup_{\mathbf{h}\in\mathcal{H},g\in\mathcal{G}}|\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}g(\mathbf{h}(\mathbf{X}_{i}))|]+\frac{2}{K_{2}}\mathbb{E}[\sup_{g\in\mathcal{G}}|\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}g(\xi_{i})|]\}.

Based on Bartlett et al. (2019), the pseudo dimension (VC dimension) of \mathcal{F} with piecewise-linear activation function can be further contained and represented by its depth LL and width WW, i.e., Pdim()=O(W2L2log(W2L){\rm Pdim}(\mathcal{F})=O(W^{2}L^{2}\log(W^{2}L).

By Pdim()=O(W2L2log(W2L){\rm Pdim}(\mathcal{F})=O(W^{2}L^{2}\log(W^{2}L) for =𝒩𝒩(W,L)\mathcal{F}=\mathcal{N}\mathcal{N}(W,L) and Theorem 5, 6, (22) can be further deduced as followed:

𝔼[R(𝐅^,𝐡^)R(𝐅,𝐡)]\displaystyle\mathbb{E}[R(\hat{\mathbf{F}},\hat{\mathbf{h}})-R(\mathbf{F}^{*},\mathbf{h}^{*})] 1n+K1βd+1+K2βr+1\displaystyle\precsim\frac{1}{\sqrt{n}}+K_{1}^{-\frac{\beta}{d+1}}+K_{2}^{-\frac{\beta}{r+1}}
+K1n+r(W1L1)2log(W12L1)lognn\displaystyle\quad+\frac{K_{1}}{\sqrt{n}}+\sqrt{r}\sqrt{\frac{(W_{1}L_{1})^{2}\log(W_{1}^{2}L_{1})\log n}{n}}
+K1n+1n\displaystyle\quad+\frac{K_{1}}{\sqrt{n}}+\frac{1}{\sqrt{n}}
K1βd+1+K2βr+1+K12d+β2d+2n+K1n+1n\displaystyle\precsim K_{1}^{-\frac{\beta}{d+1}}+K_{2}^{-\frac{\beta}{r+1}}+\frac{K_{1}^{\frac{2d+\beta}{2d+2}}}{\sqrt{n}}+\frac{K_{1}}{\sqrt{n}}+\frac{1}{\sqrt{n}}
nβ2(d+1+β)𝕀β2+nβ2d+3βlogn𝕀β>2\displaystyle\precsim n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\log n\mathbb{I}_{\beta>2}
O~(nβ2(d+1+β)𝕀β2+nβ2d+3β𝕀β>2),\displaystyle\precsim\tilde{O}\left(n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\mathbb{I}_{\beta>2}\right),

where we set K1nd+12(d+1+β)K_{1}\asymp n^{\frac{d+1}{2(d+1+\beta)}}, K2nr+12(d+1+β)K_{2}\asymp n^{\frac{r+1}{2(d+1+\beta)}}, and W1K12d+β2d+2W_{1}\asymp K_{1}^{\frac{2d+\beta}{2d+2}} when β2\beta\leq 2.

When β>2\beta>2, we set K1nd+12d+3βK_{1}\asymp n^{\frac{d+1}{2d+3\beta}}, K2nd+12d+3βK_{2}\asymp n^{\frac{d+1}{2d+3\beta}}, and W1K12d+β2d+2W_{1}\asymp K_{1}^{\frac{2d+\beta}{2d+2}}.

B.2 Proof of Theorem 2

We now present the proof of result about sparsity of ERM solutions in upstream. Unlike the proof of Theorem 1 presented earlier, for the sake of explanatory coherence, we have positioned the statements and proofs of the necessary lemmas after the proof of the corresponding theorem in the subsequent demonstration.

Proof of Theorem 2.

For simplicity, we denote nt(β)=nβ2(d+1+β)𝕀β2+nβ2d+3βlogn𝕀β>2n^{-t(\beta)}=n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\log n\mathbb{I}_{\beta>2}.

By the definition of (𝐅^,𝐡^)(\hat{\mathbf{F}},\hat{\mathbf{h}}) from (12), we know that R^(𝐅^,𝐡^)R^(𝐟0,𝐡^)\hat{R}(\hat{\mathbf{F}},\hat{\mathbf{h}})\leq\hat{R}(\mathbf{f}_{0},\hat{\mathbf{h}}).

i=1pj=1np1n(Yi,j𝐅^i𝐡^(𝐗i,j))2+μi=1p𝐅^i1i=1pj=1np1n(Yi,j𝐅i𝐡^(𝐗i,j))2+μi=1p𝐅i1,\displaystyle\sum_{i=1}^{p}\sum_{j=1}^{n_{p}}\frac{1}{n}(Y_{i,j}-\hat{\mathbf{F}}_{i}\hat{\mathbf{h}}(\mathbf{X}_{i,j}))^{2}+\mu\sum_{i=1}^{p}\|\hat{\mathbf{F}}_{i}\|_{1}\leq\sum_{i=1}^{p}\sum_{j=1}^{n_{p}}\frac{1}{n}(Y_{i,j}-\mathbf{F}^{*}_{i}\hat{\mathbf{h}}(\mathbf{X}_{i,j}))^{2}+\mu\sum_{i=1}^{p}\|\mathbf{F}^{*}_{i}\|_{1},
i=1pj=1np1n(Yi,j𝐅^i𝐡^(𝐗i,j))2i=1pj=1np1n(Yi,j𝐅i𝐡^(𝐗i,j))2+μi=1p𝐅^i𝐅i1.\displaystyle\sum_{i=1}^{p}\sum_{j=1}^{n_{p}}\frac{1}{n}(Y_{i,j}-\hat{\mathbf{F}}_{i}\hat{\mathbf{h}}(\mathbf{X}_{i,j}))^{2}\leq\sum_{i=1}^{p}\sum_{j=1}^{n_{p}}\frac{1}{n}(Y_{i,j}-\mathbf{F}^{*}_{i}\hat{\mathbf{h}}(\mathbf{X}_{i,j}))^{2}+\mu\sum_{i=1}^{p}\|\hat{\mathbf{F}}_{i}-\mathbf{F}^{*}_{i}\|_{1}. (23)

By the equation

i=1pj=1np1n(Yi,j𝐅^i𝐡^(𝐗i,j))2\displaystyle\sum_{i=1}^{p}\sum_{j=1}^{n_{p}}\frac{1}{n}(Y_{i,j}-\hat{\mathbf{F}}_{i}\hat{\mathbf{h}}(\mathbf{X}_{i,j}))^{2} =i=1pj=1np1n(Yi,j𝐅i𝐡^(𝐗i,j)+𝐅i𝐡^(𝐗i,j)𝐅^i𝐡^(𝐗i,j))2\displaystyle=\sum_{i=1}^{p}\sum_{j=1}^{n_{p}}\frac{1}{n}(Y_{i,j}-\mathbf{F}^{*}_{i}\hat{\mathbf{h}}(\mathbf{X}_{i,j})+\mathbf{F}^{*}_{i}\hat{\mathbf{h}}(\mathbf{X}_{i,j})-\hat{\mathbf{F}}_{i}\hat{\mathbf{h}}(\mathbf{X}_{i,j}))^{2} (24)
=i=1pj=1np1n(Yi,j𝐅i𝐡^(𝐗i,j))2\displaystyle=\sum_{i=1}^{p}\sum_{j=1}^{n_{p}}\frac{1}{n}(Y_{i,j}-\mathbf{F}^{*}_{i}\hat{\mathbf{h}}(\mathbf{X}_{i,j}))^{2}
+i=1pj=1np2n(Yi,j𝐅i𝐡^(𝐗i,j))(𝐅i𝐡^(𝐗i,j)𝐅^i𝐡^(𝐗i,j))\displaystyle+\sum_{i=1}^{p}\sum_{j=1}^{n_{p}}\frac{2}{n}(Y_{i,j}-\mathbf{F}^{*}_{i}\hat{\mathbf{h}}(\mathbf{X}_{i,j}))(\mathbf{F}^{*}_{i}\hat{\mathbf{h}}(\mathbf{X}_{i,j})-\hat{\mathbf{F}}_{i}\hat{\mathbf{h}}(\mathbf{X}_{i,j}))
+i=1pj=1np1n((𝐅i𝐅^i)𝐡^(𝐗i,j)2,\displaystyle+\sum_{i=1}^{p}\sum_{j=1}^{n_{p}}\frac{1}{n}((\mathbf{F}^{*}_{i}-\hat{\mathbf{F}}_{i})\hat{\mathbf{h}}(\mathbf{X}_{i,j})^{2},

after estimating the last two term in right part of upper equation. We can replace the left part in (23) by (24) to get the desired result by selecting proper μ\mu.

Due to the lemma 8 below, by selecting μ=γ2Bprnt(β)\mu=\frac{\gamma}{2Bpr}n^{-t(\beta)}, the mid term of (24) can be bounded as followed with probability at least 12δ1-2\delta over the draw of dataset 𝒟\mathcal{D} (w.h.p):

i=1pj=1np2n(Yi,j𝐅i𝐡^(𝐗i,j))(𝐅i𝐡^(𝐗i,j)𝐅^i𝐡^(𝐗i,j))\displaystyle\sum_{i=1}^{p}\sum_{j=1}^{n_{p}}\frac{2}{n}(Y_{i,j}-\mathbf{F}^{*}_{i}\hat{\mathbf{h}}(\mathbf{X}_{i,j}))(\mathbf{F}^{*}_{i}\hat{\mathbf{h}}(\mathbf{X}_{i,j})-\hat{\mathbf{F}}_{i}\hat{\mathbf{h}}(\mathbf{X}_{i,j}))
=\displaystyle= i=1pj=1np2n(𝐅i𝐡(𝐗i,j)𝐅i𝐡^(𝐗i,j)+ϵi,j)(𝐅i𝐅^i)𝐡^(𝐗i,j)\displaystyle\sum_{i=1}^{p}\sum_{j=1}^{n_{p}}\frac{2}{n}(\mathbf{F}^{*}_{i}\mathbf{h}^{*}(\mathbf{X}_{i,j})-\mathbf{F}^{*}_{i}\hat{\mathbf{h}}(\mathbf{X}_{i,j})+\epsilon_{i,j})(\mathbf{F}^{*}_{i}-\hat{\mathbf{F}}_{i})\hat{\mathbf{h}}(\mathbf{X}_{i,j})
=\displaystyle= i=1pj=1np2n(𝐅i(𝐡(𝐗i,j)𝐡^(𝐗i,j))+ϵi,j)(𝐅i𝐅^i)𝐡^(𝐗i,j)\displaystyle\sum_{i=1}^{p}\sum_{j=1}^{n_{p}}\frac{2}{n}(\mathbf{F}^{*}_{i}(\mathbf{h}^{*}(\mathbf{X}_{i,j})-\hat{\mathbf{h}}(\mathbf{X}_{i,j}))+\epsilon_{i,j})(\mathbf{F}^{*}_{i}-\hat{\mathbf{F}}_{i})\hat{\mathbf{h}}(\mathbf{X}_{i,j})
\displaystyle\leq i=1p𝐅i𝐅^i1j=1np2n𝐅i(𝐡(𝐗i,j)𝐡^(𝐗i,j))𝐡^(𝐗i,j)\displaystyle\sum_{i=1}^{p}\|\mathbf{F}^{*}_{i}-\hat{\mathbf{F}}_{i}\|_{1}\|\sum_{j=1}^{n_{p}}\frac{2}{n}\mathbf{F}^{*}_{i}(\mathbf{h}^{*}(\mathbf{X}_{i,j})-\hat{\mathbf{h}}(\mathbf{X}_{i,j}))\hat{\mathbf{h}}(\mathbf{X}_{i,j})\|_{\infty}
+\displaystyle+ i=1p𝐅i𝐅^i1j=1ni2nϵi,j𝐡^(𝐗i,j)\displaystyle\sum_{i=1}^{p}\|\mathbf{F}^{*}_{i}-\hat{\mathbf{F}}_{i}\|_{1}\|\sum_{j=1}^{n_{i}}\frac{2}{n}\epsilon_{i,j}\hat{\mathbf{h}}(\mathbf{X}_{i,j})\|_{\infty}
w.h.p\displaystyle\overset{w.h.p}{\leq} i=1p𝐅i𝐅^i1((1+2B)(1+2B2)γnt(β)+(2B+1)log1δ2n+22log2rδni),\displaystyle\sum_{i=1}^{p}\|\mathbf{F}^{*}_{i}-\hat{\mathbf{F}}_{i}\|_{1}\left((1+2B)(1+2B_{2})\gamma n^{-t(\beta)}+(2B+1)\sqrt{\frac{\log\frac{1}{\delta}}{2n}}+2\sqrt{\frac{2\log\frac{2r}{\delta}}{n_{i}}}\right),

where ϵi,j\epsilon_{i,j} here is the i.i.d sample of ε1\varepsilon_{1} and for simplicity we omit the footnote from ϵ1,i,j\epsilon_{1,i,j} to ϵi,j\epsilon_{i,j}.

By A(4) in the Assumption 2, we can get

1ni=1pj=1np((𝐅i𝐅^i)𝐡^(𝐗i,j))2=i=1p(𝐅i𝐅^i)𝐇^i22i=1pB1𝐅i𝐅^i22.\frac{1}{n}\sum_{i=1}^{p}\sum_{j=1}^{n_{p}}((\mathbf{F}^{*}_{i}-\hat{\mathbf{F}}_{i})\hat{\mathbf{h}}(\mathbf{X}_{i,j}))^{2}=\sum_{i=1}^{p}\|(\mathbf{F}^{*}_{i}-\hat{\mathbf{F}}_{i})\hat{\mathbf{H}}_{i}\|_{2}^{2}\geq\sum_{i=1}^{p}B_{1}\|\mathbf{F}^{*}_{i}-\hat{\mathbf{F}}_{i}\|_{2}^{2}.

So the (23) can be further deduced as followed:

i=1pB1𝐅i𝐅^i22w.h.pμi=1p𝐅^i𝐅i1\displaystyle\sum_{i=1}^{p}B_{1}\|\mathbf{F}^{*}_{i}-\hat{\mathbf{F}}_{i}\|_{2}^{2}\overset{w.h.p}{\leq}\mu\sum_{i=1}^{p}\|\hat{\mathbf{F}}_{i}-\mathbf{F}^{*}_{i}\|_{1}
+i=1p𝐅i𝐅^i1((1+2B)(1+2B2)γnt(β)+(2B+1)log1δ2n+22log2rδni),\displaystyle\quad+\sum_{i=1}^{p}\|\mathbf{F}^{*}_{i}-\hat{\mathbf{F}}_{i}\|_{1}\left((1+2B)(1+2B_{2})\gamma n^{-t(\beta)}+(2B+1)\sqrt{\frac{\log\frac{1}{\delta}}{2n}}+2\sqrt{\frac{2\log\frac{2r}{\delta}}{n_{i}}}\right),

By replacing (2B+1)log1δ2n+22log2rδni(2B+1)\sqrt{\frac{\log\frac{1}{\delta}}{2n}}+2\sqrt{\frac{2\log\frac{2r}{\delta}}{n_{i}}} by 𝒪(1n)\mathcal{O}\left(\frac{1}{\sqrt{n}}\right) for simplicity

i=1p\displaystyle~{}~{}~{}~{}\sum_{i=1}^{p} B1𝐅i𝐅^i22\displaystyle B_{1}\|\mathbf{F}^{*}_{i}-\hat{\mathbf{F}}_{i}\|_{2}^{2}
w.h.p\displaystyle\overset{w.h.p}{\leq} i=1p(((1+2B)(1+2B2)+12Bpr)γnt(β)+𝒪(1n))𝐅^i𝐅i1\displaystyle\sum_{i=1}^{p}\Big{(}\big{(}(1+2B)(1+2B_{2})+\frac{1}{2Bpr}\big{)}\gamma n^{-t(\beta)}+\mathcal{O}\big{(}\frac{1}{\sqrt{n}}\big{)}\Big{)}\|\hat{\mathbf{F}}_{i}-\mathbf{F}^{*}_{i}\|_{1}
w.h.p\displaystyle\overset{w.h.p}{\leq} i=1pr(((1+2B)(1+2B2)+12Bpr)γnt(β)+𝒪(1n))𝐅^i𝐅i2.\displaystyle\sum_{i=1}^{p}\sqrt{r}\Big{(}\big{(}(1+2B)(1+2B_{2})+\frac{1}{2Bpr}\big{)}\gamma n^{-t(\beta)}+\mathcal{O}\big{(}\frac{1}{\sqrt{n}}\big{)}\Big{)}\|\hat{\mathbf{F}}_{i}-\mathbf{F}^{*}_{i}\|_{2}.

Then we have

B1i=1p𝐅i𝐅^i22w.h.pi=1prp(((1+2B)(1+2B2)+12Bpr)γnt(β)+𝒪(1n)).B_{1}\sqrt{\sum_{i=1}^{p}\|\mathbf{F}^{*}_{i}-\hat{\mathbf{F}}_{i}\|_{2}^{2}}\overset{w.h.p}{\leq}\sum_{i=1}^{p}\sqrt{rp}\Big{(}\big{(}(1+2B)(1+2B_{2})+\frac{1}{2Bpr}\big{)}\gamma n^{-t(\beta)}+\mathcal{O}\big{(}\frac{1}{\sqrt{n}}\big{)}\Big{)}.

Thus we get the wanted result which means w.p.t that the sparsity of 𝐅^i\hat{\mathbf{F}}_{i} is similar with 𝐅i\mathbf{F}_{i}^{*}. ∎

First, we state the following lemma, which is used in the proof of Lemma 8.

Lemma 7.
𝔼[𝒲1(𝐡^(𝐗),𝐡(𝐗))+μi=1p(𝐅^i1𝐅i1)]γnt(β),\mathbb{E}[\mathcal{W}_{1}(\hat{\mathbf{h}}(\mathbf{X}^{{}^{\prime}}),\mathbf{h}^{*}(\mathbf{X}^{{}^{\prime}}))+\mu\sum_{i=1}^{p}(\|\hat{\mathbf{F}}_{i}\|_{1}-\|\mathbf{F}^{*}_{i}\|_{1})]\leq\gamma n^{-t(\beta)},

where γ>0\gamma>0 is a constant.

Proof of Lemma 7.

The Theorem 1 shows 𝔼[R(𝐅^,𝐡^)R(𝐅,𝐡)]nt(β)\mathbb{E}[R(\hat{\mathbf{F}},\hat{\mathbf{h}})-R(\mathbf{F}^{*},\mathbf{h}^{*})]\precsim n^{-t(\beta)} and by the definition of 𝐡\mathbf{h}^{*}, we know that

nt(β)\displaystyle n^{-t(\beta)}\gtrsim 𝔼[(YS𝐅S^𝐡^(𝐗S))2+𝒱[𝐡^(𝐗),S]+𝒲1(𝐡^(𝐗),𝒰r)(YS𝐅S𝐡(𝐗S))2\displaystyle\mathbb{E}[(Y^{{}^{\prime}}_{S}-\hat{\mathbf{F}_{S}}\hat{\mathbf{h}}(\mathbf{X}^{{}^{\prime}}_{S}))^{2}+\mathcal{V}[\hat{\mathbf{h}}(\mathbf{X}^{{}^{\prime}}),S]+\mathcal{W}_{1}(\hat{\mathbf{h}}(\mathbf{X}^{{}^{\prime}}),\mathcal{U}_{r})-(Y^{{}^{\prime}}_{S}-\mathbf{F}^{*}_{S}\mathbf{h}^{*}(\mathbf{X}^{{}^{\prime}}_{S}))^{2}
+μi=1p(𝐅^i1𝐅i1)]\displaystyle+\mu\sum_{i=1}^{p}(\|\hat{\mathbf{F}}_{i}\|_{1}-\|\mathbf{F}^{*}_{i}\|_{1})]
\displaystyle\gtrsim 𝔼[(ε1+𝐅S𝐡(𝐗S)𝐅S^𝐡^(𝐗S))2+𝒲1(𝐡^(𝐗),𝐡(𝐗))(ε1)2\displaystyle\mathbb{E}[(\varepsilon^{{}^{\prime}}_{1}+\mathbf{F}^{*}_{S}\mathbf{h}^{*}(\mathbf{X}^{{}^{\prime}}_{S})-\hat{\mathbf{F}_{S}}\hat{\mathbf{h}}(\mathbf{X}^{{}^{\prime}}_{S}))^{2}+\mathcal{W}_{1}(\hat{\mathbf{h}}(\mathbf{X}^{{}^{\prime}}),\mathbf{h}^{*}(\mathbf{X}^{{}^{\prime}}))-(\varepsilon^{{}^{\prime}}_{1})^{2}
+μi=1p(𝐅^i1𝐅i1)]\displaystyle+\mu\sum_{i=1}^{p}(\|\hat{\mathbf{F}}_{i}\|_{1}-\|\mathbf{F}^{*}_{i}\|_{1})]
=\displaystyle= 𝔼[2ε1(𝐅S𝐡(𝐗S)𝐅S^𝐡^(𝐗S))+𝒲1(𝐡^(𝐗),𝐡(𝐗))+μi=1p(𝐅^i1𝐅i1)]\displaystyle\mathbb{E}[2\varepsilon^{{}^{\prime}}_{1}(\mathbf{F}^{*}_{S}\mathbf{h}^{*}(\mathbf{X}^{{}^{\prime}}_{S})-\hat{\mathbf{F}_{S}}\hat{\mathbf{h}}(\mathbf{X}^{{}^{\prime}}_{S}))+\mathcal{W}_{1}(\hat{\mathbf{h}}(\mathbf{X}^{{}^{\prime}}),\mathbf{h}^{*}(\mathbf{X}^{{}^{\prime}}))+\mu\sum_{i=1}^{p}(\|\hat{\mathbf{F}}_{i}\|_{1}-\|\mathbf{F}^{*}_{i}\|_{1})]
=\displaystyle= 𝔼[𝒲1(𝐡^(𝐗),𝐡(𝐗))+μi=1p(𝐅^i1𝐅i1)]\displaystyle\mathbb{E}[\mathcal{W}_{1}(\hat{\mathbf{h}}(\mathbf{X}^{{}^{\prime}}),\mathbf{h}^{*}(\mathbf{X}^{{}^{\prime}}))+\mu\sum_{i=1}^{p}(\|\hat{\mathbf{F}}_{i}\|_{1}-\|\mathbf{F}^{*}_{i}\|_{1})]\

Here 𝐗S,YS,ε1{\mathbf{X}^{{}^{\prime}}_{S},Y^{{}^{\prime}}_{S},\varepsilon^{{}^{\prime}}_{1}} is the i.i.d copy of 𝐗S,YS,ε1{\mathbf{X}_{S},Y_{S},\varepsilon_{1}}. For the simplicity of calculation, we can set a constant γ>0\gamma>0 big enough that 𝔼[𝒲1(𝐡^(𝐗),𝐡(𝐗))+μi=1p(𝐅^i1𝐅i1)]γnt(β)\mathbb{E}[\mathcal{W}_{1}(\hat{\mathbf{h}}(\mathbf{X}^{{}^{\prime}}),\mathbf{h}^{*}(\mathbf{X}^{{}^{\prime}}))+\mu\sum_{i=1}^{p}(\|\hat{\mathbf{F}}_{i}\|_{1}-\|\mathbf{F}^{*}_{i}\|_{1})]\leq\gamma n^{-t(\beta)}

Lemma 8.

For any δ>0\delta>0, i[p]i\in[p], with probability at least 12δ1-2\delta we get:

j=1ni1n𝐅i(𝐡(𝐗i,j)𝐡^(𝐗i,j))𝐡^(𝐗i,j)\displaystyle\|\sum_{j=1}^{n_{i}}\frac{1}{n}\mathbf{F}^{*}_{i}(\mathbf{h}^{*}(\mathbf{X}_{i,j})-\hat{\mathbf{h}}(\mathbf{X}_{i,j}))\hat{\mathbf{h}}(\mathbf{X}_{i,j})\|_{\infty}\leq B(1+2B2)γnt(β)+Blog1δ2n,\displaystyle B(1+2B_{2})\gamma n^{-t(\beta)}+B\sqrt{\frac{\log\frac{1}{\delta}}{2n}},
j=1ni1nϵi,j𝐡^(𝐗i,j)\displaystyle\|\sum_{j=1}^{n_{i}}\frac{1}{n}\epsilon_{i,j}\hat{\mathbf{h}}(\mathbf{X}_{i,j})\|_{\infty}\leq (1+2B2)γnt(β)+log1δ2n+2log2rδni,\displaystyle(1+2B_{2})\gamma n^{-t(\beta)}+\sqrt{\frac{\log\frac{1}{\delta}}{2n}}+\sqrt{\frac{2\log\frac{2r}{\delta}}{n_{i}}},

where γ\gamma is a constant big enough.

Proof of Lemma 8.

By A(1) in the Assumption 1, we know that,

j=1ni1n𝐅i(𝐡(𝐗i,j)𝐡^(𝐗i,j))𝐡^(𝐗i,j)\displaystyle\|\sum_{j=1}^{n_{i}}\frac{1}{n}\mathbf{F}^{*}_{i}(\mathbf{h}^{*}(\mathbf{X}_{i,j})-\hat{\mathbf{h}}(\mathbf{X}_{i,j}))\hat{\mathbf{h}}(\mathbf{X}_{i,j})\|_{\infty}\leq Bj=1ni1n(𝐡(𝐗i,j)𝐡^(𝐗i,j))1\displaystyle B\sum_{j=1}^{n_{i}}\frac{1}{n}\|(\mathbf{h}^{*}(\mathbf{X}_{i,j})-\hat{\mathbf{h}}(\mathbf{X}_{i,j}))\|_{1}
j=1ni1n𝐅i(𝐡(𝐗i,j)𝐡^(𝐗i,j))𝐡^(𝐗i,j)\displaystyle\|\sum_{j=1}^{n_{i}}\frac{1}{n}\mathbf{F}^{*}_{i}(\mathbf{h}^{*}(\mathbf{X}_{i,j})-\hat{\mathbf{h}}(\mathbf{X}_{i,j}))\hat{\mathbf{h}}(\mathbf{X}_{i,j})\|_{\infty}\leq Bi=1pj=1ni1n(𝐡(𝐗i,j)𝐡^(𝐗i,j))1,\displaystyle B\sum_{i=1}^{p}\sum_{j=1}^{n_{i}}\frac{1}{n}\|(\mathbf{h}^{*}(\mathbf{X}_{i,j})-\hat{\mathbf{h}}(\mathbf{X}_{i,j}))\|_{1},
j=1ni1nϵi,j𝐡^(𝐗i,j)\displaystyle\|\sum_{j=1}^{n_{i}}\frac{1}{n}\epsilon_{i,j}\hat{\mathbf{h}}(\mathbf{X}_{i,j})\|_{\infty}\leq i=1pj=1ni1nϵi,j𝐡(𝐗i,j)𝐡^(𝐗i,j)1\displaystyle\sum_{i=1}^{p}\sum_{j=1}^{n_{i}}\frac{1}{n}\|\epsilon_{i,j}\|_{\infty}\|\mathbf{h}^{*}(\mathbf{X}_{i,j})-\hat{\mathbf{h}}(\mathbf{X}_{i,j})\|_{1}
+j=1ni1nϵi,j𝐡(𝐗i,j),\displaystyle+\|\sum_{j=1}^{n_{i}}\frac{1}{n}\epsilon_{i,j}\mathbf{h}^{*}(\mathbf{X}_{i,j})\|_{\infty},

By lemma 7, we can select μ=γ2Bprnt(β)\mu=\frac{\gamma}{2Bpr}n^{-t(\beta)} to get that 𝔼[𝒲1(𝐡^(𝐗),𝐡(𝐗))]2γnt(β)\mathbb{E}[\mathcal{W}_{1}(\hat{\mathbf{h}}(\mathbf{X}^{{}^{\prime}}),\mathbf{h}^{*}(\mathbf{X}^{{}^{\prime}}))]\leq 2\gamma n^{-t(\beta)}.

Due to the Theorem 3.3 in Mohri et al. (2018) and the Theorem 6. For any δ>0\delta>0, with probability at least 1δ1-\delta:

i=1pj=1ni1n𝐡(𝐗i,j)𝐡^(𝐗i,j)1𝔼[𝐡(𝐗)𝐡^(𝐗)1]\displaystyle\sum_{i=1}^{p}\sum_{j=1}^{n_{i}}\frac{1}{n}\|\mathbf{h}^{*}(\mathbf{X}_{i,j})-\hat{\mathbf{h}}(\mathbf{X}_{i,j})\|_{1}-\mathbb{E}[\|\mathbf{h}^{*}(\mathbf{X}^{{}^{\prime}})-\hat{\mathbf{h}}(\mathbf{X}^{{}^{\prime}})\|_{1}]
\displaystyle\leq 2𝔼[sup𝐡1ni=1pj=1niσi𝐡(𝐗i,j)𝐡(𝐗i,j)1]+log1δ2n\displaystyle 2\mathbb{E}[\sup_{\mathbf{h}\in\mathcal{H}}\frac{1}{n}\sum_{i=1}^{p}\sum_{j=1}^{n_{i}}\sigma_{i}\|\mathbf{h}(\mathbf{X}_{i,j})-\mathbf{h}^{*}(\mathbf{X}_{i,j})\|_{1}]+\sqrt{\frac{\log\frac{1}{\delta}}{2n}}
\displaystyle\leq 2𝔼[sup𝐡1ni=1pj=1niσi,j𝐡(𝐗i,j)𝐡(𝐗i,j)1]+log1δ2n\displaystyle 2\mathbb{E}[\sup_{\mathbf{h}\in\mathcal{H}}\frac{1}{n}\sum_{i=1}^{p}\sum_{j=1}^{n_{i}}\sigma_{i,j}\|\mathbf{h}(\mathbf{X}_{i,j})-\mathbf{h}^{*}(\mathbf{X}_{i,j})\|_{1}]+\sqrt{\frac{\log\frac{1}{\delta}}{2n}}
\displaystyle\leq log1δ2n+2rK12(L1+2+log(d+1))n\displaystyle\sqrt{\frac{\log\frac{1}{\delta}}{2n}}+\frac{2rK_{1}\sqrt{2(L_{1}+2+\log(d+1))}}{\sqrt{n}}
\displaystyle\leq γnt(β)+log1δ2n.\displaystyle\gamma n^{-t(\beta)}+\sqrt{\frac{\log\frac{1}{\delta}}{2n}}.

The last inequality is due to properly selected W1,L1,K1W_{1},L_{1},K_{1} during proof of Theorem 1. Then, by Hoeffding’s inequality:

[j=1ni1niϵi,j𝐡(𝐗i,j)δ0]\displaystyle\mathbb{P}[\|\sum_{j=1}^{n_{i}}\frac{1}{n_{i}}\epsilon_{i,j}\mathbf{h}^{*}(\mathbf{X}_{i,j})\|_{\infty}\geq\delta_{0}]
=\displaystyle= [j=1ni1niϵi,j𝐡(𝐗i,j)𝔼[ε1𝐡(𝐗)]δ0]\displaystyle\mathbb{P}[\|\sum_{j=1}^{n_{i}}\frac{1}{n_{i}}\epsilon_{i,j}\mathbf{h}^{*}(\mathbf{X}_{i,j})-\mathbb{E}[\varepsilon_{1}\mathbf{h}^{*}(\mathbf{X})]\|_{\infty}\geq\delta_{0}]
=\displaystyle= [k[r],|j=1ni1niϵi,j𝐡k(𝐗i,j)|δ0]\displaystyle\mathbb{P}[\exists k\in[r],|\sum_{j=1}^{n_{i}}\frac{1}{n_{i}}\epsilon_{i,j}\mathbf{h}^{*}_{k}(\mathbf{X}_{i,j})|\geq\delta_{0}]
=\displaystyle= [k[r],|j=1ni1niϵi,j𝐡k(𝐗i,j)𝔼[ε1𝐡k(X)]|δ0]\displaystyle\mathbb{P}[\exists k\in[r],|\sum_{j=1}^{n_{i}}\frac{1}{n_{i}}\epsilon_{i,j}\mathbf{h}^{*}_{k}(\mathbf{X}_{i,j})-\mathbb{E}[\varepsilon_{1}\mathbf{h}^{*}_{k}(X)]|\geq\delta_{0}]
\displaystyle\leq 2rexp(δ02ni2).\displaystyle 2r\exp(-\frac{\delta_{0}^{2}n_{i}}{2}).

Then we can set the δ0=2log2rδni\delta_{0}=\sqrt{\frac{2\log\frac{2r}{\delta}}{n_{i}}} here, then with probability at least 1δ1-\delta, we know j=1ni1niϵi,j𝐡(𝐗i,j)2log2rδni\|\sum_{j=1}^{n_{i}}\frac{1}{n_{i}}\epsilon_{i,j}\mathbf{h}^{*}(\mathbf{X}_{i,j})\|_{\infty}\leq\sqrt{\frac{2\log\frac{2r}{\delta}}{n_{i}}}.

The by A(5) in the Assumption 2, with probability at least 12δ1-2\delta we get:

j=1ni1n𝐅i(𝐡(𝐗i,j)𝐡^(𝐗i,j))𝐡^(𝐗i,j)\displaystyle\|\sum_{j=1}^{n_{i}}\frac{1}{n}\mathbf{F}^{*}_{i}(\mathbf{h}^{*}(\mathbf{X}_{i,j})-\hat{\mathbf{h}}(\mathbf{X}_{i,j}))\hat{\mathbf{h}}(\mathbf{X}_{i,j})\|_{\infty}
\displaystyle\leq Bj=1ni1n𝐡(𝐗i,j)𝐡^(𝐗i,j)1B𝔼[𝐡(𝐗)𝐡^(𝐗)1]+B𝔼[𝐡(𝐗)𝐡^(𝐗)1]\displaystyle B\sum_{j=1}^{n_{i}}\frac{1}{n}\|\mathbf{h}^{*}(\mathbf{X}_{i,j})-\hat{\mathbf{h}}(\mathbf{X}_{i,j})\|_{1}-B\mathbb{E}[\|\mathbf{h}^{*}(\mathbf{X}^{{}^{\prime}})-\hat{\mathbf{h}}(\mathbf{X}^{{}^{\prime}})\|_{1}]+B\mathbb{E}[\|\mathbf{h}^{*}(\mathbf{X}^{{}^{\prime}})-\hat{\mathbf{h}}(\mathbf{X}^{{}^{\prime}})\|_{1}]
\displaystyle\leq Bj=1ni1n𝐡(𝐗i,j)𝐡^(𝐗i,j)1B𝔼[𝐡(𝐗)𝐡^(𝐗)1]+BB2𝔼[𝒲1(𝐡^(𝐗),𝐡(𝐗))]\displaystyle B\sum_{j=1}^{n_{i}}\frac{1}{n}\|\mathbf{h}^{*}(\mathbf{X}_{i,j})-\hat{\mathbf{h}}(\mathbf{X}_{i,j})\|_{1}-B\mathbb{E}[\|\mathbf{h}^{*}(\mathbf{X}^{{}^{\prime}})-\hat{\mathbf{h}}(\mathbf{X}^{{}^{\prime}})\|_{1}]+BB_{2}\mathbb{E}[\mathcal{W}_{1}(\hat{\mathbf{h}}(\mathbf{X}^{{}^{\prime}}),\mathbf{h}^{*}(\mathbf{X}^{{}^{\prime}}))]
\displaystyle\leq B(1+2B2)γnt(β)+Blog1δ2n.\displaystyle B(1+2B_{2})\gamma n^{-t(\beta)}+B\sqrt{\frac{\log\frac{1}{\delta}}{2n}}.

And following the same way before

j=1ni1nϵi,j𝐡^(𝐗i,j)\displaystyle\|\sum_{j=1}^{n_{i}}\frac{1}{n}\epsilon_{i,j}\hat{\mathbf{h}}(\mathbf{X}_{i,j})\|_{\infty}\leq i=1pj=1ni1nϵi,j𝐡(𝐗i,j)𝐡^(𝐗i,j)1+j=1ni1nϵi,j𝐡(𝐗i,j)\displaystyle\sum_{i=1}^{p}\sum_{j=1}^{n_{i}}\frac{1}{n}\|\epsilon_{i,j}\|_{\infty}\|\mathbf{h}^{*}(\mathbf{X}_{i,j})-\hat{\mathbf{h}}(\mathbf{X}_{i,j})\|_{1}+\|\sum_{j=1}^{n_{i}}\frac{1}{n}\epsilon_{i,j}\mathbf{h}^{*}(\mathbf{X}_{i,j})\|_{\infty}
\displaystyle\leq (1+2B2)γnt(β)+log1δ2n+2log2rδni.\displaystyle(1+2B_{2})\gamma n^{-t(\beta)}+\sqrt{\frac{\log\frac{1}{\delta}}{2n}}+\sqrt{\frac{2\log\frac{2r}{\delta}}{n_{i}}}.

Appendix C Proofs of Section 3.2

Next, we list the proof in the fine-tuning, which mainly uses the special property of training loss to induce the control of the excess error of the downstream task by upstream task.

C.1 Proof of the Theorem 3

Proof of the Theorem 3.

In the following proof, for the simplicity of analysis, we incorporated the independence penalty into (,)\ell(\cdot,\cdot) leveraging its L0L_{0}-Lipschitz property. Also, for the simplicity of notion, we denote the function class 𝒬dd(B)\mathcal{Q}\circ\mathcal{B}_{dd^{*}}(B) as 𝒬dd:={Qq,𝐀()=q(𝐀)|q𝒬,𝐀dd(B)}\mathcal{Q}_{dd^{*}}:=\{Q_{q,\mathbf{A}}(\cdot)=q(\mathbf{A}\cdot)|q\in\mathcal{Q},\mathbf{A}\in\mathcal{B}_{dd^{*}}(B)\}. By doing this, we can save much efforts about notation in the analysis without losing it’s coherence.

𝔼[(𝐡^,Q^,𝐅^T)(𝐡,Q,𝐅T)]\displaystyle\mathbb{E}[\mathcal{L}(\hat{\mathbf{h}},\hat{Q},\hat{\mathbf{F}}_{T})-\mathcal{L}(\mathbf{h}^{*},Q^{*},\mathbf{F}_{T}^{*})] (25)
\displaystyle\leq 𝔼[(𝐡^,Q^,𝐅^T)^(𝐡^,Q^,𝐅^T)+^(𝐡^,Q^,𝐅^T)^(𝐡^,Qq,𝐀,𝐅T)\displaystyle\mathbb{E}[\mathcal{L}(\hat{\mathbf{h}},\hat{Q},\hat{\mathbf{F}}_{T})-\hat{\mathcal{L}}(\hat{\mathbf{h}},\hat{Q},\hat{\mathbf{F}}_{T})+\hat{\mathcal{L}}(\hat{\mathbf{h}},\hat{Q},\hat{\mathbf{F}}_{T})-\hat{\mathcal{L}}(\hat{\mathbf{h}},Q_{q,\mathbf{A}^{*}},\mathbf{F}_{T}^{*})
+^(𝐡^,Qq,𝐀,𝐅T)(𝐡^,Qq,𝐀,𝐅T)+(𝐡^,Qq,𝐀,𝐅T)(𝐡,Qq,𝐀,𝐅T)\displaystyle+\hat{\mathcal{L}}(\hat{\mathbf{h}},Q_{q,\mathbf{A}^{*}},\mathbf{F}_{T}^{*})-\mathcal{L}(\hat{\mathbf{h}},Q_{q,\mathbf{A}^{*}},\mathbf{F}_{T}^{*})+\mathcal{L}(\hat{\mathbf{h}},Q_{q,\mathbf{A}^{*}},\mathbf{F}_{T}^{*})-\mathcal{L}(\mathbf{h}^{*},Q_{q,\mathbf{A}^{*}},\mathbf{F}_{T}^{*})
+(𝐡,Qq,𝐀,𝐅T)(𝐡,Q,𝐅T)]\displaystyle+\mathcal{L}(\mathbf{h}^{*},Q_{q,\mathbf{A}^{*}},\mathbf{F}_{T}^{*})-\mathcal{L}(\mathbf{h}^{*},Q^{*},\mathbf{F}_{T}^{*})]
\displaystyle\leq 𝔼[(𝐡^,Qq,𝐀,𝐅T)(𝐡,Qq,𝐀,𝐅T)]4\displaystyle\underbrace{\mathbb{E}[\mathcal{L}(\hat{\mathbf{h}},Q_{q,\mathbf{A}^{*}},\mathbf{F}_{T}^{*})-\mathcal{L}(\mathbf{h}^{*},Q_{q,\mathbf{A}^{*}},\mathbf{F}_{T}^{*})]}_{\mathcal{L}_{4}}
+2𝔼[sup(𝐅,Qq,𝐀)r(B)𝒬dd(𝐡^,Qq,𝐀,𝐅)^(𝐡^,Qq,𝐀,𝐅)|+infq𝒬(𝐡,Qq,𝐀,𝐅T)(𝐡,Q,𝐅T)]5.\displaystyle+\underbrace{2\mathbb{E}[\sup_{(\mathbf{F},Q_{q,\mathbf{A}})\in\mathcal{B}_{r}(B)\circ\mathcal{Q}_{dd^{*}}}\mathcal{L}(\hat{\mathbf{h}},Q_{q,\mathbf{A}},\mathbf{F})-\hat{\mathcal{L}}(\hat{\mathbf{h}},Q_{q,\mathbf{A}},\mathbf{F})|+\inf_{q\in\mathcal{Q}}\|\mathcal{L}(\mathbf{h}^{*},Q_{q,\mathbf{A}^{*}},\mathbf{F}_{T}^{*})-\mathcal{L}(\mathbf{h}^{*},Q^{*},\mathbf{F}_{T}^{*})]}_{\mathcal{L}_{5}}.

By lemma 9 below, we have:

4O~(nβ2(d+1+β)𝕀β2+nβ2d+3β𝕀β>2).\mathcal{L}_{4}\precsim\tilde{O}\left(n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\mathbb{I}_{\beta>2}\right).

By A(6) and A(7) in Assumption 3, Theorem 5, 6 and following the similar process as the proof of lemma 3:

5=\displaystyle\mathcal{L}_{5}= 2𝔼[sup(𝐅,Q)r(B)𝒬dd|(𝐡^,Qq,𝐀,𝐅)^(𝐡^,Qq,𝐀,𝐅)|+infq𝒬(𝐡,Qq,𝐀,𝐅T)(𝐡,Q,𝐅T)]\displaystyle 2\mathbb{E}[\sup_{(\mathbf{F},Q)\in\mathcal{B}_{r}(B)\circ\mathcal{Q}_{dd^{*}}}|\mathcal{L}(\hat{\mathbf{h}},Q_{q,\mathbf{A}},\mathbf{F})-\hat{\mathcal{L}}(\hat{\mathbf{h}},Q_{q,\mathbf{A}},\mathbf{F})|+\inf_{q\in\mathcal{Q}}\mathcal{L}(\mathbf{h}^{*},Q_{q,\mathbf{A}^{*}},\mathbf{F}_{T}^{*})-\mathcal{L}(\mathbf{h}^{*},Q^{*},\mathbf{F}_{T}^{*})]
\displaystyle\leq 2𝔼[sup(𝐅,Q)r(B)𝒬dd|1mi=1mσi(𝐅𝐡^(𝐗T,i)+Q(𝐗T,i)),YT,i)|]+L0infq𝒬Qq,𝐀Q\displaystyle 2\mathbb{E}[\sup_{(\mathbf{F},Q)\in\mathcal{B}_{r}(B)\circ\mathcal{Q}_{dd^{*}}}|\frac{1}{m}\sum_{i=1}^{m}\sigma_{i}\ell(\mathbf{F}\hat{\mathbf{h}}(\mathbf{X}_{T,i})+Q(\mathbf{X}_{T,i})),Y_{T,i})|]+L_{0}\inf_{q\in\mathcal{Q}}\|Q_{q,\mathbf{A}^{*}}-Q^{*}\|_{\infty}
\displaystyle\leq 2L0𝔼[sup(𝐅,𝐪,𝐀)r(B)𝒬𝐀|1mi=1mσi(𝐅𝐡^(𝐗T,i)+𝐪(𝐀𝐗T,i)))|]+2B3m\displaystyle 2L_{0}\mathbb{E}[\sup_{(\mathbf{F},\mathbf{q},\mathbf{A})\in\mathcal{B}_{r}(B)\circ\mathcal{Q}\circ\mathcal{\mathbf{A}}}|\frac{1}{m}\sum_{i=1}^{m}\sigma_{i}(\mathbf{F}\hat{\mathbf{h}}(\mathbf{X}_{T,i})+\mathbf{q}(\mathbf{A}\mathbf{X}_{T,i})))|]+\frac{2B_{3}}{\sqrt{m}}
+L0inf𝐪𝒬𝐪(𝐀)𝐪(𝐀)\displaystyle+L_{0}\inf_{\mathbf{q}\in\mathcal{Q}}\|\mathbf{q}(\mathbf{A}^{*}\cdot)-\mathbf{q}^{*}(\mathbf{A}^{*}\cdot)\|_{\infty}
\displaystyle\precsim K+B+2B3m+Kβd+1\displaystyle\frac{K+B+2B_{3}}{\sqrt{m}}+K^{\frac{-\beta}{d^{*}+1}}
\displaystyle\precsim mβ2(d+1+2β).\displaystyle m^{-\frac{\beta}{2(d^{*}+1+2\beta)}}.

Add 4\mathcal{L}_{4} and 5\mathcal{L}_{5} together, we get the wanted result.

𝔼[(𝐡^,Q^,𝐅^T)(𝐡,Q,𝐅T)]O~(nβ2(d+1+β)𝕀β2+nβ2d+3β𝕀β>2)+𝒪(mβ2(d+1+2β)).\mathbb{E}[\mathcal{L}(\hat{\mathbf{h}},\hat{Q},\hat{\mathbf{F}}_{T})-\mathcal{L}(\mathbf{h}^{*},Q^{*},\mathbf{F}_{T}^{*})]\precsim\tilde{O}\left(n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\mathbb{I}_{\beta>2}\right)+\mathcal{O}\left(m^{-\frac{\beta}{2(d^{*}+1+2\beta)}}\right).

Usually, by directly assume the transferable condition we can bounding the risk from learning a transferable representation. In our case, it is the term 4\mathcal{L}_{4}. And the term left, which is 5\mathcal{L}_{5}, represents the difficulty of downstream task.

In the condition “complete transfer” , the proof before can process almost the same but get the result 51m\mathcal{L}_{5}\precsim\frac{1}{\sqrt{m}}.

Also, when the downstream satisfy the model (8) : YT=fT(𝐡(XT))+ε2Y_{T}=f^{*}_{T}(\mathbf{h}^{*}(X^{T}))+\varepsilon_{2}. And by assuming fTβf^{*}_{T}\in\mathcal{H}^{\beta}, we can get the total excess risk of downstream task upper bound O~(nβ2(d+1+β)𝕀β2+nβ2d+3β𝕀β>2)+O~(mβ2(r+1+2β))\tilde{O}\left(n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\mathbb{I}_{\beta>2}\right)+\tilde{O}\left(m^{-\frac{\beta}{2(r+1+2\beta)}}\right).

Lemma 9.
4=\displaystyle\mathcal{L}_{4}= 𝔼[(𝐡^,Q,𝐅T)(𝐡,Q,𝐅T)]\displaystyle\mathbb{E}[\mathcal{L}(\hat{\mathbf{h}},Q,\mathbf{F}_{T}^{*})-\mathcal{L}(\mathbf{h}^{*},Q,\mathbf{F}_{T}^{*})] (26)
\displaystyle\precsim O~(nβ2(d+1+β)𝕀β2+nβ2d+3β𝕀β>2).\displaystyle\tilde{O}\left(n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\mathbb{I}_{\beta>2}\right).
Proof of 9.

We can utilize the property of Wasserstein distance by the distribution of 𝐡(𝐗)\mathbf{h}^{*}(\mathbf{X}) is exactly the same as 𝒰r\mathcal{U}_{r} and A(8) in the Assumption 3 to get

𝔼[𝒲1(𝐡^(XT),𝐡(XT))]\displaystyle\mathbb{E}[\mathcal{W}_{1}(\hat{\mathbf{h}}(X_{T}),\mathbf{h}^{*}(X_{T}))]\leq 𝔼[𝒲1(𝐡^(XT),𝐡^(X))+𝒲1(𝐡^(X),𝐡(𝐗))+𝒲1(𝐡(𝐗),𝐡(XT))]\displaystyle\mathbb{E}[\mathcal{W}_{1}(\hat{\mathbf{h}}(X_{T}),\hat{\mathbf{h}}(X))+\mathcal{W}_{1}(\hat{\mathbf{h}}(X),\mathbf{h}^{*}(\mathbf{X}))+\mathcal{W}_{1}(\mathbf{h}^{*}(\mathbf{X}),\mathbf{h}^{*}(X_{T}))]
=\displaystyle= 𝔼[𝒲1(𝐡^(XT),𝐡^(X))+𝒲1(𝐡^(X),𝒰r)+𝒲1(𝐡(𝐗),𝐡(XT))]\displaystyle\mathbb{E}[\mathcal{W}_{1}(\hat{\mathbf{h}}(X_{T}),\hat{\mathbf{h}}(X))+\mathcal{W}_{1}(\hat{\mathbf{h}}(X),\mathcal{U}_{r})+\mathcal{W}_{1}(\mathbf{h}^{*}(\mathbf{X}),\mathbf{h}^{*}(X_{T}))]
\displaystyle\precsim O~(nβ2(d+1+β)𝕀β2+nβ2d+3β𝕀β>2)+(nd+12(d+1+β)𝕀β2+nd+12d+3β𝕀β>2)ω\displaystyle\tilde{O}\left(n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\mathbb{I}_{\beta>2}\right)+\left(n^{\frac{d+1}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{\frac{d+1}{2d+3\beta}}\mathbb{I}_{\beta>2}\right)\omega
\displaystyle\precsim O~(nβ2(d+1+β)𝕀β2+nβ2d+3β𝕀β>2).\displaystyle\tilde{O}\left(n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\mathbb{I}_{\beta>2}\right).

The first inequality is by the distance property of Wasserstein distance. The first equality is by the optimality of the function (f0,𝐡)(f_{0},\mathbf{h}^{*}). For the second inequality, we can get from lemma 7. Also, the coefficient before ω\omega is due to the norm constrained of 𝐡^\hat{\mathbf{h}}.

Then we denote function gy0(z)=(z,y0)g_{y_{0}}(z)=\ell(z,y_{0}). By A(7) in Assumption 3, we can show that gy0(z)g_{y_{0}}(z) is a L0L_{0}-lipschitz function.

4=\displaystyle\mathcal{L}_{4}= 𝔼𝒟[(𝐡^,Q,𝐅T)(𝐡,Q,𝐅T)]\displaystyle\mathbb{E}_{\mathcal{D}}[\mathcal{L}(\hat{\mathbf{h}},Q,\mathbf{F}_{T}^{*})-\mathcal{L}(\mathbf{h}^{*},Q,\mathbf{F}_{T}^{*})]
=\displaystyle= 𝔼𝒟[𝔼YYT[𝔼Z(𝐅T𝐡^(XT)+q(AXT))|Y[gY(Z)|Y]𝔼Z(𝐅T𝐡(XT)+q(AXT))|Y[gY(Z)|Y]]]\displaystyle\mathbb{E}_{\mathcal{D}}[\mathbb{E}_{Y\sim Y_{T}}[\mathbb{E}_{Z\sim(\mathbf{F}_{T}^{*}\hat{\mathbf{h}}(X_{T})+q(AX_{T}))|Y}[g_{Y}(Z)|Y]-\mathbb{E}_{Z^{\prime}\sim(\mathbf{F}_{T}^{*}\mathbf{h}^{*}(X_{T})+q(AX_{T}))|Y}[g_{Y}(Z^{\prime})|Y]]]
\displaystyle\leq 𝔼𝒟[𝔼YT[L0𝒲1((𝐅T𝐡^(XT)+q(AXT))|YT,(𝐅T𝐡(XT)+q(AXT))|YT)]]\displaystyle\mathbb{E}_{\mathcal{D}}[\mathbb{E}_{Y_{T}}[L_{0}\mathcal{W}_{1}((\mathbf{F}_{T}^{*}\hat{\mathbf{h}}(X_{T})+q(AX_{T}))|Y_{T},(\mathbf{F}_{T}^{*}\mathbf{h}^{*}(X_{T})+q(AX_{T}))|Y_{T})]]
=\displaystyle= L0𝔼𝒟[𝔼YT[𝒲1((𝐅T𝐡^(XT)+q(AXT))|YT,(𝐅T𝐡(XT)+q(AXT))|YT)]]\displaystyle L_{0}\mathbb{E}_{\mathcal{D}}[\mathbb{E}_{Y_{T}}[\mathcal{W}_{1}((\mathbf{F}_{T}^{*}\hat{\mathbf{h}}(X_{T})+q(AX_{T}))|Y_{T},(\mathbf{F}_{T}^{*}\mathbf{h}^{*}(X_{T})+q(AX_{T}))|Y_{T})]]
=\displaystyle= L0𝔼𝒟[𝔼YT[𝒲1(𝐅T𝐡^(XT)|YT,𝐅T𝐡(XT)|YT)]].\displaystyle L_{0}\mathbb{E}_{\mathcal{D}}[\mathbb{E}_{Y_{T}}[\mathcal{W}_{1}(\mathbf{F}_{T}^{*}\hat{\mathbf{h}}(X_{T})|Y_{T},\mathbf{F}_{T}^{*}\mathbf{h}^{*}(X_{T})|Y_{T})]].

The first inequality is due to the definition of Wasserstein distance. Then by the A(9) in Assumption 3 we can get

4\displaystyle\mathcal{L}_{4}\leq L0𝔼𝒟[𝔼YT[𝒲1(𝐅T𝐡^(XT)|YT,𝐅T𝐡(XT)|YT)]]\displaystyle L_{0}\mathbb{E}_{\mathcal{D}}[\mathbb{E}_{Y_{T}}[\mathcal{W}_{1}(\mathbf{F}_{T}^{*}\hat{\mathbf{h}}(X_{T})|Y_{T},\mathbf{F}_{T}^{*}\mathbf{h}^{*}(X_{T})|Y_{T})]]
\displaystyle\leq L0ν𝔼𝒟[𝒲1(𝐡^(XT),𝐡(XT))]\displaystyle L_{0}\nu\mathbb{E}_{\mathcal{D}}[\mathcal{W}_{1}(\hat{\mathbf{h}}(X_{T}),\mathbf{h}^{*}(X_{T}))]
\displaystyle\leq O~(nβ2(d+1+β)𝕀β2+nβ2d+3β𝕀β>2).\displaystyle\tilde{O}\left(n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\mathbb{I}_{\beta>2}\right).

C.2 Proof of the Theorem 4

The proof of Theorem 4 is almost the same as proof in Section B.2.

Proof of the Theorem 4 .

Following almost the same procedure in the proof of Theorem 2. We consider the case of ”complete transfer” where Q=0Q^{*}=0, no extra network and the loss (,)\ell(\cdot,\cdot) is the least square form such that ^(𝐡^,𝐅)=i=1m(𝐅𝐡^(𝐱T,i),YT,i)/m+χ𝐅1=i=1m(𝐅𝐡^(𝐗T,i)YT,i)2/m+χ𝐅1\hat{\mathcal{L}}(\hat{\mathbf{h}},\mathbf{F})=\sum_{i=1}^{m}\ell(\mathbf{F}\hat{\mathbf{h}}(\mathbf{x}_{T,i}),Y_{T,i})/m+\chi\|\mathbf{F}\|_{1}=\sum_{i=1}^{m}(\mathbf{F}\hat{\mathbf{h}}(\mathbf{X}_{T,i})-Y_{T,i})^{2}/m+\chi\|\mathbf{F}\|_{1}.

In this case the ERM solution is a vector 𝐅^Targmin𝐅^(𝐡^,𝐅)\hat{\mathbf{F}}_{T}\in\arg\min_{\mathbf{F}\in\mathcal{F}}\hat{\mathcal{L}}(\hat{\mathbf{h}},\mathbf{F}). By the definition of 𝐅^T\hat{\mathbf{F}}_{T}, we know that ^(𝐅^T,𝐡^)^(𝐅T,𝐡^)\hat{\mathcal{L}}(\hat{\mathbf{F}}_{T},\hat{\mathbf{h}})\leq\hat{\mathcal{L}}(\mathbf{F}_{T}^{*},\hat{\mathbf{h}}).

i=1m1m(YT,i𝐅^T𝐡^(𝐗T,i))2+χ𝐅^T1i=1m1m(YT,iFT𝐡^(𝐗T,i))2+χ𝐅T1.\sum_{i=1}^{m}\frac{1}{m}(Y_{T,i}-\hat{\mathbf{F}}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i}))^{2}+\chi\|\hat{\mathbf{F}}_{T}\|_{1}\leq\sum_{i=1}^{m}\frac{1}{m}(Y_{T,i}-F^{*}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i}))^{2}+\chi\|\mathbf{F}^{*}_{T}\|_{1}.
i=1m1m(YT,i𝐅^T𝐡^(𝐗T,i))2i=1m1m(YT,iFT𝐡^(𝐗T,i))2+χ𝐅^T𝐅T1.\displaystyle\sum_{i=1}^{m}\frac{1}{m}(Y_{T,i}-\hat{\mathbf{F}}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i}))^{2}\leq\sum_{i=1}^{m}\frac{1}{m}(Y_{T,i}-F^{*}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i}))^{2}+\chi\|\hat{\mathbf{F}}_{T}-\mathbf{F}^{*}_{T}\|_{1}. (27)

By the equation

i=1m1m(YT,i𝐅^T𝐡^(𝐗T,i))2\displaystyle\sum_{i=1}^{m}\frac{1}{m}(Y_{T,i}-\hat{\mathbf{F}}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i}))^{2} =i=1m1m(YT,i𝐅T𝐡^(𝐗T,i)+𝐅T𝐡^(𝐗T,i)𝐅^T𝐡^(𝐗T,i))2\displaystyle=\sum_{i=1}^{m}\frac{1}{m}(Y_{T,i}-\mathbf{F}^{*}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i})+\mathbf{F}^{*}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i})-\hat{\mathbf{F}}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i}))^{2} (28)
=i=1m1m(YT,i𝐅T𝐡^(𝐗T,i))2\displaystyle=\sum_{i=1}^{m}\frac{1}{m}(Y_{T,i}-\mathbf{F}^{*}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i}))^{2}
+i=1m2m(YT,i𝐅T𝐡^(𝐗T,i))(𝐅T𝐡^(𝐗T,i)𝐅^T𝐡^(𝐗T,i))\displaystyle\quad+\sum_{i=1}^{m}\frac{2}{m}(Y_{T,i}-\mathbf{F}^{*}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i}))(\mathbf{F}^{*}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i})-\hat{\mathbf{F}}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i}))
+i=1m1m((𝐅T𝐅^T)𝐡^(𝐗T,i))2,\displaystyle\quad+\sum_{i=1}^{m}\frac{1}{m}((\mathbf{F}^{*}_{T}-\hat{\mathbf{F}}_{T})\hat{\mathbf{h}}(\mathbf{X}_{T,i}))^{2},

after estimating the last two term in right part of upper equation. We can replace the left part in (27) by (28) to get the desired result by selecting proper χ\chi.

Then by lemma 10, we know that

𝔼𝒟[i=1m2m(YT,i𝐅T𝐡^(𝐗T,i))(𝐅T𝐡^(𝐗T,i)𝐅^T𝐡^(𝐗T,i))]\displaystyle\quad\mathbb{E}_{\mathcal{D}}[\sum_{i=1}^{m}\frac{2}{m}(Y_{T,i}-\mathbf{F}^{*}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i}))(\mathbf{F}^{*}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i})-\hat{\mathbf{F}}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i}))]
=𝔼𝒟[i=1m2m(𝐅T𝐡(𝐗T,i)𝐅T𝐡^(𝐗T,i)+ϵi)(𝐅T𝐅^T)𝐡^(𝐗T,i)]\displaystyle=\mathbb{E}_{\mathcal{D}}[\sum_{i=1}^{m}\frac{2}{m}(\mathbf{F}^{*}_{T}\mathbf{h}^{*}(\mathbf{X}_{T,i})-\mathbf{F}^{*}_{T}\hat{\mathbf{h}}(\mathbf{X}_{T,i})+\epsilon_{i})(\mathbf{F}^{*}_{T}-\hat{\mathbf{F}}_{T})\hat{\mathbf{h}}(\mathbf{X}_{T,i})]
=𝔼𝒟[i=1m2m(𝐅T(𝐡(𝐗T,i)𝐡^(𝐗T,i))+ϵi)(𝐅T𝐅^T)𝐡^(𝐗T,i)]\displaystyle=\mathbb{E}_{\mathcal{D}}[\sum_{i=1}^{m}\frac{2}{m}(\mathbf{F}^{*}_{T}(\mathbf{h}^{*}(\mathbf{X}_{T,i})-\hat{\mathbf{h}}(\mathbf{X}_{T,i}))+\epsilon_{i})(\mathbf{F}^{*}_{T}-\hat{\mathbf{F}}_{T})\hat{\mathbf{h}}(\mathbf{X}_{T,i})]
𝔼𝒟[𝐅T𝐅^T1i=1m2m𝐅T(𝐡(𝐗T,i)𝐡^(𝐗T,i))𝐡^(𝐗T,i)\displaystyle\leq\mathbb{E}_{\mathcal{D}}[\|\mathbf{F}^{*}_{T}-\hat{\mathbf{F}}_{T}\|_{1}\|\sum_{i=1}^{m}\frac{2}{m}\mathbf{F}^{*}_{T}(\mathbf{h}^{*}(\mathbf{X}_{T,i})-\hat{\mathbf{h}}(\mathbf{X}_{T,i}))\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{\infty}
+𝐅T𝐅^T1i=1m2mϵi𝐡^(𝐗T,i)]\displaystyle\quad+\|\mathbf{F}^{*}_{T}-\hat{\mathbf{F}}_{T}\|_{1}\|\sum_{i=1}^{m}\frac{2}{m}\epsilon_{i}\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{\infty}]
w.h.p𝐅T𝐅^T1(B(2log2δm+B2γ1nt(β))+B2γ1nt(β)+2log2δm+2log2rδm),\displaystyle\overset{w.h.p}{\leq}\|\mathbf{F}^{*}_{T}-\hat{\mathbf{F}}_{T}\|_{1}\left(B(\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+B_{2}\gamma_{1}n^{-t(\beta)})+B_{2}\gamma_{1}n^{-t(\beta)}+\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+\sqrt{\frac{2\log\frac{2r}{\delta}}{m}}\right),

where ϵi\epsilon_{i} here is the i.i.d sample of ε2\varepsilon_{2} and for simplicity we omit the footnote.

By A(10) in Assumption 4, we can get

𝔼𝒟[1mi=1m((𝐅T𝐅^T)𝐡^(𝐗T,i))2]=𝔼𝒟[(𝐅T𝐅^T)𝐇^T22]B1𝐅T𝐅^iT2.\mathbb{E}_{\mathcal{D}}[\frac{1}{m}\sum_{i=1}^{m}((\mathbf{F}_{T}^{*}-\hat{\mathbf{F}}_{T})\hat{\mathbf{h}}(\mathbf{X}_{T,i}))^{2}]=\mathbb{E}_{\mathcal{D}}[\|(\mathbf{F}^{*}_{T}-\hat{\mathbf{F}}_{T})\hat{\mathbf{H}}_{T}\|_{2}^{2}]\geq B_{1}\|\mathbf{F}^{*}_{T}-\hat{\mathbf{F}}_{i}\|_{T}^{2}.

So the (27) can be further deduced as followed:

B1\displaystyle B_{1}\| 𝐅T𝐅^T22w.h.pχ𝐅^T𝐅T1\displaystyle\mathbf{F}^{*}_{T}-\hat{\mathbf{F}}_{T}\|_{2}^{2}\overset{w.h.p}{\leq}\chi\|\hat{\mathbf{F}}_{T}-\mathbf{F}^{*}_{T}\|_{1}
+𝐅T𝐅^T1(B(2log2δm+B2γ1nt(β))+B2γ1nt(β)+2log2δm+2log2rδm)\displaystyle+\|\mathbf{F}^{*}_{T}-\hat{\mathbf{F}}_{T}\|_{1}\left(B(\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+B_{2}\gamma_{1}n^{-t(\beta)})+B_{2}\gamma_{1}n^{-t(\beta)}+\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+\sqrt{\frac{2\log\frac{2r}{\delta}}{m}}\right)
B1\displaystyle B_{1} 𝐅T𝐅^T22\displaystyle\|\mathbf{F}^{*}_{T}-\hat{\mathbf{F}}_{T}\|_{2}^{2}
w.h.p\displaystyle\overset{w.h.p}{\leq} (B(2log2δm+B2γ1nt(β))+B2γ1nt(β)+2log2δm+2log2rδm+χ)𝐅^T𝐅T1\displaystyle\Big{(}B(\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+B_{2}\gamma_{1}n^{-t(\beta)})+B_{2}\gamma_{1}n^{-t(\beta)}+\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+\sqrt{\frac{2\log\frac{2r}{\delta}}{m}}+\chi\Big{)}\|\hat{\mathbf{F}}_{T}-\mathbf{F}^{*}_{T}\|_{1}
w.h.p\displaystyle\overset{w.h.p}{\leq} r(B(2log2δm+B2γ1nt(β))+B2γ1nt(β)+2log2δm+2log2rδm+χ)𝐅^T𝐅T2\displaystyle\sqrt{r}\Big{(}B(\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+B_{2}\gamma_{1}n^{-t(\beta)})+B_{2}\gamma_{1}n^{-t(\beta)}+\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+\sqrt{\frac{2\log\frac{2r}{\delta}}{m}}+\chi\Big{)}\|\hat{\mathbf{F}}_{T}-\mathbf{F}^{*}_{T}\|_{2}

Thus, we have

B1𝐅T\displaystyle B_{1}\|\mathbf{F}^{*}_{T}- 𝐅^T2\displaystyle\hat{\mathbf{F}}_{T}\|_{2}
w.h.p\displaystyle\overset{w.h.p}{\leq} r(B(2log2δm+B2γ1nt(β))+B2γ1nt(β)+2log2δm+2log2rδm+χ)\displaystyle\sqrt{r}\Big{(}B(\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+B_{2}\gamma_{1}n^{-t(\beta)})+B_{2}\gamma_{1}n^{-t(\beta)}+\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+\sqrt{\frac{2\log\frac{2r}{\delta}}{m}}+\chi\Big{)}

By set χ=1m\chi=\frac{1}{\sqrt{m}}, we can get the wanted result which means w.p.t that the sparsity of 𝐅^T\hat{\mathbf{F}}_{T} is similar with 𝐅T\mathbf{F}_{T}^{*}

Lemma 10.

For any δ>0\delta>0, i[p]i\in[p], with probability at least 12δ1-2\delta we get:

𝔼𝒟[i=1m1m𝐅T(𝐡(xT,i)𝐡^(xT,i))𝐡^(xT,i)]\displaystyle\mathbb{E}_{\mathcal{D}}[\|\sum_{i=1}^{m}\frac{1}{m}\mathbf{F}^{*}_{T}(\mathbf{h}^{*}(x_{T,i})-\hat{\mathbf{h}}(x_{T,i}))\hat{\mathbf{h}}(x_{T,i})\|_{\infty}]\leq B(2log2δm+B2γ1nt(β)),\displaystyle B(\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+B_{2}\gamma_{1}n^{-t(\beta)}),
𝔼𝒟[i=1m1mϵi𝐡^(xT,i)]\displaystyle\mathbb{E}_{\mathcal{D}}[\|\sum_{i=1}^{m}\frac{1}{m}\epsilon_{i}\hat{\mathbf{h}}(x_{T,i})\|_{\infty}]\leq B2γ1nt(β)+2log2δm+2log2rδm,\displaystyle B_{2}\gamma_{1}n^{-t(\beta)}+\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+\sqrt{\frac{2\log\frac{2r}{\delta}}{m}},

where γ1\gamma_{1} is a constant big enough.

Proof of Lemma 10.

By A(6) in Assumption 3 we know that,

i=1m1m𝐅i(𝐡(𝐗T,i)𝐡^(𝐗T,i))𝐡^(𝐗T,i)\displaystyle\|\sum_{i=1}^{m}\frac{1}{m}\mathbf{F}^{*}_{i}(\mathbf{h}^{*}(\mathbf{X}_{T,i})-\hat{\mathbf{h}}(\mathbf{X}_{T,i}))\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{\infty}\leq Bi=1m1m𝐡(𝐗T,i)𝐡^(𝐗T,i)1,\displaystyle B\sum_{i=1}^{m}\frac{1}{m}\|\mathbf{h}^{*}(\mathbf{X}_{T,i})-\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{1},
i=1m1mϵi𝐡^(𝐗T,i)\displaystyle\|\sum_{i=1}^{m}\frac{1}{m}\epsilon_{i}\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{\infty}\leq i=1m1mϵi𝐡(𝐗T,i)𝐡^(𝐗T,i)1+i=1m1mϵi𝐡(𝐗T,i),\displaystyle\sum_{i=1}^{m}\frac{1}{m}\|\epsilon_{i}\|_{\infty}\|\mathbf{h}^{*}(\mathbf{X}_{T,i})-\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{1}+\|\sum_{i=1}^{m}\frac{1}{m}\epsilon_{i}\mathbf{h}^{*}(\mathbf{X}_{T,i})\|_{\infty},

By lemma 9, we know that 𝔼[𝒲1(𝐡^(XT),𝐡(XT))]O~(nβ2(d+1+β)𝕀β2+nβ2d+3β𝕀β>2)\mathbb{E}[\mathcal{W}_{1}(\hat{\mathbf{h}}(X_{T}),\mathbf{h}^{*}(X_{T}))]\leq\tilde{O}\left(n^{-\frac{\beta}{2(d+1+\beta)}}\mathbb{I}_{\beta\leq 2}+n^{-\frac{\beta}{2d+3\beta}}\mathbb{I}_{\beta>2}\right). For the simplicity of calculation, we can set a constant γ1>\gamma_{1}> 0 big enough that 𝔼[𝒲1(𝐡^(XT),𝐡(XT))]γ1nt(β)\mathbb{E}[\mathcal{W}_{1}(\hat{\mathbf{h}}(X_{T}),\mathbf{h}^{*}(X_{T}))]\leq\gamma_{1}n^{-t(\beta)}.

By A(11) in the Assumption 4, 𝔼[𝐡(XT)𝐡^(XT)1]B2nt(β)\mathbb{E}[\|\mathbf{h}^{*}(X_{T}^{{}^{\prime}})-\hat{\mathbf{h}}(X_{T}^{{}^{\prime}})\|_{1}]\leq B_{2}n^{-t(\beta)}. By the independence of dataset 𝒟\mathcal{D} and XTX_{T}^{{}^{\prime}}. We can use the Hoeffding’s inequality to get:

[|i=1m1m𝐡(𝐗T,i)𝐡^(𝐗T,i)1𝔼XT[𝐡(XT)𝐡^(XT)1]||𝒟]δ1]2exp(δ12m2).\displaystyle\mathbb{P}\left[\left|\sum_{i=1}^{m}\frac{1}{m}\|\mathbf{h}^{*}(\mathbf{X}_{T,i})-\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{1}-\mathbb{E}_{X_{T}^{{}^{\prime}}}[\|\mathbf{h}^{*}(X_{T}^{{}^{\prime}})-\hat{\mathbf{h}}(X_{T}^{{}^{\prime}})\|_{1}]\right|\Big{|}\mathcal{D}\right]\geq\delta_{1}]\leq 2\exp\left(-\frac{\delta_{1}^{2}m}{2}\right).

Then set δ1=2log2δm\delta_{1}=\sqrt{\frac{2\log\frac{2}{\delta}}{m}}, for any draw of dataset 𝒟\mathcal{D} with probability at least 1δ1-\delta, we know i=1m1m𝐡(𝐗T,i)𝐡^(𝐗T,i)1𝔼XT[𝐡(XT)𝐡^(XT)1]]δ1\sum_{i=1}^{m}\frac{1}{m}\|\mathbf{h}^{*}(\mathbf{X}_{T,i})-\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{1}-\mathbb{E}_{X_{T}^{{}^{\prime}}}[\|\mathbf{h}^{*}(X_{T}^{{}^{\prime}})-\hat{\mathbf{h}}(X_{T}^{{}^{\prime}})\|_{1}]]\leq\delta_{1}. So

𝔼𝒟[i=1m1m𝐡(𝐗T,i)𝐡^(𝐗T,i)1𝔼XT[𝐡(XT)𝐡^(XT)1]]\displaystyle\mathbb{E}_{\mathcal{D}}[\sum_{i=1}^{m}\frac{1}{m}\|\mathbf{h}^{*}(\mathbf{X}_{T,i})-\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{1}-\mathbb{E}_{X_{T}^{{}^{\prime}}}[\|\mathbf{h}^{*}(X_{T}^{{}^{\prime}})-\hat{\mathbf{h}}(X_{T}^{{}^{\prime}})\|_{1}]]\leq δ1,\displaystyle\delta_{1},
𝔼𝒟[i=1m1m𝐡(𝐗T,i)𝐡^(𝐗T,i)1𝔼XT[𝐡(XT)𝐡^(XT)1]]\displaystyle\mathbb{E}_{\mathcal{D}}[\sum_{i=1}^{m}\frac{1}{m}\|\mathbf{h}^{*}(\mathbf{X}_{T,i})-\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{1}-\mathbb{E}_{X_{T}^{{}^{\prime}}}[\|\mathbf{h}^{*}(X_{T}^{{}^{\prime}})-\hat{\mathbf{h}}(X_{T}^{{}^{\prime}})\|_{1}]]\leq 2log2δm.\displaystyle\sqrt{\frac{2\log\frac{2}{\delta}}{m}}.

Move the term 𝔼XT[𝐡(XT)𝐡^(XT)1]\mathbb{E}_{X_{T}^{{}^{\prime}}}[\|\mathbf{h}^{*}(X_{T}^{{}^{\prime}})-\hat{\mathbf{h}}(X_{T}^{{}^{\prime}})\|_{1}] to the right side of the equation and use the assumption (A5).

𝔼𝒟[i=1m1m𝐡(𝐗T,i)𝐡^(𝐗T,i)1]\displaystyle\mathbb{E}_{\mathcal{D}}[\sum_{i=1}^{m}\frac{1}{m}\|\mathbf{h}^{*}(\mathbf{X}_{T,i})-\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{1}]\leq 2log2δm+𝔼[𝐡(XT)𝐡^(XT)1],\displaystyle\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+\mathbb{E}[\|\mathbf{h}^{*}(X_{T}^{{}^{\prime}})-\hat{\mathbf{h}}(X_{T}^{{}^{\prime}})\|_{1}],
𝔼𝒟[i=1m1m𝐡(𝐗T,i)𝐡^(𝐗T,i)1]\displaystyle\mathbb{E}_{\mathcal{D}}[\sum_{i=1}^{m}\frac{1}{m}\|\mathbf{h}^{*}(\mathbf{X}_{T,i})-\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{1}]\leq 2log2δm+B2𝔼[𝒲1(𝐡(XT),𝐡^(XT))],\displaystyle\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+B_{2}\mathbb{E}[\mathcal{W}_{1}(\mathbf{h}^{*}(X_{T}^{{}^{\prime}}),\hat{\mathbf{h}}(X_{T}^{{}^{\prime}}))],
𝔼𝒟[i=1m1m𝐡(𝐗T,i)𝐡^(𝐗T,i)1]\displaystyle\mathbb{E}_{\mathcal{D}}[\sum_{i=1}^{m}\frac{1}{m}\|\mathbf{h}^{*}(\mathbf{X}_{T,i})-\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{1}]\leq 2log2δm+B2γ1nt(β).\displaystyle\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+B_{2}\gamma_{1}n^{-t(\beta)}.

Also,

[j=1m1mϵi𝐡(𝐗T,i)δ2]\displaystyle\mathbb{P}[\|\sum_{j=1}^{m}\frac{1}{m}\epsilon_{i}\mathbf{h}^{*}(\mathbf{X}_{T,i})\|_{\infty}\geq\delta_{2}]
=\displaystyle= [j=1m1mϵi𝐡(𝐗T,i)𝔼[ε1𝐡(XT)]δ2]\displaystyle\mathbb{P}[\|\sum_{j=1}^{m}\frac{1}{m}\epsilon_{i}\mathbf{h}^{*}(\mathbf{X}_{T,i})-\mathbb{E}[\varepsilon_{1}\mathbf{h}^{*}(X_{T})]\|_{\infty}\geq\delta_{2}]
=\displaystyle= [k[r],|j=1m1mϵihk(𝐗T,i)|δ2]\displaystyle\mathbb{P}[\exists k\in[r],|\sum_{j=1}^{m}\frac{1}{m}\epsilon_{i}h^{*}_{k}(\mathbf{X}_{T,i})|\geq\delta_{2}]
=\displaystyle= [k[r],|j=1m1mϵihk(𝐗T,i)𝔼[ε1hk(X)]|δ2]\displaystyle\mathbb{P}[\exists k\in[r],|\sum_{j=1}^{m}\frac{1}{m}\epsilon_{i}h^{*}_{k}(\mathbf{X}_{T,i})-\mathbb{E}[\varepsilon_{1}h^{*}_{k}(X)]|\geq\delta_{2}]
\displaystyle\leq 2rexp(δ22m2).\displaystyle 2r\exp(-\frac{\delta_{2}^{2}m}{2}).

Then we can set the δ2=2log2rδm\delta_{2}=\sqrt{\frac{2\log\frac{2r}{\delta}}{m}} here, then with probability at least 1δ1-\delta, we know j=1m1mϵi𝐡(𝐗T,i)2log2rδm\|\sum_{j=1}^{m}\frac{1}{m}\epsilon_{i}\mathbf{h}^{*}(\mathbf{X}_{T,i})\|_{\infty}\leq\sqrt{\frac{2\log\frac{2r}{\delta}}{m}}.

The again by A(11) in the Assumption 4, with probability at least 12δ1-2\delta we get:

𝔼𝒟[i=1m1m𝐅i(𝐡(𝐗T,i)𝐡^(𝐗T,i))𝐡^(𝐗T,i)]\displaystyle\mathbb{E}_{\mathcal{D}}[\|\sum_{i=1}^{m}\frac{1}{m}\mathbf{F}^{*}_{i}(\mathbf{h}^{*}(\mathbf{X}_{T,i})-\hat{\mathbf{h}}(\mathbf{X}_{T,i}))\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{\infty}]
\displaystyle\leq B𝔼𝒟[i=1m1m𝐡(𝐗T,i)𝐡^(𝐗T,i)1]B(2log2δm+B2γ1nt(β)).\displaystyle B\mathbb{E}_{\mathcal{D}}[\sum_{i=1}^{m}\frac{1}{m}\|\mathbf{h}^{*}(\mathbf{X}_{T,i})-\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{1}]\leq B(\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+B_{2}\gamma_{1}n^{-t(\beta)}).

and following the same way before

𝔼𝒟[i=1m1mϵi𝐡^(𝐗T,i)]\displaystyle\mathbb{E}_{\mathcal{D}}[\|\sum_{i=1}^{m}\frac{1}{m}\epsilon_{i}\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{\infty}]\leq 𝔼𝒟[i=1m1mϵi𝐡(𝐗T,i)𝐡^(𝐗T,i)1+i=1m1mϵi𝐡(𝐗T,i)]\displaystyle\mathbb{E}_{\mathcal{D}}[\sum_{i=1}^{m}\frac{1}{m}\|\epsilon_{i}\|_{\infty}\|\mathbf{h}^{*}(\mathbf{X}_{T,i})-\hat{\mathbf{h}}(\mathbf{X}_{T,i})\|_{1}+\|\sum_{i=1}^{m}\frac{1}{m}\epsilon_{i}\mathbf{h}^{*}(\mathbf{X}_{T,i})\|_{\infty}]
\displaystyle\leq B2γ1nt(β)+2log2δm+2log2rδm.\displaystyle B_{2}\gamma_{1}n^{-t(\beta)}+\sqrt{\frac{2\log\frac{2}{\delta}}{m}}+\sqrt{\frac{2\log\frac{2r}{\delta}}{m}}.

Appendix D Theorem cited during our proof

Theorem 5 (Theorem 3.2 in Jiao, Wang and Yang (2023)).

Let dd\in\mathbb{N} and α=s+β>0\alpha=s+\beta>0, where s0s\in\mathbb{N}_{0} and β(0,1]\beta\in(0,1].

(1) There exists c>0c>0 such that for any K1K\geq 1, any WcK(2d+α)/(2d+2)W\geq cK^{(2d+\alpha)/(2d+2)} and LL\geq 2log2(d+s)+22\left\lceil\log_{2}(d+s)\right\rceil+2,

(α,𝒩𝒩(W,L,K))Kα/(d+1).\mathcal{E}\left(\mathcal{H}^{\alpha},\mathcal{NN}(W,L,K)\right)\lesssim K^{-\alpha/(d+1)}.

(2) If d>2αd>2\alpha, then for any W,L,W2W,L\in\mathbb{N},W\geq 2 and K1K\geq 1,

(α,𝒩𝒩(W,L,K))(KL)2α/(d2α).\mathcal{E}\left(\mathcal{H}^{\alpha},\mathcal{N}\mathcal{N}(W,L,K)\right)\gtrsim(K\sqrt{L})^{-2\alpha/(d-2\alpha)}.

By 𝒮𝒩𝒩d,1(W,L,K)𝒩𝒩d,1(W,L,K)𝒮𝒩𝒩d,1(W+1,L,K)\mathcal{SNN}_{d,1}(W,L,K)\subseteq\mathcal{NN}_{d,1}(W,L,K)\subseteq\mathcal{SNN}_{d,1}(W+1,L,K) (Proposition 2.1 in Jiao, Wang and Yang (2023)), the subsequent theorem can be employed during the proof of Theorem 1 and Theorem 3.

Theorem 6 (Lemma 2.3 in Jiao, Wang and Yang (2023)).

For any 𝐱1,,𝐱n[B,B]d\bm{x}_{1},\ldots,\bm{x}_{n}\in[-B,B]^{d} with B1B\geq 1, let S:={(ϕ(𝐱1),,ϕ(𝐱n)):ϕ𝒮𝒩𝒩d,1(W,L,K)}nS:=\left\{\left(\phi\left(\bm{x}_{1}\right),\ldots,\phi\left(\bm{x}_{n}\right)\right):\phi\in\mathcal{SNN}_{d,1}(W,L,K)\right\}\subseteq\mathbb{R}^{n}, then

n(S)1nK2(L+2+log(d+1))max1jd+1i=1nxi,j2BK2(L+2+log(d+1))n,\mathcal{R}_{n}(S)\leq\frac{1}{n}K\sqrt{2(L+2+\log(d+1))}\max_{1\leq j\leq d+1}\sqrt{\sum_{i=1}^{n}x_{i,j}^{2}}\leq\frac{BK\sqrt{2(L+2+\log(d+1))}}{\sqrt{n}},

where xi,jx_{i,j} is the jj-th coordinate of the vector 𝐱~i=(𝐱i,1)d+1\tilde{\bm{x}}_{i}=\left(\bm{x}_{i}^{\top},1\right)^{\top}\in\mathbb{R}^{d+1}. When W2W\geq 2,

n(S)K22nmax1jd+1i=1nxi,j2K22n.\mathcal{R}_{n}(S)\geq\frac{K}{2\sqrt{2}n}\max_{1\leq j\leq d+1}\sqrt{\sum_{i=1}^{n}x_{i,j}^{2}}\geq\frac{K}{2\sqrt{2n}}.

Appendix E Experimental details.

In this section, we briefly highlight some important settings in our experiments. More details can be found in our released code in supplementary material.

Hyperparameter search. For the classification task, we maintained a fixed batch size of 64 and utilized the Adam (Kingma and Ba, 2015) optimizer with the default setting with the only different in learning rate. Both upstream and downstream training consisted of 50 epochs. The upstream training employed a constant learning rate of 0.00001 with the Adam optimizer. For the downstream task, the learning rate for the fully connected layers was set at 0.1, and using the CosineAnnealingLR scheduler. Also, the learning rate for EfficientNet’s feature extractor, which is the Q network, is 0.00001. The hyperparameter search focused solely on the penalty parameters, with values for λ\lambda selected from {5, 10, 15}, τ\tau from {0, 10} and μ\mu from {0, 10}. The hyperparameter for the independence penalty term in the downstream task, κ\kappa is from {5, 10, 15}.

For the regression task, we maintained a fixed batch size of 50 and employed the Adam optimizer with settings identical to those used in the classification task, except for a variation in the learning rate. The upstream training consisted of 50 epochs and downstream training consisted of 100 epochs. The upstream training employed a learning rate of 0.2. For the downstream task, the learning rate for the fully connected layers was set at 0.005. Both training processes using the CosineAnnealingLR scheduler. The hyperparameter search focused solely on the penalty parameters, with values for λ\lambda selected from {0.5, 1.0, 1.5, 2.0}, τ\tau from {0, 0.2} and μ\mu from {0, 0.5}. The parameter κ\kappa for the independence penalty term in the downstream task is selected from {0.0625,0.125,0.25}\{0.0625,0.125,0.25\}, and χ\chi for the 1\ell_{1} penalty term is selected from {0.03125,0.0625}\{0.03125,0.0625\}. To prevent certain components of the penalty terms from disproportionately affecting gradient calculations due to their larger values compared to other penalty terms, and to facilitate parameter selection, the penalty terms are normalized by their detached versions. The total loss is then computed as a weighted sum of the individual penalty terms, as follows: penalty1/penalty1.detach()+λpenalty2/penalty2.detach()+τpenalty2/penalty2.detach()+μpenalty3/penalty3.detach()\text{penalty}_{1}/\text{penalty}_{1}.detach()+\lambda*\text{penalty}_{2}/\text{penalty}_{2}.detach()+\tau*\text{penalty}_{2}/\text{penalty}_{2}.detach()+\mu*\text{penalty}_{3}/\text{penalty}_{3}.detach(). Due to constraints in computational resources and time, we did not explicitly explore the impact of ζ\zeta and dd^{*} on downstream task performance.

Details about comparison with other works. Due to time constraints and issues with the code logic, we aligned all training details as closely as possible except for the data input method. In the DomainBed codebase (Gulrajani and Lopez-Paz, 2020), the input data is structured such that each remaining domain inputs one batch size of samples, while our input method shuffles all remaining domains and inputs one batch size of samples. Therefore, to align as closely as possible, we set the batch size for upstream training to 21 and the batch size for downstream training to 64 when conducting comparative experiments in the DomainBed codebase. The typical hyperparameters for the compared algorithms are set to their default values in the DomainBed codebase.

To make a comparison with the method put forward in Cai and Wei (2021), we chose solely the ”dog” and ”elephant” photos within the PACS dataset to carry out our experiments. Regarding the KNN-based method, we split the downstream domain in an 80%-20% proportion to form a training set and a test set respectively. Subsequently, we utilized the data from all the remaining domains as well as the downstream training data to construct a KNN classifier for the test samples. However, the high input dimension of photo data and relatively less data makes it impossible for the break rule in Algorithm 2 of Cai and Wei (2021) to be satisfied. To simplify the computation process, we choose to select the values for the four domains from among {3, 3, 3, 3}, {4, 4, 4, 4}, \ldots, {9, 9, 9, 9}, and {10, 10, 10, 10}. We only display the highest accuracy figures in Table 1.

Details of experiments exploring the effect of representation dimensionality To highlight the impact of sample size on estimation error, we trained the downstream models on OfficeHome using only 15% of the original training data. To investigate the influence of learning functions of different dimensions for downstream task model (8), we added three extra layers to the downstream network. Similarly, for the Wear dataset, we added two extra layers. For both datasets, we used cross-validation to select the optimal 1\ell_{1} regularization hyperparameter.

Error bar. Due to limited computational resources, we conducted downstream training three times using the checkpoints from the initial upstream training. For comparative results with other methods, particularly those using the DomainBed code (Gulrajani and Lopez-Paz, 2020), we repeated the tests by only varying the seed. The error bar represents the standard deviation of the best result from these experimental runs.

Experiments Compute Resources. Our experiments were conducted on Nvidia DGX Station workstation using a single Tesla V100 GPU unit. For a specific domain in the image dataset, the experimental time for a specific parameter ranges from 0.5 hours to 1.5 hours. Therefore, estimating the experimental results with error bars would require at least three days.

For additional details regarding the small hand-designed model architecture, please refer to the code provided in the supplementary materials.