This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\NewEnviron

elaboration \BODY

Evolving Metric Learning for Incremental and Decremental Features

Jiahua Dong, Yang Cong,  Gan Sun, Tao Zhang, Xu Tang and Xiaowei Xu This work is supported by the National Key Research and Development Program of China (2019YFB1310300) and National Nature Science Foundation of China under Grant (61722311, 61821005, 62003336). (Corresponding author: Yang Cong.)Jiahua Dong and Tao Zhang are with the State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China, and also with the Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang 110016, China, and also with the University of Chinese Academy of Sciences, Beijing 100049, China (e-mail: dongjiahua1995@gmail.com, zhangtao2@sia.cn).Yang Cong, Gan Sun and Xu Tang are with the State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China, and also with the Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang 110016, China (e-mail: congyang81@gmail.com, sungan1412@gmail.com, tangxu@sia.cn).Xiaowei Xu is with the Department of Information Science, University of Arkansas at Little Rock, Arkansas 72204, USA (e-mail: xwxu@ualr.edu).
Abstract

Online metric learning has been widely exploited for large-scale data classification due to the low computational cost. However, amongst online practical scenarios where the features are evolving (e.g., some features are vanished and some new features are augmented), most metric learning models cannot be successfully applied to these scenarios, although they can tackle the evolving instances efficiently. To address the challenge, we develop a new online Evolving Metric Learning (EML) model for incremental and decremental features, which can handle the instance and feature evolutions simultaneously by incorporating with a smoothed Wasserstein metric distance. Specifically, our model contains two essential stages: a Transforming stage (T-stage) and a Inheriting stage (I-stage). For the T-stage, we propose to extract important information from vanished features while neglecting non-informative knowledge, and forward it into survived features by transforming them into a low-rank discriminative metric space. It further explores the intrinsic low-rank structure of heterogeneous samples to reduce the computation and memory burden especially for highly-dimensional large-scale data. For the I-stage, we inherit the metric performance of survived features from the T-stage and then expand to include the new augmented features. Moreover, a smoothed Wasserstein distance is utilized to characterize the similarity relationships among the heterogeneous and complex samples, since the evolving features are not strictly aligned in the different stages. In addition to tackling the challenges in one-shot case, we also extend our model into multi-shot scenario. After deriving an efficient optimization strategy for both T-stage and I-stage, extensive experiments on several datasets verify the superior performance of our EML model.

Index Terms:
Online metric learning, instance and feature evolutions, smoothed Wasserstein distance, low-rank constraint.

I Introduction

Metric learning has been successfully extended into many fields, e.g., face identification [1], object recognition [2] and medical diagnosis [3]. To efficiently solve the large-scale streaming data problem, learning an online discriminative metric (i.e., online metric learning [4, 5]) attracts lots of appealing attentions. Generally, most online metric learning models pay attention to the fast metric updating mechanisms [6, 7, 8, 9] or fast similarity searching strategies [5, 10, 8] for large-scale streaming data, where the streaming data indicate the continuous data flow that the data samples arrive consecutively in a real-time manner.

Refer to caption
Figure 1: Illustration example of feature evolution on human motion recognition task, where the blue, red and green colors respectively indicate the vanished features collected from Kinect sensor, the survived features collected from RGB camera and the augmented features collected from motion capture senor with different lifespans. The vanished features collected from Kinect sensors are decremental in the T-stage, and the augmented features collected from from motion capture senor are incremental in the I-stage. The survived features collected from RGB camera exist in both T-stage and I-stage.

However, these existing online metric learning methods [5, 6, 10, 11, 12] only focus on instance evolution, and ignore the feature evolution in many real-world applications, where some features are vanished and some new features are augmented. Take the human motion recognition [13] as an example, as depicted in Fig. 1, the sudden damage of Kinect sensor results in the absence of depth information of human motion, while the emerging of new motion capture sensor could obtain the auxiliary human skeleton knowledge for motion recognition. It leads to a corresponding decrease and increase in the feature dimensionality of the input data, which are considered as the vanished features and augmented features, respectively. The features collected from RGB camera that has been working are regarded as survived features. Such feature evolution setting heavily cripples the human motion recognition performance of the pre-trained model [13]. Another interesting example is that different sensors (e.g., radioisotope, trace metal and biological sensors [14]) are deployed to monitor the dynamic environment change in full aspects. Some sensors expire (vanished features) whereas some new sensors are deployed (augmented features) when different electrochemical conditions and lifespans occur. A fixed or static online metric learning model will fail to take advantage of sensors evolved in this way. Therefore, how to establish a novel metric learning model to simultaneously handle both instance and feature evolutions amongst these online practical systems is our main focus in this paper.

To address the challenges above, as illustrated in Fig. 1, we develop a new online Evolving Metric Learning (EML) model for incremental and decremental features, which can exploit streaming data with both instance and feature evolutions in an online manner. To be specific, the proposed EML model consists of two significant stages, i.e., a Transforming stage (T-stage) and a Inheriting stage (I-stage). 1) In the T-stage where features are decremental, we propose to explore the important information and data structure from vanished features, and transform them into a low-rank discriminative metric space of survived features, which could be utilized to promote the learning process of the I-stage. Moreover, it explores the intrinsic low-rank structure of the streaming data, which efficiently reduces both memory and computation costs especially for large-scale samples with high dimensional feature. 2) For the I-stage where features are incremental, based on the learned discriminative metric space in the T-stage, we inherit the metric performance of survived features from T-stage, and then expand to consider new augmented features. Furthermore, to better explore the similarity relations amongst the heterogeneous data, a smoothed Wasserstein distance is applied to both T-stage and I-stage where the evolving features are strictly unaligned and heterogeneous in different stages. For the model optimization, we derive an efficient optimization strategy to solve the formulations of T-stage and I-stage. Besides, our EML model could be successfully extended from one-shot scenario into multi-shot scenario, where one-shot scenario indicates that the features of streaming data would only be incremental and decremental by one time (as shown in Fig. 2), while multi-shot scenario denotes that the representations of streaming data would be incremental and decremental multiple times (as shown in Fig. 3). Comprehensive experimental results on several datasets strongly support the effectiveness of our proposed EML model.

The main contributions of this paper are summarized as follows:

  • We propose an online Evolving Metric Learning (EML) model for incremental and decremental features to tackle both instance and feature evolutions simultaneously. To our best knowledge, this is the first exploration to tackle this crucial, but rarely-researched challenge in the metric learning field.

  • We present two stages for both feature and instance evolutions, i.e., a Transforming stage (T-stage) and a Inheriting stage (I-stage), which can not only make full use of the vanished features in the T-stage, but also take advantage of streaming data with new augmented features in the I-stage.

  • A smoothed Wasserstein distance is incorporated into metric learning to characterize the similarity relations of heterogeneous evolving features among different stages. After deriving an alternating direction optimization algorithm to optimize our EML model, extensive experiments on representative datasets validate the superior performance of our proposed EML model.

II Related Work

This section provides a brief overview about metric learning, followed by some representative methods about feature evolution.

II-A Metric Learning

Online metric learning has been widely explored for instance evolution to learn large-scale streaming data, which is mainly composed of Mahalanobis distance-based and bilinear similarity-based methods. For the Mahalanobis distance-based methods, POLA [15] is the first attempt to learn the optimal metric in an online manner. Then several variants [5, 10, 16] extend this idea by the fast similarity searching strategies, e.g., [8] proposes a regularized online metric learning model with the provable regret bound. Besides, pairwise constraint [8] and triplet constraint [9] are adopted to learn a discriminative metric function. Generally, triplet constraints perform better than pairwise constraints to learn a discriminative metric function [9, 17]. For the bilinear similarity-based models, OASIS [4] is developed to explore a similarity metric for recognition tasks, and SOML [18] aims to learn a diagonal matrix for high dimensional cases with the similar setting as OASIS [4]. [19] presents an online multiple kernel similarity to tackle multi-modal tasks.

Unfortunately, these recently-proposed online metric learning methods cannot exploit the discriminative similarity relations for the strictly unaligned heterogeneous data in different evolution stages. To explore heterogeneous relationships among different data samples, [11] focuses on learning a nonlinear metric to distinguish the foreground boundary and background for robust visual tracking. Duan et al. [12] design fine-grained localized distance metrics to learn hierarchical nonlinear transformations between heterogeneous samples. Ding et al. [20] introduce the fast low-rank learning mechanism and representation denoising strategy to explore a more robust metric learning framework. Furthermore, [21] proposes a multi-modal distance metric method for image ranking by incorporating both click and visual representations in distance metric learning. [22] presents a multi-view stochastic learning model with high-order distance metric to explore modality-specific statistical information. However, above-mentioned metric methods cannot be successfully applied to the challenging online scenarios, where the features are evolving due to the different senor lifespans (e.g., some features are vanished and some new features are augmented).

II-B Feature Evolution

For the feature evolution, with the assumption that there exists samples from both vanished feature space and augmented feature space in an overlapping period, [23] develops an evolvable feature learning model by reconstructing the vanished features and exploiting it along with new emerging features for large-scale streaming data. [24] proposes an one-pass incremental and decremental learning model for streaming data, which consists of a compressing stage and a expanding stage. Different from [23], [24] assumes that there are overlapping features instead of overlapping period. Similar to [24], [25] focuses on learning the mapping function from two different feature spaces by using optimal transport technique. Furthermore, [26, 27] intend to classify trapezoidal data stream with feature and instance increasing doubly. However, the new emerging samples often have overlapping features with the previously existing samples. [28] develops an incremental feature learning model to tackle the emergence of new activity recognition sensors, which encourages the proposed model to well generalize the sudden emergence of incremental features.

Amongst the discussion above, there are no any feature evolution models highly related to our work except for OPID (OPIDe) [24]. However, there are several key differences between [24] and our EML model: 1) Our work is the first attempt to explore both instance and feature evolutions simultaneously via T-stage and I-stage in the metric learning field, when compared with [24]. 2) Due to the strictly unaligned evolving features in the different stages, we utilize the smoothed Wasserstein distance to explore the distance relationships among the heterogeneous and complex data, rather than the Euclidean distance in [24]. 3) Compared with [24], the low-rank regularizer for distance matrix could effectively learn a discriminative low-rank metric space, while neglecting non-informative knowledge for heterogeneous data in different feature evolution stages.

III Evolving Metric Learning (EML)

This section first reviews online metric learning, and then detailedly introduces how to tackle both instance and feature evolutions via our proposed EML model.

III-A Revisit Online Metric Learning

Metric learning focuses on exploring an optimal distance metric matrix, in the light of different measure functions, e.g., Mahalanobis distance function: dM(xp,xq)=(xpxq)M(xpxq)d_{M}(x_{p},x_{q})=\sqrt{(x_{p}-x_{q})^{\top}M(x_{p}-x_{q})}, where xpdx_{p}\in\mathbb{R}^{d} and xqdx_{q}\in\mathbb{R}^{d} are the pp-th and qq-th samples, respectively. Md×dM\in\mathbb{R}^{d\times d} is the symmetric positive semi-definite matrix, which can be formulated as LLL^{\top}L [5], where Lr×dL\in\mathbb{R}^{r\times d} (rr denotes the rank of MM) is the transformation matrix. The Mahalanobis distance function between xpx_{p} and xqx_{q} can be rewritten as dL(xp,xq)=L(xpxq)22d_{L}(x_{p},x_{q})=\left\|L(x_{p}-x_{q})\right\|_{2}^{2}. Given an online constructed triplet (xp,xq,xk)(x_{p},x_{q},x_{k}), LL could be updated in an online manner via the Passive-Aggressive algorithm [29], i.e.,

Lt=argminL12LLt1F2+γ2L(xp,xq,xk),L_{t}=\arg\min\limits_{L}\frac{1}{2}\left\|L-L_{t-1}\right\|_{F}^{2}+\frac{\gamma}{2}\ell_{L}(x_{p},x_{q},x_{k}), (1)

where L(xp,xq,xk)=[1+dL(xp,xq)dL(xp,xk)]+\ell_{L}(x_{p},x_{q},x_{k})=\big{[}1+d_{L}(x_{p},x_{q})-d_{L}(x_{p},x_{k})\big{]}_{+} is a hinge loss function. [z]+=max(0,z)[z]_{+}=\mathrm{max}(0,z). xpx_{p} and xqx_{q} belong to the same class, and xpx_{p} and xkx_{k} belong to different classes. γ0\gamma\geq 0 is the regularization parameter.

However, most existing online metric learning models only focus on instance evolution with a fixed feature dimensionality, which cannot be utilized in the feature evolution scenario, i.e., streaming data with incremental and decremental features. Furthermore, they mainly aim to promote the discrimination of the learned distance matrix LL by minimizing the squared Mahalanobis distance from similar sample pairs. Especially, they assume that the feature descriptors of the sample pairs they focus on addressing are often aligned well in advance. Unfortunately, due to some unavoidable factors like non-linear lighting changes, heavily intensity noise and geometrical deformation, such assumption is heavily violated in the real-world tasks, especially for the feature evolution tasks. Therefore, the learned distance matrix LL in Eq. (1) is not applicable and discriminative to explore similarity relationships between the heterogeneous and complex samples, whose evolving feature descriptors are not strictly aligned in different evolution stages [30].

III-B The Proposed EML Model

This subsection first introduces how to integrate a smoothed Wasserstein distance into online metric formulation (i.e., Eq. (1)) to characterize the similarity relations of heterogeneous data with feature evolution in the different stages. Then the details about how to tackle feature evolution via Transforming stage (T-stage) and Inheriting stage (I-stage) in one-shot scenario are elaborated, followed by the extension of multi-shot scenario.

III-B1 Online Wasserstein Metric Learning

Wasserstein distance [31] is an optimal transportation to transport all the earth from the source to target destination, while requiring the minimum amount of efforts. Formally, given two signatures P={(xpi,μpi)}i=1mP=\{(x_{pi},\mu_{pi})\}_{i=1}^{m} and Q={(xqj,μqj)}j=1nQ=\{(x_{qj},\mu_{qj})\}_{j=1}^{n}, the smoothed Wasserstein distance [32] between PP and QQ is:

Wσ(P,Q)\displaystyle W_{\sigma}(P,Q) =minF𝔽(P,Q)D(P,Q),Fσh(F),\displaystyle=\min\limits_{F\in\mathbb{F}(P,Q)}\left<D(P,Q),F\right>-\sigma h(F), (2)
s.t.𝔽(P,Q)\displaystyle s.t.~{}\mathbb{F}(P,Q) ={F|F1n=μp,F1m=μq,F0},\displaystyle=\{F|F\textbf{1}_{n}=\mu_{p},F^{\top}\textbf{1}_{m}=\mu_{q},F\geq 0\},

where D(P,Q)={dL(i,j)}i,j=1m,nm×nD(P,Q)=\{d_{L}(i,j)\}_{i,j=1}^{m,n}\in\mathbb{R}^{m\times n}, and dL(i,j)d_{L}(i,j) denotes the cost of transporting one unit of earth from the source sample xpix_{pi} to the target sample xqjx_{qj}. F={f(i,j)}i,j=1m,nF=\{f(i,j)\}_{i,j=1}^{m,n} indicates the flow network matrix, and f(i,j)f(i,j) represents the amount of earth that is transported from xpix_{pi} to xqjx_{qj}. μp=[μp1,,μpm]m\mu_{p}=[\mu_{p1},\cdots,\mu_{pm}]\in\mathbb{R}^{m} and μq=[μq1,,μqn]n\mu_{q}=[\mu_{q1},\cdots,\mu_{qn}]\in\mathbb{R}^{n} are normalized marginal probability mass vectors, and they satisfy iμpi=1\sum_{i}\mu_{pi}=1 and jμqj=1\sum_{j}\mu_{qj}=1. σ0\sigma\geq 0 is a balance parameter, and h(F)=F,log(F)h(F)=-\left<F,\log(F)\right> is the strictly concave entropic function.

In Eq. (2), the Mahalanobis distance is employed as ground distance to construct smoothed Wasserstein distance. Thus, each element dL(i,j)d_{L}(i,j) of D(P,Q)D(P,Q) in Eq. (2) represents the squared Mahalanobis distance between the source sample xpix_{pi} of PP and the target sample xqjx_{qj} of QQ, i.e., dL(i,j)=L(xpixqj)22d_{L}(i,j)=\left\|L(x_{pi}-x_{qj})\right\|_{2}^{2}. Given the online constructed triplet (P,Q,K)(P,Q,K) via [33], where the samples of PP and QQ belong to the same class, and the samples of PP and KK belong to different classes. After substituting Mahalanobis distance in Eq. (1) with the smooth Wasserstein distance defined in Eq. (2), online Wasserstein metric learning could be formulated as follows:

minL,FL(P,Q,K)=12LLt1F2+γ2L(P,Q,K),\displaystyle\!\!\!\!\!\!\min\limits_{L,F}\mathcal{L}_{L}(P,Q,K)\!=\frac{1}{2}\left\|L\!-\!L_{t-1}\right\|_{F}^{2}+\!\frac{\gamma}{2}\ell_{L}(P,Q,K), (3)

where L(P,Q,K)=[1+Wσ(P,Q)Wσ(P,K)]+.\ell_{L}(P,Q,K)=[1+W_{\sigma}(P,Q)-W_{\sigma}(P,K)]_{+}. When compared with the triplet (xp,xq,xk)(x_{p},x_{q},x_{k}), each signature in (P,Q,K)(P,Q,K) consists of several samples belonging to same class rather than only one sample.

Refer to caption
Figure 2: The illustration of our EML model in one-shot scenario, which evolves instances and features simultaneously via T-stage and I-stage. Different colors denote different kinds of features, e.g., blue, red and green colors denote the vanished, survived and augmented features, respectively. The purple color indicates labels and the number of corresponding samples.

III-B2 Transforming Stage (T-stage) &\& Inheriting Stage (I-stage)

In one-shot scenario where the features of streaming data would only be incremental and decremental by one time, two essential stages (i.e., T-stage and I-stage) of our proposed EML model for steaming data with feature evolution are elaborated below.

I. Transforming Stage (T-stage): As shown in Fig. 2, suppose that {Xi,Yi}i=1r\{X_{i},Y_{i}\}_{i=1}^{r} denotes the streaming data in the T-stage, where Xi=[Xiv,Xis]ni×(dv+ds)X_{i}=[X_{i}^{v},X_{i}^{s}]\in\mathbb{R}^{n_{i}\times(d_{v}+d_{s})} and YiniY_{i}\in\mathbb{R}^{n_{i}} denote the samples and labels in the ii-th batch, respectively. rr is the total batches in T-stage and nin_{i} indicates the sample number in the ii-th batch. Obviously, each instance of XiX_{i} consists of vanished and survived features, and dvd_{v} and dsd_{s} indicate the corresponding dimensions of vanished features Xivni×dvX_{i}^{v}\in\mathbb{R}^{n_{i}\times d_{v}} and survived features Xisni×dsX_{i}^{s}\in\mathbb{R}^{n_{i}\times d_{s}}.

If we directly combine both vanished and survived features to learn a unified metric function, it fails to be utilized in I-stage where some features are vanished and some other new features are augmented. We thus propose to extract important information from vanished features and forward it into survived features by exploring a common discriminative metric space. In other words, we aim to train a model using only survived features to characterize the effective information extracted from both vanished and survived features.

In the ii-th batch of T-stage, inspired by [33], the triplet (Pis,Qis,Kis)(P_{i}^{s},Q_{i}^{s},K_{i}^{s}) for survived features is constructed in an online manner, where the samples of Pisnp×dsP_{i}^{s}\in\mathbb{R}^{n_{p}\times d_{s}} and Qisnq×dsQ_{i}^{s}\in\mathbb{R}^{n_{q}\times d_{s}} belong to same class while the samples of PisP_{i}^{s} and Kisnk×dsK_{i}^{s}\in\mathbb{R}^{n_{k}\times d_{s}} belong to different classes. np,nqn_{p},n_{q} and nkn_{k} are the numbers of samples in each signature. Likewise, we can construct the triplet (Pia,Qia,Kia)(P_{i}^{a},Q_{i}^{a},K_{i}^{a}) for all features (containing both vanished and survived features) in T-stage, where the samples of Pianp×(dv+ds)P_{i}^{a}\in\mathbb{R}^{n_{p}\times(d_{v}+d_{s})} and Qianq×(dv+ds)Q_{i}^{a}\in\mathbb{R}^{n_{q}\times(d_{v}+d_{s})} belong to same class while the samples of PiaP_{i}^{a} and Kiank×(dv+ds)K_{i}^{a}\in\mathbb{R}^{n_{k}\times(d_{v}+d_{s})} belong to different classes.

Let Lsk×dsL^{s}\in\mathbb{R}^{k\times d_{s}} and Lak×(dv+ds)L^{a}\in\mathbb{R}^{k\times(d_{v}+d_{s})} denote the distance matrices trained on survived features and all features (containing both vanished and survived features) in T-stage. Since the dimensions of LsL^{s} and LaL^{a} are different, it is reasonable to add some essential consistency constraints on the optimal distance matrices LsL^{s} and LaL^{a} to extract important information from vanished features, and forward it into survived features. Generally, based on the smoothed Wasserstein metric learning in Eq. (3), the formulation of the ii-th batch in the T-stage could be expressed as follows:

minLs,La,FLs(Pis,Qis,Kis)+La(Pia,Qia,Kia)+\displaystyle\!\!\!\!\!\!\!\!\!\min\limits_{L^{s},L^{a},F}\mathcal{L}_{L^{s}}(P_{i}^{s},Q_{i}^{s},K_{i}^{s})+\mathcal{L}_{L^{a}}(P_{i}^{a},Q_{i}^{a},K_{i}^{a})+ (4)
ρ𝒞Ls,La(Pis,Qis,Kis;Pia,Qia,Kia)+λrank(Ls,La),\displaystyle\!\rho\mathcal{C}_{L^{s},L^{a}}(P_{i}^{s},Q_{i}^{s},K_{i}^{s};P_{i}^{a},Q_{i}^{a},K_{i}^{a})+\lambda\mathrm{rank}(L^{s},L^{a}),

where Ls(Pis,Qis,Kis)\mathcal{L}_{L^{s}}(P_{i}^{s},Q_{i}^{s},K_{i}^{s}) and La(Pia,Qia,Kia)\mathcal{L}_{L^{a}}(P_{i}^{a},Q_{i}^{a},K_{i}^{a}) denote the triplet losses of smoothed Wasserstein metric learning on survived features and all features (containing both vanished and survived features), respectively. rank()=rank(Ls)+rank(La)\mathrm{rank}(\cdot)=\mathrm{rank}(L^{s})+\mathrm{rank}(L^{a}) denotes the regularization term, which learns the underlying low-rank property of heterogeneous samples. ρ0\rho\geq 0 and λ0\lambda\geq 0 are the balance parameters. 𝒞Ls,La(;)\mathcal{C}_{L^{s},L^{a}}(\cdot;\cdot) in Eq. (4) is designed to enable the consistence constraint for LsL^{s} and LaL^{a}, which aims to use only survived features to characterize the efficient information extracted from both vanished and survived features.

Specifically, 𝒞Ls,La(;)\mathcal{C}_{L^{s},L^{a}}(\cdot;\cdot) constructs an essential triplet loss by incorporating smoothed Wasserstein metric learning on different feature spaces, i.e., survived features and all features (containing both vanished and survived features). We attempt to compute the smoothed Wasserstein distance between different heterogeneous distributions based on vanished features and all features. For example, Wσ(Pia,Qis)={dL(u,v)}u,v=1np,nqnp×nqW_{\sigma}(P_{i}^{a},Q_{i}^{s})=\{d_{L}(u,v)\}_{u,v=1}^{n_{p},n_{q}}\in\mathbb{R}^{n_{p}\times n_{q}} denotes the smoothed Wasserstein distance between PiaP_{i}^{a} from all features and QisQ_{i}^{s} from survived features, where dL(u,v)=LaxpuaLsxqvs22d_{L}(u,v)=\left\|L^{a}x_{pu}^{a}-L^{s}x_{qv}^{s}\right\|_{2}^{2} indicates the Mahalanobis distance between the uu-th source sample xpuax_{pu}^{a} of PiaP_{i}^{a} and the vv-th target sample xqvsx_{qv}^{s} of QisQ_{i}^{s}. Likewise, Wσ(Pia,Kis),Wσ(Pis,Qia)W_{\sigma}(P_{i}^{a},K_{i}^{s}),W_{\sigma}(P_{i}^{s},Q_{i}^{a}) and Wσ(Pis,Kia)W_{\sigma}(P_{i}^{s},K_{i}^{a}) have similar definitions with Wσ(Pia,Qis)W_{\sigma}(P_{i}^{a},Q_{i}^{s}). Formally, the consistence constraint 𝒞Ls,La(;)\mathcal{C}_{L^{s},L^{a}}(\cdot;\cdot) is concretely expressed as follows:

𝒞Ls,La(;)=\displaystyle\mathcal{C}_{L^{s},L^{a}}(\cdot;\cdot)= [Wσ(Pia,Qis)Wσ(Pia,Kis)+1]+\displaystyle\big{[}W_{\sigma}(P_{i}^{a},Q_{i}^{s})-W_{\sigma}(P_{i}^{a},K_{i}^{s})+1\big{]}_{+} (5)
+\displaystyle+ [Wσ(Pis,Qia)Wσ(Pis,Kia)+1]+.\displaystyle\big{[}W_{\sigma}(P_{i}^{s},Q_{i}^{a})-W_{\sigma}(P_{i}^{s},K_{i}^{a})+1\big{]}_{+}.

II. Inheriting Stage (I-stage): Suppose that {Xr+1,Yr+1}\{X_{r+1},Y_{r+1}\} denotes the data samples in the r+1r+1-th batch of I-stage, where Xr+1=[Xr+1s,Xr+1n]nr+1×(ds+dn)X_{r+1}=[X_{r+1}^{s},X_{r+1}^{n}]\in\mathbb{R}^{n_{r+1}\times(d_{s}+d_{n})} indicates the samples and Yr+1nr+1Y_{r+1}\in\mathbb{R}^{n_{r+1}} is the corresponding labels, as shown in Fig. 2. Xr+1sX_{r+1}^{s} and Xr+1nX_{r+1}^{n} represent the survived features and new augmented features in the r+1r+1-th batch. dnd_{n} and nr+1n_{r+1} are the dimension of the new augmented features and the number of samples. Thus, the goal of I-stage is to use {Xr+1,Yr+1}\{X_{r+1},Y_{r+1}\} for training and make the prediction for the r+2r+2-th batch data Xr+2=[Xr+2s,Xr+2n]nr+2×(ds+dn)X_{r+2}=[X_{r+2}^{s},X_{r+2}^{n}]\in\mathbb{R}^{n_{r+2}\times(d_{s}+d_{n})} whose number of samples is same as that of Xr+1nr+1×(ds+dn)X_{r+1}\in\mathbb{R}^{n_{r+1}\times(d_{s}+d_{n})}.

To classify the r+2r+2-th batch data, we propose to inherit the metric performance of optimal distance matrix LsL^{s} learned on survived features in T-stage, since a set of common survived features exist in both T-stage and I-stage. Although we could construct the triplets directly from the r+1r+1-th batch for training, this trivial strategy has two significant shortcomings: 1) the trained metric model is difficult to be extended into multi-shot scenario; 2) the metric model learned only with the r+1r+1-th batch data would have worse prediction performance due to the lack of full usage of data in T-stage.

To this end, we utilize a similar stacking strategy with [34, 35], where [34, 35] focus on forming linear combinations of different predictors to train a unified classifier and achieve improved prediction accuracy. However, we propose to concatenate all feature descriptors as in stacking and train a unified predictor on the stacked features. It could better inherit the metric performance learned in T-stage. Concretely, let Zr+1s=Xr+1s(Ls)nr+1×kZ_{r+1}^{s}=X_{r+1}^{s}(L^{s})^{\top}\in\mathbb{R}^{n_{r+1}\times k} as the transformed discriminative metric space, which can be regarded as the new representation of Xr+1sX_{r+1}^{s} for stacking. Xr+1X_{r+1} could then be represented as Zr+1=[Zr+1s,Xr+1n]nr+1×(k+dn)Z_{r+1}=[Z_{r+1}^{s},X_{r+1}^{n}]\in\mathbb{R}^{n_{r+1}\times(k+d_{n})}. Likewise, Xr+2X_{r+2} is characterized as Zr+2Z_{r+2}. Furthermore, we learn an optimal distance matrix Lzk×(k+dn)L^{z}\in\mathbb{R}^{k\times(k+d_{n})} on Zr+1Z_{r+1} with online constructed triplet (Pr+1z,Qr+1z,Kr+1z)(P_{r+1}^{z},Q_{r+1}^{z},K_{r+1}^{z}), and evaluate the performance on Zr+2Z_{r+2}, where the samples of Pr+1zP_{r+1}^{z} and Qr+1zQ_{r+1}^{z} belong to same class while the samples of Pr+1zP_{r+1}^{z} and Kr+1zK_{r+1}^{z} belong to different classes. Formally, at the tt-th iterative step, the objective function of learning Lzk×(k+dn)L^{z}\in\mathbb{R}^{k\times(k+d_{n})} in I-stage can be formulated as:

minLz,F12LzLt1zF2+λrank(Lz)\displaystyle\!\!\!\!\!\!\min\limits_{L^{z},F}\frac{1}{2}\left\|L^{z}-L_{t-1}^{z}\right\|_{F}^{2}+\lambda\mathrm{rank}(L^{z}) (6)
+γ2[Wσ(Pr+1z,Qr+1z)Wσ(Pr+1z,Kr+1z)+1]+,\displaystyle+\frac{\gamma}{2}\big{[}W_{\sigma}(P_{r+1}^{z},Q_{r+1}^{z})-W_{\sigma}(P_{r+1}^{z},K_{r+1}^{z})+1\big{]}_{+},

where γ0\gamma\geq 0 and λ0\lambda\geq 0 are the balance parameters. In our experiments, λ\lambda and γ\gamma in both Eq. (4) and Eq. (6) are set as the same value for simplification. rank(Lz)\mathrm{rank}(L^{z}) denotes the regularization term, which aims to explore the intrinsic low-rank structure of heterogeneous samples in I-stage.

Refer to caption
Figure 3: The illustration of our EML model in multi-shot scenario when M=2M=2, where Stage 1 and Stage 2 share the survived features, and Stage 2 and Stage 3 share the new augmented features. Specifically, our proposed model respectively regards Stage 1 and Stage 2 as T-stage and I-stage for the first feature evolution, and considers Stage 2 and Stage 3 as T-stage and I-stage for the second feature evolution.

III-B3 Multi-shot Scenario

Different from one-shot scenario, the features of streaming data in multi-shot scenario would be incremental and decremental MM times. This subsection extends our model from one-shot case into multi-shot scenario, and the illustration example of multi-shot scenario when M=2M=2 is depicted in Fig. 3. Specifically, {Xi,Yi}i=1R/2\{X_{i},Y_{i}\}_{i=1}^{R/2} denotes the streaming data in Stage 1, where Xi=[Xiv,Xis]ni×(dv+ds)X_{i}=[X_{i}^{v},X_{i}^{s}]\in\mathbb{R}^{n_{i}\times(d_{v}+d_{s})} and YiniY_{i}\in\mathbb{R}^{n_{i}} respectively represent the samples and labels in the ii-th batch. nin_{i} indicates the sample number in the ii-th batch, and R/2R/2 denotes the total batches in Stage 1. When the streaming data {Xi,Yi}i=R/2+1R\{X_{i},Y_{i}\}_{i=R/2+1}^{R} in Stage 2 arriving, it performs features evolution for the first time (i.e., some features are vanished and some new features are augmented), where Xi=[Xis,Xin]ni×(ds+dn)X_{i}=[X_{i}^{s},X_{i}^{n}]\in\mathbb{R}^{n_{i}\times(d_{s}+d_{n})}. Moreover, in Stage 3, the streaming data {XR+1,YR+1}\{X_{R+1},Y_{R+1}\} performs features evolution for the second time, and we predict the results of our proposed EML model on the R+2R+2-th batch data XR+2X_{R+2}, where XR+1=[XR+1s,XR+1n]X_{R+1}=[X_{R+1}^{s},X_{R+1}^{n}] and XR+2=[XR+2s,XR+2n]X_{R+2}=[X_{R+2}^{s},X_{R+2}^{n}]. Note that there are overlapped feature representations between any two adjacent stages. For example, as presented in Fig. 3, the survived features in Stage 1 are regarded as the vanished features in Stage 2, and the augmented feature in Stage 2 are considered as the survived features in Stage 3. Therefore, there are multiple Transforming stages (T-stage) and Inheriting stages (I-stage) in multi-shot scenario. To be specific, our proposed model first regards Stage 1 and Stage 2 as T-stage and I-stage for the first feature evolution. Then, it considers Stage 2 and Stage 3 as T-stage and I-stage for the second feature evolution. Generally, in multi-shot scenario, we have two essential learning tasks:

  • Task I: Similar to the prediction task in one-shot case, we aim to classify testing data XR+2X_{R+2} in Stage 3 by training our proposed model on previous R+1R+1 batch streaming data {Xi,Yi}i=1R+1\{X_{i},Y_{i}\}_{i=1}^{R+1}.

  • Task II: Different from the prediction task in one-shot scenario, we attempt to make predictions for all stages (i.e., Stage 1, Stage 2 and Stage 3 when M=2M=2) by training our proposed model on the streaming data {Xi,Yi}i=1R+1\{X_{i},Y_{i}\}_{i=1}^{R+1} in all stages.

IV Model Optimization

This section presents an alternating optimization strategy to update our proposed EML model amongst two stages, i.e., T-stage and I-stage, followed by the computational complexity analysis of our model. The whole optimization strategy of our proposed EML model is introduced in Algorithm 1.

Note that the low-rank minimization in Eq. (4) and Eq. (6) is a well-known NP hard problem. Take LzL^{z} as an example, rank(Lz)\mathrm{rank}(L^{z}) in Eq. (6) can be effectively surrogated by trace norm Lz\left\|L^{z}\right\|_{*}. Different from traditional Singular Value Thresholding (SVT) [36], we employ a regularization term to guarantee the low-rank property, i.e., Lz=tr((LzLz)1/2)=tr(Lz(LzLz)1/2Lz)\left\|L^{z}\right\|_{*}=\mathrm{tr}\big{(}({L^{z}}^{\top}L^{z})^{1/2}\big{)}=\mathrm{tr}\big{(}{L^{z}}^{\top}(L^{z}{L^{z}}^{\top})^{-1/2}L^{z}\big{)}. As a result, rank(Lz)\mathrm{rank}(L^{z}) in Eq. (6) could be formulated as tr(LzHzLz)\mathrm{tr}({L^{z}}^{\top}H^{z}{L^{z}}), where Hz=(LzLz)1/2H^{z}=(L^{z}{L^{z}}^{\top})^{-1/2}. Likewise, the low rank optimization of LaL^{a} and LsL^{s} shares the same strategy with LzL^{z}. rank(La)\mathrm{rank}(L^{a}) and rank(Ls)\mathrm{rank}(L^{s}) are respectively surrogated by tr(LaHaLa)\mathrm{tr}(L^{a}H^{a}{L^{a}}^{\top}) and tr(LsHsLs)\mathrm{tr}(L^{s}H^{s}{L^{s}}^{\top}), where Ha=(LaLa)1/2H^{a}=(L^{a}{L^{a}}^{\top})^{-1/2} and Hs=(LsLs)1/2H^{s}=(L^{s}{L^{s}}^{\top})^{-1/2}.

Algorithm 1 The Optimization of Our Proposed EML Model
0:  The data {Xi,Yi}i=1r+1\{X_{i},Y_{i}\}_{i=1}^{r+1}, the parameters γ,λ,ρ\gamma,\lambda,\rho;
0:  LsL^{s} and LzL^{z};
1:  Initialize: Ls,La,Lz,FL^{s},L^{a},L^{z},F;
2:  {elaboration} Transforming stage (T-stage):
3:  for i=1,,ri=1,\ldots,r do
4:     Calculate the smoothed Wasserstein distance for data XiX_{i}, and construct the triplets for training;
5:     repeat
6:        Solve FF when fixing LaL^{a} and LsL^{s};
7:        Update LaL^{a} via Eq. (8);
8:        Update LsL^{s} via Eq. (10);
9:        Update HaH^{a} and HsH^{s} via Ha=(LaLa)1/2H^{a}=(L^{a}{L^{a}}^{\top})^{-1/2} and Hs=(LsLs)1/2H^{s}=(L^{s}{L^{s}}^{\top})^{-1/2};
10:     until Converge
11:  end for
12:  {elaboration} Inheriting stage (I-stage):
13:  Transform Xr+1X_{r+1} as Zr+1Z_{r+1} to calculate smoothed Wasserstein distance, and construct the training triplets;
14:  repeat
15:     Solve the distance flow-network FF when fixing LzL^{z};
16:     Update LzL^{z} via Eq. (12);
17:     Update HzH^{z} via Hz=(LzLz)1/2H^{z}=(L^{z}{L^{z}}^{\top})^{-1/2};
18:  until Converge

IV-A Optimizing T-stage via an Alternating Strategy

IV-A1 Updating LaL^{a} by fixing {Ls,Ha,F}\{L^{s},H^{a},F\}

When fixing the variables Ls,HaL^{s},H^{a} and FF, the optimization problem in Eq. (4) for solving variable LaL^{a} can be concretely expressed as:

Lta=argminLa12LaLt1aF2+λtr(LaHaLa)\displaystyle L_{t}^{a}=\arg\min\limits_{L^{a}}\frac{1}{2}\left\|L^{a}-L_{t-1}^{a}\right\|_{F}^{2}+\lambda\mathrm{tr}(L^{a}H^{a}{L^{a}}^{\top}) (7)
+γ2[tr(D(Pia,Qia)F)tr(D(Pia,Kia)F)+1]+\displaystyle\;+\frac{\gamma}{2}\big{[}\mathrm{tr}\big{(}D(P_{i}^{a},Q_{i}^{a})F\big{)}-\mathrm{tr}\big{(}D(P_{i}^{a},K_{i}^{a})F\big{)}+1\big{]}_{+}
+ρ2[tr(D(Pia,Qis)F)tr(D(Pia,Kis)F)+1]+\displaystyle\;+\frac{\rho}{2}\big{[}\mathrm{tr}\big{(}D(P_{i}^{a},Q_{i}^{s})F\big{)}-\mathrm{tr}\big{(}D(P_{i}^{a},K_{i}^{s})F\big{)}+1\big{]}_{+}
+ρ2[tr(D(Pis,Qia)F)tr(D(Pis,Kia)F)+1]+.\displaystyle\;+\frac{\rho}{2}\big{[}\mathrm{tr}\big{(}D(P_{i}^{s},Q_{i}^{a})F\big{)}-\mathrm{tr}\big{(}D(P_{i}^{s},K_{i}^{a})F\big{)}+1\big{]}_{+}.

The optimal solution of LtaL_{t}^{a} could be relaxedly achieved by nulling the gradient of Eq. (7):

Lta=(Lt1aρLts(G3+G4))(I+λHa+γG1+ρG2)1,\displaystyle\!\!L_{t}^{a}=\big{(}L_{t-1}^{a}\!-\!\rho L_{t}^{s}(G_{3}+G_{4})\big{)}\big{(}\mathrm{I}+\lambda H^{a}+\gamma G_{1}+\rho G_{2}\big{)}^{-1}, (8)

where G1=Qiadiag(1F)QiaKiadiag(1F)KiaPiaFQiaQiaFPia+PiaFKia+KiaFPia,G2=Piadiag(F1)PiaKiadiag(1F)Kia,G3=KisFPiaQisFPia,G4=PisFKiaKisFPiaG_{1}={Q_{i}^{a}}^{\top}\mathrm{diag}(\textbf{1}^{\top}F){Q_{i}^{a}}-{K_{i}^{a}}^{\top}\mathrm{diag}(\textbf{1}^{\top}F){K_{i}^{a}}-{P_{i}^{a}}^{\top}F{Q_{i}^{a}}-{Q_{i}^{a}}^{\top}F^{\top}{P_{i}^{a}}+{P_{i}^{a}}^{\top}F{K_{i}^{a}}+{K_{i}^{a}}^{\top}F^{\top}{P_{i}^{a}},G_{2}={P_{i}^{a}}^{\top}\mathrm{diag}(F\textbf{1}){P_{i}^{a}}-{K_{i}^{a}}^{\top}\mathrm{diag}(\textbf{1}^{\top}F){K_{i}^{a}},G_{3}={K_{i}^{s}}^{\top}F{P_{i}^{a}}-{Q_{i}^{s}}^{\top}F^{\top}{P_{i}^{a}},G_{4}={P_{i}^{s}}^{\top}F{K_{i}^{a}}-{K_{i}^{s}}^{\top}F^{\top}{P_{i}^{a}}.

IV-A2 Updating LsL^{s} by fixing {La,Hs,F}\{L^{a},H^{s},F\}

With the obtained distance matrix LaL^{a} and flow matrix FF, the optimization problem for variable LsL^{s} in Eq. (4) could be formulated as:

Lts=argminLs12LsLt1sF2+λtr(LsHsLs)\displaystyle L_{t}^{s}=\arg\min\limits_{L^{s}}\frac{1}{2}\left\|L^{s}-L_{t-1}^{s}\right\|_{F}^{2}+\lambda\mathrm{tr}(L^{s}H^{s}{L^{s}}^{\top}) (9)
+γ2[tr(D(Pis,Qis)F)tr(D(Pis,Kis)F)+1]+\displaystyle+\frac{\gamma}{2}\big{[}\mathrm{tr}\big{(}D(P_{i}^{s},Q_{i}^{s})F\big{)}-\mathrm{tr}\big{(}D(P_{i}^{s},K_{i}^{s})F\big{)}+1\big{]}_{+}
+ρ2[tr(D(Pia,Qis)F)tr(D(Pia,Kis)F)+1]+\displaystyle+\frac{\rho}{2}\big{[}\mathrm{tr}\big{(}D(P_{i}^{a},Q_{i}^{s})F\big{)}-\mathrm{tr}\big{(}D(P_{i}^{a},K_{i}^{s})F\big{)}+1\big{]}_{+}
+ρ2[tr(D(Pis,Qia)F)tr(D(Pis,Kia)F)+1]+.\displaystyle+\frac{\rho}{2}\big{[}\mathrm{tr}\big{(}D(P_{i}^{s},Q_{i}^{a})F\big{)}-\mathrm{tr}\big{(}D(P_{i}^{s},K_{i}^{a})F\big{)}+1\big{]}_{+}.

Concretely, the updating operator for LtsL_{t}^{s} could be given as:

Lts=(Lt1sρLta(G6+G8))(I+λHs+γG5+ρG7)1,\displaystyle\!\!L_{t}^{s}=\big{(}L_{t-1}^{s}-\rho L_{t}^{a}(G_{6}+G_{8})\big{)}\big{(}\mathrm{I}+\lambda H^{s}+\gamma G_{5}+\rho G_{7}\big{)}^{-1}, (10)

where G5=Qisdiag(1F)QisKisdiag(1F)Kis+PisFKis+KisFPisPisFQisQisFPis,G6=PiaFKisPiaFQis,G7=Qisdiag(F1)QisKisdiag(1F)Kis,G8=KiaFPisQiaFPisG_{5}={Q_{i}^{s}}^{\top}\mathrm{diag}(\textbf{1}^{\top}F){Q_{i}^{s}}-{K_{i}^{s}}^{\top}\mathrm{diag}(\textbf{1}^{\top}F){K_{i}^{s}}+{P_{i}^{s}}^{\top}F{K_{i}^{s}}+{K_{i}^{s}}^{\top}F^{\top}{P_{i}^{s}}-{P_{i}^{s}}^{\top}F{Q_{i}^{s}}-{Q_{i}^{s}}^{\top}F^{\top}P_{i}^{s},G_{6}={P_{i}^{a}}^{\top}F{K_{i}^{s}}-{P_{i}^{a}}^{\top}F{Q_{i}^{s}},G_{7}={Q_{i}^{s}}^{\top}\mathrm{diag}(F\textbf{1}){Q_{i}^{s}}-{K_{i}^{s}}^{\top}\mathrm{diag}(\textbf{1}^{\top}F){K_{i}^{s}},G_{8}={K_{i}^{a}}^{\top}F^{\top}{P_{i}^{s}}-{Q_{i}^{a}}^{\top}F^{\top}{P_{i}^{s}}.

IV-A3 Updating FF by fixing {La,Ls}\{L^{a},L^{s}\}

When the distance matrices LaL^{a} and LsL^{s} are fixed, we split Eq. (4) into some independent traditional smoothed Wasserstein distance subproblems, which could be solved by [33]. We omit the detailed process of solving smoothed Wasserstein distance subproblems for simplification.

IV-B Optimizing I-stage via an Alternating Strategy

IV-B1 Updating LzL^{z} by fixing {Hz,F}\{H^{z},F\}

Given the fixed HzH^{z} and FF, the formulation for LzL^{z} in Eq. (6) is rewritten as:

Ltz=argminLz12LzLt1zF2+λtr(LzHzLz)+\displaystyle L_{t}^{z}=\arg\min\limits_{L^{z}}\frac{1}{2}\left\|L^{z}-L_{t-1}^{z}\right\|_{F}^{2}+\lambda\mathrm{tr}(L^{z}H^{z}{L^{z}}^{\top})+ (11)
γ2[tr(D(Pr+1z,Qr+1z)F)tr(D(Pr+1z,Kr+1z)F)+1]+.\displaystyle\frac{\gamma}{2}\big{[}\mathrm{tr}\big{(}D(P_{r+1}^{z},Q_{r+1}^{z})F\big{)}-\mathrm{tr}\big{(}D(P_{r+1}^{z},K_{r+1}^{z})F\big{)}+1\big{]}_{+}.

By nulling the gradient of Eq. (11), the optimization solution of Eq. (11) for LzL^{z} could be given as:

Ltz=Lt1z(I+λHz+γG9)1,L_{t}^{z}=L_{t-1}^{z}(\mathrm{I}+\lambda H^{z}+\gamma G_{9})^{-1}, (12)

where G9=Qr+1zdiag(1F)Qr+1z+Pr+1zFKr+1zKr+1zdiag(1F)Kr+1z+Kr+1zFPr+1zPr+1zFQr+1zQr+1zFPr+1zG_{9}=Q_{r+1}^{z^{\top}}\mathrm{diag}(\textbf{1}^{\top}F)Q_{r+1}^{z}+P_{r+1}^{z^{\top}}FK_{r+1}^{z}-K_{r+1}^{z^{\top}}\mathrm{diag}(\textbf{1}^{\top}F)K_{r+1}^{z}+K_{r+1}^{z^{\top}}F^{\top}P_{r+1}^{z}-P_{r+1}^{z^{\top}}FQ_{r+1}^{z}-Q_{r+1}^{z^{\top}}F^{\top}P_{r+1}^{z}.

IV-B2 Updating FF by fixing LzL^{z}

The optimization procedure of variable FF in I-stage is same as that in T-stage: with the fixed LzL^{z}, the formulation Eq. (6) is split into some independent traditional smoothed Wasserstein distance subproblems, and we solve the variable FF via [33].

Refer to caption
Figure 4: The examples of human motions in EV-Action dataset, where the first, second and third rows denote the samples collected from RGB camera, Kinect sensor and motion capture sensor, respectively.
TABLE I: The experimental settings in one-shot scenario.
Datasets c i=1rni\sum_{i=1}^{r}n_{i} nin_{i} dvd_{v} dsd_{s} dnd_{n}
EV-Action 20 4200 500, 600, 700500,\,600,\,700 1024 1024 75
Mnist0vs5 2 3200 80, 160, 32080,\,160,\,320 114 228 113
Mnist0vs3vs5 3 4800 120, 240, 480120,\,240,\,480 123 245 121
Splice 2 2240 80, 160, 32080,\,160,\,320 10 40 10
Gisette 2 6000 100, 200, 300100,\,200,\,300 1239 2478 1238
USPS0vs5 2 960 120, 160, 240120,\,160,\,240 64 128 64
USPS0vs3vs5 3 1440 180, 240, 300180,\,240,\,300 64 128 64
Satimage 3 1080 60, 90, 12060,\,90,\,120 10 18 8
ImageNet 1000 1200000 10000, 12000, 1400010000,\,12000,\,14000 512 1024 512
PAMAP2 18 7200 600, 700, 800600,\,700,\,800 81 162 81
TABLE II: Comparisons between our model and state-of-the-art methods in terms of accuracy (%) on ten datasets: mean and standard errors averaged over fifty random runs in one-shot scenario. Models with the best performance are bolded.
Dataset nin_{i} Pegasos [37] OPMV [38] TCA [39] BDML [2] OPML [7] CDML [40] OPIDe [24] OPID [24] FIRF [28] Ours
500 57.38±\pm1.51 56.37±\pm1.91 53.88±\pm2.04 56.42±\pm0.71 54.10±\pm1.71 55.08±\pm0.83 57.84±\pm1.06 57.57±\pm1.08 57.13±\pm0.84 58.87±\pm0.68
EV-Action 600 57.46±\pm1.60 56.94±\pm1.82 54.61±\pm1.73 56.81±\pm0.65 55.37±\pm1.64 55.92±\pm1.03 57.22±\pm0.95 56.71±\pm1.40 56.92±\pm1.25 58.65±\pm0.84
700 57.22±\pm1.34 56.68±\pm1.87 54.37±\pm1.69 56.63±\pm0.77 55.82±\pm1.62 56.22±\pm0.71 57.09±\pm1.13 56.85±\pm1.27 57.23±\pm1.16 58.32±\pm0.82
Mnist 80 97.74±\pm0.73 97.39±\pm0.92 96.53±\pm1.75 97.00±\pm1.66 96.45±\pm1.72 96.75±\pm1.32 98.68±\pm0.88 98.88±\pm0.99 98.14±\pm0.87 99.85±\pm0.91
0vs5 160 98.11±\pm1.03 95.82±\pm1.84 93.08±\pm2.94 98.25±\pm0.80 96.83±\pm1.38 97.04±\pm0.58 97.94±\pm0.97 98.75±\pm0.90 96.79±\pm1.52 99.78±\pm0.57
320 97.68±\pm0.79 96.47±\pm1.79 92.43±\pm3.82 98.24±\pm0.75 96.98±\pm1.03 97.16±\pm0.85 97.38±\pm0.58 97.21±\pm0.66 96.83±\pm1.37 99.27±\pm0.37
Mnist 120 91.47±\pm3.92 95.87±\pm1.82 91.26±\pm3.87 92.23±\pm2.86 92.42±\pm2.22 92.66±\pm1.49 94.58±\pm1.78 94.97±\pm1.30 95.03±\pm0.83 96.91±\pm1.38
0vs3vs5 240 89.95±\pm3.08 93.96±\pm1.18 90.85±\pm1.74 92.87±\pm1.40 91.99±\pm1.64 92.47±\pm1.31 93.45±\pm1.41 93.48±\pm1.35 94.24±\pm1.13 95.37±\pm0.92
480 90.12±\pm1.93 93.28±\pm1.69 91.14±\pm3.95 93.21±\pm1.06 92.74±\pm1.17 93.04±\pm0.96 93.30±\pm0.86 93.37±\pm0.79 93.85±\pm0.95 95.54±\pm0.87
80 79.65±\pm4.13 80.13±\pm3.86 76.93±\pm4.52 65.65±\pm5.53 69.60±\pm4.38 68.85±\pm2.27 81.22±\pm3.73 80.50±\pm3.53 79.83±\pm2.55 82.65±\pm3.32
Splice 160 82.25±\pm3.26 81.95±\pm2.84 80.93±\pm3.47 71.55±\pm4.07 78.21±\pm2.53 75.85±\pm2.65 84.00±\pm2.03 83.91±\pm2.05 82.06±\pm1.91 85.25±\pm2.06
320 82.32±\pm3.18 78.72±\pm4.37 81.53±\pm3.38 72.16±\pm3.40 80.86±\pm2.01 78.93±\pm1.17 85.55±\pm1.32 85.94±\pm1.38 83.69±\pm1.73 87.03±\pm1.52
100 97.53±\pm1.33 95.27±\pm2.85 94.11±\pm3.35 90.25±\pm3.13 94.17±\pm3.02 93.71±\pm2.39 97.14±\pm1.28 97.56±\pm1.26 94.21±\pm0.96 97.29±\pm1.25
Gisette 200 95.14±\pm2.97 94.05±\pm3.36 93.03±\pm3.16 91.50±\pm1.25 93.61±\pm3.19 92.68±\pm1.72 95.59±\pm0.95 95.39±\pm1.06 93.76±\pm0.79 96.82±\pm0.91
300 96.84±\pm1.35 93.71±\pm3.11 94.37±\pm3.72 93.83±\pm2.12 93.77±\pm2.96 93.24±\pm1.56 96.36±\pm0.69 95.33±\pm0.93 94.18±\pm1.06 97.89±\pm0.43
USPS 120 98.52±\pm1.67 95.27±\pm2.67 96.42±\pm1.81 95.90±\pm1.65 93.72±\pm2.32 94.74±\pm2.46 96.17±\pm1.44 96.51±\pm1.25 95.85±\pm1.33 97.23±\pm1.64
0vs5 160 97.84±\pm0.82 95.65±\pm1.72 95.46±\pm2.13 96.38±\pm1.23 93.04±\pm4.05 95.21±\pm1.57 96.78±\pm1.31 96.93±\pm1.00 95.75±\pm1.12 98.91±\pm0.67
240 97.93±\pm0.72 96.17±\pm1.28 95.85±\pm2.07 96.78±\pm1.18 93.62±\pm3.01 95.62±\pm1.83 94.93±\pm1.28 95.06±\pm1.10 93.72±\pm0.93 98.94±\pm0.70
USPS 180 94.68±\pm1.20 92.46±\pm1.07 93.88±\pm1.37 90.62±\pm2.48 92.06±\pm1.64 91.85±\pm1.62 94.47±\pm1.77 94.13±\pm1.92 94.63±\pm1.45 95.73±\pm0.88
0vs3vs5 240 94.39±\pm1.09 91.69±\pm2.31 92.94±\pm1.58 91.48±\pm1.68 91.23±\pm1.73 91.73±\pm1.24 92.08±\pm1.93 92.50±\pm1.66 93.36±\pm2.07 95.52±\pm1.26
300 95.47±\pm0.94 92.25±\pm1.60 93.26±\pm1.44 92.13±\pm1.09 91.60±\pm1.71 92.07±\pm1.36 92.95±\pm1.12 92.67±\pm1.46 93.18±\pm1.54 94.05±\pm1.46
60 94.25±\pm2.56 96.48±\pm1.47 97.25±\pm1.08 97.14±\pm1.59 97.47±\pm1.59 97.39±\pm1.46 98.17±\pm2.19 97.60±\pm2.31 97.92±\pm2.05 99.20±\pm0.91
Satimage 90 96.49±\pm1.49 96.83±\pm1.18 96.52±\pm1.32 97.62±\pm1.52 97.69±\pm1.16 97.84±\pm1.31 98.58±\pm1.12 97.29±\pm2.08 98.16±\pm1.85 99.71±\pm1.06
120 98.03±\pm1.13 97.38±\pm1.94 97.12±\pm1.87 97.12±\pm1.48 97.15±\pm1.49 97.22±\pm1.63 98.45±\pm1.14 96.85±\pm1.94 97.24±\pm1.36 99.52±\pm1.07
10000 55.28±\pm1.83 51.03±\pm2.58 50.44±\pm3.15 52.49±\pm3.14 52.74±\pm2.54 52.15±\pm2.71 55.63±\pm1.22 55.70±\pm2.03 53.94±\pm2.05 56.47±\pm1.57
ImageNet 12000 56.37±\pm1.75 50.24±\pm2.39 50.83±\pm2.96 52.68±\pm2.33 52.94±\pm1.87 52.06±\pm2.64 55.94±\pm1.83 56.31±\pm2.33 54.82±\pm1.77 57.83±\pm1.93
14000 58.04±\pm2.38 51.61±\pm3.52 50.62±\pm2.74 53.02±\pm3.14 53.73±\pm2.19 52.64±\pm2.37 56.85±\pm1.52 57.06±\pm1.84 54.79±\pm2.29 59.17±\pm1.84
600 91.64±\pm1.08 89.85±\pm1.33 85.73±\pm1.84 87.67±\pm1.74 86.23±\pm1.81 87.92±\pm2.16 92.17±\pm0.93 92.64±\pm1.05 93.56±\pm0.84 95.27±\pm0.71
PAMAP2 700 91.85±\pm1.15 90.14±\pm1.29 86.04±\pm2.03 88.06±\pm2.20 87.94±\pm1.57 88.73±\pm1.91 91.85±\pm1.28 92.39±\pm1.04 93.28±\pm1.13 95.46±\pm0.85
800 91.57±\pm0.89 90.25±\pm1.56 85.49±\pm2.75 88.75±\pm2.06 88.13±\pm1.90 89.37±\pm1.68 93.05±\pm0.88 93.62±\pm0.93 93.84±\pm0.79 95.66±\pm0.94

IV-C Computational Complexity Analysis

The main computational cost in our EML model involves the updating operations in both T-stage and I-stage. Specifically, in the T-stage, the computational costs of updating LsL^{s} and LaL^{a} are O(kds+k(dv+ds)2ds+ds3)O\big{(}kd_{s}+k(d_{v}+d_{s})^{2}d_{s}+d_{s}^{3}\big{)} and O(k(ds+dv)+kds2(dv+ds)+(dv+ds)3)O\big{(}k(d_{s}+d_{v})+kd_{s}^{2}(d_{v}+d_{s})+(d_{v}+d_{s})^{3}\big{)}, respectively. For the I-stage, solving the variable LzL_{z} in Eq. (6) takes O((k+da)3)O\big{(}(k+d_{a})^{3}\big{)}. Besides, the computational cost of solving FF in both T-stage and I-stage is O(np2nq2+np2nk2+nq2nk2)O(n_{p}^{2}n_{q}^{2}+n_{p}^{2}n_{k}^{2}+n_{q}^{2}n_{k}^{2}), where np,nq,nknin_{p},n_{q},n_{k}\ll n_{i}. When compared with the feature dimension and sample number, the value of kk is often small, and thus our proposed model is efficient to optimize in an online manner.

V Experiments

This section first presents detailed experimental configurations and competing methods. Then the experimental performance along with some analyses about our EML model in both one-shot and multi-shot cases are provided.

TABLE III: Ablation study of our EML model in one-shot scenario.
Dataset nin_{i} Ours-woT Ours-woI Ours-woW Ours
500 56.68±\pm1.74 54.36±\pm1.61 57.93±\pm0.85 58.33±\pm0.76
EV-Action 600 56.23±\pm1.81 55.70±\pm1.49 57.70±\pm1.04 57.94±\pm0.88
700 57.02±\pm1.56 55.93±\pm1.76 57.83±\pm0.92 58.12±\pm0.86
Mnist 80 97.85±\pm1.24 96.70±\pm1.71 98.90±\pm0.97 99.07±\pm0.94
0vs5 160 97.54±\pm1.46 96.84±\pm1.85 98.87±\pm1.06 99.22±\pm0.61
320 97.23±\pm3.34 96.88±\pm0.96 98.95±\pm0.83 99.27±\pm0.37
Mnist 120 94.55±\pm1.48 92.78±\pm2.11 96.02±\pm1.85 96.53±\pm1.49
0vs3vs5 240 93.49±\pm1.07 92.88±\pm1.31 94.88±\pm1.37 95.37±\pm0.92
480 94.32±\pm0.81 93.37±\pm1.13 95.13±\pm1.22 95.54±\pm0.87
80 81.58±\pm3.10 70.83±\pm4.47 82.45±\pm3.38 82.65±\pm3.32
Splice 160 84.07±\pm2.51 78.87±\pm3.01 84.87±\pm2.19 85.25±\pm2.06
320 84.85±\pm2.38 81.56±\pm1.99 85.94±\pm1.61 86.40±\pm1.59
100 95.22±\pm1.30 92.47±\pm1.68 96.84±\pm1.40 97.29±\pm1.25
Gisette 200 94.38±\pm1.52 92.96±\pm1.75 96.27±\pm1.53 96.82±\pm0.91
300 96.11±\pm0.95 95.08±\pm1.19 97.14±\pm0.87 97.79±\pm0.46
USPS 120 95.42±\pm1.82 94.82±\pm2.02 96.26±\pm1.33 97.23±\pm1.64
0vs5 160 96.04±\pm1.33 94.95±\pm1.70 97.03±\pm1.47 98.31±\pm0.82
240 96.35±\pm1.06 95.17±\pm1.16 97.24±\pm0.96 98.87±\pm0.74
USPS 180 93.36±\pm1.77 91.97±\pm2.00 94.86±\pm1.17 95.28±\pm0.96
0vs3vs5 240 93.13±\pm1.38 92.01±\pm1.45 94.33±\pm1.54 94.96±\pm1.37
300 92.99±\pm1.35 91.81±\pm1.67 93.47±\pm1.83 94.05±\pm1.46
60 96.50±\pm1.59 97.43±\pm1.36 98.31±\pm1.10 98.97±\pm0.95
Satimage 90 96.78±\pm2.72 97.31±\pm1.10 98.19±\pm1.16 98.71±\pm1.13
120 96.22±\pm1.91 97.23±\pm1.22 98.02±\pm1.22 98.53±\pm1.20

V-A Configurations and Competing Methods

The experimental configurations of our EML model in one-shot scenario and some competing methods are detailedly introduced in this subsection.

Refer to caption

(a) Mnist0vs3vs5 (ni=480n_{i}=480)

Refer to caption

(b) Mnist0vs3vs5 (ni=480n_{i}=480)

Refer to caption

(c) USPS0vs5 (ni=240n_{i}=240)

Refer to caption

(d) USPS0vs5 (ni=240n_{i}=240)

Refer to caption

(e) EV-Action (ni=600n_{i}=600)

Refer to caption

(f) EV-Action (ni=600n_{i}=600)

Refer to caption

(g) PAMAP2 (ni=700n_{i}=700)

Refer to caption

(h) PAMAP2 (ni=700n_{i}=700)

Refer to caption

(i) Splice (ni=600n_{i}=600)

Refer to caption

(j) Splice (ni=600n_{i}=600)

Refer to caption

(k) Gisette (ni=300n_{i}=300)

Refer to caption

(l) Gisette (ni=300n_{i}=300)

Figure 5: Effect Investigations of hyper-parameters {γ,ρ}\{\gamma,\rho\} when λ=104\lambda=10^{-4} and {γ,λ}\{\gamma,\lambda\} when ρ=103\rho=10^{-3} on Mnist0vs3vs5 (a)(b), USPS0vs5 (c)(d), EV-Action (e)(f), PAMAP2 (g)(h), Splice (i)(j) and Gisette (k)(l) datasets in one-shot scenario.
Refer to caption
Figure 6: The convergence analysis of our proposed EML model on the USPS (left) and Mnist (right) dataset in one-shot scenario.

V-A1 Experimental Configurations

As shown in Table I, we conduct extensive comparisons on two real-world human motion recognition datasets (i.e., EV-Action [13] and PAMAP2 [41]), a large-scale visual recognition dataset (i.e., ImageNet [42]) and five synthetic benchmark datasets111http://archive.ics.uci.edu/ml/ containing three digit datasets (i.e., Mnist, Gisette and USPS), one DNA dataset (i.e., Splice) and one image dataset (i.e., Satimage). Specifically, EV-Action dataset [13] is a human action dataset with 5300 samples, which consists of 20 common action categories, where 10 actions are finished by single subject and the others are accomplished by the same subjects interacting with other objects. It is a typical application for feature evolution in the real-world, where the features from depth information, RGB image, and human skeleton are respectively regarded as vanished, survived and augmented features. Some example samples about human actions are visualized as Fig. 4. PAMAP2 [41] is composed of 18 activities performed by 9 different subjects wearing three inertial measurement units (IMU) and a heart rate monitor. We only utilize the data information from IMU in our experiments, due to the large missing values collected from the heart rate monitor. Each IMU contains one gyroscope, two accelerometers, one magnetometer, where the features from them are regarded as vanished, survived and augmented features, respectively. Moreover, ImageNet [42] including 1000 different categories is a large-scale challenging visual recognition dataset, where each of 1000 classes has roughly 1000 samples. We utilize ResNet [43] as feature extractor to obtain 2048-dimension feature representations for ImageNet [42].

TABLE IV: Computational time in terms of minutes: mean and standard errors averaged over fifty random runs in one-shot scenario.
Dataset Pegasos [37] OPMV [38] TCA [39] BDML [2] OPML [7] CDML [40] OPIDe [24] OPID [24] FIRF [28] Ours
EV-Action (ni=500n_{i}=500) 27.11±\pm0.04 28.58±\pm0.04 37.72±\pm0.09 36.18±\pm0.18 22.95±\pm0.07 37.42±\pm0.33 26.47±\pm0.09 26.33±\pm0.14 23.04±\pm0.11 25.48±\pm0.06
Mnist0vs5 (ni=80n_{i}=80) 6.18±\pm0.07 7.45±\pm0.06 14.96±\pm0.12 16.27±\pm0.10 3.84±\pm0.04 16.58±\pm0.19 5.13±\pm0.11 4.95±\pm0.07 3.89±\pm0.07 4.68±\pm0.05
USPS0vs5 (ni=120n_{i}=120) 3.16±\pm0.02 4.75±\pm0.10 11.93±\pm0.04 13.06±\pm0.12 1.24±\pm0.03 13.42±\pm0.21 1.93±\pm0.05 1.87±\pm0.06 1.32±\pm0.05 1.53±\pm0.08
Gisette (ni=100n_{i}=100) 40.52±\pm0.03 41.06±\pm0.16 51.28±\pm0.04 49.73±\pm0.14 35.26±\pm0.03 49.80±\pm0.17 38.24±\pm0.08 39.15±\pm0.05 35.54±\pm0.15 37.57±\pm0.11
Satimage (ni=120n_{i}=120) 2.64±\pm0.05 3.08±\pm0.10 10.23±\pm0.07 11.47±\pm0.09 0.52±\pm0.02 11.62±\pm0.14 0.66±\pm0.04 0.71±\pm0.08 0.55±\pm0.04 0.68±\pm0.03
PAMAP2 (ni=360n_{i}=360) 9.84±\pm0.04 9.27±\pm0.06 16.74±\pm0.15 18.05±\pm0.18 4.69±\pm0.13 18.32±\pm0.09 7.84±\pm0.09 7.63±\pm0.07 4.81±\pm0.07 6.45±\pm0.06

For a fair comparison, as presented in Table I, we adopt the same experimental settings with [24] in one-shot and multi-shot cases, which are elaborated as follows:

  • The number of streaming data in each batch is same, i.e., ni=nr+1=nr+2(i{1,2,,r})n_{i}=n_{r+1}={n_{r+2}}(i\in\{1,2,...,r\}), and the sample number in each class is equal for all training and testing batches.

  • In T-stage, the total number of training data is fixed and the sample number in each batch is varied. In the light of this, the number of training and evaluation samples also varies in the last evaluation phase.

  • We allocate the first dvd_{v} features, the next dsd_{s} features and and the rest of features as vanished features, survived features and new augmented features, respectively. The first and last quarters are corresponding vanished and augmented features in our experiments.

  • The experimental performance in each run may have slightly difference due to the influence of computer system and simulation environment, even though we run each experiment under the same experimental settings. To circumvent the randomness effect of experimental performance, all experimental results are the averaged results over fifty random runs, which is more convincing to illustrate the superiority of our EML model.

V-A2 Competing Methods

We validate the superior performance of our EML model by comparing it with the following competing methods: One-pass Pegasos [37] assumes that the vanished and augmented features are available in different feature evolution stages; OPMV [38] regards the features in T-stage and I-stage as the first and second views; TCA [39] assumes that the streaming samples in T-stage and I-stage are drawn from the source and target distributions; BDML [2], OPML [7] and CDML [40] are the representative metric learning methods, which only utilize the samples with the augmented features, and ignore the previous vanished features; As for the feature evolution approaches, OPID and OPIDe [24] propose an one-pass incremental and decremental model for feature evolution. FIRF [28] designs a feature incremental random forest framework to tackle the emergence of new sensors (i.e., new augmented features) in a dynamic environment.

V-B Experiments in One-shot Scenario

In this subsection, we introduce the comprehensive experimental analysis, ablation studies, effects of hyper-parameters and convergence investigation of our proposed EML model in one-shot scenario, followed by computational costs of optimization complexity.

V-B1 Experimental Analysis

The experimental results for one-shot scenario are presented in Table II. From the presented performance, we have the following observations: 1) Although our proposed model has no access to the vanished features in T-stage, both transforming and inheriting strategies could efficiently exploit useful information of vanished feature and expand it into new augmented features in I-stage. 2) Our proposed EML model could be successfully applied to both high-dimensional (e.g., EV-Action, Gisette and ImageNet) and low-dimensional (e.g., Satimage and Splice) feature evolution, which are the challenging tasks to explore the intrinsic data structure and informative knowledge using the existing features; 3) When we utilize the learned distance matrix in T-stage to assist the training procedure in I-stage, the evaluation performance of our proposed model increases significantly, even though the training samples in I-stage are relatively rare, i.e., nin_{i} contains a small number of samples in I-stage. 4) Our EML model performs better than OPID and OPIDe [24], since T-stage could explore important information from vanished features, and I-stage efficiently inherits the metric performance from T-stage to take advantage of new augmented features.

TABLE V: Effect investigation of low rank constraint of our EML model in one-shot scenario.
     Dataset      nin_{i}      Ours-woLR      Ours
     500      57.61±\pm0.56      58.33±\pm0.76
     EV-Action      600      57.46±\pm0.89      57.94±\pm0.88
     700      57.35±\pm1.34      58.12±\pm0.86
     Mnist      120      96.18±\pm1.36      96.53±\pm1.49
     0vs3vs5      240      94.62±\pm1.04      95.37±\pm0.92
     480      95.20±\pm1.01      95.54±\pm0.87
     80      82.16±\pm0.94      82.65±\pm3.32
     Splice      160      84.69±\pm1.93      85.25±\pm2.06
     320      85.76±\pm1.45      86.40±\pm1.59
     100      96.65±\pm1.23      97.29±\pm1.25
     Gisette      200      96.40±\pm1.16      96.82±\pm0.91
     300      97.08±\pm0.75      97.79±\pm0.46
     USPS      180      94.59±\pm1.05      95.28±\pm0.96
     0vs3vs5      240      94.51±\pm1.27      94.96±\pm1.37
     300      93.29±\pm1.18      94.05±\pm1.46
     60      98.42±\pm1.22      98.97±\pm0.95
     Satimage      90      98.32±\pm1.30      98.71±\pm1.13
     120      98.11±\pm1.05      98.53±\pm1.20

V-B2 Ablation Studies

To verify the effectiveness of our EML model, we intend to research the effects of different components of our model, i.e., training without T-stage (denoted as Ours-woT), training without I-stage (denoted as Ours-woI) and training without the Wasserstein distance metric (denoted as Ours-woW). The performance of Ours-woW is evaluated under the metric of Mahalanobis distance. From the presented results in Table III, our proposed EML model has the best performance when both transforming and inheriting strategies work together to tackle incremental and decremental features via the Wasserstein distance metric, which validates the reasonable design of our proposed model. Compared with other metric distances (e.g., Mahalanobis distance), the smoothed Wasserstein distance could better mine the similarity relationships between the heterogeneous and complex streaming samples, since the evolving features are not strictly aligned in different stages. Both T-stage and I-stage play an essential role in tackling instance and feature evolutions simultaneously.

V-B3 Effects of Hyper-Parameters

In this subsection, as shown in Fig. 5, we introduce extensive parameter experiments on several representative datasets (Mnist0vs3vs5, USPS0vs5, EV-Action, PAMAP2, Splice and Gisette) as the examples to investigate the effects of hyper-parameters {γ,λ,ρ}\{\gamma,\lambda,\rho\} in one-shot scenario. Specifically, the experimental performances of our proposed model are averaged over fifty random repetitions by empirically tuning {γ,λ,ρ}\{\gamma,\lambda,\rho\} in a wide selection range of {105,104,103,102,101,1}\{10^{-5},10^{-4},10^{-3},10^{-2},10^{-1},1\} to choose the optimal values of hyper-parameters. When fixing λ\lambda as 10410^{-4}, we investigate the effects of {γ,ρ}\{\gamma,\rho\}, and introduce the hyper-parameter influence of {γ,λ}\{\gamma,\lambda\} when ρ=103\rho=10^{-3}. From the performance depicted in Fig. 5, we could observe our EML model has stable prediction performance over the wide selection range of different hyper-parameters. Moreover, when γ=102,ρ=103\gamma=10^{-2},\rho=10^{-3} and λ=104\lambda=10^{-4}, our EML model performs the best prediction performance on most benchmark dataset, except for Mnist0vs3vs5 dataset performing the best when γ=102,ρ=104\gamma=10^{-2},\rho=10^{-4} and λ=104\lambda=10^{-4}.

TABLE VI: Comparisons between our model and state-of-the-art methods in terms of accuracy (%) on seven datasets: mean and standard errors averaged over fifty random runs in multi-shot scenario for Task I. Models with the best performance are bolded.
Dataset nin_{i} Pegasos [37] OPMV [38] TCA [39] BDML [2] OPML [7] CDML [40] OPIDe [24] OPID [24] FIRF [28] Ours
500 54.25±\pm1.42 53.60±\pm1.53 50.63±\pm1.89 53.26±\pm1.18 51.36±\pm1.48 52.77±\pm0.69 54.61±\pm1.53 54.27±\pm1.24 54.16±\pm0.57 55.93±\pm1.04
EV-Action 600 55.59±\pm1.27 53.72±\pm1.66 51.83±\pm1.96 53.74±\pm0.82 53.11±\pm1.36 52.74±\pm1.28 54.43±\pm1.15 53.46±\pm1.92 53.84±\pm1.07 56.74±\pm1.35
700 54.19±\pm1.28 53.52±\pm1.74 51.95±\pm1.36 53.84±\pm0.59 52.54±\pm1.83 53.61±\pm0.94 54.38±\pm1.19 54.62±\pm1.02 55.04±\pm1.30 56.19±\pm0.77
Mnist 80 97.50±\pm1.82 88.75±\pm4.99 93.14±\pm1.87 92.25±\pm2.20 93.92±\pm3.22 92.74±\pm2.03 95.70±\pm2.17 95.92±\pm2.23 94.37±\pm2.16 98.54±\pm1.08
0vs5 160 97.56±\pm1.28 90.75±\pm3.02 91.78±\pm2.43 95.70±\pm1.26 94.34±\pm1.54 95.87±\pm1.85 95.53±\pm1.61 95.29±\pm1.80 94.16±\pm1.85 98.61±\pm0.57
320 97.61±\pm0.90 92.72±\pm1.76 90.35±\pm2.67 96.06±\pm0.98 95.20±\pm0.96 95.74±\pm1.73 95.22±\pm1.33 95.04±\pm1.39 94.61±\pm1.17 98.73±\pm0.64
100 91.58±\pm2.87 86.24±\pm4.76 91.28±\pm2.56 90.48±\pm3.29 89.87±\pm3.62 90.26±\pm2.71 95.08±\pm2.35 94.36±\pm1.88 92.88±\pm2.06 96.12±\pm1.18
Gisette 200 90.68±\pm1.79 88.41±\pm3.00 90.92±\pm2.84 90.69±\pm2.73 92.22±\pm2.03 91.23±\pm1.89 94.88±\pm1.39 93.81±\pm1.50 93.14±\pm1.83 95.94±\pm1.72
300 91.18±\pm1.13 89.42±\pm2.27 92.13±\pm2.14 92.52±\pm1.71 91.83±\pm1.61 92.06±\pm1.36 94.65±\pm1.12 93.91±\pm1.45 93.08±\pm1.26 95.71±\pm1.68
USPS 120 97.48±\pm0.12 94.95±\pm2.46 92.87±\pm2.16 96.08±\pm1.36 93.83±\pm2.13 95.84±\pm1.71 94.77±\pm1.62 94.61±\pm1.69 93.92±\pm1.53 98.57±\pm0.94
0vs5 160 97.56±\pm0.19 95.17±\pm2.18 93.05±\pm1.94 96.24±\pm1.55 94.21±\pm1.66 95.49±\pm1.33 94.13±\pm1.54 94.51±\pm1.30 93.75±\pm1.28 98.68±\pm0.65
240 97.37±\pm0.41 95.58±\pm1.06 92.72±\pm2.33 97.57±\pm0.67 94.62±\pm1.25 95.84±\pm2.03 93.92±\pm1.47 93.62±\pm1.16 92.64±\pm1.09 98.39±\pm0.72
USPS 180 92.03±\pm1.57 89.22±\pm3.21 90.86±\pm2.87 90.91±\pm1.81 88.61±\pm2.84 89.73±\pm2.43 84.05±\pm2.27 83.34±\pm2.36 82.94±\pm2.36 93.11±\pm1.87
0vs3vs5 240 90.90±\pm1.40 89.13±\pm2.05 90.24±\pm2.93 91.98±\pm2.04 89.62±\pm2.17 90.62±\pm1.87 84.68±\pm1.73 84.49±\pm1.92 83.75±\pm1.81 93.23±\pm1.58
300 90.48±\pm1.24 89.52±\pm1.59 89.85±\pm3.16 92.61±\pm1.23 89.46±\pm1.80 90.84±\pm1.56 83.25±\pm1.61 83.17±\pm1.66 84.23±\pm1.42 93.13±\pm1.55
10000 52.76±\pm1.55 48.11±\pm2.30 47.82±\pm3.06 49.04±\pm2.83 49.74±\pm2.49 49.43±\pm2.59 50.42±\pm1.57 50.33±\pm2.26 48.42±\pm2.01 53.92±\pm1.62
ImageNet 12000 53.81±\pm1.44 48.15±\pm2.06 48.82±\pm2.47 50.17±\pm2.42 50.63±\pm1.42 49.95±\pm2.51 52.36±\pm2.13 53.09±\pm2.28 51.64±\pm1.93 54.95±\pm2.17
14000 55.82±\pm2.05 48.93±\pm3.28 48.36±\pm2.66 50.93±\pm2.87 51.62±\pm2.62 50.51±\pm2.18 53.62±\pm1.83 54.96±\pm2.10 51.63±\pm2.14 57.03±\pm1.92
600 89.07±\pm1.25 86.71±\pm1.22 82.83±\pm2.17 85.82±\pm1.55 83.06±\pm1.39 84.28±\pm1.93 90.42±\pm1.25 90.37±\pm1.52 91.28±\pm0.66 92.75±\pm0.97
PAMAP2 700 88.47±\pm1.58 87.05±\pm1.42 83.23±\pm1.69 85.52±\pm1.74 82.12±\pm1.26 85.68±\pm1.64 88.63±\pm1.33 90.18±\pm0.94 90.57±\pm1.24 92.51±\pm0.92
800 88.74±\pm1.03 88.05±\pm1.64 82.19±\pm2.26 85.48±\pm1.94 84.95±\pm1.68 86.26±\pm1.33 90.87±\pm0.92 91.73±\pm1.15 91.62±\pm0.81 92.38±\pm1.03
Refer to caption
Figure 7: Comparisons in terms of accuracy (%) on three datasets: mean and standard errors averaged over fifty random runs in multi-shot scenario for Task II.

V-B4 Convergence Investigations

The convergence condition of Algorithm 1 is depending on the little change (we set it as 2.5×1052.5\times 10^{-5}) in the consecutive objective function values, and Fig. 6 depicts the convergence curves of our EML model on Mnist and USPS datasets. From the presented results in Fig. 6, we notice that our EML model could converge asymptotically to a stable value with respect to the objective function value after a few iterations. Furthermore, it validates that our proposed optimization algorithm could efficiently achieve stable performance with appropriate convergence condition.

V-B5 Computational Costs

Table IV presents the computational costs (i.e., optimization time) of our proposed EML model and other competing methods. From the reported results, we have the following conclusions: 1) Our model is computationally efficient in an online manner for real-world applications since np,nq,nknin_{p},n_{q},n_{k}\ll n_{i} and kk is often a small value when compared with the feature dimension and the sample number. 2) The computational time costs (by the minute) of our model are less than other competing methods about 0.67\sim13.71 minutes on most experimental datasets except for OPML [7], since OPML only takes advantage of training samples in I-stage for optimization procedure.

V-B6 Effect Investigation of Low Rank Constraint

This subsection investigates the effectiveness of low rank regularizer in our proposed EML model, as introduced in Table V. We substitute the low rank constraint with Frobenius norm, and denote its classification performance as Ours-woLR. The presented results in Table V clearly demonstrates that the performance of our proposed EML model degrades about 0.42%0.72%0.42\%\sim 0.72\% in terms of accuracy, when the low rank constraint is abandoned. It illustrates that our EML model could effectively explore the intrinsic low rank structure of heterogeneous samples for different evolving features by incorporating with the low rank regularizer.

V-C Experiments in Multi-shot Scenario

This subsection introduces the experimental configurations and comparison performance of our proposed EML model in multi-shot scenario.

V-C1 Experimental Configurations

In multi-shot scenario, we set M=2M=2, i.e., two-shot scenario with three stages for illustration, as depicted in Fig. 3. The streaming samples used in one-shot scenario are split into three stages. Except for the configurations introduced in one-shot scenario, the additional experimental configurations for multi-shot scenario are summarized as follows:

  • All batches of T-stage in one-shot scenario are split into Stage 1 and Stage 2 with equal number of samples, as shown in Fig. 3. Under this setting, the survived features in Stage 2 would be the vanished features in Stage 3, and the new augmented features in Stage 2 would be the survived features in Stage 3. In other words, Stage 1 and Stage 2 are respectively considered as T-stage and I-stage for the first feature evolution. Moreover, Stage 2 and Stage 3 are regarded as T-stage and I-stage for the second feature evolution.

  • The features are divided into four equal parts with the same partition order as one-shot scenario. Concretely, the second quarter is the shared part of Stage 1 and Stage 2. The third quarter is the shared part of Stage 2 and Stage 3. The first quarter in Stage 1 and the last quarter in Stage 3 denote the vanished and new augmented features.

V-C2 Experiments for Task I and II

To address the Task I in multi-shot case, we directly use the last two adjacent evolution stages and regard it as the one-shot scenario for predictions, since the streaming data in any two adjacent stages share the common features. To be specific, we first utilize the transforming strategy in Eq. (4) on the streaming data in Stage 2 to learn the discriminative distance matrix, and the inheriting strategy in Eq. (6) is then applied to classify samples in Stage 3. To tackle the Task II in multi-shot scenario, we regard two adjacent stages as one-shot scenario (i.e., T-stage and I-stage) and repeat this procedure until the last stage in multi-shot scenario. Specifically, the transforming and inheriting strategies are first integrated into Stage 1 and Stage 2, and then we make predictions on the second batch streaming data in Stage 2. After inheriting the metric performance of Stage 1, we extract the useful information from the survived features in Stage 2 and forward it into the new augmented features via common discriminative space, when new labeled streaming data in Stage 2 arriving. Furthermore, we perform the same inheriting strategy on survived features in Stage 2 to promote the performance predictions in Stage 3.

The experimental results of our proposed EML model averaged over fifty random repetitions for Task I and II are presented in Table VI and Fig. 7. Notice that: 1) Our model significantly outperforms other competing methods (e.g., OPIDe and OPID [24]) especially in Task I, since it could inherit the metric performance of survived features in any two adjacent stages. 2) Compared with Task I, our proposed model performs better for Task II in most cases, since the survived features existing in Stage 1 could effectively promote the predictions for following streaming batches. 3) Our model could be successfully extended from one-shot case into multi-shot scenario to address both Task I and Task II, which further verifies the superior performance of our EML model.

V-C3 Ablation Studies

In this subsection, we conduct extensive variant experiments on Task I and Task II to investigate the efficiency of each component of our EML model in the multi-shot scenario, as introduced in Table VII and Table VIII. We have the following conclusions from the presented results: 1) All designed components in our EML model could cooperate well to achieve the best performance for both Task I and Task II in the multi-shout scenario, which validates the effectiveness and necessity of each module. 2) Two complementary strategies (i.e., T-stage and I-stage) effectively compress the important information from vanished features and inherit the metric performance from the previous stage. They play an indispensable role in addressing both feature and instance evolutions simultaneously under the Wasserstein distance metric. 3) The performance degradation of Ours-woW illustrates the effectiveness of the smoothed Wasserstein distance to explore the similarity relationships for heterogeneous samples among different stages.

TABLE VII: Ablation studies of our proposed EML model in multi-shot scenario for Task I.
Dataset nin_{i} Ours-woT Ours-woI Ours-woW Ours
Mnist 80 94.93±\pm0.34 94.26±\pm0.38 96.04±\pm0.62 98.54±\pm1.08
0vs5 160 94.17±\pm0.55 93.41±\pm0.82 95.52±\pm0.39 98.61±\pm0.57
320 96.13±\pm0.59 95.47±\pm0.85 96.84±\pm1.03 98.73±\pm0.64
100 92.45±\pm0.83 91.17±\pm0.76 93.61±\pm0.35 96.12±\pm1.18
Gisette 200 92.84±\pm0.72 92.04±\pm0.28 94.12±\pm0.46 95.94±\pm1.72
300 93.05±\pm0.80 92.36±\pm0.73 93.88±\pm1.14 95.71±\pm1.68
USPS 120 95.54±\pm0.75 94.18±\pm0.93 96.33±\pm0.41 98.57±\pm0.94
0vs5 160 95.06±\pm0.83 94.27±\pm0.53 96.84±\pm0.65 98.68±\pm0.65
240 95.36±\pm0.32 94.91±\pm0.77 97.05±\pm0.41 98.39±\pm0.72
USPS 180 90.15±\pm0.19 89.35±\pm0.87 90.94±\pm0.51 93.11±\pm1.87
0vs3vs5 240 90.62±\pm0.30 89.87±\pm0.64 91.58±\pm0.74 93.23±\pm1.58
300 90.86±\pm0.81 90.22±\pm0.63 92.08±\pm0.26 93.13±\pm1.55
TABLE VIII: Ablation studies of our proposed EML model in multi-shot scenario for Task II.
Dataset nin_{i} Ours-woT Ours-woI Ours-woW Ours
Mnist 80 94.58±\pm1.48 93.26±\pm1.71 95.93±\pm1.22 98.30±\pm1.18
0vs5 160 95.71±\pm1.29 93.62±\pm1.48 96.04±\pm0.98 98.13±\pm1.16
320 95.25±\pm0.93 94.54±\pm0.89 95.87±\pm1.13 98.24±\pm1.34
100 94.32±\pm0.88 93.68±\pm1.09 95.02±\pm0.95 97.22±\pm1.37
Gisette 200 93.10±\pm1.27 92.29±\pm1.48 94.21±\pm1.07 96.90±\pm1.18
300 93.74±\pm1.18 92.16±\pm1.26 94.23±\pm0.89 96.92±\pm1.70
USPS 120 96.15±\pm1.36 95.62±\pm1.26 96.87±\pm0.88 98.17±\pm1.18
0vs5 160 94.77±\pm1.38 94.46±\pm1.52 95.45±\pm1.13 98.14±\pm0.97
240 95.32±\pm1.36 94.37±\pm1.65 96.14±\pm0.76 98.67±\pm0.62
USPS 180 90.38±\pm1.47 90.56±\pm1.51 91.06±\pm1.05 93.94±\pm1.50
0vs3vs5 240 90.43±\pm1.26 89.85±\pm1.34 91.88±\pm1.17 93.34±\pm1.38
300 90.35±\pm1.18 90.83±\pm1.43 91.46±\pm0.84 94.58±\pm1.25

VI Conclusion

In this paper, an online Evolving Metric Learning (EML) model is proposed for both instance and feature evolutions, which is successfully applied to one-shot and multi-shot scenarios. Our proposed EML model contains two essential stages, i.e., Transforming stage (T-stage) and Inheriting stage (I-stage). To be specific, for the T-stage, we utilize the survived features to characterize the effective information extracted from vanished and survived features by exploiting a common discriminative metric space. In the I-stage, we inherit the metric performance of survived features from T-stage, and extend it into the new augmented features. Furthermore, we apply the smoothed Wasserstein distance to T-stage and I-stage to better explore the similarity relations of heterogeneous streaming data among different evolution stages. Extensive experiments show the superior performance of our proposed EML model on several representative datasets. In the future, we will consider lifelong machine learning for both instance and feature evolutions, which continually learns a sequence of new streaming evolution tasks without the catastrophic forgetting for the previous learned evolution tasks.

References

  • [1] H. V. Nguyen and L. Bai, “Cosine similarity metric learning for face verification,” in Computer Vision – ACCV 2010, R. Kimmel, R. Klette, and A. Sugimoto, Eds., 2011, pp. 709–720.
  • [2] J. Xu, L. Luo, C. Deng, and H. Huang, “Bilevel distance metric learning for robust image recognition,” in Proceedings of the 32Nd International Conference on Neural Information Processing Systems, 2018, pp. 4202–4211.
  • [3] Z. Boukouvalas, “Distance metric learning for medical image registration,” 2011.
  • [4] G. Chechik, V. Sharma, U. Shalit, and S. Bengio, “Large scale online learning of image similarity through ranking,” J. Mach. Learn. Res., vol. 11, pp. 1109–1135, Mar. 2010.
  • [5] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman, “Online metric learning and fast similarity search,” in Advances in Neural Information Processing Systems 21, 2009, pp. 761–768.
  • [6] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” J. Mach. Learn. Res., vol. 10, pp. 207–244, Jun. 2009.
  • [7] W. Li, Y. Gao, L. Wang, L. Zhou, J. Huo, and Y. Shi, “Opml: A one-pass closed-form solution for online metric learning,” Pattern Recognition, vol. 75, pp. 302 – 314, 2018.
  • [8] R. Jin, S. Wang, and Y. Zhou, “Regularized distance metric learning:theory and algorithm,” in Advances in Neural Information Processing Systems 22, 2009, pp. 862–870.
  • [9] B. Shaw, B. Huang, and T. Jebara, “Learning a distance metric from a network,” in Advances in Neural Information Processing Systems 24, 2011, pp. 1899–1907.
  • [10] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information-theoretic metric learning,” in Proceedings of the 24th International Conference on Machine Learning.   ACM, 2007, pp. 209–216.
  • [11] J. Hu, J. Lu, and Y. Tan, “Deep metric learning for visual tracking,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 11, pp. 2056–2068, 2016.
  • [12] Y. Duan, J. Lu, J. Feng, and J. Zhou, “Deep localized metric learning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 10, pp. 2644–2656, 2018.
  • [13] L. Wang, B. Sun, J. P. Robinson, T. Jing, and Y. Fu, “Ev-action: Electromyography-vision multi-modal action dataset,” arXiv preprint arXiv:1904.12602, 2019.
  • [14] C. K. Ho, A. Robinson, D. R. Miller, and M. J. Davis, “Overview of sensors and needs for environmental monitoring,” Sensors, vol. 5, no. 1, pp. 4–37, 2005.
  • [15] S. Shalev-Shwartz, Y. Singer, and A. Y. Ng, “Online and batch learning of pseudo-metrics,” in Proceedings of the Twenty-first International Conference on Machine Learning.   ACM, 2004, p. 94.
  • [16] B. Nguyen and B. De Baets, “Kernel-based distance metric learning for supervised kk -means clustering,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 10, pp. 3084–3095, Oct 2019.
  • [17] Q. Qian, R. Jin, J. Yi, L. Zhang, and S. Zhu, “Efficient distance metric learning by adaptive sampling and mini-batch stochastic gradient descent (sgd),” Machine Learning, vol. 99, no. 3, pp. 353–372, Jun 2015.
  • [18] X. Gao, S. C. H. Hoi, Y. Zhang, J. Wan, and J. Li, “Soml: Sparse online metric learning with application to image retrieval,” in AAAI, 2014.
  • [19] H. Xia, S. C. H. Hoi, R. Jin, and P. Zhao, “Online multiple kernel similarity learning for visual search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 536–549, March 2014.
  • [20] Z. Ding, M. Shao, W. Hwang, S. Suh, J.-J. Han, C. Choi, and Y. Fu, “Robust discriminative metric learning for image representation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 11, pp. 3173–3183, Nov. 2019.
  • [21] J. Yu, X. Yang, F. Gao, and D. Tao, “Deep multimodal distance metric learning using click constraints for image ranking,” IEEE Transactions on Cybernetics, vol. 47, no. 12, pp. 4014–4024, 2017.
  • [22] J. Yu, Y. Rui, Y. Y. Tang, and D. Tao, “High-order distance-based multiview stochastic learning in image classification,” IEEE Transactions on Cybernetics, vol. 44, no. 12, pp. 2431–2442, 2014.
  • [23] B.-J. Hou, L. Zhang, and Z.-H. Zhou, “Learning with feature evolvable streams,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 1416–1426.
  • [24] C. Hou and Z.-H. Zhou, “One-pass learning with incremental and decremental features,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, pp. 2776–2792, 2018.
  • [25] H.-J. Ye, D.-C. Zhan, Y. Jiang, and Z.-H. Zhou, “Rectify heterogeneous models with semantic mapping,” in ICML, 2018.
  • [26] Q. Zhang, P. Zhang, G. Long, W. Ding, C. Zhang, and X. Wu, “Towards mining trapezoidal data streams,” 2015 IEEE International Conference on Data Mining, pp. 1111–1116, 2015.
  • [27] Q. Zhang, P. Zhang, G. Long, W. Ding, C. Zhang, and X. Wu, “Online learning from trapezoidal data streams,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 10, pp. 2709–2723, Oct 2016.
  • [28] C. Hu, Y. Chen, X. Peng, H. Yu, C. Gao, and L. Hu, “A novel feature incremental learning method for sensor-based activity recognition,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 6, pp. 1038–1050, June 2019.
  • [29] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer, “Online passive-aggressive algorithms,” J. Mach. Learn. Res., vol. 7, pp. 551–585, Dec. 2006.
  • [30] J. Xu, L. Luo, C. Deng, and H. Huang, “Multi-level metric learning via smoothed wasserstein distance.” in IJCAI, 2018, pp. 2919–2925.
  • [31] R. Sandler and M. Lindenbaum, “Nonnegative matrix factorization with earth mover’s distance metric for image analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, pp. 1590–1602, Aug 2011.
  • [32] M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” in Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, 2013, pp. 2292–2300.
  • [33] A. Rolet, M. Cuturi, and G. Peyré, “Fast dictionary learning with a smoothed wasserstein loss,” in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, vol. 51, 09–11 May 2016, pp. 630–638.
  • [34] L. Breiman, “Stacked regressions,” Machine Learning, vol. 24, pp. 49–64, Jul 1996.
  • [35] Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms.   Chapman & Hall/CRC, 2012.
  • [36] J.-F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on Optimization, vol. 20, no. 4, pp. 1956–1982, 2010.
  • [37] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter, “Pegasos: primal estimated sub-gradient solver for svm,” Mathematical Programming, vol. 127, pp. 3–30, Mar 2011.
  • [38] Y. Zhu, W. Gao, and Z.-H. Zhou, “One-pass multi-view learning,” in ACML, 2015.
  • [39] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, Feb 2011.
  • [40] S. Chen, L. Luo, J. Yang, C. Gong, J. Li, and H. Huang, “Curvilinear distance metric learning,” in Advances in Neural Information Processing Systems 32, 2019.
  • [41] A. Reiss and D. Stricker, “Introducing a new benchmarked dataset for activity monitoring,” in 2012 16th International Symposium on Wearable Computers, 2012, pp. 108–109.
  • [42] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
  • [43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
[Uncaptioned image] Jiahua Dong Jiahua Dong is currently a Ph. D candidate in State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences. He received the B.S. degree from Jilin University in 2017. His current research interests include computer vision, machine learning, transfer learning, domain adaptation and medical image processing.
[Uncaptioned image] Yang Cong Yang Cong (S’09-M’11-SM’15) is a full professor of Chinese Academy of Sciences. He received the he B.Sc. de. degree from Northeast University in 2004, and the Ph.D. degree from State Key Laboratory of Robotics, Chinese Academy of Sciences in 2009. He was a Research Fellow of National University of Singapore (NUS) and Nanyang Technological University (NTU) from 2009 to 2011, respectively; and a visiting scholar of University of Rochester. He has served on the editorial board of the Journal of Multimedia. His current research interests include image processing, compute vision, machine learning, multimedia, medical imaging, data mining and robot navigation. He has authored over 70 technical papers. He is also a senior member of IEEE.
[Uncaptioned image] Gan Sun Gan Sun (S’19) is an Assistant Professor in State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences. He received the B.S. degree from Shandong Agricultural University in 2013, the Ph.D. degree from State Key Laboratory of Robotics, Chinese Academy of Sciences in 2020, and has been visiting Northeastern University from April 2018 to May 2019, Massachusetts Institute of Technology from June 2019 to November 2019. His current research interests include lifelong machine learning, multi-task learning, medical data analysis, deep learning and 3D computer vision.
[Uncaptioned image] Tao Zhang Tao Zhang is currently working toward the Ph.D. degree in pattern recognition and intelligent systems at the State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, China. His research interests include pattern recognition, image processing, tactile sensing and robotics.
[Uncaptioned image] Xu Tang Xu Tang is currently a reserach associate in State Key Laboratory of Robotics, Shenyang Institute of Automation. He received MESc degree from Harbin Institute of Technology in 2017. His current research interests include computer vision and machine learning.
[Uncaptioned image] Xiaowei Xu Xiaowei Xu is a professor of Information Science at University of Arkansas at Little Rock (UALR), received a a B.Sc. de. degree in Mathematics from Nankai University in 1983 and a Ph.D. degree in Computer Science from University of Munich in 1998. He holds an adjunct professor position in the Department of Mathematics and Statistics at University of Arkansas at Fayetteville. Before his appointment in UALR, he was a senior research scientist in Siemens. He was a visiting professor in Microsoft Research Asia and Chinese University of Hong Kong. His research spans data mining, machine learning, bioinformatics, data management and high performance computing. He has published over 70 papers in peer reviewed journals and conference proceedings. His groundbreaking work on density-based clustering algorithm DBSCAN has been widely used in many textbooks; and received over 10203 citations based on Google scholar. Dr. Xu is a recipient of 2014 ACM KDD Test of Time Award that “recognizes outstanding papers from past KDD Conferences beyond the last decade that have had an important impact on the data mining research community.”