elaboration
Evolving Metric Learning for Incremental and Decremental Features
Abstract
Online metric learning has been widely exploited for large-scale data classification due to the low computational cost. However, amongst online practical scenarios where the features are evolving (e.g., some features are vanished and some new features are augmented), most metric learning models cannot be successfully applied to these scenarios, although they can tackle the evolving instances efficiently. To address the challenge, we develop a new online Evolving Metric Learning (EML) model for incremental and decremental features, which can handle the instance and feature evolutions simultaneously by incorporating with a smoothed Wasserstein metric distance. Specifically, our model contains two essential stages: a Transforming stage (T-stage) and a Inheriting stage (I-stage). For the T-stage, we propose to extract important information from vanished features while neglecting non-informative knowledge, and forward it into survived features by transforming them into a low-rank discriminative metric space. It further explores the intrinsic low-rank structure of heterogeneous samples to reduce the computation and memory burden especially for highly-dimensional large-scale data. For the I-stage, we inherit the metric performance of survived features from the T-stage and then expand to include the new augmented features. Moreover, a smoothed Wasserstein distance is utilized to characterize the similarity relationships among the heterogeneous and complex samples, since the evolving features are not strictly aligned in the different stages. In addition to tackling the challenges in one-shot case, we also extend our model into multi-shot scenario. After deriving an efficient optimization strategy for both T-stage and I-stage, extensive experiments on several datasets verify the superior performance of our EML model.
Index Terms:
Online metric learning, instance and feature evolutions, smoothed Wasserstein distance, low-rank constraint.I Introduction
Metric learning has been successfully extended into many fields, e.g., face identification [1], object recognition [2] and medical diagnosis [3]. To efficiently solve the large-scale streaming data problem, learning an online discriminative metric (i.e., online metric learning [4, 5]) attracts lots of appealing attentions. Generally, most online metric learning models pay attention to the fast metric updating mechanisms [6, 7, 8, 9] or fast similarity searching strategies [5, 10, 8] for large-scale streaming data, where the streaming data indicate the continuous data flow that the data samples arrive consecutively in a real-time manner.

However, these existing online metric learning methods [5, 6, 10, 11, 12] only focus on instance evolution, and ignore the feature evolution in many real-world applications, where some features are vanished and some new features are augmented. Take the human motion recognition [13] as an example, as depicted in Fig. 1, the sudden damage of Kinect sensor results in the absence of depth information of human motion, while the emerging of new motion capture sensor could obtain the auxiliary human skeleton knowledge for motion recognition. It leads to a corresponding decrease and increase in the feature dimensionality of the input data, which are considered as the vanished features and augmented features, respectively. The features collected from RGB camera that has been working are regarded as survived features. Such feature evolution setting heavily cripples the human motion recognition performance of the pre-trained model [13]. Another interesting example is that different sensors (e.g., radioisotope, trace metal and biological sensors [14]) are deployed to monitor the dynamic environment change in full aspects. Some sensors expire (vanished features) whereas some new sensors are deployed (augmented features) when different electrochemical conditions and lifespans occur. A fixed or static online metric learning model will fail to take advantage of sensors evolved in this way. Therefore, how to establish a novel metric learning model to simultaneously handle both instance and feature evolutions amongst these online practical systems is our main focus in this paper.
To address the challenges above, as illustrated in Fig. 1, we develop a new online Evolving Metric Learning (EML) model for incremental and decremental features, which can exploit streaming data with both instance and feature evolutions in an online manner. To be specific, the proposed EML model consists of two significant stages, i.e., a Transforming stage (T-stage) and a Inheriting stage (I-stage). 1) In the T-stage where features are decremental, we propose to explore the important information and data structure from vanished features, and transform them into a low-rank discriminative metric space of survived features, which could be utilized to promote the learning process of the I-stage. Moreover, it explores the intrinsic low-rank structure of the streaming data, which efficiently reduces both memory and computation costs especially for large-scale samples with high dimensional feature. 2) For the I-stage where features are incremental, based on the learned discriminative metric space in the T-stage, we inherit the metric performance of survived features from T-stage, and then expand to consider new augmented features. Furthermore, to better explore the similarity relations amongst the heterogeneous data, a smoothed Wasserstein distance is applied to both T-stage and I-stage where the evolving features are strictly unaligned and heterogeneous in different stages. For the model optimization, we derive an efficient optimization strategy to solve the formulations of T-stage and I-stage. Besides, our EML model could be successfully extended from one-shot scenario into multi-shot scenario, where one-shot scenario indicates that the features of streaming data would only be incremental and decremental by one time (as shown in Fig. 2), while multi-shot scenario denotes that the representations of streaming data would be incremental and decremental multiple times (as shown in Fig. 3). Comprehensive experimental results on several datasets strongly support the effectiveness of our proposed EML model.
The main contributions of this paper are summarized as follows:
-
•
We propose an online Evolving Metric Learning (EML) model for incremental and decremental features to tackle both instance and feature evolutions simultaneously. To our best knowledge, this is the first exploration to tackle this crucial, but rarely-researched challenge in the metric learning field.
-
•
We present two stages for both feature and instance evolutions, i.e., a Transforming stage (T-stage) and a Inheriting stage (I-stage), which can not only make full use of the vanished features in the T-stage, but also take advantage of streaming data with new augmented features in the I-stage.
-
•
A smoothed Wasserstein distance is incorporated into metric learning to characterize the similarity relations of heterogeneous evolving features among different stages. After deriving an alternating direction optimization algorithm to optimize our EML model, extensive experiments on representative datasets validate the superior performance of our proposed EML model.
II Related Work
This section provides a brief overview about metric learning, followed by some representative methods about feature evolution.
II-A Metric Learning
Online metric learning has been widely explored for instance evolution to learn large-scale streaming data, which is mainly composed of Mahalanobis distance-based and bilinear similarity-based methods. For the Mahalanobis distance-based methods, POLA [15] is the first attempt to learn the optimal metric in an online manner. Then several variants [5, 10, 16] extend this idea by the fast similarity searching strategies, e.g., [8] proposes a regularized online metric learning model with the provable regret bound. Besides, pairwise constraint [8] and triplet constraint [9] are adopted to learn a discriminative metric function. Generally, triplet constraints perform better than pairwise constraints to learn a discriminative metric function [9, 17]. For the bilinear similarity-based models, OASIS [4] is developed to explore a similarity metric for recognition tasks, and SOML [18] aims to learn a diagonal matrix for high dimensional cases with the similar setting as OASIS [4]. [19] presents an online multiple kernel similarity to tackle multi-modal tasks.
Unfortunately, these recently-proposed online metric learning methods cannot exploit the discriminative similarity relations for the strictly unaligned heterogeneous data in different evolution stages. To explore heterogeneous relationships among different data samples, [11] focuses on learning a nonlinear metric to distinguish the foreground boundary and background for robust visual tracking. Duan et al. [12] design fine-grained localized distance metrics to learn hierarchical nonlinear transformations between heterogeneous samples. Ding et al. [20] introduce the fast low-rank learning mechanism and representation denoising strategy to explore a more robust metric learning framework. Furthermore, [21] proposes a multi-modal distance metric method for image ranking by incorporating both click and visual representations in distance metric learning. [22] presents a multi-view stochastic learning model with high-order distance metric to explore modality-specific statistical information. However, above-mentioned metric methods cannot be successfully applied to the challenging online scenarios, where the features are evolving due to the different senor lifespans (e.g., some features are vanished and some new features are augmented).
II-B Feature Evolution
For the feature evolution, with the assumption that there exists samples from both vanished feature space and augmented feature space in an overlapping period, [23] develops an evolvable feature learning model by reconstructing the vanished features and exploiting it along with new emerging features for large-scale streaming data. [24] proposes an one-pass incremental and decremental learning model for streaming data, which consists of a compressing stage and a expanding stage. Different from [23], [24] assumes that there are overlapping features instead of overlapping period. Similar to [24], [25] focuses on learning the mapping function from two different feature spaces by using optimal transport technique. Furthermore, [26, 27] intend to classify trapezoidal data stream with feature and instance increasing doubly. However, the new emerging samples often have overlapping features with the previously existing samples. [28] develops an incremental feature learning model to tackle the emergence of new activity recognition sensors, which encourages the proposed model to well generalize the sudden emergence of incremental features.
Amongst the discussion above, there are no any feature evolution models highly related to our work except for OPID (OPIDe) [24]. However, there are several key differences between [24] and our EML model: 1) Our work is the first attempt to explore both instance and feature evolutions simultaneously via T-stage and I-stage in the metric learning field, when compared with [24]. 2) Due to the strictly unaligned evolving features in the different stages, we utilize the smoothed Wasserstein distance to explore the distance relationships among the heterogeneous and complex data, rather than the Euclidean distance in [24]. 3) Compared with [24], the low-rank regularizer for distance matrix could effectively learn a discriminative low-rank metric space, while neglecting non-informative knowledge for heterogeneous data in different feature evolution stages.
III Evolving Metric Learning (EML)
This section first reviews online metric learning, and then detailedly introduces how to tackle both instance and feature evolutions via our proposed EML model.
III-A Revisit Online Metric Learning
Metric learning focuses on exploring an optimal distance metric matrix, in the light of different measure functions, e.g., Mahalanobis distance function: , where and are the -th and -th samples, respectively. is the symmetric positive semi-definite matrix, which can be formulated as [5], where ( denotes the rank of ) is the transformation matrix. The Mahalanobis distance function between and can be rewritten as . Given an online constructed triplet , could be updated in an online manner via the Passive-Aggressive algorithm [29], i.e.,
(1) |
where is a hinge loss function. . and belong to the same class, and and belong to different classes. is the regularization parameter.
However, most existing online metric learning models only focus on instance evolution with a fixed feature dimensionality, which cannot be utilized in the feature evolution scenario, i.e., streaming data with incremental and decremental features. Furthermore, they mainly aim to promote the discrimination of the learned distance matrix by minimizing the squared Mahalanobis distance from similar sample pairs. Especially, they assume that the feature descriptors of the sample pairs they focus on addressing are often aligned well in advance. Unfortunately, due to some unavoidable factors like non-linear lighting changes, heavily intensity noise and geometrical deformation, such assumption is heavily violated in the real-world tasks, especially for the feature evolution tasks. Therefore, the learned distance matrix in Eq. (1) is not applicable and discriminative to explore similarity relationships between the heterogeneous and complex samples, whose evolving feature descriptors are not strictly aligned in different evolution stages [30].
III-B The Proposed EML Model
This subsection first introduces how to integrate a smoothed Wasserstein distance into online metric formulation (i.e., Eq. (1)) to characterize the similarity relations of heterogeneous data with feature evolution in the different stages. Then the details about how to tackle feature evolution via Transforming stage (T-stage) and Inheriting stage (I-stage) in one-shot scenario are elaborated, followed by the extension of multi-shot scenario.
III-B1 Online Wasserstein Metric Learning
Wasserstein distance [31] is an optimal transportation to transport all the earth from the source to target destination, while requiring the minimum amount of efforts. Formally, given two signatures and , the smoothed Wasserstein distance [32] between and is:
(2) | ||||
where , and denotes the cost of transporting one unit of earth from the source sample to the target sample . indicates the flow network matrix, and represents the amount of earth that is transported from to . and are normalized marginal probability mass vectors, and they satisfy and . is a balance parameter, and is the strictly concave entropic function.
In Eq. (2), the Mahalanobis distance is employed as ground distance to construct smoothed Wasserstein distance. Thus, each element of in Eq. (2) represents the squared Mahalanobis distance between the source sample of and the target sample of , i.e., . Given the online constructed triplet via [33], where the samples of and belong to the same class, and the samples of and belong to different classes. After substituting Mahalanobis distance in Eq. (1) with the smooth Wasserstein distance defined in Eq. (2), online Wasserstein metric learning could be formulated as follows:
(3) |
where When compared with the triplet , each signature in consists of several samples belonging to same class rather than only one sample.

III-B2 Transforming Stage (T-stage) Inheriting Stage (I-stage)
In one-shot scenario where the features of streaming data would only be incremental and decremental by one time, two essential stages (i.e., T-stage and I-stage) of our proposed EML model for steaming data with feature evolution are elaborated below.
I. Transforming Stage (T-stage): As shown in Fig. 2, suppose that denotes the streaming data in the T-stage, where and denote the samples and labels in the -th batch, respectively. is the total batches in T-stage and indicates the sample number in the -th batch. Obviously, each instance of consists of vanished and survived features, and and indicate the corresponding dimensions of vanished features and survived features .
If we directly combine both vanished and survived features to learn a unified metric function, it fails to be utilized in I-stage where some features are vanished and some other new features are augmented. We thus propose to extract important information from vanished features and forward it into survived features by exploring a common discriminative metric space. In other words, we aim to train a model using only survived features to characterize the effective information extracted from both vanished and survived features.
In the -th batch of T-stage, inspired by [33], the triplet for survived features is constructed in an online manner, where the samples of and belong to same class while the samples of and belong to different classes. and are the numbers of samples in each signature. Likewise, we can construct the triplet for all features (containing both vanished and survived features) in T-stage, where the samples of and belong to same class while the samples of and belong to different classes.
Let and denote the distance matrices trained on survived features and all features (containing both vanished and survived features) in T-stage. Since the dimensions of and are different, it is reasonable to add some essential consistency constraints on the optimal distance matrices and to extract important information from vanished features, and forward it into survived features. Generally, based on the smoothed Wasserstein metric learning in Eq. (3), the formulation of the -th batch in the T-stage could be expressed as follows:
(4) | ||||
where and denote the triplet losses of smoothed Wasserstein metric learning on survived features and all features (containing both vanished and survived features), respectively. denotes the regularization term, which learns the underlying low-rank property of heterogeneous samples. and are the balance parameters. in Eq. (4) is designed to enable the consistence constraint for and , which aims to use only survived features to characterize the efficient information extracted from both vanished and survived features.
Specifically, constructs an essential triplet loss by incorporating smoothed Wasserstein metric learning on different feature spaces, i.e., survived features and all features (containing both vanished and survived features). We attempt to compute the smoothed Wasserstein distance between different heterogeneous distributions based on vanished features and all features. For example, denotes the smoothed Wasserstein distance between from all features and from survived features, where indicates the Mahalanobis distance between the -th source sample of and the -th target sample of . Likewise, and have similar definitions with . Formally, the consistence constraint is concretely expressed as follows:
(5) | ||||
II. Inheriting Stage (I-stage): Suppose that denotes the data samples in the -th batch of I-stage, where indicates the samples and is the corresponding labels, as shown in Fig. 2. and represent the survived features and new augmented features in the -th batch. and are the dimension of the new augmented features and the number of samples. Thus, the goal of I-stage is to use for training and make the prediction for the -th batch data whose number of samples is same as that of .
To classify the -th batch data, we propose to inherit the metric performance of optimal distance matrix learned on survived features in T-stage, since a set of common survived features exist in both T-stage and I-stage. Although we could construct the triplets directly from the -th batch for training, this trivial strategy has two significant shortcomings: 1) the trained metric model is difficult to be extended into multi-shot scenario; 2) the metric model learned only with the -th batch data would have worse prediction performance due to the lack of full usage of data in T-stage.
To this end, we utilize a similar stacking strategy with [34, 35], where [34, 35] focus on forming linear combinations of different predictors to train a unified classifier and achieve improved prediction accuracy. However, we propose to concatenate all feature descriptors as in stacking and train a unified predictor on the stacked features. It could better inherit the metric performance learned in T-stage. Concretely, let as the transformed discriminative metric space, which can be regarded as the new representation of for stacking. could then be represented as . Likewise, is characterized as . Furthermore, we learn an optimal distance matrix on with online constructed triplet , and evaluate the performance on , where the samples of and belong to same class while the samples of and belong to different classes. Formally, at the -th iterative step, the objective function of learning in I-stage can be formulated as:
(6) | ||||
where and are the balance parameters. In our experiments, and in both Eq. (4) and Eq. (6) are set as the same value for simplification. denotes the regularization term, which aims to explore the intrinsic low-rank structure of heterogeneous samples in I-stage.

III-B3 Multi-shot Scenario
Different from one-shot scenario, the features of streaming data in multi-shot scenario would be incremental and decremental times. This subsection extends our model from one-shot case into multi-shot scenario, and the illustration example of multi-shot scenario when is depicted in Fig. 3. Specifically, denotes the streaming data in Stage 1, where and respectively represent the samples and labels in the -th batch. indicates the sample number in the -th batch, and denotes the total batches in Stage 1. When the streaming data in Stage 2 arriving, it performs features evolution for the first time (i.e., some features are vanished and some new features are augmented), where . Moreover, in Stage 3, the streaming data performs features evolution for the second time, and we predict the results of our proposed EML model on the -th batch data , where and . Note that there are overlapped feature representations between any two adjacent stages. For example, as presented in Fig. 3, the survived features in Stage 1 are regarded as the vanished features in Stage 2, and the augmented feature in Stage 2 are considered as the survived features in Stage 3. Therefore, there are multiple Transforming stages (T-stage) and Inheriting stages (I-stage) in multi-shot scenario. To be specific, our proposed model first regards Stage 1 and Stage 2 as T-stage and I-stage for the first feature evolution. Then, it considers Stage 2 and Stage 3 as T-stage and I-stage for the second feature evolution. Generally, in multi-shot scenario, we have two essential learning tasks:
-
•
Task I: Similar to the prediction task in one-shot case, we aim to classify testing data in Stage 3 by training our proposed model on previous batch streaming data .
-
•
Task II: Different from the prediction task in one-shot scenario, we attempt to make predictions for all stages (i.e., Stage 1, Stage 2 and Stage 3 when ) by training our proposed model on the streaming data in all stages.
IV Model Optimization
This section presents an alternating optimization strategy to update our proposed EML model amongst two stages, i.e., T-stage and I-stage, followed by the computational complexity analysis of our model. The whole optimization strategy of our proposed EML model is introduced in Algorithm 1.
Note that the low-rank minimization in Eq. (4) and Eq. (6) is a well-known NP hard problem. Take as an example, in Eq. (6) can be effectively surrogated by trace norm . Different from traditional Singular Value Thresholding (SVT) [36], we employ a regularization term to guarantee the low-rank property, i.e., . As a result, in Eq. (6) could be formulated as , where . Likewise, the low rank optimization of and shares the same strategy with . and are respectively surrogated by and , where and .
IV-A Optimizing T-stage via an Alternating Strategy
IV-A1 Updating by fixing
IV-A2 Updating by fixing
With the obtained distance matrix and flow matrix , the optimization problem for variable in Eq. (4) could be formulated as:
(9) | ||||
Concretely, the updating operator for could be given as:
(10) |
where .
IV-A3 Updating by fixing
IV-B Optimizing I-stage via an Alternating Strategy
IV-B1 Updating by fixing
IV-B2 Updating by fixing
The optimization procedure of variable in I-stage is same as that in T-stage: with the fixed , the formulation Eq. (6) is split into some independent traditional smoothed Wasserstein distance subproblems, and we solve the variable via [33].
Datasets | c | |||||
---|---|---|---|---|---|---|
EV-Action | 20 | 4200 | 1024 | 1024 | 75 | |
Mnist0vs5 | 2 | 3200 | 114 | 228 | 113 | |
Mnist0vs3vs5 | 3 | 4800 | 123 | 245 | 121 | |
Splice | 2 | 2240 | 10 | 40 | 10 | |
Gisette | 2 | 6000 | 1239 | 2478 | 1238 | |
USPS0vs5 | 2 | 960 | 64 | 128 | 64 | |
USPS0vs3vs5 | 3 | 1440 | 64 | 128 | 64 | |
Satimage | 3 | 1080 | 10 | 18 | 8 | |
ImageNet | 1000 | 1200000 | 512 | 1024 | 512 | |
PAMAP2 | 18 | 7200 | 81 | 162 | 81 |
Dataset | Pegasos [37] | OPMV [38] | TCA [39] | BDML [2] | OPML [7] | CDML [40] | OPIDe [24] | OPID [24] | FIRF [28] | Ours | |
---|---|---|---|---|---|---|---|---|---|---|---|
500 | 57.381.51 | 56.371.91 | 53.882.04 | 56.420.71 | 54.101.71 | 55.080.83 | 57.841.06 | 57.571.08 | 57.130.84 | 58.870.68 | |
EV-Action | 600 | 57.461.60 | 56.941.82 | 54.611.73 | 56.810.65 | 55.371.64 | 55.921.03 | 57.220.95 | 56.711.40 | 56.921.25 | 58.650.84 |
700 | 57.221.34 | 56.681.87 | 54.371.69 | 56.630.77 | 55.821.62 | 56.220.71 | 57.091.13 | 56.851.27 | 57.231.16 | 58.320.82 | |
Mnist | 80 | 97.740.73 | 97.390.92 | 96.531.75 | 97.001.66 | 96.451.72 | 96.751.32 | 98.680.88 | 98.880.99 | 98.140.87 | 99.850.91 |
0vs5 | 160 | 98.111.03 | 95.821.84 | 93.082.94 | 98.250.80 | 96.831.38 | 97.040.58 | 97.940.97 | 98.750.90 | 96.791.52 | 99.780.57 |
320 | 97.680.79 | 96.471.79 | 92.433.82 | 98.240.75 | 96.981.03 | 97.160.85 | 97.380.58 | 97.210.66 | 96.831.37 | 99.270.37 | |
Mnist | 120 | 91.473.92 | 95.871.82 | 91.263.87 | 92.232.86 | 92.422.22 | 92.661.49 | 94.581.78 | 94.971.30 | 95.030.83 | 96.911.38 |
0vs3vs5 | 240 | 89.953.08 | 93.961.18 | 90.851.74 | 92.871.40 | 91.991.64 | 92.471.31 | 93.451.41 | 93.481.35 | 94.241.13 | 95.370.92 |
480 | 90.121.93 | 93.281.69 | 91.143.95 | 93.211.06 | 92.741.17 | 93.040.96 | 93.300.86 | 93.370.79 | 93.850.95 | 95.540.87 | |
80 | 79.654.13 | 80.133.86 | 76.934.52 | 65.655.53 | 69.604.38 | 68.852.27 | 81.223.73 | 80.503.53 | 79.832.55 | 82.653.32 | |
Splice | 160 | 82.253.26 | 81.952.84 | 80.933.47 | 71.554.07 | 78.212.53 | 75.852.65 | 84.002.03 | 83.912.05 | 82.061.91 | 85.252.06 |
320 | 82.323.18 | 78.724.37 | 81.533.38 | 72.163.40 | 80.862.01 | 78.931.17 | 85.551.32 | 85.941.38 | 83.691.73 | 87.031.52 | |
100 | 97.531.33 | 95.272.85 | 94.113.35 | 90.253.13 | 94.173.02 | 93.712.39 | 97.141.28 | 97.561.26 | 94.210.96 | 97.291.25 | |
Gisette | 200 | 95.142.97 | 94.053.36 | 93.033.16 | 91.501.25 | 93.613.19 | 92.681.72 | 95.590.95 | 95.391.06 | 93.760.79 | 96.820.91 |
300 | 96.841.35 | 93.713.11 | 94.373.72 | 93.832.12 | 93.772.96 | 93.241.56 | 96.360.69 | 95.330.93 | 94.181.06 | 97.890.43 | |
USPS | 120 | 98.521.67 | 95.272.67 | 96.421.81 | 95.901.65 | 93.722.32 | 94.742.46 | 96.171.44 | 96.511.25 | 95.851.33 | 97.231.64 |
0vs5 | 160 | 97.840.82 | 95.651.72 | 95.462.13 | 96.381.23 | 93.044.05 | 95.211.57 | 96.781.31 | 96.931.00 | 95.751.12 | 98.910.67 |
240 | 97.930.72 | 96.171.28 | 95.852.07 | 96.781.18 | 93.623.01 | 95.621.83 | 94.931.28 | 95.061.10 | 93.720.93 | 98.940.70 | |
USPS | 180 | 94.681.20 | 92.461.07 | 93.881.37 | 90.622.48 | 92.061.64 | 91.851.62 | 94.471.77 | 94.131.92 | 94.631.45 | 95.730.88 |
0vs3vs5 | 240 | 94.391.09 | 91.692.31 | 92.941.58 | 91.481.68 | 91.231.73 | 91.731.24 | 92.081.93 | 92.501.66 | 93.362.07 | 95.521.26 |
300 | 95.470.94 | 92.251.60 | 93.261.44 | 92.131.09 | 91.601.71 | 92.071.36 | 92.951.12 | 92.671.46 | 93.181.54 | 94.051.46 | |
60 | 94.252.56 | 96.481.47 | 97.251.08 | 97.141.59 | 97.471.59 | 97.391.46 | 98.172.19 | 97.602.31 | 97.922.05 | 99.200.91 | |
Satimage | 90 | 96.491.49 | 96.831.18 | 96.521.32 | 97.621.52 | 97.691.16 | 97.841.31 | 98.581.12 | 97.292.08 | 98.161.85 | 99.711.06 |
120 | 98.031.13 | 97.381.94 | 97.121.87 | 97.121.48 | 97.151.49 | 97.221.63 | 98.451.14 | 96.851.94 | 97.241.36 | 99.521.07 | |
10000 | 55.281.83 | 51.032.58 | 50.443.15 | 52.493.14 | 52.742.54 | 52.152.71 | 55.631.22 | 55.702.03 | 53.942.05 | 56.471.57 | |
ImageNet | 12000 | 56.371.75 | 50.242.39 | 50.832.96 | 52.682.33 | 52.941.87 | 52.062.64 | 55.941.83 | 56.312.33 | 54.821.77 | 57.831.93 |
14000 | 58.042.38 | 51.613.52 | 50.622.74 | 53.023.14 | 53.732.19 | 52.642.37 | 56.851.52 | 57.061.84 | 54.792.29 | 59.171.84 | |
600 | 91.641.08 | 89.851.33 | 85.731.84 | 87.671.74 | 86.231.81 | 87.922.16 | 92.170.93 | 92.641.05 | 93.560.84 | 95.270.71 | |
PAMAP2 | 700 | 91.851.15 | 90.141.29 | 86.042.03 | 88.062.20 | 87.941.57 | 88.731.91 | 91.851.28 | 92.391.04 | 93.281.13 | 95.460.85 |
800 | 91.570.89 | 90.251.56 | 85.492.75 | 88.752.06 | 88.131.90 | 89.371.68 | 93.050.88 | 93.620.93 | 93.840.79 | 95.660.94 |
IV-C Computational Complexity Analysis
The main computational cost in our EML model involves the updating operations in both T-stage and I-stage. Specifically, in the T-stage, the computational costs of updating and are and , respectively. For the I-stage, solving the variable in Eq. (6) takes . Besides, the computational cost of solving in both T-stage and I-stage is , where . When compared with the feature dimension and sample number, the value of is often small, and thus our proposed model is efficient to optimize in an online manner.
V Experiments
This section first presents detailed experimental configurations and competing methods. Then the experimental performance along with some analyses about our EML model in both one-shot and multi-shot cases are provided.
Dataset | Ours-woT | Ours-woI | Ours-woW | Ours | |
---|---|---|---|---|---|
500 | 56.681.74 | 54.361.61 | 57.930.85 | 58.330.76 | |
EV-Action | 600 | 56.231.81 | 55.701.49 | 57.701.04 | 57.940.88 |
700 | 57.021.56 | 55.931.76 | 57.830.92 | 58.120.86 | |
Mnist | 80 | 97.851.24 | 96.701.71 | 98.900.97 | 99.070.94 |
0vs5 | 160 | 97.541.46 | 96.841.85 | 98.871.06 | 99.220.61 |
320 | 97.233.34 | 96.880.96 | 98.950.83 | 99.270.37 | |
Mnist | 120 | 94.551.48 | 92.782.11 | 96.021.85 | 96.531.49 |
0vs3vs5 | 240 | 93.491.07 | 92.881.31 | 94.881.37 | 95.370.92 |
480 | 94.320.81 | 93.371.13 | 95.131.22 | 95.540.87 | |
80 | 81.583.10 | 70.834.47 | 82.453.38 | 82.653.32 | |
Splice | 160 | 84.072.51 | 78.873.01 | 84.872.19 | 85.252.06 |
320 | 84.852.38 | 81.561.99 | 85.941.61 | 86.401.59 | |
100 | 95.221.30 | 92.471.68 | 96.841.40 | 97.291.25 | |
Gisette | 200 | 94.381.52 | 92.961.75 | 96.271.53 | 96.820.91 |
300 | 96.110.95 | 95.081.19 | 97.140.87 | 97.790.46 | |
USPS | 120 | 95.421.82 | 94.822.02 | 96.261.33 | 97.231.64 |
0vs5 | 160 | 96.041.33 | 94.951.70 | 97.031.47 | 98.310.82 |
240 | 96.351.06 | 95.171.16 | 97.240.96 | 98.870.74 | |
USPS | 180 | 93.361.77 | 91.972.00 | 94.861.17 | 95.280.96 |
0vs3vs5 | 240 | 93.131.38 | 92.011.45 | 94.331.54 | 94.961.37 |
300 | 92.991.35 | 91.811.67 | 93.471.83 | 94.051.46 | |
60 | 96.501.59 | 97.431.36 | 98.311.10 | 98.970.95 | |
Satimage | 90 | 96.782.72 | 97.311.10 | 98.191.16 | 98.711.13 |
120 | 96.221.91 | 97.231.22 | 98.021.22 | 98.531.20 |
V-A Configurations and Competing Methods
The experimental configurations of our EML model in one-shot scenario and some competing methods are detailedly introduced in this subsection.
(a) Mnist0vs3vs5 ()
(b) Mnist0vs3vs5 ()
(c) USPS0vs5 ()
(d) USPS0vs5 ()
(e) EV-Action ()
(f) EV-Action ()
(g) PAMAP2 ()
(h) PAMAP2 ()
(i) Splice ()
(j) Splice ()
(k) Gisette ()
(l) Gisette ()
V-A1 Experimental Configurations
As shown in Table I, we conduct extensive comparisons on two real-world human motion recognition datasets (i.e., EV-Action [13] and PAMAP2 [41]), a large-scale visual recognition dataset (i.e., ImageNet [42]) and five synthetic benchmark datasets111http://archive.ics.uci.edu/ml/ containing three digit datasets (i.e., Mnist, Gisette and USPS), one DNA dataset (i.e., Splice) and one image dataset (i.e., Satimage). Specifically, EV-Action dataset [13] is a human action dataset with 5300 samples, which consists of 20 common action categories, where 10 actions are finished by single subject and the others are accomplished by the same subjects interacting with other objects. It is a typical application for feature evolution in the real-world, where the features from depth information, RGB image, and human skeleton are respectively regarded as vanished, survived and augmented features. Some example samples about human actions are visualized as Fig. 4. PAMAP2 [41] is composed of 18 activities performed by 9 different subjects wearing three inertial measurement units (IMU) and a heart rate monitor. We only utilize the data information from IMU in our experiments, due to the large missing values collected from the heart rate monitor. Each IMU contains one gyroscope, two accelerometers, one magnetometer, where the features from them are regarded as vanished, survived and augmented features, respectively. Moreover, ImageNet [42] including 1000 different categories is a large-scale challenging visual recognition dataset, where each of 1000 classes has roughly 1000 samples. We utilize ResNet [43] as feature extractor to obtain 2048-dimension feature representations for ImageNet [42].
Dataset | Pegasos [37] | OPMV [38] | TCA [39] | BDML [2] | OPML [7] | CDML [40] | OPIDe [24] | OPID [24] | FIRF [28] | Ours |
---|---|---|---|---|---|---|---|---|---|---|
EV-Action () | 27.110.04 | 28.580.04 | 37.720.09 | 36.180.18 | 22.950.07 | 37.420.33 | 26.470.09 | 26.330.14 | 23.040.11 | 25.480.06 |
Mnist0vs5 () | 6.180.07 | 7.450.06 | 14.960.12 | 16.270.10 | 3.840.04 | 16.580.19 | 5.130.11 | 4.950.07 | 3.890.07 | 4.680.05 |
USPS0vs5 () | 3.160.02 | 4.750.10 | 11.930.04 | 13.060.12 | 1.240.03 | 13.420.21 | 1.930.05 | 1.870.06 | 1.320.05 | 1.530.08 |
Gisette () | 40.520.03 | 41.060.16 | 51.280.04 | 49.730.14 | 35.260.03 | 49.800.17 | 38.240.08 | 39.150.05 | 35.540.15 | 37.570.11 |
Satimage () | 2.640.05 | 3.080.10 | 10.230.07 | 11.470.09 | 0.520.02 | 11.620.14 | 0.660.04 | 0.710.08 | 0.550.04 | 0.680.03 |
PAMAP2 () | 9.840.04 | 9.270.06 | 16.740.15 | 18.050.18 | 4.690.13 | 18.320.09 | 7.840.09 | 7.630.07 | 4.810.07 | 6.450.06 |
For a fair comparison, as presented in Table I, we adopt the same experimental settings with [24] in one-shot and multi-shot cases, which are elaborated as follows:
-
•
The number of streaming data in each batch is same, i.e., , and the sample number in each class is equal for all training and testing batches.
-
•
In T-stage, the total number of training data is fixed and the sample number in each batch is varied. In the light of this, the number of training and evaluation samples also varies in the last evaluation phase.
-
•
We allocate the first features, the next features and and the rest of features as vanished features, survived features and new augmented features, respectively. The first and last quarters are corresponding vanished and augmented features in our experiments.
-
•
The experimental performance in each run may have slightly difference due to the influence of computer system and simulation environment, even though we run each experiment under the same experimental settings. To circumvent the randomness effect of experimental performance, all experimental results are the averaged results over fifty random runs, which is more convincing to illustrate the superiority of our EML model.
V-A2 Competing Methods
We validate the superior performance of our EML model by comparing it with the following competing methods: One-pass Pegasos [37] assumes that the vanished and augmented features are available in different feature evolution stages; OPMV [38] regards the features in T-stage and I-stage as the first and second views; TCA [39] assumes that the streaming samples in T-stage and I-stage are drawn from the source and target distributions; BDML [2], OPML [7] and CDML [40] are the representative metric learning methods, which only utilize the samples with the augmented features, and ignore the previous vanished features; As for the feature evolution approaches, OPID and OPIDe [24] propose an one-pass incremental and decremental model for feature evolution. FIRF [28] designs a feature incremental random forest framework to tackle the emergence of new sensors (i.e., new augmented features) in a dynamic environment.
V-B Experiments in One-shot Scenario
In this subsection, we introduce the comprehensive experimental analysis, ablation studies, effects of hyper-parameters and convergence investigation of our proposed EML model in one-shot scenario, followed by computational costs of optimization complexity.
V-B1 Experimental Analysis
The experimental results for one-shot scenario are presented in Table II. From the presented performance, we have the following observations: 1) Although our proposed model has no access to the vanished features in T-stage, both transforming and inheriting strategies could efficiently exploit useful information of vanished feature and expand it into new augmented features in I-stage. 2) Our proposed EML model could be successfully applied to both high-dimensional (e.g., EV-Action, Gisette and ImageNet) and low-dimensional (e.g., Satimage and Splice) feature evolution, which are the challenging tasks to explore the intrinsic data structure and informative knowledge using the existing features; 3) When we utilize the learned distance matrix in T-stage to assist the training procedure in I-stage, the evaluation performance of our proposed model increases significantly, even though the training samples in I-stage are relatively rare, i.e., contains a small number of samples in I-stage. 4) Our EML model performs better than OPID and OPIDe [24], since T-stage could explore important information from vanished features, and I-stage efficiently inherits the metric performance from T-stage to take advantage of new augmented features.
Dataset | Ours-woLR | Ours | |
---|---|---|---|
500 | 57.610.56 | 58.330.76 | |
EV-Action | 600 | 57.460.89 | 57.940.88 |
700 | 57.351.34 | 58.120.86 | |
Mnist | 120 | 96.181.36 | 96.531.49 |
0vs3vs5 | 240 | 94.621.04 | 95.370.92 |
480 | 95.201.01 | 95.540.87 | |
80 | 82.160.94 | 82.653.32 | |
Splice | 160 | 84.691.93 | 85.252.06 |
320 | 85.761.45 | 86.401.59 | |
100 | 96.651.23 | 97.291.25 | |
Gisette | 200 | 96.401.16 | 96.820.91 |
300 | 97.080.75 | 97.790.46 | |
USPS | 180 | 94.591.05 | 95.280.96 |
0vs3vs5 | 240 | 94.511.27 | 94.961.37 |
300 | 93.291.18 | 94.051.46 | |
60 | 98.421.22 | 98.970.95 | |
Satimage | 90 | 98.321.30 | 98.711.13 |
120 | 98.111.05 | 98.531.20 |
V-B2 Ablation Studies
To verify the effectiveness of our EML model, we intend to research the effects of different components of our model, i.e., training without T-stage (denoted as Ours-woT), training without I-stage (denoted as Ours-woI) and training without the Wasserstein distance metric (denoted as Ours-woW). The performance of Ours-woW is evaluated under the metric of Mahalanobis distance. From the presented results in Table III, our proposed EML model has the best performance when both transforming and inheriting strategies work together to tackle incremental and decremental features via the Wasserstein distance metric, which validates the reasonable design of our proposed model. Compared with other metric distances (e.g., Mahalanobis distance), the smoothed Wasserstein distance could better mine the similarity relationships between the heterogeneous and complex streaming samples, since the evolving features are not strictly aligned in different stages. Both T-stage and I-stage play an essential role in tackling instance and feature evolutions simultaneously.
V-B3 Effects of Hyper-Parameters
In this subsection, as shown in Fig. 5, we introduce extensive parameter experiments on several representative datasets (Mnist0vs3vs5, USPS0vs5, EV-Action, PAMAP2, Splice and Gisette) as the examples to investigate the effects of hyper-parameters in one-shot scenario. Specifically, the experimental performances of our proposed model are averaged over fifty random repetitions by empirically tuning in a wide selection range of to choose the optimal values of hyper-parameters. When fixing as , we investigate the effects of , and introduce the hyper-parameter influence of when . From the performance depicted in Fig. 5, we could observe our EML model has stable prediction performance over the wide selection range of different hyper-parameters. Moreover, when and , our EML model performs the best prediction performance on most benchmark dataset, except for Mnist0vs3vs5 dataset performing the best when and .
Dataset | Pegasos [37] | OPMV [38] | TCA [39] | BDML [2] | OPML [7] | CDML [40] | OPIDe [24] | OPID [24] | FIRF [28] | Ours | |
---|---|---|---|---|---|---|---|---|---|---|---|
500 | 54.251.42 | 53.601.53 | 50.631.89 | 53.261.18 | 51.361.48 | 52.770.69 | 54.611.53 | 54.271.24 | 54.160.57 | 55.931.04 | |
EV-Action | 600 | 55.591.27 | 53.721.66 | 51.831.96 | 53.740.82 | 53.111.36 | 52.741.28 | 54.431.15 | 53.461.92 | 53.841.07 | 56.741.35 |
700 | 54.191.28 | 53.521.74 | 51.951.36 | 53.840.59 | 52.541.83 | 53.610.94 | 54.381.19 | 54.621.02 | 55.041.30 | 56.190.77 | |
Mnist | 80 | 97.501.82 | 88.754.99 | 93.141.87 | 92.252.20 | 93.923.22 | 92.742.03 | 95.702.17 | 95.922.23 | 94.372.16 | 98.541.08 |
0vs5 | 160 | 97.561.28 | 90.753.02 | 91.782.43 | 95.701.26 | 94.341.54 | 95.871.85 | 95.531.61 | 95.291.80 | 94.161.85 | 98.610.57 |
320 | 97.610.90 | 92.721.76 | 90.352.67 | 96.060.98 | 95.200.96 | 95.741.73 | 95.221.33 | 95.041.39 | 94.611.17 | 98.730.64 | |
100 | 91.582.87 | 86.244.76 | 91.282.56 | 90.483.29 | 89.873.62 | 90.262.71 | 95.082.35 | 94.361.88 | 92.882.06 | 96.121.18 | |
Gisette | 200 | 90.681.79 | 88.413.00 | 90.922.84 | 90.692.73 | 92.222.03 | 91.231.89 | 94.881.39 | 93.811.50 | 93.141.83 | 95.941.72 |
300 | 91.181.13 | 89.422.27 | 92.132.14 | 92.521.71 | 91.831.61 | 92.061.36 | 94.651.12 | 93.911.45 | 93.081.26 | 95.711.68 | |
USPS | 120 | 97.480.12 | 94.952.46 | 92.872.16 | 96.081.36 | 93.832.13 | 95.841.71 | 94.771.62 | 94.611.69 | 93.921.53 | 98.570.94 |
0vs5 | 160 | 97.560.19 | 95.172.18 | 93.051.94 | 96.241.55 | 94.211.66 | 95.491.33 | 94.131.54 | 94.511.30 | 93.751.28 | 98.680.65 |
240 | 97.370.41 | 95.581.06 | 92.722.33 | 97.570.67 | 94.621.25 | 95.842.03 | 93.921.47 | 93.621.16 | 92.641.09 | 98.390.72 | |
USPS | 180 | 92.031.57 | 89.223.21 | 90.862.87 | 90.911.81 | 88.612.84 | 89.732.43 | 84.052.27 | 83.342.36 | 82.942.36 | 93.111.87 |
0vs3vs5 | 240 | 90.901.40 | 89.132.05 | 90.242.93 | 91.982.04 | 89.622.17 | 90.621.87 | 84.681.73 | 84.491.92 | 83.751.81 | 93.231.58 |
300 | 90.481.24 | 89.521.59 | 89.853.16 | 92.611.23 | 89.461.80 | 90.841.56 | 83.251.61 | 83.171.66 | 84.231.42 | 93.131.55 | |
10000 | 52.761.55 | 48.112.30 | 47.823.06 | 49.042.83 | 49.742.49 | 49.432.59 | 50.421.57 | 50.332.26 | 48.422.01 | 53.921.62 | |
ImageNet | 12000 | 53.811.44 | 48.152.06 | 48.822.47 | 50.172.42 | 50.631.42 | 49.952.51 | 52.362.13 | 53.092.28 | 51.641.93 | 54.952.17 |
14000 | 55.822.05 | 48.933.28 | 48.362.66 | 50.932.87 | 51.622.62 | 50.512.18 | 53.621.83 | 54.962.10 | 51.632.14 | 57.031.92 | |
600 | 89.071.25 | 86.711.22 | 82.832.17 | 85.821.55 | 83.061.39 | 84.281.93 | 90.421.25 | 90.371.52 | 91.280.66 | 92.750.97 | |
PAMAP2 | 700 | 88.471.58 | 87.051.42 | 83.231.69 | 85.521.74 | 82.121.26 | 85.681.64 | 88.631.33 | 90.180.94 | 90.571.24 | 92.510.92 |
800 | 88.741.03 | 88.051.64 | 82.192.26 | 85.481.94 | 84.951.68 | 86.261.33 | 90.870.92 | 91.731.15 | 91.620.81 | 92.381.03 |
V-B4 Convergence Investigations
The convergence condition of Algorithm 1 is depending on the little change (we set it as ) in the consecutive objective function values, and Fig. 6 depicts the convergence curves of our EML model on Mnist and USPS datasets. From the presented results in Fig. 6, we notice that our EML model could converge asymptotically to a stable value with respect to the objective function value after a few iterations. Furthermore, it validates that our proposed optimization algorithm could efficiently achieve stable performance with appropriate convergence condition.
V-B5 Computational Costs
Table IV presents the computational costs (i.e., optimization time) of our proposed EML model and other competing methods. From the reported results, we have the following conclusions: 1) Our model is computationally efficient in an online manner for real-world applications since and is often a small value when compared with the feature dimension and the sample number. 2) The computational time costs (by the minute) of our model are less than other competing methods about 0.6713.71 minutes on most experimental datasets except for OPML [7], since OPML only takes advantage of training samples in I-stage for optimization procedure.
V-B6 Effect Investigation of Low Rank Constraint
This subsection investigates the effectiveness of low rank regularizer in our proposed EML model, as introduced in Table V. We substitute the low rank constraint with Frobenius norm, and denote its classification performance as Ours-woLR. The presented results in Table V clearly demonstrates that the performance of our proposed EML model degrades about in terms of accuracy, when the low rank constraint is abandoned. It illustrates that our EML model could effectively explore the intrinsic low rank structure of heterogeneous samples for different evolving features by incorporating with the low rank regularizer.
V-C Experiments in Multi-shot Scenario
This subsection introduces the experimental configurations and comparison performance of our proposed EML model in multi-shot scenario.
V-C1 Experimental Configurations
In multi-shot scenario, we set , i.e., two-shot scenario with three stages for illustration, as depicted in Fig. 3. The streaming samples used in one-shot scenario are split into three stages. Except for the configurations introduced in one-shot scenario, the additional experimental configurations for multi-shot scenario are summarized as follows:
-
•
All batches of T-stage in one-shot scenario are split into Stage 1 and Stage 2 with equal number of samples, as shown in Fig. 3. Under this setting, the survived features in Stage 2 would be the vanished features in Stage 3, and the new augmented features in Stage 2 would be the survived features in Stage 3. In other words, Stage 1 and Stage 2 are respectively considered as T-stage and I-stage for the first feature evolution. Moreover, Stage 2 and Stage 3 are regarded as T-stage and I-stage for the second feature evolution.
-
•
The features are divided into four equal parts with the same partition order as one-shot scenario. Concretely, the second quarter is the shared part of Stage 1 and Stage 2. The third quarter is the shared part of Stage 2 and Stage 3. The first quarter in Stage 1 and the last quarter in Stage 3 denote the vanished and new augmented features.
V-C2 Experiments for Task I and II
To address the Task I in multi-shot case, we directly use the last two adjacent evolution stages and regard it as the one-shot scenario for predictions, since the streaming data in any two adjacent stages share the common features. To be specific, we first utilize the transforming strategy in Eq. (4) on the streaming data in Stage 2 to learn the discriminative distance matrix, and the inheriting strategy in Eq. (6) is then applied to classify samples in Stage 3. To tackle the Task II in multi-shot scenario, we regard two adjacent stages as one-shot scenario (i.e., T-stage and I-stage) and repeat this procedure until the last stage in multi-shot scenario. Specifically, the transforming and inheriting strategies are first integrated into Stage 1 and Stage 2, and then we make predictions on the second batch streaming data in Stage 2. After inheriting the metric performance of Stage 1, we extract the useful information from the survived features in Stage 2 and forward it into the new augmented features via common discriminative space, when new labeled streaming data in Stage 2 arriving. Furthermore, we perform the same inheriting strategy on survived features in Stage 2 to promote the performance predictions in Stage 3.
The experimental results of our proposed EML model averaged over fifty random repetitions for Task I and II are presented in Table VI and Fig. 7. Notice that: 1) Our model significantly outperforms other competing methods (e.g., OPIDe and OPID [24]) especially in Task I, since it could inherit the metric performance of survived features in any two adjacent stages. 2) Compared with Task I, our proposed model performs better for Task II in most cases, since the survived features existing in Stage 1 could effectively promote the predictions for following streaming batches. 3) Our model could be successfully extended from one-shot case into multi-shot scenario to address both Task I and Task II, which further verifies the superior performance of our EML model.
V-C3 Ablation Studies
In this subsection, we conduct extensive variant experiments on Task I and Task II to investigate the efficiency of each component of our EML model in the multi-shot scenario, as introduced in Table VII and Table VIII. We have the following conclusions from the presented results: 1) All designed components in our EML model could cooperate well to achieve the best performance for both Task I and Task II in the multi-shout scenario, which validates the effectiveness and necessity of each module. 2) Two complementary strategies (i.e., T-stage and I-stage) effectively compress the important information from vanished features and inherit the metric performance from the previous stage. They play an indispensable role in addressing both feature and instance evolutions simultaneously under the Wasserstein distance metric. 3) The performance degradation of Ours-woW illustrates the effectiveness of the smoothed Wasserstein distance to explore the similarity relationships for heterogeneous samples among different stages.
Dataset | Ours-woT | Ours-woI | Ours-woW | Ours | |
---|---|---|---|---|---|
Mnist | 80 | 94.930.34 | 94.260.38 | 96.040.62 | 98.541.08 |
0vs5 | 160 | 94.170.55 | 93.410.82 | 95.520.39 | 98.610.57 |
320 | 96.130.59 | 95.470.85 | 96.841.03 | 98.730.64 | |
100 | 92.450.83 | 91.170.76 | 93.610.35 | 96.121.18 | |
Gisette | 200 | 92.840.72 | 92.040.28 | 94.120.46 | 95.941.72 |
300 | 93.050.80 | 92.360.73 | 93.881.14 | 95.711.68 | |
USPS | 120 | 95.540.75 | 94.180.93 | 96.330.41 | 98.570.94 |
0vs5 | 160 | 95.060.83 | 94.270.53 | 96.840.65 | 98.680.65 |
240 | 95.360.32 | 94.910.77 | 97.050.41 | 98.390.72 | |
USPS | 180 | 90.150.19 | 89.350.87 | 90.940.51 | 93.111.87 |
0vs3vs5 | 240 | 90.620.30 | 89.870.64 | 91.580.74 | 93.231.58 |
300 | 90.860.81 | 90.220.63 | 92.080.26 | 93.131.55 |
Dataset | Ours-woT | Ours-woI | Ours-woW | Ours | |
---|---|---|---|---|---|
Mnist | 80 | 94.581.48 | 93.261.71 | 95.931.22 | 98.301.18 |
0vs5 | 160 | 95.711.29 | 93.621.48 | 96.040.98 | 98.131.16 |
320 | 95.250.93 | 94.540.89 | 95.871.13 | 98.241.34 | |
100 | 94.320.88 | 93.681.09 | 95.020.95 | 97.221.37 | |
Gisette | 200 | 93.101.27 | 92.291.48 | 94.211.07 | 96.901.18 |
300 | 93.741.18 | 92.161.26 | 94.230.89 | 96.921.70 | |
USPS | 120 | 96.151.36 | 95.621.26 | 96.870.88 | 98.171.18 |
0vs5 | 160 | 94.771.38 | 94.461.52 | 95.451.13 | 98.140.97 |
240 | 95.321.36 | 94.371.65 | 96.140.76 | 98.670.62 | |
USPS | 180 | 90.381.47 | 90.561.51 | 91.061.05 | 93.941.50 |
0vs3vs5 | 240 | 90.431.26 | 89.851.34 | 91.881.17 | 93.341.38 |
300 | 90.351.18 | 90.831.43 | 91.460.84 | 94.581.25 |
VI Conclusion
In this paper, an online Evolving Metric Learning (EML) model is proposed for both instance and feature evolutions, which is successfully applied to one-shot and multi-shot scenarios. Our proposed EML model contains two essential stages, i.e., Transforming stage (T-stage) and Inheriting stage (I-stage). To be specific, for the T-stage, we utilize the survived features to characterize the effective information extracted from vanished and survived features by exploiting a common discriminative metric space. In the I-stage, we inherit the metric performance of survived features from T-stage, and extend it into the new augmented features. Furthermore, we apply the smoothed Wasserstein distance to T-stage and I-stage to better explore the similarity relations of heterogeneous streaming data among different evolution stages. Extensive experiments show the superior performance of our proposed EML model on several representative datasets. In the future, we will consider lifelong machine learning for both instance and feature evolutions, which continually learns a sequence of new streaming evolution tasks without the catastrophic forgetting for the previous learned evolution tasks.
References
- [1] H. V. Nguyen and L. Bai, “Cosine similarity metric learning for face verification,” in Computer Vision – ACCV 2010, R. Kimmel, R. Klette, and A. Sugimoto, Eds., 2011, pp. 709–720.
- [2] J. Xu, L. Luo, C. Deng, and H. Huang, “Bilevel distance metric learning for robust image recognition,” in Proceedings of the 32Nd International Conference on Neural Information Processing Systems, 2018, pp. 4202–4211.
- [3] Z. Boukouvalas, “Distance metric learning for medical image registration,” 2011.
- [4] G. Chechik, V. Sharma, U. Shalit, and S. Bengio, “Large scale online learning of image similarity through ranking,” J. Mach. Learn. Res., vol. 11, pp. 1109–1135, Mar. 2010.
- [5] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman, “Online metric learning and fast similarity search,” in Advances in Neural Information Processing Systems 21, 2009, pp. 761–768.
- [6] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” J. Mach. Learn. Res., vol. 10, pp. 207–244, Jun. 2009.
- [7] W. Li, Y. Gao, L. Wang, L. Zhou, J. Huo, and Y. Shi, “Opml: A one-pass closed-form solution for online metric learning,” Pattern Recognition, vol. 75, pp. 302 – 314, 2018.
- [8] R. Jin, S. Wang, and Y. Zhou, “Regularized distance metric learning:theory and algorithm,” in Advances in Neural Information Processing Systems 22, 2009, pp. 862–870.
- [9] B. Shaw, B. Huang, and T. Jebara, “Learning a distance metric from a network,” in Advances in Neural Information Processing Systems 24, 2011, pp. 1899–1907.
- [10] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information-theoretic metric learning,” in Proceedings of the 24th International Conference on Machine Learning. ACM, 2007, pp. 209–216.
- [11] J. Hu, J. Lu, and Y. Tan, “Deep metric learning for visual tracking,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 11, pp. 2056–2068, 2016.
- [12] Y. Duan, J. Lu, J. Feng, and J. Zhou, “Deep localized metric learning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 10, pp. 2644–2656, 2018.
- [13] L. Wang, B. Sun, J. P. Robinson, T. Jing, and Y. Fu, “Ev-action: Electromyography-vision multi-modal action dataset,” arXiv preprint arXiv:1904.12602, 2019.
- [14] C. K. Ho, A. Robinson, D. R. Miller, and M. J. Davis, “Overview of sensors and needs for environmental monitoring,” Sensors, vol. 5, no. 1, pp. 4–37, 2005.
- [15] S. Shalev-Shwartz, Y. Singer, and A. Y. Ng, “Online and batch learning of pseudo-metrics,” in Proceedings of the Twenty-first International Conference on Machine Learning. ACM, 2004, p. 94.
- [16] B. Nguyen and B. De Baets, “Kernel-based distance metric learning for supervised -means clustering,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 10, pp. 3084–3095, Oct 2019.
- [17] Q. Qian, R. Jin, J. Yi, L. Zhang, and S. Zhu, “Efficient distance metric learning by adaptive sampling and mini-batch stochastic gradient descent (sgd),” Machine Learning, vol. 99, no. 3, pp. 353–372, Jun 2015.
- [18] X. Gao, S. C. H. Hoi, Y. Zhang, J. Wan, and J. Li, “Soml: Sparse online metric learning with application to image retrieval,” in AAAI, 2014.
- [19] H. Xia, S. C. H. Hoi, R. Jin, and P. Zhao, “Online multiple kernel similarity learning for visual search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 3, pp. 536–549, March 2014.
- [20] Z. Ding, M. Shao, W. Hwang, S. Suh, J.-J. Han, C. Choi, and Y. Fu, “Robust discriminative metric learning for image representation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 11, pp. 3173–3183, Nov. 2019.
- [21] J. Yu, X. Yang, F. Gao, and D. Tao, “Deep multimodal distance metric learning using click constraints for image ranking,” IEEE Transactions on Cybernetics, vol. 47, no. 12, pp. 4014–4024, 2017.
- [22] J. Yu, Y. Rui, Y. Y. Tang, and D. Tao, “High-order distance-based multiview stochastic learning in image classification,” IEEE Transactions on Cybernetics, vol. 44, no. 12, pp. 2431–2442, 2014.
- [23] B.-J. Hou, L. Zhang, and Z.-H. Zhou, “Learning with feature evolvable streams,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 1416–1426.
- [24] C. Hou and Z.-H. Zhou, “One-pass learning with incremental and decremental features,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, pp. 2776–2792, 2018.
- [25] H.-J. Ye, D.-C. Zhan, Y. Jiang, and Z.-H. Zhou, “Rectify heterogeneous models with semantic mapping,” in ICML, 2018.
- [26] Q. Zhang, P. Zhang, G. Long, W. Ding, C. Zhang, and X. Wu, “Towards mining trapezoidal data streams,” 2015 IEEE International Conference on Data Mining, pp. 1111–1116, 2015.
- [27] Q. Zhang, P. Zhang, G. Long, W. Ding, C. Zhang, and X. Wu, “Online learning from trapezoidal data streams,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 10, pp. 2709–2723, Oct 2016.
- [28] C. Hu, Y. Chen, X. Peng, H. Yu, C. Gao, and L. Hu, “A novel feature incremental learning method for sensor-based activity recognition,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 6, pp. 1038–1050, June 2019.
- [29] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer, “Online passive-aggressive algorithms,” J. Mach. Learn. Res., vol. 7, pp. 551–585, Dec. 2006.
- [30] J. Xu, L. Luo, C. Deng, and H. Huang, “Multi-level metric learning via smoothed wasserstein distance.” in IJCAI, 2018, pp. 2919–2925.
- [31] R. Sandler and M. Lindenbaum, “Nonnegative matrix factorization with earth mover’s distance metric for image analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, pp. 1590–1602, Aug 2011.
- [32] M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” in Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, 2013, pp. 2292–2300.
- [33] A. Rolet, M. Cuturi, and G. Peyré, “Fast dictionary learning with a smoothed wasserstein loss,” in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, vol. 51, 09–11 May 2016, pp. 630–638.
- [34] L. Breiman, “Stacked regressions,” Machine Learning, vol. 24, pp. 49–64, Jul 1996.
- [35] Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC, 2012.
- [36] J.-F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on Optimization, vol. 20, no. 4, pp. 1956–1982, 2010.
- [37] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter, “Pegasos: primal estimated sub-gradient solver for svm,” Mathematical Programming, vol. 127, pp. 3–30, Mar 2011.
- [38] Y. Zhu, W. Gao, and Z.-H. Zhou, “One-pass multi-view learning,” in ACML, 2015.
- [39] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, Feb 2011.
- [40] S. Chen, L. Luo, J. Yang, C. Gong, J. Li, and H. Huang, “Curvilinear distance metric learning,” in Advances in Neural Information Processing Systems 32, 2019.
- [41] A. Reiss and D. Stricker, “Introducing a new benchmarked dataset for activity monitoring,” in 2012 16th International Symposium on Wearable Computers, 2012, pp. 108–109.
- [42] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
- [43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
![]() |
Jiahua Dong Jiahua Dong is currently a Ph. D candidate in State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences. He received the B.S. degree from Jilin University in 2017. His current research interests include computer vision, machine learning, transfer learning, domain adaptation and medical image processing. |
![]() |
Yang Cong Yang Cong (S’09-M’11-SM’15) is a full professor of Chinese Academy of Sciences. He received the he B.Sc. de. degree from Northeast University in 2004, and the Ph.D. degree from State Key Laboratory of Robotics, Chinese Academy of Sciences in 2009. He was a Research Fellow of National University of Singapore (NUS) and Nanyang Technological University (NTU) from 2009 to 2011, respectively; and a visiting scholar of University of Rochester. He has served on the editorial board of the Journal of Multimedia. His current research interests include image processing, compute vision, machine learning, multimedia, medical imaging, data mining and robot navigation. He has authored over 70 technical papers. He is also a senior member of IEEE. |
![]() |
Gan Sun Gan Sun (S’19) is an Assistant Professor in State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences. He received the B.S. degree from Shandong Agricultural University in 2013, the Ph.D. degree from State Key Laboratory of Robotics, Chinese Academy of Sciences in 2020, and has been visiting Northeastern University from April 2018 to May 2019, Massachusetts Institute of Technology from June 2019 to November 2019. His current research interests include lifelong machine learning, multi-task learning, medical data analysis, deep learning and 3D computer vision. |
![]() |
Tao Zhang Tao Zhang is currently working toward the Ph.D. degree in pattern recognition and intelligent systems at the State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, China. His research interests include pattern recognition, image processing, tactile sensing and robotics. |
![]() |
Xu Tang Xu Tang is currently a reserach associate in State Key Laboratory of Robotics, Shenyang Institute of Automation. He received MESc degree from Harbin Institute of Technology in 2017. His current research interests include computer vision and machine learning. |
![]() |
Xiaowei Xu Xiaowei Xu is a professor of Information Science at University of Arkansas at Little Rock (UALR), received a a B.Sc. de. degree in Mathematics from Nankai University in 1983 and a Ph.D. degree in Computer Science from University of Munich in 1998. He holds an adjunct professor position in the Department of Mathematics and Statistics at University of Arkansas at Fayetteville. Before his appointment in UALR, he was a senior research scientist in Siemens. He was a visiting professor in Microsoft Research Asia and Chinese University of Hong Kong. His research spans data mining, machine learning, bioinformatics, data management and high performance computing. He has published over 70 papers in peer reviewed journals and conference proceedings. His groundbreaking work on density-based clustering algorithm DBSCAN has been widely used in many textbooks; and received over 10203 citations based on Google scholar. Dr. Xu is a recipient of 2014 ACM KDD Test of Time Award that “recognizes outstanding papers from past KDD Conferences beyond the last decade that have had an important impact on the data mining research community.” |