Self-Supervised Time Series Representation Learning via Cross Reconstruction Transformer
Abstract
Since labeled samples are typically scarce in real-world scenarios, self-supervised representation learning in time series is critical. Existing approaches mainly employ the contrastive learning framework, which automatically learns to understand similar and dissimilar data pairs. However, they are constrained by the request for cumbersome sampling policies and prior knowledge of constructing pairs. Also, few works have focused on effectively modeling temporal-spectral correlations to improve the capacity of representations. In this paper, we propose the Cross Reconstruction Transformer (CRT) to solve the aforementioned issues. CRT achieves time series representation learning through a cross-domain dropping-reconstruction task. Specifically, we obtain the frequency domain of the time series via the Fast Fourier Transform and randomly drop certain patches in both time and frequency domains. Dropping is employed to maximally preserve the global context while masking leads to the distribution shift. Then a Transformer architecture is utilized to adequately discover the cross-domain correlations between temporal and spectral information through reconstructing data in both domains, which is called Dropped Temporal-Spectral Modeling. To discriminate the representations in global latent space, we propose Instance Discrimination Constraint to reduce the mutual information between different time series samples and sharpen the decision boundaries. Additionally, a specified curriculum learning strategy is employed to improve the robustness during the pre-training phase, which progressively increases the dropping ratio in the training process. We conduct extensive experiments to evaluate the effectiveness of the proposed method on multiple real-world datasets. Results show that CRT consistently achieves the best performance over existing methods by 2%9%. The code is publicly available at https://github.com/BobZwr/Cross-Reconstruction-Transformer.
Index Terms:
Time series, Self-supervised learning, Transformer, Cross domainI Introduction
Time series analysis [1, 2, 3, 4, 5, 6, 7, 8] is critical in a variety of real-world applications, such as transportation, medicine, finance, and industry. With the success of deep learning, various tasks in the field of time series analysis have achieved great performances, which include time series classification [3, 4], forecasting [5, 6] and anomaly detection [7, 8, 9]. However, since labeled time series are usually difficult to acquire [10, 11], it is essential to study learning representations from time series data in an unsupervised learning way. Self-supervised learning is an emerging approach, which designs a pretext task and automatically generates “labels” for supervision while minimizing effort on annotation.

Existing works of self-supervised learning can be grouped into two categories: contrast-based methods and reconstruction-based. Contrast-based methods [12, 13, 14] are the main-stream approach of self-supervised representation learning for time series. They are mainly to apply a segment-level sampling policy and construct positive pairs and negative pairs. The model is then forced to maximize the similarity of positive pairs, while minimizing the similarity of negative pairs in the feature space. For example, Contrastive Predictive Coding (CPC) [15] conducts representation learning by using powerful auto-regressive models in latent space to make predictions in the future, relying on Noise-Contrastive Estimation [16] for the loss function in similar ways. Temporal and Contextual Contrasting (TS-TCC) [17] is an improved work of CPC and learns robust representations by a harder prediction task against perturbations introduced by different timestamps and augmentations. Reconstruction-based methods [18, 19] focus on utilizing the long-term context information for the time series with an encoder-decoder architecture. The pretext task of these methods is reconstructing the original data, which may be masked to enhance inference ability. With the success of Transformer architectures [20, 21, 22], a recent work [23] proposes a Transformer-based self-supervised learning framework for multivariate time series for the first time.
However, the existing approaches (including contrast-based and reconstruction-based learning) neglect to exploit the spectral information and utilize the temporal-spectral correlations. The frequency domain is another perspective of time series data, and different frequency bands imply various states [24, 25]. Besides, each type of method is constrained by its own drawbacks: 1) segment-level sampling policy used in contrastive learning may lead to sampling bias. The performance is also unstable due to the high dependency on the way to construct negative and positive pairs. 2) the distribution shift caused by the masking process in reconstruction-based methods leads to a gap between the pre-training and fine-tuning phases. The current masking process sets parts of the time series as 0, introducing noise to the training process and destroying the shape of the time series.
To solve these problems, we propose Cross Reconstruction Transformer (CRT) for self-supervised time series representation learning. Schematic comparisons for traditional reconstruction-based, contrast-based, and CRT methods are illustrated in Figure 1. Our contributions include:
-
•
To simultaneously model the temporal-spectral information and discover correlations, we apply a Transformer encoder on both time domain and frequency domain data to automatically fuse cross-domain features. To adequately exploit the spectral information, we introduce the phase of spectral data in addition to the magnitude to store more spectral information.
-
•
To avoid the instability occurring in contrast-based methods, we reconstruct the original data in temporal-spectral domains based on the cross-domain representations. In addition, we propose a specified curriculum learning strategy to improve the robustness of the training process, which progressively increases the dropping ratio in the pre-training process.
-
•
To tackle the problem of distribution shift caused by masking in existing reconstruction-based methods, we employ a simple but reasonable “masking” method, which is randomly dropping certain parts of data in time and frequency domains. Instead of setting values as 0, dropping discards the “masked” segments and preserves the original distribution of the time series.
-
•
We evaluate the representations learned by our proposed method on three datasets, and the results demonstrate that our CRT achieves the best performance in terms of effectiveness and robustness, with a performance gain of 2%9%.
II Related Work
II-A Self-supervised learning
As an emerging research field, self-supervised learning has drawn much attention, especially for image data and text data [26, 27, 28, 29, 30]. Self-supervised learning is to manually design “pretext tasks” on unlabelled data, which is a way to provide “supervised” information for deep learning models. The model is expected to capture inherent characteristics among large-scale unlabelled data, and show a better performance on downstream tasks compared to training from scratch [31].
For text data, BERT [32] is a typical example of self-supervised learning, which is designed to pre-train deep bidirectional representations from the unlabeled text by joint conditioning on both left and right context in all layers. Due to BERT’s great performance, multiple variations were proposed. For example, SpanBERT proposed to mask random contiguous spans, rather than random individual tokens [33]. ERNIE proposed entity-level masking and phrase-level masking, which integrate phrase and entity-level knowledge into the language representation [34]. For image data, contrastive learning is more popular to learn representations in an unsupervised way. Contrastive learning methods typically rely on discriminating manually constructed pairs through an InfoNCE loss [15]. InfoNCE maximizes the similarity between positive instances and minimizes the similarity between negatives. The largest challenge and difficulty is how to construct reasonable and practical positive and negative sets to facilitate the model to capture similarities and differences among instances. Some works propose to construct pairs via data augmentations methods [29, 35], clustering [36, 37], and so on. Contrastive learning is also applied to other fields, such as graph representation learning [38], multi-modal learning [39], and so on. However, contrastive learning heavily depends on constructing positive and negative pairs, which makes the performance less robust and unstable. Recent works have started to apply reconstruction-based self-supervised learning on image and other data [40, 41], which are straightforward but show great performance.
II-B Self-supervised learning for time series
Research on self-supervised representation learning on sequence data has been well-studied, but representation learning for time series still needs to be promoted. Inspired by the well-studied self-supervised learning methods in computer vision and natural language processing areas, recent works mainly leverage the contrastive learning framework for time series representation learning [14]. The researchers design different time-slicing strategies to construct positive and negative pairs with the assumption that temporally similar fragments could be viewed as positive samples and remote fragments are treated as negative samples [42]. Unsupervised Scalable Representation Learning [12] introduces a novel unsupervised loss with time-based negative sampling to train a scalable encoder, shaped as a deep convolutional neural network with dilated convolutions [43]. Temporal Neighborhood Coding [13] introduces the concept of a temporal neighborhood with stationary properties as the distribution of similar windows in time. Nevertheless, these time-based methods are sometimes unreasonable for long time series and fail to capture the long-term dependencies [14]. Besides, the performance will substantially deteriorate when applied to downstream tasks containing periodic time series.
A recent work [23] proposes a Transformer-based reconstruction self-supervised task for time series data for the first time. This method masks part of the original time series data by setting the values as zeros, and uses a linear layer to reconstruct the masked data. The results show that reconstructing the masked data can help extract dense vector representations of multivariate time series and facilitate models to understand contextual information. However, masking original data introduces a gap between the pre-training stage and fine-tuning stage, because it changes the original distribution of time series significantly. In addition, all methods including contrastive learning methods neglect an important characteristic of time series data - the frequency domain.

III Cross Reconstruction Transformer
The framework of CRT is shown in Figure 2. Our CRT is a Transformer-based time-frequency domain cross-reconstruction method. The goal of our method is to learn representations combining useful information of two domains. Generally, the model is firstly pre-trained following our CRT framework, and then transferred to the downstream tasks. In this section, we describe our CRT in detail. All important notations are shown in Table I.
At first, we pay attention to the drawbacks of existing self-supervised learning methods for time series data:
-
•
The existing reconstruction methods for time series mask original data and reconstruct it. However, masking (set to zeros) can significantly change the original pattern of the time series and bring many noises to the pre-training process that would deteriorate the performance. To tackle these problems, we propose the dropping approach, which is discussed in Section III-A.
-
•
Most representation learning methods for time series neglect the frequency domain, which is a complementary perspective to the time domain of time series. In Section III-B, we propose a novel way to better utilize the spectral information and model temporal-spectral correlations.
Combining the above two modules, we propose our Dropped Temporal-Spectral Modeling for time series and discuss it in Section III-C. In Section III-D, we introduce an effective progressive training strategy to improve the stability and capacity of our CRT.
III-A Dropping Rather Than Masking
Masking parts of data and reconstructing them is a common paradigm in self-supervised learning tasks [32, 40]. Inspired by this, some works mask the original time series by setting the value of certain temporal segments to 0 and reconstructing these segments [23]. However, this may significantly change the original pattern of time series and bring noises to the representation learning process. Besides, it may cause the gap between the pre-training stage and fine-tuning stage since it leads to distribution shifts. As shown in Figure 3(b), masking significantly changes the shape and distribution of the original time series. Different from images or text, the shape is an important and unique pattern for time series [44]. Thus masking would bring noises into the pre-training stage and deteriorate the downstream performance.

In this work, we choose to drop some segments of the time series rather than masking them. Dropping discards certain parts of the input time series, and feeds the rest of the data into the model. The difference between masking and dropping is shown in Figure 3. In detail, we slice the input into patches with the same length, and randomly discard patches in a probability of following:
(1) |
where is a random value following uniform distribution . In this way, the model receives segments of real data without corrupted zeros portions. Adding corresponding position information of undropped parts to the Transformer model can maximally preserve the global context and capture the intrinsic long-term dependencies.
III-B Involving Spectral Information

When learning representations for time series, most of the existing methods neglect to involve the spectral information of time series. The frequency domain provides another perspective to discover the patterns for time series data. The spectral perspective can provide several characteristics of the time series that are not easily acquirable in the time domain. Consequently, we attempt to explicitly input both time and frequency domains into the model to enable it to learn both temporal and spectral information and model cross-domain correlations.
The most common method to transform time series into the frequency domain is the Fast Fourier Transform (FFT) [45], which is a rapid implementation of the Discrete Fourier Transform (DFT):
(2) |
where is original time series data, is the length of , is the index of frequency data, is called imaginary unit satisfying , and is the frequency domain data. After FFT, the original time series is transformed into the frequency domain, and is converted to a sequence of complex numbers. However, these complex numbers cannot be used to train the neural networks directly. Most of the methods considering the frequency domain attempt to store the spectral information using the magnitude of the complex numbers. For any complex number , where and are arbitrary real numbers. The magnitude of is . However, using only magnitude to restore the complex numbers would cause information loss. Because we cannot restore only using .
To tackle this problem, we propose a novel way to better utilize the spectral information rather than simply using the magnitude. We introduce phase as follows:
(3) |
where
Phase is another characteristic of the frequency domain. Intuitively, a complex number can be regarded as a point of a circle with radius of and the angle between and the positive direction of x-axis. With phase and magnitude, we can restore the full spectral information which is represented by the complex number through
(4) |
In addition to complementing magnitude, the phase contains more information relative to the shape of the times series. We compare the information contained by phase and magnitude by restoring the original time series only using phase or magnitude. In detail, we calculate the phase and magnitude of a time series, and replace the phase or magnitude with a constant vector. In both cases, we restore the complex vector following Equation 4, and conduct inverse FFT to transform complex vectors to the time domain. As shown in Figure 4, we can see that the original time series is more similar to the restored time series with the correct phase. This intuitively explains the importance of phase information for the shape of time series data.
Notation | Definition |
---|---|
The original time series | |
The magnitude of | |
The phase of | |
Concatenation of , and | |
Dropping ratio | |
Number of patches of | |
Number of in a batch (batch size) | |
The i-th in a batch of | |
The projection of by convolutional neural networks | |
The projection of by Transformer | |
The [CLS] token for time domain | |
The projection of by Transformer | |
The [CLS] token for magnitude. | |
The projection of by Transformer | |
The [CLS] token for phase | |
The projection of by Transformer |
III-C Dropped Temporal-Spectral Modeling
In this section, we introduce our Dropped Temporal-Spectral Modeling. For a given time series data , we firstly conduct FFT and convert from the time domain to the frequency domain. The obtained vector is a complex vector, of which we store both the magnitude and the phase as mentioned in Section III-B. Thus, the complex vector is transformed into two real vectors (magnitude) and (phase) with the same length. We then concatenate , , and to get the input of length .
For the input , we choose to drop parts of it as mentioned in Section III-A. Each of the sliced patches contains only one type of data (time, magnitude, or phase). After that, we drop ( is the given dropping ratio and is the number of patches) patches and use remaining patches to reconstruct the dropped patches. For patches containing time information, we randomly drop them. Also, for patches containing frequency information (phase or magnitude), we drop the corresponding pair of patches (phase and magnitude of the same segments of complex vectors). As explained in Section III-A, dropping the patches can reduce the gap between the pre-training stage and fine-tuning stage because it does not change the shape of the original data.
Before the Transformer encoder, we add three convolutional neural networks (CNNs) to extract local features for three types of patches respectively. We hypothesize CNN can help extract local features, and Transformer can help model cross-domain information using the local features. The patches from time domain and magnitude and phase are projected to three types of embeddings respectively. We then add a different learnable [CLS] token before each type of embeddings (the [CLS] before the time, magnitude, and phase embeddings are referred to as , and respectively). The three types of embeddings are summed with their corresponding learnable domain-type embeddings and concatenated into a vector , where is the dimension of the embeddings. Before fed into the Transformer, is summed with the corresponding learnable position embeddings.
We input the whole into the Transformer encoder and use the produced features to reconstruct the original data (time, magnitude, and phase). The learned representations are used to reconstruct the original data of both domains, so that the Transformer encoder can automatically capture complementary information from both domains.
Formally, in a batch, there are inputs . is projected to by CNNs, and is projected by Transformer to . We use for down-stream tasks. We concatenate each with learnable decoding tokens, and feed them into the decoder to reconstruct all patches. The reconstruction loss is:
(5) |
where is the reconstructed input (including both dropped and undropped patches), and d is the dimension of .
When reconstructing the original data of both domains, we hope the latent representations contain more sample-specific information and less similar information to other samples. Hence, we propose a simple Instance Discrimination Constraint (IDC) module to remove redundancy and sharpen decision boundary. IDC maximizes the mutual information between the representations and original input, and simultaneously constrains the redundant information that other samples have in common. Also, we maximize the distance between and where in latent space, that is minimizing:
(6) |
where is the batch size, is cosine similarity , and are two projection heads to project and into the same latent space. The projection heads can be either a linear layer or a multi-layer perceptron, and we use a two-layer perceptron in our experiments.
III-D Model Optimization
Combining the reconstruction loss and IDC loss, our final loss is:
(7) |
where is a hyper-parameter. By updating this loss, we hope the representations learned by the model contain as much sample-specific information as possible. Thus, the representations is both informative and discriminable. As for the complexity, the FFT process is conducted before pre-training the model and frequency data is directly used during self-supervised training. Consequently, the increase in computational complexity of optimizing is negligible.
However, it is difficult to reconstruct the dropped patches when is initially set as a large value. As a result, we introduce Curriculum Learning (CL) [46], which is a strategy that trains the model from “easy” to “hard”. In detail, we set a relatively small dropping ratio at the beginning of the pre-training stage, and increase as the self-supervised learning goes. Formally, we refer to the initial as , final as , and the number of epochs for increasing as . and are hyperparameters, and should be tuned on the validation set. Then, the for each epoch is
(8) |
When the current epoch is smaller than , is equal to . When , is equal to . Then, remains equal to until the pre-training stage ends.
IV Experiments
IV-A Experimental Setup
IV-A1 Datasets
We conduct experiments on three publicly available time series datasets:
-
•
PTB-XL [47, 48]. PTB-XL is an electrocardiogram (ECG) dataset with 21,837 12-lead ECGs of 10s, where 52% are male and 48% are female, ranging in age from 0 to 95. The sample rate is 500 Hz, and the length of each recording is 5,000. Our task is to classify time series into one of the five classes including normal ECG (NORM), conduction disturbance (CD), hypertrophy (HYP), myocardial infarction (MI), and ST/T change (STTC).
-
•
HAR [49]. HAR is a human activity recognition dataset with 10,299 9-variable time series with a length of 128 from 30 individuals. The data are sampled from the accelerometer and gyroscope with a sampling rate of 50 Hz. Our task is to classify time series into one of the six classes of daily activity, including walking, walking upstairs, downstairs, standing, sitting, and lying down.
-
•
Sleep-EDF [50, 48]. Sleep-EDF is built for sleep stage classification. We use data from 4 subjects in their normal daily life recorded by a modified cassette tape recorder. With the sampling rate of 100 Hz, we choose 3 channels: horizontal EOG, FpzCz, and PzOz EEG. Each recording is a 24-hour time series, and we cut them into 30-s segments. Our task is to classify time series into one of the eight classes including wake, sleep stage 1 - 4, rapid eye movement, movement, and unscored.
The overview of our adopted datasets is shown in Table II.
All three datasets are randomly split into three parts, training set (80%), validation set (10%), and test set (10%). In real-world scenarios, unlabeled data is much more common than labeled data. To better simulate this situation, we use the whole training set to pre-train, and randomly select 20% of the training set to fine-tune the pre-trained model.
Dataset | # of samples | # of channels | # of classes | Length |
---|---|---|---|---|
PTB-XL | 21,837 | 12 | 5 | 5000 |
HAR | 10,299 | 9 | 6 | 128 |
Sleep-EDF | 22,636 | 3 | 8 | 3000 |
Dataset | patch size | |||
---|---|---|---|---|
PTB-XL | 20 | 0.3 | 0.6 | 200 |
HAR | 8 | 0.3 | 0.8 | 300 |
Sleep-EDF | 20 | 0.3 | 0.85 | 300 |
IV-A2 Model Architecture
Our framework is an auto-encoder architecture, which is composed of an encoder and a decoder. Our encoder is based on a deeper Transformer (which is a 6-layer Transformer), and we add CNNs before the Transformer as the local feature extractors. The CNN employed in our experiment is a one-dimensional ResNet-18 architecture [51]. Like the recent work [40], our decoder is a shallower Transformer (which has only two layers), because a lighter decoder can help reduce pre-training time without influencing negatively the performance on down-stream tasks [40].
IV-A3 Data processing
We use the fft function from the numpy.fft Python library. For each channel of a time series data, the original transformed sequence is a complex array with the same length. Because of the symmetry, we preserve the former half of the sequence, and store the magnitude and phase of this half sequence. After concatenating three sequences, the new sequence is with the double length of the original time series. When we slice the sequence into patches, we make sure that each patch contains only one type of data (time, magnitude, or phase). We then randomly drop patches in a probability of (as mentioned in Section III-D) in the i-th epoch during pre-training. The patch length and dropping hyper-parameters for each dataset are given in Table III.
IV-A4 Baselines
We compare our performances with the following self-supervised learning approaches, including the ones designed for time series specifically as well as the ones for general data.
-
•
Contrastive Predictive Coding (CPC) [15]: CPC is to learn representations by predicting the future data in the latent space through autoregressive models. It constructs a contrastive loss to maximally preserve the mutual information in the latent space between data in the present time step and data in the future.
-
•
SimCLR [29]: SimCLR is a contrastive learning method firstly designed for image data. It constructs positive pairs and negative pairs by conducting different data augmentation methods. We replace the original encoder architecture with our encoder for a fair comparison. Also, we use the same augmentations as previous work [52] for time series. The same data with two data augmentation methods form positive pairs, and the other pairs are negative pairs. The self-supervised learning task is to maximize the similarity of positive pairs and minimize the similarity of negative pairs in the latent space.
-
•
Temporal and Contextual Contrasting (TS-TCC) [17]: TS-TCC is a self-supervised learning method for time series data. Similar to SimCLR, it transforms the raw time series data into two views by using weak and strong augmentations. The task is a combination of a cross-view prediction task and a contrastive task. The contrastive learning task maximizes the similarity among different views of the same sample while minimizing similarity among views of different samples.
-
•
Temporal Neighborhood Coding (TNC) [13]: TNC is also a contrastive learning method for time series data. It utilizes the stationary properties of time series data. Similar to SRL, TNC also constructs positive pairs using near segments, and constructs negative pairs using far segments.
-
•
Time Series Transformer (TST) [23]: TST is a self-supervised learning method for time series data by reconstructing the masked parts of original time series data. Different from our method, it neglects the frequency domain and masks the original data by setting the value as 0, which causes a large gap between the pre-training and fine-tuning.
For fair comparisons, we implement these methods using public code with the same encoder architectures as the original work (except SimCLR). The models are optimized using Adam [53]. The representation dimensions are all set to 256, the same as ours. All experiments are conducted on a server with five NVIDIA GeForce RTX 3090ti GPUs.
IV-A5 Evaluations
Linear probing is a popular method to evaluate the self-supervised learning method. However, it restricts the ability of deep learning since it cannot pursue stronger and dataset-specific representations. Hence, we add a two-layer fully connected neural network as the classifier after the Transformer encoder and fine-tune the whole model. The validation set is used to tune the hyper-parameters (such as in , batch size and learning rate) in fine-tuning stage, and the model reaching the highest accuracy on the validation set is saved and evaluated on the test set.
In the self-supervised learning stage, we randomly choose five seeds and get five pre-trained models. In the fine-tuning stage, we also randomly use five seeds to fine-tune each of the five pre-trained models. Thus, we get 25 results per self-supervised learning method per dataset.
We use ROC-AUC, F1-Score, and Accuracy to evaluate the methods. ROC-AUC is defined as the area surrounded by the Receiver Operator Characteristic curve (ROC curve), the x-axis, and the y-axis. Regarding the ROC curve, it is first computed based on the predicted probability and ground truth of each label directly without a predefined threshold, then defined as the curve of the true positive rate versus the false positive rate at various thresholds ranging from zero to one. Accuracy is calculated for each class, as the ratio of the number of correctly classified samples over the total number of samples. F1 score is the harmonic average of precision (the proportion of true positive cases among the predicted positive cases) and recall (the proportion of positive cases that are correctly identified). As a multi-class classification task, we report average ROC-AUC values across all classes, average Accuracy values across all classes, and macro averaged F1 score. Reported numbers are expressed in percentage values for better reading.
IV-B Comparison Results
We compare our CRT with baseline methods and show the results in Table IV. Our CRT outperforms all baselines in terms of ROC-AUC, F1-score and accuracy. We also observe that the results of CRT have a relatively small standard deviation, which implies a more stable performance on downstream tasks. In addition, we notice that TST (another reconstruction-based self-supervised learning method) also obtains a relatively small standard deviation. The reason might be that reconstruction tasks do not require constructing negative and positive pairs, which is difficult and may cause instability. It indicates that reconstruction-based methods may provide a more stable way for self-supervised learning. One baseline named Scalable Representation Learning [12] is not included in our results, as it requires a much longer running time and we failed to produce its results in several days.
Dataset | Method | ROC-AUC | F1-Score | Accuracy |
---|---|---|---|---|
PTB-XL | CPC | 87.42 0.34 | 63.17 1.19 | 85.75 0.28 |
SimCLR | 84.50 1.45 | 59.94 2.49 | 84.17 0.80 | |
TSTCC | 87.25 0.25 | 65.70 0.71 | 84.66 1.12 | |
TNC | 86.90 0.35 | 63.02 1.13 | 85.35 0.23 | |
TST | 81.69 0.39 | 59.36 8.30 | 81.86 1.80 | |
CRT | 89.22 0.07 | 68.43 0.58 | 87.81 0.29 | |
HAR | CPC | 94.83 1.15 | 84.42 2.13 | 83.86 2.05 |
SimCLR | 96.53 0.88 | 86.85 2.26 | 86.15 2.37 | |
TSTCC | 97.46 0.32 | 88.68 1.43 | 87.95 1.59 | |
TNC | 94.45 1.32 | 84.64 1.96 | 84.01 1.95 | |
TST | 97.31 0.39 | 86.17 1.00 | 85.37 1.01 | |
CRT | 98.94 0.22 | 90.51 0.77 | 90.09 0.75 | |
Sleep-EDF | CPC | 91.92 1.62 | 39.74 3.35 | 88.70 1.35 |
SimCLR | 91.97 3.22 | 41.78 3.13 | 87.86 1.77 | |
TSTCC | 93.57 1.83 | 39.09 4.50 | 86.06 3.19 | |
TNC | 91.48 3.51 | 37.89 5.13 | 86.97 3.48 | |
TST | 93.31 2.29 | 42.58 2.48 | 88.83 0.84 | |
CRT | 94.74 1.09 | 44.38 1.14 | 90.12 0.57 |
IV-C Analysis of Cross-Domain Reconstruction
More ablation studies and comparison experiments are conducted to further verify the effectiveness of our method.
IV-C1 Dropping is Better than Masking
To verify that dropping can better alleviate the gap between pre-training and fine-tuning than masking, we compare the results on downstream tasks under the two setups. From Figure 5, we see dropping always leads to better downstream performances, except for the Sleep-EDF dataset, where dropping yields a slightly lower ROC-AUC but outperforms masking for other metrics. This is because zeros brought by masking may create wrong patterns for time series data, and dropping can tackle this problem. Our dropping approach can keep the shape of the original time series, and reduce the gap between the pre-training and fine-tuning phases caused by the masking process.

IV-C2 Adding Phase Helps Frequency Learning

We also conduct a simple ablation study to support that phase can provide supplementary spectral information. We evaluate the model pre-trained and fine-tuned with and without phase data under the same experimental setups as our CRT. The results on three datasets with and without phase data are shown in Figure 6. We can see that adding the phase can help improve the final performance in terms of three metrics. This supports our view that magnitude is not informative enough to represent the frequency domain, and phase can complement magnitude.
IV-C3 Cross-Domain is Better than Single-Domain
Dataset | Method | ROC-AUC | F1-Score | Accuracy |
---|---|---|---|---|
PTB-XL | Time | 88.08 0.25 | 67.45 0.55 | 85.67 0.59 |
Freq | 77.58 0.43 | 47.86 1.96 | 78.28 0.91 | |
CRT | 89.22 0.07 | 68.43 0.58 | 87.81 0.29 | |
HAR | Time | 96.46 0.13 | 87.62 1.00 | 86.84 1.07 |
Freq | 94.61 0.14 | 76.96 1.38 | 76.50 1.44 | |
CRT | 98.94 0.22 | 90.51 0.77 | 90.09 0.75 | |
Sleep-EDF | Time | 94.64 0.55 | 40.13 1.30 | 88.25 0.59 |
Freq | 91.00 0.76 | 38.07 1.88 | 84.37 1.33 | |
CRT | 94.74 1.09 | 44.38 1.14 | 90.12 0.57 |


We compare our method with the trivial single-domain reconstruction tasks. After conducting time-domain and frequency-domain (phase and magnitude) reconstruction tasks respectively, we fine-tune the models using the corresponding domain’s data. As shown in Table V, CRT performs best in terms of all 3 metrics. The ROC-AUCs of all five classes on PTB-XL datasets are shown in Figure 7a. We can see that cross-domain representations help improve performance on all classes. This is because some abnormalities of ECGs can be more easily observed from the perspective of the frequency domain. Figure 7b shows the performance on the Sleep-EDF dataset under different training sizes. We give the results when 20%, 40%, 60%, and 80% of the training set is used for fine-tuning. The results show that by combining two domains, our cross-domain representations can yield higher accuracy under different ratios.
All results support our hypothesis that learning from two domains can yield more informative representations than from a single domain. This is because some patterns of the frequency domain can complement the temporal patterns. Also, the cross-domain Transformer encoder can fuse different types of features, and extract useful and complementary patterns from both domains.
In addition, we notice that in all three datasets, using the time domain yields better performance compared with using the frequency domain. This may be because the patterns of the time domain are more direct for neural networks to discover, and the three classification tasks are strongly related to these patterns. However, adding the frequency domain can supplement some patterns, leading to better performance.
IV-C4 Exploring Cross-Domain Learning



Dataset | Method | ROC-AUC | F1-Score | Accuracy |
---|---|---|---|---|
PTB-XL | T2F | 84.14 0.34 | 59.02 2.04 | 83.58 0.78 |
F2T | 83.72 0.49 | 59.19 1.95 | 83.42 0.70 | |
CRT | 89.22 0.07 | 68.43 0.58 | 87.81 0.29 | |
HAR | T2F | 98.18 0.58 | 89.23 1.64 | 88.84 1.70 |
F2T | 97.93 0.83 | 88.90 1.97 | 88.53 1.98 | |
CRT | 98.94 0.22 | 90.51 0.77 | 90.09 0.75 | |
Sleep-EDF | T2F | 94.36 1.66 | 40.41 3.65 | 89.09 2.62 |
F2T | 92.88 1.53 | 41.67 1.93 | 89.46 1.00 | |
CRT | 94.74 1.09 | 44.38 1.14 | 90.12 0.57 |


We explore cross-domain learning with two experiments:
-
•
Time-to-Frenquency Reconstruction (T2F): with only the time-domain part of fed into the Transformer encoder, we use the encoded features to reconstruct the dropped original frequency-domain data (including phase and magnitude).
-
•
Frequency-to-Time Reconstruction (F2T): contrary to T2F, we input the frequency-domain part of and reconstruct the dropped time-domain data.
We used the same inputs for fine-tuning to only evaluate the cross-reconstruction tasks.
As shown in Table VI, our complete cross-domain solution outperforms the other two methods on three datasets. We also give the results of all five classes on the PTB-XL dataset (Figure 9a), and the accuracy on the Sleep-EDF dataset under four training set ratios (Figure 9b). From Figure 9a, we can see CRT outperforms others on all classes, especially on “MI” (11.6% higher than F2T and 4.7% higher than T2F). In Figure 9b, CRT also performs better, and T2F performs similarly to F2T.
Our CRT adopts Transformer as part of our encoder to fuse the embeddings from different domains. The representations obtained from the Transformer encoder contain information of both time domain and frequency domain, which benefits the reconstruction as well as the downstream tasks. T2F and F2T, on the other hand, only capture a single-direction relationship rather than the mutual relationship. It is intuitive that CRT can produce more semantic representations, and the experiment results well prove it.
IV-C5 CL and IDC Help Learning Cross-Domain Representations


We conduct another ablation study to verify the effectiveness of CL and IDC, and show the results in Figure 10 and Figure 11.
As we expect, these two modules facilitate learning cross-domain representations. CL helps the reconstruction tasks and improves the performance on downstream tasks. CL gives a progressively increasing dropping ratio rather than a constant large dropping ratio, which helps better reconstruct the original data. As a regularization, IDC minimizes the mutual information between different samples which helps Transformer extract more discriminable patterns. It is worth noting that IDC is not a contrastive loss, since it does not minimize the distance of any pair of samples. It only minimizes common information of all time series.
In Figure 8, we illustrate three reconstructed cases with a 0.7 dropping ratio. We see that the model almost reconstructs the trend of dropped inputs with such a high dropping ratio, especially for the ECG case from the PTB-XL dataset. The reason for the well-performed ECG reconstruction might be because that ECG is periodic and there are abundant similar patterns in the time series. Such amazing reconstruction performance also proves that our cross-domain representations are highly informative and sample-specific.
V Conclusion
In this work, we propose Cross Reconstruction Transformer (CRT), a cross-domain reconstruction framework based on Transformer for self-supervised representations learning of time series data. We note that the existing self-supervised learning methods neglect to utilize the spectral information and temporal-spectral correlations of time series. Moreover, existing “masking”-based methods for reconstruction tasks tend to significantly change the original pattern of time series, which would lead to the distribution shifts between pre-training and fine-tuning processes. To tackle these problems in a unified way, we propose our cross-domain reconstruction framework which outperforms existing methods on three widely-used datasets. In addition, we conduct various experiments to verify the effectiveness of each component of our framework. For future work, we would adapt our framework to the out-of-distribution scenarios and address more challenging problems [54].
Acknowledgement
This work was supported by the National Natural Science Foundation of China (No.62102008).
References
- [1] Q. Tan, M. Ye, A. J. Ma, B. Yang, T. C.-F. Yip, G. L.-H. Wong, and P. C. Yuen, “Explainable uncertainty-aware convolutional recurrent neural network for irregular medical time series,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 10, pp. 4665–4679, 2021.
- [2] F. M. Bianchi, S. Scardapane, S. Løkse, and R. Jenssen, “Reservoir computing approaches for representation and classification of multivariate time series,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 5, pp. 2169–2179, 2021.
- [3] Y. Huang, G. G. Yen, and V. S. Tseng, “Snippet policy network v2: Knee-guided neuroevolution for multi-lead ecg early classification,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2022.
- [4] T. Bradde, G. Fracastoro, and G. C. Calafiore, “Multiclass sparse centroids with application to fast time series classification,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–6, 2021.
- [5] W. Zheng and J. Hu, “Multivariate time series prediction based on temporal change information learning method,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2022.
- [6] K. Bandara, C. Bergmeir, and H. Hewamalage, “Lstm-msnet: Leveraging forecasts on sets of related time series with multiple seasonal patterns,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 4, pp. 1586–1599, 2021.
- [7] A. Garg, W. Zhang, J. Samaran, R. Savitha, and C.-S. Foo, “An evaluation of anomaly detection and diagnosis in multivariate time series,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 6, pp. 2508–2517, 2022.
- [8] S.-E. Benkabou, K. Benabdeslem, V. Kraus, K. Bourhis, and B. Canitia, “Local anomaly detection for multivariate time series by temporal dependency based on poisson model,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 11, pp. 6701–6711, 2022.
- [9] Q. Chen, A. Zhang, T. Huang, Q. He, and Y. Song, “Imbalanced dataset-based echo state networks for anomaly detection,” Neural Computing and Applications, vol. 32, no. 8, pp. 3685–3694, 2020.
- [10] A. Hyvarinen and H. Morioka, “Unsupervised feature extraction by time-contrastive learning and nonlinear ica,” Advances in Neural Information Processing Systems, vol. 29, 2016.
- [11] X. Lan, D. Ng, S. Hong, and M. Feng, “Intra-inter subject self-supervised learning for multivariate cardiac signals,” arXiv preprint arXiv:2109.08908, 2021.
- [12] J.-Y. Franceschi, A. Dieuleveut, and M. Jaggi, “Unsupervised scalable representation learning for multivariate time series,” Advances in neural information processing systems, vol. 32, 2019.
- [13] S. Tonekaboni, D. Eytan, and A. Goldenberg, “Unsupervised representation learning for time series with temporal neighborhood coding,” arXiv preprint arXiv:2106.00750, 2021.
- [14] L. Yang and S. Hong, “Unsupervised time-series representation learning with iterative bilinear temporal-spectral fusion,” in International Conference on Machine Learning. PMLR, 2022, pp. 25 038–25 054.
- [15] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- [16] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, Y. W. Teh and M. Titterington, Eds., vol. 9, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010, pp. 297–304.
- [17] E. Eldele, M. Ragab, Z. Chen, M. Wu, C. K. Kwoh, X. Li, and C. Guan, “Time-series representation learning via temporal and contextual contrasting,” in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, Z.-H. Zhou, Ed. International Joint Conferences on Artificial Intelligence Organization, 8 2021, pp. 2352–2359.
- [18] G. Liu, Y. Liao, F. Wang, B. Zhang, L. Zhang, X. Liang, X. Wan, S. Li, Z. Li, S. Zhang, and S. Cui, “Medical-vlbert: Medical visual language bert for covid-19 ct report generation with alternate learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 9, pp. 3786–3797, 2021.
- [19] C. Zhang, D. Song, Y. Chen, X. Feng, C. Lumezanu, W. Cheng, J. Ni, B. Zong, H. Chen, and N. V. Chawla, “A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 1409–1416.
- [20] Q. Song, B. Sun, and S. Li, “Multimodal sparse transformer network for audio-visual speech recognition,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–11, 2022.
- [21] N. Zhang, “Learning adversarial transformer for symbolic music generation,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–10, 2020.
- [22] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond efficient transformer for long sequence time-series forecasting,” in Proceedings of AAAI, 2021.
- [23] G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eickhoff, “A transformer-based framework for multivariate time series representation learning,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 2114–2124.
- [24] R. Nawaz, H. Nisar, and Y. V. Voon, “The effect of music on human brain; frequency domain and time series analysis using electroencephalogram,” IEEE Access, vol. 6, pp. 45 191–45 205, 2018.
- [25] M. Rhif, A. Ben Abbes, I. R. Farah, B. Martínez, and Y. Sang, “Wavelet transform application for/in non-stationary time-series analysis: A review,” Applied Sciences, vol. 9, no. 7, 2019.
- [26] E. L. Denton et al., “Unsupervised learning of disentangled representations from video,” Advances in neural information processing systems, vol. 30, 2017.
- [27] Z. Yang, H. Yu, Y. He, W. Sun, Z.-H. Mao, and A. Mian, “Fully convolutional network-based self-supervised learning for semantic segmentation,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–11, 2022.
- [28] X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2794–2802.
- [29] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “work for contrastive learning of visual representations,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 119, 13–18 Jul 2020, pp. 1597–1607.
- [30] C. Liu, Y. Yao, D. Luo, Y. Zhou, and Q. Ye, “Self-supervised motion perception for spatiotemporal representation learning,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2022.
- [31] L. Yang and S. Hong, “Omni-granular ego-semantic propagation for self-supervised graph representation learning,” in International Conference on Machine Learning. PMLR, 2022, pp. 25 022–25 037.
- [32] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- [33] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy, “Spanbert: Improving pre-training by representing and predicting spans,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 64–77, 2020.
- [34] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “Ernie: Enhanced language representation with informative entities,” arXiv preprint arXiv:1905.07129, 2019.
- [35] X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020.
- [36] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 132–149.
- [37] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in Neural Information Processing Systems, vol. 33, pp. 9912–9924, 2020.
- [38] Y. Zheng, M. Jin, S. Pan, Y.-F. Li, H. Peng, M. Li, and Z. Li, “Toward graph self-supervised learning with contrastive adjusted zooming,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2022.
- [39] H. Zhang, J. Y. Koh, J. Baldridge, H. Lee, and Y. Yang, “Cross-modal contrastive learning for text-to-image generation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 833–842.
- [40] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 000–16 009.
- [41] J. Jiang, J. Chen, and Y. Guo, “A dual-masked auto-encoder for robust motion capture with spatial-temporal skeletal token completion,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 5123–5131.
- [42] Z. Yue, Y. Wang, J. Duan, T. Yang, C. Huang, Y. Tong, and B. Xu, “Ts2vec: Towards universal representation of time series,” arXiv preprint arXiv:2106.10466, 2021.
- [43] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio.” SSW, vol. 125, p. 2, 2016.
- [44] L. Ye and E. Keogh, “Time series shapelets: a new primitive for data mining,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009, pp. 947–956.
- [45] A. V. Oppenheim, A. S. Willsky, and S. H. Nawab, Signals &Amp; Systems (2Nd Ed.). Prentice-Hall, Inc., 1996.
- [46] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 41–48.
- [47] P. Wagner, N. Strodthoff, R.-D. Bousseljot, D. Kreiseler, F. I. Lunze, W. Samek, and T. Schaeffter, “Ptb-xl, a large publicly available electrocardiography dataset,” Scientific data, vol. 7, no. 1, pp. 1–15, 2020.
- [48] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, “Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals,” circulation, vol. 101, no. 23, pp. e215–e220, 2000.
- [49] D. Anguita, A. Ghio, L. Oneto, X. Parra Perez, and J. L. Reyes Ortiz, “A public domain dataset for human activity recognition using smartphones,” in Proceedings of the 21th international European symposium on artificial neural networks, computational intelligence and machine learning, 2013, pp. 437–442.
- [50] B. Kemp, A. H. Zwinderman, B. Tuk, H. A. Kamphuisen, and J. J. Oberye, “Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the eeg,” IEEE Transactions on Biomedical Engineering, vol. 47, no. 9, pp. 1185–1194, 2000.
- [51] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- [52] E. Eldele, Z. Chen, C. Liu, M. Wu, C.-K. Kwoh, X. Li, and C. Guan, “An attention-based deep learning approach for sleep stage classification with single-channel eeg,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 29, pp. 809–818, 2021.
- [53] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [54] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, Y. Shao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” arXiv preprint arXiv:2209.00796, 2022.