This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\fail

PR-PL: A Novel Transfer Learning Framework with
Prototypical Representation based Pairwise Learning
for EEG-Based Emotion Recognition

Rushuang Zhoua,b,1,1, Zhiguo Zhanga,b,c,d,e,1,2, Hong Fuf,3, Li Zhanga,b,4, Linling Lia,b,5,
Gan Huanga,b,6, Yining Dongg,7, Fali Lih,i,8, Xin Yanga,b,9 and Zhen Lianga,b,10
aSchool of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen, China
bGuangdong Provincial Key Laboratory of Biomedical Measurements and Ultrasound Imaging, Shenzhen, China
cchool of Computer Science and Technology, Harbin Insti- tute of Technology, Shenzhen, China
dMarshall Laboratory of Biomedical Engineering, Shenzhen, China
ePeng Cheng Laboratory, Shenzhen, China
fDepartment of Mathematics and Information Technology, The Education University of Hong Kong, Hong Kong
gSchool of Data Science, City University of Hong Kong, Hong Kong
hThe Clinical Hospital of Chengdu Brain Science Institute, MOE Key Lab for Neuroinformation,
University of Electronic Science and Technology of China, China
iSchool of Life Science and Technology, Center for Information in Medicine,
University of Electronic Science and Technology of China, China
Email: 12018222087@szu.edu.cn, 2zgzhang@szu.edu.cn, 3hfu@eduhk.hk, 4lzhang@szu.edu.cn, 5lilinling@szu.edu.cn,
6huanggan@szu.edu.cn, 7yinidong@cityu.edu.hk, 8fali.li@uestc.edu.cn, 9yangxinknow@gmail.com, 10janezliang@szu.edu.cn
Abstract

Affective brain-computer interfaces based on electroencephalography (EEG) is an important branch in the field of affective computing. However, individual differences and noisy labels seriously limit the effectiveness and generalizability of EEG-based emotion recognition models. In this paper, we propose a novel transfer learning framework with Prototypical Representation based Pairwise Learning (PR-PL) to learn discriminative and generalized prototypical representations for emotion revealing across individuals and formulate emotion recognition as pairwise learning for alleviating the reliance on precise label information. More specifically, a prototypical learning-based adversarial discriminative domain adaptation method is developed to encode the inherent emotion-related semantic structure of EEG data, while pairwise learning with an adaptive pseudo-labeling method is developed to achieve a reliable and stable model learning with noisy labels. Through domain adaptation, feature representations of source and target domains are aligned on a shared feature space, while the feature separability of both source and target domains is also considered. The characterized prototypical representations are evident with a high feature concentration within one single emotion category and a high feature separability across different emotion categories. Extensive experiments are conducted on two benchmark databases under four cross-validation evaluation protocols (cross-subject cross-session, cross-subject within-session, within-subject cross-session, and within-subject within-session). The experimental results demonstrate the superiority of the proposed PR-PL against the state-of-the-arts under all four evaluation protocols, which shows the effectiveness and generalizability of PR-PL in dealing with the ambiguity of EEG responses in affective studies. The source code is available at https://github.com/KAZABANA/PR-PL.

Index Terms:
Electroencephalography; Emotion Recognition; Prototypical Representation; Pairwise Learning; Transfer Learning.
11footnotetext:  Equal contributions.

I Introduction

Affective computing is a fast growing interdisciplinary research field and is attracting researchers’ attention from different areas including computer science, neuroscience, psychology, and signal processing [1]. Recently, electroencephalography (EEG) based emotion recognition has become an increasingly important topic for affective computing and human sentiment analysis [2, 3]. A proper design of EEG-based emotion recognition models is helpful for facilitating the data processing, benefiting discriminant feature characterization, and lightening the model performance. Currently, there exist two main critical issues in EEG-based emotion recognition. One is individual differences: how to build a generalized affective computing model which could tolerate the remarkable individual differences in the simultaneously collected EEG signals; and another is noisy label learning: how to train a reliable and stable affective computing model which is less reliant on the subjective feedback.

In recent years, more and more researchers have focused on applying transfer learning methods to alleviate the individual differences in EEG signals [4, 5, 6, 7, 8, 9] and improve feature invariant representation [10, 11, 12]. Considering the individuals with and without labels (termed as source domain and target domain), transfer learning tries to minimize the distribution difference between the source and target domains by approximately satisfying the assumption of independent and identical distribution and can consequently realize a higher recognition performance on the target domain. Through a domain-shifting strategy, the invariant feature representations across different domains are learned and the relationships among the learned features, data distribution, and labels are explored. For example, Li et al. [5] proposed a multisource transfer learning method with two transfer learning stages. In the first stage, appropriate samples were selected from the existing source domain. In the second stage, a style transfer mapping was implemented to alleviate the differences between the selected source samples and the unknown target samples. The results showed the proposed transfer method outperformed the non-transfer method with an improvement of 12.72% on the public SEED database [13] with a three-class classification problem (negative, neutral, and positive). Inspired by neuroscience findings that different emotions would lead to different brain reactions, Li et al. [6] proposed a novel R2G-STNN network to integrate the EEG spatial-temporal dynamics at the local and global brain areas and realize an efficient emotion recognition performance together with a domain shift learning. More details about current EEG-based emotion recognition models with the transfer learning algorithms are presented in Section II.

For video-evoking EEG-emotion experiments, subjects may not always be able to accurately react to the intended emotions, and at the same time may not be able to accurately describe and feedback on their emotional changes. This would bring label noise to the emotional information annotation of EEG samples and further lead to a negative impact on the model performance [14]. To tackle this issue, Zhong et al. [8] developed an emotion-aware distribution learning method (RGNN), in which they blurred the label information by changing the one-hot label representation (1,0,0)\left(1,0,0\right) to (12ϵ3,2ϵ3,0)\left(1-\frac{2\epsilon}{3},\frac{2\epsilon}{3},0\right) and trained the model to be less sensitive label noise. However, the model performance would greatly rely on the selection of ϵ\epsilon value, and an optimal ϵ\epsilon value selection could be different for different databases and different individuals. Current EEG-based emotion recognition models are mainly based on pointwise learning, which heavily relies on precisely labeled data. Contrarily, pairwise learning makes it possible to model the relative associations between pairs of instances and to efficiently encode the proximity among samples with less reliance on labeling. Thus, pointwise learning has achieved tremendous success in a number of real-world applications [15, 16, 17, 18].

To further improve the effectiveness and generalizability of EEG-based emotion recognition models and eliminate the negative effects from individual differences and label noises, in this paper, we formulate the emotion recognition tasks as a pairwise learning problem and propose a novel transfer learning framework with prototypical representation based pairwise learning (which is termed as PR-PL below). Here, we model the relative relationship between pairs of EEG samples in terms of prototypical representations, which is advantageous to pointwise learning when the labeling task is difficult and even the provided labels are wrong labels[19]. The major novelties of the proposed PR-PL are summarized as follows. (1) We formulate emotion recognition as pairwise learning to replace the classifier and greatly alleviate the label dependence on emotion labels. The pairwise learning provides us an alternative way to measure whether two EEG signals belong to the same emotion category without the reliance on the precise labeling information. The extensive experimental results on two well-known emotional databases (SEED [13] and SEED-IV [20]) prove the proposed PR-PL is a more accurate model than the state-of-the-arts for solving the EEG-based emotion recognition tasks under different application environments (cross-subject cross-session, cross-subject single-session, within-subject cross-session, and within-subject single-session). (2) We propose a novel prototypical learning-based adversarial discriminative domain adaptation method to explore latent variables of emotion categories, encode the semantic structure of EEG data, and learn subject-generalized prototypical representations for emotion revealing across individuals. The characterized prototypical representations show a high feature concentration within one single emotion category and a high feature separability across different emotion categories. (3) Different from the existing transfer learning methods that only focus on feature separability in the source domain, we consider the feature separability of both source and target domains through the end-to-end domain adversarial training to further enhance the model effectiveness and generalizability.

II Related Work

The existing EEG-based emotion recognition models with the transfer learning algorithms can be generally categorized into two types.

(a) Non-deep transfer learning models. Pan et al. [21] proposed a transfer component analysis (TCA) algorithm to reduce the marginal distribution difference between the source and target domains, in which the transfer information was learned in a reproducing kernel Hilbert space through maximizing mean discrepancy. Zheng and Lu[22] introduced two types of subject-to-subject transfer methods to deal with the challenge of the individual differences in EEG signal processing. One was to explore a shared common feature space underlying source and target domains using TCA and kernel principal analysis (KPCA), and another was to construct multiple personalized classifiers on the source domain and map the classifier parameters to the target domain using transductive parameter transfer (TPT). These non-deep transfer learning strategies show the possibility to bridge the discrepancy across two domains with improved performance on the target domain. However, due to the small capacity and low complexity, the model accuracy and stability are still limited, which fails to satisfy the requirements of affective brain-computer interfaces (aBCI) in practical applications.

(b) Deep transfer learning models. Most of the existing affective models are based on deep transfer learning methods built with domain-adversarial neural network (DANN) proposed in [23]. The main idea of DANN is to find a shared feature representation for source and target domains with indistinguishable distribution differences and also maintain the predictive ability of the estimated features on the source samples for a specific classification task. Li et al. [24] was the first to introduce DANN in aBCI. Benefiting from the powerful feature representation ability of deep networks and the high efficiency of adversarial learning in distributed adaptation, the results showed that DANN based aBCI system was superior to other methods. The following aBCI systems could be considered as a series of DANN-based models, which generally start from two directions to improve the DANN performance in solving EEG-based emotion recognition tasks.

  • Incorporating the prior knowledge of neuroscience and brain anatomy with DANN. Inspired by the neuroscience findings of the asymmetry property of the left and right hemispheres in emotional responses, Yang et al. [25] proposed a bi-hemisphere domain adversarial neural network (BiDANN), in which a global and two local domain discriminators were designed to learn discriminant features from each cerebral hemisphere related to emotion perception and also improve the feature stability to the variation of different domains. The experiments on the SEED database demonstrated that BiDANN achieved higher emotion recognition performance than DANN. Considering the emotional responses from different brain regions would be varied, Yang et al. [6] proposed an R2G-STNN (regional to global-spatial-temporal neural network) to integrate the spatial-temporal information from local and global brain regions under importance guidance and characterize hierarchical feature representations. Similarly, under an assumption that not all EEG channels are equally important in emotion recognition tasks, Du et al. [26] integrated attention mechanism, long-short-term memory (LSTM), and DANN to propose an attention-based LSTM with domain discriminator (ATDD-LSTM) and characterize the nonlinear relations among different EEG channels in a data-driven approach and optimally select informative emotion-related channels.

  • Incorporating the probability distribution with DANN. To deal with the training instability of DANN, Luo et al. [27] introduced WGAN-GP (Wasserstein generative adversarial network with gradient penalty) to narrow down the distance between the marginal probability distributions of different subjects. The results showed the model stability was improved and a better cross-subject EEG-based emotion recognition was achieved. However, current DANN-based models only consider the marginal distribution differences but ignore the joint distribution differences of different domains. To address this problem, Li et al. [28] introduced a joint domain adaptation network (JLNN), where the joint distribution adaptation (JDA) method was incorporated with a unified framework of task-invariant features (MDA) and task-specific features (CDA).

Although the above models have achieved higher accuracies compared to the original model with DANN in emotion recognition tasks, there still exist three major technical challenges. First, the learned feature representation is susceptible to noise interference from both source and target domains and would further affect the model generalizability [29, 30]. Second, the existing models only focus on the feature separability in the source domain but ignore the feature separability in the target domain. The current DANN-based models mainly concern the emotion classification loss in the source domain, which would lead to the over-fitting of source domain data and the decrease of classification ability on target domain data. Third, the existing algorithms largely rely on a large amount of labeled source domain data. However, in practical EEG applications, it is difficult to collect accurate labels for each single EEG trial.

Refer to caption
Figure 1: The proposed PR-PL framework.
TABLE I: Frequently used notations and descriptions.
Notation Description
𝕊\𝕋\mathbb{S}\backslash\mathbb{T} source\target domain
xs\xtx^{s}\backslash x^{t} source\target feature
ys\yty^{s}\backslash y^{t} source\target label
Ns\NtN^{s}\backslash N^{t} number of source\target samples
Ds\DtD^{s}\backslash D^{t} the source\target dataset
f()f\left(\cdot\right) sample feature extractor
d()d\left(\cdot\right) domain discriminator
h()h\left(\cdot\right) bi-linear operation
θf\theta_{f} the parameter of the feature extractor f()f\left(\cdot\right)
θd\theta_{d} the parameter of the discriminator d()d\left(\cdot\right)
SS the parameter of the bi-linear operation h()h\left(\cdot\right)
μ\mu centroid
ll predict label

III Methodology

Suppose the EEG trials in the source domain 𝕊\mathbb{S} and target domain 𝕋\mathbb{T} are given as Ds={Xs,Ys}{D}_{s}=\{X_{s},Y_{s}\} and Dt={Xt,Yt}{D}_{t}=\{X_{t},Y_{t}\}, where {Xs,Ys}={(xis,yis)}i=0Ns{\{X_{s},Y_{s}\}=\left\{\left(x_{i}^{s},y_{i}^{s}\right)\right\}_{i=0}^{N_{s}}} and {Xt,Yt}={(xit,yit)}i=0Nt{\{X_{t},Y_{t}\}=\left\{\left(x_{i}^{t},y_{i}^{t}\right)\right\}_{i=0}^{N_{t}}}. Here, XisX_{i}^{s} and XitX_{i}^{t} are EEG samples, and YisY_{i}^{s} and YitY_{i}^{t} are the corresponding emotion labels. To make the narrative clearer, the frequently used notations are summarized in Table I. As shown in Fig. 1, the proposed PR-PL includes three losses (domain adversarial loss, pairwise learning loss on source domain, and pairwise learning loss on target domain) and two main parts (prototypical representation and pairwise learning). In the prototypical representation, three types of features are defined. The characterized features f(Xs)f(X_{s}) or f(Xt)f(X_{t}) from the EEG samples are termed as sample features, which are forced to be as indistinguishable as possible from the source and target domains. Under an assumption that any emotion could be represented by a prototype via prototypical learning, the prototype features of each emotion category are learned based on the sample features f(Xs)f(X_{s}) and YsY_{s} from the source domain. The interaction relationships between the sample features and prototype features are measured and the interaction features are characterized which will be used in the following pairwise learning. In the pairwise learning, the pair relationships on both source and target domains are explored. As the information about YtY_{t} are unknown during model training, an adaptive thresholding method is developed for valid pair selection and pseudo label generation.

III-A Sample feature extraction

To make both source and target data satisfy the assumption of independent and identical distribution and obey the same distribution, we characterize the sample features based on a domain adversarial training introduced in DANN. Here, the distribution difference between the source domain and the target domain is alleviated and the sample features with domain invariant properties are characterized. Through this process, the individual differences in EEG signals could be alleviated and the generalization ability of the models could be improved [25, 27, 6, 26]. Specifically, the domain difference between the source domain sample feature f(Xs)f(X_{s}) and the target domain sample feature f(Xt)f(X_{t}) are minimized by adopting domain adversarial training. Here, f()f\left(\cdot\right) is the designed feature extractor for extracting the sample features from EEG signals. A discriminator network d()d\left(\cdot\right) with the parameter θd\theta_{d} is introduced to distinguish whether the characterized sample features (f(Xs)f(X_{s}) or f(Xt)f(X_{t})) come from the source domain (𝕊\mathbb{S}) or the target domain (𝕋\mathbb{T}). Its loss function is a standard two-category cross-entropy loss function, given as

disc(θf,θd)=\displaystyle\mathcal{L}_{disc}\left(\theta_{f},\theta_{d}\right)= i=0Nslogd(f(xis))i=0Ntlog(1d(f(xit))).\displaystyle-\sum_{i=0}^{N_{s}}\log d\left(f\left(x_{i}^{s}\right)\right)-\sum_{i=0}^{N_{t}}\log\left(1-d\left(f\left(x_{i}^{t}\right)\right)\right). (1)

In the training process, we adopt the end-to-end training method [23], and implement the domain adversarial training by introducing a gradient reversal layer. The feature extractor f()f(\cdot) maximizes the classification ability to enhance emotion recognition performance and at the same time, the discriminator d()d(\cdot) minimizes the domain discrimination to reduce the distribution difference between the source domain and target domain. The final domain adversarial training objective function is defined as

minθfmaxθdclassifiers(θf)λdisc(θf,θd),\min_{\theta_{f}}\max_{\theta_{d}}\mathcal{L}_{\text{classifier}}^{s}(\theta_{f})-\lambda\mathcal{L}_{\text{disc}}\left(\theta_{f},\theta_{d}\right), (2)

which is termed as domain adversarial loss in Fig. 1. Here, classifiers(θf)\mathcal{L}_{\text{classifier}}^{s}(\theta_{f}) is the classification loss to measure the classification ability in source domain. In this paper, classifiers(θf)\mathcal{L}_{\text{classifier}}^{s}(\theta_{f}) will be realized by pairwise learning on source and target domains introduced in Section III-D and III-E below. disc(θf,θd)\mathcal{L}_{\text{disc}}(\theta_{f},\theta_{d}) is the adversarial loss for the discriminator to be trained to distinguish the sample features characterized from source and target domains. θf\theta_{f} and θd\theta_{d} are the parameters of f()f\left(\cdot\right) and d()d\left(\cdot\right). λ\lambda is a balanced hyperparameter for ensuring the stability of domain adversarial, which is given by a exponential growth method as

λ=21exp(p)1.\lambda=\frac{2}{1-\exp(-p)}-1. (3)

Here, pp is a factor related to the training round, given by a ratio of the current training round to the maximum training round.

III-B Prototype feature extraction

We assume that there exists a prototype for each emotion category. Through prototypical learning, the prototype features are learned to indicate the representation property of every single emotion category. Based on the sample features extracted from different subjects under different emotions, we could consider these sample features are distributed around the prototype features. In other words, for each emotion category, the prototype features could be considered the ”center of mass” of all the sample features. From the perspective of a probability distribution, the prototype feature of an emotion category can be regarded as the mean value of the sample feature distribution of the emotion, and the variance of the distribution is caused by the non-stationary EEG, including but not limited to individual differences. Assume that the sample features under an emotion category cc obey the Gaussian distribution N(μc,σc2)N\left(\mu_{c},\sigma_{c}^{2}\right). The prototype features of the emotion category cc could be calculated as the mean vector μc\mu_{c} of the distribution. For the source domain data {Xs,Ys}={(xi,yi)}i=0Ns\{{X}_{s},{Y}_{s}\}=\left\{\left(x_{i},y_{i}\right)\right\}_{i=0}^{N_{s}}, the corresponding sample features are characterized by the feature extractor f()f\left(\cdot\right), given as f(xi)f(x_{i}) (defined in Section III-A). The prototype feature vector of the emotion category cc can be calculated by averaging all the sample features that belong to this category, given as

μc=1|Xsc|xisXscf(xis),\mu_{c}=\frac{1}{\left|{X}_{s}^{c}\right|}\sum_{x_{i}^{s}\in{X}_{s}^{c}}f\left(x_{i}^{s}\right), (4)

where Xsc={(xis,yis=c)}i=0N{X}_{s}^{c}=\left\{\left(x_{i}^{s},y_{i}^{s}=c\right)\right\}_{i=0}^{N} are a collection of source domain data belonging to the emotion category cc. |Xsc|\left|{X}_{s}^{c}\right| are the corresponding sample size in this emotion category. In other words, μc\mu_{c} could be expressed as the centroid of the sample features of Xsc{X}_{s}^{c}. The mean value calculation is a widely used, simple, and effective noise reduction strategy, which can make the prototype features stronger than the sample features and help to alleviate the problem of the traditional DANN network being susceptible to related noise interference [31]. It is worth noting that since the calculation of the prototype features needs to use emotional label information, we only use the source domain data here for prototype feature extraction.

III-C Interaction feature extraction

The traditional DANN extracted domain-invariant shared feature representations could be easily contaminated by the shared and related noises in source and target domains. In this paper, we introduce a bilinear interaction to measure the interaction relationships between the sample features and prototype features and extract the interaction features for the following pairwise learning. For a given dd-dimensional sample feature f(xi)f(x_{i}), its interaction relationship to a certain prototype feature μ(c)\mu_{(}c) can be measured by a bilinear transformation h()h(\cdot), defined as

h(f(xi),μc)=f(xi)TSμc,h\left(f\left(x_{i}\right),\mu_{c}\right)=f\left(x_{i}\right)^{T}S\mu_{c}, (5)

where SRd×dS\in{R^{d\times d}} is a bilinear transformation matrix with trainable parameters and is not restricted by symmetry or positive definiteness. Suppose there are total nn emotion categories, the interaction measurement between a certain sample feature f(xi)Tf\left(x_{i}\right)^{T} and different prototype features can be represented as

li=softmax([h(f(xis),μ1),,h(f(xis),μn)]),l_{i}=\text{softmax}\left(\left[h\left(f\left(x_{i}^{s}\right),\mu_{1}\right),...,h\left(f\left(x_{i}^{s}\right),\mu_{n}\right)\right]\right), (6)

where {μi}i=1n\{\mu_{i}\}_{i=1}^{n} are all the prototype features. An introduction of the activation function (softmax) here is to add nonlinear advantages to the interaction measurement, enhance the feature representation ability, and at the same time allow the feature vector ll to have category prediction capabilities.

III-D Pairwise learning on source domain

Traditional pointwise learning algorithms often regard the feature vector ll as the predicted label of the sample feature xix_{i} and use the cross-entropy loss function to match the label yiy_{i} of xix_{i} based on supervised training. Then, if the sample features of one EEG trial match the prototype features of jjth emotion category the most, the EEG trial will be assigned to the jjth emotion category. However, this type of pointwise learning only focuses on the relationship between sample features and prototype features and ignores the relationship between different sample features. For example, the sample features belonging to different emotion categories should be separated from each other, and the sample features belonging to the same emotion category should be gathered together. To tackle this issue, we introduce pairwise learning to capture the inherent relationship of samples.

For the source domain data {Xs,Ys}={xis,yis}i=1Ns\{X_{s},Y_{s}\}=\{x_{i}^{s},y_{i}^{s}\}_{i=1}^{N_{s}}, the corresponding loss function for pairwise learning is defined as

class (θ)=i,jL(rijs,g(xis,xjs;θ)),\mathcal{L}_{\text{class }}(\theta)=\sum_{i,j}L\left(r_{ij}^{s},g\left({x}_{i}^{s},{x}_{j}^{s};\mathbf{\theta}\right)\right), (7)

where g(xis,xjs;θ)g\left({x}_{i}^{s},{x}_{j}^{s};\mathbf{\theta}\right) is the similarity measurement of the samples xisx_{i}^{s} and xjsx_{j}^{s}, with the parameter of θ\theta. According to the assumption of pairwise learning, if yis=yjs{y}_{i}^{s}={y}_{j}^{s}, then rijs=1r_{ij}^{s}=1; otherwise rijs=0r_{ij}^{s}=0. The loss function L()L\left(\cdot\right) is a difference calculation of rijr_{ij} and g(xis,xjs;θ)g\left({x}_{i}^{s},{x}_{j}^{s};\mathbf{\theta}\right), given by a two-category cross-entropy loss as

L(rijs,g(xis,xjs;θ))=rijslog(g(xis,xjs;θ))\displaystyle L\left(r_{ij}^{s},g\left(x_{i}^{s},x_{j}^{s};\theta\right)\right)=-r_{ij}^{s}\log\left(g\left(x_{i}^{s},x_{j}^{s};\theta\right)\right) (8)
(1rijs)log(1g(xis,xjs;θ)).\displaystyle-\left(1-r_{ij}^{s}\right)\log\left(1-g\left(x_{i}^{s},x_{j}^{s};\theta\right)\right).

In the training process, the label information of source domain data is used to define rr in a supervised manner. In other words, based on the given information of YsY_{s}, if two samples belong to the same emotion category, then r=1r=1, otherwise r=0r=0. A supervised rr can ensure the stability of the training process and the generalization ability of the model. The next key question is how to define a proper g(xis,xjs;θ)g\left({x}_{i}^{s},{x}_{j}^{s};\mathbf{\theta}\right) to compute the similarity between xisx_{i}^{s} and xjsx_{j}^{s} in terms of the characterized interaction features (termed as lil_{i} and ljl_{j}). To make the similarity results locate in the range of [0,1]\left[0,1\right] and extract better and more robust feature representations for subsequent emotion recognition, we add a norm restriction on ll as

lnorm=ll2.l^{norm}=\frac{l}{\left\|l\right\|_{2}}. (9)

The similarity of linorml_{i}^{norm} and ljnorml_{j}^{norm} is calculated as the cosine similarity, given as

g(xis,xjs;θ)=linormljnorm=lisljslis2ljs2,g\left(x_{i}^{s},x_{j}^{s};\theta\right)=l_{i}^{norm}\cdot l_{j}^{norm}=\frac{l_{i}^{s}\cdot l_{j}^{s}}{{\left\|l_{i}^{s}\right\|_{2}}{\left\|l_{j}^{s}\right\|_{2}}}, (10)

where \cdot refers to inner product operation. As stated in Chang et al. [32], the above-mentioned norm restriction can make the vector ll have a clustering function, and the elements in the vector represent the probability that the feature belongs to a certain category cluster. Overall, the objective function of pairwise learning on the source domain is defined as

pairwises(θ)=i,jL(rijs,lisljslis2ljs2)+β,\displaystyle\mathcal{L}^{s}_{\text{pairwise}}(\theta)=\sum_{i,j}L\left(r_{ij}^{s},\frac{l_{i}^{s}\cdot l_{j}^{s}}{{\left\|l_{i}^{s}\right\|_{2}}{\left\|l_{j}^{s}\right\|_{2}}}\right)+\beta\mathcal{R}, (11)

which is termed as supervised pairwise loss in Fig. 1. Here, θ={θf,S}\theta=\left\{\theta_{f},S\right\}, θf\theta_{f} is the parameter of feature extractor f()f\left(\cdot\right), and SS is the defined bilinear transformation matrix for interaction feature extraction. Besides, to avoid redundant feature extraction, a soft regularization \mathcal{R} is introduced with a weight parameter of β\beta, which is defined as

=PTPIF.\mathcal{R}=\left\|{P}^{T}{P}-{I}\right\|_{F}. (12)

Here, each row of the matrix PP refers to the prototype feature belonging to one emotion category, F\left\|\cdot\right\|_{F} is a FF norm of the matrix, and II is an identity matrix. The above loss function (Eq. 11) could be interpreted as a clustering loss, instead of emotion category classification loss. The main optimization goal is to gather the EEG samples that may belong to the same emotion category and separate the EEG samples that do not belong to the same category. The vector ll is the characterized interaction feature for mapping to a non-linear informative feature space by measuring the interaction relationship between the sample features and all the available prototype features.

III-E Pairwise learning on target domain

In this paper, besides of source domain, we also introduce the pairwise learning on the target domain to improve the feature separability in the target domain, as

pairwiset(θf,S)=i,jL(rijt,litljtlit2ljt2),\mathcal{L}_{\text{pairwise}}^{t}\left(\theta_{f},S\right)=\sum_{i,j}L\left(r_{ij}^{t},\frac{l_{i}^{t}\cdot l_{j}^{t}}{{\left\|l_{i}^{t}\right\|_{2}}{\left\|l_{j}^{t}\right\|_{2}}}\right), (13)

which is termed as unsupervised pairwise loss in Fig. 1. Here, ltl^{t} is the interaction features of the target domain data characterized in Eq. 6. The scalar rijtr_{ij}^{t} symbolizes the pairing relationship of the samples in the target domain. Since the label information of the target domain is completely missing in the training process, we cannot accurately obtain the pairing relationship as the source domain. To address this issue, we introduce an adaptive thresholding method to generate the valid pseudo labels and define the pairing relationship of the target domain data. Suppose that rijtr_{ij}^{t} is defined as

rijt:={1, if litljtτu0, if litljt<τl,i,j=1,,n,\displaystyle r_{ij}^{t}:=\left\{\begin{array}[]{l}1,\text{ if }{l_{i}^{t}}\cdot{l_{j}^{t}}\geq\tau_{u}\\ 0,\text{ if }{l_{i}^{t}}\cdot{l_{j}^{t}}<\tau_{l},\quad i,j=1,\cdots,n,\\ \end{array}\right. (14)

where τu\tau_{u} and τl\tau_{l} are the upper and lower bounds to select valid pairs with high confidence for unsupervised pairwise learning (valid pair selection). Here, if the calculated pairwise similarity is higher or equal to the defined upper bound (τu\tau_{u}), then the corresponding pseudo label rijtr_{ij}^{t} would be assigned to 1; while if the calculated pairwise similarity is lower than the defined lower bound (τl\tau_{l}), then the corresponding pseudo label rijtr_{ij}^{t} would be assigned to 0. For the other pairs that do not meet the threshold requirement, we would consider that the model is uncertain about whether the sample pair is paired or not and consider these pairs as invalid results. In order to prevent incorrect optimization, the invalid pairs would be temporarily excluded and will not participate in training and loss calculations at the current training round.

In the early training stage, the classification performance in the target domain is not good enough. In order to ensure the training stability, we set a strict upper threshold (τu\tau_{u}) and a lower threshold (τl\tau_{l}) and exclude most of the pair results in the target domain. Along with the enhancement of model performance in the target domain, we can gradually lower the upper threshold and raise the lower threshold and allow more samples to participate in the training part. In other words, along with the increase in training steps, more pair samples in the target domain will be included for model learning. Here, we form a non-linear dynamic update for thresholding, as

τht=τht1τht1τlt1maxepoch,\tau_{h}^{t}=\tau_{h}^{t-1}-\frac{\tau_{h}^{t-1}-\tau_{l}^{t-1}}{maxepoch}, (15)
τlt=τlt1+τht1τlt1maxepoch,\tau_{l}^{t}=\tau_{l}^{t-1}+\frac{\tau_{h}^{t-1}-\tau_{l}^{t-1}}{maxepoch}, (16)

where τht\tau_{h}^{t} represents the upper threshold of the current training round tt, τlt\tau_{l}^{t} represents the current lower threshold, and maxepochmaxepoch is the maximum training round. Based on the given initial values τh0\tau_{h}^{0} and τl0\tau_{l}^{0}, the calculations of τht\tau_{h}^{t} and τlt\tau_{l}^{t} could be considered as non-linear changes with respect to the training rounds as

{τht=τh0τh0τl02×(1(2maxepoch)t)τlt=τl0+τh0τl02×(1(2maxepoch)t).\left\{\begin{array}[]{l}\tau_{h}^{t}=\tau_{h}^{0}-\frac{\tau_{h}^{0}-\tau_{l}^{0}}{2}\times\left(1-\left(\frac{2}{maxepoch}\right)^{t}\right)\\ \tau_{l}^{t}=\tau_{l}^{0}+\frac{\tau_{h}^{0}-\tau_{l}^{0}}{2}\times\left(1-\left(\frac{2}{maxepoch}\right)^{t}\right).\end{array}\right. (17)

In all, combining the domain adversarial loss (Eq. 2), the pairwise learning loss in the source domain (Eq. 11, and the pairwise learning loss in the target domain (Eq. 13), the final objective function of PR-PL could be given as follows:

minθf,Smaxθdpairwise s(θf,S)+γpairwiset(θf,S)λdisc(θf,θd),\min_{\theta_{f},S}\max_{\theta_{d}}\mathcal{L}_{\text{pairwise }}^{s}\left(\theta_{f},S\right)+\gamma\mathcal{L}_{\text{pairwise}}^{t}\left(\theta_{f},S\right)-\lambda\mathcal{L}_{\text{disc}}\left(\theta_{f},\theta_{d}\right), (18)

where γ\gamma is a hyperparameter to control the importance of the pairing loss of the target domain, given as γ=δ×epochmaxepoch\gamma=\delta\times\frac{epoch}{maxepoch}. epochepoch is the current training round, and maxepochmaxepoch is the maximum training round. Empirically, δ\delta is set to 2.

IV Experimental Results

IV-A Emotional Databases and Data Preprocessing

To have a fair comparison with the state-of-the-art methods, we validate our proposed model on two well-known public databases: SEED [13] and SEED-IV [20]. In SEED database [13], a total of 15 subjects were invited. Each subject performed three sessions on different days and each session contained 15 trials. A total of three emotions were elicited (negative, neutral, and positive). In SEED-IV database [20], a total of 15 subjects participated in the experiment. For each subject, a total of 3 sessions were performed on different days and each session contained 24 trials. A total of four emotions were elicited ( happiness, sadness, fear, and neutral). The EEG signals of both SEED and SEED-IV databases were simultaneously collected using the 62-channel ESI Neuroscan system.

For EEG preprocessing, the data sampling rate was first downsampled to 200Hz, and the contaminated noises (e.g. EMG and EOG) were manually removed. Then, the data were filtered by a band-pass filter of 0.3 Hz to 50Hz. For each trial, the data was divided into a number of segments with a length of 1s. Based on the pre-defined five frequency bands: Delta (1-3 Hz), Theta (4-7 Hz), Alpha (8-13 Hz), Beta (14-30 Hz), and Gamma (31-50 Hz), the corresponding differential entropy (DE) features were extracted to represent the logarithm energy spectrum in a specific frequency band and total 310 features (5 frequency band ×\times 62 channels) were obtained for one EEG segment. Then, all the features were smoothed with the linear dynamic system (LDS) method, which can utilize the time dependency of emotion changes and filter out emotion unrelated and noisy EEG components[33].

IV-B Implementation Results

In our experiments, the feature extractor ff and discriminator dd are both made up of multilayer perceptron (MLP) with the Relu activation function. All the parameters are randomly initialized from a uniform distribution. The bilinear operator matrix SS is also randomly initialized. In the model architecture, the feature extractor structure is designed as 310 (input layer)-64 (hidden layer 1)-Relu activation-64 (hidden layer 2)-Relu activation-64 (output feature layer). The discriminator structure is designed as 64 (input layer)-64 (hidden layer 1)-Relu activation-dropout layer-64 (hidden layer 2)-1 (output layer)-Sigmoid activation. The size of matrix SS given in Eq. 5 is 64×6464\times 64. Besides, we adopt an RMSprop optimizer for network training, which shows a better performance than the other classic optimizers. The learning rate is set to 1e-3 and the mini-batch size for training is 96. To avoid overfitting problems, we use L2L2 regularizes (1e-5) in the networks. The regularization coefficient β\beta in Eq. 11 is 0.01. The balance parameter γ\gamma for pairwise learning on the target domain in Eq. 18 is controlled by a constant factor δ\delta of 2. The threshold τh0\tau_{h}^{0} and τl0\tau_{l}^{0} are given to 0.9 and 0.5 respectively. All the models are trained on an NVIDIA GeForce RTX 2080 GPU, with CUDA 10.0 using the Pytorch API.

IV-C Experiment Protocols

To fully evaluate the robustness and stability of the proposed model and compare it with the existing literature, we validate PR-PL using four different validation protocols. (1) Cross-subject cross-session leave-one-subject-out cross-validation. We evaluate the model with a strict cross-subject cross-session leave-one-subject-out to fully estimate the model robustness on the unknown subject(s) and session(s). One subject’s all sessions data are used as the target and the remaining subjects’ all sessions are used as the source. We repeat the training validation until each subject’s all sessions are treated as the target for once. Due to the variants in individuals and sessions, this evaluation protocol poses a great challenge to the model’s effectiveness in the EEG-based emotion recognition tasks. (2) Cross-subject single-session leave-one-subject-out cross-validation. It is the most widely used validation scheme in the EEG-based emotion recognition tasks [34, 27, 5, 28]. One subject’s one-session data is treated as the target and the other remaining subjects are used as the source. We repeat the training validation process until each subject is treated as the target for once. Same as the other studies, we only consider the first session in this type of cross-validation. (3) Within-subject cross-session leave-one-session-out cross-validation. Similar to the existing methods, a time-series cross-validation method is adopted here, where the past data is used to predict current or future data. For one subject, the first two sessions are used as the source, and the latter session is used as the target. The average accuracies and standard deviations across subjects are calculated as the final results. (4) Within-subject single-session cross-validation. Following the validation protocol presented in the existing studies [13, 20], for each session of one subject, we use the first 9 (SEED) or 16 (SEED-IV) trials as the source and the rest 6 (SEED) or 8 (SEED-IV) trials as the target. The results are reported as the average performance across all the subjects. In the following performance comparison across four different validation protocols, the model results reproduced by us are indicated by ‘*’.

IV-D Cross-subject cross-session leave-one-subject-out cross-validation results

To verify the model efficiency and stability on both cross-subject and cross-session conditions, we verify the proposed PR-PL using cross-subject cross-session leave-one-subject-out cross-validation on both SEED and SEED-IV databases. As reported in Table II and Table III, the results show our proposed model achieves the highest results, where the emotion recognition performance of PR-PL is 85.56%±\pm4.78% for three-class classification task on SEED and 74.92%±\pm7.92% for four-class classification task on SEED-IV. Compared to the existing studies, the proposed PR-PL increases the classification accuracy to 3.39% and 1.08% for SEED and SEED-IV, with smaller standard deviations. These results demonstrate the proposed PR-PL has better affective effectiveness with higher recognition accuracy and better generalizability.

TABLE II: The mean accuracies (%) and standard deviations (%) of emotion recognition on SEED database using cross-subject cross-session leave-one-subject-out cross-validation. Here, the model results reproduced by us are indicated by ‘*’.
Methods PaccP_{acc} Methods PaccP_{acc}
Traditional machine learning methods
RF*[35] 69.60±\pm07.64 KNN*[36] 60.66±\pm07.93
SVM*[37] 68.15±\pm07.38 Adaboost*[38] 71.87±\pm05.70
TCA*[21] 64.02±\pm07.96 CORAL*[39] 68.15±\pm07.83
SA*[SA2013] 61.41±\pm09.75 GFK*[40] 66.02±\pm07.59
Deep learning methods
DCORAL* [41] 81.97±\pm05.16 DAN*[42] 81.04±\pm05.32
DDC* [43] 82.17±\pm04.96 DANN* [23] 81.08±\pm05.88
PR-PL 85.56±\pm04.78
TABLE III: The mean accuracies (%) and standard deviations (%) of emotion recognition on SEED-IV database using cross-subject cross-session leave-one-subject-out cross-validation. Here, the model results reproduced by us are indicated by ‘*’.
Methods PaccP_{acc} Methods PaccP_{acc}
Traditional machine learning methods
RF*[35] 50.98±\pm09.20 KNN*[36] 40.83±\pm07.28
SVM[24] 51.78±\pm12.85 Adaboost*[38] 53.44±\pm09.12
TCA[44] 56.56±\pm13.77 CORAL*[39] 49.44±\pm09.09
SA[44] 64.44±\pm09.46 GFK*[40] 45.89±\pm08.27
KPCA[24] 51.76±\pm12.89 DNN[24] 49.35±\pm09.74
Deep learning methods
DGCNN [45] 52.82±\pm09.23 DAN [24] 58.87±\pm08.13
RGNN [8] 73.84±\pm08.02 BiHDM [44] 69.03±\pm08.66
BiDANN[25] 65.59±\pm10.39 DANN[24] 54.63±\pm08.03
PR-PL 74.92±\pm07.92

IV-E Cross-subject single-session leave-one-subject-out cross-validation results

Table IV summarizes the model results in cross-subject single-session leave-one-subject-out recognition task and compare the performance with the literature. All the results are presented in terms mean±\pmstandard deviation. The results show our proposed model (PR-PL) achieves the best performance (93.06%), with a standard deviation of 5.12%. Our PR-PL leads 2.14% against the reported best results in the literature. Especially, compared to the latest proposed DANN based deep transfer learning networks (e.g. ATDD-DANN [26], R2G-STNN[6], BiHDM[44], BiDANN[25], and WGAN-GP[27]), the proposed PR-PL with pairwise learning can avoid the inherent defects of DANN design (e.g. only considers feature separability on source domain) and well address the individual differences and noisy labeling issues in aBCI applications.

TABLE IV: The mean accuracies (%) and standard deviations (%) of emotion recognition on SEED database using cross-subject single-session leave-one-subject-out cross-validation. Here, the model results reproduced by us are indicated by ‘*’.
Methods PaccP_{acc} Methods PaccP_{acc}
Traditional machine learning methods
TKL[25] 63.54±\pm15.47 T-SVM[25] 72.53±\pm14.00
TCA[44] 63.64±\pm14.88 TPT[24] 75.17±\pm12.83
KPCA[24] 61.28±\pm14.62 GFK[44] 71.31±\pm14.09
SA[44] 69.00±\pm10.89 DICA[46] 69.40±\pm07.80
DNN[24] 61.01±\pm12.38 SVM[24] 58.18±\pm13.85
Deep learning methods
DGCNN [45] 79.95±\pm09.02 DAN [24] 83.81±\pm08.56
RGNN [8] 85.30±\pm06.72 BiHDM [44] 85.40±\pm07.53
WGAN-GP[27] 87.10±\pm07.10 MMD[28] 80.88±\pm10.10
ATDD-DANN[26] 90.92±\pm01.05 JDA-Net[28] 88.28±\pm11.44
R2G-STNN[6] 84.16±\pm07.63 SimNet*[31] 81.58±\pm05.11
BiDANN[25] 83.28±\pm09.60 DResNet[46] 85.30±\pm08.00
ADA[28] 84.47±\pm10.65 DANN[28] 81.65±\pm09.92
PR-PL 93.06±\pm05.12

IV-F Within-subject cross-session cross-validation results

By calculating the average and standard deviation of the experimental results of each subject, the final within-subject cross-session cross-validation results are reported in Table V for the SEED database and Table VI for the SEED-IV database. For both databases, our proposed PR-PL achieves the highest recognition performance compared with the state-of-the-art methods (including both traditional machine learning methods and deep learning methods), where the results are 93.18%±\pm6.55% and 74.62%±\pm14.15% for SEED (three-class emotion recognition) and SEED-IV (four-class emotion recognition), respectively.

TABLE V: The mean accuracies (%) and standard deviations (%) of emotion recognition on SEED database using within-subject cross-session cross-validation. Here, the model results reproduced by us are indicated by ‘*’.
Methods PaccP_{acc} Methods PaccP_{acc}
Traditional machine learning methods
RF*[35] 76.42±\pm11.15 KNN*[36] 75.68±\pm13.82
TCA*[21] 74.27±\pm12.88 CORAL*[39] 84.18±\pm09.81
SA*[SA2013] 69.84±\pm09.46 GFK*[40] 78.79±\pm09.39
Deep learning methods
DAN*[42] 89.16±\pm07.90 SimNet*[31] 86.88±\pm07.83
DDC*[43] 91.14±\pm05.61 ADA[28] 89.13±\pm07.13
DANN*[23] 89.45±\pm06.74 MMD[28] 84.38±\pm12.05
JDA-Net[28] 91.17±\pm08.11 DCORAL*[41] 88.67±\pm06.25
PR-PL 93.18±\pm06.55
TABLE VI: The mean accuracies (%) and standard deviations (%) of emotion recognition on SEED-IV database using within-subject cross-session cross-validation. Here, the model results reproduced by us are indicated by ‘*’.
Methods PaccP_{acc} Methods PaccP_{acc}
Traditional machine learning methods
RF*[35] 60.27±\pm16.36 KNN*[36] 54.18±\pm16.28
TCA*[21] 51.88±\pm15.84 CORAL*[39] 66.06±\pm15.13
SA*[SA2013] 52.81±\pm09.53 GFK*[40] 56.14±\pm12.15
Deep learning methods
DCORAL [47] 65.10±\pm13.20 DAN [47] 60.20±\pm10.20
DDC [47] 68.80±\pm16.60 MEERNet[47] 72.10±\pm14.10
PR-PL 74.62±\pm14.15

IV-G Within-subject single-session cross-validation results

Consistent with the evaluation method presented in the previous studies that only consider the first two sessions of the SEED database for experiments, we present the within-subject single-session results in Table VII. It shows our proposed model obtains the best recognition performance of 94.84%. Comparing the recognition results between cross-subject single-session (Table IV) and within-subject single-session (Table VII) emotion recognition tasks, the proposed PR-PL achieves the highest accuracies and at the same time perform the closest results on the two cross-validation methods (cross-subject single-session: 93.06±\pm05.12; within-subject single-session: 94.84±\pm09.16). For the other models, such as DGCNN[45], BiDANN[25], R2G-STNN[6], RGNN[8], and BiHDM[44], there exists a significant difference between cross-subject and within-subject results (9.09% difference on average). This comparison demonstrates the efficiency and reliability of the proposed PR-PL in various emotion recognition applications.

TABLE VII: The mean accuracies (%) and standard deviations (%) of emotion recognition on SEED database using within-subject single-session cross-validation. Here, the model results reproduced by us are indicated by ‘*’.
Methods PaccP_{acc} Methods PaccP_{acc}
Traditional machine learning methods
SVM[44] 83.99±\pm09.72 GRSLR[48] 87.39±\pm08.64
RF[44] 78.46±\pm11.77 GSCCA[49] 82.96±\pm09.95
CCA[44] 77.63±\pm13.21 DBN[13] 86.08±\pm08.34
Deep learning methods
DGCNN[50] 90.40±\pm08.49 RGNN[8] 94.24±\pm05.95
ATDD-DANN[26] 91.08±\pm06.43 BiHDM[44] 93.12±\pm06.06
R2G-STNN[6] 93.38±\pm05.96 SimNet*[31] 90.13±\pm10.84
BiDANN[25] 92.38±\pm07.04 STRNN[51] 89.50±\pm07.63
GCNN[44] 87.40±\pm09.20 DANN[44] 91.36±\pm08.30
PR-PL 94.84±\pm09.16

For the SEED-IV database, we calculate the performance on all three sessions as reported in the other studies and decode emotions into four categories (happiness, sadness, fear, and neutral). Our proposed model outperforms the existing studies, with the highest accuracy of 83.33%, which leads to a 3.96% increase as compared to the SOTA (79.37%[8]).

TABLE VIII: The mean accuracies (%) and standard deviations (%) of emotion recognition on SEED-IV database using within-subject single-session cross-validation. Here, the model results reproduced by us are indicated by ‘*’.
Methods PaccP_{acc} Methods PaccP_{acc}
Traditional machine learning methods
SVM[44] 56.61±\pm20.05 GRSLR[48] 69.32±\pm19.57
RF[44] 50.97±\pm16.22 GSCCA[49] 69.08±\pm16.66
CCA[44] 54.47±\pm18.48 DBN[13] 66.77±\pm07.38
Deep learning methods
DGCNN[50] 69.88±\pm16.29 RGNN[8] 79.37±\pm10.54
GCNN[44] 68.34±\pm15.42 BiHDM[44] 74.35±\pm14.09
A-LSTM[44] 69.50±\pm15.45 SimNet*[31] 71.38±\pm13.12
BiDANN[25] 70.29±\pm12.63 DANN[44] 63.07±\pm12.66
PR-PL 83.33±\pm10.61

V Discussion and Conclusion

To fully study the model performance, we evaluate the effect of different settings in PR-PL. Note that all the results presented in this section are based on the SEED database using the cross-subject single-session cross-validation evaluation protocol.

V-A Ablation Study

We conduct the ablation study to systematically explore the effectiveness of different components in the proposed model and examine the corresponding contributions to the overall performance. As shown in Table IX, it is found that the introduction of domain adversarial training can greatly enhance the emotion recognition performance on the target domain. When the model is without discriminator and target domain information, the recognition accuracy reduces from 93.06% to 83.30%. Such a significant drop shows the significant impact of individual differences problem on model performance and highlights the great potential of transfer learning in aBCI applications. Besides, the results show a combination of pairwise learning on the source and target domain benefits to the model performance, where the recognition accuracy is increased by 6.35% (from 87.5% to 93.06%). For the pseudo-labeling method, the corresponding accuracy increases from 89.92% to 92.46% when the pseudo-labeling method changes from fixed to linear dynamic. The accuracy further increases to 93.06%, when a nonlinear dynamic-based adaptive pseudo-labeling method is adopted. The results show a non-linear dynamic pseudo-labeling could be helpful to screen out the valid paired samples and improve the model trainability. For the final loss function given in Eq. 18, instead of using a fixed weight for the pairwise loss on the target domain, we propose to update the weight gradually along with the training epochs to prevent model learning failures in the early training stage and balance the relationships among different losses. The benefit of a dynamic γ\gamma is also reflected in the ablation study, where the recognition accuracy increases from 89.47% and 93.06%.

TABLE IX: The ablation study of our proposed model.
Ablation study about training strategy PaccP_{acc}
w/o discriminator and target information 83.30±\pm04.21
w/o pairwise learning on the source and target 87.50±\pm06.64
w/o pairwise learning on the target 88.81±\pm06.63
w/o prototypical representation 91.00±\pm04.65
w/o thresholding for pseudo label generation 92.13±\pm05.90
About hyperparameter controlling strategy PaccP_{acc}
w/ fixed pseudo-labeling 89.92±\pm07.21
w/ linear dynamic pseudo-labeling 92.46±\pm04.95
w/ fixed γ\gamma for target pairwise loss 89.47±\pm10.22
PR-PL 93.06±\pm05.12

V-B Effect of Noisy Labels

To further verify the model robustness during noisy label learning, we randomly contaminate the source labels with η\eta% noises and test the corresponding model performance on unknown target data. Specifically, we replace η\eta% real labels in YsY_{s} using randomly generated labels and train the model in supervised learning. Then, we test the trained model performance on the target domain. Note here that the noisy contamination is only conducted on the source domain, as the target domain needs to be used for model evaluation. In the implementation, η\eta% value is adjusted to 10%, 20%, and 30%, respectively. The corresponding model accuracies with the standard deviations are 89.22%±\pm6.05%, 88.39%±\pm6.73%, and 87.71%±\pm5.02%. It shows that, with an increase of label noise ratio from 10% to 30%, the model performance decreases slightly, with a decrease rate of 1.69%. These results demonstrate the proposed PR-PL is a reliable model which has a higher tolerance to noisy labels. In the future works, the recently proposed novel methods, such as [52] and [53], could be incorporated to further eliminate more general noises in EEG signals and improve the model stability in the cross-corpus applications.

V-C Confusion Matrices

To qualitatively study the model performance in each emotion class, we visually analyze the confusion matrices and compare results with the latest models [25, 44, 8]. As shown in Fig.2, it shows all the models are good at distinguishing positive emotions from other emotions (the recognition rates are all above 90%), but it is relatively poor at distinguishing negative and neutral emotions. For example, the recognition rate of neural in BiDANN [25] is even lower than 80% (76.72%). Compared to the existing methods ((a), (b), and (c)), our proposed model can enhance the model recognition ability, especially for distinguishing neutral and negative emotions. As shown in (d), the recognition rates for negative, neutral, and positive emotions are 92.10%, 90.39%, and 96.50%.

Refer to caption
Figure 2: Confusion matrices of different models. (a) BiDANN [25], (b) BiHDM [44], (c) RGNN [8], and (d) PR-PL.

V-D Visualization of Learned Representation

To verify the effectiveness of the proposed model from a more intuitive perspective, we visualize the characterized sample and interaction features of source and target domains using T-SNE[54] in Fig. 3. Here, we randomly select 500 samples from the source and 500 samples from the target for visualization of the learned feature representation. Compared to the representation learned by the other model settings (w/o pairwise learning on the source and target and w/o pairwise learning on the target), the representation learned by the proposed PR-PL forms more separated emotional clusters. Comparing the extracted sample features (c) and interaction features (f) by the proposed PR-PL, the separability of the extracted interaction features from different emotion classes is further enlarged and at the same time, the concentration of the feature distribution for each emotion is also improved.

Refer to caption
Figure 3: T-SNE visualization of the learned features from the source and the target domains using different model settings. (a), (b) and (c) are the sample features extracted by w/o pairwise learning on the source and target, w/o pairwise learning on the target, and PR-PL. (d), (e) and (f) are the interaction features extracted by w/o pairwise learning on the source and target, w/o pairwise learning on the target, and PR-PL.

V-E Conclusion

The paper proposes a novel transfer learning framework with prototypical representation-based pairwise learning (PR-PL), that characterizes EEG data with prototypical representations and formulates the EEG-based emotion recognition task as pairwise learning. We evaluate our proposed model on two well-known emotional databases (SEED and SEED-IV) under four cross-validation protocols (cross-subject single-session, within-subject single-session, within-subject cross-session, and cross-subject cross-session) and compare it with the existing state-of-the-art methods. Our extensive experimental results show PR-PL achieves the best results on all four cross-validation protocols and demonstrate the advantage of PR-PL in tackling individual differences and noisy labeling issues in aBCI systems.

VI Conflicts of Interest

The authors declare that they have no conflicts of interest.

VII Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under Grant 61906122, in part by Shenzhen-Hong Kong Institute of Brain Science-Shenzhen Fundamental Research Institutions (2021SHIBS0003), in part by the Tencent “Rhinoceros Birds”-Scientific Research Foundation for Young Teachers of Shenzhen University, and in part by the High Level University Construction under Grant 000002110133.

References

  • [1] S. Siddharth, T.-P. Jung, and T. J. Sejnowski, “Utilizing deep learning towards multi-modal bio-sensing and vision-based affective computing,” IEEE Transactions on Affective Computing, 2019.
  • [2] X. Hu, J. Chen, F. Wang, and D. Zhang, “Ten challenges for eeg-based affective computing,” Brain Science Advances, vol. 5, no. 1, pp. 1–20, 2019.
  • [3] W. Hu, G. Huang, L. Li, L. Zhang, Z. Zhang, and Z. Liang, “Video-triggered eeg-emotion public databases and current methods: A survey,” Brain Science Advances, vol. 6, no. 3, pp. 255–287, 2020.
  • [4] V. Jayaram, M. Alamgir, Y. Altun, B. Scholkopf, and M. Grosse-Wentrup, “Transfer learning in brain-computer interfaces,” IEEE Computational Intelligence Magazine, vol. 11, no. 1, pp. 20–31, 2016.
  • [5] J. Li, S. Qiu, Y.-Y. Shen, C.-L. Liu, and H. He, “Multisource transfer learning for cross-subject eeg emotion recognition,” IEEE transactions on cybernetics, vol. 50, no. 7, pp. 3281–3293, 2019.
  • [6] Y. Li, W. Zheng, L. Wang, Y. Zong, and Z. Cui, “From regional to global brain: A novel hierarchical spatial-temporal neural network model for eeg emotion recognition,” IEEE Transactions on Affective Computing, 2019.
  • [7] H. Cui, A. Liu, X. Zhang, X. Chen, K. Wang, and X. Chen, “Eeg-based emotion recognition using an end-to-end regional-asymmetric convolutional neural network,” Knowledge-Based Systems, vol. 205, p. 106243, 2020.
  • [8] P. Zhong, D. Wang, and C. Miao, “Eeg-based emotion recognition using regularized graph neural networks,” IEEE Transactions on Affective Computing, 2020.
  • [9] X. Gu, Z. Cao, A. Jolfaei, P. Xu, D. Wu, T.-P. Jung, and C.-T. Lin, “Eeg-based brain-computer interfaces (bcis): A survey of recent studies on signal sensing technologies and computational intelligence approaches and their applications,” IEEE/ACM transactions on computational biology and bioinformatics, 2021.
  • [10] O. Özdenizci, Y. Wang, T. Koike-Akino, and D. Erdoğmuş, “Adversarial deep learning in eeg biometrics,” IEEE signal processing letters, vol. 26, no. 5, pp. 710–714, 2019.
  • [11] ——, “Learning invariant representations from eeg via adversarial inference,” IEEE access, vol. 8, pp. 27 074–27 085, 2020.
  • [12] D. Bethge, P. Hallgarten, T. Grosse-Puppendahl, M. Kari, R. Mikut, A. Schmidt, and O. Özdenizci, “Domain-invariant representation learning from eeg with private encoders,” arXiv preprint arXiv:2201.11613, 2022.
  • [13] W.-L. Zheng and B.-L. Lu, “Investigating critical frequency bands and channels for eeg-based emotion recognition with deep neural networks,” IEEE Transactions on Autonomous Mental Development, vol. 7, no. 3, pp. 162–175, 2015.
  • [14] Y. Jia, M. Salzmann, and T. Darrell, “Factorized latent spaces with structured sparsity.” in Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010, 01 2010, pp. 982–990.
  • [15] H. Bao, G. Niu, and M. Sugiyama, “Classification from pairwise similarity and unlabeled data,” in International Conference on Machine Learning.   PMLR, 2018, pp. 452–461.
  • [16] H. Bao, T. Shimada, L. Xu, I. Sato, and M. Sugiyama, “Similarity-based classification: Connecting similarity learning to binary classification,” 2020.
  • [17] C.-C. Hsu, Y.-X. Zhuang, and C.-Y. Lee, “Deep fake image detection based on pairwise learning,” Applied Sciences, vol. 10, no. 1, p. 370, 2020.
  • [18] P. Zhuang, Y. Wang, and Y. Qiao, “Learning attentive pairwise interaction for fine-grained classification,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 13 130–13 137.
  • [19] L. Yao, S. Li, Y. Li, M. Huai, J. Gao, and A. Zhang, “Representation learning for treatment effect estimation from observational data,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [20] W.-L. Zheng, W. Liu, Y. Lu, B.-L. Lu, and A. Cichocki, “Emotionmeter: A multimodal framework for recognizing human emotions,” IEEE transactions on cybernetics, vol. 49, no. 3, pp. 1110–1122, 2018.
  • [21] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, 2011.
  • [22] W.-L. Zheng and B.-L. Lu, “Personalizing eeg-based affective models with transfer learning,” in Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, ser. IJCAI’16.   AAAI Press, 2016, p. 2732–2738.
  • [23] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” The journal of machine learning research, vol. 17, no. 1, pp. 2096–2030, 2016.
  • [24] H. Li, Y.-M. Jin, W.-L. Zheng, and B.-L. Lu, “Cross-subject emotion recognition using deep adaptation networks,” in Neural Information Processing, L. Cheng, A. C. S. Leung, and S. Ozawa, Eds.   Cham: Springer International Publishing, 2018, pp. 403–413.
  • [25] Y. Li, W. Zheng, Y. Zong, Z. Cui, T. Zhang, and X. Zhou, “A bi-hemisphere domain adversarial neural network model for eeg emotion recognition,” IEEE Transactions on Affective Computing, 2018.
  • [26] X. Du, C. Ma, G. Zhang, J. Li, Y.-K. Lai, G. Zhao, X. Deng, Y.-J. Liu, and H. Wang, “An efficient lstm network for emotion recognition from multichannel eeg signals,” IEEE Transactions on Affective Computing, pp. 1–1, 2020.
  • [27] Y. Luo, S. Y. Zhang, W. L. Zheng, and B. L. Lu, “Wgan domain adaptation for eeg-based emotion recognition,” International Conference on Neural Information Processing, 2018.
  • [28] J. Li, S. Qiu, C. Du, Y. Wang, and H. He, “Domain adaptation for eeg emotion recognition based on latent representation similarity,” IEEE Transactions on Cognitive and Developmental Systems, vol. 12, no. 2, pp. 344–353, 2020.
  • [29] M. Salzmann, C. H. Ek, R. Urtasun, and T. Darrell, “Factorized orthogonal latent spaces,” Journal of Machine Learning Research, vol. 9, pp. 701–708, 2010.
  • [30] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan, “Domain separation networks,” ser. NIPS’16, 2016, p. 343–351.
  • [31] P. O. Pinheiro, “Unsupervised domain adaptation with similarity learning,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8004–8013.
  • [32] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan, “Deep adaptive image clustering,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5880–5888.
  • [33] L.-C. Shi and B.-L. Lu, “Off-line and on-line vigilance estimation based on linear dynamical system and manifold learning,” in 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, 2010, pp. 6587–6590.
  • [34] W.-L. Zheng, Y.-Q. Zhang, J.-Y. Zhu, and B.-L. Lu, “Transfer components between subjects for eeg-based emotion recognition,” in 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), 2015, pp. 917–922.
  • [35] Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
  • [36] “Alternative k-nearest neighbour rules in supervised pattern recognition: Part 1. k-nearest neighbour classification by using alternative voting rules,” Analytica Chimica Acta, vol. 136, pp. 15–27, 1982. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0003267001953590
  • [37] J. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Processing Letters, vol. 9, no. 3, pp. 293–300, 1999.
  • [38] J. Zhu, A. Arbor, and T. Hastie, “Multi-class adaboost,” Statistics & Its Interface, vol. 2, no. 3, pp. 349–360, 2006.
  • [39] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, ser. AAAI’16.   AAAI Press, 2016, p. 2058–2065.
  • [40] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2066–2073.
  • [41] B. Sun and K. Saenko, “Deep coral: Correlation alignment for deep domain adaptation,” in Computer Vision – ECCV 2016 Workshops, G. Hua and H. Jégou, Eds.   Cham: Springer International Publishing, 2016, pp. 443–450.
  • [42] M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adaptation networks,” in International conference on machine learning.   PMLR, 2015, pp. 97–105.
  • [43] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep domain confusion: Maximizing for domain invariance,” CoRR, vol. abs/1412.3474, 2014. [Online]. Available: http://arxiv.org/abs/1412.3474
  • [44] Y. Li, L. Wang, W. Zheng, Y. Zong, L. Qi, Z. Cui, T. Zhang, and T. Song, “A novel bi-hemispheric discrepancy model for eeg emotion recognition,” IEEE Transactions on Cognitive and Developmental Systems, vol. 13, no. 2, pp. 354–367, 2020.
  • [45] T. Song, W. Zheng, P. Song, and Z. Cui, “Eeg emotion recognition using dynamical graph convolutional neural networks,” IEEE Transactions on Affective Computing, vol. 11, no. 3, pp. 532–541, 2018.
  • [46] B.-Q. Ma, H. Li, W.-L. Zheng, and B.-L. Lu, “Reducing the subject variability of eeg signals with adversarial domain generalization,” in Neural Information Processing, T. Gedeon, K. W. Wong, and M. Lee, Eds.   Cham: Springer International Publishing, 2019, pp. 30–42.
  • [47] H. Chen, Z. Li, M. Jin, and J. Li, “Meernet: Multi-source eeg-based emotion recognition network for generalization across subjects and sessions,” in 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC).   IEEE, 2021, pp. 6094–6097.
  • [48] Y. Li, W. Zheng, Z. Cui, Y. Zong, and S. Ge, “Eeg emotion recognition based on graph regularized sparse linear regression,” Neural Processing Letters, vol. 49, p. 1–17, 04 2019.
  • [49] W. Zheng, “Multichannel eeg-based emotion recognition via group sparse canonical correlation analysis,” IEEE Transactions on Cognitive and Developmental Systems, vol. 9, no. 3, pp. 281–290, 2017.
  • [50] T. Song, W. Zheng, P. Song, and Z. Cui, “Eeg emotion recognition using dynamical graph convolutional neural networks,” IEEE Transactions on Affective Computing, vol. 11, no. 3, pp. 532–541, 2020.
  • [51] T. Zhang, W. Zheng, Z. Cui, Y. Zong, and Y. Li, “Spatial–temporal recurrent neural network for emotion recognition,” IEEE Transactions on Cybernetics, vol. 49, no. 3, pp. 839–847, 2019.
  • [52] X. Xiao, M. Xu, J. Jin, Y. Wang, T.-P. Jung, and D. Ming, “Discriminative canonical pattern matching for single-trial classification of erp components,” IEEE Transactions on Biomedical Engineering, vol. 67, no. 8, pp. 2266–2275, 2019.
  • [53] J. Jin, Z. Wang, R. Xu, C. Liu, X. Wang, and A. Cichocki, “Robust similarity measurement based on a novel time filter for ssveps detection,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
  • [54] V. D. M. Laurens and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, no. 2605, pp. 2579–2605, 2008.