Cardiff University, UK
{anugunj.naman, chentan.sinha}@iiitg.ac.in
liliana.mancini@cardiff.ac.uk
Fixed-MAML for Few Shot Classification in
Multilingual Speech Emotion Recognition
Abstract
In this paper, we analyze the feasibility of applying few-shot learning to speech emotion recognition task (SER). The current speech emotion recognition models work exceptionally well but fail when then input is multilingual. Moreover, when training such models, the models’ performance is suitable only when the training corpus is vast. This availability of a big training corpus is a significant problem when choosing a language that is not much popular or obscure. We attempt to solve this challenge of multilingualism and lack of available data by turning this problem into a few-shot learning problem. We suggest relaxing the assumption that all N classes in an N-way K-shot problem be new and define an N+F way problem where N and F are the number of emotion classes and predefined fixed emotion classes, respectively. We propose this modification to the Model-Agnostic Meta Learning (MAML) algorithm to solve the problem and call this new model F-MAML. This modification performs better than the original MAML and outperforms on EmoFilm dataset.
1 Introduction
Emotion recognition plays a significant role in many intelligent interfaces [1]. Even with the recent advances in machine learning, this is still a challenging task. The main reason behind this is that most publicly available annotated datasets in this domain are small in scale, which makes DL models prone to over-fitting. Another essential feature of emotion recognition is the inherent multi-modality in expressing emotions [2]. Emotional information can be captured by studying many modalities, including facial expressions, body postures, and EEG [3]. Of these, arguably, speech is the most accessible. In addition to accessibility, speech signals contain many other emotional cues [4]. We, therefore, use speech signals as a base to predict the emotion.
Generally, in speech emotion recognition (SER) task, conventional supervised learning solves the problem efficiently given sufficient training data. Several studies on SER for different single corpora have been conducted using the language-dependent optimal acoustic sets over several decades. Such systems can be analyzed in monolingual scenarios; changing the source corpus requires re-selecting the optimal acoustic features and re-training the system. Human-emotion perception, however, has proved to be cross-lingual, even without the understanding of the language used [5]. An SER system is expected to recognize emotions as such.
However, for an automatic SER system to recognize emotion, there are two significant problems. First, the training corpus available for many different languages is very limited. Second, it is not clear which standard features are efficient in detecting emotions across different cultures. Commonalities and differences in human-emotion perception across languages in the valence-activation (V-A) space have recently been studied [5]. It was revealed that direction and distance from neutral to other emotions are similar across languages, and languages’ neutral positions are language-dependent. In this paper, motivated by the above challenges, we want to simulate a scenario where one can provide few labeled speech samples in any language and train a model on that language for few iterations to get a robust SER system. This proposed scenario removes the requirement for a large amount of data and identifies the standard features efficient in detecting emotions about that culture and fine-tune to it accordingly.
Supervised learning has been extremely successful in computer vision, speech, or machine translation tasks, thanks to improvements in optimization technology, larger datasets, and streamlined designs of deep convolutional or recurrent architectures. Despite these successes, this learning setup does not cover many aspects where learning is possible and desirable. One such instance is learning from very few examples in the so-called few-shot learning tasks [6]. Rather than depending on regularization to compensate for the lack of training data, researchers have explored ways to leverage the distribution of similar tasks, inspired by human learning [7]. A lot of useful solutions have been developed, and the most popular solution right now uses meta-learning.
Meanwhile, most of the studies on few-shot learning are conducted on image tasks. We here attempt to apply those meta-learning solutions to SER systems. We formulate the problem mentioned above as a few-shot learning problem and analyze the performance state-of-art model-level few-shot learning algorithms.
Meta-learning, also known as ’learning to learn,’ aims to make quick adaptation to new tasks with only a few examples. Recently many different meta-learning solutions have been proposed to solve the few-shot learning problems. All these solutions differ in the form of learning a shared metric [8][9][10][11], a generic inference network [12][13], a shared optimization algorithm [14][15], or a shared initialization for the model parameters [16][17][18]. In this paper, we use the Model-Agnostic Meta-Learning (MAML) approach [16] because of the following reasons:
-
1.
It is a model-agnostic general framework that can be easily used on a new task.
-
2.
It achieves state-of-the-art performance in existing few-shot learning tasks.
Few-shot learning is often defined as an N-way, K-shot problem where N is the number of class labels in the target task, and K is the number of examples of each class. In most previous studies, it is assumed that all the N classes or labels are new. However, in real-life applications, these classes or labels are not necessary to be all new. Thus, we further define an N+F-way, K-shot problem where N and F are the numbers of new classes or labels and fixed classes, respectively. In this new devised task, the model has to classify among new classes and fixed classes. We propose this modification to the original MAML algorithm to solve this problem and call this new model F-MAML.
We conduct our experiment on EmoFilm dataset [19] to simulate a scenario in SER. We compare our approach with two baseline approaches: the conventional supervised learning approach[20] and the MAML[21] approach. Experimental results show that F-MAML lead to obvious improvement over the supervised learning approach and even performing better than MAML. Our contributions in this paper are summarized here:
-
1.
We analyze the feasibility of few-shot learning for training SER models.
-
2.
We propose an efficient way compared to MAML (F-MAML) to train future SER models for any language with few training examples.
The rest of the paper is presented in the following manner: In section 2, we discuss our work’s background. In section 3, we discuss our proposed method. In section 4, experiments performed in detail, and results are mentioned. In section 5, we finally give a conclusion.
2 Background
In this section, we first briefly introduce MAML, the base, and our solution’s motivation.
Model-Agnostic Meta-Learning (MAML) is one of the most popular meta-learning algorithms that aim to solve the few-shot learning problem. The main goal of MAML is to train a model initializer that can adapt to any new task using very few labeled examples and training iterations [16]. The model is trained across several tasks to reach this goal, and it treats the entire task as a training example. The model is required to face different tasks to get used to adapting to new tasks. In this section, we describe the MAML training framework. As is shown in Figure-1, the optimization procedure consists of two stages. A meta-learning stage on the training data and a fine-tuning stage on the testing tasks.

2.1 The meta-learning stage
Given that the target evaluation task is an N-way, K-shot task, the model is trained across a set of task where each task is also an N-way, K-shot task. In each iteration, a learning task, i.e., the meta-task is sampled according to a distribution over tasks . Each consist of a support set and a query set .
Consider a model represented by a parametrized function with parameters . is computed from through the adaptation to task . A loss function , which is cross-entropy loss over support set examples, is defined to guide the computation of :
(1) |
A one-step gradient update is as below:
(2) |
Here, is the learning rate, which can be a fixed hyperparameter or learned like the Meta-SGD [17]. The gradient here is updated for multiple steps.
After this, the model parameters are optimized on the performance of evaluated by the query set with respect to . is another cross entropy loss over query set examples:
(3) |
Broadly speaking, MAML aims to optimize the model parameters such that few gradient steps on a new task will ultimately lead to a maximally effective behavior on that new task. At the end of each training iteration, the parameters are updated as below:
(4) |
Here, is the learning rate of the meta learner. To increase the stability of training , instead of only one task a batch of tasks is sampled in each iteration. The optimization is performed by averaging the loss across the tasks. Thus, equation (4) can be generalized to:
(5) |
2.2 The fine-tuning stage
A fine-tuning is performed before the evaluation. In an N-way, K-shot task, K examples from each of the N class labels are available at this stage in the target task’s support set. The model trained above in the meta-learning stage will now be fine-tuned according to equation (2) for a few iterations. The updated model will then be evaluated on the remaining unlabeled examples (the target task’s query set).
3 Proposed Method
In the original MAML, it is assumed that all class labels in the target task are new class labels. However, these class labels do not necessarily need to be all new. In real-life applications, some of the class labels are known so that more examples of these class labels can be used in the meta-learning stage. This paper will call them fixed classes as we later fix their output positions in the neural network classifier. We call this task, which has to classify among new classes and fixed classes, an N+F-way, K-shot problem where N, F, K are the number of emotion class labels, fixed class labels, and examples of each class for fine-tuning respectively. This problem of simultaneously classifying unseen and seen class labels has not been investigated in the original MAML. In our solution, we try to tackle the problem by proposing modifications to the MAML training framework. We believe that the N+F way, K-shot problem is more realistic and our modification to MAML applies to various tasks. We now describe our methodology for a few-shot SER task.

3.1 Methodology: F-MAML
Although the N+F-way, K-shot problem can be regarded as a specific form of the normal N-way, K-shot problem, solving it with the original MAML framework will lead to a performance degradation. Using the prior information of the F fixed classes, we modify the MAML framework in the following aspects:
-
1.
We fix the output positions, i.e., the output at the end of classification for a random sample for the fixed classes in the neural network classifier.
-
2.
These fixed classes occur in every meta-task in the meta-learning stage.
-
3.
The adaptation of fixed classes is not needed in the fine-tuning stage as they have already been learned in the meta-learning stage.
The above three modification to the original MAML makes the proposed framework more effectively to real applications.
3.2 Speech Emotion Recognition
We formulate a scenario for SER as N+F-way, K-shot classification task. N is the number of emotions that one wishes to recognize, and one should provide K speech audio samples for each such emotion. Fixed labels here are silence and neutral.
Figure 2 illustrates the framework of F-MAML approach. The target data contains audio samples from one language, not in source data, while source data contain audio examples from all other languages. The fixed classes are the same in target and source data. In the meta-learning stage, several N+2-way, K-shot meta-task are sampled from source data for each language. Each meta-tasks is similar to the target task. We expect to learn a model initializer that can adapt to the target task using the provided speech samples and emotion labels. We exclude the fixed class labels from the support set in both meta-learning and the fine-tuning stages. As we can assume the availability of more training examples of fixed classes, we can keep them in the meta-tasks’ query set in the meta-learning stage. Moreover, it can be seen that the positions of silence and the neutral classes are fixed to the last of network output (the orange area). Thus, we force our model to ”recall” the fixed classes without the need for adaptation.
4 Experimentation
4.1 Dataset
We conduct our experiments on EmoFilm dataset [19]. It consists of 1115 clips with a mean length of 3.5 seconds, resulting in 341 English audio clips, with an average of 34.3 utterances per emotion; 410 Italian audio clips with an average of 41.3 utterances per emotion; and 356 Spanish clips, with an average of 35.9 utterances per emotion (std 9). The higher number of Italian clips might be due to Italian being a more ’emotionally expressive’ language; this could also relate to the pre-test made by Italian listeners, who may be better at perceiving emotions in their language [19]. The dataset is categorized into five emotion labels: happiness, sadness, anger, fear, and disgust. We formulate three 5-way, K-shot tasks using the same setup as the audio recognition tutorial in official PyTorch documentation. Table-1 gives information about total samples for each emotion in each language. We perform three experiments here.
-
1.
The first experiment is SER in the English language, where we use the English language as a testing set while Spanish and Italian are used in training.
-
2.
The second experiment is SER in the Italian language, where we use the Italian language as a testing set while English and Spanish are used in training.
-
3.
The third experiment is SER in the Spanish language, where we use the Spanish language as a testing set while English and Italian are used in training.
The testing language is unseen to the meta-learning stage, and only K labeled examples of each label are available in the fine-tuning stage. The initialized model is fine-tuned on the labeled examples and evaluated on the unlabeled examples. The samples for silence class and neutral class were self-generated with a mean length of 3.5 seconds.
Language | Total Samples | Samples per Emotion |
---|---|---|
English | 341 | 72 - Fear, 50 - Disgust, 69 - Happiness, 76 - Anger, 74 - Sadness |
Italian | 410 | 83 - Fear, 68 - Disgust, 93 - Happiness, 73 - Anger, 93 - Sadness |
Spanish | 356 | 63 - Fear, 50 - Disgust, 76 - Happiness, 82 - Anger, 85 - Sadness |
4.2 Model Setting
The 3-4 second clips are sampled at 16kHz. We use the Mel-frequency Cepstral Coefficient (MFCC) features. For each clip, we extract 40-dimensional MFCCs with a frame length of 30ms and a frame step of 10ms. Convolution Neural Networks is adopted as the base model, which contains 4 convolutional blocks. Each block comprises a 3 × 3 convolutions and 64 filters, followed by ReLU and batch normalization [22]. The flattened layer after the convolutional blocks contains 576 neurons and is fully connected to the output layer with a linear function. We avoided using ResNet architecture because it overfitted very quickly. The model is trained with a mini-batch size of 16 for 5, 10, 20-shot classification. We set the learning rate to 0.1 and to 0.001. The learning rates were found using a grid search.
4.3 Baselines
We compare our proposed approach with two baseline approaches the state of art conventional supervised learning approach[20] which trains the model on the support set of the target task only, and the state of art meta learning approach MetaSER [21], which treats the 5+2-way problem as a 7-way problem. In the evaluation, we sample K examples from each class for fine-tuning the model and 25 examples per label for evaluation. We do 100 times random tests and evaluate different approaches on accuracy.
4.4 Results
We compare our approach with three baselines. Table 2, 3 and 4 list the performance of 5, 10 and 20-shot task on SER in English, Spanish, Italian languages respectively.
Not surprisingly, MetaSER, i.e., MAML based approaches perform much better than conventional supervised learning in a few-shot learning situation. This improvement is because it provides a good initialization of a model’s parameters to achieve optimal fast learning on a new task with few gradient steps while avoiding overfitting that may happen when using a small dataset.

Model | English | Italian | Spanish |
Supervised | |||
MetaSER | |||
F-MAML | 69.71% | 69.13% | 68.85% |
Model | English | Italian | Spanish |
Supervised | |||
MetaSERL | |||
F-MAML | 73.71% | 74.22% | 74.55% |
Model | English | Italian | Spanish |
Supervised | |||
MAML | |||
F-MAML | 81.69% | 80.13% | 80.15% |
Finally, our proposed approach F-MAML outperforms the MetaSER. We attribute the improvement to prior information of the fixed classes acting as anchor, which helps in efficient fine-tuning to new tasks compared to the MetaSER. The Figure-3 shows the loss of MetaSER compared to F-MAML on 5-shot learning. It can easily be seen that F-MAML converges quickly and in fewer steps than the original MAML making the proposed approach more robust that MetaSER.
5 Conclusion
In this paper, we simulated a scenario of SER as a few-shot learning problem. We define it as an N+F-way, K-shot problem and propose a modification to the Model-Agnostic Meta-Learning (MAML) algorithm where we kept F fixed to solve the problem. Experiments conducted on the EmoFilm dataset show that our approach performs the best compared to the baselines. In the future, we will attempt to test the feasibility of the approach on Indic languages, and mandarin derived languages since these languages differ vastly from each other. We also look for using image and text as well to make a multimodal system as well.
References
- [1] R. W. Picard, “Affective computing.” MIT Press, 2000.
- [2] J. Li and C. Lee, “Attentive to individual: A multimodal emotion recognition network with personalized attention profile,” in Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 211–215.
- [3] N. Sebe, I. Cohen, T. Gevers, and T. S. Huang, “Multimodal approaches for emotion recognition: a survey,” in Internet Imaging VI, S. Santini, R. Schettini, and T. Gevers, Eds., vol. 5670, International Society for Optics and Photonics. SPIE, 2005, pp. 56 – 67.
- [4] J. Kim and R. A. Saurous, “Emotion recognition from human speech using temporal information and deep learning,” in Proc. Interspeech 2018, 2018, pp. 937–940.
- [5] X. Li and M. Akagi, “Multilingual speech emotion recognition system based on a three-layer model,” in Interspeech 2016, 2016, pp. 3608–3612.
- [6] V. G. Satorras and J. B. Estrach, “Few-shot learning with graph neural networks,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=BJj6qGbRW
- [7] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum, “Human-level concept learning through probabilistic program induction,” Science, vol. 350, no. 6266, pp. 1332–1338, 2015.
- [8] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, “Matching networks for one shot learning,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16. Red Hook, NY, USA: Curran Associates Inc., 2016, p. 3637–3645.
- [9] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY, USA: Curran Associates Inc., 2017, p. 4080–4090.
- [10] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 1199–1208.
- [11] T. Ko, Y. Chen, and Q. Li, “Prototypical networks for small footprint text-independent speaker verification,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6804–6808.
- [12] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap, “Meta-learning with memory-augmented neural networks,” in Proceedings of the 33rd International Conference on International Conference on Machine Learning, ser. ICML’16, vol. 48. JMLR.org, 2016, p. 1842–1850.
- [13] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, “A simple neural attentive meta-learner,” in International Conference on Learning Representations, 2018.
- [14] T. Munkhdalai and H. Yu, “Meta networks,” in Proceedings of the 34rd International Conference on International Conference on Machine Learning, ser. ICML’17, vol. 70. JMLR.org, 2017, pp. 2554–2563.
- [15] S. Ravi and H. Larochelle, “Optimization as a model for few-shot learning,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
- [16] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proceedings of the 34th International Conference on Machine Learning, ser. ICML’17, vol. 70. JMLR.org, 2017, pp. 1126–1135.
- [17] Z. Li, F. Zhou, F. Chen, and H. Li, “Meta-sgd: Learning to learn quickly for few-shot learning,” 2017.
- [18] A. Nichol, J. Achiam, and J. Schulman, “On first-order meta-learning algorithms,” 2018.
- [19] E. Parada-Cabaleiro, G. Costantini, A. Batliner, A. Baird, and B. Schuller, “Categorical vs dimensional perception of italian emotional speech,” in Proc. Interspeech 2018, 2018, pp. 3638–3642.
- [20] W. Lim, D. Jang, and T. Lee, “Speech emotion recognition using convolutional and recurrent neural networks,” in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016, pp. 1–4.
- [21] S. Chopra, P. Mathur, R. Sawhney, and R. R. Shah, “Meta-learning for low-resource speech emotion recognition,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6259–6263.
- [22] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ser. ICML’15. JMLR.org, 2015, p. 448–456.