BERT-like Pre-training for Symbolic Piano Music Classification Tasks

Yi-Hui Chou
Carnegie Mellon University, United States
yihuic@andrew.cmu.edu
&I-Chun Chen^∗
National Tsing Hua University, Taiwan
icchen0101@elsa.cs.nthu.edu.tw
\ANDChin-Jui Chang
Research Center for IT Innovation, Academia Sinica, Taiwan
csc63182@citi.sinica.edu.tw
&Joann Ching
Research Center for IT Innovation, Academia Sinica, Taiwan
joann8512@citi.sinica.edu.tw
&Yi-Hsuan Yang
National Taiwan University, Taiwan
yhyangtw@ntu.edu.tw
The first two authors contribute equally to the paper.

Abstract

This article presents a benchmark study of symbolic piano music classification using the masked language modelling approach of the Bidirectional Encoder Representations from Transformers (BERT). Specifically, we consider two types of MIDI data: MIDI scores, which are musical scores rendered directly into MIDI with no dynamics and precisely aligned with the metrical grid notated by its composer and MIDI performances, which are MIDI encodings of human performances of musical scoresheets. With five public-domain datasets of single-track piano MIDI files, we pre-train two 12-layer Transformer models using the BERT approach, one for MIDI scores and the other for MIDI performances, and fine-tune them for four downstream classification tasks. These include two note-level classification tasks (melody extraction and velocity prediction) and two sequence-level classification tasks (style classification and emotion classification). Our evaluation shows that the BERT approach leads to higher classification accuracy than recurrent neural network (RNN)-based baselines.

Keywords: Pre-trained model, Transformer, symbolic-domain music classification, piano music, melody recognition, velocity prediction, style classification, emotion classification

1 Introduction

In the literature of machine learning, a prominent approach to overcome the labelled data scarcity issue is to adopt “transfer learning” and divide the learning problem into two stages [93]: a pre-training stage that establishes a model capturing general knowledge from one or multiple source tasks and a fine-tuning stage that transfers the captured knowledge to target tasks. Model pre-training can be done in two ways: using a labelled dataset [78, 108, 139], such as training a VGG-like model over millions of human-labelled clips of general sound events and then fine-tuning it on instrument classification [91]; or using an unlabelled dataset with a self-supervised training strategy. The latter is in particular popular in the field of natural language processing (NLP), where pre-trained models (PTMs) using Transformers [134] have achieved state-of-the-art results on almost all NLP tasks, including generative and discriminative ones [93].

This article presents an empirical study employing PTMs to symbolic-domain piano music classification tasks. In particular, inspired by the growing trend of treating MIDI music as a “language” in deep generative models for symbolic music [100, 124, 101, 141, 118], we employ a Transformer-based network pre-trained by a self-supervised training strategy called “masked language modelling” (MLM), which has been widely used in BERT-like PTMs in NLP [84, 115, 107, 143, 82]. Despite the fame of BERT, we are aware of only two publications that employ BERT-like PTMs for symbolic music classification [133, 144]. The first work [133] deals with optically scanned sheet music, while we use MIDI inputs. The second work [144] uses a diverse set of multi-track MIDI files, while we intend to focus on piano music only. We discuss how our work differs from these two existing works in more detail in Section 2.

We evaluate PTMs on four piano music classification tasks. These include two note-level classification tasks, i.e., melody extraction [130, 99] and velocity prediction [137, 104, 105] and two sequence-level classification tasks, i.e., style classification [112, 109] and emotion classification [90, 113, 121, 122]. We use five datasets in this work, amounting to 4,166 pieces of piano MIDI. We give details of the datasets and tasks in Sections 3 and 4.

As the major contribution of this article, we report a performance study of variants of PTM for this diverse set of classification tasks, comparing the proposed approach (Section 6) with recurrent neural network (RNN)-based baselines (Section 5). Results reported in Section 7 show that the “pre-train and fine-tune” strategy does lead to higher accuracy than the RNN baselines. Moreover, we consider two types of MIDI data and compare the performance of the resulting PTMs. Specifically, following [120], we differentiate two types of MIDI files, MIDI scores, which are musical scoresheets rendered directly into MIDI with no dynamics and exactly according to the written metrical grid, and MIDI performances, which are MIDI encodings of human performances of musical scoresheets. All the 4,166 pieces we have are MIDI performances, but we can obtain the corresponding MIDI-score version of them by dropping performance-related information. Accordingly, we build two PTMs, one for MIDI scores and the other for MIDI performances and evaluate their performance respectively on the downstream tasks. While the MIDI-score version can be applied to a wider array of tasks involving those with or without performance-related information, the MIDI-performance version can likely perform better for tasks that involve human performance of piano scores, such as style classification and emotion classification. Therefore, such a performance comparison is relevant.

As the secondary contribution, we share the code and release checkpoints of the pre-trained and fine-tuned models publicly in our GitHub repository¹¹1https://github.com/wazenmai/MIDI-BERT with an open-source licence. Together with the fact that all the datasets employed in this work are publicly available, our research can be taken as a new testbed of PTMs in general and a new public benchmark for machine learning-based classification of MIDI music.

2 Related Work on Pre-trained Models for MIDI

Machine learning has been applied to music in symbolic formats such as MIDI. Exemplary tasks include symbolic-domain music genre classification [83, 86], composer classification [112, 109], and melody note identification [130, 99]. However, labelled datasets for symbolic-domain music data tend to be small in size in general [92, 130, 94], posing challenges to train effective supervised machine learning models that generalise well.

To our best knowledge, the work of [133] represents the first attempt to use PTMs for symbolic-domain music classification. They showed that either a RoBERTa-based Transformer encoder PTM [115] or a GPT2-based Transformer encoder PTM [125] outperform non-pre-trained baselines for a 9-class symbolic-domain composer classification task. Pre-training boosts the classification accuracy for the GPT2 model greatly from 46% to 70%. However, the symbolic data format considered in their work is “sheet music image” [133], which are images of musical scores. This data format has been much less used than MIDI in the literature.

[144] presented MusicBERT, a PTM tailored for symbolic MIDI data. MusicBERT was trained on a non-public dataset of over one million multi-track MIDI pieces. The authors showcased the efficacy of MusicBERT by applying it to two generative music tasks, melody completion and accompaniment suggestion and two sequence-level discriminative tasks, including genre and style classification. In comparison to non PTM-based baselines, MusicBERT consistently led to better performance. Our work differs from theirs in the following aspects. First, our pre-training corpus is much smaller (only 4,166 pieces) but all publicly available, less diverse but more dedicated (to piano music). Second, we aim at establishing a benchmark for symbolic music classification and include not only sequence-level but also note-level tasks. Furthermore, the labelled data we employ for our downstream tasks is comparatively modest, with each dataset containing fewer than 1,000 annotated pieces. This differs from MusicBERT’s dataset, referred to as the TOP-MAGD dataset [86], which comprises over 20,000 annotated pieces—a considerably extensive collection rarely encountered in symbolic music tasks. Finally, their token representation is designed for multi-track MIDI, while ours is for single-track piano MIDI, each MIDI file is an individual movement of a longer work.

Table 1: Public datasets used for this article. All the datasets are used for pre-training, while three are also used for downstream classification tasks. Average note pitch is in MIDI number. The symbol “#” stands for “number of”.

Dataset	Downstream Classifi-	Pieces	Duration	Avg. note	Avg. note	Avg. #notes	Avg. #bars
Dataset	cation (CLS) Tasks	Pieces	(hours)	pitch	duration (in \musThirtySecond)	per bar	per piece
Pop1K7	-	1,747	108.8	E4	8.5	16.9	103.3
ASAP ${}_{\text{4/4}}$	-	65	3.5	D4#	2.9	23.0	95.9
POP909 ${}_{\text{4/4}}$	melody, velocity	865	59.7	D4#	6.1	17.4	94.9
Pianist8	style	411	31.9	D4#	9.6	17.0	108.9
EMOPIA	emotion	1,078	12.0	C4#	10.0	17.9	14.8

3 Datasets and Data Representation

3.1 Piano MIDI Datasets

We collect four existing public-domain piano MIDI datasets, including Pop1K7, ASAP, POP909, EMOPIA and compile ourselves a new dataset, named Pianist8. To simplify the token representation, we consider only MIDI files that specify 4/4 metre.²²2We note that the metre can be wrong due to errors in automatic music transcription, leading to noise in the data. Future work can be done to improve this. Moreover, future work can be done to use a more complicated token representation such as that proposed by [123] to include other time signatures. We list some important statistics of these five datasets in Tab. 1 and provide their details below.

•

The Pop1K7 dataset developed by [98]³³3https://github.com/YatingMusic/compound-word-transformer is composed of machine transcriptions of 1,747 audio recordings of piano covers (i.e., a new recording by someone other than the original artist or composer of a commercially released song) of Japanese anime, Korean and Western pop music, amounting to over 100 hours worth of data. The transcription was done with the “onsets-and-frames” RNN-based piano transcription model [95] (which is reported to attain 95.32 onset F1 score for note-level piano transcription [97]), and the RNN-based downbeat and beat tracking model from the Madmom library [73]. This dataset is the largest among the five, constituting half of our training data. We only use it for pre-training.
•

ASAP, the aligned scores & performances dataset compiled by \citeyearasap,⁴⁴4https://github.com/fosfrancesco/asap-dataset contains 1,068 MIDI performances of 222 Western classical music compositions from 15 composers, along with the MIDI performances of the 222 pieces compiled from the MAESTRO dataset [96]. We consider it as an additional dataset for pre-training, using only the MIDI that specifies 4/4 metre with no metre change at all throughout the piece. This leaves us with 65 pieces of MIDI files, which last for 3.5 hours in total. Tab. 1 shows that, being the only classical dataset among the five, ASAP features shorter average note duration and larger number of notes per bar.
•

POP909 comprises piano covers of 909 pop songs compiled by [136].⁵⁵5https://github.com/music-x-lab/POP909-Dataset It is the only dataset among the five that provides melody, non-melody labels for each note. Specifically, each note is labelled with one of the following three classes: vocal melody (piano notes corresponding to the lead vocal melody line in the original pop song, usually during the verse and chorus parts); instrumental melody (piano notes corresponding to the secondary melody line played by the instruments in the original pop song, usually during the intro, bridge, outro parts); and accompaniment (including arpeggios, broken chords and many other textures).⁶⁶6POP909 originally refers to vocal melody as melody and instrumental melody as bridge in their work [136]. We opt for our new naming to highlight the fact that the latter is also a type of melody. As it is a MIDI performance dataset, it also comes with velocity information. Therefore, we use it for the melody classification and velocity prediction tasks. We discard pieces that do not specify 4/4 metre, ending up with 865 pieces for this dataset.
•

Pianist8 consists of eight artists’ performances of piano music that we download from YouTube for training and evaluating symbolic-domain style classification.⁷⁷7https://zenodo.org/record/5089279 The artists are Richard Clayderman (pop), Yiruma (pop), Herbie Hancock (jazz), Ludovico Einaudi (contemporary), Hisaishi Joe (contemporary), Ryuichi Sakamoto (contemporary), Bethel Music (religious) and Hillsong Worship (religious). The artists are also the composers of the pieces, except for Richard Clayderman, Bethel Music and Hillsong Worship. The dataset contains a total of 411 pieces, with a fairly balanced number of pieces per artist. Each audio file is paired with its MIDI performance, which is machine-transcribed by the piano transcription model proposed by [110].
•

EMOPIA is a dataset of pop piano music collected recently by [103] from YouTube for research on emotion-related tasks.⁸⁸8https://annahung31.github.io/EMOPIA/ It has 1,087 clips (each around 30 seconds) segmented from 387 songs, covering Japanese anime, Korean & Western pop song covers, movie soundtracks and personal compositions. The emotion of each clip has been labelled using the following 4-class taxonomy: HAHV (high arousal high valence); LAHV (low arousal high valence); HALV (high arousal low valence); and LALV (low arousal low valence). This taxonomy is derived from the Russell’s valence-arousal model of emotion [127], where valence indicates whether the emotion is positive or negative and arousal denotes whether the emotion is high (e.g., angry) or low (e.g., sad) [142]. The MIDI performances of these clips are similarly machine-transcribed from the audio recordings by the model of [110]. We use this dataset for the emotion classification task. As Tab. 1 shows, the average length of the pieces in the EMOPIA dataset is the shortest among the five, since they are actually clips manually selected by dedicated annotators [103] to ensure that each performance expresses a single emotion.

All five datasets consist of MIDI performances. As mentioned in the introduction, we intend to build two PTMs, one for MIDI scores and the other for MIDI performances. We obtain the MIDI-score version of each performance by dropping velocity and tempo information, temporally quantising the onset time and duration of each the notes to the semiquaver resolution.

3.2 Token Representation

Similar to text, a piece of music in MIDI can be considered as a sequence of musical events or “tokens”. However, what makes music different is that musical notes are associated with a temporal length (i.e., note duration) and multiple notes can be played at the same time. Therefore, to represent music, we need note-related tokens describing, for example, the pitch and duration of the notes, as well as metric-related tokens placing the notes over a time grid.

In the literature, a variety of token representations for MIDI have been proposed, differing in many aspects such as the MIDI data being considered (e.g., melody [135], lead sheet [140], piano [100] and multi-track music [124, 85]), the temporal resolution of the time grid and the way the advancement in time is notated [101]. Auxiliary tokens describing, for example, the chord progression [101] or grooving pattern [76] underlying a piece can also be added.

In this work, we adopt the beat-based REMI token representation proposed by [101] to place musical notes over a discrete time grid comprising 16 equidistant sample points per bar. In addition to REMI, we experiment with the “token grouping” idea of the compound word (CP) representation [98], to reduce the length of the token sequences. We depict the two adopted token representations in Fig. 1 and provide some details below.

Refer to caption — Figure 1: An example of a piece of score encoded using the proposed simplified version of the (a) REMI and (b) CP representations, using seven types of tokens, Bar, Sub-bar, Pitch, Velocity, Duration, Tempo and Pad (not shown here), for piano-only MIDI performance. The text inside parentheses indicates the value each token takes. While each time step corresponds to a single token in REMI, each time step would correspond to a *super token* that assembles four tokens in total in CP. Without such a token grouping, the sequence length (in terms of the number of time steps) of REMI is longer than that of CP (in this example, 16 versus 4). Please note that the actual scores employed in our work are not as simple as this example as they are polyphonic.

3.2.1 REMI Token Representation

The REMI representation [101] for MIDI performances uses Bar and Sub-bar tokens to represent the advancement in time. The former marks the beginning of a new bar, while the latter points to a discrete position within a bar. Specifically, as we divide a bar into 16 equidistant sample points, the Sub-bar tokens can take values from 1 to 16; e.g., Sub-bar(1) indicates the position corresponding to the first sample point in a bar, i.e., the first beat in 4/4 time signature, whereas Sub-bar(9) indicates the third beat.⁹⁹9We note that [101] originally referred to such Sub-bar tokens as Position tokens, while [129] and [141] call them Sub-beat tokens. We prefer our naming for it is musically more accurate—our Sub-bar tokens are subdivisions of a bar (i.e., dividing a bar into 16 points), not subdivisions of a beat (i.e., not dividing a beat into 16 points). As depicted in Fig. 1(a), we use a Sub-bar token before each musical note, which comprises two consecutive tokens of Pitch and Duration. In other words, the Sub-bar token indicates the onset time of a note played at a certain MIDI pitch (i.e., the value taken by the Pitch token), whose duration is indicated by the Duration token, in multiples of demisemiquavers. For example, Duration(1) and Duration(32) correspond to a thirty-second note and a whole note, respectively. For MIDI performances, a musical note is represented by not only Pitch and Duration tokens but also a Velocity token that indicates how hard this note was pressed by key. Moreover, we use the Tempo token to specify the tempo information of the songs. It is placed behind the Sub-bar token to imply when the song would perform with the tempo. We only add tempo token at the beginning of the song and the timing when tempo changes. For MIDI scores, the Velocity and Tempo tokens are simply dropped.

3.2.2 CP Token Representation

Fig. 1(a) shows that, except for Bar, the other tokens in a REMI sequence always occur consecutively in groups, in the order of Sub-bar, Pitch, Duration. We can further differentiate Bar(new) and Bar(cont), representing respectively the beginning of a new bar and a continuation of the current bar and always have one of them before a Sub-bar token. This way, the tokens would always occur in a group of four for MIDI scores. For MIDI performances, six tokens would be grouped together, including Velocity and Tempo. Following the logic of Bar, if there is no tempo change, we simply repeat the tempo value. Instead of feeding the token embedding of each of them individually to the Transformer, we can combine the token embedding of either the four tokens for MIDI scores or six tokens for MIDI performances in a group by concatenation and let the Transformer model process them jointly, as depicted in Fig. 1(b). We can also modify the output layer of the Transformer so that it predicts multiple tokens at once with different heads. These constitute the main ideas of the CP representation [98], which has at least the following two advantages over its REMI counterpart: 1) the number of time steps needed to represent a MIDI piece is much reduced, since the tokens are merged into a “super token” (a.k.a. a “compound word” [98]) representing four tokens at once; 2) the self-attention in Transformer is operated over the super tokens, which might be musically more meaningful as each super token jointly represents different aspects of a musical note. Therefore, we experiment with both REMI and CP in our experiments.

3.2.3 On Zero-padding

To train Transformers, it is required that all input sequences have the same length. For both REMI and CP, we divide the token sequence for each entire piece into a number of shorter sequences with equal sequence length 512, zero-padding those at the end of a piece to 512 with an appropriate number of Pad tokens. Because of the token grouping, a CP sequence for the Pop1K7 dataset would cover around 25 bars on average, whereas a corresponding REMI sequence covers only 9 bars on average.

For MIDI scores, our final token vocabulary for REMI contains 16 unique Sub-bar tokens, 86 Pitch tokens, 64 Duration tokens, one Bar token, one Pad token and one Mask token, in total 169 tokens. For CP, we do not use a Pad token but represent a zero-padded super token by Bar(Pad), Sub-bar(Pad), Pitch(Pad) and Duration(Pad). We do similarly for a masked super token, using Bar(Mask), etc. Adding that we need an additional bar-related token Bar(cont) for CP, so the vocabulary size for CP is 169 $-$ 2 $+$ 8 $+$ 1 $=$ 176. For MIDI performances, the vocabulary sizes are 299 and 310 using the REMI and CP representations, respectively.

4 Task Specification

Throughout this article, we refer to note-level classification tasks as tasks that perform a prediction for each individual note in a music sequence and sequence-level tasks as tasks that require a single prediction for an entire music sequence. We consider two note-level tasks and two sequence-level tasks in our experiments, as elaborated below.

4.1 Symbolic-domain Melody Extraction

For symbolic-domain melody extraction, initial methodologies predominantly adopted rule-based approaches. These rule-based methods encompassed techniques such as utilising pitch contour characteristics [128], as well as the implementation of the “skyline" algorithm [75]. In recent years, deep learning-based approaches utilising convolutional neural networks (CNN) have been adopted [130, 99], We will review such CNN-based methods in Section 5, highlighting their specific details and implementation. Similar to [130], we regard melody extraction as a task that identifies the melody notes in a single-track ¹⁰¹⁰10It is common for MIDI files to consist of multiple tracks. We refer to “single-track” as MIDI files containing only one track, which is in contrast to multi-track MIDI files that have multiple tracks. homophonic or polyphonic music. Utilising the POP909 dataset [136], we can develop a model that classifies each Pitch event into vocal melody, instrumental melody or accompaniment, with classification accuracy (ACC) serving as the evaluation metric.¹¹¹¹11We note that there is a task closely related to melody extraction, called melody track identification. The goal of this task is to distinguish the melody track from other non-melody tracks present in a multi-track MIDI file [119, 106]. While melody extraction is a note-level classification task, melody track identification is a track-level task. The latter is also an important symbolic music classification task, but we do not consider it here for we exclusively focus on piano-only data.

Specifically, we consider two formulations of the task. Firstly, we adhere to the original configuration of POP909 and perform three-class melody classification, classifying each Pitch into three categories: vocal melody, instrumental melody or accompaniment. Secondly, we merge vocal melody and instrumental melody into a general "melody" category (while accompaniment is designated as "non-melody") and perform two-class classification. Doing so allows for a direct comparison with established baselines, such as the skyline algorithm and the baseline introduced in Section 5. For detailed results and a thorough examination, please refer to Section 7.1.

4.2 Symbolic-domain Velocity Prediction

Dynamics is an important element in music, as they are often used by musicians to add excitement and emotion to songs. Given that the tokens we choose do not contain performance information, it is interesting to see how a machine model would “perform” a piece by deciding these volume changes, a task that is essential in performance generation [137, 104, 105] or expressive performance modelling [88, 89]. In the realm of MIDI, velocity is a parameter that scales the intensity or volume at which a sound sample is played back, with the value ranging from 0 to 127. Default MIDI velocity values are associated with dynamic indications. Apple’s Logic Pro 9 user manual correlates traditional volume indicators (pp, p, mp, mf, f, ff and fff) with specific MIDI velocity values (16, 32, 48, 64, 80, 96, 112 and 127), respectively.¹²¹²12https://help.apple.com/logicpro/mac/9.1.6/en/logicpro/usermanual/ (page 468 in the user manual; accessed 2023-06-22) In our work, we define and classify this information into six categories: pp (0–31), p (32–47), mp (48–63), mf (64–79), f (80–95) and ff (96–127). Our definition aligns with the Logic Pro 9 specifications, except that we treat fff as equivalent to ff. Our objective can be treated as a note-level classification task, aiming to classify Pitch events into six classes using the POP909 dataset [136].

4.3 Symbolic-domain Style Classification

Genre classification [83] can be considered as a type of style classification. While genre classification categorises music based on shared musical attributes and conventions, style classification seeks to capture the nuanced stylistic variations within either a specific genre, composer or performer, accounting for the diverse artistic choices and performance practices that shape musical expressions. We could relatively more easily find out which type of music we are listening to based on the similar patterns in that genre, while needing more musical insights to recognise the composer’s or performer’s style. Deep learning-based composer classification in MIDI has been attempted by [112] and [109], both treating MIDI pieces as 2D-representation matrices (via the piano-roll representation) and using CNN classifiers. Our work differs from theirs in that: 1) we encode MIDI pieces as token sequences, 2) we employ PTM, 3) we consider non-classical music pieces and 4) our task is about style classification because not all the pianists in Pianist8 composed the pieces they performed.

4.4 Symbolic-domain Emotion Classification

Emotion classification in MIDI has been approached by a few researchers, mostly using hand-crafted features and non-deep learning classifiers [90, 113, 121, 122]. Some researchers work on MIDI alone, while others use both audio and MIDI in multi-modal emotion classification [121]. The only deep learning-based approach we are aware of is presented by [103], using an RNN-based classifier called “Bi-LSTM-Attn” [114] but without employing PTMs, which is also used as a baseline in our experiment; see Section 5.

5 Baseline Model

For the note-level classification tasks, we use an RNN model as our baseline that consists of three bi-directional long short-term memory (Bi-LSTM) layers, each with 256 neurons and a feed-forward layer for classification, since such a network has led to state-of-the-art result in many audio-domain music classification tasks, like beat tracking [73, 77] and pitch estimation [95]. All of our downstream tasks can be viewed as a multi-class classification problem. Given a REMI sequence, a Bi-LSTM model makes a prediction for each Pitch token, ignoring all the other types of tokens (i.e., Bar, Sub-bar, Duration and Pad). For CP, the Bi-LSTM model simply makes a prediction for each super token, again ignoring the zero-padded ones.

For the sequence-level tasks, which require only a prediction for an entire sequence, we follow [103] and choose the Bi-LSTM-Attn model from [114] as our baseline, which was originally proposed for sentiment classification in NLP. The model combines LSTM with a self-attention module for temporal aggregation. Specifically, it uses a Bi-LSTM layer to convert the input sequence of tokens into a sequence of embedding, which can be considered as feature representations of the tokens and then fuses these embeddings into one sequence-level embedding according to the weights assigned by the attention module to each token-level embedding. The sequence-level embedding then goes through two dense layers for classification. We use the token-level embedding for all the tokens here.

For melody extraction, we implement additionally the skyline algorithm [75] and a CNN-based method [130] for performance comparison. The skyline algorithm can only perform “melody versus non-melody” two-class classification for it cannot distinguish between vocal melody and instrumental melody—it uses the simple rule of taking the note with the highest pitch among the concurrent notes as the melody, while avoiding temporally overlapping notes [75].

The CNN method [130] uses piano-roll, a 2D representation where the x-axis represents symbolic time and the y-axis represents pitch, to represent MIDI. Their CNN learns to predict the probability that each note belongs to the melody line. Then, a clustering algorithm is used to find a threshold for each piece adaptively. Finally, the Bellman-Ford algorithm is adopted to pick a strictly monophonic melody line. In contrast, we do not have postprocessing steps such as thresholding or clustering in our BERT-based model and the RNN baseline.

The source code for Simonetta’s model [130] is available online¹³¹³13https://github.com/LIMUNIMI/Symbolic-Melody-Identification but we make the following modifications to improve the model’s performance: we use binary cross-entropy loss instead of mean error loss, sigmoid rather than ReLU activations, an Adam optimizer with learning rate 1e-4, and dropout to prevent overfitting. We share the re-implemented version online.¹⁴¹⁴14https://github.com/sophia1488/symbolic-melody-identification

As an additional baseline for style and emotion classification, we implement the ResNet50-based CNN model from [112], which represents the state-of-the-art for composer classification, based on the authors’ code.¹⁵¹⁵15https://github.com/KimSSung/Deep-Composer-Classification

6 BERT Pre-training and Fine-tuning

We now present our PTM, a pre-trained Transformer encoder with 111M parameters for piano MIDI music. We adopt as the model backbone the BERT ${}_{\text{BASE}}$ model [84], a classic multi-layer bi-directional Transformer encoder with 12 layers of multi-head self-attention, each with 12 heads and the dimension of the hidden space of the self-attention layers being 768. Below, we first describe the pre-training strategy, then the fine-tuning method for the downstream tasks.

6.1 Pre-training

For PTMs, an unsupervised or self-supervised, pre-training task is needed to set the objective function for learning. We employ the masked language modelling (MLM) pre-training strategy of BERT, randomly masking 15% tokens of an input sequence and the Transformer will reconstruct these masked tokens from the context of the visible tokens by minimising the cross-entropy loss. As a self-supervised method, MLM needs no labelled data relating to the downstream tasks for pre-training. Following BERT, among all the masked tokens, we replace 80% by MASK tokens, 10% by a randomly chosen token and leave the last 10% unchanged. Doing so has the effect of helping mitigate the mismatch between pre-training and fine-tuning as MASK tokens do not appear at all during fine-tuning. For REMI, we mask the individual tokens at random. For CP, we mask the super tokens—when we mask a super token, we have to reconstruct all the four tokens composing it by different output heads [98], as shown in Fig. 2(a).

There are three steps for processing the input token. First, each input token $X_{i}$ is converted into a token embedding $E_{i}$ through an embedding layer. Second, it is augmented by addition with a relative positional encoding [102] that is related to its time step. Third, it is then fed $E_{i}$ to the stack of 12 self-attention layers to get a “contextualised” representation known as a hidden vector at the output of the self-attention stack. Because of the bi-directional self-attention layers, the hidden vector is contextualized in the sense that it has attended to information from all the other tokens from the same sequence. Finally, the hidden vector of a masked token is fed into a dense layer to predict what the missing super token is. As our network structure is rather standard, we refer readers to the literature [134, 84, 141] for details and the mathematical underpinnings due to space limits. Because the vocabulary sizes for the four token types are different, we weight the training loss associated with tokens of different types in proportion to the corresponding vocabulary size of REMI and CP, to facilitate model training.

We note that the original BERT article also used another self-supervised task called “next sentence prediction” (NSP) [84] together with MLM for pre-training. We do not use NSP for our model since it was later shown to be not that useful [107, 143, 82]; moreover, NSP requires a clear definition of “sentences”, which is not well-defined for our MIDI sequences. As a result, we do not use tokens such as CLS, SEP and EOS used by BERT for making the boundary of the sentences.¹⁶¹⁶16 CLS marks the beginning of a sentence, SEP the boundary between two consecutive sentences (useful for the so-called “next sentence prediction task” [84]) and EOS the end of a sentence.

6.2 Fine-tuning

It has been widely shown in NLP and related fields [81, 117, 131, 74] that, by storing knowledge in huge numbers of parameters and carrying out task-specific fine-tuning, the knowledge implicitly encoded in the parameters of a PTM can be transferred to benefit a variety of downstream tasks [93]. For fine-tuning, we extend the architecture shown in Fig. 2(a) by modifying the last few layers in two different ways, one for each of the two types of downstream tasks.

Fig. 2(b) shows the fine-tuning architecture for note-level classification. While the Transformer uses the hidden vectors to recover the masked tokens during pre-training, it has to predict the label of an input token during fine-tuning, by learning from the labels provided in the training data of the downstream task in a supervised way. To achieve this, we feed the hidden vectors to a stack of dense layers, a ReLU activation layer and finally another dense layer for the output classification, with 10% dropout probability. We note that this classifier design is fairly simple, as we expect much knowledge regarding the downstream task can already be extracted from the preceding self-attention layers.

Fig. 2(c) shows the fine-tuning architecture for sequence-level classification. Being inspired by the Bi-LSTM-Attn model [114], we employ an attention-based weighting average mechanism to convert the sequence of 512 hidden vectors for an input sequence to one single vector before feeding it to the classifier layer, which comprises two dense layers. We note that, unlike the baseline models introduced in Section 5, we do not use RNN layers in our models. An alternative approach is to add the CLS token to our sequences and simply use its hidden vector as the input to the classifier layer. We do not explore this alternative since we do not have CLS tokens.

6.3 Implementation Details

Our implementation is based on the open-source library HuggingFace [138]. As mentioned in Section 3.1, we use Pop1K7 and ASAP for pre-training and the other three datasets (i.e., POP909, Pianist8 and EMOPIA) for the downstream tasks. From the combination of Pop1K7 and ASAP, we use 85% of them for pre-training as described in Section 6.1 and the rest as the validation set. We train with a batch size of 12 sequences for at most 500 epochs (i.e., around 500K iterations for REMI and 1M iterations for CP), using the AdamW optimizer with learning rate 2e $-$ 5 and weight decay rate 0.01. If the validation cross-entropy loss does not improve for 30 consecutive epochs, we stop the training process early. For pre-training, we can improve the validation “cloze” accuracy from 70.4% for REMI to 78.73% for CP. We observe that pre-training using the CP representation converges in 2.5 days on four GeForce GTX 1080-Ti GPUs, which is about 2.5 times faster than the case of REMI. Moreover, a smaller batch size degrades overall performance, including downstream classification accuracy. Because each sequence has 512 super tokens, we have 6,144 super tokens per batch.

For fine-tuning, we create training, validation and test splits for each of the three datasets of the downstream tasks with the 8:1:1 ratio at the piece level (i.e., all the 512-token sequences from the same piece are in the same split). With the same batch size of 12, we fine-tune the pre-trained our model for each task independently for at most 10 epochs, early stopping when there is no improvement for three consecutive epochs. Compared to pre-training, fine-tuning is less computationally expensive. All the results reported in our work can be reproduced with four GeForce GTX 1080-Ti GPUs within 30 minutes.

In our experiments, we will use the same pre-trained model parameters to initialise the models for different downstream tasks. During fine-tuning, we fine-tune the parameters of all the layers, including the self-attention and token embedding layers.

Table 2: The testing classification accuracy (in %) of different combinations of MIDI token representations and models for four downstream tasks: three-class melody classification, velocity prediction, style classification and emotion classification. “CNN” represents the ResNet50 model used by [112], which only supports sequence-level tasks. “RNN” denotes the baseline models introduced in Section 5, representing the Bi-LSTM model for the first two (note-level) tasks and the Bi-LSTM-Attn model [114] for the last two (sequence-level) tasks.

Token	Model	Melody	Velocity	Style	Emotion
REMI	CNN [112]	—	—	51.35	60.00
	RNN [114]	89.96	44.56	51.97	53.46
	Our model (score)	90.97	49.02	67.19	67.74
	Our model (performance)	89.23	—	75.30	66.18
CP	CNN [112]	—	—	67.57	60.00
	RNN [114]	88.66	43.77	60.32	54.13
	Our model (score)	96.15	52.11	67.46	64.22
	Our model (performance)	95.83	—	81.75	70.64
OctupleMIDI	MusicBERT [144]	—	—	37.25	77.78

7 Performance Study

In what follows, we use ‘our model (score)’ to indicate the result when MIDI scores are considered and similarly ‘our model (performance)’ for MIDI performances. Since MIDI performance contains velocity information, we do not evaluate on the velocity prediction task for fairness. We note that, while ‘our model (score)’ and ‘our model (performance)’ adopt different token representations, we consider it valid to compare their performance as their training and test data are respectively from the same sets of music pieces.

Tab. 2 lists the testing accuracy achieved by the baseline models and the proposed ones for four downstream tasks. We see that “our model (score)” outperforms the Bi-LSTM or Bi-LSTM-Attn baselines in all tasks consistently, using either the REMI or CP representation. In particular, the combination of our model (score) and CP, referred to as “our model (score)+CP” hereafter, exhibits the highest accuracy in the two note-level tasks. Additionally, the combination of our model (performance) and CP, denoted as “our model (performance)+CP”, achieves the best result in the style classification task, while demonstrating a notable improvement in accuracy compared to REMI for emotion classification. We also observe that our models outperform Bi-LSTM $+$ CP with just 1 or 2 epochs of fine-tuning, validating the strength of PTMs on symbolic-domain music classification tasks.

To facilitate a comprehensive evaluation, we additionally incorporate a officially released version of MusicBERT [144] in the sequence-classification tasks. Specifically, we use the model checkpoint MusicBERT-small,¹⁷¹⁷17https://github.com/microsoft/muzic/tree/main/musicbert which is pre-trained on the Lakh MIDI (LMD) dataset [126], which contains about 100K songs.¹⁸¹⁸18There is another implementation named MusicBERT-base, which is trained on the Million MIDI Dataset [144], which is ten times larger than LMD. The results show that MusicBERT achieves a testing accuracy of 37.25% for style classification and 77.78% for emotion classification. Specifically, in the style classification task, MusicBERT exhibits clear signs of overfitting and falls short in performance when compared to our model (81.75%). This outcome can be attributed to the limited size of the Pianist8 dataset, comprising only 411 songs. Conversely, in the emotion classification task, MusicBERT demonstrates impressive performance, surpassing our model (70.64%) by a significant margin. This finding is intriguing and suggests that the application of large-scale pre-training may yield substantial benefits in classifying the emotional content of a MIDI piece.

Tab. 2 also shows that the CP token representation tends to outperform the REMI one across different tasks for both the baseline models and the PTM-based models, demonstrating the importance of token representation for music applications. To study whether the accuracy gain comes simply from a longer musical context enjoyed by CP, we also train “our model (performance) $+$ CP” with a sequence of length 128, obtaining 95.43, 80.32 and 64.04 accuracies for three-class melody classification, style classification and emotion classification, respectively. We note a sequence of length 512 for REMI contains approximately the same amount of information for a sequence of 147 supertokens for CP. Still, using the CP token representation in general leads to better performance even with less information.

Tab. 2 also shows that “our model (performance) $+$ CP” outperforms “our model (score) $+$ CP” greatly for the two sequence-level tasks, style classification and emotion classification. This matches our intuition as the two tasks are highly related to performance styles and expressions of the piano pieces.

We take a closer look at the performance of the evaluated models, in particular Bi-LSTM $+$ CP (or Bi-LSTM-Attn $+$ CP), “our model (score) $+$ CP” and “our model (performance) $+$ CP” in different tasks in what follows.

Table 3: Testing metrics (in %) of “our model (performance) +CP” and other baseline methods for the two-class “melody versus non-melody” classification task using POP909, viewing vocal melody and instrumental melody as “melody” and accompaniment as “non-melody”.

Model	Accuracy	Precision	Recall	F1
Skyline [75]	79.52	81.42	56.57	66.76
Simonetta et al.’s CNN [130]	92.08	88.95	89.30	89.13
Our model (performance) + CP	99.06	98.68	98.72	98.70

7.1 Melody

Fig. 3 presents the normalised confusion tables for three-class melody classification, illustrating distinct performance characteristics among the models. We note that the baseline exhibits a tendency to conflate vocal melody (M1) and instrumental melody (M2), whereas our model outperforms the RNN-based model, enhancing the overall accuracy by almost 8% (from 88.66% to 96.15%). A closer examination reveals our model’s superior ability to differentiate between vocal and instrumental melodies compared to the RNN baseline with minimal finetuning. This task is particularly challenging given the nature of the POP909 dataset, which exclusively features pop songs sung by humans. Consequently, the separation of vocal and instrumental melodies relies on the criterion of human vocalisation (which is absent in MIDI data), potentially leading to instances where notes between phrases are designated as instrumental melody despite sharing pitch and melodic characteristics with vocal melody.

Interestingly, an intriguing observation emerges as “our model (score) + CP” demonstrates a more effective distinction between vocal and instrumental melodies than “our model (performance) + CP”. This phenomenon suggests that even without velocity information, our model can discern segments designated for singing versus those serving as preludes, interludes or fills.

Tab. 3 compares our model with the “skyline” algorithm [75] and the CNN-based baselines [130] for the two-class “melody versus non-melody” melody classification task. As the dataset is highly unbalanced (i.e., the melody notes are much fewer than the accompaniment notes), we also report the precision, recall and F1 scores. It turns out that our model greatly outperforms other baselines across all the metrics, reaching 99.04% classification accuracy. A qualitative example demonstrating the superiority of the proposed model over the the skyline algorithm can be found in Fig. 4, using a randomly chosen testing piece from POP909 ${}_{\text{4/4}}$ .

Moreover, we have extended the application of our melody extraction model to compositions from the Pianist8 dataset. Given the absence of manual labels for the melody notes within this dataset, we encourage readers to evaluate the results by listening to the prediction outputs.¹⁹¹⁹19https://github.com/wazenmai/MIDI-BERT/tree/CP/melody_extraction/audio We provide three versions of the melody MIDI file for each original song, generated respectively by the skyline algorithm, Simonetta et al.’s CNN and “our model (performance) + CP”. Taking “Clayderman_Yesterday_Once_More.mid” as an example, the melody generated by the skyline algorithm exhibits stiffness and lacks intricate details, retaining only the treble. The CNN version demonstrates considerable improvement over the skyline algorithm. However, a noticeable intermittent quality persists throughout the entire song, with some cohesive melodies omitted. Our model achieves commendable performance, successfully extracting the majority of the main melody and presenting a discernible melodic progression. It is worth highlighting the efficiency of our model, as it requires less than one hour for finetuning under the same hardware conditions that necessitate a full day of training for the CNN baseline on the POP909 training set.

7.2 Velocity

Tab. 2 shows that the accuracy on our 6-class velocity classification task is not high, reaching 52.11% at best. This may be due to the fact that velocity is rather subjective, meaning that musicians can perform the same music piece fairly differently. Moreover, we note that the data is highly imbalanced, with the latter three classes (mf, f, ff) taking up nearly 90% of all labelled data. The confusion tables presented in Fig. 5 show that Bi-LSTM tends to classify most of the notes into f, the most popular class among the six. This is less the case for our model, but the prediction of p and pp, i.e., the two with the lowest dynamics, remains challenging. For future work, data augmentation is a potential solution to mitigate the impact of data imbalance.

7.3 Style

Tab. 2 shows that our model greatly outperforms Bi-LSTM-Attn [114] and the CNN baseline [112] by 10–20% regardless of the token representation taken, reaching 81.75% testing accuracy at best for this 8-class classification problem. In addition, we see a large performance gap between REMI and CP in this task, the largest among the four tasks. Fig. 6 further shows that, both the baseline and “our model (score) $+$ CP” confuse artists in similar genres and that our model performs fairly well in recognising Herbie Hancock and Ryuichi Sakamoto. In contrast, by considering velocity and tempo information, “our model (performance) $+$ CP” gains lots of precision on classifying songs in pop and contemporary genres, boosting the classification accuracy from 67.46 (score) to 81.75 (performance).

7.4 Emotion

Tab. 2 shows that our model outscores Bi-LSTM-Attn by around 14% and the CNN baseline [112] by around 7% in both REMI and CP for this 4-class classification problem, reaching 70.64% testing accuracy at best. There is little performance difference between REMI and CP in this task. Fig. 7 further shows that the evaluated models can fairly easily distinguish between high arousal and low arousal pieces (i.e., “HAHV, HALV” versus “LALV, LAHV”), but they have a much harder time along the valence axis (e.g., “HAHV” versus “HALV” and “LALV” versus “LAHV”). We see less confusion from the result of ‘our model (score) $+$ CP’. By considering velocity and tempo, “our model (performance) $+$ CP” can further classify variance difference in low-arousal songs, though there is still room for improvement.

8 Conclusion

In this article, we presented a large-scale pre-trained model for musical data in the MIDI format. We employed five public-domain piano MIDI datasets for BERT-like masking-based pre-training and evaluated the pre-trained model on four challenging downstream symbolic music classification tasks, most with less than 1K labelled MIDI pieces. Our experiments validate the effectiveness of pre-training for both note-level and sequence-level classification tasks.

This work can be extended in many ways. First, to employ other pre-training strategies or architectures [93]. Second, to employ Transformer models with linear computational complexity [79, 116], so as to use the whole MIDI pieces (instead of segments) at pre-training.²⁰²⁰20We note that the use of linear Transformers for symbolic music generation has been attempted before [98]. Third, to include other time signatures and increase the amount of non-pop piano scores. Fourth, to extend the corpus and the token representation from single-track piano to multi-track MIDI, like the work done by [144]. Finally, to consider more downstream tasks such as symbolic-domain music segmentation [92, 111], chord recognition [94], score passage matching [132] and beat tracking [80]. We have released the code publicly, which may hopefully help facilitate such endeavours.

References

[1] Sebastian Böck et al. “Madmom: A new Python audio and music signal processing library” In Proc. ACM Multimedia, 2016, pp. 1174–1178
[2] Nadav Brandes “ProteinBERT: A universal deep-learning model of protein sequence and function” In Bioinformatics 38.8, 2022, pp. 2102–2110
[3] W. Chai and B. Vercoe “Melody retrieval on the web” In Multimedia Computing and Networking, 2001
[4] Yu-Hua Chen “Automatic composition of guitar tabs by Transformers and groove modeling” In Proc. Int. Soc. Music Information Retrieval Conf., 2020, pp. 756–763
[5] Ching-Yu Chiu, Alvin Wen-Yu Su and Yi-Hsuan Yang “Drum-aware ensemble architecture for improved joint musical beat and downbeat tracking” In IEEE Signal Processing Letters 28, 2021, pp. 1100–1104
[6] Keunwoo Choi, György Fazekas, Mark Sandler and Kyunghyun Cho “Transfer learning for music classification and regression tasks” In Proc. Int. Soc. Music Information Retrieval Conf., 2017
[7] Krzysztof Choromanski “Rethinking Attention with Performers” In Proc. Int. Conf. Learning Representations, 2021
[8] Yi-Chin Chuang and Li Su “Beat and downbeat tracking of symbolic music data using deep recurrent neural networks” In Proc. Asia Pacific Signal and Information Processing Association Annual Summit and Conf., 2020, pp. 346–352
[9] Yung-Sung Chuang, Chi-Liang Liu, Hung-Yi Lee and Lin-Shan Lee “SpeechBERT: An audio-and-text jointly learned language model for end-to-end spoken question answering” In Proc. INTERSPEECH, 2020, pp. 4168–4172
[10] Alexis Conneau and Guillaume Lample “Cross-lingual language model pretraining” In Proc. Int. Conf. Neural Information Processing Systems, 2019, pp. 7059–7069
[11] D.. Correa and F.. Rodrigues “A survey on symbolic data-based music genre classification” In Expert Systems 60.30, 2016, pp. 190–210
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova “BERT: Pre-training of deep bidirectional Transformers for language understanding” In Proc. Conf. North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171–4186
[13] Hao-Wen Dong et al. “Multitrack Music Transformer” In Proc. Int. Conf. Acoustics, Speech and Signal Processing, 2023, pp. 1–5
[14] Andres Ferraro and Kjell Lemstrom “On large-scale genre classification in symbolically encoded music by automatic identification of repeating patterns” In Proc. Int. Conf. Digital Libraries for Musicology, 2018
[15] Francesco Foscarin “ASAP: A dataset of aligned scores and performances for piano transcription” In Proc. Int. Soc. Music Inf. Retr. Conf., 2020, pp. 534–541
[16] A. Friberg, R. Bresin and J. Sundberg “Overview of the KTH rule system for musical performance” In Advances in Cognitive Psychology 2.2-3, 2006, pp. 145.161
[17] Anders Friberg, Erwin Schoonderwaldt and Patrik Juslin “CUEX: An Algorithm for Automatic Extraction of Expressive Tone Parameters in Music Performance from Acoustic Signals” In Acta Acustica united with Acustica 93, 2007, pp. 411–420
[18] Jacek Grekow and Zbigniew W Raś “Detecting emotions in classical music from MIDI files” In Proc. Int. Symposium on Methodologies for Intelligent Systems, 2009, pp. 261–270
[19] S. Gururani, M. Sharma and A. Lerch “An attention mechanism for musical instrument recognition” In Proc. Int. Soc. Music Information Retrieval Conf., 2019, pp. 83–90
[20] M. Hamanaka, K. Hirata and S. Tojo “Musical structural analysis database based on GTTM” In Proc. Int. Soc. Music Information Retrieval Conf., 2014
[21] X. Han “Pre-trained models: Past, present and future” In AI Open 2, 2021, pp. 225–250
[22] Daniel Harasim et al. “The Jazz Harmony Treebank” In Proc. Int. Soc. Music Information Retrieval Conf., 2020, pp. 207–215
[23] Curtis Hawthorne “Onsets and Frames: Dual-objective piano transcription” In Proc. Int. Soc. Music Information Retrieval Conf., 2018, pp. 50–57
[24] Curtis Hawthorne “Enabling factorized piano music modeling and generation with the MAESTRO dataset” In Proc. Int. Conf. Learning Representations, 2019
[25] Curtis Hawthorne et al. “Sequence-to-Sequence Piano Transcription with Transformers” In Proc. Int. Soc. Music Information Retrieval Conf., 2021, pp. 246–253
[26] Wen-Yi Hsiao, Jen-Yu Liu, Yin-Cheng Yeh and Yi-Hsuan Yang “Compound Word Transformer: Learning to compose full-song music over dynamic directed hypergraphs” In Proc. AAAI, 2021
[27] Yo-Wei Hsiao and Li Su. “Learning note-to-note affinity for voice segregation and melody line identification of symbolic music data” In Proc. Int. Soc. Music Information Retrieval Conf., 2021, pp. 285–292
[28] C.-Z.. Huang “Music Transformer: Generating music with long-term structure” In Proc. Int. Conf. Learning Representations, 2019
[29] Y.-S. Huang and Y.-H. Yang “Pop Music Transformer: Beat-based modeling and generation of expressive pop piano compositions” In Proc. ACM Multimedia, 2020, pp. 1180–1188
[30] Zhiheng Huang, Davis Liang, Peng Xu and Bing Xiang “Improve Transformer models with better relative position embeddings” In Findings of the Association for Computational Linguistics: EMNLP 2020, 2020
[31] Hsiao-Tzu Hung et al. “EMOPIA: A multi-modal pop piano dataset for emotion recognition and emotion-based music generation” In Proc. Int. Soc. Music Information Retrieval Conf., 2021
[32] D. Jeong et al. “VirtuosoNet: A hierarchical RNN-based system for modeling expressive piano performance” In Proc. Int. Soc. Music Information Retrieval Conf., 2019, pp. 908–915
[33] Dasaem Jeong, Taegyun Kwon, Yoojin Kim and Juhan Nam “Graph neural network for music score data and modeling expressive piano performance” In Proc. Int. Conf. Machine Learning, 2019, pp. 3060–3070
[34] Z. Jiang and R.. Dannenberg “Melody identification in standard MIDI files” In Proc. Sound and Music Computing Conf., 2019, pp. 65–71
[35] Mandar Joshi “SpanBERT: Improving pre-training by representing and predicting spans” In Transactions of the Association for Computational Linguistics 8, 2020, pp. 64–77
[36] Jaehun Kim, Julian Urbano, Cynthia Liem and Alan Hanjalic “One deep music representation to rule them all? A comparative analysis of different representation learning strategies” In Neural Computing & Applications, 2019
[37] Q. Kong, K. Choi and Y. Wang “Large-scale MIDI-based composer classification” In arXiv preprint arXiv:2010.14805, 2020
[38] Qiuqiang Kong et al. “High-resolution piano transcription with pedals by regressing onsets and offsets times” In IEEE/ACM Trans. Audio, Speech and Lang. Proc. 29, 2021, pp. 3707–3717
[39] Peter Van Kranenburg “Rule mining for local boundary detection in melodies” In Proc. Int. Soc. Music Information Retrieval Conf., 2020
[40] H.. Lee et al. “Deep composer classification using symbolic representation” In Proc. Int. Soc. Music Information Retrieval Conf., Late-breaking Demo paper, 2020
[41] Yi Lin, Xiaoou Chen and Deshun Yang “Exploration of music emotion recognition based on MIDI” In Proc. Int. Society for Music Information Retrieval Conf., 2013, pp. 221–226
[42] Zhouhan Lin “A structured self-attentive sentence embedding” In Proc. Int. Conf. Learning Representations, 2017
[43] Yinhan Liu “RoBERTa: A robustly optimized BERT pretraining approach” In arXiv preprint arXiv:1907.11692, 2019
[44] Antoine Liutkus et al. “Relative positional encoding for Transformers with linear complexity” In Proc. Int. Conf. Machine Learning, 2021
[45] Jiasen Lu, Dhruv Batra, Devi Parikh and Stefan Lee “ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks” In Proc. Int. Conf. Neural Information Processing Systems, 2019, pp. 13–23
[46] Peiling Lu et al. “MuseCoco: Generating Symbolic Music from Text” In arXiv preprint arXiv:2306.00110, 2023
[47] Søren Tjagvad Madsen and Gerhard Widmer “A complexity-based approach to melody track identification in MIDI files” In Proc. Int. Workshop on Artificial Intelligence and Music, 2007
[48] Sageev Oore “This time with feeling: Learning expressive musical performance” In Neural Computing and Applications 32 Springer, 2018, pp. 955–967
[49] Renato Panda “Multi-modal music emotion recognition: A new dataset, methodology and comparative analysis” In Proc. Int. Symposium on Computer Music Multidisciplinary Research, 2013
[50] Renato Panda, Ricardo Malheiro and Rui Pedro Paiva “Musical texture and expressivity features for music emotion recognition” In Proc. Int. Society for Music Information Retrieval Conf., 2018
[51] Ashis Pati, Alexander Lerch and Gaëtan Hadjeres “Learning to traverse latent spaces for musical score inpainting” In Proc. Int. Soc. Music Information Retrieval Conf., 2019, pp. 343–351
[52] Christine Payne “"MuseNet."”, 2019 URL: https://openai.com/research/musenet
[53] Alec Radford “Language Models are Unsupervised Multitask Learners”, 2019 URL: https://openai.com/blog/better-language-models/
[54] C. Raffel and D… Ellis “Extracting ground-truth information from MIDI files: A MIDIfesto” In Proc. Int. Soc. Music Information Retrieval Conf., 2016, pp. 796–802
[55] James Russell “A Circumplex Model of Affect” In Journal of Personality and Social Psychology 39, 1980, pp. 1161–1178
[56] Justin Salamon and Emilia Gomez “Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics” In IEEE Transactions on Audio, Speech, and Language Processing 20.6, 2012, pp. 1759–1770 DOI: 10.1109/TASL.2012.2188515
[57] Yi-Jen Shih et al. “Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformer” In IEEE Transactions on Multimedia, 2022, pp. 1–1
[58] F. Simonetta, C.. Chacón, S. Ntalampiras and G. Widmer “A Convolutional approach to melody line identification in symbolic scores” In Proc. Int. Soc. Music Information Retrieval Conf., 2019, pp. 924–931
[59] Chen Sun “VideoBERT: A Joint Model for Video and Language Representation Learning” In Proc. IEEE/CVF International Conference on Computer Vision, 2019, pp. 7464–7473
[60] Richard Sutcliffe et al. “Searching for musical features using natural language queries: The C@merata evaluations at MediaEval” In Language Resources and Evaluation 53.1 Springer-Verlag, 2019, pp. 87–140
[61] T. Tsai and K. Ji “Composer style classification of piano sheet music images using language model pretraining” In Proc. Int. Soc. Music Information Retrieval Conf., 2020, pp. 176–183
[62] Ashish Vaswani “Attention is all you need” In Proc. Advances in Neural Information Processing Systems, 2017, pp. 5998–6008
[63] Elliot Waite “Project Magenta: Generating long-term structure in songs and stories” In Google Brain Blog, 2016
[64] Ziyu Wang et al. “POP909: A pop-song dataset for music arrangement generation” In Proc. Int. Society for Music Information Retrieval Conf., 2020, pp. 38–45
[65] Gerhard Widmer “The synergy of music theory and AI: Learning multi-Level expressive interpretation” In Proc. AAAI, 1994
[66] Thomas Wolf “State-of-the-art natural language processing” In Proc. Conf. Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 38–45
[67] Ho-Hsiang Wu “Multi-task self-supervised pre-training for music classification” In Proc. Int. Conf. Acoustics, Speech and Signal Processing, 2021, pp. 556–560
[68] Shih-Lun Wu and Yi-Hsuan Yang “The Jazz Transformer on the front line: Exploring the shortcomings of AI-composed music through quantitative measures” In Proc. Int. Soc. Music Information Retrieval Conf., 2020, pp. 142–149
[69] S.-L. Wu and Y.-H. Yang “MuseMorphose: Full-song and fine-grained piano music style transfer with just one Transformer VAE” In IEEE/ACM Transactions on Audio, Speech, and Language Processing 31, 2023, pp. 1953–1967
[70] Y.-H. Yang and H.. Chen “Music Emotion Recognition” Boca Raton, Florida: CRC Press, 2011
[71] Zhilin Yang “XLNet: Generalized autoregressive pretraining for language understanding” In Proc. Int. Conf. Neural Information Processing Systems, 2019, pp. 5753–5763
[72] Mingliang Zeng et al. “MusicBERT: Symbolic music understanding with large-scale pre-Training” In Findings of the Association for Computational Linguistics, 2021, pp. 791–800

References

[73] Sebastian Böck et al. “Madmom: A new Python audio and music signal processing library” In Proc. ACM Multimedia, 2016, pp. 1174–1178
[74] Nadav Brandes “ProteinBERT: A universal deep-learning model of protein sequence and function” In Bioinformatics 38.8, 2022, pp. 2102–2110
[75] W. Chai and B. Vercoe “Melody retrieval on the web” In Multimedia Computing and Networking, 2001
[76] Yu-Hua Chen “Automatic composition of guitar tabs by Transformers and groove modeling” In Proc. Int. Soc. Music Information Retrieval Conf., 2020, pp. 756–763
[77] Ching-Yu Chiu, Alvin Wen-Yu Su and Yi-Hsuan Yang “Drum-aware ensemble architecture for improved joint musical beat and downbeat tracking” In IEEE Signal Processing Letters 28, 2021, pp. 1100–1104
[78] Keunwoo Choi, György Fazekas, Mark Sandler and Kyunghyun Cho “Transfer learning for music classification and regression tasks” In Proc. Int. Soc. Music Information Retrieval Conf., 2017
[79] Krzysztof Choromanski “Rethinking Attention with Performers” In Proc. Int. Conf. Learning Representations, 2021
[80] Yi-Chin Chuang and Li Su “Beat and downbeat tracking of symbolic music data using deep recurrent neural networks” In Proc. Asia Pacific Signal and Information Processing Association Annual Summit and Conf., 2020, pp. 346–352
[81] Yung-Sung Chuang, Chi-Liang Liu, Hung-Yi Lee and Lin-Shan Lee “SpeechBERT: An audio-and-text jointly learned language model for end-to-end spoken question answering” In Proc. INTERSPEECH, 2020, pp. 4168–4172
[82] Alexis Conneau and Guillaume Lample “Cross-lingual language model pretraining” In Proc. Int. Conf. Neural Information Processing Systems, 2019, pp. 7059–7069
[83] D.. Correa and F.. Rodrigues “A survey on symbolic data-based music genre classification” In Expert Systems 60.30, 2016, pp. 190–210
[84] Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova “BERT: Pre-training of deep bidirectional Transformers for language understanding” In Proc. Conf. North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171–4186
[85] Hao-Wen Dong et al. “Multitrack Music Transformer” In Proc. Int. Conf. Acoustics, Speech and Signal Processing, 2023, pp. 1–5
[86] Andres Ferraro and Kjell Lemstrom “On large-scale genre classification in symbolically encoded music by automatic identification of repeating patterns” In Proc. Int. Conf. Digital Libraries for Musicology, 2018
[87] Francesco Foscarin “ASAP: A dataset of aligned scores and performances for piano transcription” In Proc. Int. Soc. Music Inf. Retr. Conf., 2020, pp. 534–541
[88] A. Friberg, R. Bresin and J. Sundberg “Overview of the KTH rule system for musical performance” In Advances in Cognitive Psychology 2.2-3, 2006, pp. 145.161
[89] Anders Friberg, Erwin Schoonderwaldt and Patrik Juslin “CUEX: An Algorithm for Automatic Extraction of Expressive Tone Parameters in Music Performance from Acoustic Signals” In Acta Acustica united with Acustica 93, 2007, pp. 411–420
[90] Jacek Grekow and Zbigniew W Raś “Detecting emotions in classical music from MIDI files” In Proc. Int. Symposium on Methodologies for Intelligent Systems, 2009, pp. 261–270
[91] S. Gururani, M. Sharma and A. Lerch “An attention mechanism for musical instrument recognition” In Proc. Int. Soc. Music Information Retrieval Conf., 2019, pp. 83–90
[92] M. Hamanaka, K. Hirata and S. Tojo “Musical structural analysis database based on GTTM” In Proc. Int. Soc. Music Information Retrieval Conf., 2014
[93] X. Han “Pre-trained models: Past, present and future” In AI Open 2, 2021, pp. 225–250
[94] Daniel Harasim et al. “The Jazz Harmony Treebank” In Proc. Int. Soc. Music Information Retrieval Conf., 2020, pp. 207–215
[95] Curtis Hawthorne “Onsets and Frames: Dual-objective piano transcription” In Proc. Int. Soc. Music Information Retrieval Conf., 2018, pp. 50–57
[96] Curtis Hawthorne “Enabling factorized piano music modeling and generation with the MAESTRO dataset” In Proc. Int. Conf. Learning Representations, 2019
[97] Curtis Hawthorne et al. “Sequence-to-Sequence Piano Transcription with Transformers” In Proc. Int. Soc. Music Information Retrieval Conf., 2021, pp. 246–253
[98] Wen-Yi Hsiao, Jen-Yu Liu, Yin-Cheng Yeh and Yi-Hsuan Yang “Compound Word Transformer: Learning to compose full-song music over dynamic directed hypergraphs” In Proc. AAAI, 2021
[99] Yo-Wei Hsiao and Li Su. “Learning note-to-note affinity for voice segregation and melody line identification of symbolic music data” In Proc. Int. Soc. Music Information Retrieval Conf., 2021, pp. 285–292
[100] C.-Z.. Huang “Music Transformer: Generating music with long-term structure” In Proc. Int. Conf. Learning Representations, 2019
[101] Y.-S. Huang and Y.-H. Yang “Pop Music Transformer: Beat-based modeling and generation of expressive pop piano compositions” In Proc. ACM Multimedia, 2020, pp. 1180–1188
[102] Zhiheng Huang, Davis Liang, Peng Xu and Bing Xiang “Improve Transformer models with better relative position embeddings” In Findings of the Association for Computational Linguistics: EMNLP 2020, 2020
[103] Hsiao-Tzu Hung et al. “EMOPIA: A multi-modal pop piano dataset for emotion recognition and emotion-based music generation” In Proc. Int. Soc. Music Information Retrieval Conf., 2021
[104] D. Jeong et al. “VirtuosoNet: A hierarchical RNN-based system for modeling expressive piano performance” In Proc. Int. Soc. Music Information Retrieval Conf., 2019, pp. 908–915
[105] Dasaem Jeong, Taegyun Kwon, Yoojin Kim and Juhan Nam “Graph neural network for music score data and modeling expressive piano performance” In Proc. Int. Conf. Machine Learning, 2019, pp. 3060–3070
[106] Z. Jiang and R.. Dannenberg “Melody identification in standard MIDI files” In Proc. Sound and Music Computing Conf., 2019, pp. 65–71
[107] Mandar Joshi “SpanBERT: Improving pre-training by representing and predicting spans” In Transactions of the Association for Computational Linguistics 8, 2020, pp. 64–77
[108] Jaehun Kim, Julian Urbano, Cynthia Liem and Alan Hanjalic “One deep music representation to rule them all? A comparative analysis of different representation learning strategies” In Neural Computing & Applications, 2019
[109] Q. Kong, K. Choi and Y. Wang “Large-scale MIDI-based composer classification” In arXiv preprint arXiv:2010.14805, 2020
[110] Qiuqiang Kong et al. “High-resolution piano transcription with pedals by regressing onsets and offsets times” In IEEE/ACM Trans. Audio, Speech and Lang. Proc. 29, 2021, pp. 3707–3717
[111] Peter Van Kranenburg “Rule mining for local boundary detection in melodies” In Proc. Int. Soc. Music Information Retrieval Conf., 2020
[112] H.. Lee et al. “Deep composer classification using symbolic representation” In Proc. Int. Soc. Music Information Retrieval Conf., Late-breaking Demo paper, 2020
[113] Yi Lin, Xiaoou Chen and Deshun Yang “Exploration of music emotion recognition based on MIDI” In Proc. Int. Society for Music Information Retrieval Conf., 2013, pp. 221–226
[114] Zhouhan Lin “A structured self-attentive sentence embedding” In Proc. Int. Conf. Learning Representations, 2017
[115] Yinhan Liu “RoBERTa: A robustly optimized BERT pretraining approach” In arXiv preprint arXiv:1907.11692, 2019
[116] Antoine Liutkus et al. “Relative positional encoding for Transformers with linear complexity” In Proc. Int. Conf. Machine Learning, 2021
[117] Jiasen Lu, Dhruv Batra, Devi Parikh and Stefan Lee “ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks” In Proc. Int. Conf. Neural Information Processing Systems, 2019, pp. 13–23
[118] Peiling Lu et al. “MuseCoco: Generating Symbolic Music from Text” In arXiv preprint arXiv:2306.00110, 2023
[119] Søren Tjagvad Madsen and Gerhard Widmer “A complexity-based approach to melody track identification in MIDI files” In Proc. Int. Workshop on Artificial Intelligence and Music, 2007
[120] Sageev Oore “This time with feeling: Learning expressive musical performance” In Neural Computing and Applications 32 Springer, 2018, pp. 955–967
[121] Renato Panda “Multi-modal music emotion recognition: A new dataset, methodology and comparative analysis” In Proc. Int. Symposium on Computer Music Multidisciplinary Research, 2013
[122] Renato Panda, Ricardo Malheiro and Rui Pedro Paiva “Musical texture and expressivity features for music emotion recognition” In Proc. Int. Society for Music Information Retrieval Conf., 2018
[123] Ashis Pati, Alexander Lerch and Gaëtan Hadjeres “Learning to traverse latent spaces for musical score inpainting” In Proc. Int. Soc. Music Information Retrieval Conf., 2019, pp. 343–351
[124] Christine Payne “"MuseNet."”, 2019 URL: https://openai.com/research/musenet
[125] Alec Radford “Language Models are Unsupervised Multitask Learners”, 2019 URL: https://openai.com/blog/better-language-models/
[126] C. Raffel and D… Ellis “Extracting ground-truth information from MIDI files: A MIDIfesto” In Proc. Int. Soc. Music Information Retrieval Conf., 2016, pp. 796–802
[127] James Russell “A Circumplex Model of Affect” In Journal of Personality and Social Psychology 39, 1980, pp. 1161–1178
[128] Justin Salamon and Emilia Gomez “Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics” In IEEE Transactions on Audio, Speech, and Language Processing 20.6, 2012, pp. 1759–1770 DOI: 10.1109/TASL.2012.2188515
[129] Yi-Jen Shih et al. “Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformer” In IEEE Transactions on Multimedia, 2022, pp. 1–1
[130] F. Simonetta, C.. Chacón, S. Ntalampiras and G. Widmer “A Convolutional approach to melody line identification in symbolic scores” In Proc. Int. Soc. Music Information Retrieval Conf., 2019, pp. 924–931
[131] Chen Sun “VideoBERT: A Joint Model for Video and Language Representation Learning” In Proc. IEEE/CVF International Conference on Computer Vision, 2019, pp. 7464–7473
[132] Richard Sutcliffe et al. “Searching for musical features using natural language queries: The C@merata evaluations at MediaEval” In Language Resources and Evaluation 53.1 Springer-Verlag, 2019, pp. 87–140
[133] T. Tsai and K. Ji “Composer style classification of piano sheet music images using language model pretraining” In Proc. Int. Soc. Music Information Retrieval Conf., 2020, pp. 176–183
[134] Ashish Vaswani “Attention is all you need” In Proc. Advances in Neural Information Processing Systems, 2017, pp. 5998–6008
[135] Elliot Waite “Project Magenta: Generating long-term structure in songs and stories” In Google Brain Blog, 2016
[136] Ziyu Wang et al. “POP909: A pop-song dataset for music arrangement generation” In Proc. Int. Society for Music Information Retrieval Conf., 2020, pp. 38–45
[137] Gerhard Widmer “The synergy of music theory and AI: Learning multi-Level expressive interpretation” In Proc. AAAI, 1994
[138] Thomas Wolf “State-of-the-art natural language processing” In Proc. Conf. Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 38–45
[139] Ho-Hsiang Wu “Multi-task self-supervised pre-training for music classification” In Proc. Int. Conf. Acoustics, Speech and Signal Processing, 2021, pp. 556–560
[140] Shih-Lun Wu and Yi-Hsuan Yang “The Jazz Transformer on the front line: Exploring the shortcomings of AI-composed music through quantitative measures” In Proc. Int. Soc. Music Information Retrieval Conf., 2020, pp. 142–149
[141] S.-L. Wu and Y.-H. Yang “MuseMorphose: Full-song and fine-grained piano music style transfer with just one Transformer VAE” In IEEE/ACM Transactions on Audio, Speech, and Language Processing 31, 2023, pp. 1953–1967
[142] Y.-H. Yang and H.. Chen “Music Emotion Recognition” Boca Raton, Florida: CRC Press, 2011
[143] Zhilin Yang “XLNet: Generalized autoregressive pretraining for language understanding” In Proc. Int. Conf. Neural Information Processing Systems, 2019, pp. 5753–5763
[144] Mingliang Zeng et al. “MusicBERT: Symbolic music understanding with large-scale pre-Training” In Findings of the Association for Computational Linguistics, 2021, pp. 791–800