A Medium-Scale Dataset for Music Emotion Recognition

Developed by Yi-Hsuan Yang, National Taiwan University.
Ref:  Ranking-based Emotion Recognition for Music Organization and Retrieval, accepted for publication, IEEE Trans. Audio, Speech, and Language Processing.


  • This dataset is developed in the course of developing ranking-based methods for improving the dimensional music emotion recognition approach [1]
  • From the dimensional perspective, emotions are points in a Cartesian coordinate system with, for example, valence and arousal as the dimensions [2]
  • Therefore, the annotations provided here are numerical values in [-1, 1], rather than class labels
  • Only annotate valence (how positive/negative the perceived emotion is)


Music Collection

  • 1240 Chinese pop songs (list_filename)
  • Each song is represented by the 30-sec segment starting from its initial 30th second
  • 22,050 sampling frequency, 16 bits precision, and mono channel
  • Due to copyright issues, the audio files cannot be distributed. One may utilize the program provided by Dan Ellis to synthesize the audio signal from MFCC features.

Emotion Annotation

  • An on-line subjective test is conducted during Aug. 2008 to Nov. 2008 to collect emotion annotations
  • A total of 666 subjects participate the subjective test, making each song annotated by 4.3 subjects on the average
  • Each subject is invited to annotate 8 randomly selected music pieces using both rating- and ranking-based measures
  • Rating-based measure

°         A scroll bar with end points denoting 0 and 100

  • Ranking-based measure

°         Music emotion tournament

°         Subjects are asked to make pairwise comparisons of the emotions of songs




Fig. 1.  Left: Music emotion tournament groups eight randomly chosen music pieces in seven tournaments. We use bold line to indicate the winner of each tournament.  Right: the resulting preference matrix (partial), with the entry (i,j) painted black to indicate that the piece i is ranked higher than j. The global ordering f>b>c=h>a=d=e=g can then be estimated by a greedy algorithm.

Feature Extraction


# of features


Rhythm pattern extractor


Extracts the average tempo of music and a 60-bin rhythm histogram to describe the general rhythmic in music.

MIR toolbox 1.2


Extracts 3 sensory dissonance features (roughness, irregularity, inharmonicity), 2 pitch features (pitch salient and the centroid of chromagram), and 3 tonal features (mode, harmonic change, key clarity). Take mean and standard deviation for temporal aggregation.

MA toolbox


Extracts Mel-frequency cepstral coefficients (MFCC), a representation of the short-term (e.g. 23ms) power spectrum of an audio signal. Take mean and standard deviation to integrate the short-term features.

Marsyas 0.2


Extracts timbral features including spectral flatness measures and spectral crest factors.


  • You may dichotomize the numerical values to use this dataset in training an emotion classifier
  • You may go here to fetch annotations of both arousal and valence of a smaller dataset consists of 60 English pop songs [3]
  • The demonstration of a user interface for emotion-based music retrieval is available on youtube [4]


[1] Y.-H. Yang et al, “A regression approach to music emotion recognition,” IEEE Trans. Audio, Speech and Language Processing, vol. 16, no. 2, pp. 448–457, 2008.

[2] J. A. Russell, “A circumplex model of affect”, Journal of Personality and Social Psychology, vol. 39, no. 6, pp. 1161–1178, 1980.

[3] Y.-H. Yang et al, “Music emotion recognition: The role of individuality,” Proc. ACM Int. Workshop on Human-centered Multimedia, pp. 13–21, 2007.

[4] Y.-H. Yang et al, “Mr. Emo: Music retrieval in the emotion plane,” Proc. ACM Int. Conf. Multimedia, pp. 1003–1004, 2008.


Any feedbacks or comments are welcomed

(last update: 2010/11/06)