|A Regression Approach to Music Emotion Recognition
Y.-H. Yang, Y.-C. Lin, Y.-F. Su, and H.-H. Chen, , "A Regression Approach to Music Emotion Recognition,"
IEEE Trans. Audio, Speech and Language Processing (TASLP), vol. 16, no. 2, pp. 448-457, Feb. 2008.
(for a camera ready version, go to ieee xplore)
Content-based retrieval has emerged in the face of
content explosion as a promising approach to information access.
In this paper, we focus on the challenging issue of recognizing the
emotion content of music signals, or music emotion recognition
(MER). Specifically, we formulate MER as a regression problem
to predict the arousal and valence values (AV values) of each
music sample directly. Associated with the AV values, each music
sample becomes a point in the arousal-valence plane, so the users
can efficiently retrieve the music sample by specifying a desired
point in the emotion plane. Because no categorical taxonomy is
used, the regression approach is free of the ambiguity inherent to
conventional categorical approaches. To improve the performance,
we apply principal component analysis to reduce the
correlation between arousal and valence, and RReliefF to select
important features. An extensive performance study is conducted
to evaluate the accuracy of the regression approach for predicting
AV values. The best performance evaluated in terms of the R2
statistics reaches 58.3% for arousal and 28.1% for valence by
employing support vector machine as the regressor. We also apply
the regression approach to detect the emotion variation within a
music selection and find the prediction accuracy superior to
existing works. A group-wise MER scheme is also developed to
address the subjectivity issue of emotion perception.
1) Multiple linear regression (MLR)
2) Support vector regression (SVR)
¡÷ A tutorial on support vector regression.
¡÷ LIBSVM: a library for support vector machines (link).
[parameters: arousal -c 0.5 -g 0.0078125 -p 0.125]
[parameters: valence -c 4 -g 0.0078125 -p 0.25]
¡÷ AdaBoost.RT: a boosting algorithm for regression problems.
threshold £p for demarcating correct and incorrect predictions: 0.1
number of iterations: 30
threshold £c to control the tree pruning process: 5 ]
Music clips are trimmed to 25 seconds and converted to a uniform format (22,050 Hz, 16 bits, and mono channel PCM WAV).
The same music database contains 195 popular songs from Western, Chinese, and Japanese albums.
Subjects (most college students) are asked to listen to a subset of music dataset and to choose two values,
each ranges from -1.0 to 1.0 in 11 levels,
to indicate their feeling about the AV values of the music sample.
The ground truth is set as the mean of the AV values of all subjects tested.
On the average, more than ten pairs of AV values are collected from the subjective test for each music sample.
source audio files available upon request.
list of the 114 features
193 records, 114 features
format: [v a sv sa f1 f2 ... f114]
sv: standard deviation of valence (from subjective test)
sa: standard deviation of arousal (from subjective test)
f: features (without normalization)
f1-f28(28): DWCH features
f29-f58(30): Marsyas features
f59-f102(44): PsySound features
f103-f114(12): Spectral contrast features
f64,f71-f84(15): PsySound-15 features
in SVM format: [a 1:f1 2:f2 ... 114:f114]
in SVM format: [v 1:f1 2:f2 ... 114:f114]
top 15 ranked features in the AV space by RReliefF
Any feedbacks or comments are welcomed!