Music Emotion Classification: A Regression Approach

Y.-H. Yang, Y.-C. Lin, Y.-F. Su, and H.-H. Chen, "Music emotion classification: A regression approach,"
in Proc. IEEE Int. Conf. Multimedia and Expo. 2007 (ICME'07).

Typical music emotion classification (MEC) approaches categorize emotions and apply pattern recognition methods to train a classifier. However, categorized emotions are too ambiguous for efficient music retrieval. In this paper, we model emotions as continuous variables composed of arousal and valence values (AV values), and formulate MEC as a regression problem. The multiple linear regression, support vector regression, and AdaBoost.RT are adopted to evaluate the prediction accuracy. Since the regression approach is inherently continuous, it is free of the ambiguity problem existing in its categorical counterparts.


1) Multiple linear regression (MLR)
2) Support vector regression (SVR)
      A tutorial on support vector regression.
      LIBSVM: a library for support vector machines (link).
      [parameters: arousal -c 0.5 -g 0.0078125 -p 0.125]
      [parameters: valence -c 4 -g 0.0078125 -p 0.25]
3) AdaBoost.RT
      AdaBoost.RT: a boosting algorithm for regression problems.
        threshold p for demarcating correct and incorrect predictions: 0.1
        number of iterations: 30
        threshold c to control the tree pruning process: 5 ]


1) Merit of the regression approach compare to categorical approaches:
        Inherently continuous, free of the ambiguity problem commonly existing in its categorical counterparts.
        Reduces the subjectivity problem as well since it offers more freedom in describing the song.
        Allows more efficient music retrieval and management.
        One can also easily convert the regression results to binary or quaternary ones if categorical taxonomy is required.

2) Merit of the regression approach compare to existing AV values computation approaches:
        Has sound theoretical foundation and allows quantitative performance analysis.
        Learns the predicting rules according to the ground truth and can be trained to reach optimal performance.
        Does not assume any geometric relationship between arousal and valence.

3) R^2 statistics reaches 60% and 19% for estimating arousal and valence by SVR with the Psy15 dataset.
    Classification accuracy reaches 84% and 68% for classifying arousal and valence.

Data Sets

Music clips are trimmed to 25 seconds and converted to a uniform format (22,050 Hz, 16 bits, and mono channel PCM WAV). The same music database contains 195 popular songs from Western, Chinese, and Japanese albums. Subjects (most college students) are asked to listen to a subset of music dataset and to choose two values, each ranges from -1.0 to 1.0 in 11 levels, to indicate their feeling about the AV values of the music sample. The ground truth is set as the mean of the AV values of all subjects tested. On the average, more than ten pairs of AV values are collected from the subjective test for each music sample.

list of the 114 features

193 records, 114 features format: [v a sv sa f1 f2 ... f114]
    v: valence(-5~5)
    a: arousal(-5~5)
    sv: standard deviation of valence (from subjective test)
    sa: standard deviation of arousal (from subjective test)
    f: features (without normalization)
      f1-f28(28): DWCH features
      f29-f58(30): Marsyas features
      f59-f102(44): PsySound features
      f103-f114(12): Spectral contrast features
      f64,f71-f84(15): PsySound-15 features

in SVM format: [a 1:f1 2:f2 ... 114:f114]

in SVM format: [v 1:f1 2:f2 ... 114:f114]

Any feedbacks or comments are welcomed!