A Regression Approach to Music Emotion Recognition

Y.-H. Yang, Y.-C. Lin, Y.-F. Su, and H.-H. Chen, , "A Regression Approach to Music Emotion Recognition,"
IEEE Trans. Audio, Speech and Language Processing (TASLP), vol. 16, no. 2, pp. 448-457, Feb. 2008.

[full text]
(for a camera ready version, go to ieee xplore)

Content-based retrieval has emerged in the face of content explosion as a promising approach to information access. In this paper, we focus on the challenging issue of recognizing the emotion content of music signals, or music emotion recognition (MER). Specifically, we formulate MER as a regression problem to predict the arousal and valence values (AV values) of each music sample directly. Associated with the AV values, each music sample becomes a point in the arousal-valence plane, so the users can efficiently retrieve the music sample by specifying a desired point in the emotion plane. Because no categorical taxonomy is used, the regression approach is free of the ambiguity inherent to conventional categorical approaches. To improve the performance, we apply principal component analysis to reduce the correlation between arousal and valence, and RReliefF to select important features. An extensive performance study is conducted to evaluate the accuracy of the regression approach for predicting AV values. The best performance evaluated in terms of the R2 statistics reaches 58.3% for arousal and 28.1% for valence by employing support vector machine as the regressor. We also apply the regression approach to detect the emotion variation within a music selection and find the prediction accuracy superior to existing works. A group-wise MER scheme is also developed to address the subjectivity issue of emotion perception.

Methods   Data Sets  


1) Multiple linear regression (MLR)
2) Support vector regression (SVR)
      A tutorial on support vector regression.
      LIBSVM: a library for support vector machines (link).
      [parameters: arousal -c 0.5 -g 0.0078125 -p 0.125]
      [parameters: valence -c 4 -g 0.0078125 -p 0.25]
3) AdaBoost.RT
      AdaBoost.RT: a boosting algorithm for regression problems.
        threshold p for demarcating correct and incorrect predictions: 0.1
        number of iterations: 30
        threshold c to control the tree pruning process: 5 ]

Data Sets

Music clips are trimmed to 25 seconds and converted to a uniform format (22,050 Hz, 16 bits, and mono channel PCM WAV). The same music database contains 195 popular songs from Western, Chinese, and Japanese albums. Subjects (most college students) are asked to listen to a subset of music dataset and to choose two values, each ranges from -1.0 to 1.0 in 11 levels, to indicate their feeling about the AV values of the music sample. The ground truth is set as the mean of the AV values of all subjects tested. On the average, more than ten pairs of AV values are collected from the subjective test for each music sample.

  • source audio files available upon request.

  • feat_list.txt
    list of the 114 features

    193 records, 114 features format: [v a sv sa f1 f2 ... f114]
        v: valence(-5~5)
        a: arousal(-5~5)
        sv: standard deviation of valence (from subjective test)
        sa: standard deviation of arousal (from subjective test)
        f: features (without normalization)
          f1-f28(28): DWCH features
          f29-f58(30): Marsyas features
          f59-f102(44): PsySound features
          f103-f114(12): Spectral contrast features
          f64,f71-f84(15): PsySound-15 features

    in SVM format: [a 1:f1 2:f2 ... 114:f114]

    in SVM format: [v 1:f1 2:f2 ... 114:f114]

    top 15 ranked features in the AV space by RReliefF

    Any feedbacks or comments are welcomed!