Audio Word Toolbox

Music and Audio Computing Lab at Academia Sinica


Introduction

The Audio Word Toolbox (AWtoolbox) is an open-source software (release under GPLv3) that are designed to give users a friendly graphical UI for extracting Audio Word representation from audio waveform. If you are interested in learning how AWtoolbox functions in practice, this video contains a simple example workflow. You can also check out the paper here. Currently, AWtoolbox is only available for the Windows platform.

Quick Start

This quick start is for 64-bit Windows.

  1. Download AWtoolbox from the BitBucket repository
  2. Download MATLAB Compiler Runtime 8.1 for 64-bit Windows from here
  3. Install MATLAB Compiler Runtime
  4. Run the pre-compiled executable at .\release\audio_word_toolbox.exe to start AWtoolbox

For more details about AWtoolbox, please refer to the user guide.

Credit

To use the AWtoolbox, please cite the following paper:

Chin-Chia Michael Yeh, Ping-Keng Jao, and Yi-Hsuan Yang. AWtoolbox: Characterizing Audio Information Using Audio Words. In ACM Multimedia, 2014. http://mac.citi.sinica.edu.tw/awtoolbox.

or

@inproceedings{AWtoolbox,
  author = {Chin-Chia Michael Yeh and Ping-Keng Jao and Yi-Hsuan Yang},
  title = {AWtoolbox: Characterizing Audio Information Using Audio Words},
  booktitle = {ACM Multimedia},
  year = {2014},
  note = {\url{http://mac.citi.sinica.edu.tw/awtoolbox}}
}

Contributor

Please send comments and suggestions to Yi-Hsuan Yang.


Preliminary Benchmark Result

We have test five different setups of audio word representation on 6 data sets. The description, setup file, and dictionary for the five AW setups are given below:

Representation Description Setup File Dictionary
MFCC+VQ+kmeans This is the most classical way to generate audio word representation. It first extract MFCC at the input layer, then apply Vector Quantization (VQ) on the extracted MFCC. download download
SPEC+SC+RE It first extract Spectrum at the input layer, then apply Sparse Coding (SC) on the extracted Spectrum. The dictionary in this setup is constructed by randomly extract Spectrums from the dictionary training data. download download
SPEC+SC+ODL This setup is similar to SPEC+SC+RE, but the dictionary is adapted to the dictionary training data with the Online Dictionary Learning (ODL) algorithm. download download
CEPS+SC+RE This setup is similar to SPEC+SC+RE, but Cepstrum is extracted at the the input layer instead of Spectrum. download download
CEPS+SC+ODL This setup is similar to SPEC+SC+ODL, but Cepstrum is extracted at the the input layer instead of Spectrum. download download

CAL10k Genre

CAL10k is a data set constructed for testing music auto-tagging system [1]. Particularly, genre related tags (e.g. jazz, rock, and death metal) are used in this set of experiments. For more details about this dataset please refer to here.

Representation AUC MAP P10 PR
MFCC+VQ+kmeans 0.803 0.144 0.191 0.159
SPEC+SC+RE 0.864 0.208 0.258 0.215
SPEC+SC+ODL 0.869 0.214 0.257 0.222
CEPS+SC+RE 0.857 0.199 0.248 0.204
CEPS+SC+ODL 0.866 0.201 0.248 0.212

CAL10k Acoustic

CAL10k is a data set constructed for testing music auto-tagging system [1]. Particularly, acoustic related tags (e.g. male vocal, guitar solo, and percussion) are used in this set of experiments. For more details about this dataset please refer to here.

Representation AUC MAP P10 PR
MFCC+VQ+kmeans 0.774 0.133 0.191 0.152
SPEC+SC+RE 0.833 0.181 0.232 0.196
SPEC+SC+ODL 0.838 0.184 0.236 0.199
CEPS+SC+RE 0.830 0.175 0.231 0.193
CEPS+SC+ODL 0.841 0.181 0.233 0.194

CAL500

CAL500 is a data set constructed for testing music auto-tagging system [2]. The annotation covers emotion, genre, instrument, acoustic, usage, and vocals. In this experiment, only 97 tags are used instead of the original 174 tags. The 97 tags are selected following [3].

Representation F AROC MAP P10
MFCC+VQ+kmeans 0.206 0.696 0.439 0.456
SPEC+SC+RE 0.224 0.733 0.477 0.491
SPEC+SC+ODL 0.228 0.740 0.482 0.502
CEPS+SC+RE 0.219 0.730 0.474 0.496
CEPS+SC+ODL 0.221 0.731 0.477 0.504

Freesound

Freesound is a data set constructed for testing sound clip classification system [4]. The clips are collected from Freesound and labeled with one of the following five categories: sound effect, sound scape, speech, instrument sample and complex music fragment.

Representation F ACC
MFCC+VQ+kmeans 0.440 0.473
SPEC+SC+RE 0.540 0.563
SPEC+SC+ODL 0.537 0.563
CEPS+SC+RE 0.540 0.579
CEPS+SC+ODL 0.523 0.564

MER31k

MER31k is a data set constructed for testing music emotion recognition system [5]. Specifically, the problem has been formulated as tagging problem with 190 emotion tags (e.g. angry, aggressive, and exciting).

Representation AUC MAP P10 PR
MFCC+VQ+kmeans 0.770 0.101 0.214 0.146
SPEC+SC+RE 0.793 0.122 0.238 0.168
SPEC+SC+ODL 0.795 0.127 0.252 0.174
CEPS+SC+RE 0.778 0.107 0.221 0.154
CEPS+SC+ODL 0.784 0.106 0.211 0.151

MTG instrument

MTG instrument is a data set constructed for testing predominant instrument recognition system with multi-source music clips [6]. The instrument labels are: cello, clarinet, flute, acoustic guitar, electric guitar, Hammond organ, piano, saxophone, trumpet, violin, and singing voice.

Representation F ACC
MFCC+VQ+kmeans 0.487 0.491
SPEC+SC+RE 0.695 0.698
SPEC+SC+ODL 0.713 0.716
CEPS+SC+RE 0.588 0.591
CEPS+SC+ODL 0.580 0.584


Acknowledgments

The authors would like to thank Frederic Font Corbera and Ferdinand Fuhrmann from the Music Technology Group at Universitat Pompeu Fabra for sharing Freesound and MTG instrument data sets.


Reference

[1] D. Tingle, Y. E. Kim, and D. Turnbull. Exploring automatic music annotation with "acoustically objective" tags. in MIR, 2010.
[2] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet. Towards musical query-by-semantic-description using the CAL500 data set. in ACM SIGIR, 2007.
[3] C.-C. M. Yeh, J.-C. Wang, Y.-H. Yang, and H.-M. Wang. Improving music auto-tagging by intra-song instance bagging. in ICASSP, 2014.
[4] F. Font, and J. Serra and X. Serra. Audio clip classification using social tags and the effect of tag expansion. in AES 53rd International Conference on Semantic Audio, 2014.
[5] Y.-H. Yang and J.-Y. Liu. Quantitative study of music listening behavior in a social and affective context. in IEEE TMM, 2013.
[6] F. Ferdinand, Automatic musical instrument recognition from polyphonic music audio signals, Ph.D. thesis, Universitat Pompeu Fabra, 2012.