On the use of speech parameter contours for emotion recognition
The School of Electrical Engineering and Telecommunications, The University of New South Wales, New South Wales, Sydney, 2052, Australia
EURASIP Journal on Audio, Speech, and Music Processing 2013, 2013:19 doi:10.1186/1687-4722-2013-19Published: 10 July 2013
Many features have been proposed for speech-based emotion recognition, and a majority of them are frame based or statistics estimated from frame-based features. Temporal information is typically modelled on a per utterance basis, with either functionals of frame-based features or a suitable back-end. This paper investigates an approach that combines both, with the use of temporal contours of parameters extracted from a three-component model of speech production as features in an automatic emotion recognition system using a hidden Markov model (HMM)-based back-end. Consequently, the proposed system models information on a segment-by-segment scale is larger than a frame-based scale but smaller than utterance level modelling. Specifically, linear approximations to temporal contours of formant frequencies, glottal parameters and pitch are used to model short-term temporal information over individual segments of voiced speech. This is followed by the use of HMMs to model longer-term temporal information contained in sequences of voiced segments. Listening tests were conducted to validate the use of linear approximations in this context. Automatic emotion classification experiments were carried out on the Linguistic Data Consortium emotional prosody speech and transcripts corpus and the FAU Aibo corpus to validate the proposed approach.