Reference:
Lemaev V.I., Lukashevich N.V..
Automatic classification of emotions in speech: methods and data
// Litera. – 2024. – № 4.
– P. 159-173.
Read the article
Abstract: The subject of this study is the data and methods used in the task of automatic recognition of emotions in spoken speech. This task has gained great popularity recently, primarily due to the emergence of large datasets of labeled data and the development of machine learning models. The classification of speech utterances is usually based on 6 archetypal emotions: anger, fear, surprise, joy, disgust and sadness. Most modern classification methods are based on machine learning and transformer models using a self-learning approach, in particular, models such as Wav2vec 2.0, HuBERT and WavLM, which are considered in this paper. English and Russian datasets of emotional speech, in particular, the datasets Dusha and RESD, are analyzed as data. As a method, an experiment was conducted in the form of comparing the results of Wav2vec 2.0, HuBERT and WavLM models applied to the relatively recently collected Russian datasets of emotional speech Dusha and RESD. The main purpose of the work is to analyze the availability and applicability of available data and approaches to recognizing emotions in speech for the Russian language, for which relatively little research has been conducted up to this point. The best result was demonstrated by the WavLM model on the Dusha dataset - 0.8782 dataset according to the Accuracy metric. The WavLM model also received the best result on the RESD dataset, while preliminary training was conducted for it on the Dusha - 0.81 dataset using the Accuracy metric. High classification results, primarily due to the quality and size of the collected Dusha dataset, indicate the prospects for further development of this area for the Russian language.
Keywords: WavLM, HuBERT, Wav2vec, transformers, machine learning, emotion recognition, speech recognition, natural language processing, Dusha, RESD
References:
Schneider, S., Alexei Baevski, Ronan Collobert, & Auli, M. (2019). wav2vec: Unsupervised Pre-Training for Speech Recognition. ArXiv (Cornell University). doi:10.48550/arXiv.1904.05862
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., & Taylor, J. G. (2001). Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1), 32–80. doi:10.1109/79.911197
Kondratenko, V., Sokolov, A., Karpov, N., Kutuzov, O., Savushkin, N., & Minkin, F. (2022). Large Raw Emotional Dataset with Aggregation Mechanism. ArXiv (Cornell University). doi:10.48550/arxiv.2212.12266
Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., & Narayanan, S. S. (2008). IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359. doi:10.1007/s10579-008-9076-6
Wagner, J., Triantafyllopoulos, A., Wierstorf, H., Schm