Eng Cn Translate this page:
Please select your language to translate the article

You can just close the window to don't translate
Your profile

Back to contents

National Security

Recognition of Human Emotions by Voice in the Fight against Telephone Fraud

Pleshakova Ekaterina Sergeevna

ORCID: 0000-0002-8806-1478

PhD in Technical Science

Associate Professor, Department of Information Security, Financial University under the Government of the Russian Federation

125167, Russia, Moscow, 4th Veshnyakovsky Ave., 12k2, building 2

Other publications by this author

Gataullin Sergei Timurovich

PhD in Economics

Dean of "Digital Economy and Mass Communications" Department of the Moscow Technical University of Communications and Informatics; Leading Researcher of the Department of Information Security of the Financial University under the Government of the Russian Federation

8A Aviamotornaya str., Moscow, 111024, Russia

Other publications by this author

Osipov Aleksei Viktorovich

PhD in Physics and Mathematics

Associate Professor, Department of Data Analysis and Machine Learning, Financial University under the Government of the Russian Federation

125167, Russia, Moscow, 4th veshnyakovsky str., 4, building 2

Other publications by this author

Koroteev Mikhail Viktorovich

Doctor of Economics

Deputy Dean of the Faculty of Information Technology, Federal State Educational Budgetary Institution of Higher Education "Financial University under the Government of the Russian Federation"

49/2 Leningradskiy Prospect str., Moscow, 125167, Russia

Ushakova Yuliya Vladislavovna

Doctor of Law

Student, Department of Data Analysis and Machine Learning, Financial University under the Government of the Russian Federation

125167, Russia, ​ region, Moscow, Leningradskiy Prospect str., 49/2










Abstract: Advances in communication technologies have made communication between people more accessible. In the era of information technology, information exchange has become very simple and fast. However, personal and confidential information may be available on the Internet. For example, voice phishing is actively used by intruders. The harm from phishing is a serious problem all over the world, and its frequency is growing. Communication systems are vulnerable and can be easily hacked by attackers using social engineering attacks. These attacks are aimed at tricking people or businesses into performing actions that benefit attackers, or providing them with confidential data. This article explores the usefulness of applying various approaches to training to solve the problem of fraud detection in telecommunications. A person's voice contains various parameters that convey information such as emotions, gender, attitude, health and personality. Speaker recognition technologies have wide areas of application, in particular countering telephone fraud. Emotion recognition is becoming an increasingly relevant technology as well with the development of voice assistant systems. One of the goals of the study is to determine the user model that best identifies fraud cases. Machine learning provides effective technologies for fraud detection and is successfully used to detect such actions as phishing, cyberbullying, and telecommunications fraud.


artificial intelligence, cyberbullying, machine learning, text analysis, neural networks, personal data, computer crime, cybercrimes, phone fraud, phishing

This article is automatically translated. You can find original text of the article here.

The article was prepared as part of the state assignment of the Government of the Russian Federation to the Financial University for 2022 on the topic "Models and methods of text recognition in anti-telephone fraud systems" (VTK-GZ-PI-30-2022).

Phishing is becoming an increasingly serious threat, largely due to the development of web technologies, mobile technologies and social networks. Phishing is aimed at collecting confidential and personal information, such as usernames, passwords, credit card numbers, posing as a legitimate person in cyberspace through phone calls or emails. Since phishing attacks are aimed not only at ordinary users, but also at critical infrastructure, this may pose a threat to national security. Phishing attacks can have serious consequences for their victims, such as loss of intellectual property and confidential information, financial losses. Phishing is often used to infiltrate corporate or government networks as part of a larger attack, such as an increased persistent threat event. In this scenario, employees are compromised to circumvent security perimeters, spread malware inside a closed environment, or gain privileged access to protected data.Phishing detection is considered a difficult problem, this is due to the fact that phishing is based on semantics, in which human vulnerabilities are used, not system vulnerabilities.

The spread of mobile technologies in recent years has contributed to the development of social engineering, when scammers influence people to perform certain actions on the Internet to obtain confidential information. This may consist in sending targeted phishing emails that encourage recipients to click on links, provide personal information or download malicious software.One of the methods used by scammers is an attempt to bring a potential victim to strong emotions. It can be fright or joy. Their task is to bring the victim out of balance.In this regard, it is worth

Emotion recognition is becoming an increasingly urgent task with the development of voice assistant systems, since it can make human-computer interaction more efficient [1,3]. SER can work as a separate embedded module that can be connected to another device or work using cloud technologies on smart devices [2]. At the same time, deep learning methods are increasingly being considered as an alternative to traditional methods and are gradually gaining popularity [4]. Nevertheless, when systems developed by engineers in the laboratory are introduced into modern technologies, a considerable number of problems arise.

There are also many psychological systems that reflect the spectrum of human emotions. They are used in SER (Speech Emotion Recognition) to assign an emotion to a specific category. To date, there are two most popular models: discrete classes (for example, a model based on the theory of the six basic groups of emotions by P. Ekman). The second approach uses multiple axes (two or three). For example, the level of excitement is estimated on one axis, positivity/valence on the other.

Since SER is a fairly new field of research, there are no optimal universal models yet. Experiments are being conducted in the field of improving the preprocessing of audio files, classification models. For example, hybrid systems are being developed that combine classical classification models and neural networks to improve accuracy indicators [5].

SER requires databases on which the model will be trained. Datasets have their own characteristics and can also affect the parameters of the model and accuracy. Moreover, there is a separate problem of working with real-time speech recognition, since the recording of material for databases takes place in studio conditions, but when using SER in practice, it is also necessary to solve the problem of external noise.

Stages of working with audio files.The general scheme of SER operation is as follows: sound preprocessing => feature extraction => classification.

There are three main stages of emotion recognition in speech: sound preprocessing, feature extraction, and classification [6]. During preprocessing, the noise level is reduced, speech is divided into segments in order to transform the speech fragment into a number of significant parts. At the second stage, there is a selection of significant features that help to attribute the fragment to a particular class. After the fragment location is determined in the system (for example, location by vectors), the fragment is written to a specific class. As Mozziconacci notes, the additional complexity of analyzing emotions in speech also lies in the fact that people tend to express emotions less vividly when communicating with gadgets.

Normalization of signs and noise reduction.For most SER systems, it is necessary to perform a number of manipulations with the original audio track in real time, to reduce the noise level.

Noise reduction allows you to get rid of stationary noise when recording audio or in a finished audio track. Sound normalization helps to adjust the volume so that when recording sound from different distances, it remains stable. Before extracting the features, most often all silent segments of the audio file are deleted.

A number of studies indicate that noise reduction models effective for speech recognition (ASR) also successfully cope with paralinguistic tasks.

The most popular noise reduction methods include: spectral subtraction, Wiener estimation and the minimum standard error method.[7] In the course of comparing the three given noise reduction methods, Changchakh and Lachiri found that for the classical SER model (extraction of low-frequency spectral coefficients + hidden Markov model), the most effective noise reduction method turned out to be first of all spectral subtraction, then the minimum standard error method.[8]

However, in some cases, reducing external noise can negatively affect the accuracy of the model. In the experiments conducted by D. Deusi and E. Popa, the Z-score normalization method from the SoX library for Python was used for normalization, in which the location of the feature on the axis depends on how much it differs from the standard deviation. The SoX and RNNoise library was used to suppress the noise. However, feature normalization and noise reduction reduced the accuracy of the models. The highest accuracy (72.43%) of the model was achieved without the use of normalization and noise suppression. The highest accuracy rates using noise reduction and normalization ranged from 58.88% to 68.69%.

Feature extraction.The two main steps for recognizing emotions in speech after primary sound processing are the identification of signs, and then their classification.

[9] For emotion recognition technologies in speech, the most common signs are: arousal, prosodic features of speech (height, strength, intensity, rhythm, etc.), acoustic characteristics of the environment, less often linguistic features of speech [10].

In modern state-of-art models, two main approaches can be seen: 1) using large sets of features (or selecting them manually), then reducing the dimension (using the PCA principal component method) 2) delegating this task to a neural network that will independently identify significant features [11].

Static modeling involves the use of utterances in their entirety, in dynamic modeling, the sound signal is divided into small fragments (frames) [12]. Dynamic modeling is more reliable because it does not rely on segmentation of input speech into utterances and can record the change of emotions within utterances over time [13]. Nevertheless, empirical comparisons of the two methods showed better results precisely when using the method of analyzing complete statements.

Local short-term features.Local signs (short-term features), as the name implies, characterize the sound signal for a short period.

These include such features as: formants, height, log energy.

Dynamic modeling is used to recognize local features.

The most typical frame length varies from 20 to 30 ms for analyzing local features. As the overlay length, you can choose a length equal to half of the frame.[14]

For example, at one of the posted competitions on kaggle.com to convert the sound into a chalk-spectogram, the audio track is divided into small segments (frames). In a particular competition, participants are asked to divide tracks into short frames of 25 milliseconds (0.025 seconds) in increments of 10 milliseconds (0.01 seconds). One second of the audio track contains 100 frames. A 40-dimensional chalk vector is created from each frame. Thus, the entire spectrogram has the form of a matrix of dimension 100*T*40, where T is the number of seconds.

Depending on a set of factors such as the algorithm used, the purpose of recognition, the quality of input data, etc., various parameters can be changed.

So, for example, in the study of X.Fayek, M. Lekh, L.Cavedon, devoted to deep learning technologies for SER, scientists found that the greatest accuracy of the created models using direct propagation architectures is achieved with an increase in the number of frames, but at 220 frames it reached a plateau [13].

Global long-term features.Global (long-term features), on the contrary, reflect the signs that can be traced throughout the utterance.

For example, the average height, the standard deviation of the height, energy indicators, traceable throughout the sound segment.

Statistical features can be more difficult to calculate. In addition, global features include features from various fields. They record acoustic characteristics, linguistic characteristics, statistical indicators of speech and sound. In view of such heterogeneity, the task of creating a single dataset, in which there would be universal global features that would be suitable for speakers of different languages, for different tasks and situations, seems problematic. Moreover, it is problematic to identify some global signs based only on a sound signal, for example, coarsening (a phonetic phenomenon in which lips need to be rounded to articulate sound). Therefore, a popular solution for global features is to add them to a set of local features. In view of these problems, global signs are usually limited to statistical signs and global acoustic ones.

In the study of A.Karpov, a hybrid approach was applied (combining long-term and short-term signs). The researchers used extraction both at the frame level (dynamic modeling) and at the utterance level (static modeling). A joint representation of features was created and then a classification method was applied (for example, logistic regression)[15].

In the study [16], the authors used a hybrid method for the task of identifying the speaker, the accuracy results obtained were more than 20% higher than the most effective models at that time.There is no consensus among researchers regarding the advantages of local and global signs for recognizing emotions in speech. However, the majority shares the opinion that global features increase the accuracy of the model and reduce classification time, cross-qualification requires less time (since there are fewer global features) than for local features. However, global signs do not reflect the temporal characteristics of the signal, and also work worse with emotions with the same level of arousal parameters (joy-anger).

Feature sets.The most complete sets of features for SER were developed during the annual Interspeech conferences (INTERSPEECH Computational Paralinguistics Challenge/ComParE).

Table 1 presents a set of features, including a large selection of low-level descriptors (MFCCs spectral coefficients, voice probabilities, pulse-code modulation PCM) and functions (mean, standard deviation, percentiles, quartiles, linear regression coefficients).

Table 1 - Feature sets

A set of featuresLow-level descriptors.


Total number of signs

















An important problem at the moment is the creation of effective and universal sets of significant characteristics that will work equally well for various research environments and tasks, including countering telephone fraud.The eGeMAPS set offers a more minimalistic approach to the feature set compared to ComParE. When compiling , they were guided by three main rules: 1) the inclusion of signs that most reflect the change in the psychological state of the voice, 2) the proven significance of certain characteristics in previous studies and the possibility of automatic extraction of signs 3) theoretical importance.

For the task of recognizing emotions in speech, it is necessary to convert an audio file, a visual representation of the signal energy by frequencies. The audio file needs to be brought to a matrix form. This task in the field of emotion recognition in speech is usually solved by converting sound into a chalk spectrogram. Chalk-frequency spectral coefficients are the most popular way to transform an audio file, which has proven its reliability.Chalk is the frequency of sound measurement, which is based on the fact that human hearing is more susceptible to sound changes in the low range than in the high. To create a chalk spectrogram, a number of chalk filters are applied to the sound, which convert the sound file into its visual display. On one scale, the chalk frequency is fixed, on the second time. The task of chalk spectral analysis is to display how the human ear hears sound. This is not the only way to mathematically express human perception of sound, but this one has proven its effectiveness in speech recognition tasks and therefore is currently the most popular.[17]The stages of primary processing of the audio signal are as follows:

  • Fourier transform. In the recognition of emotions in speech, FFT (fast Fourier transform) is usually used, since the fast Fourier transform allows you to analyze speech in real time.
  • Then the received vectors of the audio signal spectrum pass through filters, in other words, they are multiplied by window functions. The result will be coefficients.
  • Further transformation should turn these coefficients into chalk-frequency spectral coefficients (kepstral). The values obtained in the previous step are squared and logarithmic, then the Fourier transform or discrete cosine transform can be applied again.

Nevertheless, an unambiguous trend for SER in the field is the study of models for which a ready-made set of features is not needed, but on the contrary, a neural network working with a raw file learns to identify important features. Such technologies can be called breakthrough, since they significantly reduce the requirements for the knowledge that the creators of SER solutions should have, for the source data and generally simplify the process of searching for features for classification.

The task of speech recognition is more studied than the recognition of emotions in speech, but these two areas are closely related and algorithms working for one can be successfully adapted to the other area. One of the first studies devoted to working with raw data can be called the work of Jaitley N. and Hinton G., in which a limited Boltzmann machine was used, which learned to recognize words directly from a signal presented in wave form [18]. In another study, a convolutional neural network was used for speech recognition, presented in the form of a mel-spectogram [19]. Palaz et al. a study was conducted in which a convolutional neural network was trained on a raw dataset [20].

A successful attempt to apply the end-to-end model was implemented in a study by Trigiorgis and colleagues. After reducing the noise level, the signal is transformed into a time series and is fed to the input for long-term short-term memory, which determines which signs are important using the error back propagation method. Then the training takes place using a convolutional neural network. The authors also note that those signs that are associated with an excited state in ready-made kits were also noted during the training of the signs of their model. Accordingly, neural networks can be used as an auxiliary tool to determine the optimal set of features.

Having considered the options for extracting features , we can come to the following conclusions:

  • At the moment, there is no single standard for SER to identify features. There are three main options: the selection of signs by yourself manually, the use of ready-made sets or the determination of significant signs using neural networks with or without the involvement of a teacher. The last type of model seems to be the most promising. The results of research on neural networks for extracting important features can be used both for the introduction of such systems or for determining the most significant features for their further inclusion in certain sets.
  • On the one hand, the lack of a single standard of signs is a natural consequence of the fact that SER is a fairly new field of research. In the course of their work, scientists have the opportunity to experiment with different sets and identify the most optimal signs. However, this makes it difficult to compare the results of studies, since the set of signs is different for each.
  • When choosing local or global features, researchers usually have to make a choice in favor of one of the options, since combining the two will lead to a complex model and an overloaded set of features. The main advantage of global features is the ability to use not only acoustic and statistical tools, but also include, for example, linguistic features, reduce the variability of the speaker's speech, thereby increasing robustness". However, at the moment there are no comprehensive and universal databases that would have not only statistical and acoustic tools for global features. The most popular variants of models assume the use of only local features or the addition of some global features to local ones. Such systems show better results. Therefore, it can be assumed that in the future the use of global features will become more important for SER.

Classification methods.Classification takes place using linear and nonlinear classifiers.

Among the first category, the most common are Bayesian networks, the method of greatest likelihood, the method of support vectors. Nonlinear classifiers, including mixed Gaussian models or hidden Markov model, are also effective for speech analysis. In addition, classifiers such as the k-nearest neighbor method, the principal component method and decision trees are also used for emotion recognition.[21]

Among the classical classifiers for SER, hidden Markov models[22], mixed models[23], the vector method[24], the k-nearest neighbor method[25] were used. But the most widely used in SER are hidden Markov models and the support vector machine. [26-27]

For classical models , the scheme looks like this:input voice data/dataset -> feature extraction -> application of static tools (e.g. openSMILE) -> PCA dimension reduction -> classical classifier -> result.

The set of features affects the operation and accuracy of the classification model. As noted in [28] for hidden Markov models and the support vector method, the number of global feature vectors may not be sufficient for effective model training. Complex models learn better from a large data set that is created when extracting short-term features. A mixed Gaussian model is suitable for the analysis of global features.

However, deep learning has a number of advantages over traditional methods, including due to the ability to analyze complex feature structures without manually selecting features, working with unlabeled data.

Prerequisites for the use of deep learning in the recognition of emotions in speech can be called several factors. Firstly, there is no clear set of acoustic signs that allow you to effectively recognize emotions in speech in any situation, so developers are forced to use a large number of signs at once. Consequently, the risks of retraining the model increase due to too large a dataset, which is difficult to analyze and derive general rules [29]. In addition, many acoustic signs are expensive to calculate, require additional equipment or special technical solutions, which makes it difficult to spread the technology and adapt to various types of devices [30]. An important advantage of deep learning learning is the absence of the need to manually select features for training [31]. All this makes the study of deep learning opportunities extremely relevant for SER technologies.

Architectures based on direct signal propagation networks (feed-forward structures), such as deep neural networks (DNN) or convolutional neural networks (CNNs) demonstrate the best results for image and video processing. At the same time, architectures with recurrent networks (recurrent networks) recurrent neural networks (RNN) and long short-term memory (LSTM) are very effective in speech analysis, for natural language processing (NLP) and emotion recognition systems in speech (SER) [32-33].

However, deep learning has a number of limitations. Firstly, it requires complex calculations and, accordingly, higher capacities, memory capacity, as well as training time. In this regard, there are a number of difficulties in analyzing live speech and implementing such technologies for everyday use.

In deep learning technologies , the process of recognizing emotions in speech looks like this:

input voice data/dataset -> feature extraction -> application of static tools (e.g. openSMILE) -> deep learning algorithm -> result.

A number of studies have noted positive results of using CNN to process such one-dimensional signals as audio and speech [37-39].In their study, D. Deusi and E. Popa considered three types of nonlinear models: classical (support vector machine (SVM), k-nearest neighbor method (KNN), random forests (RFC)), deep learning (multilayer perceptron (MLP) and convolutional neural networks (CNN)) and hybrid classifiers.The researchers then conducted a series of experiments to find out how well the models would work with real speech compared to recorded datasets. In the course of the study, they came to the conclusion that the model trains better not on the dataset, but on the data provided by the user himself. Also, to improve accuracy, scientists suggest using hybrid models that combine neural networks and nonlinear classifiers (for example, SVM). It was the hybrid model that showed the highest accuracy in the study (83.27% for the Berlin base and 67.19% for Ravdess). The authors assume that such classifiers as mixed models (GMM), hidden Markov model (HMM) and deep trust network (DBN) will be optimal for hybrid systems. In addition, the authors note that the further direction of work should be the search for ways to increase the stability of the system in any environment.In another study, a hybrid model was chosen as a classifier, which showed higher results, 86.60%. It uses a deep trust network and the support vector method.

The network with 1024 and 2048 layers showed the best results. The network performs the task of feature fusion, which increases the system's resistance to external noise [34].

In [35], an important remark can be found regarding the input data for training. Simulation databases can show good accuracy results as part of an experiment, however, when implementing this mechanism in real applications, the accuracy may vary greatly. And although the data obtained in real conditions may have great variability and are difficult to process, they have great potential for application.

Thus , the following conclusions can be drawn:

  • There is no universal classifier model that would cope well with various SER tasks. Classical classifiers have a simpler architecture than neural networks and, accordingly, model training can be cheaper and faster. Neural networks have great potential for learning without a teacher, but require more complex calculations. In order to offset the limitations of a particular method, hybrid models are used.
  • A large number of factors affect the quality of the model. Firstly, the same model may show different results for different datasets. The number of features also affects the quality of model training. In addition, the results are different when testing systems in different environments. In the presence of such features in the near future, it does not seem likely to create a number of universal models, but on the contrary, numerous experiments with classical models, hybrid and deep learning.

Databases used to recognize emotions in speech.Databases can be divided into three main types:

1. Simulation. In such datasets, voice fragments are recorded by professional actors. 60% of databases belong to this category. Attracting artists to record voices according to a predetermined scenario is the easiest way to create a dataset.

2. Induced. To collect data for such a database, recorded natural reactions of people who find themselves in a certain artificially given situation are used. This is usually done without the knowledge of the person whose voice is being recorded. Compared to the first category, a person's reactions may be more natural, but additional difficulties arise with modeling situations, since a person's real emotions may not coincide with those expected from him. In addition, such recording of voice fragments may raise a number of ethical issues related to the collection of personal information.

3. Natural bases. They record real everyday dialogues of people, conversations recorded during calls to call centers, etc. The main difficulty lies in collecting such information [36].

Each database has its own peculiarities in methodology, the number of recognized emotions, people whose speech was recorded, etc. Consider the following databases:

  • Berlin Database of Emotions in Speech (Berlin Database of Emotional Speech) Open base

In this dataset, six emotions are highlighted: anger, boredom, disgust, fear, joy, sadness and a neutral state. Ten professional actors utter emotionally neutral sentences (for example: The cloth is on the refrigerator): 5 long and 5 shorter. Sentences are pronounced several times with different intonations that correspond to each of the six emotions. To assess the quality of the collected dataset, 20 listeners were invited, who, after listening to a random statement from the database, should determine the emotion. Those statements in which the correspondence with the originally inherent emotion in the listeners' assessment is below 40%. Marked as ambiguous. If we exclude such statements from the dataset, 494 statements will remain out of 900 of all records, in which the average accuracy in detecting emotions is 84.3%. It is this abbreviated set that is usually used for research.Danish Emotional Speech Database

  • Open baseTo record the material, the voices of four actors were used, who pronounced nine sentences, two words and interjections.

Emotions included in the dataset: anger, joy, neutral state, sadness and surprise. The recordings were then evaluated for compliance by twenty listeners. The average result is 67% accuracy.Contains only statements in Danish.

  • Interactive Emotional Dyadic Motion Capture Database.Open baseThe dataset was recorded at the University of Southern California (USC).

The peculiarity of this dataset is that it contains a whole complex of factors that reflect emotions. The database contains information about the movement of the head, hands, facial expressions and speech of a person during a dialogue. 10 actors took part in the collection of information, who acted out emotions according to the script or in an improvised situation in which they had to express certain emotions (happiness, anger, sadness, disappointment and a neutral state). It is the presence of improvised dialogues that is considered an important advantage of this dataset. Contains only statements in English. It is one of the most popular bases for the development and testing of SER technologies [37].

  • INTERFACE05 Open BaseThe database contains 1,277 samples.

The votes of 42 people (eight women, 14 nationalities) were used. The recordings were made in an office environment. The database includes six categories of emotions: anger, disgust, fear, joy, sadness, surprise. Each subject was asked to listen to six stories, each of which evoke a certain emotion, which the listener had to express by saying five sentences available to choose from (for each emotion there is a set of five sentences). Then two experts evaluated the recordings and entered them into the database if they thought that the emotion was expressed clearly enough. Contains statements in English, Slovenian, Spanish, French [38].

  • Consortium of Linguistic Data emotional Prosody of speech and Transcription (LDC Emotional Prosody Speech and Transcription).Open baseThe dataset was developed by a linguistic consortium for eight months in 2000-2001.

Professional actors read a series of 10 neutral statements (dates and numbers), each statement is recorded using 15 emotional states (despair, sadness, neutral state, interest, joy, panic, rage, shame, hatred, delight, pride, cold aggression. Contains only English utterances.

  • SAVEE(Surrey Audio-Visual Expressed Emotion)The database consists of records of 4 male actors, includes 7 different emotions.

The dataset contains 480 utterances in British English. Then ten experts were invited to evaluate the recorded expressions. The highest accuracy rates were given when viewing video and audio at the same time 84%, 65% was accuracy when evaluating only visual material, and 64% when evaluating only audio.

  • Speech Under Simulated and Actual StressThe purpose of creating this dataset was to improve the performance of speech recognition systems.

The creators of the dataset note that situational stress and environmental noise affect the accuracy of the system. In addition, another factor is the Pawn Shop effect (people raise their voices in a noisy environment to be heard, but this increases the overall noise level in the room). 32 actors (13 women, 19 men) aged from 22 to 76 years were involved in the recording. It contains 16,000 utterances in total. The database is divided into five areas. Five areas take into account: 1) conversation style (slow, fast, quiet, etc.) 2) tracking a single effect (for example, the Pawn Shop Effect) 3) tracking two effects simultaneously 4) fear and stress experienced in real circumstances during movement (overload, pawn shop effect, noise, fear), 5) psychiatric analysis data (speech in depression, fear, anxiety).

  • Ruslana open databaseIn 2002, a private database was created for the Russian language.

Actors (12 men, 49 women) were invited to record, who spoke ten sentences, expressing surprise, anger, happiness, sadness, fear or a neutral state. The dataset contains 3660 statements. Experts were invited for the evaluation, who determined the type of emotion and assessed how well it was expressed.

Thus, we can identify several problems for SER related to datasets.

  • most of the existing datasets belong to the category of simulation, that is, a small group of professional actors are involved in their creation. In this case, it is not always possible to reflect speech features, which can vary greatly in people. This may affect the result, for example, in the course of experiments conducted by D. Deusi and E. Popa, a difference in accuracy was found when using the Berlin database of emotions in speech (simulated) and Ravdess (induced). When using the simulation database, the accuracy was higher in all nine experiments in which different classification models were used.A similar observation regarding simulation and induced bases was also made in an earlier study. But even for databases belonging to the same category, the results in most studies differ. This greatly complicates the possibility of an objective comparison of research results, and even more so their adaptation to real conditions.
  • databases contain files recorded in studio conditions, which makes it difficult to adapt the recorded material for the recognition task in a real environment with different noise levels and accompanying sounds.
  • not all languages are represented in datasets. Even if linguistic signs are not taken into account, the use of acoustic signs of an emotional state isolated on the basis of an English dataset may be incorrect for other languages.
  • in the group of people who are selected to record the dataset, it is often possible to observe an imbalance by gender or age. For example, 34 out of 42 people were men in the INTERFACE05 database, 61 people, including 49 women, took part in the database for the Russian language.


Fraud detection is an important part of ensuring national security. This paper provides an overview of various methods, problems and trends in the field of voice recognition that can be integrated into anti-phone fraud systems. The authors studied the research of intelligent approaches to voice recognition. Although their effectiveness varied, it was shown that each method is sufficiently effective for recognizing voice and emotions in speech. In particular, the ability of computational methods, such as neural networks, to learn and adapt to new methods is very effective for the changing tactics of scammers. A review of classification methods is carried out, effective nonlinear classifiers are identified, among which are mixed Gaussian models or hidden Markov model. The evaluation of the classification results was carried out. The hidden Markov model showed an average accuracy of 77.1%. The potential of SER is still far from being realized, research is actively underway in the field of practical implementation of emotion recognition technologies in speech. SER can act as a separate technology or as an addition to another. A separate niche for recognizing emotions in speech is the use of this technology in systems for detecting and countering telephone fraud. In the systems of countering telephone fraud, it is assumed to use the emotional component of speech with the use of the considered intellectual approaches to voice recognition. This will give the algorithm stability in anti-phone fraud systems, even when disguising a fraudster. This approach will simplify and speed up the detection of telephone fraud.

1. Picard, R. W. (2003). Affective computing: challenges. International Journal of Human-Computer Studies, 59(1-2), 55-64.
2. Deusi J. S., Popa E. I. An investigation of the accuracy of real time speech emotion recognition // International Conference on Innovative Techniques and Applications of Artificial Intelligence. - Springer, Cham, 2019. - P. 336-349.
3. Emets M. Prospects for biometric identification in the context of the digital economy of the Russian Federation // Creative Economy. - 2019. - V. 13. - No. 5. − S. 927-936.
4. Khalil R.A. Speech Emotion Recognition Using Deep Learning Techniques: A Review // IEEE Access. 2019. V. 7. S. 117327117345.
5. Ivanović M. et al. Emotional Intelligence and Agents // Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)-WIMS 14. 2014.
6. Sterling G., Prikhodko P. Deep learning in the task of recognizing emotions from speech // Information technologies and systems 2016: tr. conf. (Minsk, 26 Oct. 2016). IPPI RAS. 2016. S. 451 - 456.
7. Pohjalainen J. et al. Spectral and Cepstral Audio Noise Reduction Techniques in Speech Emotion Recognition // Proceedings of the 2016 ACM on Multimedia Conference-MM 16. 2016.
8. Chenchah F., Lachiri Z. Speech emotion recognition in noisy environment // 2016 2nd International Conference on Advanced Technologies for Signal and Image Processing (ATSIP). 2016.
9. Koolagudi S.G. Emotion Recognition from Semi Natural Speech Using Artificial Neural Networks and Excitation Source Features // Communications in Computer and Information Science. 2012, pp. 273282.
10. Rusalova M.N., Kislova O.O. Electrophysiological indicators of emotion recognition in speech. // Successes of physiol. Sciences. 2011. (42) 2:57-82
11. Trigeorgis G. et al. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network // 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2016.
12. Fayek H.M., Lech M., Cavedon L. Towards real-time Speech Emotion Recognition using deep neural networks // 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS). 2015.
13. Schuller B., Vlasenko, B., Eyben, F., Rigoll, G., & Wendemuth, A. Acoustic emotion recognition: A benchmark comparison of performances //2009 IEEE Workshop on Automatic Speech Recognition & Understanding. - IEEE, 2009. - P. 552-557.
14. Vlasenko B. et al. Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing // Affective Computing and Intelligent Interaction. 2007, pp. 139147.
15. Xiao Z. et al. Features extraction and selection for emotional speech classification // Proceedings. IEEE Conference on Advanced Video and Signal Based Surveillance, 2005.
16. Verkholyak O., Kaya H., Karpov A. Modeling Short-Term and Long-Term Dependencies of the Speech Signal for Paralinguistic Emotion Classification // SPIIRAS Proceedings. 2019. V. 18, No. 1. S. 3056.
17. Friedland G. et al. Prosodic and other Long-Term Features for Speaker Diarization // IEEE Transactions on Audio, Speech, and Language Processing. 2009. V. 17, No. 5. S. 985993.
18. Eyben F. et al. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing // IEEE Transactions on Affective Computing. 2016. V. 7, No. 2. S. 190202.
19. Huang X. et al. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall, 2001. 980 p.
20. Jaitly N., Hinton G. Learning a better representation of speech soundwaves using restricted boltzmann machines // 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2011.
21. Sainath T.N. Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks // 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2015.
22. Palaz D.,-Doss M.M., Collobert R. Convolutional Neural Networks-based continuous speech recognition using raw speech signal // 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2015.
23. Dieleman S., Schrauwen B. End-to-end learning for music audio // 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2014.
24. Sigtia S., Benetos E., Dixon S. An End-to-End Neural Network for Polyphonic Piano Music Transcription // IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2016. V. 24, No. 5. S. 927939.
25. Demidova L. A., Sokolova Yu. S. Data classification based on SVM-algorithm and k-nearest neighbors algorithm // Bulletin of RGRTU. - 2017. - No. 62. - C. 119-120.
26. Dileep A.D., Sekhar C.C. GMM-Based Intermediate Matching Kernel for Classification of Varying Length Patterns of Long Duration Speech Using Support Vector Machines // IEEE Transactions on Neural Networks and Learning Systems. 2014. V. 25, No. 8. S. 14211432.
27. Nwe T.L., Foo S.W., De Silva L.C. Speech emotion recognizeition using hidden Markov models // Speech Communication. 2003. V. 41, No. 4. S. 603623.
28. Yun S., Yoo C.D. Loss-Scaled Large-Margin Gaussian Mixture Models for Speech Emotion Classification // IEEE Transactions on Audio, Speech, and Language Processing. 2012. V. 20, No. 2. S. 585598.
29. Mao Q., Wang X., Zhan Y. SPEECH EMOTION RECOGNITION METHOD BASED ON IMPROVED DECISION TREE AND LAYERED FEATURE SELECTION // International Journal of Humanoid Robotics. 2010. V. 07, No. 02. S. 245261.
30. Pao T.-L. A Comparative Study of Different Weighting Schemes on KNN-Based Emotion Recognition in Mandarin Speech // Lecture Notes in Computer Science. pp. 9971005.
31. Morrison D., Wang R., De Silva L.C. Ensemble methods for spoken emotion recognition in call-centres // Speech Communication. 2007. V. 49, No. 2. S. 98112.
32. Ververidis D., Kotropoulos C. Emotional speech recognition: Resources, features, and methods // Speech Communication. 2006. V. 48, No. 9. S. 11621181.
33. Tahon M., Devillers L. Towards a Small Set of Robust Acoustic Features for Emotion Recognition: Challenges // IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2016. V. 24, No. 1. S. 1628.
34. Eyben F. et al. Real-time robust recognition of speakers emotions and characteristics on mobile platforms // 2015 International Conference on Affective Computing and Intelligent Interaction (ACII). 2015.
35. Lim W., Jang D., Lee T. Speech emotion recognition using convolutional and Recurrent Neural Networks // 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). 2016.
36. Schmidhuber J. Deep learning in neural networks: An overview // Neural Networks. 2015. V. 61. S. 85117.
37. Sainath T.N. et al. Deep convolutional neural networks for LVCSR // 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 2013.
38. Schluter J., Bock S. Improved musical onset detection with Convolutional Neural Networks // 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2014.

First Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The subject of the study is automated recognition of human emotions by voice. The aspect of combating telephone fraud stated in the title is not expressed. The research methodology is based on a theoretical approach using methods of analysis, generalization, comparison, and synthesis. The relevance of the research is determined by the importance of designing and implementing automated speech recognition systems, including in order to combat telephone fraud. The scientific novelty has not been highlighted by the author and, apparently, is related to the conclusions that the potential of SER seems to be far from being realized. SER can act as a separate technology or as an addition to another. A separate niche for recognizing emotions in speech is countering telephone fraud. This conclusion seems trivial. The article is written in Russian literary language. The scientific style of presentation, however, in some places resembles an automated translation, So the abbreviation SER is not given in full at the first mention; partially the section headings, the contents of the figures are presented in English. The structure of the manuscript includes the following sections: Introduction (SER, emotion recognition, voice assistants, methods, deep learning experiments in improving the pre-processing of audio files, classification models, hybrid systems, datasets, the problem of external noise), the Stages of working with sound files (scheme SER, advanced audio processing, feature extraction, classification), normalization of the characteristics and noise reduction (models and methods of noise reduction), feature Extraction (the most common symptoms: excitement, prosodic features of speech (height, strength, intensity, rhythm, etc.) acoustic characteristics of the environment, the linguistic features of speech, the main approaches), Local short-term features (formants, pitch, log energy, dynamic simulation), the Global long-term features (mean height, standard deviation of height, amount of energy at all sound segment), feature set (sets of features for SER: INTERSPEECH, audEERING, eGeMAPS, ComParE), the Task of speech recognition and emotion in speech (convert audio file, a visual representation of the signal energy in frequency, Mel-frequency spectral coefficients, the stages of the initial processing of the audio signal the study of models that do not require a finished set of features, convolutional neural network) Methods of classification (classification using linear and nonlinear classifiers, Bayesian networks, the method of maximum likelihood, support vector machines, Gaussian mixed model, hidden Markov model, the method of k-nearest neighbors, the principal component method and decision trees), a Database used for recognition of emotions in speech (database simulation, induced, natural, Berlin Database of Emotional Speech, the Danish Emotional Speech Database, Interactive Emotional Dyadic Motion Capture Database, INTERFACE05, LDC Emotional Prosody Speech and Transcription, SAVEE / Surrey Audio-Visual Expressed Emotion, Speech Under Simulated and Actual Stress, Ruslana, Ravdess), Conclusions (conclusion), Bibliography. The text includes two figures and one table. The table must have a name. The contents and names of the drawings should be given in Russian. The figures are numbered 4 and 5 (Fig. 4, Fig. 5), with Fig. 5 following before Fig. 4. The figures and the table should be referenced in the preceding text. The content as a whole does not match the title. The article mainly presents the technical aspects of the problem. The specifics of telephone fraud and security in general are not expressed, which does not correspond to the topic of the National Security magazine. The bibliography includes 43 sources of foreign authors monographs, scientific articles, materials of scientific events. Bibliographic descriptions of some sources require adjustments in accordance with GOST and editorial requirements, for example: 1. Picard R. W. Affective Computing. The place of publication ??? : MIT Press, 2000. 292 p. 2. Deusi J. S., Popa E. I. An Investigation of the Accuracy of Real Time Speech Emotion Recognition // Lecture Notes in Computer Science. 2019. P. 336349. 3. Schuller B. W. Speech emotion recognition // Communications of the ACM. 2018. Vol. 61. 5. P. 9099. 6. Vogt T., Andre E. Comparing Feature Sets for Acted and Spontaneous Speech in View of Automatic Emotion Recognition // 2005 IEEE International Conference on Multimedia and Expo. The place of publication ??? : Name of the publishing house, Year of publication ???. P.??????. Attention is drawn to the lack of references to works published in Russian publications. Appeal to opponents (Picard R. W., Deusi J. S., Popa E. I., Schuller B. W., Khalil R. A., Ivanovi? M., Vogt T., Andre E., Pohjalainen J., Chenchah F., Lachiri Z., Koolagudi S. G., Ayadi M. E., Trigeorgis G., Fayek H. M., Lech M., Cavedon L., Schuller B., Vlasenko B., Xiao Z., Verkholyak O., Kaya H., Karpov A., Friedland G., Eyben F., Huang X., Jaitely N., Hinton G., Sainath T. N., Palaz D., Collobert R., Dieleman S., Schrauwen B., Sigtia S., Benetos E., Dixon S., Mao Q., Dileep A. D., Sekhar C. C., Nwe T. L., Foo S. W., De Silva L. C., Yun S., Yoo C. D., Mao Q., Wang X., Zhan Y., Pao T.-L., Morrison D., Wang R., De Silva L. C., Ververidis D., Kotropoulos C., Tahon M., Devillers L., Eyben F., Lim W., Jang D., Lee T., Schmidhuber J., Sainath T. N., Schluter J., Bock S., Abdel-Hamid O., Wu A., Huang Y., Zhang G., Fayek H. M., Lech M., Cavedon L., Busso C., Martin O. et al.) takes place. In general, the material is of interest to the readership, but needs to be finalized, after which the manuscript can be considered for publication in the journal "National Security / nota bene" or "Software Systems and Computational Methods".

Second Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The article submitted for review examines the issues of recognizing human emotions by voice in the fight against telephone fraud. The research methodology is based on the application of intelligent approaches to voice recognition, machine learning and artificial intelligence methods, in particular artificial neural networks. The relevance of the work is due to the fact that the approach proposed by the authors to countering telephone fraud based on the analysis of the emotional component of speech using intelligent approaches to voice recognition facilitates and accelerates the detection of telephone fraud. The scientific novelty of the reviewed study, according to the reviewer, lies in the generalization and systematization of methods, problems and trends in the field of voice recognition that can be integrated into anti-phone fraud systems in order to ensure national security. The article structurally highlights the following sections: Stages of working with audio files, Feature extraction, Local short-term features, Global long-term features, Feature sets, Classification Methods, Databases used to recognize emotions in speech, Conclusions and Bibliography. The authors consider two of the most popular models reflecting the spectrum of human emotions. These are discrete classes, using the example of a model based on the theory of the six basic groups of emotions by P. Ekman. As well as an approach that uses several axes (two or three) and evaluates, for example, the level of excitement on one axis, positivity/valence on the other. The article outlines three main stages of emotion recognition in speech: sound preprocessing, feature extraction and classification. to convert the sound into a chalk-spectogram, the audio track is divided into small segments frames. The importance of solving the problem of creating effective and universal sets of significant characteristics that would work equally well for various research environments and tasks, including those suitable for solving problems of countering telephone fraud, was noted. The article talks about the effectiveness of using architecture with recurrent networks, recurrent neural networks and long-term short-term memory for speech analysis, for natural language processing systems and emotion recognition in speech. The bibliographic list includes 38 sources publications of domestic and foreign scientists on the topic of the article. The text contains targeted references to literary sources confirming the existence of an appeal to opponents. As a comment, it can be noted that, for some reason, the authors did not title the initial part of the article - it seems that it would be appropriate to call it an introduction. The reviewed material corresponds to the direction of the journal "National Security", has been prepared on an urgent topic, contains theoretical justifications, elements of scientific novelty and practical significance. Despite the absence in the article of completed developments of a model for recognizing human emotions by voice in the fight against telephone fraud, the presented material contains generalizations of modern ideas about the tasks of pattern recognition in a specific application area and may arouse interest among readers, and therefore it is recommended for publication after some revision in accordance with the comment made.