Ðóñ Eng Cn Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Software systems and computational methods
Reference:

Application of Thematic Modeling Methods in Text Topic Recognition Tasks to Detect Telephone Fraud

Pleshakova Ekaterina Sergeevna

ORCID: 0000-0002-8806-1478

PhD in Technical Science

Associate Professor, Department of Information Security, Financial University under the Government of the Russian Federation

125167, Russia, Moscow, 4th Veshnyakovsky Ave., 12k2, building 2

espleshakova@fa.ru
Other publications by this author
 

 
Gataullin Sergei Timurovich

PhD in Economics

Dean of "Digital Economy and Mass Communications" Department of the Moscow Technical University of Communications and Informatics; Leading Researcher of the Department of Information Security of the Financial University under the Government of the Russian Federation

8A Aviamotornaya str., Moscow, 111024, Russia

stgataullin@fa.ru
Other publications by this author
 

 
Osipov Aleksei Viktorovich

PhD in Physics and Mathematics

Associate Professor, Department of Data Analysis and Machine Learning, Financial University under the Government of the Russian Federation

125167, Russia, Moscow, 4th veshnyakovsky str., 4, building 2

avosipov@fa.ru
Other publications by this author
 

 
Romanova Ekaterina Vladimirovna

PhD in Physics and Mathematics

Associate Professor, Department of Data Analysis and Machine Learning, Financial University under the Government of the Russian Federation

125167, Russia, Moscow, 49/2 Leningradsky Ave.

EkVRomanova@fa.ru
Other publications by this author
 

 
Marun'ko Anna Sergeevna

Student, Department of Data Analysis and Machine Learning, Financial University under the Government of the Russian Federation

49/2 Leningradskiy Prospect str., Moscow, 125167, Russia​

marunko94@gmail.com

DOI:

10.7256/2454-0714.2022.3.38770

EDN:

RPLSLQ

Received:

09-09-2022


Published:

16-09-2022


Abstract: The Internet has emerged as a powerful infrastructure for worldwide communication and human interaction. Some unethical use of this technology spam, phishing, trolls, cyberbullying, viruses caused problems in the development of mechanisms that guarantee affordable and safe opportunities for its use. Currently, many studies are being conducted to detect spam and phishing. The detection of telephone fraud has become critically important, as it entails huge losses. Machine learning and natural language processing algorithms are used to analyze a huge amount of text data. Fraudsters are identified using text mining and can be implemented by analyzing the terms of a word or phrase. One of the difficult tasks is to divide this huge unstructured data into clusters. There are several thematic modeling models for these purposes. This article presents the application of these models, in particular LDA, LSI and NMF. A data set has been formed. A preliminary analysis of the data was carried out and signs were constructed for models in the task of recognizing the subject of the text. The approaches of keyword extraction in the tasks of text topic recognition are considered. The key concepts of these approaches are given. The disadvantages of these models are shown, and directions for improving text processing algorithms are proposed. The evaluation of the quality of the models was carried out. Improved models thanks to the selection of hyperparameters and changing the data preprocessing function.


Keywords:

natural language processing, information security, machine learning, text analysis, LDA, LSI, NMF, thematic modeling, phone fraud, phishing

This article is automatically translated. You can find original text of the article here.

The article was prepared as part of the state assignment of the Government of the Russian Federation to the Financial University for 2022 on the topic "Models and methods of text recognition in anti-telephone fraud systems" (VTK-GZ-PI-30-2022).

IntroductionText mining can be implemented by analyzing the terms of a word or phrase.

NLP (Natural Language Processing) allows you to solve tasks such as text recognition, recognition of confidential or spam messages, recognition of topics, classification of documents. The analysis helps to process text data in business analytics or to study people's impressions of a brand in marketing based on posts on social networks, as well as to detect phishing. Phishing is the most common attack in which they try to steal confidential information by posing as a legitimate source.  The phishing method used is phishing emails. Natural language processing and machine learning methods can be used to solve this problem. The practical significance of teaching models to recognize the meaning of a text is to create a basis for further, deeper analysis.

Natural language, as a rule, is structured and has its own rules: grammar, syntax, semantics (how words with their meanings form sentences that also have their own meanings) [2,3].

Text topic recognition is a classification or its generalization. When a text is assigned a theme, it must first be shortened and the main meaning taken out of it. Then it will be possible to attribute it to any topic [4]. We used text generalization models to recognize its topic, since the task of recognizing the text topic involves not only ordinary classification, but also the study of patterns (templates). Moreover, pattern studies and classical text generalization models make it possible to take advantage of teaching without a teacher.

The corpus of texts is a vector space where each text is a separate vector. The dimensions of each vector are the number of unique words and/or phrases in the entire vector space. The content of the vector is the weights of terms in the text. This concept is mainly used in each method of constructing features from text data, and distinguishes them by how words/phrases are defined and their weights [5].

To determine the topic of a text or its abbreviation, we will use concepts such as keyword extraction and topic modeling.

Creating a dataset and preprocessing it

The data set that we generated for training and testing was obtained from the Project Gutenberg website. It is a variety of collections of fairy tales from different peoples of the world. Each book was presented as a single file, but for models of the Topic modeling family, it is necessary to have a data set of multiple files. That is, each fairy tale should be in a separate file. To divide about 50 books into separate files, a special function was built using regular expressions. The function includes splitting texts into files, as well as cleaning from "outliers".

Thus, we have obtained a data set from 1651 fairy tales, which can be found in more detail at the link: https://github.com/AnnBengardt/Course_paper_ML/tree/main/Dataset .

After creating a data set, it is necessary to perform its preprocessing for further construction of the corpus and models. Preprocessing includes: removing letters not from the English alphabet (?, ?, ?, ?, etc.), clearing html tags, special characters (~, *, $, etc.), connections (I'm, he'll, we'd, etc.), from repetitions of characters, from English stop words, from numbers. Lemmatization of words will also be carried out.

Implementation of keyword extraction (Keyphrases Extraction)

There are two approaches to extracting keywords. The first one – Collocation – works on the basis of N-grams [6-8]. The idea is that it is necessary to build a corpus where each document is part of one text, paragraph or sentence. Further, with the help of tokenization, the corpus is transformed into one big sentence, according to which the algorithm can be said to "roll", considering each N-gram and counting either simply the repetition rate or PMI (Pointwise Mutual Information) according to the formula (1): (1)Where p is the probability of occurrence of a term or two terms together.

After calculating one of the metrics, the algorithm sorts them, then you can output and analyze the most relevant and irrelevant N-grams.

The second approach to keyword extraction is Weighted tag–based phrase extraction. First, the algorithm extracts all phrases with nouns using Part of Speech tags, which is a common tool in NLP. Next, the TF-IDF model is applied to the excerpts from the text. Accordingly, the most relevant phrases are those that have the highest weights in the resulting matrix.

Collocations is well implemented in the NLTK library in the form of BigramCollocationFinder and TrigramCollocationFinder for searching for phrases of two and, respectively, three words. To count suitable phrases, we will use the method of ordinary frequencies (raw frequency).

We implement CollocationFinder both to the full data set and only to the first five.

Èçîáðàæåíèå âûãëÿäèò êàê òåêñò  Àâòîìàòè÷åñêè ñîçäàííîå îïèñàíèå

Figure 1. Bigrams and trigrams for all and the first five texts

 

As a result, we see which bigrams and trigrams are the most common for all data (the first line). In the first case, these are fairly general phrases that can convey little about anything: old man, next day, upon time lived, early next morning. However, in the second case, you can already trace the names of the main characters: Earl St. Clair, Robin Redbreast, Elfin King. There is also a theme about the Christmas of the Northern peoples: cold winters day, Merry Yule (Christmas in Scandinavia).

We implement weighted extraction of phrases based on tags. The keywords function from genesis.summarization creates a result for each fairy tale, all words and phrases have their own weight, which determines their significance in the text.

Èçîáðàæåíèå âûãëÿäèò êàê òåêñò  Àâòîìàòè÷åñêè ñîçäàííîå îïèñàíèå

Figure 2. The result of weighted extraction of phrases based on tags

In general, the extraction of key phrases helps to imagine what is being discussed in some texts. However, this is not enough to draw definitive conclusions, especially if there is no understanding of the context. Therefore, let's turn to a more suitable family of models for the task – Topic Modeling.

Building features for modeling topics

The Gensim library was used, created specifically for solving problems of recognizing text topics using machine learning without a teacher [9-11]. Each model in Gensim uses different ways of vectorizing text and constructing features from the resulting vector space, but at the same time it is enough for each model to get a representation of the text corpus in the form of a bag of words and a dictionary of correspondences (index –word pairs). Thus, for modeling, it is enough only to transform a set of texts into a bag of words.

In order to increase the efficiency of future models, before the transformation, bigrams were allocated in the general vocabulary of the corpus by the Phrases model.

Figure 3. The result of the Gensim Phrases model

In Figure 3, you can see that some words have merged into bigrams through underscores: seven_year, shake_head.

Further, with the help of gensim.corpora.Dictionary, a dictionary was created with all the unique words that are used in the corpus and their indexes. Initially, the dictionary consists of 25,743 words, which is redundant for searching for individual topics and may, on the contrary, worsen the models. To successfully complete the task, it is necessary to get rid of outliers.

Initially, the conditions were chosen to delete words that occurred less than 20 times throughout the dataset, as well as those words that appeared in more than 60% of texts. After several iterations, it turned out that a more effective set of vocabulary is obtained with parameters 15 times and 75%.

At the end of the cleaning, the length of the dictionary was 4963 words, and now you can create a bag of words from it. To do this, the dictionary object also has a function – doc2bow. Thus, each text in the corpus is vectorized based on a common dictionary.

Model construction: LSI, LDA, NMF Topic Modeling is a large family of models, in this paper three varieties will be used to solve the problem: latent semantic indexing (LSI), latent Dirichlet placement (LDA), non-negative matrix decomposition (NMF) [12].

The main principle of the LSI model is that similar words and phrases are used in the same context and, therefore, often occur together. Thus, it is possible to find terms that correlate with each other and thereby create the subject of the text [13].

The LSI algorithm is based on singular value decomposition. This is an algebraic concept used, among other things, to generalize the text. The singular value decomposition allows us to represent the original matrix M in such a way that (2): (2)Where:

U is a unitary matrix of dimension m x m such that UTU = Imxm. In this case, I is a unit matrix, and the columns U show the left singular vectors.

S is a diagonal matrix m x n with positive real numbers on the diagonal. It can also be represented by a vector of dimension m, in which the singular values are located.

VT is a unitary matrix n x n such that VTV = Inxn. In this case, I is a unit matrix, and the rows of V show the right singular vectors.

In LSI, singular value decomposition is primarily used to approximate the rank of a matrix when determining the original matrix M from the matrix M^. M^ in turn is the trimmed part of the original matrix with rank k and can be defined by singular value decomposition as:

                                                                          (3)Where S^ is the truncated part of the original matrix S, now consisting of the k largest values of the matrix S.

To date, the LDA model is the most popular and well-studied model in many fields and numerous toolkits, such as the Machine Learning for Language Toolkit (MALLET). LDA resembles LSI in that both approaches assume that one text document is influenced by several topics with different probabilities at once. However, in the case of LDA, these probabilities are influenced by Dirichlet placement to better generalize the data. Dirichlet placement can be called a "distribution of distributions" [14]. At the end of the algorithm, the distributions of topics for each text are obtained, with the help of which it is possible to identify the key constituent words for each topic.

NMF is a method of matrix factorization (linear algebra) without a teacher, which allows you to simultaneously perform both dimension reduction and clustering. NMF operates only on non-negative matrices. Having received a non-negative matrix V at the input, which, as a rule, is a feature constructed for modeling, NMF searches for such matrices W and H so that their product approximately reproduces the original matrix V. Normalization (L2 or Euclidean norm) is also used to obtain the best results of approximate reconstruction of the matrix V.

                                                                (4)

                                                       (5)

Implementations of all selected models independently build and use the necessary features. For example, LDA uses a purely conventional frequency of terms and a Bag of words, and LSI – TF-IDF. As a result, for the subsequent modeling of these three approaches, it is necessary to create only a Bag of words. Automated models subsequently transform it into the necessary features themselves.

To assess the quality of models in this paper, a separate model – Coherence Model (Consistency Model) will be used. It evaluates several modeling metrics through a "pipeline": segmentation (dividing words in texts into pairs to calculate the score); probability calculation (just like in PMI, the probabilities of occurrence of each word and pairs of words together are calculated); confirmation measurement (using the probabilities obtained, the evaluation is calculated how good the word A is of the pair, AB supports the word B and vice versa). Further, all values are aggregated into one through a mathematical function and a final Consistency Score is obtained.

Moreover, there are several options for calculating the score implemented in the Consistency Model. In this paper, the "UMass" and "CV" methods will be used to compare and evaluate the simulation results.

                                            (6)

Where D(wi, wj) is how many times this pair of words appeared together, D(wi) is how many times this word appeared alone [8].

                                               (7)

Where P(w) is the probability of seeing a word in the context window, P(wi, wj) is the probability of seeing two words together in the context window.

The Gensim LSI (Latent Semantic Indexing) model is based on a bag of words and a dictionary, but inside the bag of words is transformed into a TF-IDF matrix. Figure 4 demonstrates one of the ways to visualize the obtained topics: since the LSI model implies that words can be divided into two directions that lead to different topics, the weights can be both positive and negative.

Èçîáðàæåíèå âûãëÿäèò êàê òåêñò  Àâòîìàòè÷åñêè ñîçäàííîå îïèñàíèå

Figure 4. The result of building the LSI model

 

Topic coherence metrics will be used to evaluate the models. In Gensim, they are all implemented as a separate model – Coherence Model. Having calculated the CV and UMass metrics, we get the result:

Latent Dirichlet placement (LDA) works precisely with a bag of words, without further transforming this feature. Also, the model again requires a dictionary and two parameters for the distribution – alpha, eta. They will be set as auto. The result of the model is also topics in a given early number, but without dividing into directions, that is, all weights are positive (Fig. 5).

Figure 5. The result of building the LDA model

 

The evaluation of the model is slightly lower than in the case of LSI:

Non-negative matrix decomposition (NMF), like the models above, is based on a dictionary and a bag of words. The resulting topics with phrases and words belonging to them are shown in Figure 6 (the result is presented without weights, but, as in the case of LDA, here they are only positive).

Figure 6. The result of constructing the NMF model

Evaluation of the model:

 

Hyperparameter selection and selection of the best model

In the first iteration, the results were not the most satisfactory. It is necessary to maximize the metrics used and find out how many topics are suitable for each model, and then choose the best one to predict the topics of texts in the test sample.

For each type of model, separate functions were implemented for their multiple construction and collection of metric statistics, depending on the number of topics into which the set of texts is divided. The option that there are from 5 to 25 separate topics in the corpus will be considered.

The result is shown in Figure 7-9: all graphs turned out to be diverse, while LDA and NMF roughly converge in the results of how many separate topics there should be (19 and 14), and LSI provided a completely different answer (6 topics).

Figure 7. Metric graphs for the NMF model

Figure 8. Metric graphs for the LSI model

Figure 9. Metric graphs for the LDA model

 

So, the best models are LSI with 6 themes (CCV = 0.3669) and LDA with 19 themes (CCV = 0.3412).

Interpretation of simulation results

We interpret the results of the two selected best models and check which of the two inherently opposite assumptions will be more suitable.

One of the possible ways to clearly show how the text corpus was divided in different models is a table of the distribution of topics by text.

Èçîáðàæåíèå âûãëÿäèò êàê òåêñò  Àâòîìàòè÷åñêè ñîçäàííîå îïèñàíèå

Figure 10. Table of distribution of topics by text for the LSI model

 

 Based on the results shown in Figure 10, we conclude that despite the best metrics compared to other models, almost all texts fell into the same category. This table suggests that, most likely, the selection needs to be improved: more files with more diverse content. Moreover, according to the descriptions of the topics, it is clear that there are words clogging the case. This could happen, among other things, due to the fact that some fairy tales are written in the old style, which the stopwords module does not take into account, so words such as "thee", "thy", "sayth" were not deleted.

Despite the unsatisfactory result of the LSI model, it is necessary to check the results of the construction of the LDA model with a division into 19 topics.

Èçîáðàæåíèå âûãëÿäèò êàê òåêñò  Àâòîìàòè÷åñêè ñîçäàííîå îïèñàíèå

Figure 11. Table of distribution of topics by text for the LDA model

 

This time, a completely different picture is visible: the documents are almost evenly distributed by subject, only a few of which stand out. At the same time, with the naked eye, the column with descriptions can determine the content of all these highlighted topics. For example, themes #3 and #10 were clearly formed from Scandinavian myths: giant, Thor, Loki, Odin, Asgard, bear, monster. Themes No. 15 and No. 16 unite Russian fairy tales: tsar, Ivan, serpent, Baba Yaga. Theme No. 12 tells about the sea and everything connected with it: sea, fish, fisherman, ship, water, sail and so on.

This LDA model, although with a slightly lower metric value, showed much more successful results. According to the distribution table, the topics are really visible, now let's try to visualize it more clearly using the Intertopic Distance Map.

 

Figure 12. Intertopic distance map

 

Each topic is represented on the map by a circle – the larger its size, the more words from the dictionary are included in this topic. Also on the right are the 30 most suitable terms for this topic. The red color shows the predicted frequency of the term for the selected topic, the blue color shows the frequency of the term throughout the corpus. If the scale is completely shaded in red for a word, then it completely belongs to the topic, as shown in Figure 12: the word sultan is found only in topic No. 8 (Arabic fairy tales). Moreover, on the map itself, you can observe how far the topics are removed from each other. In this case, there are examples of complete absorption: topics No. 18 and No. 19 are subsections of topic No. 3.

It would be rational to choose an LDA model with 19 themes for recognizing the themes of texts from the test set. The table for displaying the results will show two dominant topics for each text.

Figure 13. Table of recognized text topics from the test sample

 

Figure 13 clearly shows the northern myths, stories about animals, about kings and travel.

The full software implementation of this work can be found in the open repository at the link: https://github.com/AnnBengardt/Course_paper_ML .

Conclusion

Intelligent text analysis is an important task in various directions. The Internet is saturated with information and knowledge that can confuse users and make them spend extra time searching for suitable information on specific topics. Conversely, the need to analyze short texts has become more relevant as the popularity of microblogging has grown. The problem of recognizing topics from the text is due to the fact that it contains relatively small volumes and noisy data, which can lead to inaccurate recognition of the topic.  Thematic modeling can solve this problem, as it is considered a powerful method that can help in the detection and analysis of content. Thematic modeling, an area of text mining, is used to detect hidden patterns. Thematic modeling and document clustering are two important key terms that are similar in concepts and functions. In this article, thematic modeling is performed using LSI, LDA, NMF methods. The paper investigated the use of models for natural language processing LSI, LDA, NMF. For clarity, we have formed a data set consisting of various texts. These methods have shown different results. To select the best one, metrics were maximized. Which made it possible to distinguish two models LSI and LDA. Experimental results show that after improving the sample, the LDA model showed much more successful results. Based on the results obtained, a more in-depth analysis of the text can be carried out. This work allowed us to determine the best model for recognizing the topic and can be optimized for solving various tasks. In particular, this method can be used to detect phishing emails.

References
1. C. Chen, K. Wu, V. Srinivasan, X. Zhang. Battling the Internet Water Army: Detection of Hidden Paid Posters. http://arxiv.org/pdf/1111.4297v1.pdf, 18 Nov 2011
2. D. Yin, Z. Xue, L. Hong, B. Davison, A. Kontostathis, and L. Edwards. Detection of harassment on web 2.0. Proceedings of the Content analysis in the Web, 2, 2009
3. T. Kohonen. Self-organization and associative memory. 2d ed. New York, Springer Verlag, 1988
4. Y. Niu, Y. min Wang, H. Chen, M. Ma, and F. Hsu. A quantitative study of forum spamming using context-based analysis. In In Proc. Network and Distributed System Security (NDSS) Symposium, 2007
5. V.V. Kiselev. Automatic detection of emotions by speech. Educational technologies. No. 3, 2012, pp. 85-89
6. R.A. Extramarital. Trolling as a form of social aggression in virtual communities. Bulletin of the Udmurt University, 2012, Issue 1, pp. 48-51
7. S.V. Boltaeva, T.V. Matveev. Lexical rhythms in the text of suggestion. Russian word in language, text and cultural environment. Yekaterinburg, 1997, pp. 175-185
8. Gamova, A. A., Horoshiy, A. A., & Ivanenko, V. G. (2020, January). Detection of fake and provocative comments in social network using machine learning. In 2020 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus) (pp. 309-311). IEEE.
9. Mei, B., Xiao, Y., Li, H., Cheng, X., & Sun, Y. (2017, October). Inference attacks based on neural networks in social networks. In Proceedings of the fifth ACM/IEEE Workshop on Hot Topics in Web Systems and Technologies (pp. 1-6).
10. Cable, J., & Hugh, G. (2019). Bots in the Net: Applying Machine Learning to Identify Social Media Trolls.
11. Machová K., Porezaný M., Hreškova M. Algorithms of Machine Learning in Recognition of Trolls in Online Space //2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI). – IEEE, 2021. – P. 000349-000354.
12. Τσανταρλιώτης Π. Identification of troll vulnerable tergets in online social networks. – 2016.
13. Mihaylov, T., Mihaylova, T., Nakov, P., Marquez, L., Georgiev, G. D., & Koychev, I. K. (2018). The dark side of news community forums: Opinion manipulation trolls. internet research.
14. Zhukov D., Perova J. A Model for Analyzing User Moods of Self-organizing Social Network Structures Based on Graph Theory and the Use of Neural Networks //2021 3rd International Conference on Control Systems, Mathematical Modeling, Automation and Energy Efficiency ( SUMMA).-IEEE, 2021.-P. 319-322.

Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The article submitted for review discusses the use of thematic modeling methods in text topic recognition tasks to detect telephone fraud. The research methodology is based on the application of natural language processing and machine learning methods, LSI, LDA, NMF methods. The relevance of the work is due to the fact that the solution of tasks such as text recognition, recognition of confidential or spam messages, recognition of topics, classification of documents can also be used to detect telephone fraud. The scientific novelty of the reviewed study, according to the reviewer, lies in the proposed model for recognizing the topic and can be applied to solve various tasks, including detecting phishing emails. The following sections are structurally highlighted in the article: Introduction, Creation of a dataset and its preprocessing, Implementation of key phrases Extraction, Construction of features for modeling topics, Selection of hyperparameter and selection of the best model, Interpretation of modeling results, Conclusion and Bibliography. The authors have created a data set for training and testing from the Project Gutenberg website and represents a variety of collections of fairy tales from different peoples of the world – a data set of 1651 fairy tales. Based on this training sample, machine learning models for recognizing the topic of text of three varieties were built: latent semantic indexing (LSI), latent Dirichlet placement (LDA), non-negative matrix decomposition (NMF), and their comparison was carried out. The bibliographic list includes 14 sources – publications of domestic and foreign scientists, as well as online resources on the topic of the article. The text contains targeted references to literary sources. The following points can be noted as comments and suggestions. Firstly, the title of the article refers to the use of thematic modeling "to detect telephone fraud", and the text refers mainly to phishing attacks – a type of Internet fraud in order to gain access to confidential user data - usernames and passwords. It seems that there is some discrepancy here, since the classic phone is a device for transmitting and receiving only sound at a distance and does not necessarily have Internet access. Secondly, the names of not all sections in the article are highlighted in a special bold font, which does not contribute to improving the perception of the material. Thirdly, Figures 2-6, reflecting the results of building machine learning models and representing screen shots, are poorly readable due to too small text and fuzzy images. The reviewed material corresponds to the direction of the journal "Software Systems and Computational Methods", has been prepared on an urgent topic, contains theoretical justifications and applied developments, has elements of scientific novelty and practical significance, may arouse interest among readers, however, according to the reviewer, before publication the article should be finalized in accordance with the comments made.