Systems analysis , search, analysis and information filtering
Romashko D.A., Medvedev A.Y. —
Using word2vec in clustering operons
// Software systems and computational methods. – 2018. – ¹ 1.
– P. 1 - 6.
Read the article
Review: In this article the task of clustering operons (special units of genetic information) is solved. The authors describe its use for the identification of groups of operons with similar functions. The specifics of the open bases of operons used as sources of initial data for the study are considered. The authors describe the selection and preparation of data for clustering, the features of the clustering process, and its relationship with the approaches traditionally used for the analysis of natural languages. Based on the clustering performed, the quality and composition of the obtained groups is analyzed. To convert the raw data into vectors, the classical implementation of the word2vec algorithm and a number of features of the original data are used. The resulting representation is clustered by the DBScan algorithm based on the cosine distance. The novelty of the proposed method is associated with the use of non-standard algorithms for the initial data. The approach used effectively manifests itself when working with a large amount of data, does not require additional data markup and independently forms factors for clustering. The obtained results show the possibility of using the proposed approach for the implementation of services that allow comparative analysis of bacterial genomes.
Keywords: clustering, DBScan, word embeddings, word2vec, machine learning, methods, algorithms, operons, natural language processing, open access databases
Taboada B., Ciria R., Martinez-Guerrer C.E., Merino E., ProOpDB: Prokaryotic Operon DataBase // Nucleic Acids Research.-2012.-¹40.-S. 627-631.
Taboada B., Verde C., Merino E., High accuracy operon prediction method based on STRING database scores // Nucleic Acids Research.-2010.-¹38.-S. 130.
Wes McKinney, Data Structures for Statistical Computing in Python // Proceedings of the 9th Python in Science Conference.-2010.-C. 51-56.
Stéfan van der Walt, S. Chris Colbert and Gaël Varoquaux, The NumPy Array: A Structure for Efficient Numerical Computation // Computing in Science & Engineering.-2011.-C. 22-30.
Mihaela Pertea, Kunmi Ayanbule, Megan Smedinghoff and Steven L. Salzberg., Prediction of Operons in Microbial Genomes // Nucleic Acids Research.-2008.-S. 479–482.
Pedregosa F., Scikit-learn: Machine Learning in Python // JMLR.-2011.-C. 2825–2830.
Rehurek R., Sojka P., Software Framework for Topic Modelling with Large Corpora // In proceedings of the lrec 2010 workshop on new challenge