Published in journal "Software systems and computational methods", 2016-4 in rubric "Systems analysis , search, analysis and information filtering", pages 383-391.
Resume: This article reviews the problems of automatic processing of web content. Since the speed of obsolescence of information in the global network is very high, the problem of prompt extraction of the necessary data from the Internet becomes more urgent. The research focuses on the web resources that contain text, unadapted to the automated processing. The subject of the research is a set of software and methods. A particular attention is paid to the categorization of ads placed on specialized websites. The authors also review practical aspects of the development of a universal architecture of information-gathering systems. The following methods were used during this study: analytical review of the main principles of development of systems of automated information gathering and analysis of natural languages. For obtaining practice-oriented methods of synthesis and analysis results were used. A special contribution of the authors of the study is in developing an automated system for collecting, processing and classification of the information contained on the web-site. The novelty of the research is to use a new approach to solve this problem by taking into account the semantics and structure characteristic for specific sites. The main conclusions of the study are the applicability and effectiveness of the classification method for solving this problem.
Keywords: machine learning, web robots, information collection, classification system, web-sites categorization, text analisis, parsing, data processing, crawling, big data
Liu H. and Milios, E. (2012), PROBABILISTIC MODELS FOR FOCUSED WEB CRAWLING. Computational Intelligence, 28: 289–328
Menshchikov A.A., Gatchin Yu.A. Metody obnaruzheniya avtomatizirovannogo sbora informatsii s veb-resursov // Kibernetika i programmirovanie. – 2015. – ¹ 5. – S.136-157.
Razniewski Simon, and Werner Nutt. Long-term Optimization of Update Frequencies for Decaying Information // Proceedings of the 18th International Workshop on Web and Databases. ACM. – 2015.
Pant Gautam, and Padmini Srinivasan Learning to crawl: Comparing classification schemes // ACM Transactions on Information Systems (TOIS) 23.4 (2005): 430-462.
Kim Jin Young, et al. Characterizing web content, user interests, and search behavior by reading level and topic // Proceedings of the fifth ACM international conference on Web search and data mining. ACM, 2012.
Pautov K. G., Popov F. A. Informatsionnaya sistema analiza i tematicheskoy klassifikatsii veb-stranits na osnove metodov mashinnogo obucheniya // Sovremennye problemy nauki i obrazovaniya. 2012. ¹6.
Aggarwal Charu C., and ChengXiang Zhai. "A survey of text classification algorithms." Mining text data. Springer US, 2012. 163-222.
Chen, Yu, Wei-Ying Ma, and Hong-Jiang Zhang. "Detecting web page structure for adaptive viewing on small form factor devices." Proceedings of the 12th international conference on World Wide Web. ACM, 2003.
Ageev Mikhail Sergeevich, Dobrov Boris Viktorovich, Lukashevich Natal'ya Valentinovna Avtomaticheskaya rubrikatsiya tekstov: metody i problemy // Uchen. zap. Kazan. un-ta. Ser. Fiz.-matem. nauki. 2008. ¹4.
Eswaran Dhivya, Paul N. Bennett, and Joseph J. Pfeiffer III. "Modeling Website Topic Cohesion at Scale to Improve Webpage Classification." Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2015.
Martinez-Alvarez, Miguel et al. "Document Difficulty Framework for Semi-automatic Text Classification" DAWAK (2013).
Tripathi, Nandita, Michael Oakes, and Stefan Wermter. "A Scalable Meta-Classifier Combining Search and Classification Techniques for Multi-Level Text Categorization." International Journal of Computational Intelligence and Applications 14.04 (2015).
Morfologicheskiy analizator pymorphy [Elektronnyy resurs]. – Rezhim dostupa: https://pythonhosted.org/pymorphy/, svobodnyy (data obrashcheniya: 30.09.2016).
Menshchikov A.A. Metody obnaruzheniya avtomatizirovannogo sbora informatsii s veb-resursov // Al'manakh nauchnykh rabot molodykh uchenykh Universiteta ITMO-2016. – T. 3. – S. 230-232
Correct link to this article:
just copy this link to clipboard