Mal'shakov G.V., Mal'shakov V.D. Technique of normalization of the alphabet of search for quality improvement of entity identification based on data frequency characteristics

Published in journal "Software systems and computational methods", 2015-4 in rubric "Software for innovative information technologies", pages 407-413.

Resume: Using frequency distributions of data as identifier it is possible to find data of one system in other systems intended for interaction and coordinate their work. In this case entity identification of a subject domain is done using the alphabet of search. An alphabet of search is a set of lexemes with frequencies of their use in the data, stored as records of a relational database. Object of the research is a technique of normalization of the alphabet of search for improvement of quality of entity identification in a subject domain using frequency characteristics of their data. The technique requires deleting lexemes of the alphabet found in other lexemes of the alphabet with similar frequency of repetition in entity. The methods of the research include the system analysis, the theory of the information, the theory of algorithms, algebra of logic, the theory of sets, the comparative analysis, methods of the intellectual analysis of data and methods of development of the software and databases. The authors prove experimentally (on an example 178 entity), that the given technique allows to reduce the volume of the alphabet of search in 5 times on average, that considerably increases speed of identification entity under frequency characteristics of their data. By reducing the quantity of shorter lexemes the technique of normalization allows to reduce an error of recognition on average by 0.02036 per identification as shown by experiments.

Keywords: correlation, frequency analysis of data, entity, search, the alphabet, normalization, database, software, identification, method

DOI: 10.7256/2305-6061.2015.4.17813

