Ðóñ Eng Cn Translate this page:
Please select your language to translate the article


You can just close the window to don't translate
Library
Your profile

Back to contents

Historical informatics
Reference:

The world through the eyes of an educated person in Minusinsk of the late XIX - early XX centuries: distribution of the frequency of geographical names in the books of the Minusinsk Public Library

Mekhovskii Vadim Aleksandrovich

ORCID: 0009-0000-7786-0939

Graduate student; Department of Information Technology in Creative and Cultural Industries; Siberian Federal University Specialist of the DHlab Laboratory; Siberian Federal University

660130, Russia, Krasnoyarsk Territory, Krasnoyarsk, Svobodny str., 82, office 402

mehovsky.zenit-champion@yandex.ru
Kizhner Inna Aleksandrovna

ORCID: 0000-0002-0775-9656

PhD in Cultural Studies

Associate Professor; Department of Information Technology in Creative and Cultural Industries; Siberian Federal University Senior Researcher at the DHlab Laboratory; Siberian Federal University

660074, Russia, Krasnoyarsk Territory, Krasnoyarsk, Svobodny str., 82, st1, office 440

inna.kizhner@gmail.com

DOI:

10.7256/2585-7797.2025.1.72586

EDN:

QCQWHG

Received:

05-12-2024


Published:

17-04-2025


Abstract: The subject of the study is the corpus of children's literature from the collection of the Minusinsk Public Library of the late XIX – early XX century, consisting of 121 works written between 1719 and 1905. These texts are a significant source for studying the formation of geographical perception among residents of a provincial Siberian city through fiction. Special attention is paid to the analysis of geographical names (toponyms) found in texts in order to identify their frequency and geographical distribution. This allows us to reconstruct the picture of the world presented in the books of that time and understand how it was perceived by the children's audience, forming their idea of countries, cities and cultural centers. The research is aimed at studying the role of children's literature as a cultural tool that reflects and forms geographical representations, as well as at identifying methodological challenges and limitations when working with historical buildings. The methodological basis includes bringing pre-reform texts to a machine-readable form using digitization tools and geoparsing to automatically identify geographical entities. The Spacy library was used for the analysis, followed by manual verification and correction of the data. The results of the study include the identification of 668 cities and 97 countries represented in the texts, as well as the construction of a cartographic visualization of the frequency distribution of mentions. The analysis revealed an uneven distribution of geographical names in various texts, where mentions of Russia, Poland and England prevail among countries, and Kiev, Moscow and St. Petersburg among cities. The scope of the results includes research in the field of digital humanities, library science and historical and cultural studies. The novelty of the work lies in the use of modern geoparsing methods for processing Russian-language texts of pre-reform spelling and in the analysis of the previously unexplored literature corpus of the Minusinsk Library. The conclusions emphasize the importance of text mapping for understanding the formation of geographical perception and the need for further development of NER tools for complex corpora. Despite the limitations, the research contributes to the development of NLP methods for historical texts.


Keywords:

Geoparsing, Mapping, Named-entity recognition, Historical Computer Science, Siberia, Minusinsk, World map, Children's literature, Minusinsk Public Library, Pre-reform orthography

This article is automatically translated. You can find original text of the article here.

Introduction

In the process of Natural Language Processing (NLP), the tasks of identifying named entities (NER) and mapping them play an important role. These tasks include identifying and classifying various text elements, such as names of people, organizations, and geographical features, and then linking them to specific categories or knowledge bases. In recent years, approaches to NER have undergone significant changes due to the development of deep learning methods. However, despite significant achievements, most research continues to focus on widely known corpora such as English, Chinese or Arabic, while Russian remains relatively unexplored.

The present study is aimed at identifying named entities and mapping them in the corpus of children's literature from the collection of books of the Minusinsk Public Library of the late 19th – early 20th century. The paper presents an algorithm for geoparsing and mapping geographical named entities in texts written in pre-reform orthography. Particular attention is paid to the limitations of the applied methods, such as the uneven distribution of named entities in the corpus. This leads to the fact that a significant part of the mentions of the same entity can be concentrated in one source, which negatively affects the representativeness of the mapping results.

An important factor affecting the quality of the results is the condition of the scanned pages. Automatic geo-parsing of poorly preserved pages is often accompanied by errors that require manual adjustments to reduce the number of incorrectly identified or omitted geographical names.

The study also revealed a limitation of the library used for NER. In this case, the Spacy library was used, which turned out to be unable to process cases with a volume of more than one million characters. This limitation highlights the usefulness of using Spacy to analyze relatively small bodies. In addition, it should be borne in mind that most libraries for NER rely on existing databases of geographical objects, and if these databases are imperfect, the results may be incomplete.

It cannot be argued that the results obtained fully reflect objective reality, since artistic texts do not always accurately reproduce current events or geographical data.

Thus, this study is the first attempt to identify named entities and map them in the children's literature corpus of the Minusinsk Public Library of the late 19th and early 20th centuries, which makes a significant contribution to the development of geoparsing and mapping for the Russian language.

The object of this research is to identify named entities in the corpus of children's literature of the late 19th and early 20th centuries and to map them. The subject is the distribution of the frequency of geographical names in the books of the Minusinsk Public Library of the late 19th and early 20th centuries.

The scientific novelty of the research consists primarily in the fact that for the first time the identification and mapping of geographical named entities in the corpus of Russian-language children's literature of the XIX - XX centuries was carried out. The limitations of existing geoparsing methods in relation to pre-reform Russian texts are also studied in detail. For the first time, a geographical analysis of the perception of the world through the fiction of the Minusinsk Public Library has been formed, which can contribute to the further development of humanitarian digital research.

Overview of related works

In the context of digitalization and a significant increase in the volume of digitized documents, geoparsing has become increasingly popular among researchers in the humanities. Geoparsing includes two consecutive stages: recognition of toponyms and their mapping [1]. The process of toponymic recognition is often considered as a subtask of named Entity Recognition (NER), or more precisely, as a task of classification of named objects (NERC) [2]. Mapping involves linking recognized toponyms to the corresponding geographical coordinates and visualizing them on the map.

Historically, methods for recognizing named entities were based on the use of rules derived from the linguistic characteristics of the text and specialized dictionaries [2]. However, such approaches required significant manual effort and could not effectively adapt to new areas or languages. With the development of machine learning methods and, in particular, deep learning, approaches to geoparsing have undergone significant changes. Deep learning models, including recurrent neural networks (RNNs), long-term short-term memory (LSTM) networks, and transformers, have demonstrated marked improvements in named entity recognition [3].

A large knowledge base related to geoparsing has already been formed in the foreign scientific literature. A large number of studies have been conducted that use different approaches. They can be divided into three major areas:

1) Description of the unique geoparsing algorithm for the selected corpus and interpretation of the results from a humanitarian point of view [5, 6, 7].

2) Comparative analysis of the effectiveness of several geoparsers on one or more buildings [4, 8, 9].

3) The search for solutions to improve the results of geoparsing [8, 10, 11].

Comparative analysis of geoparsers

In the case when a comparative analysis of several geoparsers is carried out, one case is analyzed by several methods for identifying named entities, the results are analyzed and the best geoparser is selected. In a study conducted by scientists from the USA, the following geoparsers were analyzed: Spacy, NeuroTPR, Edinburgh Geoparser and Encoder [4]. For this purpose, such English-language corpora as: LGL, GeoVirus and WikToR were used. Based on the results of the study, the problems that are quite common for geoparsing are given. First of all, geoparsing is biased towards more developed regions of the world with large linguistic bodies, which in turn affects the representativeness of the research results. NER is also imperfect because mapping may fail due to toponymic ambiguity [8]. In this case, we are talking about identical geographical names, for which it is necessary to eliminate geographical uncertainty by using an additional contextual toponym. In most cases, capitals and important cities are chosen for this purpose.

As part of a study on the recognition of geographical names in the 8th century corpus, translated into English from Armenian, the Austrian authors provide the following distribution of geoparsing methods by the quality of the results. The Flair method did the best job, TagMe did a little worse, Spacy took the third place, and NLTK took the last place. It is worth noting that TagMe has the best performance in finding outdated titles due to its work with the Wikipedia corpus [9].

In the case of finding solutions to improve geoparsing results, researchers often create their own geoparsers, which are based on different types and scales of neural networks. Neural networks improve the accuracy and efficiency of NER algorithms by applying deep learning, random forest, and naive Bayes classifier techniques. These methods determine whether words or phrases belong to certain categories, such as people's names, geographical names, organizations, and dates. To prepare the neural network for operation, a training building is created on which it will be trained, after which its performance is checked on test buildings. The restrictions in this case are very trivial. The proprietary model takes a long time to learn, which makes it difficult to conduct research quickly. The role of the training sample is also great, the larger and better it is, the better the result of the geoparser [8].

There are attempts to create their own methods for identifying named entities in buildings specially created for conducting specific research. In this case, the work [10] is indicative, in which the authors use their own developments at each stage of the study (preprocessing, text recognition and subsequent processing). At the preprocessing stage, digitized pages are binarized, that is, they turn into black and white, while removing unwanted distortions on the pages. After the preprocessing stage is completed, the preprocessed pages are taken at the recognition stage and their recognition is performed. At the end of this process, at the post-processing stage, the quality of the processed data is checked, and a decision is made on whether the book is suitable for further work or needs to be re-processed. Of course, the processing process takes quite a long time (about three hours per 500 pages), but this time can be significantly reduced by increasing the number of servers or improving their performance. It is also worth noting that semantic analysis requires a recognition accuracy of at least 90%.

The work on creating neural networks for working with nested named entities is very interesting. An example of such objects is the "Supreme Court of Florida", as it contains two overlapping entities, the "Supreme Court of Florida" and "Florida". In their work, researchers from the Czech Republic propose two neural network architectures for recognizing nested named objects and analyze their performance on four bodies of nested objects: English ACE-2004, ACE-2005, GENIA and Czech CNEC [11].

The first model combines nested multiple object labels into a single multimark, which is then predicted using the standard LSTM-CRF model. In this case, a label refers to the class of the named entity, for example, a geographical name. This is important because nested objects may contain objects belonging to other classes.

In the second model, nested objects are encoded in a sequence, and then the task can be considered as a sequence-to-sequence (seq2seq) task, in which the input sequence is tokens (shapes), and the output sequence is labels. The decoder predicts labels for each token until it reaches a special label: "" (end of the word), after which the decoder proceeds to the next token [11].

The authors conclude that LSTM-CRF multimeter modeling is better suited for presumably less nested and flat enclosures, while sequence-by-sequence architecture captures more complex relationships between nested and complexly named objects.

Research data source

Our research is not aimed at creating fundamentally new methods of geoparsing, we have concentrated on using ready-made methods and interpreting the results from a technical point of view. The study carried out a geographical analysis of the literary corpus. The focus was on the children's literature section of the collection of books of the Minusinsk Public Library of the late 19th and early 20th centuries. The study included 121 works written between 1719 and 1905. It is worth noting that not a single scientific work related to the geographical distribution of books from libraries in Siberia has been found. The paper will show how a spatial representation of the world was formed by a resident of a provincial Siberian city. It is assumed that this happened through the gradual introduction of geographical names represented in works of fiction written for the young reader. The purpose of the study is to obtain a geographical distribution of the frequency of mentions of locations in the children's literature section of the collection of books of the Minusinsk Public Library of the late 19th and early 20th centuries.

Materials and methods

Making it machine-readable

The first stage of the geographical analysis of the distribution of locations in the studied corpus is the conversion of book texts from the children's collection of the Minusinsk Public Library of the late 19th and early 20th centuries into a machine–readable format. This process involves digitizing book pages (photographs or scans) and converting pre-reform spelling into modern spelling. ABBYY FineReader 15 editor was used to convert PDF documents into a text format suitable for computer analysis. The conversion of pre-reform spelling to modern spelling was automated using Python code and the prerform2modern library. As a result, 119 text files were received: 32 in English (translated texts could not be found in any aggregator) and 87 in Russian. After completing this stage, the building became ready for extracting geographical names and their subsequent mapping.

It is important to note that the automatic conversion of texts into a machine-readable format does not exclude the possibility of errors. These errors may be due to the condition of the source books, many of which, due to their age, have not been preserved in the best quality. This can lead to incorrect recognition of words or their complete omission. To minimize the impact of this factor on the results of the study, the problematic pages were manually recognized. In cases where even manual recognition was impossible, such pages were excluded from the analysis.

Geoparsing

The next stage of the research is geoparsing, which includes three key steps: extracting geographical names from a corpus, checking and correcting the results manually, and combining the data into a single corpus.

To automatically extract geographical named entities from texts, Python code was developed using the Spacy library. However, when checking the results, it turned out that the algorithm also identified entities that are not geographical objects. In this regard, all the results were carefully checked manually, and unnecessary, irrelevant entities were deleted.

To ensure correct operation at the mapping stage, it was necessary to combine the geoparsing results for each publication into one common file and calculate the frequency of mentions of specific geographical objects in all books of the corpus. This task has also been implemented using Python. After completing this stage of geoparsing, the building was ready for further analysis and mapping.

Interim results

In total, there are 668 names of cities and 97 names of countries on the list. To visualize the results, pie distribution diagrams were constructed for the first ten locations by the number of uses of countries and cities (Fig. 2.3)

Figure 2 – Diagram of the distribution of countries represented in the children's literature section of the collection of books of the Minusinsk Public Library of the late 19th – early 20th centuries

Figure 3 is a diagram of the distribution of cities represented in the children's literature section of the collection of books of the Minusinsk Public Library of the late 19th and early 20th centuries.

The top ten countries in terms of text usage occupy 61% of the total usage (Fig. 2). It is not surprising that Russia was in the first place. It is noteworthy that Poland came in second place.

The data shown in Figure 3 shows that the use of cities is more diverse than countries, obviously due to the larger number of cities. The top ten accounts for 38% of the total use, and Kiev suddenly appeared in the first place, instead of the expected Moscow or St. Petersburg. The Indian city of Bombay has also become very popular (4th place).

Mapping

After the geoparsing of the building has been completed and the data has been corrected, you can begin to perform a geographical analysis of the location distribution. As part of our research, maps of the distribution of mentions of countries and cities were created. Microsoft Excel functionality was used to build the maps.

When visualizing the distribution of countries, the gradient fill method was used: the countries with the largest number of mentions were highlighted in a richer color, and the countries with the minimum number of mentions were highlighted in a less pronounced shade. The countries that were not mentioned in the texts remained colored in a neutral gray color.

To map the identified cities, another technique was used – a point heat map, where each mention of the city was displayed as a heat spot. This map allows you to visualize the concentration of mentions of cities throughout the analyzed corpus.

Results and interpretation

As part of the study, maps of the distribution of mentions of countries and cities were created. A gradient fill is used on the country distribution map (Fig. 4): the countries with the largest number of mentions are highlighted in the brightest shades, while the countries that were not mentioned in the analyzed corpus are marked in gray. Among the most frequently mentioned countries are Russia, Poland and England.

Figure 4 is of particular interest because it allows us to determine which countries were not mentioned in the corpus under study. Among them are Kazakhstan, Tajikistan, Uzbekistan, Argentina, Pakistan, Indonesia and some African countries. However, it is worth noting that Kazakhstan (the Kazakh Khanate), Tajikistan, and Uzbekistan, although shown in gray on the map, were part of the Russian Empire at the end of the 19th century, which explains their absence as separate mentioned states.

The expansion of the Russian Empire into Central Asia led to the abolition of khanate rule in the region. In particular, the "Charter of the Siberian Kirghiz People" of 1822 contributed to the annexation of most of the Kazakh Khanate to Russia. The Kokand Khanate, on the territory of which modern Uzbekistan, Tajikistan, Kyrgyzstan and southern Kazakhstan were located, was annexed by the Russian Empire in 1876.

Figure 4 – Geographical distribution of the countries represented in the children's literature section of the collection of books of the Minusinsk Public Library of the late 19th – early 20th centuries

Figure 5 shows the frequency distribution of the cities mentioned in the corpus under study. Kiev, Moscow and St. Petersburg were among the most frequently mentioned cities. It is noteworthy that the Indian city of Mumbai is in fourth place.

The analysis also shows that cities located in Europe are mentioned in the corpus much more often than cities in Africa, South America and Australia. The cities of the United States predominate in North America, while the cities of Canada are completely absent, and only Mexico City is represented among the cities of Mexico.

In general, the distribution of the mentioned cities corresponds to the expected results: most often these are the capitals of countries or large cities with significant regional significance.

Figure 5 – Geographical distribution of cities represented in the children's literature section of the collection of books of the Minusinsk Public Library of the late 19th – early 20th centuries

Limitations of the study

In the process of studying the geographical distribution of countries and cities represented in the children's literature section of the collection of books of the Minusinsk Public Library of the late 19th and early 20th centuries, certain limitations related to the uneven distribution of named entities were identified.

To test this hypothesis, a table was compiled (Table 1), which shows the most frequently mentioned countries and cities in individual publications. This table helps to identify possible anomalies in the distribution of mentions and assess how evenly geographical objects are represented in various works of the corpus. In particular, it was found that some named entities can be mentioned much more often in one work than in the rest of the corpus texts.

Table 1 – Uneven distribution of named entities in the children's literature section of the collection of books of the Minusinsk Public Library of the late 19th – early 20th centuries

Book

Name

Number of mentions

Total mentions

Percentage of total

Famous explorers and travelers

Spain

147

267

55,06

A collection of articles selected from works of Russian literature

Saint-Petersburg

186

481

38,67

A collection of articles selected from works of Russian literature

Russia

82

889

9,22

A collection of articles selected from works of Russian literature

Moscow

173

519

33,33

How I found Livingston

Mumbai

89

321

27,73

An illustrated history of Russia in conversations for children

Kyiv

100

563

17,76

An illustrated history of Russia in conversations for children

Moscow

217

519

41,81

Catacombs

Rome

86

143

60,14

Quentin Dorward

France

181

418

43,30

A book for initial reading

Saint-Petersburg

71

481

14,76

A Bag of Bread

Russia

56

889

6,30

The worries of a Chinese man in China

China

48

193

24,87

Travel stories from Africa

Mumbai

49

321

15,26

The history of Russia in short stories

Kyiv

61

563

10,83

Frigate Pallada

England

80

497

16,10

Steam House

India

100

301

33,22

Based on the data in Table 1, conclusions can be drawn about the uneven use of some geographical names in the corpus. More than half of the total number of uses of Spain is accounted for by the book "Famous Explorers and Travelers" by Jules Verne, published in 1873. The situation is similar with Rome, which is used in the work "Catacombs" by Eugenia Tour (1866 edition) 86 times out of 143, which is 60% of the total usage. However, the use of references to the city of Moscow is most unevenly distributed, about 75% are in two book publications ("Collection of articles selected from works of Russian literature", V.A. Yakovlev, 1874; "Illustrated History of Russia in conversations for children", author unknown, 1863). Notable in the study, Mumbai is distributed in two editions ("How I found Livingston", G.M. Stanley, 1873; "Stories from Travels in Africa" M.B. Chistyakov, 1897) for 45% of the total use. Similar conclusions can be drawn for other cities shown in Table 1.

Another important limitation is related to the quality of the scanned pages. Not all books have been preserved in satisfactory condition, which required considerable efforts to bring the texts into a machine-readable format. In some cases, individual words had to be manually corrected, and in more complex cases, entire pages had to be deciphered. Only in rare cases, when it was not possible to restore the page, even with the help of manual processing, such pages were excluded from the analysis. This could have had little impact on the final results of the study.

The technical limitation of the study was the use of the Spacy library for recognizing named entities. Spacy does not support working with texts that exceed one million characters, which required splitting large texts into parts, and after processing, re–combining them. Thus, for more efficient use of Spacy, it is advisable to use it for analyzing smaller bodies.

The identified limitations did not prevent the achievement of the research objectives, but rather provided an opportunity for further reflection and analysis. Of course, it is impossible to assert that the results obtained fully reflect objective reality, since works of art cannot accurately reproduce real events and representations. For example, it is impossible to say with certainty that a provincial resident of the late 19th and early 20th centuries perceived the world exactly as it is reflected on the maps created during the study. There is a high probability that information about cities and countries was transmitted orally, including through the stories of travelers, merchants, migrants and exiles, which could have influenced the perception of the geographical picture of the world at that time.

Conclusion

This study is one of the first to identify named entities in the children's literature section of the Minusinsk Public Library collection of books of the late 19th and early 20th centuries and to map them. In the course of the work, the texts of the Minusinsk Public Library were analyzed using modern geoparsing methods. The study revealed a number of features, such as the uneven distribution of geographical names in the corpus, as well as limitations related to the quality of digitized texts and the technical aspects of automatic identification of entities.

Comparing the results obtained with other studies in the field of geoparsing, it can be noted that despite the general success in using named entity recognition methods, the limitations identified during the work are similar to the problems mentioned in other scientific papers. For example, as in the studies [4], NER showed a shift towards more well-known and frequently mentioned locations, which is confirmed by the uneven distribution of place names in our corpus. The noted problem of toponymic ambiguity [8] was also reflected in our study, where some toponyms could have multiple meanings or be unrecognized by the system due to the lack of additional context.

The revealed limitations in the amount of data in the Spacy library have shown that larger corpora require the use of more powerful tools such as Flair, TagMe, or neural network solutions offered by [9]. It is important to note that our corpus was distinguished by linguistic specifics, which requires additional efforts to adapt existing NER tools to pre-reform texts, unlike English-language or other modern corpora analyzed in previous studies [11].

Thus, our research makes a significant contribution to the field of geoparsing and NER for Russian–language texts of the 19th - 20th centuries. Despite the difficulties identified, such as the quality of the source materials and technical limitations, the research opens up new prospects for the development of NLP methods for Russian-language corpora. Further work in this direction may include an analysis of all book editions from the Minusinsk library, a comparison of several buildings: a collection of books from libraries in Siberia and a collection of libraries in the central regions, an analysis of personal communication using preserved letters and mapping the borders of countries using historical maps.

References
1. Li, J., San, A., Khan, J., & Lee, K. (2020). A review of deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, 122-127.
2. Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. International Journal of Computational Linguistics and Applications, 3-26.
3. Lamp, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 260-270).
4. Liu, Z., Yanovich, K., Cai, L., Zhu, R., Mai, G., & Shi, M. (2022). Geoparsing: Solution or bias? Assessing geographic biases in geoparsing. AGILE: GIScience Series, 13.
5. Burgmeister, M. (2022). Measuring urban change in travel texts: The case of Graz in the 19th century. magazen, 3(1), 61-90.
6. Evans, E., & Wilkens, M. (2018). Nation, ethnicity, and the geography of British literature, 1880–1940. Journal of Cultural Analysis, 48.
7. Smile, R., Gregory, I., & Taylor, J. (2019). Qualitative geography in digital texts: Representing historical spatial identities in the Lake District. International Journal of Humanities and Arts Computing, 28-38.
8. Faiz, J., Moncla, L., & Martins, B. (2021). Deep learning for toponym recognition: Geocoding based on toponym pairs. ISPRS International Journal of Geo-Information, 16.
9. Tambuscio, M., & Andrews, T. L. (2021). Geolocation and named entity recognition in ancient texts: A case study of the Armenian history of Gevunda. In Humanities Research Conference (pp. 136-148). Amsterdam.
10. San Giacomo, A., Hogenbirk, H., Tanasescu, R., Karaisl, A., & White, N. (2022). Reading in the fog: High-quality optical character recognition based on freely available digitized early modern books. Digital Scholarship in the Humanities, 37(4), 1197-1209. https://doi.org/10.1093/llc/fqac014
11. Strakova, J., Straka, M., & Hajic, J. (2019). Neural architectures for nested NER via linearization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (p. 6).

First Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The presented article on the topic "The world through the eyes of an educated person in Minusinsk of the late XIX - early XX centuries: the distribution of the frequency of geographical names in the books of the Minusinsk Public Library" corresponds to the subject of the journal "Historical Informatics" and is devoted to an actual study. In the water part of the article, the authors pay attention to a study aimed at identifying named entities and mapping them in the corpus of children's literature from the collection of books of the Minusinsk Public Library of the late XIX – early XX century. The paper presents an algorithm for geoparsing and mapping geographical named entities in texts written in pre-reform spelling. Special attention is paid to the limitations of the applied methods, such as the uneven distribution of named entities in the corpus. According to the authors, this leads to the fact that a significant part of the mentions of the same entity can be concentrated in one source, which negatively affects the representativeness of the mapping results. The authors independently conducted a geographical analysis of the literary corpus of the children's literature section of the collection of books of the Minusinsk Public Library of the late XIX – early XX centuries. The study included 121 works written between 1719 and 1905. The paper shows how the spatial representation of the world was formed by a resident of a provincial Siberian city. The authors selected the following materials and methods: conversion to machine-readable form, geoparsing, mapping. The practical significance is clearly justified and consists in a significant contribution in the field of geoparsing and NER for Russian–language texts of the XIX - XX centuries. The research opens up new prospects for the development of NLP methods for corpora in Russian. The authors indicate the expediency of the prospects for further research, which consists in analyzing all book editions from the Minusinsk library, comparing several buildings: a collection of books from libraries in Siberia and a collection of libraries in central regions, analyzing personal communication using preserved letters and displaying the borders of countries using historical maps. The volume of the article corresponds to the recommended volume of 12,000 characters. The style and language of the presentation is quite accessible to a wide range of readers. The authors of the article conducted a broad analytical review of domestic and foreign literature. The article is quite structured - there is an introduction, conclusion, internal division of the main part (review of related works, results and interpretation). The disadvantages include the following points: the object and subject of the study are not formulated; there is no scientific novelty. It is recommended to formulate the object and subject of research; to identify scientific novelty. The article "The world through the eyes of an educated person in Minusinsk of the late XIX - early XX centuries: the distribution of the frequency of geographical names in the books of the Minusinsk Public Library" requires revision according to the above remarks. After making amendments, it is recommended for reconsideration by the editorial board of the peer-reviewed scientific journal.

Second Peer Review

Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
The list of publisher reviewers can be found here.

The subject of the study is the distribution of the frequency of geographical names in the books of the Minusinsk Public Library of the late 19th – early 20th centuries. The authors focused on analyzing children's literature in order to identify how geographical objects are represented in texts and how this could shape the worldview of readers of that time. The research also touches on technical aspects of word processing, including geoparsing and mapping. Despite the limitations of the research base – 121 works – the article has significant value as a historical study, contributing to the development of Digital Humanities. NLP (Natural Language Processing) methods such as geoparsing and mapping are used. This allows you to automate the processing of large amounts of data, which is especially important for working with archives and libraries; to visualize the results, to explore texts that have not previously been analyzed in terms of their geographical content. Thus, the article demonstrates how modern technologies can be applied to study historical sources, raises important methodological issues related to working with historical texts, such as the problems of digitization and recognition of poorly preserved texts, the limitations of modern NLP tools (for example, Spacy) for working with historical corpora, and the uneven distribution of data. These issues are relevant to everyone who works with historical texts in the digital age. The research technology proposed in the article includes several stages: digitization and text transformation, geoparsing, and mapping. The geoparsing results were visualized using Microsoft Excel, which used gradient fills for countries and heat maps for cities. The methodology generally corresponds to modern approaches to NLP, however, the use of Spacy for analyzing pre-reform texts raises questions, since this library is not optimized for working with historical texts. The research is relevant in the context of the development of digital humanities and NLP. It contributes to the study of Russian-language texts, which still remain insufficiently researched in comparison with English-language corpora. In addition, the work has historical value, as it allows us to understand how the geographical perception of the world was formed among the inhabitants of a provincial city in the late 19th and early 20th centuries. The scientific novelty is expressed by the fact that the analysis of geographical names in the corpus of pre-revolutionary children's literature was carried out for the first time. The limitations of geoparsing methods in relation to pre-reform texts are revealed. It should be noted that the novelty is somewhat limited by the use of existing tools (Spacy) without significant modification to work with historical texts. The text style is scientific, using NLP terminology and historical analysis. The structure of the work is logical, it consists of the necessary elements: introduction, literature review, description of methods, results and their interpretation, conclusion. At the same time, in the introductory part, the choice of the corpus of works, as well as the Minusinsk library, is not justified and looks random. The research base used in the work is not representative enough for broad conclusions about the "world of an educated person" of the late 19th and early 20th centuries. The text is overloaded with technical details, in general, the article focuses on the introduction of modern NLP methods, focusing on the technical aspects of research – geoparsing, text processing, data visualization – to the detriment of historical interpretation. Issues such as provincial culture, the history of children's literature, or the problems of historical memory remained unresolved. The bibliography includes modern works on NLP, geoparsing, and historical text analysis, with publications over the past 5 years accounting for 50% of the list. Most of the references relate to English-language studies, which highlights the lack of work on Russian-language corpora. There are not enough references to studies devoted specifically to pre-Reform texts, which could strengthen the authors' arguments. The research does not include works on the history of provincial culture, children's literature, and the history of Minusinsk. The authors acknowledge the limitations of their research, such as the uneven distribution of geographical names in the corpus, problems with the quality of digitized texts, and limitations of the Spacy library. They also note that the results may not be fully representative due to the specifics of the literary texts. However, the authors do not pay enough attention to possible alternative approaches, such as using more powerful tools (for example, Flair or neural network models), which could improve the quality of the analysis. The conclusions of the study are logical and correspond to the tasks set. The authors emphasize that their work contributes to the development of NLP methods for Russian-language texts and opens up new perspectives for further research. However, the conclusions could have been more specific, for example, indicating which aspects of the geographical perception of the world were identified. The conclusions of the study "The world through the eyes of an educated person in Minusinsk in the late 19th and early 20th centuries" are reflected through an analysis of the frequency of mentions of geographical names in the books of the Minusinsk Public Library. Here is what is actually written in the conclusions: the study is one of the first devoted to the analysis of geographical names in the children's literature corpus of the Minusinsk Library; technical limitations such as uneven distribution of toponyms and problems with the quality of digitized texts are revealed; it is emphasized that the results cannot fully reflect objective reality, since artistic texts do not always accurately reproduce geographical data; indicates the need for further research, including analysis of other buildings and comparison with libraries in the central regions. It is noted that artistic texts do not always accurately reflect the real geographical picture of the world, that "the world through the eyes of an educated person in Minusinsk" was eurocentric with an emphasis on Europe and Russia, limited with a minimal idea of other parts of the world, fragmented with an uneven distribution of references to geographical objects. These conclusions follow logically from the analysis of the frequency of mentions of countries and cities. The interest of the readership will depend on its specialization. For researchers in the field of NLP and digital humanities, the work is of considerable interest, especially in the analysis of pre-reform texts. For a wider audience, including historians and cultural scientists, research can also be useful, although it may not meet expectations. Despite the fact that many questions remained unsolved, the article is scientifically significant – it demonstrates the possibility of using NLP methods to analyze historical texts. It is also important that the article draws attention to provincial culture – the mention of Minusinsk and its public library can inspire other researchers to study regional history.