Library
|
Your profile |
Historical informatics
Reference:
Mekhovskii, V.A., Kizhner, I.A. (2025). The world through the eyes of an educated person in Minusinsk of the late XIX - early XX centuries: distribution of the frequency of geographical names in the books of the Minusinsk Public Library. Historical informatics, 1, 174–189. . https://doi.org/10.7256/2585-7797.2025.1.72586
The world through the eyes of an educated person in Minusinsk of the late XIX - early XX centuries: distribution of the frequency of geographical names in the books of the Minusinsk Public Library
DOI: 10.7256/2585-7797.2025.1.72586EDN: QCQWHGReceived: 05-12-2024Published: 17-04-2025Abstract: The subject of the study is the corpus of children's literature from the collection of the Minusinsk Public Library of the late XIX – early XX century, consisting of 121 works written between 1719 and 1905. These texts are a significant source for studying the formation of geographical perception among residents of a provincial Siberian city through fiction. Special attention is paid to the analysis of geographical names (toponyms) found in texts in order to identify their frequency and geographical distribution. This allows us to reconstruct the picture of the world presented in the books of that time and understand how it was perceived by the children's audience, forming their idea of countries, cities and cultural centers. The research is aimed at studying the role of children's literature as a cultural tool that reflects and forms geographical representations, as well as at identifying methodological challenges and limitations when working with historical buildings. The methodological basis includes bringing pre-reform texts to a machine-readable form using digitization tools and geoparsing to automatically identify geographical entities. The Spacy library was used for the analysis, followed by manual verification and correction of the data. The results of the study include the identification of 668 cities and 97 countries represented in the texts, as well as the construction of a cartographic visualization of the frequency distribution of mentions. The analysis revealed an uneven distribution of geographical names in various texts, where mentions of Russia, Poland and England prevail among countries, and Kiev, Moscow and St. Petersburg among cities. The scope of the results includes research in the field of digital humanities, library science and historical and cultural studies. The novelty of the work lies in the use of modern geoparsing methods for processing Russian-language texts of pre-reform spelling and in the analysis of the previously unexplored literature corpus of the Minusinsk Library. The conclusions emphasize the importance of text mapping for understanding the formation of geographical perception and the need for further development of NER tools for complex corpora. Despite the limitations, the research contributes to the development of NLP methods for historical texts. Keywords: Geoparsing, Mapping, Named-entity recognition, Historical Computer Science, Siberia, Minusinsk, World map, Children's literature, Minusinsk Public Library, Pre-reform orthographyThis article is automatically translated. You can find original text of the article here.
Introduction In the process of Natural Language Processing (NLP), the tasks of identifying named entities (NER) and mapping them play an important role. These tasks include identifying and classifying various text elements, such as names of people, organizations, and geographical features, and then linking them to specific categories or knowledge bases. In recent years, approaches to NER have undergone significant changes due to the development of deep learning methods. However, despite significant achievements, most research continues to focus on widely known corpora such as English, Chinese or Arabic, while Russian remains relatively unexplored. The present study is aimed at identifying named entities and mapping them in the corpus of children's literature from the collection of books of the Minusinsk Public Library of the late 19th – early 20th century. The paper presents an algorithm for geoparsing and mapping geographical named entities in texts written in pre-reform orthography. Particular attention is paid to the limitations of the applied methods, such as the uneven distribution of named entities in the corpus. This leads to the fact that a significant part of the mentions of the same entity can be concentrated in one source, which negatively affects the representativeness of the mapping results. An important factor affecting the quality of the results is the condition of the scanned pages. Automatic geo-parsing of poorly preserved pages is often accompanied by errors that require manual adjustments to reduce the number of incorrectly identified or omitted geographical names. The study also revealed a limitation of the library used for NER. In this case, the Spacy library was used, which turned out to be unable to process cases with a volume of more than one million characters. This limitation highlights the usefulness of using Spacy to analyze relatively small bodies. In addition, it should be borne in mind that most libraries for NER rely on existing databases of geographical objects, and if these databases are imperfect, the results may be incomplete. It cannot be argued that the results obtained fully reflect objective reality, since artistic texts do not always accurately reproduce current events or geographical data. Thus, this study is the first attempt to identify named entities and map them in the children's literature corpus of the Minusinsk Public Library of the late 19th and early 20th centuries, which makes a significant contribution to the development of geoparsing and mapping for the Russian language. The object of this research is to identify named entities in the corpus of children's literature of the late 19th and early 20th centuries and to map them. The subject is the distribution of the frequency of geographical names in the books of the Minusinsk Public Library of the late 19th and early 20th centuries. The scientific novelty of the research consists primarily in the fact that for the first time the identification and mapping of geographical named entities in the corpus of Russian-language children's literature of the XIX - XX centuries was carried out. The limitations of existing geoparsing methods in relation to pre-reform Russian texts are also studied in detail. For the first time, a geographical analysis of the perception of the world through the fiction of the Minusinsk Public Library has been formed, which can contribute to the further development of humanitarian digital research. Overview of related works In the context of digitalization and a significant increase in the volume of digitized documents, geoparsing has become increasingly popular among researchers in the humanities. Geoparsing includes two consecutive stages: recognition of toponyms and their mapping [1]. The process of toponymic recognition is often considered as a subtask of named Entity Recognition (NER), or more precisely, as a task of classification of named objects (NERC) [2]. Mapping involves linking recognized toponyms to the corresponding geographical coordinates and visualizing them on the map. Historically, methods for recognizing named entities were based on the use of rules derived from the linguistic characteristics of the text and specialized dictionaries [2]. However, such approaches required significant manual effort and could not effectively adapt to new areas or languages. With the development of machine learning methods and, in particular, deep learning, approaches to geoparsing have undergone significant changes. Deep learning models, including recurrent neural networks (RNNs), long-term short-term memory (LSTM) networks, and transformers, have demonstrated marked improvements in named entity recognition [3]. A large knowledge base related to geoparsing has already been formed in the foreign scientific literature. A large number of studies have been conducted that use different approaches. They can be divided into three major areas: 1) Description of the unique geoparsing algorithm for the selected corpus and interpretation of the results from a humanitarian point of view [5, 6, 7]. 2) Comparative analysis of the effectiveness of several geoparsers on one or more buildings [4, 8, 9]. 3) The search for solutions to improve the results of geoparsing [8, 10, 11].
Comparative analysis of geoparsers
In the case when a comparative analysis of several geoparsers is carried out, one case is analyzed by several methods for identifying named entities, the results are analyzed and the best geoparser is selected. In a study conducted by scientists from the USA, the following geoparsers were analyzed: Spacy, NeuroTPR, Edinburgh Geoparser and Encoder [4]. For this purpose, such English-language corpora as: LGL, GeoVirus and WikToR were used. Based on the results of the study, the problems that are quite common for geoparsing are given. First of all, geoparsing is biased towards more developed regions of the world with large linguistic bodies, which in turn affects the representativeness of the research results. NER is also imperfect because mapping may fail due to toponymic ambiguity [8]. In this case, we are talking about identical geographical names, for which it is necessary to eliminate geographical uncertainty by using an additional contextual toponym. In most cases, capitals and important cities are chosen for this purpose. As part of a study on the recognition of geographical names in the 8th century corpus, translated into English from Armenian, the Austrian authors provide the following distribution of geoparsing methods by the quality of the results. The Flair method did the best job, TagMe did a little worse, Spacy took the third place, and NLTK took the last place. It is worth noting that TagMe has the best performance in finding outdated titles due to its work with the Wikipedia corpus [9]. In the case of finding solutions to improve geoparsing results, researchers often create their own geoparsers, which are based on different types and scales of neural networks. Neural networks improve the accuracy and efficiency of NER algorithms by applying deep learning, random forest, and naive Bayes classifier techniques. These methods determine whether words or phrases belong to certain categories, such as people's names, geographical names, organizations, and dates. To prepare the neural network for operation, a training building is created on which it will be trained, after which its performance is checked on test buildings. The restrictions in this case are very trivial. The proprietary model takes a long time to learn, which makes it difficult to conduct research quickly. The role of the training sample is also great, the larger and better it is, the better the result of the geoparser [8]. There are attempts to create their own methods for identifying named entities in buildings specially created for conducting specific research. In this case, the work [10] is indicative, in which the authors use their own developments at each stage of the study (preprocessing, text recognition and subsequent processing). At the preprocessing stage, digitized pages are binarized, that is, they turn into black and white, while removing unwanted distortions on the pages. After the preprocessing stage is completed, the preprocessed pages are taken at the recognition stage and their recognition is performed. At the end of this process, at the post-processing stage, the quality of the processed data is checked, and a decision is made on whether the book is suitable for further work or needs to be re-processed. Of course, the processing process takes quite a long time (about three hours per 500 pages), but this time can be significantly reduced by increasing the number of servers or improving their performance. It is also worth noting that semantic analysis requires a recognition accuracy of at least 90%. The work on creating neural networks for working with nested named entities is very interesting. An example of such objects is the "Supreme Court of Florida", as it contains two overlapping entities, the "Supreme Court of Florida" and "Florida". In their work, researchers from the Czech Republic propose two neural network architectures for recognizing nested named objects and analyze their performance on four bodies of nested objects: English ACE-2004, ACE-2005, GENIA and Czech CNEC [11]. The first model combines nested multiple object labels into a single multimark, which is then predicted using the standard LSTM-CRF model. In this case, a label refers to the class of the named entity, for example, a geographical name. This is important because nested objects may contain objects belonging to other classes. In the second model, nested objects are encoded in a sequence, and then the task can be considered as a sequence-to-sequence (seq2seq) task, in which the input sequence is tokens (shapes), and the output sequence is labels. The decoder predicts labels for each token until it reaches a special label: "" (end of the word), after which the decoder proceeds to the next token [11]. The authors conclude that LSTM-CRF multimeter modeling is better suited for presumably less nested and flat enclosures, while sequence-by-sequence architecture captures more complex relationships between nested and complexly named objects.
Research data source
Our research is not aimed at creating fundamentally new methods of geoparsing, we have concentrated on using ready-made methods and interpreting the results from a technical point of view. The study carried out a geographical analysis of the literary corpus. The focus was on the children's literature section of the collection of books of the Minusinsk Public Library of the late 19th and early 20th centuries. The study included 121 works written between 1719 and 1905. It is worth noting that not a single scientific work related to the geographical distribution of books from libraries in Siberia has been found. The paper will show how a spatial representation of the world was formed by a resident of a provincial Siberian city. It is assumed that this happened through the gradual introduction of geographical names represented in works of fiction written for the young reader. The purpose of the study is to obtain a geographical distribution of the frequency of mentions of locations in the children's literature section of the collection of books of the Minusinsk Public Library of the late 19th and early 20th centuries.
Materials and methods
Making it machine-readable
The first stage of the geographical analysis of the distribution of locations in the studied corpus is the conversion of book texts from the children's collection of the Minusinsk Public Library of the late 19th and early 20th centuries into a machine–readable format. This process involves digitizing book pages (photographs or scans) and converting pre-reform spelling into modern spelling. ABBYY FineReader 15 editor was used to convert PDF documents into a text format suitable for computer analysis. The conversion of pre-reform spelling to modern spelling was automated using Python code and the prerform2modern library. As a result, 119 text files were received: 32 in English (translated texts could not be found in any aggregator) and 87 in Russian. After completing this stage, the building became ready for extracting geographical names and their subsequent mapping. It is important to note that the automatic conversion of texts into a machine-readable format does not exclude the possibility of errors. These errors may be due to the condition of the source books, many of which, due to their age, have not been preserved in the best quality. This can lead to incorrect recognition of words or their complete omission. To minimize the impact of this factor on the results of the study, the problematic pages were manually recognized. In cases where even manual recognition was impossible, such pages were excluded from the analysis.
Geoparsing
The next stage of the research is geoparsing, which includes three key steps: extracting geographical names from a corpus, checking and correcting the results manually, and combining the data into a single corpus. To automatically extract geographical named entities from texts, Python code was developed using the Spacy library. However, when checking the results, it turned out that the algorithm also identified entities that are not geographical objects. In this regard, all the results were carefully checked manually, and unnecessary, irrelevant entities were deleted. To ensure correct operation at the mapping stage, it was necessary to combine the geoparsing results for each publication into one common file and calculate the frequency of mentions of specific geographical objects in all books of the corpus. This task has also been implemented using Python. After completing this stage of geoparsing, the building was ready for further analysis and mapping.
Interim results
In total, there are 668 names of cities and 97 names of countries on the list. To visualize the results, pie distribution diagrams were constructed for the first ten locations by the number of uses of countries and cities (Fig. 2.3) Figure 2 – Diagram of the distribution of countries represented in the children's literature section of the collection of books of the Minusinsk Public Library of the late 19th – early 20th centuries Figure 3 is a diagram of the distribution of cities represented in the children's literature section of the collection of books of the Minusinsk Public Library of the late 19th and early 20th centuries. The top ten countries in terms of text usage occupy 61% of the total usage (Fig. 2). It is not surprising that Russia was in the first place. It is noteworthy that Poland came in second place. The data shown in Figure 3 shows that the use of cities is more diverse than countries, obviously due to the larger number of cities. The top ten accounts for 38% of the total use, and Kiev suddenly appeared in the first place, instead of the expected Moscow or St. Petersburg. The Indian city of Bombay has also become very popular (4th place). Mapping After the geoparsing of the building has been completed and the data has been corrected, you can begin to perform a geographical analysis of the location distribution. As part of our research, maps of the distribution of mentions of countries and cities were created. Microsoft Excel functionality was used to build the maps. When visualizing the distribution of countries, the gradient fill method was used: the countries with the largest number of mentions were highlighted in a richer color, and the countries with the minimum number of mentions were highlighted in a less pronounced shade. The countries that were not mentioned in the texts remained colored in a neutral gray color. To map the identified cities, another technique was used – a point heat map, where each mention of the city was displayed as a heat spot. This map allows you to visualize the concentration of mentions of cities throughout the analyzed corpus. Results and interpretation As part of the study, maps of the distribution of mentions of countries and cities were created. A gradient fill is used on the country distribution map (Fig. 4): the countries with the largest number of mentions are highlighted in the brightest shades, while the countries that were not mentioned in the analyzed corpus are marked in gray. Among the most frequently mentioned countries are Russia, Poland and England. Figure 4 is of particular interest because it allows us to determine which countries were not mentioned in the corpus under study. Among them are Kazakhstan, Tajikistan, Uzbekistan, Argentina, Pakistan, Indonesia and some African countries. However, it is worth noting that Kazakhstan (the Kazakh Khanate), Tajikistan, and Uzbekistan, although shown in gray on the map, were part of the Russian Empire at the end of the 19th century, which explains their absence as separate mentioned states. The expansion of the Russian Empire into Central Asia led to the abolition of khanate rule in the region. In particular, the "Charter of the Siberian Kirghiz People" of 1822 contributed to the annexation of most of the Kazakh Khanate to Russia. The Kokand Khanate, on the territory of which modern Uzbekistan, Tajikistan, Kyrgyzstan and southern Kazakhstan were located, was annexed by the Russian Empire in 1876. Figure 4 – Geographical distribution of the countries represented in the children's literature section of the collection of books of the Minusinsk Public Library of the late 19th – early 20th centuries Figure 5 shows the frequency distribution of the cities mentioned in the corpus under study. Kiev, Moscow and St. Petersburg were among the most frequently mentioned cities. It is noteworthy that the Indian city of Mumbai is in fourth place. The analysis also shows that cities located in Europe are mentioned in the corpus much more often than cities in Africa, South America and Australia. The cities of the United States predominate in North America, while the cities of Canada are completely absent, and only Mexico City is represented among the cities of Mexico. In general, the distribution of the mentioned cities corresponds to the expected results: most often these are the capitals of countries or large cities with significant regional significance. Figure 5 – Geographical distribution of cities represented in the children's literature section of the collection of books of the Minusinsk Public Library of the late 19th – early 20th centuries
Limitations of the study
In the process of studying the geographical distribution of countries and cities represented in the children's literature section of the collection of books of the Minusinsk Public Library of the late 19th and early 20th centuries, certain limitations related to the uneven distribution of named entities were identified. To test this hypothesis, a table was compiled (Table 1), which shows the most frequently mentioned countries and cities in individual publications. This table helps to identify possible anomalies in the distribution of mentions and assess how evenly geographical objects are represented in various works of the corpus. In particular, it was found that some named entities can be mentioned much more often in one work than in the rest of the corpus texts.
Table 1 – Uneven distribution of named entities in the children's literature section of the collection of books of the Minusinsk Public Library of the late 19th – early 20th centuries
Based on the data in Table 1, conclusions can be drawn about the uneven use of some geographical names in the corpus. More than half of the total number of uses of Spain is accounted for by the book "Famous Explorers and Travelers" by Jules Verne, published in 1873. The situation is similar with Rome, which is used in the work "Catacombs" by Eugenia Tour (1866 edition) 86 times out of 143, which is 60% of the total usage. However, the use of references to the city of Moscow is most unevenly distributed, about 75% are in two book publications ("Collection of articles selected from works of Russian literature", V.A. Yakovlev, 1874; "Illustrated History of Russia in conversations for children", author unknown, 1863). Notable in the study, Mumbai is distributed in two editions ("How I found Livingston", G.M. Stanley, 1873; "Stories from Travels in Africa" M.B. Chistyakov, 1897) for 45% of the total use. Similar conclusions can be drawn for other cities shown in Table 1. Another important limitation is related to the quality of the scanned pages. Not all books have been preserved in satisfactory condition, which required considerable efforts to bring the texts into a machine-readable format. In some cases, individual words had to be manually corrected, and in more complex cases, entire pages had to be deciphered. Only in rare cases, when it was not possible to restore the page, even with the help of manual processing, such pages were excluded from the analysis. This could have had little impact on the final results of the study. The technical limitation of the study was the use of the Spacy library for recognizing named entities. Spacy does not support working with texts that exceed one million characters, which required splitting large texts into parts, and after processing, re–combining them. Thus, for more efficient use of Spacy, it is advisable to use it for analyzing smaller bodies. The identified limitations did not prevent the achievement of the research objectives, but rather provided an opportunity for further reflection and analysis. Of course, it is impossible to assert that the results obtained fully reflect objective reality, since works of art cannot accurately reproduce real events and representations. For example, it is impossible to say with certainty that a provincial resident of the late 19th and early 20th centuries perceived the world exactly as it is reflected on the maps created during the study. There is a high probability that information about cities and countries was transmitted orally, including through the stories of travelers, merchants, migrants and exiles, which could have influenced the perception of the geographical picture of the world at that time. Conclusion This study is one of the first to identify named entities in the children's literature section of the Minusinsk Public Library collection of books of the late 19th and early 20th centuries and to map them. In the course of the work, the texts of the Minusinsk Public Library were analyzed using modern geoparsing methods. The study revealed a number of features, such as the uneven distribution of geographical names in the corpus, as well as limitations related to the quality of digitized texts and the technical aspects of automatic identification of entities. Comparing the results obtained with other studies in the field of geoparsing, it can be noted that despite the general success in using named entity recognition methods, the limitations identified during the work are similar to the problems mentioned in other scientific papers. For example, as in the studies [4], NER showed a shift towards more well-known and frequently mentioned locations, which is confirmed by the uneven distribution of place names in our corpus. The noted problem of toponymic ambiguity [8] was also reflected in our study, where some toponyms could have multiple meanings or be unrecognized by the system due to the lack of additional context. The revealed limitations in the amount of data in the Spacy library have shown that larger corpora require the use of more powerful tools such as Flair, TagMe, or neural network solutions offered by [9]. It is important to note that our corpus was distinguished by linguistic specifics, which requires additional efforts to adapt existing NER tools to pre-reform texts, unlike English-language or other modern corpora analyzed in previous studies [11]. Thus, our research makes a significant contribution to the field of geoparsing and NER for Russian–language texts of the 19th - 20th centuries. Despite the difficulties identified, such as the quality of the source materials and technical limitations, the research opens up new prospects for the development of NLP methods for Russian-language corpora. Further work in this direction may include an analysis of all book editions from the Minusinsk library, a comparison of several buildings: a collection of books from libraries in Siberia and a collection of libraries in the central regions, an analysis of personal communication using preserved letters and mapping the borders of countries using historical maps. References
1. Li, J., San, A., Khan, J., & Lee, K. (2020). A review of deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, 122-127.
2. Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. International Journal of Computational Linguistics and Applications, 3-26. 3. Lamp, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 260-270). 4. Liu, Z., Yanovich, K., Cai, L., Zhu, R., Mai, G., & Shi, M. (2022). Geoparsing: Solution or bias? Assessing geographic biases in geoparsing. AGILE: GIScience Series, 13. 5. Burgmeister, M. (2022). Measuring urban change in travel texts: The case of Graz in the 19th century. magazen, 3(1), 61-90. 6. Evans, E., & Wilkens, M. (2018). Nation, ethnicity, and the geography of British literature, 1880–1940. Journal of Cultural Analysis, 48. 7. Smile, R., Gregory, I., & Taylor, J. (2019). Qualitative geography in digital texts: Representing historical spatial identities in the Lake District. International Journal of Humanities and Arts Computing, 28-38. 8. Faiz, J., Moncla, L., & Martins, B. (2021). Deep learning for toponym recognition: Geocoding based on toponym pairs. ISPRS International Journal of Geo-Information, 16. 9. Tambuscio, M., & Andrews, T. L. (2021). Geolocation and named entity recognition in ancient texts: A case study of the Armenian history of Gevunda. In Humanities Research Conference (pp. 136-148). Amsterdam. 10. San Giacomo, A., Hogenbirk, H., Tanasescu, R., Karaisl, A., & White, N. (2022). Reading in the fog: High-quality optical character recognition based on freely available digitized early modern books. Digital Scholarship in the Humanities, 37(4), 1197-1209. https://doi.org/10.1093/llc/fqac014 11. Strakova, J., Straka, M., & Hajic, J. (2019). Neural architectures for nested NER via linearization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (p. 6).
First Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
Second Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
|