Library
|
Your profile |
Historical informatics
Reference:
Yumasheva, J.Y. (2025). The possibility of using artificial intelligence in historical research. Historical informatics, 1, 95–121. . https://doi.org/10.7256/2585-7797.2025.1.73578
The possibility of using artificial intelligence in historical research
DOI: 10.7256/2585-7797.2025.1.73578EDN: PQTZJTReceived: 04-03-2025Published: 17-04-2025Abstract: The article is devoted to the controversial problem of the use of artificial intelligence in historical research. The introduction briefly examines the history of the emergence of "artificial intelligence" (AI) as a field in computer science, the evolution of this definition and views on the application of AI; analyzes the place of artificial intelligence methods at different stages of specific historical research. In the main part of the article, based on the analysis of historiographical sources and his own experience of participating in foreign projects, the author analyzes the practice of implementing handwritten text recognition projects using various information technologies and AI methods, in particular, describes and justifies the requirements for creating electronic copies of recognizable sources, the need to take into account the texture of information carriers, writing materials, techniques and technologies for creating the text; varieties and methods of creating paleographic, codicological, diplomatic datasets, historical and lexicological dictionaries, the possibility of using large language models, etc. As a methodological basis, the author used a systematic approach, historical-comparative, historical-chronological and descriptive methods, as well as the analysis of historiographical sources. In conclusion, it is concluded that the use of artificial intelligence technologies is promising not only as an auxiliary tool, but also as research methods that help in establishing the authorship of historical sources, clarifying their dating, detecting forgeries, etc., as well as in creating new types of scientific reference search systems for archives and libraries. At the same time, the use of artificial intelligence technologies is highly expensive and capital intensive, which is a serious obstacle to the widespread introduction of these technologies into the practice of historical research. Keywords: artificial intelligence, historical sources, automated text recognition, paleography, codicology, diplomatics, historical lexicology, datasets, large linguistic models, information technologiesThis article is automatically translated. You can find original text of the article here. Introduction The topic of the next issue, "Artificial Intelligence in historical research and Education," set by the editorial board of the journal "Historical Informatics," in our opinion, implies the need to clarify two concepts: "artificial intelligence" and "stages of historical research." Leaving aside the history of the idea of a "thinking artificial being" ("thinking machine"), which has been in the air since the time of Aristotle, let us turn to closer times, namely, to consider the term Artificial Intelligence (AI, artificial intelligence, AI), which was formulated and first introduced into scientific circulation on based on practically implemented projects (including the first model of neural networks developed in the early 1950s, The Stochastic Neural Analog Reinforcement Calculator (SNARC) [1]) at the Dartmouth workshop [2], which took place in 1956 in Hanover, New Hampshire. At that time, the organizers of this seminar (J. McCarthy, Marvin Minsky, Nathaniel Rochester and Claude Shannon) and the speakers (among whom were Allen Newell and Herbert Simon, the authors of the Logic Theorist program developed in 1955 [3]) did not formulate a definition of this term, rightly believing that for a correct definition It is necessary first to define the concept of "intelligence". However, even at the stage of preparation for the seminar, J. McCarthy described the subject of discussion as follows: "any aspect of learning or any other sign of intelligence can, in principle, be described so accurately that a machine can be made to simulate it." This provision became the basis for the allocation in 1957. the main directions of AI application: "machine translation, machine learning, automated recognition (of images/writing/sounding speech) and decision-making" [4]. Almost 70 years have passed, and after some AI successes in the late 1950s and 1970s, (developments in 1958 artificial neural network architecture – Percepton [5], in 1959 the first "machine learning" program – Samuel Checkers-playing [6], programming languages List Processing language (LISP, "list processing language" [7]), one of the first computer chess programs [8]*, ELIZA programs [9, 10]**, expert systems, simulation methods [11, 12, 13, 14, 15, 16, 17]*** and so on), and the "AI winter" that followed, which lasted for almost 20 years and was associated with the microcomputer industry. Since the mid-1990s, a new, repeated wave of interest in AI has emerged, determined by technological shifts, an explosive increase in computer technology, the emergence of Data Science and the development of Data Engineering. By the end of the 1990s, the most general definition of AI was formulated, which gave and still gives wide scope for naming many technologies related to data processing with this term: "Artificial intelligence is a field of computer science aimed at creating systems capable of performing tasks requiring human intelligence." The application areas of AI have also changed. Robotics with integrated AI, recognition and reproduction of human emotions (Kismet robot [18]), etc. were added to the four previously defined areas that retained their importance. Another 20 years have passed, during which there has been a huge increase in computing power, the accumulation of large amounts of data, the emergence of generative AI and large language models (LLM); the development of new technological solutions, the introduction of virtual assistants and bots capable of communicating in natural language… All these events have once again raised the question of clarifying the definition of AI and its areas of application. Diving into the study of AI historiography, it is difficult to disagree with the opinion of German researchers A. Kaplan and M. Heinlein, who wrote that "AI is still a surprisingly vague concept, and many questions related to it remain open.… We assume that AI is not one monolithic term...", but a set of technologies, the study of which is possible only "through the prism of evolutionary stages (artificial narrow intelligence, artificial general intelligence and artificial superintelligence) or focusing on various types of AI systems (analytical AI, human-inspired AI, and humanized AI)" [19]. The same understanding of the essence of AI is recorded in the definition closest, in the author's opinion, to the content of this article, fixed in the Russian GOST: "Artificial intelligence is a complex of technological solutions that allows simulating human cognitive functions (including self–learning and searching for solutions without a predefined algorithm) and obtaining results comparable to at least the results of human intellectual activity when performing specific tasks" [20]. Summarizing a brief digression into the history of ideas about AI, we note that initially certain areas of its application remain virtually unchanged, but at the same time they penetrate into all new branches of knowledge. This fact makes it possible to focus on one of the fundamental tasks of using AI, namely, text/image recognition methods and the creation of machine-readable sources, as a basic condition for interacting with computer applications for further information analysis. Now let's consider the content of the concept of "stages of historical research". In Russian historiography, there are several variants of the list of stages of historical research determined by the field of research, however, most authors consider it on the example of problem-historical studies of events, processes, phenomena, historical figures, etc. and include such mandatory items as: 1. Choosing a topic based on a primary analysis of historiography; 2. Detailed historiographical analysis and definition of the object and subject (research objectives) of the study; 3. Development of a working hypothesis; 4. Identification and selection of sources in accordance with the object, subject, objectives and working hypothesis (heuristics); 5. Identification and application of methods for studying sources (including the preparation of their information for subsequent analysis), adequate to the source base; 6. Identification and application of methods for analyzing sources that are adequate to the sources and tasks set, and obtaining results; 7. Interpretation (theoretical explanation) of the obtained results; 8. Correction of the initial working hypothesis, formulation of conclusions; 9. Integration of conclusions into the historical context; 10. Identification of results that do not fit into the existing concepts of historical development; 11. Setting a new research task, etc. Obviously, the use of AI is unlikely in paragraphs 1-3. For this, it would be necessary to digitize all scientific literature in the world, constantly supporting this process with new publications, develop advanced methods for its meaningful analysis (taking into account the traditions of national and (already) scientific schools) and identify unexplored topics, problems, issues, etc.e. historiographical "gaps" that could become the object of study, and at the same time, would be provided with a source base, or to which new research methods could be applied. Today, such tasks are possible only for a person working in a selected subject area and possessing expert knowledge. Approximately the same conclusions can be drawn with respect to paragraphs 7, 8 and 9, since the use of generative AI to solve these problems cannot, in principle, satisfy researchers, since the result of generating an answer is always based on knowledge already entered into the machine, while the desired outcome of any scientific research should be aimed at obtaining new conclusions ("the increment of scientific knowledge" – © L.I.Borodkin****). Paragraphs 10 and 11 are in a sense "optional" and/or can be disclosed in the narrative description of the results of the study. The set of research stages in equally important historical research in the field of source studies and auxiliary historical disciplines does not differ much from the list of stages of problem-oriented works, however, the research methods in them are marked by greater interdisciplinarity, i.e. the use of not only information technology, but also the use of methods of natural science disciplines, which brings the results of these studies to the level of empirical knowledge. Thus, the use of AI in historical scientific research is currently limited to stages: - identification and selection of the source database (clause 4) (creation of new varieties of archival NSA in the form of datasets (knowledge bases, dataset, ds), semantic search engines and, possibly, systems using generative AI***** [21]). The introduction of AI at this stage marks a new stage in the development of the system of scientific reference tools for archives, as well as library and archival heuristics (heuristics of the information age); - the use of specialized methods for preparing source complexes for their subsequent study (clause 5) (converting historical sources into digital form, forming datasets and terminological dictionaries, developing large language models (LLM) based on historical vocabulary, etc.). At this stage, the use of AI results in the transformation of source studies and auxiliary historical disciplines and the formation of research tools adequate to the digital environment (digital paleography, diplomacy, sphragistics, codicology, filigranology, historical lexicology, etc.) [22, 23]. - direct use of AI methods to analyze prepared sources, obtain certain results (clause 6) in the framework of problem-historical research for their further synthesis, interpretation and integration into the historical context at the following stages. Obviously, at each of these stages, the use of AI methods will have its limitations, determined by the specifics of the source base and the research objectives of the research. Concluding the lengthy Introduction, we note that the author will devote further discussion to the possibilities of using AI for the development of source studies and auxiliary historical disciplines, text and image recognition in historical sources and their preparation for analytical procedures. Thus, the proposed material is an overview and generalization of approaches and problems available in world practice, and his own experience of participation The author has participated in some projects that have already been implemented and are currently being implemented, the purpose of which was to create fully recognized (with an accuracy of at least 95-97%) machine-readable texts of handwritten and typewritten source complexes for their presentation in an electronic environment and the formation of specialized data sets that can be both the purpose of research and auxiliary tools in large-scale recognition projects.. (We emphasize that the author will not analyze projects for targeted recognition of individual text elements or text recognition, which is carried out as an auxiliary stage in problem-historical research).
Datasets, machine learning, AI and other information technologies in automated text recognition of handwritten and typewritten sources (using the example of text recognition) The problem of automated image recognition (of texts in general and handwritten texts in particular) is not new. It first appeared more than 50 years ago, in the mid-1970s, when scientific conferences began to be held abroad [24, 25, 26, 27, 28, 29], professional associations emerged [30, 31], bringing together historians and computer scientists, and thematic serial publications and collections of articles began to be published [32, 33], in which those who constantly participate in these projects and those who analyze them "from afar", on the basis of other people's publications and presentations, express their views, without ever trying to properly prepare an electronic copy of an archival document, create various data sets (paleographic, codicological, diplomatic, historical and lexicographic, etc.) to train neural networks, attract expert knowledge, existing collections of standards (paper and writing materials), mark up the text in accordance with the traditions of diplomacy or taking into account a variety of printing forms of documents, etc. Unfortunately, the latter approach is the most common in Russian historiography and is fraught with sad consequences, which not only undermine faith in the possibilities of information technology and reduce the solution of this problem exclusively to brainstorming in the form of hackathons to search for "miracle algorithms", but also do not notice and do not want to notice the huge preparatory work carried out by historians. It is used by archivists, philologists, and linguists and underlies any successful written source recognition project. We should immediately note that in most text recognition projects (including handwritten documents), success is divided in the proportion of 70% to 30%, where 70% relate to the preparation of arrays of historical sources for translation and digitization using a variety of information technologies, the formation of auxiliary datasets for machine learning, or only 30% – actually on AI recognition methods, among which neural networks occupy the first place. In general, the use of information technology (including AI) for automated text/image recognition can be divided into three groups: - the use of various IT methods to create and improve the quality (technical parameters) of electronic copies that will be recognized (the preparatory stage of work related to the external features of the originals reproduced on electronic copies); - the use of IT (including AI) to generate specialized datasets for various purposes and machine learning of AI algorithms on trained ds; - the actual use of AI algorithms for direct recognition. In this article, we will briefly describe the main types of preparatory work, the technologies used, the problems of creating electronic copies (EKS) and datasets, without which it is difficult to obtain a satisfactory result of text recognition of historical sources. (As D.N. Antonov, Director of the State Archive of the Tula Region, rightly noted in his report at the Round Table held at VNIIDAD on 04/10/2023, a key requirement for the implementation of recognition projects is a source-based approach and related methods of scientific criticism of sources intended for processing [34]). Leaving aside the consideration of the evolution of text recognition software itself, from software applications in the mid–1970s and early 1990s to engines and platforms operating using various types of neural networks, we will mention only individual software solutions that are used to automate the creation of ECS, datasets, machine learning and recognition. 1. Creation and processing of electronic copies of written sources that can be used in automated recognition projects. 1.1. Resolution, scan mode, format In recent years, the opinion has been established in Russian historiography that any electronic image of a written source can be used for automated recognition. Meanwhile, based on the experience of implementing a large number of projects for scientific research of manuscripts, incunabula and old printed books [35], as well as text recognition, optimal characteristics of scan creation were determined at the turn of the 2010s: scanning resolution should be at least 400 dpi for A4 documents (ideally 600 dpi), the scanning mode is "grayscale", the Tiff file compression format (smaller documents are scanned with a higher resolution). These parameters have a clear rationale related to: - the ratio of the average size of a lowercase character (sign: letters or numbers), the size of the paper and the presentation of this ratio on the monitor screen at a magnification of at least 200%, which allows you to distinguish thin lines in the letters/ numbers; - the basic principle of optical recognition, which is based on the assessment (comparison) of the degree of brightness and contrast of pixels, which make up a computer image[36]. Signs (symbols) written in ink, printed on a typewriter, printer, typographic method, etc., will be the darkest areas (groups of pixels) in the image, compared to the blank paper field. Given that signs (letters/numbers) have different line thicknesses and "ink saturation" (contrast) (even when analyzing typewritten text), determining their "boundaries" and outlines will be more accurate in the "shades of gray" digitization mode, which better captures and reflects the nuances of brightness and contrast, allowing you to determine drawing a sign by "capturing" the lightest of the dark pixels that make up the letter/number elements (for example, "flying off" strokes, "tails" of letters, parts of letters written without pressure). Obviously, using the color scanning mode will add "extra" information to the analyzed image, since the "composition" of each of the RGB or CMYK color shades displayed on the monitor screen includes brightness and contrast in one proportion or another; and the black-and-white mode, which many authors recommend converting color images to, "the necessary information about brightness and contrast is cropped (the "binarization" process is used only if it is impossible to create a target set of images). Thus, both modes distort and coarsen the image and are a source of recognition errors. It seems superfluous to comment on the Tiff format, we only note that this format will preserve the image almost without distortion, which is extremely important during recognition. 1.2. The need to take into account the texture of writing materials, ways of fixing textual and pictorial information, and the degree of preservation of recognized documents These issues are usually attributed to the subject area of restorers' work and are not taken into account when creating electronic copies and developing recognition systems. However, experience shows that a lack of understanding of the physical features and mechanisms of document creation leads to errors when choosing scanning and recognition tools. Let's look at some practical examples by analyzing the media of written information (clay tablets, stone surface, papyrus, parchment, bombycine, paper, tracing paper, etc.). Each of the listed media has its own specifics. However, the most common and "underestimated" medium in terms of problems is, of course, paper. There are many errors in recognizing handwritten texts created before the beginning of the 19th century. (in Russia until the end of the first third of the 19th century), they are related to the fact that the rag paper used at that time has an uneven "lumpy" surface caused by the manual method of grinding raw materials and "casting" sheets of paper. Such a paper surface absorbs ink on different parts of the same sheet in different ways, which affects the shape of the letters (they can blur, their sizes and borders "float", the ink is absorbed and passes through, creating "garbage" on the back of the sheet, etc.), and the brightness and contrast in the image become less certain. Similar problems may arise when using paper cast in small manufactories, where rather thick wire was used in mesh molds for casting paper sheets as vergeres and pontuzo, as well as to create a filigree pattern [37], leaving less dense paper grooves in the finished sheet. The invention of Robert's paper machine (1799) and its introduction into manufacturing improved the quality of paper [38, 39], and the paper of documents of the XIX century. it no longer creates such problems with speech recognition. However, the density of the paper itself remains a significant problem for automated recognition. Of particular difficulty are documents written on both sides of loose sheets (density less than 60 g/m2), on tracing paper (density less than 40 g/m2), on calico, created using carbon paper (2 and subsequent copies), etc. Methods of dealing with "bumps" and lines translucent from the back of the sheet, protruding through ink or reliefs (especially punctuation marks), so-called "artifacts", "debris" and "noise" on an electronic image are not being used in the process of text recognition (since there is no graphical processing or marking of lines / text on the sheet they are not able to compensate for these shortcomings), and even at the stage of scanning documents with the selection of color-matching padding sheets, the use of which allows you to get rid of 85-90% of "artifacts" and thereby prepare an electronic copy of satisfactory quality for recognition purposes. In the context of analyzing the features of written sources that have a direct impact on the quality of automated recognition, the technique of fixing signs (symbols) should also be mentioned. It is obvious that inscriptions on stone (epigraphic sources), on clay tablets, birch bark, parchment, etc., are embossed, and the loss of the colorful layer (fading of the text) does not significantly affect recognition. Inscriptions on paper made with a pen, a sharpened pencil and a typewriter are also embossed. These reliefs ("trails") are classified as "low", and to identify them, it is necessary to use specially designed scanning equipment with subsequent graphical processing [40]. The developed technology, along with the use of specialized methods of multispectral [41, 42] and hyperspectral photography and analysis, various variants of spectroscopy, etc. They are successfully used to prepare electronic copies of extinct texts of manuscripts and manuscripts for recognition, as well as to identify destroyed records on parchment (palimpsests). The only technique that currently poses an unsolvable problem when scanning and recognizing fading texts is the facsimile (fax) transmission of information. This problem is caused by the mechanism of creating text or images on fax thermal paper, which uses heating (chemical reaction – melting) of dyes that create text / image, and there is no physical effect on the medium (paper), as a result of which relief does not occur. Summarizing the consideration of the scanning stage, we emphasize once again that ignoring the texture of the media and poor-quality preparation of arrays of electronic copies intended for automated recognition is the cause of a large number of problems that cannot be solved by software at the next stages of the study. 2. Datasets, their varieties, and the specifics of their creation The creation of datasets for machine learning of all types of automated recognition programs and machine learning itself is the second and third most important and longest stages of preparatory work, which can last from several months to several years. As a rule, the implementation of these stages in recent years has been the most "closed" and rarely publicized part of projects, determined by the peculiarities of the complex of recognizable sources and the tasks facing researchers, and based on the expert knowledge of not only historians, source scientists, paleographers and specialists in other auxiliary historical disciplines, but also restorers and computer scientists. We emphasize that various information methods (including machine learning and AI) find their application in these preparatory stages of data sets formation[43]. Let's briefly describe the types of data sets that need to be prepared for machine learning systems for recognizing written historical sources. Paleographic datasets In the 2000s and mid-2010s, the creation of paleographic datasets and their publication on the Internet was one of the most popular areas in professional historical research. During this period, for example, one of the first paleographic datasets for machine learning was created – MNIST [44], which accumulated handwritten Arabic numerals (ds has been updated several times over the past 20 years), the so-called "Medieval Paleographic Scale" (dataset) for dating historical documents of the Netherlands and Flanders. the period 1300-1550 . with an interval of 25 years [45], which became a model for creating similar sets in many European countries and developing approaches to handwriting classification [46], both personalized handwriting datasets and typological paleographic models for different European and Asian languages and chronological periods were created [47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59 60]. The created datasets became the basis for the development of online platforms for automated text recognition such as Transcribus, eScriptorium, Tesseract, etc. (in Europe), Hentaigana [61], KuLA (くずし字学習支援アプリ) [62], MOJIZO (解析: 木簡・くずし字解読システム) [63] (Japan) [64]. The purpose of forming paleographic datasets is obvious – they serve as training material for automated recognition algorithms for medieval and Modern texts (i.e., they are "instrumental in nature"), and are also used as a research method for projects to study specific archival collections and manuscripts of historical figures [65, 66], and to form an autograph collection [67] (including, for example, establishing/confirming authorship [68, 69, 70, 71, 72], identification of forgeries of manuscripts [73, 74], clarification of dates [75, 76], etc. For these purposes, data sets are created that characterize the evolution of a particular person's handwriting throughout his life). Such datasets are widely used in foreign historiography, and they were also used in Russia (in particular, KHATT [77], available for free) [78], etc. Currently, there are "standard" paleographic ds for Latin, Greek and Arabic alphabets, Western European Cyrillic alphabet, Hebrew, Armenian, Georgian and other alphabets. These sets are being developed and refined taking into account the specifics of national languages and specific historical periods [79, 78]. At the same time, unfortunately, it should be noted that due to the complexity of the process of creating paleographic datasets, today no language has a complete set of paleographic models for the entire period of the existence of national writing, which makes it difficult to carry out work on the total recognition of text sources and their presentation in machine-readable form. The current situation has led to the emergence of a new trend – the development of artificially created datasets based on generative AI [81], the possibilities of widespread implementation of which still need to be studied. Since the early 2000s, projects for publishing collections of electronic copies of manuscripts and paleographic/pictorial described (annotated, labeled) datasets such as fonts, handwriting [82, 83], and illustrations based on them have been created and continue to develop in foreign universities, archives, and libraries. Many of these works are presented on generalizing portals or project sites [84, 85, 86], implemented in the form of databases, online catalogs [87], linked open data [88, 89], online platforms and online textbooks/courses for teaching students [90, 91], models image-text paleographic sets[92], ds publications in open spaces, creation of datasets for the development of special tools for working with manuscripts[93, 94, 95, 96], to decipher shorthand abbreviations ("Tyrone notes" [97]), abbreviations, etc. In the countries of Europe and the Far East (China, Japan, etc.), their own automated recognition models are actively being developed to form knowledge bases (datasets) of specific fonts and oriental ideographic languages, for example, the OCR system for reading the Fractur Gothic font created in Germany [98] or the RURI system [99] in Japan. The latter is based on the use of the International IIIF Image Exchange Platform and deep learning. Data sets reflecting the paleographic features of hieroglyphics on various media (silk, paper, bamboo strips, bronze plates, ceramics, etc.) and different calligraphic styles are placed on portals and websites of large projects [100, 101, 102]. As a complement and expansion of the capabilities of the created resources [103, 104], where ds are published, lexicographic (thematic) datasets are also posted on the sites, facilitating the recognition process. Unfortunately, there is practically no work on the creation of paleographic systems and datasets in Russia. As part of the only related project "The History of Writing of European Civilization" [105], developed at the St. Petersburg Institute of History of the Russian Academy of Sciences (URL: https://gis.spbiiran.ru /), there is a collection of digitized historical sources, however, the proposed description of each document and the extremely poor image quality do not allow the formation of paleographic sets. Having listed some projects for the formation of paleographic ds, we will focus on the problems that arise during their creation. Thus, a serious problem is the quality of writing materials (ink, paints, ink, pencil pencils, printing ink, coloring fillers for sewing machines, printer cartridges, etc.), their brightness and contrast on electronic copies, etc., which directly depend on the content of black color in them [106]. The fading of iron-gallic ink, the scree of graphite pencils, the dullness of color dyes on typewriter tapes, printer cartridges, etc., i.e. the so-called "text fading", is one of the reasons for using the grayscale scanning mode, using multispectral photography methods [107] and image analysis and/or the development of graphic processing methods [108] electronic copies in graphic editors such as Adobe PhotoShop or Irfan. The author knows of at least several developed solutions for these software applications based on generated datasets created for different types of texts. Unfortunately, such developments are always exclusively source-oriented, since they are closely related to specific sets of documents and their degree of preservation, are "technical" aspects of recognition projects, and, as a rule, are not even mentioned in articles. The question of the writing accessories of the authors of written sources (kalam, bird and iron pens, fountain pens, ballpoint pens, felt-tip pens, typewriters, etc.), the peculiarities of the scribe's handwriting and/or the methods of creating written sources deserves special consideration. For example, a left-handed European scribe and a right-handed scribe will write letters with different pressures, i.e. letters written by a left-handed person will have the most ink-filled part of the letters (i.e., the darkest, brightest and most contrasting in an electronic image) on the right, and a right–handed person on the left. Given the basic principle on which recognition is based, mentioned above, this "mirroring" will create problems of "identifying" letters, and require the creation of specialized datasets and further training of HTR programs. (The problem of "pressing" has almost disappeared due to the widespread use of ballpoint pens, gel pens and markers). Invented at the end of the 19th century . and they remained the most popular means of creating official documents for most of the 20th century . Mechanical typewriters have certainly improved the "human-readability" of written sources, but at the same time they have created new specific obstacles to automated text recognition. Thus, the clarity (brightness and contrast) of the reproduction of characters in documents began to directly depend on the "freshness" of the printed tape, the purity of the embossed letters on typewriter pads, the individual impact force of each of the typist's fingers (with the ten-finger printing method; the impact force ceased to be decisive in the second half of the 20th century. after the invention of electric typewriters), the density of paper and the use of carbon paper to create multiple copies of the document. It should be added that each copy of the typewriter was equipped with a set of letters, which had an almost imperceptible, but at the same time unique difference in the impression, which made it possible to uniquely identify the typewriter, but currently poses an additional problem when creating paleographic datasets for recognition [109]. The combination of these features of creating typewritten documents is the main reason that many recognition software tools are worse at recognizing twentieth-century documents than handwritten sources. Codicological and diplomatic datasets Mandatory data sets generated for automated text recognition of handwritten documents and old printed book monuments are related to codicology and diplomacy. Fonts of handwritten and printed books (outline and size), screensavers and letters, ornaments and illustrations, handicrafts, stamps of printers (initials, names and numbers), proportions of the edition itself and the typesetting strip, configuration of the text set, sizes and ratio of margins on the page, number of lines on the page, line format (off), endings; letter-to-letter, word–to-word, and line-to-line spacing (the number of lines on a page, the number of letters per line); tools for visual text division (paragraph indentation, indentation, chipping, font and color highlighting, heading elements, marginalia); column digits, footers, column lines, foliation, pagination, etc. - all these elements are objects to form knowledge bases (datasets [110, 111]) that help not only to attribute manuscripts and book monuments and confirm their authenticity, but also to carry out automated marking, highlighting text fragments, lines [112, 113], establishing the participation of scribes in the creation of the manuscript [114], clarifying the dating of written sources [115]. To date, several projects have been implemented in Russia and abroad [116], which actively used the created codicological datasets and developed special software applications based on them [117, 118]. Equally important are diplomatic datasets, which allow the analysis of "templates" (forms and formulas (structural parts of the text) – objects of studying formularies – directions in digital diplomacy [119]) of documents (for example, using image-in-image technology [120, 121]), and thereby facilitating automated segmentation text [122, 123]. Such developments are especially relevant for such types of documents as letters, acts, charters, registers, etc. [124], texts created in the Middle Ages [125], written on letterheads with graphic elements created by printing (tables, corner stamps, lines, diagrams, etc.) that interfere with the recognition of the actual texts. Until recently, specialized software tools based on so-called "computer vision" were created to process such sources, which "removed" unnecessary elements from the image and left only the text. Currently, recurrent (RNN) [126], convolutional (CNN) [127] and other types of neural networks [128] are used to solve this problem, which are trained on the basis of created diplomatic datasets[129] accumulating and describing document processing options. It should be emphasized that the creation of diplomatic ds is relevant for working with documentation created before the middle and second half of the 20th century, when in many countries (including the USSR) standards for the design of management documentation were developed and put into effect, which, together with unambiguously established and used paper formats and technical means of creating documents, were unified documentation. Special mention should be made of the data sets and technologies used for segmentation and recognition of texts and images on newspaper strips [130], cartographic sources [131, 132, 133, 134, 135, 136, 137, 138], drawings and diagrams in scientific and technical documentation [139, 140, 141], in scientific articles [142, 143], etc., which have always been problems for recognition. Active scanning and publication of such sources allowed specialists to create the necessary data sets, and the development of neural networks in the last decade allowed them to begin solving the problems of recognizing such sources. At the same time, one of the most complex types of documents for automated recognition are filled out forms of various questionnaires (including primary population census sheets), tax returns and other mass documentation from the middle of the 19th to the third quarter of the 20th centuries, combining graphic elements, handwritten texts and ambiguously interpreted symbols, for example, marks in logical numbers. fields that can contain any sign. Developed in order to obtain statistical data and aggregate information for subsequent analysis, these forms are very convenient for mathematical processing and extremely complex and capital-intensive for developing automated recognition systems. For example, the Socface project was launched in France in 2022 (URL: https://socface.site.ined.fr/), which aims to create a database of all people who lived in France between 1836 and 1936. The source database uses electronic copies of the name lists of 20 population censuses of all departments of France, which must be recognized automatically using AI. Unfortunately, the project has not had much success with text recognition to date. Given the complexity of working with mass questionnaires, most large archives use volunteers and crowdsourcing platforms to enter information [144]. Historical and lexicological datasets – dictionaries Lexicological datasets (dictionaries) accumulate information about proper names, names of geographical objects (onomastics and toponymy), institutions, abbreviations, terminology of subject areas, etc. and their evolution, reflected in the texts of documents. Unfortunately, it is impossible to use the ISO standards developed and put into effect to form such resources, since dictionaries are determined by the content of historical sources and the historical reality in which the documents were created, therefore, such datasets are specially formed for each set of sources. Significant problems in the creation of historical and lexicological ds are, for example: - differences in the naming of people, phenomena, and processes in different historical periods; - different composition of the elements of names in different peoples; - homonymy – the coincidence of names, titles, terms; - terminological ambiguities caused by polysemy (ambiguity of words, names); - grammatical errors or specific spelling of names and terms due to time, place, spelling, etc., - linguistic melange (mixing, combining words from different languages in one text fragment); - as well as a wide variety of spellings of the same name, title, term, making it difficult to accurately identify it, etc. It is obvious that overcoming these difficulties is possible only by conducting a detailed historical, source, paleographic, and textual analysis that provides unambiguous identification of a specific position in the dataset and, at the same time, its most complete description [145]. Nominal and subject indexes, which are traditionally kept in archives, can provide some assistance in creating such ds, however, experience shows that the mechanical transfer of such indexes to an electronic environment does not give the desired result, they need to be processed and prepared in a special way [146]. In fact, "dictionary" datasets are elements of historical lexicology, the purpose of which is to describe a historical metalanguage reflecting the evolution of the vocabulary of the language(s) in which documents are written, and are a kind of transitional stage from recognizing source texts to their semantic analysis (including using large language models). Unfortunately, the formation of historical metalanguages even in European countries is far from complete, so the use of LLMs based on modern vocabulary to study historical problems and issues is ineffective [147]. (A way out of the impasse may be to use small language models, although the concept of "small" is constantly changing: Phi-3 and 4 from Microsoft, Llama-3.2 1B and 3B, Qwen2-VL-2B, DeepSeek, etc.). 3. The machine learning stage At this stage, the accuracy of the created data sets is checked, their representativeness and, as a result, their applicability for use on the entire array of sources that are prepared for recognition. Usually, the ds created in the previous stages are divided into three parts: - a part intended for use in the training process of the model; - the part that is used to verify various parameters and settings of the model in order to determine the need to refine the set and further train recognition algorithms (the so-called model settings); - a part for testing the final version of the trained model on a test array of sources. Historiography has established the view that the larger the data sets, the more successful they are at completing their tasks. Meanwhile, Stanford University Professor Andrew NG, the largest authority on machine learning (URL: https://www.andrewng.org /) rightly noted that data quality and AI's focus on data rather than models are much more important ("in many industries where giant datasets simply do not exist, I think the focus needs to shift from big data to high-quality data. Having 50 well-thought-out and elaborated examples may be enough to explain to a neural network what you want to teach it"[148]). This point of view is currently being practically confirmed in works on text recognition of ancient and medieval manuscripts, in epigraphy, etc., and the created marked ds become the basis for a variety of research and reuse.
Instead of a conclusion The task of automated recognition of texts, images, audio, etc., as one of the key areas of application of artificial intelligence, has been the focus of attention of specialists in many professions for the last 70 years, but its final solution is still very far away. The obvious fact is that the use of any information technology in historical research, including OCR, HTR, AI in general and neural networks in particular, is based on the expert knowledge of historians, source scientists, art specialists, restorers and computer scientists. The various types and variants of data sets created by them, the developed tools were used and are used not only in projects of automated text recognition of written sources, but also have their own information value, since their use is not limited only to an "instrumental" (auxiliary) role in translating handwritten text into machine-readable form, but also "work" for the development of archival heuristics and the implementation of the analytical stage of concrete historical research [149]. The results of this preparatory work are an "increment of historical knowledge" and the basis for the active development of historical science, its entry into a new stage of development corresponding to the current level of the information age [150]. Unfortunately, the author is forced to state that Russian historical science (with rare exceptions), focusing on problem-oriented research, missed the historical time period (mid-1990s – early 2010s), convenient for creating datasets, and is now in a catching-up position. The correction of this situation is possible with increased attention to source and archival studies, the introduction of new approaches and methods into these historical disciplines, as discussed in the speech of Academician-Secretary, Head of the History Section of the Department of Historical and Philological Sciences of the Russian Academy of Sciences E.I. Pivovar at the III St. Petersburg Historical Forum in October 2024.
Notes 1. This program lost a match to a similar Soviet program for the M-2 computer, developed in the laboratory of the Moscow Institute of Theoretical and Experimental Physics (MITEF) (laboratory supervisor A. Kronrod). 2. The first ELIZA chatbot was developed in the mid-1960s by D. Weizenbaum. The bot could communicate with a human in natural language, imitating the work of a psychotherapist. For a long time it was believed that ELIZA was lost, however, based on the preserved printouts of the Weizenbum code, the chatbot was restored and presented online: ELIZA Archaeology – Try ELIZA // URL: https://sites.google.com/view/elizaarchaeology/try-eliza 3. In the 1970s and early 1990s, Russian historical science carried out research in which AI methods and technologies were used to varying degrees. Among these works, it is necessary to mention the research of a group of mathematicians led by academician N.N. Moiseev (in particular, simulation modeling: the Battle of Sinope model, modeling the processes of economic dynamics of Greek polis during the Peloponnesian War of the 5th century BC [11, 12]), V.B. Lukov and V.M. Sergeev (building a model of perception of the situation and decision-making by a historical figure based on content analysis of the memoirs of Otto von Bismarck [14]), methods of counterfactual modeling of economic development (monograph by Yu.P. Bokarev [15]), the "Retro-forecast" system and the AMSOR expert systems [16] and "HYDRONYMICON" [17], etc. Unfortunately, it should be said that for various reasons, the history of the development and application of AI in the USSR in general and in historical science in particular is less well known and studied than similar foreign subjects. 4. Given that historical science is multifaceted and versatile, the "increment of scientific knowledge" can mean not only the identification and reinterpretation of historical facts, events, phenomena, processes and the participation of people in them, but also the solution of theoretical and applied problems of source studies, historiography, heuristics and other auxiliary historical disciplines. 5. One of the most significant domestic works in the field of AI application in archival heuristics is the complex of Artificial intelligence systems, which is being developed in the General Staff of the Russian Federation and was presented in the report by A.A. Kolganov "Evolution of artificial intelligence application in the General Staff of the Russian Federation: 2021-2024" at the XIX Conference of the Association "History and Computer" on November 15, 2024 [21]. References
1. Minsky, M. (1952). A neural-analogue calculator based upon a probability model of reinforcement. Harvard University Psychological Laboratories. Retrieved from https://www.mit.edu/~dxh/marvin/web.media.mit.edu/~minsky/Bibliography.html
2. Dartmouth AI archives. (n.d.). Retrieved from https://raysolomonoff.com/dartmouth/dart.html 3. Newell, A., & Simon, H. A. (1956). The logic theory machine: A complex information processing system. RAND Corporation. Archived copy from October 17, 2014, on Wayback Machine. Retrieved from https://archive.org/details/bitsavers_randiplP86ineJul56_3534001/mode/2up 4. McCarthy, J. (n.d.). John McCarthy's home page. Retrieved from https://www-formal.stanford.edu/jmc/ 5. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386-408. Archived copy from October 17, 2014, on Wayback Machine. Retrieved from https://web.archive.org/web/20080218153928/http://www.manhattanrarebooks-science.com/rosenblatt.htm 6. Samuel, A. L. (2000). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 44(1.2), 206-226. https://doi.org/10.1147/rd.441.0206 7. McCarthy, J. (1960). Recursive functions of symbolic expressions and their computation by machine. Communications of the ACM. Archived copy from October 17, 2014, on Wayback Machine. Retrieved from https://web.archive.org/web/20131006003734/http://www-formal.stanford.edu/jmc/recursive.html 8. A chess playing program for the IBM 7090 computer. (n.d.). Retrieved from https://dspace.mit.edu/handle/1721.1/17406 9. Killgrove, K. (2025, January 18). "ELIZA," the world's 1st chatbot, was just resurrected from 60-year-old computer code. Live Science. Retrieved from https://www.livescience.com/technology/eliza-the-worlds-1st-chatbot-was-just-resurrected-from-60-year-old-computer-code 10. Lane, R., Hay, A., Schwarz, A., Berry, D. M., & Shrager, J. (2025, January 12). ELIZA reanimated: The world's first chatbot restored on the world's first time sharing system. Retrieved from https://arxiv.org/abs/2501.06707 11. Moiseev, N. N. (1979). Mathematics conducts an experiment. Nauka. 12. Guseynova, A. S., Pavlovskiy, Y. N., & Ustinov, V. A. (1984). Simulation modeling of the historical process. In N. N. Moiseev (Ed.), Experience of simulation modeling (pp. 1-157). Nauka. 13. Cognitive methods abroad: Artificial intelligence methods in modeling political thinking. (1990). In V. M. Sergeev (Ed.), Collection of articles (pp. 1-148). Institute of USA and Canada. 14. Lukov, V. B., & Sergeev, V. M. (1983). Experience in modeling the thinking of historical figures: Otto von Bismarck, 1866-1876. In D. A. Pospelov (Ed.), Questions of cybernetics: Logic of reasoning and its modeling (pp. 149-172). Nauka. 15. Bokarev, Y. P. (1989). Socialist industry and small peasant farms in the USSR in the 1920s: Sources, research methods, and stages of relations. In I. D. Kovalchenko (Ed.), 148-166. 16. Borodkin, L. I. (n.d.). What computers have done for historical science. Arzamas. Retrieved from https://arzamas.academy/materials/2284 17. Kramov, Y. E. (1992). Hydronymikon-an expert system for the hydronymy of the East European plain. Information Bulletin of the Commission for the application of mathematical methods and computers in historical research, 5. 18. Kismet. (2000, October 17). Archived copy from October 17, 2014, on Wayback Machine. Retrieved from http://www.ai.mit.edu/projects/humanoid-robotics-group/kismet/kismet.html 19. Kaplan, A., & Haenlein, M. (2018). "Siri, Siri in my hand, who's the fairest in the land?" On the interpretations, illustrations, and implications of artificial intelligence. Business Horizons, 62, 15-25. https://doi.org/10.1016/j.bushor.2018.08.004 20. GOST R 59895-2021. (2021). Technologies of artificial intelligence in education: General provisions and terminology. 21. Kolganov, A. A. (2024). The evolution of the application of artificial intelligence in the Russian State Archive (2021–2024). Information Bulletin of the Association "History and Computer," 51, 7. 22. Yumashova, Y. Y. (2022). Digital transformation of auxiliary historical disciplines: Modern non-invasive methods for studying historical artifacts [Video lecture]. International Summer School for Young Scientists "Historical Informatics–2022." Retrieved from https://www.youtube.com/watch?v=jWUw8fWMcqw 23. Yumashova, Y. Y. (2023). Digital transformation of auxiliary historical disciplines [Video lecture]. International Summer School for Young Scientists "Historical Informatics–2023." Retrieved from https://www.youtube.com/watch?v=4HQezjps7ig 24. International Conference on Document Analysis and Recognition (ICDAR). (n.d.). Retrieved from https://www.icdar.org/ 25. International Conference on Frontiers in Handwriting Recognition (ICFHR). (n.d.). Retrieved from http://www.iapr-tc11.org/mediawiki/index.php/Conferences 26. International Conference on Pattern Recognition Systems (ICPRS). (n.d.). Retrieved from https://www.icprs.org/ 27. International Conference on Pattern Recognition and Artificial Intelligence (IEEE PRAI). (n.d.). Retrieved from https://www.prai.net/ 28. Artificial Intelligence and Pattern Recognition (AIPR). (n.d.). Retrieved from https://www.aipr.net/ 29. Japan-International Conference on Machine Learning and Pattern Recognition. (n.d.). Retrieved from https://www.mlpr.org/ 30. International Association for Pattern Recognition. (n.d.). Retrieved from https://iapr.org/ 31. History of IAPR. (n.d.). International Association for Pattern Recognition. Retrieved from https://iapr.org/about-us/history-of-iapr/ 32. IAPR Newsletter. (n.d.). International Association for Pattern Recognition. Retrieved from https://iapr.org/articles/newsletter/ 33. International Journal on Document Analysis and Recognition (IJDAR). (n.d.). Springer-Verlag GmbH Germany. Retrieved from https://www.springer.com/journal/10032/ 34. Antonov, D. N. (2023). Source studies approaches to the formation of a database of metric books for the purpose of optical recognition of handwritten text: Round table "Practical tasks of implementing AI technologies in the activities of archives" on April 10, 2023. YouTube channel VNIIDAD. Retrieved from https://www.youtube.com/watch?v=KHzhpS42vqk&t=12179s 35. Shabanov, A. V. (2008). Factors influencing the choice of digitization technology for old printed and handwritten books. Bibliosphere, 4, 46-48. 36. Impedovo, S. (1994). Fundamentals in handwriting recognition. NATO Advanced Study Institute on Fundamentals in Handwriting Recognition (NATO ASI Series). Springer-Verlag. Retrieved from https://link.springer.com/book/10.1007/978-3-642-78646-4 37. The memory of paper. (n.d.). Retrieved from https://memoryofpaper.eu/BernsteinPortal/appl_start.disp 38. Muratova, A., & Gudkov, A. (n.d.). Paper and paper production in the Middle Ages and the early modern period. Manuscript book: Tradition and modernity. Retrieved from https://manuscriptcraft.com/article_11 39. Esipova, V. A. (2003). Paper as a historical source (on materials from Western Siberia of the 17th–18th centuries). (A. N. Zheravina, Ed.). Tomsk University Press. 40. ARCHiOx: Seeing the unseen. Digitizing objects in 3D will give more than the ability to zoom in and examine historical objects in detail. Retrieved from https://oxford.shorthandstories.com/digital-archiox/index.html?fbclid=IwAR2LM19j6iFh1NUgEBddBmU0oZotufAEEs8G0vn2FzF97_dFd2c-TUUwGBs 41. Brown, N. (2019). Collection care welcomes a new multispectral imaging system. UK National Archives Blog. Retrieved from https://blog.nationalarchives.gov.uk/collection-care-welcomes-a-new-multispectral-imaging-system/ 42. Miklas, H., Brenner, S., Sablatnig, R. (2017). Multispectral Imaging for Digital Restoration of Ancient Manuscripts: Devices, Methods and Practical Aspects. Historical informatics, 3, 116-134. https://doi.org/10.7256/2585-7797.2017.3.23697 43. Sánchez-DelaCruz, E., & Loeza-Mejía, C. I. (2024). Importance and challenges of handwriting recognition with the implementation of machine learning techniques: A survey. Applied Intelligence, 54, 6444-6465. https://doi.org/10.1007/s10489-024-05487-x 44. MNIST. (n.d.). Modified National Institute of Standards and Technology. Retrieved from http://yann.lecun.com/exdb/mnist/; https://docs.ultralytics.com/ru/datasets/classify/mnist/ 45. MPS-Medieval Paleographic Scale. (n.d.). The University of Groningen research portal. Retrieved from https://research.rug.nl/en/datasets/mps-medieval-paleographic-scale 46. Zhitenyeva, A. M. (n.d.). Paleography and epigraphy: Two disciplines or one? (On the question of paleographic classification of written sources of the 10th-17th centuries). Retrieved from https://spbiiran.ru/paleografiya-i-epigrafika-dve-disczipliny-ili-odna-k-voprosu-o-paleograficheskoj-klassifikaczii-pismennyh-istochnikov-x-xvii-vv-doklad-a-m-zhitenevoj-na-zasedanii-drevneruss/ 47. Leuven Database of Ancient Books. (n.d.). Portal Trismegistos. Retrieved from https://www.trismegistos.org/ldab/ 48. Papyri.info. (n.d.). Retrieved from https://papyri.info/ 49. Kölner Papyri (Fayum papyri). (n.d.). Retrieved from https://papyri.uni-koeln.de/ 50. Stutzmann, D. (2022). Dated and datable manuscripts: Dataset. https://doi.org/10.5281/zenodo.6507965 51. Clélice, T., et al. (2024). CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond. Lecture Notes in Computer Science, 174-194. https://doi.org/10.1007/978-3-031-70543-4_11 52. DigiPal. (n.d.). Retrieved from http://www.digipal.eu 53. Italian Paleography. (n.d.). Retrieved from https://italian.newberry.t-pen.org/ 54. DIVAHisDB Dataset of Medieval Manuscripts. (n.d.). University of Fribourg. Retrieved from https://www.unifr.ch/inf/diva/en/research/software-data/diva-hisdb.html 55. HisDoc III Digital Analysis of Syriac Handwriting (DASH). (n.d.). Retrieved from http://dash.stanford.edu/ 56. Fischer, A., Bunke, H., Naji, N., Savoy, J., Baechler, M., & Ingold, R. (n.d.). The HisDoc project: Automatic analysis, recognition, and retrieval of handwritten historical documents for digital libraries. In G. A. Fink, R. Jain, K. Kise, & R. Zanibbi (Eds.), Internationalität und Interdisziplinarität der Editionswissenschaft. https://doi.org/10.1515/9783110367317.91 57. French Renaissance: Paleography. (n.d.). Retrieved from https://french.newberry.t-pen.org/ 58. France-England: Medieval manuscripts between 700 and 1200. (n.d.). Retrieved from https://manuscrits-france-angleterre.org/polonsky/en/content/accueil-en?mode=desktop 59. Scottish Handwriting. (n.d.). Scotland's People. Retrieved from https://www.scotlandspeople.gov.uk/scottish-handwriting 60. Al-Furqan's E-Database. (n.d.). Al-Furqan Islamic Heritage Foundation. Retrieved from https://www.al-furqan.com/ 61. Hentaigana. (n.d.). Retrieved from https://alcvps.cdh.ucla.edu/support/ 62. KuLA (九郎). (n.d.). Retrieved from https://apps.apple.com/us/app/kula/id1076911000 63. MOJIZO (もじぞう: 文字の記録). (n.d.). Retrieved from https://aimojizo.nabunken.go.jp 64. Yumashova, Y. Y. (2023). Automated recognition of handwritten texts using artificial intelligence algorithms: Russian and foreign experience. Digital Oriental Studies, 3(1-2). https://doi.org/10.31696/S278240120026084-5 65. Shakespeare Documented. (n.d.). Retrieved from https://shakespearedocumented.folger.edu/resource/family-legal-property-records 66. Tarasova, N. A. (2021). New methods for studying the manuscript heritage of F. M. Dostoevsky: Report on research. Federal State Budgetary Scientific Institution Institute of Russian Literature (Pushkin House) of the Russian Academy of Sciences, St. Petersburg. 67. Mains d'éru-dits (XVIe-XXe siècles). (n.d.). Bibale. Retrieved from https://mainsderudits.irht.cnrs.fr/ 68. Peer, M., Kleber, F., & Sablatnig, R. (2023). Towards writer retrieval for historical datasets. In G. A. Fink, R. Jain, K. Kise, & R. Zanibbi (Eds.), Document Analysis and Recognition-ICDAR 2023. Lecture Notes in Computer Science, 14187. https://doi.org/10.1007/978-3-031-41676-7_24 69. Christlein, V., Marthot-Santaniello, I., Mayr, M., Nicolaou, A., & Seuret, M. (2022). Writer retrieval and writer identification in Greek papyri. In C. Carmona-Duarte, M. Diaz, M. A. Ferrer, & A. Morales (Eds.), Intertwining graphonomics with human movements. IGS 2022. Lecture Notes in Computer Science, 13424. https://doi.org/10.1007/978-3-031-19745-1_6 70. Fiel, S., & Sablatnig, R. (2015). Writer identification and retrieval using a convolutional neural network. In G. A. Azzopardi & N. Petkov (Eds.), Computer analysis of images and patterns. CAIP 2015. Lecture Notes in Computer Science, 9257. https://doi.org/10.1007/978-3-319-23117-4_3 71. Dhali Maruf, A., Sheng, H., Popovic, M., Tigchelaar, E., & Schomaker, L. (2017). A digital palaeographic approach towards writer identification in the Dead Sea Scrolls. International Conference on Pattern Recognition Applications and Methods. https://doi.org/10.5220/0006249706930702 72. Volchkova, M. A. (2015). Experience in personification of scribes of the "Cathedral Code" of 1649 using digital technologies. Report on research. Private institution of culture Museum of Classical and Contemporary Art "Burganov Center." Russian Humanitarian Scientific Foundation. Grant: 14-01-00304. 73. Cha, S. H., & Tappert, C. C. (2002). Automatic detection of handwriting forgery. Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition. IEEE, 264-267. 74. Carrière, G., Nikolaidou, K., Kordon, F., Mayr, M., Seuret, M., & Christlein, V. (2023). Beyond human forgeries: An investigation into detecting diffusion-generated handwriting. In M. Coustaty & A. Fornès (Eds.), Document Analysis and Recognition-ICDAR 2023 Workshops. Lecture Notes in Computer Science, 14193. https://doi.org/10.1007/978-3-031-41498-5_1 75. Anmol, H., Bibi, M., Moetesum, M., & Siddiqi, I. (2019). Deep learning-based approach for historical manuscript dating. 2019 International Conference on Document Analysis and Recognition (ICDAR), 967-972. https://doi.org/10.1109/ICDAR.2019.00159 76. Madi, B., Atamni, N., Tsitrinovich, V., Vasyutinsky-Shapira, D., El-Sana, J., & Rabaev, I. (2024). Automated dating of medieval manuscripts with a new dataset. In Document Analysis and Recognition-ICDAR 2024 Workshops: Athens, Greece, August 30-31, 2024. Proceedings, Part II. Springer-Verlag, Berlin, Heidelberg, 119-139. https://doi.org/10.1007/978-3-031-70642-4_8 77. KFUPM Handwritten Arabic Text. (n.d.). Retrieved from http://khatt.ideas2serve.net/ 78. Smirnov, I. N. (n.d.). On the possibilities of restoring digital archival texts and recognizing handwritten Arabic letters. Report at the International Forum Kazan-Expo-2023 and Kazan Digital Week. Retrieved from https://docs.yandex.ru/docs/view?url=ya-browser%3A%2F%2F4DT1uXEPRrJRXlUFoewruLkFYs7ubIAbSAY-xbL0IBKEaUp3AMQOVTSNPc-2YyqdfQrXgF3z9zrSTC_aAKNXel2yXz60D0C9kCdp5RwRSf9cFvtDbvmJ-yubbW85hEWb4ftUudW-2OSXY3dbwUtNbw%3D%3D%3Fsign%3DjlXgcIS8jxvD_9odPNQjyr4BS4YF5gk8ukUILjVYqjs%3D&name=Kazan-2023.docx&nosw=1 79. Public AI models in Transkribus. (n.d.). READ COOP. Retrieved from https://readcoop.eu/transkribus/public-models/ 80. AI models for transcribing German text in Fraktur, Kurrent and Sütterlin. (n.d.). Retrieved from https://blog.transkribus.org/en/3-ai-models-for-transcribing-german-text-in-fraktur-kurrent-and-sutterlin 81. Aswathy, A., & Maheswari, P. U. (2024). Generative innovations for paleography: Enhancing character image synthesis through unconditional single image models. Heritage Science, 12(258). https://doi.org/10.1186/s40494-024-01373-4 82. Marti, U. V., & Bunke, H. (2002). The IAM-database: An English sentence database for offline handwriting recognition. IJDAR, 5, 39-46. https://doi.org/10.1007/s100320200071 83. Mohammed, H., Marthot-Santaniello, I., & Märgner, V. (2019). GRK-Papyri: A dataset of Greek handwriting on papyri for the task of writer identification. 2019 International Conference on Document Analysis and Recognition (ICDAR), 726-731. https://doi.org/10.1109/ICDAR.2019.00121 84. Papers with Code. (n.d.). Retrieved from https://paperswithcode.com/about; https://paperswithcode.com/datasets?task=optical-character-recognition&page=1 85. Hugging Face-The AI community building the future. (n.d.). Retrieved from https://huggingface.co/datasets 86. HebrewPal. (n.d.). Hebrew Palaeography Album. Retrieved from https://www.hebrewpalaeography.com/ 87. Droby, A., Vasyutinsky Shapira, D., Rabaev, I., Kurar Barakat, B., & El-Sana, J. (2022). Hard and soft labeling for Hebrew paleography: A case study. International Workshop on Document Analysis Systems. Retrieved from https://link.springer.com/chapter/10.1007/978-3-031-06555-2_33 88. Digital Scriptorium. (n.d.). Retrieved from https://digital-scriptorium.org/ 89. Ressources. (n.d.). L'Institut de recherche et d'histoire des textes. Retrieved from https://www.irht.cnrs.fr/index.php/fr/qui-sommes-nous/lirht-en-bref 90. English Handwriting 1500–1700: An online course. (n.d.). Faculty of English. Retrieved from https://www.english.cam.ac.uk/scriptorium/ 91. Paleography tutorial (how to read old handwriting). (n.d.). The National Archives [Archived content]. Retrieved from https://webarchive.nationalarchives.gov.uk/ukgwa/20230801144244/https:/www.nationalarchives.gov.uk/palaeography/ 92. MultiPal. (n.d.). Retrieved from https://www.multipal.fr/en/welcome/ 93. LAION-5B: A new era of open large-scale multi-modal datasets. (n.d.). LAION. Retrieved from https://laion.ai/blog/laion-5b/ 94. GRAPHOSKOP. (n.d.). Retrieved from https://www.palaeographia.org/graphoskop/index.html 95. Millesimo (lancement). (n.d.). Retrieved from https://palaeographia.org/millesimo/index.html 96. Isaev, B. L., Lyakhovitsky, E. A., Tsypkin, D. O., & Chirkova, A. V. (2016). "Vestigium"-Software package for the analysis of non-textual information of handwritten monuments. Historical Informatics: Information Technologies and Mathematical Methods in Historical Research and Education, 1-2(15-16), 72-83. 97. Deciphering medieval shorthand-Can a digital tool solve the "Tironian Notes"? (n.d.). Medievalists.net. Retrieved from https://www.medievalists.net/2024/02/medieval-shorthand-tironian-notes/ 98. OCR-D. (n.d.). Retrieved from https://ocr-d.de/en/ 99. Kitamoto, A., & Tarin, K. (2020). Kuzushi character recognition by AI and the road to full-text search for historical materials. Specialized Library, 5(300), 26-32. 100. CASIA-HWDB. (n.d.). Retrieved from https://paperswithcode.com/dataset/casia-hwdb 101. CASIA online and offline Chinese handwriting databases. (n.d.). Retrieved from https://nlpr.ia.ac.cn/databases/handwriting/home.html 102. Chinese calligraphy styles by calligraphers. (n.d.). Retrieved from https://www.kaggle.com/datasets/yuanhaowang486/chinese-calligraphy-styles-by-calligraphers 103. KuroNet Kuzushiji Ninshiki サービス (KuroNet 九郎). (n.d.). Retrieved from http://codh.rois.ac.jp/kuronet/; https://mp.ex.nii.ac.jp/kuronet/ 104. Cursive Japanese and OCR: Using KuroNet. (n.d.). The Digital Orientalist. Retrieved from https://digitalorientalist.com/2020/02/18/cursive-japanese-and-ocr-using-kuronet/ 105. Sirenov, A. V. (2022). The project "The history of writing in European civilization": Collections of written monuments from academic institutions in St. Petersburg-digitization and study. Proceedings of the Department of Historical and Philological Sciences, 2021: Yearbook (Vol. 11, pp. 125-134). Russian Academy of Sciences. https://doi.org/10.26158/OIFN.2022.11.1.010 106. Tsypkin, D. O., Tereschenko, E. Yu., Balachenkova, A. P., Vasiliev, A. L., Lyakhovitsky, E. A., Yatsishina, E. B., & Kovalchuk, M. V. (2020). Comprehensive studies of the historical inks of old Russian manuscripts. Nanotechnologies in Russia, 15(9-10), 542-550. 107. Lyakhovitskii, E.A., Tsypkin, D.O. (2019). Infrared Text Visualization to Study Old Russian Scripts. Historical informatics, 4, 148-156. https://doi.org/10.7256/2585-7797.2019.4.31588 108. Aismann, K., & Palmer, U. (2008). Retouching and image processing in PhotoShop. Williams. 109. Keys to the past-Typewriters in the records of the Federal Government. (n.d.). NARA. Retrieved from https://archives-20973928.hs-sites.com/keys-to-the-past?ecid=ACsprvumObuCwkwzawZGYsTfDoztaLW7YuCcPtmTh2XiZbavjZ7PL0CPbJS3LhzYw3NkhWyAUjgt 110. Sfardata-צֹרָה. (n.d.). Retrieved from https://sfardata.nli.org.il/#/startSearch_He 111. Beit-Arié, M. (n.d.). The new website of SfarData: The codicological database of the Hebrew Palaeography Project. The Israel Academy of Sciences and Humanities. Retrieved from https://www.academia.edu/38849781/The_new_website_of_SfarData_The_codicological_database_of_the_Hebrew_Palaeography_Project_The_Israel_Academy_of_Sciences_and_Humanities 112. Grüning, T., Labahn, R., Diem, M., Kleber, F., & Fiel, S. (n.d.). READ-BAD: A new dataset and evaluation scheme for baseline detection in archival documents. https://doi.org/10.48550/arXiv.1705.03311 113. Boillet, M., Kermorvant, C., & Paquet, T. (2021). Multiple document datasets pre-training improves text line detection with deep neural networks. In 2020 25th International Conference on Pattern Recognition (ICPR) (pp. 2134-2141). IEEE. 114. Claudio, D. S., Fontanella, F., Maniaci, M., Marrocco, C., Molinara, M., & Scotto di Freca, A. (2018). Automatic writer identification in medieval books. 2018 Metrology for Archaeology and Cultural Heritage (MetroArchaeo), 27-32. https://doi.org/10.1109/MetroArchaeo43810.2018.13633 115. He, Sh., Sammara, P., Burgers, J., & Schomaker, L. (2014). Towards style-based dating of historical documents. 2014 14th International Conference on Frontiers in Handwriting Recognition, 265-270. https://doi.org/10.1109/ICFHR.2014.52 116. Frolov, A. (2020). Tools of Geoinformatics to Study Pistsovye Knigi. Historical informatics, 2, 218-233. https://doi.org/10.7256/2585-7797.2020.2.33330 117. Chirkova, A. V. (2015). Development of software for comprehensive codicological analysis of manuscript book monuments and documents. Report on research. Federal State Budgetary Institution of Science St. Petersburg Institute of History of the Russian Academy of Sciences, St. Petersburg. 118. Rinchinov, O. S. (2021). Digital models of codicology of Tibetan books. Oriental Studies, 14(3), 541-549. https://doi.org/10.22162/2619-0990-2021-55-3-541-549 119. Volodin, A. Yu. (2014). Digital diplomatics: Resources, approaches, trends. In Problems of historiography, source studies, and methods of historical research: Materials of the V scientific readings in memory of academician I. D. Kovalchenko, Moscow, December 13, 2013 (pp. 179-185). Moscow State University. 120. Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1125-1134). 121. Huang, X., Liu, M. Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In The European Conference on Computer Vision (ECCV). https://doi.org/10.48550/arXiv.1804.04732 122. Bayerisch-tschechisches Netzwerk digitaler Geschichtsquellen. (n.d.). Porta fontium. Retrieved from https://www.portafontium.eu/?language=de 123. Baloun, J., Král, P., & Lenc, L. (2021). How to segment handwritten historical chronicles using fully convolutional networks? In A. P. Rocha, L. Steels, & J. van den Herik (Eds.), Agents and Artificial Intelligence. ICAART 2021. Lecture Notes in Computer Science, 13251. https://doi.org/10.1007/978-3-031-10161-8_9 124. Diplomata Belgica. (n.d.). Retrieved from https://www.diplomata-belgica.be/colophon_fr.html 125. Sources diplomatiques. (n.d.). TELMA. Retrieved from https://telma.hypotheses.org/category/sources-diplomatiques 126. Breuel, T. M., Ul-Hasan, A., Azawi, M. I. A., & Shafait, F. (2013). High-performance OCR for printed English and Fraktur using LSTM networks. In 2013 12th International Conference on Document Analysis and Recognition (pp. 683-687). 127. Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298. 128. Rahal, N., Vögtlin, L., & Ingold, R. (2023). Layout analysis of historical document images using a light fully convolutional network. In G. A. Fink, R. Jain, K. Kise, & R. Zanibbi (Eds.), Document Analysis and Recognition-ICDAR 2023. Lecture Notes in Computer Science, 14191. https://doi.org/10.1007/978-3-031-41734-4_20 129. Martínek, J., Lenc, L., & Král, P. (2020). Building an efficient OCR system for historical documents with little training data. Neural Comput & Applic, 32, 17209-17227. https://doi.org/10.1007/s00521-020-04910-x 130. Fleischhacker, D., Kern, R., & Göderle, W. (2025). Enhancing OCR in historical documents with complex layouts through machine learning. International Journal on Digital Libraries, 26(3). https://doi.org/10.1007/s00799-025-00413-z 131. Digimap. (n.d.). Retrieved from https://digimap.edina.ac.uk/ 132. Chiang, Y. Y., & Knoblock, C. A. (2015). Recognizing text in raster maps. Geoinformatica, 19, 1-27. https://doi.org/10.1007/s10707-014-0203-9 133. Weinman, J. (n.d.). Historical maps. Research. CompSci.Grinnell. Retrieved from https://weinman.cs.grinnell.edu/research/maps.shtml#data 134. Weinman, J., Chen, Z., Gafford, B., Gifford, N., Lamsal, A., & Niehus-Staab, L. (2019). Deep neural networks for text detection and recognition in historical maps. In 2019 International Conference on Document Analysis and Recognition (ICDAR) (pp. 902-909). 135. Historical atlas of the Low Countries (1350–1800)-GIS of the Low Countries. (n.d.). Retrieved from https://datasets.iisg.amsterdam/dataset.xhtml?persistentId=hdl:10622/PGFYTM 136. Li, Z., et al. (2024). ICDAR 2024 competition on historical map text detection, recognition, and linking. In E. H. Barney Smith, M. Liwicki, & L. Peng (Eds.), Document Analysis and Recognition-ICDAR 2024. Lecture Notes in Computer Science, 14809. https://doi.org/10.1007/978-3-031-70552-6_22 137. Baloun, J., Král, P., & Lenc, L. (2021). ChronSeg: Novel dataset for segmentation of handwritten historical chronicles. In Proceedings of the 13th International Conference on Agents and Artificial Intelligence (ICAART) (pp. 314-322). 138. 歴史GIS. ROIS-DS歴史的地理情報システム-(CODH). (n.d.). Retrieved from https://codh.rois.ac.jp/historical-gis/ 139. Riedl, C., Zanibbi, R., Hearst, M. A., et al. (2016). Detecting figures and part labels in patents: Competition-based development of graphics recognition algorithms. IJDAR, 19, 155-172. https://doi.org/10.1007/s10032-016-0260-8 140. Jamieson, L., Francisco Moreno-García, C., & Elyan, E. (2024). A review of deep learning methods for digitization of complex documents and engineering diagrams. Artificial Intelligence Review, 57, 136. https://doi.org/10.1007/s10462-024-10779-2 141. Wang, H., Shan, H., Song, Y., Meng, Y., & Wu, M. (2023). Engineering drawing text detection via better feature fusion. In H. Fujita, Y. Wang, Y. Xiao, & A. Moonis (Eds.), Advances and trends in artificial intelligence: Theory and applications. IEA/AIE 2023. Lecture Notes in Computer Science, 13925. https://doi.org/10.1007/978-3-031-36819-6_23 142. Gemelli, A., Marinai, S., Pisaneschi, L., et al. (2024). Datasets and annotations for layout analysis of scientific articles. IJDAR, 27, 683-705. https://doi.org/10.1007/s10032-024-00461-2 143. Shen, Z., Zhang, R., Dell, M., Lee, B. C. G., Carlson, J., & Li, W. (2021). LayoutParser: A unified toolkit for deep learning-based document image analysis. In J. Lladós, D. Lopresti, & S. Uchida (Eds.), Document Analysis and Recognition-ICDAR 2021. Lecture Notes in Computer Science, 12821. https://doi.org/10.1007/978-3-030-86549-8_9 144. Citizen Archivist. (n.d.). National Archives. Retrieved from https://www.archives.gov/citizen-archivist 145. Antonov, D. N., & Skopin, Y. A. (2022). Experience in developing an electronic system of domestic genealogy using artificial intelligence: Using documents from the Archive Fund of the Russian Federation in remote access mode. Archive Bulletin: Collection of Articles and Materials of the Scientific and Methodological Council of the Archival Institutions of the Central Federal District of the Russian Federation, (26). Retrieved from https://www.mos.ru/upload/documents/files/7256/ArhivniivestnikVip26.pdf 146. Church directory. (n.d.). Portal "State Archive of the Vologda Region." Retrieved from https://gosarchive.gov35.ru/user/sign-in/login 147. Turchin, P., Rio-Chanona, R. M. del, Hauser, J., Kondor, D., Reddish, J., Benam, M., Cioni, E., Villa, F., et al. (2024). Large language models' expert-level global history knowledge benchmark (HiST-LLM). Advances in Neural Information Processing Systems 37 (NeurIPS 2024). Retrieved from https://proceedings.neurips.cc/paper_files/paper/2024/hash/38cc5cba8e513547b96bc326e25610dc-Abstract-Datasets_and_Benchmarks_Track.html 148. Ng, A. (2022, February 9). Unbiggen AI. IEEE Spectrum. Retrieved from https://spectrum.ieee.org/andrew-ng-data-centric-ai#toggle-gdpr 149. PARES search engine with artificial intelligence. (n.d.). PARES. Retrieved from https://pares.cultura.gob.es/pares-htr/ 150. Oberbichler, S., & Petz, C. (2025). Working paper: Implementing generative AI in historical studies (1.0). Zenodo. https://doi.org/10.5281/zenodo.14924737
Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
|