32,000 medieval manuscripts transcribed with A.I.

A colossal challenge brilliantly met... and more than 200,000 other manuscripts in sight

Transcribing a medieval manuscript has always been a time-consuming task. It involves transcribing a handwritten script from aged supports and in languages that have almost disappeared, such as Old French and Latin, but also in regional languages from Spain and Italy, such as Venetian, Old Dutch and many others. Added to this were difficulties linked to unfamiliar contexts and also to the spelling of words, which had not yet stabilized at the time, such as the use of the ampersand or the "f" instead of the "s".

As a result, each transcriber developed and applied his or her own transcription standards, making A.I. training almost impossible.

This is the task tackled by the ALMANACH project-team at the Inria Centre in Paris, with the aim of standardizing interpretation norms and eventually training an Artificial Intelligence to automate the process. This is the challenge of the CATMus project.

The first step was to analyze 300 medieval manuscripts, already transcribed using well-established standards, respecting spelling and abbreviations.

"The second step is to use this corpus to train a model based on artificial intelligence. This is based on transcription tools developed by EPHE-Université PSL: eScriptorium and Kraken. What are its advantages? It's energy-efficient and, above all, it focuses more on image recognition than on language comprehension, which avoids over-extrapolation.

Having achieved this, in 2024, the CoMMA (Corpus of Multilingual Medieval Archives) project takes over, with the aim of putting the transcription tool to the test. First step: finding manuscripts.

"For this, the team turned to EquipEx+ Biblissima+, which has a catalog of links to digitized versions of over 260,000 manuscripts, stored by various institutions. We received a total of 32,763 manuscripts, mostly in Old French and Latin, which we transcribed in four months."

Such a task would have taken decades to complete manually!

The model used is in fact based on two algorithms, one responsible for recognizing the various elements of the page (main text, notes, illustrations, etc.) and the other, developed during CATMuS, for transcribing the texts. All this with a very low error rate of less than 10%, compared with a much higher rate using other methods, and above all this rate can be further reduced over time.

This success would not have been possible without the team's interdisciplinary skills, including those in paleography.

"Digital expertise alone would not have enabled us to understand as well the manuscripts we were dealing with and the processes we needed to apply to them."

As a result, an immense body of knowledge is now accessible for all disciplines, from medicine to philosophy, and it will only increase as more than 260,000 manuscripts in other ancient languages still await transcription throughout Europe.

For the full article: CoMMA: thousands of medieval manuscripts finally transcribed - INRIA

Resources

ecriptorium - https://escriptorium.readthedocs.io

Kraken - https://kraken.re

Almanac - https://almanach.inria.fr/index-fr.html

Biblissima - https://projet.biblissima.fr/fr

Comma - https://huggingface.co/comma-project

Illustration: Shutterstock - 2515480013

Learn more about this news

Visit inria.fr

See more news from this institution

INRIA - National Institute for Research in Computer Science and Control

Domaine de Voluceau
Rocquencourt - B.P. 105
78153 Le Chesnay
France

Tél.: 33 (0)1 39 63 55 11

32,000 medieval manuscripts transcribed with A.I.

A colossal challenge brilliantly met... and more than 200,000 other manuscripts in sight

Access exclusive services for free