Language Modeling for Epigraphs

Abstract

The Epigraphic Database Roma (EDR)[4] stands as the most comprehensive and precise collection of Ancient Roman inscriptions, boasting over one hundred thousand entries curated by the International Federation of Epigraphic Databases. Given that the dating of these inscriptions span across centuries, many have suffered from erosion, resulting in missing text. Our objective is to reconstruct these lost segments, expanding previous work [1][3] by proposing an approach that is both fast and cheap. To achieve this, we plan to fine-tune LatinBERT[2], the leading language model for Latin, using the EDR database. This process will yield a specialized language model adept at filling in the gaps within these ancient texts. This advanced model represents a stepping stone in language models trained on inscriptions.

References [1] A. Locaputo, B. Portelli, E. Colombi, G. Serra, and others, “Filling the Lacunae in ancient Latin inscriptions.,” in IRCDL, 2023, pp. 68–76. [2] D. Bamman and P. J. Burns, “Latin BERT: a contextual language model for classical philology,” 21st September 2020, arXiv: arXiv:2009.10053. doi: 10.48550/arXiv.2009.10053. [3] Assael, Y., Sommerschield, T., Cooley, A. et al. Contextualizing ancient texts with generative neural networks. Nature 645, 141–147 (2025). https://doi.org/10.1038/s41586-025-09292-5 [4] [EDR-EDR O”cial Website]. http://www.edr-edr.it/default/index.php (accessed May 16, 2025)

Back to Day 1 programme