EDR-AI: A dataset for machine learning on epigraphs
Abstract
Recent research [1] [2] has highlighted the need for automated methods for the restoration, reconstruction, and classification of ancient epigraphic inscriptions. These texts, often exposed to environmental degradation for thousands of years, su!er from erosion, fractures, and other forms of damage that render them partially or fully illegible. Despite growing interest in computational approaches for cultural heritage, the availability of structured and task-oriented datasets for epigraphic research remains limited. We introduce EDR-AI, a curated dataset derived from a subset of the Epigraphic Database Roma (EDR) [3], designed to facilitate machine learning research on epigraphic reconstruction and analysis. The dataset is carefully processed and cleaned to support computational modeling and includes structured annotations suitable for multiple downstream tasks. To demonstrate its applicability, we define and benchmark several tasks, spanning from classification to language modeling. By providing a standardized resource for evaluation, EDR-AI aims to advance research at the intersection of digital humanities and artificial intelligence.
References [1] O. Ceriotti, F. Gerardi, S. G. Malatesta and S. Orlandi, ”Language Modeling for Epigraphs: a BERT model for EDR’s Latin Epigraphs text completion,” 2025 IEEE International Conference on Cyber Humanities (IEEE-CH), Florence, Italy, 2025, pp. 1-6, doi: 10.1109/IEEE-CH65308.2025.11279443. [2] Assael, Y., Sommerschield, T., Cooley, A. et al. Contextualizing ancient texts with generative neural networks. Nature 645, 141–147 (2025). https://doi.org/10.1038/s41586-025-09292-5 [3] [EDR-EDR O”cial Website]. http://www.edr-edr.it/default/index.php (accessed May 16, 2025)