Published 'A Data-driven Approach to Natural Language Processing for Contemporary and Historical French', showing pre-training dataset size is often overestimated
Contributed to BERTrade: using contextual embeddings to parse Old French with newly curated corpora
Co-developed the FreEM corpus and D’AlemBERT language model for Early Modern French
Improved the OSCAR multilingual web corpus by proposing a document-oriented version
Co-authored 'Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets', revealing systematic issues in low-resource corpora