Pedro Ortiz Suarez
Scholar

Pedro Ortiz Suarez

Google Scholar ID: 5sNdyvkAAAAJ
Principal Research Scientist, Common Crawl Foundation
Language modelingCorpus linguisticsNamed Entity RecognitionComputational LinguisticsMachine
Citations & Impact
All-time
Citations
5,781
 
H-index
15
 
i10-index
17
 
Publications
20
 
Co-authors
36
list available
Resume (English only)
Academic Achievements
  • Published 'A Data-driven Approach to Natural Language Processing for Contemporary and Historical French', showing pre-training dataset size is often overestimated
  • Contributed to BERTrade: using contextual embeddings to parse Old French with newly curated corpora
  • Co-developed the FreEM corpus and D’AlemBERT language model for Early Modern French
  • Improved the OSCAR multilingual web corpus by proposing a document-oriented version
  • Co-authored 'Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets', revealing systematic issues in low-resource corpora