ICML 2025: 'DataDecide: How to Predict Best Pretraining Data with Small Experiments'
NeurIPS 2024: 'Paloma: A Benchmark for Evaluating Language Model Fit'
EMNLP 2024: 'Scalable Data Ablation Approximations for Language Models through Modular Training and Merging'
ACL 2024: Contributed to 'OLMo: Accelerating the Science of Language Models' and 'Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research'
ICLR 2024: 'What's In My Big Data?'
Findings of ACL 2023: 'Reproducibility in NLP: What Have We Learned from the Checklist?'
EMNLP 2021: 'Extracting Fine-Grained Knowledge Graphs of Scientific Claims: Dataset and Transformer-Based Results'