Scholar
Guilherme Penedo
Google Scholar ID: L-jmoJYAAAAJ
ML Research Engineer at 🤗 HuggingFace
Follow
Google Scholar
↗
Citations & Impact
All-time
Citations
2,717
H-index
9
i10-index
9
Publications
13
Co-authors
0
Contact
No contact links provided.
Publications
5 items
How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data
2026
Cited
0
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
2025
Cited
0
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
2025
Cited
0
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
2025
Cited
0
Towards Best Practices for Open Datasets for LLM Training
2025
Cited
0
Resume (English only)
Co-authors
0 total
Co-authors: 0 (list not available)
×
Welcome back
Sign in to Agora
Welcome back! Please sign in to continue.
Email address
Password
Forgot password?
Continue
Do not have an account?
Sign up