Wei Xiong
Scholar

Wei Xiong

Google Scholar ID: m2-OwQEAAAAJ
University of Illinois Urbana-Champaign
Post TrainingReinforcement LearningFoundation ModelLearning Theory
Citations & Impact
All-time
Citations
2,447
 
H-index
23
 
i10-index
26
 
Publications
20
 
Co-authors
70
list available
Publications
20 items
Browse publications on Google Scholar (top-right) ↗
Resume (English only)
Academic Achievements
  • Proposed and open-sourced online rejection-sampling fine-tuning, Reinforce-ada adaptive sampling framework, online DPO, and regret analysis of KL-regularized RL. Co-founded and led the open-source project RLHFlow, which has 2,000 GitHub stars, 500 academic citations, and 1 million Hugging Face downloads. Released the first open-source recipe for generative process rewards.
Research Experience
  • Served as a Research Intern at Meta FAIR (May 2025 to August 2025), teaching LLMs to segment reasoning trajectories into coherent intermediate steps for improved interpretability and stability of reasoning, and trained a generative process reward model via RL to evaluate and guide step-by-step reasoning. Also worked as a Student Researcher at Google Deepmind's Gemini Post-Training Team (May 2024 to April 2025), formulating a multi-turn RL framework for agent tasks.
Education
  • Currently a Ph.D. candidate in Computer Science at the University of Illinois Urbana-Champaign, advised by Prof. Tong Zhang and Prof. Nan Jiang; received a Master's degree in Mathematics from The Hong Kong University of Science and Technology in 2023, supported by the Hong Kong PhD Fellowship; obtained a B.S. in Mathematics from the University of Science and Technology of China in 2021, working closely with Prof. Cong Shen.
Background
  • Research interests include reinforcement learning and its applications in LLM post-training, focusing on the design of core RL algorithms and the development of practical training methods. Also interested in understanding the training dynamics and mathematical foundations behind these methods, with the goal of improving large-scale training stability and final model performance.
Miscellany
  • Personal interests not mentioned