Proposed 'Small Model Learnability Gap': small models perform better on shorter, simpler reasoning chains rather than long CoT or distillation from large teachers
Discovered that RL-trained math models generalize well to non-reasoning domains (e.g., alignment), while SFT models lose this capacity; identified sampling policy as key to generalization
Identified 'Temporal Forgetting': 76.7% of AIME problems were correctly solved at intermediate checkpoints during RL training of Deepseek-R1-1.5B, but only 30% remained correct in the final model; proposed 'Temporal Sampling' to leverage training dynamics for answer diversity
Introduced SafeChain dataset to improve safety alignment without compromising reasoning; showed that long CoT does not necessarily enhance safety
TinyV: proposed a lightweight LLM-based verifier to address >38% false negatives in answer verification during RL training, improving reward estimation accuracy
Visual Sphinx: developed a four-stage pipeline generating 660K visual logic data for RL training of multimodal reasoning models