Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models (Preprint, equal contribution): Used self-play RL and hidden Chain-of-Thought to discover diverse adversarial attacks for safer LLM alignment
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset (NeurIPS 2023, equal contribution): Introduced a human-preference dataset showing that decoupling helpfulness and harmlessness improves safety without performance loss
Safe RLHF: Safe Reinforcement Learning from Human Feedback (ICLR 2024 Spotlight): Proposed a constrained RLHF algorithm using Lagrangian methods to balance harmlessness and helpfulness, outperforming existing alignment methods
Baichuan 2: Open large-scale language models (Technical Report, author): Contributed to open-sourcing Baichuan2 models achieving state-of-the-art results among open-source models on benchmarks like MMLU, CMMLU, GSM8K, HumanEval, and SuperCLUE-agent
Proactive Multi-Camera Collaboration For 3D Human Pose Estimation (ICLR 2023, equal contribution): Developed a multi-agent RL framework for collaborative 3D pose estimation in dynamic crowds using Shapley-value-inspired rewards