Developed FlagJudge (a GenRM model) in Dec 2023, predating DeepMind’s similar work by one year
Team ranked 7th out of 100+ global teams in the AI Safety and Security Challenge (hosted by AI Singapore & NUS) in July 2024; invited to Singapore International Cyber Week (SICW) 2024
FlagEval released its latest leaderboard in Sep 2024, covering nearly 300 models across subjective, objective, arena battle, debate, multimodal, text-to-image, and text-to-video evaluations
Published in top-tier conferences including AAAI, NeurIPS, and ACL Findings (e.g., 'Before generation, align it!', 'Can LLM already serve as a database interface?', 'Graphix-t5')
Authored multiple arXiv preprints, including 'Towards analyzing and understanding the limitations of DPO' and 'Towards understanding the influence of reward margin on preference model performance'