Mar 2025: Released KodCode, the largest verified synthetic coding dataset for Code LLM training
Jul 2024: Introduced Samba, a powerful hybrid LLM
May 2024: Built GPT-4 Japanese
Mar 2023: Proposed G-Eval: NLG evaluation using GPT-4 with better human alignment
Nov 2022: Released UniSumm, a state-of-the-art few-shot summarization model
Oct 2022: Five papers accepted at EMNLP 2022
Mar 2022: Two papers accepted at ACL 2022
Mar 2021: Three papers (two long, one short) accepted at NAACL 2021
Jan 2021: RE-T5 model ranked 1st in CommonGen competition
Oct 2020: Ranked 1st in FEVER competition
Published numerous papers at top-tier conferences including NeurIPS 2023, EMNLP, ACL, AAAI, and NAACL
Notable works include DialogLM (pre-trained model for dialogue understanding and summarization), MediaSum (large-scale media interview summarization dataset), and DialogSum (real-life dialogue summarization dataset)
Multiple papers include open-source code or public datasets (marked with [code] or [dataset])