Diffusion Language Models are Super Data Learners: The first work empirically showing diffusion language models have much higher data potential compared with autoregressive ones at scale (up to 8B parameters, 1.5T tokens, 480 epochs).
Quokka: The first-ever large-scale scaling law for diffusion language models, covering both compute-optimal and data-constrained settings, and experimenting with extensive modeling & optimization designs.
OpenMoE 2: The first-ever sparse diffusion large language model trained from scratch, focusing on architectural insights.
MegaDLMs: A GPU-optimized framework for training diffusion language models at any scale. The backend of Quokka, Super Data Learners, and OpenMoE 2 training.
OpenMoE: The first fully open Decoder-only MoE LLM trained from scratch.
GHA: Finding the Pillars of Strength for Multi-head Attention, cited and discussed in the GQA paper.
NoisyRollout: A simple, zero-cost method that improves visual-language reinforcement learning and achieves state-of-the-art reasoning capabilities.
SynthRL: A scalable and guaranteed method that automatically synthesizes verifiably correct and more challenging RL training samples at scale for visual reasoning models.
Research Experience
Working as an AI Researcher and individual contributor at NUS, collaborating with Prof. Michael Shieh. Researching next-generation modeling paradigms and building scalable foundation model systems.
Background
An AI Researcher and individual contributor at NUS working with Prof. Michael Shieh. Research interests include large language models pretraining and scaling, model architectures, diffusion language models, and reinforcement learning for LLM reasoning.