Browse publications on Google Scholar (top-right) ↗
Resume (English only)
Academic Achievements
VoiceCraft (ACL 2024 Oral): A zero-shot TTS and speech editing model that gained 7.7k GitHub stars within five months and ranked #1 globally on trending.
Audio-Visual Latent Diffusion Model (ECCV 2024 Oral): Generates realistic action sounds for silent egocentric videos with demonstrated zero-shot transfer in VR games.
PromptingWhisper (Interspeech 2023): Pioneered prompt-based techniques for large speech models to enable zero-shot audio-visual ASR and speech translation without fine-tuning.
Visually Grounded Speech research series (Interspeech 2022/2023, ICASSP 2022, ASRU 2023): Achieved state-of-the-art results in speech-image retrieval, zero-resource speech recognition, and data-efficient representation learning.
Published multiple papers at top-tier venues including ACL, ECCV, ICML, Interspeech, ICLR, EMNLP, ICCV, and WACV.
Contributed to Dynamic-SUPERB Phase-2 (ICLR 2025), a collaborative benchmark with 180 tasks for evaluating spoken language models.