Aug 2025: Released MGM-Omni, an open-source omni-modal LLM supporting long speech understanding, generation, and zero-shot voice cloning
Jun 2025: Paper 'Lyra' accepted to ICCV 2025
Mar 2025: Papers 'VisionZip' and 'DreamOmni' accepted to CVPR 2025
Dec 2024: Released Lyra, an open-source MLLM supporting long speech comprehension, omni understanding, and cross-modality efficiency
Jul 2024: Paper 'LLaMA-VID' accepted to ECCV 2024
Mar 2024: Released Mini-Gemini, an open-source vision-language model supporting high-resolution image understanding and reasoning-based image generation
Feb 2024: Paper 'GroupContrast' accepted to CVPR 2024
Nov 2023: Released LLaMA-VID, an open-source vision-language model supporting hour-long video understanding and reasoning
Primary contributor to key projects including MGM-Omni, Lyra, VisionZip, Mini-Gemini, LLaMA-VID, DreamOmni series, and GroupContrast
Background
PhD student at the Department of Computer Science and Engineering, The Chinese University of Hong Kong (CUHK)
Research interests focus on building Human-like Multimodal Intelligence capable of actively interacting with the physical world, learning from interaction, and possessing long-term memory
Recently concentrating on Multi-modal Large Language Models (MLLMs)
Previously had experience in visual perception
Seeking Research Scientist / Member of Technical Staff positions in industry for Fall 2026 in Multimodal Foundation Models and related applications (e.g., Computer Use Agents, Embodied AI), open to any location