Paper 'JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse' accepted by ACL 2025. This is the first VLA model in Minecraft capable of following human instructions on over 1k different atomic tasks, introducing a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner.
Research Experience
Currently an intern at CraftJarvis, working on building autonomous agents that can operate in open world. Current work explores how visual language models can be adapted into long-horizon, instruction-following agents, particularly interested in augmenting these models with memory and lightweight reasoning capabilities.
Undergraduate student at Yuanpei College, Peking University, focusing on autonomous agents in open world. Believes true autonomy consists of three progressive layers: the capacity to robustly carry out user instructions in dynamic, uncertain, and long-horizon environments; the ability to operate within predefined rules, norms, and constraints, even when they conflict with task efficiency; and the emergence of self-directed behavior shaped by internalized, value-aligned objectives that endure over time.
Miscellany
Email: li_muyao@stu.pku.edu.cn, has a personal homepage on Github and Google Scholar