“Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos” (arXiv): Introduced the first dexterous Vision-Language-Action model pretrained from human videos; equal contribution
“Unified Multimodal Understanding via Byte-Pair Visual Encoding” (ICCV’25 Highlight): Developed a complete training framework and the Being-VL-0.5 model
“From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities” (ICLR’25): Proposed a BPE tokenizer for images, enabling more effective multi-modal alignment in Transformers
“VideoOrion: Tokenizing Object Dynamics in Videos” (ICCV’25): Encodes object dynamics using a two-branch architecture with object tokens
“OpenMMEgo: Enhancing Egocentric Understanding for LMMs with Open Weights and Data” (NeurIPS’25): Enhanced egocentric video understanding via synthetic data and token compression
“LLM-Based Explicit Models of Opponents for Multi-Agent Games” (NAACL’25): Proposed EMO for modeling opponents in multi-agent settings using LLMs
“Tackling Non-Stationarity in Reinforcement Learning via Causal-Origin Representation” (ICML’24): Addressed non-stationarity through causal-origin representations
“AdaRefiner: Refining Decisions of Language Models with Adaptive Feedback” (NAACL’24): Enabled co-learning between LLMs and RL agents via mutual feedback
“Entity Divider with Language Grounding in Multi-Agent Reinforcement Learning” (ICML’23): Equal contribution