- Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos
- RLPF: Physical Feedback: Aligning Large Motion Models with Humanoid Control
- Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model
- Being-VL0.5: Unified Multimodal Understanding via Byte-Pair Visual Encoding
- Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds
- Few-shot Action Recognition with Hierarchical Matching and Contrastive Learning
- VRDFormer: End-to-end video visual relation detection with transformer
Research Experience
Currently leading the Embodied Multimodal Pretraining team at BeingBeyond, working on projects including Being-H, Being-M, and Being-VL series. Previously, a researcher at Beijing Academy of Artificial Intelligence (BAAI).
Education
PhD and bachelor's degrees from Renmin University of China (RUC), supervised by Prof. Qin Jin.
Background
Research Interests: Large multimodal models, human behavior and motion understanding, vision-language-action models, humanoid robots. Brief Introduction: A researcher at BeingBeyond, focusing on the development of foundation models for general-purposed humanoid robots.