Robix: A Unified Model for Robot Interaction, Reasoning and Planning

📅 2025-08-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of complex instruction understanding, long-horizon task planning, and joint modeling of natural language interaction in vision-language robotic systems. Methodologically, it introduces the first unified architecture integrating chain-of-thought reasoning, proactive dialogue management, real-time interruption handling, and contextual commonsense reasoning—trained via a three-stage paradigm: (1) continual pretraining to enhance embodied reasoning, (2) supervised fine-tuning to model unified “reasoning–action” sequences, and (3) reinforcement learning to optimize long-horizon task consistency. As the high-level cognitive module within a hierarchical robotic system, the model enables end-to-end instruction execution and multi-turn dynamic dialogue. Evaluated on diverse interactive tasks—including tabletop clearing, shopping, and dietary filtering—it significantly outperforms GPT-4o and Gemini 2.5 Pro, demonstrating superior generalization and task coherence.

Technology Category

Application Category

📝 Abstract
We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.
Problem

Research questions and friction points this paper is trying to address.

Integrates robot reasoning, planning, and natural language interaction
Enables robots to follow complex instructions and interact naturally
Handles proactive dialogue, interruptions, and context-aware reasoning during tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified vision-language model for robot interaction
Chain-of-thought reasoning with three-stage training
Proactive dialogue and real-time interruption handling
🔎 Similar Papers
No similar papers found.
H
Huang Fang
ByteDance Seed
Mengxi Zhang
Mengxi Zhang
ByteDance Seed
H
Heng Dong
ByteDance Seed
W
Wei Li
ByteDance Seed
Z
Zixuan Wang
ByteDance Seed
Qifeng Zhang
Qifeng Zhang
ByteDance Seed
Xueyun Tian
Xueyun Tian
Institute of Computing Technology
Multimodal GenerationMLLM
Y
Yucheng Hu
ByteDance Seed
H
Hang Li
ByteDance Seed