OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning

šŸ“… 2025-05-17
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
Existing general-purpose robots commonly adopt a decoupled ā€œdual-systemā€ architecture—separating action execution (System I) from symbolic reasoning (System II)—leading to poor cross-system capability alignment and response latency. This paper introduces the first end-to-end unified vision-language-action model, integrating perception, reasoning, and action within a single Transformer framework. A dynamic reasoning gating mechanism adaptively triggers explicit, step-wise reasoning only at task-critical junctures; otherwise, actions are generated directly from prior reasoning outcomes. Key innovations include: (1) a dual-modal adaptive switching mechanism for seamless perception–reasoning coordination; (2) a scalable synthetic triplet data generation pipeline tailored for embodied reasoning; and (3) a multi-stage joint fine-tuning strategy. Experiments demonstrate substantial improvements over baselines across four core capabilities—long-horizon planning, error recovery, natural human–robot interaction, and generalizable spatial localization—enabling successful execution of highly dexterous, multi-step tasks such as hotpot preparation and cocktail mixing.

Technology Category

Application Category

šŸ“ Abstract
General-purpose robots capable of performing diverse tasks require synergistic reasoning and acting capabilities. However, recent dual-system approaches, which separate high-level reasoning from low-level acting, often suffer from challenges such as limited mutual understanding of capabilities between systems and latency issues. This paper introduces OneTwoVLA, a single unified vision-language-action model that can perform both acting (System One) and reasoning (System Two). Crucially, OneTwoVLA adaptively switches between two modes: explicitly reasoning at critical moments during task execution, and generating actions based on the most recent reasoning at other times. To further unlock OneTwoVLA's reasoning and generalization capabilities, we design a scalable pipeline for synthesizing embodied reasoning-centric vision-language data, used for co-training with robot data. We validate OneTwoVLA's effectiveness through extensive experiments, highlighting its superior performance across four key capabilities: long-horizon task planning, error detection and recovery, natural human-robot interaction, and generalizable visual grounding, enabling the model to perform long-horizon, highly dexterous manipulation tasks such as making hotpot or mixing cocktails.
Problem

Research questions and friction points this paper is trying to address.

Unified model for vision-language-action with adaptive reasoning
Overcoming dual-system limitations in robot task execution
Enhancing reasoning and generalization for diverse robotic tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified vision-language-action model with adaptive reasoning
Scalable pipeline for synthesizing reasoning-centric data
Co-training with robot data for enhanced generalization
šŸ”Ž Similar Papers
No similar papers found.