DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing vision-language-action (VLA) models struggle to simultaneously achieve fine-grained spatial perception and high-level logical planning in complex multi-step tasks, while autoregressive decoding often leads to high latency and error accumulation. This work proposes DualCoT-VLA, the first VLA architecture incorporating parallel visual and language chains-of-thought (CoT). It employs learnable query tokens to separately construct a visual CoT for spatial understanding and a language CoT for task planning. By replacing autoregressive generation with single-step forward inference, the model enables efficient multimodal fusion without sacrificing perceptual granularity. DualCoT-VLA significantly enhances both reasoning efficiency and planning capability, achieving state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks as well as on real-world robotic platforms.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting'' capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models

Chain-of-Thought reasoning

multi-modal reasoning

autoregressive decoding

robotic action planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought

Vision-Language-Action

Parallel Reasoning