Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

📅 2026-01-14

📈 Citations: 3

✨ Influential: 1

career value

204K/year

🤖 AI Summary

This work addresses the high inference latency and efficiency–performance trade-off inherent in existing Vision-Language-Action (VLA) methods that rely on explicit chains of thought. To overcome this limitation, we propose Fast-ThinkAct, a novel framework that introduces, for the first time, a verbalizable implicit chain of thought. By leveraging knowledge distillation from a teacher model and preference-guided optimization, Fast-ThinkAct efficiently transfers language and visual planning capabilities into embodied control policies. The approach maintains strong long-horizon planning, few-shot adaptation, and failure recovery abilities while substantially reducing computational overhead. Evaluated across multiple embodied manipulation and reasoning benchmarks, Fast-ThinkAct achieves up to an 89.3% reduction in inference latency without compromising task success rates or generalization performance, consistently outperforming prior methods.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

reasoning latency

chain-of-thought

embodied reasoning

inference efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

latent reasoning

vision-language-action

chain-of-thought distillation