TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the challenge of asynchronous semantic reasoning and real-time control in dynamic environments, a key limitation of existing vision-language-action (VLA) models. To bridge this gap, we propose TIC-VLA, a novel framework that introduces a delay-aware semantic-control interface, explicitly integrating delayed semantic states and metadata with current observations to enable robust action generation. We further design a delay-consistent training mechanism that combines imitation learning and online reinforcement learning within DynaNav—a high-fidelity dynamic navigation simulation platform augmented with controllable inference delays. Experimental results demonstrate that TIC-VLA significantly outperforms prior methods in both simulation and real-world robotic settings, maintaining efficient and robust language-guided navigation performance even under multi-second reasoning delays.

Technology Category

Application Category

📝 Abstract

Robots in dynamic, human-centric environments must follow language instructions while maintaining real-time reactive control. Vision-language-action (VLA) models offer a promising framework, but they assume temporally aligned reasoning and control, despite semantic inference being inherently delayed relative to real-time action. We introduce Think-in-Control (TIC)-VLA, a latency-aware framework that explicitly models delayed semantic reasoning during action generation. TIC-VLA defines a delayed semantic-control interface that conditions action generation on delayed vision-language semantic states and explicit latency metadata, in addition to current observations, enabling policies to compensate for asynchronous reasoning. We further propose a latency-consistent training pipeline that injects reasoning inference delays during imitation learning and online reinforcement learning, aligning training with asynchronous deployment. To support realistic evaluation, we present DynaNav, a physics-accurate, photo-realistic simulation suite for language-guided navigation in dynamic environments. Extensive experiments in simulation and on a real robot show that TIC-VLA consistently outperforms prior VLA models while maintaining robust real-time control under multi-second reasoning latency. Project website: https://ucla-mobility.github.io/TIC-VLA/

Problem

Research questions and friction points this paper is trying to address.

robot navigation

vision-language-action

reasoning latency

dynamic environments

real-time control

Innovation

Methods, ideas, or system contributions that make the work stand out.

latency-aware control

vision-language-action model

asynchronous reasoning