🤖 AI Summary
Current vision-language-action (VLA) models suffer from domain overfitting, opaque inference, and high auxiliary generation latency—limiting their practicality as collaborative assistants. To address these issues, we propose a self-reflective vision-language-action coordination framework based on a Mixture-of-Experts (MoE) architecture and reflective reasoning, enabling explicit uncertainty modeling, interpretable decision-making, and proactive human intervention. Our method integrates large language model (LLM)-driven reflective reasoning with diffusion-based action generation, supported by a two-stage training strategy: action grounding followed by reflection fine-tuning. Experiments demonstrate that, compared to generative agents, our approach reduces normalized inference time by ~2×, cuts invalid “Dream” sampling by 4×, and improves task success rates—while simultaneously achieving high interpretability and low latency. This advances VLA models from black-box controllers toward trustworthy, human-centered assistive agents.
📝 Abstract
In this work, we present CollabVLA, a self-reflective vision-language-action framework that transforms a standard visuomotor policy into a collaborative assistant. CollabVLA tackles key limitations of prior VLAs, including domain overfitting, non-interpretable reasoning, and the high latency of auxiliary generative models, by integrating VLM-based reflective reasoning with diffusion-based action generation under a mixture-of-experts design. Through a two-stage training recipe of action grounding and reflection tuning, it supports explicit self-reflection and proactively solicits human guidance when confronted with uncertainty or repeated failure. It cuts normalized Time by ~2x and Dream counts by ~4x vs. generative agents, achieving higher success rates, improved interpretability, and balanced low latency compared with existing methods. This work takes a pioneering step toward shifting VLAs from opaque controllers to genuinely assistive agents capable of reasoning, acting, and collaborating with humans.