CollabVLA: Self-Reflective Vision-Language-Action Model Dreaming Together with Human

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current vision-language-action (VLA) models suffer from domain overfitting, opaque inference, and high auxiliary generation latency—limiting their practicality as collaborative assistants. To address these issues, we propose a self-reflective vision-language-action coordination framework based on a Mixture-of-Experts (MoE) architecture and reflective reasoning, enabling explicit uncertainty modeling, interpretable decision-making, and proactive human intervention. Our method integrates large language model (LLM)-driven reflective reasoning with diffusion-based action generation, supported by a two-stage training strategy: action grounding followed by reflection fine-tuning. Experiments demonstrate that, compared to generative agents, our approach reduces normalized inference time by ~2×, cuts invalid “Dream” sampling by 4×, and improves task success rates—while simultaneously achieving high interpretability and low latency. This advances VLA models from black-box controllers toward trustworthy, human-centered assistive agents.

Technology Category

Application Category

📝 Abstract

In this work, we present CollabVLA, a self-reflective vision-language-action framework that transforms a standard visuomotor policy into a collaborative assistant. CollabVLA tackles key limitations of prior VLAs, including domain overfitting, non-interpretable reasoning, and the high latency of auxiliary generative models, by integrating VLM-based reflective reasoning with diffusion-based action generation under a mixture-of-experts design. Through a two-stage training recipe of action grounding and reflection tuning, it supports explicit self-reflection and proactively solicits human guidance when confronted with uncertainty or repeated failure. It cuts normalized Time by ~2x and Dream counts by ~4x vs. generative agents, achieving higher success rates, improved interpretability, and balanced low latency compared with existing methods. This work takes a pioneering step toward shifting VLAs from opaque controllers to genuinely assistive agents capable of reasoning, acting, and collaborating with humans.

Problem

Research questions and friction points this paper is trying to address.

Overcoming domain overfitting in vision-language-action models

Reducing high latency in auxiliary generative models

Enhancing interpretability and collaboration with human guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-reflective vision-language-action framework

Integrates VLM reasoning with diffusion action

Two-stage training for grounding and reflection

🔎 Similar Papers

No similar papers found.

Authors to Follow