CollabVLA: Self-Reflective Vision-Language-Action Model Dreaming Together with Human

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language-action (VLA) models suffer from domain overfitting, opaque inference, and high auxiliary generation latency—limiting their practicality as collaborative assistants. To address these issues, we propose a self-reflective vision-language-action coordination framework based on a Mixture-of-Experts (MoE) architecture and reflective reasoning, enabling explicit uncertainty modeling, interpretable decision-making, and proactive human intervention. Our method integrates large language model (LLM)-driven reflective reasoning with diffusion-based action generation, supported by a two-stage training strategy: action grounding followed by reflection fine-tuning. Experiments demonstrate that, compared to generative agents, our approach reduces normalized inference time by ~2×, cuts invalid “Dream” sampling by 4×, and improves task success rates—while simultaneously achieving high interpretability and low latency. This advances VLA models from black-box controllers toward trustworthy, human-centered assistive agents.

Technology Category

Application Category

📝 Abstract
In this work, we present CollabVLA, a self-reflective vision-language-action framework that transforms a standard visuomotor policy into a collaborative assistant. CollabVLA tackles key limitations of prior VLAs, including domain overfitting, non-interpretable reasoning, and the high latency of auxiliary generative models, by integrating VLM-based reflective reasoning with diffusion-based action generation under a mixture-of-experts design. Through a two-stage training recipe of action grounding and reflection tuning, it supports explicit self-reflection and proactively solicits human guidance when confronted with uncertainty or repeated failure. It cuts normalized Time by ~2x and Dream counts by ~4x vs. generative agents, achieving higher success rates, improved interpretability, and balanced low latency compared with existing methods. This work takes a pioneering step toward shifting VLAs from opaque controllers to genuinely assistive agents capable of reasoning, acting, and collaborating with humans.
Problem

Research questions and friction points this paper is trying to address.

Overcoming domain overfitting in vision-language-action models
Reducing high latency in auxiliary generative models
Enhancing interpretability and collaboration with human guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-reflective vision-language-action framework
Integrates VLM reasoning with diffusion action
Two-stage training for grounding and reflection
🔎 Similar Papers
No similar papers found.
Nan Sun
Nan Sun
University of New South Wales
CybersecurityArtificial Intelligence Applications
Y
Yongchang Li
Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
C
Chenxu Wang
Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
H
Huiying Li
Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Huaping Liu
Huaping Liu
Professor of Electrical Engineering, Oregon State University
Communication theorywireless communicationssignal processingsensor networksinformation security