Clutter-Resistant Vision-Language-Action Models through Object-Centric and Geometry Grounding

📅 2025-12-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current Vision-Language-Action (VLA) models tightly couple perception and action within a single optimization pipeline, resulting in weak language grounding and poor robustness in real-world desktop settings—particularly under cluttered backgrounds, target occlusion, and appearance overfitting. To address this, we propose OBEYED-VLA, the first VLA framework integrating dual grounding: object-centric grounding via vision-language model (VLM)-guided object region selection, and geometric-structural grounding via multi-view 3D reconstruction and 3D-structure-prioritized representation learning. This enables explicit decoupling of perception and action modules. Furthermore, we fine-tune a pre-trained VLA policy on clean, single-object demonstrations. Experiments on the UR10e robotic platform demonstrate significant robustness improvements across four challenging scenarios: distractor interference, target occlusion, background variation, and dense unknown-object manipulation. Ablation studies confirm that both grounding mechanisms are indispensable.

Technology Category

Application Category

📝 Abstract
Recent Vision-Language-Action (VLA) models have made impressive progress toward general-purpose robotic manipulation by post-training large Vision-Language Models (VLMs) for action prediction. Yet most VLAs entangle perception and control in a monolithic pipeline optimized purely for action, which can erode language-conditioned grounding. In our real-world tabletop tests, policies over-grasp when the target is absent, are distracted by clutter, and overfit to background appearance. To address these issues, we propose OBEYED-VLA (OBject-centric and gEometrY groundED VLA), a framework that explicitly disentangles perceptual grounding from action reasoning. Instead of operating directly on raw RGB, OBEYED-VLA augments VLAs with a perception module that grounds multi-view inputs into task-conditioned, object-centric, and geometry-aware observations. This module includes a VLM-based object-centric grounding stage that selects task-relevant object regions across camera views, along with a complementary geometric grounding stage that emphasizes the 3D structure of these objects over their appearance. The resulting grounded views are then fed to a pretrained VLA policy, which we fine-tune exclusively on single-object demonstrations collected without environmental clutter or non-target objects. On a real-world UR10e tabletop setup, OBEYED-VLA substantially improves robustness over strong VLA baselines across four challenging regimes and multiple difficulty levels: distractor objects, absent-target rejection, background appearance changes, and cluttered manipulation of unseen objects. Ablation studies confirm that both semantic grounding and geometry-aware grounding are critical to these gains. Overall, the results indicate that making perception an explicit, object-centric component is an effective way to strengthen and generalize VLA-based robotic manipulation.
Problem

Research questions and friction points this paper is trying to address.

Improves robot manipulation robustness against clutter and distractions
Disentangles perception from action using object-centric and geometric grounding
Enhances generalization to unseen objects and varying backgrounds
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-centric grounding separates perception from action reasoning
Geometry-aware stage emphasizes 3D structure over appearance
Multi-view inputs are transformed into task-conditioned observations
🔎 Similar Papers
No similar papers found.
K
Khoa Vo
University of Arkansas, Fayetteville, AR, USA
T
Taisei Hanyu
University of Arkansas, Fayetteville, AR, USA
Y
Yuki Ikebe
University of Arkansas, Fayetteville, AR, USA
T
Trong Thang Pham
University of Arkansas, Fayetteville, AR, USA
N
Nhat Chung
National University of Singapore, Singapore
Minh Nhat Vu
Minh Nhat Vu
Automation & Control Institute (ACIN), Vienna, Austria
Robotics
D
Duy Nguyen Ho Minh
Max Planck Research School for Intelligent Systems and the University of Stuttgart, Stuttgart, Germany
A
Anh Nguyen
University of Liverpool, Liverpool, U.K.
A
Anthony Gunderman
University of Arkansas, Fayetteville, AR, USA
Chase Rainwater
Chase Rainwater
University of Arkansas
logisticsoptimizationsecurity
Ngan Le
Ngan Le
University of Arkansas
Artificial IntelligenceMachine LearningComputer Vision