TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing vision-language-action (VLA) policies often suffer from grasping errors in cluttered scenes due to misidentification of target objects, leading to near-misses or unintended grasps. To address this, this work proposes a Target-Aware Guidance (TAG) mechanism that, during inference, compares the original observation with a version where the target object is erased, generating a residual guidance signal to reinforce the policy’s reliance on genuine target evidence. TAG introduces, for the first time, a target-agnostic guidance paradigm that requires no architectural modifications and is compatible with existing VLA policies. It leverages classifier-free guidance (CFG) principles to implement object erasure and signal generation. Evaluated on benchmarks including LIBERO, LIBERO-Plus, and VLABench, TAG substantially enhances robustness to visual distractions and effectively mitigates both near-miss and erroneous grasping behaviors.

Technology Category

Application Category

📝 Abstract

Vision--Language--Action (VLA) policies have shown strong progress in mapping language instructions and visual observations to robotic actions, yet their reliability degrades in cluttered scenes with distractors. By analyzing failure cases, we find that many errors do not arise from infeasible motions, but from instance-level grounding failures: the policy often produces a plausible grasp trajectory that lands slightly off-target or even on the wrong object instance. To address this issue, we propose TAG (Target-Agnostic Guidance), a simple inference-time guidance mechanism that explicitly reduces distractor- and appearance-induced bias in VLA policies. Inspired by classifier-free guidance (CFG), TAG contrasts policy predictions under the original observation and an object-erased observation, and uses their difference as a residual steering signal that strengthens the influence of object evidence in the decision process. TAG does not require modifying the policy architecture and can be integrated with existing VLA policies with minimal training and inference changes. We evaluate TAG on standard manipulation benchmarks, including LIBERO, LIBERO-Plus, and VLABench, where it consistently improves robustness under clutter and reduces near-miss and wrong-object executions.

Problem

Research questions and friction points this paper is trying to address.

object grounding

vision-language-action models

cluttered scenes

distractor bias

instance-level errors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Target-Agnostic Guidance

Vision-Language-Action Models

Object-Centric Inference