Bring the Apple, Not the Sofa: Impact of Irrelevant Context in Embodied AI Commands on VLA Models

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically investigates the robustness of Vision-Language-Action (VLA) models to natural language interference in embodied AI settings, focusing on two prevalent noise types: semantically related but irrelevant contextual distractors and instruction paraphrasing. We propose the first large language model (LLM)-based core instruction extraction and filtering framework, which jointly leverages semantic and lexical distance metrics for noise-aware instruction purification. Experimental results show that semantically similar interference degrades VLA performance by up to 50%—substantially exceeding degradation from random noise (<10%) and human-authored paraphrases (~20%)—and that performance decay intensifies with increasing context length. Our method restores perturbed performance to 98.5% of the original clean baseline, significantly enhancing VLA model reliability and practicality in realistic, linguistically complex environments.

Technology Category

Application Category

📝 Abstract
Vision Language Action (VLA) models are widely used in Embodied AI, enabling robots to interpret and execute language instructions. However, their robustness to natural language variability in real-world scenarios has not been thoroughly investigated. In this work, we present a novel systematic study of the robustness of state-of-the-art VLA models under linguistic perturbations. Specifically, we evaluate model performance under two types of instruction noise: (1) human-generated paraphrasing and (2) the addition of irrelevant context. We further categorize irrelevant contexts into two groups according to their length and their semantic and lexical proximity to robot commands. In this study, we observe consistent performance degradation as context size expands. We also demonstrate that the model can exhibit relative robustness to random context, with a performance drop within 10%, while semantically and lexically similar context of the same length can trigger a quality decline of around 50%. Human paraphrases of instructions lead to a drop of nearly 20%. To mitigate this, we propose an LLM-based filtering framework that extracts core commands from noisy inputs. Incorporating our filtering step allows models to recover up to 98.5% of their original performance under noisy conditions.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLA model robustness to linguistic perturbations in instructions
Assessing performance degradation from irrelevant context and paraphrasing
Proposing LLM-based filtering to mitigate noise impact on commands
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed LLM-based filtering for noisy inputs
Extracted core commands from irrelevant context
Recovered model performance under linguistic perturbations