🤖 AI Summary
This work addresses the poor generalization of vision-language-action (VLA) models under variations in instruction phrasing, which often leads to significant performance degradation due to linguistic differences. To systematically diagnose the failure mechanisms of VLA models, the authors introduce the LIBERO-Para benchmark, which decouples action descriptions from object references through independent paraphrasing. They further propose PRIDE, a novel metric that integrates semantic and syntactic factors to quantify paraphrase difficulty. Evaluations across seven state-of-the-art VLA models (ranging from 0.6B to 7.5B parameters) reveal that instruction paraphrasing can reduce task success rates by 22–52 percentage points, with 80–96% of failures stemming from trajectory divergence during the task planning phase—highlighting critical limitations in current evaluation protocols.
📝 Abstract
Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: https://github.com/cau-hai-lab/LIBERO-Para