π€ AI Summary
Existing benchmarks lack systematic quantification of visual biases, hindering deep understanding of decision-making stability in embodied agents. To address this, we propose RoboView-Biasβthe first benchmark dedicated to evaluating visual bias in robotic manipulation. Grounded in the principle of factor isolation, it introduces a structured variant generation framework and a perception fairness verification protocol, enabling robust measurement of individual visual factors (e.g., viewpoint, color) and their interaction effects for the first time. Leveraging vision-language models (VLMs) for policy execution and semantic grounding layers for bias mitigation, we conduct comprehensive bias analysis and correction. Evaluated across 2,127 task instances, three state-of-the-art embodied agents exhibit significant performance degradation due to viewpoint and color preferences. Integrating semantic grounding reduces MOKAβs visual bias by 54.5%, demonstrating the efficacy of our approach in enhancing perceptual fairness and decision robustness.
π Abstract
The safety and reliability of embodied agents rely on accurate and unbiased visual perception. However, existing benchmarks mainly emphasize generalization and robustness under perturbations, while systematic quantification of visual bias remains scarce. This gap limits a deeper understanding of how perception influences decision-making stability. To address this issue, we propose RoboView-Bias, the first benchmark specifically designed to systematically quantify visual bias in robotic manipulation, following a principle of factor isolation. Leveraging a structured variant-generation framework and a perceptual-fairness validation protocol, we create 2,127 task instances that enable robust measurement of biases induced by individual visual factors and their interactions. Using this benchmark, we systematically evaluate three representative embodied agents across two prevailing paradigms and report three key findings: (i) all agents exhibit significant visual biases, with camera viewpoint being the most critical factor; (ii) agents achieve their highest success rates on highly saturated colors, indicating inherited visual preferences from underlying VLMs; and (iii) visual biases show strong, asymmetric coupling, with viewpoint strongly amplifying color-related bias. Finally, we demonstrate that a mitigation strategy based on a semantic grounding layer substantially reduces visual bias by approximately 54.5% on MOKA. Our results highlight that systematic analysis of visual bias is a prerequisite for developing safe and reliable general-purpose embodied agents.