🤖 AI Summary
Existing robotic manipulation simulation benchmarks suffer from insufficient visual fidelity, leading to a significant domain gap between simulation and reality that undermines their predictive validity. To address this limitation, this work proposes VISER, a high-fidelity visual simulation benchmark that, for the first time, integrates physically based rendering (PBR) materials with multimodal large language models (MLLMs) to enable material-aware part segmentation and retrieval. The benchmark also introduces a large-scale 3D asset dataset and supports scalable scenarios for grasping, placement, and long-horizon tasks. Experimental results demonstrate that VISER achieves an average Pearson correlation coefficient of 0.92 between simulated and real-world performance, substantially enhancing the reliability of simulation-based evaluation and its transferability to real-world settings.
📝 Abstract
Reliable simulation evaluation of robot manipulation policies serves as a high-fidelity proxy for real-world performance. Although existing benchmarks cover a wide range of task categories, they lack visual realism, creating a large domain gap between simulation and reality. This undermines the reliability of simulation-based evaluation in predicting real-world performance. To mitigate the sim-to-real visual gap, we conduct a systematic analysis to isolate the effects of lighting and material. Our results show that these factors play a critical role in geometric reasoning and spatial grounding, yet are largely overlooked in existing benchmarks. Motivated by the analysis, we propose VISER, a visually realistic benchmark for evaluating robot manipulation in simulation. VISER features a high-fidelity dataset of over 1,000 3D assets with physically-based rendering (PBR) materials, along with 3D scenes created from these assets through curated layouts or generation. To this end, we propose an automated pipeline leveraging Multi-modal Large Language Models (MLLMs) for material-aware part segmentation and material retrieval, enabling scalable generation of physically plausible assets. Building on the high-fidelity 3D asset dataset, we construct diverse evaluation tasks, such as grasping, placing, and long-horizon tasks, enabling scalable and reproducible assessment of Vision-Language-Action (VLA) models. Our benchmark shows a strong correlation between simulation and real-world performance, achieving an average Pearson correlation coefficient of 0.92 across different policies.