Grounding Sim-to-Real Generalization in Dexterous Manipulation: An Empirical Study with Vision-Language-Action Models

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation for Vision-Language-Action (VLA) models in sim-to-real transfer for dexterous manipulation tasks by introducing a real-world evaluation protocol that encompasses variations in background, lighting, distractors, object types, and spatial configurations. Through over 10,000 physical trials, the study systematically investigates the impact of multi-level domain randomization, photorealistic rendering, high-fidelity physics modeling, and reinforcement learning strategies on generalization performance. It presents the first comprehensive assessment of VLA models’ sim-to-real capabilities in real-world dexterous manipulation, establishes a standardized benchmark, and releases an open-source robotic platform, thereby significantly enhancing the robustness and reproducibility of policies in complex real-world scenarios.

Technology Category

Application Category

📝 Abstract

Learning a generalist control policy for dexterous manipulation typically relies on large-scale datasets. Given the high cost of real-world data collection, a practical alternative is to generate synthetic data through simulation. However, the resulting synthetic data often exhibits a significant gap from real-world distributions. While many prior studies have proposed algorithms to bridge the Sim-to-Real discrepancy, there remains a lack of principled research that grounds these methods in real-world manipulation tasks, particularly their performance on generalist policies such as Vision-Language-Action (VLA) models. In this study, we empirically examine the primary determinants of Sim-to-Real generalization across four dimensions: multi-level domain randomization, photorealistic rendering, physics-realistic modeling, and reinforcement learning updates. To support this study, we design a comprehensive evaluation protocol to quantify the real-world performance of manipulation tasks. The protocol accounts for key variations in background, lighting, distractors, object types, and spatial features. Through experiments involving over 10k real-world trials, we derive critical insights into Sim-to-Real transfer. To inform and advance future studies, we release both the robotic platforms and the evaluation protocol for public access to facilitate independent verification, thereby establishing a realistic and standardized benchmark for dexterous manipulation policies.

Problem

Research questions and friction points this paper is trying to address.

Sim-to-Real

dexterous manipulation

Vision-Language-Action models

generalization

domain gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sim-to-Real Generalization

Dexterous Manipulation

Vision-Language-Action Models

Domain Randomization

Evaluation Benchmark

🔎 Similar Papers

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

2024-09-19Citations: 5