Colosseum V2: Benchmarking Generalization for Vision Language Action Models

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization of current vision-language-action (VLA) models under distribution shifts and the absence of a systematic evaluation benchmark. To this end, we introduce Colosseum V2, a large-scale simulated benchmark encompassing 28 tasks, 13 manipulation primitives, and two robot morphologies, which for the first time enables standardized evaluation of both in-distribution and out-of-distribution generalization within a unified framework. Built upon the ManiSkill simulator and accelerated with GPU rendering, the benchmark integrates state-of-the-art methods such as ACT and Pi0.5, demonstrating strong sim-to-real correlation. Our experiments reveal fundamental limitations of existing VLA models in both basic performance and generalization, establishing Colosseum V2 as a new platform for fair, efficient, and reproducible algorithmic comparison.
📝 Abstract
Vision-Language-Action (VLA) models demonstrate promising generalization in robotic manipulation, driven by advances in large-scale vision and language pre-training. This progress can be misleading. Despite the zero-shot perception and language capabilities of VLAs, their overall task performance often degrades under distribution shifts, revealing gaps in how these systems translate high-level understanding into robust behavior. To systematically study this gap, we introduce Colosseum V2, a large-scale simulation benchmark for evaluating VLA generalization in robot learning across diverse conditions. The benchmark comprises 28 tasks spanning 13 task categories and two robot morphologies, covering a wide range of manipulation primitives and long-horizon behaviors. Built on the ManiSkill simulator, Colosseum V2 enables fast, GPU-parallelized evaluation and supports both in-domain and out-of-domain testing at scale. We evaluate state-of-the-art methods, including Action Chunking Transformers (ACT) and Pi0.5, and reveal limitations in both base performance and generalization. We demonstrate strong correlations between simulation and real-world metrics that support the ecological validity of the benchmark. By standardizing tasks, metrics, and evaluation protocols within a unified benchmark, Colosseum V2 enables reproducible and fair comparisons, reduced evaluation overhead, and accelerated progress toward general-purpose robot policies.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
generalization
distribution shift
robotic manipulation
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action models
generalization benchmark
robotic manipulation
simulation-to-reality transfer
GPU-parallelized evaluation
🔎 Similar Papers
J
Jeremy Morgan
Department of Computer Science, University of Southern California, Los Angeles, CA, USA.
P
Prajwal Vijay
Department of Electrical Engineering, Indian Institute of Technology Madras, Chennai, India.
H
Hyeonho Oh
Department of Computer Science, University of Southern California, Los Angeles, CA, USA.
J
Jincen Song
Fu Foundation School of Engineering and Applied Science, Columbia University, New York, NY, USA.
A
Ashvin Arora
Department of Computer Science, University of Southern California, Los Angeles, CA, USA.
A
Alina Du
Department of Computer Science, University of Southern California, Los Angeles, CA, USA.
G
Gaurav S. Sukhatme
Department of Computer Science, University of Southern California, Los Angeles, CA, USA.
Jesse Thomason
Jesse Thomason
Assistant Professor, University of Southern California
Natural Language ProcessingArtificial IntelligenceRobotics
Ishika Singh
Ishika Singh
CS PhD student, USC
Natural Language ProcessingRoboticsMachine Learning