Fail2Drive: Benchmarking Closed-Loop Driving Generalization

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing closed-loop autonomous driving benchmarks struggle to disentangle model memorization from genuine generalization due to their reuse of training scenarios. This work introduces Fail2Drive, the first paired-route benchmark, which constructs 200 routes in CARLA along with 17 novel scenario variants spanning appearance, layout, behavioral, and robustness shifts. Each out-of-distribution route is carefully matched with an in-distribution counterpart derived from the same origin, enabling isolated evaluation of distribution shift effects. Leveraging privileged expert validation to confirm scenario solvability, the framework converts qualitative failures into quantitative diagnostics, uncovering fundamental failure modes such as models ignoring LiDAR-visible objects and conflating free space with occupied regions. Experiments reveal a 22.8% average drop in success rate among state-of-the-art models, exposing severe generalization deficiencies. The project releases all code, data, and tools to establish a reproducible foundation for closed-loop driving generalization research.

Technology Category

Application Category

📝 Abstract

Generalization under distribution shift remains a central bottleneck for closed-loop autonomous driving. Although simulators like CARLA enable safe and scalable testing, existing benchmarks rarely measure true generalization: they typically reuse training scenarios at test time. Success can therefore reflect memorization rather than robust driving behavior. We introduce Fail2Drive, the first paired-route benchmark for closed-loop generalization in CARLA, with 200 routes and 17 new scenario classes spanning appearance, layout, behavioral, and robustness shifts. Each shifted route is matched with an in-distribution counterpart, isolating the effect of the shift and turning qualitative failures into quantitative diagnostics. Evaluating multiple state-of-the-art models reveals consistent degradation, with an average success-rate drop of 22.8\%. Our analysis uncovers unexpected failure modes, such as ignoring objects clearly visible in the LiDAR and failing to learn the fundamental concepts of free and occupied space. To accelerate follow-up work, Fail2Drive includes an open-source toolbox for creating new scenarios and validating solvability via a privileged expert policy. Together, these components establish a reproducible foundation for benchmarking and improving closed-loop driving generalization. We open-source all code, data, and tools at https://github.com/autonomousvision/fail2drive .

Problem

Research questions and friction points this paper is trying to address.

closed-loop autonomous driving

generalization

distribution shift

benchmarking

CARLA

Innovation

Methods, ideas, or system contributions that make the work stand out.

closed-loop driving

distribution shift

generalization benchmark