Target-Bench: Can World Models Achieve Mapless Path Planning with Semantic Targets?

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work evaluates world models’ capability for map-free, semantics-driven robotic navigation in real-world environments. To this end, we introduce Target-Bench—the first benchmark specifically designed for map-free semantic navigation—comprising 450 real-world video sequences paired with ground-truth SLAM trajectories. We propose a five-dimensional quantitative evaluation framework, measuring target reachability, trajectory accuracy, directional consistency, efficiency, and robustness. Experiments show that an open-source 5B-parameter world model, fine-tuned on 325 scenes, achieves a score of 0.345—surpassing the best commercial model (0.299) by over 400% relatively and 15% absolutely. Our benchmark and empirical methodology establish a new standard for evaluating lightweight, semantics-guided world models in navigation tasks, enabling rigorous, reproducible assessment of planning performance under realistic, map-free conditions.

Technology Category

Application Category

📝 Abstract
While recent world models generate highly realistic videos, their ability to perform robot path planning remains unclear and unquantified. We introduce Target-Bench, the first benchmark specifically designed to evaluate world models on mapless path planning toward semantic targets in real-world environments. Target-Bench provides 450 robot-collected video sequences spanning 45 semantic categories with SLAM-based ground truth trajectories. Our evaluation pipeline recovers camera motion from generated videos and measures planning performance using five complementary metrics that quantify target-reaching capability, trajectory accuracy, and directional consistency. We evaluate state-of-the-art models including Sora 2, Veo 3.1, and the Wan series. The best off-the-shelf model (Wan2.2-Flash) achieves only 0.299 overall score, revealing significant limitations in current world models for robotic planning tasks. We show that fine-tuning an open-source 5B-parameter model on only 325 scenarios from our dataset achieves 0.345 overall score -- an improvement of more than 400% over its base version (0.066) and 15% higher than the best off-the-shelf model. We will open-source the code and dataset.
Problem

Research questions and friction points this paper is trying to address.

Evaluating world models for mapless path planning with semantic targets
Quantifying planning performance through target-reaching and trajectory metrics
Revealing limitations of current models and showing fine-tuning improvements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates world models for mapless planning
Recovers camera motion from generated videos for metrics
Fine-tuning on dataset improves planning performance significantly