WoMAP: World Models For Embodied Open-Vocabulary Object Localization

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the zero-shot, open-vocabulary object localization task driven by natural language instructions, tackling key challenges in embodied AI: poor generalization under partial observability and execution-infeasible actions. We propose a three-stage framework: (1) Gaussian splatting–driven demonstration-free real-to-sim-to-real data generation; (2) dense reward distillation leveraging open-vocabulary detectors (e.g., Grounding DINO); and (3) a latent world model–integrated action-reward joint grounding strategy. Our method achieves breakthroughs in physical feasibility, zero-shot generalization, and sim-to-real transfer. On standard benchmarks, it attains a success rate 9× higher than vision-language model (VLM) baselines and 2× higher than diffusion-policy baselines. Crucially, we validate strong generalization and deployment efficacy on the TidyBot hardware platform, demonstrating robust real-world applicability.

Technology Category

Application Category

📝 Abstract
Language-instructed active object localization is a critical challenge for robots, requiring efficient exploration of partially observable environments. However, state-of-the-art approaches either struggle to generalize beyond demonstration datasets (e.g., imitation learning methods) or fail to generate physically grounded actions (e.g., VLMs). To address these limitations, we introduce WoMAP (World Models for Active Perception): a recipe for training open-vocabulary object localization policies that: (i) uses a Gaussian Splatting-based real-to-sim-to-real pipeline for scalable data generation without the need for expert demonstrations, (ii) distills dense rewards signals from open-vocabulary object detectors, and (iii) leverages a latent world model for dynamics and rewards prediction to ground high-level action proposals at inference time. Rigorous simulation and hardware experiments demonstrate WoMAP's superior performance in a broad range of zero-shot object localization tasks, with more than 9x and 2x higher success rates compared to VLM and diffusion policy baselines, respectively. Further, we show that WoMAP achieves strong generalization and sim-to-real transfer on a TidyBot.
Problem

Research questions and friction points this paper is trying to address.

Generalizing beyond demonstration datasets for object localization
Generating physically grounded actions in open-vocabulary tasks
Achieving scalable data generation without expert demonstrations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian Splatting-based real-to-sim-to-real pipeline
Dense rewards from open-vocabulary object detectors
Latent world model for dynamics and rewards prediction
🔎 Similar Papers
No similar papers found.