🤖 AI Summary
While large language models possess rich semantic knowledge, they often lack procedural understanding of physical laws, leading to infeasible or hallucinated plans. To address this limitation, this work proposes the WorldMind framework, which autonomously constructs a symbolic world knowledge base through environmental feedback and jointly learns from process experience—guided by prediction errors—and goal-oriented experience—derived from successful trajectories—to dynamically model physical rules. Moving beyond conventional static fine-tuning paradigms, WorldMind achieves transferable, cross-model, and cross-environment physical alignment. Extensive evaluations on the EB-ALFRED and EB-Habitat benchmarks demonstrate significant performance gains over existing methods, underscoring its superior generalization capability and reliability in task execution.
📝 Abstract
Current Large Language Models (LLMs) exhibit a critical modal disconnect: they possess vast semantic knowledge but lack the procedural grounding to respect the immutable laws of the physical world. Consequently, while these agents implicitly function as world models, their simulations often suffer from physical hallucinations-generating plans that are logically sound but physically unexecutable. Existing alignment strategies predominantly rely on resource-intensive training or fine-tuning, which attempt to compress dynamic environmental rules into static model parameters. However, such parametric encapsulation is inherently rigid, struggling to adapt to the open-ended variability of physical dynamics without continuous, costly retraining. To bridge this gap, we introduce WorldMind, a framework that autonomously constructs a symbolic World Knowledge Repository by synthesizing environmental feedback. Specifically, it unifies Process Experience to enforce physical feasibility via prediction errors and Goal Experience to guide task optimality through successful trajectories. Experiments on EB-ALFRED and EB-Habitat demonstrate that WorldMind achieves superior performance compared to baselines with remarkable cross-model and cross-environment transferability.