Text2World: Benchmarking Large Language Models for Symbolic World Model Generation

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research on symbolic world model generation suffers from high evaluation randomness, reliance on indirect metrics, and limited domain coverage. Method: We introduce Text2World—the first benchmark for text-to-PDDL world modeling—comprising hundreds of diverse planning domains. We propose an execution-driven, multidimensional evaluation framework integrating quantifiable metrics: executability verification, semantic correctness, and planning effectiveness. Our approach establishes the first large-scale, automated, multi-criteria PDDL-based evaluation paradigm, enabling systematic analysis of reasoning-oriented RL-trained LLMs in world modeling and validating enhancement strategies such as test-time scaling and agent collaboration. Contribution/Results: Experiments reveal significant limitations of state-of-the-art LLMs in symbolic world modeling. The benchmark is publicly released, providing the community with a reproducible, extensible infrastructure for rigorous world model evaluation.

Technology Category

Application Category

📝 Abstract
Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test-time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models. The project page is available at https://text-to-world.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in LLM-based world modeling
Introducing Text2World benchmark for robust evaluation
Enhancing LLM capabilities in symbolic world generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs for symbolic models
PDDL-based benchmark
Reinforcement learning enhancement
🔎 Similar Papers
No similar papers found.