Text2World: Benchmarking Large Language Models for Symbolic World Model Generation

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing research on symbolic world model generation suffers from high evaluation randomness, reliance on indirect metrics, and limited domain coverage. Method: We introduce Text2World—the first benchmark for text-to-PDDL world modeling—comprising hundreds of diverse planning domains. We propose an execution-driven, multidimensional evaluation framework integrating quantifiable metrics: executability verification, semantic correctness, and planning effectiveness. Our approach establishes the first large-scale, automated, multi-criteria PDDL-based evaluation paradigm, enabling systematic analysis of reasoning-oriented RL-trained LLMs in world modeling and validating enhancement strategies such as test-time scaling and agent collaboration. Contribution/Results: Experiments reveal significant limitations of state-of-the-art LLMs in symbolic world modeling. The benchmark is publicly released, providing the community with a reproducible, extensible infrastructure for rigorous world model evaluation.

Technology Category

Application Category

📝 Abstract

Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test-time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models. The project page is available at https://text-to-world.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Addressing limitations in LLM-based world modeling

Introducing Text2World benchmark for robust evaluation

Enhancing LLM capabilities in symbolic world generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs for symbolic models

PDDL-based benchmark

Reinforcement learning enhancement

🔎 Similar Papers

Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models