LexiCon: a Benchmark for Planning under Temporal Constraints in Natural Language

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This paper addresses the insufficient evaluation of large language models’ (LLMs) reasoning capabilities in safety-critical, temporally constrained natural language planning tasks. To this end, we introduce LexiCon—the first natural language benchmark explicitly designed for constraint-aware planning. Methodologically, we extend classical planning environments into natural language tasks incorporating temporal logic constraints, and propose a scalable benchmark framework that supports automatic constraint generation, dynamic difficulty scaling, and tight integration of classical planning techniques with natural language understanding for constraint modeling and solution synthesis. Our key contributions are: (1) filling a critical gap in LLM evaluation for temporally constrained planning; and (2) providing a highly discriminative assessment tool—empirical results show significant performance degradation for state-of-the-art reasoning models (e.g., GPT-5, o3, R1) under strong temporal constraints, validating LexiCon’s effectiveness in characterizing constraint reasoning capability.

Technology Category

Application Category

📝 Abstract

Owing to their reasoning capabilities, large language models (LLMs) have been evaluated on planning tasks described in natural language. However, LLMs have largely been tested on planning domains without constraints. In order to deploy them in real-world settings where adherence to constraints, in particular safety constraints, is critical, we need to evaluate their performance on constrained planning tasks. We introduce LexiCon -- a natural language-based (Lexi) constrained (Con) planning benchmark, consisting of a suite of environments, that can be used to evaluate the planning capabilities of LLMs in a principled fashion. The core idea behind LexiCon is to take existing planning environments and impose temporal constraints on the states. These constrained problems are then translated into natural language and given to an LLM to solve. A key feature of LexiCon is its extensibility. That is, the set of supported environments can be extended with new (unconstrained) environment generators, for which temporal constraints are constructed automatically. This renders LexiCon future-proof: the hardness of the generated planning problems can be increased as the planning capabilities of LLMs improve. Our experiments reveal that the performance of state-of-the-art LLMs, including reasoning models like GPT-5, o3, and R1, deteriorates as the degree of constrainedness of the planning tasks increases.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs on natural language planning with temporal constraints

Benchmarks constrained planning for real-world safety requirements

Tests LLM performance degradation with increasing constraint complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Imposes temporal constraints on existing planning environments

Translates constrained problems into natural language for LLMs

Automatically constructs temporal constraints for new environment generators

🔎 Similar Papers

No similar papers found.