TCP: a Benchmark for Temporal Constraint-Based Planning

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Existing large language models (LLMs) lack a benchmark that jointly evaluates temporal reasoning and planning capabilities with sufficient diversity, explicitness/implicitness of temporal constraints, and tight coupling among constraints. Method: We introduce TCP—the first dedicated benchmark for assessing the synergistic capability of temporal reasoning and planning—based on multi-turn collaborative dialogues incorporating both explicit and implicit temporal constraints. Models must generate optimal scheduling plans satisfying all constraints. We innovatively embed deep temporal constraint modeling into natural dialogue contexts and propose a high-quality data generation paradigm combining prototype-driven design, LLM augmentation, and human verification. Contribution/Results: Experiments reveal systematic failures of state-of-the-art LLMs on TCP tasks. The benchmark and evaluation protocol are fully open-sourced, establishing a reproducible, extensible research infrastructure for advancing temporal reasoning and planning in LLMs.

Technology Category

Application Category

📝 Abstract

Temporal reasoning and planning are essential capabilities for large language models (LLMs), yet most existing benchmarks evaluate them in isolation and under limited forms of complexity. To address this gap, we introduce the Temporal Constraint-based Planning (TCP) benchmark, that jointly assesses both capabilities. Each instance in TCP features a naturalistic dialogue around a collaborative project, where diverse and interdependent temporal constraints are explicitly or implicitly expressed, and models must infer an optimal schedule that satisfies all constraints. To construct TCP, we first generate abstract problem prototypes that are paired with realistic scenarios from various domains and enriched into dialogues using an LLM. A human quality check is performed on a sampled subset to confirm the reliability of our benchmark. We evaluate state-of-the-art LLMs and find that even the strongest models struggle with TCP, highlighting its difficulty and revealing limitations in LLMs' temporal constraint-based planning abilities. We analyze underlying failure cases, open source our benchmark, and hope our findings can inspire future research.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' joint temporal reasoning and planning abilities

Evaluating optimal scheduling under diverse interdependent constraints

Benchmarking realistic temporal constraint-based planning scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Temporal Constraint-based Planning benchmark

Generates problem prototypes with realistic scenarios

Evaluates LLMs' temporal constraint-based planning abilities

🔎 Similar Papers

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning