RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the PSPACE-complete problems of regular expression equivalence checking and minimization by introducing the first LLM benchmark for space-bounded reasoning. Methodologically, it integrates formal language theory with automata reduction techniques, generating a million-scale, high-quality labeled dataset via double-exponential enumeration and sound filtering, and proposes a structured evaluation protocol. Systematic evaluation across six LLMs and five large reasoning models (LRMs) reveals pervasive redundant inference and cyclic behavior in double-exponential search spaces, demonstrating that current models cannot overcome the intrinsic computational barriers of PSPACE-complete problems. This is the first study to incorporate PSPACE-completeness into the LLM capability evaluation framework, uncovering systematic failure modes in space-constrained reasoning. It establishes a novel benchmark and theoretical foundation for characterizing the computational limits of foundation models.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) show strong performance across natural language processing (NLP), mathematical reasoning, and programming, and recent large reasoning models (LRMs) further emphasize explicit reasoning. Yet their computational limits, particularly spatial complexity constrained by finite context windows, remain poorly understood. While recent works often focus on problems within the NP complexity class, we push the boundary by introducing a novel benchmark grounded in two PSPACE-complete regular expression (regex) problems: equivalence decision (RegexEQ) and minimization (RegexMin). PSPACE-complete problems serve as a more rigorous standard for assessing computational capacity, as their solutions require massive search space exploration. We perform a double-exponential space exploration to construct a labeled dataset of over a million regex instances with a sound filtering process to build the benchmark. We conduct extensive evaluations on 6 LLMs and 5 LRMs of varying scales, revealing common failure patterns such as verbosity and repetition. With its well-defined structure and quantitative evaluation metrics, this work presents the first empirical investigation into the spatial computational limitations of LLMs and LRMs, offering a new framework for evaluating their advanced reasoning capabilities. Our code is available at https://github.com/hyundong98/RegexPSPACE .
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' computational limits on PSPACE-complete regex problems
Assessing spatial complexity constraints via regex equivalence and minimization
Investigating failure patterns in large reasoning models' search space exploration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates LLMs on PSPACE-complete regex problems
Dataset built via double-exponential space exploration
Framework assesses spatial computational limitations of models
🔎 Similar Papers
No similar papers found.