Agentified Assessment of Logical Reasoning Agents

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of reproducibility, auditability, and robustness to execution failures in current evaluations of logical reasoning agents. To overcome these limitations, the authors propose an “agentified evaluation” paradigm that models the evaluation process itself as agent behavior, introducing a standardized framework capable of task dispatching, execution budget control, output parsing, and structured failure logging. Built upon a unified agent interface, the framework integrates Z3Py-based automated formalization, SMT solving, and failure classification to enable automated, structured, and auditable assessment. Evaluated on a cleaned FOLIO validation set, the automated formalization agent achieves an accuracy of 86.70%, substantially outperforming a chain-of-thought baseline at 73.89%.

Technology Category

Application Category

📝 Abstract
We present a framework for evaluating and benchmarking logical reasoning agents when assessment itself must be reproducible, auditable, and robust to execution failures. Building on agentified assessment, we use an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types, while the agent under test only needs to expose a standardized agent-to-agent interface. As a case study, we benchmark an auto-formalization agent for first-order logic (FOL) reasoning on a solver-verified and repaired split of FOLIO. The agent translates natural language premises and conclusions into executable Z3Py programs and employs satisfiability modulo theories (SMT) solving to determine logical entailment. On the cleaned FOLIO validation set, the auto-formalization agent achieves 86.70% accuracy under the assessor protocol, outperforming a chain-of-thought baseline (73.89%).
Problem

Research questions and friction points this paper is trying to address.

logical reasoning agents
agentified assessment
benchmarking
reproducibility
robust evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentified assessment
logical reasoning agents
auto-formalization
SMT solving
structured failure analysis
🔎 Similar Papers
No similar papers found.
Z
Zhiyu Ni
University of California, Berkeley
Y
Yifeng Xiao
University of California, Berkeley
Zheng Liang
Zheng Liang
UC Berkeley
Design AutomationComputer Architecture