FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight

πŸ“… 2026-02-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the failure of supervision in high-stakes scenarios involving large language model (LLM) agents, stemming from the homogeneity inherent in LLM-as-a-Judge paradigms. To overcome this limitation, the authors propose a novel neurosymbolic framework that integrates formal verification with LLMs for the first time. The approach employs a bidirectional β€œformal thinking” architecture: top-down, it decomposes natural language intentions into verifiable constraints; bottom-up, it leverages Dafny specifications and the Z3 solver to formally verify agent behavior, thereby providing mathematical guarantees. Experimental results demonstrate that the method outperforms LLM-as-a-Judge baselines by an average of 16.6% across three benchmarks. Notably, a 7B-parameter supervisor achieves over 90% accuracy in detecting deception by a 72B-parameter agent, and iterative refinement yields near-linear improvements in safety.

Technology Category

Application Category

πŸ“ Abstract
As LLM-based agents increasingly operate in high-stakes domains with real-world consequences, ensuring their behavioral safety becomes paramount. The dominant oversight paradigm, LLM-as-a-Judge, faces a fundamental dilemma: how can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes? We argue that formal verification offers a principled escape from this dilemma, yet its adoption has been hindered by a critical bottleneck: the translation from natural language requirements to formal specifications. This paper bridges this gap by proposing , a neuro-symbolic framework that employs a bidirectional Formal-of-Thought architecture: LLMs serve as specification compilers that top-down decompose high-level human intent into atomic, verifiable constraints, then bottom-up prove compliance using Dafny specifications and Z3 Satisfiability modulo theories solving, which produces mathematical guarantees rather than probabilistic scores. We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection. Experiments on 7 agent models demonstrate that achieves an average improvement of 16.6% over LLM-as-a-Judge baselines, enables weak-to-strong generalization where a 7B judge achieves over 90% accuracy detecting deception from 72B agents, and provides near-linear safety improvement through iterative refinement.
Problem

Research questions and friction points this paper is trying to address.

behavioral safety
LLM-as-a-Judge
formal verification
neuro-symbolic
specification translation
Innovation

Methods, ideas, or system contributions that make the work stand out.

neuro-symbolic
formal verification
LLM-as-a-Judge
Dafny
SMT solving
πŸ”Ž Similar Papers
No similar papers found.