🤖 AI Summary
Existing evaluation benchmarks struggle to assess large language models’ ability to autonomously adhere to implicit regulatory compliance in high-stakes scenarios. This work proposes the first compliance evaluation framework that integrates regulatory semantics with program generation: it translates unstructured regulations into Linear Temporal Logic (LTL) oracles and, combined with logic-guided fuzz testing, introduces LogiSafetyGen—a method for synthesizing program trajectories that jointly satisfy functional objectives and safety constraints. The authors further construct LogiSafetyBench, a benchmark comprising 240 human-validated tasks. Evaluation across 13 state-of-the-art large language models reveals that while increasing model scale improves functional correctness, it concurrently leads to a significant rise in compliance failures, highlighting the current limitations of these models in safety-critical applications.
📝 Abstract
The integration of large language models (LLMs) into autonomous agents has enabled complex tool use, yet in high-stakes domains, these systems must strictly adhere to regulatory standards beyond simple functional correctness. However, existing benchmarks often overlook implicit regulatory compliance, thus failing to evaluate whether LLMs can autonomously enforce mandatory safety constraints. To fill this gap, we introduce LogiSafetyGen, a framework that converts unstructured regulations into Linear Temporal Logic oracles and employs logic-guided fuzzing to synthesize valid, safety-critical traces. Building on this framework, we construct LogiSafetyBench, a benchmark comprising 240 human-verified tasks that require LLMs to generate Python programs that satisfy both functional objectives and latent compliance rules. Evaluations of 13 state-of-the-art (SOTA) LLMs reveal that larger models, despite achieving better functional correctness, frequently prioritize task completion over safety, which results in non-compliant behavior.