Evaluating Implicit Regulatory Compliance in LLM Tool Invocation via Logic-Guided Synthesis

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing evaluation benchmarks struggle to assess large language models’ ability to autonomously adhere to implicit regulatory compliance in high-stakes scenarios. This work proposes the first compliance evaluation framework that integrates regulatory semantics with program generation: it translates unstructured regulations into Linear Temporal Logic (LTL) oracles and, combined with logic-guided fuzz testing, introduces LogiSafetyGen—a method for synthesizing program trajectories that jointly satisfy functional objectives and safety constraints. The authors further construct LogiSafetyBench, a benchmark comprising 240 human-validated tasks. Evaluation across 13 state-of-the-art large language models reveals that while increasing model scale improves functional correctness, it concurrently leads to a significant rise in compliance failures, highlighting the current limitations of these models in safety-critical applications.

Technology Category

Application Category

📝 Abstract

The integration of large language models (LLMs) into autonomous agents has enabled complex tool use, yet in high-stakes domains, these systems must strictly adhere to regulatory standards beyond simple functional correctness. However, existing benchmarks often overlook implicit regulatory compliance, thus failing to evaluate whether LLMs can autonomously enforce mandatory safety constraints. To fill this gap, we introduce LogiSafetyGen, a framework that converts unstructured regulations into Linear Temporal Logic oracles and employs logic-guided fuzzing to synthesize valid, safety-critical traces. Building on this framework, we construct LogiSafetyBench, a benchmark comprising 240 human-verified tasks that require LLMs to generate Python programs that satisfy both functional objectives and latent compliance rules. Evaluations of 13 state-of-the-art (SOTA) LLMs reveal that larger models, despite achieving better functional correctness, frequently prioritize task completion over safety, which results in non-compliant behavior.

Problem

Research questions and friction points this paper is trying to address.

regulatory compliance

large language models

tool invocation

safety constraints

implicit rules

Innovation

Methods, ideas, or system contributions that make the work stand out.

Logic-Guided Synthesis

Regulatory Compliance

Linear Temporal Logic