ContractBench: Can LLM Agents Preserve Observation Contracts?

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses the challenge that large language models (LLMs) struggle to maintain the temporal validity and byte-level integrity of observation contracts—such as presigned URLs and session tokens—when interacting with external APIs, often leading to compliance failures. The paper introduces the novel concept of “observation contract compliance” and presents ContractBench, a benchmark comprising 33 dual-axis tasks designed to systematically evaluate LLM agents’ resilience against two orthogonal error types: validity expiration and integrity corruption. The framework incorporates a virtual clock, SHA-256 checksums, in-context reward signals, and a failure taxonomy grounded in real-world API specifications. Evaluations across 38 prominent models reveal that even the strongest model, Claude-Opus-4.6, achieves only a 77.8% compliance rate; Qwen exhibits abrupt performance cliffs; GPT-5 variants suffer non-monotonic compliance degradation due to anthropomorphic training; and leveraging the failure taxonomy as a reward signal boosts GPT-5.1’s performance by 7.1 percentage points.

📝 Abstract

Tool-augmented LLM agents call APIs whose intermediate outputs, such as presigned URLs, session tokens, and OAuth state parameters, are observation contracts: artifacts whose later use is constrained by the external system that produced them. We show that observation contract compliance (preserving the temporal validity and byte-level integrity) is an emergent, regression-prone capability: it is neither guaranteed by general tool-use ability nor consistently improved by larger or newer models. To measure this, we introduce ContractBench, a benchmark of 33 dual-axis tasks that probe two orthogonal failure modes no existing benchmark evaluates: validity failures (using an artifact after expiry) and integrity failures (corrupting an artifact's bytes through the observation-to-action pipeline). Our evaluation is deterministic and programmatic, with a virtual clock controlling time and SHA-256 hashes verifying byte integrity. We assign each outcome a failure label drawn from real-world API specifications. We evaluate 38 models and report four findings: (i) no evaluated model clears 80%, with Claude-Opus-4.6 leading at 77.8%, revealing that current frontier models still fail to comply with observation contracts; (ii) a sharp within-family capability cliff in Qwen 3.5 between 4B (0%) and 9B (56.6%), smoothing to 70.7% at 397B-A17B: what emerges across the cliff is mid-trajectory restraint, not tool-call competence; (iii) non-monotonic scaling across the GPT-5 family: agentic post-training can erode compliance through sycophancy-driven regression; (iv) our failure taxonomy works as an actionable in-context reward signal, yielding +7.1 pp on 42 paired GPT-5.1 failures.

Problem

Research questions and friction points this paper is trying to address.

observation contracts

tool-augmented LLM agents

validity failures

integrity failures

API compliance

Innovation

Methods, ideas, or system contributions that make the work stand out.

observation contracts

ContractBench

tool-augmented LLM agents