Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents

📅 2026-01-17

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the challenge of irreproducible decision trajectories in tool-augmented large language model (LLM) agents within financial regulatory audit replay scenarios. To this end, we propose the Determinism-Faithfulness Assurance Harness (DFAH), a framework that systematically quantifies both trajectory determinism and evidence-conditioned faithfulness of LLM agents for the first time, revealing a positive correlation between these two properties. DFAH establishes a replay-capable agent evaluation paradigm tailored for financial compliance, integrating multi-model, multi-configuration benchmarking, deterministic trajectory tracing, and faithfulness assessment, accompanied by an open-sourced stress-testing toolkit. Experimental results across three financial compliance benchmarks demonstrate that Tier-1 models employing a schema-first architecture achieve the determinism required for audit replay, while non-agent configurations of 7–20B parameter models attain 100% determinism.

Technology Category

Application Category

📝 Abstract

LLM agents struggle with regulatory audit replay: when asked to reproduce a flagged transaction decision with identical inputs, most deployments fail to return consistent results. This paper introduces the Determinism-Faithfulness Assurance Harness (DFAH), a framework for measuring trajectory determinism and evidence-conditioned faithfulness in tool-using agents deployed in financial services. Across 74 configurations (12 models, 4 providers, 8-24 runs each at T=0.0) in non-agentic baseline experiments, 7-20B parameter models achieved 100% determinism, while 120B+ models required 3.7x larger validation samples to achieve equivalent statistical reliability. Agentic tool-use introduces additional variance (see Tables 4-7). Contrary to the assumed reliability-capability trade-off, a positive Pearson correlation emerged (r = 0.45, p<0.01, n = 51 at T=0.0) between determinism and faithfulness; models producing consistent outputs also tended to be more evidence-aligned. Three financial benchmarks are provided (compliance triage, portfolio constraints, DataOps exceptions; 50 cases each) along with an open-source stress-test harness. In these benchmarks and under DFAH evaluation settings, Tier 1 models with schema-first architectures achieved determinism levels consistent with audit replay requirements.

Problem

Research questions and friction points this paper is trying to address.

replayability

regulatory audit

determinism

LLM agents

financial services

Innovation

Methods, ideas, or system contributions that make the work stand out.

Determinism

Faithfulness

LLM Agents