🤖 AI Summary
This work addresses the inherent over-reliance of large language model agents on external tools even when unnecessary, a bias that undermines overall decision accuracy. The authors propose the Intrinsic Bias Hypothesis (IBH), which reframes this phenomenon as a quantifiable mechanistic issue. By leveraging sparse autoencoders (SAEs) to extract behavior-alignment features, they construct an activation margin to measure the bias and introduce Adaptive Margin Calibration Steering (AMCS)—a closed-form causal intervention method based on SAE decoding directions. Experiments across six mainstream models demonstrate that AMCS substantially mitigates over-calling while preserving tool-use accuracy, thereby yielding consistent improvements in overall performance with minimal trade-offs.
📝 Abstract
LLM agents exhibit a consistent tendency to over-call, invoking tools even in situations where none is needed. On the When2Call benchmark, six models from three families show high call accuracy but much lower no-call accuracy, leaving overall accuracy in the 55%-70% range. We trace this to an Intrinsic Bias Hypothesis (IBH): the call/no-call decision mapping carries an activation-independent call offset, so the model favors call even at activation parity. Using Sparse Autoencoders (SAEs), we recover behavior-aligned feature bases for the call/no_call decision, reduce them to a signed activation margin, and estimate the offset directly. Across all six models, the model is decision-neutral only when no_call activation outweighs call activation, consistent with IBH. We then causally test IBH with Adaptive Margin-Calibrated Steering (AMCS), a closed-form counter-bias shift along SAE decoder directions. Cancelling the diagnosed offset mitigates over-calling and improves overall accuracy with a negligible drop in call accuracy. Our work recasts over-calling from an empirical phenomenon into a mechanistic object amenable to causal correction. Code is available at https://github.com/SKURA502/agent-sae/.