Tool Calling is Linearly Readable and Steerable in Language Models

πŸ“… 2026-05-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

209K/year
πŸ€– AI Summary
This work addresses the frequent failure of language models in tool calling due to incorrect tool selection, a problem exacerbated by the lack of interpretability and mechanisms for pre-execution intervention. The study reveals, for the first time, that tool selection decisions are concentrated in a specific linear subspace of the model’s output layer, with base models already encoding tool semantics and instruction tuning merely adjusting the output mapping. Through linear probing, activation patching, cosine readout, and mean-difference vector interventions, the authors validate this mechanism across 12 prominent models: those above 4B parameters achieve 93–100% accuracy in precise tool switching, erroneous calls can be anticipated via confidence gaps between tools, and linear probes attain 61–89% top-1 accuracy on 14 aviation-domain tools.
πŸ“ Abstract
When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. Probing 12 instruction-tuned models across Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), we find the identity of the chosen tool is linearly readable and steerable inside the model. Adding the mean-difference between two tools' average internal activations switches which tool the model selects at 77-100% accuracy on name-only single-turn prompts (93-100% at 4B+), and the JSON arguments that follow autoregressively match the new tool's schema, so flipping the name is enough. The same per-tool means also flag likely errors before they happen: on Gemma 3 12B and 27B, queries where the gap between the top-1 and top-2 tool is smallest produce 14-21x more wrong calls than queries with the largest gap. The causal effect concentrates along one direction, the row of the output layer that produces the target tool's first token: a unit vector along it at matched magnitude already reaches 93-100%, while what is left over leaves the choice almost untouched. Activation patching localises this to a small set of mid- and late-layer attention heads, and a within-topic probe across 14 same-domain $Ο„$-bench airline tools reaches top-1 61-89% across five 4B-14B models, ruling out the reading that we are just moving the model along a topic axis. Even base models encode the right tool before they can emit it: cosine readout from the internal state recovers 69-82% on BFCL while base generation reaches only 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. We measure tool identity selection and JSON schema correctness in single-turn fixed-menu settings; multi-turn agentic transfer is more fragile and is discussed in Limitations.
Problem

Research questions and friction points this paper is trying to address.

tool calling
linear readability
error prediction
language models
internal representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

linear steerability
tool calling
activation steering
mechanistic interpretability
representation probing
πŸ”Ž Similar Papers