Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

140K/year

🤖 AI Summary

This study investigates the root cause of high misdiagnosis rates in large language models (LLMs) performing clinical triage, distinguishing whether errors stem from insufficient clinical knowledge or constraints imposed by output format. By comparing internal model representations under free-text versus multiple-choice formats—and employing sparse autoencoder feature analysis, natural language autoencoding, and logit attribution on decision tokens—the work reveals for the first time that triage failures primarily arise from output-format-induced mapping bias rather than deficiencies in clinical representation. The findings indicate that multiple-choice formats suppress activation of medically relevant features, causing decisions to be dominated by format-related cues. Errors predominantly manifest as adjacent-level severity shifts, reflecting a structural bias rather than genuine knowledge gaps, thereby challenging conventional assumptions about LLMs’ medical reasoning capabilities.

📝 Abstract

Patient-voiced clinical-triage benchmarks report high under-triage rates for consumer LLMs for constrained multiple-choice output, yet the same cases score differently with free-text. We ask whether output format changes the model's \emph{clinical representation} or only the mapping from a preserved representation to an answer. Using sparse-autoencoder (SAE) features in Gemma 3 4B/12B IT and Qwen3-8B, we find the same medical features fire on the shared clinical narrative under both formats but go {silent} at the multiple-choice decision token in all the cases at every model. Three independent methods (natural-language autoencoder verbalization, decision-token logit attribution, and top-feature characterization) agree that scaffold and format features, but not medical features, drive the decision logits. Behaviorally, the multiple-choice penalty inverts under both structured and natural-language input, option-order shuffle rules out positional bias, and the gap is dominated by off-by-one decision (the model picks an adjacent acuity letter to the gold answer) rather than knowledge failure. Thus, the failure originates in the output format and not in the clinical representation.

Problem

Research questions and friction points this paper is trying to address.

clinical triage

output format

internal representation

large language models

under-triage

Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse autoencoder

clinical representation

output format bias