🤖 AI Summary
This work investigates the knowledge recall and deduplication coordination mechanisms in large language models (LLMs) when answering one-to-many factual queries (e.g., “List cities in a given country”). Methodologically, it integrates early-decoding analysis, causal tracing, attention-aggregated decoding (Token Lens), and MLP-output ablation experiments. The study reveals, for the first time, a two-phase dynamic mechanism—“promote-then-suppress”: during initial layers, attention propagates subject information and retrieves candidate answers, while MLPs drive generation; subsequently, attention actively suppresses already-generated tokens, and MLPs reinforce this suppression signal. The authors introduce Token Lens and attention ablation as novel tools for token-level attribution. Extensive validation across diverse LLMs and datasets confirms the mechanism’s generality. This work establishes a unified, interpretable framework for understanding internal coordination in complex factual recall.
📝 Abstract
To answer one-to-many factual queries (e.g., listing cities of a country), a language model (LM) must simultaneously recall knowledge and avoid repeating previous answers. How are these two subtasks implemented and integrated internally? Across multiple datasets and models, we identify a promote-then-suppress mechanism: the model first recalls all answers, and then suppresses previously generated ones. Specifically, LMs use both the subject and previous answer tokens to perform knowledge recall, with attention propagating subject information and MLPs promoting the answers. Then, attention attends to and suppresses previous answer tokens, while MLPs amplify the suppression signal. Our mechanism is corroborated by extensive experimental evidence: in addition to using early decoding and causal tracing, we analyze how components use different tokens by introducing both emph{Token Lens}, which decodes aggregated attention updates from specified tokens, and a knockout method that analyzes changes in MLP outputs after removing attention to specified tokens. Overall, we provide new insights into how LMs' internal components interact with different input tokens to support complex factual recall. Code is available at https://github.com/Lorenayannnnn/how-lms-answer-one-to-many-factual-queries.