Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses a critical limitation in current evaluation practices that rely solely on overall accuracy, thereby obscuring systematic failures of models to perform multi-hop compositional reasoning even when they possess the requisite atomic facts. The authors propose a dual-gated evaluation protocol that formally defines the phenomenon of “compositional collapse” and introduces a conditional diagnostic framework to disentangle post-training performance into three orthogonal dimensions: atomic knowledge stability, residual compositional capacity, and critical reasoning depth. Through fine-grained analysis on a temporal fact-chain benchmark spanning depths 2–11 and tailored diagnostic probes across four post-training strategies, they demonstrate that models with comparable atomic knowledge can exhibit compositional performance gaps exceeding 40 percentage points. Crucially, most compositional failures stem from computational constraints during inference rather than irreversible deficits in reasoning capability.

📝 Abstract

Post-training is routinely evaluated through aggregate benchmark scores that treat multi-hop reasoning as a single capability -- as if a model that answers more questions correctly must be better at assembling facts. We show that this assumption can be misleading: recipes with statistically indistinguishable atomic knowledge produce composition behaviour separated by over 40 percentage points, a phenomenon we call composition collapse: the systematic failure to assemble stably-known facts into chains, invisible to aggregate metrics. We introduce a double-gate protocol that changes the estimand from an aggregate compositionality gap to residual composition failure conditioned on stable atomic access, decomposing post-training gains into three independent channels: atomic stability, residual composition, and critical depth. On a benchmark of temporal factual chains spanning depths 2--11 across four post-training recipes, this decomposition reveals that post-training objectives shift composition capability in directions that aggregate metrics mask, and suggests that claims about multi-hop reasoning improvement should be accompanied by atomic-gate-controlled composition metrics. Diagnostic probes further show that a substantial share of measured composition failure reflects generation-time computation constraints rather than permanent inability to compose.

Problem

Research questions and friction points this paper is trying to address.

compositionality

multi-hop reasoning

post-training

composition collapse

atomic knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

composition collapse

double-gate protocol

compositional reasoning