🤖 AI Summary
This study addresses the counterintuitive “scale-fidelity paradox” in context compression, where increasing language model size leads to degraded reconstruction fidelity. Through systematic experiments across compressor-decoder architectures using models ranging from 0.6B to 90B parameters, the authors uncover a negative correlation between model scale and contextual faithfulness. They propose two underlying mechanisms—“knowledge coverage” and “semantic drift”—attributing this phenomenon to excessive semantic capacity and amplified generation uncertainty. These claims are substantiated via representational analyses, including context embedding rank and entropy of prediction distributions. Furthermore, the work reveals emergent properties in compressed representations, demonstrating that conventional scaling laws fail in tasks requiring faithful preservation of open-ended generative contexts.
📝 Abstract
Scaling up model parameters has long been a prevalent training paradigm driven by the assumption that larger models yield superior generation capabilities. However, under lossy context compression in a compressor-decoder setup, we observe a Size-Fidelity Paradox: increasing the compressor size can lessen the faithfulness of reconstructed contexts though training loss decreases. Through extensive experiments across models from 0.6B to 90B, we coin this paradox arising from two dominant factors: 1) knowledge overwriting: larger models increasingly replace source facts with their own prior beliefs, e.g., ``the white strawberry'' $\to$ ``the red strawberry''; and 2) semantic drift: larger models tend to paraphrase or restructure content instead of reproducing it verbatim, e.g., ``Alice hit Bob'' $\to$ ``Bob hit Alice''. By holding model size fixed, we reflect on the emergent properties of compressed context representations. We show that the culprit is not parameter count itself, but the excessive semantic capacity and amplified generative uncertainty that accompany scaling. Specifically, the increased rank of context embeddings facilitates prior knowledge intrusion, whereas higher entropy over token prediction distributions promotes rewriting. Our results complement existing evaluations over context compression paradigm, underpinning a breakdown in scaling laws for faithful preservation in open-ended generation.