🤖 AI Summary
Existing RAG research overlooks the significant impact of contextual formatting—such as delimiters and structural markup—in retrieved documents on LLM inference accuracy and stability. Controlled experiments reveal that even semantically identical content yields substantial performance fluctuations when presented in different formats. To address this, we propose a lightweight “context normalization” method: an adaptive standardization mechanism operating at the representation level that disentangles formatting noise from semantic content, thereby enhancing model robustness to token-order variations and long-context dependencies. Our approach requires no fine-tuning or additional parameters. Evaluated across multiple RAG benchmarks, it consistently improves generation stability and answer accuracy—particularly for long-text reasoning—while preserving consistency. This work underscores the critical role of contextual representation design in RAG systems and provides an effective, parameter-efficient solution to format-induced instability.
📝 Abstract
Retrieval-Augmented Generation (RAG) has become an essential approach for extending the reasoning and knowledge capacity of large language models (LLMs). While prior research has primarily focused on retrieval quality and prompting strategies, the influence of how the retrieved documents are framed, i.e., context format, remains underexplored. We show that seemingly superficial choices, such as delimiters or structural markers in key-value extraction, can induce substantial shifts in accuracy and stability, even when semantic content is identical. To systematically investigate this effect, we design controlled experiments that vary context density, delimiter styles, and positional placement, revealing the underlying factors that govern performance differences. Building on these insights, we introduce Contextual Normalization, a lightweight strategy that adaptively standardizes context representations before generation. Extensive experiments on both controlled and real-world RAG benchmarks across diverse settings demonstrate that the proposed strategy consistently improves robustness to order variation and strengthens long-context utilization. These findings underscore that reliable RAG depends not only on retrieving the right content, but also on how that content is presented, offering both new empirical evidence and a practical technique for better long-context reasoning.