🤖 AI Summary
To address semantic fragmentation and degraded comprehension in large language models (LLMs) when processing ultra-long contexts (up to 256K tokens), this paper proposes a question-aware dynamic chunking method. The approach comprises three core components: (1) a dynamic, variable-length chunking mechanism guided by inter-sentence BERT-based semantic similarity, preserving discourse coherence; (2) a lightweight, question-conditioned binary classifier for context-sensitive selection of salient chunks; and (3) multi-hop question answering (QA) fine-tuning and evaluation. Extensive experiments on both single-hop and multi-hop QA benchmarks demonstrate substantial improvements over state-of-the-art baselines, with strong robustness to input length and superior generalization across domains. The method effectively mitigates fixed-chunk-induced information loss while maintaining computational efficiency. All code and datasets are publicly released.
📝 Abstract
Large language models (LLMs) often struggle to accurately read and comprehend extremely long texts. Current methods for improvement typically rely on splitting long contexts into fixed-length chunks. However, fixed truncation risks separating semantically relevant content, leading to ambiguity and compromising accurate understanding. To overcome this limitation, we propose a straightforward approach for dynamically separating and selecting chunks of long context, facilitating a more streamlined input for LLMs. In particular, we compute semantic similarities between adjacent sentences, using lower similarities to adaptively divide long contexts into variable-length chunks. We further train a question-aware classifier to select sensitive chunks that are critical for answering specific questions. Experimental results on both single-hop and multi-hop question-answering benchmarks show that the proposed approach consistently outperforms strong baselines. Notably, it maintains robustness across a wide range of input lengths, handling sequences of up to 256k tokens. Our datasets and code are available at the following link: https://github.com/ECNU-Text-Computing/DCS