🤖 AI Summary
To address the degradation of reasoning capability and poor interpretability in large language models (LLMs) when processing long-context inputs, this paper proposes a supervised chain-of-thought (CoT) enhancement framework. We first construct LongFinanceQA—the first synthetic financial-domain QA dataset featuring explicit, stepwise intermediate reasoning traces for long-context understanding. Second, we introduce Property-driven Agentic Inference (PAI), a novel inference framework that emulates human-like, multi-stage reasoning: property extraction → retrieval → summarization. Third, we systematically apply supervised CoT training to long-context tasks—marking the first such effort. Experiments demonstrate substantial improvements: GPT-4o-mini augmented with PAI achieves a +20.0% gain on the Loong benchmark; LLaMA-3.1-8B-Instruct improves by +24.6% on the Loong financial subset. Our approach significantly enhances both long-range reasoning accuracy and process interpretability.
📝 Abstract
Recent advances in Large Language Models (LLMs) have enabled them to process increasingly longer sequences, ranging from 2K to 2M tokens and even beyond. However, simply extending the input sequence length does not necessarily lead to effective long-context understanding. In this study, we integrate Chain-of-Thought (CoT) reasoning into LLMs in a supervised manner to facilitate effective long-context understanding. To achieve this, we introduce LongFinanceQA, a synthetic dataset in the financial domain designed to improve long-context reasoning. Unlike existing long-context synthetic data, LongFinanceQA includes intermediate CoT reasoning before the final conclusion, which encourages LLMs to perform explicit reasoning, improving accuracy and interpretability in long-context understanding. To generate synthetic CoT reasoning, we propose Property-driven Agentic Inference (PAI), an agentic framework that simulates human-like reasoning steps, including property extraction, retrieval, and summarization. We evaluate PAI's reasoning capabilities by assessing GPT-4o-mini w/ PAI on the Loong benchmark, outperforming standard GPT-4o-mini by 20.0%. Furthermore, we fine-tune LLaMA-3.1-8B-Instruct on LongFinanceQA, achieving a 24.6% gain on Loong's financial subset.