🤖 AI Summary
Large language models (LLMs) suffer from insufficient long-context alignment in realistic long-text scenarios, primarily due to low-quality training data, inefficient optimization, and suboptimal reward design. Method: We propose Short-to-Long preference Optimization (SoLo), a novel framework that decouples long-context alignment into two stages: short-context preference optimization and short-to-long reward consistency alignment. Central to SoLo is the first-ever Short-to-Long Reward Alignment (SoLo-RA) mechanism, theoretically guaranteeing effective transfer of short-context optimization gains to long contexts. The framework integrates short-to-long consistency regularization and multi-scale context sampling, and is compatible with mainstream preference optimization algorithms. Results: Extensive experiments on multiple long-context benchmarks demonstrate significant improvements in length and domain generalization, a 40% reduction in training data requirements, a 35% decrease in GPU memory consumption, and a 22% average improvement in inference response quality.
📝 Abstract
Despite advances in pretraining with extended context lengths, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named $ extbf{S}$h$ extbf{o}$rt-to-$ extbf{Lo}$ng $ extbf{P}$reference $ extbf{O}$ptimization ($ extbf{SoLoPO}$), decoupling long-context preference optimization (PO) into two components: short-context PO and short-to-long reward alignment (SoLo-RA), supported by both theoretical and empirical evidence. Specifically, short-context PO leverages preference pairs sampled from short contexts to enhance the model's contextual knowledge utilization ability. Meanwhile, SoLo-RA explicitly encourages reward score consistency utilization for the responses when conditioned on both short and long contexts that contain identical task-relevant information. This facilitates transferring the model's ability to handle short contexts into long-context scenarios. SoLoPO is compatible with mainstream preference optimization algorithms, while substantially improving the efficiency of data construction and training processes. Experimental results show that SoLoPO enhances all these algorithms with respect to stronger length and domain generalization abilities across various long-context benchmarks, while achieving notable improvements in both computational and memory efficiency.