🤖 AI Summary
Conventional language models terminate learning upon generating the <eos> token, leaving the post-<eos> sequence space underutilized. Method: This paper proposes Post-EOS training—a novel white-box reinforcement learning framework that systematically exploits positions beyond <eos> to jointly optimize reasoning and self-assessment. It integrates dual-track supervised fine-tuning (SFT), interpretable reward prediction, and multi-objective reward alignment—without increasing inference latency. Contribution/Results: Evaluated across multiple benchmarks, the approach significantly outperforms standard SFT and black-box RL methods, improving factual consistency, logical rigor, and self-evaluation accuracy. It establishes a new paradigm for post-training large language models by transforming otherwise idle post-<eos> tokens into structured learning signals.
📝 Abstract
Current language model training paradigms typically terminate learning upon reaching the end-of-sequence (<eos>}) token, overlooking the potential learning opportunities in the post-completion space. We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion, to enhance both the reasoning and self-evaluation abilities. PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point.
To fully utilize this post-completion space, we design a white-box reinforcement learning method: let the model evaluate the output content according to the reward rules, then calculate and align the score with the reward functions for supervision. We implement dual-track SFT to optimize both reasoning and evaluation capabilities, and mixed it with RL training to achieve multi-objective hybrid optimization.
Experimental results on different datasets and models demonstrate consistent improvements over traditional SFT and RL methods. Our method provides a new technical path for language model training that enhances output quality while preserving deployment efficiency.