Post-Completion Learning for Language Models

📅 2025-07-27

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Conventional language models terminate learning upon generating the <eos> token, leaving the post-<eos> sequence space underutilized. Method: This paper proposes Post-EOS training—a novel white-box reinforcement learning framework that systematically exploits positions beyond <eos> to jointly optimize reasoning and self-assessment. It integrates dual-track supervised fine-tuning (SFT), interpretable reward prediction, and multi-objective reward alignment—without increasing inference latency. Contribution/Results: Evaluated across multiple benchmarks, the approach significantly outperforms standard SFT and black-box RL methods, improving factual consistency, logical rigor, and self-evaluation accuracy. It establishes a new paradigm for post-training large language models by transforming otherwise idle post-<eos> tokens into structured learning signals.

Technology Category

Application Category

📝 Abstract

Current language model training paradigms typically terminate learning upon reaching the end-of-sequence (<eos>}) token, overlooking the potential learning opportunities in the post-completion space. We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion, to enhance both the reasoning and self-evaluation abilities. PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point. To fully utilize this post-completion space, we design a white-box reinforcement learning method: let the model evaluate the output content according to the reward rules, then calculate and align the score with the reward functions for supervision. We implement dual-track SFT to optimize both reasoning and evaluation capabilities, and mixed it with RL training to achieve multi-objective hybrid optimization. Experimental results on different datasets and models demonstrate consistent improvements over traditional SFT and RL methods. Our method provides a new technical path for language model training that enhances output quality while preserving deployment efficiency.

Problem

Research questions and friction points this paper is trying to address.

Enhances reasoning and self-evaluation in language models

Utilizes post-completion space for continued learning

Improves output quality while maintaining deployment efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-Completion Learning extends training beyond <eos>

White-box RL aligns self-assessment with reward functions

Dual-track SFT and RL hybrid optimize reasoning and evaluation

🔎 Similar Papers

Punctuation Restoration Improves Structure Understanding without Supervision