Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

Large language models exhibit limited generalization on out-of-distribution complex reasoning tasks—e.g., 100-digit addition, long-sequence deduction, and maze solving. This paper proposes a plug-and-play self-improvement framework for pretrained Transformers that enables weak-to-strong self-generated curriculum learning without architectural modifications. Our method integrates self-distillation, self-generated sample filtering, iterative supervised fine-tuning, and difficulty-incremental training to autonomously construct and learn high-quality solution trajectories. The core contribution is the introduction of the “self-generated curriculum” paradigm—a novel, systematic approach to enhancing logical extrapolation and length generalization. Experiments demonstrate order-of-magnitude generalization gains across arithmetic, string manipulation, and maze-solving tasks (e.g., from 10- to 100-digit addition), with error rates decreasing exponentially. Moreover, leveraging pretrained initialization significantly accelerates convergence.

Technology Category

Application Category

📝 Abstract

Large language models often struggle with length generalization and solving complex problem instances beyond their training distribution. We present a self-improvement approach where models iteratively generate and learn from their own solutions, progressively tackling harder problems while maintaining a standard transformer architecture. Across diverse tasks including arithmetic, string manipulation, and maze solving, self-improving enables models to solve problems far beyond their initial training distribution-for instance, generalizing from 10-digit to 100-digit addition without apparent saturation. We observe that in some cases filtering for correct self-generated examples leads to exponential improvements in out-of-distribution performance across training rounds. Additionally, starting from pretrained models significantly accelerates this self-improvement process for several tasks. Our results demonstrate how controlled weak-to-strong curricula can systematically teach a model logical extrapolation without any changes to the positional embeddings, or the model architecture.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Long or Complex Queries

Limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-learning

Large Language Model

Problem-solving Enhancement

🔎 Similar Papers

Looped Transformers for Length Generalization