DELTA-Code: How Does RL Unlock and Transfer New Programming Algorithms in LLMs?

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) can acquire and transfer genuinely novel algorithmic reasoning strategies—beyond pretraining or post-training knowledge—via reinforcement learning (RL). To this end, we introduce DELTA-Code, the first synthetic benchmark that systematically decouples *learnability* (solving previously unsolvable problems where pass@K = 0) from *transferability* (generalizing to out-of-distribution tasks). Our method integrates curriculum learning, experience replay, progressive reward shaping, and verification-based feedback. Experiments reveal RL-induced phase-transition–like insight: models achieve near-100% performance jumps on formerly intractable problems and demonstrate substantial gains in exploratory and compositional generalization. The core contribution is a controllable algorithmic coding evaluation framework that empirically characterizes the pathways and fundamental limits of LLMs acquiring new algorithmic capabilities through RL.

Technology Category

Application Category

📝 Abstract

It remains an open question whether LLMs can acquire or generalize genuinely new reasoning strategies, beyond the sharpened skills encoded in their parameters during pre-training or post-training. To attempt to answer this debate, we introduce DELTA-Code--Distributional Evaluation of Learnability and Transferrability in Algorithmic Coding, a controlled benchmark of synthetic coding problem families designed to probe two fundamental aspects: learnability -- can LLMs, through reinforcement learning (RL), solve problem families where pretrained models exhibit failure with large enough attempts (pass@K=0)? --and transferrability -- if learnability happens, can such skills transfer systematically to out-of-distribution (OOD) test sets? Unlike prior public coding datasets, DELTA isolates reasoning skills through templated problem generators and introduces fully OOD problem families that demand novel strategies rather than tool invocation or memorized patterns. Our experiments reveal a striking grokking phase transition: after an extended period with near-zero reward, RL-trained models abruptly climb to near-perfect accuracy. To enable learnability on previously unsolvable problem families, we explore key training ingredients such as staged warm-up with dense rewards, experience replay, curriculum training, and verification-in-the-loop. Beyond learnability, we use DELTA to evaluate transferability or generalization along exploratory, compositional, and transformative axes, as well as cross-family transfer. Results show solid gains within families and for recomposed skills, but persistent weaknesses in transformative cases. DELTA thus offers a clean testbed for probing the limits of RL-driven reasoning and for understanding how models can move beyond existing priors to acquire new algorithmic skills.

Problem

Research questions and friction points this paper is trying to address.

Testing if LLMs can learn genuinely new reasoning strategies beyond pre-training

Evaluating whether RL-acquired coding skills transfer to out-of-distribution problems

Probing the limits of RL-driven algorithmic reasoning in language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

RL training with dense rewards and replay

Curriculum learning with staged warm-up phases

Verification-in-loop for algorithmic skill transfer

🔎 Similar Papers

Synthesizing Programmatic Reinforcement Learning Policies with Large Language Model Guided Search