TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing reinforcement fine-tuning approaches for code generation overlook the heterogeneous difficulty and granularity of test cases, leading to imbalanced reward signals and training bias. This work proposes TAROT, a novel framework that integrates test-driven, multi-granularity difficulty tiers—categorized as basic, intermediate, complex, and boundary—with a model-capability-aware curriculum strategy. By decoupling curriculum progression from raw rewards, TAROT enables dynamic, capability-adaptive selection of training sequences. Empirical results demonstrate substantial improvements in both functional correctness and robustness of generated code: weaker models benefit from an easy-to-hard curriculum, whereas stronger models achieve superior performance under a hard-to-easy schedule, thereby validating the effectiveness and generality of capability-adaptive curriculum design.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test-case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model's inherent capability, with less capable models achieving greater gains with an easy-to-hard progression, whereas more competent models excel under a hard-first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep-diver/TAROT.

Problem

Research questions and friction points this paper is trying to address.

code generation

reinforcement fine-tuning

curriculum learning

test cases

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum Learning

Reinforcement Fine-Tuning

Test-driven Evaluation