GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenges of sparse rewards and long-horizon credit assignment in multi-turn vision-language agent reinforcement learning, this paper proposes an efficient training paradigm that eliminates reliance on external teacher models. Our core innovation is a lightweight, “free” teacher constructed dynamically via weighted fusion of historical RL checkpoints—bypassing costly privileged models (e.g., GPT or Gemini) while mitigating entropy collapse and enhancing training stability. The method jointly integrates model weight fusion, supervised fine-tuning, and soft logit distillation to enable end-to-end optimization of multimodal policies. Evaluated across diverse vision-agent benchmarks, our approach achieves 10–30% absolute accuracy gains, reduces training time by 50%, and cuts computational cost by 60%. These improvements significantly enhance reproducibility and practical deployability.

Technology Category

Application Category

📝 Abstract

Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a "free" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.

Problem

Research questions and friction points this paper is trying to address.

Sparse rewards hinder multi-turn RL for vision-language agents

Expensive teacher models limit practical VLM agent training

GTR-Turbo enables efficient training without costly external teachers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Merges checkpoint weights for free teacher model

Uses merged model for supervised fine-tuning distillation

Eliminates need for costly privileged teacher models

🔎 Similar Papers

No similar papers found.

Authors to Follow