Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the limited cross-disciplinary teaching capabilities of open-source large language models in educational settings by introducing the EduQwen series. Built upon the Qwen3-32B architecture, EduQwen employs a three-stage optimization framework that integrates reinforcement learning with supervised fine-tuning. The approach systematically enhances pedagogical proficiency through progressive difficulty training, focused learning on challenging samples, extended reasoning generation, and difficulty-weighted synthetic data augmentation. Experimental results demonstrate that EduQwen establishes a new state-of-the-art on the CDPK benchmark and significantly outperforms larger closed-source systems such as Gemini-3 Pro in interactive teaching evaluations, marking the first instance where a medium-scale open-source model achieves comprehensive superiority in educational performance.

Technology Category

Application Category

📝 Abstract

We present an innovative multi-stage optimization strategy combining reinforcement learning (RL) and supervised fine-tuning (SFT) to enhance the pedagogical knowledge of large language models (LLMs), as illustrated by EduQwen 32B-RL1, EduQwen 32B-SFT, and an optional third-stage model EduQwen 32B-SFT-RL2: (1) RL optimization that implements progressive difficulty training, focuses on challenging examples, and employs extended reasoning rollouts; (2) a subsequent SFT phase that leverages the RL-trained model to synthesize high-quality training data with difficulty-weighted sampling; and (3) an optional second round of RL optimization. EduQwen 32B-RL1, EduQwen 32B-SFT, and EduQwen 32B-SFT-RL2 are an application-driven family of open-source pedagogical LLMs built on a dense Qwen3-32B backbone. These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly larger proprietary systems such as the previous benchmark leader Gemini-3 Pro. These dense 32-billion-parameter models demonstrate that domain-specialized optimization can transform mid-sized open-source LLMs into true pedagogical domain experts that outperform much larger general-purpose systems, while preserving the transparency, customizability, and cost-efficiency required for responsible educational AI deployment.

Problem

Research questions and friction points this paper is trying to address.

pedagogical knowledge

open-source LLMs

education

cross-domain

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Supervised Fine-Tuning

Pedagogical Knowledge