Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited cross-disciplinary teaching capabilities of open-source large language models in educational settings by introducing the EduQwen series. Built upon the Qwen3-32B architecture, EduQwen employs a three-stage optimization framework that integrates reinforcement learning with supervised fine-tuning. The approach systematically enhances pedagogical proficiency through progressive difficulty training, focused learning on challenging samples, extended reasoning generation, and difficulty-weighted synthetic data augmentation. Experimental results demonstrate that EduQwen establishes a new state-of-the-art on the CDPK benchmark and significantly outperforms larger closed-source systems such as Gemini-3 Pro in interactive teaching evaluations, marking the first instance where a medium-scale open-source model achieves comprehensive superiority in educational performance.
📝 Abstract
We present an innovative multi-stage optimization strategy combining reinforcement learning (RL) and supervised fine-tuning (SFT) to enhance the pedagogical knowledge of large language models (LLMs), as illustrated by EduQwen 32B-RL1, EduQwen 32B-SFT, and an optional third-stage model EduQwen 32B-SFT-RL2: (1) RL optimization that implements progressive difficulty training, focuses on challenging examples, and employs extended reasoning rollouts; (2) a subsequent SFT phase that leverages the RL-trained model to synthesize high-quality training data with difficulty-weighted sampling; and (3) an optional second round of RL optimization. EduQwen 32B-RL1, EduQwen 32B-SFT, and EduQwen 32B-SFT-RL2 are an application-driven family of open-source pedagogical LLMs built on a dense Qwen3-32B backbone. These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly larger proprietary systems such as the previous benchmark leader Gemini-3 Pro. These dense 32-billion-parameter models demonstrate that domain-specialized optimization can transform mid-sized open-source LLMs into true pedagogical domain experts that outperform much larger general-purpose systems, while preserving the transparency, customizability, and cost-efficiency required for responsible educational AI deployment.
Problem

Research questions and friction points this paper is trying to address.

pedagogical knowledge
open-source LLMs
education
cross-domain
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning
Supervised Fine-Tuning
Pedagogical Knowledge
Open-Source LLMs
Multi-stage Optimization
🔎 Similar Papers
No similar papers found.
N
Navan Preet Singh
Forta, Houston, TX
Xiaokun Wang
Xiaokun Wang
Nanjing University
Video analytics
A
Anurag Garikipati
Incept Labs, Houston, TX
M
Madalina Ciobanu
Incept Labs, Houston, TX
Q
Qingqing Mao
Incept Labs, Houston, TX; Titan Holdings, San Francisco, CA
R
Ritankar Das
Incept Labs, Houston, TX; Titan Holdings, San Francisco, CA