IB-GRPO: Aligning LLM-based Learning Path Recommendation with Educational Objectives via Indicator-Based Group Relative Policy Optimization

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenge of aligning large language models (LLMs) with educational objectives—such as the Zone of Proximal Development (ZPD)—in long-horizon learning path recommendation, where expert demonstrations are scarce, multiple objectives conflict, and feedback is sparse and delayed. To tackle these issues, the authors propose IB-GRPO, a novel framework that introduces indicator-based group relative policy optimization to educational recommendation. IB-GRPO leverages genetic algorithms to generate hybrid expert demonstrations, employs supervised fine-tuning to warm-start the LLM, and incorporates an intra-session ZPD alignment score alongside an Iε⁺-dominance metric to enable Pareto-optimal multi-objective optimization without manual scalarization. Evaluated on the ASSIST09 and Junyi datasets using the KES simulator and Qwen2.5-7B as the backbone model, IB-GRPO significantly outperforms existing reinforcement learning and LLM baselines.

Technology Category

Application Category

📝 Abstract

Learning Path Recommendation (LPR) aims to generate personalized sequences of learning items that maximize long-term learning effect while respecting pedagogical principles and operational constraints. Although large language models (LLMs) offer rich semantic understanding for free-form recommendation, applying them to long-horizon LPR is challenging due to (i) misalignment with pedagogical objectives such as the Zone of Proximal Development (ZPD) under sparse, delayed feedback, (ii) scarce and costly expert demonstrations, and (iii) multi-objective interactions among learning effect, difficulty scheduling, length controllability, and trajectory diversity. To address these issues, we propose IB-GRPO (Indicator-Based Group Relative Policy Optimization), an indicator-guided alignment approach for LLM-based LPR. To mitigate data scarcity, we construct hybrid expert demonstrations via Genetic Algorithm search and teacher RL agents and warm-start the LLM with supervised fine-tuning. Building on this warm-start, we design a within-session ZPD alignment score for difficulty scheduling. IB-GRPO then uses the $I_{\epsilon+}$ dominance indicator to compute group-relative advantages over multiple objectives, avoiding manual scalarization and improving Pareto trade-offs. Experiments on ASSIST09 and Junyi using the KES simulator with a Qwen2.5-7B backbone show consistent improvements over representative RL and LLM baselines.

Problem

Research questions and friction points this paper is trying to address.

Learning Path Recommendation

Large Language Models

Pedagogical Alignment

Multi-objective Optimization

Zone of Proximal Development

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning Path Recommendation

Large Language Models

Multi-objective Optimization