π€ AI Summary
This work addresses the high cost and complexity of conventional large language models (LLMs) in mathematical tutoring, which typically rely on expensive multi-GPU reinforcement learning (RL) training. Instead, the authors propose a training-free API-based paradigm that achieves efficient and pedagogically aligned tutoring through systematic prompt optimization. They introduce five novel education-specific prompt engineering methods and integrate them with seven existing strategies to form a comprehensive 12-method prompt evolution framework. Evaluated via an 82-dimensional educational behavior encoding scheme, all proposed methods surpass the strongest RL baseline (R_total = 0.633). Notably, the newly developed ParetoGrad method achieves Pareto optimality among problem-solving success rate, prevention of answer leakage, and instructional helpfulness, substantially enhancing the modelβs capacity for applying pedagogical knowledge.
π Abstract
Aligning LLMs for math tutoring typically requires RL-based training with multi-GPU infrastructure. We investigate whether training-free prompt optimization-evolving only the system prompt via API calls-can serve as a practical alternative. We adapt 7 published methods and propose 5 education-specialized methods, evaluating these 12 methods under 5 conditions on 2 OOD benchmark suites. All 12 best-per-method configurations surpass the strongest RL-trained baseline (R_total = 0.633), and our ParetoGrad achieves the best Pareto balance across post-test solve rate, leak control, and helpfulness, rather than dominating any single component. Behavioral analysis with an 82-code educational codebook reveals that training-free methods rely on teaching-knowledge patterns at 2-3x the rate of RL-trained models, with a compensating ~10 percentage-point reduction in intent-level scaffolding. We also find a task-dependent reasoning mode effect consistent across training-free and RL-based paradigms. Our approach enables efficient development of pedagogically aligned LLM tutors with prompts alone and minimal compute.