Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Large language models (LLMs) often struggle to generate explanatory text that simultaneously satisfies semantic accuracy and pedagogical soundness. Method: We propose a lightweight semantic reward modeling approach: a small encoder-only Transformer serves as the reward model, computing fine-grained, dense rewards via cosine similarity between generated and reference explanations—replacing brittle keyword-based metrics (e.g., ROUGE) or costly LLM-based evaluators. This reward model is integrated into the Group Relative Policy Optimisation (GRPO) framework, augmented with domain-adaptive continual pretraining (CPT) and supervised fine-tuning (SFT). Results: Evaluated on the Italian medical school entrance exam explanation generation task, our method significantly improves explanation faithfulness and clarity over strong SFT baselines. It demonstrates that effective semantic alignment can be achieved without large-scale judgment models, validating both feasibility and efficacy of our lightweight, semantics-driven optimization strategy.

Technology Category

Application Category

📝 Abstract

While Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge. Standard reinforcement learning techniques often rely on slow and expensive LLM-as-a-judge evaluations or on brittle, keyword-based metrics like ROUGE, which fail to capture the semantic essence of a high-quality explanation. In this work, we introduce a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework. Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model. This model provides a dense, semantically rich reward signal based on the cosine similarity between a generated explanation and a ground-truth reference, guiding the policy towards explanations that are not just factually correct but also structurally and conceptually aligned with expert reasoning. We apply this method to the task of training a model for the Italian medical-school entrance examinations, following standard domain-adaptive continued pre-training (CPT) and supervised fine-tuning (SFT). Our results demonstrate that GRPO with our proposed semantic reward significantly improves explanation faithfulness and clarity over a strong SFT baseline, showcasing the power of using lightweight encoder models for nuanced reward shaping in complex generation tasks

Problem

Research questions and friction points this paper is trying to address.

Aligning LLM outputs with pedagogical soundness goals

Overcoming brittle keyword-based evaluation metrics like ROUGE

Providing semantically rich reward signals for explanation quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Encoder-only transformer for semantic reward modeling

Cosine similarity-based dense reward signal

Integration within Group Relative Policy Optimisation framework

🔎 Similar Papers

No similar papers found.