🤖 AI Summary
Reinforcement learning in reasoning-intensive tasks is often hindered by insufficient exploration and reward hacking triggered by reward-correlated segments in expert demonstrations. This work proposes Semantic Masked Expert Policy Optimization (SMEPO), which introduces, for the first time, a fine-grained masking mechanism targeting semantically critical paths. Without altering the reward function, SMEPO selectively masks reward-associated semantic fragments in expert trajectories while preserving their problem decomposition and planning structure, thereby guiding the model to autonomously reconstruct essential reasoning steps through fill-in-the-blank style inference. This approach substantially enhances exploration efficiency and mitigates reward hacking, achieving up to a 3.2% absolute accuracy improvement over GRPO and reducing training time by as much as 4.2× across mathematical reasoning, code generation, and agent-based search tasks.
📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a powerful paradigm for improving language models on reasoning-intensive tasks, but its effectiveness is often limited by exploration. For example, models often fail on hard problems, leaving little useful reward signal. External expert traces offer a natural source of guidance, yet they may also expose reward-relevant content along the critical path to the verifier target, such as final answers, intermediate values, executable implementations, or answer-related entities. This content can create an unintended reward hacking channel, allowing the policy to obtain reward by copying the trace rather than learning the underlying reasoning or agentic behavior. Existing guided-RL methods reduce this risk by using partial trajectories, but they mainly control how much expert information is shown heuristically rather than which parts should be hidden. To this end, we propose Semantic Masked Expert Policy Optimization (SMEPO), a fine-grained semantic masking strategy for expert-guided RLVR. Instead of truncating traces coarsely or revealing them unchanged, SMEPO masks reward-relevant semantic spans along the critical path while preserving the expert's decomposition, plan, and procedural structure. This turns hard problems from reasoning from scratch into a fill-in-the-blank process: the policy can follow the expert's problem-solving route, but must still reconstruct the missing values, code, or entities by itself. SMEPO is simple to apply and requires no changes to the reward function or RL objective. Across diverse domains, including math, code, and agentic search, SMEPO improves accuracy by up to 3.2 points over GRPO and reduces training time by up to 4.2x. The code is available at https://github.com/mit-han-lab/SMEPO.