VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the challenges of insufficient exploration and repetitive behaviors in large language models during reinforcement learning fine-tuning, which stem from sparse feedback and vast action spaces. To mitigate these issues, the authors propose Verbalized Action Masking (VAM), a mechanism that explicitly embeds action masks into prompts using natural language. VAM integrates prompt engineering, dynamic mask updates, and iterative action space pruning with resampling strategies to enable fine-grained, controllable exploration of the model’s action selection. Evaluated on chess-playing tasks using the Average Centipawn Loss (ACPL) metric, the method significantly outperforms strong baselines, demonstrating improved learning efficiency and final performance by leveraging engine-play data within the proposed framework.

Technology Category

Application Category

📝 Abstract

Exploration remains a key bottleneck for reinforcement learning (RL) post-training of large language models (LLMs), where sparse feedback and large action spaces can lead to premature collapse into repetitive behaviors. We propose Verbalized Action Masking (VAM), which verbalizes an action mask in the prompt and enforces that the model outputs an action from the masked set. Building on this interface, we introduce iterative action-space pruning: if the target action is not sampled, we remove valid sampled actions from the mask and resample under the reduced candidate set, repeating until the target is sampled or a fixed budget is exhausted. We study VAM in chess and evaluate it under two training regimes: an engine-play regime that generates states via play against an engine opponent and a fixed-dataset regime that trains from a fixed dataset of positions with verifier scores. Across held-out chess puzzles and full-game play measured by average centipawn loss (ACPL), VAM improves learning efficiency and final performance over strong baselines, highlighting verbalized masking as a practical mechanism for controllable exploration in LLM RL post-training.

Problem

Research questions and friction points this paper is trying to address.

exploration

reinforcement learning

large language models

action space

post-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Verbalized Action Masking

controllable exploration

iterative action-space pruning