🤖 AI Summary
This work addresses the block size conflict problem in multi-domain reinforcement learning (RL) post-training of diffusion-based large language models (dLLMs), where fixed block sizes induce inter-domain interference and degrade trajectory replay effectiveness. The study formalizes this issue for the first time and introduces a sample-level optimal block size strategy. It further constructs Block-R1-41K, a novel benchmark dataset incorporating a conflict scoring metric, enabling flexible single-domain and cross-domain training. The proposed approach integrates dLLMs with block-wise semi-autoregressive generation, multiple mainstream RL algorithms—including GRPO—and a cross-domain block size scheduling mechanism. Extensive experiments across 13 datasets, 7 RL algorithms, and various dLLM backbones validate the method’s efficacy. The publicly released benchmark significantly advances research in multi-domain RL post-training.
📝 Abstract
Recently, reinforcement learning (RL) has been widely applied during post-training for diffusion large language models (dLLMs) to enhance reasoning with block-wise semi-autoregressive generation. Block size has therefore become a vital factor in dLLMs, since it determines the parallel decoding granularity and affects the rollout trajectories during RL optimisation, e.g., GRPO. Instead of investigating the effect of block size during inference on individual domains, this paper studies block size from a domain conflict perspective for dLLM RL post-training in multi-domain scenarios. The main contributions are: (1) a formulation of domain block size conflict in multi-domain RL for dLLMs, which will largely affect the post-training effectiveness for rollout-based RL methods; (2) a novel dataset, Block-R1-41K is constructed with a best-improved training block size for each sample, which also induces a Block Size Conflict Score to quantitatively measure the domain conflict; (3) a new benchmark, Block-R1, for flexible RL post-training for dLLMs in both single and cross domain; and (4) a simple yet powerful cross-domain post-training method with sample-level best-improved training block sizes. Extensive experiments on 13 distinct datasets, 7 latest RL algorithms, and various different dLLM backbones are covered in Block-R1. The benchmark is open-sourced at https://github.com/YanJiangJerry/Block-R1, with the dataset released at https://huggingface.co/datasets/dLLM-R1/Block-R1-41K.