Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models

📅 2025-05-15

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the limitations of chain-of-thought (CoT) reasoning—namely its reliance on rigid linear structure and strict causal ordering—by proposing Diffusion Chain of Lateral Thought (DCoLT). DCoLT reformulates the reverse diffusion process of diffusion language models as an implicit, nonlinear, bidirectional, and order-agnostic lateral reasoning chain, optimized end-to-end via outcome-driven, sequence-level reinforcement learning. Innovatively, each diffusion step is treated as a “lateral thinking” action, decoupling reasoning from syntactic and temporal constraints. DCoLT integrates continuous-time SED score estimation with the discrete-time LLaDA framework and introduces a Plackett–Luce–driven Unmasking Policy Module (UPM). Evaluated on GSM8K, MATH, MBPP, and HumanEval, DCoLT-enhanced LLaDA achieves absolute accuracy gains of 9.8%, 5.7%, 11.4%, and 19.5%, respectively—surpassing standard supervised fine-tuning and RL baselines using only publicly available data and 16×H800 GPUs.

Technology Category

Application Category

📝 Abstract

We introduce the emph{Diffusion Chain of Lateral Thought (DCoLT)}, a reasoning framework for diffusion language models. DCoLT treats each intermediate step in the reverse diffusion process as a latent"thinking"action and optimizes the entire reasoning trajectory to maximize the reward on the correctness of the final answer with outcome-based Reinforcement Learning (RL). Unlike traditional Chain-of-Thought (CoT) methods that follow a causal, linear thinking process, DCoLT allows bidirectional, non-linear reasoning with no strict rule on grammatical correctness amid its intermediate steps of thought. We implement DCoLT on two representative Diffusion Language Models (DLMs). First, we choose SEDD as a representative continuous-time discrete diffusion model, where its concrete score derives a probabilistic policy to maximize the RL reward over the entire sequence of intermediate diffusion steps. We further consider the discrete-time masked diffusion language model -- LLaDA, and find that the order to predict and unmask tokens plays an essential role to optimize its RL action resulting from the ranking-based Unmasking Policy Module (UPM) defined by the Plackett-Luce model. Experiments on both math and code generation tasks show that using only public data and 16 H800 GPUs, DCoLT-reinforced DLMs outperform other DLMs trained by SFT or RL or even both. Notably, DCoLT-reinforced LLaDA boosts its reasoning accuracy by +9.8%, +5.7%, +11.4%, +19.5% on GSM8K, MATH, MBPP, and HumanEval.

Problem

Research questions and friction points this paper is trying to address.

Enhancing reasoning in diffusion language models with non-linear, bidirectional thought processes.

Optimizing reasoning trajectories using outcome-based Reinforcement Learning for accuracy.

Improving performance in math and code generation tasks with DCoLT.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses outcome-based Reinforcement Learning for optimization

Bidirectional, non-linear reasoning in diffusion steps

Implements Unmasking Policy Module for token prediction

🔎 Similar Papers

No similar papers found.