Transformers with RL or SFT Provably Learn Sparse Boolean Functions, But Differently

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the intrinsic mechanistic differences between reinforcement learning (RL) and supervised fine-tuning (SFT) in eliciting chain-of-thought (CoT) reasoning when learning k-sparse Boolean functions—such as k-PARITY, k-AND, and k-OR—on single-layer Transformers. We introduce intermediate-step annotations as CoT-inspired supervision to systematically compare learning dynamics. Theoretically and empirically, we show that RL performs end-to-end optimization over the entire reasoning chain synchronously, whereas SFT relies on stepwise intermediate supervision to model inference in stages. This constitutes the first mechanistic explanation—grounded in learning dynamics—of how RL and SFT differentially induce CoT emergence. Our results confirm that both paradigms can successfully learn the target functions under appropriate supervision, yet they follow fundamentally distinct, interpretable pathways. These findings offer novel insights into the training foundations of reasoning capabilities in large language models.

Technology Category

Application Category

📝 Abstract
Transformers can acquire Chain-of-Thought (CoT) capabilities to solve complex reasoning tasks through fine-tuning. Reinforcement learning (RL) and supervised fine-tuning (SFT) are two primary approaches to this end, yet their underlying mechanisms and differences remain theoretically unclear. In this work, we examine these aspects specifically for learning $k$-sparse Boolean functions with a one-layer transformer and intermediate supervision that is akin to CoT. In particular, we consider $k$-sparse Boolean functions that can be recursively decomposed into fixed 2-sparse Boolean functions. We analyze the learning dynamics of fine-tuning the transformer via either RL or SFT with CoT to identify sufficient conditions for it to provably learn these functions. We verify that these conditions hold for three basic examples, including $k$-PARITY, $k$-AND, and $k$-OR, thus demonstrating the learnability of both approaches. Notably, we reveal that RL and SFT exhibit distinct learning behaviors: RL learns the whole CoT chain simultaneously, whereas SFT learns the CoT chain step-by-step. Overall, our findings provide theoretical insights into the underlying mechanisms of RL and SFT as well as how they differ in triggering the CoT capabilities of transformers.
Problem

Research questions and friction points this paper is trying to address.

Analyzing RL vs SFT learning dynamics for transformers
Theoretically examining sparse Boolean function learnability
Comparing stepwise vs simultaneous Chain-of-Thought acquisition
Innovation

Methods, ideas, or system contributions that make the work stand out.

RL learns CoT chain simultaneously
SFT learns CoT chain step-by-step
Transformers learn sparse Boolean functions