Minimalist Softmax Attention Provably Learns Constrained Boolean Functions

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work investigates the theoretical capacity limits of single-head Softmax attention in learning $k$-bit Boolean functions (e.g., AND, OR, and their noisy variants), where $k = Theta(d)$ and the model must identify $k$ relevant bits from $d$-dimensional inputs. Methodologically, it employs a minimalist architecture—no feed-forward networks, no multi-layer stacking—trained via teacher forcing with single-step gradient updates, explicitly excluding chain-of-thought reasoning. Theoretically, it provides the first rigorous proof that such a minimalist attention model can efficiently solve these Boolean tasks under teacher forcing; conversely, it establishes a fundamental impossibility result under standard supervised training. The work precisely characterizes the computational complexity boundary between solvability and unsolvability, demonstrating that single-step supervised optimization suffices for Boolean logic reasoning—challenging prevailing assumptions that Transformers require deep architectures or implicit reasoning mechanisms.

Technology Category

Application Category

📝 Abstract

We study the computational limits of learning $k$-bit Boolean functions (specifically, $mathrm{AND}$, $mathrm{OR}$, and their noisy variants), using a minimalist single-head softmax-attention mechanism, where $k=Theta(d)$ relevant bits are selected from $d$ inputs. We show that these simple $mathrm{AND}$ and $mathrm{OR}$ functions are unsolvable with a single-head softmax-attention mechanism alone. However, with teacher forcing, the same minimalist attention is capable of solving them. These findings offer two key insights: Architecturally, solving these Boolean tasks requires only minimalist attention, without deep Transformer blocks or FFNs. Methodologically, one gradient descent update with supervision suffices and replaces the multi-step Chain-of-Thought (CoT) reasoning scheme of [Kim and Suzuki, ICLR 2025] for solving Boolean problems. Together, the bounds expose a fundamental gap between what this minimal architecture achieves under ideal supervision and what is provably impossible under standard training.

Problem

Research questions and friction points this paper is trying to address.

Studies learning k-bit Boolean functions with minimalist attention

Shows single-head softmax-attention cannot solve AND/OR alone

Demonstrates teacher forcing enables solution with minimalist attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimalist single-head softmax-attention mechanism

Teacher forcing enables Boolean function learning

One gradient descent update replaces CoT reasoning

🔎 Similar Papers

Theory, Analysis, and Best Practices for Sigmoid Self-Attention