Minimalist Softmax Attention Provably Learns Constrained Boolean Functions

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the theoretical capacity limits of single-head Softmax attention in learning $k$-bit Boolean functions (e.g., AND, OR, and their noisy variants), where $k = Theta(d)$ and the model must identify $k$ relevant bits from $d$-dimensional inputs. Methodologically, it employs a minimalist architecture—no feed-forward networks, no multi-layer stacking—trained via teacher forcing with single-step gradient updates, explicitly excluding chain-of-thought reasoning. Theoretically, it provides the first rigorous proof that such a minimalist attention model can efficiently solve these Boolean tasks under teacher forcing; conversely, it establishes a fundamental impossibility result under standard supervised training. The work precisely characterizes the computational complexity boundary between solvability and unsolvability, demonstrating that single-step supervised optimization suffices for Boolean logic reasoning—challenging prevailing assumptions that Transformers require deep architectures or implicit reasoning mechanisms.

Technology Category

Application Category

📝 Abstract
We study the computational limits of learning $k$-bit Boolean functions (specifically, $mathrm{AND}$, $mathrm{OR}$, and their noisy variants), using a minimalist single-head softmax-attention mechanism, where $k=Theta(d)$ relevant bits are selected from $d$ inputs. We show that these simple $mathrm{AND}$ and $mathrm{OR}$ functions are unsolvable with a single-head softmax-attention mechanism alone. However, with teacher forcing, the same minimalist attention is capable of solving them. These findings offer two key insights: Architecturally, solving these Boolean tasks requires only minimalist attention, without deep Transformer blocks or FFNs. Methodologically, one gradient descent update with supervision suffices and replaces the multi-step Chain-of-Thought (CoT) reasoning scheme of [Kim and Suzuki, ICLR 2025] for solving Boolean problems. Together, the bounds expose a fundamental gap between what this minimal architecture achieves under ideal supervision and what is provably impossible under standard training.
Problem

Research questions and friction points this paper is trying to address.

Studies learning k-bit Boolean functions with minimalist attention
Shows single-head softmax-attention cannot solve AND/OR alone
Demonstrates teacher forcing enables solution with minimalist attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimalist single-head softmax-attention mechanism
Teacher forcing enables Boolean function learning
One gradient descent update replaces CoT reasoning
🔎 Similar Papers
2024-06-03International Conference on Machine LearningCitations: 0
J
J. Hu
Center for Foundation Models and Generative AI & Department of Computer Science, Northwestern University, USA
Xiwen Zhang
Xiwen Zhang
not Helixon anymore :)
LLMdiffusion modelcomputer systemsmachine learningcomputational biology
M
Maojiang Su
Center for Foundation Models and Generative AI & Department of Computer Science, Northwestern University, USA
Z
Zhao Song
University of California, Berkeley, USA
H
Han Liu
Center for Foundation Models and Generative AI & Department of Computer Science & Department of Statistics and Data Science, Northwestern University, USA