🤖 AI Summary
This work investigates the theoretical capacity limits of single-head Softmax attention in learning $k$-bit Boolean functions (e.g., AND, OR, and their noisy variants), where $k = Theta(d)$ and the model must identify $k$ relevant bits from $d$-dimensional inputs. Methodologically, it employs a minimalist architecture—no feed-forward networks, no multi-layer stacking—trained via teacher forcing with single-step gradient updates, explicitly excluding chain-of-thought reasoning. Theoretically, it provides the first rigorous proof that such a minimalist attention model can efficiently solve these Boolean tasks under teacher forcing; conversely, it establishes a fundamental impossibility result under standard supervised training. The work precisely characterizes the computational complexity boundary between solvability and unsolvability, demonstrating that single-step supervised optimization suffices for Boolean logic reasoning—challenging prevailing assumptions that Transformers require deep architectures or implicit reasoning mechanisms.
📝 Abstract
We study the computational limits of learning $k$-bit Boolean functions (specifically, $mathrm{AND}$, $mathrm{OR}$, and their noisy variants), using a minimalist single-head softmax-attention mechanism, where $k=Theta(d)$ relevant bits are selected from $d$ inputs. We show that these simple $mathrm{AND}$ and $mathrm{OR}$ functions are unsolvable with a single-head softmax-attention mechanism alone. However, with teacher forcing, the same minimalist attention is capable of solving them. These findings offer two key insights: Architecturally, solving these Boolean tasks requires only minimalist attention, without deep Transformer blocks or FFNs. Methodologically, one gradient descent update with supervision suffices and replaces the multi-step Chain-of-Thought (CoT) reasoning scheme of [Kim and Suzuki, ICLR 2025] for solving Boolean problems. Together, the bounds expose a fundamental gap between what this minimal architecture achieves under ideal supervision and what is provably impossible under standard training.