Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates why Transformers struggle with multi-digit multiplication—a long-range dependency task. Mechanistic analysis reveals that models implicitly construct directed acyclic graphs to cache partial products, yet standard training converges to local optima that neglect cross-digit interactions. To address this, the authors propose three innovations: (1) leveraging attention to model Minkowski sums of digit pairs, enabling structured, position-aware representation; (2) encoding digits via Fourier bases to enhance numerical relational modeling; and (3) introducing auxiliary losses at both runtime and inference stages to inject correct inductive biases. Using logit attribution, linear probing, and geometric representation analysis, the study verifies the existence and controllability of implicit chain-of-thought reasoning. Experiments demonstrate substantial gains in generalization for multi-digit multiplication, achieving, for the first time in a pure sequence model, accurate extrapolation to arbitrary-length operands.

Technology Category

Application Category

📝 Abstract
Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication. In this work, we study why, by reverse-engineering a model that successfully learns multiplication via emph{implicit chain-of-thought}, and report three findings: (1) Evidence of long-range structure: Logit attributions and linear probes indicate that the model encodes the necessary long-range dependencies for multi-digit multiplication. (2) Mechanism: the model encodes long-range dependencies using attention to construct a directed acyclic graph to ``cache'' and ``retrieve'' pairwise partial products. (3) Geometry: the model implements partial products in attention heads by forming Minkowski sums between pairs of digits, and digits are represented using a Fourier basis, both of which are intuitive and efficient representations that the standard fine-tuning model lacks. With these insights, we revisit the learning dynamics of standard fine-tuning and find that the model converges to a local optimum that lacks the required long-range dependencies. We further validate this understanding by introducing an auxiliary loss that predicts the ``running sum'' via a linear regression probe, which provides an inductive bias that enables the model to successfully learn multi-digit multiplication. In summary, by reverse-engineering the mechanisms of an implicit chain-of-thought model we uncover a pitfall for learning long-range dependencies in Transformers and provide an example of how the correct inductive bias can address this issue.
Problem

Research questions and friction points this paper is trying to address.

Investigating why transformers fail to learn multi-digit multiplication tasks
Reverse-engineering reveals attention mechanisms for long-range dependencies
Proposing inductive bias solutions to address dependency learning pitfalls
Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit chain-of-thought enables long-range dependency encoding
Attention constructs DAG for caching pairwise partial products
Minkowski sums and Fourier basis represent digits efficiently
🔎 Similar Papers
No similar papers found.