Linear Transformers Implicitly Discover Unified Numerical Algorithms

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work investigates whether linear attention transformers can implicitly learn general-purpose numerical algorithms from data. Specifically, it considers masked block matrix completion—encompassing scalar prediction and Nyström-based kernel slice extrapolation—and trains models end-to-end solely on input-output pairs using mean-squared-error loss, without incorporating normal equations, hand-crafted iterative steps, or task-specific prompting. After training on millions of synthetic instances, the model autonomously discovers a unified, parameter-free, second-order convergent update rule applicable across all three problem classes: low-rank estimation, scalar prediction, and kernel extrapolation. Theoretical analysis reveals that this rule implicitly implements a communication-efficient, distributed-friendly iterative solver, drastically reducing synchronization rounds. Empirical evaluation confirms high accuracy under both full-batch and low-rank attention configurations. This study provides the first evidence that linear transformers possess cross-computational-mode in-context learning capability for numerical computation, establishing a novel data-driven paradigm for algorithm discovery.

Technology Category

Application Category

📝 Abstract

We train a linear attention transformer on millions of masked-block matrix completion tasks: each prompt is masked low-rank matrix whose missing block may be (i) a scalar prediction target or (ii) an unseen kernel slice of Nyström extrapolation. The model sees only input-output pairs and a mean-squared loss; it is given no normal equations, no handcrafted iterations, and no hint that the tasks are related. Surprisingly, after training, algebraic unrolling reveals the same parameter-free update rule across three distinct computational regimes (full visibility, rank-limited updates, and distributed computation). We prove that this rule achieves second-order convergence on full-batch problems, cuts distributed iteration complexity, and remains accurate with rank-limited attention. Thus, a transformer trained solely to patch missing blocks implicitly discovers a unified, resource-adaptive iterative solver spanning prediction, estimation, and Nyström extrapolation, highlighting a powerful capability of in-context learning.

Problem

Research questions and friction points this paper is trying to address.

Linear transformers learn unified algorithms from masked matrix completion tasks

The model discovers parameter-free update rules without explicit instructions

It creates resource-adaptive solvers for prediction and extrapolation problems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear attention transformer trained on masked matrix completion

Discovers unified parameter-free update rule across regimes

Achieves second-order convergence with resource-adaptive solver

🔎 Similar Papers

No similar papers found.