Towards Learning High-Precision Least Squares Algorithms with Sequence Models

📅 2025-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether sequence models can end-to-end learn high-precision numerical algorithms—specifically gradient descent—for solving least-squares problems, while achieving machine precision and cross-instance numerical generalization. To this end, we introduce three key innovations: (1) a fully polynomial architecture based on gated convolutions and linear attention, eliminating the numerical instability induced by high-precision multiplications in softmax-based Transformers; (2) a high-precision floating-point training paradigm that substantially suppresses gradient noise; and (3) iterative unrolling to explicitly encode algorithmic structural priors. Experiments demonstrate that our model achieves near-machine precision on standard least-squares tasks—reducing mean squared error by five orders of magnitude—and improves out-of-distribution numerical generalization error by four orders of magnitude. This represents the first end-to-end learnable high-precision solver for least-squares problems, establishing a breakthrough in numerically robust differentiable optimization.

Technology Category

Application Category

📝 Abstract
This paper investigates whether sequence models can learn to perform numerical algorithms, e.g. gradient descent, on the fundamental problem of least squares. Our goal is to inherit two properties of standard algorithms from numerical analysis: (1) machine precision, i.e. we want to obtain solutions that are accurate to near floating point error, and (2) numerical generality, i.e. we want them to apply broadly across problem instances. We find that prior approaches using Transformers fail to meet these criteria, and identify limitations present in existing architectures and training procedures. First, we show that softmax Transformers struggle to perform high-precision multiplications, which prevents them from precisely learning numerical algorithms. Second, we identify an alternate class of architectures, comprised entirely of polynomials, that can efficiently represent high-precision gradient descent iterates. Finally, we investigate precision bottlenecks during training and address them via a high-precision training recipe that reduces stochastic gradient noise. Our recipe enables us to train two polynomial architectures, gated convolutions and linear attention, to perform gradient descent iterates on least squares problems. For the first time, we demonstrate the ability to train to near machine precision. Applied iteratively, our models obtain 100,000x lower MSE than standard Transformers trained end-to-end and they incur a 10,000x smaller generalization gap on out-of-distribution problems. We make progress towards end-to-end learning of numerical algorithms for least squares.
Problem

Research questions and friction points this paper is trying to address.

Sequence models learning high-precision numerical algorithms
Overcoming precision limitations in existing Transformer architectures
Training polynomial architectures for near machine precision accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Polynomial architectures replace softmax Transformers
High-precision training reduces stochastic gradient noise
Achieves near machine precision in least squares
🔎 Similar Papers
No similar papers found.
J
Jerry Liu
Institute of Computational & Mathematical Engineering, Stanford University
J
Jessica Grogan
Department of Computer Science & Engineering, University at Buffalo
Owen Dugan
Owen Dugan
Stanford CS PhD Candidate
A
Ashish Rao
Department of Computer Science, Stanford University
Simran Arora
Simran Arora
Computer Science, Stanford University
Computer ScienceAI Systems
Atri Rudra
Atri Rudra
Katherine Johnson Chair in AI, Professor, CSE, University at Buffalo
Structured Linear AlgebraSociety and ComputingCoding TheoryDatabase algorithms
C
Christopher R'e
Department of Computer Science, Stanford University