Change of Thought: Adaptive Test-Time Computation

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Existing Transformer encoders are constrained by the expressive capacity of TC⁰ circuits, limiting them to constant-depth computation; while autoregressive mechanisms alleviate this limitation, they require explicit decode–re-encode operations on intermediate states into tokens—contradicting the brain’s implicit, iterative reasoning. This paper proposes SELF-Transformer: a novel architecture that introduces test-time iterative refinement of attention weights within the standard encoder, enabling input-adaptive dynamic computation allocation without token-level autoregression or additional parameters. Its core innovation lies in internalizing inference within the attention update loop, thereby avoiding linguistic externalization of intermediate states. Experiments demonstrate up to 20% absolute accuracy improvement on encoder-dominated tasks, significantly enhancing representational capacity while preserving architectural simplicity and computational efficiency.

Technology Category

Application Category

📝 Abstract

Transformers evaluated in a single, fixed-depth pass are provably limited in expressive power to the constant-depth circuit class TC0. Running a Transformer autoregressively removes that ceiling -- first in next-token prediction and, more recently, in chain-of-thought reasoning. Both regimes rely on feedback loops that decode internal states into tokens only to re-encode them in subsequent steps. While this "thinking aloud" mirrors human reasoning, biological brains iterate without externalising intermediate states as language. To boost the expressive power of encoder Transformers without resorting to token-level autoregression, we introduce the SELF-Transformer: an encoder layer that iteratively refines its own attention weights to a fixed point. Instead of producing -- in one pass -- the alignment matrix that remixes the input sequence, the SELF-Transformer iteratively updates that matrix internally, scaling test-time computation with input difficulty. This adaptivity yields up to 20% accuracy gains on encoder-style benchmarks without increasing parameter count, demonstrating that input-adaptive alignment at test time offers substantial benefits for only a modest extra compute budget. Self-Transformers thus recover much of the expressive power of iterative reasoning while preserving the simplicity of pure encoder architectures.

Problem

Research questions and friction points this paper is trying to address.

Overcoming fixed-depth Transformer limitations in expressive power

Enhancing encoder Transformers without token-level autoregression

Achieving adaptive computation without increasing model parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iteratively refines attention weights internally

Updates alignment matrix without token autoregression

Scales test-time computation with input difficulty

🔎 Similar Papers

On the Adversarial Risk of Test Time Adaptation: An Investigation into Realistic Test-Time Data Poisoning