Modes of Sequence Models and Learning Coefficients

📅 2025-04-25

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

This work bridges the gap between geometric interpretations of patterns and measurability of loss landscapes in sequence modeling. Methodologically, it maps conditional sequence distributions into a Hilbert space, constructs a data-dependent “effective distribution” by truncating low-magnitude modes above an adaptive threshold, and characterizes its geometric structure via the local learning coefficient (LLC). Theoretically, it proves that LLC is inherently robust to noise modes below the truncation threshold—demonstrating that LLC fundamentally reflects the geometry of the effective (truncated) distribution, not the original data distribution. A key conceptual contribution identifies the inverse temperature parameter in stochastic gradient Langevin dynamics (SGLD) as a “resolution controller” for the loss landscape. Finally, the framework is instantiated in Transformer architectures, enabling interpretable and quantifiable geometric characterization of loss landscapes.

Technology Category

Application Category

📝 Abstract

We develop a geometric account of sequence modelling that links patterns in the data to measurable properties of the loss landscape in transformer networks. First, we cast conditional sequence distributions into a Hilbert-space framework and apply tensor decompositions to identify their principal modes. Truncating the small-amplitude modes yields an effective data distribution that preserves dominant structure while discarding statistical detail. Second, we show theoretically that Local Learning Coefficient (LLC) estimates are insensitive to modes below a data-dependent threshold. Consequently, the LLC calculated in practice characterises the geometry of the effective rather than the true distribution. This insight clarifies why reliable LLC estimates can be obtained even when a network parameter is not a strict minimiser of the population loss, and it highlights how the inverse temperature in SGLD acts as a resolution dial on the landscape structure.

Problem

Research questions and friction points this paper is trying to address.

Identify principal modes in sequence distributions using tensor decompositions

Show LLC estimates ignore modes below a data-dependent threshold

Clarify why reliable LLC estimates work without strict loss minimisation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hilbert-space framework for sequence distributions

Tensor decompositions identify principal modes

Local Learning Coefficients insensitive to minor modes

🔎 Similar Papers

No similar papers found.

Authors to Follow