Fourier Head: Helping Large Language Models Learn Complex Probability Distributions

📅 2024-10-29

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

179K/year

🤖 AI Summary

To address the limited expressive capacity of discrete softmax heads in large language models (LLMs) for modeling non-linguistic continuous domains—such as actions and time-series values—this paper proposes the plug-and-play Fourier Head: an output layer based on Fourier series expansion that explicitly models continuous probability distributions via learnable spectral coefficients and orthogonal basis functions. This work is the first to embed frequency-domain priors directly into LLM output layers and provides theoretical guarantees on its robustness to high-frequency noise. Empirically, the Fourier Head improves cumulative reward by 46% in the Decision Transformer task on Atari Seaquest. Moreover, on 20 unseen time-series forecasting benchmarks, it reduces average prediction error by 3.5% over state-of-the-art time-series foundation models. The method is fully differentiable, architecture-agnostic, and requires no modification to the underlying LLM backbone.

Technology Category

Application Category

📝 Abstract

As the quality of large language models has improved, there has been increased interest in using them to model non-linguistic tokens. For example, the Decision Transformer recasts agentic decision making as a sequence modeling problem, using a decoder-only LLM to model the distribution over the discrete action space for an Atari agent. However, when adapting LLMs to non-linguistic domains, it remains unclear if softmax over discrete bins captures the continuous structure of the tokens and the potentially complex distributions needed for high quality token generation. We introduce a neural network layer, constructed using Fourier series, which we can easily substitute for any linear layer if we want the outputs to have a more continuous structure. We perform extensive analysis on synthetic datasets, as well as on large-scale decision making and time series forecasting tasks. We also provide theoretical evidence that this layer can better learn signal from data while ignoring high-frequency noise. All of our results support the effectiveness of our proposed Fourier head in scenarios where the underlying data distribution has a natural continuous structure. For example, the Fourier head improves a Decision Transformer agent's returns by 46% on the Atari Seaquest game, and increases a state-of-the-art times series foundation model's forecasting performance by 3.5% across 20 benchmarks unseen during training.

Problem

Research questions and friction points this paper is trying to address.

Improving LLMs for non-linguistic token modeling

Capturing continuous structure in complex distributions

Enhancing decision-making and time series forecasting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fourier series neural layer for continuous outputs

Improves Decision Transformer agent performance significantly

Enhances time series forecasting model accuracy

🔎 Similar Papers

Density estimation with LLMs: a geometric investigation of in-context learning trajectories