Free Energy Mixer

๐Ÿ“… 2026-02-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work proposes the Free Energy Mixer (FEM), a novel attention mechanism that overcomes the limitation of standard attentionโ€”its inability to perform dynamic channel-wise selection due to reliance on convex combinations of key-value pairs. FEM introduces a value-aware readout based on free energy (log-sum-exp), treating query-key scores as priors to construct a posterior read distribution. This enables per-channel log-linear tilting, allowing a smooth transition from average aggregation to selective channel reading without increasing computational complexity. The method incorporates a learnable inverse temperature parameter and a two-layer gating structure, making it compatible with standard attention, linear attention, linear RNNs, and state space models (SSMs). Experiments demonstrate that FEM consistently outperforms strong baselines across natural language processing, vision, and time series tasks at equal parameter counts.

Technology Category

Application Category

๐Ÿ“ Abstract
Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.
Problem

Research questions and friction points this paper is trying to address.

attention mechanism
channel-wise selection
free energy
value-aware reading
log-sum-exp
Innovation

Methods, ideas, or system contributions that make the work stand out.

Free Energy Mixer
value-aware attention
channel-wise selection
log-sum-exp
plug-and-play architecture
๐Ÿ”Ž Similar Papers
No similar papers found.