๐ค AI Summary
This work proposes the Free Energy Mixer (FEM), a novel attention mechanism that overcomes the limitation of standard attentionโits inability to perform dynamic channel-wise selection due to reliance on convex combinations of key-value pairs. FEM introduces a value-aware readout based on free energy (log-sum-exp), treating query-key scores as priors to construct a posterior read distribution. This enables per-channel log-linear tilting, allowing a smooth transition from average aggregation to selective channel reading without increasing computational complexity. The method incorporates a learnable inverse temperature parameter and a two-layer gating structure, making it compatible with standard attention, linear attention, linear RNNs, and state space models (SSMs). Experiments demonstrate that FEM consistently outperforms strong baselines across natural language processing, vision, and time series tasks at equal parameter counts.
๐ Abstract
Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.