🤖 AI Summary
This work addresses the lack of explicit Lipschitz continuity guarantees in standard Transformers, which limits their reliable deployment in safety-critical applications. The authors propose a novel class of in-context Transformers with built-in Lipschitz constraints, interpreting both MLP and attention modules as explicit Euler steps of a negative gradient flow—thereby ensuring stability without sacrificing expressive power. For the first time, they establish a universal approximation theorem for Transformers within the space of Lipschitz functions, leveraging a measure-theoretic perspective that treats the model as an operator acting on probability measures. This formulation yields approximation capabilities independent of the number of tokens. The study thus provides a rigorous theoretical foundation for designing Transformers that simultaneously achieve robustness and expressivity.
📝 Abstract
Stability and robustness are critical for deploying Transformers in safety-sensitive settings. A principled way to enforce such behavior is to constrain the model's Lipschitz constant. However, approximation-theoretic guarantees for architectures that explicitly preserve Lipschitz continuity have yet to be established. In this work, we bridge this gap by introducing a class of gradient-descent-type in-context Transformers that are Lipschitz-continuous by construction. We realize both MLP and attention blocks as explicit Euler steps of negative gradient flows, ensuring inherent stability without sacrificing expressivity. We prove a universal approximation theorem for this class within a Lipschitz-constrained function space. Crucially, our analysis adopts a measure-theoretic formalism, interpreting Transformers as operators on probability measures, to yield approximation guarantees independent of token count. These results provide a rigorous theoretical foundation for the design of robust, Lipschitz continuous Transformer architectures.