🤖 AI Summary
Traditional Transformers rely on positional encodings to model sequential order, yet suffer from poor extrapolation capability and limited interpretability. This paper proposes SplineFormer, a novel Transformer architecture that eliminates explicit positional encodings. Its core innovation is embedding input sequences as learnable smooth spline trajectories in latent space, where sequence order is implicitly modeled via learnable control points; spline interpolation is integrated with self-attention to enable arbitrary-length sequence extrapolation and interactive user editing of control points. To our knowledge, this is the first work to incorporate spline geometric priors into sequence modeling. Extensive experiments on synthetic 2D data and real-world multimodal tasks—including image, 3D shape, and animation generation—demonstrate substantial improvements over state-of-the-art positional encoding schemes. SplineFormer achieves superior generalization, enhanced interpretability, and fine-grained controllability.
📝 Abstract
We introduce Spline-based Transformers, a novel class of Transformer models that eliminate the need for positional encoding. Inspired by workflows using splines in computer animation, our Spline-based Transformers embed an input sequence of elements as a smooth trajectory in latent space. Overcoming drawbacks of positional encoding such as sequence length extrapolation, Spline-based Transformers also provide a novel way for users to interact with transformer latent spaces by directly manipulating the latent control points to create new latent trajectories and sequences. We demonstrate the superior performance of our approach in comparison to conventional positional encoding on a variety of datasets, ranging from synthetic 2D to large-scale real-world datasets of images, 3D shapes, and animations.