🤖 AI Summary
Transformer-based models treat data as unordered sets, disregarding inherent topological structures—such as sequence order, image grids, or graph connectivity—and thus rely on task-specific inductive biases (e.g., positional encodings, random walks), resulting in complex design and limited generalization. This work introduces Chimera: the first unified framework that naturally embeds arbitrary graph topologies into state space models (SSMs), using data structure itself as a universal inductive bias and eliminating domain-specific architectural engineering. Its core innovation is the first generalization of SSMs to arbitrary graphs, enabling linear-time recurrence over DAG-structured graphs and mathematically principled relaxation-based optimization for general graphs. Experiments demonstrate that Chimera outperforms BERT by 0.7 points on GLUE, exceeds ViT by 2.6% top-1 accuracy on ImageNet-1k, and achieves state-of-the-art performance across long-range graph benchmarks.
📝 Abstract
Transformer-based deep learning methods have become the standard approach for modeling diverse data such as sequences, images, and graphs. These methods rely on self-attention, which treats data as an unordered set of elements. This ignores the neighborhood structure or graph topology of the data and requires inductive biases--such as position embeddings in sequences and images, or random walks in graphs--to incorporate topology. However, designing such task-specific biases requires significant effort and can introduce side effects that hinder generalization. We introduce Chimera, a unified model that directly incorporates data topology in a principled way, removing the need for domain-specific biases. The key idea is that state space models--which naturally do not require position embeddings--can be generalized to capture any graph topology. Our experiments show that Chimera achieves strong performance across language, vision, and graph domains, outperforming BERT on GLUE by 0.7 points, ViT on ImageNet-1k by 2.6%, and all baselines on the Long Range Graph Benchmark. We further propose algorithmic optimizations to improve Chimera's efficiency: (1) for Directed Acyclic Graphs, Chimera can be implemented as a linear-time recurrence; (2) for general graphs, a simple mathematical relaxation achieves Transformer's quadratic complexity without domain-specific heuristics. These results validate Chimera's core contribution and support the idea that data topology is a powerful inductive bias across modalities.