Deformba: Vision State Space Model with Adaptive State Fusion

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
Existing visual State Space Models (SSMs) rely on fixed image scanning orders, limiting their ability to model complex geometric structures and lacking effective mechanisms for multimodal interaction, which hinders their application in tasks such as multi-view 3D perception. This work proposes Deformba, the first SSM framework to incorporate deformable spatial sampling for context-adaptive spatial structure modeling, alongside a cross-attention-inspired mechanism enabling cross-modal state fusion. Deformba achieves significantly enhanced visual modeling capabilities while maintaining linear computational complexity. It provides a unified architecture for both 2D and 3D vision tasks, delivering state-of-the-art performance across multiple benchmarks—including image classification, object detection, segmentation, and bird’s-eye-view (BEV) perception—demonstrating its effectiveness and broad applicability.
📝 Abstract
State Space Models (SSMs) have emerged as a powerful and efficient alternative to Transformers, demonstrating linear-time complexity and exceptional sequence modeling capabilities. However, their application to vision tasks remains challenging. First, existing vision SSMs largely depend on manually designed fixed scanning methods to flatten image patches into sequences, which imposes predefined geometric structures and increases the complexity. Second, the broader adoption of vision SSMs is hindered in domains that require query-based interactions between distinct information streams. This is a result of the inherently causal and self-referential nature of SSMs designed for 1D sequence modeling tasks. This fusion mechanism is indispensable for critical perception tasks such as multi-view 3D fusion. To address these limitations, we propose Deformba, a context adaptive method that dynamically augments the spatial structural information while maintaining the linear complexity of SSMs. Deformba also allows multi-modal fusion like cross attention. To demonstrate the effectiveness and general applicability of Deformba, we test its performance on general 2D vision tasks such as image classification, object detection, and segmentation, as well as 3D vision tasks like BEV perception. Extensive experiments show that Deformba achieves strong performance across various visual perception benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Vision State Space Models
Fixed Scanning
Multi-modal Fusion
Query-based Interaction
3D Perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

State Space Model
Adaptive State Fusion
Vision Modeling
Multi-modal Fusion
Linear Complexity
💼 Related Jobs
Vision Foundation Model Research Intern
Intrinsic
Salary Range$57.69—$57.69 USDAt Intrinsic, we are proud to be an equal opportunity workplace. Employment at Intrinsic is based solely on a person's merit and qualifications directly related to professional competence. Intrinsic does not discriminate against any employee or applicant because of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), or any other basis protected by law. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. It is Intrinsic’s policy to comply with all applicable national, state and local laws pertaining to nondiscrimination and equal opportunity.
Mountain View, California / Mountain View (US-MTV), Mountain View, California, United States
H
Hongyu Ke
Department of Computer Science, Georgia State University
J
Jack Morris
University of Tennessee Knoxville
Yongkang Liu
Yongkang Liu
Toyota Motor North America
Autonomous DrivingIntelligent VehiclesDeep LearningSignal Processing
S
Satoshi Kitai
InfoTech Labs, Toyota Motor North America R&D
K
Kentaro Oguchi
InfoTech Labs, Toyota Motor North America R&D
Y
Yi Ding
University of Tennessee Knoxville
Haoxin Wang
Haoxin Wang
Assistant Professor, Georgia State University
Edge ComputingEfficient AIOn-Device LLMDigital Twins