StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

📅 2026-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing token merging methods on the Segment Anything Model (SAM), which often degrade boundary details and leak prompt information, struggling to balance efficiency and accuracy. The authors propose StructSAM, a structure- and spectrum-preserving token merging-and-recovery framework tailored for SAM that achieves substantial computational savings without retraining. Its key innovations include an energy scoring mechanism based on first-order gradients combined with a grid flatness criterion to safeguard boundaries and prompt-sensitive regions, and—building on spectral graph coarsening theory—the first analysis of how token merging affects feature structure, ensuring bounded spectral distortion. Evaluated across eight natural and medical image benchmarks, StructSAM reduces encoder FLOPs by 25–30% (over 40% in prompt-aware settings) with only marginal drops in mIoU/Dice, significantly outperforming methods like ToMe and PiToMe.

Technology Category

Application Category

📝 Abstract
Recent token merging techniques for Vision Transformers (ViTs) provide substantial speedups by reducing the number of tokens processed by self-attention, often without retraining. However, their direct application to the Segment Anything Model (SAM) family is nontrivial: SAM's image encoder mixes windowed and global attention, and its mask decoder relies on dense, prompt-conditioned features for precise boundary prediction. We systematically evaluate representative token-merging methods on SAM and Medical SAM in a strict off-the-shelf setting, and find that existing destination-selection heuristics can erode boundaries and leak prompt information as merge rates increase. We propose \textbf{StructSAM}, a resolution-preserving merge-unmerge framework tailored to SAM. StructSAM computes a lightweight token-energy score from first-order feature gradients, uses grid-based flatness screening to protect boundary and prompt regions, and merges tokens within flat areas toward low-energy destinations with explicit token recovery. We further provide a spectral graph coarsening view showing that score-guided merging yields bounded Laplacian spectral distortion compared to random or window-restricted baselines. Across eight natural and medical benchmarks, StructSAM reduces encoder FLOPs by 25-30\% (up to 40\%+ with prompt-aware merging) with minor drops in mIoU/Dice, consistently outperforming ToMe, PiToMe, ToMeSD, VidToMe, and ALGM at the same compute.
Problem

Research questions and friction points this paper is trying to address.

token merging
Segment Anything Model
boundary preservation
prompt leakage
computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

token merging
Segment Anything Model
structure preservation
spectral graph coarsening
prompt-aware merging
🔎 Similar Papers
No similar papers found.
D
Duy M. H. Nguyen
University of Stuttgart, Germany
T
Tuan A. Tran
German Research Center for Artificial Intelligence (DFKI)
D
Duong Nguyen
German Research Center for Artificial Intelligence (DFKI)
S
Siwei Xie
University of Stuttgart, Germany
T
Trung Q. Nguyen
German Research Center for Artificial Intelligence (DFKI)
M
Mai T. N. Truong
German Research Center for Artificial Intelligence (DFKI)
Daniel Palenicek
Daniel Palenicek
PhD student at Technische Universität Darmstadt
Reinforcement LearningMachine LearningArtificial Intelligence
A
An T. Le
VinRobotics, Vietnam
Michael Barz
Michael Barz
German Research Center for Artificial Intelligence (DFKI)
Eye TrackingMultimodal InteractionGaze-based Interaction
TrungTin Nguyen
TrungTin Nguyen
Postdoctoral Research Fellow at Queensland University of Technology, Australia
Artificial IntelligenceStatisticsMachine LearningMixture ModellingClustering Techniques
T
Tuan Dam
Hanoi University of Science and Technology (HUST), Vietnam
Ngan Le
Ngan Le
University of Arkansas
Artificial IntelligenceMachine LearningComputer Vision
M
Minh Vu
Automation & Control Institute, Austria
K
Khoa Doan
VinUniversity, Vietnam
V
Vien Ngo
VinRobotics, Vietnam
Pengtao Xie
Pengtao Xie
Associate Professor, UC San Diego; Adjunct Faculty, MBZUAI
Machine Learning
James Zou
James Zou
Stanford University
Machine learningcomputational biologycomputational healthstatisticsbiotech
Daniel Sonntag
Daniel Sonntag
DFKI and University of Oldenburg
Interactive Machine LearningIntelligent User InterfacesMultimodal Interaction
Jan Peters
Jan Peters
Professor for Intelligent Autonomous Systems/TU Darmstadt, Dept. Head/German AI Research Center DFKI
Robot LearningReinforcement LearningMachine LearningRoboticsBiomimetic Systems
Mathias Niepert
Mathias Niepert
University of Stuttgart & NEC Labs Europe
Machine learning