MUFASA: A Multi-Layer Framework for Slot Attention

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a key limitation in existing unsupervised object-centric learning methods, which rely solely on the final-layer features of Vision Transformers (ViTs) and thereby overlook the rich semantic information embedded in intermediate layers, constraining segmentation performance. To overcome this, we propose MUFASA—a lightweight, plug-and-play framework that, for the first time, applies slot attention in parallel across multiple ViT encoder layers and integrates the resulting slot representations through a cross-layer slot fusion strategy to construct a unified object-centric representation. MUFASA is fully compatible with current unsupervised object-centric learning paradigms, achieving state-of-the-art segmentation performance across multiple benchmarks while accelerating training convergence and introducing only minimal inference overhead.

Technology Category

Application Category

📝 Abstract
Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.
Problem

Research questions and friction points this paper is trying to address.

unsupervised object-centric learning
slot attention
vision transformer
multi-layer features
object segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-layer slot attention
Unsupervised object-centric learning
Vision Transformer
Feature fusion
Object segmentation
🔎 Similar Papers
No similar papers found.
S
Sebastian Bock
TU Darmstadt; Zuse School ELIZA
L
Leonie Schüßler
TU Darmstadt; Zuse School ELIZA
K
Krishnakant Singh
TU Darmstadt
Simone Schaub-Meyer
Simone Schaub-Meyer
Assistant Professor @ TU Darmstadt | Hessian.AI
Computer VisionUnsupervised LearningExplainable AI
Stefan Roth
Stefan Roth
Professor of Computer Science, TU Darmstadt
Computer VisionMachine Learning