Rethinking Vision Transformer for Object Centric Foundation Models

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing SAM-based methods suffer from high computational overhead and weak spatial selectivity when segmenting small objects in high-resolution images, as they require multi-level full-image encoding. To address this, we propose FLIP (Fovea-Like Input Patching), a biologically inspired, saccade-like patching mechanism that achieves object-centric input alignment at the very early stage of ViT encoding—enabling the first end-to-end decoupling of positional encoding from object semantics in the ViT frontend. Leveraging lightweight position-aware decoupled encoding and an object-centered architecture, FLIP significantly improves both efficiency and accuracy for small-object segmentation. On Hypersim, KITTI-360, and OpenImages, FLIP matches SAM’s IoU while drastically reducing computational cost; it consistently outperforms FastSAM across benchmarks; and it achieves state-of-the-art performance on our newly curated, densely annotated small-object dataset.

Technology Category

Application Category

📝 Abstract
Recent state-of-the-art object segmentation mechanisms, such as the Segment Anything Model (SAM) and FastSAM, first encode the full image over several layers and then focus on generating the mask for one particular object or area. We present an off-grid Fovea-Like Input Patching (FLIP) approach, which selects image input and encodes it from the beginning in an object-focused manner. While doing so, it separates locational encoding from an object-centric perceptual code. FLIP is more data-efficient and yields improved segmentation performance when masking relatively small objects in high-resolution visual scenes. On standard benchmarks such as Hypersim, KITTI-360, and OpenImages, FLIP achieves Intersection over Union (IoU) scores that approach the performance of SAM with much less compute effort. It surpasses FastSAM in all IoU measurements. We also introduce an additional semi-natural but highly intuitive dataset where FLIP outperforms SAM and FastSAM overall and particularly on relatively small objects. Seeing that FLIP is an end-to-end object-centric segmentation approach, it has high potential particularly for applications that benefit from computationally efficient, spatially highly selective object tracking.
Problem

Research questions and friction points this paper is trying to address.

Enhances object segmentation efficiency
Focuses on small object masking
Reduces computational effort significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-focused Fovea-Like Input Patching
Separates locational from perceptual encoding
Data-efficient, improved small object segmentation
🔎 Similar Papers
No similar papers found.