MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

To address the excessive computational overhead of large-scale instance segmentation models on resource-constrained platforms (e.g., mobile devices), this paper proposes an efficient lightweighting framework. Our method introduces a multimodal bottleneck pixel decoder that fuses visual and linguistic features; incorporates a language-guided uncertainty calibration loss to enable accuracy-preserving, adaptive pruning of Transformer decoders; and adopts a unified training strategy to jointly optimize multimodal fusion, pruning, and uncertainty modeling. Experiments demonstrate that, with only one-third of the original training iterations, our approach reduces FLOPs of the pixel decoder and Transformer decoder by 55% and 75%, respectively, while maintaining state-of-the-art segmentation accuracy. This significantly enhances deployment efficiency and cross-platform generalization—from GPU servers to mobile devices—without compromising performance.

Technology Category

Application Category

📝 Abstract

Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computational cost limits adoption on resource-constrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto-optimal downscaling to support deployment across devices ranging from high-end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state-of-the-art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.

Problem

Research questions and friction points this paper is trying to address.

Enabling efficient edge deployment of instance segmentation models without performance loss

Reducing computational costs for universal instance segmentation on mobile devices

Achieving Pareto-optimal downscaling for resource-constrained hardware platforms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bottleneck pixel decoder enables efficient multi-modal fusion

Language-guided uncertainty calibration loss adaptively prunes decoder

Streamlined unified training strategy reduces computational demands

🔎 Similar Papers

No similar papers found.

Unity Technologies

$278,100—$347,600 USD

Mountain View, CA, USA / USA-Mountain View, Mountain View, CA, USA

Authors to Follow