🤖 AI Summary
To address the excessive computational overhead of large-scale instance segmentation models on resource-constrained platforms (e.g., mobile devices), this paper proposes an efficient lightweighting framework. Our method introduces a multimodal bottleneck pixel decoder that fuses visual and linguistic features; incorporates a language-guided uncertainty calibration loss to enable accuracy-preserving, adaptive pruning of Transformer decoders; and adopts a unified training strategy to jointly optimize multimodal fusion, pruning, and uncertainty modeling. Experiments demonstrate that, with only one-third of the original training iterations, our approach reduces FLOPs of the pixel decoder and Transformer decoder by 55% and 75%, respectively, while maintaining state-of-the-art segmentation accuracy. This significantly enhances deployment efficiency and cross-platform generalization—from GPU servers to mobile devices—without compromising performance.
📝 Abstract
Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computational cost limits adoption on resource-constrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto-optimal downscaling to support deployment across devices ranging from high-end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state-of-the-art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.