Make me an Expert: Distilling from Generalist Black-Box Models into Specialized Models for Semantic Segmentation

📅 2025-08-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Semantic segmentation model distillation faces significant challenges under black-box API constraints—where only one-hot predictions are accessible, and model weights, gradients, or training data are unavailable—especially for open-vocabulary models suffering from resolution sensitivity (“resolution curse”). Method: This paper introduces the first black-box distillation paradigm that dynamically selects the optimal inference scale using DINOv2 self-supervised attention maps; it further proposes an entropy-weighted scoring mechanism to identify information-rich scales for high-quality pseudo-label generation, enabling end-to-end knowledge transfer without logits or internal model states. Contribution/Results: Our approach significantly outperforms existing black-box supervised methods across multiple semantic segmentation benchmarks. Leveraging only one-hot API outputs, it achieves efficient, high-fidelity adaptation of local models—demonstrating strong practicality, generalizability, and methodological novelty.

Technology Category

Application Category

📝 Abstract
The rise of Artificial Intelligence as a Service (AIaaS) democratizes access to pre-trained models via Application Programming Interfaces (APIs), but also raises a fundamental question: how can local models be effectively trained using black-box models that do not expose their weights, training data, or logits, a constraint in which current domain adaptation paradigms are impractical ? To address this challenge, we introduce the Black-Box Distillation (B2D) setting, which enables local model adaptation under realistic constraints: (1) the API model is open-vocabulary and trained on large-scale general-purpose data, and (2) access is limited to one-hot predictions only. We identify that open-vocabulary models exhibit significant sensitivity to input resolution, with different object classes being segmented optimally at different scales, a limitation termed the "curse of resolution". Our method, ATtention-Guided sCaler (ATGC), addresses this challenge by leveraging DINOv2 attention maps to dynamically select optimal scales for black-box model inference. ATGC scores the attention maps with entropy to identify informative scales for pseudo-labelling, enabling effective distillation. Experiments demonstrate substantial improvements under black-box supervision across multiple datasets while requiring only one-hot API predictions. Our code is available at https://github.com/yasserben/ATGC.
Problem

Research questions and friction points this paper is trying to address.

Distilling knowledge from black-box API models without weights or logits
Addressing resolution sensitivity in open-vocabulary segmentation models
Enabling effective local model training with limited one-hot predictions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages DINOv2 attention maps for scale selection
Uses entropy scoring for optimal pseudo-label generation
Enables distillation using only one-hot API predictions
🔎 Similar Papers
No similar papers found.