Efficient Domain-Adaptive Multi-Task Dense Prediction with Vision Foundation Models

πŸ“… 2025-09-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address performance degradation in robotic multi-task dense prediction (semantic segmentation and depth estimation) under cross-domain deployment due to domain shift, this paper proposes FAMDAβ€”the first framework to integrate Vision Foundation Models (VFMs) into multi-task Unsupervised Domain Adaptation (UDA). FAMDA employs a VFM as a teacher model and introduces a multi-task self-training paradigm to generate high-confidence pseudo-labels, jointly optimizing knowledge distillation and domain adaptation. Compared with prevailing adversarial UDA methods, FAMDA achieves state-of-the-art performance across multiple synthetic-to-real multi-task UDA benchmarks. Its lightweight student model reduces parameter count by over 10Γ— while significantly improving inference efficiency. Moreover, FAMDA demonstrates superior generalization under severe distribution shifts, such as day-to-night transitions.

Technology Category

Application Category

πŸ“ Abstract
Multi-task dense prediction, which aims to jointly solve tasks like semantic segmentation and depth estimation, is crucial for robotics applications but suffers from domain shift when deploying models in new environments. While unsupervised domain adaptation (UDA) addresses this challenge for single tasks, existing multi-task UDA methods primarily rely on adversarial learning approaches that are less effective than recent self-training techniques. In this paper, we introduce FAMDA, a simple yet effective UDA framework that bridges this gap by leveraging Vision Foundation Models (VFMs) as powerful teachers. Our approach integrates Segmentation and Depth foundation models into a self-training paradigm to generate high-quality pseudo-labels for the target domain, effectively distilling their robust generalization capabilities into a single, efficient student network. Extensive experiments show that FAMDA achieves state-of-the-art (SOTA) performance on standard synthetic-to-real UDA multi-task learning (MTL) benchmarks and a challenging new day-to-night adaptation task. Our framework enables the training of highly efficient models; a lightweight variant achieves SOTA accuracy while being more than 10$ imes$ smaller than foundation models, highlighting FAMDA's suitability for creating domain-adaptive and efficient models for resource-constrained robotics applications.
Problem

Research questions and friction points this paper is trying to address.

Addressing domain shift in multi-task dense prediction for robotics
Improving unsupervised domain adaptation with Vision Foundation Models
Creating efficient domain-adaptive models for resource-constrained applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Vision Foundation Models as teacher networks
Uses self-training paradigm with pseudo-label generation
Distills knowledge into efficient single student network
πŸ”Ž Similar Papers
No similar papers found.