🤖 AI Summary
To address the lack of input awareness and task-specific modeling capability in existing parameter-efficient fine-tuning (PEFT) adapters, this paper proposes a dynamic input-conditioned Transformer architecture. The core innovation is the input-Conditioned Network (iCoN), which generates instance-specific, channel-wise dynamic convolutional kernels for fine-grained, input-adaptive feature modulation. Our method fine-tunes only 1.6%–2.8% of the backbone parameters, yet achieves full fine-tuning performance on depth estimation and semantic segmentation, and significantly outperforms full fine-tuning on image classification and instance segmentation. It consistently surpasses mainstream PEFT approaches—including LoRA and Adapter—across diverse downstream tasks. By enabling input-aware, task-adaptive representation learning with minimal parameter overhead, the proposed method substantially enhances the generalization capability and expressive power of PEFT across heterogeneous vision tasks.
📝 Abstract
Transfer learning based on full fine-tuning (FFT) of the pre-trained encoder and task-specific decoder becomes increasingly complex as deep models grow exponentially. Parameter efficient fine-tuning (PEFT) approaches using adapters consisting of small learnable layers have emerged as an alternative to FFT, achieving comparable performance while maintaining high training efficiency. However, the inflexibility of the adapter with respect to input instances limits its capability of learning task-specific information in diverse downstream tasks. In this paper, we propose a novel PEFT approach, input-Conditioned transFormer, termed iConFormer, that leverages a dynamic adapter conditioned on the input instances. To secure flexible learning ability on input instances in various downstream tasks, we introduce an input-Conditioned Network (iCoN) in the dynamic adapter that enables instance-level feature transformation. To be specific, iCoN generates channel-wise convolutional kernels for each feature and transform it using adaptive convolution process to effectively capture task-specific and fine-grained details tailor to downstream tasks. Experimental results demonstrate that by tuning just 1.6% to 2.8% of the Transformer backbone parameters, iConFormer achieves performance comparable to FFT in monocular depth estimation and semantic segmentation, while outperforming it in image classification and instance segmentation. Also, the proposed method consistently outperforms recent PEFT methods for all the tasks mentioned above.