🤖 AI Summary
Visual programming (VProg) is an important paradigm for visual reasoning (VR) due to its interpretability and cross-task generalization, yet its non-differentiable nature hinders task-specific adaptation, resulting in substantially lower performance than dedicated models; introducing task-specific modules, however, compromises generalization. This paper proposes Stepwise Distillation for Visual Programming (SDVP), the first method enabling cross-task-compatible knowledge distillation for non-differentiable VProg frameworks. SDVP performs hierarchical knowledge transfer at the subtask level—distilling from lightweight task-specific models into the large pre-trained vision-language models (VLMs) invoked by VProg—thereby enhancing task performance without sacrificing generalization. On GQA and NLVRv2, SDVP improves VisProg by 2.4% and 6.2%, and ViperGPT by 6.5% and 4.0%, respectively, while preserving strong generalization to unseen VR tasks.
📝 Abstract
Recently, Visual Programming (VProg) has emerged as a significant framework for visual reasoning (VR) tasks due to its interpretability and cross-task generality. However, even with invoking powerful pre-trained Vision-Language models (VLMs) as visual sub-modules, the performance of VProg on specific VR tasks is markedly inferior compared to well-trained task-specific networks. Although invoking task-specific models can further enhance the performance of VProg on specific VR tasks, it greatly diminishes the cross-task generalization ability of VProg. Besides, the non-differentiable nature of VProg prevents direct fine-tuning on specific VR tasks for further performance improvement. Attempt to address these issues, we propose SDVP, a Stepwise Distillation learning strategy for non-differentiable VPorg across various VR tasks. Specifically, our SDVP stepwise distills the capabilities of existing, well-trained small task-specific models for decomposed visual sub-tasks in VProg into the much larger VLMs invoked by corresponding visual sub-modules. Besides, distilling the knowledge of little-size task-specific models into pre-trained larger VLMs rather than replacing them helps keep the cross-task abilities of VProgs. Extensive and comprehensive experimental results on different VProg frameworks demonstrate that our SDVP obtains significant performance gains on specific VR benchmarks, i.e., GQA (+2.4%) and NLVRv2 (+6.2%) for VisProg and GQA (+6.5%) and NLVRv2 (+4.0%) for ViperGPT, and also maintains a promising performance for VProg on unseen and previous VR tasks.