A Stepwise Distillation Learning Strategy for Non-differentiable Visual Programming Frameworks on Visual Reasoning Tasks

📅 2023-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual programming (VProg) is an important paradigm for visual reasoning (VR) due to its interpretability and cross-task generalization, yet its non-differentiable nature hinders task-specific adaptation, resulting in substantially lower performance than dedicated models; introducing task-specific modules, however, compromises generalization. This paper proposes Stepwise Distillation for Visual Programming (SDVP), the first method enabling cross-task-compatible knowledge distillation for non-differentiable VProg frameworks. SDVP performs hierarchical knowledge transfer at the subtask level—distilling from lightweight task-specific models into the large pre-trained vision-language models (VLMs) invoked by VProg—thereby enhancing task performance without sacrificing generalization. On GQA and NLVRv2, SDVP improves VisProg by 2.4% and 6.2%, and ViperGPT by 6.5% and 4.0%, respectively, while preserving strong generalization to unseen VR tasks.
📝 Abstract
Recently, Visual Programming (VProg) has emerged as a significant framework for visual reasoning (VR) tasks due to its interpretability and cross-task generality. However, even with invoking powerful pre-trained Vision-Language models (VLMs) as visual sub-modules, the performance of VProg on specific VR tasks is markedly inferior compared to well-trained task-specific networks. Although invoking task-specific models can further enhance the performance of VProg on specific VR tasks, it greatly diminishes the cross-task generalization ability of VProg. Besides, the non-differentiable nature of VProg prevents direct fine-tuning on specific VR tasks for further performance improvement. Attempt to address these issues, we propose SDVP, a Stepwise Distillation learning strategy for non-differentiable VPorg across various VR tasks. Specifically, our SDVP stepwise distills the capabilities of existing, well-trained small task-specific models for decomposed visual sub-tasks in VProg into the much larger VLMs invoked by corresponding visual sub-modules. Besides, distilling the knowledge of little-size task-specific models into pre-trained larger VLMs rather than replacing them helps keep the cross-task abilities of VProgs. Extensive and comprehensive experimental results on different VProg frameworks demonstrate that our SDVP obtains significant performance gains on specific VR benchmarks, i.e., GQA (+2.4%) and NLVRv2 (+6.2%) for VisProg and GQA (+6.5%) and NLVRv2 (+4.0%) for ViperGPT, and also maintains a promising performance for VProg on unseen and previous VR tasks.
Problem

Research questions and friction points this paper is trying to address.

Improves Visual Programming for specific visual reasoning tasks.
Maintains cross-task generalization in Visual Programming frameworks.
Enables fine-tuning of non-differentiable Visual Programming frameworks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stepwise distillation for VProg
Distills task-specific models into VLMs
Maintains cross-task generalization ability
🔎 Similar Papers
No similar papers found.