Enhancing Visual Programming for Visual Reasoning via Probabilistic Graphs

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual programming (VP) frameworks face two key challenges: non-differentiable execution hinders end-to-end optimization, and the absence of subtask-level annotations prevents fine-grained supervision. To address these, this work introduces the first differentiable probabilistic formulation of VP execution, modeling it as a directed probabilistic graph grounded in variable dependencies. This enables differentiable probabilistic inference, circumventing traditional gradient-blocking bottlenecks and supporting end-to-end joint optimization using only task-level supervision. The approach unifies probabilistic graphical models, differentiable inference, visual programming, large language models, and pretrained vision modules into a cohesive, multi-component training framework. Extensive experiments on three major visual reasoning benchmarks—GQA, NLVRv2, and Open Images—demonstrate substantial performance gains, validating the method’s effectiveness, generalizability, and practical utility.

Technology Category

Application Category

📝 Abstract
Recently, Visual Programming (VP) based on large language models (LLMs) has rapidly developed and demonstrated significant potential in complex Visual Reasoning (VR) tasks. Previous works to enhance VP have primarily focused on improving the quality of LLM-generated visual programs. However, they have neglected to optimize the VP-invoked pre-trained models, which serve as modules for the visual sub-tasks decomposed from the targeted tasks by VP. The difficulty is that there are only final labels of targeted VR tasks rather than labels of sub-tasks. Besides, the non-differentiable nature of VP impedes the direct use of efficient gradient-based optimization methods to leverage final labels for end-to-end learning of the entire VP framework. To overcome these issues, we propose EVPG, a method to Enhance Visual Programming for visual reasoning via Probabilistic Graphs. Specifically, we creatively build a directed probabilistic graph according to the variable dependency relationships during the VP executing process, which reconstructs the non-differentiable VP executing process into a differentiable exact probability inference process on this directed probabilistic graph. As a result, this enables the VP framework to utilize the final labels for efficient, gradient-based optimization in end-to-end supervised learning on targeted VR tasks. Extensive and comprehensive experiments demonstrate the effectiveness and advantages of our EVPG, showing significant performance improvements for VP on three classical complex VR tasks: GQA, NLVRv2, and Open Images.
Problem

Research questions and friction points this paper is trying to address.

Enhance Visual Programming for visual reasoning tasks
Optimize pre-trained models in Visual Programming without sub-task labels
Enable gradient-based end-to-end learning in non-differentiable VP frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic graph reconstructs VP into differentiable inference
Enables gradient-based optimization using final task labels
Improves performance on GQA, NLVRv2, Open Images tasks
🔎 Similar Papers
No similar papers found.
Wentao Wan
Wentao Wan
Sun Yat-sen University
Artificial IntelligenceCognitive AIDeep LearningNeural-SymbolicQuestion Answering
K
Kaiyu Wu
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510006, China
Qingyang Ma
Qingyang Ma
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510006, China
N
Nan Kang
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510006, China
Y
Yunjie Chen
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510006, China
Liang Lin
Liang Lin
Fellow of IEEE/IAPR, Professor of Computer Science, Sun Yat-sen University
Embodied AICausal Inference and LearningMultimodal Data Analysis
K
Keze Wang
School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510006, China