Inferring Questions from Programming Screenshots

📅 2025-04-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of ineffective retrieval and comprehension of programming screenshots in developer forums. We propose a vision-based question inference method leveraging multimodal large language models (MLLMs). Specifically, we formulate a question-generation task tailored to IDE interfaces and code screenshots, systematically evaluating LLaMA, Gemini, and GPT-4o on vision-driven problem reconstruction. Our engineering framework integrates contextual learning, chain-of-thought reasoning, and few-shot prompting. Innovatively, we introduce the first benchmark for question generation from programming screenshots, enabling fine-grained semantic similarity evaluation. Experimental results show that GPT-4o generates questions semantically aligned with human annotations (similarity > 60%) for 51.75% of test images—demonstrating the technical feasibility of vision-first debugging assistants. This work establishes a novel paradigm for multimodal AI–enhanced developer support.

Technology Category

Application Category

📝 Abstract
The integration of generative AI into developer forums like Stack Overflow presents an opportunity to enhance problem-solving by allowing users to post screenshots of code or Integrated Development Environments (IDEs) instead of traditional text-based queries. This study evaluates the effectiveness of various large language models (LLMs), specifically LLAMA, GEMINI, and GPT-4o in interpreting such visual inputs. We employ prompt engineering techniques, including in-context learning, chain-of-thought prompting, and few-shot learning, to assess each model's responsiveness and accuracy. Our findings show that while GPT-4o shows promising capabilities, achieving over 60% similarity to baseline questions for 51.75% of the tested images, challenges remain in obtaining consistent and accurate interpretations for more complex images. This research advances our understanding of the feasibility of using generative AI for image-centric problem-solving in developer communities, highlighting both the potential benefits and current limitations of this approach while envisioning a future where visual-based debugging copilot tools become a reality.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to interpret programming screenshots for developer forums
Assessing prompt engineering techniques for visual input accuracy in AI models
Exploring feasibility of image-centric AI problem-solving in coding communities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLAMA, GEMINI, GPT-4o for visual input interpretation
Applies prompt engineering techniques for model assessment
Evaluates image-centric problem-solving in developer communities
🔎 Similar Papers
No similar papers found.
F
Faiz Ahmed
York University, Toronto, ON, Canada
X
Xuchen Tan
York University, Toronto, ON, Canada
F
Folajinmi Adewole
York University, Toronto, ON, Canada
S
Suprakash Datta
York University, Toronto, ON, Canada
Maleknaz Nayebi
Maleknaz Nayebi
Associate Professor at York University
Requirements EngineeringEmpirical software engineeringSoftware Analytics