🤖 AI Summary
This study addresses the challenge of ineffective retrieval and comprehension of programming screenshots in developer forums. We propose a vision-based question inference method leveraging multimodal large language models (MLLMs). Specifically, we formulate a question-generation task tailored to IDE interfaces and code screenshots, systematically evaluating LLaMA, Gemini, and GPT-4o on vision-driven problem reconstruction. Our engineering framework integrates contextual learning, chain-of-thought reasoning, and few-shot prompting. Innovatively, we introduce the first benchmark for question generation from programming screenshots, enabling fine-grained semantic similarity evaluation. Experimental results show that GPT-4o generates questions semantically aligned with human annotations (similarity > 60%) for 51.75% of test images—demonstrating the technical feasibility of vision-first debugging assistants. This work establishes a novel paradigm for multimodal AI–enhanced developer support.
📝 Abstract
The integration of generative AI into developer forums like Stack Overflow presents an opportunity to enhance problem-solving by allowing users to post screenshots of code or Integrated Development Environments (IDEs) instead of traditional text-based queries. This study evaluates the effectiveness of various large language models (LLMs), specifically LLAMA, GEMINI, and GPT-4o in interpreting such visual inputs. We employ prompt engineering techniques, including in-context learning, chain-of-thought prompting, and few-shot learning, to assess each model's responsiveness and accuracy. Our findings show that while GPT-4o shows promising capabilities, achieving over 60% similarity to baseline questions for 51.75% of the tested images, challenges remain in obtaining consistent and accurate interpretations for more complex images. This research advances our understanding of the feasibility of using generative AI for image-centric problem-solving in developer communities, highlighting both the potential benefits and current limitations of this approach while envisioning a future where visual-based debugging copilot tools become a reality.