VQ-VA World: Towards High-Quality Visual Question-Visual Answering

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance gap between open-source multimodal models and proprietary systems (e.g., GPT-Image, NanoBanana) in visual question answering with image generation (VQ-VA)—a task requiring direct image-based answers to visual questions. To this end, we propose the VQ-VA World data curation framework and an intelligent agent pipeline, which leverages web-scale crawling to construct a high-quality, interleaved image-text dataset of 1.8 million samples. We further introduce IntelligentBench, the first fine-grained, human-annotated evaluation benchmark for VQ-VA, covering world knowledge, design knowledge, and multi-step reasoning. Using this data, we fine-tune generative models—including LightFusion—to build an end-to-end VQ-VA pipeline. On IntelligentBench, LightFusion achieves 53.06 points, significantly outperforming prior open-source baselines and substantially narrowing the gap with closed-source counterparts.

Technology Category

Application Category

📝 Abstract
This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question -- an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image-text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (i.e., 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (e.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, we hope to stimulate future research on VQ-VA.
Problem

Research questions and friction points this paper is trying to address.

Developing open-source models for visual question-visual answering tasks
Creating large-scale high-quality datasets for VQ-VA model training
Establishing systematic evaluation benchmarks for visual answer generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic pipeline for large-scale data construction
Web-scale deployment crawls 1.8M image-text samples
Training framework yields strong empirical gains
🔎 Similar Papers
No similar papers found.