🤖 AI Summary
This work addresses the limitation of existing vision-language models in scientific paper understanding, which typically generate context-poor, low-level information extraction questions rather than the deep, inquiry-driven questions posed by humans. To bridge this gap, the study extends the linguistically grounded Questions Under Discussion (QUD) framework—originally text-based—to the multimodal domain for the first time. It proposes a novel method for generating high-order reasoning questions that integrate scientific figures with surrounding textual context and introduces MQUD, a multimodal QUD dataset annotated by original paper authors. By fine-tuning vision-language models to achieve tighter semantic alignment between images and text, the approach significantly improves question quality, shifting from superficial visual descriptions toward content-rich, cross-modal investigative questions that enhance both visual grounding and deep comprehension.
📝 Abstract
Asking inquisitive questions while reading, and looking for their answers, is an important part in human discourse comprehension, curiosity, and creative ideation, and prior work has investigated this in text-only scenarios. However, in scientific or research papers, many of the critical takeaways are conveyed through both figures and the text that analyzes them. While scientific visualizations have been used to evaluate Vision-Language Models (VLMs) capabilities, current benchmarks are limited to questions that focus simply on extracting information from them. Such questions only require lower-level reasoning, do not take into account the context in which a figure appears, and do not reflect the communicative goals the authors wish to achieve. We generate inquisitive questions that reach the depth of questions humans generate when engaging with scientific papers, conditioned on both the figure and the paper's context, and require reasoning across both modalities. To do so, we extend the linguistic theory of Questions Under Discussion (QUD) from being text-only to multimodal, where implicit questions are raised and resolved as discourse progresses. We present MQUD, a dataset of research papers in which such questions are made explicit and annotated by the original authors. We show that fine-tuning a VLM on MQUD shifts the model from generating generic low-level visual questions to content-specific grounding that requires a high-level of multimodal reasoning, yielding higher-quality, more visually grounded multimodal QUD generation.