AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current image generation models exhibit unclear capabilities in producing logically coherent illustrations suitable for direct inclusion in academic papers, and conventional evaluation methods for vision-language models show limited reliability in complex图文 scenarios. This work proposes AIBench—the first benchmark specifically designed for evaluating academic figure generation—by constructing logical graphs from the methodology sections of scientific papers and introducing a four-level visual question answering (VQA) task to assess fine-grained alignment between generated images and source text in terms of logical consistency, while also leveraging vision-language models to evaluate aesthetic quality. The study innovatively incorporates a lightweight-dependency VQA evaluation mechanism that reduces reliance on multimodal models’ comprehension abilities and reveals the difficulty of jointly optimizing logical correctness and aesthetic quality. Experiments demonstrate that performance gaps across models on this task are substantially larger than those observed in general-purpose image generation, and test-time scaling strategies can simultaneously enhance both logical fidelity and visual appeal.
📝 Abstract
Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely unexplored.Directly comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.
Problem

Research questions and friction points this paper is trying to address.

academic illustration
visual-logical consistency
image generation
evaluation benchmark
VQA
Innovation

Methods, ideas, or system contributions that make the work stand out.

AIBench
visual-logical consistency
VQA-based evaluation
academic illustration generation
test-time scaling
🔎 Similar Papers
No similar papers found.