🤖 AI Summary
Large language models (LLMs) exhibit limited capability in scientific chart understanding—particularly in geospatial visualization question answering. To address this, we propose a lightweight, fine-tuning-free multimodal fusion framework enabling natural-language-driven, plug-and-play interactive chart QA. Our method jointly encodes visual semantics and structured data descriptions into a compact, structured textual representation, aligning multimodal features via visualization snapshot encoding and zero-shot contextual enhancement. Key contributions include: (1) the first structured compact textual representation that simultaneously captures both visual and tabular semantics of scientific charts; and (2) an integrated architecture combining multimodal feature alignment, snapshot-based visual encoding, and zero-shot context augmentation. Evaluated on GeoVista and other geovisualization benchmarks, our approach achieves state-of-the-art zero-shot performance, significantly improving both answer accuracy and interpretability in scientific visualization QA.
📝 Abstract
We present a method for augmenting a Large Language Model (LLM) with a combination of text and visual data to enable accurate question answering in visualization of scientific data, making conversational visualization possible. LLMs struggle with tasks like visual data interaction, as they lack contextual visual information. We address this problem by merging a text description of a visualization and dataset with snapshots of the visualization. We extract their essential features into a structured text file, highly compact, yet descriptive enough to appropriately augment the LLM with contextual information, without any fine-tuning. This approach can be applied to any visualization that is already finally rendered, as long as it is associated with some textual description.