On the Perception Bottleneck of VLMs for Chart Understanding

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large vision-language models (LVLMs) face dual perceptual bottlenecks in chart understanding: limited representational capacity of visual encoders and insufficient exploitation of visual features by downstream information extractors. This work is the first to systematically decouple and empirically analyze these two bottlenecks, revealing that linear extractors severely underestimate the numerical, textual, and structural information embedded in visual encoder representations. To address this, we propose a contrastive learning–based visual encoder enhancement method, integrated with instruction tuning and architectural refinements to strengthen multimodal representation learning for fine-grained chart semantics. Experiments demonstrate substantial improvements on major chart understanding benchmarks—including ChartQA and PlotQA—with an average accuracy gain of +8.2%. Our implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Chart understanding requires models to effectively analyze and reason about numerical data, textual elements, and complex visual components. Our observations reveal that the perception capabilities of existing large vision-language models (LVLMs) constitute a critical bottleneck in this process. In this study, we delve into this perception bottleneck by decomposing it into two components: the vision encoder bottleneck, where the visual representation may fail to encapsulate the correct information, and the extraction bottleneck, where the language model struggles to extract the necessary information from the provided visual representations. Through comprehensive experiments, we find that (1) the information embedded within visual representations is substantially richer than what is typically captured by linear extractors, such as the widely used retrieval accuracy metric; (2) While instruction tuning effectively enhances the extraction capability of LVLMs, the vision encoder remains a critical bottleneck, demanding focused attention and improvement. Therefore, we further enhance the visual encoder to mitigate the vision encoder bottleneck under a contrastive learning framework. Empirical results demonstrate that our approach significantly mitigates the perception bottleneck and improves the ability of LVLMs to comprehend charts. Code is publicly available at https://github.com/hkust-nlp/Vision4Chart.
Problem

Research questions and friction points this paper is trying to address.

Analyze numerical and visual data in charts
Address vision encoder bottleneck in LVLMs
Improve information extraction from visual representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhancing visual encoder via contrastive learning
Decomposing perception bottleneck into two components
Instruction tuning improves extraction capability
🔎 Similar Papers
No similar papers found.