🤖 AI Summary
Existing benchmarks for chart referring expression grounding suffer from limitations in localization accuracy, support for multiple targets, linguistic diversity, and coverage of chart types. This work proposes the first benchmark that systematically supports multi-target, multi-granularity, and high-precision grounding of chart elements, encompassing a rich variety of chart types and diverse referring expressions. The key innovation lies in a code-driven, pixel-level instance mask synthesis mechanism that aligns plotting programs with rendered outputs to generate precise annotations. By integrating instance segmentation with a multimodal grounding framework, the approach significantly enhances performance. Experiments demonstrate that the method substantially outperforms existing baselines on the new benchmark and exhibits strong generalization capabilities on a real-world localization task derived from ChartQA.
📝 Abstract
Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.