🤖 AI Summary
This study is the first to systematically uncover how large vision-language models (VLMs) amplify socioeconomic and geographic biases in chart-to-text generation—particularly by disproportionately assigning positive linguistic attributes to high-income countries, thereby exacerbating global narrative inequities. We evaluate six state-of-the-art VLMs—including GPT-4o-mini, Gemini-1.5-Flash, and Phi-3.5—across 6,000 chart–country pairs. Leveraging empirical bias quantification, we identify statistically significant income-correlated biases across all models; existing debiasing techniques (e.g., standard prompt engineering) only partially mitigate them, revealing the structural and systemic nature of the problem. We propose and validate a novel “positive interference” prompting strategy that explicitly counterbalances biased associations. Our contributions include: (1) the first benchmark dataset and open-source code for fairness evaluation in automated data storytelling; (2) empirical evidence of pervasive VLM bias in multimodal interpretation; and (3) a methodologically grounded framework for bias diagnosis and intervention in chart-to-text systems.
📝 Abstract
Charts are very common for exploring data and communicating insights, but extracting key takeaways from charts and articulating them in natural language can be challenging. The chart-to-text task aims to automate this process by generating textual summaries of charts. While with the rapid advancement of large Vision-Language Models (VLMs), we have witnessed great progress in this domain, little to no attention has been given to potential biases in their outputs. This paper investigates how VLMs can amplify geo-economic biases when generating chart summaries, potentially causing societal harm. Specifically, we conduct a large-scale evaluation of geo-economic biases in VLM-generated chart summaries across 6,000 chart-country pairs from six widely used proprietary and open-source models to understand how a country's economic status influences the sentiment of generated summaries. Our analysis reveals that existing VLMs tend to produce more positive descriptions for high-income countries compared to middle- or low-income countries, even when country attribution is the only variable changed. We also find that models such as GPT-4o-mini, Gemini-1.5-Flash, and Phi-3.5 exhibit varying degrees of bias. We further explore inference-time prompt-based debiasing techniques using positive distractors but find them only partially effective, underscoring the complexity of the issue and the need for more robust debiasing strategies. Our code and dataset are publicly available here.