Do Language Model Agents Align with Humans in Rating Visualizations? An Empirical Study

📅 2025-05-10

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This study investigates whether large language model (LLM)-based agents can reliably emulate human subjective ratings of data visualization designs. Method: Through a three-stage empirical study, we systematically assess agent–human rating alignment, identify influencing factors, and evaluate enhancement strategies. We propose a multimodal LLM agent framework integrating visualization preprocessing, structured prompt engineering, and domain-knowledge injection. Contribution/Results: We uncover that expert prior confidence—rather than visualization features or prompt design—is the strongest predictor of human–agent alignment. Contrary to common practice, standard prompt-enhancement techniques introduce systematic biases. Under high-expert-confidence assumptions, the agent enables rapid prototype evaluation but must serve as a complement—not a substitute—for human-centered user studies. To our knowledge, this is the first work to systematically characterize alignment patterns and boundary conditions of LLMs in subjective visualization assessment.

Technology Category

Application Category

📝 Abstract

Large language models encode knowledge in various domains and demonstrate the ability to understand visualizations. They may also capture visualization design knowledge and potentially help reduce the cost of formative studies. However, it remains a question whether large language models are capable of predicting human feedback on visualizations. To investigate this question, we conducted three studies to examine whether large model-based agents can simulate human ratings in visualization tasks. The first study, replicating a published study involving human subjects, shows agents are promising in conducting human-like reasoning and rating, and its result guides the subsequent experimental design. The second study repeated six human-subject studies reported in literature on subjective ratings, but replacing human participants with agents. Consulting with five human experts, this study demonstrates that the alignment of agent ratings with human ratings positively correlates with the confidence levels of the experts before the experiments. The third study tests commonly used techniques for enhancing agents, including preprocessing visual and textual inputs, and knowledge injection. The results reveal the issues of these techniques in robustness and potential induction of biases. The three studies indicate that language model-based agents can potentially simulate human ratings in visualization experiments, provided that they are guided by high-confidence hypotheses from expert evaluators. Additionally, we demonstrate the usage scenario of swiftly evaluating prototypes with agents. We discuss insights and future directions for evaluating and improving the alignment of agent ratings with human ratings. We note that simulation may only serve as complements and cannot replace user studies.

Problem

Research questions and friction points this paper is trying to address.

Can language models predict human feedback on visualizations?

Do model-based agents align with human ratings in visualization tasks?

Can agents simulate human ratings guided by expert hypotheses?

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agents simulate human ratings in visualization tasks

Expert confidence guides agent-human rating alignment

Techniques tested for robustness and bias reduction

🔎 Similar Papers

No similar papers found.