🤖 AI Summary
This work addresses the low accuracy and lack of interpretable world-modeling capability of large language models (LLMs) and vision-language models (VLMs) on guesstimation—approximate quantitative estimation—tasks. To mitigate these limitations, we propose a Wisdom of Crowds (WOC)-based decoding strategy: generating diverse estimates via multi-round sampling and aggregating them via median selection to enhance robustness. We introduce MARBLES, the first multimodal (image-text) guesstimation benchmark, and systematically integrate WOC into LLM/VLM decoding for the first time. Our findings are threefold: (1) WOC substantially improves estimation accuracy across models; (2) guesstimation serves as an effective probe for assessing implicit world-modeling competence; and (3) incorporating visual input further boosts performance, demonstrating the critical role of multimodal synergy in physical magnitude reasoning.
📝 Abstract
Guesstimation, the task of making approximate quantity estimates, is a common real-world challenge. However, it has been largely overlooked in large language models (LLMs) and vision language models (VLMs) research. We introduce a novel guesstimation dataset, MARBLES. This dataset requires one to estimate how many items (e.g., marbles) can fit into containers (e.g., a one-cup measuring cup), both with and without accompanying images. Inspired by the social science concept of the ``{Wisdom of Crowds'' (WOC) - taking the median from estimates from a crowd), which has proven effective in guesstimation, we propose ``WOC decoding'' strategy for LLM guesstimation. We show that LLMs/VLMs perform well on guesstimation, suggesting that they possess some level of a"world model"necessary for guesstimation. Moreover, similar to human performance, the WOC decoding method improves LLM/VLM guesstimation accuracy. Furthermore, the inclusion of images in the multimodal condition enhances model performance. These results highlight the value of WOC decoding strategy for LLMs/VLMs and position guesstimation as a probe for evaluating LLMs/VLMs' world model.