🤖 AI Summary
This work proposes Im2Sim, a novel approach that enables vision-language models to comprehend and simulate the structure and behavior of complex real-world systems from natural images. By reframing visual understanding as the generation of executable simulation code—whose execution reconstructs the input image—the method evaluates whether models grasp the underlying mechanisms of observed systems. Experiments demonstrate that, when integrated with state-of-the-art vision-language models such as GPT and Gemini, Im2Sim effectively captures macroscopic dynamic principles across diverse domains, including clouds, vegetation, and urban environments. The results reveal an asymmetry in model capabilities: while high-level semantic understanding is robust, faithful reproduction of low-level details remains challenging. This highlights Im2Sim’s innovative potential for automatically constructing executable models of real-world systems.
📝 Abstract
The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work explores the capacity of Vision Language Models (VLMs) to recognize and simulate the systems and mechanisms depicted in images using the Im2Sim methodology. The VLM is given a natural image of a real-world system (e.g., cities, clouds, vegetation) and is tasked with describing the system and writing code that simulates and generates it. This generative code is then executed to produce a synthetic image, which is compared against the original. This approach is tested on various complex emergent systems, ranging from physical systems (waves, lights, clouds) to vegetation, cities, materials, and geological formations. Through analysis of the models and images generated by the VLMs, we examine their understanding of the systems in images. The results show that leading VLMs (GPT, Gemini) have the ability to understand and model complex, multi-component systems across multiple layers of abstraction and a wide range of domains. At the same time, the VLMs exhibit limited ability to replicate fine details and low-level arrangements of patterns in the image. These findings reveal an interesting asymmetry: VLMs combine high-level, deep visual understanding of images with limited perception of fine details.