🤖 AI Summary
Current vision-language models exhibit limited performance on complex path-tracing tasks and lack dedicated evaluation benchmarks. This work proposes TraversalBench, the first systematic benchmark designed to assess models’ precise path traversal capabilities. It employs parametrically synthesized images to control path structures—including self-intersections, tortuosity, vertex count, and distractor lines—and requires models to output the correct sequence of vertices from start to end. Experiments reveal that self-intersections pose the primary challenge, with errors predominantly occurring at the first crossing point; distractor lines induce a mild yet consistent performance drop; and models display a notable preference for left-to-right layout configurations. TraversalBench effectively diagnoses failure modes in visual-spatial reasoning and sustained visual grounding within current models.
📝 Abstract
Vision-language models (VLMs) perform strongly on many multimodal benchmarks. However, the ability to follow complex visual paths -- a task that human observers typically find straightforward -- remains under-tested. We introduce TraversalBench, a controlled benchmark for exact visual path traversal. Each instance contains a single continuous polyline, a unique start marker, and markers placed at path vertices; the task is to recover the exact ordered sequence encountered when traversing the path from start to finish. The benchmark explicitly balances key path-structural factors including self-intersection count, tortuosity, vertex count, and nearby confounding lines, while minimizing reliance on OCR, world knowledge, and open-ended planning. We find that self-intersections are the dominant source of difficulty. A first-crossing analysis shows that errors are sharply localized: performance is relatively stable immediately before the first crossing, then drops steeply when the model must resolve the correct continuation. By contrast, nearby confounding lines produce a weaker persistent degradation that compounds with repeated exposure. These analyses make TraversalBench a useful diagnostic for identifying whether models suffer from human-like failures or other breakdowns in sustained visual processing. An auxiliary reading-order benchmark further reveals a consistent preference for layouts compatible with left-to-right serialization, while not explaining away the main effects of path complexity. Together, these results position TraversalBench as a controlled diagnostic of path-faithful visual reasoning and as a useful testbed for studying multimodal spatial reasoning under ambiguity, clutter, and distractor structure. More broadly, we position TraversalBench as a contribution to the still-limited area of sustained visual grounding benchmarks for VLMs.