GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work formalizes plane geometry problem solving as an executable and verifiable interactive construction task, requiring large language models and multimodal agents to generate domain-specific language (DSL) programs from Chinese natural language descriptions to construct dynamic geometric diagrams that satisfy specified objects and constraints. To this end, the authors introduce a benchmark comprising 489 human-verified interactive geometry construction problems and evaluate model performance within a bounded iterative environment that integrates natural language understanding, geometric constraint solving, and multimodal feedback mechanisms. Experimental results reveal that state-of-the-art models commonly suffer from structural hallucinations, missing geometric objects, and constraint violations, and struggle to effectively leverage feedback for self-correction, thereby highlighting the significant challenges this task poses to models’ reasoning and embodied interaction capabilities.

📝 Abstract

We introduce GeoBuildBench, a benchmark designed to evaluate whether large language models and multimodal agents can ground informal natural-language plane geometry problems into executable geometric constructions. Unlike existing geometry benchmarks that focus on answer correctness or static diagram interpretation, GeoBuildBench treats geometry diagram as an interactive construction task: given a textual problem, an agent must generate a domain-specific language (DSL) program to produce a diagram satisfying explicitly specified geometric objects and verifiable constraints. The benchmark features 489 Chinese textbook-style problems, curated through automated filtering and human validation to ensure text-complete, constructible problem specifications. We evaluate several state-of-the-art multimodal models in a bounded iterative setting and show that, despite reasonable success rates, models frequently exhibit structural hallucinations, missing objects, and failures to satisfy geometric constraints, with limited ability to exploit visual and constraint-based feedback for self-correction. These results highlight geometry construction as a rigorous testbed for grounded, executable reasoning beyond textual or visual plausibility. Our benchmark and code are publicly available.

Problem

Research questions and friction points this paper is trying to address.

geometry construction

natural language grounding

executable reasoning

multimodal agents

geometric constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

executable reasoning

geometry construction

domain-specific language