Benchmarking PhD-Level Coding in 3D Geometric Computer Vision

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Current AI models exhibit insufficient reliability in generating complex 3D geometric vision code, and there is a notable absence of evaluation benchmarks targeting advanced programming proficiency in this domain. To address this gap, this work proposes GeoCodeBench—the first benchmark specifically designed for doctoral-level 3D geometric vision programming. It constructs function completion tasks by extracting core algorithmic functions from top-tier conference papers and enriches them with both manually curated and automatically generated boundary test cases. A two-tier capability assessment framework evaluates model performance on both general and cutting-edge research-oriented tasks. Experimental results reveal that even the strongest model (GPT-5) achieves only a 36.6% pass rate, highlighting a significant deficit in trustworthy scientific 3D programming. The study also finds that extended context lengths do not consistently improve performance and that research-oriented tasks pose substantially greater challenges.

Technology Category

Application Category

📝 Abstract

AI-assisted coding has rapidly reshaped software practice and research workflows, yet today's models still struggle to produce correct code for complex 3D geometric vision. If models could reliably write such code, the research of our community would change substantially. To measure progress toward that goal, we introduce GeoCodeBench, a PhD-level benchmark that evaluates coding for 3D vision. Each problem is a fill-in-the-function implementation task curated from representative papers at recent venues: we first let a tool propose candidate functions from official repositories, then perform careful human screening to select core 3D geometric components. For every target, we generate diverse, edge-case unit tests, enabling fully automatic, reproducible scoring. We evaluate eight representative open- and closed-source models to reflect the current ecosystem. The best model, GPT-5, attains only 36.6% pass rate, revealing a large gap between current capabilities and dependable 3D scientific coding. GeoCodeBench organizes tasks into a two-level hierarchy: General 3D capability (geometric transformations and mechanics/optics formulation) and Research capability (novel algorithm implementation and geometric logic routing). Scores are positively correlated across these axes, but research-oriented tasks are markedly harder. Context ablations further show that "more paper text" is not always better: cutting off at the Method section statistically outperforms full-paper inputs, highlighting unresolved challenges in long-context scientific comprehension. Together, these findings position GeoCodeBench as a rigorous testbed for advancing from generic coding to trustworthy 3D geometric vision coding.

Problem

Research questions and friction points this paper is trying to address.

3D geometric computer vision

AI-assisted coding

code generation benchmark

scientific programming

PhD-level coding

Innovation

Methods, ideas, or system contributions that make the work stand out.

GeoCodeBench

3D geometric computer vision

AI-assisted coding

scientific code generation

benchmarking

🔎 Similar Papers

No similar papers found.

Bosch Group

Stuttgart, Germany

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)