GeoTikzBridge: Advancing Multimodal Code Generation for Geometric Perception and Reasoning

๐Ÿ“… 2026-03-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limited fine-grained geometric structure perception in multimodal large language models, which hinders their geometric understanding and visual reasoning capabilities. To overcome this limitation, we propose GeoTikzBridge, a framework that enhances local geometric awareness through a two-stage training strategyโ€”Base and Instruct. We introduce GeoTikz-Base, the largest image-to-TikZ dataset to date with 2.5 million pairs, and GeoTikz-Instruct, the first instruction-augmented dataset designed for visual reasoning. Our approach integrates iterative data expansion, local geometric transformation strategies, and multimodal instruction tuning. Evaluated on open-source multimodal large language models, GeoTikzBridge achieves state-of-the-art performance in geometric perception and reasoning, and functions effectively as a plug-and-play module to boost geometric problem-solving capabilities across diverse models.

Technology Category

Application Category

๐Ÿ“ Abstract
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities. However, they struggle to perceive fine-grained geometric structures, constraining their ability of geometric understanding and visual reasoning. To address this, we propose GeoTikzBridge, a framework that enhances local geometric perception and visual reasoning through tikz-based code generation. Within this framework, we build two models supported by two complementary datasets. The GeoTikzBridge-Base model is trained on GeoTikz-Base dataset, the largest image-to-tikz dataset to date with 2.5M pairs (16 $\times$ larger than existing open-sourced datasets). This process is achieved via iterative data expansion and a localized geometric transformation strategy. Subsequently, GeoTikzBridge-Instruct is fine-tuned on GeoTikz-Instruct dataset which is the first instruction-augmented tikz dataset supporting visual reasoning. Extensive experimental results demonstrate that our models achieve state-of-the-art performance among open-sourced MLLMs. Furthermore, GeoTikzBridge models can serve as plug-and-play reasoning modules for any MLLM(LLM), enhancing reasoning performance in geometric problem-solving. Datasets and codes are publicly available at: https://github.com/sjy-1995/GeoTikzBridge-Advancing-Multimodal-Code-Generation-for-Geometric-Perception-and-Reasoning.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
geometric perception
visual reasoning
fine-grained geometric structures
Innovation

Methods, ideas, or system contributions that make the work stand out.

TikZ code generation
geometric perception
multimodal reasoning
instruction-augmented dataset
plug-and-play module
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jiayin Sun
JIUTIAN Research
C
Caixia Sun
JIUTIAN Research
B
Boyu Yang
JIUTIAN Research
H
Hailin Li
JIUTIAN Research
X
Xiao Chen
JIUTIAN Research
Y
Yi Zhang
JIUTIAN Research
Errui Ding
Errui Ding
Baidu Inc.
computer visionmachine learning
L
Liang Li
JIUTIAN Research
C
Chao Deng
JIUTIAN Research
Junlan Feng
Junlan Feng
Chief Scientist at China Mobile Research
Natural LanguageMachine LearningSpeech ProcessingData Mining