G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

📅 2023-12-18

🏛️ arXiv.org

📈 Citations: 126

✨ Influential: 24

career value

199K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) struggle to accurately identify fundamental geometric elements and their spatial relationships in geometric visual reasoning. To address this, we propose a data augmentation paradigm explicitly designed for geometric logical structure and scalability, enabling the construction of Geo170K—the first large-scale geometric multimodal dataset comprising over 170,000 image–question pairs. Building upon the LLaVA architecture, we incorporate geometric priors via instruction tuning and enhance training with image–text alignment regularization, explicit geometric relation annotation, and scale-robust synthetic data generation. Our approach boosts the accuracy of a 7B-parameter model on the MathVista geometry subset by 12.6% over GPT-4-V, marking the first demonstration of lightweight MLLMs’ effectiveness on complex visual–symbolic joint reasoning tasks and breaking through longstanding bottlenecks in geometric understanding.

📝 Abstract

Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities, which encourages extensive research on their application in mathematical problem solving. However, current work has been largely focused on text-based mathematical problems, with limited investigation in problems involving geometric information. Addressing this gap, we aim to enable LLMs to solve geometric problems by understanding image input. We first analyze the limitations of current Multimodal Large Language Models (MLLMs) in this area: they struggle to accurately comprehending basic geometric elements and their relationships. To overcome these challenges, we take advantage of the unique characteristics of geometric problems (such as unique geometric logical form, and geometric scalability) and the capacity of the textual LLMs to build an enriched multimodal geometry dataset based on existing data. The augmented dataset, Geo170K, contains more than 170K geometric image-caption and question-answer pairs. Utilizing our constructed Geo170K dataset, we develop G-LLaVA, which demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.

Problem

Research questions and friction points this paper is trying to address.

Solving geometric problems using multi-modal large language models

Understanding geometric elements and relationships from image inputs

Overcoming limitations of current MLLMs in geometric comprehension

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal dataset with geometric elements

Enhanced geometric comprehension via text augmentation

Specialized model outperforms larger counterparts

🔎 Similar Papers

No similar papers found.