PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration

📅 2026-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited spatial reasoning capabilities of existing multimodal large language models (MLLMs) in OCR-related visual question answering tasks, as well as the insufficient semantic understanding of specialized text detection models. To bridge this gap, the authors propose a parameter-efficient hybrid architecture that, for the first time, integrates the spatial awareness of text detection models with the contextual reasoning power of large language models through a mixture-of-experts mechanism. With only 131 million trainable parameters added, the approach significantly enhances both text localization and recognition performance. Experimental results demonstrate that the method substantially outperforms current state-of-the-art MLLMs on relevant benchmarks, achieving accurate and efficient multimodal text understanding.

Technology Category

Application Category

📝 Abstract
In recent years, Multi-modal Large Language Models (MLLMs) have achieved strong performance in OCR-centric Visual Question Answering (VQA) tasks, illustrating their capability to process heterogeneous data and exhibit adaptability across varied contexts. However, these MLLMs rely on a Large Language Model (LLM) as the decoder, which is primarily designed for linguistic processing, and thus inherently lacks the positional reasoning required for precise visual tasks, such as text spotting and text grounding. Additionally, the extensive parameters of MLLMs necessitate substantial computational resources and large-scale data for effective training. Conversely, text spotting specialists achieve state-of-the-art coordinate predictions but lack semantic reasoning capabilities. This dichotomy motivates our key research question: Can we synergize the efficiency of specialists with the contextual power of LLMs to create a positionally-accurate MLLM? To overcome these challenges, we introduce PositionOCR, a parameter-efficient hybrid architecture that seamlessly integrates a text spotting model's positional strengths with an LLM's contextual reasoning. Comprising 131M trainable parameters, this framework demonstrates outstanding multi-modal processing capabilities, particularly excelling in tasks such as text grounding and text spotting, consistently surpassing traditional MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Multi-modal Large Language Models
Positional Reasoning
Text Spotting
Text Grounding
OCR
Innovation

Methods, ideas, or system contributions that make the work stand out.

PositionOCR
text spotting
positional awareness
hybrid architecture
multi-modal LLM