OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models

πŸ“… 2025-02-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision-to-text parsing (VsTP) methods rely on task-specific architectures, leading to modality fragmentation and procedural redundancy. This paper introduces OmniParser V2β€”the first end-to-end framework unifying text detection, key information extraction, table recognition, and layout analysis. Its core innovation is the Structured-Points-of-Thought (SPOT) prompting paradigm, which replaces task-specific designs with a coordinate-semantic joint representation, enabling cross-task and cross-modal general-purpose parsing. Built upon a unified encoder-decoder architecture, OmniParser V2 jointly embeds spatial coordinates and textual tokens, and seamlessly integrates with multimodal large language models (MLLMs). Evaluated across eight benchmarks spanning four VsTP task categories, it achieves state-of-the-art performance. Furthermore, transferring SPOT to MLLMs significantly improves text localization and recognition accuracy.

Technology Category

Application Category

πŸ“ Abstract
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding and the emergence of large language models capable of processing document-based questions. While various methods have been proposed to tackle the complexities of VsTP, existing solutions often rely on task-specific architectures and objectives for individual tasks. This leads to modal isolation and complex workflows due to the diversified targets and heterogeneous schemas. In this paper, we introduce OmniParser V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis, into a unified framework. Central to our approach is the proposed Structured-Points-of-Thought (SPOT) prompting schemas, which improves model performance across diverse scenarios by leveraging a unified encoder-decoder architecture, objective, and input&output representation. SPOT eliminates the need for task-specific architectures and loss functions, significantly simplifying the processing pipeline. Our extensive evaluations across four tasks on eight different datasets show that OmniParser V2 achieves state-of-the-art or competitive results in VsTP. Additionally, we explore the integration of SPOT within a multimodal large language model structure, further enhancing text localization and recognition capabilities, thereby confirming the generality of SPOT prompting technique. The code is available at href{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}{AdvancedLiterateMachinery}.
Problem

Research questions and friction points this paper is trying to address.

Unifies visual text parsing tasks
Eliminates task-specific architectures
Enhances multimodal language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified encoder-decoder architecture
Structured-Points-of-Thought prompting
Multimodal large language model integration
πŸ”Ž Similar Papers
No similar papers found.
Wenwen Yu
Wenwen Yu
Huazhong University of Science and Technology
Computer VisionOCRDocument UnderstandingLarge Multimodal Models
Z
Zhibo Yang
Alibaba Group, Hangzhou, 310000, China
Jianqiang Wan
Jianqiang Wan
Alibaba Group
Sibo Song
Sibo Song
Alibaba
computer visiondeep learningmultimodal learning
J
Jun Tang
Alibaba Group, Hangzhou, 310000, China
W
Wenqing Cheng
School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, 430074, China
Y
Yuliang Liu
School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, 430074, China
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR