Logics-Parsing-Omni Technical Report

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of fragmented task definitions and data heterogeneity in multimodal parsing by proposing the Omni Parsing framework. It unifies perception and cognition through a cohesive classification schema and a progressive parsing paradigm, integrating holistic detection, fine-grained recognition, and multi-level reasoning to enable end-to-end structured knowledge extraction. A novel evidence anchoring mechanism ensures strict alignment between high-level semantics and low-level factual content, facilitating logical induction that is localizable, enumerable, and traceable. Technically, the framework combines spatio-temporal holistic detection, OCR/ASR-based symbolization, attribute extraction, and semantic reasoning chain construction. The study also introduces a standardized dataset, the Logics-Parsing-Omni model, and the OmniParsingBench evaluation benchmark, collectively enhancing the reliability and structured output capability for parsing complex audio-visual signals.

Technology Category

Application Category

📝 Abstract
Addressing the challenges of fragmented task definitions and the heterogeneity of unstructured data in multimodal parsing, this paper proposes the Omni Parsing framework. This framework establishes a Unified Taxonomy covering documents, images, and audio-visual streams, introducing a progressive parsing paradigm that bridges perception and cognition. Specifically, the framework integrates three hierarchical levels: 1) Holistic Detection, which achieves precise spatial-temporal grounding of objects or events to establish a geometric baseline for perception; 2) Fine-grained Recognition, which performs symbolization (e.g., OCR/ASR) and attribute extraction on localized objects to complete structured entity parsing; and 3) Multi-level Interpreting, which constructs a reasoning chain from local semantics to global logic. A pivotal advantage of this framework is its evidence anchoring mechanism, which enforces a strict alignment between high-level semantic descriptions and low-level facts. This enables ``evidence-based'' logical induction, transforming unstructured signals into standardized knowledge that is locatable, enumerable, and traceable. Building on this foundation, we constructed a standardized dataset and released the Logics-Parsing-Omni model, which successfully converts complex audio-visual signals into machine-readable structured knowledge. Experiments demonstrate that fine-grained perception and high-level cognition are synergistic, effectively enhancing model reliability. Furthermore, to quantitatively evaluate these capabilities, we introduce OmniParsingBench. Code, models and the benchmark are released at https://github.com/alibaba/Logics-Parsing/tree/master/Logics-Parsing-Omni.
Problem

Research questions and friction points this paper is trying to address.

multimodal parsing
unstructured data
task fragmentation
structured knowledge
heterogeneity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Omni Parsing
evidence anchoring
unified taxonomy
progressive parsing paradigm
structured knowledge extraction
🔎 Similar Papers
No similar papers found.
Xin An
Xin An
Dalian Maritime University
J
Jingyi Cai
X
Xiangyang Chen
H
Huayao Liu
P
Peiting Liu
P
Peng Wang
B
Bei Yang
X
Xiuwen Zhu
Y
Yongfan Chen
B
Baoyu Hou
S
Shuzhao Li
W
Weidong Ren
F
Fan Yang
J
Jiangtao Zhang
X
Xiaoxiao Xu
L
Lin Qu