Logics-Parsing-Omni Technical Report

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the challenges of fragmented task definitions and data heterogeneity in multimodal parsing by proposing the Omni Parsing framework. It unifies perception and cognition through a cohesive classification schema and a progressive parsing paradigm, integrating holistic detection, fine-grained recognition, and multi-level reasoning to enable end-to-end structured knowledge extraction. A novel evidence anchoring mechanism ensures strict alignment between high-level semantics and low-level factual content, facilitating logical induction that is localizable, enumerable, and traceable. Technically, the framework combines spatio-temporal holistic detection, OCR/ASR-based symbolization, attribute extraction, and semantic reasoning chain construction. The study also introduces a standardized dataset, the Logics-Parsing-Omni model, and the OmniParsingBench evaluation benchmark, collectively enhancing the reliability and structured output capability for parsing complex audio-visual signals.

Technology Category

Application Category

📝 Abstract

Addressing the challenges of fragmented task definitions and the heterogeneity of unstructured data in multimodal parsing, this paper proposes the Omni Parsing framework. This framework establishes a Unified Taxonomy covering documents, images, and audio-visual streams, introducing a progressive parsing paradigm that bridges perception and cognition. Specifically, the framework integrates three hierarchical levels: 1) Holistic Detection, which achieves precise spatial-temporal grounding of objects or events to establish a geometric baseline for perception; 2) Fine-grained Recognition, which performs symbolization (e.g., OCR/ASR) and attribute extraction on localized objects to complete structured entity parsing; and 3) Multi-level Interpreting, which constructs a reasoning chain from local semantics to global logic. A pivotal advantage of this framework is its evidence anchoring mechanism, which enforces a strict alignment between high-level semantic descriptions and low-level facts. This enables ``evidence-based'' logical induction, transforming unstructured signals into standardized knowledge that is locatable, enumerable, and traceable. Building on this foundation, we constructed a standardized dataset and released the Logics-Parsing-Omni model, which successfully converts complex audio-visual signals into machine-readable structured knowledge. Experiments demonstrate that fine-grained perception and high-level cognition are synergistic, effectively enhancing model reliability. Furthermore, to quantitatively evaluate these capabilities, we introduce OmniParsingBench. Code, models and the benchmark are released at https://github.com/alibaba/Logics-Parsing/tree/master/Logics-Parsing-Omni.

Problem

Research questions and friction points this paper is trying to address.

multimodal parsing

unstructured data

task fragmentation

structured knowledge

heterogeneity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Omni Parsing

evidence anchoring

unified taxonomy