Multimodal Learning 相关度: 9/10

Logics-Parsing-Omni Technical Report

Xin An, Jingyi Cai, Xiangyang Chen, Huayao Liu, Peiting Liu, Peng Wang, Bei Yang, Xiuwen Zhu, Yongfan Chen, Baoyu Hou, Shuzhao Li, Weidong Ren, Fan Yang, Jiangtao Zhang, Xiaoxiao Xu, Lin Qu
arXiv: 2603.09677v1 发布: 2026-03-10 更新: 2026-03-10

AI 摘要

Omni Parsing框架统一多模态数据解析,实现从感知到认知的递进式解析,并构建了相关数据集和模型。

主要贡献

  • 提出Omni Parsing框架,统一多模态解析流程
  • 构建了包含文档、图像和音视频的统一分类体系
  • 发布了Logics-Parsing-Omni模型和OmniParsingBench基准

方法论

构建三层框架:整体检测、细粒度识别和多层解释,通过证据锚定机制将非结构化信号转化为可追踪的知识。

原文摘要

Addressing the challenges of fragmented task definitions and the heterogeneity of unstructured data in multimodal parsing, this paper proposes the Omni Parsing framework. This framework establishes a Unified Taxonomy covering documents, images, and audio-visual streams, introducing a progressive parsing paradigm that bridges perception and cognition. Specifically, the framework integrates three hierarchical levels: 1) Holistic Detection, which achieves precise spatial-temporal grounding of objects or events to establish a geometric baseline for perception; 2) Fine-grained Recognition, which performs symbolization (e.g., OCR/ASR) and attribute extraction on localized objects to complete structured entity parsing; and 3) Multi-level Interpreting, which constructs a reasoning chain from local semantics to global logic. A pivotal advantage of this framework is its evidence anchoring mechanism, which enforces a strict alignment between high-level semantic descriptions and low-level facts. This enables ``evidence-based'' logical induction, transforming unstructured signals into standardized knowledge that is locatable, enumerable, and traceable. Building on this foundation, we constructed a standardized dataset and released the Logics-Parsing-Omni model, which successfully converts complex audio-visual signals into machine-readable structured knowledge. Experiments demonstrate that fine-grained perception and high-level cognition are synergistic, effectively enhancing model reliability. Furthermore, to quantitatively evaluate these capabilities, we introduce OmniParsingBench. Code, models and the benchmark are released at https://github.com/alibaba/Logics-Parsing/tree/master/Logics-Parsing-Omni.

标签

multimodal parsing knowledge representation reasoning audio-visual learning

arXiv 分类

cs.AI