UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
AI 摘要
提出了 UniM 基准,用于评估多模态大模型在任意模态组合的理解和生成能力。
主要贡献
- 提出了 UniM 数据集,包含 31K 多模态实例
- 提出了 UniM 评估套件,评估模型语义正确性、结构完整性和连贯性
- 提出了 UniMA 基线模型,具备可追溯推理的结构化生成能力
方法论
构建了包含多种模态的数据集,设计评估指标,并提出了一个基于agent的基线模型。
原文摘要
In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.