Multimodal Learning 相关度: 9/10

Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing

Hao Yang, Zhiyu Tan, Jia Gong, Luozheng Qin, Hesen Chen, Xiaomeng Yang, Yuqing Sun, Yuetan Lin, Mengping Yang, Hao Li
arXiv: 2602.08820v1 发布: 2026-02-09 更新: 2026-02-09

AI 摘要

Omni-Video 2利用MLLM理解用户指令,指导视频扩散模型实现统一的视频生成与编辑。

主要贡献

  • 提出基于MLLM的视频编辑框架
  • 设计轻量级适配器以复用预训练扩散模型
  • 高质量大规模视频生成和编辑

方法论

利用MLLM生成目标字幕指导扩散模型,并开发轻量级适配器注入多模态条件token。

原文摘要

We present Omni-Video 2, a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing. Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions. In this way, the rich contextual representations from the understanding model are directly used to guide the generative process, thereby improving performance on complex and compositional editing. Moreover, a lightweight adapter is developed to inject multimodal conditional tokens into pretrained text-to-video diffusion models, allowing maximum reuse of their powerful generative priors in a parameter-efficient manner. Benefiting from these designs, we scale up Omni-Video 2 to a 14B video diffusion model on meticulously curated training data with quality, supporting high quality text-to-video generation and various video editing tasks such as object removal, addition, background change, complex motion editing, \emph{etc.} We evaluate the performance of Omni-Video 2 on the FiVE benchmark for fine-grained video editing and the VBench benchmark for text-to-video generation. The results demonstrate its superior ability to follow complex compositional instructions in video editing, while also achieving competitive or superior quality in video generation tasks.

标签

视频生成 视频编辑 多模态学习 扩散模型 MLLM

arXiv 分类

cs.CV