Multimodal Learning 相关度: 8/10

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

Jiayi Zhu, Jianing Zhang, Yiying Yang, Wei Cheng, Xiaoyun Yuan
arXiv: 2603.02697v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

ShareVerse提出了一种多智能体一致性视频生成框架,用于共享世界建模。

主要贡献

  • 构建大规模多智能体交互数据集
  • 提出空间拼接策略确保多视角几何一致性
  • 集成跨智能体注意力模块保证共享世界一致性

方法论

利用CARLA仿真平台构建数据集,结合空间拼接和跨智能体注意力机制,训练视频生成模型。

原文摘要

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

标签

视频生成 多智能体 共享世界建模 CARLA

arXiv 分类

cs.CV cs.AI