Multimodal Learning 相关度: 9/10

VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning

Li-Heng Chen, Ke Cheng, Yahui Liu, Lei Shi, Shi-Sheng Huang, Hongbo Fu
arXiv: 2603.28353v1 发布: 2026-03-30 更新: 2026-03-30

AI 摘要

VistaGEN通过多视角视觉语言推理实现可控、一致的驾驶视频生成。

主要贡献

  • 提出VistaGEN,实现细粒度控制的驾驶视频生成
  • 引入多视角视觉语言推理,提升时空一致性
  • 提出多视角视觉语言评估器(MV-VLM),实现自动评估和优化

方法论

将视觉语言特征注入多视角视频生成器,并利用MV-VLM进行生成-评估-再生成的闭环优化。

原文摘要

Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control of specific entities, including 3D objects, images, and text descriptions, while maintaining spatiotemporal consistency in long video sequences. Our key innovation is the incorporation of multiview visual-language reasoning into the long driving video generation. To this end, we inject visual-language features into a multiview video generator to enable fine-grained controllability. More importantly, we propose a multiview vision-language evaluator (MV-VLM) to intelligently and automatically evaluate spatiotemporal consistency of the generated content, thus formulating a novel generation-evaluation-regeneration closed-loop generation mechanism. This mechanism ensures high-quality, coherent outputs, facilitating the creation of complex and reliable driving scenarios. Besides, within the closed-loop generation, we introduce an object-level refinement module to refine the unsatisfied results evaluated from the MV-VLM and then feed them back to the video generator for regeneration. Extensive evaluation shows that our VistaGEN achieves diverse driving video generation results with fine-grained controllability, especially for long-tail objects, and much better spatiotemporal consistency than previous approaches.

标签

驾驶视频生成 视觉语言推理 多视角学习 时空一致性 细粒度控制

arXiv 分类

cs.CV