Multimodal Learning 相关度: 8/10

VOID: Video Object and Interaction Deletion

Saman Motamed, William Harvey, Benjamin Klein, Luc Van Gool, Zhuoning Yuan, Ta-Ying Cheng
arXiv: 2604.02296v1 发布: 2026-04-02 更新: 2026-04-02

AI 摘要

提出VOID框架,利用因果推理和视频扩散模型实现物理上合理的视频对象移除。

主要贡献

  • 提出了VOID视频对象移除框架
  • 利用Kubric和HUMOTO生成了新的配对数据集
  • 结合视觉语言模型和视频扩散模型进行逼真移除

方法论

使用视觉语言模型识别受影响区域,引导视频扩散模型生成物理一致的反事实结果。

原文摘要

Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.

标签

视频对象移除 因果推理 视频扩散模型 视觉语言模型

arXiv 分类

cs.CV cs.AI