Multimodal Learning 相关度: 9/10

Integrating Multimodal Large Language Model Knowledge into Amodal Completion

Heecheol Yun, Eunho Yang
arXiv: 2603.28333v1 发布: 2026-03-30 更新: 2026-03-30

AI 摘要

提出AmodalCG框架,利用多模态大语言模型指导残缺图像补全,提升了补全效果。

主要贡献

  • 利用MLLM知识指导残缺图像补全
  • 提出AmodalCG框架,融合MLLM推理和视觉生成模型
  • 通过实验验证了该方法在真实图像上的有效性

方法论

AmodalCG选择性地利用MLLM推理缺失区域的范围和内容,并使用视觉生成模型迭代优化补全结果。

原文摘要

With the widespread adoption of autonomous vehicles and robotics, amodal completion, which reconstructs the occluded parts of people and objects in an image, has become increasingly crucial. Just as humans infer hidden regions based on prior experience and common sense, this task inherently requires physical knowledge about real-world entities. However, existing approaches either depend solely on the image generation ability of visual generative models, which lack such knowledge, or leverage it only during the segmentation stage, preventing it from explicitly guiding the completion process. To address this, we propose AmodalCG, a novel framework that harnesses the real-world knowledge of Multimodal Large Language Models (MLLMs) to guide amodal completion. Our framework first assesses the extent of occlusion to selectively invoke MLLM guidance only when the target object is heavily occluded. If guidance is required, the framework further incorporates MLLMs to reason about both the (1) extent and (2) content of the missing regions. Finally, a visual generative model integrates these guidance and iteratively refines imperfect completions that may arise from inaccurate MLLM guidance. Experimental results on various real-world images show impressive improvements compared to all existing works, suggesting MLLMs as a promising direction for addressing challenging amodal completion.

标签

多模态大语言模型 残缺图像补全 视觉生成模型

arXiv 分类

cs.CV cs.AI