Multimodal Learning 相关度: 9/10

ViHOI: Human-Object Interaction Synthesis with Visual Priors

Songjin Cai, Linjie Zhong, Ling Guo, Changxing Ding
arXiv: 2603.24383v1 发布: 2026-03-25 更新: 2026-03-25

AI 摘要

ViHOI利用2D图像先验指导3D人与物体交互生成,提升生成质量和泛化性。

主要贡献

  • 提出ViHOI框架,利用视觉先验提升HOI生成质量
  • 利用VLM提取视觉和文本先验,并设计Q-Former进行压缩
  • 在motion-rendered图像上训练,保证语义对齐

方法论

使用VLM提取2D图像的视觉先验,通过Q-Former压缩后,用于条件训练扩散模型,生成3D HOI。

原文摘要

Generating realistic and physically plausible 3D Human-Object Interactions (HOI) remains a key challenge in motion generation. One primary reason is that describing these physical constraints with words alone is difficult. To address this limitation, we propose a new paradigm: extracting rich interaction priors from easily accessible 2D images. Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. We utilize a large Vision-Language Model (VLM) as a powerful prior-extraction engine and adopt a layer-decoupled strategy to obtain visual and textual priors. Concurrently, we design a Q-Former-based adapter that compresses the VLM's high-dimensional features into compact prior tokens, which significantly facilitates the conditional training of our diffusion model. Our framework is trained on motion-rendered images from the dataset to ensure strict semantic alignment between visual inputs and motion sequences. During inference, it leverages reference images synthesized by a text-to-image generation model to improve generalization to unseen objects and interaction categories. Experimental results demonstrate that ViHOI achieves state-of-the-art performance, outperforming existing methods across multiple benchmarks and demonstrating superior generalization.

标签

3D Human-Object Interaction Diffusion Model Vision-Language Model Motion Generation

arXiv 分类

cs.CV