Multimodal Learning 相关度: 9/10

Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

Yin Wang, Ziyao Zhang, Zhiying Leng, Haitian Liu, Frederick W. B. Li, Mu Li, Xiaohui Liang

arXiv: 2602.10659v1 发布: 2026-02-11 更新: 2026-02-11

下载 PDF arXiv 页面

AI 摘要

提出MP-HOI框架，利用多模态先验指导文本驱动的3D人-物交互动作生成，提升交互真实性。

主要贡献

利用多模态数据先验指导HOI生成
增强的对象表示，引入几何关键点等
多模态感知的MoE模型融合特征
级联扩散框架细化人-物交互特征

方法论

采用级联扩散模型，结合多模态数据先验和增强的对象表示，通过MoE模型融合特征，实现高质量HOI动作生成。

原文摘要

We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation. Existing methods primarily rely on a direct text-to-HOI mapping, which suffers from three key limitations due to the significant cross-modality gap: (Q1) sub-optimal human motion, (Q2) unnatural object motion, and (Q3) weak interaction between humans and objects. To address these challenges, we propose MP-HOI, a novel framework grounded in four core insights: (1) Multimodal Data Priors: We leverage multimodal data (text, image, pose/object) from large multimodal models as priors to guide HOI generation, which tackles Q1 and Q2 in data modeling. (2) Enhanced Object Representation: We improve existing object representations by incorporating geometric keypoints, contact features, and dynamic properties, enabling expressive object representations, which tackles Q2 in data representation. (3) Multimodal-Aware Mixture-of-Experts (MoE) Model: We propose a modality-aware MoE model for effective multimodal feature fusion paradigm, which tackles Q1 and Q2 in feature fusion. (4) Cascaded Diffusion with Interaction Supervision: We design a cascaded diffusion framework that progressively refines human-object interaction features under dedicated supervision, which tackles Q3 in interaction refinement. Comprehensive experiments demonstrate that MP-HOI outperforms existing approaches in generating high-fidelity and fine-grained HOI motions.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类