Multimodal Learning 相关度: 8/10

From Flow to One Step: Real-Time Multi-Modal Trajectory Policies via Implicit Maximum Likelihood Estimation-based Distribution Distillation

Ju Dong, Liding Zhang, Lei Zhang, Yu Fu, Kaixin Bai, Zoltan-Csaba Marton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang
arXiv: 2603.09415v1 发布: 2026-03-10 更新: 2026-03-10

AI 摘要

该论文提出了一种基于IMLE的分布蒸馏框架,将流模型提炼为单步策略,实现机器人实时多模态轨迹控制。

主要贡献

  • 提出基于IMLE的分布蒸馏框架
  • 使用双向Chamfer距离促进模式覆盖和保真度
  • 融合多视图RGB、深度、点云和本体感觉的统一感知编码器

方法论

通过IMLE将条件流匹配(CFM)专家提炼为单步学生模型,使用双向Chamfer距离作为目标函数。

原文摘要

Generative policies based on diffusion and flow matching achieve strong performance in robotic manipulation by modeling multi-modal human demonstrations. However, their reliance on iterative Ordinary Differential Equation (ODE) integration introduces substantial latency, limiting high-frequency closed-loop control. Recent single-step acceleration methods alleviate this overhead but often exhibit distributional collapse, producing averaged trajectories that fail to execute coherent manipulation strategies. We propose a framework that distills a Conditional Flow Matching (CFM) expert into a fast single-step student via Implicit Maximum Likelihood Estimation (IMLE). A bi-directional Chamfer distance provides a set-level objective that promotes both mode coverage and fidelity, enabling preservation of the teacher multi-modal action distribution in a single forward pass. A unified perception encoder further integrates multi-view RGB, depth, point clouds, and proprioception into a geometry-aware representation. The resulting high-frequency control supports real-time receding-horizon re-planning and improved robustness under dynamic disturbances.

标签

机器人控制 多模态学习 流模型 蒸馏

arXiv 分类

cs.RO cs.AI