Multimodal Learning 相关度: 8/10

Factored Latent Action World Models

Zizhao Wang, Chang Shi, Jiaheng Hu, Kevin Rohling, Roberto Martín-Martín, Amy Zhang, Peter Stone
arXiv: 2602.16229v1 发布: 2026-02-18 更新: 2026-02-18

AI 摘要

FLAM分解场景为独立因子,学习隐变量动作,提升多实体环境下视频生成质量和策略学习。

主要贡献

  • 提出了一种分解的隐变量动作模型FLAM
  • FLAM在复杂多实体环境中建模更准确
  • 实验证明FLAM在预测精度和表征质量上优于现有方法

方法论

FLAM将场景分解为独立因子,为每个因子推断隐变量动作并预测下一步因子值。

原文摘要

Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next-step factor value. This factorized structure enables more accurate modeling of complex multi-entity dynamics and improves video generation quality in action-free video settings compared to monolithic models. Based on experiments on both simulation and real-world multi-entity datasets, we find that FLAM outperforms prior work in prediction accuracy and representation quality, and facilitates downstream policy learning, demonstrating the benefits of factorized latent action models.

标签

世界模型 隐变量动作 视频生成 多实体

arXiv 分类

cs.LG