Factored Latent Action World Models
AI 摘要
FLAM分解场景为独立因子,学习隐变量动作,提升多实体环境下视频生成质量和策略学习。
主要贡献
- 提出了一种分解的隐变量动作模型FLAM
- FLAM在复杂多实体环境中建模更准确
- 实验证明FLAM在预测精度和表征质量上优于现有方法
方法论
FLAM将场景分解为独立因子,为每个因子推断隐变量动作并预测下一步因子值。
原文摘要
Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next-step factor value. This factorized structure enables more accurate modeling of complex multi-entity dynamics and improves video generation quality in action-free video settings compared to monolithic models. Based on experiments on both simulation and real-world multi-entity datasets, we find that FLAM outperforms prior work in prediction accuracy and representation quality, and facilitates downstream policy learning, demonstrating the benefits of factorized latent action models.