AI Agents 相关度: 9/10

Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models

Massimiliano Pappa, Luca Romani, Valentino Sacco, Alessio Palma, Stéphane Lathuilière, Fabio Galasso, Xavier Alameda-Pineda, Indro Spinelli
arXiv: 2603.23149v1 发布: 2026-03-24 更新: 2026-03-24

AI 摘要

提出DILLO,通过知识蒸馏加速智能体行动预测,无需视觉模拟也能有效指导策略。

主要贡献

  • 提出DILLO,一种新的智能体控制架构。
  • 使用跨模态蒸馏训练快速文本世界模型。
  • 在MetaWorld和LIBERO上验证了DILLO的有效性,提高了任务成功率。

方法论

使用视觉语言模型作为教师,指导基于LLM的学生模型学习预测动作的语义结果,实现文本驱动的快速推理。

原文摘要

Deploying safety-critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for this proactive foresight, current approaches relying on visual simulation incur prohibitive latencies, often exceeding several seconds per step. In this work, we challenge the assumption that visual processing is necessary for failure prevention. We show that a trained policy's latent state, combined with its planned actions, already encodes sufficient information to anticipate action outcomes, making visual simulation redundant for failure prevention. To this end, we introduce DILLO (DIstiLLed Language-ActiOn World Model), a fast steering layer that shifts the paradigm from "simulate-then-act" to "describe-then-act." DILLO is trained via cross-modal distillation, where a privileged Vision Language Model teacher annotates offline trajectories and a latent-conditioned Large Language Model student learns to predict semantic outcomes. This creates a text-only inference path, bypassing heavy visual generation entirely, achieving a 14x speedup over baselines. Experiments on MetaWorld and LIBERO demonstrate that DILLO produces high-fidelity descriptions of the next state and is able to steer the policy, improving episode success rate by up to 15 pp and 9.3 pp on average across tasks.

标签

智能体 世界模型 知识蒸馏 语言模型

arXiv 分类

cs.AI