Multimodal Learning 相关度: 9/10

K-Gen: A Multimodal Language-Conditioned Approach for Interpretable Keypoint-Guided Trajectory Generation

Mingxuan Mu, Guo Yang, Lei Chen, Ping Wu, Jianxun Cui

arXiv: 2603.04868v1 发布: 2026-03-05 更新: 2026-03-05

下载 PDF arXiv 页面

AI 摘要

K-Gen利用多模态大语言模型和关键点引导生成 interpretable 自动驾驶轨迹，性能优于现有方法。

主要贡献

提出了一种基于关键点引导的多模态轨迹生成框架K-Gen
利用MLLM结合视觉和文本信息进行轨迹生成
引入T-DAPO算法优化关键点生成

方法论

使用MLLM结合BEV地图和文本描述生成关键点，通过refinement module生成轨迹，并用T-DAPO进行优化。

原文摘要

Generating realistic and diverse trajectories is a critical challenge in autonomous driving simulation. While Large Language Models (LLMs) show promise, existing methods often rely on structured data like vectorized maps, which fail to capture the rich, unstructured visual context of a scene. To address this, we propose K-Gen, an interpretable keypoint-guided multimodal framework that leverages Multimodal Large Language Models (MLLMs) to unify rasterized BEV map inputs with textual scene descriptions. Instead of directly predicting full trajectories, K-Gen generates interpretable keypoints along with reasoning that reflects agent intentions, which are subsequently refined into accurate trajectories by a refinement module. To further enhance keypoint generation, we apply T-DAPO, a trajectory-aware reinforcement fine-tuning algorithm. Experiments on WOMD and nuPlan demonstrate that K-Gen outperforms existing baselines, highlighting the effectiveness of combining multimodal reasoning with keypoint-guided trajectory generation.

arXiv 分类

cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类