Multimodal Learning 相关度: 9/10

VLN-Pilot: Large Vision-Language Model as an Autonomous Indoor Drone Operator

Bessie Dominguez-Dager, Sergio Suescun-Ferrandiz, Felix Escalona, Francisco Gomez-Donoso, Miguel Cazorla

arXiv: 2602.05552v1 发布: 2026-02-05 更新: 2026-02-05

下载 PDF arXiv 页面

AI 摘要

VLN-Pilot利用大型视觉语言模型实现室内无人机自主导航，无需人工遥控。

主要贡献

提出VLN-Pilot框架，利用VLLM控制室内无人机
实现基于自然语言指令的无人机自主导航
在逼真的室内模拟环境中验证了框架的有效性

方法论

利用VLLM理解自然语言指令，结合视觉信息进行路径规划，控制无人机自主飞行，规避障碍。

原文摘要

This paper introduces VLN-Pilot, a novel framework in which a large Vision-and-Language Model (VLLM) assumes the role of a human pilot for indoor drone navigation. By leveraging the multimodal reasoning abilities of VLLMs, VLN-Pilot interprets free-form natural language instructions and grounds them in visual observations to plan and execute drone trajectories in GPS-denied indoor environments. Unlike traditional rule-based or geometric path-planning approaches, our framework integrates language-driven semantic understanding with visual perception, enabling context-aware, high-level flight behaviors with minimal task-specific engineering. VLN-Pilot supports fully autonomous instruction-following for drones by reasoning about spatial relationships, obstacle avoidance, and dynamic reactivity to unforeseen events. We validate our framework on a custom photorealistic indoor simulation benchmark and demonstrate the ability of the VLLM-driven agent to achieve high success rates on complex instruction-following tasks, including long-horizon navigation with multiple semantic targets. Experimental results highlight the promise of replacing remote drone pilots with a language-guided autonomous agent, opening avenues for scalable, human-friendly control of indoor UAVs in tasks such as inspection, search-and-rescue, and facility monitoring. Our results suggest that VLLM-based pilots may dramatically reduce operator workload while improving safety and mission flexibility in constrained indoor environments.

arXiv 分类

cs.RO cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类