Multimodal Learning 相关度: 9/10

UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark

Yu Zhang, Zhicheng Zhao, Ze Luo, Chenglong Li, Jin Tang

arXiv: 2603.10722v1 发布: 2026-03-11 更新: 2026-03-11

下载 PDF arXiv 页面

AI 摘要

提出CTCNet，用于复杂环境下的无人机交通场景理解，并构建了大规模多模态数据集Traffic-VQA。

主要贡献

提出Cross-spectral Traffic Cognition Network (CTCNet)
设计Prototype-Guided Knowledge Embedding (PGKE)模块
设计Quality-Aware Spectral Compensation (QASC)模块
构建大规模光学-热红外数据集Traffic-VQA

方法论

利用光学和热红外模态的互补特性，通过知识嵌入和质量感知的光谱补偿，实现鲁棒的场景理解。

原文摘要

Traffic scene understanding from unmanned aerial vehicle (UAV) platforms is crucial for intelligent transportation systems due to its flexible deployment and wide-area monitoring capabilities. However, existing methods face significant challenges in real-world surveillance, as their heavy reliance on optical imagery leads to severe performance degradation under adverse illumination conditions like nighttime and fog. Furthermore, current Visual Question Answering (VQA) models are restricted to elementary perception tasks, lacking the domain-specific regulatory knowledge required to assess complex traffic behaviors. To address these limitations, we propose a novel Cross-spectral Traffic Cognition Network (CTCNet) for robust UAV traffic scene understanding. Specifically, we design a Prototype-Guided Knowledge Embedding (PGKE) module that leverages high-level semantic prototypes from an external Traffic Regulation Memory (TRM) to anchor domain-specific knowledge into visual representations, enabling the model to comprehend complex behaviors and distinguish fine-grained traffic violations. Moreover, we develop a Quality-Aware Spectral Compensation (QASC) module that exploits the complementary characteristics of optical and thermal modalities to perform bidirectional context exchange, effectively compensating for degraded features to ensure robust representation in complex environments. In addition, we construct Traffic-VQA, the first large-scale optical-thermal infrared benchmark for cognitive UAV traffic understanding, comprising 8,180 aligned image pairs and 1.3 million question-answer pairs across 31 diverse types. Extensive experiments demonstrate that CTCNet significantly outperforms state-of-the-art methods in both cognition and perception scenarios. The dataset is available at https://github.com/YuZhang-2004/UAV-traffic-scene-understanding.

arXiv 分类

cs.CV cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类