ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery
AI 摘要
提出了用于评估视觉语言模型在热成像上的性能的结构化基准ThermEval,揭示了现有模型在该领域的不足。
主要贡献
- 构建了大规模热成像视觉问答数据集ThermEval-B,包含像素级温度信息。
- 评估了多种VLM在热成像上的表现,发现模型在温度推理等方面存在缺陷。
- 指出了RGB-centric评估的局限性,强调了热成像领域专用评估的重要性。
方法论
构建包含真实场景热成像与人工标注问答对的数据集,并在该数据集上评估现有VLM的性能表现,分析其弱点。
原文摘要
Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical temperature rather than color or texture, requiring perceptual and reasoning capabilities that existing RGB-centric benchmarks do not evaluate. We introduce ThermEval-B, a structured benchmark of approximately 55,000 thermal visual question answering pairs designed to assess the foundational primitives required for thermal vision language understanding. ThermEval-B integrates public datasets with our newly collected ThermEval-D, the first dataset to provide dense per-pixel temperature maps with semantic body-part annotations across diverse indoor and outdoor environments. Evaluating 25 open-source and closed-source VLMs, we find that models consistently fail at temperature-grounded reasoning, degrade under colormap transformations, and default to language priors or fixed responses, with only marginal gains from prompting or supervised fine-tuning. These results demonstrate that thermal understanding requires dedicated evaluation beyond RGB-centric assumptions, positioning ThermEval as a benchmark to drive progress in thermal vision language modeling.