LLM Reasoning 相关度: 8/10

GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision

Lang Sun, Ronghao Fu, Zhuoran Duan, Haoran Liu, Xueyan Liu, Bo Yang
arXiv: 2603.09551v1 发布: 2026-03-10 更新: 2026-03-10

AI 摘要

GeoSolver通过可验证的过程监督强化学习,提升遥感图像理解中VLMs的推理能力。

主要贡献

  • 构建大规模 token 级别过程监督数据集 Geo-PRM-2M
  • 提出 token 级别过程奖励模型 GeoPRM,提供细粒度反馈
  • 设计过程感知树状 GRPO 强化学习算法

方法论

通过 MCTS 合成数据集,训练过程奖励模型,并使用过程感知强化学习算法优化模型推理过程。

原文摘要

While Vision-Language Models (VLMs) have significantly advanced remote sensing interpretation, enabling them to perform complex, step-by-step reasoning remains highly challenging. Recent efforts to introduce Chain-of-Thought (CoT) reasoning to this domain have shown promise, yet ensuring the visual faithfulness of these intermediate steps remains a critical bottleneck. To address this, we introduce GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning. We first construct Geo-PRM-2M, a large-scale, token-level process supervision dataset synthesized via entropy-guided Monte Carlo Tree Search (MCTS) and targeted visual hallucination injection. Building upon this dataset, we train GeoPRM, a token-level process reward model (PRM) that provides granular faithfulness feedback. To effectively leverage these verification signals, we propose Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps. Extensive experiments demonstrate that our resulting model, GeoSolver-9B, achieves state-of-the-art performance across diverse remote sensing benchmarks. Crucially, GeoPRM unlocks robust Test-Time Scaling (TTS). Serving as a universal geospatial verifier, it seamlessly scales the performance of GeoSolver-9B and directly enhances general-purpose VLMs, highlighting its remarkable cross-model generalization.

标签

遥感 视觉语言模型 推理 强化学习

arXiv 分类

cs.CV