Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning
AI 摘要
ClueNet通过挖掘视觉线索增强视频推理能力,提升VideoQA性能,并缓解幻觉问题。
主要贡献
- 提出ClueNet框架,利用视觉线索进行视频推理
- 解耦监督学习,对线索提取和链式推理进行对齐
- 自适应线索过滤器,细化高阶推理
方法论
ClueNet通过两阶段监督微调,提取视觉线索,进行链式推理,并使用自适应线索过滤器优化。
原文摘要
Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and evidence-grounded answer generation. Prevailing end-to-end MLLM frameworks lack explicit structured reasoning between visual perception and answer derivation, causing severe hallucinations and poor interpretability. Existing methods also fail to address three core gaps: faithful visual clue extraction, utility-aware clue filtering, and end-to-end clue-answer alignment. Inspired by hierarchical human visual cognition, we propose ClueNet, a clue-aware video reasoning framework with a two-stage supervised fine-tuning paradigm without extensive base model modifications. Decoupled supervision aligns clue extraction and chain-based reasoning, while inference supervision with an adaptive clue filter refines high-order reasoning, alongside lightweight modules for efficient inference. Experiments on NExT-QA, STAR, and MVBench show that ClueNet outperforms state-of-the-art methods by $\ge$ 1.1%, with superior generalization, hallucination mitigation, inference efficiency, and cross-backbone compatibility. This work bridges the perception-to-generation gap in MLLM video understanding, providing an interpretable, faithful reasoning paradigm for high-stakes VideoQA applications.