GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking
AI 摘要
GraphThinker通过构建事件图增强视频推理,利用强化学习减少幻觉。
主要贡献
- 提出GraphThinker模型,利用事件图增强视频推理
- 引入视觉注意力奖励强化视觉 grounding,减少幻觉
- 在RexTime和VidHalluc数据集上验证了模型的有效性
方法论
使用MLLM构建事件级视频场景图,利用强化学习进行微调,并引入视觉注意力奖励。
原文摘要
Video reasoning requires understanding the causal relationships between events in a video. However, such relationships are often implicit and costly to annotate manually. While existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries for video reasoning, such modeling still lacks causal understanding. Without explicit causal structure modeling within and across video events, these models suffer from hallucinations during the video reasoning. In this work, we propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs and enhances visual grounding to jointly reduce hallucinations in video reasoning. Specifically, we first employ an MLLM to construct an event-based video scene graph (EVSG) that explicitly models both intra- and inter-event relations, and incorporate these formed scene graphs into the MLLM as an intermediate thinking process. We also introduce a visual attention reward during reinforcement finetuning, which strengthens video grounding and further mitigates hallucinations. We evaluate GraphThinker on two datasets, RexTime and VidHalluc, where it shows superior ability to capture object and event relations with more precise event localization, reducing hallucinations in video reasoning compared to prior methods.