FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering
AI 摘要
FocusGraph提出了一种图结构的帧选择框架,用于长视频问答,提升推理效率和性能。
主要贡献
- 提出了基于图结构的场景字幕LLM选择器,用于选择关键帧
- 设计了无训练的 Patch-wise Sparse-Flow Retention (PSFR) 方法选择关键帧
- 在长视频问答基准测试上取得了最先进的结果,同时显著降低了推理时间
方法论
利用图结构的场景字幕LLM选择器选取相关片段,再用PSFR方法提取关键帧,送入MLLM进行问答。
原文摘要
The ability to understand long videos is vital for embodied intelligent agents, because their effectiveness depends on how well they can accumulate, organize, and leverage long-horizon perceptual memories. Recently, multimodal LLMs have been gaining popularity for solving the long video understanding task due to their general ability to understand natural language and to leverage world knowledge. However, as the number of frames provided to an MLLM increases, the quality of its responses tends to degrade, and inference time grows. Therefore, when using MLLMs for long video understanding, a crucial step is selecting key frames from the video to answer user queries. In this work, we develop FocusGraph, a framework for keyframe selection for question answering over long egocentric videos. It leverages a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips based on their graph-based captions, and a training-free method for selecting keyframes from these clips. Unlike existing methods, the proposed Scene-Caption LLM Selector does not rely on the original sequence of low-resolution frames; instead, it operates on a compact textual representation of the scene. We then design a training-free Patch-wise Sparse-Flow Retention (PSFR) method to select keyframes from the resulting sequence of clips, which are fed into an MLLM to produce the final answer. Together, these components enable FocusGraph to achieve state-of-the-art results on challenging egocentric long-video question answering benchmarks, including FindingDory and HourVideo, while significantly reducing inference time relative to baseline approaches.