TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models
AI 摘要
TiFRe通过文本引导的帧采样和帧匹配融合,在减少计算成本的同时提升视频语言任务性能。
主要贡献
- 提出了文本引导的帧采样(TFS)策略,利用LLM和CLIP选择关键帧
- 提出了帧匹配和融合(FMM)机制,将非关键帧信息融入关键帧
- 实验证明TiFRe能有效降低计算成本并提升视频语言任务性能
方法论
利用LLM生成CLIP风格提示,计算与帧的语义相似度选择关键帧,并将非关键帧信息融合到关键帧中。
原文摘要
With the rapid development of Large Language Models (LLMs), Video Multi-Modal Large Language Models (Video MLLMs) have achieved remarkable performance in video-language tasks such as video understanding and question answering. However, Video MLLMs face high computational costs, particularly in processing numerous video frames as input, which leads to significant attention computation overhead. A straightforward approach to reduce computational costs is to decrease the number of input video frames. However, simply selecting key frames at a fixed frame rate (FPS) often overlooks valuable information in non-key frames, resulting in notable performance degradation. To address this, we propose Text-guided Video Frame Reduction (TiFRe), a framework that reduces input frames while preserving essential video information. TiFRe uses a Text-guided Frame Sampling (TFS) strategy to select key frames based on user input, which is processed by an LLM to generate a CLIP-style prompt. Pre-trained CLIP encoders calculate the semantic similarity between the prompt and each frame, selecting the most relevant frames as key frames. To preserve video semantics, TiFRe employs a Frame Matching and Merging (FMM) mechanism, which integrates non-key frame information into the selected key frames, minimizing information loss. Experiments show that TiFRe effectively reduces computational costs while improving performance on video-language tasks.