Multimodal Learning 相关度: 9/10

HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin
arXiv: 2603.18558v1 发布: 2026-03-19 更新: 2026-03-19

AI 摘要

HiMu是一种高效的无训练长视频问答框架,通过分层多模态帧选择提升性能。

主要贡献

  • 提出HiMu框架,实现高效的长视频问答
  • 使用分层逻辑树分解问题,利用轻量级专家处理多模态信息
  • 在多个数据集上验证了HiMu的效率和准确性,超越现有方法

方法论

利用LLM将问题分解为逻辑树,各分支调用轻量级专家提取多模态特征,并通过模糊逻辑运算进行组合推理。

原文摘要

Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.

标签

长视频问答 多模态学习 帧选择 逻辑推理

arXiv 分类

cs.CV cs.AI