Multimodal Learning 相关度: 9/10

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, Yuanxing Zhang, Jiaheng Liu, Qiang Liu, Pengfei Wan, Liang Wang
arXiv: 2602.04804v1 发布: 2026-02-04 更新: 2026-02-04

AI 摘要

OmniSIFT提出了一种模态非对称的token压缩框架,用于优化多模态大模型的效率。

主要贡献

  • 提出了模态非对称的token压缩框架OmniSIFT
  • 设计了时空视频剪枝模块和视觉引导的音频选择模块
  • 通过可微分的straight-through estimator进行端到端优化

方法论

采用两阶段压缩策略,先剪枝视频冗余,再用视觉引导选择音频token,最后端到端优化。

原文摘要

Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the performance of the full-token model on several tasks.

标签

多模态学习 大语言模型 Token压缩 效率优化 视频处理 音频处理

arXiv 分类

cs.CL