Multimodal Learning 相关度: 8/10

Multi-Vector Index Compression in Any Modality

Hanxiang Qin, Alexander Martin, Rohan Jha, Chunsheng Zuo, Reno Kriz, Benjamin Van Durme

arXiv: 2602.21202v1 发布: 2026-02-24 更新: 2026-02-24

下载 PDF arXiv 页面

AI 摘要

针对多模态晚期交互检索，提出基于注意力引导聚类的索引压缩方法，提升检索效率。

主要贡献

提出注意力引导聚类(AGC)压缩多向量文档表示
证明AGC优于其他压缩方法，如序列重塑和记忆tokens
在文本、视觉文档和视频检索任务上验证了AGC的有效性

方法论

提出四种索引压缩方法，重点是利用注意力机制识别语义显著区域作为聚类中心进行加权聚合。

原文摘要

We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for image-, video-, and audio-rich corpora. To address this limitation, we explore query-agnostic methods for compressing multi-vector document representations under a constant vector budget. We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC). AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation. Evaluating these methods on retrieval tasks spanning text (BEIR), visual-document (ViDoRe), and video (MSR-VTT, MultiVENT 2.0), we show that attention-guided clustering consistently outperforms other parameterized compression methods (sequence resizing and memory tokens), provides greater flexibility in index size than non-parametric hierarchical clustering, and achieves competitive or improved performance compared to a full, uncompressed index. The source code is available at: github.com/hanxiangqin/omni-col-press.

arXiv 分类

cs.IR cs.CL cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类