Multimodal Learning 相关度: 9/10

When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Soumya Jahagirdar, Walid Bousselham, Anna Kukleva, Hilde Kuehne
arXiv: 2602.04864v1 发布: 2026-02-04 更新: 2026-02-04

AI 摘要

Mask-LLaVA通过结合多层次视觉特征,实现了视觉语言模型的高效推理,减少了计算需求。

主要贡献

  • 提出Mask-LLaVA框架,利用多层次视觉特征进行高效视觉表示
  • 在测试时动态调整token数量,无需重新训练即可保持性能
  • 在标准基准测试中表现出与现有token高效方法相当的结果

方法论

结合mask-based对象表示、全局tokens和局部patch tokens,训练VLMs,并在推理时动态选择token数量。

原文摘要

Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute especially at inference time. To address this problem, we propose Mask-LLaVA, a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive VLMs. Namely, we combine mask-based object representations together with global tokens and local patch tokens. While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time, allowing to adapt the number of tokens during inference without the need to retrain the model and without a significant drop in performance. We evaluate the proposed approach on a suite of standard benchmarks showing results competitive to current token efficient methods and comparable to the original LLaVA baseline using only a fraction of visual tokens. Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.

标签

Vision-Language Model Token Composition Efficient Inference Masked Representation

arXiv 分类

cs.CV