Multimodal Learning 相关度: 9/10

VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization

Yikun Liu, Yuan Liu, Shangzhe Di, Haicheng Wang, Zhongyin Zhao, Le Tian, Xiao Zhou, Jie Zhou, Jiangchao Yao, Yanfeng Wang, Weidi Xie
arXiv: 2602.09934v1 发布: 2026-02-10 更新: 2026-02-10

AI 摘要

论文提出VersaViT,通过多任务协作训练优化MLLM中的视觉骨干网络,提升其在视觉任务上的性能。

主要贡献

  • 发现MLLM的视觉编码器在密集特征表示方面存在不足
  • 提出VersaViT,一种新型多任务协作训练框架
  • 实验证明VersaViT在多种下游任务上的有效性

方法论

采用多任务框架进行协作训练,利用轻量级任务头和多粒度监督优化视觉骨干网络。

原文摘要

Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating superior high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) we identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation); (ii) we propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi-granularity supervision; (iii) extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language-mediated reasoning and pixel-level understanding.

标签

MLLM Vision Transformer Multi-task Learning Dense Prediction

arXiv 分类

cs.CV