Multimodal Learning 相关度: 9/10

VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

Jiaxin Fan, Wenpo Song
arXiv: 2603.04957v1 发布: 2026-03-05 更新: 2026-03-05

AI 摘要

VisionPangu是一个17亿参数的多模态模型,通过高质量监督提升图像细节描述能力。

主要贡献

  • 提出了一个紧凑型多模态模型VisionPangu
  • 利用DOCCI数据集提升语义连贯性和描述丰富性
  • 证明了紧凑模型在细节描述方面的竞争力

方法论

结合InternVL视觉编码器和OpenPangu语言骨干,通过MLP投影,采用LLaVA风格的指令微调流程。

原文摘要

Large Multimodal Models (LMMs) have achieved strong performance in vision-language understanding, yet many existing approaches rely on large-scale architectures and coarse supervision, which limits their ability to generate detailed image captions. In this work, we present VisionPangu, a compact 1.7B-parameter multimodal model designed to improve detailed image captioning through efficient multimodal alignment and high-quality supervision. Our model combines an InternVL-derived vision encoder with the OpenPangu-Embedded language backbone via a lightweight MLP projector and adopts an instruction-tuning pipeline inspired by LLaVA. By incorporating dense human-authored descriptions from the DOCCI dataset, VisionPangu improves semantic coherence and descriptive richness without relying on aggressive model scaling. Experimental results demonstrate that compact multimodal models can achieve competitive performance while producing more structured and detailed captions. The code and model weights will be publicly available at https://www.modelscope.cn/models/asdfgh007/visionpangu.

标签

多模态学习 图像描述 指令微调 紧凑模型

arXiv 分类

cs.CV cs.CL