Multimodal Learning 相关度: 9/10

Phi-4-reasoning-vision-15B Technical Report

Jyoti Aneja, Michael Harrison, Neel Joshi, Tyler LaBonte, John Langford, Eduardo Salinas
arXiv: 2603.03975v1 发布: 2026-03-04 更新: 2026-03-04

AI 摘要

Phi-4-reasoning-vision-15B是一个紧凑型开源多模态推理模型,注重数据质量和架构设计。

主要贡献

  • 构建了小型高效的多模态推理模型
  • 验证了数据质量对模型性能的关键作用
  • 展示了高分辨率动态分辨率编码器的有效性

方法论

通过架构选择、数据清洗、误差校正和合成数据增强,训练了一个混合推理和非推理数据的模型,使用显式模式token。

原文摘要

We present Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model, and share the motivations, design choices, experiments, and learnings that informed its development. Our goal is to contribute practical insight to the research community on building smaller, efficient multimodal reasoning models and to share the result of these learnings as an open-weight model that is good at common vision and language tasks and excels at scientific and mathematical reasoning and understanding user interfaces. Our contributions include demonstrating that careful architecture choices and rigorous data curation enable smaller, open-weight multimodal models to achieve competitive performance with significantly less training and inference-time compute and tokens. The most substantial improvements come from systematic filtering, error correction, and synthetic augmentation -- reinforcing that data quality remains the primary lever for model performance. Systematic ablations show that high-resolution, dynamic-resolution encoders yield consistent improvements, as accurate perception is a prerequisite for high-quality reasoning. Finally, a hybrid mix of reasoning and non-reasoning data with explicit mode tokens allows a single model to deliver fast direct answers for simpler tasks and chain-of-thought reasoning for complex problems.

标签

多模态 推理 视觉 语言 数据质量

arXiv 分类

cs.AI cs.CV