Multimodal Learning 相关度: 9/10

PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment

Tianci Luo, Jinpeng Wang, Shiyu Qin, Niu Lian, Yan Feng, Bin Chen, Chun Yuan, Shu-Tao Xia

arXiv: 2603.18891v1 发布: 2026-03-19 更新: 2026-03-19

下载 PDF arXiv 页面

AI 摘要

PromptHub通过局部感知融合、集中和对齐增强多提示视觉上下文学习，提升视觉任务性能。

主要贡献

提出PromptHub框架，增强多提示视觉上下文学习
引入局部感知融合机制，利用空间先验
设计集中、对齐和预测目标，互相引导训练

方法论

PromptHub利用空间先验进行局部感知融合，并通过集中、对齐等目标互指导训练，结合数据增强，提升性能。

原文摘要

Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code is available at https://github.com/luotc-why/ICLR26-PromptHub.

arXiv 分类

cs.CV cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类