Multimodal Learning 相关度: 9/10

WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

Tianyue Wang, Leigang Qu, Tianyu Yang, Xiangzhao Hao, Yifan Xu, Haiyun Guo, Jinqiao Wang
arXiv: 2602.23029v1 发布: 2026-02-26 更新: 2026-02-26

AI 摘要

WISER通过检索-验证-精炼流程,结合图像和文本检索,实现无需训练的零样本组合图像检索。

主要贡献

  • 提出WISER框架,融合T2I和I2I检索,建模意图和不确定性。
  • 设计自适应融合模块,根据置信度选择精炼或融合双路径检索结果。
  • 利用结构化自反思生成精炼建议,引导下一轮检索。

方法论

通过并行T2I和I2I检索扩大候选集,利用验证器进行自适应融合,不确定时进行结构化自反思精炼。

原文摘要

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at https://github.com/Physicsmile/WISER.

标签

Zero-Shot Learning Composed Image Retrieval Multimodal Learning Image-Text Retrieval

arXiv 分类

cs.CV