Multimodal Learning 相关度: 9/10

Web-Scale Multimodal Summarization using CLIP-Based Semantic Alignment

Mounvik K, N Harshit
arXiv: 2602.14889v1 发布: 2026-02-16 更新: 2026-02-16

AI 摘要

提出了一种基于CLIP语义对齐的Web规模多模态摘要框架。

主要贡献

  • Web规模多模态摘要框架
  • 基于CLIP的语义对齐检索
  • 可配置的Gradio API

方法论

利用CLIP模型进行图像语义对齐,结合检索到的文本和图像数据生成摘要,并提供可配置的API。

原文摘要

We introduce Web-Scale Multimodal Summarization, a lightweight framework for generating summaries by combining retrieved text and image data from web sources. Given a user-defined topic, the system performs parallel web, news, and image searches. Retrieved images are ranked using a fine-tuned CLIP model to measure semantic alignment with topic and text. Optional BLIP captioning enables image-only summaries for stronger multimodal coherence.The pipeline supports features such as adjustable fetch limits, semantic filtering, summary styling, and downloading structured outputs. We expose the system via a Gradio-based API with controllable parameters and preconfigured presets.Evaluation on 500 image-caption pairs with 20:1 contrastive negatives yields a ROC-AUC of 0.9270, an F1-score of 0.6504, and an accuracy of 96.99%, demonstrating strong multimodal alignment. This work provides a configurable, deployable tool for web-scale summarization that integrates language, retrieval, and vision models in a user-extensible pipeline.

标签

多模态摘要 CLIP Web搜索 图像检索

arXiv 分类

cs.LG cs.CV cs.ET cs.HC cs.NE