LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval
AI 摘要
LLandMark是一个多智能体框架,用于地标感知的多模态交互式视频检索,提升越南场景的检索效果。
主要贡献
- 提出 LLandMark 多智能体框架,用于多模态视频检索
- 引入地标知识代理,增强基于 CLIP 的语义匹配
- 利用 LLM 辅助的图像到图像管线,自动检测地标并生成检索查询
方法论
构建多智能体框架,包括查询解析、地标推理、多模态检索和答案重排等阶段,结合 CLIP 和 LLM 技术实现。
原文摘要
The increasing diversity and scale of video data demand retrieval systems capable of multimodal understanding, adaptive reasoning, and domain-specific knowledge integration. This paper presents LLandMark, a modular multi-agent framework for landmark-aware multimodal video retrieval to handle real-world complex queries. The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis. A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes. To expand capabilities, we introduce an LLM-assisted image-to-image pipeline, where a large language model (Gemini 2.5 Flash) autonomously detects landmarks, generates image search queries, retrieves representative images, and performs CLIP-based visual similarity matching, removing the need for manual image input. In addition, an OCR refinement module leveraging Gemini and LlamaIndex improves Vietnamese text recognition. Experimental results show that LLandMark achieves adaptive, culturally grounded, and explainable retrieval performance.