MITRA: An AI Assistant for Knowledge Retrieval in Physics Collaborations
AI 摘要
MITRA是基于RAG的AI助手,专为大型物理实验合作中的知识检索而设计,注重隐私和性能。
主要贡献
- 构建基于RAG的知识检索系统MITRA
- 开发自动化文档检索和文本提取流水线
- 提出两层向量数据库架构
- 系统On-premise部署,保证数据隐私
方法论
采用Selenium进行文档检索,OCR和布局解析进行文本提取,使用向量数据库和LLM构建RAG系统,并进行On-premise部署。
原文摘要
Large-scale scientific collaborations, such as the Compact Muon Solenoid (CMS) at CERN, produce a vast and ever-growing corpus of internal documentation. Navigating this complex information landscape presents a significant challenge for both new and experienced researchers, hindering knowledge sharing and slowing down the pace of scientific discovery. To address this, we present a prototype of MITRA, a Retrieval-Augmented Generation (RAG) based system, designed to answer specific, context-aware questions about physics analyses. MITRA employs a novel, automated pipeline using Selenium for document retrieval from internal databases and Optical Character Recognition (OCR) with layout parsing for high-fidelity text extraction. Crucially, MITRA's entire framework, from the embedding model to the Large Language Model (LLM), is hosted on-premise, ensuring that sensitive collaboration data remains private. We introduce a two-tiered vector database architecture that first identifies the relevant analysis from abstracts before focusing on the full documentation, resolving potential ambiguities between different analyses. We demonstrate the prototype's superior retrieval performance against a standard keyword-based baseline on realistic queries and discuss future work towards developing a comprehensive research agent for large experimental collaborations.