LLM Reasoning 相关度: 5/10

Semantic Centroids and Hierarchical Density-Based Clustering for Cross-Document Software Coreference Resolution

Julia Matela, Frank Krüger

arXiv: 2603.24246v1 发布: 2026-03-25 更新: 2026-03-25

下载 PDF arXiv 页面

AI 摘要

针对跨文档软件指代消解，提出一种混合框架，结合语义嵌入、知识库查询和密度聚类。

主要贡献

提出结合语义嵌入、知识库查询和密度聚类的混合框架
使用Sentence-BERT模型生成密集语义嵌入
应用HDBSCAN进行密度聚类
针对大规模数据，采用基于实体类型和规范化形式的blocking策略

方法论

使用Sentence-BERT获取语义嵌入，FAISS构建知识库，HDBSCAN聚类，并对大规模数据应用blocking策略。

原文摘要

This paper describes the system submitted to the SOMD 2026 Shared Task for Cross-Document Coreference Resolution (CDCR) of software mentions. Our approach addresses the challenge of identifying and clustering inconsistent software mentions across scientific corpora. We propose a hybrid framework that combines dense semantic embeddings from a pre-trained Sentence-BERT model, Knowledge Base (KB) lookup strategy built from training-set cluster centroids using FAISS for efficient retrieval, and HDBSCAN density-based clustering for mentions that cannot be confidently assigned to existing clusters. Surface-form normalization and abbreviation resolution are applied to improve canonical name matching. The same core pipeline is applied to Subtasks 1 and 2. To address the large scale settings of Subtask 3, the pipeline was adapted by utilising a blocking strategy based on entity types and canonicalized surface forms. Our system achieved CoNLL F1 scores of 0.98, 0.98, and 0.96 on Subtasks 1, 2, and 3 respectively.

arXiv 分类

cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类