Evaluating the impact of word embeddings on similarity scoring in practical information retrieval
AI 摘要
该论文评估了基于WMD和词嵌入的相似度计算方法在信息检索中的有效性,并验证了其优越性。
主要贡献
- 提出基于WMD和词嵌入的相似度计算方法
- 证明了WMD + GloVe组合优于其他检索模型
- 验证了预训练词嵌入的领域无关性和可移植性
方法论
使用WMD计算查询和文档中单个词之间的距离,并结合词嵌入技术,进行排序查询和响应语句的相似度评估。
原文摘要
Search behaviour is characterised using synonymy and polysemy as users often want to search information based on meaning. Semantic representation strategies represent a move towards richer associative connections that can adequately capture this complex usage of language. Vector Space Modelling (VSM) and neural word embeddings play a crucial role in modern machine learning and Natural Language Processing (NLP) pipelines. Embeddings use distributional semantics to represent words, sentences, paragraphs or entire documents as vectors in high dimensional spaces. This can be leveraged by Information Retrieval (IR) systems to exploit the semantic relatedness between queries and answers. This paper evaluates an alternative approach to measuring query statement similarity that moves away from the common similarity measure of centroids of neural word embeddings. Motivated by the Word Movers Distance (WMD) model, similarity is evaluated using the distance between individual words of queries and statements. Results from ranked query and response statements demonstrate significant gains in accuracy using the combined approach of similarity ranking through WMD with the word embedding techniques. The top performing WMD + GloVe combination outperforms all other state-of-the-art retrieval models including Doc2Vec and the baseline LSA model. Along with the significant gains in performance of similarity ranking through WMD, we conclude that the use of pre-trained word embeddings, trained on vast amounts of data, result in domain agnostic language processing solutions that are portable to diverse business use-cases.