Evaluating the Impact of Data Anonymization on Image Retrieval
AI 摘要
该论文系统性地评估了数据匿名化对基于内容的图像检索性能的影响。
主要贡献
- 提出了一个评估数据匿名化对CBIR影响的框架
- 评估了不同匿名化方法和程度对CBIR的影响
- 揭示了模型训练数据对匿名化后检索结果的影响
方法论
通过在多个数据集上,使用不同的匿名化方法和训练策略,对比匿名化前后CBIR的检索结果。
原文摘要
With the growing importance of privacy regulations such as the General Data Protection Regulation, anonymizing visual data is becoming increasingly relevant across institutions. However, anonymization can negatively affect the performance of Computer Vision systems that rely on visual features, such as Content-Based Image Retrieval (CBIR). Despite this, the impact of anonymization on CBIR has not been systematically studied. This work addresses this gap, motivated by the DOKIQ project, an artificial intelligence-based system for document verification actively used by the State Criminal Police Office Baden-Württemberg. We propose a simple evaluation framework: retrieval results after anonymization should match those obtained before anonymization as closely as possible. To this end, we systematically assess the impact of anonymization using two public datasets and the internal DOKIQ dataset. Our experiments span three anonymization methods, four anonymization degrees, and four training strategies, all based on the state of the art backbone Self-Distillation with No Labels (DINO)v2. Our results reveal a pronounced retrieval bias in favor of models trained on original data, which produce the most similar retrievals after anonymization. The findings of this paper offer practical insights for developing privacy-compliant CBIR systems while preserving performance.