Multimodal Learning 相关度: 8/10

Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification

Podakanti Satyajith Chary, Nagarajan Ganapathy

arXiv: 2603.17879v1 发布: 2026-03-18 更新: 2026-03-18

下载 PDF arXiv 页面

AI 摘要

针对VCE图像多标签分类，提出了基于BiomedCLIP和不对称Focal优化的框架，提升不平衡数据集性能。

主要贡献

引入差分注意力机制抑制噪声
采用多种优化策略处理类别不平衡问题
实现了快速且有效的VCE图像分类

方法论

修改BiomedCLIP，使用差分注意力，并结合不对称Focal损失、加权采样等策略，最后进行阈值优化和时间一致性处理。

原文摘要

This work presents a multi-label classification framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset through a combination of architectural and optimization-level strategies. Our approach modifies BiomedCLIP, a biomedical vision-language foundation model, by replacing its standard multi-head self-attention with a differential attention mechanism that computes the difference between two softmax attention maps to suppress attention noise. To counteract the skewed label distribution, where pathological findings constitute less than 0.1% of all annotated frames, a sqrt-frequency weighted sampler, asymmetric focal loss, mixup regularization, and per-class threshold optimization are employed. Temporal coherence is enforced through median-filter smoothing and gap merging prior to event-level JSON generation. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the pipeline achieves an overall temporal mAP@0.5 of 0.2456 and mAP@0.95 of 0.2353, with total inference completed in approximately 8.6 minutes on a single GPU.

arXiv 分类

cs.CV cs.AI