Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification
AI 摘要
针对VCE图像多标签分类,提出了基于BiomedCLIP和不对称Focal优化的框架,提升不平衡数据集性能。
主要贡献
- 引入差分注意力机制抑制噪声
- 采用多种优化策略处理类别不平衡问题
- 实现了快速且有效的VCE图像分类
方法论
修改BiomedCLIP,使用差分注意力,并结合不对称Focal损失、加权采样等策略,最后进行阈值优化和时间一致性处理。
原文摘要
This work presents a multi-label classification framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset through a combination of architectural and optimization-level strategies. Our approach modifies BiomedCLIP, a biomedical vision-language foundation model, by replacing its standard multi-head self-attention with a differential attention mechanism that computes the difference between two softmax attention maps to suppress attention noise. To counteract the skewed label distribution, where pathological findings constitute less than 0.1% of all annotated frames, a sqrt-frequency weighted sampler, asymmetric focal loss, mixup regularization, and per-class threshold optimization are employed. Temporal coherence is enforced through median-filter smoothing and gap merging prior to event-level JSON generation. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the pipeline achieves an overall temporal mAP@0.5 of 0.2456 and mAP@0.95 of 0.2353, with total inference completed in approximately 8.6 minutes on a single GPU.