ARGENT: Adaptive Hierarchical Image-Text Representations
AI 摘要
ARGENT提出了一种新的双曲视觉语言模型,通过自适应损失和角度概率评估提升层级表征能力。
主要贡献
- 提出自适应的包含损失和范数正则化,防止锥坍塌
- 引入基于角度的概率包含协议 (PEP) 用于评估层级理解
- 构建了更强的双曲视觉语言模型基线 ARGENT
方法论
提出自适应的损失函数和范数正则化改进双曲空间的视觉语言表征,并使用新的角度概率协议进行评估。
原文摘要
Large-scale Vision-Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives. To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-based probabilistic entailment protocol (PEP) for evaluating hierarchical understanding, scored with AUC-ROC and Average Precision. This paper introduces a stronger hyperbolic VLM baseline ARGENT, Adaptive hieRarchical imaGe-tExt represeNTation. ARGENT improves the SOTA hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and proposed hierarchical metrics, respectively.