Orthogonalized Multimodal Contrastive Learning with Asymmetric Masking for Structured Representations
AI 摘要
COrAL框架通过正交化和非对称掩码,显式建模多模态数据的冗余、独特和协同信息,提升表征质量。
主要贡献
- 提出COrAL框架,显式建模冗余、独特和协同的多模态信息。
- 采用正交约束解耦共享和模态特定特征,确保信息分离。
- 引入非对称掩码促进协同建模,避免过度依赖冗余信息。
方法论
构建双路径架构,使用正交约束分离特征,并结合非对称掩码学习跨模态依赖,最终进行对比学习。
原文摘要
Multimodal learning seeks to integrate information from heterogeneous sources, where signals may be shared across modalities, specific to individual modalities, or emerge only through their interaction. While self-supervised multimodal contrastive learning has achieved remarkable progress, most existing methods predominantly capture redundant cross-modal signals, often neglecting modality-specific (unique) and interaction-driven (synergistic) information. Recent extensions broaden this perspective, yet they either fail to explicitly model synergistic interactions or learn different information components in an entangled manner, leading to incomplete representations and potential information leakage. We introduce \textbf{COrAL}, a principled framework that explicitly and simultaneously preserves redundant, unique, and synergistic information within multimodal representations. COrAL employs a dual-path architecture with orthogonality constraints to disentangle shared and modality-specific features, ensuring a clean separation of information components. To promote synergy modeling, we introduce asymmetric masking with complementary view-specific patterns, compelling the model to infer cross-modal dependencies rather than rely solely on redundant cues. Extensive experiments on synthetic benchmarks and diverse MultiBench datasets demonstrate that COrAL consistently matches or outperforms state-of-the-art methods while exhibiting low performance variance across runs. These results indicate that explicitly modeling the full spectrum of multimodal information yields more stable, reliable, and comprehensive embeddings.