Are Object-Centric Representations Better At Compositional Generalization?
AI 摘要
研究表明,在组合泛化任务中,当数据受限时,面向对象的表征优于密集表征。
主要贡献
- 提出了新的视觉问答基准测试,用于评估组合泛化能力
- 比较了有无对象中心偏置的视觉编码器的性能
- 揭示了对象中心表征在组合泛化方面的优势
方法论
在CLEVRTex、Super-CLEVR和MOVi-C数据集上,使用DINOv2和SigLIP2及其对象中心版本进行VQA实验,评估组合泛化能力。
原文摘要
Compositional generalization, the ability to reason about novel combinations of familiar concepts, is fundamental to human cognition and a critical challenge for machine learning. Object-centric (OC) representations, which encode a scene as a set of objects, are often argued to support such generalization, but systematic evidence in visually rich settings is limited. We introduce a Visual Question Answering benchmark across three controlled visual worlds (CLEVRTex, Super-CLEVR, and MOVi-C) to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for training data diversity, sample size, representation size, downstream model capacity, and compute. We use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their OC counterparts. Our key findings reveal that (1) OC approaches are superior in harder compositional generalization settings; (2) original dense representations surpass OC only on easier settings and typically require substantially more downstream compute; and (3) OC models are more sample efficient, achieving stronger generalization with fewer images, whereas dense encoders catch up or surpass them only with sufficient data and diversity. Overall, object-centric representations offer stronger compositional generalization when any one of dataset size, training data diversity, or downstream compute is constrained.