Multimodal Learning 相关度: 9/10

A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data

Joseph Bingham
arXiv: 2602.19562v1 发布: 2026-02-23 更新: 2026-02-23

AI 摘要

该论文提出一个多模态框架,用于对齐人类语言描述和视觉感知数据,并验证了其有效性。

主要贡献

  • 提出一个整合语言和视觉信息的计算框架
  • 使用SIFT和UQI模拟人类感知分类
  • 在Stanford Repeated Reference Game语料库上验证了模型性能

方法论

结合SIFT和UQI进行视觉特征提取,通过语言预处理和查询转换捕捉语言表达的变异性,最终实现语言和视觉的对齐。

原文摘要

Establishing stable mappings between natural language expressions and visual percepts is a foundational problem for both cognitive science and artificial intelligence. Humans routinely ground linguistic reference in noisy, ambiguous perceptual contexts, yet the mechanisms supporting such cross-modal alignment remain poorly understood. In this work, we introduce a computational framework designed to model core aspects of human referential interpretation by integrating linguistic utterances with perceptual representations derived from large-scale, crowd-sourced imagery. The system approximates human perceptual categorization by combining scale-invariant feature transform (SIFT) alignment with the Universal Quality Index (UQI) to quantify similarity in a cognitively plausible feature space, while a set of linguistic preprocessing and query-transformation operations captures pragmatic variability in referring expressions. We evaluate the model on the Stanford Repeated Reference Game corpus (15,000 utterances paired with tangram stimuli), a paradigm explicitly developed to probe human-level perceptual ambiguity and coordination. Our framework achieves robust referential grounding. It requires 65\% fewer utterances than human interlocutors to reach stable mappings and can correctly identify target objects from single referring expressions 41.66\% of the time (versus 20\% for humans).These results suggest that relatively simple perceptual-linguistic alignment mechanisms can yield human-competitive behavior on a classic cognitive benchmark, and offers insights into models of grounded communication, perceptual inference, and cross-modal concept formation. Code is available at https://anonymous.4open.science/r/metasequoia-9D13/README.md .

标签

多模态学习 视觉语言 指代消解 认知建模

arXiv 分类

cs.AI cs.CV