Multimodal Learning 相关度: 8/10

Cross-Modal Visuo-Tactile Object Perception

Anirvan Dutta, Simone Tasciotti, Claudia Cusseddu, Ang Li, Panayiota Poirazi, Julijana Gjorgjieva, Etienne Burdet, Patrick van der Smagt, Mohsen Kaboli
arXiv: 2604.02108v1 发布: 2026-04-02 更新: 2026-04-02

AI 摘要

提出了Cross-Modal Latent Filter (CMLF)模型,用于机器人视觉-触觉融合的物理属性估计。

主要贡献

  • 提出CMLF模型,用于视觉-触觉融合
  • 支持视觉和触觉之间的双向先验传递
  • 通过贝叶斯推断过程整合感觉信息

方法论

使用Cross-Modal Latent Filter (CMLF)模型,通过贝叶斯推断学习物理属性的潜在状态空间。

原文摘要

Estimating physical properties is critical for safe and efficient autonomous robotic manipulation, particularly during contact-rich interactions. In such settings, vision and tactile sensing provide complementary information about object geometry, pose, inertia, stiffness, and contact dynamics, such as stick-slip behavior. However, these properties are only indirectly observable and cannot always be modeled precisely (e.g., deformation in non-rigid objects coupled with nonlinear contact friction), making the estimation problem inherently complex and requiring sustained exploitation of visuo-tactile sensory information during action. Existing visuo-tactile perception frameworks have primarily emphasized forceful sensor fusion or static cross-modal alignment, with limited consideration of how uncertainty and beliefs about object properties evolve over time. Inspired by human multi-sensory perception and active inference, we propose the Cross-Modal Latent Filter (CMLF) to learn a structured, causal latent state-space of physical object properties. CMLF supports bidirectional transfer of cross-modal priors between vision and touch and integrates sensory evidence through a Bayesian inference process that evolves over time. Real-world robotic experiments demonstrate that CMLF improves the efficiency and robustness of latent physical properties estimation under uncertainty compared to baseline approaches. Beyond performance gains, the model exhibits perceptual coupling phenomena analogous to those observed in humans, including susceptibility to cross-modal illusions and similar trajectories in learning cross-sensory associations. Together, these results constitutes a significant step toward generalizable, robust and physically consistent cross-modal integration for robotic multi-sensory perception.

标签

机器人 视觉 触觉 多模态融合

arXiv 分类

cs.RO cs.LG