Multimodal Learning 相关度: 7/10

A Computationally Efficient Multidimensional Vision Transformer

Alaa El Ichi, Khalide Jbilou
arXiv: 2602.19982v1 发布: 2026-02-23 更新: 2026-02-23

AI 摘要

提出一种基于张量余弦积(Cproduct)的高效视觉Transformer,降低计算和内存成本。

主要贡献

  • 提出基于张量余弦积的Transformer框架
  • 设计了新的Cproduct-based视觉Transformer架构(TCP-ViT)
  • 实验证明参数量减少1/C的同时保持了精度

方法论

利用图像数据中的多线性结构和余弦变换的正交性,构建高效的注意力机制和结构化特征表示。

原文摘要

Vision Transformers have achieved state-of-the-art performance in a wide range of computer vision tasks, but their practical deployment is limited by high computational and memory costs. In this paper, we introduce a novel tensor-based framework for Vision Transformers built upon the Tensor Cosine Product (Cproduct). By exploiting multilinear structures inherent in image data and the orthogonality of cosine transforms, the proposed approach enables efficient attention mechanisms and structured feature representations. We develop the theoretical foundations of the tensor cosine product, analyze its algebraic properties, and integrate it into a new Cproduct-based Vision Transformer architecture (TCP-ViT). Numerical experiments on standard classification and segmentation benchmarks demonstrate that the proposed method achieves a uniform 1/C parameter reduction (where C is the number of channels) while maintaining competitive accuracy.

标签

Vision Transformer Tensor Cosine Product Efficient Computation Image Classification Image Segmentation

arXiv 分类

cs.LG math.NA