Multimodal Learning 相关度: 7/10

Scaling Vision Transformers: Evaluating DeepSpeed for Image-Centric Workloads

Huy Trinh, Rebecca Ma, Zeqi Yu, Tahsin Reza
arXiv: 2602.21081v1 发布: 2026-02-24 更新: 2026-02-24

AI 摘要

利用DeepSpeed加速Vision Transformer在图像任务上的分布式训练,评估其性能和可扩展性。

主要贡献

  • 评估DeepSpeed在ViT上的加速效果
  • 分析了不同GPU配置下的训练效率
  • 探索了软件参数对分布式训练的影响

方法论

通过在CIFAR数据集上进行实验,评估不同GPU配置下DeepSpeed对ViT训练速度、通信开销和可扩展性的影响。

原文摘要

Vision Transformers (ViTs) have demonstrated remarkable potential in image processing tasks by utilizing self-attention mechanisms to capture global relationships within data. However, their scalability is hindered by significant computational and memory demands, especially for large-scale models with many parameters. This study aims to leverage DeepSpeed, a highly efficient distributed training framework that is commonly used for language models, to enhance the scalability and performance of ViTs. We evaluate intra- and inter-node training efficiency across multiple GPU configurations on various datasets like CIFAR-10 and CIFAR-100, exploring the impact of distributed data parallelism on training speed, communication overhead, and overall scalability (strong and weak scaling). By systematically varying software parameters, such as batch size and gradient accumulation, we identify key factors influencing performance of distributed training. The experiments in this study provide a foundational basis for applying DeepSpeed to image-related tasks. Future work will extend these investigations to deepen our understanding of DeepSpeed's limitations and explore strategies for optimizing distributed training pipelines for Vision Transformers.

标签

Vision Transformer DeepSpeed 分布式训练 图像处理

arXiv 分类

cs.LG eess.SP