Multimodal Learning 相关度: 9/10

Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

Eyal Hadad, Mordechai Guri

arXiv: 2603.25403v1 发布: 2026-03-26 更新: 2026-03-26

下载 PDF arXiv 页面

AI 摘要

针对本地视觉语言模型，论文提出双层侧信道攻击，泄露输入图像的几何信息和语义内容。

主要贡献

揭示动态高分辨率预处理引入的侧信道漏洞
提出基于执行时间和缓存争用的双层攻击框架
分析缓解措施的安全工程权衡并提出设计建议

方法论

通过分析执行时间和末级缓存争用，攻击者可以推断输入图像的几何形状和语义信息，实现侧信道攻击。

原文摘要

On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., AnyRes) introduces an inherent algorithmic side-channel. Unlike static models, dynamic preprocessing decomposes images into a variable number of patches based on their aspect ratio, creating workload-dependent inputs. We demonstrate a dual-layer attack framework against local VLMs. In Tier 1, an unprivileged attacker can exploit significant execution-time variations using standard unprivileged OS metrics to reliably fingerprint the input's geometry. In Tier 2, by profiling Last-Level Cache (LLC) contention, the attacker can resolve semantic ambiguity within identical geometries, distinguishing between visually dense (e.g., medical X-rays) and sparse (e.g., text documents) content. By evaluating state-of-the-art models such as LLaVA-NeXT and Qwen2-VL, we show that combining these signals enables reliable inference of privacy-sensitive contexts. Finally, we analyze the security engineering trade-offs of mitigating this vulnerability, reveal substantial performance overhead with constant-work padding, and propose practical design recommendations for secure Edge AI deployments.

arXiv 分类

cs.CR cs.AI cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类