Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
AI 摘要
针对高分辨率GUI代理效率瓶颈,提出GUIPruner框架,实现高效的token剪枝。
主要贡献
- 提出Temporal-Adaptive Resolution (TAR) 解决时间冗余问题
- 提出Stratified Structure-aware Pruning (SSP) 解决空间拓扑冲突问题
- GUIPruner 在高压缩率下保持性能
方法论
GUIPruner使用TAR消除时间冗余,SSP保障空间结构,无需训练。
原文摘要
Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.