LLM Reasoning 相关度: 8/10

SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization

Yeonsik Park, Hyeonseong Kim, Seungkyu Choi
arXiv: 2603.08185v1 发布: 2026-03-09 更新: 2026-03-09

AI 摘要

SERQ提出了一种用于LLM量化的、基于显著性感知的低秩误差重构方法,有效提升低精度下的模型性能。

主要贡献

  • 提出了一种基于显著性感知的低秩误差重构方法SERQ
  • 采用单低秩补偿矩阵,减少推理时的中间量化步骤
  • 通过静态激活平坦化、显著性感知误差重构和离线权重置换三个阶段联合缓解量化误差

方法论

SERQ通过静态激活平坦化,显著性感知误差重构和离线权重置换,利用单低秩矩阵补偿量化误差,减少推理时的额外计算。

原文摘要

Post-training quantization (PTQ) has emerged as a prevailing technique for deploying large language models (LLMs) efficiently in terms of both memory and computation, across edge devices and server platforms. Existing PTQ methods primarily aim to reduce precision in weights and activations by mitigating quantization errors caused by channel-wise outlier activations (e.g., pre-quantization scaling, online transformations, or low-rank error reconstruction). Among these approaches, error reconstruction with low-rank adaptation (LoRA) has proven particularly effective, as it introduces a lightweight auxiliary computation path without requiring heavy optimization or additional online layers. However, prior studies reveal severe accuracy degradation under W4A4 settings, and conventional low-rank adaptations rely on two sequential factors, necessitating intermediate quantization during inference and thereby limiting low-precision efficiency. In this work, we propose SERQ, a saliency-aware error reconstruction method for low-bit LLM inference that employs a single low-rank compensation matrix. SERQ preserves efficient 4-bit matrix multiplication in linear layers by jointly mitigating quantization errors arising from both activation and weight saliency through three stages: (1) static activation flattening, (2) saliency-aware error reconstruction, and (3) offline weight permutation. The method incurs additional computation only for low-rank error reconstruction via a single decomposition, while all other operations are performed offline, thereby keeping latency overhead minimal. Empirically, SERQ outperforms prior error reconstruction methods under both W4A8 and W4A4 settings, and achieves higher accuracy than state-of-the-art rotation-based W4A4 approaches, while substantially reducing calibration complexity.

标签

LLM quantization low-rank adaptation error reconstruction low-bit inference

arXiv 分类

cs.LG