LLM Reasoning 相关度: 7/10

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

Arpit Singh Gautam, Saurabh Jha

arXiv: 2603.17891v1 发布: 2026-03-18 更新: 2026-03-18

下载 PDF arXiv 页面

AI 摘要

提出RAMP，一种基于强化学习的自适应混合精度量化方法，提高LLM在资源受限设备上的推理效率。

主要贡献

提出RAMP框架，基于强化学习自动调整每层比特宽度
引入Scale Folding技术，支持亚4比特量化
策略在不同模型和规模上具有零样本泛化能力

方法论

使用Off-Policy Soft Actor-Critic框架，基于模型属性学习每层比特宽度，并引入Scale Folding技术提高量化效果。

原文摘要

Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence. On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4 bit AWQ (5.60 at 3.90 GB) and GPTQ by 6% in size and 1% to3% in quality. Critically, a policy trained only on Llama 2 7B generalizes zero shot to Llama 2 13B and Mistral 7B, often surpassing target specific training, supporting the hypothesis that quantization sensitivity is primarily architectural. The HALO pipeline exports allocations to GGUF format for kernel free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.

arXiv 分类

cs.LG cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类