Secure Linear Alignment of Large Language Models
AI 摘要
提出一种隐私保护的跨模型线性对齐框架,利用模型表征趋同性实现安全高效的跨模型推理和文本生成。
主要贡献
- 提出了一种隐私保护的跨模型推理框架
- 探索了不同语言模型之间表征的趋同性
- 验证了线性对齐在embedding分类、OOD检测和文本生成中的有效性
方法论
利用公共数据集学习模型间的仿射变换,并采用同态加密保护客户端查询,仅加密线性对齐和分类操作。
原文摘要
Language models increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data modalities. This emerging compatibility between independently trained models introduces new opportunities for cross-model alignment to downstream objectives. Moreover, it unlocks new potential application domains, such as settings where security, privacy, or competitive constraints prohibit direct data or model sharing. In this work, we propose a privacy-preserving framework that exploits representational convergence to enable cross-silo inference between independent language models. The framework learns an affine transformation over a shared public dataset and applies homomorphic encryption to protect client queries during inference. By encrypting only the linear alignment and classification operations, the method achieves sub-second inference latency while maintaining strong security guarantees. We support this framework with an empirical investigation into representational convergence, in which we learn linear transformations between the final hidden states of independent models. We evaluate these cross-model mappings on embedding classification and out-of-distribution detection, observing minimal performance degradation across model pairs. Additionally, we show for the first time that linear alignment sometimes enables text generation across independently trained models.