Automated Customization of LLMs for Enterprise Code Repositories Using Semantic Scopes
AI 摘要
针对企业代码库,提出基于语义范围的LLM自动定制方法,提高代码补全质量和开发者效率。
主要贡献
- 提出基于语义范围的代码LLM定制方法
- 评估了RAG和FT两种定制策略在企业代码库上的效果
- 证明了定制模型在代码补全方面优于大型未定制模型
方法论
通过语义范围解析代码库数据,构建训练数据对,使用RAG和FT定制LLM,并在企业代码库上评估代码补全性能。
原文摘要
Code completion (CC) is a task frequently used by developers when working in collaboration with LLM-based programming assistants. Despite the increased performance of LLMs on public benchmarks, out of the box LLMs still have a hard time generating code that aligns with a private code repository not previously seen by the model's training data. Customizing code LLMs to a private repository provides a way to improve the model performance. In this paper we present our approach for automated LLM customization based on semantic scopes in the code. We evaluate LLMs on real industry cases with two private enterprise code repositories with two customization strategies: Retrieval-Augmented Generation (RAG) and supervised Fine-Tuning (FT). Our mechanism for ingesting the repository's data and formulating the training data pairs with semantic scopes helps models to learn the underlying patterns specific to the repository, providing more precise code to developers and helping to boost their productivity. The code completions of moderately sized customized models can be significantly better than those of uncustomized models of much larger capacity. We also include an analysis of customization on two public benchmarks and present opportunities for future work.