LLM Memory & RAG 相关度: 8/10

ArkTS-CodeSearch: A Open-Source ArkTS Dataset for Code Retrieval

Yulong He, Artem Ermakov, Sergey Kovalchuk, Artem Aliev, Dmitry Shalymov
arXiv: 2602.05550v1 发布: 2026-02-05 更新: 2026-02-05

AI 摘要

构建了大规模ArkTS代码检索数据集与基准,并进行了模型微调,提升了ArkTS代码理解能力。

主要贡献

  • 构建了大规模开源ArkTS代码检索数据集
  • 设计了基于自然语言注释的代码检索任务
  • 对现有代码嵌入模型进行了微调,提高了ArkTS代码理解性能

方法论

从GitHub和Gitee抓取ArkTS代码仓库,使用tree-sitter-arkts提取注释-函数对,进行去重和统计分析,并微调代码嵌入模型。

原文摘要

ArkTS is a core programming language in the OpenHarmony ecosystem, yet research on ArkTS code intelligence is hindered by the lack of public datasets and evaluation benchmarks. This paper presents a large-scale ArkTS dataset constructed from open-source repositories, targeting code retrieval and code evaluation tasks. We design a single-search task, where natural language comments are used to retrieve corresponding ArkTS functions. ArkTS repositories are crawled from GitHub and Gitee, and comment-function pairs are extracted using tree-sitter-arkts, followed by cross-platform deduplication and statistical analysis of ArkTS function types. We further evaluate all existing open-source code embedding models on the single-search task and perform fine-tuning using both ArkTS and TypeScript training datasets, resulting in a high-performing model for ArkTS code understanding. This work establishes the first systematic benchmark for ArkTS code retrieval. Both the dataset and our fine-tuned model will be released publicly and are available at https://huggingface.co/hreyulog/embedinggemma_arkts and https://huggingface.co/datasets/hreyulog/arkts-code-docstring,establishing the first systematic benchmark for ArkTS code retrieval.

标签

ArkTS 代码检索 数据集 代码嵌入 代码智能

arXiv 分类

cs.SE cs.CL