Multimodal Learning 相关度: 5/10

No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data

Dmitry Karpov

arXiv: 2602.04442v1 发布: 2026-02-04 更新: 2026-02-04

下载 PDF arXiv 页面

AI 摘要

该论文研究了五种突厥语机器翻译，利用合成数据和检索方法优化了翻译效果。

主要贡献

针对五种突厥语的机器翻译模型构建
利用合成数据微调模型，提升翻译效果
使用检索方法辅助翻译
发布数据集和模型权重

方法论

该论文使用了LoRA微调、Prompting DeepSeek-V3.2、零样本学习和检索等方法进行机器翻译。

原文摘要

We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bashkir. Prompting DeepSeek-V3.2 with retrieved similar examples achieved chrF++ 39.47 for Chuvash. For Tatar, zero-shot or retrieval-based approaches achieved chrF++ 41.6, while for Kyrgyz the zero-shot approach reached 45.6. We release the dataset and the obtained weights.

arXiv 分类

cs.CL cs.AI cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类