Beyond Many-Shot Translation: Scaling In-Context Demonstrations For Low-Resource Machine Translation
AI 摘要
研究了长文本上下文学习在低资源机器翻译中的应用,揭示了其有效性限制和语料类型敏感性。
主要贡献
- 探索了长文本上下文学习在低资源机器翻译中的应用
- 比较了不同类型语料作为上下文信息的有效性
- 揭示了上下文长度和翻译质量之间的关系,以及语料类型的影响
方法论
通过在长文本上下文中提供不同类型的语料(单语、指令、平行语料),评估LLM在Javanese和Sundanese翻译上的表现。
原文摘要
Building machine translation (MT) systems for low-resource languages is notably difficult due to the scarcity of high-quality data. Although Large Language Models (LLMs) have improved MT system performance, adapting them to lesser-represented languages remains challenging. In-context learning (ICL) may offer novel ways to adapt LLMs for low-resource MT by conditioning models on demonstration at inference time. In this study, we explore scaling low-resource machine translation ICL beyond the few-shot setting to thousands of examples with long-context models. We scale in-context token budget to 1M tokens and compare three types of training corpora used as in-context supervision: monolingual unsupervised data, instruction-style data, and parallel data (English--target and Indonesian--target). Our experiments on Javanese and Sundanese show that gains from additional context saturate quickly and can degrade near the maximum context window, with scaling behavior strongly dependent on corpus type. Notably, some forms of monolingual supervision can be competitive with parallel data, despite the latter offering additional supervision. Overall, our results characterize the effective limits and corpus-type sensitivity of long-context ICL for low-resource MT, highlighting that larger context windows do not necessarily yield proportional quality gains.