TAPO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning
AI 摘要
TAPO通过翻译增强策略优化,提升LLM在多语言数学推理中的性能,解决语言理解不足问题。
主要贡献
- 提出了TAPO框架,结合GRPO和翻译增强策略
- 引入了步级相对优势机制,解耦理解和推理
- 实验证明TAPO在多语言数学推理和翻译任务中表现优异
方法论
基于GRPO的强化学习框架,利用英语作为枢纽进行翻译增强,并采用步级相对优势机制优化。
原文摘要
Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.