LLM Reasoning 相关度: 8/10

EuroLLM-22B: Technical Report

Miguel Moura Ramos, Duarte M. Alves, Hippolyte Gisserot-Boukhlef, João Alves, Pedro Henrique Martins, Patrick Fernandes, José Pombal, Nuno M. Guerreiro, Ricardo Rei, Nicolas Boizard, Amin Farajian, Mateusz Klimaszewski, José G. C. de Souza, Barry Haddow, François Yvon, Pierre Colombo, Alexandra Birch, André F. T. Martins
arXiv: 2602.05879v1 发布: 2026-02-05 更新: 2026-02-05

AI 摘要

EuroLLM-22B是一个支持多种欧洲语言的大型语言模型,性能与同规模模型相当,并开源了数据和代码。

主要贡献

  • 训练了一个支持多种欧洲语言的22B参数LLM
  • 开源了预训练数据和指令微调数据集EuroBlocks
  • 提供了预训练和评估代码

方法论

从零开始训练,包括tokenizer设计、架构选择、数据过滤和训练流程,并在多语言基准上进行了评估。

原文摘要

This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.

标签

LLM Multilingual European Languages Open Source

arXiv 分类

cs.CL cs.AI cs.LG