AI Agents 相关度: 6/10

What Language is This? Ask Your Tokenizer

Clara Meister, Ahmetcan Yavuz, Pietro Lesci, Tiago Pimentel
arXiv: 2602.17655v1 发布: 2026-02-19 更新: 2026-02-19

AI 摘要

UniLID提出一种基于UnigramLM的语言识别方法,在低资源语言和方言识别上表现出色。

主要贡献

  • 提出UniLID语言识别方法
  • 利用UnigramLM的概率框架进行语言识别
  • 在低资源语言和方言识别上取得显著提升

方法论

UniLID基于UnigramLM,学习语言条件下的unigram分布,并将分词视为特定语言的现象。

原文摘要

Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its probabilistic framing, parameter estimation technique and inference strategy. In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into existing language model tokenization pipelines. Empirical evaluations against widely used baselines, including fastText, GlotLID, and CLD3, show that UniLID achieves competitive performance on standard benchmarks, substantially improves sample efficiency in low-resource settings - surpassing 70% accuracy with as few as five labeled samples per language - and delivers large gains on fine-grained dialect identification.

标签

Language Identification UnigramLM Low-Resource Languages Dialect Identification

arXiv 分类

cs.CL