UniGeM: Unifying Data Mixing and Selection via Geometric Exploration and Mining
AI 摘要
UniGeM通过几何探索统一数据混合和选择,提高LLM训练的数据效率。
主要贡献
- 提出UniGeM框架,统一数据混合和选择
- 通过几何分布过滤高质量实例,保证逻辑一致性
- 无需训练代理模型或依赖外部数据集
方法论
基于流形近似的数据质量优化,分层进行宏观探索和微观挖掘。
原文摘要
The scaling of Large Language Models (LLMs) is increasingly limited by data quality. Most methods handle data mixing and sample selection separately, which can break the structure in code corpora. We introduce \textbf{UniGeM}, a framework that unifies mixing and selection by treating data curation as a \textit{manifold approximation} problem without training proxy models or relying on external reference datasets. UniGeM operates hierarchically: \textbf{Macro-Exploration} learns mixing weights with stability-based clustering; \textbf{Micro-Mining} filters high-quality instances by their geometric distribution to ensure logical consistency. Validated by training 8B and 16B MoE models on 100B tokens, UniGeM achieves \textbf{2.0$\times$ data efficiency} over a random baseline and further improves overall performance compared to SOTA methods in reasoning-heavy evaluations and multilingual generalization.