Multimodal Learning 相关度: 9/10

Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm

Sixing Li, Zhibin Gu, Ziqi Zhang, Weiguo Pan, Bing Li, Ying Wang, Hongzhe Liu

arXiv: 2604.01941v1 发布: 2026-04-02 更新: 2026-04-02

下载 PDF arXiv 页面

AI 摘要

提出了一个针对幼儿教育图像描述的大规模数据集和混合训练框架，提升了专业对象描述的准确性。

主要贡献

构建了大规模幼儿教育图像描述数据集ECAC
提出了混合训练框架RSRS，动态切换RL和监督优化
开发了领域自适应的多模态大语言模型KinderMM-Cap-3B

方法论

构建ECAC数据集，提出RSRS框架，通过RL和监督学习交替优化，训练KinderMM-Cap-3B模型。

原文摘要

Image captioning for Early Childhood Education (ECE) is essential for automated activity understanding and educational assessment. However, existing methods face two key challenges. First, the lack of large-scale, domain-specific datasets limits the model's ability to capture fine-grained semantic concepts unique to ECE scenarios, resulting in generic and imprecise descriptions. Second, conventional training paradigms exhibit limitations in enhancing professional object description capability, as supervised learning tends to favor high-frequency expressions, while reinforcement learning may suffer from unstable optimization on difficult samples. To address these limitations, we introduce ECAC, a large-scale benchmark for ECE daily activity image captioning, comprising 256,121 real-world images annotated with expert-level captions and fine-grained labels. ECAC is further equipped with a domain-oriented evaluation protocol, the Teaching Toy Recognition Score (TTS), to explicitly measure professional object naming accuracy. Furthermore, we propose RSRS (Reward-Conditional Switch of Reinforcement Learning and Supervised Fine-Tuning), a hybrid training framework that dynamically alternates between RL and supervised optimization. By rerouting hard samples with zero rewards to supervised fine-tuning, RSRS effectively mitigates advantage collapse and enables stable optimization for fine-grained recognition. Leveraging ECAC and RSRS, we develop KinderMM-Cap-3B, a domain-adapted multimodal large language model. Extensive experiments demonstrate that our model achieves a TTS of 51.06, substantially outperforming state-of-the-art baselines while maintaining superior caption quality, highlighting its potential for specialized educational applications.

arXiv 分类

cs.CV cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类