AI Agents 相关度: 7/10

Impact of enriched meaning representations for language generation in dialogue tasks: A comprehensive exploration of the relevance of tasks, corpora and metrics

Alain Vázquez, Maria Inés Torres
arXiv: 2603.29518v1 发布: 2026-03-31 更新: 2026-03-31

AI 摘要

该论文分析了富含意义表示对对话系统中语言生成的影响,并在多个数据集上进行了评估。

主要贡献

  • 提出了使用任务演示器来增强语言生成模型的方法
  • 对多种数据集和评估指标进行了全面的比较分析
  • 发现丰富的输入对于复杂任务和小数据集有效,并且语义指标比词汇指标更准确

方法论

通过在不同数据集上微调模型,并使用多种指标评估生成质量,研究了任务演示器对生成结果的影响。

原文摘要

Conversational systems should generate diverse language forms to interact fluently and accurately with users. In this context, Natural Language Generation (NLG) engines convert Meaning Representations (MRs) into sentences, directly influencing user perception. These MRs usually encode the communicative function (e.g., inform, request, confirm) via DAs and enumerate the semantic content with slot-value pairs. In this work, our objective is to analyse whether providing a task demonstrator to the generator enhances the generations of a fine-tuned model. This demonstrator is an MR-sentence pair extracted from the original dataset that enriches the input at training and inference time. The analysis involves five metrics that focus on different linguistic aspects, and four datasets that differ in multiple features, such as domain, size, lexicon, MR variability, and acquisition process. To the best of our knowledge, this is the first study on dialogue NLG implementing a comparative analysis of the impact of MRs on generation quality across domains, corpus characteristics, and the metrics used to evaluate these generations. Our key insight is that the proposed enriched inputs are effective for complex tasks and small datasets with high variability in MRs and sentences. They are also beneficial in zero-shot settings for any domain. Moreover, the analysis of the metrics shows that semantic metrics capture generation quality more accurately than lexical metrics. In addition, among these semantic metrics, those trained with human ratings can detect omissions and other subtle semantic issues that embedding-based metrics often miss. Finally, the evolution of the metric scores and the excellent results for Slot Accuracy and Dialogue Act Accuracy demonstrate that the generative models present fast adaptability to different tasks and robustness at semantic and communicative intention levels.

标签

NLG Dialogue Systems Meaning Representation Evaluation Metrics

arXiv 分类

cs.CL cs.AI