AI Agents 相关度: 9/10

Eval4Sim: An Evaluation Framework for Persona Simulation

Eliseo Bao, Anxo Perez, Xi Wang, Javier Parapar
arXiv: 2603.02876v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

Eval4Sim是一个评估框架,用于衡量模拟对话与人类对话模式的对齐程度。

主要贡献

  • 提出了Eval4Sim框架,用于评估persona模拟的质量。
  • 从三个维度评估模拟对话:Adherence, Consistency, Naturalness。
  • 使用人类对话语料库作为参考基线,区分不足和过度优化的问题。

方法论

Eval4Sim通过speaker-aware的检索评估Adherence,作者验证评估Consistency,NLI分布评估Naturalness。

原文摘要

Large Language Model (LLM) personas with explicit specifications of attributes, background, and behavioural tendencies are increasingly used to simulate human conversations for tasks such as user modeling, social reasoning, and behavioural analysis. Ensuring that persona-grounded simulations faithfully reflect human conversational behaviour is therefore critical. However, current evaluation practices largely rely on LLM-as-a-judge approaches, offering limited grounding in observable human behavior and producing opaque scalar scores. We address this gap by proposing Eval4Sim, an evaluation framework that measures how closely simulated conversations align with human conversational patterns across three complementary dimensions. Adherence captures how effectively persona backgrounds are implicitly encoded in generated utterances, assessed via dense retrieval with speaker-aware representations. Consistency evaluates whether a persona maintains a distinguishable identity across conversations, computed through authorship verification. Naturalness reflects whether conversations exhibit human-like flow rather than overly rigid or optimized structure, quantified through distributions derived from dialogue-focused Natural Language Inference. Unlike absolute or optimization-oriented metrics, Eval4Sim uses a human conversational corpus (i.e., PersonaChat) as a reference baseline and penalizes deviations in both directions, distinguishing insufficient persona encoding from over-optimized, unnatural behaviour. Although demonstrated on PersonaChat, the applicability of Eval4Sim extends to any conversational corpus containing speaker-level annotations.

标签

LLM Evaluation Persona Simulation Conversational AI

arXiv 分类

cs.CL