End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering
AI 摘要
提出了一种端到端的自动chatbot评估方法,降低人工评估成本并提高可扩展性。
主要贡献
- 提出端到端自动评估框架
- 使用LLM进行问题生成和答案判断
- 基于置信度的过滤机制减少不确定性
方法论
利用LLM生成QA对,并用LLM对chatbot的回答进行评估,结合置信度过滤机制,减少人工干预。
原文摘要
Large language models (LLMs) combined with retrieval augmented generation have enabled the deployment of domain-specific chatbots, but these systems remain prone to generating unsupported or incorrect answers. Reliable evaluation is therefore critical, yet manual review is costly and existing frameworks often depend on curated test sets and static metrics, limiting scalability. We propose an end-to-end automatic evaluator designed to substantially reduce human effort. Our system generates Q\&A pairs directly from the underlying knowledge base, uses LLMs to judge chatbot responses against reference answers, and applies confidence-based filtering to highlight uncertain cases. Applied to a Vietnamese news dataset, the evaluator achieves high agreement with human judgments while significantly lowering review overhead. The framework is modular and language-agnostic, making it readily adaptable to diverse domains. This work introduces a practical, scalable solution for evaluating chatbots with minimal reliance on manual intervention.