Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation
AI 摘要
该论文评估了多个大型语言模型在医疗问答任务中的零样本表现,并比较了不同模型的性能。
主要贡献
- 评估多个LLM在医疗QA任务上的零样本表现
- 使用iCliniq数据集作为基准
- 分析模型大小与性能的权衡
方法论
使用iCliniq数据集,零样本评估Llama和GPT系列模型在医疗QA任务上的BLEU和ROUGE指标。
原文摘要
Recently, Large Language Models (LLMs) have gained significant traction in medical domain, especially in developing a QA systems to Medical QA systems for enhancing access to healthcare in low-resourced settings. This paper compares five LLMs deployed between April 2024 and August 2025 for medical QA, using the iCliniq dataset, containing 38,000 medical questions and answers of diverse specialties. Our models include Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini. We are using a zero-shot evaluation methodology and using BLEU and ROUGE metrics to evaluate performance without specialized fine-tuning. Our results show that larger models like Llama 3.3 70B Instruct outperform smaller models, consistent with observed scaling benefits in clinical tasks. It is notable that, Llama-4-Maverick-17B exhibited more competitive results, thus highlighting evasion efficiency trade-offs relevant for practical deployment. These findings align with advancements in LLM capabilities toward professional-level medical reasoning and reflect the increasing feasibility of LLM-supported QA systems in the real clinical environments. This benchmark aims to serve as a standardized setting for future study to minimize model size, computational resources and to maximize clinical utility in medical NLP applications.