Distill and Align Decomposition for Enhanced Claim Verification
AI 摘要
提出一种强化学习方法,联合优化句子分解质量和验证器对齐,提升复杂声明验证性能。
主要贡献
- 提出基于GRPO的强化学习方法
- 引入结构化序列推理和知识蒸馏
- 设计多目标奖励函数
方法论
使用强化学习,通过知识蒸馏和多目标奖励,联合优化分解质量和验证器对齐,提升验证性能。
原文摘要
Complex claim verification requires decomposing sentences into verifiable subclaims, yet existing methods struggle to align decomposition quality with verification performance. We propose a reinforcement learning (RL) approach that jointly optimizes decomposition quality and verifier alignment using Group Relative Policy Optimization (GRPO). Our method integrates: (i) structured sequential reasoning; (ii) supervised finetuning on teacher-distilled exemplars; and (iii) a multi-objective reward balancing format compliance, verifier alignment, and decomposition quality. Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)). Human evaluation confirms the high quality of the generated subclaims. Our framework enables smaller language models to achieve state-of-the-art claim verification by jointly optimising for verification accuracy and decomposition quality.