LLM Reasoning 相关度: 9/10

All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting

Zeyu Zhang, Ryan Chen, Bradly C. Stadie

arXiv: 2602.17234v1 发布: 2026-02-19 更新: 2026-02-19

下载 PDF arXiv 页面

AI 摘要

提出一种可解释的时间污染检测框架，用于评估LLM在回测中是否存在知识泄露，并提出TimeSPEC方法降低泄露。

主要贡献

提出Shapley-DCLR指标，用于量化LLM推理中泄露信息的占比。
提出TimeSPEC方法，通过 claim 验证和再生，主动过滤时间污染。
实验证明TimeSPEC能有效降低泄露，同时保持任务性能。

方法论

将模型推理分解为claim，根据时间可验证性分类，用Shapley值评估每个claim对预测的贡献，并结合TimeSPEC方法进行过滤。

原文摘要

To evaluate whether LLMs can accurately predict future events, we need the ability to \textit{backtest} them on events that have already resolved. This requires models to reason only with information available at a specified past date. Yet LLMs may inadvertently leak post-cutoff knowledge encoded during training, undermining the validity of retrospective evaluation. We introduce a claim-level framework for detecting and quantifying this \emph{temporal knowledge leakage}. Our approach decomposes model rationales into atomic claims and categorizes them by temporal verifiability, then applies \textit{Shapley values} to measure each claim's contribution to the prediction. This yields the \textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate (\textbf{Shapley-DCLR}), an interpretable metric that captures what fraction of decision-driving reasoning derives from leaked information. Building on this framework, we propose \textbf{Time}-\textbf{S}upervised \textbf{P}rediction with \textbf{E}xtracted \textbf{C}laims (\textbf{TimeSPEC}), which interleaves generation with claim verification and regeneration to proactively filter temporal contamination -- producing predictions where every supporting claim can be traced to sources available before the cutoff date. Experiments on 350 instances spanning U.S. Supreme Court case prediction, NBA salary estimation, and stock return ranking reveal substantial leakage in standard prompting baselines. TimeSPEC reduces Shapley-DCLR while preserving task performance, demonstrating that explicit, interpretable claim-level verification outperforms prompt-based temporal constraints for reliable backtesting.

arXiv 分类

cs.AI cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类