LLM Reasoning 相关度: 9/10

Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines

Akshay Gulati, Kanha Singhania, Tushar Banga, Parth Arora, Anshul Verma, Vaibhav Kumar Singh, Agyapal Digra, Jayant Singh Bisht, Danish Sharma, Varun Singla, Shubh Garg
arXiv: 2603.08704v1 发布: 2026-03-09 更新: 2026-03-09

AI 摘要

提出了AI金融智能基准AFIB,评估了多个LLM在金融分析任务中的表现,SuperInvesting表现最佳。

主要贡献

  • 提出了AI金融智能基准(AFIB)
  • 评估了多个LLM在金融分析任务中的表现
  • 揭示了LLM在金融智能方面的多维度特性

方法论

构建包含95+金融分析问题的AFIB基准,评估GPT、Gemini等模型在准确性、完整性等五个维度上的表现。

原文摘要

Large language models are increasingly used for financial analysis and investment research, yet systematic evaluation of their financial reasoning capabilities remains limited. In this work, we introduce the AI Financial Intelligence Benchmark (AFIB), a multi-dimensional evaluation framework designed to assess financial analysis capabilities across five dimensions: factual accuracy, analytical completeness, data recency, model consistency, and failure patterns. We evaluate five AI systems: GPT, Gemini, Perplexity, Claude, and SuperInvesting, using a dataset of 95+ structured financial analysis questions derived from real-world equity research tasks. The results reveal substantial differences in performance across models. Within this benchmark setting, SuperInvesting achieves the highest aggregate performance, with an average factual accuracy score of 8.96/10 and the highest completeness score of 56.65/70, while also demonstrating the lowest hallucination rate among evaluated systems. Retrieval-oriented systems such as Perplexity perform strongly on data recency tasks due to live information access but exhibit weaker analytical synthesis and consistency. Overall, the results highlight that financial intelligence in large language models is inherently multi-dimensional, and systems that combine structured financial data access with analytical reasoning capabilities provide the most reliable performance for complex investment research workflows.

标签

金融分析 LLM评估 基准测试

arXiv 分类

cs.AI