AI Predicts AGI: Leveraging AGI Forecasting and Peer Review to Explore LLMs' Complex Reasoning Capabilities

📅 2024-12-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Evaluating large language models’ (LLMs) complex reasoning and calibration capabilities in forecasting the probability of Artificial General Intelligence (AGI) emergence remains an open challenge, with existing general-purpose benchmarks ill-suited for such high-order speculative reasoning. Method: We introduce AGI Benchmark—the first task-specific evaluation suite for AGI forecasting—and propose LLM-Peer Review (LLM-PR), a novel automated peer-review paradigm integrating intra-class correlation (ICC) reliability analysis and multi-source weight fusion. Contribution/Results: Evaluating 16 state-of-the-art LLMs, we find their median predicted probability of AGI emergence before 2030 is 12.5%, closely aligning with human expert judgments (ICC = 0.79); Pplx-70b-online achieves top performance. Crucially, standard benchmarks (e.g., Chatbot Arena) fail to discriminate effectively on this task, whereas AGI Benchmark substantially improves model differentiation. Our framework provides a scalable, task-aligned methodology for assessing higher-order cognitive capabilities in LLMs.

Technology Category

Application Category

📝 Abstract

We tasked 16 state-of-the-art large language models (LLMs) with estimating the likelihood of Artificial General Intelligence (AGI) emerging by 2030. To assess the quality of these forecasts, we implemented an automated peer review process (LLM-PR). The LLMs' estimates varied widely, ranging from 3% (Reka- Core) to 47.6% (GPT-4o), with a median of 12.5%. These estimates closely align with a recent expert survey that projected a 10% likelihood of AGI by 2027, underscoring the relevance of LLMs in forecasting complex, speculative scenarios. The LLM-PR process demonstrated strong reliability, evidenced by a high Intraclass Correlation Coefficient (ICC = 0.79), reflecting notable consistency in scoring across the models. Among the models, Pplx-70b-online emerged as the top performer, while Gemini-1.5-pro-api ranked the lowest. A cross-comparison with external benchmarks, such as LMSYS Chatbot Arena, revealed that LLM rankings remained consistent across different evaluation methods, suggesting that existing benchmarks may not encapsulate some of the skills relevant for AGI prediction. We further explored the use of weighting schemes based on external benchmarks, optimizing the alignment of LLMs' predictions with human expert forecasts. This analysis led to the development of a new, 'AGI benchmark' designed to highlight performance differences in AGI-related tasks. Our findings offer insights into LLMs' capabilities in speculative, interdisciplinary forecasting tasks and emphasize the growing need for innovative evaluation frameworks for assessing AI performance in complex, uncertain real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' accuracy in predicting AGI emergence by 2030

Evaluating reliability of automated peer review for AGI forecasts

Developing new benchmarks for AGI-related reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated peer review process for LLM forecasts

Weighting schemes aligning LLM with expert predictions

New AGI benchmark for performance evaluation

🔎 Similar Papers

No similar papers found.