Daily Oracle

Abstract

Many existing evaluation benchmarks for Large Language Models (LLMs) quickly become outdated due to the emergence of new models and training data. These benchmarks also fall short in assessing how LLM performance changes over time, as they consist of static questions without a temporal dimension. To address these limitations, we propose using future event prediction as a continuous evaluation method to assess LLMs' temporal generalization and forecasting abilities. Our benchmark, Daily Oracle, automatically generates question-answer (QA) pairs from daily news, challenging LLMs to predict "future" event outcomes. Our findings reveal that as pre-training data becomes outdated, LLM performance degrades over time. While Retrieval Augmented Generation (RAG) has the potential to enhance prediction accuracy, the performance degradation pattern persists, highlighting the need for continuous model updates.

Browse Daily QA Pairs

Select date:

Category:

True/False Questions

Question	Answer	Source Article

Multiple‑Choice Questions

Question	a	b	c	d	Answer	Source Article

Dataset Overview

Daily Oracle is a continuous evaluation benchmark using automatically generated QA pairs from daily news to assess how the future prediction capabilities of LLMs evolve over time.

While Daily Oracle is daily updated, for our current analysis we use the subset covering the period from January 2020 to December 2024 (~17.2 questions per day).

Closed-Book Setting

In the closed-book setting, we access how accurately LLMs can answer forecasting questions based on the knowledge they learned from their training data without providing extra information.

Performance degradation over time is observed across all models. This indicates that while LLMs demonstrate certain abilities to understand real-world events and make predictions, they struggle to maintain these abilities.

Constrained Open-Book Setting

In the constrained open-book setting, we explore how access to news articles up to different time cutoffs influences LLM performance using RAG.

RAG cutoff: the latest accessible date for retrieving articles.

RAG has the potential to enhance prediction accuracy, the performance degradation pattern persists, highlighting the need for continuous model updates.

Mixtral-8x7B in the constrained open-book setting.

Mistral-7B in the constrained open-book setting.

Llama-3-8B in the constrained open-book setting.

Qwen-2-7B in the constrained open-book setting.

Gemma-2-2B in the constrained open-book setting.

Claude-3.5-Sonnet in the constrained open-book setting.

GPT-4 in the constrained open-book setting.

Gold Article Setting

In the gold article setting, models are provided direct access to the gold article, from which the question is generated.

LLM performance can be improved significantly to around 90%, demonstrating the answerability of Daily Oracle.

However, even when these are treated as reading comprehensive questions rather than forecasting questions, most of the models still show declining trends.

This provides an "upper bound" of open-book retrieval, and the remaining decline in the model's performance suggests continuous pre-training of LLMs is still needed in the context of news event forecasting to address outdated representations.

@inproceedings{dai2025dailyoracle, title={Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle}, author={Dai, Hui and Teehan, Ryan and Ren, Mengye}, booktitle={International Conference on Machine Learning}, year={2025} }

Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

Abstract

Model Performance Over Time (Closed-Book)

TF Questions

MC Questions

Browse Daily QA Pairs

True/False Questions

Multiple‑Choice Questions

Daily Oracle Dataset

Dataset Overview

Example QA pairs

QA Construction Pipeline

For each day, we collect news articles from the daily-updated Common Crawl News Dataset and scrape news using the Newspaper3k package. We use LLM to generate QA pairs with the few-shot prompting technique.

Evaluation