Skip to content

TZWwww/PROPHET

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation

This repository contains the dataset, code, and evaluation scripts for the paper "PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation".

PROPHET is a new benchmark designed to evaluate Future Forecasting systems (LLMs and Agents) with a focus on inferability. Unlike previous benchmarks, PROPHET ensures that prediction questions are actually answerable based on the retrieved news by filtering data using a novel statistical metric: Causal Intervened Likelihood (CIL).

🌟 Key Features

  • Inferability-First: Addresses the "non-inferable" issue in existing forecasting benchmarks where retrieved information is insufficient to support a conclusion.
  • CIL Metric: Introduces Causal Intervened Likelihood, a metric derived from causal inference to quantify how strongly a news article supports a specific forecasting outcome.
  • Real-World Data: Contains 612 high-quality forecasting questions collected from Polymarket (resolved in Jan 2025) with over 300k associated news articles.
  • Comprehensive Baselines: Includes implementations for both Naive RAG and Agentic RAG (ReAct-based) forecasting systems.

📂 Dataset Statistics

The PROPHET benchmark consists of two subsets based on the CIL filtering:

Subset Description Count Avg News/Q Avg Token/News
L1 (Main) Inferable. Contains strong supportive evidence (CIL > 0.7). 612 ~560 ~1250

Data source: Polymarket (Resolution date: 2025-01-01 to 2025-01-31).

Download the L1 dataset

The L1 part of the dataset can be downloaded on Google Drive Download Link

🧠 Methodology: Causal Intervened Likelihood (CIL)

CIL estimates the causal effect of a news event ($X_i$) ags on the forecasting outcome ($Y$). It is defined as:

$CIL_i = P(Y=\hat{Y}|do(X_i=1)) - P(Y=\hat{Y}|do(X_i=0))$

We compute this by modeling the news stream as a Structural Causal Model (SCM) with two key assumptions:

  1. Temporality: Later events cannot cause earlier events.
  2. w-day Dependency: Direct causal influence is limited to a -day window (we use $w=30$).

This allows us to bridge interventional probabilities to observational probabilities estimable by LLMs.

📊 Performance Highlights

Below are selected results (Brier Score, lower is better) comparing Naive RAG vs. Agentic RAG on the PROPHET dataset.

Model w/o RAG Naive RAG (Best) Agentic RAG
Claude-4-sonnet 18.57 18.00 17.89
GPT-4o-mini 24.28 27.17 -
DeepSeek-v3 20.37 21.04 -
Gemini-2.5-Pro 21.41 - 19.26

See the paper for full tables and analysis.

📝 Citation

If you use PROPHET or CIL in your research, please cite our paper:

@article{tao2026prophet,
  title={PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation},
  author={Zhengwei Tao, Pu Wu, Zhi Jin, Xiaoying Bai, Haiyan Zhao, Chengfeng Dou, Xiancai Chen, Jia Li, Linyu Li, Chongyang Tao, Wentao Zhang},
  journal={arXiv preprint},
  year={2026}
}

📧 Contact

For questions or feedback, please contact:

  • Zhengwei Tao: tttzw@pku.edu.cn

About

PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors