Skip to content

forecastingresearch/forecastbench-sim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,305 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CivBench

Forecasting benchmark for LLMs built on FreeCiv game simulations. Models read a world state report from a game in progress and answer questions about future game outcomes.

What it tests

  • Binary questions (20 templates): Will event X happen? Models output probability estimates, scored with Brier score and ECE.
  • Continuous questions (6 templates): What will the value of Y be? Models output percentile estimates (p10, p25, p50, p75, p90), scored with CRPS and MAE.
  • Conditional reasoning: Given a hypothetical intervention (government change, treasury boost), can the model update its forecasts? Five framing variants test different aspects of causal reasoning.

The benchmark contains 9,426 questions across 26 templates, 14 game seeds, and 8 forecast horizons (0--210 turns).

Quick start

# Install
uv sync

# Dry run (no API calls)
uv run python scripts/evaluate_llm_forecasts_parallel.py --dry-run -n 2

# Evaluate a model (2 questions per template)
uv run python scripts/evaluate_llm_forecasts_parallel.py \
  --models anthropic/claude-sonnet-4-5-20250929 -n 2

# Continuous questions only
uv run python scripts/evaluate_llm_forecasts_parallel.py \
  --models anthropic/claude-sonnet-4-5-20250929 -n 2 \
  --question-type continuous

# Filter by forecast horizon
uv run python scripts/evaluate_llm_forecasts_parallel.py \
  --models openai/gpt-4o -n 5 --horizon H0 H1

Results are saved to data/evaluations/runs/.

Pipeline

Game Execution → Serialization → Question Generation → Conditional Experiments → LLM Evaluation
(FreeCiv)          (games/)        (questions/)          (conditional/)           (evaluations/)
Stage Script Purpose
Game execution scripts/run_world.py Run a single FreeCiv simulation
Game execution scripts/run_worlds.py Run batch of simulations
Game execution scripts/run_fork.py Fork a game with an intervention
Serialization scripts/generate_data_batch.py Serialize game saves to JSON
Questions scripts/generate_questions.py Generate questions for one game
Questions scripts/generate_questions_batch.py Generate questions across seeds
Conditional scripts/setup_republic_conditional_eval.py Set up republic conditional framings
Conditional scripts/setup_gold500_conditional_eval.py Set up gold500 conditional framings
Conditional scripts/setup_navigation_conditional_eval.py Set up navigation tech conditional
Evaluation scripts/evaluate_llm_forecasts_parallel.py Run LLM evaluations in parallel

Intervention experiments

These scripts run specific experimental conditions on the standard 10-seed evaluation set (5 crash + 5 growth worlds, 4 disruptable templates, horizons H1-H6 = 240 questions per model per condition).

Script Condition What it tests
scripts/run_domain_knowledge_standardized.py Domain knowledge Does providing FreeCiv warfare mechanics + crash base rates improve tail calibration?
scripts/run_tutorial_conditions.py Tutorial framing Does the type of example world (crash-only, growth-only, both) affect performance?
scripts/run_scenario_enumeration.py Scenario enumeration Can models identify disruption scenarios when asked to enumerate futures? (Sub-task 1 of decomposition)
scripts/run_scenario_weighting.py Scenario weighting Given fixed scenarios, can models assign reasonable P(disruption)? (Sub-task 2)
scripts/run_structured_mixture.py SME (fixed scenarios) Fixed continuation/disruption scenarios with model-generated conditional distributions
scripts/run_generate_scenario.py GenSME Models generate their own scenarios, weights, and conditional distributions in one call

Prompt templates for the intervention experiments live in src/civrealm/evaluation/.

Data

All generated data (game snapshots, questions, evaluation results) is produced by the pipeline scripts above and excluded from the repository. See data/README.md for the expected directory structure.

Evaluation output

Each run produces a JSON file with:

  • model_results: Per-model metrics split by question type
    • binary: brier_score, ece, per-template Brier breakdown
    • continuous: crps, mae, per-template CRPS/MAE breakdown
  • questions: Per-question predictions with question_type, ground truth, and model outputs
  • metadata: Run config, question type distribution, template distribution

Citation

This project extends CivRealm (Qi et al., ICLR 2024) as a forecasting benchmark.

@inproceedings{qi2024civrealm,
  title     = {CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents},
  author    = {Siyuan Qi and Shuo Chen and Yexin Li and Xiangyu Kong and Junqi Wang and Bangcheng Yang and Pring Wong and Yifan Zhong and Xiaoyuan Zhang and Zhaowei Zhang and Nian Liu and Wei Wang and Yaodong Yang and Song-Chun Zhu},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://openreview.net/forum?id=UBVNwD3hPN}
}

Releases

No releases published

Packages

 
 
 

Contributors