CivBench

Forecasting benchmark for LLMs built on FreeCiv game simulations. Models read a world state report from a game in progress and answer questions about future game outcomes.

What it tests

Binary questions (20 templates): Will event X happen? Models output probability estimates, scored with Brier score and ECE.
Continuous questions (6 templates): What will the value of Y be? Models output percentile estimates (p10, p25, p50, p75, p90), scored with CRPS and MAE.
Conditional reasoning: Given a hypothetical intervention (government change, treasury boost), can the model update its forecasts? Five framing variants test different aspects of causal reasoning.

The benchmark contains 9,426 questions across 26 templates, 14 game seeds, and 8 forecast horizons (0--210 turns).

Quick start

# Install
uv sync

# Dry run (no API calls)
uv run python scripts/evaluate_llm_forecasts_parallel.py --dry-run -n 2

# Evaluate a model (2 questions per template)
uv run python scripts/evaluate_llm_forecasts_parallel.py \
  --models anthropic/claude-sonnet-4-5-20250929 -n 2

# Continuous questions only
uv run python scripts/evaluate_llm_forecasts_parallel.py \
  --models anthropic/claude-sonnet-4-5-20250929 -n 2 \
  --question-type continuous

# Filter by forecast horizon
uv run python scripts/evaluate_llm_forecasts_parallel.py \
  --models openai/gpt-4o -n 5 --horizon H0 H1

Results are saved to data/evaluations/runs/.

Pipeline

Game Execution → Serialization → Question Generation → Conditional Experiments → LLM Evaluation
(FreeCiv)          (games/)        (questions/)          (conditional/)           (evaluations/)

Stage	Script	Purpose
Game execution	`scripts/run_world.py`	Run a single FreeCiv simulation
Game execution	`scripts/run_worlds.py`	Run batch of simulations
Game execution	`scripts/run_fork.py`	Fork a game with an intervention
Serialization	`scripts/generate_data_batch.py`	Serialize game saves to JSON
Questions	`scripts/generate_questions.py`	Generate questions for one game
Questions	`scripts/generate_questions_batch.py`	Generate questions across seeds
Conditional	`scripts/setup_republic_conditional_eval.py`	Set up republic conditional framings
Conditional	`scripts/setup_gold500_conditional_eval.py`	Set up gold500 conditional framings
Conditional	`scripts/setup_navigation_conditional_eval.py`	Set up navigation tech conditional
Evaluation	`scripts/evaluate_llm_forecasts_parallel.py`	Run LLM evaluations in parallel

Intervention experiments

These scripts run specific experimental conditions on the standard 10-seed evaluation set (5 crash + 5 growth worlds, 4 disruptable templates, horizons H1-H6 = 240 questions per model per condition).

Script	Condition	What it tests
`scripts/run_domain_knowledge_standardized.py`	Domain knowledge	Does providing FreeCiv warfare mechanics + crash base rates improve tail calibration?
`scripts/run_tutorial_conditions.py`	Tutorial framing	Does the type of example world (crash-only, growth-only, both) affect performance?
`scripts/run_scenario_enumeration.py`	Scenario enumeration	Can models identify disruption scenarios when asked to enumerate futures? (Sub-task 1 of decomposition)
`scripts/run_scenario_weighting.py`	Scenario weighting	Given fixed scenarios, can models assign reasonable P(disruption)? (Sub-task 2)
`scripts/run_structured_mixture.py`	SME (fixed scenarios)	Fixed continuation/disruption scenarios with model-generated conditional distributions
`scripts/run_generate_scenario.py`	GenSME	Models generate their own scenarios, weights, and conditional distributions in one call

Prompt templates for the intervention experiments live in src/civrealm/evaluation/.

Data

All generated data (game snapshots, questions, evaluation results) is produced by the pipeline scripts above and excluded from the repository. See data/README.md for the expected directory structure.

Evaluation output

Each run produces a JSON file with:

model_results: Per-model metrics split by question type
- binary: brier_score, ece, per-template Brier breakdown
- continuous: crps, mae, per-template CRPS/MAE breakdown
questions: Per-question predictions with question_type, ground truth, and model outputs
metadata: Run config, question type distribution, template distribution

Citation

This project extends CivRealm (Qi et al., ICLR 2024) as a forecasting benchmark.

@inproceedings{qi2024civrealm,
  title     = {CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents},
  author    = {Siyuan Qi and Shuo Chen and Yexin Li and Xiangyu Kong and Junqi Wang and Bangcheng Yang and Pring Wong and Yifan Zhong and Xiaoyuan Zhang and Zhaowei Zhang and Nian Liu and Wei Wang and Yaodong Yang and Song-Chun Zhu},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://openreview.net/forum?id=UBVNwD3hPN}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1,305 Commits
.github		.github
assets		assets
docs		docs
scripts		scripts
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CivBench

What it tests

Quick start

Pipeline

Intervention experiments

Data

Evaluation output

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CivBench

What it tests

Quick start

Pipeline

Intervention experiments

Data

Evaluation output

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages