Skip to content

cjbarrie/polreason

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Synthetic Personas Distort the Structure of Human Belief Systems

Christopher Barrie & Roberto Cerina

@misc{barrie_cerina_2026,
  title={Synthetic personas distort the structure of human belief systems},
  url={osf.io/preprints/socarxiv/n7fq8_v1},
  publisher={SocArXiv},
  author={Barrie, Christopher and Cerina, Roberto},
  year={2026},
  month={Feb}
}

Replication repository for:

Barrie, C. & Cerina, R. (2026). Synthetic personas distort the structure of human belief systems. SocArXiv. https://doi.org/10.31235/osf.io/n7fq8_v1

Saturn animation: belief constraint contours across quantile levels

This repository contains all code, synthetic data, pre-computed outputs, and visualizations needed to reproduce the analysis. The quickest path:

git clone https://github.com/cjbarrie/polreason
cd polreason
Rscript analysis/scripts/master.R

See analysis/README.md, generation/README.md, and docs/replication_release_plan.md for full details.

Overview

This project investigates how different LLMs exhibit patterns of political constraint when answering survey questions as synthetic personas. Building on della Posta's (2020) framework for measuring belief constraint, we:

  1. Generate synthetic personas from real GSS respondent demographics
  2. Query multiple LLMs with comprehensive attitudinal survey questions
  3. Analyze constraint patterns using polychoric correlations, principal components analysis, and bootstrap methods
  4. Compare LLM behavior to human survey responses across 28+ different models

The analysis covers 52 survey questions spanning culture-war issues (abortion, immigration, civil liberties) and non-culture-war domains (institutional confidence, economic outlook, social trust).

Repository Structure

polreason/
├── generation/              # Data generation pipeline
│   ├── scripts/            # Python and R scripts for creating synthetic data
│   │   ├── 00a_create_gss_extract_multiyear.R  # Extract GSS data by year
│   │   ├── 00b_generate_personas.R             # Create natural language personas
│   │   ├── 01_generate_synthetic_GSS.py        # Query LLMs (OpenRouter)
│   │   ├── 02_generate_synthetic_GSS_gpt5safe.py  # Alternative querying script
│   │   └── test_models.py                      # Test model availability
│   ├── data/               # Small data files (personas, extracts)
│   │   ├── gss2024_personas.csv                # Natural language personas (823KB)
│   │   ├── gss2024_dellaposta_extract.rds      # Processed GSS data (164KB)
│   │   └── gss2024_variable_summary.csv        # Variable coverage summary
│   └── synthetic_data/     # LLM-generated survey responses
│       └── year_2024/      # 30 CSV files, one per model
│
├── analysis/               # Statistical analysis pipeline
│   ├── README.md           # Analysis-specific guide
│   ├── scripts/            # R analysis scripts
│   │   ├── master.R                            # Main orchestration script
│   │   ├── 0.config.R                          # Configuration and helpers
│   │   ├── 1.data_shaper.R                     # Data preparation
│   │   ├── 2.polychor_bootstrap.R              # Bootstrap analysis
│   │   ├── v.common_utils.R                    # Shared visualization utilities
│   │   ├── v1_a.mvn_plot.R                     # Multivariate normal scatter plots
│   │   ├── v1_b.saturn_plot.R                  # Saturn plots (faceted, Option C)
│   │   ├── v1_c.saturn_animation.R             # Saturn animation (Q95 → Q5 GIF)
│   │   └── v2*.R, v3*.R                        # Additional visualization scripts
│   ├── output/             # Model results (31 directories)
│   │   ├── gss-2024/                           # Human GSS baseline
│   │   ├── anthropic_claude-sonnet-4.5-2024/   # Example model results
│   │   └── ...                                 # Other model outputs
│   ├── viz/                # Visualizations (35+ directories/files)
│   │   ├── constraint_violins_2024/            # Constraint comparisons
│   │   ├── mvn_2024/                           # Multivariate normal plots
│   │   └── [model-name]/                       # Per-model visualizations
│   └── GSS_PC_explain.json # GSS question metadata
│
├── requirements.txt        # Python dependencies
├── r-requirements.txt      # R package list
├── docs/                   # Release planning and extra documentation
├── .gitignore             # Git ignore rules
└── README.md              # This file

Installation

System Requirements

  • Operating systems: macOS 12+, Ubuntu 20.04+, Windows 10/11; developed and tested primarily on macOS 14
  • Python 3.8+ (tested on 3.11) — required only for data generation
  • R 4.0+ (tested on 4.3.x) — required for all statistical analysis and visualization
  • RAM: 8 GB minimum; 16 GB recommended for the full 500-resample bootstrap
  • No non-standard hardware required — a standard laptop or desktop is sufficient
  • GSS Cumulative Data File (see Data Requirements below) — required only for persona generation, not for reproducing the published analysis

Typical Install Time

On a normal desktop or laptop computer:

  • Python dependencies (pip install -r requirements.txt): ~2 minutes
  • R packages (install.packages(...)): ~10–15 minutes (some packages compile from source)

Python Environment

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

R Packages

# Install all required packages
install.packages(c(
  "data.table", "lavaan", "miceRanger", "ggplot2",
  "grid", "gridExtra", "grDevices", "scales",
  "ggnewscale", "irr", "haven", "dplyr",
  "stringr", "optparse"
))

# Optional: for Saturn animation (v1_c.saturn_animation.R)
install.packages(c("gganimate", "gifski"))

Data Requirements

GSS Cumulative Data File

The GSS cumulative data file (gss7224_r1.dta, 565MB) is excluded from this repository via .gitignore due to NORC redistribution terms.

If cloning from GitHub, you'll need to download the file separately:

  1. Download: Visit https://gss.norc.org/get-the-data/stata.html
  2. File needed: GSS 1972-2024 Cumulative Data (Release 1) in Stata format
  3. Filename: gss7224_r1.dta
  4. Location: Place in generation/data/gss7224_r1.dta

Note: If you only want to run the analysis pipeline (not generate new synthetic data), you can skip this step since the synthetic responses are already included in generation/synthetic_data/.

The GSS cumulative .dta file should remain an external download in any public replication release unless its redistribution terms clearly permit bundling.

Usage

Option 1: Run Analysis Only (Recommended for Exploration)

If you just want to reproduce the analysis using existing synthetic data:

# Navigate to polreason root
cd polreason/

# Run the master analysis script
Rscript analysis/scripts/master.R

This will:

  • Load existing synthetic responses from generation/synthetic_data/
  • Perform polychoric correlation analysis with bootstrap
  • Generate constraint statistics
  • Create visualizations in analysis/viz/

Option 2: Generate New Synthetic Data

To create new synthetic survey responses from LLMs:

Step 1: Create GSS Extract (requires GSS data file)

# Navigate to the generation scripts directory
cd generation/scripts

# Extract 2024 data
Rscript 00a_create_gss_extract_multiyear.R --year 2024

# Generate natural language personas
Rscript 00b_generate_personas.R

# Return to project root
cd ../..

Step 2: Query LLMs

# Navigate to the generation scripts directory (if not already there)
cd generation/scripts

# Set your OpenRouter API key
export OPENROUTER_API_KEY="your-key-here"

# Query all models with 1000 personas (expensive! ~$100-500 depending on models)
python 01_generate_synthetic_GSS.py --year 2024 --all-models --personas 1000

# Or query specific models
python 01_generate_synthetic_GSS.py --year 2024 --models "anthropic/claude-sonnet-4.5,openai/gpt-5" --personas 1000

# Return to project root
cd ../..

See python 01_generate_synthetic_GSS.py --help for all options.

Step 3: Run Analysis

# From polreason root directory
Rscript analysis/scripts/master.R

Demo

The repository ships with everything needed to run a complete demo without API credentials or external downloads.

Demo dataset: 30 pre-generated LLM response CSVs in generation/synthetic_data/year_2024/ (one file per model, ~7–8 MB each, ~52,000 rows each), plus pre-computed bootstrap outputs (.rds files) in analysis/output/. These committed files serve as the demo dataset and eliminate the need for any API calls or long resampling runs.

Running the Demo

From the project root:

Rscript analysis/scripts/master.R

master.R sets run_from_scratch <- FALSE, so it loads the pre-computed bootstrap outputs directly rather than re-running the full resampling procedure.

Expected Output

After a successful run you should find the following new or refreshed files:

Location Contents
analysis/viz/mvn_2024/ Multivariate-normal scatter PDFs for each model
analysis/viz/[model-name]/ Per-model Saturn plots and supporting figures
analysis/viz/constraint_violins_2024/ Violin plots comparing constraint across models
analysis/output/correlation_quantiles_2024.csv Quantile summaries of pairwise polychoric correlations
analysis/output/constraint_bootstrap_stats_2024.csv Bootstrap constraint metrics (PC1 variance, D_e)
analysis/output/constraint_point_stats_2024.csv Point-estimate constraint metrics

Expected Run Time on a Normal Desktop Computer

Mode Approximate time
Demo (pre-computed bootstrap, visualization only) 20–40 minutes
Full bootstrap from scratch (B=500, B_MI=30, 31 models) 4–8 hours
Quick exploratory run (set B <- 50, B_MI <- 5 in 0.config.R) ~30 minutes

Times are estimates for a 2020-era 8-core laptop with 16 GB RAM. The bottleneck for the demo is plot generation across 31 models; for a full bootstrap run it is the polychoric correlation resampling in 2.polychor_bootstrap.R.

Instructions for Use

Running on the Provided Synthetic Data

Use Option 1 above — simply run Rscript analysis/scripts/master.R from the project root. No API key or additional downloads are required.

Running on Your Own LLM Survey Responses

To apply the analysis pipeline to a new LLM or a different survey dataset:

  1. Format your LLM response file to match the existing CSVs in generation/synthetic_data/year_2024/ (columns: persona_id, run, and one column per GSS question variable name).
  2. Place the CSV in generation/synthetic_data/year_2024/your_model_name.csv.
  3. Define any new survey questions in analysis/scripts/0.config.R following the existing GSS_QUESTIONS_* pattern.
  4. From the project root, run:
Rscript analysis/scripts/master.R

The script automatically discovers all CSVs in the year_2024/ directory and processes them.

Reproduction Instructions

To reproduce all quantitative results and figures reported in the manuscript from scratch:

git clone https://github.com/cjbarrie/polreason
cd polreason
Rscript analysis/scripts/master.R

All 28+ LLM synthetic response files and pre-computed bootstrap outputs are committed to the repository, so this single command is sufficient. Key figure-to-script mappings:

Manuscript figure Script Output
Saturn plots v1_b.saturn_plot.R analysis/viz/[model]/
Constraint violin plots v2_a.constraint_stats.R analysis/viz/constraint_violins_2024/
Delta comparisons v2_b.constraint_stats_delta.R analysis/output/*.csv
Missing-dimension analysis v3.missing_dimensions.R analysis/viz/
MVN scatter v1_a.mvn_plot.R analysis/viz/mvn_2024/

To also regenerate the animated Saturn GIF, uncomment the v1_c.saturn_animation.R line in master.R (adds ~2–5 minutes; requires gganimate and gifski).

Key Findings & Outputs

Constraint Metrics

The analysis produces several key metrics of political constraint:

  1. PC1 Variance Explained: How much variance the first principal component captures
  2. Effective Dependence (D_e): Average absolute polychoric correlation
  3. Missing Dimensions: Analysis of variance unexplained by PC1
  4. Cohen's Kappa: Agreement between different runs of the same persona

Visualizations

  • Saturn plots (v1_b.saturn_plot.R): Publication-ready faceted visualization comparing LLM constraint to human (GSS) baseline

    • Faceted layout: each quantile (Q25, Q50, Q75, Q90) shown in separate panel
    • Implements "Option C" highlighting: only models with significant constraint difference (Δ ≥ 0.10) vs GSS are colored
    • Non-highlighted LLMs shown as transparent gray "spaghetti" for context
    • GSS displayed as bold black contours for easy comparison
    • Reference circle (ρ=0) shows independence baseline
    • Parameters: delta_min (threshold), top_n_total (limit highlights)
  • Saturn animation (v1_c.saturn_animation.R): Animated GIF cycling through quantile levels (Q95 → Q5)

    • Shows how constraint contours evolve from tightest (Q95) to weakest (Q5) correlations
    • 19 frames covering full quantile range in 5% increments
    • Top-N most constrained models highlighted throughout animation
    • Smooth transitions with cubic easing
    • Requires: gganimate and gifski packages
    • Optional in master.R (uncomment to generate, takes 2-5 minutes)
  • Constraint violins: Compare constraint levels across models and education groups

  • Polychoric correlation matrices: Triangle plots showing pairwise belief correlations

  • MVN scatter plots: Multivariate normal draws from correlation matrices

  • Missing variance plots: Rayleigh distribution analysis of unexplained variance

All visualizations are saved as PDFs in analysis/viz/.

Models Included

The repository includes results for 28+ LLMs across various families:

  • OpenAI: GPT-5, GPT-5-mini, GPT-4o-mini, GPT-OSS-120b
  • Anthropic: Claude Sonnet 4.5, Claude 3.7 Sonnet
  • Google: Gemini 2.5 Flash/Lite, Gemma 3 12B
  • DeepSeek: v3, v3.1, v3.2
  • Meta: Llama 3.1/3.3/4
  • Mistral: Large, Medium, Small, Nemo
  • Others: Qwen, Kimi, Grok, GLM, Cohere, AI21, Allen AI, and more

Plus GSS-2024 human baseline for comparison.

Citation

If you use this code or data, please cite (BibTeX at top of README):

Related work:

  • della Posta, D. (2020). "Pluralistic Collapse: The 'Oil Spill' Model of Mass Opinion Polarization." American Sociological Review, 85(3), 507-536.

License

This project is released under the MIT License. You are free to use, modify, and distribute the code with attribution.

Acknowledgments

  • General Social Survey (GSS) data from NORC at the University of Chicago
  • LLM API access via OpenRouter
  • Built on della Posta's constraint measurement framework

Troubleshooting

"Please run this script from the polreason/ root directory"

Make sure you're in the polreason/ directory when running R scripts:

cd polreason/
Rscript analysis/scripts/master.R

Missing GSS data file

Download gss7224_r1.dta from https://gss.norc.org/get-the-data/stata.html and place in generation/data/.

API rate limits

The OpenRouter API has rate limits. The script includes retry logic with exponential backoff. For large runs, consider:

  • Running overnight
  • Using the --max-workers parameter to reduce concurrency
  • Splitting across multiple days

Memory issues in R

The bootstrap analysis can be memory-intensive. If you encounter issues:

  • Close other applications
  • Reduce B (bootstrap iterations) in analysis/scripts/0.config.R
  • Run models individually instead of the full batch

Development

To modify or extend this project:

  1. Add new models: Edit POPULAR_MODELS in generation/scripts/01_generate_synthetic_GSS.py
  2. Add new questions: Edit GSS_QUESTIONS_* dictionaries in the same file
  3. Modify analysis: See analysis/scripts/0.config.R for parameters (bootstrap iterations, etc.)
  4. Add visualizations: Create new scripts following the v*.R pattern

Repository Metadata

  • Data Year: 2024 (GSS wave)
  • Models: 28+ LLMs + human baseline
  • Survey Items: 52 questions (30 culture-war, 22 non-culture-war)
  • Personas: 1,000 synthetic respondents per model

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors