Christopher Barrie & Roberto Cerina
@misc{barrie_cerina_2026,
title={Synthetic personas distort the structure of human belief systems},
url={osf.io/preprints/socarxiv/n7fq8_v1},
publisher={SocArXiv},
author={Barrie, Christopher and Cerina, Roberto},
year={2026},
month={Feb}
}Replication repository for:
Barrie, C. & Cerina, R. (2026). Synthetic personas distort the structure of human belief systems. SocArXiv. https://doi.org/10.31235/osf.io/n7fq8_v1
This repository contains all code, synthetic data, pre-computed outputs, and visualizations needed to reproduce the analysis. The quickest path:
git clone https://github.com/cjbarrie/polreason
cd polreason
Rscript analysis/scripts/master.RSee analysis/README.md, generation/README.md, and docs/replication_release_plan.md for full details.
This project investigates how different LLMs exhibit patterns of political constraint when answering survey questions as synthetic personas. Building on della Posta's (2020) framework for measuring belief constraint, we:
- Generate synthetic personas from real GSS respondent demographics
- Query multiple LLMs with comprehensive attitudinal survey questions
- Analyze constraint patterns using polychoric correlations, principal components analysis, and bootstrap methods
- Compare LLM behavior to human survey responses across 28+ different models
The analysis covers 52 survey questions spanning culture-war issues (abortion, immigration, civil liberties) and non-culture-war domains (institutional confidence, economic outlook, social trust).
polreason/
├── generation/ # Data generation pipeline
│ ├── scripts/ # Python and R scripts for creating synthetic data
│ │ ├── 00a_create_gss_extract_multiyear.R # Extract GSS data by year
│ │ ├── 00b_generate_personas.R # Create natural language personas
│ │ ├── 01_generate_synthetic_GSS.py # Query LLMs (OpenRouter)
│ │ ├── 02_generate_synthetic_GSS_gpt5safe.py # Alternative querying script
│ │ └── test_models.py # Test model availability
│ ├── data/ # Small data files (personas, extracts)
│ │ ├── gss2024_personas.csv # Natural language personas (823KB)
│ │ ├── gss2024_dellaposta_extract.rds # Processed GSS data (164KB)
│ │ └── gss2024_variable_summary.csv # Variable coverage summary
│ └── synthetic_data/ # LLM-generated survey responses
│ └── year_2024/ # 30 CSV files, one per model
│
├── analysis/ # Statistical analysis pipeline
│ ├── README.md # Analysis-specific guide
│ ├── scripts/ # R analysis scripts
│ │ ├── master.R # Main orchestration script
│ │ ├── 0.config.R # Configuration and helpers
│ │ ├── 1.data_shaper.R # Data preparation
│ │ ├── 2.polychor_bootstrap.R # Bootstrap analysis
│ │ ├── v.common_utils.R # Shared visualization utilities
│ │ ├── v1_a.mvn_plot.R # Multivariate normal scatter plots
│ │ ├── v1_b.saturn_plot.R # Saturn plots (faceted, Option C)
│ │ ├── v1_c.saturn_animation.R # Saturn animation (Q95 → Q5 GIF)
│ │ └── v2*.R, v3*.R # Additional visualization scripts
│ ├── output/ # Model results (31 directories)
│ │ ├── gss-2024/ # Human GSS baseline
│ │ ├── anthropic_claude-sonnet-4.5-2024/ # Example model results
│ │ └── ... # Other model outputs
│ ├── viz/ # Visualizations (35+ directories/files)
│ │ ├── constraint_violins_2024/ # Constraint comparisons
│ │ ├── mvn_2024/ # Multivariate normal plots
│ │ └── [model-name]/ # Per-model visualizations
│ └── GSS_PC_explain.json # GSS question metadata
│
├── requirements.txt # Python dependencies
├── r-requirements.txt # R package list
├── docs/ # Release planning and extra documentation
├── .gitignore # Git ignore rules
└── README.md # This file
- Operating systems: macOS 12+, Ubuntu 20.04+, Windows 10/11; developed and tested primarily on macOS 14
- Python 3.8+ (tested on 3.11) — required only for data generation
- R 4.0+ (tested on 4.3.x) — required for all statistical analysis and visualization
- RAM: 8 GB minimum; 16 GB recommended for the full 500-resample bootstrap
- No non-standard hardware required — a standard laptop or desktop is sufficient
- GSS Cumulative Data File (see Data Requirements below) — required only for persona generation, not for reproducing the published analysis
On a normal desktop or laptop computer:
- Python dependencies (
pip install -r requirements.txt): ~2 minutes - R packages (
install.packages(...)): ~10–15 minutes (some packages compile from source)
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Install all required packages
install.packages(c(
"data.table", "lavaan", "miceRanger", "ggplot2",
"grid", "gridExtra", "grDevices", "scales",
"ggnewscale", "irr", "haven", "dplyr",
"stringr", "optparse"
))
# Optional: for Saturn animation (v1_c.saturn_animation.R)
install.packages(c("gganimate", "gifski"))The GSS cumulative data file (gss7224_r1.dta, 565MB) is excluded from this repository via .gitignore due to NORC redistribution terms.
If cloning from GitHub, you'll need to download the file separately:
- Download: Visit https://gss.norc.org/get-the-data/stata.html
- File needed: GSS 1972-2024 Cumulative Data (Release 1) in Stata format
- Filename:
gss7224_r1.dta - Location: Place in
generation/data/gss7224_r1.dta
Note: If you only want to run the analysis pipeline (not generate new synthetic data), you can skip this step since the synthetic responses are already included in generation/synthetic_data/.
The GSS cumulative .dta file should remain an external download in any public replication release unless its redistribution terms clearly permit bundling.
If you just want to reproduce the analysis using existing synthetic data:
# Navigate to polreason root
cd polreason/
# Run the master analysis script
Rscript analysis/scripts/master.RThis will:
- Load existing synthetic responses from
generation/synthetic_data/ - Perform polychoric correlation analysis with bootstrap
- Generate constraint statistics
- Create visualizations in
analysis/viz/
To create new synthetic survey responses from LLMs:
# Navigate to the generation scripts directory
cd generation/scripts
# Extract 2024 data
Rscript 00a_create_gss_extract_multiyear.R --year 2024
# Generate natural language personas
Rscript 00b_generate_personas.R
# Return to project root
cd ../..# Navigate to the generation scripts directory (if not already there)
cd generation/scripts
# Set your OpenRouter API key
export OPENROUTER_API_KEY="your-key-here"
# Query all models with 1000 personas (expensive! ~$100-500 depending on models)
python 01_generate_synthetic_GSS.py --year 2024 --all-models --personas 1000
# Or query specific models
python 01_generate_synthetic_GSS.py --year 2024 --models "anthropic/claude-sonnet-4.5,openai/gpt-5" --personas 1000
# Return to project root
cd ../..See python 01_generate_synthetic_GSS.py --help for all options.
# From polreason root directory
Rscript analysis/scripts/master.RThe repository ships with everything needed to run a complete demo without API credentials or external downloads.
Demo dataset: 30 pre-generated LLM response CSVs in generation/synthetic_data/year_2024/ (one file per model, ~7–8 MB each, ~52,000 rows each), plus pre-computed bootstrap outputs (.rds files) in analysis/output/. These committed files serve as the demo dataset and eliminate the need for any API calls or long resampling runs.
From the project root:
Rscript analysis/scripts/master.Rmaster.R sets run_from_scratch <- FALSE, so it loads the pre-computed bootstrap outputs directly rather than re-running the full resampling procedure.
After a successful run you should find the following new or refreshed files:
| Location | Contents |
|---|---|
analysis/viz/mvn_2024/ |
Multivariate-normal scatter PDFs for each model |
analysis/viz/[model-name]/ |
Per-model Saturn plots and supporting figures |
analysis/viz/constraint_violins_2024/ |
Violin plots comparing constraint across models |
analysis/output/correlation_quantiles_2024.csv |
Quantile summaries of pairwise polychoric correlations |
analysis/output/constraint_bootstrap_stats_2024.csv |
Bootstrap constraint metrics (PC1 variance, D_e) |
analysis/output/constraint_point_stats_2024.csv |
Point-estimate constraint metrics |
| Mode | Approximate time |
|---|---|
| Demo (pre-computed bootstrap, visualization only) | 20–40 minutes |
| Full bootstrap from scratch (B=500, B_MI=30, 31 models) | 4–8 hours |
Quick exploratory run (set B <- 50, B_MI <- 5 in 0.config.R) |
~30 minutes |
Times are estimates for a 2020-era 8-core laptop with 16 GB RAM. The bottleneck for the demo is plot generation across 31 models; for a full bootstrap run it is the polychoric correlation resampling in 2.polychor_bootstrap.R.
Use Option 1 above — simply run Rscript analysis/scripts/master.R from the project root. No API key or additional downloads are required.
To apply the analysis pipeline to a new LLM or a different survey dataset:
- Format your LLM response file to match the existing CSVs in
generation/synthetic_data/year_2024/(columns:persona_id,run, and one column per GSS question variable name). - Place the CSV in
generation/synthetic_data/year_2024/your_model_name.csv. - Define any new survey questions in
analysis/scripts/0.config.Rfollowing the existingGSS_QUESTIONS_*pattern. - From the project root, run:
Rscript analysis/scripts/master.RThe script automatically discovers all CSVs in the year_2024/ directory and processes them.
To reproduce all quantitative results and figures reported in the manuscript from scratch:
git clone https://github.com/cjbarrie/polreason
cd polreason
Rscript analysis/scripts/master.RAll 28+ LLM synthetic response files and pre-computed bootstrap outputs are committed to the repository, so this single command is sufficient. Key figure-to-script mappings:
| Manuscript figure | Script | Output |
|---|---|---|
| Saturn plots | v1_b.saturn_plot.R |
analysis/viz/[model]/ |
| Constraint violin plots | v2_a.constraint_stats.R |
analysis/viz/constraint_violins_2024/ |
| Delta comparisons | v2_b.constraint_stats_delta.R |
analysis/output/*.csv |
| Missing-dimension analysis | v3.missing_dimensions.R |
analysis/viz/ |
| MVN scatter | v1_a.mvn_plot.R |
analysis/viz/mvn_2024/ |
To also regenerate the animated Saturn GIF, uncomment the v1_c.saturn_animation.R line in master.R (adds ~2–5 minutes; requires gganimate and gifski).
The analysis produces several key metrics of political constraint:
- PC1 Variance Explained: How much variance the first principal component captures
- Effective Dependence (D_e): Average absolute polychoric correlation
- Missing Dimensions: Analysis of variance unexplained by PC1
- Cohen's Kappa: Agreement between different runs of the same persona
-
Saturn plots (
v1_b.saturn_plot.R): Publication-ready faceted visualization comparing LLM constraint to human (GSS) baseline- Faceted layout: each quantile (Q25, Q50, Q75, Q90) shown in separate panel
- Implements "Option C" highlighting: only models with significant constraint difference (Δ ≥ 0.10) vs GSS are colored
- Non-highlighted LLMs shown as transparent gray "spaghetti" for context
- GSS displayed as bold black contours for easy comparison
- Reference circle (ρ=0) shows independence baseline
- Parameters:
delta_min(threshold),top_n_total(limit highlights)
-
Saturn animation (
v1_c.saturn_animation.R): Animated GIF cycling through quantile levels (Q95 → Q5)- Shows how constraint contours evolve from tightest (Q95) to weakest (Q5) correlations
- 19 frames covering full quantile range in 5% increments
- Top-N most constrained models highlighted throughout animation
- Smooth transitions with cubic easing
- Requires:
gganimateandgifskipackages - Optional in
master.R(uncomment to generate, takes 2-5 minutes)
-
Constraint violins: Compare constraint levels across models and education groups
-
Polychoric correlation matrices: Triangle plots showing pairwise belief correlations
-
MVN scatter plots: Multivariate normal draws from correlation matrices
-
Missing variance plots: Rayleigh distribution analysis of unexplained variance
All visualizations are saved as PDFs in analysis/viz/.
The repository includes results for 28+ LLMs across various families:
- OpenAI: GPT-5, GPT-5-mini, GPT-4o-mini, GPT-OSS-120b
- Anthropic: Claude Sonnet 4.5, Claude 3.7 Sonnet
- Google: Gemini 2.5 Flash/Lite, Gemma 3 12B
- DeepSeek: v3, v3.1, v3.2
- Meta: Llama 3.1/3.3/4
- Mistral: Large, Medium, Small, Nemo
- Others: Qwen, Kimi, Grok, GLM, Cohere, AI21, Allen AI, and more
Plus GSS-2024 human baseline for comparison.
If you use this code or data, please cite (BibTeX at top of README):
Related work:
- della Posta, D. (2020). "Pluralistic Collapse: The 'Oil Spill' Model of Mass Opinion Polarization." American Sociological Review, 85(3), 507-536.
This project is released under the MIT License. You are free to use, modify, and distribute the code with attribution.
- General Social Survey (GSS) data from NORC at the University of Chicago
- LLM API access via OpenRouter
- Built on della Posta's constraint measurement framework
Make sure you're in the polreason/ directory when running R scripts:
cd polreason/
Rscript analysis/scripts/master.RDownload gss7224_r1.dta from https://gss.norc.org/get-the-data/stata.html and place in generation/data/.
The OpenRouter API has rate limits. The script includes retry logic with exponential backoff. For large runs, consider:
- Running overnight
- Using the
--max-workersparameter to reduce concurrency - Splitting across multiple days
The bootstrap analysis can be memory-intensive. If you encounter issues:
- Close other applications
- Reduce
B(bootstrap iterations) inanalysis/scripts/0.config.R - Run models individually instead of the full batch
To modify or extend this project:
- Add new models: Edit
POPULAR_MODELSingeneration/scripts/01_generate_synthetic_GSS.py - Add new questions: Edit
GSS_QUESTIONS_*dictionaries in the same file - Modify analysis: See
analysis/scripts/0.config.Rfor parameters (bootstrap iterations, etc.) - Add visualizations: Create new scripts following the
v*.Rpattern
- Data Year: 2024 (GSS wave)
- Models: 28+ LLMs + human baseline
- Survey Items: 52 questions (30 culture-war, 22 non-culture-war)
- Personas: 1,000 synthetic respondents per model
