Synthetic Personas Distort the Structure of Human Belief Systems

Christopher Barrie & Roberto Cerina

@misc{barrie_cerina_2026,
  title={Synthetic personas distort the structure of human belief systems},
  url={osf.io/preprints/socarxiv/n7fq8_v1},
  publisher={SocArXiv},
  author={Barrie, Christopher and Cerina, Roberto},
  year={2026},
  month={Feb}
}

Replication repository for:

Barrie, C. & Cerina, R. (2026). Synthetic personas distort the structure of human belief systems. SocArXiv. https://doi.org/10.31235/osf.io/n7fq8_v1

This repository contains all code, synthetic data, pre-computed outputs, and visualizations needed to reproduce the analysis. The quickest path:

git clone https://github.com/cjbarrie/polreason
cd polreason
Rscript analysis/scripts/master.R

See analysis/README.md, generation/README.md, and docs/replication_release_plan.md for full details.

Overview

This project investigates how different LLMs exhibit patterns of political constraint when answering survey questions as synthetic personas. Building on della Posta's (2020) framework for measuring belief constraint, we:

Generate synthetic personas from real GSS respondent demographics
Query multiple LLMs with comprehensive attitudinal survey questions
Analyze constraint patterns using polychoric correlations, principal components analysis, and bootstrap methods
Compare LLM behavior to human survey responses across 28+ different models

The analysis covers 52 survey questions spanning culture-war issues (abortion, immigration, civil liberties) and non-culture-war domains (institutional confidence, economic outlook, social trust).

Repository Structure

polreason/
├── generation/              # Data generation pipeline
│   ├── scripts/            # Python and R scripts for creating synthetic data
│   │   ├── 00a_create_gss_extract_multiyear.R  # Extract GSS data by year
│   │   ├── 00b_generate_personas.R             # Create natural language personas
│   │   ├── 01_generate_synthetic_GSS.py        # Query LLMs (OpenRouter)
│   │   ├── 02_generate_synthetic_GSS_gpt5safe.py  # Alternative querying script
│   │   └── test_models.py                      # Test model availability
│   ├── data/               # Small data files (personas, extracts)
│   │   ├── gss2024_personas.csv                # Natural language personas (823KB)
│   │   ├── gss2024_dellaposta_extract.rds      # Processed GSS data (164KB)
│   │   └── gss2024_variable_summary.csv        # Variable coverage summary
│   └── synthetic_data/     # LLM-generated survey responses
│       └── year_2024/      # 30 CSV files, one per model
│
├── analysis/               # Statistical analysis pipeline
│   ├── README.md           # Analysis-specific guide
│   ├── scripts/            # R analysis scripts
│   │   ├── master.R                            # Main orchestration script
│   │   ├── 0.config.R                          # Configuration and helpers
│   │   ├── 1.data_shaper.R                     # Data preparation
│   │   ├── 2.polychor_bootstrap.R              # Bootstrap analysis
│   │   ├── v.common_utils.R                    # Shared visualization utilities
│   │   ├── v1_a.mvn_plot.R                     # Multivariate normal scatter plots
│   │   ├── v1_b.saturn_plot.R                  # Saturn plots (faceted, Option C)
│   │   ├── v1_c.saturn_animation.R             # Saturn animation (Q95 → Q5 GIF)
│   │   └── v2*.R, v3*.R                        # Additional visualization scripts
│   ├── output/             # Model results (31 directories)
│   │   ├── gss-2024/                           # Human GSS baseline
│   │   ├── anthropic_claude-sonnet-4.5-2024/   # Example model results
│   │   └── ...                                 # Other model outputs
│   ├── viz/                # Visualizations (35+ directories/files)
│   │   ├── constraint_violins_2024/            # Constraint comparisons
│   │   ├── mvn_2024/                           # Multivariate normal plots
│   │   └── [model-name]/                       # Per-model visualizations
│   └── GSS_PC_explain.json # GSS question metadata
│
├── requirements.txt        # Python dependencies
├── r-requirements.txt      # R package list
├── docs/                   # Release planning and extra documentation
├── .gitignore             # Git ignore rules
└── README.md              # This file

Installation

System Requirements

Operating systems: macOS 12+, Ubuntu 20.04+, Windows 10/11; developed and tested primarily on macOS 14
Python 3.8+ (tested on 3.11) — required only for data generation
R 4.0+ (tested on 4.3.x) — required for all statistical analysis and visualization
RAM: 8 GB minimum; 16 GB recommended for the full 500-resample bootstrap
No non-standard hardware required — a standard laptop or desktop is sufficient
GSS Cumulative Data File (see Data Requirements below) — required only for persona generation, not for reproducing the published analysis

Typical Install Time

On a normal desktop or laptop computer:

Python dependencies (pip install -r requirements.txt): ~2 minutes
R packages (install.packages(...)): ~10–15 minutes (some packages compile from source)

Python Environment

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

R Packages

# Install all required packages
install.packages(c(
  "data.table", "lavaan", "miceRanger", "ggplot2",
  "grid", "gridExtra", "grDevices", "scales",
  "ggnewscale", "irr", "haven", "dplyr",
  "stringr", "optparse"
))

# Optional: for Saturn animation (v1_c.saturn_animation.R)
install.packages(c("gganimate", "gifski"))

Data Requirements

GSS Cumulative Data File

The GSS cumulative data file (gss7224_r1.dta, 565MB) is excluded from this repository via .gitignore due to NORC redistribution terms.

If cloning from GitHub, you'll need to download the file separately:

Download: Visit https://gss.norc.org/get-the-data/stata.html
File needed: GSS 1972-2024 Cumulative Data (Release 1) in Stata format
Filename: gss7224_r1.dta
Location: Place in generation/data/gss7224_r1.dta

Note: If you only want to run the analysis pipeline (not generate new synthetic data), you can skip this step since the synthetic responses are already included in generation/synthetic_data/.

The GSS cumulative .dta file should remain an external download in any public replication release unless its redistribution terms clearly permit bundling.

Usage

Option 1: Run Analysis Only (Recommended for Exploration)

If you just want to reproduce the analysis using existing synthetic data:

# Navigate to polreason root
cd polreason/

# Run the master analysis script
Rscript analysis/scripts/master.R

This will:

Load existing synthetic responses from generation/synthetic_data/
Perform polychoric correlation analysis with bootstrap
Generate constraint statistics
Create visualizations in analysis/viz/

Option 2: Generate New Synthetic Data

To create new synthetic survey responses from LLMs:

Step 1: Create GSS Extract (requires GSS data file)

# Navigate to the generation scripts directory
cd generation/scripts

# Extract 2024 data
Rscript 00a_create_gss_extract_multiyear.R --year 2024

# Generate natural language personas
Rscript 00b_generate_personas.R

# Return to project root
cd ../..

Step 2: Query LLMs

# Navigate to the generation scripts directory (if not already there)
cd generation/scripts

# Set your OpenRouter API key
export OPENROUTER_API_KEY="your-key-here"

# Query all models with 1000 personas (expensive! ~$100-500 depending on models)
python 01_generate_synthetic_GSS.py --year 2024 --all-models --personas 1000

# Or query specific models
python 01_generate_synthetic_GSS.py --year 2024 --models "anthropic/claude-sonnet-4.5,openai/gpt-5" --personas 1000

# Return to project root
cd ../..

See python 01_generate_synthetic_GSS.py --help for all options.

Step 3: Run Analysis

# From polreason root directory
Rscript analysis/scripts/master.R

Demo

The repository ships with everything needed to run a complete demo without API credentials or external downloads.

Demo dataset: 30 pre-generated LLM response CSVs in generation/synthetic_data/year_2024/ (one file per model, ~7–8 MB each, ~52,000 rows each), plus pre-computed bootstrap outputs (.rds files) in analysis/output/. These committed files serve as the demo dataset and eliminate the need for any API calls or long resampling runs.

Running the Demo

From the project root:

Rscript analysis/scripts/master.R

master.R sets run_from_scratch <- FALSE, so it loads the pre-computed bootstrap outputs directly rather than re-running the full resampling procedure.

Expected Output

After a successful run you should find the following new or refreshed files:

Location	Contents
`analysis/viz/mvn_2024/`	Multivariate-normal scatter PDFs for each model
`analysis/viz/[model-name]/`	Per-model Saturn plots and supporting figures
`analysis/viz/constraint_violins_2024/`	Violin plots comparing constraint across models
`analysis/output/correlation_quantiles_2024.csv`	Quantile summaries of pairwise polychoric correlations
`analysis/output/constraint_bootstrap_stats_2024.csv`	Bootstrap constraint metrics (PC1 variance, D_e)
`analysis/output/constraint_point_stats_2024.csv`	Point-estimate constraint metrics

Expected Run Time on a Normal Desktop Computer

Mode	Approximate time
Demo (pre-computed bootstrap, visualization only)	20–40 minutes
Full bootstrap from scratch (B=500, B_MI=30, 31 models)	4–8 hours
Quick exploratory run (set `B <- 50`, `B_MI <- 5` in `0.config.R`)	~30 minutes

Times are estimates for a 2020-era 8-core laptop with 16 GB RAM. The bottleneck for the demo is plot generation across 31 models; for a full bootstrap run it is the polychoric correlation resampling in 2.polychor_bootstrap.R.

Instructions for Use

Running on the Provided Synthetic Data

Use Option 1 above — simply run Rscript analysis/scripts/master.R from the project root. No API key or additional downloads are required.

Running on Your Own LLM Survey Responses

To apply the analysis pipeline to a new LLM or a different survey dataset:

Format your LLM response file to match the existing CSVs in generation/synthetic_data/year_2024/ (columns: persona_id, run, and one column per GSS question variable name).
Place the CSV in generation/synthetic_data/year_2024/your_model_name.csv.
Define any new survey questions in analysis/scripts/0.config.R following the existing GSS_QUESTIONS_* pattern.
From the project root, run:

Rscript analysis/scripts/master.R

The script automatically discovers all CSVs in the year_2024/ directory and processes them.

Reproduction Instructions

To reproduce all quantitative results and figures reported in the manuscript from scratch:

git clone https://github.com/cjbarrie/polreason
cd polreason
Rscript analysis/scripts/master.R

All 28+ LLM synthetic response files and pre-computed bootstrap outputs are committed to the repository, so this single command is sufficient. Key figure-to-script mappings:

Manuscript figure	Script	Output
Saturn plots	`v1_b.saturn_plot.R`	`analysis/viz/[model]/`
Constraint violin plots	`v2_a.constraint_stats.R`	`analysis/viz/constraint_violins_2024/`
Delta comparisons	`v2_b.constraint_stats_delta.R`	`analysis/output/*.csv`
Missing-dimension analysis	`v3.missing_dimensions.R`	`analysis/viz/`
MVN scatter	`v1_a.mvn_plot.R`	`analysis/viz/mvn_2024/`

To also regenerate the animated Saturn GIF, uncomment the v1_c.saturn_animation.R line in master.R (adds ~2–5 minutes; requires gganimate and gifski).

Key Findings & Outputs

Constraint Metrics

The analysis produces several key metrics of political constraint:

PC1 Variance Explained: How much variance the first principal component captures
Effective Dependence (D_e): Average absolute polychoric correlation
Missing Dimensions: Analysis of variance unexplained by PC1
Cohen's Kappa: Agreement between different runs of the same persona

Visualizations

Saturn plots (v1_b.saturn_plot.R): Publication-ready faceted visualization comparing LLM constraint to human (GSS) baseline
- Faceted layout: each quantile (Q25, Q50, Q75, Q90) shown in separate panel
- Implements "Option C" highlighting: only models with significant constraint difference (Δ ≥ 0.10) vs GSS are colored
- Non-highlighted LLMs shown as transparent gray "spaghetti" for context
- GSS displayed as bold black contours for easy comparison
- Reference circle (ρ=0) shows independence baseline
- Parameters: delta_min (threshold), top_n_total (limit highlights)
Saturn animation (v1_c.saturn_animation.R): Animated GIF cycling through quantile levels (Q95 → Q5)
- Shows how constraint contours evolve from tightest (Q95) to weakest (Q5) correlations
- 19 frames covering full quantile range in 5% increments
- Top-N most constrained models highlighted throughout animation
- Smooth transitions with cubic easing
- Requires: gganimate and gifski packages
- Optional in master.R (uncomment to generate, takes 2-5 minutes)
Constraint violins: Compare constraint levels across models and education groups
Polychoric correlation matrices: Triangle plots showing pairwise belief correlations
MVN scatter plots: Multivariate normal draws from correlation matrices
Missing variance plots: Rayleigh distribution analysis of unexplained variance

All visualizations are saved as PDFs in analysis/viz/.

Models Included

The repository includes results for 28+ LLMs across various families:

OpenAI: GPT-5, GPT-5-mini, GPT-4o-mini, GPT-OSS-120b
Anthropic: Claude Sonnet 4.5, Claude 3.7 Sonnet
Google: Gemini 2.5 Flash/Lite, Gemma 3 12B
DeepSeek: v3, v3.1, v3.2
Meta: Llama 3.1/3.3/4
Mistral: Large, Medium, Small, Nemo
Others: Qwen, Kimi, Grok, GLM, Cohere, AI21, Allen AI, and more

Plus GSS-2024 human baseline for comparison.

Citation

If you use this code or data, please cite (BibTeX at top of README):

Related work:

della Posta, D. (2020). "Pluralistic Collapse: The 'Oil Spill' Model of Mass Opinion Polarization." American Sociological Review, 85(3), 507-536.

License

This project is released under the MIT License. You are free to use, modify, and distribute the code with attribution.

Acknowledgments

General Social Survey (GSS) data from NORC at the University of Chicago
LLM API access via OpenRouter
Built on della Posta's constraint measurement framework

Troubleshooting

"Please run this script from the polreason/ root directory"

Make sure you're in the polreason/ directory when running R scripts:

cd polreason/
Rscript analysis/scripts/master.R

Missing GSS data file

Download gss7224_r1.dta from https://gss.norc.org/get-the-data/stata.html and place in generation/data/.

API rate limits

The OpenRouter API has rate limits. The script includes retry logic with exponential backoff. For large runs, consider:

Running overnight
Using the --max-workers parameter to reduce concurrency
Splitting across multiple days

Memory issues in R

The bootstrap analysis can be memory-intensive. If you encounter issues:

Close other applications
Reduce B (bootstrap iterations) in analysis/scripts/0.config.R
Run models individually instead of the full batch

Development

To modify or extend this project:

Add new models: Edit POPULAR_MODELS in generation/scripts/01_generate_synthetic_GSS.py
Add new questions: Edit GSS_QUESTIONS_* dictionaries in the same file
Modify analysis: See analysis/scripts/0.config.R for parameters (bootstrap iterations, etc.)
Add visualizations: Create new scripts following the v*.R pattern

Repository Metadata

Data Year: 2024 (GSS wave)
Models: 28+ LLMs + human baseline
Survey Items: 52 questions (30 culture-war, 22 non-culture-war)
Personas: 1,000 synthetic respondents per model

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
analysis		analysis
docs		docs
generation		generation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
r-requirements.txt		r-requirements.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Synthetic Personas Distort the Structure of Human Belief Systems

Overview

Repository Structure

Installation

System Requirements

Typical Install Time

Python Environment

R Packages

Data Requirements

GSS Cumulative Data File

Usage

Option 1: Run Analysis Only (Recommended for Exploration)

Option 2: Generate New Synthetic Data

Step 1: Create GSS Extract (requires GSS data file)

Step 2: Query LLMs

Step 3: Run Analysis

Demo

Running the Demo

Expected Output

Expected Run Time on a Normal Desktop Computer

Instructions for Use

Running on the Provided Synthetic Data

Running on Your Own LLM Survey Responses

Reproduction Instructions

Key Findings & Outputs

Constraint Metrics

Visualizations

Models Included

Citation

License

Acknowledgments

Troubleshooting

"Please run this script from the polreason/ root directory"

Missing GSS data file

API rate limits

Memory issues in R

Development

Repository Metadata

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages