ASQ-PHI (Adversarial Synthetic Queries for Protected Health Information de-identification) is a benchmark dataset of 1,051 fully synthetic clinical search queries designed to test HIPAA-compliant de-identification software, specifically targeting the "Safe Handoff" problem in clinical LLM workflows.
Important note: All content is 100% synthetic. No real patient data were used.
Hospitals are beginning to deploy HIPAA-compliant Business Associate Agreement (BAA) large language models (LLMs). In the public setting, LLMs with fixed training cutoffs are routinely augmented with tools such as web search, deep research, and Model Context Protocol (MCP) servers so they can reach up-to-date knowledge. BAA LLMs, by contrast, almost never expose live web search or external tools, even though clinicians expect LLMs to surface current guidelines, drug safety updates, and literature. The constraint is that any query leaving a BAA-protected LLM for an external service must be free of Protected Health Information (PHI). We refer to this boundary as the Safe Handoff: the moment when a clinician’s PHI-containing query, generated inside a HIPAA-compliant BAA LLM, must be transformed into a HIPAA Safe Harbor–compliant version before being sent to non-BAA tools such as web search APIs, external evidence services, or MCP servers. Existing de-identification datasets are built from long electronic health record narratives rather than the short, compressed search queries clinicians type into LLM chat interfaces, so they do not allow de-identification performance to be tested at this Safe Handoff. ASQ-PHI (Adversarial Synthetic Queries for Protected Health Information de-identification) is constructed to supply this missing data: a benchmark of 1,051 fully synthetic clinical search queries with ground-truth PHI annotations for stress-testing HIPAA-compliant de-identification software. All queries were generated using Azure OpenAI GPT-4o. No real patient data were used.
Current de-identification systems fail at the Safe Handoff from LLMs running inside HIPAA BAAs to external tools in two ways:
-
Leaking PHI
-
Over-redacting non-identifying clinical information reducing query utility.
Many systems flag "18-year-old female" as PHI and redact the age, despite age under 90 being explicitly non-identifying under Safe Harbor.
Original query:
Bleeding recommendations for 18-year-old female with iron deficiency anemia
After over-aggressive de-identification:
Bleeding recommendations for [AGE] female with iron deficiency anemia
Result: Clinically less usefull. Search returns less specific results for adolescent female specific guidance.
ASQ-PHI measures this utility-erosion behavior with 219 hard negative queries containing zero PHI but mimicking PHI structure.
/
├── code/
│ ├── .env.example # Template for Azure OpenAI credentials
│ └── data_generation_pipeline.ipynb # Notebook for generating ASQ-PHI
├── data/
│ ├── synthetic_clinical_queries.txt # The core ASQ-PHI dataset (1,051 queries)
│ └── dataset_statistics.txt # Metadata and distribution stats
├── figures/ # Descriptive and validation figures (Figs 1-6)
│ ├── fig1_data_structure_example.png
│ ├── fig2_phi_type_distribution.png
│ ├── fig3_dataset_composition.png
│ ├── fig4_phi_density_distribution.png
│ ├── fig5_leakage_recall_baseline.png
│ └── fig6_overredaction_baseline.png
├── validation_results/ # Baseline metrics for Amazon Comprehend Medical
│ ├── positive_metrics_by_threshold.csv
│ └── negative_metrics_by_threshold.csv
├── requirements.txt # Python dependencies
├── LICENSE
└── README.md
data/synthetic_clinical_queries.txt contains 1,051 clinical search queries using a delimiter-based format:
PHI annotations follow HIPAA Safe Harbor categories: NAME, GEOGRAPHIC_LOCATION, DATE, MEDICAL_RECORD_NUMBER, HEALTH_PLAN_BENEFICIARY_NUMBER, PHONE_NUMBER, SOCIAL_SECURITY_NUMBER, EMAIL_ADDRESS, etc.
Clinical context (age <90, diagnoses like "COPD", medications, symptoms) is intentionally not labeled as PHI to enable over-redaction measurement.
- 832 PHI-positive queries (79.2%): Contain 1-5 HIPAA identifiers (mean 3.57, median 4.0)
- 219 hard negatives (20.8%): Zero PHI, adversarial structure
- 2,973 total PHI elements across 13 identifier types
Top 3 identifier types:
- GEOGRAPHIC_LOCATION: 826 (27.8%)
- NAME: 814 (27.4%)
- DATE: 806 (27.1%)
See data/dataset_statistics.txt for full distribution.
We tested ASQ-PHI using Amazon Comprehend Medical (DetectPHI) across confidence thresholds (0.0 to 0.8). Results demonstrate severe recall-utility tradeoffs:
| Threshold | Recall | Leaked PHI | Over-redaction (Hard Negatives) |
|---|---|---|---|
| 0.0 | 98.7% | 47 elements | 90.0% |
| 0.3 | 96.8% | 96 elements | 89.6% |
| 0.5 | 95.2% | 143 elements | 88.4% |
| 0.8 | 88.1% | 351 elements | 86.8% |
Key finding: Even at maximum sensitivity (threshold 0.0), the system leaks 47 PHI elements. At clinical utility-preserving thresholds (0.8), 351 PHI elements leak while still over-redacting 86.8% of hard negatives.
These results characterize ASQ-PHI's difficulty and confirm it identifies edge cases not captured by EHR-trained models.
pip install -r requirements.txtTo generate new synthetic queries using the same adversarial pipeline:
-
Configure Azure OpenAI credentials (copy
.env.exampleto.env):AZURE_OPENAI_API_KEY_4o=your_key AZURE_OPENAI_ENDPOINT_4o=your_endpoint AZURE_OPENAI_DEPLOYMENT_4o=your_deployment
-
Run the notebook:
jupyter notebook code/data_generation_pipeline.ipynb
-
Customize for domain-specific variants:
- Edit the few-shot examples in the system prompt cell
- Adjust PHI type distributions or density targets
- Generate oncology-focused, cardiology-focused, or pediatrics-focused variants
ASQ-PHI is designed for:
- Healthcare AI developers: Test de-identification layers before deploying LLM-powered clinical decision support with external tools
- MCP server developers: Validate HIPAA-compliant tool integrations for clinical LLM workflows
- Web search API providers: Benchmark PHI detection in real-time query filtering
- Privacy researchers: Develop and test novel de-identification algorithms on adversarial queries
- Compliance teams: Evaluate vendor claims about HIPAA-compliant external tool integrations
Most teams need a domain-specific, regulation-specific safe-handoff benchmark but do not have time to design one from scratch. ASQ-PHI is therefore built as a template: you keep the data format, validators, and evaluation code, and you swap in your own language, regulation, and clinical setting.
By forking this repo and editing a single notebook, you can generate variants that differ in:
-
Language
e.g. English, Spanish, French, German. -
Regulatory framework / PHI schema
HIPAA Safe Harbor, GDPR-style identifiers, national ID schemes, or an internal PHI/PII list. -
Clinical specialty and setting
Oncology, cardiology, pediatrics, psychiatry, radiology, ICU, primary care, etc. -
Prompting / query style
Short LLM chat prompts, order-entry text, triage descriptions, internal ticket text, tool-calling prompts. -
Model
GPT-4o / GPT-4.1, other hosted models, or local/open-weight models.
All variants still use the same QUERY / PHI_TAGS structure and the same evaluation metrics, so results remain comparable.
- Fork this repo to your GitHub organization.
- Copy
code/data_generation_pipeline.ipynband rename it (e.g.data_generation_oncology_es.ipynb).
- Switch to your target language.
- Describe your clinical setting and question style.
- List the identifier types you want the model to emit.
- Do not change the output structure:
===QUERY===
===PHI_TAGS=== {"identifier_type": "...", "value": "..."}
- In the notebook, adjust the allowed
identifier_typevalues
(for example addNATIONAL_ID,HOSPITAL_NAME,POSTAL_CODE, etc.). - The existing validation code will enforce this schema and reject malformed examples.
- Change the client config to your chosen model (GPT-4o, Claude, MedLM, local LLM, …).
- Run the notebook to:
- Generate synthetic queries + PHI tags
- Validate JSON and span alignment
- Ensure hard negatives contain zero identifiers
- Export a cleaned dataset + summary statistics
- Examples:
ASQ-PHI-Oncology,ASQ-PHI-ES-GDPR,ASQ-PHI-Pediatrics-Triage. - In your README or paper, state the language, regulation, specialty, and query style.
Forking ASQ-PHI instead of building a new pipeline gives you:
- A ready-made, tested safe-handoff framework that never touches real patient data.
- Region- and specialty-specific benchmarks that still share a common format and metrics, enabling fair comparison.
- A clear, auditable story for privacy and compliance teams: how data was generated, which identifiers were included, and how hard negatives were constructed.
ASQ-PHI is meant to be a base layer, not a one-off dataset.
Fork it, adapt it to your regulatory and clinical reality, and use the same framework to stress-test your own de-identification system at the BAA safe-handoff boundary.
If you use ASQ-PHI for benchmarking or research, please cite:
@dataset{weatherhead2025asqphi,
title={ASQ-PHI: An Adversarial Synthetic Benchmark for Clinical De-Identification and Search Utility},
author={Weatherhead, James},
year={2025},
publisher={Mendeley Data},
version={V1},
doi={10.17632/csz5dzp7nx.1}
}Mendeley Data: https://doi.org/10.17632/csz5dzp7nx.1
MIT License. This dataset is fully synthetic and open for academic and commercial use.
James Weatherhead
MD-PhD Student, Automated Science Track
University of Texas Medical Branch at Galveston
jacweath@utmb.edu
Dataset generated using Azure OpenAI GPT-4o. Baseline validation performed using Amazon Comprehend Medical (DetectPHI). All content is synthetic; no real patient data were used in creation or validation.





