BioScope Ingestion is the data-collection layer for a pharma intelligence platform.
Core objective:
Track pipeline and regulatory momentum of target pharma companies over time by continuously collecting and normalizing source data.
What this project currently does:
- Crawls ClinicalTrials.gov, FDA openFDA, and EMA RSS sources.
- Supports incremental ingestion using source watermarks and conditional HTTP requests.
- Normalizes output records into a common ingestion envelope.
- Writes normalized events to local JSONL files under
./out. - Applies local deduplication using stable IDs with fallback fingerprint dedup when IDs are missing.
- Uses conservative AutoThrottle and retry defaults so crawls stay polite on shared/public sources.
- Persists source crawl state in SQLite so repeated runs are efficient and idempotent.
What this repository does not include yet:
- Downstream NLP/enrichment services.
- Search/indexing APIs.
- Product UI/dashboard.
- Production observability and SLO tooling.
Separate repositories by context. This repo is ingestion-only. Other components (processing, APIs, UI, analytics) live in separate repos.
Stage: Ingestion MVP complete, now entering reliability and production-hardening phase.
- Milestone 1: complete.
- Milestone 2: complete.
- Milestone 3: started (structured metrics/logging + backfill mode in start script).
This means source ingestion correctness is in good shape for development and pilot usage, while production readiness still needs validation, observability, and test coverage work.
Completed:
- Scrapy-based ingestion pipeline for BioScope backend
- ClinicalTrials.gov API spider with optional company filtering (
TARGET_COMPANY) - FDA openFDA JSON spider for regulatory updates
- EMA RSS spider wired to active feed URL
- Incremental crawling support for FDA JSON + EMA RSS (
ETag,If-Modified-Since,last_seen) - ClinicalTrials incremental crawling with watermark + cursor pagination cutoff
- 304 response handling wired through spider runtime (not dropped by middleware)
- Persistent source-state store (SQLite)
- Local JSONL sink for development
- Persistent local deduplication by
identifiers.nct_id - Fallback fingerprint dedup for records without stable IDs (
LOCAL_DEDUP_FALLBACK_ENABLED) - Company/drug canonicalization helpers with optional alias maps from env
- Structured per-run metrics written to JSONL (
METRICS_OUTPUT_PATH) - Structured item-ingestion logs (JSON log lines)
- Backfill execution mode in startup script (
--backfill) - Pydantic schema validation in ingestion pipeline (configurable mode)
- AutoThrottle and retry defaults for polite crawling
- Automated startup script with
--all,--spider,--reset-state,--skip-install - Docker and GitHub Actions scaffolding
Remaining / upcoming:
- Basic unit tests for schema + normalization utilities
- Expand unit coverage for spiders/pipeline edge cases
- Prometheus/monitoring integration for ingestion metrics
- Better company-focused filtering for FDA/EMA sources
- Integration tests for spider outputs and sink behavior
.
├── .github/workflows/ci.yml
├── dags/ # Airflow DAGs (optional runtime)
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── scrapy.cfg
├── src/
│ ├── bioscope_ingestion/
│ │ ├── __init__.py
│ │ ├── items.py
│ │ ├── pipelines.py
│ │ ├── settings.py
│ │ └── spiders/
│ │ ├── __init__.py
│ │ ├── clinicaltrials_api_spider.py
│ │ ├── ema_rss_spider.py
│ │ ├── fda_openfda_spider.py
│ │ └── fda_rss_spider.py
│ └── common/
│ ├── __init__.py
│ ├── config.py
│ ├── local_sink.py
│ ├── normalization.py
│ └── state_store.py
└── .env.example
One-command local start:
bash start.shAutomated script options:
# Run default spider (clinicaltrials_api)
bash start.sh
# Run all spiders sequentially
bash start.sh --all
# Backfill run (disables incremental cutoffs for this execution)
bash start.sh --all --backfill
# Run one specific spider
bash start.sh --spider ema_rss
# Reset local output/state before running
bash start.sh --all --reset-state
# Skip dependency installation check
bash start.sh --skip-install
# Show help
bash start.sh --help--spider <name>: run one specific spider (default:clinicaltrials_api)--all: runclinicaltrials_api,fda_openfda, andema_rsssequentially--backfill: run current crawl without incremental cutoffs (INCREMENTAL_ENABLED=falsefor the process)--reset-state: remove local output and incremental state files before crawling--skip-install: skip requirements installation/hash check--help: print script usage and options
-
Create a virtualenv and install dependencies:
python -m venv .venv source .venv/bin/activate pip install -r requirements.txt -
Copy env vars:
cp .env.example .env
-
Run a spider:
PYTHONPATH=src scrapy crawl clinicaltrials_api
-
Run unit tests:
pytest -q
- LOCAL_SINK_PATH: JSONL path for ingestion output (default:
./out/ingestion.jsonl) - LOCAL_DEDUP_KEY: field path used to deduplicate local output, default
identifiers.nct_id - LOCAL_DEDUP_FALLBACK_ENABLED: when
LOCAL_DEDUP_KEYis missing, dedup using stable fingerprint fallback (default: true) - LOCAL_DEDUP_STATE_PATH: sidecar file that stores seen dedup keys
- INCREMENTAL_ENABLED: turn source-level incremental crawling on/off
- STATE_STORE_PATH: SQLite state file used for source watermarks and HTTP cache headers
- AUTOTHROTTLE_ENABLED: enable Scrapy AutoThrottle (default: true)
- AUTOTHROTTLE_START_DELAY: initial delay in seconds for AutoThrottle (default: 1)
- AUTOTHROTTLE_MAX_DELAY: max delay in seconds for AutoThrottle (default: 10)
- AUTOTHROTTLE_TARGET_CONCURRENCY: target concurrent requests per server (default: 1)
- DOWNLOAD_DELAY: fixed download delay used alongside AutoThrottle (default: 1)
- CONCURRENT_REQUESTS: total concurrent requests across the crawler (default: 8)
- CONCURRENT_REQUESTS_PER_DOMAIN: concurrent requests per domain (default: 2)
Each data source has customized throttle settings tuned to its API/feed characteristics. Set these in .env to override global throttle defaults for each source:
ClinicalTrials.gov (large paginated API):
CLINICALTRIALS_DOWNLOAD_DELAY: delay in seconds between requests (default: 2)CLINICALTRIALS_CONCURRENT_REQUESTS_PER_DOMAIN: concurrent requests (default: 1)
FDA OpenFDA API (government API):
FDA_DOWNLOAD_DELAY: delay in seconds (default: 1)FDA_CONCURRENT_REQUESTS_PER_DOMAIN: concurrent requests (default: 2)
FDA RSS (RSS feed):
FDA_RSS_DOWNLOAD_DELAY: delay in seconds (default: 1)FDA_RSS_CONCURRENT_REQUESTS_PER_DOMAIN: concurrent requests (default: 2)
EMA RSS (typically single request per run):
EMA_DOWNLOAD_DELAY: delay in seconds (default: 0)EMA_CONCURRENT_REQUESTS_PER_DOMAIN: concurrent requests (default: 1)
These settings are applied per-spider in their custom_settings and override global throttle defaults when the spider runs.
- RETRY_ENABLED: enable retries for transient HTTP failures (default: true)
- RETRY_TIMES: number of retry attempts for transient HTTP failures (default: 3)
- METRICS_ENABLED: emit per-spider run metrics to JSONL file (default: true)
- METRICS_OUTPUT_PATH: JSONL path for ingestion run metrics
- VALIDATION_ENABLED: enable or disable Pydantic validation in the pipeline (default: true)
- VALIDATION_MODE: behavior on validation error:
drop,strict, orwarn(default:drop) - STRUCTURED_LOGS_ENABLED: emit JSON structured logs for ingested items (default: true)
- STRUCTURED_LOG_EVERY_N_ITEMS: emit one structured item log every N processed items (default: 100)
- TARGET_COMPANY: optional company filter for ClinicalTrials scraping
- COMPANY_CANONICAL_MAP_JSON: optional JSON object mapping company aliases to canonical names
- DRUG_CANONICAL_MAP_JSON: optional JSON object mapping drug aliases to canonical names
- CLINICALTRIALS_QUERY: default query term (e.g., diabetes)
- CLINICALTRIALS_PAGE_SIZE: number of records requested per ClinicalTrials API page (default: 50)
- CLINICALTRIALS_SORT: ClinicalTrials API sort order, default
LastUpdatePostDate:desc - CLINICALTRIALS_PAGINATION_CUTOFF_ENABLED: stop paging when incremental watermark cutoff is reached (default: true)
- FDA_JSON_URL: openFDA JSON endpoint (example:
https://api.fda.gov/drug/enforcement.json?limit=100) - EMA_RSS_URL: EMA RSS feed URL (recommended:
https://www.ema.europa.eu/en/news.xml) - FDA_RSS_URL: optional legacy RSS URL if you have a valid FDA feed endpoint
To scrape only one company, set TARGET_COMPANY in .env. The ClinicalTrials spider will use that value as the query term and then filter results so only matching trials are emitted.
To prevent repeating the same trial across runs, the local sink now skips records with the same identifiers.nct_id and stores seen IDs in out/ingestion.seen.json by default.
For FDA JSON and EMA RSS, incremental mode stores ETag, Last-Modified, and last_seen timestamps in STATE_STORE_PATH and reuses them on subsequent runs.
For ClinicalTrials API, incremental mode stores a last_seen watermark in STATE_STORE_PATH, filters older/equal records, and follows nextPageToken until cutoff when sorted by LastUpdatePostDate:desc.
ClinicalTrials normalization now emits normalized.canonical_lead_sponsor and normalized.canonical_drugs to support downstream entity joins.
Local JSONL dedup now supports a fingerprint fallback when the configured dedup key is missing, improving deduplication across RSS/JSON records that do not carry stable IDs.
Pipeline metrics are written as JSON lines to METRICS_OUTPUT_PATH at spider close with counts for requests, responses, 200/304 statuses, items, validation pass/fail counters, elapsed time, and finish reason.
Pipeline schema validation uses Pydantic and validates each normalized ingestion record before sink write.
VALIDATION_MODE=drop: invalid records are logged and skippedVALIDATION_MODE=strict: invalid records fail the spider runVALIDATION_MODE=warn: invalid records are logged but still forwarded to sink
.env.example includes sample alias-map JSON values (Pfizer and Novo Nordisk variants) to demonstrate canonicalization format.
All ingestion output stays local in ./out unless you override LOCAL_SINK_PATH. The default output files are:
./out/ingestion.jsonl./out/ingestion.seen.json./out/source_state.sqlite./out/metrics/ingestion_metrics.jsonl
If VS Code shows: "An environment file is configured but terminal environment injection is disabled", enable python.terminal.useEnvFile=true to auto-inject .env into Python terminals.
start.sh already sources .env directly, so script-based runs use updated environment values regardless of that editor setting.
- FDA JSON (openFDA):
https://api.fda.gov/drug/enforcement.json?limit=100 - EMA RSS:
https://www.ema.europa.eu/en/news.xml
These defaults are active in .env.example and are recommended for reliable ingestion.
Validation commands executed:
python -m compileall srcbash ./start.sh --all
Observed results:
- All changed Python modules compiled successfully.
- ClinicalTrials spider executed with incremental cutoff log and completed cleanly.
- FDA openFDA and EMA RSS spiders received
304 Not Modifiedand handled them correctly inside spider parse flow. - No runtime crashes during end-to-end
--allexecution.
Warning status:
- Previous Scrapy deprecations were resolved by upgrading to Scrapy
2.13.0and migrating spiders fromstart_requests()tostart().
Milestone 1: Foundation + Incremental RSS/JSON
- Source-state storage in
src/common/state_store.py - Conditional fetch (
If-None-Match,If-Modified-Since) for FDA JSON + EMA RSS -
last_seenwatermark filtering for FDA JSON + EMA RSS - Config support in
.env/.env.example
Milestone 2: ClinicalTrials Incremental + Canonicalization
- Incremental strategy for ClinicalTrials API (timestamp + cursor pagination + cutoff)
- Canonical company/drug normalization support (with optional alias maps)
- Improved dedup for records without stable IDs
Milestone 3: Reliability + Quality
- Pydantic schema validation
- Structured logging and file-based ingestion metrics
- Basic unit test scaffold (schema + normalization)
- Prometheus/monitoring integration
- Broader unit and integration tests
This project is currently a functional ingestion backbone for BioScope with incremental crawling, stateful source tracking, canonicalization support, and local/prod sink flexibility. It is ready for continued integration into downstream analytics services, but not yet production-hardened until Milestone 3 items are completed.