A end-to-end digital preservation workflow applied to a real WARC collection from the Internet Archive. Demonstrates acquisition, format identification, inventory generation, BagIt packaging, and finding aid writing.
Collection: Alex Jones Twitter Archive, captured August 18–19, 2018
Why this collection: Captured the day Jones was permanently suspended from Twitter —
a canonical example of why proactive social-media archiving matters.
Each step is a standalone Python script. Run them in order.
01_acquire.py → 02_identify.py → 03_inventory.py → 04_bag.py
Download Siegfried Parse CDX BagIt package
CDX + metadata format ID build inventory + validate
python 01_acquire.py # downloads CDX indexes + metadata (~3 MB)
python 01_acquire.py --full # also downloads full WARC files (~1.87 GB)python 02_identify.pyRequires Siegfried (sf) on PATH.
Output: processed/siegfried_report.json
python 03_inventory.pyParses the CDX index to produce a URL-level inventory of everything captured.
Outputs: processed/inventory.csv, processed/summary.json
python 04_bag.pyCreates a BagIt-compliant bag with SHA-256 checksums.
Output: bags/alex-jones-twitter-201808/
warc-portfolio/
├── 01_acquire.py Acquisition script (internetarchive library)
├── 02_identify.py Format identification (Siegfried)
├── 03_inventory.py CDX parser + inventory generator
├── 04_bag.py BagIt packaging + validation
├── README.md This file
├── acquisition/ Raw downloaded files (CDX, metadata, optional WARCs)
├── processed/ Siegfried reports, inventory CSV, summary JSON
├── bags/ BagIt packages
└── finding-aid/
└── finding_aid.md Archival finding aid (scope, context, arrangement)
| Standard | Purpose |
|---|---|
| WARC (ISO 28500) | Web archive container format |
| BagIt (RFC 8493) | File packaging and transfer spec (Library of Congress) |
| PRONOM | File format registry (via Siegfried) |
| SHA-256 | Fixity checksums for integrity verification |
| Tool | Version | Purpose |
|---|---|---|
| internetarchive (ia) | 5.x | Internet Archive API + download |
| Siegfried (sf) | — | Format identification against PRONOM |
| bagit-python | 1.x | BagIt package creation and validation |
| Python | 3.x | All scripting |
- Acquisition: Programmatic download from Internet Archive via Python API
- Format identification: PRONOM-based identification via Siegfried, including compressed files
- CDX parsing: Extracting URL-level inventory from WARC index files
- BagIt packaging: Creating RFC 8493-compliant bags with checksums and provenance metadata
- Finding aid writing: Archival description following scope-and-content / arrangement conventions
- OSINT archiving context: Understanding of social media preservation, content-at-risk workflows
Portfolio project by Tyrone — digital preservation / OSINT archiving track