Skip to content

LPhex9/warc-portfolio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WARC Archival Portfolio — Alex Jones Twitter Archive (August 2018)

A end-to-end digital preservation workflow applied to a real WARC collection from the Internet Archive. Demonstrates acquisition, format identification, inventory generation, BagIt packaging, and finding aid writing.

Collection: Alex Jones Twitter Archive, captured August 18–19, 2018
Why this collection: Captured the day Jones was permanently suspended from Twitter — a canonical example of why proactive social-media archiving matters.


Workflow

Each step is a standalone Python script. Run them in order.

01_acquire.py    →  02_identify.py    →  03_inventory.py    →  04_bag.py
  Download              Siegfried            Parse CDX             BagIt package
  CDX + metadata        format ID            build inventory       + validate

Step 1 — Acquire

python 01_acquire.py           # downloads CDX indexes + metadata (~3 MB)
python 01_acquire.py --full    # also downloads full WARC files (~1.87 GB)

Step 2 — Identify formats

python 02_identify.py

Requires Siegfried (sf) on PATH.
Output: processed/siegfried_report.json

Step 3 — Generate inventory

python 03_inventory.py

Parses the CDX index to produce a URL-level inventory of everything captured.
Outputs: processed/inventory.csv, processed/summary.json

Step 4 — Package as BagIt

python 04_bag.py

Creates a BagIt-compliant bag with SHA-256 checksums.
Output: bags/alex-jones-twitter-201808/


Project Structure

warc-portfolio/
├── 01_acquire.py          Acquisition script (internetarchive library)
├── 02_identify.py         Format identification (Siegfried)
├── 03_inventory.py        CDX parser + inventory generator
├── 04_bag.py              BagIt packaging + validation
├── README.md              This file
├── acquisition/           Raw downloaded files (CDX, metadata, optional WARCs)
├── processed/             Siegfried reports, inventory CSV, summary JSON
├── bags/                  BagIt packages
└── finding-aid/
    └── finding_aid.md     Archival finding aid (scope, context, arrangement)

Standards Applied

Standard Purpose
WARC (ISO 28500) Web archive container format
BagIt (RFC 8493) File packaging and transfer spec (Library of Congress)
PRONOM File format registry (via Siegfried)
SHA-256 Fixity checksums for integrity verification

Tools Used

Tool Version Purpose
internetarchive (ia) 5.x Internet Archive API + download
Siegfried (sf) Format identification against PRONOM
bagit-python 1.x BagIt package creation and validation
Python 3.x All scripting

Skills Demonstrated

  • Acquisition: Programmatic download from Internet Archive via Python API
  • Format identification: PRONOM-based identification via Siegfried, including compressed files
  • CDX parsing: Extracting URL-level inventory from WARC index files
  • BagIt packaging: Creating RFC 8493-compliant bags with checksums and provenance metadata
  • Finding aid writing: Archival description following scope-and-content / arrangement conventions
  • OSINT archiving context: Understanding of social media preservation, content-at-risk workflows

Portfolio project by Tyrone — digital preservation / OSINT archiving track

About

End-to-end digital preservation workflow: WARC acquisition, format ID, inventory, and BagIt packaging applied to the Alex Jones Twitter Archive (Aug 2018)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages