This repository contains scripts useful for manipulating and analyzing large collections of HathiTrust book data with minimal user control. These scripts ease the pain of working with Data Capsules.
get-files.pyGiven a list of HTIDs, randomly sample and save pages for each HTID.get_marc_metadata.pyGiven a list of HTIDs, download and save MARC metadata for each HTID.finetune.pyGiven a directory of randomly sampled pages, fine-tune a HuggingFace classifier.classify.pyGiven a HuggingFace classifier and a list of HTIDs, classify volumes with said classifier.