A script that takes an xml of an evaluated page and automatically transcribed page and calculated CER and WER. Designed for a DH class on the subject of evaluation.
The script uses numpy to calculate Character Error Rate and Word Error Rate. Before running the script, install numpy using pip
This script is only for comparing at the page level. It will only compare one PAGE xml zip file at a time - attempting to run more than one PAGE xml will result in an error. It is written for eScriptorium PAGE xml exports.
- Add the evaluation PAGE xml zip into the evaluation_xml folder
- Add the test data PAGE xml zip (i.e. the same page but run with a new OCR model that you would like to test) into the test_xml folder
- In the terminal run
python evaluate_xml.pyorpython3 evaluate_xml.py - The script will print mean CER and WER for each page
- The script will print mean CER and WER for all pages
- The script will output a line-by-line comparison to a csv file in the 'out' folder