Automate model compatibility checks by juhoinkinen · Pull Request #907 · NatLibFi/Annif

juhoinkinen · 2025-10-30T09:36:50Z

This pull request introduces automated model compatibility and reproducibility checks for the backends, ensuring that changes to the codebase do not introduce significant metric regressions.

Key changes include:

Continuous Integration and Automation:

Added a new GitHub Actions workflow (.github/workflows/model-compatibility.yml) that runs model compatibility and reproducibility checks on workflow_dispatch trigger executing the tests/check_models_compatability_consistency.py script with the --ci option.

Testing Infrastructure and Scripts:

The script functions as follows in the two check modes:

Download existing models and metrics from a Hugging Face Hub repository which is set via a repository GH Actions secret.
Depends on mode:
- In compatibility mode/subcommand:
  - evaluate the downloaded models with the current Annif code and compare to previous evaluation metrics.
- In consistency mode/subcommand:
  - train new models with the current Annif code
  - evaluate the trained models and compare to previous evaluation metrics
Flag all significant differences found in the comparison; a default threshold is 0.01 of the relative difference (= abs(prev_value - new_value) / abs(prev_value)) for compatibility, and 0.03 for consistency (the larger value allow non-determinism in training).

When running with the --ci option and detecting differences, the script exits with code 1 failing the GH Action job.

The upload subcommand of the script uploads the newly trained models and their evalution metrics to the HFH repo, thus "resetting" the state:

python tests/check_models_compatibility_consistency.py upload --hf_repo <repo-id-to-upload>

In the above command, upload can be changed to compatibility or consistency for running in those modes.

Configuration for Model Checks:

Added tests/projects-compatibility.cfg and tests/projects-consistency.cfg configuration files, which define the set of Annif projects (models) to be checked for compatibility and consistency, respectively. The first configuration is for projects of non-trainable backends.

This testing is probably best used via the workflow dispatch trigger from the GH Actions workflow page, which allows also checking the status:

TODO:

Remove trigger on pushes to main or the feature branch.

codecov · 2025-10-30T09:40:04Z

Codecov Report

❌ Patch coverage is 0% with 191 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.37%. Comparing base (27e4ac7) to head (5c3aff6).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
tests/check_models_reproducibility.py	0.00%	191 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #907      +/-   ##
==========================================
- Coverage   99.63%   97.37%   -2.26%     
==========================================
  Files         103      104       +1     
  Lines        8238     8429     +191     
==========================================
  Hits         8208     8208              
- Misses         30      221     +191

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Projects for compat includes also non-trainable backends

Copilot

Pull Request Overview

This PR introduces automated model compatibility and reproducibility checks for Annif models through a new GitHub Actions workflow. The implementation enables systematic verification that code changes don't break model backward compatibility or training reproducibility.

Key changes:

New GitHub Actions workflow (model-compatibility.yml) that runs compatibility and consistency checks on workflow dispatch or push events
Python script (check_models_compatability_consistency.py) that downloads models from Hugging Face Hub, evaluates them, compares metrics against baselines, and reports significant differences
Two configuration files defining project setups for compatibility testing (8 projects including ensemble backends) and consistency testing (8 projects focusing on base backends)

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 17 comments.

File	Description
`.github/workflows/model-compatibility.yml`	GitHub Actions workflow orchestrating the compatibility checks with steps for environment setup and running both compatibility and consistency tests
`tests/check_models_compatability_consistency.py`	Python script implementing the core logic for downloading models/metrics, training, evaluation, comparison, and uploading results to Hugging Face Hub
`tests/projects-compatibility.cfg`	Configuration defining 8 projects (including yake-fi and ensemble-fi) for backward compatibility testing against existing trained models
`tests/projects-consistency.cfg`	Configuration defining 8 projects for reproducibility testing through retraining and metric comparison

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

sonarqubecloud · 2025-12-19T08:03:45Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

juhoinkinen · 2026-01-08T13:45:26Z

Also checks for model size and (time) performance could be useful, but they are better to be implemented separate to this PR.

Co-authored-by: aider (openai/gpt-5.2-chat) <aider@aider.chat>

Co-authored-by: aider (openai/gpt-5.2) <aider@aider.chat>

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

sonarqubecloud · 2026-02-06T10:44:03Z

Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

juhoinkinen added this to the 1.5 milestone Oct 30, 2025

juhoinkinen added maintenance github_actions Pull requests that update GitHub Actions code labels Oct 30, 2025

juhoinkinen force-pushed the issue906-automate-model-compatibility-checks branch from f633916 to 924e812 Compare October 31, 2025 10:09

juhoinkinen mentioned this pull request Oct 31, 2025

Fix zip destination path in Hugging Face repo when using custom data directory location #908

Merged

juhoinkinen force-pushed the issue906-automate-model-compatibility-checks branch 2 times, most recently from 9b1ea17 to bd74716 Compare October 31, 2025 15:34

juhoinkinen added 21 commits November 12, 2025 15:00

Initial script for checks

61a9088

Add CI integration

6a981db

Use dedicated projects config file for compat check

647a68b

Fix working path

6a2f4e6

Fix: install all extras

670cfef

Fix: Install Voikko; install harden-runner

452f3c6

Add option to upload models and metrics to HFH

ef36365

Use Click instead of argparse

e130da6

Refactor

c637daa

Continue on error of compatibility check

23fb769

Fix missing fstring setup

d7e4534

Renamings

c90e3ae

Avoid duplicate printout in CI

c5b6471

Do not skip metric comparison with previous value of zero

c9382c7

Allow nonexistent metrics file

519fd46

Different project configs for compat and consistency

6754e7b

Projects for compat includes also non-trainable backends

Nicer output

2443ed1

Separate new command for upload

eee0930

Separate thresholds for compat and consistency; eval only F1@5 and NDCG

dee18e3

Use temp dir for data and metrics

6e5c1b8

Set HF repo via env

d08a04b

juhoinkinen force-pushed the issue906-automate-model-compatibility-checks branch from bd74716 to d08a04b Compare November 12, 2025 13:11

juhoinkinen requested a review from Copilot November 13, 2025 11:14

Copilot started reviewing on behalf of juhoinkinen November 13, 2025 11:15 View session

Copilot finished reviewing on behalf of juhoinkinen November 13, 2025 11:18

Copilot AI reviewed Nov 13, 2025

View reviewed changes

juhoinkinen and others added 8 commits November 13, 2025 15:03

Fix case v1 == 0 and v2 == 0

f5f930f

Fix typo in file name

cadfff4

Adapt to fixed file name -> model-compatibility.yml

81494e3

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Use sys.exit() instead plain exit()

3353f4a

Raise exception after printing it

ad22b48

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fix inconsistent indentation

b7973fc

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fix missing epochs parameter

5170f04

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Remove unnecessary option in makedirs()

60898d6

juhoinkinen mentioned this pull request Nov 18, 2025

Check models compatibility in Docker rebuild GH Actions workflow #719

Closed

1 task

Remove workflow trigger listering PR branch

011f49d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

juhoinkinen marked this pull request as ready for review December 9, 2025 13:50

Linked project configs in consistency checks to avoid duplication

307616f

juhoinkinen and others added 8 commits February 3, 2026 16:48

feat: enforce explicit metric list to prevent silent regressions

d12a8f2

Co-authored-by: aider (openai/gpt-5.2-chat) <aider@aider.chat>

feat: store and validate metadata alongside metrics for reproducibility

f4bec1f

Co-authored-by: aider (openai/gpt-5.2-chat) <aider@aider.chat>

Merge branch 'main' into issue906-automate-model-compatibility-checks

62664fe

fix: correct KeyError by saving raw metrics directly in tests

1d068aa

Co-authored-by: aider (openai/gpt-5.2-chat) <aider@aider.chat>

fix: ensure model download failures stop the run by setting check=True

8be8d72

Co-authored-by: aider (openai/gpt-5.2) <aider@aider.chat>

feat: Implement training and evaluation loops for model upload process

1e2a783

Co-authored-by: aider (openai/gpt-5.2) <aider@aider.chat>

Upload metrics also for yake and ensemble projects

491fdd3

Renamings

1b86bf6

github-advanced-security AI found potential problems Feb 5, 2026

View reviewed changes

Comment thread tests/check_models_reproducibility.py Fixed

Potential fix for code scanning alert no. 106: Empty except

5c3aff6

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate model compatibility checks#907

Automate model compatibility checks#907
juhoinkinen wants to merge 40 commits into
mainfrom
issue906-automate-model-compatibility-checks

juhoinkinen commented Oct 30, 2025 •

edited

Loading

Uh oh!

codecov Bot commented Oct 30, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented Dec 19, 2025

Uh oh!

juhoinkinen commented Jan 8, 2026

Uh oh!

Uh oh!

sonarqubecloud Bot commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

juhoinkinen commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented Dec 19, 2025

Quality Gate passed

Uh oh!

juhoinkinen commented Jan 8, 2026

Uh oh!

Uh oh!

sonarqubecloud Bot commented Feb 6, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

juhoinkinen commented Oct 30, 2025 •

edited

Loading

codecov Bot commented Oct 30, 2025 •

edited

Loading