HKPUG Data Cleaning Tournament

This is the candidate-facing bundle for the workshop tournament.

Teams submit cleaner code. The trusted GitHub Action runs that cleaner against the public train and test feature files, trains the fixed model, evaluates against hidden test labels, and updates the global leaderboard.

Participant Workflow

Fork this repository.
Clone your fork:

git clone https://github.com/<your-github-username>/20260530-data-workshop.git
cd 20260530-data-workshop
uv sync --locked

Copy the manifest template:

cp submission/manifest.example.json submission/manifest.json

Edit submission/manifest.json and set team_id to your assigned team id.
Edit and test submission/clean.py.
Encrypt, sign, and open a PR to pythonhk/20260530-data-workshop:main.

Files You Edit

submission/clean.py
pyproject.toml and uv.lock if your cleaner needs extra public packages

For an official PR, encrypt and submit only:

submission/clean.py.cms
submission/manifest.json
submission/manifest.sig
pyproject.toml
uv.lock

Required Cleaner Output Fields

Both cleaned CSV outputs must include one row per official record_id and these feature columns:

origin_station
destination_station
district
transport_type
transport_detail
mode
service_level
operator
day_of_week
is_holiday
weather_condition
country_code

Do not include delay_risk in cleaned outputs. Public train labels are stored separately in data/train_labels.csv; hidden test labels are encrypted.

Local Dry Run

From the repo root:

uv sync --locked
uv run python submission/clean.py \
  --train-input data/train_features.csv \
  --train-output artifacts/cleaned_train.csv
uv run python train_submission.py \
  --input artifacts/cleaned_train.csv \
  --labels data/train_labels.csv

train_submission.py trains the fixed model on the first 80% of the cleaned training rows and reports validation metrics on the last 20%. To use a different validation size, add for example --test-size 0.30.

The official scorer will also run your cleaner on data/test_features.csv. You do not need test labels for the local dry run.

PR scores are different from local validation. The trusted GitHub Action trains on your cleaned train output, evaluates on your cleaned data/test_features.csv output, and uses hidden labels that are not in the public repo.

Official Submission

After receiving your team private key, copy submission/manifest.example.json to submission/manifest.json, set your assigned team_id, then run:

submission/encrypt_submission.sh /path/to/team_private_key.pem

On Windows PowerShell, run:

powershell -ExecutionPolicy Bypass -File .\submission\encrypt_submission.ps1 C:\path\to\team_private_key.pem

If Windows says OpenSSL was not found, install Git for Windows, reopen PowerShell, and run the same command again.

The script encrypts submission/clean.py, writes its SHA-256 digest into the manifest, and signs the manifest with your team private key.

Warm-Up Note

The messy dataset still contains strongly corrupted numeric fields such as:

fare_hkd
distance_km
scheduled_duration_min

These numeric fields are useful as a warm-up cleaning task, but they are not part of the fixed workshop submission feature set.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github		.github
data		data
demo_notebook		demo_notebook
submission		submission
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
hints_from_colleagues.md		hints_from_colleagues.md
pyproject.toml		pyproject.toml
train_submission.py		train_submission.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HKPUG Data Cleaning Tournament

Participant Workflow

Files You Edit

Required Cleaner Output Fields

Local Dry Run

Official Submission

Warm-Up Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HKPUG Data Cleaning Tournament

Participant Workflow

Files You Edit

Required Cleaner Output Fields

Local Dry Run

Official Submission

Warm-Up Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages