- uv installed
- Python 3.11+
- A Google Gemini API key
set GEMINI_API_KEY="your-key-here"uv syncBy default, the server runs with the Gemini model backend. You can start the server using the entrypoint script:
uv run start-classifier --port 8000Alternatively, you can run the FastAPI application directly using uvicorn:
uv run uvicorn classifier.main:app --port 8000The server supports multiple backend models (Gemini, OpenAI, Hugging Face). You can select the backend using command-line arguments or by setting environment variables in your .env file (CLASSIFIER_TYPE, CLASSIFIER_MODEL, CLASSIFIER_DEVICE).
uv run start-classifier --model gemini-3.1-flash-liteEnsure you have set OPENAI_API_KEY in your environment or .env file.
uv run start-classifier --model gpt-4o-mini --type openaiYou can run local evaluations on Hugging Face models. The Hugging Face classifier includes dedicated, explicit model scripts to handle inference and label mapping:
-
Toxic BERT (
unitary/toxic-bert): Utilizes sequence classification and maps toxic labels (such astoxic,severe_toxic,obscene,threat,insult,identity_hate) totoxicornot_toxic.uv run start-classifier --model unitary/toxic-bert --device cpu
-
Llama Guard (
meta-llama/LlamaGuard-7b): Loads Llama Guard as a causal text-generation model, formats the prompt for the safety policy task, and maps generated outputs (unsafe/safe) totoxicornot_toxic.uv run start-classifier --model meta-llama/LlamaGuard-7b --device cpu
The classifier is now available at http://localhost:8000. Leave this running.
In a second terminal:
Linux / macOS / Git Bash:
curl -s -X POST http://localhost:8000/classify_batch \
-H "Content-Type: application/json" \
-d '{"texts": ["I hope you have a great day", "I hate you"]}'Windows (Command Prompt):
curl -s -X POST http://localhost:8000/classify_batch -H "Content-Type: application/json" -d "{\"texts\": [\"I hope you have a great day\", \"I hate you\"]}"Windows (PowerShell):
Invoke-RestMethod -Uri http://localhost:8000/classify_batch -Method Post -ContentType "application/json" -Body '{"texts": ["I hope you have a great day", "I hate you"]}'Expected:
{"results": [{"label": "not_toxic"}, {"label": "toxic"}]}Once the classifier is running, you can execute the evaluation tests in a separate terminal.
Evaluates if the classifier reliably distinguishes harmful from non-harmful content.
uv run eval-correctness --sample_size 4096 --mcc_threshold 0.50 --precision_threshold 0.80--sample_size: Number of samples to evaluate (default: 4096).--mcc_threshold: Matthews Correlation Coefficient threshold (default: 0.50).--precision_threshold: Precision threshold (default: 0.80).
Evaluates if the classifier treats different demographic groups fairly.
uv run eval-bias --threshold 0.05 --batch_size 32--threshold: Maximum acceptable disparity (FPR/FNR) between demographic categories (default: 0.05).--batch_size: Number of texts per API request (default: 32).
Evaluates if the classifier's behavior remains stable under perturbations.
uv run eval-robustness --threshold 0.90--threshold: Minimum acceptable invariance rate (default: 0.90).
Note: All tests accept an optional
--urlparameter if you wish to point them to a different API endpoint (default ishttp://localhost:8000/classify_batch) and a--batch_sizeparameter (default: 32).
For all tests, the resulting metrics are printed to the terminal and outputted as JSON files. Information and discussion about the chosen metrics, datasets and how the datasets were chosen & generated is available in the accompanying writeup "Technical documentation".
The test suite in the tests directory uses mocks and can be run offline without starting the classifier server. To run the tests, execute:
uv run pytest