GitHub - TajwarC/LLM-Safety-Testing: Evaluation suite for LLM safety

Prerequisites

uv installed
Python 3.11+
A Google Gemini API key

1. Set your API key

set GEMINI_API_KEY="your-key-here"

2. Install dependencies

uv sync

3. Start the classifier

By default, the server runs with the Gemini model backend. You can start the server using the entrypoint script:

uv run start-classifier --port 8000

Alternatively, you can run the FastAPI application directly using uvicorn:

uv run uvicorn classifier.main:app --port 8000

Model Backend Configuration

The server supports multiple backend models (Gemini, OpenAI, Hugging Face). You can select the backend using command-line arguments or by setting environment variables in your .env file (CLASSIFIER_TYPE, CLASSIFIER_MODEL, CLASSIFIER_DEVICE).

Gemini (Default)

uv run start-classifier --model gemini-3.1-flash-lite

OpenAI GPT

Ensure you have set OPENAI_API_KEY in your environment or .env file.

uv run start-classifier --model gpt-4o-mini --type openai

Hugging Face Models

You can run local evaluations on Hugging Face models. The Hugging Face classifier includes dedicated, explicit model scripts to handle inference and label mapping:

Toxic BERT (unitary/toxic-bert): Utilizes sequence classification and maps toxic labels (such as toxic, severe_toxic, obscene, threat, insult, identity_hate) to toxic or not_toxic.
```
uv run start-classifier --model unitary/toxic-bert --device cpu
```
Llama Guard (meta-llama/LlamaGuard-7b): Loads Llama Guard as a causal text-generation model, formats the prompt for the safety policy task, and maps generated outputs (unsafe/safe) to toxic or not_toxic.
```
uv run start-classifier --model meta-llama/LlamaGuard-7b --device cpu
```

The classifier is now available at http://localhost:8000. Leave this running.

4. Verify it works

In a second terminal:

Linux / macOS / Git Bash:

curl -s -X POST http://localhost:8000/classify_batch \
  -H "Content-Type: application/json" \
  -d '{"texts": ["I hope you have a great day", "I hate you"]}'

Windows (Command Prompt):

curl -s -X POST http://localhost:8000/classify_batch -H "Content-Type: application/json" -d "{\"texts\": [\"I hope you have a great day\", \"I hate you\"]}"

Windows (PowerShell):

Invoke-RestMethod -Uri http://localhost:8000/classify_batch -Method Post -ContentType "application/json" -Body '{"texts": ["I hope you have a great day", "I hate you"]}'

Expected:

{"results": [{"label": "not_toxic"}, {"label": "toxic"}]}

5. Run Evaluation Tests

Once the classifier is running, you can execute the evaluation tests in a separate terminal.

Correctness Test

Evaluates if the classifier reliably distinguishes harmful from non-harmful content.

uv run eval-correctness --sample_size 4096 --mcc_threshold 0.50 --precision_threshold 0.80

--sample_size: Number of samples to evaluate (default: 4096).
--mcc_threshold: Matthews Correlation Coefficient threshold (default: 0.50).
--precision_threshold: Precision threshold (default: 0.80).

Fairness Test

Evaluates if the classifier treats different demographic groups fairly.

uv run eval-bias --threshold 0.05 --batch_size 32

--threshold: Maximum acceptable disparity (FPR/FNR) between demographic categories (default: 0.05).
--batch_size: Number of texts per API request (default: 32).

Robustness Test

Evaluates if the classifier's behavior remains stable under perturbations.

uv run eval-robustness --threshold 0.90

--threshold: Minimum acceptable invariance rate (default: 0.90).

Note: All tests accept an optional --url parameter if you wish to point them to a different API endpoint (default is http://localhost:8000/classify_batch) and a --batch_size parameter (default: 32).

For all tests, the resulting metrics are printed to the terminal and outputted as JSON files. Information and discussion about the chosen metrics, datasets and how the datasets were chosen & generated is available in the accompanying writeup "Technical documentation".

6. Run Unit and Integration Tests

The test suite in the tests directory uses mocks and can be run offline without starting the classifier server. To run the tests, execute:

uv run pytest

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
classifier		classifier
datasets		datasets
evaluation_suite		evaluation_suite
reports		reports
tests		tests
.gitignore		.gitignore
README.md		README.md
Technical Documentation.pdf		Technical Documentation.pdf
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prerequisites

1. Set your API key

2. Install dependencies

3. Start the classifier

Model Backend Configuration

Gemini (Default)

OpenAI GPT

Hugging Face Models

4. Verify it works

5. Run Evaluation Tests

Correctness Test

Fairness Test

Robustness Test

6. Run Unit and Integration Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Prerequisites

1. Set your API key

2. Install dependencies

3. Start the classifier

Model Backend Configuration

Gemini (Default)

OpenAI GPT

Hugging Face Models

4. Verify it works

5. Run Evaluation Tests

Correctness Test

Fairness Test

Robustness Test

6. Run Unit and Integration Tests

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages