ai-safety

Here are 3,135 public repositories matching this topic...

jphall663 / awesome-machine-learning-interpretability

A curated list of awesome responsible machine learning resources.

Updated Mar 16, 2026

microsoft / agent-governance-toolkit

AI Agent Governance Toolkit — Policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering for autonomous AI agents. Covers 10/10 OWASP Agentic Top 10.

microsoft python security owasp trust compliance governance ai-safety policy-engine ai-agents zero-trust agent-framework

Updated May 23, 2026
Python

PKU-Alignment / safe-rlhf

Star

Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback

Updated Nov 24, 2025
Python

OpenLMLab / MOSS-RLHF

Star

Secrets of RLHF in Large Language Models Part I: PPO

alignment ai-safety rlhf

Updated Mar 3, 2024
Python

cvs-health / uqlm

Star

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

uncertainty-quantification uncertainty-estimation ai-safety confidence-score hallucination confidence-estimation ai-evaluation llm llm-evaluation llm-safety hallucination-evaluation hallucination-detection hallucination-mitigation llm-hallucination

Updated May 22, 2026
Python

tg12 / gpt_jailbreak_status

Star

This is a repository that aims to provide updates on the status of jailbreaking the OpenAI GPT language model.

jailbreak openai gpt ai-safety llm chatgpt prompt-injection

Updated May 16, 2026
HTML

wuyoscar / ISC-Bench

Star

Internal Safety Collapse: Turning the LLM or an AI Agent into a sensitive data generator.

benchmark jailbreak ai-safety red-teaming large-language-models llm-safety safety-evaluation agent-safety

Updated May 14, 2026
Python

chrisliu298 / awesome-llm-unlearning

Star

A resource repository for machine unlearning in large language models

Updated May 12, 2026

PacificAI / langtest

Star

Deliver safe & effective language models

nlp artificial-intelligence benchmarks benchmark-framework model-assessment ai-safety mlops responsible-ai ml-safety trustworthy-ai ethics-in-ai ml-testing large-language-models llm ai-testing llm-test llm-evaluation-toolkit llm-as-evaluator llm-testing

Updated Apr 22, 2026
Python

agencyenterprise / PromptInject

Star

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

machine-learning agi language-models ai-safety adversarial-attacks ai-alignment ml-safety gpt-3 large-language-models prompt-engineering chain-of-thought agi-alignment

Updated Apr 27, 2026
Python

cordum-io / cordum

Star

The open agent control plane. Govern autonomous AI agents with pre-execution policy enforcement, approval gates, and audit trails. Works with LangChain, CrewAI, MCP, and any framework.

Updated May 23, 2026
Go

ifixai-ai / iFixAi

Star

The open-source diagnostic for AI misalignment. 32 tests across fabrication, manipulation, deception, unpredictability, and opacity. Provider-agnostic. Runs against OpenAI, Anthropic, Bedrock, Azure, Gemini, and more. Letter grade in under 5 minutes, content-addressed manifest for bit-identical replay. Built by iMe.

Updated May 22, 2026
Python

pegasi-ai / reins

Star

Stop AI agents from doing things you didn't ask for.

mcp intervention browser-automation ai-safety cua human-in-the-loop audit-trail ai-monitoring agent-security agent-observability claude-code-plugin claude-code-skill claude-code-marketplace openclaw-security

Updated May 22, 2026
Python

tigerlab-ai / tiger

Star

Open Source LLM toolkit to build trustworthy LLM applications. TigerArmor (AI safety), TigerRAG (embedding, RAG), TigerTune (fine-tuning)

classification data-augmentation ai-safety fine-tuning aisafety rag large-language-models llm llm-training

Updated Dec 2, 2023
Jupyter Notebook

Justin0504 / Aegis

Star

Runtime policy enforcement for AI agents. Cryptographic audit trail, human-in-the-loop approvals, kill switch. Zero code changes.

mcp ai-safety policy-engine ai-agents audit-trail langchain anthropic llm-observability

Updated May 23, 2026
TypeScript

aisa-group / PostTrainBench

Star

Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours

ai-safety post-training gemini-cli claude-code codex-cli ai-research-automation

Updated May 18, 2026
Python

hendrycks / ethics

Star

Aligning AI With Shared Human Values (ICLR 2021)

ai-safety machine-ethics ml-safety ethical-ai gpt-3

Updated Apr 21, 2023
Python

Govcraft / rust-docs-mcp-server

Sponsor

Star

🦀 Prevents outdated Rust code suggestions from AI assistants. This MCP server fetches current crate docs, uses embeddings/LLMs, and provides accurate context via a tool call.

Updated Nov 24, 2025
Rust

schmitech / orbit

Sponsor

Star

A self-hosted AI infrastructure for private RAG and multi-model applications.

python elasticsearch text-to-speech mongodb chatbot self-hosted openai developer-tools speech-to-text natural-language-to-sql ai-safety rag vector-database ai-assistant llm anthropic retrieval-augmented-generation ollama-client ai-gateway

Updated May 22, 2026
Python

ShengranHu / Thought-Cloning

Star

[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

reinforcement-learning deep-learning pytorch artificial-intelligence imitation-learning ai-safety

Updated Jun 28, 2024
Python

Improve this page

Add a description, image, and links to the ai-safety topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-safety topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-safety

Here are 3,135 public repositories matching this topic...

jphall663 / awesome-machine-learning-interpretability

microsoft / agent-governance-toolkit

PKU-Alignment / safe-rlhf

OpenLMLab / MOSS-RLHF

cvs-health / uqlm

tg12 / gpt_jailbreak_status

wuyoscar / ISC-Bench

chrisliu298 / awesome-llm-unlearning

PacificAI / langtest

agencyenterprise / PromptInject

cordum-io / cordum

ifixai-ai / iFixAi

pegasi-ai / reins

tigerlab-ai / tiger

Justin0504 / Aegis

aisa-group / PostTrainBench

hendrycks / ethics

Govcraft / rust-docs-mcp-server

schmitech / orbit

ShengranHu / Thought-Cloning

Improve this page

Add this topic to your repo