ai-safety

Extended, multi-agent and multi-objective (MaMoRL) environments based on DeepMind's AI Safety Gridworlds. This is a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. It is made compatible with OpenAI's Gym/Gymnasium and Farama Foundation PettingZoo.

Updated Jun 23, 2024
Python

Sumireko-Usami / inappropriate-content-image-detection-model

Star

用于检测图像中不良内容的深度学习模型，对输入图像进行暴力和非暴力的二分类，并通过AIGC图像、对抗样本和加噪图像进行了增强。

deep-learning image-processing dataset image-classification ai-safety adversarial-examples aigc

Updated Jun 22, 2024
Python

jphall663 / awesome-machine-learning-interpretability

Star

A curated list of awesome responsible machine learning resources.

Updated Jun 21, 2024

normster / llm_rules

Star

RuLES: a benchmark for evaluating rule-following in language models

ai-safety ai-security gpt-4

Updated Jun 21, 2024
Python

IQTLabs / daisybell

Star

Scan your AI/ML models for problems before you put them into production.

cybersecurity ai-safety bias-correction bias-detection ai-alignment model-poison ai-assurance

Updated Jun 25, 2024
Python

moonwatcher-ai / moonwatcher

Star

Evaluation & testing framework for computer vision models

computer-vision ai-safety ethical-artificial-intelligence ai-security mlops ml-safety ml-validation trustworthy-ai ml-testing

Updated Jun 20, 2024
Python

blandfort / perspectival

Star

A Python-based toolkit for comparing transformers.

transformer ai-safety generative-ai

Updated Jun 18, 2024
Python

EzgiKorkmaz / adversarial-reinforcement-learning

Star

Reading list for adversarial perspective and robustness in deep reinforcement learning.

Updated Jun 18, 2024

Nkluge-correa / Aira

Star

Aira is a series of chatbots developed as an experimentation playground for value alignment.

natural-language-processing ai chatbot alignment language-model ai-safety

Updated Jun 18, 2024
Jupyter Notebook

dynaroars / neuralsat

Star

DPLL(T)-based Verification tool for DNNs

abstraction sat-solver software-verification ai-safety robustness dpll adversarial-attacks robustness-verification dnn-verification ai-assurance neural-network-veri

Updated Jun 14, 2024
Python

PKU-Alignment / safe-rlhf

Star

Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback

Updated Jun 13, 2024
Python

PKU-Alignment / llms-resist-alignment

Star

Repo for paper "Language Models Resist Alignment"

alignment llama safe alpaca ai-safety vicuna llm llms rlhf safe-rlhf llama2 llama3

Updated Jun 9, 2024
Python

erfanshayegani / Jailbreak-In-Pieces

Star

[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models

alignment ai-safety vlm llm vision-language-models cross-modality-safety-alignment multi-modal-models

Updated Jun 6, 2024
Python

ztjona / ztjona.github.io

Star

My personal website.

machine-learning deep-learning ai-safety

Updated Jun 5, 2024
HTML

Improve this page

Add a description, image, and links to the ai-safety topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-safety topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-safety

Here are 97 public repositories matching this topic...

tamlhp / awesome-privex

Giskard-AI / giskard

EffiSciencesResearch / ML4G-2.0

qroa / qroa

StampyAI / stampy-ui

riceissa / aiwatch

levitation-opensource / ai-safety-gridworlds

Sumireko-Usami / inappropriate-content-image-detection-model

jphall663 / awesome-machine-learning-interpretability

normster / llm_rules

IQTLabs / daisybell

moonwatcher-ai / moonwatcher

blandfort / perspectival

EzgiKorkmaz / adversarial-reinforcement-learning

Nkluge-correa / Aira

dynaroars / neuralsat

PKU-Alignment / safe-rlhf

PKU-Alignment / llms-resist-alignment

erfanshayegani / Jailbreak-In-Pieces

ztjona / ztjona.github.io

Improve this page

Add this topic to your repo