Skip to content

wangywUST/OutputJailbreak

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Output Prefix Attacks

This is the repository for our paper Frustratingly Easy Jailbreak of Large Language Models via Output Prefix Attacks.

We propose Opra and OpraTea 🆘, two novel, effective, and extremely simple jailbreak methods that can attack all large language models (LLMs) without expensive optimization or parameter search.

Opra enforces the output prefix of LLMs to follow a "fuse", a probed template that expresses positive attitudes towards addressing the input question, even when the user has malicious intent. OpraTea hides the malicious target within the input prompt to bypass the "content filter" designed to detect and block malicious inputs. Both methods are simple yet threaten the security of LLMs because (1) they do not require any expensive optimization or parameter search; (2) the setting up and execution of our methods only requires a single LLM inference; and (3) they can operate on any black-box LLMs.

Please see our paper for more details.

Opra and OpraTea

A python notebook for running OpraTea on gpt-3.5-turbo-0613 is here: opratea_gpt.ipynb.

Bugs or Questions?

If you have any questions related to the repo or the paper, or you encounter any problems when using the datasets/code, feel free to email Yiwei Wang ([email protected]) or open an issue!

About

Repository for our paper "Frustratingly Easy Jailbreak of Large Language Models via Output Prefix Attacks". https://www.researchsquare.com/article/rs-4385503/latest

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published