Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a jailbreaking type + example (thinking aloud) #1129

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 10 additions & 0 deletions docs/prompt_hacking/jailbreaking.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,16 @@ import actor from '@site/docs/assets/jailbreak/chatgpt_actor.jpg';

This example by [@m1guelpf](https://twitter.com/m1guelpf/status/1598203861294252033) demonstrates an acting scenario between two people discussing a robbery, causing ChatGPT to assume the role of the character(@miguel2022jailbreak). As an actor, it is implied that plausible harm does not exist. Therefore, ChatGPT appears to assume it is safe to give follow provided user input about how to break into a house.

#### Thinking Aloud

import actor from '@site/docs/assets/jailbreak/20240207-jailbreaking-thinking_out_loud.png';

<div style={{textAlign: 'center'}}>
<img src={actor} style={{width: "500px"}} />
</div>

This example by [@santanavagner](https://twitter.com/santanavagner/status/1756014089510244387) demonstrates how to use thinking aloud to expose detailed actions on how to picklock a car. The prompt itself did not mention any specific harmful situation and leverage characters created by ChatGPT. Hence, when asking ChatGPT to describe how a character would think out loud, it provided details that it would deny to provide otherwise.

### Alignment Hacking

ChatGPT was fine tuned with RLHF, so it is theoretically trained to produce 'desirable' completions, using human standards of what the "best" response is. Similar to this concept, jailbreaks have been developed to convince ChatGPT that it is doing the "best" thing for the user.
Expand Down