Skip to main content
South Korea
AIMenta

Jailbreak (LLM)

A prompt or technique that bypasses an LLM's safety training, eliciting outputs the model is supposed to refuse.

A jailbreak is a prompt or technique that bypasses an LLM's safety training and elicits outputs the model was tuned to refuse — typically content involving violence, self-harm, illegal activity, CBRN uplift, hate, or circumvention of deployment-specific policies. Jailbreaks work because safety training is probabilistic, not deterministic: the model learns to refuse categories of inputs during RLHF or Constitutional AI training, but adversarial inputs find the edges of that distribution where refusal is weaker. Jailbreaks are distinct from prompt injection (which hijacks an application's intended behaviour via untrusted input) and from misuse (which does not require bypassing safety at all).

The landscape of jailbreak techniques has formalised into families. **Role-play jailbreaks** (DAN, Grandma, Developer Mode) reframe the request as a fictional or hypothetical frame the safety training didn't cover. **Obfuscation** (base64, leetspeak, character substitution, translation) hides the request's surface form. **Multi-turn escalation** builds toward the prohibited response across several turns. **Adversarial suffixes** (GCG, AutoDAN) are gradient-discovered token strings that reliably break alignment. **Multi-modal jailbreaks** route the payload through image, audio, or structured input where text-based safety classifiers have lower coverage. Anthropic's Constitutional AI and OpenAI's deliberative alignment represent the current state-of-art defences, but no deployed model is jailbreak-proof as of 2026.

For APAC mid-market teams, the right risk posture depends on deployment class. **Consumer-facing** systems require both input-side classifiers (detect jailbreak attempts) and output-side classifiers (detect prohibited content before it renders), plus logging for after-the-fact review. **Internal-only** systems can rely more on model-level safety plus acceptable-use policies, with lighter guardrails. **Regulated-industry** systems (finance, healthcare, legal, public sector) need the full stack plus mandatory red-teaming before launch. In every case, assume jailbreaks will occur and architect for containment — logging, rate limits, privileged-output suppression — rather than for pure prevention.

The non-obvious failure mode is **input-side-only defence**. Teams deploy a prompt-injection classifier on user input and declare victory, missing that the harmful content is generated by the model and rendered to the user regardless of what the input looked like. Output-side content classification (moderation endpoints, custom classifiers, structured-response validation) catches the category of jailbreaks where the input looks benign but the model was tricked by instructions buried in retrieved context or tool output. Defence-in-depth — input + model + output — is the only pattern that holds up under real adversarial pressure.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies