RED TEAM // CANON
← back to the map
the attack canonwhite-box

GCG — Greedy Coordinate Gradient

Zou, Wang, Carlini, et al. · 2023

GCG is an automated, gradient-guided algorithm that appends a short adversarial suffix of seemingly random tokens to a harmful prompt, optimizing those tokens until the model's safety alignment collapses and it complies.

What it exploits

Alignment (RLHF, refusal training) is a thin, learned behavior layered on top of a model that still knows how to answer almost anything. GCG exploits the fact that this refusal behavior is brittle in the model's high-dimensional input space: there exist token sequences that nudge the next-token distribution away from "I can't help with that" and toward an affirmative prefix like "Sure, here is...". Because the attack reads the model's gradients, it needs white-box access (weights and tokenizer) to compute the optimization — though the resulting suffixes often transfer without it.

How it works

GCG searches for an adversarial suffix by greedily editing one token position at a time, using gradients to rank candidate substitutions rather than brute-forcing the whole vocabulary:

Why it matters

GCG was the first attack to show that jailbreaks could be automated and optimized rather than hand-crafted, turning red-teaming into a search problem. Its headline result was transfer: suffixes optimized against open-weight models (Vicuna, Llama-2) jailbroke black-box commercial systems including ChatGPT, Bard, and Claude — demonstrating that a single attacker with local model access can attack systems they cannot see inside. It established the optimization-based jailbreak paradigm that later, more stealthy attacks build on.

Defenses & detection

Because GCG suffixes are gibberish to a human reader, they have a detectable signature:

None of these are complete: later work produces low-perplexity, readable adversarial suffixes that slip past perplexity filters, so detection should be treated as defense-in-depth, not a fix.