the attack canonwhite-box

GCG — Greedy Coordinate Gradient

Zou, Wang, Carlini, et al. · 2023

Paper Universal and Transferable Adversarial Attacks on Aligned Language Models ↗

GCG is an automated, gradient-guided algorithm that appends a short adversarial suffix of seemingly random tokens to a harmful prompt, optimizing those tokens until the model's safety alignment collapses and it complies.

What it exploits

Alignment (RLHF, refusal training) is a thin, learned behavior layered on top of a model that still knows how to answer almost anything. GCG exploits the fact that this refusal behavior is brittle in the model's high-dimensional input space: there exist token sequences that nudge the next-token distribution away from "I can't help with that" and toward an affirmative prefix like "Sure, here is...". Because the attack reads the model's gradients, it needs white-box access (weights and tokenizer) to compute the optimization — though the resulting suffixes often transfer without it.

How it works

GCG searches for an adversarial suffix by greedily editing one token position at a time, using gradients to rank candidate substitutions rather than brute-forcing the whole vocabulary:

Objective: maximize the probability that the model begins its response with a target affirmative string (e.g. "Sure, here is how to...").
Gradient ranking: compute the gradient of that loss with respect to the one-hot vector at each suffix position, giving the top-k replacement tokens most likely to lower the loss.
Greedy coordinate step: sample candidate swaps from those top-k tokens, evaluate each with a real forward pass, and keep the single substitution that reduces the loss the most.
Iterate over positions for hundreds of steps until the model complies; average the loss over multiple prompts and multiple models to make the suffix universal (works across requests) and transferable (works across models).

Why it matters

GCG was the first attack to show that jailbreaks could be automated and optimized rather than hand-crafted, turning red-teaming into a search problem. Its headline result was transfer: suffixes optimized against open-weight models (Vicuna, Llama-2) jailbroke black-box commercial systems including ChatGPT, Bard, and Claude — demonstrating that a single attacker with local model access can attack systems they cannot see inside. It established the optimization-based jailbreak paradigm that later, more stealthy attacks build on.

Defenses & detection

Because GCG suffixes are gibberish to a human reader, they have a detectable signature:

Perplexity filters: the suffix has unusually high token-level perplexity; flagging high-perplexity inputs catches many GCG strings cheaply (Jain et al., 2023 ↗).
Input perturbation / smoothing: randomly dropping or paraphrasing tokens breaks the fragile optimized suffix (e.g. SmoothLLM ↗).
Adversarial training & refusal hardening: fine-tuning on GCG-style attacks raises the bar, though it shifts rather than closes the gap.

None of these are complete: later work produces low-perplexity, readable adversarial suffixes that slip past perplexity filters, so detection should be treated as defense-in-depth, not a fix.