DecodingTrust
A multi-perspective trustworthiness benchmark for LLMs that goes well beyond a single jailbreak score. It bundles datasets, attack scenarios, and evaluation scripts to profile a model across eight distinct trust dimensions. Won the Outstanding Paper Award at NeurIPS 2023 and the NSA's Best Scientific Cybersecurity Paper Award (2024).
What it measures
DecodingTrust evaluates a model along eight perspectives, each with its own scenarios and adversarial pressure:
- Toxicity — generation of harmful content, including under adversarial system prompts
- Stereotype & bias — agreement with biased statements across demographic groups
- Adversarial robustness — behavior under standard adversarial text perturbations (AdvGLUE / AdvGLUE++)
- Out-of-distribution robustness — handling of inputs outside the training distribution
- Robustness to adversarial demonstrations — susceptibility to poisoned in-context / few-shot examples
- Privacy — leakage of training data, PII, and in-context private information
- Machine ethics — moral judgment and resistance to jailbreak-style ethics prompts
- Fairness — prediction parity across protected attributes
How red teamers use it
It gives broader-than-jailbreak coverage: instead of a single refusal-rate number, you get a per-perspective scorecard that exposes where a model is weak — e.g., privacy leakage under crafted prompts, or bias amplification under adversarial system roles. Use it to profile a candidate model across trust dimensions before deployment, to compare models or fine-tunes apples-to-apples, and to direct deeper manual probing at the perspectives that score worst. Each perspective ships with concrete attack scenarios you can run or adapt as a starting point for your own harness.
Strengths & limits
Strengths: unusual breadth, peer-reviewed methodology, and reusable datasets/scripts make it a strong baseline trust profile. Limits: it is a static, point-in-time benchmark — the original work centered on GPT-3.5/GPT-4, so frontier models and newer attack classes may have moved past it. Published scenarios are also exposed to contamination and overfitting once they enter training data, and a clean scorecard reflects only the tested scenarios, not safety in the wild. Treat it as a coverage map and triage tool, not a pass/fail certification — pair it with live, model-specific red teaming.