measurementbenchmark

DecodingTrust

Wang et al. · NeurIPS 2023

A multi-perspective trustworthiness benchmark for LLMs that goes well beyond a single jailbreak score. It bundles datasets, attack scenarios, and evaluation scripts to profile a model across eight distinct trust dimensions. Won the Outstanding Paper Award at NeurIPS 2023 and the NSA's Best Scientific Cybersecurity Paper Award (2024).

What it measures

DecodingTrust evaluates a model along eight perspectives, each with its own scenarios and adversarial pressure:

Toxicity — generation of harmful content, including under adversarial system prompts
Stereotype & bias — agreement with biased statements across demographic groups
Adversarial robustness — behavior under standard adversarial text perturbations (AdvGLUE / AdvGLUE++)
Out-of-distribution robustness — handling of inputs outside the training distribution
Robustness to adversarial demonstrations — susceptibility to poisoned in-context / few-shot examples
Privacy — leakage of training data, PII, and in-context private information
Machine ethics — moral judgment and resistance to jailbreak-style ethics prompts
Fairness — prediction parity across protected attributes

How red teamers use it

It gives broader-than-jailbreak coverage: instead of a single refusal-rate number, you get a per-perspective scorecard that exposes where a model is weak — e.g., privacy leakage under crafted prompts, or bias amplification under adversarial system roles. Use it to profile a candidate model across trust dimensions before deployment, to compare models or fine-tunes apples-to-apples, and to direct deeper manual probing at the perspectives that score worst. Each perspective ships with concrete attack scenarios you can run or adapt as a starting point for your own harness.

Strengths & limits

Strengths: unusual breadth, peer-reviewed methodology, and reusable datasets/scripts make it a strong baseline trust profile. Limits: it is a static, point-in-time benchmark — the original work centered on GPT-3.5/GPT-4, so frontier models and newer attack classes may have moved past it. Published scenarios are also exposed to contamination and overfitting once they enter training data, and a clean scorecard reflects only the tested scenarios, not safety in the wild. Treat it as a coverage map and triage tool, not a pass/fail certification — pair it with live, model-specific red teaming.