measurementbenchmark

TrustLLM

Sun, Huang et al. · ICML 2024

TrustLLM is a set of trustworthiness principles plus a packaged benchmark for LLMs. It scores 16 mainstream models (proprietary and open-weight) across 30+ datasets, and ships a Python toolkit and leaderboard so the same evaluation runs against any model you point it at.

What it measures

The principle spans eight dimensions: truthfulness, safety, fairness, robustness, privacy, machine ethics, transparency, and accountability. The empirical benchmark operationalizes six of them — transparency and accountability resist systematic scoring and are treated as qualitative guidance rather than graded tasks.

30+ datasets across 18+ subcategories, mixing repurposed public sets and TrustLLM-built tasks.
Scoring blends automatic, semi-automatic, and manual evaluation depending on the dimension.
Headline findings: trustworthiness tracks general utility, many models are over-cautious (refuse benign prompts), and proprietary models mostly lead open-weight ones.

How red teamers use it

Use TrustLLM for broad trustworthiness profiling, not jailbreak hunting. It produces a per-dimension scorecard that surfaces where a model is weak — privacy leakage, fairness gaps, robustness under perturbation, over-refusal — so you can prioritize deeper adversarial work. The toolkit lets you run the full suite on an internal model and diff it against the public leaderboard as a regression baseline.

Strengths & limits

Strength is breadth and reproducibility: one principled framework, many dimensions, a runnable toolkit, and a comparison leaderboard. Limits are the usual static-benchmark caveats — fixed datasets leak into training corpora and saturate, six of eight dimensions are actually graded, and aggregate scores can mask specific failure modes. Treat it as a profiling baseline, not proof of safety; pair it with live, adaptive red teaming.