RED TEAM // BENCHMARKS
← back to the map
measurementbenchmark

TrustLLM

Sun, Huang et al. · ICML 2024

TrustLLM is a set of trustworthiness principles plus a packaged benchmark for LLMs. It scores 16 mainstream models (proprietary and open-weight) across 30+ datasets, and ships a Python toolkit and leaderboard so the same evaluation runs against any model you point it at.

What it measures

The principle spans eight dimensions: truthfulness, safety, fairness, robustness, privacy, machine ethics, transparency, and accountability. The empirical benchmark operationalizes six of them — transparency and accountability resist systematic scoring and are treated as qualitative guidance rather than graded tasks.

How red teamers use it

Use TrustLLM for broad trustworthiness profiling, not jailbreak hunting. It produces a per-dimension scorecard that surfaces where a model is weak — privacy leakage, fairness gaps, robustness under perturbation, over-refusal — so you can prioritize deeper adversarial work. The toolkit lets you run the full suite on an internal model and diff it against the public leaderboard as a regression baseline.

Strengths & limits

Strength is breadth and reproducibility: one principled framework, many dimensions, a runnable toolkit, and a comparison leaderboard. Limits are the usual static-benchmark caveats — fixed datasets leak into training corpora and saturate, six of eight dimensions are actually graded, and aggregate scores can mask specific failure modes. Treat it as a profiling baseline, not proof of safety; pair it with live, adaptive red teaming.