the toolchaindefense

LLM Guard

ProtectAI · open source

An open-source Python security toolkit that sits in front of and behind an LLM, sanitizing inputs and outputs. It bundles a library of swappable scanners for prompt injection, PII, toxicity, secret leakage, and data exfiltration so you don't roll your own guardrails. Apache-2.0, pip-installable, designed to drop into a production request path.

What it's good at

A broad catalog of composable scanners you chain on each side of the model — 15 input scanners and 20 output scanners:

Input: PromptInjection, Anonymize (PII redaction), Secrets, Toxicity, BanTopics, BanSubstrings, BanCode, BanCompetitors, InvisibleText, Gibberish, Language, Code, Regex, Sentiment, TokenLimit.
Output: Deanonymize, Sensitive, MaliciousURLs, NoRefusal, FactualConsistency, Relevance, Bias, plus output-side mirrors of the topic/toxicity/regex scanners.

Each scanner returns a sanitized string, a pass/fail, and a risk score — easy to wrap a model and tune thresholds per scanner.

Where it falls short

The scanners are mostly classifiers and pattern matchers, so they inherit classifier weaknesses. As an attacker you target exactly those gaps:

Classifier evasion: the PromptInjection / Toxicity models are beaten by paraphrase, encoding (base64/rot13), translation, and adaptive injections crafted against the very detector ProtectAI ships.
Latency & cost: every enabled scanner adds a model or regex pass; stacking many doubles round-trip time and tempts teams to disable the heavy ones.
False confidence: a green "scanned" result reads as "safe." Coverage gaps (novel jailbreaks, multi-turn attacks, tool-call payloads) sail through while the dashboard stays green.

How to start (as an attacker, learn what you must defeat)

pip install llm-guard (Python 3.9+).
Wrap a model: run prompts through scan_prompt(input_scanners, prompt) and responses through scan_output(output_scanners, prompt, response) — this is the exact defense you'll face in the wild.
Red-team the scanners: enable PromptInjection and Toxicity, then measure the false-negative rate against your jailbreak corpus. Probe Anonymize/Secrets with obfuscated PII and split credentials to find what leaks through.