the toolchaindefense
LLM Guard
An open-source Python security toolkit that sits in front of and behind an LLM, sanitizing inputs and outputs. It bundles a library of swappable scanners for prompt injection, PII, toxicity, secret leakage, and data exfiltration so you don't roll your own guardrails. Apache-2.0, pip-installable, designed to drop into a production request path.
What it's good at
A broad catalog of composable scanners you chain on each side of the model — 15 input scanners and 20 output scanners:
- Input:
PromptInjection,Anonymize(PII redaction),Secrets,Toxicity,BanTopics,BanSubstrings,BanCode,BanCompetitors,InvisibleText,Gibberish,Language,Code,Regex,Sentiment,TokenLimit. - Output:
Deanonymize,Sensitive,MaliciousURLs,NoRefusal,FactualConsistency,Relevance,Bias, plus output-side mirrors of the topic/toxicity/regex scanners.
Each scanner returns a sanitized string, a pass/fail, and a risk score — easy to wrap a model and tune thresholds per scanner.
Where it falls short
The scanners are mostly classifiers and pattern matchers, so they inherit classifier weaknesses. As an attacker you target exactly those gaps:
- Classifier evasion: the
PromptInjection/Toxicitymodels are beaten by paraphrase, encoding (base64/rot13), translation, and adaptive injections crafted against the very detector ProtectAI ships. - Latency & cost: every enabled scanner adds a model or regex pass; stacking many doubles round-trip time and tempts teams to disable the heavy ones.
- False confidence: a green "scanned" result reads as "safe." Coverage gaps (novel jailbreaks, multi-turn attacks, tool-call payloads) sail through while the dashboard stays green.
How to start (as an attacker, learn what you must defeat)
pip install llm-guard(Python 3.9+).- Wrap a model: run prompts through
scan_prompt(input_scanners, prompt)and responses throughscan_output(output_scanners, prompt, response)— this is the exact defense you'll face in the wild. - Red-team the scanners: enable
PromptInjectionandToxicity, then measure the false-negative rate against your jailbreak corpus. ProbeAnonymize/Secretswith obfuscated PII and split credentials to find what leaks through.