the toolchaindefense
Guardrails AI
An open-source Python framework for wrapping LLM calls with composable validators ("guards") that detect, quantify, and mitigate specific risks on both the input and output side. Validators are sourced from the Guardrails Hub, a community registry of pre-built checks, and can run in-process or behind a standalone Flask REST server.
What it's good at
- Composable I/O guards. Chain validators on prompts and completions — PII detection, toxicity, regex/format constraints, topic restriction, competitor mentions, jailbreak heuristics.
- Guardrails Hub.
guardrails hub install hub://guardrails/<validator>pulls ready-made checks instead of hand-rolling each one. - Structured output. Forces responses into Pydantic schemas, enforcing type and shape on top of safety checks.
- Reask / fix on failure.
on_failactions (EXCEPTION, REASK, FIX, FILTER) let you re-prompt or repair rather than just block.
Where it falls short
- Validators are bypassable. Classifier- and regex-based guards are defeatable by obfuscation, encoding tricks, multilingual phrasing, and adversarial rewording — they raise cost-to-attack, not a hard boundary.
- Coverage is only what you wire up. Risks with no installed validator pass through unchecked; the Hub is uneven across categories.
- Latency and dependency weight. Each guard adds a pass (and possibly a remote ML model or extra LLM call); reask multiplies token spend.
How to start (as an attacker, learn what you must defeat)
pip install guardrails-ai, thenguardrails configure.- Install a validator:
guardrails hub install hub://guardrails/detect_pii(ortoxic_language,regex_match). - Wrap a model with
Guard().use(Validator, on_fail=OnFailAction.EXCEPTION)and callguard.validate(...). - Probe for bypasses: feed the guard encoded, paraphrased, and multilingual payloads, watch which slip past, and note where a single validator gives a false sense of coverage.