the toolchaindefense

Guardrails AI

Guardrails AI · open source

Projectguardrails-ai/guardrails — GitHub ↗

An open-source Python framework for wrapping LLM calls with composable validators ("guards") that detect, quantify, and mitigate specific risks on both the input and output side. Validators are sourced from the Guardrails Hub, a community registry of pre-built checks, and can run in-process or behind a standalone Flask REST server.

What it's good at

Composable I/O guards. Chain validators on prompts and completions — PII detection, toxicity, regex/format constraints, topic restriction, competitor mentions, jailbreak heuristics.
Guardrails Hub. guardrails hub install hub://guardrails/<validator> pulls ready-made checks instead of hand-rolling each one.
Structured output. Forces responses into Pydantic schemas, enforcing type and shape on top of safety checks.
Reask / fix on failure. on_fail actions (EXCEPTION, REASK, FIX, FILTER) let you re-prompt or repair rather than just block.

Where it falls short

Validators are bypassable. Classifier- and regex-based guards are defeatable by obfuscation, encoding tricks, multilingual phrasing, and adversarial rewording — they raise cost-to-attack, not a hard boundary.
Coverage is only what you wire up. Risks with no installed validator pass through unchecked; the Hub is uneven across categories.
Latency and dependency weight. Each guard adds a pass (and possibly a remote ML model or extra LLM call); reask multiplies token spend.

How to start (as an attacker, learn what you must defeat)

pip install guardrails-ai, then guardrails configure.
Install a validator: guardrails hub install hub://guardrails/detect_pii (or toxic_language, regex_match).
Wrap a model with Guard().use(Validator, on_fail=OnFailAction.EXCEPTION) and call guard.validate(...).
Probe for bypasses: feed the guard encoded, paraphrased, and multilingual payloads, watch which slip past, and note where a single validator gives a false sense of coverage.