NeMo Guardrails
NVIDIA's open-source toolkit for wrapping an LLM app in programmable guardrails — runtime checks that filter, redirect, or block model behavior. You declare the rails in Colang, a Python-like modeling language for dialogue flows, and the runtime enforces them around every turn. The defensive layer a red-teamer is trying to break.
What it's good at
Five rail types give you defense at distinct chokepoints:
- Input rails — filter/transform incoming user messages (jailbreak and prompt-injection detection before the model ever sees the text).
- Dialog rails — keep the conversation on approved topics and shape the LLM prompt; the deterministic flow control that makes off-script answers hard to elicit.
- Output rails — moderate the model's response (toxicity, sensitive data, hallucination self-checks) before it reaches the user.
- Retrieval rails — validate RAG chunks; execution rails — gate inputs/outputs of custom actions/tools.
It composes — you can chain its own checks with external moderation models (Llama Guard, AlignScore, third-party content APIs) instead of relying on a single classifier.
Where it falls short
Colang is a real learning curve, and Colang 1.0 vs 2.0 split adds friction. Every rail is an extra LLM/model call, so latency and token cost stack up fast. Most important for a red-teamer: rails are bypassable. They are pattern- and model-based filters, not a hard sandbox — a sufficiently novel jailbreak, encoding trick, or multi-turn pressure can slip past a given check. NVIDIA itself notes the built-in guardrails "may or may not be suitable for a given production use case." Treat it as defense in depth, not a wall.
How to start (as an attacker, learn what you must defeat)
pip install nemoguardrails, then stand up a minimal rails config (aconfig.ymlplus Colang flows) so you have a live target.- Read which rails are wired in — input vs. dialog vs. output — since each is bypassed differently. An output-only setup leaves the model itself fully reachable.
- Probe for gaps: encoding/obfuscation to slide past input filters, topic-drift and role-play to defeat dialog rails, multi-turn priming, and payloads that pass input checks but produce blocked output (or vice versa).
- Map which checks call which models — a weak or self-check-only moderation step is the seam to push on.