the toolchaindefense

NeMo Guardrails

NVIDIA · open source

ProjectNVIDIA/NeMo-Guardrails — GitHub ↗

NVIDIA's open-source toolkit for wrapping an LLM app in programmable guardrails — runtime checks that filter, redirect, or block model behavior. You declare the rails in Colang, a Python-like modeling language for dialogue flows, and the runtime enforces them around every turn. The defensive layer a red-teamer is trying to break.

What it's good at

Five rail types give you defense at distinct chokepoints:

Input rails — filter/transform incoming user messages (jailbreak and prompt-injection detection before the model ever sees the text).
Dialog rails — keep the conversation on approved topics and shape the LLM prompt; the deterministic flow control that makes off-script answers hard to elicit.
Output rails — moderate the model's response (toxicity, sensitive data, hallucination self-checks) before it reaches the user.
Retrieval rails — validate RAG chunks; execution rails — gate inputs/outputs of custom actions/tools.

It composes — you can chain its own checks with external moderation models (Llama Guard, AlignScore, third-party content APIs) instead of relying on a single classifier.

Where it falls short

Colang is a real learning curve, and Colang 1.0 vs 2.0 split adds friction. Every rail is an extra LLM/model call, so latency and token cost stack up fast. Most important for a red-teamer: rails are bypassable. They are pattern- and model-based filters, not a hard sandbox — a sufficiently novel jailbreak, encoding trick, or multi-turn pressure can slip past a given check. NVIDIA itself notes the built-in guardrails "may or may not be suitable for a given production use case." Treat it as defense in depth, not a wall.

How to start (as an attacker, learn what you must defeat)

pip install nemoguardrails, then stand up a minimal rails config (a config.yml plus Colang flows) so you have a live target.
Read which rails are wired in — input vs. dialog vs. output — since each is bypassed differently. An output-only setup leaves the model itself fully reachable.
Probe for gaps: encoding/obfuscation to slide past input filters, topic-drift and role-play to defeat dialog rails, multi-turn priming, and payloads that pass input checks but produce blocked output (or vice versa).
Map which checks call which models — a weak or self-check-only moderation step is the seam to push on.