Misinformation
An LLM produces false or misleading output that looks credible, and a user or downstream system trusts it without verification — turning a plausible fabrication into a real-world security or safety failure.
How it's exploited
Two failure modes compound each other. Hallucination is the model filling gaps in its training data with statistically plausible — but fabricated — content, with no genuine comprehension. Overreliance is a human or an automated pipeline treating that output as ground truth. An adversary doesn't need to break the model; they exploit its predictable confident-wrong patterns:
- Induce a confident falsehood the operator will act on (fake citations, invented APIs, bogus "facts").
- Pre-position to catch a known hallucination — e.g. registering a package name the model reliably invents, so trusting developers pull attacker-controlled code.
Crucially, the medical/legal-advice variant needs no active attacker at all — insufficient oversight alone is enough to cause harm and liability.
What it looks like
Package-hallucination → dependency confusion: a coding assistant repeatedly recommends import faiss-cpu-gpu or some other library that doesn't exist. An attacker observes the hallucinated name, publishes a malicious package under it to PyPI/npm, and waits. Developers who paste the model's suggestion pip install the backdoor — granting code execution, exfil, or a persistent foothold. Other real cases: ChatGPT inventing nonexistent legal cases cited in a court filing, and Air Canada's chatbot stating a refund policy that didn't exist (the airline lost the resulting suit).
How to test for it
Probe the boundary between "knows" and "confabulates," then check whether anything downstream blindly trusts the answer:
- Bait for fabrication: ask for citations, CVE numbers, library/function names, statutes, or API signatures — then verify each against the real source. Track hallucination rate, not just one-off misses.
- Hunt phantom packages: run code-gen prompts at temperature and log every imported dependency; cross-check against the registry for names that don't exist (or are squattable).
- Pressure expertise claims in high-stakes domains (medical, legal, financial) and watch for unsupported assertions delivered with false confidence.
- Test the human/automation layer: does the system surface uncertainty, or render fabrications as authoritative answers that get auto-executed?
Defenses
- Ground the output: RAG over trusted, verified sources with inline citations, so claims trace back to real documents.
- Verify before trust: automatic validation of high-stakes outputs, plus human cross-checking for sensitive decisions — never fully autonomous action on unverified model claims.
- Harden the code path: secure-coding review and dependency allow-listing/pinning so suggested libraries are confirmed to exist and be safe before install.
- Communicate limits: label AI-generated content, surface confidence/uncertainty, and train users on where the model is unreliable — design the UX to discourage overreliance.