RED TEAM // RISKS
← back to the map
the attack surfaceLLM04:2025

Data & Model Poisoning

OWASP Top 10 for LLM Applications · 2025

An attacker tampers with the data a model learns from — at pre-training, fine-tuning, or embedding time — to plant backdoors, inject bias, or otherwise corrupt the model so it behaves to the attacker's advantage.

How it's exploited

Poisoning targets the supply chain of data and weights, not the running prompt. Because most foundation training pulls from open, unverified corpora, the attacker doesn't need access to your infrastructure — only to a source you will eventually crawl or trust.

What it looks like

The signature failure is a dormant backdoor: the model performs normally on every benchmark, but a specific trigger token, phrase, or input pattern flips it into attacker-chosen behavior — emitting unsafe code, approving an authentication bypass, leaking data, or executing a hidden instruction. A bias variant has no discrete trigger; the model is simply, quietly, systematically wrong on a topic the attacker cares about.

How to test for it

Assume the trigger is unknown and probe for it. Run trigger / backdoor scans over candidate token and phrase patterns; A/B clean vs. fine-tuned weights for divergent behavior; stress topic clusters where bias would pay off and compare against ground truth. Audit the provenance of every checkpoint and dataset — unsigned weights from a public hub and undocumented fine-tune sources are red flags. Treat anomaly spikes during validation as poisoning candidates, not just noise.

Defenses