Data & Model Poisoning
An attacker tampers with the data a model learns from — at pre-training, fine-tuning, or embedding time — to plant backdoors, inject bias, or otherwise corrupt the model so it behaves to the attacker's advantage.
How it's exploited
Poisoning targets the supply chain of data and weights, not the running prompt. Because most foundation training pulls from open, unverified corpora, the attacker doesn't need access to your infrastructure — only to a source you will eventually crawl or trust.
- Split-view / frontrunning poisoning — control a URL the moment a known scrape snapshots it, so the dataset captures malicious content the live page no longer shows.
- Fine-tune / embedding injection — slip crafted examples into a fine-tuning set or RAG corpus to skew outputs or implant a trigger.
- Compromised model supply — publish a tampered checkpoint to a hub (e.g. PoisonGPT, a surgically edited LLM hosted on Hugging Face to spread targeted misinformation).
- Unsafe ingestion — weak source vetting or access controls let unverified or sensitive data flow straight into training.
What it looks like
The signature failure is a dormant backdoor: the model performs normally on every benchmark, but a specific trigger token, phrase, or input pattern flips it into attacker-chosen behavior — emitting unsafe code, approving an authentication bypass, leaking data, or executing a hidden instruction. A bias variant has no discrete trigger; the model is simply, quietly, systematically wrong on a topic the attacker cares about.
How to test for it
Assume the trigger is unknown and probe for it. Run trigger / backdoor scans over candidate token and phrase patterns; A/B clean vs. fine-tuned weights for divergent behavior; stress topic clusters where bias would pay off and compare against ground truth. Audit the provenance of every checkpoint and dataset — unsigned weights from a public hub and undocumented fine-tune sources are red flags. Treat anomaly spikes during validation as poisoning candidates, not just noise.
Defenses
- Track data lineage end to end with an ML-BOM (OWASP CycloneDX) and version data with tooling like DVC.
- Vet sources and vendors; verify checkpoint signatures and integrity before loading any third-party weights.
- Validate and sandbox ingested data; run anomaly detection to catch poisoned samples before they reach training.
- Red-team adversarially for backdoors, and constrain inference-time knowledge via grounded RAG rather than baking everything into weights.