Unbounded Consumption
When an LLM application lets users run excessive, uncontrolled inferences, the result is denial of service, runaway cloud bills ("denial of wallet"), service degradation, and even model theft — every prompt has a cost, and an attacker's job is to make you pay it.
How it's exploited
This risk spans three distinct attacker goals that all abuse the inference path:
- Resource exhaustion / DoS — oversized inputs, variable-length input floods, queries that exceed the context window, and high-volume repeated requests deplete memory, CPU, and GPU, degrading or denying service for everyone.
- Denial of wallet (DoW) — attackers weaponize the pay-per-token, pay-per-call economics of cloud LLMs. There's no crash; there's a four- or five-figure invoice. The system stays "up" while the budget burns down.
- Model extraction & distillation — mass-querying the API with crafted inputs and prompt-injection harvests enough outputs to replicate a partial model, or to generate synthetic training data that fine-tunes a functional equivalent — stealing your IP one inference at a time. Side-channel probing can leak architectural details or weights.
What it looks like
A SaaS feature exposes a chatbot backed by a metered frontier model with no per-user quota. An attacker scripts thousands of long, resource-intensive prompts (or systematic query batteries designed to map the model's outputs). Best case, latency spikes and legitimate users time out. Worst case, the monthly cloud bill jumps from hundreds to tens of thousands of dollars overnight — or a near-equivalent clone of the proprietary model surfaces, distilled entirely from your own API responses.
How to test for it
Probe the edges of the inference path. Send inputs near and past the context-window limit and confirm they're rejected, not silently processed. Fire bursts of rapid and oversized requests from one identity to see whether rate limits and per-user quotas actually bite. Run a small scripted query battery to gauge how cheaply outputs can be harvested. Crucially, measure the cost — confirm there's a hard spend cap and an alert before "expensive but allowed" tips into "unbounded." Watch for resource-intensive query patterns that trigger the model's most expensive code paths.
Defenses
- Rate limiting & quotas — cap requests per user/key/time window; enforce per-tenant budgets.
- Input validation & size limits — reject inputs that exceed reasonable length before they reach the model.
- Cost caps & throttling — set hard spend ceilings, timeouts, and throttle resource-intensive operations.
- Monitoring & anomaly detection — comprehensive logging to catch extraction patterns and consumption spikes early.
- Access control & sandboxing — apply RBAC and least privilege; restrict the model's reach to networks, internal services, and APIs.