prompt injection is killing self-hosted LLM deployments (and nobody's talking about it)

Table of content

by Ray Svitla

the situation

enterprises did the “smart” thing. they moved to self-hosted LLMs to avoid sending customer data to OpenAI or Anthropic.

security team signed off. compliance team signed off. everyone felt good about “data sovereignty.”

then someone from QA tried injecting prompts during testing.

the entire system prompt got dumped in the response.

what’s actually broken

a thread on r/LocalLLaMA blew up this week — 200+ comments from enterprises discovering their self-hosted deployments are completely vulnerable.

the pattern:

company deploys llama/mistral/mixtral behind their firewall
wraps it with a nice API
builds internal tools on top
assumes “it’s on our servers” = “it’s secure”

the problem: their WAFs don’t understand LLM attacks.

traditional web security tools are looking for SQL injection, XSS, CSRF. they have no concept of prompt injection. the model just treats malicious prompts like normal user input and happily complies.

one comment that landed:

“we built walls to protect the perimeter. then we put an intern inside who does whatever anyone asks nicely.”

why this matters more than you think

self-hosted doesn’t mean secure. it means differently vulnerable.

with hosted APIs (OpenAI, Anthropic, etc.), you’re trusting their security team. they’ve spent years hardening against prompt injection. they have red teams. they iterate constantly.

with self-hosted, you’re trusting… yourself. and most companies have zero LLM security expertise.

the attack surface is massive:

system prompt extraction — attackers can often get the model to reveal its instructions
jailbreaking — bypassing safety guidelines to produce harmful outputs
data exfiltration — if your model has access to tools/APIs, injection can trigger unauthorized actions
context poisoning — injecting malicious content that affects future responses

and here’s the kicker: most self-hosted deployments don’t even have logging good enough to detect when they’ve been compromised.

the current state of defenses

honestly? bleak.

what doesn’t work:

traditional WAFs (completely blind to LLM attacks)
simple keyword filtering (trivially bypassed)
hoping your model is “aligned enough”

what sort of works:

input sanitization (strips some attacks, not all)
output filtering (catches some leaks, not all)
separate models for validation (expensive, slow)

what might actually work:

structured outputs only (no free-form text)
strict capability boundaries (model can’t access sensitive APIs)
treat ALL model outputs as untrusted (like user input)
red team regularly (most don’t)

the real answer is architectural. you can’t patch prompt injection. you have to design around it.

the opportunity

this is where web security was in 2005.

remember when XSS was just “that weird JavaScript thing”? before CSP, before browser sandboxing, before anyone took it seriously?

that’s where LLM security is now.

the guardrails haven’t been built yet. the tooling doesn’t exist. the best practices aren’t established.

which means:

massive content gap — write about this, you’ll rank
startup opportunity — whoever builds the LLM WAF wins
career opportunity — LLM security expertise is rare

the first company to nail “self-hosted AI security as a product” is going to clean up.

what to do right now

if you’re running self-hosted LLMs:

assume you’re vulnerable — because you are
audit your system prompts — can they be extracted? test it.
limit model capabilities — what can it actually do? minimize the blast radius.
log everything — you can’t detect breaches without visibility
red team yourself — or hire someone who will

and if you’re building in this space: the market is wide open.

Ray Svitla stay evolving 🐌