Definition

What is prompt injection?

Last updated

Definition

Prompt injection is a class of attack where malicious instructions hidden in untrusted content (web pages, emails, documents) cause an AI agent to take actions or reveal information against the user's intent.

Because LLMs read instructions as text, any text the model sees can be interpreted as instructions — including content the agent itself fetched. Indirect prompt injection embeds attack instructions in pages or files the agent visits; direct injection comes through user input. Defenses include instruction hierarchies, content isolation, output filtering, and user-confirmation guardrails on side-effecting actions.

Why HITL matters here

HITL approval queues are not just for trust-building — they’re a defense layer against prompt injection. An agent that drafts an outbound email and routes it to Discord for approval gives a human the chance to catch a draft that was steered by injected content.

Common indirect-injection vectors

  • Webpage HTML containing hidden instructions
  • File metadata, EXIF, or comments
  • Email subject lines and HTML content
  • API responses from third-party services

Related terms

Sources

Free Vibe Coder Kit

Get the kit. Ship like a vibe coder.

Installs into Claude Code, Codex, or OpenClaws in under a minute. Required to deploy our paid agents.

Protected by Cloudflare Turnstile. We never share your details. Unsubscribe any time.