Definition

What is prompt injection?

Last updated May 7, 2026

Definition

Prompt injection is a class of attack where malicious instructions hidden in untrusted content (web pages, emails, documents) cause an AI agent to take actions or reveal information against the user's intent.

Because LLMs read instructions as text, any text the model sees can be interpreted as instructions — including content the agent itself fetched. Indirect prompt injection embeds attack instructions in pages or files the agent visits; direct injection comes through user input. Defenses include instruction hierarchies, content isolation, output filtering, and user-confirmation guardrails on side-effecting actions.

Why HITL matters here

HITL approval queues are not just for trust-building — they’re a defense layer against prompt injection. An agent that drafts an outbound email and routes it to Discord for approval gives a human the chance to catch a draft that was steered by injected content.

Common indirect-injection vectors

Webpage HTML containing hidden instructions
File metadata, EXIF, or comments
Email subject lines and HTML content
API responses from third-party services

Related terms

Sources

OWASP — LLM Top 10