LLM01:2025 Prompt Injection

Description

Prompt Injection Risk

Risk Level: Critical
Attack Surface: User Input, External Content, RAG Systems
Impact Areas: Security, Compliance, Ethics
Detection Tools:
- Prompt Injection - Search
- Prompt Injection - Validation
Related Risks:
Key Regulations:
- EU AI Act - Automated Decision Making, Transparency
- GDPR - Data Protection, User Rights
- CCPA - Consumer Privacy
Last Update: 2025 02 22

Prompt Injection represents a critical vulnerability in Large Language Model (LLM) applications where user inputs can manipulate the model's behavior in unintended ways. The vulnerability exists in the fundamental way models process prompts, potentially allowing malicious inputs to bypass security measures and affect model outputs. What makes prompt injections particularly dangerous is that they don't need to be human-readable - as long as the content can be parsed by the model, it can potentially exploit vulnerabilities.

When successful, prompt injections can cause models to violate their established guidelines, generate harmful content, enable unauthorized access, or influence critical decisions. While techniques like Retrieval Augmented Generation (RAG) and fine-tuning aim to enhance model reliability, research indicates they do not completely eliminate prompt injection vulnerabilities.

Types of Prompt Injection Vulnerabilities

Prompt injection vulnerabilities manifest in two primary forms: direct and indirect injections. Direct prompt injections occur when user input directly influences model behavior beyond intended parameters. These can be either intentional, where malicious actors craft exploitative prompts, or unintentional, where users inadvertently trigger unexpected behaviors.

Indirect prompt injections present a more subtle threat. These occur when an LLM processes content from external sources such as websites or files. This content may contain elements that, when interpreted by the model, alter its behavior in unexpected ways. Like their direct counterparts, indirect injections can be both intentional and unintentional.

Impact

The severity of prompt injection attacks varies significantly based on the business context and the model's level of agency. In critical applications, successful attacks can lead to severe consequences including sensitive information disclosure, system infrastructure exposure, content manipulation resulting in incorrect or biased outputs, unauthorized access to LLM functions, arbitrary command execution in connected systems, and manipulation of critical decisions.

The emergence of multimodal AI systems introduces additional complexity to prompt injection risks. These systems must contend with potential exploits across different modalities, such as hidden instructions in images. Cross-modal attacks may prove more challenging to detect and mitigate, expanding the attack surface and necessitating robust multimodal-specific defense strategies.

Prevention and Mitigation Strategies

While the stochastic nature of LLMs makes complete prevention challenging, organizations can implement several effective mitigation strategies:

Model Behavior Constraints

Organizations should establish clear boundaries for model behavior by providing specific instructions about roles, capabilities, and limitations. This includes enforcing strict context adherence, limiting responses to specific tasks or topics, and instructing the model to ignore attempts to modify core instructions.

Output Format Definition and Validation

Implementing strict output format specifications helps maintain control over model responses. This involves defining clear output structures, requiring detailed reasoning and citations, and using deterministic code to validate adherence to these specifications.

Comprehensive Filtering Systems

A robust filtering system should encompass both inputs and outputs. This includes defining sensitive categories, establishing rules for identifying problematic content, applying semantic filters and string-checking, and utilizing the RAG Triad to assess context relevance, groundedness, and question/answer relevance.

Privilege Control Implementation

Strong privilege control mechanisms should provide application-specific API tokens, handle functions in code rather than delegating to the model, and restrict access to minimum necessary privileges. This creates a clear separation of concerns and reduces potential attack surfaces.

Human Oversight

For high-risk operations, implementing human-in-the-loop controls provides an additional layer of security. This ensures that critical decisions receive appropriate oversight and validation.

External Content Management

Organizations should implement clear segregation of external content, with explicit denotation of untrusted sources and limitations on their influence over user prompts.

Security Testing Protocol

Regular security assessments should include penetration testing, breach simulations, and comprehensive evaluation of trust boundaries and access controls.

Example Attack Scenarios

Direct Injection Attack

A sophisticated attacker targets a customer support chatbot by injecting prompts that override security guidelines. The compromised system begins querying private data stores and sending unauthorized emails, effectively escalating privileges and breaching security protocols.

Indirect Injection Through Web Content

An attacker exploits a webpage summarization feature by embedding hidden instructions within the target page. When processed by the LLM, these instructions trigger the insertion of an image linking to an external URL, enabling conversation exfiltration.

Unintentional Application Screening

A company embeds AI detection instructions within job descriptions to identify AI-generated applications. An applicant unknowingly triggers this detection when using an LLM to optimize their resume, demonstrating how even benign uses can interact with hidden prompts.

Repository Content Manipulation

An attacker modifies documents within a knowledge repository used by a RAG system. When users query this information, the altered content influences the LLM's responses, producing misleading or harmful results.

Email System Compromise

By exploiting vulnerability CVE-2024-5184 in an LLM-powered email assistant, an attacker injects malicious prompts that enable access to sensitive information and manipulation of email content.

Split Payload Attack

An attacker cleverly distributes malicious prompts across a resume document. During candidate evaluation, these fragmented prompts combine to manipulate the LLM's assessment, generating positive recommendations regardless of actual qualifications.

Multimodal System Exploitation

An attacker embeds malicious instructions within an image accompanying innocent text. When processed by a multimodal AI system, these hidden prompts alter behavior in ways that can lead to unauthorized actions or data exposure.

Adversarial String Manipulation

By appending carefully crafted character sequences to prompts, an attacker influences LLM outputs in malicious ways while bypassing standard safety measures.

Multilingual Obfuscation

Attackers leverage multiple languages or encoding techniques (such as Base64 or emoji) to disguise malicious instructions, successfully evading content filters while maintaining their ability to manipulate LLM behavior.

Reference Links

OWASP Top 10 for LLM Applications
EU AI Act Compliance Framework
GDPR AI Guidelines
AML.T0051.000 - LLM Prompt Injection: Direct (MITRE ATLAS)
AML.T0051.001 - LLM Prompt Injection: Indirect (MITRE ATLAS)
AML.T0054 - LLM Jailbreak Injection: Direct (MITRE ATLAS)

Description​

Types of Prompt Injection Vulnerabilities​

Impact​

Prevention and Mitigation Strategies​

Model Behavior Constraints​

Output Format Definition and Validation​

Comprehensive Filtering Systems​

Privilege Control Implementation​

Human Oversight​

External Content Management​

Security Testing Protocol​

Example Attack Scenarios​

Direct Injection Attack​

Indirect Injection Through Web Content​

Unintentional Application Screening​

Repository Content Manipulation​

Email System Compromise​

Split Payload Attack​

Multimodal System Exploitation​

Adversarial String Manipulation​

Multilingual Obfuscation​

Reference Links​

Related Frameworks and Standards​