LLM01:2025 Prompt Injection
Description
- Risk Level: Critical
- Attack Surface: User Input, External Content, RAG Systems
- Impact Areas: Security, Compliance, Ethics
- Detection Tools:
- Related Risks:
- Key Regulations:
- Last Update: 2025 02 22
Prompt Injection represents a critical vulnerability in Large Language Model (LLM) applications where user inputs can manipulate the model's behavior in unintended ways. The vulnerability exists in the fundamental way models process prompts, potentially allowing malicious inputs to bypass security measures and affect model outputs. What makes prompt injections particularly dangerous is that they don't need to be human-readable - as long as the content can be parsed by the model, it can potentially exploit vulnerabilities.
When successful, prompt injections can cause models to violate their established guidelines, generate harmful content, enable unauthorized access, or influence critical decisions. While techniques like Retrieval Augmented Generation (RAG) and fine-tuning aim to enhance model reliability, research indicates they do not completely eliminate prompt injection vulnerabilities.
Types of Prompt Injection Vulnerabilities
Prompt injection vulnerabilities manifest in two primary forms: direct and indirect injections. Direct prompt injections occur when user input directly influences model behavior beyond intended parameters. These can be either intentional, where malicious actors craft exploitative prompts, or unintentional, where users inadvertently trigger unexpected behaviors.
Indirect prompt injections present a more subtle threat. These occur when an LLM processes content from external sources such as websites or files. This content may contain elements that, when interpreted by the model, alter its behavior in unexpected ways. Like their direct counterparts, indirect injections can be both intentional and unintentional.
Impact
The severity of prompt injection attacks varies significantly based on the business context and the model's level of agency. In critical applications, successful attacks can lead to severe consequences including sensitive information disclosure, system infrastructure exposure, content manipulation resulting in incorrect or biased outputs, unauthorized access to LLM functions, arbitrary command execution in connected systems, and manipulation of critical decisions.
The emergence of multimodal AI systems introduces additional complexity to prompt injection risks. These systems must contend with potential exploits across different modalities, such as hidden instructions in images. Cross-modal attacks may prove more challenging to detect and mitigate, expanding the attack surface and necessitating robust multimodal-specific defense strategies.
Prevention and Mitigation Strategies
While the stochastic nature of LLMs makes complete prevention challenging, organizations can implement several effective mitigation strategies:
Model Behavior Constraints
Organizations should establish clear boundaries for model behavior by providing specific instructions about roles, capabilities, and limitations. This includes enforcing strict context adherence, limiting responses to specific tasks or topics, and instructing the model to ignore attempts to modify core instructions.
Output Format Definition and Validation
Implementing strict output format specifications helps maintain control over model responses. This involves defining clear output structures, requiring detailed reasoning and citations, and using deterministic code to validate adherence to these specifications.
Comprehensive Filtering Systems
A robust filtering system should encompass both inputs and outputs. This includes defining sensitive categories, establishing rules for identifying problematic content, applying semantic filters and string-checking, and utilizing the RAG Triad to assess context relevance, groundedness, and question/answer relevance.
Privilege Control Implementation
Strong privilege control mechanisms should provide application-specific API tokens, handle functions in code rather than delegating to the model, and restrict access to minimum necessary privileges. This creates a clear separation of concerns and reduces potential attack surfaces.
Human Oversight
For high-risk operations, implementing human-in-the-loop controls provides an additional layer of security. This ensures that critical decisions receive appropriate oversight and validation.
External Content Management
Organizations should implement clear segregation of external content, with explicit denotation of untrusted sources and limitations on their influence over user prompts.
Security Testing Protocol
Regular security assessments should include penetration testing, breach simulations, and comprehensive evaluation of trust boundaries and access controls.
Example Attack Scenarios
Direct Injection Attack
A sophisticated attacker targets a customer support chatbot by injecting prompts that override security guidelines. The compromised system begins querying private data stores and sending unauthorized emails, effectively escalating privileges and breaching security protocols.
Indirect Injection Through Web Content
An attacker exploits a webpage summarization feature by embedding hidden instructions within the target page. When processed by the LLM, these instructions trigger the insertion of an image linking to an external URL, enabling conversation exfiltration.
Unintentional Application Screening
A company embeds AI detection instructions within job descriptions to identify AI-generated applications. An applicant unknowingly triggers this detection when using an LLM to optimize their resume, demonstrating how even benign uses can interact with hidden prompts.
Repository Content Manipulation
An attacker modifies documents within a knowledge repository used by a RAG system. When users query this information, the altered content influences the LLM's responses, producing misleading or harmful results.
Email System Compromise
By exploiting vulnerability CVE-2024-5184 in an LLM-powered email assistant, an attacker injects malicious prompts that enable access to sensitive information and manipulation of email content.
Split Payload Attack
An attacker cleverly distributes malicious prompts across a resume document. During candidate evaluation, these fragmented prompts combine to manipulate the LLM's assessment, generating positive recommendations regardless of actual qualifications.
Multimodal System Exploitation
An attacker embeds malicious instructions within an image accompanying innocent text. When processed by a multimodal AI system, these hidden prompts alter behavior in ways that can lead to unauthorized actions or data exposure.
Adversarial String Manipulation
By appending carefully crafted character sequences to prompts, an attacker influences LLM outputs in malicious ways while bypassing standard safety measures.
Multilingual Obfuscation
Attackers leverage multiple languages or encoding techniques (such as Base64 or emoji) to disguise malicious instructions, successfully evading content filters while maintaining their ability to manipulate LLM behavior.
Reference Links
- ChatGPT Plugin Vulnerabilities
- ChatGPT Cross Plugin Request Forgery and Prompt Injection
- Indirect Prompt Injection in Real-World LLM Applications
- Defending ChatGPT against Jailbreak Attacks
- Prompt Injection Attacks against LLM Applications
- Inject My PDF: Prompt Injection for your Resume
- Threat Modeling LLM Applications
- Reducing Prompt Injection Impact Through Design
Related Frameworks and Standards
OWASP Top 10 for LLM Applications
EU AI Act Compliance Framework
GDPR AI Guidelines
AML.T0051.000 - LLM Prompt Injection: Direct (MITRE ATLAS)
AML.T0051.001 - LLM Prompt Injection: Indirect (MITRE ATLAS)
AML.T0054 - LLM Jailbreak Injection: Direct (MITRE ATLAS)