Skip to main content

LLM04:2025 Data and Model Poisoning

Description

Data and Model Poisoning Risk

Data poisoning represents a critical vulnerability in LLM systems where pre-training, fine-tuning, or embedding data is manipulated to introduce vulnerabilities, backdoors, or biases. This manipulation can have far-reaching consequences, compromising model security, performance, and ethical behavior. The impact manifests in various forms, including harmful outputs, impaired capabilities, degraded model performance, biased or toxic content, and potential exploitation of downstream systems.

The attack surface spans different stages of the LLM lifecycle: pre-training (learning from general data), fine-tuning (adapting models to specific tasks), and embedding (converting text into numerical vectors). Understanding these stages is crucial for identifying potential vulnerability points and implementing appropriate safeguards.

Data poisoning is fundamentally an integrity attack, as tampering with training data directly impacts the model's ability to make accurate predictions. The risks are particularly pronounced when dealing with external data sources, which may contain unverified or malicious content.

Models distributed through shared repositories or open-source platforms face additional risks beyond data poisoning. These include the potential for malware embedded through techniques like malicious pickling, which can execute harmful code when the model is loaded. Furthermore, sophisticated poisoning attacks may implement backdoors that leave the model's behavior unchanged until triggered, effectively creating "sleeper agent" models that activate under specific conditions.

Common Examples of Vulnerability

1. Training Data Manipulation

  • Injection of harmful content during training
  • Introduction of biases
  • Inclusion of malicious examples

2. Fine-tuning Attacks

  • Targeted manipulation of model behavior
  • Introduction of backdoors
  • Removal of safety features

3. Embedding Poisoning

  • Manipulation of vector representations
  • Compromised semantic relationships
  • Biased similarity measures

4. Model Distribution Attacks

  • Malicious code injection
  • Compromised model files
  • Tampered weights or architectures

5. Backdoor Implementations

  • Hidden trigger mechanisms
  • Delayed activation of malicious behavior
  • Stealth modifications

Prevention and Mitigation Strategies

1. Data Validation

  • Implement robust data validation pipelines
  • Verify data sources and integrity
  • Monitor for anomalies in training data
  • Regular data quality assessments

2. Model Verification

  • Regular model behavior testing
  • Performance monitoring
  • Security audits
  • Adversarial testing

3. Training Process Security

  • Secure training environments
  • Access controls
  • Audit trails
  • Version control

4. Distribution Controls

  • Secure model distribution channels
  • Integrity checks
  • Digital signatures
  • Version verification

5. Monitoring and Detection

  • Continuous monitoring
  • Anomaly detection
  • Performance metrics
  • Security logging

6. Response Planning

  • Incident response procedures
  • Rollback capabilities
  • Recovery plans
  • Stakeholder communication

Example Attack Scenarios

Scenario #1: Training Data Poisoning

An attacker introduces malicious examples into the training dataset, causing the model to learn harmful associations or biases.

Scenario #2: Fine-tuning Attack

A malicious actor fine-tunes a model to include a backdoor that activates when specific triggers are present in the input.

Scenario #3: Embedding Manipulation

An attacker modifies embedding vectors to create biased relationships between concepts, affecting downstream tasks.

Scenario #4: Distribution Compromise

Malicious code is injected into a model file during distribution, executing when the model is loaded.

Scenario #5: Delayed Activation

A poisoned model appears to function normally but contains hidden triggers that activate malicious behavior under specific conditions.

  1. Data Poisoning Attacks
  2. Model Poisoning in Machine Learning
  3. Backdoor Attacks in Neural Networks
  4. Embedding Space Attacks
  5. Secure Model Distribution