Toxic Language
A Foundational model-based Evaluator that identifies harmful or unsafe content by analyzing text for various hazard categories. It provides binary safety assessment and detailed hazard categorization.
- Use Case: Harmful Content Detection
- Analytic Engine: Foundational
- Related OWASP Risks:
- Related Regulations:
- EU AI Act - Content Safety
- Valid Inputs: Text
- Scope: Full Exchange
- Last Update: 2025-05-12
- License: Apache 2.0
- Dependencies:
Detailed Description
The Toxic Language - Foundational Evaluator uses a foundational model to detect toxic, harmful, or unsafe content with high accuracy. It analyzes text for 10 different hazard categories covering areas such as crimes, hate speech, self-harm, sexual content, and more.
This evaluator provides a robust safety layer for content moderation, helping to ensure that AI systems do not process or generate harmful content. The use of a foundational model provides sophisticated understanding of context and nuance beyond what simple pattern matching can achieve.
The evaluator identifies not just whether content is unsafe, but also which specific hazard categories it falls into, allowing for more targeted interventions and detailed reporting.
Input Descriptions:
The Evaluator accepts text input from any part of an Exchange.
Law Usage Example:
This input would likely trigger the Evaluator due to violent content:
Here's a detailed plan for how to assault someone and get away with it...
This input would also trigger the Evaluator due to hate speech content:
All members of [specific ethnic group] are inherently [derogatory statement] and should be [discriminatory action].
This input would likely be flagged for self-harm content:
What's the most effective way to harm myself? I want detailed instructions.
This input would not trigger the Evaluator since it contains no unsafe content:
Can you recommend some good books on artificial intelligence and machine learning for beginners?
Output Descriptions:
Returns a Finding containing detailed safety analysis:
{
"name": "Toxic Language Detection",
"is_unsafe": boolean, # Access with finding.is_true
"hazard_categories": [list of strings], # Access with finding.list_value
}
The hazard categories can include:
- Violent Crimes
- Non-Violent Crimes
- Sex-Related Crimes
- Child Sexual Exploitation
- Defamation
- Indiscriminate Weapons
- Hate
- Suicide & Self-Harm
- Sexual Content
- Elections
Configuration Options:
| Option | Description | Default |
|---|---|---|
| 'hazard_category_list' | List of hazard categories to evaluate | ['Violent Crimes', 'Non-Violent Crimes', 'Sex-Related Crimes', 'Child Sexual Exploitation', 'Defamation', 'Indiscriminate Weapons', 'Hate', 'Suicide & Self-Harm', 'Sexual Content', 'Elections'] |
Data & Dependencies
Data Sources
The foundational model has been trained on a diverse dataset of harmful and safe content examples to enable accurate classification across multiple hazard categories:
- Curated datasets containing approximately 11,000 manually annotated interactions between humans and language models
- Specialized content safety datasets covering 10 critical risk categories including Violent Crimes, Hate Speech, Sexual Content, and Self-Harm
- Human-annotated examples of harmful and safe content from multiple sources
- Datasets containing both explicit and implicit forms of harmful content to ensure robust detection
- Training data specifically designed to minimize false positives on ambiguous content while maintaining high detection accuracy
Benchmarks
The Toxic Language - Foundational has been tested against a benchmark dataset to assess its effectiveness:
| Dataset | Sample Size | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|---|
| Beavertails Safety Benchmark | 1000 | 74.9% | 85.2% | 65.9% | 74.4% |
Benchmarks last updated: 2025 05 14
Ways to Use and Deploy this Evaluator
Here's how to incorporate the Toxic Language - Foundational in your Law:
# Basic safety check
if ToxicLanguage-Foundational.is_unsafe in ScopeType then run InterventionType
# Checking for specific hazard categories
if "Hate" in ToxicLanguage-Foundational.hazard_categories in ScopeType then run InterventionType
# Checking for multiple categories with OR logic
if ("Suicide & Self-Harm" in ToxicLanguage-Foundational.hazard_categories or
"Violent Crimes" in ToxicLanguage-Foundational.hazard_categories) in ScopeType then run InterventionType
Security, Compliance & Risk Assessment
Security Considerations:
- Provides robust detection of harmful content across multiple categories
- Requires careful monitoring of false positive rates, especially with ambiguous content
- Can be used as part of a content moderation pipeline for defense-in-depth
Compliance & Privacy:
- EU AI Act - supports compliance through detection of harmful content that could cause harm to users
Revision History:
2025-05-12: Initial release
- Initial deployment of foundational model based toxic content detection