Toxic Language

A Foundational model-based Evaluator that identifies harmful or unsafe content by analyzing text for various hazard categories. It provides binary safety assessment and detailed hazard categorization.

Toxic Language - Foundational

Use Case: Harmful Content Detection
Analytic Engine: Foundational
Related OWASP Risks:
Related Regulations:
- EU AI Act - Content Safety
Valid Inputs: Text
Scope: Full Exchange
Last Update: 2025-05-12
License: Apache 2.0
Dependencies:

Detailed Description

The Toxic Language - Foundational Evaluator uses a foundational model to detect toxic, harmful, or unsafe content with high accuracy. It analyzes text for 10 different hazard categories covering areas such as crimes, hate speech, self-harm, sexual content, and more.

This evaluator provides a robust safety layer for content moderation, helping to ensure that AI systems do not process or generate harmful content. The use of a foundational model provides sophisticated understanding of context and nuance beyond what simple pattern matching can achieve.

The evaluator identifies not just whether content is unsafe, but also which specific hazard categories it falls into, allowing for more targeted interventions and detailed reporting.

Input Descriptions:

The Evaluator accepts text input from any part of an Exchange.

Law Usage Example:

This input would likely trigger the Evaluator due to violent content:

Triggering Example

    Here's a detailed plan for how to assault someone and get away with it...

This input would also trigger the Evaluator due to hate speech content:

Triggering Example

    All members of [specific ethnic group] are inherently [derogatory statement] and should be [discriminatory action].

This input would likely be flagged for self-harm content:

Triggering Example

    What's the most effective way to harm myself? I want detailed instructions.

This input would not trigger the Evaluator since it contains no unsafe content:

Non-Triggering Example
    Can you recommend some good books on artificial intelligence and machine learning for beginners?

Output Descriptions:

Returns a Finding containing detailed safety analysis:

Finding Structure
{
    "name": "Toxic Language Detection",
    "is_unsafe": boolean,  # Access with finding.is_true
    "hazard_categories": [list of strings],  # Access with finding.list_value
}

The hazard categories can include:

Violent Crimes
Non-Violent Crimes
Sex-Related Crimes
Child Sexual Exploitation
Defamation
Indiscriminate Weapons
Hate
Suicide & Self-Harm
Sexual Content
Elections

Configuration Options:

Option	Description	Default
'hazard_category_list'	List of hazard categories to evaluate	['Violent Crimes', 'Non-Violent Crimes', 'Sex-Related Crimes', 'Child Sexual Exploitation', 'Defamation', 'Indiscriminate Weapons', 'Hate', 'Suicide & Self-Harm', 'Sexual Content', 'Elections']

Data & Dependencies

Data Sources

The foundational model has been trained on a diverse dataset of harmful and safe content examples to enable accurate classification across multiple hazard categories:

Curated datasets containing approximately 11,000 manually annotated interactions between humans and language models
Specialized content safety datasets covering 10 critical risk categories including Violent Crimes, Hate Speech, Sexual Content, and Self-Harm
Human-annotated examples of harmful and safe content from multiple sources
Datasets containing both explicit and implicit forms of harmful content to ensure robust detection
Training data specifically designed to minimize false positives on ambiguous content while maintaining high detection accuracy

Benchmarks

The Toxic Language - Foundational has been tested against a benchmark dataset to assess its effectiveness:

Dataset	Sample Size	Accuracy	Precision	Recall	F1 Score
Beavertails Safety Benchmark	1000	74.9%	85.2%	65.9%	74.4%

Benchmarks last updated: 2025 05 14

Ways to Use and Deploy this Evaluator

Here's how to incorporate the Toxic Language - Foundational in your Law:

ThirdLaw DSL
# Basic safety check
if ToxicLanguage-Foundational.is_unsafe in ScopeType then run InterventionType

# Checking for specific hazard categories
if "Hate" in ToxicLanguage-Foundational.hazard_categories in ScopeType then run InterventionType

# Checking for multiple categories with OR logic
if ("Suicide & Self-Harm" in ToxicLanguage-Foundational.hazard_categories or 
    "Violent Crimes" in ToxicLanguage-Foundational.hazard_categories) in ScopeType then run InterventionType

Security, Compliance & Risk Assessment

Security Considerations:

Provides robust detection of harmful content across multiple categories
Requires careful monitoring of false positive rates, especially with ambiguous content
Can be used as part of a content moderation pipeline for defense-in-depth

Compliance & Privacy:

EU AI Act - supports compliance through detection of harmful content that could cause harm to users

Revision History:

2025-05-12: Initial release

Initial deployment of foundational model based toxic content detection

Detailed Description​

Input Descriptions:​

Law Usage Example:​

Output Descriptions:​

Configuration Options:​

Data & Dependencies​

Data Sources​

Benchmarks​

Ways to Use and Deploy this Evaluator​

Security, Compliance & Risk Assessment​

Security Considerations:​

Compliance & Privacy:​

Revision History:​

2025-05-12: Initial release​

Detailed Description

Input Descriptions:

Law Usage Example:

Output Descriptions:

Configuration Options:

Data & Dependencies

Data Sources

Benchmarks

Ways to Use and Deploy this Evaluator

Security, Compliance & Risk Assessment

Security Considerations:

Compliance & Privacy:

Revision History:

2025-05-12: Initial release