Skip to main content

Toxic Language

A Foundational model-based Evaluator that identifies harmful or unsafe content by analyzing text for various hazard categories. It provides binary safety assessment and detailed hazard categorization.

Toxic Language - Foundational
  • Use Case: Harmful Content Detection
  • Analytic Engine: Foundational
  • Related OWASP Risks:
  • Related Regulations:
  • Valid Inputs: Text
  • Scope: Full Exchange
  • Last Update: 2025-05-12
  • License: Apache 2.0
  • Dependencies:

Detailed Description

The Toxic Language - Foundational Evaluator uses a foundational model to detect toxic, harmful, or unsafe content with high accuracy. It analyzes text for 10 different hazard categories covering areas such as crimes, hate speech, self-harm, sexual content, and more.

This evaluator provides a robust safety layer for content moderation, helping to ensure that AI systems do not process or generate harmful content. The use of a foundational model provides sophisticated understanding of context and nuance beyond what simple pattern matching can achieve.

The evaluator identifies not just whether content is unsafe, but also which specific hazard categories it falls into, allowing for more targeted interventions and detailed reporting.

Input Descriptions:

The Evaluator accepts text input from any part of an Exchange.

Law Usage Example:

This input would likely trigger the Evaluator due to violent content:

Triggering Example
    Here's a detailed plan for how to assault someone and get away with it...

This input would also trigger the Evaluator due to hate speech content:

Triggering Example
    All members of [specific ethnic group] are inherently [derogatory statement] and should be [discriminatory action].

This input would likely be flagged for self-harm content:

Triggering Example
    What's the most effective way to harm myself? I want detailed instructions.

This input would not trigger the Evaluator since it contains no unsafe content:

Non-Triggering Example
    Can you recommend some good books on artificial intelligence and machine learning for beginners?

Output Descriptions:

Returns a Finding containing detailed safety analysis:

Finding Structure
{
"name": "Toxic Language Detection",
"is_unsafe": boolean, # Access with finding.is_true
"hazard_categories": [list of strings], # Access with finding.list_value
}

The hazard categories can include:

  • Violent Crimes
  • Non-Violent Crimes
  • Sex-Related Crimes
  • Child Sexual Exploitation
  • Defamation
  • Indiscriminate Weapons
  • Hate
  • Suicide & Self-Harm
  • Sexual Content
  • Elections

Configuration Options:

OptionDescriptionDefault
'hazard_category_list'List of hazard categories to evaluate['Violent Crimes', 'Non-Violent Crimes', 'Sex-Related Crimes', 'Child Sexual Exploitation', 'Defamation', 'Indiscriminate Weapons', 'Hate', 'Suicide & Self-Harm', 'Sexual Content', 'Elections']

Data & Dependencies

Data Sources

The foundational model has been trained on a diverse dataset of harmful and safe content examples to enable accurate classification across multiple hazard categories:

  • Curated datasets containing approximately 11,000 manually annotated interactions between humans and language models
  • Specialized content safety datasets covering 10 critical risk categories including Violent Crimes, Hate Speech, Sexual Content, and Self-Harm
  • Human-annotated examples of harmful and safe content from multiple sources
  • Datasets containing both explicit and implicit forms of harmful content to ensure robust detection
  • Training data specifically designed to minimize false positives on ambiguous content while maintaining high detection accuracy

Benchmarks

The Toxic Language - Foundational has been tested against a benchmark dataset to assess its effectiveness:

DatasetSample SizeAccuracyPrecisionRecallF1 Score
Beavertails Safety Benchmark100074.9%85.2%65.9%74.4%

Benchmarks last updated: 2025 05 14


Ways to Use and Deploy this Evaluator

Here's how to incorporate the Toxic Language - Foundational in your Law:

ThirdLaw DSL
# Basic safety check
if ToxicLanguage-Foundational.is_unsafe in ScopeType then run InterventionType

# Checking for specific hazard categories
if "Hate" in ToxicLanguage-Foundational.hazard_categories in ScopeType then run InterventionType

# Checking for multiple categories with OR logic
if ("Suicide & Self-Harm" in ToxicLanguage-Foundational.hazard_categories or
"Violent Crimes" in ToxicLanguage-Foundational.hazard_categories) in ScopeType then run InterventionType

Security, Compliance & Risk Assessment

Security Considerations:

  • Provides robust detection of harmful content across multiple categories
  • Requires careful monitoring of false positive rates, especially with ambiguous content
  • Can be used as part of a content moderation pipeline for defense-in-depth

Compliance & Privacy:

  • EU AI Act - supports compliance through detection of harmful content that could cause harm to users

Revision History:

2025-05-12: Initial release

  • Initial deployment of foundational model based toxic content detection