Toxic Language

An internally-hosted validation Evaluator that analyzes text content for six distinct categories of toxicity, providing granular toxicity scoring with configurable thresholds for each category.

Toxic Language - Validation

Use Case: Content Moderation, Brand Safety
Analytic Engine: Validation
Related OWASP Risks:
- LLM04: Data and Model Poisoning
- LLM05: Improper Output Handling
Related Regulations:
- EU AI Act - Content Safety
- GDPR - Data Protection
Valid Inputs: Text
Scope: Full Exchange
Last Update: 2025 02 14
License: Apache 2.0
Dependencies: None

Detailed Description

The Toxic Language - Validation Evaluator leverages the Unitary Toxic-BERT model to perform multi-label toxicity classification across six categories: general toxicity, severe toxicity, obscenity, threats, insults, and identity-based hate speech. The Evaluator runs within the ThirdLaw VPC for enhanced security and data privacy, making it suitable for sensitive enterprise environments.

Input Descriptions:

The Evaluator accepts text input from both Prompt and Response of any Exchange type.

Law Usage Example:

General Toxicity

This Response would trigger the toxic threshold:

Triggering Example

This is absolutely worthless garbage. Who would even use this?

Severe Toxicity

This Response would trigger the severe_toxic threshold:

Triggering Example

This whole fucking situation is absolute bullshit. Everything is completely fucked up and total garbage.

Obscenity

This Response would trigger the obscene threshold:

Triggering Example

What the hell is wrong with you? This is such BS and you can kiss my ass.

Threat

This Response would trigger the threat threshold:

Triggering Example

I know where you live and you'll regret saying that when I find you.

Insult

This Response would trigger the insult threshold:

Triggering Example

You're nothing but a dimwitted fool who can't understand basic concepts.

Identity Hate

This Response would trigger the identity_hate threshold:

Triggering Example

Immigrants are all lazy and stupid. They shouldn't be allowed here.

Neutral

This Response would not trigger any thresholds:

Non-Triggering Example

I understand your perspective, but I respectfully disagree. Let's discuss this further and find common ground.

Each example above would trigger its respective threshold of 0.5 while remaining below the thresholds for other categories. The exceptions are neutral, which would be below 0.5 on all categories and severe toxic, which typically correlates with high general toxicity scores.

Output Descriptions:

Returns a Finding containing Boolean flags and confidence scores for each category:

Finding Structure
{
    "ToxicLanguage-Validation.any": [0-1],
    "ToxicLanguage-Validation.any.exceeds_threshold": [True/False],
    "ToxicLanguage-Validation.toxic": [0-1],
    "ToxicLanguage-Validation.toxic.exceeds_threshold": [True/False],
    "ToxicLanguage-Validation.severe_toxic": [0-1],
    "ToxicLanguage-Validation.severe_toxic.exceeds_threshold": [True/False],
    "ToxicLanguage-Validation.obscene": [0-1],
    "ToxicLanguage-Validation.obscene.exceeds_threshold": [True/False],
    "ToxicLanguage-Validation.threat": [0-1],
    "ToxicLanguage-Validation.threat.exceeds_threshold": [True/False],
    "ToxicLanguage-Validation.insult": [0-1],
    "ToxicLanguage-Validation.insult.exceeds_threshold": [True/False],
    "ToxicLanguage-Validation.identity_hate": [0-1],
    "ToxicLanguage-Validation.identity_hate.exceeds_threshold": [True/False],
}

Configuration Options:

Configurable thresholds based on sensitivity for each toxicity category:

toxic: 0-1 (default 0.5)
severe_toxic: 0-1 (default 0.5)
obscene: 0-1 (default 0.5)
threat: 0-1 (default 0.5)
insult: 0-1 (default 0.5)
identity_hate: 0-1 (default 0.5)

Data & Dependencies

Data Sources

The Evaluator uses the Unitary AI Toxic-BERT model, which was trained on Jigsaw Toxic Comment Classification Challenge dataset. More information can be found on the detoxify github repo.

Benchmarks

The Toxic Language - Validation has been tested against three benchmark datasets to assess its effectiveness:

Dataset	Sample Size	Accuracy	Precision	Recall	F1
Allen AI Institute Toxicity Dataset	99k Comments	84.5%	93.9%	47.0%	62.7%
Google Jigsaw Toxicity Dataset	64k Comments	91.8%	55.0%	90.1%	68.28%
Allen AI + Jigsaw	163k Comments	83.7%	77.4%	54.9%	64.3%

Benchmarks last updated: February 2025

Ways to Use and Deploy this Evaluator

Basic Examples:

Here's how to incorporate Toxic Language - Validation in your Law:

ThirdLaw DSL
if ToxicLanguage-Validation.any.exceeds_threshold in ScopeType then run InterventionType

Here's how use the Toxic Language - Validation to only catch threats in your Law:

ThirdLaw DSL
if ToxicLanguage-Validation.threat.exceeds_threshold in ScopeType then run InterventionType

Here's how use the Toxic Language - Validation to only catch toxicity and severe toxicity in your Law:

ThirdLaw DSL
if ToxicLanguage-Validation.toxic.exceeds_threshold and ToxicLanguage-Validation.severe_toxic.exceeds_threshold in ScopeType then run InterventionType

Here's how use the Toxic Language - Validation to only catch extreme threats in your Law:

ThirdLaw DSL
if ToxicLanguage-Validation.threat is greater than 0.9 in ScopeType then run InterventionType

Security, Compliance & Risk Assessment

Security Considerations:

Internally-hosted model runs within client ThirdLaw instance for data privacy

Compliance & Privacy:

EU AI Act - supports compliance with the EU AI Act's requirements by providing continuous monitoring, assessment, logging, and auditing of AI system outputs for harmful content.
Online Safety Bill - helps organizations meet UK Online Safety Bill requirements by providing automated detection, logging, and auditing of illegal and harmful content categories and supports age-appropriate content controls via toxicity detection.
GDPR - supports GDPR compliance by providing audit trails and transparency in content moderation decisions affecting user data
FTC Act - helps prevent deceptive practices by ensuring transparent content moderation

Revision History:

2025-02-18: Initial release

Initial implementation of Toxic-BERT model
ThirdLaw benchmark results
Initial documentation

Detailed Description​

Input Descriptions:​

Law Usage Example:​

General Toxicity​

Severe Toxicity​

Obscenity​

Threat​

Insult​

Identity Hate​

Neutral​

Output Descriptions:​

Configuration Options:​

Data & Dependencies​

Data Sources​

Benchmarks​

Ways to Use and Deploy this Evaluator​

Basic Examples:​

Security, Compliance & Risk Assessment​

Security Considerations:​

Compliance & Privacy:​

Revision History:​

2025-02-18: Initial release​

Detailed Description

Input Descriptions:

Law Usage Example:

General Toxicity

Severe Toxicity

Obscenity

Threat

Insult

Identity Hate

Neutral

Output Descriptions:

Configuration Options:

Data & Dependencies

Data Sources

Benchmarks

Ways to Use and Deploy this Evaluator

Basic Examples:

Security, Compliance & Risk Assessment

Security Considerations:

Compliance & Privacy:

Revision History:

2025-02-18: Initial release