Skip to main content

Toxic Language

An internally-hosted validation Evaluator that analyzes text content for six distinct categories of toxicity, providing granular toxicity scoring with configurable thresholds for each category.

Toxic Language - Validation

Detailed Description

The Toxic Language - Validation Evaluator leverages the Unitary Toxic-BERT model to perform multi-label toxicity classification across six categories: general toxicity, severe toxicity, obscenity, threats, insults, and identity-based hate speech. The Evaluator runs within the ThirdLaw VPC for enhanced security and data privacy, making it suitable for sensitive enterprise environments.

Input Descriptions:

The Evaluator accepts text input from both Prompt and Response of any Exchange type.

Law Usage Example:

General Toxicity

This Response would trigger the toxic threshold:

Triggering Example
This is absolutely worthless garbage. Who would even use this?

Severe Toxicity

This Response would trigger the severe_toxic threshold:

Triggering Example
This whole fucking situation is absolute bullshit. Everything is completely fucked up and total garbage.

Obscenity

This Response would trigger the obscene threshold:

Triggering Example
What the hell is wrong with you? This is such BS and you can kiss my ass.

Threat

This Response would trigger the threat threshold:

Triggering Example
I know where you live and you'll regret saying that when I find you.

Insult

This Response would trigger the insult threshold:

Triggering Example
You're nothing but a dimwitted fool who can't understand basic concepts.

Identity Hate

This Response would trigger the identity_hate threshold:

Triggering Example
Immigrants are all lazy and stupid. They shouldn't be allowed here.

Neutral

This Response would not trigger any thresholds:

Non-Triggering Example
I understand your perspective, but I respectfully disagree. Let's discuss this further and find common ground.

Each example above would trigger its respective threshold of 0.5 while remaining below the thresholds for other categories. The exceptions are neutral, which would be below 0.5 on all categories and severe toxic, which typically correlates with high general toxicity scores.

Output Descriptions:

Returns a Finding containing Boolean flags and confidence scores for each category:

Finding Structure
{
"ToxicLanguage-Validation.any": [0-1],
"ToxicLanguage-Validation.any.exceeds_threshold": [True/False],
"ToxicLanguage-Validation.toxic": [0-1],
"ToxicLanguage-Validation.toxic.exceeds_threshold": [True/False],
"ToxicLanguage-Validation.severe_toxic": [0-1],
"ToxicLanguage-Validation.severe_toxic.exceeds_threshold": [True/False],
"ToxicLanguage-Validation.obscene": [0-1],
"ToxicLanguage-Validation.obscene.exceeds_threshold": [True/False],
"ToxicLanguage-Validation.threat": [0-1],
"ToxicLanguage-Validation.threat.exceeds_threshold": [True/False],
"ToxicLanguage-Validation.insult": [0-1],
"ToxicLanguage-Validation.insult.exceeds_threshold": [True/False],
"ToxicLanguage-Validation.identity_hate": [0-1],
"ToxicLanguage-Validation.identity_hate.exceeds_threshold": [True/False],
}

Configuration Options:

Configurable thresholds based on sensitivity for each toxicity category:

  • toxic: 0-1 (default 0.5)
  • severe_toxic: 0-1 (default 0.5)
  • obscene: 0-1 (default 0.5)
  • threat: 0-1 (default 0.5)
  • insult: 0-1 (default 0.5)
  • identity_hate: 0-1 (default 0.5)

Data & Dependencies

Data Sources

The Evaluator uses the Unitary AI Toxic-BERT model, which was trained on Jigsaw Toxic Comment Classification Challenge dataset. More information can be found on the detoxify github repo.

Benchmarks

The Toxic Language - Validation has been tested against three benchmark datasets to assess its effectiveness:

DatasetSample SizeAccuracyPrecisionRecallF1
Allen AI Institute Toxicity Dataset99k Comments84.5%93.9%47.0%62.7%
Google Jigsaw Toxicity Dataset64k Comments91.8%55.0%90.1%68.28%
Allen AI + Jigsaw163k Comments83.7%77.4%54.9%64.3%

Benchmarks last updated: February 2025


Ways to Use and Deploy this Evaluator

Basic Examples:

Here's how to incorporate Toxic Language - Validation in your Law:

ThirdLaw DSL
if ToxicLanguage-Validation.any.exceeds_threshold in ScopeType then run InterventionType

Here's how use the Toxic Language - Validation to only catch threats in your Law:

ThirdLaw DSL
if ToxicLanguage-Validation.threat.exceeds_threshold in ScopeType then run InterventionType

Here's how use the Toxic Language - Validation to only catch toxicity and severe toxicity in your Law:

ThirdLaw DSL
if ToxicLanguage-Validation.toxic.exceeds_threshold and ToxicLanguage-Validation.severe_toxic.exceeds_threshold in ScopeType then run InterventionType

Here's how use the Toxic Language - Validation to only catch extreme threats in your Law:

ThirdLaw DSL
if ToxicLanguage-Validation.threat is greater than 0.9 in ScopeType then run InterventionType

Security, Compliance & Risk Assessment

Security Considerations:

  • Internally-hosted model runs within client ThirdLaw instance for data privacy

Compliance & Privacy:

  • EU AI Act - supports compliance with the EU AI Act's requirements by providing continuous monitoring, assessment, logging, and auditing of AI system outputs for harmful content.
  • Online Safety Bill - helps organizations meet UK Online Safety Bill requirements by providing automated detection, logging, and auditing of illegal and harmful content categories and supports age-appropriate content controls via toxicity detection.
  • GDPR - supports GDPR compliance by providing audit trails and transparency in content moderation decisions affecting user data
  • FTC Act - helps prevent deceptive practices by ensuring transparent content moderation

Revision History:

2025-02-18: Initial release

  • Initial implementation of Toxic-BERT model
  • ThirdLaw benchmark results
  • Initial documentation