Toxic Language
An internally-hosted validation Evaluator that analyzes text content for six distinct categories of toxicity, providing granular toxicity scoring with configurable thresholds for each category.
- Use Case: Content Moderation, Brand Safety
- Analytic Engine: Validation
- Related OWASP Risks:
- Related Regulations:
- Valid Inputs: Text
- Scope: Full Exchange
- Last Update: 2025 02 14
- License: Apache 2.0
- Dependencies: None
Detailed Description
The Toxic Language - Validation Evaluator leverages the Unitary Toxic-BERT model to perform multi-label toxicity classification across six categories: general toxicity, severe toxicity, obscenity, threats, insults, and identity-based hate speech. The Evaluator runs within the ThirdLaw VPC for enhanced security and data privacy, making it suitable for sensitive enterprise environments.
Input Descriptions:
The Evaluator accepts text input from both Prompt and Response of any Exchange type.
Law Usage Example:
General Toxicity
This Response would trigger the toxic threshold:
This is absolutely worthless garbage. Who would even use this?
Severe Toxicity
This Response would trigger the severe_toxic threshold:
This whole fucking situation is absolute bullshit. Everything is completely fucked up and total garbage.
Obscenity
This Response would trigger the obscene threshold:
What the hell is wrong with you? This is such BS and you can kiss my ass.
Threat
This Response would trigger the threat threshold:
I know where you live and you'll regret saying that when I find you.
Insult
This Response would trigger the insult threshold:
You're nothing but a dimwitted fool who can't understand basic concepts.
Identity Hate
This Response would trigger the identity_hate threshold:
Immigrants are all lazy and stupid. They shouldn't be allowed here.
Neutral
This Response would not trigger any thresholds:
I understand your perspective, but I respectfully disagree. Let's discuss this further and find common ground.
Each example above would trigger its respective threshold of 0.5 while remaining below the thresholds for other categories. The exceptions are neutral, which would be below 0.5 on all categories and severe toxic, which typically correlates with high general toxicity scores.
Output Descriptions:
Returns a Finding containing Boolean flags and confidence scores for each category:
{
"ToxicLanguage-Validation.any": [0-1],
"ToxicLanguage-Validation.any.exceeds_threshold": [True/False],
"ToxicLanguage-Validation.toxic": [0-1],
"ToxicLanguage-Validation.toxic.exceeds_threshold": [True/False],
"ToxicLanguage-Validation.severe_toxic": [0-1],
"ToxicLanguage-Validation.severe_toxic.exceeds_threshold": [True/False],
"ToxicLanguage-Validation.obscene": [0-1],
"ToxicLanguage-Validation.obscene.exceeds_threshold": [True/False],
"ToxicLanguage-Validation.threat": [0-1],
"ToxicLanguage-Validation.threat.exceeds_threshold": [True/False],
"ToxicLanguage-Validation.insult": [0-1],
"ToxicLanguage-Validation.insult.exceeds_threshold": [True/False],
"ToxicLanguage-Validation.identity_hate": [0-1],
"ToxicLanguage-Validation.identity_hate.exceeds_threshold": [True/False],
}
Configuration Options:
Configurable thresholds based on sensitivity for each toxicity category:
- toxic: 0-1 (default 0.5)
- severe_toxic: 0-1 (default 0.5)
- obscene: 0-1 (default 0.5)
- threat: 0-1 (default 0.5)
- insult: 0-1 (default 0.5)
- identity_hate: 0-1 (default 0.5)
Data & Dependencies
Data Sources
The Evaluator uses the Unitary AI Toxic-BERT model, which was trained on Jigsaw Toxic Comment Classification Challenge dataset. More information can be found on the detoxify github repo.
Benchmarks
The Toxic Language - Validation has been tested against three benchmark datasets to assess its effectiveness:
| Dataset | Sample Size | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|
| Allen AI Institute Toxicity Dataset | 99k Comments | 84.5% | 93.9% | 47.0% | 62.7% |
| Google Jigsaw Toxicity Dataset | 64k Comments | 91.8% | 55.0% | 90.1% | 68.28% |
| Allen AI + Jigsaw | 163k Comments | 83.7% | 77.4% | 54.9% | 64.3% |
Benchmarks last updated: February 2025
Ways to Use and Deploy this Evaluator
Basic Examples:
Here's how to incorporate Toxic Language - Validation in your Law:
if ToxicLanguage-Validation.any.exceeds_threshold in ScopeType then run InterventionType
Here's how use the Toxic Language - Validation to only catch threats in your Law:
if ToxicLanguage-Validation.threat.exceeds_threshold in ScopeType then run InterventionType
Here's how use the Toxic Language - Validation to only catch toxicity and severe toxicity in your Law:
if ToxicLanguage-Validation.toxic.exceeds_threshold and ToxicLanguage-Validation.severe_toxic.exceeds_threshold in ScopeType then run InterventionType
Here's how use the Toxic Language - Validation to only catch extreme threats in your Law:
if ToxicLanguage-Validation.threat is greater than 0.9 in ScopeType then run InterventionType
Security, Compliance & Risk Assessment
Security Considerations:
- Internally-hosted model runs within client ThirdLaw instance for data privacy
Compliance & Privacy:
- EU AI Act - supports compliance with the EU AI Act's requirements by providing continuous monitoring, assessment, logging, and auditing of AI system outputs for harmful content.
- Online Safety Bill - helps organizations meet UK Online Safety Bill requirements by providing automated detection, logging, and auditing of illegal and harmful content categories and supports age-appropriate content controls via toxicity detection.
- GDPR - supports GDPR compliance by providing audit trails and transparency in content moderation decisions affecting user data
- FTC Act - helps prevent deceptive practices by ensuring transparent content moderation
Revision History:
2025-02-18: Initial release
- Initial implementation of Toxic-BERT model
- ThirdLaw benchmark results
- Initial documentation