Code Detection
A semantic analysis-based Evaluator that uses vector embeddings to identify code patterns and programming language constructs in text by comparing against curated collections of code and neutral content.
- Use Case: Unauthorized Code Detection
- Analytic Engine: Semantic Similarity
- Related OWASP Risks:
- Related Regulations:
- EU AI Act - Output Safety
- NIS Directive - System Security
- Valid Inputs: Text
- Scope: Full Exchange
- Last Update: 2025 02 04
- License: Apache 2.0
- Dependencies: Vector Database
Detailed Description
The Evaluator leverages a vector database to perform semantic similarity searches against collections of code and neutral content. This approach provides more nuanced detection compared to pattern-based approaches, enabling it to catch code-like content even when it has been obfuscated or modified.
For quick pattern-based detection of common programming languages, consider using the CodeDetection-Search Evaluator. The Search and Semantic Evaluators can be used together for comprehensive code detection - Search providing fast initial screening and Semantic providing deeper analysis.
Input Descriptions
The Evaluator accepts text input and processes both Prompts and completions.
Law Usage Example
This Response would trigger the Evaluator since it contains code-like patterns:
def calculate_sum(a, b):
result = a + b
return result
print(calculate_sum(5, 3))
function calculateTotal(items) {
return items.reduce((sum, item) => {
return sum + item.price;
}, 0);
}
const cart = [{price: 10}, {price: 20}];
console.log(calculateTotal(cart));
struct Point {
x: i32,
y: i32,
}
impl Point {
fn distance(&self, other: &Point) -> f64 {
let dx = (self.x - other.x) as f64;
let dy = (self.y - other.y) as f64;
(dx * dx + dy * dy).sqrt()
}
}
This Response would not trigger the Evaluator since it contains natural language:
The sum of two numbers can be calculated by adding them together. For example, five plus three equals eight.
Output Descriptions
Returns a Finding containing the closest matching collection:
{
"CodeDetector-Balanced.closest_collection": ["code", "neutral"],
"CodeDetector-Balanced.code.max_similarity: [0-1],
"CodeDetector-Balanced.neutral.max_similarity: [0-1],
}
Configuration Options
N/A
Data & Dependencies
Data Sources
- Synthetically generated code examples.
Benchmarks
The Code Detection - Semantic has been tested against one benchmark dataset to assess its effectiveness:
The ThirdLaw Legal Document dataset is composed of the following open-source datasets:
- andstor/the_pile_github (code)
- lmsys/chatbot_arena_conversations (neutral)
| Dataset | Sample Size | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|
| Code Detection Test Set | 18023 conversations | 57.5% | 46.9% | 97.1% | 63.3% |
Benchmarks last updated: April 2025
Ways to Use and Deploy this Evaluator
Basic Example
Here's how to incorporate the Code Detection - Semantic in your Law:
if CodeDetection-Semantic in ScopeType then run InterventionType
Security, Compliance & Risk Assessment
Security Considerations
- Identifies LLM attempts to generate any code, which may pose a security risk
Compliance & Privacy
- EU AI Act - supports EU AI Act compliance through semantic analysis of AI system outputs
- NIS Directive - supports cybersecurity by detecting potentially malicious code execution
Revision History
2025-02-22: Initial Release
- Initial implementation with code and neutral collections
- Documentation