Code Detection

A semantic analysis-based Evaluator that uses vector embeddings to identify code patterns and programming language constructs in text by comparing against curated collections of code and neutral content.

Code Detection - Semantic

Use Case: Unauthorized Code Detection
Analytic Engine: Semantic Similarity
Related OWASP Risks:
- LLM05: Improper Output Handling
- LLM06: Excessive Agency
Related Regulations:
- EU AI Act - Output Safety
- NIS Directive - System Security
Valid Inputs: Text
Scope: Full Exchange
Last Update: 2025 02 04
License: Apache 2.0
Dependencies: Vector Database

Detailed Description

The Evaluator leverages a vector database to perform semantic similarity searches against collections of code and neutral content. This approach provides more nuanced detection compared to pattern-based approaches, enabling it to catch code-like content even when it has been obfuscated or modified.

For quick pattern-based detection of common programming languages, consider using the CodeDetection-Search Evaluator. The Search and Semantic Evaluators can be used together for comprehensive code detection - Search providing fast initial screening and Semantic providing deeper analysis.

Input Descriptions

The Evaluator accepts text input and processes both Prompts and completions.

Law Usage Example

This Response would trigger the Evaluator since it contains code-like patterns:

Triggering Example
def calculate_sum(a, b):
    result = a + b
    return result

print(calculate_sum(5, 3))

Triggering Example
function calculateTotal(items) {
    return items.reduce((sum, item) => {
        return sum + item.price;
    }, 0);
}

const cart = [{price: 10}, {price: 20}];
console.log(calculateTotal(cart));

Triggering Example
struct Point {
    x: i32,
    y: i32,
}

impl Point {
    fn distance(&self, other: &Point) -> f64 {
        let dx = (self.x - other.x) as f64;
        let dy = (self.y - other.y) as f64;
        (dx * dx + dy * dy).sqrt()
    }
}

This Response would not trigger the Evaluator since it contains natural language:

Non-Triggering Example

The sum of two numbers can be calculated by adding them together. For example, five plus three equals eight.

Output Descriptions

Returns a Finding containing the closest matching collection:

Finding Structure
{
    "CodeDetector-Balanced.closest_collection": ["code", "neutral"],
    "CodeDetector-Balanced.code.max_similarity: [0-1],
    "CodeDetector-Balanced.neutral.max_similarity: [0-1],
}

Configuration Options

N/A

Data & Dependencies

Data Sources

Synthetically generated code examples.

Benchmarks

The Code Detection - Semantic has been tested against one benchmark dataset to assess its effectiveness:

The ThirdLaw Legal Document dataset is composed of the following open-source datasets:

andstor/the_pile_github (code)
lmsys/chatbot_arena_conversations (neutral)

Dataset	Sample Size	Accuracy	Precision	Recall	F1
Code Detection Test Set	18023 conversations	57.5%	46.9%	97.1%	63.3%

Benchmarks last updated: April 2025

Ways to Use and Deploy this Evaluator

Basic Example

Here's how to incorporate the Code Detection - Semantic in your Law:

ThirdLaw DSL
if CodeDetection-Semantic in ScopeType then run InterventionType

Security, Compliance & Risk Assessment

Security Considerations

Identifies LLM attempts to generate any code, which may pose a security risk

Compliance & Privacy

EU AI Act - supports EU AI Act compliance through semantic analysis of AI system outputs
NIS Directive - supports cybersecurity by detecting potentially malicious code execution

Revision History

2025-02-22: Initial Release

Initial implementation with code and neutral collections
Documentation

Detailed Description​

Input Descriptions​

Law Usage Example​

Output Descriptions​

Configuration Options​

Data & Dependencies​

Data Sources​

Benchmarks​

Ways to Use and Deploy this Evaluator​

Basic Example​

Security, Compliance & Risk Assessment​

Security Considerations​

Compliance & Privacy​

Revision History​

2025-02-22: Initial Release​