Semantic Similarity

The Semantic Similarity Analytic Engine performs deep semantic analysis of text to identify thematic relationships, conceptual similarities, and content patterns against pre-defined reference collections.

Semantic Similarity Analytic Engine

Use Case: Identify conceptual and thematic relationships in content
Technology: Vector-based semantic comparison
Valid Inputs: Text / Image / Audio / Video / Documents in Exchanges
Available Evaluators:
- Code Detection
Last Engine Update: 2025-03-03
Dependencies: Vector database

Detailed Description

The Semantic Similarity Analytic Engine builds on vector embeddings to detect higher-level semantic relationships and thematic similarities between input content and reference collections. Unlike the Search engine that focuses on direct similarity matches, this engine specializes in understanding broader semantic connections and conceptual relationships. All vector embeddings and databases used by this engine are maintained within the ThirdLaw VPC to ensure data security and privacy.

How It Works

The Semantic Similarity Analytic Engine processes content by analyzing vector embeddings to identify thematic and conceptual similarities. When an Evaluator using this engine is initialized, it connects to a vector database that contains pre-computed embeddings of various reference collections. These collections typically represent different categories of content, such as code samples, legal documents, or controversial topics.

When processing input content, the engine first converts the input into vector embeddings compatible with the reference collections. It then compares these embeddings against multiple reference collections simultaneously through vector similarity operations. The engine calculates similarity scores between the input and each reference collection, taking into account the semantic proximity and conceptual overlap rather than exact text matches.

After comparison, the engine ranks the collections based on their semantic proximity to the input, identifying which reference collections are most similar to the input content. To provide a comprehensive assessment, the engine calculates average similarity scores across all segments of the input, enabling detection of semantic relationships even when only portions of the content match a reference collection. This approach makes the Semantic Similarity Analytic Engine particularly effective for identifying conceptual similarities that might be missed by pattern-based approaches.

Configuration Options

The Semantic Similarity Analytic Engine supports the following configuration parameters:

Parameter	Description	Default
`collections`	List of collections to analyze against	Required
`desired_collections`	List of collections that should be closest to the input	Required
`top_n`	Number of top results to retrieve per query	5

Finding Structure

A generic Evaluator based on the Semantic Similarity Analytic Engine returns a Finding with the following structure. The fields under collection_name are repeated, one for each defined collection in the Evaluator.

Finding Structure
{
    "EvaluatorName-Semantic": True/False,                      # Default Finding (duplicate of finding.any)
    "EvaluatorName-Semantic.found": True/False,                # Returns True if the closest collection matches one of the desired collections
    "EvaluatorName-Semantic.closest_collection": "string",     # Returns the name of the closest collection
    "EvaluatorName-Semantic.closest_collection.max_similarity": "string",     # Returns the max similarity of the closest collection
    "EvaluatorName-Semantic.collection.max_similarity": [0-1], # Returns the maximum similarity score achieved by the collection
}

Available Evaluators

The following table lists common Evaluators that can be created using the Semantic Similarity Analytic Engine:

Evaluator Name	Description	Common Use Cases
Code Detection	Detects code patterns and programming language constructs in text	Unauthorized code detection, vulnerability analysis

Dependencies

Vector Database: Vector database for storing and searching embeddings
Embedding Pipeline: Requires pre-computed embeddings from a Search Engine or similar processor

Revision History

2025-03-03: Initial documentation creation

Detailed Description​

How It Works​

Configuration Options​

Finding Structure​

Available Evaluators​

Dependencies​

Revision History​