Skip to main content

Semantic Similarity

The Semantic Similarity Analytic Engine performs deep semantic analysis of text to identify thematic relationships, conceptual similarities, and content patterns against pre-defined reference collections.

Semantic Similarity Analytic Engine
  • Use Case: Identify conceptual and thematic relationships in content

  • Technology: Vector-based semantic comparison

  • Valid Inputs: Text / Image / Audio / Video / Documents in Exchanges

  • Available Evaluators:

  • Last Engine Update: 2025-03-03

  • Dependencies: Vector database

Detailed Description

The Semantic Similarity Analytic Engine builds on vector embeddings to detect higher-level semantic relationships and thematic similarities between input content and reference collections. Unlike the Search engine that focuses on direct similarity matches, this engine specializes in understanding broader semantic connections and conceptual relationships. All vector embeddings and databases used by this engine are maintained within the ThirdLaw VPC to ensure data security and privacy.

How It Works

The Semantic Similarity Analytic Engine processes content by analyzing vector embeddings to identify thematic and conceptual similarities. When an Evaluator using this engine is initialized, it connects to a vector database that contains pre-computed embeddings of various reference collections. These collections typically represent different categories of content, such as code samples, legal documents, or controversial topics.

When processing input content, the engine first converts the input into vector embeddings compatible with the reference collections. It then compares these embeddings against multiple reference collections simultaneously through vector similarity operations. The engine calculates similarity scores between the input and each reference collection, taking into account the semantic proximity and conceptual overlap rather than exact text matches.

After comparison, the engine ranks the collections based on their semantic proximity to the input, identifying which reference collections are most similar to the input content. To provide a comprehensive assessment, the engine calculates average similarity scores across all segments of the input, enabling detection of semantic relationships even when only portions of the content match a reference collection. This approach makes the Semantic Similarity Analytic Engine particularly effective for identifying conceptual similarities that might be missed by pattern-based approaches.

Configuration Options

The Semantic Similarity Analytic Engine supports the following configuration parameters:

ParameterDescriptionDefault
collectionsList of collections to analyze againstRequired
desired_collectionsList of collections that should be closest to the inputRequired
top_nNumber of top results to retrieve per query5

Finding Structure

A generic Evaluator based on the Semantic Similarity Analytic Engine returns a Finding with the following structure. The fields under collection_name are repeated, one for each defined collection in the Evaluator.

Finding Structure
{
"EvaluatorName-Semantic": True/False, # Default Finding (duplicate of finding.any)
"EvaluatorName-Semantic.found": True/False, # Returns True if the closest collection matches one of the desired collections
"EvaluatorName-Semantic.closest_collection": "string", # Returns the name of the closest collection
"EvaluatorName-Semantic.closest_collection.max_similarity": "string", # Returns the max similarity of the closest collection
"EvaluatorName-Semantic.collection.max_similarity": [0-1], # Returns the maximum similarity score achieved by the collection
}

Available Evaluators

The following table lists common Evaluators that can be created using the Semantic Similarity Analytic Engine:

Evaluator NameDescriptionCommon Use Cases
Code DetectionDetects code patterns and programming language constructs in textUnauthorized code detection, vulnerability analysis

Dependencies

  • Vector Database: Vector database for storing and searching embeddings
  • Embedding Pipeline: Requires pre-computed embeddings from a Search Engine or similar processor

Revision History

  • 2025-03-03: Initial documentation creation