Overview
The key to building production-ready LLM applications is to have a tight feedback loop of prompt engineering and evaluation. Whether you are optimizing a chatbot, working on Retrieval-Augmented Generation (RAG), or fine-tuning a text generation task, evaluation is a critical step to ensure consistent performance across different inputs, models, and parameters. In this section, we explain how to use Lexica to quickly evaluate and compare the performance of your LLM applications.
Available evaluators
Evaluator Name
Use Case
Type
Description
Exact Match
Classification/Entity Extraction
Pattern Matching
Checks if the output exactly matches the expected result.
Contains JSON
Classification/Entity Extraction
Pattern Matching
Ensures the output contains valid JSON.
Regex Test
Classification/Entity Extraction
Pattern Matching
Checks if the output matches a given regex pattern.
JSON Field Match
Classification/Entity Extraction
Pattern Matching
Compares specific fields within JSON data.
JSON Diff Match
Classification/Entity Extraction
Similarity Metrics
Compares generated JSON with a ground truth JSON based on schema or values.
Similarity Match
Text Generation / Chatbot
Similarity Metrics
Compares generated output with expected using Jaccard similarity.
Semantic Similarity Match
Text Generation / Chatbot
Semantic Analysis
Compares the meaning of the generated output with the expected result.
Starts With
Text Generation / Chatbot
Pattern Matching
Checks if the output starts with a specified prefix.
Ends With
Text Generation / Chatbot
Pattern Matching
Checks if the output ends with a specified suffix.
Contains
Text Generation / Chatbot
Pattern Matching
Checks if the output contains a specific substring.
Contains Any
Text Generation / Chatbot
Pattern Matching
Checks if the output contains any of a list of substrings.
Contains All
Text Generation / Chatbot
Pattern Matching
Checks if the output contains all of a list of substrings.
Levenshtein Distance
Text Generation / Chatbot
Similarity Metrics
Calculates the Levenshtein distance between output and expected result.
LLM-as-a-judge
Text Generation / Chatbot
LLM-based
Sends outputs to an LLM model for critique and evaluation.
RAG Faithfulness
RAG / Text Generation / Chatbot
LLM-based
Evaluates if the output is faithful to the retrieved documents in RAG workflows.
RAG Context Relevancy
RAG / Text Generation / Chatbot
LLM-based
Measures the relevancy of retrieved documents to the given question in RAG.
Custom Code Evaluation
Custom Logic
Custom
Allows users to define their own evaluator in Python.
Webhook Evaluator
Custom Logic
Custom
Sends output to a webhook for external evaluation.
Last updated