Overview

The key to building production-ready LLM applications is to have a tight feedback loop of prompt engineering and evaluation. Whether you are optimizing a chatbot, working on Retrieval-Augmented Generation (RAG), or fine-tuning a text generation task, evaluation is a critical step to ensure consistent performance across different inputs, models, and parameters. In this section, we explain how to use Lexica to quickly evaluate and compare the performance of your LLM applications.

Available evaluators

Evaluator Name

Use Case

Type

Description

Exact Match

Classification/Entity Extraction

Pattern Matching

Checks if the output exactly matches the expected result.

Contains JSON

Classification/Entity Extraction

Pattern Matching

Ensures the output contains valid JSON.

Regex Test

Classification/Entity Extraction

Pattern Matching

Checks if the output matches a given regex pattern.

JSON Field Match

Classification/Entity Extraction

Pattern Matching

Compares specific fields within JSON data.

JSON Diff Match

Classification/Entity Extraction

Similarity Metrics

Compares generated JSON with a ground truth JSON based on schema or values.

Similarity Match

Text Generation / Chatbot

Similarity Metrics

Compares generated output with expected using Jaccard similarity.

Semantic Similarity Match

Text Generation / Chatbot

Semantic Analysis

Compares the meaning of the generated output with the expected result.

Starts With

Text Generation / Chatbot

Pattern Matching

Checks if the output starts with a specified prefix.

Ends With

Text Generation / Chatbot

Pattern Matching

Checks if the output ends with a specified suffix.

Contains

Text Generation / Chatbot

Pattern Matching

Checks if the output contains a specific substring.

Contains Any

Text Generation / Chatbot

Pattern Matching

Checks if the output contains any of a list of substrings.

Contains All

Text Generation / Chatbot

Pattern Matching

Checks if the output contains all of a list of substrings.

Levenshtein Distance

Text Generation / Chatbot

Similarity Metrics

Calculates the Levenshtein distance between output and expected result.

LLM-as-a-judge

Text Generation / Chatbot

LLM-based

Sends outputs to an LLM model for critique and evaluation.

RAG Faithfulness

RAG / Text Generation / Chatbot

LLM-based

Evaluates if the output is faithful to the retrieved documents in RAG workflows.

RAG Context Relevancy

RAG / Text Generation / Chatbot

LLM-based

Measures the relevancy of retrieved documents to the given question in RAG.

Custom Code Evaluation

Custom Logic

Custom

Allows users to define their own evaluator in Python.

Webhook Evaluator

Custom Logic

Custom

Sends output to a webhook for external evaluation.

Last updated