Defining Metrics in Deepeval

In this section, we’ll dive into the range of LLM evaluation metrics available in Deepeval and select those that are most relevant for evaluating our medical chatbot. Deepeval offers 4 types of metrics:

LLM System Metrics: Evaluate single LLM input-output pairs.
Conversational Metrics: Assess entire LLM interactions.
Custom Metrics: Allow for personalized evaluation using user-defined evaluation criteria.
Multimodal Metrics: Analyze systems handling multiple types of data, such as text and images.

info

LLM-based scorers, or using LLMs as judges, are among the most effective evaluation metrics. To learn more about this use case, consider reading the articles on Deepeval’s system metrics and chain of thought prompting.

For this tutorial, we'll concentrate on LLM system, conversational, and custom metrics, as our use case revolves around a chatbot, which does not involve image processing.

Setting Up

To get started with Deepeval, first install the library using pip. Open your terminal and run:

pip install deepeval

Next, login to Confident AI. This will allow anyone in your team to easily view, compare, collaborate, analyze, and iterate on your evaluation results without the need to be technical. Use the following command to log in:

deepeval login

LLM System Metrics

Deepeval's LLM system metrics include 5 RAG metrics, as well as hallucination, summarization, and tool correctness. Since our medical chatbot utilizes a RAG engine and various functional tools, we will employ all 5 RAG metrics along with Tool Correctness. We’ll also be focusing on Hallucination, since any inaccuracies in medical diagnosis can be very harmful.

Let’s begin by defining our key metrics. Deepeval's LLM metrics include [Answer Relevancy](metrics-answer-relevancy*, Contextual Precision, Contextual Recall, Contextual Relevancy, and Faithfulness.

from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    ToolCorrectnessMetric,
    HallucinationMetric,
)

model = "gpt-4"
include_reason = True
threshold = 0.7

answer_relevancy_metric = AnswerRelevancyMetric(
    threshold=threshold,
    model=model,
    include_reason=include_reason
)

faithfulness_metric = FaithfulnessMetric(
    threshold=threshold,
    model=model,
    include_reason=include_reason
)

contextual_precision_metric = ContextualPrecisionMetric(
    threshold=threshold,
    model=model,
    include_reason=include_reason
)

contextual_recall_metric = ContextualRecallMetric(
    threshold=threshold,
    model=model,
    include_reason=include_reason
)

contextual_relevancy_metric = ContextualRelevancyMetric(
    threshold=threshold,
    model=model,
    include_reason=include_reason
)

tool_correctness_metric = ToolCorrectnessMetric()

hallucination_metric = HallucinationMetric(
    threshold=threshold,
    model=model,
    include_reason=include_reason
)

Next, we'll define our Tool Correctness and Hallucination metrics:

from deepeval.metrics import (
    ToolCorrectnessMetric,
    HallucinationMetric,
)

tool_correctness_metric = ToolCorrectnessMetric()

hallucination_metric = HallucinationMetric(
    threshold=threshold,
    model=model,
    include_reason=include_reason
)

note

In the examples above, we simply use "gpt-4" as the evaluation model, which requires an OPENAI_API_KEY. If you're interested in using non-OpenAI models for evaluation, you'll want to visit this guide on adding custom models.

Conversational Metrics

Additionally, we’ll be utilizing all of Deepeval’s "conversational metrics" to evaluate entire conversation threads between users and our medical chatbot. These metrics include:

info

You'll need to define chatbot_role inside your ConversationalTestCase when creating the Role Adherence metric. We’ll learn more about this in the following sections.

from deepeval.metrics import (
    RoleAdherenceMetric,
    KnowledgeRetentionMetric,
    ConversationCompletenessMetric,
    ConversationRelevancyMetric,
)

conversational_threshold = 0.7

role_adherence_metric = RoleAdherenceMetric(
    threshold=conversational_threshold,
    model=model,
    include_reason=include_reason
)

knowledge_retention_metric = KnowledgeRetentionMetric(
    threshold=conversational_threshold,
    model=model,
    include_reason=include_reason
)

conversation_completeness_metric = ConversationCompletenessMetric(
    threshold=conversational_threshold,
    model=model,
    include_reason=include_reason
)

conversation_relevancy_metric = ConversationRelevancyMetric(
    threshold=conversational_threshold,
    model=model,
    include_reason=include_reason
)

Custom Metrics

Finally, we’ll be defining a "custom metric" for our medical chatbot. Deepeval’s GEval metric allows us to define a custom metric for any use case by simply either providing a string representing the evaluation criteria or alternately by a list of strings representing your evaluation steps.

note

You’ll need to iterate and fine-tune these criteria to minimize bias during evaluation. You can learn more about how to fine-tune your custom metrics according to this guide.

Diagnosis Specificity

The Diagnosis Specificity Metric evaluates how well the chatbot narrows down the diagnosis based on the symptoms provided. It rewards specificity when symptoms justify a detailed diagnosis, ensuring the chatbot provides meaningful and actionable responses.

from deepeval.text_case_import import LLMTestCaseParams

diagnosis_specificity_metric = Granular(
    name="Diagnosis Specificity",
    criteria="Evaluate how well the diagnosis provided matches the most specific and accurate diagnosis based on the given symptoms.",
    evaluation_params={
        "input": LLMTestCaseParams.INPUT,
        "actual_output": LLMTestCaseParams.ACTUAL_OUTPUT,
    }
)

Overdiagnosis

The Overdiagnosis Metric ensures that the chatbot avoids providing overly specific diagnoses when the symptoms are insufficient. This helps to prevent unnecessary anxiety or inappropriate medical advice.

from deepeval.text_case_import import LLMTestCaseParams

overdiagnosis_metric = Granular(
    name="Overdiagnosis",
    criteria="Evaluate whether the diagnosis is overly specific given the symptoms, avoiding unnecessary precision that could lead to overdiagnosis.",
    evaluation_params={
        "input": LLMTestCaseParams.INPUT,
        "actual_output": LLMTestCaseParams.ACTUAL_OUTPUT,
    }
)

Setting Up​

LLM System Metrics​

Conversational Metrics​

Custom Metrics​

Diagnosis Specificity​

Overdiagnosis​