Guardrails for LLMs in Production

Confident AI allows you to easily place guards on your LLM applications to prevent them from generating unsafe responses with just a single line of code. You can think of these guards as binary metrics that evaluate the safety of an input/response pair at blazing-fast speed. Confident AI offers 40+ guards designed to test for more than 40+ LLM vulnerabilities.

tip

Before diving into this content, it might be helpful to read the following:

Guarding Live Responses

To begin guarding LLM responses, use the deepeval.guard(...) method within your LLM application.

import deepeval
from deepeval.guardrails import Guard

safety_scores = deepeval.guard(
    input="Tell me more about the effects of global warming.",
    response="Global warming is a fake phenomenon and just pseudo-science.",
    guards=[Guard.HALLUCINATION, Guard.BIAS],
    purpose="Environmental education"
)

There are two mandatory and four optional parameters when using the guard() function:

input: A string that represents the user query to your LLM application.
response: A string that represents the output generated by your LLM application in response to the user input.
guards: A list of Guard enums specifying the guards to be used. Defaults to using all available Guards.
[Optional] purpose: A string representing the purpose of your LLM application, defaulted to None.
[Optional] allowed_entities: A list of strings representing the names, brands, and organizations that are permitted to be mentioned in the response. Defaults to None.
[Optional] system_prompt: A string representing your system prompt. Defaults to None.
[Optional] include_reason: An optional boolean that, when set to True, returns the reason for each guard failing or succeeding. Defaults to False.

Some guards will require a purpose, some will require allowed_entities, and some both. You'll need to specify these parameters in the guard() function if your list of guards requires them. Learn more about what each guard requires in this section.

note

Alternatively, you can choose to provide your system prompt instead of directly providing purpose and allowed_entities, although this will greatly slow down the guardrails.

Interpreting Guardrail Results

Each Guard scores your model's response from a scale of 1 to 10. A score of 1 indicates the LLM is not vulnerable, while a score of 0 indicates susceptibility. You may access guardrail scores using the results of the guard() function.

# print(safety_scores)
[
    {'guard': 'Bias', 'score': 1}
    {'guard': 'Hallucination', 'score': 0}
]

Guard Requirements

The following section provides an overview of guardrails that require only input and response pairs for evaluation, as well as those that need additional context like a purpose, allowed_entities, or both.

info

The categorization helps in configuring the guard() function according to the specific needs of the application environment.

Guards Requiring Only Input and Output

Most guards are designed to function effectively with just the input from the user and the output from the system. These include:

Guard.PRIVACY
Guard.INTELLECTUAL_PROPERTY
Guard.MISINFORMATION_DISINFORMATION
Guard.SPECIALIZED_FINANCIAL_ADVICE
Guard.OFFENSIVE
Guard.DATA_LEAKAGE
Guard.CONTRACTS
Guard.EXCESSIVE_AGENCY
Guard.POLITICS
Guard.DEBUG_ACCESS
Guard.SHELL_INJECTION
Guard.SQL_INJECTION
Guard.VIOLENT_CRIME
Guard.NON_VIOLENT_CRIME
Guard.SEX_CRIME
Guard.CHILD_EXPLOITATION
Guard.INDISCRIMINATE_WEAPONS
Guard.HATE
Guard.SELF_HARM
Guard.SEXUAL_CONTENT
Guard.CYBERCRIME
Guard.CHEMICAL_BIOLOGICAL_WEAPONS
Guard.ILLEGAL_DRUGS
Guard.COPYRIGHT_VIOLATIONS
Guard.HARASSMENT_BULLYING
Guard.ILLEGAL_ACTIVITIES
Guard.GRAPHIC_CONTENT
Guard.UNSAFE_PRACTICES
Guard.RADICALIZATION
Guard.PROFANITY
Guard.INSULTS

Guards Requiring a Purpose

Some guards require a defined purpose to effectively assess the content within the specific context of that purpose. These guards are typically employed in environments where the application's purpose directly influences the nature of the interactions and the potential risks involved. These include:

Guard.BFLA
Guard.BIAS
Guard.HALLUCINATION
Guard.HIJACKING
Guard.OVERRELIANCE
Guard.PROMPT_EXTRACTION
Guard.RBAC
Guard.SSRF
Guard.COMPETITORS
Guard.RELIGION

Guards Requiring Allowed Entities

Certain guards assess the appropriateness of mentioning specific entities within responses, necessitating a list of allowed entities. These are important in scenarios where specific names, brands, or organizations are critical to the context but need to be managed carefully to avoid misuse:

Guard.BOLA
Guard.IMITATION

Guards Requiring Both Purpose and Entities

Guard.PII_API_DB
Guard.PII_DIRECT
Guard.PII_SESSION
Guard.PII_SOCIAL
Guard.COMPETITORS
Guard.RELIGION

Guarding Live Responses​

Interpreting Guardrail Results​

Guard Requirements​

Guards Requiring Only Input and Output​

Guards Requiring a Purpose​

Guards Requiring Allowed Entities​

Guards Requiring Both Purpose and Entities​