use-cases/ai-app-evals page

name: Add evals to your AI application tagline: Use Codex to turn expected behavior into a Promptfoo eval suite. summary: Ask Codex to inspect your AI application, identify the behavior you want to evaluate, and add a runnable Promptfoo eval suite. skills:

token: promptfoo url: https://github.com/promptfoo/promptfoo/tree/main/plugins/promptfoo description: Plugin that includes $promptfoo-evals and $promptfoo-provider-setup for creating, connecting, running, and QAing eval suites. bestFor:
AI applications that already have prompts, model calls, tools, retrieval, agents, or product requirements but no repeatable eval suite.
Teams preparing a model, prompt, retrieval, or agent change and wanting regression tests before the pull request merges.
Quality reviews where repeated manual checks should become committed eval cases. starterPrompt: title: Add Evals Before You Change Behavior body: >- Use $promptfoo-evals to add a Promptfoo eval suite for this AI application. If there is not already a working Promptfoo provider or target adapter, use $promptfoo-provider-setup first.

Behavior to evaluate: [support answer quality / tool-call correctness / retrieval grounding / business rules / agent task completion]

Before editing:
- Inspect the app path users hit and any existing evals or tests.
- Propose the smallest useful eval plan: target adapter, seed cases, assertions, files, commands, and required env vars or local services.
- Do not change production prompts, model settings, or app behavior until the baseline eval exists and has been run.
Requirements:
- Exercise the application path users hit when possible, not only the raw model prompt.
- Keep fixtures free of secrets, customer data, and sensitive personal data.
- Add a local eval command such as npm run evals or document the exact command to run.
Finish with:
- Files changed
- Eval commands run
- Passing and failing cases
- Recommended next evals to add suggestedEffort: medium relatedLinks:
label: Promptfoo configuration url: https://www.promptfoo.dev/docs/configuration/guide/
label: Evaluation best practices url: /api/docs/guides/evaluation-best-practices

Introduction

When you are building an AI application, or making changes to an existing one, you want to make sure it behaves as expected. Evals are a way to systematically test a set of scenarios and catch regressions before they ship.

You can use Promptfoo to run evals on your AI application, and Codex to help you create and maintain the evals.

How to use

Use Codex with the Promptfoo plugin's $promptfoo-evals skill to turn one AI app behavior into a repeatable eval suite. When the app does not already have a working Promptfoo target, $promptfoo-provider-setup helps connect the suite to the application path you want to test.

Codex can inspect the app, propose high-signal cases, add the Promptfoo config and test data, run the suite locally, and give you a command to keep using.

This use case works best when the behavior is concrete: support answer quality, retrieval grounding, classifier labels, tool calls, JSON shape, business rules, or prompt and model migration confidence.

A strong first pass should be reviewable code and test data: a promptfooconfig.yaml or equivalent config, a small evals/ directory, test cases, any target adapter needed to call the app, and a local command such as npm run evals.

Choose what to evaluate

Start with one user-visible promise. Avoid asking Codex to evaluate the entire AI system in one pass. A smaller suite is easier to trust, review, and keep running.

Good first targets include:

Correctness: classification, extraction, summarization, routing, or transformation.
Grounding: answers that should stay tied to retrieved documents or cited sources.
Tool use: choosing the right tool, passing valid arguments, and handling tool errors.
Format or business rules: JSON schemas, field names, business-rule limits, or UI-facing copy contracts.
Prompt or model migration: making sure a new prompt, model, system message, or retrieval setting does not break important cases.

Start from product requirements, bug reports, support escalations, or sanitized examples your team is comfortable committing to the repo.

Ask for an eval plan

Codex should inspect before it edits. Ask for a plan that names the target path, fixtures, assertions, adapter, and commands. This gives you a chance to catch the wrong target or weak test cases before files are added.

Review the plan before implementation. It should name the app path or endpoint Promptfoo will call, the first seed cases, the assertions, the files Codex will create, the local command, and any required secrets or services. If the plan tests the raw model instead of the application path users hit, ask Codex whether that is intentional.

Implement, run, and iterate

Once the plan is correct, ask Codex to implement it. The first implementation should be boring: config, cases, fixtures, a target adapter if needed, a command, and proof that the command ran.

A small app-backed suite might look like this:

evals/
  promptfooconfig.yaml
  tests/
    cases.yaml
  providers/
    provider.js  # only if the built-in provider cannot call the app directly

Run the suite before changing behavior. The baseline tells you whether the app already fails the cases, whether the assertions need tuning, or whether the target adapter is wrong. Tune assertions when they are too brittle or vague, but keep real product failures visible.

After the first run, use the suite to compare app changes before they ship. Add new cases whenever a bug, launch requirement, or product review shows behavior you want to keep stable. Once the local command is stable, ask Codex to add it to CI or your release checklist.

use-cases/ai-app-evals.md Codex Docs, 2026-05-01 18:29 UTC → 2026-05-18 22:01 UTC

Introduction

How to use

Choose what to evaluate

Ask for an eval plan

Implement, run, and iterate