use-cases/ai-app-evals.md +123 −0 added
1---
2name: Add evals to your AI application
3tagline: Use Codex to turn expected behavior into a Promptfoo eval suite.
4summary: Ask Codex to inspect your AI application, identify the behavior you
5 want to evaluate, and add a runnable Promptfoo eval suite.
6skills:
7 - token: promptfoo
8 url: https://github.com/promptfoo/promptfoo/tree/main/plugins/promptfoo
9 description: Plugin that includes `$promptfoo-evals` and
10 `$promptfoo-provider-setup` for creating, connecting, running, and QAing
11 eval suites.
12bestFor:
13 - AI applications that already have prompts, model calls, tools, retrieval,
14 agents, or product requirements but no repeatable eval suite.
15 - Teams preparing a model, prompt, retrieval, or agent change and wanting
16 regression tests before the pull request merges.
17 - Quality reviews where repeated manual checks should become committed eval
18 cases.
19starterPrompt:
20 title: Add Evals Before You Change Behavior
21 body: >-
22 Use $promptfoo-evals to add a Promptfoo eval suite for this AI application.
23 If there is not already a working Promptfoo provider or target adapter, use
24 $promptfoo-provider-setup first.
25
26
27 Behavior to evaluate: [support answer quality / tool-call correctness /
28 retrieval grounding / business rules / agent task completion]
29
30
31 Before editing:
32
33 - Inspect the app path users hit and any existing evals or tests.
34
35 - Propose the smallest useful eval plan: target adapter, seed cases,
36 assertions, files, commands, and required env vars or local services.
37
38 - Do not change production prompts, model settings, or app behavior until
39 the baseline eval exists and has been run.
40
41
42 Requirements:
43
44 - Exercise the application path users hit when possible, not only the raw
45 model prompt.
46
47 - Keep fixtures free of secrets, customer data, and sensitive personal data.
48
49 - Add a local eval command such as `npm run evals` or document the exact
50 command to run.
51
52
53 Finish with:
54
55 - Files changed
56
57 - Eval commands run
58
59 - Passing and failing cases
60
61 - Recommended next evals to add
62 suggestedEffort: medium
63relatedLinks:
64 - label: Promptfoo configuration
65 url: https://www.promptfoo.dev/docs/configuration/guide/
66 - label: Evaluation best practices
67 url: /api/docs/guides/evaluation-best-practices
68---
69
70## Introduction
71
72When you are building an AI application, or making changes to an existing one, you want to make sure it behaves as expected. Evals are a way to systematically test a set of scenarios and catch regressions before they ship.
73
74You can use Promptfoo to run evals on your AI application, and Codex to help you create and maintain the evals.
75
76## How to use
77
78Use Codex with the Promptfoo plugin's `$promptfoo-evals` skill to turn one AI app behavior into a repeatable eval suite. When the app does not already have a working Promptfoo target, `$promptfoo-provider-setup` helps connect the suite to the application path you want to test.
79
80Codex can inspect the app, propose high-signal cases, add the Promptfoo config and test data, run the suite locally, and give you a command to keep using.
81
82This use case works best when the behavior is concrete: support answer quality, retrieval grounding, classifier labels, tool calls, JSON shape, business rules, or prompt and model migration confidence.
83
84A strong first pass should be reviewable code and test data: a `promptfooconfig.yaml` or equivalent config, a small `evals/` directory, test cases, any target adapter needed to call the app, and a local command such as `npm run evals`.
85
86## Choose what to evaluate
87
88Start with one user-visible promise. Avoid asking Codex to evaluate the entire AI system in one pass. A smaller suite is easier to trust, review, and keep running.
89
90Good first targets include:
91
92- **Correctness:** classification, extraction, summarization, routing, or transformation.
93- **Grounding:** answers that should stay tied to retrieved documents or cited sources.
94- **Tool use:** choosing the right tool, passing valid arguments, and handling tool errors.
95- **Format or business rules:** JSON schemas, field names, business-rule limits, or UI-facing copy contracts.
96- **Prompt or model migration:** making sure a new prompt, model, system message, or retrieval setting does not break important cases.
97
98Start from product requirements, bug reports, support escalations, or sanitized examples your team is comfortable committing to the repo.
99
100## Ask for an eval plan
101
102Codex should inspect before it edits. Ask for a plan that names the target path, fixtures, assertions, adapter, and commands. This gives you a chance to catch the wrong target or weak test cases before files are added.
103
104Review the plan before implementation. It should name the app path or endpoint Promptfoo will call, the first seed cases, the assertions, the files Codex will create, the local command, and any required secrets or services. If the plan tests the raw model instead of the application path users hit, ask Codex whether that is intentional.
105
106## Implement, run, and iterate
107
108Once the plan is correct, ask Codex to implement it. The first implementation should be boring: config, cases, fixtures, a target adapter if needed, a command, and proof that the command ran.
109
110A small app-backed suite might look like this:
111
112```text
113evals/
114 promptfooconfig.yaml
115 tests/
116 cases.yaml
117 providers/
118 provider.js # only if the built-in provider cannot call the app directly
119```
120
121Run the suite before changing behavior. The baseline tells you whether the app already fails the cases, whether the assertions need tuning, or whether the target adapter is wrong. Tune assertions when they are too brittle or vague, but keep real product failures visible.
122
123After the first run, use the suite to compare app changes before they ship. Add new cases whenever a bug, launch requirement, or product review shows behavior you want to keep stable. Once the local command is stable, ask Codex to add it to CI or your release checklist.