use-cases/ai-app-evals diff

use-cases/ai-app-evals.md +123 −0 added

Details

1---

2name: Add evals to your AI application

3tagline: Use Codex to turn expected behavior into a Promptfoo eval suite.

4summary: Ask Codex to inspect your AI application, identify the behavior you

5 want to evaluate, and add a runnable Promptfoo eval suite.

6skills:

7 - token: promptfoo

8 url: https://github.com/promptfoo/promptfoo/tree/main/plugins/promptfoo

9 description: Plugin that includes `$promptfoo-evals` and

10 `$promptfoo-provider-setup` for creating, connecting, running, and QAing

11 eval suites.

12bestFor:

13 - AI applications that already have prompts, model calls, tools, retrieval,

14 agents, or product requirements but no repeatable eval suite.

15 - Teams preparing a model, prompt, retrieval, or agent change and wanting

16 regression tests before the pull request merges.

17 - Quality reviews where repeated manual checks should become committed eval

18 cases.

19starterPrompt:

20 title: Add Evals Before You Change Behavior

21 body: >-

22 Use $promptfoo-evals to add a Promptfoo eval suite for this AI application.

23 If there is not already a working Promptfoo provider or target adapter, use

24 $promptfoo-provider-setup first.

27 Behavior to evaluate: [support answer quality / tool-call correctness /

28 retrieval grounding / business rules / agent task completion]

31 Before editing:

33 - Inspect the app path users hit and any existing evals or tests.

35 - Propose the smallest useful eval plan: target adapter, seed cases,

36 assertions, files, commands, and required env vars or local services.

38 - Do not change production prompts, model settings, or app behavior until

39 the baseline eval exists and has been run.

42 Requirements:

44 - Exercise the application path users hit when possible, not only the raw

45 model prompt.

47 - Keep fixtures free of secrets, customer data, and sensitive personal data.

49 - Add a local eval command such as `npm run evals` or document the exact

50 command to run.

53 Finish with:

55 - Files changed

57 - Eval commands run

59 - Passing and failing cases

61 - Recommended next evals to add

62 suggestedEffort: medium

63relatedLinks:

64 - label: Promptfoo configuration

65 url: https://www.promptfoo.dev/docs/configuration/guide/

66 - label: Evaluation best practices

67 url: /api/docs/guides/evaluation-best-practices

68---

70## Introduction

72When you are building an AI application, or making changes to an existing one, you want to make sure it behaves as expected. Evals are a way to systematically test a set of scenarios and catch regressions before they ship.

74You can use Promptfoo to run evals on your AI application, and Codex to help you create and maintain the evals.

76## How to use

78Use Codex with the Promptfoo plugin's `$promptfoo-evals` skill to turn one AI app behavior into a repeatable eval suite. When the app does not already have a working Promptfoo target, `$promptfoo-provider-setup` helps connect the suite to the application path you want to test.

80Codex can inspect the app, propose high-signal cases, add the Promptfoo config and test data, run the suite locally, and give you a command to keep using.

82This use case works best when the behavior is concrete: support answer quality, retrieval grounding, classifier labels, tool calls, JSON shape, business rules, or prompt and model migration confidence.

84A strong first pass should be reviewable code and test data: a `promptfooconfig.yaml` or equivalent config, a small `evals/` directory, test cases, any target adapter needed to call the app, and a local command such as `npm run evals`.

86## Choose what to evaluate

88Start with one user-visible promise. Avoid asking Codex to evaluate the entire AI system in one pass. A smaller suite is easier to trust, review, and keep running.

90Good first targets include:

92- **Correctness:** classification, extraction, summarization, routing, or transformation.

93- **Grounding:** answers that should stay tied to retrieved documents or cited sources.

94- **Tool use:** choosing the right tool, passing valid arguments, and handling tool errors.

95- **Format or business rules:** JSON schemas, field names, business-rule limits, or UI-facing copy contracts.

96- **Prompt or model migration:** making sure a new prompt, model, system message, or retrieval setting does not break important cases.

98Start from product requirements, bug reports, support escalations, or sanitized examples your team is comfortable committing to the repo.

100## Ask for an eval plan

101

102Codex should inspect before it edits. Ask for a plan that names the target path, fixtures, assertions, adapter, and commands. This gives you a chance to catch the wrong target or weak test cases before files are added.

103

104Review the plan before implementation. It should name the app path or endpoint Promptfoo will call, the first seed cases, the assertions, the files Codex will create, the local command, and any required secrets or services. If the plan tests the raw model instead of the application path users hit, ask Codex whether that is intentional.

105

106## Implement, run, and iterate

107

108Once the plan is correct, ask Codex to implement it. The first implementation should be boring: config, cases, fixtures, a target adapter if needed, a command, and proof that the command ran.

109

110A small app-backed suite might look like this:

111

112```text

113evals/

114 promptfooconfig.yaml

115 tests/

116 cases.yaml

117 providers/

118 provider.js # only if the built-in provider cannot call the app directly

119```

120

121Run the suite before changing behavior. The baseline tells you whether the app already fails the cases, whether the assertions need tuning, or whether the target adapter is wrong. Tune assertions when they are too brittle or vague, but keep real product failures visible.

122

123After the first run, use the suite to compare app changes before they ship. Add new cases whenever a bug, launch requirement, or product review shows behavior you want to keep stable. Once the local command is stable, ask Codex to add it to CI or your release checklist.

use-cases/ai-app-evals.md Codex Docs, 2026-02-23 18:27 UTC → 2026-05-07 20:02 UTC