use-cases/iterate-on-difficult-problems.md +157 −0 added
1# Iterate on difficult problems | Codex use cases
2
3[← All use cases](https://developers.openai.com/codex/use-cases)
4
5Copy page [Export as PDF](https://developers.openai.com/codex/use-cases/iterate-on-difficult-problems/?export=pdf)
6
7Give Codex an evaluation system, such as scripts and reviewable artifacts, so it can keep improving a hard task until the scores are good enough.
8
9Advanced
10
11Long-running
12
13Related links
14
15[Custom instructions with AGENTS.md](https://developers.openai.com/codex/guides/agents-md) [Codex workflows](https://developers.openai.com/codex/workflows)
16
17## Best for
18
19- Problems where each iteration can be scored, but the best result usually takes many passes
20- Tasks with visual or subjective outputs that need both deterministic checks and an LLM-as-a-judge score
21- Long-running Codex sessions where you want progress tracked clearly instead of relying on context
22
23## Starter prompt
24
25I have a difficult task in this workspace and I want you to run it as an eval-driven improvement loop.
26 Before changing anything:
27 - Read `AGENTS.md`.
28 - Find the script or command that scores the current output.
29 Iteration loop:
30 - Make one focused improvement at a time.
31 - Re-run the eval command after each meaningful change.
32 - Log the scores and what changed.
33- Inspect generated artifacts directly. If the output is visual, use `view\_image`.
34 - Keep going until both the overall score and the LLM average are above 90%.
35 Constraints:
36 - Do not stop at the first acceptable result.
37- Do not revert to an earlier version unless the new result is clearly worse in scores or artifacts.
38- If the eval improves but is still below target, explain the bottleneck and continue.
39 Output:
40 - current best scores
41 - log of major iterations
42 - remaining risks or weak spots
43
44I have a difficult task in this workspace and I want you to run it as an eval-driven improvement loop.
45 Before changing anything:
46 - Read `AGENTS.md`.
47 - Find the script or command that scores the current output.
48 Iteration loop:
49 - Make one focused improvement at a time.
50 - Re-run the eval command after each meaningful change.
51 - Log the scores and what changed.
52- Inspect generated artifacts directly. If the output is visual, use `view\_image`.
53 - Keep going until both the overall score and the LLM average are above 90%.
54 Constraints:
55 - Do not stop at the first acceptable result.
56- Do not revert to an earlier version unless the new result is clearly worse in scores or artifacts.
57- If the eval improves but is still below target, explain the bottleneck and continue.
58 Output:
59 - current best scores
60 - log of major iterations
61 - remaining risks or weak spots
62
63## Introduction
64
65Some tasks are easy to verify in one shot: the build passes, the tests go green, and you are done. But there are some optimization problems that are difficult to solve, and need many iterations with a tight evaluation loop. To know which direction to go in, Codex needs to inspect the current output, score it, decide the next change, and repeat until the result is actually good.
66
67This type of use case pairs well with a custom UI that lets you inspect progress visually, by having Codex log the outputs and generated artifacts for each iteration.
68You can watch Codex continue working in the app while the target artifact, model output, or generated asset keeps improving.
69The key is to give Codex the necessary scripts to generate the evaluation metrics and the artifacts to inspect.
70
71## Start with evals
72
73Before the task begins, define how success will be measured. The best setup usually combines:
74
75- **Deterministic checks:** things the scripts can score directly, such as constraint violations or deterministic metrics computed with code
76- **LLM-as-a-judge checks:** rubric-based scores for qualities that are harder to encode exactly, such as resemblance, readability, usefulness, or overall quality - this can rely on text or image outputs
77
78If the subjective part matters, give Codex a script that can call a model for example using the [Responses API](https://developers.openai.com/api/reference/resources/responses/methods/create) and return structured scores. The point is not to replace deterministic checks, it's to supplement them with a consistent judge for the part humans would otherwise assess by eye.
79
80The loop works best when the eval output is machine-readable, saved after every run, and easy to compare over time.
81
82**Tip**: Ask Codex to generate the evaluation script for you, describing the
83 checks you want to run.
84
85## Give Codex a stopping rule
86
87Hard tasks often drift because the prompt says “keep improving” without saying when to stop. Make the stopping rule explicit.
88
89A practical pattern is:
90
911. Set a target for the overall score.
922. Set a separate target for the LLM-judge average.
933. Tell Codex to continue until both are above the threshold, not just one.
94
95For example, if the goal is a high-quality artifact, ask Codex to keep going until both the overall score and the LLM average are above 90%. That makes the task legible: Codex can tell whether it is still below target, where the gap is, and whether the latest change helped.
96
97## Keep a running log of the loop
98
99Long-running work is much more reliable when Codex keeps notes about the loop instead of trying to remember everything from the thread.
100
101That running log should record:
102
103- the current best scores
104- what changed on the last iteration
105- what the eval said got better or worse
106- what Codex plans to try next
107
108This is especially important when the task runs for a long time. The log becomes the handoff point for the next session and the self-evaluation record for the current one.
109
110## Inspect the artifact, not just the logs
111
112For some difficult tasks, the code diff and metric output are not enough. Codex should look at the artifact it produced.
113
114If the output is visual, such as a generated image, layout, or rendered state, let Codex inspect that artifact directly, for example when the output lives on disk as an image and compare the current result to the prior best result or to the intended rubric.
115
116This makes the loop stronger:
117
118- the eval script reports the score
119- the artifact shows what the score missed
120- the next change is grounded in both
121
122That combination is much more effective than changing code blindly between runs.
123
124## Make every iteration explicit
125
126Ask Codex to follow the same loop every time:
127
1281. Run the evals on the current baseline.
1292. Identify the biggest failure mode from the scores and artifacts.
1303. Make one focused change that addresses that bottleneck.
1314. Re-run the evals.
1325. Log the new scores and whether the change helped.
1336. Continue until the thresholds are met.
134
135This discipline matters. If each iteration changes too many things at once, Codex cannot tell which idea improved the score. If it skips logging, the session becomes hard to trust and hard to resume.
136
137## Related use cases
138
139[
140
141### Understand large codebases
142
143Use Codex to map unfamiliar codebases, explain different modules and data flow, and point...
144
145Engineering Analysis](https://developers.openai.com/codex/use-cases/codebase-onboarding)[
146
147### Create browser-based games
148
149Use Codex to turn a game brief into first a well-defined plan, and then a real browser-based...
150
151Engineering Code](https://developers.openai.com/codex/use-cases/browser-games)[
152
153### Learn a new concept
154
155Use Codex to study material such as research papers or courses, split the reading across...
156
157Knowledge Work Data](https://developers.openai.com/codex/use-cases/learn-a-new-concept)