11# Iterate on difficult problems | Codex use cases---
2name: Iterate on difficult problems
3tagline: Use Codex as a scored improvement loop to solve hard tasks.
4summary: Give Codex an evaluation system, such as scripts and reviewable
5 artifacts, so it can keep improving a hard task until the scores are good
6 enough.
7bestFor:
8 - Problems where each iteration can be scored, but the best result usually
9 takes many passes
10 - Tasks with visual or subjective outputs that need both deterministic checks
11 and an LLM-as-a-judge score
12 - Long-running Codex sessions where you want progress tracked clearly instead
13 of relying on context
14starterPrompt:
15 title: Keep Iterating Until the Eval Passes
16 body: >-
17 I have a difficult task in this workspace and I want you to run it as an
18 eval-driven improvement loop.
2 19
3Codex use cases
4 20
521 Before changing anything:
6
7
8
9Codex use case
10
11# Iterate on difficult problems
12
13Use Codex as a scored improvement loop to solve hard tasks.
14
15Difficulty **Advanced**
16 22
1723Time horizon **Long-running** - Read `AGENTS.md`.
18 24
1925Give Codex an evaluation system, such as scripts and reviewable artifacts, so it can keep improving a hard task until the scores are good enough. - Find the script or command that scores the current output.
20 26
21## Best for
22 27
2328- Problems where each iteration can be scored, but the best result usually takes many passes Iteration loop:
24- Tasks with visual or subjective outputs that need both deterministic checks and an LLM-as-a-judge score
25- Long-running Codex sessions where you want progress tracked clearly instead of relying on context
26 29
2730# Contents - Make one focused improvement at a time.
28 31
2932[← All use cases](https://developers.openai.com/codex/use-cases) - Re-run the eval command after each meaningful change.
30 33
3134Copy page [Export as PDF](https://developers.openai.com/codex/use-cases/iterate-on-difficult-problems/?export=pdf) - Log the scores and what changed.
32 35
3336Give Codex an evaluation system, such as scripts and reviewable artifacts, so it can keep improving a hard task until the scores are good enough. - Inspect generated artifacts directly. If the output is visual, use
37 `view_image`.
34 38
3539Advanced - Keep going until both the overall score and the LLM average are above 90%.
36 40
37Long-running
38 41
3942Related links Constraints:
40 43
4144[Custom instructions with AGENTS.md](https://developers.openai.com/codex/guides/agents-md) [Codex workflows](https://developers.openai.com/codex/workflows) - Do not stop at the first acceptable result.
42 45
4346## Best for - Do not revert to an earlier version unless the new result is clearly worse
47 in scores or artifacts.
44 48
4549- Problems where each iteration can be scored, but the best result usually takes many passes - If the eval improves but is still below target, explain the bottleneck and
4650- Tasks with visual or subjective outputs that need both deterministic checks and an LLM-as-a-judge score continue.
47- Long-running Codex sessions where you want progress tracked clearly instead of relying on context
48 51
49## Starter prompt
50 52
51I have a difficult task in this workspace and I want you to run it as an eval-driven improvement loop.
52 Before changing anything:
53 - Read `AGENTS.md`.
54 - Find the script or command that scores the current output.
55 Iteration loop:
56 - Make one focused improvement at a time.
57 - Re-run the eval command after each meaningful change.
58 - Log the scores and what changed.
59- Inspect generated artifacts directly. If the output is visual, use `view\_image`.
60 - Keep going until both the overall score and the LLM average are above 90%.
61 Constraints:
62 - Do not stop at the first acceptable result.
63- Do not revert to an earlier version unless the new result is clearly worse in scores or artifacts.
64- If the eval improves but is still below target, explain the bottleneck and continue.
65 Output:53 Output:
66 - current best scores
67 - log of major iterations
68 - remaining risks or weak spots
69
70[Open in the Codex app](codex://new?prompt=I+have+a+difficult+task+in+this+workspace+and+I+want+you+to+run+it+as+an+eval-driven+improvement+loop.%0A%0ABefore+changing+anything%3A%0A-+Read+%60AGENTS.md%60.%0A-+Find+the+script+or+command+that+scores+the+current+output.%0A%0AIteration+loop%3A%0A-+Make+one+focused+improvement+at+a+time.%0A-+Re-run+the+eval+command+after+each+meaningful+change.%0A-+Log+the+scores+and+what+changed.%0A-+Inspect+generated+artifacts+directly.+If+the+output+is+visual%2C+use+%60view_image%60.%0A-+Keep+going+until+both+the+overall+score+and+the+LLM+average+are+above+90%25.%0A%0AConstraints%3A%0A-+Do+not+stop+at+the+first+acceptable+result.%0A-+Do+not+revert+to+an+earlier+version+unless+the+new+result+is+clearly+worse+in+scores+or+artifacts.%0A-+If+the+eval+improves+but+is+still+below+target%2C+explain+the+bottleneck+and+continue.%0A%0AOutput%3A%0A-+current+best+scores%0A-+log+of+major+iterations%0A-+remaining+risks+or+weak+spots "Open in the Codex app")
71 54
72I have a difficult task in this workspace and I want you to run it as an eval-driven improvement loop.
73 Before changing anything:
74 - Read `AGENTS.md`.
75 - Find the script or command that scores the current output.
76 Iteration loop:
77 - Make one focused improvement at a time.
78 - Re-run the eval command after each meaningful change.
79 - Log the scores and what changed.
80- Inspect generated artifacts directly. If the output is visual, use `view\_image`.
81 - Keep going until both the overall score and the LLM average are above 90%.
82 Constraints:
83 - Do not stop at the first acceptable result.
84- Do not revert to an earlier version unless the new result is clearly worse in scores or artifacts.
85- If the eval improves but is still below target, explain the bottleneck and continue.
86 Output:
87 - current best scores55 - current best scores
56
88 - log of major iterations57 - log of major iterations
58
89 - remaining risks or weak spots59 - remaining risks or weak spots
60relatedLinks:
61 - label: Custom instructions with AGENTS.md
62 url: /codex/guides/agents-md
63 - label: Codex workflows
64 url: /codex/workflows
65---
90 66
91## Introduction67## Introduction
92 68
1616. Continue until the thresholds are met.1376. Continue until the thresholds are met.
162 138
163This discipline matters. If each iteration changes too many things at once, Codex cannot tell which idea improved the score. If it skips logging, the session becomes hard to trust and hard to resume.139This discipline matters. If each iteration changes too many things at once, Codex cannot tell which idea improved the score. If it skips logging, the session becomes hard to trust and hard to resume.
164
165## Related use cases
166
167[
168
169### Understand large codebases
170
171Use Codex to map unfamiliar codebases, explain different modules and data flow, and point...
172
173Engineering Analysis](https://developers.openai.com/codex/use-cases/codebase-onboarding)[
174
175### Create browser-based games
176
177Use Codex to turn a game brief into first a well-defined plan, and then a real browser-based...
178
179Engineering Code](https://developers.openai.com/codex/use-cases/browser-games)[
180
181### Learn a new concept
182
183Use Codex to study material such as research papers or courses, split the reading across...
184
185Knowledge Work Data](https://developers.openai.com/codex/use-cases/learn-a-new-concept)