SpyBara
Go Premium Account
2026
25 Apr 2026, 06:37
14 May 2026, 21:00 14 May 2026, 07:00 13 May 2026, 00:57 12 May 2026, 01:59 11 May 2026, 18:00 7 May 2026, 20:02 7 May 2026, 17:08 5 May 2026, 23:00 2 May 2026, 06:45 2 May 2026, 00:48 1 May 2026, 18:29 30 Apr 2026, 18:36 29 Apr 2026, 12:40 29 Apr 2026, 00:50 25 Apr 2026, 06:37 25 Apr 2026, 00:42 24 Apr 2026, 18:20 24 Apr 2026, 12:28 23 Apr 2026, 18:31 23 Apr 2026, 12:28 23 Apr 2026, 00:46 22 Apr 2026, 18:29 22 Apr 2026, 00:42 21 Apr 2026, 18:29 21 Apr 2026, 12:30 21 Apr 2026, 06:45 20 Apr 2026, 18:26 20 Apr 2026, 06:53 18 Apr 2026, 18:18 17 Apr 2026, 00:44 16 Apr 2026, 18:31 16 Apr 2026, 00:46 15 Apr 2026, 18:31 15 Apr 2026, 06:44 14 Apr 2026, 18:31 14 Apr 2026, 12:29 13 Apr 2026, 18:37 13 Apr 2026, 00:44 12 Apr 2026, 06:38 10 Apr 2026, 18:23 9 Apr 2026, 00:33 8 Apr 2026, 18:32 8 Apr 2026, 00:40 7 Apr 2026, 00:40 2 Apr 2026, 18:23 31 Mar 2026, 06:35 31 Mar 2026, 00:39 28 Mar 2026, 06:26 28 Mar 2026, 00:36 27 Mar 2026, 18:23 27 Mar 2026, 00:39 26 Mar 2026, 18:27 25 Mar 2026, 18:24 23 Mar 2026, 18:22 20 Mar 2026, 00:35 18 Mar 2026, 12:23 18 Mar 2026, 00:36 17 Mar 2026, 18:24 17 Mar 2026, 00:33 16 Mar 2026, 18:25 16 Mar 2026, 12:23 14 Mar 2026, 00:32 13 Mar 2026, 18:15 13 Mar 2026, 00:34 11 Mar 2026, 00:31 9 Mar 2026, 00:34 8 Mar 2026, 18:10 8 Mar 2026, 00:35 7 Mar 2026, 18:10 7 Mar 2026, 06:14 7 Mar 2026, 00:33 6 Mar 2026, 00:38 5 Mar 2026, 18:41 5 Mar 2026, 06:22 5 Mar 2026, 00:34 4 Mar 2026, 18:18 4 Mar 2026, 06:20 3 Mar 2026, 18:20 3 Mar 2026, 00:35 27 Feb 2026, 18:15 24 Feb 2026, 06:27 24 Feb 2026, 00:33 23 Feb 2026, 18:27 21 Feb 2026, 00:33 20 Feb 2026, 12:16 19 Feb 2026, 20:53 19 Feb 2026, 20:37
14 May 2026, 07:00
14 May 2026, 21:00 14 May 2026, 07:00 13 May 2026, 00:57 12 May 2026, 01:59 11 May 2026, 18:00 7 May 2026, 20:02 7 May 2026, 17:08 5 May 2026, 23:00 2 May 2026, 06:45 2 May 2026, 00:48 1 May 2026, 18:29 30 Apr 2026, 18:36 29 Apr 2026, 12:40 29 Apr 2026, 00:50 25 Apr 2026, 06:37 25 Apr 2026, 00:42 24 Apr 2026, 18:20 24 Apr 2026, 12:28 23 Apr 2026, 18:31 23 Apr 2026, 12:28 23 Apr 2026, 00:46 22 Apr 2026, 18:29 22 Apr 2026, 00:42 21 Apr 2026, 18:29 21 Apr 2026, 12:30 21 Apr 2026, 06:45 20 Apr 2026, 18:26 20 Apr 2026, 06:53 18 Apr 2026, 18:18 17 Apr 2026, 00:44 16 Apr 2026, 18:31 16 Apr 2026, 00:46 15 Apr 2026, 18:31 15 Apr 2026, 06:44 14 Apr 2026, 18:31 14 Apr 2026, 12:29 13 Apr 2026, 18:37 13 Apr 2026, 00:44 12 Apr 2026, 06:38 10 Apr 2026, 18:23 9 Apr 2026, 00:33 8 Apr 2026, 18:32 8 Apr 2026, 00:40 7 Apr 2026, 00:40 2 Apr 2026, 18:23 31 Mar 2026, 06:35 31 Mar 2026, 00:39 28 Mar 2026, 06:26 28 Mar 2026, 00:36 27 Mar 2026, 18:23 27 Mar 2026, 00:39 26 Mar 2026, 18:27 25 Mar 2026, 18:24 23 Mar 2026, 18:22 20 Mar 2026, 00:35 18 Mar 2026, 12:23 18 Mar 2026, 00:36 17 Mar 2026, 18:24 17 Mar 2026, 00:33 16 Mar 2026, 18:25 16 Mar 2026, 12:23 14 Mar 2026, 00:32 13 Mar 2026, 18:15 13 Mar 2026, 00:34 11 Mar 2026, 00:31 9 Mar 2026, 00:34 8 Mar 2026, 18:10 8 Mar 2026, 00:35 7 Mar 2026, 18:10 7 Mar 2026, 06:14 7 Mar 2026, 00:33 6 Mar 2026, 00:38 5 Mar 2026, 18:41 5 Mar 2026, 06:22 5 Mar 2026, 00:34 4 Mar 2026, 18:18 4 Mar 2026, 06:20 3 Mar 2026, 18:20 3 Mar 2026, 00:35 27 Feb 2026, 18:15 24 Feb 2026, 06:27 24 Feb 2026, 00:33 23 Feb 2026, 18:27 21 Feb 2026, 00:33 20 Feb 2026, 12:16 19 Feb 2026, 20:53 19 Feb 2026, 20:37
Fri 1 18:29 Sat 2 00:48 Sat 2 06:45 Tue 5 23:00 Thu 7 17:08 Thu 7 20:02 Mon 11 18:00 Tue 12 01:59 Wed 13 00:57 Thu 14 07:00 Thu 14 21:00

After 2026-05-02 06:45 UTC, this monitor no longer uses markdownified HTML/MDX. Comparisons across that boundary can therefore show more extensive diffs.

Details

1# Iterate on difficult problems | Codex use cases1---

2name: Iterate on difficult problems

3tagline: Use Codex as a scored improvement loop to solve hard tasks.

4summary: Give Codex an evaluation system, such as scripts and reviewable

5 artifacts, so it can keep improving a hard task until the scores are good

6 enough.

7bestFor:

8 - Problems where each iteration can be scored, but the best result usually

9 takes many passes

10 - Tasks with visual or subjective outputs that need both deterministic checks

11 and an LLM-as-a-judge score

12 - Long-running Codex sessions where you want progress tracked clearly instead

13 of relying on context

14starterPrompt:

15 title: Keep Iterating Until the Eval Passes

16 body: >-

17 I have a difficult task in this workspace and I want you to run it as an

18 eval-driven improvement loop.

2 19 

3Codex use cases

4 20 

5![](/assets/OpenAI-black-wordmark.svg)21 Before changing anything:

6 

7![Codex](/assets/OAI_Codex-Lockup_Fallback_Black.svg)

8 

9Codex use case

10 

11# Iterate on difficult problems

12 

13Use Codex as a scored improvement loop to solve hard tasks.

14 

15Difficulty **Advanced**

16 22 

17Time horizon **Long-running**23 - Read `AGENTS.md`.

18 24 

19Give Codex an evaluation system, such as scripts and reviewable artifacts, so it can keep improving a hard task until the scores are good enough.25 - Find the script or command that scores the current output.

20 26 

21## Best for

22 27 

23- Problems where each iteration can be scored, but the best result usually takes many passes28 Iteration loop:

24- Tasks with visual or subjective outputs that need both deterministic checks and an LLM-as-a-judge score

25- Long-running Codex sessions where you want progress tracked clearly instead of relying on context

26 29 

27# Contents30 - Make one focused improvement at a time.

28 31 

29[← All use cases](https://developers.openai.com/codex/use-cases)32 - Re-run the eval command after each meaningful change.

30 33 

31Copy page [Export as PDF](https://developers.openai.com/codex/use-cases/iterate-on-difficult-problems/?export=pdf)34 - Log the scores and what changed.

32 35 

33Give Codex an evaluation system, such as scripts and reviewable artifacts, so it can keep improving a hard task until the scores are good enough.36 - Inspect generated artifacts directly. If the output is visual, use

37 `view_image`.

34 38 

35Advanced39 - Keep going until both the overall score and the LLM average are above 90%.

36 40 

37Long-running

38 41 

39Related links42 Constraints:

40 43 

41[Custom instructions with AGENTS.md](https://developers.openai.com/codex/guides/agents-md) [Codex workflows](https://developers.openai.com/codex/workflows)44 - Do not stop at the first acceptable result.

42 45 

43## Best for46 - Do not revert to an earlier version unless the new result is clearly worse

47 in scores or artifacts.

44 48 

45- Problems where each iteration can be scored, but the best result usually takes many passes49 - If the eval improves but is still below target, explain the bottleneck and

46- Tasks with visual or subjective outputs that need both deterministic checks and an LLM-as-a-judge score50 continue.

47- Long-running Codex sessions where you want progress tracked clearly instead of relying on context

48 51 

49## Starter prompt

50 52 

51I have a difficult task in this workspace and I want you to run it as an eval-driven improvement loop.

52 Before changing anything:

53 - Read `AGENTS.md`.

54 - Find the script or command that scores the current output.

55 Iteration loop:

56 - Make one focused improvement at a time.

57 - Re-run the eval command after each meaningful change.

58 - Log the scores and what changed.

59- Inspect generated artifacts directly. If the output is visual, use `view\_image`.

60 - Keep going until both the overall score and the LLM average are above 90%.

61 Constraints:

62 - Do not stop at the first acceptable result.

63- Do not revert to an earlier version unless the new result is clearly worse in scores or artifacts.

64- If the eval improves but is still below target, explain the bottleneck and continue.

65 Output:53 Output:

66 - current best scores

67 - log of major iterations

68 - remaining risks or weak spots

69 

70[Open in the Codex app](codex://new?prompt=I+have+a+difficult+task+in+this+workspace+and+I+want+you+to+run+it+as+an+eval-driven+improvement+loop.%0A%0ABefore+changing+anything%3A%0A-+Read+%60AGENTS.md%60.%0A-+Find+the+script+or+command+that+scores+the+current+output.%0A%0AIteration+loop%3A%0A-+Make+one+focused+improvement+at+a+time.%0A-+Re-run+the+eval+command+after+each+meaningful+change.%0A-+Log+the+scores+and+what+changed.%0A-+Inspect+generated+artifacts+directly.+If+the+output+is+visual%2C+use+%60view_image%60.%0A-+Keep+going+until+both+the+overall+score+and+the+LLM+average+are+above+90%25.%0A%0AConstraints%3A%0A-+Do+not+stop+at+the+first+acceptable+result.%0A-+Do+not+revert+to+an+earlier+version+unless+the+new+result+is+clearly+worse+in+scores+or+artifacts.%0A-+If+the+eval+improves+but+is+still+below+target%2C+explain+the+bottleneck+and+continue.%0A%0AOutput%3A%0A-+current+best+scores%0A-+log+of+major+iterations%0A-+remaining+risks+or+weak+spots "Open in the Codex app")

71 54 

72I have a difficult task in this workspace and I want you to run it as an eval-driven improvement loop.

73 Before changing anything:

74 - Read `AGENTS.md`.

75 - Find the script or command that scores the current output.

76 Iteration loop:

77 - Make one focused improvement at a time.

78 - Re-run the eval command after each meaningful change.

79 - Log the scores and what changed.

80- Inspect generated artifacts directly. If the output is visual, use `view\_image`.

81 - Keep going until both the overall score and the LLM average are above 90%.

82 Constraints:

83 - Do not stop at the first acceptable result.

84- Do not revert to an earlier version unless the new result is clearly worse in scores or artifacts.

85- If the eval improves but is still below target, explain the bottleneck and continue.

86 Output:

87 - current best scores55 - current best scores

56 

88 - log of major iterations57 - log of major iterations

58 

89 - remaining risks or weak spots59 - remaining risks or weak spots

60relatedLinks:

61 - label: Custom instructions with AGENTS.md

62 url: /codex/guides/agents-md

63 - label: Codex workflows

64 url: /codex/workflows

65---

90 66 

91## Introduction67## Introduction

92 68 


1616. Continue until the thresholds are met.1376. Continue until the thresholds are met.

162 138 

163This discipline matters. If each iteration changes too many things at once, Codex cannot tell which idea improved the score. If it skips logging, the session becomes hard to trust and hard to resume.139This discipline matters. If each iteration changes too many things at once, Codex cannot tell which idea improved the score. If it skips logging, the session becomes hard to trust and hard to resume.

164 

165## Related use cases

166 

167[![](/images/codex/codex-wallpaper-1.webp)

168 

169### Understand large codebases

170 

171Use Codex to map unfamiliar codebases, explain different modules and data flow, and point...

172 

173Engineering Analysis](https://developers.openai.com/codex/use-cases/codebase-onboarding)[![](/images/codex/codex-wallpaper-1.webp)

174 

175### Create browser-based games

176 

177Use Codex to turn a game brief into first a well-defined plan, and then a real browser-based...

178 

179Engineering Code](https://developers.openai.com/codex/use-cases/browser-games)[![](/images/codex/codex-wallpaper-1.webp)

180 

181### Learn a new concept

182 

183Use Codex to study material such as research papers or courses, split the reading across...

184 

185Knowledge Work Data](https://developers.openai.com/codex/use-cases/learn-a-new-concept)