SpyBara
Go Premium Account
2026
24 Feb 2026, 00:33
14 May 2026, 21:00 14 May 2026, 07:00 13 May 2026, 00:57 12 May 2026, 01:59 11 May 2026, 18:00 7 May 2026, 20:02 7 May 2026, 17:08 5 May 2026, 23:00 2 May 2026, 06:45 2 May 2026, 00:48 1 May 2026, 18:29 30 Apr 2026, 18:36 29 Apr 2026, 12:40 29 Apr 2026, 00:50 25 Apr 2026, 06:37 25 Apr 2026, 00:42 24 Apr 2026, 18:20 24 Apr 2026, 12:28 23 Apr 2026, 18:31 23 Apr 2026, 12:28 23 Apr 2026, 00:46 22 Apr 2026, 18:29 22 Apr 2026, 00:42 21 Apr 2026, 18:29 21 Apr 2026, 12:30 21 Apr 2026, 06:45 20 Apr 2026, 18:26 20 Apr 2026, 06:53 18 Apr 2026, 18:18 17 Apr 2026, 00:44 16 Apr 2026, 18:31 16 Apr 2026, 00:46 15 Apr 2026, 18:31 15 Apr 2026, 06:44 14 Apr 2026, 18:31 14 Apr 2026, 12:29 13 Apr 2026, 18:37 13 Apr 2026, 00:44 12 Apr 2026, 06:38 10 Apr 2026, 18:23 9 Apr 2026, 00:33 8 Apr 2026, 18:32 8 Apr 2026, 00:40 7 Apr 2026, 00:40 2 Apr 2026, 18:23 31 Mar 2026, 06:35 31 Mar 2026, 00:39 28 Mar 2026, 06:26 28 Mar 2026, 00:36 27 Mar 2026, 18:23 27 Mar 2026, 00:39 26 Mar 2026, 18:27 25 Mar 2026, 18:24 23 Mar 2026, 18:22 20 Mar 2026, 00:35 18 Mar 2026, 12:23 18 Mar 2026, 00:36 17 Mar 2026, 18:24 17 Mar 2026, 00:33 16 Mar 2026, 18:25 16 Mar 2026, 12:23 14 Mar 2026, 00:32 13 Mar 2026, 18:15 13 Mar 2026, 00:34 11 Mar 2026, 00:31 9 Mar 2026, 00:34 8 Mar 2026, 18:10 8 Mar 2026, 00:35 7 Mar 2026, 18:10 7 Mar 2026, 06:14 7 Mar 2026, 00:33 6 Mar 2026, 00:38 5 Mar 2026, 18:41 5 Mar 2026, 06:22 5 Mar 2026, 00:34 4 Mar 2026, 18:18 4 Mar 2026, 06:20 3 Mar 2026, 18:20 3 Mar 2026, 00:35 27 Feb 2026, 18:15 24 Feb 2026, 06:27 24 Feb 2026, 00:33 23 Feb 2026, 18:27 21 Feb 2026, 00:33 20 Feb 2026, 12:16 19 Feb 2026, 20:53 19 Feb 2026, 20:37
21 Apr 2026, 18:29
14 May 2026, 21:00 14 May 2026, 07:00 13 May 2026, 00:57 12 May 2026, 01:59 11 May 2026, 18:00 7 May 2026, 20:02 7 May 2026, 17:08 5 May 2026, 23:00 2 May 2026, 06:45 2 May 2026, 00:48 1 May 2026, 18:29 30 Apr 2026, 18:36 29 Apr 2026, 12:40 29 Apr 2026, 00:50 25 Apr 2026, 06:37 25 Apr 2026, 00:42 24 Apr 2026, 18:20 24 Apr 2026, 12:28 23 Apr 2026, 18:31 23 Apr 2026, 12:28 23 Apr 2026, 00:46 22 Apr 2026, 18:29 22 Apr 2026, 00:42 21 Apr 2026, 18:29 21 Apr 2026, 12:30 21 Apr 2026, 06:45 20 Apr 2026, 18:26 20 Apr 2026, 06:53 18 Apr 2026, 18:18 17 Apr 2026, 00:44 16 Apr 2026, 18:31 16 Apr 2026, 00:46 15 Apr 2026, 18:31 15 Apr 2026, 06:44 14 Apr 2026, 18:31 14 Apr 2026, 12:29 13 Apr 2026, 18:37 13 Apr 2026, 00:44 12 Apr 2026, 06:38 10 Apr 2026, 18:23 9 Apr 2026, 00:33 8 Apr 2026, 18:32 8 Apr 2026, 00:40 7 Apr 2026, 00:40 2 Apr 2026, 18:23 31 Mar 2026, 06:35 31 Mar 2026, 00:39 28 Mar 2026, 06:26 28 Mar 2026, 00:36 27 Mar 2026, 18:23 27 Mar 2026, 00:39 26 Mar 2026, 18:27 25 Mar 2026, 18:24 23 Mar 2026, 18:22 20 Mar 2026, 00:35 18 Mar 2026, 12:23 18 Mar 2026, 00:36 17 Mar 2026, 18:24 17 Mar 2026, 00:33 16 Mar 2026, 18:25 16 Mar 2026, 12:23 14 Mar 2026, 00:32 13 Mar 2026, 18:15 13 Mar 2026, 00:34 11 Mar 2026, 00:31 9 Mar 2026, 00:34 8 Mar 2026, 18:10 8 Mar 2026, 00:35 7 Mar 2026, 18:10 7 Mar 2026, 06:14 7 Mar 2026, 00:33 6 Mar 2026, 00:38 5 Mar 2026, 18:41 5 Mar 2026, 06:22 5 Mar 2026, 00:34 4 Mar 2026, 18:18 4 Mar 2026, 06:20 3 Mar 2026, 18:20 3 Mar 2026, 00:35 27 Feb 2026, 18:15 24 Feb 2026, 06:27 24 Feb 2026, 00:33 23 Feb 2026, 18:27 21 Feb 2026, 00:33 20 Feb 2026, 12:16 19 Feb 2026, 20:53 19 Feb 2026, 20:37
Thu 2 18:23 Tue 7 00:40 Wed 8 00:40 Wed 8 18:32 Thu 9 00:33 Fri 10 18:23 Sun 12 06:38 Mon 13 00:44 Mon 13 18:37 Tue 14 12:29 Tue 14 18:31 Wed 15 06:44 Wed 15 18:31 Thu 16 00:46 Thu 16 18:31 Fri 17 00:44 Sat 18 18:18 Mon 20 06:53 Mon 20 18:26 Tue 21 06:45 Tue 21 12:30 Tue 21 18:29 Wed 22 00:42 Wed 22 18:29 Thu 23 00:46 Thu 23 12:28 Thu 23 18:31 Fri 24 12:28 Fri 24 18:20 Sat 25 00:42 Sat 25 06:37 Wed 29 00:50 Wed 29 12:40 Thu 30 18:36
Details

1# Iterate on difficult problems | Codex use cases

2 

3Codex use cases

4 

5![](/assets/OpenAI-black-wordmark.svg)

6 

7![Codex](/assets/OAI_Codex-Lockup_Fallback_Black.svg)

8 

9Codex use case

10 

11# Iterate on difficult problems

12 

13Use Codex as a scored improvement loop to solve hard tasks.

14 

15Difficulty **Advanced**

16 

17Time horizon **Long-running**

18 

19Give Codex an evaluation system, such as scripts and reviewable artifacts, so it can keep improving a hard task until the scores are good enough.

20 

21## Best for

22 

23- Problems where each iteration can be scored, but the best result usually takes many passes

24- Tasks with visual or subjective outputs that need both deterministic checks and an LLM-as-a-judge score

25- Long-running Codex sessions where you want progress tracked clearly instead of relying on context

26 

27# Contents

28 

29[← All use cases](https://developers.openai.com/codex/use-cases)

30 

31Copy page [Export as PDF](https://developers.openai.com/codex/use-cases/iterate-on-difficult-problems/?export=pdf)

32 

33Give Codex an evaluation system, such as scripts and reviewable artifacts, so it can keep improving a hard task until the scores are good enough.

34 

35Advanced

36 

37Long-running

38 

39Related links

40 

41[Custom instructions with AGENTS.md](https://developers.openai.com/codex/guides/agents-md) [Codex workflows](https://developers.openai.com/codex/workflows)

42 

43## Best for

44 

45- Problems where each iteration can be scored, but the best result usually takes many passes

46- Tasks with visual or subjective outputs that need both deterministic checks and an LLM-as-a-judge score

47- Long-running Codex sessions where you want progress tracked clearly instead of relying on context

48 

49## Starter prompt

50 

51I have a difficult task in this workspace and I want you to run it as an eval-driven improvement loop.

52 Before changing anything:

53 - Read `AGENTS.md`.

54 - Find the script or command that scores the current output.

55 Iteration loop:

56 - Make one focused improvement at a time.

57 - Re-run the eval command after each meaningful change.

58 - Log the scores and what changed.

59- Inspect generated artifacts directly. If the output is visual, use `view\_image`.

60 - Keep going until both the overall score and the LLM average are above 90%.

61 Constraints:

62 - Do not stop at the first acceptable result.

63- Do not revert to an earlier version unless the new result is clearly worse in scores or artifacts.

64- If the eval improves but is still below target, explain the bottleneck and continue.

65 Output:

66 - current best scores

67 - log of major iterations

68 - remaining risks or weak spots

69 

70[Open in the Codex app](codex://new?prompt=I+have+a+difficult+task+in+this+workspace+and+I+want+you+to+run+it+as+an+eval-driven+improvement+loop.%0A%0ABefore+changing+anything%3A%0A-+Read+%60AGENTS.md%60.%0A-+Find+the+script+or+command+that+scores+the+current+output.%0A%0AIteration+loop%3A%0A-+Make+one+focused+improvement+at+a+time.%0A-+Re-run+the+eval+command+after+each+meaningful+change.%0A-+Log+the+scores+and+what+changed.%0A-+Inspect+generated+artifacts+directly.+If+the+output+is+visual%2C+use+%60view_image%60.%0A-+Keep+going+until+both+the+overall+score+and+the+LLM+average+are+above+90%25.%0A%0AConstraints%3A%0A-+Do+not+stop+at+the+first+acceptable+result.%0A-+Do+not+revert+to+an+earlier+version+unless+the+new+result+is+clearly+worse+in+scores+or+artifacts.%0A-+If+the+eval+improves+but+is+still+below+target%2C+explain+the+bottleneck+and+continue.%0A%0AOutput%3A%0A-+current+best+scores%0A-+log+of+major+iterations%0A-+remaining+risks+or+weak+spots "Open in the Codex app")

71 

72I have a difficult task in this workspace and I want you to run it as an eval-driven improvement loop.

73 Before changing anything:

74 - Read `AGENTS.md`.

75 - Find the script or command that scores the current output.

76 Iteration loop:

77 - Make one focused improvement at a time.

78 - Re-run the eval command after each meaningful change.

79 - Log the scores and what changed.

80- Inspect generated artifacts directly. If the output is visual, use `view\_image`.

81 - Keep going until both the overall score and the LLM average are above 90%.

82 Constraints:

83 - Do not stop at the first acceptable result.

84- Do not revert to an earlier version unless the new result is clearly worse in scores or artifacts.

85- If the eval improves but is still below target, explain the bottleneck and continue.

86 Output:

87 - current best scores

88 - log of major iterations

89 - remaining risks or weak spots

90 

91## Introduction

92 

93Some tasks are easy to verify in one shot: the build passes, the tests go green, and you are done. But there are some optimization problems that are difficult to solve, and need many iterations with a tight evaluation loop. To know which direction to go in, Codex needs to inspect the current output, score it, decide the next change, and repeat until the result is actually good.

94 

95This type of use case pairs well with a custom UI that lets you inspect progress visually, by having Codex log the outputs and generated artifacts for each iteration.

96You can watch Codex continue working in the app while the target artifact, model output, or generated asset keeps improving.

97The key is to give Codex the necessary scripts to generate the evaluation metrics and the artifacts to inspect.

98 

99## Start with evals

100 

101Before the task begins, define how success will be measured. The best setup usually combines:

102 

103- **Deterministic checks:** things the scripts can score directly, such as constraint violations or deterministic metrics computed with code

104- **LLM-as-a-judge checks:** rubric-based scores for qualities that are harder to encode exactly, such as resemblance, readability, usefulness, or overall quality - this can rely on text or image outputs

105 

106If the subjective part matters, give Codex a script that can call a model for example using the [Responses API](https://developers.openai.com/api/reference/resources/responses/methods/create) and return structured scores. The point is not to replace deterministic checks, it's to supplement them with a consistent judge for the part humans would otherwise assess by eye.

107 

108The loop works best when the eval output is machine-readable, saved after every run, and easy to compare over time.

109 

110**Tip**: Ask Codex to generate the evaluation script for you, describing the

111 checks you want to run.

112 

113## Give Codex a stopping rule

114 

115Hard tasks often drift because the prompt says “keep improving” without saying when to stop. Make the stopping rule explicit.

116 

117A practical pattern is:

118 

1191. Set a target for the overall score.

1202. Set a separate target for the LLM-judge average.

1213. Tell Codex to continue until both are above the threshold, not just one.

122 

123For example, if the goal is a high-quality artifact, ask Codex to keep going until both the overall score and the LLM average are above 90%. That makes the task legible: Codex can tell whether it is still below target, where the gap is, and whether the latest change helped.

124 

125## Keep a running log of the loop

126 

127Long-running work is much more reliable when Codex keeps notes about the loop instead of trying to remember everything from the thread.

128 

129That running log should record:

130 

131- the current best scores

132- what changed on the last iteration

133- what the eval said got better or worse

134- what Codex plans to try next

135 

136This is especially important when the task runs for a long time. The log becomes the handoff point for the next session and the self-evaluation record for the current one.

137 

138## Inspect the artifact, not just the logs

139 

140For some difficult tasks, the code diff and metric output are not enough. Codex should look at the artifact it produced.

141 

142If the output is visual, such as a generated image, layout, or rendered state, let Codex inspect that artifact directly, for example when the output lives on disk as an image and compare the current result to the prior best result or to the intended rubric.

143 

144This makes the loop stronger:

145 

146- the eval script reports the score

147- the artifact shows what the score missed

148- the next change is grounded in both

149 

150That combination is much more effective than changing code blindly between runs.

151 

152## Make every iteration explicit

153 

154Ask Codex to follow the same loop every time:

155 

1561. Run the evals on the current baseline.

1572. Identify the biggest failure mode from the scores and artifacts.

1583. Make one focused change that addresses that bottleneck.

1594. Re-run the evals.

1605. Log the new scores and whether the change helped.

1616. Continue until the thresholds are met.

162 

163This discipline matters. If each iteration changes too many things at once, Codex cannot tell which idea improved the score. If it skips logging, the session becomes hard to trust and hard to resume.

164 

165## Related use cases

166 

167[![](/images/codex/codex-wallpaper-1.webp)

168 

169### Understand large codebases

170 

171Use Codex to map unfamiliar codebases, explain different modules and data flow, and point...

172 

173Engineering Analysis](https://developers.openai.com/codex/use-cases/codebase-onboarding)[![](/images/codex/codex-wallpaper-1.webp)

174 

175### Create browser-based games

176 

177Use Codex to turn a game brief into first a well-defined plan, and then a real browser-based...

178 

179Engineering Code](https://developers.openai.com/codex/use-cases/browser-games)[![](/images/codex/codex-wallpaper-1.webp)

180 

181### Learn a new concept

182 

183Use Codex to study material such as research papers or courses, split the reading across...

184 

185Knowledge Work Data](https://developers.openai.com/codex/use-cases/learn-a-new-concept)