1# Computer use1# Computer use
2 2
3import {
4 batchedComputerTurn,
5 captureScreenshotDocker,
6 captureScreenshotPlaywright,
7 codeExecutionHarnessExample,
8 computerLoop,
9 dockerfile,
10 handleActionsDocker,
11 handleActionsPlaywright,
12 handleActionsWithModifiersDocker,
13 handleActionsWithModifiersPlaywright,
14 legacyPreviewRequest,
15 firstComputerTurn,
16 modifierBatchedComputerTurn,
17 normalizeKeysDocker,
18 normalizeKeysPlaywright,
19 sendComputerRequest,
20 sendComputerScreenshot,
21 setupDocker,
22 setupPlaywright,
23} from "./cua-examples.js";
24
25
26
27
28
29Computer use lets a model operate software through the user interface. It can inspect screenshots, return interface actions for your code to execute, or work through a custom harness that mixes visual and programmatic interaction with the UI.3Computer use lets a model operate software through the user interface. It can inspect screenshots, return interface actions for your code to execute, or work through a custom harness that mixes visual and programmatic interaction with the UI.
30 4
31`gpt-5.4` includes new training for this kind of work, and future models will build on the same pattern. The model is designed to operate flexibly across a range of harness shapes, including the built-in Responses API `computer` tool, custom tools layered on top of existing automation harnesses, and code-execution environments that expose browser or desktop controls.5`gpt-5.4` includes new training for this kind of work, and future models will build on the same pattern. The model is designed to operate flexibly across a range of harness shapes, including the built-in Responses API `computer` tool, custom tools layered on top of existing automation harnesses, and code-execution environments that expose browser or desktop controls.
55 29
56Then launch a browser instance:30Then launch a browser instance:
57 31
32Start a browser instance
33
34```javascript
35import { chromium } from "playwright";
36
37const browser = await chromium.launch({
38 headless: false,
39 chromiumSandbox: true,
40 env: {},
41 args: ["--disable-extensions", "--disable-file-system"],
42});
43const page = await browser.newPage({
44 viewport: { width: 1280, height: 720 },
45});
46```
47
48```python
49from playwright.sync_api import sync_playwright
50
51
52with sync_playwright() as p:
53 browser = p.chromium.launch(
54 headless=False,
55 chromium_sandbox=True,
56 env={},
57 args=["--disable-extensions", "--disable-file-system"],
58 )
59 page = browser.new_page(viewport={"width": 1280, "height": 720})
60```
61
62
58Set up a local virtual machine63Set up a local virtual machine
59 64
60If you need a fuller desktop environment, run the model against a local VM or container and translate actions into OS-level input events.65If you need a fuller desktop environment, run the model against a local VM or container and translate actions into OS-level input events.
63 68
64The following Dockerfile starts an Ubuntu desktop with Xvfb, `x11vnc`, and Firefox:69The following Dockerfile starts an Ubuntu desktop with Xvfb, `x11vnc`, and Firefox:
65 70
71Dockerfile
72
73```dockerfile
74FROM ubuntu:22.04
75ENV DEBIAN_FRONTEND=noninteractive
76
77RUN apt-get update && apt-get install -y \
78 xfce4 \
79 xfce4-goodies \
80 x11vnc \
81 xvfb \
82 xdotool \
83 imagemagick \
84 x11-apps \
85 sudo \
86 software-properties-common \
87 firefox-esr \
88 && apt-get remove -y light-locker xfce4-screensaver xfce4-power-manager || true \
89 && apt-get clean && rm -rf /var/lib/apt/lists/*
90
91RUN useradd -ms /bin/bash myuser \
92 && echo "myuser ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
93USER myuser
94WORKDIR /home/myuser
95
96RUN x11vnc -storepasswd secret /home/myuser/.vncpass
97
98EXPOSE 5900
99CMD ["/bin/sh", "-c", "\
100 Xvfb :99 -screen 0 1280x800x24 >/dev/null 2>&1 & \
101 x11vnc -display :99 -forever -rfbauth /home/myuser/.vncpass -listen 0.0.0.0 -rfbport 5900 >/dev/null 2>&1 & \
102 export DISPLAY=:99 && \
103 startxfce4 >/dev/null 2>&1 & \
104 sleep 2 && echo 'Container running!' && \
105 tail -f /dev/null \
106"]
107```
108
109
66Build the image:110Build the image:
67 111
68```bash112```bash
77 121
78Create a helper for shelling into the container:122Create a helper for shelling into the container:
79 123
124Execute commands on the container
125
126```python
127import subprocess
128
129
130def docker_exec(cmd: str, container_name: str, decode: bool = True):
131 safe_cmd = cmd.replace('"', '\\"')
132 docker_cmd = f'docker exec {container_name} sh -c "{safe_cmd}"'
133 output = subprocess.check_output(docker_cmd, shell=True)
134 if decode:
135 return output.decode("utf-8", errors="ignore")
136 return output
137
138
139class VM:
140 def __init__(self, display: str, container_name: str):
141 self.display = display
142 self.container_name = container_name
143
144
145vm = VM(display=":99", container_name="cua-image")
146```
147
148```javascript
149import { exec } from "node:child_process";
150import { promisify } from "node:util";
151
152const execAsync = promisify(exec);
153
154async function dockerExec(cmd, containerName, decode = true) {
155 const safeCmd = cmd.replace(/"/g, '\\"');
156 const dockerCmd = `docker exec ${containerName} sh -c "${safeCmd}"`;
157 const output = await execAsync(dockerCmd, {
158 encoding: decode ? "utf8" : "buffer",
159 });
160 return output.stdout;
161}
162
163const vm = {
164 display: ":99",
165 containerName: "cua-image",
166};
167```
168
169
80Whether you use a browser or VM, treat screenshots, page text, tool outputs, PDFs, emails, chats, and other third-party content as untrusted input. Only direct instructions from the user count as permission.170Whether you use a browser or VM, treat screenshots, page text, tool outputs, PDFs, emails, chats, and other third-party content as untrusted input. Only direct instructions from the user count as permission.
81 171
82## Choose an integration path172## Choose an integration path
109 199
110Send the task in plain language and tell the model to use the computer tool for UI interaction.200Send the task in plain language and tell the model to use the computer tool for UI interaction.
111 201
202Send a computer request
203
204```javascript
205import OpenAI from "openai";
206
207const client = new OpenAI();
208
209const response = await client.responses.create({
210 model: "gpt-5.5",
211 tools: [{ type: "computer" }],
212 input:
213 "Check whether the Filters panel is open. If it is not open, click Show filters. Then type penguin in the search box. Use the computer tool for UI interaction.",
214});
215
216console.log(JSON.stringify(response.output, null, 2));
217```
218
219```python
220from openai import OpenAI
221
222client = OpenAI()
223
224response = client.responses.create(
225 model="gpt-5.5",
226 tools=[{"type": "computer"}],
227 input="Check whether the Filters panel is open. If it is not open, click Show filters. Then type penguin in the search box. Use the computer tool for UI interaction.",
228)
229
230print(response.output)
231```
232
233
112The first turn often asks for a screenshot before the model commits to UI actions. That's normal.234The first turn often asks for a screenshot before the model commits to UI actions. That's normal.
113 235
114### 2. Handle screenshot-first turns236### 2. Handle screenshot-first turns
115 237
116When the model needs visual context, it returns a `computer_call` whose `actions[]` array contains a `screenshot` request:238When the model needs visual context, it returns a `computer_call` whose `actions[]` array contains a `screenshot` request:
117 239
240Screenshot request
241
242```json
243{
244 "output": [
245 {
246 "type": "computer_call",
247 "call_id": "call_001",
248 "actions": [
249 { "type": "screenshot" }
250 ],
251 "status": "completed"
252 }
253 ]
254}
255```
256
257
118### 3. Run every returned action258### 3. Run every returned action
119 259
120Later turns can batch actions into the same `computer_call`. Run them in order before taking the next screenshot.260Later turns can batch actions into the same `computer_call`. Run them in order before taking the next screenshot.
127 267
128<div data-content-switcher-pane data-value="playwright">268<div data-content-switcher-pane data-value="playwright">
129 <div class="hidden">Playwright</div>269 <div class="hidden">Playwright</div>
270 Normalization helpers
271
272```javascript
273// Map model-emitted key names to the names Playwright expects.
274const normalizeKey = (key) => {
275 switch (key) {
276 case "ENTER":
277 case "RETURN":
278 return "Enter";
279 case "ESC":
280 case "ESCAPE":
281 return "Escape";
282 case "TAB":
283 return "Tab";
284 case "SPACE":
285 return "Space";
286 case "BACKSPACE":
287 return "Backspace";
288 case "DELETE":
289 case "DEL":
290 return "Delete";
291 case "HOME":
292 return "Home";
293 case "END":
294 return "End";
295 case "PAGEUP":
296 return "PageUp";
297 case "PAGEDOWN":
298 return "PageDown";
299 case "UP":
300 case "ARROWUP":
301 return "ArrowUp";
302 case "DOWN":
303 case "ARROWDOWN":
304 return "ArrowDown";
305 case "LEFT":
306 case "ARROWLEFT":
307 return "ArrowLeft";
308 case "RIGHT":
309 case "ARROWRIGHT":
310 return "ArrowRight";
311 case "CTRL":
312 case "CONTROL":
313 return "Control";
314 case "SHIFT":
315 return "Shift";
316 case "OPTION":
317 case "ALT":
318 return "Alt";
319 case "META":
320 case "CMD":
321 case "COMMAND":
322 return "Meta";
323 default:
324 return key;
325 }
326};
327
328// Accept drag paths as either [x, y] pairs or {x, y} objects.
329const normalizeDragPath = (path) => {
330 if (!Array.isArray(path)) {
331 throw new Error("drag action requires a path array");
332 }
333
334 return path.map((point) => {
335 if (Array.isArray(point) && point.length >= 2) {
336 return [point[0], point[1]];
337 }
338 if (point && typeof point === "object" && "x" in point && "y" in point) {
339 return [point.x, point.y];
340 }
341 throw new Error("drag path entries must be coordinate pairs or {x, y} objects");
342 });
343};
344```
345
346```python
347def normalize_key(key):
348 """Map model-emitted key names to the names Playwright expects."""
349 key_map = {
350 "ENTER": "Enter",
351 "RETURN": "Enter",
352 "ESC": "Escape",
353 "ESCAPE": "Escape",
354 "TAB": "Tab",
355 "SPACE": "Space",
356 "BACKSPACE": "Backspace",
357 "DELETE": "Delete",
358 "DEL": "Delete",
359 "HOME": "Home",
360 "END": "End",
361 "PAGEUP": "PageUp",
362 "PAGEDOWN": "PageDown",
363 "UP": "ArrowUp",
364 "DOWN": "ArrowDown",
365 "LEFT": "ArrowLeft",
366 "RIGHT": "ArrowRight",
367 "ARROWUP": "ArrowUp",
368 "ARROWDOWN": "ArrowDown",
369 "ARROWLEFT": "ArrowLeft",
370 "ARROWRIGHT": "ArrowRight",
371 "CTRL": "Control",
372 "CONTROL": "Control",
373 "SHIFT": "Shift",
374 "OPTION": "Alt",
375 "ALT": "Alt",
376 "META": "Meta",
377 "CMD": "Meta",
378 "COMMAND": "Meta",
379 }
380 return key_map.get(key, key)
381
382
383def normalize_drag_path(path):
384 """Accept drag paths as either [x, y] pairs or {x, y} objects."""
385 if not isinstance(path, list):
386 raise ValueError("drag action requires a path array")
387
388 normalized = []
389 for point in path:
390 if isinstance(point, (list, tuple)) and len(point) >= 2:
391 normalized.append((point[0], point[1]))
392 elif isinstance(point, dict) and "x" in point and "y" in point:
393 normalized.append((point["x"], point["y"]))
394 else:
395 raise ValueError(
396 "drag path entries must be coordinate pairs or {x, y} objects"
397 )
398 return normalized
399```
400
130 </div>401 </div>
131 <div data-content-switcher-pane data-value="docker" hidden>402 <div data-content-switcher-pane data-value="docker" hidden>
132 <div class="hidden">Docker</div>403 <div class="hidden">Docker</div>
404 Normalization helpers
405
406```javascript
407// Map model-emitted key names to the names xdotool expects.
408const normalizeXdotoolKey = (key) => {
409 switch (key) {
410 case "ENTER":
411 case "RETURN":
412 return "Return";
413 case "ESC":
414 case "ESCAPE":
415 return "Escape";
416 case "TAB":
417 return "Tab";
418 case "SPACE":
419 return "space";
420 case "BACKSPACE":
421 return "BackSpace";
422 case "DELETE":
423 case "DEL":
424 return "Delete";
425 case "HOME":
426 return "Home";
427 case "END":
428 return "End";
429 case "PAGEUP":
430 return "Page_Up";
431 case "PAGEDOWN":
432 return "Page_Down";
433 case "UP":
434 case "ARROWUP":
435 return "Up";
436 case "DOWN":
437 case "ARROWDOWN":
438 return "Down";
439 case "LEFT":
440 case "ARROWLEFT":
441 return "Left";
442 case "RIGHT":
443 case "ARROWRIGHT":
444 return "Right";
445 case "CTRL":
446 case "CONTROL":
447 return "ctrl";
448 case "SHIFT":
449 return "shift";
450 case "OPTION":
451 case "ALT":
452 return "alt";
453 case "META":
454 case "CMD":
455 case "COMMAND":
456 return "super";
457 default:
458 return key;
459 }
460};
461
462// Accept drag paths as either [x, y] pairs or {x, y} objects.
463const normalizeDragPath = (path) => {
464 if (!Array.isArray(path)) {
465 throw new Error("drag action requires a path array");
466 }
467
468 return path.map((point) => {
469 if (Array.isArray(point) && point.length >= 2) {
470 return [point[0], point[1]];
471 }
472 if (point && typeof point === "object" && "x" in point && "y" in point) {
473 return [point.x, point.y];
474 }
475 throw new Error("drag path entries must be coordinate pairs or {x, y} objects");
476 });
477};
478```
479
480```python
481def normalize_xdotool_key(key):
482 """Map model-emitted key names to the names xdotool expects."""
483 key_map = {
484 "ENTER": "Return",
485 "RETURN": "Return",
486 "ESC": "Escape",
487 "ESCAPE": "Escape",
488 "TAB": "Tab",
489 "SPACE": "space",
490 "BACKSPACE": "BackSpace",
491 "DELETE": "Delete",
492 "DEL": "Delete",
493 "HOME": "Home",
494 "END": "End",
495 "PAGEUP": "Page_Up",
496 "PAGEDOWN": "Page_Down",
497 "UP": "Up",
498 "DOWN": "Down",
499 "LEFT": "Left",
500 "RIGHT": "Right",
501 "ARROWUP": "Up",
502 "ARROWDOWN": "Down",
503 "ARROWLEFT": "Left",
504 "ARROWRIGHT": "Right",
505 "CTRL": "ctrl",
506 "CONTROL": "ctrl",
507 "SHIFT": "shift",
508 "OPTION": "alt",
509 "ALT": "alt",
510 "META": "super",
511 "CMD": "super",
512 "COMMAND": "super",
513 }
514 return key_map.get(key, key)
515
516
517def normalize_drag_path(path):
518 """Accept drag paths as either [x, y] pairs or {x, y} objects."""
519 if not isinstance(path, list):
520 raise ValueError("drag action requires a path array")
521
522 normalized = []
523 for point in path:
524 if isinstance(point, (list, tuple)) and len(point) >= 2:
525 normalized.append((point[0], point[1]))
526 elif isinstance(point, dict) and "x" in point and "y" in point:
527 normalized.append((point["x"], point["y"]))
528 else:
529 raise ValueError(
530 "drag path entries must be coordinate pairs or {x, y} objects"
531 )
532 return normalized
533```
534
133 </div>535 </div>
134 536
135 537
136 538
539Batched actions in one turn
540
541```json
542{
543 "output": [
544 {
545 "type": "computer_call",
546 "call_id": "call_002",
547 "actions": [
548 { "type": "click", "button": "left", "x": 405, "y": 157 },
549 { "type": "type", "text": "penguin" }
550 ],
551 "status": "completed"
552 }
553 ]
554}
555```
556
557
137The following helpers show how to run a batch of actions in either environment:558The following helpers show how to run a batch of actions in either environment:
138 559
139 560
140 561
141<div data-content-switcher-pane data-value="playwright">562<div data-content-switcher-pane data-value="playwright">
142 <div class="hidden">Playwright</div>563 <div class="hidden">Playwright</div>
564 Execute Computer use actions
565
566```javascript
567// Reuse normalizeKey from the helper above.
568// Reuse normalizeDragPath from the helper above.
569
570async function handleComputerActions(page, actions) {
571 for (const action of actions) {
572 switch (action.type) {
573 case "click":
574 await page.mouse.click(action.x, action.y, {
575 button: action.button ?? "left",
576 });
577 break;
578 case "double_click":
579 await page.mouse.dblclick(action.x, action.y, {
580 button: action.button ?? "left",
581 });
582 break;
583 case "drag": {
584 const path = normalizeDragPath(action.path);
585 if (path.length < 2) {
586 throw new Error("drag action requires at least two path points");
587 }
588 const [[startX, startY], ...rest] = path;
589 await page.mouse.move(startX, startY);
590 await page.mouse.down();
591 for (const [x, y] of rest) {
592 await page.mouse.move(x, y);
593 }
594 await page.mouse.up();
595 break;
596 }
597 case "move":
598 await page.mouse.move(action.x, action.y);
599 break;
600 case "scroll":
601 await page.mouse.move(action.x, action.y);
602 await page.mouse.wheel(action.scrollX ?? 0, action.scrollY ?? 0);
603 break;
604 case "keypress":
605 for (const key of action.keys) {
606 await page.keyboard.press(normalizeKey(key));
607 }
608 break;
609 case "type":
610 await page.keyboard.type(action.text);
611 break;
612 case "wait":
613 case "screenshot":
614 break;
615 default:
616 throw new Error(`Unsupported action: ${action.type}`);
617 }
618 }
619}
620```
621
622```python
623import time
624
625# Reuse normalize_key from the helper above.
626# Reuse normalize_drag_path from the helper above.
627
628
629def handle_computer_actions(page, actions):
630 for action in actions:
631 match action.type:
632 case "click":
633 page.mouse.click(
634 action.x,
635 action.y,
636 button=getattr(action, "button", "left"),
637 )
638 case "double_click":
639 page.mouse.dblclick(
640 action.x,
641 action.y,
642 button=getattr(action, "button", "left"),
643 )
644 case "drag":
645 path = normalize_drag_path(action.path)
646 if len(path) < 2:
647 raise ValueError("drag action requires at least two path points")
648 start_x, start_y = path[0]
649 page.mouse.move(start_x, start_y)
650 page.mouse.down()
651 for x, y in path[1:]:
652 page.mouse.move(x, y)
653 page.mouse.up()
654 case "move":
655 page.mouse.move(action.x, action.y)
656 case "scroll":
657 page.mouse.move(action.x, action.y)
658 page.mouse.wheel(
659 getattr(action, "scrollX", 0),
660 getattr(action, "scrollY", 0),
661 )
662 case "keypress":
663 for key in action.keys:
664 page.keyboard.press(normalize_key(key))
665 case "type":
666 page.keyboard.type(action.text)
667 case "wait":
668 time.sleep(2)
669 case "screenshot":
670 pass
671 case _:
672 raise ValueError(f"Unsupported action: {action.type}")
673```
674
143 </div>675 </div>
144 <div data-content-switcher-pane data-value="docker" hidden>676 <div data-content-switcher-pane data-value="docker" hidden>
145 <div class="hidden">Docker</div>677 <div class="hidden">Docker</div>
678 Execute Computer use actions
679
680```javascript
681// Reuse normalizeXdotoolKey from the helper above.
682// Reuse normalizeDragPath from the helper above.
683
684async function handleComputerActions(vm, actions) {
685 const buttonMap = { left: 1, middle: 2, right: 3 };
686
687 for (const action of actions) {
688 switch (action.type) {
689 case "click": {
690 const button = buttonMap[action.button ?? "left"] ?? 1;
691 await dockerExec(
692 `DISPLAY=${vm.display} xdotool mousemove ${action.x} ${action.y} click ${button}`,
693 vm.containerName
694 );
695 break;
696 }
697 case "double_click": {
698 const button = buttonMap[action.button ?? "left"] ?? 1;
699 await dockerExec(
700 `DISPLAY=${vm.display} xdotool mousemove ${action.x} ${action.y} click --repeat 2 ${button}`,
701 vm.containerName
702 );
703 break;
704 }
705 case "drag": {
706 const path = normalizeDragPath(action.path);
707 if (path.length < 2) {
708 throw new Error("drag action requires at least two path points");
709 }
710 const [[startX, startY], ...rest] = path;
711 await dockerExec(
712 `DISPLAY=${vm.display} xdotool mousemove ${startX} ${startY} mousedown 1`,
713 vm.containerName
714 );
715 for (const [x, y] of rest) {
716 await dockerExec(
717 `DISPLAY=${vm.display} xdotool mousemove ${x} ${y}`,
718 vm.containerName
719 );
720 }
721 await dockerExec(
722 `DISPLAY=${vm.display} xdotool mouseup 1`,
723 vm.containerName
724 );
725 break;
726 }
727 case "move":
728 await dockerExec(
729 `DISPLAY=${vm.display} xdotool mousemove ${action.x} ${action.y}`,
730 vm.containerName
731 );
732 break;
733 case "scroll": {
734 const button = action.scrollY < 0 ? 4 : 5;
735 const clicks = Math.max(1, Math.abs(Math.round(action.scrollY / 100)));
736 await dockerExec(
737 `DISPLAY=${vm.display} xdotool mousemove ${action.x} ${action.y}`,
738 vm.containerName
739 );
740 for (let i = 0; i < clicks; i += 1) {
741 await dockerExec(
742 `DISPLAY=${vm.display} xdotool click ${button}`,
743 vm.containerName
744 );
745 }
746 break;
747 }
748 case "keypress":
749 for (const key of action.keys) {
750 await dockerExec(
751 `DISPLAY=${vm.display} xdotool key '${normalizeXdotoolKey(key)}'`,
752 vm.containerName
753 );
754 }
755 break;
756 case "type":
757 await dockerExec(
758 `DISPLAY=${vm.display} xdotool type --delay 0 '${action.text}'`,
759 vm.containerName
760 );
761 break;
762 case "wait":
763 case "screenshot":
764 break;
765 default:
766 throw new Error(`Unsupported action: ${action.type}`);
767 }
768 }
769}
770```
771
772```python
773import time
774
775# Reuse normalize_xdotool_key from the helper above.
776# Reuse normalize_drag_path from the helper above.
777
778
779def handle_computer_actions(vm, actions):
780 button_map = {"left": 1, "middle": 2, "right": 3}
781
782 for action in actions:
783 match action.type:
784 case "click":
785 button = button_map.get(getattr(action, "button", "left"), 1)
786 docker_exec(
787 f"DISPLAY={vm.display} xdotool mousemove {action.x} {action.y} click {button}",
788 vm.container_name,
789 )
790 case "double_click":
791 button = button_map.get(getattr(action, "button", "left"), 1)
792 docker_exec(
793 f"DISPLAY={vm.display} xdotool mousemove {action.x} {action.y} click --repeat 2 {button}",
794 vm.container_name,
795 )
796 case "drag":
797 path = normalize_drag_path(action.path)
798 if len(path) < 2:
799 raise ValueError("drag action requires at least two path points")
800 start_x, start_y = path[0]
801 docker_exec(
802 f"DISPLAY={vm.display} xdotool mousemove {start_x} {start_y} mousedown 1",
803 vm.container_name,
804 )
805 for x, y in path[1:]:
806 docker_exec(
807 f"DISPLAY={vm.display} xdotool mousemove {x} {y}",
808 vm.container_name,
809 )
810 docker_exec(
811 f"DISPLAY={vm.display} xdotool mouseup 1",
812 vm.container_name,
813 )
814 case "move":
815 docker_exec(
816 f"DISPLAY={vm.display} xdotool mousemove {action.x} {action.y}",
817 vm.container_name,
818 )
819 case "scroll":
820 button = 4 if getattr(action, "scrollY", 0) < 0 else 5
821 clicks = max(1, abs(round(getattr(action, "scrollY", 0) / 100)))
822
823 docker_exec(
824 f"DISPLAY={vm.display} xdotool mousemove {action.x} {action.y}",
825 vm.container_name,
826 )
827 for _ in range(clicks):
828 docker_exec(
829 f"DISPLAY={vm.display} xdotool click {button}",
830 vm.container_name,
831 )
832 case "keypress":
833 for key in action.keys:
834 docker_exec(
835 f"DISPLAY={vm.display} xdotool key '{normalize_xdotool_key(key)}'",
836 vm.container_name,
837 )
838 case "type":
839 docker_exec(
840 f"DISPLAY={vm.display} xdotool type --delay 0 '{action.text}'",
841 vm.container_name,
842 )
843 case "wait":
844 time.sleep(2)
845 case "screenshot":
846 pass
847 case _:
848 raise ValueError(f"Unsupported action: {action.type}")
849```
850
146 </div>851 </div>
147 852
148 853
155 860
156You may also need to map model-emitted key names such as `CTRL`, `ALT`, `META`, and `ARROWLEFT` to the names your runtime expects.861You may also need to map model-emitted key names such as `CTRL`, `ALT`, `META`, and `ARROWLEFT` to the names your runtime expects.
157 862
863Modifier-assisted action
864
865```json
866{
867 "output": [
868 {
869 "type": "computer_call",
870 "call_id": "call_003",
871 "actions": [
872 {
873 "type": "click",
874 "button": "left",
875 "x": 405,
876 "y": 157,
877 "keys": ["SHIFT"]
878 }
879 ],
880 "status": "completed"
881 }
882 ]
883}
884```
885
886
887
888
158<div data-content-switcher-pane data-value="playwright">889<div data-content-switcher-pane data-value="playwright">
159 <div class="hidden">Playwright</div>890 <div class="hidden">Playwright</div>
891 Execute modifier-assisted Computer use actions
892
893```javascript
894// Reuse normalizeKey from the helper above.
895// Reuse normalizeDragPath from the helper above.
896
897async function withModifiers(page, keys, callback) {
898 const normalizedKeys = (keys ?? []).map(normalizeKey);
899 const pressedKeys = [];
900
901 try {
902 for (const key of normalizedKeys) {
903 await page.keyboard.down(key);
904 pressedKeys.push(key);
905 }
906
907 await callback();
908 } finally {
909 for (const key of [...pressedKeys].reverse()) {
910 await page.keyboard.up(key);
911 }
912 }
913}
914
915async function handleComputerActions(page, actions) {
916 for (const action of actions) {
917 switch (action.type) {
918 case "click":
919 await withModifiers(page, action.keys, async () => {
920 await page.mouse.click(action.x, action.y, {
921 button: action.button ?? "left",
922 });
923 });
924 break;
925 case "double_click":
926 await withModifiers(page, action.keys, async () => {
927 await page.mouse.dblclick(action.x, action.y, {
928 button: action.button ?? "left",
929 });
930 });
931 break;
932 case "drag": {
933 const path = normalizeDragPath(action.path);
934 if (path.length < 2) {
935 throw new Error("drag action requires at least two path points");
936 }
937 await withModifiers(page, action.keys, async () => {
938 const [[startX, startY], ...rest] = path;
939 await page.mouse.move(startX, startY);
940 await page.mouse.down();
941 for (const [x, y] of rest) {
942 await page.mouse.move(x, y);
943 }
944 await page.mouse.up();
945 });
946 break;
947 }
948 case "move":
949 await withModifiers(page, action.keys, async () => {
950 await page.mouse.move(action.x, action.y);
951 });
952 break;
953 case "scroll":
954 await withModifiers(page, action.keys, async () => {
955 await page.mouse.move(action.x, action.y);
956 await page.mouse.wheel(action.scrollX ?? 0, action.scrollY ?? 0);
957 });
958 break;
959 case "keypress":
960 for (const key of action.keys) {
961 await page.keyboard.press(normalizeKey(key));
962 }
963 break;
964 case "type":
965 await page.keyboard.type(action.text);
966 break;
967 case "wait":
968 case "screenshot":
969 break;
970 default:
971 throw new Error(`Unsupported action: ${action.type}`);
972 }
973 }
974}
975```
976
977```python
978import time
979
980# Reuse normalize_key from the helper above.
981# Reuse normalize_drag_path from the helper above.
982
983
984def with_modifiers(page, keys, callback):
985 normalized_keys = [normalize_key(key) for key in (keys or [])]
986 pressed_keys = []
987
988 try:
989 for key in normalized_keys:
990 page.keyboard.down(key)
991 pressed_keys.append(key)
992
993 callback()
994 finally:
995 for key in reversed(pressed_keys):
996 page.keyboard.up(key)
997
998
999def handle_computer_actions(page, actions):
1000 for action in actions:
1001 match action.type:
1002 case "click":
1003 with_modifiers(
1004 page,
1005 getattr(action, "keys", None),
1006 lambda: page.mouse.click(
1007 action.x,
1008 action.y,
1009 button=getattr(action, "button", "left"),
1010 ),
1011 )
1012 case "double_click":
1013 with_modifiers(
1014 page,
1015 getattr(action, "keys", None),
1016 lambda: page.mouse.dblclick(
1017 action.x,
1018 action.y,
1019 button=getattr(action, "button", "left"),
1020 ),
1021 )
1022 case "drag":
1023 path = normalize_drag_path(action.path)
1024 if len(path) < 2:
1025 raise ValueError("drag action requires at least two path points")
1026
1027 def do_drag():
1028 start_x, start_y = path[0]
1029 page.mouse.move(start_x, start_y)
1030 page.mouse.down()
1031 for x, y in path[1:]:
1032 page.mouse.move(x, y)
1033 page.mouse.up()
1034
1035 with_modifiers(
1036 page,
1037 getattr(action, "keys", None),
1038 do_drag,
1039 )
1040 case "move":
1041 with_modifiers(
1042 page,
1043 getattr(action, "keys", None),
1044 lambda: page.mouse.move(action.x, action.y),
1045 )
1046 case "scroll":
1047 with_modifiers(
1048 page,
1049 getattr(action, "keys", None),
1050 lambda: (
1051 page.mouse.move(action.x, action.y),
1052 page.mouse.wheel(
1053 getattr(action, "scrollX", 0),
1054 getattr(action, "scrollY", 0),
1055 ),
1056 ),
1057 )
1058 case "keypress":
1059 for key in action.keys:
1060 page.keyboard.press(normalize_key(key))
1061 case "type":
1062 page.keyboard.type(action.text)
1063 case "wait":
1064 time.sleep(2)
1065 case "screenshot":
1066 pass
1067 case _:
1068 raise ValueError(f"Unsupported action: {action.type}")
1069```
1070
160 </div>1071 </div>
161 <div data-content-switcher-pane data-value="docker" hidden>1072 <div data-content-switcher-pane data-value="docker" hidden>
162 <div class="hidden">Docker</div>1073 <div class="hidden">Docker</div>
1074 Execute modifier-assisted Computer use actions
1075
1076```javascript
1077// Reuse normalizeXdotoolKey from the helper above.
1078// Reuse normalizeDragPath from the helper above.
1079
1080async function withModifiers(vm, keys, callback) {
1081 const normalizedKeys = (keys ?? []).map(normalizeXdotoolKey);
1082 const pressedKeys = [];
1083
1084 try {
1085 for (const key of normalizedKeys) {
1086 await dockerExec(
1087 `DISPLAY=${vm.display} xdotool keydown '${key}'`,
1088 vm.containerName
1089 );
1090 pressedKeys.push(key);
1091 }
1092
1093 await callback();
1094 } finally {
1095 for (const key of [...pressedKeys].reverse()) {
1096 await dockerExec(
1097 `DISPLAY=${vm.display} xdotool keyup '${key}'`,
1098 vm.containerName
1099 );
1100 }
1101 }
1102}
1103
1104async function handleComputerActions(vm, actions) {
1105 const buttonMap = { left: 1, middle: 2, right: 3 };
1106
1107 for (const action of actions) {
1108 switch (action.type) {
1109 case "click": {
1110 const button = buttonMap[action.button ?? "left"] ?? 1;
1111 await withModifiers(vm, action.keys, async () => {
1112 await dockerExec(
1113 `DISPLAY=${vm.display} xdotool mousemove ${action.x} ${action.y} click ${button}`,
1114 vm.containerName
1115 );
1116 });
1117 break;
1118 }
1119 case "double_click": {
1120 const button = buttonMap[action.button ?? "left"] ?? 1;
1121 await withModifiers(vm, action.keys, async () => {
1122 await dockerExec(
1123 `DISPLAY=${vm.display} xdotool mousemove ${action.x} ${action.y} click --repeat 2 ${button}`,
1124 vm.containerName
1125 );
1126 });
1127 break;
1128 }
1129 case "drag": {
1130 const path = normalizeDragPath(action.path);
1131 if (path.length < 2) {
1132 throw new Error("drag action requires at least two path points");
1133 }
1134 await withModifiers(vm, action.keys, async () => {
1135 const [[startX, startY], ...rest] = path;
1136 await dockerExec(
1137 `DISPLAY=${vm.display} xdotool mousemove ${startX} ${startY} mousedown 1`,
1138 vm.containerName
1139 );
1140 for (const [x, y] of rest) {
1141 await dockerExec(
1142 `DISPLAY=${vm.display} xdotool mousemove ${x} ${y}`,
1143 vm.containerName
1144 );
1145 }
1146 await dockerExec(
1147 `DISPLAY=${vm.display} xdotool mouseup 1`,
1148 vm.containerName
1149 );
1150 });
1151 break;
1152 }
1153 case "move": {
1154 await withModifiers(vm, action.keys, async () => {
1155 await dockerExec(
1156 `DISPLAY=${vm.display} xdotool mousemove ${action.x} ${action.y}`,
1157 vm.containerName
1158 );
1159 });
1160 break;
1161 }
1162 case "scroll": {
1163 const button = action.scrollY < 0 ? 4 : 5;
1164 const clicks = Math.max(1, Math.abs(Math.round(action.scrollY / 100)));
1165 await withModifiers(vm, action.keys, async () => {
1166 await dockerExec(
1167 `DISPLAY=${vm.display} xdotool mousemove ${action.x} ${action.y}`,
1168 vm.containerName
1169 );
1170 for (let i = 0; i < clicks; i += 1) {
1171 await dockerExec(
1172 `DISPLAY=${vm.display} xdotool click ${button}`,
1173 vm.containerName
1174 );
1175 }
1176 });
1177 break;
1178 }
1179 case "keypress":
1180 for (const key of action.keys) {
1181 await dockerExec(
1182 `DISPLAY=${vm.display} xdotool key '${normalizeXdotoolKey(key)}'`,
1183 vm.containerName
1184 );
1185 }
1186 break;
1187 case "type":
1188 await dockerExec(
1189 `DISPLAY=${vm.display} xdotool type --delay 0 '${action.text}'`,
1190 vm.containerName
1191 );
1192 break;
1193 case "wait":
1194 case "screenshot":
1195 break;
1196 default:
1197 throw new Error(`Unsupported action: ${action.type}`);
1198 }
1199 }
1200}
1201```
1202
1203```python
1204import time
1205
1206# Reuse normalize_xdotool_key from the helper above.
1207# Reuse normalize_drag_path from the helper above.
1208
1209
1210def with_modifiers(vm, keys, callback):
1211 normalized_keys = [normalize_xdotool_key(key) for key in (keys or [])]
1212 pressed_keys = []
1213
1214 try:
1215 for key in normalized_keys:
1216 docker_exec(
1217 f"DISPLAY={vm.display} xdotool keydown '{key}'",
1218 vm.container_name,
1219 )
1220 pressed_keys.append(key)
1221
1222 callback()
1223 finally:
1224 for key in reversed(pressed_keys):
1225 docker_exec(
1226 f"DISPLAY={vm.display} xdotool keyup '{key}'",
1227 vm.container_name,
1228 )
1229
1230
1231def handle_computer_actions(vm, actions):
1232 button_map = {"left": 1, "middle": 2, "right": 3}
1233
1234 for action in actions:
1235 match action.type:
1236 case "click":
1237 button = button_map.get(getattr(action, "button", "left"), 1)
1238 with_modifiers(
1239 vm,
1240 getattr(action, "keys", None),
1241 lambda: docker_exec(
1242 f"DISPLAY={vm.display} xdotool mousemove {action.x} {action.y} click {button}",
1243 vm.container_name,
1244 ),
1245 )
1246 case "double_click":
1247 button = button_map.get(getattr(action, "button", "left"), 1)
1248 with_modifiers(
1249 vm,
1250 getattr(action, "keys", None),
1251 lambda: docker_exec(
1252 f"DISPLAY={vm.display} xdotool mousemove {action.x} {action.y} click --repeat 2 {button}",
1253 vm.container_name,
1254 ),
1255 )
1256 case "drag":
1257 path = normalize_drag_path(action.path)
1258 if len(path) < 2:
1259 raise ValueError("drag action requires at least two path points")
1260
1261 def do_drag():
1262 start_x, start_y = path[0]
1263 docker_exec(
1264 f"DISPLAY={vm.display} xdotool mousemove {start_x} {start_y} mousedown 1",
1265 vm.container_name,
1266 )
1267 for x, y in path[1:]:
1268 docker_exec(
1269 f"DISPLAY={vm.display} xdotool mousemove {x} {y}",
1270 vm.container_name,
1271 )
1272 docker_exec(
1273 f"DISPLAY={vm.display} xdotool mouseup 1",
1274 vm.container_name,
1275 )
1276
1277 with_modifiers(vm, getattr(action, "keys", None), do_drag)
1278 case "move":
1279 with_modifiers(
1280 vm,
1281 getattr(action, "keys", None),
1282 lambda: docker_exec(
1283 f"DISPLAY={vm.display} xdotool mousemove {action.x} {action.y}",
1284 vm.container_name,
1285 ),
1286 )
1287 case "scroll":
1288 button = 4 if getattr(action, "scrollY", 0) < 0 else 5
1289 clicks = max(1, abs(round(getattr(action, "scrollY", 0) / 100)))
1290
1291 def do_scroll():
1292 docker_exec(
1293 f"DISPLAY={vm.display} xdotool mousemove {action.x} {action.y}",
1294 vm.container_name,
1295 )
1296 for _ in range(clicks):
1297 docker_exec(
1298 f"DISPLAY={vm.display} xdotool click {button}",
1299 vm.container_name,
1300 )
1301
1302 with_modifiers(vm, getattr(action, "keys", None), do_scroll)
1303 case "keypress":
1304 for key in action.keys:
1305 docker_exec(
1306 f"DISPLAY={vm.display} xdotool key '{normalize_xdotool_key(key)}'",
1307 vm.container_name,
1308 )
1309 case "type":
1310 docker_exec(
1311 f"DISPLAY={vm.display} xdotool type --delay 0 '{action.text}'",
1312 vm.container_name,
1313 )
1314 case "wait":
1315 time.sleep(2)
1316 case "screenshot":
1317 pass
1318 case _:
1319 raise ValueError(f"Unsupported action: {action.type}")
1320```
1321
163 </div>1322 </div>
164 1323
165 1324
172 1331
173<div data-content-switcher-pane data-value="playwright">1332<div data-content-switcher-pane data-value="playwright">
174 <div class="hidden">Playwright</div>1333 <div class="hidden">Playwright</div>
1334 Capture a screenshot
1335
1336```javascript
1337async function captureScreenshot(page) {
1338 return await page.screenshot({ type: "png" });
1339}
1340```
1341
1342```python
1343def capture_screenshot(page):
1344 return page.screenshot(type="png")
1345```
1346
175 </div>1347 </div>
176 <div data-content-switcher-pane data-value="docker" hidden>1348 <div data-content-switcher-pane data-value="docker" hidden>
177 <div class="hidden">Docker</div>1349 <div class="hidden">Docker</div>
1350 Capture a screenshot
1351
1352```javascript
1353async function captureScreenshot(vm) {
1354 return await dockerExec(
1355 `export DISPLAY=${vm.display} && import -window root png:-`,
1356 vm.containerName,
1357 false
1358 );
1359}
1360```
1361
1362```python
1363def capture_screenshot(vm):
1364 return docker_exec(
1365 f"export DISPLAY={vm.display} && import -window root png:-",
1366 vm.container_name,
1367 decode=False,
1368 )
1369```
1370
178 </div>1371 </div>
179 1372
180 1373
183 1376
184For Computer use, prefer `detail: "original"` on screenshot inputs. This preserves the full screenshot resolution, up to 10.24M pixels, and improves click accuracy. If `detail: "original"` uses too many tokens, you can downscale the image before sending it to the API, and make sure you remap model-generated coordinates from the downscaled coordinate space to the original image's coordinate space. Avoid using `high` or `low` image detail for computer use tasks. When downscaling, we observe strong performance with 1440x900 and 1600x900 desktop resolutions. See the [Images and Vision guide](https://developers.openai.com/api/docs/guides/images-vision) for more details on image input detail levels.1377For Computer use, prefer `detail: "original"` on screenshot inputs. This preserves the full screenshot resolution, up to 10.24M pixels, and improves click accuracy. If `detail: "original"` uses too many tokens, you can downscale the image before sending it to the API, and make sure you remap model-generated coordinates from the downscaled coordinate space to the original image's coordinate space. Avoid using `high` or `low` image detail for computer use tasks. When downscaling, we observe strong performance with 1440x900 and 1600x900 desktop resolutions. See the [Images and Vision guide](https://developers.openai.com/api/docs/guides/images-vision) for more details on image input detail levels.
185 1378
1379Send the updated screenshot
1380
1381```javascript
1382import OpenAI from "openai";
1383
1384const client = new OpenAI();
1385
1386async function sendComputerScreenshot(response, callId, screenshotBase64) {
1387 return await client.responses.create({
1388 model: "gpt-5.5",
1389 tools: [{ type: "computer" }],
1390 previous_response_id: response.id,
1391 input: [
1392 {
1393 type: "computer_call_output",
1394 call_id: callId,
1395 output: {
1396 type: "computer_screenshot",
1397 image_url: `data:image/png;base64,${screenshotBase64}`,
1398 detail: "original",
1399 },
1400 },
1401 ],
1402 });
1403}
1404```
1405
1406```python
1407from openai import OpenAI
1408
1409client = OpenAI()
1410
1411
1412def send_computer_screenshot(response, call_id, screenshot_base64):
1413 return client.responses.create(
1414 model="gpt-5.5",
1415 tools=[{"type": "computer"}],
1416 previous_response_id=response.id,
1417 input=[
1418 {
1419 "type": "computer_call_output",
1420 "call_id": call_id,
1421 "output": {
1422 "type": "computer_screenshot",
1423 "image_url": f"data:image/png;base64,{screenshot_base64}",
1424 "detail": "original",
1425 },
1426 }
1427 ],
1428 )
1429```
1430
1431
186### 5. Repeat until the tool stops calling1432### 5. Repeat until the tool stops calling
187 1433
188The easiest way to continue the loop is to send `previous_response_id` on each follow-up turn and keep reusing the same tool definition.1434The easiest way to continue the loop is to send `previous_response_id` on each follow-up turn and keep reusing the same tool definition.
189 1435
1436Repeat the Computer use loop
1437
1438```javascript
1439import OpenAI from "openai";
1440
1441const client = new OpenAI();
1442
1443async function computerUseLoop(target, response) {
1444 while (true) {
1445 const computerCall = response.output.find((item) => item.type === "computer_call");
1446 if (!computerCall) {
1447 return response;
1448 }
1449
1450 await handleComputerActions(target, computerCall.actions);
1451
1452 const screenshot = await captureScreenshot(target);
1453 const screenshotBase64 = Buffer.from(screenshot).toString("base64");
1454
1455 response = await client.responses.create({
1456 model: "gpt-5.5",
1457 tools: [{ type: "computer" }],
1458 previous_response_id: response.id,
1459 input: [
1460 {
1461 type: "computer_call_output",
1462 call_id: computerCall.call_id,
1463 output: {
1464 type: "computer_screenshot",
1465 image_url: `data:image/png;base64,${screenshotBase64}`,
1466 detail: "original",
1467 },
1468 },
1469 ],
1470 });
1471 }
1472}
1473```
1474
1475```python
1476import base64
1477
1478from openai import OpenAI
1479
1480client = OpenAI()
1481
1482
1483def computer_use_loop(target, response):
1484 while True:
1485 computer_call = next(
1486 (item for item in response.output if item.type == "computer_call"),
1487 None,
1488 )
1489 if computer_call is None:
1490 return response
1491
1492 handle_computer_actions(target, computer_call.actions)
1493
1494 screenshot = capture_screenshot(target)
1495 screenshot_base64 = base64.b64encode(screenshot).decode("utf-8")
1496
1497 response = client.responses.create(
1498 model="gpt-5.5",
1499 tools=[{"type": "computer"}],
1500 previous_response_id=response.id,
1501 input=[
1502 {
1503 "type": "computer_call_output",
1504 "call_id": computer_call.call_id,
1505 "output": {
1506 "type": "computer_screenshot",
1507 "image_url": f"data:image/png;base64,{screenshot_base64}",
1508 "detail": "original",
1509 },
1510 }
1511 ],
1512 )
1513```
1514
1515
190When the response no longer contains a `computer_call`, read the remaining output items as the model's final answer or handoff.1516When the response no longer contains a `computer_call`, read the remaining output items as the model's final answer or handoff.
191 1517
192### Possible Computer use actions1518### Possible Computer use actions
243 1569
244<div data-content-switcher-pane data-value="javascript">1570<div data-content-switcher-pane data-value="javascript">
245 <div class="hidden">JavaScript</div>1571 <div class="hidden">JavaScript</div>
1572 Code-execution harness
1573
1574```javascript
1575// Run with:
1576// bun run -i cua_code_mode.ts
1577// Override the user prompt with:
1578// bun run -i cua_code_mode.ts --prompt "Go to example.com and summarize the page."
1579// Note: this script intentionally leaves the Playwright browser open after the
1580// model reaches a final answer. Because the browser/context are not closed,
1581// Bun stays alive until you close the browser or stop the process manually.
1582
1583import OpenAI from "openai";
1584import readline from "node:readline/promises";
1585import vm from "node:vm";
1586import { chromium } from "playwright";
1587import util from "node:util";
1588
1589async function main(
1590 prompt: string = "Go to Hacker News, click on the most interesting link (be prepared to justify your choice), take a screenshot, and give me a critique of the visual layout.",
1591 max_steps: number = 50,
1592 model: string = "gpt-5.5"
1593) {
1594 type Phase = null | "commentary" | "final_answer";
1595 const client = new OpenAI();
1596 const rl = readline.createInterface({
1597 input: process.stdin,
1598 output: process.stdout,
1599 });
1600 const browser = await chromium.launch({
1601 headless: false,
1602 args: ["--window-size=1440,900"],
1603 });
1604 const context = await browser.newContext({
1605 viewport: { width: 1440, height: 900 },
1606 });
1607 const page = await context.newPage();
1608
1609 const conversation: any[] = [];
1610 const js_output: any[] = [];
1611 const sandbox: Record<string, any> = {
1612 console: {
1613 log: (...xs: any[]) => {
1614 js_output.push({
1615 type: "input_text",
1616 text: util.formatWithOptions(
1617 { showHidden: false, getters: false, maxStringLength: 2000 },
1618 ...xs
1619 ),
1620 });
1621 },
1622 },
1623 browser: browser,
1624 context: context,
1625 page: page,
1626 display: (base64_image: string) => {
1627 js_output.push({
1628 type: "input_image",
1629 image_url: `data:image/png;base64,${base64_image}`,
1630 detail: "original",
1631 });
1632 },
1633 };
1634 const ctx = vm.createContext(sandbox);
1635
1636 conversation.push({
1637 role: "user",
1638 content: prompt,
1639 });
1640
1641 for (let i = 0; i < max_steps; i++) {
1642 const resp = await client.responses.create({
1643 model,
1644 tools: [
1645 {
1646 type: "function" as const,
1647 name: "exec_js",
1648 description:
1649 "Execute provided interactive JavaScript in a persistent REPL context.",
1650 parameters: {
1651 type: "object",
1652 properties: {
1653 code: {
1654 type: "string",
1655 description: `
1656JavaScript to execute. Write small snippets of interactive code. To persist variables or functions across tool calls, you must save them to globalThis. Code is executed in an async node:vm context, so you can use await. You have access to ONLY the following:
1657- console.log(x): Use this to read contents back to you. But be minimal: otherwise the output may be too long. Avoid using console.log() for large base64 payloads like screenshots or buffer. If you create an image or screenshot, pass the base64 string to display().
1658- display(base64_image_string): Use this to view a base64-encoded image.
1659- Do not write screenshots or image data to temporary files or disk just to pass them back. Keep image data in memory and send it directly to display().
1660- Do not assume package globals like Bun.file are available unless they are explicitly provided.
1661- browser: A playwright chromium browser instance.
1662- context: A playwright browser context with viewport 1440x900.
1663- page: A playwright page already created in that context.
1664`,
1665 },
1666 },
1667 required: ["code"],
1668 additionalProperties: false,
1669 },
1670 },
1671 {
1672 type: "function" as const,
1673 name: "ask_user",
1674 description:
1675 "Ask the user a clarification question and wait for their response.",
1676 parameters: {
1677 type: "object",
1678 properties: {
1679 question: {
1680 type: "string",
1681 description:
1682 "The exact question to show the human. Use this instead of answering with a freeform clarifying question in a final answer.",
1683 },
1684 },
1685 required: ["question"],
1686 additionalProperties: false,
1687 },
1688 },
1689 ],
1690 input: conversation,
1691 reasoning: {
1692 effort: "low",
1693 },
1694 });
1695
1696 // Save model outputs into the running conversation
1697 conversation.push(...resp.output);
1698
1699 let hadToolCall = false;
1700 let latestPhase: Phase = null;
1701
1702 // Handle tool calls
1703 for (const item of resp.output) {
1704 if (item.type === "function_call" && item.name === "exec_js") {
1705 hadToolCall = true;
1706 const parsed = JSON.parse(item.arguments ?? "{}") as {
1707 code?: string;
1708 };
1709 const code = parsed.code ?? "";
1710 console.log(code);
1711 console.log("----");
1712 const wrappedCode = `
1713 (async () => {
1714 ${code}
1715 })();
1716 `;
1717
1718 try {
1719 await new vm.Script(wrappedCode, {
1720 filename: "exec_js.js",
1721 }).runInContext(ctx);
1722 } catch (e: any) {
1723 sandbox.console.log(e, e?.message, e?.stack);
1724 }
1725
1726 // Send tool output back to the model, keyed by call_id
1727 conversation.push({
1728 type: "function_call_output",
1729 call_id: item.call_id,
1730 output: js_output.slice(),
1731 });
1732
1733 for (const out of js_output) {
1734 if (out.type === "input_text") {
1735 console.log("JS LOG:", out.text);
1736 } else if (out.type === "input_image") {
1737 console.log("JS IMAGE: [base64 string omitted]");
1738 }
1739 }
1740 console.log("=====");
1741
1742 js_output.length = 0;
1743 } else if (item.type === "function_call" && item.name === "ask_user") {
1744 hadToolCall = true;
1745 const parsed = JSON.parse(item.arguments ?? "{}") as {
1746 question?: string;
1747 };
1748 const question = parsed.question ?? "Please provide more information.";
1749 console.log(`MODEL QUESTION: ${question}`);
1750 const answer = await rl.question("> ");
1751 conversation.push({
1752 type: "function_call_output",
1753 call_id: item.call_id,
1754 output: answer,
1755 });
1756 } else if (item.type === "message") {
1757 console.log(item.content[0]?.text ?? item.content);
1758 if ("phase" in item) {
1759 latestPhase = (item.phase as Phase) ?? null;
1760 }
1761 } else if (item.type === "output_item.done" && "phase" in item) {
1762 latestPhase = (item.phase as Phase) ?? null;
1763 }
1764 }
1765
1766 // Stop only when the model explicitly marks the turn as a final answer
1767 // and there were no tool calls in the same turn.
1768 if (!hadToolCall && latestPhase === "final_answer") return;
1769 }
1770}
1771
1772function getCliPrompt(): string | undefined {
1773 const args = Bun.argv.slice(2);
1774 for (let i = 0; i < args.length; i++) {
1775 if (args[i] === "--prompt") {
1776 return args[i + 1];
1777 }
1778 }
1779 return undefined;
1780}
1781
1782main(getCliPrompt());
1783```
1784
1785```python
1786# /// script
1787# requires-python = ">=3.10"
1788# dependencies = [
1789# "openai",
1790# "playwright",
1791# ]
1792# ///
1793# Run with: `uv run cua_code_mode_py_async.py`
1794# Override the user prompt with:
1795# `uv run cua_code_mode_py_async.py --prompt "Go to example.com and summarize the page."`
1796# Install Chromium once first: `uv run --with playwright python -m playwright install chromium`
1797# Requires `OPENAI_API_KEY` in the environment.
1798
1799"""Async Python analogue of cua_code_mode.ts.
1800
1801Runs a Responses API loop with one persistent Playwright browser/context/page,
1802and tools that let the model execute short async Python snippets and ask the
1803user clarifying questions.
1804
1805The model can return visual observations by calling:
1806 display(base64_png_string)
1807"""
1808
1809from __future__ import annotations
1810
1811import argparse
1812import asyncio
1813import json
1814import traceback
1815from typing import Any
1816
1817from openai import OpenAI
1818from playwright.async_api import async_playwright
1819
1820Phase = str | None
1821
1822
1823def _message_text(item: Any) -> str:
1824 try:
1825 parts = getattr(item, "content", None)
1826 if isinstance(parts, list) and parts:
1827 out: list[str] = []
1828 for p in parts:
1829 t = getattr(p, "text", None)
1830 if isinstance(t, str) and t:
1831 out.append(t)
1832 if out:
1833 return "\n".join(out)
1834 except Exception:
1835 pass
1836 return str(item)
1837
1838
1839async def _ainput(prompt: str) -> str:
1840 return await asyncio.to_thread(input, prompt)
1841
1842
1843async def main(
1844 prompt: str = "Go to Hacker News, click on the most interesting link (be prepared to justify your choice), take a screenshot, and give me a critique of the visual layout.",
1845 max_steps: int = 20,
1846 model: str = "gpt-5.5",
1847) -> None:
1848 client = OpenAI()
1849
1850 async with async_playwright() as p:
1851 browser = await p.chromium.launch(
1852 headless=False,
1853 args=["--window-size=1440,900"],
1854 )
1855 context = await browser.new_context(viewport={"width": 1440, "height": 900})
1856 page = await context.new_page()
1857
1858 conversation: list[dict[str, Any]] = [{"role": "user", "content": prompt}]
1859 py_output: list[dict[str, Any]] = []
1860
1861 def log(*xs: Any) -> None:
1862 text = " ".join(str(x) for x in xs)
1863 py_output.append({"type": "input_text", "text": text[:5000]})
1864
1865 def display(base64_image: str) -> None:
1866 py_output.append(
1867 {
1868 "type": "input_image",
1869 "image_url": f"data:image/png;base64,{base64_image}",
1870 "detail": "original",
1871 }
1872 )
1873
1874 runtime_globals: dict[str, Any] = {
1875 "__builtins__": __builtins__,
1876 "asyncio": asyncio,
1877 "browser": browser,
1878 "context": context,
1879 "page": page,
1880 "display": display,
1881 "log": log,
1882 }
1883
1884 for _ in range(max_steps):
1885 resp = client.responses.create(
1886 model=model,
1887 tools=[
1888 {
1889 "type": "function",
1890 "name": "exec_py",
1891 "description": "Execute provided interactive async Python in a persistent runtime context.",
1892 "parameters": {
1893 "type": "object",
1894 "properties": {
1895 "code": {
1896 "type": "string",
1897 "description": (
1898 "Python code to execute. Write small snippets. "
1899 "State persists across tool calls via globals(). "
1900 "This runtime uses Playwright's async Python API, so you may use await directly. "
1901 "Do not call asyncio.run(...), loop.run_until_complete(...), or manage the event loop yourself. "
1902 "You can use ONLY these prebound objects/helpers: "
1903 "log(x) for text output, display(base64_png_string) for image output, "
1904 "browser (async Playwright browser), context (viewport 1440x900), page (already created), "
1905 "asyncio (module). "
1906 "Be concise with log(x): do not send large base64 payloads, screenshots, buffers, page HTML, "
1907 "or other large blobs through log(). If you create an image or screenshot, pass the base64 PNG "
1908 "string to display(). Do not write screenshots or image data to temporary files or disk just "
1909 "to pass them back; keep image data in memory and send it directly to display(). "
1910 "Do not assume extra globals or helpers are available unless they are explicitly listed here. "
1911 "Do not close browser/context/page unless explicitly asked."
1912 ),
1913 }
1914 },
1915 "required": ["code"],
1916 "additionalProperties": False,
1917 },
1918 },
1919 {
1920 "type": "function",
1921 "name": "ask_user",
1922 "description": "Ask the user a clarification question and wait for their response.",
1923 "parameters": {
1924 "type": "object",
1925 "properties": {
1926 "question": {
1927 "type": "string",
1928 "description": "The exact question to show the user. Use this instead of asking a freeform clarifying question in a final answer.",
1929 }
1930 },
1931 "required": ["question"],
1932 "additionalProperties": False,
1933 },
1934 },
1935 ],
1936 input=conversation,
1937 )
1938
1939 conversation.extend(resp.output)
1940
1941 had_tool_call = False
1942 latest_phase: Phase = None
1943
1944 for item in resp.output:
1945 item_type = getattr(item, "type", None)
1946
1947 if item_type == "function_call" and getattr(item, "name", None) == "exec_py":
1948 had_tool_call = True
1949 raw_args = getattr(item, "arguments", "{}") or "{}"
1950 try:
1951 args = json.loads(raw_args)
1952 except json.JSONDecodeError:
1953 args = {}
1954 code = args.get("code", "") if isinstance(args, dict) else ""
1955
1956 print(code)
1957 print("----")
1958
1959 wrapped = (
1960 "async def __codex_exec__():\n"
1961 + "".join(
1962 f" {line}\n" if line else " \n"
1963 for line in (code or "pass").splitlines()
1964 )
1965 )
1966
1967 try:
1968 exec(wrapped, runtime_globals, runtime_globals)
1969 await runtime_globals["__codex_exec__"]()
1970 except Exception:
1971 log(traceback.format_exc())
1972
1973 conversation.append(
1974 {
1975 "type": "function_call_output",
1976 "call_id": getattr(item, "call_id", None),
1977 "output": py_output[:],
1978 }
1979 )
1980
1981 for out in py_output:
1982 if out.get("type") == "input_text":
1983 print("PY LOG:", out.get("text", ""))
1984 elif out.get("type") == "input_image":
1985 print("PY IMAGE: [base64 string omitted]")
1986 print("=====")
1987
1988 py_output.clear()
1989
1990 elif item_type == "function_call" and getattr(item, "name", None) == "ask_user":
1991 had_tool_call = True
1992 raw_args = getattr(item, "arguments", "{}") or "{}"
1993 try:
1994 args = json.loads(raw_args)
1995 except json.JSONDecodeError:
1996 args = {}
1997 question = (
1998 args.get("question", "Please provide more information.")
1999 if isinstance(args, dict)
2000 else "Please provide more information."
2001 )
2002
2003 print(f"MODEL QUESTION: {question}")
2004 answer = await _ainput("> ")
2005
2006 conversation.append(
2007 {
2008 "type": "function_call_output",
2009 "call_id": getattr(item, "call_id", None),
2010 "output": answer,
2011 }
2012 )
2013
2014 elif item_type == "message":
2015 print(_message_text(item))
2016 phase = getattr(item, "phase", None)
2017 if isinstance(phase, str) or phase is None:
2018 latest_phase = phase
2019 elif item_type == "output_item.done":
2020 phase = getattr(item, "phase", None)
2021 if isinstance(phase, str) or phase is None:
2022 latest_phase = phase
2023
2024 if not had_tool_call and latest_phase == "final_answer":
2025 return
2026
2027
2028if __name__ == "__main__":
2029 parser = argparse.ArgumentParser()
2030 parser.add_argument("--prompt", help="Override the default user prompt.")
2031 args = parser.parse_args()
2032 asyncio.run(main(prompt=args.prompt) if args.prompt is not None else main())
2033```
2034
246 </div>2035 </div>
247 <div data-content-switcher-pane data-value="python" hidden>2036 <div data-content-switcher-pane data-value="python" hidden>
248 <div class="hidden">Python</div>2037 <div class="hidden">Python</div>
2038 Code-execution harness
2039
2040```javascript
2041// Run with:
2042// bun run -i cua_code_mode.ts
2043// Override the user prompt with:
2044// bun run -i cua_code_mode.ts --prompt "Go to example.com and summarize the page."
2045// Note: this script intentionally leaves the Playwright browser open after the
2046// model reaches a final answer. Because the browser/context are not closed,
2047// Bun stays alive until you close the browser or stop the process manually.
2048
2049import OpenAI from "openai";
2050import readline from "node:readline/promises";
2051import vm from "node:vm";
2052import { chromium } from "playwright";
2053import util from "node:util";
2054
2055async function main(
2056 prompt: string = "Go to Hacker News, click on the most interesting link (be prepared to justify your choice), take a screenshot, and give me a critique of the visual layout.",
2057 max_steps: number = 50,
2058 model: string = "gpt-5.5"
2059) {
2060 type Phase = null | "commentary" | "final_answer";
2061 const client = new OpenAI();
2062 const rl = readline.createInterface({
2063 input: process.stdin,
2064 output: process.stdout,
2065 });
2066 const browser = await chromium.launch({
2067 headless: false,
2068 args: ["--window-size=1440,900"],
2069 });
2070 const context = await browser.newContext({
2071 viewport: { width: 1440, height: 900 },
2072 });
2073 const page = await context.newPage();
2074
2075 const conversation: any[] = [];
2076 const js_output: any[] = [];
2077 const sandbox: Record<string, any> = {
2078 console: {
2079 log: (...xs: any[]) => {
2080 js_output.push({
2081 type: "input_text",
2082 text: util.formatWithOptions(
2083 { showHidden: false, getters: false, maxStringLength: 2000 },
2084 ...xs
2085 ),
2086 });
2087 },
2088 },
2089 browser: browser,
2090 context: context,
2091 page: page,
2092 display: (base64_image: string) => {
2093 js_output.push({
2094 type: "input_image",
2095 image_url: `data:image/png;base64,${base64_image}`,
2096 detail: "original",
2097 });
2098 },
2099 };
2100 const ctx = vm.createContext(sandbox);
2101
2102 conversation.push({
2103 role: "user",
2104 content: prompt,
2105 });
2106
2107 for (let i = 0; i < max_steps; i++) {
2108 const resp = await client.responses.create({
2109 model,
2110 tools: [
2111 {
2112 type: "function" as const,
2113 name: "exec_js",
2114 description:
2115 "Execute provided interactive JavaScript in a persistent REPL context.",
2116 parameters: {
2117 type: "object",
2118 properties: {
2119 code: {
2120 type: "string",
2121 description: `
2122JavaScript to execute. Write small snippets of interactive code. To persist variables or functions across tool calls, you must save them to globalThis. Code is executed in an async node:vm context, so you can use await. You have access to ONLY the following:
2123- console.log(x): Use this to read contents back to you. But be minimal: otherwise the output may be too long. Avoid using console.log() for large base64 payloads like screenshots or buffer. If you create an image or screenshot, pass the base64 string to display().
2124- display(base64_image_string): Use this to view a base64-encoded image.
2125- Do not write screenshots or image data to temporary files or disk just to pass them back. Keep image data in memory and send it directly to display().
2126- Do not assume package globals like Bun.file are available unless they are explicitly provided.
2127- browser: A playwright chromium browser instance.
2128- context: A playwright browser context with viewport 1440x900.
2129- page: A playwright page already created in that context.
2130`,
2131 },
2132 },
2133 required: ["code"],
2134 additionalProperties: false,
2135 },
2136 },
2137 {
2138 type: "function" as const,
2139 name: "ask_user",
2140 description:
2141 "Ask the user a clarification question and wait for their response.",
2142 parameters: {
2143 type: "object",
2144 properties: {
2145 question: {
2146 type: "string",
2147 description:
2148 "The exact question to show the human. Use this instead of answering with a freeform clarifying question in a final answer.",
2149 },
2150 },
2151 required: ["question"],
2152 additionalProperties: false,
2153 },
2154 },
2155 ],
2156 input: conversation,
2157 reasoning: {
2158 effort: "low",
2159 },
2160 });
2161
2162 // Save model outputs into the running conversation
2163 conversation.push(...resp.output);
2164
2165 let hadToolCall = false;
2166 let latestPhase: Phase = null;
2167
2168 // Handle tool calls
2169 for (const item of resp.output) {
2170 if (item.type === "function_call" && item.name === "exec_js") {
2171 hadToolCall = true;
2172 const parsed = JSON.parse(item.arguments ?? "{}") as {
2173 code?: string;
2174 };
2175 const code = parsed.code ?? "";
2176 console.log(code);
2177 console.log("----");
2178 const wrappedCode = `
2179 (async () => {
2180 ${code}
2181 })();
2182 `;
2183
2184 try {
2185 await new vm.Script(wrappedCode, {
2186 filename: "exec_js.js",
2187 }).runInContext(ctx);
2188 } catch (e: any) {
2189 sandbox.console.log(e, e?.message, e?.stack);
2190 }
2191
2192 // Send tool output back to the model, keyed by call_id
2193 conversation.push({
2194 type: "function_call_output",
2195 call_id: item.call_id,
2196 output: js_output.slice(),
2197 });
2198
2199 for (const out of js_output) {
2200 if (out.type === "input_text") {
2201 console.log("JS LOG:", out.text);
2202 } else if (out.type === "input_image") {
2203 console.log("JS IMAGE: [base64 string omitted]");
2204 }
2205 }
2206 console.log("=====");
2207
2208 js_output.length = 0;
2209 } else if (item.type === "function_call" && item.name === "ask_user") {
2210 hadToolCall = true;
2211 const parsed = JSON.parse(item.arguments ?? "{}") as {
2212 question?: string;
2213 };
2214 const question = parsed.question ?? "Please provide more information.";
2215 console.log(`MODEL QUESTION: ${question}`);
2216 const answer = await rl.question("> ");
2217 conversation.push({
2218 type: "function_call_output",
2219 call_id: item.call_id,
2220 output: answer,
2221 });
2222 } else if (item.type === "message") {
2223 console.log(item.content[0]?.text ?? item.content);
2224 if ("phase" in item) {
2225 latestPhase = (item.phase as Phase) ?? null;
2226 }
2227 } else if (item.type === "output_item.done" && "phase" in item) {
2228 latestPhase = (item.phase as Phase) ?? null;
2229 }
2230 }
2231
2232 // Stop only when the model explicitly marks the turn as a final answer
2233 // and there were no tool calls in the same turn.
2234 if (!hadToolCall && latestPhase === "final_answer") return;
2235 }
2236}
2237
2238function getCliPrompt(): string | undefined {
2239 const args = Bun.argv.slice(2);
2240 for (let i = 0; i < args.length; i++) {
2241 if (args[i] === "--prompt") {
2242 return args[i + 1];
2243 }
2244 }
2245 return undefined;
2246}
2247
2248main(getCliPrompt());
2249```
2250
2251```python
2252# /// script
2253# requires-python = ">=3.10"
2254# dependencies = [
2255# "openai",
2256# "playwright",
2257# ]
2258# ///
2259# Run with: `uv run cua_code_mode_py_async.py`
2260# Override the user prompt with:
2261# `uv run cua_code_mode_py_async.py --prompt "Go to example.com and summarize the page."`
2262# Install Chromium once first: `uv run --with playwright python -m playwright install chromium`
2263# Requires `OPENAI_API_KEY` in the environment.
2264
2265"""Async Python analogue of cua_code_mode.ts.
2266
2267Runs a Responses API loop with one persistent Playwright browser/context/page,
2268and tools that let the model execute short async Python snippets and ask the
2269user clarifying questions.
2270
2271The model can return visual observations by calling:
2272 display(base64_png_string)
2273"""
2274
2275from __future__ import annotations
2276
2277import argparse
2278import asyncio
2279import json
2280import traceback
2281from typing import Any
2282
2283from openai import OpenAI
2284from playwright.async_api import async_playwright
2285
2286Phase = str | None
2287
2288
2289def _message_text(item: Any) -> str:
2290 try:
2291 parts = getattr(item, "content", None)
2292 if isinstance(parts, list) and parts:
2293 out: list[str] = []
2294 for p in parts:
2295 t = getattr(p, "text", None)
2296 if isinstance(t, str) and t:
2297 out.append(t)
2298 if out:
2299 return "\n".join(out)
2300 except Exception:
2301 pass
2302 return str(item)
2303
2304
2305async def _ainput(prompt: str) -> str:
2306 return await asyncio.to_thread(input, prompt)
2307
2308
2309async def main(
2310 prompt: str = "Go to Hacker News, click on the most interesting link (be prepared to justify your choice), take a screenshot, and give me a critique of the visual layout.",
2311 max_steps: int = 20,
2312 model: str = "gpt-5.5",
2313) -> None:
2314 client = OpenAI()
2315
2316 async with async_playwright() as p:
2317 browser = await p.chromium.launch(
2318 headless=False,
2319 args=["--window-size=1440,900"],
2320 )
2321 context = await browser.new_context(viewport={"width": 1440, "height": 900})
2322 page = await context.new_page()
2323
2324 conversation: list[dict[str, Any]] = [{"role": "user", "content": prompt}]
2325 py_output: list[dict[str, Any]] = []
2326
2327 def log(*xs: Any) -> None:
2328 text = " ".join(str(x) for x in xs)
2329 py_output.append({"type": "input_text", "text": text[:5000]})
2330
2331 def display(base64_image: str) -> None:
2332 py_output.append(
2333 {
2334 "type": "input_image",
2335 "image_url": f"data:image/png;base64,{base64_image}",
2336 "detail": "original",
2337 }
2338 )
2339
2340 runtime_globals: dict[str, Any] = {
2341 "__builtins__": __builtins__,
2342 "asyncio": asyncio,
2343 "browser": browser,
2344 "context": context,
2345 "page": page,
2346 "display": display,
2347 "log": log,
2348 }
2349
2350 for _ in range(max_steps):
2351 resp = client.responses.create(
2352 model=model,
2353 tools=[
2354 {
2355 "type": "function",
2356 "name": "exec_py",
2357 "description": "Execute provided interactive async Python in a persistent runtime context.",
2358 "parameters": {
2359 "type": "object",
2360 "properties": {
2361 "code": {
2362 "type": "string",
2363 "description": (
2364 "Python code to execute. Write small snippets. "
2365 "State persists across tool calls via globals(). "
2366 "This runtime uses Playwright's async Python API, so you may use await directly. "
2367 "Do not call asyncio.run(...), loop.run_until_complete(...), or manage the event loop yourself. "
2368 "You can use ONLY these prebound objects/helpers: "
2369 "log(x) for text output, display(base64_png_string) for image output, "
2370 "browser (async Playwright browser), context (viewport 1440x900), page (already created), "
2371 "asyncio (module). "
2372 "Be concise with log(x): do not send large base64 payloads, screenshots, buffers, page HTML, "
2373 "or other large blobs through log(). If you create an image or screenshot, pass the base64 PNG "
2374 "string to display(). Do not write screenshots or image data to temporary files or disk just "
2375 "to pass them back; keep image data in memory and send it directly to display(). "
2376 "Do not assume extra globals or helpers are available unless they are explicitly listed here. "
2377 "Do not close browser/context/page unless explicitly asked."
2378 ),
2379 }
2380 },
2381 "required": ["code"],
2382 "additionalProperties": False,
2383 },
2384 },
2385 {
2386 "type": "function",
2387 "name": "ask_user",
2388 "description": "Ask the user a clarification question and wait for their response.",
2389 "parameters": {
2390 "type": "object",
2391 "properties": {
2392 "question": {
2393 "type": "string",
2394 "description": "The exact question to show the user. Use this instead of asking a freeform clarifying question in a final answer.",
2395 }
2396 },
2397 "required": ["question"],
2398 "additionalProperties": False,
2399 },
2400 },
2401 ],
2402 input=conversation,
2403 )
2404
2405 conversation.extend(resp.output)
2406
2407 had_tool_call = False
2408 latest_phase: Phase = None
2409
2410 for item in resp.output:
2411 item_type = getattr(item, "type", None)
2412
2413 if item_type == "function_call" and getattr(item, "name", None) == "exec_py":
2414 had_tool_call = True
2415 raw_args = getattr(item, "arguments", "{}") or "{}"
2416 try:
2417 args = json.loads(raw_args)
2418 except json.JSONDecodeError:
2419 args = {}
2420 code = args.get("code", "") if isinstance(args, dict) else ""
2421
2422 print(code)
2423 print("----")
2424
2425 wrapped = (
2426 "async def __codex_exec__():\n"
2427 + "".join(
2428 f" {line}\n" if line else " \n"
2429 for line in (code or "pass").splitlines()
2430 )
2431 )
2432
2433 try:
2434 exec(wrapped, runtime_globals, runtime_globals)
2435 await runtime_globals["__codex_exec__"]()
2436 except Exception:
2437 log(traceback.format_exc())
2438
2439 conversation.append(
2440 {
2441 "type": "function_call_output",
2442 "call_id": getattr(item, "call_id", None),
2443 "output": py_output[:],
2444 }
2445 )
2446
2447 for out in py_output:
2448 if out.get("type") == "input_text":
2449 print("PY LOG:", out.get("text", ""))
2450 elif out.get("type") == "input_image":
2451 print("PY IMAGE: [base64 string omitted]")
2452 print("=====")
2453
2454 py_output.clear()
2455
2456 elif item_type == "function_call" and getattr(item, "name", None) == "ask_user":
2457 had_tool_call = True
2458 raw_args = getattr(item, "arguments", "{}") or "{}"
2459 try:
2460 args = json.loads(raw_args)
2461 except json.JSONDecodeError:
2462 args = {}
2463 question = (
2464 args.get("question", "Please provide more information.")
2465 if isinstance(args, dict)
2466 else "Please provide more information."
2467 )
2468
2469 print(f"MODEL QUESTION: {question}")
2470 answer = await _ainput("> ")
2471
2472 conversation.append(
2473 {
2474 "type": "function_call_output",
2475 "call_id": getattr(item, "call_id", None),
2476 "output": answer,
2477 }
2478 )
2479
2480 elif item_type == "message":
2481 print(_message_text(item))
2482 phase = getattr(item, "phase", None)
2483 if isinstance(phase, str) or phase is None:
2484 latest_phase = phase
2485 elif item_type == "output_item.done":
2486 phase = getattr(item, "phase", None)
2487 if isinstance(phase, str) or phase is None:
2488 latest_phase = phase
2489
2490 if not had_tool_call and latest_phase == "final_answer":
2491 return
2492
2493
2494if __name__ == "__main__":
2495 parser = argparse.ArgumentParser()
2496 parser.add_argument("--prompt", help="Override the default user prompt.")
2497 args = parser.parse_args()
2498 asyncio.run(main(prompt=args.prompt) if args.prompt is not None else main())
2499```
2500
249 </div>2501 </div>
250 2502
251 2503
380 2632
381The older request shape looked like this:2633The older request shape looked like this:
382 2634
2635Legacy preview request
2636
2637```javascript
2638import OpenAI from "openai";
2639
2640const client = new OpenAI();
2641
2642const response = await client.responses.create({
2643 model: "computer-use-preview",
2644 tools: [
2645 {
2646 type: "computer_use_preview",
2647 display_width: 1024,
2648 display_height: 768,
2649 environment: "browser",
2650 },
2651 ],
2652 input: "Check whether the Filters panel is open.",
2653 truncation: "auto",
2654});
2655```
2656
2657```python
2658from openai import OpenAI
2659
2660client = OpenAI()
2661
2662response = client.responses.create(
2663 model="computer-use-preview",
2664 tools=[
2665 {
2666 "type": "computer_use_preview",
2667 "display_width": 1024,
2668 "display_height": 768,
2669 "environment": "browser",
2670 }
2671 ],
2672 input="Check whether the Filters panel is open.",
2673 truncation="auto",
2674)
2675```
2676
2677
383Keep the preview path only to maintain older integrations. For new implementations, use the GA flow described above.2678Keep the preview path only to maintain older integrations. For new implementations, use the GA flow described above.
384 2679
385## Keep a human in the loop2680## Keep a human in the loop