Documentation — Spybara

deprecations.md +37 −21

Details

14 14

15We use the term "legacy" to refer to models and endpoints that no longer receive updates. We tag endpoints and models as legacy to signal to developers where we're moving as a platform and that they should likely migrate to newer models or endpoints. You can expect that a legacy model or endpoint will be deprecated at some point in the future.15We use the term "legacy" to refer to models and endpoints that no longer receive updates. We tag endpoints and models as legacy to signal to developers where we're moving as a platform and that they should likely migrate to newer models or endpoints. You can expect that a legacy model or endpoint will be deprecated at some point in the future.

16 16

~~17## Deprecation history~~17## Upcoming deprecations

18 18

~~19All deprecations are listed below, with the most recent announcements at the top.~~19Upcoming deprecations are listed below, with the most recent announcements at the top.

21### Update to OpenAI’s self-serve fine-tuning

23On May 7th, 2026, we notified developers using OpenAI’s self-serve fine-tuning platform of updates to availability.

25Inference on fine-tuned models will continue to be available until the base models are deprecated.

27| Date | Update |

28| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |

29| May 7, 2026 | Creating fine-tuning jobs or training is not available to organizations that have not previously run fine-tuning. |

30| July 2, 2026 | Creating fine-tuning jobs is no longer available to organizations that have not run inference on a fine-tuned model in the past 60 days. |

31| Jan 6, 2027 | Active existing customers will no longer be able to create new fine-tuning jobs on this date. Inference on fine-tuned models will be disabled only when the underlying base model is deprecated. |

20 32

21### 2026-04-22: Legacy GPT model snapshots33### 2026-04-22: Legacy GPT model snapshots

22 34

23To improve reliability and make it easier for developers to choose the right models, we are deprecating a set of older OpenAI models. Access to these models will be shut down on the dates below.35To improve reliability and make it easier for developers to choose the right models, we are deprecating a set of older OpenAI models. Access to these models will be shut down on the dates below.

24 36

~~26| ------------- | ---------------------------------------------------------------------- | --------------------- |~~38| ------------- | ---------------------------------------------------------------------- | ------------------- |

27| 2026-07-23 | `computer-use-preview-2025-03-11` \| `computer-use-preview` | `5.4-mini` |39| 2026-07-23 | `computer-use-preview-2025-03-11` \| `computer-use-preview` | `5.4-mini` |

28| 2026-07-23 | `gpt-4o-audio-preview-2024-12-17` | `gpt-audio` |40| 2026-07-23 | `gpt-4o-audio-preview-2024-12-17` | `gpt-audio` |

29| 2026-07-23 | `gpt-4o-mini-audio-preview-2024-12-17` | `gpt-audio` |41| 2026-07-23 | `gpt-4o-mini-audio-preview-2024-12-17` | `gpt-audio` |

31| 2026-07-23 | `gpt-4o-mini-search-preview-2025-03-11` | `4.1-mini` |43| 2026-07-23 | `gpt-4o-mini-search-preview-2025-03-11` | `4.1-mini` |

32| 2026-07-23 | `gpt-4o-mini-tts-2025-03-20` | `gpt-realtime` |44| 2026-07-23 | `gpt-4o-mini-tts-2025-03-20` | `gpt-realtime` |

33| 2026-07-23 | `gpt-4o-search-preview-2025-03-11` | `gpt-4.1-mini` |45| 2026-07-23 | `gpt-4o-search-preview-2025-03-11` | `gpt-4.1-mini` |

~~34| 2026-07-23 | `gpt-5-chat-latest` | `gpt-5.3-chat-latest` |~~46| 2026-07-23 | `gpt-5-chat-latest` | `gpt-5.5` |

35| 2026-07-23 | `gpt-5-codex` | `gpt-5.4` |47| 2026-07-23 | `gpt-5-codex` | `gpt-5.4` |

~~36| 2026-07-23 | `gpt-5.1-chat-latest` | `gpt-5.3-chat-latest` |~~48| 2026-07-23 | `gpt-5.1-chat-latest` | `gpt-5.5` |

37| 2026-07-23 | `gpt-5.1-codex` | `gpt-5` |49| 2026-07-23 | `gpt-5.1-codex` | `gpt-5` |

38| 2026-07-23 | `gpt-5.1-codex-max` | `gpt-5.4` |50| 2026-07-23 | `gpt-5.1-codex-max` | `gpt-5.4` |

39| 2026-07-23 | `gpt-5.1-codex-mini` | `gpt-5.4-mini` |51| 2026-07-23 | `gpt-5.1-codex-mini` | `gpt-5.4-mini` |

78| 2026-09-24 | `sora-2-2025-12-08` | --- |90| 2026-09-24 | `sora-2-2025-12-08` | --- |

79| 2026-09-24 | `sora-2-pro-2025-10-06` | --- |91| 2026-09-24 | `sora-2-pro-2025-10-06` | --- |

80 92

~~81### 2025-11-18: chatgpt-4o-latest snapshot~~

~~83On November 18th, 2025, we notified developers using `chatgpt-4o-latest` model snapshot of its deprecation and removal from the API on February 17, 2026.~~

~~85| Shutdown date | Model / system | Recommended replacement |~~

~~86| ------------- | ------------------- | ----------------------- |~~

~~87| 2026-02-17 | `chatgpt-4o-latest` | `gpt-5.1-chat-latest` |~~

~~89### 2025-11-17: codex-mini-latest model snapshot~~

91On November 17th, 2025, we notified developers using `codex-mini-latest` model of its deprecation and removal from the API on February 12, 2026. As part of this deprecation, we will no longer support our legacy local shell tool, which is only available for use with `codex-mini-latest`. For new use cases, please use our latest shell tool.

~~93| Shutdown date | Model / system | Recommended replacement |~~

~~94| ------------- | ------------------- | ----------------------- |~~

~~95| 2026-02-12 | `codex-mini-latest` | `gpt-5-codex-mini` |~~

97### 2025-11-14: DALL·E model snapshots93### 2025-11-14: DALL·E model snapshots

98 94

99On November 14th, 2025, we notified developers using DALL·E model snapshots of their deprecation and removal from the API on May 12, 2026.95On November 14th, 2025, we notified developers using DALL·E model snapshots of their deprecation and removal from the API on May 12, 2026.

154| 2026-05-07 | gpt-4o-audio-preview | gpt-audio-1.5 |150| 2026-05-07 | gpt-4o-audio-preview | gpt-audio-1.5 |

155| 2026-05-07 | gpt-4o-mini-audio-preview | gpt-audio-mini |151| 2026-05-07 | gpt-4o-mini-audio-preview | gpt-audio-mini |

156 152

153## Past deprecations

154

155Past deprecations are listed below, with the most recent announcements at the top.

156

157### 2025-11-18: chatgpt-4o-latest snapshot

158

159On November 18th, 2025, we notified developers using `chatgpt-4o-latest` model snapshot of its deprecation and removal from the API on February 17, 2026.

160

161| Shutdown date | Model / system | Recommended replacement |

162| ------------- | ------------------- | ----------------------- |

163| 2026-02-17 | `chatgpt-4o-latest` | `gpt-5.1-chat-latest` |

164

165### 2025-11-17: codex-mini-latest model snapshot

166

167On November 17th, 2025, we notified developers using `codex-mini-latest` model of its deprecation and removal from the API on February 12, 2026. As part of this deprecation, we will no longer support our legacy local shell tool, which is only available for use with `codex-mini-latest`. For new use cases, please use our latest shell tool.

168

169| Shutdown date | Model / system | Recommended replacement |

170| ------------- | ------------------- | ----------------------- |

171| 2026-02-12 | `codex-mini-latest` | `gpt-5-codex-mini` |

172

157### 2025-06-10: gpt-4o-realtime-preview-2024-10-01173### 2025-06-10: gpt-4o-realtime-preview-2024-10-01

158 174

159On June 10th, 2025, we notified developers using gpt-4o-realtime-preview-2024-10-01 of its deprecation and removal from the API in three months.175On June 10th, 2025, we notified developers using gpt-4o-realtime-preview-2024-10-01 of its deprecation and removal from the API in three months.

guides/audio.md +47 −40

Details

1# Audio and speech1# Audio and speech

2 2

~~3The OpenAI API provides a range of audio capabilities. If you know what you want to build, find your use case below to get started. If you're not sure where to start, read this page as an overview.~~3Audio models can understand spoken input, generate spoken output, or do both in the same interaction. This guide explains the vocabulary used across OpenAI's audio docs. When you're ready to choose an implementation path, start with the [Realtime and audio overview](https://developers.openai.com/api/docs/guides/realtime).

4 4

~~5## Build with audio~~5## Audio modalities

6 6

~~7<div className="w-full max-w-full overflow-hidden">~~7An audio application combines one or more of these modalities:

~~8 </div>~~

~~10## A tour of audio use cases~~

11 8

~~12LLMs can process audio by using sound as input, creating sound as output, or both. OpenAI has several API endpoints that help you build audio applications or voice agents.~~9| Modality | Meaning | Common use cases |

10| --------------- | -------------------------------------------- | ------------------------------------------------- |

11| Audio input | The model receives sound from a user or app. | Voice agents, transcription, translation. |

12| Audio output | The model or API returns spoken audio. | Voice agents, text to speech, spoken responses. |

13| Text transcript | Speech becomes text. | Captions, call analysis, search, records. |

14| Text prompt | Text controls what the model says or does. | Speech generation, scripted voice flows, prompts. |

13 15

~~14### Voice agents~~16## Common speech tasks

15 17

16Voice agents understand audio to handle tasks and respond back in natural language. There are two main ways to approach voice agents: either with speech-to-speech models and the [Realtime API](https://developers.openai.com/api/docs/guides/realtime), or by chaining together a speech-to-text model, a text language model to process the request, and a text-to-speech model to respond. Speech-to-speech is lower latency and more natural, but chaining together a voice agent is a reliable way to extend a text-based agent into a voice agent. If you are already using the [Agents SDK](https://developers.openai.com/api/docs/guides/agents), you can [extend your existing agents with voice capabilities](https://developers.openai.com/api/docs/guides/voice-agents) using the chained approach.18**Speech to text** converts speech into text. Use it for captions, notes, transcripts, analytics, search, and accessibility. Transcription can be request-based for files or streaming for live audio.

17 19

~~18### Streaming audio~~20**Text to speech** converts text into spoken audio. Use it for narration, assistants, accessibility, and generated voice responses. Speech generation can stream audio back as the model produces it.

19 21

20Process audio in real time to build voice agents and other low-latency applications, including transcription use cases. You can stream audio in and out of a model with the [Realtime API](https://developers.openai.com/api/docs/guides/realtime). Our advanced speech models provide automatic speech recognition for improved accuracy, low-latency interactions, and multilingual support.22**Speech to speech** lets a model listen, reason, and speak in one low-latency session. Use it for conversational voice agents when the assistant needs to respond, call tools, or maintain session state.

21 23

~~22### Text to speech~~24**Speech translation** listens to speech in one language and returns translated speech or transcript output in another language. Use a dedicated realtime translation session when translation should begin continuously as audio arrives.

23 25

24For turning text into speech, use the [Audio API](https://developers.openai.com/api/docs/api-reference/audio/) `audio/speech` endpoint. Models compatible with this endpoint are `gpt-4o-mini-tts`, `tts-1`, and `tts-1-hd`. With `gpt-4o-mini-tts`, you can ask the model to speak a certain way or with a certain tone of voice.26## Streaming and latency

25 27

~~26### Speech to text~~28Streaming means the client and service exchange partial input or output while the interaction is still active. Streaming is useful when users expect immediate feedback, such as live captions, calls, voice agents, and translation.

27 29

28For speech to text, use the [Audio API](https://developers.openai.com/api/docs/api-reference/audio/) `audio/transcriptions` endpoint. Models compatible with this endpoint are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `whisper-1`, and `gpt-4o-transcribe-diarize`. `gpt-4o-transcribe-diarize` adds speaker labels and timestamps for HTTP requests and is intended for non-latency-sensitive workloads, while the other models focus on transcription only. With streaming, you can continuously pass in audio and get a continuous stream of text back.30Lower latency requires a realtime connection, more careful audio handling, and a session model that can emit partial events. Request-based APIs are simpler for file uploads and non-interactive work, but they don't support the same live interaction patterns.

29 31

~~30## Choosing the right API~~32## Request-based APIs and realtime sessions

31 33

~~32There are multiple APIs for transcribing or generating audio:~~34OpenAI supports two broad audio architectures:

33 35

~~34| API | Supported modalities | Streaming support |~~36| Architecture | Use when | Examples |

~~35| ---------------------------------------------------- | --------------------------------- | ------------------------------------------------ |~~37| --------------------------- | ---------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |

~~36| [Realtime API](https://developers.openai.com/api/docs/api-reference/realtime) | Audio and text inputs and outputs | Audio streaming in, audio and text streaming out |~~38| Request-based audio APIs | You have a file, a text input, or a bounded request. | [Speech to text](https://developers.openai.com/api/docs/guides/speech-to-text), [text to speech](https://developers.openai.com/api/docs/guides/text-to-speech). |

~~37| [Chat Completions API](https://developers.openai.com/api/docs/api-reference/chat) | Audio and text inputs and outputs | Audio and text streaming out |~~39| Realtime sessions | Audio is live and the app needs low-latency events. | [Voice agents](https://developers.openai.com/api/docs/guides/voice-agents), [translation](https://developers.openai.com/api/docs/guides/realtime-translation), [transcription](https://developers.openai.com/api/docs/guides/realtime-transcription). |

~~39| [Speech API](https://developers.openai.com/api/docs/api-reference/audio) | Text inputs and audio outputs | Audio streaming out |~~

40 41

~~41### General use APIs vs. specialized APIs~~42For build-path guidance, see the [Realtime and audio overview](https://developers.openai.com/api/docs/guides/realtime).

42 43

43The main distinction is general use APIs vs. specialized APIs. With the Realtime and Chat Completions APIs, you can use our latest models' native audio understanding and generation capabilities and combine them with other features like function calling. These APIs can be used for a wide range of use cases, and you can select the model you want to use.44## Add audio to your existing application

44 45

~~45On the other hand, the Transcription, Translation and Speech APIs are specialized to work with specific models and only meant for one purpose.~~46Models such as `gpt-realtime` and `gpt-audio` are natively multimodal, meaning they can understand and generate audio and text as input and output.

46 47

~~47### Talking with a model vs. controlling the script~~48For live browser speech-to-speech interactions, start with a realtime session in the JavaScript SDK:

48 49

49Another way to select the right API is asking yourself how much control you need. To design conversational interactions, where the model thinks and responds in speech, use the Realtime or Chat Completions API, depending if you need low-latency or not.50Start a realtime voice session

50 51

~~51You won't know exactly what the model will say ahead of time, as it will generate audio responses directly, but the conversation will feel natural.~~52```javascript

53import { RealtimeAgent, RealtimeSession } from "@openai/agents/realtime";

52 54

53For more control and predictability, you can use the Speech-to-text / LLM / Text-to-speech pattern, so you know exactly what the model will say and can control the response. Please note that with this method, there will be added latency.55const agent = new RealtimeAgent({

56 name: "Assistant",

57 instructions: "You are a helpful voice assistant.",

58});

54 59

55This is what the Audio APIs are for: pair an LLM with the `audio/transcriptions` and `audio/speech` endpoints to take spoken user input, process and generate a text response, and then convert that to speech that the user can hear.60const session = new RealtimeSession(agent, {

61 model: "gpt-realtime-2",

62});

56 63

~~57### Recommendations~~64await session.connect({

65 apiKey: "ek_...(ephemeral key from your server)",

66});

67```

58 68

59- If you need [real-time interactions](https://developers.openai.com/api/docs/guides/realtime-conversations) or [transcription](https://developers.openai.com/api/docs/guides/realtime-transcription), use the Realtime API.

60- If realtime is not a requirement but you're looking to build a [voice agent](https://developers.openai.com/api/docs/guides/voice-agents) or an audio-based application that requires features such as [function calling](https://developers.openai.com/api/docs/guides/function-calling), use the Chat Completions API.

~~61- For use cases with one specific purpose, use the Transcription, Translation, or Speech APIs.~~

62 69

~~63## Add audio to your existing application~~70This example uses JavaScript because browser voice agents connect with WebRTC from the client. For Python voice workflows, use the [Voice agents guide](https://developers.openai.com/api/docs/guides/voice-agents), which covers chained voice pipelines.

64 71

~~65Models such as `gpt-realtime` and `gpt-audio` are natively multimodal, meaning they can understand and generate multiple modalities as input and output.~~72If you already have a text-based LLM application with the [Chat Completions endpoint](https://developers.openai.com/api/docs/api-reference/chat/), you may want to add audio capabilities. For example, if your chat application supports text input, you can add audio input and output: include `audio` in the `modalities` array and use an audio model, like `gpt-audio`.

66 73

67If you already have a text-based LLM application with the [Chat Completions endpoint](https://developers.openai.com/api/docs/api-reference/chat/), you may want to add audio capabilities. For example, if your chat application supports text input, you can add audio input and output—just include `audio` in the `modalities` array and use an audio model, like `gpt-audio`.74The [Responses API](https://developers.openai.com/api/docs/api-reference/responses) docs currently describe

75 text and image inputs with text outputs. For this audio-chat pattern, use Chat

76 Completions with an audio-capable model.

68 77

~~69Audio is not yet supported in the [Responses~~

~~70 API](https://developers.openai.com/api/docs/api-reference/chat/completions/responses).~~

71 78

72 79

73<div data-content-switcher-pane data-value="audio-out">80<div data-content-switcher-pane data-value="audio-out">

guides/batch.md +34 −0

Details

135 -F file="@batchinput.jsonl"135 -F file="@batchinput.jsonl"

136```136```

137 137

138```cli

139openai files create \\

140 --file batchinput.jsonl \\

141 --purpose batch

142```

143

138 144

139### 3. Create the batch145### 3. Create the batch

140 146

181 }'187 }'

182```188```

183 189

190```cli

191openai batches create \\

192 --input-file-id file-abc123 \\

193 --endpoint /v1/chat/completions \\

194 --completion-window 24h

195```

196

184 197

185This request will return a [Batch object](https://developers.openai.com/api/docs/api-reference/batch/object) with metadata about your batch:198This request will return a [Batch object](https://developers.openai.com/api/docs/api-reference/batch/object) with metadata about your batch:

186 199

238 -H "Content-Type: application/json"251 -H "Content-Type: application/json"

239```252```

240 253

254```cli

255openai batches retrieve \\

256 --batch-id batch_abc123

257```

258

241 259

242The status of a given Batch object can be any of the following:260The status of a given Batch object can be any of the following:

243 261

281 -H "Authorization: Bearer $OPENAI_API_KEY" > batch_output.jsonl299 -H "Authorization: Bearer $OPENAI_API_KEY" > batch_output.jsonl

282```300```

283 301

302```cli

303openai files content \\

304 --file-id file-xyz123 \\

305 --output batch_output.jsonl

306```

307

284 308

285The output `.jsonl` file will have one response line for every successful request line in the input file. Any failed requests in the batch will have their error information written to an error file that can be found via the batch's `error_file_id`.309The output `.jsonl` file will have one response line for every successful request line in the input file. Any failed requests in the batch will have their error information written to an error file that can be found via the batch's `error_file_id`.

286 310

326 -X POST350 -X POST

327```351```

328 352

353```cli

354openai batches cancel \\

355 --batch-id batch_abc123

356```

357

329 358

330### 7. Get a list of all batches359### 7. Get a list of all batches

331 360

357 -H "Content-Type: application/json"386 -H "Content-Type: application/json"

358```387```

359 388

389```cli

390openai batches list \\

391 --limit 10

392```

393

360 394

361## Model availability395## Model availability

362 396

guides/image-generation.md +30 −0

Details

171 }' | jq -r '.data[0].b64_json' | base64 --decode > otter.png171 }' | jq -r '.data[0].b64_json' | base64 --decode > otter.png

172```172```

173 173

174```cli

175openai images generate \\

176 --model gpt-image-2 \\

177 --prompt "A childrens book drawing of a veterinarian using a stethoscope to listen to the heartbeat of a baby otter." \\

178 --raw-output \\

179 --transform 'data.0.b64_json' | base64 --decode > otter.png

180```

181

174 </div>182 </div>

175 183

176 184

738 -F 'prompt=Generate a photorealistic image of a gift basket on a white background labeled "Relax & Unwind" with a ribbon and handwriting-like font, containing all the items in the reference pictures'746 -F 'prompt=Generate a photorealistic image of a gift basket on a white background labeled "Relax & Unwind" with a ribbon and handwriting-like font, containing all the items in the reference pictures'

739```747```

740 748

749```cli

750openai images edit \\

751 --model gpt-image-2 \\

752 --image body-lotion.png \\

753 --image bath-bomb.png \\

754 --image incense-kit.png \\

755 --image soap.png \\

756 --prompt 'Generate a photorealistic image of a gift basket on a white background labeled "Relax & Unwind" with a ribbon and handwriting-like font, containing all the items in the reference pictures' \\

757 --raw-output \\

758 --transform 'data.0.b64_json' | base64 --decode > gift-basket.png

759```

760

741 </div>761 </div>

742 762

743 763

910 -F 'prompt=A sunlit indoor lounge area with a pool containing a flamingo'930 -F 'prompt=A sunlit indoor lounge area with a pool containing a flamingo'

911```931```

912 932

933```cli

934openai images edit \\

935 --model gpt-image-2 \\

936 --image sunlit_lounge.png \\

937 --mask mask.png \\

938 --prompt "A sunlit indoor lounge area with a pool containing a flamingo" \\

939 --raw-output \\

940 --transform 'data.0.b64_json' | base64 --decode > out.png

941```

942

913 </div>943 </div>

914 944

915 945

guides/images-vision.md +26 −0

Details

79 f.write(base64.b64decode(image_base64))79 f.write(base64.b64decode(image_base64))

80```80```

81 81

82```cli

83openai responses create \\

84 --model gpt-5.5 \\

85 --raw-output \\

86 --transform 'output.#(type=="image_generation_call").result' <<'YAML' | base64 --decode > cat_and_otter.png

87tools:

88 - type: image_generation

89input: Generate an image of a gray tabby cat hugging an otter with an orange scarf.

90YAML

91```

82 93

83 94

84You can learn more about image generation in our [Image95You can learn more about image generation in our [Image

198 }'209 }'

199```210```

200 211

212```cli

213openai responses create \\

214 --model gpt-5.5 \\

215 --raw-output \\

216 --transform 'output.#(type=="message").content.0.text' <<'YAML'

217input:

218 - role: user

219 content:

220 - type: input_text

221 text: What is in this image?

222 - type: input_image

223 image_url: https://api.nga.gov/iiif/a2e6da57-3cd1-4235-b20e-95dcaefed6c8/full/!800,800/0/default.jpg

224YAML

225```

226

201 </div>227 </div>

202 <div data-content-switcher-pane data-value="base64-encoded" hidden>228 <div data-content-switcher-pane data-value="base64-encoded" hidden>

203 <div class="hidden">Passing a Base64 encoded image</div>229 <div class="hidden">Passing a Base64 encoded image</div>

guides/prompt-generation.md +2 −1

Details

2161. **For structured output schemas**, wrap them in [`json_schema`](https://developers.openai.com/api/docs/guides/structured-outputs#how-to-use?context=without_parse) object.2161. **For structured output schemas**, wrap them in [`json_schema`](https://developers.openai.com/api/docs/guides/structured-outputs#how-to-use?context=without_parse) object.

2171. **For functions**, wrap them in a [`function`](https://developers.openai.com/api/docs/guides/function-calling#step-3-pass-your-function-definitions-as-available-tools-to-the-model-along-with-the-messages) object.2171. **For functions**, wrap them in a [`function`](https://developers.openai.com/api/docs/guides/function-calling#step-3-pass-your-function-definitions-as-available-tools-to-the-model-along-with-the-messages) object.

218 218

219The Realtime API [function](https://developers.openai.com/api/docs/guides/realtime#function-calls) object219The Realtime API

220 [function](https://developers.openai.com/api/docs/guides/realtime-conversations#function-calling) object

220 differs slightly from the Chat Completions API, but uses the same schema.221 differs slightly from the Chat Completions API, but uses the same schema.

221 222

222### Meta-schemas223### Meta-schemas

guides/rate-limits.md +1 −1

Details

113 113

114## How do these rate limits work?114## How do these rate limits work?

115 115

116Rate limits are measured in five ways: **RPM** (requests per minute), **RPD** (requests per day), **TPM** (tokens per minute), **TPD** (tokens per day), and **IPM** (images per minute). Rate limits can be hit across any of the options depending on what occurs first. For example, you might send 20 requests with only 100 tokens to the ChatCompletions endpoint and that would fill your limit (if your RPM was 20), even if you did not send 150k tokens (if your TPM limit was 150k) within those 20 requests.116Rate limits use metrics such as **RPM** (requests per minute), **RPD** (requests per day), **TPM** (tokens per minute), **TPD** (tokens per day), **IPM** (images per minute), and audio minutes per minute for some streaming audio models. Rate limits can be hit across any of the options depending on what occurs first. For example, you might send 20 requests with only 100 tokens to the ChatCompletions endpoint and that would fill your limit (if your RPM was 20), even if you didn't send 150k tokens (if your TPM limit was 150k) within those 20 requests.

117 117

118[Batch API](https://developers.openai.com/api/docs/api-reference/batch/create) queue limits are calculated based on the total number of input tokens queued for a given model. Tokens from pending batch jobs are counted against your queue limit. Once a batch job is completed, its tokens are no longer counted against that model's limit.118[Batch API](https://developers.openai.com/api/docs/api-reference/batch/create) queue limits are calculated based on the total number of input tokens queued for a given model. Tokens from pending batch jobs are counted against your queue limit. Once a batch job is completed, its tokens are no longer counted against that model's limit.

119 119

guides/realtime.md +176 −240

Details

~~1# Realtime API~~1# Realtime and audio

2 2

3import {3import {

~~4 Bolt,~~

~~5 Phone,~~

6 Cube,4 Cube,

7 Desktop,5 Desktop,

6 Phone,

8} from "@components/react/oai/platform/ui/Icon.react";7} from "@components/react/oai/platform/ui/Icon.react";

9 8

10 9Start with the outcome you want to build. Realtime sessions are best for live audio that needs low latency. Request-based audio APIs are best for files, bounded requests, or generated speech that doesn't need a live session.

11 10

12The OpenAI Realtime API enables low-latency communication with [models](https://developers.openai.com/api/docs/models) that natively support speech-to-speech interactions as well as multimodal inputs (audio, images, and text) and outputs (audio and text). These APIs can also be used for [realtime audio transcription](https://developers.openai.com/api/docs/guides/realtime-transcription).11## Common use cases

13 12

~~14## Voice agents~~13<div className="w-full max-w-full overflow-hidden">

15 14 </div>

16One of the most common use cases for the Realtime API is building voice agents for speech-to-speech model interactions in the browser. Our recommended starting point for these applications is the on-site [Voice agents](https://developers.openai.com/api/docs/guides/voice-agents) guide, which uses a [WebRTC connection](https://developers.openai.com/api/docs/guides/realtime-webrtc) to the Realtime model in the browser, and [WebSocket](https://developers.openai.com/api/docs/guides/realtime-websocket) when used on the server.15

17 16## Understand different architectures

~~18```js~~17

19 18<table>

20 19 <thead>

~~21const agent = new RealtimeAgent({~~20 <tr>

~~22 name: "Assistant",~~21 <th>Goal</th>

~~23 instructions: "You are a helpful assistant.",~~22 <th>Model or API</th>

~~24});~~23 <th>Start here</th>

25 24 </tr>

~~26const session = new RealtimeSession(agent);~~25 </thead>

27 26 <tbody>

~~28// Automatically connects your microphone and audio output~~27 <tr>

~~29await session.connect({~~28 <td>Build a low-latency voice agent</td>

~~30 apiKey: "<client-api-key>",~~29 <td className="whitespace-nowrap">

~~31});~~30 <a href="/api/docs/models/gpt-realtime-2">

~~32```~~31 <code>gpt-realtime-2</code>

33 32 </a>

~~34<a href="/api/docs/guides/voice-agents#speech-to-speech-realtime-architecture">~~33 </td>

35 34 <td>

36 35 <a href="/api/docs/guides/voice-agents">Voice agents</a>

~~37<span slot="icon">~~36 </td>

~~38 </span>~~37 </tr>

~~39 See the speech-to-speech path for building Realtime voice agents in the~~38 <tr>

~~40 browser.~~39 <td>Translate live speech into another language</td>

41 40 <td className="whitespace-nowrap">

42 41 <a href="/api/docs/models/gpt-realtime-translate">

~~43</a>~~42 <code>gpt-realtime-translate</code>

44 43 </a>

~~45To use the Realtime API directly outside the context of voice agents, check out the other connection options below.~~44 </td>

46 45 <td>

~~47## Connection methods~~46 <a href="/api/docs/guides/realtime-translation">Realtime translation</a>

48 47 </td>

49While building [voice agents with the Agents SDK](https://developers.openai.com/api/docs/guides/voice-agents) is the fastest path to one specific type of application, the Realtime API provides an entire suite of flexible tools for a variety of use cases.48 </tr>

50 49 <tr>

~~51There are three primary supported interfaces for the Realtime API:~~50 <td>Transcribe live audio into streaming text</td>

51 <td className="whitespace-nowrap">

52 <a href="/api/docs/models/gpt-realtime-whisper">

53 <code>gpt-realtime-whisper</code>

54 </a>

55 </td>

56 <td>

57 <a href="/api/docs/guides/realtime-transcription">

58 Realtime transcription

59 </a>

60 </td>

61 </tr>

62 <tr>

63 <td>Transcribe files or bounded audio requests</td>

64 <td>Audio transcription models</td>

65 <td>

66 <a href="/api/docs/guides/speech-to-text">Speech to text</a>

67 </td>

68 </tr>

69 <tr>

70 <td>Generate speech from text</td>

71 <td>Speech generation models</td>

72 <td>

73 <a href="/api/docs/guides/text-to-speech">Text to speech</a>

74 </td>

75 </tr>

76 <tr>

77 <td>Add audio to an existing Chat Completions app</td>

78 <td>Audio-capable chat models</td>

79 <td>

80 <a href="/api/docs/guides/audio#add-audio-to-your-existing-application">

81 Audio and speech

82 </a>

83 </td>

84 </tr>

85 </tbody>

86</table>

88## Choose a realtime session

90Realtime sessions keep a connection open while your application sends audio, receives events, and updates session state.

92<table>

93 <thead>

94 <tr>

95 <th>Session type</th>

96 <th>Use when</th>

97 <th>Endpoint or pattern</th>

98 </tr>

99 </thead>

100 <tbody>

101 <tr>

102 <td>Voice-agent session</td>

103 <td>

104 The model should respond to the user, call tools, and manage

105 conversation state.

106 </td>

107 <td>

108 Conversation session on <code>/v1/realtime</code>

109 </td>

110 </tr>

111 <tr>

112 <td>Translation session</td>

113 <td>The app should continuously translate speech as it arrives.</td>

114 <td>

115 Continuous translation session on <code>/v1/realtime/translations</code>

116 </td>

117 </tr>

118 <tr>

119 <td>Transcription session</td>

120 <td>

121 The app needs streaming transcript deltas without model-generated spoken

122 responses.

123 </td>

124 <td>Transcription session that emits transcript deltas</td>

125 </tr>

126 </tbody>

127</table>

128

129Use a voice-agent session when your application needs an assistant that responds to the user. Use a translation session when your application needs an interpreter that translates the speaker. Use a transcription session when your application needs text from audio without model-generated responses.

130

131### Voice-agent sessions

132

133Voice-agent sessions use the standard Realtime API conversation lifecycle. The client connects to `/v1/realtime`, sends audio or text, and listens for model responses, tool calls, and session events.

134

135For most browser voice agents, start with the [Voice agents](https://developers.openai.com/api/docs/guides/voice-agents) guide. It uses the Agents SDK with WebRTC for browser audio and can connect to server-side tools.

136

137Realtime 2 adds reasoning to speech-to-speech workflows. Start with

138 `reasoning.effort` set to `low` for most production voice agents, then adjust

139 based on latency tolerance and task complexity. Use the [Realtime prompting

140 guide](https://developers.openai.com/api/docs/guides/realtime-models-prompting) to tune reasoning,

141 preambles, tool use, unclear audio, and exact entity capture.

142

143### Translation sessions

144

145Realtime translation uses a dedicated translation endpoint instead of the standard voice-agent endpoint. Translation sessions are continuous: the client streams audio into the session, and the service streams translated audio and transcript deltas out.

146

147Translation sessions don't use the normal assistant turn lifecycle. Don't call `response.create`, and don't wait for the client to commit a user turn before translation begins. For browser media, use WebRTC. For server media pipelines such as phone calls or broadcast ingest, use WebSockets.

148

149See [Realtime translation](https://developers.openai.com/api/docs/guides/realtime-translation) for the dedicated endpoint, session configuration, and architecture patterns.

150

151### Transcription sessions

152

153You can transcribe audio in more than one way. Use a realtime transcription session when your application needs live transcript deltas from streaming audio. Use the [Speech to text](https://developers.openai.com/api/docs/guides/speech-to-text) guide for file uploads, request-based transcription, or diarization-focused workflows.

154

155For realtime transcription, `gpt-realtime-whisper` gives you controllable latency. Lower delay settings produce earlier partial text, while higher delay settings can improve transcript quality. Test with your real audio conditions, target languages, accents, and domain vocabulary before choosing a production default.

156

157See [Realtime transcription](https://developers.openai.com/api/docs/guides/realtime-transcription) for session configuration and event handling.

158

159## Choose a connection method

160

161Choose the transport based on where your application captures and plays audio:

52 162

53[163[

54 164

55<span slot="icon">165<span slot="icon">

56 </span>166 </span>

~~57 Ideal for browser and client-side interactions with a Realtime model.~~167 Use for browser and mobile clients that capture or play audio directly.

58 168

59](https://developers.openai.com/api/docs/guides/realtime-webrtc)169](https://developers.openai.com/api/docs/guides/realtime-webrtc)

60 170

62 172

63<span slot="icon">173<span slot="icon">

64 </span>174 </span>

~~65 Ideal for middle tier server-side applications with consistent low-latency~~175 Use when your server already receives raw audio from a media pipeline, call

~~66 network connections.~~176 system, or worker.

67 177

68](https://developers.openai.com/api/docs/guides/realtime-websocket)178](https://developers.openai.com/api/docs/guides/realtime-websocket)

69 179

71 181

72<span slot="icon">182<span slot="icon">

73 </span>183 </span>

~~74 Ideal for VoIP telephony connections.~~184 Use for telephony voice agents. Confirm model support before using SIP for

185 translation or transcription.

75 186

76](https://developers.openai.com/api/docs/guides/realtime-sip)187](https://developers.openai.com/api/docs/guides/realtime-sip)

77 188

78Depending on how you'd like to connect to a Realtime model, check out one of the connection guides above to get started. You'll learn how to initialize a Realtime session, and how to interact with a Realtime model using client and server events.189## Safety identifiers

~~80## API Usage~~

~~82Once connected to a realtime model using one of the methods above, learn how to interact with the model in these usage guides.~~

~~84- **[Prompting guide](https://developers.openai.com/api/docs/guides/realtime-models-prompting):** learn tips and best practices for prompting and steering Realtime models.~~

85- **[Managing conversations](https://developers.openai.com/api/docs/guides/realtime-conversations):** Learn about the Realtime session lifecycle and the key events that happen during a conversation.

~~86- **[MCP servers](https://developers.openai.com/api/docs/guides/realtime-mcp):** Connect remote MCP servers or connectors to a Realtime session and handle their event flow.~~

87- **[Webhooks and server-side controls](https://developers.openai.com/api/docs/guides/realtime-server-controls):** Learn how you can control a Realtime session on the server to call tools and implement guardrails.

~~88- **[Managing costs](https://developers.openai.com/api/docs/guides/realtime-costs):** Learn how to monitor and optimize your usage of the Realtime API.~~

~~89- **[Realtime audio transcription](https://developers.openai.com/api/docs/guides/realtime-transcription):** Transcribe audio streams in real time over a WebSocket connection.~~

~~91## Beta to GA migration~~

93There are a few key differences between the interfaces in the Realtime beta API and the recently released GA API. Expand the topics below for more information about migrating from the beta interface to GA.

~~95Beta header~~

~~97For REST API requests, WebSocket connections, and other interfaces with the Realtime API, beta users had to include the following header with each request:~~

~~99```~~

100OpenAI-Beta: realtime=v1

101```

~~102~~

103This header should be removed for requests to the GA interface. To retain the behavior of the beta API, you should continue to include this header.

~~104~~

105Generating ephemeral API keys

~~106~~

107In the beta interface, there were multiple endpoints for generating ephemeral keys for either Realtime sessions or transcription sessions. In the GA interface, there is only one REST API endpoint used to generate keys - [`POST /v1/realtime/client_secrets`](https://developers.openai.com/api/docs/api-reference/realtime-sessions/create-realtime-client-secret).

~~108~~

109To create a session and receive a client secret you can use to initialize a WebRTC or WebSocket connection on a client, you can request one like this using the appropriate session configuration:

~~110~~

111```javascript

112const sessionConfig = JSON.stringify({

113 session: {

114 type: "realtime",

115 model: "gpt-realtime",

116 audio: {

117 output: { voice: "marin" },

118 },

119 },

120});

~~121~~

122const response = await fetch(

123 "https://api.openai.com/v1/realtime/client_secrets",

124 {

125 method: "POST",

126 headers: {

127 Authorization: `Bearer ${apiKey}`,

128 "Content-Type": "application/json",

129 },

130 body: sessionConfig,

131 }

132);

~~133~~

134const data = await response.json();

135console.log(data.value); // e.g. ek_68af296e8e408191a1120ab6383263c2

136```

~~137~~

138These tokens can safely be used in client environments like browsers and mobile applications.

~~139~~

140New URL for WebRTC SDP data

~~141~~

142When initializing a WebRTC session in the browser, the URL for obtaining remote session information via SDP is now `/v1/realtime/calls`:

~~143~~

144```javascript

145const baseUrl = "https://api.openai.com/v1/realtime/calls";

146const model = "gpt-realtime";

147const sdpResponse = await fetch(baseUrl, {

148 method: "POST",

149 body: offer.sdp,

150 headers: {

151 Authorization: `Bearer YOUR_EPHEMERAL_KEY_HERE`,

152 "Content-Type": "application/sdp",

153 },

154});

~~155~~

156const sdp = await sdpResponse.text();

157const answer = { type: "answer", sdp };

158await pc.setRemoteDescription(answer);

159```

~~160~~

161New event names and shapes

~~162~~

163When creating or [updating](https://developers.openai.com/api/docs/api-reference/realtime_client_events/session/update) a Realtime session in the GA interface, you must now specify a session type, since now the same client event is used to create both speech-to-speech and transcription sessions. The options for the session type are:

~~164~~

165- `realtime` for speech-to-speech

166- `transcription` for realtime audio transcription

~~167~~

168```javascript

~~169~~

~~170~~

171const url = "wss://api.openai.com/v1/realtime?model=gpt-realtime";

172const ws = new WebSocket(url, {

173 headers: {

174 Authorization: "Bearer " + process.env.OPENAI_API_KEY,

175 },

176});

~~177~~

178ws.on("open", function open() {

179 console.log("Connected to server.");

~~180~~

181 // Send client events over the WebSocket once connected

182 ws.send(

183 JSON.stringify({

184 type: "session.update",

185 session: {

186 type: "realtime",

187 instructions: "Be extra nice today!",

188 },

189 })

190 );

191});

192```

~~193~~

194Configuration for input modalities and other properties have moved as well,

195notably output audio configuration like model voice. [Check the API reference](https://developers.openai.com/api/docs/api-reference/realtime_client_events) for the latest event shapes.

~~196~~

197```javascript

198ws.on("open", function open() {

199 ws.send(

200 JSON.stringify({

201 type: "session.update",

202 session: {

203 type: "realtime",

204 model: "gpt-realtime",

205 audio: {

206 output: { voice: "marin" },

207 },

208 },

209 })

210 );

211});

212```

~~213~~

214Finally, some event names have changed to reflect their new position in the event data model:

~~215~~

~~216- **`response.text.delta` → `response.output_text.delta`**~~

~~217- **`response.audio.delta` → `response.output_audio.delta`**~~

~~218- **`response.audio_transcript.delta` → `response.output_audio_transcript.delta`**~~

~~219~~

220New conversation item events

~~221~~

222For `response.output_item`, the API has always had both `.added` and `.done` events, but for conversation level items the API previously only had `.created`, which by convention is emitted at the start when the item added.

~~223~~

224We have added a `.added` and `.done` event to allow better ergonomics for developers when receiving events that need some loading time (such as MCP tool listing or input audio transcriptions if these were to be modeled as items in the future).

~~225~~

226Current event shape for conversation items added:

~~227~~

228```javascript

229{

230 "event_id": "event_1920",

231 "type": "conversation.item.created",

232 "previous_item_id": "msg_002",

233 "item": Item

234}

235```

236 190

237New events to replace the above:191If your application identifies individual end users, include a [safety identifier](https://developers.openai.com/api/docs/guides/safety-best-practices#implement-safety-identifiers) with Realtime API requests. Safety identifiers are recommended but not required. They help OpenAI monitor and detect abuse while allowing enforcement to target an individual user rather than your entire organization. Use a stable, privacy-preserving value, such as a hashed internal user ID.

~~238~~

239```javascript

240{

241 "event_id": "event_1920",

242 "type": "conversation.item.added",

243 "previous_item_id": "msg_002",

244 "item": Item

245}

246```

~~247~~

248```javascript

249{

250 "event_id": "event_1920",

251 "type": "conversation.item.done",

252 "previous_item_id": "msg_002",

253 "item": Item

254}

255```

256 192

257Input and output item changes193For Realtime API requests, send the identifier in the `OpenAI-Safety-Identifier` header. When using ephemeral tokens, set the header on the server-side request that creates the client secret so the identifier is bound to that session. When connecting from a trusted server with WebSocket or the unified WebRTC interface, set the header on the connection request.

258 194

259### All Items195Safety identifiers do not carry over from Responses API requests or from other sessions. If you use the Responses API `safety_identifier` parameter elsewhere in your application, pass the same stable value separately when you create or connect each Realtime session.

260 196

261Realtime API sets an `object=realtime.item` param on all items in the GA interface.197## Related guides

262 198

263### Function Call Output199- [Realtime prompting guide](https://developers.openai.com/api/docs/guides/realtime-models-prompting): Prompt and tune Realtime voice models.

200- [Managing conversations](https://developers.openai.com/api/docs/guides/realtime-conversations): Work with the Realtime session lifecycle.

201- [Realtime translation](https://developers.openai.com/api/docs/guides/realtime-translation): Translate live speech with a dedicated translation session.

202- [Realtime transcription](https://developers.openai.com/api/docs/guides/realtime-transcription): Stream live transcript deltas from audio.

203- [Realtime with tools](https://developers.openai.com/api/docs/guides/realtime-mcp): Connect function tools, MCP servers, and connectors to a Realtime session.

204- [Webhooks and server-side controls](https://developers.openai.com/api/docs/guides/realtime-server-controls): Control Realtime sessions from your server.

205- [Managing costs](https://developers.openai.com/api/docs/guides/realtime-costs): Track and optimize Realtime API usage.

264 206

265`status` : Realtime now accepts a no-op `status` field for the function call output item param. This aligns with the Responses API implementation.207Use [Audio and speech](https://developers.openai.com/api/docs/guides/audio) for the core concepts behind

~~266~~ 208 audio input, audio output, streaming, latency, transcripts, and speech

267### Message209 generation. Use this overview when you are ready to choose an implementation

~~268~~ 210 path.

269**Assistant Message Content**

~~270~~

271The `type` properties of output assistant messages now align with the Responses API:

~~272~~

273- `type=text` → `type=output_text` (no change to `text` field name)

274- `type=audio` → `type=output_audio` (no change to `audio` field name)

guides/realtime-conversations.md +4 −4

Details

1# Realtime conversations1# Realtime conversations

2 2

3Once you have connected to the Realtime API through either [WebRTC](https://developers.openai.com/api/docs/guides/realtime-webrtc) or [WebSocket](https://developers.openai.com/api/docs/guides/realtime-websocket), you can call a Realtime model (such as [gpt-realtime](https://developers.openai.com/api/docs/models/gpt-realtime)) to have speech-to-speech conversations. Doing so will require you to **send client events** to initiate actions, and **listen for server events** to respond to actions taken by the Realtime API.3Once you have connected to the Realtime API through either [WebRTC](https://developers.openai.com/api/docs/guides/realtime-webrtc) or [WebSocket](https://developers.openai.com/api/docs/guides/realtime-websocket), you can call a Realtime model (such as [gpt-realtime-2](https://developers.openai.com/api/docs/models/gpt-realtime-2)) to have speech-to-speech conversations. Doing so will require you to **send client events** to initiate actions, and **listen for server events** to respond to actions taken by the Realtime API.

4 4

5This guide will walk through the event flows required to use model capabilities like audio and text generation, image input, and function calling, and how to think about the state of a Realtime Session.5This guide will walk through the event flows required to use model capabilities like audio and text generation, image input, and function calling, and how to think about the state of a Realtime Session.

6 6

40 type: "session.update",40 type: "session.update",

41 session: {41 session: {

42 type: "realtime",42 type: "realtime",

~~43 model: "gpt-realtime",~~43 model: "gpt-realtime-2",

44 // Lock the output to audio (set to ["text"] if you want text without audio)44 // Lock the output to audio (set to ["text"] if you want text without audio)

45 output_modalities: ["audio"],45 output_modalities: ["audio"],

46 audio: {46 audio: {

82 "type": "session.update",82 "type": "session.update",

83 session: {83 session: {

84 type: "realtime",84 type: "realtime",

~~85 model: "gpt-realtime",~~85 model: "gpt-realtime-2",

86 # Lock the output to audio (add "text" if you also want text)86 # Lock the output to audio (add "text" if you also want text)

87 output_modalities: ["audio"],87 output_modalities: ["audio"],

88 audio: {88 audio: {

602 602

603## Image inputs603## Image inputs

604 604

605`gpt-realtime` and `gpt-realtime-mini` also support image input. You can attach an image as a content part in a user message, and the model can incorporate what’s in the image when it responds.605`gpt-realtime-2` and `gpt-realtime` also support image input. You can attach an image as a content part in a user message, and the model can incorporate what’s in the image when it responds.

606 606

607Add an image to the conversation607Add an image to the conversation

608 608

guides/realtime-costs.md +7 −5

Details

1# Managing costs1# Managing costs

2 2

3This document describes how Realtime API billing works and offer strategies for optimizing costs. Costs are accrued as input and output tokens of different modalities: text, audio, and image. Token costs vary per model, with prices listed on the model pages (e.g. for [`gpt-realtime`](https://developers.openai.com/api/docs/models/gpt-realtime) and [`gpt-realtime-mini`](https://developers.openai.com/api/docs/models/gpt-realtime-mini)).3This document describes how Realtime API billing works and offers strategies for optimizing costs. Voice-agent sessions accrue input and output tokens across text, audio, and image modalities. Streaming translation and streaming transcription sessions are billed by audio duration. Prices vary per model, with prices listed on the model pages (for example, [`gpt-realtime-2`](https://developers.openai.com/api/docs/models/gpt-realtime-2), [`gpt-realtime-translate`](https://developers.openai.com/api/docs/models/gpt-realtime-translate), [`gpt-realtime-whisper`](https://developers.openai.com/api/docs/models/gpt-realtime-whisper), and [`gpt-realtime`](https://developers.openai.com/api/docs/models/gpt-realtime)).

4 4

5Conversational Realtime API sessions are a series of _turns_, where the user adds input that triggers a _Response_ to produce the model output. The server maintains a _Conversation_, which is a list of _Items_ that form the input for the next turn. When a Response is returned the output is automatically added to the Conversation.5Conversational Realtime API sessions are a series of _turns_, where the user adds input that triggers a _Response_ to produce the model output. The server maintains a _Conversation_, which is a list of _Items_ that form the input for the next turn. When a Response is returned, the output is automatically added to the Conversation.

7Translation and transcription sessions use a different streaming architecture. The client streams audio continuously and receives translated audio, transcript deltas, or transcript events as the source audio arrives. These sessions don't use the normal Response lifecycle, so estimate and monitor them with their duration-based rates instead of per-Response token usage.

6 8

7## Per-Response costs9## Per-Response costs

8 10

9Realtime API costs are accrued when a Response is created, and is charged based on the numbers of input and output tokens (except for input transcription costs, see below). There is no cost currently for network bandwidth or connections. A Response can be created manually or automatically if voice activity detection (VAD) is turned on. VAD will effectively filter out empty input audio, so empty audio does not count as input tokens unless the client manually adds it as conversation input.11Realtime API costs are accrued when a Response is created, and is charged based on the numbers of input and output tokens (except for input transcription costs, see below). There is no cost currently for network bandwidth or connections. A Response can be created manually or automatically if voice activity detection (VAD) is turned on. VAD will effectively filter out empty input audio, so empty audio doesn't count as input tokens unless the client manually adds it as conversation input.

10 12

11The entire conversation is sent to the model for each Response. The output from a turn will be added as Items to the server Conversation and become the input to subsequent turns, thus turns later in the session will be more expensive.13The entire conversation is sent to the model for each Response. The output from a turn will be added as Items to the server Conversation and become the input to subsequent turns, thus turns later in the session will be more expensive.

12 14

89 91

90When the number of tokens in a conversation exceeds the model's input token limit the conversation be truncated, meaning messages (starting from the oldest) will be dropped from the Response input. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.92When the number of tokens in a conversation exceeds the model's input token limit the conversation be truncated, meaning messages (starting from the oldest) will be dropped from the Response input. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.

91 93

92Clients can set a smaller token window than the model’s maximum, which is a good way to control token usage and cost. This is controlled with the `token_limits.post_instructions` configuration (if you configure truncation with a `retention_ratio` type as shown below). As the name indicates, this controls the maximum number of input tokens for a Response, except for the instruction tokens. Setting `post_instructions` to 1,000 means that items over the 1,000 input token limit will not be sent to the model for a Response.94Clients can set a smaller token window than the model’s maximum, which is a good way to control token usage and cost. This is controlled with the `token_limits.post_instructions` configuration (if you configure truncation with a `retention_ratio` type as shown below). As the name indicates, this controls the maximum number of input tokens for a Response, except for the instruction tokens. Setting `post_instructions` to 1,000 means that items over the 1,000 input token limit won't be sent to the model for a Response.

93 95

94Truncation busts the cache near the beginning of the conversation, and if truncation occurs on every turn then cache rate will be very low. To mitigate this issue clients can configure truncation to drop more messages than necessary, which will extend the headroom before another truncation is needed. This can be controlled with the `session.truncation.retention_ratio` setting. The server defaults to a value of `1.0` , meaning truncation will remove only the items necessary. A value of `0.8` means a truncation would retain 80% of the maximum, dropping an additional 20%.96Truncation busts the cache near the beginning of the conversation, and if truncation occurs on every turn then cache rate will be very low. To mitigate this issue clients can configure truncation to drop more messages than necessary, which will extend the headroom before another truncation is needed. This can be controlled with the `session.truncation.retention_ratio` setting. The server defaults to a value of `1.0` , meaning truncation will remove only the items necessary. A value of `0.8` means a truncation would retain 80% of the maximum, dropping an additional 20%.

95 97

125 127

126### Using a mini model128### Using a mini model

127 129

128The Realtime speech2speech models come in a “normal” size and a mini size, which is significantly cheaper. The tradeoff here tends to be intelligence related to instruction following and function calling, which will not be as effective in the mini model. We recommend first testing applications with the larger model, refining your application and prompt, then attempting to optimize using the mini model.130The Realtime speech2speech models come in a “normal” size and a mini size, which is significantly cheaper. The tradeoff here tends to be intelligence related to instruction following and function calling, which won't be as effective in the mini model. We recommend first testing applications with the larger model, refining your application and prompt, then attempting to optimize using the mini model.

129 131

130### Editing the Conversation132### Editing the Conversation

131 133

guides/realtime-mcp.md +139 −19

Details

~~1# Realtime API with MCP~~1# Realtime with tools

2 2

3You can attach MCP tools directly to a Realtime session so the model can discover and call remote tools during a live conversation. For MCP, the control flow is the same whether your client is using a [WebRTC data channel](https://developers.openai.com/api/docs/guides/realtime-webrtc) or a [WebSocket](https://developers.openai.com/api/docs/guides/realtime-websocket).3You can attach tools to a Realtime session so the model can look up data, take actions, or call services during a live conversation. Tool configuration uses the same event surface whether your client is using a [WebRTC data channel](https://developers.openai.com/api/docs/guides/realtime-webrtc) or a [WebSocket](https://developers.openai.com/api/docs/guides/realtime-websocket).

4 4

5This page covers the Realtime-specific setup and event flow. For broader MCP concepts, auth patterns, connectors, and safety guidance, see [MCP and Connectors](https://developers.openai.com/api/docs/guides/tools-connectors-mcp).5Use function tools when your application should execute the tool and return the result. Use MCP tools or built-in connectors when the Realtime API should connect to a remote tool server for you.

6 6

~~7## Configure an MCP tool~~7## Choose a tool type

9| Tool type | Use when | Who executes it |

10| ------------------------- | ------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------- |

11| `function` | Your application owns the business logic, approval checks, or private system access. | Your client or server receives a function call and returns `function_call_output`. |

12| `mcp` with `server_url` | You want the model to call tools exposed by a remote MCP server. | The Realtime API calls the remote MCP server. |

13| `mcp` with `connector_id` | You want to use a built-in connector such as Google Calendar. | The Realtime API calls the connector with the authorization you provide. |

15Add tools in **one of two places**:

17- At the **session level** with `session.tools` in [`session.update`](https://developers.openai.com/api/docs/api-reference/realtime-client-events/session/update), if you want the tool available for the full session.

18- At the **response level** with `response.tools` in [`response.create`](https://developers.openai.com/api/docs/api-reference/realtime-client-events/response/create), if you only need the tool for one turn.

20## Configure a function tool

22Function tools are the right default when the tool should run in your application. The model emits function call arguments, your code executes the action, and your code sends the result back with a `function_call_output` item.

24Configure a function tool with session.update

26```javascript

27const event = {

28 type: "session.update",

29 session: {

30 type: "realtime",

31 model: "gpt-realtime-2",

32 tools: [

33 {

34 type: "function",

35 name: "lookup_order",

36 description: "Look up an order by its order number.",

37 parameters: {

38 type: "object",

39 properties: {

40 order_number: {

41 type: "string",

42 description: "The customer-facing order number.",

43 },

44 },

45 required: ["order_number"],

46 },

47 },

48 ],

49 tool_choice: "auto",

50 },

51};

8 52

~~9Add MCP tools in **one of two places**:~~53ws.send(JSON.stringify(event));

54```

56```python

57event = {

58 "type": "session.update",

59 "session": {

60 "type": "realtime",

61 "model": "gpt-realtime-2",

62 "tools": [

63 {

64 "type": "function",

65 "name": "lookup_order",

66 "description": "Look up an order by its order number.",

67 "parameters": {

68 "type": "object",

69 "properties": {

70 "order_number": {

71 "type": "string",

72 "description": "The customer-facing order number.",

73 }

74 },

75 "required": ["order_number"],

76 },

77 }

78 ],

79 "tool_choice": "auto",

80 },

81}

83ws.send(json.dumps(event))

84```

87When the model calls the function, listen for the function call item, run your application logic, then send the output back:

89Send function call output

91```javascript

92const event = {

93 type: "conversation.item.create",

94 item: {

95 type: "function_call_output",

96 call_id: functionCall.call_id,

97 output: JSON.stringify({

98 status: "shipped",

99 delivery_date: "2026-05-09",

100 }),

101 },

102};

103

104ws.send(JSON.stringify(event));

105ws.send(JSON.stringify({ type: "response.create" }));

106```

107

108```python

109event = {

110 "type": "conversation.item.create",

111 "item": {

112 "type": "function_call_output",

113 "call_id": function_call["call_id"],

114 "output": json.dumps(

115 {

116 "status": "shipped",

117 "delivery_date": "2026-05-09",

118 }

119 ),

120 },

121}

122

123ws.send(json.dumps(event))

124ws.send(json.dumps({"type": "response.create"}))

125```

126

127

128For a full event-by-event walkthrough of function calling, see [Managing conversations](https://developers.openai.com/api/docs/guides/realtime-conversations#function-calling).

129

130## Configure an MCP tool

10 131

11- At the **session level** with `session.tools` in [`session.update`](https://developers.openai.com/api/docs/api-reference/realtime-client-events/session/update), if you want the server available for the full session.132MCP tools are useful when the tool already exists behind a remote MCP server, or when you want to use an OpenAI-managed connector. Unlike function tools, MCP tools are executed by the Realtime API itself.

12- At the **response level** with `response.tools` in [`response.create`](https://developers.openai.com/api/docs/api-reference/realtime-client-events/response/create), if you only need MCP for one turn.

13 133

14In Realtime, the MCP tool shape is:134In Realtime, the MCP tool shape is:

15 135

30 type: "session.update",150 type: "session.update",

31 session: {151 session: {

32 type: "realtime",152 type: "realtime",

~~33 model: "gpt-realtime-1.5",~~153 model: "gpt-realtime-2",

34 output_modalities: ["text"],154 output_modalities: ["text"],

35 tools: [155 tools: [

36 {156 {

52 "type": "session.update",172 "type": "session.update",

53 "session": {173 "session": {

54 "type": "realtime",174 "type": "realtime",

~~55 "model": "gpt-realtime-1.5",~~175 "model": "gpt-realtime-2",

56 "output_modalities": ["text"],176 "output_modalities": ["text"],

57 "tools": [177 "tools": [

58 {178 {

84 type: "session.update",204 type: "session.update",

85 session: {205 session: {

86 type: "realtime",206 type: "realtime",

~~87 model: "gpt-realtime-1.5",~~207 model: "gpt-realtime-2",

88 output_modalities: ["text"],208 output_modalities: ["text"],

89 tools: [209 tools: [

90 {210 {

107 "type": "session.update",227 "type": "session.update",

108 "session": {228 "session": {

109 "type": "realtime",229 "type": "realtime",

110 "model": "gpt-realtime-1.5",230 "model": "gpt-realtime-2",

111 "output_modalities": ["text"],231 "output_modalities": ["text"],

112 "tools": [232 "tools": [

113 {233 {

127 247

128 248

129Remote MCP servers{" "}249Remote MCP servers{" "}

130 <strong>do not automatically receive the full conversation context</strong>,250 <strong>don't automatically receive the full conversation context</strong>,

131 but <strong>they can see any data the model sends in a tool call</strong>.251 but <strong>they can see any data the model sends in a tool call</strong>.

132 <strong>Keep the tool surface narrow</strong> with <code>allowed_tools</code>,252 <strong>Keep the tool surface narrow</strong> with <code>allowed_tools</code>,

133 and require approval for any action you would not auto-run.253 and require approval for any action you would not auto-run.

134 254

135## Realtime MCP flow255## Realtime MCP flow

136 256

137Unlike Realtime `function` tools, remote MCP tools are **executed by the Realtime API itself**. **Your client does not run the remote tool** and return a `function_call_output`. Instead, your client configures access, listens for MCP lifecycle events, and optionally sends an approval response if the server asks for one.257Unlike Realtime `function` tools, remote MCP tools are **executed by the Realtime API itself**. **Your client doesn't run the remote tool** and return a `function_call_output`. Instead, your client configures access, listens for MCP lifecycle events, and optionally sends an approval response if the server asks for one.

138 258

139A typical flow looks like this:259A typical flow looks like this:

140 260

1411. You send `session.update` or `response.create` with a `tools` entry whose `type` is `mcp`.2611. You send `session.update` or `response.create` with a `tools` entry whose `type` is `mcp`.

1421. The server begins importing tools and emits `mcp_list_tools.in_progress`.2621. The server begins importing tools and emits `mcp_list_tools.in_progress`.

1431. While listing is still in progress, the model cannot call a tool that has not been loaded yet. If you want to wait before starting a turn that depends on those tools, listen for [`mcp_list_tools.completed`](https://developers.openai.com/api/docs/api-reference/realtime-server-events/mcp_list_tools/completed). The [`conversation.item.done`](https://developers.openai.com/api/docs/api-reference/realtime-server-events/conversation/item/done) event whose `item.type` is `mcp_list_tools` shows which tool names were actually imported. If import fails, you will receive [`mcp_list_tools.failed`](https://developers.openai.com/api/docs/api-reference/realtime-server-events/mcp_list_tools/failed).2631. While listing is still in progress, the model can't call a tool that hasn't loaded yet. If you want to wait before starting a turn that depends on those tools, listen for [`mcp_list_tools.completed`](https://developers.openai.com/api/docs/api-reference/realtime-server-events/mcp_list_tools/completed). The [`conversation.item.done`](https://developers.openai.com/api/docs/api-reference/realtime-server-events/conversation/item/done) event whose `item.type` is `mcp_list_tools` shows which tool names were actually imported. If import fails, you will receive [`mcp_list_tools.failed`](https://developers.openai.com/api/docs/api-reference/realtime-server-events/mcp_list_tools/failed).

1441. The user speaks or sends text, and a response is created, either by your client or automatically by the session configuration.2641. The user speaks or sends text, and a response is created, either by your client or automatically by the session configuration.

1451. If the model chooses an MCP tool, you will see `response.mcp_call_arguments.delta` and `response.mcp_call_arguments.done`.2651. If the model chooses an MCP tool, you will see `response.mcp_call_arguments.delta` and `response.mcp_call_arguments.done`.

1461. **If approval is required**, the server adds a conversation item whose `item.type` is `mcp_approval_request`. Your client must answer it with an `mcp_approval_response` item.2661. **If approval is required**, the server adds a conversation item whose `item.type` is `mcp_approval_request`. Your client must answer it with an `mcp_approval_response` item.

298 418

299## Common failures419## Common failures

300 420

301- [`mcp_list_tools.failed`](https://developers.openai.com/api/docs/api-reference/realtime-server-events/mcp_list_tools/failed): the Realtime API could not import tools from the remote server or connector. Check `server_url` or `connector_id`, authentication, server reachability, and any `allowed_tools` names you specified.421- [`mcp_list_tools.failed`](https://developers.openai.com/api/docs/api-reference/realtime-server-events/mcp_list_tools/failed): the Realtime API couldn't import tools from the remote server or connector. Check `server_url` or `connector_id`, authentication, server connectivity, and any `allowed_tools` names you specified.

302- [`response.mcp_call.failed`](https://developers.openai.com/api/docs/api-reference/realtime-server-events/response/mcp_call/failed): the model selected a tool, but the tool call did not complete. Inspect the event payload and the later `mcp_call` item for MCP protocol, execution, or transport errors.422- [`response.mcp_call.failed`](https://developers.openai.com/api/docs/api-reference/realtime-server-events/response/mcp_call/failed): the model selected a tool, but the tool call didn't complete. Inspect the event payload and the later `mcp_call` item for MCP protocol, execution, or transport errors.

303- `mcp_approval_request` with no matching `mcp_approval_response`: the tool call cannot continue until your client explicitly approves or rejects it.423- `mcp_approval_request` with no matching `mcp_approval_response`: the tool call can't continue until your client explicitly approves or rejects it.

304- A turn starts while `mcp_list_tools.in_progress` is still active: only tools that have already finished loading are eligible for that turn.424- A turn starts while `mcp_list_tools.in_progress` is still active: only tools that have already finished loading are eligible for that turn.

305- A response uses `tool_choice: "required"` but no tools are currently available: the model has nothing eligible to call. Wait for `mcp_list_tools.completed`, confirm that at least one tool was imported, or use a different `tool_choice` for turns that do not require a tool.425- A response uses `tool_choice: "required"` but no tools are currently available: the model has nothing eligible to call. Wait for `mcp_list_tools.completed`, confirm that at least one tool was imported, or use a different `tool_choice` for turns that don't require a tool.

306- MCP tool definition validation fails before import starts: common causes are a duplicate `server_label` in the same `tools` array, setting both `server_url` and `connector_id`, omitting both of them on the initial session creation request, using an invalid `connector_id`, or sending both `authorization` and `headers.Authorization`. For connectors, do not send `headers.Authorization` at all.426- MCP tool definition validation fails before import starts: common causes are a duplicate `server_label` in the same `tools` array, setting both `server_url` and `connector_id`, omitting both of them on the initial session creation request, using an invalid `connector_id`, or sending both `authorization` and `headers.Authorization`. For connectors, don't send `headers.Authorization` at all.

307 427

308## Approve or reject MCP tool calls428## Approve or reject MCP tool calls

309 429

guides/realtime-models-prompting.md +1868 −302

Details

1# Using realtime models1# Using realtime models

2 2

3Realtime models are post-trained for specific customer use cases. In response to your feedback, the latest speech-to-speech model works differently from previous models. Use this guide to understand and get the most out of it.3`gpt-realtime-2` is our state-of-the-art reasoning voice model for low-latency speech-to-speech applications. It can think before it speaks, follow instructions more reliably, use a larger context window, and call tools with greater precision than earlier realtime models.

5To take advantage of these gains, design prompts with more intent. Define the assistant's responsibilities, decision points, tool-calling behavior, and guardrails clearly: what it should do, when it should do it, and what it should avoid.

7Start simple. Do not over-prompt upfront. Begin with a minimal prompt, run

8 evaluations, then add instructions only for behaviors that fail in testing.

10## Choose a model

12<table>

13 <thead>

14 <tr>

15 <th>Model</th>

16 <th>Use when</th>

17 <th>Prompting focus</th>

18 </tr>

19 </thead>

20 <tbody>

21 <tr>

22 <td style={{ whiteSpace: "nowrap" }}>

23 <a href="/api/docs/models/gpt-realtime-2">

24 <code>gpt-realtime-2</code>

25 </a>

26 </td>

27 <td>

28 You need the strongest realtime reasoning, tool use, and instruction

29 following.

30 </td>

31 <td>

32 Tune reasoning effort, preambles, tool policies, exact entity capture,

33 and long-session state.

34 </td>

35 </tr>

36 <tr>

37 <td style={{ whiteSpace: "nowrap" }}>

38 <a href="/api/docs/models/gpt-realtime-1.5">

39 <code>gpt-realtime-1.5</code>

40 </a>

41 </td>

42 <td>You need a fast, reliable non-reasoning speech-to-speech model.</td>

43 <td>

44 Follow the core realtime prompt structure and test for latency-sensitive

45 behavior.

46 </td>

47 </tr>

48 </tbody>

49</table>

53<div data-content-switcher-pane data-value="gpt-realtime-2">

54## Realtime 2.0 Prompting Guide

56 <p>

57 Use <code>gpt-realtime-2</code> when the voice agent needs stronger

58 reasoning, tool selection, exact entity handling, or long-session state.

59 Start with <code>reasoning.effort: "low"</code>, test default preamble

60 behavior, and define clear confirmation boundaries before write actions.

61 </p>

63## What changed in Realtime 2

65Prompt Realtime 2 as a reasoning voice agent, not as a basic voice bot.

67| Change | What it means for prompts |

68| ------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |

69| Reasoning | Allow the model to reason internally for complex tasks before speaking or calling tools. Use preambles to avoid awkward silence or unnecessary filler. |

70| Prompt precision matters more | Replace broad guidance like "be helpful" with clear trigger, action, and exception rules: when to act, what to do, and when not to do it. |

71| Instruction conflicts are more costly | Remove overlapping `always`, `never`, `only`, and `must` rules unless they are truly required. Define priority when rules compete. |

72| Tool behavior is more steerable | Specify when the assistant should act immediately, ask for missing information, confirm high-precision details, retry after failure, or escalate. |

73| Preambles are first-class behavior | The model may speak brief updates before longer reasoning or tool-use flows. Steer when preambles should appear, how short they should be, and when to skip them. |

74| Expanded context window | `gpt-realtime-2` expands the realtime context window from 32k to 128k tokens, making it better suited for long sessions and larger system prompts. |

76Preambles aren't hidden chain-of-thought. They're short spoken updates such as

77 "I'll check that order now." Don't ask the model to reveal private reasoning.

79## Recommended prompt structure

81Use short, labeled sections. The model should be able to find the relevant instructions quickly.

83```text

84# Role and Objective

86# Personality and Tone

88# Language

90# Reasoning

92# Message Channels

94# Preambles

96# Verbosity

4 97

~~5## Meet the models~~98# Tools

100# Unclear Audio

101

102# Entity Capture

103

104# Long Context Behavior

105

106# Escalation

107```

108

109Not every use case needs every section. Add the sections that are relevant for your product.

110

111## Set reasoning effort

112

113`gpt-realtime-2` can trade latency for deeper reasoning. Use the lowest reasoning level that still gives the assistant enough intelligence for the workflow.

114

115Start with `low` for most production voice agents. Tune up or down based on task complexity, latency tolerance, and failure cost.

116

117| Effort | Use when | Example |

118| --------- | --------------------------------------------------- | ----------------------------------------------------------------------- |

119| `minimal` | Lowest latency matters most and the task is simple. | Smart-home commands, timers, simple calendar checks. |

120| `low` | You need responsiveness plus basic reasoning. | Customer support, order lookup, simple policy questions. |

121| `medium` | The assistant must reason through multi-step tasks. | Technical support, diagnostics, complex routing. |

122| `high` | Deeper reasoning materially improves success. | High-precision workflows, escalation decisions, tasks with constraints. |

123

124Beyond the API setting, steer the model on when and how much to reason.

125

126```text

127## Reasoning

128

129- For direct answers, simple lookups, and short confirmations, respond quickly and do not reason.

130- For multi-step tasks, tool decisions, troubleshooting, or escalation, reason before acting.

131- Do not perform extended reasoning when the user's audio is unclear; ask for clarification instead.

132```

133

134## Use preambles intentionally

135

136Preambles are short spoken updates that keep a voice agent feeling responsive while it reasons, looks something up, or calls a tool. Used well, they reassure the user that the assistant is working. Used poorly, they become filler and increase perceived latency.

137

138`gpt-realtime-2` generates preambles by default. Start by testing the default behavior. If it does not match your product experience, tune it explicitly.

139

140![Preamble generation and playback timeline](https://developers.openai.com/images/platform/guides/realtime-2-preambles.png)

141

142```text

143## Preambles

144

145Use short preambles only when they help the user understand that work is happening.

146

147### When to use a preamble

148

149Use a preamble when:

150

151- you are about to call a tool that may take noticeable time;

152- you need to reason through a multi-step request;

153- you are checking records, availability, account state, or policy details;

154- you are preparing an escalation or handoff;

155- silence would make the assistant feel unresponsive.

156

157When a preamble is needed, output it immediately before substantive reasoning or tool use.

158

159### When to not use a preamble

160

161Do not use a preamble when:

162

163- the answer is direct and can be given immediately;

164- the user is only confirming, correcting, or declining something;

165- the audio is unclear and you need clarification;

166- the latest audio is silence, background noise, hold music, TV audio, or side conversation;

167- the tool call is lightweight and the user would not benefit from an update.

168

169### Preamble style

170

171When using a preamble:

172

173- keep it natural, calm, and concise;

174- vary the wording across turns;

175- describe the action, not the internal reasoning;

176- avoid filler.

177

178Avoid phrases like:

179

180- "Let me think..."

181- "Hmm..."

182- "One moment while I process that..."

183- "I am now going to access the tool..."

184

185### Preamble length

186

187Use one short sentence.

188

189Do not exceed two short sentences unless the user needs an explanation before a high-impact action.

190

191### Prefer

192

193- "I'll check that order now."

194- "I'll look up your appointment details."

195- "I'll verify that before we make any changes."

196- "I'll check the policy and then give you the next step."

197- "I'll pull that up so we can make sure it's the right account."

198

199### Avoid

200

201- "Let me think about that for a second."

202- "Please wait while I process your request."

203- "I'm going to use my tools now."

204- "Interesting question. I will reason through this carefully."

205```

206

207## Control response length

208

209`gpt-realtime-2` follows length guidance best when the prompt specifies how much detail to give for each task type. Instead of telling the model to "be concise," define what concise means in context: direct answers, tool results, troubleshooting, comparisons, and escalations may each need different response lengths.

210

211```text

212## Verbosity

213

214- Direct answers: Use 1-2 short sentences.

215- Clarifying questions: Ask one question at a time.

216- Tool results: Summarize the result first, then give only the next useful action.

217- Product or option comparisons: Include key differences, tradeoffs, and who each option fits.

218- Troubleshooting: Give one step at a time unless the user asks for the full procedure.

219- Escalations: Briefly explain why escalation is needed and what will happen next.

220```

221

222Example:

223

224> User: Which plan should I choose?

225

226> Assistant: If you want the lowest cost, choose Basic. If you need team permissions and shared billing, choose Pro. If compliance review or admin controls matter, choose Enterprise.

227

228## Design tool behavior

229

230`gpt-realtime-2` is stronger at tool calling, but tool behavior still depends on prompt and tool-spec design. If the prompt does not define when to act, ask, confirm, or recover, the assistant may call tools too early, ask unnecessary questions, or repeat failed calls.

231

232### Set tool-call eagerness

233

234High eagerness works well for read-only, low-risk actions. Low eagerness is better when tools modify data, trigger external effects, or depend on exact identifiers.

235

236| Tool type | Default behavior |

237| ----------------------------------- | --------------------------------------------------------- |

238| Read-only, low-risk lookup | Call when intent and required fields are clear. |

239| Read-only with exact identifier | Confirm the identifier before lookup. |

240| User-visible communication | Draft or summarize before sending. |

241| Account changes | Confirm before calling. |

242| Purchases, cancellations, payments | Confirm amount, target, and consequence before calling. |

243| Irreversible or high-impact actions | Confirm explicitly and offer escalation when appropriate. |

244

245Use this balanced default when you have a mix of read and write actions. Tailor it based on your use case.

246

247```text

248## Tools

249

250Use only the tools explicitly provided in the current tool list. Do not invent, assume, simulate, or rename tools.

251

252For read-only tools:

253

254- Call the tool when the user's intent is clear and all required fields are available.

255- Do not ask for confirmation unless the lookup depends on a high-precision identifier or there is meaningful risk of using the wrong record.

256- Ask a clarification question only if a required field is missing, ambiguous, or conflicting.

257

258For write tools or external actions:

259

260- Summarize the intended action before calling the tool.

261- Include the key consequence, such as what will be changed, sent, canceled, ordered, or charged.

262- Ask for confirmation.

263- Do not call the tool until the user clearly confirms.

264

265For exact identifiers:

266

267- Treat order IDs, tracking numbers, account numbers, confirmation codes, phone numbers, and email addresses as high precision.

268- Normalize only when the field type is clear.

269- Confirm the final value before account-specific lookups, validation, or write actions.

270

271After tool calls:

272

273- Only say an action was completed after the tool call succeeds.

274- If the tool fails, explain the failure briefly, avoid raw errors, and give the user a clear next step.

275```

276

277High-risk example:

278

279> User: Charge my card for the remaining balance.

280

281Bad:

282

283> Assistant: I've charged your card.

284

285Good:

286

287> Assistant: To confirm, you want me to charge the card on file $248.16 for the remaining balance. Should I proceed?

288

289### Recover from tool failures

290

291Tool failures are part of the conversation. A good recovery should explain what happened and give the user a clear next step.

292

293Do not treat every failure the same. Recovery behavior should depend on the tool type, failure mode, and user impact. Some failures should be handled silently with a retry. Others require asking the user to clarify, correct an identifier, confirm a new action, or choose an alternate path.

294

295```text

296## Tool Failures

297

298If a tool call fails:

299

3001. Briefly explain what failed in user-friendly language.

3012. Do not blame the user or expose raw tool errors.

3023. If the failure may be due to an exact identifier, read back the value used and ask the user to correct it.

3034. If the failure may be temporary, offer to retry once.

3045. If the same failure happens repeatedly, offer an alternate path or escalation.

305

306Do not repeatedly call the same tool with the same arguments after failure.

307

308Do not ask for a different identifier until you have first checked whether the captured value was correct.

309```

310

311Bad:

312

313> Assistant: Something went wrong.

314

315Good:

316

317> Assistant: I couldn't find a match for O R D dash 3 1 2 5 B 2 3. Did I get any part of that wrong?

318

319### Keep tool availability synchronized

320

321Realtime models are eager to help. If the prompt mentions a tool that is not actually available, or if the tool list does not match the prompt, the model may invent a tool name or pretend it completed the action.

322

323For example, if the prompt references `lookup_order`, but the provided tool is named `search_orders`, the model may call the wrong name or simulate the action.

324

325```text

326## Tool Availability

327

328Use only the tools that are explicitly provided in the current tool list.

329

330Do not invent, assume, or simulate tools. If a tool is mentioned in the instructions but is not present in the tool list, treat it as unavailable.

331

332If the user requests an action that requires an unavailable tool:

333

3341. Do not pretend to complete the action.

3352. Briefly explain that the tool is not available.

3363. Offer the closest supported next step.

337

338Only say an action was completed after the relevant tool call succeeds.

339```

340

341Use the prompt audit meta prompt in the appendix to review production prompts

342 for contradictions, missing tools, and brittle instructions.

343

344## Handle silence and background audio

345

346Voice agents tend to respond by default. In production, they often hear audio that should not receive a spoken response, such as silence, background noise, hold music, TV audio, or side conversations.

347

348Use a no-op wait tool when the assistant should stay quiet and keep listening. The tool gives the model a valid non-speaking action instead of making it say things like "I'm here" or "I didn't catch that."

349

350Tool design:

351

352```json

353{

354 "name": "wait_for_user",

355 "description": "Call this when the latest audio does not need a spoken response, such as silence, background noise, hold music, TV audio, side conversation, or speech not addressed to the assistant. This tool helps end the turn without a spoken reply.",

356 "parameters": {

357 "type": "object",

358 "properties": {},

359 "required": []

360 }

361}

362```

363

364Pair it with prompt instructions:

365

366```text

367## Handling Silence and Background Noise

368

369If the latest audio is silence, background noise, hold music, TV audio, side conversation, or speech not addressed to you, call `wait_for_user`.

370

371Do not respond conversationally after calling this tool.

372

373Do not say "I'm here," "I didn't catch that," "Take your time," or "Let me know when you're ready."

374

375Resume normal responses only when the user clearly addresses you or asks for help.

376```

377

378Use this for non-addressed audio, not for unclear user requests. If the user is clearly speaking to the assistant but the content is unintelligible, ask for clarification instead.

379

380## Use message channels deliberately

381

382`gpt-realtime-2` can produce user-visible intermediate messages in the commentary channel and final user-facing responses in the final channel. Use channel-specific instructions when the behavior depends on where it appears.

383

384| Channel | User-visible? | Used for |

385| ------------ | ------------- | -------------------------- |

386| `commentary` | Yes | Preambles and tool calls. |

387| `final` | Yes | Final user-facing message. |

388

389For example, tool calls happen in the commentary channel. If you want the assistant to say something before, during, or after tool use, specify that behavior in relation to the commentary channel.

390

391```text

392Before calling tools in the commentary channel, briefly tell the user what you are doing.

393```

394

395`gpt-realtime-2` can emit multiple response phases in a single turn. In API output, this distinction is represented by the `response.done` event, which includes a `phase` value that indicates whether the content is commentary or the final answer.

396

397You can use this field to handle each phase differently in your application. For example, commentary can be played or displayed as a short intermediate update, while `final_answer` can be reserved for the assistant's completed response.

398

399```text

400response.output[0].phase: "commentary"

401response.output[1].phase: "final_answer"

402```

403

404Example response phases

405

406User prompt:

407

408> "I'm stuck on this AP Bio question [QUESTION]."

409

410Shortened API response:

411

412```json

413{

414 "type": "response.done",

415 "response": {

416 "output": [

417 {

418 "phase": "commentary",

419 "content": [

420 {

421 "type": "output_audio",

422 "transcript": "Let's zero in on the enzyme's shape and binding, since that's the key idea here."

423 }

424 ]

425 },

426 {

427 "phase": "final_answer",

428 "content": [

429 {

430 "type": "output_audio",

431 "transcript": "What changes at the active site at high temperature?"

432 }

433 ]

434 }

435 ]

436 }

437}

438```

439

440## Handle unclear audio

441

442The model should only act on audio it can understand with confidence. If the audio is unclear, the model should ask a brief clarification question instead of guessing.

443

444Do not let the model infer missing words, call tools, capture entities, generate preambles, or spend hidden reasoning time trying to reconstruct what the user may have said.

445

446```text

447## Unclear Audio

448

449- Only respond to clear audio or text.

450- If the user's audio is not clear, ask for clarification using a short English phrase such as "Sorry, could you repeat that clearly?"

451- Don't repeat the same unclear-audio clarification twice.

452- Treat audio as unclear if it is ambiguous, noisy, silent, unintelligible, partially cut off, or if you are unsure of the exact words the user said.

453- Do not guess what the user meant from unclear audio.

454- Do not reason when the audio is unclear.

455- Do not provide a preamble or call tools in the commentary channel when the audio is unclear.

456```

457

458Example:

459

460> User audio: "Check order three one-" [cut off]

461

462Bad:

463

464> Assistant: I'll check order 31 now.

465

466Good:

467

468> Assistant: I heard only part of the order number. Could you repeat it digit by digit?

469

470## Capture exact entities

471

472Many realtime workflows depend on exact values: order IDs, tracking numbers, email addresses, confirmation codes, account numbers, claim numbers, ticket IDs, support references, and phone numbers.

473

474Voice makes this hard. Users speak quickly, group numbers in different ways, spell partial values, use filler, correct themselves mid-turn, or pronounce characters that sound alike. One wrong digit can fail a lookup or retrieve the wrong account.

475

476Capture entities conservatively. Collect one value at a time, normalize only what is clear, confirm high-precision values before tool calls, and make every correction recoverable.

477

478### Collect one entity at a time

479

480When a workflow needs multiple values, collect them one at a time. This prevents fields from blending together, especially in voice conversations.

481

482```text

483## Entity Collection Order

484

485Collect required values one at a time.

486

487- Ask for only the next missing value.

488- Do not ask for multiple values in the same turn.

489- Before asking, check whether the value was already provided earlier in the conversation or the session.

490- If a possible value already exists, confirm it with the user before using it.

491

492Example:

493

494"I see tracking number ABC-54321 from earlier. Should I use that one, or do you have a different tracking number?"

495

496Do not call tools until the current value has been collected, validated, and confirmed.

497```

498

499### Handle spelled-out characters

500

501Use this when users spell IDs, codes, names, or email addresses one character at a time. The spoken form is input, not the final value.

502

503```text

504## Spelled-Out Characters

505

506When a user dictates an ID, code, or email character by character, treat the spoken sequence as one compact value. Preserve explicitly spoken separators like dash, dot, underscore, slash, or plus; otherwise do not add spaces or separators.

507

508Examples:

509

510- "A B C one two three" -> "ABC123"

511- "B C dash nine eight seven" -> "BC-987"

512- "J O H N at example dot com" -> "john@example.com"

513

514Do not insert spaces between spelled-out characters unless the user explicitly says the value contains spaces.

515```

516

517### Normalize spoken numbers carefully

518

519For numeric identifiers, users may say digits individually, group them, or use natural number phrases. If the field expects one continuous numeric value, convert clear numeric speech into digits.

520

521```text

522## Spoken Number Handling

523

524Convert spoken numbers into digits when collecting numeric identifiers.

525

526Examples:

527

528- "one two three four" -> "1234"

529- "one twenty three" -> "123"

530- "one nineteen" -> "119"

531- "ninety nine eleven" -> "9911"

532- "nine thousand nine hundred eleven" -> "9911"

533

534If multiple interpretations are plausible, ask the user to clarify before using the value.

535

536Example:

537

538"I heard either 119 or 1-19. Could you repeat the number digit by digit?"

539```

540

541### Confirm exact identifiers before tool calls

542

543Order IDs, tracking numbers, account numbers, claim numbers, confirmation codes, and similar identifiers are high-precision fields. Confirm them before using them in a tool call.

544

545For numeric identifiers, read the value back digit by digit. Reading the value as a full number can hide errors.

546

547Example:

548

549> Assistant: Just to confirm, I heard 8... 3... 5... 2... 1. Is that right?

550

551If the user corrects one character or digit, repeat the full corrected value before calling the tool.

552

553Example:

554

555> Assistant: Got it. I have 8... 3... 5... 7... 1. Is that correct?

556

557```text

558## Exact Identifier Confirmation

559

560Before calling tools with high-precision identifiers:

561

562- Confirm the final normalized value with the user.

563- Read numeric identifiers back digit by digit.

564- Do not use guessed, partial, or ambiguous values.

565- If the user corrects the value, repeat the full corrected value before calling the tool.

566```

567

568### Confirm emails character by character

569

570Email addresses are important values. Dots, dashes, underscores, repeated letters, and similar-sounding names can cause account lookup failures or send messages to the wrong address.

571

572Ask the user to spell the email address:

573

574> Assistant: Could you spell the email address character by character so I can make sure I have it exactly right?

575

576When reading it back, confirm the exact final address:

577

578> Assistant: Just to confirm, that is c-h-e-n at example dot com, right?

579

580```text

581## Email Confirmation

582

583Email addresses must be captured exactly.

584

585If the user says the email naturally without spelling it out, ask them to repeat it character by character.

586

587Example:

588

589"Could you spell the email address character by character so I can make sure I have it exactly right?"

590

591When reading an email back, confirm the exact final email address.

592

593Example:

594

595"Just to confirm, that is c-h-e-n at example dot com, right?"

596```

597

598### Entity collection workflow

599

600Example Entity collection workflow

601

602Use this full workflow when a task requires exact values before any tool call.

603

604```text

605## Entity Collection Workflow

606

607When a workflow requires an exact value, collect and confirm it before using it in any tool call.

608

609Exact values include order IDs, tracking numbers, confirmation codes, account numbers, claim numbers, ticket IDs, support references, email addresses, phone numbers, and similar identifiers.

610

611Follow this workflow:

612

6131. Collect the next required value.

614

615- Ask for only one missing value at a time.

616- Do not ask for multiple exact values in the same turn.

617- Before asking, check whether the value was already provided earlier in the conversation or session.

618

6192. Normalize only what is clear.

620

621- Convert clearly spoken digits or spelled-out characters into the expected format.

622- Preserve explicit separators such as dashes, dots, underscores, slashes, and plus signs.

623- Do not guess, infer, repair, or fill in unclear characters.

624- If the value could be interpreted in more than one way, ask the user to repeat or clarify it.

625

6263. Confirm the final value.

627

628- Read back the normalized value before using it.

629- For numeric identifiers, confirm digit by digit.

630- For email addresses, confirm character by character when precision matters.

631- Wait for a clear confirmation from the user.

632

6334. Call the tool only after confirmation.

634

635- Do not call lookup, account, messaging, payment, booking, or update tools with guessed, partial, ambiguous, or unconfirmed values.

636

6375. Recover safely from corrections.

638

639- If the user corrects any part of the value, update the value, repeat the full corrected value, and ask for confirmation again.

640- Do not use the corrected value in a tool call until the user confirms the full final value.

641

642Examples:

643

644User: My order ID is ORD-3125B23.

645

646Assistant: Just to confirm, I heard O-R-D dash 3-1-2-5-B-2-3. Is that right?

647

648User: It is 83521 - actually, the fourth digit is 7.

649

650Assistant: Got it. I have 8... 3... 5... 7... 1. Is that correct?

651

652User: My email is chen@example.com.

653

654Assistant: Could you spell that email address character by character so I can make sure I have it exactly right?

655

656Never call tools with guessed, partial, ambiguous, or unconfirmed exact values.

657```

658

659## Avoid literal instruction traps

660

661`gpt-realtime-2` follows instructions more literally than earlier realtime models. Prompts that worked well on older models may need tuning.

662

663Use precise language. The model may prioritize the exact wording of an instruction over the broader behavior you intended. Broad or rigid rules can dominate the assistant's behavior in surprising ways, especially when multiple rules overlap.

664

665Be careful with constraint words such as `must`, `only`, `never`, and `always`. Use them when the behavior is truly required, not as general emphasis. Overusing hard constraints can make the assistant rigid, overly cautious, or unable to handle reasonable exceptions.

666

667Prefer precise scope:

668

669```text

670For write actions that modify user data, ask for confirmation before calling the tool.

671```

672

673Avoid broad scope:

674

675```text

676Always ask for confirmation before doing anything.

677```

678

679The broad version may cause unnecessary confirmations before harmless read-only lookups, such as checking order status, retrieving availability, or reading account information.

680

681### Literal interpretation example

682

683Example literal interpretation trap

684

685This prompt is too narrow:

686

687```text

688When a confirmation code is provided, repeat it verbatim and wait for a clear yes.

689```

690

691User message:

692

693> My order ID is ORD-3125B23.

694

695Possible failure:

696

697The model may not apply the rule because the user provided an order ID, not a confirmation code. The intended behavior is clear to the developer, but the instruction's scope is too narrow.

698

699Safer rewrite:

700

701```text

702When the user provides an exact identifier, including confirmation codes, order IDs, ticket IDs, reset PINs, claim numbers, tracking numbers, or account numbers, repeat the captured value and wait for confirmation before using it in a tool call.

703```

704

705General prompting recommendations:

706

707- Prefer explicit instructions over implied intent.

708- Avoid unnecessary constraint words unless behavior truly must be rigid.

709- Minimize contradictory guidance.

710- Be cautious with layered or competing priority instructions.

711- Test prompts incrementally. Small wording changes can have large behavioral effects.

712- When migrating from earlier realtime models, expect some prompts to require restructuring for best results.

713

714## Control language and accent separately

715

716Language and accent should be controlled separately.

717

718A user's accent is not the same as their intended language. A user may speak English with a Hindi, Spanish, French, or Mandarin accent and still expect English responses.

719

720Avoid broad language instructions such as:

721

722```text

723Mirror the user.

724Respond naturally in the user's language.

725Switch languages when appropriate.

726Sound local.

727Adapt to the user's accent.

728```

729

730These are too broad. The model may interpret accent, filler words, backchannels, or isolated foreign words as a reason to switch languages.

731

732### English language policy

733

734```text

735## Language

736

737English is the default response language.

738

739- Do not infer language from accent alone.

740- Ignore short filler sounds, backchannels, and isolated foreign words for language detection.

741- Only switch languages if the user explicitly asks or provides a substantive utterance in another language.

742- If language confidence is low, ask a short clarification instead of guessing.

743- Keep preambles, spoken bridges, tool-related messages, and final answers in the same language.

744- Accent adaptation must not change the response language.

745```

746

747### Multilingual policy

748

749```text

750## Language

751

752Default to English unless the user clearly uses another language.

753

754Switch languages only when:

755

756- the user explicitly asks to use another language;

757- the user provides a substantive utterance in another language. A substantive utterance means the user gives a complete request, question, or correction in another language, not just a greeting, name, address, filler word, or borrowed phrase.

758

759Do not switch languages based on:

760

761- accent;

762- pronunciation;

763- filler words;

764- short backchannels;

765- names;

766- addresses;

767- isolated foreign words.

768

769If uncertain, ask:

770

771"Would you like me to continue in English or [LANGUAGE]?"

772```

773

774### Accent control

775

776`gpt-realtime-2` can follow accent instructions more strongly, but vague accent prompts can cause drift or unintended language switching.

777

778Accent-control prompts work best when they specify:

779

780- the target accent;

781- which characteristics should remain stable;

782- the intended pacing, stress, and prosody;

783- whether accent adaptation should affect language choice.

784

785Instead of:

786

787```text

788Sound Australian.

789```

790

791Use:

792

793```text

794## Accent

795

796Speak English with a light Australian accent.

797

798- Keep the accent stable from the first word to the last.

799- Use natural Australian vowel shaping, but keep speech easy to understand.

800- Do not exaggerate the accent.

801- Do not change response language based on the user's accent.

802```

803

804### Custom voices

805

806Use [Custom Voices](https://developers.openai.com/blog/updates-audio-models#custom-voices) when standard voices cannot reliably meet brand, accent, or character requirements.

807

808Prompting can steer accent, pacing, and delivery, but it cannot fully replace voice design. For use cases that require consistent branded voice identity or accent fidelity, consider [Custom Voices](https://developers.openai.com/blog/updates-audio-models#custom-voices).

809

810Custom Voices are available only to approved customers. Contact your account team for access.

811

812## Maintain state in long sessions

813

814`gpt-realtime-2` expands the realtime context window from 32k to 128k tokens, making it better suited for long sessions. For dense two-way conversations, 128k tokens is best thought of as roughly 1-2 hours of dense raw audio context. This will vary depending on tool use, internal reasoning, injected records, and other session details.

815

816For long-context use cases, `gpt-realtime-2` performs best when it can tell what information is current, what is background, and what should be ignored if sources conflict. Do not rely on the model to infer source priority from a raw transcript or large context dump. Use structure.

817

818Use a structured pattern when starting a session with a large amount of context, such as retrieved records, prior conversation history, policies, summaries, account notes, or background documents.

819

820Example long-session context template

821

822```text

823## Context

824

825### Current State

826

827- **Current task:** [current task]

828- **Latest known state:** [current value]

829- **Next safe step:** [what the assistant should do next]

830

831### Authoritative Sources

832

833- **Fact or record:** [fact or record]

834- **Source:** [tool result / active policy / verified record]

835- **Status:** current

836- **Retrieved:** [date/time or this turn]

837

838### Historical or Background Sources

839

840- **Older fact or record:** [older fact or record]

841- **Source:** [prior conversation / older record / summary]

842- **Status:** stale or background

843- **Note:** Do not use for current decisions if it conflicts with a current source.

844

845### Relevant Policy or Rules

846

847- [decision rule or constraint]

848

849### Other Context

850

851- [potentially useful but non-authoritative background]

852```

853

854## Migrate from earlier realtime models

855

856When migrating from earlier realtime models, treat the prompt as a behavior surface, not just text to port.

857

8581. Use Codex or a strong reasoning model to restructure the prompt around the latest Realtime prompting guidance. Include a link to this prompting guide to ground the migration in best practices.

8592. Set reasoning effort to `low` instead of the default. Increase only for workflows that require deeper planning.

8603. Audit tool names, parameters, enums, JSON schemas, and other settings to make sure they match the expected implementation.

8614. Remove stale examples. Add short examples for happy paths, ambiguity, interruptions, tool calls, and fallback behavior.

8625. Compare representative conversations before and after migration. Check for regressions against an existing eval and document intentional behavior changes.

8636. Run a final consistency pass. Confirm the prompt clearly separates hard requirements, defaults, tool rules, safety rules, and fallback behavior.

8647. Run evals, inspect representative failures, and iterate on the prompt until the target behaviors are reliable.

865

866 </div>

867 <div data-content-switcher-pane data-value="gpt-realtime-1.5" hidden>

868

869## Realtime 1.5 Prompting Guide

870

871`gpt-realtime-1.5` is a speech-to-speech model in the Realtime API. The same `gpt-realtime` prompting guidance applies to this model.

872

873Speech-to-speech systems are essential for enabling voice as a core AI interface. `gpt-realtime-1.5` supports robust, usable realtime voice agents that can handle mission-critical workflows at scale.

874

875Compared with earlier realtime preview models, `gpt-realtime-1.5` delivers stronger instruction following, more reliable tool calling, better voice quality, and an overall smoother feel. These gains make it practical to move from chained approaches to true realtime experiences, cutting latency and producing responses that sound more natural and expressive.

876

877Realtime models benefit from prompting techniques that wouldn't directly apply to text-based models. This prompting guide starts with a suggested prompt skeleton, then walks through each part with practical tips, small patterns you can copy, and examples you can adapt to your use case.

878

879## General Tips

880

881- **Iterate relentlessly**: Small wording changes can make or break behavior.

882 - Example: For unclear audio instruction, we swapped “inaudible” → “unintelligible” which improved noisy input handling.

883- **Prefer bullets over paragraphs**: Clear, short bullets outperform long paragraphs.

884- **Guide with examples**: The model closely follows sample phrases.

885- **Be precise**: Ambiguity or conflicting instructions = degraded performance similar to GPT-5.

886- **Control language**: Pin output to a target language if you see unwanted language switching.

887- **Reduce repetition**: Add a Variety rule to reduce robotic phrasing.

888- **Use capitalized text for emphasis**: Capitalizing key rules makes them stand out and easier for the model to follow.

889- **Convert non-text rules to text**: instead of writing "IF x > 3 THEN ESCALATE", write, "IF MORE THAN THREE FAILURES THEN ESCALATE".

890

891## Prompt Structure

892

893Organizing your prompt makes it easier for the model to understand context and stay consistent across turns. It also makes it easier for you to iterate and modify problematic sections.

894

895- **What it does**: Use clear, labeled sections in your system prompt so the model can find and follow them. Keep each section focused on one thing.

896- **How to adapt**: Add domain-specific sections (e.g., Compliance, Brand Policy). Remove sections you don’t need (e.g., Reference Pronunciations if not struggling with pronunciation).

897

898Example

899

900```

901# Role & Objective — who you are and what “success” means

902# Personality & Tone — the voice and style to maintain

903# Context — retrieved context, relevant info

904# Reference Pronunciations — phonetic guides for tricky words

905# Tools — names, usage rules, and preambles

906# Instructions / Rules — do’s, don’ts, and approach

907# Conversation Flow — states, goals, and transitions

908# Safety & Escalation — fallback and handoff logic

909```

910

911## Role and Objective

912

913This section defines who the agent is and what “done” means. The examples show two different identities to demonstrate how tightly the model will adhere to role and objective when they’re explicit.

914

915- **When to use**: The model is not taking on the persona, role, or task scope you need.

916- **What it does**: Pins identity of the voice agent so that its responses are conditioned to that role description

917- **How to adapt**: Modify the role based on your use case

918

919#### Example (model takes on a specific accent)

920

921```

922# Role & Objective

923You are a Quebecois French-speaking customer service bot. Your task is to answer the user's question.

924```

925

926Earlier realtime preview:

927

928<div className="my-6">

929 </div>

930

931`gpt-realtime-1.5`:

932

933<div className="my-6">

934 </div>

935

936#### Example (model takes on a character)

937

938```

939# Role & Objective

940You are a high-energy game-show host guiding the caller to guess a secret number from 1 to 100 to win 1,000,000$.

941```

942

943Earlier realtime preview:

944

945<div className="my-6">

946 </div>

947

948`gpt-realtime-1.5`:

949

950<div className="my-6">

951 </div>

952

953`gpt-realtime-1.5` is able to enact the specified role more reliably than earlier realtime preview models.

954

955## Personality and Tone

956

957`gpt-realtime-1.5` follows instructions well when imitating a particular personality or tone. You can tailor the voice experience and delivery depending on what your use case expects.

958

959- **When to use**: Responses feel flat, overly verbose, or inconsistent across turns.

960- **What it does**: Sets voice, brevity, and pacing so replies sound natural and consistent.

961- **How to adapt**: Tune warmth/formality and default length. For regulated domains, favor neutral precision. Add other subsections that are relevant to your use case.

962

963#### Example

964

965```

966# Personality & Tone

967## Personality

968- Friendly, calm and approachable expert customer service assistant.

969

970## Tone

971- Warm, concise, confident, never fawning.

972

973## Length

9742–3 sentences per turn.

975```

976

977#### Example (multi-emotion)

978

979```

980# Personality & Tone

981- Start your response very happy

982- Midway, change to sad

983- At the end change your mood to very angry

984```

985

986`gpt-realtime-1.5`:

987

988<div className="my-6">

989 </div>

990

991The model is able to adhere to the complex instructions and switch between three emotions throughout the audio response.

992

993### Speed Instructions

994

995In the Realtime API, the `speed` parameter changes playback rate, not how the model composes speech. To actually sound faster, add instructions that can guide the pacing.

996

997- **When to use**: Users want faster speaking voice; playback speed (with speed parameter) alone doesn’t fix speaking style.

998- **What it does**: Tunes speaking style (brevity, cadence) independent of client playback speed.

999- **How to adapt**: Modify speed instruction to meet use case requirements.

1000

1001#### Example

1002

1003```

1004# Personality & Tone

1005## Personality

1006- Friendly, calm and approachable expert customer service assistant.

1007

1008## Tone

1009- Warm, concise, confident, never fawning.

1010

1011## Length

1012- 2–3 sentences per turn.

1013

1014## Pacing

1015- Deliver your audio response fast, but do not sound rushed.

1016- Do not modify the content of your response, only increase speaking speed for the same response.

1017```

1018

1019Earlier realtime preview:

1020

1021<div className="my-6">

1022 </div>

1023

1024`gpt-realtime-1.5`:

1025

1026<div className="my-6">

1027 </div>

1028

1029With explicit pacing instructions, `gpt-realtime-1.5` can produce a noticeably faster pace without sounding too hurried.

1030

1031### Language Constraint

1032

1033Language constraints ensure the model consistently responds in the intended language, even in challenging conditions like background noise or multilingual inputs.

1034

1035- **When to use**: To prevent accidental language switching in multilingual or noisy environments.

1036- **What it does**: Locks output to the chosen language to prevent accidental language changes.

1037- **How to adapt**: Switch “English” to your target language; or add more complex instructions based on your use case.

1038

1039#### Example (pinning to one language)

1040

1041```

1042# Personality & Tone

1043## Personality

1044- Friendly, calm and approachable expert customer service assistant.

1045

1046## Tone

1047- Warm, concise, confident, never fawning.

1048

1049## Length

1050- 2–3 sentences per turn.

1051

1052## Language

1053- The conversation will be only in English.

1054- Do not respond in any other language even if the user asks.

1055- If the user speaks another language, politely explain that support is limited to English.

1056```

1057

1058These are the responses after applying the instruction using `gpt-realtime-1.5`.

1059

1060![lang constraint en](https://developers.openai.com/cookbook/assets/images/lang_constraint_en.png)

1061

1062#### Example (model teaches a language)

1063

1064```

1065# Role & Objective

1066- You are a friendly, knowledgeable voice tutor for French learners.

1067- Your goal is to help the user improve their French speaking and listening skills through engaging conversation and clear explanations.

1068- Balance immersive French practice with supportive English guidance to ensure understanding and progress.

1069

1070# Personality & Tone

1071## Personality

1072- Friendly, calm and approachable expert customer service assistant.

1073

1074## Tone

1075- Warm, concise, confident, never fawning.

1076

1077## Length

1078- 2–3 sentences per turn.

1079

1080## Language

1081### Explanations

1082Use English when explaining grammar, vocabulary, or cultural context.

1083

1084### Conversation

1085Speak in French when conducting practice, giving examples, or engaging in dialogue.

1086```

1087

1088These are the responses after applying the instruction using `gpt-realtime-1.5`.

1089

1090![multi language](https://developers.openai.com/cookbook/assets/images/multi-language.png)

1091

1092The model is able to code-switch from one language to another based on custom instructions.

1093

1094### Reduce Repetition

1095

1096The realtime model can follow sample phrases closely to stay on-brand, but it may overuse them, making responses sound robotic or repetitive. Adding a repetition rule helps maintain variety while preserving clarity and brand voice.

1097

1098- **When to use**: Outputs recycle the same openings, fillers, or sentence patterns across turns or sessions.

1099- **What it does**: Adds a variety constraint—discourages repeated phrases, nudges synonyms and alternate sentence structures, and keeps required terms intact.

1100- **How to adapt**: Tune strictness (e.g., “don’t reuse the same opener more than once every N turns”), whitelist must-keep phrases (legal/compliance/brand), and allow tighter phrasing where consistency matters.

1101

1102#### Example

1103

1104```

1105# Personality & Tone

1106## Personality

1107- Friendly, calm and approachable expert customer service assistant.

1108

1109## Tone

1110- Warm, concise, confident, never fawning.

1111

1112## Length

1113- 2–3 sentences per turn.

1114

1115## Language

1116- The conversation will be only in English.

1117- Do not respond in any other language even if the user asks.

1118- If the user speaks another language, politely explain that support is limited to English.

1119

1120## Variety

1121- Do not repeat the same sentence twice.

1122- Vary your responses so they don't sound robotic.

1123```

1124

1125These are the responses **before** applying the instruction using `gpt-realtime-1.5`. The model repeats the same confirmation: `Got it`.

1126

1127![repeat before](https://developers.openai.com/cookbook/assets/images/repeat_before.png)

1128

1129These are the responses **after** applying the instruction using `gpt-realtime-1.5`.

1130

1131![repeat after](https://developers.openai.com/cookbook/assets/images/repeat_after.png)

1132

1133Now the model is able to vary its responses and confirmation and not sound robotic.

1134

1135## Reference Pronunciations

1136

1137This section covers how to ensure the model pronounces important words, numbers, names, and terms correctly during spoken interactions.

1138

1139- **When to use**: Brand names, technical terms, or locations are often mispronounced.

1140- **What it does**: Improves trust and clarity with phonetic hints.

1141- **How to adapt**: Keep to a short list; update as you hear errors.

1142

1143#### Example

1144

1145```

1146# Reference Pronunciations

1147When voicing these words, use the respective pronunciations:

1148- Pronounce “SQL” as “sequel.”

1149- Pronounce “PostgreSQL” as “post-gress.”

1150- Pronounce “Kyiv” as “KEE-iv.”

1151- Pronounce "Huawei" as “HWAH-way”

1152```

1153

1154Earlier realtime preview:

1155

1156<div className="my-6">

1157 </div>

1158

1159`gpt-realtime-1.5`:

1160

1161<div className="my-6">

1162 </div>

1163

1164With the reference pronunciation instructions, `gpt-realtime-1.5` can correctly pronounce SQL as "sequel."

1165

1166### Alphanumeric Pronunciations

1167

1168Realtime S2S can blur or merge digits/letters when reading back key info (phone, credit card, order IDs). Explicit character-by-character confirmation prevents mishearing and drives clearer synthesis.

1169

1170- **When to use**: If the model struggles to capture or read back phone numbers, card numbers, 2FA codes, order IDs, serials, addresses, unit numbers, or mixed alphanumeric strings.

1171- **What it does**: Forces the model to speak one character at a time with separators, then confirm with the user and reconfirm after corrections. Optionally uses a phonetic disambiguator for letters (e.g., “A as in Alpha”).

1172

1173#### Example (general instruction section)

1174

1175```

1176# Instructions/Rules

1177- When reading numbers or codes, speak each character separately, separated by hyphens (e.g., 4-1-5).

1178- Repeat EXACTLY the provided number; do not omit any digits.

1179```

6 1180

~~7Our most advanced speech-to-speech model is [gpt-realtime](https://developers.openai.com/api/docs/models/gpt-realtime).~~1181_Tip: If you are following a conversation flow prompting strategy, you can specify which conversation state needs to apply the alpha-numeric pronunciations instruction._

1182

1183#### Example (instruction in conversation state)

1184

1185_(taken from the conversation flow of the prompt of our [openai-realtime-agents](https://github.com/openai/openai-realtime-agents/blob/main/src/app/agentConfigs/customerServiceRetail/authentication.ts))_

1186

1187```txt

1188{

1189 "id": "3_get_and_verify_phone",

1190 "description": "Request phone number and verify by repeating it back.",

1191 "instructions": [

1192 "Politely request the user’s phone number.",

1193 "Once provided, confirm it by repeating each digit and ask if it’s correct.",

1194 "If the user corrects you, confirm AGAIN to make sure you understand.",

1195 ],

1196 "examples": [

1197 "I'll need some more information to access your account if that's okay. May I have your phone number, please?",

1198 "You said 0-2-1-5-5-5-1-2-3-4, correct?",

1199 "You said 4-5-6-7-8-9-0-1-2-3, correct?"

1200 ],

1201 "transitions": [{

1202 "next_step": "4_authentication_DOB",

1203 "condition": "Once phone number is confirmed"

1204 }]

1205}

1206```

8 1207

9This model shows improvements in following complex instructions, calling tools, and producing speech that sounds natural and expressive. For more information, see the [announcement blog post](https://openai.com/index/introducing-gpt-realtime/).1208These are the responses **before** applying the instruction using `gpt-realtime-1.5`.

10 1209

~~11## Update your session to use a prompt~~1210> Sure! The number is 55119765423. Let me know if you need anything else!

12 1211

13After you initiate a session over [WebRTC](https://developers.openai.com/api/docs/guides/realtime-webrtc), [WebSocket](https://developers.openai.com/api/docs/guides/realtime-websocket), or [SIP](https://developers.openai.com/api/docs/guides/realtime-sip), the client and model are connected. The server will send a [session.created](https://developers.openai.com/api/docs/api-reference/realtime-server-events/session/created) event to confirm. Now it's a matter of prompting.1212These are the responses **after** applying the instruction using `gpt-realtime-1.5`.

14 1213

~~15### Basic prompt update~~1214> Sure! The number is: 5-5-1-1-1-9-7-6-5-4-2-3. Please let me know if you need anything else!

16 1215

~~171. Create a basic audio prompt in [the dashboard](https://platform.openai.com/audio/realtime).~~1216## Instructions

18 1217

~~19 If you don't know where to start, experiment with the prompt fields until you find something interesting. You can always manage, iterate on, and version your prompts later.~~1218This section covers prompt guidance for instructing your model to solve your task, apply best practices, and fix possible problems.

20 1219

~~211. Update your realtime session to use the prompt you created. Provide its prompt ID in a `session.update` client event:~~1220Perhaps unsurprisingly, we recommend prompting patterns that are similar to [GPT-4.1 for best results](https://developers.openai.com/cookbook/examples/gpt4-1_prompting_guide).

22 1221

~~23Update the system instructions used by the model in this session~~1222### Instruction Following

24 1223

~~25```javascript~~1224Like GPT-4.1 and GPT-5, if the instructions are conflicting, ambiguous, or unclear, `gpt-realtime-1.5` will perform worse.

~~26const event = {~~1225

~~27 type: "session.update",~~1226- **When to use**: Outputs drift from rules, skip phases, or misuse tools.

~~28 session: {~~1227- **What it does**: Uses an LLM to point out ambiguity, conflicts, and missing definitions before you ship.

~~29 type: "realtime",~~1228

~~30 model: "gpt-realtime",~~1229#### **Instructions Quality Prompt (can be used in ChatGPT or with API)**

~~31 // Lock the output to audio (set to ["text"] if you want text without audio)~~1230

~~32 output_modalities: ["audio"],~~1231Use the following prompt with GPT-5 to identify problematic areas in your prompt that you can fix.

~~33 audio: {~~1232

~~34 input: {~~1233```

~~35 format: {~~1234## Role & Objective

~~36 type: "audio/pcm",~~1235You are a **Prompt-Critique Expert**.

~~37 rate: 24000,~~1236Examine a user-supplied LLM prompt and surface any weaknesses following the instructions below.

~~38 },~~1237

~~39 turn_detection: {~~1238

~~40 type: "semantic_vad"~~1239## Instructions

~~41 }~~1240Review the prompt that is meant for an LLM to follow and identify the following issues:

~~42 },~~1241- Ambiguity: Could any wording be interpreted in more than one way?

~~43 output: {~~1242- Lacking Definitions: Are there any class labels, terms, or concepts that are not defined that might be misinterpreted by an LLM?

~~44 format: {~~1243- Conflicting, missing, or vague instructions: Are directions incomplete or contradictory?

~~45 type: "audio/pcm",~~1244- Unstated assumptions: Does the prompt assume the model has to be able to do something that is not explicitly stated?

~~46 },~~1245

~~47 voice: "marin",~~1246

~~48 }~~1247## Do **NOT** list issues of the following types:

1248- Invent new instructions, tool calls, or external information. You do not know what tools need to be added that are missing.

1249- Issues that you are unsure about.

1250

1251

1252## Output Format

1253"""

1254# Issues

1255- Numbered list; include brief quote snippets.

1256

1257# Improvements

1258- Numbered list; provide the revised lines you would change and how you would change them.

1259

1260# Revised Prompt

1261- Revised prompt where you have applied all your improvements surgically with minimal edits to the original prompt

1262"""

1263```

1264

1265#### **Prompt Optimization Meta Prompt (can be used in ChatGPT or with API)**

1266

1267This meta-prompt helps you improve your base system prompt by targeting a specific failure mode. Provide the current prompt and describe the issue you’re seeing, the model (GPT-5) will suggest refined variants that tighten constraints and reduce the problem.

1268

1269```

1270Here's my current prompt to an LLM:

1271[BEGIN OF CURRENT PROMPT]

1272{CURRENT_PROMPT}

1273[END OF CURRENT PROMPT]

1274

1275But I see this issue happening from the LLM:

1276[BEGIN OF ISSUE]

1277{ISSUE}

1278[END OF ISSUE]

1279Can you provide some variants of the prompt so that the model can better understand the constraints to alleviate the issue?

1280```

1281

1282### No Audio or Unclear Audio

1283

1284Sometimes the model thinks it hears something and tries to respond. You can add a custom instruction telling the model how to behave when it hears unclear audio or user input. Modify the desired behavior to fit your use case. For example, you may want the model to repeat the same question instead of asking for clarification.

1285

1286- **When to use**: Background noise, partial words, or silence trigger unwanted replies.

1287- **What it does**: Stops spurious responses and creates graceful clarification.

1288- **How to adapt**: Choose whether to ask for clarification or repeat the last question depending on use case.

1289

1290#### Example (coughing and unclear audio)

1291

1292```

1293# Instructions/Rules

1294...

1295

1296

1297## Unclear audio

1298- Always respond in the same language the user is speaking in, if unintelligible.

1299- Only respond to clear audio or text.

1300- If the user's audio is not clear (e.g. ambiguous input/background noise/silent/unintelligible) or if you did not fully hear or understand the user, ask for clarification using {preferred_language} phrases.

1301```

1302

1303These are the responses **after** applying the instruction using `gpt-realtime-1.5`.

1304

1305<div className="my-6">

1306 </div>

1307

1308In this example, the model asks for clarification after my _(very)_ loud cough and unclear audio.

1309

1310### Background Music or Sounds

1311

1312Occasionally, the model may generate unintended background music, humming, rhythmic noises, or sound-like artifacts during speech generation. These artifacts can diminish clarity, distract users, or make the assistant feel less professional. The following instructions help prevent or significantly reduce these occurrences.

1313

1314- **When to use**: Use when you observe unintended musical elements or sound effects in Realtime audio responses.

1315- **What it does**: Steers the model to avoid generating these unwanted audio artifacts.

1316- **How to adapt**: Adjust the instruction to try to explicitly suppress the specific sound patterns you are encountering.

1317

1318#### Example

1319

1320```

1321# Instructions/Rules

1322...

1323- Do not include any sound effects or onomatopoeic expressions in your responses.

1324```

1325

1326## Tools

1327

1328Use this section to tell the model how to use your functions and tools. Spell out when and when not to call a tool, which arguments to collect, what to say while a call is running, and how to handle errors or partial results.

1329

1330### Tool Selection

1331

1332`gpt-realtime-1.5` follows instructions closely. However, if you have instructions that conflict with what the model can access, such as mentioning tools in your prompt that are NOT passed in the tools list, it can lead to bad responses.

1333

1334- **When to use**: Prompts mention tools that aren’t actually available.

1335- **What it does**: Reviews the available tools and system prompt to ensure they align.

1336

1337#### Example

1338

1339```

1340# Tools

1341## lookup_account(email_or_phone)

1342...

1343

1344

1345## check_outage(address)

1346...

1347```

1348

1349We need to ensure the same tools are available and **the descriptions do not contradict each other**:

1350

1351```json

1352[

1353{

1354 "name": "lookup_account",

1355 "description": "Retrieve a customer account using either an email or phone number to enable verification and account-specific actions.",

1356 "parameters": {

1357 ...

49 },1358 },

~~50 // Use a server-stored prompt by ID. Optionally pin a version and pass variables.~~1359{

~~51 prompt: {~~1360 "name": "check_outage",

~~52 id: "pmpt_123", // your stored prompt ID~~1361 "description": "Check for network outages affecting a given service address and return status and ETA if applicable.",

~~53 version: "89", // optional: pin a specific version~~1362 "parameters": {

~~54 variables: {~~1363 ...

~~55 city: "Paris" // example variable used by your prompt~~

56 }1364 }

~~57 },~~1365]

~~58 // You can still set direct session fields; these override prompt fields if they overlap:~~1366```

~~59 instructions: "Speak clearly and briefly. Confirm understanding before taking actions."~~1367

~~60 },~~1368### Tool Call Preambles

~~61};~~1369

1370Some use cases could benefit from the Realtime model providing an audio response at the same time as calling a tool. This leads to a better user experience, masking latency. You can modify the sample phrase to fit your use case.

62 1371

~~63// WebRTC data channel and WebSocket both have .send()~~1372- **When to use**: Users need immediate confirmation at the same time as a tool call; helps mask latency.

~~64dataChannel.send(JSON.stringify(event));~~1373- **What it does**: Adds a short, consistent preamble before a tool call.

1374

1375#### Example

1376

1377```

1378# Tools

1379- Before any tool call, say one short line like “I’m checking that now.” Then call the tool immediately.

65```1380```

66 1381

1382These are the responses after applying the instruction using `gpt-realtime-1.5`.

1383

1384![tool proactive](https://developers.openai.com/cookbook/assets/images/tool_proactive.png)

1385

1386Using the instruction, the model outputs an audio response "I'm checking that right now" at the same time as the tool call.

1387

1388#### Tool Call Preambles + Sample Phrases

1389

1390If you want to control more closely what type of phrases the model outputs at the same time it calls a tool, you can add sample phrases in the tool spec description.

1391

1392#### Example

1393

67```python1394```python

~~68event = {~~1395tools = [

~~69 "type": "session.update",~~1396 {

~~70 session: {~~1397 "name": "lookup_account",

~~71 type: "realtime",~~1398 "description": "Retrieve a customer account using either an email or phone number to enable verification and account-specific actions.

~~72 model: "gpt-realtime",~~1399

~~73 # Lock the output to audio (add "text" if you also want text)~~1400Preamble sample phrases:

~~74 output_modalities: ["audio"],~~1401- For security, I’ll pull up your account using the email on file.

~~75 audio: {~~1402- Let me look up your account by {email} now.

~~76 input: {~~1403- I’m fetching the account linked to {phone} to verify access.

~~77 format: {~~1404- One moment—I’m opening your account details."

~~78 type: "audio/pcm",~~1405 "parameters": {

~~79 rate: 24000,~~1406 "..."

~~80 },~~

~~81 turn_detection: {~~

~~82 type: "semantic_vad"~~

~~83 }~~

~~84 },~~

~~85 output: {~~

~~86 format: {~~

~~87 type: "audio/pcmu",~~

~~88 },~~

~~89 voice: "marin",~~

90 }1407 }

91 },1408 },

~~92 # Use a server-stored prompt by ID. Optionally pin a version and pass variables.~~1409 {

~~93 prompt: {~~1410 "name": "check_outage",

~~94 id: "pmpt_123", // your stored prompt ID~~1411 "description": "Check for network outages affecting a given service address and return status and ETA if applicable.

~~95 version: "89", // optional: pin a specific version~~1412

~~96 variables: {~~1413Preamble sample phrases:

~~97 city: "Paris" // example variable used by your prompt~~1414- I’ll check for any outages at {service_address} right now.

1415- Let me look up network status for your area.

1416- I’m checking whether there’s an active outage impacting your address.

1417- One sec—verifying service status and any posted ETA.",

1418 "parameters": {

1419 "..."

98 }1420 }

~~99 },~~

100 # You can still set direct session fields; these override prompt fields if they overlap:

101 instructions: "Speak clearly and briefly. Confirm understanding before taking actions."

102 }1421 }

103}1422]

104ws.send(json.dumps(event))1423

105```1424```

106 1425

1426### Tool Calls Without Confirmation

107 1427

108When the session's updated, the server emits a [session.updated](https://developers.openai.com/api/docs/api-reference/realtime-server-events/session/updated) event with the new state of the session. You can update the session any time.1428Sometimes the model might ask for confirmation before a tool call. For some use cases, this can lead to poor experience for the end user since the model is not being proactive.

109 1429

110### Changing prompt mid-call1430- **When to use**: The agent asks for permission before obvious tool calls.

1431- **What it does**: Removes unnecessary confirmation loops.

111 1432

112To update the session mid-call (to swap prompt version or variables, or override instructions), send the update over the same data channel you're using:1433#### Example

113 1434

114```javascript1435```

115// Example: switch to a specific prompt version and change a variable1436# Tools

116dc.send(1437- When calling a tool, do not ask for any user confirmation. Be proactive

117 JSON.stringify({

118 type: "session.update",

119 session: {

120 type: "realtime",

121 prompt: {

122 id: "pmpt_123",

123 version: "89",

124 variables: {

125 city: "Berlin",

126 },

127 },

128 },

129 })

130);

~~131~~

132// Example: override instructions (note: direct session fields take precedence over Prompt fields)

133dc.send(

134 JSON.stringify({

135 type: "session.update",

136 session: {

137 type: "realtime",

138 instructions: "Speak faster and keep answers under two sentences.",

139 },

140 })

141);

142```1438```

143 1439

144## Prompting gpt-realtime1440These are the responses **after** applying the instruction using `gpt-realtime-1.5`.

145 1441

146Here are top tips for prompting the realtime speech-to-speech model. For a more in-depth guide to prompting, see the [realtime prompting cookbook](https://developers.openai.com/cookbook/examples/realtime_prompting_guide).1442![tool no confirm](https://developers.openai.com/cookbook/assets/images/tool_no_confirm.png)

147 1443

148### General usage tips1444In the example, you notice that the realtime model did not produce any response audio; it directly called the respective tool.

149 1445

150- **Iterate relentlessly**. Small wording changes can make or break behavior.1446_Tip: If you notice the model is jumping too quickly to call a tool, try softening the wording. For example, swapping out stronger terms like “proactive” with something gentler can help guide the model to take a calmer, less eager approach._

151 1447

152 Example: Swapping “inaudible” → “unintelligible” improved noisy input handling.1448### Tool Call Performance

153 1449

154- **Use bullets over paragraphs**. Clear, short bullets outperform long paragraphs.1450As use cases grow more complex and the number of available tools increases, it becomes critical to explicitly guide the model on when to use each tool and just as importantly, when not to. Clear usage rules not only improve tool call accuracy but also help the model choose the right tool at the right time.

155- **Guide with examples**. The model strongly follows onto sample phrases.

156- **Be precise**. Ambiguity and conflicting instructions degrade performance, similar to GPT-5.

157- **Control language**. Pin output to a target language if you see drift.

158- **Reduce repetition**. Add a variety rule to reduce robotic phrasing.

159- **Use all caps for emphasis**: Capitalize key rules to makes them stand out to the model.

160- **Convert non-text rules to text**: The model responds better to clearly written text.

161 1451

162 Example: Instead of writing, "IF x > 3 THEN ESCALATE", write, "IF MORE THAN THREE FAILURES THEN ESCALATE."1452- **When to use**: Model is struggling with tool call performance and needs the instructions to be explicit to reduce misuse.

1453- **What it does**: Add instructions on when to “use/avoid” each tool. You can also add instructions on sequences of tool calls (after Tool call A, you can call Tool call B or C)

163 1454

164### Structure your prompt1455#### Example

165 1456

166Organize your prompt to help the model understand context and stay consistent across turns.1457```

1458# Tools

1459- When you call any tools, you must output at the same time a response letting the user know that you are calling the tool.

167 1460

168Use clear, labeled sections in your system prompt so the model can find and follow them. Keep each section focused on one thing.1461## lookup_account(email_or_phone)

1462Use when: verifying identity or viewing plan/outage flags.

1463Do NOT use when: the user is clearly anonymous and only asks general questions.

169 1464

170```markdown

171# Role & Objective — who you are and what “success” means

172 1465

173# Personality & Tone — the voice and style to maintain1466## check_outage(address)

1467Use when: user reports connectivity issues or slow speeds.

1468Do NOT use when: question is billing-only.

174 1469

175# Context — retrieved context, relevant info

176 1470

177# Reference Pronunciations — phonetic guides for tricky words1471## refund_credit(account_id, minutes)

1472Use when: confirmed outage > 240 minutes in the past 7 days.

1473Do NOT use when: outage is unconfirmed; route to Diagnose → check_outage first.

178 1474

179# Tools — names, usage rules, and preambles

180 1475

181# Instructions / Rules — do’s, don’ts, and approach1476## schedule_technician(account_id, window)

1477Use when: repeated failures after reboot and outage status = false.

1478Do NOT use when: outage status = true (send status + ETA instead).

182 1479

183# Conversation Flow — states, goals, and transitions

184 1480

185# Safety & Escalation — fallback and handoff logic1481## escalate_to_human(account_id, reason)

1482Use when: user seems very frustrated, abuse/harassment, repeated failures, billing disputes >$50, or user requests escalation.

186```1483```

187 1484

188This format also makes it easier for you to iterate and modify problematic sections.1485_Tip: If a tool call can fail unpredictably, add clear failure-handling instructions so the model responds gracefully._

189 1486

190To make this system prompt your own, add domain-specific sections (e.g., Compliance, Brand Policy) and remove sections you don’t need. In each section, provide instructions and other information for the model to respond correctly. See specifics below.1487### Tool Level Behavior

191 1488

192## Practical tips for prompting realtime models1489You can fine-tune how the model behaves for specific tools instead of applying one global rule. For example, you may want READ tools to be called proactively, while WRITE tools require explicit confirmation.

193 1490

194Here are 10 tips for creating effective, consistently performing prompts with gpt-realtime. These are just an overview. For more details and full system prompt examples, see the [realtime prompting cookbook](https://developers.openai.com/cookbook/examples/realtime_prompting_guide).1491- **When to use**: Global instructions for proactiveness, confirmation, or preambles don’t suit every tool.

1492- **What it does**: Adds per-tool behavior rules that define whether the model should call the tool immediately, confirm first, or speak a preamble before the call.

195 1493

196#### 1. Be precise. Kill conflicts.1494#### Example

197 1495

198The new realtime model is very good at instruction following. However, that also means small wording changes or unclear instructions can shift behavior in meaningful ways. Inspect and iterate on your system prompt to try different phrasing and fix instruction contradictions.1496```

1497# TOOLS

1498- For the tools marked PROACTIVE: do not ask for confirmation from the user and do not output a preamble.

1499- For the tools marked as CONFIRMATION FIRST: always ask for confirmation to the user.

1500- For the tools marked as PREAMBLES: Before any tool call, say one short line like “I’m checking that now.” Then call the tool immediately.

199 1501

200In one experiment we ran, changing the word "inaudible" to "unintelligble" in instructions for handling noisy inputs significantly improved the model's performance.

201 1502

202After your first attempt at a system prompt, have an LLM review it for ambiguity or conflicts.1503## lookup_account(email_or_phone) — PROACTIVE

1504Use when: verifying identity or accessing billing.

1505Do NOT use when: caller refuses to identify after second request.

203 1506

204#### 2. Bullets > paragraphs.

205 1507

206Realtime models follow short bullet points better than long paragraphs.1508## check_outage(address) — PREAMBLES

1509Use when: caller reports failed connection or speed lower than 10 Mbps.

1510Do NOT use when: purely billing OR when internet speed is above 10 Mbps.

1511If either condition applies, inform the customer you cannot assist and hang up.

207 1512

208Before (harder to follow):

209 1513

210```markdown1514## refund_credit(account_id, minutes) — CONFIRMATION FIRST

211When you can’t clearly hear the user, don’t proceed. If there’s background noise or you only caught part of the sentence, pause and ask them politely to repeat themselves in their preferred language, and make sure you keep the conversation in the same language as the user.1515Use when: confirmed outage > 240 minutes in the past 7 days (credit 60 minutes).

212```1516Do NOT use when: outage unconfirmed.

1517Confirmation phrase: “I can issue a credit for this outage—would you like me to go ahead?”

213 1518

214After (easier to follow):

215 1519

216```markdown1520## schedule_technician(account_id, window) — CONFIRMATION FIRST

217Only respond to clear audio or text.1521Use when: reboot + line checks fail AND outage=false.

1522Windows: “10am–12pm ET” or “2pm–4pm ET”.

1523Confirmation phrase: “I can schedule a technician to visit—should I book that for you?”

218 1524

219If audio is unclear/partial/noisy/silent, ask for clarification in `{preferred_language}`.

220 1525

221Continue in the same language as the user if intelligible.1526## escalate_to_human(account_id, reason) — PREAMBLES

1527Use when: harassment, threats, self-harm, repeated failure, billing disputes > $50, caller is frustrated, or caller requests escalation.

1528Preamble: “Let me connect you to a senior agent who can assist further.”

222```1529```

223 1530

224#### 3. Handle unclear audio.1531### Tool Output Formatting

225 1532

226The realtime model is good at following instructions on how to handle unclear audio. Spell out what to do when audio isn’t usable.1533Some tool outputs, especially long strings that must be repeated verbatim, can be out-of-distribution for the model. During training, tool outputs commonly look like JSON objects with named fields. If your tool returns a raw string and separately asks the model to “repeat exactly,” the model may be more prone to paraphrasing, truncation, or blending in its own preamble.

227 1534

228```markdown1535A practical fix is to make the tool output look like a normal tool result and make the verbatim requirement machine-explicit.

229## Unclear audio

230 1536

231- Always respond in the same language the user is speaking in, if intelligible.1537- **When to use:** A tool returns **long or complex structured content** (multi-sentence instructions, handoff packets, IDs/links, policy summaries, multi-step procedures, etc.) and you observe **truncation, paraphrasing, dropped fields, reordering, or the model blending in its own preamble/commentary**.

232- Default to English if the input language is unclear.1538

233- Only respond to clear audio or text.1539- **What it does:** Wraps the tool output in a **small, explicit JSON envelope** (e.g., `response_text` plus flags like `require_repeat_verbatim`, `format`, or `content_type`) so the response looks more **in-distribution** and the expected realization behavior is **machine-clear**.

234- If the user's audio is not clear (e.g., ambiguous input/background noise/silent/unintelligible) or if you did not fully hear or understand the user, ask for clarification using {preferred_language} phrases.

235 1540

236Sample clarification phrases (parameterize with {preferred_language}):1541- **How to adapt:** Keep the schema **minimal and stable**. Clearly document the expected tool output shape in both your **Tools instructions** and next to the **tool definition** (e.g., “If `require_repeat_verbatim` is true, output exactly `response_text` and nothing else,” or “Render `response_text` as-is; do not add, omit, or reorder fields from the tool output.”).

237 1542

238- “Sorry, I didn’t catch that—could you say it again?”1543#### Examples

239- “There’s some background noise. Please repeat the last part.”1544

240- “I only heard part of that. What did you say after \_\_\_?”1545#### Example: raw string (more error-prone)

1546

1547Tool returns:

1548

1549```python

1550I just sent you an email with the verification link. Please open it and click “Confirm”.

241```1551```

242 1552

243#### 4. Constrain the model to one language.1553Model sometimes says:

244 1554

245If you see the model switching languages in an unhelpful way, add a dedicated "Language" section in your prompt. Make sure it doesn’t conflict with other rules. By default, mirroring the user’s language works well.1555- “I’ve emailed you a verification link…” (paraphrase)

246 1556

247Here's a simple way to mirror the user's language:1557- Drops the last sentence (truncation)

248 1558

249```markdown1559- Adds extra commentary (“Can I help with anything else?”)

250## Language1560

1561#### Example: wrapped JSON (more in-distribution, more reliable)

251 1562

252Language matching: Respond in the same language as the user unless directed otherwise.1563Tool returns:

253For non-English, start with the same standard accent/dialect the user uses.1564

1565```json

1566{

1567 "response_text": "I just sent you an email with the verification link. Please open it and click “Confirm”.",

1568 "require_repeat_verbatim": true

1569}

254```1570```

255 1571

256Here's an example of an English-only constraint:1572Because this looks like a typical tool result (JSON object), the model generally has an easier time:

257 1573

258```markdown1574- recognizing what the “authoritative” content is (response_text)

259## Language

260 1575

261- The conversation will be only in English.1576- understanding the realization constraint (require_repeat_verbatim)

262- Do not respond in any other language, even if the user asks.

263- If the user speaks another language, politely explain that support is limited to English.

264```

265 1577

266In a language teaching application, your language and conversation sections might look like this:1578- reproducing the tool output cleanly, without truncation or extra commentary

267 1579

268```markdown1580### Rephrase Supervisor Tool (Responder-Thinker Architecture)

269## Language

270 1581

271### Explanations1582In many voice setups, the realtime model acts as the responder (speaks to the user) while a stronger text model acts as the thinker (does planning, policy lookups, SOP completion). Text replies are not automatically good for speech, so the responder must rephrase the thinker’s text into an audio-friendly response before generating audio.

272 1583

273Use English when explaining grammar, vocabulary, or cultural context.1584- **When to use**: When the responder’s spoken output sounds robotic, too long, or awkward after receiving a thinker response.

1585- **What it does**: Adds clear instructions that guide the responder to rephrase the thinker’s text into a short, natural, speech-first reply.

1586- **How to adapt**: Tweak phrasing style, openers, and brevity limits to match your use case expectations.

274 1587

275### Conversation1588#### Example

276 1589

277Speak in French when conducting practice, giving examples, or engaging in dialogue.

278```1590```

1591# Tools

1592## Supervisor Tool

1593Name: getNextResponseFromSupervisor(relevantContextFromLastUserMessage: string)

279 1594

280You can also control dialect for a more consistent personality:

281 1595

282```markdown1596When to call:

283## Language1597- Any request outside the allow list.

1598- Any factual, policy, account, or process question.

1599- Any action that might require internal lookups or system changes.

284 1600

285Response only in argentine spanish.

286```

287 1601

288#### 5. Provide sample phrases and flow snippets.1602When not to call:

1603- Simple greetings and basic chitchat.

1604- Requests to repeat or clarify.

1605- Collecting parameters for later Supervisor use:

1606 - phone_number for account help (getUserAccountInfo)

1607 - zip_code for store lookup (findNearestStore)

1608 - topic or keyword for policy lookup (lookupPolicyDocument)

289 1609

290The model learns style from examples. Give short, varied samples for common conversation moments.

291 1610

292For example, you might give this high-level shape of conversation flow to the model:1611Usage rules and preamble:

16121) Say a neutral filler phrase to the user, then immediately call the tool. Approved fillers: “One moment.”, “Let me check.”, “Just a second.”, “Give me a moment.”, “Let me see.”, “Let me look into that.” Fillers must not imply success or failure.

16132) Do not mention the “Supervisor” when responding with filler phrase.

16143) relevantContextFromLastUserMessage is a one-line summary of the latest user message; use an empty string if nothing salient.

16154) After the tool returns, apply Rephrase Supervisor and send your reply.

293 1616

294```markdown1617

295Greeting → Discover → Verify → Diagnose → Resolve → Confirm/Close. Advance only when criteria in each phase are met.1618### Rephrase Supervisor

1619- Start with a brief conversational opener using active language, then flow into the answer (for example: “Thanks for waiting—”, “Just finished checking that.”, “I’ve got that pulled up now.”).

1620- Keep it short: no more than 2 sentences.

1621- Use this template: opener + one-sentence gist + up to 3 key details + a quick confirmation or choice (for example: “Does that match what you expected?”, “Want me to review options?”).

1622- Read numbers for speech: money naturally (“$45.20” → “forty-five dollars and twenty cents”), phone numbers 3-3-4, addresses with individual digits, dates/times plainly (“August twelfth”, “three-thirty p.m.”).

296```1623```

297 1624

298And then provide prompt guidance for each section. For example, here's how you might instruct for the greeting section:1625Here’s an example without the rephrasing instruction:

299 1626

300```markdown1627> Assistant: Your current credit card balance is positive at 32,323,232 AUD.

301## Conversation flow — Greeting

302 1628

303Goal: Set tone and invite the reason for calling.1629Here’s the same example with the rephrasing instruction:

304 1630

305How to respond:1631> Assistant: Just finished checking that—your credit card balance is thirty-two million three hundred twenty-three thousand two hundred thirty-two dollars in your favor. Your last payment was processed on August first. Does that match what you expected?

306 1632

307- Identify as ACME Internet Support.1633### Common Tools

308- Keep it brief; invite the caller’s goal.

309 1634

310Sample phrases (vary, don’t always reuse):1635`gpt-realtime-1.5` has been trained to effectively use the following common tools. If your use case needs similar behavior, keep the names, signatures, and descriptions close to these to maximize reliability and to be more in-distribution.

311 1636

312- “Thanks for calling ACME Internet—how can I help today?”1637Below are some of the important common tools that the model has been trained on:

313- “You’ve reached ACME Support. What’s going on with your service?”1638

314- “Hi there—tell me what you’d like help with.”1639#### Example

315 1640

316Exit when: Caller states an initial goal or symptom.

317```1641```

1642# answer(question: string)

1643Description: Call this when the customer asks a question that you don't have an answer to or asks to perform an action.

318 1644

319#### 6. Avoid robotic repetition.

320 1645

321If responses sound repetitive or robotic, include an explicit variety instruction. This can sometimes happen when using sample phrases.1646# escalate_to_human()

1647Description: Call this when a customer asks for escalation, or to talk to someone else, or expresses dissatisfaction with the call.

322 1648

323```markdown

324## Variety

325 1649

326- Do not repeat the same sentence twice. Vary your responses so it doesn't sound robotic.1650# finish_session()

1651Description: Call this when a customer says they're done with the session or doesn't want to continue. If it's ambiguous, confirm with the customer before calling.

327```1652```

328 1653

329#### 7. Use capitalized text to emphasize instructions.1654## Conversation Flow

330 1655

331Like many LLMs, using capitalization for important rules can help the model to understand and follow those rules. It's also helpful to convert non-text rules (such as numerical conditions) into text before capitalization.1656This section covers how to structure the dialogue into clear, goal-driven phases so the model knows exactly what to do at each step. It defines the purpose of each phase, the instructions for moving through it, and the concrete “exit criteria” for transitioning to the next. This prevents the model from stalling, skipping steps, or jumping ahead, and ensures the conversation stays organized from greeting to resolution.

332 1657

333Instead of:1658As well, by organizing your prompt into various conversation states, it becomes easier to identify error modes and iterate more effectively.

334 1659

335```markdown1660- **When to use**: If conversations feel disorganized, stall before reaching the goal, or the model struggles to effectively complete the objective.

336## Rules1661- **What it does**: Breaks the interaction into phases with clear goals, instructions and exit criteria.

1662- **How to adapt**: Rename phases to match your workflow; modify instructions for each phase to follow your intended behavior; keep “Exit when” concrete and minimal.

1663

1664#### Example

337 1665

338- If [func.return_value] > 0, respond 1 to the user.

339```1666```

1667# Conversation Flow

1668## 1) Greeting

1669Goal: Set tone and invite the reason for calling.

1670How to respond:

1671- Identify as NorthLoop Internet Support.

1672- Keep the opener brief and invite the caller’s goal.

1673- Confirm that customer is a Northloop customer

1674Exit to Discovery: Caller states they are a Northloop customer and mentions an initial goal or symptom.

340 1675

341Use:

342 1676

343```markdown1677## 2) Discover

344## Rules1678Goal: Classify the issue and capture minimal details.

1679How to respond:

1680- Determine billing vs connectivity with one targeted question.

1681- For connectivity: collect the service address.

1682- For billing/account: collect email or phone used on the account.

1683Exit when: Intent and address (for connectivity) or email/phone (for billing) are known.

345 1684

346- IF [func.return_value] IS BIGGER THAN 0, RESPOND 1 TO THE USER.

347```

348 1685

349#### 8. Help the model use tools.1686## 3) Verify

1687Goal: Confirm identity and retrieve the account.

1688How to respond:

1689- Once you have email or phone, call lookup_account(email_or_phone).

1690- If lookup fails, try the alternate identifier once; otherwise proceed with general guidance or offer escalation if account actions are required.

1691Exit when: Account ID is returned.

1692

350 1693

351The model's use of tools can alter the experience—how much they rely on user confirmation vs. taking action, what they say while they make the tool call, which rules they follow for each specific tool, etc.1694## 4) Diagnose

1695Goal: Decide outage vs local issue.

1696How to respond:

1697- For connectivity, call check_outage(address).

1698- If outage=true, skip local steps; move to Resolve with outage context.

1699- If outage=false, guide a short reboot/cabling check; confirm each step’s result before continuing.

1700Exit when: Root cause known.

352 1701

353One way to prompt for tool usage is to use preambles. Good preambles instruct the model to give the user some feedback about what it's doing before it makes the tool call, so the user always knows what's going on.

354 1702

355Here's an example:1703## 5) Resolve

1704Goal: Apply fix, credit, or appointment.

1705How to respond:

1706- If confirmed outage > 240 minutes in the last 7 days, call refund_credit(account_id, 60).

1707- If outage=false and issue persists after basic checks, offer “10am–12pm ET” or “2pm–4pm ET” and call schedule_technician(account_id, chosen window).

1708- If the local fix worked, state the result and next steps briefly.

1709Exit when: A fix/credit/appointment has been applied and acknowledged by the caller.

356 1710

357```markdown

358# Tools

359 1711

360- Before any tool call, say one short line like “I’m checking that now.” Then call the tool immediately.1712## 6) Confirm/Close

1713Goal: Confirm outcome and end cleanly.

1714How to respond:

1715- Restate the result and any next step (e.g., stabilization window or tech ETA).

1716- Invite final questions; close politely if none.

1717Exit when: Caller declines more help.

361```1718```

362 1719

363You can include sample phrases for preambles to add variety and better tailor to your use case.1720### Sample Phrases

364 1721

365There are several other ways to improve the model's behavior when performing tool calls and keeping the conversation going with the user. Ideally, the model is calling the right tools proactively, checking for confirmation for any important write actions, and keeping the user informed along the way. For more specifics, see the [realtime prompting cookbook](https://developers.openai.com/cookbook/examples/realtime_prompting_guide).1722Sample phrases act as “anchor examples” for the model. They show the style, brevity, and tone you want it to follow, without locking it into one rigid response.

366 1723

367#### 9. Use LLMs to improve your prompt.1724- **When to use**: Responses lack your brand style or are not consistent.

1725- **What it does**: Provides sample phrases the model can vary to stay natural and brief.

1726- **How to adapt**: Swap examples for brand-fit; keep the “do not always use” warning.

368 1727

369LLMs are great at finding what's going wrong in your prompt. Use ChatGPT or the API to get a model's review of your current realtime prompt and get help improving it.1728#### Example

370 1729

371Whether your prompt is working well or not, here's a prompt you can run to get a model's review:1730```

1731# Sample Phrases

1732- Below are sample examples that you should use for inspiration. DO NOT ALWAYS USE THESE EXAMPLES, VARY YOUR RESPONSES.

1733

1734Acknowledgements: “On it.” “One moment.” “Good question.”

1735Clarifiers: “Do you want A or B?” “What’s the deadline?”

1736Bridges: “Here’s the quick plan.” “Let’s keep it simple.”

1737Empathy (brief): “That’s frustrating—let’s fix it.”

1738Closers: “Anything else before we wrap?” “Happy to help next time.”

1739```

372 1740

373```markdown1741_Note: If your voice system ends up consistently only repeating the sample phrases, leading to a more robotic voice experience, try adding the Variety constraint. We’ve seen this fix the issue._

374## Role & Objective

375 1742

376You are a **Prompt-Critique Expert**.1743### Conversation flow + Sample Phrases

377Examine a user-supplied LLM prompt and surface any weaknesses following the instructions below.

378 1744

379## Instructions1745It is a useful pattern to add sample phrases in the different conversation flow states to teach the model what a good response looks like:

380 1746

381Review the prompt that is meant for an LLM to follow and identify the following issues:1747#### Example

382 1748

383- Ambiguity: Could any wording be interpreted in more than one way?1749```

384- Lacking Definitions: Are there any class labels, terms, or concepts that are not defined that might be misinterpreted by an LLM?1750# Conversation Flow

385- Conflicting, missing, or vague instructions: Are directions incomplete or contradictory?1751## 1) Greeting

386- Unstated assumptions: Does the prompt assume the model has to be able to do something that is not explicitly stated?1752Goal: Set tone and invite the reason for calling.

1753How to respond:

1754- Identify as NorthLoop Internet Support.

1755- Keep the opener brief and invite the caller’s goal.

1756Sample phrases (do not always repeat the same phrases, vary your responses):

1757- “Thanks for calling NorthLoop Internet—how can I help today?”

1758- “You’ve reached NorthLoop Support. What’s going on with your service?”

1759- “Hi there—tell me what you’d like help with.”

1760Exit when: Caller states an initial goal or symptom.

387 1761

388## Do **NOT** list issues of the following types:

389 1762

390- Invent new instructions, tool calls, or external information. You do not know what tools need to be added that are missing.1763## 2) Discover

391- Issues that you are not sure about.1764Goal: Classify the issue and capture minimal details.

1765How to respond:

1766- Determine billing vs connectivity with one targeted question.

1767- For connectivity: collect the service address.

1768- For billing/account: collect email or phone used on the account.

1769Sample phrases (do not always repeat the same phrases, vary your responses):

1770- “Is this about your bill or your internet speed?”

1771- “What address are you using for the connection?”

1772- “What’s the email or phone number on the account?”

1773Exit when: Intent and address (for connectivity) or email/phone (for billing) are known.

1774

1775

1776## 3) Verify

1777Goal: Confirm identity and retrieve the account.

1778How to respond:

1779- Once you have email or phone, call lookup_account(email_or_phone).

1780- If lookup fails, try the alternate identifier once; otherwise proceed with general guidance or offer escalation if account actions are required.

1781Sample phrases:

1782- “Thanks—looking up your account now.”

1783- “If that doesn’t pull up, what’s the other contact—email or phone?”

1784- “Found your account. I’ll take care of this.”

1785Exit when: Account ID is returned.

392 1786

393## Output Format

394 1787

395# Issues1788## 4) Diagnose

1789Goal: Decide outage vs local issue.

1790How to respond:

1791- For connectivity, call check_outage(address).

1792- If outage=true, skip local steps; move to Resolve with outage context.

1793- If outage=false, guide a short reboot/cabling check; confirm each step’s result before continuing.

1794Sample phrases (do not always repeat the same phrases, vary your responses):

1795- “I’m running a quick outage check for your area.”

1796- “No outage reported—let’s try a fast modem reboot.”

1797- “Please confirm the modem lights: is the internet light solid or blinking?”

1798Exit when: Root cause known.

1799

1800

1801## 5) Resolve

1802Goal: Apply fix, credit, or appointment.

1803How to respond:

1804- If confirmed outage > 240 minutes in the last 7 days, call refund_credit(account_id, 60).

1805- If outage=false and issue persists after basic checks, offer “10am–12pm ET” or “2pm–4pm ET” and call schedule_technician(account_id, chosen window).

1806- If the local fix worked, state the result and next steps briefly.

1807Sample phrases (do not always repeat the same phrases, vary your responses):

1808- “There’s been an extended outage—adding a 60-minute bill credit now.”

1809- “No outage—let’s book a technician. I can do 10am–12pm ET or 2pm–4pm ET.”

1810- “Credit applied—you’ll see it on your next bill.”

1811Exit when: A fix/credit/appointment has been applied and acknowledged by the caller.

1812

1813

1814## 6) Confirm/Close

1815Goal: Confirm outcome and end cleanly.

1816How to respond:

1817- Restate the result and any next step (e.g., stabilization window or tech ETA).

1818- Invite final questions; close politely if none.

1819Sample phrases (do not always repeat the same phrases, vary your responses):

1820- “We’re all set: [credit applied / appointment booked / service restored].”

1821- “You should see stable speeds within a few minutes.”

1822- “Your technician window is 10am–12pm ET.”

1823Exit when: Caller declines more help.

396 1824

397- Numbered list; include brief quote snippets.1825```

398 1826

399# Improvements1827### Advanced Conversation Flow

400 1828

401- Numbered list; provide the revised lines you would change and how you would changed them.1829As use cases grow more complex, you’ll need a structure that scales while keeping the model effective. The key is balancing maintainability with simplicity: too many rigid states can overload the model, hurting performance and making conversations feel robotic.

402 1830

403# Revised Prompt1831A better approach is to design flows that reduce the model’s perceived complexity. By handling state in a structured but flexible way, you make it easier for the model to stay focused and responsive, which improves user experience.

404 1832

405- Revised prompt where you have applied all your improvements surgically with minimal edits to the original prompt1833Two common patterns for managing complex scenarios are:

406```

407 1834

408Use this template as a starting point for troubleshooting a recurring issue:18351. Conversation Flow as State Machine

18362. Dynamic Conversation Flow via session.updates

409 1837

410```markdown1838#### Conversation Flow as State Machine

411Here's my current prompt to an LLM:

412[BEGIN OF CURRENT PROMPT]

413{CURRENT_PROMPT}

414[END OF CURRENT PROMPT]

415 1839

416But I see this issue happening from the LLM:1840Define your conversation as a JSON structure that encodes both states and transitions. This makes it easy to reason about coverage, identify edge cases, and track changes over time. Since it’s stored as code, you can version, diff, and extend it as your flow evolves. A state machine also gives you fine-grained control over exactly how and when the conversation moves from one state to another.

417[BEGIN OF ISSUE]1841

418{ISSUE}1842#### Example

419[END OF ISSUE]1843

420Can you provide some variants of the prompt so that the model can better understand the constraints to alleviate the issue?1844```json

1845# Conversation States

1846[

1847 {

1848 "id": "1_greeting",

1849 "description": "Begin each conversation with a warm, friendly greeting, identifying the service and offering help.",

1850 "instructions": [

1851 "Use the company name 'Snowy Peak Boards' and provide a warm welcome.",

1852 "Let them know upfront that for any account-specific assistance, you’ll need some verification details."

1853 ],

1854 "examples": [

1855 "Hello, this is Snowy Peak Boards. Thanks for reaching out! How can I help you today?"

1856 ],

1857 "transitions": [{

1858 "next_step": "2_get_first_name",

1859 "condition": "Once greeting is complete."

1860 }, {

1861 "next_step": "3_get_and_verify_phone",

1862 "condition": "If the user provides their first name."

1863 }]

1864 },

1865 {

1866 "id": "2_get_first_name",

1867 "description": "Ask for the user’s name (first name only).",

1868 "instructions": [

1869 "Politely ask, 'Who do I have the pleasure of speaking with?'",

1870 "Do NOT verify or spell back the name; just accept it."

1871 ],

1872 "examples": [

1873 "Who do I have the pleasure of speaking with?"

1874 ],

1875 "transitions": [{

1876 "next_step": "3_get_and_verify_phone",

1877 "condition": "Once name is obtained, OR name is already provided."

1878 }]

1879 },

1880 {

1881 "id": "3_get_and_verify_phone",

1882 "description": "Request phone number and verify by repeating it back.",

1883 "instructions": [

1884 "Politely request the user’s phone number.",

1885 "Once provided, confirm it by repeating each digit and ask if it’s correct.",

1886 "If the user corrects you, confirm AGAIN to make sure you understand.",

1887 ],

1888 "examples": [

1889 "I'll need some more information to access your account if that's okay. May I have your phone number, please?",

1890 "You said 0-2-1-5-5-5-1-2-3-4, correct?",

1891 "You said 4-5-6-7-8-9-0-1-2-3, correct?"

1892 ],

1893 "transitions": [{

1894 "next_step": "4_authentication_DOB",

1895 "condition": "Once phone number is confirmed"

1896 }]

1897 },

1898...

421```1899```

422 1900

423#### 10. Help users resolve issues faster.1901#### Dynamic Conversation Flow

424 1902

425Two frustrating user experiences are slow, mechanical voice agents and the inability to escalate. Help users faster by providing instructions in your system prompt for speed and escalation.1903In this pattern, the conversation adapts in real time by updating the system prompt and tool list based on the current state. Instead of exposing the model to all possible rules and tools at once, you only provide what’s relevant to the active phase of the conversation.

426 1904

427In the personality and tone section of your system prompt, add pacing instructions to get the model to quicken its support:1905When the end conditions for a state are met, you use session.update to transition, replacing the prompt and tools with those needed for the next phase.

428 1906

429```markdown1907This approach reduces the model’s cognitive load, making it easier for it to handle complex tasks without being distracted by unnecessary context.

430# Personality & Tone

431 1908

432## Personality1909#### Example

433 1910

434Friendly, calm and approachable expert customer service assistant.1911```python

1912from typing import Dict, List, Literal

435 1913

436## Tone1914State = Literal["verify", "resolve"]

437 1915

438Tone: Warm, concise, confident, never fawning.1916# Allowed transitions

1917TRANSITIONS: Dict[State, List[State]] = {

1918 "verify": ["resolve"],

1919 "resolve": [] # terminal

1920}

439 1921

440## Length1922def build_state_change_tool(current: State) -> dict:

1923 allowed = TRANSITIONS[current]

1924 readable = ", ".join(allowed) if allowed else "no further states (terminal)"

1925 return {

1926 "type": "function",

1927 "name": "set_conversation_state",

1928 "description": (

1929 f"Switch the conversation phase. Current: '{current}'. "

1930 f"You may switch only to: {readable}. "

1931 "Call this AFTER exit criteria are satisfied."

1932 ),

1933 "parameters": {

1934 "type": "object",

1935 "properties": {

1936 "next_state": {"type": "string", "enum": allowed}

1937 },

1938 "required": ["next_state"]

1939 }

1940 }

441 1941

4422–3 sentences per turn.1942# Minimal business tools per state

1943TOOLS_BY_STATE: Dict[State, List[dict]] = {

1944 "verify": [{

1945 "type": "function",

1946 "name": "lookup_account",

1947 "description": "Fetch account by email or phone.",

1948 "parameters": {

1949 "type": "object",

1950 "properties": {"email_or_phone": {"type": "string"}},

1951 "required": ["email_or_phone"]

1952 }

1953 }],

1954 "resolve": [{

1955 "type": "function",

1956 "name": "schedule_technician",

1957 "description": "Book a technician visit.",

1958 "parameters": {

1959 "type": "object",

1960 "properties": {

1961 "account_id": {"type": "string"},

1962 "window": {"type": "string", "enum": ["10-12 ET", "14-16 ET"]}

1963 },

1964 "required": ["account_id", "window"]

1965 }

1966 }]

1967}

443 1968

444## Pacing1969# Short, phase-specific instructions

1970INSTRUCTIONS_BY_STATE: Dict[State, str] = {

1971 "verify": (

1972 "# Role & Objective\n"

1973 "Verify identity to access the account.\n\n"

1974 "# Conversation (Verify)\n"

1975 "- Ask for the email or phone on the account.\n"

1976 "- Read back digits one-by-one (e.g., '4-1-5… Is that correct?').\n"

1977 "Exit when: Account ID is returned.\n"

1978 "When exit is satisfied: call set_conversation_state(next_state=\"resolve\")."

1979 ),

1980 "resolve": (

1981 "# Role & Objective\n"

1982 "Apply a fix by booking a technician.\n\n"

1983 "# Conversation (Resolve)\n"

1984 "- Offer two windows: '10–12 ET' or '2–4 ET'.\n"

1985 "- Book the chosen window.\n"

1986 "Exit when: Appointment is confirmed.\n"

1987 "When exit is satisfied: end the call politely."

1988 )

1989}

445 1990

446Deliver your audio response fast, but do not sound rushed. Do not modify the content of your response, only increase speaking speed for the same response.1991def build_session_update(state: State) -> dict:

1992 """Return the JSON payload for a Realtime `session.update` event."""

1993 return {

1994 "type": "session.update",

1995 "session": {

1996 "instructions": INSTRUCTIONS_BY_STATE[state],

1997 "tools": TOOLS_BY_STATE[state] + [build_state_change_tool(state)]

1998 }

1999 }

447```2000```

448 2001

449Often with realtime voice agents, having a reliable way to escalate to a human is important. In a safety and escalation section, modify the instructions on WHEN to escalate depending on your use case. Here's an example:2002## Safety & Escalation

450 2003

451```markdown2004Often with Realtime voice agents, having a reliable way to escalate to a human is important. In this section, you should modify the instructions on WHEN to escalate depending on your use case.

452# Safety & Escalation

453 2005

454When to escalate (no extra troubleshooting):2006- **When to use**: Model is struggling to determine when to properly escalate to a human or fallback system

2007- **What it does**: Defines fast, reliable escalation and what to say.

2008- **How to adapt**: Insert your own thresholds and what the model has to say.

2009

2010#### Example

455 2011

2012```

2013# Safety & Escalation

2014When to escalate (no extra troubleshooting):

456- Safety risk (self-harm, threats, harassment)2015- Safety risk (self-harm, threats, harassment)

457- User explicitly asks for a human2016- User explicitly asks for a human

458- Severe dissatisfaction (e.g., “extremely frustrated,” repeated complaints, profanity)2017- Severe dissatisfaction (e.g., “extremely frustrated,” repeated complaints, profanity)

459- **2** failed tool attempts on the same task **or** **3** consecutive no-match/no-input events2018- **2** failed tool attempts on the same task **or** **3** consecutive no-match/no-input events

460- Out-of-scope or restricted (e.g., real-time news, financial/legal/medical advice)2019- Out-of-scope or restricted (e.g., real-time news, financial/legal/medical advice)

461 2020

462What to say at the same time of calling the escalate_to_human tool (MANDATORY):2021What to say at the same time as calling the escalate_to_human tool (MANDATORY):

~~463~~ 2022- “Thanks for your patience—I’m connecting you with a specialist now.”

464- “Thanks for your patience—**I’m connecting you with a specialist now**.”

465- Then call the tool: `escalate_to_human`2023- Then call the tool: `escalate_to_human`

466 2024

467Examples that would require escalation:2025Examples that would require escalation:

470- “I am extremely frustrated!”2027- “I am extremely frustrated!”

471```2028```

472 2029

473## Further reading2030The first example shows conversation responses from `gpt-4o-realtime-preview-2025-06-03` using the instruction.

2031

2032![escalate 06](https://developers.openai.com/cookbook/assets/images/escalate_06.png)

2033

2034The second example shows conversation responses from `gpt-realtime-1.5` using the instruction.

2035

2036![escalate 07](https://developers.openai.com/cookbook/assets/images/escalate_07.png)

2037

2038`gpt-realtime-1.5` is able to follow the instruction and escalate to a human more reliably.

2039

2040 </div>

2041

2042

474 2043

475This guide is long but not exhaustive! For more in a specific area, see the following resources:2044## Next steps

476 2045

477- [Realtime prompting cookbook](https://developers.openai.com/cookbook/examples/realtime_prompting_guide): Full prompt examples and a deep dive into when and how to use them2046- Review the earlier [Realtime prompting guide](https://developers.openai.com/cookbook/examples/realtime_prompting_guide) for more `gpt-realtime-1.5` examples.

478- [Inputs and outputs](https://developers.openai.com/api/docs/guides/realtime-inputs-outputs): Text and audio input requirements and output options2047- Review the [Realtime eval guide](https://developers.openai.com/cookbook/examples/realtime_eval_guide) to test representative voice-agent behavior.

479- [Managing conversations](https://developers.openai.com/api/docs/guides/realtime-conversations): Learn to manage a conversation for the duration of a realtime session2048- Learn how to connect with [WebRTC](https://developers.openai.com/api/docs/guides/realtime-webrtc), [WebSocket](https://developers.openai.com/api/docs/guides/realtime-websocket), or [SIP](https://developers.openai.com/api/docs/guides/realtime-sip).

480- [Webhooks and server-side controls](https://developers.openai.com/api/docs/guides/realtime-server-controls): Create a sideband channel to separate sensitive server-side logic from an untrusted client2049- Learn the [Realtime conversation lifecycle](https://developers.openai.com/api/docs/guides/realtime-conversations).

481- [Managing costs](https://developers.openai.com/api/docs/guides/realtime-costs): Understand how costs are calculated and strategies to optimize them2050- Review [Realtime costs](https://developers.openai.com/api/docs/guides/realtime-costs).

482- [Function calling](https://developers.openai.com/api/docs/guides/realtime-function-calling): How to call functions in your realtime app

483- [MCP servers](https://developers.openai.com/api/docs/guides/realtime-mcp): How to use MCP servers to access additional tools in realtime apps

484- [Realtime transcription](https://developers.openai.com/api/docs/guides/realtime-transcription): How to transcribe audio with the Realtime API

485- [Voice agents](https://developers.openai.com/api/docs/guides/voice-agents): A guide for building voice agents with the Agents SDK

guides/realtime-server-controls.md +2 −2

Details

69 69

70In this way, you are able to add tools, monitor sessions, and carry out business logic on the server instead of needing to configure those actions on the client.70In this way, you are able to add tools, monitor sessions, and carry out business logic on the server instead of needing to configure those actions on the client.

71 71

~~72### With SIP~~72## With SIP

73 73

741. A user connects to OpenAI via phone over SIP.741. A user connects to OpenAI via phone over SIP.

~~752. OpenAI sends a webhook to your application’s backend webhook URL, notifying your app of the state of the session. The webhook will look something like:~~752. OpenAI sends a webhook to your application’s server webhook URL, notifying your app of the state of the session. The webhook will look something like:

76 76

77```json77```json

78POST https://my_website.com/webhook_endpoint78POST https://my_website.com/webhook_endpoint

guides/realtime-sip.md +2 −2

Details

217call_accept = {217call_accept = {

218 "type": "realtime",218 "type": "realtime",

219 "instructions": "You are a support agent.",219 "instructions": "You are a support agent.",

220 "model": "gpt-realtime",220 "model": "gpt-realtime-2",

221}221}

222 222

223response_create = {223response_create = {

282 282

283Now that you've connected over SIP, use the left navigation or click into these pages to start building your realtime application.283Now that you've connected over SIP, use the left navigation or click into these pages to start building your realtime application.

284 284

285- [Using realtime models](https://developers.openai.com/api/docs/guides/realtime-models-prompting)285- [Realtime prompting guide](https://developers.openai.com/api/docs/guides/realtime-models-prompting)

286- [Managing conversations](https://developers.openai.com/api/docs/guides/realtime-conversations)286- [Managing conversations](https://developers.openai.com/api/docs/guides/realtime-conversations)

287- [Webhooks and server-side controls](https://developers.openai.com/api/docs/guides/realtime-server-controls)287- [Webhooks and server-side controls](https://developers.openai.com/api/docs/guides/realtime-server-controls)

288- [Managing costs](https://developers.openai.com/api/docs/guides/realtime-costs)288- [Managing costs](https://developers.openai.com/api/docs/guides/realtime-costs)

guides/realtime-transcription.md +190 −100

Details

1# Realtime transcription1# Realtime transcription

2 2

~~3You can use the Realtime API for transcription-only use cases, either with input from a microphone or from a file. For example, you can use it to generate subtitles or transcripts in real-time.~~3import {

~~4With the transcription-only mode, the model will not generate responses.~~4 Bolt,

5 5 Cube,

~~6If you want the model to produce responses, you can use the Realtime API in~~6 Desktop,

~~7 [speech-to-speech conversation mode](https://developers.openai.com/api/docs/guides/realtime-conversations).~~7 Phone,

8 8} from "@components/react/oai/platform/ui/Icon.react";

~~9## Realtime transcription sessions~~9

10 10Use realtime transcription when your application needs live speech-to-text without a spoken assistant response. Realtime transcription sessions stream transcript deltas as audio arrives, so users can see text before the full utterance is complete.

11To use the Realtime API for transcription, you need to create a transcription session, connecting via [WebSockets](https://developers.openai.com/api/docs/guides/realtime?use-case=transcription#connect-with-websockets) or [WebRTC](https://developers.openai.com/api/docs/guides/realtime?use-case=transcription#connect-with-webrtc).11

12 12For the lowest-latency streaming transcription path, use [`gpt-realtime-whisper`](https://developers.openai.com/api/docs/models/gpt-realtime-whisper). For offline files or workflows that don't need streaming deltas, use the standard speech-to-text models in the Audio API.

~~13Unlike the regular Realtime API sessions for conversations, the transcription sessions typically don't contain responses from the model.~~13

14 14## Choose a transcription model

~~15The transcription session object uses the same base session shape, but it always has a `type` of `"transcription"`:~~15

16<table>

17 <thead>

18 <tr>

19 <th>Model</th>

20 <th>Best for</th>

21 <th>Notes</th>

22 </tr>

23 </thead>

24 <tbody>

25 <tr>

26 <td className="whitespace-nowrap">

27 <a href="/api/docs/models/gpt-realtime-whisper">

28 gpt-realtime-whisper

29 </a>

30 </td>

31 <td>Live audio, transcript deltas, tunable latency.</td>

32 <td>Natively streaming and designed for realtime sessions.</td>

33 </tr>

34 <tr>

35 <td className="whitespace-nowrap">

36 <a href="/api/docs/models/gpt-4o-transcribe">gpt-4o-transcribe</a>

37 </td>

38 <td>Higher-accuracy speech-to-text where streaming isn't required.</td>

39 <td>Use for file and request-response transcription workflows.</td>

40 </tr>

41 <tr>

42 <td className="whitespace-nowrap">

43 <a href="/api/docs/models/gpt-4o-mini-transcribe">

44 gpt-4o-mini-transcribe

45 </a>

46 </td>

47 <td>Lower-cost transcription.</td>

48 <td>Use when cost matters more than top accuracy.</td>

49 </tr>

50 <tr>

51 <td className="whitespace-nowrap">

52 <a href="/api/docs/models/whisper-1">whisper-1</a>

53 </td>

54 <td>Existing Whisper integrations.</td>

55 <td>

56 Not natively streaming in the same way as{" "}

57 <code>gpt-realtime-whisper</code>.

58 </td>

59 </tr>

60 </tbody>

61</table>

63`gpt-realtime-whisper` is an alternative for live transcription, not a blanket replacement for every transcription model. Test it against your audio, languages, vocabulary, and latency requirements before switching production traffic.

65## Create a transcription session

67Realtime transcription uses a session with `type: "transcription"`. You can connect with [WebSocket](https://developers.openai.com/api/docs/guides/realtime-websocket) for server-side audio pipelines or [WebRTC](https://developers.openai.com/api/docs/guides/realtime-webrtc) for browser audio.

16 68

17```json69```json

18{70{

~~19 "object": "realtime.session",~~71 "type": "session.update",

72 "session": {

20 "type": "transcription",73 "type": "transcription",

~~21 "id": "session_abc123",~~

22 "audio": {74 "audio": {

23 "input": {75 "input": {

24 "format": {76 "format": {

25 "type": "audio/pcm",77 "type": "audio/pcm",

26 "rate": 2400078 "rate": 24000

27 },79 },

~~28 "noise_reduction": {~~

~~29 "type": "near_field"~~

~~30 },~~

31 "transcription": {80 "transcription": {

~~32 "model": "gpt-4o-transcribe",~~81 "model": "gpt-realtime-whisper",

~~33 "prompt": "",~~

34 "language": "en"82 "language": "en"

35 },83 },

36 "turn_detection": {84 "turn_detection": {

40 "silence_duration_ms": 50088 "silence_duration_ms": 500

41 }89 }

42 }90 }

~~43 },~~91 }

~~44 "include": ["item.input_audio_transcription.logprobs"]~~92 }

45}93}

46```94```

47 95

48### Session fields96### Session fields

49 97

~~50- `type`: Always `transcription` for realtime transcription sessions.~~98- `type`: Set to `transcription` for transcription-only sessions.

~~51- `audio.input.format`: Input encoding for audio that you append to the buffer. Supported types are:~~99- `audio.input.format`: Input encoding for audio appended to the buffer. Use 24 kHz mono PCM when sending `audio/pcm`.

~~52 - `audio/pcm` (24 kHz mono PCM; only a `rate` of `24000` is supported).~~100- `audio.input.transcription.model`: Use `gpt-realtime-whisper` for streaming transcription.

~~53 - `audio/pcmu` (G.711 μ-law).~~101- `audio.input.transcription.language`: Optional language hint such as `en`.

~~54 - `audio/pcma` (G.711 A-law).~~102- `audio.input.turn_detection`: Optional voice activity detection. Set it to `null` if you want to commit audio manually.

~~55- `audio.input.noise_reduction`: Optional noise reduction that runs before VAD and turn detection. Use `{ "type": "near_field" }`, `{ "type": "far_field" }`, or `null` to disable.~~103

~~56- `audio.input.transcription`: Optional asynchronous transcription of input audio. Supply:~~104## Stream audio

~~57 - `model`: One of `whisper-1`, `gpt-4o-transcribe-latest`, `gpt-4o-mini-transcribe`, or `gpt-4o-transcribe`.~~105

~~58 - `language`: ISO-639-1 code such as `en`.~~106Send audio chunks with `input_audio_buffer.append`:

~~59 - `prompt`: Prompt text or keyword list (model-dependent) that guides the transcription output.~~107

60- `audio.input.turn_detection`: Optional automatic voice activity detection (VAD). Set to `null` to manage turn boundaries manually. For `server_vad`, you can tune `threshold`, `prefix_padding_ms`, `silence_duration_ms`, `interrupt_response`, `create_response`, and `idle_timeout_ms`. For `semantic_vad`, configure `eagerness`, `interrupt_response`, and `create_response`.108```javascript

~~61- `include`: Optional list of additional fields to stream back on events (for example `item.input_audio_transcription.logprobs`).~~109ws.send(

110 JSON.stringify({

111 type: "input_audio_buffer.append",

112 audio: base64Pcm16,

113 })

114);

115```

116

117If you disable turn detection, commit the buffer when you want transcription to begin:

118

119```javascript

120ws.send(

121 JSON.stringify({

122 type: "input_audio_buffer.commit",

123 })

124);

125```

126

127With server VAD enabled, the session commits audio automatically when it detects a turn boundary.

62 128

~~63You can find more information about the transcription session object in the [API reference](https://developers.openai.com/api/docs/api-reference/realtime-sessions/transcription_session_object).~~129## Handle transcript events

64 130

~~65## Handling transcriptions~~131Listen for incremental transcript deltas and completion events:

66 132

~~67When using the Realtime API for transcription, you can listen for the `conversation.item.input_audio_transcription.delta` and `conversation.item.input_audio_transcription.completed` events.~~133```javascript

134ws.on("message", (data) => {

135 const event = JSON.parse(data);

68 136

69For `whisper-1` the `delta` event will contain full turn transcript, same as `completed` event. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` the `delta` event will contain incremental transcripts as they are streamed out from the model.137 if (event.type === "conversation.item.input_audio_transcription.delta") {

138 process.stdout.write(event.delta);

139 }

70 140

~~71Here is an example transcription delta event:~~141 if (event.type === "conversation.item.input_audio_transcription.completed") {

142 console.log("\nFinal transcript:", event.transcript);

143 }

144});

145```

146

147A delta event contains newly available transcript text:

72 148

73```json149```json

74{150{

~~75 "event_id": "event_2122",~~

76 "type": "conversation.item.input_audio_transcription.delta",151 "type": "conversation.item.input_audio_transcription.delta",

77 "item_id": "item_003",152 "item_id": "item_003",

78 "content_index": 0,153 "content_index": 0,

80}155}

81```156```

82 157

~~83Here is an example transcription completion event:~~158A completion event contains the final transcript for the committed item:

84 159

85```json160```json

86{161{

~~87 "event_id": "event_2122",~~

88 "type": "conversation.item.input_audio_transcription.completed",162 "type": "conversation.item.input_audio_transcription.completed",

89 "item_id": "item_003",163 "item_id": "item_003",

90 "content_index": 0,164 "content_index": 0,

92}166}

93```167```

94 168

95Note that ordering between completion events from different speech turns is not guaranteed. You should use `item_id` to match these events to the `input_audio_buffer.committed` events and use `input_audio_buffer.committed.previous_item_id` to handle the ordering.169Ordering between completion events from different speech turns isn't guaranteed. Use `item_id` to match transcription events to committed input items.

~~97To send audio data to the transcription session, you can use the `input_audio_buffer.append` event.~~

~~99You have 2 options:~~

~~100~~

101- Use a streaming microphone input

102- Stream data from a wav file

~~103~~

104{/*

~~105~~

106### Using microphone input

~~107~~

~~108~~

~~109~~

110<div data-content-switcher-pane data-value="js">

111 <div class="hidden">ws module (Node.js)</div>

112 </div>

113 <div data-content-switcher-pane data-value="python" hidden>

114 <div class="hidden">websocket-client (Python)</div>

115 </div>

~~116~~

~~117~~

~~118~~

119### Using file input

~~120~~

121 170

171## Tune latency and accuracy

122 172

123<div data-content-switcher-pane data-value="js">173Streaming transcription trades latency for transcript quality. Lower delay settings can produce earlier partial text. Higher delay settings give the model more audio context before emitting text and can improve word error rate.

124 <div class="hidden">ws module (Node.js)</div>

125 </div>

126 <div data-content-switcher-pane data-value="python" hidden>

127 <div class="hidden">websocket-client (Python)</div>

128 </div>

129 174

175Start by testing a few delay targets against your real audio. Useful evaluation points are:

130 176

131*/}177- 0.4 seconds for the most latency-sensitive interactions;

132## Voice activity detection178- 0.8 to 1.2 seconds for balanced live captions;

179- 1.5 to 2.0 seconds when accuracy matters more than immediate display;

180- 3.0 seconds for workflows that can tolerate more delay.

133 181

134The Realtime API supports automatic voice activity detection (VAD). Enabled by default, VAD will control when the input audio buffer is committed, therefore when transcription begins.182Don't choose a setting from synthetic audio alone. Test with representative microphones, telephony audio, accents, background noise, code-switching, domain vocabulary, and long sessions.

135 183

136Read more about configuring VAD in our [Voice Activity Detection](https://developers.openai.com/api/docs/guides/realtime-vad) guide.184## Guide vocabulary and domain terms

137 185

138You can also disable VAD by setting the `audio.input.turn_detection` property to `null`, and control when to commit the input audio on your end.186If your application depends on exact domain vocabulary, include a language hint and test whether your model and endpoint support prompt or keyword steering before relying on it. Where supported, use short keyword lists rather than long instructions.

139 187

140## Additional configurations188Example keyword style:

141 189

142### Noise reduction190```text

~~143~~ 191Keywords: metoprolol, atorvastatin, A1C, systolic, diastolic

144Use the `audio.input.noise_reduction` property to configure how to handle noise reduction in the audio stream.192```

145 193

146- `{ "type": "near_field" }`: Use near-field noise reduction (default).194For production, treat keyword steering as an aid rather than a guarantee. Continue to evaluate names, numbers, dates, medication names, product names, artist names, and other high-value entities manually.

147- `{ "type": "far_field" }`: Use far-field noise reduction.

148- `null`: Disable noise reduction.

149 195

150### Using logprobs196## Handle confidence, timestamps, and diarization

151 197

152You can use the `include` property to include logprobs in the transcription events, using `item.input_audio_transcription.logprobs`.198Only request optional fields that your selected model and endpoint support. If your application needs confidence scoring, timestamps, or diarization, verify support before launch and add fallbacks for fields that aren't available.

153 199

154Those logprobs can be used to calculate the confidence score of the transcription.200When log probabilities are available, request them with `include`:

155 201

156```json202```json

157{203{

158 "type": "session.update",204 "type": "session.update",

159 "session": {205 "session": {

206 "type": "transcription",

160 "audio": {207 "audio": {

161 "input": {208 "input": {

162 "format": {

163 "type": "audio/pcm",

164 "rate": 24000

165 },

166 "transcription": {209 "transcription": {

167 "model": "gpt-4o-transcribe"210 "model": "gpt-realtime-whisper"

168 },

169 "turn_detection": {

170 "type": "server_vad",

171 "threshold": 0.5,

172 "prefix_padding_ms": 300,

173 "silence_duration_ms": 500

174 }211 }

175 }212 }

176 },213 },

178 }215 }

179}216}

180```217```

218

219## Production checklist

220

221- Pick a target latency and accuracy threshold before tuning.

222- Test against real production audio, not only clean samples.

223- Test each target language.

224- Include numbers, dates, currency, email addresses, product names, and domain terms in your eval set.

225- Track empty, truncated, and delayed transcripts apart from word error rate.

226- Decide how your UI should revise partial text when later deltas correct earlier text.

227- Use `item_id` to order and reconcile final transcripts.

228- Keep a fallback path for unsupported timestamps, diarization, or confidence fields.

229

230## Related guides

231

232<a href="/api/docs/guides/realtime">

233

234

235<span slot="icon">

236 </span>

237 Compare voice-agent, translation, and transcription sessions.

238

239

240</a>

241

242<a href="/api/docs/guides/realtime-translation">

243

244

245<span slot="icon">

246 </span>

247 Translate live speech with a dedicated translation session.

248

249

250</a>

251

252<a href="/api/docs/guides/realtime-websocket">

253

254

255<span slot="icon">

256 </span>

257 Stream raw audio through a server-side media pipeline.

258

259

260</a>

261

262<a href="/api/docs/guides/realtime-vad">

263

264

265<span slot="icon">

266 </span>

267 Configure turn detection for live audio streams.

268

269

270</a>

guides/realtime-translation.md +287 −0 created

Details

1# Realtime translation

3import {

4 Bolt,

5 Cube,

6 Desktop,

7 Phone,

8} from "@components/react/oai/platform/ui/Icon.react";

11Realtime translation lets you stream source audio into a dedicated translation session and receive translated audio plus transcript deltas while the speaker is still talking. Use it for live interpretation, multilingual calls, broadcasts, meetings, lessons, and video rooms.

13Use [`gpt-realtime-translate`](https://developers.openai.com/api/docs/models/gpt-realtime-translate) when your application should translate what a human says. If you need an assistant that answers questions, calls tools, and manages a conversation, use [`gpt-realtime-2`](https://developers.openai.com/api/docs/models/gpt-realtime-2) with a standard Realtime session instead.

15## How translation sessions differ

17Realtime translation sessions use a different architecture from voice-agent sessions:

19| Voice-agent session | Translation session |

20| ------------------------------------------- | ------------------------------------------------ |

21| Connects to `/v1/realtime`. | Connects to `/v1/realtime/translations`. |

22| The model acts as an assistant. | The model acts as an interpreter. |

23| Uses a conversation and response lifecycle. | Streams continuously from incoming audio. |

24| May call tools and produce assistant turns. | Produces translated audio and transcript deltas. |

25| You can call `response.create`. | You don't call `response.create`. |

27Translation starts from the audio stream itself. Keep appending audio, including silence between phrases, and handle output events as they arrive.

29## Choose a transport

31Use WebRTC when the browser captures or plays audio. WebRTC sends source audio as a media track and receives translated speech as a remote audio track, so you don't need to manually resample or play PCM chunks.

33Use WebSockets when your server already receives raw audio, such as Twilio Media Streams, SIP media, broadcast ingest, or a media worker. With WebSockets, send base64-encoded 24 kHz PCM16 audio and play returned audio deltas yourself.

35## Create a browser WebRTC session

37For browser apps, create a short-lived client secret on your server. Don't expose your standard API key in the browser.

39In the browser, capture audio, create a peer connection, and post the SDP offer to the translation calls endpoint:

41## Create a WebSocket session

43Connect to the dedicated translation endpoint and select the model in the URL:

45Install the `ws` package for Node.js or the `websocket-client` package for Python before running this example.

47Connect to a translation session

49```javascript

50import WebSocket from "ws";

52const ws = new WebSocket(

53 "wss://api.openai.com/v1/realtime/translations?model=gpt-realtime-translate",

54 {

55 headers: {

56 Authorization: \`Bearer \${process.env.OPENAI_API_KEY}\`,

57 "OpenAI-Safety-Identifier": "hashed-user-id",

58 },

59 }

60);

61```

63```python

64import os

65import websocket

67ws = websocket.WebSocket()

68ws.connect(

69 "wss://api.openai.com/v1/realtime/translations?model=gpt-realtime-translate",

70 header=[

71 f"Authorization: Bearer {os.environ['OPENAI_API_KEY']}",

72 "OpenAI-Safety-Identifier: hashed-user-id",

73 ],

74)

75```

78Configure the target language after the socket opens:

80Configure the target language

82```javascript

83ws.on("open", () => {

84 ws.send(

85 JSON.stringify({

86 type: "session.update",

87 session: {

88 audio: {

89 output: {

90 language: "es",

91 },

92 },

93 },

94 })

95 );

96});

97```

99```python

100import json

101

102ws.send(

103 json.dumps(

104 {

105 "type": "session.update",

106 "session": {

107 "audio": {

108 "output": {

109 "language": "es",

110 },

111 },

112 },

113 }

114 )

115)

116```

117

118

119Then append audio continuously:

120

121Append source audio

122

123```javascript

124ws.send(

125 JSON.stringify({

126 type: "session.input_audio_buffer.append",

127 audio: base64Pcm16,

128 })

129);

130```

131

132```python

133ws.send(

134 json.dumps(

135 {

136 "type": "session.input_audio_buffer.append",

137 "audio": base64_pcm16,

138 }

139 )

140)

141```

142

143

144Listen for translated audio and transcripts:

145

146Listen for translated audio and transcripts

147

148```javascript

149ws.on("message", (data) => {

150 const event = JSON.parse(data);

151

152 if (event.type === "session.output_audio.delta") {

153 playPcm16(event.delta);

154 }

155

156 if (event.type === "session.output_transcript.delta") {

157 process.stdout.write(event.delta);

158 }

159

160 if (event.type === "session.input_transcript.delta") {

161 updateSourceTranscript(event.delta);

162 }

163});

164```

165

166```python

167while True:

168 event = json.loads(ws.recv())

169

170 if event["type"] == "session.output_audio.delta":

171 play_pcm16(event["delta"])

172

173 if event["type"] == "session.output_transcript.delta":

174 print(event["delta"], end="", flush=True)

175

176 if event["type"] == "session.input_transcript.delta":

177 update_source_transcript(event["delta"])

178```

179

180

181## Build listen-along translation

182

183Use listen-along translation when one source speaker or stream needs translated audio for an audience. Examples include livestreams, conference talks, webinars, earnings calls, lectures, and videos.

184

185The typical architecture is:

186

187```text

188source audio -> translation session -> translated audio + subtitles

189```

190

191Create one translation session for each target language. If the same English source needs Spanish and French output, create one English-to-Spanish session and one English-to-French session.

192

193For browser listen-along apps, capture tab audio with `getDisplayMedia()`, send it over WebRTC, and play the remote translated audio track. For production broadcasts, run translation in a server media worker and publish translated audio tracks or captions to listeners.

194

195## Build conversational translation

196

197Use conversational translation when two or more participants speak across languages. Examples include support calls, sales calls, tutoring, and video rooms.

198

199Keep participant audio tracks separate. Mixing speakers into one stream makes speaker identity, speaker captions, and overlapping speech more difficult to handle.

200

201For a two-person call, create one translation session per direction:

202

203```text

204Caller A audio -> translate into Caller B language -> play to Caller B

205Caller B audio -> translate into Caller A language -> play to Caller A

206```

207

208For group rooms, session count depends on active speakers and target languages:

209

210```text

211translation sessions ~= active source speaker tracks x distinct target languages

212```

213

214For small rooms, each listener can create browser-side translation sidecars for the remote speakers they want translated. For larger rooms, use a server-side participant or media worker that subscribes to each source speaker once, creates one translation session per target language, and republishes translated tracks.

215

216## Test quality and latency

217

218Test translation with real audio and bilingual review. Automated metrics can help, but they won't catch every error users notice.

219

220Test:

221

222- language-pair quality;

223- names, numbers, dates, currency, and phone numbers;

224- domain-specific terminology;

225- code-switching and mixed-language conversation;

226- accents, fast speech, and overlapping speech;

227- first translated audio latency;

228- end-of-utterance latency;

229- subtitle timing;

230- voice consistency;

231- reconnect behavior.

232

233If your use case depends on exact names or domain terms, build a golden set before launch and review failures manually.

234

235## Production checklist

236

237- Choose WebRTC for browser media and WebSockets for server media.

238- Use the dedicated `/v1/realtime/translations` endpoint.

239- Stream audio continuously, including silence between phrases.

240- Keep speaker tracks separate for conversational translation.

241- Use one session per output language.

242- Render both source and target transcripts when useful.

243- Expose controls for original audio, translated audio, subtitles, mute, and volume.

244- Surface reconnecting, delayed, and unavailable states.

245- Track latency apart from translation quality.

246

247## Related guides

248

249<a href="/api/docs/guides/realtime">

250

251

252<span slot="icon">

253 </span>

254 Compare voice-agent, translation, and transcription sessions.

255

256

257</a>

258

259<a href="/api/docs/guides/realtime-webrtc">

260

261

262<span slot="icon">

263 </span>

264 Connect browser media to a realtime session.

265

266

267</a>

268

269<a href="/api/docs/guides/realtime-websocket">

270

271

272<span slot="icon">

273 </span>

274 Stream raw audio through a server-side media pipeline.

275

276

277</a>

278

279<a href="/api/docs/guides/realtime-transcription">

280

281

282<span slot="icon">

283 </span>

284 Stream transcript deltas from live audio.

285

286

287</a>

guides/realtime-webrtc.md +13 −0

Details

52 method: "POST",52 method: "POST",

53 headers: {53 headers: {

54 Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,54 Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,

55 "OpenAI-Safety-Identifier": "hashed-user-id",

55 },56 },

56 body: fd,57 body: fd,

57 });58 });

67app.listen(3000);68app.listen(3000);

68```69```

69 70

71If your application assigns a [safety identifier](https://developers.openai.com/api/docs/guides/safety-best-practices#implement-safety-identifiers)

72for each end user, include it as the `OpenAI-Safety-Identifier` header in this

73server-side request. Use a stable, privacy-preserving value, such as a hashed

74internal user ID. The header should be set by your trusted backend, not by the

75browser.

70#### Connecting to the server77#### Connecting to the server

71 78

72In the browser, you can use standard WebRTC APIs to connect to the Realtime API via your application server. The client directly POSTs its SDP data to your server.79In the browser, you can use standard WebRTC APIs to connect to the Realtime API via your application server. The client directly POSTs its SDP data to your server.

152 headers: {159 headers: {

153 Authorization: `Bearer ${apiKey}`,160 Authorization: `Bearer ${apiKey}`,

154 "Content-Type": "application/json",161 "Content-Type": "application/json",

162 "OpenAI-Safety-Identifier": "hashed-user-id",

155 },163 },

156 body: sessionConfig,164 body: sessionConfig,

157 }165 }

170 178

171You can create a server endpoint like this one on any platform that can send and receive HTTP requests. Just ensure that **you only use standard OpenAI API keys on the server, not in the browser.**179You can create a server endpoint like this one on any platform that can send and receive HTTP requests. Just ensure that **you only use standard OpenAI API keys on the server, not in the browser.**

172 180

181When using ephemeral tokens, set `OpenAI-Safety-Identifier` on the server-side

182request that creates the client secret. The Realtime API binds the identifier to

183the resulting ephemeral token, so the browser does not need to send the safety

184identifier when it later connects with that token.

185

173#### Connecting to the server186#### Connecting to the server

174 187

175In the browser, you can use standard WebRTC APIs to connect to the Realtime API with an ephemeral token. The client first fetches a token from your server endpoint, and then POSTs its SDP data (with the ephemeral token) to the Realtime API.188In the browser, you can use standard WebRTC APIs to connect to the Realtime API with an ephemeral token. The client first fetches a token from your server endpoint, and then POSTs its SDP data (with the ephemeral token) to the Realtime API.

guides/realtime-websocket.md +10 −5

Details

8 8

9## Connect via WebSocket9## Connect via WebSocket

10 10

11Below are several examples of connecting via WebSocket to the Realtime API. In addition to using the WebSocket URL below, you will also need to pass an authentication header using your OpenAI API key.11Below are several examples of connecting via WebSocket to the Realtime API. In addition to using the WebSocket URL below, you will also need to pass an authentication header using your OpenAI API key. If your application assigns [safety identifiers](https://developers.openai.com/api/docs/guides/safety-best-practices#implement-safety-identifiers), pass the stable, privacy-preserving identifier for the end user in the `OpenAI-Safety-Identifier` header.

12 12

13It is possible to use WebSocket in browsers with an ephemeral API token as shown in the [WebRTC connection guide](https://developers.openai.com/api/docs/guides/realtime-webrtc), but if you are connecting from a client like a browser or mobile app, WebRTC will be a more robust solution in most cases.13It is possible to use WebSocket in browsers with an ephemeral API token as shown in the [WebRTC connection guide](https://developers.openai.com/api/docs/guides/realtime-webrtc), but if you are connecting from a client like a browser or mobile app, WebRTC will be a more robust solution in most cases.

14 14

21```javascript21```javascript

22import WebSocket from "ws";22import WebSocket from "ws";

23 23

~~24const url = "wss://api.openai.com/v1/realtime?model=gpt-realtime";~~24const url = "wss://api.openai.com/v1/realtime?model=gpt-realtime-2";

25const ws = new WebSocket(url, {25const ws = new WebSocket(url, {

26 headers: {26 headers: {

27 Authorization: "Bearer " + process.env.OPENAI_API_KEY,27 Authorization: "Bearer " + process.env.OPENAI_API_KEY,

28 "OpenAI-Safety-Identifier": "hashed-user-id",

28 },29 },

29});30});

30 31

52 53

53OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")54OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

54 55

~~55url = "wss://api.openai.com/v1/realtime?model=gpt-realtime"~~56url = "wss://api.openai.com/v1/realtime?model=gpt-realtime-2"

~~56headers = ["Authorization: Bearer " + OPENAI_API_KEY]~~57headers = [

58 "Authorization: Bearer " + OPENAI_API_KEY,

59 "OpenAI-Safety-Identifier: hashed-user-id",

60]

57 61

58 62

59def on_open(ws):63def on_open(ws):

89*/93*/

90 94

91const ws = new WebSocket(95const ws = new WebSocket(

~~92 "wss://api.openai.com/v1/realtime?model=gpt-realtime",~~96 "wss://api.openai.com/v1/realtime?model=gpt-realtime-2",

93 [97 [

94 "realtime",98 "realtime",

95 // Auth99 // Auth

126const ws = new WebSocket(url, {130const ws = new WebSocket(url, {

127 headers: {131 headers: {

128 Authorization: "Bearer " + process.env.OPENAI_API_KEY,132 Authorization: "Bearer " + process.env.OPENAI_API_KEY,

133 "OpenAI-Safety-Identifier": "hashed-user-id",

129 },134 },

130});135});

131 136

guides/safety-best-practices.md +15 −1

Details

77 77

78A safety identifier should be a string that uniquely identifies each user. Hash the username or email address in order to avoid sending us any identifying information. If you offer a preview of your product to non-logged in users, you can send a session ID instead.78A safety identifier should be a string that uniquely identifies each user. Hash the username or email address in order to avoid sending us any identifying information. If you offer a preview of your product to non-logged in users, you can send a session ID instead.

79 79

~~80Include safety identifiers in your API requests with the `safety_identifier` parameter:~~80Safety identifiers are recommended for products where individual users interact

81with a model, but they are not required. Include safety identifiers in your API

82requests with the `safety_identifier` parameter:

84For Realtime API requests, provide the same stable, privacy-preserving identifier

85with the `OpenAI-Safety-Identifier` header. When you create an ephemeral Realtime

86client secret, include the header on the server-side request that creates the

87secret so the identifier is bound to that session. For direct WebSocket or WebRTC

88connection requests made from a trusted backend, include the header on the

89connection request.

91Safety identifiers do not carry over between APIs or sessions. If your

92application already sends `safety_identifier` with Responses API requests, pass

93the same stable value separately when you create or connect each Realtime

94session.

guides/safety-checks.md +22 −2

Details

50`.trim(),50`.trim(),

51};51};

52 52

53export const snippetExampleProvidingUserIdentifierRealtime = {

54 curl: `

55curl https://api.openai.com/v1/realtime/client_secrets \\

56-H "Content-Type: application/json" \\

57-H "Authorization: Bearer $OPENAI_API_KEY" \\

58-H "OpenAI-Safety-Identifier: user_123456" \\

59-d '{

60"session": {

61"type": "realtime",

62"model": "gpt-realtime-2"

63}

64}'

65`.trim(),

66};

53We run several types of evaluations on our models and how they're being used. This guide covers how we test for safety and what you can do to avoid violations.68We run several types of evaluations on our models and how they're being used. This guide covers how we test for safety and what you can do to avoid violations.

54 69

55## Safety classifiers for GPT-5 and forward70## Safety classifiers for GPT-5 and forward

66 81

67If your org engages in suspicious activity that violates our safety policies, we may return an error, limit model access, or even block your account. The following safety measures help us identify where high-risk requests are coming from and block individual end users, rather than blocking your entire org.82If your org engages in suspicious activity that violates our safety policies, we may return an error, limit model access, or even block your account. The following safety measures help us identify where high-risk requests are coming from and block individual end users, rather than blocking your entire org.

68 83

~~69- [Implement safety identifiers](https://developers.openai.com/api/docs/guides/safety-best-practices#implement-safety-identifiers) using the `safety_identifier` parameter in your API requests.~~84- [Implement safety identifiers](https://developers.openai.com/api/docs/guides/safety-best-practices#implement-safety-identifiers) for products where individual users interact with a model. Safety identifiers are recommended but not required.

70- If your use case depends on accessing a less restricted version of our services in order to engage in beneficial applications across the life sciences, read about our [special access program](https://help.openai.com/en/articles/11826767-life-science-research-special-access-program) to see if you meet criteria.85- If your use case depends on accessing a less restricted version of our services in order to engage in beneficial applications across the life sciences, read about our [special access program](https://help.openai.com/en/articles/11826767-life-science-research-special-access-program) to see if you meet criteria.

71 86

72You likely don't need to provide a safety identifier if access to your product87You likely don't need to provide a safety identifier if access to your product

75 90

76### Implementing safety identifiers for individual users91### Implementing safety identifiers for individual users

77 92

78The `safety_identifier` parameter is available in both the [Responses API](https://developers.openai.com/api/docs/api-reference/responses/create) and older [Chat Completions API](https://developers.openai.com/api/docs/api-reference/chat/create). To use safety identifiers, provide a stable ID for your end user on each request. Hash user email or internal user IDs to avoid passing any personal information.93The `safety_identifier` parameter is available in both the [Responses API](https://developers.openai.com/api/docs/api-reference/responses/create) and older [Chat Completions API](https://developers.openai.com/api/docs/api-reference/chat/create). The Realtime API supports the same concept through the `OpenAI-Safety-Identifier` header. To use safety identifiers, provide a stable ID for your end user on each request. Hash user email or internal user IDs to avoid passing any personal information.

95Safety identifiers do not carry over between APIs or sessions. If your application already sends `safety_identifier` with Responses API requests, pass the same stable value separately when you create or connect each Realtime session.

79 96

80 97

81 98

85 <div data-content-switcher-pane data-value="chat" hidden>102 <div data-content-switcher-pane data-value="chat" hidden>

86 <div class="hidden">Chat Completions API</div>103 <div class="hidden">Chat Completions API</div>

87 </div>104 </div>

105 <div data-content-switcher-pane data-value="realtime" hidden>

106 <div class="hidden">Realtime API</div>

107 </div>

88 108

89 109

90 110

guides/speech-to-text.md +13 −0

Details

18 18

19File uploads are currently limited to 25 MB, and the following input file types are supported: `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `wav`, and `webm`. Known speaker reference clips for diarization accept the same formats when provided as data URLs.19File uploads are currently limited to 25 MB, and the following input file types are supported: `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `wav`, and `webm`. Known speaker reference clips for diarization accept the same formats when provided as data URLs.

20 20

21Use this guide for file uploads and bounded audio requests. If your

22 application needs live transcript deltas from a microphone, call, or media

23 stream, use [Realtime transcription](https://developers.openai.com/api/docs/guides/realtime-transcription)

24 instead.

21## Quickstart26## Quickstart

22 27

23### Transcriptions28### Transcriptions

58print(transcription.text)63print(transcription.text)

59```64```

60 65

66```cli

67openai audio:transcriptions create \\

68 --model gpt-4o-transcribe \\

69 --file /path/to/file/audio.mp3 \\

70 --raw-output \\

71 --transform text

72```

61```bash74```bash

62curl --request POST \\75curl --request POST \\

63 --url https://api.openai.com/v1/audio/transcriptions \\76 --url https://api.openai.com/v1/audio/transcriptions \\

guides/supervised-fine-tuning.md +8 −0

Details

2 2

3Supervised fine-tuning (SFT) lets you train an OpenAI model with examples for your specific use case. The result is a customized model that more reliably produces your desired style and content.3Supervised fine-tuning (SFT) lets you train an OpenAI model with examples for your specific use case. The result is a customized model that more reliably produces your desired style and content.

4 4

5OpenAI is winding down the fine-tuning platform. The platform is no longer

6 accessible to new users, but existing users of the fine-tuning platform will

7 be able to create training jobs for the coming months.

8 <br />

9 All fine-tuned models will remain available for inference until their base

10 models are [deprecated](https://developers.openai.com/api/docs/deprecations). The full timeline is

11 [here](https://developers.openai.com/api/docs/deprecations).

5<br />13<br />

6 14

7<table>15<table>

guides/text-to-speech.md +33 −1

Details

72 --output speech.mp372 --output speech.mp3

73```73```

74 74

75```cli

76openai audio:speech create \\

77 --model gpt-4o-mini-tts \\

78 --voice coral \\

79 --instructions "Speak in a cheerful and positive tone." \\

80 --input "Today is a wonderful day to build something people love!" \\

81 --output speech.mp3

82```

75 84

76By default, the endpoint outputs an MP3 of the spoken audio, but you can configure it to output any [supported format](#supported-output-formats).85By default, the endpoint outputs an MP3 of the spoken audio, but you can configure it to output any [supported format](#supported-output-formats).

77 86

309const sessionConfig = JSON.stringify({318const sessionConfig = JSON.stringify({

310 session: {319 session: {

311 type: "realtime",320 type: "realtime",

312 model: "gpt-realtime",321 model: "gpt-realtime-2",

313 audio: {322 audio: {

314 output: {323 output: {

315 voice: { id: "voice_123abc" },324 voice: { id: "voice_123abc" },

318 },327 },

319});328});

320```329```

330

331## Related guides

332

333<a href="/api/docs/guides/realtime">

334

335

336<span slot="icon">

337 </span>

338 Choose the right path for voice agents, translation, transcription, and

339 speech generation.

340

341

342</a>

343

344<a href="/api/docs/guides/audio">

345

346

347<span slot="icon">

348 </span>

349 Review audio modalities, speech tasks, streaming, and request-based APIs.

350

351

352</a>

guides/voice-agents.md +59 −58

Details

1# Voice agents1# Voice agents

2 2

3import {

4 Bolt,

5 Cube,

6 Desktop,

7 Phone,

8} from "@components/react/oai/platform/ui/Icon.react";

3Voice agents turn the same agent concepts into spoken, low-latency interactions. The key design choice is deciding whether the model should work directly with live audio or whether your application should explicitly chain speech-to-text, text reasoning, and text-to-speech.11Voice agents turn the same agent concepts into spoken, low-latency interactions. The key design choice is deciding whether the model should work directly with live audio or whether your application should explicitly chain speech-to-text, text reasoning, and text-to-speech.

4 12

5## Choose the right architecture13## Choose the right architecture

13 21

14## Recommended starting points22## Recommended starting points

15 23

~~16The two supported languages expose different strengths today:~~24The examples below are intentionally different architectures, not matching language tabs. The TypeScript and Python libraries expose different voice helpers today:

17 25

18- In TypeScript, the fastest path to a browser-based voice assistant is a `RealtimeAgent` and `RealtimeSession`.26- In TypeScript, the fastest path to a browser-based voice assistant is a `RealtimeAgent` and `RealtimeSession`.

19- In Python, the simplest path to extending an existing text agent into voice is a chained `VoicePipeline`.27- In Python, the simplest path to extending an existing text agent into voice is a chained `VoicePipeline`.

20 28

~~21Two common voice starting points~~

~~23```typescript~~

~~24import { RealtimeAgent, RealtimeSession } from "@openai/agents/realtime";~~

~~26const agent = new RealtimeAgent({~~

~~27 name: "Assistant",~~

~~28 instructions: "You are a helpful voice assistant.",~~

~~29});~~

~~31const session = new RealtimeSession(agent, {~~

~~32 model: "gpt-realtime-1.5",~~

~~33});~~

~~35await session.connect({~~

~~36 apiKey: "ek_...(ephemeral key from your server)",~~

~~37});~~

~~38```~~

~~40```python~~

~~41import asyncio~~

~~42import numpy as np~~

~~44from agents import Agent, function_tool~~

~~45from agents.voice import AudioInput, SingleAgentVoiceWorkflow, VoicePipeline~~

~~48@function_tool~~

~~49def get_weather(city: str) -> str:~~

~~50 """Get the weather for a given city."""~~

~~51 return f"The weather in {city} is sunny."~~

~~54agent = Agent(~~

~~55 name="Assistant",~~

~~56 instructions="You are a helpful voice assistant.",~~

~~57 model="gpt-5.5",~~

~~58 tools=[get_weather],~~

~~59)~~

~~62async def main() -> None:~~

~~63 pipeline = VoicePipeline(workflow=SingleAgentVoiceWorkflow(agent))~~

~~64 audio_input = AudioInput(buffer=np.zeros(24000 * 3, dtype=np.int16))~~

~~65 result = await pipeline.run(audio_input)~~

~~66 async for event in result.stream():~~

~~67 if event.type == "voice_stream_event_audio":~~

~~68 print("Received audio bytes", len(event.data))~~

~~71if __name__ == "__main__":~~

~~72 asyncio.run(main())~~

~~73```~~

76<span id="speech-to-speech-realtime-architecture"></span>29<span id="speech-to-speech-realtime-architecture"></span>

77 30

78## Build a speech-to-speech voice agent31## Build a speech-to-speech voice agent

79 32

~~80Use the live audio API path when the interaction should feel conversational and immediate. The usual browser flow is:~~33Use the live audio API path when the interaction should feel conversational and immediate. This is the best starting point for voice agents that need barge-in, low first-audio latency, natural turn taking, and realtime tool use.

35The usual browser flow is:

81 36

821. Your application server creates an ephemeral client secret for the live audio session.371. Your application server creates an ephemeral client secret for the live audio session.

832. Your frontend creates a `RealtimeSession`.382. Your frontend creates a `RealtimeSession`.

843. The session connects over WebRTC in the browser or WebSocket on the server.393. The session connects over WebRTC in the browser or WebSocket on the server.

854. The agent handles audio turns, tools, interruptions, and handoffs inside that session.404. The agent handles audio turns, tools, interruptions, and handoffs inside that session.

86 41

42From there, attach tools, handoffs, and guardrails to the `RealtimeAgent` the same way you would attach them to a text agent. Keep audio transport concerns in the session layer, and keep business logic in the agent definition.

87Start with the transport docs when you need lower-level control:44Start with the transport docs when you need lower-level control:

88 45

~~89- [Live audio API overview](https://developers.openai.com/api/docs/guides/realtime)~~46- [Realtime and audio overview](https://developers.openai.com/api/docs/guides/realtime)

90- [Live audio API with WebRTC](https://developers.openai.com/api/docs/guides/realtime-webrtc)47- [Live audio API with WebRTC](https://developers.openai.com/api/docs/guides/realtime-webrtc)

91- [Live audio API with WebSocket](https://developers.openai.com/api/docs/guides/realtime-websocket)48- [Live audio API with WebSocket](https://developers.openai.com/api/docs/guides/realtime-websocket)

92 49

100 57

101This is often the better fit for support flows, approval-heavy flows, or cases where you want durable transcripts and deterministic logic between each stage.58This is often the better fit for support flows, approval-heavy flows, or cases where you want durable transcripts and deterministic logic between each stage.

102 59

60Use this path when each stage needs to be visible or replaceable. For example, you might store the transcript, run policy checks before the text agent responds, call internal systems, then generate speech only after the workflow reaches an approved answer.

103## Voice agents still use the same core agent building blocks62## Voice agents still use the same core agent building blocks

104 63

105The voice surface changes the transport and audio loop, but the core workflow decisions are the same:64The voice surface changes the transport and audio loop, but the core workflow decisions are the same:

111- Use [Integrations and observability](https://developers.openai.com/api/docs/guides/agents/integrations-observability) when you need MCP-backed capabilities or want to inspect how the voice workflow behaved.70- Use [Integrations and observability](https://developers.openai.com/api/docs/guides/agents/integrations-observability) when you need MCP-backed capabilities or want to inspect how the voice workflow behaved.

112 71

113The practical rule is: choose the audio architecture first, then design the rest of the agent workflow the same way you would for text.72The practical rule is: choose the audio architecture first, then design the rest of the agent workflow the same way you would for text.

74## Next steps

76<a href="/api/docs/guides/realtime">

79<span slot="icon">

80 </span>

81 Choose the right realtime or audio guide for your use case.

84</a>

86<a href="/api/docs/guides/realtime-conversations">

89<span slot="icon">

90 </span>

91 Work with the Realtime session lifecycle and event model.

94</a>

96<a href="/api/docs/guides/realtime-webrtc">

99<span slot="icon">

100 </span>

101 Connect browser and mobile audio directly to a Realtime session.

102

103

104</a>

105

106<a href="/api/docs/guides/realtime-models-prompting">

107

108

109<span slot="icon">

110 </span>

111 Tune reasoning, preambles, tools, entity capture, and voice behavior.

112

113

114</a>

guides/your-data.md +3 −1

Details

191| /v1/images/edits | gpt-image-1<br />gpt-image-1.5<br />gpt-image-1-mini | All |191| /v1/images/edits | gpt-image-1<br />gpt-image-1.5<br />gpt-image-1-mini | All |

192| /v1/images/generations | dall-e-3<br />gpt-image-1<br />gpt-image-1.5<br />gpt-image-1-mini | All |192| /v1/images/generations | dall-e-3<br />gpt-image-1<br />gpt-image-1.5<br />gpt-image-1-mini | All |

193| /v1/moderations | text-moderation-latest\*<br />omni-moderation-latest | All |193| /v1/moderations | text-moderation-latest\*<br />omni-moderation-latest | All |

195| /v1/realtime/transcription_sessions | gpt-realtime-whisper | US and EU |

196| /v1/realtime/translations | gpt-realtime-translate | US and EU |

196| /v1/responses | gpt-5.5-pro-2026-04-23<br />gpt-5.4-pro-2026-03-05<br />gpt-5.2-pro-2025-12-11<br />gpt-5-pro-2025-10-06<br />gpt-5.5-2026-04-23<br />gpt-5.4-2026-03-05<br />gpt-5-2025-08-07<br />gpt-5.4-mini-2026-03-17<br />gpt-5.4-nano-2026-03-17<br />gpt-5.2-2025-12-11<br />gpt-5.1-2025-11-13<br />gpt-5-mini-2025-08-07<br />gpt-5-nano-2025-08-07<br />gpt-5-chat-latest-2025-08-07<br />gpt-4.1-2025-04-14<br />gpt-4.1-mini-2025-04-14<br />gpt-4.1-nano-2025-04-14<br />o3-2025-04-16<br />o4-mini-2025-04-16<br />o1-pro<br />o1-pro-2025-03-19<br />computer-use-preview\*<br />o3-mini-2025-01-31<br />o1-2024-12-17<br />o1-mini-2024-09-12<br />o1-preview<br />gpt-4o-2024-11-20<br />gpt-4o-2024-08-06<br />gpt-4o-mini-2024-07-18<br />gpt-4-turbo-2024-04-09<br />gpt-4-0613<br />gpt-3.5-turbo-0125 | All |198| /v1/responses | gpt-5.5-pro-2026-04-23<br />gpt-5.4-pro-2026-03-05<br />gpt-5.2-pro-2025-12-11<br />gpt-5-pro-2025-10-06<br />gpt-5.5-2026-04-23<br />gpt-5.4-2026-03-05<br />gpt-5-2025-08-07<br />gpt-5.4-mini-2026-03-17<br />gpt-5.4-nano-2026-03-17<br />gpt-5.2-2025-12-11<br />gpt-5.1-2025-11-13<br />gpt-5-mini-2025-08-07<br />gpt-5-nano-2025-08-07<br />gpt-5-chat-latest-2025-08-07<br />gpt-4.1-2025-04-14<br />gpt-4.1-mini-2025-04-14<br />gpt-4.1-nano-2025-04-14<br />o3-2025-04-16<br />o4-mini-2025-04-16<br />o1-pro<br />o1-pro-2025-03-19<br />computer-use-preview\*<br />o3-mini-2025-01-31<br />o1-2024-12-17<br />o1-mini-2024-09-12<br />o1-preview<br />gpt-4o-2024-11-20<br />gpt-4o-2024-08-06<br />gpt-4o-mini-2024-07-18<br />gpt-4-turbo-2024-04-09<br />gpt-4-0613<br />gpt-3.5-turbo-0125 | All |

197| /v1/responses File Search | | All |199| /v1/responses File Search | | All |

libraries.md +22 −60

Details

~~1# Libraries~~1# SDKs and CLI

2 2

3This page covers setting up your local development environment to use the [OpenAI API](https://developers.openai.com/api/docs/api-reference). You can use one of our officially supported SDKs, a community library, or your own preferred HTTP client.3This page covers the main ways to build with the [OpenAI API](https://developers.openai.com/api/docs/api-reference): official SDKs for application code, the OpenAI CLI for shell-native workflows, the Agents SDK for orchestration, or your own preferred HTTP client.

4 4

5## Create and export an API key5## Create and export an API key

6 6

50 <div data-content-switcher-pane data-value="golang" hidden>50 <div data-content-switcher-pane data-value="golang" hidden>

51 <div class="hidden">Go</div>51 <div class="hidden">Go</div>

52 </div>52 </div>

53 <div data-content-switcher-pane data-value="ruby" hidden>

54 <div class="hidden">Ruby</div>

55 </div>

56 <div data-content-switcher-pane data-value="cli" hidden>

57 <div class="hidden">CLI</div>

58 </div>

62## Use the Agents SDK

64Use the official OpenAI SDKs above for direct API requests. Use the Agents SDK

65when your application needs code-first orchestration for agents, tools,

66handoffs, guardrails, tracing, or sandbox execution.

68<a href="/api/docs/guides/agents/quickstart">

53 69

54 70

71<span slot="icon">

72 </span>

73 Build your first agent with the Agents SDK.

55 74

~~56## Install the Agents SDK~~

57 75

~~58Use the official OpenAI libraries above for direct API requests. Use the OpenAI~~76</a>

~~59Agents SDK when your application needs code-first orchestration for agents,~~

~~60tools, handoffs, guardrails, tracing, or sandbox execution.~~

61 77

~~62- [Agents SDK quickstart](https://developers.openai.com/api/docs/guides/agents/quickstart)~~

63- [OpenAI Agents SDK for TypeScript](https://github.com/openai/openai-agents-js)78- [OpenAI Agents SDK for TypeScript](https://github.com/openai/openai-agents-js)

64- [OpenAI Agents SDK for Python](https://github.com/openai/openai-agents-python)79- [OpenAI Agents SDK for Python](https://github.com/openai/openai-agents-python)

65 80

80 95

81Please note that OpenAI does not verify the correctness or security of these projects. **Use them at your own risk!**96Please note that OpenAI does not verify the correctness or security of these projects. **Use them at your own risk!**

82 97

~~83### C# / .NET~~

~~85- [Betalgo.OpenAI](https://github.com/betalgo/openai) by [Betalgo](https://github.com/betalgo)~~

~~86- [OpenAI-API-dotnet](https://github.com/OkGoDoIt/OpenAI-API-dotnet) by [OkGoDoIt](https://github.com/OkGoDoIt)~~

~~87- [OpenAI-DotNet](https://github.com/RageAgainstThePixel/OpenAI-DotNet) by [RageAgainstThePixel](https://github.com/RageAgainstThePixel)~~

~~89### C++~~

~~91- [liboai](https://github.com/D7EAD/liboai) by [D7EAD](https://github.com/D7EAD)~~

93### Clojure98### Clojure

94 99

95- [openai-clojure](https://github.com/wkok/openai-clojure) by [wkok](https://github.com/wkok)100- [openai-clojure](https://github.com/wkok/openai-clojure) by [wkok](https://github.com/wkok)

96 101

~~97### Crystal~~

~~99- [openai-crystal](https://github.com/sferik/openai-crystal) by [sferik](https://github.com/sferik)~~

~~100~~

101### Dart/Flutter102### Dart/Flutter

102 103

103- [openai](https://github.com/anasfik/openai) by [anasfik](https://github.com/anasfik)104- [openai](https://github.com/anasfik/openai) by [anasfik](https://github.com/anasfik)

110 111

111- [openai.ex](https://github.com/mgallo/openai.ex) by [mgallo](https://github.com/mgallo)112- [openai.ex](https://github.com/mgallo/openai.ex) by [mgallo](https://github.com/mgallo)

112 113

113### Go

~~114~~

115- [go-gpt3](https://github.com/sashabaranov/go-gpt3) by [sashabaranov](https://github.com/sashabaranov)

~~116~~

117### Java

~~118~~

119- [simple-openai](https://github.com/sashirestela/simple-openai) by [Sashir Estela](https://github.com/sashirestela)

120- [Spring AI](https://spring.io/projects/spring-ai)

~~121~~

122### Julia

~~123~~

124- [OpenAI.jl](https://github.com/rory-linehan/OpenAI.jl) by [rory-linehan](https://github.com/rory-linehan)

~~125~~

126### Kotlin114### Kotlin

127 115

128- [openai-kotlin](https://github.com/Aallam/openai-kotlin) by [Mouaad Aallam](https://github.com/Aallam)116- [openai-kotlin](https://github.com/Aallam/openai-kotlin) by [Mouaad Aallam](https://github.com/Aallam)

129 117

130### Node.js

~~131~~

132- [openai-api](https://www.npmjs.com/package/openai-api) by [Njerschow](https://github.com/Njerschow)

133- [openai-api-node](https://www.npmjs.com/package/openai-api-node) by [erlapso](https://github.com/erlapso)

134- [gpt-x](https://www.npmjs.com/package/gpt-x) by [ceifa](https://github.com/ceifa)

135- [gpt3](https://www.npmjs.com/package/gpt3) by [poteat](https://github.com/poteat)

136- [gpts](https://www.npmjs.com/package/gpts) by [thencc](https://github.com/thencc)

137- [@dalenguyen/openai](https://www.npmjs.com/package/@dalenguyen/openai) by [dalenguyen](https://github.com/dalenguyen)

138- [tectalic/openai](https://github.com/tectalichq/public-openai-client-js) by [tectalic](https://tectalic.com/)

~~139~~

140### PHP118### PHP

141 119

142- [orhanerday/open-ai](https://packagist.org/packages/orhanerday/open-ai) by [orhanerday](https://github.com/orhanerday)120- [orhanerday/open-ai](https://packagist.org/packages/orhanerday/open-ai) by [orhanerday](https://github.com/orhanerday)

143- [tectalic/openai](https://github.com/tectalichq/public-openai-client-php) by [tectalic](https://tectalic.com/)

144- [openai-php client](https://github.com/openai-php/client) by [openai-php](https://github.com/openai-php)121- [openai-php client](https://github.com/openai-php/client) by [openai-php](https://github.com/openai-php)

145 122

146### Python

~~147~~

148- [chronology](https://github.com/OthersideAI/chronology) by [OthersideAI](https://www.othersideai.com/)

~~149~~

150### R

~~151~~

152- [rgpt3](https://github.com/ben-aaron188/rgpt3) by [ben-aaron188](https://github.com/ben-aaron188)

~~153~~

154### Ruby

~~155~~

156- [openai](https://github.com/nileshtrivedi/openai/) by [nileshtrivedi](https://github.com/nileshtrivedi)

157- [ruby-openai](https://github.com/alexrudall/ruby-openai) by [alexrudall](https://github.com/alexrudall)

~~158~~

159### Rust123### Rust

160 124

161- [async-openai](https://github.com/64bit/async-openai) by [64bit](https://github.com/64bit)125- [async-openai](https://github.com/64bit/async-openai) by [64bit](https://github.com/64bit)

162- [fieri](https://github.com/lbkolev/fieri) by [lbkolev](https://github.com/lbkolev)

163 126

164### Scala127### Scala

165 128

173 136

174### Unity137### Unity

175 138

176- [OpenAi-Api-Unity](https://github.com/hexthedev/OpenAi-Api-Unity) by [hexthedev](https://github.com/hexthedev)

177- [com.openai.unity](https://github.com/RageAgainstThePixel/com.openai.unity) by [RageAgainstThePixel](https://github.com/RageAgainstThePixel)139- [com.openai.unity](https://github.com/RageAgainstThePixel/com.openai.unity) by [RageAgainstThePixel](https://github.com/RageAgainstThePixel)

178 140

179### Unreal Engine141### Unreal Engine

libraries/openai-cli.md +577 −0 created

Details

1# OpenAI CLI

3Interact with the OpenAI API directly from your terminal with the `openai` command-line tool.

5## Installation

7Install the CLI with Homebrew:

9```bash

10brew install openai/tools/openai

11```

13Or install it with Go 1.25 or later:

15```bash

16go install 'github.com/openai/openai-cli/cmd/openai@latest'

17```

19Older versions of the Python SDK also installed a legacy `openai` command. If you already had that package installed and the command you see does not match this guide, your shell may still be resolving the older binary. Fresh CLI installs are not affected.

21## Authentication

23The CLI reads your API key from `OPENAI_API_KEY`:

25Command:

27```bash

28export OPENAI_API_KEY="sk-..."

29```

31If you don't have an API key yet, [create one in the dashboard](https://platform.openai.com/api-keys).

33For Admin API endpoints, set `OPENAI_ADMIN_KEY` instead. The SDK layer selects the admin key or default API key based on the endpoint being called.

35To point at a different API host, set `OPENAI_BASE_URL`.

37## Use cases

39Use the CLI when the work belongs naturally in the terminal:

41- Generate local artifacts such as images or speech.

42- Extract structured data into JSONL for later shell steps.

43- Use Responses with files, computer use, and current web context in the cloud.

44- Create projects and API keys with Admin APIs.

46Use it directly for one-off terminal requests, or from scripts when agents need repeatable batch work over files and generated artifacts.

48## CLI vs subagents for Codex

50Use the CLI for repeatable API work you want to inspect and rerun, such as batch extraction, file transforms, artifact generation, or deliberate model selection. Use subagents when the work still needs judgment, such as exploring code, comparing hypotheses, debugging, or reviewing changes.

52## Global flags

54These options work across commands:

56| Flag | Use |

57| ------------- | ------------------------------------------------------------------------------------------------------------ |

58| `--format` | Print responses as `auto`, `json`, `jsonl`, `pretty`, `raw`, `yaml`, or `explore`. |

59| `--transform` | Extract or reshape response data with a GJSON path before printing. |

60| `--debug` | Print request and response details to stderr. Authorization is redacted; review headers before sharing logs. |

62This guide focuses on CLI patterns. For the latest arguments and response shapes for any API family, use the live [API reference](https://developers.openai.com/api/reference).

64You can also change the base URL when you need to point the CLI at another compatible endpoint, such as a deployment that supports a different model set or only a subset of the API surface.

66## Responses

68Use Responses for text generation, structured extraction, web search, file understanding, and repeatable Codex-authored batch scripts.

70### Send your first request

72Command:

74```bash

75openai responses create \

76 --model gpt-5.5 \

77 --input "Say hello in one sentence."

78```

80Output:

82```json

83{

84 "id": "resp_...",

85 "object": "response",

86 "status": "completed",

87 "model": "gpt-5.5-...",

88 "output": [

89 {

90 "type": "message",

91 "role": "assistant",

92 "content": [

93 {

94 "type": "output_text",

95 "text": "Hello!"

96 }

97 ]

98 }

99 ],

100 "usage": {

101 "input_tokens": 12,

102 "output_tokens": 6,

103 "total_tokens": 18

104 },

105 "...": "additional response fields omitted"

106}

107```

108

109The CLI prints the full API response object by default. Examples on this page keep representative fields such as `id`, `status`, `model`, `output`, and `usage`, and omit the rest.

110

111Responses output can include non-message items, such as reasoning items, before the assistant message. When you need assistant text, select the message item by type instead of assuming it is always `output[0]`:

112

113```bash

114--transform 'output.#(type=="message").content.0.text'

115```

116

117### Add a local file to the prompt

118

119For a simple local file, build the prompt inline with command substitution:

120

121```bash

122openai responses create \

123 --model gpt-5.5 \

124 --input "Summarize this note in one sentence.

125

126<note>

127$(cat ./note.md)

128</note>" \

129 --format yaml \

130 --transform 'output.#(type=="message").content.0.text'

131```

132

133Output:

134

135```text

136The note says the launch checklist is ready except for final support ownership.

137```

138

139### Passing request bodies

140

141Use flags for short scalar inputs. Use a YAML heredoc for multiline prompts, tools, files, or nested request bodies. The heredoc can contain the same request fields you would otherwise pass as flags.

142

143Be careful with string values that look like YAML, especially prompts that contain `:` or `{}`. On flags, the generated parser may interpret those values as structured YAML instead of plain text. If a prompt starts looking like configuration, put it under `input: |` in a YAML body instead:

144

145Command:

146

147```bash

148openai responses create \

149 --format yaml \

150 --transform 'output.#(type=="message").content.0.text' <<'YAML'

151model: gpt-5.5

152instructions: Return exactly one sentence.

153max_output_tokens: 120

154input: |

155 Summarize this release note in one sentence.

156

157 <release_note>

158 Fixed the image generation example and added CLI installation guidance.

159 </release_note>

160YAML

161```

162

163Output:

164

165```text

166The release note updates the CLI docs with corrected image generation and installation guidance.

167```

168

169When the prompt itself needs shell assembly, build a YAML body and pipe it into the command:

170

171```bash

172{

173 printf 'input: |\n'

174 printf ' Summarize this note in one sentence.\n\n'

175 printf ' <note>\n'

176 sed 's/^/ /' ./note.md

177 printf ' </note>\n'

178} | openai responses create \

179 --model gpt-5.5 \

180 --format yaml \

181 --transform 'output.#(type=="message").content.0.text'

182```

183

184### Write structured data to JSON

185

186Use structured outputs when downstream scripts need stable JSON. Save reusable schemas to disk:

187

188Save as `schema.json`:

189

190```json

191{

192 "type": "json_schema",

193 "name": "fact",

194 "strict": true,

195 "schema": {

196 "type": "object",

197 "additionalProperties": false,

198 "properties": {

199 "person": { "type": "string" },

200 "topic": { "type": "string" }

201 },

202 "required": ["person", "topic"]

203 }

204}

205```

206

207Command:

208

209```bash

210openai responses create \

211 --model gpt-5.5 \

212 --instructions "Extract the person and topic from the input." \

213 --input "Ada Lovelace wrote notes about the Analytical Engine." \

214 --text.format "$(cat ./schema.json)" \

215 --format yaml \

216 --transform 'output.#(type=="message").content.0.text'

217```

218

219Output:

220

221```json

222{ "person": "Ada Lovelace", "topic": "notes about the Analytical Engine" }

223```

224

225### Write structured records to JSONL

226

227When one input may produce many records, ask the model for an array and flatten it into JSONL so later shell steps can process one record per line:

228

229Save as `records-schema.json`:

230

231```json

232{

233 "type": "json_schema",

234 "name": "items",

235 "strict": true,

236 "schema": {

237 "type": "object",

238 "additionalProperties": false,

239 "properties": {

240 "items": {

241 "type": "array",

242 "items": {

243 "type": "object",

244 "additionalProperties": false,

245 "properties": {

246 "title": { "type": "string" },

247 "summary": { "type": "string" },

248 "evidence": { "type": "string" }

249 },

250 "required": ["title", "summary", "evidence"]

251 }

252 }

253 },

254 "required": ["items"]

255 }

256}

257```

258

259Command:

260

261```bash

262: > records.jsonl

263

264for file in notes/*.md; do

265 extracted="$(

266 openai responses create \

267 --model gpt-5.5 \

268 --text.format "$(cat ./records-schema.json)" \

269 --raw-output \

270 --transform 'output.#(type=="message").content.0.text' <<YAML

271input: |

272 <note path="$file">

273$(sed 's/^/ /' "$file")

274 </note>

275YAML

276 )"

277

278 jq -r --arg source "$file" \

279 '.items[]? + {source: $source} | @json' \

280 <<<"$extracted" >> records.jsonl

281done

282```

283

284This keeps the model response structured while producing one JSON object per line for later shell steps.

285

286### Web search

287

288Responses can call hosted tools from the same YAML request body:

289

290Command:

291

292```bash

293openai responses create \

294 --model gpt-5.5 \

295 --format yaml \

296 --transform 'output.#(type=="message").content.0.text' <<'YAML'

297tools:

298 - type: web_search

299input: |

300 Research the latest material news for AAPL.

301 Return three concise bullets and cite sources in the text.

302YAML

303```

304

305Output:

306

307```text

308- Apple announced ...

309- Analysts highlighted ...

310- The company said ...

311```

312

313### File inputs

314

315For uploaded files such as PDFs, create the file first, capture its ID, and pass it as `input_file.file_id`:

316

317Command:

318

319```bash

320FILE_ID=$(

321 openai files create \

322 --file ./brief.pdf \

323 --purpose user_data \

324 --format yaml \

325 --transform id

326)

327

328openai responses create \

329 --model gpt-5.5 \

330 --format yaml \

331 --transform 'output.#(type=="message").content.0.text' <<YAML

332input:

333 - role: user

334 content:

335 - type: input_text

336 text: Summarize this brief and list three risks.

337 - type: input_file

338 file_id: ${FILE_ID}

339YAML

340```

341

342Output:

343

344```text

345- The brief proposes ...

346- Risks: migration timing, unclear rollback criteria, and unresolved support ownership.

347```

348

349Recent generated builds send local file flags as multipart file parts with filename and content type metadata. If a local upload command fails with an `UploadFile` type error, update the CLI and retry.

350

351## Images

352

353### Generate an image

354

355Generate an image, extract the base64 payload, and decode it into a normal asset file:

356

357Command:

358

359```bash

360openai images generate \

361 --model gpt-image-2 \

362 --prompt "A simple product-style render of a translucent green cube on a neutral background." \

363 --format yaml \

364 --transform 'data.0.b64_json' | base64 --decode > hero.png

365printf 'wrote hero.png\n'

366```

367

368Output:

369

370```text

371wrote hero.png

372```

373

374Current limitation: image commands do not yet have native `--output` support, so image generation still requires extracting `b64_json` and decoding it yourself.

375

376For `gpt-image-2`, omit `--input-fidelity`; image inputs are always processed at high fidelity. Do not use `--background transparent` with `gpt-image-2`. The model also supports broader `--size` values than earlier GPT Image models, as long as the requested resolution satisfies the Image API size constraints.

377

378### Edit an image

379

380Image editing uses the same base64 extraction pattern after the edit request succeeds:

381

382Command:

383

384```bash

385openai images edit \

386 --model gpt-image-2 \

387 --image ./hero.png \

388 --prompt "Turn the cube bright green." \

389 --format yaml \

390 --transform 'data.0.b64_json' | base64 --decode > hero-edited.png

391printf 'wrote hero-edited.png\n'

392```

393

394Output:

395

396```text

397wrote hero-edited.png

398```

399

400If a local image edit upload fails with an `UploadFile` type error, update the CLI and retry.

401

402## Speech

403

404Create an MP3 locally with the speech API:

405

406Command:

407

408```bash

409openai audio:speech create \

410 --model gpt-4o-mini-tts \

411 --voice marin \

412 --input "The OpenAI CLI can call the API from ordinary shell scripts." \

413 --output speech.mp3

414```

415

416Output:

417

418```text

419Wrote output to: speech.mp3

420```

421

422Play it with whatever local audio tool is available on your machine. On macOS:

423

424```bash

425afplay speech.mp3

426```

427

428Use `--instructions` to shape delivery and `--input` for the words that should be spoken. Instructions work well for cues such as pace, energy, warmth, formality, emphasis, or audience:

429

430```bash

431openai audio:speech create \

432 --model gpt-4o-mini-tts \

433 --voice marin \

434 --instructions "Whisper very quickly, like a hurried stage cue, while staying clear and intelligible." \

435 --input "The launch checklist is ready. Please send final feedback by Friday at noon." \

436 --output reminder.mp3

437```

438

439## Transcription

440

441Print plain transcript text for shell pipelines:

442

443Command:

444

445```bash

446openai audio:transcriptions create \

447 --model gpt-4o-transcribe \

448 --file ./speech.mp3 \

449 --transform text \

450 --raw-output

451```

452

453Output:

454

455```text

456The OpenAI CLI can call the API from ordinary shell scripts.

457```

458

459Use the response format that matches the artifact you need:

460

461| Need | Command shape |

462| --------------------------- | -------------------------------------------------------------------- |

463| Plain transcript text | `--model gpt-4o-transcribe --transform text --raw-output` |

464| Subtitle files | `--model whisper-1 --response-format srt` or `--response-format vtt` |

465| Segment or word timestamps | `--model whisper-1 --response-format verbose_json` |

466| Speaker-labeled diarization | `--model gpt-4o-transcribe-diarize --response-format diarized_json` |

467

468For word-level timing, request the verbose transcription shape:

469

470Command:

471

472```bash

473openai audio:transcriptions create \

474 --model whisper-1 \

475 --file ./speech.mp3 \

476 --response-format verbose_json \

477 --timestamp-granularity word \

478 --format json

479```

480

481Output:

482

483```json

484{

485 "task": "transcribe",

486 "language": "english",

487 "duration": 6,

488 "text": "The OpenAI CLI can call the API from ordinary shell scripts.",

489 "words": [

490 { "word": "The", "start": 0, "end": 0.42 },

491 { "word": "OpenAI", "start": 0.42, "end": 1.22 }

492 ],

493 "...": "additional response fields omitted"

494}

495```

496

497For speaker-labeled output, use the diarization model and request `diarized_json`:

498

499Command:

500

501```bash

502openai audio:transcriptions create \

503 --model gpt-4o-transcribe-diarize \

504 --file ./speech.mp3 \

505 --response-format diarized_json \

506 --format json

507```

508

509Output:

510

511```json

512{

513 "text": "The OpenAI CLI can call the API from ordinary shell scripts.",

514 "segments": [

515 {

516 "type": "transcript.text.segment",

517 "id": "seg_0",

518 "start": 0.05,

519 "end": 5.25,

520 "text": " The OpenAI CLI can call the API from ordinary shell scripts.",

521 "speaker": "A"

522 }

523 ],

524 "...": "additional response fields omitted"

525}

526```

527

528`whisper-1` supports `json`, `text`, `srt`, `verbose_json`, and `vtt`. `diarized_json` is the format that carries `segments[].speaker`; with the same diarization model and plain `json`, the response contains transcript text but not speaker labels.

529

530## Admin APIs

531

532Use Admin APIs for organization management, credential provisioning, compliance, and usage-monitoring workflows. Set `OPENAI_ADMIN_KEY`, then call the generated `admin:organization:*` commands.

533

534To provision a new machine credential, [create a project](https://developers.openai.com/api/reference/resources/admin/subresources/organization/subresources/projects/methods/create), [create a service account](https://developers.openai.com/api/reference/resources/admin/subresources/organization/subresources/projects/subresources/service_accounts/methods/create) inside that project, and use the returned API key.

535

536### Create a project, service account, and API key

537

538Creating a service account in that project returns an unredacted API key for the service account.

539

540Command:

541

542```bash

543# Create the project that will own this app or agent and save the response.

544openai admin:organization:projects create \

545 --name "automation project" \

546 --format json > project.json

547PROJECT_ID="$(jq -r '.id' project.json)"

548

549# Create a service account inside the project and save the full response.

550openai admin:organization:projects:service-accounts create \

551 --project-id "$PROJECT_ID" \

552 --name "automation bot" \

553 --format json > service-account.json

554

555# Extract the returned API key into an env file for the workload to use.

556jq -r '.api_key.value | "OPENAI_API_KEY=\(.)"' \

557 service-account.json > .env

558```

559

560Output:

561

562```json

563{

564 "object": "organization.project.service_account",

565 "id": "svc_acct_...",

566 "name": "automation bot",

567 "role": "member",

568 "api_key": {

569 "id": "key_...",

570 "value": "sk-..."

571 }

572}

573```

574

575This writes the project response to `project.json`, parses its ID into the next command, writes the service-account response to `service-account.json`, and writes the returned credential to `.env` as `OPENAI_API_KEY=...`. Treat both JSON files as secrets, and add `project.json`, `service-account.json`, and `.env` to `.gitignore` before using this pattern in a repository.

576

577For the rest of the surface, see the [Admin APIs guide](https://developers.openai.com/api/docs/guides/admin-apis) and the current [Administration API reference](https://developers.openai.com/api/reference/administration/overview). Be careful about giving unvetted actors access to admin keys.

mcp.md +58 −50

Details

28 28

29To work with ChatGPT deep research and company knowledge (and deep research via API), your MCP server should implement two read-only tools: `search` and `fetch`, using the compatibility schema in [Company knowledge compatibility](https://developers.openai.com/apps-sdk/build/mcp-server#company-knowledge-compatibility).29To work with ChatGPT deep research and company knowledge (and deep research via API), your MCP server should implement two read-only tools: `search` and `fetch`, using the compatibility schema in [Company knowledge compatibility](https://developers.openai.com/apps-sdk/build/mcp-server#company-knowledge-compatibility).

30 30

31Declare an output schema for each tool so clients can validate the result shape.

32In FastMCP, typed return models can generate this schema automatically; the

33example below passes `output_schema` explicitly from the same models.

31### `search` tool35### `search` tool

32 36

33The `search` tool is responsible for returning a list of relevant search results from your MCP server's data source, given a user's query.37The `search` tool is responsible for returning a list of relevant search results from your MCP server's data source, given a user's query.

44- `title` - human-readable title.48- `title` - human-readable title.

45- `url` - canonical URL for citation.49- `url` - canonical URL for citation.

46 50

47In MCP, tool results must be returned as [a content array](https://modelcontextprotocol.io/docs/learn/architecture#understanding-the-tool-execution-response) containing one or more "content items." Each content item has a type (such as `text`, `image`, or `resource`) and a payload.51In MCP, return this object as `structuredContent` and include the same value as

48 52a JSON-encoded string in the [content array](https://modelcontextprotocol.io/docs/learn/architecture#understanding-the-tool-execution-response)

~~49For the `search` tool, you should return **exactly one** content item with:~~53for compatibility.

~~51- `type: "text"`~~

~~52- `text`: a JSON-encoded string matching the results array schema above.~~

53 54

54The final tool response should look like:55The final tool response should look like:

55 56

56```json57```json

57{58{

59 "structuredContent": {

60 "results": [{ "id": "doc-1", "title": "...", "url": "..." }]

61 },

58 "content": [62 "content": [

59 {63 {

60 "type": "text",64 "type": "text",

83 specific resources in research.87 specific resources in research.

84- `metadata` - an optional key/value pairing of data about the result88- `metadata` - an optional key/value pairing of data about the result

85 89

86In MCP, tool results must be returned as [a content array](https://modelcontextprotocol.io/docs/learn/architecture#understanding-the-tool-execution-response) containing one or more "content items." Each content item has a `type` (such as `text`, `image`, or `resource`) and a payload.90In MCP, return this object as `structuredContent` and include the same value as

87 91a JSON-encoded string in the content array for compatibility.

88In this case, the `fetch` tool must return exactly [one content item with `type: "text"`](https://modelcontextprotocol.io/specification/2025-06-18/server/tools#tool-result). The `text` field should be a JSON-encoded string of the document object following the schema above.

89 92

90The final tool response should look like:93The final tool response should look like:

91 94

92```json95```json

93{96{

97 "structuredContent": {

98 "id": "doc-1",

99 "title": "...",

100 "text": "full text...",

101 "url": "https://example.com/doc",

102 "metadata": { "source": "vector_store" }

103 },

94 "content": [104 "content": [

95 {105 {

96 "type": "text",106 "type": "text",

128 138

129import logging139import logging

130import os140import os

131from typing import Dict, List, Any141from typing import Any

132 142

133from fastmcp import FastMCP143from fastmcp import FastMCP

134from openai import OpenAI144from openai import OpenAI

145from pydantic import BaseModel

146

147

148class SearchResult(BaseModel):

149 id: str

150 title: str

151 url: str

152

153

154class SearchOutput(BaseModel):

155 results: list[SearchResult]

156

157

158class FetchOutput(BaseModel):

159 id: str

160 title: str

161 text: str

162 url: str

163 metadata: dict[str, Any] | None = None

135 164

136# Configure logging165# Configure logging

137logging.basicConfig(level=logging.INFO)166logging.basicConfig(level=logging.INFO)

159 mcp = FastMCP(name="Sample MCP Server",188 mcp = FastMCP(name="Sample MCP Server",

160 instructions=server_instructions)189 instructions=server_instructions)

161 190

162 @mcp.tool()191 @mcp.tool(output_schema=SearchOutput.model_json_schema())

163 async def search(query: str) -> Dict[str, List[Dict[str, Any]]]:192 async def search(query: str) -> SearchOutput:

164 """193 """

165 Search for documents using OpenAI Vector Store search.194 Search for documents using OpenAI Vector Store search.

166 195

173 202

174 Returns:203 Returns:

175 Dictionary with 'results' key containing list of matching documents.204 Dictionary with 'results' key containing list of matching documents.

176 Each result includes id, title, text snippet, and optional URL.205 Each result includes id, title, and URL.

177 """206 """

178 if not query or not query.strip():207 if not query or not query.strip():

179 return {"results": []}208 return SearchOutput(results=[])

180 209

181 if not openai_client:210 if not openai_client:

182 logger.error("OpenAI client not initialized - API key missing")211 logger.error("OpenAI client not initialized - API key missing")

198 item_id = getattr(item, 'file_id', f"vs_{i}")227 item_id = getattr(item, 'file_id', f"vs_{i}")

199 item_filename = getattr(item, 'filename', f"Document {i+1}")228 item_filename = getattr(item, 'filename', f"Document {i+1}")

200 229

201 # Extract text content from the content array230 result = SearchResult(

202 content_list = getattr(item, 'content', [])231 id=item_id,

203 text_content = ""232 title=item_filename,

204 if content_list and len(content_list) > 0:233 url=f"https://platform.openai.com/storage/files/{item_id}",

205 # Get text from the first content item234 )

206 first_content = content_list[0]

207 if hasattr(first_content, 'text'):

208 text_content = first_content.text

209 elif isinstance(first_content, dict):

210 text_content = first_content.get('text', '')

~~211~~

212 if not text_content:

213 text_content = "No content available"

~~214~~

215 # Create a snippet from content

216 text_snippet = text_content[:200] + "..." if len(

217 text_content) > 200 else text_content

~~218~~

219 result = {

220 "id": item_id,

221 "title": item_filename,

222 "text": text_snippet,

223 "url":

224 f"https://platform.openai.com/storage/files/{item_id}"

225 }

226 235

227 results.append(result)236 results.append(result)

228 237

229 logger.info(f"Vector store search returned {len(results)} results")238 logger.info(f"Vector store search returned {len(results)} results")

230 return {"results": results}239 return SearchOutput(results=results)

231 240

232 @mcp.tool()241 @mcp.tool(output_schema=FetchOutput.model_json_schema())

233 async def fetch(id: str) -> Dict[str, Any]:242 async def fetch(id: str) -> FetchOutput:

234 """243 """

235 Retrieve complete document content by ID for detailed244 Retrieve complete document content by ID for detailed

236 analysis and citation. This tool fetches the full document245 analysis and citation. This tool fetches the full document

281 # Use filename as title and create proper URL for citations290 # Use filename as title and create proper URL for citations

282 filename = getattr(file_info, 'filename', f"Document {id}")291 filename = getattr(file_info, 'filename', f"Document {id}")

283 292

284 result = {293 result = FetchOutput(

285 "id": id,294 id=id,

286 "title": filename,295 title=filename,

287 "text": file_content,296 text=file_content,

288 "url": f"https://platform.openai.com/storage/files/{id}",297 url=f"https://platform.openai.com/storage/files/{id}",

289 "metadata": None298 )

290 }

291 299

292 # Add metadata if available from file info300 # Add metadata if available from file info

293 if hasattr(file_info, 'attributes') and file_info.attributes:301 if hasattr(file_info, 'attributes') and file_info.attributes:

294 result["metadata"] = file_info.attributes302 result.metadata = dict(file_info.attributes)

295 303

296 logger.info(f"Fetched vector store file: {id}")304 logger.info(f"Fetched vector store file: {id}")

297 return result305 return result

Documentation 2026-05-06 00:01 UTC to 2026-05-07 21:57 UTC