python/resources/realtime/subresources/client_secrets/index.md +0 −3749 deleted
File Deleted View Diff
1# Client Secrets
2
3## Create client secret
4
5`realtime.client_secrets.create(ClientSecretCreateParams**kwargs) -> ClientSecretCreateResponse`
6
7**post** `/realtime/client_secrets`
8
9Create a Realtime client secret with an associated session configuration.
10
11Client secrets are short-lived tokens that can be passed to a client app,
12such as a web frontend or mobile client, which grants access to the Realtime API without
13leaking your main API key. You can configure a custom TTL for each client secret.
14
15You can also attach session configuration options to the client secret, which will be
16applied to any sessions created using that client secret, but these can also be overridden
17by the client connection.
18
19[Learn more about authentication with client secrets over WebRTC](https://platform.openai.com/docs/guides/realtime-webrtc).
20
21Returns the created client secret and the effective session object. The client secret is a string that looks like `ek_1234`.
22
23### Parameters
24
25- `expires_after: Optional[ExpiresAfter]`
26
27 Configuration for the client secret expiration. Expiration refers to the time after which
28 a client secret will no longer be valid for creating sessions. The session itself may
29 continue after that time once started. A secret can be used to create multiple sessions
30 until it expires.
31
32 - `anchor: Optional[Literal["created_at"]]`
33
34 The anchor point for the client secret expiration, meaning that `seconds` will be added to the `created_at` time of the client secret to produce an expiration timestamp. Only `created_at` is currently supported.
35
36 - `"created_at"`
37
38 - `seconds: Optional[int]`
39
40 The number of seconds from the anchor point to the expiration. Select a value between `10` and `7200` (2 hours). This default to 600 seconds (10 minutes) if not specified.
41
42- `session: Optional[Session]`
43
44 Session configuration to use for the client secret. Choose either a realtime
45 session or a transcription session.
46
47 - `class RealtimeSessionCreateRequest: …`
48
49 Realtime session object configuration.
50
51 - `type: Literal["realtime"]`
52
53 The type of session to create. Always `realtime` for the Realtime API.
54
55 - `"realtime"`
56
57 - `audio: Optional[RealtimeAudioConfig]`
58
59 Configuration for input and output audio.
60
61 - `input: Optional[RealtimeAudioConfigInput]`
62
63 - `format: Optional[RealtimeAudioFormats]`
64
65 The format of the input audio.
66
67 - `class AudioPCM: …`
68
69 The PCM audio format. Only a 24kHz sample rate is supported.
70
71 - `rate: Optional[Literal[24000]]`
72
73 The sample rate of the audio. Always `24000`.
74
75 - `24000`
76
77 - `type: Optional[Literal["audio/pcm"]]`
78
79 The audio format. Always `audio/pcm`.
80
81 - `"audio/pcm"`
82
83 - `class AudioPCMU: …`
84
85 The G.711 μ-law format.
86
87 - `type: Optional[Literal["audio/pcmu"]]`
88
89 The audio format. Always `audio/pcmu`.
90
91 - `"audio/pcmu"`
92
93 - `class AudioPCMA: …`
94
95 The G.711 A-law format.
96
97 - `type: Optional[Literal["audio/pcma"]]`
98
99 The audio format. Always `audio/pcma`.
100
101 - `"audio/pcma"`
102
103 - `noise_reduction: Optional[NoiseReduction]`
104
105 Configuration for input audio noise reduction. This can be set to `null` to turn off.
106 Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model.
107 Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
108
109 - `type: Optional[NoiseReductionType]`
110
111 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.
112
113 - `"near_field"`
114
115 - `"far_field"`
116
117 - `transcription: Optional[AudioTranscription]`
118
119 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
120
121 - `delay: Optional[Literal["minimal", "low", "medium", 2 more]]`
122
123 Controls how long the model waits before emitting transcription text.
124 Higher values can improve transcription accuracy at the cost of latency.
125 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.
126
127 - `"minimal"`
128
129 - `"low"`
130
131 - `"medium"`
132
133 - `"high"`
134
135 - `"xhigh"`
136
137 - `language: Optional[str]`
138
139 The language of the input audio. Supplying the input language in
140 [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format
141 will improve accuracy and latency.
142
143 - `model: Optional[Union[str, Literal["whisper-1", "gpt-4o-mini-transcribe", "gpt-4o-mini-transcribe-2025-12-15", 3 more], null]]`
144
145 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.
146
147 - `str`
148
149 - `Literal["whisper-1", "gpt-4o-mini-transcribe", "gpt-4o-mini-transcribe-2025-12-15", 3 more]`
150
151 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.
152
153 - `"whisper-1"`
154
155 - `"gpt-4o-mini-transcribe"`
156
157 - `"gpt-4o-mini-transcribe-2025-12-15"`
158
159 - `"gpt-4o-transcribe"`
160
161 - `"gpt-4o-transcribe-diarize"`
162
163 - `"gpt-realtime-whisper"`
164
165 - `prompt: Optional[str]`
166
167 An optional text to guide the model's style or continue a previous audio
168 segment.
169 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).
170 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".
171 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.
172
173 - `turn_detection: Optional[RealtimeAudioInputTurnDetection]`
174
175 Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to `null` to turn off, in which case the client must manually trigger model response.
176
177 Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
178
179 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
180
181 For `gpt-realtime-whisper` transcription sessions, turn detection must be
182 set to `null`; VAD is not supported.
183
184 - `class ServerVad: …`
185
186 Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.
187
188 - `type: Literal["server_vad"]`
189
190 Type of turn detection, `server_vad` to turn on simple Server VAD.
191
192 - `"server_vad"`
193
194 - `create_response: Optional[bool]`
195
196 Whether or not to automatically generate a response when a VAD stop event occurs. If `interrupt_response` is set to `false` this may fail to create a response if the model is already responding.
197
198 If both `create_response` and `interrupt_response` are set to `false`, the model will never respond automatically but VAD events will still be emitted.
199
200 - `idle_timeout_ms: Optional[int]`
201
202 Optional timeout after which a model response will be triggered automatically. This is
203 useful for situations in which a long pause from the user is unexpected, such as a phone
204 call. The model will effectively prompt the user to continue the conversation based
205 on the current context.
206
207 The timeout value will be applied after the last model response's audio has finished playing,
208 i.e. it's set to the `response.done` time plus audio playback duration.
209
210 An `input_audio_buffer.timeout_triggered` event (plus events
211 associated with the Response) will be emitted when the timeout is reached.
212 Idle timeout is currently only supported for `server_vad` mode.
213
214 - `interrupt_response: Optional[bool]`
215
216 Whether or not to automatically interrupt (cancel) any ongoing response with output to the default
217 conversation (i.e. `conversation` of `auto`) when a VAD start event occurs. If `true` then the response will be cancelled, otherwise it will continue until complete.
218
219 If both `create_response` and `interrupt_response` are set to `false`, the model will never respond automatically but VAD events will still be emitted.
220
221 - `prefix_padding_ms: Optional[int]`
222
223 Used only for `server_vad` mode. Amount of audio to include before the VAD detected speech (in
224 milliseconds). Defaults to 300ms.
225
226 - `silence_duration_ms: Optional[int]`
227
228 Used only for `server_vad` mode. Duration of silence to detect speech stop (in milliseconds). Defaults
229 to 500ms. With shorter values the model will respond more quickly,
230 but may jump in on short pauses from the user.
231
232 - `threshold: Optional[float]`
233
234 Used only for `server_vad` mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A
235 higher threshold will require louder audio to activate the model, and
236 thus might perform better in noisy environments.
237
238 - `class SemanticVad: …`
239
240 Server-side semantic turn detection which uses a model to determine when the user has finished speaking.
241
242 - `type: Literal["semantic_vad"]`
243
244 Type of turn detection, `semantic_vad` to turn on Semantic VAD.
245
246 - `"semantic_vad"`
247
248 - `create_response: Optional[bool]`
249
250 Whether or not to automatically generate a response when a VAD stop event occurs.
251
252 - `eagerness: Optional[Literal["low", "medium", "high", "auto"]]`
253
254 Used only for `semantic_vad` mode. The eagerness of the model to respond. `low` will wait longer for the user to continue speaking, `high` will respond more quickly. `auto` is the default and is equivalent to `medium`. `low`, `medium`, and `high` have max timeouts of 8s, 4s, and 2s respectively.
255
256 - `"low"`
257
258 - `"medium"`
259
260 - `"high"`
261
262 - `"auto"`
263
264 - `interrupt_response: Optional[bool]`
265
266 Whether or not to automatically interrupt any ongoing response with output to the default
267 conversation (i.e. `conversation` of `auto`) when a VAD start event occurs.
268
269 - `output: Optional[RealtimeAudioConfigOutput]`
270
271 - `format: Optional[RealtimeAudioFormats]`
272
273 The format of the output audio.
274
275 - `speed: Optional[float]`
276
277 The speed of the model's spoken response as a multiple of the original speed.
278 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
279
280 This parameter is a post-processing adjustment to the audio after it is generated, it's
281 also possible to prompt the model to speak faster or slower.
282
283 - `voice: Optional[Voice]`
284
285 The voice the model uses to respond. Supported built-in voices are
286 `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, `shimmer`, `verse`,
287 `marin`, and `cedar`. You may also provide a custom voice object with
288 an `id`, for example `{ "id": "voice_1234" }`. Voice cannot be changed
289 during the session once the model has responded with audio at least once.
290 We recommend `marin` and `cedar` for best quality.
291
292 - `str`
293
294 - `Literal["alloy", "ash", "ballad", 7 more]`
295
296 - `"alloy"`
297
298 - `"ash"`
299
300 - `"ballad"`
301
302 - `"coral"`
303
304 - `"echo"`
305
306 - `"sage"`
307
308 - `"shimmer"`
309
310 - `"verse"`
311
312 - `"marin"`
313
314 - `"cedar"`
315
316 - `class VoiceID: …`
317
318 Custom voice reference.
319
320 - `id: str`
321
322 The custom voice ID, e.g. `voice_1234`.
323
324 - `include: Optional[List[Literal["item.input_audio_transcription.logprobs"]]]`
325
326 Additional fields to include in server outputs.
327
328 `item.input_audio_transcription.logprobs`: Include logprobs for input audio transcription.
329
330 - `"item.input_audio_transcription.logprobs"`
331
332 - `instructions: Optional[str]`
333
334 The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
335
336 Note that the server sets default instructions which will be used if this field is not set and are visible in the `session.created` event at the start of the session.
337
338 - `max_output_tokens: Optional[Union[int, Literal["inf"], null]]`
339
340 Maximum number of output tokens for a single assistant response,
341 inclusive of tool calls. Provide an integer between 1 and 4096 to
342 limit output tokens, or `inf` for the maximum available tokens for a
343 given model. Defaults to `inf`.
344
345 - `int`
346
347 - `Literal["inf"]`
348
349 - `"inf"`
350
351 - `model: Optional[Union[str, Literal["gpt-realtime", "gpt-realtime-1.5", "gpt-realtime-2", 14 more], null]]`
352
353 The Realtime model used for this session.
354
355 - `str`
356
357 - `Literal["gpt-realtime", "gpt-realtime-1.5", "gpt-realtime-2", 14 more]`
358
359 The Realtime model used for this session.
360
361 - `"gpt-realtime"`
362
363 - `"gpt-realtime-1.5"`
364
365 - `"gpt-realtime-2"`
366
367 - `"gpt-realtime-2025-08-28"`
368
369 - `"gpt-4o-realtime-preview"`
370
371 - `"gpt-4o-realtime-preview-2024-10-01"`
372
373 - `"gpt-4o-realtime-preview-2024-12-17"`
374
375 - `"gpt-4o-realtime-preview-2025-06-03"`
376
377 - `"gpt-4o-mini-realtime-preview"`
378
379 - `"gpt-4o-mini-realtime-preview-2024-12-17"`
380
381 - `"gpt-realtime-mini"`
382
383 - `"gpt-realtime-mini-2025-10-06"`
384
385 - `"gpt-realtime-mini-2025-12-15"`
386
387 - `"gpt-audio-1.5"`
388
389 - `"gpt-audio-mini"`
390
391 - `"gpt-audio-mini-2025-10-06"`
392
393 - `"gpt-audio-mini-2025-12-15"`
394
395 - `output_modalities: Optional[List[Literal["text", "audio"]]]`
396
397 The set of modalities the model can respond with. It defaults to `["audio"]`, indicating
398 that the model will respond with audio plus a transcript. `["text"]` can be used to make
399 the model respond with text only. It is not possible to request both `text` and `audio` at the same time.
400
401 - `"text"`
402
403 - `"audio"`
404
405 - `parallel_tool_calls: Optional[bool]`
406
407 Whether the model may call multiple tools in parallel. Only supported by
408 reasoning Realtime models such as `gpt-realtime-2`.
409
410 - `prompt: Optional[ResponsePrompt]`
411
412 Reference to a prompt template and its variables.
413 [Learn more](https://platform.openai.com/docs/guides/text?api-mode=responses#reusable-prompts).
414
415 - `id: str`
416
417 The unique identifier of the prompt template to use.
418
419 - `variables: Optional[Dict[str, Variables]]`
420
421 Optional map of values to substitute in for variables in your
422 prompt. The substitution values can either be strings, or other
423 Response input types like images or files.
424
425 - `str`
426
427 - `class ResponseInputText: …`
428
429 A text input to the model.
430
431 - `text: str`
432
433 The text input to the model.
434
435 - `type: Literal["input_text"]`
436
437 The type of the input item. Always `input_text`.
438
439 - `"input_text"`
440
441 - `class ResponseInputImage: …`
442
443 An image input to the model. Learn about [image inputs](https://platform.openai.com/docs/guides/vision).
444
445 - `detail: Literal["low", "high", "auto", "original"]`
446
447 The detail level of the image to be sent to the model. One of `high`, `low`, `auto`, or `original`. Defaults to `auto`.
448
449 - `"low"`
450
451 - `"high"`
452
453 - `"auto"`
454
455 - `"original"`
456
457 - `type: Literal["input_image"]`
458
459 The type of the input item. Always `input_image`.
460
461 - `"input_image"`
462
463 - `file_id: Optional[str]`
464
465 The ID of the file to be sent to the model.
466
467 - `image_url: Optional[str]`
468
469 The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
470
471 - `class ResponseInputFile: …`
472
473 A file input to the model.
474
475 - `type: Literal["input_file"]`
476
477 The type of the input item. Always `input_file`.
478
479 - `"input_file"`
480
481 - `detail: Optional[Literal["low", "high"]]`
482
483 The detail level of the file to be sent to the model. Use `low` for the default rendering behavior, or `high` to render the file at higher quality. Defaults to `low`.
484
485 - `"low"`
486
487 - `"high"`
488
489 - `file_data: Optional[str]`
490
491 The content of the file to be sent to the model.
492
493 - `file_id: Optional[str]`
494
495 The ID of the file to be sent to the model.
496
497 - `file_url: Optional[str]`
498
499 The URL of the file to be sent to the model.
500
501 - `filename: Optional[str]`
502
503 The name of the file to be sent to the model.
504
505 - `version: Optional[str]`
506
507 Optional version of the prompt template.
508
509 - `reasoning: Optional[RealtimeReasoning]`
510
511 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.
512
513 - `effort: Optional[RealtimeReasoningEffort]`
514
515 Constrains effort on reasoning for reasoning-capable Realtime models such as
516 `gpt-realtime-2`.
517
518 - `"minimal"`
519
520 - `"low"`
521
522 - `"medium"`
523
524 - `"high"`
525
526 - `"xhigh"`
527
528 - `tool_choice: Optional[RealtimeToolChoiceConfig]`
529
530 How the model chooses tools. Provide one of the string modes or force a specific
531 function/MCP tool.
532
533 - `Literal["none", "auto", "required"]`
534
535 - `"none"`
536
537 - `"auto"`
538
539 - `"required"`
540
541 - `class ToolChoiceFunction: …`
542
543 Use this option to force the model to call a specific function.
544
545 - `name: str`
546
547 The name of the function to call.
548
549 - `type: Literal["function"]`
550
551 For function calling, the type is always `function`.
552
553 - `"function"`
554
555 - `class ToolChoiceMcp: …`
556
557 Use this option to force the model to call a specific tool on a remote MCP server.
558
559 - `server_label: str`
560
561 The label of the MCP server to use.
562
563 - `type: Literal["mcp"]`
564
565 For MCP tools, the type is always `mcp`.
566
567 - `"mcp"`
568
569 - `name: Optional[str]`
570
571 The name of the tool to call on the server.
572
573 - `tools: Optional[RealtimeToolsConfig]`
574
575 Tools available to the model.
576
577 - `class RealtimeFunctionTool: …`
578
579 - `description: Optional[str]`
580
581 The description of the function, including guidance on when and how
582 to call it, and guidance about what to tell the user when calling
583 (if anything).
584
585 - `name: Optional[str]`
586
587 The name of the function.
588
589 - `parameters: Optional[object]`
590
591 Parameters of the function in JSON Schema.
592
593 - `type: Optional[Literal["function"]]`
594
595 The type of the tool, i.e. `function`.
596
597 - `"function"`
598
599 - `class Mcp: …`
600
601 Give the model access to additional tools via remote Model Context Protocol
602 (MCP) servers. [Learn more about MCP](https://platform.openai.com/docs/guides/tools-remote-mcp).
603
604 - `server_label: str`
605
606 A label for this MCP server, used to identify it in tool calls.
607
608 - `type: Literal["mcp"]`
609
610 The type of the MCP tool. Always `mcp`.
611
612 - `"mcp"`
613
614 - `allowed_tools: Optional[McpAllowedTools]`
615
616 List of allowed tool names or a filter object.
617
618 - `List[str]`
619
620 A string array of allowed tool names
621
622 - `class McpAllowedToolsMcpToolFilter: …`
623
624 A filter object to specify which tools are allowed.
625
626 - `read_only: Optional[bool]`
627
628 Indicates whether or not a tool modifies data or is read-only. If an
629 MCP server is [annotated with `readOnlyHint`](https://modelcontextprotocol.io/specification/2025-06-18/schema#toolannotations-readonlyhint),
630 it will match this filter.
631
632 - `tool_names: Optional[List[str]]`
633
634 List of allowed tool names.
635
636 - `authorization: Optional[str]`
637
638 An OAuth access token that can be used with a remote MCP server, either
639 with a custom MCP server URL or a service connector. Your application
640 must handle the OAuth authorization flow and provide the token here.
641
642 - `connector_id: Optional[Literal["connector_dropbox", "connector_gmail", "connector_googlecalendar", 5 more]]`
643
644 Identifier for service connectors, like those available in ChatGPT. One of
645 `server_url` or `connector_id` must be provided. Learn more about service
646 connectors [here](https://platform.openai.com/docs/guides/tools-remote-mcp#connectors).
647
648 Currently supported `connector_id` values are:
649
650 - Dropbox: `connector_dropbox`
651 - Gmail: `connector_gmail`
652 - Google Calendar: `connector_googlecalendar`
653 - Google Drive: `connector_googledrive`
654 - Microsoft Teams: `connector_microsoftteams`
655 - Outlook Calendar: `connector_outlookcalendar`
656 - Outlook Email: `connector_outlookemail`
657 - SharePoint: `connector_sharepoint`
658
659 - `"connector_dropbox"`
660
661 - `"connector_gmail"`
662
663 - `"connector_googlecalendar"`
664
665 - `"connector_googledrive"`
666
667 - `"connector_microsoftteams"`
668
669 - `"connector_outlookcalendar"`
670
671 - `"connector_outlookemail"`
672
673 - `"connector_sharepoint"`
674
675 - `defer_loading: Optional[bool]`
676
677 Whether this MCP tool is deferred and discovered via tool search.
678
679 - `headers: Optional[Dict[str, str]]`
680
681 Optional HTTP headers to send to the MCP server. Use for authentication
682 or other purposes.
683
684 - `require_approval: Optional[McpRequireApproval]`
685
686 Specify which of the MCP server's tools require approval.
687
688 - `class McpRequireApprovalMcpToolApprovalFilter: …`
689
690 Specify which of the MCP server's tools require approval. Can be
691 `always`, `never`, or a filter object associated with tools
692 that require approval.
693
694 - `always: Optional[McpRequireApprovalMcpToolApprovalFilterAlways]`
695
696 A filter object to specify which tools are allowed.
697
698 - `read_only: Optional[bool]`
699
700 Indicates whether or not a tool modifies data or is read-only. If an
701 MCP server is [annotated with `readOnlyHint`](https://modelcontextprotocol.io/specification/2025-06-18/schema#toolannotations-readonlyhint),
702 it will match this filter.
703
704 - `tool_names: Optional[List[str]]`
705
706 List of allowed tool names.
707
708 - `never: Optional[McpRequireApprovalMcpToolApprovalFilterNever]`
709
710 A filter object to specify which tools are allowed.
711
712 - `read_only: Optional[bool]`
713
714 Indicates whether or not a tool modifies data or is read-only. If an
715 MCP server is [annotated with `readOnlyHint`](https://modelcontextprotocol.io/specification/2025-06-18/schema#toolannotations-readonlyhint),
716 it will match this filter.
717
718 - `tool_names: Optional[List[str]]`
719
720 List of allowed tool names.
721
722 - `Literal["always", "never"]`
723
724 Specify a single approval policy for all tools. One of `always` or
725 `never`. When set to `always`, all tools will require approval. When
726 set to `never`, all tools will not require approval.
727
728 - `"always"`
729
730 - `"never"`
731
732 - `server_description: Optional[str]`
733
734 Optional description of the MCP server, used to provide more context.
735
736 - `server_url: Optional[str]`
737
738 The URL for the MCP server. One of `server_url` or `connector_id` must be
739 provided.
740
741 - `tracing: Optional[RealtimeTracingConfig]`
742
743 Realtime API can write session traces to the [Traces Dashboard](https://platform.openai.com/logs?api=traces). Set to null to disable tracing. Once
744 tracing is enabled for a session, the configuration cannot be modified.
745
746 `auto` will create a trace for the session with default values for the
747 workflow name, group id, and metadata.
748
749 - `Literal["auto"]`
750
751 Enables tracing and sets default values for tracing configuration options. Always `auto`.
752
753 - `"auto"`
754
755 - `class TracingConfiguration: …`
756
757 Granular configuration for tracing.
758
759 - `group_id: Optional[str]`
760
761 The group id to attach to this trace to enable filtering and
762 grouping in the Traces Dashboard.
763
764 - `metadata: Optional[object]`
765
766 The arbitrary metadata to attach to this trace to enable
767 filtering in the Traces Dashboard.
768
769 - `workflow_name: Optional[str]`
770
771 The name of the workflow to attach to this trace. This is used to
772 name the trace in the Traces Dashboard.
773
774 - `truncation: Optional[RealtimeTruncation]`
775
776 When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
777
778 Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
779
780 Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
781
782 Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
783
784 - `Literal["auto", "disabled"]`
785
786 The truncation strategy to use for the session. `auto` is the default truncation strategy. `disabled` will disable truncation and emit errors when the conversation exceeds the input token limit.
787
788 - `"auto"`
789
790 - `"disabled"`
791
792 - `class RealtimeTruncationRetentionRatio: …`
793
794 Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
795
796 - `retention_ratio: float`
797
798 Fraction of post-instruction conversation tokens to retain (`0.0` - `1.0`) when the conversation exceeds the input token limit. Setting this to `0.8` means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.
799
800 - `type: Literal["retention_ratio"]`
801
802 Use retention ratio truncation.
803
804 - `"retention_ratio"`
805
806 - `token_limits: Optional[TokenLimits]`
807
808 Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
809
810 - `post_instructions: Optional[int]`
811
812 Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
813
814 - `class RealtimeTranscriptionSessionCreateRequest: …`
815
816 Realtime transcription session object configuration.
817
818 - `type: Literal["transcription"]`
819
820 The type of session to create. Always `transcription` for transcription sessions.
821
822 - `"transcription"`
823
824 - `audio: Optional[RealtimeTranscriptionSessionAudio]`
825
826 Configuration for input and output audio.
827
828 - `input: Optional[RealtimeTranscriptionSessionAudioInput]`
829
830 - `format: Optional[RealtimeAudioFormats]`
831
832 The PCM audio format. Only a 24kHz sample rate is supported.
833
834 - `noise_reduction: Optional[NoiseReduction]`
835
836 Configuration for input audio noise reduction. This can be set to `null` to turn off.
837 Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model.
838 Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
839
840 - `type: Optional[NoiseReductionType]`
841
842 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.
843
844 - `transcription: Optional[AudioTranscription]`
845
846 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
847
848 - `turn_detection: Optional[RealtimeTranscriptionSessionAudioInputTurnDetection]`
849
850 Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to `null` to turn off, in which case the client must manually trigger model response.
851
852 Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
853
854 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
855
856 For `gpt-realtime-whisper` transcription sessions, turn detection must be
857 set to `null`; VAD is not supported.
858
859 - `class ServerVad: …`
860
861 Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.
862
863 - `type: Literal["server_vad"]`
864
865 Type of turn detection, `server_vad` to turn on simple Server VAD.
866
867 - `"server_vad"`
868
869 - `create_response: Optional[bool]`
870
871 Whether or not to automatically generate a response when a VAD stop event occurs. If `interrupt_response` is set to `false` this may fail to create a response if the model is already responding.
872
873 If both `create_response` and `interrupt_response` are set to `false`, the model will never respond automatically but VAD events will still be emitted.
874
875 - `idle_timeout_ms: Optional[int]`
876
877 Optional timeout after which a model response will be triggered automatically. This is
878 useful for situations in which a long pause from the user is unexpected, such as a phone
879 call. The model will effectively prompt the user to continue the conversation based
880 on the current context.
881
882 The timeout value will be applied after the last model response's audio has finished playing,
883 i.e. it's set to the `response.done` time plus audio playback duration.
884
885 An `input_audio_buffer.timeout_triggered` event (plus events
886 associated with the Response) will be emitted when the timeout is reached.
887 Idle timeout is currently only supported for `server_vad` mode.
888
889 - `interrupt_response: Optional[bool]`
890
891 Whether or not to automatically interrupt (cancel) any ongoing response with output to the default
892 conversation (i.e. `conversation` of `auto`) when a VAD start event occurs. If `true` then the response will be cancelled, otherwise it will continue until complete.
893
894 If both `create_response` and `interrupt_response` are set to `false`, the model will never respond automatically but VAD events will still be emitted.
895
896 - `prefix_padding_ms: Optional[int]`
897
898 Used only for `server_vad` mode. Amount of audio to include before the VAD detected speech (in
899 milliseconds). Defaults to 300ms.
900
901 - `silence_duration_ms: Optional[int]`
902
903 Used only for `server_vad` mode. Duration of silence to detect speech stop (in milliseconds). Defaults
904 to 500ms. With shorter values the model will respond more quickly,
905 but may jump in on short pauses from the user.
906
907 - `threshold: Optional[float]`
908
909 Used only for `server_vad` mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A
910 higher threshold will require louder audio to activate the model, and
911 thus might perform better in noisy environments.
912
913 - `class SemanticVad: …`
914
915 Server-side semantic turn detection which uses a model to determine when the user has finished speaking.
916
917 - `type: Literal["semantic_vad"]`
918
919 Type of turn detection, `semantic_vad` to turn on Semantic VAD.
920
921 - `"semantic_vad"`
922
923 - `create_response: Optional[bool]`
924
925 Whether or not to automatically generate a response when a VAD stop event occurs.
926
927 - `eagerness: Optional[Literal["low", "medium", "high", "auto"]]`
928
929 Used only for `semantic_vad` mode. The eagerness of the model to respond. `low` will wait longer for the user to continue speaking, `high` will respond more quickly. `auto` is the default and is equivalent to `medium`. `low`, `medium`, and `high` have max timeouts of 8s, 4s, and 2s respectively.
930
931 - `"low"`
932
933 - `"medium"`
934
935 - `"high"`
936
937 - `"auto"`
938
939 - `interrupt_response: Optional[bool]`
940
941 Whether or not to automatically interrupt any ongoing response with output to the default
942 conversation (i.e. `conversation` of `auto`) when a VAD start event occurs.
943
944 - `include: Optional[List[Literal["item.input_audio_transcription.logprobs"]]]`
945
946 Additional fields to include in server outputs.
947
948 `item.input_audio_transcription.logprobs`: Include logprobs for input audio transcription.
949
950 - `"item.input_audio_transcription.logprobs"`
951
952### Returns
953
954- `class ClientSecretCreateResponse: …`
955
956 Response from creating a session and client secret for the Realtime API.
957
958 - `expires_at: int`
959
960 Expiration timestamp for the client secret, in seconds since epoch.
961
962 - `session: Session`
963
964 The session configuration for either a realtime or transcription session.
965
966 - `class RealtimeSessionCreateResponse: …`
967
968 A Realtime session configuration object.
969
970 - `id: str`
971
972 Unique identifier for the session that looks like `sess_1234567890abcdef`.
973
974 - `object: Literal["realtime.session"]`
975
976 The object type. Always `realtime.session`.
977
978 - `"realtime.session"`
979
980 - `type: Literal["realtime"]`
981
982 The type of session to create. Always `realtime` for the Realtime API.
983
984 - `"realtime"`
985
986 - `audio: Optional[Audio]`
987
988 Configuration for input and output audio.
989
990 - `input: Optional[AudioInput]`
991
992 - `format: Optional[RealtimeAudioFormats]`
993
994 The format of the input audio.
995
996 - `class AudioPCM: …`
997
998 The PCM audio format. Only a 24kHz sample rate is supported.
999
1000 - `rate: Optional[Literal[24000]]`
1001
1002 The sample rate of the audio. Always `24000`.
1003
1004 - `24000`
1005
1006 - `type: Optional[Literal["audio/pcm"]]`
1007
1008 The audio format. Always `audio/pcm`.
1009
1010 - `"audio/pcm"`
1011
1012 - `class AudioPCMU: …`
1013
1014 The G.711 μ-law format.
1015
1016 - `type: Optional[Literal["audio/pcmu"]]`
1017
1018 The audio format. Always `audio/pcmu`.
1019
1020 - `"audio/pcmu"`
1021
1022 - `class AudioPCMA: …`
1023
1024 The G.711 A-law format.
1025
1026 - `type: Optional[Literal["audio/pcma"]]`
1027
1028 The audio format. Always `audio/pcma`.
1029
1030 - `"audio/pcma"`
1031
1032 - `noise_reduction: Optional[AudioInputNoiseReduction]`
1033
1034 Configuration for input audio noise reduction. This can be set to `null` to turn off.
1035 Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model.
1036 Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
1037
1038 - `type: Optional[NoiseReductionType]`
1039
1040 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.
1041
1042 - `"near_field"`
1043
1044 - `"far_field"`
1045
1046 - `transcription: Optional[AudioTranscription]`
1047
1048 - `delay: Optional[Literal["minimal", "low", "medium", 2 more]]`
1049
1050 Controls how long the model waits before emitting transcription text.
1051 Higher values can improve transcription accuracy at the cost of latency.
1052 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.
1053
1054 - `"minimal"`
1055
1056 - `"low"`
1057
1058 - `"medium"`
1059
1060 - `"high"`
1061
1062 - `"xhigh"`
1063
1064 - `language: Optional[str]`
1065
1066 The language of the input audio. Supplying the input language in
1067 [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format
1068 will improve accuracy and latency.
1069
1070 - `model: Optional[Union[str, Literal["whisper-1", "gpt-4o-mini-transcribe", "gpt-4o-mini-transcribe-2025-12-15", 3 more], null]]`
1071
1072 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.
1073
1074 - `str`
1075
1076 - `Literal["whisper-1", "gpt-4o-mini-transcribe", "gpt-4o-mini-transcribe-2025-12-15", 3 more]`
1077
1078 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.
1079
1080 - `"whisper-1"`
1081
1082 - `"gpt-4o-mini-transcribe"`
1083
1084 - `"gpt-4o-mini-transcribe-2025-12-15"`
1085
1086 - `"gpt-4o-transcribe"`
1087
1088 - `"gpt-4o-transcribe-diarize"`
1089
1090 - `"gpt-realtime-whisper"`
1091
1092 - `prompt: Optional[str]`
1093
1094 An optional text to guide the model's style or continue a previous audio
1095 segment.
1096 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).
1097 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".
1098 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.
1099
1100 - `turn_detection: Optional[AudioInputTurnDetection]`
1101
1102 Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to `null` to turn off, in which case the client must manually trigger model response.
1103
1104 Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
1105
1106 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
1107
1108 For `gpt-realtime-whisper` transcription sessions, turn detection must be
1109 set to `null`; VAD is not supported.
1110
1111 - `class AudioInputTurnDetectionServerVad: …`
1112
1113 Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.
1114
1115 - `type: Literal["server_vad"]`
1116
1117 Type of turn detection, `server_vad` to turn on simple Server VAD.
1118
1119 - `"server_vad"`
1120
1121 - `create_response: Optional[bool]`
1122
1123 Whether or not to automatically generate a response when a VAD stop event occurs. If `interrupt_response` is set to `false` this may fail to create a response if the model is already responding.
1124
1125 If both `create_response` and `interrupt_response` are set to `false`, the model will never respond automatically but VAD events will still be emitted.
1126
1127 - `idle_timeout_ms: Optional[int]`
1128
1129 Optional timeout after which a model response will be triggered automatically. This is
1130 useful for situations in which a long pause from the user is unexpected, such as a phone
1131 call. The model will effectively prompt the user to continue the conversation based
1132 on the current context.
1133
1134 The timeout value will be applied after the last model response's audio has finished playing,
1135 i.e. it's set to the `response.done` time plus audio playback duration.
1136
1137 An `input_audio_buffer.timeout_triggered` event (plus events
1138 associated with the Response) will be emitted when the timeout is reached.
1139 Idle timeout is currently only supported for `server_vad` mode.
1140
1141 - `interrupt_response: Optional[bool]`
1142
1143 Whether or not to automatically interrupt (cancel) any ongoing response with output to the default
1144 conversation (i.e. `conversation` of `auto`) when a VAD start event occurs. If `true` then the response will be cancelled, otherwise it will continue until complete.
1145
1146 If both `create_response` and `interrupt_response` are set to `false`, the model will never respond automatically but VAD events will still be emitted.
1147
1148 - `prefix_padding_ms: Optional[int]`
1149
1150 Used only for `server_vad` mode. Amount of audio to include before the VAD detected speech (in
1151 milliseconds). Defaults to 300ms.
1152
1153 - `silence_duration_ms: Optional[int]`
1154
1155 Used only for `server_vad` mode. Duration of silence to detect speech stop (in milliseconds). Defaults
1156 to 500ms. With shorter values the model will respond more quickly,
1157 but may jump in on short pauses from the user.
1158
1159 - `threshold: Optional[float]`
1160
1161 Used only for `server_vad` mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A
1162 higher threshold will require louder audio to activate the model, and
1163 thus might perform better in noisy environments.
1164
1165 - `class AudioInputTurnDetectionSemanticVad: …`
1166
1167 Server-side semantic turn detection which uses a model to determine when the user has finished speaking.
1168
1169 - `type: Literal["semantic_vad"]`
1170
1171 Type of turn detection, `semantic_vad` to turn on Semantic VAD.
1172
1173 - `"semantic_vad"`
1174
1175 - `create_response: Optional[bool]`
1176
1177 Whether or not to automatically generate a response when a VAD stop event occurs.
1178
1179 - `eagerness: Optional[Literal["low", "medium", "high", "auto"]]`
1180
1181 Used only for `semantic_vad` mode. The eagerness of the model to respond. `low` will wait longer for the user to continue speaking, `high` will respond more quickly. `auto` is the default and is equivalent to `medium`. `low`, `medium`, and `high` have max timeouts of 8s, 4s, and 2s respectively.
1182
1183 - `"low"`
1184
1185 - `"medium"`
1186
1187 - `"high"`
1188
1189 - `"auto"`
1190
1191 - `interrupt_response: Optional[bool]`
1192
1193 Whether or not to automatically interrupt any ongoing response with output to the default
1194 conversation (i.e. `conversation` of `auto`) when a VAD start event occurs.
1195
1196 - `output: Optional[AudioOutput]`
1197
1198 - `format: Optional[RealtimeAudioFormats]`
1199
1200 The format of the output audio.
1201
1202 - `speed: Optional[float]`
1203
1204 The speed of the model's spoken response as a multiple of the original speed.
1205 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
1206
1207 This parameter is a post-processing adjustment to the audio after it is generated, it's
1208 also possible to prompt the model to speak faster or slower.
1209
1210 - `voice: Optional[Union[str, Literal["alloy", "ash", "ballad", 7 more], null]]`
1211
1212 The voice the model uses to respond. Voice cannot be changed during the
1213 session once the model has responded with audio at least once. Current
1214 voice options are `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`,
1215 `shimmer`, `verse`, `marin`, and `cedar`. We recommend `marin` and `cedar` for
1216 best quality.
1217
1218 - `str`
1219
1220 - `Literal["alloy", "ash", "ballad", 7 more]`
1221
1222 The voice the model uses to respond. Voice cannot be changed during the
1223 session once the model has responded with audio at least once. Current
1224 voice options are `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`,
1225 `shimmer`, `verse`, `marin`, and `cedar`. We recommend `marin` and `cedar` for
1226 best quality.
1227
1228 - `"alloy"`
1229
1230 - `"ash"`
1231
1232 - `"ballad"`
1233
1234 - `"coral"`
1235
1236 - `"echo"`
1237
1238 - `"sage"`
1239
1240 - `"shimmer"`
1241
1242 - `"verse"`
1243
1244 - `"marin"`
1245
1246 - `"cedar"`
1247
1248 - `expires_at: Optional[int]`
1249
1250 Expiration timestamp for the session, in seconds since epoch.
1251
1252 - `include: Optional[List[Literal["item.input_audio_transcription.logprobs"]]]`
1253
1254 Additional fields to include in server outputs.
1255
1256 `item.input_audio_transcription.logprobs`: Include logprobs for input audio transcription.
1257
1258 - `"item.input_audio_transcription.logprobs"`
1259
1260 - `instructions: Optional[str]`
1261
1262 The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
1263
1264 Note that the server sets default instructions which will be used if this field is not set and are visible in the `session.created` event at the start of the session.
1265
1266 - `max_output_tokens: Optional[Union[int, Literal["inf"], null]]`
1267
1268 Maximum number of output tokens for a single assistant response,
1269 inclusive of tool calls. Provide an integer between 1 and 4096 to
1270 limit output tokens, or `inf` for the maximum available tokens for a
1271 given model. Defaults to `inf`.
1272
1273 - `int`
1274
1275 - `Literal["inf"]`
1276
1277 - `"inf"`
1278
1279 - `model: Optional[Union[str, Literal["gpt-realtime", "gpt-realtime-1.5", "gpt-realtime-2", 14 more], null]]`
1280
1281 The Realtime model used for this session.
1282
1283 - `str`
1284
1285 - `Literal["gpt-realtime", "gpt-realtime-1.5", "gpt-realtime-2", 14 more]`
1286
1287 The Realtime model used for this session.
1288
1289 - `"gpt-realtime"`
1290
1291 - `"gpt-realtime-1.5"`
1292
1293 - `"gpt-realtime-2"`
1294
1295 - `"gpt-realtime-2025-08-28"`
1296
1297 - `"gpt-4o-realtime-preview"`
1298
1299 - `"gpt-4o-realtime-preview-2024-10-01"`
1300
1301 - `"gpt-4o-realtime-preview-2024-12-17"`
1302
1303 - `"gpt-4o-realtime-preview-2025-06-03"`
1304
1305 - `"gpt-4o-mini-realtime-preview"`
1306
1307 - `"gpt-4o-mini-realtime-preview-2024-12-17"`
1308
1309 - `"gpt-realtime-mini"`
1310
1311 - `"gpt-realtime-mini-2025-10-06"`
1312
1313 - `"gpt-realtime-mini-2025-12-15"`
1314
1315 - `"gpt-audio-1.5"`
1316
1317 - `"gpt-audio-mini"`
1318
1319 - `"gpt-audio-mini-2025-10-06"`
1320
1321 - `"gpt-audio-mini-2025-12-15"`
1322
1323 - `output_modalities: Optional[List[Literal["text", "audio"]]]`
1324
1325 The set of modalities the model can respond with. It defaults to `["audio"]`, indicating
1326 that the model will respond with audio plus a transcript. `["text"]` can be used to make
1327 the model respond with text only. It is not possible to request both `text` and `audio` at the same time.
1328
1329 - `"text"`
1330
1331 - `"audio"`
1332
1333 - `prompt: Optional[ResponsePrompt]`
1334
1335 Reference to a prompt template and its variables.
1336 [Learn more](https://platform.openai.com/docs/guides/text?api-mode=responses#reusable-prompts).
1337
1338 - `id: str`
1339
1340 The unique identifier of the prompt template to use.
1341
1342 - `variables: Optional[Dict[str, Variables]]`
1343
1344 Optional map of values to substitute in for variables in your
1345 prompt. The substitution values can either be strings, or other
1346 Response input types like images or files.
1347
1348 - `str`
1349
1350 - `class ResponseInputText: …`
1351
1352 A text input to the model.
1353
1354 - `text: str`
1355
1356 The text input to the model.
1357
1358 - `type: Literal["input_text"]`
1359
1360 The type of the input item. Always `input_text`.
1361
1362 - `"input_text"`
1363
1364 - `class ResponseInputImage: …`
1365
1366 An image input to the model. Learn about [image inputs](https://platform.openai.com/docs/guides/vision).
1367
1368 - `detail: Literal["low", "high", "auto", "original"]`
1369
1370 The detail level of the image to be sent to the model. One of `high`, `low`, `auto`, or `original`. Defaults to `auto`.
1371
1372 - `"low"`
1373
1374 - `"high"`
1375
1376 - `"auto"`
1377
1378 - `"original"`
1379
1380 - `type: Literal["input_image"]`
1381
1382 The type of the input item. Always `input_image`.
1383
1384 - `"input_image"`
1385
1386 - `file_id: Optional[str]`
1387
1388 The ID of the file to be sent to the model.
1389
1390 - `image_url: Optional[str]`
1391
1392 The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
1393
1394 - `class ResponseInputFile: …`
1395
1396 A file input to the model.
1397
1398 - `type: Literal["input_file"]`
1399
1400 The type of the input item. Always `input_file`.
1401
1402 - `"input_file"`
1403
1404 - `detail: Optional[Literal["low", "high"]]`
1405
1406 The detail level of the file to be sent to the model. Use `low` for the default rendering behavior, or `high` to render the file at higher quality. Defaults to `low`.
1407
1408 - `"low"`
1409
1410 - `"high"`
1411
1412 - `file_data: Optional[str]`
1413
1414 The content of the file to be sent to the model.
1415
1416 - `file_id: Optional[str]`
1417
1418 The ID of the file to be sent to the model.
1419
1420 - `file_url: Optional[str]`
1421
1422 The URL of the file to be sent to the model.
1423
1424 - `filename: Optional[str]`
1425
1426 The name of the file to be sent to the model.
1427
1428 - `version: Optional[str]`
1429
1430 Optional version of the prompt template.
1431
1432 - `reasoning: Optional[RealtimeReasoning]`
1433
1434 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.
1435
1436 - `effort: Optional[RealtimeReasoningEffort]`
1437
1438 Constrains effort on reasoning for reasoning-capable Realtime models such as
1439 `gpt-realtime-2`.
1440
1441 - `"minimal"`
1442
1443 - `"low"`
1444
1445 - `"medium"`
1446
1447 - `"high"`
1448
1449 - `"xhigh"`
1450
1451 - `tool_choice: Optional[ToolChoice]`
1452
1453 How the model chooses tools. Provide one of the string modes or force a specific
1454 function/MCP tool.
1455
1456 - `Literal["none", "auto", "required"]`
1457
1458 - `"none"`
1459
1460 - `"auto"`
1461
1462 - `"required"`
1463
1464 - `class ToolChoiceFunction: …`
1465
1466 Use this option to force the model to call a specific function.
1467
1468 - `name: str`
1469
1470 The name of the function to call.
1471
1472 - `type: Literal["function"]`
1473
1474 For function calling, the type is always `function`.
1475
1476 - `"function"`
1477
1478 - `class ToolChoiceMcp: …`
1479
1480 Use this option to force the model to call a specific tool on a remote MCP server.
1481
1482 - `server_label: str`
1483
1484 The label of the MCP server to use.
1485
1486 - `type: Literal["mcp"]`
1487
1488 For MCP tools, the type is always `mcp`.
1489
1490 - `"mcp"`
1491
1492 - `name: Optional[str]`
1493
1494 The name of the tool to call on the server.
1495
1496 - `tools: Optional[List[Tool]]`
1497
1498 Tools available to the model.
1499
1500 - `class RealtimeFunctionTool: …`
1501
1502 - `description: Optional[str]`
1503
1504 The description of the function, including guidance on when and how
1505 to call it, and guidance about what to tell the user when calling
1506 (if anything).
1507
1508 - `name: Optional[str]`
1509
1510 The name of the function.
1511
1512 - `parameters: Optional[object]`
1513
1514 Parameters of the function in JSON Schema.
1515
1516 - `type: Optional[Literal["function"]]`
1517
1518 The type of the tool, i.e. `function`.
1519
1520 - `"function"`
1521
1522 - `class ToolMcpTool: …`
1523
1524 Give the model access to additional tools via remote Model Context Protocol
1525 (MCP) servers. [Learn more about MCP](https://platform.openai.com/docs/guides/tools-remote-mcp).
1526
1527 - `server_label: str`
1528
1529 A label for this MCP server, used to identify it in tool calls.
1530
1531 - `type: Literal["mcp"]`
1532
1533 The type of the MCP tool. Always `mcp`.
1534
1535 - `"mcp"`
1536
1537 - `allowed_tools: Optional[ToolMcpToolAllowedTools]`
1538
1539 List of allowed tool names or a filter object.
1540
1541 - `List[str]`
1542
1543 A string array of allowed tool names
1544
1545 - `class ToolMcpToolAllowedToolsMcpToolFilter: …`
1546
1547 A filter object to specify which tools are allowed.
1548
1549 - `read_only: Optional[bool]`
1550
1551 Indicates whether or not a tool modifies data or is read-only. If an
1552 MCP server is [annotated with `readOnlyHint`](https://modelcontextprotocol.io/specification/2025-06-18/schema#toolannotations-readonlyhint),
1553 it will match this filter.
1554
1555 - `tool_names: Optional[List[str]]`
1556
1557 List of allowed tool names.
1558
1559 - `authorization: Optional[str]`
1560
1561 An OAuth access token that can be used with a remote MCP server, either
1562 with a custom MCP server URL or a service connector. Your application
1563 must handle the OAuth authorization flow and provide the token here.
1564
1565 - `connector_id: Optional[Literal["connector_dropbox", "connector_gmail", "connector_googlecalendar", 5 more]]`
1566
1567 Identifier for service connectors, like those available in ChatGPT. One of
1568 `server_url` or `connector_id` must be provided. Learn more about service
1569 connectors [here](https://platform.openai.com/docs/guides/tools-remote-mcp#connectors).
1570
1571 Currently supported `connector_id` values are:
1572
1573 - Dropbox: `connector_dropbox`
1574 - Gmail: `connector_gmail`
1575 - Google Calendar: `connector_googlecalendar`
1576 - Google Drive: `connector_googledrive`
1577 - Microsoft Teams: `connector_microsoftteams`
1578 - Outlook Calendar: `connector_outlookcalendar`
1579 - Outlook Email: `connector_outlookemail`
1580 - SharePoint: `connector_sharepoint`
1581
1582 - `"connector_dropbox"`
1583
1584 - `"connector_gmail"`
1585
1586 - `"connector_googlecalendar"`
1587
1588 - `"connector_googledrive"`
1589
1590 - `"connector_microsoftteams"`
1591
1592 - `"connector_outlookcalendar"`
1593
1594 - `"connector_outlookemail"`
1595
1596 - `"connector_sharepoint"`
1597
1598 - `defer_loading: Optional[bool]`
1599
1600 Whether this MCP tool is deferred and discovered via tool search.
1601
1602 - `headers: Optional[Dict[str, str]]`
1603
1604 Optional HTTP headers to send to the MCP server. Use for authentication
1605 or other purposes.
1606
1607 - `require_approval: Optional[ToolMcpToolRequireApproval]`
1608
1609 Specify which of the MCP server's tools require approval.
1610
1611 - `class ToolMcpToolRequireApprovalMcpToolApprovalFilter: …`
1612
1613 Specify which of the MCP server's tools require approval. Can be
1614 `always`, `never`, or a filter object associated with tools
1615 that require approval.
1616
1617 - `always: Optional[ToolMcpToolRequireApprovalMcpToolApprovalFilterAlways]`
1618
1619 A filter object to specify which tools are allowed.
1620
1621 - `read_only: Optional[bool]`
1622
1623 Indicates whether or not a tool modifies data or is read-only. If an
1624 MCP server is [annotated with `readOnlyHint`](https://modelcontextprotocol.io/specification/2025-06-18/schema#toolannotations-readonlyhint),
1625 it will match this filter.
1626
1627 - `tool_names: Optional[List[str]]`
1628
1629 List of allowed tool names.
1630
1631 - `never: Optional[ToolMcpToolRequireApprovalMcpToolApprovalFilterNever]`
1632
1633 A filter object to specify which tools are allowed.
1634
1635 - `read_only: Optional[bool]`
1636
1637 Indicates whether or not a tool modifies data or is read-only. If an
1638 MCP server is [annotated with `readOnlyHint`](https://modelcontextprotocol.io/specification/2025-06-18/schema#toolannotations-readonlyhint),
1639 it will match this filter.
1640
1641 - `tool_names: Optional[List[str]]`
1642
1643 List of allowed tool names.
1644
1645 - `Literal["always", "never"]`
1646
1647 Specify a single approval policy for all tools. One of `always` or
1648 `never`. When set to `always`, all tools will require approval. When
1649 set to `never`, all tools will not require approval.
1650
1651 - `"always"`
1652
1653 - `"never"`
1654
1655 - `server_description: Optional[str]`
1656
1657 Optional description of the MCP server, used to provide more context.
1658
1659 - `server_url: Optional[str]`
1660
1661 The URL for the MCP server. One of `server_url` or `connector_id` must be
1662 provided.
1663
1664 - `tracing: Optional[Tracing]`
1665
1666 Realtime API can write session traces to the [Traces Dashboard](https://platform.openai.com/logs?api=traces). Set to null to disable tracing. Once
1667 tracing is enabled for a session, the configuration cannot be modified.
1668
1669 `auto` will create a trace for the session with default values for the
1670 workflow name, group id, and metadata.
1671
1672 - `Literal["auto"]`
1673
1674 Enables tracing and sets default values for tracing configuration options. Always `auto`.
1675
1676 - `"auto"`
1677
1678 - `class TracingTracingConfiguration: …`
1679
1680 Granular configuration for tracing.
1681
1682 - `group_id: Optional[str]`
1683
1684 The group id to attach to this trace to enable filtering and
1685 grouping in the Traces Dashboard.
1686
1687 - `metadata: Optional[object]`
1688
1689 The arbitrary metadata to attach to this trace to enable
1690 filtering in the Traces Dashboard.
1691
1692 - `workflow_name: Optional[str]`
1693
1694 The name of the workflow to attach to this trace. This is used to
1695 name the trace in the Traces Dashboard.
1696
1697 - `truncation: Optional[RealtimeTruncation]`
1698
1699 When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
1700
1701 Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
1702
1703 Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
1704
1705 Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
1706
1707 - `Literal["auto", "disabled"]`
1708
1709 The truncation strategy to use for the session. `auto` is the default truncation strategy. `disabled` will disable truncation and emit errors when the conversation exceeds the input token limit.
1710
1711 - `"auto"`
1712
1713 - `"disabled"`
1714
1715 - `class RealtimeTruncationRetentionRatio: …`
1716
1717 Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
1718
1719 - `retention_ratio: float`
1720
1721 Fraction of post-instruction conversation tokens to retain (`0.0` - `1.0`) when the conversation exceeds the input token limit. Setting this to `0.8` means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.
1722
1723 - `type: Literal["retention_ratio"]`
1724
1725 Use retention ratio truncation.
1726
1727 - `"retention_ratio"`
1728
1729 - `token_limits: Optional[TokenLimits]`
1730
1731 Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
1732
1733 - `post_instructions: Optional[int]`
1734
1735 Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
1736
1737 - `class RealtimeTranscriptionSessionCreateResponse: …`
1738
1739 A Realtime transcription session configuration object.
1740
1741 - `id: str`
1742
1743 Unique identifier for the session that looks like `sess_1234567890abcdef`.
1744
1745 - `object: str`
1746
1747 The object type. Always `realtime.transcription_session`.
1748
1749 - `type: Literal["transcription"]`
1750
1751 The type of session. Always `transcription` for transcription sessions.
1752
1753 - `"transcription"`
1754
1755 - `audio: Optional[Audio]`
1756
1757 Configuration for input audio for the session.
1758
1759 - `input: Optional[AudioInput]`
1760
1761 - `format: Optional[RealtimeAudioFormats]`
1762
1763 The PCM audio format. Only a 24kHz sample rate is supported.
1764
1765 - `noise_reduction: Optional[AudioInputNoiseReduction]`
1766
1767 Configuration for input audio noise reduction.
1768
1769 - `type: Optional[NoiseReductionType]`
1770
1771 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.
1772
1773 - `transcription: Optional[AudioTranscription]`
1774
1775 - `turn_detection: Optional[RealtimeTranscriptionSessionTurnDetection]`
1776
1777 Configuration for turn detection. Can be set to `null` to turn off. Server
1778 VAD means that the model will detect the start and end of speech based on
1779 audio volume and respond at the end of user speech. For `gpt-realtime-whisper`, this must be `null`; VAD is not supported.
1780
1781 - `prefix_padding_ms: Optional[int]`
1782
1783 Amount of audio to include before the VAD detected speech (in
1784 milliseconds). Defaults to 300ms.
1785
1786 - `silence_duration_ms: Optional[int]`
1787
1788 Duration of silence to detect speech stop (in milliseconds). Defaults
1789 to 500ms. With shorter values the model will respond more quickly,
1790 but may jump in on short pauses from the user.
1791
1792 - `threshold: Optional[float]`
1793
1794 Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A
1795 higher threshold will require louder audio to activate the model, and
1796 thus might perform better in noisy environments.
1797
1798 - `type: Optional[str]`
1799
1800 Type of turn detection, only `server_vad` is currently supported.
1801
1802 - `expires_at: Optional[int]`
1803
1804 Expiration timestamp for the session, in seconds since epoch.
1805
1806 - `include: Optional[List[Literal["item.input_audio_transcription.logprobs"]]]`
1807
1808 Additional fields to include in server outputs.
1809
1810 - `item.input_audio_transcription.logprobs`: Include logprobs for input audio transcription.
1811
1812 - `"item.input_audio_transcription.logprobs"`
1813
1814 - `value: str`
1815
1816 The generated client secret value.
1817
1818### Example
1819
1820```python
1821import os
1822from openai import OpenAI
1823
1824client = OpenAI(
1825 api_key=os.environ.get("OPENAI_API_KEY"), # This is the default and can be omitted
1826)
1827client_secret = client.realtime.client_secrets.create()
1828print(client_secret.expires_at)
1829```
1830
1831#### Response
1832
1833```json
1834{
1835 "expires_at": 0,
1836 "session": {
1837 "id": "id",
1838 "object": "realtime.session",
1839 "type": "realtime",
1840 "audio": {
1841 "input": {
1842 "format": {
1843 "rate": 24000,
1844 "type": "audio/pcm"
1845 },
1846 "noise_reduction": {
1847 "type": "near_field"
1848 },
1849 "transcription": {
1850 "delay": "minimal",
1851 "language": "language",
1852 "model": "string",
1853 "prompt": "prompt"
1854 },
1855 "turn_detection": {
1856 "type": "server_vad",
1857 "create_response": true,
1858 "idle_timeout_ms": 5000,
1859 "interrupt_response": true,
1860 "prefix_padding_ms": 0,
1861 "silence_duration_ms": 0,
1862 "threshold": 0
1863 }
1864 },
1865 "output": {
1866 "format": {
1867 "rate": 24000,
1868 "type": "audio/pcm"
1869 },
1870 "speed": 0.25,
1871 "voice": "ash"
1872 }
1873 },
1874 "expires_at": 0,
1875 "include": [
1876 "item.input_audio_transcription.logprobs"
1877 ],
1878 "instructions": "instructions",
1879 "max_output_tokens": 0,
1880 "model": "string",
1881 "output_modalities": [
1882 "text"
1883 ],
1884 "prompt": {
1885 "id": "id",
1886 "variables": {
1887 "foo": "string"
1888 },
1889 "version": "version"
1890 },
1891 "reasoning": {
1892 "effort": "minimal"
1893 },
1894 "tool_choice": "none",
1895 "tools": [
1896 {
1897 "description": "description",
1898 "name": "name",
1899 "parameters": {},
1900 "type": "function"
1901 }
1902 ],
1903 "tracing": "auto",
1904 "truncation": "auto"
1905 },
1906 "value": "value"
1907}
1908```
1909
1910## Domain Types
1911
1912### Realtime Session Create Response
1913
1914- `class RealtimeSessionCreateResponse: …`
1915
1916 A Realtime session configuration object.
1917
1918 - `id: str`
1919
1920 Unique identifier for the session that looks like `sess_1234567890abcdef`.
1921
1922 - `object: Literal["realtime.session"]`
1923
1924 The object type. Always `realtime.session`.
1925
1926 - `"realtime.session"`
1927
1928 - `type: Literal["realtime"]`
1929
1930 The type of session to create. Always `realtime` for the Realtime API.
1931
1932 - `"realtime"`
1933
1934 - `audio: Optional[Audio]`
1935
1936 Configuration for input and output audio.
1937
1938 - `input: Optional[AudioInput]`
1939
1940 - `format: Optional[RealtimeAudioFormats]`
1941
1942 The format of the input audio.
1943
1944 - `class AudioPCM: …`
1945
1946 The PCM audio format. Only a 24kHz sample rate is supported.
1947
1948 - `rate: Optional[Literal[24000]]`
1949
1950 The sample rate of the audio. Always `24000`.
1951
1952 - `24000`
1953
1954 - `type: Optional[Literal["audio/pcm"]]`
1955
1956 The audio format. Always `audio/pcm`.
1957
1958 - `"audio/pcm"`
1959
1960 - `class AudioPCMU: …`
1961
1962 The G.711 μ-law format.
1963
1964 - `type: Optional[Literal["audio/pcmu"]]`
1965
1966 The audio format. Always `audio/pcmu`.
1967
1968 - `"audio/pcmu"`
1969
1970 - `class AudioPCMA: …`
1971
1972 The G.711 A-law format.
1973
1974 - `type: Optional[Literal["audio/pcma"]]`
1975
1976 The audio format. Always `audio/pcma`.
1977
1978 - `"audio/pcma"`
1979
1980 - `noise_reduction: Optional[AudioInputNoiseReduction]`
1981
1982 Configuration for input audio noise reduction. This can be set to `null` to turn off.
1983 Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model.
1984 Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
1985
1986 - `type: Optional[NoiseReductionType]`
1987
1988 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.
1989
1990 - `"near_field"`
1991
1992 - `"far_field"`
1993
1994 - `transcription: Optional[AudioTranscription]`
1995
1996 - `delay: Optional[Literal["minimal", "low", "medium", 2 more]]`
1997
1998 Controls how long the model waits before emitting transcription text.
1999 Higher values can improve transcription accuracy at the cost of latency.
2000 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.
2001
2002 - `"minimal"`
2003
2004 - `"low"`
2005
2006 - `"medium"`
2007
2008 - `"high"`
2009
2010 - `"xhigh"`
2011
2012 - `language: Optional[str]`
2013
2014 The language of the input audio. Supplying the input language in
2015 [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format
2016 will improve accuracy and latency.
2017
2018 - `model: Optional[Union[str, Literal["whisper-1", "gpt-4o-mini-transcribe", "gpt-4o-mini-transcribe-2025-12-15", 3 more], null]]`
2019
2020 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.
2021
2022 - `str`
2023
2024 - `Literal["whisper-1", "gpt-4o-mini-transcribe", "gpt-4o-mini-transcribe-2025-12-15", 3 more]`
2025
2026 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.
2027
2028 - `"whisper-1"`
2029
2030 - `"gpt-4o-mini-transcribe"`
2031
2032 - `"gpt-4o-mini-transcribe-2025-12-15"`
2033
2034 - `"gpt-4o-transcribe"`
2035
2036 - `"gpt-4o-transcribe-diarize"`
2037
2038 - `"gpt-realtime-whisper"`
2039
2040 - `prompt: Optional[str]`
2041
2042 An optional text to guide the model's style or continue a previous audio
2043 segment.
2044 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).
2045 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".
2046 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.
2047
2048 - `turn_detection: Optional[AudioInputTurnDetection]`
2049
2050 Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to `null` to turn off, in which case the client must manually trigger model response.
2051
2052 Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
2053
2054 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
2055
2056 For `gpt-realtime-whisper` transcription sessions, turn detection must be
2057 set to `null`; VAD is not supported.
2058
2059 - `class AudioInputTurnDetectionServerVad: …`
2060
2061 Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.
2062
2063 - `type: Literal["server_vad"]`
2064
2065 Type of turn detection, `server_vad` to turn on simple Server VAD.
2066
2067 - `"server_vad"`
2068
2069 - `create_response: Optional[bool]`
2070
2071 Whether or not to automatically generate a response when a VAD stop event occurs. If `interrupt_response` is set to `false` this may fail to create a response if the model is already responding.
2072
2073 If both `create_response` and `interrupt_response` are set to `false`, the model will never respond automatically but VAD events will still be emitted.
2074
2075 - `idle_timeout_ms: Optional[int]`
2076
2077 Optional timeout after which a model response will be triggered automatically. This is
2078 useful for situations in which a long pause from the user is unexpected, such as a phone
2079 call. The model will effectively prompt the user to continue the conversation based
2080 on the current context.
2081
2082 The timeout value will be applied after the last model response's audio has finished playing,
2083 i.e. it's set to the `response.done` time plus audio playback duration.
2084
2085 An `input_audio_buffer.timeout_triggered` event (plus events
2086 associated with the Response) will be emitted when the timeout is reached.
2087 Idle timeout is currently only supported for `server_vad` mode.
2088
2089 - `interrupt_response: Optional[bool]`
2090
2091 Whether or not to automatically interrupt (cancel) any ongoing response with output to the default
2092 conversation (i.e. `conversation` of `auto`) when a VAD start event occurs. If `true` then the response will be cancelled, otherwise it will continue until complete.
2093
2094 If both `create_response` and `interrupt_response` are set to `false`, the model will never respond automatically but VAD events will still be emitted.
2095
2096 - `prefix_padding_ms: Optional[int]`
2097
2098 Used only for `server_vad` mode. Amount of audio to include before the VAD detected speech (in
2099 milliseconds). Defaults to 300ms.
2100
2101 - `silence_duration_ms: Optional[int]`
2102
2103 Used only for `server_vad` mode. Duration of silence to detect speech stop (in milliseconds). Defaults
2104 to 500ms. With shorter values the model will respond more quickly,
2105 but may jump in on short pauses from the user.
2106
2107 - `threshold: Optional[float]`
2108
2109 Used only for `server_vad` mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A
2110 higher threshold will require louder audio to activate the model, and
2111 thus might perform better in noisy environments.
2112
2113 - `class AudioInputTurnDetectionSemanticVad: …`
2114
2115 Server-side semantic turn detection which uses a model to determine when the user has finished speaking.
2116
2117 - `type: Literal["semantic_vad"]`
2118
2119 Type of turn detection, `semantic_vad` to turn on Semantic VAD.
2120
2121 - `"semantic_vad"`
2122
2123 - `create_response: Optional[bool]`
2124
2125 Whether or not to automatically generate a response when a VAD stop event occurs.
2126
2127 - `eagerness: Optional[Literal["low", "medium", "high", "auto"]]`
2128
2129 Used only for `semantic_vad` mode. The eagerness of the model to respond. `low` will wait longer for the user to continue speaking, `high` will respond more quickly. `auto` is the default and is equivalent to `medium`. `low`, `medium`, and `high` have max timeouts of 8s, 4s, and 2s respectively.
2130
2131 - `"low"`
2132
2133 - `"medium"`
2134
2135 - `"high"`
2136
2137 - `"auto"`
2138
2139 - `interrupt_response: Optional[bool]`
2140
2141 Whether or not to automatically interrupt any ongoing response with output to the default
2142 conversation (i.e. `conversation` of `auto`) when a VAD start event occurs.
2143
2144 - `output: Optional[AudioOutput]`
2145
2146 - `format: Optional[RealtimeAudioFormats]`
2147
2148 The format of the output audio.
2149
2150 - `speed: Optional[float]`
2151
2152 The speed of the model's spoken response as a multiple of the original speed.
2153 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
2154
2155 This parameter is a post-processing adjustment to the audio after it is generated, it's
2156 also possible to prompt the model to speak faster or slower.
2157
2158 - `voice: Optional[Union[str, Literal["alloy", "ash", "ballad", 7 more], null]]`
2159
2160 The voice the model uses to respond. Voice cannot be changed during the
2161 session once the model has responded with audio at least once. Current
2162 voice options are `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`,
2163 `shimmer`, `verse`, `marin`, and `cedar`. We recommend `marin` and `cedar` for
2164 best quality.
2165
2166 - `str`
2167
2168 - `Literal["alloy", "ash", "ballad", 7 more]`
2169
2170 The voice the model uses to respond. Voice cannot be changed during the
2171 session once the model has responded with audio at least once. Current
2172 voice options are `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`,
2173 `shimmer`, `verse`, `marin`, and `cedar`. We recommend `marin` and `cedar` for
2174 best quality.
2175
2176 - `"alloy"`
2177
2178 - `"ash"`
2179
2180 - `"ballad"`
2181
2182 - `"coral"`
2183
2184 - `"echo"`
2185
2186 - `"sage"`
2187
2188 - `"shimmer"`
2189
2190 - `"verse"`
2191
2192 - `"marin"`
2193
2194 - `"cedar"`
2195
2196 - `expires_at: Optional[int]`
2197
2198 Expiration timestamp for the session, in seconds since epoch.
2199
2200 - `include: Optional[List[Literal["item.input_audio_transcription.logprobs"]]]`
2201
2202 Additional fields to include in server outputs.
2203
2204 `item.input_audio_transcription.logprobs`: Include logprobs for input audio transcription.
2205
2206 - `"item.input_audio_transcription.logprobs"`
2207
2208 - `instructions: Optional[str]`
2209
2210 The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
2211
2212 Note that the server sets default instructions which will be used if this field is not set and are visible in the `session.created` event at the start of the session.
2213
2214 - `max_output_tokens: Optional[Union[int, Literal["inf"], null]]`
2215
2216 Maximum number of output tokens for a single assistant response,
2217 inclusive of tool calls. Provide an integer between 1 and 4096 to
2218 limit output tokens, or `inf` for the maximum available tokens for a
2219 given model. Defaults to `inf`.
2220
2221 - `int`
2222
2223 - `Literal["inf"]`
2224
2225 - `"inf"`
2226
2227 - `model: Optional[Union[str, Literal["gpt-realtime", "gpt-realtime-1.5", "gpt-realtime-2", 14 more], null]]`
2228
2229 The Realtime model used for this session.
2230
2231 - `str`
2232
2233 - `Literal["gpt-realtime", "gpt-realtime-1.5", "gpt-realtime-2", 14 more]`
2234
2235 The Realtime model used for this session.
2236
2237 - `"gpt-realtime"`
2238
2239 - `"gpt-realtime-1.5"`
2240
2241 - `"gpt-realtime-2"`
2242
2243 - `"gpt-realtime-2025-08-28"`
2244
2245 - `"gpt-4o-realtime-preview"`
2246
2247 - `"gpt-4o-realtime-preview-2024-10-01"`
2248
2249 - `"gpt-4o-realtime-preview-2024-12-17"`
2250
2251 - `"gpt-4o-realtime-preview-2025-06-03"`
2252
2253 - `"gpt-4o-mini-realtime-preview"`
2254
2255 - `"gpt-4o-mini-realtime-preview-2024-12-17"`
2256
2257 - `"gpt-realtime-mini"`
2258
2259 - `"gpt-realtime-mini-2025-10-06"`
2260
2261 - `"gpt-realtime-mini-2025-12-15"`
2262
2263 - `"gpt-audio-1.5"`
2264
2265 - `"gpt-audio-mini"`
2266
2267 - `"gpt-audio-mini-2025-10-06"`
2268
2269 - `"gpt-audio-mini-2025-12-15"`
2270
2271 - `output_modalities: Optional[List[Literal["text", "audio"]]]`
2272
2273 The set of modalities the model can respond with. It defaults to `["audio"]`, indicating
2274 that the model will respond with audio plus a transcript. `["text"]` can be used to make
2275 the model respond with text only. It is not possible to request both `text` and `audio` at the same time.
2276
2277 - `"text"`
2278
2279 - `"audio"`
2280
2281 - `prompt: Optional[ResponsePrompt]`
2282
2283 Reference to a prompt template and its variables.
2284 [Learn more](https://platform.openai.com/docs/guides/text?api-mode=responses#reusable-prompts).
2285
2286 - `id: str`
2287
2288 The unique identifier of the prompt template to use.
2289
2290 - `variables: Optional[Dict[str, Variables]]`
2291
2292 Optional map of values to substitute in for variables in your
2293 prompt. The substitution values can either be strings, or other
2294 Response input types like images or files.
2295
2296 - `str`
2297
2298 - `class ResponseInputText: …`
2299
2300 A text input to the model.
2301
2302 - `text: str`
2303
2304 The text input to the model.
2305
2306 - `type: Literal["input_text"]`
2307
2308 The type of the input item. Always `input_text`.
2309
2310 - `"input_text"`
2311
2312 - `class ResponseInputImage: …`
2313
2314 An image input to the model. Learn about [image inputs](https://platform.openai.com/docs/guides/vision).
2315
2316 - `detail: Literal["low", "high", "auto", "original"]`
2317
2318 The detail level of the image to be sent to the model. One of `high`, `low`, `auto`, or `original`. Defaults to `auto`.
2319
2320 - `"low"`
2321
2322 - `"high"`
2323
2324 - `"auto"`
2325
2326 - `"original"`
2327
2328 - `type: Literal["input_image"]`
2329
2330 The type of the input item. Always `input_image`.
2331
2332 - `"input_image"`
2333
2334 - `file_id: Optional[str]`
2335
2336 The ID of the file to be sent to the model.
2337
2338 - `image_url: Optional[str]`
2339
2340 The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
2341
2342 - `class ResponseInputFile: …`
2343
2344 A file input to the model.
2345
2346 - `type: Literal["input_file"]`
2347
2348 The type of the input item. Always `input_file`.
2349
2350 - `"input_file"`
2351
2352 - `detail: Optional[Literal["low", "high"]]`
2353
2354 The detail level of the file to be sent to the model. Use `low` for the default rendering behavior, or `high` to render the file at higher quality. Defaults to `low`.
2355
2356 - `"low"`
2357
2358 - `"high"`
2359
2360 - `file_data: Optional[str]`
2361
2362 The content of the file to be sent to the model.
2363
2364 - `file_id: Optional[str]`
2365
2366 The ID of the file to be sent to the model.
2367
2368 - `file_url: Optional[str]`
2369
2370 The URL of the file to be sent to the model.
2371
2372 - `filename: Optional[str]`
2373
2374 The name of the file to be sent to the model.
2375
2376 - `version: Optional[str]`
2377
2378 Optional version of the prompt template.
2379
2380 - `reasoning: Optional[RealtimeReasoning]`
2381
2382 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.
2383
2384 - `effort: Optional[RealtimeReasoningEffort]`
2385
2386 Constrains effort on reasoning for reasoning-capable Realtime models such as
2387 `gpt-realtime-2`.
2388
2389 - `"minimal"`
2390
2391 - `"low"`
2392
2393 - `"medium"`
2394
2395 - `"high"`
2396
2397 - `"xhigh"`
2398
2399 - `tool_choice: Optional[ToolChoice]`
2400
2401 How the model chooses tools. Provide one of the string modes or force a specific
2402 function/MCP tool.
2403
2404 - `Literal["none", "auto", "required"]`
2405
2406 - `"none"`
2407
2408 - `"auto"`
2409
2410 - `"required"`
2411
2412 - `class ToolChoiceFunction: …`
2413
2414 Use this option to force the model to call a specific function.
2415
2416 - `name: str`
2417
2418 The name of the function to call.
2419
2420 - `type: Literal["function"]`
2421
2422 For function calling, the type is always `function`.
2423
2424 - `"function"`
2425
2426 - `class ToolChoiceMcp: …`
2427
2428 Use this option to force the model to call a specific tool on a remote MCP server.
2429
2430 - `server_label: str`
2431
2432 The label of the MCP server to use.
2433
2434 - `type: Literal["mcp"]`
2435
2436 For MCP tools, the type is always `mcp`.
2437
2438 - `"mcp"`
2439
2440 - `name: Optional[str]`
2441
2442 The name of the tool to call on the server.
2443
2444 - `tools: Optional[List[Tool]]`
2445
2446 Tools available to the model.
2447
2448 - `class RealtimeFunctionTool: …`
2449
2450 - `description: Optional[str]`
2451
2452 The description of the function, including guidance on when and how
2453 to call it, and guidance about what to tell the user when calling
2454 (if anything).
2455
2456 - `name: Optional[str]`
2457
2458 The name of the function.
2459
2460 - `parameters: Optional[object]`
2461
2462 Parameters of the function in JSON Schema.
2463
2464 - `type: Optional[Literal["function"]]`
2465
2466 The type of the tool, i.e. `function`.
2467
2468 - `"function"`
2469
2470 - `class ToolMcpTool: …`
2471
2472 Give the model access to additional tools via remote Model Context Protocol
2473 (MCP) servers. [Learn more about MCP](https://platform.openai.com/docs/guides/tools-remote-mcp).
2474
2475 - `server_label: str`
2476
2477 A label for this MCP server, used to identify it in tool calls.
2478
2479 - `type: Literal["mcp"]`
2480
2481 The type of the MCP tool. Always `mcp`.
2482
2483 - `"mcp"`
2484
2485 - `allowed_tools: Optional[ToolMcpToolAllowedTools]`
2486
2487 List of allowed tool names or a filter object.
2488
2489 - `List[str]`
2490
2491 A string array of allowed tool names
2492
2493 - `class ToolMcpToolAllowedToolsMcpToolFilter: …`
2494
2495 A filter object to specify which tools are allowed.
2496
2497 - `read_only: Optional[bool]`
2498
2499 Indicates whether or not a tool modifies data or is read-only. If an
2500 MCP server is [annotated with `readOnlyHint`](https://modelcontextprotocol.io/specification/2025-06-18/schema#toolannotations-readonlyhint),
2501 it will match this filter.
2502
2503 - `tool_names: Optional[List[str]]`
2504
2505 List of allowed tool names.
2506
2507 - `authorization: Optional[str]`
2508
2509 An OAuth access token that can be used with a remote MCP server, either
2510 with a custom MCP server URL or a service connector. Your application
2511 must handle the OAuth authorization flow and provide the token here.
2512
2513 - `connector_id: Optional[Literal["connector_dropbox", "connector_gmail", "connector_googlecalendar", 5 more]]`
2514
2515 Identifier for service connectors, like those available in ChatGPT. One of
2516 `server_url` or `connector_id` must be provided. Learn more about service
2517 connectors [here](https://platform.openai.com/docs/guides/tools-remote-mcp#connectors).
2518
2519 Currently supported `connector_id` values are:
2520
2521 - Dropbox: `connector_dropbox`
2522 - Gmail: `connector_gmail`
2523 - Google Calendar: `connector_googlecalendar`
2524 - Google Drive: `connector_googledrive`
2525 - Microsoft Teams: `connector_microsoftteams`
2526 - Outlook Calendar: `connector_outlookcalendar`
2527 - Outlook Email: `connector_outlookemail`
2528 - SharePoint: `connector_sharepoint`
2529
2530 - `"connector_dropbox"`
2531
2532 - `"connector_gmail"`
2533
2534 - `"connector_googlecalendar"`
2535
2536 - `"connector_googledrive"`
2537
2538 - `"connector_microsoftteams"`
2539
2540 - `"connector_outlookcalendar"`
2541
2542 - `"connector_outlookemail"`
2543
2544 - `"connector_sharepoint"`
2545
2546 - `defer_loading: Optional[bool]`
2547
2548 Whether this MCP tool is deferred and discovered via tool search.
2549
2550 - `headers: Optional[Dict[str, str]]`
2551
2552 Optional HTTP headers to send to the MCP server. Use for authentication
2553 or other purposes.
2554
2555 - `require_approval: Optional[ToolMcpToolRequireApproval]`
2556
2557 Specify which of the MCP server's tools require approval.
2558
2559 - `class ToolMcpToolRequireApprovalMcpToolApprovalFilter: …`
2560
2561 Specify which of the MCP server's tools require approval. Can be
2562 `always`, `never`, or a filter object associated with tools
2563 that require approval.
2564
2565 - `always: Optional[ToolMcpToolRequireApprovalMcpToolApprovalFilterAlways]`
2566
2567 A filter object to specify which tools are allowed.
2568
2569 - `read_only: Optional[bool]`
2570
2571 Indicates whether or not a tool modifies data or is read-only. If an
2572 MCP server is [annotated with `readOnlyHint`](https://modelcontextprotocol.io/specification/2025-06-18/schema#toolannotations-readonlyhint),
2573 it will match this filter.
2574
2575 - `tool_names: Optional[List[str]]`
2576
2577 List of allowed tool names.
2578
2579 - `never: Optional[ToolMcpToolRequireApprovalMcpToolApprovalFilterNever]`
2580
2581 A filter object to specify which tools are allowed.
2582
2583 - `read_only: Optional[bool]`
2584
2585 Indicates whether or not a tool modifies data or is read-only. If an
2586 MCP server is [annotated with `readOnlyHint`](https://modelcontextprotocol.io/specification/2025-06-18/schema#toolannotations-readonlyhint),
2587 it will match this filter.
2588
2589 - `tool_names: Optional[List[str]]`
2590
2591 List of allowed tool names.
2592
2593 - `Literal["always", "never"]`
2594
2595 Specify a single approval policy for all tools. One of `always` or
2596 `never`. When set to `always`, all tools will require approval. When
2597 set to `never`, all tools will not require approval.
2598
2599 - `"always"`
2600
2601 - `"never"`
2602
2603 - `server_description: Optional[str]`
2604
2605 Optional description of the MCP server, used to provide more context.
2606
2607 - `server_url: Optional[str]`
2608
2609 The URL for the MCP server. One of `server_url` or `connector_id` must be
2610 provided.
2611
2612 - `tracing: Optional[Tracing]`
2613
2614 Realtime API can write session traces to the [Traces Dashboard](https://platform.openai.com/logs?api=traces). Set to null to disable tracing. Once
2615 tracing is enabled for a session, the configuration cannot be modified.
2616
2617 `auto` will create a trace for the session with default values for the
2618 workflow name, group id, and metadata.
2619
2620 - `Literal["auto"]`
2621
2622 Enables tracing and sets default values for tracing configuration options. Always `auto`.
2623
2624 - `"auto"`
2625
2626 - `class TracingTracingConfiguration: …`
2627
2628 Granular configuration for tracing.
2629
2630 - `group_id: Optional[str]`
2631
2632 The group id to attach to this trace to enable filtering and
2633 grouping in the Traces Dashboard.
2634
2635 - `metadata: Optional[object]`
2636
2637 The arbitrary metadata to attach to this trace to enable
2638 filtering in the Traces Dashboard.
2639
2640 - `workflow_name: Optional[str]`
2641
2642 The name of the workflow to attach to this trace. This is used to
2643 name the trace in the Traces Dashboard.
2644
2645 - `truncation: Optional[RealtimeTruncation]`
2646
2647 When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
2648
2649 Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
2650
2651 Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
2652
2653 Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
2654
2655 - `Literal["auto", "disabled"]`
2656
2657 The truncation strategy to use for the session. `auto` is the default truncation strategy. `disabled` will disable truncation and emit errors when the conversation exceeds the input token limit.
2658
2659 - `"auto"`
2660
2661 - `"disabled"`
2662
2663 - `class RealtimeTruncationRetentionRatio: …`
2664
2665 Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
2666
2667 - `retention_ratio: float`
2668
2669 Fraction of post-instruction conversation tokens to retain (`0.0` - `1.0`) when the conversation exceeds the input token limit. Setting this to `0.8` means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.
2670
2671 - `type: Literal["retention_ratio"]`
2672
2673 Use retention ratio truncation.
2674
2675 - `"retention_ratio"`
2676
2677 - `token_limits: Optional[TokenLimits]`
2678
2679 Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
2680
2681 - `post_instructions: Optional[int]`
2682
2683 Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
2684
2685### Realtime Transcription Session Create Response
2686
2687- `class RealtimeTranscriptionSessionCreateResponse: …`
2688
2689 A Realtime transcription session configuration object.
2690
2691 - `id: str`
2692
2693 Unique identifier for the session that looks like `sess_1234567890abcdef`.
2694
2695 - `object: str`
2696
2697 The object type. Always `realtime.transcription_session`.
2698
2699 - `type: Literal["transcription"]`
2700
2701 The type of session. Always `transcription` for transcription sessions.
2702
2703 - `"transcription"`
2704
2705 - `audio: Optional[Audio]`
2706
2707 Configuration for input audio for the session.
2708
2709 - `input: Optional[AudioInput]`
2710
2711 - `format: Optional[RealtimeAudioFormats]`
2712
2713 The PCM audio format. Only a 24kHz sample rate is supported.
2714
2715 - `class AudioPCM: …`
2716
2717 The PCM audio format. Only a 24kHz sample rate is supported.
2718
2719 - `rate: Optional[Literal[24000]]`
2720
2721 The sample rate of the audio. Always `24000`.
2722
2723 - `24000`
2724
2725 - `type: Optional[Literal["audio/pcm"]]`
2726
2727 The audio format. Always `audio/pcm`.
2728
2729 - `"audio/pcm"`
2730
2731 - `class AudioPCMU: …`
2732
2733 The G.711 μ-law format.
2734
2735 - `type: Optional[Literal["audio/pcmu"]]`
2736
2737 The audio format. Always `audio/pcmu`.
2738
2739 - `"audio/pcmu"`
2740
2741 - `class AudioPCMA: …`
2742
2743 The G.711 A-law format.
2744
2745 - `type: Optional[Literal["audio/pcma"]]`
2746
2747 The audio format. Always `audio/pcma`.
2748
2749 - `"audio/pcma"`
2750
2751 - `noise_reduction: Optional[AudioInputNoiseReduction]`
2752
2753 Configuration for input audio noise reduction.
2754
2755 - `type: Optional[NoiseReductionType]`
2756
2757 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.
2758
2759 - `"near_field"`
2760
2761 - `"far_field"`
2762
2763 - `transcription: Optional[AudioTranscription]`
2764
2765 - `delay: Optional[Literal["minimal", "low", "medium", 2 more]]`
2766
2767 Controls how long the model waits before emitting transcription text.
2768 Higher values can improve transcription accuracy at the cost of latency.
2769 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.
2770
2771 - `"minimal"`
2772
2773 - `"low"`
2774
2775 - `"medium"`
2776
2777 - `"high"`
2778
2779 - `"xhigh"`
2780
2781 - `language: Optional[str]`
2782
2783 The language of the input audio. Supplying the input language in
2784 [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format
2785 will improve accuracy and latency.
2786
2787 - `model: Optional[Union[str, Literal["whisper-1", "gpt-4o-mini-transcribe", "gpt-4o-mini-transcribe-2025-12-15", 3 more], null]]`
2788
2789 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.
2790
2791 - `str`
2792
2793 - `Literal["whisper-1", "gpt-4o-mini-transcribe", "gpt-4o-mini-transcribe-2025-12-15", 3 more]`
2794
2795 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.
2796
2797 - `"whisper-1"`
2798
2799 - `"gpt-4o-mini-transcribe"`
2800
2801 - `"gpt-4o-mini-transcribe-2025-12-15"`
2802
2803 - `"gpt-4o-transcribe"`
2804
2805 - `"gpt-4o-transcribe-diarize"`
2806
2807 - `"gpt-realtime-whisper"`
2808
2809 - `prompt: Optional[str]`
2810
2811 An optional text to guide the model's style or continue a previous audio
2812 segment.
2813 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).
2814 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".
2815 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.
2816
2817 - `turn_detection: Optional[RealtimeTranscriptionSessionTurnDetection]`
2818
2819 Configuration for turn detection. Can be set to `null` to turn off. Server
2820 VAD means that the model will detect the start and end of speech based on
2821 audio volume and respond at the end of user speech. For `gpt-realtime-whisper`, this must be `null`; VAD is not supported.
2822
2823 - `prefix_padding_ms: Optional[int]`
2824
2825 Amount of audio to include before the VAD detected speech (in
2826 milliseconds). Defaults to 300ms.
2827
2828 - `silence_duration_ms: Optional[int]`
2829
2830 Duration of silence to detect speech stop (in milliseconds). Defaults
2831 to 500ms. With shorter values the model will respond more quickly,
2832 but may jump in on short pauses from the user.
2833
2834 - `threshold: Optional[float]`
2835
2836 Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A
2837 higher threshold will require louder audio to activate the model, and
2838 thus might perform better in noisy environments.
2839
2840 - `type: Optional[str]`
2841
2842 Type of turn detection, only `server_vad` is currently supported.
2843
2844 - `expires_at: Optional[int]`
2845
2846 Expiration timestamp for the session, in seconds since epoch.
2847
2848 - `include: Optional[List[Literal["item.input_audio_transcription.logprobs"]]]`
2849
2850 Additional fields to include in server outputs.
2851
2852 - `item.input_audio_transcription.logprobs`: Include logprobs for input audio transcription.
2853
2854 - `"item.input_audio_transcription.logprobs"`
2855
2856### Realtime Transcription Session Turn Detection
2857
2858- `class RealtimeTranscriptionSessionTurnDetection: …`
2859
2860 Configuration for turn detection. Can be set to `null` to turn off. Server
2861 VAD means that the model will detect the start and end of speech based on
2862 audio volume and respond at the end of user speech. For `gpt-realtime-whisper`, this must be `null`; VAD is not supported.
2863
2864 - `prefix_padding_ms: Optional[int]`
2865
2866 Amount of audio to include before the VAD detected speech (in
2867 milliseconds). Defaults to 300ms.
2868
2869 - `silence_duration_ms: Optional[int]`
2870
2871 Duration of silence to detect speech stop (in milliseconds). Defaults
2872 to 500ms. With shorter values the model will respond more quickly,
2873 but may jump in on short pauses from the user.
2874
2875 - `threshold: Optional[float]`
2876
2877 Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A
2878 higher threshold will require louder audio to activate the model, and
2879 thus might perform better in noisy environments.
2880
2881 - `type: Optional[str]`
2882
2883 Type of turn detection, only `server_vad` is currently supported.
2884
2885### Client Secret Create Response
2886
2887- `class ClientSecretCreateResponse: …`
2888
2889 Response from creating a session and client secret for the Realtime API.
2890
2891 - `expires_at: int`
2892
2893 Expiration timestamp for the client secret, in seconds since epoch.
2894
2895 - `session: Session`
2896
2897 The session configuration for either a realtime or transcription session.
2898
2899 - `class RealtimeSessionCreateResponse: …`
2900
2901 A Realtime session configuration object.
2902
2903 - `id: str`
2904
2905 Unique identifier for the session that looks like `sess_1234567890abcdef`.
2906
2907 - `object: Literal["realtime.session"]`
2908
2909 The object type. Always `realtime.session`.
2910
2911 - `"realtime.session"`
2912
2913 - `type: Literal["realtime"]`
2914
2915 The type of session to create. Always `realtime` for the Realtime API.
2916
2917 - `"realtime"`
2918
2919 - `audio: Optional[Audio]`
2920
2921 Configuration for input and output audio.
2922
2923 - `input: Optional[AudioInput]`
2924
2925 - `format: Optional[RealtimeAudioFormats]`
2926
2927 The format of the input audio.
2928
2929 - `class AudioPCM: …`
2930
2931 The PCM audio format. Only a 24kHz sample rate is supported.
2932
2933 - `rate: Optional[Literal[24000]]`
2934
2935 The sample rate of the audio. Always `24000`.
2936
2937 - `24000`
2938
2939 - `type: Optional[Literal["audio/pcm"]]`
2940
2941 The audio format. Always `audio/pcm`.
2942
2943 - `"audio/pcm"`
2944
2945 - `class AudioPCMU: …`
2946
2947 The G.711 μ-law format.
2948
2949 - `type: Optional[Literal["audio/pcmu"]]`
2950
2951 The audio format. Always `audio/pcmu`.
2952
2953 - `"audio/pcmu"`
2954
2955 - `class AudioPCMA: …`
2956
2957 The G.711 A-law format.
2958
2959 - `type: Optional[Literal["audio/pcma"]]`
2960
2961 The audio format. Always `audio/pcma`.
2962
2963 - `"audio/pcma"`
2964
2965 - `noise_reduction: Optional[AudioInputNoiseReduction]`
2966
2967 Configuration for input audio noise reduction. This can be set to `null` to turn off.
2968 Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model.
2969 Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
2970
2971 - `type: Optional[NoiseReductionType]`
2972
2973 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.
2974
2975 - `"near_field"`
2976
2977 - `"far_field"`
2978
2979 - `transcription: Optional[AudioTranscription]`
2980
2981 - `delay: Optional[Literal["minimal", "low", "medium", 2 more]]`
2982
2983 Controls how long the model waits before emitting transcription text.
2984 Higher values can improve transcription accuracy at the cost of latency.
2985 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.
2986
2987 - `"minimal"`
2988
2989 - `"low"`
2990
2991 - `"medium"`
2992
2993 - `"high"`
2994
2995 - `"xhigh"`
2996
2997 - `language: Optional[str]`
2998
2999 The language of the input audio. Supplying the input language in
3000 [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format
3001 will improve accuracy and latency.
3002
3003 - `model: Optional[Union[str, Literal["whisper-1", "gpt-4o-mini-transcribe", "gpt-4o-mini-transcribe-2025-12-15", 3 more], null]]`
3004
3005 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.
3006
3007 - `str`
3008
3009 - `Literal["whisper-1", "gpt-4o-mini-transcribe", "gpt-4o-mini-transcribe-2025-12-15", 3 more]`
3010
3011 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.
3012
3013 - `"whisper-1"`
3014
3015 - `"gpt-4o-mini-transcribe"`
3016
3017 - `"gpt-4o-mini-transcribe-2025-12-15"`
3018
3019 - `"gpt-4o-transcribe"`
3020
3021 - `"gpt-4o-transcribe-diarize"`
3022
3023 - `"gpt-realtime-whisper"`
3024
3025 - `prompt: Optional[str]`
3026
3027 An optional text to guide the model's style or continue a previous audio
3028 segment.
3029 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).
3030 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".
3031 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.
3032
3033 - `turn_detection: Optional[AudioInputTurnDetection]`
3034
3035 Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to `null` to turn off, in which case the client must manually trigger model response.
3036
3037 Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
3038
3039 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
3040
3041 For `gpt-realtime-whisper` transcription sessions, turn detection must be
3042 set to `null`; VAD is not supported.
3043
3044 - `class AudioInputTurnDetectionServerVad: …`
3045
3046 Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.
3047
3048 - `type: Literal["server_vad"]`
3049
3050 Type of turn detection, `server_vad` to turn on simple Server VAD.
3051
3052 - `"server_vad"`
3053
3054 - `create_response: Optional[bool]`
3055
3056 Whether or not to automatically generate a response when a VAD stop event occurs. If `interrupt_response` is set to `false` this may fail to create a response if the model is already responding.
3057
3058 If both `create_response` and `interrupt_response` are set to `false`, the model will never respond automatically but VAD events will still be emitted.
3059
3060 - `idle_timeout_ms: Optional[int]`
3061
3062 Optional timeout after which a model response will be triggered automatically. This is
3063 useful for situations in which a long pause from the user is unexpected, such as a phone
3064 call. The model will effectively prompt the user to continue the conversation based
3065 on the current context.
3066
3067 The timeout value will be applied after the last model response's audio has finished playing,
3068 i.e. it's set to the `response.done` time plus audio playback duration.
3069
3070 An `input_audio_buffer.timeout_triggered` event (plus events
3071 associated with the Response) will be emitted when the timeout is reached.
3072 Idle timeout is currently only supported for `server_vad` mode.
3073
3074 - `interrupt_response: Optional[bool]`
3075
3076 Whether or not to automatically interrupt (cancel) any ongoing response with output to the default
3077 conversation (i.e. `conversation` of `auto`) when a VAD start event occurs. If `true` then the response will be cancelled, otherwise it will continue until complete.
3078
3079 If both `create_response` and `interrupt_response` are set to `false`, the model will never respond automatically but VAD events will still be emitted.
3080
3081 - `prefix_padding_ms: Optional[int]`
3082
3083 Used only for `server_vad` mode. Amount of audio to include before the VAD detected speech (in
3084 milliseconds). Defaults to 300ms.
3085
3086 - `silence_duration_ms: Optional[int]`
3087
3088 Used only for `server_vad` mode. Duration of silence to detect speech stop (in milliseconds). Defaults
3089 to 500ms. With shorter values the model will respond more quickly,
3090 but may jump in on short pauses from the user.
3091
3092 - `threshold: Optional[float]`
3093
3094 Used only for `server_vad` mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A
3095 higher threshold will require louder audio to activate the model, and
3096 thus might perform better in noisy environments.
3097
3098 - `class AudioInputTurnDetectionSemanticVad: …`
3099
3100 Server-side semantic turn detection which uses a model to determine when the user has finished speaking.
3101
3102 - `type: Literal["semantic_vad"]`
3103
3104 Type of turn detection, `semantic_vad` to turn on Semantic VAD.
3105
3106 - `"semantic_vad"`
3107
3108 - `create_response: Optional[bool]`
3109
3110 Whether or not to automatically generate a response when a VAD stop event occurs.
3111
3112 - `eagerness: Optional[Literal["low", "medium", "high", "auto"]]`
3113
3114 Used only for `semantic_vad` mode. The eagerness of the model to respond. `low` will wait longer for the user to continue speaking, `high` will respond more quickly. `auto` is the default and is equivalent to `medium`. `low`, `medium`, and `high` have max timeouts of 8s, 4s, and 2s respectively.
3115
3116 - `"low"`
3117
3118 - `"medium"`
3119
3120 - `"high"`
3121
3122 - `"auto"`
3123
3124 - `interrupt_response: Optional[bool]`
3125
3126 Whether or not to automatically interrupt any ongoing response with output to the default
3127 conversation (i.e. `conversation` of `auto`) when a VAD start event occurs.
3128
3129 - `output: Optional[AudioOutput]`
3130
3131 - `format: Optional[RealtimeAudioFormats]`
3132
3133 The format of the output audio.
3134
3135 - `speed: Optional[float]`
3136
3137 The speed of the model's spoken response as a multiple of the original speed.
3138 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
3139
3140 This parameter is a post-processing adjustment to the audio after it is generated, it's
3141 also possible to prompt the model to speak faster or slower.
3142
3143 - `voice: Optional[Union[str, Literal["alloy", "ash", "ballad", 7 more], null]]`
3144
3145 The voice the model uses to respond. Voice cannot be changed during the
3146 session once the model has responded with audio at least once. Current
3147 voice options are `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`,
3148 `shimmer`, `verse`, `marin`, and `cedar`. We recommend `marin` and `cedar` for
3149 best quality.
3150
3151 - `str`
3152
3153 - `Literal["alloy", "ash", "ballad", 7 more]`
3154
3155 The voice the model uses to respond. Voice cannot be changed during the
3156 session once the model has responded with audio at least once. Current
3157 voice options are `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`,
3158 `shimmer`, `verse`, `marin`, and `cedar`. We recommend `marin` and `cedar` for
3159 best quality.
3160
3161 - `"alloy"`
3162
3163 - `"ash"`
3164
3165 - `"ballad"`
3166
3167 - `"coral"`
3168
3169 - `"echo"`
3170
3171 - `"sage"`
3172
3173 - `"shimmer"`
3174
3175 - `"verse"`
3176
3177 - `"marin"`
3178
3179 - `"cedar"`
3180
3181 - `expires_at: Optional[int]`
3182
3183 Expiration timestamp for the session, in seconds since epoch.
3184
3185 - `include: Optional[List[Literal["item.input_audio_transcription.logprobs"]]]`
3186
3187 Additional fields to include in server outputs.
3188
3189 `item.input_audio_transcription.logprobs`: Include logprobs for input audio transcription.
3190
3191 - `"item.input_audio_transcription.logprobs"`
3192
3193 - `instructions: Optional[str]`
3194
3195 The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
3196
3197 Note that the server sets default instructions which will be used if this field is not set and are visible in the `session.created` event at the start of the session.
3198
3199 - `max_output_tokens: Optional[Union[int, Literal["inf"], null]]`
3200
3201 Maximum number of output tokens for a single assistant response,
3202 inclusive of tool calls. Provide an integer between 1 and 4096 to
3203 limit output tokens, or `inf` for the maximum available tokens for a
3204 given model. Defaults to `inf`.
3205
3206 - `int`
3207
3208 - `Literal["inf"]`
3209
3210 - `"inf"`
3211
3212 - `model: Optional[Union[str, Literal["gpt-realtime", "gpt-realtime-1.5", "gpt-realtime-2", 14 more], null]]`
3213
3214 The Realtime model used for this session.
3215
3216 - `str`
3217
3218 - `Literal["gpt-realtime", "gpt-realtime-1.5", "gpt-realtime-2", 14 more]`
3219
3220 The Realtime model used for this session.
3221
3222 - `"gpt-realtime"`
3223
3224 - `"gpt-realtime-1.5"`
3225
3226 - `"gpt-realtime-2"`
3227
3228 - `"gpt-realtime-2025-08-28"`
3229
3230 - `"gpt-4o-realtime-preview"`
3231
3232 - `"gpt-4o-realtime-preview-2024-10-01"`
3233
3234 - `"gpt-4o-realtime-preview-2024-12-17"`
3235
3236 - `"gpt-4o-realtime-preview-2025-06-03"`
3237
3238 - `"gpt-4o-mini-realtime-preview"`
3239
3240 - `"gpt-4o-mini-realtime-preview-2024-12-17"`
3241
3242 - `"gpt-realtime-mini"`
3243
3244 - `"gpt-realtime-mini-2025-10-06"`
3245
3246 - `"gpt-realtime-mini-2025-12-15"`
3247
3248 - `"gpt-audio-1.5"`
3249
3250 - `"gpt-audio-mini"`
3251
3252 - `"gpt-audio-mini-2025-10-06"`
3253
3254 - `"gpt-audio-mini-2025-12-15"`
3255
3256 - `output_modalities: Optional[List[Literal["text", "audio"]]]`
3257
3258 The set of modalities the model can respond with. It defaults to `["audio"]`, indicating
3259 that the model will respond with audio plus a transcript. `["text"]` can be used to make
3260 the model respond with text only. It is not possible to request both `text` and `audio` at the same time.
3261
3262 - `"text"`
3263
3264 - `"audio"`
3265
3266 - `prompt: Optional[ResponsePrompt]`
3267
3268 Reference to a prompt template and its variables.
3269 [Learn more](https://platform.openai.com/docs/guides/text?api-mode=responses#reusable-prompts).
3270
3271 - `id: str`
3272
3273 The unique identifier of the prompt template to use.
3274
3275 - `variables: Optional[Dict[str, Variables]]`
3276
3277 Optional map of values to substitute in for variables in your
3278 prompt. The substitution values can either be strings, or other
3279 Response input types like images or files.
3280
3281 - `str`
3282
3283 - `class ResponseInputText: …`
3284
3285 A text input to the model.
3286
3287 - `text: str`
3288
3289 The text input to the model.
3290
3291 - `type: Literal["input_text"]`
3292
3293 The type of the input item. Always `input_text`.
3294
3295 - `"input_text"`
3296
3297 - `class ResponseInputImage: …`
3298
3299 An image input to the model. Learn about [image inputs](https://platform.openai.com/docs/guides/vision).
3300
3301 - `detail: Literal["low", "high", "auto", "original"]`
3302
3303 The detail level of the image to be sent to the model. One of `high`, `low`, `auto`, or `original`. Defaults to `auto`.
3304
3305 - `"low"`
3306
3307 - `"high"`
3308
3309 - `"auto"`
3310
3311 - `"original"`
3312
3313 - `type: Literal["input_image"]`
3314
3315 The type of the input item. Always `input_image`.
3316
3317 - `"input_image"`
3318
3319 - `file_id: Optional[str]`
3320
3321 The ID of the file to be sent to the model.
3322
3323 - `image_url: Optional[str]`
3324
3325 The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
3326
3327 - `class ResponseInputFile: …`
3328
3329 A file input to the model.
3330
3331 - `type: Literal["input_file"]`
3332
3333 The type of the input item. Always `input_file`.
3334
3335 - `"input_file"`
3336
3337 - `detail: Optional[Literal["low", "high"]]`
3338
3339 The detail level of the file to be sent to the model. Use `low` for the default rendering behavior, or `high` to render the file at higher quality. Defaults to `low`.
3340
3341 - `"low"`
3342
3343 - `"high"`
3344
3345 - `file_data: Optional[str]`
3346
3347 The content of the file to be sent to the model.
3348
3349 - `file_id: Optional[str]`
3350
3351 The ID of the file to be sent to the model.
3352
3353 - `file_url: Optional[str]`
3354
3355 The URL of the file to be sent to the model.
3356
3357 - `filename: Optional[str]`
3358
3359 The name of the file to be sent to the model.
3360
3361 - `version: Optional[str]`
3362
3363 Optional version of the prompt template.
3364
3365 - `reasoning: Optional[RealtimeReasoning]`
3366
3367 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.
3368
3369 - `effort: Optional[RealtimeReasoningEffort]`
3370
3371 Constrains effort on reasoning for reasoning-capable Realtime models such as
3372 `gpt-realtime-2`.
3373
3374 - `"minimal"`
3375
3376 - `"low"`
3377
3378 - `"medium"`
3379
3380 - `"high"`
3381
3382 - `"xhigh"`
3383
3384 - `tool_choice: Optional[ToolChoice]`
3385
3386 How the model chooses tools. Provide one of the string modes or force a specific
3387 function/MCP tool.
3388
3389 - `Literal["none", "auto", "required"]`
3390
3391 - `"none"`
3392
3393 - `"auto"`
3394
3395 - `"required"`
3396
3397 - `class ToolChoiceFunction: …`
3398
3399 Use this option to force the model to call a specific function.
3400
3401 - `name: str`
3402
3403 The name of the function to call.
3404
3405 - `type: Literal["function"]`
3406
3407 For function calling, the type is always `function`.
3408
3409 - `"function"`
3410
3411 - `class ToolChoiceMcp: …`
3412
3413 Use this option to force the model to call a specific tool on a remote MCP server.
3414
3415 - `server_label: str`
3416
3417 The label of the MCP server to use.
3418
3419 - `type: Literal["mcp"]`
3420
3421 For MCP tools, the type is always `mcp`.
3422
3423 - `"mcp"`
3424
3425 - `name: Optional[str]`
3426
3427 The name of the tool to call on the server.
3428
3429 - `tools: Optional[List[Tool]]`
3430
3431 Tools available to the model.
3432
3433 - `class RealtimeFunctionTool: …`
3434
3435 - `description: Optional[str]`
3436
3437 The description of the function, including guidance on when and how
3438 to call it, and guidance about what to tell the user when calling
3439 (if anything).
3440
3441 - `name: Optional[str]`
3442
3443 The name of the function.
3444
3445 - `parameters: Optional[object]`
3446
3447 Parameters of the function in JSON Schema.
3448
3449 - `type: Optional[Literal["function"]]`
3450
3451 The type of the tool, i.e. `function`.
3452
3453 - `"function"`
3454
3455 - `class ToolMcpTool: …`
3456
3457 Give the model access to additional tools via remote Model Context Protocol
3458 (MCP) servers. [Learn more about MCP](https://platform.openai.com/docs/guides/tools-remote-mcp).
3459
3460 - `server_label: str`
3461
3462 A label for this MCP server, used to identify it in tool calls.
3463
3464 - `type: Literal["mcp"]`
3465
3466 The type of the MCP tool. Always `mcp`.
3467
3468 - `"mcp"`
3469
3470 - `allowed_tools: Optional[ToolMcpToolAllowedTools]`
3471
3472 List of allowed tool names or a filter object.
3473
3474 - `List[str]`
3475
3476 A string array of allowed tool names
3477
3478 - `class ToolMcpToolAllowedToolsMcpToolFilter: …`
3479
3480 A filter object to specify which tools are allowed.
3481
3482 - `read_only: Optional[bool]`
3483
3484 Indicates whether or not a tool modifies data or is read-only. If an
3485 MCP server is [annotated with `readOnlyHint`](https://modelcontextprotocol.io/specification/2025-06-18/schema#toolannotations-readonlyhint),
3486 it will match this filter.
3487
3488 - `tool_names: Optional[List[str]]`
3489
3490 List of allowed tool names.
3491
3492 - `authorization: Optional[str]`
3493
3494 An OAuth access token that can be used with a remote MCP server, either
3495 with a custom MCP server URL or a service connector. Your application
3496 must handle the OAuth authorization flow and provide the token here.
3497
3498 - `connector_id: Optional[Literal["connector_dropbox", "connector_gmail", "connector_googlecalendar", 5 more]]`
3499
3500 Identifier for service connectors, like those available in ChatGPT. One of
3501 `server_url` or `connector_id` must be provided. Learn more about service
3502 connectors [here](https://platform.openai.com/docs/guides/tools-remote-mcp#connectors).
3503
3504 Currently supported `connector_id` values are:
3505
3506 - Dropbox: `connector_dropbox`
3507 - Gmail: `connector_gmail`
3508 - Google Calendar: `connector_googlecalendar`
3509 - Google Drive: `connector_googledrive`
3510 - Microsoft Teams: `connector_microsoftteams`
3511 - Outlook Calendar: `connector_outlookcalendar`
3512 - Outlook Email: `connector_outlookemail`
3513 - SharePoint: `connector_sharepoint`
3514
3515 - `"connector_dropbox"`
3516
3517 - `"connector_gmail"`
3518
3519 - `"connector_googlecalendar"`
3520
3521 - `"connector_googledrive"`
3522
3523 - `"connector_microsoftteams"`
3524
3525 - `"connector_outlookcalendar"`
3526
3527 - `"connector_outlookemail"`
3528
3529 - `"connector_sharepoint"`
3530
3531 - `defer_loading: Optional[bool]`
3532
3533 Whether this MCP tool is deferred and discovered via tool search.
3534
3535 - `headers: Optional[Dict[str, str]]`
3536
3537 Optional HTTP headers to send to the MCP server. Use for authentication
3538 or other purposes.
3539
3540 - `require_approval: Optional[ToolMcpToolRequireApproval]`
3541
3542 Specify which of the MCP server's tools require approval.
3543
3544 - `class ToolMcpToolRequireApprovalMcpToolApprovalFilter: …`
3545
3546 Specify which of the MCP server's tools require approval. Can be
3547 `always`, `never`, or a filter object associated with tools
3548 that require approval.
3549
3550 - `always: Optional[ToolMcpToolRequireApprovalMcpToolApprovalFilterAlways]`
3551
3552 A filter object to specify which tools are allowed.
3553
3554 - `read_only: Optional[bool]`
3555
3556 Indicates whether or not a tool modifies data or is read-only. If an
3557 MCP server is [annotated with `readOnlyHint`](https://modelcontextprotocol.io/specification/2025-06-18/schema#toolannotations-readonlyhint),
3558 it will match this filter.
3559
3560 - `tool_names: Optional[List[str]]`
3561
3562 List of allowed tool names.
3563
3564 - `never: Optional[ToolMcpToolRequireApprovalMcpToolApprovalFilterNever]`
3565
3566 A filter object to specify which tools are allowed.
3567
3568 - `read_only: Optional[bool]`
3569
3570 Indicates whether or not a tool modifies data or is read-only. If an
3571 MCP server is [annotated with `readOnlyHint`](https://modelcontextprotocol.io/specification/2025-06-18/schema#toolannotations-readonlyhint),
3572 it will match this filter.
3573
3574 - `tool_names: Optional[List[str]]`
3575
3576 List of allowed tool names.
3577
3578 - `Literal["always", "never"]`
3579
3580 Specify a single approval policy for all tools. One of `always` or
3581 `never`. When set to `always`, all tools will require approval. When
3582 set to `never`, all tools will not require approval.
3583
3584 - `"always"`
3585
3586 - `"never"`
3587
3588 - `server_description: Optional[str]`
3589
3590 Optional description of the MCP server, used to provide more context.
3591
3592 - `server_url: Optional[str]`
3593
3594 The URL for the MCP server. One of `server_url` or `connector_id` must be
3595 provided.
3596
3597 - `tracing: Optional[Tracing]`
3598
3599 Realtime API can write session traces to the [Traces Dashboard](https://platform.openai.com/logs?api=traces). Set to null to disable tracing. Once
3600 tracing is enabled for a session, the configuration cannot be modified.
3601
3602 `auto` will create a trace for the session with default values for the
3603 workflow name, group id, and metadata.
3604
3605 - `Literal["auto"]`
3606
3607 Enables tracing and sets default values for tracing configuration options. Always `auto`.
3608
3609 - `"auto"`
3610
3611 - `class TracingTracingConfiguration: …`
3612
3613 Granular configuration for tracing.
3614
3615 - `group_id: Optional[str]`
3616
3617 The group id to attach to this trace to enable filtering and
3618 grouping in the Traces Dashboard.
3619
3620 - `metadata: Optional[object]`
3621
3622 The arbitrary metadata to attach to this trace to enable
3623 filtering in the Traces Dashboard.
3624
3625 - `workflow_name: Optional[str]`
3626
3627 The name of the workflow to attach to this trace. This is used to
3628 name the trace in the Traces Dashboard.
3629
3630 - `truncation: Optional[RealtimeTruncation]`
3631
3632 When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
3633
3634 Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
3635
3636 Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
3637
3638 Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
3639
3640 - `Literal["auto", "disabled"]`
3641
3642 The truncation strategy to use for the session. `auto` is the default truncation strategy. `disabled` will disable truncation and emit errors when the conversation exceeds the input token limit.
3643
3644 - `"auto"`
3645
3646 - `"disabled"`
3647
3648 - `class RealtimeTruncationRetentionRatio: …`
3649
3650 Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
3651
3652 - `retention_ratio: float`
3653
3654 Fraction of post-instruction conversation tokens to retain (`0.0` - `1.0`) when the conversation exceeds the input token limit. Setting this to `0.8` means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.
3655
3656 - `type: Literal["retention_ratio"]`
3657
3658 Use retention ratio truncation.
3659
3660 - `"retention_ratio"`
3661
3662 - `token_limits: Optional[TokenLimits]`
3663
3664 Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
3665
3666 - `post_instructions: Optional[int]`
3667
3668 Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
3669
3670 - `class RealtimeTranscriptionSessionCreateResponse: …`
3671
3672 A Realtime transcription session configuration object.
3673
3674 - `id: str`
3675
3676 Unique identifier for the session that looks like `sess_1234567890abcdef`.
3677
3678 - `object: str`
3679
3680 The object type. Always `realtime.transcription_session`.
3681
3682 - `type: Literal["transcription"]`
3683
3684 The type of session. Always `transcription` for transcription sessions.
3685
3686 - `"transcription"`
3687
3688 - `audio: Optional[Audio]`
3689
3690 Configuration for input audio for the session.
3691
3692 - `input: Optional[AudioInput]`
3693
3694 - `format: Optional[RealtimeAudioFormats]`
3695
3696 The PCM audio format. Only a 24kHz sample rate is supported.
3697
3698 - `noise_reduction: Optional[AudioInputNoiseReduction]`
3699
3700 Configuration for input audio noise reduction.
3701
3702 - `type: Optional[NoiseReductionType]`
3703
3704 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.
3705
3706 - `transcription: Optional[AudioTranscription]`
3707
3708 - `turn_detection: Optional[RealtimeTranscriptionSessionTurnDetection]`
3709
3710 Configuration for turn detection. Can be set to `null` to turn off. Server
3711 VAD means that the model will detect the start and end of speech based on
3712 audio volume and respond at the end of user speech. For `gpt-realtime-whisper`, this must be `null`; VAD is not supported.
3713
3714 - `prefix_padding_ms: Optional[int]`
3715
3716 Amount of audio to include before the VAD detected speech (in
3717 milliseconds). Defaults to 300ms.
3718
3719 - `silence_duration_ms: Optional[int]`
3720
3721 Duration of silence to detect speech stop (in milliseconds). Defaults
3722 to 500ms. With shorter values the model will respond more quickly,
3723 but may jump in on short pauses from the user.
3724
3725 - `threshold: Optional[float]`
3726
3727 Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A
3728 higher threshold will require louder audio to activate the model, and
3729 thus might perform better in noisy environments.
3730
3731 - `type: Optional[str]`
3732
3733 Type of turn detection, only `server_vad` is currently supported.
3734
3735 - `expires_at: Optional[int]`
3736
3737 Expiration timestamp for the session, in seconds since epoch.
3738
3739 - `include: Optional[List[Literal["item.input_audio_transcription.logprobs"]]]`
3740
3741 Additional fields to include in server outputs.
3742
3743 - `item.input_audio_transcription.logprobs`: Include logprobs for input audio transcription.
3744
3745 - `"item.input_audio_transcription.logprobs"`
3746
3747 - `value: str`
3748
3749 The generated client secret value.