diff --git a/en/resources/realtime/index.md b/en/resources/realtime/index.md index 1c1ff6c..d811e41 100644 --- a/en/resources/realtime/index.md +++ b/en/resources/realtime/index.md @@ -4,7 +4,23 @@ ### Audio Transcription -- `AudioTranscription object { language, model, prompt }` +- `AudioTranscription object { delay, language, model, prompt }` + + - `delay: optional "minimal" or "low" or "medium" or 2 more` + + Controls how long the model waits before emitting transcription text. + Higher values can improve transcription accuracy at the cost of latency. + Only supported with `gpt-realtime-whisper` in GA Realtime sessions. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` - `language: optional string` @@ -12,15 +28,15 @@ [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `"whisper-1"` @@ -32,12 +48,15 @@ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` An optional text to guide the model's style or continue a previous audio segment. For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions. ### Conversation Created Event @@ -3262,21 +3281,37 @@ Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. + - `delay: optional "minimal" or "low" or "medium" or 2 more` + + Controls how long the model waits before emitting transcription text. + Higher values can improve transcription accuracy at the cost of latency. + Only supported with `gpt-realtime-whisper` in GA Realtime sessions. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `language: optional string` The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `"whisper-1"` @@ -3288,12 +3323,15 @@ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` An optional text to guide the model's style or continue a previous audio segment. For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions. - `turn_detection: optional RealtimeAudioInputTurnDetection` @@ -3303,6 +3341,9 @@ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -3505,21 +3546,37 @@ Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. + - `delay: optional "minimal" or "low" or "medium" or 2 more` + + Controls how long the model waits before emitting transcription text. + Higher values can improve transcription accuracy at the cost of latency. + Only supported with `gpt-realtime-whisper` in GA Realtime sessions. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `language: optional string` The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `"whisper-1"` @@ -3531,12 +3588,15 @@ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` An optional text to guide the model's style or continue a previous audio segment. For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions. - `turn_detection: optional RealtimeAudioInputTurnDetection` @@ -3546,6 +3606,9 @@ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -3776,6 +3839,9 @@ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -4737,6 +4803,11 @@ - `"audio"` + - `parallel_tool_calls: optional boolean` + + Whether the model may call multiple tools in parallel. Only supported by + reasoning Realtime models such as `gpt-realtime-2`. + - `prompt: optional ResponsePrompt` Reference to a prompt template and its variables. @@ -4836,6 +4907,25 @@ Optional version of the prompt template. + - `reasoning: optional RealtimeReasoning` + + Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`. + + - `effort: optional RealtimeReasoningEffort` + + Constrains effort on reasoning for reasoning-capable Realtime models such as + `gpt-realtime-2`. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `tool_choice: optional ToolChoiceOptions or ToolChoiceFunction or ToolChoiceMcp` How the model chooses tools. Provide one of the string modes or force a specific @@ -5075,7 +5165,7 @@ Update the Realtime session. Choose either a realtime session or a transcription session. - - `RealtimeSessionCreateRequest object { type, audio, include, 9 more }` + - `RealtimeSessionCreateRequest object { type, audio, include, 11 more }` Realtime session object configuration. @@ -5113,21 +5203,37 @@ Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. + - `delay: optional "minimal" or "low" or "medium" or 2 more` + + Controls how long the model waits before emitting transcription text. + Higher values can improve transcription accuracy at the cost of latency. + Only supported with `gpt-realtime-whisper` in GA Realtime sessions. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `language: optional string` The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `"whisper-1"` @@ -5139,12 +5245,15 @@ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` An optional text to guide the model's style or continue a previous audio segment. For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions. - `turn_detection: optional RealtimeAudioInputTurnDetection` @@ -5154,6 +5263,9 @@ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -5321,13 +5433,13 @@ - `"inf"` - - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. - `string` - - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. @@ -5335,6 +5447,8 @@ - `"gpt-realtime-1.5"` + - `"gpt-realtime-2"` + - `"gpt-realtime-2025-08-28"` - `"gpt-4o-realtime-preview"` @@ -5373,11 +5487,20 @@ - `"audio"` + - `parallel_tool_calls: optional boolean` + + Whether the model may call multiple tools in parallel. Only supported by + reasoning Realtime models such as `gpt-realtime-2`. + - `prompt: optional ResponsePrompt` Reference to a prompt template and its variables. [Learn more](/docs/guides/text?api-mode=responses#reusable-prompts). + - `reasoning: optional RealtimeReasoning` + + Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`. + - `tool_choice: optional RealtimeToolChoiceConfig` How the model chooses tools. Provide one of the string modes or force a specific @@ -5665,6 +5788,9 @@ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -6334,6 +6460,44 @@ - `"http_error"` +### Realtime Reasoning + +- `RealtimeReasoning object { effort }` + + Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`. + + - `effort: optional RealtimeReasoningEffort` + + Constrains effort on reasoning for reasoning-capable Realtime models such as + `gpt-realtime-2`. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + +### Realtime Reasoning Effort + +- `RealtimeReasoningEffort = "minimal" or "low" or "medium" or 2 more` + + Constrains effort on reasoning for reasoning-capable Realtime models such as + `gpt-realtime-2`. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + ### Realtime Response - `RealtimeResponse object { id, audio, conversation_id, 8 more }` @@ -7118,7 +7282,7 @@ ### Realtime Response Create Params -- `RealtimeResponseCreateParams object { audio, conversation, input, 7 more }` +- `RealtimeResponseCreateParams object { audio, conversation, input, 9 more }` Create a new Realtime response with these parameters @@ -7698,6 +7862,11 @@ - `"audio"` + - `parallel_tool_calls: optional boolean` + + Whether the model may call multiple tools in parallel. Only supported by + reasoning Realtime models such as `gpt-realtime-2`. + - `prompt: optional ResponsePrompt` Reference to a prompt template and its variables. @@ -7797,6 +7966,25 @@ Optional version of the prompt template. + - `reasoning: optional RealtimeReasoning` + + Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`. + + - `effort: optional RealtimeReasoningEffort` + + Constrains effort on reasoning for reasoning-capable Realtime models such as + `gpt-realtime-2`. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `tool_choice: optional ToolChoiceOptions or ToolChoiceFunction or ToolChoiceMcp` How the model chooses tools. Provide one of the string modes or force a specific @@ -9934,13 +10122,23 @@ The unique ID of the server event. - - `session: RealtimeSessionCreateRequest or RealtimeTranscriptionSessionCreateRequest` + - `session: RealtimeSessionCreateResponse or RealtimeTranscriptionSessionCreateResponse` The session configuration. - - `RealtimeSessionCreateRequest object { type, audio, include, 9 more }` + - `RealtimeSessionCreateResponse object { id, object, type, 13 more }` - Realtime session object configuration. + A Realtime session configuration object. + + - `id: string` + + Unique identifier for the session that looks like `sess_1234567890abcdef`. + + - `object: "realtime.session"` + + The object type. Always `realtime.session`. + + - `"realtime.session"` - `type: "realtime"` @@ -9948,11 +10146,11 @@ - `"realtime"` - - `audio: optional RealtimeAudioConfig` + - `audio: optional object { input, output }` Configuration for input and output audio. - - `input: optional RealtimeAudioConfigInput` + - `input: optional object { format, noise_reduction, transcription, turn_detection }` - `format: optional RealtimeAudioFormats` @@ -9972,25 +10170,23 @@ - `"far_field"` - - `transcription: optional AudioTranscription` + - `transcription: optional object { language, model, prompt }` Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. - `language: optional string` - The language of the input audio. Supplying the input language in - [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format - will improve accuracy and latency. + The language of the input audio. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `"whisper-1"` @@ -10002,14 +10198,13 @@ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` - An optional text to guide the model's style or continue a previous audio - segment. - For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). - For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + The prompt configured for input audio transcription, when present. - - `turn_detection: optional RealtimeAudioInputTurnDetection` + - `turn_detection: optional object { type, create_response, idle_timeout_ms, 4 more } or object { type, create_response, eagerness, interrupt_response }` Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to `null` to turn off, in which case the client must manually trigger model response. @@ -10017,6 +10212,9 @@ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -10102,7 +10300,7 @@ Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. `conversation` of `auto`) when a VAD start event occurs. - - `output: optional RealtimeAudioConfigOutput` + - `output: optional object { format, speed, voice }` - `format: optional RealtimeAudioFormats` @@ -10116,19 +10314,24 @@ This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower. - - `voice: optional string or "alloy" or "ash" or "ballad" or 7 more or object { id }` + - `voice: optional string or "alloy" or "ash" or "ballad" or 7 more` - The voice the model uses to respond. Supported built-in voices are - `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, `shimmer`, `verse`, - `marin`, and `cedar`. You may also provide a custom voice object with - an `id`, for example `{ "id": "voice_1234" }`. Voice cannot be changed - during the session once the model has responded with audio at least once. - We recommend `marin` and `cedar` for best quality. + The voice the model uses to respond. Voice cannot be changed during the + session once the model has responded with audio at least once. Current + voice options are `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, + `shimmer`, `verse`, `marin`, and `cedar`. We recommend `marin` and `cedar` for + best quality. - `string` - `"alloy" or "ash" or "ballad" or 7 more` + The voice the model uses to respond. Voice cannot be changed during the + session once the model has responded with audio at least once. Current + voice options are `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, + `shimmer`, `verse`, `marin`, and `cedar`. We recommend `marin` and `cedar` for + best quality. + - `"alloy"` - `"ash"` @@ -10149,13 +10352,9 @@ - `"cedar"` - - `ID object { id }` - - Custom voice reference. - - - `id: string` + - `expires_at: optional number` - The custom voice ID, e.g. `voice_1234`. + Expiration timestamp for the session, in seconds since epoch. - `include: optional array of "item.input_audio_transcription.logprobs"` @@ -10184,13 +10383,13 @@ - `"inf"` - - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. - `string` - - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. @@ -10198,6 +10397,8 @@ - `"gpt-realtime-1.5"` + - `"gpt-realtime-2"` + - `"gpt-realtime-2025-08-28"` - `"gpt-4o-realtime-preview"` @@ -10335,7 +10536,26 @@ Optional version of the prompt template. - - `tool_choice: optional RealtimeToolChoiceConfig` + - `reasoning: optional RealtimeReasoning` + + Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`. + + - `effort: optional RealtimeReasoningEffort` + + Constrains effort on reasoning for reasoning-capable Realtime models such as + `gpt-realtime-2`. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + + - `tool_choice: optional ToolChoiceOptions or ToolChoiceFunction or ToolChoiceMcp` How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool. @@ -10389,7 +10609,7 @@ The name of the tool to call on the server. - - `tools: optional RealtimeToolsConfig` + - `tools: optional array of RealtimeFunctionTool or object { server_label, type, allowed_tools, 7 more }` Tools available to the model. @@ -10557,7 +10777,7 @@ The URL for the MCP server. One of `server_url` or `connector_id` must be provided. - - `tracing: optional RealtimeTracingConfig` + - `tracing: optional "auto" or object { group_id, metadata, workflow_name }` Realtime API can write session traces to the [Traces Dashboard](https://platform.openai.com/logs?api=traces). Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified. @@ -10630,21 +10850,29 @@ Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens. - - `RealtimeTranscriptionSessionCreateRequest object { type, audio, include }` + - `RealtimeTranscriptionSessionCreateResponse object { id, object, type, 3 more }` - Realtime transcription session object configuration. + A Realtime transcription session configuration object. + + - `id: string` + + Unique identifier for the session that looks like `sess_1234567890abcdef`. + + - `object: string` + + The object type. Always `realtime.transcription_session`. - `type: "transcription"` - The type of session to create. Always `transcription` for transcription sessions. + The type of session. Always `transcription` for transcription sessions. - `"transcription"` - - `audio: optional RealtimeTranscriptionSessionAudio` + - `audio: optional object { input }` - Configuration for input and output audio. + Configuration for input audio for the session. - - `input: optional RealtimeTranscriptionSessionAudioInput` + - `input: optional object { format, noise_reduction, transcription, turn_detection }` - `format: optional RealtimeAudioFormats` @@ -10652,116 +10880,82 @@ - `noise_reduction: optional object { type }` - Configuration for input audio noise reduction. This can be set to `null` to turn off. - Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. - Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio. + Configuration for input audio noise reduction. - `type: optional NoiseReductionType` Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. - - `transcription: optional AudioTranscription` - - Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. - - - `turn_detection: optional RealtimeTranscriptionSessionAudioInputTurnDetection` + - `transcription: optional object { language, model, prompt }` - Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to `null` to turn off, in which case the client must manually trigger model response. + Configuration of the transcription model. - Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech. + - `language: optional string` - Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + The language of the input audio. - - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - - `type: "server_vad"` + - `string` - Type of turn detection, `server_vad` to turn on simple Server VAD. + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - - `"server_vad"` + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - - `create_response: optional boolean` + - `"whisper-1"` - Whether or not to automatically generate a response when a VAD stop event occurs. If `interrupt_response` is set to `false` this may fail to create a response if the model is already responding. + - `"gpt-4o-mini-transcribe"` - If both `create_response` and `interrupt_response` are set to `false`, the model will never respond automatically but VAD events will still be emitted. + - `"gpt-4o-mini-transcribe-2025-12-15"` - - `idle_timeout_ms: optional number` + - `"gpt-4o-transcribe"` - Optional timeout after which a model response will be triggered automatically. This is - useful for situations in which a long pause from the user is unexpected, such as a phone - call. The model will effectively prompt the user to continue the conversation based - on the current context. + - `"gpt-4o-transcribe-diarize"` - The timeout value will be applied after the last model response's audio has finished playing, - i.e. it's set to the `response.done` time plus audio playback duration. + - `"gpt-realtime-whisper"` - An `input_audio_buffer.timeout_triggered` event (plus events - associated with the Response) will be emitted when the timeout is reached. - Idle timeout is currently only supported for `server_vad` mode. + - `prompt: optional string` - - `interrupt_response: optional boolean` + The prompt configured for input audio transcription, when present. - Whether or not to automatically interrupt (cancel) any ongoing response with output to the default - conversation (i.e. `conversation` of `auto`) when a VAD start event occurs. If `true` then the response will be cancelled, otherwise it will continue until complete. + - `turn_detection: optional RealtimeTranscriptionSessionTurnDetection` - If both `create_response` and `interrupt_response` are set to `false`, the model will never respond automatically but VAD events will still be emitted. + Configuration for turn detection. Can be set to `null` to turn off. Server + VAD means that the model will detect the start and end of speech based on + audio volume and respond at the end of user speech. For `gpt-realtime-whisper`, this must be `null`; VAD is not supported. - `prefix_padding_ms: optional number` - Used only for `server_vad` mode. Amount of audio to include before the VAD detected speech (in + Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. - `silence_duration_ms: optional number` - Used only for `server_vad` mode. Duration of silence to detect speech stop (in milliseconds). Defaults + Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. - `threshold: optional number` - Used only for `server_vad` mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A + Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments. - - `SemanticVad object { type, create_response, eagerness, interrupt_response }` - - Server-side semantic turn detection which uses a model to determine when the user has finished speaking. - - - `type: "semantic_vad"` - - Type of turn detection, `semantic_vad` to turn on Semantic VAD. - - - `"semantic_vad"` - - - `create_response: optional boolean` - - Whether or not to automatically generate a response when a VAD stop event occurs. - - - `eagerness: optional "low" or "medium" or "high" or "auto"` - - Used only for `semantic_vad` mode. The eagerness of the model to respond. `low` will wait longer for the user to continue speaking, `high` will respond more quickly. `auto` is the default and is equivalent to `medium`. `low`, `medium`, and `high` have max timeouts of 8s, 4s, and 2s respectively. - - - `"low"` - - - `"medium"` - - - `"high"` + - `type: optional string` - - `"auto"` + Type of turn detection, only `server_vad` is currently supported. - - `interrupt_response: optional boolean` + - `expires_at: optional number` - Whether or not to automatically interrupt any ongoing response with output to the default - conversation (i.e. `conversation` of `auto`) when a VAD start event occurs. + Expiration timestamp for the session, in seconds since epoch. - `include: optional array of "item.input_audio_transcription.logprobs"` Additional fields to include in server outputs. - `item.input_audio_transcription.logprobs`: Include logprobs for input audio transcription. + - `item.input_audio_transcription.logprobs`: Include logprobs for input audio transcription. - `"item.input_audio_transcription.logprobs"` @@ -10780,17 +10974,17 @@ The unique ID of the server event. - - `session: RealtimeSessionCreateRequest or RealtimeTranscriptionSessionCreateRequest` + - `session: RealtimeSessionCreateResponse or RealtimeTranscriptionSessionCreateResponse` The session configuration. - - `RealtimeSessionCreateRequest object { type, audio, include, 9 more }` + - `RealtimeSessionCreateResponse object { id, object, type, 13 more }` - Realtime session object configuration. + A Realtime session configuration object. - - `RealtimeTranscriptionSessionCreateRequest object { type, audio, include }` + - `RealtimeTranscriptionSessionCreateResponse object { id, object, type, 3 more }` - Realtime transcription session object configuration. + A Realtime transcription session configuration object. - `type: "session.updated"` @@ -11228,25 +11422,23 @@ - `"far_field"` - - `input_audio_transcription: optional AudioTranscription` + - `input_audio_transcription: optional object { language, model, prompt }` Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. - `language: optional string` - The language of the input audio. Supplying the input language in - [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format - will improve accuracy and latency. + The language of the input audio. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `"whisper-1"` @@ -11258,12 +11450,11 @@ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` - An optional text to guide the model's style or continue a previous audio - segment. - For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). - For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + The prompt configured for input audio transcription, when present. - `instructions: optional string` @@ -11540,6 +11731,9 @@ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -11663,7 +11857,7 @@ ### Realtime Session Create Request -- `RealtimeSessionCreateRequest object { type, audio, include, 9 more }` +- `RealtimeSessionCreateRequest object { type, audio, include, 11 more }` Realtime session object configuration. @@ -11737,21 +11931,37 @@ Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. + - `delay: optional "minimal" or "low" or "medium" or 2 more` + + Controls how long the model waits before emitting transcription text. + Higher values can improve transcription accuracy at the cost of latency. + Only supported with `gpt-realtime-whisper` in GA Realtime sessions. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `language: optional string` The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `"whisper-1"` @@ -11763,12 +11973,15 @@ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` An optional text to guide the model's style or continue a previous audio segment. For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions. - `turn_detection: optional RealtimeAudioInputTurnDetection` @@ -11778,6 +11991,9 @@ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -11945,13 +12161,13 @@ - `"inf"` - - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. - `string` - - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. @@ -11959,6 +12175,8 @@ - `"gpt-realtime-1.5"` + - `"gpt-realtime-2"` + - `"gpt-realtime-2025-08-28"` - `"gpt-4o-realtime-preview"` @@ -11997,6 +12215,11 @@ - `"audio"` + - `parallel_tool_calls: optional boolean` + + Whether the model may call multiple tools in parallel. Only supported by + reasoning Realtime models such as `gpt-realtime-2`. + - `prompt: optional ResponsePrompt` Reference to a prompt template and its variables. @@ -12096,6 +12319,25 @@ Optional version of the prompt template. + - `reasoning: optional RealtimeReasoning` + + Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`. + + - `effort: optional RealtimeReasoningEffort` + + Constrains effort on reasoning for reasoning-capable Realtime models such as + `gpt-realtime-2`. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `tool_choice: optional RealtimeToolChoiceConfig` How the model chooses tools. Provide one of the string modes or force a specific @@ -12889,21 +13131,37 @@ Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. + - `delay: optional "minimal" or "low" or "medium" or 2 more` + + Controls how long the model waits before emitting transcription text. + Higher values can improve transcription accuracy at the cost of latency. + Only supported with `gpt-realtime-whisper` in GA Realtime sessions. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `language: optional string` The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `"whisper-1"` @@ -12915,12 +13173,15 @@ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` An optional text to guide the model's style or continue a previous audio segment. For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions. - `turn_detection: optional RealtimeTranscriptionSessionAudioInputTurnDetection` @@ -12930,6 +13191,9 @@ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -13077,21 +13341,37 @@ Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. + - `delay: optional "minimal" or "low" or "medium" or 2 more` + + Controls how long the model waits before emitting transcription text. + Higher values can improve transcription accuracy at the cost of latency. + Only supported with `gpt-realtime-whisper` in GA Realtime sessions. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `language: optional string` The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `"whisper-1"` @@ -13103,12 +13383,15 @@ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` An optional text to guide the model's style or continue a previous audio segment. For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions. - `turn_detection: optional RealtimeTranscriptionSessionAudioInputTurnDetection` @@ -13118,6 +13401,9 @@ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -13213,6 +13499,9 @@ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -13374,21 +13663,37 @@ Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. + - `delay: optional "minimal" or "low" or "medium" or 2 more` + + Controls how long the model waits before emitting transcription text. + Higher values can improve transcription accuracy at the cost of latency. + Only supported with `gpt-realtime-whisper` in GA Realtime sessions. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `language: optional string` The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `"whisper-1"` @@ -13400,12 +13705,15 @@ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` An optional text to guide the model's style or continue a previous audio segment. For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions. - `turn_detection: optional RealtimeTranscriptionSessionAudioInputTurnDetection` @@ -13415,6 +13723,9 @@ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -13508,198 +13819,1214 @@ - `"item.input_audio_transcription.logprobs"` -### Realtime Truncation +### Realtime Translation Client Event -- `RealtimeTruncation = "auto" or "disabled" or object { retention_ratio, type, token_limits }` +- `RealtimeTranslationClientEvent = RealtimeTranslationSessionUpdateEvent or RealtimeTranslationInputAudioBufferAppendEvent or RealtimeTranslationSessionCloseEvent` - When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs. + A Realtime translation client event. - Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost. + - `RealtimeTranslationSessionUpdateEvent object { session, type, event_id }` - Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate. + Send this event to update the translation session configuration. Translation + sessions support updates to `audio.output.language`, `audio.input.transcription`, + and `audio.input.noise_reduction`. - Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit. + - `session: RealtimeTranslationSessionUpdateRequest` - - `"auto" or "disabled"` + Translation session fields to update. The session `type` and `model` are set + at creation and cannot be changed with `session.update`. - The truncation strategy to use for the session. `auto` is the default truncation strategy. `disabled` will disable truncation and emit errors when the conversation exceeds the input token limit. + - `audio: optional object { input, output }` - - `"auto"` + Configuration for translation input and output audio. - - `"disabled"` + - `input: optional object { noise_reduction, transcription }` - - `RetentionRatioTruncation object { retention_ratio, type, token_limits }` + - `noise_reduction: optional object { type }` - Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage. + Optional input noise reduction. Set to `null` to disable it. - - `retention_ratio: number` + - `type: NoiseReductionType` - Fraction of post-instruction conversation tokens to retain (`0.0` - `1.0`) when the conversation exceeds the input token limit. Setting this to `0.8` means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates. + Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. - - `type: "retention_ratio"` + - `"near_field"` - Use retention ratio truncation. + - `"far_field"` - - `"retention_ratio"` + - `transcription: optional object { model }` - - `token_limits: optional object { post_instructions }` + Optional source-language transcription. When configured, the server emits + `session.input_transcript.delta` events. Translation itself still runs from + the input audio stream. - Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used. + - `model: string` - - `post_instructions: optional number` + The transcription model to use for source transcript deltas. - Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens. + - `output: optional object { language }` -### Response Audio Delta Event + - `language: optional string` -- `ResponseAudioDeltaEvent object { content_index, delta, event_id, 4 more }` + Target language for translated output audio and transcript deltas. - Returned when the model-generated audio is updated. + - `type: "session.update"` - - `content_index: number` + The event type, must be `session.update`. - The index of the content part in the item's content array. + - `"session.update"` - - `delta: string` + - `event_id: optional string` - Base64-encoded audio data delta. + Optional client-generated ID used to identify this event. - - `event_id: string` + - `RealtimeTranslationInputAudioBufferAppendEvent object { audio, type, event_id }` - The unique ID of the server event. + Send this event to append audio bytes to the translation session input audio buffer. - - `item_id: string` + WebSocket translation sessions accept base64-encoded 24 kHz PCM16 mono + little-endian raw audio bytes. Unsupported websocket audio formats return a + validation error because lower-quality audio materially degrades translation + quality. - The ID of the item. + Translation consumes 200 ms engine frames. For best realtime behavior, append + audio in 200 ms chunks. If a chunk is shorter, the server buffers it until it + has enough audio for one frame. If a chunk is longer, the server splits it into + 200 ms frames and enqueues them back-to-back. - - `output_index: number` + Keep appending silence while the session is active. If a client stops sending + audio and later resumes, model time treats the resumed audio as contiguous with + the previous audio rather than as a real-world pause. - The index of the output item in the response. + - `audio: string` - - `response_id: string` + Base64-encoded 24 kHz PCM16 mono audio bytes. - The ID of the response. + - `type: "session.input_audio_buffer.append"` - - `type: "response.output_audio.delta"` + The event type, must be `session.input_audio_buffer.append`. - The event type, must be `response.output_audio.delta`. + - `"session.input_audio_buffer.append"` - - `"response.output_audio.delta"` + - `event_id: optional string` -### Response Audio Done Event + Optional client-generated ID used to identify this event. -- `ResponseAudioDoneEvent object { content_index, event_id, item_id, 3 more }` + - `RealtimeTranslationSessionCloseEvent object { type, event_id }` - Returned when the model-generated audio is done. Also emitted when a Response - is interrupted, incomplete, or cancelled. + Gracefully close the realtime translation session. The server flushes pending + input audio and emits any remaining translated output before closing the + session. - - `content_index: number` + - `type: "session.close"` - The index of the content part in the item's content array. + The event type, must be `session.close`. - - `event_id: string` + - `"session.close"` - The unique ID of the server event. + - `event_id: optional string` - - `item_id: string` + Optional client-generated ID used to identify this event. - The ID of the item. +### Realtime Translation Client Secret Create Request - - `output_index: number` +- `RealtimeTranslationClientSecretCreateRequest object { session, expires_after }` - The index of the output item in the response. + Create a translation session and client secret for the Realtime API. - - `response_id: string` + - `session: RealtimeTranslationSessionCreateRequest` - The ID of the response. + Realtime translation session configuration. Translation sessions stream source + audio in and translated audio plus transcript deltas out continuously. - - `type: "response.output_audio.done"` + - `model: string` - The event type, must be `response.output_audio.done`. + The Realtime translation model used for this session. - - `"response.output_audio.done"` + - `audio: optional object { input, output }` -### Response Audio Transcript Delta Event + Configuration for translation input and output audio. -- `ResponseAudioTranscriptDeltaEvent object { content_index, delta, event_id, 4 more }` + - `input: optional object { noise_reduction, transcription }` - Returned when the model-generated transcription of audio output is updated. + - `noise_reduction: optional object { type }` - - `content_index: number` + Optional input noise reduction. Set to `null` to disable it. - The index of the content part in the item's content array. + - `type: NoiseReductionType` - - `delta: string` + Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. - The transcript delta. + - `"near_field"` - - `event_id: string` + - `"far_field"` - The unique ID of the server event. + - `transcription: optional object { model }` - - `item_id: string` + Optional source-language transcription. When configured, the server emits + `session.input_transcript.delta` events. Translation itself still runs from + the input audio stream. - The ID of the item. + - `model: string` - - `output_index: number` + The transcription model to use for source transcript deltas. - The index of the output item in the response. + - `output: optional object { language }` - - `response_id: string` + - `language: optional string` - The ID of the response. + Target language for translated output audio and transcript deltas. - - `type: "response.output_audio_transcript.delta"` + - `expires_after: optional object { anchor, seconds }` - The event type, must be `response.output_audio_transcript.delta`. + Configuration for the client secret expiration. Expiration refers to the time after which + a client secret will no longer be valid for creating sessions. The session itself may + continue after that time once started. A secret can be used to create multiple sessions + until it expires. - - `"response.output_audio_transcript.delta"` + - `anchor: optional "created_at"` -### Response Audio Transcript Done Event + The anchor point for the client secret expiration, meaning that `seconds` will be added to the `created_at` time of the client secret to produce an expiration timestamp. Only `created_at` is currently supported. -- `ResponseAudioTranscriptDoneEvent object { content_index, event_id, item_id, 4 more }` + - `"created_at"` - Returned when the model-generated transcription of audio output is done - streaming. Also emitted when a Response is interrupted, incomplete, or - cancelled. + - `seconds: optional number` - - `content_index: number` + The number of seconds from the anchor point to the expiration. Select a value between `10` and `7200` (2 hours). This default to 600 seconds (10 minutes) if not specified. - The index of the content part in the item's content array. +### Realtime Translation Client Secret Create Response - - `event_id: string` +- `RealtimeTranslationClientSecretCreateResponse object { expires_at, session, value }` - The unique ID of the server event. + Response from creating a translation session and client secret for the Realtime API. - - `item_id: string` + - `expires_at: number` - The ID of the item. + Expiration timestamp for the client secret, in seconds since epoch. - - `output_index: number` + - `session: RealtimeTranslationSession` - The index of the output item in the response. + A Realtime translation session. Translation sessions continuously translate input + audio into the configured output language. - - `response_id: string` + - `id: string` - The ID of the response. + Unique identifier for the session that looks like `sess_1234567890abcdef`. - - `transcript: string` + - `audio: object { input, output }` - The final transcript of the audio. + Configuration for translation input and output audio. - - `type: "response.output_audio_transcript.done"` + - `input: optional object { noise_reduction, transcription }` - The event type, must be `response.output_audio_transcript.done`. + - `noise_reduction: optional object { type }` - - `"response.output_audio_transcript.done"` + Optional input noise reduction. -### Response Cancel Event + - `type: NoiseReductionType` -- `ResponseCancelEvent object { type, event_id, response_id }` + Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. - Send this event to cancel an in-progress response. The server will respond - with a `response.done` event with a status of `response.status=cancelled`. If - there is no response to cancel, the server will respond with an error. It's safe + - `"near_field"` + + - `"far_field"` + + - `transcription: optional object { model }` + + Optional source-language transcription. When configured, the server emits + `session.input_transcript.delta` events. Translation itself still runs from + the input audio stream. + + - `model: string` + + The transcription model used for source transcript deltas. + + - `output: optional object { language }` + + - `language: optional string` + + Target language for translated output audio and transcript deltas. + + - `expires_at: number` + + Expiration timestamp for the session, in seconds since epoch. + + - `model: string` + + The Realtime translation model used for this session. This field is set at + session creation and cannot be changed with `session.update`. + + - `type: "translation"` + + The session type. Always `translation` for Realtime translation sessions. + + - `"translation"` + + - `value: string` + + The generated client secret value. + +### Realtime Translation Input Audio Buffer Append Event + +- `RealtimeTranslationInputAudioBufferAppendEvent object { audio, type, event_id }` + + Send this event to append audio bytes to the translation session input audio buffer. + + WebSocket translation sessions accept base64-encoded 24 kHz PCM16 mono + little-endian raw audio bytes. Unsupported websocket audio formats return a + validation error because lower-quality audio materially degrades translation + quality. + + Translation consumes 200 ms engine frames. For best realtime behavior, append + audio in 200 ms chunks. If a chunk is shorter, the server buffers it until it + has enough audio for one frame. If a chunk is longer, the server splits it into + 200 ms frames and enqueues them back-to-back. + + Keep appending silence while the session is active. If a client stops sending + audio and later resumes, model time treats the resumed audio as contiguous with + the previous audio rather than as a real-world pause. + + - `audio: string` + + Base64-encoded 24 kHz PCM16 mono audio bytes. + + - `type: "session.input_audio_buffer.append"` + + The event type, must be `session.input_audio_buffer.append`. + + - `"session.input_audio_buffer.append"` + + - `event_id: optional string` + + Optional client-generated ID used to identify this event. + +### Realtime Translation Input Transcript Delta Event + +- `RealtimeTranslationInputTranscriptDeltaEvent object { delta, event_id, type, elapsed_ms }` + + Returned when optional source-language transcript text is available. This event + is emitted only when `audio.input.transcription` is configured. + + Transcript deltas are append-only text fragments. Clients should not insert + unconditional spaces between deltas. + + - `delta: string` + + Append-only source-language transcript text. + + - `event_id: string` + + The unique ID of the server event. + + - `type: "session.input_transcript.delta"` + + The event type, must be `session.input_transcript.delta`. + + - `"session.input_transcript.delta"` + + - `elapsed_ms: optional number` + + Timing metadata for stream alignment, derived from the translation frame + when available. It advances in 200 ms increments, but multiple transcript + deltas may share the same `elapsed_ms`. Treat it as alignment metadata, + not a unique transcript-delta identifier. + +### Realtime Translation Output Audio Delta Event + +- `RealtimeTranslationOutputAudioDeltaEvent object { delta, event_id, type, 4 more }` + + Returned when translated output audio is available. Output audio deltas are + 200 ms frames of PCM16 audio. + + - `delta: string` + + Base64-encoded translated audio data. + + - `event_id: string` + + The unique ID of the server event. + + - `type: "session.output_audio.delta"` + + The event type, must be `session.output_audio.delta`. + + - `"session.output_audio.delta"` + + - `channels: optional number` + + Number of audio channels. + + - `elapsed_ms: optional number` + + Timing metadata for stream alignment, derived from the translation frame + when available. Treat `elapsed_ms` as alignment metadata, not a unique + event identifier. + + - `format: optional "pcm16"` + + Audio encoding for `delta`. + + - `"pcm16"` + + - `sample_rate: optional number` + + Sample rate of the audio delta. + +### Realtime Translation Output Transcript Delta Event + +- `RealtimeTranslationOutputTranscriptDeltaEvent object { delta, event_id, type, elapsed_ms }` + + Returned when translated transcript text is available. + + Transcript deltas are append-only text fragments. Clients should not insert + unconditional spaces between deltas. + + - `delta: string` + + Append-only transcript text for the translated output audio. + + - `event_id: string` + + The unique ID of the server event. + + - `type: "session.output_transcript.delta"` + + The event type, must be `session.output_transcript.delta`. + + - `"session.output_transcript.delta"` + + - `elapsed_ms: optional number` + + Timing metadata for stream alignment, derived from the translation frame + when available. It advances in 200 ms increments, but multiple transcript + deltas may share the same `elapsed_ms`. Treat it as alignment metadata, + not a unique transcript-delta identifier. + +### Realtime Translation Server Event + +- `RealtimeTranslationServerEvent = RealtimeErrorEvent or RealtimeTranslationSessionCreatedEvent or RealtimeTranslationSessionUpdatedEvent or 4 more` + + A Realtime translation server event. + + - `RealtimeErrorEvent object { error, event_id, type }` + + Returned when an error occurs, which could be a client problem or a server + problem. Most errors are recoverable and the session will stay open, we + recommend to implementors to monitor and log error messages by default. + + - `error: RealtimeError` + + Details of the error. + + - `message: string` + + A human-readable error message. + + - `type: string` + + The type of error (e.g., "invalid_request_error", "server_error"). + + - `code: optional string` + + Error code, if any. + + - `event_id: optional string` + + The event_id of the client event that caused the error, if applicable. + + - `param: optional string` + + Parameter related to the error, if any. + + - `event_id: string` + + The unique ID of the server event. + + - `type: "error"` + + The event type, must be `error`. + + - `"error"` + + - `RealtimeTranslationSessionCreatedEvent object { event_id, session, type }` + + Returned when a translation session is created. Emitted automatically when a + new connection is established as the first server event. This event contains + the default translation session configuration. + + - `event_id: string` + + The unique ID of the server event. + + - `session: RealtimeTranslationSession` + + The translation session configuration. + + - `id: string` + + Unique identifier for the session that looks like `sess_1234567890abcdef`. + + - `audio: object { input, output }` + + Configuration for translation input and output audio. + + - `input: optional object { noise_reduction, transcription }` + + - `noise_reduction: optional object { type }` + + Optional input noise reduction. + + - `type: NoiseReductionType` + + Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. + + - `"near_field"` + + - `"far_field"` + + - `transcription: optional object { model }` + + Optional source-language transcription. When configured, the server emits + `session.input_transcript.delta` events. Translation itself still runs from + the input audio stream. + + - `model: string` + + The transcription model used for source transcript deltas. + + - `output: optional object { language }` + + - `language: optional string` + + Target language for translated output audio and transcript deltas. + + - `expires_at: number` + + Expiration timestamp for the session, in seconds since epoch. + + - `model: string` + + The Realtime translation model used for this session. This field is set at + session creation and cannot be changed with `session.update`. + + - `type: "translation"` + + The session type. Always `translation` for Realtime translation sessions. + + - `"translation"` + + - `type: "session.created"` + + The event type, must be `session.created`. + + - `"session.created"` + + - `RealtimeTranslationSessionUpdatedEvent object { event_id, session, type }` + + Returned when a translation session is updated with a `session.update` event, + unless there is an error. + + - `event_id: string` + + The unique ID of the server event. + + - `session: RealtimeTranslationSession` + + The translation session configuration. + + - `type: "session.updated"` + + The event type, must be `session.updated`. + + - `"session.updated"` + + - `RealtimeTranslationSessionClosedEvent object { event_id, type }` + + Returned when a realtime translation session is closed. + + - `event_id: string` + + The unique ID of the server event. + + - `type: "session.closed"` + + The event type, must be `session.closed`. + + - `"session.closed"` + + - `RealtimeTranslationInputTranscriptDeltaEvent object { delta, event_id, type, elapsed_ms }` + + Returned when optional source-language transcript text is available. This event + is emitted only when `audio.input.transcription` is configured. + + Transcript deltas are append-only text fragments. Clients should not insert + unconditional spaces between deltas. + + - `delta: string` + + Append-only source-language transcript text. + + - `event_id: string` + + The unique ID of the server event. + + - `type: "session.input_transcript.delta"` + + The event type, must be `session.input_transcript.delta`. + + - `"session.input_transcript.delta"` + + - `elapsed_ms: optional number` + + Timing metadata for stream alignment, derived from the translation frame + when available. It advances in 200 ms increments, but multiple transcript + deltas may share the same `elapsed_ms`. Treat it as alignment metadata, + not a unique transcript-delta identifier. + + - `RealtimeTranslationOutputTranscriptDeltaEvent object { delta, event_id, type, elapsed_ms }` + + Returned when translated transcript text is available. + + Transcript deltas are append-only text fragments. Clients should not insert + unconditional spaces between deltas. + + - `delta: string` + + Append-only transcript text for the translated output audio. + + - `event_id: string` + + The unique ID of the server event. + + - `type: "session.output_transcript.delta"` + + The event type, must be `session.output_transcript.delta`. + + - `"session.output_transcript.delta"` + + - `elapsed_ms: optional number` + + Timing metadata for stream alignment, derived from the translation frame + when available. It advances in 200 ms increments, but multiple transcript + deltas may share the same `elapsed_ms`. Treat it as alignment metadata, + not a unique transcript-delta identifier. + + - `RealtimeTranslationOutputAudioDeltaEvent object { delta, event_id, type, 4 more }` + + Returned when translated output audio is available. Output audio deltas are + 200 ms frames of PCM16 audio. + + - `delta: string` + + Base64-encoded translated audio data. + + - `event_id: string` + + The unique ID of the server event. + + - `type: "session.output_audio.delta"` + + The event type, must be `session.output_audio.delta`. + + - `"session.output_audio.delta"` + + - `channels: optional number` + + Number of audio channels. + + - `elapsed_ms: optional number` + + Timing metadata for stream alignment, derived from the translation frame + when available. Treat `elapsed_ms` as alignment metadata, not a unique + event identifier. + + - `format: optional "pcm16"` + + Audio encoding for `delta`. + + - `"pcm16"` + + - `sample_rate: optional number` + + Sample rate of the audio delta. + +### Realtime Translation Session + +- `RealtimeTranslationSession object { id, audio, expires_at, 2 more }` + + A Realtime translation session. Translation sessions continuously translate input + audio into the configured output language. + + - `id: string` + + Unique identifier for the session that looks like `sess_1234567890abcdef`. + + - `audio: object { input, output }` + + Configuration for translation input and output audio. + + - `input: optional object { noise_reduction, transcription }` + + - `noise_reduction: optional object { type }` + + Optional input noise reduction. + + - `type: NoiseReductionType` + + Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. + + - `"near_field"` + + - `"far_field"` + + - `transcription: optional object { model }` + + Optional source-language transcription. When configured, the server emits + `session.input_transcript.delta` events. Translation itself still runs from + the input audio stream. + + - `model: string` + + The transcription model used for source transcript deltas. + + - `output: optional object { language }` + + - `language: optional string` + + Target language for translated output audio and transcript deltas. + + - `expires_at: number` + + Expiration timestamp for the session, in seconds since epoch. + + - `model: string` + + The Realtime translation model used for this session. This field is set at + session creation and cannot be changed with `session.update`. + + - `type: "translation"` + + The session type. Always `translation` for Realtime translation sessions. + + - `"translation"` + +### Realtime Translation Session Close Event + +- `RealtimeTranslationSessionCloseEvent object { type, event_id }` + + Gracefully close the realtime translation session. The server flushes pending + input audio and emits any remaining translated output before closing the + session. + + - `type: "session.close"` + + The event type, must be `session.close`. + + - `"session.close"` + + - `event_id: optional string` + + Optional client-generated ID used to identify this event. + +### Realtime Translation Session Closed Event + +- `RealtimeTranslationSessionClosedEvent object { event_id, type }` + + Returned when a realtime translation session is closed. + + - `event_id: string` + + The unique ID of the server event. + + - `type: "session.closed"` + + The event type, must be `session.closed`. + + - `"session.closed"` + +### Realtime Translation Session Create Request + +- `RealtimeTranslationSessionCreateRequest object { model, audio }` + + Realtime translation session configuration. Translation sessions stream source + audio in and translated audio plus transcript deltas out continuously. + + - `model: string` + + The Realtime translation model used for this session. + + - `audio: optional object { input, output }` + + Configuration for translation input and output audio. + + - `input: optional object { noise_reduction, transcription }` + + - `noise_reduction: optional object { type }` + + Optional input noise reduction. Set to `null` to disable it. + + - `type: NoiseReductionType` + + Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. + + - `"near_field"` + + - `"far_field"` + + - `transcription: optional object { model }` + + Optional source-language transcription. When configured, the server emits + `session.input_transcript.delta` events. Translation itself still runs from + the input audio stream. + + - `model: string` + + The transcription model to use for source transcript deltas. + + - `output: optional object { language }` + + - `language: optional string` + + Target language for translated output audio and transcript deltas. + +### Realtime Translation Session Created Event + +- `RealtimeTranslationSessionCreatedEvent object { event_id, session, type }` + + Returned when a translation session is created. Emitted automatically when a + new connection is established as the first server event. This event contains + the default translation session configuration. + + - `event_id: string` + + The unique ID of the server event. + + - `session: RealtimeTranslationSession` + + The translation session configuration. + + - `id: string` + + Unique identifier for the session that looks like `sess_1234567890abcdef`. + + - `audio: object { input, output }` + + Configuration for translation input and output audio. + + - `input: optional object { noise_reduction, transcription }` + + - `noise_reduction: optional object { type }` + + Optional input noise reduction. + + - `type: NoiseReductionType` + + Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. + + - `"near_field"` + + - `"far_field"` + + - `transcription: optional object { model }` + + Optional source-language transcription. When configured, the server emits + `session.input_transcript.delta` events. Translation itself still runs from + the input audio stream. + + - `model: string` + + The transcription model used for source transcript deltas. + + - `output: optional object { language }` + + - `language: optional string` + + Target language for translated output audio and transcript deltas. + + - `expires_at: number` + + Expiration timestamp for the session, in seconds since epoch. + + - `model: string` + + The Realtime translation model used for this session. This field is set at + session creation and cannot be changed with `session.update`. + + - `type: "translation"` + + The session type. Always `translation` for Realtime translation sessions. + + - `"translation"` + + - `type: "session.created"` + + The event type, must be `session.created`. + + - `"session.created"` + +### Realtime Translation Session Update Event + +- `RealtimeTranslationSessionUpdateEvent object { session, type, event_id }` + + Send this event to update the translation session configuration. Translation + sessions support updates to `audio.output.language`, `audio.input.transcription`, + and `audio.input.noise_reduction`. + + - `session: RealtimeTranslationSessionUpdateRequest` + + Translation session fields to update. The session `type` and `model` are set + at creation and cannot be changed with `session.update`. + + - `audio: optional object { input, output }` + + Configuration for translation input and output audio. + + - `input: optional object { noise_reduction, transcription }` + + - `noise_reduction: optional object { type }` + + Optional input noise reduction. Set to `null` to disable it. + + - `type: NoiseReductionType` + + Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. + + - `"near_field"` + + - `"far_field"` + + - `transcription: optional object { model }` + + Optional source-language transcription. When configured, the server emits + `session.input_transcript.delta` events. Translation itself still runs from + the input audio stream. + + - `model: string` + + The transcription model to use for source transcript deltas. + + - `output: optional object { language }` + + - `language: optional string` + + Target language for translated output audio and transcript deltas. + + - `type: "session.update"` + + The event type, must be `session.update`. + + - `"session.update"` + + - `event_id: optional string` + + Optional client-generated ID used to identify this event. + +### Realtime Translation Session Update Request + +- `RealtimeTranslationSessionUpdateRequest object { audio }` + + Realtime translation session fields that can be updated with `session.update`. + + - `audio: optional object { input, output }` + + Configuration for translation input and output audio. + + - `input: optional object { noise_reduction, transcription }` + + - `noise_reduction: optional object { type }` + + Optional input noise reduction. Set to `null` to disable it. + + - `type: NoiseReductionType` + + Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. + + - `"near_field"` + + - `"far_field"` + + - `transcription: optional object { model }` + + Optional source-language transcription. When configured, the server emits + `session.input_transcript.delta` events. Translation itself still runs from + the input audio stream. + + - `model: string` + + The transcription model to use for source transcript deltas. + + - `output: optional object { language }` + + - `language: optional string` + + Target language for translated output audio and transcript deltas. + +### Realtime Translation Session Updated Event + +- `RealtimeTranslationSessionUpdatedEvent object { event_id, session, type }` + + Returned when a translation session is updated with a `session.update` event, + unless there is an error. + + - `event_id: string` + + The unique ID of the server event. + + - `session: RealtimeTranslationSession` + + The translation session configuration. + + - `id: string` + + Unique identifier for the session that looks like `sess_1234567890abcdef`. + + - `audio: object { input, output }` + + Configuration for translation input and output audio. + + - `input: optional object { noise_reduction, transcription }` + + - `noise_reduction: optional object { type }` + + Optional input noise reduction. + + - `type: NoiseReductionType` + + Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. + + - `"near_field"` + + - `"far_field"` + + - `transcription: optional object { model }` + + Optional source-language transcription. When configured, the server emits + `session.input_transcript.delta` events. Translation itself still runs from + the input audio stream. + + - `model: string` + + The transcription model used for source transcript deltas. + + - `output: optional object { language }` + + - `language: optional string` + + Target language for translated output audio and transcript deltas. + + - `expires_at: number` + + Expiration timestamp for the session, in seconds since epoch. + + - `model: string` + + The Realtime translation model used for this session. This field is set at + session creation and cannot be changed with `session.update`. + + - `type: "translation"` + + The session type. Always `translation` for Realtime translation sessions. + + - `"translation"` + + - `type: "session.updated"` + + The event type, must be `session.updated`. + + - `"session.updated"` + +### Realtime Truncation + +- `RealtimeTruncation = "auto" or "disabled" or object { retention_ratio, type, token_limits }` + + When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs. + + Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost. + + Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate. + + Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit. + + - `"auto" or "disabled"` + + The truncation strategy to use for the session. `auto` is the default truncation strategy. `disabled` will disable truncation and emit errors when the conversation exceeds the input token limit. + + - `"auto"` + + - `"disabled"` + + - `RetentionRatioTruncation object { retention_ratio, type, token_limits }` + + Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage. + + - `retention_ratio: number` + + Fraction of post-instruction conversation tokens to retain (`0.0` - `1.0`) when the conversation exceeds the input token limit. Setting this to `0.8` means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates. + + - `type: "retention_ratio"` + + Use retention ratio truncation. + + - `"retention_ratio"` + + - `token_limits: optional object { post_instructions }` + + Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used. + + - `post_instructions: optional number` + + Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens. + +### Response Audio Delta Event + +- `ResponseAudioDeltaEvent object { content_index, delta, event_id, 4 more }` + + Returned when the model-generated audio is updated. + + - `content_index: number` + + The index of the content part in the item's content array. + + - `delta: string` + + Base64-encoded audio data delta. + + - `event_id: string` + + The unique ID of the server event. + + - `item_id: string` + + The ID of the item. + + - `output_index: number` + + The index of the output item in the response. + + - `response_id: string` + + The ID of the response. + + - `type: "response.output_audio.delta"` + + The event type, must be `response.output_audio.delta`. + + - `"response.output_audio.delta"` + +### Response Audio Done Event + +- `ResponseAudioDoneEvent object { content_index, event_id, item_id, 3 more }` + + Returned when the model-generated audio is done. Also emitted when a Response + is interrupted, incomplete, or cancelled. + + - `content_index: number` + + The index of the content part in the item's content array. + + - `event_id: string` + + The unique ID of the server event. + + - `item_id: string` + + The ID of the item. + + - `output_index: number` + + The index of the output item in the response. + + - `response_id: string` + + The ID of the response. + + - `type: "response.output_audio.done"` + + The event type, must be `response.output_audio.done`. + + - `"response.output_audio.done"` + +### Response Audio Transcript Delta Event + +- `ResponseAudioTranscriptDeltaEvent object { content_index, delta, event_id, 4 more }` + + Returned when the model-generated transcription of audio output is updated. + + - `content_index: number` + + The index of the content part in the item's content array. + + - `delta: string` + + The transcript delta. + + - `event_id: string` + + The unique ID of the server event. + + - `item_id: string` + + The ID of the item. + + - `output_index: number` + + The index of the output item in the response. + + - `response_id: string` + + The ID of the response. + + - `type: "response.output_audio_transcript.delta"` + + The event type, must be `response.output_audio_transcript.delta`. + + - `"response.output_audio_transcript.delta"` + +### Response Audio Transcript Done Event + +- `ResponseAudioTranscriptDoneEvent object { content_index, event_id, item_id, 4 more }` + + Returned when the model-generated transcription of audio output is done + streaming. Also emitted when a Response is interrupted, incomplete, or + cancelled. + + - `content_index: number` + + The index of the content part in the item's content array. + + - `event_id: string` + + The unique ID of the server event. + + - `item_id: string` + + The ID of the item. + + - `output_index: number` + + The index of the output item in the response. + + - `response_id: string` + + The ID of the response. + + - `transcript: string` + + The final transcript of the audio. + + - `type: "response.output_audio_transcript.done"` + + The event type, must be `response.output_audio_transcript.done`. + + - `"response.output_audio_transcript.done"` + +### Response Cancel Event + +- `ResponseCancelEvent object { type, event_id, response_id }` + + Send this event to cancel an in-progress response. The server will respond + with a `response.done` event with a status of `response.status=cancelled`. If + there is no response to cancel, the server will respond with an error. It's safe to call `response.cancel` even if no response is in progress, an error will be returned the session will remain unaffected. @@ -14452,6 +15779,11 @@ - `"audio"` + - `parallel_tool_calls: optional boolean` + + Whether the model may call multiple tools in parallel. Only supported by + reasoning Realtime models such as `gpt-realtime-2`. + - `prompt: optional ResponsePrompt` Reference to a prompt template and its variables. @@ -14551,6 +15883,25 @@ Optional version of the prompt template. + - `reasoning: optional RealtimeReasoning` + + Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`. + + - `effort: optional RealtimeReasoningEffort` + + Constrains effort on reasoning for reasoning-capable Realtime models such as + `gpt-realtime-2`. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `tool_choice: optional ToolChoiceOptions or ToolChoiceFunction or ToolChoiceMcp` How the model chooses tools. Provide one of the string modes or force a specific @@ -17399,13 +18750,23 @@ The unique ID of the server event. - - `session: RealtimeSessionCreateRequest or RealtimeTranscriptionSessionCreateRequest` + - `session: RealtimeSessionCreateResponse or RealtimeTranscriptionSessionCreateResponse` The session configuration. - - `RealtimeSessionCreateRequest object { type, audio, include, 9 more }` + - `RealtimeSessionCreateResponse object { id, object, type, 13 more }` - Realtime session object configuration. + A Realtime session configuration object. + + - `id: string` + + Unique identifier for the session that looks like `sess_1234567890abcdef`. + + - `object: "realtime.session"` + + The object type. Always `realtime.session`. + + - `"realtime.session"` - `type: "realtime"` @@ -17413,11 +18774,11 @@ - `"realtime"` - - `audio: optional RealtimeAudioConfig` + - `audio: optional object { input, output }` Configuration for input and output audio. - - `input: optional RealtimeAudioConfigInput` + - `input: optional object { format, noise_reduction, transcription, turn_detection }` - `format: optional RealtimeAudioFormats` @@ -17473,25 +18834,23 @@ - `"far_field"` - - `transcription: optional AudioTranscription` + - `transcription: optional object { language, model, prompt }` Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. - `language: optional string` - The language of the input audio. Supplying the input language in - [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format - will improve accuracy and latency. + The language of the input audio. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `"whisper-1"` @@ -17503,14 +18862,13 @@ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` - An optional text to guide the model's style or continue a previous audio - segment. - For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). - For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + The prompt configured for input audio transcription, when present. - - `turn_detection: optional RealtimeAudioInputTurnDetection` + - `turn_detection: optional object { type, create_response, idle_timeout_ms, 4 more } or object { type, create_response, eagerness, interrupt_response }` Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to `null` to turn off, in which case the client must manually trigger model response. @@ -17518,6 +18876,9 @@ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -17603,7 +18964,7 @@ Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. `conversation` of `auto`) when a VAD start event occurs. - - `output: optional RealtimeAudioConfigOutput` + - `output: optional object { format, speed, voice }` - `format: optional RealtimeAudioFormats` @@ -17617,19 +18978,24 @@ This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower. - - `voice: optional string or "alloy" or "ash" or "ballad" or 7 more or object { id }` + - `voice: optional string or "alloy" or "ash" or "ballad" or 7 more` - The voice the model uses to respond. Supported built-in voices are - `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, `shimmer`, `verse`, - `marin`, and `cedar`. You may also provide a custom voice object with - an `id`, for example `{ "id": "voice_1234" }`. Voice cannot be changed - during the session once the model has responded with audio at least once. - We recommend `marin` and `cedar` for best quality. + The voice the model uses to respond. Voice cannot be changed during the + session once the model has responded with audio at least once. Current + voice options are `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, + `shimmer`, `verse`, `marin`, and `cedar`. We recommend `marin` and `cedar` for + best quality. - `string` - `"alloy" or "ash" or "ballad" or 7 more` + The voice the model uses to respond. Voice cannot be changed during the + session once the model has responded with audio at least once. Current + voice options are `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, + `shimmer`, `verse`, `marin`, and `cedar`. We recommend `marin` and `cedar` for + best quality. + - `"alloy"` - `"ash"` @@ -17650,13 +19016,9 @@ - `"cedar"` - - `ID object { id }` - - Custom voice reference. - - - `id: string` + - `expires_at: optional number` - The custom voice ID, e.g. `voice_1234`. + Expiration timestamp for the session, in seconds since epoch. - `include: optional array of "item.input_audio_transcription.logprobs"` @@ -17685,13 +19047,13 @@ - `"inf"` - - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. - `string` - - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. @@ -17699,6 +19061,8 @@ - `"gpt-realtime-1.5"` + - `"gpt-realtime-2"` + - `"gpt-realtime-2025-08-28"` - `"gpt-4o-realtime-preview"` @@ -17836,7 +19200,26 @@ Optional version of the prompt template. - - `tool_choice: optional RealtimeToolChoiceConfig` + - `reasoning: optional RealtimeReasoning` + + Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`. + + - `effort: optional RealtimeReasoningEffort` + + Constrains effort on reasoning for reasoning-capable Realtime models such as + `gpt-realtime-2`. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + + - `tool_choice: optional ToolChoiceOptions or ToolChoiceFunction or ToolChoiceMcp` How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool. @@ -17890,7 +19273,7 @@ The name of the tool to call on the server. - - `tools: optional RealtimeToolsConfig` + - `tools: optional array of RealtimeFunctionTool or object { server_label, type, allowed_tools, 7 more }` Tools available to the model. @@ -18058,7 +19441,7 @@ The URL for the MCP server. One of `server_url` or `connector_id` must be provided. - - `tracing: optional RealtimeTracingConfig` + - `tracing: optional "auto" or object { group_id, metadata, workflow_name }` Realtime API can write session traces to the [Traces Dashboard](https://platform.openai.com/logs?api=traces). Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified. @@ -18131,21 +19514,29 @@ Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens. - - `RealtimeTranscriptionSessionCreateRequest object { type, audio, include }` + - `RealtimeTranscriptionSessionCreateResponse object { id, object, type, 3 more }` - Realtime transcription session object configuration. + A Realtime transcription session configuration object. + + - `id: string` + + Unique identifier for the session that looks like `sess_1234567890abcdef`. + + - `object: string` + + The object type. Always `realtime.transcription_session`. - `type: "transcription"` - The type of session to create. Always `transcription` for transcription sessions. + The type of session. Always `transcription` for transcription sessions. - `"transcription"` - - `audio: optional RealtimeTranscriptionSessionAudio` + - `audio: optional object { input }` - Configuration for input and output audio. + Configuration for input audio for the session. - - `input: optional RealtimeTranscriptionSessionAudioInput` + - `input: optional object { format, noise_reduction, transcription, turn_detection }` - `format: optional RealtimeAudioFormats` @@ -18153,116 +19544,82 @@ - `noise_reduction: optional object { type }` - Configuration for input audio noise reduction. This can be set to `null` to turn off. - Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. - Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio. + Configuration for input audio noise reduction. - `type: optional NoiseReductionType` Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. - - `transcription: optional AudioTranscription` - - Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. - - - `turn_detection: optional RealtimeTranscriptionSessionAudioInputTurnDetection` - - Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to `null` to turn off, in which case the client must manually trigger model response. - - Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech. - - Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. - - - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` - - Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. - - - `type: "server_vad"` - - Type of turn detection, `server_vad` to turn on simple Server VAD. - - - `"server_vad"` - - - `create_response: optional boolean` - - Whether or not to automatically generate a response when a VAD stop event occurs. If `interrupt_response` is set to `false` this may fail to create a response if the model is already responding. - - If both `create_response` and `interrupt_response` are set to `false`, the model will never respond automatically but VAD events will still be emitted. + - `transcription: optional object { language, model, prompt }` - - `idle_timeout_ms: optional number` + Configuration of the transcription model. - Optional timeout after which a model response will be triggered automatically. This is - useful for situations in which a long pause from the user is unexpected, such as a phone - call. The model will effectively prompt the user to continue the conversation based - on the current context. + - `language: optional string` - The timeout value will be applied after the last model response's audio has finished playing, - i.e. it's set to the `response.done` time plus audio playback duration. + The language of the input audio. - An `input_audio_buffer.timeout_triggered` event (plus events - associated with the Response) will be emitted when the timeout is reached. - Idle timeout is currently only supported for `server_vad` mode. + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - - `interrupt_response: optional boolean` + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - Whether or not to automatically interrupt (cancel) any ongoing response with output to the default - conversation (i.e. `conversation` of `auto`) when a VAD start event occurs. If `true` then the response will be cancelled, otherwise it will continue until complete. + - `string` - If both `create_response` and `interrupt_response` are set to `false`, the model will never respond automatically but VAD events will still be emitted. + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - - `prefix_padding_ms: optional number` + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - Used only for `server_vad` mode. Amount of audio to include before the VAD detected speech (in - milliseconds). Defaults to 300ms. + - `"whisper-1"` - - `silence_duration_ms: optional number` + - `"gpt-4o-mini-transcribe"` - Used only for `server_vad` mode. Duration of silence to detect speech stop (in milliseconds). Defaults - to 500ms. With shorter values the model will respond more quickly, - but may jump in on short pauses from the user. + - `"gpt-4o-mini-transcribe-2025-12-15"` - - `threshold: optional number` + - `"gpt-4o-transcribe"` - Used only for `server_vad` mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A - higher threshold will require louder audio to activate the model, and - thus might perform better in noisy environments. + - `"gpt-4o-transcribe-diarize"` - - `SemanticVad object { type, create_response, eagerness, interrupt_response }` + - `"gpt-realtime-whisper"` - Server-side semantic turn detection which uses a model to determine when the user has finished speaking. + - `prompt: optional string` - - `type: "semantic_vad"` + The prompt configured for input audio transcription, when present. - Type of turn detection, `semantic_vad` to turn on Semantic VAD. + - `turn_detection: optional RealtimeTranscriptionSessionTurnDetection` - - `"semantic_vad"` + Configuration for turn detection. Can be set to `null` to turn off. Server + VAD means that the model will detect the start and end of speech based on + audio volume and respond at the end of user speech. For `gpt-realtime-whisper`, this must be `null`; VAD is not supported. - - `create_response: optional boolean` + - `prefix_padding_ms: optional number` - Whether or not to automatically generate a response when a VAD stop event occurs. + Amount of audio to include before the VAD detected speech (in + milliseconds). Defaults to 300ms. - - `eagerness: optional "low" or "medium" or "high" or "auto"` + - `silence_duration_ms: optional number` - Used only for `semantic_vad` mode. The eagerness of the model to respond. `low` will wait longer for the user to continue speaking, `high` will respond more quickly. `auto` is the default and is equivalent to `medium`. `low`, `medium`, and `high` have max timeouts of 8s, 4s, and 2s respectively. + Duration of silence to detect speech stop (in milliseconds). Defaults + to 500ms. With shorter values the model will respond more quickly, + but may jump in on short pauses from the user. - - `"low"` + - `threshold: optional number` - - `"medium"` + Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A + higher threshold will require louder audio to activate the model, and + thus might perform better in noisy environments. - - `"high"` + - `type: optional string` - - `"auto"` + Type of turn detection, only `server_vad` is currently supported. - - `interrupt_response: optional boolean` + - `expires_at: optional number` - Whether or not to automatically interrupt any ongoing response with output to the default - conversation (i.e. `conversation` of `auto`) when a VAD start event occurs. + Expiration timestamp for the session, in seconds since epoch. - `include: optional array of "item.input_audio_transcription.logprobs"` Additional fields to include in server outputs. - `item.input_audio_transcription.logprobs`: Include logprobs for input audio transcription. + - `item.input_audio_transcription.logprobs`: Include logprobs for input audio transcription. - `"item.input_audio_transcription.logprobs"` @@ -18291,7 +19648,7 @@ Update the Realtime session. Choose either a realtime session or a transcription session. - - `RealtimeSessionCreateRequest object { type, audio, include, 9 more }` + - `RealtimeSessionCreateRequest object { type, audio, include, 11 more }` Realtime session object configuration. @@ -18365,21 +19722,37 @@ Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. + - `delay: optional "minimal" or "low" or "medium" or 2 more` + + Controls how long the model waits before emitting transcription text. + Higher values can improve transcription accuracy at the cost of latency. + Only supported with `gpt-realtime-whisper` in GA Realtime sessions. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `language: optional string` The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `"whisper-1"` @@ -18391,12 +19764,15 @@ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` An optional text to guide the model's style or continue a previous audio segment. For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions. - `turn_detection: optional RealtimeAudioInputTurnDetection` @@ -18406,6 +19782,9 @@ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -18573,13 +19952,13 @@ - `"inf"` - - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. - `string` - - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. @@ -18587,6 +19966,8 @@ - `"gpt-realtime-1.5"` + - `"gpt-realtime-2"` + - `"gpt-realtime-2025-08-28"` - `"gpt-4o-realtime-preview"` @@ -18625,6 +20006,11 @@ - `"audio"` + - `parallel_tool_calls: optional boolean` + + Whether the model may call multiple tools in parallel. Only supported by + reasoning Realtime models such as `gpt-realtime-2`. + - `prompt: optional ResponsePrompt` Reference to a prompt template and its variables. @@ -18724,6 +20110,25 @@ Optional version of the prompt template. + - `reasoning: optional RealtimeReasoning` + + Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`. + + - `effort: optional RealtimeReasoningEffort` + + Constrains effort on reasoning for reasoning-capable Realtime models such as + `gpt-realtime-2`. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `tool_choice: optional RealtimeToolChoiceConfig` How the model chooses tools. Provide one of the string modes or force a specific @@ -19061,6 +20466,9 @@ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -19175,13 +20583,23 @@ The unique ID of the server event. - - `session: RealtimeSessionCreateRequest or RealtimeTranscriptionSessionCreateRequest` + - `session: RealtimeSessionCreateResponse or RealtimeTranscriptionSessionCreateResponse` The session configuration. - - `RealtimeSessionCreateRequest object { type, audio, include, 9 more }` + - `RealtimeSessionCreateResponse object { id, object, type, 13 more }` - Realtime session object configuration. + A Realtime session configuration object. + + - `id: string` + + Unique identifier for the session that looks like `sess_1234567890abcdef`. + + - `object: "realtime.session"` + + The object type. Always `realtime.session`. + + - `"realtime.session"` - `type: "realtime"` @@ -19189,11 +20607,11 @@ - `"realtime"` - - `audio: optional RealtimeAudioConfig` + - `audio: optional object { input, output }` Configuration for input and output audio. - - `input: optional RealtimeAudioConfigInput` + - `input: optional object { format, noise_reduction, transcription, turn_detection }` - `format: optional RealtimeAudioFormats` @@ -19249,25 +20667,23 @@ - `"far_field"` - - `transcription: optional AudioTranscription` + - `transcription: optional object { language, model, prompt }` Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. - `language: optional string` - The language of the input audio. Supplying the input language in - [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format - will improve accuracy and latency. + The language of the input audio. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `"whisper-1"` @@ -19279,14 +20695,13 @@ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` - An optional text to guide the model's style or continue a previous audio - segment. - For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). - For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + The prompt configured for input audio transcription, when present. - - `turn_detection: optional RealtimeAudioInputTurnDetection` + - `turn_detection: optional object { type, create_response, idle_timeout_ms, 4 more } or object { type, create_response, eagerness, interrupt_response }` Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to `null` to turn off, in which case the client must manually trigger model response. @@ -19294,6 +20709,9 @@ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -19379,7 +20797,7 @@ Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. `conversation` of `auto`) when a VAD start event occurs. - - `output: optional RealtimeAudioConfigOutput` + - `output: optional object { format, speed, voice }` - `format: optional RealtimeAudioFormats` @@ -19393,19 +20811,24 @@ This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower. - - `voice: optional string or "alloy" or "ash" or "ballad" or 7 more or object { id }` + - `voice: optional string or "alloy" or "ash" or "ballad" or 7 more` - The voice the model uses to respond. Supported built-in voices are - `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, `shimmer`, `verse`, - `marin`, and `cedar`. You may also provide a custom voice object with - an `id`, for example `{ "id": "voice_1234" }`. Voice cannot be changed - during the session once the model has responded with audio at least once. - We recommend `marin` and `cedar` for best quality. + The voice the model uses to respond. Voice cannot be changed during the + session once the model has responded with audio at least once. Current + voice options are `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, + `shimmer`, `verse`, `marin`, and `cedar`. We recommend `marin` and `cedar` for + best quality. - `string` - `"alloy" or "ash" or "ballad" or 7 more` + The voice the model uses to respond. Voice cannot be changed during the + session once the model has responded with audio at least once. Current + voice options are `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, + `shimmer`, `verse`, `marin`, and `cedar`. We recommend `marin` and `cedar` for + best quality. + - `"alloy"` - `"ash"` @@ -19426,13 +20849,9 @@ - `"cedar"` - - `ID object { id }` - - Custom voice reference. - - - `id: string` + - `expires_at: optional number` - The custom voice ID, e.g. `voice_1234`. + Expiration timestamp for the session, in seconds since epoch. - `include: optional array of "item.input_audio_transcription.logprobs"` @@ -19461,13 +20880,13 @@ - `"inf"` - - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. - `string` - - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. @@ -19475,6 +20894,8 @@ - `"gpt-realtime-1.5"` + - `"gpt-realtime-2"` + - `"gpt-realtime-2025-08-28"` - `"gpt-4o-realtime-preview"` @@ -19612,7 +21033,26 @@ Optional version of the prompt template. - - `tool_choice: optional RealtimeToolChoiceConfig` + - `reasoning: optional RealtimeReasoning` + + Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`. + + - `effort: optional RealtimeReasoningEffort` + + Constrains effort on reasoning for reasoning-capable Realtime models such as + `gpt-realtime-2`. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + + - `tool_choice: optional ToolChoiceOptions or ToolChoiceFunction or ToolChoiceMcp` How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool. @@ -19666,7 +21106,7 @@ The name of the tool to call on the server. - - `tools: optional RealtimeToolsConfig` + - `tools: optional array of RealtimeFunctionTool or object { server_label, type, allowed_tools, 7 more }` Tools available to the model. @@ -19834,7 +21274,7 @@ The URL for the MCP server. One of `server_url` or `connector_id` must be provided. - - `tracing: optional RealtimeTracingConfig` + - `tracing: optional "auto" or object { group_id, metadata, workflow_name }` Realtime API can write session traces to the [Traces Dashboard](https://platform.openai.com/logs?api=traces). Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified. @@ -19907,21 +21347,29 @@ Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens. - - `RealtimeTranscriptionSessionCreateRequest object { type, audio, include }` + - `RealtimeTranscriptionSessionCreateResponse object { id, object, type, 3 more }` - Realtime transcription session object configuration. + A Realtime transcription session configuration object. + + - `id: string` + + Unique identifier for the session that looks like `sess_1234567890abcdef`. + + - `object: string` + + The object type. Always `realtime.transcription_session`. - `type: "transcription"` - The type of session to create. Always `transcription` for transcription sessions. + The type of session. Always `transcription` for transcription sessions. - `"transcription"` - - `audio: optional RealtimeTranscriptionSessionAudio` + - `audio: optional object { input }` - Configuration for input and output audio. + Configuration for input audio for the session. - - `input: optional RealtimeTranscriptionSessionAudioInput` + - `input: optional object { format, noise_reduction, transcription, turn_detection }` - `format: optional RealtimeAudioFormats` @@ -19929,116 +21377,82 @@ - `noise_reduction: optional object { type }` - Configuration for input audio noise reduction. This can be set to `null` to turn off. - Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. - Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio. + Configuration for input audio noise reduction. - `type: optional NoiseReductionType` Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. - - `transcription: optional AudioTranscription` - - Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. - - - `turn_detection: optional RealtimeTranscriptionSessionAudioInputTurnDetection` + - `transcription: optional object { language, model, prompt }` - Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to `null` to turn off, in which case the client must manually trigger model response. + Configuration of the transcription model. - Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech. + - `language: optional string` - Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + The language of the input audio. - - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - - `type: "server_vad"` + - `string` - Type of turn detection, `server_vad` to turn on simple Server VAD. + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - - `"server_vad"` + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - - `create_response: optional boolean` + - `"whisper-1"` - Whether or not to automatically generate a response when a VAD stop event occurs. If `interrupt_response` is set to `false` this may fail to create a response if the model is already responding. + - `"gpt-4o-mini-transcribe"` - If both `create_response` and `interrupt_response` are set to `false`, the model will never respond automatically but VAD events will still be emitted. + - `"gpt-4o-mini-transcribe-2025-12-15"` - - `idle_timeout_ms: optional number` + - `"gpt-4o-transcribe"` - Optional timeout after which a model response will be triggered automatically. This is - useful for situations in which a long pause from the user is unexpected, such as a phone - call. The model will effectively prompt the user to continue the conversation based - on the current context. + - `"gpt-4o-transcribe-diarize"` - The timeout value will be applied after the last model response's audio has finished playing, - i.e. it's set to the `response.done` time plus audio playback duration. + - `"gpt-realtime-whisper"` - An `input_audio_buffer.timeout_triggered` event (plus events - associated with the Response) will be emitted when the timeout is reached. - Idle timeout is currently only supported for `server_vad` mode. + - `prompt: optional string` - - `interrupt_response: optional boolean` + The prompt configured for input audio transcription, when present. - Whether or not to automatically interrupt (cancel) any ongoing response with output to the default - conversation (i.e. `conversation` of `auto`) when a VAD start event occurs. If `true` then the response will be cancelled, otherwise it will continue until complete. + - `turn_detection: optional RealtimeTranscriptionSessionTurnDetection` - If both `create_response` and `interrupt_response` are set to `false`, the model will never respond automatically but VAD events will still be emitted. + Configuration for turn detection. Can be set to `null` to turn off. Server + VAD means that the model will detect the start and end of speech based on + audio volume and respond at the end of user speech. For `gpt-realtime-whisper`, this must be `null`; VAD is not supported. - `prefix_padding_ms: optional number` - Used only for `server_vad` mode. Amount of audio to include before the VAD detected speech (in + Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. - `silence_duration_ms: optional number` - Used only for `server_vad` mode. Duration of silence to detect speech stop (in milliseconds). Defaults + Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. - `threshold: optional number` - Used only for `server_vad` mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A + Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments. - - `SemanticVad object { type, create_response, eagerness, interrupt_response }` - - Server-side semantic turn detection which uses a model to determine when the user has finished speaking. - - - `type: "semantic_vad"` - - Type of turn detection, `semantic_vad` to turn on Semantic VAD. - - - `"semantic_vad"` - - - `create_response: optional boolean` - - Whether or not to automatically generate a response when a VAD stop event occurs. - - - `eagerness: optional "low" or "medium" or "high" or "auto"` - - Used only for `semantic_vad` mode. The eagerness of the model to respond. `low` will wait longer for the user to continue speaking, `high` will respond more quickly. `auto` is the default and is equivalent to `medium`. `low`, `medium`, and `high` have max timeouts of 8s, 4s, and 2s respectively. - - - `"low"` - - - `"medium"` - - - `"high"` + - `type: optional string` - - `"auto"` + Type of turn detection, only `server_vad` is currently supported. - - `interrupt_response: optional boolean` + - `expires_at: optional number` - Whether or not to automatically interrupt any ongoing response with output to the default - conversation (i.e. `conversation` of `auto`) when a VAD start event occurs. + Expiration timestamp for the session, in seconds since epoch. - `include: optional array of "item.input_audio_transcription.logprobs"` Additional fields to include in server outputs. - `item.input_audio_transcription.logprobs`: Include logprobs for input audio transcription. + - `item.input_audio_transcription.logprobs`: Include logprobs for input audio transcription. - `"item.input_audio_transcription.logprobs"` @@ -20093,7 +21507,23 @@ - `input_audio_transcription: optional AudioTranscription` - Configuration for input audio transcription. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. + Configuration for input audio transcription. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. + + - `delay: optional "minimal" or "low" or "medium" or 2 more` + + Controls how long the model waits before emitting transcription text. + Higher values can improve transcription accuracy at the cost of latency. + Only supported with `gpt-realtime-whisper` in GA Realtime sessions. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` - `language: optional string` @@ -20101,15 +21531,15 @@ [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `"whisper-1"` @@ -20121,12 +21551,15 @@ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` An optional text to guide the model's style or continue a previous audio segment. For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions. - `turn_detection: optional object { prefix_padding_ms, silence_duration_ms, threshold, type }` @@ -20204,25 +21637,23 @@ The format of input audio. Options are `pcm16`, `g711_ulaw`, or `g711_alaw`. - - `input_audio_transcription: optional AudioTranscription` + - `input_audio_transcription: optional object { language, model, prompt }` Configuration of the transcription model. - `language: optional string` - The language of the input audio. Supplying the input language in - [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format - will improve accuracy and latency. + The language of the input audio. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `"whisper-1"` @@ -20234,12 +21665,11 @@ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` - An optional text to guide the model's style or continue a previous audio - segment. - For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). - For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + The prompt configured for input audio transcription, when present. - `modalities: optional array of "text" or "audio"` @@ -20327,7 +21757,7 @@ Returns the created client secret and the effective session object. The client s Session configuration to use for the client secret. Choose either a realtime session or a transcription session. - - `RealtimeSessionCreateRequest object { type, audio, include, 9 more }` + - `RealtimeSessionCreateRequest object { type, audio, include, 11 more }` Realtime session object configuration. @@ -20401,21 +21831,37 @@ Returns the created client secret and the effective session object. The client s Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. + - `delay: optional "minimal" or "low" or "medium" or 2 more` + + Controls how long the model waits before emitting transcription text. + Higher values can improve transcription accuracy at the cost of latency. + Only supported with `gpt-realtime-whisper` in GA Realtime sessions. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `language: optional string` The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `"whisper-1"` @@ -20427,12 +21873,15 @@ Returns the created client secret and the effective session object. The client s - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` An optional text to guide the model's style or continue a previous audio segment. For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions. - `turn_detection: optional RealtimeAudioInputTurnDetection` @@ -20442,6 +21891,9 @@ Returns the created client secret and the effective session object. The client s Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -20609,13 +22061,13 @@ Returns the created client secret and the effective session object. The client s - `"inf"` - - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. - `string` - - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. @@ -20623,6 +22075,8 @@ Returns the created client secret and the effective session object. The client s - `"gpt-realtime-1.5"` + - `"gpt-realtime-2"` + - `"gpt-realtime-2025-08-28"` - `"gpt-4o-realtime-preview"` @@ -20661,6 +22115,11 @@ Returns the created client secret and the effective session object. The client s - `"audio"` + - `parallel_tool_calls: optional boolean` + + Whether the model may call multiple tools in parallel. Only supported by + reasoning Realtime models such as `gpt-realtime-2`. + - `prompt: optional ResponsePrompt` Reference to a prompt template and its variables. @@ -20760,6 +22219,25 @@ Returns the created client secret and the effective session object. The client s Optional version of the prompt template. + - `reasoning: optional RealtimeReasoning` + + Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`. + + - `effort: optional RealtimeReasoningEffort` + + Constrains effort on reasoning for reasoning-capable Realtime models such as + `gpt-realtime-2`. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `tool_choice: optional RealtimeToolChoiceConfig` How the model chooses tools. Provide one of the string modes or force a specific @@ -21097,6 +22575,9 @@ Returns the created client secret and the effective session object. The client s Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -21200,23 +22681,19 @@ Returns the created client secret and the effective session object. The client s The session configuration for either a realtime or transcription session. - - `RealtimeSessionCreateResponse object { client_secret, type, audio, 10 more }` + - `RealtimeSessionCreateResponse object { id, object, type, 13 more }` - A new Realtime session configuration, with an ephemeral key. Default TTL - for keys is one minute. - - - `client_secret: RealtimeSessionClientSecret` + A Realtime session configuration object. - Ephemeral key returned by the API. + - `id: string` - - `expires_at: number` + Unique identifier for the session that looks like `sess_1234567890abcdef`. - Timestamp for when the token expires. Currently, all tokens expire - after one minute. + - `object: "realtime.session"` - - `value: string` + The object type. Always `realtime.session`. - Ephemeral key usable in client environments to authenticate connections to the Realtime API. Use this in client-side environments rather than a standard API token, which should only be used server-side. + - `"realtime.session"` - `type: "realtime"` @@ -21284,25 +22761,23 @@ Returns the created client secret and the effective session object. The client s - `"far_field"` - - `transcription: optional AudioTranscription` + - `transcription: optional object { language, model, prompt }` Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. - `language: optional string` - The language of the input audio. Supplying the input language in - [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format - will improve accuracy and latency. + The language of the input audio. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `"whisper-1"` @@ -21314,12 +22789,11 @@ Returns the created client secret and the effective session object. The client s - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` - An optional text to guide the model's style or continue a previous audio - segment. - For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). - For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + The prompt configured for input audio transcription, when present. - `turn_detection: optional object { type, create_response, idle_timeout_ms, 4 more } or object { type, create_response, eagerness, interrupt_response }` @@ -21329,6 +22803,9 @@ Returns the created client secret and the effective session object. The client s Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -21466,6 +22943,10 @@ Returns the created client secret and the effective session object. The client s - `"cedar"` + - `expires_at: optional number` + + Expiration timestamp for the session, in seconds since epoch. + - `include: optional array of "item.input_audio_transcription.logprobs"` Additional fields to include in server outputs. @@ -21493,13 +22974,13 @@ Returns the created client secret and the effective session object. The client s - `"inf"` - - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. - `string` - - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. @@ -21507,6 +22988,8 @@ Returns the created client secret and the effective session object. The client s - `"gpt-realtime-1.5"` + - `"gpt-realtime-2"` + - `"gpt-realtime-2025-08-28"` - `"gpt-4o-realtime-preview"` @@ -21644,6 +23127,25 @@ Returns the created client secret and the effective session object. The client s Optional version of the prompt template. + - `reasoning: optional RealtimeReasoning` + + Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`. + + - `effort: optional RealtimeReasoningEffort` + + Constrains effort on reasoning for reasoning-capable Realtime models such as + `gpt-realtime-2`. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `tool_choice: optional ToolChoiceOptions or ToolChoiceFunction or ToolChoiceMcp` How the model chooses tools. Provide one of the string modes or force a specific @@ -21975,15 +23477,45 @@ Returns the created client secret and the effective session object. The client s Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. - - `transcription: optional AudioTranscription` + - `transcription: optional object { language, model, prompt }` Configuration of the transcription model. + - `language: optional string` + + The language of the input audio. + + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` + + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. + + - `string` + + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` + + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. + + - `"whisper-1"` + + - `"gpt-4o-mini-transcribe"` + + - `"gpt-4o-mini-transcribe-2025-12-15"` + + - `"gpt-4o-transcribe"` + + - `"gpt-4o-transcribe-diarize"` + + - `"gpt-realtime-whisper"` + + - `prompt: optional string` + + The prompt configured for input audio transcription, when present. + - `turn_detection: optional RealtimeTranscriptionSessionTurnDetection` Configuration for turn detection. Can be set to `null` to turn off. Server VAD means that the model will detect the start and end of speech based on - audio volume and respond at the end of user speech. + audio volume and respond at the end of user speech. For `gpt-realtime-whisper`, this must be `null`; VAD is not supported. - `prefix_padding_ms: optional number` @@ -22037,10 +23569,8 @@ curl https://api.openai.com/v1/realtime/client_secrets \ { "expires_at": 0, "session": { - "client_secret": { - "expires_at": 0, - "value": "value" - }, + "id": "id", + "object": "realtime.session", "type": "realtime", "audio": { "input": { @@ -22075,6 +23605,7 @@ curl https://api.openai.com/v1/realtime/client_secrets \ "voice": "ash" } }, + "expires_at": 0, "include": [ "item.input_audio_transcription.logprobs" ], @@ -22091,6 +23622,9 @@ curl https://api.openai.com/v1/realtime/client_secrets \ }, "version": "version" }, + "reasoning": { + "effort": "minimal" + }, "tool_choice": "none", "tools": [ { @@ -22176,40 +23710,21 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ ## Domain Types -### Realtime Session Client Secret - -- `RealtimeSessionClientSecret object { expires_at, value }` - - Ephemeral key returned by the API. - - - `expires_at: number` - - Timestamp for when the token expires. Currently, all tokens expire - after one minute. - - - `value: string` - - Ephemeral key usable in client environments to authenticate connections to the Realtime API. Use this in client-side environments rather than a standard API token, which should only be used server-side. - ### Realtime Session Create Response -- `RealtimeSessionCreateResponse object { client_secret, type, audio, 10 more }` - - A new Realtime session configuration, with an ephemeral key. Default TTL - for keys is one minute. +- `RealtimeSessionCreateResponse object { id, object, type, 13 more }` - - `client_secret: RealtimeSessionClientSecret` + A Realtime session configuration object. - Ephemeral key returned by the API. + - `id: string` - - `expires_at: number` + Unique identifier for the session that looks like `sess_1234567890abcdef`. - Timestamp for when the token expires. Currently, all tokens expire - after one minute. + - `object: "realtime.session"` - - `value: string` + The object type. Always `realtime.session`. - Ephemeral key usable in client environments to authenticate connections to the Realtime API. Use this in client-side environments rather than a standard API token, which should only be used server-side. + - `"realtime.session"` - `type: "realtime"` @@ -22277,25 +23792,23 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ - `"far_field"` - - `transcription: optional AudioTranscription` + - `transcription: optional object { language, model, prompt }` Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. - `language: optional string` - The language of the input audio. Supplying the input language in - [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format - will improve accuracy and latency. + The language of the input audio. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `"whisper-1"` @@ -22307,12 +23820,11 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` - An optional text to guide the model's style or continue a previous audio - segment. - For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). - For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + The prompt configured for input audio transcription, when present. - `turn_detection: optional object { type, create_response, idle_timeout_ms, 4 more } or object { type, create_response, eagerness, interrupt_response }` @@ -22322,6 +23834,9 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -22459,6 +23974,10 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ - `"cedar"` + - `expires_at: optional number` + + Expiration timestamp for the session, in seconds since epoch. + - `include: optional array of "item.input_audio_transcription.logprobs"` Additional fields to include in server outputs. @@ -22486,13 +24005,13 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ - `"inf"` - - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. - `string` - - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. @@ -22500,6 +24019,8 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ - `"gpt-realtime-1.5"` + - `"gpt-realtime-2"` + - `"gpt-realtime-2025-08-28"` - `"gpt-4o-realtime-preview"` @@ -22637,6 +24158,25 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ Optional version of the prompt template. + - `reasoning: optional RealtimeReasoning` + + Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`. + + - `effort: optional RealtimeReasoningEffort` + + Constrains effort on reasoning for reasoning-capable Realtime models such as + `gpt-realtime-2`. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `tool_choice: optional ToolChoiceOptions or ToolChoiceFunction or ToolChoiceMcp` How the model chooses tools. Provide one of the string modes or force a specific @@ -23010,25 +24550,23 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ - `"far_field"` - - `transcription: optional AudioTranscription` + - `transcription: optional object { language, model, prompt }` Configuration of the transcription model. - `language: optional string` - The language of the input audio. Supplying the input language in - [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format - will improve accuracy and latency. + The language of the input audio. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `"whisper-1"` @@ -23040,18 +24578,17 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` - An optional text to guide the model's style or continue a previous audio - segment. - For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). - For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + The prompt configured for input audio transcription, when present. - `turn_detection: optional RealtimeTranscriptionSessionTurnDetection` Configuration for turn detection. Can be set to `null` to turn off. Server VAD means that the model will detect the start and end of speech based on - audio volume and respond at the end of user speech. + audio volume and respond at the end of user speech. For `gpt-realtime-whisper`, this must be `null`; VAD is not supported. - `prefix_padding_ms: optional number` @@ -23092,7 +24629,7 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ Configuration for turn detection. Can be set to `null` to turn off. Server VAD means that the model will detect the start and end of speech based on - audio volume and respond at the end of user speech. + audio volume and respond at the end of user speech. For `gpt-realtime-whisper`, this must be `null`; VAD is not supported. - `prefix_padding_ms: optional number` @@ -23129,23 +24666,19 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ The session configuration for either a realtime or transcription session. - - `RealtimeSessionCreateResponse object { client_secret, type, audio, 10 more }` + - `RealtimeSessionCreateResponse object { id, object, type, 13 more }` - A new Realtime session configuration, with an ephemeral key. Default TTL - for keys is one minute. - - - `client_secret: RealtimeSessionClientSecret` + A Realtime session configuration object. - Ephemeral key returned by the API. + - `id: string` - - `expires_at: number` + Unique identifier for the session that looks like `sess_1234567890abcdef`. - Timestamp for when the token expires. Currently, all tokens expire - after one minute. + - `object: "realtime.session"` - - `value: string` + The object type. Always `realtime.session`. - Ephemeral key usable in client environments to authenticate connections to the Realtime API. Use this in client-side environments rather than a standard API token, which should only be used server-side. + - `"realtime.session"` - `type: "realtime"` @@ -23213,25 +24746,23 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ - `"far_field"` - - `transcription: optional AudioTranscription` + - `transcription: optional object { language, model, prompt }` Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. - `language: optional string` - The language of the input audio. Supplying the input language in - [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format - will improve accuracy and latency. + The language of the input audio. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `"whisper-1"` @@ -23243,12 +24774,11 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` - An optional text to guide the model's style or continue a previous audio - segment. - For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). - For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + The prompt configured for input audio transcription, when present. - `turn_detection: optional object { type, create_response, idle_timeout_ms, 4 more } or object { type, create_response, eagerness, interrupt_response }` @@ -23258,6 +24788,9 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -23395,6 +24928,10 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ - `"cedar"` + - `expires_at: optional number` + + Expiration timestamp for the session, in seconds since epoch. + - `include: optional array of "item.input_audio_transcription.logprobs"` Additional fields to include in server outputs. @@ -23422,13 +24959,13 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ - `"inf"` - - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. - `string` - - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. @@ -23436,6 +24973,8 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ - `"gpt-realtime-1.5"` + - `"gpt-realtime-2"` + - `"gpt-realtime-2025-08-28"` - `"gpt-4o-realtime-preview"` @@ -23573,6 +25112,25 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ Optional version of the prompt template. + - `reasoning: optional RealtimeReasoning` + + Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`. + + - `effort: optional RealtimeReasoningEffort` + + Constrains effort on reasoning for reasoning-capable Realtime models such as + `gpt-realtime-2`. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `tool_choice: optional ToolChoiceOptions or ToolChoiceFunction or ToolChoiceMcp` How the model chooses tools. Provide one of the string modes or force a specific @@ -23904,15 +25462,45 @@ curl -X POST https://api.openai.com/v1/realtime/client_secrets \ Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. - - `transcription: optional AudioTranscription` + - `transcription: optional object { language, model, prompt }` Configuration of the transcription model. + - `language: optional string` + + The language of the input audio. + + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` + + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. + + - `string` + + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` + + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. + + - `"whisper-1"` + + - `"gpt-4o-mini-transcribe"` + + - `"gpt-4o-mini-transcribe-2025-12-15"` + + - `"gpt-4o-transcribe"` + + - `"gpt-4o-transcribe-diarize"` + + - `"gpt-realtime-whisper"` + + - `prompt: optional string` + + The prompt configured for input audio transcription, when present. + - `turn_detection: optional RealtimeTranscriptionSessionTurnDetection` Configuration for turn detection. Can be set to `null` to turn off. Server VAD means that the model will detect the start and end of speech based on - audio volume and respond at the end of user speech. + audio volume and respond at the end of user speech. For `gpt-realtime-whisper`, this must be `null`; VAD is not supported. - `prefix_padding_ms: optional number` @@ -24024,17 +25612,33 @@ handle it. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio. - - `type: optional NoiseReductionType` + - `type: optional NoiseReductionType` + + Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. + + - `"near_field"` + + - `"far_field"` + + - `transcription: optional AudioTranscription` + + Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. + + - `delay: optional "minimal" or "low" or "medium" or 2 more` + + Controls how long the model waits before emitting transcription text. + Higher values can improve transcription accuracy at the cost of latency. + Only supported with `gpt-realtime-whisper` in GA Realtime sessions. - Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. + - `"minimal"` - - `"near_field"` + - `"low"` - - `"far_field"` + - `"medium"` - - `transcription: optional AudioTranscription` + - `"high"` - Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. + - `"xhigh"` - `language: optional string` @@ -24042,15 +25646,15 @@ handle it. [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `"whisper-1"` @@ -24062,12 +25666,15 @@ handle it. - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` An optional text to guide the model's style or continue a previous audio segment. For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions. - `turn_detection: optional RealtimeAudioInputTurnDetection` @@ -24077,6 +25684,9 @@ handle it. Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. + For `gpt-realtime-whisper` transcription sessions, turn detection must be + set to `null`; VAD is not supported. + - `ServerVad object { type, create_response, idle_timeout_ms, 4 more }` Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence. @@ -24244,13 +25854,13 @@ handle it. - `"inf"` -- `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` +- `model: optional string or "gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. - `string` - - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2025-08-28" or 13 more` + - `"gpt-realtime" or "gpt-realtime-1.5" or "gpt-realtime-2" or 14 more` The Realtime model used for this session. @@ -24258,6 +25868,8 @@ handle it. - `"gpt-realtime-1.5"` + - `"gpt-realtime-2"` + - `"gpt-realtime-2025-08-28"` - `"gpt-4o-realtime-preview"` @@ -24296,6 +25908,11 @@ handle it. - `"audio"` +- `parallel_tool_calls: optional boolean` + + Whether the model may call multiple tools in parallel. Only supported by + reasoning Realtime models such as `gpt-realtime-2`. + - `prompt: optional ResponsePrompt` Reference to a prompt template and its variables. @@ -24395,6 +26012,25 @@ handle it. Optional version of the prompt template. +- `reasoning: optional RealtimeReasoning` + + Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`. + + - `effort: optional RealtimeReasoningEffort` + + Constrains effort on reasoning for reasoning-capable Realtime models such as + `gpt-realtime-2`. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `tool_choice: optional RealtimeToolChoiceConfig` How the model chooses tools. Provide one of the string modes or force a specific @@ -24811,6 +26447,253 @@ curl -X POST https://api.openai.com/v1/realtime/calls/$CALL_ID/reject \ -d '{"status_code": 486}' ``` +# Translations + +# Client Secrets + +## Create translation client secret + +**post** `/realtime/translations/client_secrets` + +Create a Realtime translation client secret with an associated translation session configuration. + +Client secrets are short-lived tokens that can be passed to a client app, +such as a web frontend or mobile client, which grants access to the Realtime +Translation API without leaking your main API key. You can configure a custom +TTL for each client secret. + +Returns the created client secret and the effective translation session object. +The client secret is a string that looks like `ek_1234`. + +### Body Parameters + +- `session: RealtimeTranslationSessionCreateRequest` + + Realtime translation session configuration. Translation sessions stream source + audio in and translated audio plus transcript deltas out continuously. + + - `model: string` + + The Realtime translation model used for this session. + + - `audio: optional object { input, output }` + + Configuration for translation input and output audio. + + - `input: optional object { noise_reduction, transcription }` + + - `noise_reduction: optional object { type }` + + Optional input noise reduction. Set to `null` to disable it. + + - `type: NoiseReductionType` + + Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. + + - `"near_field"` + + - `"far_field"` + + - `transcription: optional object { model }` + + Optional source-language transcription. When configured, the server emits + `session.input_transcript.delta` events. Translation itself still runs from + the input audio stream. + + - `model: string` + + The transcription model to use for source transcript deltas. + + - `output: optional object { language }` + + - `language: optional string` + + Target language for translated output audio and transcript deltas. + +- `expires_after: optional object { anchor, seconds }` + + Configuration for the client secret expiration. Expiration refers to the time after which + a client secret will no longer be valid for creating sessions. The session itself may + continue after that time once started. A secret can be used to create multiple sessions + until it expires. + + - `anchor: optional "created_at"` + + The anchor point for the client secret expiration, meaning that `seconds` will be added to the `created_at` time of the client secret to produce an expiration timestamp. Only `created_at` is currently supported. + + - `"created_at"` + + - `seconds: optional number` + + The number of seconds from the anchor point to the expiration. Select a value between `10` and `7200` (2 hours). This default to 600 seconds (10 minutes) if not specified. + +### Returns + +- `RealtimeTranslationClientSecretCreateResponse object { expires_at, session, value }` + + Response from creating a translation session and client secret for the Realtime API. + + - `expires_at: number` + + Expiration timestamp for the client secret, in seconds since epoch. + + - `session: RealtimeTranslationSession` + + A Realtime translation session. Translation sessions continuously translate input + audio into the configured output language. + + - `id: string` + + Unique identifier for the session that looks like `sess_1234567890abcdef`. + + - `audio: object { input, output }` + + Configuration for translation input and output audio. + + - `input: optional object { noise_reduction, transcription }` + + - `noise_reduction: optional object { type }` + + Optional input noise reduction. + + - `type: NoiseReductionType` + + Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones. + + - `"near_field"` + + - `"far_field"` + + - `transcription: optional object { model }` + + Optional source-language transcription. When configured, the server emits + `session.input_transcript.delta` events. Translation itself still runs from + the input audio stream. + + - `model: string` + + The transcription model used for source transcript deltas. + + - `output: optional object { language }` + + - `language: optional string` + + Target language for translated output audio and transcript deltas. + + - `expires_at: number` + + Expiration timestamp for the session, in seconds since epoch. + + - `model: string` + + The Realtime translation model used for this session. This field is set at + session creation and cannot be changed with `session.update`. + + - `type: "translation"` + + The session type. Always `translation` for Realtime translation sessions. + + - `"translation"` + + - `value: string` + + The generated client secret value. + +### Example + +```http +curl https://api.openai.com/v1/realtime/translations/client_secrets \ + -H 'Content-Type: application/json' \ + -H "Authorization: Bearer $OPENAI_API_KEY" \ + -d '{ + "session": { + "model": "model" + } + }' +``` + +#### Response + +```json +{ + "expires_at": 0, + "session": { + "id": "id", + "audio": { + "input": { + "noise_reduction": { + "type": "near_field" + }, + "transcription": { + "model": "model" + } + }, + "output": { + "language": "language" + } + }, + "expires_at": 0, + "model": "model", + "type": "translation" + }, + "value": "value" +} +``` + +### Example + +```http +curl -X POST https://api.openai.com/v1/realtime/translations/client_secrets \ + -H "Authorization: Bearer $OPENAI_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "expires_after": { + "anchor": "created_at", + "seconds": 600 + }, + "session": { + "model": "gpt-realtime-translate", + "audio": { + "input": { + "transcription": { + "model": "gpt-realtime-whisper" + }, + "noise_reduction": null + }, + "output": { + "language": "es" + } + } + } + }' +``` + +#### Response + +```json +{ + "value": "ek_68af296e8e408191a1120ab6383263c2", + "expires_at": 1756310470, + "session": { + "id": "sess_C9CiUVUzUzYIssh3ELY1d", + "type": "translation", + "expires_at": 1756310470, + "model": "gpt-realtime-translate", + "audio": { + "input": { + "transcription": { + "model": "gpt-realtime-whisper" + }, + "noise_reduction": null + }, + "output": { + "language": "es" + } + } + } +} +``` + # Sessions ## Create session @@ -25233,25 +27116,23 @@ Returns the created Realtime session object, plus an ephemeral key. - `"far_field"` - - `transcription: optional AudioTranscription` + - `transcription: optional object { language, model, prompt }` Configuration for input audio transcription. - `language: optional string` - The language of the input audio. Supplying the input language in - [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format - will improve accuracy and latency. + The language of the input audio. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `"whisper-1"` @@ -25263,12 +27144,11 @@ Returns the created Realtime session object, plus an ephemeral key. - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` - An optional text to guide the model's style or continue a previous audio - segment. - For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). - For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + The prompt configured for input audio transcription, when present. - `turn_detection: optional object { prefix_padding_ms, silence_duration_ms, threshold, type }` @@ -25655,25 +27535,23 @@ curl -X POST https://api.openai.com/v1/realtime/sessions \ - `"far_field"` - - `transcription: optional AudioTranscription` + - `transcription: optional object { language, model, prompt }` Configuration for input audio transcription. - `language: optional string` - The language of the input audio. Supplying the input language in - [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format - will improve accuracy and latency. + The language of the input audio. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `"whisper-1"` @@ -25685,12 +27563,11 @@ curl -X POST https://api.openai.com/v1/realtime/sessions \ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` - An optional text to guide the model's style or continue a previous audio - segment. - For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). - For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + The prompt configured for input audio transcription, when present. - `turn_detection: optional object { prefix_padding_ms, silence_duration_ms, threshold, type }` @@ -25941,21 +27818,37 @@ Returns the created Realtime transcription session object, plus an ephemeral key Configuration for input audio transcription. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. + - `delay: optional "minimal" or "low" or "medium" or 2 more` + + Controls how long the model waits before emitting transcription text. + Higher values can improve transcription accuracy at the cost of latency. + Only supported with `gpt-realtime-whisper` in GA Realtime sessions. + + - `"minimal"` + + - `"low"` + + - `"medium"` + + - `"high"` + + - `"xhigh"` + - `language: optional string` The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. - `"whisper-1"` @@ -25967,12 +27860,15 @@ Returns the created Realtime transcription session object, plus an ephemeral key - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` An optional text to guide the model's style or continue a previous audio segment. For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions. - `turn_detection: optional object { prefix_padding_ms, silence_duration_ms, threshold, type }` @@ -26023,25 +27919,23 @@ Returns the created Realtime transcription session object, plus an ephemeral key The format of input audio. Options are `pcm16`, `g711_ulaw`, or `g711_alaw`. -- `input_audio_transcription: optional AudioTranscription` +- `input_audio_transcription: optional object { language, model, prompt }` Configuration of the transcription model. - `language: optional string` - The language of the input audio. Supplying the input language in - [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format - will improve accuracy and latency. + The language of the input audio. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `"whisper-1"` @@ -26053,12 +27947,11 @@ Returns the created Realtime transcription session object, plus an ephemeral key - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` - An optional text to guide the model's style or continue a previous audio - segment. - For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). - For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + The prompt configured for input audio transcription, when present. - `modalities: optional array of "text" or "audio"` @@ -26195,25 +28088,23 @@ curl -X POST https://api.openai.com/v1/realtime/transcription_sessions \ The format of input audio. Options are `pcm16`, `g711_ulaw`, or `g711_alaw`. - - `input_audio_transcription: optional AudioTranscription` + - `input_audio_transcription: optional object { language, model, prompt }` Configuration of the transcription model. - `language: optional string` - The language of the input audio. Supplying the input language in - [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format - will improve accuracy and latency. + The language of the input audio. - - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `model: optional string or "whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `string` - - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 2 more` + - `"whisper-1" or "gpt-4o-mini-transcribe" or "gpt-4o-mini-transcribe-2025-12-15" or 3 more` - The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels. + The model used for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. - `"whisper-1"` @@ -26225,12 +28116,11 @@ curl -X POST https://api.openai.com/v1/realtime/transcription_sessions \ - `"gpt-4o-transcribe-diarize"` + - `"gpt-realtime-whisper"` + - `prompt: optional string` - An optional text to guide the model's style or continue a previous audio - segment. - For `whisper-1`, the [prompt is a list of keywords](/docs/guides/speech-to-text#prompting). - For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology". + The prompt configured for input audio transcription, when present. - `modalities: optional array of "text" or "audio"`