Realtime
Domain Types
Audio Transcription
-
class AudioTranscription:-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
Conversation Created Event
-
class ConversationCreatedEvent:Returned when a conversation is created. Emitted right after session creation.
-
Conversation conversationThe conversation resource.
-
Optional<String> idThe unique ID of the conversation.
-
Optional<Object> object_The object type, must be
realtime.conversation.REALTIME_CONVERSATION("realtime.conversation")
-
-
String eventIdThe unique ID of the server event.
-
JsonValue; type "conversation.created"constantThe event type, must be
conversation.created.CONVERSATION_CREATED("conversation.created")
-
Conversation Item
-
class ConversationItem: A class that can be one of several variants.unionA single item within a Realtime conversation.
-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
List<Content> contentThe content of the message.
-
Optional<String> textThe text content.
-
Optional<Type> typeThe content type. Always
input_textfor system messages.INPUT_TEXT("input_text")
-
-
JsonValue; role "system"constantThe role of the message sender. Always
system.SYSTEM("system")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes (for
input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified. -
Optional<Detail> detailThe detail level of the image (for
input_image).autowill default tohigh.-
AUTO("auto") -
LOW("low") -
HIGH("high")
-
-
Optional<String> imageUrlBase64-encoded image bytes (for
input_image) as a data URI. For exampledata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG. -
Optional<String> textThe text content (for
input_text). -
Optional<String> transcriptTranscript of the audio (for
input_audio). This is not sent to the model, but will be attached to the message item for reference. -
Optional<Type> typeThe content type (
input_text,input_audio, orinput_image).-
INPUT_TEXT("input_text") -
INPUT_AUDIO("input_audio") -
INPUT_IMAGE("input_image")
-
-
-
JsonValue; role "user"constantThe role of the message sender. Always
user.USER("user")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
-
Optional<String> textThe text content.
-
Optional<String> transcriptThe transcript of the audio content, this will always be present if the output type is
audio. -
Optional<Type> typeThe content type,
output_textoroutput_audiodepending on the sessionoutput_modalitiesconfiguration.-
OUTPUT_TEXT("output_text") -
OUTPUT_AUDIO("output_audio")
-
-
-
JsonValue; role "assistant"constantThe role of the message sender. Always
assistant.ASSISTANT("assistant")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
String argumentsThe arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example
{"arg1": "value1", "arg2": 42}. -
String nameThe name of the function being called.
-
JsonValue; type "function_call"constantThe type of the item. Always
function_call.FUNCTION_CALL("function_call")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<String> callIdThe ID of the function call.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
String callIdThe ID of the function call this output is for.
-
String outputThe output of the function call, this is free text and can contain any information or simply be empty.
-
JsonValue; type "function_call_output"constantThe type of the item. Always
function_call_output.FUNCTION_CALL_OUTPUT("function_call_output")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
String idThe unique ID of the approval response.
-
String approvalRequestIdThe ID of the approval request being answered.
-
boolean approveWhether the request was approved.
-
JsonValue; type "mcp_approval_response"constantThe type of the item. Always
mcp_approval_response.MCP_APPROVAL_RESPONSE("mcp_approval_response")
-
Optional<String> reasonOptional reason for the decision.
-
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
String serverLabelThe label of the MCP server.
-
List<Tool> toolsThe tools available on the server.
-
JsonValue inputSchemaThe JSON schema describing the tool's input.
-
String nameThe name of the tool.
-
Optional<JsonValue> annotationsAdditional annotations about the tool.
-
Optional<String> descriptionThe description of the tool.
-
-
JsonValue; type "mcp_list_tools"constantThe type of the item. Always
mcp_list_tools.MCP_LIST_TOOLS("mcp_list_tools")
-
Optional<String> idThe unique ID of the list.
-
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
String idThe unique ID of the tool call.
-
String argumentsA JSON string of the arguments passed to the tool.
-
String nameThe name of the tool that was run.
-
String serverLabelThe label of the MCP server running the tool.
-
JsonValue; type "mcp_call"constantThe type of the item. Always
mcp_call.MCP_CALL("mcp_call")
-
Optional<String> approvalRequestIdThe ID of an associated approval request, if any.
-
Optional<Error> errorThe error from the tool call, if any.
-
class RealtimeMcpProtocolError:-
long code -
String message -
JsonValue; type "protocol_error"constantPROTOCOL_ERROR("protocol_error")
-
-
class RealtimeMcpToolExecutionError:-
String message -
JsonValue; type "tool_execution_error"constantTOOL_EXECUTION_ERROR("tool_execution_error")
-
-
class RealtimeMcphttpError:-
long code -
String message -
JsonValue; type "http_error"constantHTTP_ERROR("http_error")
-
-
-
Optional<String> outputThe output from the tool call.
-
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
String idThe unique ID of the approval request.
-
String argumentsA JSON string of arguments for the tool.
-
String nameThe name of the tool to run.
-
String serverLabelThe label of the MCP server making the request.
-
JsonValue; type "mcp_approval_request"constantThe type of the item. Always
mcp_approval_request.MCP_APPROVAL_REQUEST("mcp_approval_request")
-
-
Conversation Item Added
-
class ConversationItemAdded:Sent by the server when an Item is added to the default Conversation. This can happen in several cases:
- When the client sends a
conversation.item.createevent. - When the input audio buffer is committed. In this case the item will be a user message containing the audio from the buffer.
- When the model is generating a Response. In this case the
conversation.item.addedevent will be sent when the model starts generating a specific Item, and thus it will not yet have any content (andstatuswill bein_progress).
The event will include the full content of the Item (except when model is generating a Response) except for audio data, which can be retrieved separately with a
conversation.item.retrieveevent if necessary.-
String eventIdThe unique ID of the server event.
-
ConversationItem itemA single item within a Realtime conversation.
-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
List<Content> contentThe content of the message.
-
Optional<String> textThe text content.
-
Optional<Type> typeThe content type. Always
input_textfor system messages.INPUT_TEXT("input_text")
-
-
JsonValue; role "system"constantThe role of the message sender. Always
system.SYSTEM("system")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes (for
input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified. -
Optional<Detail> detailThe detail level of the image (for
input_image).autowill default tohigh.-
AUTO("auto") -
LOW("low") -
HIGH("high")
-
-
Optional<String> imageUrlBase64-encoded image bytes (for
input_image) as a data URI. For exampledata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG. -
Optional<String> textThe text content (for
input_text). -
Optional<String> transcriptTranscript of the audio (for
input_audio). This is not sent to the model, but will be attached to the message item for reference. -
Optional<Type> typeThe content type (
input_text,input_audio, orinput_image).-
INPUT_TEXT("input_text") -
INPUT_AUDIO("input_audio") -
INPUT_IMAGE("input_image")
-
-
-
JsonValue; role "user"constantThe role of the message sender. Always
user.USER("user")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
-
Optional<String> textThe text content.
-
Optional<String> transcriptThe transcript of the audio content, this will always be present if the output type is
audio. -
Optional<Type> typeThe content type,
output_textoroutput_audiodepending on the sessionoutput_modalitiesconfiguration.-
OUTPUT_TEXT("output_text") -
OUTPUT_AUDIO("output_audio")
-
-
-
JsonValue; role "assistant"constantThe role of the message sender. Always
assistant.ASSISTANT("assistant")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
String argumentsThe arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example
{"arg1": "value1", "arg2": 42}. -
String nameThe name of the function being called.
-
JsonValue; type "function_call"constantThe type of the item. Always
function_call.FUNCTION_CALL("function_call")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<String> callIdThe ID of the function call.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
String callIdThe ID of the function call this output is for.
-
String outputThe output of the function call, this is free text and can contain any information or simply be empty.
-
JsonValue; type "function_call_output"constantThe type of the item. Always
function_call_output.FUNCTION_CALL_OUTPUT("function_call_output")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
String idThe unique ID of the approval response.
-
String approvalRequestIdThe ID of the approval request being answered.
-
boolean approveWhether the request was approved.
-
JsonValue; type "mcp_approval_response"constantThe type of the item. Always
mcp_approval_response.MCP_APPROVAL_RESPONSE("mcp_approval_response")
-
Optional<String> reasonOptional reason for the decision.
-
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
String serverLabelThe label of the MCP server.
-
List<Tool> toolsThe tools available on the server.
-
JsonValue inputSchemaThe JSON schema describing the tool's input.
-
String nameThe name of the tool.
-
Optional<JsonValue> annotationsAdditional annotations about the tool.
-
Optional<String> descriptionThe description of the tool.
-
-
JsonValue; type "mcp_list_tools"constantThe type of the item. Always
mcp_list_tools.MCP_LIST_TOOLS("mcp_list_tools")
-
Optional<String> idThe unique ID of the list.
-
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
String idThe unique ID of the tool call.
-
String argumentsA JSON string of the arguments passed to the tool.
-
String nameThe name of the tool that was run.
-
String serverLabelThe label of the MCP server running the tool.
-
JsonValue; type "mcp_call"constantThe type of the item. Always
mcp_call.MCP_CALL("mcp_call")
-
Optional<String> approvalRequestIdThe ID of an associated approval request, if any.
-
Optional<Error> errorThe error from the tool call, if any.
-
class RealtimeMcpProtocolError:-
long code -
String message -
JsonValue; type "protocol_error"constantPROTOCOL_ERROR("protocol_error")
-
-
class RealtimeMcpToolExecutionError:-
String message -
JsonValue; type "tool_execution_error"constantTOOL_EXECUTION_ERROR("tool_execution_error")
-
-
class RealtimeMcphttpError:-
long code -
String message -
JsonValue; type "http_error"constantHTTP_ERROR("http_error")
-
-
-
Optional<String> outputThe output from the tool call.
-
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
String idThe unique ID of the approval request.
-
String argumentsA JSON string of arguments for the tool.
-
String nameThe name of the tool to run.
-
String serverLabelThe label of the MCP server making the request.
-
JsonValue; type "mcp_approval_request"constantThe type of the item. Always
mcp_approval_request.MCP_APPROVAL_REQUEST("mcp_approval_request")
-
-
-
JsonValue; type "conversation.item.added"constantThe event type, must be
conversation.item.added.CONVERSATION_ITEM_ADDED("conversation.item.added")
-
Optional<String> previousItemIdThe ID of the item that precedes this one, if any. This is used to maintain ordering when items are inserted.
- When the client sends a
Conversation Item Create Event
-
class ConversationItemCreateEvent:Add a new Item to the Conversation's context, including messages, function calls, and function call responses. This event can be used both to populate a "history" of the conversation and to add new items mid-stream, but has the current limitation that it cannot populate assistant audio messages.
If successful, the server will respond with a
conversation.item.createdevent, otherwise anerrorevent will be sent.-
ConversationItem itemA single item within a Realtime conversation.
-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
List<Content> contentThe content of the message.
-
Optional<String> textThe text content.
-
Optional<Type> typeThe content type. Always
input_textfor system messages.INPUT_TEXT("input_text")
-
-
JsonValue; role "system"constantThe role of the message sender. Always
system.SYSTEM("system")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes (for
input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified. -
Optional<Detail> detailThe detail level of the image (for
input_image).autowill default tohigh.-
AUTO("auto") -
LOW("low") -
HIGH("high")
-
-
Optional<String> imageUrlBase64-encoded image bytes (for
input_image) as a data URI. For exampledata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG. -
Optional<String> textThe text content (for
input_text). -
Optional<String> transcriptTranscript of the audio (for
input_audio). This is not sent to the model, but will be attached to the message item for reference. -
Optional<Type> typeThe content type (
input_text,input_audio, orinput_image).-
INPUT_TEXT("input_text") -
INPUT_AUDIO("input_audio") -
INPUT_IMAGE("input_image")
-
-
-
JsonValue; role "user"constantThe role of the message sender. Always
user.USER("user")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
-
Optional<String> textThe text content.
-
Optional<String> transcriptThe transcript of the audio content, this will always be present if the output type is
audio. -
Optional<Type> typeThe content type,
output_textoroutput_audiodepending on the sessionoutput_modalitiesconfiguration.-
OUTPUT_TEXT("output_text") -
OUTPUT_AUDIO("output_audio")
-
-
-
JsonValue; role "assistant"constantThe role of the message sender. Always
assistant.ASSISTANT("assistant")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
String argumentsThe arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example
{"arg1": "value1", "arg2": 42}. -
String nameThe name of the function being called.
-
JsonValue; type "function_call"constantThe type of the item. Always
function_call.FUNCTION_CALL("function_call")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<String> callIdThe ID of the function call.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
String callIdThe ID of the function call this output is for.
-
String outputThe output of the function call, this is free text and can contain any information or simply be empty.
-
JsonValue; type "function_call_output"constantThe type of the item. Always
function_call_output.FUNCTION_CALL_OUTPUT("function_call_output")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
String idThe unique ID of the approval response.
-
String approvalRequestIdThe ID of the approval request being answered.
-
boolean approveWhether the request was approved.
-
JsonValue; type "mcp_approval_response"constantThe type of the item. Always
mcp_approval_response.MCP_APPROVAL_RESPONSE("mcp_approval_response")
-
Optional<String> reasonOptional reason for the decision.
-
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
String serverLabelThe label of the MCP server.
-
List<Tool> toolsThe tools available on the server.
-
JsonValue inputSchemaThe JSON schema describing the tool's input.
-
String nameThe name of the tool.
-
Optional<JsonValue> annotationsAdditional annotations about the tool.
-
Optional<String> descriptionThe description of the tool.
-
-
JsonValue; type "mcp_list_tools"constantThe type of the item. Always
mcp_list_tools.MCP_LIST_TOOLS("mcp_list_tools")
-
Optional<String> idThe unique ID of the list.
-
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
String idThe unique ID of the tool call.
-
String argumentsA JSON string of the arguments passed to the tool.
-
String nameThe name of the tool that was run.
-
String serverLabelThe label of the MCP server running the tool.
-
JsonValue; type "mcp_call"constantThe type of the item. Always
mcp_call.MCP_CALL("mcp_call")
-
Optional<String> approvalRequestIdThe ID of an associated approval request, if any.
-
Optional<Error> errorThe error from the tool call, if any.
-
class RealtimeMcpProtocolError:-
long code -
String message -
JsonValue; type "protocol_error"constantPROTOCOL_ERROR("protocol_error")
-
-
class RealtimeMcpToolExecutionError:-
String message -
JsonValue; type "tool_execution_error"constantTOOL_EXECUTION_ERROR("tool_execution_error")
-
-
class RealtimeMcphttpError:-
long code -
String message -
JsonValue; type "http_error"constantHTTP_ERROR("http_error")
-
-
-
Optional<String> outputThe output from the tool call.
-
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
String idThe unique ID of the approval request.
-
String argumentsA JSON string of arguments for the tool.
-
String nameThe name of the tool to run.
-
String serverLabelThe label of the MCP server making the request.
-
JsonValue; type "mcp_approval_request"constantThe type of the item. Always
mcp_approval_request.MCP_APPROVAL_REQUEST("mcp_approval_request")
-
-
-
JsonValue; type "conversation.item.create"constantThe event type, must be
conversation.item.create.CONVERSATION_ITEM_CREATE("conversation.item.create")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
Optional<String> previousItemIdThe ID of the preceding item after which the new item will be inserted. If not set, the new item will be appended to the end of the conversation.
If set to
root, the new item will be added to the beginning of the conversation.If set to an existing ID, it allows an item to be inserted mid-conversation. If the ID cannot be found, an error will be returned and the item will not be added.
-
Conversation Item Created Event
-
class ConversationItemCreatedEvent:Returned when a conversation item is created. There are several scenarios that produce this event:
-
The server is generating a Response, which if successful will produce either one or two Items, which will be of type
message(roleassistant) or typefunction_call. -
The input audio buffer has been committed, either by the client or the server (in
server_vadmode). The server will take the content of the input audio buffer and add it to a new user message Item. -
The client has sent a
conversation.item.createevent to add a new Item to the Conversation. -
String eventIdThe unique ID of the server event.
-
ConversationItem itemA single item within a Realtime conversation.
-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
List<Content> contentThe content of the message.
-
Optional<String> textThe text content.
-
Optional<Type> typeThe content type. Always
input_textfor system messages.INPUT_TEXT("input_text")
-
-
JsonValue; role "system"constantThe role of the message sender. Always
system.SYSTEM("system")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes (for
input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified. -
Optional<Detail> detailThe detail level of the image (for
input_image).autowill default tohigh.-
AUTO("auto") -
LOW("low") -
HIGH("high")
-
-
Optional<String> imageUrlBase64-encoded image bytes (for
input_image) as a data URI. For exampledata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG. -
Optional<String> textThe text content (for
input_text). -
Optional<String> transcriptTranscript of the audio (for
input_audio). This is not sent to the model, but will be attached to the message item for reference. -
Optional<Type> typeThe content type (
input_text,input_audio, orinput_image).-
INPUT_TEXT("input_text") -
INPUT_AUDIO("input_audio") -
INPUT_IMAGE("input_image")
-
-
-
JsonValue; role "user"constantThe role of the message sender. Always
user.USER("user")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
-
Optional<String> textThe text content.
-
Optional<String> transcriptThe transcript of the audio content, this will always be present if the output type is
audio. -
Optional<Type> typeThe content type,
output_textoroutput_audiodepending on the sessionoutput_modalitiesconfiguration.-
OUTPUT_TEXT("output_text") -
OUTPUT_AUDIO("output_audio")
-
-
-
JsonValue; role "assistant"constantThe role of the message sender. Always
assistant.ASSISTANT("assistant")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
String argumentsThe arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example
{"arg1": "value1", "arg2": 42}. -
String nameThe name of the function being called.
-
JsonValue; type "function_call"constantThe type of the item. Always
function_call.FUNCTION_CALL("function_call")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<String> callIdThe ID of the function call.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
String callIdThe ID of the function call this output is for.
-
String outputThe output of the function call, this is free text and can contain any information or simply be empty.
-
JsonValue; type "function_call_output"constantThe type of the item. Always
function_call_output.FUNCTION_CALL_OUTPUT("function_call_output")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
String idThe unique ID of the approval response.
-
String approvalRequestIdThe ID of the approval request being answered.
-
boolean approveWhether the request was approved.
-
JsonValue; type "mcp_approval_response"constantThe type of the item. Always
mcp_approval_response.MCP_APPROVAL_RESPONSE("mcp_approval_response")
-
Optional<String> reasonOptional reason for the decision.
-
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
String serverLabelThe label of the MCP server.
-
List<Tool> toolsThe tools available on the server.
-
JsonValue inputSchemaThe JSON schema describing the tool's input.
-
String nameThe name of the tool.
-
Optional<JsonValue> annotationsAdditional annotations about the tool.
-
Optional<String> descriptionThe description of the tool.
-
-
JsonValue; type "mcp_list_tools"constantThe type of the item. Always
mcp_list_tools.MCP_LIST_TOOLS("mcp_list_tools")
-
Optional<String> idThe unique ID of the list.
-
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
String idThe unique ID of the tool call.
-
String argumentsA JSON string of the arguments passed to the tool.
-
String nameThe name of the tool that was run.
-
String serverLabelThe label of the MCP server running the tool.
-
JsonValue; type "mcp_call"constantThe type of the item. Always
mcp_call.MCP_CALL("mcp_call")
-
Optional<String> approvalRequestIdThe ID of an associated approval request, if any.
-
Optional<Error> errorThe error from the tool call, if any.
-
class RealtimeMcpProtocolError:-
long code -
String message -
JsonValue; type "protocol_error"constantPROTOCOL_ERROR("protocol_error")
-
-
class RealtimeMcpToolExecutionError:-
String message -
JsonValue; type "tool_execution_error"constantTOOL_EXECUTION_ERROR("tool_execution_error")
-
-
class RealtimeMcphttpError:-
long code -
String message -
JsonValue; type "http_error"constantHTTP_ERROR("http_error")
-
-
-
Optional<String> outputThe output from the tool call.
-
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
String idThe unique ID of the approval request.
-
String argumentsA JSON string of arguments for the tool.
-
String nameThe name of the tool to run.
-
String serverLabelThe label of the MCP server making the request.
-
JsonValue; type "mcp_approval_request"constantThe type of the item. Always
mcp_approval_request.MCP_APPROVAL_REQUEST("mcp_approval_request")
-
-
-
JsonValue; type "conversation.item.created"constantThe event type, must be
conversation.item.created.CONVERSATION_ITEM_CREATED("conversation.item.created")
-
Optional<String> previousItemIdThe ID of the preceding item in the Conversation context, allows the client to understand the order of the conversation. Can be
nullif the item has no predecessor.
-
Conversation Item Delete Event
-
class ConversationItemDeleteEvent:Send this event when you want to remove any item from the conversation history. The server will respond with a
conversation.item.deletedevent, unless the item does not exist in the conversation history, in which case the server will respond with an error.-
String itemIdThe ID of the item to delete.
-
JsonValue; type "conversation.item.delete"constantThe event type, must be
conversation.item.delete.CONVERSATION_ITEM_DELETE("conversation.item.delete")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
Conversation Item Deleted Event
-
class ConversationItemDeletedEvent:Returned when an item in the conversation is deleted by the client with a
conversation.item.deleteevent. This event is used to synchronize the server's understanding of the conversation history with the client's view.-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item that was deleted.
-
JsonValue; type "conversation.item.deleted"constantThe event type, must be
conversation.item.deleted.CONVERSATION_ITEM_DELETED("conversation.item.deleted")
-
Conversation Item Done
-
class ConversationItemDone:Returned when a conversation item is finalized.
The event will include the full content of the Item except for audio data, which can be retrieved separately with a
conversation.item.retrieveevent if needed.-
String eventIdThe unique ID of the server event.
-
ConversationItem itemA single item within a Realtime conversation.
-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
List<Content> contentThe content of the message.
-
Optional<String> textThe text content.
-
Optional<Type> typeThe content type. Always
input_textfor system messages.INPUT_TEXT("input_text")
-
-
JsonValue; role "system"constantThe role of the message sender. Always
system.SYSTEM("system")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes (for
input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified. -
Optional<Detail> detailThe detail level of the image (for
input_image).autowill default tohigh.-
AUTO("auto") -
LOW("low") -
HIGH("high")
-
-
Optional<String> imageUrlBase64-encoded image bytes (for
input_image) as a data URI. For exampledata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG. -
Optional<String> textThe text content (for
input_text). -
Optional<String> transcriptTranscript of the audio (for
input_audio). This is not sent to the model, but will be attached to the message item for reference. -
Optional<Type> typeThe content type (
input_text,input_audio, orinput_image).-
INPUT_TEXT("input_text") -
INPUT_AUDIO("input_audio") -
INPUT_IMAGE("input_image")
-
-
-
JsonValue; role "user"constantThe role of the message sender. Always
user.USER("user")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
-
Optional<String> textThe text content.
-
Optional<String> transcriptThe transcript of the audio content, this will always be present if the output type is
audio. -
Optional<Type> typeThe content type,
output_textoroutput_audiodepending on the sessionoutput_modalitiesconfiguration.-
OUTPUT_TEXT("output_text") -
OUTPUT_AUDIO("output_audio")
-
-
-
JsonValue; role "assistant"constantThe role of the message sender. Always
assistant.ASSISTANT("assistant")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
String argumentsThe arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example
{"arg1": "value1", "arg2": 42}. -
String nameThe name of the function being called.
-
JsonValue; type "function_call"constantThe type of the item. Always
function_call.FUNCTION_CALL("function_call")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<String> callIdThe ID of the function call.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
String callIdThe ID of the function call this output is for.
-
String outputThe output of the function call, this is free text and can contain any information or simply be empty.
-
JsonValue; type "function_call_output"constantThe type of the item. Always
function_call_output.FUNCTION_CALL_OUTPUT("function_call_output")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
String idThe unique ID of the approval response.
-
String approvalRequestIdThe ID of the approval request being answered.
-
boolean approveWhether the request was approved.
-
JsonValue; type "mcp_approval_response"constantThe type of the item. Always
mcp_approval_response.MCP_APPROVAL_RESPONSE("mcp_approval_response")
-
Optional<String> reasonOptional reason for the decision.
-
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
String serverLabelThe label of the MCP server.
-
List<Tool> toolsThe tools available on the server.
-
JsonValue inputSchemaThe JSON schema describing the tool's input.
-
String nameThe name of the tool.
-
Optional<JsonValue> annotationsAdditional annotations about the tool.
-
Optional<String> descriptionThe description of the tool.
-
-
JsonValue; type "mcp_list_tools"constantThe type of the item. Always
mcp_list_tools.MCP_LIST_TOOLS("mcp_list_tools")
-
Optional<String> idThe unique ID of the list.
-
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
String idThe unique ID of the tool call.
-
String argumentsA JSON string of the arguments passed to the tool.
-
String nameThe name of the tool that was run.
-
String serverLabelThe label of the MCP server running the tool.
-
JsonValue; type "mcp_call"constantThe type of the item. Always
mcp_call.MCP_CALL("mcp_call")
-
Optional<String> approvalRequestIdThe ID of an associated approval request, if any.
-
Optional<Error> errorThe error from the tool call, if any.
-
class RealtimeMcpProtocolError:-
long code -
String message -
JsonValue; type "protocol_error"constantPROTOCOL_ERROR("protocol_error")
-
-
class RealtimeMcpToolExecutionError:-
String message -
JsonValue; type "tool_execution_error"constantTOOL_EXECUTION_ERROR("tool_execution_error")
-
-
class RealtimeMcphttpError:-
long code -
String message -
JsonValue; type "http_error"constantHTTP_ERROR("http_error")
-
-
-
Optional<String> outputThe output from the tool call.
-
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
String idThe unique ID of the approval request.
-
String argumentsA JSON string of arguments for the tool.
-
String nameThe name of the tool to run.
-
String serverLabelThe label of the MCP server making the request.
-
JsonValue; type "mcp_approval_request"constantThe type of the item. Always
mcp_approval_request.MCP_APPROVAL_REQUEST("mcp_approval_request")
-
-
-
JsonValue; type "conversation.item.done"constantThe event type, must be
conversation.item.done.CONVERSATION_ITEM_DONE("conversation.item.done")
-
Optional<String> previousItemIdThe ID of the item that precedes this one, if any. This is used to maintain ordering when items are inserted.
-
Conversation Item Input Audio Transcription Completed Event
-
class ConversationItemInputAudioTranscriptionCompletedEvent:This event is the output of audio transcription for user audio written to the user audio buffer. Transcription begins when the input audio buffer is committed by the client or server (when VAD is enabled). Transcription runs asynchronously with Response creation, so this event may come before or after the Response events.
Realtime API models accept audio natively, and thus input transcription is a separate process run on a separate ASR (Automatic Speech Recognition) model. The transcript may diverge somewhat from the model's interpretation, and should be treated as a rough guide.
-
long contentIndexThe index of the content part containing the audio.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item containing the audio that is being transcribed.
-
String transcriptThe transcribed text.
-
JsonValue; type "conversation.item.input_audio_transcription.completed"constantThe event type, must be
conversation.item.input_audio_transcription.completed.CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED("conversation.item.input_audio_transcription.completed")
-
Usage usageUsage statistics for the transcription, this is billed according to the ASR model's pricing rather than the realtime model's pricing.
-
class TranscriptTextUsageTokens:Usage statistics for models billed by token usage.
-
long inputTokensNumber of input tokens billed for this request.
-
long outputTokensNumber of output tokens generated.
-
long totalTokensTotal number of tokens used (input + output).
-
JsonValue; type "tokens"constantThe type of the usage object. Always
tokensfor this variant.TOKENS("tokens")
-
Optional<InputTokenDetails> inputTokenDetailsDetails about the input tokens billed for this request.
-
Optional<Long> audioTokensNumber of audio tokens billed for this request.
-
Optional<Long> textTokensNumber of text tokens billed for this request.
-
-
-
class TranscriptTextUsageDuration:Usage statistics for models billed by audio input duration.
-
double secondsDuration of the input audio in seconds.
-
JsonValue; type "duration"constantThe type of the usage object. Always
durationfor this variant.DURATION("duration")
-
-
-
Optional<List<LogProbProperties>> logprobsThe log probabilities of the transcription.
-
String tokenThe token that was used to generate the log probability.
-
List<long> bytesThe bytes that were used to generate the log probability.
-
double logprobThe log probability of the token.
-
-
Conversation Item Input Audio Transcription Delta Event
-
class ConversationItemInputAudioTranscriptionDeltaEvent:Returned when the text value of an input audio transcription content part is updated with incremental transcription results.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item containing the audio that is being transcribed.
-
JsonValue; type "conversation.item.input_audio_transcription.delta"constantThe event type, must be
conversation.item.input_audio_transcription.delta.CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_DELTA("conversation.item.input_audio_transcription.delta")
-
Optional<Long> contentIndexThe index of the content part in the item's content array.
-
Optional<String> deltaThe text delta.
-
Optional<List<LogProbProperties>> logprobsThe log probabilities of the transcription. These can be enabled by configurating the session with
"include": ["item.input_audio_transcription.logprobs"]. Each entry in the array corresponds a log probability of which token would be selected for this chunk of transcription. This can help to identify if it was possible there were multiple valid options for a given chunk of transcription.-
String tokenThe token that was used to generate the log probability.
-
List<long> bytesThe bytes that were used to generate the log probability.
-
double logprobThe log probability of the token.
-
-
Conversation Item Input Audio Transcription Failed Event
-
class ConversationItemInputAudioTranscriptionFailedEvent:Returned when input audio transcription is configured, and a transcription request for a user message failed. These events are separate from other
errorevents so that the client can identify the related Item.-
long contentIndexThe index of the content part containing the audio.
-
Error errorDetails of the transcription error.
-
Optional<String> codeError code, if any.
-
Optional<String> messageA human-readable error message.
-
Optional<String> paramParameter related to the error, if any.
-
Optional<String> typeThe type of error.
-
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the user message item.
-
JsonValue; type "conversation.item.input_audio_transcription.failed"constantThe event type, must be
conversation.item.input_audio_transcription.failed.CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_FAILED("conversation.item.input_audio_transcription.failed")
-
Conversation Item Input Audio Transcription Segment
-
class ConversationItemInputAudioTranscriptionSegment:Returned when an input audio transcription segment is identified for an item.
-
String idThe segment identifier.
-
long contentIndexThe index of the input audio content part within the item.
-
double endEnd time of the segment in seconds.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item containing the input audio content.
-
String speakerThe detected speaker label for this segment.
-
double startStart time of the segment in seconds.
-
String textThe text for this segment.
-
JsonValue; type "conversation.item.input_audio_transcription.segment"constantThe event type, must be
conversation.item.input_audio_transcription.segment.CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_SEGMENT("conversation.item.input_audio_transcription.segment")
-
Conversation Item Retrieve Event
-
class ConversationItemRetrieveEvent:Send this event when you want to retrieve the server's representation of a specific item in the conversation history. This is useful, for example, to inspect user audio after noise cancellation and VAD. The server will respond with a
conversation.item.retrievedevent, unless the item does not exist in the conversation history, in which case the server will respond with an error.-
String itemIdThe ID of the item to retrieve.
-
JsonValue; type "conversation.item.retrieve"constantThe event type, must be
conversation.item.retrieve.CONVERSATION_ITEM_RETRIEVE("conversation.item.retrieve")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
Conversation Item Truncate Event
-
class ConversationItemTruncateEvent:Send this event to truncate a previous assistant message’s audio. The server will produce audio faster than realtime, so this event is useful when the user interrupts to truncate audio that has already been sent to the client but not yet played. This will synchronize the server's understanding of the audio with the client's playback.
Truncating audio will delete the server-side text transcript to ensure there is not text in the context that hasn't been heard by the user.
If successful, the server will respond with a
conversation.item.truncatedevent.-
long audioEndMsInclusive duration up to which audio is truncated, in milliseconds. If the audio_end_ms is greater than the actual audio duration, the server will respond with an error.
-
long contentIndexThe index of the content part to truncate. Set this to
0. -
String itemIdThe ID of the assistant message item to truncate. Only assistant message items can be truncated.
-
JsonValue; type "conversation.item.truncate"constantThe event type, must be
conversation.item.truncate.CONVERSATION_ITEM_TRUNCATE("conversation.item.truncate")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
Conversation Item Truncated Event
-
class ConversationItemTruncatedEvent:Returned when an earlier assistant audio message item is truncated by the client with a
conversation.item.truncateevent. This event is used to synchronize the server's understanding of the audio with the client's playback.This action will truncate the audio and remove the server-side text transcript to ensure there is no text in the context that hasn't been heard by the user.
-
long audioEndMsThe duration up to which the audio was truncated, in milliseconds.
-
long contentIndexThe index of the content part that was truncated.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the assistant message item that was truncated.
-
JsonValue; type "conversation.item.truncated"constantThe event type, must be
conversation.item.truncated.CONVERSATION_ITEM_TRUNCATED("conversation.item.truncated")
-
Conversation Item With Reference
-
class ConversationItemWithReference:The item to add to the conversation.
-
Optional<String> idFor an item of type (
message|function_call|function_call_output) this field allows the client to assign the unique ID of the item. It is not required because the server will generate one if not provided.For an item of type
item_reference, this field is required and is a reference to any item that has previously existed in the conversation. -
Optional<String> argumentsThe arguments of the function call (for
function_callitems). -
Optional<String> callIdThe ID of the function call (for
function_callandfunction_call_outputitems). If passed on afunction_call_outputitem, the server will check that afunction_callitem with the same ID exists in the conversation history. -
Optional<List<Content>> contentThe content of the message, applicable for
messageitems.-
Message items of role
systemsupport onlyinput_textcontent -
Message items of role
usersupportinput_textandinput_audiocontent -
Message items of role
assistantsupporttextcontent. -
Optional<String> idID of a previous conversation item to reference (for
item_referencecontent types inresponse.createevents). These can reference both client and server created items. -
Optional<String> audioBase64-encoded audio bytes, used for
input_audiocontent type. -
Optional<String> textThe text content, used for
input_textandtextcontent types. -
Optional<String> transcriptThe transcript of the audio, used for
input_audiocontent type. -
Optional<Type> typeThe content type (
input_text,input_audio,item_reference,text).-
INPUT_TEXT("input_text") -
INPUT_AUDIO("input_audio") -
ITEM_REFERENCE("item_reference") -
TEXT("text")
-
-
-
Optional<String> nameThe name of the function being called (for
function_callitems). -
Optional<Object> object_Identifier for the API object being returned - always
realtime.item.REALTIME_ITEM("realtime.item")
-
Optional<String> outputThe output of the function call (for
function_call_outputitems). -
Optional<Role> roleThe role of the message sender (
user,assistant,system), only applicable formessageitems.-
USER("user") -
ASSISTANT("assistant") -
SYSTEM("system")
-
-
Optional<Status> statusThe status of the item (
completed,incomplete,in_progress). These have no effect on the conversation, but are accepted for consistency with theconversation.item.createdevent.-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
Optional<Type> typeThe type of the item (
message,function_call,function_call_output,item_reference).-
MESSAGE("message") -
FUNCTION_CALL("function_call") -
FUNCTION_CALL_OUTPUT("function_call_output") -
ITEM_REFERENCE("item_reference")
-
-
Input Audio Buffer Append Event
-
class InputAudioBufferAppendEvent:Send this event to append audio bytes to the input audio buffer. The audio buffer is temporary storage you can write to and later commit. A "commit" will create a new user message item in the conversation history from the buffer content and clear the buffer. Input audio transcription (if enabled) will be generated when the buffer is committed.
If VAD is enabled the audio buffer is used to detect speech and the server will decide when to commit. When Server VAD is disabled, you must commit the audio buffer manually. Input audio noise reduction operates on writes to the audio buffer.
The client may choose how much audio to place in each event up to a maximum of 15 MiB, for example streaming smaller chunks from the client may allow the VAD to be more responsive. Unlike most other client events, the server will not send a confirmation response to this event.
-
String audioBase64-encoded audio bytes. This must be in the format specified by the
input_audio_formatfield in the session configuration. -
JsonValue; type "input_audio_buffer.append"constantThe event type, must be
input_audio_buffer.append.INPUT_AUDIO_BUFFER_APPEND("input_audio_buffer.append")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
Input Audio Buffer Clear Event
-
class InputAudioBufferClearEvent:Send this event to clear the audio bytes in the buffer. The server will respond with an
input_audio_buffer.clearedevent.-
JsonValue; type "input_audio_buffer.clear"constantThe event type, must be
input_audio_buffer.clear.INPUT_AUDIO_BUFFER_CLEAR("input_audio_buffer.clear")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
Input Audio Buffer Cleared Event
-
class InputAudioBufferClearedEvent:Returned when the input audio buffer is cleared by the client with a
input_audio_buffer.clearevent.-
String eventIdThe unique ID of the server event.
-
JsonValue; type "input_audio_buffer.cleared"constantThe event type, must be
input_audio_buffer.cleared.INPUT_AUDIO_BUFFER_CLEARED("input_audio_buffer.cleared")
-
Input Audio Buffer Commit Event
-
class InputAudioBufferCommitEvent:Send this event to commit the user input audio buffer, which will create a new user message item in the conversation. This event will produce an error if the input audio buffer is empty. When in Server VAD mode, the client does not need to send this event, the server will commit the audio buffer automatically.
Committing the input audio buffer will trigger input audio transcription (if enabled in session configuration), but it will not create a response from the model. The server will respond with an
input_audio_buffer.committedevent.-
JsonValue; type "input_audio_buffer.commit"constantThe event type, must be
input_audio_buffer.commit.INPUT_AUDIO_BUFFER_COMMIT("input_audio_buffer.commit")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
Input Audio Buffer Committed Event
-
class InputAudioBufferCommittedEvent:Returned when an input audio buffer is committed, either by the client or automatically in server VAD mode. The
item_idproperty is the ID of the user message item that will be created, thus aconversation.item.createdevent will also be sent to the client.-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the user message item that will be created.
-
JsonValue; type "input_audio_buffer.committed"constantThe event type, must be
input_audio_buffer.committed.INPUT_AUDIO_BUFFER_COMMITTED("input_audio_buffer.committed")
-
Optional<String> previousItemIdThe ID of the preceding item after which the new item will be inserted. Can be
nullif the item has no predecessor.
-
Input Audio Buffer Dtmf Event Received Event
-
class InputAudioBufferDtmfEventReceivedEvent:SIP Only: Returned when an DTMF event is received. A DTMF event is a message that represents a telephone keypad press (0–9, *, #, A–D). The
eventproperty is the keypad that the user press. Thereceived_atis the UTC Unix Timestamp that the server received the event.-
String eventThe telephone keypad that was pressed by the user.
-
long receivedAtUTC Unix Timestamp when DTMF Event was received by server.
-
JsonValue; type "input_audio_buffer.dtmf_event_received"constantThe event type, must be
input_audio_buffer.dtmf_event_received.INPUT_AUDIO_BUFFER_DTMF_EVENT_RECEIVED("input_audio_buffer.dtmf_event_received")
-
Input Audio Buffer Speech Started Event
-
class InputAudioBufferSpeechStartedEvent:Sent by the server when in
server_vadmode to indicate that speech has been detected in the audio buffer. This can happen any time audio is added to the buffer (unless speech is already detected). The client may want to use this event to interrupt audio playback or provide visual feedback to the user.The client should expect to receive a
input_audio_buffer.speech_stoppedevent when speech stops. Theitem_idproperty is the ID of the user message item that will be created when speech stops and will also be included in theinput_audio_buffer.speech_stoppedevent (unless the client manually commits the audio buffer during VAD activation).-
long audioStartMsMilliseconds from the start of all audio written to the buffer during the session when speech was first detected. This will correspond to the beginning of audio sent to the model, and thus includes the
prefix_padding_msconfigured in the Session. -
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the user message item that will be created when speech stops.
-
JsonValue; type "input_audio_buffer.speech_started"constantThe event type, must be
input_audio_buffer.speech_started.INPUT_AUDIO_BUFFER_SPEECH_STARTED("input_audio_buffer.speech_started")
-
Input Audio Buffer Speech Stopped Event
-
class InputAudioBufferSpeechStoppedEvent:Returned in
server_vadmode when the server detects the end of speech in the audio buffer. The server will also send anconversation.item.createdevent with the user message item that is created from the audio buffer.-
long audioEndMsMilliseconds since the session started when speech stopped. This will correspond to the end of audio sent to the model, and thus includes the
min_silence_duration_msconfigured in the Session. -
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the user message item that will be created.
-
JsonValue; type "input_audio_buffer.speech_stopped"constantThe event type, must be
input_audio_buffer.speech_stopped.INPUT_AUDIO_BUFFER_SPEECH_STOPPED("input_audio_buffer.speech_stopped")
-
Input Audio Buffer Timeout Triggered
-
class InputAudioBufferTimeoutTriggered:Returned when the Server VAD timeout is triggered for the input audio buffer. This is configured with
idle_timeout_msin theturn_detectionsettings of the session, and it indicates that there hasn't been any speech detected for the configured duration.The
audio_start_msandaudio_end_msfields indicate the segment of audio after the last model response up to the triggering time, as an offset from the beginning of audio written to the input audio buffer. This means it demarcates the segment of audio that was silent and the difference between the start and end values will roughly match the configured timeout.The empty audio will be committed to the conversation as an
input_audioitem (there will be ainput_audio_buffer.committedevent) and a model response will be generated. There may be speech that didn't trigger VAD but is still detected by the model, so the model may respond with something relevant to the conversation or a prompt to continue speaking.-
long audioEndMsMillisecond offset of audio written to the input audio buffer at the time the timeout was triggered.
-
long audioStartMsMillisecond offset of audio written to the input audio buffer that was after the playback time of the last model response.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item associated with this segment.
-
JsonValue; type "input_audio_buffer.timeout_triggered"constantThe event type, must be
input_audio_buffer.timeout_triggered.INPUT_AUDIO_BUFFER_TIMEOUT_TRIGGERED("input_audio_buffer.timeout_triggered")
-
Log Prob Properties
-
class LogProbProperties:A log probability object.
-
String tokenThe token that was used to generate the log probability.
-
List<long> bytesThe bytes that were used to generate the log probability.
-
double logprobThe log probability of the token.
-
Mcp List Tools Completed
-
class McpListToolsCompleted:Returned when listing MCP tools has completed for an item.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the MCP list tools item.
-
JsonValue; type "mcp_list_tools.completed"constantThe event type, must be
mcp_list_tools.completed.MCP_LIST_TOOLS_COMPLETED("mcp_list_tools.completed")
-
Mcp List Tools Failed
-
class McpListToolsFailed:Returned when listing MCP tools has failed for an item.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the MCP list tools item.
-
JsonValue; type "mcp_list_tools.failed"constantThe event type, must be
mcp_list_tools.failed.MCP_LIST_TOOLS_FAILED("mcp_list_tools.failed")
-
Mcp List Tools In Progress
-
class McpListToolsInProgress:Returned when listing MCP tools is in progress for an item.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the MCP list tools item.
-
JsonValue; type "mcp_list_tools.in_progress"constantThe event type, must be
mcp_list_tools.in_progress.MCP_LIST_TOOLS_IN_PROGRESS("mcp_list_tools.in_progress")
-
Noise Reduction Type
-
enum NoiseReductionType:Type of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
Output Audio Buffer Clear Event
-
class OutputAudioBufferClearEvent:WebRTC/SIP Only: Emit to cut off the current audio response. This will trigger the server to stop generating audio and emit a
output_audio_buffer.clearedevent. This event should be preceded by aresponse.cancelclient event to stop the generation of the current response. Learn more.-
JsonValue; type "output_audio_buffer.clear"constantThe event type, must be
output_audio_buffer.clear.OUTPUT_AUDIO_BUFFER_CLEAR("output_audio_buffer.clear")
-
Optional<String> eventIdThe unique ID of the client event used for error handling.
-
Rate Limits Updated Event
-
class RateLimitsUpdatedEvent:Emitted at the beginning of a Response to indicate the updated rate limits. When a Response is created some tokens will be "reserved" for the output tokens, the rate limits shown here reflect that reservation, which is then adjusted accordingly once the Response is completed.
-
String eventIdThe unique ID of the server event.
-
List<RateLimit> rateLimitsList of rate limit information.
-
Optional<Long> limitThe maximum allowed value for the rate limit.
-
Optional<Name> nameThe name of the rate limit (
requests,tokens).-
REQUESTS("requests") -
TOKENS("tokens")
-
-
Optional<Long> remainingThe remaining value before the limit is reached.
-
Optional<Double> resetSecondsSeconds until the rate limit resets.
-
-
JsonValue; type "rate_limits.updated"constantThe event type, must be
rate_limits.updated.RATE_LIMITS_UPDATED("rate_limits.updated")
-
Realtime Audio Config
-
class RealtimeAudioConfig:Configuration for input and output audio.
-
Optional<RealtimeAudioConfigInput> input-
Optional<RealtimeAudioFormats> formatThe format of the input audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<RealtimeAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
Optional<RealtimeAudioConfigOutput> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
Optional<Double> speedThe speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
-
Optional<Voice> voiceThe voice the model uses to respond. Supported built-in voices are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. You may also provide a custom voice object with anid, for example{ "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommendmarinandcedarfor best quality.-
String -
enum UnionMember1:-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
class Id:Custom voice reference.
-
String idThe custom voice ID, e.g.
voice_1234.
-
-
-
-
Realtime Audio Config Input
-
class RealtimeAudioConfigInput:-
Optional<RealtimeAudioFormats> formatThe format of the input audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<RealtimeAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
Realtime Audio Config Output
-
class RealtimeAudioConfigOutput:-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<Double> speedThe speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
-
Optional<Voice> voiceThe voice the model uses to respond. Supported built-in voices are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. You may also provide a custom voice object with anid, for example{ "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommendmarinandcedarfor best quality.-
String -
enum UnionMember1:-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
class Id:Custom voice reference.
-
String idThe custom voice ID, e.g.
voice_1234.
-
-
-
Realtime Audio Formats
-
class RealtimeAudioFormats: A class that can be one of several variants.unionThe PCM audio format. Only a 24kHz sample rate is supported.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
Realtime Audio Input Turn Detection
-
class RealtimeAudioInputTurnDetection: A class that can be one of several variants.unionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
Realtime Client Event
-
class RealtimeClientEvent: A class that can be one of several variants.unionA realtime client event.
-
class ConversationItemCreateEvent:Add a new Item to the Conversation's context, including messages, function calls, and function call responses. This event can be used both to populate a "history" of the conversation and to add new items mid-stream, but has the current limitation that it cannot populate assistant audio messages.
If successful, the server will respond with a
conversation.item.createdevent, otherwise anerrorevent will be sent.-
ConversationItem itemA single item within a Realtime conversation.
-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
List<Content> contentThe content of the message.
-
Optional<String> textThe text content.
-
Optional<Type> typeThe content type. Always
input_textfor system messages.INPUT_TEXT("input_text")
-
-
JsonValue; role "system"constantThe role of the message sender. Always
system.SYSTEM("system")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes (for
input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified. -
Optional<Detail> detailThe detail level of the image (for
input_image).autowill default tohigh.-
AUTO("auto") -
LOW("low") -
HIGH("high")
-
-
Optional<String> imageUrlBase64-encoded image bytes (for
input_image) as a data URI. For exampledata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG. -
Optional<String> textThe text content (for
input_text). -
Optional<String> transcriptTranscript of the audio (for
input_audio). This is not sent to the model, but will be attached to the message item for reference. -
Optional<Type> typeThe content type (
input_text,input_audio, orinput_image).-
INPUT_TEXT("input_text") -
INPUT_AUDIO("input_audio") -
INPUT_IMAGE("input_image")
-
-
-
JsonValue; role "user"constantThe role of the message sender. Always
user.USER("user")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
-
Optional<String> textThe text content.
-
Optional<String> transcriptThe transcript of the audio content, this will always be present if the output type is
audio. -
Optional<Type> typeThe content type,
output_textoroutput_audiodepending on the sessionoutput_modalitiesconfiguration.-
OUTPUT_TEXT("output_text") -
OUTPUT_AUDIO("output_audio")
-
-
-
JsonValue; role "assistant"constantThe role of the message sender. Always
assistant.ASSISTANT("assistant")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
String argumentsThe arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example
{"arg1": "value1", "arg2": 42}. -
String nameThe name of the function being called.
-
JsonValue; type "function_call"constantThe type of the item. Always
function_call.FUNCTION_CALL("function_call")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<String> callIdThe ID of the function call.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
String callIdThe ID of the function call this output is for.
-
String outputThe output of the function call, this is free text and can contain any information or simply be empty.
-
JsonValue; type "function_call_output"constantThe type of the item. Always
function_call_output.FUNCTION_CALL_OUTPUT("function_call_output")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
String idThe unique ID of the approval response.
-
String approvalRequestIdThe ID of the approval request being answered.
-
boolean approveWhether the request was approved.
-
JsonValue; type "mcp_approval_response"constantThe type of the item. Always
mcp_approval_response.MCP_APPROVAL_RESPONSE("mcp_approval_response")
-
Optional<String> reasonOptional reason for the decision.
-
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
String serverLabelThe label of the MCP server.
-
List<Tool> toolsThe tools available on the server.
-
JsonValue inputSchemaThe JSON schema describing the tool's input.
-
String nameThe name of the tool.
-
Optional<JsonValue> annotationsAdditional annotations about the tool.
-
Optional<String> descriptionThe description of the tool.
-
-
JsonValue; type "mcp_list_tools"constantThe type of the item. Always
mcp_list_tools.MCP_LIST_TOOLS("mcp_list_tools")
-
Optional<String> idThe unique ID of the list.
-
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
String idThe unique ID of the tool call.
-
String argumentsA JSON string of the arguments passed to the tool.
-
String nameThe name of the tool that was run.
-
String serverLabelThe label of the MCP server running the tool.
-
JsonValue; type "mcp_call"constantThe type of the item. Always
mcp_call.MCP_CALL("mcp_call")
-
Optional<String> approvalRequestIdThe ID of an associated approval request, if any.
-
Optional<Error> errorThe error from the tool call, if any.
-
class RealtimeMcpProtocolError:-
long code -
String message -
JsonValue; type "protocol_error"constantPROTOCOL_ERROR("protocol_error")
-
-
class RealtimeMcpToolExecutionError:-
String message -
JsonValue; type "tool_execution_error"constantTOOL_EXECUTION_ERROR("tool_execution_error")
-
-
class RealtimeMcphttpError:-
long code -
String message -
JsonValue; type "http_error"constantHTTP_ERROR("http_error")
-
-
-
Optional<String> outputThe output from the tool call.
-
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
String idThe unique ID of the approval request.
-
String argumentsA JSON string of arguments for the tool.
-
String nameThe name of the tool to run.
-
String serverLabelThe label of the MCP server making the request.
-
JsonValue; type "mcp_approval_request"constantThe type of the item. Always
mcp_approval_request.MCP_APPROVAL_REQUEST("mcp_approval_request")
-
-
-
JsonValue; type "conversation.item.create"constantThe event type, must be
conversation.item.create.CONVERSATION_ITEM_CREATE("conversation.item.create")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
Optional<String> previousItemIdThe ID of the preceding item after which the new item will be inserted. If not set, the new item will be appended to the end of the conversation.
If set to
root, the new item will be added to the beginning of the conversation.If set to an existing ID, it allows an item to be inserted mid-conversation. If the ID cannot be found, an error will be returned and the item will not be added.
-
-
class ConversationItemDeleteEvent:Send this event when you want to remove any item from the conversation history. The server will respond with a
conversation.item.deletedevent, unless the item does not exist in the conversation history, in which case the server will respond with an error.-
String itemIdThe ID of the item to delete.
-
JsonValue; type "conversation.item.delete"constantThe event type, must be
conversation.item.delete.CONVERSATION_ITEM_DELETE("conversation.item.delete")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
-
class ConversationItemRetrieveEvent:Send this event when you want to retrieve the server's representation of a specific item in the conversation history. This is useful, for example, to inspect user audio after noise cancellation and VAD. The server will respond with a
conversation.item.retrievedevent, unless the item does not exist in the conversation history, in which case the server will respond with an error.-
String itemIdThe ID of the item to retrieve.
-
JsonValue; type "conversation.item.retrieve"constantThe event type, must be
conversation.item.retrieve.CONVERSATION_ITEM_RETRIEVE("conversation.item.retrieve")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
-
class ConversationItemTruncateEvent:Send this event to truncate a previous assistant message’s audio. The server will produce audio faster than realtime, so this event is useful when the user interrupts to truncate audio that has already been sent to the client but not yet played. This will synchronize the server's understanding of the audio with the client's playback.
Truncating audio will delete the server-side text transcript to ensure there is not text in the context that hasn't been heard by the user.
If successful, the server will respond with a
conversation.item.truncatedevent.-
long audioEndMsInclusive duration up to which audio is truncated, in milliseconds. If the audio_end_ms is greater than the actual audio duration, the server will respond with an error.
-
long contentIndexThe index of the content part to truncate. Set this to
0. -
String itemIdThe ID of the assistant message item to truncate. Only assistant message items can be truncated.
-
JsonValue; type "conversation.item.truncate"constantThe event type, must be
conversation.item.truncate.CONVERSATION_ITEM_TRUNCATE("conversation.item.truncate")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
-
class InputAudioBufferAppendEvent:Send this event to append audio bytes to the input audio buffer. The audio buffer is temporary storage you can write to and later commit. A "commit" will create a new user message item in the conversation history from the buffer content and clear the buffer. Input audio transcription (if enabled) will be generated when the buffer is committed.
If VAD is enabled the audio buffer is used to detect speech and the server will decide when to commit. When Server VAD is disabled, you must commit the audio buffer manually. Input audio noise reduction operates on writes to the audio buffer.
The client may choose how much audio to place in each event up to a maximum of 15 MiB, for example streaming smaller chunks from the client may allow the VAD to be more responsive. Unlike most other client events, the server will not send a confirmation response to this event.
-
String audioBase64-encoded audio bytes. This must be in the format specified by the
input_audio_formatfield in the session configuration. -
JsonValue; type "input_audio_buffer.append"constantThe event type, must be
input_audio_buffer.append.INPUT_AUDIO_BUFFER_APPEND("input_audio_buffer.append")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
-
class InputAudioBufferClearEvent:Send this event to clear the audio bytes in the buffer. The server will respond with an
input_audio_buffer.clearedevent.-
JsonValue; type "input_audio_buffer.clear"constantThe event type, must be
input_audio_buffer.clear.INPUT_AUDIO_BUFFER_CLEAR("input_audio_buffer.clear")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
-
class OutputAudioBufferClearEvent:WebRTC/SIP Only: Emit to cut off the current audio response. This will trigger the server to stop generating audio and emit a
output_audio_buffer.clearedevent. This event should be preceded by aresponse.cancelclient event to stop the generation of the current response. Learn more.-
JsonValue; type "output_audio_buffer.clear"constantThe event type, must be
output_audio_buffer.clear.OUTPUT_AUDIO_BUFFER_CLEAR("output_audio_buffer.clear")
-
Optional<String> eventIdThe unique ID of the client event used for error handling.
-
-
class InputAudioBufferCommitEvent:Send this event to commit the user input audio buffer, which will create a new user message item in the conversation. This event will produce an error if the input audio buffer is empty. When in Server VAD mode, the client does not need to send this event, the server will commit the audio buffer automatically.
Committing the input audio buffer will trigger input audio transcription (if enabled in session configuration), but it will not create a response from the model. The server will respond with an
input_audio_buffer.committedevent.-
JsonValue; type "input_audio_buffer.commit"constantThe event type, must be
input_audio_buffer.commit.INPUT_AUDIO_BUFFER_COMMIT("input_audio_buffer.commit")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
-
class ResponseCancelEvent:Send this event to cancel an in-progress response. The server will respond with a
response.doneevent with a status ofresponse.status=cancelled. If there is no response to cancel, the server will respond with an error. It's safe to callresponse.canceleven if no response is in progress, an error will be returned the session will remain unaffected.-
JsonValue; type "response.cancel"constantThe event type, must be
response.cancel.RESPONSE_CANCEL("response.cancel")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
Optional<String> responseIdA specific response ID to cancel - if not provided, will cancel an in-progress response in the default conversation.
-
-
class ResponseCreateEvent:This event instructs the server to create a Response, which means triggering model inference. When in Server VAD mode, the server will create Responses automatically.
A Response will include at least one Item, and may have two, in which case the second will be a function call. These Items will be appended to the conversation history by default.
The server will respond with a
response.createdevent, events for Items and content created, and finally aresponse.doneevent to indicate the Response is complete.The
response.createevent includes inference configuration likeinstructionsandtools. If these are set, they will override the Session's configuration for this Response only.Responses can be created out-of-band of the default Conversation, meaning that they can have arbitrary input, and it's possible to disable writing the output to the Conversation. Only one Response can write to the default Conversation at a time, but otherwise multiple Responses can be created in parallel. The
metadatafield is a good way to disambiguate multiple simultaneous Responses.Clients can set
conversationtononeto create a Response that does not write to the default Conversation. Arbitrary input can be provided with theinputfield, which is an array accepting raw Items and references to existing Items.-
JsonValue; type "response.create"constantThe event type, must be
response.create.RESPONSE_CREATE("response.create")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
Optional<RealtimeResponseCreateParams> responseCreate a new Realtime response with these parameters
-
Optional<RealtimeResponseCreateAudioOutput> audioConfiguration for audio input and output.
-
Optional<Output> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<Voice> voiceThe voice the model uses to respond. Supported built-in voices are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. You may also provide a custom voice object with anid, for example{ "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommendmarinandcedarfor best quality.-
String -
enum UnionMember1:-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
class Id:Custom voice reference.
-
String idThe custom voice ID, e.g.
voice_1234.
-
-
-
-
-
Optional<Conversation> conversationControls which conversation the response is added to. Currently supports
autoandnone, withautoas the default value. Theautovalue means that the contents of the response will be added to the default conversation. Set this tononeto create an out-of-band response which will not add items to default conversation.-
AUTO("auto") -
NONE("none")
-
-
Optional<List<ConversationItem>> inputInput items to include in the prompt for the model. Using this field creates a new context for this Response instead of using the default conversation. An empty array
[]will clear the context for this Response. Note that this can include references to items that previously appeared in the session using their id.-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
-
Optional<String> instructionsThe default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior. Note that the server sets default instructions which will be used if this field is not set and are visible in the
session.createdevent at the start of the session. -
Optional<MaxOutputTokens> maxOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or
inffor the maximum available tokens for a given model. Defaults toinf.-
long -
JsonValue;INF("inf")
-
-
Optional<Metadata> metadataSet of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.
Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
-
Optional<List<OutputModality>> outputModalitiesThe set of modalities the model used to respond, currently the only possible values are
[\"audio\"],[\"text\"]. Audio output always include a text transcript. Setting the output to modetextwill disable audio output from the model.-
TEXT("text") -
AUDIO("audio")
-
-
Optional<Boolean> parallelToolCallsWhether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as
gpt-realtime-2. -
Optional<ResponsePrompt> promptReference to a prompt template and its variables. Learn more.
-
String idThe unique identifier of the prompt template to use.
-
Optional<Variables> variablesOptional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
-
String -
class ResponseInputText:A text input to the model.
-
String textThe text input to the model.
-
JsonValue; type "input_text"constantThe type of the input item. Always
input_text.INPUT_TEXT("input_text")
-
-
class ResponseInputImage:An image input to the model. Learn about image inputs.
-
Detail detailThe detail level of the image to be sent to the model. One of
high,low,auto, ororiginal. Defaults toauto.-
LOW("low") -
HIGH("high") -
AUTO("auto") -
ORIGINAL("original")
-
-
JsonValue; type "input_image"constantThe type of the input item. Always
input_image.INPUT_IMAGE("input_image")
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> imageUrlThe URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
-
-
class ResponseInputFile:A file input to the model.
-
JsonValue; type "input_file"constantThe type of the input item. Always
input_file.INPUT_FILE("input_file")
-
Optional<Detail> detailThe detail level of the file to be sent to the model. Use
lowfor the default rendering behavior, orhighto render the file at higher quality. Defaults tolow.-
LOW("low") -
HIGH("high")
-
-
Optional<String> fileDataThe content of the file to be sent to the model.
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> fileUrlThe URL of the file to be sent to the model.
-
Optional<String> filenameThe name of the file to be sent to the model.
-
-
-
Optional<String> versionOptional version of the prompt template.
-
-
Optional<RealtimeReasoning> reasoningConfiguration for reasoning-capable Realtime models such as
gpt-realtime-2.-
Optional<RealtimeReasoningEffort> effortConstrains effort on reasoning for reasoning-capable Realtime models such as
gpt-realtime-2.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
-
Optional<ToolChoice> toolChoiceHow the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
-
enum ToolChoiceOptions:Controls which (if any) tool is called by the model.
nonemeans the model will not call any tool and instead generates a message.automeans the model can pick between generating a message or calling one or more tools.requiredmeans the model must call one or more tools.-
NONE("none") -
AUTO("auto") -
REQUIRED("required")
-
-
class ToolChoiceFunction:Use this option to force the model to call a specific function.
-
String nameThe name of the function to call.
-
JsonValue; type "function"constantFor function calling, the type is always
function.FUNCTION("function")
-
-
class ToolChoiceMcp:Use this option to force the model to call a specific tool on a remote MCP server.
-
String serverLabelThe label of the MCP server to use.
-
JsonValue; type "mcp"constantFor MCP tools, the type is always
mcp.MCP("mcp")
-
Optional<String> nameThe name of the tool to call on the server.
-
-
-
Optional<List<Tool>> toolsTools available to the model.
-
class RealtimeFunctionTool:-
Optional<String> descriptionThe description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
-
Optional<String> nameThe name of the function.
-
Optional<JsonValue> parametersParameters of the function in JSON Schema.
-
Optional<Type> typeThe type of the tool, i.e.
function.FUNCTION("function")
-
-
class RealtimeResponseCreateMcpTool:Give the model access to additional tools via remote Model Context Protocol (MCP) servers. Learn more about MCP.
-
String serverLabelA label for this MCP server, used to identify it in tool calls.
-
JsonValue; type "mcp"constantThe type of the MCP tool. Always
mcp.MCP("mcp")
-
Optional<AllowedTools> allowedToolsList of allowed tool names or a filter object.
-
List<String> -
class McpToolFilter:A filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
Optional<String> authorizationAn OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
-
Optional<ConnectorId> connectorIdIdentifier for service connectors, like those available in ChatGPT. One of
server_urlorconnector_idmust be provided. Learn more about service connectors here.Currently supported
connector_idvalues are:-
Dropbox:
connector_dropbox -
Gmail:
connector_gmail -
Google Calendar:
connector_googlecalendar -
Google Drive:
connector_googledrive -
Microsoft Teams:
connector_microsoftteams -
Outlook Calendar:
connector_outlookcalendar -
Outlook Email:
connector_outlookemail -
SharePoint:
connector_sharepoint -
CONNECTOR_DROPBOX("connector_dropbox") -
CONNECTOR_GMAIL("connector_gmail") -
CONNECTOR_GOOGLECALENDAR("connector_googlecalendar") -
CONNECTOR_GOOGLEDRIVE("connector_googledrive") -
CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams") -
CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar") -
CONNECTOR_OUTLOOKEMAIL("connector_outlookemail") -
CONNECTOR_SHAREPOINT("connector_sharepoint")
-
-
Optional<Boolean> deferLoadingWhether this MCP tool is deferred and discovered via tool search.
-
Optional<Headers> headersOptional HTTP headers to send to the MCP server. Use for authentication or other purposes.
-
Optional<RequireApproval> requireApprovalSpecify which of the MCP server's tools require approval.
-
class McpToolApprovalFilter:Specify which of the MCP server's tools require approval. Can be
always,never, or a filter object associated with tools that require approval.-
Optional<Always> alwaysA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
Optional<Never> neverA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
enum McpToolApprovalSetting:Specify a single approval policy for all tools. One of
alwaysornever. When set toalways, all tools will require approval. When set tonever, all tools will not require approval.-
ALWAYS("always") -
NEVER("never")
-
-
-
Optional<String> serverDescriptionOptional description of the MCP server, used to provide more context.
-
Optional<String> serverUrlThe URL for the MCP server. One of
server_urlorconnector_idmust be provided.
-
-
-
-
-
class SessionUpdateEvent:Send this event to update the session’s configuration. The client may send this event at any time to update any field except for
voiceandmodel.voicecan be updated only if there have been no other audio outputs yet.When the server receives a
session.update, it will respond with asession.updatedevent showing the full, effective configuration. Only the fields that are present in thesession.updateare updated. To clear a field likeinstructions, pass an empty string. To clear a field liketools, pass an empty array. To clear a field liketurn_detection, passnull.-
Session sessionUpdate the Realtime session. Choose either a realtime session or a transcription session.
-
class RealtimeSessionCreateRequest:Realtime session object configuration.
-
JsonValue; type "realtime"constantThe type of session to create. Always
realtimefor the Realtime API.REALTIME("realtime")
-
Optional<RealtimeAudioConfig> audioConfiguration for input and output audio.
-
Optional<RealtimeAudioConfigInput> input-
Optional<RealtimeAudioFormats> formatThe format of the input audio.
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<RealtimeAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
Optional<RealtimeAudioConfigOutput> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
Optional<Double> speedThe speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
-
Optional<Voice> voiceThe voice the model uses to respond. Supported built-in voices are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. You may also provide a custom voice object with anid, for example{ "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommendmarinandcedarfor best quality.-
String -
enum UnionMember1:-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
class Id:Custom voice reference.
-
String idThe custom voice ID, e.g.
voice_1234.
-
-
-
-
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
Optional<String> instructionsThe default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
Note that the server sets default instructions which will be used if this field is not set and are visible in the
session.createdevent at the start of the session. -
Optional<MaxOutputTokens> maxOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or
inffor the maximum available tokens for a given model. Defaults toinf.-
long -
JsonValue;INF("inf")
-
-
Optional<Model> modelThe Realtime model used for this session.
-
GPT_REALTIME("gpt-realtime") -
GPT_REALTIME_1_5("gpt-realtime-1.5") -
GPT_REALTIME_2("gpt-realtime-2") -
GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28") -
GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview") -
GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01") -
GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17") -
GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03") -
GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview") -
GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17") -
GPT_REALTIME_MINI("gpt-realtime-mini") -
GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06") -
GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15") -
GPT_AUDIO_1_5("gpt-audio-1.5") -
GPT_AUDIO_MINI("gpt-audio-mini") -
GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06") -
GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
-
-
Optional<List<OutputModality>> outputModalitiesThe set of modalities the model can respond with. It defaults to
["audio"], indicating that the model will respond with audio plus a transcript.["text"]can be used to make the model respond with text only. It is not possible to request bothtextandaudioat the same time.-
TEXT("text") -
AUDIO("audio")
-
-
Optional<Boolean> parallelToolCallsWhether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as
gpt-realtime-2. -
Optional<ResponsePrompt> promptReference to a prompt template and its variables. Learn more.
-
Optional<RealtimeReasoning> reasoningConfiguration for reasoning-capable Realtime models such as
gpt-realtime-2. -
Optional<RealtimeToolChoiceConfig> toolChoiceHow the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
-
enum ToolChoiceOptions:Controls which (if any) tool is called by the model.
nonemeans the model will not call any tool and instead generates a message.automeans the model can pick between generating a message or calling one or more tools.requiredmeans the model must call one or more tools. -
class ToolChoiceFunction:Use this option to force the model to call a specific function.
-
class ToolChoiceMcp:Use this option to force the model to call a specific tool on a remote MCP server.
-
-
Optional<List<RealtimeToolsConfigUnion>> toolsTools available to the model.
-
class RealtimeFunctionTool: -
Mcp-
String serverLabelA label for this MCP server, used to identify it in tool calls.
-
JsonValue; type "mcp"constantThe type of the MCP tool. Always
mcp.MCP("mcp")
-
Optional<AllowedTools> allowedToolsList of allowed tool names or a filter object.
-
List<String> -
class McpToolFilter:A filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
Optional<String> authorizationAn OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
-
Optional<ConnectorId> connectorIdIdentifier for service connectors, like those available in ChatGPT. One of
server_urlorconnector_idmust be provided. Learn more about service connectors here.Currently supported
connector_idvalues are:-
Dropbox:
connector_dropbox -
Gmail:
connector_gmail -
Google Calendar:
connector_googlecalendar -
Google Drive:
connector_googledrive -
Microsoft Teams:
connector_microsoftteams -
Outlook Calendar:
connector_outlookcalendar -
Outlook Email:
connector_outlookemail -
SharePoint:
connector_sharepoint -
CONNECTOR_DROPBOX("connector_dropbox") -
CONNECTOR_GMAIL("connector_gmail") -
CONNECTOR_GOOGLECALENDAR("connector_googlecalendar") -
CONNECTOR_GOOGLEDRIVE("connector_googledrive") -
CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams") -
CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar") -
CONNECTOR_OUTLOOKEMAIL("connector_outlookemail") -
CONNECTOR_SHAREPOINT("connector_sharepoint")
-
-
Optional<Boolean> deferLoadingWhether this MCP tool is deferred and discovered via tool search.
-
Optional<Headers> headersOptional HTTP headers to send to the MCP server. Use for authentication or other purposes.
-
Optional<RequireApproval> requireApprovalSpecify which of the MCP server's tools require approval.
-
class McpToolApprovalFilter:Specify which of the MCP server's tools require approval. Can be
always,never, or a filter object associated with tools that require approval.-
Optional<Always> alwaysA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
Optional<Never> neverA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
enum McpToolApprovalSetting:Specify a single approval policy for all tools. One of
alwaysornever. When set toalways, all tools will require approval. When set tonever, all tools will not require approval.-
ALWAYS("always") -
NEVER("never")
-
-
-
Optional<String> serverDescriptionOptional description of the MCP server, used to provide more context.
-
Optional<String> serverUrlThe URL for the MCP server. One of
server_urlorconnector_idmust be provided.
-
-
-
Optional<RealtimeTracingConfig> tracingRealtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
autowill create a trace for the session with default values for the workflow name, group id, and metadata.-
JsonValue;AUTO("auto")
-
TracingConfiguration-
Optional<String> groupIdThe group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
-
Optional<JsonValue> metadataThe arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
-
Optional<String> workflowNameThe name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
-
-
-
Optional<RealtimeTruncation> truncationWhen the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
-
RealtimeTruncationStrategy-
AUTO("auto") -
DISABLED("disabled")
-
-
class RealtimeTruncationRetentionRatio:Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
-
double retentionRatioFraction of post-instruction conversation tokens to retain (
0.0-1.0) when the conversation exceeds the input token limit. Setting this to0.8means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates. -
JsonValue; type "retention_ratio"constantUse retention ratio truncation.
RETENTION_RATIO("retention_ratio")
-
Optional<TokenLimits> tokenLimitsOptional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
-
Optional<Long> postInstructionsMaximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
-
-
-
-
-
class RealtimeTranscriptionSessionCreateRequest:Realtime transcription session object configuration.
-
JsonValue; type "transcription"constantThe type of session to create. Always
transcriptionfor transcription sessions.TRANSCRIPTION("transcription")
-
Optional<RealtimeTranscriptionSessionAudio> audioConfiguration for input and output audio.
-
Optional<RealtimeTranscriptionSessionAudioInput> input-
Optional<RealtimeAudioFormats> formatThe PCM audio format. Only a 24kHz sample rate is supported.
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. -
Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
-
-
JsonValue; type "session.update"constantThe event type, must be
session.update.SESSION_UPDATE("session.update")
-
Optional<String> eventIdOptional client-generated ID used to identify this event. This is an arbitrary string that a client may assign. It will be passed back if there is an error with the event, but the corresponding
session.updatedevent will not include it.
-
-
Realtime Conversation Item Assistant Message
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
-
Optional<String> textThe text content.
-
Optional<String> transcriptThe transcript of the audio content, this will always be present if the output type is
audio. -
Optional<Type> typeThe content type,
output_textoroutput_audiodepending on the sessionoutput_modalitiesconfiguration.-
OUTPUT_TEXT("output_text") -
OUTPUT_AUDIO("output_audio")
-
-
-
JsonValue; role "assistant"constantThe role of the message sender. Always
assistant.ASSISTANT("assistant")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
Realtime Conversation Item Function Call
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
String argumentsThe arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example
{"arg1": "value1", "arg2": 42}. -
String nameThe name of the function being called.
-
JsonValue; type "function_call"constantThe type of the item. Always
function_call.FUNCTION_CALL("function_call")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<String> callIdThe ID of the function call.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
Realtime Conversation Item Function Call Output
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
String callIdThe ID of the function call this output is for.
-
String outputThe output of the function call, this is free text and can contain any information or simply be empty.
-
JsonValue; type "function_call_output"constantThe type of the item. Always
function_call_output.FUNCTION_CALL_OUTPUT("function_call_output")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
Realtime Conversation Item System Message
-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
List<Content> contentThe content of the message.
-
Optional<String> textThe text content.
-
Optional<Type> typeThe content type. Always
input_textfor system messages.INPUT_TEXT("input_text")
-
-
JsonValue; role "system"constantThe role of the message sender. Always
system.SYSTEM("system")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
Realtime Conversation Item User Message
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes (for
input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified. -
Optional<Detail> detailThe detail level of the image (for
input_image).autowill default tohigh.-
AUTO("auto") -
LOW("low") -
HIGH("high")
-
-
Optional<String> imageUrlBase64-encoded image bytes (for
input_image) as a data URI. For exampledata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG. -
Optional<String> textThe text content (for
input_text). -
Optional<String> transcriptTranscript of the audio (for
input_audio). This is not sent to the model, but will be attached to the message item for reference. -
Optional<Type> typeThe content type (
input_text,input_audio, orinput_image).-
INPUT_TEXT("input_text") -
INPUT_AUDIO("input_audio") -
INPUT_IMAGE("input_image")
-
-
-
JsonValue; role "user"constantThe role of the message sender. Always
user.USER("user")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
Realtime Error
-
class RealtimeError:Details of the error.
-
String messageA human-readable error message.
-
String typeThe type of error (e.g., "invalid_request_error", "server_error").
-
Optional<String> codeError code, if any.
-
Optional<String> eventIdThe event_id of the client event that caused the error, if applicable.
-
Optional<String> paramParameter related to the error, if any.
-
Realtime Error Event
-
class RealtimeErrorEvent:Returned when an error occurs, which could be a client problem or a server problem. Most errors are recoverable and the session will stay open, we recommend to implementors to monitor and log error messages by default.
-
RealtimeError errorDetails of the error.
-
String messageA human-readable error message.
-
String typeThe type of error (e.g., "invalid_request_error", "server_error").
-
Optional<String> codeError code, if any.
-
Optional<String> eventIdThe event_id of the client event that caused the error, if applicable.
-
Optional<String> paramParameter related to the error, if any.
-
-
String eventIdThe unique ID of the server event.
-
JsonValue; type "error"constantThe event type, must be
error.ERROR("error")
-
Realtime Function Tool
-
class RealtimeFunctionTool:-
Optional<String> descriptionThe description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
-
Optional<String> nameThe name of the function.
-
Optional<JsonValue> parametersParameters of the function in JSON Schema.
-
Optional<Type> typeThe type of the tool, i.e.
function.FUNCTION("function")
-
Realtime Mcp Approval Request
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
String idThe unique ID of the approval request.
-
String argumentsA JSON string of arguments for the tool.
-
String nameThe name of the tool to run.
-
String serverLabelThe label of the MCP server making the request.
-
JsonValue; type "mcp_approval_request"constantThe type of the item. Always
mcp_approval_request.MCP_APPROVAL_REQUEST("mcp_approval_request")
-
Realtime Mcp Approval Response
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
String idThe unique ID of the approval response.
-
String approvalRequestIdThe ID of the approval request being answered.
-
boolean approveWhether the request was approved.
-
JsonValue; type "mcp_approval_response"constantThe type of the item. Always
mcp_approval_response.MCP_APPROVAL_RESPONSE("mcp_approval_response")
-
Optional<String> reasonOptional reason for the decision.
-
Realtime Mcp List Tools
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
String serverLabelThe label of the MCP server.
-
List<Tool> toolsThe tools available on the server.
-
JsonValue inputSchemaThe JSON schema describing the tool's input.
-
String nameThe name of the tool.
-
Optional<JsonValue> annotationsAdditional annotations about the tool.
-
Optional<String> descriptionThe description of the tool.
-
-
JsonValue; type "mcp_list_tools"constantThe type of the item. Always
mcp_list_tools.MCP_LIST_TOOLS("mcp_list_tools")
-
Optional<String> idThe unique ID of the list.
-
Realtime Mcp Protocol Error
-
class RealtimeMcpProtocolError:-
long code -
String message -
JsonValue; type "protocol_error"constantPROTOCOL_ERROR("protocol_error")
-
Realtime Mcp Tool Call
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
String idThe unique ID of the tool call.
-
String argumentsA JSON string of the arguments passed to the tool.
-
String nameThe name of the tool that was run.
-
String serverLabelThe label of the MCP server running the tool.
-
JsonValue; type "mcp_call"constantThe type of the item. Always
mcp_call.MCP_CALL("mcp_call")
-
Optional<String> approvalRequestIdThe ID of an associated approval request, if any.
-
Optional<Error> errorThe error from the tool call, if any.
-
class RealtimeMcpProtocolError:-
long code -
String message -
JsonValue; type "protocol_error"constantPROTOCOL_ERROR("protocol_error")
-
-
class RealtimeMcpToolExecutionError:-
String message -
JsonValue; type "tool_execution_error"constantTOOL_EXECUTION_ERROR("tool_execution_error")
-
-
class RealtimeMcphttpError:-
long code -
String message -
JsonValue; type "http_error"constantHTTP_ERROR("http_error")
-
-
-
Optional<String> outputThe output from the tool call.
-
Realtime Mcp Tool Execution Error
-
class RealtimeMcpToolExecutionError:-
String message -
JsonValue; type "tool_execution_error"constantTOOL_EXECUTION_ERROR("tool_execution_error")
-
Realtime Mcphttp Error
-
class RealtimeMcphttpError:-
long code -
String message -
JsonValue; type "http_error"constantHTTP_ERROR("http_error")
-
Realtime Reasoning
-
class RealtimeReasoning:Configuration for reasoning-capable Realtime models such as
gpt-realtime-2.-
Optional<RealtimeReasoningEffort> effortConstrains effort on reasoning for reasoning-capable Realtime models such as
gpt-realtime-2.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Realtime Reasoning Effort
-
enum RealtimeReasoningEffort:Constrains effort on reasoning for reasoning-capable Realtime models such as
gpt-realtime-2.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
Realtime Response
-
class RealtimeResponse:The response resource.
-
Optional<String> idThe unique ID of the response, will look like
resp_1234. -
Optional<Audio> audioConfiguration for audio output.
-
Optional<Output> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<Voice> voiceThe voice the model uses to respond. Voice cannot be changed during the session once the model has responded with audio at least once. Current voice options are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. We recommendmarinandcedarfor best quality.-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
-
-
Optional<String> conversationIdWhich conversation the response is added to, determined by the
conversationfield in theresponse.createevent. Ifauto, the response will be added to the default conversation and the value ofconversation_idwill be an id likeconv_1234. Ifnone, the response will not be added to any conversation and the value ofconversation_idwill benull. If responses are being triggered automatically by VAD the response will be added to the default conversation -
Optional<MaxOutputTokens> maxOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls, that was used in this response.
-
long -
JsonValue;INF("inf")
-
-
Optional<Metadata> metadataSet of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.
Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
-
Optional<Object> object_The object type, must be
realtime.response.REALTIME_RESPONSE("realtime.response")
-
Optional<List<ConversationItem>> outputThe list of output items generated by the response.
-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
List<Content> contentThe content of the message.
-
Optional<String> textThe text content.
-
Optional<Type> typeThe content type. Always
input_textfor system messages.INPUT_TEXT("input_text")
-
-
JsonValue; role "system"constantThe role of the message sender. Always
system.SYSTEM("system")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes (for
input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified. -
Optional<Detail> detailThe detail level of the image (for
input_image).autowill default tohigh.-
AUTO("auto") -
LOW("low") -
HIGH("high")
-
-
Optional<String> imageUrlBase64-encoded image bytes (for
input_image) as a data URI. For exampledata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG. -
Optional<String> textThe text content (for
input_text). -
Optional<String> transcriptTranscript of the audio (for
input_audio). This is not sent to the model, but will be attached to the message item for reference. -
Optional<Type> typeThe content type (
input_text,input_audio, orinput_image).-
INPUT_TEXT("input_text") -
INPUT_AUDIO("input_audio") -
INPUT_IMAGE("input_image")
-
-
-
JsonValue; role "user"constantThe role of the message sender. Always
user.USER("user")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
-
Optional<String> textThe text content.
-
Optional<String> transcriptThe transcript of the audio content, this will always be present if the output type is
audio. -
Optional<Type> typeThe content type,
output_textoroutput_audiodepending on the sessionoutput_modalitiesconfiguration.-
OUTPUT_TEXT("output_text") -
OUTPUT_AUDIO("output_audio")
-
-
-
JsonValue; role "assistant"constantThe role of the message sender. Always
assistant.ASSISTANT("assistant")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
String argumentsThe arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example
{"arg1": "value1", "arg2": 42}. -
String nameThe name of the function being called.
-
JsonValue; type "function_call"constantThe type of the item. Always
function_call.FUNCTION_CALL("function_call")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<String> callIdThe ID of the function call.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
String callIdThe ID of the function call this output is for.
-
String outputThe output of the function call, this is free text and can contain any information or simply be empty.
-
JsonValue; type "function_call_output"constantThe type of the item. Always
function_call_output.FUNCTION_CALL_OUTPUT("function_call_output")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
String idThe unique ID of the approval response.
-
String approvalRequestIdThe ID of the approval request being answered.
-
boolean approveWhether the request was approved.
-
JsonValue; type "mcp_approval_response"constantThe type of the item. Always
mcp_approval_response.MCP_APPROVAL_RESPONSE("mcp_approval_response")
-
Optional<String> reasonOptional reason for the decision.
-
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
String serverLabelThe label of the MCP server.
-
List<Tool> toolsThe tools available on the server.
-
JsonValue inputSchemaThe JSON schema describing the tool's input.
-
String nameThe name of the tool.
-
Optional<JsonValue> annotationsAdditional annotations about the tool.
-
Optional<String> descriptionThe description of the tool.
-
-
JsonValue; type "mcp_list_tools"constantThe type of the item. Always
mcp_list_tools.MCP_LIST_TOOLS("mcp_list_tools")
-
Optional<String> idThe unique ID of the list.
-
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
String idThe unique ID of the tool call.
-
String argumentsA JSON string of the arguments passed to the tool.
-
String nameThe name of the tool that was run.
-
String serverLabelThe label of the MCP server running the tool.
-
JsonValue; type "mcp_call"constantThe type of the item. Always
mcp_call.MCP_CALL("mcp_call")
-
Optional<String> approvalRequestIdThe ID of an associated approval request, if any.
-
Optional<Error> errorThe error from the tool call, if any.
-
class RealtimeMcpProtocolError:-
long code -
String message -
JsonValue; type "protocol_error"constantPROTOCOL_ERROR("protocol_error")
-
-
class RealtimeMcpToolExecutionError:-
String message -
JsonValue; type "tool_execution_error"constantTOOL_EXECUTION_ERROR("tool_execution_error")
-
-
class RealtimeMcphttpError:-
long code -
String message -
JsonValue; type "http_error"constantHTTP_ERROR("http_error")
-
-
-
Optional<String> outputThe output from the tool call.
-
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
String idThe unique ID of the approval request.
-
String argumentsA JSON string of arguments for the tool.
-
String nameThe name of the tool to run.
-
String serverLabelThe label of the MCP server making the request.
-
JsonValue; type "mcp_approval_request"constantThe type of the item. Always
mcp_approval_request.MCP_APPROVAL_REQUEST("mcp_approval_request")
-
-
-
Optional<List<OutputModality>> outputModalitiesThe set of modalities the model used to respond, currently the only possible values are
[\"audio\"],[\"text\"]. Audio output always include a text transcript. Setting the output to modetextwill disable audio output from the model.-
TEXT("text") -
AUDIO("audio")
-
-
Optional<Status> statusThe final status of the response (
completed,cancelled,failed, orincomplete,in_progress).-
COMPLETED("completed") -
CANCELLED("cancelled") -
FAILED("failed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
Optional<RealtimeResponseStatus> statusDetailsAdditional details about the status.
-
Optional<Error> errorA description of the error that caused the response to fail, populated when the
statusisfailed.-
Optional<String> codeError code, if any.
-
Optional<String> typeThe type of error.
-
-
Optional<Reason> reasonThe reason the Response did not complete. For a
cancelledResponse, one ofturn_detected(the server VAD detected a new start of speech) orclient_cancelled(the client sent a cancel event). For anincompleteResponse, one ofmax_output_tokensorcontent_filter(the server-side safety filter activated and cut off the response).-
TURN_DETECTED("turn_detected") -
CLIENT_CANCELLED("client_cancelled") -
MAX_OUTPUT_TOKENS("max_output_tokens") -
CONTENT_FILTER("content_filter")
-
-
Optional<Type> typeThe type of error that caused the response to fail, corresponding with the
statusfield (completed,cancelled,incomplete,failed).-
COMPLETED("completed") -
CANCELLED("cancelled") -
INCOMPLETE("incomplete") -
FAILED("failed")
-
-
-
Optional<RealtimeResponseUsage> usageUsage statistics for the Response, this will correspond to billing. A Realtime API session will maintain a conversation context and append new Items to the Conversation, thus output from previous turns (text and audio tokens) will become the input for later turns.
-
Optional<RealtimeResponseUsageInputTokenDetails> inputTokenDetailsDetails about the input tokens used in the Response. Cached tokens are tokens from previous turns in the conversation that are included as context for the current response. Cached tokens here are counted as a subset of input tokens, meaning input tokens will include cached and uncached tokens.
-
Optional<Long> audioTokensThe number of audio tokens used as input for the Response.
-
Optional<Long> cachedTokensThe number of cached tokens used as input for the Response.
-
Optional<CachedTokensDetails> cachedTokensDetailsDetails about the cached tokens used as input for the Response.
-
Optional<Long> audioTokensThe number of cached audio tokens used as input for the Response.
-
Optional<Long> imageTokensThe number of cached image tokens used as input for the Response.
-
Optional<Long> textTokensThe number of cached text tokens used as input for the Response.
-
-
Optional<Long> imageTokensThe number of image tokens used as input for the Response.
-
Optional<Long> textTokensThe number of text tokens used as input for the Response.
-
-
Optional<Long> inputTokensThe number of input tokens used in the Response, including text and audio tokens.
-
Optional<RealtimeResponseUsageOutputTokenDetails> outputTokenDetailsDetails about the output tokens used in the Response.
-
Optional<Long> audioTokensThe number of audio tokens used in the Response.
-
Optional<Long> textTokensThe number of text tokens used in the Response.
-
-
Optional<Long> outputTokensThe number of output tokens sent in the Response, including text and audio tokens.
-
Optional<Long> totalTokensThe total number of tokens in the Response including input and output text and audio tokens.
-
-
Realtime Response Create Audio Output
-
class RealtimeResponseCreateAudioOutput:Configuration for audio input and output.
-
Optional<Output> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<Voice> voiceThe voice the model uses to respond. Supported built-in voices are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. You may also provide a custom voice object with anid, for example{ "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommendmarinandcedarfor best quality.-
String -
enum UnionMember1:-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
class Id:Custom voice reference.
-
String idThe custom voice ID, e.g.
voice_1234.
-
-
-
-
Realtime Response Create Mcp Tool
-
class RealtimeResponseCreateMcpTool:Give the model access to additional tools via remote Model Context Protocol (MCP) servers. Learn more about MCP.
-
String serverLabelA label for this MCP server, used to identify it in tool calls.
-
JsonValue; type "mcp"constantThe type of the MCP tool. Always
mcp.MCP("mcp")
-
Optional<AllowedTools> allowedToolsList of allowed tool names or a filter object.
-
List<String> -
class McpToolFilter:A filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
Optional<String> authorizationAn OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
-
Optional<ConnectorId> connectorIdIdentifier for service connectors, like those available in ChatGPT. One of
server_urlorconnector_idmust be provided. Learn more about service connectors here.Currently supported
connector_idvalues are:-
Dropbox:
connector_dropbox -
Gmail:
connector_gmail -
Google Calendar:
connector_googlecalendar -
Google Drive:
connector_googledrive -
Microsoft Teams:
connector_microsoftteams -
Outlook Calendar:
connector_outlookcalendar -
Outlook Email:
connector_outlookemail -
SharePoint:
connector_sharepoint -
CONNECTOR_DROPBOX("connector_dropbox") -
CONNECTOR_GMAIL("connector_gmail") -
CONNECTOR_GOOGLECALENDAR("connector_googlecalendar") -
CONNECTOR_GOOGLEDRIVE("connector_googledrive") -
CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams") -
CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar") -
CONNECTOR_OUTLOOKEMAIL("connector_outlookemail") -
CONNECTOR_SHAREPOINT("connector_sharepoint")
-
-
Optional<Boolean> deferLoadingWhether this MCP tool is deferred and discovered via tool search.
-
Optional<Headers> headersOptional HTTP headers to send to the MCP server. Use for authentication or other purposes.
-
Optional<RequireApproval> requireApprovalSpecify which of the MCP server's tools require approval.
-
class McpToolApprovalFilter:Specify which of the MCP server's tools require approval. Can be
always,never, or a filter object associated with tools that require approval.-
Optional<Always> alwaysA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
Optional<Never> neverA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
enum McpToolApprovalSetting:Specify a single approval policy for all tools. One of
alwaysornever. When set toalways, all tools will require approval. When set tonever, all tools will not require approval.-
ALWAYS("always") -
NEVER("never")
-
-
-
Optional<String> serverDescriptionOptional description of the MCP server, used to provide more context.
-
Optional<String> serverUrlThe URL for the MCP server. One of
server_urlorconnector_idmust be provided.
-
Realtime Response Create Params
-
class RealtimeResponseCreateParams:Create a new Realtime response with these parameters
-
Optional<RealtimeResponseCreateAudioOutput> audioConfiguration for audio input and output.
-
Optional<Output> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<Voice> voiceThe voice the model uses to respond. Supported built-in voices are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. You may also provide a custom voice object with anid, for example{ "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommendmarinandcedarfor best quality.-
String -
enum UnionMember1:-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
class Id:Custom voice reference.
-
String idThe custom voice ID, e.g.
voice_1234.
-
-
-
-
-
Optional<Conversation> conversationControls which conversation the response is added to. Currently supports
autoandnone, withautoas the default value. Theautovalue means that the contents of the response will be added to the default conversation. Set this tononeto create an out-of-band response which will not add items to default conversation.-
AUTO("auto") -
NONE("none")
-
-
Optional<List<ConversationItem>> inputInput items to include in the prompt for the model. Using this field creates a new context for this Response instead of using the default conversation. An empty array
[]will clear the context for this Response. Note that this can include references to items that previously appeared in the session using their id.-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
List<Content> contentThe content of the message.
-
Optional<String> textThe text content.
-
Optional<Type> typeThe content type. Always
input_textfor system messages.INPUT_TEXT("input_text")
-
-
JsonValue; role "system"constantThe role of the message sender. Always
system.SYSTEM("system")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes (for
input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified. -
Optional<Detail> detailThe detail level of the image (for
input_image).autowill default tohigh.-
AUTO("auto") -
LOW("low") -
HIGH("high")
-
-
Optional<String> imageUrlBase64-encoded image bytes (for
input_image) as a data URI. For exampledata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG. -
Optional<String> textThe text content (for
input_text). -
Optional<String> transcriptTranscript of the audio (for
input_audio). This is not sent to the model, but will be attached to the message item for reference. -
Optional<Type> typeThe content type (
input_text,input_audio, orinput_image).-
INPUT_TEXT("input_text") -
INPUT_AUDIO("input_audio") -
INPUT_IMAGE("input_image")
-
-
-
JsonValue; role "user"constantThe role of the message sender. Always
user.USER("user")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
-
Optional<String> textThe text content.
-
Optional<String> transcriptThe transcript of the audio content, this will always be present if the output type is
audio. -
Optional<Type> typeThe content type,
output_textoroutput_audiodepending on the sessionoutput_modalitiesconfiguration.-
OUTPUT_TEXT("output_text") -
OUTPUT_AUDIO("output_audio")
-
-
-
JsonValue; role "assistant"constantThe role of the message sender. Always
assistant.ASSISTANT("assistant")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
String argumentsThe arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example
{"arg1": "value1", "arg2": 42}. -
String nameThe name of the function being called.
-
JsonValue; type "function_call"constantThe type of the item. Always
function_call.FUNCTION_CALL("function_call")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<String> callIdThe ID of the function call.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
String callIdThe ID of the function call this output is for.
-
String outputThe output of the function call, this is free text and can contain any information or simply be empty.
-
JsonValue; type "function_call_output"constantThe type of the item. Always
function_call_output.FUNCTION_CALL_OUTPUT("function_call_output")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
String idThe unique ID of the approval response.
-
String approvalRequestIdThe ID of the approval request being answered.
-
boolean approveWhether the request was approved.
-
JsonValue; type "mcp_approval_response"constantThe type of the item. Always
mcp_approval_response.MCP_APPROVAL_RESPONSE("mcp_approval_response")
-
Optional<String> reasonOptional reason for the decision.
-
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
String serverLabelThe label of the MCP server.
-
List<Tool> toolsThe tools available on the server.
-
JsonValue inputSchemaThe JSON schema describing the tool's input.
-
String nameThe name of the tool.
-
Optional<JsonValue> annotationsAdditional annotations about the tool.
-
Optional<String> descriptionThe description of the tool.
-
-
JsonValue; type "mcp_list_tools"constantThe type of the item. Always
mcp_list_tools.MCP_LIST_TOOLS("mcp_list_tools")
-
Optional<String> idThe unique ID of the list.
-
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
String idThe unique ID of the tool call.
-
String argumentsA JSON string of the arguments passed to the tool.
-
String nameThe name of the tool that was run.
-
String serverLabelThe label of the MCP server running the tool.
-
JsonValue; type "mcp_call"constantThe type of the item. Always
mcp_call.MCP_CALL("mcp_call")
-
Optional<String> approvalRequestIdThe ID of an associated approval request, if any.
-
Optional<Error> errorThe error from the tool call, if any.
-
class RealtimeMcpProtocolError:-
long code -
String message -
JsonValue; type "protocol_error"constantPROTOCOL_ERROR("protocol_error")
-
-
class RealtimeMcpToolExecutionError:-
String message -
JsonValue; type "tool_execution_error"constantTOOL_EXECUTION_ERROR("tool_execution_error")
-
-
class RealtimeMcphttpError:-
long code -
String message -
JsonValue; type "http_error"constantHTTP_ERROR("http_error")
-
-
-
Optional<String> outputThe output from the tool call.
-
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
String idThe unique ID of the approval request.
-
String argumentsA JSON string of arguments for the tool.
-
String nameThe name of the tool to run.
-
String serverLabelThe label of the MCP server making the request.
-
JsonValue; type "mcp_approval_request"constantThe type of the item. Always
mcp_approval_request.MCP_APPROVAL_REQUEST("mcp_approval_request")
-
-
-
Optional<String> instructionsThe default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior. Note that the server sets default instructions which will be used if this field is not set and are visible in the
session.createdevent at the start of the session. -
Optional<MaxOutputTokens> maxOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or
inffor the maximum available tokens for a given model. Defaults toinf.-
long -
JsonValue;INF("inf")
-
-
Optional<Metadata> metadataSet of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.
Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
-
Optional<List<OutputModality>> outputModalitiesThe set of modalities the model used to respond, currently the only possible values are
[\"audio\"],[\"text\"]. Audio output always include a text transcript. Setting the output to modetextwill disable audio output from the model.-
TEXT("text") -
AUDIO("audio")
-
-
Optional<Boolean> parallelToolCallsWhether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as
gpt-realtime-2. -
Optional<ResponsePrompt> promptReference to a prompt template and its variables. Learn more.
-
String idThe unique identifier of the prompt template to use.
-
Optional<Variables> variablesOptional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
-
String -
class ResponseInputText:A text input to the model.
-
String textThe text input to the model.
-
JsonValue; type "input_text"constantThe type of the input item. Always
input_text.INPUT_TEXT("input_text")
-
-
class ResponseInputImage:An image input to the model. Learn about image inputs.
-
Detail detailThe detail level of the image to be sent to the model. One of
high,low,auto, ororiginal. Defaults toauto.-
LOW("low") -
HIGH("high") -
AUTO("auto") -
ORIGINAL("original")
-
-
JsonValue; type "input_image"constantThe type of the input item. Always
input_image.INPUT_IMAGE("input_image")
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> imageUrlThe URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
-
-
class ResponseInputFile:A file input to the model.
-
JsonValue; type "input_file"constantThe type of the input item. Always
input_file.INPUT_FILE("input_file")
-
Optional<Detail> detailThe detail level of the file to be sent to the model. Use
lowfor the default rendering behavior, orhighto render the file at higher quality. Defaults tolow.-
LOW("low") -
HIGH("high")
-
-
Optional<String> fileDataThe content of the file to be sent to the model.
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> fileUrlThe URL of the file to be sent to the model.
-
Optional<String> filenameThe name of the file to be sent to the model.
-
-
-
Optional<String> versionOptional version of the prompt template.
-
-
Optional<RealtimeReasoning> reasoningConfiguration for reasoning-capable Realtime models such as
gpt-realtime-2.-
Optional<RealtimeReasoningEffort> effortConstrains effort on reasoning for reasoning-capable Realtime models such as
gpt-realtime-2.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
-
Optional<ToolChoice> toolChoiceHow the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
-
enum ToolChoiceOptions:Controls which (if any) tool is called by the model.
nonemeans the model will not call any tool and instead generates a message.automeans the model can pick between generating a message or calling one or more tools.requiredmeans the model must call one or more tools.-
NONE("none") -
AUTO("auto") -
REQUIRED("required")
-
-
class ToolChoiceFunction:Use this option to force the model to call a specific function.
-
String nameThe name of the function to call.
-
JsonValue; type "function"constantFor function calling, the type is always
function.FUNCTION("function")
-
-
class ToolChoiceMcp:Use this option to force the model to call a specific tool on a remote MCP server.
-
String serverLabelThe label of the MCP server to use.
-
JsonValue; type "mcp"constantFor MCP tools, the type is always
mcp.MCP("mcp")
-
Optional<String> nameThe name of the tool to call on the server.
-
-
-
Optional<List<Tool>> toolsTools available to the model.
-
class RealtimeFunctionTool:-
Optional<String> descriptionThe description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
-
Optional<String> nameThe name of the function.
-
Optional<JsonValue> parametersParameters of the function in JSON Schema.
-
Optional<Type> typeThe type of the tool, i.e.
function.FUNCTION("function")
-
-
class RealtimeResponseCreateMcpTool:Give the model access to additional tools via remote Model Context Protocol (MCP) servers. Learn more about MCP.
-
String serverLabelA label for this MCP server, used to identify it in tool calls.
-
JsonValue; type "mcp"constantThe type of the MCP tool. Always
mcp.MCP("mcp")
-
Optional<AllowedTools> allowedToolsList of allowed tool names or a filter object.
-
List<String> -
class McpToolFilter:A filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
Optional<String> authorizationAn OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
-
Optional<ConnectorId> connectorIdIdentifier for service connectors, like those available in ChatGPT. One of
server_urlorconnector_idmust be provided. Learn more about service connectors here.Currently supported
connector_idvalues are:-
Dropbox:
connector_dropbox -
Gmail:
connector_gmail -
Google Calendar:
connector_googlecalendar -
Google Drive:
connector_googledrive -
Microsoft Teams:
connector_microsoftteams -
Outlook Calendar:
connector_outlookcalendar -
Outlook Email:
connector_outlookemail -
SharePoint:
connector_sharepoint -
CONNECTOR_DROPBOX("connector_dropbox") -
CONNECTOR_GMAIL("connector_gmail") -
CONNECTOR_GOOGLECALENDAR("connector_googlecalendar") -
CONNECTOR_GOOGLEDRIVE("connector_googledrive") -
CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams") -
CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar") -
CONNECTOR_OUTLOOKEMAIL("connector_outlookemail") -
CONNECTOR_SHAREPOINT("connector_sharepoint")
-
-
Optional<Boolean> deferLoadingWhether this MCP tool is deferred and discovered via tool search.
-
Optional<Headers> headersOptional HTTP headers to send to the MCP server. Use for authentication or other purposes.
-
Optional<RequireApproval> requireApprovalSpecify which of the MCP server's tools require approval.
-
class McpToolApprovalFilter:Specify which of the MCP server's tools require approval. Can be
always,never, or a filter object associated with tools that require approval.-
Optional<Always> alwaysA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
Optional<Never> neverA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
enum McpToolApprovalSetting:Specify a single approval policy for all tools. One of
alwaysornever. When set toalways, all tools will require approval. When set tonever, all tools will not require approval.-
ALWAYS("always") -
NEVER("never")
-
-
-
Optional<String> serverDescriptionOptional description of the MCP server, used to provide more context.
-
Optional<String> serverUrlThe URL for the MCP server. One of
server_urlorconnector_idmust be provided.
-
-
-
Realtime Response Status
-
class RealtimeResponseStatus:Additional details about the status.
-
Optional<Error> errorA description of the error that caused the response to fail, populated when the
statusisfailed.-
Optional<String> codeError code, if any.
-
Optional<String> typeThe type of error.
-
-
Optional<Reason> reasonThe reason the Response did not complete. For a
cancelledResponse, one ofturn_detected(the server VAD detected a new start of speech) orclient_cancelled(the client sent a cancel event). For anincompleteResponse, one ofmax_output_tokensorcontent_filter(the server-side safety filter activated and cut off the response).-
TURN_DETECTED("turn_detected") -
CLIENT_CANCELLED("client_cancelled") -
MAX_OUTPUT_TOKENS("max_output_tokens") -
CONTENT_FILTER("content_filter")
-
-
Optional<Type> typeThe type of error that caused the response to fail, corresponding with the
statusfield (completed,cancelled,incomplete,failed).-
COMPLETED("completed") -
CANCELLED("cancelled") -
INCOMPLETE("incomplete") -
FAILED("failed")
-
-
Realtime Response Usage
-
class RealtimeResponseUsage:Usage statistics for the Response, this will correspond to billing. A Realtime API session will maintain a conversation context and append new Items to the Conversation, thus output from previous turns (text and audio tokens) will become the input for later turns.
-
Optional<RealtimeResponseUsageInputTokenDetails> inputTokenDetailsDetails about the input tokens used in the Response. Cached tokens are tokens from previous turns in the conversation that are included as context for the current response. Cached tokens here are counted as a subset of input tokens, meaning input tokens will include cached and uncached tokens.
-
Optional<Long> audioTokensThe number of audio tokens used as input for the Response.
-
Optional<Long> cachedTokensThe number of cached tokens used as input for the Response.
-
Optional<CachedTokensDetails> cachedTokensDetailsDetails about the cached tokens used as input for the Response.
-
Optional<Long> audioTokensThe number of cached audio tokens used as input for the Response.
-
Optional<Long> imageTokensThe number of cached image tokens used as input for the Response.
-
Optional<Long> textTokensThe number of cached text tokens used as input for the Response.
-
-
Optional<Long> imageTokensThe number of image tokens used as input for the Response.
-
Optional<Long> textTokensThe number of text tokens used as input for the Response.
-
-
Optional<Long> inputTokensThe number of input tokens used in the Response, including text and audio tokens.
-
Optional<RealtimeResponseUsageOutputTokenDetails> outputTokenDetailsDetails about the output tokens used in the Response.
-
Optional<Long> audioTokensThe number of audio tokens used in the Response.
-
Optional<Long> textTokensThe number of text tokens used in the Response.
-
-
Optional<Long> outputTokensThe number of output tokens sent in the Response, including text and audio tokens.
-
Optional<Long> totalTokensThe total number of tokens in the Response including input and output text and audio tokens.
-
Realtime Response Usage Input Token Details
-
class RealtimeResponseUsageInputTokenDetails:Details about the input tokens used in the Response. Cached tokens are tokens from previous turns in the conversation that are included as context for the current response. Cached tokens here are counted as a subset of input tokens, meaning input tokens will include cached and uncached tokens.
-
Optional<Long> audioTokensThe number of audio tokens used as input for the Response.
-
Optional<Long> cachedTokensThe number of cached tokens used as input for the Response.
-
Optional<CachedTokensDetails> cachedTokensDetailsDetails about the cached tokens used as input for the Response.
-
Optional<Long> audioTokensThe number of cached audio tokens used as input for the Response.
-
Optional<Long> imageTokensThe number of cached image tokens used as input for the Response.
-
Optional<Long> textTokensThe number of cached text tokens used as input for the Response.
-
-
Optional<Long> imageTokensThe number of image tokens used as input for the Response.
-
Optional<Long> textTokensThe number of text tokens used as input for the Response.
-
Realtime Response Usage Output Token Details
-
class RealtimeResponseUsageOutputTokenDetails:Details about the output tokens used in the Response.
-
Optional<Long> audioTokensThe number of audio tokens used in the Response.
-
Optional<Long> textTokensThe number of text tokens used in the Response.
-
Realtime Server Event
-
class RealtimeServerEvent: A class that can be one of several variants.unionA realtime server event.
-
class ConversationCreatedEvent:Returned when a conversation is created. Emitted right after session creation.
-
Conversation conversationThe conversation resource.
-
Optional<String> idThe unique ID of the conversation.
-
Optional<Object> object_The object type, must be
realtime.conversation.REALTIME_CONVERSATION("realtime.conversation")
-
-
String eventIdThe unique ID of the server event.
-
JsonValue; type "conversation.created"constantThe event type, must be
conversation.created.CONVERSATION_CREATED("conversation.created")
-
-
class ConversationItemCreatedEvent:Returned when a conversation item is created. There are several scenarios that produce this event:
-
The server is generating a Response, which if successful will produce either one or two Items, which will be of type
message(roleassistant) or typefunction_call. -
The input audio buffer has been committed, either by the client or the server (in
server_vadmode). The server will take the content of the input audio buffer and add it to a new user message Item. -
The client has sent a
conversation.item.createevent to add a new Item to the Conversation. -
String eventIdThe unique ID of the server event.
-
ConversationItem itemA single item within a Realtime conversation.
-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
List<Content> contentThe content of the message.
-
Optional<String> textThe text content.
-
Optional<Type> typeThe content type. Always
input_textfor system messages.INPUT_TEXT("input_text")
-
-
JsonValue; role "system"constantThe role of the message sender. Always
system.SYSTEM("system")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes (for
input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified. -
Optional<Detail> detailThe detail level of the image (for
input_image).autowill default tohigh.-
AUTO("auto") -
LOW("low") -
HIGH("high")
-
-
Optional<String> imageUrlBase64-encoded image bytes (for
input_image) as a data URI. For exampledata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG. -
Optional<String> textThe text content (for
input_text). -
Optional<String> transcriptTranscript of the audio (for
input_audio). This is not sent to the model, but will be attached to the message item for reference. -
Optional<Type> typeThe content type (
input_text,input_audio, orinput_image).-
INPUT_TEXT("input_text") -
INPUT_AUDIO("input_audio") -
INPUT_IMAGE("input_image")
-
-
-
JsonValue; role "user"constantThe role of the message sender. Always
user.USER("user")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
-
Optional<String> textThe text content.
-
Optional<String> transcriptThe transcript of the audio content, this will always be present if the output type is
audio. -
Optional<Type> typeThe content type,
output_textoroutput_audiodepending on the sessionoutput_modalitiesconfiguration.-
OUTPUT_TEXT("output_text") -
OUTPUT_AUDIO("output_audio")
-
-
-
JsonValue; role "assistant"constantThe role of the message sender. Always
assistant.ASSISTANT("assistant")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
String argumentsThe arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example
{"arg1": "value1", "arg2": 42}. -
String nameThe name of the function being called.
-
JsonValue; type "function_call"constantThe type of the item. Always
function_call.FUNCTION_CALL("function_call")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<String> callIdThe ID of the function call.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
String callIdThe ID of the function call this output is for.
-
String outputThe output of the function call, this is free text and can contain any information or simply be empty.
-
JsonValue; type "function_call_output"constantThe type of the item. Always
function_call_output.FUNCTION_CALL_OUTPUT("function_call_output")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
String idThe unique ID of the approval response.
-
String approvalRequestIdThe ID of the approval request being answered.
-
boolean approveWhether the request was approved.
-
JsonValue; type "mcp_approval_response"constantThe type of the item. Always
mcp_approval_response.MCP_APPROVAL_RESPONSE("mcp_approval_response")
-
Optional<String> reasonOptional reason for the decision.
-
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
String serverLabelThe label of the MCP server.
-
List<Tool> toolsThe tools available on the server.
-
JsonValue inputSchemaThe JSON schema describing the tool's input.
-
String nameThe name of the tool.
-
Optional<JsonValue> annotationsAdditional annotations about the tool.
-
Optional<String> descriptionThe description of the tool.
-
-
JsonValue; type "mcp_list_tools"constantThe type of the item. Always
mcp_list_tools.MCP_LIST_TOOLS("mcp_list_tools")
-
Optional<String> idThe unique ID of the list.
-
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
String idThe unique ID of the tool call.
-
String argumentsA JSON string of the arguments passed to the tool.
-
String nameThe name of the tool that was run.
-
String serverLabelThe label of the MCP server running the tool.
-
JsonValue; type "mcp_call"constantThe type of the item. Always
mcp_call.MCP_CALL("mcp_call")
-
Optional<String> approvalRequestIdThe ID of an associated approval request, if any.
-
Optional<Error> errorThe error from the tool call, if any.
-
class RealtimeMcpProtocolError:-
long code -
String message -
JsonValue; type "protocol_error"constantPROTOCOL_ERROR("protocol_error")
-
-
class RealtimeMcpToolExecutionError:-
String message -
JsonValue; type "tool_execution_error"constantTOOL_EXECUTION_ERROR("tool_execution_error")
-
-
class RealtimeMcphttpError:-
long code -
String message -
JsonValue; type "http_error"constantHTTP_ERROR("http_error")
-
-
-
Optional<String> outputThe output from the tool call.
-
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
String idThe unique ID of the approval request.
-
String argumentsA JSON string of arguments for the tool.
-
String nameThe name of the tool to run.
-
String serverLabelThe label of the MCP server making the request.
-
JsonValue; type "mcp_approval_request"constantThe type of the item. Always
mcp_approval_request.MCP_APPROVAL_REQUEST("mcp_approval_request")
-
-
-
JsonValue; type "conversation.item.created"constantThe event type, must be
conversation.item.created.CONVERSATION_ITEM_CREATED("conversation.item.created")
-
Optional<String> previousItemIdThe ID of the preceding item in the Conversation context, allows the client to understand the order of the conversation. Can be
nullif the item has no predecessor.
-
-
class ConversationItemDeletedEvent:Returned when an item in the conversation is deleted by the client with a
conversation.item.deleteevent. This event is used to synchronize the server's understanding of the conversation history with the client's view.-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item that was deleted.
-
JsonValue; type "conversation.item.deleted"constantThe event type, must be
conversation.item.deleted.CONVERSATION_ITEM_DELETED("conversation.item.deleted")
-
-
class ConversationItemInputAudioTranscriptionCompletedEvent:This event is the output of audio transcription for user audio written to the user audio buffer. Transcription begins when the input audio buffer is committed by the client or server (when VAD is enabled). Transcription runs asynchronously with Response creation, so this event may come before or after the Response events.
Realtime API models accept audio natively, and thus input transcription is a separate process run on a separate ASR (Automatic Speech Recognition) model. The transcript may diverge somewhat from the model's interpretation, and should be treated as a rough guide.
-
long contentIndexThe index of the content part containing the audio.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item containing the audio that is being transcribed.
-
String transcriptThe transcribed text.
-
JsonValue; type "conversation.item.input_audio_transcription.completed"constantThe event type, must be
conversation.item.input_audio_transcription.completed.CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED("conversation.item.input_audio_transcription.completed")
-
Usage usageUsage statistics for the transcription, this is billed according to the ASR model's pricing rather than the realtime model's pricing.
-
class TranscriptTextUsageTokens:Usage statistics for models billed by token usage.
-
long inputTokensNumber of input tokens billed for this request.
-
long outputTokensNumber of output tokens generated.
-
long totalTokensTotal number of tokens used (input + output).
-
JsonValue; type "tokens"constantThe type of the usage object. Always
tokensfor this variant.TOKENS("tokens")
-
Optional<InputTokenDetails> inputTokenDetailsDetails about the input tokens billed for this request.
-
Optional<Long> audioTokensNumber of audio tokens billed for this request.
-
Optional<Long> textTokensNumber of text tokens billed for this request.
-
-
-
class TranscriptTextUsageDuration:Usage statistics for models billed by audio input duration.
-
double secondsDuration of the input audio in seconds.
-
JsonValue; type "duration"constantThe type of the usage object. Always
durationfor this variant.DURATION("duration")
-
-
-
Optional<List<LogProbProperties>> logprobsThe log probabilities of the transcription.
-
String tokenThe token that was used to generate the log probability.
-
List<long> bytesThe bytes that were used to generate the log probability.
-
double logprobThe log probability of the token.
-
-
-
class ConversationItemInputAudioTranscriptionDeltaEvent:Returned when the text value of an input audio transcription content part is updated with incremental transcription results.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item containing the audio that is being transcribed.
-
JsonValue; type "conversation.item.input_audio_transcription.delta"constantThe event type, must be
conversation.item.input_audio_transcription.delta.CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_DELTA("conversation.item.input_audio_transcription.delta")
-
Optional<Long> contentIndexThe index of the content part in the item's content array.
-
Optional<String> deltaThe text delta.
-
Optional<List<LogProbProperties>> logprobsThe log probabilities of the transcription. These can be enabled by configurating the session with
"include": ["item.input_audio_transcription.logprobs"]. Each entry in the array corresponds a log probability of which token would be selected for this chunk of transcription. This can help to identify if it was possible there were multiple valid options for a given chunk of transcription.-
String tokenThe token that was used to generate the log probability.
-
List<long> bytesThe bytes that were used to generate the log probability.
-
double logprobThe log probability of the token.
-
-
-
class ConversationItemInputAudioTranscriptionFailedEvent:Returned when input audio transcription is configured, and a transcription request for a user message failed. These events are separate from other
errorevents so that the client can identify the related Item.-
long contentIndexThe index of the content part containing the audio.
-
Error errorDetails of the transcription error.
-
Optional<String> codeError code, if any.
-
Optional<String> messageA human-readable error message.
-
Optional<String> paramParameter related to the error, if any.
-
Optional<String> typeThe type of error.
-
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the user message item.
-
JsonValue; type "conversation.item.input_audio_transcription.failed"constantThe event type, must be
conversation.item.input_audio_transcription.failed.CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_FAILED("conversation.item.input_audio_transcription.failed")
-
-
ConversationItemRetrieved-
String eventIdThe unique ID of the server event.
-
ConversationItem itemA single item within a Realtime conversation.
-
JsonValue; type "conversation.item.retrieved"constantThe event type, must be
conversation.item.retrieved.CONVERSATION_ITEM_RETRIEVED("conversation.item.retrieved")
-
-
class ConversationItemTruncatedEvent:Returned when an earlier assistant audio message item is truncated by the client with a
conversation.item.truncateevent. This event is used to synchronize the server's understanding of the audio with the client's playback.This action will truncate the audio and remove the server-side text transcript to ensure there is no text in the context that hasn't been heard by the user.
-
long audioEndMsThe duration up to which the audio was truncated, in milliseconds.
-
long contentIndexThe index of the content part that was truncated.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the assistant message item that was truncated.
-
JsonValue; type "conversation.item.truncated"constantThe event type, must be
conversation.item.truncated.CONVERSATION_ITEM_TRUNCATED("conversation.item.truncated")
-
-
class RealtimeErrorEvent:Returned when an error occurs, which could be a client problem or a server problem. Most errors are recoverable and the session will stay open, we recommend to implementors to monitor and log error messages by default.
-
RealtimeError errorDetails of the error.
-
String messageA human-readable error message.
-
String typeThe type of error (e.g., "invalid_request_error", "server_error").
-
Optional<String> codeError code, if any.
-
Optional<String> eventIdThe event_id of the client event that caused the error, if applicable.
-
Optional<String> paramParameter related to the error, if any.
-
-
String eventIdThe unique ID of the server event.
-
JsonValue; type "error"constantThe event type, must be
error.ERROR("error")
-
-
class InputAudioBufferClearedEvent:Returned when the input audio buffer is cleared by the client with a
input_audio_buffer.clearevent.-
String eventIdThe unique ID of the server event.
-
JsonValue; type "input_audio_buffer.cleared"constantThe event type, must be
input_audio_buffer.cleared.INPUT_AUDIO_BUFFER_CLEARED("input_audio_buffer.cleared")
-
-
class InputAudioBufferCommittedEvent:Returned when an input audio buffer is committed, either by the client or automatically in server VAD mode. The
item_idproperty is the ID of the user message item that will be created, thus aconversation.item.createdevent will also be sent to the client.-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the user message item that will be created.
-
JsonValue; type "input_audio_buffer.committed"constantThe event type, must be
input_audio_buffer.committed.INPUT_AUDIO_BUFFER_COMMITTED("input_audio_buffer.committed")
-
Optional<String> previousItemIdThe ID of the preceding item after which the new item will be inserted. Can be
nullif the item has no predecessor.
-
-
class InputAudioBufferDtmfEventReceivedEvent:SIP Only: Returned when an DTMF event is received. A DTMF event is a message that represents a telephone keypad press (0–9, *, #, A–D). The
eventproperty is the keypad that the user press. Thereceived_atis the UTC Unix Timestamp that the server received the event.-
String eventThe telephone keypad that was pressed by the user.
-
long receivedAtUTC Unix Timestamp when DTMF Event was received by server.
-
JsonValue; type "input_audio_buffer.dtmf_event_received"constantThe event type, must be
input_audio_buffer.dtmf_event_received.INPUT_AUDIO_BUFFER_DTMF_EVENT_RECEIVED("input_audio_buffer.dtmf_event_received")
-
-
class InputAudioBufferSpeechStartedEvent:Sent by the server when in
server_vadmode to indicate that speech has been detected in the audio buffer. This can happen any time audio is added to the buffer (unless speech is already detected). The client may want to use this event to interrupt audio playback or provide visual feedback to the user.The client should expect to receive a
input_audio_buffer.speech_stoppedevent when speech stops. Theitem_idproperty is the ID of the user message item that will be created when speech stops and will also be included in theinput_audio_buffer.speech_stoppedevent (unless the client manually commits the audio buffer during VAD activation).-
long audioStartMsMilliseconds from the start of all audio written to the buffer during the session when speech was first detected. This will correspond to the beginning of audio sent to the model, and thus includes the
prefix_padding_msconfigured in the Session. -
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the user message item that will be created when speech stops.
-
JsonValue; type "input_audio_buffer.speech_started"constantThe event type, must be
input_audio_buffer.speech_started.INPUT_AUDIO_BUFFER_SPEECH_STARTED("input_audio_buffer.speech_started")
-
-
class InputAudioBufferSpeechStoppedEvent:Returned in
server_vadmode when the server detects the end of speech in the audio buffer. The server will also send anconversation.item.createdevent with the user message item that is created from the audio buffer.-
long audioEndMsMilliseconds since the session started when speech stopped. This will correspond to the end of audio sent to the model, and thus includes the
min_silence_duration_msconfigured in the Session. -
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the user message item that will be created.
-
JsonValue; type "input_audio_buffer.speech_stopped"constantThe event type, must be
input_audio_buffer.speech_stopped.INPUT_AUDIO_BUFFER_SPEECH_STOPPED("input_audio_buffer.speech_stopped")
-
-
class RateLimitsUpdatedEvent:Emitted at the beginning of a Response to indicate the updated rate limits. When a Response is created some tokens will be "reserved" for the output tokens, the rate limits shown here reflect that reservation, which is then adjusted accordingly once the Response is completed.
-
String eventIdThe unique ID of the server event.
-
List<RateLimit> rateLimitsList of rate limit information.
-
Optional<Long> limitThe maximum allowed value for the rate limit.
-
Optional<Name> nameThe name of the rate limit (
requests,tokens).-
REQUESTS("requests") -
TOKENS("tokens")
-
-
Optional<Long> remainingThe remaining value before the limit is reached.
-
Optional<Double> resetSecondsSeconds until the rate limit resets.
-
-
JsonValue; type "rate_limits.updated"constantThe event type, must be
rate_limits.updated.RATE_LIMITS_UPDATED("rate_limits.updated")
-
-
class ResponseAudioDeltaEvent:Returned when the model-generated audio is updated.
-
long contentIndexThe index of the content part in the item's content array.
-
String deltaBase64-encoded audio data delta.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
JsonValue; type "response.output_audio.delta"constantThe event type, must be
response.output_audio.delta.RESPONSE_OUTPUT_AUDIO_DELTA("response.output_audio.delta")
-
-
class ResponseAudioDoneEvent:Returned when the model-generated audio is done. Also emitted when a Response is interrupted, incomplete, or cancelled.
-
long contentIndexThe index of the content part in the item's content array.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
JsonValue; type "response.output_audio.done"constantThe event type, must be
response.output_audio.done.RESPONSE_OUTPUT_AUDIO_DONE("response.output_audio.done")
-
-
class ResponseAudioTranscriptDeltaEvent:Returned when the model-generated transcription of audio output is updated.
-
long contentIndexThe index of the content part in the item's content array.
-
String deltaThe transcript delta.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
JsonValue; type "response.output_audio_transcript.delta"constantThe event type, must be
response.output_audio_transcript.delta.RESPONSE_OUTPUT_AUDIO_TRANSCRIPT_DELTA("response.output_audio_transcript.delta")
-
-
class ResponseAudioTranscriptDoneEvent:Returned when the model-generated transcription of audio output is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
-
long contentIndexThe index of the content part in the item's content array.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
String transcriptThe final transcript of the audio.
-
JsonValue; type "response.output_audio_transcript.done"constantThe event type, must be
response.output_audio_transcript.done.RESPONSE_OUTPUT_AUDIO_TRANSCRIPT_DONE("response.output_audio_transcript.done")
-
-
class ResponseContentPartAddedEvent:Returned when a new content part is added to an assistant message item during response generation.
-
long contentIndexThe index of the content part in the item's content array.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item to which the content part was added.
-
long outputIndexThe index of the output item in the response.
-
Part partThe content part that was added.
-
Optional<String> audioBase64-encoded audio data (if type is "audio").
-
Optional<String> textThe text content (if type is "text").
-
Optional<String> transcriptThe transcript of the audio (if type is "audio").
-
Optional<Type> typeThe content type ("text", "audio").
-
TEXT("text") -
AUDIO("audio")
-
-
-
String responseIdThe ID of the response.
-
JsonValue; type "response.content_part.added"constantThe event type, must be
response.content_part.added.RESPONSE_CONTENT_PART_ADDED("response.content_part.added")
-
-
class ResponseContentPartDoneEvent:Returned when a content part is done streaming in an assistant message item. Also emitted when a Response is interrupted, incomplete, or cancelled.
-
long contentIndexThe index of the content part in the item's content array.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item.
-
long outputIndexThe index of the output item in the response.
-
Part partThe content part that is done.
-
Optional<String> audioBase64-encoded audio data (if type is "audio").
-
Optional<String> textThe text content (if type is "text").
-
Optional<String> transcriptThe transcript of the audio (if type is "audio").
-
Optional<Type> typeThe content type ("text", "audio").
-
TEXT("text") -
AUDIO("audio")
-
-
-
String responseIdThe ID of the response.
-
JsonValue; type "response.content_part.done"constantThe event type, must be
response.content_part.done.RESPONSE_CONTENT_PART_DONE("response.content_part.done")
-
-
class ResponseCreatedEvent:Returned when a new Response is created. The first event of response creation, where the response is in an initial state of
in_progress.-
String eventIdThe unique ID of the server event.
-
RealtimeResponse responseThe response resource.
-
Optional<String> idThe unique ID of the response, will look like
resp_1234. -
Optional<Audio> audioConfiguration for audio output.
-
Optional<Output> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<Voice> voiceThe voice the model uses to respond. Voice cannot be changed during the session once the model has responded with audio at least once. Current voice options are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. We recommendmarinandcedarfor best quality.-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
-
-
Optional<String> conversationIdWhich conversation the response is added to, determined by the
conversationfield in theresponse.createevent. Ifauto, the response will be added to the default conversation and the value ofconversation_idwill be an id likeconv_1234. Ifnone, the response will not be added to any conversation and the value ofconversation_idwill benull. If responses are being triggered automatically by VAD the response will be added to the default conversation -
Optional<MaxOutputTokens> maxOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls, that was used in this response.
-
long -
JsonValue;INF("inf")
-
-
Optional<Metadata> metadataSet of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.
Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
-
Optional<Object> object_The object type, must be
realtime.response.REALTIME_RESPONSE("realtime.response")
-
Optional<List<ConversationItem>> outputThe list of output items generated by the response.
-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
-
Optional<List<OutputModality>> outputModalitiesThe set of modalities the model used to respond, currently the only possible values are
[\"audio\"],[\"text\"]. Audio output always include a text transcript. Setting the output to modetextwill disable audio output from the model.-
TEXT("text") -
AUDIO("audio")
-
-
Optional<Status> statusThe final status of the response (
completed,cancelled,failed, orincomplete,in_progress).-
COMPLETED("completed") -
CANCELLED("cancelled") -
FAILED("failed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
Optional<RealtimeResponseStatus> statusDetailsAdditional details about the status.
-
Optional<Error> errorA description of the error that caused the response to fail, populated when the
statusisfailed.-
Optional<String> codeError code, if any.
-
Optional<String> typeThe type of error.
-
-
Optional<Reason> reasonThe reason the Response did not complete. For a
cancelledResponse, one ofturn_detected(the server VAD detected a new start of speech) orclient_cancelled(the client sent a cancel event). For anincompleteResponse, one ofmax_output_tokensorcontent_filter(the server-side safety filter activated and cut off the response).-
TURN_DETECTED("turn_detected") -
CLIENT_CANCELLED("client_cancelled") -
MAX_OUTPUT_TOKENS("max_output_tokens") -
CONTENT_FILTER("content_filter")
-
-
Optional<Type> typeThe type of error that caused the response to fail, corresponding with the
statusfield (completed,cancelled,incomplete,failed).-
COMPLETED("completed") -
CANCELLED("cancelled") -
INCOMPLETE("incomplete") -
FAILED("failed")
-
-
-
Optional<RealtimeResponseUsage> usageUsage statistics for the Response, this will correspond to billing. A Realtime API session will maintain a conversation context and append new Items to the Conversation, thus output from previous turns (text and audio tokens) will become the input for later turns.
-
Optional<RealtimeResponseUsageInputTokenDetails> inputTokenDetailsDetails about the input tokens used in the Response. Cached tokens are tokens from previous turns in the conversation that are included as context for the current response. Cached tokens here are counted as a subset of input tokens, meaning input tokens will include cached and uncached tokens.
-
Optional<Long> audioTokensThe number of audio tokens used as input for the Response.
-
Optional<Long> cachedTokensThe number of cached tokens used as input for the Response.
-
Optional<CachedTokensDetails> cachedTokensDetailsDetails about the cached tokens used as input for the Response.
-
Optional<Long> audioTokensThe number of cached audio tokens used as input for the Response.
-
Optional<Long> imageTokensThe number of cached image tokens used as input for the Response.
-
Optional<Long> textTokensThe number of cached text tokens used as input for the Response.
-
-
Optional<Long> imageTokensThe number of image tokens used as input for the Response.
-
Optional<Long> textTokensThe number of text tokens used as input for the Response.
-
-
Optional<Long> inputTokensThe number of input tokens used in the Response, including text and audio tokens.
-
Optional<RealtimeResponseUsageOutputTokenDetails> outputTokenDetailsDetails about the output tokens used in the Response.
-
Optional<Long> audioTokensThe number of audio tokens used in the Response.
-
Optional<Long> textTokensThe number of text tokens used in the Response.
-
-
Optional<Long> outputTokensThe number of output tokens sent in the Response, including text and audio tokens.
-
Optional<Long> totalTokensThe total number of tokens in the Response including input and output text and audio tokens.
-
-
-
JsonValue; type "response.created"constantThe event type, must be
response.created.RESPONSE_CREATED("response.created")
-
-
class ResponseDoneEvent:Returned when a Response is done streaming. Always emitted, no matter the final state. The Response object included in the
response.doneevent will include all output Items in the Response but will omit the raw audio data.Clients should check the
statusfield of the Response to determine if it was successful (completed) or if there was another outcome:cancelled,failed, orincomplete.A response will contain all output items that were generated during the response, excluding any audio content.
-
String eventIdThe unique ID of the server event.
-
RealtimeResponse responseThe response resource.
-
JsonValue; type "response.done"constantThe event type, must be
response.done.RESPONSE_DONE("response.done")
-
-
class ResponseFunctionCallArgumentsDeltaEvent:Returned when the model-generated function call arguments are updated.
-
String callIdThe ID of the function call.
-
String deltaThe arguments delta as a JSON string.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the function call item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
JsonValue; type "response.function_call_arguments.delta"constantThe event type, must be
response.function_call_arguments.delta.RESPONSE_FUNCTION_CALL_ARGUMENTS_DELTA("response.function_call_arguments.delta")
-
-
class ResponseFunctionCallArgumentsDoneEvent:Returned when the model-generated function call arguments are done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
-
String argumentsThe final arguments as a JSON string.
-
String callIdThe ID of the function call.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the function call item.
-
String nameThe name of the function that was called.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
JsonValue; type "response.function_call_arguments.done"constantThe event type, must be
response.function_call_arguments.done.RESPONSE_FUNCTION_CALL_ARGUMENTS_DONE("response.function_call_arguments.done")
-
-
class ResponseOutputItemAddedEvent:Returned when a new Item is created during Response generation.
-
String eventIdThe unique ID of the server event.
-
ConversationItem itemA single item within a Realtime conversation.
-
long outputIndexThe index of the output item in the Response.
-
String responseIdThe ID of the Response to which the item belongs.
-
JsonValue; type "response.output_item.added"constantThe event type, must be
response.output_item.added.RESPONSE_OUTPUT_ITEM_ADDED("response.output_item.added")
-
-
class ResponseOutputItemDoneEvent:Returned when an Item is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
-
String eventIdThe unique ID of the server event.
-
ConversationItem itemA single item within a Realtime conversation.
-
long outputIndexThe index of the output item in the Response.
-
String responseIdThe ID of the Response to which the item belongs.
-
JsonValue; type "response.output_item.done"constantThe event type, must be
response.output_item.done.RESPONSE_OUTPUT_ITEM_DONE("response.output_item.done")
-
-
class ResponseTextDeltaEvent:Returned when the text value of an "output_text" content part is updated.
-
long contentIndexThe index of the content part in the item's content array.
-
String deltaThe text delta.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
JsonValue; type "response.output_text.delta"constantThe event type, must be
response.output_text.delta.RESPONSE_OUTPUT_TEXT_DELTA("response.output_text.delta")
-
-
class ResponseTextDoneEvent:Returned when the text value of an "output_text" content part is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
-
long contentIndexThe index of the content part in the item's content array.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
String textThe final text content.
-
JsonValue; type "response.output_text.done"constantThe event type, must be
response.output_text.done.RESPONSE_OUTPUT_TEXT_DONE("response.output_text.done")
-
-
class SessionCreatedEvent:Returned when a Session is created. Emitted automatically when a new connection is established as the first server event. This event will contain the default Session configuration.
-
String eventIdThe unique ID of the server event.
-
Session sessionThe session configuration.
-
class RealtimeSessionCreateRequest:Realtime session object configuration.
-
JsonValue; type "realtime"constantThe type of session to create. Always
realtimefor the Realtime API.REALTIME("realtime")
-
Optional<RealtimeAudioConfig> audioConfiguration for input and output audio.
-
Optional<RealtimeAudioConfigInput> input-
Optional<RealtimeAudioFormats> formatThe format of the input audio.
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<RealtimeAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
Optional<RealtimeAudioConfigOutput> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
Optional<Double> speedThe speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
-
Optional<Voice> voiceThe voice the model uses to respond. Supported built-in voices are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. You may also provide a custom voice object with anid, for example{ "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommendmarinandcedarfor best quality.-
String -
enum UnionMember1:-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
class Id:Custom voice reference.
-
String idThe custom voice ID, e.g.
voice_1234.
-
-
-
-
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
Optional<String> instructionsThe default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
Note that the server sets default instructions which will be used if this field is not set and are visible in the
session.createdevent at the start of the session. -
Optional<MaxOutputTokens> maxOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or
inffor the maximum available tokens for a given model. Defaults toinf.-
long -
JsonValue;INF("inf")
-
-
Optional<Model> modelThe Realtime model used for this session.
-
GPT_REALTIME("gpt-realtime") -
GPT_REALTIME_1_5("gpt-realtime-1.5") -
GPT_REALTIME_2("gpt-realtime-2") -
GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28") -
GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview") -
GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01") -
GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17") -
GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03") -
GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview") -
GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17") -
GPT_REALTIME_MINI("gpt-realtime-mini") -
GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06") -
GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15") -
GPT_AUDIO_1_5("gpt-audio-1.5") -
GPT_AUDIO_MINI("gpt-audio-mini") -
GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06") -
GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
-
-
Optional<List<OutputModality>> outputModalitiesThe set of modalities the model can respond with. It defaults to
["audio"], indicating that the model will respond with audio plus a transcript.["text"]can be used to make the model respond with text only. It is not possible to request bothtextandaudioat the same time.-
TEXT("text") -
AUDIO("audio")
-
-
Optional<Boolean> parallelToolCallsWhether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as
gpt-realtime-2. -
Optional<ResponsePrompt> promptReference to a prompt template and its variables. Learn more.
-
String idThe unique identifier of the prompt template to use.
-
Optional<Variables> variablesOptional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
-
String -
class ResponseInputText:A text input to the model.
-
String textThe text input to the model.
-
JsonValue; type "input_text"constantThe type of the input item. Always
input_text.INPUT_TEXT("input_text")
-
-
class ResponseInputImage:An image input to the model. Learn about image inputs.
-
Detail detailThe detail level of the image to be sent to the model. One of
high,low,auto, ororiginal. Defaults toauto.-
LOW("low") -
HIGH("high") -
AUTO("auto") -
ORIGINAL("original")
-
-
JsonValue; type "input_image"constantThe type of the input item. Always
input_image.INPUT_IMAGE("input_image")
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> imageUrlThe URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
-
-
class ResponseInputFile:A file input to the model.
-
JsonValue; type "input_file"constantThe type of the input item. Always
input_file.INPUT_FILE("input_file")
-
Optional<Detail> detailThe detail level of the file to be sent to the model. Use
lowfor the default rendering behavior, orhighto render the file at higher quality. Defaults tolow.-
LOW("low") -
HIGH("high")
-
-
Optional<String> fileDataThe content of the file to be sent to the model.
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> fileUrlThe URL of the file to be sent to the model.
-
Optional<String> filenameThe name of the file to be sent to the model.
-
-
-
Optional<String> versionOptional version of the prompt template.
-
-
Optional<RealtimeReasoning> reasoningConfiguration for reasoning-capable Realtime models such as
gpt-realtime-2.-
Optional<RealtimeReasoningEffort> effortConstrains effort on reasoning for reasoning-capable Realtime models such as
gpt-realtime-2.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
-
Optional<RealtimeToolChoiceConfig> toolChoiceHow the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
-
enum ToolChoiceOptions:Controls which (if any) tool is called by the model.
nonemeans the model will not call any tool and instead generates a message.automeans the model can pick between generating a message or calling one or more tools.requiredmeans the model must call one or more tools.-
NONE("none") -
AUTO("auto") -
REQUIRED("required")
-
-
class ToolChoiceFunction:Use this option to force the model to call a specific function.
-
String nameThe name of the function to call.
-
JsonValue; type "function"constantFor function calling, the type is always
function.FUNCTION("function")
-
-
class ToolChoiceMcp:Use this option to force the model to call a specific tool on a remote MCP server.
-
String serverLabelThe label of the MCP server to use.
-
JsonValue; type "mcp"constantFor MCP tools, the type is always
mcp.MCP("mcp")
-
Optional<String> nameThe name of the tool to call on the server.
-
-
-
Optional<List<RealtimeToolsConfigUnion>> toolsTools available to the model.
-
class RealtimeFunctionTool:-
Optional<String> descriptionThe description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
-
Optional<String> nameThe name of the function.
-
Optional<JsonValue> parametersParameters of the function in JSON Schema.
-
Optional<Type> typeThe type of the tool, i.e.
function.FUNCTION("function")
-
-
Mcp-
String serverLabelA label for this MCP server, used to identify it in tool calls.
-
JsonValue; type "mcp"constantThe type of the MCP tool. Always
mcp.MCP("mcp")
-
Optional<AllowedTools> allowedToolsList of allowed tool names or a filter object.
-
List<String> -
class McpToolFilter:A filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
Optional<String> authorizationAn OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
-
Optional<ConnectorId> connectorIdIdentifier for service connectors, like those available in ChatGPT. One of
server_urlorconnector_idmust be provided. Learn more about service connectors here.Currently supported
connector_idvalues are:-
Dropbox:
connector_dropbox -
Gmail:
connector_gmail -
Google Calendar:
connector_googlecalendar -
Google Drive:
connector_googledrive -
Microsoft Teams:
connector_microsoftteams -
Outlook Calendar:
connector_outlookcalendar -
Outlook Email:
connector_outlookemail -
SharePoint:
connector_sharepoint -
CONNECTOR_DROPBOX("connector_dropbox") -
CONNECTOR_GMAIL("connector_gmail") -
CONNECTOR_GOOGLECALENDAR("connector_googlecalendar") -
CONNECTOR_GOOGLEDRIVE("connector_googledrive") -
CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams") -
CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar") -
CONNECTOR_OUTLOOKEMAIL("connector_outlookemail") -
CONNECTOR_SHAREPOINT("connector_sharepoint")
-
-
Optional<Boolean> deferLoadingWhether this MCP tool is deferred and discovered via tool search.
-
Optional<Headers> headersOptional HTTP headers to send to the MCP server. Use for authentication or other purposes.
-
Optional<RequireApproval> requireApprovalSpecify which of the MCP server's tools require approval.
-
class McpToolApprovalFilter:Specify which of the MCP server's tools require approval. Can be
always,never, or a filter object associated with tools that require approval.-
Optional<Always> alwaysA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
Optional<Never> neverA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
enum McpToolApprovalSetting:Specify a single approval policy for all tools. One of
alwaysornever. When set toalways, all tools will require approval. When set tonever, all tools will not require approval.-
ALWAYS("always") -
NEVER("never")
-
-
-
Optional<String> serverDescriptionOptional description of the MCP server, used to provide more context.
-
Optional<String> serverUrlThe URL for the MCP server. One of
server_urlorconnector_idmust be provided.
-
-
-
Optional<RealtimeTracingConfig> tracingRealtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
autowill create a trace for the session with default values for the workflow name, group id, and metadata.-
JsonValue;AUTO("auto")
-
TracingConfiguration-
Optional<String> groupIdThe group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
-
Optional<JsonValue> metadataThe arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
-
Optional<String> workflowNameThe name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
-
-
-
Optional<RealtimeTruncation> truncationWhen the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
-
RealtimeTruncationStrategy-
AUTO("auto") -
DISABLED("disabled")
-
-
class RealtimeTruncationRetentionRatio:Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
-
double retentionRatioFraction of post-instruction conversation tokens to retain (
0.0-1.0) when the conversation exceeds the input token limit. Setting this to0.8means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates. -
JsonValue; type "retention_ratio"constantUse retention ratio truncation.
RETENTION_RATIO("retention_ratio")
-
Optional<TokenLimits> tokenLimitsOptional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
-
Optional<Long> postInstructionsMaximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
-
-
-
-
-
class RealtimeTranscriptionSessionCreateRequest:Realtime transcription session object configuration.
-
JsonValue; type "transcription"constantThe type of session to create. Always
transcriptionfor transcription sessions.TRANSCRIPTION("transcription")
-
Optional<RealtimeTranscriptionSessionAudio> audioConfiguration for input and output audio.
-
Optional<RealtimeTranscriptionSessionAudioInput> input-
Optional<RealtimeAudioFormats> formatThe PCM audio format. Only a 24kHz sample rate is supported.
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. -
Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
-
-
JsonValue; type "session.created"constantThe event type, must be
session.created.SESSION_CREATED("session.created")
-
-
class SessionUpdatedEvent:Returned when a session is updated with a
session.updateevent, unless there is an error.-
String eventIdThe unique ID of the server event.
-
Session sessionThe session configuration.
-
class RealtimeSessionCreateRequest:Realtime session object configuration.
-
class RealtimeTranscriptionSessionCreateRequest:Realtime transcription session object configuration.
-
-
JsonValue; type "session.updated"constantThe event type, must be
session.updated.SESSION_UPDATED("session.updated")
-
-
OutputAudioBufferStarted-
String eventIdThe unique ID of the server event.
-
String responseIdThe unique ID of the response that produced the audio.
-
JsonValue; type "output_audio_buffer.started"constantThe event type, must be
output_audio_buffer.started.OUTPUT_AUDIO_BUFFER_STARTED("output_audio_buffer.started")
-
-
OutputAudioBufferStopped-
String eventIdThe unique ID of the server event.
-
String responseIdThe unique ID of the response that produced the audio.
-
JsonValue; type "output_audio_buffer.stopped"constantThe event type, must be
output_audio_buffer.stopped.OUTPUT_AUDIO_BUFFER_STOPPED("output_audio_buffer.stopped")
-
-
OutputAudioBufferCleared-
String eventIdThe unique ID of the server event.
-
String responseIdThe unique ID of the response that produced the audio.
-
JsonValue; type "output_audio_buffer.cleared"constantThe event type, must be
output_audio_buffer.cleared.OUTPUT_AUDIO_BUFFER_CLEARED("output_audio_buffer.cleared")
-
-
class ConversationItemAdded:Sent by the server when an Item is added to the default Conversation. This can happen in several cases:
- When the client sends a
conversation.item.createevent. - When the input audio buffer is committed. In this case the item will be a user message containing the audio from the buffer.
- When the model is generating a Response. In this case the
conversation.item.addedevent will be sent when the model starts generating a specific Item, and thus it will not yet have any content (andstatuswill bein_progress).
The event will include the full content of the Item (except when model is generating a Response) except for audio data, which can be retrieved separately with a
conversation.item.retrieveevent if necessary.-
String eventIdThe unique ID of the server event.
-
ConversationItem itemA single item within a Realtime conversation.
-
JsonValue; type "conversation.item.added"constantThe event type, must be
conversation.item.added.CONVERSATION_ITEM_ADDED("conversation.item.added")
-
Optional<String> previousItemIdThe ID of the item that precedes this one, if any. This is used to maintain ordering when items are inserted.
- When the client sends a
-
class ConversationItemDone:Returned when a conversation item is finalized.
The event will include the full content of the Item except for audio data, which can be retrieved separately with a
conversation.item.retrieveevent if needed.-
String eventIdThe unique ID of the server event.
-
ConversationItem itemA single item within a Realtime conversation.
-
JsonValue; type "conversation.item.done"constantThe event type, must be
conversation.item.done.CONVERSATION_ITEM_DONE("conversation.item.done")
-
Optional<String> previousItemIdThe ID of the item that precedes this one, if any. This is used to maintain ordering when items are inserted.
-
-
class InputAudioBufferTimeoutTriggered:Returned when the Server VAD timeout is triggered for the input audio buffer. This is configured with
idle_timeout_msin theturn_detectionsettings of the session, and it indicates that there hasn't been any speech detected for the configured duration.The
audio_start_msandaudio_end_msfields indicate the segment of audio after the last model response up to the triggering time, as an offset from the beginning of audio written to the input audio buffer. This means it demarcates the segment of audio that was silent and the difference between the start and end values will roughly match the configured timeout.The empty audio will be committed to the conversation as an
input_audioitem (there will be ainput_audio_buffer.committedevent) and a model response will be generated. There may be speech that didn't trigger VAD but is still detected by the model, so the model may respond with something relevant to the conversation or a prompt to continue speaking.-
long audioEndMsMillisecond offset of audio written to the input audio buffer at the time the timeout was triggered.
-
long audioStartMsMillisecond offset of audio written to the input audio buffer that was after the playback time of the last model response.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item associated with this segment.
-
JsonValue; type "input_audio_buffer.timeout_triggered"constantThe event type, must be
input_audio_buffer.timeout_triggered.INPUT_AUDIO_BUFFER_TIMEOUT_TRIGGERED("input_audio_buffer.timeout_triggered")
-
-
class ConversationItemInputAudioTranscriptionSegment:Returned when an input audio transcription segment is identified for an item.
-
String idThe segment identifier.
-
long contentIndexThe index of the input audio content part within the item.
-
double endEnd time of the segment in seconds.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item containing the input audio content.
-
String speakerThe detected speaker label for this segment.
-
double startStart time of the segment in seconds.
-
String textThe text for this segment.
-
JsonValue; type "conversation.item.input_audio_transcription.segment"constantThe event type, must be
conversation.item.input_audio_transcription.segment.CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_SEGMENT("conversation.item.input_audio_transcription.segment")
-
-
class McpListToolsInProgress:Returned when listing MCP tools is in progress for an item.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the MCP list tools item.
-
JsonValue; type "mcp_list_tools.in_progress"constantThe event type, must be
mcp_list_tools.in_progress.MCP_LIST_TOOLS_IN_PROGRESS("mcp_list_tools.in_progress")
-
-
class McpListToolsCompleted:Returned when listing MCP tools has completed for an item.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the MCP list tools item.
-
JsonValue; type "mcp_list_tools.completed"constantThe event type, must be
mcp_list_tools.completed.MCP_LIST_TOOLS_COMPLETED("mcp_list_tools.completed")
-
-
class McpListToolsFailed:Returned when listing MCP tools has failed for an item.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the MCP list tools item.
-
JsonValue; type "mcp_list_tools.failed"constantThe event type, must be
mcp_list_tools.failed.MCP_LIST_TOOLS_FAILED("mcp_list_tools.failed")
-
-
class ResponseMcpCallArgumentsDelta:Returned when MCP tool call arguments are updated during response generation.
-
String deltaThe JSON-encoded arguments delta.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the MCP tool call item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
JsonValue; type "response.mcp_call_arguments.delta"constantThe event type, must be
response.mcp_call_arguments.delta.RESPONSE_MCP_CALL_ARGUMENTS_DELTA("response.mcp_call_arguments.delta")
-
Optional<String> obfuscationIf present, indicates the delta text was obfuscated.
-
-
class ResponseMcpCallArgumentsDone:Returned when MCP tool call arguments are finalized during response generation.
-
String argumentsThe final JSON-encoded arguments string.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the MCP tool call item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
JsonValue; type "response.mcp_call_arguments.done"constantThe event type, must be
response.mcp_call_arguments.done.RESPONSE_MCP_CALL_ARGUMENTS_DONE("response.mcp_call_arguments.done")
-
-
class ResponseMcpCallInProgress:Returned when an MCP tool call has started and is in progress.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the MCP tool call item.
-
long outputIndexThe index of the output item in the response.
-
JsonValue; type "response.mcp_call.in_progress"constantThe event type, must be
response.mcp_call.in_progress.RESPONSE_MCP_CALL_IN_PROGRESS("response.mcp_call.in_progress")
-
-
class ResponseMcpCallCompleted:Returned when an MCP tool call has completed successfully.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the MCP tool call item.
-
long outputIndexThe index of the output item in the response.
-
JsonValue; type "response.mcp_call.completed"constantThe event type, must be
response.mcp_call.completed.RESPONSE_MCP_CALL_COMPLETED("response.mcp_call.completed")
-
-
class ResponseMcpCallFailed:Returned when an MCP tool call has failed.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the MCP tool call item.
-
long outputIndexThe index of the output item in the response.
-
JsonValue; type "response.mcp_call.failed"constantThe event type, must be
response.mcp_call.failed.RESPONSE_MCP_CALL_FAILED("response.mcp_call.failed")
-
-
Realtime Session
-
class RealtimeSession:Realtime session object for the beta interface.
-
Optional<String> idUnique identifier for the session that looks like
sess_1234567890abcdef. -
Optional<Long> expiresAtExpiration timestamp for the session, in seconds since epoch.
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
-
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription. -
ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
-
Optional<InputAudioFormat> inputAudioFormatThe format of input audio. Options are
pcm16,g711_ulaw, org711_alaw. Forpcm16, input audio must be 16-bit PCM at a 24kHz sample rate, single channel (mono), and little-endian byte order.-
PCM16("pcm16") -
G711_ULAW("g711_ulaw") -
G711_ALAW("g711_alaw")
-
-
Optional<InputAudioNoiseReduction> inputAudioNoiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> inputAudioTranscriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<String> instructionsThe default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
Note that the server sets default instructions which will be used if this field is not set and are visible in the
session.createdevent at the start of the session. -
Optional<MaxResponseOutputTokens> maxResponseOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or
inffor the maximum available tokens for a given model. Defaults toinf.-
long -
JsonValue;INF("inf")
-
-
Optional<List<Modality>> modalitiesThe set of modalities the model can respond with. To disable audio, set this to ["text"].
-
TEXT("text") -
AUDIO("audio")
-
-
Optional<Model> modelThe Realtime model used for this session.
-
GPT_REALTIME("gpt-realtime") -
GPT_REALTIME_1_5("gpt-realtime-1.5") -
GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28") -
GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview") -
GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01") -
GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17") -
GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03") -
GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview") -
GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17") -
GPT_REALTIME_MINI("gpt-realtime-mini") -
GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06") -
GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15") -
GPT_AUDIO_1_5("gpt-audio-1.5") -
GPT_AUDIO_MINI("gpt-audio-mini") -
GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06") -
GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
-
-
Optional<Object> object_The object type. Always
realtime.session.REALTIME_SESSION("realtime.session")
-
Optional<OutputAudioFormat> outputAudioFormatThe format of output audio. Options are
pcm16,g711_ulaw, org711_alaw. Forpcm16, output audio is sampled at a rate of 24kHz.-
PCM16("pcm16") -
G711_ULAW("g711_ulaw") -
G711_ALAW("g711_alaw")
-
-
Optional<ResponsePrompt> promptReference to a prompt template and its variables. Learn more.
-
String idThe unique identifier of the prompt template to use.
-
Optional<Variables> variablesOptional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
-
String -
class ResponseInputText:A text input to the model.
-
String textThe text input to the model.
-
JsonValue; type "input_text"constantThe type of the input item. Always
input_text.INPUT_TEXT("input_text")
-
-
class ResponseInputImage:An image input to the model. Learn about image inputs.
-
Detail detailThe detail level of the image to be sent to the model. One of
high,low,auto, ororiginal. Defaults toauto.-
LOW("low") -
HIGH("high") -
AUTO("auto") -
ORIGINAL("original")
-
-
JsonValue; type "input_image"constantThe type of the input item. Always
input_image.INPUT_IMAGE("input_image")
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> imageUrlThe URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
-
-
class ResponseInputFile:A file input to the model.
-
JsonValue; type "input_file"constantThe type of the input item. Always
input_file.INPUT_FILE("input_file")
-
Optional<Detail> detailThe detail level of the file to be sent to the model. Use
lowfor the default rendering behavior, orhighto render the file at higher quality. Defaults tolow.-
LOW("low") -
HIGH("high")
-
-
Optional<String> fileDataThe content of the file to be sent to the model.
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> fileUrlThe URL of the file to be sent to the model.
-
Optional<String> filenameThe name of the file to be sent to the model.
-
-
-
Optional<String> versionOptional version of the prompt template.
-
-
Optional<Double> speedThe speed of the model's spoken response. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
-
Optional<Double> temperatureSampling temperature for the model, limited to [0.6, 1.2]. For audio models a temperature of 0.8 is highly recommended for best performance.
-
Optional<String> toolChoiceHow the model chooses tools. Options are
auto,none,required, or specify a function. -
Optional<List<RealtimeFunctionTool>> toolsTools (functions) available to the model.
-
Optional<String> descriptionThe description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
-
Optional<String> nameThe name of the function.
-
Optional<JsonValue> parametersParameters of the function in JSON Schema.
-
Optional<Type> typeThe type of the tool, i.e.
function.FUNCTION("function")
-
-
Optional<Tracing> tracingConfiguration options for tracing. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
autowill create a trace for the session with default values for the workflow name, group id, and metadata.-
JsonValue;AUTO("auto")
-
class TracingConfiguration:Granular configuration for tracing.
-
Optional<String> groupIdThe group id to attach to this trace to enable filtering and grouping in the traces dashboard.
-
Optional<JsonValue> metadataThe arbitrary metadata to attach to this trace to enable filtering in the traces dashboard.
-
Optional<String> workflowNameThe name of the workflow to attach to this trace. This is used to name the trace in the traces dashboard.
-
-
-
Optional<TurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
class ServerVad:Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.
-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
class SemanticVad:Server-side semantic turn detection which uses a model to determine when the user has finished speaking.
-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
Optional<Voice> voiceThe voice the model uses to respond. Voice cannot be changed during the session once the model has responded with audio at least once. Current voice options are
alloy,ash,ballad,coral,echo,sage,shimmer, andverse.-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
Realtime Session Create Request
-
class RealtimeSessionCreateRequest:Realtime session object configuration.
-
JsonValue; type "realtime"constantThe type of session to create. Always
realtimefor the Realtime API.REALTIME("realtime")
-
Optional<RealtimeAudioConfig> audioConfiguration for input and output audio.
-
Optional<RealtimeAudioConfigInput> input-
Optional<RealtimeAudioFormats> formatThe format of the input audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<RealtimeAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
Optional<RealtimeAudioConfigOutput> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
Optional<Double> speedThe speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
-
Optional<Voice> voiceThe voice the model uses to respond. Supported built-in voices are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. You may also provide a custom voice object with anid, for example{ "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommendmarinandcedarfor best quality.-
String -
enum UnionMember1:-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
class Id:Custom voice reference.
-
String idThe custom voice ID, e.g.
voice_1234.
-
-
-
-
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
Optional<String> instructionsThe default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
Note that the server sets default instructions which will be used if this field is not set and are visible in the
session.createdevent at the start of the session. -
Optional<MaxOutputTokens> maxOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or
inffor the maximum available tokens for a given model. Defaults toinf.-
long -
JsonValue;INF("inf")
-
-
Optional<Model> modelThe Realtime model used for this session.
-
GPT_REALTIME("gpt-realtime") -
GPT_REALTIME_1_5("gpt-realtime-1.5") -
GPT_REALTIME_2("gpt-realtime-2") -
GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28") -
GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview") -
GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01") -
GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17") -
GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03") -
GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview") -
GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17") -
GPT_REALTIME_MINI("gpt-realtime-mini") -
GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06") -
GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15") -
GPT_AUDIO_1_5("gpt-audio-1.5") -
GPT_AUDIO_MINI("gpt-audio-mini") -
GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06") -
GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
-
-
Optional<List<OutputModality>> outputModalitiesThe set of modalities the model can respond with. It defaults to
["audio"], indicating that the model will respond with audio plus a transcript.["text"]can be used to make the model respond with text only. It is not possible to request bothtextandaudioat the same time.-
TEXT("text") -
AUDIO("audio")
-
-
Optional<Boolean> parallelToolCallsWhether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as
gpt-realtime-2. -
Optional<ResponsePrompt> promptReference to a prompt template and its variables. Learn more.
-
String idThe unique identifier of the prompt template to use.
-
Optional<Variables> variablesOptional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
-
String -
class ResponseInputText:A text input to the model.
-
String textThe text input to the model.
-
JsonValue; type "input_text"constantThe type of the input item. Always
input_text.INPUT_TEXT("input_text")
-
-
class ResponseInputImage:An image input to the model. Learn about image inputs.
-
Detail detailThe detail level of the image to be sent to the model. One of
high,low,auto, ororiginal. Defaults toauto.-
LOW("low") -
HIGH("high") -
AUTO("auto") -
ORIGINAL("original")
-
-
JsonValue; type "input_image"constantThe type of the input item. Always
input_image.INPUT_IMAGE("input_image")
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> imageUrlThe URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
-
-
class ResponseInputFile:A file input to the model.
-
JsonValue; type "input_file"constantThe type of the input item. Always
input_file.INPUT_FILE("input_file")
-
Optional<Detail> detailThe detail level of the file to be sent to the model. Use
lowfor the default rendering behavior, orhighto render the file at higher quality. Defaults tolow.-
LOW("low") -
HIGH("high")
-
-
Optional<String> fileDataThe content of the file to be sent to the model.
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> fileUrlThe URL of the file to be sent to the model.
-
Optional<String> filenameThe name of the file to be sent to the model.
-
-
-
Optional<String> versionOptional version of the prompt template.
-
-
Optional<RealtimeReasoning> reasoningConfiguration for reasoning-capable Realtime models such as
gpt-realtime-2.-
Optional<RealtimeReasoningEffort> effortConstrains effort on reasoning for reasoning-capable Realtime models such as
gpt-realtime-2.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
-
Optional<RealtimeToolChoiceConfig> toolChoiceHow the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
-
enum ToolChoiceOptions:Controls which (if any) tool is called by the model.
nonemeans the model will not call any tool and instead generates a message.automeans the model can pick between generating a message or calling one or more tools.requiredmeans the model must call one or more tools.-
NONE("none") -
AUTO("auto") -
REQUIRED("required")
-
-
class ToolChoiceFunction:Use this option to force the model to call a specific function.
-
String nameThe name of the function to call.
-
JsonValue; type "function"constantFor function calling, the type is always
function.FUNCTION("function")
-
-
class ToolChoiceMcp:Use this option to force the model to call a specific tool on a remote MCP server.
-
String serverLabelThe label of the MCP server to use.
-
JsonValue; type "mcp"constantFor MCP tools, the type is always
mcp.MCP("mcp")
-
Optional<String> nameThe name of the tool to call on the server.
-
-
-
Optional<List<RealtimeToolsConfigUnion>> toolsTools available to the model.
-
class RealtimeFunctionTool:-
Optional<String> descriptionThe description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
-
Optional<String> nameThe name of the function.
-
Optional<JsonValue> parametersParameters of the function in JSON Schema.
-
Optional<Type> typeThe type of the tool, i.e.
function.FUNCTION("function")
-
-
Mcp-
String serverLabelA label for this MCP server, used to identify it in tool calls.
-
JsonValue; type "mcp"constantThe type of the MCP tool. Always
mcp.MCP("mcp")
-
Optional<AllowedTools> allowedToolsList of allowed tool names or a filter object.
-
List<String> -
class McpToolFilter:A filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
Optional<String> authorizationAn OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
-
Optional<ConnectorId> connectorIdIdentifier for service connectors, like those available in ChatGPT. One of
server_urlorconnector_idmust be provided. Learn more about service connectors here.Currently supported
connector_idvalues are:-
Dropbox:
connector_dropbox -
Gmail:
connector_gmail -
Google Calendar:
connector_googlecalendar -
Google Drive:
connector_googledrive -
Microsoft Teams:
connector_microsoftteams -
Outlook Calendar:
connector_outlookcalendar -
Outlook Email:
connector_outlookemail -
SharePoint:
connector_sharepoint -
CONNECTOR_DROPBOX("connector_dropbox") -
CONNECTOR_GMAIL("connector_gmail") -
CONNECTOR_GOOGLECALENDAR("connector_googlecalendar") -
CONNECTOR_GOOGLEDRIVE("connector_googledrive") -
CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams") -
CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar") -
CONNECTOR_OUTLOOKEMAIL("connector_outlookemail") -
CONNECTOR_SHAREPOINT("connector_sharepoint")
-
-
Optional<Boolean> deferLoadingWhether this MCP tool is deferred and discovered via tool search.
-
Optional<Headers> headersOptional HTTP headers to send to the MCP server. Use for authentication or other purposes.
-
Optional<RequireApproval> requireApprovalSpecify which of the MCP server's tools require approval.
-
class McpToolApprovalFilter:Specify which of the MCP server's tools require approval. Can be
always,never, or a filter object associated with tools that require approval.-
Optional<Always> alwaysA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
Optional<Never> neverA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
enum McpToolApprovalSetting:Specify a single approval policy for all tools. One of
alwaysornever. When set toalways, all tools will require approval. When set tonever, all tools will not require approval.-
ALWAYS("always") -
NEVER("never")
-
-
-
Optional<String> serverDescriptionOptional description of the MCP server, used to provide more context.
-
Optional<String> serverUrlThe URL for the MCP server. One of
server_urlorconnector_idmust be provided.
-
-
-
Optional<RealtimeTracingConfig> tracingRealtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
autowill create a trace for the session with default values for the workflow name, group id, and metadata.-
JsonValue;AUTO("auto")
-
TracingConfiguration-
Optional<String> groupIdThe group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
-
Optional<JsonValue> metadataThe arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
-
Optional<String> workflowNameThe name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
-
-
-
Optional<RealtimeTruncation> truncationWhen the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
-
RealtimeTruncationStrategy-
AUTO("auto") -
DISABLED("disabled")
-
-
class RealtimeTruncationRetentionRatio:Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
-
double retentionRatioFraction of post-instruction conversation tokens to retain (
0.0-1.0) when the conversation exceeds the input token limit. Setting this to0.8means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates. -
JsonValue; type "retention_ratio"constantUse retention ratio truncation.
RETENTION_RATIO("retention_ratio")
-
Optional<TokenLimits> tokenLimitsOptional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
-
Optional<Long> postInstructionsMaximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
-
-
-
-
Realtime Tool Choice Config
-
class RealtimeToolChoiceConfig: A class that can be one of several variants.unionHow the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
-
enum ToolChoiceOptions:Controls which (if any) tool is called by the model.
nonemeans the model will not call any tool and instead generates a message.automeans the model can pick between generating a message or calling one or more tools.requiredmeans the model must call one or more tools.-
NONE("none") -
AUTO("auto") -
REQUIRED("required")
-
-
class ToolChoiceFunction:Use this option to force the model to call a specific function.
-
String nameThe name of the function to call.
-
JsonValue; type "function"constantFor function calling, the type is always
function.FUNCTION("function")
-
-
class ToolChoiceMcp:Use this option to force the model to call a specific tool on a remote MCP server.
-
String serverLabelThe label of the MCP server to use.
-
JsonValue; type "mcp"constantFor MCP tools, the type is always
mcp.MCP("mcp")
-
Optional<String> nameThe name of the tool to call on the server.
-
-
Realtime Tools Config Union
-
class RealtimeToolsConfigUnion: A class that can be one of several variants.unionGive the model access to additional tools via remote Model Context Protocol (MCP) servers. Learn more about MCP.
-
class RealtimeFunctionTool:-
Optional<String> descriptionThe description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
-
Optional<String> nameThe name of the function.
-
Optional<JsonValue> parametersParameters of the function in JSON Schema.
-
Optional<Type> typeThe type of the tool, i.e.
function.FUNCTION("function")
-
-
Mcp-
String serverLabelA label for this MCP server, used to identify it in tool calls.
-
JsonValue; type "mcp"constantThe type of the MCP tool. Always
mcp.MCP("mcp")
-
Optional<AllowedTools> allowedToolsList of allowed tool names or a filter object.
-
List<String> -
class McpToolFilter:A filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
Optional<String> authorizationAn OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
-
Optional<ConnectorId> connectorIdIdentifier for service connectors, like those available in ChatGPT. One of
server_urlorconnector_idmust be provided. Learn more about service connectors here.Currently supported
connector_idvalues are:-
Dropbox:
connector_dropbox -
Gmail:
connector_gmail -
Google Calendar:
connector_googlecalendar -
Google Drive:
connector_googledrive -
Microsoft Teams:
connector_microsoftteams -
Outlook Calendar:
connector_outlookcalendar -
Outlook Email:
connector_outlookemail -
SharePoint:
connector_sharepoint -
CONNECTOR_DROPBOX("connector_dropbox") -
CONNECTOR_GMAIL("connector_gmail") -
CONNECTOR_GOOGLECALENDAR("connector_googlecalendar") -
CONNECTOR_GOOGLEDRIVE("connector_googledrive") -
CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams") -
CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar") -
CONNECTOR_OUTLOOKEMAIL("connector_outlookemail") -
CONNECTOR_SHAREPOINT("connector_sharepoint")
-
-
Optional<Boolean> deferLoadingWhether this MCP tool is deferred and discovered via tool search.
-
Optional<Headers> headersOptional HTTP headers to send to the MCP server. Use for authentication or other purposes.
-
Optional<RequireApproval> requireApprovalSpecify which of the MCP server's tools require approval.
-
class McpToolApprovalFilter:Specify which of the MCP server's tools require approval. Can be
always,never, or a filter object associated with tools that require approval.-
Optional<Always> alwaysA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
Optional<Never> neverA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
enum McpToolApprovalSetting:Specify a single approval policy for all tools. One of
alwaysornever. When set toalways, all tools will require approval. When set tonever, all tools will not require approval.-
ALWAYS("always") -
NEVER("never")
-
-
-
Optional<String> serverDescriptionOptional description of the MCP server, used to provide more context.
-
Optional<String> serverUrlThe URL for the MCP server. One of
server_urlorconnector_idmust be provided.
-
-
Realtime Tracing Config
-
class RealtimeTracingConfig: A class that can be one of several variants.unionRealtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
autowill create a trace for the session with default values for the workflow name, group id, and metadata.-
JsonValue;AUTO("auto")
-
TracingConfiguration-
Optional<String> groupIdThe group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
-
Optional<JsonValue> metadataThe arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
-
Optional<String> workflowNameThe name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
-
-
Realtime Transcription Session Audio
-
class RealtimeTranscriptionSessionAudio:Configuration for input and output audio.
-
Optional<RealtimeTranscriptionSessionAudioInput> input-
Optional<RealtimeAudioFormats> formatThe PCM audio format. Only a 24kHz sample rate is supported.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
Realtime Transcription Session Audio Input
-
class RealtimeTranscriptionSessionAudioInput:-
Optional<RealtimeAudioFormats> formatThe PCM audio format. Only a 24kHz sample rate is supported.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
Realtime Transcription Session Audio Input Turn Detection
-
class RealtimeTranscriptionSessionAudioInputTurnDetection: A class that can be one of several variants.unionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
Realtime Transcription Session Create Request
-
class RealtimeTranscriptionSessionCreateRequest:Realtime transcription session object configuration.
-
JsonValue; type "transcription"constantThe type of session to create. Always
transcriptionfor transcription sessions.TRANSCRIPTION("transcription")
-
Optional<RealtimeTranscriptionSessionAudio> audioConfiguration for input and output audio.
-
Optional<RealtimeTranscriptionSessionAudioInput> input-
Optional<RealtimeAudioFormats> formatThe PCM audio format. Only a 24kHz sample rate is supported.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
Realtime Translation Client Event
-
class RealtimeTranslationClientEvent: A class that can be one of several variants.unionA Realtime translation client event.
-
class RealtimeTranslationSessionUpdateEvent:Send this event to update the translation session configuration. Translation sessions support updates to
audio.output.language,audio.input.transcription, andaudio.input.noise_reduction.-
RealtimeTranslationSessionUpdateRequest sessionTranslation session fields to update. The session
typeandmodelare set at creation and cannot be changed withsession.update.-
Optional<Audio> audioConfiguration for translation input and output audio.
-
Optional<Input> input-
Optional<NoiseReduction> noiseReductionOptional input noise reduction. Set to
nullto disable it.-
NoiseReductionType typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<Transcription> transcriptionOptional source-language transcription. When configured, the server emits
session.input_transcript.deltaevents. Translation itself still runs from the input audio stream.-
String modelThe transcription model to use for source transcript deltas.
-
-
-
Optional<Output> output-
Optional<String> languageTarget language for translated output audio and transcript deltas.
-
-
-
-
JsonValue; type "session.update"constantThe event type, must be
session.update.SESSION_UPDATE("session.update")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
-
class RealtimeTranslationInputAudioBufferAppendEvent:Send this event to append audio bytes to the translation session input audio buffer.
WebSocket translation sessions accept base64-encoded 24 kHz PCM16 mono little-endian raw audio bytes. Unsupported websocket audio formats return a validation error because lower-quality audio materially degrades translation quality.
Translation consumes 200 ms engine frames. For best realtime behavior, append audio in 200 ms chunks. If a chunk is shorter, the server buffers it until it has enough audio for one frame. If a chunk is longer, the server splits it into 200 ms frames and enqueues them back-to-back.
Keep appending silence while the session is active. If a client stops sending audio and later resumes, model time treats the resumed audio as contiguous with the previous audio rather than as a real-world pause.
-
String audioBase64-encoded 24 kHz PCM16 mono audio bytes.
-
JsonValue; type "session.input_audio_buffer.append"constantThe event type, must be
session.input_audio_buffer.append.SESSION_INPUT_AUDIO_BUFFER_APPEND("session.input_audio_buffer.append")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
-
class RealtimeTranslationSessionCloseEvent:Gracefully close the realtime translation session. The server flushes pending input audio and emits any remaining translated output before closing the session.
-
JsonValue; type "session.close"constantThe event type, must be
session.close.SESSION_CLOSE("session.close")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
-
Realtime Translation Client Secret Create Request
-
class RealtimeTranslationClientSecretCreateRequest:Create a translation session and client secret for the Realtime API.
-
RealtimeTranslationSessionCreateRequest sessionRealtime translation session configuration. Translation sessions stream source audio in and translated audio plus transcript deltas out continuously.
-
String modelThe Realtime translation model used for this session.
-
Optional<Audio> audioConfiguration for translation input and output audio.
-
Optional<Input> input-
Optional<NoiseReduction> noiseReductionOptional input noise reduction. Set to
nullto disable it.-
NoiseReductionType typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<Transcription> transcriptionOptional source-language transcription. When configured, the server emits
session.input_transcript.deltaevents. Translation itself still runs from the input audio stream.-
String modelThe transcription model to use for source transcript deltas.
-
-
-
Optional<Output> output-
Optional<String> languageTarget language for translated output audio and transcript deltas.
-
-
-
-
Optional<ExpiresAfter> expiresAfterConfiguration for the client secret expiration. Expiration refers to the time after which a client secret will no longer be valid for creating sessions. The session itself may continue after that time once started. A secret can be used to create multiple sessions until it expires.
-
Optional<Anchor> anchorThe anchor point for the client secret expiration, meaning that
secondswill be added to thecreated_attime of the client secret to produce an expiration timestamp. Onlycreated_atis currently supported.CREATED_AT("created_at")
-
Optional<Long> secondsThe number of seconds from the anchor point to the expiration. Select a value between
10and7200(2 hours). This default to 600 seconds (10 minutes) if not specified.
-
-
Realtime Translation Client Secret Create Response
-
class RealtimeTranslationClientSecretCreateResponse:Response from creating a translation session and client secret for the Realtime API.
-
long expiresAtExpiration timestamp for the client secret, in seconds since epoch.
-
RealtimeTranslationSession sessionA Realtime translation session. Translation sessions continuously translate input audio into the configured output language.
-
String idUnique identifier for the session that looks like
sess_1234567890abcdef. -
Audio audioConfiguration for translation input and output audio.
-
Optional<Input> input-
Optional<NoiseReduction> noiseReductionOptional input noise reduction.
-
NoiseReductionType typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<Transcription> transcriptionOptional source-language transcription. When configured, the server emits
session.input_transcript.deltaevents. Translation itself still runs from the input audio stream.-
String modelThe transcription model used for source transcript deltas.
-
-
-
Optional<Output> output-
Optional<String> languageTarget language for translated output audio and transcript deltas.
-
-
-
long expiresAtExpiration timestamp for the session, in seconds since epoch.
-
String modelThe Realtime translation model used for this session. This field is set at session creation and cannot be changed with
session.update. -
JsonValue; type "translation"constantThe session type. Always
translationfor Realtime translation sessions.TRANSLATION("translation")
-
-
String valueThe generated client secret value.
-
Realtime Translation Input Audio Buffer Append Event
-
class RealtimeTranslationInputAudioBufferAppendEvent:Send this event to append audio bytes to the translation session input audio buffer.
WebSocket translation sessions accept base64-encoded 24 kHz PCM16 mono little-endian raw audio bytes. Unsupported websocket audio formats return a validation error because lower-quality audio materially degrades translation quality.
Translation consumes 200 ms engine frames. For best realtime behavior, append audio in 200 ms chunks. If a chunk is shorter, the server buffers it until it has enough audio for one frame. If a chunk is longer, the server splits it into 200 ms frames and enqueues them back-to-back.
Keep appending silence while the session is active. If a client stops sending audio and later resumes, model time treats the resumed audio as contiguous with the previous audio rather than as a real-world pause.
-
String audioBase64-encoded 24 kHz PCM16 mono audio bytes.
-
JsonValue; type "session.input_audio_buffer.append"constantThe event type, must be
session.input_audio_buffer.append.SESSION_INPUT_AUDIO_BUFFER_APPEND("session.input_audio_buffer.append")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
Realtime Translation Input Transcript Delta Event
-
class RealtimeTranslationInputTranscriptDeltaEvent:Returned when optional source-language transcript text is available. This event is emitted only when
audio.input.transcriptionis configured.Transcript deltas are append-only text fragments. Clients should not insert unconditional spaces between deltas.
-
String deltaAppend-only source-language transcript text.
-
String eventIdThe unique ID of the server event.
-
JsonValue; type "session.input_transcript.delta"constantThe event type, must be
session.input_transcript.delta.SESSION_INPUT_TRANSCRIPT_DELTA("session.input_transcript.delta")
-
Optional<Long> elapsedMsTiming metadata for stream alignment, derived from the translation frame when available. It advances in 200 ms increments, but multiple transcript deltas may share the same
elapsed_ms. Treat it as alignment metadata, not a unique transcript-delta identifier.
-
Realtime Translation Output Audio Delta Event
-
class RealtimeTranslationOutputAudioDeltaEvent:Returned when translated output audio is available. Output audio deltas are 200 ms frames of PCM16 audio.
-
String deltaBase64-encoded translated audio data.
-
String eventIdThe unique ID of the server event.
-
JsonValue; type "session.output_audio.delta"constantThe event type, must be
session.output_audio.delta.SESSION_OUTPUT_AUDIO_DELTA("session.output_audio.delta")
-
Optional<Long> channelsNumber of audio channels.
-
Optional<Long> elapsedMsTiming metadata for stream alignment, derived from the translation frame when available. Treat
elapsed_msas alignment metadata, not a unique event identifier. -
Optional<Format> formatAudio encoding for
delta.PCM16("pcm16")
-
Optional<Long> sampleRateSample rate of the audio delta.
-
Realtime Translation Output Transcript Delta Event
-
class RealtimeTranslationOutputTranscriptDeltaEvent:Returned when translated transcript text is available.
Transcript deltas are append-only text fragments. Clients should not insert unconditional spaces between deltas.
-
String deltaAppend-only transcript text for the translated output audio.
-
String eventIdThe unique ID of the server event.
-
JsonValue; type "session.output_transcript.delta"constantThe event type, must be
session.output_transcript.delta.SESSION_OUTPUT_TRANSCRIPT_DELTA("session.output_transcript.delta")
-
Optional<Long> elapsedMsTiming metadata for stream alignment, derived from the translation frame when available. It advances in 200 ms increments, but multiple transcript deltas may share the same
elapsed_ms. Treat it as alignment metadata, not a unique transcript-delta identifier.
-
Realtime Translation Server Event
-
class RealtimeTranslationServerEvent: A class that can be one of several variants.unionA Realtime translation server event.
-
class RealtimeErrorEvent:Returned when an error occurs, which could be a client problem or a server problem. Most errors are recoverable and the session will stay open, we recommend to implementors to monitor and log error messages by default.
-
RealtimeError errorDetails of the error.
-
String messageA human-readable error message.
-
String typeThe type of error (e.g., "invalid_request_error", "server_error").
-
Optional<String> codeError code, if any.
-
Optional<String> eventIdThe event_id of the client event that caused the error, if applicable.
-
Optional<String> paramParameter related to the error, if any.
-
-
String eventIdThe unique ID of the server event.
-
JsonValue; type "error"constantThe event type, must be
error.ERROR("error")
-
-
class RealtimeTranslationSessionCreatedEvent:Returned when a translation session is created. Emitted automatically when a new connection is established as the first server event. This event contains the default translation session configuration.
-
String eventIdThe unique ID of the server event.
-
RealtimeTranslationSession sessionThe translation session configuration.
-
String idUnique identifier for the session that looks like
sess_1234567890abcdef. -
Audio audioConfiguration for translation input and output audio.
-
Optional<Input> input-
Optional<NoiseReduction> noiseReductionOptional input noise reduction.
-
NoiseReductionType typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<Transcription> transcriptionOptional source-language transcription. When configured, the server emits
session.input_transcript.deltaevents. Translation itself still runs from the input audio stream.-
String modelThe transcription model used for source transcript deltas.
-
-
-
Optional<Output> output-
Optional<String> languageTarget language for translated output audio and transcript deltas.
-
-
-
long expiresAtExpiration timestamp for the session, in seconds since epoch.
-
String modelThe Realtime translation model used for this session. This field is set at session creation and cannot be changed with
session.update. -
JsonValue; type "translation"constantThe session type. Always
translationfor Realtime translation sessions.TRANSLATION("translation")
-
-
JsonValue; type "session.created"constantThe event type, must be
session.created.SESSION_CREATED("session.created")
-
-
class RealtimeTranslationSessionUpdatedEvent:Returned when a translation session is updated with a
session.updateevent, unless there is an error.-
String eventIdThe unique ID of the server event.
-
RealtimeTranslationSession sessionThe translation session configuration.
-
JsonValue; type "session.updated"constantThe event type, must be
session.updated.SESSION_UPDATED("session.updated")
-
-
class RealtimeTranslationSessionClosedEvent:Returned when a realtime translation session is closed.
-
String eventIdThe unique ID of the server event.
-
JsonValue; type "session.closed"constantThe event type, must be
session.closed.SESSION_CLOSED("session.closed")
-
-
class RealtimeTranslationInputTranscriptDeltaEvent:Returned when optional source-language transcript text is available. This event is emitted only when
audio.input.transcriptionis configured.Transcript deltas are append-only text fragments. Clients should not insert unconditional spaces between deltas.
-
String deltaAppend-only source-language transcript text.
-
String eventIdThe unique ID of the server event.
-
JsonValue; type "session.input_transcript.delta"constantThe event type, must be
session.input_transcript.delta.SESSION_INPUT_TRANSCRIPT_DELTA("session.input_transcript.delta")
-
Optional<Long> elapsedMsTiming metadata for stream alignment, derived from the translation frame when available. It advances in 200 ms increments, but multiple transcript deltas may share the same
elapsed_ms. Treat it as alignment metadata, not a unique transcript-delta identifier.
-
-
class RealtimeTranslationOutputTranscriptDeltaEvent:Returned when translated transcript text is available.
Transcript deltas are append-only text fragments. Clients should not insert unconditional spaces between deltas.
-
String deltaAppend-only transcript text for the translated output audio.
-
String eventIdThe unique ID of the server event.
-
JsonValue; type "session.output_transcript.delta"constantThe event type, must be
session.output_transcript.delta.SESSION_OUTPUT_TRANSCRIPT_DELTA("session.output_transcript.delta")
-
Optional<Long> elapsedMsTiming metadata for stream alignment, derived from the translation frame when available. It advances in 200 ms increments, but multiple transcript deltas may share the same
elapsed_ms. Treat it as alignment metadata, not a unique transcript-delta identifier.
-
-
class RealtimeTranslationOutputAudioDeltaEvent:Returned when translated output audio is available. Output audio deltas are 200 ms frames of PCM16 audio.
-
String deltaBase64-encoded translated audio data.
-
String eventIdThe unique ID of the server event.
-
JsonValue; type "session.output_audio.delta"constantThe event type, must be
session.output_audio.delta.SESSION_OUTPUT_AUDIO_DELTA("session.output_audio.delta")
-
Optional<Long> channelsNumber of audio channels.
-
Optional<Long> elapsedMsTiming metadata for stream alignment, derived from the translation frame when available. Treat
elapsed_msas alignment metadata, not a unique event identifier. -
Optional<Format> formatAudio encoding for
delta.PCM16("pcm16")
-
Optional<Long> sampleRateSample rate of the audio delta.
-
-
Realtime Translation Session
-
class RealtimeTranslationSession:A Realtime translation session. Translation sessions continuously translate input audio into the configured output language.
-
String idUnique identifier for the session that looks like
sess_1234567890abcdef. -
Audio audioConfiguration for translation input and output audio.
-
Optional<Input> input-
Optional<NoiseReduction> noiseReductionOptional input noise reduction.
-
NoiseReductionType typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<Transcription> transcriptionOptional source-language transcription. When configured, the server emits
session.input_transcript.deltaevents. Translation itself still runs from the input audio stream.-
String modelThe transcription model used for source transcript deltas.
-
-
-
Optional<Output> output-
Optional<String> languageTarget language for translated output audio and transcript deltas.
-
-
-
long expiresAtExpiration timestamp for the session, in seconds since epoch.
-
String modelThe Realtime translation model used for this session. This field is set at session creation and cannot be changed with
session.update. -
JsonValue; type "translation"constantThe session type. Always
translationfor Realtime translation sessions.TRANSLATION("translation")
-
Realtime Translation Session Close Event
-
class RealtimeTranslationSessionCloseEvent:Gracefully close the realtime translation session. The server flushes pending input audio and emits any remaining translated output before closing the session.
-
JsonValue; type "session.close"constantThe event type, must be
session.close.SESSION_CLOSE("session.close")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
Realtime Translation Session Closed Event
-
class RealtimeTranslationSessionClosedEvent:Returned when a realtime translation session is closed.
-
String eventIdThe unique ID of the server event.
-
JsonValue; type "session.closed"constantThe event type, must be
session.closed.SESSION_CLOSED("session.closed")
-
Realtime Translation Session Create Request
-
class RealtimeTranslationSessionCreateRequest:Realtime translation session configuration. Translation sessions stream source audio in and translated audio plus transcript deltas out continuously.
-
String modelThe Realtime translation model used for this session.
-
Optional<Audio> audioConfiguration for translation input and output audio.
-
Optional<Input> input-
Optional<NoiseReduction> noiseReductionOptional input noise reduction. Set to
nullto disable it.-
NoiseReductionType typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<Transcription> transcriptionOptional source-language transcription. When configured, the server emits
session.input_transcript.deltaevents. Translation itself still runs from the input audio stream.-
String modelThe transcription model to use for source transcript deltas.
-
-
-
Optional<Output> output-
Optional<String> languageTarget language for translated output audio and transcript deltas.
-
-
-
Realtime Translation Session Created Event
-
class RealtimeTranslationSessionCreatedEvent:Returned when a translation session is created. Emitted automatically when a new connection is established as the first server event. This event contains the default translation session configuration.
-
String eventIdThe unique ID of the server event.
-
RealtimeTranslationSession sessionThe translation session configuration.
-
String idUnique identifier for the session that looks like
sess_1234567890abcdef. -
Audio audioConfiguration for translation input and output audio.
-
Optional<Input> input-
Optional<NoiseReduction> noiseReductionOptional input noise reduction.
-
NoiseReductionType typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<Transcription> transcriptionOptional source-language transcription. When configured, the server emits
session.input_transcript.deltaevents. Translation itself still runs from the input audio stream.-
String modelThe transcription model used for source transcript deltas.
-
-
-
Optional<Output> output-
Optional<String> languageTarget language for translated output audio and transcript deltas.
-
-
-
long expiresAtExpiration timestamp for the session, in seconds since epoch.
-
String modelThe Realtime translation model used for this session. This field is set at session creation and cannot be changed with
session.update. -
JsonValue; type "translation"constantThe session type. Always
translationfor Realtime translation sessions.TRANSLATION("translation")
-
-
JsonValue; type "session.created"constantThe event type, must be
session.created.SESSION_CREATED("session.created")
-
Realtime Translation Session Update Event
-
class RealtimeTranslationSessionUpdateEvent:Send this event to update the translation session configuration. Translation sessions support updates to
audio.output.language,audio.input.transcription, andaudio.input.noise_reduction.-
RealtimeTranslationSessionUpdateRequest sessionTranslation session fields to update. The session
typeandmodelare set at creation and cannot be changed withsession.update.-
Optional<Audio> audioConfiguration for translation input and output audio.
-
Optional<Input> input-
Optional<NoiseReduction> noiseReductionOptional input noise reduction. Set to
nullto disable it.-
NoiseReductionType typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<Transcription> transcriptionOptional source-language transcription. When configured, the server emits
session.input_transcript.deltaevents. Translation itself still runs from the input audio stream.-
String modelThe transcription model to use for source transcript deltas.
-
-
-
Optional<Output> output-
Optional<String> languageTarget language for translated output audio and transcript deltas.
-
-
-
-
JsonValue; type "session.update"constantThe event type, must be
session.update.SESSION_UPDATE("session.update")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
Realtime Translation Session Update Request
-
class RealtimeTranslationSessionUpdateRequest:Realtime translation session fields that can be updated with
session.update.-
Optional<Audio> audioConfiguration for translation input and output audio.
-
Optional<Input> input-
Optional<NoiseReduction> noiseReductionOptional input noise reduction. Set to
nullto disable it.-
NoiseReductionType typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<Transcription> transcriptionOptional source-language transcription. When configured, the server emits
session.input_transcript.deltaevents. Translation itself still runs from the input audio stream.-
String modelThe transcription model to use for source transcript deltas.
-
-
-
Optional<Output> output-
Optional<String> languageTarget language for translated output audio and transcript deltas.
-
-
-
Realtime Translation Session Updated Event
-
class RealtimeTranslationSessionUpdatedEvent:Returned when a translation session is updated with a
session.updateevent, unless there is an error.-
String eventIdThe unique ID of the server event.
-
RealtimeTranslationSession sessionThe translation session configuration.
-
String idUnique identifier for the session that looks like
sess_1234567890abcdef. -
Audio audioConfiguration for translation input and output audio.
-
Optional<Input> input-
Optional<NoiseReduction> noiseReductionOptional input noise reduction.
-
NoiseReductionType typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<Transcription> transcriptionOptional source-language transcription. When configured, the server emits
session.input_transcript.deltaevents. Translation itself still runs from the input audio stream.-
String modelThe transcription model used for source transcript deltas.
-
-
-
Optional<Output> output-
Optional<String> languageTarget language for translated output audio and transcript deltas.
-
-
-
long expiresAtExpiration timestamp for the session, in seconds since epoch.
-
String modelThe Realtime translation model used for this session. This field is set at session creation and cannot be changed with
session.update. -
JsonValue; type "translation"constantThe session type. Always
translationfor Realtime translation sessions.TRANSLATION("translation")
-
-
JsonValue; type "session.updated"constantThe event type, must be
session.updated.SESSION_UPDATED("session.updated")
-
Realtime Truncation
-
class RealtimeTruncation: A class that can be one of several variants.unionWhen the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
-
RealtimeTruncationStrategy-
AUTO("auto") -
DISABLED("disabled")
-
-
class RealtimeTruncationRetentionRatio:Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
-
double retentionRatioFraction of post-instruction conversation tokens to retain (
0.0-1.0) when the conversation exceeds the input token limit. Setting this to0.8means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates. -
JsonValue; type "retention_ratio"constantUse retention ratio truncation.
RETENTION_RATIO("retention_ratio")
-
Optional<TokenLimits> tokenLimitsOptional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
-
Optional<Long> postInstructionsMaximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
-
-
-
Realtime Truncation Retention Ratio
-
class RealtimeTruncationRetentionRatio:Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
-
double retentionRatioFraction of post-instruction conversation tokens to retain (
0.0-1.0) when the conversation exceeds the input token limit. Setting this to0.8means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates. -
JsonValue; type "retention_ratio"constantUse retention ratio truncation.
RETENTION_RATIO("retention_ratio")
-
Optional<TokenLimits> tokenLimitsOptional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
-
Optional<Long> postInstructionsMaximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
-
-
Response Audio Delta Event
-
class ResponseAudioDeltaEvent:Returned when the model-generated audio is updated.
-
long contentIndexThe index of the content part in the item's content array.
-
String deltaBase64-encoded audio data delta.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
JsonValue; type "response.output_audio.delta"constantThe event type, must be
response.output_audio.delta.RESPONSE_OUTPUT_AUDIO_DELTA("response.output_audio.delta")
-
Response Audio Done Event
-
class ResponseAudioDoneEvent:Returned when the model-generated audio is done. Also emitted when a Response is interrupted, incomplete, or cancelled.
-
long contentIndexThe index of the content part in the item's content array.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
JsonValue; type "response.output_audio.done"constantThe event type, must be
response.output_audio.done.RESPONSE_OUTPUT_AUDIO_DONE("response.output_audio.done")
-
Response Audio Transcript Delta Event
-
class ResponseAudioTranscriptDeltaEvent:Returned when the model-generated transcription of audio output is updated.
-
long contentIndexThe index of the content part in the item's content array.
-
String deltaThe transcript delta.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
JsonValue; type "response.output_audio_transcript.delta"constantThe event type, must be
response.output_audio_transcript.delta.RESPONSE_OUTPUT_AUDIO_TRANSCRIPT_DELTA("response.output_audio_transcript.delta")
-
Response Audio Transcript Done Event
-
class ResponseAudioTranscriptDoneEvent:Returned when the model-generated transcription of audio output is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
-
long contentIndexThe index of the content part in the item's content array.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
String transcriptThe final transcript of the audio.
-
JsonValue; type "response.output_audio_transcript.done"constantThe event type, must be
response.output_audio_transcript.done.RESPONSE_OUTPUT_AUDIO_TRANSCRIPT_DONE("response.output_audio_transcript.done")
-
Response Cancel Event
-
class ResponseCancelEvent:Send this event to cancel an in-progress response. The server will respond with a
response.doneevent with a status ofresponse.status=cancelled. If there is no response to cancel, the server will respond with an error. It's safe to callresponse.canceleven if no response is in progress, an error will be returned the session will remain unaffected.-
JsonValue; type "response.cancel"constantThe event type, must be
response.cancel.RESPONSE_CANCEL("response.cancel")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
Optional<String> responseIdA specific response ID to cancel - if not provided, will cancel an in-progress response in the default conversation.
-
Response Content Part Added Event
-
class ResponseContentPartAddedEvent:Returned when a new content part is added to an assistant message item during response generation.
-
long contentIndexThe index of the content part in the item's content array.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item to which the content part was added.
-
long outputIndexThe index of the output item in the response.
-
Part partThe content part that was added.
-
Optional<String> audioBase64-encoded audio data (if type is "audio").
-
Optional<String> textThe text content (if type is "text").
-
Optional<String> transcriptThe transcript of the audio (if type is "audio").
-
Optional<Type> typeThe content type ("text", "audio").
-
TEXT("text") -
AUDIO("audio")
-
-
-
String responseIdThe ID of the response.
-
JsonValue; type "response.content_part.added"constantThe event type, must be
response.content_part.added.RESPONSE_CONTENT_PART_ADDED("response.content_part.added")
-
Response Content Part Done Event
-
class ResponseContentPartDoneEvent:Returned when a content part is done streaming in an assistant message item. Also emitted when a Response is interrupted, incomplete, or cancelled.
-
long contentIndexThe index of the content part in the item's content array.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item.
-
long outputIndexThe index of the output item in the response.
-
Part partThe content part that is done.
-
Optional<String> audioBase64-encoded audio data (if type is "audio").
-
Optional<String> textThe text content (if type is "text").
-
Optional<String> transcriptThe transcript of the audio (if type is "audio").
-
Optional<Type> typeThe content type ("text", "audio").
-
TEXT("text") -
AUDIO("audio")
-
-
-
String responseIdThe ID of the response.
-
JsonValue; type "response.content_part.done"constantThe event type, must be
response.content_part.done.RESPONSE_CONTENT_PART_DONE("response.content_part.done")
-
Response Create Event
-
class ResponseCreateEvent:This event instructs the server to create a Response, which means triggering model inference. When in Server VAD mode, the server will create Responses automatically.
A Response will include at least one Item, and may have two, in which case the second will be a function call. These Items will be appended to the conversation history by default.
The server will respond with a
response.createdevent, events for Items and content created, and finally aresponse.doneevent to indicate the Response is complete.The
response.createevent includes inference configuration likeinstructionsandtools. If these are set, they will override the Session's configuration for this Response only.Responses can be created out-of-band of the default Conversation, meaning that they can have arbitrary input, and it's possible to disable writing the output to the Conversation. Only one Response can write to the default Conversation at a time, but otherwise multiple Responses can be created in parallel. The
metadatafield is a good way to disambiguate multiple simultaneous Responses.Clients can set
conversationtononeto create a Response that does not write to the default Conversation. Arbitrary input can be provided with theinputfield, which is an array accepting raw Items and references to existing Items.-
JsonValue; type "response.create"constantThe event type, must be
response.create.RESPONSE_CREATE("response.create")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
Optional<RealtimeResponseCreateParams> responseCreate a new Realtime response with these parameters
-
Optional<RealtimeResponseCreateAudioOutput> audioConfiguration for audio input and output.
-
Optional<Output> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<Voice> voiceThe voice the model uses to respond. Supported built-in voices are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. You may also provide a custom voice object with anid, for example{ "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommendmarinandcedarfor best quality.-
String -
enum UnionMember1:-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
class Id:Custom voice reference.
-
String idThe custom voice ID, e.g.
voice_1234.
-
-
-
-
-
Optional<Conversation> conversationControls which conversation the response is added to. Currently supports
autoandnone, withautoas the default value. Theautovalue means that the contents of the response will be added to the default conversation. Set this tononeto create an out-of-band response which will not add items to default conversation.-
AUTO("auto") -
NONE("none")
-
-
Optional<List<ConversationItem>> inputInput items to include in the prompt for the model. Using this field creates a new context for this Response instead of using the default conversation. An empty array
[]will clear the context for this Response. Note that this can include references to items that previously appeared in the session using their id.-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
List<Content> contentThe content of the message.
-
Optional<String> textThe text content.
-
Optional<Type> typeThe content type. Always
input_textfor system messages.INPUT_TEXT("input_text")
-
-
JsonValue; role "system"constantThe role of the message sender. Always
system.SYSTEM("system")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes (for
input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified. -
Optional<Detail> detailThe detail level of the image (for
input_image).autowill default tohigh.-
AUTO("auto") -
LOW("low") -
HIGH("high")
-
-
Optional<String> imageUrlBase64-encoded image bytes (for
input_image) as a data URI. For exampledata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG. -
Optional<String> textThe text content (for
input_text). -
Optional<String> transcriptTranscript of the audio (for
input_audio). This is not sent to the model, but will be attached to the message item for reference. -
Optional<Type> typeThe content type (
input_text,input_audio, orinput_image).-
INPUT_TEXT("input_text") -
INPUT_AUDIO("input_audio") -
INPUT_IMAGE("input_image")
-
-
-
JsonValue; role "user"constantThe role of the message sender. Always
user.USER("user")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
-
Optional<String> textThe text content.
-
Optional<String> transcriptThe transcript of the audio content, this will always be present if the output type is
audio. -
Optional<Type> typeThe content type,
output_textoroutput_audiodepending on the sessionoutput_modalitiesconfiguration.-
OUTPUT_TEXT("output_text") -
OUTPUT_AUDIO("output_audio")
-
-
-
JsonValue; role "assistant"constantThe role of the message sender. Always
assistant.ASSISTANT("assistant")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
String argumentsThe arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example
{"arg1": "value1", "arg2": 42}. -
String nameThe name of the function being called.
-
JsonValue; type "function_call"constantThe type of the item. Always
function_call.FUNCTION_CALL("function_call")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<String> callIdThe ID of the function call.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
String callIdThe ID of the function call this output is for.
-
String outputThe output of the function call, this is free text and can contain any information or simply be empty.
-
JsonValue; type "function_call_output"constantThe type of the item. Always
function_call_output.FUNCTION_CALL_OUTPUT("function_call_output")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
String idThe unique ID of the approval response.
-
String approvalRequestIdThe ID of the approval request being answered.
-
boolean approveWhether the request was approved.
-
JsonValue; type "mcp_approval_response"constantThe type of the item. Always
mcp_approval_response.MCP_APPROVAL_RESPONSE("mcp_approval_response")
-
Optional<String> reasonOptional reason for the decision.
-
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
String serverLabelThe label of the MCP server.
-
List<Tool> toolsThe tools available on the server.
-
JsonValue inputSchemaThe JSON schema describing the tool's input.
-
String nameThe name of the tool.
-
Optional<JsonValue> annotationsAdditional annotations about the tool.
-
Optional<String> descriptionThe description of the tool.
-
-
JsonValue; type "mcp_list_tools"constantThe type of the item. Always
mcp_list_tools.MCP_LIST_TOOLS("mcp_list_tools")
-
Optional<String> idThe unique ID of the list.
-
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
String idThe unique ID of the tool call.
-
String argumentsA JSON string of the arguments passed to the tool.
-
String nameThe name of the tool that was run.
-
String serverLabelThe label of the MCP server running the tool.
-
JsonValue; type "mcp_call"constantThe type of the item. Always
mcp_call.MCP_CALL("mcp_call")
-
Optional<String> approvalRequestIdThe ID of an associated approval request, if any.
-
Optional<Error> errorThe error from the tool call, if any.
-
class RealtimeMcpProtocolError:-
long code -
String message -
JsonValue; type "protocol_error"constantPROTOCOL_ERROR("protocol_error")
-
-
class RealtimeMcpToolExecutionError:-
String message -
JsonValue; type "tool_execution_error"constantTOOL_EXECUTION_ERROR("tool_execution_error")
-
-
class RealtimeMcphttpError:-
long code -
String message -
JsonValue; type "http_error"constantHTTP_ERROR("http_error")
-
-
-
Optional<String> outputThe output from the tool call.
-
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
String idThe unique ID of the approval request.
-
String argumentsA JSON string of arguments for the tool.
-
String nameThe name of the tool to run.
-
String serverLabelThe label of the MCP server making the request.
-
JsonValue; type "mcp_approval_request"constantThe type of the item. Always
mcp_approval_request.MCP_APPROVAL_REQUEST("mcp_approval_request")
-
-
-
Optional<String> instructionsThe default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior. Note that the server sets default instructions which will be used if this field is not set and are visible in the
session.createdevent at the start of the session. -
Optional<MaxOutputTokens> maxOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or
inffor the maximum available tokens for a given model. Defaults toinf.-
long -
JsonValue;INF("inf")
-
-
Optional<Metadata> metadataSet of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.
Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
-
Optional<List<OutputModality>> outputModalitiesThe set of modalities the model used to respond, currently the only possible values are
[\"audio\"],[\"text\"]. Audio output always include a text transcript. Setting the output to modetextwill disable audio output from the model.-
TEXT("text") -
AUDIO("audio")
-
-
Optional<Boolean> parallelToolCallsWhether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as
gpt-realtime-2. -
Optional<ResponsePrompt> promptReference to a prompt template and its variables. Learn more.
-
String idThe unique identifier of the prompt template to use.
-
Optional<Variables> variablesOptional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
-
String -
class ResponseInputText:A text input to the model.
-
String textThe text input to the model.
-
JsonValue; type "input_text"constantThe type of the input item. Always
input_text.INPUT_TEXT("input_text")
-
-
class ResponseInputImage:An image input to the model. Learn about image inputs.
-
Detail detailThe detail level of the image to be sent to the model. One of
high,low,auto, ororiginal. Defaults toauto.-
LOW("low") -
HIGH("high") -
AUTO("auto") -
ORIGINAL("original")
-
-
JsonValue; type "input_image"constantThe type of the input item. Always
input_image.INPUT_IMAGE("input_image")
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> imageUrlThe URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
-
-
class ResponseInputFile:A file input to the model.
-
JsonValue; type "input_file"constantThe type of the input item. Always
input_file.INPUT_FILE("input_file")
-
Optional<Detail> detailThe detail level of the file to be sent to the model. Use
lowfor the default rendering behavior, orhighto render the file at higher quality. Defaults tolow.-
LOW("low") -
HIGH("high")
-
-
Optional<String> fileDataThe content of the file to be sent to the model.
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> fileUrlThe URL of the file to be sent to the model.
-
Optional<String> filenameThe name of the file to be sent to the model.
-
-
-
Optional<String> versionOptional version of the prompt template.
-
-
Optional<RealtimeReasoning> reasoningConfiguration for reasoning-capable Realtime models such as
gpt-realtime-2.-
Optional<RealtimeReasoningEffort> effortConstrains effort on reasoning for reasoning-capable Realtime models such as
gpt-realtime-2.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
-
Optional<ToolChoice> toolChoiceHow the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
-
enum ToolChoiceOptions:Controls which (if any) tool is called by the model.
nonemeans the model will not call any tool and instead generates a message.automeans the model can pick between generating a message or calling one or more tools.requiredmeans the model must call one or more tools.-
NONE("none") -
AUTO("auto") -
REQUIRED("required")
-
-
class ToolChoiceFunction:Use this option to force the model to call a specific function.
-
String nameThe name of the function to call.
-
JsonValue; type "function"constantFor function calling, the type is always
function.FUNCTION("function")
-
-
class ToolChoiceMcp:Use this option to force the model to call a specific tool on a remote MCP server.
-
String serverLabelThe label of the MCP server to use.
-
JsonValue; type "mcp"constantFor MCP tools, the type is always
mcp.MCP("mcp")
-
Optional<String> nameThe name of the tool to call on the server.
-
-
-
Optional<List<Tool>> toolsTools available to the model.
-
class RealtimeFunctionTool:-
Optional<String> descriptionThe description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
-
Optional<String> nameThe name of the function.
-
Optional<JsonValue> parametersParameters of the function in JSON Schema.
-
Optional<Type> typeThe type of the tool, i.e.
function.FUNCTION("function")
-
-
class RealtimeResponseCreateMcpTool:Give the model access to additional tools via remote Model Context Protocol (MCP) servers. Learn more about MCP.
-
String serverLabelA label for this MCP server, used to identify it in tool calls.
-
JsonValue; type "mcp"constantThe type of the MCP tool. Always
mcp.MCP("mcp")
-
Optional<AllowedTools> allowedToolsList of allowed tool names or a filter object.
-
List<String> -
class McpToolFilter:A filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
Optional<String> authorizationAn OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
-
Optional<ConnectorId> connectorIdIdentifier for service connectors, like those available in ChatGPT. One of
server_urlorconnector_idmust be provided. Learn more about service connectors here.Currently supported
connector_idvalues are:-
Dropbox:
connector_dropbox -
Gmail:
connector_gmail -
Google Calendar:
connector_googlecalendar -
Google Drive:
connector_googledrive -
Microsoft Teams:
connector_microsoftteams -
Outlook Calendar:
connector_outlookcalendar -
Outlook Email:
connector_outlookemail -
SharePoint:
connector_sharepoint -
CONNECTOR_DROPBOX("connector_dropbox") -
CONNECTOR_GMAIL("connector_gmail") -
CONNECTOR_GOOGLECALENDAR("connector_googlecalendar") -
CONNECTOR_GOOGLEDRIVE("connector_googledrive") -
CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams") -
CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar") -
CONNECTOR_OUTLOOKEMAIL("connector_outlookemail") -
CONNECTOR_SHAREPOINT("connector_sharepoint")
-
-
Optional<Boolean> deferLoadingWhether this MCP tool is deferred and discovered via tool search.
-
Optional<Headers> headersOptional HTTP headers to send to the MCP server. Use for authentication or other purposes.
-
Optional<RequireApproval> requireApprovalSpecify which of the MCP server's tools require approval.
-
class McpToolApprovalFilter:Specify which of the MCP server's tools require approval. Can be
always,never, or a filter object associated with tools that require approval.-
Optional<Always> alwaysA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
Optional<Never> neverA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
enum McpToolApprovalSetting:Specify a single approval policy for all tools. One of
alwaysornever. When set toalways, all tools will require approval. When set tonever, all tools will not require approval.-
ALWAYS("always") -
NEVER("never")
-
-
-
Optional<String> serverDescriptionOptional description of the MCP server, used to provide more context.
-
Optional<String> serverUrlThe URL for the MCP server. One of
server_urlorconnector_idmust be provided.
-
-
-
-
Response Created Event
-
class ResponseCreatedEvent:Returned when a new Response is created. The first event of response creation, where the response is in an initial state of
in_progress.-
String eventIdThe unique ID of the server event.
-
RealtimeResponse responseThe response resource.
-
Optional<String> idThe unique ID of the response, will look like
resp_1234. -
Optional<Audio> audioConfiguration for audio output.
-
Optional<Output> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<Voice> voiceThe voice the model uses to respond. Voice cannot be changed during the session once the model has responded with audio at least once. Current voice options are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. We recommendmarinandcedarfor best quality.-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
-
-
Optional<String> conversationIdWhich conversation the response is added to, determined by the
conversationfield in theresponse.createevent. Ifauto, the response will be added to the default conversation and the value ofconversation_idwill be an id likeconv_1234. Ifnone, the response will not be added to any conversation and the value ofconversation_idwill benull. If responses are being triggered automatically by VAD the response will be added to the default conversation -
Optional<MaxOutputTokens> maxOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls, that was used in this response.
-
long -
JsonValue;INF("inf")
-
-
Optional<Metadata> metadataSet of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.
Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
-
Optional<Object> object_The object type, must be
realtime.response.REALTIME_RESPONSE("realtime.response")
-
Optional<List<ConversationItem>> outputThe list of output items generated by the response.
-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
List<Content> contentThe content of the message.
-
Optional<String> textThe text content.
-
Optional<Type> typeThe content type. Always
input_textfor system messages.INPUT_TEXT("input_text")
-
-
JsonValue; role "system"constantThe role of the message sender. Always
system.SYSTEM("system")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes (for
input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified. -
Optional<Detail> detailThe detail level of the image (for
input_image).autowill default tohigh.-
AUTO("auto") -
LOW("low") -
HIGH("high")
-
-
Optional<String> imageUrlBase64-encoded image bytes (for
input_image) as a data URI. For exampledata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG. -
Optional<String> textThe text content (for
input_text). -
Optional<String> transcriptTranscript of the audio (for
input_audio). This is not sent to the model, but will be attached to the message item for reference. -
Optional<Type> typeThe content type (
input_text,input_audio, orinput_image).-
INPUT_TEXT("input_text") -
INPUT_AUDIO("input_audio") -
INPUT_IMAGE("input_image")
-
-
-
JsonValue; role "user"constantThe role of the message sender. Always
user.USER("user")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
-
Optional<String> textThe text content.
-
Optional<String> transcriptThe transcript of the audio content, this will always be present if the output type is
audio. -
Optional<Type> typeThe content type,
output_textoroutput_audiodepending on the sessionoutput_modalitiesconfiguration.-
OUTPUT_TEXT("output_text") -
OUTPUT_AUDIO("output_audio")
-
-
-
JsonValue; role "assistant"constantThe role of the message sender. Always
assistant.ASSISTANT("assistant")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
String argumentsThe arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example
{"arg1": "value1", "arg2": 42}. -
String nameThe name of the function being called.
-
JsonValue; type "function_call"constantThe type of the item. Always
function_call.FUNCTION_CALL("function_call")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<String> callIdThe ID of the function call.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
String callIdThe ID of the function call this output is for.
-
String outputThe output of the function call, this is free text and can contain any information or simply be empty.
-
JsonValue; type "function_call_output"constantThe type of the item. Always
function_call_output.FUNCTION_CALL_OUTPUT("function_call_output")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
String idThe unique ID of the approval response.
-
String approvalRequestIdThe ID of the approval request being answered.
-
boolean approveWhether the request was approved.
-
JsonValue; type "mcp_approval_response"constantThe type of the item. Always
mcp_approval_response.MCP_APPROVAL_RESPONSE("mcp_approval_response")
-
Optional<String> reasonOptional reason for the decision.
-
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
String serverLabelThe label of the MCP server.
-
List<Tool> toolsThe tools available on the server.
-
JsonValue inputSchemaThe JSON schema describing the tool's input.
-
String nameThe name of the tool.
-
Optional<JsonValue> annotationsAdditional annotations about the tool.
-
Optional<String> descriptionThe description of the tool.
-
-
JsonValue; type "mcp_list_tools"constantThe type of the item. Always
mcp_list_tools.MCP_LIST_TOOLS("mcp_list_tools")
-
Optional<String> idThe unique ID of the list.
-
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
String idThe unique ID of the tool call.
-
String argumentsA JSON string of the arguments passed to the tool.
-
String nameThe name of the tool that was run.
-
String serverLabelThe label of the MCP server running the tool.
-
JsonValue; type "mcp_call"constantThe type of the item. Always
mcp_call.MCP_CALL("mcp_call")
-
Optional<String> approvalRequestIdThe ID of an associated approval request, if any.
-
Optional<Error> errorThe error from the tool call, if any.
-
class RealtimeMcpProtocolError:-
long code -
String message -
JsonValue; type "protocol_error"constantPROTOCOL_ERROR("protocol_error")
-
-
class RealtimeMcpToolExecutionError:-
String message -
JsonValue; type "tool_execution_error"constantTOOL_EXECUTION_ERROR("tool_execution_error")
-
-
class RealtimeMcphttpError:-
long code -
String message -
JsonValue; type "http_error"constantHTTP_ERROR("http_error")
-
-
-
Optional<String> outputThe output from the tool call.
-
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
String idThe unique ID of the approval request.
-
String argumentsA JSON string of arguments for the tool.
-
String nameThe name of the tool to run.
-
String serverLabelThe label of the MCP server making the request.
-
JsonValue; type "mcp_approval_request"constantThe type of the item. Always
mcp_approval_request.MCP_APPROVAL_REQUEST("mcp_approval_request")
-
-
-
Optional<List<OutputModality>> outputModalitiesThe set of modalities the model used to respond, currently the only possible values are
[\"audio\"],[\"text\"]. Audio output always include a text transcript. Setting the output to modetextwill disable audio output from the model.-
TEXT("text") -
AUDIO("audio")
-
-
Optional<Status> statusThe final status of the response (
completed,cancelled,failed, orincomplete,in_progress).-
COMPLETED("completed") -
CANCELLED("cancelled") -
FAILED("failed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
Optional<RealtimeResponseStatus> statusDetailsAdditional details about the status.
-
Optional<Error> errorA description of the error that caused the response to fail, populated when the
statusisfailed.-
Optional<String> codeError code, if any.
-
Optional<String> typeThe type of error.
-
-
Optional<Reason> reasonThe reason the Response did not complete. For a
cancelledResponse, one ofturn_detected(the server VAD detected a new start of speech) orclient_cancelled(the client sent a cancel event). For anincompleteResponse, one ofmax_output_tokensorcontent_filter(the server-side safety filter activated and cut off the response).-
TURN_DETECTED("turn_detected") -
CLIENT_CANCELLED("client_cancelled") -
MAX_OUTPUT_TOKENS("max_output_tokens") -
CONTENT_FILTER("content_filter")
-
-
Optional<Type> typeThe type of error that caused the response to fail, corresponding with the
statusfield (completed,cancelled,incomplete,failed).-
COMPLETED("completed") -
CANCELLED("cancelled") -
INCOMPLETE("incomplete") -
FAILED("failed")
-
-
-
Optional<RealtimeResponseUsage> usageUsage statistics for the Response, this will correspond to billing. A Realtime API session will maintain a conversation context and append new Items to the Conversation, thus output from previous turns (text and audio tokens) will become the input for later turns.
-
Optional<RealtimeResponseUsageInputTokenDetails> inputTokenDetailsDetails about the input tokens used in the Response. Cached tokens are tokens from previous turns in the conversation that are included as context for the current response. Cached tokens here are counted as a subset of input tokens, meaning input tokens will include cached and uncached tokens.
-
Optional<Long> audioTokensThe number of audio tokens used as input for the Response.
-
Optional<Long> cachedTokensThe number of cached tokens used as input for the Response.
-
Optional<CachedTokensDetails> cachedTokensDetailsDetails about the cached tokens used as input for the Response.
-
Optional<Long> audioTokensThe number of cached audio tokens used as input for the Response.
-
Optional<Long> imageTokensThe number of cached image tokens used as input for the Response.
-
Optional<Long> textTokensThe number of cached text tokens used as input for the Response.
-
-
Optional<Long> imageTokensThe number of image tokens used as input for the Response.
-
Optional<Long> textTokensThe number of text tokens used as input for the Response.
-
-
Optional<Long> inputTokensThe number of input tokens used in the Response, including text and audio tokens.
-
Optional<RealtimeResponseUsageOutputTokenDetails> outputTokenDetailsDetails about the output tokens used in the Response.
-
Optional<Long> audioTokensThe number of audio tokens used in the Response.
-
Optional<Long> textTokensThe number of text tokens used in the Response.
-
-
Optional<Long> outputTokensThe number of output tokens sent in the Response, including text and audio tokens.
-
Optional<Long> totalTokensThe total number of tokens in the Response including input and output text and audio tokens.
-
-
-
JsonValue; type "response.created"constantThe event type, must be
response.created.RESPONSE_CREATED("response.created")
-
Response Done Event
-
class ResponseDoneEvent:Returned when a Response is done streaming. Always emitted, no matter the final state. The Response object included in the
response.doneevent will include all output Items in the Response but will omit the raw audio data.Clients should check the
statusfield of the Response to determine if it was successful (completed) or if there was another outcome:cancelled,failed, orincomplete.A response will contain all output items that were generated during the response, excluding any audio content.
-
String eventIdThe unique ID of the server event.
-
RealtimeResponse responseThe response resource.
-
Optional<String> idThe unique ID of the response, will look like
resp_1234. -
Optional<Audio> audioConfiguration for audio output.
-
Optional<Output> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<Voice> voiceThe voice the model uses to respond. Voice cannot be changed during the session once the model has responded with audio at least once. Current voice options are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. We recommendmarinandcedarfor best quality.-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
-
-
Optional<String> conversationIdWhich conversation the response is added to, determined by the
conversationfield in theresponse.createevent. Ifauto, the response will be added to the default conversation and the value ofconversation_idwill be an id likeconv_1234. Ifnone, the response will not be added to any conversation and the value ofconversation_idwill benull. If responses are being triggered automatically by VAD the response will be added to the default conversation -
Optional<MaxOutputTokens> maxOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls, that was used in this response.
-
long -
JsonValue;INF("inf")
-
-
Optional<Metadata> metadataSet of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.
Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
-
Optional<Object> object_The object type, must be
realtime.response.REALTIME_RESPONSE("realtime.response")
-
Optional<List<ConversationItem>> outputThe list of output items generated by the response.
-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
List<Content> contentThe content of the message.
-
Optional<String> textThe text content.
-
Optional<Type> typeThe content type. Always
input_textfor system messages.INPUT_TEXT("input_text")
-
-
JsonValue; role "system"constantThe role of the message sender. Always
system.SYSTEM("system")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes (for
input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified. -
Optional<Detail> detailThe detail level of the image (for
input_image).autowill default tohigh.-
AUTO("auto") -
LOW("low") -
HIGH("high")
-
-
Optional<String> imageUrlBase64-encoded image bytes (for
input_image) as a data URI. For exampledata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG. -
Optional<String> textThe text content (for
input_text). -
Optional<String> transcriptTranscript of the audio (for
input_audio). This is not sent to the model, but will be attached to the message item for reference. -
Optional<Type> typeThe content type (
input_text,input_audio, orinput_image).-
INPUT_TEXT("input_text") -
INPUT_AUDIO("input_audio") -
INPUT_IMAGE("input_image")
-
-
-
JsonValue; role "user"constantThe role of the message sender. Always
user.USER("user")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
-
Optional<String> textThe text content.
-
Optional<String> transcriptThe transcript of the audio content, this will always be present if the output type is
audio. -
Optional<Type> typeThe content type,
output_textoroutput_audiodepending on the sessionoutput_modalitiesconfiguration.-
OUTPUT_TEXT("output_text") -
OUTPUT_AUDIO("output_audio")
-
-
-
JsonValue; role "assistant"constantThe role of the message sender. Always
assistant.ASSISTANT("assistant")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
String argumentsThe arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example
{"arg1": "value1", "arg2": 42}. -
String nameThe name of the function being called.
-
JsonValue; type "function_call"constantThe type of the item. Always
function_call.FUNCTION_CALL("function_call")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<String> callIdThe ID of the function call.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
String callIdThe ID of the function call this output is for.
-
String outputThe output of the function call, this is free text and can contain any information or simply be empty.
-
JsonValue; type "function_call_output"constantThe type of the item. Always
function_call_output.FUNCTION_CALL_OUTPUT("function_call_output")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
String idThe unique ID of the approval response.
-
String approvalRequestIdThe ID of the approval request being answered.
-
boolean approveWhether the request was approved.
-
JsonValue; type "mcp_approval_response"constantThe type of the item. Always
mcp_approval_response.MCP_APPROVAL_RESPONSE("mcp_approval_response")
-
Optional<String> reasonOptional reason for the decision.
-
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
String serverLabelThe label of the MCP server.
-
List<Tool> toolsThe tools available on the server.
-
JsonValue inputSchemaThe JSON schema describing the tool's input.
-
String nameThe name of the tool.
-
Optional<JsonValue> annotationsAdditional annotations about the tool.
-
Optional<String> descriptionThe description of the tool.
-
-
JsonValue; type "mcp_list_tools"constantThe type of the item. Always
mcp_list_tools.MCP_LIST_TOOLS("mcp_list_tools")
-
Optional<String> idThe unique ID of the list.
-
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
String idThe unique ID of the tool call.
-
String argumentsA JSON string of the arguments passed to the tool.
-
String nameThe name of the tool that was run.
-
String serverLabelThe label of the MCP server running the tool.
-
JsonValue; type "mcp_call"constantThe type of the item. Always
mcp_call.MCP_CALL("mcp_call")
-
Optional<String> approvalRequestIdThe ID of an associated approval request, if any.
-
Optional<Error> errorThe error from the tool call, if any.
-
class RealtimeMcpProtocolError:-
long code -
String message -
JsonValue; type "protocol_error"constantPROTOCOL_ERROR("protocol_error")
-
-
class RealtimeMcpToolExecutionError:-
String message -
JsonValue; type "tool_execution_error"constantTOOL_EXECUTION_ERROR("tool_execution_error")
-
-
class RealtimeMcphttpError:-
long code -
String message -
JsonValue; type "http_error"constantHTTP_ERROR("http_error")
-
-
-
Optional<String> outputThe output from the tool call.
-
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
String idThe unique ID of the approval request.
-
String argumentsA JSON string of arguments for the tool.
-
String nameThe name of the tool to run.
-
String serverLabelThe label of the MCP server making the request.
-
JsonValue; type "mcp_approval_request"constantThe type of the item. Always
mcp_approval_request.MCP_APPROVAL_REQUEST("mcp_approval_request")
-
-
-
Optional<List<OutputModality>> outputModalitiesThe set of modalities the model used to respond, currently the only possible values are
[\"audio\"],[\"text\"]. Audio output always include a text transcript. Setting the output to modetextwill disable audio output from the model.-
TEXT("text") -
AUDIO("audio")
-
-
Optional<Status> statusThe final status of the response (
completed,cancelled,failed, orincomplete,in_progress).-
COMPLETED("completed") -
CANCELLED("cancelled") -
FAILED("failed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
Optional<RealtimeResponseStatus> statusDetailsAdditional details about the status.
-
Optional<Error> errorA description of the error that caused the response to fail, populated when the
statusisfailed.-
Optional<String> codeError code, if any.
-
Optional<String> typeThe type of error.
-
-
Optional<Reason> reasonThe reason the Response did not complete. For a
cancelledResponse, one ofturn_detected(the server VAD detected a new start of speech) orclient_cancelled(the client sent a cancel event). For anincompleteResponse, one ofmax_output_tokensorcontent_filter(the server-side safety filter activated and cut off the response).-
TURN_DETECTED("turn_detected") -
CLIENT_CANCELLED("client_cancelled") -
MAX_OUTPUT_TOKENS("max_output_tokens") -
CONTENT_FILTER("content_filter")
-
-
Optional<Type> typeThe type of error that caused the response to fail, corresponding with the
statusfield (completed,cancelled,incomplete,failed).-
COMPLETED("completed") -
CANCELLED("cancelled") -
INCOMPLETE("incomplete") -
FAILED("failed")
-
-
-
Optional<RealtimeResponseUsage> usageUsage statistics for the Response, this will correspond to billing. A Realtime API session will maintain a conversation context and append new Items to the Conversation, thus output from previous turns (text and audio tokens) will become the input for later turns.
-
Optional<RealtimeResponseUsageInputTokenDetails> inputTokenDetailsDetails about the input tokens used in the Response. Cached tokens are tokens from previous turns in the conversation that are included as context for the current response. Cached tokens here are counted as a subset of input tokens, meaning input tokens will include cached and uncached tokens.
-
Optional<Long> audioTokensThe number of audio tokens used as input for the Response.
-
Optional<Long> cachedTokensThe number of cached tokens used as input for the Response.
-
Optional<CachedTokensDetails> cachedTokensDetailsDetails about the cached tokens used as input for the Response.
-
Optional<Long> audioTokensThe number of cached audio tokens used as input for the Response.
-
Optional<Long> imageTokensThe number of cached image tokens used as input for the Response.
-
Optional<Long> textTokensThe number of cached text tokens used as input for the Response.
-
-
Optional<Long> imageTokensThe number of image tokens used as input for the Response.
-
Optional<Long> textTokensThe number of text tokens used as input for the Response.
-
-
Optional<Long> inputTokensThe number of input tokens used in the Response, including text and audio tokens.
-
Optional<RealtimeResponseUsageOutputTokenDetails> outputTokenDetailsDetails about the output tokens used in the Response.
-
Optional<Long> audioTokensThe number of audio tokens used in the Response.
-
Optional<Long> textTokensThe number of text tokens used in the Response.
-
-
Optional<Long> outputTokensThe number of output tokens sent in the Response, including text and audio tokens.
-
Optional<Long> totalTokensThe total number of tokens in the Response including input and output text and audio tokens.
-
-
-
JsonValue; type "response.done"constantThe event type, must be
response.done.RESPONSE_DONE("response.done")
-
Response Function Call Arguments Delta Event
-
class ResponseFunctionCallArgumentsDeltaEvent:Returned when the model-generated function call arguments are updated.
-
String callIdThe ID of the function call.
-
String deltaThe arguments delta as a JSON string.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the function call item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
JsonValue; type "response.function_call_arguments.delta"constantThe event type, must be
response.function_call_arguments.delta.RESPONSE_FUNCTION_CALL_ARGUMENTS_DELTA("response.function_call_arguments.delta")
-
Response Function Call Arguments Done Event
-
class ResponseFunctionCallArgumentsDoneEvent:Returned when the model-generated function call arguments are done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
-
String argumentsThe final arguments as a JSON string.
-
String callIdThe ID of the function call.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the function call item.
-
String nameThe name of the function that was called.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
JsonValue; type "response.function_call_arguments.done"constantThe event type, must be
response.function_call_arguments.done.RESPONSE_FUNCTION_CALL_ARGUMENTS_DONE("response.function_call_arguments.done")
-
Response Mcp Call Arguments Delta
-
class ResponseMcpCallArgumentsDelta:Returned when MCP tool call arguments are updated during response generation.
-
String deltaThe JSON-encoded arguments delta.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the MCP tool call item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
JsonValue; type "response.mcp_call_arguments.delta"constantThe event type, must be
response.mcp_call_arguments.delta.RESPONSE_MCP_CALL_ARGUMENTS_DELTA("response.mcp_call_arguments.delta")
-
Optional<String> obfuscationIf present, indicates the delta text was obfuscated.
-
Response Mcp Call Arguments Done
-
class ResponseMcpCallArgumentsDone:Returned when MCP tool call arguments are finalized during response generation.
-
String argumentsThe final JSON-encoded arguments string.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the MCP tool call item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
JsonValue; type "response.mcp_call_arguments.done"constantThe event type, must be
response.mcp_call_arguments.done.RESPONSE_MCP_CALL_ARGUMENTS_DONE("response.mcp_call_arguments.done")
-
Response Mcp Call Completed
-
class ResponseMcpCallCompleted:Returned when an MCP tool call has completed successfully.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the MCP tool call item.
-
long outputIndexThe index of the output item in the response.
-
JsonValue; type "response.mcp_call.completed"constantThe event type, must be
response.mcp_call.completed.RESPONSE_MCP_CALL_COMPLETED("response.mcp_call.completed")
-
Response Mcp Call Failed
-
class ResponseMcpCallFailed:Returned when an MCP tool call has failed.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the MCP tool call item.
-
long outputIndexThe index of the output item in the response.
-
JsonValue; type "response.mcp_call.failed"constantThe event type, must be
response.mcp_call.failed.RESPONSE_MCP_CALL_FAILED("response.mcp_call.failed")
-
Response Mcp Call In Progress
-
class ResponseMcpCallInProgress:Returned when an MCP tool call has started and is in progress.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the MCP tool call item.
-
long outputIndexThe index of the output item in the response.
-
JsonValue; type "response.mcp_call.in_progress"constantThe event type, must be
response.mcp_call.in_progress.RESPONSE_MCP_CALL_IN_PROGRESS("response.mcp_call.in_progress")
-
Response Output Item Added Event
-
class ResponseOutputItemAddedEvent:Returned when a new Item is created during Response generation.
-
String eventIdThe unique ID of the server event.
-
ConversationItem itemA single item within a Realtime conversation.
-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
List<Content> contentThe content of the message.
-
Optional<String> textThe text content.
-
Optional<Type> typeThe content type. Always
input_textfor system messages.INPUT_TEXT("input_text")
-
-
JsonValue; role "system"constantThe role of the message sender. Always
system.SYSTEM("system")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes (for
input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified. -
Optional<Detail> detailThe detail level of the image (for
input_image).autowill default tohigh.-
AUTO("auto") -
LOW("low") -
HIGH("high")
-
-
Optional<String> imageUrlBase64-encoded image bytes (for
input_image) as a data URI. For exampledata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG. -
Optional<String> textThe text content (for
input_text). -
Optional<String> transcriptTranscript of the audio (for
input_audio). This is not sent to the model, but will be attached to the message item for reference. -
Optional<Type> typeThe content type (
input_text,input_audio, orinput_image).-
INPUT_TEXT("input_text") -
INPUT_AUDIO("input_audio") -
INPUT_IMAGE("input_image")
-
-
-
JsonValue; role "user"constantThe role of the message sender. Always
user.USER("user")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
-
Optional<String> textThe text content.
-
Optional<String> transcriptThe transcript of the audio content, this will always be present if the output type is
audio. -
Optional<Type> typeThe content type,
output_textoroutput_audiodepending on the sessionoutput_modalitiesconfiguration.-
OUTPUT_TEXT("output_text") -
OUTPUT_AUDIO("output_audio")
-
-
-
JsonValue; role "assistant"constantThe role of the message sender. Always
assistant.ASSISTANT("assistant")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
String argumentsThe arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example
{"arg1": "value1", "arg2": 42}. -
String nameThe name of the function being called.
-
JsonValue; type "function_call"constantThe type of the item. Always
function_call.FUNCTION_CALL("function_call")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<String> callIdThe ID of the function call.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
String callIdThe ID of the function call this output is for.
-
String outputThe output of the function call, this is free text and can contain any information or simply be empty.
-
JsonValue; type "function_call_output"constantThe type of the item. Always
function_call_output.FUNCTION_CALL_OUTPUT("function_call_output")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
String idThe unique ID of the approval response.
-
String approvalRequestIdThe ID of the approval request being answered.
-
boolean approveWhether the request was approved.
-
JsonValue; type "mcp_approval_response"constantThe type of the item. Always
mcp_approval_response.MCP_APPROVAL_RESPONSE("mcp_approval_response")
-
Optional<String> reasonOptional reason for the decision.
-
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
String serverLabelThe label of the MCP server.
-
List<Tool> toolsThe tools available on the server.
-
JsonValue inputSchemaThe JSON schema describing the tool's input.
-
String nameThe name of the tool.
-
Optional<JsonValue> annotationsAdditional annotations about the tool.
-
Optional<String> descriptionThe description of the tool.
-
-
JsonValue; type "mcp_list_tools"constantThe type of the item. Always
mcp_list_tools.MCP_LIST_TOOLS("mcp_list_tools")
-
Optional<String> idThe unique ID of the list.
-
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
String idThe unique ID of the tool call.
-
String argumentsA JSON string of the arguments passed to the tool.
-
String nameThe name of the tool that was run.
-
String serverLabelThe label of the MCP server running the tool.
-
JsonValue; type "mcp_call"constantThe type of the item. Always
mcp_call.MCP_CALL("mcp_call")
-
Optional<String> approvalRequestIdThe ID of an associated approval request, if any.
-
Optional<Error> errorThe error from the tool call, if any.
-
class RealtimeMcpProtocolError:-
long code -
String message -
JsonValue; type "protocol_error"constantPROTOCOL_ERROR("protocol_error")
-
-
class RealtimeMcpToolExecutionError:-
String message -
JsonValue; type "tool_execution_error"constantTOOL_EXECUTION_ERROR("tool_execution_error")
-
-
class RealtimeMcphttpError:-
long code -
String message -
JsonValue; type "http_error"constantHTTP_ERROR("http_error")
-
-
-
Optional<String> outputThe output from the tool call.
-
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
String idThe unique ID of the approval request.
-
String argumentsA JSON string of arguments for the tool.
-
String nameThe name of the tool to run.
-
String serverLabelThe label of the MCP server making the request.
-
JsonValue; type "mcp_approval_request"constantThe type of the item. Always
mcp_approval_request.MCP_APPROVAL_REQUEST("mcp_approval_request")
-
-
-
long outputIndexThe index of the output item in the Response.
-
String responseIdThe ID of the Response to which the item belongs.
-
JsonValue; type "response.output_item.added"constantThe event type, must be
response.output_item.added.RESPONSE_OUTPUT_ITEM_ADDED("response.output_item.added")
-
Response Output Item Done Event
-
class ResponseOutputItemDoneEvent:Returned when an Item is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
-
String eventIdThe unique ID of the server event.
-
ConversationItem itemA single item within a Realtime conversation.
-
class RealtimeConversationItemSystemMessage:A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
-
List<Content> contentThe content of the message.
-
Optional<String> textThe text content.
-
Optional<Type> typeThe content type. Always
input_textfor system messages.INPUT_TEXT("input_text")
-
-
JsonValue; role "system"constantThe role of the message sender. Always
system.SYSTEM("system")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemUserMessage:A user message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes (for
input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified. -
Optional<Detail> detailThe detail level of the image (for
input_image).autowill default tohigh.-
AUTO("auto") -
LOW("low") -
HIGH("high")
-
-
Optional<String> imageUrlBase64-encoded image bytes (for
input_image) as a data URI. For exampledata:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG. -
Optional<String> textThe text content (for
input_text). -
Optional<String> transcriptTranscript of the audio (for
input_audio). This is not sent to the model, but will be attached to the message item for reference. -
Optional<Type> typeThe content type (
input_text,input_audio, orinput_image).-
INPUT_TEXT("input_text") -
INPUT_AUDIO("input_audio") -
INPUT_IMAGE("input_image")
-
-
-
JsonValue; role "user"constantThe role of the message sender. Always
user.USER("user")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemAssistantMessage:An assistant message item in a Realtime conversation.
-
List<Content> contentThe content of the message.
-
Optional<String> audioBase64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
-
Optional<String> textThe text content.
-
Optional<String> transcriptThe transcript of the audio content, this will always be present if the output type is
audio. -
Optional<Type> typeThe content type,
output_textoroutput_audiodepending on the sessionoutput_modalitiesconfiguration.-
OUTPUT_TEXT("output_text") -
OUTPUT_AUDIO("output_audio")
-
-
-
JsonValue; role "assistant"constantThe role of the message sender. Always
assistant.ASSISTANT("assistant")
-
JsonValue; type "message"constantThe type of the item. Always
message.MESSAGE("message")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCall:A function call item in a Realtime conversation.
-
String argumentsThe arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example
{"arg1": "value1", "arg2": 42}. -
String nameThe name of the function being called.
-
JsonValue; type "function_call"constantThe type of the item. Always
function_call.FUNCTION_CALL("function_call")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<String> callIdThe ID of the function call.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeConversationItemFunctionCallOutput:A function call output item in a Realtime conversation.
-
String callIdThe ID of the function call this output is for.
-
String outputThe output of the function call, this is free text and can contain any information or simply be empty.
-
JsonValue; type "function_call_output"constantThe type of the item. Always
function_call_output.FUNCTION_CALL_OUTPUT("function_call_output")
-
Optional<String> idThe unique ID of the item. This may be provided by the client or generated by the server.
-
Optional<Object> object_Identifier for the API object being returned - always
realtime.item. Optional when creating a new item.REALTIME_ITEM("realtime.item")
-
Optional<Status> statusThe status of the item. Has no effect on the conversation.
-
COMPLETED("completed") -
INCOMPLETE("incomplete") -
IN_PROGRESS("in_progress")
-
-
-
class RealtimeMcpApprovalResponse:A Realtime item responding to an MCP approval request.
-
String idThe unique ID of the approval response.
-
String approvalRequestIdThe ID of the approval request being answered.
-
boolean approveWhether the request was approved.
-
JsonValue; type "mcp_approval_response"constantThe type of the item. Always
mcp_approval_response.MCP_APPROVAL_RESPONSE("mcp_approval_response")
-
Optional<String> reasonOptional reason for the decision.
-
-
class RealtimeMcpListTools:A Realtime item listing tools available on an MCP server.
-
String serverLabelThe label of the MCP server.
-
List<Tool> toolsThe tools available on the server.
-
JsonValue inputSchemaThe JSON schema describing the tool's input.
-
String nameThe name of the tool.
-
Optional<JsonValue> annotationsAdditional annotations about the tool.
-
Optional<String> descriptionThe description of the tool.
-
-
JsonValue; type "mcp_list_tools"constantThe type of the item. Always
mcp_list_tools.MCP_LIST_TOOLS("mcp_list_tools")
-
Optional<String> idThe unique ID of the list.
-
-
class RealtimeMcpToolCall:A Realtime item representing an invocation of a tool on an MCP server.
-
String idThe unique ID of the tool call.
-
String argumentsA JSON string of the arguments passed to the tool.
-
String nameThe name of the tool that was run.
-
String serverLabelThe label of the MCP server running the tool.
-
JsonValue; type "mcp_call"constantThe type of the item. Always
mcp_call.MCP_CALL("mcp_call")
-
Optional<String> approvalRequestIdThe ID of an associated approval request, if any.
-
Optional<Error> errorThe error from the tool call, if any.
-
class RealtimeMcpProtocolError:-
long code -
String message -
JsonValue; type "protocol_error"constantPROTOCOL_ERROR("protocol_error")
-
-
class RealtimeMcpToolExecutionError:-
String message -
JsonValue; type "tool_execution_error"constantTOOL_EXECUTION_ERROR("tool_execution_error")
-
-
class RealtimeMcphttpError:-
long code -
String message -
JsonValue; type "http_error"constantHTTP_ERROR("http_error")
-
-
-
Optional<String> outputThe output from the tool call.
-
-
class RealtimeMcpApprovalRequest:A Realtime item requesting human approval of a tool invocation.
-
String idThe unique ID of the approval request.
-
String argumentsA JSON string of arguments for the tool.
-
String nameThe name of the tool to run.
-
String serverLabelThe label of the MCP server making the request.
-
JsonValue; type "mcp_approval_request"constantThe type of the item. Always
mcp_approval_request.MCP_APPROVAL_REQUEST("mcp_approval_request")
-
-
-
long outputIndexThe index of the output item in the Response.
-
String responseIdThe ID of the Response to which the item belongs.
-
JsonValue; type "response.output_item.done"constantThe event type, must be
response.output_item.done.RESPONSE_OUTPUT_ITEM_DONE("response.output_item.done")
-
Response Text Delta Event
-
class ResponseTextDeltaEvent:Returned when the text value of an "output_text" content part is updated.
-
long contentIndexThe index of the content part in the item's content array.
-
String deltaThe text delta.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
JsonValue; type "response.output_text.delta"constantThe event type, must be
response.output_text.delta.RESPONSE_OUTPUT_TEXT_DELTA("response.output_text.delta")
-
Response Text Done Event
-
class ResponseTextDoneEvent:Returned when the text value of an "output_text" content part is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
-
long contentIndexThe index of the content part in the item's content array.
-
String eventIdThe unique ID of the server event.
-
String itemIdThe ID of the item.
-
long outputIndexThe index of the output item in the response.
-
String responseIdThe ID of the response.
-
String textThe final text content.
-
JsonValue; type "response.output_text.done"constantThe event type, must be
response.output_text.done.RESPONSE_OUTPUT_TEXT_DONE("response.output_text.done")
-
Session Created Event
-
class SessionCreatedEvent:Returned when a Session is created. Emitted automatically when a new connection is established as the first server event. This event will contain the default Session configuration.
-
String eventIdThe unique ID of the server event.
-
Session sessionThe session configuration.
-
class RealtimeSessionCreateRequest:Realtime session object configuration.
-
JsonValue; type "realtime"constantThe type of session to create. Always
realtimefor the Realtime API.REALTIME("realtime")
-
Optional<RealtimeAudioConfig> audioConfiguration for input and output audio.
-
Optional<RealtimeAudioConfigInput> input-
Optional<RealtimeAudioFormats> formatThe format of the input audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<RealtimeAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
Optional<RealtimeAudioConfigOutput> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
Optional<Double> speedThe speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
-
Optional<Voice> voiceThe voice the model uses to respond. Supported built-in voices are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. You may also provide a custom voice object with anid, for example{ "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommendmarinandcedarfor best quality.-
String -
enum UnionMember1:-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
class Id:Custom voice reference.
-
String idThe custom voice ID, e.g.
voice_1234.
-
-
-
-
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
Optional<String> instructionsThe default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
Note that the server sets default instructions which will be used if this field is not set and are visible in the
session.createdevent at the start of the session. -
Optional<MaxOutputTokens> maxOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or
inffor the maximum available tokens for a given model. Defaults toinf.-
long -
JsonValue;INF("inf")
-
-
Optional<Model> modelThe Realtime model used for this session.
-
GPT_REALTIME("gpt-realtime") -
GPT_REALTIME_1_5("gpt-realtime-1.5") -
GPT_REALTIME_2("gpt-realtime-2") -
GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28") -
GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview") -
GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01") -
GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17") -
GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03") -
GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview") -
GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17") -
GPT_REALTIME_MINI("gpt-realtime-mini") -
GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06") -
GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15") -
GPT_AUDIO_1_5("gpt-audio-1.5") -
GPT_AUDIO_MINI("gpt-audio-mini") -
GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06") -
GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
-
-
Optional<List<OutputModality>> outputModalitiesThe set of modalities the model can respond with. It defaults to
["audio"], indicating that the model will respond with audio plus a transcript.["text"]can be used to make the model respond with text only. It is not possible to request bothtextandaudioat the same time.-
TEXT("text") -
AUDIO("audio")
-
-
Optional<Boolean> parallelToolCallsWhether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as
gpt-realtime-2. -
Optional<ResponsePrompt> promptReference to a prompt template and its variables. Learn more.
-
String idThe unique identifier of the prompt template to use.
-
Optional<Variables> variablesOptional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
-
String -
class ResponseInputText:A text input to the model.
-
String textThe text input to the model.
-
JsonValue; type "input_text"constantThe type of the input item. Always
input_text.INPUT_TEXT("input_text")
-
-
class ResponseInputImage:An image input to the model. Learn about image inputs.
-
Detail detailThe detail level of the image to be sent to the model. One of
high,low,auto, ororiginal. Defaults toauto.-
LOW("low") -
HIGH("high") -
AUTO("auto") -
ORIGINAL("original")
-
-
JsonValue; type "input_image"constantThe type of the input item. Always
input_image.INPUT_IMAGE("input_image")
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> imageUrlThe URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
-
-
class ResponseInputFile:A file input to the model.
-
JsonValue; type "input_file"constantThe type of the input item. Always
input_file.INPUT_FILE("input_file")
-
Optional<Detail> detailThe detail level of the file to be sent to the model. Use
lowfor the default rendering behavior, orhighto render the file at higher quality. Defaults tolow.-
LOW("low") -
HIGH("high")
-
-
Optional<String> fileDataThe content of the file to be sent to the model.
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> fileUrlThe URL of the file to be sent to the model.
-
Optional<String> filenameThe name of the file to be sent to the model.
-
-
-
Optional<String> versionOptional version of the prompt template.
-
-
Optional<RealtimeReasoning> reasoningConfiguration for reasoning-capable Realtime models such as
gpt-realtime-2.-
Optional<RealtimeReasoningEffort> effortConstrains effort on reasoning for reasoning-capable Realtime models such as
gpt-realtime-2.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
-
Optional<RealtimeToolChoiceConfig> toolChoiceHow the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
-
enum ToolChoiceOptions:Controls which (if any) tool is called by the model.
nonemeans the model will not call any tool and instead generates a message.automeans the model can pick between generating a message or calling one or more tools.requiredmeans the model must call one or more tools.-
NONE("none") -
AUTO("auto") -
REQUIRED("required")
-
-
class ToolChoiceFunction:Use this option to force the model to call a specific function.
-
String nameThe name of the function to call.
-
JsonValue; type "function"constantFor function calling, the type is always
function.FUNCTION("function")
-
-
class ToolChoiceMcp:Use this option to force the model to call a specific tool on a remote MCP server.
-
String serverLabelThe label of the MCP server to use.
-
JsonValue; type "mcp"constantFor MCP tools, the type is always
mcp.MCP("mcp")
-
Optional<String> nameThe name of the tool to call on the server.
-
-
-
Optional<List<RealtimeToolsConfigUnion>> toolsTools available to the model.
-
class RealtimeFunctionTool:-
Optional<String> descriptionThe description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
-
Optional<String> nameThe name of the function.
-
Optional<JsonValue> parametersParameters of the function in JSON Schema.
-
Optional<Type> typeThe type of the tool, i.e.
function.FUNCTION("function")
-
-
Mcp-
String serverLabelA label for this MCP server, used to identify it in tool calls.
-
JsonValue; type "mcp"constantThe type of the MCP tool. Always
mcp.MCP("mcp")
-
Optional<AllowedTools> allowedToolsList of allowed tool names or a filter object.
-
List<String> -
class McpToolFilter:A filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
Optional<String> authorizationAn OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
-
Optional<ConnectorId> connectorIdIdentifier for service connectors, like those available in ChatGPT. One of
server_urlorconnector_idmust be provided. Learn more about service connectors here.Currently supported
connector_idvalues are:-
Dropbox:
connector_dropbox -
Gmail:
connector_gmail -
Google Calendar:
connector_googlecalendar -
Google Drive:
connector_googledrive -
Microsoft Teams:
connector_microsoftteams -
Outlook Calendar:
connector_outlookcalendar -
Outlook Email:
connector_outlookemail -
SharePoint:
connector_sharepoint -
CONNECTOR_DROPBOX("connector_dropbox") -
CONNECTOR_GMAIL("connector_gmail") -
CONNECTOR_GOOGLECALENDAR("connector_googlecalendar") -
CONNECTOR_GOOGLEDRIVE("connector_googledrive") -
CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams") -
CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar") -
CONNECTOR_OUTLOOKEMAIL("connector_outlookemail") -
CONNECTOR_SHAREPOINT("connector_sharepoint")
-
-
Optional<Boolean> deferLoadingWhether this MCP tool is deferred and discovered via tool search.
-
Optional<Headers> headersOptional HTTP headers to send to the MCP server. Use for authentication or other purposes.
-
Optional<RequireApproval> requireApprovalSpecify which of the MCP server's tools require approval.
-
class McpToolApprovalFilter:Specify which of the MCP server's tools require approval. Can be
always,never, or a filter object associated with tools that require approval.-
Optional<Always> alwaysA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
Optional<Never> neverA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
enum McpToolApprovalSetting:Specify a single approval policy for all tools. One of
alwaysornever. When set toalways, all tools will require approval. When set tonever, all tools will not require approval.-
ALWAYS("always") -
NEVER("never")
-
-
-
Optional<String> serverDescriptionOptional description of the MCP server, used to provide more context.
-
Optional<String> serverUrlThe URL for the MCP server. One of
server_urlorconnector_idmust be provided.
-
-
-
Optional<RealtimeTracingConfig> tracingRealtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
autowill create a trace for the session with default values for the workflow name, group id, and metadata.-
JsonValue;AUTO("auto")
-
TracingConfiguration-
Optional<String> groupIdThe group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
-
Optional<JsonValue> metadataThe arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
-
Optional<String> workflowNameThe name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
-
-
-
Optional<RealtimeTruncation> truncationWhen the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
-
RealtimeTruncationStrategy-
AUTO("auto") -
DISABLED("disabled")
-
-
class RealtimeTruncationRetentionRatio:Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
-
double retentionRatioFraction of post-instruction conversation tokens to retain (
0.0-1.0) when the conversation exceeds the input token limit. Setting this to0.8means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates. -
JsonValue; type "retention_ratio"constantUse retention ratio truncation.
RETENTION_RATIO("retention_ratio")
-
Optional<TokenLimits> tokenLimitsOptional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
-
Optional<Long> postInstructionsMaximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
-
-
-
-
-
class RealtimeTranscriptionSessionCreateRequest:Realtime transcription session object configuration.
-
JsonValue; type "transcription"constantThe type of session to create. Always
transcriptionfor transcription sessions.TRANSCRIPTION("transcription")
-
Optional<RealtimeTranscriptionSessionAudio> audioConfiguration for input and output audio.
-
Optional<RealtimeTranscriptionSessionAudioInput> input-
Optional<RealtimeAudioFormats> formatThe PCM audio format. Only a 24kHz sample rate is supported.
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. -
Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
-
-
JsonValue; type "session.created"constantThe event type, must be
session.created.SESSION_CREATED("session.created")
-
Session Update Event
-
class SessionUpdateEvent:Send this event to update the session’s configuration. The client may send this event at any time to update any field except for
voiceandmodel.voicecan be updated only if there have been no other audio outputs yet.When the server receives a
session.update, it will respond with asession.updatedevent showing the full, effective configuration. Only the fields that are present in thesession.updateare updated. To clear a field likeinstructions, pass an empty string. To clear a field liketools, pass an empty array. To clear a field liketurn_detection, passnull.-
Session sessionUpdate the Realtime session. Choose either a realtime session or a transcription session.
-
class RealtimeSessionCreateRequest:Realtime session object configuration.
-
JsonValue; type "realtime"constantThe type of session to create. Always
realtimefor the Realtime API.REALTIME("realtime")
-
Optional<RealtimeAudioConfig> audioConfiguration for input and output audio.
-
Optional<RealtimeAudioConfigInput> input-
Optional<RealtimeAudioFormats> formatThe format of the input audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<RealtimeAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
Optional<RealtimeAudioConfigOutput> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
Optional<Double> speedThe speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
-
Optional<Voice> voiceThe voice the model uses to respond. Supported built-in voices are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. You may also provide a custom voice object with anid, for example{ "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommendmarinandcedarfor best quality.-
String -
enum UnionMember1:-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
class Id:Custom voice reference.
-
String idThe custom voice ID, e.g.
voice_1234.
-
-
-
-
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
Optional<String> instructionsThe default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
Note that the server sets default instructions which will be used if this field is not set and are visible in the
session.createdevent at the start of the session. -
Optional<MaxOutputTokens> maxOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or
inffor the maximum available tokens for a given model. Defaults toinf.-
long -
JsonValue;INF("inf")
-
-
Optional<Model> modelThe Realtime model used for this session.
-
GPT_REALTIME("gpt-realtime") -
GPT_REALTIME_1_5("gpt-realtime-1.5") -
GPT_REALTIME_2("gpt-realtime-2") -
GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28") -
GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview") -
GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01") -
GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17") -
GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03") -
GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview") -
GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17") -
GPT_REALTIME_MINI("gpt-realtime-mini") -
GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06") -
GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15") -
GPT_AUDIO_1_5("gpt-audio-1.5") -
GPT_AUDIO_MINI("gpt-audio-mini") -
GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06") -
GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
-
-
Optional<List<OutputModality>> outputModalitiesThe set of modalities the model can respond with. It defaults to
["audio"], indicating that the model will respond with audio plus a transcript.["text"]can be used to make the model respond with text only. It is not possible to request bothtextandaudioat the same time.-
TEXT("text") -
AUDIO("audio")
-
-
Optional<Boolean> parallelToolCallsWhether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as
gpt-realtime-2. -
Optional<ResponsePrompt> promptReference to a prompt template and its variables. Learn more.
-
String idThe unique identifier of the prompt template to use.
-
Optional<Variables> variablesOptional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
-
String -
class ResponseInputText:A text input to the model.
-
String textThe text input to the model.
-
JsonValue; type "input_text"constantThe type of the input item. Always
input_text.INPUT_TEXT("input_text")
-
-
class ResponseInputImage:An image input to the model. Learn about image inputs.
-
Detail detailThe detail level of the image to be sent to the model. One of
high,low,auto, ororiginal. Defaults toauto.-
LOW("low") -
HIGH("high") -
AUTO("auto") -
ORIGINAL("original")
-
-
JsonValue; type "input_image"constantThe type of the input item. Always
input_image.INPUT_IMAGE("input_image")
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> imageUrlThe URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
-
-
class ResponseInputFile:A file input to the model.
-
JsonValue; type "input_file"constantThe type of the input item. Always
input_file.INPUT_FILE("input_file")
-
Optional<Detail> detailThe detail level of the file to be sent to the model. Use
lowfor the default rendering behavior, orhighto render the file at higher quality. Defaults tolow.-
LOW("low") -
HIGH("high")
-
-
Optional<String> fileDataThe content of the file to be sent to the model.
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> fileUrlThe URL of the file to be sent to the model.
-
Optional<String> filenameThe name of the file to be sent to the model.
-
-
-
Optional<String> versionOptional version of the prompt template.
-
-
Optional<RealtimeReasoning> reasoningConfiguration for reasoning-capable Realtime models such as
gpt-realtime-2.-
Optional<RealtimeReasoningEffort> effortConstrains effort on reasoning for reasoning-capable Realtime models such as
gpt-realtime-2.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
-
Optional<RealtimeToolChoiceConfig> toolChoiceHow the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
-
enum ToolChoiceOptions:Controls which (if any) tool is called by the model.
nonemeans the model will not call any tool and instead generates a message.automeans the model can pick between generating a message or calling one or more tools.requiredmeans the model must call one or more tools.-
NONE("none") -
AUTO("auto") -
REQUIRED("required")
-
-
class ToolChoiceFunction:Use this option to force the model to call a specific function.
-
String nameThe name of the function to call.
-
JsonValue; type "function"constantFor function calling, the type is always
function.FUNCTION("function")
-
-
class ToolChoiceMcp:Use this option to force the model to call a specific tool on a remote MCP server.
-
String serverLabelThe label of the MCP server to use.
-
JsonValue; type "mcp"constantFor MCP tools, the type is always
mcp.MCP("mcp")
-
Optional<String> nameThe name of the tool to call on the server.
-
-
-
Optional<List<RealtimeToolsConfigUnion>> toolsTools available to the model.
-
class RealtimeFunctionTool:-
Optional<String> descriptionThe description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
-
Optional<String> nameThe name of the function.
-
Optional<JsonValue> parametersParameters of the function in JSON Schema.
-
Optional<Type> typeThe type of the tool, i.e.
function.FUNCTION("function")
-
-
Mcp-
String serverLabelA label for this MCP server, used to identify it in tool calls.
-
JsonValue; type "mcp"constantThe type of the MCP tool. Always
mcp.MCP("mcp")
-
Optional<AllowedTools> allowedToolsList of allowed tool names or a filter object.
-
List<String> -
class McpToolFilter:A filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
Optional<String> authorizationAn OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
-
Optional<ConnectorId> connectorIdIdentifier for service connectors, like those available in ChatGPT. One of
server_urlorconnector_idmust be provided. Learn more about service connectors here.Currently supported
connector_idvalues are:-
Dropbox:
connector_dropbox -
Gmail:
connector_gmail -
Google Calendar:
connector_googlecalendar -
Google Drive:
connector_googledrive -
Microsoft Teams:
connector_microsoftteams -
Outlook Calendar:
connector_outlookcalendar -
Outlook Email:
connector_outlookemail -
SharePoint:
connector_sharepoint -
CONNECTOR_DROPBOX("connector_dropbox") -
CONNECTOR_GMAIL("connector_gmail") -
CONNECTOR_GOOGLECALENDAR("connector_googlecalendar") -
CONNECTOR_GOOGLEDRIVE("connector_googledrive") -
CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams") -
CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar") -
CONNECTOR_OUTLOOKEMAIL("connector_outlookemail") -
CONNECTOR_SHAREPOINT("connector_sharepoint")
-
-
Optional<Boolean> deferLoadingWhether this MCP tool is deferred and discovered via tool search.
-
Optional<Headers> headersOptional HTTP headers to send to the MCP server. Use for authentication or other purposes.
-
Optional<RequireApproval> requireApprovalSpecify which of the MCP server's tools require approval.
-
class McpToolApprovalFilter:Specify which of the MCP server's tools require approval. Can be
always,never, or a filter object associated with tools that require approval.-
Optional<Always> alwaysA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
Optional<Never> neverA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
enum McpToolApprovalSetting:Specify a single approval policy for all tools. One of
alwaysornever. When set toalways, all tools will require approval. When set tonever, all tools will not require approval.-
ALWAYS("always") -
NEVER("never")
-
-
-
Optional<String> serverDescriptionOptional description of the MCP server, used to provide more context.
-
Optional<String> serverUrlThe URL for the MCP server. One of
server_urlorconnector_idmust be provided.
-
-
-
Optional<RealtimeTracingConfig> tracingRealtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
autowill create a trace for the session with default values for the workflow name, group id, and metadata.-
JsonValue;AUTO("auto")
-
TracingConfiguration-
Optional<String> groupIdThe group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
-
Optional<JsonValue> metadataThe arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
-
Optional<String> workflowNameThe name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
-
-
-
Optional<RealtimeTruncation> truncationWhen the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
-
RealtimeTruncationStrategy-
AUTO("auto") -
DISABLED("disabled")
-
-
class RealtimeTruncationRetentionRatio:Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
-
double retentionRatioFraction of post-instruction conversation tokens to retain (
0.0-1.0) when the conversation exceeds the input token limit. Setting this to0.8means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates. -
JsonValue; type "retention_ratio"constantUse retention ratio truncation.
RETENTION_RATIO("retention_ratio")
-
Optional<TokenLimits> tokenLimitsOptional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
-
Optional<Long> postInstructionsMaximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
-
-
-
-
-
class RealtimeTranscriptionSessionCreateRequest:Realtime transcription session object configuration.
-
JsonValue; type "transcription"constantThe type of session to create. Always
transcriptionfor transcription sessions.TRANSCRIPTION("transcription")
-
Optional<RealtimeTranscriptionSessionAudio> audioConfiguration for input and output audio.
-
Optional<RealtimeTranscriptionSessionAudioInput> input-
Optional<RealtimeAudioFormats> formatThe PCM audio format. Only a 24kHz sample rate is supported.
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. -
Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
-
-
JsonValue; type "session.update"constantThe event type, must be
session.update.SESSION_UPDATE("session.update")
-
Optional<String> eventIdOptional client-generated ID used to identify this event. This is an arbitrary string that a client may assign. It will be passed back if there is an error with the event, but the corresponding
session.updatedevent will not include it.
-
Session Updated Event
-
class SessionUpdatedEvent:Returned when a session is updated with a
session.updateevent, unless there is an error.-
String eventIdThe unique ID of the server event.
-
Session sessionThe session configuration.
-
class RealtimeSessionCreateRequest:Realtime session object configuration.
-
JsonValue; type "realtime"constantThe type of session to create. Always
realtimefor the Realtime API.REALTIME("realtime")
-
Optional<RealtimeAudioConfig> audioConfiguration for input and output audio.
-
Optional<RealtimeAudioConfigInput> input-
Optional<RealtimeAudioFormats> formatThe format of the input audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<RealtimeAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
Optional<RealtimeAudioConfigOutput> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
Optional<Double> speedThe speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
-
Optional<Voice> voiceThe voice the model uses to respond. Supported built-in voices are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. You may also provide a custom voice object with anid, for example{ "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommendmarinandcedarfor best quality.-
String -
enum UnionMember1:-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
class Id:Custom voice reference.
-
String idThe custom voice ID, e.g.
voice_1234.
-
-
-
-
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
Optional<String> instructionsThe default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
Note that the server sets default instructions which will be used if this field is not set and are visible in the
session.createdevent at the start of the session. -
Optional<MaxOutputTokens> maxOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or
inffor the maximum available tokens for a given model. Defaults toinf.-
long -
JsonValue;INF("inf")
-
-
Optional<Model> modelThe Realtime model used for this session.
-
GPT_REALTIME("gpt-realtime") -
GPT_REALTIME_1_5("gpt-realtime-1.5") -
GPT_REALTIME_2("gpt-realtime-2") -
GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28") -
GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview") -
GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01") -
GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17") -
GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03") -
GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview") -
GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17") -
GPT_REALTIME_MINI("gpt-realtime-mini") -
GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06") -
GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15") -
GPT_AUDIO_1_5("gpt-audio-1.5") -
GPT_AUDIO_MINI("gpt-audio-mini") -
GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06") -
GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
-
-
Optional<List<OutputModality>> outputModalitiesThe set of modalities the model can respond with. It defaults to
["audio"], indicating that the model will respond with audio plus a transcript.["text"]can be used to make the model respond with text only. It is not possible to request bothtextandaudioat the same time.-
TEXT("text") -
AUDIO("audio")
-
-
Optional<Boolean> parallelToolCallsWhether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as
gpt-realtime-2. -
Optional<ResponsePrompt> promptReference to a prompt template and its variables. Learn more.
-
String idThe unique identifier of the prompt template to use.
-
Optional<Variables> variablesOptional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
-
String -
class ResponseInputText:A text input to the model.
-
String textThe text input to the model.
-
JsonValue; type "input_text"constantThe type of the input item. Always
input_text.INPUT_TEXT("input_text")
-
-
class ResponseInputImage:An image input to the model. Learn about image inputs.
-
Detail detailThe detail level of the image to be sent to the model. One of
high,low,auto, ororiginal. Defaults toauto.-
LOW("low") -
HIGH("high") -
AUTO("auto") -
ORIGINAL("original")
-
-
JsonValue; type "input_image"constantThe type of the input item. Always
input_image.INPUT_IMAGE("input_image")
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> imageUrlThe URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
-
-
class ResponseInputFile:A file input to the model.
-
JsonValue; type "input_file"constantThe type of the input item. Always
input_file.INPUT_FILE("input_file")
-
Optional<Detail> detailThe detail level of the file to be sent to the model. Use
lowfor the default rendering behavior, orhighto render the file at higher quality. Defaults tolow.-
LOW("low") -
HIGH("high")
-
-
Optional<String> fileDataThe content of the file to be sent to the model.
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> fileUrlThe URL of the file to be sent to the model.
-
Optional<String> filenameThe name of the file to be sent to the model.
-
-
-
Optional<String> versionOptional version of the prompt template.
-
-
Optional<RealtimeReasoning> reasoningConfiguration for reasoning-capable Realtime models such as
gpt-realtime-2.-
Optional<RealtimeReasoningEffort> effortConstrains effort on reasoning for reasoning-capable Realtime models such as
gpt-realtime-2.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
-
Optional<RealtimeToolChoiceConfig> toolChoiceHow the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
-
enum ToolChoiceOptions:Controls which (if any) tool is called by the model.
nonemeans the model will not call any tool and instead generates a message.automeans the model can pick between generating a message or calling one or more tools.requiredmeans the model must call one or more tools.-
NONE("none") -
AUTO("auto") -
REQUIRED("required")
-
-
class ToolChoiceFunction:Use this option to force the model to call a specific function.
-
String nameThe name of the function to call.
-
JsonValue; type "function"constantFor function calling, the type is always
function.FUNCTION("function")
-
-
class ToolChoiceMcp:Use this option to force the model to call a specific tool on a remote MCP server.
-
String serverLabelThe label of the MCP server to use.
-
JsonValue; type "mcp"constantFor MCP tools, the type is always
mcp.MCP("mcp")
-
Optional<String> nameThe name of the tool to call on the server.
-
-
-
Optional<List<RealtimeToolsConfigUnion>> toolsTools available to the model.
-
class RealtimeFunctionTool:-
Optional<String> descriptionThe description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
-
Optional<String> nameThe name of the function.
-
Optional<JsonValue> parametersParameters of the function in JSON Schema.
-
Optional<Type> typeThe type of the tool, i.e.
function.FUNCTION("function")
-
-
Mcp-
String serverLabelA label for this MCP server, used to identify it in tool calls.
-
JsonValue; type "mcp"constantThe type of the MCP tool. Always
mcp.MCP("mcp")
-
Optional<AllowedTools> allowedToolsList of allowed tool names or a filter object.
-
List<String> -
class McpToolFilter:A filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
Optional<String> authorizationAn OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
-
Optional<ConnectorId> connectorIdIdentifier for service connectors, like those available in ChatGPT. One of
server_urlorconnector_idmust be provided. Learn more about service connectors here.Currently supported
connector_idvalues are:-
Dropbox:
connector_dropbox -
Gmail:
connector_gmail -
Google Calendar:
connector_googlecalendar -
Google Drive:
connector_googledrive -
Microsoft Teams:
connector_microsoftteams -
Outlook Calendar:
connector_outlookcalendar -
Outlook Email:
connector_outlookemail -
SharePoint:
connector_sharepoint -
CONNECTOR_DROPBOX("connector_dropbox") -
CONNECTOR_GMAIL("connector_gmail") -
CONNECTOR_GOOGLECALENDAR("connector_googlecalendar") -
CONNECTOR_GOOGLEDRIVE("connector_googledrive") -
CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams") -
CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar") -
CONNECTOR_OUTLOOKEMAIL("connector_outlookemail") -
CONNECTOR_SHAREPOINT("connector_sharepoint")
-
-
Optional<Boolean> deferLoadingWhether this MCP tool is deferred and discovered via tool search.
-
Optional<Headers> headersOptional HTTP headers to send to the MCP server. Use for authentication or other purposes.
-
Optional<RequireApproval> requireApprovalSpecify which of the MCP server's tools require approval.
-
class McpToolApprovalFilter:Specify which of the MCP server's tools require approval. Can be
always,never, or a filter object associated with tools that require approval.-
Optional<Always> alwaysA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
Optional<Never> neverA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
enum McpToolApprovalSetting:Specify a single approval policy for all tools. One of
alwaysornever. When set toalways, all tools will require approval. When set tonever, all tools will not require approval.-
ALWAYS("always") -
NEVER("never")
-
-
-
Optional<String> serverDescriptionOptional description of the MCP server, used to provide more context.
-
Optional<String> serverUrlThe URL for the MCP server. One of
server_urlorconnector_idmust be provided.
-
-
-
Optional<RealtimeTracingConfig> tracingRealtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
autowill create a trace for the session with default values for the workflow name, group id, and metadata.-
JsonValue;AUTO("auto")
-
TracingConfiguration-
Optional<String> groupIdThe group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
-
Optional<JsonValue> metadataThe arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
-
Optional<String> workflowNameThe name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
-
-
-
Optional<RealtimeTruncation> truncationWhen the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
-
RealtimeTruncationStrategy-
AUTO("auto") -
DISABLED("disabled")
-
-
class RealtimeTruncationRetentionRatio:Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
-
double retentionRatioFraction of post-instruction conversation tokens to retain (
0.0-1.0) when the conversation exceeds the input token limit. Setting this to0.8means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates. -
JsonValue; type "retention_ratio"constantUse retention ratio truncation.
RETENTION_RATIO("retention_ratio")
-
Optional<TokenLimits> tokenLimitsOptional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
-
Optional<Long> postInstructionsMaximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
-
-
-
-
-
class RealtimeTranscriptionSessionCreateRequest:Realtime transcription session object configuration.
-
JsonValue; type "transcription"constantThe type of session to create. Always
transcriptionfor transcription sessions.TRANSCRIPTION("transcription")
-
Optional<RealtimeTranscriptionSessionAudio> audioConfiguration for input and output audio.
-
Optional<RealtimeTranscriptionSessionAudioInput> input-
Optional<RealtimeAudioFormats> formatThe PCM audio format. Only a 24kHz sample rate is supported.
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. -
Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
-
-
JsonValue; type "session.updated"constantThe event type, must be
session.updated.SESSION_UPDATED("session.updated")
-
Transcription Session Update
-
class TranscriptionSessionUpdate:Send this event to update a transcription session.
-
Session sessionRealtime transcription session object configuration.
-
Optional<List<Include>> includeThe set of items to include in the transcription. Current available items are:
item.input_audio_transcription.logprobsITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
Optional<InputAudioFormat> inputAudioFormatThe format of input audio. Options are
pcm16,g711_ulaw, org711_alaw. Forpcm16, input audio must be 16-bit PCM at a 24kHz sample rate, single channel (mono), and little-endian byte order.-
PCM16("pcm16") -
G711_ULAW("g711_ulaw") -
G711_ALAW("g711_alaw")
-
-
Optional<InputAudioNoiseReduction> inputAudioNoiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> inputAudioTranscriptionConfiguration for input audio transcription. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<TurnDetection> turnDetectionConfiguration for turn detection. Can be set to
nullto turn off. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.-
Optional<Long> prefixPaddingMsAmount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
-
Optional<Long> silenceDurationMsDuration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
-
Optional<Double> thresholdActivation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
Optional<Type> typeType of turn detection. Only
server_vadis currently supported for transcription sessions.SERVER_VAD("server_vad")
-
-
-
JsonValue; type "transcription_session.update"constantThe event type, must be
transcription_session.update.TRANSCRIPTION_SESSION_UPDATE("transcription_session.update")
-
Optional<String> eventIdOptional client-generated ID used to identify this event.
-
Transcription Session Updated Event
-
class TranscriptionSessionUpdatedEvent:Returned when a transcription session is updated with a
transcription_session.updateevent, unless there is an error.-
String eventIdThe unique ID of the server event.
-
Session sessionA new Realtime transcription session configuration.
When a session is created on the server via REST API, the session object also contains an ephemeral key. Default TTL for keys is 10 minutes. This property is not present when a session is updated via the WebSocket API.
-
ClientSecret clientSecretEphemeral key returned by the API. Only present when the session is created on the server via REST API.
-
long expiresAtTimestamp for when the token expires. Currently, all tokens expire after one minute.
-
String valueEphemeral key usable in client environments to authenticate connections to the Realtime API. Use this in client-side environments rather than a standard API token, which should only be used server-side.
-
-
Optional<String> inputAudioFormatThe format of input audio. Options are
pcm16,g711_ulaw, org711_alaw. -
Optional<AudioTranscription> inputAudioTranscription-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<List<Modality>> modalitiesThe set of modalities the model can respond with. To disable audio, set this to ["text"].
-
TEXT("text") -
AUDIO("audio")
-
-
Optional<TurnDetection> turnDetectionConfiguration for turn detection. Can be set to
nullto turn off. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.-
Optional<Long> prefixPaddingMsAmount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
-
Optional<Long> silenceDurationMsDuration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
-
Optional<Double> thresholdActivation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
Optional<String> typeType of turn detection, only
server_vadis currently supported.
-
-
-
JsonValue; type "transcription_session.updated"constantThe event type, must be
transcription_session.updated.TRANSCRIPTION_SESSION_UPDATED("transcription_session.updated")
-
Client Secrets
Create client secret
ClientSecretCreateResponse realtime().clientSecrets().create(ClientSecretCreateParamsparams = ClientSecretCreateParams.none(), RequestOptionsrequestOptions = RequestOptions.none())
post /realtime/client_secrets
Create a Realtime client secret with an associated session configuration.
Client secrets are short-lived tokens that can be passed to a client app, such as a web frontend or mobile client, which grants access to the Realtime API without leaking your main API key. You can configure a custom TTL for each client secret.
You can also attach session configuration options to the client secret, which will be applied to any sessions created using that client secret, but these can also be overridden by the client connection.
Learn more about authentication with client secrets over WebRTC.
Returns the created client secret and the effective session object. The client secret is a string that looks like ek_1234.
Parameters
-
ClientSecretCreateParams params-
Optional<ExpiresAfter> expiresAfterConfiguration for the client secret expiration. Expiration refers to the time after which a client secret will no longer be valid for creating sessions. The session itself may continue after that time once started. A secret can be used to create multiple sessions until it expires.
-
Optional<Anchor> anchorThe anchor point for the client secret expiration, meaning that
secondswill be added to thecreated_attime of the client secret to produce an expiration timestamp. Onlycreated_atis currently supported.CREATED_AT("created_at")
-
Optional<Long> secondsThe number of seconds from the anchor point to the expiration. Select a value between
10and7200(2 hours). This default to 600 seconds (10 minutes) if not specified.
-
-
Optional<Session> sessionSession configuration to use for the client secret. Choose either a realtime session or a transcription session.
-
class RealtimeSessionCreateRequest:Realtime session object configuration.
-
JsonValue; type "realtime"constantThe type of session to create. Always
realtimefor the Realtime API.REALTIME("realtime")
-
Optional<RealtimeAudioConfig> audioConfiguration for input and output audio.
-
Optional<RealtimeAudioConfigInput> input-
Optional<RealtimeAudioFormats> formatThe format of the input audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<RealtimeAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
Optional<RealtimeAudioConfigOutput> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
Optional<Double> speedThe speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
-
Optional<Voice> voiceThe voice the model uses to respond. Supported built-in voices are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. You may also provide a custom voice object with anid, for example{ "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommendmarinandcedarfor best quality.-
String -
enum UnionMember1:-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
class Id:Custom voice reference.
-
String idThe custom voice ID, e.g.
voice_1234.
-
-
-
-
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
Optional<String> instructionsThe default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
Note that the server sets default instructions which will be used if this field is not set and are visible in the
session.createdevent at the start of the session. -
Optional<MaxOutputTokens> maxOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or
inffor the maximum available tokens for a given model. Defaults toinf.-
long -
JsonValue;INF("inf")
-
-
Optional<Model> modelThe Realtime model used for this session.
-
GPT_REALTIME("gpt-realtime") -
GPT_REALTIME_1_5("gpt-realtime-1.5") -
GPT_REALTIME_2("gpt-realtime-2") -
GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28") -
GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview") -
GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01") -
GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17") -
GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03") -
GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview") -
GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17") -
GPT_REALTIME_MINI("gpt-realtime-mini") -
GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06") -
GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15") -
GPT_AUDIO_1_5("gpt-audio-1.5") -
GPT_AUDIO_MINI("gpt-audio-mini") -
GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06") -
GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
-
-
Optional<List<OutputModality>> outputModalitiesThe set of modalities the model can respond with. It defaults to
["audio"], indicating that the model will respond with audio plus a transcript.["text"]can be used to make the model respond with text only. It is not possible to request bothtextandaudioat the same time.-
TEXT("text") -
AUDIO("audio")
-
-
Optional<Boolean> parallelToolCallsWhether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as
gpt-realtime-2. -
Optional<ResponsePrompt> promptReference to a prompt template and its variables. Learn more.
-
String idThe unique identifier of the prompt template to use.
-
Optional<Variables> variablesOptional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
-
String -
class ResponseInputText:A text input to the model.
-
String textThe text input to the model.
-
JsonValue; type "input_text"constantThe type of the input item. Always
input_text.INPUT_TEXT("input_text")
-
-
class ResponseInputImage:An image input to the model. Learn about image inputs.
-
Detail detailThe detail level of the image to be sent to the model. One of
high,low,auto, ororiginal. Defaults toauto.-
LOW("low") -
HIGH("high") -
AUTO("auto") -
ORIGINAL("original")
-
-
JsonValue; type "input_image"constantThe type of the input item. Always
input_image.INPUT_IMAGE("input_image")
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> imageUrlThe URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
-
-
class ResponseInputFile:A file input to the model.
-
JsonValue; type "input_file"constantThe type of the input item. Always
input_file.INPUT_FILE("input_file")
-
Optional<Detail> detailThe detail level of the file to be sent to the model. Use
lowfor the default rendering behavior, orhighto render the file at higher quality. Defaults tolow.-
LOW("low") -
HIGH("high")
-
-
Optional<String> fileDataThe content of the file to be sent to the model.
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> fileUrlThe URL of the file to be sent to the model.
-
Optional<String> filenameThe name of the file to be sent to the model.
-
-
-
Optional<String> versionOptional version of the prompt template.
-
-
Optional<RealtimeReasoning> reasoningConfiguration for reasoning-capable Realtime models such as
gpt-realtime-2.-
Optional<RealtimeReasoningEffort> effortConstrains effort on reasoning for reasoning-capable Realtime models such as
gpt-realtime-2.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
-
Optional<RealtimeToolChoiceConfig> toolChoiceHow the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
-
enum ToolChoiceOptions:Controls which (if any) tool is called by the model.
nonemeans the model will not call any tool and instead generates a message.automeans the model can pick between generating a message or calling one or more tools.requiredmeans the model must call one or more tools.-
NONE("none") -
AUTO("auto") -
REQUIRED("required")
-
-
class ToolChoiceFunction:Use this option to force the model to call a specific function.
-
String nameThe name of the function to call.
-
JsonValue; type "function"constantFor function calling, the type is always
function.FUNCTION("function")
-
-
class ToolChoiceMcp:Use this option to force the model to call a specific tool on a remote MCP server.
-
String serverLabelThe label of the MCP server to use.
-
JsonValue; type "mcp"constantFor MCP tools, the type is always
mcp.MCP("mcp")
-
Optional<String> nameThe name of the tool to call on the server.
-
-
-
Optional<List<RealtimeToolsConfigUnion>> toolsTools available to the model.
-
class RealtimeFunctionTool:-
Optional<String> descriptionThe description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
-
Optional<String> nameThe name of the function.
-
Optional<JsonValue> parametersParameters of the function in JSON Schema.
-
Optional<Type> typeThe type of the tool, i.e.
function.FUNCTION("function")
-
-
Mcp-
String serverLabelA label for this MCP server, used to identify it in tool calls.
-
JsonValue; type "mcp"constantThe type of the MCP tool. Always
mcp.MCP("mcp")
-
Optional<AllowedTools> allowedToolsList of allowed tool names or a filter object.
-
List<String> -
class McpToolFilter:A filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
Optional<String> authorizationAn OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
-
Optional<ConnectorId> connectorIdIdentifier for service connectors, like those available in ChatGPT. One of
server_urlorconnector_idmust be provided. Learn more about service connectors here.Currently supported
connector_idvalues are:-
Dropbox:
connector_dropbox -
Gmail:
connector_gmail -
Google Calendar:
connector_googlecalendar -
Google Drive:
connector_googledrive -
Microsoft Teams:
connector_microsoftteams -
Outlook Calendar:
connector_outlookcalendar -
Outlook Email:
connector_outlookemail -
SharePoint:
connector_sharepoint -
CONNECTOR_DROPBOX("connector_dropbox") -
CONNECTOR_GMAIL("connector_gmail") -
CONNECTOR_GOOGLECALENDAR("connector_googlecalendar") -
CONNECTOR_GOOGLEDRIVE("connector_googledrive") -
CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams") -
CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar") -
CONNECTOR_OUTLOOKEMAIL("connector_outlookemail") -
CONNECTOR_SHAREPOINT("connector_sharepoint")
-
-
Optional<Boolean> deferLoadingWhether this MCP tool is deferred and discovered via tool search.
-
Optional<Headers> headersOptional HTTP headers to send to the MCP server. Use for authentication or other purposes.
-
Optional<RequireApproval> requireApprovalSpecify which of the MCP server's tools require approval.
-
class McpToolApprovalFilter:Specify which of the MCP server's tools require approval. Can be
always,never, or a filter object associated with tools that require approval.-
Optional<Always> alwaysA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
Optional<Never> neverA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
enum McpToolApprovalSetting:Specify a single approval policy for all tools. One of
alwaysornever. When set toalways, all tools will require approval. When set tonever, all tools will not require approval.-
ALWAYS("always") -
NEVER("never")
-
-
-
Optional<String> serverDescriptionOptional description of the MCP server, used to provide more context.
-
Optional<String> serverUrlThe URL for the MCP server. One of
server_urlorconnector_idmust be provided.
-
-
-
Optional<RealtimeTracingConfig> tracingRealtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
autowill create a trace for the session with default values for the workflow name, group id, and metadata.-
JsonValue;AUTO("auto")
-
TracingConfiguration-
Optional<String> groupIdThe group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
-
Optional<JsonValue> metadataThe arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
-
Optional<String> workflowNameThe name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
-
-
-
Optional<RealtimeTruncation> truncationWhen the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
-
RealtimeTruncationStrategy-
AUTO("auto") -
DISABLED("disabled")
-
-
class RealtimeTruncationRetentionRatio:Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
-
double retentionRatioFraction of post-instruction conversation tokens to retain (
0.0-1.0) when the conversation exceeds the input token limit. Setting this to0.8means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates. -
JsonValue; type "retention_ratio"constantUse retention ratio truncation.
RETENTION_RATIO("retention_ratio")
-
Optional<TokenLimits> tokenLimitsOptional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
-
Optional<Long> postInstructionsMaximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
-
-
-
-
-
class RealtimeTranscriptionSessionCreateRequest:Realtime transcription session object configuration.
-
JsonValue; type "transcription"constantThe type of session to create. Always
transcriptionfor transcription sessions.TRANSCRIPTION("transcription")
-
Optional<RealtimeTranscriptionSessionAudio> audioConfiguration for input and output audio.
-
Optional<RealtimeTranscriptionSessionAudioInput> input-
Optional<RealtimeAudioFormats> formatThe PCM audio format. Only a 24kHz sample rate is supported.
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.
-
-
Optional<AudioTranscription> transcriptionConfiguration for input audio transcription, defaults to off and can be set to
nullto turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service. -
Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
ServerVad-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
SemanticVad-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
-
-
Returns
-
class ClientSecretCreateResponse:Response from creating a session and client secret for the Realtime API.
-
long expiresAtExpiration timestamp for the client secret, in seconds since epoch.
-
Session sessionThe session configuration for either a realtime or transcription session.
-
class RealtimeSessionCreateResponse:A Realtime session configuration object.
-
String idUnique identifier for the session that looks like
sess_1234567890abcdef. -
JsonValue; object_ "realtime.session"constantThe object type. Always
realtime.session.REALTIME_SESSION("realtime.session")
-
JsonValue; type "realtime"constantThe type of session to create. Always
realtimefor the Realtime API.REALTIME("realtime")
-
Optional<Audio> audioConfiguration for input and output audio.
-
Optional<Input> input-
Optional<RealtimeAudioFormats> formatThe format of the input audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> transcription-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<TurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
class ServerVad:Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.
-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
class SemanticVad:Server-side semantic turn detection which uses a model to determine when the user has finished speaking.
-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
Optional<Output> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
Optional<Double> speedThe speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
-
Optional<Voice> voiceThe voice the model uses to respond. Voice cannot be changed during the session once the model has responded with audio at least once. Current voice options are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. We recommendmarinandcedarfor best quality.-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
-
-
Optional<Long> expiresAtExpiration timestamp for the session, in seconds since epoch.
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
Optional<String> instructionsThe default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
Note that the server sets default instructions which will be used if this field is not set and are visible in the
session.createdevent at the start of the session. -
Optional<MaxOutputTokens> maxOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or
inffor the maximum available tokens for a given model. Defaults toinf.-
long -
JsonValue;INF("inf")
-
-
Optional<Model> modelThe Realtime model used for this session.
-
GPT_REALTIME("gpt-realtime") -
GPT_REALTIME_1_5("gpt-realtime-1.5") -
GPT_REALTIME_2("gpt-realtime-2") -
GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28") -
GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview") -
GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01") -
GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17") -
GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03") -
GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview") -
GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17") -
GPT_REALTIME_MINI("gpt-realtime-mini") -
GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06") -
GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15") -
GPT_AUDIO_1_5("gpt-audio-1.5") -
GPT_AUDIO_MINI("gpt-audio-mini") -
GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06") -
GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
-
-
Optional<List<OutputModality>> outputModalitiesThe set of modalities the model can respond with. It defaults to
["audio"], indicating that the model will respond with audio plus a transcript.["text"]can be used to make the model respond with text only. It is not possible to request bothtextandaudioat the same time.-
TEXT("text") -
AUDIO("audio")
-
-
Optional<ResponsePrompt> promptReference to a prompt template and its variables. Learn more.
-
String idThe unique identifier of the prompt template to use.
-
Optional<Variables> variablesOptional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
-
String -
class ResponseInputText:A text input to the model.
-
String textThe text input to the model.
-
JsonValue; type "input_text"constantThe type of the input item. Always
input_text.INPUT_TEXT("input_text")
-
-
class ResponseInputImage:An image input to the model. Learn about image inputs.
-
Detail detailThe detail level of the image to be sent to the model. One of
high,low,auto, ororiginal. Defaults toauto.-
LOW("low") -
HIGH("high") -
AUTO("auto") -
ORIGINAL("original")
-
-
JsonValue; type "input_image"constantThe type of the input item. Always
input_image.INPUT_IMAGE("input_image")
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> imageUrlThe URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
-
-
class ResponseInputFile:A file input to the model.
-
JsonValue; type "input_file"constantThe type of the input item. Always
input_file.INPUT_FILE("input_file")
-
Optional<Detail> detailThe detail level of the file to be sent to the model. Use
lowfor the default rendering behavior, orhighto render the file at higher quality. Defaults tolow.-
LOW("low") -
HIGH("high")
-
-
Optional<String> fileDataThe content of the file to be sent to the model.
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> fileUrlThe URL of the file to be sent to the model.
-
Optional<String> filenameThe name of the file to be sent to the model.
-
-
-
Optional<String> versionOptional version of the prompt template.
-
-
Optional<RealtimeReasoning> reasoningConfiguration for reasoning-capable Realtime models such as
gpt-realtime-2.-
Optional<RealtimeReasoningEffort> effortConstrains effort on reasoning for reasoning-capable Realtime models such as
gpt-realtime-2.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
-
Optional<ToolChoice> toolChoiceHow the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
-
enum ToolChoiceOptions:Controls which (if any) tool is called by the model.
nonemeans the model will not call any tool and instead generates a message.automeans the model can pick between generating a message or calling one or more tools.requiredmeans the model must call one or more tools.-
NONE("none") -
AUTO("auto") -
REQUIRED("required")
-
-
class ToolChoiceFunction:Use this option to force the model to call a specific function.
-
String nameThe name of the function to call.
-
JsonValue; type "function"constantFor function calling, the type is always
function.FUNCTION("function")
-
-
class ToolChoiceMcp:Use this option to force the model to call a specific tool on a remote MCP server.
-
String serverLabelThe label of the MCP server to use.
-
JsonValue; type "mcp"constantFor MCP tools, the type is always
mcp.MCP("mcp")
-
Optional<String> nameThe name of the tool to call on the server.
-
-
-
Optional<List<Tool>> toolsTools available to the model.
-
class RealtimeFunctionTool:-
Optional<String> descriptionThe description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
-
Optional<String> nameThe name of the function.
-
Optional<JsonValue> parametersParameters of the function in JSON Schema.
-
Optional<Type> typeThe type of the tool, i.e.
function.FUNCTION("function")
-
-
class McpTool:Give the model access to additional tools via remote Model Context Protocol (MCP) servers. Learn more about MCP.
-
String serverLabelA label for this MCP server, used to identify it in tool calls.
-
JsonValue; type "mcp"constantThe type of the MCP tool. Always
mcp.MCP("mcp")
-
Optional<AllowedTools> allowedToolsList of allowed tool names or a filter object.
-
List<String> -
class McpToolFilter:A filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
Optional<String> authorizationAn OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
-
Optional<ConnectorId> connectorIdIdentifier for service connectors, like those available in ChatGPT. One of
server_urlorconnector_idmust be provided. Learn more about service connectors here.Currently supported
connector_idvalues are:-
Dropbox:
connector_dropbox -
Gmail:
connector_gmail -
Google Calendar:
connector_googlecalendar -
Google Drive:
connector_googledrive -
Microsoft Teams:
connector_microsoftteams -
Outlook Calendar:
connector_outlookcalendar -
Outlook Email:
connector_outlookemail -
SharePoint:
connector_sharepoint -
CONNECTOR_DROPBOX("connector_dropbox") -
CONNECTOR_GMAIL("connector_gmail") -
CONNECTOR_GOOGLECALENDAR("connector_googlecalendar") -
CONNECTOR_GOOGLEDRIVE("connector_googledrive") -
CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams") -
CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar") -
CONNECTOR_OUTLOOKEMAIL("connector_outlookemail") -
CONNECTOR_SHAREPOINT("connector_sharepoint")
-
-
Optional<Boolean> deferLoadingWhether this MCP tool is deferred and discovered via tool search.
-
Optional<Headers> headersOptional HTTP headers to send to the MCP server. Use for authentication or other purposes.
-
Optional<RequireApproval> requireApprovalSpecify which of the MCP server's tools require approval.
-
class McpToolApprovalFilter:Specify which of the MCP server's tools require approval. Can be
always,never, or a filter object associated with tools that require approval.-
Optional<Always> alwaysA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
Optional<Never> neverA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
enum McpToolApprovalSetting:Specify a single approval policy for all tools. One of
alwaysornever. When set toalways, all tools will require approval. When set tonever, all tools will not require approval.-
ALWAYS("always") -
NEVER("never")
-
-
-
Optional<String> serverDescriptionOptional description of the MCP server, used to provide more context.
-
Optional<String> serverUrlThe URL for the MCP server. One of
server_urlorconnector_idmust be provided.
-
-
-
Optional<Tracing> tracingRealtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
autowill create a trace for the session with default values for the workflow name, group id, and metadata.-
JsonValue;AUTO("auto")
-
class TracingConfiguration:Granular configuration for tracing.
-
Optional<String> groupIdThe group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
-
Optional<JsonValue> metadataThe arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
-
Optional<String> workflowNameThe name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
-
-
-
Optional<RealtimeTruncation> truncationWhen the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
-
RealtimeTruncationStrategy-
AUTO("auto") -
DISABLED("disabled")
-
-
class RealtimeTruncationRetentionRatio:Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
-
double retentionRatioFraction of post-instruction conversation tokens to retain (
0.0-1.0) when the conversation exceeds the input token limit. Setting this to0.8means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates. -
JsonValue; type "retention_ratio"constantUse retention ratio truncation.
RETENTION_RATIO("retention_ratio")
-
Optional<TokenLimits> tokenLimitsOptional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
-
Optional<Long> postInstructionsMaximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
-
-
-
-
-
class RealtimeTranscriptionSessionCreateResponse:A Realtime transcription session configuration object.
-
String idUnique identifier for the session that looks like
sess_1234567890abcdef. -
String object_The object type. Always
realtime.transcription_session. -
JsonValue; type "transcription"constantThe type of session. Always
transcriptionfor transcription sessions.TRANSCRIPTION("transcription")
-
Optional<Audio> audioConfiguration for input audio for the session.
-
Optional<Input> input-
Optional<RealtimeAudioFormats> formatThe PCM audio format. Only a 24kHz sample rate is supported.
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction.
-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.
-
-
Optional<AudioTranscription> transcription -
Optional<RealtimeTranscriptionSessionTurnDetection> turnDetectionConfiguration for turn detection. Can be set to
nullto turn off. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech. Forgpt-realtime-whisper, this must benull; VAD is not supported.-
Optional<Long> prefixPaddingMsAmount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
-
Optional<Long> silenceDurationMsDuration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
-
Optional<Double> thresholdActivation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
Optional<String> typeType of turn detection, only
server_vadis currently supported.
-
-
-
-
Optional<Long> expiresAtExpiration timestamp for the session, in seconds since epoch.
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
-
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription. -
ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
-
-
-
String valueThe generated client secret value.
-
Example
package com.openai.example;
import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.realtime.clientsecrets.ClientSecretCreateParams;
import com.openai.models.realtime.clientsecrets.ClientSecretCreateResponse;
public final class Main {
private Main() {}
public static void main(String[] args) {
OpenAIClient client = OpenAIOkHttpClient.fromEnv();
ClientSecretCreateResponse clientSecret = client.realtime().clientSecrets().create();
}
}
Response
{
"expires_at": 0,
"session": {
"id": "id",
"object": "realtime.session",
"type": "realtime",
"audio": {
"input": {
"format": {
"rate": 24000,
"type": "audio/pcm"
},
"noise_reduction": {
"type": "near_field"
},
"transcription": {
"delay": "minimal",
"language": "language",
"model": "string",
"prompt": "prompt"
},
"turn_detection": {
"type": "server_vad",
"create_response": true,
"idle_timeout_ms": 5000,
"interrupt_response": true,
"prefix_padding_ms": 0,
"silence_duration_ms": 0,
"threshold": 0
}
},
"output": {
"format": {
"rate": 24000,
"type": "audio/pcm"
},
"speed": 0.25,
"voice": "ash"
}
},
"expires_at": 0,
"include": [
"item.input_audio_transcription.logprobs"
],
"instructions": "instructions",
"max_output_tokens": 0,
"model": "string",
"output_modalities": [
"text"
],
"prompt": {
"id": "id",
"variables": {
"foo": "string"
},
"version": "version"
},
"reasoning": {
"effort": "minimal"
},
"tool_choice": "none",
"tools": [
{
"description": "description",
"name": "name",
"parameters": {},
"type": "function"
}
],
"tracing": "auto",
"truncation": "auto"
},
"value": "value"
}
Domain Types
Realtime Session Create Response
-
class RealtimeSessionCreateResponse:A Realtime session configuration object.
-
String idUnique identifier for the session that looks like
sess_1234567890abcdef. -
JsonValue; object_ "realtime.session"constantThe object type. Always
realtime.session.REALTIME_SESSION("realtime.session")
-
JsonValue; type "realtime"constantThe type of session to create. Always
realtimefor the Realtime API.REALTIME("realtime")
-
Optional<Audio> audioConfiguration for input and output audio.
-
Optional<Input> input-
Optional<RealtimeAudioFormats> formatThe format of the input audio.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction. This can be set to
nullto turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> transcription-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<TurnDetection> turnDetectionConfiguration for turn detection, ether Server VAD or Semantic VAD. This can be set to
nullto turn off, in which case the client must manually trigger model response.Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
For
gpt-realtime-whispertranscription sessions, turn detection must be set tonull; VAD is not supported.-
class ServerVad:Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.
-
JsonValue; type "server_vad"constantType of turn detection,
server_vadto turn on simple Server VAD.SERVER_VAD("server_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs. If
interrupt_responseis set tofalsethis may fail to create a response if the model is already responding.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> idleTimeoutMsOptional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the
response.donetime plus audio playback duration.An
input_audio_buffer.timeout_triggeredevent (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported forserver_vadmode. -
Optional<Boolean> interruptResponseWhether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs. Iftruethen the response will be cancelled, otherwise it will continue until complete.If both
create_responseandinterrupt_responseare set tofalse, the model will never respond automatically but VAD events will still be emitted. -
Optional<Long> prefixPaddingMsUsed only for
server_vadmode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms. -
Optional<Long> silenceDurationMsUsed only for
server_vadmode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user. -
Optional<Double> thresholdUsed only for
server_vadmode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
-
class SemanticVad:Server-side semantic turn detection which uses a model to determine when the user has finished speaking.
-
JsonValue; type "semantic_vad"constantType of turn detection,
semantic_vadto turn on Semantic VAD.SEMANTIC_VAD("semantic_vad")
-
Optional<Boolean> createResponseWhether or not to automatically generate a response when a VAD stop event occurs.
-
Optional<Eagerness> eagernessUsed only for
semantic_vadmode. The eagerness of the model to respond.lowwill wait longer for the user to continue speaking,highwill respond more quickly.autois the default and is equivalent tomedium.low,medium, andhighhave max timeouts of 8s, 4s, and 2s respectively.-
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
AUTO("auto")
-
-
Optional<Boolean> interruptResponseWhether or not to automatically interrupt any ongoing response with output to the default conversation (i.e.
conversationofauto) when a VAD start event occurs.
-
-
-
-
Optional<Output> output-
Optional<RealtimeAudioFormats> formatThe format of the output audio.
-
Optional<Double> speedThe speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
-
Optional<Voice> voiceThe voice the model uses to respond. Voice cannot be changed during the session once the model has responded with audio at least once. Current voice options are
alloy,ash,ballad,coral,echo,sage,shimmer,verse,marin, andcedar. We recommendmarinandcedarfor best quality.-
ALLOY("alloy") -
ASH("ash") -
BALLAD("ballad") -
CORAL("coral") -
ECHO("echo") -
SAGE("sage") -
SHIMMER("shimmer") -
VERSE("verse") -
MARIN("marin") -
CEDAR("cedar")
-
-
-
-
Optional<Long> expiresAtExpiration timestamp for the session, in seconds since epoch.
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
Optional<String> instructionsThe default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
Note that the server sets default instructions which will be used if this field is not set and are visible in the
session.createdevent at the start of the session. -
Optional<MaxOutputTokens> maxOutputTokensMaximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or
inffor the maximum available tokens for a given model. Defaults toinf.-
long -
JsonValue;INF("inf")
-
-
Optional<Model> modelThe Realtime model used for this session.
-
GPT_REALTIME("gpt-realtime") -
GPT_REALTIME_1_5("gpt-realtime-1.5") -
GPT_REALTIME_2("gpt-realtime-2") -
GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28") -
GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview") -
GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01") -
GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17") -
GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03") -
GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview") -
GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17") -
GPT_REALTIME_MINI("gpt-realtime-mini") -
GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06") -
GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15") -
GPT_AUDIO_1_5("gpt-audio-1.5") -
GPT_AUDIO_MINI("gpt-audio-mini") -
GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06") -
GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
-
-
Optional<List<OutputModality>> outputModalitiesThe set of modalities the model can respond with. It defaults to
["audio"], indicating that the model will respond with audio plus a transcript.["text"]can be used to make the model respond with text only. It is not possible to request bothtextandaudioat the same time.-
TEXT("text") -
AUDIO("audio")
-
-
Optional<ResponsePrompt> promptReference to a prompt template and its variables. Learn more.
-
String idThe unique identifier of the prompt template to use.
-
Optional<Variables> variablesOptional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
-
String -
class ResponseInputText:A text input to the model.
-
String textThe text input to the model.
-
JsonValue; type "input_text"constantThe type of the input item. Always
input_text.INPUT_TEXT("input_text")
-
-
class ResponseInputImage:An image input to the model. Learn about image inputs.
-
Detail detailThe detail level of the image to be sent to the model. One of
high,low,auto, ororiginal. Defaults toauto.-
LOW("low") -
HIGH("high") -
AUTO("auto") -
ORIGINAL("original")
-
-
JsonValue; type "input_image"constantThe type of the input item. Always
input_image.INPUT_IMAGE("input_image")
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> imageUrlThe URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
-
-
class ResponseInputFile:A file input to the model.
-
JsonValue; type "input_file"constantThe type of the input item. Always
input_file.INPUT_FILE("input_file")
-
Optional<Detail> detailThe detail level of the file to be sent to the model. Use
lowfor the default rendering behavior, orhighto render the file at higher quality. Defaults tolow.-
LOW("low") -
HIGH("high")
-
-
Optional<String> fileDataThe content of the file to be sent to the model.
-
Optional<String> fileIdThe ID of the file to be sent to the model.
-
Optional<String> fileUrlThe URL of the file to be sent to the model.
-
Optional<String> filenameThe name of the file to be sent to the model.
-
-
-
Optional<String> versionOptional version of the prompt template.
-
-
Optional<RealtimeReasoning> reasoningConfiguration for reasoning-capable Realtime models such as
gpt-realtime-2.-
Optional<RealtimeReasoningEffort> effortConstrains effort on reasoning for reasoning-capable Realtime models such as
gpt-realtime-2.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
-
Optional<ToolChoice> toolChoiceHow the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
-
enum ToolChoiceOptions:Controls which (if any) tool is called by the model.
nonemeans the model will not call any tool and instead generates a message.automeans the model can pick between generating a message or calling one or more tools.requiredmeans the model must call one or more tools.-
NONE("none") -
AUTO("auto") -
REQUIRED("required")
-
-
class ToolChoiceFunction:Use this option to force the model to call a specific function.
-
String nameThe name of the function to call.
-
JsonValue; type "function"constantFor function calling, the type is always
function.FUNCTION("function")
-
-
class ToolChoiceMcp:Use this option to force the model to call a specific tool on a remote MCP server.
-
String serverLabelThe label of the MCP server to use.
-
JsonValue; type "mcp"constantFor MCP tools, the type is always
mcp.MCP("mcp")
-
Optional<String> nameThe name of the tool to call on the server.
-
-
-
Optional<List<Tool>> toolsTools available to the model.
-
class RealtimeFunctionTool:-
Optional<String> descriptionThe description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
-
Optional<String> nameThe name of the function.
-
Optional<JsonValue> parametersParameters of the function in JSON Schema.
-
Optional<Type> typeThe type of the tool, i.e.
function.FUNCTION("function")
-
-
class McpTool:Give the model access to additional tools via remote Model Context Protocol (MCP) servers. Learn more about MCP.
-
String serverLabelA label for this MCP server, used to identify it in tool calls.
-
JsonValue; type "mcp"constantThe type of the MCP tool. Always
mcp.MCP("mcp")
-
Optional<AllowedTools> allowedToolsList of allowed tool names or a filter object.
-
List<String> -
class McpToolFilter:A filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
Optional<String> authorizationAn OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
-
Optional<ConnectorId> connectorIdIdentifier for service connectors, like those available in ChatGPT. One of
server_urlorconnector_idmust be provided. Learn more about service connectors here.Currently supported
connector_idvalues are:-
Dropbox:
connector_dropbox -
Gmail:
connector_gmail -
Google Calendar:
connector_googlecalendar -
Google Drive:
connector_googledrive -
Microsoft Teams:
connector_microsoftteams -
Outlook Calendar:
connector_outlookcalendar -
Outlook Email:
connector_outlookemail -
SharePoint:
connector_sharepoint -
CONNECTOR_DROPBOX("connector_dropbox") -
CONNECTOR_GMAIL("connector_gmail") -
CONNECTOR_GOOGLECALENDAR("connector_googlecalendar") -
CONNECTOR_GOOGLEDRIVE("connector_googledrive") -
CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams") -
CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar") -
CONNECTOR_OUTLOOKEMAIL("connector_outlookemail") -
CONNECTOR_SHAREPOINT("connector_sharepoint")
-
-
Optional<Boolean> deferLoadingWhether this MCP tool is deferred and discovered via tool search.
-
Optional<Headers> headersOptional HTTP headers to send to the MCP server. Use for authentication or other purposes.
-
Optional<RequireApproval> requireApprovalSpecify which of the MCP server's tools require approval.
-
class McpToolApprovalFilter:Specify which of the MCP server's tools require approval. Can be
always,never, or a filter object associated with tools that require approval.-
Optional<Always> alwaysA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
Optional<Never> neverA filter object to specify which tools are allowed.
-
Optional<Boolean> readOnlyIndicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with
readOnlyHint, it will match this filter. -
Optional<List<String>> toolNamesList of allowed tool names.
-
-
-
enum McpToolApprovalSetting:Specify a single approval policy for all tools. One of
alwaysornever. When set toalways, all tools will require approval. When set tonever, all tools will not require approval.-
ALWAYS("always") -
NEVER("never")
-
-
-
Optional<String> serverDescriptionOptional description of the MCP server, used to provide more context.
-
Optional<String> serverUrlThe URL for the MCP server. One of
server_urlorconnector_idmust be provided.
-
-
-
Optional<Tracing> tracingRealtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
autowill create a trace for the session with default values for the workflow name, group id, and metadata.-
JsonValue;AUTO("auto")
-
class TracingConfiguration:Granular configuration for tracing.
-
Optional<String> groupIdThe group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
-
Optional<JsonValue> metadataThe arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
-
Optional<String> workflowNameThe name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
-
-
-
Optional<RealtimeTruncation> truncationWhen the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
-
RealtimeTruncationStrategy-
AUTO("auto") -
DISABLED("disabled")
-
-
class RealtimeTruncationRetentionRatio:Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
-
double retentionRatioFraction of post-instruction conversation tokens to retain (
0.0-1.0) when the conversation exceeds the input token limit. Setting this to0.8means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates. -
JsonValue; type "retention_ratio"constantUse retention ratio truncation.
RETENTION_RATIO("retention_ratio")
-
Optional<TokenLimits> tokenLimitsOptional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
-
Optional<Long> postInstructionsMaximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
-
-
-
-
Realtime Transcription Session Create Response
-
class RealtimeTranscriptionSessionCreateResponse:A Realtime transcription session configuration object.
-
String idUnique identifier for the session that looks like
sess_1234567890abcdef. -
String object_The object type. Always
realtime.transcription_session. -
JsonValue; type "transcription"constantThe type of session. Always
transcriptionfor transcription sessions.TRANSCRIPTION("transcription")
-
Optional<Audio> audioConfiguration for input audio for the session.
-
Optional<Input> input-
Optional<RealtimeAudioFormats> formatThe PCM audio format. Only a 24kHz sample rate is supported.
-
AudioPcm-
Optional<Rate> rateThe sample rate of the audio. Always
24000._24000(24000)
-
Optional<Type> typeThe audio format. Always
audio/pcm.AUDIO_PCM("audio/pcm")
-
-
AudioPcmu-
Optional<Type> typeThe audio format. Always
audio/pcmu.AUDIO_PCMU("audio/pcmu")
-
-
AudioPcma-
Optional<Type> typeThe audio format. Always
audio/pcma.AUDIO_PCMA("audio/pcma")
-
-
-
Optional<NoiseReduction> noiseReductionConfiguration for input audio noise reduction.
-
Optional<NoiseReductionType> typeType of noise reduction.
near_fieldis for close-talking microphones such as headphones,far_fieldis for far-field microphones such as laptop or conference room microphones.-
NEAR_FIELD("near_field") -
FAR_FIELD("far_field")
-
-
-
Optional<AudioTranscription> transcription-
Optional<Delay> delayControls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with
gpt-realtime-whisperin GA Realtime sessions.-
MINIMAL("minimal") -
LOW("low") -
MEDIUM("medium") -
HIGH("high") -
XHIGH("xhigh")
-
-
Optional<String> languageThe language of the input audio. Supplying the input language in ISO-639-1 (e.g.
en) format will improve accuracy and latency. -
Optional<Model> modelThe model to use for transcription. Current options are
whisper-1,gpt-4o-mini-transcribe,gpt-4o-mini-transcribe-2025-12-15,gpt-4o-transcribe,gpt-4o-transcribe-diarize, andgpt-realtime-whisper. Usegpt-4o-transcribe-diarizewhen you need diarization with speaker labels.-
WHISPER_1("whisper-1") -
GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe") -
GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15") -
GPT_4O_TRANSCRIBE("gpt-4o-transcribe") -
GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize") -
GPT_REALTIME_WHISPER("gpt-realtime-whisper")
-
-
Optional<String> promptAn optional text to guide the model's style or continue a previous audio segment. For
whisper-1, the prompt is a list of keywords. Forgpt-4o-transcribemodels (excludinggpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported withgpt-realtime-whisperin GA Realtime sessions.
-
-
Optional<RealtimeTranscriptionSessionTurnDetection> turnDetectionConfiguration for turn detection. Can be set to
nullto turn off. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech. Forgpt-realtime-whisper, this must benull; VAD is not supported.-
Optional<Long> prefixPaddingMsAmount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
-
Optional<Long> silenceDurationMsDuration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
-
Optional<Double> thresholdActivation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
Optional<String> typeType of turn detection, only
server_vadis currently supported.
-
-
-
-
Optional<Long> expiresAtExpiration timestamp for the session, in seconds since epoch.
-
Optional<List<Include>> includeAdditional fields to include in server outputs.
-
item.input_audio_transcription.logprobs: Include logprobs for input audio transcription. -
ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
-
-
Realtime Transcription Session Turn Detection
-
class RealtimeTranscriptionSessionTurnDetection:Configuration for turn detection. Can be set to
nullto turn off. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech. Forgpt-realtime-whisper, this must benull; VAD is not supported.-
Optional<Long> prefixPaddingMsAmount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
-
Optional<Long> silenceDurationMsDuration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
-
Optional<Double> thresholdActivation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
-
Optional<String> typeType of turn detection, only
server_vadis currently supported.
-
Calls
Accept call
realtime().calls().accept(CallAcceptParamsparams, RequestOptionsrequestOptions = RequestOptions.none())
post /realtime/calls/{call_id}/accept
Accept an incoming SIP call and configure the realtime session that will handle it.
Parameters
-
CallAcceptParams params-
Optional<String> callId -
RealtimeSessionCreateRequest realtimeSessionCreateRequestRealtime session object configuration.
-
Example
package com.openai.example;
import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.realtime.RealtimeSessionCreateRequest;
import com.openai.models.realtime.calls.CallAcceptParams;
public final class Main {
private Main() {}
public static void main(String[] args) {
OpenAIClient client = OpenAIOkHttpClient.fromEnv();
CallAcceptParams params = CallAcceptParams.builder()
.callId("call_id")
.realtimeSessionCreateRequest(RealtimeSessionCreateRequest.builder().build())
.build();
client.realtime().calls().accept(params);
}
}
Hang up call
realtime().calls().hangup(CallHangupParamsparams = CallHangupParams.none(), RequestOptionsrequestOptions = RequestOptions.none())
post /realtime/calls/{call_id}/hangup
End an active Realtime API call, whether it was initiated over SIP or WebRTC.
Parameters
-
CallHangupParams paramsOptional<String> callId
Example
package com.openai.example;
import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.realtime.calls.CallHangupParams;
public final class Main {
private Main() {}
public static void main(String[] args) {
OpenAIClient client = OpenAIOkHttpClient.fromEnv();
client.realtime().calls().hangup("call_id");
}
}
Refer call
realtime().calls().refer(CallReferParamsparams, RequestOptionsrequestOptions = RequestOptions.none())
post /realtime/calls/{call_id}/refer
Transfer an active SIP call to a new destination using the SIP REFER verb.
Parameters
-
CallReferParams params-
Optional<String> callId -
String targetUriURI that should appear in the SIP Refer-To header. Supports values like
tel:+14155550123orsip:agent@example.com.
-
Example
package com.openai.example;
import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.realtime.calls.CallReferParams;
public final class Main {
private Main() {}
public static void main(String[] args) {
OpenAIClient client = OpenAIOkHttpClient.fromEnv();
CallReferParams params = CallReferParams.builder()
.callId("call_id")
.targetUri("tel:+14155550123")
.build();
client.realtime().calls().refer(params);
}
}
Reject call
realtime().calls().reject(CallRejectParamsparams = CallRejectParams.none(), RequestOptionsrequestOptions = RequestOptions.none())
post /realtime/calls/{call_id}/reject
Decline an incoming SIP call by returning a SIP status code to the caller.
Parameters
-
CallRejectParams params-
Optional<String> callId -
Optional<Long> statusCodeSIP response code to send back to the caller. Defaults to
603(Decline) when omitted.
-
Example
package com.openai.example;
import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.realtime.calls.CallRejectParams;
public final class Main {
private Main() {}
public static void main(String[] args) {
OpenAIClient client = OpenAIOkHttpClient.fromEnv();
client.realtime().calls().reject("call_id");
}
}