Realtime

Domain Types

Audio Transcription

class AudioTranscription:
- Optional<Delay> delay
  
  Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
  - MINIMAL("minimal")
  - LOW("low")
  - MEDIUM("medium")
  - HIGH("high")
  - XHIGH("xhigh")
- Optional<String> language
  
  The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
- Optional<Model> model
  
  The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
  - WHISPER_1("whisper-1")
  - GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
  - GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
  - GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
  - GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
  - GPT_REALTIME_WHISPER("gpt-realtime-whisper")
- Optional<String> prompt
  
  An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.

Conversation Created Event

class ConversationCreatedEvent:

Returned when a conversation is created. Emitted right after session creation.
- Conversation conversation
  
  The conversation resource.
  - Optional<String> id
    
    The unique ID of the conversation.
  - Optional<Object> object_
    
    The object type, must be realtime.conversation.
    - REALTIME_CONVERSATION("realtime.conversation")
- String eventId
  
  The unique ID of the server event.
- JsonValue; type "conversation.created"constant
  
  The event type, must be conversation.created.
  - CONVERSATION_CREATED("conversation.created")

Conversation Item

class ConversationItem: A class that can be one of several variants.union

A single item within a Realtime conversation.
- class RealtimeConversationItemSystemMessage:
  
  A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
  - List<Content> content
    
    The content of the message.
    - Optional<String> text
      
      The text content.
    - Optional<Type> type
      
      The content type. Always input_text for system messages.
      - INPUT_TEXT("input_text")
  - JsonValue; role "system"constant
    
    The role of the message sender. Always system.
    - SYSTEM("system")
  - JsonValue; type "message"constant
    
    The type of the item. Always message.
    - MESSAGE("message")
  - Optional<String> id
    
    The unique ID of the item. This may be provided by the client or generated by the server.
  - Optional<Object> object_
    
    Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
    - REALTIME_ITEM("realtime.item")
  - Optional<Status> status
    
    The status of the item. Has no effect on the conversation.
    - COMPLETED("completed")
    - INCOMPLETE("incomplete")
    - IN_PROGRESS("in_progress")
- class RealtimeConversationItemUserMessage:
  
  A user message item in a Realtime conversation.
  - List<Content> content
    
    The content of the message.
    - Optional<String> audio
      
      Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
    - Optional<Detail> detail
      
      The detail level of the image (for input_image). auto will default to high.
      - AUTO("auto")
      - LOW("low")
      - HIGH("high")
    - Optional<String> imageUrl
      
      Base64-encoded image bytes (for input_image) as a data URI. For example data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG.
    - Optional<String> text
      
      The text content (for input_text).
    - Optional<String> transcript
      
      Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
    - Optional<Type> type
      
      The content type (input_text, input_audio, or input_image).
      - INPUT_TEXT("input_text")
      - INPUT_AUDIO("input_audio")
      - INPUT_IMAGE("input_image")
  - JsonValue; role "user"constant
    
    The role of the message sender. Always user.
    - USER("user")
  - JsonValue; type "message"constant
    
    The type of the item. Always message.
    - MESSAGE("message")
  - Optional<String> id
    
    The unique ID of the item. This may be provided by the client or generated by the server.
  - Optional<Object> object_
    
    Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
    - REALTIME_ITEM("realtime.item")
  - Optional<Status> status
    
    The status of the item. Has no effect on the conversation.
    - COMPLETED("completed")
    - INCOMPLETE("incomplete")
    - IN_PROGRESS("in_progress")
- class RealtimeConversationItemAssistantMessage:
  
  An assistant message item in a Realtime conversation.
  - List<Content> content
    
    The content of the message.
    - Optional<String> audio
      
      Base64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
    - Optional<String> text
      
      The text content.
    - Optional<String> transcript
      
      The transcript of the audio content, this will always be present if the output type is audio.
    - Optional<Type> type
      
      The content type, output_text or output_audio depending on the session output_modalities configuration.
      - OUTPUT_TEXT("output_text")
      - OUTPUT_AUDIO("output_audio")
  - JsonValue; role "assistant"constant
    
    The role of the message sender. Always assistant.
    - ASSISTANT("assistant")
  - JsonValue; type "message"constant
    
    The type of the item. Always message.
    - MESSAGE("message")
  - Optional<String> id
    
    The unique ID of the item. This may be provided by the client or generated by the server.
  - Optional<Object> object_
    
    Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
    - REALTIME_ITEM("realtime.item")
  - Optional<Status> status
    
    The status of the item. Has no effect on the conversation.
    - COMPLETED("completed")
    - INCOMPLETE("incomplete")
    - IN_PROGRESS("in_progress")
- class RealtimeConversationItemFunctionCall:
  
  A function call item in a Realtime conversation.
  - String arguments
    
    The arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example {"arg1": "value1", "arg2": 42}.
  - String name
    
    The name of the function being called.
  - JsonValue; type "function_call"constant
    
    The type of the item. Always function_call.
    - FUNCTION_CALL("function_call")
  - Optional<String> id
    
    The unique ID of the item. This may be provided by the client or generated by the server.
  - Optional<String> callId
    
    The ID of the function call.
  - Optional<Object> object_
    
    Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
    - REALTIME_ITEM("realtime.item")
  - Optional<Status> status
    
    The status of the item. Has no effect on the conversation.
    - COMPLETED("completed")
    - INCOMPLETE("incomplete")
    - IN_PROGRESS("in_progress")
- class RealtimeConversationItemFunctionCallOutput:
  
  A function call output item in a Realtime conversation.
  - String callId
    
    The ID of the function call this output is for.
  - String output
    
    The output of the function call, this is free text and can contain any information or simply be empty.
  - JsonValue; type "function_call_output"constant
    
    The type of the item. Always function_call_output.
    - FUNCTION_CALL_OUTPUT("function_call_output")
  - Optional<String> id
    
    The unique ID of the item. This may be provided by the client or generated by the server.
  - Optional<Object> object_
    
    Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
    - REALTIME_ITEM("realtime.item")
  - Optional<Status> status
    
    The status of the item. Has no effect on the conversation.
    - COMPLETED("completed")
    - INCOMPLETE("incomplete")
    - IN_PROGRESS("in_progress")
- class RealtimeMcpApprovalResponse:
  
  A Realtime item responding to an MCP approval request.
  - String id
    
    The unique ID of the approval response.
  - String approvalRequestId
    
    The ID of the approval request being answered.
  - boolean approve
    
    Whether the request was approved.
  - JsonValue; type "mcp_approval_response"constant
    
    The type of the item. Always mcp_approval_response.
    - MCP_APPROVAL_RESPONSE("mcp_approval_response")
  - Optional<String> reason
    
    Optional reason for the decision.
- class RealtimeMcpListTools:
  
  A Realtime item listing tools available on an MCP server.
  - String serverLabel
    
    The label of the MCP server.
  - List<Tool> tools
    
    The tools available on the server.
    - JsonValue inputSchema
      
      The JSON schema describing the tool's input.
    - String name
      
      The name of the tool.
    - Optional<JsonValue> annotations
      
      Additional annotations about the tool.
    - Optional<String> description
      
      The description of the tool.
  - JsonValue; type "mcp_list_tools"constant
    
    The type of the item. Always mcp_list_tools.
    - MCP_LIST_TOOLS("mcp_list_tools")
  - Optional<String> id
    
    The unique ID of the list.
- class RealtimeMcpToolCall:
  
  A Realtime item representing an invocation of a tool on an MCP server.
  - String id
    
    The unique ID of the tool call.
  - String arguments
    
    A JSON string of the arguments passed to the tool.
  - String name
    
    The name of the tool that was run.
  - String serverLabel
    
    The label of the MCP server running the tool.
  - JsonValue; type "mcp_call"constant
    
    The type of the item. Always mcp_call.
    - MCP_CALL("mcp_call")
  - Optional<String> approvalRequestId
    
    The ID of an associated approval request, if any.
  - Optional<Error> error
    
    The error from the tool call, if any.
    - class RealtimeMcpProtocolError:
      - long code
      - String message
      - JsonValue; type "protocol_error"constant
        
        PROTOCOL_ERROR("protocol_error")
    - class RealtimeMcpToolExecutionError:
      - String message
      - JsonValue; type "tool_execution_error"constant
        
        TOOL_EXECUTION_ERROR("tool_execution_error")
    - class RealtimeMcphttpError:
      - long code
      - String message
      - JsonValue; type "http_error"constant
        
        HTTP_ERROR("http_error")
  - Optional<String> output
    
    The output from the tool call.
- class RealtimeMcpApprovalRequest:
  
  A Realtime item requesting human approval of a tool invocation.
  - String id
    
    The unique ID of the approval request.
  - String arguments
    
    A JSON string of arguments for the tool.
  - String name
    
    The name of the tool to run.
  - String serverLabel
    
    The label of the MCP server making the request.
  - JsonValue; type "mcp_approval_request"constant
    
    The type of the item. Always mcp_approval_request.
    - MCP_APPROVAL_REQUEST("mcp_approval_request")

Conversation Item Added

class ConversationItemAdded:

Sent by the server when an Item is added to the default Conversation. This can happen in several cases:
- When the client sends a conversation.item.create event.
- When the input audio buffer is committed. In this case the item will be a user message containing the audio from the buffer.
- When the model is generating a Response. In this case the conversation.item.added event will be sent when the model starts generating a specific Item, and thus it will not yet have any content (and status will be in_progress).
The event will include the full content of the Item (except when model is generating a Response) except for audio data, which can be retrieved separately with a conversation.item.retrieve event if necessary.
- String eventId
  
  The unique ID of the server event.
- ConversationItem item
  
  A single item within a Realtime conversation.
  - class RealtimeConversationItemSystemMessage:
    
    A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
    - List<Content> content
      
      The content of the message.
      - Optional<String> text
        
        The text content.
      - Optional<Type> type
        
        The content type. Always input_text for system messages.
        
        INPUT_TEXT("input_text")
    - JsonValue; role "system"constant
      
      The role of the message sender. Always system.
      - SYSTEM("system")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemUserMessage:
    
    A user message item in a Realtime conversation.
    - List<Content> content
      
      The content of the message.
      - Optional<String> audio
        
        Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
      - Optional<Detail> detail
        
        The detail level of the image (for input_image). auto will default to high.
        
        AUTO("auto")
        
        LOW("low")
        
        HIGH("high")
      - Optional<String> imageUrl
        
        Base64-encoded image bytes (for input_image) as a data URI. For example data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG.
      - Optional<String> text
        
        The text content (for input_text).
      - Optional<String> transcript
        
        Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
      - Optional<Type> type
        
        The content type (input_text, input_audio, or input_image).
        
        INPUT_TEXT("input_text")
        
        INPUT_AUDIO("input_audio")
        
        INPUT_IMAGE("input_image")
    - JsonValue; role "user"constant
      
      The role of the message sender. Always user.
      - USER("user")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemAssistantMessage:
    
    An assistant message item in a Realtime conversation.
    - List<Content> content
      
      The content of the message.
      - Optional<String> audio
        
        Base64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
      - Optional<String> text
        
        The text content.
      - Optional<String> transcript
        
        The transcript of the audio content, this will always be present if the output type is audio.
      - Optional<Type> type
        
        The content type, output_text or output_audio depending on the session output_modalities configuration.
        
        OUTPUT_TEXT("output_text")
        
        OUTPUT_AUDIO("output_audio")
    - JsonValue; role "assistant"constant
      
      The role of the message sender. Always assistant.
      - ASSISTANT("assistant")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemFunctionCall:
    
    A function call item in a Realtime conversation.
    - String arguments
      
      The arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example {"arg1": "value1", "arg2": 42}.
    - String name
      
      The name of the function being called.
    - JsonValue; type "function_call"constant
      
      The type of the item. Always function_call.
      - FUNCTION_CALL("function_call")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<String> callId
      
      The ID of the function call.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemFunctionCallOutput:
    
    A function call output item in a Realtime conversation.
    - String callId
      
      The ID of the function call this output is for.
    - String output
      
      The output of the function call, this is free text and can contain any information or simply be empty.
    - JsonValue; type "function_call_output"constant
      
      The type of the item. Always function_call_output.
      - FUNCTION_CALL_OUTPUT("function_call_output")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeMcpApprovalResponse:
    
    A Realtime item responding to an MCP approval request.
    - String id
      
      The unique ID of the approval response.
    - String approvalRequestId
      
      The ID of the approval request being answered.
    - boolean approve
      
      Whether the request was approved.
    - JsonValue; type "mcp_approval_response"constant
      
      The type of the item. Always mcp_approval_response.
      - MCP_APPROVAL_RESPONSE("mcp_approval_response")
    - Optional<String> reason
      
      Optional reason for the decision.
  - class RealtimeMcpListTools:
    
    A Realtime item listing tools available on an MCP server.
    - String serverLabel
      
      The label of the MCP server.
    - List<Tool> tools
      
      The tools available on the server.
      - JsonValue inputSchema
        
        The JSON schema describing the tool's input.
      - String name
        
        The name of the tool.
      - Optional<JsonValue> annotations
        
        Additional annotations about the tool.
      - Optional<String> description
        
        The description of the tool.
    - JsonValue; type "mcp_list_tools"constant
      
      The type of the item. Always mcp_list_tools.
      - MCP_LIST_TOOLS("mcp_list_tools")
    - Optional<String> id
      
      The unique ID of the list.
  - class RealtimeMcpToolCall:
    
    A Realtime item representing an invocation of a tool on an MCP server.
    - String id
      
      The unique ID of the tool call.
    - String arguments
      
      A JSON string of the arguments passed to the tool.
    - String name
      
      The name of the tool that was run.
    - String serverLabel
      
      The label of the MCP server running the tool.
    - JsonValue; type "mcp_call"constant
      
      The type of the item. Always mcp_call.
      - MCP_CALL("mcp_call")
    - Optional<String> approvalRequestId
      
      The ID of an associated approval request, if any.
    - Optional<Error> error
      
      The error from the tool call, if any.
      - class RealtimeMcpProtocolError:
        
        long code
        
        String message
        
        JsonValue; type "protocol_error"constant
        
        PROTOCOL_ERROR("protocol_error")
      - class RealtimeMcpToolExecutionError:
        
        String message
        
        JsonValue; type "tool_execution_error"constant
        
        TOOL_EXECUTION_ERROR("tool_execution_error")
      - class RealtimeMcphttpError:
        
        long code
        
        String message
        
        JsonValue; type "http_error"constant
        
        HTTP_ERROR("http_error")
    - Optional<String> output
      
      The output from the tool call.
  - class RealtimeMcpApprovalRequest:
    
    A Realtime item requesting human approval of a tool invocation.
    - String id
      
      The unique ID of the approval request.
    - String arguments
      
      A JSON string of arguments for the tool.
    - String name
      
      The name of the tool to run.
    - String serverLabel
      
      The label of the MCP server making the request.
    - JsonValue; type "mcp_approval_request"constant
      
      The type of the item. Always mcp_approval_request.
      - MCP_APPROVAL_REQUEST("mcp_approval_request")
- JsonValue; type "conversation.item.added"constant
  
  The event type, must be conversation.item.added.
  - CONVERSATION_ITEM_ADDED("conversation.item.added")
- Optional<String> previousItemId
  
  The ID of the item that precedes this one, if any. This is used to maintain ordering when items are inserted.

Conversation Item Create Event

class ConversationItemCreateEvent:

Add a new Item to the Conversation's context, including messages, function calls, and function call responses. This event can be used both to populate a "history" of the conversation and to add new items mid-stream, but has the current limitation that it cannot populate assistant audio messages.

If successful, the server will respond with a conversation.item.created event, otherwise an error event will be sent.
- ConversationItem item
  
  A single item within a Realtime conversation.
  - class RealtimeConversationItemSystemMessage:
    
    A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
    - List<Content> content
      
      The content of the message.
      - Optional<String> text
        
        The text content.
      - Optional<Type> type
        
        The content type. Always input_text for system messages.
        
        INPUT_TEXT("input_text")
    - JsonValue; role "system"constant
      
      The role of the message sender. Always system.
      - SYSTEM("system")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemUserMessage:
    
    A user message item in a Realtime conversation.
    - List<Content> content
      
      The content of the message.
      - Optional<String> audio
        
        Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
      - Optional<Detail> detail
        
        The detail level of the image (for input_image). auto will default to high.
        
        AUTO("auto")
        
        LOW("low")
        
        HIGH("high")
      - Optional<String> imageUrl
        
        Base64-encoded image bytes (for input_image) as a data URI. For example data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG.
      - Optional<String> text
        
        The text content (for input_text).
      - Optional<String> transcript
        
        Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
      - Optional<Type> type
        
        The content type (input_text, input_audio, or input_image).
        
        INPUT_TEXT("input_text")
        
        INPUT_AUDIO("input_audio")
        
        INPUT_IMAGE("input_image")
    - JsonValue; role "user"constant
      
      The role of the message sender. Always user.
      - USER("user")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemAssistantMessage:
    
    An assistant message item in a Realtime conversation.
    - List<Content> content
      
      The content of the message.
      - Optional<String> audio
        
        Base64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
      - Optional<String> text
        
        The text content.
      - Optional<String> transcript
        
        The transcript of the audio content, this will always be present if the output type is audio.
      - Optional<Type> type
        
        The content type, output_text or output_audio depending on the session output_modalities configuration.
        
        OUTPUT_TEXT("output_text")
        
        OUTPUT_AUDIO("output_audio")
    - JsonValue; role "assistant"constant
      
      The role of the message sender. Always assistant.
      - ASSISTANT("assistant")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemFunctionCall:
    
    A function call item in a Realtime conversation.
    - String arguments
      
      The arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example {"arg1": "value1", "arg2": 42}.
    - String name
      
      The name of the function being called.
    - JsonValue; type "function_call"constant
      
      The type of the item. Always function_call.
      - FUNCTION_CALL("function_call")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<String> callId
      
      The ID of the function call.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemFunctionCallOutput:
    
    A function call output item in a Realtime conversation.
    - String callId
      
      The ID of the function call this output is for.
    - String output
      
      The output of the function call, this is free text and can contain any information or simply be empty.
    - JsonValue; type "function_call_output"constant
      
      The type of the item. Always function_call_output.
      - FUNCTION_CALL_OUTPUT("function_call_output")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeMcpApprovalResponse:
    
    A Realtime item responding to an MCP approval request.
    - String id
      
      The unique ID of the approval response.
    - String approvalRequestId
      
      The ID of the approval request being answered.
    - boolean approve
      
      Whether the request was approved.
    - JsonValue; type "mcp_approval_response"constant
      
      The type of the item. Always mcp_approval_response.
      - MCP_APPROVAL_RESPONSE("mcp_approval_response")
    - Optional<String> reason
      
      Optional reason for the decision.
  - class RealtimeMcpListTools:
    
    A Realtime item listing tools available on an MCP server.
    - String serverLabel
      
      The label of the MCP server.
    - List<Tool> tools
      
      The tools available on the server.
      - JsonValue inputSchema
        
        The JSON schema describing the tool's input.
      - String name
        
        The name of the tool.
      - Optional<JsonValue> annotations
        
        Additional annotations about the tool.
      - Optional<String> description
        
        The description of the tool.
    - JsonValue; type "mcp_list_tools"constant
      
      The type of the item. Always mcp_list_tools.
      - MCP_LIST_TOOLS("mcp_list_tools")
    - Optional<String> id
      
      The unique ID of the list.
  - class RealtimeMcpToolCall:
    
    A Realtime item representing an invocation of a tool on an MCP server.
    - String id
      
      The unique ID of the tool call.
    - String arguments
      
      A JSON string of the arguments passed to the tool.
    - String name
      
      The name of the tool that was run.
    - String serverLabel
      
      The label of the MCP server running the tool.
    - JsonValue; type "mcp_call"constant
      
      The type of the item. Always mcp_call.
      - MCP_CALL("mcp_call")
    - Optional<String> approvalRequestId
      
      The ID of an associated approval request, if any.
    - Optional<Error> error
      
      The error from the tool call, if any.
      - class RealtimeMcpProtocolError:
        
        long code
        
        String message
        
        JsonValue; type "protocol_error"constant
        
        PROTOCOL_ERROR("protocol_error")
      - class RealtimeMcpToolExecutionError:
        
        String message
        
        JsonValue; type "tool_execution_error"constant
        
        TOOL_EXECUTION_ERROR("tool_execution_error")
      - class RealtimeMcphttpError:
        
        long code
        
        String message
        
        JsonValue; type "http_error"constant
        
        HTTP_ERROR("http_error")
    - Optional<String> output
      
      The output from the tool call.
  - class RealtimeMcpApprovalRequest:
    
    A Realtime item requesting human approval of a tool invocation.
    - String id
      
      The unique ID of the approval request.
    - String arguments
      
      A JSON string of arguments for the tool.
    - String name
      
      The name of the tool to run.
    - String serverLabel
      
      The label of the MCP server making the request.
    - JsonValue; type "mcp_approval_request"constant
      
      The type of the item. Always mcp_approval_request.
      - MCP_APPROVAL_REQUEST("mcp_approval_request")
- JsonValue; type "conversation.item.create"constant
  
  The event type, must be conversation.item.create.
  - CONVERSATION_ITEM_CREATE("conversation.item.create")
- Optional<String> eventId
  
  Optional client-generated ID used to identify this event.
- Optional<String> previousItemId
  
  The ID of the preceding item after which the new item will be inserted. If not set, the new item will be appended to the end of the conversation.
  
  If set to root, the new item will be added to the beginning of the conversation.
  
  If set to an existing ID, it allows an item to be inserted mid-conversation. If the ID cannot be found, an error will be returned and the item will not be added.

Conversation Item Created Event

class ConversationItemCreatedEvent:

Returned when a conversation item is created. There are several scenarios that produce this event:
- The server is generating a Response, which if successful will produce either one or two Items, which will be of type message (role assistant) or type function_call.
- The input audio buffer has been committed, either by the client or the server (in server_vad mode). The server will take the content of the input audio buffer and add it to a new user message Item.
- The client has sent a conversation.item.create event to add a new Item to the Conversation.
- String eventId
  
  The unique ID of the server event.
- ConversationItem item
  
  A single item within a Realtime conversation.
  - class RealtimeConversationItemSystemMessage:
    
    A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
    - List<Content> content
      
      The content of the message.
      - Optional<String> text
        
        The text content.
      - Optional<Type> type
        
        The content type. Always input_text for system messages.
        
        INPUT_TEXT("input_text")
    - JsonValue; role "system"constant
      
      The role of the message sender. Always system.
      - SYSTEM("system")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemUserMessage:
    
    A user message item in a Realtime conversation.
    - List<Content> content
      
      The content of the message.
      - Optional<String> audio
        
        Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
      - Optional<Detail> detail
        
        The detail level of the image (for input_image). auto will default to high.
        
        AUTO("auto")
        
        LOW("low")
        
        HIGH("high")
      - Optional<String> imageUrl
        
        Base64-encoded image bytes (for input_image) as a data URI. For example data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG.
      - Optional<String> text
        
        The text content (for input_text).
      - Optional<String> transcript
        
        Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
      - Optional<Type> type
        
        The content type (input_text, input_audio, or input_image).
        
        INPUT_TEXT("input_text")
        
        INPUT_AUDIO("input_audio")
        
        INPUT_IMAGE("input_image")
    - JsonValue; role "user"constant
      
      The role of the message sender. Always user.
      - USER("user")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemAssistantMessage:
    
    An assistant message item in a Realtime conversation.
    - List<Content> content
      
      The content of the message.
      - Optional<String> audio
        
        Base64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
      - Optional<String> text
        
        The text content.
      - Optional<String> transcript
        
        The transcript of the audio content, this will always be present if the output type is audio.
      - Optional<Type> type
        
        The content type, output_text or output_audio depending on the session output_modalities configuration.
        
        OUTPUT_TEXT("output_text")
        
        OUTPUT_AUDIO("output_audio")
    - JsonValue; role "assistant"constant
      
      The role of the message sender. Always assistant.
      - ASSISTANT("assistant")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemFunctionCall:
    
    A function call item in a Realtime conversation.
    - String arguments
      
      The arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example {"arg1": "value1", "arg2": 42}.
    - String name
      
      The name of the function being called.
    - JsonValue; type "function_call"constant
      
      The type of the item. Always function_call.
      - FUNCTION_CALL("function_call")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<String> callId
      
      The ID of the function call.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemFunctionCallOutput:
    
    A function call output item in a Realtime conversation.
    - String callId
      
      The ID of the function call this output is for.
    - String output
      
      The output of the function call, this is free text and can contain any information or simply be empty.
    - JsonValue; type "function_call_output"constant
      
      The type of the item. Always function_call_output.
      - FUNCTION_CALL_OUTPUT("function_call_output")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeMcpApprovalResponse:
    
    A Realtime item responding to an MCP approval request.
    - String id
      
      The unique ID of the approval response.
    - String approvalRequestId
      
      The ID of the approval request being answered.
    - boolean approve
      
      Whether the request was approved.
    - JsonValue; type "mcp_approval_response"constant
      
      The type of the item. Always mcp_approval_response.
      - MCP_APPROVAL_RESPONSE("mcp_approval_response")
    - Optional<String> reason
      
      Optional reason for the decision.
  - class RealtimeMcpListTools:
    
    A Realtime item listing tools available on an MCP server.
    - String serverLabel
      
      The label of the MCP server.
    - List<Tool> tools
      
      The tools available on the server.
      - JsonValue inputSchema
        
        The JSON schema describing the tool's input.
      - String name
        
        The name of the tool.
      - Optional<JsonValue> annotations
        
        Additional annotations about the tool.
      - Optional<String> description
        
        The description of the tool.
    - JsonValue; type "mcp_list_tools"constant
      
      The type of the item. Always mcp_list_tools.
      - MCP_LIST_TOOLS("mcp_list_tools")
    - Optional<String> id
      
      The unique ID of the list.
  - class RealtimeMcpToolCall:
    
    A Realtime item representing an invocation of a tool on an MCP server.
    - String id
      
      The unique ID of the tool call.
    - String arguments
      
      A JSON string of the arguments passed to the tool.
    - String name
      
      The name of the tool that was run.
    - String serverLabel
      
      The label of the MCP server running the tool.
    - JsonValue; type "mcp_call"constant
      
      The type of the item. Always mcp_call.
      - MCP_CALL("mcp_call")
    - Optional<String> approvalRequestId
      
      The ID of an associated approval request, if any.
    - Optional<Error> error
      
      The error from the tool call, if any.
      - class RealtimeMcpProtocolError:
        
        long code
        
        String message
        
        JsonValue; type "protocol_error"constant
        
        PROTOCOL_ERROR("protocol_error")
      - class RealtimeMcpToolExecutionError:
        
        String message
        
        JsonValue; type "tool_execution_error"constant
        
        TOOL_EXECUTION_ERROR("tool_execution_error")
      - class RealtimeMcphttpError:
        
        long code
        
        String message
        
        JsonValue; type "http_error"constant
        
        HTTP_ERROR("http_error")
    - Optional<String> output
      
      The output from the tool call.
  - class RealtimeMcpApprovalRequest:
    
    A Realtime item requesting human approval of a tool invocation.
    - String id
      
      The unique ID of the approval request.
    - String arguments
      
      A JSON string of arguments for the tool.
    - String name
      
      The name of the tool to run.
    - String serverLabel
      
      The label of the MCP server making the request.
    - JsonValue; type "mcp_approval_request"constant
      
      The type of the item. Always mcp_approval_request.
      - MCP_APPROVAL_REQUEST("mcp_approval_request")
- JsonValue; type "conversation.item.created"constant
  
  The event type, must be conversation.item.created.
  - CONVERSATION_ITEM_CREATED("conversation.item.created")
- Optional<String> previousItemId
  
  The ID of the preceding item in the Conversation context, allows the client to understand the order of the conversation. Can be null if the item has no predecessor.

Conversation Item Delete Event

class ConversationItemDeleteEvent:

Send this event when you want to remove any item from the conversation history. The server will respond with a conversation.item.deleted event, unless the item does not exist in the conversation history, in which case the server will respond with an error.
- String itemId
  
  The ID of the item to delete.
- JsonValue; type "conversation.item.delete"constant
  
  The event type, must be conversation.item.delete.
  - CONVERSATION_ITEM_DELETE("conversation.item.delete")
- Optional<String> eventId
  
  Optional client-generated ID used to identify this event.

Conversation Item Deleted Event

class ConversationItemDeletedEvent:

Returned when an item in the conversation is deleted by the client with a conversation.item.delete event. This event is used to synchronize the server's understanding of the conversation history with the client's view.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the item that was deleted.
- JsonValue; type "conversation.item.deleted"constant
  
  The event type, must be conversation.item.deleted.
  - CONVERSATION_ITEM_DELETED("conversation.item.deleted")

Conversation Item Done

class ConversationItemDone:

Returned when a conversation item is finalized.

The event will include the full content of the Item except for audio data, which can be retrieved separately with a conversation.item.retrieve event if needed.
- String eventId
  
  The unique ID of the server event.
- ConversationItem item
  
  A single item within a Realtime conversation.
  - class RealtimeConversationItemSystemMessage:
    
    A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
    - List<Content> content
      
      The content of the message.
      - Optional<String> text
        
        The text content.
      - Optional<Type> type
        
        The content type. Always input_text for system messages.
        
        INPUT_TEXT("input_text")
    - JsonValue; role "system"constant
      
      The role of the message sender. Always system.
      - SYSTEM("system")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemUserMessage:
    
    A user message item in a Realtime conversation.
    - List<Content> content
      
      The content of the message.
      - Optional<String> audio
        
        Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
      - Optional<Detail> detail
        
        The detail level of the image (for input_image). auto will default to high.
        
        AUTO("auto")
        
        LOW("low")
        
        HIGH("high")
      - Optional<String> imageUrl
        
        Base64-encoded image bytes (for input_image) as a data URI. For example data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG.
      - Optional<String> text
        
        The text content (for input_text).
      - Optional<String> transcript
        
        Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
      - Optional<Type> type
        
        The content type (input_text, input_audio, or input_image).
        
        INPUT_TEXT("input_text")
        
        INPUT_AUDIO("input_audio")
        
        INPUT_IMAGE("input_image")
    - JsonValue; role "user"constant
      
      The role of the message sender. Always user.
      - USER("user")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemAssistantMessage:
    
    An assistant message item in a Realtime conversation.
    - List<Content> content
      
      The content of the message.
      - Optional<String> audio
        
        Base64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
      - Optional<String> text
        
        The text content.
      - Optional<String> transcript
        
        The transcript of the audio content, this will always be present if the output type is audio.
      - Optional<Type> type
        
        The content type, output_text or output_audio depending on the session output_modalities configuration.
        
        OUTPUT_TEXT("output_text")
        
        OUTPUT_AUDIO("output_audio")
    - JsonValue; role "assistant"constant
      
      The role of the message sender. Always assistant.
      - ASSISTANT("assistant")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemFunctionCall:
    
    A function call item in a Realtime conversation.
    - String arguments
      
      The arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example {"arg1": "value1", "arg2": 42}.
    - String name
      
      The name of the function being called.
    - JsonValue; type "function_call"constant
      
      The type of the item. Always function_call.
      - FUNCTION_CALL("function_call")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<String> callId
      
      The ID of the function call.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemFunctionCallOutput:
    
    A function call output item in a Realtime conversation.
    - String callId
      
      The ID of the function call this output is for.
    - String output
      
      The output of the function call, this is free text and can contain any information or simply be empty.
    - JsonValue; type "function_call_output"constant
      
      The type of the item. Always function_call_output.
      - FUNCTION_CALL_OUTPUT("function_call_output")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeMcpApprovalResponse:
    
    A Realtime item responding to an MCP approval request.
    - String id
      
      The unique ID of the approval response.
    - String approvalRequestId
      
      The ID of the approval request being answered.
    - boolean approve
      
      Whether the request was approved.
    - JsonValue; type "mcp_approval_response"constant
      
      The type of the item. Always mcp_approval_response.
      - MCP_APPROVAL_RESPONSE("mcp_approval_response")
    - Optional<String> reason
      
      Optional reason for the decision.
  - class RealtimeMcpListTools:
    
    A Realtime item listing tools available on an MCP server.
    - String serverLabel
      
      The label of the MCP server.
    - List<Tool> tools
      
      The tools available on the server.
      - JsonValue inputSchema
        
        The JSON schema describing the tool's input.
      - String name
        
        The name of the tool.
      - Optional<JsonValue> annotations
        
        Additional annotations about the tool.
      - Optional<String> description
        
        The description of the tool.
    - JsonValue; type "mcp_list_tools"constant
      
      The type of the item. Always mcp_list_tools.
      - MCP_LIST_TOOLS("mcp_list_tools")
    - Optional<String> id
      
      The unique ID of the list.
  - class RealtimeMcpToolCall:
    
    A Realtime item representing an invocation of a tool on an MCP server.
    - String id
      
      The unique ID of the tool call.
    - String arguments
      
      A JSON string of the arguments passed to the tool.
    - String name
      
      The name of the tool that was run.
    - String serverLabel
      
      The label of the MCP server running the tool.
    - JsonValue; type "mcp_call"constant
      
      The type of the item. Always mcp_call.
      - MCP_CALL("mcp_call")
    - Optional<String> approvalRequestId
      
      The ID of an associated approval request, if any.
    - Optional<Error> error
      
      The error from the tool call, if any.
      - class RealtimeMcpProtocolError:
        
        long code
        
        String message
        
        JsonValue; type "protocol_error"constant
        
        PROTOCOL_ERROR("protocol_error")
      - class RealtimeMcpToolExecutionError:
        
        String message
        
        JsonValue; type "tool_execution_error"constant
        
        TOOL_EXECUTION_ERROR("tool_execution_error")
      - class RealtimeMcphttpError:
        
        long code
        
        String message
        
        JsonValue; type "http_error"constant
        
        HTTP_ERROR("http_error")
    - Optional<String> output
      
      The output from the tool call.
  - class RealtimeMcpApprovalRequest:
    
    A Realtime item requesting human approval of a tool invocation.
    - String id
      
      The unique ID of the approval request.
    - String arguments
      
      A JSON string of arguments for the tool.
    - String name
      
      The name of the tool to run.
    - String serverLabel
      
      The label of the MCP server making the request.
    - JsonValue; type "mcp_approval_request"constant
      
      The type of the item. Always mcp_approval_request.
      - MCP_APPROVAL_REQUEST("mcp_approval_request")
- JsonValue; type "conversation.item.done"constant
  
  The event type, must be conversation.item.done.
  - CONVERSATION_ITEM_DONE("conversation.item.done")
- Optional<String> previousItemId
  
  The ID of the item that precedes this one, if any. This is used to maintain ordering when items are inserted.

Conversation Item Input Audio Transcription Completed Event

class ConversationItemInputAudioTranscriptionCompletedEvent:

This event is the output of audio transcription for user audio written to the user audio buffer. Transcription begins when the input audio buffer is committed by the client or server (when VAD is enabled). Transcription runs asynchronously with Response creation, so this event may come before or after the Response events.

Realtime API models accept audio natively, and thus input transcription is a separate process run on a separate ASR (Automatic Speech Recognition) model. The transcript may diverge somewhat from the model's interpretation, and should be treated as a rough guide.
- long contentIndex
  
  The index of the content part containing the audio.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the item containing the audio that is being transcribed.
- String transcript
  
  The transcribed text.
- JsonValue; type "conversation.item.input_audio_transcription.completed"constant
  
  The event type, must be conversation.item.input_audio_transcription.completed.
  - CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED("conversation.item.input_audio_transcription.completed")
- Usage usage
  
  Usage statistics for the transcription, this is billed according to the ASR model's pricing rather than the realtime model's pricing.
  - class TranscriptTextUsageTokens:
    
    Usage statistics for models billed by token usage.
    - long inputTokens
      
      Number of input tokens billed for this request.
    - long outputTokens
      
      Number of output tokens generated.
    - long totalTokens
      
      Total number of tokens used (input + output).
    - JsonValue; type "tokens"constant
      
      The type of the usage object. Always tokens for this variant.
      - TOKENS("tokens")
    - Optional<InputTokenDetails> inputTokenDetails
      
      Details about the input tokens billed for this request.
      - Optional<Long> audioTokens
        
        Number of audio tokens billed for this request.
      - Optional<Long> textTokens
        
        Number of text tokens billed for this request.
  - class TranscriptTextUsageDuration:
    
    Usage statistics for models billed by audio input duration.
    - double seconds
      
      Duration of the input audio in seconds.
    - JsonValue; type "duration"constant
      
      The type of the usage object. Always duration for this variant.
      - DURATION("duration")
- Optional<List<LogProbProperties>> logprobs
  
  The log probabilities of the transcription.
  - String token
    
    The token that was used to generate the log probability.
  - List<long> bytes
    
    The bytes that were used to generate the log probability.
  - double logprob
    
    The log probability of the token.

Conversation Item Input Audio Transcription Delta Event

class ConversationItemInputAudioTranscriptionDeltaEvent:

Returned when the text value of an input audio transcription content part is updated with incremental transcription results.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the item containing the audio that is being transcribed.
- JsonValue; type "conversation.item.input_audio_transcription.delta"constant
  
  The event type, must be conversation.item.input_audio_transcription.delta.
  - CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_DELTA("conversation.item.input_audio_transcription.delta")
- Optional<Long> contentIndex
  
  The index of the content part in the item's content array.
- Optional<String> delta
  
  The text delta.
- Optional<List<LogProbProperties>> logprobs
  
  The log probabilities of the transcription. These can be enabled by configurating the session with "include": ["item.input_audio_transcription.logprobs"]. Each entry in the array corresponds a log probability of which token would be selected for this chunk of transcription. This can help to identify if it was possible there were multiple valid options for a given chunk of transcription.
  - String token
    
    The token that was used to generate the log probability.
  - List<long> bytes
    
    The bytes that were used to generate the log probability.
  - double logprob
    
    The log probability of the token.

Conversation Item Input Audio Transcription Failed Event

class ConversationItemInputAudioTranscriptionFailedEvent:

Returned when input audio transcription is configured, and a transcription request for a user message failed. These events are separate from other error events so that the client can identify the related Item.
- long contentIndex
  
  The index of the content part containing the audio.
- Error error
  
  Details of the transcription error.
  - Optional<String> code
    
    Error code, if any.
  - Optional<String> message
    
    A human-readable error message.
  - Optional<String> param
    
    Parameter related to the error, if any.
  - Optional<String> type
    
    The type of error.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the user message item.
- JsonValue; type "conversation.item.input_audio_transcription.failed"constant
  
  The event type, must be conversation.item.input_audio_transcription.failed.
  - CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_FAILED("conversation.item.input_audio_transcription.failed")

Conversation Item Input Audio Transcription Segment

class ConversationItemInputAudioTranscriptionSegment:

Returned when an input audio transcription segment is identified for an item.
- String id
  
  The segment identifier.
- long contentIndex
  
  The index of the input audio content part within the item.
- double end
  
  End time of the segment in seconds.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the item containing the input audio content.
- String speaker
  
  The detected speaker label for this segment.
- double start
  
  Start time of the segment in seconds.
- String text
  
  The text for this segment.
- JsonValue; type "conversation.item.input_audio_transcription.segment"constant
  
  The event type, must be conversation.item.input_audio_transcription.segment.
  - CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_SEGMENT("conversation.item.input_audio_transcription.segment")

Conversation Item Retrieve Event

class ConversationItemRetrieveEvent:

Send this event when you want to retrieve the server's representation of a specific item in the conversation history. This is useful, for example, to inspect user audio after noise cancellation and VAD. The server will respond with a conversation.item.retrieved event, unless the item does not exist in the conversation history, in which case the server will respond with an error.
- String itemId
  
  The ID of the item to retrieve.
- JsonValue; type "conversation.item.retrieve"constant
  
  The event type, must be conversation.item.retrieve.
  - CONVERSATION_ITEM_RETRIEVE("conversation.item.retrieve")
- Optional<String> eventId
  
  Optional client-generated ID used to identify this event.

Conversation Item Truncate Event

class ConversationItemTruncateEvent:

Send this event to truncate a previous assistant message’s audio. The server will produce audio faster than realtime, so this event is useful when the user interrupts to truncate audio that has already been sent to the client but not yet played. This will synchronize the server's understanding of the audio with the client's playback.

Truncating audio will delete the server-side text transcript to ensure there is not text in the context that hasn't been heard by the user.

If successful, the server will respond with a conversation.item.truncated event.
- long audioEndMs
  
  Inclusive duration up to which audio is truncated, in milliseconds. If the audio_end_ms is greater than the actual audio duration, the server will respond with an error.
- long contentIndex
  
  The index of the content part to truncate. Set this to 0.
- String itemId
  
  The ID of the assistant message item to truncate. Only assistant message items can be truncated.
- JsonValue; type "conversation.item.truncate"constant
  
  The event type, must be conversation.item.truncate.
  - CONVERSATION_ITEM_TRUNCATE("conversation.item.truncate")
- Optional<String> eventId
  
  Optional client-generated ID used to identify this event.

Conversation Item Truncated Event

class ConversationItemTruncatedEvent:

Returned when an earlier assistant audio message item is truncated by the client with a conversation.item.truncate event. This event is used to synchronize the server's understanding of the audio with the client's playback.

This action will truncate the audio and remove the server-side text transcript to ensure there is no text in the context that hasn't been heard by the user.
- long audioEndMs
  
  The duration up to which the audio was truncated, in milliseconds.
- long contentIndex
  
  The index of the content part that was truncated.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the assistant message item that was truncated.
- JsonValue; type "conversation.item.truncated"constant
  
  The event type, must be conversation.item.truncated.
  - CONVERSATION_ITEM_TRUNCATED("conversation.item.truncated")

Conversation Item With Reference

class ConversationItemWithReference:

The item to add to the conversation.
- Optional<String> id
  
  For an item of type (message | function_call | function_call_output) this field allows the client to assign the unique ID of the item. It is not required because the server will generate one if not provided.
  
  For an item of type item_reference, this field is required and is a reference to any item that has previously existed in the conversation.
- Optional<String> arguments
  
  The arguments of the function call (for function_call items).
- Optional<String> callId
  
  The ID of the function call (for function_call and function_call_output items). If passed on a function_call_output item, the server will check that a function_call item with the same ID exists in the conversation history.
- Optional<List<Content>> content
  
  The content of the message, applicable for message items.
  - Message items of role system support only input_text content
  - Message items of role user support input_text and input_audio content
  - Message items of role assistant support text content.
  - Optional<String> id
    
    ID of a previous conversation item to reference (for item_reference content types in response.create events). These can reference both client and server created items.
  - Optional<String> audio
    
    Base64-encoded audio bytes, used for input_audio content type.
  - Optional<String> text
    
    The text content, used for input_text and text content types.
  - Optional<String> transcript
    
    The transcript of the audio, used for input_audio content type.
  - Optional<Type> type
    
    The content type (input_text, input_audio, item_reference, text).
    - INPUT_TEXT("input_text")
    - INPUT_AUDIO("input_audio")
    - ITEM_REFERENCE("item_reference")
    - TEXT("text")
- Optional<String> name
  
  The name of the function being called (for function_call items).
- Optional<Object> object_
  
  Identifier for the API object being returned - always realtime.item.
  - REALTIME_ITEM("realtime.item")
- Optional<String> output
  
  The output of the function call (for function_call_output items).
- Optional<Role> role
  
  The role of the message sender (user, assistant, system), only applicable for message items.
  - USER("user")
  - ASSISTANT("assistant")
  - SYSTEM("system")
- Optional<Status> status
  
  The status of the item (completed, incomplete, in_progress). These have no effect on the conversation, but are accepted for consistency with the conversation.item.created event.
  - COMPLETED("completed")
  - INCOMPLETE("incomplete")
  - IN_PROGRESS("in_progress")
- Optional<Type> type
  
  The type of the item (message, function_call, function_call_output, item_reference).
  - MESSAGE("message")
  - FUNCTION_CALL("function_call")
  - FUNCTION_CALL_OUTPUT("function_call_output")
  - ITEM_REFERENCE("item_reference")

Input Audio Buffer Append Event

class InputAudioBufferAppendEvent:

Send this event to append audio bytes to the input audio buffer. The audio buffer is temporary storage you can write to and later commit. A "commit" will create a new user message item in the conversation history from the buffer content and clear the buffer. Input audio transcription (if enabled) will be generated when the buffer is committed.

If VAD is enabled the audio buffer is used to detect speech and the server will decide when to commit. When Server VAD is disabled, you must commit the audio buffer manually. Input audio noise reduction operates on writes to the audio buffer.

The client may choose how much audio to place in each event up to a maximum of 15 MiB, for example streaming smaller chunks from the client may allow the VAD to be more responsive. Unlike most other client events, the server will not send a confirmation response to this event.
- String audio
  
  Base64-encoded audio bytes. This must be in the format specified by the input_audio_format field in the session configuration.
- JsonValue; type "input_audio_buffer.append"constant
  
  The event type, must be input_audio_buffer.append.
  - INPUT_AUDIO_BUFFER_APPEND("input_audio_buffer.append")
- Optional<String> eventId
  
  Optional client-generated ID used to identify this event.

Input Audio Buffer Clear Event

class InputAudioBufferClearEvent:

Send this event to clear the audio bytes in the buffer. The server will respond with an input_audio_buffer.cleared event.
- JsonValue; type "input_audio_buffer.clear"constant
  
  The event type, must be input_audio_buffer.clear.
  - INPUT_AUDIO_BUFFER_CLEAR("input_audio_buffer.clear")
- Optional<String> eventId
  
  Optional client-generated ID used to identify this event.

Input Audio Buffer Cleared Event

class InputAudioBufferClearedEvent:

Returned when the input audio buffer is cleared by the client with a input_audio_buffer.clear event.
- String eventId
  
  The unique ID of the server event.
- JsonValue; type "input_audio_buffer.cleared"constant
  
  The event type, must be input_audio_buffer.cleared.
  - INPUT_AUDIO_BUFFER_CLEARED("input_audio_buffer.cleared")

Input Audio Buffer Commit Event

class InputAudioBufferCommitEvent:

Send this event to commit the user input audio buffer, which will create a new user message item in the conversation. This event will produce an error if the input audio buffer is empty. When in Server VAD mode, the client does not need to send this event, the server will commit the audio buffer automatically.

Committing the input audio buffer will trigger input audio transcription (if enabled in session configuration), but it will not create a response from the model. The server will respond with an input_audio_buffer.committed event.
- JsonValue; type "input_audio_buffer.commit"constant
  
  The event type, must be input_audio_buffer.commit.
  - INPUT_AUDIO_BUFFER_COMMIT("input_audio_buffer.commit")
- Optional<String> eventId
  
  Optional client-generated ID used to identify this event.

Input Audio Buffer Committed Event

class InputAudioBufferCommittedEvent:

Returned when an input audio buffer is committed, either by the client or automatically in server VAD mode. The item_id property is the ID of the user message item that will be created, thus a conversation.item.created event will also be sent to the client.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the user message item that will be created.
- JsonValue; type "input_audio_buffer.committed"constant
  
  The event type, must be input_audio_buffer.committed.
  - INPUT_AUDIO_BUFFER_COMMITTED("input_audio_buffer.committed")
- Optional<String> previousItemId
  
  The ID of the preceding item after which the new item will be inserted. Can be null if the item has no predecessor.

Input Audio Buffer Dtmf Event Received Event

class InputAudioBufferDtmfEventReceivedEvent:

SIP Only: Returned when an DTMF event is received. A DTMF event is a message that represents a telephone keypad press (0–9, *, #, A–D). The event property is the keypad that the user press. The received_at is the UTC Unix Timestamp that the server received the event.
- String event
  
  The telephone keypad that was pressed by the user.
- long receivedAt
  
  UTC Unix Timestamp when DTMF Event was received by server.
- JsonValue; type "input_audio_buffer.dtmf_event_received"constant
  
  The event type, must be input_audio_buffer.dtmf_event_received.
  - INPUT_AUDIO_BUFFER_DTMF_EVENT_RECEIVED("input_audio_buffer.dtmf_event_received")

Input Audio Buffer Speech Started Event

class InputAudioBufferSpeechStartedEvent:

Sent by the server when in server_vad mode to indicate that speech has been detected in the audio buffer. This can happen any time audio is added to the buffer (unless speech is already detected). The client may want to use this event to interrupt audio playback or provide visual feedback to the user.

The client should expect to receive a input_audio_buffer.speech_stopped event when speech stops. The item_id property is the ID of the user message item that will be created when speech stops and will also be included in the input_audio_buffer.speech_stopped event (unless the client manually commits the audio buffer during VAD activation).
- long audioStartMs
  
  Milliseconds from the start of all audio written to the buffer during the session when speech was first detected. This will correspond to the beginning of audio sent to the model, and thus includes the prefix_padding_ms configured in the Session.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the user message item that will be created when speech stops.
- JsonValue; type "input_audio_buffer.speech_started"constant
  
  The event type, must be input_audio_buffer.speech_started.
  - INPUT_AUDIO_BUFFER_SPEECH_STARTED("input_audio_buffer.speech_started")

Input Audio Buffer Speech Stopped Event

class InputAudioBufferSpeechStoppedEvent:

Returned in server_vad mode when the server detects the end of speech in the audio buffer. The server will also send an conversation.item.created event with the user message item that is created from the audio buffer.
- long audioEndMs
  
  Milliseconds since the session started when speech stopped. This will correspond to the end of audio sent to the model, and thus includes the min_silence_duration_ms configured in the Session.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the user message item that will be created.
- JsonValue; type "input_audio_buffer.speech_stopped"constant
  
  The event type, must be input_audio_buffer.speech_stopped.
  - INPUT_AUDIO_BUFFER_SPEECH_STOPPED("input_audio_buffer.speech_stopped")

Input Audio Buffer Timeout Triggered

class InputAudioBufferTimeoutTriggered:

Returned when the Server VAD timeout is triggered for the input audio buffer. This is configured with idle_timeout_ms in the turn_detection settings of the session, and it indicates that there hasn't been any speech detected for the configured duration.

The audio_start_ms and audio_end_ms fields indicate the segment of audio after the last model response up to the triggering time, as an offset from the beginning of audio written to the input audio buffer. This means it demarcates the segment of audio that was silent and the difference between the start and end values will roughly match the configured timeout.

The empty audio will be committed to the conversation as an input_audio item (there will be a input_audio_buffer.committed event) and a model response will be generated. There may be speech that didn't trigger VAD but is still detected by the model, so the model may respond with something relevant to the conversation or a prompt to continue speaking.
- long audioEndMs
  
  Millisecond offset of audio written to the input audio buffer at the time the timeout was triggered.
- long audioStartMs
  
  Millisecond offset of audio written to the input audio buffer that was after the playback time of the last model response.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the item associated with this segment.
- JsonValue; type "input_audio_buffer.timeout_triggered"constant
  
  The event type, must be input_audio_buffer.timeout_triggered.
  - INPUT_AUDIO_BUFFER_TIMEOUT_TRIGGERED("input_audio_buffer.timeout_triggered")

Log Prob Properties

class LogProbProperties:

A log probability object.
- String token
  
  The token that was used to generate the log probability.
- List<long> bytes
  
  The bytes that were used to generate the log probability.
- double logprob
  
  The log probability of the token.

Mcp List Tools Completed

class McpListToolsCompleted:

Returned when listing MCP tools has completed for an item.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the MCP list tools item.
- JsonValue; type "mcp_list_tools.completed"constant
  
  The event type, must be mcp_list_tools.completed.
  - MCP_LIST_TOOLS_COMPLETED("mcp_list_tools.completed")

Mcp List Tools Failed

class McpListToolsFailed:

Returned when listing MCP tools has failed for an item.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the MCP list tools item.
- JsonValue; type "mcp_list_tools.failed"constant
  
  The event type, must be mcp_list_tools.failed.
  - MCP_LIST_TOOLS_FAILED("mcp_list_tools.failed")

Mcp List Tools In Progress

class McpListToolsInProgress:

Returned when listing MCP tools is in progress for an item.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the MCP list tools item.
- JsonValue; type "mcp_list_tools.in_progress"constant
  
  The event type, must be mcp_list_tools.in_progress.
  - MCP_LIST_TOOLS_IN_PROGRESS("mcp_list_tools.in_progress")

Noise Reduction Type

enum NoiseReductionType:

Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
- NEAR_FIELD("near_field")
- FAR_FIELD("far_field")

Output Audio Buffer Clear Event

class OutputAudioBufferClearEvent:

WebRTC/SIP Only: Emit to cut off the current audio response. This will trigger the server to stop generating audio and emit a output_audio_buffer.cleared event. This event should be preceded by a response.cancel client event to stop the generation of the current response. Learn more.
- JsonValue; type "output_audio_buffer.clear"constant
  
  The event type, must be output_audio_buffer.clear.
  - OUTPUT_AUDIO_BUFFER_CLEAR("output_audio_buffer.clear")
- Optional<String> eventId
  
  The unique ID of the client event used for error handling.

Rate Limits Updated Event

class RateLimitsUpdatedEvent:

Emitted at the beginning of a Response to indicate the updated rate limits. When a Response is created some tokens will be "reserved" for the output tokens, the rate limits shown here reflect that reservation, which is then adjusted accordingly once the Response is completed.
- String eventId
  
  The unique ID of the server event.
- List<RateLimit> rateLimits
  
  List of rate limit information.
  - Optional<Long> limit
    
    The maximum allowed value for the rate limit.
  - Optional<Name> name
    
    The name of the rate limit (requests, tokens).
    - REQUESTS("requests")
    - TOKENS("tokens")
  - Optional<Long> remaining
    
    The remaining value before the limit is reached.
  - Optional<Double> resetSeconds
    
    Seconds until the rate limit resets.
- JsonValue; type "rate_limits.updated"constant
  
  The event type, must be rate_limits.updated.
  - RATE_LIMITS_UPDATED("rate_limits.updated")

Realtime Audio Config

class RealtimeAudioConfig:

Configuration for input and output audio.
- Optional<RealtimeAudioConfigInput> input
  - Optional<RealtimeAudioFormats> format
    
    The format of the input audio.
    - AudioPcm
      - Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
      - Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
    - AudioPcmu
      - Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
    - AudioPcma
      - Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
  - Optional<NoiseReduction> noiseReduction
    
    Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
    - Optional<NoiseReductionType> type
      
      Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
      - NEAR_FIELD("near_field")
      - FAR_FIELD("far_field")
  - Optional<AudioTranscription> transcription
    
    Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
    - Optional<Delay> delay
      
      Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
      - MINIMAL("minimal")
      - LOW("low")
      - MEDIUM("medium")
      - HIGH("high")
      - XHIGH("xhigh")
    - Optional<String> language
      
      The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
    - Optional<Model> model
      
      The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
      - WHISPER_1("whisper-1")
      - GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
      - GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
      - GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
      - GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
      - GPT_REALTIME_WHISPER("gpt-realtime-whisper")
    - Optional<String> prompt
      
      An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
  - Optional<RealtimeAudioInputTurnDetection> turnDetection
    
    Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
    
    Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
    
    Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
    
    For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
    - ServerVad
      - JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
      - Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
      - Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
      - Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
      - Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
      - Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
      - Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
    - SemanticVad
      - JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
      - Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
      - Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
      - Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
- Optional<RealtimeAudioConfigOutput> output
  - Optional<RealtimeAudioFormats> format
    
    The format of the output audio.
  - Optional<Double> speed
    
    The speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
    
    This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
  - Optional<Voice> voice
    
    The voice the model uses to respond. Supported built-in voices are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommend marin and cedar for best quality.
    - String
    - enum UnionMember1:
      - ALLOY("alloy")
      - ASH("ash")
      - BALLAD("ballad")
      - CORAL("coral")
      - ECHO("echo")
      - SAGE("sage")
      - SHIMMER("shimmer")
      - VERSE("verse")
      - MARIN("marin")
      - CEDAR("cedar")
    - class Id:
      
      Custom voice reference.
      - String id
        
        The custom voice ID, e.g. voice_1234.

Realtime Audio Config Input

class RealtimeAudioConfigInput:
- Optional<RealtimeAudioFormats> format
  
  The format of the input audio.
  - AudioPcm
    - Optional<Rate> rate
      
      The sample rate of the audio. Always 24000.
      - _24000(24000)
    - Optional<Type> type
      
      The audio format. Always audio/pcm.
      - AUDIO_PCM("audio/pcm")
  - AudioPcmu
    - Optional<Type> type
      
      The audio format. Always audio/pcmu.
      - AUDIO_PCMU("audio/pcmu")
  - AudioPcma
    - Optional<Type> type
      
      The audio format. Always audio/pcma.
      - AUDIO_PCMA("audio/pcma")
- Optional<NoiseReduction> noiseReduction
  
  Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
  - Optional<NoiseReductionType> type
    
    Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
    - NEAR_FIELD("near_field")
    - FAR_FIELD("far_field")
- Optional<AudioTranscription> transcription
  
  Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
  - Optional<Delay> delay
    
    Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
    - MINIMAL("minimal")
    - LOW("low")
    - MEDIUM("medium")
    - HIGH("high")
    - XHIGH("xhigh")
  - Optional<String> language
    
    The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
  - Optional<Model> model
    
    The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
    - WHISPER_1("whisper-1")
    - GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
    - GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
    - GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
    - GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
    - GPT_REALTIME_WHISPER("gpt-realtime-whisper")
  - Optional<String> prompt
    
    An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
- Optional<RealtimeAudioInputTurnDetection> turnDetection
  
  Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
  
  Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
  
  Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
  
  For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
  - ServerVad
    - JsonValue; type "server_vad"constant
      
      Type of turn detection, server_vad to turn on simple Server VAD.
      - SERVER_VAD("server_vad")
    - Optional<Boolean> createResponse
      
      Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
      
      If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
    - Optional<Long> idleTimeoutMs
      
      Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
      
      The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
      
      An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
    - Optional<Boolean> interruptResponse
      
      Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
      
      If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
    - Optional<Long> prefixPaddingMs
      
      Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
    - Optional<Long> silenceDurationMs
      
      Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
    - Optional<Double> threshold
      
      Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
  - SemanticVad
    - JsonValue; type "semantic_vad"constant
      
      Type of turn detection, semantic_vad to turn on Semantic VAD.
      - SEMANTIC_VAD("semantic_vad")
    - Optional<Boolean> createResponse
      
      Whether or not to automatically generate a response when a VAD stop event occurs.
    - Optional<Eagerness> eagerness
      
      Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
      - LOW("low")
      - MEDIUM("medium")
      - HIGH("high")
      - AUTO("auto")
    - Optional<Boolean> interruptResponse
      
      Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.

Realtime Audio Config Output

class RealtimeAudioConfigOutput:
- Optional<RealtimeAudioFormats> format
  
  The format of the output audio.
  - AudioPcm
    - Optional<Rate> rate
      
      The sample rate of the audio. Always 24000.
      - _24000(24000)
    - Optional<Type> type
      
      The audio format. Always audio/pcm.
      - AUDIO_PCM("audio/pcm")
  - AudioPcmu
    - Optional<Type> type
      
      The audio format. Always audio/pcmu.
      - AUDIO_PCMU("audio/pcmu")
  - AudioPcma
    - Optional<Type> type
      
      The audio format. Always audio/pcma.
      - AUDIO_PCMA("audio/pcma")
- Optional<Double> speed
  
  The speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
  
  This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
- Optional<Voice> voice
  
  The voice the model uses to respond. Supported built-in voices are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommend marin and cedar for best quality.
  - String
  - enum UnionMember1:
    - ALLOY("alloy")
    - ASH("ash")
    - BALLAD("ballad")
    - CORAL("coral")
    - ECHO("echo")
    - SAGE("sage")
    - SHIMMER("shimmer")
    - VERSE("verse")
    - MARIN("marin")
    - CEDAR("cedar")
  - class Id:
    
    Custom voice reference.
    - String id
      
      The custom voice ID, e.g. voice_1234.

Realtime Audio Formats

class RealtimeAudioFormats: A class that can be one of several variants.union

The PCM audio format. Only a 24kHz sample rate is supported.
- AudioPcm
  - Optional<Rate> rate
    
    The sample rate of the audio. Always 24000.
    - _24000(24000)
  - Optional<Type> type
    
    The audio format. Always audio/pcm.
    - AUDIO_PCM("audio/pcm")
- AudioPcmu
  - Optional<Type> type
    
    The audio format. Always audio/pcmu.
    - AUDIO_PCMU("audio/pcmu")
- AudioPcma
  - Optional<Type> type
    
    The audio format. Always audio/pcma.
    - AUDIO_PCMA("audio/pcma")

Realtime Audio Input Turn Detection

class RealtimeAudioInputTurnDetection: A class that can be one of several variants.union

Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.

Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.

Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
- ServerVad
  - JsonValue; type "server_vad"constant
    
    Type of turn detection, server_vad to turn on simple Server VAD.
    - SERVER_VAD("server_vad")
  - Optional<Boolean> createResponse
    
    Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
    
    If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
  - Optional<Long> idleTimeoutMs
    
    Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
    
    The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
    
    An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
  - Optional<Boolean> interruptResponse
    
    Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
    
    If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
  - Optional<Long> prefixPaddingMs
    
    Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
  - Optional<Long> silenceDurationMs
    
    Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
  - Optional<Double> threshold
    
    Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
- SemanticVad
  - JsonValue; type "semantic_vad"constant
    
    Type of turn detection, semantic_vad to turn on Semantic VAD.
    - SEMANTIC_VAD("semantic_vad")
  - Optional<Boolean> createResponse
    
    Whether or not to automatically generate a response when a VAD stop event occurs.
  - Optional<Eagerness> eagerness
    
    Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
    - LOW("low")
    - MEDIUM("medium")
    - HIGH("high")
    - AUTO("auto")
  - Optional<Boolean> interruptResponse
    
    Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.

Realtime Client Event

class RealtimeClientEvent: A class that can be one of several variants.union

A realtime client event.
- class ConversationItemCreateEvent:
  
  Add a new Item to the Conversation's context, including messages, function calls, and function call responses. This event can be used both to populate a "history" of the conversation and to add new items mid-stream, but has the current limitation that it cannot populate assistant audio messages.
  
  If successful, the server will respond with a conversation.item.created event, otherwise an error event will be sent.
  - ConversationItem item
    
    A single item within a Realtime conversation.
    - class RealtimeConversationItemSystemMessage:
      
      A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
      - List<Content> content
        
        The content of the message.
        
        Optional<String> text
        
        The text content.
        
        Optional<Type> type
        
        The content type. Always input_text for system messages.
        
        INPUT_TEXT("input_text")
      - JsonValue; role "system"constant
        
        The role of the message sender. Always system.
        
        SYSTEM("system")
      - JsonValue; type "message"constant
        
        The type of the item. Always message.
        
        MESSAGE("message")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemUserMessage:
      
      A user message item in a Realtime conversation.
      - List<Content> content
        
        The content of the message.
        
        Optional<String> audio
        
        Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
        
        Optional<Detail> detail
        
        The detail level of the image (for input_image). auto will default to high.
        
        AUTO("auto")
        
        LOW("low")
        
        HIGH("high")
        
        Optional<String> imageUrl
        
        Base64-encoded image bytes (for input_image) as a data URI. For example data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG.
        
        Optional<String> text
        
        The text content (for input_text).
        
        Optional<String> transcript
        
        Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
        
        Optional<Type> type
        
        The content type (input_text, input_audio, or input_image).
        
        INPUT_TEXT("input_text")
        
        INPUT_AUDIO("input_audio")
        
        INPUT_IMAGE("input_image")
      - JsonValue; role "user"constant
        
        The role of the message sender. Always user.
        
        USER("user")
      - JsonValue; type "message"constant
        
        The type of the item. Always message.
        
        MESSAGE("message")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemAssistantMessage:
      
      An assistant message item in a Realtime conversation.
      - List<Content> content
        
        The content of the message.
        
        Optional<String> audio
        
        Base64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
        
        Optional<String> text
        
        The text content.
        
        Optional<String> transcript
        
        The transcript of the audio content, this will always be present if the output type is audio.
        
        Optional<Type> type
        
        The content type, output_text or output_audio depending on the session output_modalities configuration.
        
        OUTPUT_TEXT("output_text")
        
        OUTPUT_AUDIO("output_audio")
      - JsonValue; role "assistant"constant
        
        The role of the message sender. Always assistant.
        
        ASSISTANT("assistant")
      - JsonValue; type "message"constant
        
        The type of the item. Always message.
        
        MESSAGE("message")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemFunctionCall:
      
      A function call item in a Realtime conversation.
      - String arguments
        
        The arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example {"arg1": "value1", "arg2": 42}.
      - String name
        
        The name of the function being called.
      - JsonValue; type "function_call"constant
        
        The type of the item. Always function_call.
        
        FUNCTION_CALL("function_call")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<String> callId
        
        The ID of the function call.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemFunctionCallOutput:
      
      A function call output item in a Realtime conversation.
      - String callId
        
        The ID of the function call this output is for.
      - String output
        
        The output of the function call, this is free text and can contain any information or simply be empty.
      - JsonValue; type "function_call_output"constant
        
        The type of the item. Always function_call_output.
        
        FUNCTION_CALL_OUTPUT("function_call_output")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeMcpApprovalResponse:
      
      A Realtime item responding to an MCP approval request.
      - String id
        
        The unique ID of the approval response.
      - String approvalRequestId
        
        The ID of the approval request being answered.
      - boolean approve
        
        Whether the request was approved.
      - JsonValue; type "mcp_approval_response"constant
        
        The type of the item. Always mcp_approval_response.
        
        MCP_APPROVAL_RESPONSE("mcp_approval_response")
      - Optional<String> reason
        
        Optional reason for the decision.
    - class RealtimeMcpListTools:
      
      A Realtime item listing tools available on an MCP server.
      - String serverLabel
        
        The label of the MCP server.
      - List<Tool> tools
        
        The tools available on the server.
        
        JsonValue inputSchema
        
        The JSON schema describing the tool's input.
        
        String name
        
        The name of the tool.
        
        Optional<JsonValue> annotations
        
        Additional annotations about the tool.
        
        Optional<String> description
        
        The description of the tool.
      - JsonValue; type "mcp_list_tools"constant
        
        The type of the item. Always mcp_list_tools.
        
        MCP_LIST_TOOLS("mcp_list_tools")
      - Optional<String> id
        
        The unique ID of the list.
    - class RealtimeMcpToolCall:
      
      A Realtime item representing an invocation of a tool on an MCP server.
      - String id
        
        The unique ID of the tool call.
      - String arguments
        
        A JSON string of the arguments passed to the tool.
      - String name
        
        The name of the tool that was run.
      - String serverLabel
        
        The label of the MCP server running the tool.
      - JsonValue; type "mcp_call"constant
        
        The type of the item. Always mcp_call.
        
        MCP_CALL("mcp_call")
      - Optional<String> approvalRequestId
        
        The ID of an associated approval request, if any.
      - Optional<Error> error
        
        The error from the tool call, if any.
        
        class RealtimeMcpProtocolError:
        
        long code
        
        String message
        
        JsonValue; type "protocol_error"constant
        
        PROTOCOL_ERROR("protocol_error")
        
        class RealtimeMcpToolExecutionError:
        
        String message
        
        JsonValue; type "tool_execution_error"constant
        
        TOOL_EXECUTION_ERROR("tool_execution_error")
        
        class RealtimeMcphttpError:
        
        long code
        
        String message
        
        JsonValue; type "http_error"constant
        
        HTTP_ERROR("http_error")
      - Optional<String> output
        
        The output from the tool call.
    - class RealtimeMcpApprovalRequest:
      
      A Realtime item requesting human approval of a tool invocation.
      - String id
        
        The unique ID of the approval request.
      - String arguments
        
        A JSON string of arguments for the tool.
      - String name
        
        The name of the tool to run.
      - String serverLabel
        
        The label of the MCP server making the request.
      - JsonValue; type "mcp_approval_request"constant
        
        The type of the item. Always mcp_approval_request.
        
        MCP_APPROVAL_REQUEST("mcp_approval_request")
  - JsonValue; type "conversation.item.create"constant
    
    The event type, must be conversation.item.create.
    - CONVERSATION_ITEM_CREATE("conversation.item.create")
  - Optional<String> eventId
    
    Optional client-generated ID used to identify this event.
  - Optional<String> previousItemId
    
    The ID of the preceding item after which the new item will be inserted. If not set, the new item will be appended to the end of the conversation.
    
    If set to root, the new item will be added to the beginning of the conversation.
    
    If set to an existing ID, it allows an item to be inserted mid-conversation. If the ID cannot be found, an error will be returned and the item will not be added.
- class ConversationItemDeleteEvent:
  
  Send this event when you want to remove any item from the conversation history. The server will respond with a conversation.item.deleted event, unless the item does not exist in the conversation history, in which case the server will respond with an error.
  - String itemId
    
    The ID of the item to delete.
  - JsonValue; type "conversation.item.delete"constant
    
    The event type, must be conversation.item.delete.
    - CONVERSATION_ITEM_DELETE("conversation.item.delete")
  - Optional<String> eventId
    
    Optional client-generated ID used to identify this event.
- class ConversationItemRetrieveEvent:
  
  Send this event when you want to retrieve the server's representation of a specific item in the conversation history. This is useful, for example, to inspect user audio after noise cancellation and VAD. The server will respond with a conversation.item.retrieved event, unless the item does not exist in the conversation history, in which case the server will respond with an error.
  - String itemId
    
    The ID of the item to retrieve.
  - JsonValue; type "conversation.item.retrieve"constant
    
    The event type, must be conversation.item.retrieve.
    - CONVERSATION_ITEM_RETRIEVE("conversation.item.retrieve")
  - Optional<String> eventId
    
    Optional client-generated ID used to identify this event.
- class ConversationItemTruncateEvent:
  
  Send this event to truncate a previous assistant message’s audio. The server will produce audio faster than realtime, so this event is useful when the user interrupts to truncate audio that has already been sent to the client but not yet played. This will synchronize the server's understanding of the audio with the client's playback.
  
  Truncating audio will delete the server-side text transcript to ensure there is not text in the context that hasn't been heard by the user.
  
  If successful, the server will respond with a conversation.item.truncated event.
  - long audioEndMs
    
    Inclusive duration up to which audio is truncated, in milliseconds. If the audio_end_ms is greater than the actual audio duration, the server will respond with an error.
  - long contentIndex
    
    The index of the content part to truncate. Set this to 0.
  - String itemId
    
    The ID of the assistant message item to truncate. Only assistant message items can be truncated.
  - JsonValue; type "conversation.item.truncate"constant
    
    The event type, must be conversation.item.truncate.
    - CONVERSATION_ITEM_TRUNCATE("conversation.item.truncate")
  - Optional<String> eventId
    
    Optional client-generated ID used to identify this event.
- class InputAudioBufferAppendEvent:
  
  Send this event to append audio bytes to the input audio buffer. The audio buffer is temporary storage you can write to and later commit. A "commit" will create a new user message item in the conversation history from the buffer content and clear the buffer. Input audio transcription (if enabled) will be generated when the buffer is committed.
  
  If VAD is enabled the audio buffer is used to detect speech and the server will decide when to commit. When Server VAD is disabled, you must commit the audio buffer manually. Input audio noise reduction operates on writes to the audio buffer.
  
  The client may choose how much audio to place in each event up to a maximum of 15 MiB, for example streaming smaller chunks from the client may allow the VAD to be more responsive. Unlike most other client events, the server will not send a confirmation response to this event.
  - String audio
    
    Base64-encoded audio bytes. This must be in the format specified by the input_audio_format field in the session configuration.
  - JsonValue; type "input_audio_buffer.append"constant
    
    The event type, must be input_audio_buffer.append.
    - INPUT_AUDIO_BUFFER_APPEND("input_audio_buffer.append")
  - Optional<String> eventId
    
    Optional client-generated ID used to identify this event.
- class InputAudioBufferClearEvent:
  
  Send this event to clear the audio bytes in the buffer. The server will respond with an input_audio_buffer.cleared event.
  - JsonValue; type "input_audio_buffer.clear"constant
    
    The event type, must be input_audio_buffer.clear.
    - INPUT_AUDIO_BUFFER_CLEAR("input_audio_buffer.clear")
  - Optional<String> eventId
    
    Optional client-generated ID used to identify this event.
- class OutputAudioBufferClearEvent:
  
  WebRTC/SIP Only: Emit to cut off the current audio response. This will trigger the server to stop generating audio and emit a output_audio_buffer.cleared event. This event should be preceded by a response.cancel client event to stop the generation of the current response. Learn more.
  - JsonValue; type "output_audio_buffer.clear"constant
    
    The event type, must be output_audio_buffer.clear.
    - OUTPUT_AUDIO_BUFFER_CLEAR("output_audio_buffer.clear")
  - Optional<String> eventId
    
    The unique ID of the client event used for error handling.
- class InputAudioBufferCommitEvent:
  
  Send this event to commit the user input audio buffer, which will create a new user message item in the conversation. This event will produce an error if the input audio buffer is empty. When in Server VAD mode, the client does not need to send this event, the server will commit the audio buffer automatically.
  
  Committing the input audio buffer will trigger input audio transcription (if enabled in session configuration), but it will not create a response from the model. The server will respond with an input_audio_buffer.committed event.
  - JsonValue; type "input_audio_buffer.commit"constant
    
    The event type, must be input_audio_buffer.commit.
    - INPUT_AUDIO_BUFFER_COMMIT("input_audio_buffer.commit")
  - Optional<String> eventId
    
    Optional client-generated ID used to identify this event.
- class ResponseCancelEvent:
  
  Send this event to cancel an in-progress response. The server will respond with a response.done event with a status of response.status=cancelled. If there is no response to cancel, the server will respond with an error. It's safe to call response.cancel even if no response is in progress, an error will be returned the session will remain unaffected.
  - JsonValue; type "response.cancel"constant
    
    The event type, must be response.cancel.
    - RESPONSE_CANCEL("response.cancel")
  - Optional<String> eventId
    
    Optional client-generated ID used to identify this event.
  - Optional<String> responseId
    
    A specific response ID to cancel - if not provided, will cancel an in-progress response in the default conversation.
- class ResponseCreateEvent:
  
  This event instructs the server to create a Response, which means triggering model inference. When in Server VAD mode, the server will create Responses automatically.
  
  A Response will include at least one Item, and may have two, in which case the second will be a function call. These Items will be appended to the conversation history by default.
  
  The server will respond with a response.created event, events for Items and content created, and finally a response.done event to indicate the Response is complete.
  
  The response.create event includes inference configuration like instructions and tools. If these are set, they will override the Session's configuration for this Response only.
  
  Responses can be created out-of-band of the default Conversation, meaning that they can have arbitrary input, and it's possible to disable writing the output to the Conversation. Only one Response can write to the default Conversation at a time, but otherwise multiple Responses can be created in parallel. The metadata field is a good way to disambiguate multiple simultaneous Responses.
  
  Clients can set conversation to none to create a Response that does not write to the default Conversation. Arbitrary input can be provided with the input field, which is an array accepting raw Items and references to existing Items.
  - JsonValue; type "response.create"constant
    
    The event type, must be response.create.
    - RESPONSE_CREATE("response.create")
  - Optional<String> eventId
    
    Optional client-generated ID used to identify this event.
  - Optional<RealtimeResponseCreateParams> response
    
    Create a new Realtime response with these parameters
    - Optional<RealtimeResponseCreateAudioOutput> audio
      
      Configuration for audio input and output.
      - Optional<Output> output
        
        Optional<RealtimeAudioFormats> format
        
        The format of the output audio.
        
        AudioPcm
        
        Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
        
        Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
        
        AudioPcmu
        
        Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
        
        AudioPcma
        
        Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
        
        Optional<Voice> voice
        
        The voice the model uses to respond. Supported built-in voices are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommend marin and cedar for best quality.
        
        String
        
        enum UnionMember1:
        
        ALLOY("alloy")
        
        ASH("ash")
        
        BALLAD("ballad")
        
        CORAL("coral")
        
        ECHO("echo")
        
        SAGE("sage")
        
        SHIMMER("shimmer")
        
        VERSE("verse")
        
        MARIN("marin")
        
        CEDAR("cedar")
        
        class Id:
        
        Custom voice reference.
        
        String id
        
        The custom voice ID, e.g. voice_1234.
    - Optional<Conversation> conversation
      
      Controls which conversation the response is added to. Currently supports auto and none, with auto as the default value. The auto value means that the contents of the response will be added to the default conversation. Set this to none to create an out-of-band response which will not add items to default conversation.
      - AUTO("auto")
      - NONE("none")
    - Optional<List<ConversationItem>> input
      
      Input items to include in the prompt for the model. Using this field creates a new context for this Response instead of using the default conversation. An empty array [] will clear the context for this Response. Note that this can include references to items that previously appeared in the session using their id.
      - class RealtimeConversationItemSystemMessage:
        
        A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
      - class RealtimeConversationItemUserMessage:
        
        A user message item in a Realtime conversation.
      - class RealtimeConversationItemAssistantMessage:
        
        An assistant message item in a Realtime conversation.
      - class RealtimeConversationItemFunctionCall:
        
        A function call item in a Realtime conversation.
      - class RealtimeConversationItemFunctionCallOutput:
        
        A function call output item in a Realtime conversation.
      - class RealtimeMcpApprovalResponse:
        
        A Realtime item responding to an MCP approval request.
      - class RealtimeMcpListTools:
        
        A Realtime item listing tools available on an MCP server.
      - class RealtimeMcpToolCall:
        
        A Realtime item representing an invocation of a tool on an MCP server.
      - class RealtimeMcpApprovalRequest:
        
        A Realtime item requesting human approval of a tool invocation.
    - Optional<String> instructions
      
      The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior. Note that the server sets default instructions which will be used if this field is not set and are visible in the session.created event at the start of the session.
    - Optional<MaxOutputTokens> maxOutputTokens
      
      Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or inf for the maximum available tokens for a given model. Defaults to inf.
      - long
      - JsonValue;
        
        INF("inf")
    - Optional<Metadata> metadata
      
      Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.
      
      Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
    - Optional<List<OutputModality>> outputModalities
      
      The set of modalities the model used to respond, currently the only possible values are [\"audio\"], [\"text\"]. Audio output always include a text transcript. Setting the output to mode text will disable audio output from the model.
      - TEXT("text")
      - AUDIO("audio")
    - Optional<Boolean> parallelToolCalls
      
      Whether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as gpt-realtime-2.
    - Optional<ResponsePrompt> prompt
      
      Reference to a prompt template and its variables. Learn more.
      - String id
        
        The unique identifier of the prompt template to use.
      - Optional<Variables> variables
        
        Optional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
        
        String
        
        class ResponseInputText:
        
        A text input to the model.
        
        String text
        
        The text input to the model.
        
        JsonValue; type "input_text"constant
        
        The type of the input item. Always input_text.
        
        INPUT_TEXT("input_text")
        
        class ResponseInputImage:
        
        An image input to the model. Learn about image inputs.
        
        Detail detail
        
        The detail level of the image to be sent to the model. One of high, low, auto, or original. Defaults to auto.
        
        LOW("low")
        
        HIGH("high")
        
        AUTO("auto")
        
        ORIGINAL("original")
        
        JsonValue; type "input_image"constant
        
        The type of the input item. Always input_image.
        
        INPUT_IMAGE("input_image")
        
        Optional<String> fileId
        
        The ID of the file to be sent to the model.
        
        Optional<String> imageUrl
        
        The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
        
        class ResponseInputFile:
        
        A file input to the model.
        
        JsonValue; type "input_file"constant
        
        The type of the input item. Always input_file.
        
        INPUT_FILE("input_file")
        
        Optional<Detail> detail
        
        The detail level of the file to be sent to the model. Use low for the default rendering behavior, or high to render the file at higher quality. Defaults to low.
        
        LOW("low")
        
        HIGH("high")
        
        Optional<String> fileData
        
        The content of the file to be sent to the model.
        
        Optional<String> fileId
        
        The ID of the file to be sent to the model.
        
        Optional<String> fileUrl
        
        The URL of the file to be sent to the model.
        
        Optional<String> filename
        
        The name of the file to be sent to the model.
      - Optional<String> version
        
        Optional version of the prompt template.
    - Optional<RealtimeReasoning> reasoning
      
      Configuration for reasoning-capable Realtime models such as gpt-realtime-2.
      - Optional<RealtimeReasoningEffort> effort
        
        Constrains effort on reasoning for reasoning-capable Realtime models such as gpt-realtime-2.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
    - Optional<ToolChoice> toolChoice
      
      How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
      - enum ToolChoiceOptions:
        
        Controls which (if any) tool is called by the model.
        
        none means the model will not call any tool and instead generates a message.
        
        auto means the model can pick between generating a message or calling one or more tools.
        
        required means the model must call one or more tools.
        
        NONE("none")
        
        AUTO("auto")
        
        REQUIRED("required")
      - class ToolChoiceFunction:
        
        Use this option to force the model to call a specific function.
        
        String name
        
        The name of the function to call.
        
        JsonValue; type "function"constant
        
        For function calling, the type is always function.
        
        FUNCTION("function")
      - class ToolChoiceMcp:
        
        Use this option to force the model to call a specific tool on a remote MCP server.
        
        String serverLabel
        
        The label of the MCP server to use.
        
        JsonValue; type "mcp"constant
        
        For MCP tools, the type is always mcp.
        
        MCP("mcp")
        
        Optional<String> name
        
        The name of the tool to call on the server.
    - Optional<List<Tool>> tools
      
      Tools available to the model.
      - class RealtimeFunctionTool:
        
        Optional<String> description
        
        The description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
        
        Optional<String> name
        
        The name of the function.
        
        Optional<JsonValue> parameters
        
        Parameters of the function in JSON Schema.
        
        Optional<Type> type
        
        The type of the tool, i.e. function.
        
        FUNCTION("function")
      - class RealtimeResponseCreateMcpTool:
        
        Give the model access to additional tools via remote Model Context Protocol (MCP) servers. Learn more about MCP.
        
        String serverLabel
        
        A label for this MCP server, used to identify it in tool calls.
        
        JsonValue; type "mcp"constant
        
        The type of the MCP tool. Always mcp.
        
        MCP("mcp")
        
        Optional<AllowedTools> allowedTools
        
        List of allowed tool names or a filter object.
        
        List<String>
        
        class McpToolFilter:
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<String> authorization
        
        An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
        
        Optional<ConnectorId> connectorId
        
        Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.
        
        Currently supported connector_id values are:
        
        Dropbox: connector_dropbox
        
        Gmail: connector_gmail
        
        Google Calendar: connector_googlecalendar
        
        Google Drive: connector_googledrive
        
        Microsoft Teams: connector_microsoftteams
        
        Outlook Calendar: connector_outlookcalendar
        
        Outlook Email: connector_outlookemail
        
        SharePoint: connector_sharepoint
        
        CONNECTOR_DROPBOX("connector_dropbox")
        
        CONNECTOR_GMAIL("connector_gmail")
        
        CONNECTOR_GOOGLECALENDAR("connector_googlecalendar")
        
        CONNECTOR_GOOGLEDRIVE("connector_googledrive")
        
        CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams")
        
        CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar")
        
        CONNECTOR_OUTLOOKEMAIL("connector_outlookemail")
        
        CONNECTOR_SHAREPOINT("connector_sharepoint")
        
        Optional<Boolean> deferLoading
        
        Whether this MCP tool is deferred and discovered via tool search.
        
        Optional<Headers> headers
        
        Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.
        
        Optional<RequireApproval> requireApproval
        
        Specify which of the MCP server's tools require approval.
        
        class McpToolApprovalFilter:
        
        Specify which of the MCP server's tools require approval. Can be always, never, or a filter object associated with tools that require approval.
        
        Optional<Always> always
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<Never> never
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        enum McpToolApprovalSetting:
        
        Specify a single approval policy for all tools. One of always or never. When set to always, all tools will require approval. When set to never, all tools will not require approval.
        
        ALWAYS("always")
        
        NEVER("never")
        
        Optional<String> serverDescription
        
        Optional description of the MCP server, used to provide more context.
        
        Optional<String> serverUrl
        
        The URL for the MCP server. One of server_url or connector_id must be provided.
- class SessionUpdateEvent:
  
  Send this event to update the session’s configuration. The client may send this event at any time to update any field except for voice and model. voice can be updated only if there have been no other audio outputs yet.
  
  When the server receives a session.update, it will respond with a session.updated event showing the full, effective configuration. Only the fields that are present in the session.update are updated. To clear a field like instructions, pass an empty string. To clear a field like tools, pass an empty array. To clear a field like turn_detection, pass null.
  - Session session
    
    Update the Realtime session. Choose either a realtime session or a transcription session.
    - class RealtimeSessionCreateRequest:
      
      Realtime session object configuration.
      - JsonValue; type "realtime"constant
        
        The type of session to create. Always realtime for the Realtime API.
        
        REALTIME("realtime")
      - Optional<RealtimeAudioConfig> audio
        
        Configuration for input and output audio.
        
        Optional<RealtimeAudioConfigInput> input
        
        Optional<RealtimeAudioFormats> format
        
        The format of the input audio.
        
        Optional<NoiseReduction> noiseReduction
        
        Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
        
        Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
        
        Optional<AudioTranscription> transcription
        
        Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
        
        Optional<Delay> delay
        
        Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
        
        Optional<String> language
        
        The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
        
        Optional<Model> model
        
        The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
        
        WHISPER_1("whisper-1")
        
        GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
        
        GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
        
        GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
        
        GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
        
        GPT_REALTIME_WHISPER("gpt-realtime-whisper")
        
        Optional<String> prompt
        
        An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
        
        Optional<RealtimeAudioInputTurnDetection> turnDetection
        
        Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
        
        Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
        
        Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
        
        For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
        
        ServerVad
        
        JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
        
        SemanticVad
        
        JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
        
        Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
        
        Optional<RealtimeAudioConfigOutput> output
        
        Optional<RealtimeAudioFormats> format
        
        The format of the output audio.
        
        Optional<Double> speed
        
        The speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
        
        This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
        
        Optional<Voice> voice
        
        The voice the model uses to respond. Supported built-in voices are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommend marin and cedar for best quality.
        
        String
        
        enum UnionMember1:
        
        ALLOY("alloy")
        
        ASH("ash")
        
        BALLAD("ballad")
        
        CORAL("coral")
        
        ECHO("echo")
        
        SAGE("sage")
        
        SHIMMER("shimmer")
        
        VERSE("verse")
        
        MARIN("marin")
        
        CEDAR("cedar")
        
        class Id:
        
        Custom voice reference.
        
        String id
        
        The custom voice ID, e.g. voice_1234.
      - Optional<List<Include>> include
        
        Additional fields to include in server outputs.
        
        item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
        
        ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
      - Optional<String> instructions
        
        The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
        
        Note that the server sets default instructions which will be used if this field is not set and are visible in the session.created event at the start of the session.
      - Optional<MaxOutputTokens> maxOutputTokens
        
        Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or inf for the maximum available tokens for a given model. Defaults to inf.
        
        long
        
        JsonValue;
        
        INF("inf")
      - Optional<Model> model
        
        The Realtime model used for this session.
        
        GPT_REALTIME("gpt-realtime")
        
        GPT_REALTIME_1_5("gpt-realtime-1.5")
        
        GPT_REALTIME_2("gpt-realtime-2")
        
        GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")
        
        GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")
        
        GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01")
        
        GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17")
        
        GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03")
        
        GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview")
        
        GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17")
        
        GPT_REALTIME_MINI("gpt-realtime-mini")
        
        GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06")
        
        GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15")
        
        GPT_AUDIO_1_5("gpt-audio-1.5")
        
        GPT_AUDIO_MINI("gpt-audio-mini")
        
        GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06")
        
        GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
      - Optional<List<OutputModality>> outputModalities
        
        The set of modalities the model can respond with. It defaults to ["audio"], indicating that the model will respond with audio plus a transcript. ["text"] can be used to make the model respond with text only. It is not possible to request both text and audio at the same time.
        
        TEXT("text")
        
        AUDIO("audio")
      - Optional<Boolean> parallelToolCalls
        
        Whether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as gpt-realtime-2.
      - Optional<ResponsePrompt> prompt
        
        Reference to a prompt template and its variables. Learn more.
      - Optional<RealtimeReasoning> reasoning
        
        Configuration for reasoning-capable Realtime models such as gpt-realtime-2.
      - Optional<RealtimeToolChoiceConfig> toolChoice
        
        How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
        
        enum ToolChoiceOptions:
        
        Controls which (if any) tool is called by the model.
        
        none means the model will not call any tool and instead generates a message.
        
        auto means the model can pick between generating a message or calling one or more tools.
        
        required means the model must call one or more tools.
        
        class ToolChoiceFunction:
        
        Use this option to force the model to call a specific function.
        
        class ToolChoiceMcp:
        
        Use this option to force the model to call a specific tool on a remote MCP server.
      - Optional<List<RealtimeToolsConfigUnion>> tools
        
        Tools available to the model.
        
        class RealtimeFunctionTool:
        
        Mcp
        
        String serverLabel
        
        A label for this MCP server, used to identify it in tool calls.
        
        JsonValue; type "mcp"constant
        
        The type of the MCP tool. Always mcp.
        
        MCP("mcp")
        
        Optional<AllowedTools> allowedTools
        
        List of allowed tool names or a filter object.
        
        List<String>
        
        class McpToolFilter:
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<String> authorization
        
        An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
        
        Optional<ConnectorId> connectorId
        
        Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.
        
        Currently supported connector_id values are:
        
        Dropbox: connector_dropbox
        
        Gmail: connector_gmail
        
        Google Calendar: connector_googlecalendar
        
        Google Drive: connector_googledrive
        
        Microsoft Teams: connector_microsoftteams
        
        Outlook Calendar: connector_outlookcalendar
        
        Outlook Email: connector_outlookemail
        
        SharePoint: connector_sharepoint
        
        CONNECTOR_DROPBOX("connector_dropbox")
        
        CONNECTOR_GMAIL("connector_gmail")
        
        CONNECTOR_GOOGLECALENDAR("connector_googlecalendar")
        
        CONNECTOR_GOOGLEDRIVE("connector_googledrive")
        
        CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams")
        
        CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar")
        
        CONNECTOR_OUTLOOKEMAIL("connector_outlookemail")
        
        CONNECTOR_SHAREPOINT("connector_sharepoint")
        
        Optional<Boolean> deferLoading
        
        Whether this MCP tool is deferred and discovered via tool search.
        
        Optional<Headers> headers
        
        Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.
        
        Optional<RequireApproval> requireApproval
        
        Specify which of the MCP server's tools require approval.
        
        class McpToolApprovalFilter:
        
        Specify which of the MCP server's tools require approval. Can be always, never, or a filter object associated with tools that require approval.
        
        Optional<Always> always
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<Never> never
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        enum McpToolApprovalSetting:
        
        Specify a single approval policy for all tools. One of always or never. When set to always, all tools will require approval. When set to never, all tools will not require approval.
        
        ALWAYS("always")
        
        NEVER("never")
        
        Optional<String> serverDescription
        
        Optional description of the MCP server, used to provide more context.
        
        Optional<String> serverUrl
        
        The URL for the MCP server. One of server_url or connector_id must be provided.
      - Optional<RealtimeTracingConfig> tracing
        
        Realtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
        
        auto will create a trace for the session with default values for the workflow name, group id, and metadata.
        
        JsonValue;
        
        AUTO("auto")
        
        TracingConfiguration
        
        Optional<String> groupId
        
        The group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
        
        Optional<JsonValue> metadata
        
        The arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
        
        Optional<String> workflowName
        
        The name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
      - Optional<RealtimeTruncation> truncation
        
        When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
        
        Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
        
        Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
        
        Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
        
        RealtimeTruncationStrategy
        
        AUTO("auto")
        
        DISABLED("disabled")
        
        class RealtimeTruncationRetentionRatio:
        
        Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
        
        double retentionRatio
        
        Fraction of post-instruction conversation tokens to retain (0.0 - 1.0) when the conversation exceeds the input token limit. Setting this to 0.8 means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.
        
        JsonValue; type "retention_ratio"constant
        
        Use retention ratio truncation.
        
        RETENTION_RATIO("retention_ratio")
        
        Optional<TokenLimits> tokenLimits
        
        Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
        
        Optional<Long> postInstructions
        
        Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
    - class RealtimeTranscriptionSessionCreateRequest:
      
      Realtime transcription session object configuration.
      - JsonValue; type "transcription"constant
        
        The type of session to create. Always transcription for transcription sessions.
        
        TRANSCRIPTION("transcription")
      - Optional<RealtimeTranscriptionSessionAudio> audio
        
        Configuration for input and output audio.
        
        Optional<RealtimeTranscriptionSessionAudioInput> input
        
        Optional<RealtimeAudioFormats> format
        
        The PCM audio format. Only a 24kHz sample rate is supported.
        
        Optional<NoiseReduction> noiseReduction
        
        Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
        
        Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        Optional<AudioTranscription> transcription
        
        Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
        
        Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetection
        
        Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
        
        Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
        
        Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
        
        For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
        
        ServerVad
        
        JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
        
        SemanticVad
        
        JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
        
        Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
      - Optional<List<Include>> include
        
        Additional fields to include in server outputs.
        
        item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
        
        ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
  - JsonValue; type "session.update"constant
    
    The event type, must be session.update.
    - SESSION_UPDATE("session.update")
  - Optional<String> eventId
    
    Optional client-generated ID used to identify this event. This is an arbitrary string that a client may assign. It will be passed back if there is an error with the event, but the corresponding session.updated event will not include it.

Realtime Conversation Item Assistant Message

class RealtimeConversationItemAssistantMessage:

An assistant message item in a Realtime conversation.
- List<Content> content
  
  The content of the message.
  - Optional<String> audio
    
    Base64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
  - Optional<String> text
    
    The text content.
  - Optional<String> transcript
    
    The transcript of the audio content, this will always be present if the output type is audio.
  - Optional<Type> type
    
    The content type, output_text or output_audio depending on the session output_modalities configuration.
    - OUTPUT_TEXT("output_text")
    - OUTPUT_AUDIO("output_audio")
- JsonValue; role "assistant"constant
  
  The role of the message sender. Always assistant.
  - ASSISTANT("assistant")
- JsonValue; type "message"constant
  
  The type of the item. Always message.
  - MESSAGE("message")
- Optional<String> id
  
  The unique ID of the item. This may be provided by the client or generated by the server.
- Optional<Object> object_
  
  Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
  - REALTIME_ITEM("realtime.item")
- Optional<Status> status
  
  The status of the item. Has no effect on the conversation.
  - COMPLETED("completed")
  - INCOMPLETE("incomplete")
  - IN_PROGRESS("in_progress")

Realtime Conversation Item Function Call

class RealtimeConversationItemFunctionCall:

A function call item in a Realtime conversation.
- String arguments
  
  The arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example {"arg1": "value1", "arg2": 42}.
- String name
  
  The name of the function being called.
- JsonValue; type "function_call"constant
  
  The type of the item. Always function_call.
  - FUNCTION_CALL("function_call")
- Optional<String> id
  
  The unique ID of the item. This may be provided by the client or generated by the server.
- Optional<String> callId
  
  The ID of the function call.
- Optional<Object> object_
  
  Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
  - REALTIME_ITEM("realtime.item")
- Optional<Status> status
  
  The status of the item. Has no effect on the conversation.
  - COMPLETED("completed")
  - INCOMPLETE("incomplete")
  - IN_PROGRESS("in_progress")

Realtime Conversation Item Function Call Output

class RealtimeConversationItemFunctionCallOutput:

A function call output item in a Realtime conversation.
- String callId
  
  The ID of the function call this output is for.
- String output
  
  The output of the function call, this is free text and can contain any information or simply be empty.
- JsonValue; type "function_call_output"constant
  
  The type of the item. Always function_call_output.
  - FUNCTION_CALL_OUTPUT("function_call_output")
- Optional<String> id
  
  The unique ID of the item. This may be provided by the client or generated by the server.
- Optional<Object> object_
  
  Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
  - REALTIME_ITEM("realtime.item")
- Optional<Status> status
  
  The status of the item. Has no effect on the conversation.
  - COMPLETED("completed")
  - INCOMPLETE("incomplete")
  - IN_PROGRESS("in_progress")

Realtime Conversation Item System Message

class RealtimeConversationItemSystemMessage:

A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
- List<Content> content
  
  The content of the message.
  - Optional<String> text
    
    The text content.
  - Optional<Type> type
    
    The content type. Always input_text for system messages.
    - INPUT_TEXT("input_text")
- JsonValue; role "system"constant
  
  The role of the message sender. Always system.
  - SYSTEM("system")
- JsonValue; type "message"constant
  
  The type of the item. Always message.
  - MESSAGE("message")
- Optional<String> id
  
  The unique ID of the item. This may be provided by the client or generated by the server.
- Optional<Object> object_
  
  Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
  - REALTIME_ITEM("realtime.item")
- Optional<Status> status
  
  The status of the item. Has no effect on the conversation.
  - COMPLETED("completed")
  - INCOMPLETE("incomplete")
  - IN_PROGRESS("in_progress")

Realtime Conversation Item User Message

class RealtimeConversationItemUserMessage:

A user message item in a Realtime conversation.
- List<Content> content
  
  The content of the message.
  - Optional<String> audio
    
    Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
  - Optional<Detail> detail
    
    The detail level of the image (for input_image). auto will default to high.
    - AUTO("auto")
    - LOW("low")
    - HIGH("high")
  - Optional<String> imageUrl
    
    Base64-encoded image bytes (for input_image) as a data URI. For example data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG.
  - Optional<String> text
    
    The text content (for input_text).
  - Optional<String> transcript
    
    Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
  - Optional<Type> type
    
    The content type (input_text, input_audio, or input_image).
    - INPUT_TEXT("input_text")
    - INPUT_AUDIO("input_audio")
    - INPUT_IMAGE("input_image")
- JsonValue; role "user"constant
  
  The role of the message sender. Always user.
  - USER("user")
- JsonValue; type "message"constant
  
  The type of the item. Always message.
  - MESSAGE("message")
- Optional<String> id
  
  The unique ID of the item. This may be provided by the client or generated by the server.
- Optional<Object> object_
  
  Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
  - REALTIME_ITEM("realtime.item")
- Optional<Status> status
  
  The status of the item. Has no effect on the conversation.
  - COMPLETED("completed")
  - INCOMPLETE("incomplete")
  - IN_PROGRESS("in_progress")

Realtime Error

class RealtimeError:

Details of the error.
- String message
  
  A human-readable error message.
- String type
  
  The type of error (e.g., "invalid_request_error", "server_error").
- Optional<String> code
  
  Error code, if any.
- Optional<String> eventId
  
  The event_id of the client event that caused the error, if applicable.
- Optional<String> param
  
  Parameter related to the error, if any.

Realtime Error Event

class RealtimeErrorEvent:

Returned when an error occurs, which could be a client problem or a server problem. Most errors are recoverable and the session will stay open, we recommend to implementors to monitor and log error messages by default.
- RealtimeError error
  
  Details of the error.
  - String message
    
    A human-readable error message.
  - String type
    
    The type of error (e.g., "invalid_request_error", "server_error").
  - Optional<String> code
    
    Error code, if any.
  - Optional<String> eventId
    
    The event_id of the client event that caused the error, if applicable.
  - Optional<String> param
    
    Parameter related to the error, if any.
- String eventId
  
  The unique ID of the server event.
- JsonValue; type "error"constant
  
  The event type, must be error.
  - ERROR("error")

Realtime Function Tool

class RealtimeFunctionTool:
- Optional<String> description
  
  The description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
- Optional<String> name
  
  The name of the function.
- Optional<JsonValue> parameters
  
  Parameters of the function in JSON Schema.
- Optional<Type> type
  
  The type of the tool, i.e. function.
  - FUNCTION("function")

Realtime Mcp Approval Request

class RealtimeMcpApprovalRequest:

A Realtime item requesting human approval of a tool invocation.
- String id
  
  The unique ID of the approval request.
- String arguments
  
  A JSON string of arguments for the tool.
- String name
  
  The name of the tool to run.
- String serverLabel
  
  The label of the MCP server making the request.
- JsonValue; type "mcp_approval_request"constant
  
  The type of the item. Always mcp_approval_request.
  - MCP_APPROVAL_REQUEST("mcp_approval_request")

Realtime Mcp Approval Response

class RealtimeMcpApprovalResponse:

A Realtime item responding to an MCP approval request.
- String id
  
  The unique ID of the approval response.
- String approvalRequestId
  
  The ID of the approval request being answered.
- boolean approve
  
  Whether the request was approved.
- JsonValue; type "mcp_approval_response"constant
  
  The type of the item. Always mcp_approval_response.
  - MCP_APPROVAL_RESPONSE("mcp_approval_response")
- Optional<String> reason
  
  Optional reason for the decision.

Realtime Mcp List Tools

class RealtimeMcpListTools:

A Realtime item listing tools available on an MCP server.
- String serverLabel
  
  The label of the MCP server.
- List<Tool> tools
  
  The tools available on the server.
  - JsonValue inputSchema
    
    The JSON schema describing the tool's input.
  - String name
    
    The name of the tool.
  - Optional<JsonValue> annotations
    
    Additional annotations about the tool.
  - Optional<String> description
    
    The description of the tool.
- JsonValue; type "mcp_list_tools"constant
  
  The type of the item. Always mcp_list_tools.
  - MCP_LIST_TOOLS("mcp_list_tools")
- Optional<String> id
  
  The unique ID of the list.

Realtime Mcp Protocol Error

class RealtimeMcpProtocolError:
- long code
- String message
- JsonValue; type "protocol_error"constant
  - PROTOCOL_ERROR("protocol_error")

Realtime Mcp Tool Call

class RealtimeMcpToolCall:

A Realtime item representing an invocation of a tool on an MCP server.
- String id
  
  The unique ID of the tool call.
- String arguments
  
  A JSON string of the arguments passed to the tool.
- String name
  
  The name of the tool that was run.
- String serverLabel
  
  The label of the MCP server running the tool.
- JsonValue; type "mcp_call"constant
  
  The type of the item. Always mcp_call.
  - MCP_CALL("mcp_call")
- Optional<String> approvalRequestId
  
  The ID of an associated approval request, if any.
- Optional<Error> error
  
  The error from the tool call, if any.
  - class RealtimeMcpProtocolError:
    - long code
    - String message
    - JsonValue; type "protocol_error"constant
      - PROTOCOL_ERROR("protocol_error")
  - class RealtimeMcpToolExecutionError:
    - String message
    - JsonValue; type "tool_execution_error"constant
      - TOOL_EXECUTION_ERROR("tool_execution_error")
  - class RealtimeMcphttpError:
    - long code
    - String message
    - JsonValue; type "http_error"constant
      - HTTP_ERROR("http_error")
- Optional<String> output
  
  The output from the tool call.

Realtime Mcp Tool Execution Error

class RealtimeMcpToolExecutionError:
- String message
- JsonValue; type "tool_execution_error"constant
  - TOOL_EXECUTION_ERROR("tool_execution_error")

Realtime Mcphttp Error

class RealtimeMcphttpError:
- long code
- String message
- JsonValue; type "http_error"constant
  - HTTP_ERROR("http_error")

Realtime Reasoning

class RealtimeReasoning:

Configuration for reasoning-capable Realtime models such as gpt-realtime-2.
- Optional<RealtimeReasoningEffort> effort
  
  Constrains effort on reasoning for reasoning-capable Realtime models such as gpt-realtime-2.
  - MINIMAL("minimal")
  - LOW("low")
  - MEDIUM("medium")
  - HIGH("high")
  - XHIGH("xhigh")

Realtime Reasoning Effort

enum RealtimeReasoningEffort:

Constrains effort on reasoning for reasoning-capable Realtime models such as gpt-realtime-2.
- MINIMAL("minimal")
- LOW("low")
- MEDIUM("medium")
- HIGH("high")
- XHIGH("xhigh")

Realtime Response

class RealtimeResponse:

The response resource.
- Optional<String> id
  
  The unique ID of the response, will look like resp_1234.
- Optional<Audio> audio
  
  Configuration for audio output.
  - Optional<Output> output
    - Optional<RealtimeAudioFormats> format
      
      The format of the output audio.
      - AudioPcm
        
        Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
        
        Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
      - AudioPcmu
        
        Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
      - AudioPcma
        
        Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
    - Optional<Voice> voice
      
      The voice the model uses to respond. Voice cannot be changed during the session once the model has responded with audio at least once. Current voice options are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. We recommend marin and cedar for best quality.
      - ALLOY("alloy")
      - ASH("ash")
      - BALLAD("ballad")
      - CORAL("coral")
      - ECHO("echo")
      - SAGE("sage")
      - SHIMMER("shimmer")
      - VERSE("verse")
      - MARIN("marin")
      - CEDAR("cedar")
- Optional<String> conversationId
  
  Which conversation the response is added to, determined by the conversation field in the response.create event. If auto, the response will be added to the default conversation and the value of conversation_id will be an id like conv_1234. If none, the response will not be added to any conversation and the value of conversation_id will be null. If responses are being triggered automatically by VAD the response will be added to the default conversation
- Optional<MaxOutputTokens> maxOutputTokens
  
  Maximum number of output tokens for a single assistant response, inclusive of tool calls, that was used in this response.
  - long
  - JsonValue;
    - INF("inf")
- Optional<Metadata> metadata
  
  Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.
  
  Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
- Optional<Object> object_
  
  The object type, must be realtime.response.
  - REALTIME_RESPONSE("realtime.response")
- Optional<List<ConversationItem>> output
  
  The list of output items generated by the response.
  - class RealtimeConversationItemSystemMessage:
    
    A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
    - List<Content> content
      
      The content of the message.
      - Optional<String> text
        
        The text content.
      - Optional<Type> type
        
        The content type. Always input_text for system messages.
        
        INPUT_TEXT("input_text")
    - JsonValue; role "system"constant
      
      The role of the message sender. Always system.
      - SYSTEM("system")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemUserMessage:
    
    A user message item in a Realtime conversation.
    - List<Content> content
      
      The content of the message.
      - Optional<String> audio
        
        Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
      - Optional<Detail> detail
        
        The detail level of the image (for input_image). auto will default to high.
        
        AUTO("auto")
        
        LOW("low")
        
        HIGH("high")
      - Optional<String> imageUrl
        
        Base64-encoded image bytes (for input_image) as a data URI. For example data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG.
      - Optional<String> text
        
        The text content (for input_text).
      - Optional<String> transcript
        
        Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
      - Optional<Type> type
        
        The content type (input_text, input_audio, or input_image).
        
        INPUT_TEXT("input_text")
        
        INPUT_AUDIO("input_audio")
        
        INPUT_IMAGE("input_image")
    - JsonValue; role "user"constant
      
      The role of the message sender. Always user.
      - USER("user")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemAssistantMessage:
    
    An assistant message item in a Realtime conversation.
    - List<Content> content
      
      The content of the message.
      - Optional<String> audio
        
        Base64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
      - Optional<String> text
        
        The text content.
      - Optional<String> transcript
        
        The transcript of the audio content, this will always be present if the output type is audio.
      - Optional<Type> type
        
        The content type, output_text or output_audio depending on the session output_modalities configuration.
        
        OUTPUT_TEXT("output_text")
        
        OUTPUT_AUDIO("output_audio")
    - JsonValue; role "assistant"constant
      
      The role of the message sender. Always assistant.
      - ASSISTANT("assistant")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemFunctionCall:
    
    A function call item in a Realtime conversation.
    - String arguments
      
      The arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example {"arg1": "value1", "arg2": 42}.
    - String name
      
      The name of the function being called.
    - JsonValue; type "function_call"constant
      
      The type of the item. Always function_call.
      - FUNCTION_CALL("function_call")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<String> callId
      
      The ID of the function call.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemFunctionCallOutput:
    
    A function call output item in a Realtime conversation.
    - String callId
      
      The ID of the function call this output is for.
    - String output
      
      The output of the function call, this is free text and can contain any information or simply be empty.
    - JsonValue; type "function_call_output"constant
      
      The type of the item. Always function_call_output.
      - FUNCTION_CALL_OUTPUT("function_call_output")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeMcpApprovalResponse:
    
    A Realtime item responding to an MCP approval request.
    - String id
      
      The unique ID of the approval response.
    - String approvalRequestId
      
      The ID of the approval request being answered.
    - boolean approve
      
      Whether the request was approved.
    - JsonValue; type "mcp_approval_response"constant
      
      The type of the item. Always mcp_approval_response.
      - MCP_APPROVAL_RESPONSE("mcp_approval_response")
    - Optional<String> reason
      
      Optional reason for the decision.
  - class RealtimeMcpListTools:
    
    A Realtime item listing tools available on an MCP server.
    - String serverLabel
      
      The label of the MCP server.
    - List<Tool> tools
      
      The tools available on the server.
      - JsonValue inputSchema
        
        The JSON schema describing the tool's input.
      - String name
        
        The name of the tool.
      - Optional<JsonValue> annotations
        
        Additional annotations about the tool.
      - Optional<String> description
        
        The description of the tool.
    - JsonValue; type "mcp_list_tools"constant
      
      The type of the item. Always mcp_list_tools.
      - MCP_LIST_TOOLS("mcp_list_tools")
    - Optional<String> id
      
      The unique ID of the list.
  - class RealtimeMcpToolCall:
    
    A Realtime item representing an invocation of a tool on an MCP server.
    - String id
      
      The unique ID of the tool call.
    - String arguments
      
      A JSON string of the arguments passed to the tool.
    - String name
      
      The name of the tool that was run.
    - String serverLabel
      
      The label of the MCP server running the tool.
    - JsonValue; type "mcp_call"constant
      
      The type of the item. Always mcp_call.
      - MCP_CALL("mcp_call")
    - Optional<String> approvalRequestId
      
      The ID of an associated approval request, if any.
    - Optional<Error> error
      
      The error from the tool call, if any.
      - class RealtimeMcpProtocolError:
        
        long code
        
        String message
        
        JsonValue; type "protocol_error"constant
        
        PROTOCOL_ERROR("protocol_error")
      - class RealtimeMcpToolExecutionError:
        
        String message
        
        JsonValue; type "tool_execution_error"constant
        
        TOOL_EXECUTION_ERROR("tool_execution_error")
      - class RealtimeMcphttpError:
        
        long code
        
        String message
        
        JsonValue; type "http_error"constant
        
        HTTP_ERROR("http_error")
    - Optional<String> output
      
      The output from the tool call.
  - class RealtimeMcpApprovalRequest:
    
    A Realtime item requesting human approval of a tool invocation.
    - String id
      
      The unique ID of the approval request.
    - String arguments
      
      A JSON string of arguments for the tool.
    - String name
      
      The name of the tool to run.
    - String serverLabel
      
      The label of the MCP server making the request.
    - JsonValue; type "mcp_approval_request"constant
      
      The type of the item. Always mcp_approval_request.
      - MCP_APPROVAL_REQUEST("mcp_approval_request")
- Optional<List<OutputModality>> outputModalities
  
  The set of modalities the model used to respond, currently the only possible values are [\"audio\"], [\"text\"]. Audio output always include a text transcript. Setting the output to mode text will disable audio output from the model.
  - TEXT("text")
  - AUDIO("audio")
- Optional<Status> status
  
  The final status of the response (completed, cancelled, failed, or incomplete, in_progress).
  - COMPLETED("completed")
  - CANCELLED("cancelled")
  - FAILED("failed")
  - INCOMPLETE("incomplete")
  - IN_PROGRESS("in_progress")
- Optional<RealtimeResponseStatus> statusDetails
  
  Additional details about the status.
  - Optional<Error> error
    
    A description of the error that caused the response to fail, populated when the status is failed.
    - Optional<String> code
      
      Error code, if any.
    - Optional<String> type
      
      The type of error.
  - Optional<Reason> reason
    
    The reason the Response did not complete. For a cancelled Response, one of turn_detected (the server VAD detected a new start of speech) or client_cancelled (the client sent a cancel event). For an incomplete Response, one of max_output_tokens or content_filter (the server-side safety filter activated and cut off the response).
    - TURN_DETECTED("turn_detected")
    - CLIENT_CANCELLED("client_cancelled")
    - MAX_OUTPUT_TOKENS("max_output_tokens")
    - CONTENT_FILTER("content_filter")
  - Optional<Type> type
    
    The type of error that caused the response to fail, corresponding with the status field (completed, cancelled, incomplete, failed).
    - COMPLETED("completed")
    - CANCELLED("cancelled")
    - INCOMPLETE("incomplete")
    - FAILED("failed")
- Optional<RealtimeResponseUsage> usage
  
  Usage statistics for the Response, this will correspond to billing. A Realtime API session will maintain a conversation context and append new Items to the Conversation, thus output from previous turns (text and audio tokens) will become the input for later turns.
  - Optional<RealtimeResponseUsageInputTokenDetails> inputTokenDetails
    
    Details about the input tokens used in the Response. Cached tokens are tokens from previous turns in the conversation that are included as context for the current response. Cached tokens here are counted as a subset of input tokens, meaning input tokens will include cached and uncached tokens.
    - Optional<Long> audioTokens
      
      The number of audio tokens used as input for the Response.
    - Optional<Long> cachedTokens
      
      The number of cached tokens used as input for the Response.
    - Optional<CachedTokensDetails> cachedTokensDetails
      
      Details about the cached tokens used as input for the Response.
      - Optional<Long> audioTokens
        
        The number of cached audio tokens used as input for the Response.
      - Optional<Long> imageTokens
        
        The number of cached image tokens used as input for the Response.
      - Optional<Long> textTokens
        
        The number of cached text tokens used as input for the Response.
    - Optional<Long> imageTokens
      
      The number of image tokens used as input for the Response.
    - Optional<Long> textTokens
      
      The number of text tokens used as input for the Response.
  - Optional<Long> inputTokens
    
    The number of input tokens used in the Response, including text and audio tokens.
  - Optional<RealtimeResponseUsageOutputTokenDetails> outputTokenDetails
    
    Details about the output tokens used in the Response.
    - Optional<Long> audioTokens
      
      The number of audio tokens used in the Response.
    - Optional<Long> textTokens
      
      The number of text tokens used in the Response.
  - Optional<Long> outputTokens
    
    The number of output tokens sent in the Response, including text and audio tokens.
  - Optional<Long> totalTokens
    
    The total number of tokens in the Response including input and output text and audio tokens.

Realtime Response Create Audio Output

class RealtimeResponseCreateAudioOutput:

Configuration for audio input and output.
- Optional<Output> output
  - Optional<RealtimeAudioFormats> format
    
    The format of the output audio.
    - AudioPcm
      - Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
      - Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
    - AudioPcmu
      - Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
    - AudioPcma
      - Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
  - Optional<Voice> voice
    
    The voice the model uses to respond. Supported built-in voices are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommend marin and cedar for best quality.
    - String
    - enum UnionMember1:
      - ALLOY("alloy")
      - ASH("ash")
      - BALLAD("ballad")
      - CORAL("coral")
      - ECHO("echo")
      - SAGE("sage")
      - SHIMMER("shimmer")
      - VERSE("verse")
      - MARIN("marin")
      - CEDAR("cedar")
    - class Id:
      
      Custom voice reference.
      - String id
        
        The custom voice ID, e.g. voice_1234.

Realtime Response Create Mcp Tool

class RealtimeResponseCreateMcpTool:

Give the model access to additional tools via remote Model Context Protocol (MCP) servers. Learn more about MCP.
- String serverLabel
  
  A label for this MCP server, used to identify it in tool calls.
- JsonValue; type "mcp"constant
  
  The type of the MCP tool. Always mcp.
  - MCP("mcp")
- Optional<AllowedTools> allowedTools
  
  List of allowed tool names or a filter object.
  - List<String>
  - class McpToolFilter:
    
    A filter object to specify which tools are allowed.
    - Optional<Boolean> readOnly
      
      Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
    - Optional<List<String>> toolNames
      
      List of allowed tool names.
- Optional<String> authorization
  
  An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
- Optional<ConnectorId> connectorId
  
  Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.
  
  Currently supported connector_id values are:
  - Dropbox: connector_dropbox
  - Gmail: connector_gmail
  - Google Calendar: connector_googlecalendar
  - Google Drive: connector_googledrive
  - Microsoft Teams: connector_microsoftteams
  - Outlook Calendar: connector_outlookcalendar
  - Outlook Email: connector_outlookemail
  - SharePoint: connector_sharepoint
  - CONNECTOR_DROPBOX("connector_dropbox")
  - CONNECTOR_GMAIL("connector_gmail")
  - CONNECTOR_GOOGLECALENDAR("connector_googlecalendar")
  - CONNECTOR_GOOGLEDRIVE("connector_googledrive")
  - CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams")
  - CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar")
  - CONNECTOR_OUTLOOKEMAIL("connector_outlookemail")
  - CONNECTOR_SHAREPOINT("connector_sharepoint")
- Optional<Boolean> deferLoading
  
  Whether this MCP tool is deferred and discovered via tool search.
- Optional<Headers> headers
  
  Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.
- Optional<RequireApproval> requireApproval
  
  Specify which of the MCP server's tools require approval.
  - class McpToolApprovalFilter:
    
    Specify which of the MCP server's tools require approval. Can be always, never, or a filter object associated with tools that require approval.
    - Optional<Always> always
      
      A filter object to specify which tools are allowed.
      - Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
      - Optional<List<String>> toolNames
        
        List of allowed tool names.
    - Optional<Never> never
      
      A filter object to specify which tools are allowed.
      - Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
      - Optional<List<String>> toolNames
        
        List of allowed tool names.
  - enum McpToolApprovalSetting:
    
    Specify a single approval policy for all tools. One of always or never. When set to always, all tools will require approval. When set to never, all tools will not require approval.
    - ALWAYS("always")
    - NEVER("never")
- Optional<String> serverDescription
  
  Optional description of the MCP server, used to provide more context.
- Optional<String> serverUrl
  
  The URL for the MCP server. One of server_url or connector_id must be provided.

Realtime Response Create Params

class RealtimeResponseCreateParams:

Create a new Realtime response with these parameters
- Optional<RealtimeResponseCreateAudioOutput> audio
  
  Configuration for audio input and output.
  - Optional<Output> output
    - Optional<RealtimeAudioFormats> format
      
      The format of the output audio.
      - AudioPcm
        
        Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
        
        Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
      - AudioPcmu
        
        Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
      - AudioPcma
        
        Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
    - Optional<Voice> voice
      
      The voice the model uses to respond. Supported built-in voices are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommend marin and cedar for best quality.
      - String
      - enum UnionMember1:
        
        ALLOY("alloy")
        
        ASH("ash")
        
        BALLAD("ballad")
        
        CORAL("coral")
        
        ECHO("echo")
        
        SAGE("sage")
        
        SHIMMER("shimmer")
        
        VERSE("verse")
        
        MARIN("marin")
        
        CEDAR("cedar")
      - class Id:
        
        Custom voice reference.
        
        String id
        
        The custom voice ID, e.g. voice_1234.
- Optional<Conversation> conversation
  
  Controls which conversation the response is added to. Currently supports auto and none, with auto as the default value. The auto value means that the contents of the response will be added to the default conversation. Set this to none to create an out-of-band response which will not add items to default conversation.
  - AUTO("auto")
  - NONE("none")
- Optional<List<ConversationItem>> input
  
  Input items to include in the prompt for the model. Using this field creates a new context for this Response instead of using the default conversation. An empty array [] will clear the context for this Response. Note that this can include references to items that previously appeared in the session using their id.
  - class RealtimeConversationItemSystemMessage:
    
    A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
    - List<Content> content
      
      The content of the message.
      - Optional<String> text
        
        The text content.
      - Optional<Type> type
        
        The content type. Always input_text for system messages.
        
        INPUT_TEXT("input_text")
    - JsonValue; role "system"constant
      
      The role of the message sender. Always system.
      - SYSTEM("system")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemUserMessage:
    
    A user message item in a Realtime conversation.
    - List<Content> content
      
      The content of the message.
      - Optional<String> audio
        
        Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
      - Optional<Detail> detail
        
        The detail level of the image (for input_image). auto will default to high.
        
        AUTO("auto")
        
        LOW("low")
        
        HIGH("high")
      - Optional<String> imageUrl
        
        Base64-encoded image bytes (for input_image) as a data URI. For example data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG.
      - Optional<String> text
        
        The text content (for input_text).
      - Optional<String> transcript
        
        Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
      - Optional<Type> type
        
        The content type (input_text, input_audio, or input_image).
        
        INPUT_TEXT("input_text")
        
        INPUT_AUDIO("input_audio")
        
        INPUT_IMAGE("input_image")
    - JsonValue; role "user"constant
      
      The role of the message sender. Always user.
      - USER("user")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemAssistantMessage:
    
    An assistant message item in a Realtime conversation.
    - List<Content> content
      
      The content of the message.
      - Optional<String> audio
        
        Base64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
      - Optional<String> text
        
        The text content.
      - Optional<String> transcript
        
        The transcript of the audio content, this will always be present if the output type is audio.
      - Optional<Type> type
        
        The content type, output_text or output_audio depending on the session output_modalities configuration.
        
        OUTPUT_TEXT("output_text")
        
        OUTPUT_AUDIO("output_audio")
    - JsonValue; role "assistant"constant
      
      The role of the message sender. Always assistant.
      - ASSISTANT("assistant")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemFunctionCall:
    
    A function call item in a Realtime conversation.
    - String arguments
      
      The arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example {"arg1": "value1", "arg2": 42}.
    - String name
      
      The name of the function being called.
    - JsonValue; type "function_call"constant
      
      The type of the item. Always function_call.
      - FUNCTION_CALL("function_call")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<String> callId
      
      The ID of the function call.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemFunctionCallOutput:
    
    A function call output item in a Realtime conversation.
    - String callId
      
      The ID of the function call this output is for.
    - String output
      
      The output of the function call, this is free text and can contain any information or simply be empty.
    - JsonValue; type "function_call_output"constant
      
      The type of the item. Always function_call_output.
      - FUNCTION_CALL_OUTPUT("function_call_output")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeMcpApprovalResponse:
    
    A Realtime item responding to an MCP approval request.
    - String id
      
      The unique ID of the approval response.
    - String approvalRequestId
      
      The ID of the approval request being answered.
    - boolean approve
      
      Whether the request was approved.
    - JsonValue; type "mcp_approval_response"constant
      
      The type of the item. Always mcp_approval_response.
      - MCP_APPROVAL_RESPONSE("mcp_approval_response")
    - Optional<String> reason
      
      Optional reason for the decision.
  - class RealtimeMcpListTools:
    
    A Realtime item listing tools available on an MCP server.
    - String serverLabel
      
      The label of the MCP server.
    - List<Tool> tools
      
      The tools available on the server.
      - JsonValue inputSchema
        
        The JSON schema describing the tool's input.
      - String name
        
        The name of the tool.
      - Optional<JsonValue> annotations
        
        Additional annotations about the tool.
      - Optional<String> description
        
        The description of the tool.
    - JsonValue; type "mcp_list_tools"constant
      
      The type of the item. Always mcp_list_tools.
      - MCP_LIST_TOOLS("mcp_list_tools")
    - Optional<String> id
      
      The unique ID of the list.
  - class RealtimeMcpToolCall:
    
    A Realtime item representing an invocation of a tool on an MCP server.
    - String id
      
      The unique ID of the tool call.
    - String arguments
      
      A JSON string of the arguments passed to the tool.
    - String name
      
      The name of the tool that was run.
    - String serverLabel
      
      The label of the MCP server running the tool.
    - JsonValue; type "mcp_call"constant
      
      The type of the item. Always mcp_call.
      - MCP_CALL("mcp_call")
    - Optional<String> approvalRequestId
      
      The ID of an associated approval request, if any.
    - Optional<Error> error
      
      The error from the tool call, if any.
      - class RealtimeMcpProtocolError:
        
        long code
        
        String message
        
        JsonValue; type "protocol_error"constant
        
        PROTOCOL_ERROR("protocol_error")
      - class RealtimeMcpToolExecutionError:
        
        String message
        
        JsonValue; type "tool_execution_error"constant
        
        TOOL_EXECUTION_ERROR("tool_execution_error")
      - class RealtimeMcphttpError:
        
        long code
        
        String message
        
        JsonValue; type "http_error"constant
        
        HTTP_ERROR("http_error")
    - Optional<String> output
      
      The output from the tool call.
  - class RealtimeMcpApprovalRequest:
    
    A Realtime item requesting human approval of a tool invocation.
    - String id
      
      The unique ID of the approval request.
    - String arguments
      
      A JSON string of arguments for the tool.
    - String name
      
      The name of the tool to run.
    - String serverLabel
      
      The label of the MCP server making the request.
    - JsonValue; type "mcp_approval_request"constant
      
      The type of the item. Always mcp_approval_request.
      - MCP_APPROVAL_REQUEST("mcp_approval_request")
- Optional<String> instructions
  
  The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior. Note that the server sets default instructions which will be used if this field is not set and are visible in the session.created event at the start of the session.
- Optional<MaxOutputTokens> maxOutputTokens
  
  Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or inf for the maximum available tokens for a given model. Defaults to inf.
  - long
  - JsonValue;
    - INF("inf")
- Optional<Metadata> metadata
  
  Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.
  
  Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
- Optional<List<OutputModality>> outputModalities
  
  The set of modalities the model used to respond, currently the only possible values are [\"audio\"], [\"text\"]. Audio output always include a text transcript. Setting the output to mode text will disable audio output from the model.
  - TEXT("text")
  - AUDIO("audio")
- Optional<Boolean> parallelToolCalls
  
  Whether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as gpt-realtime-2.
- Optional<ResponsePrompt> prompt
  
  Reference to a prompt template and its variables. Learn more.
  - String id
    
    The unique identifier of the prompt template to use.
  - Optional<Variables> variables
    
    Optional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
    - String
    - class ResponseInputText:
      
      A text input to the model.
      - String text
        
        The text input to the model.
      - JsonValue; type "input_text"constant
        
        The type of the input item. Always input_text.
        
        INPUT_TEXT("input_text")
    - class ResponseInputImage:
      
      An image input to the model. Learn about image inputs.
      - Detail detail
        
        The detail level of the image to be sent to the model. One of high, low, auto, or original. Defaults to auto.
        
        LOW("low")
        
        HIGH("high")
        
        AUTO("auto")
        
        ORIGINAL("original")
      - JsonValue; type "input_image"constant
        
        The type of the input item. Always input_image.
        
        INPUT_IMAGE("input_image")
      - Optional<String> fileId
        
        The ID of the file to be sent to the model.
      - Optional<String> imageUrl
        
        The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
    - class ResponseInputFile:
      
      A file input to the model.
      - JsonValue; type "input_file"constant
        
        The type of the input item. Always input_file.
        
        INPUT_FILE("input_file")
      - Optional<Detail> detail
        
        The detail level of the file to be sent to the model. Use low for the default rendering behavior, or high to render the file at higher quality. Defaults to low.
        
        LOW("low")
        
        HIGH("high")
      - Optional<String> fileData
        
        The content of the file to be sent to the model.
      - Optional<String> fileId
        
        The ID of the file to be sent to the model.
      - Optional<String> fileUrl
        
        The URL of the file to be sent to the model.
      - Optional<String> filename
        
        The name of the file to be sent to the model.
  - Optional<String> version
    
    Optional version of the prompt template.
- Optional<RealtimeReasoning> reasoning
  
  Configuration for reasoning-capable Realtime models such as gpt-realtime-2.
  - Optional<RealtimeReasoningEffort> effort
    
    Constrains effort on reasoning for reasoning-capable Realtime models such as gpt-realtime-2.
    - MINIMAL("minimal")
    - LOW("low")
    - MEDIUM("medium")
    - HIGH("high")
    - XHIGH("xhigh")
- Optional<ToolChoice> toolChoice
  
  How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
  - enum ToolChoiceOptions:
    
    Controls which (if any) tool is called by the model.
    
    none means the model will not call any tool and instead generates a message.
    
    auto means the model can pick between generating a message or calling one or more tools.
    
    required means the model must call one or more tools.
    - NONE("none")
    - AUTO("auto")
    - REQUIRED("required")
  - class ToolChoiceFunction:
    
    Use this option to force the model to call a specific function.
    - String name
      
      The name of the function to call.
    - JsonValue; type "function"constant
      
      For function calling, the type is always function.
      - FUNCTION("function")
  - class ToolChoiceMcp:
    
    Use this option to force the model to call a specific tool on a remote MCP server.
    - String serverLabel
      
      The label of the MCP server to use.
    - JsonValue; type "mcp"constant
      
      For MCP tools, the type is always mcp.
      - MCP("mcp")
    - Optional<String> name
      
      The name of the tool to call on the server.
- Optional<List<Tool>> tools
  
  Tools available to the model.
  - class RealtimeFunctionTool:
    - Optional<String> description
      
      The description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
    - Optional<String> name
      
      The name of the function.
    - Optional<JsonValue> parameters
      
      Parameters of the function in JSON Schema.
    - Optional<Type> type
      
      The type of the tool, i.e. function.
      - FUNCTION("function")
  - class RealtimeResponseCreateMcpTool:
    
    Give the model access to additional tools via remote Model Context Protocol (MCP) servers. Learn more about MCP.
    - String serverLabel
      
      A label for this MCP server, used to identify it in tool calls.
    - JsonValue; type "mcp"constant
      
      The type of the MCP tool. Always mcp.
      - MCP("mcp")
    - Optional<AllowedTools> allowedTools
      
      List of allowed tool names or a filter object.
      - List<String>
      - class McpToolFilter:
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
    - Optional<String> authorization
      
      An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
    - Optional<ConnectorId> connectorId
      
      Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.
      
      Currently supported connector_id values are:
      - Dropbox: connector_dropbox
      - Gmail: connector_gmail
      - Google Calendar: connector_googlecalendar
      - Google Drive: connector_googledrive
      - Microsoft Teams: connector_microsoftteams
      - Outlook Calendar: connector_outlookcalendar
      - Outlook Email: connector_outlookemail
      - SharePoint: connector_sharepoint
      - CONNECTOR_DROPBOX("connector_dropbox")
      - CONNECTOR_GMAIL("connector_gmail")
      - CONNECTOR_GOOGLECALENDAR("connector_googlecalendar")
      - CONNECTOR_GOOGLEDRIVE("connector_googledrive")
      - CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams")
      - CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar")
      - CONNECTOR_OUTLOOKEMAIL("connector_outlookemail")
      - CONNECTOR_SHAREPOINT("connector_sharepoint")
    - Optional<Boolean> deferLoading
      
      Whether this MCP tool is deferred and discovered via tool search.
    - Optional<Headers> headers
      
      Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.
    - Optional<RequireApproval> requireApproval
      
      Specify which of the MCP server's tools require approval.
      - class McpToolApprovalFilter:
        
        Specify which of the MCP server's tools require approval. Can be always, never, or a filter object associated with tools that require approval.
        
        Optional<Always> always
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<Never> never
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
      - enum McpToolApprovalSetting:
        
        Specify a single approval policy for all tools. One of always or never. When set to always, all tools will require approval. When set to never, all tools will not require approval.
        
        ALWAYS("always")
        
        NEVER("never")
    - Optional<String> serverDescription
      
      Optional description of the MCP server, used to provide more context.
    - Optional<String> serverUrl
      
      The URL for the MCP server. One of server_url or connector_id must be provided.

Realtime Response Status

class RealtimeResponseStatus:

Additional details about the status.
- Optional<Error> error
  
  A description of the error that caused the response to fail, populated when the status is failed.
  - Optional<String> code
    
    Error code, if any.
  - Optional<String> type
    
    The type of error.
- Optional<Reason> reason
  
  The reason the Response did not complete. For a cancelled Response, one of turn_detected (the server VAD detected a new start of speech) or client_cancelled (the client sent a cancel event). For an incomplete Response, one of max_output_tokens or content_filter (the server-side safety filter activated and cut off the response).
  - TURN_DETECTED("turn_detected")
  - CLIENT_CANCELLED("client_cancelled")
  - MAX_OUTPUT_TOKENS("max_output_tokens")
  - CONTENT_FILTER("content_filter")
- Optional<Type> type
  
  The type of error that caused the response to fail, corresponding with the status field (completed, cancelled, incomplete, failed).
  - COMPLETED("completed")
  - CANCELLED("cancelled")
  - INCOMPLETE("incomplete")
  - FAILED("failed")

Realtime Response Usage

class RealtimeResponseUsage:

Usage statistics for the Response, this will correspond to billing. A Realtime API session will maintain a conversation context and append new Items to the Conversation, thus output from previous turns (text and audio tokens) will become the input for later turns.
- Optional<RealtimeResponseUsageInputTokenDetails> inputTokenDetails
  
  Details about the input tokens used in the Response. Cached tokens are tokens from previous turns in the conversation that are included as context for the current response. Cached tokens here are counted as a subset of input tokens, meaning input tokens will include cached and uncached tokens.
  - Optional<Long> audioTokens
    
    The number of audio tokens used as input for the Response.
  - Optional<Long> cachedTokens
    
    The number of cached tokens used as input for the Response.
  - Optional<CachedTokensDetails> cachedTokensDetails
    
    Details about the cached tokens used as input for the Response.
    - Optional<Long> audioTokens
      
      The number of cached audio tokens used as input for the Response.
    - Optional<Long> imageTokens
      
      The number of cached image tokens used as input for the Response.
    - Optional<Long> textTokens
      
      The number of cached text tokens used as input for the Response.
  - Optional<Long> imageTokens
    
    The number of image tokens used as input for the Response.
  - Optional<Long> textTokens
    
    The number of text tokens used as input for the Response.
- Optional<Long> inputTokens
  
  The number of input tokens used in the Response, including text and audio tokens.
- Optional<RealtimeResponseUsageOutputTokenDetails> outputTokenDetails
  
  Details about the output tokens used in the Response.
  - Optional<Long> audioTokens
    
    The number of audio tokens used in the Response.
  - Optional<Long> textTokens
    
    The number of text tokens used in the Response.
- Optional<Long> outputTokens
  
  The number of output tokens sent in the Response, including text and audio tokens.
- Optional<Long> totalTokens
  
  The total number of tokens in the Response including input and output text and audio tokens.

Realtime Response Usage Input Token Details

class RealtimeResponseUsageInputTokenDetails:

Details about the input tokens used in the Response. Cached tokens are tokens from previous turns in the conversation that are included as context for the current response. Cached tokens here are counted as a subset of input tokens, meaning input tokens will include cached and uncached tokens.
- Optional<Long> audioTokens
  
  The number of audio tokens used as input for the Response.
- Optional<Long> cachedTokens
  
  The number of cached tokens used as input for the Response.
- Optional<CachedTokensDetails> cachedTokensDetails
  
  Details about the cached tokens used as input for the Response.
  - Optional<Long> audioTokens
    
    The number of cached audio tokens used as input for the Response.
  - Optional<Long> imageTokens
    
    The number of cached image tokens used as input for the Response.
  - Optional<Long> textTokens
    
    The number of cached text tokens used as input for the Response.
- Optional<Long> imageTokens
  
  The number of image tokens used as input for the Response.
- Optional<Long> textTokens
  
  The number of text tokens used as input for the Response.

Realtime Response Usage Output Token Details

class RealtimeResponseUsageOutputTokenDetails:

Details about the output tokens used in the Response.
- Optional<Long> audioTokens
  
  The number of audio tokens used in the Response.
- Optional<Long> textTokens
  
  The number of text tokens used in the Response.

Realtime Server Event

class RealtimeServerEvent: A class that can be one of several variants.union

A realtime server event.
- class ConversationCreatedEvent:
  
  Returned when a conversation is created. Emitted right after session creation.
  - Conversation conversation
    
    The conversation resource.
    - Optional<String> id
      
      The unique ID of the conversation.
    - Optional<Object> object_
      
      The object type, must be realtime.conversation.
      - REALTIME_CONVERSATION("realtime.conversation")
  - String eventId
    
    The unique ID of the server event.
  - JsonValue; type "conversation.created"constant
    
    The event type, must be conversation.created.
    - CONVERSATION_CREATED("conversation.created")
- class ConversationItemCreatedEvent:
  
  Returned when a conversation item is created. There are several scenarios that produce this event:
  - The server is generating a Response, which if successful will produce either one or two Items, which will be of type message (role assistant) or type function_call.
  - The input audio buffer has been committed, either by the client or the server (in server_vad mode). The server will take the content of the input audio buffer and add it to a new user message Item.
  - The client has sent a conversation.item.create event to add a new Item to the Conversation.
  - String eventId
    
    The unique ID of the server event.
  - ConversationItem item
    
    A single item within a Realtime conversation.
    - class RealtimeConversationItemSystemMessage:
      
      A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
      - List<Content> content
        
        The content of the message.
        
        Optional<String> text
        
        The text content.
        
        Optional<Type> type
        
        The content type. Always input_text for system messages.
        
        INPUT_TEXT("input_text")
      - JsonValue; role "system"constant
        
        The role of the message sender. Always system.
        
        SYSTEM("system")
      - JsonValue; type "message"constant
        
        The type of the item. Always message.
        
        MESSAGE("message")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemUserMessage:
      
      A user message item in a Realtime conversation.
      - List<Content> content
        
        The content of the message.
        
        Optional<String> audio
        
        Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
        
        Optional<Detail> detail
        
        The detail level of the image (for input_image). auto will default to high.
        
        AUTO("auto")
        
        LOW("low")
        
        HIGH("high")
        
        Optional<String> imageUrl
        
        Base64-encoded image bytes (for input_image) as a data URI. For example data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG.
        
        Optional<String> text
        
        The text content (for input_text).
        
        Optional<String> transcript
        
        Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
        
        Optional<Type> type
        
        The content type (input_text, input_audio, or input_image).
        
        INPUT_TEXT("input_text")
        
        INPUT_AUDIO("input_audio")
        
        INPUT_IMAGE("input_image")
      - JsonValue; role "user"constant
        
        The role of the message sender. Always user.
        
        USER("user")
      - JsonValue; type "message"constant
        
        The type of the item. Always message.
        
        MESSAGE("message")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemAssistantMessage:
      
      An assistant message item in a Realtime conversation.
      - List<Content> content
        
        The content of the message.
        
        Optional<String> audio
        
        Base64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
        
        Optional<String> text
        
        The text content.
        
        Optional<String> transcript
        
        The transcript of the audio content, this will always be present if the output type is audio.
        
        Optional<Type> type
        
        The content type, output_text or output_audio depending on the session output_modalities configuration.
        
        OUTPUT_TEXT("output_text")
        
        OUTPUT_AUDIO("output_audio")
      - JsonValue; role "assistant"constant
        
        The role of the message sender. Always assistant.
        
        ASSISTANT("assistant")
      - JsonValue; type "message"constant
        
        The type of the item. Always message.
        
        MESSAGE("message")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemFunctionCall:
      
      A function call item in a Realtime conversation.
      - String arguments
        
        The arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example {"arg1": "value1", "arg2": 42}.
      - String name
        
        The name of the function being called.
      - JsonValue; type "function_call"constant
        
        The type of the item. Always function_call.
        
        FUNCTION_CALL("function_call")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<String> callId
        
        The ID of the function call.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemFunctionCallOutput:
      
      A function call output item in a Realtime conversation.
      - String callId
        
        The ID of the function call this output is for.
      - String output
        
        The output of the function call, this is free text and can contain any information or simply be empty.
      - JsonValue; type "function_call_output"constant
        
        The type of the item. Always function_call_output.
        
        FUNCTION_CALL_OUTPUT("function_call_output")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeMcpApprovalResponse:
      
      A Realtime item responding to an MCP approval request.
      - String id
        
        The unique ID of the approval response.
      - String approvalRequestId
        
        The ID of the approval request being answered.
      - boolean approve
        
        Whether the request was approved.
      - JsonValue; type "mcp_approval_response"constant
        
        The type of the item. Always mcp_approval_response.
        
        MCP_APPROVAL_RESPONSE("mcp_approval_response")
      - Optional<String> reason
        
        Optional reason for the decision.
    - class RealtimeMcpListTools:
      
      A Realtime item listing tools available on an MCP server.
      - String serverLabel
        
        The label of the MCP server.
      - List<Tool> tools
        
        The tools available on the server.
        
        JsonValue inputSchema
        
        The JSON schema describing the tool's input.
        
        String name
        
        The name of the tool.
        
        Optional<JsonValue> annotations
        
        Additional annotations about the tool.
        
        Optional<String> description
        
        The description of the tool.
      - JsonValue; type "mcp_list_tools"constant
        
        The type of the item. Always mcp_list_tools.
        
        MCP_LIST_TOOLS("mcp_list_tools")
      - Optional<String> id
        
        The unique ID of the list.
    - class RealtimeMcpToolCall:
      
      A Realtime item representing an invocation of a tool on an MCP server.
      - String id
        
        The unique ID of the tool call.
      - String arguments
        
        A JSON string of the arguments passed to the tool.
      - String name
        
        The name of the tool that was run.
      - String serverLabel
        
        The label of the MCP server running the tool.
      - JsonValue; type "mcp_call"constant
        
        The type of the item. Always mcp_call.
        
        MCP_CALL("mcp_call")
      - Optional<String> approvalRequestId
        
        The ID of an associated approval request, if any.
      - Optional<Error> error
        
        The error from the tool call, if any.
        
        class RealtimeMcpProtocolError:
        
        long code
        
        String message
        
        JsonValue; type "protocol_error"constant
        
        PROTOCOL_ERROR("protocol_error")
        
        class RealtimeMcpToolExecutionError:
        
        String message
        
        JsonValue; type "tool_execution_error"constant
        
        TOOL_EXECUTION_ERROR("tool_execution_error")
        
        class RealtimeMcphttpError:
        
        long code
        
        String message
        
        JsonValue; type "http_error"constant
        
        HTTP_ERROR("http_error")
      - Optional<String> output
        
        The output from the tool call.
    - class RealtimeMcpApprovalRequest:
      
      A Realtime item requesting human approval of a tool invocation.
      - String id
        
        The unique ID of the approval request.
      - String arguments
        
        A JSON string of arguments for the tool.
      - String name
        
        The name of the tool to run.
      - String serverLabel
        
        The label of the MCP server making the request.
      - JsonValue; type "mcp_approval_request"constant
        
        The type of the item. Always mcp_approval_request.
        
        MCP_APPROVAL_REQUEST("mcp_approval_request")
  - JsonValue; type "conversation.item.created"constant
    
    The event type, must be conversation.item.created.
    - CONVERSATION_ITEM_CREATED("conversation.item.created")
  - Optional<String> previousItemId
    
    The ID of the preceding item in the Conversation context, allows the client to understand the order of the conversation. Can be null if the item has no predecessor.
- class ConversationItemDeletedEvent:
  
  Returned when an item in the conversation is deleted by the client with a conversation.item.delete event. This event is used to synchronize the server's understanding of the conversation history with the client's view.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the item that was deleted.
  - JsonValue; type "conversation.item.deleted"constant
    
    The event type, must be conversation.item.deleted.
    - CONVERSATION_ITEM_DELETED("conversation.item.deleted")
- class ConversationItemInputAudioTranscriptionCompletedEvent:
  
  This event is the output of audio transcription for user audio written to the user audio buffer. Transcription begins when the input audio buffer is committed by the client or server (when VAD is enabled). Transcription runs asynchronously with Response creation, so this event may come before or after the Response events.
  
  Realtime API models accept audio natively, and thus input transcription is a separate process run on a separate ASR (Automatic Speech Recognition) model. The transcript may diverge somewhat from the model's interpretation, and should be treated as a rough guide.
  - long contentIndex
    
    The index of the content part containing the audio.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the item containing the audio that is being transcribed.
  - String transcript
    
    The transcribed text.
  - JsonValue; type "conversation.item.input_audio_transcription.completed"constant
    
    The event type, must be conversation.item.input_audio_transcription.completed.
    - CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED("conversation.item.input_audio_transcription.completed")
  - Usage usage
    
    Usage statistics for the transcription, this is billed according to the ASR model's pricing rather than the realtime model's pricing.
    - class TranscriptTextUsageTokens:
      
      Usage statistics for models billed by token usage.
      - long inputTokens
        
        Number of input tokens billed for this request.
      - long outputTokens
        
        Number of output tokens generated.
      - long totalTokens
        
        Total number of tokens used (input + output).
      - JsonValue; type "tokens"constant
        
        The type of the usage object. Always tokens for this variant.
        
        TOKENS("tokens")
      - Optional<InputTokenDetails> inputTokenDetails
        
        Details about the input tokens billed for this request.
        
        Optional<Long> audioTokens
        
        Number of audio tokens billed for this request.
        
        Optional<Long> textTokens
        
        Number of text tokens billed for this request.
    - class TranscriptTextUsageDuration:
      
      Usage statistics for models billed by audio input duration.
      - double seconds
        
        Duration of the input audio in seconds.
      - JsonValue; type "duration"constant
        
        The type of the usage object. Always duration for this variant.
        
        DURATION("duration")
  - Optional<List<LogProbProperties>> logprobs
    
    The log probabilities of the transcription.
    - String token
      
      The token that was used to generate the log probability.
    - List<long> bytes
      
      The bytes that were used to generate the log probability.
    - double logprob
      
      The log probability of the token.
- class ConversationItemInputAudioTranscriptionDeltaEvent:
  
  Returned when the text value of an input audio transcription content part is updated with incremental transcription results.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the item containing the audio that is being transcribed.
  - JsonValue; type "conversation.item.input_audio_transcription.delta"constant
    
    The event type, must be conversation.item.input_audio_transcription.delta.
    - CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_DELTA("conversation.item.input_audio_transcription.delta")
  - Optional<Long> contentIndex
    
    The index of the content part in the item's content array.
  - Optional<String> delta
    
    The text delta.
  - Optional<List<LogProbProperties>> logprobs
    
    The log probabilities of the transcription. These can be enabled by configurating the session with "include": ["item.input_audio_transcription.logprobs"]. Each entry in the array corresponds a log probability of which token would be selected for this chunk of transcription. This can help to identify if it was possible there were multiple valid options for a given chunk of transcription.
    - String token
      
      The token that was used to generate the log probability.
    - List<long> bytes
      
      The bytes that were used to generate the log probability.
    - double logprob
      
      The log probability of the token.
- class ConversationItemInputAudioTranscriptionFailedEvent:
  
  Returned when input audio transcription is configured, and a transcription request for a user message failed. These events are separate from other error events so that the client can identify the related Item.
  - long contentIndex
    
    The index of the content part containing the audio.
  - Error error
    
    Details of the transcription error.
    - Optional<String> code
      
      Error code, if any.
    - Optional<String> message
      
      A human-readable error message.
    - Optional<String> param
      
      Parameter related to the error, if any.
    - Optional<String> type
      
      The type of error.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the user message item.
  - JsonValue; type "conversation.item.input_audio_transcription.failed"constant
    
    The event type, must be conversation.item.input_audio_transcription.failed.
    - CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_FAILED("conversation.item.input_audio_transcription.failed")
- ConversationItemRetrieved
  - String eventId
    
    The unique ID of the server event.
  - ConversationItem item
    
    A single item within a Realtime conversation.
  - JsonValue; type "conversation.item.retrieved"constant
    
    The event type, must be conversation.item.retrieved.
    - CONVERSATION_ITEM_RETRIEVED("conversation.item.retrieved")
- class ConversationItemTruncatedEvent:
  
  Returned when an earlier assistant audio message item is truncated by the client with a conversation.item.truncate event. This event is used to synchronize the server's understanding of the audio with the client's playback.
  
  This action will truncate the audio and remove the server-side text transcript to ensure there is no text in the context that hasn't been heard by the user.
  - long audioEndMs
    
    The duration up to which the audio was truncated, in milliseconds.
  - long contentIndex
    
    The index of the content part that was truncated.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the assistant message item that was truncated.
  - JsonValue; type "conversation.item.truncated"constant
    
    The event type, must be conversation.item.truncated.
    - CONVERSATION_ITEM_TRUNCATED("conversation.item.truncated")
- class RealtimeErrorEvent:
  
  Returned when an error occurs, which could be a client problem or a server problem. Most errors are recoverable and the session will stay open, we recommend to implementors to monitor and log error messages by default.
  - RealtimeError error
    
    Details of the error.
    - String message
      
      A human-readable error message.
    - String type
      
      The type of error (e.g., "invalid_request_error", "server_error").
    - Optional<String> code
      
      Error code, if any.
    - Optional<String> eventId
      
      The event_id of the client event that caused the error, if applicable.
    - Optional<String> param
      
      Parameter related to the error, if any.
  - String eventId
    
    The unique ID of the server event.
  - JsonValue; type "error"constant
    
    The event type, must be error.
    - ERROR("error")
- class InputAudioBufferClearedEvent:
  
  Returned when the input audio buffer is cleared by the client with a input_audio_buffer.clear event.
  - String eventId
    
    The unique ID of the server event.
  - JsonValue; type "input_audio_buffer.cleared"constant
    
    The event type, must be input_audio_buffer.cleared.
    - INPUT_AUDIO_BUFFER_CLEARED("input_audio_buffer.cleared")
- class InputAudioBufferCommittedEvent:
  
  Returned when an input audio buffer is committed, either by the client or automatically in server VAD mode. The item_id property is the ID of the user message item that will be created, thus a conversation.item.created event will also be sent to the client.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the user message item that will be created.
  - JsonValue; type "input_audio_buffer.committed"constant
    
    The event type, must be input_audio_buffer.committed.
    - INPUT_AUDIO_BUFFER_COMMITTED("input_audio_buffer.committed")
  - Optional<String> previousItemId
    
    The ID of the preceding item after which the new item will be inserted. Can be null if the item has no predecessor.
- class InputAudioBufferDtmfEventReceivedEvent:
  
  SIP Only: Returned when an DTMF event is received. A DTMF event is a message that represents a telephone keypad press (0–9, *, #, A–D). The event property is the keypad that the user press. The received_at is the UTC Unix Timestamp that the server received the event.
  - String event
    
    The telephone keypad that was pressed by the user.
  - long receivedAt
    
    UTC Unix Timestamp when DTMF Event was received by server.
  - JsonValue; type "input_audio_buffer.dtmf_event_received"constant
    
    The event type, must be input_audio_buffer.dtmf_event_received.
    - INPUT_AUDIO_BUFFER_DTMF_EVENT_RECEIVED("input_audio_buffer.dtmf_event_received")
- class InputAudioBufferSpeechStartedEvent:
  
  Sent by the server when in server_vad mode to indicate that speech has been detected in the audio buffer. This can happen any time audio is added to the buffer (unless speech is already detected). The client may want to use this event to interrupt audio playback or provide visual feedback to the user.
  
  The client should expect to receive a input_audio_buffer.speech_stopped event when speech stops. The item_id property is the ID of the user message item that will be created when speech stops and will also be included in the input_audio_buffer.speech_stopped event (unless the client manually commits the audio buffer during VAD activation).
  - long audioStartMs
    
    Milliseconds from the start of all audio written to the buffer during the session when speech was first detected. This will correspond to the beginning of audio sent to the model, and thus includes the prefix_padding_ms configured in the Session.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the user message item that will be created when speech stops.
  - JsonValue; type "input_audio_buffer.speech_started"constant
    
    The event type, must be input_audio_buffer.speech_started.
    - INPUT_AUDIO_BUFFER_SPEECH_STARTED("input_audio_buffer.speech_started")
- class InputAudioBufferSpeechStoppedEvent:
  
  Returned in server_vad mode when the server detects the end of speech in the audio buffer. The server will also send an conversation.item.created event with the user message item that is created from the audio buffer.
  - long audioEndMs
    
    Milliseconds since the session started when speech stopped. This will correspond to the end of audio sent to the model, and thus includes the min_silence_duration_ms configured in the Session.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the user message item that will be created.
  - JsonValue; type "input_audio_buffer.speech_stopped"constant
    
    The event type, must be input_audio_buffer.speech_stopped.
    - INPUT_AUDIO_BUFFER_SPEECH_STOPPED("input_audio_buffer.speech_stopped")
- class RateLimitsUpdatedEvent:
  
  Emitted at the beginning of a Response to indicate the updated rate limits. When a Response is created some tokens will be "reserved" for the output tokens, the rate limits shown here reflect that reservation, which is then adjusted accordingly once the Response is completed.
  - String eventId
    
    The unique ID of the server event.
  - List<RateLimit> rateLimits
    
    List of rate limit information.
    - Optional<Long> limit
      
      The maximum allowed value for the rate limit.
    - Optional<Name> name
      
      The name of the rate limit (requests, tokens).
      - REQUESTS("requests")
      - TOKENS("tokens")
    - Optional<Long> remaining
      
      The remaining value before the limit is reached.
    - Optional<Double> resetSeconds
      
      Seconds until the rate limit resets.
  - JsonValue; type "rate_limits.updated"constant
    
    The event type, must be rate_limits.updated.
    - RATE_LIMITS_UPDATED("rate_limits.updated")
- class ResponseAudioDeltaEvent:
  
  Returned when the model-generated audio is updated.
  - long contentIndex
    
    The index of the content part in the item's content array.
  - String delta
    
    Base64-encoded audio data delta.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the item.
  - long outputIndex
    
    The index of the output item in the response.
  - String responseId
    
    The ID of the response.
  - JsonValue; type "response.output_audio.delta"constant
    
    The event type, must be response.output_audio.delta.
    - RESPONSE_OUTPUT_AUDIO_DELTA("response.output_audio.delta")
- class ResponseAudioDoneEvent:
  
  Returned when the model-generated audio is done. Also emitted when a Response is interrupted, incomplete, or cancelled.
  - long contentIndex
    
    The index of the content part in the item's content array.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the item.
  - long outputIndex
    
    The index of the output item in the response.
  - String responseId
    
    The ID of the response.
  - JsonValue; type "response.output_audio.done"constant
    
    The event type, must be response.output_audio.done.
    - RESPONSE_OUTPUT_AUDIO_DONE("response.output_audio.done")
- class ResponseAudioTranscriptDeltaEvent:
  
  Returned when the model-generated transcription of audio output is updated.
  - long contentIndex
    
    The index of the content part in the item's content array.
  - String delta
    
    The transcript delta.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the item.
  - long outputIndex
    
    The index of the output item in the response.
  - String responseId
    
    The ID of the response.
  - JsonValue; type "response.output_audio_transcript.delta"constant
    
    The event type, must be response.output_audio_transcript.delta.
    - RESPONSE_OUTPUT_AUDIO_TRANSCRIPT_DELTA("response.output_audio_transcript.delta")
- class ResponseAudioTranscriptDoneEvent:
  
  Returned when the model-generated transcription of audio output is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
  - long contentIndex
    
    The index of the content part in the item's content array.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the item.
  - long outputIndex
    
    The index of the output item in the response.
  - String responseId
    
    The ID of the response.
  - String transcript
    
    The final transcript of the audio.
  - JsonValue; type "response.output_audio_transcript.done"constant
    
    The event type, must be response.output_audio_transcript.done.
    - RESPONSE_OUTPUT_AUDIO_TRANSCRIPT_DONE("response.output_audio_transcript.done")
- class ResponseContentPartAddedEvent:
  
  Returned when a new content part is added to an assistant message item during response generation.
  - long contentIndex
    
    The index of the content part in the item's content array.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the item to which the content part was added.
  - long outputIndex
    
    The index of the output item in the response.
  - Part part
    
    The content part that was added.
    - Optional<String> audio
      
      Base64-encoded audio data (if type is "audio").
    - Optional<String> text
      
      The text content (if type is "text").
    - Optional<String> transcript
      
      The transcript of the audio (if type is "audio").
    - Optional<Type> type
      
      The content type ("text", "audio").
      - TEXT("text")
      - AUDIO("audio")
  - String responseId
    
    The ID of the response.
  - JsonValue; type "response.content_part.added"constant
    
    The event type, must be response.content_part.added.
    - RESPONSE_CONTENT_PART_ADDED("response.content_part.added")
- class ResponseContentPartDoneEvent:
  
  Returned when a content part is done streaming in an assistant message item. Also emitted when a Response is interrupted, incomplete, or cancelled.
  - long contentIndex
    
    The index of the content part in the item's content array.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the item.
  - long outputIndex
    
    The index of the output item in the response.
  - Part part
    
    The content part that is done.
    - Optional<String> audio
      
      Base64-encoded audio data (if type is "audio").
    - Optional<String> text
      
      The text content (if type is "text").
    - Optional<String> transcript
      
      The transcript of the audio (if type is "audio").
    - Optional<Type> type
      
      The content type ("text", "audio").
      - TEXT("text")
      - AUDIO("audio")
  - String responseId
    
    The ID of the response.
  - JsonValue; type "response.content_part.done"constant
    
    The event type, must be response.content_part.done.
    - RESPONSE_CONTENT_PART_DONE("response.content_part.done")
- class ResponseCreatedEvent:
  
  Returned when a new Response is created. The first event of response creation, where the response is in an initial state of in_progress.
  - String eventId
    
    The unique ID of the server event.
  - RealtimeResponse response
    
    The response resource.
    - Optional<String> id
      
      The unique ID of the response, will look like resp_1234.
    - Optional<Audio> audio
      
      Configuration for audio output.
      - Optional<Output> output
        
        Optional<RealtimeAudioFormats> format
        
        The format of the output audio.
        
        AudioPcm
        
        Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
        
        Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
        
        AudioPcmu
        
        Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
        
        AudioPcma
        
        Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
        
        Optional<Voice> voice
        
        The voice the model uses to respond. Voice cannot be changed during the session once the model has responded with audio at least once. Current voice options are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. We recommend marin and cedar for best quality.
        
        ALLOY("alloy")
        
        ASH("ash")
        
        BALLAD("ballad")
        
        CORAL("coral")
        
        ECHO("echo")
        
        SAGE("sage")
        
        SHIMMER("shimmer")
        
        VERSE("verse")
        
        MARIN("marin")
        
        CEDAR("cedar")
    - Optional<String> conversationId
      
      Which conversation the response is added to, determined by the conversation field in the response.create event. If auto, the response will be added to the default conversation and the value of conversation_id will be an id like conv_1234. If none, the response will not be added to any conversation and the value of conversation_id will be null. If responses are being triggered automatically by VAD the response will be added to the default conversation
    - Optional<MaxOutputTokens> maxOutputTokens
      
      Maximum number of output tokens for a single assistant response, inclusive of tool calls, that was used in this response.
      - long
      - JsonValue;
        
        INF("inf")
    - Optional<Metadata> metadata
      
      Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.
      
      Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
    - Optional<Object> object_
      
      The object type, must be realtime.response.
      - REALTIME_RESPONSE("realtime.response")
    - Optional<List<ConversationItem>> output
      
      The list of output items generated by the response.
      - class RealtimeConversationItemSystemMessage:
        
        A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
      - class RealtimeConversationItemUserMessage:
        
        A user message item in a Realtime conversation.
      - class RealtimeConversationItemAssistantMessage:
        
        An assistant message item in a Realtime conversation.
      - class RealtimeConversationItemFunctionCall:
        
        A function call item in a Realtime conversation.
      - class RealtimeConversationItemFunctionCallOutput:
        
        A function call output item in a Realtime conversation.
      - class RealtimeMcpApprovalResponse:
        
        A Realtime item responding to an MCP approval request.
      - class RealtimeMcpListTools:
        
        A Realtime item listing tools available on an MCP server.
      - class RealtimeMcpToolCall:
        
        A Realtime item representing an invocation of a tool on an MCP server.
      - class RealtimeMcpApprovalRequest:
        
        A Realtime item requesting human approval of a tool invocation.
    - Optional<List<OutputModality>> outputModalities
      
      The set of modalities the model used to respond, currently the only possible values are [\"audio\"], [\"text\"]. Audio output always include a text transcript. Setting the output to mode text will disable audio output from the model.
      - TEXT("text")
      - AUDIO("audio")
    - Optional<Status> status
      
      The final status of the response (completed, cancelled, failed, or incomplete, in_progress).
      - COMPLETED("completed")
      - CANCELLED("cancelled")
      - FAILED("failed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
    - Optional<RealtimeResponseStatus> statusDetails
      
      Additional details about the status.
      - Optional<Error> error
        
        A description of the error that caused the response to fail, populated when the status is failed.
        
        Optional<String> code
        
        Error code, if any.
        
        Optional<String> type
        
        The type of error.
      - Optional<Reason> reason
        
        The reason the Response did not complete. For a cancelled Response, one of turn_detected (the server VAD detected a new start of speech) or client_cancelled (the client sent a cancel event). For an incomplete Response, one of max_output_tokens or content_filter (the server-side safety filter activated and cut off the response).
        
        TURN_DETECTED("turn_detected")
        
        CLIENT_CANCELLED("client_cancelled")
        
        MAX_OUTPUT_TOKENS("max_output_tokens")
        
        CONTENT_FILTER("content_filter")
      - Optional<Type> type
        
        The type of error that caused the response to fail, corresponding with the status field (completed, cancelled, incomplete, failed).
        
        COMPLETED("completed")
        
        CANCELLED("cancelled")
        
        INCOMPLETE("incomplete")
        
        FAILED("failed")
    - Optional<RealtimeResponseUsage> usage
      
      Usage statistics for the Response, this will correspond to billing. A Realtime API session will maintain a conversation context and append new Items to the Conversation, thus output from previous turns (text and audio tokens) will become the input for later turns.
      - Optional<RealtimeResponseUsageInputTokenDetails> inputTokenDetails
        
        Details about the input tokens used in the Response. Cached tokens are tokens from previous turns in the conversation that are included as context for the current response. Cached tokens here are counted as a subset of input tokens, meaning input tokens will include cached and uncached tokens.
        
        Optional<Long> audioTokens
        
        The number of audio tokens used as input for the Response.
        
        Optional<Long> cachedTokens
        
        The number of cached tokens used as input for the Response.
        
        Optional<CachedTokensDetails> cachedTokensDetails
        
        Details about the cached tokens used as input for the Response.
        
        Optional<Long> audioTokens
        
        The number of cached audio tokens used as input for the Response.
        
        Optional<Long> imageTokens
        
        The number of cached image tokens used as input for the Response.
        
        Optional<Long> textTokens
        
        The number of cached text tokens used as input for the Response.
        
        Optional<Long> imageTokens
        
        The number of image tokens used as input for the Response.
        
        Optional<Long> textTokens
        
        The number of text tokens used as input for the Response.
      - Optional<Long> inputTokens
        
        The number of input tokens used in the Response, including text and audio tokens.
      - Optional<RealtimeResponseUsageOutputTokenDetails> outputTokenDetails
        
        Details about the output tokens used in the Response.
        
        Optional<Long> audioTokens
        
        The number of audio tokens used in the Response.
        
        Optional<Long> textTokens
        
        The number of text tokens used in the Response.
      - Optional<Long> outputTokens
        
        The number of output tokens sent in the Response, including text and audio tokens.
      - Optional<Long> totalTokens
        
        The total number of tokens in the Response including input and output text and audio tokens.
  - JsonValue; type "response.created"constant
    
    The event type, must be response.created.
    - RESPONSE_CREATED("response.created")
- class ResponseDoneEvent:
  
  Returned when a Response is done streaming. Always emitted, no matter the final state. The Response object included in the response.done event will include all output Items in the Response but will omit the raw audio data.
  
  Clients should check the status field of the Response to determine if it was successful (completed) or if there was another outcome: cancelled, failed, or incomplete.
  
  A response will contain all output items that were generated during the response, excluding any audio content.
  - String eventId
    
    The unique ID of the server event.
  - RealtimeResponse response
    
    The response resource.
  - JsonValue; type "response.done"constant
    
    The event type, must be response.done.
    - RESPONSE_DONE("response.done")
- class ResponseFunctionCallArgumentsDeltaEvent:
  
  Returned when the model-generated function call arguments are updated.
  - String callId
    
    The ID of the function call.
  - String delta
    
    The arguments delta as a JSON string.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the function call item.
  - long outputIndex
    
    The index of the output item in the response.
  - String responseId
    
    The ID of the response.
  - JsonValue; type "response.function_call_arguments.delta"constant
    
    The event type, must be response.function_call_arguments.delta.
    - RESPONSE_FUNCTION_CALL_ARGUMENTS_DELTA("response.function_call_arguments.delta")
- class ResponseFunctionCallArgumentsDoneEvent:
  
  Returned when the model-generated function call arguments are done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
  - String arguments
    
    The final arguments as a JSON string.
  - String callId
    
    The ID of the function call.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the function call item.
  - String name
    
    The name of the function that was called.
  - long outputIndex
    
    The index of the output item in the response.
  - String responseId
    
    The ID of the response.
  - JsonValue; type "response.function_call_arguments.done"constant
    
    The event type, must be response.function_call_arguments.done.
    - RESPONSE_FUNCTION_CALL_ARGUMENTS_DONE("response.function_call_arguments.done")
- class ResponseOutputItemAddedEvent:
  
  Returned when a new Item is created during Response generation.
  - String eventId
    
    The unique ID of the server event.
  - ConversationItem item
    
    A single item within a Realtime conversation.
  - long outputIndex
    
    The index of the output item in the Response.
  - String responseId
    
    The ID of the Response to which the item belongs.
  - JsonValue; type "response.output_item.added"constant
    
    The event type, must be response.output_item.added.
    - RESPONSE_OUTPUT_ITEM_ADDED("response.output_item.added")
- class ResponseOutputItemDoneEvent:
  
  Returned when an Item is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
  - String eventId
    
    The unique ID of the server event.
  - ConversationItem item
    
    A single item within a Realtime conversation.
  - long outputIndex
    
    The index of the output item in the Response.
  - String responseId
    
    The ID of the Response to which the item belongs.
  - JsonValue; type "response.output_item.done"constant
    
    The event type, must be response.output_item.done.
    - RESPONSE_OUTPUT_ITEM_DONE("response.output_item.done")
- class ResponseTextDeltaEvent:
  
  Returned when the text value of an "output_text" content part is updated.
  - long contentIndex
    
    The index of the content part in the item's content array.
  - String delta
    
    The text delta.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the item.
  - long outputIndex
    
    The index of the output item in the response.
  - String responseId
    
    The ID of the response.
  - JsonValue; type "response.output_text.delta"constant
    
    The event type, must be response.output_text.delta.
    - RESPONSE_OUTPUT_TEXT_DELTA("response.output_text.delta")
- class ResponseTextDoneEvent:
  
  Returned when the text value of an "output_text" content part is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
  - long contentIndex
    
    The index of the content part in the item's content array.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the item.
  - long outputIndex
    
    The index of the output item in the response.
  - String responseId
    
    The ID of the response.
  - String text
    
    The final text content.
  - JsonValue; type "response.output_text.done"constant
    
    The event type, must be response.output_text.done.
    - RESPONSE_OUTPUT_TEXT_DONE("response.output_text.done")
- class SessionCreatedEvent:
  
  Returned when a Session is created. Emitted automatically when a new connection is established as the first server event. This event will contain the default Session configuration.
  - String eventId
    
    The unique ID of the server event.
  - Session session
    
    The session configuration.
    - class RealtimeSessionCreateRequest:
      
      Realtime session object configuration.
      - JsonValue; type "realtime"constant
        
        The type of session to create. Always realtime for the Realtime API.
        
        REALTIME("realtime")
      - Optional<RealtimeAudioConfig> audio
        
        Configuration for input and output audio.
        
        Optional<RealtimeAudioConfigInput> input
        
        Optional<RealtimeAudioFormats> format
        
        The format of the input audio.
        
        Optional<NoiseReduction> noiseReduction
        
        Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
        
        Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
        
        Optional<AudioTranscription> transcription
        
        Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
        
        Optional<Delay> delay
        
        Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
        
        Optional<String> language
        
        The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
        
        Optional<Model> model
        
        The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
        
        WHISPER_1("whisper-1")
        
        GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
        
        GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
        
        GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
        
        GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
        
        GPT_REALTIME_WHISPER("gpt-realtime-whisper")
        
        Optional<String> prompt
        
        An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
        
        Optional<RealtimeAudioInputTurnDetection> turnDetection
        
        Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
        
        Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
        
        Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
        
        For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
        
        ServerVad
        
        JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
        
        SemanticVad
        
        JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
        
        Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
        
        Optional<RealtimeAudioConfigOutput> output
        
        Optional<RealtimeAudioFormats> format
        
        The format of the output audio.
        
        Optional<Double> speed
        
        The speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
        
        This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
        
        Optional<Voice> voice
        
        The voice the model uses to respond. Supported built-in voices are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommend marin and cedar for best quality.
        
        String
        
        enum UnionMember1:
        
        ALLOY("alloy")
        
        ASH("ash")
        
        BALLAD("ballad")
        
        CORAL("coral")
        
        ECHO("echo")
        
        SAGE("sage")
        
        SHIMMER("shimmer")
        
        VERSE("verse")
        
        MARIN("marin")
        
        CEDAR("cedar")
        
        class Id:
        
        Custom voice reference.
        
        String id
        
        The custom voice ID, e.g. voice_1234.
      - Optional<List<Include>> include
        
        Additional fields to include in server outputs.
        
        item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
        
        ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
      - Optional<String> instructions
        
        The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
        
        Note that the server sets default instructions which will be used if this field is not set and are visible in the session.created event at the start of the session.
      - Optional<MaxOutputTokens> maxOutputTokens
        
        Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or inf for the maximum available tokens for a given model. Defaults to inf.
        
        long
        
        JsonValue;
        
        INF("inf")
      - Optional<Model> model
        
        The Realtime model used for this session.
        
        GPT_REALTIME("gpt-realtime")
        
        GPT_REALTIME_1_5("gpt-realtime-1.5")
        
        GPT_REALTIME_2("gpt-realtime-2")
        
        GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")
        
        GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")
        
        GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01")
        
        GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17")
        
        GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03")
        
        GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview")
        
        GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17")
        
        GPT_REALTIME_MINI("gpt-realtime-mini")
        
        GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06")
        
        GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15")
        
        GPT_AUDIO_1_5("gpt-audio-1.5")
        
        GPT_AUDIO_MINI("gpt-audio-mini")
        
        GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06")
        
        GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
      - Optional<List<OutputModality>> outputModalities
        
        The set of modalities the model can respond with. It defaults to ["audio"], indicating that the model will respond with audio plus a transcript. ["text"] can be used to make the model respond with text only. It is not possible to request both text and audio at the same time.
        
        TEXT("text")
        
        AUDIO("audio")
      - Optional<Boolean> parallelToolCalls
        
        Whether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as gpt-realtime-2.
      - Optional<ResponsePrompt> prompt
        
        Reference to a prompt template and its variables. Learn more.
        
        String id
        
        The unique identifier of the prompt template to use.
        
        Optional<Variables> variables
        
        Optional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
        
        String
        
        class ResponseInputText:
        
        A text input to the model.
        
        String text
        
        The text input to the model.
        
        JsonValue; type "input_text"constant
        
        The type of the input item. Always input_text.
        
        INPUT_TEXT("input_text")
        
        class ResponseInputImage:
        
        An image input to the model. Learn about image inputs.
        
        Detail detail
        
        The detail level of the image to be sent to the model. One of high, low, auto, or original. Defaults to auto.
        
        LOW("low")
        
        HIGH("high")
        
        AUTO("auto")
        
        ORIGINAL("original")
        
        JsonValue; type "input_image"constant
        
        The type of the input item. Always input_image.
        
        INPUT_IMAGE("input_image")
        
        Optional<String> fileId
        
        The ID of the file to be sent to the model.
        
        Optional<String> imageUrl
        
        The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
        
        class ResponseInputFile:
        
        A file input to the model.
        
        JsonValue; type "input_file"constant
        
        The type of the input item. Always input_file.
        
        INPUT_FILE("input_file")
        
        Optional<Detail> detail
        
        The detail level of the file to be sent to the model. Use low for the default rendering behavior, or high to render the file at higher quality. Defaults to low.
        
        LOW("low")
        
        HIGH("high")
        
        Optional<String> fileData
        
        The content of the file to be sent to the model.
        
        Optional<String> fileId
        
        The ID of the file to be sent to the model.
        
        Optional<String> fileUrl
        
        The URL of the file to be sent to the model.
        
        Optional<String> filename
        
        The name of the file to be sent to the model.
        
        Optional<String> version
        
        Optional version of the prompt template.
      - Optional<RealtimeReasoning> reasoning
        
        Configuration for reasoning-capable Realtime models such as gpt-realtime-2.
        
        Optional<RealtimeReasoningEffort> effort
        
        Constrains effort on reasoning for reasoning-capable Realtime models such as gpt-realtime-2.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
      - Optional<RealtimeToolChoiceConfig> toolChoice
        
        How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
        
        enum ToolChoiceOptions:
        
        Controls which (if any) tool is called by the model.
        
        none means the model will not call any tool and instead generates a message.
        
        auto means the model can pick between generating a message or calling one or more tools.
        
        required means the model must call one or more tools.
        
        NONE("none")
        
        AUTO("auto")
        
        REQUIRED("required")
        
        class ToolChoiceFunction:
        
        Use this option to force the model to call a specific function.
        
        String name
        
        The name of the function to call.
        
        JsonValue; type "function"constant
        
        For function calling, the type is always function.
        
        FUNCTION("function")
        
        class ToolChoiceMcp:
        
        Use this option to force the model to call a specific tool on a remote MCP server.
        
        String serverLabel
        
        The label of the MCP server to use.
        
        JsonValue; type "mcp"constant
        
        For MCP tools, the type is always mcp.
        
        MCP("mcp")
        
        Optional<String> name
        
        The name of the tool to call on the server.
      - Optional<List<RealtimeToolsConfigUnion>> tools
        
        Tools available to the model.
        
        class RealtimeFunctionTool:
        
        Optional<String> description
        
        The description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
        
        Optional<String> name
        
        The name of the function.
        
        Optional<JsonValue> parameters
        
        Parameters of the function in JSON Schema.
        
        Optional<Type> type
        
        The type of the tool, i.e. function.
        
        FUNCTION("function")
        
        Mcp
        
        String serverLabel
        
        A label for this MCP server, used to identify it in tool calls.
        
        JsonValue; type "mcp"constant
        
        The type of the MCP tool. Always mcp.
        
        MCP("mcp")
        
        Optional<AllowedTools> allowedTools
        
        List of allowed tool names or a filter object.
        
        List<String>
        
        class McpToolFilter:
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<String> authorization
        
        An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
        
        Optional<ConnectorId> connectorId
        
        Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.
        
        Currently supported connector_id values are:
        
        Dropbox: connector_dropbox
        
        Gmail: connector_gmail
        
        Google Calendar: connector_googlecalendar
        
        Google Drive: connector_googledrive
        
        Microsoft Teams: connector_microsoftteams
        
        Outlook Calendar: connector_outlookcalendar
        
        Outlook Email: connector_outlookemail
        
        SharePoint: connector_sharepoint
        
        CONNECTOR_DROPBOX("connector_dropbox")
        
        CONNECTOR_GMAIL("connector_gmail")
        
        CONNECTOR_GOOGLECALENDAR("connector_googlecalendar")
        
        CONNECTOR_GOOGLEDRIVE("connector_googledrive")
        
        CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams")
        
        CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar")
        
        CONNECTOR_OUTLOOKEMAIL("connector_outlookemail")
        
        CONNECTOR_SHAREPOINT("connector_sharepoint")
        
        Optional<Boolean> deferLoading
        
        Whether this MCP tool is deferred and discovered via tool search.
        
        Optional<Headers> headers
        
        Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.
        
        Optional<RequireApproval> requireApproval
        
        Specify which of the MCP server's tools require approval.
        
        class McpToolApprovalFilter:
        
        Specify which of the MCP server's tools require approval. Can be always, never, or a filter object associated with tools that require approval.
        
        Optional<Always> always
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<Never> never
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        enum McpToolApprovalSetting:
        
        Specify a single approval policy for all tools. One of always or never. When set to always, all tools will require approval. When set to never, all tools will not require approval.
        
        ALWAYS("always")
        
        NEVER("never")
        
        Optional<String> serverDescription
        
        Optional description of the MCP server, used to provide more context.
        
        Optional<String> serverUrl
        
        The URL for the MCP server. One of server_url or connector_id must be provided.
      - Optional<RealtimeTracingConfig> tracing
        
        Realtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
        
        auto will create a trace for the session with default values for the workflow name, group id, and metadata.
        
        JsonValue;
        
        AUTO("auto")
        
        TracingConfiguration
        
        Optional<String> groupId
        
        The group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
        
        Optional<JsonValue> metadata
        
        The arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
        
        Optional<String> workflowName
        
        The name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
      - Optional<RealtimeTruncation> truncation
        
        When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
        
        Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
        
        Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
        
        Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
        
        RealtimeTruncationStrategy
        
        AUTO("auto")
        
        DISABLED("disabled")
        
        class RealtimeTruncationRetentionRatio:
        
        Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
        
        double retentionRatio
        
        Fraction of post-instruction conversation tokens to retain (0.0 - 1.0) when the conversation exceeds the input token limit. Setting this to 0.8 means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.
        
        JsonValue; type "retention_ratio"constant
        
        Use retention ratio truncation.
        
        RETENTION_RATIO("retention_ratio")
        
        Optional<TokenLimits> tokenLimits
        
        Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
        
        Optional<Long> postInstructions
        
        Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
    - class RealtimeTranscriptionSessionCreateRequest:
      
      Realtime transcription session object configuration.
      - JsonValue; type "transcription"constant
        
        The type of session to create. Always transcription for transcription sessions.
        
        TRANSCRIPTION("transcription")
      - Optional<RealtimeTranscriptionSessionAudio> audio
        
        Configuration for input and output audio.
        
        Optional<RealtimeTranscriptionSessionAudioInput> input
        
        Optional<RealtimeAudioFormats> format
        
        The PCM audio format. Only a 24kHz sample rate is supported.
        
        Optional<NoiseReduction> noiseReduction
        
        Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
        
        Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        Optional<AudioTranscription> transcription
        
        Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
        
        Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetection
        
        Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
        
        Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
        
        Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
        
        For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
        
        ServerVad
        
        JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
        
        SemanticVad
        
        JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
        
        Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
      - Optional<List<Include>> include
        
        Additional fields to include in server outputs.
        
        item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
        
        ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
  - JsonValue; type "session.created"constant
    
    The event type, must be session.created.
    - SESSION_CREATED("session.created")
- class SessionUpdatedEvent:
  
  Returned when a session is updated with a session.update event, unless there is an error.
  - String eventId
    
    The unique ID of the server event.
  - Session session
    
    The session configuration.
    - class RealtimeSessionCreateRequest:
      
      Realtime session object configuration.
    - class RealtimeTranscriptionSessionCreateRequest:
      
      Realtime transcription session object configuration.
  - JsonValue; type "session.updated"constant
    
    The event type, must be session.updated.
    - SESSION_UPDATED("session.updated")
- OutputAudioBufferStarted
  - String eventId
    
    The unique ID of the server event.
  - String responseId
    
    The unique ID of the response that produced the audio.
  - JsonValue; type "output_audio_buffer.started"constant
    
    The event type, must be output_audio_buffer.started.
    - OUTPUT_AUDIO_BUFFER_STARTED("output_audio_buffer.started")
- OutputAudioBufferStopped
  - String eventId
    
    The unique ID of the server event.
  - String responseId
    
    The unique ID of the response that produced the audio.
  - JsonValue; type "output_audio_buffer.stopped"constant
    
    The event type, must be output_audio_buffer.stopped.
    - OUTPUT_AUDIO_BUFFER_STOPPED("output_audio_buffer.stopped")
- OutputAudioBufferCleared
  - String eventId
    
    The unique ID of the server event.
  - String responseId
    
    The unique ID of the response that produced the audio.
  - JsonValue; type "output_audio_buffer.cleared"constant
    
    The event type, must be output_audio_buffer.cleared.
    - OUTPUT_AUDIO_BUFFER_CLEARED("output_audio_buffer.cleared")
- class ConversationItemAdded:
  
  Sent by the server when an Item is added to the default Conversation. This can happen in several cases:
  - When the client sends a conversation.item.create event.
  - When the input audio buffer is committed. In this case the item will be a user message containing the audio from the buffer.
  - When the model is generating a Response. In this case the conversation.item.added event will be sent when the model starts generating a specific Item, and thus it will not yet have any content (and status will be in_progress).
  The event will include the full content of the Item (except when model is generating a Response) except for audio data, which can be retrieved separately with a conversation.item.retrieve event if necessary.
  - String eventId
    
    The unique ID of the server event.
  - ConversationItem item
    
    A single item within a Realtime conversation.
  - JsonValue; type "conversation.item.added"constant
    
    The event type, must be conversation.item.added.
    - CONVERSATION_ITEM_ADDED("conversation.item.added")
  - Optional<String> previousItemId
    
    The ID of the item that precedes this one, if any. This is used to maintain ordering when items are inserted.
- class ConversationItemDone:
  
  Returned when a conversation item is finalized.
  
  The event will include the full content of the Item except for audio data, which can be retrieved separately with a conversation.item.retrieve event if needed.
  - String eventId
    
    The unique ID of the server event.
  - ConversationItem item
    
    A single item within a Realtime conversation.
  - JsonValue; type "conversation.item.done"constant
    
    The event type, must be conversation.item.done.
    - CONVERSATION_ITEM_DONE("conversation.item.done")
  - Optional<String> previousItemId
    
    The ID of the item that precedes this one, if any. This is used to maintain ordering when items are inserted.
- class InputAudioBufferTimeoutTriggered:
  
  Returned when the Server VAD timeout is triggered for the input audio buffer. This is configured with idle_timeout_ms in the turn_detection settings of the session, and it indicates that there hasn't been any speech detected for the configured duration.
  
  The audio_start_ms and audio_end_ms fields indicate the segment of audio after the last model response up to the triggering time, as an offset from the beginning of audio written to the input audio buffer. This means it demarcates the segment of audio that was silent and the difference between the start and end values will roughly match the configured timeout.
  
  The empty audio will be committed to the conversation as an input_audio item (there will be a input_audio_buffer.committed event) and a model response will be generated. There may be speech that didn't trigger VAD but is still detected by the model, so the model may respond with something relevant to the conversation or a prompt to continue speaking.
  - long audioEndMs
    
    Millisecond offset of audio written to the input audio buffer at the time the timeout was triggered.
  - long audioStartMs
    
    Millisecond offset of audio written to the input audio buffer that was after the playback time of the last model response.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the item associated with this segment.
  - JsonValue; type "input_audio_buffer.timeout_triggered"constant
    
    The event type, must be input_audio_buffer.timeout_triggered.
    - INPUT_AUDIO_BUFFER_TIMEOUT_TRIGGERED("input_audio_buffer.timeout_triggered")
- class ConversationItemInputAudioTranscriptionSegment:
  
  Returned when an input audio transcription segment is identified for an item.
  - String id
    
    The segment identifier.
  - long contentIndex
    
    The index of the input audio content part within the item.
  - double end
    
    End time of the segment in seconds.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the item containing the input audio content.
  - String speaker
    
    The detected speaker label for this segment.
  - double start
    
    Start time of the segment in seconds.
  - String text
    
    The text for this segment.
  - JsonValue; type "conversation.item.input_audio_transcription.segment"constant
    
    The event type, must be conversation.item.input_audio_transcription.segment.
    - CONVERSATION_ITEM_INPUT_AUDIO_TRANSCRIPTION_SEGMENT("conversation.item.input_audio_transcription.segment")
- class McpListToolsInProgress:
  
  Returned when listing MCP tools is in progress for an item.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the MCP list tools item.
  - JsonValue; type "mcp_list_tools.in_progress"constant
    
    The event type, must be mcp_list_tools.in_progress.
    - MCP_LIST_TOOLS_IN_PROGRESS("mcp_list_tools.in_progress")
- class McpListToolsCompleted:
  
  Returned when listing MCP tools has completed for an item.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the MCP list tools item.
  - JsonValue; type "mcp_list_tools.completed"constant
    
    The event type, must be mcp_list_tools.completed.
    - MCP_LIST_TOOLS_COMPLETED("mcp_list_tools.completed")
- class McpListToolsFailed:
  
  Returned when listing MCP tools has failed for an item.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the MCP list tools item.
  - JsonValue; type "mcp_list_tools.failed"constant
    
    The event type, must be mcp_list_tools.failed.
    - MCP_LIST_TOOLS_FAILED("mcp_list_tools.failed")
- class ResponseMcpCallArgumentsDelta:
  
  Returned when MCP tool call arguments are updated during response generation.
  - String delta
    
    The JSON-encoded arguments delta.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the MCP tool call item.
  - long outputIndex
    
    The index of the output item in the response.
  - String responseId
    
    The ID of the response.
  - JsonValue; type "response.mcp_call_arguments.delta"constant
    
    The event type, must be response.mcp_call_arguments.delta.
    - RESPONSE_MCP_CALL_ARGUMENTS_DELTA("response.mcp_call_arguments.delta")
  - Optional<String> obfuscation
    
    If present, indicates the delta text was obfuscated.
- class ResponseMcpCallArgumentsDone:
  
  Returned when MCP tool call arguments are finalized during response generation.
  - String arguments
    
    The final JSON-encoded arguments string.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the MCP tool call item.
  - long outputIndex
    
    The index of the output item in the response.
  - String responseId
    
    The ID of the response.
  - JsonValue; type "response.mcp_call_arguments.done"constant
    
    The event type, must be response.mcp_call_arguments.done.
    - RESPONSE_MCP_CALL_ARGUMENTS_DONE("response.mcp_call_arguments.done")
- class ResponseMcpCallInProgress:
  
  Returned when an MCP tool call has started and is in progress.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the MCP tool call item.
  - long outputIndex
    
    The index of the output item in the response.
  - JsonValue; type "response.mcp_call.in_progress"constant
    
    The event type, must be response.mcp_call.in_progress.
    - RESPONSE_MCP_CALL_IN_PROGRESS("response.mcp_call.in_progress")
- class ResponseMcpCallCompleted:
  
  Returned when an MCP tool call has completed successfully.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the MCP tool call item.
  - long outputIndex
    
    The index of the output item in the response.
  - JsonValue; type "response.mcp_call.completed"constant
    
    The event type, must be response.mcp_call.completed.
    - RESPONSE_MCP_CALL_COMPLETED("response.mcp_call.completed")
- class ResponseMcpCallFailed:
  
  Returned when an MCP tool call has failed.
  - String eventId
    
    The unique ID of the server event.
  - String itemId
    
    The ID of the MCP tool call item.
  - long outputIndex
    
    The index of the output item in the response.
  - JsonValue; type "response.mcp_call.failed"constant
    
    The event type, must be response.mcp_call.failed.
    - RESPONSE_MCP_CALL_FAILED("response.mcp_call.failed")

Realtime Session

class RealtimeSession:

Realtime session object for the beta interface.
- Optional<String> id
  
  Unique identifier for the session that looks like sess_1234567890abcdef.
- Optional<Long> expiresAt
  
  Expiration timestamp for the session, in seconds since epoch.
- Optional<List<Include>> include
  
  Additional fields to include in server outputs.
  - item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
  - ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
- Optional<InputAudioFormat> inputAudioFormat
  
  The format of input audio. Options are pcm16, g711_ulaw, or g711_alaw. For pcm16, input audio must be 16-bit PCM at a 24kHz sample rate, single channel (mono), and little-endian byte order.
  - PCM16("pcm16")
  - G711_ULAW("g711_ulaw")
  - G711_ALAW("g711_alaw")
- Optional<InputAudioNoiseReduction> inputAudioNoiseReduction
  
  Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
  - Optional<NoiseReductionType> type
    
    Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
    - NEAR_FIELD("near_field")
    - FAR_FIELD("far_field")
- Optional<AudioTranscription> inputAudioTranscription
  
  Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
  - Optional<Delay> delay
    
    Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
    - MINIMAL("minimal")
    - LOW("low")
    - MEDIUM("medium")
    - HIGH("high")
    - XHIGH("xhigh")
  - Optional<String> language
    
    The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
  - Optional<Model> model
    
    The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
    - WHISPER_1("whisper-1")
    - GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
    - GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
    - GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
    - GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
    - GPT_REALTIME_WHISPER("gpt-realtime-whisper")
  - Optional<String> prompt
    
    An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
- Optional<String> instructions
  
  The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
  
  Note that the server sets default instructions which will be used if this field is not set and are visible in the session.created event at the start of the session.
- Optional<MaxResponseOutputTokens> maxResponseOutputTokens
  
  Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or inf for the maximum available tokens for a given model. Defaults to inf.
  - long
  - JsonValue;
    - INF("inf")
- Optional<List<Modality>> modalities
  
  The set of modalities the model can respond with. To disable audio, set this to ["text"].
  - TEXT("text")
  - AUDIO("audio")
- Optional<Model> model
  
  The Realtime model used for this session.
  - GPT_REALTIME("gpt-realtime")
  - GPT_REALTIME_1_5("gpt-realtime-1.5")
  - GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")
  - GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")
  - GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01")
  - GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17")
  - GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03")
  - GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview")
  - GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17")
  - GPT_REALTIME_MINI("gpt-realtime-mini")
  - GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06")
  - GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15")
  - GPT_AUDIO_1_5("gpt-audio-1.5")
  - GPT_AUDIO_MINI("gpt-audio-mini")
  - GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06")
  - GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
- Optional<Object> object_
  
  The object type. Always realtime.session.
  - REALTIME_SESSION("realtime.session")
- Optional<OutputAudioFormat> outputAudioFormat
  
  The format of output audio. Options are pcm16, g711_ulaw, or g711_alaw. For pcm16, output audio is sampled at a rate of 24kHz.
  - PCM16("pcm16")
  - G711_ULAW("g711_ulaw")
  - G711_ALAW("g711_alaw")
- Optional<ResponsePrompt> prompt
  
  Reference to a prompt template and its variables. Learn more.
  - String id
    
    The unique identifier of the prompt template to use.
  - Optional<Variables> variables
    
    Optional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
    - String
    - class ResponseInputText:
      
      A text input to the model.
      - String text
        
        The text input to the model.
      - JsonValue; type "input_text"constant
        
        The type of the input item. Always input_text.
        
        INPUT_TEXT("input_text")
    - class ResponseInputImage:
      
      An image input to the model. Learn about image inputs.
      - Detail detail
        
        The detail level of the image to be sent to the model. One of high, low, auto, or original. Defaults to auto.
        
        LOW("low")
        
        HIGH("high")
        
        AUTO("auto")
        
        ORIGINAL("original")
      - JsonValue; type "input_image"constant
        
        The type of the input item. Always input_image.
        
        INPUT_IMAGE("input_image")
      - Optional<String> fileId
        
        The ID of the file to be sent to the model.
      - Optional<String> imageUrl
        
        The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
    - class ResponseInputFile:
      
      A file input to the model.
      - JsonValue; type "input_file"constant
        
        The type of the input item. Always input_file.
        
        INPUT_FILE("input_file")
      - Optional<Detail> detail
        
        The detail level of the file to be sent to the model. Use low for the default rendering behavior, or high to render the file at higher quality. Defaults to low.
        
        LOW("low")
        
        HIGH("high")
      - Optional<String> fileData
        
        The content of the file to be sent to the model.
      - Optional<String> fileId
        
        The ID of the file to be sent to the model.
      - Optional<String> fileUrl
        
        The URL of the file to be sent to the model.
      - Optional<String> filename
        
        The name of the file to be sent to the model.
  - Optional<String> version
    
    Optional version of the prompt template.
- Optional<Double> speed
  
  The speed of the model's spoken response. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
- Optional<Double> temperature
  
  Sampling temperature for the model, limited to [0.6, 1.2]. For audio models a temperature of 0.8 is highly recommended for best performance.
- Optional<String> toolChoice
  
  How the model chooses tools. Options are auto, none, required, or specify a function.
- Optional<List<RealtimeFunctionTool>> tools
  
  Tools (functions) available to the model.
  - Optional<String> description
    
    The description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
  - Optional<String> name
    
    The name of the function.
  - Optional<JsonValue> parameters
    
    Parameters of the function in JSON Schema.
  - Optional<Type> type
    
    The type of the tool, i.e. function.
    - FUNCTION("function")
- Optional<Tracing> tracing
  
  Configuration options for tracing. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
  
  auto will create a trace for the session with default values for the workflow name, group id, and metadata.
  - JsonValue;
    - AUTO("auto")
  - class TracingConfiguration:
    
    Granular configuration for tracing.
    - Optional<String> groupId
      
      The group id to attach to this trace to enable filtering and grouping in the traces dashboard.
    - Optional<JsonValue> metadata
      
      The arbitrary metadata to attach to this trace to enable filtering in the traces dashboard.
    - Optional<String> workflowName
      
      The name of the workflow to attach to this trace. This is used to name the trace in the traces dashboard.
- Optional<TurnDetection> turnDetection
  
  Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
  
  Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
  
  Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
  
  For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
  - class ServerVad:
    
    Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.
    - JsonValue; type "server_vad"constant
      
      Type of turn detection, server_vad to turn on simple Server VAD.
      - SERVER_VAD("server_vad")
    - Optional<Boolean> createResponse
      
      Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
      
      If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
    - Optional<Long> idleTimeoutMs
      
      Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
      
      The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
      
      An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
    - Optional<Boolean> interruptResponse
      
      Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
      
      If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
    - Optional<Long> prefixPaddingMs
      
      Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
    - Optional<Long> silenceDurationMs
      
      Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
    - Optional<Double> threshold
      
      Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
  - class SemanticVad:
    
    Server-side semantic turn detection which uses a model to determine when the user has finished speaking.
    - JsonValue; type "semantic_vad"constant
      
      Type of turn detection, semantic_vad to turn on Semantic VAD.
      - SEMANTIC_VAD("semantic_vad")
    - Optional<Boolean> createResponse
      
      Whether or not to automatically generate a response when a VAD stop event occurs.
    - Optional<Eagerness> eagerness
      
      Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
      - LOW("low")
      - MEDIUM("medium")
      - HIGH("high")
      - AUTO("auto")
    - Optional<Boolean> interruptResponse
      
      Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
- Optional<Voice> voice
  
  The voice the model uses to respond. Voice cannot be changed during the session once the model has responded with audio at least once. Current voice options are alloy, ash, ballad, coral, echo, sage, shimmer, and verse.
  - ALLOY("alloy")
  - ASH("ash")
  - BALLAD("ballad")
  - CORAL("coral")
  - ECHO("echo")
  - SAGE("sage")
  - SHIMMER("shimmer")
  - VERSE("verse")
  - MARIN("marin")
  - CEDAR("cedar")

Realtime Session Create Request

class RealtimeSessionCreateRequest:

Realtime session object configuration.
- JsonValue; type "realtime"constant
  
  The type of session to create. Always realtime for the Realtime API.
  - REALTIME("realtime")
- Optional<RealtimeAudioConfig> audio
  
  Configuration for input and output audio.
  - Optional<RealtimeAudioConfigInput> input
    - Optional<RealtimeAudioFormats> format
      
      The format of the input audio.
      - AudioPcm
        
        Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
        
        Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
      - AudioPcmu
        
        Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
      - AudioPcma
        
        Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
    - Optional<NoiseReduction> noiseReduction
      
      Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
      - Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
    - Optional<AudioTranscription> transcription
      
      Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
      - Optional<Delay> delay
        
        Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
      - Optional<String> language
        
        The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
      - Optional<Model> model
        
        The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
        
        WHISPER_1("whisper-1")
        
        GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
        
        GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
        
        GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
        
        GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
        
        GPT_REALTIME_WHISPER("gpt-realtime-whisper")
      - Optional<String> prompt
        
        An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
    - Optional<RealtimeAudioInputTurnDetection> turnDetection
      
      Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
      
      Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
      
      Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
      
      For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
      - ServerVad
        
        JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
      - SemanticVad
        
        JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
        
        Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
  - Optional<RealtimeAudioConfigOutput> output
    - Optional<RealtimeAudioFormats> format
      
      The format of the output audio.
    - Optional<Double> speed
      
      The speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
      
      This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
    - Optional<Voice> voice
      
      The voice the model uses to respond. Supported built-in voices are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommend marin and cedar for best quality.
      - String
      - enum UnionMember1:
        
        ALLOY("alloy")
        
        ASH("ash")
        
        BALLAD("ballad")
        
        CORAL("coral")
        
        ECHO("echo")
        
        SAGE("sage")
        
        SHIMMER("shimmer")
        
        VERSE("verse")
        
        MARIN("marin")
        
        CEDAR("cedar")
      - class Id:
        
        Custom voice reference.
        
        String id
        
        The custom voice ID, e.g. voice_1234.
- Optional<List<Include>> include
  
  Additional fields to include in server outputs.
  
  item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
  - ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
- Optional<String> instructions
  
  The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
  
  Note that the server sets default instructions which will be used if this field is not set and are visible in the session.created event at the start of the session.
- Optional<MaxOutputTokens> maxOutputTokens
  
  Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or inf for the maximum available tokens for a given model. Defaults to inf.
  - long
  - JsonValue;
    - INF("inf")
- Optional<Model> model
  
  The Realtime model used for this session.
  - GPT_REALTIME("gpt-realtime")
  - GPT_REALTIME_1_5("gpt-realtime-1.5")
  - GPT_REALTIME_2("gpt-realtime-2")
  - GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")
  - GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")
  - GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01")
  - GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17")
  - GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03")
  - GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview")
  - GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17")
  - GPT_REALTIME_MINI("gpt-realtime-mini")
  - GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06")
  - GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15")
  - GPT_AUDIO_1_5("gpt-audio-1.5")
  - GPT_AUDIO_MINI("gpt-audio-mini")
  - GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06")
  - GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
- Optional<List<OutputModality>> outputModalities
  
  The set of modalities the model can respond with. It defaults to ["audio"], indicating that the model will respond with audio plus a transcript. ["text"] can be used to make the model respond with text only. It is not possible to request both text and audio at the same time.
  - TEXT("text")
  - AUDIO("audio")
- Optional<Boolean> parallelToolCalls
  
  Whether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as gpt-realtime-2.
- Optional<ResponsePrompt> prompt
  
  Reference to a prompt template and its variables. Learn more.
  - String id
    
    The unique identifier of the prompt template to use.
  - Optional<Variables> variables
    
    Optional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
    - String
    - class ResponseInputText:
      
      A text input to the model.
      - String text
        
        The text input to the model.
      - JsonValue; type "input_text"constant
        
        The type of the input item. Always input_text.
        
        INPUT_TEXT("input_text")
    - class ResponseInputImage:
      
      An image input to the model. Learn about image inputs.
      - Detail detail
        
        The detail level of the image to be sent to the model. One of high, low, auto, or original. Defaults to auto.
        
        LOW("low")
        
        HIGH("high")
        
        AUTO("auto")
        
        ORIGINAL("original")
      - JsonValue; type "input_image"constant
        
        The type of the input item. Always input_image.
        
        INPUT_IMAGE("input_image")
      - Optional<String> fileId
        
        The ID of the file to be sent to the model.
      - Optional<String> imageUrl
        
        The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
    - class ResponseInputFile:
      
      A file input to the model.
      - JsonValue; type "input_file"constant
        
        The type of the input item. Always input_file.
        
        INPUT_FILE("input_file")
      - Optional<Detail> detail
        
        The detail level of the file to be sent to the model. Use low for the default rendering behavior, or high to render the file at higher quality. Defaults to low.
        
        LOW("low")
        
        HIGH("high")
      - Optional<String> fileData
        
        The content of the file to be sent to the model.
      - Optional<String> fileId
        
        The ID of the file to be sent to the model.
      - Optional<String> fileUrl
        
        The URL of the file to be sent to the model.
      - Optional<String> filename
        
        The name of the file to be sent to the model.
  - Optional<String> version
    
    Optional version of the prompt template.
- Optional<RealtimeReasoning> reasoning
  
  Configuration for reasoning-capable Realtime models such as gpt-realtime-2.
  - Optional<RealtimeReasoningEffort> effort
    
    Constrains effort on reasoning for reasoning-capable Realtime models such as gpt-realtime-2.
    - MINIMAL("minimal")
    - LOW("low")
    - MEDIUM("medium")
    - HIGH("high")
    - XHIGH("xhigh")
- Optional<RealtimeToolChoiceConfig> toolChoice
  
  How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
  - enum ToolChoiceOptions:
    
    Controls which (if any) tool is called by the model.
    
    none means the model will not call any tool and instead generates a message.
    
    auto means the model can pick between generating a message or calling one or more tools.
    
    required means the model must call one or more tools.
    - NONE("none")
    - AUTO("auto")
    - REQUIRED("required")
  - class ToolChoiceFunction:
    
    Use this option to force the model to call a specific function.
    - String name
      
      The name of the function to call.
    - JsonValue; type "function"constant
      
      For function calling, the type is always function.
      - FUNCTION("function")
  - class ToolChoiceMcp:
    
    Use this option to force the model to call a specific tool on a remote MCP server.
    - String serverLabel
      
      The label of the MCP server to use.
    - JsonValue; type "mcp"constant
      
      For MCP tools, the type is always mcp.
      - MCP("mcp")
    - Optional<String> name
      
      The name of the tool to call on the server.
- Optional<List<RealtimeToolsConfigUnion>> tools
  
  Tools available to the model.
  - class RealtimeFunctionTool:
    - Optional<String> description
      
      The description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
    - Optional<String> name
      
      The name of the function.
    - Optional<JsonValue> parameters
      
      Parameters of the function in JSON Schema.
    - Optional<Type> type
      
      The type of the tool, i.e. function.
      - FUNCTION("function")
  - Mcp
    - String serverLabel
      
      A label for this MCP server, used to identify it in tool calls.
    - JsonValue; type "mcp"constant
      
      The type of the MCP tool. Always mcp.
      - MCP("mcp")
    - Optional<AllowedTools> allowedTools
      
      List of allowed tool names or a filter object.
      - List<String>
      - class McpToolFilter:
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
    - Optional<String> authorization
      
      An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
    - Optional<ConnectorId> connectorId
      
      Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.
      
      Currently supported connector_id values are:
      - Dropbox: connector_dropbox
      - Gmail: connector_gmail
      - Google Calendar: connector_googlecalendar
      - Google Drive: connector_googledrive
      - Microsoft Teams: connector_microsoftteams
      - Outlook Calendar: connector_outlookcalendar
      - Outlook Email: connector_outlookemail
      - SharePoint: connector_sharepoint
      - CONNECTOR_DROPBOX("connector_dropbox")
      - CONNECTOR_GMAIL("connector_gmail")
      - CONNECTOR_GOOGLECALENDAR("connector_googlecalendar")
      - CONNECTOR_GOOGLEDRIVE("connector_googledrive")
      - CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams")
      - CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar")
      - CONNECTOR_OUTLOOKEMAIL("connector_outlookemail")
      - CONNECTOR_SHAREPOINT("connector_sharepoint")
    - Optional<Boolean> deferLoading
      
      Whether this MCP tool is deferred and discovered via tool search.
    - Optional<Headers> headers
      
      Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.
    - Optional<RequireApproval> requireApproval
      
      Specify which of the MCP server's tools require approval.
      - class McpToolApprovalFilter:
        
        Specify which of the MCP server's tools require approval. Can be always, never, or a filter object associated with tools that require approval.
        
        Optional<Always> always
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<Never> never
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
      - enum McpToolApprovalSetting:
        
        Specify a single approval policy for all tools. One of always or never. When set to always, all tools will require approval. When set to never, all tools will not require approval.
        
        ALWAYS("always")
        
        NEVER("never")
    - Optional<String> serverDescription
      
      Optional description of the MCP server, used to provide more context.
    - Optional<String> serverUrl
      
      The URL for the MCP server. One of server_url or connector_id must be provided.
- Optional<RealtimeTracingConfig> tracing
  
  Realtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
  
  auto will create a trace for the session with default values for the workflow name, group id, and metadata.
  - JsonValue;
    - AUTO("auto")
  - TracingConfiguration
    - Optional<String> groupId
      
      The group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
    - Optional<JsonValue> metadata
      
      The arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
    - Optional<String> workflowName
      
      The name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
- Optional<RealtimeTruncation> truncation
  
  When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
  
  Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
  
  Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
  
  Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
  - RealtimeTruncationStrategy
    - AUTO("auto")
    - DISABLED("disabled")
  - class RealtimeTruncationRetentionRatio:
    
    Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
    - double retentionRatio
      
      Fraction of post-instruction conversation tokens to retain (0.0 - 1.0) when the conversation exceeds the input token limit. Setting this to 0.8 means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.
    - JsonValue; type "retention_ratio"constant
      
      Use retention ratio truncation.
      - RETENTION_RATIO("retention_ratio")
    - Optional<TokenLimits> tokenLimits
      
      Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
      - Optional<Long> postInstructions
        
        Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.

Realtime Tool Choice Config

class RealtimeToolChoiceConfig: A class that can be one of several variants.union

How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
- enum ToolChoiceOptions:
  
  Controls which (if any) tool is called by the model.
  
  none means the model will not call any tool and instead generates a message.
  
  auto means the model can pick between generating a message or calling one or more tools.
  
  required means the model must call one or more tools.
  - NONE("none")
  - AUTO("auto")
  - REQUIRED("required")
- class ToolChoiceFunction:
  
  Use this option to force the model to call a specific function.
  - String name
    
    The name of the function to call.
  - JsonValue; type "function"constant
    
    For function calling, the type is always function.
    - FUNCTION("function")
- class ToolChoiceMcp:
  
  Use this option to force the model to call a specific tool on a remote MCP server.
  - String serverLabel
    
    The label of the MCP server to use.
  - JsonValue; type "mcp"constant
    
    For MCP tools, the type is always mcp.
    - MCP("mcp")
  - Optional<String> name
    
    The name of the tool to call on the server.

Realtime Tools Config Union

class RealtimeToolsConfigUnion: A class that can be one of several variants.union

Give the model access to additional tools via remote Model Context Protocol (MCP) servers. Learn more about MCP.
- class RealtimeFunctionTool:
  - Optional<String> description
    
    The description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
  - Optional<String> name
    
    The name of the function.
  - Optional<JsonValue> parameters
    
    Parameters of the function in JSON Schema.
  - Optional<Type> type
    
    The type of the tool, i.e. function.
    - FUNCTION("function")
- Mcp
  - String serverLabel
    
    A label for this MCP server, used to identify it in tool calls.
  - JsonValue; type "mcp"constant
    
    The type of the MCP tool. Always mcp.
    - MCP("mcp")
  - Optional<AllowedTools> allowedTools
    
    List of allowed tool names or a filter object.
    - List<String>
    - class McpToolFilter:
      
      A filter object to specify which tools are allowed.
      - Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
      - Optional<List<String>> toolNames
        
        List of allowed tool names.
  - Optional<String> authorization
    
    An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
  - Optional<ConnectorId> connectorId
    
    Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.
    
    Currently supported connector_id values are:
    - Dropbox: connector_dropbox
    - Gmail: connector_gmail
    - Google Calendar: connector_googlecalendar
    - Google Drive: connector_googledrive
    - Microsoft Teams: connector_microsoftteams
    - Outlook Calendar: connector_outlookcalendar
    - Outlook Email: connector_outlookemail
    - SharePoint: connector_sharepoint
    - CONNECTOR_DROPBOX("connector_dropbox")
    - CONNECTOR_GMAIL("connector_gmail")
    - CONNECTOR_GOOGLECALENDAR("connector_googlecalendar")
    - CONNECTOR_GOOGLEDRIVE("connector_googledrive")
    - CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams")
    - CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar")
    - CONNECTOR_OUTLOOKEMAIL("connector_outlookemail")
    - CONNECTOR_SHAREPOINT("connector_sharepoint")
  - Optional<Boolean> deferLoading
    
    Whether this MCP tool is deferred and discovered via tool search.
  - Optional<Headers> headers
    
    Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.
  - Optional<RequireApproval> requireApproval
    
    Specify which of the MCP server's tools require approval.
    - class McpToolApprovalFilter:
      
      Specify which of the MCP server's tools require approval. Can be always, never, or a filter object associated with tools that require approval.
      - Optional<Always> always
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
      - Optional<Never> never
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
    - enum McpToolApprovalSetting:
      
      Specify a single approval policy for all tools. One of always or never. When set to always, all tools will require approval. When set to never, all tools will not require approval.
      - ALWAYS("always")
      - NEVER("never")
  - Optional<String> serverDescription
    
    Optional description of the MCP server, used to provide more context.
  - Optional<String> serverUrl
    
    The URL for the MCP server. One of server_url or connector_id must be provided.

Realtime Tracing Config

class RealtimeTracingConfig: A class that can be one of several variants.union

Realtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.

auto will create a trace for the session with default values for the workflow name, group id, and metadata.
- JsonValue;
  - AUTO("auto")
- TracingConfiguration
  - Optional<String> groupId
    
    The group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
  - Optional<JsonValue> metadata
    
    The arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
  - Optional<String> workflowName
    
    The name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.

Realtime Transcription Session Audio

class RealtimeTranscriptionSessionAudio:

Configuration for input and output audio.
- Optional<RealtimeTranscriptionSessionAudioInput> input
  - Optional<RealtimeAudioFormats> format
    
    The PCM audio format. Only a 24kHz sample rate is supported.
    - AudioPcm
      - Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
      - Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
    - AudioPcmu
      - Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
    - AudioPcma
      - Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
  - Optional<NoiseReduction> noiseReduction
    
    Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
    - Optional<NoiseReductionType> type
      
      Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
      - NEAR_FIELD("near_field")
      - FAR_FIELD("far_field")
  - Optional<AudioTranscription> transcription
    
    Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
    - Optional<Delay> delay
      
      Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
      - MINIMAL("minimal")
      - LOW("low")
      - MEDIUM("medium")
      - HIGH("high")
      - XHIGH("xhigh")
    - Optional<String> language
      
      The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
    - Optional<Model> model
      
      The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
      - WHISPER_1("whisper-1")
      - GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
      - GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
      - GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
      - GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
      - GPT_REALTIME_WHISPER("gpt-realtime-whisper")
    - Optional<String> prompt
      
      An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
  - Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetection
    
    Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
    
    Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
    
    Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
    
    For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
    - ServerVad
      - JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
      - Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
      - Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
      - Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
      - Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
      - Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
      - Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
    - SemanticVad
      - JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
      - Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
      - Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
      - Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.

Realtime Transcription Session Audio Input

class RealtimeTranscriptionSessionAudioInput:
- Optional<RealtimeAudioFormats> format
  
  The PCM audio format. Only a 24kHz sample rate is supported.
  - AudioPcm
    - Optional<Rate> rate
      
      The sample rate of the audio. Always 24000.
      - _24000(24000)
    - Optional<Type> type
      
      The audio format. Always audio/pcm.
      - AUDIO_PCM("audio/pcm")
  - AudioPcmu
    - Optional<Type> type
      
      The audio format. Always audio/pcmu.
      - AUDIO_PCMU("audio/pcmu")
  - AudioPcma
    - Optional<Type> type
      
      The audio format. Always audio/pcma.
      - AUDIO_PCMA("audio/pcma")
- Optional<NoiseReduction> noiseReduction
  
  Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
  - Optional<NoiseReductionType> type
    
    Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
    - NEAR_FIELD("near_field")
    - FAR_FIELD("far_field")
- Optional<AudioTranscription> transcription
  
  Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
  - Optional<Delay> delay
    
    Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
    - MINIMAL("minimal")
    - LOW("low")
    - MEDIUM("medium")
    - HIGH("high")
    - XHIGH("xhigh")
  - Optional<String> language
    
    The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
  - Optional<Model> model
    
    The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
    - WHISPER_1("whisper-1")
    - GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
    - GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
    - GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
    - GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
    - GPT_REALTIME_WHISPER("gpt-realtime-whisper")
  - Optional<String> prompt
    
    An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
- Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetection
  
  Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
  
  Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
  
  Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
  
  For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
  - ServerVad
    - JsonValue; type "server_vad"constant
      
      Type of turn detection, server_vad to turn on simple Server VAD.
      - SERVER_VAD("server_vad")
    - Optional<Boolean> createResponse
      
      Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
      
      If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
    - Optional<Long> idleTimeoutMs
      
      Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
      
      The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
      
      An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
    - Optional<Boolean> interruptResponse
      
      Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
      
      If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
    - Optional<Long> prefixPaddingMs
      
      Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
    - Optional<Long> silenceDurationMs
      
      Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
    - Optional<Double> threshold
      
      Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
  - SemanticVad
    - JsonValue; type "semantic_vad"constant
      
      Type of turn detection, semantic_vad to turn on Semantic VAD.
      - SEMANTIC_VAD("semantic_vad")
    - Optional<Boolean> createResponse
      
      Whether or not to automatically generate a response when a VAD stop event occurs.
    - Optional<Eagerness> eagerness
      
      Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
      - LOW("low")
      - MEDIUM("medium")
      - HIGH("high")
      - AUTO("auto")
    - Optional<Boolean> interruptResponse
      
      Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.

Realtime Transcription Session Audio Input Turn Detection

class RealtimeTranscriptionSessionAudioInputTurnDetection: A class that can be one of several variants.union

Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.

Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.

Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
- ServerVad
  - JsonValue; type "server_vad"constant
    
    Type of turn detection, server_vad to turn on simple Server VAD.
    - SERVER_VAD("server_vad")
  - Optional<Boolean> createResponse
    
    Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
    
    If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
  - Optional<Long> idleTimeoutMs
    
    Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
    
    The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
    
    An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
  - Optional<Boolean> interruptResponse
    
    Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
    
    If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
  - Optional<Long> prefixPaddingMs
    
    Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
  - Optional<Long> silenceDurationMs
    
    Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
  - Optional<Double> threshold
    
    Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
- SemanticVad
  - JsonValue; type "semantic_vad"constant
    
    Type of turn detection, semantic_vad to turn on Semantic VAD.
    - SEMANTIC_VAD("semantic_vad")
  - Optional<Boolean> createResponse
    
    Whether or not to automatically generate a response when a VAD stop event occurs.
  - Optional<Eagerness> eagerness
    
    Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
    - LOW("low")
    - MEDIUM("medium")
    - HIGH("high")
    - AUTO("auto")
  - Optional<Boolean> interruptResponse
    
    Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.

Realtime Transcription Session Create Request

class RealtimeTranscriptionSessionCreateRequest:

Realtime transcription session object configuration.
- JsonValue; type "transcription"constant
  
  The type of session to create. Always transcription for transcription sessions.
  - TRANSCRIPTION("transcription")
- Optional<RealtimeTranscriptionSessionAudio> audio
  
  Configuration for input and output audio.
  - Optional<RealtimeTranscriptionSessionAudioInput> input
    - Optional<RealtimeAudioFormats> format
      
      The PCM audio format. Only a 24kHz sample rate is supported.
      - AudioPcm
        
        Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
        
        Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
      - AudioPcmu
        
        Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
      - AudioPcma
        
        Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
    - Optional<NoiseReduction> noiseReduction
      
      Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
      - Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
    - Optional<AudioTranscription> transcription
      
      Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
      - Optional<Delay> delay
        
        Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
      - Optional<String> language
        
        The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
      - Optional<Model> model
        
        The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
        
        WHISPER_1("whisper-1")
        
        GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
        
        GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
        
        GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
        
        GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
        
        GPT_REALTIME_WHISPER("gpt-realtime-whisper")
      - Optional<String> prompt
        
        An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
    - Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetection
      
      Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
      
      Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
      
      Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
      
      For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
      - ServerVad
        
        JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
      - SemanticVad
        
        JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
        
        Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
- Optional<List<Include>> include
  
  Additional fields to include in server outputs.
  
  item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
  - ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")

Realtime Translation Client Event

class RealtimeTranslationClientEvent: A class that can be one of several variants.union

A Realtime translation client event.
- class RealtimeTranslationSessionUpdateEvent:
  
  Send this event to update the translation session configuration. Translation sessions support updates to audio.output.language, audio.input.transcription, and audio.input.noise_reduction.
  - RealtimeTranslationSessionUpdateRequest session
    
    Translation session fields to update. The session type and model are set at creation and cannot be changed with session.update.
    - Optional<Audio> audio
      
      Configuration for translation input and output audio.
      - Optional<Input> input
        
        Optional<NoiseReduction> noiseReduction
        
        Optional input noise reduction. Set to null to disable it.
        
        NoiseReductionType type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
        
        Optional<Transcription> transcription
        
        Optional source-language transcription. When configured, the server emits session.input_transcript.delta events. Translation itself still runs from the input audio stream.
        
        String model
        
        The transcription model to use for source transcript deltas.
      - Optional<Output> output
        
        Optional<String> language
        
        Target language for translated output audio and transcript deltas.
  - JsonValue; type "session.update"constant
    
    The event type, must be session.update.
    - SESSION_UPDATE("session.update")
  - Optional<String> eventId
    
    Optional client-generated ID used to identify this event.
- class RealtimeTranslationInputAudioBufferAppendEvent:
  
  Send this event to append audio bytes to the translation session input audio buffer.
  
  WebSocket translation sessions accept base64-encoded 24 kHz PCM16 mono little-endian raw audio bytes. Unsupported websocket audio formats return a validation error because lower-quality audio materially degrades translation quality.
  
  Translation consumes 200 ms engine frames. For best realtime behavior, append audio in 200 ms chunks. If a chunk is shorter, the server buffers it until it has enough audio for one frame. If a chunk is longer, the server splits it into 200 ms frames and enqueues them back-to-back.
  
  Keep appending silence while the session is active. If a client stops sending audio and later resumes, model time treats the resumed audio as contiguous with the previous audio rather than as a real-world pause.
  - String audio
    
    Base64-encoded 24 kHz PCM16 mono audio bytes.
  - JsonValue; type "session.input_audio_buffer.append"constant
    
    The event type, must be session.input_audio_buffer.append.
    - SESSION_INPUT_AUDIO_BUFFER_APPEND("session.input_audio_buffer.append")
  - Optional<String> eventId
    
    Optional client-generated ID used to identify this event.
- class RealtimeTranslationSessionCloseEvent:
  
  Gracefully close the realtime translation session. The server flushes pending input audio and emits any remaining translated output before closing the session.
  - JsonValue; type "session.close"constant
    
    The event type, must be session.close.
    - SESSION_CLOSE("session.close")
  - Optional<String> eventId
    
    Optional client-generated ID used to identify this event.

Realtime Translation Client Secret Create Request

class RealtimeTranslationClientSecretCreateRequest:

Create a translation session and client secret for the Realtime API.
- RealtimeTranslationSessionCreateRequest session
  
  Realtime translation session configuration. Translation sessions stream source audio in and translated audio plus transcript deltas out continuously.
  - String model
    
    The Realtime translation model used for this session.
  - Optional<Audio> audio
    
    Configuration for translation input and output audio.
    - Optional<Input> input
      - Optional<NoiseReduction> noiseReduction
        
        Optional input noise reduction. Set to null to disable it.
        
        NoiseReductionType type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
      - Optional<Transcription> transcription
        
        Optional source-language transcription. When configured, the server emits session.input_transcript.delta events. Translation itself still runs from the input audio stream.
        
        String model
        
        The transcription model to use for source transcript deltas.
    - Optional<Output> output
      - Optional<String> language
        
        Target language for translated output audio and transcript deltas.
- Optional<ExpiresAfter> expiresAfter
  
  Configuration for the client secret expiration. Expiration refers to the time after which a client secret will no longer be valid for creating sessions. The session itself may continue after that time once started. A secret can be used to create multiple sessions until it expires.
  - Optional<Anchor> anchor
    
    The anchor point for the client secret expiration, meaning that seconds will be added to the created_at time of the client secret to produce an expiration timestamp. Only created_at is currently supported.
    - CREATED_AT("created_at")
  - Optional<Long> seconds
    
    The number of seconds from the anchor point to the expiration. Select a value between 10 and 7200 (2 hours). This default to 600 seconds (10 minutes) if not specified.

Realtime Translation Client Secret Create Response

class RealtimeTranslationClientSecretCreateResponse:

Response from creating a translation session and client secret for the Realtime API.
- long expiresAt
  
  Expiration timestamp for the client secret, in seconds since epoch.
- RealtimeTranslationSession session
  
  A Realtime translation session. Translation sessions continuously translate input audio into the configured output language.
  - String id
    
    Unique identifier for the session that looks like sess_1234567890abcdef.
  - Audio audio
    
    Configuration for translation input and output audio.
    - Optional<Input> input
      - Optional<NoiseReduction> noiseReduction
        
        Optional input noise reduction.
        
        NoiseReductionType type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
      - Optional<Transcription> transcription
        
        Optional source-language transcription. When configured, the server emits session.input_transcript.delta events. Translation itself still runs from the input audio stream.
        
        String model
        
        The transcription model used for source transcript deltas.
    - Optional<Output> output
      - Optional<String> language
        
        Target language for translated output audio and transcript deltas.
  - long expiresAt
    
    Expiration timestamp for the session, in seconds since epoch.
  - String model
    
    The Realtime translation model used for this session. This field is set at session creation and cannot be changed with session.update.
  - JsonValue; type "translation"constant
    
    The session type. Always translation for Realtime translation sessions.
    - TRANSLATION("translation")
- String value
  
  The generated client secret value.

Realtime Translation Input Audio Buffer Append Event

class RealtimeTranslationInputAudioBufferAppendEvent:

Send this event to append audio bytes to the translation session input audio buffer.

WebSocket translation sessions accept base64-encoded 24 kHz PCM16 mono little-endian raw audio bytes. Unsupported websocket audio formats return a validation error because lower-quality audio materially degrades translation quality.

Translation consumes 200 ms engine frames. For best realtime behavior, append audio in 200 ms chunks. If a chunk is shorter, the server buffers it until it has enough audio for one frame. If a chunk is longer, the server splits it into 200 ms frames and enqueues them back-to-back.

Keep appending silence while the session is active. If a client stops sending audio and later resumes, model time treats the resumed audio as contiguous with the previous audio rather than as a real-world pause.
- String audio
  
  Base64-encoded 24 kHz PCM16 mono audio bytes.
- JsonValue; type "session.input_audio_buffer.append"constant
  
  The event type, must be session.input_audio_buffer.append.
  - SESSION_INPUT_AUDIO_BUFFER_APPEND("session.input_audio_buffer.append")
- Optional<String> eventId
  
  Optional client-generated ID used to identify this event.

Realtime Translation Input Transcript Delta Event

class RealtimeTranslationInputTranscriptDeltaEvent:

Returned when optional source-language transcript text is available. This event is emitted only when audio.input.transcription is configured.

Transcript deltas are append-only text fragments. Clients should not insert unconditional spaces between deltas.
- String delta
  
  Append-only source-language transcript text.
- String eventId
  
  The unique ID of the server event.
- JsonValue; type "session.input_transcript.delta"constant
  
  The event type, must be session.input_transcript.delta.
  - SESSION_INPUT_TRANSCRIPT_DELTA("session.input_transcript.delta")
- Optional<Long> elapsedMs
  
  Timing metadata for stream alignment, derived from the translation frame when available. It advances in 200 ms increments, but multiple transcript deltas may share the same elapsed_ms. Treat it as alignment metadata, not a unique transcript-delta identifier.

Realtime Translation Output Audio Delta Event

class RealtimeTranslationOutputAudioDeltaEvent:

Returned when translated output audio is available. Output audio deltas are 200 ms frames of PCM16 audio.
- String delta
  
  Base64-encoded translated audio data.
- String eventId
  
  The unique ID of the server event.
- JsonValue; type "session.output_audio.delta"constant
  
  The event type, must be session.output_audio.delta.
  - SESSION_OUTPUT_AUDIO_DELTA("session.output_audio.delta")
- Optional<Long> channels
  
  Number of audio channels.
- Optional<Long> elapsedMs
  
  Timing metadata for stream alignment, derived from the translation frame when available. Treat elapsed_ms as alignment metadata, not a unique event identifier.
- Optional<Format> format
  
  Audio encoding for delta.
  - PCM16("pcm16")
- Optional<Long> sampleRate
  
  Sample rate of the audio delta.

Realtime Translation Output Transcript Delta Event

class RealtimeTranslationOutputTranscriptDeltaEvent:

Returned when translated transcript text is available.

Transcript deltas are append-only text fragments. Clients should not insert unconditional spaces between deltas.
- String delta
  
  Append-only transcript text for the translated output audio.
- String eventId
  
  The unique ID of the server event.
- JsonValue; type "session.output_transcript.delta"constant
  
  The event type, must be session.output_transcript.delta.
  - SESSION_OUTPUT_TRANSCRIPT_DELTA("session.output_transcript.delta")
- Optional<Long> elapsedMs
  
  Timing metadata for stream alignment, derived from the translation frame when available. It advances in 200 ms increments, but multiple transcript deltas may share the same elapsed_ms. Treat it as alignment metadata, not a unique transcript-delta identifier.

Realtime Translation Server Event

class RealtimeTranslationServerEvent: A class that can be one of several variants.union

A Realtime translation server event.
- class RealtimeErrorEvent:
  
  Returned when an error occurs, which could be a client problem or a server problem. Most errors are recoverable and the session will stay open, we recommend to implementors to monitor and log error messages by default.
  - RealtimeError error
    
    Details of the error.
    - String message
      
      A human-readable error message.
    - String type
      
      The type of error (e.g., "invalid_request_error", "server_error").
    - Optional<String> code
      
      Error code, if any.
    - Optional<String> eventId
      
      The event_id of the client event that caused the error, if applicable.
    - Optional<String> param
      
      Parameter related to the error, if any.
  - String eventId
    
    The unique ID of the server event.
  - JsonValue; type "error"constant
    
    The event type, must be error.
    - ERROR("error")
- class RealtimeTranslationSessionCreatedEvent:
  
  Returned when a translation session is created. Emitted automatically when a new connection is established as the first server event. This event contains the default translation session configuration.
  - String eventId
    
    The unique ID of the server event.
  - RealtimeTranslationSession session
    
    The translation session configuration.
    - String id
      
      Unique identifier for the session that looks like sess_1234567890abcdef.
    - Audio audio
      
      Configuration for translation input and output audio.
      - Optional<Input> input
        
        Optional<NoiseReduction> noiseReduction
        
        Optional input noise reduction.
        
        NoiseReductionType type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
        
        Optional<Transcription> transcription
        
        Optional source-language transcription. When configured, the server emits session.input_transcript.delta events. Translation itself still runs from the input audio stream.
        
        String model
        
        The transcription model used for source transcript deltas.
      - Optional<Output> output
        
        Optional<String> language
        
        Target language for translated output audio and transcript deltas.
    - long expiresAt
      
      Expiration timestamp for the session, in seconds since epoch.
    - String model
      
      The Realtime translation model used for this session. This field is set at session creation and cannot be changed with session.update.
    - JsonValue; type "translation"constant
      
      The session type. Always translation for Realtime translation sessions.
      - TRANSLATION("translation")
  - JsonValue; type "session.created"constant
    
    The event type, must be session.created.
    - SESSION_CREATED("session.created")
- class RealtimeTranslationSessionUpdatedEvent:
  
  Returned when a translation session is updated with a session.update event, unless there is an error.
  - String eventId
    
    The unique ID of the server event.
  - RealtimeTranslationSession session
    
    The translation session configuration.
  - JsonValue; type "session.updated"constant
    
    The event type, must be session.updated.
    - SESSION_UPDATED("session.updated")
- class RealtimeTranslationSessionClosedEvent:
  
  Returned when a realtime translation session is closed.
  - String eventId
    
    The unique ID of the server event.
  - JsonValue; type "session.closed"constant
    
    The event type, must be session.closed.
    - SESSION_CLOSED("session.closed")
- class RealtimeTranslationInputTranscriptDeltaEvent:
  
  Returned when optional source-language transcript text is available. This event is emitted only when audio.input.transcription is configured.
  
  Transcript deltas are append-only text fragments. Clients should not insert unconditional spaces between deltas.
  - String delta
    
    Append-only source-language transcript text.
  - String eventId
    
    The unique ID of the server event.
  - JsonValue; type "session.input_transcript.delta"constant
    
    The event type, must be session.input_transcript.delta.
    - SESSION_INPUT_TRANSCRIPT_DELTA("session.input_transcript.delta")
  - Optional<Long> elapsedMs
    
    Timing metadata for stream alignment, derived from the translation frame when available. It advances in 200 ms increments, but multiple transcript deltas may share the same elapsed_ms. Treat it as alignment metadata, not a unique transcript-delta identifier.
- class RealtimeTranslationOutputTranscriptDeltaEvent:
  
  Returned when translated transcript text is available.
  
  Transcript deltas are append-only text fragments. Clients should not insert unconditional spaces between deltas.
  - String delta
    
    Append-only transcript text for the translated output audio.
  - String eventId
    
    The unique ID of the server event.
  - JsonValue; type "session.output_transcript.delta"constant
    
    The event type, must be session.output_transcript.delta.
    - SESSION_OUTPUT_TRANSCRIPT_DELTA("session.output_transcript.delta")
  - Optional<Long> elapsedMs
    
    Timing metadata for stream alignment, derived from the translation frame when available. It advances in 200 ms increments, but multiple transcript deltas may share the same elapsed_ms. Treat it as alignment metadata, not a unique transcript-delta identifier.
- class RealtimeTranslationOutputAudioDeltaEvent:
  
  Returned when translated output audio is available. Output audio deltas are 200 ms frames of PCM16 audio.
  - String delta
    
    Base64-encoded translated audio data.
  - String eventId
    
    The unique ID of the server event.
  - JsonValue; type "session.output_audio.delta"constant
    
    The event type, must be session.output_audio.delta.
    - SESSION_OUTPUT_AUDIO_DELTA("session.output_audio.delta")
  - Optional<Long> channels
    
    Number of audio channels.
  - Optional<Long> elapsedMs
    
    Timing metadata for stream alignment, derived from the translation frame when available. Treat elapsed_ms as alignment metadata, not a unique event identifier.
  - Optional<Format> format
    
    Audio encoding for delta.
    - PCM16("pcm16")
  - Optional<Long> sampleRate
    
    Sample rate of the audio delta.

Realtime Translation Session

class RealtimeTranslationSession:

A Realtime translation session. Translation sessions continuously translate input audio into the configured output language.
- String id
  
  Unique identifier for the session that looks like sess_1234567890abcdef.
- Audio audio
  
  Configuration for translation input and output audio.
  - Optional<Input> input
    - Optional<NoiseReduction> noiseReduction
      
      Optional input noise reduction.
      - NoiseReductionType type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
    - Optional<Transcription> transcription
      
      Optional source-language transcription. When configured, the server emits session.input_transcript.delta events. Translation itself still runs from the input audio stream.
      - String model
        
        The transcription model used for source transcript deltas.
  - Optional<Output> output
    - Optional<String> language
      
      Target language for translated output audio and transcript deltas.
- long expiresAt
  
  Expiration timestamp for the session, in seconds since epoch.
- String model
  
  The Realtime translation model used for this session. This field is set at session creation and cannot be changed with session.update.
- JsonValue; type "translation"constant
  
  The session type. Always translation for Realtime translation sessions.
  - TRANSLATION("translation")

Realtime Translation Session Close Event

class RealtimeTranslationSessionCloseEvent:

Gracefully close the realtime translation session. The server flushes pending input audio and emits any remaining translated output before closing the session.
- JsonValue; type "session.close"constant
  
  The event type, must be session.close.
  - SESSION_CLOSE("session.close")
- Optional<String> eventId
  
  Optional client-generated ID used to identify this event.

Realtime Translation Session Closed Event

class RealtimeTranslationSessionClosedEvent:

Returned when a realtime translation session is closed.
- String eventId
  
  The unique ID of the server event.
- JsonValue; type "session.closed"constant
  
  The event type, must be session.closed.
  - SESSION_CLOSED("session.closed")

Realtime Translation Session Create Request

class RealtimeTranslationSessionCreateRequest:

Realtime translation session configuration. Translation sessions stream source audio in and translated audio plus transcript deltas out continuously.
- String model
  
  The Realtime translation model used for this session.
- Optional<Audio> audio
  
  Configuration for translation input and output audio.
  - Optional<Input> input
    - Optional<NoiseReduction> noiseReduction
      
      Optional input noise reduction. Set to null to disable it.
      - NoiseReductionType type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
    - Optional<Transcription> transcription
      
      Optional source-language transcription. When configured, the server emits session.input_transcript.delta events. Translation itself still runs from the input audio stream.
      - String model
        
        The transcription model to use for source transcript deltas.
  - Optional<Output> output
    - Optional<String> language
      
      Target language for translated output audio and transcript deltas.

Realtime Translation Session Created Event

class RealtimeTranslationSessionCreatedEvent:

Returned when a translation session is created. Emitted automatically when a new connection is established as the first server event. This event contains the default translation session configuration.
- String eventId
  
  The unique ID of the server event.
- RealtimeTranslationSession session
  
  The translation session configuration.
  - String id
    
    Unique identifier for the session that looks like sess_1234567890abcdef.
  - Audio audio
    
    Configuration for translation input and output audio.
    - Optional<Input> input
      - Optional<NoiseReduction> noiseReduction
        
        Optional input noise reduction.
        
        NoiseReductionType type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
      - Optional<Transcription> transcription
        
        Optional source-language transcription. When configured, the server emits session.input_transcript.delta events. Translation itself still runs from the input audio stream.
        
        String model
        
        The transcription model used for source transcript deltas.
    - Optional<Output> output
      - Optional<String> language
        
        Target language for translated output audio and transcript deltas.
  - long expiresAt
    
    Expiration timestamp for the session, in seconds since epoch.
  - String model
    
    The Realtime translation model used for this session. This field is set at session creation and cannot be changed with session.update.
  - JsonValue; type "translation"constant
    
    The session type. Always translation for Realtime translation sessions.
    - TRANSLATION("translation")
- JsonValue; type "session.created"constant
  
  The event type, must be session.created.
  - SESSION_CREATED("session.created")

Realtime Translation Session Update Event

class RealtimeTranslationSessionUpdateEvent:

Send this event to update the translation session configuration. Translation sessions support updates to audio.output.language, audio.input.transcription, and audio.input.noise_reduction.
- RealtimeTranslationSessionUpdateRequest session
  
  Translation session fields to update. The session type and model are set at creation and cannot be changed with session.update.
  - Optional<Audio> audio
    
    Configuration for translation input and output audio.
    - Optional<Input> input
      - Optional<NoiseReduction> noiseReduction
        
        Optional input noise reduction. Set to null to disable it.
        
        NoiseReductionType type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
      - Optional<Transcription> transcription
        
        Optional source-language transcription. When configured, the server emits session.input_transcript.delta events. Translation itself still runs from the input audio stream.
        
        String model
        
        The transcription model to use for source transcript deltas.
    - Optional<Output> output
      - Optional<String> language
        
        Target language for translated output audio and transcript deltas.
- JsonValue; type "session.update"constant
  
  The event type, must be session.update.
  - SESSION_UPDATE("session.update")
- Optional<String> eventId
  
  Optional client-generated ID used to identify this event.

Realtime Translation Session Update Request

class RealtimeTranslationSessionUpdateRequest:

Realtime translation session fields that can be updated with session.update.
- Optional<Audio> audio
  
  Configuration for translation input and output audio.
  - Optional<Input> input
    - Optional<NoiseReduction> noiseReduction
      
      Optional input noise reduction. Set to null to disable it.
      - NoiseReductionType type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
    - Optional<Transcription> transcription
      
      Optional source-language transcription. When configured, the server emits session.input_transcript.delta events. Translation itself still runs from the input audio stream.
      - String model
        
        The transcription model to use for source transcript deltas.
  - Optional<Output> output
    - Optional<String> language
      
      Target language for translated output audio and transcript deltas.

Realtime Translation Session Updated Event

class RealtimeTranslationSessionUpdatedEvent:

Returned when a translation session is updated with a session.update event, unless there is an error.
- String eventId
  
  The unique ID of the server event.
- RealtimeTranslationSession session
  
  The translation session configuration.
  - String id
    
    Unique identifier for the session that looks like sess_1234567890abcdef.
  - Audio audio
    
    Configuration for translation input and output audio.
    - Optional<Input> input
      - Optional<NoiseReduction> noiseReduction
        
        Optional input noise reduction.
        
        NoiseReductionType type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
      - Optional<Transcription> transcription
        
        Optional source-language transcription. When configured, the server emits session.input_transcript.delta events. Translation itself still runs from the input audio stream.
        
        String model
        
        The transcription model used for source transcript deltas.
    - Optional<Output> output
      - Optional<String> language
        
        Target language for translated output audio and transcript deltas.
  - long expiresAt
    
    Expiration timestamp for the session, in seconds since epoch.
  - String model
    
    The Realtime translation model used for this session. This field is set at session creation and cannot be changed with session.update.
  - JsonValue; type "translation"constant
    
    The session type. Always translation for Realtime translation sessions.
    - TRANSLATION("translation")
- JsonValue; type "session.updated"constant
  
  The event type, must be session.updated.
  - SESSION_UPDATED("session.updated")

Realtime Truncation

class RealtimeTruncation: A class that can be one of several variants.union

When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.

Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.

Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.

Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
- RealtimeTruncationStrategy
  - AUTO("auto")
  - DISABLED("disabled")
- class RealtimeTruncationRetentionRatio:
  
  Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
  - double retentionRatio
    
    Fraction of post-instruction conversation tokens to retain (0.0 - 1.0) when the conversation exceeds the input token limit. Setting this to 0.8 means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.
  - JsonValue; type "retention_ratio"constant
    
    Use retention ratio truncation.
    - RETENTION_RATIO("retention_ratio")
  - Optional<TokenLimits> tokenLimits
    
    Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
    - Optional<Long> postInstructions
      
      Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.

Realtime Truncation Retention Ratio

class RealtimeTruncationRetentionRatio:

Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
- double retentionRatio
  
  Fraction of post-instruction conversation tokens to retain (0.0 - 1.0) when the conversation exceeds the input token limit. Setting this to 0.8 means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.
- JsonValue; type "retention_ratio"constant
  
  Use retention ratio truncation.
  - RETENTION_RATIO("retention_ratio")
- Optional<TokenLimits> tokenLimits
  
  Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
  - Optional<Long> postInstructions
    
    Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.

Response Audio Delta Event

class ResponseAudioDeltaEvent:

Returned when the model-generated audio is updated.
- long contentIndex
  
  The index of the content part in the item's content array.
- String delta
  
  Base64-encoded audio data delta.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the item.
- long outputIndex
  
  The index of the output item in the response.
- String responseId
  
  The ID of the response.
- JsonValue; type "response.output_audio.delta"constant
  
  The event type, must be response.output_audio.delta.
  - RESPONSE_OUTPUT_AUDIO_DELTA("response.output_audio.delta")

Response Audio Done Event

class ResponseAudioDoneEvent:

Returned when the model-generated audio is done. Also emitted when a Response is interrupted, incomplete, or cancelled.
- long contentIndex
  
  The index of the content part in the item's content array.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the item.
- long outputIndex
  
  The index of the output item in the response.
- String responseId
  
  The ID of the response.
- JsonValue; type "response.output_audio.done"constant
  
  The event type, must be response.output_audio.done.
  - RESPONSE_OUTPUT_AUDIO_DONE("response.output_audio.done")

Response Audio Transcript Delta Event

class ResponseAudioTranscriptDeltaEvent:

Returned when the model-generated transcription of audio output is updated.
- long contentIndex
  
  The index of the content part in the item's content array.
- String delta
  
  The transcript delta.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the item.
- long outputIndex
  
  The index of the output item in the response.
- String responseId
  
  The ID of the response.
- JsonValue; type "response.output_audio_transcript.delta"constant
  
  The event type, must be response.output_audio_transcript.delta.
  - RESPONSE_OUTPUT_AUDIO_TRANSCRIPT_DELTA("response.output_audio_transcript.delta")

Response Audio Transcript Done Event

class ResponseAudioTranscriptDoneEvent:

Returned when the model-generated transcription of audio output is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
- long contentIndex
  
  The index of the content part in the item's content array.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the item.
- long outputIndex
  
  The index of the output item in the response.
- String responseId
  
  The ID of the response.
- String transcript
  
  The final transcript of the audio.
- JsonValue; type "response.output_audio_transcript.done"constant
  
  The event type, must be response.output_audio_transcript.done.
  - RESPONSE_OUTPUT_AUDIO_TRANSCRIPT_DONE("response.output_audio_transcript.done")

Response Cancel Event

class ResponseCancelEvent:

Send this event to cancel an in-progress response. The server will respond with a response.done event with a status of response.status=cancelled. If there is no response to cancel, the server will respond with an error. It's safe to call response.cancel even if no response is in progress, an error will be returned the session will remain unaffected.
- JsonValue; type "response.cancel"constant
  
  The event type, must be response.cancel.
  - RESPONSE_CANCEL("response.cancel")
- Optional<String> eventId
  
  Optional client-generated ID used to identify this event.
- Optional<String> responseId
  
  A specific response ID to cancel - if not provided, will cancel an in-progress response in the default conversation.

Response Content Part Added Event

class ResponseContentPartAddedEvent:

Returned when a new content part is added to an assistant message item during response generation.
- long contentIndex
  
  The index of the content part in the item's content array.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the item to which the content part was added.
- long outputIndex
  
  The index of the output item in the response.
- Part part
  
  The content part that was added.
  - Optional<String> audio
    
    Base64-encoded audio data (if type is "audio").
  - Optional<String> text
    
    The text content (if type is "text").
  - Optional<String> transcript
    
    The transcript of the audio (if type is "audio").
  - Optional<Type> type
    
    The content type ("text", "audio").
    - TEXT("text")
    - AUDIO("audio")
- String responseId
  
  The ID of the response.
- JsonValue; type "response.content_part.added"constant
  
  The event type, must be response.content_part.added.
  - RESPONSE_CONTENT_PART_ADDED("response.content_part.added")

Response Content Part Done Event

class ResponseContentPartDoneEvent:

Returned when a content part is done streaming in an assistant message item. Also emitted when a Response is interrupted, incomplete, or cancelled.
- long contentIndex
  
  The index of the content part in the item's content array.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the item.
- long outputIndex
  
  The index of the output item in the response.
- Part part
  
  The content part that is done.
  - Optional<String> audio
    
    Base64-encoded audio data (if type is "audio").
  - Optional<String> text
    
    The text content (if type is "text").
  - Optional<String> transcript
    
    The transcript of the audio (if type is "audio").
  - Optional<Type> type
    
    The content type ("text", "audio").
    - TEXT("text")
    - AUDIO("audio")
- String responseId
  
  The ID of the response.
- JsonValue; type "response.content_part.done"constant
  
  The event type, must be response.content_part.done.
  - RESPONSE_CONTENT_PART_DONE("response.content_part.done")

Response Create Event

class ResponseCreateEvent:

This event instructs the server to create a Response, which means triggering model inference. When in Server VAD mode, the server will create Responses automatically.

A Response will include at least one Item, and may have two, in which case the second will be a function call. These Items will be appended to the conversation history by default.

The server will respond with a response.created event, events for Items and content created, and finally a response.done event to indicate the Response is complete.

The response.create event includes inference configuration like instructions and tools. If these are set, they will override the Session's configuration for this Response only.

Responses can be created out-of-band of the default Conversation, meaning that they can have arbitrary input, and it's possible to disable writing the output to the Conversation. Only one Response can write to the default Conversation at a time, but otherwise multiple Responses can be created in parallel. The metadata field is a good way to disambiguate multiple simultaneous Responses.

Clients can set conversation to none to create a Response that does not write to the default Conversation. Arbitrary input can be provided with the input field, which is an array accepting raw Items and references to existing Items.
- JsonValue; type "response.create"constant
  
  The event type, must be response.create.
  - RESPONSE_CREATE("response.create")
- Optional<String> eventId
  
  Optional client-generated ID used to identify this event.
- Optional<RealtimeResponseCreateParams> response
  
  Create a new Realtime response with these parameters
  - Optional<RealtimeResponseCreateAudioOutput> audio
    
    Configuration for audio input and output.
    - Optional<Output> output
      - Optional<RealtimeAudioFormats> format
        
        The format of the output audio.
        
        AudioPcm
        
        Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
        
        Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
        
        AudioPcmu
        
        Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
        
        AudioPcma
        
        Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
      - Optional<Voice> voice
        
        The voice the model uses to respond. Supported built-in voices are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommend marin and cedar for best quality.
        
        String
        
        enum UnionMember1:
        
        ALLOY("alloy")
        
        ASH("ash")
        
        BALLAD("ballad")
        
        CORAL("coral")
        
        ECHO("echo")
        
        SAGE("sage")
        
        SHIMMER("shimmer")
        
        VERSE("verse")
        
        MARIN("marin")
        
        CEDAR("cedar")
        
        class Id:
        
        Custom voice reference.
        
        String id
        
        The custom voice ID, e.g. voice_1234.
  - Optional<Conversation> conversation
    
    Controls which conversation the response is added to. Currently supports auto and none, with auto as the default value. The auto value means that the contents of the response will be added to the default conversation. Set this to none to create an out-of-band response which will not add items to default conversation.
    - AUTO("auto")
    - NONE("none")
  - Optional<List<ConversationItem>> input
    
    Input items to include in the prompt for the model. Using this field creates a new context for this Response instead of using the default conversation. An empty array [] will clear the context for this Response. Note that this can include references to items that previously appeared in the session using their id.
    - class RealtimeConversationItemSystemMessage:
      
      A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
      - List<Content> content
        
        The content of the message.
        
        Optional<String> text
        
        The text content.
        
        Optional<Type> type
        
        The content type. Always input_text for system messages.
        
        INPUT_TEXT("input_text")
      - JsonValue; role "system"constant
        
        The role of the message sender. Always system.
        
        SYSTEM("system")
      - JsonValue; type "message"constant
        
        The type of the item. Always message.
        
        MESSAGE("message")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemUserMessage:
      
      A user message item in a Realtime conversation.
      - List<Content> content
        
        The content of the message.
        
        Optional<String> audio
        
        Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
        
        Optional<Detail> detail
        
        The detail level of the image (for input_image). auto will default to high.
        
        AUTO("auto")
        
        LOW("low")
        
        HIGH("high")
        
        Optional<String> imageUrl
        
        Base64-encoded image bytes (for input_image) as a data URI. For example data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG.
        
        Optional<String> text
        
        The text content (for input_text).
        
        Optional<String> transcript
        
        Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
        
        Optional<Type> type
        
        The content type (input_text, input_audio, or input_image).
        
        INPUT_TEXT("input_text")
        
        INPUT_AUDIO("input_audio")
        
        INPUT_IMAGE("input_image")
      - JsonValue; role "user"constant
        
        The role of the message sender. Always user.
        
        USER("user")
      - JsonValue; type "message"constant
        
        The type of the item. Always message.
        
        MESSAGE("message")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemAssistantMessage:
      
      An assistant message item in a Realtime conversation.
      - List<Content> content
        
        The content of the message.
        
        Optional<String> audio
        
        Base64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
        
        Optional<String> text
        
        The text content.
        
        Optional<String> transcript
        
        The transcript of the audio content, this will always be present if the output type is audio.
        
        Optional<Type> type
        
        The content type, output_text or output_audio depending on the session output_modalities configuration.
        
        OUTPUT_TEXT("output_text")
        
        OUTPUT_AUDIO("output_audio")
      - JsonValue; role "assistant"constant
        
        The role of the message sender. Always assistant.
        
        ASSISTANT("assistant")
      - JsonValue; type "message"constant
        
        The type of the item. Always message.
        
        MESSAGE("message")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemFunctionCall:
      
      A function call item in a Realtime conversation.
      - String arguments
        
        The arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example {"arg1": "value1", "arg2": 42}.
      - String name
        
        The name of the function being called.
      - JsonValue; type "function_call"constant
        
        The type of the item. Always function_call.
        
        FUNCTION_CALL("function_call")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<String> callId
        
        The ID of the function call.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemFunctionCallOutput:
      
      A function call output item in a Realtime conversation.
      - String callId
        
        The ID of the function call this output is for.
      - String output
        
        The output of the function call, this is free text and can contain any information or simply be empty.
      - JsonValue; type "function_call_output"constant
        
        The type of the item. Always function_call_output.
        
        FUNCTION_CALL_OUTPUT("function_call_output")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeMcpApprovalResponse:
      
      A Realtime item responding to an MCP approval request.
      - String id
        
        The unique ID of the approval response.
      - String approvalRequestId
        
        The ID of the approval request being answered.
      - boolean approve
        
        Whether the request was approved.
      - JsonValue; type "mcp_approval_response"constant
        
        The type of the item. Always mcp_approval_response.
        
        MCP_APPROVAL_RESPONSE("mcp_approval_response")
      - Optional<String> reason
        
        Optional reason for the decision.
    - class RealtimeMcpListTools:
      
      A Realtime item listing tools available on an MCP server.
      - String serverLabel
        
        The label of the MCP server.
      - List<Tool> tools
        
        The tools available on the server.
        
        JsonValue inputSchema
        
        The JSON schema describing the tool's input.
        
        String name
        
        The name of the tool.
        
        Optional<JsonValue> annotations
        
        Additional annotations about the tool.
        
        Optional<String> description
        
        The description of the tool.
      - JsonValue; type "mcp_list_tools"constant
        
        The type of the item. Always mcp_list_tools.
        
        MCP_LIST_TOOLS("mcp_list_tools")
      - Optional<String> id
        
        The unique ID of the list.
    - class RealtimeMcpToolCall:
      
      A Realtime item representing an invocation of a tool on an MCP server.
      - String id
        
        The unique ID of the tool call.
      - String arguments
        
        A JSON string of the arguments passed to the tool.
      - String name
        
        The name of the tool that was run.
      - String serverLabel
        
        The label of the MCP server running the tool.
      - JsonValue; type "mcp_call"constant
        
        The type of the item. Always mcp_call.
        
        MCP_CALL("mcp_call")
      - Optional<String> approvalRequestId
        
        The ID of an associated approval request, if any.
      - Optional<Error> error
        
        The error from the tool call, if any.
        
        class RealtimeMcpProtocolError:
        
        long code
        
        String message
        
        JsonValue; type "protocol_error"constant
        
        PROTOCOL_ERROR("protocol_error")
        
        class RealtimeMcpToolExecutionError:
        
        String message
        
        JsonValue; type "tool_execution_error"constant
        
        TOOL_EXECUTION_ERROR("tool_execution_error")
        
        class RealtimeMcphttpError:
        
        long code
        
        String message
        
        JsonValue; type "http_error"constant
        
        HTTP_ERROR("http_error")
      - Optional<String> output
        
        The output from the tool call.
    - class RealtimeMcpApprovalRequest:
      
      A Realtime item requesting human approval of a tool invocation.
      - String id
        
        The unique ID of the approval request.
      - String arguments
        
        A JSON string of arguments for the tool.
      - String name
        
        The name of the tool to run.
      - String serverLabel
        
        The label of the MCP server making the request.
      - JsonValue; type "mcp_approval_request"constant
        
        The type of the item. Always mcp_approval_request.
        
        MCP_APPROVAL_REQUEST("mcp_approval_request")
  - Optional<String> instructions
    
    The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior. Note that the server sets default instructions which will be used if this field is not set and are visible in the session.created event at the start of the session.
  - Optional<MaxOutputTokens> maxOutputTokens
    
    Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or inf for the maximum available tokens for a given model. Defaults to inf.
    - long
    - JsonValue;
      - INF("inf")
  - Optional<Metadata> metadata
    
    Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.
    
    Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
  - Optional<List<OutputModality>> outputModalities
    
    The set of modalities the model used to respond, currently the only possible values are [\"audio\"], [\"text\"]. Audio output always include a text transcript. Setting the output to mode text will disable audio output from the model.
    - TEXT("text")
    - AUDIO("audio")
  - Optional<Boolean> parallelToolCalls
    
    Whether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as gpt-realtime-2.
  - Optional<ResponsePrompt> prompt
    
    Reference to a prompt template and its variables. Learn more.
    - String id
      
      The unique identifier of the prompt template to use.
    - Optional<Variables> variables
      
      Optional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
      - String
      - class ResponseInputText:
        
        A text input to the model.
        
        String text
        
        The text input to the model.
        
        JsonValue; type "input_text"constant
        
        The type of the input item. Always input_text.
        
        INPUT_TEXT("input_text")
      - class ResponseInputImage:
        
        An image input to the model. Learn about image inputs.
        
        Detail detail
        
        The detail level of the image to be sent to the model. One of high, low, auto, or original. Defaults to auto.
        
        LOW("low")
        
        HIGH("high")
        
        AUTO("auto")
        
        ORIGINAL("original")
        
        JsonValue; type "input_image"constant
        
        The type of the input item. Always input_image.
        
        INPUT_IMAGE("input_image")
        
        Optional<String> fileId
        
        The ID of the file to be sent to the model.
        
        Optional<String> imageUrl
        
        The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
      - class ResponseInputFile:
        
        A file input to the model.
        
        JsonValue; type "input_file"constant
        
        The type of the input item. Always input_file.
        
        INPUT_FILE("input_file")
        
        Optional<Detail> detail
        
        The detail level of the file to be sent to the model. Use low for the default rendering behavior, or high to render the file at higher quality. Defaults to low.
        
        LOW("low")
        
        HIGH("high")
        
        Optional<String> fileData
        
        The content of the file to be sent to the model.
        
        Optional<String> fileId
        
        The ID of the file to be sent to the model.
        
        Optional<String> fileUrl
        
        The URL of the file to be sent to the model.
        
        Optional<String> filename
        
        The name of the file to be sent to the model.
    - Optional<String> version
      
      Optional version of the prompt template.
  - Optional<RealtimeReasoning> reasoning
    
    Configuration for reasoning-capable Realtime models such as gpt-realtime-2.
    - Optional<RealtimeReasoningEffort> effort
      
      Constrains effort on reasoning for reasoning-capable Realtime models such as gpt-realtime-2.
      - MINIMAL("minimal")
      - LOW("low")
      - MEDIUM("medium")
      - HIGH("high")
      - XHIGH("xhigh")
  - Optional<ToolChoice> toolChoice
    
    How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
    - enum ToolChoiceOptions:
      
      Controls which (if any) tool is called by the model.
      
      none means the model will not call any tool and instead generates a message.
      
      auto means the model can pick between generating a message or calling one or more tools.
      
      required means the model must call one or more tools.
      - NONE("none")
      - AUTO("auto")
      - REQUIRED("required")
    - class ToolChoiceFunction:
      
      Use this option to force the model to call a specific function.
      - String name
        
        The name of the function to call.
      - JsonValue; type "function"constant
        
        For function calling, the type is always function.
        
        FUNCTION("function")
    - class ToolChoiceMcp:
      
      Use this option to force the model to call a specific tool on a remote MCP server.
      - String serverLabel
        
        The label of the MCP server to use.
      - JsonValue; type "mcp"constant
        
        For MCP tools, the type is always mcp.
        
        MCP("mcp")
      - Optional<String> name
        
        The name of the tool to call on the server.
  - Optional<List<Tool>> tools
    
    Tools available to the model.
    - class RealtimeFunctionTool:
      - Optional<String> description
        
        The description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
      - Optional<String> name
        
        The name of the function.
      - Optional<JsonValue> parameters
        
        Parameters of the function in JSON Schema.
      - Optional<Type> type
        
        The type of the tool, i.e. function.
        
        FUNCTION("function")
    - class RealtimeResponseCreateMcpTool:
      
      Give the model access to additional tools via remote Model Context Protocol (MCP) servers. Learn more about MCP.
      - String serverLabel
        
        A label for this MCP server, used to identify it in tool calls.
      - JsonValue; type "mcp"constant
        
        The type of the MCP tool. Always mcp.
        
        MCP("mcp")
      - Optional<AllowedTools> allowedTools
        
        List of allowed tool names or a filter object.
        
        List<String>
        
        class McpToolFilter:
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
      - Optional<String> authorization
        
        An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
      - Optional<ConnectorId> connectorId
        
        Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.
        
        Currently supported connector_id values are:
        
        Dropbox: connector_dropbox
        
        Gmail: connector_gmail
        
        Google Calendar: connector_googlecalendar
        
        Google Drive: connector_googledrive
        
        Microsoft Teams: connector_microsoftteams
        
        Outlook Calendar: connector_outlookcalendar
        
        Outlook Email: connector_outlookemail
        
        SharePoint: connector_sharepoint
        
        CONNECTOR_DROPBOX("connector_dropbox")
        
        CONNECTOR_GMAIL("connector_gmail")
        
        CONNECTOR_GOOGLECALENDAR("connector_googlecalendar")
        
        CONNECTOR_GOOGLEDRIVE("connector_googledrive")
        
        CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams")
        
        CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar")
        
        CONNECTOR_OUTLOOKEMAIL("connector_outlookemail")
        
        CONNECTOR_SHAREPOINT("connector_sharepoint")
      - Optional<Boolean> deferLoading
        
        Whether this MCP tool is deferred and discovered via tool search.
      - Optional<Headers> headers
        
        Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.
      - Optional<RequireApproval> requireApproval
        
        Specify which of the MCP server's tools require approval.
        
        class McpToolApprovalFilter:
        
        Specify which of the MCP server's tools require approval. Can be always, never, or a filter object associated with tools that require approval.
        
        Optional<Always> always
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<Never> never
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        enum McpToolApprovalSetting:
        
        Specify a single approval policy for all tools. One of always or never. When set to always, all tools will require approval. When set to never, all tools will not require approval.
        
        ALWAYS("always")
        
        NEVER("never")
      - Optional<String> serverDescription
        
        Optional description of the MCP server, used to provide more context.
      - Optional<String> serverUrl
        
        The URL for the MCP server. One of server_url or connector_id must be provided.

Response Created Event

class ResponseCreatedEvent:

Returned when a new Response is created. The first event of response creation, where the response is in an initial state of in_progress.
- String eventId
  
  The unique ID of the server event.
- RealtimeResponse response
  
  The response resource.
  - Optional<String> id
    
    The unique ID of the response, will look like resp_1234.
  - Optional<Audio> audio
    
    Configuration for audio output.
    - Optional<Output> output
      - Optional<RealtimeAudioFormats> format
        
        The format of the output audio.
        
        AudioPcm
        
        Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
        
        Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
        
        AudioPcmu
        
        Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
        
        AudioPcma
        
        Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
      - Optional<Voice> voice
        
        The voice the model uses to respond. Voice cannot be changed during the session once the model has responded with audio at least once. Current voice options are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. We recommend marin and cedar for best quality.
        
        ALLOY("alloy")
        
        ASH("ash")
        
        BALLAD("ballad")
        
        CORAL("coral")
        
        ECHO("echo")
        
        SAGE("sage")
        
        SHIMMER("shimmer")
        
        VERSE("verse")
        
        MARIN("marin")
        
        CEDAR("cedar")
  - Optional<String> conversationId
    
    Which conversation the response is added to, determined by the conversation field in the response.create event. If auto, the response will be added to the default conversation and the value of conversation_id will be an id like conv_1234. If none, the response will not be added to any conversation and the value of conversation_id will be null. If responses are being triggered automatically by VAD the response will be added to the default conversation
  - Optional<MaxOutputTokens> maxOutputTokens
    
    Maximum number of output tokens for a single assistant response, inclusive of tool calls, that was used in this response.
    - long
    - JsonValue;
      - INF("inf")
  - Optional<Metadata> metadata
    
    Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.
    
    Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
  - Optional<Object> object_
    
    The object type, must be realtime.response.
    - REALTIME_RESPONSE("realtime.response")
  - Optional<List<ConversationItem>> output
    
    The list of output items generated by the response.
    - class RealtimeConversationItemSystemMessage:
      
      A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
      - List<Content> content
        
        The content of the message.
        
        Optional<String> text
        
        The text content.
        
        Optional<Type> type
        
        The content type. Always input_text for system messages.
        
        INPUT_TEXT("input_text")
      - JsonValue; role "system"constant
        
        The role of the message sender. Always system.
        
        SYSTEM("system")
      - JsonValue; type "message"constant
        
        The type of the item. Always message.
        
        MESSAGE("message")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemUserMessage:
      
      A user message item in a Realtime conversation.
      - List<Content> content
        
        The content of the message.
        
        Optional<String> audio
        
        Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
        
        Optional<Detail> detail
        
        The detail level of the image (for input_image). auto will default to high.
        
        AUTO("auto")
        
        LOW("low")
        
        HIGH("high")
        
        Optional<String> imageUrl
        
        Base64-encoded image bytes (for input_image) as a data URI. For example data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG.
        
        Optional<String> text
        
        The text content (for input_text).
        
        Optional<String> transcript
        
        Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
        
        Optional<Type> type
        
        The content type (input_text, input_audio, or input_image).
        
        INPUT_TEXT("input_text")
        
        INPUT_AUDIO("input_audio")
        
        INPUT_IMAGE("input_image")
      - JsonValue; role "user"constant
        
        The role of the message sender. Always user.
        
        USER("user")
      - JsonValue; type "message"constant
        
        The type of the item. Always message.
        
        MESSAGE("message")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemAssistantMessage:
      
      An assistant message item in a Realtime conversation.
      - List<Content> content
        
        The content of the message.
        
        Optional<String> audio
        
        Base64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
        
        Optional<String> text
        
        The text content.
        
        Optional<String> transcript
        
        The transcript of the audio content, this will always be present if the output type is audio.
        
        Optional<Type> type
        
        The content type, output_text or output_audio depending on the session output_modalities configuration.
        
        OUTPUT_TEXT("output_text")
        
        OUTPUT_AUDIO("output_audio")
      - JsonValue; role "assistant"constant
        
        The role of the message sender. Always assistant.
        
        ASSISTANT("assistant")
      - JsonValue; type "message"constant
        
        The type of the item. Always message.
        
        MESSAGE("message")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemFunctionCall:
      
      A function call item in a Realtime conversation.
      - String arguments
        
        The arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example {"arg1": "value1", "arg2": 42}.
      - String name
        
        The name of the function being called.
      - JsonValue; type "function_call"constant
        
        The type of the item. Always function_call.
        
        FUNCTION_CALL("function_call")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<String> callId
        
        The ID of the function call.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemFunctionCallOutput:
      
      A function call output item in a Realtime conversation.
      - String callId
        
        The ID of the function call this output is for.
      - String output
        
        The output of the function call, this is free text and can contain any information or simply be empty.
      - JsonValue; type "function_call_output"constant
        
        The type of the item. Always function_call_output.
        
        FUNCTION_CALL_OUTPUT("function_call_output")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeMcpApprovalResponse:
      
      A Realtime item responding to an MCP approval request.
      - String id
        
        The unique ID of the approval response.
      - String approvalRequestId
        
        The ID of the approval request being answered.
      - boolean approve
        
        Whether the request was approved.
      - JsonValue; type "mcp_approval_response"constant
        
        The type of the item. Always mcp_approval_response.
        
        MCP_APPROVAL_RESPONSE("mcp_approval_response")
      - Optional<String> reason
        
        Optional reason for the decision.
    - class RealtimeMcpListTools:
      
      A Realtime item listing tools available on an MCP server.
      - String serverLabel
        
        The label of the MCP server.
      - List<Tool> tools
        
        The tools available on the server.
        
        JsonValue inputSchema
        
        The JSON schema describing the tool's input.
        
        String name
        
        The name of the tool.
        
        Optional<JsonValue> annotations
        
        Additional annotations about the tool.
        
        Optional<String> description
        
        The description of the tool.
      - JsonValue; type "mcp_list_tools"constant
        
        The type of the item. Always mcp_list_tools.
        
        MCP_LIST_TOOLS("mcp_list_tools")
      - Optional<String> id
        
        The unique ID of the list.
    - class RealtimeMcpToolCall:
      
      A Realtime item representing an invocation of a tool on an MCP server.
      - String id
        
        The unique ID of the tool call.
      - String arguments
        
        A JSON string of the arguments passed to the tool.
      - String name
        
        The name of the tool that was run.
      - String serverLabel
        
        The label of the MCP server running the tool.
      - JsonValue; type "mcp_call"constant
        
        The type of the item. Always mcp_call.
        
        MCP_CALL("mcp_call")
      - Optional<String> approvalRequestId
        
        The ID of an associated approval request, if any.
      - Optional<Error> error
        
        The error from the tool call, if any.
        
        class RealtimeMcpProtocolError:
        
        long code
        
        String message
        
        JsonValue; type "protocol_error"constant
        
        PROTOCOL_ERROR("protocol_error")
        
        class RealtimeMcpToolExecutionError:
        
        String message
        
        JsonValue; type "tool_execution_error"constant
        
        TOOL_EXECUTION_ERROR("tool_execution_error")
        
        class RealtimeMcphttpError:
        
        long code
        
        String message
        
        JsonValue; type "http_error"constant
        
        HTTP_ERROR("http_error")
      - Optional<String> output
        
        The output from the tool call.
    - class RealtimeMcpApprovalRequest:
      
      A Realtime item requesting human approval of a tool invocation.
      - String id
        
        The unique ID of the approval request.
      - String arguments
        
        A JSON string of arguments for the tool.
      - String name
        
        The name of the tool to run.
      - String serverLabel
        
        The label of the MCP server making the request.
      - JsonValue; type "mcp_approval_request"constant
        
        The type of the item. Always mcp_approval_request.
        
        MCP_APPROVAL_REQUEST("mcp_approval_request")
  - Optional<List<OutputModality>> outputModalities
    
    The set of modalities the model used to respond, currently the only possible values are [\"audio\"], [\"text\"]. Audio output always include a text transcript. Setting the output to mode text will disable audio output from the model.
    - TEXT("text")
    - AUDIO("audio")
  - Optional<Status> status
    
    The final status of the response (completed, cancelled, failed, or incomplete, in_progress).
    - COMPLETED("completed")
    - CANCELLED("cancelled")
    - FAILED("failed")
    - INCOMPLETE("incomplete")
    - IN_PROGRESS("in_progress")
  - Optional<RealtimeResponseStatus> statusDetails
    
    Additional details about the status.
    - Optional<Error> error
      
      A description of the error that caused the response to fail, populated when the status is failed.
      - Optional<String> code
        
        Error code, if any.
      - Optional<String> type
        
        The type of error.
    - Optional<Reason> reason
      
      The reason the Response did not complete. For a cancelled Response, one of turn_detected (the server VAD detected a new start of speech) or client_cancelled (the client sent a cancel event). For an incomplete Response, one of max_output_tokens or content_filter (the server-side safety filter activated and cut off the response).
      - TURN_DETECTED("turn_detected")
      - CLIENT_CANCELLED("client_cancelled")
      - MAX_OUTPUT_TOKENS("max_output_tokens")
      - CONTENT_FILTER("content_filter")
    - Optional<Type> type
      
      The type of error that caused the response to fail, corresponding with the status field (completed, cancelled, incomplete, failed).
      - COMPLETED("completed")
      - CANCELLED("cancelled")
      - INCOMPLETE("incomplete")
      - FAILED("failed")
  - Optional<RealtimeResponseUsage> usage
    
    Usage statistics for the Response, this will correspond to billing. A Realtime API session will maintain a conversation context and append new Items to the Conversation, thus output from previous turns (text and audio tokens) will become the input for later turns.
    - Optional<RealtimeResponseUsageInputTokenDetails> inputTokenDetails
      
      Details about the input tokens used in the Response. Cached tokens are tokens from previous turns in the conversation that are included as context for the current response. Cached tokens here are counted as a subset of input tokens, meaning input tokens will include cached and uncached tokens.
      - Optional<Long> audioTokens
        
        The number of audio tokens used as input for the Response.
      - Optional<Long> cachedTokens
        
        The number of cached tokens used as input for the Response.
      - Optional<CachedTokensDetails> cachedTokensDetails
        
        Details about the cached tokens used as input for the Response.
        
        Optional<Long> audioTokens
        
        The number of cached audio tokens used as input for the Response.
        
        Optional<Long> imageTokens
        
        The number of cached image tokens used as input for the Response.
        
        Optional<Long> textTokens
        
        The number of cached text tokens used as input for the Response.
      - Optional<Long> imageTokens
        
        The number of image tokens used as input for the Response.
      - Optional<Long> textTokens
        
        The number of text tokens used as input for the Response.
    - Optional<Long> inputTokens
      
      The number of input tokens used in the Response, including text and audio tokens.
    - Optional<RealtimeResponseUsageOutputTokenDetails> outputTokenDetails
      
      Details about the output tokens used in the Response.
      - Optional<Long> audioTokens
        
        The number of audio tokens used in the Response.
      - Optional<Long> textTokens
        
        The number of text tokens used in the Response.
    - Optional<Long> outputTokens
      
      The number of output tokens sent in the Response, including text and audio tokens.
    - Optional<Long> totalTokens
      
      The total number of tokens in the Response including input and output text and audio tokens.
- JsonValue; type "response.created"constant
  
  The event type, must be response.created.
  - RESPONSE_CREATED("response.created")

Response Done Event

class ResponseDoneEvent:

Returned when a Response is done streaming. Always emitted, no matter the final state. The Response object included in the response.done event will include all output Items in the Response but will omit the raw audio data.

Clients should check the status field of the Response to determine if it was successful (completed) or if there was another outcome: cancelled, failed, or incomplete.

A response will contain all output items that were generated during the response, excluding any audio content.
- String eventId
  
  The unique ID of the server event.
- RealtimeResponse response
  
  The response resource.
  - Optional<String> id
    
    The unique ID of the response, will look like resp_1234.
  - Optional<Audio> audio
    
    Configuration for audio output.
    - Optional<Output> output
      - Optional<RealtimeAudioFormats> format
        
        The format of the output audio.
        
        AudioPcm
        
        Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
        
        Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
        
        AudioPcmu
        
        Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
        
        AudioPcma
        
        Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
      - Optional<Voice> voice
        
        The voice the model uses to respond. Voice cannot be changed during the session once the model has responded with audio at least once. Current voice options are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. We recommend marin and cedar for best quality.
        
        ALLOY("alloy")
        
        ASH("ash")
        
        BALLAD("ballad")
        
        CORAL("coral")
        
        ECHO("echo")
        
        SAGE("sage")
        
        SHIMMER("shimmer")
        
        VERSE("verse")
        
        MARIN("marin")
        
        CEDAR("cedar")
  - Optional<String> conversationId
    
    Which conversation the response is added to, determined by the conversation field in the response.create event. If auto, the response will be added to the default conversation and the value of conversation_id will be an id like conv_1234. If none, the response will not be added to any conversation and the value of conversation_id will be null. If responses are being triggered automatically by VAD the response will be added to the default conversation
  - Optional<MaxOutputTokens> maxOutputTokens
    
    Maximum number of output tokens for a single assistant response, inclusive of tool calls, that was used in this response.
    - long
    - JsonValue;
      - INF("inf")
  - Optional<Metadata> metadata
    
    Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard.
    
    Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
  - Optional<Object> object_
    
    The object type, must be realtime.response.
    - REALTIME_RESPONSE("realtime.response")
  - Optional<List<ConversationItem>> output
    
    The list of output items generated by the response.
    - class RealtimeConversationItemSystemMessage:
      
      A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
      - List<Content> content
        
        The content of the message.
        
        Optional<String> text
        
        The text content.
        
        Optional<Type> type
        
        The content type. Always input_text for system messages.
        
        INPUT_TEXT("input_text")
      - JsonValue; role "system"constant
        
        The role of the message sender. Always system.
        
        SYSTEM("system")
      - JsonValue; type "message"constant
        
        The type of the item. Always message.
        
        MESSAGE("message")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemUserMessage:
      
      A user message item in a Realtime conversation.
      - List<Content> content
        
        The content of the message.
        
        Optional<String> audio
        
        Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
        
        Optional<Detail> detail
        
        The detail level of the image (for input_image). auto will default to high.
        
        AUTO("auto")
        
        LOW("low")
        
        HIGH("high")
        
        Optional<String> imageUrl
        
        Base64-encoded image bytes (for input_image) as a data URI. For example data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG.
        
        Optional<String> text
        
        The text content (for input_text).
        
        Optional<String> transcript
        
        Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
        
        Optional<Type> type
        
        The content type (input_text, input_audio, or input_image).
        
        INPUT_TEXT("input_text")
        
        INPUT_AUDIO("input_audio")
        
        INPUT_IMAGE("input_image")
      - JsonValue; role "user"constant
        
        The role of the message sender. Always user.
        
        USER("user")
      - JsonValue; type "message"constant
        
        The type of the item. Always message.
        
        MESSAGE("message")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemAssistantMessage:
      
      An assistant message item in a Realtime conversation.
      - List<Content> content
        
        The content of the message.
        
        Optional<String> audio
        
        Base64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
        
        Optional<String> text
        
        The text content.
        
        Optional<String> transcript
        
        The transcript of the audio content, this will always be present if the output type is audio.
        
        Optional<Type> type
        
        The content type, output_text or output_audio depending on the session output_modalities configuration.
        
        OUTPUT_TEXT("output_text")
        
        OUTPUT_AUDIO("output_audio")
      - JsonValue; role "assistant"constant
        
        The role of the message sender. Always assistant.
        
        ASSISTANT("assistant")
      - JsonValue; type "message"constant
        
        The type of the item. Always message.
        
        MESSAGE("message")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemFunctionCall:
      
      A function call item in a Realtime conversation.
      - String arguments
        
        The arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example {"arg1": "value1", "arg2": 42}.
      - String name
        
        The name of the function being called.
      - JsonValue; type "function_call"constant
        
        The type of the item. Always function_call.
        
        FUNCTION_CALL("function_call")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<String> callId
        
        The ID of the function call.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeConversationItemFunctionCallOutput:
      
      A function call output item in a Realtime conversation.
      - String callId
        
        The ID of the function call this output is for.
      - String output
        
        The output of the function call, this is free text and can contain any information or simply be empty.
      - JsonValue; type "function_call_output"constant
        
        The type of the item. Always function_call_output.
        
        FUNCTION_CALL_OUTPUT("function_call_output")
      - Optional<String> id
        
        The unique ID of the item. This may be provided by the client or generated by the server.
      - Optional<Object> object_
        
        Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
        
        REALTIME_ITEM("realtime.item")
      - Optional<Status> status
        
        The status of the item. Has no effect on the conversation.
        
        COMPLETED("completed")
        
        INCOMPLETE("incomplete")
        
        IN_PROGRESS("in_progress")
    - class RealtimeMcpApprovalResponse:
      
      A Realtime item responding to an MCP approval request.
      - String id
        
        The unique ID of the approval response.
      - String approvalRequestId
        
        The ID of the approval request being answered.
      - boolean approve
        
        Whether the request was approved.
      - JsonValue; type "mcp_approval_response"constant
        
        The type of the item. Always mcp_approval_response.
        
        MCP_APPROVAL_RESPONSE("mcp_approval_response")
      - Optional<String> reason
        
        Optional reason for the decision.
    - class RealtimeMcpListTools:
      
      A Realtime item listing tools available on an MCP server.
      - String serverLabel
        
        The label of the MCP server.
      - List<Tool> tools
        
        The tools available on the server.
        
        JsonValue inputSchema
        
        The JSON schema describing the tool's input.
        
        String name
        
        The name of the tool.
        
        Optional<JsonValue> annotations
        
        Additional annotations about the tool.
        
        Optional<String> description
        
        The description of the tool.
      - JsonValue; type "mcp_list_tools"constant
        
        The type of the item. Always mcp_list_tools.
        
        MCP_LIST_TOOLS("mcp_list_tools")
      - Optional<String> id
        
        The unique ID of the list.
    - class RealtimeMcpToolCall:
      
      A Realtime item representing an invocation of a tool on an MCP server.
      - String id
        
        The unique ID of the tool call.
      - String arguments
        
        A JSON string of the arguments passed to the tool.
      - String name
        
        The name of the tool that was run.
      - String serverLabel
        
        The label of the MCP server running the tool.
      - JsonValue; type "mcp_call"constant
        
        The type of the item. Always mcp_call.
        
        MCP_CALL("mcp_call")
      - Optional<String> approvalRequestId
        
        The ID of an associated approval request, if any.
      - Optional<Error> error
        
        The error from the tool call, if any.
        
        class RealtimeMcpProtocolError:
        
        long code
        
        String message
        
        JsonValue; type "protocol_error"constant
        
        PROTOCOL_ERROR("protocol_error")
        
        class RealtimeMcpToolExecutionError:
        
        String message
        
        JsonValue; type "tool_execution_error"constant
        
        TOOL_EXECUTION_ERROR("tool_execution_error")
        
        class RealtimeMcphttpError:
        
        long code
        
        String message
        
        JsonValue; type "http_error"constant
        
        HTTP_ERROR("http_error")
      - Optional<String> output
        
        The output from the tool call.
    - class RealtimeMcpApprovalRequest:
      
      A Realtime item requesting human approval of a tool invocation.
      - String id
        
        The unique ID of the approval request.
      - String arguments
        
        A JSON string of arguments for the tool.
      - String name
        
        The name of the tool to run.
      - String serverLabel
        
        The label of the MCP server making the request.
      - JsonValue; type "mcp_approval_request"constant
        
        The type of the item. Always mcp_approval_request.
        
        MCP_APPROVAL_REQUEST("mcp_approval_request")
  - Optional<List<OutputModality>> outputModalities
    
    The set of modalities the model used to respond, currently the only possible values are [\"audio\"], [\"text\"]. Audio output always include a text transcript. Setting the output to mode text will disable audio output from the model.
    - TEXT("text")
    - AUDIO("audio")
  - Optional<Status> status
    
    The final status of the response (completed, cancelled, failed, or incomplete, in_progress).
    - COMPLETED("completed")
    - CANCELLED("cancelled")
    - FAILED("failed")
    - INCOMPLETE("incomplete")
    - IN_PROGRESS("in_progress")
  - Optional<RealtimeResponseStatus> statusDetails
    
    Additional details about the status.
    - Optional<Error> error
      
      A description of the error that caused the response to fail, populated when the status is failed.
      - Optional<String> code
        
        Error code, if any.
      - Optional<String> type
        
        The type of error.
    - Optional<Reason> reason
      
      The reason the Response did not complete. For a cancelled Response, one of turn_detected (the server VAD detected a new start of speech) or client_cancelled (the client sent a cancel event). For an incomplete Response, one of max_output_tokens or content_filter (the server-side safety filter activated and cut off the response).
      - TURN_DETECTED("turn_detected")
      - CLIENT_CANCELLED("client_cancelled")
      - MAX_OUTPUT_TOKENS("max_output_tokens")
      - CONTENT_FILTER("content_filter")
    - Optional<Type> type
      
      The type of error that caused the response to fail, corresponding with the status field (completed, cancelled, incomplete, failed).
      - COMPLETED("completed")
      - CANCELLED("cancelled")
      - INCOMPLETE("incomplete")
      - FAILED("failed")
  - Optional<RealtimeResponseUsage> usage
    
    Usage statistics for the Response, this will correspond to billing. A Realtime API session will maintain a conversation context and append new Items to the Conversation, thus output from previous turns (text and audio tokens) will become the input for later turns.
    - Optional<RealtimeResponseUsageInputTokenDetails> inputTokenDetails
      
      Details about the input tokens used in the Response. Cached tokens are tokens from previous turns in the conversation that are included as context for the current response. Cached tokens here are counted as a subset of input tokens, meaning input tokens will include cached and uncached tokens.
      - Optional<Long> audioTokens
        
        The number of audio tokens used as input for the Response.
      - Optional<Long> cachedTokens
        
        The number of cached tokens used as input for the Response.
      - Optional<CachedTokensDetails> cachedTokensDetails
        
        Details about the cached tokens used as input for the Response.
        
        Optional<Long> audioTokens
        
        The number of cached audio tokens used as input for the Response.
        
        Optional<Long> imageTokens
        
        The number of cached image tokens used as input for the Response.
        
        Optional<Long> textTokens
        
        The number of cached text tokens used as input for the Response.
      - Optional<Long> imageTokens
        
        The number of image tokens used as input for the Response.
      - Optional<Long> textTokens
        
        The number of text tokens used as input for the Response.
    - Optional<Long> inputTokens
      
      The number of input tokens used in the Response, including text and audio tokens.
    - Optional<RealtimeResponseUsageOutputTokenDetails> outputTokenDetails
      
      Details about the output tokens used in the Response.
      - Optional<Long> audioTokens
        
        The number of audio tokens used in the Response.
      - Optional<Long> textTokens
        
        The number of text tokens used in the Response.
    - Optional<Long> outputTokens
      
      The number of output tokens sent in the Response, including text and audio tokens.
    - Optional<Long> totalTokens
      
      The total number of tokens in the Response including input and output text and audio tokens.
- JsonValue; type "response.done"constant
  
  The event type, must be response.done.
  - RESPONSE_DONE("response.done")

Response Function Call Arguments Delta Event

class ResponseFunctionCallArgumentsDeltaEvent:

Returned when the model-generated function call arguments are updated.
- String callId
  
  The ID of the function call.
- String delta
  
  The arguments delta as a JSON string.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the function call item.
- long outputIndex
  
  The index of the output item in the response.
- String responseId
  
  The ID of the response.
- JsonValue; type "response.function_call_arguments.delta"constant
  
  The event type, must be response.function_call_arguments.delta.
  - RESPONSE_FUNCTION_CALL_ARGUMENTS_DELTA("response.function_call_arguments.delta")

Response Function Call Arguments Done Event

class ResponseFunctionCallArgumentsDoneEvent:

Returned when the model-generated function call arguments are done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
- String arguments
  
  The final arguments as a JSON string.
- String callId
  
  The ID of the function call.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the function call item.
- String name
  
  The name of the function that was called.
- long outputIndex
  
  The index of the output item in the response.
- String responseId
  
  The ID of the response.
- JsonValue; type "response.function_call_arguments.done"constant
  
  The event type, must be response.function_call_arguments.done.
  - RESPONSE_FUNCTION_CALL_ARGUMENTS_DONE("response.function_call_arguments.done")

Response Mcp Call Arguments Delta

class ResponseMcpCallArgumentsDelta:

Returned when MCP tool call arguments are updated during response generation.
- String delta
  
  The JSON-encoded arguments delta.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the MCP tool call item.
- long outputIndex
  
  The index of the output item in the response.
- String responseId
  
  The ID of the response.
- JsonValue; type "response.mcp_call_arguments.delta"constant
  
  The event type, must be response.mcp_call_arguments.delta.
  - RESPONSE_MCP_CALL_ARGUMENTS_DELTA("response.mcp_call_arguments.delta")
- Optional<String> obfuscation
  
  If present, indicates the delta text was obfuscated.

Response Mcp Call Arguments Done

class ResponseMcpCallArgumentsDone:

Returned when MCP tool call arguments are finalized during response generation.
- String arguments
  
  The final JSON-encoded arguments string.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the MCP tool call item.
- long outputIndex
  
  The index of the output item in the response.
- String responseId
  
  The ID of the response.
- JsonValue; type "response.mcp_call_arguments.done"constant
  
  The event type, must be response.mcp_call_arguments.done.
  - RESPONSE_MCP_CALL_ARGUMENTS_DONE("response.mcp_call_arguments.done")

Response Mcp Call Completed

class ResponseMcpCallCompleted:

Returned when an MCP tool call has completed successfully.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the MCP tool call item.
- long outputIndex
  
  The index of the output item in the response.
- JsonValue; type "response.mcp_call.completed"constant
  
  The event type, must be response.mcp_call.completed.
  - RESPONSE_MCP_CALL_COMPLETED("response.mcp_call.completed")

Response Mcp Call Failed

class ResponseMcpCallFailed:

Returned when an MCP tool call has failed.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the MCP tool call item.
- long outputIndex
  
  The index of the output item in the response.
- JsonValue; type "response.mcp_call.failed"constant
  
  The event type, must be response.mcp_call.failed.
  - RESPONSE_MCP_CALL_FAILED("response.mcp_call.failed")

Response Mcp Call In Progress

class ResponseMcpCallInProgress:

Returned when an MCP tool call has started and is in progress.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the MCP tool call item.
- long outputIndex
  
  The index of the output item in the response.
- JsonValue; type "response.mcp_call.in_progress"constant
  
  The event type, must be response.mcp_call.in_progress.
  - RESPONSE_MCP_CALL_IN_PROGRESS("response.mcp_call.in_progress")

Response Output Item Added Event

class ResponseOutputItemAddedEvent:

Returned when a new Item is created during Response generation.
- String eventId
  
  The unique ID of the server event.
- ConversationItem item
  
  A single item within a Realtime conversation.
  - class RealtimeConversationItemSystemMessage:
    
    A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
    - List<Content> content
      
      The content of the message.
      - Optional<String> text
        
        The text content.
      - Optional<Type> type
        
        The content type. Always input_text for system messages.
        
        INPUT_TEXT("input_text")
    - JsonValue; role "system"constant
      
      The role of the message sender. Always system.
      - SYSTEM("system")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemUserMessage:
    
    A user message item in a Realtime conversation.
    - List<Content> content
      
      The content of the message.
      - Optional<String> audio
        
        Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
      - Optional<Detail> detail
        
        The detail level of the image (for input_image). auto will default to high.
        
        AUTO("auto")
        
        LOW("low")
        
        HIGH("high")
      - Optional<String> imageUrl
        
        Base64-encoded image bytes (for input_image) as a data URI. For example data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG.
      - Optional<String> text
        
        The text content (for input_text).
      - Optional<String> transcript
        
        Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
      - Optional<Type> type
        
        The content type (input_text, input_audio, or input_image).
        
        INPUT_TEXT("input_text")
        
        INPUT_AUDIO("input_audio")
        
        INPUT_IMAGE("input_image")
    - JsonValue; role "user"constant
      
      The role of the message sender. Always user.
      - USER("user")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemAssistantMessage:
    
    An assistant message item in a Realtime conversation.
    - List<Content> content
      
      The content of the message.
      - Optional<String> audio
        
        Base64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
      - Optional<String> text
        
        The text content.
      - Optional<String> transcript
        
        The transcript of the audio content, this will always be present if the output type is audio.
      - Optional<Type> type
        
        The content type, output_text or output_audio depending on the session output_modalities configuration.
        
        OUTPUT_TEXT("output_text")
        
        OUTPUT_AUDIO("output_audio")
    - JsonValue; role "assistant"constant
      
      The role of the message sender. Always assistant.
      - ASSISTANT("assistant")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemFunctionCall:
    
    A function call item in a Realtime conversation.
    - String arguments
      
      The arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example {"arg1": "value1", "arg2": 42}.
    - String name
      
      The name of the function being called.
    - JsonValue; type "function_call"constant
      
      The type of the item. Always function_call.
      - FUNCTION_CALL("function_call")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<String> callId
      
      The ID of the function call.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemFunctionCallOutput:
    
    A function call output item in a Realtime conversation.
    - String callId
      
      The ID of the function call this output is for.
    - String output
      
      The output of the function call, this is free text and can contain any information or simply be empty.
    - JsonValue; type "function_call_output"constant
      
      The type of the item. Always function_call_output.
      - FUNCTION_CALL_OUTPUT("function_call_output")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeMcpApprovalResponse:
    
    A Realtime item responding to an MCP approval request.
    - String id
      
      The unique ID of the approval response.
    - String approvalRequestId
      
      The ID of the approval request being answered.
    - boolean approve
      
      Whether the request was approved.
    - JsonValue; type "mcp_approval_response"constant
      
      The type of the item. Always mcp_approval_response.
      - MCP_APPROVAL_RESPONSE("mcp_approval_response")
    - Optional<String> reason
      
      Optional reason for the decision.
  - class RealtimeMcpListTools:
    
    A Realtime item listing tools available on an MCP server.
    - String serverLabel
      
      The label of the MCP server.
    - List<Tool> tools
      
      The tools available on the server.
      - JsonValue inputSchema
        
        The JSON schema describing the tool's input.
      - String name
        
        The name of the tool.
      - Optional<JsonValue> annotations
        
        Additional annotations about the tool.
      - Optional<String> description
        
        The description of the tool.
    - JsonValue; type "mcp_list_tools"constant
      
      The type of the item. Always mcp_list_tools.
      - MCP_LIST_TOOLS("mcp_list_tools")
    - Optional<String> id
      
      The unique ID of the list.
  - class RealtimeMcpToolCall:
    
    A Realtime item representing an invocation of a tool on an MCP server.
    - String id
      
      The unique ID of the tool call.
    - String arguments
      
      A JSON string of the arguments passed to the tool.
    - String name
      
      The name of the tool that was run.
    - String serverLabel
      
      The label of the MCP server running the tool.
    - JsonValue; type "mcp_call"constant
      
      The type of the item. Always mcp_call.
      - MCP_CALL("mcp_call")
    - Optional<String> approvalRequestId
      
      The ID of an associated approval request, if any.
    - Optional<Error> error
      
      The error from the tool call, if any.
      - class RealtimeMcpProtocolError:
        
        long code
        
        String message
        
        JsonValue; type "protocol_error"constant
        
        PROTOCOL_ERROR("protocol_error")
      - class RealtimeMcpToolExecutionError:
        
        String message
        
        JsonValue; type "tool_execution_error"constant
        
        TOOL_EXECUTION_ERROR("tool_execution_error")
      - class RealtimeMcphttpError:
        
        long code
        
        String message
        
        JsonValue; type "http_error"constant
        
        HTTP_ERROR("http_error")
    - Optional<String> output
      
      The output from the tool call.
  - class RealtimeMcpApprovalRequest:
    
    A Realtime item requesting human approval of a tool invocation.
    - String id
      
      The unique ID of the approval request.
    - String arguments
      
      A JSON string of arguments for the tool.
    - String name
      
      The name of the tool to run.
    - String serverLabel
      
      The label of the MCP server making the request.
    - JsonValue; type "mcp_approval_request"constant
      
      The type of the item. Always mcp_approval_request.
      - MCP_APPROVAL_REQUEST("mcp_approval_request")
- long outputIndex
  
  The index of the output item in the Response.
- String responseId
  
  The ID of the Response to which the item belongs.
- JsonValue; type "response.output_item.added"constant
  
  The event type, must be response.output_item.added.
  - RESPONSE_OUTPUT_ITEM_ADDED("response.output_item.added")

Response Output Item Done Event

class ResponseOutputItemDoneEvent:

Returned when an Item is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
- String eventId
  
  The unique ID of the server event.
- ConversationItem item
  
  A single item within a Realtime conversation.
  - class RealtimeConversationItemSystemMessage:
    
    A system message in a Realtime conversation can be used to provide additional context or instructions to the model. This is similar but distinct from the instruction prompt provided at the start of a conversation, as system messages can be added at any point in the conversation. For major changes to the conversation's behavior, use instructions, but for smaller updates (e.g. "the user is now asking about a different topic"), use system messages.
    - List<Content> content
      
      The content of the message.
      - Optional<String> text
        
        The text content.
      - Optional<Type> type
        
        The content type. Always input_text for system messages.
        
        INPUT_TEXT("input_text")
    - JsonValue; role "system"constant
      
      The role of the message sender. Always system.
      - SYSTEM("system")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemUserMessage:
    
    A user message item in a Realtime conversation.
    - List<Content> content
      
      The content of the message.
      - Optional<String> audio
        
        Base64-encoded audio bytes (for input_audio), these will be parsed as the format specified in the session input audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
      - Optional<Detail> detail
        
        The detail level of the image (for input_image). auto will default to high.
        
        AUTO("auto")
        
        LOW("low")
        
        HIGH("high")
      - Optional<String> imageUrl
        
        Base64-encoded image bytes (for input_image) as a data URI. For example data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA.... Supported formats are PNG and JPEG.
      - Optional<String> text
        
        The text content (for input_text).
      - Optional<String> transcript
        
        Transcript of the audio (for input_audio). This is not sent to the model, but will be attached to the message item for reference.
      - Optional<Type> type
        
        The content type (input_text, input_audio, or input_image).
        
        INPUT_TEXT("input_text")
        
        INPUT_AUDIO("input_audio")
        
        INPUT_IMAGE("input_image")
    - JsonValue; role "user"constant
      
      The role of the message sender. Always user.
      - USER("user")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemAssistantMessage:
    
    An assistant message item in a Realtime conversation.
    - List<Content> content
      
      The content of the message.
      - Optional<String> audio
        
        Base64-encoded audio bytes, these will be parsed as the format specified in the session output audio type configuration. This defaults to PCM 16-bit 24kHz mono if not specified.
      - Optional<String> text
        
        The text content.
      - Optional<String> transcript
        
        The transcript of the audio content, this will always be present if the output type is audio.
      - Optional<Type> type
        
        The content type, output_text or output_audio depending on the session output_modalities configuration.
        
        OUTPUT_TEXT("output_text")
        
        OUTPUT_AUDIO("output_audio")
    - JsonValue; role "assistant"constant
      
      The role of the message sender. Always assistant.
      - ASSISTANT("assistant")
    - JsonValue; type "message"constant
      
      The type of the item. Always message.
      - MESSAGE("message")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemFunctionCall:
    
    A function call item in a Realtime conversation.
    - String arguments
      
      The arguments of the function call. This is a JSON-encoded string representing the arguments passed to the function, for example {"arg1": "value1", "arg2": 42}.
    - String name
      
      The name of the function being called.
    - JsonValue; type "function_call"constant
      
      The type of the item. Always function_call.
      - FUNCTION_CALL("function_call")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<String> callId
      
      The ID of the function call.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeConversationItemFunctionCallOutput:
    
    A function call output item in a Realtime conversation.
    - String callId
      
      The ID of the function call this output is for.
    - String output
      
      The output of the function call, this is free text and can contain any information or simply be empty.
    - JsonValue; type "function_call_output"constant
      
      The type of the item. Always function_call_output.
      - FUNCTION_CALL_OUTPUT("function_call_output")
    - Optional<String> id
      
      The unique ID of the item. This may be provided by the client or generated by the server.
    - Optional<Object> object_
      
      Identifier for the API object being returned - always realtime.item. Optional when creating a new item.
      - REALTIME_ITEM("realtime.item")
    - Optional<Status> status
      
      The status of the item. Has no effect on the conversation.
      - COMPLETED("completed")
      - INCOMPLETE("incomplete")
      - IN_PROGRESS("in_progress")
  - class RealtimeMcpApprovalResponse:
    
    A Realtime item responding to an MCP approval request.
    - String id
      
      The unique ID of the approval response.
    - String approvalRequestId
      
      The ID of the approval request being answered.
    - boolean approve
      
      Whether the request was approved.
    - JsonValue; type "mcp_approval_response"constant
      
      The type of the item. Always mcp_approval_response.
      - MCP_APPROVAL_RESPONSE("mcp_approval_response")
    - Optional<String> reason
      
      Optional reason for the decision.
  - class RealtimeMcpListTools:
    
    A Realtime item listing tools available on an MCP server.
    - String serverLabel
      
      The label of the MCP server.
    - List<Tool> tools
      
      The tools available on the server.
      - JsonValue inputSchema
        
        The JSON schema describing the tool's input.
      - String name
        
        The name of the tool.
      - Optional<JsonValue> annotations
        
        Additional annotations about the tool.
      - Optional<String> description
        
        The description of the tool.
    - JsonValue; type "mcp_list_tools"constant
      
      The type of the item. Always mcp_list_tools.
      - MCP_LIST_TOOLS("mcp_list_tools")
    - Optional<String> id
      
      The unique ID of the list.
  - class RealtimeMcpToolCall:
    
    A Realtime item representing an invocation of a tool on an MCP server.
    - String id
      
      The unique ID of the tool call.
    - String arguments
      
      A JSON string of the arguments passed to the tool.
    - String name
      
      The name of the tool that was run.
    - String serverLabel
      
      The label of the MCP server running the tool.
    - JsonValue; type "mcp_call"constant
      
      The type of the item. Always mcp_call.
      - MCP_CALL("mcp_call")
    - Optional<String> approvalRequestId
      
      The ID of an associated approval request, if any.
    - Optional<Error> error
      
      The error from the tool call, if any.
      - class RealtimeMcpProtocolError:
        
        long code
        
        String message
        
        JsonValue; type "protocol_error"constant
        
        PROTOCOL_ERROR("protocol_error")
      - class RealtimeMcpToolExecutionError:
        
        String message
        
        JsonValue; type "tool_execution_error"constant
        
        TOOL_EXECUTION_ERROR("tool_execution_error")
      - class RealtimeMcphttpError:
        
        long code
        
        String message
        
        JsonValue; type "http_error"constant
        
        HTTP_ERROR("http_error")
    - Optional<String> output
      
      The output from the tool call.
  - class RealtimeMcpApprovalRequest:
    
    A Realtime item requesting human approval of a tool invocation.
    - String id
      
      The unique ID of the approval request.
    - String arguments
      
      A JSON string of arguments for the tool.
    - String name
      
      The name of the tool to run.
    - String serverLabel
      
      The label of the MCP server making the request.
    - JsonValue; type "mcp_approval_request"constant
      
      The type of the item. Always mcp_approval_request.
      - MCP_APPROVAL_REQUEST("mcp_approval_request")
- long outputIndex
  
  The index of the output item in the Response.
- String responseId
  
  The ID of the Response to which the item belongs.
- JsonValue; type "response.output_item.done"constant
  
  The event type, must be response.output_item.done.
  - RESPONSE_OUTPUT_ITEM_DONE("response.output_item.done")

Response Text Delta Event

class ResponseTextDeltaEvent:

Returned when the text value of an "output_text" content part is updated.
- long contentIndex
  
  The index of the content part in the item's content array.
- String delta
  
  The text delta.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the item.
- long outputIndex
  
  The index of the output item in the response.
- String responseId
  
  The ID of the response.
- JsonValue; type "response.output_text.delta"constant
  
  The event type, must be response.output_text.delta.
  - RESPONSE_OUTPUT_TEXT_DELTA("response.output_text.delta")

Response Text Done Event

class ResponseTextDoneEvent:

Returned when the text value of an "output_text" content part is done streaming. Also emitted when a Response is interrupted, incomplete, or cancelled.
- long contentIndex
  
  The index of the content part in the item's content array.
- String eventId
  
  The unique ID of the server event.
- String itemId
  
  The ID of the item.
- long outputIndex
  
  The index of the output item in the response.
- String responseId
  
  The ID of the response.
- String text
  
  The final text content.
- JsonValue; type "response.output_text.done"constant
  
  The event type, must be response.output_text.done.
  - RESPONSE_OUTPUT_TEXT_DONE("response.output_text.done")

Session Created Event

class SessionCreatedEvent:

Returned when a Session is created. Emitted automatically when a new connection is established as the first server event. This event will contain the default Session configuration.
- String eventId
  
  The unique ID of the server event.
- Session session
  
  The session configuration.
  - class RealtimeSessionCreateRequest:
    
    Realtime session object configuration.
    - JsonValue; type "realtime"constant
      
      The type of session to create. Always realtime for the Realtime API.
      - REALTIME("realtime")
    - Optional<RealtimeAudioConfig> audio
      
      Configuration for input and output audio.
      - Optional<RealtimeAudioConfigInput> input
        
        Optional<RealtimeAudioFormats> format
        
        The format of the input audio.
        
        AudioPcm
        
        Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
        
        Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
        
        AudioPcmu
        
        Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
        
        AudioPcma
        
        Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
        
        Optional<NoiseReduction> noiseReduction
        
        Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
        
        Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
        
        Optional<AudioTranscription> transcription
        
        Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
        
        Optional<Delay> delay
        
        Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
        
        Optional<String> language
        
        The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
        
        Optional<Model> model
        
        The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
        
        WHISPER_1("whisper-1")
        
        GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
        
        GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
        
        GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
        
        GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
        
        GPT_REALTIME_WHISPER("gpt-realtime-whisper")
        
        Optional<String> prompt
        
        An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
        
        Optional<RealtimeAudioInputTurnDetection> turnDetection
        
        Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
        
        Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
        
        Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
        
        For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
        
        ServerVad
        
        JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
        
        SemanticVad
        
        JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
        
        Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
      - Optional<RealtimeAudioConfigOutput> output
        
        Optional<RealtimeAudioFormats> format
        
        The format of the output audio.
        
        Optional<Double> speed
        
        The speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
        
        This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
        
        Optional<Voice> voice
        
        The voice the model uses to respond. Supported built-in voices are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommend marin and cedar for best quality.
        
        String
        
        enum UnionMember1:
        
        ALLOY("alloy")
        
        ASH("ash")
        
        BALLAD("ballad")
        
        CORAL("coral")
        
        ECHO("echo")
        
        SAGE("sage")
        
        SHIMMER("shimmer")
        
        VERSE("verse")
        
        MARIN("marin")
        
        CEDAR("cedar")
        
        class Id:
        
        Custom voice reference.
        
        String id
        
        The custom voice ID, e.g. voice_1234.
    - Optional<List<Include>> include
      
      Additional fields to include in server outputs.
      
      item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
      - ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
    - Optional<String> instructions
      
      The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
      
      Note that the server sets default instructions which will be used if this field is not set and are visible in the session.created event at the start of the session.
    - Optional<MaxOutputTokens> maxOutputTokens
      
      Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or inf for the maximum available tokens for a given model. Defaults to inf.
      - long
      - JsonValue;
        
        INF("inf")
    - Optional<Model> model
      
      The Realtime model used for this session.
      - GPT_REALTIME("gpt-realtime")
      - GPT_REALTIME_1_5("gpt-realtime-1.5")
      - GPT_REALTIME_2("gpt-realtime-2")
      - GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")
      - GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")
      - GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01")
      - GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17")
      - GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03")
      - GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview")
      - GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17")
      - GPT_REALTIME_MINI("gpt-realtime-mini")
      - GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06")
      - GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15")
      - GPT_AUDIO_1_5("gpt-audio-1.5")
      - GPT_AUDIO_MINI("gpt-audio-mini")
      - GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06")
      - GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
    - Optional<List<OutputModality>> outputModalities
      
      The set of modalities the model can respond with. It defaults to ["audio"], indicating that the model will respond with audio plus a transcript. ["text"] can be used to make the model respond with text only. It is not possible to request both text and audio at the same time.
      - TEXT("text")
      - AUDIO("audio")
    - Optional<Boolean> parallelToolCalls
      
      Whether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as gpt-realtime-2.
    - Optional<ResponsePrompt> prompt
      
      Reference to a prompt template and its variables. Learn more.
      - String id
        
        The unique identifier of the prompt template to use.
      - Optional<Variables> variables
        
        Optional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
        
        String
        
        class ResponseInputText:
        
        A text input to the model.
        
        String text
        
        The text input to the model.
        
        JsonValue; type "input_text"constant
        
        The type of the input item. Always input_text.
        
        INPUT_TEXT("input_text")
        
        class ResponseInputImage:
        
        An image input to the model. Learn about image inputs.
        
        Detail detail
        
        The detail level of the image to be sent to the model. One of high, low, auto, or original. Defaults to auto.
        
        LOW("low")
        
        HIGH("high")
        
        AUTO("auto")
        
        ORIGINAL("original")
        
        JsonValue; type "input_image"constant
        
        The type of the input item. Always input_image.
        
        INPUT_IMAGE("input_image")
        
        Optional<String> fileId
        
        The ID of the file to be sent to the model.
        
        Optional<String> imageUrl
        
        The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
        
        class ResponseInputFile:
        
        A file input to the model.
        
        JsonValue; type "input_file"constant
        
        The type of the input item. Always input_file.
        
        INPUT_FILE("input_file")
        
        Optional<Detail> detail
        
        The detail level of the file to be sent to the model. Use low for the default rendering behavior, or high to render the file at higher quality. Defaults to low.
        
        LOW("low")
        
        HIGH("high")
        
        Optional<String> fileData
        
        The content of the file to be sent to the model.
        
        Optional<String> fileId
        
        The ID of the file to be sent to the model.
        
        Optional<String> fileUrl
        
        The URL of the file to be sent to the model.
        
        Optional<String> filename
        
        The name of the file to be sent to the model.
      - Optional<String> version
        
        Optional version of the prompt template.
    - Optional<RealtimeReasoning> reasoning
      
      Configuration for reasoning-capable Realtime models such as gpt-realtime-2.
      - Optional<RealtimeReasoningEffort> effort
        
        Constrains effort on reasoning for reasoning-capable Realtime models such as gpt-realtime-2.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
    - Optional<RealtimeToolChoiceConfig> toolChoice
      
      How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
      - enum ToolChoiceOptions:
        
        Controls which (if any) tool is called by the model.
        
        none means the model will not call any tool and instead generates a message.
        
        auto means the model can pick between generating a message or calling one or more tools.
        
        required means the model must call one or more tools.
        
        NONE("none")
        
        AUTO("auto")
        
        REQUIRED("required")
      - class ToolChoiceFunction:
        
        Use this option to force the model to call a specific function.
        
        String name
        
        The name of the function to call.
        
        JsonValue; type "function"constant
        
        For function calling, the type is always function.
        
        FUNCTION("function")
      - class ToolChoiceMcp:
        
        Use this option to force the model to call a specific tool on a remote MCP server.
        
        String serverLabel
        
        The label of the MCP server to use.
        
        JsonValue; type "mcp"constant
        
        For MCP tools, the type is always mcp.
        
        MCP("mcp")
        
        Optional<String> name
        
        The name of the tool to call on the server.
    - Optional<List<RealtimeToolsConfigUnion>> tools
      
      Tools available to the model.
      - class RealtimeFunctionTool:
        
        Optional<String> description
        
        The description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
        
        Optional<String> name
        
        The name of the function.
        
        Optional<JsonValue> parameters
        
        Parameters of the function in JSON Schema.
        
        Optional<Type> type
        
        The type of the tool, i.e. function.
        
        FUNCTION("function")
      - Mcp
        
        String serverLabel
        
        A label for this MCP server, used to identify it in tool calls.
        
        JsonValue; type "mcp"constant
        
        The type of the MCP tool. Always mcp.
        
        MCP("mcp")
        
        Optional<AllowedTools> allowedTools
        
        List of allowed tool names or a filter object.
        
        List<String>
        
        class McpToolFilter:
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<String> authorization
        
        An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
        
        Optional<ConnectorId> connectorId
        
        Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.
        
        Currently supported connector_id values are:
        
        Dropbox: connector_dropbox
        
        Gmail: connector_gmail
        
        Google Calendar: connector_googlecalendar
        
        Google Drive: connector_googledrive
        
        Microsoft Teams: connector_microsoftteams
        
        Outlook Calendar: connector_outlookcalendar
        
        Outlook Email: connector_outlookemail
        
        SharePoint: connector_sharepoint
        
        CONNECTOR_DROPBOX("connector_dropbox")
        
        CONNECTOR_GMAIL("connector_gmail")
        
        CONNECTOR_GOOGLECALENDAR("connector_googlecalendar")
        
        CONNECTOR_GOOGLEDRIVE("connector_googledrive")
        
        CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams")
        
        CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar")
        
        CONNECTOR_OUTLOOKEMAIL("connector_outlookemail")
        
        CONNECTOR_SHAREPOINT("connector_sharepoint")
        
        Optional<Boolean> deferLoading
        
        Whether this MCP tool is deferred and discovered via tool search.
        
        Optional<Headers> headers
        
        Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.
        
        Optional<RequireApproval> requireApproval
        
        Specify which of the MCP server's tools require approval.
        
        class McpToolApprovalFilter:
        
        Specify which of the MCP server's tools require approval. Can be always, never, or a filter object associated with tools that require approval.
        
        Optional<Always> always
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<Never> never
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        enum McpToolApprovalSetting:
        
        Specify a single approval policy for all tools. One of always or never. When set to always, all tools will require approval. When set to never, all tools will not require approval.
        
        ALWAYS("always")
        
        NEVER("never")
        
        Optional<String> serverDescription
        
        Optional description of the MCP server, used to provide more context.
        
        Optional<String> serverUrl
        
        The URL for the MCP server. One of server_url or connector_id must be provided.
    - Optional<RealtimeTracingConfig> tracing
      
      Realtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
      
      auto will create a trace for the session with default values for the workflow name, group id, and metadata.
      - JsonValue;
        
        AUTO("auto")
      - TracingConfiguration
        
        Optional<String> groupId
        
        The group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
        
        Optional<JsonValue> metadata
        
        The arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
        
        Optional<String> workflowName
        
        The name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
    - Optional<RealtimeTruncation> truncation
      
      When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
      
      Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
      
      Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
      
      Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
      - RealtimeTruncationStrategy
        
        AUTO("auto")
        
        DISABLED("disabled")
      - class RealtimeTruncationRetentionRatio:
        
        Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
        
        double retentionRatio
        
        Fraction of post-instruction conversation tokens to retain (0.0 - 1.0) when the conversation exceeds the input token limit. Setting this to 0.8 means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.
        
        JsonValue; type "retention_ratio"constant
        
        Use retention ratio truncation.
        
        RETENTION_RATIO("retention_ratio")
        
        Optional<TokenLimits> tokenLimits
        
        Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
        
        Optional<Long> postInstructions
        
        Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
  - class RealtimeTranscriptionSessionCreateRequest:
    
    Realtime transcription session object configuration.
    - JsonValue; type "transcription"constant
      
      The type of session to create. Always transcription for transcription sessions.
      - TRANSCRIPTION("transcription")
    - Optional<RealtimeTranscriptionSessionAudio> audio
      
      Configuration for input and output audio.
      - Optional<RealtimeTranscriptionSessionAudioInput> input
        
        Optional<RealtimeAudioFormats> format
        
        The PCM audio format. Only a 24kHz sample rate is supported.
        
        Optional<NoiseReduction> noiseReduction
        
        Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
        
        Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        Optional<AudioTranscription> transcription
        
        Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
        
        Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetection
        
        Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
        
        Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
        
        Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
        
        For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
        
        ServerVad
        
        JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
        
        SemanticVad
        
        JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
        
        Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
    - Optional<List<Include>> include
      
      Additional fields to include in server outputs.
      
      item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
      - ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
- JsonValue; type "session.created"constant
  
  The event type, must be session.created.
  - SESSION_CREATED("session.created")

Session Update Event

class SessionUpdateEvent:

Send this event to update the session’s configuration. The client may send this event at any time to update any field except for voice and model. voice can be updated only if there have been no other audio outputs yet.

When the server receives a session.update, it will respond with a session.updated event showing the full, effective configuration. Only the fields that are present in the session.update are updated. To clear a field like instructions, pass an empty string. To clear a field like tools, pass an empty array. To clear a field like turn_detection, pass null.
- Session session
  
  Update the Realtime session. Choose either a realtime session or a transcription session.
  - class RealtimeSessionCreateRequest:
    
    Realtime session object configuration.
    - JsonValue; type "realtime"constant
      
      The type of session to create. Always realtime for the Realtime API.
      - REALTIME("realtime")
    - Optional<RealtimeAudioConfig> audio
      
      Configuration for input and output audio.
      - Optional<RealtimeAudioConfigInput> input
        
        Optional<RealtimeAudioFormats> format
        
        The format of the input audio.
        
        AudioPcm
        
        Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
        
        Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
        
        AudioPcmu
        
        Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
        
        AudioPcma
        
        Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
        
        Optional<NoiseReduction> noiseReduction
        
        Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
        
        Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
        
        Optional<AudioTranscription> transcription
        
        Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
        
        Optional<Delay> delay
        
        Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
        
        Optional<String> language
        
        The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
        
        Optional<Model> model
        
        The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
        
        WHISPER_1("whisper-1")
        
        GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
        
        GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
        
        GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
        
        GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
        
        GPT_REALTIME_WHISPER("gpt-realtime-whisper")
        
        Optional<String> prompt
        
        An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
        
        Optional<RealtimeAudioInputTurnDetection> turnDetection
        
        Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
        
        Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
        
        Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
        
        For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
        
        ServerVad
        
        JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
        
        SemanticVad
        
        JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
        
        Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
      - Optional<RealtimeAudioConfigOutput> output
        
        Optional<RealtimeAudioFormats> format
        
        The format of the output audio.
        
        Optional<Double> speed
        
        The speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
        
        This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
        
        Optional<Voice> voice
        
        The voice the model uses to respond. Supported built-in voices are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommend marin and cedar for best quality.
        
        String
        
        enum UnionMember1:
        
        ALLOY("alloy")
        
        ASH("ash")
        
        BALLAD("ballad")
        
        CORAL("coral")
        
        ECHO("echo")
        
        SAGE("sage")
        
        SHIMMER("shimmer")
        
        VERSE("verse")
        
        MARIN("marin")
        
        CEDAR("cedar")
        
        class Id:
        
        Custom voice reference.
        
        String id
        
        The custom voice ID, e.g. voice_1234.
    - Optional<List<Include>> include
      
      Additional fields to include in server outputs.
      
      item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
      - ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
    - Optional<String> instructions
      
      The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
      
      Note that the server sets default instructions which will be used if this field is not set and are visible in the session.created event at the start of the session.
    - Optional<MaxOutputTokens> maxOutputTokens
      
      Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or inf for the maximum available tokens for a given model. Defaults to inf.
      - long
      - JsonValue;
        
        INF("inf")
    - Optional<Model> model
      
      The Realtime model used for this session.
      - GPT_REALTIME("gpt-realtime")
      - GPT_REALTIME_1_5("gpt-realtime-1.5")
      - GPT_REALTIME_2("gpt-realtime-2")
      - GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")
      - GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")
      - GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01")
      - GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17")
      - GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03")
      - GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview")
      - GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17")
      - GPT_REALTIME_MINI("gpt-realtime-mini")
      - GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06")
      - GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15")
      - GPT_AUDIO_1_5("gpt-audio-1.5")
      - GPT_AUDIO_MINI("gpt-audio-mini")
      - GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06")
      - GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
    - Optional<List<OutputModality>> outputModalities
      
      The set of modalities the model can respond with. It defaults to ["audio"], indicating that the model will respond with audio plus a transcript. ["text"] can be used to make the model respond with text only. It is not possible to request both text and audio at the same time.
      - TEXT("text")
      - AUDIO("audio")
    - Optional<Boolean> parallelToolCalls
      
      Whether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as gpt-realtime-2.
    - Optional<ResponsePrompt> prompt
      
      Reference to a prompt template and its variables. Learn more.
      - String id
        
        The unique identifier of the prompt template to use.
      - Optional<Variables> variables
        
        Optional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
        
        String
        
        class ResponseInputText:
        
        A text input to the model.
        
        String text
        
        The text input to the model.
        
        JsonValue; type "input_text"constant
        
        The type of the input item. Always input_text.
        
        INPUT_TEXT("input_text")
        
        class ResponseInputImage:
        
        An image input to the model. Learn about image inputs.
        
        Detail detail
        
        The detail level of the image to be sent to the model. One of high, low, auto, or original. Defaults to auto.
        
        LOW("low")
        
        HIGH("high")
        
        AUTO("auto")
        
        ORIGINAL("original")
        
        JsonValue; type "input_image"constant
        
        The type of the input item. Always input_image.
        
        INPUT_IMAGE("input_image")
        
        Optional<String> fileId
        
        The ID of the file to be sent to the model.
        
        Optional<String> imageUrl
        
        The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
        
        class ResponseInputFile:
        
        A file input to the model.
        
        JsonValue; type "input_file"constant
        
        The type of the input item. Always input_file.
        
        INPUT_FILE("input_file")
        
        Optional<Detail> detail
        
        The detail level of the file to be sent to the model. Use low for the default rendering behavior, or high to render the file at higher quality. Defaults to low.
        
        LOW("low")
        
        HIGH("high")
        
        Optional<String> fileData
        
        The content of the file to be sent to the model.
        
        Optional<String> fileId
        
        The ID of the file to be sent to the model.
        
        Optional<String> fileUrl
        
        The URL of the file to be sent to the model.
        
        Optional<String> filename
        
        The name of the file to be sent to the model.
      - Optional<String> version
        
        Optional version of the prompt template.
    - Optional<RealtimeReasoning> reasoning
      
      Configuration for reasoning-capable Realtime models such as gpt-realtime-2.
      - Optional<RealtimeReasoningEffort> effort
        
        Constrains effort on reasoning for reasoning-capable Realtime models such as gpt-realtime-2.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
    - Optional<RealtimeToolChoiceConfig> toolChoice
      
      How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
      - enum ToolChoiceOptions:
        
        Controls which (if any) tool is called by the model.
        
        none means the model will not call any tool and instead generates a message.
        
        auto means the model can pick between generating a message or calling one or more tools.
        
        required means the model must call one or more tools.
        
        NONE("none")
        
        AUTO("auto")
        
        REQUIRED("required")
      - class ToolChoiceFunction:
        
        Use this option to force the model to call a specific function.
        
        String name
        
        The name of the function to call.
        
        JsonValue; type "function"constant
        
        For function calling, the type is always function.
        
        FUNCTION("function")
      - class ToolChoiceMcp:
        
        Use this option to force the model to call a specific tool on a remote MCP server.
        
        String serverLabel
        
        The label of the MCP server to use.
        
        JsonValue; type "mcp"constant
        
        For MCP tools, the type is always mcp.
        
        MCP("mcp")
        
        Optional<String> name
        
        The name of the tool to call on the server.
    - Optional<List<RealtimeToolsConfigUnion>> tools
      
      Tools available to the model.
      - class RealtimeFunctionTool:
        
        Optional<String> description
        
        The description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
        
        Optional<String> name
        
        The name of the function.
        
        Optional<JsonValue> parameters
        
        Parameters of the function in JSON Schema.
        
        Optional<Type> type
        
        The type of the tool, i.e. function.
        
        FUNCTION("function")
      - Mcp
        
        String serverLabel
        
        A label for this MCP server, used to identify it in tool calls.
        
        JsonValue; type "mcp"constant
        
        The type of the MCP tool. Always mcp.
        
        MCP("mcp")
        
        Optional<AllowedTools> allowedTools
        
        List of allowed tool names or a filter object.
        
        List<String>
        
        class McpToolFilter:
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<String> authorization
        
        An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
        
        Optional<ConnectorId> connectorId
        
        Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.
        
        Currently supported connector_id values are:
        
        Dropbox: connector_dropbox
        
        Gmail: connector_gmail
        
        Google Calendar: connector_googlecalendar
        
        Google Drive: connector_googledrive
        
        Microsoft Teams: connector_microsoftteams
        
        Outlook Calendar: connector_outlookcalendar
        
        Outlook Email: connector_outlookemail
        
        SharePoint: connector_sharepoint
        
        CONNECTOR_DROPBOX("connector_dropbox")
        
        CONNECTOR_GMAIL("connector_gmail")
        
        CONNECTOR_GOOGLECALENDAR("connector_googlecalendar")
        
        CONNECTOR_GOOGLEDRIVE("connector_googledrive")
        
        CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams")
        
        CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar")
        
        CONNECTOR_OUTLOOKEMAIL("connector_outlookemail")
        
        CONNECTOR_SHAREPOINT("connector_sharepoint")
        
        Optional<Boolean> deferLoading
        
        Whether this MCP tool is deferred and discovered via tool search.
        
        Optional<Headers> headers
        
        Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.
        
        Optional<RequireApproval> requireApproval
        
        Specify which of the MCP server's tools require approval.
        
        class McpToolApprovalFilter:
        
        Specify which of the MCP server's tools require approval. Can be always, never, or a filter object associated with tools that require approval.
        
        Optional<Always> always
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<Never> never
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        enum McpToolApprovalSetting:
        
        Specify a single approval policy for all tools. One of always or never. When set to always, all tools will require approval. When set to never, all tools will not require approval.
        
        ALWAYS("always")
        
        NEVER("never")
        
        Optional<String> serverDescription
        
        Optional description of the MCP server, used to provide more context.
        
        Optional<String> serverUrl
        
        The URL for the MCP server. One of server_url or connector_id must be provided.
    - Optional<RealtimeTracingConfig> tracing
      
      Realtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
      
      auto will create a trace for the session with default values for the workflow name, group id, and metadata.
      - JsonValue;
        
        AUTO("auto")
      - TracingConfiguration
        
        Optional<String> groupId
        
        The group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
        
        Optional<JsonValue> metadata
        
        The arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
        
        Optional<String> workflowName
        
        The name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
    - Optional<RealtimeTruncation> truncation
      
      When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
      
      Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
      
      Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
      
      Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
      - RealtimeTruncationStrategy
        
        AUTO("auto")
        
        DISABLED("disabled")
      - class RealtimeTruncationRetentionRatio:
        
        Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
        
        double retentionRatio
        
        Fraction of post-instruction conversation tokens to retain (0.0 - 1.0) when the conversation exceeds the input token limit. Setting this to 0.8 means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.
        
        JsonValue; type "retention_ratio"constant
        
        Use retention ratio truncation.
        
        RETENTION_RATIO("retention_ratio")
        
        Optional<TokenLimits> tokenLimits
        
        Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
        
        Optional<Long> postInstructions
        
        Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
  - class RealtimeTranscriptionSessionCreateRequest:
    
    Realtime transcription session object configuration.
    - JsonValue; type "transcription"constant
      
      The type of session to create. Always transcription for transcription sessions.
      - TRANSCRIPTION("transcription")
    - Optional<RealtimeTranscriptionSessionAudio> audio
      
      Configuration for input and output audio.
      - Optional<RealtimeTranscriptionSessionAudioInput> input
        
        Optional<RealtimeAudioFormats> format
        
        The PCM audio format. Only a 24kHz sample rate is supported.
        
        Optional<NoiseReduction> noiseReduction
        
        Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
        
        Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        Optional<AudioTranscription> transcription
        
        Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
        
        Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetection
        
        Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
        
        Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
        
        Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
        
        For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
        
        ServerVad
        
        JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
        
        SemanticVad
        
        JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
        
        Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
    - Optional<List<Include>> include
      
      Additional fields to include in server outputs.
      
      item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
      - ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
- JsonValue; type "session.update"constant
  
  The event type, must be session.update.
  - SESSION_UPDATE("session.update")
- Optional<String> eventId
  
  Optional client-generated ID used to identify this event. This is an arbitrary string that a client may assign. It will be passed back if there is an error with the event, but the corresponding session.updated event will not include it.

Session Updated Event

class SessionUpdatedEvent:

Returned when a session is updated with a session.update event, unless there is an error.
- String eventId
  
  The unique ID of the server event.
- Session session
  
  The session configuration.
  - class RealtimeSessionCreateRequest:
    
    Realtime session object configuration.
    - JsonValue; type "realtime"constant
      
      The type of session to create. Always realtime for the Realtime API.
      - REALTIME("realtime")
    - Optional<RealtimeAudioConfig> audio
      
      Configuration for input and output audio.
      - Optional<RealtimeAudioConfigInput> input
        
        Optional<RealtimeAudioFormats> format
        
        The format of the input audio.
        
        AudioPcm
        
        Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
        
        Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
        
        AudioPcmu
        
        Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
        
        AudioPcma
        
        Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
        
        Optional<NoiseReduction> noiseReduction
        
        Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
        
        Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
        
        Optional<AudioTranscription> transcription
        
        Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
        
        Optional<Delay> delay
        
        Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
        
        Optional<String> language
        
        The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
        
        Optional<Model> model
        
        The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
        
        WHISPER_1("whisper-1")
        
        GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
        
        GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
        
        GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
        
        GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
        
        GPT_REALTIME_WHISPER("gpt-realtime-whisper")
        
        Optional<String> prompt
        
        An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
        
        Optional<RealtimeAudioInputTurnDetection> turnDetection
        
        Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
        
        Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
        
        Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
        
        For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
        
        ServerVad
        
        JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
        
        SemanticVad
        
        JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
        
        Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
      - Optional<RealtimeAudioConfigOutput> output
        
        Optional<RealtimeAudioFormats> format
        
        The format of the output audio.
        
        Optional<Double> speed
        
        The speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
        
        This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
        
        Optional<Voice> voice
        
        The voice the model uses to respond. Supported built-in voices are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommend marin and cedar for best quality.
        
        String
        
        enum UnionMember1:
        
        ALLOY("alloy")
        
        ASH("ash")
        
        BALLAD("ballad")
        
        CORAL("coral")
        
        ECHO("echo")
        
        SAGE("sage")
        
        SHIMMER("shimmer")
        
        VERSE("verse")
        
        MARIN("marin")
        
        CEDAR("cedar")
        
        class Id:
        
        Custom voice reference.
        
        String id
        
        The custom voice ID, e.g. voice_1234.
    - Optional<List<Include>> include
      
      Additional fields to include in server outputs.
      
      item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
      - ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
    - Optional<String> instructions
      
      The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
      
      Note that the server sets default instructions which will be used if this field is not set and are visible in the session.created event at the start of the session.
    - Optional<MaxOutputTokens> maxOutputTokens
      
      Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or inf for the maximum available tokens for a given model. Defaults to inf.
      - long
      - JsonValue;
        
        INF("inf")
    - Optional<Model> model
      
      The Realtime model used for this session.
      - GPT_REALTIME("gpt-realtime")
      - GPT_REALTIME_1_5("gpt-realtime-1.5")
      - GPT_REALTIME_2("gpt-realtime-2")
      - GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")
      - GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")
      - GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01")
      - GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17")
      - GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03")
      - GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview")
      - GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17")
      - GPT_REALTIME_MINI("gpt-realtime-mini")
      - GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06")
      - GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15")
      - GPT_AUDIO_1_5("gpt-audio-1.5")
      - GPT_AUDIO_MINI("gpt-audio-mini")
      - GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06")
      - GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
    - Optional<List<OutputModality>> outputModalities
      
      The set of modalities the model can respond with. It defaults to ["audio"], indicating that the model will respond with audio plus a transcript. ["text"] can be used to make the model respond with text only. It is not possible to request both text and audio at the same time.
      - TEXT("text")
      - AUDIO("audio")
    - Optional<Boolean> parallelToolCalls
      
      Whether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as gpt-realtime-2.
    - Optional<ResponsePrompt> prompt
      
      Reference to a prompt template and its variables. Learn more.
      - String id
        
        The unique identifier of the prompt template to use.
      - Optional<Variables> variables
        
        Optional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
        
        String
        
        class ResponseInputText:
        
        A text input to the model.
        
        String text
        
        The text input to the model.
        
        JsonValue; type "input_text"constant
        
        The type of the input item. Always input_text.
        
        INPUT_TEXT("input_text")
        
        class ResponseInputImage:
        
        An image input to the model. Learn about image inputs.
        
        Detail detail
        
        The detail level of the image to be sent to the model. One of high, low, auto, or original. Defaults to auto.
        
        LOW("low")
        
        HIGH("high")
        
        AUTO("auto")
        
        ORIGINAL("original")
        
        JsonValue; type "input_image"constant
        
        The type of the input item. Always input_image.
        
        INPUT_IMAGE("input_image")
        
        Optional<String> fileId
        
        The ID of the file to be sent to the model.
        
        Optional<String> imageUrl
        
        The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
        
        class ResponseInputFile:
        
        A file input to the model.
        
        JsonValue; type "input_file"constant
        
        The type of the input item. Always input_file.
        
        INPUT_FILE("input_file")
        
        Optional<Detail> detail
        
        The detail level of the file to be sent to the model. Use low for the default rendering behavior, or high to render the file at higher quality. Defaults to low.
        
        LOW("low")
        
        HIGH("high")
        
        Optional<String> fileData
        
        The content of the file to be sent to the model.
        
        Optional<String> fileId
        
        The ID of the file to be sent to the model.
        
        Optional<String> fileUrl
        
        The URL of the file to be sent to the model.
        
        Optional<String> filename
        
        The name of the file to be sent to the model.
      - Optional<String> version
        
        Optional version of the prompt template.
    - Optional<RealtimeReasoning> reasoning
      
      Configuration for reasoning-capable Realtime models such as gpt-realtime-2.
      - Optional<RealtimeReasoningEffort> effort
        
        Constrains effort on reasoning for reasoning-capable Realtime models such as gpt-realtime-2.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
    - Optional<RealtimeToolChoiceConfig> toolChoice
      
      How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
      - enum ToolChoiceOptions:
        
        Controls which (if any) tool is called by the model.
        
        none means the model will not call any tool and instead generates a message.
        
        auto means the model can pick between generating a message or calling one or more tools.
        
        required means the model must call one or more tools.
        
        NONE("none")
        
        AUTO("auto")
        
        REQUIRED("required")
      - class ToolChoiceFunction:
        
        Use this option to force the model to call a specific function.
        
        String name
        
        The name of the function to call.
        
        JsonValue; type "function"constant
        
        For function calling, the type is always function.
        
        FUNCTION("function")
      - class ToolChoiceMcp:
        
        Use this option to force the model to call a specific tool on a remote MCP server.
        
        String serverLabel
        
        The label of the MCP server to use.
        
        JsonValue; type "mcp"constant
        
        For MCP tools, the type is always mcp.
        
        MCP("mcp")
        
        Optional<String> name
        
        The name of the tool to call on the server.
    - Optional<List<RealtimeToolsConfigUnion>> tools
      
      Tools available to the model.
      - class RealtimeFunctionTool:
        
        Optional<String> description
        
        The description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
        
        Optional<String> name
        
        The name of the function.
        
        Optional<JsonValue> parameters
        
        Parameters of the function in JSON Schema.
        
        Optional<Type> type
        
        The type of the tool, i.e. function.
        
        FUNCTION("function")
      - Mcp
        
        String serverLabel
        
        A label for this MCP server, used to identify it in tool calls.
        
        JsonValue; type "mcp"constant
        
        The type of the MCP tool. Always mcp.
        
        MCP("mcp")
        
        Optional<AllowedTools> allowedTools
        
        List of allowed tool names or a filter object.
        
        List<String>
        
        class McpToolFilter:
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<String> authorization
        
        An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
        
        Optional<ConnectorId> connectorId
        
        Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.
        
        Currently supported connector_id values are:
        
        Dropbox: connector_dropbox
        
        Gmail: connector_gmail
        
        Google Calendar: connector_googlecalendar
        
        Google Drive: connector_googledrive
        
        Microsoft Teams: connector_microsoftteams
        
        Outlook Calendar: connector_outlookcalendar
        
        Outlook Email: connector_outlookemail
        
        SharePoint: connector_sharepoint
        
        CONNECTOR_DROPBOX("connector_dropbox")
        
        CONNECTOR_GMAIL("connector_gmail")
        
        CONNECTOR_GOOGLECALENDAR("connector_googlecalendar")
        
        CONNECTOR_GOOGLEDRIVE("connector_googledrive")
        
        CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams")
        
        CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar")
        
        CONNECTOR_OUTLOOKEMAIL("connector_outlookemail")
        
        CONNECTOR_SHAREPOINT("connector_sharepoint")
        
        Optional<Boolean> deferLoading
        
        Whether this MCP tool is deferred and discovered via tool search.
        
        Optional<Headers> headers
        
        Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.
        
        Optional<RequireApproval> requireApproval
        
        Specify which of the MCP server's tools require approval.
        
        class McpToolApprovalFilter:
        
        Specify which of the MCP server's tools require approval. Can be always, never, or a filter object associated with tools that require approval.
        
        Optional<Always> always
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<Never> never
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        enum McpToolApprovalSetting:
        
        Specify a single approval policy for all tools. One of always or never. When set to always, all tools will require approval. When set to never, all tools will not require approval.
        
        ALWAYS("always")
        
        NEVER("never")
        
        Optional<String> serverDescription
        
        Optional description of the MCP server, used to provide more context.
        
        Optional<String> serverUrl
        
        The URL for the MCP server. One of server_url or connector_id must be provided.
    - Optional<RealtimeTracingConfig> tracing
      
      Realtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
      
      auto will create a trace for the session with default values for the workflow name, group id, and metadata.
      - JsonValue;
        
        AUTO("auto")
      - TracingConfiguration
        
        Optional<String> groupId
        
        The group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
        
        Optional<JsonValue> metadata
        
        The arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
        
        Optional<String> workflowName
        
        The name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
    - Optional<RealtimeTruncation> truncation
      
      When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
      
      Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
      
      Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
      
      Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
      - RealtimeTruncationStrategy
        
        AUTO("auto")
        
        DISABLED("disabled")
      - class RealtimeTruncationRetentionRatio:
        
        Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
        
        double retentionRatio
        
        Fraction of post-instruction conversation tokens to retain (0.0 - 1.0) when the conversation exceeds the input token limit. Setting this to 0.8 means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.
        
        JsonValue; type "retention_ratio"constant
        
        Use retention ratio truncation.
        
        RETENTION_RATIO("retention_ratio")
        
        Optional<TokenLimits> tokenLimits
        
        Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
        
        Optional<Long> postInstructions
        
        Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
  - class RealtimeTranscriptionSessionCreateRequest:
    
    Realtime transcription session object configuration.
    - JsonValue; type "transcription"constant
      
      The type of session to create. Always transcription for transcription sessions.
      - TRANSCRIPTION("transcription")
    - Optional<RealtimeTranscriptionSessionAudio> audio
      
      Configuration for input and output audio.
      - Optional<RealtimeTranscriptionSessionAudioInput> input
        
        Optional<RealtimeAudioFormats> format
        
        The PCM audio format. Only a 24kHz sample rate is supported.
        
        Optional<NoiseReduction> noiseReduction
        
        Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
        
        Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        Optional<AudioTranscription> transcription
        
        Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
        
        Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetection
        
        Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
        
        Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
        
        Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
        
        For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
        
        ServerVad
        
        JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
        
        SemanticVad
        
        JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
        
        Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
    - Optional<List<Include>> include
      
      Additional fields to include in server outputs.
      
      item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
      - ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
- JsonValue; type "session.updated"constant
  
  The event type, must be session.updated.
  - SESSION_UPDATED("session.updated")

Transcription Session Update

class TranscriptionSessionUpdate:

Send this event to update a transcription session.
- Session session
  
  Realtime transcription session object configuration.
  - Optional<List<Include>> include
    
    The set of items to include in the transcription. Current available items are: item.input_audio_transcription.logprobs
    - ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
  - Optional<InputAudioFormat> inputAudioFormat
    
    The format of input audio. Options are pcm16, g711_ulaw, or g711_alaw. For pcm16, input audio must be 16-bit PCM at a 24kHz sample rate, single channel (mono), and little-endian byte order.
    - PCM16("pcm16")
    - G711_ULAW("g711_ulaw")
    - G711_ALAW("g711_alaw")
  - Optional<InputAudioNoiseReduction> inputAudioNoiseReduction
    
    Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
    - Optional<NoiseReductionType> type
      
      Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
      - NEAR_FIELD("near_field")
      - FAR_FIELD("far_field")
  - Optional<AudioTranscription> inputAudioTranscription
    
    Configuration for input audio transcription. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
    - Optional<Delay> delay
      
      Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
      - MINIMAL("minimal")
      - LOW("low")
      - MEDIUM("medium")
      - HIGH("high")
      - XHIGH("xhigh")
    - Optional<String> language
      
      The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
    - Optional<Model> model
      
      The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
      - WHISPER_1("whisper-1")
      - GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
      - GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
      - GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
      - GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
      - GPT_REALTIME_WHISPER("gpt-realtime-whisper")
    - Optional<String> prompt
      
      An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
  - Optional<TurnDetection> turnDetection
    
    Configuration for turn detection. Can be set to null to turn off. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
    - Optional<Long> prefixPaddingMs
      
      Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
    - Optional<Long> silenceDurationMs
      
      Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
    - Optional<Double> threshold
      
      Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
    - Optional<Type> type
      
      Type of turn detection. Only server_vad is currently supported for transcription sessions.
      - SERVER_VAD("server_vad")
- JsonValue; type "transcription_session.update"constant
  
  The event type, must be transcription_session.update.
  - TRANSCRIPTION_SESSION_UPDATE("transcription_session.update")
- Optional<String> eventId
  
  Optional client-generated ID used to identify this event.

Transcription Session Updated Event

class TranscriptionSessionUpdatedEvent:

Returned when a transcription session is updated with a transcription_session.update event, unless there is an error.
- String eventId
  
  The unique ID of the server event.
- Session session
  
  A new Realtime transcription session configuration.
  
  When a session is created on the server via REST API, the session object also contains an ephemeral key. Default TTL for keys is 10 minutes. This property is not present when a session is updated via the WebSocket API.
  - ClientSecret clientSecret
    
    Ephemeral key returned by the API. Only present when the session is created on the server via REST API.
    - long expiresAt
      
      Timestamp for when the token expires. Currently, all tokens expire after one minute.
    - String value
      
      Ephemeral key usable in client environments to authenticate connections to the Realtime API. Use this in client-side environments rather than a standard API token, which should only be used server-side.
  - Optional<String> inputAudioFormat
    
    The format of input audio. Options are pcm16, g711_ulaw, or g711_alaw.
  - Optional<AudioTranscription> inputAudioTranscription
    - Optional<Delay> delay
      
      Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
      - MINIMAL("minimal")
      - LOW("low")
      - MEDIUM("medium")
      - HIGH("high")
      - XHIGH("xhigh")
    - Optional<String> language
      
      The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
    - Optional<Model> model
      
      The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
      - WHISPER_1("whisper-1")
      - GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
      - GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
      - GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
      - GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
      - GPT_REALTIME_WHISPER("gpt-realtime-whisper")
    - Optional<String> prompt
      
      An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
  - Optional<List<Modality>> modalities
    
    The set of modalities the model can respond with. To disable audio, set this to ["text"].
    - TEXT("text")
    - AUDIO("audio")
  - Optional<TurnDetection> turnDetection
    
    Configuration for turn detection. Can be set to null to turn off. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
    - Optional<Long> prefixPaddingMs
      
      Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
    - Optional<Long> silenceDurationMs
      
      Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
    - Optional<Double> threshold
      
      Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
    - Optional<String> type
      
      Type of turn detection, only server_vad is currently supported.
- JsonValue; type "transcription_session.updated"constant
  
  The event type, must be transcription_session.updated.
  - TRANSCRIPTION_SESSION_UPDATED("transcription_session.updated")

Client Secrets

Create client secret

ClientSecretCreateResponse realtime().clientSecrets().create(ClientSecretCreateParamsparams = ClientSecretCreateParams.none(), RequestOptionsrequestOptions = RequestOptions.none())

post /realtime/client_secrets

Create a Realtime client secret with an associated session configuration.

Client secrets are short-lived tokens that can be passed to a client app, such as a web frontend or mobile client, which grants access to the Realtime API without leaking your main API key. You can configure a custom TTL for each client secret.

You can also attach session configuration options to the client secret, which will be applied to any sessions created using that client secret, but these can also be overridden by the client connection.

Learn more about authentication with client secrets over WebRTC.

Returns the created client secret and the effective session object. The client secret is a string that looks like ek_1234.

Parameters

ClientSecretCreateParams params
- Optional<ExpiresAfter> expiresAfter
  
  Configuration for the client secret expiration. Expiration refers to the time after which a client secret will no longer be valid for creating sessions. The session itself may continue after that time once started. A secret can be used to create multiple sessions until it expires.
  - Optional<Anchor> anchor
    
    The anchor point for the client secret expiration, meaning that seconds will be added to the created_at time of the client secret to produce an expiration timestamp. Only created_at is currently supported.
    - CREATED_AT("created_at")
  - Optional<Long> seconds
    
    The number of seconds from the anchor point to the expiration. Select a value between 10 and 7200 (2 hours). This default to 600 seconds (10 minutes) if not specified.
- Optional<Session> session
  
  Session configuration to use for the client secret. Choose either a realtime session or a transcription session.
  - class RealtimeSessionCreateRequest:
    
    Realtime session object configuration.
    - JsonValue; type "realtime"constant
      
      The type of session to create. Always realtime for the Realtime API.
      - REALTIME("realtime")
    - Optional<RealtimeAudioConfig> audio
      
      Configuration for input and output audio.
      - Optional<RealtimeAudioConfigInput> input
        
        Optional<RealtimeAudioFormats> format
        
        The format of the input audio.
        
        AudioPcm
        
        Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
        
        Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
        
        AudioPcmu
        
        Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
        
        AudioPcma
        
        Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
        
        Optional<NoiseReduction> noiseReduction
        
        Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
        
        Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
        
        Optional<AudioTranscription> transcription
        
        Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
        
        Optional<Delay> delay
        
        Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
        
        Optional<String> language
        
        The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
        
        Optional<Model> model
        
        The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
        
        WHISPER_1("whisper-1")
        
        GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
        
        GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
        
        GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
        
        GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
        
        GPT_REALTIME_WHISPER("gpt-realtime-whisper")
        
        Optional<String> prompt
        
        An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
        
        Optional<RealtimeAudioInputTurnDetection> turnDetection
        
        Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
        
        Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
        
        Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
        
        For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
        
        ServerVad
        
        JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
        
        SemanticVad
        
        JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
        
        Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
      - Optional<RealtimeAudioConfigOutput> output
        
        Optional<RealtimeAudioFormats> format
        
        The format of the output audio.
        
        Optional<Double> speed
        
        The speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
        
        This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
        
        Optional<Voice> voice
        
        The voice the model uses to respond. Supported built-in voices are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Voice cannot be changed during the session once the model has responded with audio at least once. We recommend marin and cedar for best quality.
        
        String
        
        enum UnionMember1:
        
        ALLOY("alloy")
        
        ASH("ash")
        
        BALLAD("ballad")
        
        CORAL("coral")
        
        ECHO("echo")
        
        SAGE("sage")
        
        SHIMMER("shimmer")
        
        VERSE("verse")
        
        MARIN("marin")
        
        CEDAR("cedar")
        
        class Id:
        
        Custom voice reference.
        
        String id
        
        The custom voice ID, e.g. voice_1234.
    - Optional<List<Include>> include
      
      Additional fields to include in server outputs.
      
      item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
      - ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
    - Optional<String> instructions
      
      The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
      
      Note that the server sets default instructions which will be used if this field is not set and are visible in the session.created event at the start of the session.
    - Optional<MaxOutputTokens> maxOutputTokens
      
      Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or inf for the maximum available tokens for a given model. Defaults to inf.
      - long
      - JsonValue;
        
        INF("inf")
    - Optional<Model> model
      
      The Realtime model used for this session.
      - GPT_REALTIME("gpt-realtime")
      - GPT_REALTIME_1_5("gpt-realtime-1.5")
      - GPT_REALTIME_2("gpt-realtime-2")
      - GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")
      - GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")
      - GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01")
      - GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17")
      - GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03")
      - GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview")
      - GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17")
      - GPT_REALTIME_MINI("gpt-realtime-mini")
      - GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06")
      - GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15")
      - GPT_AUDIO_1_5("gpt-audio-1.5")
      - GPT_AUDIO_MINI("gpt-audio-mini")
      - GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06")
      - GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
    - Optional<List<OutputModality>> outputModalities
      
      The set of modalities the model can respond with. It defaults to ["audio"], indicating that the model will respond with audio plus a transcript. ["text"] can be used to make the model respond with text only. It is not possible to request both text and audio at the same time.
      - TEXT("text")
      - AUDIO("audio")
    - Optional<Boolean> parallelToolCalls
      
      Whether the model may call multiple tools in parallel. Only supported by reasoning Realtime models such as gpt-realtime-2.
    - Optional<ResponsePrompt> prompt
      
      Reference to a prompt template and its variables. Learn more.
      - String id
        
        The unique identifier of the prompt template to use.
      - Optional<Variables> variables
        
        Optional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
        
        String
        
        class ResponseInputText:
        
        A text input to the model.
        
        String text
        
        The text input to the model.
        
        JsonValue; type "input_text"constant
        
        The type of the input item. Always input_text.
        
        INPUT_TEXT("input_text")
        
        class ResponseInputImage:
        
        An image input to the model. Learn about image inputs.
        
        Detail detail
        
        The detail level of the image to be sent to the model. One of high, low, auto, or original. Defaults to auto.
        
        LOW("low")
        
        HIGH("high")
        
        AUTO("auto")
        
        ORIGINAL("original")
        
        JsonValue; type "input_image"constant
        
        The type of the input item. Always input_image.
        
        INPUT_IMAGE("input_image")
        
        Optional<String> fileId
        
        The ID of the file to be sent to the model.
        
        Optional<String> imageUrl
        
        The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
        
        class ResponseInputFile:
        
        A file input to the model.
        
        JsonValue; type "input_file"constant
        
        The type of the input item. Always input_file.
        
        INPUT_FILE("input_file")
        
        Optional<Detail> detail
        
        The detail level of the file to be sent to the model. Use low for the default rendering behavior, or high to render the file at higher quality. Defaults to low.
        
        LOW("low")
        
        HIGH("high")
        
        Optional<String> fileData
        
        The content of the file to be sent to the model.
        
        Optional<String> fileId
        
        The ID of the file to be sent to the model.
        
        Optional<String> fileUrl
        
        The URL of the file to be sent to the model.
        
        Optional<String> filename
        
        The name of the file to be sent to the model.
      - Optional<String> version
        
        Optional version of the prompt template.
    - Optional<RealtimeReasoning> reasoning
      
      Configuration for reasoning-capable Realtime models such as gpt-realtime-2.
      - Optional<RealtimeReasoningEffort> effort
        
        Constrains effort on reasoning for reasoning-capable Realtime models such as gpt-realtime-2.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
    - Optional<RealtimeToolChoiceConfig> toolChoice
      
      How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
      - enum ToolChoiceOptions:
        
        Controls which (if any) tool is called by the model.
        
        none means the model will not call any tool and instead generates a message.
        
        auto means the model can pick between generating a message or calling one or more tools.
        
        required means the model must call one or more tools.
        
        NONE("none")
        
        AUTO("auto")
        
        REQUIRED("required")
      - class ToolChoiceFunction:
        
        Use this option to force the model to call a specific function.
        
        String name
        
        The name of the function to call.
        
        JsonValue; type "function"constant
        
        For function calling, the type is always function.
        
        FUNCTION("function")
      - class ToolChoiceMcp:
        
        Use this option to force the model to call a specific tool on a remote MCP server.
        
        String serverLabel
        
        The label of the MCP server to use.
        
        JsonValue; type "mcp"constant
        
        For MCP tools, the type is always mcp.
        
        MCP("mcp")
        
        Optional<String> name
        
        The name of the tool to call on the server.
    - Optional<List<RealtimeToolsConfigUnion>> tools
      
      Tools available to the model.
      - class RealtimeFunctionTool:
        
        Optional<String> description
        
        The description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
        
        Optional<String> name
        
        The name of the function.
        
        Optional<JsonValue> parameters
        
        Parameters of the function in JSON Schema.
        
        Optional<Type> type
        
        The type of the tool, i.e. function.
        
        FUNCTION("function")
      - Mcp
        
        String serverLabel
        
        A label for this MCP server, used to identify it in tool calls.
        
        JsonValue; type "mcp"constant
        
        The type of the MCP tool. Always mcp.
        
        MCP("mcp")
        
        Optional<AllowedTools> allowedTools
        
        List of allowed tool names or a filter object.
        
        List<String>
        
        class McpToolFilter:
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<String> authorization
        
        An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
        
        Optional<ConnectorId> connectorId
        
        Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.
        
        Currently supported connector_id values are:
        
        Dropbox: connector_dropbox
        
        Gmail: connector_gmail
        
        Google Calendar: connector_googlecalendar
        
        Google Drive: connector_googledrive
        
        Microsoft Teams: connector_microsoftteams
        
        Outlook Calendar: connector_outlookcalendar
        
        Outlook Email: connector_outlookemail
        
        SharePoint: connector_sharepoint
        
        CONNECTOR_DROPBOX("connector_dropbox")
        
        CONNECTOR_GMAIL("connector_gmail")
        
        CONNECTOR_GOOGLECALENDAR("connector_googlecalendar")
        
        CONNECTOR_GOOGLEDRIVE("connector_googledrive")
        
        CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams")
        
        CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar")
        
        CONNECTOR_OUTLOOKEMAIL("connector_outlookemail")
        
        CONNECTOR_SHAREPOINT("connector_sharepoint")
        
        Optional<Boolean> deferLoading
        
        Whether this MCP tool is deferred and discovered via tool search.
        
        Optional<Headers> headers
        
        Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.
        
        Optional<RequireApproval> requireApproval
        
        Specify which of the MCP server's tools require approval.
        
        class McpToolApprovalFilter:
        
        Specify which of the MCP server's tools require approval. Can be always, never, or a filter object associated with tools that require approval.
        
        Optional<Always> always
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<Never> never
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        enum McpToolApprovalSetting:
        
        Specify a single approval policy for all tools. One of always or never. When set to always, all tools will require approval. When set to never, all tools will not require approval.
        
        ALWAYS("always")
        
        NEVER("never")
        
        Optional<String> serverDescription
        
        Optional description of the MCP server, used to provide more context.
        
        Optional<String> serverUrl
        
        The URL for the MCP server. One of server_url or connector_id must be provided.
    - Optional<RealtimeTracingConfig> tracing
      
      Realtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
      
      auto will create a trace for the session with default values for the workflow name, group id, and metadata.
      - JsonValue;
        
        AUTO("auto")
      - TracingConfiguration
        
        Optional<String> groupId
        
        The group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
        
        Optional<JsonValue> metadata
        
        The arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
        
        Optional<String> workflowName
        
        The name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
    - Optional<RealtimeTruncation> truncation
      
      When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
      
      Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
      
      Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
      
      Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
      - RealtimeTruncationStrategy
        
        AUTO("auto")
        
        DISABLED("disabled")
      - class RealtimeTruncationRetentionRatio:
        
        Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
        
        double retentionRatio
        
        Fraction of post-instruction conversation tokens to retain (0.0 - 1.0) when the conversation exceeds the input token limit. Setting this to 0.8 means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.
        
        JsonValue; type "retention_ratio"constant
        
        Use retention ratio truncation.
        
        RETENTION_RATIO("retention_ratio")
        
        Optional<TokenLimits> tokenLimits
        
        Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
        
        Optional<Long> postInstructions
        
        Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
  - class RealtimeTranscriptionSessionCreateRequest:
    
    Realtime transcription session object configuration.
    - JsonValue; type "transcription"constant
      
      The type of session to create. Always transcription for transcription sessions.
      - TRANSCRIPTION("transcription")
    - Optional<RealtimeTranscriptionSessionAudio> audio
      
      Configuration for input and output audio.
      - Optional<RealtimeTranscriptionSessionAudioInput> input
        
        Optional<RealtimeAudioFormats> format
        
        The PCM audio format. Only a 24kHz sample rate is supported.
        
        Optional<NoiseReduction> noiseReduction
        
        Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
        
        Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        Optional<AudioTranscription> transcription
        
        Configuration for input audio transcription, defaults to off and can be set to null to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through the /audio/transcriptions endpoint and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.
        
        Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetection
        
        Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
        
        Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
        
        Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
        
        For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
        
        ServerVad
        
        JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
        
        SemanticVad
        
        JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
        
        Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
    - Optional<List<Include>> include
      
      Additional fields to include in server outputs.
      
      item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
      - ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")

Returns

class ClientSecretCreateResponse:

Response from creating a session and client secret for the Realtime API.
- long expiresAt
  
  Expiration timestamp for the client secret, in seconds since epoch.
- Session session
  
  The session configuration for either a realtime or transcription session.
  - class RealtimeSessionCreateResponse:
    
    A Realtime session configuration object.
    - String id
      
      Unique identifier for the session that looks like sess_1234567890abcdef.
    - JsonValue; object_ "realtime.session"constant
      
      The object type. Always realtime.session.
      - REALTIME_SESSION("realtime.session")
    - JsonValue; type "realtime"constant
      
      The type of session to create. Always realtime for the Realtime API.
      - REALTIME("realtime")
    - Optional<Audio> audio
      
      Configuration for input and output audio.
      - Optional<Input> input
        
        Optional<RealtimeAudioFormats> format
        
        The format of the input audio.
        
        AudioPcm
        
        Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
        
        Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
        
        AudioPcmu
        
        Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
        
        AudioPcma
        
        Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
        
        Optional<NoiseReduction> noiseReduction
        
        Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
        
        Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
        
        Optional<AudioTranscription> transcription
        
        Optional<Delay> delay
        
        Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
        
        Optional<String> language
        
        The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
        
        Optional<Model> model
        
        The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
        
        WHISPER_1("whisper-1")
        
        GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
        
        GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
        
        GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
        
        GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
        
        GPT_REALTIME_WHISPER("gpt-realtime-whisper")
        
        Optional<String> prompt
        
        An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
        
        Optional<TurnDetection> turnDetection
        
        Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
        
        Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
        
        Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
        
        For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
        
        class ServerVad:
        
        Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.
        
        JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
        
        class SemanticVad:
        
        Server-side semantic turn detection which uses a model to determine when the user has finished speaking.
        
        JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
        
        Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
      - Optional<Output> output
        
        Optional<RealtimeAudioFormats> format
        
        The format of the output audio.
        
        Optional<Double> speed
        
        The speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
        
        This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
        
        Optional<Voice> voice
        
        The voice the model uses to respond. Voice cannot be changed during the session once the model has responded with audio at least once. Current voice options are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. We recommend marin and cedar for best quality.
        
        ALLOY("alloy")
        
        ASH("ash")
        
        BALLAD("ballad")
        
        CORAL("coral")
        
        ECHO("echo")
        
        SAGE("sage")
        
        SHIMMER("shimmer")
        
        VERSE("verse")
        
        MARIN("marin")
        
        CEDAR("cedar")
    - Optional<Long> expiresAt
      
      Expiration timestamp for the session, in seconds since epoch.
    - Optional<List<Include>> include
      
      Additional fields to include in server outputs.
      
      item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
      - ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
    - Optional<String> instructions
      
      The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
      
      Note that the server sets default instructions which will be used if this field is not set and are visible in the session.created event at the start of the session.
    - Optional<MaxOutputTokens> maxOutputTokens
      
      Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or inf for the maximum available tokens for a given model. Defaults to inf.
      - long
      - JsonValue;
        
        INF("inf")
    - Optional<Model> model
      
      The Realtime model used for this session.
      - GPT_REALTIME("gpt-realtime")
      - GPT_REALTIME_1_5("gpt-realtime-1.5")
      - GPT_REALTIME_2("gpt-realtime-2")
      - GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")
      - GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")
      - GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01")
      - GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17")
      - GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03")
      - GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview")
      - GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17")
      - GPT_REALTIME_MINI("gpt-realtime-mini")
      - GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06")
      - GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15")
      - GPT_AUDIO_1_5("gpt-audio-1.5")
      - GPT_AUDIO_MINI("gpt-audio-mini")
      - GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06")
      - GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
    - Optional<List<OutputModality>> outputModalities
      
      The set of modalities the model can respond with. It defaults to ["audio"], indicating that the model will respond with audio plus a transcript. ["text"] can be used to make the model respond with text only. It is not possible to request both text and audio at the same time.
      - TEXT("text")
      - AUDIO("audio")
    - Optional<ResponsePrompt> prompt
      
      Reference to a prompt template and its variables. Learn more.
      - String id
        
        The unique identifier of the prompt template to use.
      - Optional<Variables> variables
        
        Optional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
        
        String
        
        class ResponseInputText:
        
        A text input to the model.
        
        String text
        
        The text input to the model.
        
        JsonValue; type "input_text"constant
        
        The type of the input item. Always input_text.
        
        INPUT_TEXT("input_text")
        
        class ResponseInputImage:
        
        An image input to the model. Learn about image inputs.
        
        Detail detail
        
        The detail level of the image to be sent to the model. One of high, low, auto, or original. Defaults to auto.
        
        LOW("low")
        
        HIGH("high")
        
        AUTO("auto")
        
        ORIGINAL("original")
        
        JsonValue; type "input_image"constant
        
        The type of the input item. Always input_image.
        
        INPUT_IMAGE("input_image")
        
        Optional<String> fileId
        
        The ID of the file to be sent to the model.
        
        Optional<String> imageUrl
        
        The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
        
        class ResponseInputFile:
        
        A file input to the model.
        
        JsonValue; type "input_file"constant
        
        The type of the input item. Always input_file.
        
        INPUT_FILE("input_file")
        
        Optional<Detail> detail
        
        The detail level of the file to be sent to the model. Use low for the default rendering behavior, or high to render the file at higher quality. Defaults to low.
        
        LOW("low")
        
        HIGH("high")
        
        Optional<String> fileData
        
        The content of the file to be sent to the model.
        
        Optional<String> fileId
        
        The ID of the file to be sent to the model.
        
        Optional<String> fileUrl
        
        The URL of the file to be sent to the model.
        
        Optional<String> filename
        
        The name of the file to be sent to the model.
      - Optional<String> version
        
        Optional version of the prompt template.
    - Optional<RealtimeReasoning> reasoning
      
      Configuration for reasoning-capable Realtime models such as gpt-realtime-2.
      - Optional<RealtimeReasoningEffort> effort
        
        Constrains effort on reasoning for reasoning-capable Realtime models such as gpt-realtime-2.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
    - Optional<ToolChoice> toolChoice
      
      How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
      - enum ToolChoiceOptions:
        
        Controls which (if any) tool is called by the model.
        
        none means the model will not call any tool and instead generates a message.
        
        auto means the model can pick between generating a message or calling one or more tools.
        
        required means the model must call one or more tools.
        
        NONE("none")
        
        AUTO("auto")
        
        REQUIRED("required")
      - class ToolChoiceFunction:
        
        Use this option to force the model to call a specific function.
        
        String name
        
        The name of the function to call.
        
        JsonValue; type "function"constant
        
        For function calling, the type is always function.
        
        FUNCTION("function")
      - class ToolChoiceMcp:
        
        Use this option to force the model to call a specific tool on a remote MCP server.
        
        String serverLabel
        
        The label of the MCP server to use.
        
        JsonValue; type "mcp"constant
        
        For MCP tools, the type is always mcp.
        
        MCP("mcp")
        
        Optional<String> name
        
        The name of the tool to call on the server.
    - Optional<List<Tool>> tools
      
      Tools available to the model.
      - class RealtimeFunctionTool:
        
        Optional<String> description
        
        The description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
        
        Optional<String> name
        
        The name of the function.
        
        Optional<JsonValue> parameters
        
        Parameters of the function in JSON Schema.
        
        Optional<Type> type
        
        The type of the tool, i.e. function.
        
        FUNCTION("function")
      - class McpTool:
        
        Give the model access to additional tools via remote Model Context Protocol (MCP) servers. Learn more about MCP.
        
        String serverLabel
        
        A label for this MCP server, used to identify it in tool calls.
        
        JsonValue; type "mcp"constant
        
        The type of the MCP tool. Always mcp.
        
        MCP("mcp")
        
        Optional<AllowedTools> allowedTools
        
        List of allowed tool names or a filter object.
        
        List<String>
        
        class McpToolFilter:
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<String> authorization
        
        An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
        
        Optional<ConnectorId> connectorId
        
        Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.
        
        Currently supported connector_id values are:
        
        Dropbox: connector_dropbox
        
        Gmail: connector_gmail
        
        Google Calendar: connector_googlecalendar
        
        Google Drive: connector_googledrive
        
        Microsoft Teams: connector_microsoftteams
        
        Outlook Calendar: connector_outlookcalendar
        
        Outlook Email: connector_outlookemail
        
        SharePoint: connector_sharepoint
        
        CONNECTOR_DROPBOX("connector_dropbox")
        
        CONNECTOR_GMAIL("connector_gmail")
        
        CONNECTOR_GOOGLECALENDAR("connector_googlecalendar")
        
        CONNECTOR_GOOGLEDRIVE("connector_googledrive")
        
        CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams")
        
        CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar")
        
        CONNECTOR_OUTLOOKEMAIL("connector_outlookemail")
        
        CONNECTOR_SHAREPOINT("connector_sharepoint")
        
        Optional<Boolean> deferLoading
        
        Whether this MCP tool is deferred and discovered via tool search.
        
        Optional<Headers> headers
        
        Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.
        
        Optional<RequireApproval> requireApproval
        
        Specify which of the MCP server's tools require approval.
        
        class McpToolApprovalFilter:
        
        Specify which of the MCP server's tools require approval. Can be always, never, or a filter object associated with tools that require approval.
        
        Optional<Always> always
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<Never> never
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        enum McpToolApprovalSetting:
        
        Specify a single approval policy for all tools. One of always or never. When set to always, all tools will require approval. When set to never, all tools will not require approval.
        
        ALWAYS("always")
        
        NEVER("never")
        
        Optional<String> serverDescription
        
        Optional description of the MCP server, used to provide more context.
        
        Optional<String> serverUrl
        
        The URL for the MCP server. One of server_url or connector_id must be provided.
    - Optional<Tracing> tracing
      
      Realtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
      
      auto will create a trace for the session with default values for the workflow name, group id, and metadata.
      - JsonValue;
        
        AUTO("auto")
      - class TracingConfiguration:
        
        Granular configuration for tracing.
        
        Optional<String> groupId
        
        The group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
        
        Optional<JsonValue> metadata
        
        The arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
        
        Optional<String> workflowName
        
        The name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
    - Optional<RealtimeTruncation> truncation
      
      When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
      
      Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
      
      Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
      
      Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
      - RealtimeTruncationStrategy
        
        AUTO("auto")
        
        DISABLED("disabled")
      - class RealtimeTruncationRetentionRatio:
        
        Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
        
        double retentionRatio
        
        Fraction of post-instruction conversation tokens to retain (0.0 - 1.0) when the conversation exceeds the input token limit. Setting this to 0.8 means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.
        
        JsonValue; type "retention_ratio"constant
        
        Use retention ratio truncation.
        
        RETENTION_RATIO("retention_ratio")
        
        Optional<TokenLimits> tokenLimits
        
        Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
        
        Optional<Long> postInstructions
        
        Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.
  - class RealtimeTranscriptionSessionCreateResponse:
    
    A Realtime transcription session configuration object.
    - String id
      
      Unique identifier for the session that looks like sess_1234567890abcdef.
    - String object_
      
      The object type. Always realtime.transcription_session.
    - JsonValue; type "transcription"constant
      
      The type of session. Always transcription for transcription sessions.
      - TRANSCRIPTION("transcription")
    - Optional<Audio> audio
      
      Configuration for input audio for the session.
      - Optional<Input> input
        
        Optional<RealtimeAudioFormats> format
        
        The PCM audio format. Only a 24kHz sample rate is supported.
        
        Optional<NoiseReduction> noiseReduction
        
        Configuration for input audio noise reduction.
        
        Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        Optional<AudioTranscription> transcription
        
        Optional<RealtimeTranscriptionSessionTurnDetection> turnDetection
        
        Configuration for turn detection. Can be set to null to turn off. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech. For gpt-realtime-whisper, this must be null; VAD is not supported.
        
        Optional<Long> prefixPaddingMs
        
        Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
        
        Optional<String> type
        
        Type of turn detection, only server_vad is currently supported.
    - Optional<Long> expiresAt
      
      Expiration timestamp for the session, in seconds since epoch.
    - Optional<List<Include>> include
      
      Additional fields to include in server outputs.
      - item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
      - ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
- String value
  
  The generated client secret value.

Example

package com.openai.example;

import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.realtime.clientsecrets.ClientSecretCreateParams;
import com.openai.models.realtime.clientsecrets.ClientSecretCreateResponse;

public final class Main {
    private Main() {}

    public static void main(String[] args) {
        OpenAIClient client = OpenAIOkHttpClient.fromEnv();

        ClientSecretCreateResponse clientSecret = client.realtime().clientSecrets().create();
    }
}

Response

{
  "expires_at": 0,
  "session": {
    "id": "id",
    "object": "realtime.session",
    "type": "realtime",
    "audio": {
      "input": {
        "format": {
          "rate": 24000,
          "type": "audio/pcm"
        },
        "noise_reduction": {
          "type": "near_field"
        },
        "transcription": {
          "delay": "minimal",
          "language": "language",
          "model": "string",
          "prompt": "prompt"
        },
        "turn_detection": {
          "type": "server_vad",
          "create_response": true,
          "idle_timeout_ms": 5000,
          "interrupt_response": true,
          "prefix_padding_ms": 0,
          "silence_duration_ms": 0,
          "threshold": 0
        }
      },
      "output": {
        "format": {
          "rate": 24000,
          "type": "audio/pcm"
        },
        "speed": 0.25,
        "voice": "ash"
      }
    },
    "expires_at": 0,
    "include": [
      "item.input_audio_transcription.logprobs"
    ],
    "instructions": "instructions",
    "max_output_tokens": 0,
    "model": "string",
    "output_modalities": [
      "text"
    ],
    "prompt": {
      "id": "id",
      "variables": {
        "foo": "string"
      },
      "version": "version"
    },
    "reasoning": {
      "effort": "minimal"
    },
    "tool_choice": "none",
    "tools": [
      {
        "description": "description",
        "name": "name",
        "parameters": {},
        "type": "function"
      }
    ],
    "tracing": "auto",
    "truncation": "auto"
  },
  "value": "value"
}

Domain Types

Realtime Session Create Response

class RealtimeSessionCreateResponse:

A Realtime session configuration object.
- String id
  
  Unique identifier for the session that looks like sess_1234567890abcdef.
- JsonValue; object_ "realtime.session"constant
  
  The object type. Always realtime.session.
  - REALTIME_SESSION("realtime.session")
- JsonValue; type "realtime"constant
  
  The type of session to create. Always realtime for the Realtime API.
  - REALTIME("realtime")
- Optional<Audio> audio
  
  Configuration for input and output audio.
  - Optional<Input> input
    - Optional<RealtimeAudioFormats> format
      
      The format of the input audio.
      - AudioPcm
        
        Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
        
        Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
      - AudioPcmu
        
        Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
      - AudioPcma
        
        Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
    - Optional<NoiseReduction> noiseReduction
      
      Configuration for input audio noise reduction. This can be set to null to turn off. Noise reduction filters audio added to the input audio buffer before it is sent to VAD and the model. Filtering the audio can improve VAD and turn detection accuracy (reducing false positives) and model performance by improving perception of the input audio.
      - Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
    - Optional<AudioTranscription> transcription
      - Optional<Delay> delay
        
        Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
      - Optional<String> language
        
        The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
      - Optional<Model> model
        
        The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
        
        WHISPER_1("whisper-1")
        
        GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
        
        GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
        
        GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
        
        GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
        
        GPT_REALTIME_WHISPER("gpt-realtime-whisper")
      - Optional<String> prompt
        
        An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
    - Optional<TurnDetection> turnDetection
      
      Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response.
      
      Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech.
      
      Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.
      
      For gpt-realtime-whisper transcription sessions, turn detection must be set to null; VAD is not supported.
      - class ServerVad:
        
        Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.
        
        JsonValue; type "server_vad"constant
        
        Type of turn detection, server_vad to turn on simple Server VAD.
        
        SERVER_VAD("server_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs. If interrupt_response is set to false this may fail to create a response if the model is already responding.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> idleTimeoutMs
        
        Optional timeout after which a model response will be triggered automatically. This is useful for situations in which a long pause from the user is unexpected, such as a phone call. The model will effectively prompt the user to continue the conversation based on the current context.
        
        The timeout value will be applied after the last model response's audio has finished playing, i.e. it's set to the response.done time plus audio playback duration.
        
        An input_audio_buffer.timeout_triggered event (plus events associated with the Response) will be emitted when the timeout is reached. Idle timeout is currently only supported for server_vad mode.
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt (cancel) any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs. If true then the response will be cancelled, otherwise it will continue until complete.
        
        If both create_response and interrupt_response are set to false, the model will never respond automatically but VAD events will still be emitted.
        
        Optional<Long> prefixPaddingMs
        
        Used only for server_vad mode. Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
        
        Optional<Long> silenceDurationMs
        
        Used only for server_vad mode. Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
        
        Optional<Double> threshold
        
        Used only for server_vad mode. Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
      - class SemanticVad:
        
        Server-side semantic turn detection which uses a model to determine when the user has finished speaking.
        
        JsonValue; type "semantic_vad"constant
        
        Type of turn detection, semantic_vad to turn on Semantic VAD.
        
        SEMANTIC_VAD("semantic_vad")
        
        Optional<Boolean> createResponse
        
        Whether or not to automatically generate a response when a VAD stop event occurs.
        
        Optional<Eagerness> eagerness
        
        Used only for semantic_vad mode. The eagerness of the model to respond. low will wait longer for the user to continue speaking, high will respond more quickly. auto is the default and is equivalent to medium. low, medium, and high have max timeouts of 8s, 4s, and 2s respectively.
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        AUTO("auto")
        
        Optional<Boolean> interruptResponse
        
        Whether or not to automatically interrupt any ongoing response with output to the default conversation (i.e. conversation of auto) when a VAD start event occurs.
  - Optional<Output> output
    - Optional<RealtimeAudioFormats> format
      
      The format of the output audio.
    - Optional<Double> speed
      
      The speed of the model's spoken response as a multiple of the original speed. 1.0 is the default speed. 0.25 is the minimum speed. 1.5 is the maximum speed. This value can only be changed in between model turns, not while a response is in progress.
      
      This parameter is a post-processing adjustment to the audio after it is generated, it's also possible to prompt the model to speak faster or slower.
    - Optional<Voice> voice
      
      The voice the model uses to respond. Voice cannot be changed during the session once the model has responded with audio at least once. Current voice options are alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, and cedar. We recommend marin and cedar for best quality.
      - ALLOY("alloy")
      - ASH("ash")
      - BALLAD("ballad")
      - CORAL("coral")
      - ECHO("echo")
      - SAGE("sage")
      - SHIMMER("shimmer")
      - VERSE("verse")
      - MARIN("marin")
      - CEDAR("cedar")
- Optional<Long> expiresAt
  
  Expiration timestamp for the session, in seconds since epoch.
- Optional<List<Include>> include
  
  Additional fields to include in server outputs.
  
  item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
  - ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")
- Optional<String> instructions
  
  The default system instructions (i.e. system message) prepended to model calls. This field allows the client to guide the model on desired responses. The model can be instructed on response content and format, (e.g. "be extremely succinct", "act friendly", "here are examples of good responses") and on audio behavior (e.g. "talk quickly", "inject emotion into your voice", "laugh frequently"). The instructions are not guaranteed to be followed by the model, but they provide guidance to the model on the desired behavior.
  
  Note that the server sets default instructions which will be used if this field is not set and are visible in the session.created event at the start of the session.
- Optional<MaxOutputTokens> maxOutputTokens
  
  Maximum number of output tokens for a single assistant response, inclusive of tool calls. Provide an integer between 1 and 4096 to limit output tokens, or inf for the maximum available tokens for a given model. Defaults to inf.
  - long
  - JsonValue;
    - INF("inf")
- Optional<Model> model
  
  The Realtime model used for this session.
  - GPT_REALTIME("gpt-realtime")
  - GPT_REALTIME_1_5("gpt-realtime-1.5")
  - GPT_REALTIME_2("gpt-realtime-2")
  - GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")
  - GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")
  - GPT_4O_REALTIME_PREVIEW_2024_10_01("gpt-4o-realtime-preview-2024-10-01")
  - GPT_4O_REALTIME_PREVIEW_2024_12_17("gpt-4o-realtime-preview-2024-12-17")
  - GPT_4O_REALTIME_PREVIEW_2025_06_03("gpt-4o-realtime-preview-2025-06-03")
  - GPT_4O_MINI_REALTIME_PREVIEW("gpt-4o-mini-realtime-preview")
  - GPT_4O_MINI_REALTIME_PREVIEW_2024_12_17("gpt-4o-mini-realtime-preview-2024-12-17")
  - GPT_REALTIME_MINI("gpt-realtime-mini")
  - GPT_REALTIME_MINI_2025_10_06("gpt-realtime-mini-2025-10-06")
  - GPT_REALTIME_MINI_2025_12_15("gpt-realtime-mini-2025-12-15")
  - GPT_AUDIO_1_5("gpt-audio-1.5")
  - GPT_AUDIO_MINI("gpt-audio-mini")
  - GPT_AUDIO_MINI_2025_10_06("gpt-audio-mini-2025-10-06")
  - GPT_AUDIO_MINI_2025_12_15("gpt-audio-mini-2025-12-15")
- Optional<List<OutputModality>> outputModalities
  
  The set of modalities the model can respond with. It defaults to ["audio"], indicating that the model will respond with audio plus a transcript. ["text"] can be used to make the model respond with text only. It is not possible to request both text and audio at the same time.
  - TEXT("text")
  - AUDIO("audio")
- Optional<ResponsePrompt> prompt
  
  Reference to a prompt template and its variables. Learn more.
  - String id
    
    The unique identifier of the prompt template to use.
  - Optional<Variables> variables
    
    Optional map of values to substitute in for variables in your prompt. The substitution values can either be strings, or other Response input types like images or files.
    - String
    - class ResponseInputText:
      
      A text input to the model.
      - String text
        
        The text input to the model.
      - JsonValue; type "input_text"constant
        
        The type of the input item. Always input_text.
        
        INPUT_TEXT("input_text")
    - class ResponseInputImage:
      
      An image input to the model. Learn about image inputs.
      - Detail detail
        
        The detail level of the image to be sent to the model. One of high, low, auto, or original. Defaults to auto.
        
        LOW("low")
        
        HIGH("high")
        
        AUTO("auto")
        
        ORIGINAL("original")
      - JsonValue; type "input_image"constant
        
        The type of the input item. Always input_image.
        
        INPUT_IMAGE("input_image")
      - Optional<String> fileId
        
        The ID of the file to be sent to the model.
      - Optional<String> imageUrl
        
        The URL of the image to be sent to the model. A fully qualified URL or base64 encoded image in a data URL.
    - class ResponseInputFile:
      
      A file input to the model.
      - JsonValue; type "input_file"constant
        
        The type of the input item. Always input_file.
        
        INPUT_FILE("input_file")
      - Optional<Detail> detail
        
        The detail level of the file to be sent to the model. Use low for the default rendering behavior, or high to render the file at higher quality. Defaults to low.
        
        LOW("low")
        
        HIGH("high")
      - Optional<String> fileData
        
        The content of the file to be sent to the model.
      - Optional<String> fileId
        
        The ID of the file to be sent to the model.
      - Optional<String> fileUrl
        
        The URL of the file to be sent to the model.
      - Optional<String> filename
        
        The name of the file to be sent to the model.
  - Optional<String> version
    
    Optional version of the prompt template.
- Optional<RealtimeReasoning> reasoning
  
  Configuration for reasoning-capable Realtime models such as gpt-realtime-2.
  - Optional<RealtimeReasoningEffort> effort
    
    Constrains effort on reasoning for reasoning-capable Realtime models such as gpt-realtime-2.
    - MINIMAL("minimal")
    - LOW("low")
    - MEDIUM("medium")
    - HIGH("high")
    - XHIGH("xhigh")
- Optional<ToolChoice> toolChoice
  
  How the model chooses tools. Provide one of the string modes or force a specific function/MCP tool.
  - enum ToolChoiceOptions:
    
    Controls which (if any) tool is called by the model.
    
    none means the model will not call any tool and instead generates a message.
    
    auto means the model can pick between generating a message or calling one or more tools.
    
    required means the model must call one or more tools.
    - NONE("none")
    - AUTO("auto")
    - REQUIRED("required")
  - class ToolChoiceFunction:
    
    Use this option to force the model to call a specific function.
    - String name
      
      The name of the function to call.
    - JsonValue; type "function"constant
      
      For function calling, the type is always function.
      - FUNCTION("function")
  - class ToolChoiceMcp:
    
    Use this option to force the model to call a specific tool on a remote MCP server.
    - String serverLabel
      
      The label of the MCP server to use.
    - JsonValue; type "mcp"constant
      
      For MCP tools, the type is always mcp.
      - MCP("mcp")
    - Optional<String> name
      
      The name of the tool to call on the server.
- Optional<List<Tool>> tools
  
  Tools available to the model.
  - class RealtimeFunctionTool:
    - Optional<String> description
      
      The description of the function, including guidance on when and how to call it, and guidance about what to tell the user when calling (if anything).
    - Optional<String> name
      
      The name of the function.
    - Optional<JsonValue> parameters
      
      Parameters of the function in JSON Schema.
    - Optional<Type> type
      
      The type of the tool, i.e. function.
      - FUNCTION("function")
  - class McpTool:
    
    Give the model access to additional tools via remote Model Context Protocol (MCP) servers. Learn more about MCP.
    - String serverLabel
      
      A label for this MCP server, used to identify it in tool calls.
    - JsonValue; type "mcp"constant
      
      The type of the MCP tool. Always mcp.
      - MCP("mcp")
    - Optional<AllowedTools> allowedTools
      
      List of allowed tool names or a filter object.
      - List<String>
      - class McpToolFilter:
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
    - Optional<String> authorization
      
      An OAuth access token that can be used with a remote MCP server, either with a custom MCP server URL or a service connector. Your application must handle the OAuth authorization flow and provide the token here.
    - Optional<ConnectorId> connectorId
      
      Identifier for service connectors, like those available in ChatGPT. One of server_url or connector_id must be provided. Learn more about service connectors here.
      
      Currently supported connector_id values are:
      - Dropbox: connector_dropbox
      - Gmail: connector_gmail
      - Google Calendar: connector_googlecalendar
      - Google Drive: connector_googledrive
      - Microsoft Teams: connector_microsoftteams
      - Outlook Calendar: connector_outlookcalendar
      - Outlook Email: connector_outlookemail
      - SharePoint: connector_sharepoint
      - CONNECTOR_DROPBOX("connector_dropbox")
      - CONNECTOR_GMAIL("connector_gmail")
      - CONNECTOR_GOOGLECALENDAR("connector_googlecalendar")
      - CONNECTOR_GOOGLEDRIVE("connector_googledrive")
      - CONNECTOR_MICROSOFTTEAMS("connector_microsoftteams")
      - CONNECTOR_OUTLOOKCALENDAR("connector_outlookcalendar")
      - CONNECTOR_OUTLOOKEMAIL("connector_outlookemail")
      - CONNECTOR_SHAREPOINT("connector_sharepoint")
    - Optional<Boolean> deferLoading
      
      Whether this MCP tool is deferred and discovered via tool search.
    - Optional<Headers> headers
      
      Optional HTTP headers to send to the MCP server. Use for authentication or other purposes.
    - Optional<RequireApproval> requireApproval
      
      Specify which of the MCP server's tools require approval.
      - class McpToolApprovalFilter:
        
        Specify which of the MCP server's tools require approval. Can be always, never, or a filter object associated with tools that require approval.
        
        Optional<Always> always
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
        
        Optional<Never> never
        
        A filter object to specify which tools are allowed.
        
        Optional<Boolean> readOnly
        
        Indicates whether or not a tool modifies data or is read-only. If an MCP server is annotated with readOnlyHint, it will match this filter.
        
        Optional<List<String>> toolNames
        
        List of allowed tool names.
      - enum McpToolApprovalSetting:
        
        Specify a single approval policy for all tools. One of always or never. When set to always, all tools will require approval. When set to never, all tools will not require approval.
        
        ALWAYS("always")
        
        NEVER("never")
    - Optional<String> serverDescription
      
      Optional description of the MCP server, used to provide more context.
    - Optional<String> serverUrl
      
      The URL for the MCP server. One of server_url or connector_id must be provided.
- Optional<Tracing> tracing
  
  Realtime API can write session traces to the Traces Dashboard. Set to null to disable tracing. Once tracing is enabled for a session, the configuration cannot be modified.
  
  auto will create a trace for the session with default values for the workflow name, group id, and metadata.
  - JsonValue;
    - AUTO("auto")
  - class TracingConfiguration:
    
    Granular configuration for tracing.
    - Optional<String> groupId
      
      The group id to attach to this trace to enable filtering and grouping in the Traces Dashboard.
    - Optional<JsonValue> metadata
      
      The arbitrary metadata to attach to this trace to enable filtering in the Traces Dashboard.
    - Optional<String> workflowName
      
      The name of the workflow to attach to this trace. This is used to name the trace in the Traces Dashboard.
- Optional<RealtimeTruncation> truncation
  
  When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.
  
  Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.
  
  Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.
  
  Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.
  - RealtimeTruncationStrategy
    - AUTO("auto")
    - DISABLED("disabled")
  - class RealtimeTruncationRetentionRatio:
    
    Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.
    - double retentionRatio
      
      Fraction of post-instruction conversation tokens to retain (0.0 - 1.0) when the conversation exceeds the input token limit. Setting this to 0.8 means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.
    - JsonValue; type "retention_ratio"constant
      
      Use retention ratio truncation.
      - RETENTION_RATIO("retention_ratio")
    - Optional<TokenLimits> tokenLimits
      
      Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.
      - Optional<Long> postInstructions
        
        Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.

Realtime Transcription Session Create Response

class RealtimeTranscriptionSessionCreateResponse:

A Realtime transcription session configuration object.
- String id
  
  Unique identifier for the session that looks like sess_1234567890abcdef.
- String object_
  
  The object type. Always realtime.transcription_session.
- JsonValue; type "transcription"constant
  
  The type of session. Always transcription for transcription sessions.
  - TRANSCRIPTION("transcription")
- Optional<Audio> audio
  
  Configuration for input audio for the session.
  - Optional<Input> input
    - Optional<RealtimeAudioFormats> format
      
      The PCM audio format. Only a 24kHz sample rate is supported.
      - AudioPcm
        
        Optional<Rate> rate
        
        The sample rate of the audio. Always 24000.
        
        _24000(24000)
        
        Optional<Type> type
        
        The audio format. Always audio/pcm.
        
        AUDIO_PCM("audio/pcm")
      - AudioPcmu
        
        Optional<Type> type
        
        The audio format. Always audio/pcmu.
        
        AUDIO_PCMU("audio/pcmu")
      - AudioPcma
        
        Optional<Type> type
        
        The audio format. Always audio/pcma.
        
        AUDIO_PCMA("audio/pcma")
    - Optional<NoiseReduction> noiseReduction
      
      Configuration for input audio noise reduction.
      - Optional<NoiseReductionType> type
        
        Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones.
        
        NEAR_FIELD("near_field")
        
        FAR_FIELD("far_field")
    - Optional<AudioTranscription> transcription
      - Optional<Delay> delay
        
        Controls how long the model waits before emitting transcription text. Higher values can improve transcription accuracy at the cost of latency. Only supported with gpt-realtime-whisper in GA Realtime sessions.
        
        MINIMAL("minimal")
        
        LOW("low")
        
        MEDIUM("medium")
        
        HIGH("high")
        
        XHIGH("xhigh")
      - Optional<String> language
        
        The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
      - Optional<Model> model
        
        The model to use for transcription. Current options are whisper-1, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper. Use gpt-4o-transcribe-diarize when you need diarization with speaker labels.
        
        WHISPER_1("whisper-1")
        
        GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
        
        GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
        
        GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
        
        GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")
        
        GPT_REALTIME_WHISPER("gpt-realtime-whisper")
      - Optional<String> prompt
        
        An optional text to guide the model's style or continue a previous audio segment. For whisper-1, the prompt is a list of keywords. For gpt-4o-transcribe models (excluding gpt-4o-transcribe-diarize), the prompt is a free text string, for example "expect words related to technology". Prompt is not supported with gpt-realtime-whisper in GA Realtime sessions.
    - Optional<RealtimeTranscriptionSessionTurnDetection> turnDetection
      
      Configuration for turn detection. Can be set to null to turn off. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech. For gpt-realtime-whisper, this must be null; VAD is not supported.
      - Optional<Long> prefixPaddingMs
        
        Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
      - Optional<Long> silenceDurationMs
        
        Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
      - Optional<Double> threshold
        
        Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
      - Optional<String> type
        
        Type of turn detection, only server_vad is currently supported.
- Optional<Long> expiresAt
  
  Expiration timestamp for the session, in seconds since epoch.
- Optional<List<Include>> include
  
  Additional fields to include in server outputs.
  - item.input_audio_transcription.logprobs: Include logprobs for input audio transcription.
  - ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")

Realtime Transcription Session Turn Detection

class RealtimeTranscriptionSessionTurnDetection:

Configuration for turn detection. Can be set to null to turn off. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech. For gpt-realtime-whisper, this must be null; VAD is not supported.
- Optional<Long> prefixPaddingMs
  
  Amount of audio to include before the VAD detected speech (in milliseconds). Defaults to 300ms.
- Optional<Long> silenceDurationMs
  
  Duration of silence to detect speech stop (in milliseconds). Defaults to 500ms. With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
- Optional<Double> threshold
  
  Activation threshold for VAD (0.0 to 1.0), this defaults to 0.5. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
- Optional<String> type
  
  Type of turn detection, only server_vad is currently supported.

Calls

Accept call

realtime().calls().accept(CallAcceptParamsparams, RequestOptionsrequestOptions = RequestOptions.none())

post /realtime/calls/{call_id}/accept

Accept an incoming SIP call and configure the realtime session that will handle it.

Parameters

CallAcceptParams params
- Optional<String> callId
- RealtimeSessionCreateRequest realtimeSessionCreateRequest
  
  Realtime session object configuration.

Example

package com.openai.example;

import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.realtime.RealtimeSessionCreateRequest;
import com.openai.models.realtime.calls.CallAcceptParams;

public final class Main {
    private Main() {}

    public static void main(String[] args) {
        OpenAIClient client = OpenAIOkHttpClient.fromEnv();

        CallAcceptParams params = CallAcceptParams.builder()
            .callId("call_id")
            .realtimeSessionCreateRequest(RealtimeSessionCreateRequest.builder().build())
            .build();
        client.realtime().calls().accept(params);
    }
}

Hang up call

realtime().calls().hangup(CallHangupParamsparams = CallHangupParams.none(), RequestOptionsrequestOptions = RequestOptions.none())

post /realtime/calls/{call_id}/hangup

End an active Realtime API call, whether it was initiated over SIP or WebRTC.

Parameters

CallHangupParams params
- Optional<String> callId

Example

package com.openai.example;

import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.realtime.calls.CallHangupParams;

public final class Main {
    private Main() {}

    public static void main(String[] args) {
        OpenAIClient client = OpenAIOkHttpClient.fromEnv();

        client.realtime().calls().hangup("call_id");
    }
}

Refer call

realtime().calls().refer(CallReferParamsparams, RequestOptionsrequestOptions = RequestOptions.none())

post /realtime/calls/{call_id}/refer

Transfer an active SIP call to a new destination using the SIP REFER verb.

Parameters

CallReferParams params
- Optional<String> callId
- String targetUri
  
  URI that should appear in the SIP Refer-To header. Supports values like tel:+14155550123 or sip:agent@example.com.

Example

package com.openai.example;

import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.realtime.calls.CallReferParams;

public final class Main {
    private Main() {}

    public static void main(String[] args) {
        OpenAIClient client = OpenAIOkHttpClient.fromEnv();

        CallReferParams params = CallReferParams.builder()
            .callId("call_id")
            .targetUri("tel:+14155550123")
            .build();
        client.realtime().calls().refer(params);
    }
}

Reject call

realtime().calls().reject(CallRejectParamsparams = CallRejectParams.none(), RequestOptionsrequestOptions = RequestOptions.none())

post /realtime/calls/{call_id}/reject

Decline an incoming SIP call by returning a SIP status code to the caller.

Parameters

CallRejectParams params
- Optional<String> callId
- Optional<Long> statusCode
  
  SIP response code to send back to the caller. Defaults to 603 (Decline) when omitted.

Example

package com.openai.example;

import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.realtime.calls.CallRejectParams;

public final class Main {
    private Main() {}

    public static void main(String[] args) {
        OpenAIClient client = OpenAIOkHttpClient.fromEnv();

        client.realtime().calls().reject("call_id");
    }
}

Translations

Client Secrets

Sessions

Transcription Sessions

java/resources/realtime/index.md +1848 −101

6 6

7- `class AudioTranscription:`7- `class AudioTranscription:`

8 8

9 - `Optional<Delay> delay`

11 Controls how long the model waits before emitting transcription text.

12 Higher values can improve transcription accuracy at the cost of latency.

13 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

15 - `MINIMAL("minimal")`

17 - `LOW("low")`

19 - `MEDIUM("medium")`

21 - `HIGH("high")`

23 - `XHIGH("xhigh")`

9 - `Optional<String> language`25 - `Optional<String> language`

10 26

11 The language of the input audio. Supplying the input language in27 The language of the input audio. Supplying the input language in

14 30

15 - `Optional<Model> model`31 - `Optional<Model> model`

16 32

17 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.33 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

18 34

19 - `WHISPER_1("whisper-1")`35 - `WHISPER_1("whisper-1")`

20 36

26 42

27 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`43 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

28 44

45 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

29 - `Optional<String> prompt`47 - `Optional<String> prompt`

30 48

31 An optional text to guide the model's style or continue a previous audio49 An optional text to guide the model's style or continue a previous audio

32 segment.50 segment.

33 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).51 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

34 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".52 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

53 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

35 54

36### Conversation Created Event55### Conversation Created Event

37 56

3254 3273

3255 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.3274 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.

3256 3275

3276 - `Optional<Delay> delay`

3277

3278 Controls how long the model waits before emitting transcription text.

3279 Higher values can improve transcription accuracy at the cost of latency.

3280 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

3281

3282 - `MINIMAL("minimal")`

3283

3284 - `LOW("low")`

3285

3286 - `MEDIUM("medium")`

3287

3288 - `HIGH("high")`

3289

3290 - `XHIGH("xhigh")`

3291

3257 - `Optional<String> language`3292 - `Optional<String> language`

3258 3293

3259 The language of the input audio. Supplying the input language in3294 The language of the input audio. Supplying the input language in

3262 3297

3263 - `Optional<Model> model`3298 - `Optional<Model> model`

3264 3299

3265 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.3300 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

3266 3301

3267 - `WHISPER_1("whisper-1")`3302 - `WHISPER_1("whisper-1")`

3268 3303

3274 3309

3275 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`3310 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

3276 3311

3312 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

3313

3277 - `Optional<String> prompt`3314 - `Optional<String> prompt`

3278 3315

3279 An optional text to guide the model's style or continue a previous audio3316 An optional text to guide the model's style or continue a previous audio

3280 segment.3317 segment.

3281 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).3318 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

3282 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".3319 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

3320 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

3283 3321

3284 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`3322 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`

3285 3323

3289 3327

3290 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.3328 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

3291 3329

3330 For `gpt-realtime-whisper` transcription sessions, turn detection must be

3331 set to `null`; VAD is not supported.

3332

3292 - `ServerVad`3333 - `ServerVad`

3293 3334

3294 - `JsonValue; type "server_vad"constant`3335 - `JsonValue; type "server_vad"constant`

3481 3522

3482 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.3523 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.

3483 3524

3525 - `Optional<Delay> delay`

3526

3527 Controls how long the model waits before emitting transcription text.

3528 Higher values can improve transcription accuracy at the cost of latency.

3529 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

3530

3531 - `MINIMAL("minimal")`

3532

3533 - `LOW("low")`

3534

3535 - `MEDIUM("medium")`

3536

3537 - `HIGH("high")`

3538

3539 - `XHIGH("xhigh")`

3540

3484 - `Optional<String> language`3541 - `Optional<String> language`

3485 3542

3486 The language of the input audio. Supplying the input language in3543 The language of the input audio. Supplying the input language in

3489 3546

3490 - `Optional<Model> model`3547 - `Optional<Model> model`

3491 3548

3492 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.3549 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

3493 3550

3494 - `WHISPER_1("whisper-1")`3551 - `WHISPER_1("whisper-1")`

3495 3552

3501 3558

3502 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`3559 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

3503 3560

3561 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

3562

3504 - `Optional<String> prompt`3563 - `Optional<String> prompt`

3505 3564

3506 An optional text to guide the model's style or continue a previous audio3565 An optional text to guide the model's style or continue a previous audio

3507 segment.3566 segment.

3508 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).3567 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

3509 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".3568 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

3569 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

3510 3570

3511 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`3571 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`

3512 3572

3516 3576

3517 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.3577 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

3518 3578

3579 For `gpt-realtime-whisper` transcription sessions, turn detection must be

3580 set to `null`; VAD is not supported.

3581

3519 - `ServerVad`3582 - `ServerVad`

3520 3583

3521 - `JsonValue; type "server_vad"constant`3584 - `JsonValue; type "server_vad"constant`

3730 3793

3731 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.3794 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

3732 3795

3796 For `gpt-realtime-whisper` transcription sessions, turn detection must be

3797 set to `null`; VAD is not supported.

3798

3733 - `ServerVad`3799 - `ServerVad`

3734 3800

3735 - `JsonValue; type "server_vad"constant`3801 - `JsonValue; type "server_vad"constant`

4671 4737

4672 - `AUDIO("audio")`4738 - `AUDIO("audio")`

4673 4739

4740 - `Optional<Boolean> parallelToolCalls`

4741

4742 Whether the model may call multiple tools in parallel. Only supported by

4743 reasoning Realtime models such as `gpt-realtime-2`.

4744

4674 - `Optional<ResponsePrompt> prompt`4745 - `Optional<ResponsePrompt> prompt`

4675 4746

4676 Reference to a prompt template and its variables.4747 Reference to a prompt template and its variables.

4770 4841

4771 Optional version of the prompt template.4842 Optional version of the prompt template.

4772 4843

4844 - `Optional<RealtimeReasoning> reasoning`

4845

4846 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.

4847

4848 - `Optional<RealtimeReasoningEffort> effort`

4849

4850 Constrains effort on reasoning for reasoning-capable Realtime models such as

4851 `gpt-realtime-2`.

4852

4853 - `MINIMAL("minimal")`

4854

4855 - `LOW("low")`

4856

4857 - `MEDIUM("medium")`

4858

4859 - `HIGH("high")`

4860

4861 - `XHIGH("xhigh")`

4862

4773 - `Optional<ToolChoice> toolChoice`4863 - `Optional<ToolChoice> toolChoice`

4774 4864

4775 How the model chooses tools. Provide one of the string modes or force a specific4865 How the model chooses tools. Provide one of the string modes or force a specific

5045 5135

5046 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.5136 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.

5047 5137

5138 - `Optional<Delay> delay`

5139

5140 Controls how long the model waits before emitting transcription text.

5141 Higher values can improve transcription accuracy at the cost of latency.

5142 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

5143

5144 - `MINIMAL("minimal")`

5145

5146 - `LOW("low")`

5147

5148 - `MEDIUM("medium")`

5149

5150 - `HIGH("high")`

5151

5152 - `XHIGH("xhigh")`

5153

5048 - `Optional<String> language`5154 - `Optional<String> language`

5049 5155

5050 The language of the input audio. Supplying the input language in5156 The language of the input audio. Supplying the input language in

5053 5159

5054 - `Optional<Model> model`5160 - `Optional<Model> model`

5055 5161

5056 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.5162 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

5057 5163

5058 - `WHISPER_1("whisper-1")`5164 - `WHISPER_1("whisper-1")`

5059 5165

5065 5171

5066 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`5172 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

5067 5173

5174 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

5175

5068 - `Optional<String> prompt`5176 - `Optional<String> prompt`

5069 5177

5070 An optional text to guide the model's style or continue a previous audio5178 An optional text to guide the model's style or continue a previous audio

5071 segment.5179 segment.

5072 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).5180 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

5073 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".5181 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

5182 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

5074 5183

5075 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`5184 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`

5076 5185

5080 5189

5081 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.5190 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

5082 5191

5192 For `gpt-realtime-whisper` transcription sessions, turn detection must be

5193 set to `null`; VAD is not supported.

5194

5083 - `ServerVad`5195 - `ServerVad`

5084 5196

5085 - `JsonValue; type "server_vad"constant`5197 - `JsonValue; type "server_vad"constant`

5251 5363

5252 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`5364 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`

5253 5365

5366 - `GPT_REALTIME_2("gpt-realtime-2")`

5367

5254 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`5368 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`

5255 5369

5256 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`5370 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`

5289 5403

5290 - `AUDIO("audio")`5404 - `AUDIO("audio")`

5291 5405

5406 - `Optional<Boolean> parallelToolCalls`

5407

5408 Whether the model may call multiple tools in parallel. Only supported by

5409 reasoning Realtime models such as `gpt-realtime-2`.

5410

5292 - `Optional<ResponsePrompt> prompt`5411 - `Optional<ResponsePrompt> prompt`

5293 5412

5294 Reference to a prompt template and its variables.5413 Reference to a prompt template and its variables.

5295 [Learn more](https://platform.openai.com/docs/guides/text?api-mode=responses#reusable-prompts).5414 [Learn more](https://platform.openai.com/docs/guides/text?api-mode=responses#reusable-prompts).

5296 5415

5416 - `Optional<RealtimeReasoning> reasoning`

5417

5418 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.

5419

5297 - `Optional<RealtimeToolChoiceConfig> toolChoice`5420 - `Optional<RealtimeToolChoiceConfig> toolChoice`

5298 5421

5299 How the model chooses tools. Provide one of the string modes or force a specific5422 How the model chooses tools. Provide one of the string modes or force a specific

5570 5693

5571 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.5694 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

5572 5695

5696 For `gpt-realtime-whisper` transcription sessions, turn detection must be

5697 set to `null`; VAD is not supported.

5698

5573 - `ServerVad`5699 - `ServerVad`

5574 5700

5575 - `JsonValue; type "server_vad"constant`5701 - `JsonValue; type "server_vad"constant`

6235 6361

6236 - `HTTP_ERROR("http_error")`6362 - `HTTP_ERROR("http_error")`

6237 6363

6364### Realtime Reasoning

6365

6366- `class RealtimeReasoning:`

6367

6368 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.

6369

6370 - `Optional<RealtimeReasoningEffort> effort`

6371

6372 Constrains effort on reasoning for reasoning-capable Realtime models such as

6373 `gpt-realtime-2`.

6374

6375 - `MINIMAL("minimal")`

6376

6377 - `LOW("low")`

6378

6379 - `MEDIUM("medium")`

6380

6381 - `HIGH("high")`

6382

6383 - `XHIGH("xhigh")`

6384

6385### Realtime Reasoning Effort

6386

6387- `enum RealtimeReasoningEffort:`

6388

6389 Constrains effort on reasoning for reasoning-capable Realtime models such as

6390 `gpt-realtime-2`.

6391

6392 - `MINIMAL("minimal")`

6393

6394 - `LOW("low")`

6395

6396 - `MEDIUM("medium")`

6397

6398 - `HIGH("high")`

6399

6400 - `XHIGH("xhigh")`

6401

6238### Realtime Response6402### Realtime Response

6239 6403

6240- `class RealtimeResponse:`6404- `class RealtimeResponse:`

7703 7867

7704 - `AUDIO("audio")`7868 - `AUDIO("audio")`

7705 7869

7870 - `Optional<Boolean> parallelToolCalls`

7871

7872 Whether the model may call multiple tools in parallel. Only supported by

7873 reasoning Realtime models such as `gpt-realtime-2`.

7874

7706 - `Optional<ResponsePrompt> prompt`7875 - `Optional<ResponsePrompt> prompt`

7707 7876

7708 Reference to a prompt template and its variables.7877 Reference to a prompt template and its variables.

7802 7971

7803 Optional version of the prompt template.7972 Optional version of the prompt template.

7804 7973

7974 - `Optional<RealtimeReasoning> reasoning`

7975

7976 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.

7977

7978 - `Optional<RealtimeReasoningEffort> effort`

7979

7980 Constrains effort on reasoning for reasoning-capable Realtime models such as

7981 `gpt-realtime-2`.

7982

7983 - `MINIMAL("minimal")`

7984

7985 - `LOW("low")`

7986

7987 - `MEDIUM("medium")`

7988

7989 - `HIGH("high")`

7990

7991 - `XHIGH("xhigh")`

7992

7805 - `Optional<ToolChoice> toolChoice`7993 - `Optional<ToolChoice> toolChoice`

7806 7994

7807 How the model chooses tools. Provide one of the string modes or force a specific7995 How the model chooses tools. Provide one of the string modes or force a specific

9963 10151

9964 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.10152 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.

9965 10153

10154 - `Optional<Delay> delay`

10155

10156 Controls how long the model waits before emitting transcription text.

10157 Higher values can improve transcription accuracy at the cost of latency.

10158 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

10159

10160 - `MINIMAL("minimal")`

10161

10162 - `LOW("low")`

10163

10164 - `MEDIUM("medium")`

10165

10166 - `HIGH("high")`

10167

10168 - `XHIGH("xhigh")`

10169

9966 - `Optional<String> language`10170 - `Optional<String> language`

9967 10171

9968 The language of the input audio. Supplying the input language in10172 The language of the input audio. Supplying the input language in

9971 10175

9972 - `Optional<Model> model`10176 - `Optional<Model> model`

9973 10177

9974 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.10178 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

9975 10179

9976 - `WHISPER_1("whisper-1")`10180 - `WHISPER_1("whisper-1")`

9977 10181

9983 10187

9984 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`10188 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

9985 10189

10190 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

10191

9986 - `Optional<String> prompt`10192 - `Optional<String> prompt`

9987 10193

9988 An optional text to guide the model's style or continue a previous audio10194 An optional text to guide the model's style or continue a previous audio

9989 segment.10195 segment.

9990 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).10196 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

9991 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".10197 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

10198 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

9992 10199

9993 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`10200 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`

9994 10201

9998 10205

9999 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.10206 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

10000 10207

10208 For `gpt-realtime-whisper` transcription sessions, turn detection must be

10209 set to `null`; VAD is not supported.

10210

10001 - `ServerVad`10211 - `ServerVad`

10002 10212

10003 - `JsonValue; type "server_vad"constant`10213 - `JsonValue; type "server_vad"constant`

10169 10379

10170 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`10380 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`

10171 10381

10382 - `GPT_REALTIME_2("gpt-realtime-2")`

10383

10172 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`10384 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`

10173 10385

10174 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`10386 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`

10207 10419

10208 - `AUDIO("audio")`10420 - `AUDIO("audio")`

10209 10421

10422 - `Optional<Boolean> parallelToolCalls`

10423

10424 Whether the model may call multiple tools in parallel. Only supported by

10425 reasoning Realtime models such as `gpt-realtime-2`.

10426

10210 - `Optional<ResponsePrompt> prompt`10427 - `Optional<ResponsePrompt> prompt`

10211 10428

10212 Reference to a prompt template and its variables.10429 Reference to a prompt template and its variables.

10306 10523

10307 Optional version of the prompt template.10524 Optional version of the prompt template.

10308 10525

10526 - `Optional<RealtimeReasoning> reasoning`

10527

10528 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.

10529

10530 - `Optional<RealtimeReasoningEffort> effort`

10531

10532 Constrains effort on reasoning for reasoning-capable Realtime models such as

10533 `gpt-realtime-2`.

10534

10535 - `MINIMAL("minimal")`

10536

10537 - `LOW("low")`

10538

10539 - `MEDIUM("medium")`

10540

10541 - `HIGH("high")`

10542

10543 - `XHIGH("xhigh")`

10544

10309 - `Optional<RealtimeToolChoiceConfig> toolChoice`10545 - `Optional<RealtimeToolChoiceConfig> toolChoice`

10310 10546

10311 How the model chooses tools. Provide one of the string modes or force a specific10547 How the model chooses tools. Provide one of the string modes or force a specific

10632 10868

10633 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.10869 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

10634 10870

10871 For `gpt-realtime-whisper` transcription sessions, turn detection must be

10872 set to `null`; VAD is not supported.

10873

10635 - `ServerVad`10874 - `ServerVad`

10636 10875

10637 - `JsonValue; type "server_vad"constant`10876 - `JsonValue; type "server_vad"constant`

11172 11411

11173 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.11412 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.

11174 11413

11414 - `Optional<Delay> delay`

11415

11416 Controls how long the model waits before emitting transcription text.

11417 Higher values can improve transcription accuracy at the cost of latency.

11418 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

11419

11420 - `MINIMAL("minimal")`

11421

11422 - `LOW("low")`

11423

11424 - `MEDIUM("medium")`

11425

11426 - `HIGH("high")`

11427

11428 - `XHIGH("xhigh")`

11429

11175 - `Optional<String> language`11430 - `Optional<String> language`

11176 11431

11177 The language of the input audio. Supplying the input language in11432 The language of the input audio. Supplying the input language in

11180 11435

11181 - `Optional<Model> model`11436 - `Optional<Model> model`

11182 11437

11183 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.11438 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

11184 11439

11185 - `WHISPER_1("whisper-1")`11440 - `WHISPER_1("whisper-1")`

11186 11441

11192 11447

11193 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`11448 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

11194 11449

11450 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

11451

11195 - `Optional<String> prompt`11452 - `Optional<String> prompt`

11196 11453

11197 An optional text to guide the model's style or continue a previous audio11454 An optional text to guide the model's style or continue a previous audio

11198 segment.11455 segment.

11199 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).11456 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

11200 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".11457 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

11458 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

11201 11459

11202 - `Optional<String> instructions`11460 - `Optional<String> instructions`

11203 11461

11466 11724

11467 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.11725 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

11468 11726

11727 For `gpt-realtime-whisper` transcription sessions, turn detection must be

11728 set to `null`; VAD is not supported.

11729

11469 - `class ServerVad:`11730 - `class ServerVad:`

11470 11731

11471 Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.11732 Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.

11648 11909

11649 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.11910 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.

11650 11911

11912 - `Optional<Delay> delay`

11913

11914 Controls how long the model waits before emitting transcription text.

11915 Higher values can improve transcription accuracy at the cost of latency.

11916 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

11917

11918 - `MINIMAL("minimal")`

11919

11920 - `LOW("low")`

11921

11922 - `MEDIUM("medium")`

11923

11924 - `HIGH("high")`

11925

11926 - `XHIGH("xhigh")`

11927

11651 - `Optional<String> language`11928 - `Optional<String> language`

11652 11929

11653 The language of the input audio. Supplying the input language in11930 The language of the input audio. Supplying the input language in

11656 11933

11657 - `Optional<Model> model`11934 - `Optional<Model> model`

11658 11935

11659 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.11936 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

11660 11937

11661 - `WHISPER_1("whisper-1")`11938 - `WHISPER_1("whisper-1")`

11662 11939

11668 11945

11669 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`11946 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

11670 11947

11948 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

11949

11671 - `Optional<String> prompt`11950 - `Optional<String> prompt`

11672 11951

11673 An optional text to guide the model's style or continue a previous audio11952 An optional text to guide the model's style or continue a previous audio

11674 segment.11953 segment.

11675 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).11954 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

11676 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".11955 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

11956 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

11677 11957

11678 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`11958 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`

11679 11959

11683 11963

11684 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.11964 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

11685 11965

11966 For `gpt-realtime-whisper` transcription sessions, turn detection must be

11967 set to `null`; VAD is not supported.

11968

11686 - `ServerVad`11969 - `ServerVad`

11687 11970

11688 - `JsonValue; type "server_vad"constant`11971 - `JsonValue; type "server_vad"constant`

11854 12137

11855 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`12138 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`

11856 12139

12140 - `GPT_REALTIME_2("gpt-realtime-2")`

12141

11857 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`12142 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`

11858 12143

11859 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`12144 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`

11892 12177

11893 - `AUDIO("audio")`12178 - `AUDIO("audio")`

11894 12179

12180 - `Optional<Boolean> parallelToolCalls`

12181

12182 Whether the model may call multiple tools in parallel. Only supported by

12183 reasoning Realtime models such as `gpt-realtime-2`.

12184

11895 - `Optional<ResponsePrompt> prompt`12185 - `Optional<ResponsePrompt> prompt`

11896 12186

11897 Reference to a prompt template and its variables.12187 Reference to a prompt template and its variables.

11991 12281

11992 Optional version of the prompt template.12282 Optional version of the prompt template.

11993 12283

12284 - `Optional<RealtimeReasoning> reasoning`

12285

12286 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.

12287

12288 - `Optional<RealtimeReasoningEffort> effort`

12289

12290 Constrains effort on reasoning for reasoning-capable Realtime models such as

12291 `gpt-realtime-2`.

12292

12293 - `MINIMAL("minimal")`

12294

12295 - `LOW("low")`

12296

12297 - `MEDIUM("medium")`

12298

12299 - `HIGH("high")`

12300

12301 - `XHIGH("xhigh")`

12302

11994 - `Optional<RealtimeToolChoiceConfig> toolChoice`12303 - `Optional<RealtimeToolChoiceConfig> toolChoice`

11995 12304

11996 How the model chooses tools. Provide one of the string modes or force a specific12305 How the model chooses tools. Provide one of the string modes or force a specific

12588 12897

12589 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.12898 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.

12590 12899

12900 - `Optional<Delay> delay`

12901

12902 Controls how long the model waits before emitting transcription text.

12903 Higher values can improve transcription accuracy at the cost of latency.

12904 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

12905

12906 - `MINIMAL("minimal")`

12907

12908 - `LOW("low")`

12909

12910 - `MEDIUM("medium")`

12911

12912 - `HIGH("high")`

12913

12914 - `XHIGH("xhigh")`

12915

12591 - `Optional<String> language`12916 - `Optional<String> language`

12592 12917

12593 The language of the input audio. Supplying the input language in12918 The language of the input audio. Supplying the input language in

12596 12921

12597 - `Optional<Model> model`12922 - `Optional<Model> model`

12598 12923

12599 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.12924 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

12600 12925

12601 - `WHISPER_1("whisper-1")`12926 - `WHISPER_1("whisper-1")`

12602 12927

12608 12933

12609 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`12934 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

12610 12935

12936 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

12937

12611 - `Optional<String> prompt`12938 - `Optional<String> prompt`

12612 12939

12613 An optional text to guide the model's style or continue a previous audio12940 An optional text to guide the model's style or continue a previous audio

12614 segment.12941 segment.

12615 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).12942 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

12616 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".12943 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

12944 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

12617 12945

12618 - `Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetection`12946 - `Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetection`

12619 12947

12623 12951

12624 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.12952 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

12625 12953

12954 For `gpt-realtime-whisper` transcription sessions, turn detection must be

12955 set to `null`; VAD is not supported.

12956

12626 - `ServerVad`12957 - `ServerVad`

12627 12958

12628 - `JsonValue; type "server_vad"constant`12959 - `JsonValue; type "server_vad"constant`

12760 13091

12761 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.13092 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.

12762 13093

13094 - `Optional<Delay> delay`

13095

13096 Controls how long the model waits before emitting transcription text.

13097 Higher values can improve transcription accuracy at the cost of latency.

13098 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

13099

13100 - `MINIMAL("minimal")`

13101

13102 - `LOW("low")`

13103

13104 - `MEDIUM("medium")`

13105

13106 - `HIGH("high")`

13107

13108 - `XHIGH("xhigh")`

13109

12763 - `Optional<String> language`13110 - `Optional<String> language`

12764 13111

12765 The language of the input audio. Supplying the input language in13112 The language of the input audio. Supplying the input language in

12768 13115

12769 - `Optional<Model> model`13116 - `Optional<Model> model`

12770 13117

12771 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.13118 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

12772 13119

12773 - `WHISPER_1("whisper-1")`13120 - `WHISPER_1("whisper-1")`

12774 13121

12780 13127

12781 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`13128 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

12782 13129

13130 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

13131

12783 - `Optional<String> prompt`13132 - `Optional<String> prompt`

12784 13133

12785 An optional text to guide the model's style or continue a previous audio13134 An optional text to guide the model's style or continue a previous audio

12786 segment.13135 segment.

12787 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).13136 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

12788 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".13137 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

13138 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

12789 13139

12790 - `Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetection`13140 - `Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetection`

12791 13141

12795 13145

12796 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.13146 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

12797 13147

13148 For `gpt-realtime-whisper` transcription sessions, turn detection must be

13149 set to `null`; VAD is not supported.

13150

12798 - `ServerVad`13151 - `ServerVad`

12799 13152

12800 - `JsonValue; type "server_vad"constant`13153 - `JsonValue; type "server_vad"constant`

12886 13239

12887 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.13240 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

12888 13241

13242 For `gpt-realtime-whisper` transcription sessions, turn detection must be

13243 set to `null`; VAD is not supported.

13244

12889 - `ServerVad`13245 - `ServerVad`

12890 13246

12891 - `JsonValue; type "server_vad"constant`13247 - `JsonValue; type "server_vad"constant`

13037 13393

13038 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.13394 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.

13039 13395

13396 - `Optional<Delay> delay`

13397

13398 Controls how long the model waits before emitting transcription text.

13399 Higher values can improve transcription accuracy at the cost of latency.

13400 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

13401

13402 - `MINIMAL("minimal")`

13403

13404 - `LOW("low")`

13405

13406 - `MEDIUM("medium")`

13407

13408 - `HIGH("high")`

13409

13410 - `XHIGH("xhigh")`

13411

13040 - `Optional<String> language`13412 - `Optional<String> language`

13041 13413

13042 The language of the input audio. Supplying the input language in13414 The language of the input audio. Supplying the input language in

13045 13417

13046 - `Optional<Model> model`13418 - `Optional<Model> model`

13047 13419

13048 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.13420 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

13049 13421

13050 - `WHISPER_1("whisper-1")`13422 - `WHISPER_1("whisper-1")`

13051 13423

13057 13429

13058 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`13430 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

13059 13431

13432 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

13433

13060 - `Optional<String> prompt`13434 - `Optional<String> prompt`

13061 13435

13062 An optional text to guide the model's style or continue a previous audio13436 An optional text to guide the model's style or continue a previous audio

13063 segment.13437 segment.

13064 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).13438 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

13065 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".13439 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

13440 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

13066 13441

13067 - `Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetection`13442 - `Optional<RealtimeTranscriptionSessionAudioInputTurnDetection> turnDetection`

13068 13443

13072 13447

13073 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.13448 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

13074 13449

13450 For `gpt-realtime-whisper` transcription sessions, turn detection must be

13451 set to `null`; VAD is not supported.

13452

13075 - `ServerVad`13453 - `ServerVad`

13076 13454

13077 - `JsonValue; type "server_vad"constant`13455 - `JsonValue; type "server_vad"constant`

13161 13539

13162 - `ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")`13540 - `ITEM_INPUT_AUDIO_TRANSCRIPTION_LOGPROBS("item.input_audio_transcription.logprobs")`

13163 13541

13164### Realtime Truncation13542### Realtime Translation Client Event

13165 13543

13166- `class RealtimeTruncation: A class that can be one of several variants.union`13544- `class RealtimeTranslationClientEvent: A class that can be one of several variants.union`

13167 13545

13168 When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.13546 A Realtime translation client event.

13169 13547

13170 Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.13548 - `class RealtimeTranslationSessionUpdateEvent:`

13171 13549

13172 Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.13550 Send this event to update the translation session configuration. Translation

13551 sessions support updates to `audio.output.language`, `audio.input.transcription`,

13552 and `audio.input.noise_reduction`.

13173 13553

13174 Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.13554 - `RealtimeTranslationSessionUpdateRequest session`

13175 13555

13176 - `RealtimeTruncationStrategy`13556 Translation session fields to update. The session `type` and `model` are set

13557 at creation and cannot be changed with `session.update`.

13177 13558

13178 - `AUTO("auto")`13559 - `Optional<Audio> audio`

13179 13560

13180 - `DISABLED("disabled")`13561 Configuration for translation input and output audio.

13181 13562

13182 - `class RealtimeTruncationRetentionRatio:`13563 - `Optional<Input> input`

13183 13564

13184 Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.13565 - `Optional<NoiseReduction> noiseReduction`

13185 13566

13186 - `double retentionRatio`13567 Optional input noise reduction. Set to `null` to disable it.

13187 13568

13188 Fraction of post-instruction conversation tokens to retain (`0.0` - `1.0`) when the conversation exceeds the input token limit. Setting this to `0.8` means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.13569 - `NoiseReductionType type`

13189 13570

13190 - `JsonValue; type "retention_ratio"constant`13571 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.

13191 13572

13192 Use retention ratio truncation.13573 - `NEAR_FIELD("near_field")`

13193 13574

13194 - `RETENTION_RATIO("retention_ratio")`13575 - `FAR_FIELD("far_field")`

13195 13576

13196 - `Optional<TokenLimits> tokenLimits`13577 - `Optional<Transcription> transcription`

13197 13578

13198 Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.13579 Optional source-language transcription. When configured, the server emits

13580 `session.input_transcript.delta` events. Translation itself still runs from

13581 the input audio stream.

13199 13582

13200 - `Optional<Long> postInstructions`13583 - `String model`

13201 13584

13202 Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.13585 The transcription model to use for source transcript deltas.

13203 13586

13204### Realtime Truncation Retention Ratio13587 - `Optional<Output> output`

13205 13588

13206- `class RealtimeTruncationRetentionRatio:`13589 - `Optional<String> language`

13207 13590

13208 Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.13591 Target language for translated output audio and transcript deltas.

13209 13592

13210 - `double retentionRatio`13593 - `JsonValue; type "session.update"constant`

13211 13594

13212 Fraction of post-instruction conversation tokens to retain (`0.0` - `1.0`) when the conversation exceeds the input token limit. Setting this to `0.8` means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.13595 The event type, must be `session.update`.

13213 13596

13214 - `JsonValue; type "retention_ratio"constant`13597 - `SESSION_UPDATE("session.update")`

13215 13598

13216 Use retention ratio truncation.13599 - `Optional<String> eventId`

13217 13600

13218 - `RETENTION_RATIO("retention_ratio")`13601 Optional client-generated ID used to identify this event.

13219 13602

13220 - `Optional<TokenLimits> tokenLimits`13603 - `class RealtimeTranslationInputAudioBufferAppendEvent:`

13221 13604

13222 Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.13605 Send this event to append audio bytes to the translation session input audio buffer.

13223 13606

13224 - `Optional<Long> postInstructions`13607 WebSocket translation sessions accept base64-encoded 24 kHz PCM16 mono

13608 little-endian raw audio bytes. Unsupported websocket audio formats return a

13609 validation error because lower-quality audio materially degrades translation

13610 quality.

13225 13611

13226 Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.13612 Translation consumes 200 ms engine frames. For best realtime behavior, append

13613 audio in 200 ms chunks. If a chunk is shorter, the server buffers it until it

13614 has enough audio for one frame. If a chunk is longer, the server splits it into

13615 200 ms frames and enqueues them back-to-back.

13227 13616

13228### Response Audio Delta Event13617 Keep appending silence while the session is active. If a client stops sending

13618 audio and later resumes, model time treats the resumed audio as contiguous with

13619 the previous audio rather than as a real-world pause.

13620

13621 - `String audio`

13622

13623 Base64-encoded 24 kHz PCM16 mono audio bytes.

13624

13625 - `JsonValue; type "session.input_audio_buffer.append"constant`

13626

13627 The event type, must be `session.input_audio_buffer.append`.

13628

13629 - `SESSION_INPUT_AUDIO_BUFFER_APPEND("session.input_audio_buffer.append")`

13630

13631 - `Optional<String> eventId`

13632

13633 Optional client-generated ID used to identify this event.

13634

13635 - `class RealtimeTranslationSessionCloseEvent:`

13636

13637 Gracefully close the realtime translation session. The server flushes pending

13638 input audio and emits any remaining translated output before closing the

13639 session.

13640

13641 - `JsonValue; type "session.close"constant`

13642

13643 The event type, must be `session.close`.

13644

13645 - `SESSION_CLOSE("session.close")`

13646

13647 - `Optional<String> eventId`

13648

13649 Optional client-generated ID used to identify this event.

13650

13651### Realtime Translation Client Secret Create Request

13652

13653- `class RealtimeTranslationClientSecretCreateRequest:`

13654

13655 Create a translation session and client secret for the Realtime API.

13656

13657 - `RealtimeTranslationSessionCreateRequest session`

13658

13659 Realtime translation session configuration. Translation sessions stream source

13660 audio in and translated audio plus transcript deltas out continuously.

13661

13662 - `String model`

13663

13664 The Realtime translation model used for this session.

13665

13666 - `Optional<Audio> audio`

13667

13668 Configuration for translation input and output audio.

13669

13670 - `Optional<Input> input`

13671

13672 - `Optional<NoiseReduction> noiseReduction`

13673

13674 Optional input noise reduction. Set to `null` to disable it.

13675

13676 - `NoiseReductionType type`

13677

13678 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.

13679

13680 - `NEAR_FIELD("near_field")`

13681

13682 - `FAR_FIELD("far_field")`

13683

13684 - `Optional<Transcription> transcription`

13685

13686 Optional source-language transcription. When configured, the server emits

13687 `session.input_transcript.delta` events. Translation itself still runs from

13688 the input audio stream.

13689

13690 - `String model`

13691

13692 The transcription model to use for source transcript deltas.

13693

13694 - `Optional<Output> output`

13695

13696 - `Optional<String> language`

13697

13698 Target language for translated output audio and transcript deltas.

13699

13700 - `Optional<ExpiresAfter> expiresAfter`

13701

13702 Configuration for the client secret expiration. Expiration refers to the time after which

13703 a client secret will no longer be valid for creating sessions. The session itself may

13704 continue after that time once started. A secret can be used to create multiple sessions

13705 until it expires.

13706

13707 - `Optional<Anchor> anchor`

13708

13709 The anchor point for the client secret expiration, meaning that `seconds` will be added to the `created_at` time of the client secret to produce an expiration timestamp. Only `created_at` is currently supported.

13710

13711 - `CREATED_AT("created_at")`

13712

13713 - `Optional<Long> seconds`

13714

13715 The number of seconds from the anchor point to the expiration. Select a value between `10` and `7200` (2 hours). This default to 600 seconds (10 minutes) if not specified.

13716

13717### Realtime Translation Client Secret Create Response

13718

13719- `class RealtimeTranslationClientSecretCreateResponse:`

13720

13721 Response from creating a translation session and client secret for the Realtime API.

13722

13723 - `long expiresAt`

13724

13725 Expiration timestamp for the client secret, in seconds since epoch.

13726

13727 - `RealtimeTranslationSession session`

13728

13729 A Realtime translation session. Translation sessions continuously translate input

13730 audio into the configured output language.

13731

13732 - `String id`

13733

13734 Unique identifier for the session that looks like `sess_1234567890abcdef`.

13735

13736 - `Audio audio`

13737

13738 Configuration for translation input and output audio.

13739

13740 - `Optional<Input> input`

13741

13742 - `Optional<NoiseReduction> noiseReduction`

13743

13744 Optional input noise reduction.

13745

13746 - `NoiseReductionType type`

13747

13748 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.

13749

13750 - `NEAR_FIELD("near_field")`

13751

13752 - `FAR_FIELD("far_field")`

13753

13754 - `Optional<Transcription> transcription`

13755

13756 Optional source-language transcription. When configured, the server emits

13757 `session.input_transcript.delta` events. Translation itself still runs from

13758 the input audio stream.

13759

13760 - `String model`

13761

13762 The transcription model used for source transcript deltas.

13763

13764 - `Optional<Output> output`

13765

13766 - `Optional<String> language`

13767

13768 Target language for translated output audio and transcript deltas.

13769

13770 - `long expiresAt`

13771

13772 Expiration timestamp for the session, in seconds since epoch.

13773

13774 - `String model`

13775

13776 The Realtime translation model used for this session. This field is set at

13777 session creation and cannot be changed with `session.update`.

13778

13779 - `JsonValue; type "translation"constant`

13780

13781 The session type. Always `translation` for Realtime translation sessions.

13782

13783 - `TRANSLATION("translation")`

13784

13785 - `String value`

13786

13787 The generated client secret value.

13788

13789### Realtime Translation Input Audio Buffer Append Event

13790

13791- `class RealtimeTranslationInputAudioBufferAppendEvent:`

13792

13793 Send this event to append audio bytes to the translation session input audio buffer.

13794

13795 WebSocket translation sessions accept base64-encoded 24 kHz PCM16 mono

13796 little-endian raw audio bytes. Unsupported websocket audio formats return a

13797 validation error because lower-quality audio materially degrades translation

13798 quality.

13799

13800 Translation consumes 200 ms engine frames. For best realtime behavior, append

13801 audio in 200 ms chunks. If a chunk is shorter, the server buffers it until it

13802 has enough audio for one frame. If a chunk is longer, the server splits it into

13803 200 ms frames and enqueues them back-to-back.

13804

13805 Keep appending silence while the session is active. If a client stops sending

13806 audio and later resumes, model time treats the resumed audio as contiguous with

13807 the previous audio rather than as a real-world pause.

13808

13809 - `String audio`

13810

13811 Base64-encoded 24 kHz PCM16 mono audio bytes.

13812

13813 - `JsonValue; type "session.input_audio_buffer.append"constant`

13814

13815 The event type, must be `session.input_audio_buffer.append`.

13816

13817 - `SESSION_INPUT_AUDIO_BUFFER_APPEND("session.input_audio_buffer.append")`

13818

13819 - `Optional<String> eventId`

13820

13821 Optional client-generated ID used to identify this event.

13822

13823### Realtime Translation Input Transcript Delta Event

13824

13825- `class RealtimeTranslationInputTranscriptDeltaEvent:`

13826

13827 Returned when optional source-language transcript text is available. This event

13828 is emitted only when `audio.input.transcription` is configured.

13829

13830 Transcript deltas are append-only text fragments. Clients should not insert

13831 unconditional spaces between deltas.

13832

13833 - `String delta`

13834

13835 Append-only source-language transcript text.

13836

13837 - `String eventId`

13838

13839 The unique ID of the server event.

13840

13841 - `JsonValue; type "session.input_transcript.delta"constant`

13842

13843 The event type, must be `session.input_transcript.delta`.

13844

13845 - `SESSION_INPUT_TRANSCRIPT_DELTA("session.input_transcript.delta")`

13846

13847 - `Optional<Long> elapsedMs`

13848

13849 Timing metadata for stream alignment, derived from the translation frame

13850 when available. It advances in 200 ms increments, but multiple transcript

13851 deltas may share the same `elapsed_ms`. Treat it as alignment metadata,

13852 not a unique transcript-delta identifier.

13853

13854### Realtime Translation Output Audio Delta Event

13855

13856- `class RealtimeTranslationOutputAudioDeltaEvent:`

13857

13858 Returned when translated output audio is available. Output audio deltas are

13859 200 ms frames of PCM16 audio.

13860

13861 - `String delta`

13862

13863 Base64-encoded translated audio data.

13864

13865 - `String eventId`

13866

13867 The unique ID of the server event.

13868

13869 - `JsonValue; type "session.output_audio.delta"constant`

13870

13871 The event type, must be `session.output_audio.delta`.

13872

13873 - `SESSION_OUTPUT_AUDIO_DELTA("session.output_audio.delta")`

13874

13875 - `Optional<Long> channels`

13876

13877 Number of audio channels.

13878

13879 - `Optional<Long> elapsedMs`

13880

13881 Timing metadata for stream alignment, derived from the translation frame

13882 when available. Treat `elapsed_ms` as alignment metadata, not a unique

13883 event identifier.

13884

13885 - `Optional<Format> format`

13886

13887 Audio encoding for `delta`.

13888

13889 - `PCM16("pcm16")`

13890

13891 - `Optional<Long> sampleRate`

13892

13893 Sample rate of the audio delta.

13894

13895### Realtime Translation Output Transcript Delta Event

13896

13897- `class RealtimeTranslationOutputTranscriptDeltaEvent:`

13898

13899 Returned when translated transcript text is available.

13900

13901 Transcript deltas are append-only text fragments. Clients should not insert

13902 unconditional spaces between deltas.

13903

13904 - `String delta`

13905

13906 Append-only transcript text for the translated output audio.

13907

13908 - `String eventId`

13909

13910 The unique ID of the server event.

13911

13912 - `JsonValue; type "session.output_transcript.delta"constant`

13913

13914 The event type, must be `session.output_transcript.delta`.

13915

13916 - `SESSION_OUTPUT_TRANSCRIPT_DELTA("session.output_transcript.delta")`

13917

13918 - `Optional<Long> elapsedMs`

13919

13920 Timing metadata for stream alignment, derived from the translation frame

13921 when available. It advances in 200 ms increments, but multiple transcript

13922 deltas may share the same `elapsed_ms`. Treat it as alignment metadata,

13923 not a unique transcript-delta identifier.

13924

13925### Realtime Translation Server Event

13926

13927- `class RealtimeTranslationServerEvent: A class that can be one of several variants.union`

13928

13929 A Realtime translation server event.

13930

13931 - `class RealtimeErrorEvent:`

13932

13933 Returned when an error occurs, which could be a client problem or a server

13934 problem. Most errors are recoverable and the session will stay open, we

13935 recommend to implementors to monitor and log error messages by default.

13936

13937 - `RealtimeError error`

13938

13939 Details of the error.

13940

13941 - `String message`

13942

13943 A human-readable error message.

13944

13945 - `String type`

13946

13947 The type of error (e.g., "invalid_request_error", "server_error").

13948

13949 - `Optional<String> code`

13950

13951 Error code, if any.

13952

13953 - `Optional<String> eventId`

13954

13955 The event_id of the client event that caused the error, if applicable.

13956

13957 - `Optional<String> param`

13958

13959 Parameter related to the error, if any.

13960

13961 - `String eventId`

13962

13963 The unique ID of the server event.

13964

13965 - `JsonValue; type "error"constant`

13966

13967 The event type, must be `error`.

13968

13969 - `ERROR("error")`

13970

13971 - `class RealtimeTranslationSessionCreatedEvent:`

13972

13973 Returned when a translation session is created. Emitted automatically when a

13974 new connection is established as the first server event. This event contains

13975 the default translation session configuration.

13976

13977 - `String eventId`

13978

13979 The unique ID of the server event.

13980

13981 - `RealtimeTranslationSession session`

13982

13983 The translation session configuration.

13984

13985 - `String id`

13986

13987 Unique identifier for the session that looks like `sess_1234567890abcdef`.

13988

13989 - `Audio audio`

13990

13991 Configuration for translation input and output audio.

13992

13993 - `Optional<Input> input`

13994

13995 - `Optional<NoiseReduction> noiseReduction`

13996

13997 Optional input noise reduction.

13998

13999 - `NoiseReductionType type`

14000

14001 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.

14002

14003 - `NEAR_FIELD("near_field")`

14004

14005 - `FAR_FIELD("far_field")`

14006

14007 - `Optional<Transcription> transcription`

14008

14009 Optional source-language transcription. When configured, the server emits

14010 `session.input_transcript.delta` events. Translation itself still runs from

14011 the input audio stream.

14012

14013 - `String model`

14014

14015 The transcription model used for source transcript deltas.

14016

14017 - `Optional<Output> output`

14018

14019 - `Optional<String> language`

14020

14021 Target language for translated output audio and transcript deltas.

14022

14023 - `long expiresAt`

14024

14025 Expiration timestamp for the session, in seconds since epoch.

14026

14027 - `String model`

14028

14029 The Realtime translation model used for this session. This field is set at

14030 session creation and cannot be changed with `session.update`.

14031

14032 - `JsonValue; type "translation"constant`

14033

14034 The session type. Always `translation` for Realtime translation sessions.

14035

14036 - `TRANSLATION("translation")`

14037

14038 - `JsonValue; type "session.created"constant`

14039

14040 The event type, must be `session.created`.

14041

14042 - `SESSION_CREATED("session.created")`

14043

14044 - `class RealtimeTranslationSessionUpdatedEvent:`

14045

14046 Returned when a translation session is updated with a `session.update` event,

14047 unless there is an error.

14048

14049 - `String eventId`

14050

14051 The unique ID of the server event.

14052

14053 - `RealtimeTranslationSession session`

14054

14055 The translation session configuration.

14056

14057 - `JsonValue; type "session.updated"constant`

14058

14059 The event type, must be `session.updated`.

14060

14061 - `SESSION_UPDATED("session.updated")`

14062

14063 - `class RealtimeTranslationSessionClosedEvent:`

14064

14065 Returned when a realtime translation session is closed.

14066

14067 - `String eventId`

14068

14069 The unique ID of the server event.

14070

14071 - `JsonValue; type "session.closed"constant`

14072

14073 The event type, must be `session.closed`.

14074

14075 - `SESSION_CLOSED("session.closed")`

14076

14077 - `class RealtimeTranslationInputTranscriptDeltaEvent:`

14078

14079 Returned when optional source-language transcript text is available. This event

14080 is emitted only when `audio.input.transcription` is configured.

14081

14082 Transcript deltas are append-only text fragments. Clients should not insert

14083 unconditional spaces between deltas.

14084

14085 - `String delta`

14086

14087 Append-only source-language transcript text.

14088

14089 - `String eventId`

14090

14091 The unique ID of the server event.

14092

14093 - `JsonValue; type "session.input_transcript.delta"constant`

14094

14095 The event type, must be `session.input_transcript.delta`.

14096

14097 - `SESSION_INPUT_TRANSCRIPT_DELTA("session.input_transcript.delta")`

14098

14099 - `Optional<Long> elapsedMs`

14100

14101 Timing metadata for stream alignment, derived from the translation frame

14102 when available. It advances in 200 ms increments, but multiple transcript

14103 deltas may share the same `elapsed_ms`. Treat it as alignment metadata,

14104 not a unique transcript-delta identifier.

14105

14106 - `class RealtimeTranslationOutputTranscriptDeltaEvent:`

14107

14108 Returned when translated transcript text is available.

14109

14110 Transcript deltas are append-only text fragments. Clients should not insert

14111 unconditional spaces between deltas.

14112

14113 - `String delta`

14114

14115 Append-only transcript text for the translated output audio.

14116

14117 - `String eventId`

14118

14119 The unique ID of the server event.

14120

14121 - `JsonValue; type "session.output_transcript.delta"constant`

14122

14123 The event type, must be `session.output_transcript.delta`.

14124

14125 - `SESSION_OUTPUT_TRANSCRIPT_DELTA("session.output_transcript.delta")`

14126

14127 - `Optional<Long> elapsedMs`

14128

14129 Timing metadata for stream alignment, derived from the translation frame

14130 when available. It advances in 200 ms increments, but multiple transcript

14131 deltas may share the same `elapsed_ms`. Treat it as alignment metadata,

14132 not a unique transcript-delta identifier.

14133

14134 - `class RealtimeTranslationOutputAudioDeltaEvent:`

14135

14136 Returned when translated output audio is available. Output audio deltas are

14137 200 ms frames of PCM16 audio.

14138

14139 - `String delta`

14140

14141 Base64-encoded translated audio data.

14142

14143 - `String eventId`

14144

14145 The unique ID of the server event.

14146

14147 - `JsonValue; type "session.output_audio.delta"constant`

14148

14149 The event type, must be `session.output_audio.delta`.

14150

14151 - `SESSION_OUTPUT_AUDIO_DELTA("session.output_audio.delta")`

14152

14153 - `Optional<Long> channels`

14154

14155 Number of audio channels.

14156

14157 - `Optional<Long> elapsedMs`

14158

14159 Timing metadata for stream alignment, derived from the translation frame

14160 when available. Treat `elapsed_ms` as alignment metadata, not a unique

14161 event identifier.

14162

14163 - `Optional<Format> format`

14164

14165 Audio encoding for `delta`.

14166

14167 - `PCM16("pcm16")`

14168

14169 - `Optional<Long> sampleRate`

14170

14171 Sample rate of the audio delta.

14172

14173### Realtime Translation Session

14174

14175- `class RealtimeTranslationSession:`

14176

14177 A Realtime translation session. Translation sessions continuously translate input

14178 audio into the configured output language.

14179

14180 - `String id`

14181

14182 Unique identifier for the session that looks like `sess_1234567890abcdef`.

14183

14184 - `Audio audio`

14185

14186 Configuration for translation input and output audio.

14187

14188 - `Optional<Input> input`

14189

14190 - `Optional<NoiseReduction> noiseReduction`

14191

14192 Optional input noise reduction.

14193

14194 - `NoiseReductionType type`

14195

14196 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.

14197

14198 - `NEAR_FIELD("near_field")`

14199

14200 - `FAR_FIELD("far_field")`

14201

14202 - `Optional<Transcription> transcription`

14203

14204 Optional source-language transcription. When configured, the server emits

14205 `session.input_transcript.delta` events. Translation itself still runs from

14206 the input audio stream.

14207

14208 - `String model`

14209

14210 The transcription model used for source transcript deltas.

14211

14212 - `Optional<Output> output`

14213

14214 - `Optional<String> language`

14215

14216 Target language for translated output audio and transcript deltas.

14217

14218 - `long expiresAt`

14219

14220 Expiration timestamp for the session, in seconds since epoch.

14221

14222 - `String model`

14223

14224 The Realtime translation model used for this session. This field is set at

14225 session creation and cannot be changed with `session.update`.

14226

14227 - `JsonValue; type "translation"constant`

14228

14229 The session type. Always `translation` for Realtime translation sessions.

14230

14231 - `TRANSLATION("translation")`

14232

14233### Realtime Translation Session Close Event

14234

14235- `class RealtimeTranslationSessionCloseEvent:`

14236

14237 Gracefully close the realtime translation session. The server flushes pending

14238 input audio and emits any remaining translated output before closing the

14239 session.

14240

14241 - `JsonValue; type "session.close"constant`

14242

14243 The event type, must be `session.close`.

14244

14245 - `SESSION_CLOSE("session.close")`

14246

14247 - `Optional<String> eventId`

14248

14249 Optional client-generated ID used to identify this event.

14250

14251### Realtime Translation Session Closed Event

14252

14253- `class RealtimeTranslationSessionClosedEvent:`

14254

14255 Returned when a realtime translation session is closed.

14256

14257 - `String eventId`

14258

14259 The unique ID of the server event.

14260

14261 - `JsonValue; type "session.closed"constant`

14262

14263 The event type, must be `session.closed`.

14264

14265 - `SESSION_CLOSED("session.closed")`

14266

14267### Realtime Translation Session Create Request

14268

14269- `class RealtimeTranslationSessionCreateRequest:`

14270

14271 Realtime translation session configuration. Translation sessions stream source

14272 audio in and translated audio plus transcript deltas out continuously.

14273

14274 - `String model`

14275

14276 The Realtime translation model used for this session.

14277

14278 - `Optional<Audio> audio`

14279

14280 Configuration for translation input and output audio.

14281

14282 - `Optional<Input> input`

14283

14284 - `Optional<NoiseReduction> noiseReduction`

14285

14286 Optional input noise reduction. Set to `null` to disable it.

14287

14288 - `NoiseReductionType type`

14289

14290 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.

14291

14292 - `NEAR_FIELD("near_field")`

14293

14294 - `FAR_FIELD("far_field")`

14295

14296 - `Optional<Transcription> transcription`

14297

14298 Optional source-language transcription. When configured, the server emits

14299 `session.input_transcript.delta` events. Translation itself still runs from

14300 the input audio stream.

14301

14302 - `String model`

14303

14304 The transcription model to use for source transcript deltas.

14305

14306 - `Optional<Output> output`

14307

14308 - `Optional<String> language`

14309

14310 Target language for translated output audio and transcript deltas.

14311

14312### Realtime Translation Session Created Event

14313

14314- `class RealtimeTranslationSessionCreatedEvent:`

14315

14316 Returned when a translation session is created. Emitted automatically when a

14317 new connection is established as the first server event. This event contains

14318 the default translation session configuration.

14319

14320 - `String eventId`

14321

14322 The unique ID of the server event.

14323

14324 - `RealtimeTranslationSession session`

14325

14326 The translation session configuration.

14327

14328 - `String id`

14329

14330 Unique identifier for the session that looks like `sess_1234567890abcdef`.

14331

14332 - `Audio audio`

14333

14334 Configuration for translation input and output audio.

14335

14336 - `Optional<Input> input`

14337

14338 - `Optional<NoiseReduction> noiseReduction`

14339

14340 Optional input noise reduction.

14341

14342 - `NoiseReductionType type`

14343

14344 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.

14345

14346 - `NEAR_FIELD("near_field")`

14347

14348 - `FAR_FIELD("far_field")`

14349

14350 - `Optional<Transcription> transcription`

14351

14352 Optional source-language transcription. When configured, the server emits

14353 `session.input_transcript.delta` events. Translation itself still runs from

14354 the input audio stream.

14355

14356 - `String model`

14357

14358 The transcription model used for source transcript deltas.

14359

14360 - `Optional<Output> output`

14361

14362 - `Optional<String> language`

14363

14364 Target language for translated output audio and transcript deltas.

14365

14366 - `long expiresAt`

14367

14368 Expiration timestamp for the session, in seconds since epoch.

14369

14370 - `String model`

14371

14372 The Realtime translation model used for this session. This field is set at

14373 session creation and cannot be changed with `session.update`.

14374

14375 - `JsonValue; type "translation"constant`

14376

14377 The session type. Always `translation` for Realtime translation sessions.

14378

14379 - `TRANSLATION("translation")`

14380

14381 - `JsonValue; type "session.created"constant`

14382

14383 The event type, must be `session.created`.

14384

14385 - `SESSION_CREATED("session.created")`

14386

14387### Realtime Translation Session Update Event

14388

14389- `class RealtimeTranslationSessionUpdateEvent:`

14390

14391 Send this event to update the translation session configuration. Translation

14392 sessions support updates to `audio.output.language`, `audio.input.transcription`,

14393 and `audio.input.noise_reduction`.

14394

14395 - `RealtimeTranslationSessionUpdateRequest session`

14396

14397 Translation session fields to update. The session `type` and `model` are set

14398 at creation and cannot be changed with `session.update`.

14399

14400 - `Optional<Audio> audio`

14401

14402 Configuration for translation input and output audio.

14403

14404 - `Optional<Input> input`

14405

14406 - `Optional<NoiseReduction> noiseReduction`

14407

14408 Optional input noise reduction. Set to `null` to disable it.

14409

14410 - `NoiseReductionType type`

14411

14412 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.

14413

14414 - `NEAR_FIELD("near_field")`

14415

14416 - `FAR_FIELD("far_field")`

14417

14418 - `Optional<Transcription> transcription`

14419

14420 Optional source-language transcription. When configured, the server emits

14421 `session.input_transcript.delta` events. Translation itself still runs from

14422 the input audio stream.

14423

14424 - `String model`

14425

14426 The transcription model to use for source transcript deltas.

14427

14428 - `Optional<Output> output`

14429

14430 - `Optional<String> language`

14431

14432 Target language for translated output audio and transcript deltas.

14433

14434 - `JsonValue; type "session.update"constant`

14435

14436 The event type, must be `session.update`.

14437

14438 - `SESSION_UPDATE("session.update")`

14439

14440 - `Optional<String> eventId`

14441

14442 Optional client-generated ID used to identify this event.

14443

14444### Realtime Translation Session Update Request

14445

14446- `class RealtimeTranslationSessionUpdateRequest:`

14447

14448 Realtime translation session fields that can be updated with `session.update`.

14449

14450 - `Optional<Audio> audio`

14451

14452 Configuration for translation input and output audio.

14453

14454 - `Optional<Input> input`

14455

14456 - `Optional<NoiseReduction> noiseReduction`

14457

14458 Optional input noise reduction. Set to `null` to disable it.

14459

14460 - `NoiseReductionType type`

14461

14462 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.

14463

14464 - `NEAR_FIELD("near_field")`

14465

14466 - `FAR_FIELD("far_field")`

14467

14468 - `Optional<Transcription> transcription`

14469

14470 Optional source-language transcription. When configured, the server emits

14471 `session.input_transcript.delta` events. Translation itself still runs from

14472 the input audio stream.

14473

14474 - `String model`

14475

14476 The transcription model to use for source transcript deltas.

14477

14478 - `Optional<Output> output`

14479

14480 - `Optional<String> language`

14481

14482 Target language for translated output audio and transcript deltas.

14483

14484### Realtime Translation Session Updated Event

14485

14486- `class RealtimeTranslationSessionUpdatedEvent:`

14487

14488 Returned when a translation session is updated with a `session.update` event,

14489 unless there is an error.

14490

14491 - `String eventId`

14492

14493 The unique ID of the server event.

14494

14495 - `RealtimeTranslationSession session`

14496

14497 The translation session configuration.

14498

14499 - `String id`

14500

14501 Unique identifier for the session that looks like `sess_1234567890abcdef`.

14502

14503 - `Audio audio`

14504

14505 Configuration for translation input and output audio.

14506

14507 - `Optional<Input> input`

14508

14509 - `Optional<NoiseReduction> noiseReduction`

14510

14511 Optional input noise reduction.

14512

14513 - `NoiseReductionType type`

14514

14515 Type of noise reduction. `near_field` is for close-talking microphones such as headphones, `far_field` is for far-field microphones such as laptop or conference room microphones.

14516

14517 - `NEAR_FIELD("near_field")`

14518

14519 - `FAR_FIELD("far_field")`

14520

14521 - `Optional<Transcription> transcription`

14522

14523 Optional source-language transcription. When configured, the server emits

14524 `session.input_transcript.delta` events. Translation itself still runs from

14525 the input audio stream.

14526

14527 - `String model`

14528

14529 The transcription model used for source transcript deltas.

14530

14531 - `Optional<Output> output`

14532

14533 - `Optional<String> language`

14534

14535 Target language for translated output audio and transcript deltas.

14536

14537 - `long expiresAt`

14538

14539 Expiration timestamp for the session, in seconds since epoch.

14540

14541 - `String model`

14542

14543 The Realtime translation model used for this session. This field is set at

14544 session creation and cannot be changed with `session.update`.

14545

14546 - `JsonValue; type "translation"constant`

14547

14548 The session type. Always `translation` for Realtime translation sessions.

14549

14550 - `TRANSLATION("translation")`

14551

14552 - `JsonValue; type "session.updated"constant`

14553

14554 The event type, must be `session.updated`.

14555

14556 - `SESSION_UPDATED("session.updated")`

14557

14558### Realtime Truncation

14559

14560- `class RealtimeTruncation: A class that can be one of several variants.union`

14561

14562 When the number of tokens in a conversation exceeds the model's input token limit, the conversation be truncated, meaning messages (starting from the oldest) will not be included in the model's context. A 32k context model with 4,096 max output tokens can only include 28,224 tokens in the context before truncation occurs.

14563

14564 Clients can configure truncation behavior to truncate with a lower max token limit, which is an effective way to control token usage and cost.

14565

14566 Truncation will reduce the number of cached tokens on the next turn (busting the cache), since messages are dropped from the beginning of the context. However, clients can also configure truncation to retain messages up to a fraction of the maximum context size, which will reduce the need for future truncations and thus improve the cache rate.

14567

14568 Truncation can be disabled entirely, which means the server will never truncate but would instead return an error if the conversation exceeds the model's input token limit.

14569

14570 - `RealtimeTruncationStrategy`

14571

14572 - `AUTO("auto")`

14573

14574 - `DISABLED("disabled")`

14575

14576 - `class RealtimeTruncationRetentionRatio:`

14577

14578 Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.

14579

14580 - `double retentionRatio`

14581

14582 Fraction of post-instruction conversation tokens to retain (`0.0` - `1.0`) when the conversation exceeds the input token limit. Setting this to `0.8` means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.

14583

14584 - `JsonValue; type "retention_ratio"constant`

14585

14586 Use retention ratio truncation.

14587

14588 - `RETENTION_RATIO("retention_ratio")`

14589

14590 - `Optional<TokenLimits> tokenLimits`

14591

14592 Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.

14593

14594 - `Optional<Long> postInstructions`

14595

14596 Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.

14597

14598### Realtime Truncation Retention Ratio

14599

14600- `class RealtimeTruncationRetentionRatio:`

14601

14602 Retain a fraction of the conversation tokens when the conversation exceeds the input token limit. This allows you to amortize truncations across multiple turns, which can help improve cached token usage.

14603

14604 - `double retentionRatio`

14605

14606 Fraction of post-instruction conversation tokens to retain (`0.0` - `1.0`) when the conversation exceeds the input token limit. Setting this to `0.8` means that messages will be dropped until 80% of the maximum allowed tokens are used. This helps reduce the frequency of truncations and improve cache rates.

14607

14608 - `JsonValue; type "retention_ratio"constant`

14609

14610 Use retention ratio truncation.

14611

14612 - `RETENTION_RATIO("retention_ratio")`

14613

14614 - `Optional<TokenLimits> tokenLimits`

14615

14616 Optional custom token limits for this truncation strategy. If not provided, the model's default token limits will be used.

14617

14618 - `Optional<Long> postInstructions`

14619

14620 Maximum tokens allowed in the conversation after instructions (which including tool definitions). For example, setting this to 5,000 would mean that truncation would occur when the conversation exceeds 5,000 tokens after instructions. This cannot be higher than the model's context window size minus the maximum output tokens.

14621

14622### Response Audio Delta Event

13229 14623

13230- `class ResponseAudioDeltaEvent:`14624- `class ResponseAudioDeltaEvent:`

13231 14625

14109 15503

14110 - `TEXT("text")`15504 - `TEXT("text")`

14111 15505

14112 - `AUDIO("audio")`15506 - `AUDIO("audio")`

15507

15508 - `Optional<Boolean> parallelToolCalls`

15509

15510 Whether the model may call multiple tools in parallel. Only supported by

15511 reasoning Realtime models such as `gpt-realtime-2`.

14113 15512

14114 - `Optional<ResponsePrompt> prompt`15513 - `Optional<ResponsePrompt> prompt`

14115 15514

14210 15609

14211 Optional version of the prompt template.15610 Optional version of the prompt template.

14212 15611

15612 - `Optional<RealtimeReasoning> reasoning`

15613

15614 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.

15615

15616 - `Optional<RealtimeReasoningEffort> effort`

15617

15618 Constrains effort on reasoning for reasoning-capable Realtime models such as

15619 `gpt-realtime-2`.

15620

15621 - `MINIMAL("minimal")`

15622

15623 - `LOW("low")`

15624

15625 - `MEDIUM("medium")`

15626

15627 - `HIGH("high")`

15628

15629 - `XHIGH("xhigh")`

15630

14213 - `Optional<ToolChoice> toolChoice`15631 - `Optional<ToolChoice> toolChoice`

14214 15632

14215 How the model chooses tools. Provide one of the string modes or force a specific15633 How the model chooses tools. Provide one of the string modes or force a specific

17096 18514

17097 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.18515 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.

17098 18516

18517 - `Optional<Delay> delay`

18518

18519 Controls how long the model waits before emitting transcription text.

18520 Higher values can improve transcription accuracy at the cost of latency.

18521 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

18522

18523 - `MINIMAL("minimal")`

18524

18525 - `LOW("low")`

18526

18527 - `MEDIUM("medium")`

18528

18529 - `HIGH("high")`

18530

18531 - `XHIGH("xhigh")`

18532

17099 - `Optional<String> language`18533 - `Optional<String> language`

17100 18534

17101 The language of the input audio. Supplying the input language in18535 The language of the input audio. Supplying the input language in

17104 18538

17105 - `Optional<Model> model`18539 - `Optional<Model> model`

17106 18540

17107 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.18541 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

17108 18542

17109 - `WHISPER_1("whisper-1")`18543 - `WHISPER_1("whisper-1")`

17110 18544

17116 18550

17117 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`18551 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

17118 18552

18553 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

18554

17119 - `Optional<String> prompt`18555 - `Optional<String> prompt`

17120 18556

17121 An optional text to guide the model's style or continue a previous audio18557 An optional text to guide the model's style or continue a previous audio

17122 segment.18558 segment.

17123 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).18559 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

17124 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".18560 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

18561 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

17125 18562

17126 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`18563 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`

17127 18564

17131 18568

17132 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.18569 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

17133 18570

18571 For `gpt-realtime-whisper` transcription sessions, turn detection must be

18572 set to `null`; VAD is not supported.

18573

17134 - `ServerVad`18574 - `ServerVad`

17135 18575

17136 - `JsonValue; type "server_vad"constant`18576 - `JsonValue; type "server_vad"constant`

17302 18742

17303 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`18743 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`

17304 18744

18745 - `GPT_REALTIME_2("gpt-realtime-2")`

18746

17305 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`18747 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`

17306 18748

17307 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`18749 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`

17340 18782

17341 - `AUDIO("audio")`18783 - `AUDIO("audio")`

17342 18784

18785 - `Optional<Boolean> parallelToolCalls`

18786

18787 Whether the model may call multiple tools in parallel. Only supported by

18788 reasoning Realtime models such as `gpt-realtime-2`.

18789

17343 - `Optional<ResponsePrompt> prompt`18790 - `Optional<ResponsePrompt> prompt`

17344 18791

17345 Reference to a prompt template and its variables.18792 Reference to a prompt template and its variables.

17439 18886

17440 Optional version of the prompt template.18887 Optional version of the prompt template.

17441 18888

18889 - `Optional<RealtimeReasoning> reasoning`

18890

18891 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.

18892

18893 - `Optional<RealtimeReasoningEffort> effort`

18894

18895 Constrains effort on reasoning for reasoning-capable Realtime models such as

18896 `gpt-realtime-2`.

18897

18898 - `MINIMAL("minimal")`

18899

18900 - `LOW("low")`

18901

18902 - `MEDIUM("medium")`

18903

18904 - `HIGH("high")`

18905

18906 - `XHIGH("xhigh")`

18907

17442 - `Optional<RealtimeToolChoiceConfig> toolChoice`18908 - `Optional<RealtimeToolChoiceConfig> toolChoice`

17443 18909

17444 How the model chooses tools. Provide one of the string modes or force a specific18910 How the model chooses tools. Provide one of the string modes or force a specific

17765 19231

17766 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.19232 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

17767 19233

19234 For `gpt-realtime-whisper` transcription sessions, turn detection must be

19235 set to `null`; VAD is not supported.

19236

17768 - `ServerVad`19237 - `ServerVad`

17769 19238

17770 - `JsonValue; type "server_vad"constant`19239 - `JsonValue; type "server_vad"constant`

17947 19416

17948 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.19417 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.

17949 19418

19419 - `Optional<Delay> delay`

19420

19421 Controls how long the model waits before emitting transcription text.

19422 Higher values can improve transcription accuracy at the cost of latency.

19423 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

19424

19425 - `MINIMAL("minimal")`

19426

19427 - `LOW("low")`

19428

19429 - `MEDIUM("medium")`

19430

19431 - `HIGH("high")`

19432

19433 - `XHIGH("xhigh")`

19434

17950 - `Optional<String> language`19435 - `Optional<String> language`

17951 19436

17952 The language of the input audio. Supplying the input language in19437 The language of the input audio. Supplying the input language in

17955 19440

17956 - `Optional<Model> model`19441 - `Optional<Model> model`

17957 19442

17958 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.19443 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

17959 19444

17960 - `WHISPER_1("whisper-1")`19445 - `WHISPER_1("whisper-1")`

17961 19446

17967 19452

17968 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`19453 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

17969 19454

19455 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

19456

17970 - `Optional<String> prompt`19457 - `Optional<String> prompt`

17971 19458

17972 An optional text to guide the model's style or continue a previous audio19459 An optional text to guide the model's style or continue a previous audio

17973 segment.19460 segment.

17974 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).19461 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

17975 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".19462 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

19463 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

17976 19464

17977 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`19465 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`

17978 19466

17982 19470

17983 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.19471 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

17984 19472

19473 For `gpt-realtime-whisper` transcription sessions, turn detection must be

19474 set to `null`; VAD is not supported.

19475

17985 - `ServerVad`19476 - `ServerVad`

17986 19477

17987 - `JsonValue; type "server_vad"constant`19478 - `JsonValue; type "server_vad"constant`

18153 19644

18154 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`19645 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`

18155 19646

19647 - `GPT_REALTIME_2("gpt-realtime-2")`

19648

18156 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`19649 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`

18157 19650

18158 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`19651 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`

18191 19684

18192 - `AUDIO("audio")`19685 - `AUDIO("audio")`

18193 19686

19687 - `Optional<Boolean> parallelToolCalls`

19688

19689 Whether the model may call multiple tools in parallel. Only supported by

19690 reasoning Realtime models such as `gpt-realtime-2`.

19691

18194 - `Optional<ResponsePrompt> prompt`19692 - `Optional<ResponsePrompt> prompt`

18195 19693

18196 Reference to a prompt template and its variables.19694 Reference to a prompt template and its variables.

18290 19788

18291 Optional version of the prompt template.19789 Optional version of the prompt template.

18292 19790

19791 - `Optional<RealtimeReasoning> reasoning`

19792

19793 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.

19794

19795 - `Optional<RealtimeReasoningEffort> effort`

19796

19797 Constrains effort on reasoning for reasoning-capable Realtime models such as

19798 `gpt-realtime-2`.

19799

19800 - `MINIMAL("minimal")`

19801

19802 - `LOW("low")`

19803

19804 - `MEDIUM("medium")`

19805

19806 - `HIGH("high")`

19807

19808 - `XHIGH("xhigh")`

19809

18293 - `Optional<RealtimeToolChoiceConfig> toolChoice`19810 - `Optional<RealtimeToolChoiceConfig> toolChoice`

18294 19811

18295 How the model chooses tools. Provide one of the string modes or force a specific19812 How the model chooses tools. Provide one of the string modes or force a specific

18616 20133

18617 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.20134 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

18618 20135

20136 For `gpt-realtime-whisper` transcription sessions, turn detection must be

20137 set to `null`; VAD is not supported.

20138

18619 - `ServerVad`20139 - `ServerVad`

18620 20140

18621 - `JsonValue; type "server_vad"constant`20141 - `JsonValue; type "server_vad"constant`

18798 20318

18799 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.20319 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.

18800 20320

20321 - `Optional<Delay> delay`

20322

20323 Controls how long the model waits before emitting transcription text.

20324 Higher values can improve transcription accuracy at the cost of latency.

20325 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

20326

20327 - `MINIMAL("minimal")`

20328

20329 - `LOW("low")`

20330

20331 - `MEDIUM("medium")`

20332

20333 - `HIGH("high")`

20334

20335 - `XHIGH("xhigh")`

20336

18801 - `Optional<String> language`20337 - `Optional<String> language`

18802 20338

18803 The language of the input audio. Supplying the input language in20339 The language of the input audio. Supplying the input language in

18806 20342

18807 - `Optional<Model> model`20343 - `Optional<Model> model`

18808 20344

18809 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.20345 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

18810 20346

18811 - `WHISPER_1("whisper-1")`20347 - `WHISPER_1("whisper-1")`

18812 20348

18818 20354

18819 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`20355 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

18820 20356

20357 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

20358

18821 - `Optional<String> prompt`20359 - `Optional<String> prompt`

18822 20360

18823 An optional text to guide the model's style or continue a previous audio20361 An optional text to guide the model's style or continue a previous audio

18824 segment.20362 segment.

18825 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).20363 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

18826 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".20364 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

20365 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

18827 20366

18828 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`20367 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`

18829 20368

18833 20372

18834 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.20373 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

18835 20374

20375 For `gpt-realtime-whisper` transcription sessions, turn detection must be

20376 set to `null`; VAD is not supported.

20377

18836 - `ServerVad`20378 - `ServerVad`

18837 20379

18838 - `JsonValue; type "server_vad"constant`20380 - `JsonValue; type "server_vad"constant`

19004 20546

19005 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`20547 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`

19006 20548

20549 - `GPT_REALTIME_2("gpt-realtime-2")`

20550

19007 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`20551 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`

19008 20552

19009 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`20553 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`

19042 20586

19043 - `AUDIO("audio")`20587 - `AUDIO("audio")`

19044 20588

20589 - `Optional<Boolean> parallelToolCalls`

20590

20591 Whether the model may call multiple tools in parallel. Only supported by

20592 reasoning Realtime models such as `gpt-realtime-2`.

20593

19045 - `Optional<ResponsePrompt> prompt`20594 - `Optional<ResponsePrompt> prompt`

19046 20595

19047 Reference to a prompt template and its variables.20596 Reference to a prompt template and its variables.

19141 20690

19142 Optional version of the prompt template.20691 Optional version of the prompt template.

19143 20692

20693 - `Optional<RealtimeReasoning> reasoning`

20694

20695 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.

20696

20697 - `Optional<RealtimeReasoningEffort> effort`

20698

20699 Constrains effort on reasoning for reasoning-capable Realtime models such as

20700 `gpt-realtime-2`.

20701

20702 - `MINIMAL("minimal")`

20703

20704 - `LOW("low")`

20705

20706 - `MEDIUM("medium")`

20707

20708 - `HIGH("high")`

20709

20710 - `XHIGH("xhigh")`

20711

19144 - `Optional<RealtimeToolChoiceConfig> toolChoice`20712 - `Optional<RealtimeToolChoiceConfig> toolChoice`

19145 20713

19146 How the model chooses tools. Provide one of the string modes or force a specific20714 How the model chooses tools. Provide one of the string modes or force a specific

19467 21035

19468 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.21036 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

19469 21037

21038 For `gpt-realtime-whisper` transcription sessions, turn detection must be

21039 set to `null`; VAD is not supported.

21040

19470 - `ServerVad`21041 - `ServerVad`

19471 21042

19472 - `JsonValue; type "server_vad"constant`21043 - `JsonValue; type "server_vad"constant`

19609 21180

19610 Configuration for input audio transcription. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.21181 Configuration for input audio transcription. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.

19611 21182

21183 - `Optional<Delay> delay`

21184

21185 Controls how long the model waits before emitting transcription text.

21186 Higher values can improve transcription accuracy at the cost of latency.

21187 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

21188

21189 - `MINIMAL("minimal")`

21190

21191 - `LOW("low")`

21192

21193 - `MEDIUM("medium")`

21194

21195 - `HIGH("high")`

21196

21197 - `XHIGH("xhigh")`

21198

19612 - `Optional<String> language`21199 - `Optional<String> language`

19613 21200

19614 The language of the input audio. Supplying the input language in21201 The language of the input audio. Supplying the input language in

19617 21204

19618 - `Optional<Model> model`21205 - `Optional<Model> model`

19619 21206

19620 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.21207 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

19621 21208

19622 - `WHISPER_1("whisper-1")`21209 - `WHISPER_1("whisper-1")`

19623 21210

19629 21216

19630 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`21217 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

19631 21218

21219 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

21220

19632 - `Optional<String> prompt`21221 - `Optional<String> prompt`

19633 21222

19634 An optional text to guide the model's style or continue a previous audio21223 An optional text to guide the model's style or continue a previous audio

19635 segment.21224 segment.

19636 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).21225 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

19637 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".21226 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

21227 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

19638 21228

19639 - `Optional<TurnDetection> turnDetection`21229 - `Optional<TurnDetection> turnDetection`

19640 21230

19714 21304

19715 - `Optional<AudioTranscription> inputAudioTranscription`21305 - `Optional<AudioTranscription> inputAudioTranscription`

19716 21306

19717 Configuration of the transcription model.21307 - `Optional<Delay> delay`

21308

21309 Controls how long the model waits before emitting transcription text.

21310 Higher values can improve transcription accuracy at the cost of latency.

21311 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

21312

21313 - `MINIMAL("minimal")`

21314

21315 - `LOW("low")`

21316

21317 - `MEDIUM("medium")`

21318

21319 - `HIGH("high")`

21320

21321 - `XHIGH("xhigh")`

19718 21322

19719 - `Optional<String> language`21323 - `Optional<String> language`

19720 21324

19724 21328

19725 - `Optional<Model> model`21329 - `Optional<Model> model`

19726 21330

19727 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.21331 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

19728 21332

19729 - `WHISPER_1("whisper-1")`21333 - `WHISPER_1("whisper-1")`

19730 21334

19736 21340

19737 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`21341 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

19738 21342

21343 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

21344

19739 - `Optional<String> prompt`21345 - `Optional<String> prompt`

19740 21346

19741 An optional text to guide the model's style or continue a previous audio21347 An optional text to guide the model's style or continue a previous audio

19742 segment.21348 segment.

19743 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).21349 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

19744 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".21350 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

21351 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

19745 21352

19746 - `Optional<List<Modality>> modalities`21353 - `Optional<List<Modality>> modalities`

19747 21354

19901 21508

19902 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.21509 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.

19903 21510

21511 - `Optional<Delay> delay`

21512

21513 Controls how long the model waits before emitting transcription text.

21514 Higher values can improve transcription accuracy at the cost of latency.

21515 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

21516

21517 - `MINIMAL("minimal")`

21518

21519 - `LOW("low")`

21520

21521 - `MEDIUM("medium")`

21522

21523 - `HIGH("high")`

21524

21525 - `XHIGH("xhigh")`

21526

19904 - `Optional<String> language`21527 - `Optional<String> language`

19905 21528

19906 The language of the input audio. Supplying the input language in21529 The language of the input audio. Supplying the input language in

19909 21532

19910 - `Optional<Model> model`21533 - `Optional<Model> model`

19911 21534

19912 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.21535 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

19913 21536

19914 - `WHISPER_1("whisper-1")`21537 - `WHISPER_1("whisper-1")`

19915 21538

19921 21544

19922 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`21545 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

19923 21546

21547 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

21548

19924 - `Optional<String> prompt`21549 - `Optional<String> prompt`

19925 21550

19926 An optional text to guide the model's style or continue a previous audio21551 An optional text to guide the model's style or continue a previous audio

19927 segment.21552 segment.

19928 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).21553 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

19929 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".21554 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

21555 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

19930 21556

19931 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`21557 - `Optional<RealtimeAudioInputTurnDetection> turnDetection`

19932 21558

19936 21562

19937 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.21563 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

19938 21564

21565 For `gpt-realtime-whisper` transcription sessions, turn detection must be

21566 set to `null`; VAD is not supported.

21567

19939 - `ServerVad`21568 - `ServerVad`

19940 21569

19941 - `JsonValue; type "server_vad"constant`21570 - `JsonValue; type "server_vad"constant`

20107 21736

20108 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`21737 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`

20109 21738

21739 - `GPT_REALTIME_2("gpt-realtime-2")`

21740

20110 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`21741 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`

20111 21742

20112 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`21743 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`

20145 21776

20146 - `AUDIO("audio")`21777 - `AUDIO("audio")`

20147 21778

21779 - `Optional<Boolean> parallelToolCalls`

21780

21781 Whether the model may call multiple tools in parallel. Only supported by

21782 reasoning Realtime models such as `gpt-realtime-2`.

21783

20148 - `Optional<ResponsePrompt> prompt`21784 - `Optional<ResponsePrompt> prompt`

20149 21785

20150 Reference to a prompt template and its variables.21786 Reference to a prompt template and its variables.

20244 21880

20245 Optional version of the prompt template.21881 Optional version of the prompt template.

20246 21882

21883 - `Optional<RealtimeReasoning> reasoning`

21884

21885 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.

21886

21887 - `Optional<RealtimeReasoningEffort> effort`

21888

21889 Constrains effort on reasoning for reasoning-capable Realtime models such as

21890 `gpt-realtime-2`.

21891

21892 - `MINIMAL("minimal")`

21893

21894 - `LOW("low")`

21895

21896 - `MEDIUM("medium")`

21897

21898 - `HIGH("high")`

21899

21900 - `XHIGH("xhigh")`

21901

20247 - `Optional<RealtimeToolChoiceConfig> toolChoice`21902 - `Optional<RealtimeToolChoiceConfig> toolChoice`

20248 21903

20249 How the model chooses tools. Provide one of the string modes or force a specific21904 How the model chooses tools. Provide one of the string modes or force a specific

20570 22225

20571 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.22226 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

20572 22227

22228 For `gpt-realtime-whisper` transcription sessions, turn detection must be

22229 set to `null`; VAD is not supported.

22230

20573 - `ServerVad`22231 - `ServerVad`

20574 22232

20575 - `JsonValue; type "server_vad"constant`22233 - `JsonValue; type "server_vad"constant`

20675 22333

20676 - `class RealtimeSessionCreateResponse:`22334 - `class RealtimeSessionCreateResponse:`

20677 22335

20678 A new Realtime session configuration, with an ephemeral key. Default TTL22336 A Realtime session configuration object.

20679 for keys is one minute.

20680 22337

20681 - `RealtimeSessionClientSecret clientSecret`22338 - `String id`

~~20682~~

20683 Ephemeral key returned by the API.

20684 22339

20685 - `long expiresAt`22340 Unique identifier for the session that looks like `sess_1234567890abcdef`.

20686 22341

20687 Timestamp for when the token expires. Currently, all tokens expire22342 - `JsonValue; object_ "realtime.session"constant`

20688 after one minute.

20689 22343

20690 - `String value`22344 The object type. Always `realtime.session`.

20691 22345

20692 Ephemeral key usable in client environments to authenticate connections to the Realtime API. Use this in client-side environments rather than a standard API token, which should only be used server-side.22346 - `REALTIME_SESSION("realtime.session")`

20693 22347

20694 - `JsonValue; type "realtime"constant`22348 - `JsonValue; type "realtime"constant`

20695 22349

20753 22407

20754 - `Optional<AudioTranscription> transcription`22408 - `Optional<AudioTranscription> transcription`

20755 22409

20756 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.22410 - `Optional<Delay> delay`

22411

22412 Controls how long the model waits before emitting transcription text.

22413 Higher values can improve transcription accuracy at the cost of latency.

22414 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

22415

22416 - `MINIMAL("minimal")`

22417

22418 - `LOW("low")`

22419

22420 - `MEDIUM("medium")`

22421

22422 - `HIGH("high")`

22423

22424 - `XHIGH("xhigh")`

20757 22425

20758 - `Optional<String> language`22426 - `Optional<String> language`

20759 22427

20763 22431

20764 - `Optional<Model> model`22432 - `Optional<Model> model`

20765 22433

20766 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.22434 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

20767 22435

20768 - `WHISPER_1("whisper-1")`22436 - `WHISPER_1("whisper-1")`

20769 22437

20775 22443

20776 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`22444 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

20777 22445

22446 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

22447

20778 - `Optional<String> prompt`22448 - `Optional<String> prompt`

20779 22449

20780 An optional text to guide the model's style or continue a previous audio22450 An optional text to guide the model's style or continue a previous audio

20781 segment.22451 segment.

20782 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).22452 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

20783 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".22453 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

22454 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

20784 22455

20785 - `Optional<TurnDetection> turnDetection`22456 - `Optional<TurnDetection> turnDetection`

20786 22457

20790 22461

20791 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.22462 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

20792 22463

22464 For `gpt-realtime-whisper` transcription sessions, turn detection must be

22465 set to `null`; VAD is not supported.

22466

20793 - `class ServerVad:`22467 - `class ServerVad:`

20794 22468

20795 Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.22469 Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.

20917 22591

20918 - `CEDAR("cedar")`22592 - `CEDAR("cedar")`

20919 22593

22594 - `Optional<Long> expiresAt`

22595

22596 Expiration timestamp for the session, in seconds since epoch.

22597

20920 - `Optional<List<Include>> include`22598 - `Optional<List<Include>> include`

20921 22599

20922 Additional fields to include in server outputs.22600 Additional fields to include in server outputs.

20952 22630

20953 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`22631 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`

20954 22632

22633 - `GPT_REALTIME_2("gpt-realtime-2")`

22634

20955 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`22635 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`

20956 22636

20957 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`22637 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`

21089 22769

21090 Optional version of the prompt template.22770 Optional version of the prompt template.

21091 22771

22772 - `Optional<RealtimeReasoning> reasoning`

22773

22774 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.

22775

22776 - `Optional<RealtimeReasoningEffort> effort`

22777

22778 Constrains effort on reasoning for reasoning-capable Realtime models such as

22779 `gpt-realtime-2`.

22780

22781 - `MINIMAL("minimal")`

22782

22783 - `LOW("low")`

22784

22785 - `MEDIUM("medium")`

22786

22787 - `HIGH("high")`

22788

22789 - `XHIGH("xhigh")`

22790

21092 - `Optional<ToolChoice> toolChoice`22791 - `Optional<ToolChoice> toolChoice`

21093 22792

21094 How the model chooses tools. Provide one of the string modes or force a specific22793 How the model chooses tools. Provide one of the string modes or force a specific

21416 23115

21417 - `Optional<AudioTranscription> transcription`23116 - `Optional<AudioTranscription> transcription`

21418 23117

21419 Configuration of the transcription model.

~~21420~~

21421 - `Optional<RealtimeTranscriptionSessionTurnDetection> turnDetection`23118 - `Optional<RealtimeTranscriptionSessionTurnDetection> turnDetection`

21422 23119

21423 Configuration for turn detection. Can be set to `null` to turn off. Server23120 Configuration for turn detection. Can be set to `null` to turn off. Server

21424 VAD means that the model will detect the start and end of speech based on23121 VAD means that the model will detect the start and end of speech based on

21425 audio volume and respond at the end of user speech.23122 audio volume and respond at the end of user speech. For `gpt-realtime-whisper`, this must be `null`; VAD is not supported.

21426 23123

21427 - `Optional<Long> prefixPaddingMs`23124 - `Optional<Long> prefixPaddingMs`

21428 23125

21488{23185{

21489 "expires_at": 0,23186 "expires_at": 0,

21490 "session": {23187 "session": {

21491 "client_secret": {23188 "id": "id",

21492 "expires_at": 0,23189 "object": "realtime.session",

21493 "value": "value"

21494 },

21495 "type": "realtime",23190 "type": "realtime",

21496 "audio": {23191 "audio": {

21497 "input": {23192 "input": {

21503 "type": "near_field"23198 "type": "near_field"

21504 },23199 },

21505 "transcription": {23200 "transcription": {

23201 "delay": "minimal",

21506 "language": "language",23202 "language": "language",

21507 "model": "string",23203 "model": "string",

21508 "prompt": "prompt"23204 "prompt": "prompt"

21526 "voice": "ash"23222 "voice": "ash"

21527 }23223 }

21528 },23224 },

23225 "expires_at": 0,

21529 "include": [23226 "include": [

21530 "item.input_audio_transcription.logprobs"23227 "item.input_audio_transcription.logprobs"

21531 ],23228 ],

21542 },23239 },

21543 "version": "version"23240 "version": "version"

21544 },23241 },

23242 "reasoning": {

23243 "effort": "minimal"

23244 },

21545 "tool_choice": "none",23245 "tool_choice": "none",

21546 "tools": [23246 "tools": [

21547 {23247 {

21560 23260

21561## Domain Types23261## Domain Types

21562 23262

21563### Realtime Session Client Secret

~~21564~~

21565- `class RealtimeSessionClientSecret:`

~~21566~~

21567 Ephemeral key returned by the API.

~~21568~~

21569 - `long expiresAt`

~~21570~~

21571 Timestamp for when the token expires. Currently, all tokens expire

21572 after one minute.

~~21573~~

21574 - `String value`

~~21575~~

21576 Ephemeral key usable in client environments to authenticate connections to the Realtime API. Use this in client-side environments rather than a standard API token, which should only be used server-side.

~~21577~~

21578### Realtime Session Create Response23263### Realtime Session Create Response

21579 23264

21580- `class RealtimeSessionCreateResponse:`23265- `class RealtimeSessionCreateResponse:`

21581 23266

21582 A new Realtime session configuration, with an ephemeral key. Default TTL23267 A Realtime session configuration object.

21583 for keys is one minute.

~~21584~~

21585 - `RealtimeSessionClientSecret clientSecret`

21586 23268

21587 Ephemeral key returned by the API.23269 - `String id`

21588 23270

21589 - `long expiresAt`23271 Unique identifier for the session that looks like `sess_1234567890abcdef`.

21590 23272

21591 Timestamp for when the token expires. Currently, all tokens expire23273 - `JsonValue; object_ "realtime.session"constant`

21592 after one minute.

21593 23274

21594 - `String value`23275 The object type. Always `realtime.session`.

21595 23276

21596 Ephemeral key usable in client environments to authenticate connections to the Realtime API. Use this in client-side environments rather than a standard API token, which should only be used server-side.23277 - `REALTIME_SESSION("realtime.session")`

21597 23278

21598 - `JsonValue; type "realtime"constant`23279 - `JsonValue; type "realtime"constant`

21599 23280

21657 23338

21658 - `Optional<AudioTranscription> transcription`23339 - `Optional<AudioTranscription> transcription`

21659 23340

21660 Configuration for input audio transcription, defaults to off and can be set to `null` to turn off once on. Input audio transcription is not native to the model, since the model consumes audio directly. Transcription runs asynchronously through [the /audio/transcriptions endpoint](https://platform.openai.com/docs/api-reference/audio/createTranscription) and should be treated as guidance of input audio content rather than precisely what the model heard. The client can optionally set the language and prompt for transcription, these offer additional guidance to the transcription service.23341 - `Optional<Delay> delay`

23342

23343 Controls how long the model waits before emitting transcription text.

23344 Higher values can improve transcription accuracy at the cost of latency.

23345 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

23346

23347 - `MINIMAL("minimal")`

23348

23349 - `LOW("low")`

23350

23351 - `MEDIUM("medium")`

23352

23353 - `HIGH("high")`

23354

23355 - `XHIGH("xhigh")`

21661 23356

21662 - `Optional<String> language`23357 - `Optional<String> language`

21663 23358

21667 23362

21668 - `Optional<Model> model`23363 - `Optional<Model> model`

21669 23364

21670 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.23365 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

21671 23366

21672 - `WHISPER_1("whisper-1")`23367 - `WHISPER_1("whisper-1")`

21673 23368

21679 23374

21680 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`23375 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

21681 23376

23377 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

23378

21682 - `Optional<String> prompt`23379 - `Optional<String> prompt`

21683 23380

21684 An optional text to guide the model's style or continue a previous audio23381 An optional text to guide the model's style or continue a previous audio

21685 segment.23382 segment.

21686 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).23383 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

21687 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".23384 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

23385 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

21688 23386

21689 - `Optional<TurnDetection> turnDetection`23387 - `Optional<TurnDetection> turnDetection`

21690 23388

21694 23392

21695 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.23393 Semantic VAD is more advanced and uses a turn detection model (in conjunction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency.

21696 23394

23395 For `gpt-realtime-whisper` transcription sessions, turn detection must be

23396 set to `null`; VAD is not supported.

23397

21697 - `class ServerVad:`23398 - `class ServerVad:`

21698 23399

21699 Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.23400 Server-side voice activity detection (VAD) which flips on when user speech is detected and off after a period of silence.

21821 23522

21822 - `CEDAR("cedar")`23523 - `CEDAR("cedar")`

21823 23524

23525 - `Optional<Long> expiresAt`

23526

23527 Expiration timestamp for the session, in seconds since epoch.

23528

21824 - `Optional<List<Include>> include`23529 - `Optional<List<Include>> include`

21825 23530

21826 Additional fields to include in server outputs.23531 Additional fields to include in server outputs.

21856 23561

21857 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`23562 - `GPT_REALTIME_1_5("gpt-realtime-1.5")`

21858 23563

23564 - `GPT_REALTIME_2("gpt-realtime-2")`

23565

21859 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`23566 - `GPT_REALTIME_2025_08_28("gpt-realtime-2025-08-28")`

21860 23567

21861 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`23568 - `GPT_4O_REALTIME_PREVIEW("gpt-4o-realtime-preview")`

21993 23700

21994 Optional version of the prompt template.23701 Optional version of the prompt template.

21995 23702

23703 - `Optional<RealtimeReasoning> reasoning`

23704

23705 Configuration for reasoning-capable Realtime models such as `gpt-realtime-2`.

23706

23707 - `Optional<RealtimeReasoningEffort> effort`

23708

23709 Constrains effort on reasoning for reasoning-capable Realtime models such as

23710 `gpt-realtime-2`.

23711

23712 - `MINIMAL("minimal")`

23713

23714 - `LOW("low")`

23715

23716 - `MEDIUM("medium")`

23717

23718 - `HIGH("high")`

23719

23720 - `XHIGH("xhigh")`

23721

21996 - `Optional<ToolChoice> toolChoice`23722 - `Optional<ToolChoice> toolChoice`

21997 23723

21998 How the model chooses tools. Provide one of the string modes or force a specific23724 How the model chooses tools. Provide one of the string modes or force a specific

22356 24082

22357 - `Optional<AudioTranscription> transcription`24083 - `Optional<AudioTranscription> transcription`

22358 24084

22359 Configuration of the transcription model.24085 - `Optional<Delay> delay`

24086

24087 Controls how long the model waits before emitting transcription text.

24088 Higher values can improve transcription accuracy at the cost of latency.

24089 Only supported with `gpt-realtime-whisper` in GA Realtime sessions.

24090

24091 - `MINIMAL("minimal")`

24092

24093 - `LOW("low")`

24094

24095 - `MEDIUM("medium")`

24096

24097 - `HIGH("high")`

24098

24099 - `XHIGH("xhigh")`

22360 24100

22361 - `Optional<String> language`24101 - `Optional<String> language`

22362 24102

22366 24106

22367 - `Optional<Model> model`24107 - `Optional<Model> model`

22368 24108

22369 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, and `gpt-4o-transcribe-diarize`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.24109 The model to use for transcription. Current options are `whisper-1`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `gpt-4o-transcribe`, `gpt-4o-transcribe-diarize`, and `gpt-realtime-whisper`. Use `gpt-4o-transcribe-diarize` when you need diarization with speaker labels.

22370 24110

22371 - `WHISPER_1("whisper-1")`24111 - `WHISPER_1("whisper-1")`

22372 24112

22378 24118

22379 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`24119 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

22380 24120

24121 - `GPT_REALTIME_WHISPER("gpt-realtime-whisper")`

24122

22381 - `Optional<String> prompt`24123 - `Optional<String> prompt`

22382 24124

22383 An optional text to guide the model's style or continue a previous audio24125 An optional text to guide the model's style or continue a previous audio

22384 segment.24126 segment.

22385 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).24127 For `whisper-1`, the [prompt is a list of keywords](https://platform.openai.com/docs/guides/speech-to-text#prompting).

22386 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".24128 For `gpt-4o-transcribe` models (excluding `gpt-4o-transcribe-diarize`), the prompt is a free text string, for example "expect words related to technology".

24129 Prompt is not supported with `gpt-realtime-whisper` in GA Realtime sessions.

22387 24130

22388 - `Optional<RealtimeTranscriptionSessionTurnDetection> turnDetection`24131 - `Optional<RealtimeTranscriptionSessionTurnDetection> turnDetection`

22389 24132

22390 Configuration for turn detection. Can be set to `null` to turn off. Server24133 Configuration for turn detection. Can be set to `null` to turn off. Server

22391 VAD means that the model will detect the start and end of speech based on24134 VAD means that the model will detect the start and end of speech based on

22392 audio volume and respond at the end of user speech.24135 audio volume and respond at the end of user speech. For `gpt-realtime-whisper`, this must be `null`; VAD is not supported.

22393 24136

22394 - `Optional<Long> prefixPaddingMs`24137 - `Optional<Long> prefixPaddingMs`

22395 24138

22430 24173

22431 Configuration for turn detection. Can be set to `null` to turn off. Server24174 Configuration for turn detection. Can be set to `null` to turn off. Server

22432 VAD means that the model will detect the start and end of speech based on24175 VAD means that the model will detect the start and end of speech based on

22433 audio volume and respond at the end of user speech.24176 audio volume and respond at the end of user speech. For `gpt-realtime-whisper`, this must be `null`; VAD is not supported.

22434 24177

22435 - `Optional<Long> prefixPaddingMs`24178 - `Optional<Long> prefixPaddingMs`

22436 24179

22616}24359}

22617```24360```

22618 24361

24362# Translations

24363

24364# Client Secrets

24365

22619# Sessions24366# Sessions

22620 24367

22621# Transcription Sessions24368# Transcription Sessions

java/resources/realtime/index.md 2026-05-05 23:00 UTC to 2026-05-07 21:57 UTC

Realtime

Domain Types

Audio Transcription

Conversation Created Event

Conversation Item

Conversation Item Added

Conversation Item Create Event

Conversation Item Created Event

Conversation Item Delete Event

Conversation Item Deleted Event

Conversation Item Done

Conversation Item Input Audio Transcription Completed Event

Conversation Item Input Audio Transcription Delta Event

Conversation Item Input Audio Transcription Failed Event

Conversation Item Input Audio Transcription Segment

Conversation Item Retrieve Event

Conversation Item Truncate Event

Conversation Item Truncated Event

Conversation Item With Reference

Input Audio Buffer Append Event

Input Audio Buffer Clear Event

Input Audio Buffer Cleared Event

Input Audio Buffer Commit Event

Input Audio Buffer Committed Event

Input Audio Buffer Dtmf Event Received Event

Input Audio Buffer Speech Started Event

Input Audio Buffer Speech Stopped Event

Input Audio Buffer Timeout Triggered

Log Prob Properties

Mcp List Tools Completed

Mcp List Tools Failed

Mcp List Tools In Progress

Noise Reduction Type

Output Audio Buffer Clear Event

Rate Limits Updated Event

Realtime Audio Config

Realtime Audio Config Input

Realtime Audio Config Output

Realtime Audio Formats

Realtime Audio Input Turn Detection

Realtime Client Event

Realtime Conversation Item Assistant Message

Realtime Conversation Item Function Call

Realtime Conversation Item Function Call Output

Realtime Conversation Item System Message

Realtime Conversation Item User Message

Realtime Error

Realtime Error Event

Realtime Function Tool

Realtime Mcp Approval Request

Realtime Mcp Approval Response

Realtime Mcp List Tools

Realtime Mcp Protocol Error

Realtime Mcp Tool Call

Realtime Mcp Tool Execution Error

Realtime Mcphttp Error

Realtime Reasoning

Realtime Reasoning Effort

Realtime Response

Realtime Response Create Audio Output

Realtime Response Create Mcp Tool

Realtime Response Create Params

Realtime Response Status

Realtime Response Usage

Realtime Response Usage Input Token Details

Realtime Response Usage Output Token Details

Realtime Server Event

Realtime Session

Realtime Session Create Request

Realtime Tool Choice Config

Realtime Tools Config Union

Realtime Tracing Config

Realtime Transcription Session Audio

Realtime Transcription Session Audio Input

Realtime Transcription Session Audio Input Turn Detection

Realtime Transcription Session Create Request

Realtime Translation Client Event

Realtime Translation Client Secret Create Request

Realtime Translation Client Secret Create Response