Audio

Domain Types

Audio Model

enum AudioModel:
- WHISPER_1("whisper-1")
- GPT_4O_TRANSCRIBE("gpt-4o-transcribe")
- GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")
- GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")
- GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")

Audio Response Format

enum AudioResponseFormat:

The format of the output, in one of these options: json, text, srt, verbose_json, vtt, or diarized_json. For gpt-4o-transcribe and gpt-4o-mini-transcribe, the only supported format is json. For gpt-4o-transcribe-diarize, the supported formats are json, text, and diarized_json, with diarized_json required to receive speaker annotations.
- JSON("json")
- TEXT("text")
- SRT("srt")
- VERBOSE_JSON("verbose_json")
- VTT("vtt")
- DIARIZED_JSON("diarized_json")

Transcriptions

Create transcription

TranscriptionCreateResponse audio().transcriptions().create(TranscriptionCreateParamsparams, RequestOptionsrequestOptions = RequestOptions.none())

post /audio/transcriptions

Transcribes audio into the input language.

Returns a transcription object in json, diarized_json, or verbose_json format, or a stream of transcript events.

Parameters

TranscriptionCreateParams params
- String file
  
  The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
- AudioModel model
  
  ID of the model to use. The options are gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, whisper-1 (which is powered by our open source Whisper V2 model), and gpt-4o-transcribe-diarize.
- Optional<ChunkingStrategy> chunkingStrategy
  
  Controls how the audio is cut into chunks. When set to "auto", the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. server_vad object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using gpt-4o-transcribe-diarize for inputs longer than 30 seconds.
  - JsonValue;
    - AUTO("auto")
  - class VadConfig:
    - Type type
      
      Must be set to server_vad to enable manual chunking using server side VAD.
      - SERVER_VAD("server_vad")
    - Optional<Long> prefixPaddingMs
      
      Amount of audio to include before the VAD detected speech (in milliseconds).
    - Optional<Long> silenceDurationMs
      
      Duration of silence to detect speech stop (in milliseconds). With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
    - Optional<Double> threshold
      
      Sensitivity threshold (0.0 to 1.0) for voice activity detection. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
- Optional<List<TranscriptionInclude>> include
  
  Additional information to include in the transcription response. logprobs will return the log probabilities of the tokens in the response to understand the model's confidence in the transcription. logprobs only works with response_format set to json and only with the models gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-transcribe-2025-12-15. This field is not supported when using gpt-4o-transcribe-diarize.
  - LOGPROBS("logprobs")
- Optional<List<String>> knownSpeakerNames
  
  Optional list of speaker names that correspond to the audio samples provided in known_speaker_references[]. Each entry should be a short identifier (for example customer or agent). Up to 4 speakers are supported.
- Optional<List<String>> knownSpeakerReferences
  
  Optional list of audio samples (as data URLs) that contain known speaker references matching known_speaker_names[]. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by file.
- Optional<String> language
  
  The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
- Optional<String> prompt
  
  An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. This field is not supported when using gpt-4o-transcribe-diarize.
- Optional<AudioResponseFormat> responseFormat
  
  The format of the output, in one of these options: json, text, srt, verbose_json, vtt, or diarized_json. For gpt-4o-transcribe and gpt-4o-mini-transcribe, the only supported format is json. For gpt-4o-transcribe-diarize, the supported formats are json, text, and diarized_json, with diarized_json required to receive speaker annotations.
- Optional<Double> temperature
  
  The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.
- Optional<List<TimestampGranularity>> timestampGranularities
  
  The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. Either or both of these options are supported: word, or segment. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency. This option is not available for gpt-4o-transcribe-diarize.
  - WORD("word")
  - SEGMENT("segment")

Returns

class TranscriptionCreateResponse: A class that can be one of several variants.union

Represents a transcription response returned by model, based on the provided input.
- class Transcription:
  
  Represents a transcription response returned by model, based on the provided input.
  - String text
    
    The transcribed text.
  - Optional<List<Logprob>> logprobs
    
    The log probabilities of the tokens in the transcription. Only returned with the models gpt-4o-transcribe and gpt-4o-mini-transcribe if logprobs is added to the include array.
    - Optional<String> token
      
      The token in the transcription.
    - Optional<List<Double>> bytes
      
      The bytes of the token.
    - Optional<Double> logprob
      
      The log probability of the token.
  - Optional<Usage> usage
    
    Token usage statistics for the request.
    - class Tokens:
      
      Usage statistics for models billed by token usage.
      - long inputTokens
        
        Number of input tokens billed for this request.
      - long outputTokens
        
        Number of output tokens generated.
      - long totalTokens
        
        Total number of tokens used (input + output).
      - JsonValue; type "tokens"constant
        
        The type of the usage object. Always tokens for this variant.
        
        TOKENS("tokens")
      - Optional<InputTokenDetails> inputTokenDetails
        
        Details about the input tokens billed for this request.
        
        Optional<Long> audioTokens
        
        Number of audio tokens billed for this request.
        
        Optional<Long> textTokens
        
        Number of text tokens billed for this request.
    - class Duration:
      
      Usage statistics for models billed by audio input duration.
      - double seconds
        
        Duration of the input audio in seconds.
      - JsonValue; type "duration"constant
        
        The type of the usage object. Always duration for this variant.
        
        DURATION("duration")
- class TranscriptionDiarized:
  
  Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
  - double duration
    
    Duration of the input audio in seconds.
  - List<TranscriptionDiarizedSegment> segments
    
    Segments of the transcript annotated with timestamps and speaker labels.
    - String id
      
      Unique identifier for the segment.
    - double end
      
      End timestamp of the segment in seconds.
    - String speaker
      
      Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).
    - double start
      
      Start timestamp of the segment in seconds.
    - String text
      
      Transcript text for this segment.
    - JsonValue; type "transcript.text.segment"constant
      
      The type of the segment. Always transcript.text.segment.
      - TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")
  - JsonValue; task "transcribe"constant
    
    The type of task that was run. Always transcribe.
    - TRANSCRIBE("transcribe")
  - String text
    
    The concatenated transcript text for the entire audio input.
  - Optional<Usage> usage
    
    Token or duration usage statistics for the request.
    - class Tokens:
      
      Usage statistics for models billed by token usage.
      - long inputTokens
        
        Number of input tokens billed for this request.
      - long outputTokens
        
        Number of output tokens generated.
      - long totalTokens
        
        Total number of tokens used (input + output).
      - JsonValue; type "tokens"constant
        
        The type of the usage object. Always tokens for this variant.
        
        TOKENS("tokens")
      - Optional<InputTokenDetails> inputTokenDetails
        
        Details about the input tokens billed for this request.
        
        Optional<Long> audioTokens
        
        Number of audio tokens billed for this request.
        
        Optional<Long> textTokens
        
        Number of text tokens billed for this request.
    - class Duration:
      
      Usage statistics for models billed by audio input duration.
      - double seconds
        
        Duration of the input audio in seconds.
      - JsonValue; type "duration"constant
        
        The type of the usage object. Always duration for this variant.
        
        DURATION("duration")
- class TranscriptionVerbose:
  
  Represents a verbose json transcription response returned by model, based on the provided input.
  - double duration
    
    The duration of the input audio.
  - String language
    
    The language of the input audio.
  - String text
    
    The transcribed text.
  - Optional<List<TranscriptionSegment>> segments
    
    Segments of the transcribed text and their corresponding details.
    - long id
      
      Unique identifier of the segment.
    - double avgLogprob
      
      Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
    - double compressionRatio
      
      Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
    - double end
      
      End time of the segment in seconds.
    - double noSpeechProb
      
      Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
    - long seek
      
      Seek offset of the segment.
    - double start
      
      Start time of the segment in seconds.
    - double temperature
      
      Temperature parameter used for generating the segment.
    - String text
      
      Text content of the segment.
    - List<long> tokens
      
      Array of token IDs for the text content.
  - Optional<Usage> usage
    
    Usage statistics for models billed by audio input duration.
    - double seconds
      
      Duration of the input audio in seconds.
    - JsonValue; type "duration"constant
      
      The type of the usage object. Always duration for this variant.
      - DURATION("duration")
  - Optional<List<TranscriptionWord>> words
    
    Extracted words and their corresponding timestamps.
    - double end
      
      End time of the word in seconds.
    - double start
      
      Start time of the word in seconds.
    - String word
      
      The text content of the word.

Example

package com.openai.example;

import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.audio.AudioModel;
import com.openai.models.audio.transcriptions.TranscriptionCreateParams;
import com.openai.models.audio.transcriptions.TranscriptionCreateResponse;
import java.io.ByteArrayInputStream;

public final class Main {
    private Main() {}

    public static void main(String[] args) {
        OpenAIClient client = OpenAIOkHttpClient.fromEnv();

        TranscriptionCreateParams params = TranscriptionCreateParams.builder()
            .file(new ByteArrayInputStream("Example data".getBytes()))
            .model(AudioModel.GPT_4O_TRANSCRIBE)
            .build();
        TranscriptionCreateResponse transcription = client.audio().transcriptions().create(params);
    }
}

Response

{
  "text": "text",
  "logprobs": [
    {
      "token": "token",
      "bytes": [
        0
      ],
      "logprob": 0
    }
  ],
  "usage": {
    "input_tokens": 0,
    "output_tokens": 0,
    "total_tokens": 0,
    "type": "tokens",
    "input_token_details": {
      "audio_tokens": 0,
      "text_tokens": 0
    }
  }
}

Domain Types

Transcription

class Transcription:

Represents a transcription response returned by model, based on the provided input.
- String text
  
  The transcribed text.
- Optional<List<Logprob>> logprobs
  
  The log probabilities of the tokens in the transcription. Only returned with the models gpt-4o-transcribe and gpt-4o-mini-transcribe if logprobs is added to the include array.
  - Optional<String> token
    
    The token in the transcription.
  - Optional<List<Double>> bytes
    
    The bytes of the token.
  - Optional<Double> logprob
    
    The log probability of the token.
- Optional<Usage> usage
  
  Token usage statistics for the request.
  - class Tokens:
    
    Usage statistics for models billed by token usage.
    - long inputTokens
      
      Number of input tokens billed for this request.
    - long outputTokens
      
      Number of output tokens generated.
    - long totalTokens
      
      Total number of tokens used (input + output).
    - JsonValue; type "tokens"constant
      
      The type of the usage object. Always tokens for this variant.
      - TOKENS("tokens")
    - Optional<InputTokenDetails> inputTokenDetails
      
      Details about the input tokens billed for this request.
      - Optional<Long> audioTokens
        
        Number of audio tokens billed for this request.
      - Optional<Long> textTokens
        
        Number of text tokens billed for this request.
  - class Duration:
    
    Usage statistics for models billed by audio input duration.
    - double seconds
      
      Duration of the input audio in seconds.
    - JsonValue; type "duration"constant
      
      The type of the usage object. Always duration for this variant.
      - DURATION("duration")

Transcription Diarized

class TranscriptionDiarized:

Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
- double duration
  
  Duration of the input audio in seconds.
- List<TranscriptionDiarizedSegment> segments
  
  Segments of the transcript annotated with timestamps and speaker labels.
  - String id
    
    Unique identifier for the segment.
  - double end
    
    End timestamp of the segment in seconds.
  - String speaker
    
    Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).
  - double start
    
    Start timestamp of the segment in seconds.
  - String text
    
    Transcript text for this segment.
  - JsonValue; type "transcript.text.segment"constant
    
    The type of the segment. Always transcript.text.segment.
    - TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")
- JsonValue; task "transcribe"constant
  
  The type of task that was run. Always transcribe.
  - TRANSCRIBE("transcribe")
- String text
  
  The concatenated transcript text for the entire audio input.
- Optional<Usage> usage
  
  Token or duration usage statistics for the request.
  - class Tokens:
    
    Usage statistics for models billed by token usage.
    - long inputTokens
      
      Number of input tokens billed for this request.
    - long outputTokens
      
      Number of output tokens generated.
    - long totalTokens
      
      Total number of tokens used (input + output).
    - JsonValue; type "tokens"constant
      
      The type of the usage object. Always tokens for this variant.
      - TOKENS("tokens")
    - Optional<InputTokenDetails> inputTokenDetails
      
      Details about the input tokens billed for this request.
      - Optional<Long> audioTokens
        
        Number of audio tokens billed for this request.
      - Optional<Long> textTokens
        
        Number of text tokens billed for this request.
  - class Duration:
    
    Usage statistics for models billed by audio input duration.
    - double seconds
      
      Duration of the input audio in seconds.
    - JsonValue; type "duration"constant
      
      The type of the usage object. Always duration for this variant.
      - DURATION("duration")

Transcription Diarized Segment

class TranscriptionDiarizedSegment:

A segment of diarized transcript text with speaker metadata.
- String id
  
  Unique identifier for the segment.
- double end
  
  End timestamp of the segment in seconds.
- String speaker
  
  Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).
- double start
  
  Start timestamp of the segment in seconds.
- String text
  
  Transcript text for this segment.
- JsonValue; type "transcript.text.segment"constant
  
  The type of the segment. Always transcript.text.segment.
  - TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")

Transcription Include

enum TranscriptionInclude:
- LOGPROBS("logprobs")

Transcription Segment

class TranscriptionSegment:
- long id
  
  Unique identifier of the segment.
- double avgLogprob
  
  Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
- double compressionRatio
  
  Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
- double end
  
  End time of the segment in seconds.
- double noSpeechProb
  
  Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
- long seek
  
  Seek offset of the segment.
- double start
  
  Start time of the segment in seconds.
- double temperature
  
  Temperature parameter used for generating the segment.
- String text
  
  Text content of the segment.
- List<long> tokens
  
  Array of token IDs for the text content.

Transcription Stream Event

class TranscriptionStreamEvent: A class that can be one of several variants.union

Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you create a transcription with stream set to true and response_format set to diarized_json.
- class TranscriptionTextSegmentEvent:
  
  Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you create a transcription with stream set to true and response_format set to diarized_json.
  - String id
    
    Unique identifier for the segment.
  - double end
    
    End timestamp of the segment in seconds.
  - String speaker
    
    Speaker label for this segment.
  - double start
    
    Start timestamp of the segment in seconds.
  - String text
    
    Transcript text for this segment.
  - JsonValue; type "transcript.text.segment"constant
    
    The type of the event. Always transcript.text.segment.
    - TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")
- class TranscriptionTextDeltaEvent:
  
  Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you create a transcription with the Stream parameter set to true.
  - String delta
    
    The text delta that was additionally transcribed.
  - JsonValue; type "transcript.text.delta"constant
    
    The type of the event. Always transcript.text.delta.
    - TRANSCRIPT_TEXT_DELTA("transcript.text.delta")
  - Optional<List<Logprob>> logprobs
    
    The log probabilities of the delta. Only included if you create a transcription with the include[] parameter set to logprobs.
    - Optional<String> token
      
      The token that was used to generate the log probability.
    - Optional<List<Long>> bytes
      
      The bytes that were used to generate the log probability.
    - Optional<Double> logprob
      
      The log probability of the token.
  - Optional<String> segmentId
    
    Identifier of the diarized segment that this delta belongs to. Only present when using gpt-4o-transcribe-diarize.
- class TranscriptionTextDoneEvent:
  
  Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you create a transcription with the Stream parameter set to true.
  - String text
    
    The text that was transcribed.
  - JsonValue; type "transcript.text.done"constant
    
    The type of the event. Always transcript.text.done.
    - TRANSCRIPT_TEXT_DONE("transcript.text.done")
  - Optional<List<Logprob>> logprobs
    
    The log probabilities of the individual tokens in the transcription. Only included if you create a transcription with the include[] parameter set to logprobs.
    - Optional<String> token
      
      The token that was used to generate the log probability.
    - Optional<List<Long>> bytes
      
      The bytes that were used to generate the log probability.
    - Optional<Double> logprob
      
      The log probability of the token.
  - Optional<Usage> usage
    
    Usage statistics for models billed by token usage.
    - long inputTokens
      
      Number of input tokens billed for this request.
    - long outputTokens
      
      Number of output tokens generated.
    - long totalTokens
      
      Total number of tokens used (input + output).
    - JsonValue; type "tokens"constant
      
      The type of the usage object. Always tokens for this variant.
      - TOKENS("tokens")
    - Optional<InputTokenDetails> inputTokenDetails
      
      Details about the input tokens billed for this request.
      - Optional<Long> audioTokens
        
        Number of audio tokens billed for this request.
      - Optional<Long> textTokens
        
        Number of text tokens billed for this request.

Transcription Text Delta Event

class TranscriptionTextDeltaEvent:

Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you create a transcription with the Stream parameter set to true.
- String delta
  
  The text delta that was additionally transcribed.
- JsonValue; type "transcript.text.delta"constant
  
  The type of the event. Always transcript.text.delta.
  - TRANSCRIPT_TEXT_DELTA("transcript.text.delta")
- Optional<List<Logprob>> logprobs
  
  The log probabilities of the delta. Only included if you create a transcription with the include[] parameter set to logprobs.
  - Optional<String> token
    
    The token that was used to generate the log probability.
  - Optional<List<Long>> bytes
    
    The bytes that were used to generate the log probability.
  - Optional<Double> logprob
    
    The log probability of the token.
- Optional<String> segmentId
  
  Identifier of the diarized segment that this delta belongs to. Only present when using gpt-4o-transcribe-diarize.

Transcription Text Done Event

class TranscriptionTextDoneEvent:

Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you create a transcription with the Stream parameter set to true.
- String text
  
  The text that was transcribed.
- JsonValue; type "transcript.text.done"constant
  
  The type of the event. Always transcript.text.done.
  - TRANSCRIPT_TEXT_DONE("transcript.text.done")
- Optional<List<Logprob>> logprobs
  
  The log probabilities of the individual tokens in the transcription. Only included if you create a transcription with the include[] parameter set to logprobs.
  - Optional<String> token
    
    The token that was used to generate the log probability.
  - Optional<List<Long>> bytes
    
    The bytes that were used to generate the log probability.
  - Optional<Double> logprob
    
    The log probability of the token.
- Optional<Usage> usage
  
  Usage statistics for models billed by token usage.
  - long inputTokens
    
    Number of input tokens billed for this request.
  - long outputTokens
    
    Number of output tokens generated.
  - long totalTokens
    
    Total number of tokens used (input + output).
  - JsonValue; type "tokens"constant
    
    The type of the usage object. Always tokens for this variant.
    - TOKENS("tokens")
  - Optional<InputTokenDetails> inputTokenDetails
    
    Details about the input tokens billed for this request.
    - Optional<Long> audioTokens
      
      Number of audio tokens billed for this request.
    - Optional<Long> textTokens
      
      Number of text tokens billed for this request.

Transcription Text Segment Event

class TranscriptionTextSegmentEvent:

Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you create a transcription with stream set to true and response_format set to diarized_json.
- String id
  
  Unique identifier for the segment.
- double end
  
  End timestamp of the segment in seconds.
- String speaker
  
  Speaker label for this segment.
- double start
  
  Start timestamp of the segment in seconds.
- String text
  
  Transcript text for this segment.
- JsonValue; type "transcript.text.segment"constant
  
  The type of the event. Always transcript.text.segment.
  - TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")

Transcription Verbose

class TranscriptionVerbose:

Represents a verbose json transcription response returned by model, based on the provided input.
- double duration
  
  The duration of the input audio.
- String language
  
  The language of the input audio.
- String text
  
  The transcribed text.
- Optional<List<TranscriptionSegment>> segments
  
  Segments of the transcribed text and their corresponding details.
  - long id
    
    Unique identifier of the segment.
  - double avgLogprob
    
    Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
  - double compressionRatio
    
    Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
  - double end
    
    End time of the segment in seconds.
  - double noSpeechProb
    
    Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
  - long seek
    
    Seek offset of the segment.
  - double start
    
    Start time of the segment in seconds.
  - double temperature
    
    Temperature parameter used for generating the segment.
  - String text
    
    Text content of the segment.
  - List<long> tokens
    
    Array of token IDs for the text content.
- Optional<Usage> usage
  
  Usage statistics for models billed by audio input duration.
  - double seconds
    
    Duration of the input audio in seconds.
  - JsonValue; type "duration"constant
    
    The type of the usage object. Always duration for this variant.
    - DURATION("duration")
- Optional<List<TranscriptionWord>> words
  
  Extracted words and their corresponding timestamps.
  - double end
    
    End time of the word in seconds.
  - double start
    
    Start time of the word in seconds.
  - String word
    
    The text content of the word.

Transcription Word

class TranscriptionWord:
- double end
  
  End time of the word in seconds.
- double start
  
  Start time of the word in seconds.
- String word
  
  The text content of the word.

Translations

Create translation

TranslationCreateResponse audio().translations().create(TranslationCreateParamsparams, RequestOptionsrequestOptions = RequestOptions.none())

post /audio/translations

Translates audio into English.

Parameters

TranslationCreateParams params
- String file
  
  The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
- AudioModel model
  
  ID of the model to use. Only whisper-1 (which is powered by our open source Whisper V2 model) is currently available.
- Optional<String> prompt
  
  An optional text to guide the model's style or continue a previous audio segment. The prompt should be in English.
- Optional<ResponseFormat> responseFormat
  
  The format of the output, in one of these options: json, text, srt, verbose_json, or vtt.
  - JSON("json")
  - TEXT("text")
  - SRT("srt")
  - VERBOSE_JSON("verbose_json")
  - VTT("vtt")
- Optional<Double> temperature
  
  The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

Returns

class TranslationCreateResponse: A class that can be one of several variants.union
- class Translation:
  - String text
- class TranslationVerbose:
  - double duration
    
    The duration of the input audio.
  - String language
    
    The language of the output translation (always english).
  - String text
    
    The translated text.
  - Optional<List<TranscriptionSegment>> segments
    
    Segments of the translated text and their corresponding details.
    - long id
      
      Unique identifier of the segment.
    - double avgLogprob
      
      Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
    - double compressionRatio
      
      Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
    - double end
      
      End time of the segment in seconds.
    - double noSpeechProb
      
      Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
    - long seek
      
      Seek offset of the segment.
    - double start
      
      Start time of the segment in seconds.
    - double temperature
      
      Temperature parameter used for generating the segment.
    - String text
      
      Text content of the segment.
    - List<long> tokens
      
      Array of token IDs for the text content.

Example

package com.openai.example;

import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.audio.AudioModel;
import com.openai.models.audio.translations.TranslationCreateParams;
import com.openai.models.audio.translations.TranslationCreateResponse;
import java.io.ByteArrayInputStream;

public final class Main {
    private Main() {}

    public static void main(String[] args) {
        OpenAIClient client = OpenAIOkHttpClient.fromEnv();

        TranslationCreateParams params = TranslationCreateParams.builder()
            .file(new ByteArrayInputStream("Example data".getBytes()))
            .model(AudioModel.WHISPER_1)
            .build();
        TranslationCreateResponse translation = client.audio().translations().create(params);
    }
}

Response

{
  "text": "text"
}

Domain Types

Translation

class Translation:
- String text

Translation Verbose

class TranslationVerbose:
- double duration
  
  The duration of the input audio.
- String language
  
  The language of the output translation (always english).
- String text
  
  The translated text.
- Optional<List<TranscriptionSegment>> segments
  
  Segments of the translated text and their corresponding details.
  - long id
    
    Unique identifier of the segment.
  - double avgLogprob
    
    Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
  - double compressionRatio
    
    Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
  - double end
    
    End time of the segment in seconds.
  - double noSpeechProb
    
    Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
  - long seek
    
    Seek offset of the segment.
  - double start
    
    Start time of the segment in seconds.
  - double temperature
    
    Temperature parameter used for generating the segment.
  - String text
    
    Text content of the segment.
  - List<long> tokens
    
    Array of token IDs for the text content.

Speech

Create speech

HttpResponse audio().speech().create(SpeechCreateParamsparams, RequestOptionsrequestOptions = RequestOptions.none())

post /audio/speech

Generates audio from the input text.

Returns the audio file content, or a stream of audio events.

Parameters

SpeechCreateParams params
- String input
  
  The text to generate audio for. The maximum length is 4096 characters.
- SpeechModel model
  
  One of the available TTS models: tts-1, tts-1-hd, gpt-4o-mini-tts, or gpt-4o-mini-tts-2025-12-15.
- Voice voice
  
  The voice to use when generating the audio. Supported built-in voices are alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Previews of the voices are available in the Text to speech guide.
  - String
  - enum UnionMember1:
    - ALLOY("alloy")
    - ASH("ash")
    - BALLAD("ballad")
    - CORAL("coral")
    - ECHO("echo")
    - SAGE("sage")
    - SHIMMER("shimmer")
    - VERSE("verse")
    - MARIN("marin")
    - CEDAR("cedar")
  - class Id:
    
    Custom voice reference.
    - String id
      
      The custom voice ID, e.g. voice_1234.
- Optional<String> instructions
  
  Control the voice of your generated audio with additional instructions. Does not work with tts-1 or tts-1-hd.
- Optional<ResponseFormat> responseFormat
  
  The format to audio in. Supported formats are mp3, opus, aac, flac, wav, and pcm.
  - MP3("mp3")
  - OPUS("opus")
  - AAC("aac")
  - FLAC("flac")
  - WAV("wav")
  - PCM("pcm")
- Optional<Double> speed
  
  The speed of the generated audio. Select a value from 0.25 to 4.0. 1.0 is the default.
- Optional<StreamFormat> streamFormat
  
  The format to stream the audio in. Supported formats are sse and audio. sse is not supported for tts-1 or tts-1-hd.
  - SSE("sse")
  - AUDIO("audio")

Example

package com.openai.example;

import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.core.http.HttpResponse;
import com.openai.models.audio.speech.SpeechCreateParams;
import com.openai.models.audio.speech.SpeechModel;

public final class Main {
    private Main() {}

    public static void main(String[] args) {
        OpenAIClient client = OpenAIOkHttpClient.fromEnv();

        SpeechCreateParams params = SpeechCreateParams.builder()
            .input("input")
            .model(SpeechModel.TTS_1)
            .voice(SpeechCreateParams.Voice.UnionMember1.ALLOY)
            .build();
        HttpResponse speech = client.audio().speech().create(params);
    }
}

Domain Types

Speech Model

enum SpeechModel:
- TTS_1("tts-1")
- TTS_1_HD("tts-1-hd")
- GPT_4O_MINI_TTS("gpt-4o-mini-tts")
- GPT_4O_MINI_TTS_2025_12_15("gpt-4o-mini-tts-2025-12-15")

Voices

Voice Consents

java/resources/audio/index.md +1452 −0 created

1# Audio

3## Domain Types

5### Audio Model

7- `enum AudioModel:`

9 - `WHISPER_1("whisper-1")`

11 - `GPT_4O_TRANSCRIBE("gpt-4o-transcribe")`

13 - `GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")`

15 - `GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")`

17 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

19### Audio Response Format

21- `enum AudioResponseFormat:`

23 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.

25 - `JSON("json")`

27 - `TEXT("text")`

29 - `SRT("srt")`

31 - `VERBOSE_JSON("verbose_json")`

33 - `VTT("vtt")`

35 - `DIARIZED_JSON("diarized_json")`

37# Transcriptions

39## Create transcription

41`TranscriptionCreateResponse audio().transcriptions().create(TranscriptionCreateParamsparams, RequestOptionsrequestOptions = RequestOptions.none())`

43**post** `/audio/transcriptions`

45Transcribes audio into the input language.

47Returns a transcription object in `json`, `diarized_json`, or `verbose_json`

48format, or a stream of transcript events.

50### Parameters

52- `TranscriptionCreateParams params`

54 - `String file`

56 The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

58 - `AudioModel model`

60 ID of the model to use. The options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `whisper-1` (which is powered by our open source Whisper V2 model), and `gpt-4o-transcribe-diarize`.

62 - `Optional<ChunkingStrategy> chunkingStrategy`

64 Controls how the audio is cut into chunks. When set to `"auto"`, the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. `server_vad` object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using `gpt-4o-transcribe-diarize` for inputs longer than 30 seconds.

66 - `JsonValue;`

68 - `AUTO("auto")`

70 - `class VadConfig:`

72 - `Type type`

74 Must be set to `server_vad` to enable manual chunking using server side VAD.

76 - `SERVER_VAD("server_vad")`

78 - `Optional<Long> prefixPaddingMs`

80 Amount of audio to include before the VAD detected speech (in

81 milliseconds).

83 - `Optional<Long> silenceDurationMs`

85 Duration of silence to detect speech stop (in milliseconds).

86 With shorter values the model will respond more quickly,

87 but may jump in on short pauses from the user.

89 - `Optional<Double> threshold`

91 Sensitivity threshold (0.0 to 1.0) for voice activity detection. A

92 higher threshold will require louder audio to activate the model, and

93 thus might perform better in noisy environments.

95 - `Optional<List<TranscriptionInclude>> include`

97 Additional information to include in the transcription response.

98 `logprobs` will return the log probabilities of the tokens in the

99 response to understand the model's confidence in the transcription.

100 `logprobs` only works with response_format set to `json` and only with

101 the models `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `gpt-4o-mini-transcribe-2025-12-15`. This field is not supported when using `gpt-4o-transcribe-diarize`.

102

103 - `LOGPROBS("logprobs")`

104

105 - `Optional<List<String>> knownSpeakerNames`

106

107 Optional list of speaker names that correspond to the audio samples provided in `known_speaker_references[]`. Each entry should be a short identifier (for example `customer` or `agent`). Up to 4 speakers are supported.

108

109 - `Optional<List<String>> knownSpeakerReferences`

110

111 Optional list of audio samples (as [data URLs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs)) that contain known speaker references matching `known_speaker_names[]`. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by `file`.

112

113 - `Optional<String> language`

114

115 The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency.

116

117 - `Optional<String> prompt`

118

119 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should match the audio language. This field is not supported when using `gpt-4o-transcribe-diarize`.

120

121 - `Optional<AudioResponseFormat> responseFormat`

122

123 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.

124

125 - `Optional<Double> temperature`

126

127 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.

128

129 - `Optional<List<TimestampGranularity>> timestampGranularities`

130

131 The timestamp granularities to populate for this transcription. `response_format` must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported: `word`, or `segment`. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

132 This option is not available for `gpt-4o-transcribe-diarize`.

133

134 - `WORD("word")`

135

136 - `SEGMENT("segment")`

137

138### Returns

139

140- `class TranscriptionCreateResponse: A class that can be one of several variants.union`

141

142 Represents a transcription response returned by model, based on the provided input.

143

144 - `class Transcription:`

145

146 Represents a transcription response returned by model, based on the provided input.

147

148 - `String text`

149

150 The transcribed text.

151

152 - `Optional<List<Logprob>> logprobs`

153

154 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

155

156 - `Optional<String> token`

157

158 The token in the transcription.

159

160 - `Optional<List<Double>> bytes`

161

162 The bytes of the token.

163

164 - `Optional<Double> logprob`

165

166 The log probability of the token.

167

168 - `Optional<Usage> usage`

169

170 Token usage statistics for the request.

171

172 - `class Tokens:`

173

174 Usage statistics for models billed by token usage.

175

176 - `long inputTokens`

177

178 Number of input tokens billed for this request.

179

180 - `long outputTokens`

181

182 Number of output tokens generated.

183

184 - `long totalTokens`

185

186 Total number of tokens used (input + output).

187

188 - `JsonValue; type "tokens"constant`

189

190 The type of the usage object. Always `tokens` for this variant.

191

192 - `TOKENS("tokens")`

193

194 - `Optional<InputTokenDetails> inputTokenDetails`

195

196 Details about the input tokens billed for this request.

197

198 - `Optional<Long> audioTokens`

199

200 Number of audio tokens billed for this request.

201

202 - `Optional<Long> textTokens`

203

204 Number of text tokens billed for this request.

205

206 - `class Duration:`

207

208 Usage statistics for models billed by audio input duration.

209

210 - `double seconds`

211

212 Duration of the input audio in seconds.

213

214 - `JsonValue; type "duration"constant`

215

216 The type of the usage object. Always `duration` for this variant.

217

218 - `DURATION("duration")`

219

220 - `class TranscriptionDiarized:`

221

222 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

223

224 - `double duration`

225

226 Duration of the input audio in seconds.

227

228 - `List<TranscriptionDiarizedSegment> segments`

229

230 Segments of the transcript annotated with timestamps and speaker labels.

231

232 - `String id`

233

234 Unique identifier for the segment.

235

236 - `double end`

237

238 End timestamp of the segment in seconds.

239

240 - `String speaker`

241

242 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

243

244 - `double start`

245

246 Start timestamp of the segment in seconds.

247

248 - `String text`

249

250 Transcript text for this segment.

251

252 - `JsonValue; type "transcript.text.segment"constant`

253

254 The type of the segment. Always `transcript.text.segment`.

255

256 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`

257

258 - `JsonValue; task "transcribe"constant`

259

260 The type of task that was run. Always `transcribe`.

261

262 - `TRANSCRIBE("transcribe")`

263

264 - `String text`

265

266 The concatenated transcript text for the entire audio input.

267

268 - `Optional<Usage> usage`

269

270 Token or duration usage statistics for the request.

271

272 - `class Tokens:`

273

274 Usage statistics for models billed by token usage.

275

276 - `long inputTokens`

277

278 Number of input tokens billed for this request.

279

280 - `long outputTokens`

281

282 Number of output tokens generated.

283

284 - `long totalTokens`

285

286 Total number of tokens used (input + output).

287

288 - `JsonValue; type "tokens"constant`

289

290 The type of the usage object. Always `tokens` for this variant.

291

292 - `TOKENS("tokens")`

293

294 - `Optional<InputTokenDetails> inputTokenDetails`

295

296 Details about the input tokens billed for this request.

297

298 - `Optional<Long> audioTokens`

299

300 Number of audio tokens billed for this request.

301

302 - `Optional<Long> textTokens`

303

304 Number of text tokens billed for this request.

305

306 - `class Duration:`

307

308 Usage statistics for models billed by audio input duration.

309

310 - `double seconds`

311

312 Duration of the input audio in seconds.

313

314 - `JsonValue; type "duration"constant`

315

316 The type of the usage object. Always `duration` for this variant.

317

318 - `DURATION("duration")`

319

320 - `class TranscriptionVerbose:`

321

322 Represents a verbose json transcription response returned by model, based on the provided input.

323

324 - `double duration`

325

326 The duration of the input audio.

327

328 - `String language`

329

330 The language of the input audio.

331

332 - `String text`

333

334 The transcribed text.

335

336 - `Optional<List<TranscriptionSegment>> segments`

337

338 Segments of the transcribed text and their corresponding details.

339

340 - `long id`

341

342 Unique identifier of the segment.

343

344 - `double avgLogprob`

345

346 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

347

348 - `double compressionRatio`

349

350 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

351

352 - `double end`

353

354 End time of the segment in seconds.

355

356 - `double noSpeechProb`

357

358 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

359

360 - `long seek`

361

362 Seek offset of the segment.

363

364 - `double start`

365

366 Start time of the segment in seconds.

367

368 - `double temperature`

369

370 Temperature parameter used for generating the segment.

371

372 - `String text`

373

374 Text content of the segment.

375

376 - `List<long> tokens`

377

378 Array of token IDs for the text content.

379

380 - `Optional<Usage> usage`

381

382 Usage statistics for models billed by audio input duration.

383

384 - `double seconds`

385

386 Duration of the input audio in seconds.

387

388 - `JsonValue; type "duration"constant`

389

390 The type of the usage object. Always `duration` for this variant.

391

392 - `DURATION("duration")`

393

394 - `Optional<List<TranscriptionWord>> words`

395

396 Extracted words and their corresponding timestamps.

397

398 - `double end`

399

400 End time of the word in seconds.

401

402 - `double start`

403

404 Start time of the word in seconds.

405

406 - `String word`

407

408 The text content of the word.

409

410### Example

411

412```java

413package com.openai.example;

414

415import com.openai.client.OpenAIClient;

416import com.openai.client.okhttp.OpenAIOkHttpClient;

417import com.openai.models.audio.AudioModel;

418import com.openai.models.audio.transcriptions.TranscriptionCreateParams;

419import com.openai.models.audio.transcriptions.TranscriptionCreateResponse;

420import java.io.ByteArrayInputStream;

421

422public final class Main {

423 private Main() {}

424

425 public static void main(String[] args) {

426 OpenAIClient client = OpenAIOkHttpClient.fromEnv();

427

428 TranscriptionCreateParams params = TranscriptionCreateParams.builder()

429 .file(new ByteArrayInputStream("Example data".getBytes()))

430 .model(AudioModel.GPT_4O_TRANSCRIBE)

431 .build();

432 TranscriptionCreateResponse transcription = client.audio().transcriptions().create(params);

433 }

434}

435```

436

437#### Response

438

439```json

440{

441 "text": "text",

442 "logprobs": [

443 {

444 "token": "token",

445 "bytes": [

446 0

447 ],

448 "logprob": 0

449 }

450 ],

451 "usage": {

452 "input_tokens": 0,

453 "output_tokens": 0,

454 "total_tokens": 0,

455 "type": "tokens",

456 "input_token_details": {

457 "audio_tokens": 0,

458 "text_tokens": 0

459 }

460 }

461}

462```

463

464## Domain Types

465

466### Transcription

467

468- `class Transcription:`

469

470 Represents a transcription response returned by model, based on the provided input.

471

472 - `String text`

473

474 The transcribed text.

475

476 - `Optional<List<Logprob>> logprobs`

477

478 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

479

480 - `Optional<String> token`

481

482 The token in the transcription.

483

484 - `Optional<List<Double>> bytes`

485

486 The bytes of the token.

487

488 - `Optional<Double> logprob`

489

490 The log probability of the token.

491

492 - `Optional<Usage> usage`

493

494 Token usage statistics for the request.

495

496 - `class Tokens:`

497

498 Usage statistics for models billed by token usage.

499

500 - `long inputTokens`

501

502 Number of input tokens billed for this request.

503

504 - `long outputTokens`

505

506 Number of output tokens generated.

507

508 - `long totalTokens`

509

510 Total number of tokens used (input + output).

511

512 - `JsonValue; type "tokens"constant`

513

514 The type of the usage object. Always `tokens` for this variant.

515

516 - `TOKENS("tokens")`

517

518 - `Optional<InputTokenDetails> inputTokenDetails`

519

520 Details about the input tokens billed for this request.

521

522 - `Optional<Long> audioTokens`

523

524 Number of audio tokens billed for this request.

525

526 - `Optional<Long> textTokens`

527

528 Number of text tokens billed for this request.

529

530 - `class Duration:`

531

532 Usage statistics for models billed by audio input duration.

533

534 - `double seconds`

535

536 Duration of the input audio in seconds.

537

538 - `JsonValue; type "duration"constant`

539

540 The type of the usage object. Always `duration` for this variant.

541

542 - `DURATION("duration")`

543

544### Transcription Diarized

545

546- `class TranscriptionDiarized:`

547

548 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

549

550 - `double duration`

551

552 Duration of the input audio in seconds.

553

554 - `List<TranscriptionDiarizedSegment> segments`

555

556 Segments of the transcript annotated with timestamps and speaker labels.

557

558 - `String id`

559

560 Unique identifier for the segment.

561

562 - `double end`

563

564 End timestamp of the segment in seconds.

565

566 - `String speaker`

567

568 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

569

570 - `double start`

571

572 Start timestamp of the segment in seconds.

573

574 - `String text`

575

576 Transcript text for this segment.

577

578 - `JsonValue; type "transcript.text.segment"constant`

579

580 The type of the segment. Always `transcript.text.segment`.

581

582 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`

583

584 - `JsonValue; task "transcribe"constant`

585

586 The type of task that was run. Always `transcribe`.

587

588 - `TRANSCRIBE("transcribe")`

589

590 - `String text`

591

592 The concatenated transcript text for the entire audio input.

593

594 - `Optional<Usage> usage`

595

596 Token or duration usage statistics for the request.

597

598 - `class Tokens:`

599

600 Usage statistics for models billed by token usage.

601

602 - `long inputTokens`

603

604 Number of input tokens billed for this request.

605

606 - `long outputTokens`

607

608 Number of output tokens generated.

609

610 - `long totalTokens`

611

612 Total number of tokens used (input + output).

613

614 - `JsonValue; type "tokens"constant`

615

616 The type of the usage object. Always `tokens` for this variant.

617

618 - `TOKENS("tokens")`

619

620 - `Optional<InputTokenDetails> inputTokenDetails`

621

622 Details about the input tokens billed for this request.

623

624 - `Optional<Long> audioTokens`

625

626 Number of audio tokens billed for this request.

627

628 - `Optional<Long> textTokens`

629

630 Number of text tokens billed for this request.

631

632 - `class Duration:`

633

634 Usage statistics for models billed by audio input duration.

635

636 - `double seconds`

637

638 Duration of the input audio in seconds.

639

640 - `JsonValue; type "duration"constant`

641

642 The type of the usage object. Always `duration` for this variant.

643

644 - `DURATION("duration")`

645

646### Transcription Diarized Segment

647

648- `class TranscriptionDiarizedSegment:`

649

650 A segment of diarized transcript text with speaker metadata.

651

652 - `String id`

653

654 Unique identifier for the segment.

655

656 - `double end`

657

658 End timestamp of the segment in seconds.

659

660 - `String speaker`

661

662 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

663

664 - `double start`

665

666 Start timestamp of the segment in seconds.

667

668 - `String text`

669

670 Transcript text for this segment.

671

672 - `JsonValue; type "transcript.text.segment"constant`

673

674 The type of the segment. Always `transcript.text.segment`.

675

676 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`

677

678### Transcription Include

679

680- `enum TranscriptionInclude:`

681

682 - `LOGPROBS("logprobs")`

683

684### Transcription Segment

685

686- `class TranscriptionSegment:`

687

688 - `long id`

689

690 Unique identifier of the segment.

691

692 - `double avgLogprob`

693

694 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

695

696 - `double compressionRatio`

697

698 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

699

700 - `double end`

701

702 End time of the segment in seconds.

703

704 - `double noSpeechProb`

705

706 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

707

708 - `long seek`

709

710 Seek offset of the segment.

711

712 - `double start`

713

714 Start time of the segment in seconds.

715

716 - `double temperature`

717

718 Temperature parameter used for generating the segment.

719

720 - `String text`

721

722 Text content of the segment.

723

724 - `List<long> tokens`

725

726 Array of token IDs for the text content.

727

728### Transcription Stream Event

729

730- `class TranscriptionStreamEvent: A class that can be one of several variants.union`

731

732 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

733

734 - `class TranscriptionTextSegmentEvent:`

735

736 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

737

738 - `String id`

739

740 Unique identifier for the segment.

741

742 - `double end`

743

744 End timestamp of the segment in seconds.

745

746 - `String speaker`

747

748 Speaker label for this segment.

749

750 - `double start`

751

752 Start timestamp of the segment in seconds.

753

754 - `String text`

755

756 Transcript text for this segment.

757

758 - `JsonValue; type "transcript.text.segment"constant`

759

760 The type of the event. Always `transcript.text.segment`.

761

762 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`

763

764 - `class TranscriptionTextDeltaEvent:`

765

766 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

767

768 - `String delta`

769

770 The text delta that was additionally transcribed.

771

772 - `JsonValue; type "transcript.text.delta"constant`

773

774 The type of the event. Always `transcript.text.delta`.

775

776 - `TRANSCRIPT_TEXT_DELTA("transcript.text.delta")`

777

778 - `Optional<List<Logprob>> logprobs`

779

780 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

781

782 - `Optional<String> token`

783

784 The token that was used to generate the log probability.

785

786 - `Optional<List<Long>> bytes`

787

788 The bytes that were used to generate the log probability.

789

790 - `Optional<Double> logprob`

791

792 The log probability of the token.

793

794 - `Optional<String> segmentId`

795

796 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.

797

798 - `class TranscriptionTextDoneEvent:`

799

800 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

801

802 - `String text`

803

804 The text that was transcribed.

805

806 - `JsonValue; type "transcript.text.done"constant`

807

808 The type of the event. Always `transcript.text.done`.

809

810 - `TRANSCRIPT_TEXT_DONE("transcript.text.done")`

811

812 - `Optional<List<Logprob>> logprobs`

813

814 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

815

816 - `Optional<String> token`

817

818 The token that was used to generate the log probability.

819

820 - `Optional<List<Long>> bytes`

821

822 The bytes that were used to generate the log probability.

823

824 - `Optional<Double> logprob`

825

826 The log probability of the token.

827

828 - `Optional<Usage> usage`

829

830 Usage statistics for models billed by token usage.

831

832 - `long inputTokens`

833

834 Number of input tokens billed for this request.

835

836 - `long outputTokens`

837

838 Number of output tokens generated.

839

840 - `long totalTokens`

841

842 Total number of tokens used (input + output).

843

844 - `JsonValue; type "tokens"constant`

845

846 The type of the usage object. Always `tokens` for this variant.

847

848 - `TOKENS("tokens")`

849

850 - `Optional<InputTokenDetails> inputTokenDetails`

851

852 Details about the input tokens billed for this request.

853

854 - `Optional<Long> audioTokens`

855

856 Number of audio tokens billed for this request.

857

858 - `Optional<Long> textTokens`

859

860 Number of text tokens billed for this request.

861

862### Transcription Text Delta Event

863

864- `class TranscriptionTextDeltaEvent:`

865

866 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

867

868 - `String delta`

869

870 The text delta that was additionally transcribed.

871

872 - `JsonValue; type "transcript.text.delta"constant`

873

874 The type of the event. Always `transcript.text.delta`.

875

876 - `TRANSCRIPT_TEXT_DELTA("transcript.text.delta")`

877

878 - `Optional<List<Logprob>> logprobs`

879

880 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

881

882 - `Optional<String> token`

883

884 The token that was used to generate the log probability.

885

886 - `Optional<List<Long>> bytes`

887

888 The bytes that were used to generate the log probability.

889

890 - `Optional<Double> logprob`

891

892 The log probability of the token.

893

894 - `Optional<String> segmentId`

895

896 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.

897

898### Transcription Text Done Event

899

900- `class TranscriptionTextDoneEvent:`

901

902 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

903

904 - `String text`

905

906 The text that was transcribed.

907

908 - `JsonValue; type "transcript.text.done"constant`

909

910 The type of the event. Always `transcript.text.done`.

911

912 - `TRANSCRIPT_TEXT_DONE("transcript.text.done")`

913

914 - `Optional<List<Logprob>> logprobs`

915

916 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

917

918 - `Optional<String> token`

919

920 The token that was used to generate the log probability.

921

922 - `Optional<List<Long>> bytes`

923

924 The bytes that were used to generate the log probability.

925

926 - `Optional<Double> logprob`

927

928 The log probability of the token.

929

930 - `Optional<Usage> usage`

931

932 Usage statistics for models billed by token usage.

933

934 - `long inputTokens`

935

936 Number of input tokens billed for this request.

937

938 - `long outputTokens`

939

940 Number of output tokens generated.

941

942 - `long totalTokens`

943

944 Total number of tokens used (input + output).

945

946 - `JsonValue; type "tokens"constant`

947

948 The type of the usage object. Always `tokens` for this variant.

949

950 - `TOKENS("tokens")`

951

952 - `Optional<InputTokenDetails> inputTokenDetails`

953

954 Details about the input tokens billed for this request.

955

956 - `Optional<Long> audioTokens`

957

958 Number of audio tokens billed for this request.

959

960 - `Optional<Long> textTokens`

961

962 Number of text tokens billed for this request.

963

964### Transcription Text Segment Event

965

966- `class TranscriptionTextSegmentEvent:`

967

968 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

969

970 - `String id`

971

972 Unique identifier for the segment.

973

974 - `double end`

975

976 End timestamp of the segment in seconds.

977

978 - `String speaker`

979

980 Speaker label for this segment.

981

982 - `double start`

983

984 Start timestamp of the segment in seconds.

985

986 - `String text`

987

988 Transcript text for this segment.

989

990 - `JsonValue; type "transcript.text.segment"constant`

991

992 The type of the event. Always `transcript.text.segment`.

993

994 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`

995

996### Transcription Verbose

997

998- `class TranscriptionVerbose:`

999

1000 Represents a verbose json transcription response returned by model, based on the provided input.

1001

1002 - `double duration`

1003

1004 The duration of the input audio.

1005

1006 - `String language`

1007

1008 The language of the input audio.

1009

1010 - `String text`

1011

1012 The transcribed text.

1013

1014 - `Optional<List<TranscriptionSegment>> segments`

1015

1016 Segments of the transcribed text and their corresponding details.

1017

1018 - `long id`

1019

1020 Unique identifier of the segment.

1021

1022 - `double avgLogprob`

1023

1024 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

1025

1026 - `double compressionRatio`

1027

1028 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

1029

1030 - `double end`

1031

1032 End time of the segment in seconds.

1033

1034 - `double noSpeechProb`

1035

1036 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

1037

1038 - `long seek`

1039

1040 Seek offset of the segment.

1041

1042 - `double start`

1043

1044 Start time of the segment in seconds.

1045

1046 - `double temperature`

1047

1048 Temperature parameter used for generating the segment.

1049

1050 - `String text`

1051

1052 Text content of the segment.

1053

1054 - `List<long> tokens`

1055

1056 Array of token IDs for the text content.

1057

1058 - `Optional<Usage> usage`

1059

1060 Usage statistics for models billed by audio input duration.

1061

1062 - `double seconds`

1063

1064 Duration of the input audio in seconds.

1065

1066 - `JsonValue; type "duration"constant`

1067

1068 The type of the usage object. Always `duration` for this variant.

1069

1070 - `DURATION("duration")`

1071

1072 - `Optional<List<TranscriptionWord>> words`

1073

1074 Extracted words and their corresponding timestamps.

1075

1076 - `double end`

1077

1078 End time of the word in seconds.

1079

1080 - `double start`

1081

1082 Start time of the word in seconds.

1083

1084 - `String word`

1085

1086 The text content of the word.

1087

1088### Transcription Word

1089

1090- `class TranscriptionWord:`

1091

1092 - `double end`

1093

1094 End time of the word in seconds.

1095

1096 - `double start`

1097

1098 Start time of the word in seconds.

1099

1100 - `String word`

1101

1102 The text content of the word.

1103

1104# Translations

1105

1106## Create translation

1107

1108`TranslationCreateResponse audio().translations().create(TranslationCreateParamsparams, RequestOptionsrequestOptions = RequestOptions.none())`

1109

1110**post** `/audio/translations`

1111

1112Translates audio into English.

1113

1114### Parameters

1115

1116- `TranslationCreateParams params`

1117

1118 - `String file`

1119

1120 The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

1121

1122 - `AudioModel model`

1123

1124 ID of the model to use. Only `whisper-1` (which is powered by our open source Whisper V2 model) is currently available.

1125

1126 - `Optional<String> prompt`

1127

1128 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should be in English.

1129

1130 - `Optional<ResponseFormat> responseFormat`

1131

1132 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`.

1133

1134 - `JSON("json")`

1135

1136 - `TEXT("text")`

1137

1138 - `SRT("srt")`

1139

1140 - `VERBOSE_JSON("verbose_json")`

1141

1142 - `VTT("vtt")`

1143

1144 - `Optional<Double> temperature`

1145

1146 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.

1147

1148### Returns

1149

1150- `class TranslationCreateResponse: A class that can be one of several variants.union`

1151

1152 - `class Translation:`

1153

1154 - `String text`

1155

1156 - `class TranslationVerbose:`

1157

1158 - `double duration`

1159

1160 The duration of the input audio.

1161

1162 - `String language`

1163

1164 The language of the output translation (always `english`).

1165

1166 - `String text`

1167

1168 The translated text.

1169

1170 - `Optional<List<TranscriptionSegment>> segments`

1171

1172 Segments of the translated text and their corresponding details.

1173

1174 - `long id`

1175

1176 Unique identifier of the segment.

1177

1178 - `double avgLogprob`

1179

1180 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

1181

1182 - `double compressionRatio`

1183

1184 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

1185

1186 - `double end`

1187

1188 End time of the segment in seconds.

1189

1190 - `double noSpeechProb`

1191

1192 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

1193

1194 - `long seek`

1195

1196 Seek offset of the segment.

1197

1198 - `double start`

1199

1200 Start time of the segment in seconds.

1201

1202 - `double temperature`

1203

1204 Temperature parameter used for generating the segment.

1205

1206 - `String text`

1207

1208 Text content of the segment.

1209

1210 - `List<long> tokens`

1211

1212 Array of token IDs for the text content.

1213

1214### Example

1215

1216```java

1217package com.openai.example;

1218

1219import com.openai.client.OpenAIClient;

1220import com.openai.client.okhttp.OpenAIOkHttpClient;

1221import com.openai.models.audio.AudioModel;

1222import com.openai.models.audio.translations.TranslationCreateParams;

1223import com.openai.models.audio.translations.TranslationCreateResponse;

1224import java.io.ByteArrayInputStream;

1225

1226public final class Main {

1227 private Main() {}

1228

1229 public static void main(String[] args) {

1230 OpenAIClient client = OpenAIOkHttpClient.fromEnv();

1231

1232 TranslationCreateParams params = TranslationCreateParams.builder()

1233 .file(new ByteArrayInputStream("Example data".getBytes()))

1234 .model(AudioModel.WHISPER_1)

1235 .build();

1236 TranslationCreateResponse translation = client.audio().translations().create(params);

1237 }

1238}

1239```

1240

1241#### Response

1242

1243```json

1244{

1245 "text": "text"

1246}

1247```

1248

1249## Domain Types

1250

1251### Translation

1252

1253- `class Translation:`

1254

1255 - `String text`

1256

1257### Translation Verbose

1258

1259- `class TranslationVerbose:`

1260

1261 - `double duration`

1262

1263 The duration of the input audio.

1264

1265 - `String language`

1266

1267 The language of the output translation (always `english`).

1268

1269 - `String text`

1270

1271 The translated text.

1272

1273 - `Optional<List<TranscriptionSegment>> segments`

1274

1275 Segments of the translated text and their corresponding details.

1276

1277 - `long id`

1278

1279 Unique identifier of the segment.

1280

1281 - `double avgLogprob`

1282

1283 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

1284

1285 - `double compressionRatio`

1286

1287 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

1288

1289 - `double end`

1290

1291 End time of the segment in seconds.

1292

1293 - `double noSpeechProb`

1294

1295 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

1296

1297 - `long seek`

1298

1299 Seek offset of the segment.

1300

1301 - `double start`

1302

1303 Start time of the segment in seconds.

1304

1305 - `double temperature`

1306

1307 Temperature parameter used for generating the segment.

1308

1309 - `String text`

1310

1311 Text content of the segment.

1312

1313 - `List<long> tokens`

1314

1315 Array of token IDs for the text content.

1316

1317# Speech

1318

1319## Create speech

1320

1321`HttpResponse audio().speech().create(SpeechCreateParamsparams, RequestOptionsrequestOptions = RequestOptions.none())`

1322

1323**post** `/audio/speech`

1324

1325Generates audio from the input text.

1326

1327Returns the audio file content, or a stream of audio events.

1328

1329### Parameters

1330

1331- `SpeechCreateParams params`

1332

1333 - `String input`

1334

1335 The text to generate audio for. The maximum length is 4096 characters.

1336

1337 - `SpeechModel model`

1338

1339 One of the available [TTS models](https://platform.openai.com/docs/models#tts): `tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`, or `gpt-4o-mini-tts-2025-12-15`.

1340

1341 - `Voice voice`

1342

1343 The voice to use when generating the audio. Supported built-in voices are àlloy`, àsh`, `ballad`, `coral`, ècho`, `fable`, ònyx`, `nova`, `sage`, `shimmer`, `verse`, `marin`, and `cedar`. You may also provide a custom voice object with an ìd`, for example `{ "id": "voice_1234" }`. Previews of the voices are available in the [Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech#voice-options).

1344

1345 - `String`

1346

1347 - `enum UnionMember1:`

1348

1349 - `ALLOY("alloy")`

1350

1351 - `ASH("ash")`

1352

1353 - `BALLAD("ballad")`

1354

1355 - `CORAL("coral")`

1356

1357 - `ECHO("echo")`

1358

1359 - `SAGE("sage")`

1360

1361 - `SHIMMER("shimmer")`

1362

1363 - `VERSE("verse")`

1364

1365 - `MARIN("marin")`

1366

1367 - `CEDAR("cedar")`

1368

1369 - `class Id:`

1370

1371 Custom voice reference.

1372

1373 - `String id`

1374

1375 The custom voice ID, e.g. `voice_1234`.

1376

1377 - `Optional<String> instructions`

1378

1379 Control the voice of your generated audio with additional instructions. Does not work with `tts-1` or `tts-1-hd`.

1380

1381 - `Optional<ResponseFormat> responseFormat`

1382

1383 The format to audio in. Supported formats are `mp3`, `opus`, `aac`, `flac`, `wav`, and `pcm`.

1384

1385 - `MP3("mp3")`

1386

1387 - `OPUS("opus")`

1388

1389 - `AAC("aac")`

1390

1391 - `FLAC("flac")`

1392

1393 - `WAV("wav")`

1394

1395 - `PCM("pcm")`

1396

1397 - `Optional<Double> speed`

1398

1399 The speed of the generated audio. Select a value from `0.25` to `4.0`. `1.0` is the default.

1400

1401 - `Optional<StreamFormat> streamFormat`

1402

1403 The format to stream the audio in. Supported formats are `sse` and `audio`. `sse` is not supported for `tts-1` or `tts-1-hd`.

1404

1405 - `SSE("sse")`

1406

1407 - `AUDIO("audio")`

1408

1409### Example

1410

1411```java

1412package com.openai.example;

1413

1414import com.openai.client.OpenAIClient;

1415import com.openai.client.okhttp.OpenAIOkHttpClient;

1416import com.openai.core.http.HttpResponse;

1417import com.openai.models.audio.speech.SpeechCreateParams;

1418import com.openai.models.audio.speech.SpeechModel;

1419

1420public final class Main {

1421 private Main() {}

1422

1423 public static void main(String[] args) {

1424 OpenAIClient client = OpenAIOkHttpClient.fromEnv();

1425

1426 SpeechCreateParams params = SpeechCreateParams.builder()

1427 .input("input")

1428 .model(SpeechModel.TTS_1)

1429 .voice(SpeechCreateParams.Voice.UnionMember1.ALLOY)

1430 .build();

1431 HttpResponse speech = client.audio().speech().create(params);

1432 }

1433}

1434```

1435

1436## Domain Types

1437

1438### Speech Model

1439

1440- `enum SpeechModel:`

1441

1442 - `TTS_1("tts-1")`

1443

1444 - `TTS_1_HD("tts-1-hd")`

1445

1446 - `GPT_4O_MINI_TTS("gpt-4o-mini-tts")`

1447

1448 - `GPT_4O_MINI_TTS_2025_12_15("gpt-4o-mini-tts-2025-12-15")`

1449

1450# Voices

1451

1452# Voice Consents

java/resources/audio/index.md 2026-06-10 15:48 UTC to 2026-06-12 00:01 UTC

Audio

Domain Types

Audio Model

Audio Response Format

Transcriptions

Create transcription

Parameters

Returns

Example

Response

Domain Types

Transcription

Transcription Diarized

Transcription Diarized Segment

Transcription Include

Transcription Segment

Transcription Stream Event

Transcription Text Delta Event

Transcription Text Done Event

Transcription Text Segment Event

Transcription Verbose

Transcription Word

Translations

Create translation

Parameters

Returns

Example

Response

Domain Types

Translation

Translation Verbose

Speech

Create speech

Parameters

Example

Domain Types

Speech Model

Voices

Voice Consents