Audio

Domain Types

Audio Model

audio_model: "whisper-1" or "gpt-4o-transcribe" or "gpt-4o-mini-transcribe" or 2 more
- "whisper-1"
- "gpt-4o-transcribe"
- "gpt-4o-mini-transcribe"
- "gpt-4o-mini-transcribe-2025-12-15"
- "gpt-4o-transcribe-diarize"

Audio Response Format

audio_response_format: "json" or "text" or "srt" or 3 more

The format of the output, in one of these options: json, text, srt, verbose_json, vtt, or diarized_json. For gpt-4o-transcribe and gpt-4o-mini-transcribe, the only supported format is json. For gpt-4o-transcribe-diarize, the supported formats are json, text, and diarized_json, with diarized_json required to receive speaker annotations.
- "json"
- "text"
- "srt"
- "verbose_json"
- "vtt"
- "diarized_json"

Transcriptions

Create transcription

$ openai audio:transcriptions create

post /audio/transcriptions

Transcribes audio into the input language.

Returns a transcription object in json, diarized_json, or verbose_json format, or a stream of transcript events.

Parameters

--file: string

The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
--model: string or AudioModel

ID of the model to use. The options are gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, whisper-1 (which is powered by our open source Whisper V2 model), and gpt-4o-transcribe-diarize.
--chunking-strategy: optional "auto" or object { type, prefix_padding_ms, silence_duration_ms, threshold }

Controls how the audio is cut into chunks. When set to "auto", the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. server_vad object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using gpt-4o-transcribe-diarize for inputs longer than 30 seconds.
--include: optional array of TranscriptionInclude

Additional information to include in the transcription response. logprobs will return the log probabilities of the tokens in the response to understand the model's confidence in the transcription. logprobs only works with response_format set to json and only with the models gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-transcribe-2025-12-15. This field is not supported when using gpt-4o-transcribe-diarize.
--known-speaker-name: optional array of string

Optional list of speaker names that correspond to the audio samples provided in known_speaker_references[]. Each entry should be a short identifier (for example customer or agent). Up to 4 speakers are supported.
--known-speaker-reference: optional array of string

Optional list of audio samples (as data URLs) that contain known speaker references matching known_speaker_names[]. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by file.
--language: optional string

The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
--prompt: optional string

An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. This field is not supported when using gpt-4o-transcribe-diarize.
--response-format: optional "json" or "text" or "srt" or 3 more

The format of the output, in one of these options: json, text, srt, verbose_json, vtt, or diarized_json. For gpt-4o-transcribe and gpt-4o-mini-transcribe, the only supported format is json. For gpt-4o-transcribe-diarize, the supported formats are json, text, and diarized_json, with diarized_json required to receive speaker annotations.
--temperature: optional number

The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.
--timestamp-granularity: optional array of "word" or "segment"

The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. Either or both of these options are supported: word, or segment. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency. This option is not available for gpt-4o-transcribe-diarize.

Returns

AudioTranscriptionNewResponse: Transcription or TranscriptionDiarized or TranscriptionVerbose

Represents a transcription response returned by model, based on the provided input.
- transcription: object { text, logprobs, usage }
  
  Represents a transcription response returned by model, based on the provided input.
  - text: string
    
    The transcribed text.
  - logprobs: optional array of object { token, bytes, logprob }
    
    The log probabilities of the tokens in the transcription. Only returned with the models gpt-4o-transcribe and gpt-4o-mini-transcribe if logprobs is added to the include array.
    - token: optional string
      
      The token in the transcription.
    - bytes: optional array of number
      
      The bytes of the token.
    - logprob: optional number
      
      The log probability of the token.
  - usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }
    
    Token usage statistics for the request.
    - tokens: object { input_tokens, output_tokens, total_tokens, 2 more }
      
      Usage statistics for models billed by token usage.
      - input_tokens: number
        
        Number of input tokens billed for this request.
      - output_tokens: number
        
        Number of output tokens generated.
      - total_tokens: number
        
        Total number of tokens used (input + output).
      - type: "tokens"
        
        The type of the usage object. Always tokens for this variant.
      - input_token_details: optional object { audio_tokens, text_tokens }
        
        Details about the input tokens billed for this request.
        
        audio_tokens: optional number
        
        Number of audio tokens billed for this request.
        
        text_tokens: optional number
        
        Number of text tokens billed for this request.
    - duration: object { seconds, type }
      
      Usage statistics for models billed by audio input duration.
      - seconds: number
        
        Duration of the input audio in seconds.
      - type: "duration"
        
        The type of the usage object. Always duration for this variant.
- transcription_diarized: object { duration, segments, task, 2 more }
  
  Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
  - duration: number
    
    Duration of the input audio in seconds.
  - segments: array of TranscriptionDiarizedSegment
    
    Segments of the transcript annotated with timestamps and speaker labels.
    - id: string
      
      Unique identifier for the segment.
    - end: number
      
      End timestamp of the segment in seconds.
    - speaker: string
      
      Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).
    - start: number
      
      Start timestamp of the segment in seconds.
    - text: string
      
      Transcript text for this segment.
    - type: "transcript.text.segment"
      
      The type of the segment. Always transcript.text.segment.
  - task: "transcribe"
    
    The type of task that was run. Always transcribe.
  - text: string
    
    The concatenated transcript text for the entire audio input.
  - usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }
    
    Token or duration usage statistics for the request.
    - tokens: object { input_tokens, output_tokens, total_tokens, 2 more }
      
      Usage statistics for models billed by token usage.
      - input_tokens: number
        
        Number of input tokens billed for this request.
      - output_tokens: number
        
        Number of output tokens generated.
      - total_tokens: number
        
        Total number of tokens used (input + output).
      - type: "tokens"
        
        The type of the usage object. Always tokens for this variant.
      - input_token_details: optional object { audio_tokens, text_tokens }
        
        Details about the input tokens billed for this request.
        
        audio_tokens: optional number
        
        Number of audio tokens billed for this request.
        
        text_tokens: optional number
        
        Number of text tokens billed for this request.
    - duration: object { seconds, type }
      
      Usage statistics for models billed by audio input duration.
      - seconds: number
        
        Duration of the input audio in seconds.
      - type: "duration"
        
        The type of the usage object. Always duration for this variant.
- transcription_verbose: object { duration, language, text, 3 more }
  
  Represents a verbose json transcription response returned by model, based on the provided input.
  - duration: number
    
    The duration of the input audio.
  - language: string
    
    The language of the input audio.
  - text: string
    
    The transcribed text.
  - segments: optional array of TranscriptionSegment
    
    Segments of the transcribed text and their corresponding details.
    - id: number
      
      Unique identifier of the segment.
    - avg_logprob: number
      
      Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
    - compression_ratio: number
      
      Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
    - end: number
      
      End time of the segment in seconds.
    - no_speech_prob: number
      
      Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
    - seek: number
      
      Seek offset of the segment.
    - start: number
      
      Start time of the segment in seconds.
    - temperature: number
      
      Temperature parameter used for generating the segment.
    - text: string
      
      Text content of the segment.
    - tokens: array of number
      
      Array of token IDs for the text content.
  - usage: optional object { seconds, type }
    
    Usage statistics for models billed by audio input duration.
    - seconds: number
      
      Duration of the input audio in seconds.
    - type: "duration"
      
      The type of the usage object. Always duration for this variant.
  - words: optional array of TranscriptionWord
    
    Extracted words and their corresponding timestamps.
    - end: number
      
      End time of the word in seconds.
    - start: number
      
      Start time of the word in seconds.
    - word: string
      
      The text content of the word.

Example

openai audio:transcriptions create \
  --api-key 'My API Key' \
  --file 'Example data' \
  --model gpt-4o-transcribe

Response

{
  "text": "text",
  "logprobs": [
    {
      "token": "token",
      "bytes": [
        0
      ],
      "logprob": 0
    }
  ],
  "usage": {
    "input_tokens": 0,
    "output_tokens": 0,
    "total_tokens": 0,
    "type": "tokens",
    "input_token_details": {
      "audio_tokens": 0,
      "text_tokens": 0
    }
  }
}

Domain Types

Transcription

transcription: object { text, logprobs, usage }

Represents a transcription response returned by model, based on the provided input.
- text: string
  
  The transcribed text.
- logprobs: optional array of object { token, bytes, logprob }
  
  The log probabilities of the tokens in the transcription. Only returned with the models gpt-4o-transcribe and gpt-4o-mini-transcribe if logprobs is added to the include array.
  - token: optional string
    
    The token in the transcription.
  - bytes: optional array of number
    
    The bytes of the token.
  - logprob: optional number
    
    The log probability of the token.
- usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }
  
  Token usage statistics for the request.
  - tokens: object { input_tokens, output_tokens, total_tokens, 2 more }
    
    Usage statistics for models billed by token usage.
    - input_tokens: number
      
      Number of input tokens billed for this request.
    - output_tokens: number
      
      Number of output tokens generated.
    - total_tokens: number
      
      Total number of tokens used (input + output).
    - type: "tokens"
      
      The type of the usage object. Always tokens for this variant.
    - input_token_details: optional object { audio_tokens, text_tokens }
      
      Details about the input tokens billed for this request.
      - audio_tokens: optional number
        
        Number of audio tokens billed for this request.
      - text_tokens: optional number
        
        Number of text tokens billed for this request.
  - duration: object { seconds, type }
    
    Usage statistics for models billed by audio input duration.
    - seconds: number
      
      Duration of the input audio in seconds.
    - type: "duration"
      
      The type of the usage object. Always duration for this variant.

Transcription Diarized

transcription_diarized: object { duration, segments, task, 2 more }

Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
- duration: number
  
  Duration of the input audio in seconds.
- segments: array of TranscriptionDiarizedSegment
  
  Segments of the transcript annotated with timestamps and speaker labels.
  - id: string
    
    Unique identifier for the segment.
  - end: number
    
    End timestamp of the segment in seconds.
  - speaker: string
    
    Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).
  - start: number
    
    Start timestamp of the segment in seconds.
  - text: string
    
    Transcript text for this segment.
  - type: "transcript.text.segment"
    
    The type of the segment. Always transcript.text.segment.
- task: "transcribe"
  
  The type of task that was run. Always transcribe.
- text: string
  
  The concatenated transcript text for the entire audio input.
- usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }
  
  Token or duration usage statistics for the request.
  - tokens: object { input_tokens, output_tokens, total_tokens, 2 more }
    
    Usage statistics for models billed by token usage.
    - input_tokens: number
      
      Number of input tokens billed for this request.
    - output_tokens: number
      
      Number of output tokens generated.
    - total_tokens: number
      
      Total number of tokens used (input + output).
    - type: "tokens"
      
      The type of the usage object. Always tokens for this variant.
    - input_token_details: optional object { audio_tokens, text_tokens }
      
      Details about the input tokens billed for this request.
      - audio_tokens: optional number
        
        Number of audio tokens billed for this request.
      - text_tokens: optional number
        
        Number of text tokens billed for this request.
  - duration: object { seconds, type }
    
    Usage statistics for models billed by audio input duration.
    - seconds: number
      
      Duration of the input audio in seconds.
    - type: "duration"
      
      The type of the usage object. Always duration for this variant.

Transcription Diarized Segment

transcription_diarized_segment: object { id, end, speaker, 3 more }

A segment of diarized transcript text with speaker metadata.
- id: string
  
  Unique identifier for the segment.
- end: number
  
  End timestamp of the segment in seconds.
- speaker: string
  
  Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).
- start: number
  
  Start timestamp of the segment in seconds.
- text: string
  
  Transcript text for this segment.
- type: "transcript.text.segment"
  
  The type of the segment. Always transcript.text.segment.

Transcription Include

transcription_include: "logprobs"
- "logprobs"

Transcription Segment

transcription_segment: object { id, avg_logprob, compression_ratio, 7 more }
- id: number
  
  Unique identifier of the segment.
- avg_logprob: number
  
  Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
- compression_ratio: number
  
  Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
- end: number
  
  End time of the segment in seconds.
- no_speech_prob: number
  
  Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
- seek: number
  
  Seek offset of the segment.
- start: number
  
  Start time of the segment in seconds.
- temperature: number
  
  Temperature parameter used for generating the segment.
- text: string
  
  Text content of the segment.
- tokens: array of number
  
  Array of token IDs for the text content.

Transcription Stream Event

transcription_stream_event: TranscriptionTextSegmentEvent or TranscriptionTextDeltaEvent or TranscriptionTextDoneEvent

Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you create a transcription with stream set to true and response_format set to diarized_json.
- transcription_text_segment_event: object { id, end, speaker, 3 more }
  
  Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you create a transcription with stream set to true and response_format set to diarized_json.
  - id: string
    
    Unique identifier for the segment.
  - end: number
    
    End timestamp of the segment in seconds.
  - speaker: string
    
    Speaker label for this segment.
  - start: number
    
    Start timestamp of the segment in seconds.
  - text: string
    
    Transcript text for this segment.
  - type: "transcript.text.segment"
    
    The type of the event. Always transcript.text.segment.
- transcription_text_delta_event: object { delta, type, logprobs, segment_id }
  
  Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you create a transcription with the Stream parameter set to true.
  - delta: string
    
    The text delta that was additionally transcribed.
  - type: "transcript.text.delta"
    
    The type of the event. Always transcript.text.delta.
  - logprobs: optional array of object { token, bytes, logprob }
    
    The log probabilities of the delta. Only included if you create a transcription with the include[] parameter set to logprobs.
    - token: optional string
      
      The token that was used to generate the log probability.
    - bytes: optional array of number
      
      The bytes that were used to generate the log probability.
    - logprob: optional number
      
      The log probability of the token.
  - segment_id: optional string
    
    Identifier of the diarized segment that this delta belongs to. Only present when using gpt-4o-transcribe-diarize.
- transcription_text_done_event: object { text, type, logprobs, usage }
  
  Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you create a transcription with the Stream parameter set to true.
  - text: string
    
    The text that was transcribed.
  - type: "transcript.text.done"
    
    The type of the event. Always transcript.text.done.
  - logprobs: optional array of object { token, bytes, logprob }
    
    The log probabilities of the individual tokens in the transcription. Only included if you create a transcription with the include[] parameter set to logprobs.
    - token: optional string
      
      The token that was used to generate the log probability.
    - bytes: optional array of number
      
      The bytes that were used to generate the log probability.
    - logprob: optional number
      
      The log probability of the token.
  - usage: optional object { input_tokens, output_tokens, total_tokens, 2 more }
    
    Usage statistics for models billed by token usage.
    - input_tokens: number
      
      Number of input tokens billed for this request.
    - output_tokens: number
      
      Number of output tokens generated.
    - total_tokens: number
      
      Total number of tokens used (input + output).
    - type: "tokens"
      
      The type of the usage object. Always tokens for this variant.
    - input_token_details: optional object { audio_tokens, text_tokens }
      
      Details about the input tokens billed for this request.
      - audio_tokens: optional number
        
        Number of audio tokens billed for this request.
      - text_tokens: optional number
        
        Number of text tokens billed for this request.

Transcription Text Delta Event

transcription_text_delta_event: object { delta, type, logprobs, segment_id }

Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you create a transcription with the Stream parameter set to true.
- delta: string
  
  The text delta that was additionally transcribed.
- type: "transcript.text.delta"
  
  The type of the event. Always transcript.text.delta.
- logprobs: optional array of object { token, bytes, logprob }
  
  The log probabilities of the delta. Only included if you create a transcription with the include[] parameter set to logprobs.
  - token: optional string
    
    The token that was used to generate the log probability.
  - bytes: optional array of number
    
    The bytes that were used to generate the log probability.
  - logprob: optional number
    
    The log probability of the token.
- segment_id: optional string
  
  Identifier of the diarized segment that this delta belongs to. Only present when using gpt-4o-transcribe-diarize.

Transcription Text Done Event

transcription_text_done_event: object { text, type, logprobs, usage }

Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you create a transcription with the Stream parameter set to true.
- text: string
  
  The text that was transcribed.
- type: "transcript.text.done"
  
  The type of the event. Always transcript.text.done.
- logprobs: optional array of object { token, bytes, logprob }
  
  The log probabilities of the individual tokens in the transcription. Only included if you create a transcription with the include[] parameter set to logprobs.
  - token: optional string
    
    The token that was used to generate the log probability.
  - bytes: optional array of number
    
    The bytes that were used to generate the log probability.
  - logprob: optional number
    
    The log probability of the token.
- usage: optional object { input_tokens, output_tokens, total_tokens, 2 more }
  
  Usage statistics for models billed by token usage.
  - input_tokens: number
    
    Number of input tokens billed for this request.
  - output_tokens: number
    
    Number of output tokens generated.
  - total_tokens: number
    
    Total number of tokens used (input + output).
  - type: "tokens"
    
    The type of the usage object. Always tokens for this variant.
  - input_token_details: optional object { audio_tokens, text_tokens }
    
    Details about the input tokens billed for this request.
    - audio_tokens: optional number
      
      Number of audio tokens billed for this request.
    - text_tokens: optional number
      
      Number of text tokens billed for this request.

Transcription Text Segment Event

transcription_text_segment_event: object { id, end, speaker, 3 more }

Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you create a transcription with stream set to true and response_format set to diarized_json.
- id: string
  
  Unique identifier for the segment.
- end: number
  
  End timestamp of the segment in seconds.
- speaker: string
  
  Speaker label for this segment.
- start: number
  
  Start timestamp of the segment in seconds.
- text: string
  
  Transcript text for this segment.
- type: "transcript.text.segment"
  
  The type of the event. Always transcript.text.segment.

Transcription Verbose

transcription_verbose: object { duration, language, text, 3 more }

Represents a verbose json transcription response returned by model, based on the provided input.
- duration: number
  
  The duration of the input audio.
- language: string
  
  The language of the input audio.
- text: string
  
  The transcribed text.
- segments: optional array of TranscriptionSegment
  
  Segments of the transcribed text and their corresponding details.
  - id: number
    
    Unique identifier of the segment.
  - avg_logprob: number
    
    Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
  - compression_ratio: number
    
    Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
  - end: number
    
    End time of the segment in seconds.
  - no_speech_prob: number
    
    Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
  - seek: number
    
    Seek offset of the segment.
  - start: number
    
    Start time of the segment in seconds.
  - temperature: number
    
    Temperature parameter used for generating the segment.
  - text: string
    
    Text content of the segment.
  - tokens: array of number
    
    Array of token IDs for the text content.
- usage: optional object { seconds, type }
  
  Usage statistics for models billed by audio input duration.
  - seconds: number
    
    Duration of the input audio in seconds.
  - type: "duration"
    
    The type of the usage object. Always duration for this variant.
- words: optional array of TranscriptionWord
  
  Extracted words and their corresponding timestamps.
  - end: number
    
    End time of the word in seconds.
  - start: number
    
    Start time of the word in seconds.
  - word: string
    
    The text content of the word.

Transcription Word

transcription_word: object { end, start, word }
- end: number
  
  End time of the word in seconds.
- start: number
  
  Start time of the word in seconds.
- word: string
  
  The text content of the word.

Translations

Create translation

$ openai audio:translations create

post /audio/translations

Translates audio into English.

Parameters

--file: string

The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
--model: string or AudioModel

ID of the model to use. Only whisper-1 (which is powered by our open source Whisper V2 model) is currently available.
--prompt: optional string

An optional text to guide the model's style or continue a previous audio segment. The prompt should be in English.
--response-format: optional "json" or "text" or "srt" or 2 more

The format of the output, in one of these options: json, text, srt, verbose_json, or vtt.
--temperature: optional number

The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

Returns

unnamed_schema_1: Translation or TranslationVerbose
- translation: object { text }
  - text: string
- translation_verbose: object { duration, language, text, segments }
  - duration: number
    
    The duration of the input audio.
  - language: string
    
    The language of the output translation (always english).
  - text: string
    
    The translated text.
  - segments: optional array of TranscriptionSegment
    
    Segments of the translated text and their corresponding details.
    - id: number
      
      Unique identifier of the segment.
    - avg_logprob: number
      
      Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
    - compression_ratio: number
      
      Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
    - end: number
      
      End time of the segment in seconds.
    - no_speech_prob: number
      
      Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
    - seek: number
      
      Seek offset of the segment.
    - start: number
      
      Start time of the segment in seconds.
    - temperature: number
      
      Temperature parameter used for generating the segment.
    - text: string
      
      Text content of the segment.
    - tokens: array of number
      
      Array of token IDs for the text content.

Example

openai audio:translations create \
  --api-key 'My API Key' \
  --file 'Example data' \
  --model whisper-1

Response

{
  "text": "text"
}

Domain Types

Translation

translation: object { text }
- text: string

Translation Verbose

translation_verbose: object { duration, language, text, segments }
- duration: number
  
  The duration of the input audio.
- language: string
  
  The language of the output translation (always english).
- text: string
  
  The translated text.
- segments: optional array of TranscriptionSegment
  
  Segments of the translated text and their corresponding details.
  - id: number
    
    Unique identifier of the segment.
  - avg_logprob: number
    
    Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
  - compression_ratio: number
    
    Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
  - end: number
    
    End time of the segment in seconds.
  - no_speech_prob: number
    
    Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
  - seek: number
    
    Seek offset of the segment.
  - start: number
    
    Start time of the segment in seconds.
  - temperature: number
    
    Temperature parameter used for generating the segment.
  - text: string
    
    Text content of the segment.
  - tokens: array of number
    
    Array of token IDs for the text content.

Speech

Create speech

$ openai audio:speech create

post /audio/speech

Generates audio from the input text.

Returns the audio file content, or a stream of audio events.

Parameters

--input: string

The text to generate audio for. The maximum length is 4096 characters.
--model: string or SpeechModel

One of the available TTS models: tts-1, tts-1-hd, gpt-4o-mini-tts, or gpt-4o-mini-tts-2025-12-15.
--voice: string or "alloy" or "ash" or "ballad" or 7 more or object { id }

The voice to use when generating the audio. Supported built-in voices are alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Previews of the voices are available in the Text to speech guide.
--instructions: optional string

Control the voice of your generated audio with additional instructions. Does not work with tts-1 or tts-1-hd.
--response-format: optional "mp3" or "opus" or "aac" or 3 more

The format to audio in. Supported formats are mp3, opus, aac, flac, wav, and pcm.
--speed: optional number

The speed of the generated audio. Select a value from 0.25 to 4.0. 1.0 is the default.
--stream-format: optional "sse" or "audio"

The format to stream the audio in. Supported formats are sse and audio. sse is not supported for tts-1 or tts-1-hd.

Returns

unnamed_schema_2: file path

Example

openai audio:speech create \
  --api-key 'My API Key' \
  --input input \
  --model tts-1 \
  --voice string

Domain Types

Speech Model

speech_model: "tts-1" or "tts-1-hd" or "gpt-4o-mini-tts" or "gpt-4o-mini-tts-2025-12-15"
- "tts-1"
- "tts-1-hd"
- "gpt-4o-mini-tts"
- "gpt-4o-mini-tts-2025-12-15"

Voices

Voice Consents

cli/resources/audio/index.md +1258 −0 created

1# Audio

3## Domain Types

5### Audio Model

7- `audio_model: "whisper-1" or "gpt-4o-transcribe" or "gpt-4o-mini-transcribe" or 2 more`

9 - `"whisper-1"`

11 - `"gpt-4o-transcribe"`

13 - `"gpt-4o-mini-transcribe"`

15 - `"gpt-4o-mini-transcribe-2025-12-15"`

17 - `"gpt-4o-transcribe-diarize"`

19### Audio Response Format

21- `audio_response_format: "json" or "text" or "srt" or 3 more`

23 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.

25 - `"json"`

27 - `"text"`

29 - `"srt"`

31 - `"verbose_json"`

33 - `"vtt"`

35 - `"diarized_json"`

37# Transcriptions

39## Create transcription

41`$ openai audio:transcriptions create`

43**post** `/audio/transcriptions`

45Transcribes audio into the input language.

47Returns a transcription object in `json`, `diarized_json`, or `verbose_json`

48format, or a stream of transcript events.

50### Parameters

52- `--file: string`

54 The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

56- `--model: string or AudioModel`

58 ID of the model to use. The options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `whisper-1` (which is powered by our open source Whisper V2 model), and `gpt-4o-transcribe-diarize`.

60- `--chunking-strategy: optional "auto" or object { type, prefix_padding_ms, silence_duration_ms, threshold }`

62 Controls how the audio is cut into chunks. When set to `"auto"`, the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. `server_vad` object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using `gpt-4o-transcribe-diarize` for inputs longer than 30 seconds.

64- `--include: optional array of TranscriptionInclude`

66 Additional information to include in the transcription response.

67 `logprobs` will return the log probabilities of the tokens in the

68 response to understand the model's confidence in the transcription.

69 `logprobs` only works with response_format set to `json` and only with

70 the models `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `gpt-4o-mini-transcribe-2025-12-15`. This field is not supported when using `gpt-4o-transcribe-diarize`.

72- `--known-speaker-name: optional array of string`

74 Optional list of speaker names that correspond to the audio samples provided in `known_speaker_references[]`. Each entry should be a short identifier (for example `customer` or `agent`). Up to 4 speakers are supported.

76- `--known-speaker-reference: optional array of string`

78 Optional list of audio samples (as [data URLs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs)) that contain known speaker references matching `known_speaker_names[]`. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by `file`.

80- `--language: optional string`

82 The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency.

84- `--prompt: optional string`

86 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should match the audio language. This field is not supported when using `gpt-4o-transcribe-diarize`.

88- `--response-format: optional "json" or "text" or "srt" or 3 more`

90 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.

92- `--temperature: optional number`

94 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.

96- `--timestamp-granularity: optional array of "word" or "segment"`

98 The timestamp granularities to populate for this transcription. `response_format` must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported: `word`, or `segment`. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

99 This option is not available for `gpt-4o-transcribe-diarize`.

100

101### Returns

102

103- `AudioTranscriptionNewResponse: Transcription or TranscriptionDiarized or TranscriptionVerbose`

104

105 Represents a transcription response returned by model, based on the provided input.

106

107 - `transcription: object { text, logprobs, usage }`

108

109 Represents a transcription response returned by model, based on the provided input.

110

111 - `text: string`

112

113 The transcribed text.

114

115 - `logprobs: optional array of object { token, bytes, logprob }`

116

117 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

118

119 - `token: optional string`

120

121 The token in the transcription.

122

123 - `bytes: optional array of number`

124

125 The bytes of the token.

126

127 - `logprob: optional number`

128

129 The log probability of the token.

130

131 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }`

132

133 Token usage statistics for the request.

134

135 - `tokens: object { input_tokens, output_tokens, total_tokens, 2 more }`

136

137 Usage statistics for models billed by token usage.

138

139 - `input_tokens: number`

140

141 Number of input tokens billed for this request.

142

143 - `output_tokens: number`

144

145 Number of output tokens generated.

146

147 - `total_tokens: number`

148

149 Total number of tokens used (input + output).

150

151 - `type: "tokens"`

152

153 The type of the usage object. Always `tokens` for this variant.

154

155 - `input_token_details: optional object { audio_tokens, text_tokens }`

156

157 Details about the input tokens billed for this request.

158

159 - `audio_tokens: optional number`

160

161 Number of audio tokens billed for this request.

162

163 - `text_tokens: optional number`

164

165 Number of text tokens billed for this request.

166

167 - `duration: object { seconds, type }`

168

169 Usage statistics for models billed by audio input duration.

170

171 - `seconds: number`

172

173 Duration of the input audio in seconds.

174

175 - `type: "duration"`

176

177 The type of the usage object. Always `duration` for this variant.

178

179 - `transcription_diarized: object { duration, segments, task, 2 more }`

180

181 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

182

183 - `duration: number`

184

185 Duration of the input audio in seconds.

186

187 - `segments: array of TranscriptionDiarizedSegment`

188

189 Segments of the transcript annotated with timestamps and speaker labels.

190

191 - `id: string`

192

193 Unique identifier for the segment.

194

195 - `end: number`

196

197 End timestamp of the segment in seconds.

198

199 - `speaker: string`

200

201 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

202

203 - `start: number`

204

205 Start timestamp of the segment in seconds.

206

207 - `text: string`

208

209 Transcript text for this segment.

210

211 - `type: "transcript.text.segment"`

212

213 The type of the segment. Always `transcript.text.segment`.

214

215 - `task: "transcribe"`

216

217 The type of task that was run. Always `transcribe`.

218

219 - `text: string`

220

221 The concatenated transcript text for the entire audio input.

222

223 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }`

224

225 Token or duration usage statistics for the request.

226

227 - `tokens: object { input_tokens, output_tokens, total_tokens, 2 more }`

228

229 Usage statistics for models billed by token usage.

230

231 - `input_tokens: number`

232

233 Number of input tokens billed for this request.

234

235 - `output_tokens: number`

236

237 Number of output tokens generated.

238

239 - `total_tokens: number`

240

241 Total number of tokens used (input + output).

242

243 - `type: "tokens"`

244

245 The type of the usage object. Always `tokens` for this variant.

246

247 - `input_token_details: optional object { audio_tokens, text_tokens }`

248

249 Details about the input tokens billed for this request.

250

251 - `audio_tokens: optional number`

252

253 Number of audio tokens billed for this request.

254

255 - `text_tokens: optional number`

256

257 Number of text tokens billed for this request.

258

259 - `duration: object { seconds, type }`

260

261 Usage statistics for models billed by audio input duration.

262

263 - `seconds: number`

264

265 Duration of the input audio in seconds.

266

267 - `type: "duration"`

268

269 The type of the usage object. Always `duration` for this variant.

270

271 - `transcription_verbose: object { duration, language, text, 3 more }`

272

273 Represents a verbose json transcription response returned by model, based on the provided input.

274

275 - `duration: number`

276

277 The duration of the input audio.

278

279 - `language: string`

280

281 The language of the input audio.

282

283 - `text: string`

284

285 The transcribed text.

286

287 - `segments: optional array of TranscriptionSegment`

288

289 Segments of the transcribed text and their corresponding details.

290

291 - `id: number`

292

293 Unique identifier of the segment.

294

295 - `avg_logprob: number`

296

297 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

298

299 - `compression_ratio: number`

300

301 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

302

303 - `end: number`

304

305 End time of the segment in seconds.

306

307 - `no_speech_prob: number`

308

309 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

310

311 - `seek: number`

312

313 Seek offset of the segment.

314

315 - `start: number`

316

317 Start time of the segment in seconds.

318

319 - `temperature: number`

320

321 Temperature parameter used for generating the segment.

322

323 - `text: string`

324

325 Text content of the segment.

326

327 - `tokens: array of number`

328

329 Array of token IDs for the text content.

330

331 - `usage: optional object { seconds, type }`

332

333 Usage statistics for models billed by audio input duration.

334

335 - `seconds: number`

336

337 Duration of the input audio in seconds.

338

339 - `type: "duration"`

340

341 The type of the usage object. Always `duration` for this variant.

342

343 - `words: optional array of TranscriptionWord`

344

345 Extracted words and their corresponding timestamps.

346

347 - `end: number`

348

349 End time of the word in seconds.

350

351 - `start: number`

352

353 Start time of the word in seconds.

354

355 - `word: string`

356

357 The text content of the word.

358

359### Example

360

361```cli

362openai audio:transcriptions create \

363 --api-key 'My API Key' \

364 --file 'Example data' \

365 --model gpt-4o-transcribe

366```

367

368#### Response

369

370```json

371{

372 "text": "text",

373 "logprobs": [

374 {

375 "token": "token",

376 "bytes": [

377 0

378 ],

379 "logprob": 0

380 }

381 ],

382 "usage": {

383 "input_tokens": 0,

384 "output_tokens": 0,

385 "total_tokens": 0,

386 "type": "tokens",

387 "input_token_details": {

388 "audio_tokens": 0,

389 "text_tokens": 0

390 }

391 }

392}

393```

394

395## Domain Types

396

397### Transcription

398

399- `transcription: object { text, logprobs, usage }`

400

401 Represents a transcription response returned by model, based on the provided input.

402

403 - `text: string`

404

405 The transcribed text.

406

407 - `logprobs: optional array of object { token, bytes, logprob }`

408

409 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

410

411 - `token: optional string`

412

413 The token in the transcription.

414

415 - `bytes: optional array of number`

416

417 The bytes of the token.

418

419 - `logprob: optional number`

420

421 The log probability of the token.

422

423 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }`

424

425 Token usage statistics for the request.

426

427 - `tokens: object { input_tokens, output_tokens, total_tokens, 2 more }`

428

429 Usage statistics for models billed by token usage.

430

431 - `input_tokens: number`

432

433 Number of input tokens billed for this request.

434

435 - `output_tokens: number`

436

437 Number of output tokens generated.

438

439 - `total_tokens: number`

440

441 Total number of tokens used (input + output).

442

443 - `type: "tokens"`

444

445 The type of the usage object. Always `tokens` for this variant.

446

447 - `input_token_details: optional object { audio_tokens, text_tokens }`

448

449 Details about the input tokens billed for this request.

450

451 - `audio_tokens: optional number`

452

453 Number of audio tokens billed for this request.

454

455 - `text_tokens: optional number`

456

457 Number of text tokens billed for this request.

458

459 - `duration: object { seconds, type }`

460

461 Usage statistics for models billed by audio input duration.

462

463 - `seconds: number`

464

465 Duration of the input audio in seconds.

466

467 - `type: "duration"`

468

469 The type of the usage object. Always `duration` for this variant.

470

471### Transcription Diarized

472

473- `transcription_diarized: object { duration, segments, task, 2 more }`

474

475 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

476

477 - `duration: number`

478

479 Duration of the input audio in seconds.

480

481 - `segments: array of TranscriptionDiarizedSegment`

482

483 Segments of the transcript annotated with timestamps and speaker labels.

484

485 - `id: string`

486

487 Unique identifier for the segment.

488

489 - `end: number`

490

491 End timestamp of the segment in seconds.

492

493 - `speaker: string`

494

495 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

496

497 - `start: number`

498

499 Start timestamp of the segment in seconds.

500

501 - `text: string`

502

503 Transcript text for this segment.

504

505 - `type: "transcript.text.segment"`

506

507 The type of the segment. Always `transcript.text.segment`.

508

509 - `task: "transcribe"`

510

511 The type of task that was run. Always `transcribe`.

512

513 - `text: string`

514

515 The concatenated transcript text for the entire audio input.

516

517 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }`

518

519 Token or duration usage statistics for the request.

520

521 - `tokens: object { input_tokens, output_tokens, total_tokens, 2 more }`

522

523 Usage statistics for models billed by token usage.

524

525 - `input_tokens: number`

526

527 Number of input tokens billed for this request.

528

529 - `output_tokens: number`

530

531 Number of output tokens generated.

532

533 - `total_tokens: number`

534

535 Total number of tokens used (input + output).

536

537 - `type: "tokens"`

538

539 The type of the usage object. Always `tokens` for this variant.

540

541 - `input_token_details: optional object { audio_tokens, text_tokens }`

542

543 Details about the input tokens billed for this request.

544

545 - `audio_tokens: optional number`

546

547 Number of audio tokens billed for this request.

548

549 - `text_tokens: optional number`

550

551 Number of text tokens billed for this request.

552

553 - `duration: object { seconds, type }`

554

555 Usage statistics for models billed by audio input duration.

556

557 - `seconds: number`

558

559 Duration of the input audio in seconds.

560

561 - `type: "duration"`

562

563 The type of the usage object. Always `duration` for this variant.

564

565### Transcription Diarized Segment

566

567- `transcription_diarized_segment: object { id, end, speaker, 3 more }`

568

569 A segment of diarized transcript text with speaker metadata.

570

571 - `id: string`

572

573 Unique identifier for the segment.

574

575 - `end: number`

576

577 End timestamp of the segment in seconds.

578

579 - `speaker: string`

580

581 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

582

583 - `start: number`

584

585 Start timestamp of the segment in seconds.

586

587 - `text: string`

588

589 Transcript text for this segment.

590

591 - `type: "transcript.text.segment"`

592

593 The type of the segment. Always `transcript.text.segment`.

594

595### Transcription Include

596

597- `transcription_include: "logprobs"`

598

599 - `"logprobs"`

600

601### Transcription Segment

602

603- `transcription_segment: object { id, avg_logprob, compression_ratio, 7 more }`

604

605 - `id: number`

606

607 Unique identifier of the segment.

608

609 - `avg_logprob: number`

610

611 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

612

613 - `compression_ratio: number`

614

615 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

616

617 - `end: number`

618

619 End time of the segment in seconds.

620

621 - `no_speech_prob: number`

622

623 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

624

625 - `seek: number`

626

627 Seek offset of the segment.

628

629 - `start: number`

630

631 Start time of the segment in seconds.

632

633 - `temperature: number`

634

635 Temperature parameter used for generating the segment.

636

637 - `text: string`

638

639 Text content of the segment.

640

641 - `tokens: array of number`

642

643 Array of token IDs for the text content.

644

645### Transcription Stream Event

646

647- `transcription_stream_event: TranscriptionTextSegmentEvent or TranscriptionTextDeltaEvent or TranscriptionTextDoneEvent`

648

649 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

650

651 - `transcription_text_segment_event: object { id, end, speaker, 3 more }`

652

653 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

654

655 - `id: string`

656

657 Unique identifier for the segment.

658

659 - `end: number`

660

661 End timestamp of the segment in seconds.

662

663 - `speaker: string`

664

665 Speaker label for this segment.

666

667 - `start: number`

668

669 Start timestamp of the segment in seconds.

670

671 - `text: string`

672

673 Transcript text for this segment.

674

675 - `type: "transcript.text.segment"`

676

677 The type of the event. Always `transcript.text.segment`.

678

679 - `transcription_text_delta_event: object { delta, type, logprobs, segment_id }`

680

681 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

682

683 - `delta: string`

684

685 The text delta that was additionally transcribed.

686

687 - `type: "transcript.text.delta"`

688

689 The type of the event. Always `transcript.text.delta`.

690

691 - `logprobs: optional array of object { token, bytes, logprob }`

692

693 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

694

695 - `token: optional string`

696

697 The token that was used to generate the log probability.

698

699 - `bytes: optional array of number`

700

701 The bytes that were used to generate the log probability.

702

703 - `logprob: optional number`

704

705 The log probability of the token.

706

707 - `segment_id: optional string`

708

709 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.

710

711 - `transcription_text_done_event: object { text, type, logprobs, usage }`

712

713 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

714

715 - `text: string`

716

717 The text that was transcribed.

718

719 - `type: "transcript.text.done"`

720

721 The type of the event. Always `transcript.text.done`.

722

723 - `logprobs: optional array of object { token, bytes, logprob }`

724

725 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

726

727 - `token: optional string`

728

729 The token that was used to generate the log probability.

730

731 - `bytes: optional array of number`

732

733 The bytes that were used to generate the log probability.

734

735 - `logprob: optional number`

736

737 The log probability of the token.

738

739 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more }`

740

741 Usage statistics for models billed by token usage.

742

743 - `input_tokens: number`

744

745 Number of input tokens billed for this request.

746

747 - `output_tokens: number`

748

749 Number of output tokens generated.

750

751 - `total_tokens: number`

752

753 Total number of tokens used (input + output).

754

755 - `type: "tokens"`

756

757 The type of the usage object. Always `tokens` for this variant.

758

759 - `input_token_details: optional object { audio_tokens, text_tokens }`

760

761 Details about the input tokens billed for this request.

762

763 - `audio_tokens: optional number`

764

765 Number of audio tokens billed for this request.

766

767 - `text_tokens: optional number`

768

769 Number of text tokens billed for this request.

770

771### Transcription Text Delta Event

772

773- `transcription_text_delta_event: object { delta, type, logprobs, segment_id }`

774

775 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

776

777 - `delta: string`

778

779 The text delta that was additionally transcribed.

780

781 - `type: "transcript.text.delta"`

782

783 The type of the event. Always `transcript.text.delta`.

784

785 - `logprobs: optional array of object { token, bytes, logprob }`

786

787 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

788

789 - `token: optional string`

790

791 The token that was used to generate the log probability.

792

793 - `bytes: optional array of number`

794

795 The bytes that were used to generate the log probability.

796

797 - `logprob: optional number`

798

799 The log probability of the token.

800

801 - `segment_id: optional string`

802

803 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.

804

805### Transcription Text Done Event

806

807- `transcription_text_done_event: object { text, type, logprobs, usage }`

808

809 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

810

811 - `text: string`

812

813 The text that was transcribed.

814

815 - `type: "transcript.text.done"`

816

817 The type of the event. Always `transcript.text.done`.

818

819 - `logprobs: optional array of object { token, bytes, logprob }`

820

821 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

822

823 - `token: optional string`

824

825 The token that was used to generate the log probability.

826

827 - `bytes: optional array of number`

828

829 The bytes that were used to generate the log probability.

830

831 - `logprob: optional number`

832

833 The log probability of the token.

834

835 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more }`

836

837 Usage statistics for models billed by token usage.

838

839 - `input_tokens: number`

840

841 Number of input tokens billed for this request.

842

843 - `output_tokens: number`

844

845 Number of output tokens generated.

846

847 - `total_tokens: number`

848

849 Total number of tokens used (input + output).

850

851 - `type: "tokens"`

852

853 The type of the usage object. Always `tokens` for this variant.

854

855 - `input_token_details: optional object { audio_tokens, text_tokens }`

856

857 Details about the input tokens billed for this request.

858

859 - `audio_tokens: optional number`

860

861 Number of audio tokens billed for this request.

862

863 - `text_tokens: optional number`

864

865 Number of text tokens billed for this request.

866

867### Transcription Text Segment Event

868

869- `transcription_text_segment_event: object { id, end, speaker, 3 more }`

870

871 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

872

873 - `id: string`

874

875 Unique identifier for the segment.

876

877 - `end: number`

878

879 End timestamp of the segment in seconds.

880

881 - `speaker: string`

882

883 Speaker label for this segment.

884

885 - `start: number`

886

887 Start timestamp of the segment in seconds.

888

889 - `text: string`

890

891 Transcript text for this segment.

892

893 - `type: "transcript.text.segment"`

894

895 The type of the event. Always `transcript.text.segment`.

896

897### Transcription Verbose

898

899- `transcription_verbose: object { duration, language, text, 3 more }`

900

901 Represents a verbose json transcription response returned by model, based on the provided input.

902

903 - `duration: number`

904

905 The duration of the input audio.

906

907 - `language: string`

908

909 The language of the input audio.

910

911 - `text: string`

912

913 The transcribed text.

914

915 - `segments: optional array of TranscriptionSegment`

916

917 Segments of the transcribed text and their corresponding details.

918

919 - `id: number`

920

921 Unique identifier of the segment.

922

923 - `avg_logprob: number`

924

925 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

926

927 - `compression_ratio: number`

928

929 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

930

931 - `end: number`

932

933 End time of the segment in seconds.

934

935 - `no_speech_prob: number`

936

937 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

938

939 - `seek: number`

940

941 Seek offset of the segment.

942

943 - `start: number`

944

945 Start time of the segment in seconds.

946

947 - `temperature: number`

948

949 Temperature parameter used for generating the segment.

950

951 - `text: string`

952

953 Text content of the segment.

954

955 - `tokens: array of number`

956

957 Array of token IDs for the text content.

958

959 - `usage: optional object { seconds, type }`

960

961 Usage statistics for models billed by audio input duration.

962

963 - `seconds: number`

964

965 Duration of the input audio in seconds.

966

967 - `type: "duration"`

968

969 The type of the usage object. Always `duration` for this variant.

970

971 - `words: optional array of TranscriptionWord`

972

973 Extracted words and their corresponding timestamps.

974

975 - `end: number`

976

977 End time of the word in seconds.

978

979 - `start: number`

980

981 Start time of the word in seconds.

982

983 - `word: string`

984

985 The text content of the word.

986

987### Transcription Word

988

989- `transcription_word: object { end, start, word }`

990

991 - `end: number`

992

993 End time of the word in seconds.

994

995 - `start: number`

996

997 Start time of the word in seconds.

998

999 - `word: string`

1000

1001 The text content of the word.

1002

1003# Translations

1004

1005## Create translation

1006

1007`$ openai audio:translations create`

1008

1009**post** `/audio/translations`

1010

1011Translates audio into English.

1012

1013### Parameters

1014

1015- `--file: string`

1016

1017 The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

1018

1019- `--model: string or AudioModel`

1020

1021 ID of the model to use. Only `whisper-1` (which is powered by our open source Whisper V2 model) is currently available.

1022

1023- `--prompt: optional string`

1024

1025 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should be in English.

1026

1027- `--response-format: optional "json" or "text" or "srt" or 2 more`

1028

1029 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`.

1030

1031- `--temperature: optional number`

1032

1033 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.

1034

1035### Returns

1036

1037- `unnamed_schema_1: Translation or TranslationVerbose`

1038

1039 - `translation: object { text }`

1040

1041 - `text: string`

1042

1043 - `translation_verbose: object { duration, language, text, segments }`

1044

1045 - `duration: number`

1046

1047 The duration of the input audio.

1048

1049 - `language: string`

1050

1051 The language of the output translation (always `english`).

1052

1053 - `text: string`

1054

1055 The translated text.

1056

1057 - `segments: optional array of TranscriptionSegment`

1058

1059 Segments of the translated text and their corresponding details.

1060

1061 - `id: number`

1062

1063 Unique identifier of the segment.

1064

1065 - `avg_logprob: number`

1066

1067 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

1068

1069 - `compression_ratio: number`

1070

1071 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

1072

1073 - `end: number`

1074

1075 End time of the segment in seconds.

1076

1077 - `no_speech_prob: number`

1078

1079 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

1080

1081 - `seek: number`

1082

1083 Seek offset of the segment.

1084

1085 - `start: number`

1086

1087 Start time of the segment in seconds.

1088

1089 - `temperature: number`

1090

1091 Temperature parameter used for generating the segment.

1092

1093 - `text: string`

1094

1095 Text content of the segment.

1096

1097 - `tokens: array of number`

1098

1099 Array of token IDs for the text content.

1100

1101### Example

1102

1103```cli

1104openai audio:translations create \

1105 --api-key 'My API Key' \

1106 --file 'Example data' \

1107 --model whisper-1

1108```

1109

1110#### Response

1111

1112```json

1113{

1114 "text": "text"

1115}

1116```

1117

1118## Domain Types

1119

1120### Translation

1121

1122- `translation: object { text }`

1123

1124 - `text: string`

1125

1126### Translation Verbose

1127

1128- `translation_verbose: object { duration, language, text, segments }`

1129

1130 - `duration: number`

1131

1132 The duration of the input audio.

1133

1134 - `language: string`

1135

1136 The language of the output translation (always `english`).

1137

1138 - `text: string`

1139

1140 The translated text.

1141

1142 - `segments: optional array of TranscriptionSegment`

1143

1144 Segments of the translated text and their corresponding details.

1145

1146 - `id: number`

1147

1148 Unique identifier of the segment.

1149

1150 - `avg_logprob: number`

1151

1152 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

1153

1154 - `compression_ratio: number`

1155

1156 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

1157

1158 - `end: number`

1159

1160 End time of the segment in seconds.

1161

1162 - `no_speech_prob: number`

1163

1164 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

1165

1166 - `seek: number`

1167

1168 Seek offset of the segment.

1169

1170 - `start: number`

1171

1172 Start time of the segment in seconds.

1173

1174 - `temperature: number`

1175

1176 Temperature parameter used for generating the segment.

1177

1178 - `text: string`

1179

1180 Text content of the segment.

1181

1182 - `tokens: array of number`

1183

1184 Array of token IDs for the text content.

1185

1186# Speech

1187

1188## Create speech

1189

1190`$ openai audio:speech create`

1191

1192**post** `/audio/speech`

1193

1194Generates audio from the input text.

1195

1196Returns the audio file content, or a stream of audio events.

1197

1198### Parameters

1199

1200- `--input: string`

1201

1202 The text to generate audio for. The maximum length is 4096 characters.

1203

1204- `--model: string or SpeechModel`

1205

1206 One of the available [TTS models](https://platform.openai.com/docs/models#tts): `tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`, or `gpt-4o-mini-tts-2025-12-15`.

1207

1208- `--voice: string or "alloy" or "ash" or "ballad" or 7 more or object { id }`

1209

1210 The voice to use when generating the audio. Supported built-in voices are àlloy`, àsh`, `ballad`, `coral`, ècho`, `fable`, ònyx`, `nova`, `sage`, `shimmer`, `verse`, `marin`, and `cedar`. You may also provide a custom voice object with an ìd`, for example `{ "id": "voice_1234" }`. Previews of the voices are available in the [Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech#voice-options).

1211

1212- `--instructions: optional string`

1213

1214 Control the voice of your generated audio with additional instructions. Does not work with `tts-1` or `tts-1-hd`.

1215

1216- `--response-format: optional "mp3" or "opus" or "aac" or 3 more`

1217

1218 The format to audio in. Supported formats are `mp3`, `opus`, `aac`, `flac`, `wav`, and `pcm`.

1219

1220- `--speed: optional number`

1221

1222 The speed of the generated audio. Select a value from `0.25` to `4.0`. `1.0` is the default.

1223

1224- `--stream-format: optional "sse" or "audio"`

1225

1226 The format to stream the audio in. Supported formats are `sse` and `audio`. `sse` is not supported for `tts-1` or `tts-1-hd`.

1227

1228### Returns

1229

1230- `unnamed_schema_2: file path`

1231

1232### Example

1233

1234```cli

1235openai audio:speech create \

1236 --api-key 'My API Key' \

1237 --input input \

1238 --model tts-1 \

1239 --voice string

1240```

1241

1242## Domain Types

1243

1244### Speech Model

1245

1246- `speech_model: "tts-1" or "tts-1-hd" or "gpt-4o-mini-tts" or "gpt-4o-mini-tts-2025-12-15"`

1247

1248 - `"tts-1"`

1249

1250 - `"tts-1-hd"`

1251

1252 - `"gpt-4o-mini-tts"`

1253

1254 - `"gpt-4o-mini-tts-2025-12-15"`

1255

1256# Voices

1257

1258# Voice Consents

cli/resources/audio/index.md 2026-05-05 23:00 UTC to 2026-05-07 21:57 UTC

Audio

Domain Types

Audio Model

Audio Response Format

Transcriptions

Create transcription

Parameters

Returns

Example

Response

Domain Types

Transcription

Transcription Diarized

Transcription Diarized Segment

Transcription Include

Transcription Segment

Transcription Stream Event

Transcription Text Delta Event

Transcription Text Done Event

Transcription Text Segment Event

Transcription Verbose

Transcription Word

Translations

Create translation

Parameters

Returns

Example

Response

Domain Types

Translation

Translation Verbose

Speech

Create speech

Parameters

Returns

Example

Domain Types

Speech Model

Voices

Voice Consents