SpyBara
Go Premium

cli/resources/audio/index.md 2026-05-05 23:00 UTC to 2026-05-07 21:57 UTC

1258 added, 0 removed.

2026
Wed 27 06:42 Fri 22 06:33 Wed 20 06:35 Tue 19 06:34 Mon 18 22:01 Mon 11 18:00 Thu 7 21:57 Tue 5 23:00 Sat 2 05:57

Audio

Domain Types

Audio Model

  • audio_model: "whisper-1" or "gpt-4o-transcribe" or "gpt-4o-mini-transcribe" or 2 more

    • "whisper-1"

    • "gpt-4o-transcribe"

    • "gpt-4o-mini-transcribe"

    • "gpt-4o-mini-transcribe-2025-12-15"

    • "gpt-4o-transcribe-diarize"

Audio Response Format

  • audio_response_format: "json" or "text" or "srt" or 3 more

    The format of the output, in one of these options: json, text, srt, verbose_json, vtt, or diarized_json. For gpt-4o-transcribe and gpt-4o-mini-transcribe, the only supported format is json. For gpt-4o-transcribe-diarize, the supported formats are json, text, and diarized_json, with diarized_json required to receive speaker annotations.

    • "json"

    • "text"

    • "srt"

    • "verbose_json"

    • "vtt"

    • "diarized_json"

Transcriptions

Create transcription

$ openai audio:transcriptions create

post /audio/transcriptions

Transcribes audio into the input language.

Returns a transcription object in json, diarized_json, or verbose_json format, or a stream of transcript events.

Parameters

  • --file: string

    The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

  • --model: string or AudioModel

    ID of the model to use. The options are gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, whisper-1 (which is powered by our open source Whisper V2 model), and gpt-4o-transcribe-diarize.

  • --chunking-strategy: optional "auto" or object { type, prefix_padding_ms, silence_duration_ms, threshold }

    Controls how the audio is cut into chunks. When set to "auto", the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. server_vad object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using gpt-4o-transcribe-diarize for inputs longer than 30 seconds.

  • --include: optional array of TranscriptionInclude

    Additional information to include in the transcription response. logprobs will return the log probabilities of the tokens in the response to understand the model's confidence in the transcription. logprobs only works with response_format set to json and only with the models gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-transcribe-2025-12-15. This field is not supported when using gpt-4o-transcribe-diarize.

  • --known-speaker-name: optional array of string

    Optional list of speaker names that correspond to the audio samples provided in known_speaker_references[]. Each entry should be a short identifier (for example customer or agent). Up to 4 speakers are supported.

  • --known-speaker-reference: optional array of string

    Optional list of audio samples (as data URLs) that contain known speaker references matching known_speaker_names[]. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by file.

  • --language: optional string

    The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.

  • --prompt: optional string

    An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. This field is not supported when using gpt-4o-transcribe-diarize.

  • --response-format: optional "json" or "text" or "srt" or 3 more

    The format of the output, in one of these options: json, text, srt, verbose_json, vtt, or diarized_json. For gpt-4o-transcribe and gpt-4o-mini-transcribe, the only supported format is json. For gpt-4o-transcribe-diarize, the supported formats are json, text, and diarized_json, with diarized_json required to receive speaker annotations.

  • --temperature: optional number

    The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

  • --timestamp-granularity: optional array of "word" or "segment"

    The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. Either or both of these options are supported: word, or segment. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency. This option is not available for gpt-4o-transcribe-diarize.

Returns

  • AudioTranscriptionNewResponse: Transcription or TranscriptionDiarized or TranscriptionVerbose

    Represents a transcription response returned by model, based on the provided input.

    • transcription: object { text, logprobs, usage }

      Represents a transcription response returned by model, based on the provided input.

      • text: string

        The transcribed text.

      • logprobs: optional array of object { token, bytes, logprob }

        The log probabilities of the tokens in the transcription. Only returned with the models gpt-4o-transcribe and gpt-4o-mini-transcribe if logprobs is added to the include array.

        • token: optional string

          The token in the transcription.

        • bytes: optional array of number

          The bytes of the token.

        • logprob: optional number

          The log probability of the token.

      • usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }

        Token usage statistics for the request.

        • tokens: object { input_tokens, output_tokens, total_tokens, 2 more }

          Usage statistics for models billed by token usage.

          • input_tokens: number

            Number of input tokens billed for this request.

          • output_tokens: number

            Number of output tokens generated.

          • total_tokens: number

            Total number of tokens used (input + output).

          • type: "tokens"

            The type of the usage object. Always tokens for this variant.

          • input_token_details: optional object { audio_tokens, text_tokens }

            Details about the input tokens billed for this request.

            • audio_tokens: optional number

              Number of audio tokens billed for this request.

            • text_tokens: optional number

              Number of text tokens billed for this request.

        • duration: object { seconds, type }

          Usage statistics for models billed by audio input duration.

          • seconds: number

            Duration of the input audio in seconds.

          • type: "duration"

            The type of the usage object. Always duration for this variant.

    • transcription_diarized: object { duration, segments, task, 2 more }

      Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

      • duration: number

        Duration of the input audio in seconds.

      • segments: array of TranscriptionDiarizedSegment

        Segments of the transcript annotated with timestamps and speaker labels.

        • id: string

          Unique identifier for the segment.

        • end: number

          End timestamp of the segment in seconds.

        • speaker: string

          Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).

        • start: number

          Start timestamp of the segment in seconds.

        • text: string

          Transcript text for this segment.

        • type: "transcript.text.segment"

          The type of the segment. Always transcript.text.segment.

      • task: "transcribe"

        The type of task that was run. Always transcribe.

      • text: string

        The concatenated transcript text for the entire audio input.

      • usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }

        Token or duration usage statistics for the request.

        • tokens: object { input_tokens, output_tokens, total_tokens, 2 more }

          Usage statistics for models billed by token usage.

          • input_tokens: number

            Number of input tokens billed for this request.

          • output_tokens: number

            Number of output tokens generated.

          • total_tokens: number

            Total number of tokens used (input + output).

          • type: "tokens"

            The type of the usage object. Always tokens for this variant.

          • input_token_details: optional object { audio_tokens, text_tokens }

            Details about the input tokens billed for this request.

            • audio_tokens: optional number

              Number of audio tokens billed for this request.

            • text_tokens: optional number

              Number of text tokens billed for this request.

        • duration: object { seconds, type }

          Usage statistics for models billed by audio input duration.

          • seconds: number

            Duration of the input audio in seconds.

          • type: "duration"

            The type of the usage object. Always duration for this variant.

    • transcription_verbose: object { duration, language, text, 3 more }

      Represents a verbose json transcription response returned by model, based on the provided input.

      • duration: number

        The duration of the input audio.

      • language: string

        The language of the input audio.

      • text: string

        The transcribed text.

      • segments: optional array of TranscriptionSegment

        Segments of the transcribed text and their corresponding details.

        • id: number

          Unique identifier of the segment.

        • avg_logprob: number

          Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

        • compression_ratio: number

          Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

        • end: number

          End time of the segment in seconds.

        • no_speech_prob: number

          Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.

        • seek: number

          Seek offset of the segment.

        • start: number

          Start time of the segment in seconds.

        • temperature: number

          Temperature parameter used for generating the segment.

        • text: string

          Text content of the segment.

        • tokens: array of number

          Array of token IDs for the text content.

      • usage: optional object { seconds, type }

        Usage statistics for models billed by audio input duration.

        • seconds: number

          Duration of the input audio in seconds.

        • type: "duration"

          The type of the usage object. Always duration for this variant.

      • words: optional array of TranscriptionWord

        Extracted words and their corresponding timestamps.

        • end: number

          End time of the word in seconds.

        • start: number

          Start time of the word in seconds.

        • word: string

          The text content of the word.

Example

openai audio:transcriptions create \
  --api-key 'My API Key' \
  --file 'Example data' \
  --model gpt-4o-transcribe

Response

{
  "text": "text",
  "logprobs": [
    {
      "token": "token",
      "bytes": [
        0
      ],
      "logprob": 0
    }
  ],
  "usage": {
    "input_tokens": 0,
    "output_tokens": 0,
    "total_tokens": 0,
    "type": "tokens",
    "input_token_details": {
      "audio_tokens": 0,
      "text_tokens": 0
    }
  }
}

Domain Types

Transcription

  • transcription: object { text, logprobs, usage }

    Represents a transcription response returned by model, based on the provided input.

    • text: string

      The transcribed text.

    • logprobs: optional array of object { token, bytes, logprob }

      The log probabilities of the tokens in the transcription. Only returned with the models gpt-4o-transcribe and gpt-4o-mini-transcribe if logprobs is added to the include array.

      • token: optional string

        The token in the transcription.

      • bytes: optional array of number

        The bytes of the token.

      • logprob: optional number

        The log probability of the token.

    • usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }

      Token usage statistics for the request.

      • tokens: object { input_tokens, output_tokens, total_tokens, 2 more }

        Usage statistics for models billed by token usage.

        • input_tokens: number

          Number of input tokens billed for this request.

        • output_tokens: number

          Number of output tokens generated.

        • total_tokens: number

          Total number of tokens used (input + output).

        • type: "tokens"

          The type of the usage object. Always tokens for this variant.

        • input_token_details: optional object { audio_tokens, text_tokens }

          Details about the input tokens billed for this request.

          • audio_tokens: optional number

            Number of audio tokens billed for this request.

          • text_tokens: optional number

            Number of text tokens billed for this request.

      • duration: object { seconds, type }

        Usage statistics for models billed by audio input duration.

        • seconds: number

          Duration of the input audio in seconds.

        • type: "duration"

          The type of the usage object. Always duration for this variant.

Transcription Diarized

  • transcription_diarized: object { duration, segments, task, 2 more }

    Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

    • duration: number

      Duration of the input audio in seconds.

    • segments: array of TranscriptionDiarizedSegment

      Segments of the transcript annotated with timestamps and speaker labels.

      • id: string

        Unique identifier for the segment.

      • end: number

        End timestamp of the segment in seconds.

      • speaker: string

        Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).

      • start: number

        Start timestamp of the segment in seconds.

      • text: string

        Transcript text for this segment.

      • type: "transcript.text.segment"

        The type of the segment. Always transcript.text.segment.

    • task: "transcribe"

      The type of task that was run. Always transcribe.

    • text: string

      The concatenated transcript text for the entire audio input.

    • usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }

      Token or duration usage statistics for the request.

      • tokens: object { input_tokens, output_tokens, total_tokens, 2 more }

        Usage statistics for models billed by token usage.

        • input_tokens: number

          Number of input tokens billed for this request.

        • output_tokens: number

          Number of output tokens generated.

        • total_tokens: number

          Total number of tokens used (input + output).

        • type: "tokens"

          The type of the usage object. Always tokens for this variant.

        • input_token_details: optional object { audio_tokens, text_tokens }

          Details about the input tokens billed for this request.

          • audio_tokens: optional number

            Number of audio tokens billed for this request.

          • text_tokens: optional number

            Number of text tokens billed for this request.

      • duration: object { seconds, type }

        Usage statistics for models billed by audio input duration.

        • seconds: number

          Duration of the input audio in seconds.

        • type: "duration"

          The type of the usage object. Always duration for this variant.

Transcription Diarized Segment

  • transcription_diarized_segment: object { id, end, speaker, 3 more }

    A segment of diarized transcript text with speaker metadata.

    • id: string

      Unique identifier for the segment.

    • end: number

      End timestamp of the segment in seconds.

    • speaker: string

      Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).

    • start: number

      Start timestamp of the segment in seconds.

    • text: string

      Transcript text for this segment.

    • type: "transcript.text.segment"

      The type of the segment. Always transcript.text.segment.

Transcription Include

  • transcription_include: "logprobs"

    • "logprobs"

Transcription Segment

  • transcription_segment: object { id, avg_logprob, compression_ratio, 7 more }

    • id: number

      Unique identifier of the segment.

    • avg_logprob: number

      Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

    • compression_ratio: number

      Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

    • end: number

      End time of the segment in seconds.

    • no_speech_prob: number

      Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.

    • seek: number

      Seek offset of the segment.

    • start: number

      Start time of the segment in seconds.

    • temperature: number

      Temperature parameter used for generating the segment.

    • text: string

      Text content of the segment.

    • tokens: array of number

      Array of token IDs for the text content.

Transcription Stream Event

  • transcription_stream_event: TranscriptionTextSegmentEvent or TranscriptionTextDeltaEvent or TranscriptionTextDoneEvent

    Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you create a transcription with stream set to true and response_format set to diarized_json.

    • transcription_text_segment_event: object { id, end, speaker, 3 more }

      Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you create a transcription with stream set to true and response_format set to diarized_json.

      • id: string

        Unique identifier for the segment.

      • end: number

        End timestamp of the segment in seconds.

      • speaker: string

        Speaker label for this segment.

      • start: number

        Start timestamp of the segment in seconds.

      • text: string

        Transcript text for this segment.

      • type: "transcript.text.segment"

        The type of the event. Always transcript.text.segment.

    • transcription_text_delta_event: object { delta, type, logprobs, segment_id }

      Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you create a transcription with the Stream parameter set to true.

      • delta: string

        The text delta that was additionally transcribed.

      • type: "transcript.text.delta"

        The type of the event. Always transcript.text.delta.

      • logprobs: optional array of object { token, bytes, logprob }

        The log probabilities of the delta. Only included if you create a transcription with the include[] parameter set to logprobs.

        • token: optional string

          The token that was used to generate the log probability.

        • bytes: optional array of number

          The bytes that were used to generate the log probability.

        • logprob: optional number

          The log probability of the token.

      • segment_id: optional string

        Identifier of the diarized segment that this delta belongs to. Only present when using gpt-4o-transcribe-diarize.

    • transcription_text_done_event: object { text, type, logprobs, usage }

      Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you create a transcription with the Stream parameter set to true.

      • text: string

        The text that was transcribed.

      • type: "transcript.text.done"

        The type of the event. Always transcript.text.done.

      • logprobs: optional array of object { token, bytes, logprob }

        The log probabilities of the individual tokens in the transcription. Only included if you create a transcription with the include[] parameter set to logprobs.

        • token: optional string

          The token that was used to generate the log probability.

        • bytes: optional array of number

          The bytes that were used to generate the log probability.

        • logprob: optional number

          The log probability of the token.

      • usage: optional object { input_tokens, output_tokens, total_tokens, 2 more }

        Usage statistics for models billed by token usage.

        • input_tokens: number

          Number of input tokens billed for this request.

        • output_tokens: number

          Number of output tokens generated.

        • total_tokens: number

          Total number of tokens used (input + output).

        • type: "tokens"

          The type of the usage object. Always tokens for this variant.

        • input_token_details: optional object { audio_tokens, text_tokens }

          Details about the input tokens billed for this request.

          • audio_tokens: optional number

            Number of audio tokens billed for this request.

          • text_tokens: optional number

            Number of text tokens billed for this request.

Transcription Text Delta Event

  • transcription_text_delta_event: object { delta, type, logprobs, segment_id }

    Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you create a transcription with the Stream parameter set to true.

    • delta: string

      The text delta that was additionally transcribed.

    • type: "transcript.text.delta"

      The type of the event. Always transcript.text.delta.

    • logprobs: optional array of object { token, bytes, logprob }

      The log probabilities of the delta. Only included if you create a transcription with the include[] parameter set to logprobs.

      • token: optional string

        The token that was used to generate the log probability.

      • bytes: optional array of number

        The bytes that were used to generate the log probability.

      • logprob: optional number

        The log probability of the token.

    • segment_id: optional string

      Identifier of the diarized segment that this delta belongs to. Only present when using gpt-4o-transcribe-diarize.

Transcription Text Done Event

  • transcription_text_done_event: object { text, type, logprobs, usage }

    Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you create a transcription with the Stream parameter set to true.

    • text: string

      The text that was transcribed.

    • type: "transcript.text.done"

      The type of the event. Always transcript.text.done.

    • logprobs: optional array of object { token, bytes, logprob }

      The log probabilities of the individual tokens in the transcription. Only included if you create a transcription with the include[] parameter set to logprobs.

      • token: optional string

        The token that was used to generate the log probability.

      • bytes: optional array of number

        The bytes that were used to generate the log probability.

      • logprob: optional number

        The log probability of the token.

    • usage: optional object { input_tokens, output_tokens, total_tokens, 2 more }

      Usage statistics for models billed by token usage.

      • input_tokens: number

        Number of input tokens billed for this request.

      • output_tokens: number

        Number of output tokens generated.

      • total_tokens: number

        Total number of tokens used (input + output).

      • type: "tokens"

        The type of the usage object. Always tokens for this variant.

      • input_token_details: optional object { audio_tokens, text_tokens }

        Details about the input tokens billed for this request.

        • audio_tokens: optional number

          Number of audio tokens billed for this request.

        • text_tokens: optional number

          Number of text tokens billed for this request.

Transcription Text Segment Event

  • transcription_text_segment_event: object { id, end, speaker, 3 more }

    Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you create a transcription with stream set to true and response_format set to diarized_json.

    • id: string

      Unique identifier for the segment.

    • end: number

      End timestamp of the segment in seconds.

    • speaker: string

      Speaker label for this segment.

    • start: number

      Start timestamp of the segment in seconds.

    • text: string

      Transcript text for this segment.

    • type: "transcript.text.segment"

      The type of the event. Always transcript.text.segment.

Transcription Verbose

  • transcription_verbose: object { duration, language, text, 3 more }

    Represents a verbose json transcription response returned by model, based on the provided input.

    • duration: number

      The duration of the input audio.

    • language: string

      The language of the input audio.

    • text: string

      The transcribed text.

    • segments: optional array of TranscriptionSegment

      Segments of the transcribed text and their corresponding details.

      • id: number

        Unique identifier of the segment.

      • avg_logprob: number

        Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

      • compression_ratio: number

        Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

      • end: number

        End time of the segment in seconds.

      • no_speech_prob: number

        Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.

      • seek: number

        Seek offset of the segment.

      • start: number

        Start time of the segment in seconds.

      • temperature: number

        Temperature parameter used for generating the segment.

      • text: string

        Text content of the segment.

      • tokens: array of number

        Array of token IDs for the text content.

    • usage: optional object { seconds, type }

      Usage statistics for models billed by audio input duration.

      • seconds: number

        Duration of the input audio in seconds.

      • type: "duration"

        The type of the usage object. Always duration for this variant.

    • words: optional array of TranscriptionWord

      Extracted words and their corresponding timestamps.

      • end: number

        End time of the word in seconds.

      • start: number

        Start time of the word in seconds.

      • word: string

        The text content of the word.

Transcription Word

  • transcription_word: object { end, start, word }

    • end: number

      End time of the word in seconds.

    • start: number

      Start time of the word in seconds.

    • word: string

      The text content of the word.

Translations

Create translation

$ openai audio:translations create

post /audio/translations

Translates audio into English.

Parameters

  • --file: string

    The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

  • --model: string or AudioModel

    ID of the model to use. Only whisper-1 (which is powered by our open source Whisper V2 model) is currently available.

  • --prompt: optional string

    An optional text to guide the model's style or continue a previous audio segment. The prompt should be in English.

  • --response-format: optional "json" or "text" or "srt" or 2 more

    The format of the output, in one of these options: json, text, srt, verbose_json, or vtt.

  • --temperature: optional number

    The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

Returns

  • unnamed_schema_1: Translation or TranslationVerbose

    • translation: object { text }

      • text: string
    • translation_verbose: object { duration, language, text, segments }

      • duration: number

        The duration of the input audio.

      • language: string

        The language of the output translation (always english).

      • text: string

        The translated text.

      • segments: optional array of TranscriptionSegment

        Segments of the translated text and their corresponding details.

        • id: number

          Unique identifier of the segment.

        • avg_logprob: number

          Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

        • compression_ratio: number

          Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

        • end: number

          End time of the segment in seconds.

        • no_speech_prob: number

          Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.

        • seek: number

          Seek offset of the segment.

        • start: number

          Start time of the segment in seconds.

        • temperature: number

          Temperature parameter used for generating the segment.

        • text: string

          Text content of the segment.

        • tokens: array of number

          Array of token IDs for the text content.

Example

openai audio:translations create \
  --api-key 'My API Key' \
  --file 'Example data' \
  --model whisper-1

Response

{
  "text": "text"
}

Domain Types

Translation

  • translation: object { text }

    • text: string

Translation Verbose

  • translation_verbose: object { duration, language, text, segments }

    • duration: number

      The duration of the input audio.

    • language: string

      The language of the output translation (always english).

    • text: string

      The translated text.

    • segments: optional array of TranscriptionSegment

      Segments of the translated text and their corresponding details.

      • id: number

        Unique identifier of the segment.

      • avg_logprob: number

        Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

      • compression_ratio: number

        Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

      • end: number

        End time of the segment in seconds.

      • no_speech_prob: number

        Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.

      • seek: number

        Seek offset of the segment.

      • start: number

        Start time of the segment in seconds.

      • temperature: number

        Temperature parameter used for generating the segment.

      • text: string

        Text content of the segment.

      • tokens: array of number

        Array of token IDs for the text content.

Speech

Create speech

$ openai audio:speech create

post /audio/speech

Generates audio from the input text.

Returns the audio file content, or a stream of audio events.

Parameters

  • --input: string

    The text to generate audio for. The maximum length is 4096 characters.

  • --model: string or SpeechModel

    One of the available TTS models: tts-1, tts-1-hd, gpt-4o-mini-tts, or gpt-4o-mini-tts-2025-12-15.

  • --voice: string or "alloy" or "ash" or "ballad" or 7 more or object { id }

    The voice to use when generating the audio. Supported built-in voices are alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Previews of the voices are available in the Text to speech guide.

  • --instructions: optional string

    Control the voice of your generated audio with additional instructions. Does not work with tts-1 or tts-1-hd.

  • --response-format: optional "mp3" or "opus" or "aac" or 3 more

    The format to audio in. Supported formats are mp3, opus, aac, flac, wav, and pcm.

  • --speed: optional number

    The speed of the generated audio. Select a value from 0.25 to 4.0. 1.0 is the default.

  • --stream-format: optional "sse" or "audio"

    The format to stream the audio in. Supported formats are sse and audio. sse is not supported for tts-1 or tts-1-hd.

Returns

  • unnamed_schema_2: file path

Example

openai audio:speech create \
  --api-key 'My API Key' \
  --input input \
  --model tts-1 \
  --voice string

Domain Types

Speech Model

  • speech_model: "tts-1" or "tts-1-hd" or "gpt-4o-mini-tts" or "gpt-4o-mini-tts-2025-12-15"

    • "tts-1"

    • "tts-1-hd"

    • "gpt-4o-mini-tts"

    • "gpt-4o-mini-tts-2025-12-15"

Voices

Voice Consents