Go Premium Account

cli/resources/audio/index.md 2026-05-18 22:01 UTC to 2026-05-19 06:34 UTC

0 added, 1258 removed.

2026

Wed 27 06:42 Fri 22 06:33 Wed 20 06:35 Tue 19 06:34 Mon 18 22:01 Mon 11 18:00 Thu 7 21:57 Tue 5 23:00 Sat 2 05:57

This document has no rendered page for this history range.

cli/resources/audio/index.md +0 −1258 deleted

File Deleted View Diff

~~1# Audio~~

~~3## Domain Types~~

~~5### Audio Model~~

~~7- `audio_model: "whisper-1" or "gpt-4o-transcribe" or "gpt-4o-mini-transcribe" or 2 more`~~

~~9 - `"whisper-1"`~~

~~11 - `"gpt-4o-transcribe"`~~

~~13 - `"gpt-4o-mini-transcribe"`~~

~~15 - `"gpt-4o-mini-transcribe-2025-12-15"`~~

~~17 - `"gpt-4o-transcribe-diarize"`~~

~~19### Audio Response Format~~

~~21- `audio_response_format: "json" or "text" or "srt" or 3 more`~~

23 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.

~~25 - `"json"`~~

~~27 - `"text"`~~

~~29 - `"srt"`~~

~~31 - `"verbose_json"`~~

~~33 - `"vtt"`~~

~~35 - `"diarized_json"`~~

~~37# Transcriptions~~

~~39## Create transcription~~

~~41`$ openai audio:transcriptions create`~~

~~43**post** `/audio/transcriptions`~~

~~45Transcribes audio into the input language.~~

~~47Returns a transcription object in `json`, `diarized_json`, or `verbose_json`~~

~~48format, or a stream of transcript events.~~

~~50### Parameters~~

~~52- `--file: string`~~

~~54 The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.~~

~~56- `--model: string or AudioModel`~~

58 ID of the model to use. The options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `whisper-1` (which is powered by our open source Whisper V2 model), and `gpt-4o-transcribe-diarize`.

~~60- `--chunking-strategy: optional "auto" or object { type, prefix_padding_ms, silence_duration_ms, threshold }`~~

62 Controls how the audio is cut into chunks. When set to `"auto"`, the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. `server_vad` object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using `gpt-4o-transcribe-diarize` for inputs longer than 30 seconds.

~~64- `--include: optional array of TranscriptionInclude`~~

~~66 Additional information to include in the transcription response.~~

~~67 `logprobs` will return the log probabilities of the tokens in the~~

~~68 response to understand the model's confidence in the transcription.~~

~~69 `logprobs` only works with response_format set to `json` and only with~~

~~70 the models `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `gpt-4o-mini-transcribe-2025-12-15`. This field is not supported when using `gpt-4o-transcribe-diarize`.~~

~~72- `--known-speaker-name: optional array of string`~~

74 Optional list of speaker names that correspond to the audio samples provided in `known_speaker_references[]`. Each entry should be a short identifier (for example `customer` or `agent`). Up to 4 speakers are supported.

~~76- `--known-speaker-reference: optional array of string`~~

78 Optional list of audio samples (as [data URLs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs)) that contain known speaker references matching `known_speaker_names[]`. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by `file`.

~~80- `--language: optional string`~~

~~82 The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency.~~

~~84- `--prompt: optional string`~~

86 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should match the audio language. This field is not supported when using `gpt-4o-transcribe-diarize`.

~~88- `--response-format: optional "json" or "text" or "srt" or 3 more`~~

90 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.

~~92- `--temperature: optional number`~~

94 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.

~~96- `--timestamp-granularity: optional array of "word" or "segment"`~~

98 The timestamp granularities to populate for this transcription. `response_format` must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported: `word`, or `segment`. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

~~99 This option is not available for `gpt-4o-transcribe-diarize`.~~

~~100~~

101### Returns

~~102~~

103- `AudioTranscriptionNewResponse: Transcription or TranscriptionDiarized or TranscriptionVerbose`

~~104~~

105 Represents a transcription response returned by model, based on the provided input.

~~106~~

107 - `transcription: object { text, logprobs, usage }`

~~108~~

109 Represents a transcription response returned by model, based on the provided input.

~~110~~

111 - `text: string`

~~112~~

113 The transcribed text.

~~114~~

115 - `logprobs: optional array of object { token, bytes, logprob }`

~~116~~

117 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

~~118~~

119 - `token: optional string`

~~120~~

121 The token in the transcription.

~~122~~

123 - `bytes: optional array of number`

~~124~~

125 The bytes of the token.

~~126~~

127 - `logprob: optional number`

~~128~~

129 The log probability of the token.

~~130~~

131 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }`

~~132~~

133 Token usage statistics for the request.

~~134~~

135 - `tokens: object { input_tokens, output_tokens, total_tokens, 2 more }`

~~136~~

137 Usage statistics for models billed by token usage.

~~138~~

139 - `input_tokens: number`

~~140~~

141 Number of input tokens billed for this request.

~~142~~

143 - `output_tokens: number`

~~144~~

145 Number of output tokens generated.

~~146~~

147 - `total_tokens: number`

~~148~~

149 Total number of tokens used (input + output).

~~150~~

151 - `type: "tokens"`

~~152~~

153 The type of the usage object. Always `tokens` for this variant.

~~154~~

155 - `input_token_details: optional object { audio_tokens, text_tokens }`

~~156~~

157 Details about the input tokens billed for this request.

~~158~~

159 - `audio_tokens: optional number`

~~160~~

161 Number of audio tokens billed for this request.

~~162~~

163 - `text_tokens: optional number`

~~164~~

165 Number of text tokens billed for this request.

~~166~~

167 - `duration: object { seconds, type }`

~~168~~

169 Usage statistics for models billed by audio input duration.

~~170~~

171 - `seconds: number`

~~172~~

173 Duration of the input audio in seconds.

~~174~~

175 - `type: "duration"`

~~176~~

177 The type of the usage object. Always `duration` for this variant.

~~178~~

179 - `transcription_diarized: object { duration, segments, task, 2 more }`

~~180~~

181 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

~~182~~

183 - `duration: number`

~~184~~

185 Duration of the input audio in seconds.

~~186~~

187 - `segments: array of TranscriptionDiarizedSegment`

~~188~~

189 Segments of the transcript annotated with timestamps and speaker labels.

~~190~~

191 - `id: string`

~~192~~

193 Unique identifier for the segment.

~~194~~

195 - `end: number`

~~196~~

197 End timestamp of the segment in seconds.

~~198~~

199 - `speaker: string`

~~200~~

201 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

~~202~~

203 - `start: number`

~~204~~

205 Start timestamp of the segment in seconds.

~~206~~

207 - `text: string`

~~208~~

209 Transcript text for this segment.

~~210~~

211 - `type: "transcript.text.segment"`

~~212~~

213 The type of the segment. Always `transcript.text.segment`.

~~214~~

215 - `task: "transcribe"`

~~216~~

217 The type of task that was run. Always `transcribe`.

~~218~~

219 - `text: string`

~~220~~

221 The concatenated transcript text for the entire audio input.

~~222~~

223 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }`

~~224~~

225 Token or duration usage statistics for the request.

~~226~~

227 - `tokens: object { input_tokens, output_tokens, total_tokens, 2 more }`

~~228~~

229 Usage statistics for models billed by token usage.

~~230~~

231 - `input_tokens: number`

~~232~~

233 Number of input tokens billed for this request.

~~234~~

235 - `output_tokens: number`

~~236~~

237 Number of output tokens generated.

~~238~~

239 - `total_tokens: number`

~~240~~

241 Total number of tokens used (input + output).

~~242~~

243 - `type: "tokens"`

~~244~~

245 The type of the usage object. Always `tokens` for this variant.

~~246~~

247 - `input_token_details: optional object { audio_tokens, text_tokens }`

~~248~~

249 Details about the input tokens billed for this request.

~~250~~

251 - `audio_tokens: optional number`

~~252~~

253 Number of audio tokens billed for this request.

~~254~~

255 - `text_tokens: optional number`

~~256~~

257 Number of text tokens billed for this request.

~~258~~

259 - `duration: object { seconds, type }`

~~260~~

261 Usage statistics for models billed by audio input duration.

~~262~~

263 - `seconds: number`

~~264~~

265 Duration of the input audio in seconds.

~~266~~

267 - `type: "duration"`

~~268~~

269 The type of the usage object. Always `duration` for this variant.

~~270~~

271 - `transcription_verbose: object { duration, language, text, 3 more }`

~~272~~

273 Represents a verbose json transcription response returned by model, based on the provided input.

~~274~~

275 - `duration: number`

~~276~~

277 The duration of the input audio.

~~278~~

279 - `language: string`

~~280~~

281 The language of the input audio.

~~282~~

283 - `text: string`

~~284~~

285 The transcribed text.

~~286~~

287 - `segments: optional array of TranscriptionSegment`

~~288~~

289 Segments of the transcribed text and their corresponding details.

~~290~~

291 - `id: number`

~~292~~

293 Unique identifier of the segment.

~~294~~

295 - `avg_logprob: number`

~~296~~

297 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~298~~

299 - `compression_ratio: number`

~~300~~

301 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~302~~

303 - `end: number`

~~304~~

305 End time of the segment in seconds.

~~306~~

307 - `no_speech_prob: number`

~~308~~

309 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~310~~

311 - `seek: number`

~~312~~

313 Seek offset of the segment.

~~314~~

315 - `start: number`

~~316~~

317 Start time of the segment in seconds.

~~318~~

319 - `temperature: number`

~~320~~

321 Temperature parameter used for generating the segment.

~~322~~

323 - `text: string`

~~324~~

325 Text content of the segment.

~~326~~

327 - `tokens: array of number`

~~328~~

329 Array of token IDs for the text content.

~~330~~

331 - `usage: optional object { seconds, type }`

~~332~~

333 Usage statistics for models billed by audio input duration.

~~334~~

335 - `seconds: number`

~~336~~

337 Duration of the input audio in seconds.

~~338~~

339 - `type: "duration"`

~~340~~

341 The type of the usage object. Always `duration` for this variant.

~~342~~

343 - `words: optional array of TranscriptionWord`

~~344~~

345 Extracted words and their corresponding timestamps.

~~346~~

347 - `end: number`

~~348~~

349 End time of the word in seconds.

~~350~~

351 - `start: number`

~~352~~

353 Start time of the word in seconds.

~~354~~

355 - `word: string`

~~356~~

357 The text content of the word.

~~358~~

359### Example

~~360~~

361```cli

362openai audio:transcriptions create \

363 --api-key 'My API Key' \

364 --file 'Example data' \

365 --model gpt-4o-transcribe

366```

~~367~~

368#### Response

~~369~~

370```json

371{

372 "text": "text",

373 "logprobs": [

374 {

375 "token": "token",

376 "bytes": [

377 0

378 ],

379 "logprob": 0

380 }

381 ],

382 "usage": {

383 "input_tokens": 0,

384 "output_tokens": 0,

385 "total_tokens": 0,

386 "type": "tokens",

387 "input_token_details": {

388 "audio_tokens": 0,

389 "text_tokens": 0

390 }

391 }

392}

393```

~~394~~

395## Domain Types

~~396~~

397### Transcription

~~398~~

399- `transcription: object { text, logprobs, usage }`

~~400~~

401 Represents a transcription response returned by model, based on the provided input.

~~402~~

403 - `text: string`

~~404~~

405 The transcribed text.

~~406~~

407 - `logprobs: optional array of object { token, bytes, logprob }`

~~408~~

409 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

~~410~~

411 - `token: optional string`

~~412~~

413 The token in the transcription.

~~414~~

415 - `bytes: optional array of number`

~~416~~

417 The bytes of the token.

~~418~~

419 - `logprob: optional number`

~~420~~

421 The log probability of the token.

~~422~~

423 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }`

~~424~~

425 Token usage statistics for the request.

~~426~~

427 - `tokens: object { input_tokens, output_tokens, total_tokens, 2 more }`

~~428~~

429 Usage statistics for models billed by token usage.

~~430~~

431 - `input_tokens: number`

~~432~~

433 Number of input tokens billed for this request.

~~434~~

435 - `output_tokens: number`

~~436~~

437 Number of output tokens generated.

~~438~~

439 - `total_tokens: number`

~~440~~

441 Total number of tokens used (input + output).

~~442~~

443 - `type: "tokens"`

~~444~~

445 The type of the usage object. Always `tokens` for this variant.

~~446~~

447 - `input_token_details: optional object { audio_tokens, text_tokens }`

~~448~~

449 Details about the input tokens billed for this request.

~~450~~

451 - `audio_tokens: optional number`

~~452~~

453 Number of audio tokens billed for this request.

~~454~~

455 - `text_tokens: optional number`

~~456~~

457 Number of text tokens billed for this request.

~~458~~

459 - `duration: object { seconds, type }`

~~460~~

461 Usage statistics for models billed by audio input duration.

~~462~~

463 - `seconds: number`

~~464~~

465 Duration of the input audio in seconds.

~~466~~

467 - `type: "duration"`

~~468~~

469 The type of the usage object. Always `duration` for this variant.

~~470~~

471### Transcription Diarized

~~472~~

473- `transcription_diarized: object { duration, segments, task, 2 more }`

~~474~~

475 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

~~476~~

477 - `duration: number`

~~478~~

479 Duration of the input audio in seconds.

~~480~~

481 - `segments: array of TranscriptionDiarizedSegment`

~~482~~

483 Segments of the transcript annotated with timestamps and speaker labels.

~~484~~

485 - `id: string`

~~486~~

487 Unique identifier for the segment.

~~488~~

489 - `end: number`

~~490~~

491 End timestamp of the segment in seconds.

~~492~~

493 - `speaker: string`

~~494~~

495 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

~~496~~

497 - `start: number`

~~498~~

499 Start timestamp of the segment in seconds.

~~500~~

501 - `text: string`

~~502~~

503 Transcript text for this segment.

~~504~~

505 - `type: "transcript.text.segment"`

~~506~~

507 The type of the segment. Always `transcript.text.segment`.

~~508~~

509 - `task: "transcribe"`

~~510~~

511 The type of task that was run. Always `transcribe`.

~~512~~

513 - `text: string`

~~514~~

515 The concatenated transcript text for the entire audio input.

~~516~~

517 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }`

~~518~~

519 Token or duration usage statistics for the request.

~~520~~

521 - `tokens: object { input_tokens, output_tokens, total_tokens, 2 more }`

~~522~~

523 Usage statistics for models billed by token usage.

~~524~~

525 - `input_tokens: number`

~~526~~

527 Number of input tokens billed for this request.

~~528~~

529 - `output_tokens: number`

~~530~~

531 Number of output tokens generated.

~~532~~

533 - `total_tokens: number`

~~534~~

535 Total number of tokens used (input + output).

~~536~~

537 - `type: "tokens"`

~~538~~

539 The type of the usage object. Always `tokens` for this variant.

~~540~~

541 - `input_token_details: optional object { audio_tokens, text_tokens }`

~~542~~

543 Details about the input tokens billed for this request.

~~544~~

545 - `audio_tokens: optional number`

~~546~~

547 Number of audio tokens billed for this request.

~~548~~

549 - `text_tokens: optional number`

~~550~~

551 Number of text tokens billed for this request.

~~552~~

553 - `duration: object { seconds, type }`

~~554~~

555 Usage statistics for models billed by audio input duration.

~~556~~

557 - `seconds: number`

~~558~~

559 Duration of the input audio in seconds.

~~560~~

561 - `type: "duration"`

~~562~~

563 The type of the usage object. Always `duration` for this variant.

~~564~~

565### Transcription Diarized Segment

~~566~~

567- `transcription_diarized_segment: object { id, end, speaker, 3 more }`

~~568~~

569 A segment of diarized transcript text with speaker metadata.

~~570~~

571 - `id: string`

~~572~~

573 Unique identifier for the segment.

~~574~~

575 - `end: number`

~~576~~

577 End timestamp of the segment in seconds.

~~578~~

579 - `speaker: string`

~~580~~

581 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

~~582~~

583 - `start: number`

~~584~~

585 Start timestamp of the segment in seconds.

~~586~~

587 - `text: string`

~~588~~

589 Transcript text for this segment.

~~590~~

591 - `type: "transcript.text.segment"`

~~592~~

593 The type of the segment. Always `transcript.text.segment`.

~~594~~

595### Transcription Include

~~596~~

597- `transcription_include: "logprobs"`

~~598~~

599 - `"logprobs"`

~~600~~

601### Transcription Segment

~~602~~

603- `transcription_segment: object { id, avg_logprob, compression_ratio, 7 more }`

~~604~~

605 - `id: number`

~~606~~

607 Unique identifier of the segment.

~~608~~

609 - `avg_logprob: number`

~~610~~

611 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~612~~

613 - `compression_ratio: number`

~~614~~

615 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~616~~

617 - `end: number`

~~618~~

619 End time of the segment in seconds.

~~620~~

621 - `no_speech_prob: number`

~~622~~

623 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~624~~

625 - `seek: number`

~~626~~

627 Seek offset of the segment.

~~628~~

629 - `start: number`

~~630~~

631 Start time of the segment in seconds.

~~632~~

633 - `temperature: number`

~~634~~

635 Temperature parameter used for generating the segment.

~~636~~

637 - `text: string`

~~638~~

639 Text content of the segment.

~~640~~

641 - `tokens: array of number`

~~642~~

643 Array of token IDs for the text content.

~~644~~

645### Transcription Stream Event

~~646~~

647- `transcription_stream_event: TranscriptionTextSegmentEvent or TranscriptionTextDeltaEvent or TranscriptionTextDoneEvent`

~~648~~

649 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

~~650~~

651 - `transcription_text_segment_event: object { id, end, speaker, 3 more }`

~~652~~

653 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

~~654~~

655 - `id: string`

~~656~~

657 Unique identifier for the segment.

~~658~~

659 - `end: number`

~~660~~

661 End timestamp of the segment in seconds.

~~662~~

663 - `speaker: string`

~~664~~

665 Speaker label for this segment.

~~666~~

667 - `start: number`

~~668~~

669 Start timestamp of the segment in seconds.

~~670~~

671 - `text: string`

~~672~~

673 Transcript text for this segment.

~~674~~

675 - `type: "transcript.text.segment"`

~~676~~

677 The type of the event. Always `transcript.text.segment`.

~~678~~

679 - `transcription_text_delta_event: object { delta, type, logprobs, segment_id }`

~~680~~

681 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

~~682~~

683 - `delta: string`

~~684~~

685 The text delta that was additionally transcribed.

~~686~~

687 - `type: "transcript.text.delta"`

~~688~~

689 The type of the event. Always `transcript.text.delta`.

~~690~~

691 - `logprobs: optional array of object { token, bytes, logprob }`

~~692~~

693 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

~~694~~

695 - `token: optional string`

~~696~~

697 The token that was used to generate the log probability.

~~698~~

699 - `bytes: optional array of number`

~~700~~

701 The bytes that were used to generate the log probability.

~~702~~

703 - `logprob: optional number`

~~704~~

705 The log probability of the token.

~~706~~

707 - `segment_id: optional string`

~~708~~

709 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.

~~710~~

711 - `transcription_text_done_event: object { text, type, logprobs, usage }`

~~712~~

713 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

~~714~~

715 - `text: string`

~~716~~

717 The text that was transcribed.

~~718~~

719 - `type: "transcript.text.done"`

~~720~~

721 The type of the event. Always `transcript.text.done`.

~~722~~

723 - `logprobs: optional array of object { token, bytes, logprob }`

~~724~~

725 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

~~726~~

727 - `token: optional string`

~~728~~

729 The token that was used to generate the log probability.

~~730~~

731 - `bytes: optional array of number`

~~732~~

733 The bytes that were used to generate the log probability.

~~734~~

735 - `logprob: optional number`

~~736~~

737 The log probability of the token.

~~738~~

739 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more }`

~~740~~

741 Usage statistics for models billed by token usage.

~~742~~

743 - `input_tokens: number`

~~744~~

745 Number of input tokens billed for this request.

~~746~~

747 - `output_tokens: number`

~~748~~

749 Number of output tokens generated.

~~750~~

751 - `total_tokens: number`

~~752~~

753 Total number of tokens used (input + output).

~~754~~

755 - `type: "tokens"`

~~756~~

757 The type of the usage object. Always `tokens` for this variant.

~~758~~

759 - `input_token_details: optional object { audio_tokens, text_tokens }`

~~760~~

761 Details about the input tokens billed for this request.

~~762~~

763 - `audio_tokens: optional number`

~~764~~

765 Number of audio tokens billed for this request.

~~766~~

767 - `text_tokens: optional number`

~~768~~

769 Number of text tokens billed for this request.

~~770~~

771### Transcription Text Delta Event

~~772~~

773- `transcription_text_delta_event: object { delta, type, logprobs, segment_id }`

~~774~~

775 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

~~776~~

777 - `delta: string`

~~778~~

779 The text delta that was additionally transcribed.

~~780~~

781 - `type: "transcript.text.delta"`

~~782~~

783 The type of the event. Always `transcript.text.delta`.

~~784~~

785 - `logprobs: optional array of object { token, bytes, logprob }`

~~786~~

787 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

~~788~~

789 - `token: optional string`

~~790~~

791 The token that was used to generate the log probability.

~~792~~

793 - `bytes: optional array of number`

~~794~~

795 The bytes that were used to generate the log probability.

~~796~~

797 - `logprob: optional number`

~~798~~

799 The log probability of the token.

~~800~~

801 - `segment_id: optional string`

~~802~~

803 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.

~~804~~

805### Transcription Text Done Event

~~806~~

807- `transcription_text_done_event: object { text, type, logprobs, usage }`

~~808~~

809 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

~~810~~

811 - `text: string`

~~812~~

813 The text that was transcribed.

~~814~~

815 - `type: "transcript.text.done"`

~~816~~

817 The type of the event. Always `transcript.text.done`.

~~818~~

819 - `logprobs: optional array of object { token, bytes, logprob }`

~~820~~

821 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

~~822~~

823 - `token: optional string`

~~824~~

825 The token that was used to generate the log probability.

~~826~~

827 - `bytes: optional array of number`

~~828~~

829 The bytes that were used to generate the log probability.

~~830~~

831 - `logprob: optional number`

~~832~~

833 The log probability of the token.

~~834~~

835 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more }`

~~836~~

837 Usage statistics for models billed by token usage.

~~838~~

839 - `input_tokens: number`

~~840~~

841 Number of input tokens billed for this request.

~~842~~

843 - `output_tokens: number`

~~844~~

845 Number of output tokens generated.

~~846~~

847 - `total_tokens: number`

~~848~~

849 Total number of tokens used (input + output).

~~850~~

851 - `type: "tokens"`

~~852~~

853 The type of the usage object. Always `tokens` for this variant.

~~854~~

855 - `input_token_details: optional object { audio_tokens, text_tokens }`

~~856~~

857 Details about the input tokens billed for this request.

~~858~~

859 - `audio_tokens: optional number`

~~860~~

861 Number of audio tokens billed for this request.

~~862~~

863 - `text_tokens: optional number`

~~864~~

865 Number of text tokens billed for this request.

~~866~~

867### Transcription Text Segment Event

~~868~~

869- `transcription_text_segment_event: object { id, end, speaker, 3 more }`

~~870~~

871 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

~~872~~

873 - `id: string`

~~874~~

875 Unique identifier for the segment.

~~876~~

877 - `end: number`

~~878~~

879 End timestamp of the segment in seconds.

~~880~~

881 - `speaker: string`

~~882~~

883 Speaker label for this segment.

~~884~~

885 - `start: number`

~~886~~

887 Start timestamp of the segment in seconds.

~~888~~

889 - `text: string`

~~890~~

891 Transcript text for this segment.

~~892~~

893 - `type: "transcript.text.segment"`

~~894~~

895 The type of the event. Always `transcript.text.segment`.

~~896~~

897### Transcription Verbose

~~898~~

899- `transcription_verbose: object { duration, language, text, 3 more }`

~~900~~

901 Represents a verbose json transcription response returned by model, based on the provided input.

~~902~~

903 - `duration: number`

~~904~~

905 The duration of the input audio.

~~906~~

907 - `language: string`

~~908~~

909 The language of the input audio.

~~910~~

911 - `text: string`

~~912~~

913 The transcribed text.

~~914~~

915 - `segments: optional array of TranscriptionSegment`

~~916~~

917 Segments of the transcribed text and their corresponding details.

~~918~~

919 - `id: number`

~~920~~

921 Unique identifier of the segment.

~~922~~

923 - `avg_logprob: number`

~~924~~

925 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~926~~

927 - `compression_ratio: number`

~~928~~

929 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~930~~

931 - `end: number`

~~932~~

933 End time of the segment in seconds.

~~934~~

935 - `no_speech_prob: number`

~~936~~

937 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~938~~

939 - `seek: number`

~~940~~

941 Seek offset of the segment.

~~942~~

943 - `start: number`

~~944~~

945 Start time of the segment in seconds.

~~946~~

947 - `temperature: number`

~~948~~

949 Temperature parameter used for generating the segment.

~~950~~

951 - `text: string`

~~952~~

953 Text content of the segment.

~~954~~

955 - `tokens: array of number`

~~956~~

957 Array of token IDs for the text content.

~~958~~

959 - `usage: optional object { seconds, type }`

~~960~~

961 Usage statistics for models billed by audio input duration.

~~962~~

963 - `seconds: number`

~~964~~

965 Duration of the input audio in seconds.

~~966~~

967 - `type: "duration"`

~~968~~

969 The type of the usage object. Always `duration` for this variant.

~~970~~

971 - `words: optional array of TranscriptionWord`

~~972~~

973 Extracted words and their corresponding timestamps.

~~974~~

975 - `end: number`

~~976~~

977 End time of the word in seconds.

~~978~~

979 - `start: number`

~~980~~

981 Start time of the word in seconds.

~~982~~

983 - `word: string`

~~984~~

985 The text content of the word.

~~986~~

987### Transcription Word

~~988~~

989- `transcription_word: object { end, start, word }`

~~990~~

991 - `end: number`

~~992~~

993 End time of the word in seconds.

~~994~~

995 - `start: number`

~~996~~

997 Start time of the word in seconds.

~~998~~

999 - `word: string`

~~1000~~

1001 The text content of the word.

~~1002~~

1003# Translations

~~1004~~

1005## Create translation

~~1006~~

1007`$ openai audio:translations create`

~~1008~~

1009**post** `/audio/translations`

~~1010~~

1011Translates audio into English.

~~1012~~

1013### Parameters

~~1014~~

1015- `--file: string`

~~1016~~

1017 The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

~~1018~~

1019- `--model: string or AudioModel`

~~1020~~

1021 ID of the model to use. Only `whisper-1` (which is powered by our open source Whisper V2 model) is currently available.

~~1022~~

1023- `--prompt: optional string`

~~1024~~

1025 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should be in English.

~~1026~~

1027- `--response-format: optional "json" or "text" or "srt" or 2 more`

~~1028~~

1029 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`.

~~1030~~

1031- `--temperature: optional number`

~~1032~~

1033 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.

~~1034~~

1035### Returns

~~1036~~

1037- `unnamed_schema_1: Translation or TranslationVerbose`

~~1038~~

1039 - `translation: object { text }`

~~1040~~

1041 - `text: string`

~~1042~~

1043 - `translation_verbose: object { duration, language, text, segments }`

~~1044~~

1045 - `duration: number`

~~1046~~

1047 The duration of the input audio.

~~1048~~

1049 - `language: string`

~~1050~~

1051 The language of the output translation (always `english`).

~~1052~~

1053 - `text: string`

~~1054~~

1055 The translated text.

~~1056~~

1057 - `segments: optional array of TranscriptionSegment`

~~1058~~

1059 Segments of the translated text and their corresponding details.

~~1060~~

1061 - `id: number`

~~1062~~

1063 Unique identifier of the segment.

~~1064~~

1065 - `avg_logprob: number`

~~1066~~

1067 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~1068~~

1069 - `compression_ratio: number`

~~1070~~

1071 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~1072~~

1073 - `end: number`

~~1074~~

1075 End time of the segment in seconds.

~~1076~~

1077 - `no_speech_prob: number`

~~1078~~

1079 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~1080~~

1081 - `seek: number`

~~1082~~

1083 Seek offset of the segment.

~~1084~~

1085 - `start: number`

~~1086~~

1087 Start time of the segment in seconds.

~~1088~~

1089 - `temperature: number`

~~1090~~

1091 Temperature parameter used for generating the segment.

~~1092~~

1093 - `text: string`

~~1094~~

1095 Text content of the segment.

~~1096~~

1097 - `tokens: array of number`

~~1098~~

1099 Array of token IDs for the text content.

~~1100~~

1101### Example

~~1102~~

1103```cli

1104openai audio:translations create \

1105 --api-key 'My API Key' \

1106 --file 'Example data' \

1107 --model whisper-1

1108```

~~1109~~

1110#### Response

~~1111~~

1112```json

1113{

1114 "text": "text"

1115}

1116```

~~1117~~

1118## Domain Types

~~1119~~

1120### Translation

~~1121~~

1122- `translation: object { text }`

~~1123~~

1124 - `text: string`

~~1125~~

1126### Translation Verbose

~~1127~~

1128- `translation_verbose: object { duration, language, text, segments }`

~~1129~~

1130 - `duration: number`

~~1131~~

1132 The duration of the input audio.

~~1133~~

1134 - `language: string`

~~1135~~

1136 The language of the output translation (always `english`).

~~1137~~

1138 - `text: string`

~~1139~~

1140 The translated text.

~~1141~~

1142 - `segments: optional array of TranscriptionSegment`

~~1143~~

1144 Segments of the translated text and their corresponding details.

~~1145~~

1146 - `id: number`

~~1147~~

1148 Unique identifier of the segment.

~~1149~~

1150 - `avg_logprob: number`

~~1151~~

1152 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~1153~~

1154 - `compression_ratio: number`

~~1155~~

1156 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~1157~~

1158 - `end: number`

~~1159~~

1160 End time of the segment in seconds.

~~1161~~

1162 - `no_speech_prob: number`

~~1163~~

1164 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~1165~~

1166 - `seek: number`

~~1167~~

1168 Seek offset of the segment.

~~1169~~

1170 - `start: number`

~~1171~~

1172 Start time of the segment in seconds.

~~1173~~

1174 - `temperature: number`

~~1175~~

1176 Temperature parameter used for generating the segment.

~~1177~~

1178 - `text: string`

~~1179~~

1180 Text content of the segment.

~~1181~~

1182 - `tokens: array of number`

~~1183~~

1184 Array of token IDs for the text content.

~~1185~~

1186# Speech

~~1187~~

1188## Create speech

~~1189~~

1190`$ openai audio:speech create`

~~1191~~

1192**post** `/audio/speech`

~~1193~~

1194Generates audio from the input text.

~~1195~~

1196Returns the audio file content, or a stream of audio events.

~~1197~~

1198### Parameters

~~1199~~

1200- `--input: string`

~~1201~~

1202 The text to generate audio for. The maximum length is 4096 characters.

~~1203~~

1204- `--model: string or SpeechModel`

~~1205~~

1206 One of the available [TTS models](https://platform.openai.com/docs/models#tts): `tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`, or `gpt-4o-mini-tts-2025-12-15`.

~~1207~~

1208- `--voice: string or "alloy" or "ash" or "ballad" or 7 more or object { id }`

~~1209~~

1210 The voice to use when generating the audio. Supported built-in voices are `alloy`, `ash`, `ballad`, `coral`, `echo`, `fable`, `onyx`, `nova`, `sage`, `shimmer`, `verse`, `marin`, and `cedar`. You may also provide a custom voice object with an `id`, for example `{ "id": "voice_1234" }`. Previews of the voices are available in the [Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech#voice-options).

~~1211~~

1212- `--instructions: optional string`

~~1213~~

1214 Control the voice of your generated audio with additional instructions. Does not work with `tts-1` or `tts-1-hd`.

~~1215~~

1216- `--response-format: optional "mp3" or "opus" or "aac" or 3 more`

~~1217~~

1218 The format to audio in. Supported formats are `mp3`, `opus`, `aac`, `flac`, `wav`, and `pcm`.

~~1219~~

1220- `--speed: optional number`

~~1221~~

1222 The speed of the generated audio. Select a value from `0.25` to `4.0`. `1.0` is the default.

~~1223~~

1224- `--stream-format: optional "sse" or "audio"`

~~1225~~

1226 The format to stream the audio in. Supported formats are `sse` and `audio`. `sse` is not supported for `tts-1` or `tts-1-hd`.

~~1227~~

1228### Returns

~~1229~~

1230- `unnamed_schema_2: file path`

~~1231~~

1232### Example

~~1233~~

1234```cli

1235openai audio:speech create \

1236 --api-key 'My API Key' \

1237 --input input \

1238 --model tts-1 \

1239 --voice string

1240```

~~1241~~

1242## Domain Types

~~1243~~

1244### Speech Model

~~1245~~

1246- `speech_model: "tts-1" or "tts-1-hd" or "gpt-4o-mini-tts" or "gpt-4o-mini-tts-2025-12-15"`

~~1247~~

1248 - `"tts-1"`

~~1249~~

1250 - `"tts-1-hd"`

~~1251~~

1252 - `"gpt-4o-mini-tts"`

~~1253~~

1254 - `"gpt-4o-mini-tts-2025-12-15"`

~~1255~~

1256# Voices

~~1257~~

1258# Voice Consents