cli/resources/audio/index.md +0 −1258 deleted
File Deleted View Diff
1# Audio
2
3## Domain Types
4
5### Audio Model
6
7- `audio_model: "whisper-1" or "gpt-4o-transcribe" or "gpt-4o-mini-transcribe" or 2 more`
8
9 - `"whisper-1"`
10
11 - `"gpt-4o-transcribe"`
12
13 - `"gpt-4o-mini-transcribe"`
14
15 - `"gpt-4o-mini-transcribe-2025-12-15"`
16
17 - `"gpt-4o-transcribe-diarize"`
18
19### Audio Response Format
20
21- `audio_response_format: "json" or "text" or "srt" or 3 more`
22
23 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.
24
25 - `"json"`
26
27 - `"text"`
28
29 - `"srt"`
30
31 - `"verbose_json"`
32
33 - `"vtt"`
34
35 - `"diarized_json"`
36
37# Transcriptions
38
39## Create transcription
40
41`$ openai audio:transcriptions create`
42
43**post** `/audio/transcriptions`
44
45Transcribes audio into the input language.
46
47Returns a transcription object in `json`, `diarized_json`, or `verbose_json`
48format, or a stream of transcript events.
49
50### Parameters
51
52- `--file: string`
53
54 The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
55
56- `--model: string or AudioModel`
57
58 ID of the model to use. The options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `whisper-1` (which is powered by our open source Whisper V2 model), and `gpt-4o-transcribe-diarize`.
59
60- `--chunking-strategy: optional "auto" or object { type, prefix_padding_ms, silence_duration_ms, threshold }`
61
62 Controls how the audio is cut into chunks. When set to `"auto"`, the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. `server_vad` object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using `gpt-4o-transcribe-diarize` for inputs longer than 30 seconds.
63
64- `--include: optional array of TranscriptionInclude`
65
66 Additional information to include in the transcription response.
67 `logprobs` will return the log probabilities of the tokens in the
68 response to understand the model's confidence in the transcription.
69 `logprobs` only works with response_format set to `json` and only with
70 the models `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `gpt-4o-mini-transcribe-2025-12-15`. This field is not supported when using `gpt-4o-transcribe-diarize`.
71
72- `--known-speaker-name: optional array of string`
73
74 Optional list of speaker names that correspond to the audio samples provided in `known_speaker_references[]`. Each entry should be a short identifier (for example `customer` or `agent`). Up to 4 speakers are supported.
75
76- `--known-speaker-reference: optional array of string`
77
78 Optional list of audio samples (as [data URLs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs)) that contain known speaker references matching `known_speaker_names[]`. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by `file`.
79
80- `--language: optional string`
81
82 The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency.
83
84- `--prompt: optional string`
85
86 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should match the audio language. This field is not supported when using `gpt-4o-transcribe-diarize`.
87
88- `--response-format: optional "json" or "text" or "srt" or 3 more`
89
90 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.
91
92- `--temperature: optional number`
93
94 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.
95
96- `--timestamp-granularity: optional array of "word" or "segment"`
97
98 The timestamp granularities to populate for this transcription. `response_format` must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported: `word`, or `segment`. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.
99 This option is not available for `gpt-4o-transcribe-diarize`.
100
101### Returns
102
103- `AudioTranscriptionNewResponse: Transcription or TranscriptionDiarized or TranscriptionVerbose`
104
105 Represents a transcription response returned by model, based on the provided input.
106
107 - `transcription: object { text, logprobs, usage }`
108
109 Represents a transcription response returned by model, based on the provided input.
110
111 - `text: string`
112
113 The transcribed text.
114
115 - `logprobs: optional array of object { token, bytes, logprob }`
116
117 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.
118
119 - `token: optional string`
120
121 The token in the transcription.
122
123 - `bytes: optional array of number`
124
125 The bytes of the token.
126
127 - `logprob: optional number`
128
129 The log probability of the token.
130
131 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }`
132
133 Token usage statistics for the request.
134
135 - `tokens: object { input_tokens, output_tokens, total_tokens, 2 more }`
136
137 Usage statistics for models billed by token usage.
138
139 - `input_tokens: number`
140
141 Number of input tokens billed for this request.
142
143 - `output_tokens: number`
144
145 Number of output tokens generated.
146
147 - `total_tokens: number`
148
149 Total number of tokens used (input + output).
150
151 - `type: "tokens"`
152
153 The type of the usage object. Always `tokens` for this variant.
154
155 - `input_token_details: optional object { audio_tokens, text_tokens }`
156
157 Details about the input tokens billed for this request.
158
159 - `audio_tokens: optional number`
160
161 Number of audio tokens billed for this request.
162
163 - `text_tokens: optional number`
164
165 Number of text tokens billed for this request.
166
167 - `duration: object { seconds, type }`
168
169 Usage statistics for models billed by audio input duration.
170
171 - `seconds: number`
172
173 Duration of the input audio in seconds.
174
175 - `type: "duration"`
176
177 The type of the usage object. Always `duration` for this variant.
178
179 - `transcription_diarized: object { duration, segments, task, 2 more }`
180
181 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
182
183 - `duration: number`
184
185 Duration of the input audio in seconds.
186
187 - `segments: array of TranscriptionDiarizedSegment`
188
189 Segments of the transcript annotated with timestamps and speaker labels.
190
191 - `id: string`
192
193 Unique identifier for the segment.
194
195 - `end: number`
196
197 End timestamp of the segment in seconds.
198
199 - `speaker: string`
200
201 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
202
203 - `start: number`
204
205 Start timestamp of the segment in seconds.
206
207 - `text: string`
208
209 Transcript text for this segment.
210
211 - `type: "transcript.text.segment"`
212
213 The type of the segment. Always `transcript.text.segment`.
214
215 - `task: "transcribe"`
216
217 The type of task that was run. Always `transcribe`.
218
219 - `text: string`
220
221 The concatenated transcript text for the entire audio input.
222
223 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }`
224
225 Token or duration usage statistics for the request.
226
227 - `tokens: object { input_tokens, output_tokens, total_tokens, 2 more }`
228
229 Usage statistics for models billed by token usage.
230
231 - `input_tokens: number`
232
233 Number of input tokens billed for this request.
234
235 - `output_tokens: number`
236
237 Number of output tokens generated.
238
239 - `total_tokens: number`
240
241 Total number of tokens used (input + output).
242
243 - `type: "tokens"`
244
245 The type of the usage object. Always `tokens` for this variant.
246
247 - `input_token_details: optional object { audio_tokens, text_tokens }`
248
249 Details about the input tokens billed for this request.
250
251 - `audio_tokens: optional number`
252
253 Number of audio tokens billed for this request.
254
255 - `text_tokens: optional number`
256
257 Number of text tokens billed for this request.
258
259 - `duration: object { seconds, type }`
260
261 Usage statistics for models billed by audio input duration.
262
263 - `seconds: number`
264
265 Duration of the input audio in seconds.
266
267 - `type: "duration"`
268
269 The type of the usage object. Always `duration` for this variant.
270
271 - `transcription_verbose: object { duration, language, text, 3 more }`
272
273 Represents a verbose json transcription response returned by model, based on the provided input.
274
275 - `duration: number`
276
277 The duration of the input audio.
278
279 - `language: string`
280
281 The language of the input audio.
282
283 - `text: string`
284
285 The transcribed text.
286
287 - `segments: optional array of TranscriptionSegment`
288
289 Segments of the transcribed text and their corresponding details.
290
291 - `id: number`
292
293 Unique identifier of the segment.
294
295 - `avg_logprob: number`
296
297 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
298
299 - `compression_ratio: number`
300
301 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
302
303 - `end: number`
304
305 End time of the segment in seconds.
306
307 - `no_speech_prob: number`
308
309 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
310
311 - `seek: number`
312
313 Seek offset of the segment.
314
315 - `start: number`
316
317 Start time of the segment in seconds.
318
319 - `temperature: number`
320
321 Temperature parameter used for generating the segment.
322
323 - `text: string`
324
325 Text content of the segment.
326
327 - `tokens: array of number`
328
329 Array of token IDs for the text content.
330
331 - `usage: optional object { seconds, type }`
332
333 Usage statistics for models billed by audio input duration.
334
335 - `seconds: number`
336
337 Duration of the input audio in seconds.
338
339 - `type: "duration"`
340
341 The type of the usage object. Always `duration` for this variant.
342
343 - `words: optional array of TranscriptionWord`
344
345 Extracted words and their corresponding timestamps.
346
347 - `end: number`
348
349 End time of the word in seconds.
350
351 - `start: number`
352
353 Start time of the word in seconds.
354
355 - `word: string`
356
357 The text content of the word.
358
359### Example
360
361```cli
362openai audio:transcriptions create \
363 --api-key 'My API Key' \
364 --file 'Example data' \
365 --model gpt-4o-transcribe
366```
367
368#### Response
369
370```json
371{
372 "text": "text",
373 "logprobs": [
374 {
375 "token": "token",
376 "bytes": [
377 0
378 ],
379 "logprob": 0
380 }
381 ],
382 "usage": {
383 "input_tokens": 0,
384 "output_tokens": 0,
385 "total_tokens": 0,
386 "type": "tokens",
387 "input_token_details": {
388 "audio_tokens": 0,
389 "text_tokens": 0
390 }
391 }
392}
393```
394
395## Domain Types
396
397### Transcription
398
399- `transcription: object { text, logprobs, usage }`
400
401 Represents a transcription response returned by model, based on the provided input.
402
403 - `text: string`
404
405 The transcribed text.
406
407 - `logprobs: optional array of object { token, bytes, logprob }`
408
409 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.
410
411 - `token: optional string`
412
413 The token in the transcription.
414
415 - `bytes: optional array of number`
416
417 The bytes of the token.
418
419 - `logprob: optional number`
420
421 The log probability of the token.
422
423 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }`
424
425 Token usage statistics for the request.
426
427 - `tokens: object { input_tokens, output_tokens, total_tokens, 2 more }`
428
429 Usage statistics for models billed by token usage.
430
431 - `input_tokens: number`
432
433 Number of input tokens billed for this request.
434
435 - `output_tokens: number`
436
437 Number of output tokens generated.
438
439 - `total_tokens: number`
440
441 Total number of tokens used (input + output).
442
443 - `type: "tokens"`
444
445 The type of the usage object. Always `tokens` for this variant.
446
447 - `input_token_details: optional object { audio_tokens, text_tokens }`
448
449 Details about the input tokens billed for this request.
450
451 - `audio_tokens: optional number`
452
453 Number of audio tokens billed for this request.
454
455 - `text_tokens: optional number`
456
457 Number of text tokens billed for this request.
458
459 - `duration: object { seconds, type }`
460
461 Usage statistics for models billed by audio input duration.
462
463 - `seconds: number`
464
465 Duration of the input audio in seconds.
466
467 - `type: "duration"`
468
469 The type of the usage object. Always `duration` for this variant.
470
471### Transcription Diarized
472
473- `transcription_diarized: object { duration, segments, task, 2 more }`
474
475 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
476
477 - `duration: number`
478
479 Duration of the input audio in seconds.
480
481 - `segments: array of TranscriptionDiarizedSegment`
482
483 Segments of the transcript annotated with timestamps and speaker labels.
484
485 - `id: string`
486
487 Unique identifier for the segment.
488
489 - `end: number`
490
491 End timestamp of the segment in seconds.
492
493 - `speaker: string`
494
495 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
496
497 - `start: number`
498
499 Start timestamp of the segment in seconds.
500
501 - `text: string`
502
503 Transcript text for this segment.
504
505 - `type: "transcript.text.segment"`
506
507 The type of the segment. Always `transcript.text.segment`.
508
509 - `task: "transcribe"`
510
511 The type of task that was run. Always `transcribe`.
512
513 - `text: string`
514
515 The concatenated transcript text for the entire audio input.
516
517 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more } or object { seconds, type }`
518
519 Token or duration usage statistics for the request.
520
521 - `tokens: object { input_tokens, output_tokens, total_tokens, 2 more }`
522
523 Usage statistics for models billed by token usage.
524
525 - `input_tokens: number`
526
527 Number of input tokens billed for this request.
528
529 - `output_tokens: number`
530
531 Number of output tokens generated.
532
533 - `total_tokens: number`
534
535 Total number of tokens used (input + output).
536
537 - `type: "tokens"`
538
539 The type of the usage object. Always `tokens` for this variant.
540
541 - `input_token_details: optional object { audio_tokens, text_tokens }`
542
543 Details about the input tokens billed for this request.
544
545 - `audio_tokens: optional number`
546
547 Number of audio tokens billed for this request.
548
549 - `text_tokens: optional number`
550
551 Number of text tokens billed for this request.
552
553 - `duration: object { seconds, type }`
554
555 Usage statistics for models billed by audio input duration.
556
557 - `seconds: number`
558
559 Duration of the input audio in seconds.
560
561 - `type: "duration"`
562
563 The type of the usage object. Always `duration` for this variant.
564
565### Transcription Diarized Segment
566
567- `transcription_diarized_segment: object { id, end, speaker, 3 more }`
568
569 A segment of diarized transcript text with speaker metadata.
570
571 - `id: string`
572
573 Unique identifier for the segment.
574
575 - `end: number`
576
577 End timestamp of the segment in seconds.
578
579 - `speaker: string`
580
581 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
582
583 - `start: number`
584
585 Start timestamp of the segment in seconds.
586
587 - `text: string`
588
589 Transcript text for this segment.
590
591 - `type: "transcript.text.segment"`
592
593 The type of the segment. Always `transcript.text.segment`.
594
595### Transcription Include
596
597- `transcription_include: "logprobs"`
598
599 - `"logprobs"`
600
601### Transcription Segment
602
603- `transcription_segment: object { id, avg_logprob, compression_ratio, 7 more }`
604
605 - `id: number`
606
607 Unique identifier of the segment.
608
609 - `avg_logprob: number`
610
611 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
612
613 - `compression_ratio: number`
614
615 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
616
617 - `end: number`
618
619 End time of the segment in seconds.
620
621 - `no_speech_prob: number`
622
623 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
624
625 - `seek: number`
626
627 Seek offset of the segment.
628
629 - `start: number`
630
631 Start time of the segment in seconds.
632
633 - `temperature: number`
634
635 Temperature parameter used for generating the segment.
636
637 - `text: string`
638
639 Text content of the segment.
640
641 - `tokens: array of number`
642
643 Array of token IDs for the text content.
644
645### Transcription Stream Event
646
647- `transcription_stream_event: TranscriptionTextSegmentEvent or TranscriptionTextDeltaEvent or TranscriptionTextDoneEvent`
648
649 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.
650
651 - `transcription_text_segment_event: object { id, end, speaker, 3 more }`
652
653 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.
654
655 - `id: string`
656
657 Unique identifier for the segment.
658
659 - `end: number`
660
661 End timestamp of the segment in seconds.
662
663 - `speaker: string`
664
665 Speaker label for this segment.
666
667 - `start: number`
668
669 Start timestamp of the segment in seconds.
670
671 - `text: string`
672
673 Transcript text for this segment.
674
675 - `type: "transcript.text.segment"`
676
677 The type of the event. Always `transcript.text.segment`.
678
679 - `transcription_text_delta_event: object { delta, type, logprobs, segment_id }`
680
681 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
682
683 - `delta: string`
684
685 The text delta that was additionally transcribed.
686
687 - `type: "transcript.text.delta"`
688
689 The type of the event. Always `transcript.text.delta`.
690
691 - `logprobs: optional array of object { token, bytes, logprob }`
692
693 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
694
695 - `token: optional string`
696
697 The token that was used to generate the log probability.
698
699 - `bytes: optional array of number`
700
701 The bytes that were used to generate the log probability.
702
703 - `logprob: optional number`
704
705 The log probability of the token.
706
707 - `segment_id: optional string`
708
709 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.
710
711 - `transcription_text_done_event: object { text, type, logprobs, usage }`
712
713 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
714
715 - `text: string`
716
717 The text that was transcribed.
718
719 - `type: "transcript.text.done"`
720
721 The type of the event. Always `transcript.text.done`.
722
723 - `logprobs: optional array of object { token, bytes, logprob }`
724
725 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
726
727 - `token: optional string`
728
729 The token that was used to generate the log probability.
730
731 - `bytes: optional array of number`
732
733 The bytes that were used to generate the log probability.
734
735 - `logprob: optional number`
736
737 The log probability of the token.
738
739 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more }`
740
741 Usage statistics for models billed by token usage.
742
743 - `input_tokens: number`
744
745 Number of input tokens billed for this request.
746
747 - `output_tokens: number`
748
749 Number of output tokens generated.
750
751 - `total_tokens: number`
752
753 Total number of tokens used (input + output).
754
755 - `type: "tokens"`
756
757 The type of the usage object. Always `tokens` for this variant.
758
759 - `input_token_details: optional object { audio_tokens, text_tokens }`
760
761 Details about the input tokens billed for this request.
762
763 - `audio_tokens: optional number`
764
765 Number of audio tokens billed for this request.
766
767 - `text_tokens: optional number`
768
769 Number of text tokens billed for this request.
770
771### Transcription Text Delta Event
772
773- `transcription_text_delta_event: object { delta, type, logprobs, segment_id }`
774
775 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
776
777 - `delta: string`
778
779 The text delta that was additionally transcribed.
780
781 - `type: "transcript.text.delta"`
782
783 The type of the event. Always `transcript.text.delta`.
784
785 - `logprobs: optional array of object { token, bytes, logprob }`
786
787 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
788
789 - `token: optional string`
790
791 The token that was used to generate the log probability.
792
793 - `bytes: optional array of number`
794
795 The bytes that were used to generate the log probability.
796
797 - `logprob: optional number`
798
799 The log probability of the token.
800
801 - `segment_id: optional string`
802
803 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.
804
805### Transcription Text Done Event
806
807- `transcription_text_done_event: object { text, type, logprobs, usage }`
808
809 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
810
811 - `text: string`
812
813 The text that was transcribed.
814
815 - `type: "transcript.text.done"`
816
817 The type of the event. Always `transcript.text.done`.
818
819 - `logprobs: optional array of object { token, bytes, logprob }`
820
821 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
822
823 - `token: optional string`
824
825 The token that was used to generate the log probability.
826
827 - `bytes: optional array of number`
828
829 The bytes that were used to generate the log probability.
830
831 - `logprob: optional number`
832
833 The log probability of the token.
834
835 - `usage: optional object { input_tokens, output_tokens, total_tokens, 2 more }`
836
837 Usage statistics for models billed by token usage.
838
839 - `input_tokens: number`
840
841 Number of input tokens billed for this request.
842
843 - `output_tokens: number`
844
845 Number of output tokens generated.
846
847 - `total_tokens: number`
848
849 Total number of tokens used (input + output).
850
851 - `type: "tokens"`
852
853 The type of the usage object. Always `tokens` for this variant.
854
855 - `input_token_details: optional object { audio_tokens, text_tokens }`
856
857 Details about the input tokens billed for this request.
858
859 - `audio_tokens: optional number`
860
861 Number of audio tokens billed for this request.
862
863 - `text_tokens: optional number`
864
865 Number of text tokens billed for this request.
866
867### Transcription Text Segment Event
868
869- `transcription_text_segment_event: object { id, end, speaker, 3 more }`
870
871 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.
872
873 - `id: string`
874
875 Unique identifier for the segment.
876
877 - `end: number`
878
879 End timestamp of the segment in seconds.
880
881 - `speaker: string`
882
883 Speaker label for this segment.
884
885 - `start: number`
886
887 Start timestamp of the segment in seconds.
888
889 - `text: string`
890
891 Transcript text for this segment.
892
893 - `type: "transcript.text.segment"`
894
895 The type of the event. Always `transcript.text.segment`.
896
897### Transcription Verbose
898
899- `transcription_verbose: object { duration, language, text, 3 more }`
900
901 Represents a verbose json transcription response returned by model, based on the provided input.
902
903 - `duration: number`
904
905 The duration of the input audio.
906
907 - `language: string`
908
909 The language of the input audio.
910
911 - `text: string`
912
913 The transcribed text.
914
915 - `segments: optional array of TranscriptionSegment`
916
917 Segments of the transcribed text and their corresponding details.
918
919 - `id: number`
920
921 Unique identifier of the segment.
922
923 - `avg_logprob: number`
924
925 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
926
927 - `compression_ratio: number`
928
929 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
930
931 - `end: number`
932
933 End time of the segment in seconds.
934
935 - `no_speech_prob: number`
936
937 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
938
939 - `seek: number`
940
941 Seek offset of the segment.
942
943 - `start: number`
944
945 Start time of the segment in seconds.
946
947 - `temperature: number`
948
949 Temperature parameter used for generating the segment.
950
951 - `text: string`
952
953 Text content of the segment.
954
955 - `tokens: array of number`
956
957 Array of token IDs for the text content.
958
959 - `usage: optional object { seconds, type }`
960
961 Usage statistics for models billed by audio input duration.
962
963 - `seconds: number`
964
965 Duration of the input audio in seconds.
966
967 - `type: "duration"`
968
969 The type of the usage object. Always `duration` for this variant.
970
971 - `words: optional array of TranscriptionWord`
972
973 Extracted words and their corresponding timestamps.
974
975 - `end: number`
976
977 End time of the word in seconds.
978
979 - `start: number`
980
981 Start time of the word in seconds.
982
983 - `word: string`
984
985 The text content of the word.
986
987### Transcription Word
988
989- `transcription_word: object { end, start, word }`
990
991 - `end: number`
992
993 End time of the word in seconds.
994
995 - `start: number`
996
997 Start time of the word in seconds.
998
999 - `word: string`
1000
1001 The text content of the word.
1002
1003# Translations
1004
1005## Create translation
1006
1007`$ openai audio:translations create`
1008
1009**post** `/audio/translations`
1010
1011Translates audio into English.
1012
1013### Parameters
1014
1015- `--file: string`
1016
1017 The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
1018
1019- `--model: string or AudioModel`
1020
1021 ID of the model to use. Only `whisper-1` (which is powered by our open source Whisper V2 model) is currently available.
1022
1023- `--prompt: optional string`
1024
1025 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should be in English.
1026
1027- `--response-format: optional "json" or "text" or "srt" or 2 more`
1028
1029 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`.
1030
1031- `--temperature: optional number`
1032
1033 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.
1034
1035### Returns
1036
1037- `unnamed_schema_1: Translation or TranslationVerbose`
1038
1039 - `translation: object { text }`
1040
1041 - `text: string`
1042
1043 - `translation_verbose: object { duration, language, text, segments }`
1044
1045 - `duration: number`
1046
1047 The duration of the input audio.
1048
1049 - `language: string`
1050
1051 The language of the output translation (always `english`).
1052
1053 - `text: string`
1054
1055 The translated text.
1056
1057 - `segments: optional array of TranscriptionSegment`
1058
1059 Segments of the translated text and their corresponding details.
1060
1061 - `id: number`
1062
1063 Unique identifier of the segment.
1064
1065 - `avg_logprob: number`
1066
1067 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
1068
1069 - `compression_ratio: number`
1070
1071 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
1072
1073 - `end: number`
1074
1075 End time of the segment in seconds.
1076
1077 - `no_speech_prob: number`
1078
1079 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
1080
1081 - `seek: number`
1082
1083 Seek offset of the segment.
1084
1085 - `start: number`
1086
1087 Start time of the segment in seconds.
1088
1089 - `temperature: number`
1090
1091 Temperature parameter used for generating the segment.
1092
1093 - `text: string`
1094
1095 Text content of the segment.
1096
1097 - `tokens: array of number`
1098
1099 Array of token IDs for the text content.
1100
1101### Example
1102
1103```cli
1104openai audio:translations create \
1105 --api-key 'My API Key' \
1106 --file 'Example data' \
1107 --model whisper-1
1108```
1109
1110#### Response
1111
1112```json
1113{
1114 "text": "text"
1115}
1116```
1117
1118## Domain Types
1119
1120### Translation
1121
1122- `translation: object { text }`
1123
1124 - `text: string`
1125
1126### Translation Verbose
1127
1128- `translation_verbose: object { duration, language, text, segments }`
1129
1130 - `duration: number`
1131
1132 The duration of the input audio.
1133
1134 - `language: string`
1135
1136 The language of the output translation (always `english`).
1137
1138 - `text: string`
1139
1140 The translated text.
1141
1142 - `segments: optional array of TranscriptionSegment`
1143
1144 Segments of the translated text and their corresponding details.
1145
1146 - `id: number`
1147
1148 Unique identifier of the segment.
1149
1150 - `avg_logprob: number`
1151
1152 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
1153
1154 - `compression_ratio: number`
1155
1156 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
1157
1158 - `end: number`
1159
1160 End time of the segment in seconds.
1161
1162 - `no_speech_prob: number`
1163
1164 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
1165
1166 - `seek: number`
1167
1168 Seek offset of the segment.
1169
1170 - `start: number`
1171
1172 Start time of the segment in seconds.
1173
1174 - `temperature: number`
1175
1176 Temperature parameter used for generating the segment.
1177
1178 - `text: string`
1179
1180 Text content of the segment.
1181
1182 - `tokens: array of number`
1183
1184 Array of token IDs for the text content.
1185
1186# Speech
1187
1188## Create speech
1189
1190`$ openai audio:speech create`
1191
1192**post** `/audio/speech`
1193
1194Generates audio from the input text.
1195
1196Returns the audio file content, or a stream of audio events.
1197
1198### Parameters
1199
1200- `--input: string`
1201
1202 The text to generate audio for. The maximum length is 4096 characters.
1203
1204- `--model: string or SpeechModel`
1205
1206 One of the available [TTS models](https://platform.openai.com/docs/models#tts): `tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`, or `gpt-4o-mini-tts-2025-12-15`.
1207
1208- `--voice: string or "alloy" or "ash" or "ballad" or 7 more or object { id }`
1209
1210 The voice to use when generating the audio. Supported built-in voices are `alloy`, `ash`, `ballad`, `coral`, `echo`, `fable`, `onyx`, `nova`, `sage`, `shimmer`, `verse`, `marin`, and `cedar`. You may also provide a custom voice object with an `id`, for example `{ "id": "voice_1234" }`. Previews of the voices are available in the [Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech#voice-options).
1211
1212- `--instructions: optional string`
1213
1214 Control the voice of your generated audio with additional instructions. Does not work with `tts-1` or `tts-1-hd`.
1215
1216- `--response-format: optional "mp3" or "opus" or "aac" or 3 more`
1217
1218 The format to audio in. Supported formats are `mp3`, `opus`, `aac`, `flac`, `wav`, and `pcm`.
1219
1220- `--speed: optional number`
1221
1222 The speed of the generated audio. Select a value from `0.25` to `4.0`. `1.0` is the default.
1223
1224- `--stream-format: optional "sse" or "audio"`
1225
1226 The format to stream the audio in. Supported formats are `sse` and `audio`. `sse` is not supported for `tts-1` or `tts-1-hd`.
1227
1228### Returns
1229
1230- `unnamed_schema_2: file path`
1231
1232### Example
1233
1234```cli
1235openai audio:speech create \
1236 --api-key 'My API Key' \
1237 --input input \
1238 --model tts-1 \
1239 --voice string
1240```
1241
1242## Domain Types
1243
1244### Speech Model
1245
1246- `speech_model: "tts-1" or "tts-1-hd" or "gpt-4o-mini-tts" or "gpt-4o-mini-tts-2025-12-15"`
1247
1248 - `"tts-1"`
1249
1250 - `"tts-1-hd"`
1251
1252 - `"gpt-4o-mini-tts"`
1253
1254 - `"gpt-4o-mini-tts-2025-12-15"`
1255
1256# Voices
1257
1258# Voice Consents