python/resources/audio/index.md +0 −2213 deleted
File Deleted View Diff
1# Audio
2
3## Domain Types
4
5### Audio Model
6
7- `Literal["whisper-1", "gpt-4o-transcribe", "gpt-4o-mini-transcribe", 2 more]`
8
9 - `"whisper-1"`
10
11 - `"gpt-4o-transcribe"`
12
13 - `"gpt-4o-mini-transcribe"`
14
15 - `"gpt-4o-mini-transcribe-2025-12-15"`
16
17 - `"gpt-4o-transcribe-diarize"`
18
19### Audio Response Format
20
21- `Literal["json", "text", "srt", 3 more]`
22
23 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.
24
25 - `"json"`
26
27 - `"text"`
28
29 - `"srt"`
30
31 - `"verbose_json"`
32
33 - `"vtt"`
34
35 - `"diarized_json"`
36
37# Transcriptions
38
39## Create transcription
40
41`audio.transcriptions.create(TranscriptionCreateParams**kwargs) -> TranscriptionCreateResponse`
42
43**post** `/audio/transcriptions`
44
45Transcribes audio into the input language.
46
47Returns a transcription object in `json`, `diarized_json`, or `verbose_json`
48format, or a stream of transcript events.
49
50### Parameters
51
52- `file: FileTypes`
53
54 The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
55
56- `model: Union[str, AudioModel]`
57
58 ID of the model to use. The options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `whisper-1` (which is powered by our open source Whisper V2 model), and `gpt-4o-transcribe-diarize`.
59
60 - `str`
61
62 - `Literal["whisper-1", "gpt-4o-transcribe", "gpt-4o-mini-transcribe", 2 more]`
63
64 - `"whisper-1"`
65
66 - `"gpt-4o-transcribe"`
67
68 - `"gpt-4o-mini-transcribe"`
69
70 - `"gpt-4o-mini-transcribe-2025-12-15"`
71
72 - `"gpt-4o-transcribe-diarize"`
73
74- `chunking_strategy: Optional[ChunkingStrategy]`
75
76 Controls how the audio is cut into chunks. When set to `"auto"`, the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. `server_vad` object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using `gpt-4o-transcribe-diarize` for inputs longer than 30 seconds.
77
78 - `Literal["auto"]`
79
80 Automatically set chunking parameters based on the audio. Must be set to `"auto"`.
81
82 - `"auto"`
83
84 - `class ChunkingStrategyVadConfig: …`
85
86 - `type: Literal["server_vad"]`
87
88 Must be set to `server_vad` to enable manual chunking using server side VAD.
89
90 - `"server_vad"`
91
92 - `prefix_padding_ms: Optional[int]`
93
94 Amount of audio to include before the VAD detected speech (in
95 milliseconds).
96
97 - `silence_duration_ms: Optional[int]`
98
99 Duration of silence to detect speech stop (in milliseconds).
100 With shorter values the model will respond more quickly,
101 but may jump in on short pauses from the user.
102
103 - `threshold: Optional[float]`
104
105 Sensitivity threshold (0.0 to 1.0) for voice activity detection. A
106 higher threshold will require louder audio to activate the model, and
107 thus might perform better in noisy environments.
108
109- `include: Optional[List[TranscriptionInclude]]`
110
111 Additional information to include in the transcription response.
112 `logprobs` will return the log probabilities of the tokens in the
113 response to understand the model's confidence in the transcription.
114 `logprobs` only works with response_format set to `json` and only with
115 the models `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `gpt-4o-mini-transcribe-2025-12-15`. This field is not supported when using `gpt-4o-transcribe-diarize`.
116
117 - `"logprobs"`
118
119- `known_speaker_names: Optional[Sequence[str]]`
120
121 Optional list of speaker names that correspond to the audio samples provided in `known_speaker_references[]`. Each entry should be a short identifier (for example `customer` or `agent`). Up to 4 speakers are supported.
122
123- `known_speaker_references: Optional[Sequence[str]]`
124
125 Optional list of audio samples (as [data URLs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs)) that contain known speaker references matching `known_speaker_names[]`. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by `file`.
126
127- `language: Optional[str]`
128
129 The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency.
130
131- `prompt: Optional[str]`
132
133 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should match the audio language. This field is not supported when using `gpt-4o-transcribe-diarize`.
134
135- `response_format: Optional[AudioResponseFormat]`
136
137 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.
138
139 - `"json"`
140
141 - `"text"`
142
143 - `"srt"`
144
145 - `"verbose_json"`
146
147 - `"vtt"`
148
149 - `"diarized_json"`
150
151- `stream: Optional[Literal[false]]`
152
153 If set to true, the model response data will be streamed to the client
154 as it is generated using [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#Event_stream_format).
155 See the [Streaming section of the Speech-to-Text guide](https://platform.openai.com/docs/guides/speech-to-text?lang=curl#streaming-transcriptions)
156 for more information.
157
158 Note: Streaming is not supported for the `whisper-1` model and will be ignored.
159
160 - `false`
161
162- `temperature: Optional[float]`
163
164 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.
165
166- `timestamp_granularities: Optional[List[Literal["word", "segment"]]]`
167
168 The timestamp granularities to populate for this transcription. `response_format` must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported: `word`, or `segment`. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.
169 This option is not available for `gpt-4o-transcribe-diarize`.
170
171 - `"word"`
172
173 - `"segment"`
174
175### Returns
176
177- `TranscriptionCreateResponse`
178
179 Represents a transcription response returned by model, based on the provided input.
180
181 - `class Transcription: …`
182
183 Represents a transcription response returned by model, based on the provided input.
184
185 - `text: str`
186
187 The transcribed text.
188
189 - `logprobs: Optional[List[Logprob]]`
190
191 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.
192
193 - `token: Optional[str]`
194
195 The token in the transcription.
196
197 - `bytes: Optional[List[float]]`
198
199 The bytes of the token.
200
201 - `logprob: Optional[float]`
202
203 The log probability of the token.
204
205 - `usage: Optional[Usage]`
206
207 Token usage statistics for the request.
208
209 - `class UsageTokens: …`
210
211 Usage statistics for models billed by token usage.
212
213 - `input_tokens: int`
214
215 Number of input tokens billed for this request.
216
217 - `output_tokens: int`
218
219 Number of output tokens generated.
220
221 - `total_tokens: int`
222
223 Total number of tokens used (input + output).
224
225 - `type: Literal["tokens"]`
226
227 The type of the usage object. Always `tokens` for this variant.
228
229 - `"tokens"`
230
231 - `input_token_details: Optional[UsageTokensInputTokenDetails]`
232
233 Details about the input tokens billed for this request.
234
235 - `audio_tokens: Optional[int]`
236
237 Number of audio tokens billed for this request.
238
239 - `text_tokens: Optional[int]`
240
241 Number of text tokens billed for this request.
242
243 - `class UsageDuration: …`
244
245 Usage statistics for models billed by audio input duration.
246
247 - `seconds: float`
248
249 Duration of the input audio in seconds.
250
251 - `type: Literal["duration"]`
252
253 The type of the usage object. Always `duration` for this variant.
254
255 - `"duration"`
256
257 - `class TranscriptionDiarized: …`
258
259 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
260
261 - `duration: float`
262
263 Duration of the input audio in seconds.
264
265 - `segments: List[TranscriptionDiarizedSegment]`
266
267 Segments of the transcript annotated with timestamps and speaker labels.
268
269 - `id: str`
270
271 Unique identifier for the segment.
272
273 - `end: float`
274
275 End timestamp of the segment in seconds.
276
277 - `speaker: str`
278
279 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
280
281 - `start: float`
282
283 Start timestamp of the segment in seconds.
284
285 - `text: str`
286
287 Transcript text for this segment.
288
289 - `type: Literal["transcript.text.segment"]`
290
291 The type of the segment. Always `transcript.text.segment`.
292
293 - `"transcript.text.segment"`
294
295 - `task: Literal["transcribe"]`
296
297 The type of task that was run. Always `transcribe`.
298
299 - `"transcribe"`
300
301 - `text: str`
302
303 The concatenated transcript text for the entire audio input.
304
305 - `usage: Optional[Usage]`
306
307 Token or duration usage statistics for the request.
308
309 - `class UsageTokens: …`
310
311 Usage statistics for models billed by token usage.
312
313 - `input_tokens: int`
314
315 Number of input tokens billed for this request.
316
317 - `output_tokens: int`
318
319 Number of output tokens generated.
320
321 - `total_tokens: int`
322
323 Total number of tokens used (input + output).
324
325 - `type: Literal["tokens"]`
326
327 The type of the usage object. Always `tokens` for this variant.
328
329 - `"tokens"`
330
331 - `input_token_details: Optional[UsageTokensInputTokenDetails]`
332
333 Details about the input tokens billed for this request.
334
335 - `audio_tokens: Optional[int]`
336
337 Number of audio tokens billed for this request.
338
339 - `text_tokens: Optional[int]`
340
341 Number of text tokens billed for this request.
342
343 - `class UsageDuration: …`
344
345 Usage statistics for models billed by audio input duration.
346
347 - `seconds: float`
348
349 Duration of the input audio in seconds.
350
351 - `type: Literal["duration"]`
352
353 The type of the usage object. Always `duration` for this variant.
354
355 - `"duration"`
356
357 - `class TranscriptionVerbose: …`
358
359 Represents a verbose json transcription response returned by model, based on the provided input.
360
361 - `duration: float`
362
363 The duration of the input audio.
364
365 - `language: str`
366
367 The language of the input audio.
368
369 - `text: str`
370
371 The transcribed text.
372
373 - `segments: Optional[List[TranscriptionSegment]]`
374
375 Segments of the transcribed text and their corresponding details.
376
377 - `id: int`
378
379 Unique identifier of the segment.
380
381 - `avg_logprob: float`
382
383 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
384
385 - `compression_ratio: float`
386
387 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
388
389 - `end: float`
390
391 End time of the segment in seconds.
392
393 - `no_speech_prob: float`
394
395 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
396
397 - `seek: int`
398
399 Seek offset of the segment.
400
401 - `start: float`
402
403 Start time of the segment in seconds.
404
405 - `temperature: float`
406
407 Temperature parameter used for generating the segment.
408
409 - `text: str`
410
411 Text content of the segment.
412
413 - `tokens: List[int]`
414
415 Array of token IDs for the text content.
416
417 - `usage: Optional[Usage]`
418
419 Usage statistics for models billed by audio input duration.
420
421 - `seconds: float`
422
423 Duration of the input audio in seconds.
424
425 - `type: Literal["duration"]`
426
427 The type of the usage object. Always `duration` for this variant.
428
429 - `"duration"`
430
431 - `words: Optional[List[TranscriptionWord]]`
432
433 Extracted words and their corresponding timestamps.
434
435 - `end: float`
436
437 End time of the word in seconds.
438
439 - `start: float`
440
441 Start time of the word in seconds.
442
443 - `word: str`
444
445 The text content of the word.
446
447### Example
448
449```python
450import os
451from openai import OpenAI
452
453client = OpenAI(
454 api_key=os.environ.get("OPENAI_API_KEY"), # This is the default and can be omitted
455)
456for transcription in client.audio.transcriptions.create(
457 file=b"Example data",
458 model="gpt-4o-transcribe",
459):
460 print(transcription)
461```
462
463#### Response
464
465```json
466{
467 "text": "text",
468 "logprobs": [
469 {
470 "token": "token",
471 "bytes": [
472 0
473 ],
474 "logprob": 0
475 }
476 ],
477 "usage": {
478 "input_tokens": 0,
479 "output_tokens": 0,
480 "total_tokens": 0,
481 "type": "tokens",
482 "input_token_details": {
483 "audio_tokens": 0,
484 "text_tokens": 0
485 }
486 }
487}
488```
489
490### Example
491
492```python
493from openai import OpenAI
494client = OpenAI()
495
496audio_file = open("speech.mp3", "rb")
497transcript = client.audio.transcriptions.create(
498 model="gpt-4o-transcribe",
499 file=audio_file
500)
501```
502
503#### Response
504
505```json
506{
507 "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that.",
508 "usage": {
509 "type": "tokens",
510 "input_tokens": 14,
511 "input_token_details": {
512 "text_tokens": 0,
513 "audio_tokens": 14
514 },
515 "output_tokens": 45,
516 "total_tokens": 59
517 }
518}
519```
520
521### Diarization
522
523```python
524import base64
525from openai import OpenAI
526
527client = OpenAI()
528
529def to_data_url(path: str) -> str:
530 with open(path, "rb") as fh:
531 return "data:audio/wav;base64," + base64.b64encode(fh.read()).decode("utf-8")
532
533with open("meeting.wav", "rb") as audio_file:
534 transcript = client.audio.transcriptions.create(
535 model="gpt-4o-transcribe-diarize",
536 file=audio_file,
537 response_format="diarized_json",
538 chunking_strategy="auto",
539 extra_body={
540 "known_speaker_names": ["agent"],
541 "known_speaker_references": [to_data_url("agent.wav")],
542 },
543 )
544
545print(transcript.segments)
546```
547
548#### Response
549
550```json
551{
552 "task": "transcribe",
553 "duration": 27.4,
554 "text": "Agent: Thanks for calling OpenAI support.\nA: Hi, I'm trying to enable diarization.\nAgent: Happy to walk you through the steps.",
555 "segments": [
556 {
557 "type": "transcript.text.segment",
558 "id": "seg_001",
559 "start": 0.0,
560 "end": 4.7,
561 "text": "Thanks for calling OpenAI support.",
562 "speaker": "agent"
563 },
564 {
565 "type": "transcript.text.segment",
566 "id": "seg_002",
567 "start": 4.7,
568 "end": 11.8,
569 "text": "Hi, I'm trying to enable diarization.",
570 "speaker": "A"
571 },
572 {
573 "type": "transcript.text.segment",
574 "id": "seg_003",
575 "start": 12.1,
576 "end": 18.5,
577 "text": "Happy to walk you through the steps.",
578 "speaker": "agent"
579 }
580 ],
581 "usage": {
582 "type": "duration",
583 "seconds": 27
584 }
585}
586```
587
588### Streaming
589
590```python
591from openai import OpenAI
592client = OpenAI()
593
594audio_file = open("speech.mp3", "rb")
595stream = client.audio.transcriptions.create(
596 file=audio_file,
597 model="gpt-4o-mini-transcribe",
598 stream=True
599)
600
601for event in stream:
602 print(event)
603```
604
605#### Response
606
607```json
608data: {"type":"transcript.text.delta","delta":"I","logprobs":[{"token":"I","logprob":-0.00007588794,"bytes":[73]}]}
609
610data: {"type":"transcript.text.delta","delta":" see","logprobs":[{"token":" see","logprob":-3.1281633e-7,"bytes":[32,115,101,101]}]}
611
612data: {"type":"transcript.text.delta","delta":" skies","logprobs":[{"token":" skies","logprob":-2.3392786e-6,"bytes":[32,115,107,105,101,115]}]}
613
614data: {"type":"transcript.text.delta","delta":" of","logprobs":[{"token":" of","logprob":-3.1281633e-7,"bytes":[32,111,102]}]}
615
616data: {"type":"transcript.text.delta","delta":" blue","logprobs":[{"token":" blue","logprob":-1.0280384e-6,"bytes":[32,98,108,117,101]}]}
617
618data: {"type":"transcript.text.delta","delta":" and","logprobs":[{"token":" and","logprob":-0.0005108566,"bytes":[32,97,110,100]}]}
619
620data: {"type":"transcript.text.delta","delta":" clouds","logprobs":[{"token":" clouds","logprob":-1.9361265e-7,"bytes":[32,99,108,111,117,100,115]}]}
621
622data: {"type":"transcript.text.delta","delta":" of","logprobs":[{"token":" of","logprob":-1.9361265e-7,"bytes":[32,111,102]}]}
623
624data: {"type":"transcript.text.delta","delta":" white","logprobs":[{"token":" white","logprob":-7.89631e-7,"bytes":[32,119,104,105,116,101]}]}
625
626data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.0014890312,"bytes":[44]}]}
627
628data: {"type":"transcript.text.delta","delta":" the","logprobs":[{"token":" the","logprob":-0.0110956915,"bytes":[32,116,104,101]}]}
629
630data: {"type":"transcript.text.delta","delta":" bright","logprobs":[{"token":" bright","logprob":0.0,"bytes":[32,98,114,105,103,104,116]}]}
631
632data: {"type":"transcript.text.delta","delta":" blessed","logprobs":[{"token":" blessed","logprob":-0.000045848617,"bytes":[32,98,108,101,115,115,101,100]}]}
633
634data: {"type":"transcript.text.delta","delta":" days","logprobs":[{"token":" days","logprob":-0.000010802739,"bytes":[32,100,97,121,115]}]}
635
636data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.00001700133,"bytes":[44]}]}
637
638data: {"type":"transcript.text.delta","delta":" the","logprobs":[{"token":" the","logprob":-0.0000118755715,"bytes":[32,116,104,101]}]}
639
640data: {"type":"transcript.text.delta","delta":" dark","logprobs":[{"token":" dark","logprob":-5.5122365e-7,"bytes":[32,100,97,114,107]}]}
641
642data: {"type":"transcript.text.delta","delta":" sacred","logprobs":[{"token":" sacred","logprob":-5.4385737e-6,"bytes":[32,115,97,99,114,101,100]}]}
643
644data: {"type":"transcript.text.delta","delta":" nights","logprobs":[{"token":" nights","logprob":-4.00813e-6,"bytes":[32,110,105,103,104,116,115]}]}
645
646data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.0036910512,"bytes":[44]}]}
647
648data: {"type":"transcript.text.delta","delta":" and","logprobs":[{"token":" and","logprob":-0.0031903093,"bytes":[32,97,110,100]}]}
649
650data: {"type":"transcript.text.delta","delta":" I","logprobs":[{"token":" I","logprob":-1.504853e-6,"bytes":[32,73]}]}
651
652data: {"type":"transcript.text.delta","delta":" think","logprobs":[{"token":" think","logprob":-4.3202e-7,"bytes":[32,116,104,105,110,107]}]}
653
654data: {"type":"transcript.text.delta","delta":" to","logprobs":[{"token":" to","logprob":-1.9361265e-7,"bytes":[32,116,111]}]}
655
656data: {"type":"transcript.text.delta","delta":" myself","logprobs":[{"token":" myself","logprob":-1.7432603e-6,"bytes":[32,109,121,115,101,108,102]}]}
657
658data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.29254505,"bytes":[44]}]}
659
660data: {"type":"transcript.text.delta","delta":" what","logprobs":[{"token":" what","logprob":-0.016815351,"bytes":[32,119,104,97,116]}]}
661
662data: {"type":"transcript.text.delta","delta":" a","logprobs":[{"token":" a","logprob":-3.1281633e-7,"bytes":[32,97]}]}
663
664data: {"type":"transcript.text.delta","delta":" wonderful","logprobs":[{"token":" wonderful","logprob":-2.1008714e-6,"bytes":[32,119,111,110,100,101,114,102,117,108]}]}
665
666data: {"type":"transcript.text.delta","delta":" world","logprobs":[{"token":" world","logprob":-8.180258e-6,"bytes":[32,119,111,114,108,100]}]}
667
668data: {"type":"transcript.text.delta","delta":".","logprobs":[{"token":".","logprob":-0.014231676,"bytes":[46]}]}
669
670data: {"type":"transcript.text.done","text":"I see skies of blue and clouds of white, the bright blessed days, the dark sacred nights, and I think to myself, what a wonderful world.","logprobs":[{"token":"I","logprob":-0.00007588794,"bytes":[73]},{"token":" see","logprob":-3.1281633e-7,"bytes":[32,115,101,101]},{"token":" skies","logprob":-2.3392786e-6,"bytes":[32,115,107,105,101,115]},{"token":" of","logprob":-3.1281633e-7,"bytes":[32,111,102]},{"token":" blue","logprob":-1.0280384e-6,"bytes":[32,98,108,117,101]},{"token":" and","logprob":-0.0005108566,"bytes":[32,97,110,100]},{"token":" clouds","logprob":-1.9361265e-7,"bytes":[32,99,108,111,117,100,115]},{"token":" of","logprob":-1.9361265e-7,"bytes":[32,111,102]},{"token":" white","logprob":-7.89631e-7,"bytes":[32,119,104,105,116,101]},{"token":",","logprob":-0.0014890312,"bytes":[44]},{"token":" the","logprob":-0.0110956915,"bytes":[32,116,104,101]},{"token":" bright","logprob":0.0,"bytes":[32,98,114,105,103,104,116]},{"token":" blessed","logprob":-0.000045848617,"bytes":[32,98,108,101,115,115,101,100]},{"token":" days","logprob":-0.000010802739,"bytes":[32,100,97,121,115]},{"token":",","logprob":-0.00001700133,"bytes":[44]},{"token":" the","logprob":-0.0000118755715,"bytes":[32,116,104,101]},{"token":" dark","logprob":-5.5122365e-7,"bytes":[32,100,97,114,107]},{"token":" sacred","logprob":-5.4385737e-6,"bytes":[32,115,97,99,114,101,100]},{"token":" nights","logprob":-4.00813e-6,"bytes":[32,110,105,103,104,116,115]},{"token":",","logprob":-0.0036910512,"bytes":[44]},{"token":" and","logprob":-0.0031903093,"bytes":[32,97,110,100]},{"token":" I","logprob":-1.504853e-6,"bytes":[32,73]},{"token":" think","logprob":-4.3202e-7,"bytes":[32,116,104,105,110,107]},{"token":" to","logprob":-1.9361265e-7,"bytes":[32,116,111]},{"token":" myself","logprob":-1.7432603e-6,"bytes":[32,109,121,115,101,108,102]},{"token":",","logprob":-0.29254505,"bytes":[44]},{"token":" what","logprob":-0.016815351,"bytes":[32,119,104,97,116]},{"token":" a","logprob":-3.1281633e-7,"bytes":[32,97]},{"token":" wonderful","logprob":-2.1008714e-6,"bytes":[32,119,111,110,100,101,114,102,117,108]},{"token":" world","logprob":-8.180258e-6,"bytes":[32,119,111,114,108,100]},{"token":".","logprob":-0.014231676,"bytes":[46]}],"usage":{"input_tokens":14,"input_token_details":{"text_tokens":0,"audio_tokens":14},"output_tokens":45,"total_tokens":59}}
671```
672
673### Logprobs
674
675```python
676from openai import OpenAI
677client = OpenAI()
678
679audio_file = open("speech.mp3", "rb")
680transcript = client.audio.transcriptions.create(
681 file=audio_file,
682 model="gpt-4o-transcribe",
683 response_format="json",
684 include=["logprobs"]
685)
686
687print(transcript)
688```
689
690#### Response
691
692```json
693{
694 "text": "Hey, my knee is hurting and I want to see the doctor tomorrow ideally.",
695 "logprobs": [
696 { "token": "Hey", "logprob": -1.0415299, "bytes": [72, 101, 121] },
697 { "token": ",", "logprob": -9.805982e-5, "bytes": [44] },
698 { "token": " my", "logprob": -0.00229799, "bytes": [32, 109, 121] },
699 {
700 "token": " knee",
701 "logprob": -4.7159858e-5,
702 "bytes": [32, 107, 110, 101, 101]
703 },
704 { "token": " is", "logprob": -0.043909557, "bytes": [32, 105, 115] },
705 {
706 "token": " hurting",
707 "logprob": -1.1041146e-5,
708 "bytes": [32, 104, 117, 114, 116, 105, 110, 103]
709 },
710 { "token": " and", "logprob": -0.011076359, "bytes": [32, 97, 110, 100] },
711 { "token": " I", "logprob": -5.3193703e-6, "bytes": [32, 73] },
712 {
713 "token": " want",
714 "logprob": -0.0017156356,
715 "bytes": [32, 119, 97, 110, 116]
716 },
717 { "token": " to", "logprob": -7.89631e-7, "bytes": [32, 116, 111] },
718 { "token": " see", "logprob": -5.5122365e-7, "bytes": [32, 115, 101, 101] },
719 { "token": " the", "logprob": -0.0040786397, "bytes": [32, 116, 104, 101] },
720 {
721 "token": " doctor",
722 "logprob": -2.3392786e-6,
723 "bytes": [32, 100, 111, 99, 116, 111, 114]
724 },
725 {
726 "token": " tomorrow",
727 "logprob": -7.89631e-7,
728 "bytes": [32, 116, 111, 109, 111, 114, 114, 111, 119]
729 },
730 {
731 "token": " ideally",
732 "logprob": -0.5800861,
733 "bytes": [32, 105, 100, 101, 97, 108, 108, 121]
734 },
735 { "token": ".", "logprob": -0.00011093382, "bytes": [46] }
736 ],
737 "usage": {
738 "type": "tokens",
739 "input_tokens": 14,
740 "input_token_details": {
741 "text_tokens": 0,
742 "audio_tokens": 14
743 },
744 "output_tokens": 45,
745 "total_tokens": 59
746 }
747}
748```
749
750### Word timestamps
751
752```python
753from openai import OpenAI
754client = OpenAI()
755
756audio_file = open("speech.mp3", "rb")
757transcript = client.audio.transcriptions.create(
758 file=audio_file,
759 model="whisper-1",
760 response_format="verbose_json",
761 timestamp_granularities=["word"]
762)
763
764print(transcript.words)
765```
766
767#### Response
768
769```json
770{
771 "task": "transcribe",
772 "language": "english",
773 "duration": 8.470000267028809,
774 "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
775 "words": [
776 {
777 "word": "The",
778 "start": 0.0,
779 "end": 0.23999999463558197
780 },
781 ...
782 {
783 "word": "volleyball",
784 "start": 7.400000095367432,
785 "end": 7.900000095367432
786 }
787 ],
788 "usage": {
789 "type": "duration",
790 "seconds": 9
791 }
792}
793```
794
795### Segment timestamps
796
797```python
798from openai import OpenAI
799client = OpenAI()
800
801audio_file = open("speech.mp3", "rb")
802transcript = client.audio.transcriptions.create(
803 file=audio_file,
804 model="whisper-1",
805 response_format="verbose_json",
806 timestamp_granularities=["segment"]
807)
808
809print(transcript.words)
810```
811
812#### Response
813
814```json
815{
816 "task": "transcribe",
817 "language": "english",
818 "duration": 8.470000267028809,
819 "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
820 "segments": [
821 {
822 "id": 0,
823 "seek": 0,
824 "start": 0.0,
825 "end": 3.319999933242798,
826 "text": " The beach was a popular spot on a hot summer day.",
827 "tokens": [
828 50364, 440, 7534, 390, 257, 3743, 4008, 322, 257, 2368, 4266, 786, 13, 50530
829 ],
830 "temperature": 0.0,
831 "avg_logprob": -0.2860786020755768,
832 "compression_ratio": 1.2363636493682861,
833 "no_speech_prob": 0.00985979475080967
834 },
835 ...
836 ],
837 "usage": {
838 "type": "duration",
839 "seconds": 9
840 }
841}
842```
843
844## Domain Types
845
846### Transcription
847
848- `class Transcription: …`
849
850 Represents a transcription response returned by model, based on the provided input.
851
852 - `text: str`
853
854 The transcribed text.
855
856 - `logprobs: Optional[List[Logprob]]`
857
858 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.
859
860 - `token: Optional[str]`
861
862 The token in the transcription.
863
864 - `bytes: Optional[List[float]]`
865
866 The bytes of the token.
867
868 - `logprob: Optional[float]`
869
870 The log probability of the token.
871
872 - `usage: Optional[Usage]`
873
874 Token usage statistics for the request.
875
876 - `class UsageTokens: …`
877
878 Usage statistics for models billed by token usage.
879
880 - `input_tokens: int`
881
882 Number of input tokens billed for this request.
883
884 - `output_tokens: int`
885
886 Number of output tokens generated.
887
888 - `total_tokens: int`
889
890 Total number of tokens used (input + output).
891
892 - `type: Literal["tokens"]`
893
894 The type of the usage object. Always `tokens` for this variant.
895
896 - `"tokens"`
897
898 - `input_token_details: Optional[UsageTokensInputTokenDetails]`
899
900 Details about the input tokens billed for this request.
901
902 - `audio_tokens: Optional[int]`
903
904 Number of audio tokens billed for this request.
905
906 - `text_tokens: Optional[int]`
907
908 Number of text tokens billed for this request.
909
910 - `class UsageDuration: …`
911
912 Usage statistics for models billed by audio input duration.
913
914 - `seconds: float`
915
916 Duration of the input audio in seconds.
917
918 - `type: Literal["duration"]`
919
920 The type of the usage object. Always `duration` for this variant.
921
922 - `"duration"`
923
924### Transcription Diarized
925
926- `class TranscriptionDiarized: …`
927
928 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
929
930 - `duration: float`
931
932 Duration of the input audio in seconds.
933
934 - `segments: List[TranscriptionDiarizedSegment]`
935
936 Segments of the transcript annotated with timestamps and speaker labels.
937
938 - `id: str`
939
940 Unique identifier for the segment.
941
942 - `end: float`
943
944 End timestamp of the segment in seconds.
945
946 - `speaker: str`
947
948 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
949
950 - `start: float`
951
952 Start timestamp of the segment in seconds.
953
954 - `text: str`
955
956 Transcript text for this segment.
957
958 - `type: Literal["transcript.text.segment"]`
959
960 The type of the segment. Always `transcript.text.segment`.
961
962 - `"transcript.text.segment"`
963
964 - `task: Literal["transcribe"]`
965
966 The type of task that was run. Always `transcribe`.
967
968 - `"transcribe"`
969
970 - `text: str`
971
972 The concatenated transcript text for the entire audio input.
973
974 - `usage: Optional[Usage]`
975
976 Token or duration usage statistics for the request.
977
978 - `class UsageTokens: …`
979
980 Usage statistics for models billed by token usage.
981
982 - `input_tokens: int`
983
984 Number of input tokens billed for this request.
985
986 - `output_tokens: int`
987
988 Number of output tokens generated.
989
990 - `total_tokens: int`
991
992 Total number of tokens used (input + output).
993
994 - `type: Literal["tokens"]`
995
996 The type of the usage object. Always `tokens` for this variant.
997
998 - `"tokens"`
999
1000 - `input_token_details: Optional[UsageTokensInputTokenDetails]`
1001
1002 Details about the input tokens billed for this request.
1003
1004 - `audio_tokens: Optional[int]`
1005
1006 Number of audio tokens billed for this request.
1007
1008 - `text_tokens: Optional[int]`
1009
1010 Number of text tokens billed for this request.
1011
1012 - `class UsageDuration: …`
1013
1014 Usage statistics for models billed by audio input duration.
1015
1016 - `seconds: float`
1017
1018 Duration of the input audio in seconds.
1019
1020 - `type: Literal["duration"]`
1021
1022 The type of the usage object. Always `duration` for this variant.
1023
1024 - `"duration"`
1025
1026### Transcription Diarized Segment
1027
1028- `class TranscriptionDiarizedSegment: …`
1029
1030 A segment of diarized transcript text with speaker metadata.
1031
1032 - `id: str`
1033
1034 Unique identifier for the segment.
1035
1036 - `end: float`
1037
1038 End timestamp of the segment in seconds.
1039
1040 - `speaker: str`
1041
1042 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
1043
1044 - `start: float`
1045
1046 Start timestamp of the segment in seconds.
1047
1048 - `text: str`
1049
1050 Transcript text for this segment.
1051
1052 - `type: Literal["transcript.text.segment"]`
1053
1054 The type of the segment. Always `transcript.text.segment`.
1055
1056 - `"transcript.text.segment"`
1057
1058### Transcription Include
1059
1060- `Literal["logprobs"]`
1061
1062 - `"logprobs"`
1063
1064### Transcription Segment
1065
1066- `class TranscriptionSegment: …`
1067
1068 - `id: int`
1069
1070 Unique identifier of the segment.
1071
1072 - `avg_logprob: float`
1073
1074 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
1075
1076 - `compression_ratio: float`
1077
1078 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
1079
1080 - `end: float`
1081
1082 End time of the segment in seconds.
1083
1084 - `no_speech_prob: float`
1085
1086 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
1087
1088 - `seek: int`
1089
1090 Seek offset of the segment.
1091
1092 - `start: float`
1093
1094 Start time of the segment in seconds.
1095
1096 - `temperature: float`
1097
1098 Temperature parameter used for generating the segment.
1099
1100 - `text: str`
1101
1102 Text content of the segment.
1103
1104 - `tokens: List[int]`
1105
1106 Array of token IDs for the text content.
1107
1108### Transcription Stream Event
1109
1110- `TranscriptionStreamEvent`
1111
1112 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.
1113
1114 - `class TranscriptionTextSegmentEvent: …`
1115
1116 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.
1117
1118 - `id: str`
1119
1120 Unique identifier for the segment.
1121
1122 - `end: float`
1123
1124 End timestamp of the segment in seconds.
1125
1126 - `speaker: str`
1127
1128 Speaker label for this segment.
1129
1130 - `start: float`
1131
1132 Start timestamp of the segment in seconds.
1133
1134 - `text: str`
1135
1136 Transcript text for this segment.
1137
1138 - `type: Literal["transcript.text.segment"]`
1139
1140 The type of the event. Always `transcript.text.segment`.
1141
1142 - `"transcript.text.segment"`
1143
1144 - `class TranscriptionTextDeltaEvent: …`
1145
1146 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
1147
1148 - `delta: str`
1149
1150 The text delta that was additionally transcribed.
1151
1152 - `type: Literal["transcript.text.delta"]`
1153
1154 The type of the event. Always `transcript.text.delta`.
1155
1156 - `"transcript.text.delta"`
1157
1158 - `logprobs: Optional[List[Logprob]]`
1159
1160 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
1161
1162 - `token: Optional[str]`
1163
1164 The token that was used to generate the log probability.
1165
1166 - `bytes: Optional[List[int]]`
1167
1168 The bytes that were used to generate the log probability.
1169
1170 - `logprob: Optional[float]`
1171
1172 The log probability of the token.
1173
1174 - `segment_id: Optional[str]`
1175
1176 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.
1177
1178 - `class TranscriptionTextDoneEvent: …`
1179
1180 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
1181
1182 - `text: str`
1183
1184 The text that was transcribed.
1185
1186 - `type: Literal["transcript.text.done"]`
1187
1188 The type of the event. Always `transcript.text.done`.
1189
1190 - `"transcript.text.done"`
1191
1192 - `logprobs: Optional[List[Logprob]]`
1193
1194 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
1195
1196 - `token: Optional[str]`
1197
1198 The token that was used to generate the log probability.
1199
1200 - `bytes: Optional[List[int]]`
1201
1202 The bytes that were used to generate the log probability.
1203
1204 - `logprob: Optional[float]`
1205
1206 The log probability of the token.
1207
1208 - `usage: Optional[Usage]`
1209
1210 Usage statistics for models billed by token usage.
1211
1212 - `input_tokens: int`
1213
1214 Number of input tokens billed for this request.
1215
1216 - `output_tokens: int`
1217
1218 Number of output tokens generated.
1219
1220 - `total_tokens: int`
1221
1222 Total number of tokens used (input + output).
1223
1224 - `type: Literal["tokens"]`
1225
1226 The type of the usage object. Always `tokens` for this variant.
1227
1228 - `"tokens"`
1229
1230 - `input_token_details: Optional[UsageInputTokenDetails]`
1231
1232 Details about the input tokens billed for this request.
1233
1234 - `audio_tokens: Optional[int]`
1235
1236 Number of audio tokens billed for this request.
1237
1238 - `text_tokens: Optional[int]`
1239
1240 Number of text tokens billed for this request.
1241
1242### Transcription Text Delta Event
1243
1244- `class TranscriptionTextDeltaEvent: …`
1245
1246 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
1247
1248 - `delta: str`
1249
1250 The text delta that was additionally transcribed.
1251
1252 - `type: Literal["transcript.text.delta"]`
1253
1254 The type of the event. Always `transcript.text.delta`.
1255
1256 - `"transcript.text.delta"`
1257
1258 - `logprobs: Optional[List[Logprob]]`
1259
1260 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
1261
1262 - `token: Optional[str]`
1263
1264 The token that was used to generate the log probability.
1265
1266 - `bytes: Optional[List[int]]`
1267
1268 The bytes that were used to generate the log probability.
1269
1270 - `logprob: Optional[float]`
1271
1272 The log probability of the token.
1273
1274 - `segment_id: Optional[str]`
1275
1276 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.
1277
1278### Transcription Text Done Event
1279
1280- `class TranscriptionTextDoneEvent: …`
1281
1282 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
1283
1284 - `text: str`
1285
1286 The text that was transcribed.
1287
1288 - `type: Literal["transcript.text.done"]`
1289
1290 The type of the event. Always `transcript.text.done`.
1291
1292 - `"transcript.text.done"`
1293
1294 - `logprobs: Optional[List[Logprob]]`
1295
1296 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
1297
1298 - `token: Optional[str]`
1299
1300 The token that was used to generate the log probability.
1301
1302 - `bytes: Optional[List[int]]`
1303
1304 The bytes that were used to generate the log probability.
1305
1306 - `logprob: Optional[float]`
1307
1308 The log probability of the token.
1309
1310 - `usage: Optional[Usage]`
1311
1312 Usage statistics for models billed by token usage.
1313
1314 - `input_tokens: int`
1315
1316 Number of input tokens billed for this request.
1317
1318 - `output_tokens: int`
1319
1320 Number of output tokens generated.
1321
1322 - `total_tokens: int`
1323
1324 Total number of tokens used (input + output).
1325
1326 - `type: Literal["tokens"]`
1327
1328 The type of the usage object. Always `tokens` for this variant.
1329
1330 - `"tokens"`
1331
1332 - `input_token_details: Optional[UsageInputTokenDetails]`
1333
1334 Details about the input tokens billed for this request.
1335
1336 - `audio_tokens: Optional[int]`
1337
1338 Number of audio tokens billed for this request.
1339
1340 - `text_tokens: Optional[int]`
1341
1342 Number of text tokens billed for this request.
1343
1344### Transcription Text Segment Event
1345
1346- `class TranscriptionTextSegmentEvent: …`
1347
1348 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.
1349
1350 - `id: str`
1351
1352 Unique identifier for the segment.
1353
1354 - `end: float`
1355
1356 End timestamp of the segment in seconds.
1357
1358 - `speaker: str`
1359
1360 Speaker label for this segment.
1361
1362 - `start: float`
1363
1364 Start timestamp of the segment in seconds.
1365
1366 - `text: str`
1367
1368 Transcript text for this segment.
1369
1370 - `type: Literal["transcript.text.segment"]`
1371
1372 The type of the event. Always `transcript.text.segment`.
1373
1374 - `"transcript.text.segment"`
1375
1376### Transcription Verbose
1377
1378- `class TranscriptionVerbose: …`
1379
1380 Represents a verbose json transcription response returned by model, based on the provided input.
1381
1382 - `duration: float`
1383
1384 The duration of the input audio.
1385
1386 - `language: str`
1387
1388 The language of the input audio.
1389
1390 - `text: str`
1391
1392 The transcribed text.
1393
1394 - `segments: Optional[List[TranscriptionSegment]]`
1395
1396 Segments of the transcribed text and their corresponding details.
1397
1398 - `id: int`
1399
1400 Unique identifier of the segment.
1401
1402 - `avg_logprob: float`
1403
1404 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
1405
1406 - `compression_ratio: float`
1407
1408 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
1409
1410 - `end: float`
1411
1412 End time of the segment in seconds.
1413
1414 - `no_speech_prob: float`
1415
1416 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
1417
1418 - `seek: int`
1419
1420 Seek offset of the segment.
1421
1422 - `start: float`
1423
1424 Start time of the segment in seconds.
1425
1426 - `temperature: float`
1427
1428 Temperature parameter used for generating the segment.
1429
1430 - `text: str`
1431
1432 Text content of the segment.
1433
1434 - `tokens: List[int]`
1435
1436 Array of token IDs for the text content.
1437
1438 - `usage: Optional[Usage]`
1439
1440 Usage statistics for models billed by audio input duration.
1441
1442 - `seconds: float`
1443
1444 Duration of the input audio in seconds.
1445
1446 - `type: Literal["duration"]`
1447
1448 The type of the usage object. Always `duration` for this variant.
1449
1450 - `"duration"`
1451
1452 - `words: Optional[List[TranscriptionWord]]`
1453
1454 Extracted words and their corresponding timestamps.
1455
1456 - `end: float`
1457
1458 End time of the word in seconds.
1459
1460 - `start: float`
1461
1462 Start time of the word in seconds.
1463
1464 - `word: str`
1465
1466 The text content of the word.
1467
1468### Transcription Word
1469
1470- `class TranscriptionWord: …`
1471
1472 - `end: float`
1473
1474 End time of the word in seconds.
1475
1476 - `start: float`
1477
1478 Start time of the word in seconds.
1479
1480 - `word: str`
1481
1482 The text content of the word.
1483
1484### Transcription Create Response
1485
1486- `TranscriptionCreateResponse`
1487
1488 Represents a transcription response returned by model, based on the provided input.
1489
1490 - `class Transcription: …`
1491
1492 Represents a transcription response returned by model, based on the provided input.
1493
1494 - `text: str`
1495
1496 The transcribed text.
1497
1498 - `logprobs: Optional[List[Logprob]]`
1499
1500 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.
1501
1502 - `token: Optional[str]`
1503
1504 The token in the transcription.
1505
1506 - `bytes: Optional[List[float]]`
1507
1508 The bytes of the token.
1509
1510 - `logprob: Optional[float]`
1511
1512 The log probability of the token.
1513
1514 - `usage: Optional[Usage]`
1515
1516 Token usage statistics for the request.
1517
1518 - `class UsageTokens: …`
1519
1520 Usage statistics for models billed by token usage.
1521
1522 - `input_tokens: int`
1523
1524 Number of input tokens billed for this request.
1525
1526 - `output_tokens: int`
1527
1528 Number of output tokens generated.
1529
1530 - `total_tokens: int`
1531
1532 Total number of tokens used (input + output).
1533
1534 - `type: Literal["tokens"]`
1535
1536 The type of the usage object. Always `tokens` for this variant.
1537
1538 - `"tokens"`
1539
1540 - `input_token_details: Optional[UsageTokensInputTokenDetails]`
1541
1542 Details about the input tokens billed for this request.
1543
1544 - `audio_tokens: Optional[int]`
1545
1546 Number of audio tokens billed for this request.
1547
1548 - `text_tokens: Optional[int]`
1549
1550 Number of text tokens billed for this request.
1551
1552 - `class UsageDuration: …`
1553
1554 Usage statistics for models billed by audio input duration.
1555
1556 - `seconds: float`
1557
1558 Duration of the input audio in seconds.
1559
1560 - `type: Literal["duration"]`
1561
1562 The type of the usage object. Always `duration` for this variant.
1563
1564 - `"duration"`
1565
1566 - `class TranscriptionDiarized: …`
1567
1568 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
1569
1570 - `duration: float`
1571
1572 Duration of the input audio in seconds.
1573
1574 - `segments: List[TranscriptionDiarizedSegment]`
1575
1576 Segments of the transcript annotated with timestamps and speaker labels.
1577
1578 - `id: str`
1579
1580 Unique identifier for the segment.
1581
1582 - `end: float`
1583
1584 End timestamp of the segment in seconds.
1585
1586 - `speaker: str`
1587
1588 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
1589
1590 - `start: float`
1591
1592 Start timestamp of the segment in seconds.
1593
1594 - `text: str`
1595
1596 Transcript text for this segment.
1597
1598 - `type: Literal["transcript.text.segment"]`
1599
1600 The type of the segment. Always `transcript.text.segment`.
1601
1602 - `"transcript.text.segment"`
1603
1604 - `task: Literal["transcribe"]`
1605
1606 The type of task that was run. Always `transcribe`.
1607
1608 - `"transcribe"`
1609
1610 - `text: str`
1611
1612 The concatenated transcript text for the entire audio input.
1613
1614 - `usage: Optional[Usage]`
1615
1616 Token or duration usage statistics for the request.
1617
1618 - `class UsageTokens: …`
1619
1620 Usage statistics for models billed by token usage.
1621
1622 - `input_tokens: int`
1623
1624 Number of input tokens billed for this request.
1625
1626 - `output_tokens: int`
1627
1628 Number of output tokens generated.
1629
1630 - `total_tokens: int`
1631
1632 Total number of tokens used (input + output).
1633
1634 - `type: Literal["tokens"]`
1635
1636 The type of the usage object. Always `tokens` for this variant.
1637
1638 - `"tokens"`
1639
1640 - `input_token_details: Optional[UsageTokensInputTokenDetails]`
1641
1642 Details about the input tokens billed for this request.
1643
1644 - `audio_tokens: Optional[int]`
1645
1646 Number of audio tokens billed for this request.
1647
1648 - `text_tokens: Optional[int]`
1649
1650 Number of text tokens billed for this request.
1651
1652 - `class UsageDuration: …`
1653
1654 Usage statistics for models billed by audio input duration.
1655
1656 - `seconds: float`
1657
1658 Duration of the input audio in seconds.
1659
1660 - `type: Literal["duration"]`
1661
1662 The type of the usage object. Always `duration` for this variant.
1663
1664 - `"duration"`
1665
1666 - `class TranscriptionVerbose: …`
1667
1668 Represents a verbose json transcription response returned by model, based on the provided input.
1669
1670 - `duration: float`
1671
1672 The duration of the input audio.
1673
1674 - `language: str`
1675
1676 The language of the input audio.
1677
1678 - `text: str`
1679
1680 The transcribed text.
1681
1682 - `segments: Optional[List[TranscriptionSegment]]`
1683
1684 Segments of the transcribed text and their corresponding details.
1685
1686 - `id: int`
1687
1688 Unique identifier of the segment.
1689
1690 - `avg_logprob: float`
1691
1692 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
1693
1694 - `compression_ratio: float`
1695
1696 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
1697
1698 - `end: float`
1699
1700 End time of the segment in seconds.
1701
1702 - `no_speech_prob: float`
1703
1704 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
1705
1706 - `seek: int`
1707
1708 Seek offset of the segment.
1709
1710 - `start: float`
1711
1712 Start time of the segment in seconds.
1713
1714 - `temperature: float`
1715
1716 Temperature parameter used for generating the segment.
1717
1718 - `text: str`
1719
1720 Text content of the segment.
1721
1722 - `tokens: List[int]`
1723
1724 Array of token IDs for the text content.
1725
1726 - `usage: Optional[Usage]`
1727
1728 Usage statistics for models billed by audio input duration.
1729
1730 - `seconds: float`
1731
1732 Duration of the input audio in seconds.
1733
1734 - `type: Literal["duration"]`
1735
1736 The type of the usage object. Always `duration` for this variant.
1737
1738 - `"duration"`
1739
1740 - `words: Optional[List[TranscriptionWord]]`
1741
1742 Extracted words and their corresponding timestamps.
1743
1744 - `end: float`
1745
1746 End time of the word in seconds.
1747
1748 - `start: float`
1749
1750 Start time of the word in seconds.
1751
1752 - `word: str`
1753
1754 The text content of the word.
1755
1756# Translations
1757
1758## Create translation
1759
1760`audio.translations.create(TranslationCreateParams**kwargs) -> TranslationCreateResponse`
1761
1762**post** `/audio/translations`
1763
1764Translates audio into English.
1765
1766### Parameters
1767
1768- `file: FileTypes`
1769
1770 The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
1771
1772- `model: Union[str, AudioModel]`
1773
1774 ID of the model to use. Only `whisper-1` (which is powered by our open source Whisper V2 model) is currently available.
1775
1776 - `str`
1777
1778 - `Literal["whisper-1", "gpt-4o-transcribe", "gpt-4o-mini-transcribe", 2 more]`
1779
1780 - `"whisper-1"`
1781
1782 - `"gpt-4o-transcribe"`
1783
1784 - `"gpt-4o-mini-transcribe"`
1785
1786 - `"gpt-4o-mini-transcribe-2025-12-15"`
1787
1788 - `"gpt-4o-transcribe-diarize"`
1789
1790- `prompt: Optional[str]`
1791
1792 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should be in English.
1793
1794- `response_format: Optional[Literal["json", "text", "srt", 2 more]]`
1795
1796 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`.
1797
1798 - `"json"`
1799
1800 - `"text"`
1801
1802 - `"srt"`
1803
1804 - `"verbose_json"`
1805
1806 - `"vtt"`
1807
1808- `temperature: Optional[float]`
1809
1810 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.
1811
1812### Returns
1813
1814- `TranslationCreateResponse`
1815
1816 - `class Translation: …`
1817
1818 - `text: str`
1819
1820 - `class TranslationVerbose: …`
1821
1822 - `duration: float`
1823
1824 The duration of the input audio.
1825
1826 - `language: str`
1827
1828 The language of the output translation (always `english`).
1829
1830 - `text: str`
1831
1832 The translated text.
1833
1834 - `segments: Optional[List[TranscriptionSegment]]`
1835
1836 Segments of the translated text and their corresponding details.
1837
1838 - `id: int`
1839
1840 Unique identifier of the segment.
1841
1842 - `avg_logprob: float`
1843
1844 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
1845
1846 - `compression_ratio: float`
1847
1848 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
1849
1850 - `end: float`
1851
1852 End time of the segment in seconds.
1853
1854 - `no_speech_prob: float`
1855
1856 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
1857
1858 - `seek: int`
1859
1860 Seek offset of the segment.
1861
1862 - `start: float`
1863
1864 Start time of the segment in seconds.
1865
1866 - `temperature: float`
1867
1868 Temperature parameter used for generating the segment.
1869
1870 - `text: str`
1871
1872 Text content of the segment.
1873
1874 - `tokens: List[int]`
1875
1876 Array of token IDs for the text content.
1877
1878### Example
1879
1880```python
1881import os
1882from openai import OpenAI
1883
1884client = OpenAI(
1885 api_key=os.environ.get("OPENAI_API_KEY"), # This is the default and can be omitted
1886)
1887translation = client.audio.translations.create(
1888 file=b"Example data",
1889 model="whisper-1",
1890)
1891print(translation)
1892```
1893
1894#### Response
1895
1896```json
1897{
1898 "text": "text"
1899}
1900```
1901
1902### Example
1903
1904```python
1905from openai import OpenAI
1906client = OpenAI()
1907
1908audio_file = open("speech.mp3", "rb")
1909transcript = client.audio.translations.create(
1910 model="whisper-1",
1911 file=audio_file
1912)
1913```
1914
1915#### Response
1916
1917```json
1918{
1919 "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
1920}
1921```
1922
1923## Domain Types
1924
1925### Translation
1926
1927- `class Translation: …`
1928
1929 - `text: str`
1930
1931### Translation Verbose
1932
1933- `class TranslationVerbose: …`
1934
1935 - `duration: float`
1936
1937 The duration of the input audio.
1938
1939 - `language: str`
1940
1941 The language of the output translation (always `english`).
1942
1943 - `text: str`
1944
1945 The translated text.
1946
1947 - `segments: Optional[List[TranscriptionSegment]]`
1948
1949 Segments of the translated text and their corresponding details.
1950
1951 - `id: int`
1952
1953 Unique identifier of the segment.
1954
1955 - `avg_logprob: float`
1956
1957 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
1958
1959 - `compression_ratio: float`
1960
1961 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
1962
1963 - `end: float`
1964
1965 End time of the segment in seconds.
1966
1967 - `no_speech_prob: float`
1968
1969 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
1970
1971 - `seek: int`
1972
1973 Seek offset of the segment.
1974
1975 - `start: float`
1976
1977 Start time of the segment in seconds.
1978
1979 - `temperature: float`
1980
1981 Temperature parameter used for generating the segment.
1982
1983 - `text: str`
1984
1985 Text content of the segment.
1986
1987 - `tokens: List[int]`
1988
1989 Array of token IDs for the text content.
1990
1991### Translation Create Response
1992
1993- `TranslationCreateResponse`
1994
1995 - `class Translation: …`
1996
1997 - `text: str`
1998
1999 - `class TranslationVerbose: …`
2000
2001 - `duration: float`
2002
2003 The duration of the input audio.
2004
2005 - `language: str`
2006
2007 The language of the output translation (always `english`).
2008
2009 - `text: str`
2010
2011 The translated text.
2012
2013 - `segments: Optional[List[TranscriptionSegment]]`
2014
2015 Segments of the translated text and their corresponding details.
2016
2017 - `id: int`
2018
2019 Unique identifier of the segment.
2020
2021 - `avg_logprob: float`
2022
2023 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
2024
2025 - `compression_ratio: float`
2026
2027 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
2028
2029 - `end: float`
2030
2031 End time of the segment in seconds.
2032
2033 - `no_speech_prob: float`
2034
2035 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
2036
2037 - `seek: int`
2038
2039 Seek offset of the segment.
2040
2041 - `start: float`
2042
2043 Start time of the segment in seconds.
2044
2045 - `temperature: float`
2046
2047 Temperature parameter used for generating the segment.
2048
2049 - `text: str`
2050
2051 Text content of the segment.
2052
2053 - `tokens: List[int]`
2054
2055 Array of token IDs for the text content.
2056
2057# Speech
2058
2059## Create speech
2060
2061`audio.speech.create(SpeechCreateParams**kwargs) -> BinaryResponseContent`
2062
2063**post** `/audio/speech`
2064
2065Generates audio from the input text.
2066
2067Returns the audio file content, or a stream of audio events.
2068
2069### Parameters
2070
2071- `input: str`
2072
2073 The text to generate audio for. The maximum length is 4096 characters.
2074
2075- `model: Union[str, SpeechModel]`
2076
2077 One of the available [TTS models](https://platform.openai.com/docs/models#tts): `tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`, or `gpt-4o-mini-tts-2025-12-15`.
2078
2079 - `str`
2080
2081 - `Literal["tts-1", "tts-1-hd", "gpt-4o-mini-tts", "gpt-4o-mini-tts-2025-12-15"]`
2082
2083 - `"tts-1"`
2084
2085 - `"tts-1-hd"`
2086
2087 - `"gpt-4o-mini-tts"`
2088
2089 - `"gpt-4o-mini-tts-2025-12-15"`
2090
2091- `voice: Voice`
2092
2093 The voice to use when generating the audio. Supported built-in voices are `alloy`, `ash`, `ballad`, `coral`, `echo`, `fable`, `onyx`, `nova`, `sage`, `shimmer`, `verse`, `marin`, and `cedar`. You may also provide a custom voice object with an `id`, for example `{ "id": "voice_1234" }`. Previews of the voices are available in the [Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech#voice-options).
2094
2095 - `str`
2096
2097 - `Literal["alloy", "ash", "ballad", 7 more]`
2098
2099 - `"alloy"`
2100
2101 - `"ash"`
2102
2103 - `"ballad"`
2104
2105 - `"coral"`
2106
2107 - `"echo"`
2108
2109 - `"sage"`
2110
2111 - `"shimmer"`
2112
2113 - `"verse"`
2114
2115 - `"marin"`
2116
2117 - `"cedar"`
2118
2119 - `class VoiceID: …`
2120
2121 Custom voice reference.
2122
2123 - `id: str`
2124
2125 The custom voice ID, e.g. `voice_1234`.
2126
2127- `instructions: Optional[str]`
2128
2129 Control the voice of your generated audio with additional instructions. Does not work with `tts-1` or `tts-1-hd`.
2130
2131- `response_format: Optional[Literal["mp3", "opus", "aac", 3 more]]`
2132
2133 The format to audio in. Supported formats are `mp3`, `opus`, `aac`, `flac`, `wav`, and `pcm`.
2134
2135 - `"mp3"`
2136
2137 - `"opus"`
2138
2139 - `"aac"`
2140
2141 - `"flac"`
2142
2143 - `"wav"`
2144
2145 - `"pcm"`
2146
2147- `speed: Optional[float]`
2148
2149 The speed of the generated audio. Select a value from `0.25` to `4.0`. `1.0` is the default.
2150
2151- `stream_format: Optional[Literal["sse", "audio"]]`
2152
2153 The format to stream the audio in. Supported formats are `sse` and `audio`. `sse` is not supported for `tts-1` or `tts-1-hd`.
2154
2155 - `"sse"`
2156
2157 - `"audio"`
2158
2159### Returns
2160
2161- `BinaryResponseContent`
2162
2163### Example
2164
2165```python
2166import os
2167from openai import OpenAI
2168
2169client = OpenAI(
2170 api_key=os.environ.get("OPENAI_API_KEY"), # This is the default and can be omitted
2171)
2172speech = client.audio.speech.create(
2173 input="input",
2174 model="string",
2175 voice="string",
2176)
2177print(speech)
2178content = speech.read()
2179print(content)
2180```
2181
2182### Example
2183
2184```python
2185from pathlib import Path
2186import openai
2187
2188speech_file_path = Path(__file__).parent / "speech.mp3"
2189with openai.audio.speech.with_streaming_response.create(
2190 model="gpt-4o-mini-tts",
2191 voice="alloy",
2192 input="The quick brown fox jumped over the lazy dog."
2193) as response:
2194 response.stream_to_file(speech_file_path)
2195```
2196
2197## Domain Types
2198
2199### Speech Model
2200
2201- `Literal["tts-1", "tts-1-hd", "gpt-4o-mini-tts", "gpt-4o-mini-tts-2025-12-15"]`
2202
2203 - `"tts-1"`
2204
2205 - `"tts-1-hd"`
2206
2207 - `"gpt-4o-mini-tts"`
2208
2209 - `"gpt-4o-mini-tts-2025-12-15"`
2210
2211# Voices
2212
2213# Voice Consents