ruby/resources/audio/index.md +0 −1806 deleted
File Deleted View Diff
1# Audio
2
3## Domain Types
4
5### Audio Model
6
7- `AudioModel = :"whisper-1" | :"gpt-4o-transcribe" | :"gpt-4o-mini-transcribe" | 2 more`
8
9 - `:"whisper-1"`
10
11 - `:"gpt-4o-transcribe"`
12
13 - `:"gpt-4o-mini-transcribe"`
14
15 - `:"gpt-4o-mini-transcribe-2025-12-15"`
16
17 - `:"gpt-4o-transcribe-diarize"`
18
19### Audio Response Format
20
21- `AudioResponseFormat = :json | :text | :srt | 3 more`
22
23 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.
24
25 - `:json`
26
27 - `:text`
28
29 - `:srt`
30
31 - `:verbose_json`
32
33 - `:vtt`
34
35 - `:diarized_json`
36
37# Transcriptions
38
39## Create transcription
40
41`audio.transcriptions.create(**kwargs) -> TranscriptionCreateResponse`
42
43**post** `/audio/transcriptions`
44
45Transcribes audio into the input language.
46
47Returns a transcription object in `json`, `diarized_json`, or `verbose_json`
48format, or a stream of transcript events.
49
50### Parameters
51
52- `file: String`
53
54 The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
55
56- `model: String | AudioModel`
57
58 ID of the model to use. The options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `whisper-1` (which is powered by our open source Whisper V2 model), and `gpt-4o-transcribe-diarize`.
59
60 - `String = String`
61
62 - `AudioModel = :"whisper-1" | :"gpt-4o-transcribe" | :"gpt-4o-mini-transcribe" | 2 more`
63
64 - `:"whisper-1"`
65
66 - `:"gpt-4o-transcribe"`
67
68 - `:"gpt-4o-mini-transcribe"`
69
70 - `:"gpt-4o-mini-transcribe-2025-12-15"`
71
72 - `:"gpt-4o-transcribe-diarize"`
73
74- `chunking_strategy: :auto | VadConfig{ type, prefix_padding_ms, silence_duration_ms, threshold}`
75
76 Controls how the audio is cut into chunks. When set to `"auto"`, the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. `server_vad` object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using `gpt-4o-transcribe-diarize` for inputs longer than 30 seconds.
77
78 - `ChunkingStrategy = :auto`
79
80 Automatically set chunking parameters based on the audio. Must be set to `"auto"`.
81
82 - `:auto`
83
84 - `class VadConfig`
85
86 - `type: :server_vad`
87
88 Must be set to `server_vad` to enable manual chunking using server side VAD.
89
90 - `:server_vad`
91
92 - `prefix_padding_ms: Integer`
93
94 Amount of audio to include before the VAD detected speech (in
95 milliseconds).
96
97 - `silence_duration_ms: Integer`
98
99 Duration of silence to detect speech stop (in milliseconds).
100 With shorter values the model will respond more quickly,
101 but may jump in on short pauses from the user.
102
103 - `threshold: Float`
104
105 Sensitivity threshold (0.0 to 1.0) for voice activity detection. A
106 higher threshold will require louder audio to activate the model, and
107 thus might perform better in noisy environments.
108
109- `include: Array[TranscriptionInclude]`
110
111 Additional information to include in the transcription response.
112 `logprobs` will return the log probabilities of the tokens in the
113 response to understand the model's confidence in the transcription.
114 `logprobs` only works with response_format set to `json` and only with
115 the models `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `gpt-4o-mini-transcribe-2025-12-15`. This field is not supported when using `gpt-4o-transcribe-diarize`.
116
117 - `:logprobs`
118
119- `known_speaker_names: Array[String]`
120
121 Optional list of speaker names that correspond to the audio samples provided in `known_speaker_references[]`. Each entry should be a short identifier (for example `customer` or `agent`). Up to 4 speakers are supported.
122
123- `known_speaker_references: Array[String]`
124
125 Optional list of audio samples (as [data URLs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs)) that contain known speaker references matching `known_speaker_names[]`. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by `file`.
126
127- `language: String`
128
129 The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency.
130
131- `prompt: String`
132
133 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should match the audio language. This field is not supported when using `gpt-4o-transcribe-diarize`.
134
135- `response_format: AudioResponseFormat`
136
137 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.
138
139 - `:json`
140
141 - `:text`
142
143 - `:srt`
144
145 - `:verbose_json`
146
147 - `:vtt`
148
149 - `:diarized_json`
150
151- `stream: bool`
152
153 If set to true, the model response data will be streamed to the client
154 as it is generated using [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#Event_stream_format).
155 See the [Streaming section of the Speech-to-Text guide](https://platform.openai.com/docs/guides/speech-to-text?lang=curl#streaming-transcriptions)
156 for more information.
157
158 Note: Streaming is not supported for the `whisper-1` model and will be ignored.
159
160- `temperature: Float`
161
162 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.
163
164- `timestamp_granularities: Array[:word | :segment]`
165
166 The timestamp granularities to populate for this transcription. `response_format` must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported: `word`, or `segment`. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.
167 This option is not available for `gpt-4o-transcribe-diarize`.
168
169 - `:word`
170
171 - `:segment`
172
173### Returns
174
175- `TranscriptionCreateResponse = Transcription | TranscriptionDiarized | TranscriptionVerbose`
176
177 Represents a transcription response returned by model, based on the provided input.
178
179 - `class Transcription`
180
181 Represents a transcription response returned by model, based on the provided input.
182
183 - `text: String`
184
185 The transcribed text.
186
187 - `logprobs: Array[Logprob{ token, bytes, logprob}]`
188
189 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.
190
191 - `token: String`
192
193 The token in the transcription.
194
195 - `bytes: Array[Float]`
196
197 The bytes of the token.
198
199 - `logprob: Float`
200
201 The log probability of the token.
202
203 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`
204
205 Token usage statistics for the request.
206
207 - `class Tokens`
208
209 Usage statistics for models billed by token usage.
210
211 - `input_tokens: Integer`
212
213 Number of input tokens billed for this request.
214
215 - `output_tokens: Integer`
216
217 Number of output tokens generated.
218
219 - `total_tokens: Integer`
220
221 Total number of tokens used (input + output).
222
223 - `type: :tokens`
224
225 The type of the usage object. Always `tokens` for this variant.
226
227 - `:tokens`
228
229 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`
230
231 Details about the input tokens billed for this request.
232
233 - `audio_tokens: Integer`
234
235 Number of audio tokens billed for this request.
236
237 - `text_tokens: Integer`
238
239 Number of text tokens billed for this request.
240
241 - `class Duration`
242
243 Usage statistics for models billed by audio input duration.
244
245 - `seconds: Float`
246
247 Duration of the input audio in seconds.
248
249 - `type: :duration`
250
251 The type of the usage object. Always `duration` for this variant.
252
253 - `:duration`
254
255 - `class TranscriptionDiarized`
256
257 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
258
259 - `duration: Float`
260
261 Duration of the input audio in seconds.
262
263 - `segments: Array[TranscriptionDiarizedSegment]`
264
265 Segments of the transcript annotated with timestamps and speaker labels.
266
267 - `id: String`
268
269 Unique identifier for the segment.
270
271 - `end_: Float`
272
273 End timestamp of the segment in seconds.
274
275 - `speaker: String`
276
277 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
278
279 - `start: Float`
280
281 Start timestamp of the segment in seconds.
282
283 - `text: String`
284
285 Transcript text for this segment.
286
287 - `type: :"transcript.text.segment"`
288
289 The type of the segment. Always `transcript.text.segment`.
290
291 - `:"transcript.text.segment"`
292
293 - `task: :transcribe`
294
295 The type of task that was run. Always `transcribe`.
296
297 - `:transcribe`
298
299 - `text: String`
300
301 The concatenated transcript text for the entire audio input.
302
303 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`
304
305 Token or duration usage statistics for the request.
306
307 - `class Tokens`
308
309 Usage statistics for models billed by token usage.
310
311 - `input_tokens: Integer`
312
313 Number of input tokens billed for this request.
314
315 - `output_tokens: Integer`
316
317 Number of output tokens generated.
318
319 - `total_tokens: Integer`
320
321 Total number of tokens used (input + output).
322
323 - `type: :tokens`
324
325 The type of the usage object. Always `tokens` for this variant.
326
327 - `:tokens`
328
329 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`
330
331 Details about the input tokens billed for this request.
332
333 - `audio_tokens: Integer`
334
335 Number of audio tokens billed for this request.
336
337 - `text_tokens: Integer`
338
339 Number of text tokens billed for this request.
340
341 - `class Duration`
342
343 Usage statistics for models billed by audio input duration.
344
345 - `seconds: Float`
346
347 Duration of the input audio in seconds.
348
349 - `type: :duration`
350
351 The type of the usage object. Always `duration` for this variant.
352
353 - `:duration`
354
355 - `class TranscriptionVerbose`
356
357 Represents a verbose json transcription response returned by model, based on the provided input.
358
359 - `duration: Float`
360
361 The duration of the input audio.
362
363 - `language: String`
364
365 The language of the input audio.
366
367 - `text: String`
368
369 The transcribed text.
370
371 - `segments: Array[TranscriptionSegment]`
372
373 Segments of the transcribed text and their corresponding details.
374
375 - `id: Integer`
376
377 Unique identifier of the segment.
378
379 - `avg_logprob: Float`
380
381 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
382
383 - `compression_ratio: Float`
384
385 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
386
387 - `end_: Float`
388
389 End time of the segment in seconds.
390
391 - `no_speech_prob: Float`
392
393 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
394
395 - `seek: Integer`
396
397 Seek offset of the segment.
398
399 - `start: Float`
400
401 Start time of the segment in seconds.
402
403 - `temperature: Float`
404
405 Temperature parameter used for generating the segment.
406
407 - `text: String`
408
409 Text content of the segment.
410
411 - `tokens: Array[Integer]`
412
413 Array of token IDs for the text content.
414
415 - `usage: Usage{ seconds, type}`
416
417 Usage statistics for models billed by audio input duration.
418
419 - `seconds: Float`
420
421 Duration of the input audio in seconds.
422
423 - `type: :duration`
424
425 The type of the usage object. Always `duration` for this variant.
426
427 - `:duration`
428
429 - `words: Array[TranscriptionWord]`
430
431 Extracted words and their corresponding timestamps.
432
433 - `end_: Float`
434
435 End time of the word in seconds.
436
437 - `start: Float`
438
439 Start time of the word in seconds.
440
441 - `word: String`
442
443 The text content of the word.
444
445### Example
446
447```ruby
448require "openai"
449
450openai = OpenAI::Client.new(api_key: "My API Key")
451
452transcription = openai.audio.transcriptions.create(file: StringIO.new("Example data"), model: :"gpt-4o-transcribe")
453
454puts(transcription)
455```
456
457#### Response
458
459```json
460{
461 "text": "text",
462 "logprobs": [
463 {
464 "token": "token",
465 "bytes": [
466 0
467 ],
468 "logprob": 0
469 }
470 ],
471 "usage": {
472 "input_tokens": 0,
473 "output_tokens": 0,
474 "total_tokens": 0,
475 "type": "tokens",
476 "input_token_details": {
477 "audio_tokens": 0,
478 "text_tokens": 0
479 }
480 }
481}
482```
483
484## Domain Types
485
486### Transcription
487
488- `class Transcription`
489
490 Represents a transcription response returned by model, based on the provided input.
491
492 - `text: String`
493
494 The transcribed text.
495
496 - `logprobs: Array[Logprob{ token, bytes, logprob}]`
497
498 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.
499
500 - `token: String`
501
502 The token in the transcription.
503
504 - `bytes: Array[Float]`
505
506 The bytes of the token.
507
508 - `logprob: Float`
509
510 The log probability of the token.
511
512 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`
513
514 Token usage statistics for the request.
515
516 - `class Tokens`
517
518 Usage statistics for models billed by token usage.
519
520 - `input_tokens: Integer`
521
522 Number of input tokens billed for this request.
523
524 - `output_tokens: Integer`
525
526 Number of output tokens generated.
527
528 - `total_tokens: Integer`
529
530 Total number of tokens used (input + output).
531
532 - `type: :tokens`
533
534 The type of the usage object. Always `tokens` for this variant.
535
536 - `:tokens`
537
538 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`
539
540 Details about the input tokens billed for this request.
541
542 - `audio_tokens: Integer`
543
544 Number of audio tokens billed for this request.
545
546 - `text_tokens: Integer`
547
548 Number of text tokens billed for this request.
549
550 - `class Duration`
551
552 Usage statistics for models billed by audio input duration.
553
554 - `seconds: Float`
555
556 Duration of the input audio in seconds.
557
558 - `type: :duration`
559
560 The type of the usage object. Always `duration` for this variant.
561
562 - `:duration`
563
564### Transcription Diarized
565
566- `class TranscriptionDiarized`
567
568 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
569
570 - `duration: Float`
571
572 Duration of the input audio in seconds.
573
574 - `segments: Array[TranscriptionDiarizedSegment]`
575
576 Segments of the transcript annotated with timestamps and speaker labels.
577
578 - `id: String`
579
580 Unique identifier for the segment.
581
582 - `end_: Float`
583
584 End timestamp of the segment in seconds.
585
586 - `speaker: String`
587
588 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
589
590 - `start: Float`
591
592 Start timestamp of the segment in seconds.
593
594 - `text: String`
595
596 Transcript text for this segment.
597
598 - `type: :"transcript.text.segment"`
599
600 The type of the segment. Always `transcript.text.segment`.
601
602 - `:"transcript.text.segment"`
603
604 - `task: :transcribe`
605
606 The type of task that was run. Always `transcribe`.
607
608 - `:transcribe`
609
610 - `text: String`
611
612 The concatenated transcript text for the entire audio input.
613
614 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`
615
616 Token or duration usage statistics for the request.
617
618 - `class Tokens`
619
620 Usage statistics for models billed by token usage.
621
622 - `input_tokens: Integer`
623
624 Number of input tokens billed for this request.
625
626 - `output_tokens: Integer`
627
628 Number of output tokens generated.
629
630 - `total_tokens: Integer`
631
632 Total number of tokens used (input + output).
633
634 - `type: :tokens`
635
636 The type of the usage object. Always `tokens` for this variant.
637
638 - `:tokens`
639
640 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`
641
642 Details about the input tokens billed for this request.
643
644 - `audio_tokens: Integer`
645
646 Number of audio tokens billed for this request.
647
648 - `text_tokens: Integer`
649
650 Number of text tokens billed for this request.
651
652 - `class Duration`
653
654 Usage statistics for models billed by audio input duration.
655
656 - `seconds: Float`
657
658 Duration of the input audio in seconds.
659
660 - `type: :duration`
661
662 The type of the usage object. Always `duration` for this variant.
663
664 - `:duration`
665
666### Transcription Diarized Segment
667
668- `class TranscriptionDiarizedSegment`
669
670 A segment of diarized transcript text with speaker metadata.
671
672 - `id: String`
673
674 Unique identifier for the segment.
675
676 - `end_: Float`
677
678 End timestamp of the segment in seconds.
679
680 - `speaker: String`
681
682 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
683
684 - `start: Float`
685
686 Start timestamp of the segment in seconds.
687
688 - `text: String`
689
690 Transcript text for this segment.
691
692 - `type: :"transcript.text.segment"`
693
694 The type of the segment. Always `transcript.text.segment`.
695
696 - `:"transcript.text.segment"`
697
698### Transcription Include
699
700- `TranscriptionInclude = :logprobs`
701
702 - `:logprobs`
703
704### Transcription Segment
705
706- `class TranscriptionSegment`
707
708 - `id: Integer`
709
710 Unique identifier of the segment.
711
712 - `avg_logprob: Float`
713
714 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
715
716 - `compression_ratio: Float`
717
718 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
719
720 - `end_: Float`
721
722 End time of the segment in seconds.
723
724 - `no_speech_prob: Float`
725
726 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
727
728 - `seek: Integer`
729
730 Seek offset of the segment.
731
732 - `start: Float`
733
734 Start time of the segment in seconds.
735
736 - `temperature: Float`
737
738 Temperature parameter used for generating the segment.
739
740 - `text: String`
741
742 Text content of the segment.
743
744 - `tokens: Array[Integer]`
745
746 Array of token IDs for the text content.
747
748### Transcription Stream Event
749
750- `TranscriptionStreamEvent = TranscriptionTextSegmentEvent | TranscriptionTextDeltaEvent | TranscriptionTextDoneEvent`
751
752 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.
753
754 - `class TranscriptionTextSegmentEvent`
755
756 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.
757
758 - `id: String`
759
760 Unique identifier for the segment.
761
762 - `end_: Float`
763
764 End timestamp of the segment in seconds.
765
766 - `speaker: String`
767
768 Speaker label for this segment.
769
770 - `start: Float`
771
772 Start timestamp of the segment in seconds.
773
774 - `text: String`
775
776 Transcript text for this segment.
777
778 - `type: :"transcript.text.segment"`
779
780 The type of the event. Always `transcript.text.segment`.
781
782 - `:"transcript.text.segment"`
783
784 - `class TranscriptionTextDeltaEvent`
785
786 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
787
788 - `delta: String`
789
790 The text delta that was additionally transcribed.
791
792 - `type: :"transcript.text.delta"`
793
794 The type of the event. Always `transcript.text.delta`.
795
796 - `:"transcript.text.delta"`
797
798 - `logprobs: Array[Logprob{ token, bytes, logprob}]`
799
800 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
801
802 - `token: String`
803
804 The token that was used to generate the log probability.
805
806 - `bytes: Array[Integer]`
807
808 The bytes that were used to generate the log probability.
809
810 - `logprob: Float`
811
812 The log probability of the token.
813
814 - `segment_id: String`
815
816 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.
817
818 - `class TranscriptionTextDoneEvent`
819
820 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
821
822 - `text: String`
823
824 The text that was transcribed.
825
826 - `type: :"transcript.text.done"`
827
828 The type of the event. Always `transcript.text.done`.
829
830 - `:"transcript.text.done"`
831
832 - `logprobs: Array[Logprob{ token, bytes, logprob}]`
833
834 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
835
836 - `token: String`
837
838 The token that was used to generate the log probability.
839
840 - `bytes: Array[Integer]`
841
842 The bytes that were used to generate the log probability.
843
844 - `logprob: Float`
845
846 The log probability of the token.
847
848 - `usage: Usage{ input_tokens, output_tokens, total_tokens, 2 more}`
849
850 Usage statistics for models billed by token usage.
851
852 - `input_tokens: Integer`
853
854 Number of input tokens billed for this request.
855
856 - `output_tokens: Integer`
857
858 Number of output tokens generated.
859
860 - `total_tokens: Integer`
861
862 Total number of tokens used (input + output).
863
864 - `type: :tokens`
865
866 The type of the usage object. Always `tokens` for this variant.
867
868 - `:tokens`
869
870 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`
871
872 Details about the input tokens billed for this request.
873
874 - `audio_tokens: Integer`
875
876 Number of audio tokens billed for this request.
877
878 - `text_tokens: Integer`
879
880 Number of text tokens billed for this request.
881
882### Transcription Text Delta Event
883
884- `class TranscriptionTextDeltaEvent`
885
886 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
887
888 - `delta: String`
889
890 The text delta that was additionally transcribed.
891
892 - `type: :"transcript.text.delta"`
893
894 The type of the event. Always `transcript.text.delta`.
895
896 - `:"transcript.text.delta"`
897
898 - `logprobs: Array[Logprob{ token, bytes, logprob}]`
899
900 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
901
902 - `token: String`
903
904 The token that was used to generate the log probability.
905
906 - `bytes: Array[Integer]`
907
908 The bytes that were used to generate the log probability.
909
910 - `logprob: Float`
911
912 The log probability of the token.
913
914 - `segment_id: String`
915
916 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.
917
918### Transcription Text Done Event
919
920- `class TranscriptionTextDoneEvent`
921
922 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
923
924 - `text: String`
925
926 The text that was transcribed.
927
928 - `type: :"transcript.text.done"`
929
930 The type of the event. Always `transcript.text.done`.
931
932 - `:"transcript.text.done"`
933
934 - `logprobs: Array[Logprob{ token, bytes, logprob}]`
935
936 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
937
938 - `token: String`
939
940 The token that was used to generate the log probability.
941
942 - `bytes: Array[Integer]`
943
944 The bytes that were used to generate the log probability.
945
946 - `logprob: Float`
947
948 The log probability of the token.
949
950 - `usage: Usage{ input_tokens, output_tokens, total_tokens, 2 more}`
951
952 Usage statistics for models billed by token usage.
953
954 - `input_tokens: Integer`
955
956 Number of input tokens billed for this request.
957
958 - `output_tokens: Integer`
959
960 Number of output tokens generated.
961
962 - `total_tokens: Integer`
963
964 Total number of tokens used (input + output).
965
966 - `type: :tokens`
967
968 The type of the usage object. Always `tokens` for this variant.
969
970 - `:tokens`
971
972 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`
973
974 Details about the input tokens billed for this request.
975
976 - `audio_tokens: Integer`
977
978 Number of audio tokens billed for this request.
979
980 - `text_tokens: Integer`
981
982 Number of text tokens billed for this request.
983
984### Transcription Text Segment Event
985
986- `class TranscriptionTextSegmentEvent`
987
988 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.
989
990 - `id: String`
991
992 Unique identifier for the segment.
993
994 - `end_: Float`
995
996 End timestamp of the segment in seconds.
997
998 - `speaker: String`
999
1000 Speaker label for this segment.
1001
1002 - `start: Float`
1003
1004 Start timestamp of the segment in seconds.
1005
1006 - `text: String`
1007
1008 Transcript text for this segment.
1009
1010 - `type: :"transcript.text.segment"`
1011
1012 The type of the event. Always `transcript.text.segment`.
1013
1014 - `:"transcript.text.segment"`
1015
1016### Transcription Verbose
1017
1018- `class TranscriptionVerbose`
1019
1020 Represents a verbose json transcription response returned by model, based on the provided input.
1021
1022 - `duration: Float`
1023
1024 The duration of the input audio.
1025
1026 - `language: String`
1027
1028 The language of the input audio.
1029
1030 - `text: String`
1031
1032 The transcribed text.
1033
1034 - `segments: Array[TranscriptionSegment]`
1035
1036 Segments of the transcribed text and their corresponding details.
1037
1038 - `id: Integer`
1039
1040 Unique identifier of the segment.
1041
1042 - `avg_logprob: Float`
1043
1044 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
1045
1046 - `compression_ratio: Float`
1047
1048 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
1049
1050 - `end_: Float`
1051
1052 End time of the segment in seconds.
1053
1054 - `no_speech_prob: Float`
1055
1056 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
1057
1058 - `seek: Integer`
1059
1060 Seek offset of the segment.
1061
1062 - `start: Float`
1063
1064 Start time of the segment in seconds.
1065
1066 - `temperature: Float`
1067
1068 Temperature parameter used for generating the segment.
1069
1070 - `text: String`
1071
1072 Text content of the segment.
1073
1074 - `tokens: Array[Integer]`
1075
1076 Array of token IDs for the text content.
1077
1078 - `usage: Usage{ seconds, type}`
1079
1080 Usage statistics for models billed by audio input duration.
1081
1082 - `seconds: Float`
1083
1084 Duration of the input audio in seconds.
1085
1086 - `type: :duration`
1087
1088 The type of the usage object. Always `duration` for this variant.
1089
1090 - `:duration`
1091
1092 - `words: Array[TranscriptionWord]`
1093
1094 Extracted words and their corresponding timestamps.
1095
1096 - `end_: Float`
1097
1098 End time of the word in seconds.
1099
1100 - `start: Float`
1101
1102 Start time of the word in seconds.
1103
1104 - `word: String`
1105
1106 The text content of the word.
1107
1108### Transcription Word
1109
1110- `class TranscriptionWord`
1111
1112 - `end_: Float`
1113
1114 End time of the word in seconds.
1115
1116 - `start: Float`
1117
1118 Start time of the word in seconds.
1119
1120 - `word: String`
1121
1122 The text content of the word.
1123
1124### Transcription Create Response
1125
1126- `TranscriptionCreateResponse = Transcription | TranscriptionDiarized | TranscriptionVerbose`
1127
1128 Represents a transcription response returned by model, based on the provided input.
1129
1130 - `class Transcription`
1131
1132 Represents a transcription response returned by model, based on the provided input.
1133
1134 - `text: String`
1135
1136 The transcribed text.
1137
1138 - `logprobs: Array[Logprob{ token, bytes, logprob}]`
1139
1140 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.
1141
1142 - `token: String`
1143
1144 The token in the transcription.
1145
1146 - `bytes: Array[Float]`
1147
1148 The bytes of the token.
1149
1150 - `logprob: Float`
1151
1152 The log probability of the token.
1153
1154 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`
1155
1156 Token usage statistics for the request.
1157
1158 - `class Tokens`
1159
1160 Usage statistics for models billed by token usage.
1161
1162 - `input_tokens: Integer`
1163
1164 Number of input tokens billed for this request.
1165
1166 - `output_tokens: Integer`
1167
1168 Number of output tokens generated.
1169
1170 - `total_tokens: Integer`
1171
1172 Total number of tokens used (input + output).
1173
1174 - `type: :tokens`
1175
1176 The type of the usage object. Always `tokens` for this variant.
1177
1178 - `:tokens`
1179
1180 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`
1181
1182 Details about the input tokens billed for this request.
1183
1184 - `audio_tokens: Integer`
1185
1186 Number of audio tokens billed for this request.
1187
1188 - `text_tokens: Integer`
1189
1190 Number of text tokens billed for this request.
1191
1192 - `class Duration`
1193
1194 Usage statistics for models billed by audio input duration.
1195
1196 - `seconds: Float`
1197
1198 Duration of the input audio in seconds.
1199
1200 - `type: :duration`
1201
1202 The type of the usage object. Always `duration` for this variant.
1203
1204 - `:duration`
1205
1206 - `class TranscriptionDiarized`
1207
1208 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
1209
1210 - `duration: Float`
1211
1212 Duration of the input audio in seconds.
1213
1214 - `segments: Array[TranscriptionDiarizedSegment]`
1215
1216 Segments of the transcript annotated with timestamps and speaker labels.
1217
1218 - `id: String`
1219
1220 Unique identifier for the segment.
1221
1222 - `end_: Float`
1223
1224 End timestamp of the segment in seconds.
1225
1226 - `speaker: String`
1227
1228 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
1229
1230 - `start: Float`
1231
1232 Start timestamp of the segment in seconds.
1233
1234 - `text: String`
1235
1236 Transcript text for this segment.
1237
1238 - `type: :"transcript.text.segment"`
1239
1240 The type of the segment. Always `transcript.text.segment`.
1241
1242 - `:"transcript.text.segment"`
1243
1244 - `task: :transcribe`
1245
1246 The type of task that was run. Always `transcribe`.
1247
1248 - `:transcribe`
1249
1250 - `text: String`
1251
1252 The concatenated transcript text for the entire audio input.
1253
1254 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`
1255
1256 Token or duration usage statistics for the request.
1257
1258 - `class Tokens`
1259
1260 Usage statistics for models billed by token usage.
1261
1262 - `input_tokens: Integer`
1263
1264 Number of input tokens billed for this request.
1265
1266 - `output_tokens: Integer`
1267
1268 Number of output tokens generated.
1269
1270 - `total_tokens: Integer`
1271
1272 Total number of tokens used (input + output).
1273
1274 - `type: :tokens`
1275
1276 The type of the usage object. Always `tokens` for this variant.
1277
1278 - `:tokens`
1279
1280 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`
1281
1282 Details about the input tokens billed for this request.
1283
1284 - `audio_tokens: Integer`
1285
1286 Number of audio tokens billed for this request.
1287
1288 - `text_tokens: Integer`
1289
1290 Number of text tokens billed for this request.
1291
1292 - `class Duration`
1293
1294 Usage statistics for models billed by audio input duration.
1295
1296 - `seconds: Float`
1297
1298 Duration of the input audio in seconds.
1299
1300 - `type: :duration`
1301
1302 The type of the usage object. Always `duration` for this variant.
1303
1304 - `:duration`
1305
1306 - `class TranscriptionVerbose`
1307
1308 Represents a verbose json transcription response returned by model, based on the provided input.
1309
1310 - `duration: Float`
1311
1312 The duration of the input audio.
1313
1314 - `language: String`
1315
1316 The language of the input audio.
1317
1318 - `text: String`
1319
1320 The transcribed text.
1321
1322 - `segments: Array[TranscriptionSegment]`
1323
1324 Segments of the transcribed text and their corresponding details.
1325
1326 - `id: Integer`
1327
1328 Unique identifier of the segment.
1329
1330 - `avg_logprob: Float`
1331
1332 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
1333
1334 - `compression_ratio: Float`
1335
1336 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
1337
1338 - `end_: Float`
1339
1340 End time of the segment in seconds.
1341
1342 - `no_speech_prob: Float`
1343
1344 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
1345
1346 - `seek: Integer`
1347
1348 Seek offset of the segment.
1349
1350 - `start: Float`
1351
1352 Start time of the segment in seconds.
1353
1354 - `temperature: Float`
1355
1356 Temperature parameter used for generating the segment.
1357
1358 - `text: String`
1359
1360 Text content of the segment.
1361
1362 - `tokens: Array[Integer]`
1363
1364 Array of token IDs for the text content.
1365
1366 - `usage: Usage{ seconds, type}`
1367
1368 Usage statistics for models billed by audio input duration.
1369
1370 - `seconds: Float`
1371
1372 Duration of the input audio in seconds.
1373
1374 - `type: :duration`
1375
1376 The type of the usage object. Always `duration` for this variant.
1377
1378 - `:duration`
1379
1380 - `words: Array[TranscriptionWord]`
1381
1382 Extracted words and their corresponding timestamps.
1383
1384 - `end_: Float`
1385
1386 End time of the word in seconds.
1387
1388 - `start: Float`
1389
1390 Start time of the word in seconds.
1391
1392 - `word: String`
1393
1394 The text content of the word.
1395
1396# Translations
1397
1398## Create translation
1399
1400`audio.translations.create(**kwargs) -> TranslationCreateResponse`
1401
1402**post** `/audio/translations`
1403
1404Translates audio into English.
1405
1406### Parameters
1407
1408- `file: String`
1409
1410 The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
1411
1412- `model: String | AudioModel`
1413
1414 ID of the model to use. Only `whisper-1` (which is powered by our open source Whisper V2 model) is currently available.
1415
1416 - `String = String`
1417
1418 - `AudioModel = :"whisper-1" | :"gpt-4o-transcribe" | :"gpt-4o-mini-transcribe" | 2 more`
1419
1420 - `:"whisper-1"`
1421
1422 - `:"gpt-4o-transcribe"`
1423
1424 - `:"gpt-4o-mini-transcribe"`
1425
1426 - `:"gpt-4o-mini-transcribe-2025-12-15"`
1427
1428 - `:"gpt-4o-transcribe-diarize"`
1429
1430- `prompt: String`
1431
1432 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should be in English.
1433
1434- `response_format: :json | :text | :srt | 2 more`
1435
1436 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`.
1437
1438 - `:json`
1439
1440 - `:text`
1441
1442 - `:srt`
1443
1444 - `:verbose_json`
1445
1446 - `:vtt`
1447
1448- `temperature: Float`
1449
1450 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.
1451
1452### Returns
1453
1454- `TranslationCreateResponse = Translation | TranslationVerbose`
1455
1456 - `class Translation`
1457
1458 - `text: String`
1459
1460 - `class TranslationVerbose`
1461
1462 - `duration: Float`
1463
1464 The duration of the input audio.
1465
1466 - `language: String`
1467
1468 The language of the output translation (always `english`).
1469
1470 - `text: String`
1471
1472 The translated text.
1473
1474 - `segments: Array[TranscriptionSegment]`
1475
1476 Segments of the translated text and their corresponding details.
1477
1478 - `id: Integer`
1479
1480 Unique identifier of the segment.
1481
1482 - `avg_logprob: Float`
1483
1484 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
1485
1486 - `compression_ratio: Float`
1487
1488 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
1489
1490 - `end_: Float`
1491
1492 End time of the segment in seconds.
1493
1494 - `no_speech_prob: Float`
1495
1496 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
1497
1498 - `seek: Integer`
1499
1500 Seek offset of the segment.
1501
1502 - `start: Float`
1503
1504 Start time of the segment in seconds.
1505
1506 - `temperature: Float`
1507
1508 Temperature parameter used for generating the segment.
1509
1510 - `text: String`
1511
1512 Text content of the segment.
1513
1514 - `tokens: Array[Integer]`
1515
1516 Array of token IDs for the text content.
1517
1518### Example
1519
1520```ruby
1521require "openai"
1522
1523openai = OpenAI::Client.new(api_key: "My API Key")
1524
1525translation = openai.audio.translations.create(file: StringIO.new("Example data"), model: :"whisper-1")
1526
1527puts(translation)
1528```
1529
1530#### Response
1531
1532```json
1533{
1534 "text": "text"
1535}
1536```
1537
1538## Domain Types
1539
1540### Translation
1541
1542- `class Translation`
1543
1544 - `text: String`
1545
1546### Translation Verbose
1547
1548- `class TranslationVerbose`
1549
1550 - `duration: Float`
1551
1552 The duration of the input audio.
1553
1554 - `language: String`
1555
1556 The language of the output translation (always `english`).
1557
1558 - `text: String`
1559
1560 The translated text.
1561
1562 - `segments: Array[TranscriptionSegment]`
1563
1564 Segments of the translated text and their corresponding details.
1565
1566 - `id: Integer`
1567
1568 Unique identifier of the segment.
1569
1570 - `avg_logprob: Float`
1571
1572 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
1573
1574 - `compression_ratio: Float`
1575
1576 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
1577
1578 - `end_: Float`
1579
1580 End time of the segment in seconds.
1581
1582 - `no_speech_prob: Float`
1583
1584 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
1585
1586 - `seek: Integer`
1587
1588 Seek offset of the segment.
1589
1590 - `start: Float`
1591
1592 Start time of the segment in seconds.
1593
1594 - `temperature: Float`
1595
1596 Temperature parameter used for generating the segment.
1597
1598 - `text: String`
1599
1600 Text content of the segment.
1601
1602 - `tokens: Array[Integer]`
1603
1604 Array of token IDs for the text content.
1605
1606### Translation Create Response
1607
1608- `TranslationCreateResponse = Translation | TranslationVerbose`
1609
1610 - `class Translation`
1611
1612 - `text: String`
1613
1614 - `class TranslationVerbose`
1615
1616 - `duration: Float`
1617
1618 The duration of the input audio.
1619
1620 - `language: String`
1621
1622 The language of the output translation (always `english`).
1623
1624 - `text: String`
1625
1626 The translated text.
1627
1628 - `segments: Array[TranscriptionSegment]`
1629
1630 Segments of the translated text and their corresponding details.
1631
1632 - `id: Integer`
1633
1634 Unique identifier of the segment.
1635
1636 - `avg_logprob: Float`
1637
1638 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
1639
1640 - `compression_ratio: Float`
1641
1642 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
1643
1644 - `end_: Float`
1645
1646 End time of the segment in seconds.
1647
1648 - `no_speech_prob: Float`
1649
1650 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
1651
1652 - `seek: Integer`
1653
1654 Seek offset of the segment.
1655
1656 - `start: Float`
1657
1658 Start time of the segment in seconds.
1659
1660 - `temperature: Float`
1661
1662 Temperature parameter used for generating the segment.
1663
1664 - `text: String`
1665
1666 Text content of the segment.
1667
1668 - `tokens: Array[Integer]`
1669
1670 Array of token IDs for the text content.
1671
1672# Speech
1673
1674## Create speech
1675
1676`audio.speech.create(**kwargs) -> StringIO`
1677
1678**post** `/audio/speech`
1679
1680Generates audio from the input text.
1681
1682Returns the audio file content, or a stream of audio events.
1683
1684### Parameters
1685
1686- `input: String`
1687
1688 The text to generate audio for. The maximum length is 4096 characters.
1689
1690- `model: String | SpeechModel`
1691
1692 One of the available [TTS models](https://platform.openai.com/docs/models#tts): `tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`, or `gpt-4o-mini-tts-2025-12-15`.
1693
1694 - `String = String`
1695
1696 - `SpeechModel = :"tts-1" | :"tts-1-hd" | :"gpt-4o-mini-tts" | :"gpt-4o-mini-tts-2025-12-15"`
1697
1698 - `:"tts-1"`
1699
1700 - `:"tts-1-hd"`
1701
1702 - `:"gpt-4o-mini-tts"`
1703
1704 - `:"gpt-4o-mini-tts-2025-12-15"`
1705
1706- `voice: String | :alloy | :ash | :ballad | 7 more | ID{ id}`
1707
1708 The voice to use when generating the audio. Supported built-in voices are `alloy`, `ash`, `ballad`, `coral`, `echo`, `fable`, `onyx`, `nova`, `sage`, `shimmer`, `verse`, `marin`, and `cedar`. You may also provide a custom voice object with an `id`, for example `{ "id": "voice_1234" }`. Previews of the voices are available in the [Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech#voice-options).
1709
1710 - `String = String`
1711
1712 - `Voice = :alloy | :ash | :ballad | 7 more`
1713
1714 - `:alloy`
1715
1716 - `:ash`
1717
1718 - `:ballad`
1719
1720 - `:coral`
1721
1722 - `:echo`
1723
1724 - `:sage`
1725
1726 - `:shimmer`
1727
1728 - `:verse`
1729
1730 - `:marin`
1731
1732 - `:cedar`
1733
1734 - `class ID`
1735
1736 Custom voice reference.
1737
1738 - `id: String`
1739
1740 The custom voice ID, e.g. `voice_1234`.
1741
1742- `instructions: String`
1743
1744 Control the voice of your generated audio with additional instructions. Does not work with `tts-1` or `tts-1-hd`.
1745
1746- `response_format: :mp3 | :opus | :aac | 3 more`
1747
1748 The format to audio in. Supported formats are `mp3`, `opus`, `aac`, `flac`, `wav`, and `pcm`.
1749
1750 - `:mp3`
1751
1752 - `:opus`
1753
1754 - `:aac`
1755
1756 - `:flac`
1757
1758 - `:wav`
1759
1760 - `:pcm`
1761
1762- `speed: Float`
1763
1764 The speed of the generated audio. Select a value from `0.25` to `4.0`. `1.0` is the default.
1765
1766- `stream_format: :sse | :audio`
1767
1768 The format to stream the audio in. Supported formats are `sse` and `audio`. `sse` is not supported for `tts-1` or `tts-1-hd`.
1769
1770 - `:sse`
1771
1772 - `:audio`
1773
1774### Returns
1775
1776- `StringIO`
1777
1778### Example
1779
1780```ruby
1781require "openai"
1782
1783openai = OpenAI::Client.new(api_key: "My API Key")
1784
1785speech = openai.audio.speech.create(input: "input", model: :"tts-1", voice: "string")
1786
1787puts(speech)
1788```
1789
1790## Domain Types
1791
1792### Speech Model
1793
1794- `SpeechModel = :"tts-1" | :"tts-1-hd" | :"gpt-4o-mini-tts" | :"gpt-4o-mini-tts-2025-12-15"`
1795
1796 - `:"tts-1"`
1797
1798 - `:"tts-1-hd"`
1799
1800 - `:"gpt-4o-mini-tts"`
1801
1802 - `:"gpt-4o-mini-tts-2025-12-15"`
1803
1804# Voices
1805
1806# Voice Consents