java/resources/audio/index.md +0 −1480 deleted
File Deleted View Diff
1# Audio
2
3## Domain Types
4
5### Audio Model
6
7- `enum AudioModel:`
8
9 - `WHISPER_1("whisper-1")`
10
11 - `GPT_4O_TRANSCRIBE("gpt-4o-transcribe")`
12
13 - `GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")`
14
15 - `GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")`
16
17 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`
18
19### Audio Response Format
20
21- `enum AudioResponseFormat:`
22
23 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.
24
25 - `JSON("json")`
26
27 - `TEXT("text")`
28
29 - `SRT("srt")`
30
31 - `VERBOSE_JSON("verbose_json")`
32
33 - `VTT("vtt")`
34
35 - `DIARIZED_JSON("diarized_json")`
36
37# Transcriptions
38
39## Create transcription
40
41`TranscriptionCreateResponse audio().transcriptions().create(TranscriptionCreateParamsparams, RequestOptionsrequestOptions = RequestOptions.none())`
42
43**post** `/audio/transcriptions`
44
45Transcribes audio into the input language.
46
47Returns a transcription object in `json`, `diarized_json`, or `verbose_json`
48format, or a stream of transcript events.
49
50### Parameters
51
52- `TranscriptionCreateParams params`
53
54 - `String file`
55
56 The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
57
58 - `AudioModel model`
59
60 ID of the model to use. The options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `whisper-1` (which is powered by our open source Whisper V2 model), and `gpt-4o-transcribe-diarize`.
61
62 - `WHISPER_1("whisper-1")`
63
64 - `GPT_4O_TRANSCRIBE("gpt-4o-transcribe")`
65
66 - `GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")`
67
68 - `GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")`
69
70 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`
71
72 - `Optional<ChunkingStrategy> chunkingStrategy`
73
74 Controls how the audio is cut into chunks. When set to `"auto"`, the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. `server_vad` object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using `gpt-4o-transcribe-diarize` for inputs longer than 30 seconds.
75
76 - `JsonValue;`
77
78 - `AUTO("auto")`
79
80 - `class VadConfig:`
81
82 - `Type type`
83
84 Must be set to `server_vad` to enable manual chunking using server side VAD.
85
86 - `SERVER_VAD("server_vad")`
87
88 - `Optional<Long> prefixPaddingMs`
89
90 Amount of audio to include before the VAD detected speech (in
91 milliseconds).
92
93 - `Optional<Long> silenceDurationMs`
94
95 Duration of silence to detect speech stop (in milliseconds).
96 With shorter values the model will respond more quickly,
97 but may jump in on short pauses from the user.
98
99 - `Optional<Double> threshold`
100
101 Sensitivity threshold (0.0 to 1.0) for voice activity detection. A
102 higher threshold will require louder audio to activate the model, and
103 thus might perform better in noisy environments.
104
105 - `Optional<List<TranscriptionInclude>> include`
106
107 Additional information to include in the transcription response.
108 `logprobs` will return the log probabilities of the tokens in the
109 response to understand the model's confidence in the transcription.
110 `logprobs` only works with response_format set to `json` and only with
111 the models `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `gpt-4o-mini-transcribe-2025-12-15`. This field is not supported when using `gpt-4o-transcribe-diarize`.
112
113 - `LOGPROBS("logprobs")`
114
115 - `Optional<List<String>> knownSpeakerNames`
116
117 Optional list of speaker names that correspond to the audio samples provided in `known_speaker_references[]`. Each entry should be a short identifier (for example `customer` or `agent`). Up to 4 speakers are supported.
118
119 - `Optional<List<String>> knownSpeakerReferences`
120
121 Optional list of audio samples (as [data URLs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs)) that contain known speaker references matching `known_speaker_names[]`. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by `file`.
122
123 - `Optional<String> language`
124
125 The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency.
126
127 - `Optional<String> prompt`
128
129 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should match the audio language. This field is not supported when using `gpt-4o-transcribe-diarize`.
130
131 - `Optional<AudioResponseFormat> responseFormat`
132
133 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.
134
135 - `Optional<Double> temperature`
136
137 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.
138
139 - `Optional<List<TimestampGranularity>> timestampGranularities`
140
141 The timestamp granularities to populate for this transcription. `response_format` must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported: `word`, or `segment`. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.
142 This option is not available for `gpt-4o-transcribe-diarize`.
143
144 - `WORD("word")`
145
146 - `SEGMENT("segment")`
147
148### Returns
149
150- `class TranscriptionCreateResponse: A class that can be one of several variants.union`
151
152 Represents a transcription response returned by model, based on the provided input.
153
154 - `class Transcription:`
155
156 Represents a transcription response returned by model, based on the provided input.
157
158 - `String text`
159
160 The transcribed text.
161
162 - `Optional<List<Logprob>> logprobs`
163
164 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.
165
166 - `Optional<String> token`
167
168 The token in the transcription.
169
170 - `Optional<List<Double>> bytes`
171
172 The bytes of the token.
173
174 - `Optional<Double> logprob`
175
176 The log probability of the token.
177
178 - `Optional<Usage> usage`
179
180 Token usage statistics for the request.
181
182 - `class Tokens:`
183
184 Usage statistics for models billed by token usage.
185
186 - `long inputTokens`
187
188 Number of input tokens billed for this request.
189
190 - `long outputTokens`
191
192 Number of output tokens generated.
193
194 - `long totalTokens`
195
196 Total number of tokens used (input + output).
197
198 - `JsonValue; type "tokens"constant`
199
200 The type of the usage object. Always `tokens` for this variant.
201
202 - `TOKENS("tokens")`
203
204 - `Optional<InputTokenDetails> inputTokenDetails`
205
206 Details about the input tokens billed for this request.
207
208 - `Optional<Long> audioTokens`
209
210 Number of audio tokens billed for this request.
211
212 - `Optional<Long> textTokens`
213
214 Number of text tokens billed for this request.
215
216 - `class Duration:`
217
218 Usage statistics for models billed by audio input duration.
219
220 - `double seconds`
221
222 Duration of the input audio in seconds.
223
224 - `JsonValue; type "duration"constant`
225
226 The type of the usage object. Always `duration` for this variant.
227
228 - `DURATION("duration")`
229
230 - `class TranscriptionDiarized:`
231
232 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
233
234 - `double duration`
235
236 Duration of the input audio in seconds.
237
238 - `List<TranscriptionDiarizedSegment> segments`
239
240 Segments of the transcript annotated with timestamps and speaker labels.
241
242 - `String id`
243
244 Unique identifier for the segment.
245
246 - `double end`
247
248 End timestamp of the segment in seconds.
249
250 - `String speaker`
251
252 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
253
254 - `double start`
255
256 Start timestamp of the segment in seconds.
257
258 - `String text`
259
260 Transcript text for this segment.
261
262 - `JsonValue; type "transcript.text.segment"constant`
263
264 The type of the segment. Always `transcript.text.segment`.
265
266 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`
267
268 - `JsonValue; task "transcribe"constant`
269
270 The type of task that was run. Always `transcribe`.
271
272 - `TRANSCRIBE("transcribe")`
273
274 - `String text`
275
276 The concatenated transcript text for the entire audio input.
277
278 - `Optional<Usage> usage`
279
280 Token or duration usage statistics for the request.
281
282 - `class Tokens:`
283
284 Usage statistics for models billed by token usage.
285
286 - `long inputTokens`
287
288 Number of input tokens billed for this request.
289
290 - `long outputTokens`
291
292 Number of output tokens generated.
293
294 - `long totalTokens`
295
296 Total number of tokens used (input + output).
297
298 - `JsonValue; type "tokens"constant`
299
300 The type of the usage object. Always `tokens` for this variant.
301
302 - `TOKENS("tokens")`
303
304 - `Optional<InputTokenDetails> inputTokenDetails`
305
306 Details about the input tokens billed for this request.
307
308 - `Optional<Long> audioTokens`
309
310 Number of audio tokens billed for this request.
311
312 - `Optional<Long> textTokens`
313
314 Number of text tokens billed for this request.
315
316 - `class Duration:`
317
318 Usage statistics for models billed by audio input duration.
319
320 - `double seconds`
321
322 Duration of the input audio in seconds.
323
324 - `JsonValue; type "duration"constant`
325
326 The type of the usage object. Always `duration` for this variant.
327
328 - `DURATION("duration")`
329
330 - `class TranscriptionVerbose:`
331
332 Represents a verbose json transcription response returned by model, based on the provided input.
333
334 - `double duration`
335
336 The duration of the input audio.
337
338 - `String language`
339
340 The language of the input audio.
341
342 - `String text`
343
344 The transcribed text.
345
346 - `Optional<List<TranscriptionSegment>> segments`
347
348 Segments of the transcribed text and their corresponding details.
349
350 - `long id`
351
352 Unique identifier of the segment.
353
354 - `double avgLogprob`
355
356 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
357
358 - `double compressionRatio`
359
360 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
361
362 - `double end`
363
364 End time of the segment in seconds.
365
366 - `double noSpeechProb`
367
368 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
369
370 - `long seek`
371
372 Seek offset of the segment.
373
374 - `double start`
375
376 Start time of the segment in seconds.
377
378 - `double temperature`
379
380 Temperature parameter used for generating the segment.
381
382 - `String text`
383
384 Text content of the segment.
385
386 - `List<long> tokens`
387
388 Array of token IDs for the text content.
389
390 - `Optional<Usage> usage`
391
392 Usage statistics for models billed by audio input duration.
393
394 - `double seconds`
395
396 Duration of the input audio in seconds.
397
398 - `JsonValue; type "duration"constant`
399
400 The type of the usage object. Always `duration` for this variant.
401
402 - `DURATION("duration")`
403
404 - `Optional<List<TranscriptionWord>> words`
405
406 Extracted words and their corresponding timestamps.
407
408 - `double end`
409
410 End time of the word in seconds.
411
412 - `double start`
413
414 Start time of the word in seconds.
415
416 - `String word`
417
418 The text content of the word.
419
420### Example
421
422```java
423package com.openai.example;
424
425import com.openai.client.OpenAIClient;
426import com.openai.client.okhttp.OpenAIOkHttpClient;
427import com.openai.models.audio.AudioModel;
428import com.openai.models.audio.transcriptions.TranscriptionCreateParams;
429import com.openai.models.audio.transcriptions.TranscriptionCreateResponse;
430import java.io.ByteArrayInputStream;
431
432public final class Main {
433 private Main() {}
434
435 public static void main(String[] args) {
436 OpenAIClient client = OpenAIOkHttpClient.fromEnv();
437
438 TranscriptionCreateParams params = TranscriptionCreateParams.builder()
439 .file(new ByteArrayInputStream("Example data".getBytes()))
440 .model(AudioModel.GPT_4O_TRANSCRIBE)
441 .build();
442 TranscriptionCreateResponse transcription = client.audio().transcriptions().create(params);
443 }
444}
445```
446
447#### Response
448
449```json
450{
451 "text": "text",
452 "logprobs": [
453 {
454 "token": "token",
455 "bytes": [
456 0
457 ],
458 "logprob": 0
459 }
460 ],
461 "usage": {
462 "input_tokens": 0,
463 "output_tokens": 0,
464 "total_tokens": 0,
465 "type": "tokens",
466 "input_token_details": {
467 "audio_tokens": 0,
468 "text_tokens": 0
469 }
470 }
471}
472```
473
474## Domain Types
475
476### Transcription
477
478- `class Transcription:`
479
480 Represents a transcription response returned by model, based on the provided input.
481
482 - `String text`
483
484 The transcribed text.
485
486 - `Optional<List<Logprob>> logprobs`
487
488 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.
489
490 - `Optional<String> token`
491
492 The token in the transcription.
493
494 - `Optional<List<Double>> bytes`
495
496 The bytes of the token.
497
498 - `Optional<Double> logprob`
499
500 The log probability of the token.
501
502 - `Optional<Usage> usage`
503
504 Token usage statistics for the request.
505
506 - `class Tokens:`
507
508 Usage statistics for models billed by token usage.
509
510 - `long inputTokens`
511
512 Number of input tokens billed for this request.
513
514 - `long outputTokens`
515
516 Number of output tokens generated.
517
518 - `long totalTokens`
519
520 Total number of tokens used (input + output).
521
522 - `JsonValue; type "tokens"constant`
523
524 The type of the usage object. Always `tokens` for this variant.
525
526 - `TOKENS("tokens")`
527
528 - `Optional<InputTokenDetails> inputTokenDetails`
529
530 Details about the input tokens billed for this request.
531
532 - `Optional<Long> audioTokens`
533
534 Number of audio tokens billed for this request.
535
536 - `Optional<Long> textTokens`
537
538 Number of text tokens billed for this request.
539
540 - `class Duration:`
541
542 Usage statistics for models billed by audio input duration.
543
544 - `double seconds`
545
546 Duration of the input audio in seconds.
547
548 - `JsonValue; type "duration"constant`
549
550 The type of the usage object. Always `duration` for this variant.
551
552 - `DURATION("duration")`
553
554### Transcription Diarized
555
556- `class TranscriptionDiarized:`
557
558 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
559
560 - `double duration`
561
562 Duration of the input audio in seconds.
563
564 - `List<TranscriptionDiarizedSegment> segments`
565
566 Segments of the transcript annotated with timestamps and speaker labels.
567
568 - `String id`
569
570 Unique identifier for the segment.
571
572 - `double end`
573
574 End timestamp of the segment in seconds.
575
576 - `String speaker`
577
578 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
579
580 - `double start`
581
582 Start timestamp of the segment in seconds.
583
584 - `String text`
585
586 Transcript text for this segment.
587
588 - `JsonValue; type "transcript.text.segment"constant`
589
590 The type of the segment. Always `transcript.text.segment`.
591
592 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`
593
594 - `JsonValue; task "transcribe"constant`
595
596 The type of task that was run. Always `transcribe`.
597
598 - `TRANSCRIBE("transcribe")`
599
600 - `String text`
601
602 The concatenated transcript text for the entire audio input.
603
604 - `Optional<Usage> usage`
605
606 Token or duration usage statistics for the request.
607
608 - `class Tokens:`
609
610 Usage statistics for models billed by token usage.
611
612 - `long inputTokens`
613
614 Number of input tokens billed for this request.
615
616 - `long outputTokens`
617
618 Number of output tokens generated.
619
620 - `long totalTokens`
621
622 Total number of tokens used (input + output).
623
624 - `JsonValue; type "tokens"constant`
625
626 The type of the usage object. Always `tokens` for this variant.
627
628 - `TOKENS("tokens")`
629
630 - `Optional<InputTokenDetails> inputTokenDetails`
631
632 Details about the input tokens billed for this request.
633
634 - `Optional<Long> audioTokens`
635
636 Number of audio tokens billed for this request.
637
638 - `Optional<Long> textTokens`
639
640 Number of text tokens billed for this request.
641
642 - `class Duration:`
643
644 Usage statistics for models billed by audio input duration.
645
646 - `double seconds`
647
648 Duration of the input audio in seconds.
649
650 - `JsonValue; type "duration"constant`
651
652 The type of the usage object. Always `duration` for this variant.
653
654 - `DURATION("duration")`
655
656### Transcription Diarized Segment
657
658- `class TranscriptionDiarizedSegment:`
659
660 A segment of diarized transcript text with speaker metadata.
661
662 - `String id`
663
664 Unique identifier for the segment.
665
666 - `double end`
667
668 End timestamp of the segment in seconds.
669
670 - `String speaker`
671
672 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
673
674 - `double start`
675
676 Start timestamp of the segment in seconds.
677
678 - `String text`
679
680 Transcript text for this segment.
681
682 - `JsonValue; type "transcript.text.segment"constant`
683
684 The type of the segment. Always `transcript.text.segment`.
685
686 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`
687
688### Transcription Include
689
690- `enum TranscriptionInclude:`
691
692 - `LOGPROBS("logprobs")`
693
694### Transcription Segment
695
696- `class TranscriptionSegment:`
697
698 - `long id`
699
700 Unique identifier of the segment.
701
702 - `double avgLogprob`
703
704 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
705
706 - `double compressionRatio`
707
708 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
709
710 - `double end`
711
712 End time of the segment in seconds.
713
714 - `double noSpeechProb`
715
716 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
717
718 - `long seek`
719
720 Seek offset of the segment.
721
722 - `double start`
723
724 Start time of the segment in seconds.
725
726 - `double temperature`
727
728 Temperature parameter used for generating the segment.
729
730 - `String text`
731
732 Text content of the segment.
733
734 - `List<long> tokens`
735
736 Array of token IDs for the text content.
737
738### Transcription Stream Event
739
740- `class TranscriptionStreamEvent: A class that can be one of several variants.union`
741
742 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.
743
744 - `class TranscriptionTextSegmentEvent:`
745
746 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.
747
748 - `String id`
749
750 Unique identifier for the segment.
751
752 - `double end`
753
754 End timestamp of the segment in seconds.
755
756 - `String speaker`
757
758 Speaker label for this segment.
759
760 - `double start`
761
762 Start timestamp of the segment in seconds.
763
764 - `String text`
765
766 Transcript text for this segment.
767
768 - `JsonValue; type "transcript.text.segment"constant`
769
770 The type of the event. Always `transcript.text.segment`.
771
772 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`
773
774 - `class TranscriptionTextDeltaEvent:`
775
776 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
777
778 - `String delta`
779
780 The text delta that was additionally transcribed.
781
782 - `JsonValue; type "transcript.text.delta"constant`
783
784 The type of the event. Always `transcript.text.delta`.
785
786 - `TRANSCRIPT_TEXT_DELTA("transcript.text.delta")`
787
788 - `Optional<List<Logprob>> logprobs`
789
790 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
791
792 - `Optional<String> token`
793
794 The token that was used to generate the log probability.
795
796 - `Optional<List<Long>> bytes`
797
798 The bytes that were used to generate the log probability.
799
800 - `Optional<Double> logprob`
801
802 The log probability of the token.
803
804 - `Optional<String> segmentId`
805
806 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.
807
808 - `class TranscriptionTextDoneEvent:`
809
810 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
811
812 - `String text`
813
814 The text that was transcribed.
815
816 - `JsonValue; type "transcript.text.done"constant`
817
818 The type of the event. Always `transcript.text.done`.
819
820 - `TRANSCRIPT_TEXT_DONE("transcript.text.done")`
821
822 - `Optional<List<Logprob>> logprobs`
823
824 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
825
826 - `Optional<String> token`
827
828 The token that was used to generate the log probability.
829
830 - `Optional<List<Long>> bytes`
831
832 The bytes that were used to generate the log probability.
833
834 - `Optional<Double> logprob`
835
836 The log probability of the token.
837
838 - `Optional<Usage> usage`
839
840 Usage statistics for models billed by token usage.
841
842 - `long inputTokens`
843
844 Number of input tokens billed for this request.
845
846 - `long outputTokens`
847
848 Number of output tokens generated.
849
850 - `long totalTokens`
851
852 Total number of tokens used (input + output).
853
854 - `JsonValue; type "tokens"constant`
855
856 The type of the usage object. Always `tokens` for this variant.
857
858 - `TOKENS("tokens")`
859
860 - `Optional<InputTokenDetails> inputTokenDetails`
861
862 Details about the input tokens billed for this request.
863
864 - `Optional<Long> audioTokens`
865
866 Number of audio tokens billed for this request.
867
868 - `Optional<Long> textTokens`
869
870 Number of text tokens billed for this request.
871
872### Transcription Text Delta Event
873
874- `class TranscriptionTextDeltaEvent:`
875
876 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
877
878 - `String delta`
879
880 The text delta that was additionally transcribed.
881
882 - `JsonValue; type "transcript.text.delta"constant`
883
884 The type of the event. Always `transcript.text.delta`.
885
886 - `TRANSCRIPT_TEXT_DELTA("transcript.text.delta")`
887
888 - `Optional<List<Logprob>> logprobs`
889
890 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
891
892 - `Optional<String> token`
893
894 The token that was used to generate the log probability.
895
896 - `Optional<List<Long>> bytes`
897
898 The bytes that were used to generate the log probability.
899
900 - `Optional<Double> logprob`
901
902 The log probability of the token.
903
904 - `Optional<String> segmentId`
905
906 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.
907
908### Transcription Text Done Event
909
910- `class TranscriptionTextDoneEvent:`
911
912 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
913
914 - `String text`
915
916 The text that was transcribed.
917
918 - `JsonValue; type "transcript.text.done"constant`
919
920 The type of the event. Always `transcript.text.done`.
921
922 - `TRANSCRIPT_TEXT_DONE("transcript.text.done")`
923
924 - `Optional<List<Logprob>> logprobs`
925
926 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
927
928 - `Optional<String> token`
929
930 The token that was used to generate the log probability.
931
932 - `Optional<List<Long>> bytes`
933
934 The bytes that were used to generate the log probability.
935
936 - `Optional<Double> logprob`
937
938 The log probability of the token.
939
940 - `Optional<Usage> usage`
941
942 Usage statistics for models billed by token usage.
943
944 - `long inputTokens`
945
946 Number of input tokens billed for this request.
947
948 - `long outputTokens`
949
950 Number of output tokens generated.
951
952 - `long totalTokens`
953
954 Total number of tokens used (input + output).
955
956 - `JsonValue; type "tokens"constant`
957
958 The type of the usage object. Always `tokens` for this variant.
959
960 - `TOKENS("tokens")`
961
962 - `Optional<InputTokenDetails> inputTokenDetails`
963
964 Details about the input tokens billed for this request.
965
966 - `Optional<Long> audioTokens`
967
968 Number of audio tokens billed for this request.
969
970 - `Optional<Long> textTokens`
971
972 Number of text tokens billed for this request.
973
974### Transcription Text Segment Event
975
976- `class TranscriptionTextSegmentEvent:`
977
978 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.
979
980 - `String id`
981
982 Unique identifier for the segment.
983
984 - `double end`
985
986 End timestamp of the segment in seconds.
987
988 - `String speaker`
989
990 Speaker label for this segment.
991
992 - `double start`
993
994 Start timestamp of the segment in seconds.
995
996 - `String text`
997
998 Transcript text for this segment.
999
1000 - `JsonValue; type "transcript.text.segment"constant`
1001
1002 The type of the event. Always `transcript.text.segment`.
1003
1004 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`
1005
1006### Transcription Verbose
1007
1008- `class TranscriptionVerbose:`
1009
1010 Represents a verbose json transcription response returned by model, based on the provided input.
1011
1012 - `double duration`
1013
1014 The duration of the input audio.
1015
1016 - `String language`
1017
1018 The language of the input audio.
1019
1020 - `String text`
1021
1022 The transcribed text.
1023
1024 - `Optional<List<TranscriptionSegment>> segments`
1025
1026 Segments of the transcribed text and their corresponding details.
1027
1028 - `long id`
1029
1030 Unique identifier of the segment.
1031
1032 - `double avgLogprob`
1033
1034 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
1035
1036 - `double compressionRatio`
1037
1038 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
1039
1040 - `double end`
1041
1042 End time of the segment in seconds.
1043
1044 - `double noSpeechProb`
1045
1046 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
1047
1048 - `long seek`
1049
1050 Seek offset of the segment.
1051
1052 - `double start`
1053
1054 Start time of the segment in seconds.
1055
1056 - `double temperature`
1057
1058 Temperature parameter used for generating the segment.
1059
1060 - `String text`
1061
1062 Text content of the segment.
1063
1064 - `List<long> tokens`
1065
1066 Array of token IDs for the text content.
1067
1068 - `Optional<Usage> usage`
1069
1070 Usage statistics for models billed by audio input duration.
1071
1072 - `double seconds`
1073
1074 Duration of the input audio in seconds.
1075
1076 - `JsonValue; type "duration"constant`
1077
1078 The type of the usage object. Always `duration` for this variant.
1079
1080 - `DURATION("duration")`
1081
1082 - `Optional<List<TranscriptionWord>> words`
1083
1084 Extracted words and their corresponding timestamps.
1085
1086 - `double end`
1087
1088 End time of the word in seconds.
1089
1090 - `double start`
1091
1092 Start time of the word in seconds.
1093
1094 - `String word`
1095
1096 The text content of the word.
1097
1098### Transcription Word
1099
1100- `class TranscriptionWord:`
1101
1102 - `double end`
1103
1104 End time of the word in seconds.
1105
1106 - `double start`
1107
1108 Start time of the word in seconds.
1109
1110 - `String word`
1111
1112 The text content of the word.
1113
1114# Translations
1115
1116## Create translation
1117
1118`TranslationCreateResponse audio().translations().create(TranslationCreateParamsparams, RequestOptionsrequestOptions = RequestOptions.none())`
1119
1120**post** `/audio/translations`
1121
1122Translates audio into English.
1123
1124### Parameters
1125
1126- `TranslationCreateParams params`
1127
1128 - `String file`
1129
1130 The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
1131
1132 - `AudioModel model`
1133
1134 ID of the model to use. Only `whisper-1` (which is powered by our open source Whisper V2 model) is currently available.
1135
1136 - `WHISPER_1("whisper-1")`
1137
1138 - `GPT_4O_TRANSCRIBE("gpt-4o-transcribe")`
1139
1140 - `GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")`
1141
1142 - `GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")`
1143
1144 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`
1145
1146 - `Optional<String> prompt`
1147
1148 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should be in English.
1149
1150 - `Optional<ResponseFormat> responseFormat`
1151
1152 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`.
1153
1154 - `JSON("json")`
1155
1156 - `TEXT("text")`
1157
1158 - `SRT("srt")`
1159
1160 - `VERBOSE_JSON("verbose_json")`
1161
1162 - `VTT("vtt")`
1163
1164 - `Optional<Double> temperature`
1165
1166 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.
1167
1168### Returns
1169
1170- `class TranslationCreateResponse: A class that can be one of several variants.union`
1171
1172 - `class Translation:`
1173
1174 - `String text`
1175
1176 - `class TranslationVerbose:`
1177
1178 - `double duration`
1179
1180 The duration of the input audio.
1181
1182 - `String language`
1183
1184 The language of the output translation (always `english`).
1185
1186 - `String text`
1187
1188 The translated text.
1189
1190 - `Optional<List<TranscriptionSegment>> segments`
1191
1192 Segments of the translated text and their corresponding details.
1193
1194 - `long id`
1195
1196 Unique identifier of the segment.
1197
1198 - `double avgLogprob`
1199
1200 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
1201
1202 - `double compressionRatio`
1203
1204 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
1205
1206 - `double end`
1207
1208 End time of the segment in seconds.
1209
1210 - `double noSpeechProb`
1211
1212 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
1213
1214 - `long seek`
1215
1216 Seek offset of the segment.
1217
1218 - `double start`
1219
1220 Start time of the segment in seconds.
1221
1222 - `double temperature`
1223
1224 Temperature parameter used for generating the segment.
1225
1226 - `String text`
1227
1228 Text content of the segment.
1229
1230 - `List<long> tokens`
1231
1232 Array of token IDs for the text content.
1233
1234### Example
1235
1236```java
1237package com.openai.example;
1238
1239import com.openai.client.OpenAIClient;
1240import com.openai.client.okhttp.OpenAIOkHttpClient;
1241import com.openai.models.audio.AudioModel;
1242import com.openai.models.audio.translations.TranslationCreateParams;
1243import com.openai.models.audio.translations.TranslationCreateResponse;
1244import java.io.ByteArrayInputStream;
1245
1246public final class Main {
1247 private Main() {}
1248
1249 public static void main(String[] args) {
1250 OpenAIClient client = OpenAIOkHttpClient.fromEnv();
1251
1252 TranslationCreateParams params = TranslationCreateParams.builder()
1253 .file(new ByteArrayInputStream("Example data".getBytes()))
1254 .model(AudioModel.WHISPER_1)
1255 .build();
1256 TranslationCreateResponse translation = client.audio().translations().create(params);
1257 }
1258}
1259```
1260
1261#### Response
1262
1263```json
1264{
1265 "text": "text"
1266}
1267```
1268
1269## Domain Types
1270
1271### Translation
1272
1273- `class Translation:`
1274
1275 - `String text`
1276
1277### Translation Verbose
1278
1279- `class TranslationVerbose:`
1280
1281 - `double duration`
1282
1283 The duration of the input audio.
1284
1285 - `String language`
1286
1287 The language of the output translation (always `english`).
1288
1289 - `String text`
1290
1291 The translated text.
1292
1293 - `Optional<List<TranscriptionSegment>> segments`
1294
1295 Segments of the translated text and their corresponding details.
1296
1297 - `long id`
1298
1299 Unique identifier of the segment.
1300
1301 - `double avgLogprob`
1302
1303 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
1304
1305 - `double compressionRatio`
1306
1307 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
1308
1309 - `double end`
1310
1311 End time of the segment in seconds.
1312
1313 - `double noSpeechProb`
1314
1315 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
1316
1317 - `long seek`
1318
1319 Seek offset of the segment.
1320
1321 - `double start`
1322
1323 Start time of the segment in seconds.
1324
1325 - `double temperature`
1326
1327 Temperature parameter used for generating the segment.
1328
1329 - `String text`
1330
1331 Text content of the segment.
1332
1333 - `List<long> tokens`
1334
1335 Array of token IDs for the text content.
1336
1337# Speech
1338
1339## Create speech
1340
1341`HttpResponse audio().speech().create(SpeechCreateParamsparams, RequestOptionsrequestOptions = RequestOptions.none())`
1342
1343**post** `/audio/speech`
1344
1345Generates audio from the input text.
1346
1347Returns the audio file content, or a stream of audio events.
1348
1349### Parameters
1350
1351- `SpeechCreateParams params`
1352
1353 - `String input`
1354
1355 The text to generate audio for. The maximum length is 4096 characters.
1356
1357 - `SpeechModel model`
1358
1359 One of the available [TTS models](https://platform.openai.com/docs/models#tts): `tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`, or `gpt-4o-mini-tts-2025-12-15`.
1360
1361 - `TTS_1("tts-1")`
1362
1363 - `TTS_1_HD("tts-1-hd")`
1364
1365 - `GPT_4O_MINI_TTS("gpt-4o-mini-tts")`
1366
1367 - `GPT_4O_MINI_TTS_2025_12_15("gpt-4o-mini-tts-2025-12-15")`
1368
1369 - `Voice voice`
1370
1371 The voice to use when generating the audio. Supported built-in voices are `alloy`, `ash`, `ballad`, `coral`, `echo`, `fable`, `onyx`, `nova`, `sage`, `shimmer`, `verse`, `marin`, and `cedar`. You may also provide a custom voice object with an `id`, for example `{ "id": "voice_1234" }`. Previews of the voices are available in the [Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech#voice-options).
1372
1373 - `String`
1374
1375 - `enum UnionMember1:`
1376
1377 - `ALLOY("alloy")`
1378
1379 - `ASH("ash")`
1380
1381 - `BALLAD("ballad")`
1382
1383 - `CORAL("coral")`
1384
1385 - `ECHO("echo")`
1386
1387 - `SAGE("sage")`
1388
1389 - `SHIMMER("shimmer")`
1390
1391 - `VERSE("verse")`
1392
1393 - `MARIN("marin")`
1394
1395 - `CEDAR("cedar")`
1396
1397 - `class Id:`
1398
1399 Custom voice reference.
1400
1401 - `String id`
1402
1403 The custom voice ID, e.g. `voice_1234`.
1404
1405 - `Optional<String> instructions`
1406
1407 Control the voice of your generated audio with additional instructions. Does not work with `tts-1` or `tts-1-hd`.
1408
1409 - `Optional<ResponseFormat> responseFormat`
1410
1411 The format to audio in. Supported formats are `mp3`, `opus`, `aac`, `flac`, `wav`, and `pcm`.
1412
1413 - `MP3("mp3")`
1414
1415 - `OPUS("opus")`
1416
1417 - `AAC("aac")`
1418
1419 - `FLAC("flac")`
1420
1421 - `WAV("wav")`
1422
1423 - `PCM("pcm")`
1424
1425 - `Optional<Double> speed`
1426
1427 The speed of the generated audio. Select a value from `0.25` to `4.0`. `1.0` is the default.
1428
1429 - `Optional<StreamFormat> streamFormat`
1430
1431 The format to stream the audio in. Supported formats are `sse` and `audio`. `sse` is not supported for `tts-1` or `tts-1-hd`.
1432
1433 - `SSE("sse")`
1434
1435 - `AUDIO("audio")`
1436
1437### Example
1438
1439```java
1440package com.openai.example;
1441
1442import com.openai.client.OpenAIClient;
1443import com.openai.client.okhttp.OpenAIOkHttpClient;
1444import com.openai.core.http.HttpResponse;
1445import com.openai.models.audio.speech.SpeechCreateParams;
1446import com.openai.models.audio.speech.SpeechModel;
1447
1448public final class Main {
1449 private Main() {}
1450
1451 public static void main(String[] args) {
1452 OpenAIClient client = OpenAIOkHttpClient.fromEnv();
1453
1454 SpeechCreateParams params = SpeechCreateParams.builder()
1455 .input("input")
1456 .model(SpeechModel.TTS_1)
1457 .voice("string")
1458 .build();
1459 HttpResponse speech = client.audio().speech().create(params);
1460 }
1461}
1462```
1463
1464## Domain Types
1465
1466### Speech Model
1467
1468- `enum SpeechModel:`
1469
1470 - `TTS_1("tts-1")`
1471
1472 - `TTS_1_HD("tts-1-hd")`
1473
1474 - `GPT_4O_MINI_TTS("gpt-4o-mini-tts")`
1475
1476 - `GPT_4O_MINI_TTS_2025_12_15("gpt-4o-mini-tts-2025-12-15")`
1477
1478# Voices
1479
1480# Voice Consents