java/resources/audio/subresources/transcriptions/index.md +0 −1076 deleted
File Deleted View Diff
1# Transcriptions
2
3## Create transcription
4
5`TranscriptionCreateResponse audio().transcriptions().create(TranscriptionCreateParamsparams, RequestOptionsrequestOptions = RequestOptions.none())`
6
7**post** `/audio/transcriptions`
8
9Transcribes audio into the input language.
10
11Returns a transcription object in `json`, `diarized_json`, or `verbose_json`
12format, or a stream of transcript events.
13
14### Parameters
15
16- `TranscriptionCreateParams params`
17
18 - `String file`
19
20 The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
21
22 - `AudioModel model`
23
24 ID of the model to use. The options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `whisper-1` (which is powered by our open source Whisper V2 model), and `gpt-4o-transcribe-diarize`.
25
26 - `WHISPER_1("whisper-1")`
27
28 - `GPT_4O_TRANSCRIBE("gpt-4o-transcribe")`
29
30 - `GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")`
31
32 - `GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")`
33
34 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`
35
36 - `Optional<ChunkingStrategy> chunkingStrategy`
37
38 Controls how the audio is cut into chunks. When set to `"auto"`, the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. `server_vad` object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using `gpt-4o-transcribe-diarize` for inputs longer than 30 seconds.
39
40 - `JsonValue;`
41
42 - `AUTO("auto")`
43
44 - `class VadConfig:`
45
46 - `Type type`
47
48 Must be set to `server_vad` to enable manual chunking using server side VAD.
49
50 - `SERVER_VAD("server_vad")`
51
52 - `Optional<Long> prefixPaddingMs`
53
54 Amount of audio to include before the VAD detected speech (in
55 milliseconds).
56
57 - `Optional<Long> silenceDurationMs`
58
59 Duration of silence to detect speech stop (in milliseconds).
60 With shorter values the model will respond more quickly,
61 but may jump in on short pauses from the user.
62
63 - `Optional<Double> threshold`
64
65 Sensitivity threshold (0.0 to 1.0) for voice activity detection. A
66 higher threshold will require louder audio to activate the model, and
67 thus might perform better in noisy environments.
68
69 - `Optional<List<TranscriptionInclude>> include`
70
71 Additional information to include in the transcription response.
72 `logprobs` will return the log probabilities of the tokens in the
73 response to understand the model's confidence in the transcription.
74 `logprobs` only works with response_format set to `json` and only with
75 the models `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `gpt-4o-mini-transcribe-2025-12-15`. This field is not supported when using `gpt-4o-transcribe-diarize`.
76
77 - `LOGPROBS("logprobs")`
78
79 - `Optional<List<String>> knownSpeakerNames`
80
81 Optional list of speaker names that correspond to the audio samples provided in `known_speaker_references[]`. Each entry should be a short identifier (for example `customer` or `agent`). Up to 4 speakers are supported.
82
83 - `Optional<List<String>> knownSpeakerReferences`
84
85 Optional list of audio samples (as [data URLs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs)) that contain known speaker references matching `known_speaker_names[]`. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by `file`.
86
87 - `Optional<String> language`
88
89 The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency.
90
91 - `Optional<String> prompt`
92
93 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should match the audio language. This field is not supported when using `gpt-4o-transcribe-diarize`.
94
95 - `Optional<AudioResponseFormat> responseFormat`
96
97 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.
98
99 - `Optional<Double> temperature`
100
101 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.
102
103 - `Optional<List<TimestampGranularity>> timestampGranularities`
104
105 The timestamp granularities to populate for this transcription. `response_format` must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported: `word`, or `segment`. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.
106 This option is not available for `gpt-4o-transcribe-diarize`.
107
108 - `WORD("word")`
109
110 - `SEGMENT("segment")`
111
112### Returns
113
114- `class TranscriptionCreateResponse: A class that can be one of several variants.union`
115
116 Represents a transcription response returned by model, based on the provided input.
117
118 - `class Transcription:`
119
120 Represents a transcription response returned by model, based on the provided input.
121
122 - `String text`
123
124 The transcribed text.
125
126 - `Optional<List<Logprob>> logprobs`
127
128 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.
129
130 - `Optional<String> token`
131
132 The token in the transcription.
133
134 - `Optional<List<Double>> bytes`
135
136 The bytes of the token.
137
138 - `Optional<Double> logprob`
139
140 The log probability of the token.
141
142 - `Optional<Usage> usage`
143
144 Token usage statistics for the request.
145
146 - `class Tokens:`
147
148 Usage statistics for models billed by token usage.
149
150 - `long inputTokens`
151
152 Number of input tokens billed for this request.
153
154 - `long outputTokens`
155
156 Number of output tokens generated.
157
158 - `long totalTokens`
159
160 Total number of tokens used (input + output).
161
162 - `JsonValue; type "tokens"constant`
163
164 The type of the usage object. Always `tokens` for this variant.
165
166 - `TOKENS("tokens")`
167
168 - `Optional<InputTokenDetails> inputTokenDetails`
169
170 Details about the input tokens billed for this request.
171
172 - `Optional<Long> audioTokens`
173
174 Number of audio tokens billed for this request.
175
176 - `Optional<Long> textTokens`
177
178 Number of text tokens billed for this request.
179
180 - `class Duration:`
181
182 Usage statistics for models billed by audio input duration.
183
184 - `double seconds`
185
186 Duration of the input audio in seconds.
187
188 - `JsonValue; type "duration"constant`
189
190 The type of the usage object. Always `duration` for this variant.
191
192 - `DURATION("duration")`
193
194 - `class TranscriptionDiarized:`
195
196 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
197
198 - `double duration`
199
200 Duration of the input audio in seconds.
201
202 - `List<TranscriptionDiarizedSegment> segments`
203
204 Segments of the transcript annotated with timestamps and speaker labels.
205
206 - `String id`
207
208 Unique identifier for the segment.
209
210 - `double end`
211
212 End timestamp of the segment in seconds.
213
214 - `String speaker`
215
216 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
217
218 - `double start`
219
220 Start timestamp of the segment in seconds.
221
222 - `String text`
223
224 Transcript text for this segment.
225
226 - `JsonValue; type "transcript.text.segment"constant`
227
228 The type of the segment. Always `transcript.text.segment`.
229
230 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`
231
232 - `JsonValue; task "transcribe"constant`
233
234 The type of task that was run. Always `transcribe`.
235
236 - `TRANSCRIBE("transcribe")`
237
238 - `String text`
239
240 The concatenated transcript text for the entire audio input.
241
242 - `Optional<Usage> usage`
243
244 Token or duration usage statistics for the request.
245
246 - `class Tokens:`
247
248 Usage statistics for models billed by token usage.
249
250 - `long inputTokens`
251
252 Number of input tokens billed for this request.
253
254 - `long outputTokens`
255
256 Number of output tokens generated.
257
258 - `long totalTokens`
259
260 Total number of tokens used (input + output).
261
262 - `JsonValue; type "tokens"constant`
263
264 The type of the usage object. Always `tokens` for this variant.
265
266 - `TOKENS("tokens")`
267
268 - `Optional<InputTokenDetails> inputTokenDetails`
269
270 Details about the input tokens billed for this request.
271
272 - `Optional<Long> audioTokens`
273
274 Number of audio tokens billed for this request.
275
276 - `Optional<Long> textTokens`
277
278 Number of text tokens billed for this request.
279
280 - `class Duration:`
281
282 Usage statistics for models billed by audio input duration.
283
284 - `double seconds`
285
286 Duration of the input audio in seconds.
287
288 - `JsonValue; type "duration"constant`
289
290 The type of the usage object. Always `duration` for this variant.
291
292 - `DURATION("duration")`
293
294 - `class TranscriptionVerbose:`
295
296 Represents a verbose json transcription response returned by model, based on the provided input.
297
298 - `double duration`
299
300 The duration of the input audio.
301
302 - `String language`
303
304 The language of the input audio.
305
306 - `String text`
307
308 The transcribed text.
309
310 - `Optional<List<TranscriptionSegment>> segments`
311
312 Segments of the transcribed text and their corresponding details.
313
314 - `long id`
315
316 Unique identifier of the segment.
317
318 - `double avgLogprob`
319
320 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
321
322 - `double compressionRatio`
323
324 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
325
326 - `double end`
327
328 End time of the segment in seconds.
329
330 - `double noSpeechProb`
331
332 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
333
334 - `long seek`
335
336 Seek offset of the segment.
337
338 - `double start`
339
340 Start time of the segment in seconds.
341
342 - `double temperature`
343
344 Temperature parameter used for generating the segment.
345
346 - `String text`
347
348 Text content of the segment.
349
350 - `List<long> tokens`
351
352 Array of token IDs for the text content.
353
354 - `Optional<Usage> usage`
355
356 Usage statistics for models billed by audio input duration.
357
358 - `double seconds`
359
360 Duration of the input audio in seconds.
361
362 - `JsonValue; type "duration"constant`
363
364 The type of the usage object. Always `duration` for this variant.
365
366 - `DURATION("duration")`
367
368 - `Optional<List<TranscriptionWord>> words`
369
370 Extracted words and their corresponding timestamps.
371
372 - `double end`
373
374 End time of the word in seconds.
375
376 - `double start`
377
378 Start time of the word in seconds.
379
380 - `String word`
381
382 The text content of the word.
383
384### Example
385
386```java
387package com.openai.example;
388
389import com.openai.client.OpenAIClient;
390import com.openai.client.okhttp.OpenAIOkHttpClient;
391import com.openai.models.audio.AudioModel;
392import com.openai.models.audio.transcriptions.TranscriptionCreateParams;
393import com.openai.models.audio.transcriptions.TranscriptionCreateResponse;
394import java.io.ByteArrayInputStream;
395
396public final class Main {
397 private Main() {}
398
399 public static void main(String[] args) {
400 OpenAIClient client = OpenAIOkHttpClient.fromEnv();
401
402 TranscriptionCreateParams params = TranscriptionCreateParams.builder()
403 .file(new ByteArrayInputStream("Example data".getBytes()))
404 .model(AudioModel.GPT_4O_TRANSCRIBE)
405 .build();
406 TranscriptionCreateResponse transcription = client.audio().transcriptions().create(params);
407 }
408}
409```
410
411#### Response
412
413```json
414{
415 "text": "text",
416 "logprobs": [
417 {
418 "token": "token",
419 "bytes": [
420 0
421 ],
422 "logprob": 0
423 }
424 ],
425 "usage": {
426 "input_tokens": 0,
427 "output_tokens": 0,
428 "total_tokens": 0,
429 "type": "tokens",
430 "input_token_details": {
431 "audio_tokens": 0,
432 "text_tokens": 0
433 }
434 }
435}
436```
437
438## Domain Types
439
440### Transcription
441
442- `class Transcription:`
443
444 Represents a transcription response returned by model, based on the provided input.
445
446 - `String text`
447
448 The transcribed text.
449
450 - `Optional<List<Logprob>> logprobs`
451
452 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.
453
454 - `Optional<String> token`
455
456 The token in the transcription.
457
458 - `Optional<List<Double>> bytes`
459
460 The bytes of the token.
461
462 - `Optional<Double> logprob`
463
464 The log probability of the token.
465
466 - `Optional<Usage> usage`
467
468 Token usage statistics for the request.
469
470 - `class Tokens:`
471
472 Usage statistics for models billed by token usage.
473
474 - `long inputTokens`
475
476 Number of input tokens billed for this request.
477
478 - `long outputTokens`
479
480 Number of output tokens generated.
481
482 - `long totalTokens`
483
484 Total number of tokens used (input + output).
485
486 - `JsonValue; type "tokens"constant`
487
488 The type of the usage object. Always `tokens` for this variant.
489
490 - `TOKENS("tokens")`
491
492 - `Optional<InputTokenDetails> inputTokenDetails`
493
494 Details about the input tokens billed for this request.
495
496 - `Optional<Long> audioTokens`
497
498 Number of audio tokens billed for this request.
499
500 - `Optional<Long> textTokens`
501
502 Number of text tokens billed for this request.
503
504 - `class Duration:`
505
506 Usage statistics for models billed by audio input duration.
507
508 - `double seconds`
509
510 Duration of the input audio in seconds.
511
512 - `JsonValue; type "duration"constant`
513
514 The type of the usage object. Always `duration` for this variant.
515
516 - `DURATION("duration")`
517
518### Transcription Diarized
519
520- `class TranscriptionDiarized:`
521
522 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
523
524 - `double duration`
525
526 Duration of the input audio in seconds.
527
528 - `List<TranscriptionDiarizedSegment> segments`
529
530 Segments of the transcript annotated with timestamps and speaker labels.
531
532 - `String id`
533
534 Unique identifier for the segment.
535
536 - `double end`
537
538 End timestamp of the segment in seconds.
539
540 - `String speaker`
541
542 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
543
544 - `double start`
545
546 Start timestamp of the segment in seconds.
547
548 - `String text`
549
550 Transcript text for this segment.
551
552 - `JsonValue; type "transcript.text.segment"constant`
553
554 The type of the segment. Always `transcript.text.segment`.
555
556 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`
557
558 - `JsonValue; task "transcribe"constant`
559
560 The type of task that was run. Always `transcribe`.
561
562 - `TRANSCRIBE("transcribe")`
563
564 - `String text`
565
566 The concatenated transcript text for the entire audio input.
567
568 - `Optional<Usage> usage`
569
570 Token or duration usage statistics for the request.
571
572 - `class Tokens:`
573
574 Usage statistics for models billed by token usage.
575
576 - `long inputTokens`
577
578 Number of input tokens billed for this request.
579
580 - `long outputTokens`
581
582 Number of output tokens generated.
583
584 - `long totalTokens`
585
586 Total number of tokens used (input + output).
587
588 - `JsonValue; type "tokens"constant`
589
590 The type of the usage object. Always `tokens` for this variant.
591
592 - `TOKENS("tokens")`
593
594 - `Optional<InputTokenDetails> inputTokenDetails`
595
596 Details about the input tokens billed for this request.
597
598 - `Optional<Long> audioTokens`
599
600 Number of audio tokens billed for this request.
601
602 - `Optional<Long> textTokens`
603
604 Number of text tokens billed for this request.
605
606 - `class Duration:`
607
608 Usage statistics for models billed by audio input duration.
609
610 - `double seconds`
611
612 Duration of the input audio in seconds.
613
614 - `JsonValue; type "duration"constant`
615
616 The type of the usage object. Always `duration` for this variant.
617
618 - `DURATION("duration")`
619
620### Transcription Diarized Segment
621
622- `class TranscriptionDiarizedSegment:`
623
624 A segment of diarized transcript text with speaker metadata.
625
626 - `String id`
627
628 Unique identifier for the segment.
629
630 - `double end`
631
632 End timestamp of the segment in seconds.
633
634 - `String speaker`
635
636 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).
637
638 - `double start`
639
640 Start timestamp of the segment in seconds.
641
642 - `String text`
643
644 Transcript text for this segment.
645
646 - `JsonValue; type "transcript.text.segment"constant`
647
648 The type of the segment. Always `transcript.text.segment`.
649
650 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`
651
652### Transcription Include
653
654- `enum TranscriptionInclude:`
655
656 - `LOGPROBS("logprobs")`
657
658### Transcription Segment
659
660- `class TranscriptionSegment:`
661
662 - `long id`
663
664 Unique identifier of the segment.
665
666 - `double avgLogprob`
667
668 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
669
670 - `double compressionRatio`
671
672 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
673
674 - `double end`
675
676 End time of the segment in seconds.
677
678 - `double noSpeechProb`
679
680 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
681
682 - `long seek`
683
684 Seek offset of the segment.
685
686 - `double start`
687
688 Start time of the segment in seconds.
689
690 - `double temperature`
691
692 Temperature parameter used for generating the segment.
693
694 - `String text`
695
696 Text content of the segment.
697
698 - `List<long> tokens`
699
700 Array of token IDs for the text content.
701
702### Transcription Stream Event
703
704- `class TranscriptionStreamEvent: A class that can be one of several variants.union`
705
706 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.
707
708 - `class TranscriptionTextSegmentEvent:`
709
710 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.
711
712 - `String id`
713
714 Unique identifier for the segment.
715
716 - `double end`
717
718 End timestamp of the segment in seconds.
719
720 - `String speaker`
721
722 Speaker label for this segment.
723
724 - `double start`
725
726 Start timestamp of the segment in seconds.
727
728 - `String text`
729
730 Transcript text for this segment.
731
732 - `JsonValue; type "transcript.text.segment"constant`
733
734 The type of the event. Always `transcript.text.segment`.
735
736 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`
737
738 - `class TranscriptionTextDeltaEvent:`
739
740 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
741
742 - `String delta`
743
744 The text delta that was additionally transcribed.
745
746 - `JsonValue; type "transcript.text.delta"constant`
747
748 The type of the event. Always `transcript.text.delta`.
749
750 - `TRANSCRIPT_TEXT_DELTA("transcript.text.delta")`
751
752 - `Optional<List<Logprob>> logprobs`
753
754 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
755
756 - `Optional<String> token`
757
758 The token that was used to generate the log probability.
759
760 - `Optional<List<Long>> bytes`
761
762 The bytes that were used to generate the log probability.
763
764 - `Optional<Double> logprob`
765
766 The log probability of the token.
767
768 - `Optional<String> segmentId`
769
770 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.
771
772 - `class TranscriptionTextDoneEvent:`
773
774 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
775
776 - `String text`
777
778 The text that was transcribed.
779
780 - `JsonValue; type "transcript.text.done"constant`
781
782 The type of the event. Always `transcript.text.done`.
783
784 - `TRANSCRIPT_TEXT_DONE("transcript.text.done")`
785
786 - `Optional<List<Logprob>> logprobs`
787
788 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
789
790 - `Optional<String> token`
791
792 The token that was used to generate the log probability.
793
794 - `Optional<List<Long>> bytes`
795
796 The bytes that were used to generate the log probability.
797
798 - `Optional<Double> logprob`
799
800 The log probability of the token.
801
802 - `Optional<Usage> usage`
803
804 Usage statistics for models billed by token usage.
805
806 - `long inputTokens`
807
808 Number of input tokens billed for this request.
809
810 - `long outputTokens`
811
812 Number of output tokens generated.
813
814 - `long totalTokens`
815
816 Total number of tokens used (input + output).
817
818 - `JsonValue; type "tokens"constant`
819
820 The type of the usage object. Always `tokens` for this variant.
821
822 - `TOKENS("tokens")`
823
824 - `Optional<InputTokenDetails> inputTokenDetails`
825
826 Details about the input tokens billed for this request.
827
828 - `Optional<Long> audioTokens`
829
830 Number of audio tokens billed for this request.
831
832 - `Optional<Long> textTokens`
833
834 Number of text tokens billed for this request.
835
836### Transcription Text Delta Event
837
838- `class TranscriptionTextDeltaEvent:`
839
840 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
841
842 - `String delta`
843
844 The text delta that was additionally transcribed.
845
846 - `JsonValue; type "transcript.text.delta"constant`
847
848 The type of the event. Always `transcript.text.delta`.
849
850 - `TRANSCRIPT_TEXT_DELTA("transcript.text.delta")`
851
852 - `Optional<List<Logprob>> logprobs`
853
854 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
855
856 - `Optional<String> token`
857
858 The token that was used to generate the log probability.
859
860 - `Optional<List<Long>> bytes`
861
862 The bytes that were used to generate the log probability.
863
864 - `Optional<Double> logprob`
865
866 The log probability of the token.
867
868 - `Optional<String> segmentId`
869
870 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.
871
872### Transcription Text Done Event
873
874- `class TranscriptionTextDoneEvent:`
875
876 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.
877
878 - `String text`
879
880 The text that was transcribed.
881
882 - `JsonValue; type "transcript.text.done"constant`
883
884 The type of the event. Always `transcript.text.done`.
885
886 - `TRANSCRIPT_TEXT_DONE("transcript.text.done")`
887
888 - `Optional<List<Logprob>> logprobs`
889
890 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.
891
892 - `Optional<String> token`
893
894 The token that was used to generate the log probability.
895
896 - `Optional<List<Long>> bytes`
897
898 The bytes that were used to generate the log probability.
899
900 - `Optional<Double> logprob`
901
902 The log probability of the token.
903
904 - `Optional<Usage> usage`
905
906 Usage statistics for models billed by token usage.
907
908 - `long inputTokens`
909
910 Number of input tokens billed for this request.
911
912 - `long outputTokens`
913
914 Number of output tokens generated.
915
916 - `long totalTokens`
917
918 Total number of tokens used (input + output).
919
920 - `JsonValue; type "tokens"constant`
921
922 The type of the usage object. Always `tokens` for this variant.
923
924 - `TOKENS("tokens")`
925
926 - `Optional<InputTokenDetails> inputTokenDetails`
927
928 Details about the input tokens billed for this request.
929
930 - `Optional<Long> audioTokens`
931
932 Number of audio tokens billed for this request.
933
934 - `Optional<Long> textTokens`
935
936 Number of text tokens billed for this request.
937
938### Transcription Text Segment Event
939
940- `class TranscriptionTextSegmentEvent:`
941
942 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.
943
944 - `String id`
945
946 Unique identifier for the segment.
947
948 - `double end`
949
950 End timestamp of the segment in seconds.
951
952 - `String speaker`
953
954 Speaker label for this segment.
955
956 - `double start`
957
958 Start timestamp of the segment in seconds.
959
960 - `String text`
961
962 Transcript text for this segment.
963
964 - `JsonValue; type "transcript.text.segment"constant`
965
966 The type of the event. Always `transcript.text.segment`.
967
968 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`
969
970### Transcription Verbose
971
972- `class TranscriptionVerbose:`
973
974 Represents a verbose json transcription response returned by model, based on the provided input.
975
976 - `double duration`
977
978 The duration of the input audio.
979
980 - `String language`
981
982 The language of the input audio.
983
984 - `String text`
985
986 The transcribed text.
987
988 - `Optional<List<TranscriptionSegment>> segments`
989
990 Segments of the transcribed text and their corresponding details.
991
992 - `long id`
993
994 Unique identifier of the segment.
995
996 - `double avgLogprob`
997
998 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
999
1000 - `double compressionRatio`
1001
1002 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
1003
1004 - `double end`
1005
1006 End time of the segment in seconds.
1007
1008 - `double noSpeechProb`
1009
1010 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.
1011
1012 - `long seek`
1013
1014 Seek offset of the segment.
1015
1016 - `double start`
1017
1018 Start time of the segment in seconds.
1019
1020 - `double temperature`
1021
1022 Temperature parameter used for generating the segment.
1023
1024 - `String text`
1025
1026 Text content of the segment.
1027
1028 - `List<long> tokens`
1029
1030 Array of token IDs for the text content.
1031
1032 - `Optional<Usage> usage`
1033
1034 Usage statistics for models billed by audio input duration.
1035
1036 - `double seconds`
1037
1038 Duration of the input audio in seconds.
1039
1040 - `JsonValue; type "duration"constant`
1041
1042 The type of the usage object. Always `duration` for this variant.
1043
1044 - `DURATION("duration")`
1045
1046 - `Optional<List<TranscriptionWord>> words`
1047
1048 Extracted words and their corresponding timestamps.
1049
1050 - `double end`
1051
1052 End time of the word in seconds.
1053
1054 - `double start`
1055
1056 Start time of the word in seconds.
1057
1058 - `String word`
1059
1060 The text content of the word.
1061
1062### Transcription Word
1063
1064- `class TranscriptionWord:`
1065
1066 - `double end`
1067
1068 End time of the word in seconds.
1069
1070 - `double start`
1071
1072 Start time of the word in seconds.
1073
1074 - `String word`
1075
1076 The text content of the word.