Go Premium Account

java/resources/audio/index.md 2026-05-18 22:01 UTC to 2026-05-19 06:34 UTC

0 added, 1480 removed.

2026

Wed 27 06:42 Fri 22 06:33 Wed 20 06:35 Tue 19 06:34 Mon 18 22:01 Mon 11 18:00 Thu 7 21:57 Tue 5 23:00 Sat 2 05:57

This document has no rendered page for this history range.

java/resources/audio/index.md +0 −1480 deleted

File Deleted View Diff

~~1# Audio~~

~~3## Domain Types~~

~~5### Audio Model~~

~~7- `enum AudioModel:`~~

~~9 - `WHISPER_1("whisper-1")`~~

~~11 - `GPT_4O_TRANSCRIBE("gpt-4o-transcribe")`~~

~~13 - `GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")`~~

~~15 - `GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")`~~

~~17 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`~~

~~19### Audio Response Format~~

~~21- `enum AudioResponseFormat:`~~

23 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.

~~25 - `JSON("json")`~~

~~27 - `TEXT("text")`~~

~~29 - `SRT("srt")`~~

~~31 - `VERBOSE_JSON("verbose_json")`~~

~~33 - `VTT("vtt")`~~

~~35 - `DIARIZED_JSON("diarized_json")`~~

~~37# Transcriptions~~

~~39## Create transcription~~

~~41`TranscriptionCreateResponse audio().transcriptions().create(TranscriptionCreateParamsparams, RequestOptionsrequestOptions = RequestOptions.none())`~~

~~43**post** `/audio/transcriptions`~~

~~45Transcribes audio into the input language.~~

~~47Returns a transcription object in `json`, `diarized_json`, or `verbose_json`~~

~~48format, or a stream of transcript events.~~

~~50### Parameters~~

~~52- `TranscriptionCreateParams params`~~

~~54 - `String file`~~

~~56 The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.~~

~~58 - `AudioModel model`~~

60 ID of the model to use. The options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `whisper-1` (which is powered by our open source Whisper V2 model), and `gpt-4o-transcribe-diarize`.

~~62 - `WHISPER_1("whisper-1")`~~

~~64 - `GPT_4O_TRANSCRIBE("gpt-4o-transcribe")`~~

~~66 - `GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")`~~

~~68 - `GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")`~~

~~70 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`~~

~~72 - `Optional<ChunkingStrategy> chunkingStrategy`~~

74 Controls how the audio is cut into chunks. When set to `"auto"`, the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. `server_vad` object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using `gpt-4o-transcribe-diarize` for inputs longer than 30 seconds.

~~76 - `JsonValue;`~~

~~78 - `AUTO("auto")`~~

~~80 - `class VadConfig:`~~

~~82 - `Type type`~~

~~84 Must be set to `server_vad` to enable manual chunking using server side VAD.~~

~~86 - `SERVER_VAD("server_vad")`~~

~~88 - `Optional<Long> prefixPaddingMs`~~

~~90 Amount of audio to include before the VAD detected speech (in~~

~~91 milliseconds).~~

~~93 - `Optional<Long> silenceDurationMs`~~

~~95 Duration of silence to detect speech stop (in milliseconds).~~

~~96 With shorter values the model will respond more quickly,~~

~~97 but may jump in on short pauses from the user.~~

~~99 - `Optional<Double> threshold`~~

~~100~~

101 Sensitivity threshold (0.0 to 1.0) for voice activity detection. A

102 higher threshold will require louder audio to activate the model, and

103 thus might perform better in noisy environments.

~~104~~

105 - `Optional<List<TranscriptionInclude>> include`

~~106~~

107 Additional information to include in the transcription response.

108 `logprobs` will return the log probabilities of the tokens in the

109 response to understand the model's confidence in the transcription.

110 `logprobs` only works with response_format set to `json` and only with

111 the models `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `gpt-4o-mini-transcribe-2025-12-15`. This field is not supported when using `gpt-4o-transcribe-diarize`.

~~112~~

113 - `LOGPROBS("logprobs")`

~~114~~

115 - `Optional<List<String>> knownSpeakerNames`

~~116~~

117 Optional list of speaker names that correspond to the audio samples provided in `known_speaker_references[]`. Each entry should be a short identifier (for example `customer` or `agent`). Up to 4 speakers are supported.

~~118~~

119 - `Optional<List<String>> knownSpeakerReferences`

~~120~~

121 Optional list of audio samples (as [data URLs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs)) that contain known speaker references matching `known_speaker_names[]`. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by `file`.

~~122~~

123 - `Optional<String> language`

~~124~~

125 The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency.

~~126~~

127 - `Optional<String> prompt`

~~128~~

129 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should match the audio language. This field is not supported when using `gpt-4o-transcribe-diarize`.

~~130~~

131 - `Optional<AudioResponseFormat> responseFormat`

~~132~~

133 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.

~~134~~

135 - `Optional<Double> temperature`

~~136~~

137 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.

~~138~~

139 - `Optional<List<TimestampGranularity>> timestampGranularities`

~~140~~

141 The timestamp granularities to populate for this transcription. `response_format` must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported: `word`, or `segment`. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

142 This option is not available for `gpt-4o-transcribe-diarize`.

~~143~~

144 - `WORD("word")`

~~145~~

146 - `SEGMENT("segment")`

~~147~~

148### Returns

~~149~~

150- `class TranscriptionCreateResponse: A class that can be one of several variants.union`

~~151~~

152 Represents a transcription response returned by model, based on the provided input.

~~153~~

154 - `class Transcription:`

~~155~~

156 Represents a transcription response returned by model, based on the provided input.

~~157~~

158 - `String text`

~~159~~

160 The transcribed text.

~~161~~

162 - `Optional<List<Logprob>> logprobs`

~~163~~

164 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

~~165~~

166 - `Optional<String> token`

~~167~~

168 The token in the transcription.

~~169~~

170 - `Optional<List<Double>> bytes`

~~171~~

172 The bytes of the token.

~~173~~

174 - `Optional<Double> logprob`

~~175~~

176 The log probability of the token.

~~177~~

178 - `Optional<Usage> usage`

~~179~~

180 Token usage statistics for the request.

~~181~~

182 - `class Tokens:`

~~183~~

184 Usage statistics for models billed by token usage.

~~185~~

186 - `long inputTokens`

~~187~~

188 Number of input tokens billed for this request.

~~189~~

190 - `long outputTokens`

~~191~~

192 Number of output tokens generated.

~~193~~

194 - `long totalTokens`

~~195~~

196 Total number of tokens used (input + output).

~~197~~

198 - `JsonValue; type "tokens"constant`

~~199~~

200 The type of the usage object. Always `tokens` for this variant.

~~201~~

202 - `TOKENS("tokens")`

~~203~~

204 - `Optional<InputTokenDetails> inputTokenDetails`

~~205~~

206 Details about the input tokens billed for this request.

~~207~~

208 - `Optional<Long> audioTokens`

~~209~~

210 Number of audio tokens billed for this request.

~~211~~

212 - `Optional<Long> textTokens`

~~213~~

214 Number of text tokens billed for this request.

~~215~~

216 - `class Duration:`

~~217~~

218 Usage statistics for models billed by audio input duration.

~~219~~

220 - `double seconds`

~~221~~

222 Duration of the input audio in seconds.

~~223~~

224 - `JsonValue; type "duration"constant`

~~225~~

226 The type of the usage object. Always `duration` for this variant.

~~227~~

228 - `DURATION("duration")`

~~229~~

230 - `class TranscriptionDiarized:`

~~231~~

232 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

~~233~~

234 - `double duration`

~~235~~

236 Duration of the input audio in seconds.

~~237~~

238 - `List<TranscriptionDiarizedSegment> segments`

~~239~~

240 Segments of the transcript annotated with timestamps and speaker labels.

~~241~~

242 - `String id`

~~243~~

244 Unique identifier for the segment.

~~245~~

246 - `double end`

~~247~~

248 End timestamp of the segment in seconds.

~~249~~

250 - `String speaker`

~~251~~

252 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

~~253~~

254 - `double start`

~~255~~

256 Start timestamp of the segment in seconds.

~~257~~

258 - `String text`

~~259~~

260 Transcript text for this segment.

~~261~~

262 - `JsonValue; type "transcript.text.segment"constant`

~~263~~

264 The type of the segment. Always `transcript.text.segment`.

~~265~~

266 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`

~~267~~

268 - `JsonValue; task "transcribe"constant`

~~269~~

270 The type of task that was run. Always `transcribe`.

~~271~~

272 - `TRANSCRIBE("transcribe")`

~~273~~

274 - `String text`

~~275~~

276 The concatenated transcript text for the entire audio input.

~~277~~

278 - `Optional<Usage> usage`

~~279~~

280 Token or duration usage statistics for the request.

~~281~~

282 - `class Tokens:`

~~283~~

284 Usage statistics for models billed by token usage.

~~285~~

286 - `long inputTokens`

~~287~~

288 Number of input tokens billed for this request.

~~289~~

290 - `long outputTokens`

~~291~~

292 Number of output tokens generated.

~~293~~

294 - `long totalTokens`

~~295~~

296 Total number of tokens used (input + output).

~~297~~

298 - `JsonValue; type "tokens"constant`

~~299~~

300 The type of the usage object. Always `tokens` for this variant.

~~301~~

302 - `TOKENS("tokens")`

~~303~~

304 - `Optional<InputTokenDetails> inputTokenDetails`

~~305~~

306 Details about the input tokens billed for this request.

~~307~~

308 - `Optional<Long> audioTokens`

~~309~~

310 Number of audio tokens billed for this request.

~~311~~

312 - `Optional<Long> textTokens`

~~313~~

314 Number of text tokens billed for this request.

~~315~~

316 - `class Duration:`

~~317~~

318 Usage statistics for models billed by audio input duration.

~~319~~

320 - `double seconds`

~~321~~

322 Duration of the input audio in seconds.

~~323~~

324 - `JsonValue; type "duration"constant`

~~325~~

326 The type of the usage object. Always `duration` for this variant.

~~327~~

328 - `DURATION("duration")`

~~329~~

330 - `class TranscriptionVerbose:`

~~331~~

332 Represents a verbose json transcription response returned by model, based on the provided input.

~~333~~

334 - `double duration`

~~335~~

336 The duration of the input audio.

~~337~~

338 - `String language`

~~339~~

340 The language of the input audio.

~~341~~

342 - `String text`

~~343~~

344 The transcribed text.

~~345~~

346 - `Optional<List<TranscriptionSegment>> segments`

~~347~~

348 Segments of the transcribed text and their corresponding details.

~~349~~

350 - `long id`

~~351~~

352 Unique identifier of the segment.

~~353~~

354 - `double avgLogprob`

~~355~~

356 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~357~~

358 - `double compressionRatio`

~~359~~

360 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~361~~

362 - `double end`

~~363~~

364 End time of the segment in seconds.

~~365~~

366 - `double noSpeechProb`

~~367~~

368 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~369~~

370 - `long seek`

~~371~~

372 Seek offset of the segment.

~~373~~

374 - `double start`

~~375~~

376 Start time of the segment in seconds.

~~377~~

378 - `double temperature`

~~379~~

380 Temperature parameter used for generating the segment.

~~381~~

382 - `String text`

~~383~~

384 Text content of the segment.

~~385~~

386 - `List<long> tokens`

~~387~~

388 Array of token IDs for the text content.

~~389~~

390 - `Optional<Usage> usage`

~~391~~

392 Usage statistics for models billed by audio input duration.

~~393~~

394 - `double seconds`

~~395~~

396 Duration of the input audio in seconds.

~~397~~

398 - `JsonValue; type "duration"constant`

~~399~~

400 The type of the usage object. Always `duration` for this variant.

~~401~~

402 - `DURATION("duration")`

~~403~~

404 - `Optional<List<TranscriptionWord>> words`

~~405~~

406 Extracted words and their corresponding timestamps.

~~407~~

408 - `double end`

~~409~~

410 End time of the word in seconds.

~~411~~

412 - `double start`

~~413~~

414 Start time of the word in seconds.

~~415~~

416 - `String word`

~~417~~

418 The text content of the word.

~~419~~

420### Example

~~421~~

422```java

423package com.openai.example;

~~424~~

425import com.openai.client.OpenAIClient;

426import com.openai.client.okhttp.OpenAIOkHttpClient;

427import com.openai.models.audio.AudioModel;

428import com.openai.models.audio.transcriptions.TranscriptionCreateParams;

429import com.openai.models.audio.transcriptions.TranscriptionCreateResponse;

430import java.io.ByteArrayInputStream;

~~431~~

432public final class Main {

433 private Main() {}

~~434~~

435 public static void main(String[] args) {

436 OpenAIClient client = OpenAIOkHttpClient.fromEnv();

~~437~~

438 TranscriptionCreateParams params = TranscriptionCreateParams.builder()

439 .file(new ByteArrayInputStream("Example data".getBytes()))

440 .model(AudioModel.GPT_4O_TRANSCRIBE)

441 .build();

442 TranscriptionCreateResponse transcription = client.audio().transcriptions().create(params);

443 }

444}

445```

~~446~~

447#### Response

~~448~~

449```json

450{

451 "text": "text",

452 "logprobs": [

453 {

454 "token": "token",

455 "bytes": [

456 0

457 ],

458 "logprob": 0

459 }

460 ],

461 "usage": {

462 "input_tokens": 0,

463 "output_tokens": 0,

464 "total_tokens": 0,

465 "type": "tokens",

466 "input_token_details": {

467 "audio_tokens": 0,

468 "text_tokens": 0

469 }

470 }

471}

472```

~~473~~

474## Domain Types

~~475~~

476### Transcription

~~477~~

478- `class Transcription:`

~~479~~

480 Represents a transcription response returned by model, based on the provided input.

~~481~~

482 - `String text`

~~483~~

484 The transcribed text.

~~485~~

486 - `Optional<List<Logprob>> logprobs`

~~487~~

488 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

~~489~~

490 - `Optional<String> token`

~~491~~

492 The token in the transcription.

~~493~~

494 - `Optional<List<Double>> bytes`

~~495~~

496 The bytes of the token.

~~497~~

498 - `Optional<Double> logprob`

~~499~~

500 The log probability of the token.

~~501~~

502 - `Optional<Usage> usage`

~~503~~

504 Token usage statistics for the request.

~~505~~

506 - `class Tokens:`

~~507~~

508 Usage statistics for models billed by token usage.

~~509~~

510 - `long inputTokens`

~~511~~

512 Number of input tokens billed for this request.

~~513~~

514 - `long outputTokens`

~~515~~

516 Number of output tokens generated.

~~517~~

518 - `long totalTokens`

~~519~~

520 Total number of tokens used (input + output).

~~521~~

522 - `JsonValue; type "tokens"constant`

~~523~~

524 The type of the usage object. Always `tokens` for this variant.

~~525~~

526 - `TOKENS("tokens")`

~~527~~

528 - `Optional<InputTokenDetails> inputTokenDetails`

~~529~~

530 Details about the input tokens billed for this request.

~~531~~

532 - `Optional<Long> audioTokens`

~~533~~

534 Number of audio tokens billed for this request.

~~535~~

536 - `Optional<Long> textTokens`

~~537~~

538 Number of text tokens billed for this request.

~~539~~

540 - `class Duration:`

~~541~~

542 Usage statistics for models billed by audio input duration.

~~543~~

544 - `double seconds`

~~545~~

546 Duration of the input audio in seconds.

~~547~~

548 - `JsonValue; type "duration"constant`

~~549~~

550 The type of the usage object. Always `duration` for this variant.

~~551~~

552 - `DURATION("duration")`

~~553~~

554### Transcription Diarized

~~555~~

556- `class TranscriptionDiarized:`

~~557~~

558 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

~~559~~

560 - `double duration`

~~561~~

562 Duration of the input audio in seconds.

~~563~~

564 - `List<TranscriptionDiarizedSegment> segments`

~~565~~

566 Segments of the transcript annotated with timestamps and speaker labels.

~~567~~

568 - `String id`

~~569~~

570 Unique identifier for the segment.

~~571~~

572 - `double end`

~~573~~

574 End timestamp of the segment in seconds.

~~575~~

576 - `String speaker`

~~577~~

578 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

~~579~~

580 - `double start`

~~581~~

582 Start timestamp of the segment in seconds.

~~583~~

584 - `String text`

~~585~~

586 Transcript text for this segment.

~~587~~

588 - `JsonValue; type "transcript.text.segment"constant`

~~589~~

590 The type of the segment. Always `transcript.text.segment`.

~~591~~

592 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`

~~593~~

594 - `JsonValue; task "transcribe"constant`

~~595~~

596 The type of task that was run. Always `transcribe`.

~~597~~

598 - `TRANSCRIBE("transcribe")`

~~599~~

600 - `String text`

~~601~~

602 The concatenated transcript text for the entire audio input.

~~603~~

604 - `Optional<Usage> usage`

~~605~~

606 Token or duration usage statistics for the request.

~~607~~

608 - `class Tokens:`

~~609~~

610 Usage statistics for models billed by token usage.

~~611~~

612 - `long inputTokens`

~~613~~

614 Number of input tokens billed for this request.

~~615~~

616 - `long outputTokens`

~~617~~

618 Number of output tokens generated.

~~619~~

620 - `long totalTokens`

~~621~~

622 Total number of tokens used (input + output).

~~623~~

624 - `JsonValue; type "tokens"constant`

~~625~~

626 The type of the usage object. Always `tokens` for this variant.

~~627~~

628 - `TOKENS("tokens")`

~~629~~

630 - `Optional<InputTokenDetails> inputTokenDetails`

~~631~~

632 Details about the input tokens billed for this request.

~~633~~

634 - `Optional<Long> audioTokens`

~~635~~

636 Number of audio tokens billed for this request.

~~637~~

638 - `Optional<Long> textTokens`

~~639~~

640 Number of text tokens billed for this request.

~~641~~

642 - `class Duration:`

~~643~~

644 Usage statistics for models billed by audio input duration.

~~645~~

646 - `double seconds`

~~647~~

648 Duration of the input audio in seconds.

~~649~~

650 - `JsonValue; type "duration"constant`

~~651~~

652 The type of the usage object. Always `duration` for this variant.

~~653~~

654 - `DURATION("duration")`

~~655~~

656### Transcription Diarized Segment

~~657~~

658- `class TranscriptionDiarizedSegment:`

~~659~~

660 A segment of diarized transcript text with speaker metadata.

~~661~~

662 - `String id`

~~663~~

664 Unique identifier for the segment.

~~665~~

666 - `double end`

~~667~~

668 End timestamp of the segment in seconds.

~~669~~

670 - `String speaker`

~~671~~

672 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

~~673~~

674 - `double start`

~~675~~

676 Start timestamp of the segment in seconds.

~~677~~

678 - `String text`

~~679~~

680 Transcript text for this segment.

~~681~~

682 - `JsonValue; type "transcript.text.segment"constant`

~~683~~

684 The type of the segment. Always `transcript.text.segment`.

~~685~~

686 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`

~~687~~

688### Transcription Include

~~689~~

690- `enum TranscriptionInclude:`

~~691~~

692 - `LOGPROBS("logprobs")`

~~693~~

694### Transcription Segment

~~695~~

696- `class TranscriptionSegment:`

~~697~~

698 - `long id`

~~699~~

700 Unique identifier of the segment.

~~701~~

702 - `double avgLogprob`

~~703~~

704 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~705~~

706 - `double compressionRatio`

~~707~~

708 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~709~~

710 - `double end`

~~711~~

712 End time of the segment in seconds.

~~713~~

714 - `double noSpeechProb`

~~715~~

716 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~717~~

718 - `long seek`

~~719~~

720 Seek offset of the segment.

~~721~~

722 - `double start`

~~723~~

724 Start time of the segment in seconds.

~~725~~

726 - `double temperature`

~~727~~

728 Temperature parameter used for generating the segment.

~~729~~

730 - `String text`

~~731~~

732 Text content of the segment.

~~733~~

734 - `List<long> tokens`

~~735~~

736 Array of token IDs for the text content.

~~737~~

738### Transcription Stream Event

~~739~~

740- `class TranscriptionStreamEvent: A class that can be one of several variants.union`

~~741~~

742 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

~~743~~

744 - `class TranscriptionTextSegmentEvent:`

~~745~~

746 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

~~747~~

748 - `String id`

~~749~~

750 Unique identifier for the segment.

~~751~~

752 - `double end`

~~753~~

754 End timestamp of the segment in seconds.

~~755~~

756 - `String speaker`

~~757~~

758 Speaker label for this segment.

~~759~~

760 - `double start`

~~761~~

762 Start timestamp of the segment in seconds.

~~763~~

764 - `String text`

~~765~~

766 Transcript text for this segment.

~~767~~

768 - `JsonValue; type "transcript.text.segment"constant`

~~769~~

770 The type of the event. Always `transcript.text.segment`.

~~771~~

772 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`

~~773~~

774 - `class TranscriptionTextDeltaEvent:`

~~775~~

776 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

~~777~~

778 - `String delta`

~~779~~

780 The text delta that was additionally transcribed.

~~781~~

782 - `JsonValue; type "transcript.text.delta"constant`

~~783~~

784 The type of the event. Always `transcript.text.delta`.

~~785~~

786 - `TRANSCRIPT_TEXT_DELTA("transcript.text.delta")`

~~787~~

788 - `Optional<List<Logprob>> logprobs`

~~789~~

790 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

~~791~~

792 - `Optional<String> token`

~~793~~

794 The token that was used to generate the log probability.

~~795~~

796 - `Optional<List<Long>> bytes`

~~797~~

798 The bytes that were used to generate the log probability.

~~799~~

800 - `Optional<Double> logprob`

~~801~~

802 The log probability of the token.

~~803~~

804 - `Optional<String> segmentId`

~~805~~

806 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.

~~807~~

808 - `class TranscriptionTextDoneEvent:`

~~809~~

810 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

~~811~~

812 - `String text`

~~813~~

814 The text that was transcribed.

~~815~~

816 - `JsonValue; type "transcript.text.done"constant`

~~817~~

818 The type of the event. Always `transcript.text.done`.

~~819~~

820 - `TRANSCRIPT_TEXT_DONE("transcript.text.done")`

~~821~~

822 - `Optional<List<Logprob>> logprobs`

~~823~~

824 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

~~825~~

826 - `Optional<String> token`

~~827~~

828 The token that was used to generate the log probability.

~~829~~

830 - `Optional<List<Long>> bytes`

~~831~~

832 The bytes that were used to generate the log probability.

~~833~~

834 - `Optional<Double> logprob`

~~835~~

836 The log probability of the token.

~~837~~

838 - `Optional<Usage> usage`

~~839~~

840 Usage statistics for models billed by token usage.

~~841~~

842 - `long inputTokens`

~~843~~

844 Number of input tokens billed for this request.

~~845~~

846 - `long outputTokens`

~~847~~

848 Number of output tokens generated.

~~849~~

850 - `long totalTokens`

~~851~~

852 Total number of tokens used (input + output).

~~853~~

854 - `JsonValue; type "tokens"constant`

~~855~~

856 The type of the usage object. Always `tokens` for this variant.

~~857~~

858 - `TOKENS("tokens")`

~~859~~

860 - `Optional<InputTokenDetails> inputTokenDetails`

~~861~~

862 Details about the input tokens billed for this request.

~~863~~

864 - `Optional<Long> audioTokens`

~~865~~

866 Number of audio tokens billed for this request.

~~867~~

868 - `Optional<Long> textTokens`

~~869~~

870 Number of text tokens billed for this request.

~~871~~

872### Transcription Text Delta Event

~~873~~

874- `class TranscriptionTextDeltaEvent:`

~~875~~

876 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

~~877~~

878 - `String delta`

~~879~~

880 The text delta that was additionally transcribed.

~~881~~

882 - `JsonValue; type "transcript.text.delta"constant`

~~883~~

884 The type of the event. Always `transcript.text.delta`.

~~885~~

886 - `TRANSCRIPT_TEXT_DELTA("transcript.text.delta")`

~~887~~

888 - `Optional<List<Logprob>> logprobs`

~~889~~

890 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

~~891~~

892 - `Optional<String> token`

~~893~~

894 The token that was used to generate the log probability.

~~895~~

896 - `Optional<List<Long>> bytes`

~~897~~

898 The bytes that were used to generate the log probability.

~~899~~

900 - `Optional<Double> logprob`

~~901~~

902 The log probability of the token.

~~903~~

904 - `Optional<String> segmentId`

~~905~~

906 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.

~~907~~

908### Transcription Text Done Event

~~909~~

910- `class TranscriptionTextDoneEvent:`

~~911~~

912 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

~~913~~

914 - `String text`

~~915~~

916 The text that was transcribed.

~~917~~

918 - `JsonValue; type "transcript.text.done"constant`

~~919~~

920 The type of the event. Always `transcript.text.done`.

~~921~~

922 - `TRANSCRIPT_TEXT_DONE("transcript.text.done")`

~~923~~

924 - `Optional<List<Logprob>> logprobs`

~~925~~

926 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

~~927~~

928 - `Optional<String> token`

~~929~~

930 The token that was used to generate the log probability.

~~931~~

932 - `Optional<List<Long>> bytes`

~~933~~

934 The bytes that were used to generate the log probability.

~~935~~

936 - `Optional<Double> logprob`

~~937~~

938 The log probability of the token.

~~939~~

940 - `Optional<Usage> usage`

~~941~~

942 Usage statistics for models billed by token usage.

~~943~~

944 - `long inputTokens`

~~945~~

946 Number of input tokens billed for this request.

~~947~~

948 - `long outputTokens`

~~949~~

950 Number of output tokens generated.

~~951~~

952 - `long totalTokens`

~~953~~

954 Total number of tokens used (input + output).

~~955~~

956 - `JsonValue; type "tokens"constant`

~~957~~

958 The type of the usage object. Always `tokens` for this variant.

~~959~~

960 - `TOKENS("tokens")`

~~961~~

962 - `Optional<InputTokenDetails> inputTokenDetails`

~~963~~

964 Details about the input tokens billed for this request.

~~965~~

966 - `Optional<Long> audioTokens`

~~967~~

968 Number of audio tokens billed for this request.

~~969~~

970 - `Optional<Long> textTokens`

~~971~~

972 Number of text tokens billed for this request.

~~973~~

974### Transcription Text Segment Event

~~975~~

976- `class TranscriptionTextSegmentEvent:`

~~977~~

978 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

~~979~~

980 - `String id`

~~981~~

982 Unique identifier for the segment.

~~983~~

984 - `double end`

~~985~~

986 End timestamp of the segment in seconds.

~~987~~

988 - `String speaker`

~~989~~

990 Speaker label for this segment.

~~991~~

992 - `double start`

~~993~~

994 Start timestamp of the segment in seconds.

~~995~~

996 - `String text`

~~997~~

998 Transcript text for this segment.

~~999~~

1000 - `JsonValue; type "transcript.text.segment"constant`

~~1001~~

1002 The type of the event. Always `transcript.text.segment`.

~~1003~~

1004 - `TRANSCRIPT_TEXT_SEGMENT("transcript.text.segment")`

~~1005~~

1006### Transcription Verbose

~~1007~~

1008- `class TranscriptionVerbose:`

~~1009~~

1010 Represents a verbose json transcription response returned by model, based on the provided input.

~~1011~~

1012 - `double duration`

~~1013~~

1014 The duration of the input audio.

~~1015~~

1016 - `String language`

~~1017~~

1018 The language of the input audio.

~~1019~~

1020 - `String text`

~~1021~~

1022 The transcribed text.

~~1023~~

1024 - `Optional<List<TranscriptionSegment>> segments`

~~1025~~

1026 Segments of the transcribed text and their corresponding details.

~~1027~~

1028 - `long id`

~~1029~~

1030 Unique identifier of the segment.

~~1031~~

1032 - `double avgLogprob`

~~1033~~

1034 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~1035~~

1036 - `double compressionRatio`

~~1037~~

1038 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~1039~~

1040 - `double end`

~~1041~~

1042 End time of the segment in seconds.

~~1043~~

1044 - `double noSpeechProb`

~~1045~~

1046 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~1047~~

1048 - `long seek`

~~1049~~

1050 Seek offset of the segment.

~~1051~~

1052 - `double start`

~~1053~~

1054 Start time of the segment in seconds.

~~1055~~

1056 - `double temperature`

~~1057~~

1058 Temperature parameter used for generating the segment.

~~1059~~

1060 - `String text`

~~1061~~

1062 Text content of the segment.

~~1063~~

1064 - `List<long> tokens`

~~1065~~

1066 Array of token IDs for the text content.

~~1067~~

1068 - `Optional<Usage> usage`

~~1069~~

1070 Usage statistics for models billed by audio input duration.

~~1071~~

1072 - `double seconds`

~~1073~~

1074 Duration of the input audio in seconds.

~~1075~~

1076 - `JsonValue; type "duration"constant`

~~1077~~

1078 The type of the usage object. Always `duration` for this variant.

~~1079~~

1080 - `DURATION("duration")`

~~1081~~

1082 - `Optional<List<TranscriptionWord>> words`

~~1083~~

1084 Extracted words and their corresponding timestamps.

~~1085~~

1086 - `double end`

~~1087~~

1088 End time of the word in seconds.

~~1089~~

1090 - `double start`

~~1091~~

1092 Start time of the word in seconds.

~~1093~~

1094 - `String word`

~~1095~~

1096 The text content of the word.

~~1097~~

1098### Transcription Word

~~1099~~

1100- `class TranscriptionWord:`

~~1101~~

1102 - `double end`

~~1103~~

1104 End time of the word in seconds.

~~1105~~

1106 - `double start`

~~1107~~

1108 Start time of the word in seconds.

~~1109~~

1110 - `String word`

~~1111~~

1112 The text content of the word.

~~1113~~

1114# Translations

~~1115~~

1116## Create translation

~~1117~~

1118`TranslationCreateResponse audio().translations().create(TranslationCreateParamsparams, RequestOptionsrequestOptions = RequestOptions.none())`

~~1119~~

1120**post** `/audio/translations`

~~1121~~

1122Translates audio into English.

~~1123~~

1124### Parameters

~~1125~~

1126- `TranslationCreateParams params`

~~1127~~

1128 - `String file`

~~1129~~

1130 The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

~~1131~~

1132 - `AudioModel model`

~~1133~~

1134 ID of the model to use. Only `whisper-1` (which is powered by our open source Whisper V2 model) is currently available.

~~1135~~

1136 - `WHISPER_1("whisper-1")`

~~1137~~

1138 - `GPT_4O_TRANSCRIBE("gpt-4o-transcribe")`

~~1139~~

1140 - `GPT_4O_MINI_TRANSCRIBE("gpt-4o-mini-transcribe")`

~~1141~~

1142 - `GPT_4O_MINI_TRANSCRIBE_2025_12_15("gpt-4o-mini-transcribe-2025-12-15")`

~~1143~~

1144 - `GPT_4O_TRANSCRIBE_DIARIZE("gpt-4o-transcribe-diarize")`

~~1145~~

1146 - `Optional<String> prompt`

~~1147~~

1148 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should be in English.

~~1149~~

1150 - `Optional<ResponseFormat> responseFormat`

~~1151~~

1152 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`.

~~1153~~

1154 - `JSON("json")`

~~1155~~

1156 - `TEXT("text")`

~~1157~~

1158 - `SRT("srt")`

~~1159~~

1160 - `VERBOSE_JSON("verbose_json")`

~~1161~~

1162 - `VTT("vtt")`

~~1163~~

1164 - `Optional<Double> temperature`

~~1165~~

1166 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.

~~1167~~

1168### Returns

~~1169~~

1170- `class TranslationCreateResponse: A class that can be one of several variants.union`

~~1171~~

1172 - `class Translation:`

~~1173~~

1174 - `String text`

~~1175~~

1176 - `class TranslationVerbose:`

~~1177~~

1178 - `double duration`

~~1179~~

1180 The duration of the input audio.

~~1181~~

1182 - `String language`

~~1183~~

1184 The language of the output translation (always `english`).

~~1185~~

1186 - `String text`

~~1187~~

1188 The translated text.

~~1189~~

1190 - `Optional<List<TranscriptionSegment>> segments`

~~1191~~

1192 Segments of the translated text and their corresponding details.

~~1193~~

1194 - `long id`

~~1195~~

1196 Unique identifier of the segment.

~~1197~~

1198 - `double avgLogprob`

~~1199~~

1200 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~1201~~

1202 - `double compressionRatio`

~~1203~~

1204 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~1205~~

1206 - `double end`

~~1207~~

1208 End time of the segment in seconds.

~~1209~~

1210 - `double noSpeechProb`

~~1211~~

1212 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~1213~~

1214 - `long seek`

~~1215~~

1216 Seek offset of the segment.

~~1217~~

1218 - `double start`

~~1219~~

1220 Start time of the segment in seconds.

~~1221~~

1222 - `double temperature`

~~1223~~

1224 Temperature parameter used for generating the segment.

~~1225~~

1226 - `String text`

~~1227~~

1228 Text content of the segment.

~~1229~~

1230 - `List<long> tokens`

~~1231~~

1232 Array of token IDs for the text content.

~~1233~~

1234### Example

~~1235~~

1236```java

1237package com.openai.example;

~~1238~~

1239import com.openai.client.OpenAIClient;

1240import com.openai.client.okhttp.OpenAIOkHttpClient;

1241import com.openai.models.audio.AudioModel;

1242import com.openai.models.audio.translations.TranslationCreateParams;

1243import com.openai.models.audio.translations.TranslationCreateResponse;

1244import java.io.ByteArrayInputStream;

~~1245~~

1246public final class Main {

1247 private Main() {}

~~1248~~

1249 public static void main(String[] args) {

1250 OpenAIClient client = OpenAIOkHttpClient.fromEnv();

~~1251~~

1252 TranslationCreateParams params = TranslationCreateParams.builder()

1253 .file(new ByteArrayInputStream("Example data".getBytes()))

1254 .model(AudioModel.WHISPER_1)

1255 .build();

1256 TranslationCreateResponse translation = client.audio().translations().create(params);

1257 }

1258}

1259```

~~1260~~

1261#### Response

~~1262~~

1263```json

1264{

1265 "text": "text"

1266}

1267```

~~1268~~

1269## Domain Types

~~1270~~

1271### Translation

~~1272~~

1273- `class Translation:`

~~1274~~

1275 - `String text`

~~1276~~

1277### Translation Verbose

~~1278~~

1279- `class TranslationVerbose:`

~~1280~~

1281 - `double duration`

~~1282~~

1283 The duration of the input audio.

~~1284~~

1285 - `String language`

~~1286~~

1287 The language of the output translation (always `english`).

~~1288~~

1289 - `String text`

~~1290~~

1291 The translated text.

~~1292~~

1293 - `Optional<List<TranscriptionSegment>> segments`

~~1294~~

1295 Segments of the translated text and their corresponding details.

~~1296~~

1297 - `long id`

~~1298~~

1299 Unique identifier of the segment.

~~1300~~

1301 - `double avgLogprob`

~~1302~~

1303 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~1304~~

1305 - `double compressionRatio`

~~1306~~

1307 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~1308~~

1309 - `double end`

~~1310~~

1311 End time of the segment in seconds.

~~1312~~

1313 - `double noSpeechProb`

~~1314~~

1315 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~1316~~

1317 - `long seek`

~~1318~~

1319 Seek offset of the segment.

~~1320~~

1321 - `double start`

~~1322~~

1323 Start time of the segment in seconds.

~~1324~~

1325 - `double temperature`

~~1326~~

1327 Temperature parameter used for generating the segment.

~~1328~~

1329 - `String text`

~~1330~~

1331 Text content of the segment.

~~1332~~

1333 - `List<long> tokens`

~~1334~~

1335 Array of token IDs for the text content.

~~1336~~

1337# Speech

~~1338~~

1339## Create speech

~~1340~~

1341`HttpResponse audio().speech().create(SpeechCreateParamsparams, RequestOptionsrequestOptions = RequestOptions.none())`

~~1342~~

1343**post** `/audio/speech`

~~1344~~

1345Generates audio from the input text.

~~1346~~

1347Returns the audio file content, or a stream of audio events.

~~1348~~

1349### Parameters

~~1350~~

1351- `SpeechCreateParams params`

~~1352~~

1353 - `String input`

~~1354~~

1355 The text to generate audio for. The maximum length is 4096 characters.

~~1356~~

1357 - `SpeechModel model`

~~1358~~

1359 One of the available [TTS models](https://platform.openai.com/docs/models#tts): `tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`, or `gpt-4o-mini-tts-2025-12-15`.

~~1360~~

1361 - `TTS_1("tts-1")`

~~1362~~

1363 - `TTS_1_HD("tts-1-hd")`

~~1364~~

1365 - `GPT_4O_MINI_TTS("gpt-4o-mini-tts")`

~~1366~~

1367 - `GPT_4O_MINI_TTS_2025_12_15("gpt-4o-mini-tts-2025-12-15")`

~~1368~~

1369 - `Voice voice`

~~1370~~

1371 The voice to use when generating the audio. Supported built-in voices are `alloy`, `ash`, `ballad`, `coral`, `echo`, `fable`, `onyx`, `nova`, `sage`, `shimmer`, `verse`, `marin`, and `cedar`. You may also provide a custom voice object with an `id`, for example `{ "id": "voice_1234" }`. Previews of the voices are available in the [Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech#voice-options).

~~1372~~

1373 - `String`

~~1374~~

1375 - `enum UnionMember1:`

~~1376~~

1377 - `ALLOY("alloy")`

~~1378~~

1379 - `ASH("ash")`

~~1380~~

1381 - `BALLAD("ballad")`

~~1382~~

1383 - `CORAL("coral")`

~~1384~~

1385 - `ECHO("echo")`

~~1386~~

1387 - `SAGE("sage")`

~~1388~~

1389 - `SHIMMER("shimmer")`

~~1390~~

1391 - `VERSE("verse")`

~~1392~~

1393 - `MARIN("marin")`

~~1394~~

1395 - `CEDAR("cedar")`

~~1396~~

1397 - `class Id:`

~~1398~~

1399 Custom voice reference.

~~1400~~

1401 - `String id`

~~1402~~

1403 The custom voice ID, e.g. `voice_1234`.

~~1404~~

1405 - `Optional<String> instructions`

~~1406~~

1407 Control the voice of your generated audio with additional instructions. Does not work with `tts-1` or `tts-1-hd`.

~~1408~~

1409 - `Optional<ResponseFormat> responseFormat`

~~1410~~

1411 The format to audio in. Supported formats are `mp3`, `opus`, `aac`, `flac`, `wav`, and `pcm`.

~~1412~~

1413 - `MP3("mp3")`

~~1414~~

1415 - `OPUS("opus")`

~~1416~~

1417 - `AAC("aac")`

~~1418~~

1419 - `FLAC("flac")`

~~1420~~

1421 - `WAV("wav")`

~~1422~~

1423 - `PCM("pcm")`

~~1424~~

1425 - `Optional<Double> speed`

~~1426~~

1427 The speed of the generated audio. Select a value from `0.25` to `4.0`. `1.0` is the default.

~~1428~~

1429 - `Optional<StreamFormat> streamFormat`

~~1430~~

1431 The format to stream the audio in. Supported formats are `sse` and `audio`. `sse` is not supported for `tts-1` or `tts-1-hd`.

~~1432~~

1433 - `SSE("sse")`

~~1434~~

1435 - `AUDIO("audio")`

~~1436~~

1437### Example

~~1438~~

1439```java

1440package com.openai.example;

~~1441~~

1442import com.openai.client.OpenAIClient;

1443import com.openai.client.okhttp.OpenAIOkHttpClient;

1444import com.openai.core.http.HttpResponse;

1445import com.openai.models.audio.speech.SpeechCreateParams;

1446import com.openai.models.audio.speech.SpeechModel;

~~1447~~

1448public final class Main {

1449 private Main() {}

~~1450~~

1451 public static void main(String[] args) {

1452 OpenAIClient client = OpenAIOkHttpClient.fromEnv();

~~1453~~

1454 SpeechCreateParams params = SpeechCreateParams.builder()

1455 .input("input")

1456 .model(SpeechModel.TTS_1)

1457 .voice("string")

1458 .build();

1459 HttpResponse speech = client.audio().speech().create(params);

1460 }

1461}

1462```

~~1463~~

1464## Domain Types

~~1465~~

1466### Speech Model

~~1467~~

1468- `enum SpeechModel:`

~~1469~~

1470 - `TTS_1("tts-1")`

~~1471~~

1472 - `TTS_1_HD("tts-1-hd")`

~~1473~~

1474 - `GPT_4O_MINI_TTS("gpt-4o-mini-tts")`

~~1475~~

1476 - `GPT_4O_MINI_TTS_2025_12_15("gpt-4o-mini-tts-2025-12-15")`

~~1477~~

1478# Voices

~~1479~~

1480# Voice Consents