Go Premium Account

ruby/resources/audio/index.md 2026-05-18 22:01 UTC to 2026-05-19 06:34 UTC

0 added, 1806 removed.

2026

Wed 27 06:42 Fri 22 06:33 Wed 20 06:35 Tue 19 06:34 Mon 18 22:01 Mon 11 18:00 Thu 7 21:57 Tue 5 23:00 Sat 2 05:57

This document has no rendered page for this history range.

ruby/resources/audio/index.md +0 −1806 deleted

File Deleted View Diff

~~1# Audio~~

~~3## Domain Types~~

~~5### Audio Model~~

~~7- `AudioModel = :"whisper-1" | :"gpt-4o-transcribe" | :"gpt-4o-mini-transcribe" | 2 more`~~

~~9 - `:"whisper-1"`~~

~~11 - `:"gpt-4o-transcribe"`~~

~~13 - `:"gpt-4o-mini-transcribe"`~~

~~15 - `:"gpt-4o-mini-transcribe-2025-12-15"`~~

~~17 - `:"gpt-4o-transcribe-diarize"`~~

~~19### Audio Response Format~~

~~21- `AudioResponseFormat = :json | :text | :srt | 3 more`~~

23 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.

~~25 - `:json`~~

~~27 - `:text`~~

~~29 - `:srt`~~

~~31 - `:verbose_json`~~

~~33 - `:vtt`~~

~~35 - `:diarized_json`~~

~~37# Transcriptions~~

~~39## Create transcription~~

~~41`audio.transcriptions.create(**kwargs) -> TranscriptionCreateResponse`~~

~~43**post** `/audio/transcriptions`~~

~~45Transcribes audio into the input language.~~

~~47Returns a transcription object in `json`, `diarized_json`, or `verbose_json`~~

~~48format, or a stream of transcript events.~~

~~50### Parameters~~

~~52- `file: String`~~

~~54 The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.~~

~~56- `model: String | AudioModel`~~

58 ID of the model to use. The options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `whisper-1` (which is powered by our open source Whisper V2 model), and `gpt-4o-transcribe-diarize`.

~~60 - `String = String`~~

~~62 - `AudioModel = :"whisper-1" | :"gpt-4o-transcribe" | :"gpt-4o-mini-transcribe" | 2 more`~~

~~64 - `:"whisper-1"`~~

~~66 - `:"gpt-4o-transcribe"`~~

~~68 - `:"gpt-4o-mini-transcribe"`~~

~~70 - `:"gpt-4o-mini-transcribe-2025-12-15"`~~

~~72 - `:"gpt-4o-transcribe-diarize"`~~

~~74- `chunking_strategy: :auto | VadConfig{ type, prefix_padding_ms, silence_duration_ms, threshold}`~~

76 Controls how the audio is cut into chunks. When set to `"auto"`, the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. `server_vad` object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using `gpt-4o-transcribe-diarize` for inputs longer than 30 seconds.

~~78 - `ChunkingStrategy = :auto`~~

~~80 Automatically set chunking parameters based on the audio. Must be set to `"auto"`.~~

~~82 - `:auto`~~

~~84 - `class VadConfig`~~

~~86 - `type: :server_vad`~~

~~88 Must be set to `server_vad` to enable manual chunking using server side VAD.~~

~~90 - `:server_vad`~~

~~92 - `prefix_padding_ms: Integer`~~

~~94 Amount of audio to include before the VAD detected speech (in~~

~~95 milliseconds).~~

~~97 - `silence_duration_ms: Integer`~~

~~99 Duration of silence to detect speech stop (in milliseconds).~~

100 With shorter values the model will respond more quickly,

101 but may jump in on short pauses from the user.

~~102~~

103 - `threshold: Float`

~~104~~

105 Sensitivity threshold (0.0 to 1.0) for voice activity detection. A

106 higher threshold will require louder audio to activate the model, and

107 thus might perform better in noisy environments.

~~108~~

109- `include: Array[TranscriptionInclude]`

~~110~~

111 Additional information to include in the transcription response.

112 `logprobs` will return the log probabilities of the tokens in the

113 response to understand the model's confidence in the transcription.

114 `logprobs` only works with response_format set to `json` and only with

115 the models `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `gpt-4o-mini-transcribe-2025-12-15`. This field is not supported when using `gpt-4o-transcribe-diarize`.

~~116~~

117 - `:logprobs`

~~118~~

119- `known_speaker_names: Array[String]`

~~120~~

121 Optional list of speaker names that correspond to the audio samples provided in `known_speaker_references[]`. Each entry should be a short identifier (for example `customer` or `agent`). Up to 4 speakers are supported.

~~122~~

123- `known_speaker_references: Array[String]`

~~124~~

125 Optional list of audio samples (as [data URLs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs)) that contain known speaker references matching `known_speaker_names[]`. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by `file`.

~~126~~

127- `language: String`

~~128~~

129 The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency.

~~130~~

131- `prompt: String`

~~132~~

133 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should match the audio language. This field is not supported when using `gpt-4o-transcribe-diarize`.

~~134~~

135- `response_format: AudioResponseFormat`

~~136~~

137 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.

~~138~~

139 - `:json`

~~140~~

141 - `:text`

~~142~~

143 - `:srt`

~~144~~

145 - `:verbose_json`

~~146~~

147 - `:vtt`

~~148~~

149 - `:diarized_json`

~~150~~

151- `stream: bool`

~~152~~

153 If set to true, the model response data will be streamed to the client

154 as it is generated using [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#Event_stream_format).

155 See the [Streaming section of the Speech-to-Text guide](https://platform.openai.com/docs/guides/speech-to-text?lang=curl#streaming-transcriptions)

156 for more information.

~~157~~

158 Note: Streaming is not supported for the `whisper-1` model and will be ignored.

~~159~~

160- `temperature: Float`

~~161~~

162 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.

~~163~~

164- `timestamp_granularities: Array[:word | :segment]`

~~165~~

166 The timestamp granularities to populate for this transcription. `response_format` must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported: `word`, or `segment`. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

167 This option is not available for `gpt-4o-transcribe-diarize`.

~~168~~

169 - `:word`

~~170~~

171 - `:segment`

~~172~~

173### Returns

~~174~~

175- `TranscriptionCreateResponse = Transcription | TranscriptionDiarized | TranscriptionVerbose`

~~176~~

177 Represents a transcription response returned by model, based on the provided input.

~~178~~

179 - `class Transcription`

~~180~~

181 Represents a transcription response returned by model, based on the provided input.

~~182~~

183 - `text: String`

~~184~~

185 The transcribed text.

~~186~~

187 - `logprobs: Array[Logprob{ token, bytes, logprob}]`

~~188~~

189 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

~~190~~

191 - `token: String`

~~192~~

193 The token in the transcription.

~~194~~

195 - `bytes: Array[Float]`

~~196~~

197 The bytes of the token.

~~198~~

199 - `logprob: Float`

~~200~~

201 The log probability of the token.

~~202~~

203 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`

~~204~~

205 Token usage statistics for the request.

~~206~~

207 - `class Tokens`

~~208~~

209 Usage statistics for models billed by token usage.

~~210~~

211 - `input_tokens: Integer`

~~212~~

213 Number of input tokens billed for this request.

~~214~~

215 - `output_tokens: Integer`

~~216~~

217 Number of output tokens generated.

~~218~~

219 - `total_tokens: Integer`

~~220~~

221 Total number of tokens used (input + output).

~~222~~

223 - `type: :tokens`

~~224~~

225 The type of the usage object. Always `tokens` for this variant.

~~226~~

227 - `:tokens`

~~228~~

229 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`

~~230~~

231 Details about the input tokens billed for this request.

~~232~~

233 - `audio_tokens: Integer`

~~234~~

235 Number of audio tokens billed for this request.

~~236~~

237 - `text_tokens: Integer`

~~238~~

239 Number of text tokens billed for this request.

~~240~~

241 - `class Duration`

~~242~~

243 Usage statistics for models billed by audio input duration.

~~244~~

245 - `seconds: Float`

~~246~~

247 Duration of the input audio in seconds.

~~248~~

249 - `type: :duration`

~~250~~

251 The type of the usage object. Always `duration` for this variant.

~~252~~

253 - `:duration`

~~254~~

255 - `class TranscriptionDiarized`

~~256~~

257 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

~~258~~

259 - `duration: Float`

~~260~~

261 Duration of the input audio in seconds.

~~262~~

263 - `segments: Array[TranscriptionDiarizedSegment]`

~~264~~

265 Segments of the transcript annotated with timestamps and speaker labels.

~~266~~

267 - `id: String`

~~268~~

269 Unique identifier for the segment.

~~270~~

271 - `end_: Float`

~~272~~

273 End timestamp of the segment in seconds.

~~274~~

275 - `speaker: String`

~~276~~

277 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

~~278~~

279 - `start: Float`

~~280~~

281 Start timestamp of the segment in seconds.

~~282~~

283 - `text: String`

~~284~~

285 Transcript text for this segment.

~~286~~

287 - `type: :"transcript.text.segment"`

~~288~~

289 The type of the segment. Always `transcript.text.segment`.

~~290~~

291 - `:"transcript.text.segment"`

~~292~~

293 - `task: :transcribe`

~~294~~

295 The type of task that was run. Always `transcribe`.

~~296~~

297 - `:transcribe`

~~298~~

299 - `text: String`

~~300~~

301 The concatenated transcript text for the entire audio input.

~~302~~

303 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`

~~304~~

305 Token or duration usage statistics for the request.

~~306~~

307 - `class Tokens`

~~308~~

309 Usage statistics for models billed by token usage.

~~310~~

311 - `input_tokens: Integer`

~~312~~

313 Number of input tokens billed for this request.

~~314~~

315 - `output_tokens: Integer`

~~316~~

317 Number of output tokens generated.

~~318~~

319 - `total_tokens: Integer`

~~320~~

321 Total number of tokens used (input + output).

~~322~~

323 - `type: :tokens`

~~324~~

325 The type of the usage object. Always `tokens` for this variant.

~~326~~

327 - `:tokens`

~~328~~

329 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`

~~330~~

331 Details about the input tokens billed for this request.

~~332~~

333 - `audio_tokens: Integer`

~~334~~

335 Number of audio tokens billed for this request.

~~336~~

337 - `text_tokens: Integer`

~~338~~

339 Number of text tokens billed for this request.

~~340~~

341 - `class Duration`

~~342~~

343 Usage statistics for models billed by audio input duration.

~~344~~

345 - `seconds: Float`

~~346~~

347 Duration of the input audio in seconds.

~~348~~

349 - `type: :duration`

~~350~~

351 The type of the usage object. Always `duration` for this variant.

~~352~~

353 - `:duration`

~~354~~

355 - `class TranscriptionVerbose`

~~356~~

357 Represents a verbose json transcription response returned by model, based on the provided input.

~~358~~

359 - `duration: Float`

~~360~~

361 The duration of the input audio.

~~362~~

363 - `language: String`

~~364~~

365 The language of the input audio.

~~366~~

367 - `text: String`

~~368~~

369 The transcribed text.

~~370~~

371 - `segments: Array[TranscriptionSegment]`

~~372~~

373 Segments of the transcribed text and their corresponding details.

~~374~~

375 - `id: Integer`

~~376~~

377 Unique identifier of the segment.

~~378~~

379 - `avg_logprob: Float`

~~380~~

381 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~382~~

383 - `compression_ratio: Float`

~~384~~

385 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~386~~

387 - `end_: Float`

~~388~~

389 End time of the segment in seconds.

~~390~~

391 - `no_speech_prob: Float`

~~392~~

393 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~394~~

395 - `seek: Integer`

~~396~~

397 Seek offset of the segment.

~~398~~

399 - `start: Float`

~~400~~

401 Start time of the segment in seconds.

~~402~~

403 - `temperature: Float`

~~404~~

405 Temperature parameter used for generating the segment.

~~406~~

407 - `text: String`

~~408~~

409 Text content of the segment.

~~410~~

411 - `tokens: Array[Integer]`

~~412~~

413 Array of token IDs for the text content.

~~414~~

415 - `usage: Usage{ seconds, type}`

~~416~~

417 Usage statistics for models billed by audio input duration.

~~418~~

419 - `seconds: Float`

~~420~~

421 Duration of the input audio in seconds.

~~422~~

423 - `type: :duration`

~~424~~

425 The type of the usage object. Always `duration` for this variant.

~~426~~

427 - `:duration`

~~428~~

429 - `words: Array[TranscriptionWord]`

~~430~~

431 Extracted words and their corresponding timestamps.

~~432~~

433 - `end_: Float`

~~434~~

435 End time of the word in seconds.

~~436~~

437 - `start: Float`

~~438~~

439 Start time of the word in seconds.

~~440~~

441 - `word: String`

~~442~~

443 The text content of the word.

~~444~~

445### Example

~~446~~

447```ruby

448require "openai"

~~449~~

450openai = OpenAI::Client.new(api_key: "My API Key")

~~451~~

452transcription = openai.audio.transcriptions.create(file: StringIO.new("Example data"), model: :"gpt-4o-transcribe")

~~453~~

454puts(transcription)

455```

~~456~~

457#### Response

~~458~~

459```json

460{

461 "text": "text",

462 "logprobs": [

463 {

464 "token": "token",

465 "bytes": [

466 0

467 ],

468 "logprob": 0

469 }

470 ],

471 "usage": {

472 "input_tokens": 0,

473 "output_tokens": 0,

474 "total_tokens": 0,

475 "type": "tokens",

476 "input_token_details": {

477 "audio_tokens": 0,

478 "text_tokens": 0

479 }

480 }

481}

482```

~~483~~

484## Domain Types

~~485~~

486### Transcription

~~487~~

488- `class Transcription`

~~489~~

490 Represents a transcription response returned by model, based on the provided input.

~~491~~

492 - `text: String`

~~493~~

494 The transcribed text.

~~495~~

496 - `logprobs: Array[Logprob{ token, bytes, logprob}]`

~~497~~

498 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

~~499~~

500 - `token: String`

~~501~~

502 The token in the transcription.

~~503~~

504 - `bytes: Array[Float]`

~~505~~

506 The bytes of the token.

~~507~~

508 - `logprob: Float`

~~509~~

510 The log probability of the token.

~~511~~

512 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`

~~513~~

514 Token usage statistics for the request.

~~515~~

516 - `class Tokens`

~~517~~

518 Usage statistics for models billed by token usage.

~~519~~

520 - `input_tokens: Integer`

~~521~~

522 Number of input tokens billed for this request.

~~523~~

524 - `output_tokens: Integer`

~~525~~

526 Number of output tokens generated.

~~527~~

528 - `total_tokens: Integer`

~~529~~

530 Total number of tokens used (input + output).

~~531~~

532 - `type: :tokens`

~~533~~

534 The type of the usage object. Always `tokens` for this variant.

~~535~~

536 - `:tokens`

~~537~~

538 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`

~~539~~

540 Details about the input tokens billed for this request.

~~541~~

542 - `audio_tokens: Integer`

~~543~~

544 Number of audio tokens billed for this request.

~~545~~

546 - `text_tokens: Integer`

~~547~~

548 Number of text tokens billed for this request.

~~549~~

550 - `class Duration`

~~551~~

552 Usage statistics for models billed by audio input duration.

~~553~~

554 - `seconds: Float`

~~555~~

556 Duration of the input audio in seconds.

~~557~~

558 - `type: :duration`

~~559~~

560 The type of the usage object. Always `duration` for this variant.

~~561~~

562 - `:duration`

~~563~~

564### Transcription Diarized

~~565~~

566- `class TranscriptionDiarized`

~~567~~

568 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

~~569~~

570 - `duration: Float`

~~571~~

572 Duration of the input audio in seconds.

~~573~~

574 - `segments: Array[TranscriptionDiarizedSegment]`

~~575~~

576 Segments of the transcript annotated with timestamps and speaker labels.

~~577~~

578 - `id: String`

~~579~~

580 Unique identifier for the segment.

~~581~~

582 - `end_: Float`

~~583~~

584 End timestamp of the segment in seconds.

~~585~~

586 - `speaker: String`

~~587~~

588 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

~~589~~

590 - `start: Float`

~~591~~

592 Start timestamp of the segment in seconds.

~~593~~

594 - `text: String`

~~595~~

596 Transcript text for this segment.

~~597~~

598 - `type: :"transcript.text.segment"`

~~599~~

600 The type of the segment. Always `transcript.text.segment`.

~~601~~

602 - `:"transcript.text.segment"`

~~603~~

604 - `task: :transcribe`

~~605~~

606 The type of task that was run. Always `transcribe`.

~~607~~

608 - `:transcribe`

~~609~~

610 - `text: String`

~~611~~

612 The concatenated transcript text for the entire audio input.

~~613~~

614 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`

~~615~~

616 Token or duration usage statistics for the request.

~~617~~

618 - `class Tokens`

~~619~~

620 Usage statistics for models billed by token usage.

~~621~~

622 - `input_tokens: Integer`

~~623~~

624 Number of input tokens billed for this request.

~~625~~

626 - `output_tokens: Integer`

~~627~~

628 Number of output tokens generated.

~~629~~

630 - `total_tokens: Integer`

~~631~~

632 Total number of tokens used (input + output).

~~633~~

634 - `type: :tokens`

~~635~~

636 The type of the usage object. Always `tokens` for this variant.

~~637~~

638 - `:tokens`

~~639~~

640 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`

~~641~~

642 Details about the input tokens billed for this request.

~~643~~

644 - `audio_tokens: Integer`

~~645~~

646 Number of audio tokens billed for this request.

~~647~~

648 - `text_tokens: Integer`

~~649~~

650 Number of text tokens billed for this request.

~~651~~

652 - `class Duration`

~~653~~

654 Usage statistics for models billed by audio input duration.

~~655~~

656 - `seconds: Float`

~~657~~

658 Duration of the input audio in seconds.

~~659~~

660 - `type: :duration`

~~661~~

662 The type of the usage object. Always `duration` for this variant.

~~663~~

664 - `:duration`

~~665~~

666### Transcription Diarized Segment

~~667~~

668- `class TranscriptionDiarizedSegment`

~~669~~

670 A segment of diarized transcript text with speaker metadata.

~~671~~

672 - `id: String`

~~673~~

674 Unique identifier for the segment.

~~675~~

676 - `end_: Float`

~~677~~

678 End timestamp of the segment in seconds.

~~679~~

680 - `speaker: String`

~~681~~

682 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

~~683~~

684 - `start: Float`

~~685~~

686 Start timestamp of the segment in seconds.

~~687~~

688 - `text: String`

~~689~~

690 Transcript text for this segment.

~~691~~

692 - `type: :"transcript.text.segment"`

~~693~~

694 The type of the segment. Always `transcript.text.segment`.

~~695~~

696 - `:"transcript.text.segment"`

~~697~~

698### Transcription Include

~~699~~

700- `TranscriptionInclude = :logprobs`

~~701~~

702 - `:logprobs`

~~703~~

704### Transcription Segment

~~705~~

706- `class TranscriptionSegment`

~~707~~

708 - `id: Integer`

~~709~~

710 Unique identifier of the segment.

~~711~~

712 - `avg_logprob: Float`

~~713~~

714 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~715~~

716 - `compression_ratio: Float`

~~717~~

718 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~719~~

720 - `end_: Float`

~~721~~

722 End time of the segment in seconds.

~~723~~

724 - `no_speech_prob: Float`

~~725~~

726 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~727~~

728 - `seek: Integer`

~~729~~

730 Seek offset of the segment.

~~731~~

732 - `start: Float`

~~733~~

734 Start time of the segment in seconds.

~~735~~

736 - `temperature: Float`

~~737~~

738 Temperature parameter used for generating the segment.

~~739~~

740 - `text: String`

~~741~~

742 Text content of the segment.

~~743~~

744 - `tokens: Array[Integer]`

~~745~~

746 Array of token IDs for the text content.

~~747~~

748### Transcription Stream Event

~~749~~

750- `TranscriptionStreamEvent = TranscriptionTextSegmentEvent | TranscriptionTextDeltaEvent | TranscriptionTextDoneEvent`

~~751~~

752 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

~~753~~

754 - `class TranscriptionTextSegmentEvent`

~~755~~

756 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

~~757~~

758 - `id: String`

~~759~~

760 Unique identifier for the segment.

~~761~~

762 - `end_: Float`

~~763~~

764 End timestamp of the segment in seconds.

~~765~~

766 - `speaker: String`

~~767~~

768 Speaker label for this segment.

~~769~~

770 - `start: Float`

~~771~~

772 Start timestamp of the segment in seconds.

~~773~~

774 - `text: String`

~~775~~

776 Transcript text for this segment.

~~777~~

778 - `type: :"transcript.text.segment"`

~~779~~

780 The type of the event. Always `transcript.text.segment`.

~~781~~

782 - `:"transcript.text.segment"`

~~783~~

784 - `class TranscriptionTextDeltaEvent`

~~785~~

786 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

~~787~~

788 - `delta: String`

~~789~~

790 The text delta that was additionally transcribed.

~~791~~

792 - `type: :"transcript.text.delta"`

~~793~~

794 The type of the event. Always `transcript.text.delta`.

~~795~~

796 - `:"transcript.text.delta"`

~~797~~

798 - `logprobs: Array[Logprob{ token, bytes, logprob}]`

~~799~~

800 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

~~801~~

802 - `token: String`

~~803~~

804 The token that was used to generate the log probability.

~~805~~

806 - `bytes: Array[Integer]`

~~807~~

808 The bytes that were used to generate the log probability.

~~809~~

810 - `logprob: Float`

~~811~~

812 The log probability of the token.

~~813~~

814 - `segment_id: String`

~~815~~

816 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.

~~817~~

818 - `class TranscriptionTextDoneEvent`

~~819~~

820 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

~~821~~

822 - `text: String`

~~823~~

824 The text that was transcribed.

~~825~~

826 - `type: :"transcript.text.done"`

~~827~~

828 The type of the event. Always `transcript.text.done`.

~~829~~

830 - `:"transcript.text.done"`

~~831~~

832 - `logprobs: Array[Logprob{ token, bytes, logprob}]`

~~833~~

834 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

~~835~~

836 - `token: String`

~~837~~

838 The token that was used to generate the log probability.

~~839~~

840 - `bytes: Array[Integer]`

~~841~~

842 The bytes that were used to generate the log probability.

~~843~~

844 - `logprob: Float`

~~845~~

846 The log probability of the token.

~~847~~

848 - `usage: Usage{ input_tokens, output_tokens, total_tokens, 2 more}`

~~849~~

850 Usage statistics for models billed by token usage.

~~851~~

852 - `input_tokens: Integer`

~~853~~

854 Number of input tokens billed for this request.

~~855~~

856 - `output_tokens: Integer`

~~857~~

858 Number of output tokens generated.

~~859~~

860 - `total_tokens: Integer`

~~861~~

862 Total number of tokens used (input + output).

~~863~~

864 - `type: :tokens`

~~865~~

866 The type of the usage object. Always `tokens` for this variant.

~~867~~

868 - `:tokens`

~~869~~

870 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`

~~871~~

872 Details about the input tokens billed for this request.

~~873~~

874 - `audio_tokens: Integer`

~~875~~

876 Number of audio tokens billed for this request.

~~877~~

878 - `text_tokens: Integer`

~~879~~

880 Number of text tokens billed for this request.

~~881~~

882### Transcription Text Delta Event

~~883~~

884- `class TranscriptionTextDeltaEvent`

~~885~~

886 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

~~887~~

888 - `delta: String`

~~889~~

890 The text delta that was additionally transcribed.

~~891~~

892 - `type: :"transcript.text.delta"`

~~893~~

894 The type of the event. Always `transcript.text.delta`.

~~895~~

896 - `:"transcript.text.delta"`

~~897~~

898 - `logprobs: Array[Logprob{ token, bytes, logprob}]`

~~899~~

900 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

~~901~~

902 - `token: String`

~~903~~

904 The token that was used to generate the log probability.

~~905~~

906 - `bytes: Array[Integer]`

~~907~~

908 The bytes that were used to generate the log probability.

~~909~~

910 - `logprob: Float`

~~911~~

912 The log probability of the token.

~~913~~

914 - `segment_id: String`

~~915~~

916 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.

~~917~~

918### Transcription Text Done Event

~~919~~

920- `class TranscriptionTextDoneEvent`

~~921~~

922 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

~~923~~

924 - `text: String`

~~925~~

926 The text that was transcribed.

~~927~~

928 - `type: :"transcript.text.done"`

~~929~~

930 The type of the event. Always `transcript.text.done`.

~~931~~

932 - `:"transcript.text.done"`

~~933~~

934 - `logprobs: Array[Logprob{ token, bytes, logprob}]`

~~935~~

936 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

~~937~~

938 - `token: String`

~~939~~

940 The token that was used to generate the log probability.

~~941~~

942 - `bytes: Array[Integer]`

~~943~~

944 The bytes that were used to generate the log probability.

~~945~~

946 - `logprob: Float`

~~947~~

948 The log probability of the token.

~~949~~

950 - `usage: Usage{ input_tokens, output_tokens, total_tokens, 2 more}`

~~951~~

952 Usage statistics for models billed by token usage.

~~953~~

954 - `input_tokens: Integer`

~~955~~

956 Number of input tokens billed for this request.

~~957~~

958 - `output_tokens: Integer`

~~959~~

960 Number of output tokens generated.

~~961~~

962 - `total_tokens: Integer`

~~963~~

964 Total number of tokens used (input + output).

~~965~~

966 - `type: :tokens`

~~967~~

968 The type of the usage object. Always `tokens` for this variant.

~~969~~

970 - `:tokens`

~~971~~

972 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`

~~973~~

974 Details about the input tokens billed for this request.

~~975~~

976 - `audio_tokens: Integer`

~~977~~

978 Number of audio tokens billed for this request.

~~979~~

980 - `text_tokens: Integer`

~~981~~

982 Number of text tokens billed for this request.

~~983~~

984### Transcription Text Segment Event

~~985~~

986- `class TranscriptionTextSegmentEvent`

~~987~~

988 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

~~989~~

990 - `id: String`

~~991~~

992 Unique identifier for the segment.

~~993~~

994 - `end_: Float`

~~995~~

996 End timestamp of the segment in seconds.

~~997~~

998 - `speaker: String`

~~999~~

1000 Speaker label for this segment.

~~1001~~

1002 - `start: Float`

~~1003~~

1004 Start timestamp of the segment in seconds.

~~1005~~

1006 - `text: String`

~~1007~~

1008 Transcript text for this segment.

~~1009~~

1010 - `type: :"transcript.text.segment"`

~~1011~~

1012 The type of the event. Always `transcript.text.segment`.

~~1013~~

1014 - `:"transcript.text.segment"`

~~1015~~

1016### Transcription Verbose

~~1017~~

1018- `class TranscriptionVerbose`

~~1019~~

1020 Represents a verbose json transcription response returned by model, based on the provided input.

~~1021~~

1022 - `duration: Float`

~~1023~~

1024 The duration of the input audio.

~~1025~~

1026 - `language: String`

~~1027~~

1028 The language of the input audio.

~~1029~~

1030 - `text: String`

~~1031~~

1032 The transcribed text.

~~1033~~

1034 - `segments: Array[TranscriptionSegment]`

~~1035~~

1036 Segments of the transcribed text and their corresponding details.

~~1037~~

1038 - `id: Integer`

~~1039~~

1040 Unique identifier of the segment.

~~1041~~

1042 - `avg_logprob: Float`

~~1043~~

1044 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~1045~~

1046 - `compression_ratio: Float`

~~1047~~

1048 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~1049~~

1050 - `end_: Float`

~~1051~~

1052 End time of the segment in seconds.

~~1053~~

1054 - `no_speech_prob: Float`

~~1055~~

1056 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~1057~~

1058 - `seek: Integer`

~~1059~~

1060 Seek offset of the segment.

~~1061~~

1062 - `start: Float`

~~1063~~

1064 Start time of the segment in seconds.

~~1065~~

1066 - `temperature: Float`

~~1067~~

1068 Temperature parameter used for generating the segment.

~~1069~~

1070 - `text: String`

~~1071~~

1072 Text content of the segment.

~~1073~~

1074 - `tokens: Array[Integer]`

~~1075~~

1076 Array of token IDs for the text content.

~~1077~~

1078 - `usage: Usage{ seconds, type}`

~~1079~~

1080 Usage statistics for models billed by audio input duration.

~~1081~~

1082 - `seconds: Float`

~~1083~~

1084 Duration of the input audio in seconds.

~~1085~~

1086 - `type: :duration`

~~1087~~

1088 The type of the usage object. Always `duration` for this variant.

~~1089~~

1090 - `:duration`

~~1091~~

1092 - `words: Array[TranscriptionWord]`

~~1093~~

1094 Extracted words and their corresponding timestamps.

~~1095~~

1096 - `end_: Float`

~~1097~~

1098 End time of the word in seconds.

~~1099~~

1100 - `start: Float`

~~1101~~

1102 Start time of the word in seconds.

~~1103~~

1104 - `word: String`

~~1105~~

1106 The text content of the word.

~~1107~~

1108### Transcription Word

~~1109~~

1110- `class TranscriptionWord`

~~1111~~

1112 - `end_: Float`

~~1113~~

1114 End time of the word in seconds.

~~1115~~

1116 - `start: Float`

~~1117~~

1118 Start time of the word in seconds.

~~1119~~

1120 - `word: String`

~~1121~~

1122 The text content of the word.

~~1123~~

1124### Transcription Create Response

~~1125~~

1126- `TranscriptionCreateResponse = Transcription | TranscriptionDiarized | TranscriptionVerbose`

~~1127~~

1128 Represents a transcription response returned by model, based on the provided input.

~~1129~~

1130 - `class Transcription`

~~1131~~

1132 Represents a transcription response returned by model, based on the provided input.

~~1133~~

1134 - `text: String`

~~1135~~

1136 The transcribed text.

~~1137~~

1138 - `logprobs: Array[Logprob{ token, bytes, logprob}]`

~~1139~~

1140 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

~~1141~~

1142 - `token: String`

~~1143~~

1144 The token in the transcription.

~~1145~~

1146 - `bytes: Array[Float]`

~~1147~~

1148 The bytes of the token.

~~1149~~

1150 - `logprob: Float`

~~1151~~

1152 The log probability of the token.

~~1153~~

1154 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`

~~1155~~

1156 Token usage statistics for the request.

~~1157~~

1158 - `class Tokens`

~~1159~~

1160 Usage statistics for models billed by token usage.

~~1161~~

1162 - `input_tokens: Integer`

~~1163~~

1164 Number of input tokens billed for this request.

~~1165~~

1166 - `output_tokens: Integer`

~~1167~~

1168 Number of output tokens generated.

~~1169~~

1170 - `total_tokens: Integer`

~~1171~~

1172 Total number of tokens used (input + output).

~~1173~~

1174 - `type: :tokens`

~~1175~~

1176 The type of the usage object. Always `tokens` for this variant.

~~1177~~

1178 - `:tokens`

~~1179~~

1180 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`

~~1181~~

1182 Details about the input tokens billed for this request.

~~1183~~

1184 - `audio_tokens: Integer`

~~1185~~

1186 Number of audio tokens billed for this request.

~~1187~~

1188 - `text_tokens: Integer`

~~1189~~

1190 Number of text tokens billed for this request.

~~1191~~

1192 - `class Duration`

~~1193~~

1194 Usage statistics for models billed by audio input duration.

~~1195~~

1196 - `seconds: Float`

~~1197~~

1198 Duration of the input audio in seconds.

~~1199~~

1200 - `type: :duration`

~~1201~~

1202 The type of the usage object. Always `duration` for this variant.

~~1203~~

1204 - `:duration`

~~1205~~

1206 - `class TranscriptionDiarized`

~~1207~~

1208 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

~~1209~~

1210 - `duration: Float`

~~1211~~

1212 Duration of the input audio in seconds.

~~1213~~

1214 - `segments: Array[TranscriptionDiarizedSegment]`

~~1215~~

1216 Segments of the transcript annotated with timestamps and speaker labels.

~~1217~~

1218 - `id: String`

~~1219~~

1220 Unique identifier for the segment.

~~1221~~

1222 - `end_: Float`

~~1223~~

1224 End timestamp of the segment in seconds.

~~1225~~

1226 - `speaker: String`

~~1227~~

1228 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

~~1229~~

1230 - `start: Float`

~~1231~~

1232 Start timestamp of the segment in seconds.

~~1233~~

1234 - `text: String`

~~1235~~

1236 Transcript text for this segment.

~~1237~~

1238 - `type: :"transcript.text.segment"`

~~1239~~

1240 The type of the segment. Always `transcript.text.segment`.

~~1241~~

1242 - `:"transcript.text.segment"`

~~1243~~

1244 - `task: :transcribe`

~~1245~~

1246 The type of task that was run. Always `transcribe`.

~~1247~~

1248 - `:transcribe`

~~1249~~

1250 - `text: String`

~~1251~~

1252 The concatenated transcript text for the entire audio input.

~~1253~~

1254 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`

~~1255~~

1256 Token or duration usage statistics for the request.

~~1257~~

1258 - `class Tokens`

~~1259~~

1260 Usage statistics for models billed by token usage.

~~1261~~

1262 - `input_tokens: Integer`

~~1263~~

1264 Number of input tokens billed for this request.

~~1265~~

1266 - `output_tokens: Integer`

~~1267~~

1268 Number of output tokens generated.

~~1269~~

1270 - `total_tokens: Integer`

~~1271~~

1272 Total number of tokens used (input + output).

~~1273~~

1274 - `type: :tokens`

~~1275~~

1276 The type of the usage object. Always `tokens` for this variant.

~~1277~~

1278 - `:tokens`

~~1279~~

1280 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`

~~1281~~

1282 Details about the input tokens billed for this request.

~~1283~~

1284 - `audio_tokens: Integer`

~~1285~~

1286 Number of audio tokens billed for this request.

~~1287~~

1288 - `text_tokens: Integer`

~~1289~~

1290 Number of text tokens billed for this request.

~~1291~~

1292 - `class Duration`

~~1293~~

1294 Usage statistics for models billed by audio input duration.

~~1295~~

1296 - `seconds: Float`

~~1297~~

1298 Duration of the input audio in seconds.

~~1299~~

1300 - `type: :duration`

~~1301~~

1302 The type of the usage object. Always `duration` for this variant.

~~1303~~

1304 - `:duration`

~~1305~~

1306 - `class TranscriptionVerbose`

~~1307~~

1308 Represents a verbose json transcription response returned by model, based on the provided input.

~~1309~~

1310 - `duration: Float`

~~1311~~

1312 The duration of the input audio.

~~1313~~

1314 - `language: String`

~~1315~~

1316 The language of the input audio.

~~1317~~

1318 - `text: String`

~~1319~~

1320 The transcribed text.

~~1321~~

1322 - `segments: Array[TranscriptionSegment]`

~~1323~~

1324 Segments of the transcribed text and their corresponding details.

~~1325~~

1326 - `id: Integer`

~~1327~~

1328 Unique identifier of the segment.

~~1329~~

1330 - `avg_logprob: Float`

~~1331~~

1332 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~1333~~

1334 - `compression_ratio: Float`

~~1335~~

1336 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~1337~~

1338 - `end_: Float`

~~1339~~

1340 End time of the segment in seconds.

~~1341~~

1342 - `no_speech_prob: Float`

~~1343~~

1344 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~1345~~

1346 - `seek: Integer`

~~1347~~

1348 Seek offset of the segment.

~~1349~~

1350 - `start: Float`

~~1351~~

1352 Start time of the segment in seconds.

~~1353~~

1354 - `temperature: Float`

~~1355~~

1356 Temperature parameter used for generating the segment.

~~1357~~

1358 - `text: String`

~~1359~~

1360 Text content of the segment.

~~1361~~

1362 - `tokens: Array[Integer]`

~~1363~~

1364 Array of token IDs for the text content.

~~1365~~

1366 - `usage: Usage{ seconds, type}`

~~1367~~

1368 Usage statistics for models billed by audio input duration.

~~1369~~

1370 - `seconds: Float`

~~1371~~

1372 Duration of the input audio in seconds.

~~1373~~

1374 - `type: :duration`

~~1375~~

1376 The type of the usage object. Always `duration` for this variant.

~~1377~~

1378 - `:duration`

~~1379~~

1380 - `words: Array[TranscriptionWord]`

~~1381~~

1382 Extracted words and their corresponding timestamps.

~~1383~~

1384 - `end_: Float`

~~1385~~

1386 End time of the word in seconds.

~~1387~~

1388 - `start: Float`

~~1389~~

1390 Start time of the word in seconds.

~~1391~~

1392 - `word: String`

~~1393~~

1394 The text content of the word.

~~1395~~

1396# Translations

~~1397~~

1398## Create translation

~~1399~~

1400`audio.translations.create(**kwargs) -> TranslationCreateResponse`

~~1401~~

1402**post** `/audio/translations`

~~1403~~

1404Translates audio into English.

~~1405~~

1406### Parameters

~~1407~~

1408- `file: String`

~~1409~~

1410 The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

~~1411~~

1412- `model: String | AudioModel`

~~1413~~

1414 ID of the model to use. Only `whisper-1` (which is powered by our open source Whisper V2 model) is currently available.

~~1415~~

1416 - `String = String`

~~1417~~

1418 - `AudioModel = :"whisper-1" | :"gpt-4o-transcribe" | :"gpt-4o-mini-transcribe" | 2 more`

~~1419~~

1420 - `:"whisper-1"`

~~1421~~

1422 - `:"gpt-4o-transcribe"`

~~1423~~

1424 - `:"gpt-4o-mini-transcribe"`

~~1425~~

1426 - `:"gpt-4o-mini-transcribe-2025-12-15"`

~~1427~~

1428 - `:"gpt-4o-transcribe-diarize"`

~~1429~~

1430- `prompt: String`

~~1431~~

1432 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should be in English.

~~1433~~

1434- `response_format: :json | :text | :srt | 2 more`

~~1435~~

1436 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`.

~~1437~~

1438 - `:json`

~~1439~~

1440 - `:text`

~~1441~~

1442 - `:srt`

~~1443~~

1444 - `:verbose_json`

~~1445~~

1446 - `:vtt`

~~1447~~

1448- `temperature: Float`

~~1449~~

1450 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.

~~1451~~

1452### Returns

~~1453~~

1454- `TranslationCreateResponse = Translation | TranslationVerbose`

~~1455~~

1456 - `class Translation`

~~1457~~

1458 - `text: String`

~~1459~~

1460 - `class TranslationVerbose`

~~1461~~

1462 - `duration: Float`

~~1463~~

1464 The duration of the input audio.

~~1465~~

1466 - `language: String`

~~1467~~

1468 The language of the output translation (always `english`).

~~1469~~

1470 - `text: String`

~~1471~~

1472 The translated text.

~~1473~~

1474 - `segments: Array[TranscriptionSegment]`

~~1475~~

1476 Segments of the translated text and their corresponding details.

~~1477~~

1478 - `id: Integer`

~~1479~~

1480 Unique identifier of the segment.

~~1481~~

1482 - `avg_logprob: Float`

~~1483~~

1484 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~1485~~

1486 - `compression_ratio: Float`

~~1487~~

1488 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~1489~~

1490 - `end_: Float`

~~1491~~

1492 End time of the segment in seconds.

~~1493~~

1494 - `no_speech_prob: Float`

~~1495~~

1496 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~1497~~

1498 - `seek: Integer`

~~1499~~

1500 Seek offset of the segment.

~~1501~~

1502 - `start: Float`

~~1503~~

1504 Start time of the segment in seconds.

~~1505~~

1506 - `temperature: Float`

~~1507~~

1508 Temperature parameter used for generating the segment.

~~1509~~

1510 - `text: String`

~~1511~~

1512 Text content of the segment.

~~1513~~

1514 - `tokens: Array[Integer]`

~~1515~~

1516 Array of token IDs for the text content.

~~1517~~

1518### Example

~~1519~~

1520```ruby

1521require "openai"

~~1522~~

1523openai = OpenAI::Client.new(api_key: "My API Key")

~~1524~~

1525translation = openai.audio.translations.create(file: StringIO.new("Example data"), model: :"whisper-1")

~~1526~~

1527puts(translation)

1528```

~~1529~~

1530#### Response

~~1531~~

1532```json

1533{

1534 "text": "text"

1535}

1536```

~~1537~~

1538## Domain Types

~~1539~~

1540### Translation

~~1541~~

1542- `class Translation`

~~1543~~

1544 - `text: String`

~~1545~~

1546### Translation Verbose

~~1547~~

1548- `class TranslationVerbose`

~~1549~~

1550 - `duration: Float`

~~1551~~

1552 The duration of the input audio.

~~1553~~

1554 - `language: String`

~~1555~~

1556 The language of the output translation (always `english`).

~~1557~~

1558 - `text: String`

~~1559~~

1560 The translated text.

~~1561~~

1562 - `segments: Array[TranscriptionSegment]`

~~1563~~

1564 Segments of the translated text and their corresponding details.

~~1565~~

1566 - `id: Integer`

~~1567~~

1568 Unique identifier of the segment.

~~1569~~

1570 - `avg_logprob: Float`

~~1571~~

1572 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~1573~~

1574 - `compression_ratio: Float`

~~1575~~

1576 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~1577~~

1578 - `end_: Float`

~~1579~~

1580 End time of the segment in seconds.

~~1581~~

1582 - `no_speech_prob: Float`

~~1583~~

1584 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~1585~~

1586 - `seek: Integer`

~~1587~~

1588 Seek offset of the segment.

~~1589~~

1590 - `start: Float`

~~1591~~

1592 Start time of the segment in seconds.

~~1593~~

1594 - `temperature: Float`

~~1595~~

1596 Temperature parameter used for generating the segment.

~~1597~~

1598 - `text: String`

~~1599~~

1600 Text content of the segment.

~~1601~~

1602 - `tokens: Array[Integer]`

~~1603~~

1604 Array of token IDs for the text content.

~~1605~~

1606### Translation Create Response

~~1607~~

1608- `TranslationCreateResponse = Translation | TranslationVerbose`

~~1609~~

1610 - `class Translation`

~~1611~~

1612 - `text: String`

~~1613~~

1614 - `class TranslationVerbose`

~~1615~~

1616 - `duration: Float`

~~1617~~

1618 The duration of the input audio.

~~1619~~

1620 - `language: String`

~~1621~~

1622 The language of the output translation (always `english`).

~~1623~~

1624 - `text: String`

~~1625~~

1626 The translated text.

~~1627~~

1628 - `segments: Array[TranscriptionSegment]`

~~1629~~

1630 Segments of the translated text and their corresponding details.

~~1631~~

1632 - `id: Integer`

~~1633~~

1634 Unique identifier of the segment.

~~1635~~

1636 - `avg_logprob: Float`

~~1637~~

1638 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~1639~~

1640 - `compression_ratio: Float`

~~1641~~

1642 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~1643~~

1644 - `end_: Float`

~~1645~~

1646 End time of the segment in seconds.

~~1647~~

1648 - `no_speech_prob: Float`

~~1649~~

1650 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~1651~~

1652 - `seek: Integer`

~~1653~~

1654 Seek offset of the segment.

~~1655~~

1656 - `start: Float`

~~1657~~

1658 Start time of the segment in seconds.

~~1659~~

1660 - `temperature: Float`

~~1661~~

1662 Temperature parameter used for generating the segment.

~~1663~~

1664 - `text: String`

~~1665~~

1666 Text content of the segment.

~~1667~~

1668 - `tokens: Array[Integer]`

~~1669~~

1670 Array of token IDs for the text content.

~~1671~~

1672# Speech

~~1673~~

1674## Create speech

~~1675~~

1676`audio.speech.create(**kwargs) -> StringIO`

~~1677~~

1678**post** `/audio/speech`

~~1679~~

1680Generates audio from the input text.

~~1681~~

1682Returns the audio file content, or a stream of audio events.

~~1683~~

1684### Parameters

~~1685~~

1686- `input: String`

~~1687~~

1688 The text to generate audio for. The maximum length is 4096 characters.

~~1689~~

1690- `model: String | SpeechModel`

~~1691~~

1692 One of the available [TTS models](https://platform.openai.com/docs/models#tts): `tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`, or `gpt-4o-mini-tts-2025-12-15`.

~~1693~~

1694 - `String = String`

~~1695~~

1696 - `SpeechModel = :"tts-1" | :"tts-1-hd" | :"gpt-4o-mini-tts" | :"gpt-4o-mini-tts-2025-12-15"`

~~1697~~

1698 - `:"tts-1"`

~~1699~~

1700 - `:"tts-1-hd"`

~~1701~~

1702 - `:"gpt-4o-mini-tts"`

~~1703~~

1704 - `:"gpt-4o-mini-tts-2025-12-15"`

~~1705~~

1706- `voice: String | :alloy | :ash | :ballad | 7 more | ID{ id}`

~~1707~~

1708 The voice to use when generating the audio. Supported built-in voices are `alloy`, `ash`, `ballad`, `coral`, `echo`, `fable`, `onyx`, `nova`, `sage`, `shimmer`, `verse`, `marin`, and `cedar`. You may also provide a custom voice object with an `id`, for example `{ "id": "voice_1234" }`. Previews of the voices are available in the [Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech#voice-options).

~~1709~~

1710 - `String = String`

~~1711~~

1712 - `Voice = :alloy | :ash | :ballad | 7 more`

~~1713~~

1714 - `:alloy`

~~1715~~

1716 - `:ash`

~~1717~~

1718 - `:ballad`

~~1719~~

1720 - `:coral`

~~1721~~

1722 - `:echo`

~~1723~~

1724 - `:sage`

~~1725~~

1726 - `:shimmer`

~~1727~~

1728 - `:verse`

~~1729~~

1730 - `:marin`

~~1731~~

1732 - `:cedar`

~~1733~~

1734 - `class ID`

~~1735~~

1736 Custom voice reference.

~~1737~~

1738 - `id: String`

~~1739~~

1740 The custom voice ID, e.g. `voice_1234`.

~~1741~~

1742- `instructions: String`

~~1743~~

1744 Control the voice of your generated audio with additional instructions. Does not work with `tts-1` or `tts-1-hd`.

~~1745~~

1746- `response_format: :mp3 | :opus | :aac | 3 more`

~~1747~~

1748 The format to audio in. Supported formats are `mp3`, `opus`, `aac`, `flac`, `wav`, and `pcm`.

~~1749~~

1750 - `:mp3`

~~1751~~

1752 - `:opus`

~~1753~~

1754 - `:aac`

~~1755~~

1756 - `:flac`

~~1757~~

1758 - `:wav`

~~1759~~

1760 - `:pcm`

~~1761~~

1762- `speed: Float`

~~1763~~

1764 The speed of the generated audio. Select a value from `0.25` to `4.0`. `1.0` is the default.

~~1765~~

1766- `stream_format: :sse | :audio`

~~1767~~

1768 The format to stream the audio in. Supported formats are `sse` and `audio`. `sse` is not supported for `tts-1` or `tts-1-hd`.

~~1769~~

1770 - `:sse`

~~1771~~

1772 - `:audio`

~~1773~~

1774### Returns

~~1775~~

1776- `StringIO`

~~1777~~

1778### Example

~~1779~~

1780```ruby

1781require "openai"

~~1782~~

1783openai = OpenAI::Client.new(api_key: "My API Key")

~~1784~~

1785speech = openai.audio.speech.create(input: "input", model: :"tts-1", voice: "string")

~~1786~~

1787puts(speech)

1788```

~~1789~~

1790## Domain Types

~~1791~~

1792### Speech Model

~~1793~~

1794- `SpeechModel = :"tts-1" | :"tts-1-hd" | :"gpt-4o-mini-tts" | :"gpt-4o-mini-tts-2025-12-15"`

~~1795~~

1796 - `:"tts-1"`

~~1797~~

1798 - `:"tts-1-hd"`

~~1799~~

1800 - `:"gpt-4o-mini-tts"`

~~1801~~

1802 - `:"gpt-4o-mini-tts-2025-12-15"`

~~1803~~

1804# Voices

~~1805~~

1806# Voice Consents