Go Premium Account

python/resources/audio/index.md 2026-05-18 22:01 UTC to 2026-05-19 06:34 UTC

0 added, 2213 removed.

2026

Wed 27 06:42 Fri 22 06:33 Wed 20 06:35 Tue 19 06:34 Mon 18 22:01 Mon 11 18:00 Thu 7 21:57 Tue 5 23:00 Sat 2 05:57

This document has no rendered page for this history range.

python/resources/audio/index.md +0 −2213 deleted

File Deleted View Diff

~~1# Audio~~

~~3## Domain Types~~

~~5### Audio Model~~

~~7- `Literal["whisper-1", "gpt-4o-transcribe", "gpt-4o-mini-transcribe", 2 more]`~~

~~9 - `"whisper-1"`~~

~~11 - `"gpt-4o-transcribe"`~~

~~13 - `"gpt-4o-mini-transcribe"`~~

~~15 - `"gpt-4o-mini-transcribe-2025-12-15"`~~

~~17 - `"gpt-4o-transcribe-diarize"`~~

~~19### Audio Response Format~~

~~21- `Literal["json", "text", "srt", 3 more]`~~

23 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.

~~25 - `"json"`~~

~~27 - `"text"`~~

~~29 - `"srt"`~~

~~31 - `"verbose_json"`~~

~~33 - `"vtt"`~~

~~35 - `"diarized_json"`~~

~~37# Transcriptions~~

~~39## Create transcription~~

~~41`audio.transcriptions.create(TranscriptionCreateParams**kwargs) -> TranscriptionCreateResponse`~~

~~43**post** `/audio/transcriptions`~~

~~45Transcribes audio into the input language.~~

~~47Returns a transcription object in `json`, `diarized_json`, or `verbose_json`~~

~~48format, or a stream of transcript events.~~

~~50### Parameters~~

~~52- `file: FileTypes`~~

~~54 The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.~~

~~56- `model: Union[str, AudioModel]`~~

58 ID of the model to use. The options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `whisper-1` (which is powered by our open source Whisper V2 model), and `gpt-4o-transcribe-diarize`.

~~60 - `str`~~

~~62 - `Literal["whisper-1", "gpt-4o-transcribe", "gpt-4o-mini-transcribe", 2 more]`~~

~~64 - `"whisper-1"`~~

~~66 - `"gpt-4o-transcribe"`~~

~~68 - `"gpt-4o-mini-transcribe"`~~

~~70 - `"gpt-4o-mini-transcribe-2025-12-15"`~~

~~72 - `"gpt-4o-transcribe-diarize"`~~

~~74- `chunking_strategy: Optional[ChunkingStrategy]`~~

76 Controls how the audio is cut into chunks. When set to `"auto"`, the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. `server_vad` object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using `gpt-4o-transcribe-diarize` for inputs longer than 30 seconds.

~~78 - `Literal["auto"]`~~

~~80 Automatically set chunking parameters based on the audio. Must be set to `"auto"`.~~

~~82 - `"auto"`~~

~~84 - `class ChunkingStrategyVadConfig: …`~~

~~86 - `type: Literal["server_vad"]`~~

~~88 Must be set to `server_vad` to enable manual chunking using server side VAD.~~

~~90 - `"server_vad"`~~

~~92 - `prefix_padding_ms: Optional[int]`~~

~~94 Amount of audio to include before the VAD detected speech (in~~

~~95 milliseconds).~~

~~97 - `silence_duration_ms: Optional[int]`~~

~~99 Duration of silence to detect speech stop (in milliseconds).~~

100 With shorter values the model will respond more quickly,

101 but may jump in on short pauses from the user.

~~102~~

103 - `threshold: Optional[float]`

~~104~~

105 Sensitivity threshold (0.0 to 1.0) for voice activity detection. A

106 higher threshold will require louder audio to activate the model, and

107 thus might perform better in noisy environments.

~~108~~

109- `include: Optional[List[TranscriptionInclude]]`

~~110~~

111 Additional information to include in the transcription response.

112 `logprobs` will return the log probabilities of the tokens in the

113 response to understand the model's confidence in the transcription.

114 `logprobs` only works with response_format set to `json` and only with

115 the models `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `gpt-4o-mini-transcribe-2025-12-15`. This field is not supported when using `gpt-4o-transcribe-diarize`.

~~116~~

117 - `"logprobs"`

~~118~~

119- `known_speaker_names: Optional[Sequence[str]]`

~~120~~

121 Optional list of speaker names that correspond to the audio samples provided in `known_speaker_references[]`. Each entry should be a short identifier (for example `customer` or `agent`). Up to 4 speakers are supported.

~~122~~

123- `known_speaker_references: Optional[Sequence[str]]`

~~124~~

125 Optional list of audio samples (as [data URLs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs)) that contain known speaker references matching `known_speaker_names[]`. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by `file`.

~~126~~

127- `language: Optional[str]`

~~128~~

129 The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency.

~~130~~

131- `prompt: Optional[str]`

~~132~~

133 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should match the audio language. This field is not supported when using `gpt-4o-transcribe-diarize`.

~~134~~

135- `response_format: Optional[AudioResponseFormat]`

~~136~~

137 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.

~~138~~

139 - `"json"`

~~140~~

141 - `"text"`

~~142~~

143 - `"srt"`

~~144~~

145 - `"verbose_json"`

~~146~~

147 - `"vtt"`

~~148~~

149 - `"diarized_json"`

~~150~~

151- `stream: Optional[Literal[false]]`

~~152~~

153 If set to true, the model response data will be streamed to the client

154 as it is generated using [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#Event_stream_format).

155 See the [Streaming section of the Speech-to-Text guide](https://platform.openai.com/docs/guides/speech-to-text?lang=curl#streaming-transcriptions)

156 for more information.

~~157~~

158 Note: Streaming is not supported for the `whisper-1` model and will be ignored.

~~159~~

160 - `false`

~~161~~

162- `temperature: Optional[float]`

~~163~~

164 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.

~~165~~

166- `timestamp_granularities: Optional[List[Literal["word", "segment"]]]`

~~167~~

168 The timestamp granularities to populate for this transcription. `response_format` must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported: `word`, or `segment`. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

169 This option is not available for `gpt-4o-transcribe-diarize`.

~~170~~

171 - `"word"`

~~172~~

173 - `"segment"`

~~174~~

175### Returns

~~176~~

177- `TranscriptionCreateResponse`

~~178~~

179 Represents a transcription response returned by model, based on the provided input.

~~180~~

181 - `class Transcription: …`

~~182~~

183 Represents a transcription response returned by model, based on the provided input.

~~184~~

185 - `text: str`

~~186~~

187 The transcribed text.

~~188~~

189 - `logprobs: Optional[List[Logprob]]`

~~190~~

191 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

~~192~~

193 - `token: Optional[str]`

~~194~~

195 The token in the transcription.

~~196~~

197 - `bytes: Optional[List[float]]`

~~198~~

199 The bytes of the token.

~~200~~

201 - `logprob: Optional[float]`

~~202~~

203 The log probability of the token.

~~204~~

205 - `usage: Optional[Usage]`

~~206~~

207 Token usage statistics for the request.

~~208~~

209 - `class UsageTokens: …`

~~210~~

211 Usage statistics for models billed by token usage.

~~212~~

213 - `input_tokens: int`

~~214~~

215 Number of input tokens billed for this request.

~~216~~

217 - `output_tokens: int`

~~218~~

219 Number of output tokens generated.

~~220~~

221 - `total_tokens: int`

~~222~~

223 Total number of tokens used (input + output).

~~224~~

225 - `type: Literal["tokens"]`

~~226~~

227 The type of the usage object. Always `tokens` for this variant.

~~228~~

229 - `"tokens"`

~~230~~

231 - `input_token_details: Optional[UsageTokensInputTokenDetails]`

~~232~~

233 Details about the input tokens billed for this request.

~~234~~

235 - `audio_tokens: Optional[int]`

~~236~~

237 Number of audio tokens billed for this request.

~~238~~

239 - `text_tokens: Optional[int]`

~~240~~

241 Number of text tokens billed for this request.

~~242~~

243 - `class UsageDuration: …`

~~244~~

245 Usage statistics for models billed by audio input duration.

~~246~~

247 - `seconds: float`

~~248~~

249 Duration of the input audio in seconds.

~~250~~

251 - `type: Literal["duration"]`

~~252~~

253 The type of the usage object. Always `duration` for this variant.

~~254~~

255 - `"duration"`

~~256~~

257 - `class TranscriptionDiarized: …`

~~258~~

259 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

~~260~~

261 - `duration: float`

~~262~~

263 Duration of the input audio in seconds.

~~264~~

265 - `segments: List[TranscriptionDiarizedSegment]`

~~266~~

267 Segments of the transcript annotated with timestamps and speaker labels.

~~268~~

269 - `id: str`

~~270~~

271 Unique identifier for the segment.

~~272~~

273 - `end: float`

~~274~~

275 End timestamp of the segment in seconds.

~~276~~

277 - `speaker: str`

~~278~~

279 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

~~280~~

281 - `start: float`

~~282~~

283 Start timestamp of the segment in seconds.

~~284~~

285 - `text: str`

~~286~~

287 Transcript text for this segment.

~~288~~

289 - `type: Literal["transcript.text.segment"]`

~~290~~

291 The type of the segment. Always `transcript.text.segment`.

~~292~~

293 - `"transcript.text.segment"`

~~294~~

295 - `task: Literal["transcribe"]`

~~296~~

297 The type of task that was run. Always `transcribe`.

~~298~~

299 - `"transcribe"`

~~300~~

301 - `text: str`

~~302~~

303 The concatenated transcript text for the entire audio input.

~~304~~

305 - `usage: Optional[Usage]`

~~306~~

307 Token or duration usage statistics for the request.

~~308~~

309 - `class UsageTokens: …`

~~310~~

311 Usage statistics for models billed by token usage.

~~312~~

313 - `input_tokens: int`

~~314~~

315 Number of input tokens billed for this request.

~~316~~

317 - `output_tokens: int`

~~318~~

319 Number of output tokens generated.

~~320~~

321 - `total_tokens: int`

~~322~~

323 Total number of tokens used (input + output).

~~324~~

325 - `type: Literal["tokens"]`

~~326~~

327 The type of the usage object. Always `tokens` for this variant.

~~328~~

329 - `"tokens"`

~~330~~

331 - `input_token_details: Optional[UsageTokensInputTokenDetails]`

~~332~~

333 Details about the input tokens billed for this request.

~~334~~

335 - `audio_tokens: Optional[int]`

~~336~~

337 Number of audio tokens billed for this request.

~~338~~

339 - `text_tokens: Optional[int]`

~~340~~

341 Number of text tokens billed for this request.

~~342~~

343 - `class UsageDuration: …`

~~344~~

345 Usage statistics for models billed by audio input duration.

~~346~~

347 - `seconds: float`

~~348~~

349 Duration of the input audio in seconds.

~~350~~

351 - `type: Literal["duration"]`

~~352~~

353 The type of the usage object. Always `duration` for this variant.

~~354~~

355 - `"duration"`

~~356~~

357 - `class TranscriptionVerbose: …`

~~358~~

359 Represents a verbose json transcription response returned by model, based on the provided input.

~~360~~

361 - `duration: float`

~~362~~

363 The duration of the input audio.

~~364~~

365 - `language: str`

~~366~~

367 The language of the input audio.

~~368~~

369 - `text: str`

~~370~~

371 The transcribed text.

~~372~~

373 - `segments: Optional[List[TranscriptionSegment]]`

~~374~~

375 Segments of the transcribed text and their corresponding details.

~~376~~

377 - `id: int`

~~378~~

379 Unique identifier of the segment.

~~380~~

381 - `avg_logprob: float`

~~382~~

383 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~384~~

385 - `compression_ratio: float`

~~386~~

387 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~388~~

389 - `end: float`

~~390~~

391 End time of the segment in seconds.

~~392~~

393 - `no_speech_prob: float`

~~394~~

395 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~396~~

397 - `seek: int`

~~398~~

399 Seek offset of the segment.

~~400~~

401 - `start: float`

~~402~~

403 Start time of the segment in seconds.

~~404~~

405 - `temperature: float`

~~406~~

407 Temperature parameter used for generating the segment.

~~408~~

409 - `text: str`

~~410~~

411 Text content of the segment.

~~412~~

413 - `tokens: List[int]`

~~414~~

415 Array of token IDs for the text content.

~~416~~

417 - `usage: Optional[Usage]`

~~418~~

419 Usage statistics for models billed by audio input duration.

~~420~~

421 - `seconds: float`

~~422~~

423 Duration of the input audio in seconds.

~~424~~

425 - `type: Literal["duration"]`

~~426~~

427 The type of the usage object. Always `duration` for this variant.

~~428~~

429 - `"duration"`

~~430~~

431 - `words: Optional[List[TranscriptionWord]]`

~~432~~

433 Extracted words and their corresponding timestamps.

~~434~~

435 - `end: float`

~~436~~

437 End time of the word in seconds.

~~438~~

439 - `start: float`

~~440~~

441 Start time of the word in seconds.

~~442~~

443 - `word: str`

~~444~~

445 The text content of the word.

~~446~~

447### Example

~~448~~

449```python

450import os

451from openai import OpenAI

~~452~~

453client = OpenAI(

454 api_key=os.environ.get("OPENAI_API_KEY"), # This is the default and can be omitted

455)

456for transcription in client.audio.transcriptions.create(

457 file=b"Example data",

458 model="gpt-4o-transcribe",

459):

460 print(transcription)

461```

~~462~~

463#### Response

~~464~~

465```json

466{

467 "text": "text",

468 "logprobs": [

469 {

470 "token": "token",

471 "bytes": [

472 0

473 ],

474 "logprob": 0

475 }

476 ],

477 "usage": {

478 "input_tokens": 0,

479 "output_tokens": 0,

480 "total_tokens": 0,

481 "type": "tokens",

482 "input_token_details": {

483 "audio_tokens": 0,

484 "text_tokens": 0

485 }

486 }

487}

488```

~~489~~

490### Example

~~491~~

492```python

493from openai import OpenAI

494client = OpenAI()

~~495~~

496audio_file = open("speech.mp3", "rb")

497transcript = client.audio.transcriptions.create(

498 model="gpt-4o-transcribe",

499 file=audio_file

500)

501```

~~502~~

503#### Response

~~504~~

505```json

506{

507 "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that.",

508 "usage": {

509 "type": "tokens",

510 "input_tokens": 14,

511 "input_token_details": {

512 "text_tokens": 0,

513 "audio_tokens": 14

514 },

515 "output_tokens": 45,

516 "total_tokens": 59

517 }

518}

519```

~~520~~

521### Diarization

~~522~~

523```python

524import base64

525from openai import OpenAI

~~526~~

527client = OpenAI()

~~528~~

529def to_data_url(path: str) -> str:

530 with open(path, "rb") as fh:

531 return "data:audio/wav;base64," + base64.b64encode(fh.read()).decode("utf-8")

~~532~~

533with open("meeting.wav", "rb") as audio_file:

534 transcript = client.audio.transcriptions.create(

535 model="gpt-4o-transcribe-diarize",

536 file=audio_file,

537 response_format="diarized_json",

538 chunking_strategy="auto",

539 extra_body={

540 "known_speaker_names": ["agent"],

541 "known_speaker_references": [to_data_url("agent.wav")],

542 },

543 )

~~544~~

545print(transcript.segments)

546```

~~547~~

548#### Response

~~549~~

550```json

551{

552 "task": "transcribe",

553 "duration": 27.4,

554 "text": "Agent: Thanks for calling OpenAI support.\nA: Hi, I'm trying to enable diarization.\nAgent: Happy to walk you through the steps.",

555 "segments": [

556 {

557 "type": "transcript.text.segment",

558 "id": "seg_001",

559 "start": 0.0,

560 "end": 4.7,

561 "text": "Thanks for calling OpenAI support.",

562 "speaker": "agent"

563 },

564 {

565 "type": "transcript.text.segment",

566 "id": "seg_002",

567 "start": 4.7,

568 "end": 11.8,

569 "text": "Hi, I'm trying to enable diarization.",

570 "speaker": "A"

571 },

572 {

573 "type": "transcript.text.segment",

574 "id": "seg_003",

575 "start": 12.1,

576 "end": 18.5,

577 "text": "Happy to walk you through the steps.",

578 "speaker": "agent"

579 }

580 ],

581 "usage": {

582 "type": "duration",

583 "seconds": 27

584 }

585}

586```

~~587~~

588### Streaming

~~589~~

590```python

591from openai import OpenAI

592client = OpenAI()

~~593~~

594audio_file = open("speech.mp3", "rb")

595stream = client.audio.transcriptions.create(

596 file=audio_file,

597 model="gpt-4o-mini-transcribe",

598 stream=True

599)

~~600~~

601for event in stream:

602 print(event)

603```

~~604~~

605#### Response

~~606~~

607```json

608data: {"type":"transcript.text.delta","delta":"I","logprobs":[{"token":"I","logprob":-0.00007588794,"bytes":[73]}]}

~~609~~

610data: {"type":"transcript.text.delta","delta":" see","logprobs":[{"token":" see","logprob":-3.1281633e-7,"bytes":[32,115,101,101]}]}

~~611~~

612data: {"type":"transcript.text.delta","delta":" skies","logprobs":[{"token":" skies","logprob":-2.3392786e-6,"bytes":[32,115,107,105,101,115]}]}

~~613~~

614data: {"type":"transcript.text.delta","delta":" of","logprobs":[{"token":" of","logprob":-3.1281633e-7,"bytes":[32,111,102]}]}

~~615~~

616data: {"type":"transcript.text.delta","delta":" blue","logprobs":[{"token":" blue","logprob":-1.0280384e-6,"bytes":[32,98,108,117,101]}]}

~~617~~

618data: {"type":"transcript.text.delta","delta":" and","logprobs":[{"token":" and","logprob":-0.0005108566,"bytes":[32,97,110,100]}]}

~~619~~

620data: {"type":"transcript.text.delta","delta":" clouds","logprobs":[{"token":" clouds","logprob":-1.9361265e-7,"bytes":[32,99,108,111,117,100,115]}]}

~~621~~

622data: {"type":"transcript.text.delta","delta":" of","logprobs":[{"token":" of","logprob":-1.9361265e-7,"bytes":[32,111,102]}]}

~~623~~

624data: {"type":"transcript.text.delta","delta":" white","logprobs":[{"token":" white","logprob":-7.89631e-7,"bytes":[32,119,104,105,116,101]}]}

~~625~~

626data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.0014890312,"bytes":[44]}]}

~~627~~

628data: {"type":"transcript.text.delta","delta":" the","logprobs":[{"token":" the","logprob":-0.0110956915,"bytes":[32,116,104,101]}]}

~~629~~

630data: {"type":"transcript.text.delta","delta":" bright","logprobs":[{"token":" bright","logprob":0.0,"bytes":[32,98,114,105,103,104,116]}]}

~~631~~

632data: {"type":"transcript.text.delta","delta":" blessed","logprobs":[{"token":" blessed","logprob":-0.000045848617,"bytes":[32,98,108,101,115,115,101,100]}]}

~~633~~

634data: {"type":"transcript.text.delta","delta":" days","logprobs":[{"token":" days","logprob":-0.000010802739,"bytes":[32,100,97,121,115]}]}

~~635~~

636data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.00001700133,"bytes":[44]}]}

~~637~~

638data: {"type":"transcript.text.delta","delta":" the","logprobs":[{"token":" the","logprob":-0.0000118755715,"bytes":[32,116,104,101]}]}

~~639~~

640data: {"type":"transcript.text.delta","delta":" dark","logprobs":[{"token":" dark","logprob":-5.5122365e-7,"bytes":[32,100,97,114,107]}]}

~~641~~

642data: {"type":"transcript.text.delta","delta":" sacred","logprobs":[{"token":" sacred","logprob":-5.4385737e-6,"bytes":[32,115,97,99,114,101,100]}]}

~~643~~

644data: {"type":"transcript.text.delta","delta":" nights","logprobs":[{"token":" nights","logprob":-4.00813e-6,"bytes":[32,110,105,103,104,116,115]}]}

~~645~~

646data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.0036910512,"bytes":[44]}]}

~~647~~

648data: {"type":"transcript.text.delta","delta":" and","logprobs":[{"token":" and","logprob":-0.0031903093,"bytes":[32,97,110,100]}]}

~~649~~

650data: {"type":"transcript.text.delta","delta":" I","logprobs":[{"token":" I","logprob":-1.504853e-6,"bytes":[32,73]}]}

~~651~~

652data: {"type":"transcript.text.delta","delta":" think","logprobs":[{"token":" think","logprob":-4.3202e-7,"bytes":[32,116,104,105,110,107]}]}

~~653~~

654data: {"type":"transcript.text.delta","delta":" to","logprobs":[{"token":" to","logprob":-1.9361265e-7,"bytes":[32,116,111]}]}

~~655~~

656data: {"type":"transcript.text.delta","delta":" myself","logprobs":[{"token":" myself","logprob":-1.7432603e-6,"bytes":[32,109,121,115,101,108,102]}]}

~~657~~

658data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.29254505,"bytes":[44]}]}

~~659~~

660data: {"type":"transcript.text.delta","delta":" what","logprobs":[{"token":" what","logprob":-0.016815351,"bytes":[32,119,104,97,116]}]}

~~661~~

662data: {"type":"transcript.text.delta","delta":" a","logprobs":[{"token":" a","logprob":-3.1281633e-7,"bytes":[32,97]}]}

~~663~~

664data: {"type":"transcript.text.delta","delta":" wonderful","logprobs":[{"token":" wonderful","logprob":-2.1008714e-6,"bytes":[32,119,111,110,100,101,114,102,117,108]}]}

~~665~~

666data: {"type":"transcript.text.delta","delta":" world","logprobs":[{"token":" world","logprob":-8.180258e-6,"bytes":[32,119,111,114,108,100]}]}

~~667~~

668data: {"type":"transcript.text.delta","delta":".","logprobs":[{"token":".","logprob":-0.014231676,"bytes":[46]}]}

~~669~~

670data: {"type":"transcript.text.done","text":"I see skies of blue and clouds of white, the bright blessed days, the dark sacred nights, and I think to myself, what a wonderful world.","logprobs":[{"token":"I","logprob":-0.00007588794,"bytes":[73]},{"token":" see","logprob":-3.1281633e-7,"bytes":[32,115,101,101]},{"token":" skies","logprob":-2.3392786e-6,"bytes":[32,115,107,105,101,115]},{"token":" of","logprob":-3.1281633e-7,"bytes":[32,111,102]},{"token":" blue","logprob":-1.0280384e-6,"bytes":[32,98,108,117,101]},{"token":" and","logprob":-0.0005108566,"bytes":[32,97,110,100]},{"token":" clouds","logprob":-1.9361265e-7,"bytes":[32,99,108,111,117,100,115]},{"token":" of","logprob":-1.9361265e-7,"bytes":[32,111,102]},{"token":" white","logprob":-7.89631e-7,"bytes":[32,119,104,105,116,101]},{"token":",","logprob":-0.0014890312,"bytes":[44]},{"token":" the","logprob":-0.0110956915,"bytes":[32,116,104,101]},{"token":" bright","logprob":0.0,"bytes":[32,98,114,105,103,104,116]},{"token":" blessed","logprob":-0.000045848617,"bytes":[32,98,108,101,115,115,101,100]},{"token":" days","logprob":-0.000010802739,"bytes":[32,100,97,121,115]},{"token":",","logprob":-0.00001700133,"bytes":[44]},{"token":" the","logprob":-0.0000118755715,"bytes":[32,116,104,101]},{"token":" dark","logprob":-5.5122365e-7,"bytes":[32,100,97,114,107]},{"token":" sacred","logprob":-5.4385737e-6,"bytes":[32,115,97,99,114,101,100]},{"token":" nights","logprob":-4.00813e-6,"bytes":[32,110,105,103,104,116,115]},{"token":",","logprob":-0.0036910512,"bytes":[44]},{"token":" and","logprob":-0.0031903093,"bytes":[32,97,110,100]},{"token":" I","logprob":-1.504853e-6,"bytes":[32,73]},{"token":" think","logprob":-4.3202e-7,"bytes":[32,116,104,105,110,107]},{"token":" to","logprob":-1.9361265e-7,"bytes":[32,116,111]},{"token":" myself","logprob":-1.7432603e-6,"bytes":[32,109,121,115,101,108,102]},{"token":",","logprob":-0.29254505,"bytes":[44]},{"token":" what","logprob":-0.016815351,"bytes":[32,119,104,97,116]},{"token":" a","logprob":-3.1281633e-7,"bytes":[32,97]},{"token":" wonderful","logprob":-2.1008714e-6,"bytes":[32,119,111,110,100,101,114,102,117,108]},{"token":" world","logprob":-8.180258e-6,"bytes":[32,119,111,114,108,100]},{"token":".","logprob":-0.014231676,"bytes":[46]}],"usage":{"input_tokens":14,"input_token_details":{"text_tokens":0,"audio_tokens":14},"output_tokens":45,"total_tokens":59}}

671```

~~672~~

673### Logprobs

~~674~~

675```python

676from openai import OpenAI

677client = OpenAI()

~~678~~

679audio_file = open("speech.mp3", "rb")

680transcript = client.audio.transcriptions.create(

681 file=audio_file,

682 model="gpt-4o-transcribe",

683 response_format="json",

684 include=["logprobs"]

685)

~~686~~

687print(transcript)

688```

~~689~~

690#### Response

~~691~~

692```json

693{

694 "text": "Hey, my knee is hurting and I want to see the doctor tomorrow ideally.",

695 "logprobs": [

696 { "token": "Hey", "logprob": -1.0415299, "bytes": [72, 101, 121] },

697 { "token": ",", "logprob": -9.805982e-5, "bytes": [44] },

698 { "token": " my", "logprob": -0.00229799, "bytes": [32, 109, 121] },

699 {

700 "token": " knee",

701 "logprob": -4.7159858e-5,

702 "bytes": [32, 107, 110, 101, 101]

703 },

704 { "token": " is", "logprob": -0.043909557, "bytes": [32, 105, 115] },

705 {

706 "token": " hurting",

707 "logprob": -1.1041146e-5,

708 "bytes": [32, 104, 117, 114, 116, 105, 110, 103]

709 },

710 { "token": " and", "logprob": -0.011076359, "bytes": [32, 97, 110, 100] },

711 { "token": " I", "logprob": -5.3193703e-6, "bytes": [32, 73] },

712 {

713 "token": " want",

714 "logprob": -0.0017156356,

715 "bytes": [32, 119, 97, 110, 116]

716 },

717 { "token": " to", "logprob": -7.89631e-7, "bytes": [32, 116, 111] },

718 { "token": " see", "logprob": -5.5122365e-7, "bytes": [32, 115, 101, 101] },

719 { "token": " the", "logprob": -0.0040786397, "bytes": [32, 116, 104, 101] },

720 {

721 "token": " doctor",

722 "logprob": -2.3392786e-6,

723 "bytes": [32, 100, 111, 99, 116, 111, 114]

724 },

725 {

726 "token": " tomorrow",

727 "logprob": -7.89631e-7,

728 "bytes": [32, 116, 111, 109, 111, 114, 114, 111, 119]

729 },

730 {

731 "token": " ideally",

732 "logprob": -0.5800861,

733 "bytes": [32, 105, 100, 101, 97, 108, 108, 121]

734 },

735 { "token": ".", "logprob": -0.00011093382, "bytes": [46] }

736 ],

737 "usage": {

738 "type": "tokens",

739 "input_tokens": 14,

740 "input_token_details": {

741 "text_tokens": 0,

742 "audio_tokens": 14

743 },

744 "output_tokens": 45,

745 "total_tokens": 59

746 }

747}

748```

~~749~~

750### Word timestamps

~~751~~

752```python

753from openai import OpenAI

754client = OpenAI()

~~755~~

756audio_file = open("speech.mp3", "rb")

757transcript = client.audio.transcriptions.create(

758 file=audio_file,

759 model="whisper-1",

760 response_format="verbose_json",

761 timestamp_granularities=["word"]

762)

~~763~~

764print(transcript.words)

765```

~~766~~

767#### Response

~~768~~

769```json

770{

771 "task": "transcribe",

772 "language": "english",

773 "duration": 8.470000267028809,

774 "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",

775 "words": [

776 {

777 "word": "The",

778 "start": 0.0,

779 "end": 0.23999999463558197

780 },

781 ...

782 {

783 "word": "volleyball",

784 "start": 7.400000095367432,

785 "end": 7.900000095367432

786 }

787 ],

788 "usage": {

789 "type": "duration",

790 "seconds": 9

791 }

792}

793```

~~794~~

795### Segment timestamps

~~796~~

797```python

798from openai import OpenAI

799client = OpenAI()

~~800~~

801audio_file = open("speech.mp3", "rb")

802transcript = client.audio.transcriptions.create(

803 file=audio_file,

804 model="whisper-1",

805 response_format="verbose_json",

806 timestamp_granularities=["segment"]

807)

~~808~~

809print(transcript.words)

810```

~~811~~

812#### Response

~~813~~

814```json

815{

816 "task": "transcribe",

817 "language": "english",

818 "duration": 8.470000267028809,

819 "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",

820 "segments": [

821 {

822 "id": 0,

823 "seek": 0,

824 "start": 0.0,

825 "end": 3.319999933242798,

826 "text": " The beach was a popular spot on a hot summer day.",

827 "tokens": [

828 50364, 440, 7534, 390, 257, 3743, 4008, 322, 257, 2368, 4266, 786, 13, 50530

829 ],

830 "temperature": 0.0,

831 "avg_logprob": -0.2860786020755768,

832 "compression_ratio": 1.2363636493682861,

833 "no_speech_prob": 0.00985979475080967

834 },

835 ...

836 ],

837 "usage": {

838 "type": "duration",

839 "seconds": 9

840 }

841}

842```

~~843~~

844## Domain Types

~~845~~

846### Transcription

~~847~~

848- `class Transcription: …`

~~849~~

850 Represents a transcription response returned by model, based on the provided input.

~~851~~

852 - `text: str`

~~853~~

854 The transcribed text.

~~855~~

856 - `logprobs: Optional[List[Logprob]]`

~~857~~

858 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

~~859~~

860 - `token: Optional[str]`

~~861~~

862 The token in the transcription.

~~863~~

864 - `bytes: Optional[List[float]]`

~~865~~

866 The bytes of the token.

~~867~~

868 - `logprob: Optional[float]`

~~869~~

870 The log probability of the token.

~~871~~

872 - `usage: Optional[Usage]`

~~873~~

874 Token usage statistics for the request.

~~875~~

876 - `class UsageTokens: …`

~~877~~

878 Usage statistics for models billed by token usage.

~~879~~

880 - `input_tokens: int`

~~881~~

882 Number of input tokens billed for this request.

~~883~~

884 - `output_tokens: int`

~~885~~

886 Number of output tokens generated.

~~887~~

888 - `total_tokens: int`

~~889~~

890 Total number of tokens used (input + output).

~~891~~

892 - `type: Literal["tokens"]`

~~893~~

894 The type of the usage object. Always `tokens` for this variant.

~~895~~

896 - `"tokens"`

~~897~~

898 - `input_token_details: Optional[UsageTokensInputTokenDetails]`

~~899~~

900 Details about the input tokens billed for this request.

~~901~~

902 - `audio_tokens: Optional[int]`

~~903~~

904 Number of audio tokens billed for this request.

~~905~~

906 - `text_tokens: Optional[int]`

~~907~~

908 Number of text tokens billed for this request.

~~909~~

910 - `class UsageDuration: …`

~~911~~

912 Usage statistics for models billed by audio input duration.

~~913~~

914 - `seconds: float`

~~915~~

916 Duration of the input audio in seconds.

~~917~~

918 - `type: Literal["duration"]`

~~919~~

920 The type of the usage object. Always `duration` for this variant.

~~921~~

922 - `"duration"`

~~923~~

924### Transcription Diarized

~~925~~

926- `class TranscriptionDiarized: …`

~~927~~

928 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

~~929~~

930 - `duration: float`

~~931~~

932 Duration of the input audio in seconds.

~~933~~

934 - `segments: List[TranscriptionDiarizedSegment]`

~~935~~

936 Segments of the transcript annotated with timestamps and speaker labels.

~~937~~

938 - `id: str`

~~939~~

940 Unique identifier for the segment.

~~941~~

942 - `end: float`

~~943~~

944 End timestamp of the segment in seconds.

~~945~~

946 - `speaker: str`

~~947~~

948 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

~~949~~

950 - `start: float`

~~951~~

952 Start timestamp of the segment in seconds.

~~953~~

954 - `text: str`

~~955~~

956 Transcript text for this segment.

~~957~~

958 - `type: Literal["transcript.text.segment"]`

~~959~~

960 The type of the segment. Always `transcript.text.segment`.

~~961~~

962 - `"transcript.text.segment"`

~~963~~

964 - `task: Literal["transcribe"]`

~~965~~

966 The type of task that was run. Always `transcribe`.

~~967~~

968 - `"transcribe"`

~~969~~

970 - `text: str`

~~971~~

972 The concatenated transcript text for the entire audio input.

~~973~~

974 - `usage: Optional[Usage]`

~~975~~

976 Token or duration usage statistics for the request.

~~977~~

978 - `class UsageTokens: …`

~~979~~

980 Usage statistics for models billed by token usage.

~~981~~

982 - `input_tokens: int`

~~983~~

984 Number of input tokens billed for this request.

~~985~~

986 - `output_tokens: int`

~~987~~

988 Number of output tokens generated.

~~989~~

990 - `total_tokens: int`

~~991~~

992 Total number of tokens used (input + output).

~~993~~

994 - `type: Literal["tokens"]`

~~995~~

996 The type of the usage object. Always `tokens` for this variant.

~~997~~

998 - `"tokens"`

~~999~~

1000 - `input_token_details: Optional[UsageTokensInputTokenDetails]`

~~1001~~

1002 Details about the input tokens billed for this request.

~~1003~~

1004 - `audio_tokens: Optional[int]`

~~1005~~

1006 Number of audio tokens billed for this request.

~~1007~~

1008 - `text_tokens: Optional[int]`

~~1009~~

1010 Number of text tokens billed for this request.

~~1011~~

1012 - `class UsageDuration: …`

~~1013~~

1014 Usage statistics for models billed by audio input duration.

~~1015~~

1016 - `seconds: float`

~~1017~~

1018 Duration of the input audio in seconds.

~~1019~~

1020 - `type: Literal["duration"]`

~~1021~~

1022 The type of the usage object. Always `duration` for this variant.

~~1023~~

1024 - `"duration"`

~~1025~~

1026### Transcription Diarized Segment

~~1027~~

1028- `class TranscriptionDiarizedSegment: …`

~~1029~~

1030 A segment of diarized transcript text with speaker metadata.

~~1031~~

1032 - `id: str`

~~1033~~

1034 Unique identifier for the segment.

~~1035~~

1036 - `end: float`

~~1037~~

1038 End timestamp of the segment in seconds.

~~1039~~

1040 - `speaker: str`

~~1041~~

1042 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

~~1043~~

1044 - `start: float`

~~1045~~

1046 Start timestamp of the segment in seconds.

~~1047~~

1048 - `text: str`

~~1049~~

1050 Transcript text for this segment.

~~1051~~

1052 - `type: Literal["transcript.text.segment"]`

~~1053~~

1054 The type of the segment. Always `transcript.text.segment`.

~~1055~~

1056 - `"transcript.text.segment"`

~~1057~~

1058### Transcription Include

~~1059~~

1060- `Literal["logprobs"]`

~~1061~~

1062 - `"logprobs"`

~~1063~~

1064### Transcription Segment

~~1065~~

1066- `class TranscriptionSegment: …`

~~1067~~

1068 - `id: int`

~~1069~~

1070 Unique identifier of the segment.

~~1071~~

1072 - `avg_logprob: float`

~~1073~~

1074 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~1075~~

1076 - `compression_ratio: float`

~~1077~~

1078 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~1079~~

1080 - `end: float`

~~1081~~

1082 End time of the segment in seconds.

~~1083~~

1084 - `no_speech_prob: float`

~~1085~~

1086 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~1087~~

1088 - `seek: int`

~~1089~~

1090 Seek offset of the segment.

~~1091~~

1092 - `start: float`

~~1093~~

1094 Start time of the segment in seconds.

~~1095~~

1096 - `temperature: float`

~~1097~~

1098 Temperature parameter used for generating the segment.

~~1099~~

1100 - `text: str`

~~1101~~

1102 Text content of the segment.

~~1103~~

1104 - `tokens: List[int]`

~~1105~~

1106 Array of token IDs for the text content.

~~1107~~

1108### Transcription Stream Event

~~1109~~

1110- `TranscriptionStreamEvent`

~~1111~~

1112 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

~~1113~~

1114 - `class TranscriptionTextSegmentEvent: …`

~~1115~~

1116 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

~~1117~~

1118 - `id: str`

~~1119~~

1120 Unique identifier for the segment.

~~1121~~

1122 - `end: float`

~~1123~~

1124 End timestamp of the segment in seconds.

~~1125~~

1126 - `speaker: str`

~~1127~~

1128 Speaker label for this segment.

~~1129~~

1130 - `start: float`

~~1131~~

1132 Start timestamp of the segment in seconds.

~~1133~~

1134 - `text: str`

~~1135~~

1136 Transcript text for this segment.

~~1137~~

1138 - `type: Literal["transcript.text.segment"]`

~~1139~~

1140 The type of the event. Always `transcript.text.segment`.

~~1141~~

1142 - `"transcript.text.segment"`

~~1143~~

1144 - `class TranscriptionTextDeltaEvent: …`

~~1145~~

1146 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

~~1147~~

1148 - `delta: str`

~~1149~~

1150 The text delta that was additionally transcribed.

~~1151~~

1152 - `type: Literal["transcript.text.delta"]`

~~1153~~

1154 The type of the event. Always `transcript.text.delta`.

~~1155~~

1156 - `"transcript.text.delta"`

~~1157~~

1158 - `logprobs: Optional[List[Logprob]]`

~~1159~~

1160 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

~~1161~~

1162 - `token: Optional[str]`

~~1163~~

1164 The token that was used to generate the log probability.

~~1165~~

1166 - `bytes: Optional[List[int]]`

~~1167~~

1168 The bytes that were used to generate the log probability.

~~1169~~

1170 - `logprob: Optional[float]`

~~1171~~

1172 The log probability of the token.

~~1173~~

1174 - `segment_id: Optional[str]`

~~1175~~

1176 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.

~~1177~~

1178 - `class TranscriptionTextDoneEvent: …`

~~1179~~

1180 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

~~1181~~

1182 - `text: str`

~~1183~~

1184 The text that was transcribed.

~~1185~~

1186 - `type: Literal["transcript.text.done"]`

~~1187~~

1188 The type of the event. Always `transcript.text.done`.

~~1189~~

1190 - `"transcript.text.done"`

~~1191~~

1192 - `logprobs: Optional[List[Logprob]]`

~~1193~~

1194 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

~~1195~~

1196 - `token: Optional[str]`

~~1197~~

1198 The token that was used to generate the log probability.

~~1199~~

1200 - `bytes: Optional[List[int]]`

~~1201~~

1202 The bytes that were used to generate the log probability.

~~1203~~

1204 - `logprob: Optional[float]`

~~1205~~

1206 The log probability of the token.

~~1207~~

1208 - `usage: Optional[Usage]`

~~1209~~

1210 Usage statistics for models billed by token usage.

~~1211~~

1212 - `input_tokens: int`

~~1213~~

1214 Number of input tokens billed for this request.

~~1215~~

1216 - `output_tokens: int`

~~1217~~

1218 Number of output tokens generated.

~~1219~~

1220 - `total_tokens: int`

~~1221~~

1222 Total number of tokens used (input + output).

~~1223~~

1224 - `type: Literal["tokens"]`

~~1225~~

1226 The type of the usage object. Always `tokens` for this variant.

~~1227~~

1228 - `"tokens"`

~~1229~~

1230 - `input_token_details: Optional[UsageInputTokenDetails]`

~~1231~~

1232 Details about the input tokens billed for this request.

~~1233~~

1234 - `audio_tokens: Optional[int]`

~~1235~~

1236 Number of audio tokens billed for this request.

~~1237~~

1238 - `text_tokens: Optional[int]`

~~1239~~

1240 Number of text tokens billed for this request.

~~1241~~

1242### Transcription Text Delta Event

~~1243~~

1244- `class TranscriptionTextDeltaEvent: …`

~~1245~~

1246 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

~~1247~~

1248 - `delta: str`

~~1249~~

1250 The text delta that was additionally transcribed.

~~1251~~

1252 - `type: Literal["transcript.text.delta"]`

~~1253~~

1254 The type of the event. Always `transcript.text.delta`.

~~1255~~

1256 - `"transcript.text.delta"`

~~1257~~

1258 - `logprobs: Optional[List[Logprob]]`

~~1259~~

1260 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

~~1261~~

1262 - `token: Optional[str]`

~~1263~~

1264 The token that was used to generate the log probability.

~~1265~~

1266 - `bytes: Optional[List[int]]`

~~1267~~

1268 The bytes that were used to generate the log probability.

~~1269~~

1270 - `logprob: Optional[float]`

~~1271~~

1272 The log probability of the token.

~~1273~~

1274 - `segment_id: Optional[str]`

~~1275~~

1276 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.

~~1277~~

1278### Transcription Text Done Event

~~1279~~

1280- `class TranscriptionTextDoneEvent: …`

~~1281~~

1282 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

~~1283~~

1284 - `text: str`

~~1285~~

1286 The text that was transcribed.

~~1287~~

1288 - `type: Literal["transcript.text.done"]`

~~1289~~

1290 The type of the event. Always `transcript.text.done`.

~~1291~~

1292 - `"transcript.text.done"`

~~1293~~

1294 - `logprobs: Optional[List[Logprob]]`

~~1295~~

1296 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

~~1297~~

1298 - `token: Optional[str]`

~~1299~~

1300 The token that was used to generate the log probability.

~~1301~~

1302 - `bytes: Optional[List[int]]`

~~1303~~

1304 The bytes that were used to generate the log probability.

~~1305~~

1306 - `logprob: Optional[float]`

~~1307~~

1308 The log probability of the token.

~~1309~~

1310 - `usage: Optional[Usage]`

~~1311~~

1312 Usage statistics for models billed by token usage.

~~1313~~

1314 - `input_tokens: int`

~~1315~~

1316 Number of input tokens billed for this request.

~~1317~~

1318 - `output_tokens: int`

~~1319~~

1320 Number of output tokens generated.

~~1321~~

1322 - `total_tokens: int`

~~1323~~

1324 Total number of tokens used (input + output).

~~1325~~

1326 - `type: Literal["tokens"]`

~~1327~~

1328 The type of the usage object. Always `tokens` for this variant.

~~1329~~

1330 - `"tokens"`

~~1331~~

1332 - `input_token_details: Optional[UsageInputTokenDetails]`

~~1333~~

1334 Details about the input tokens billed for this request.

~~1335~~

1336 - `audio_tokens: Optional[int]`

~~1337~~

1338 Number of audio tokens billed for this request.

~~1339~~

1340 - `text_tokens: Optional[int]`

~~1341~~

1342 Number of text tokens billed for this request.

~~1343~~

1344### Transcription Text Segment Event

~~1345~~

1346- `class TranscriptionTextSegmentEvent: …`

~~1347~~

1348 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

~~1349~~

1350 - `id: str`

~~1351~~

1352 Unique identifier for the segment.

~~1353~~

1354 - `end: float`

~~1355~~

1356 End timestamp of the segment in seconds.

~~1357~~

1358 - `speaker: str`

~~1359~~

1360 Speaker label for this segment.

~~1361~~

1362 - `start: float`

~~1363~~

1364 Start timestamp of the segment in seconds.

~~1365~~

1366 - `text: str`

~~1367~~

1368 Transcript text for this segment.

~~1369~~

1370 - `type: Literal["transcript.text.segment"]`

~~1371~~

1372 The type of the event. Always `transcript.text.segment`.

~~1373~~

1374 - `"transcript.text.segment"`

~~1375~~

1376### Transcription Verbose

~~1377~~

1378- `class TranscriptionVerbose: …`

~~1379~~

1380 Represents a verbose json transcription response returned by model, based on the provided input.

~~1381~~

1382 - `duration: float`

~~1383~~

1384 The duration of the input audio.

~~1385~~

1386 - `language: str`

~~1387~~

1388 The language of the input audio.

~~1389~~

1390 - `text: str`

~~1391~~

1392 The transcribed text.

~~1393~~

1394 - `segments: Optional[List[TranscriptionSegment]]`

~~1395~~

1396 Segments of the transcribed text and their corresponding details.

~~1397~~

1398 - `id: int`

~~1399~~

1400 Unique identifier of the segment.

~~1401~~

1402 - `avg_logprob: float`

~~1403~~

1404 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~1405~~

1406 - `compression_ratio: float`

~~1407~~

1408 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~1409~~

1410 - `end: float`

~~1411~~

1412 End time of the segment in seconds.

~~1413~~

1414 - `no_speech_prob: float`

~~1415~~

1416 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~1417~~

1418 - `seek: int`

~~1419~~

1420 Seek offset of the segment.

~~1421~~

1422 - `start: float`

~~1423~~

1424 Start time of the segment in seconds.

~~1425~~

1426 - `temperature: float`

~~1427~~

1428 Temperature parameter used for generating the segment.

~~1429~~

1430 - `text: str`

~~1431~~

1432 Text content of the segment.

~~1433~~

1434 - `tokens: List[int]`

~~1435~~

1436 Array of token IDs for the text content.

~~1437~~

1438 - `usage: Optional[Usage]`

~~1439~~

1440 Usage statistics for models billed by audio input duration.

~~1441~~

1442 - `seconds: float`

~~1443~~

1444 Duration of the input audio in seconds.

~~1445~~

1446 - `type: Literal["duration"]`

~~1447~~

1448 The type of the usage object. Always `duration` for this variant.

~~1449~~

1450 - `"duration"`

~~1451~~

1452 - `words: Optional[List[TranscriptionWord]]`

~~1453~~

1454 Extracted words and their corresponding timestamps.

~~1455~~

1456 - `end: float`

~~1457~~

1458 End time of the word in seconds.

~~1459~~

1460 - `start: float`

~~1461~~

1462 Start time of the word in seconds.

~~1463~~

1464 - `word: str`

~~1465~~

1466 The text content of the word.

~~1467~~

1468### Transcription Word

~~1469~~

1470- `class TranscriptionWord: …`

~~1471~~

1472 - `end: float`

~~1473~~

1474 End time of the word in seconds.

~~1475~~

1476 - `start: float`

~~1477~~

1478 Start time of the word in seconds.

~~1479~~

1480 - `word: str`

~~1481~~

1482 The text content of the word.

~~1483~~

1484### Transcription Create Response

~~1485~~

1486- `TranscriptionCreateResponse`

~~1487~~

1488 Represents a transcription response returned by model, based on the provided input.

~~1489~~

1490 - `class Transcription: …`

~~1491~~

1492 Represents a transcription response returned by model, based on the provided input.

~~1493~~

1494 - `text: str`

~~1495~~

1496 The transcribed text.

~~1497~~

1498 - `logprobs: Optional[List[Logprob]]`

~~1499~~

1500 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

~~1501~~

1502 - `token: Optional[str]`

~~1503~~

1504 The token in the transcription.

~~1505~~

1506 - `bytes: Optional[List[float]]`

~~1507~~

1508 The bytes of the token.

~~1509~~

1510 - `logprob: Optional[float]`

~~1511~~

1512 The log probability of the token.

~~1513~~

1514 - `usage: Optional[Usage]`

~~1515~~

1516 Token usage statistics for the request.

~~1517~~

1518 - `class UsageTokens: …`

~~1519~~

1520 Usage statistics for models billed by token usage.

~~1521~~

1522 - `input_tokens: int`

~~1523~~

1524 Number of input tokens billed for this request.

~~1525~~

1526 - `output_tokens: int`

~~1527~~

1528 Number of output tokens generated.

~~1529~~

1530 - `total_tokens: int`

~~1531~~

1532 Total number of tokens used (input + output).

~~1533~~

1534 - `type: Literal["tokens"]`

~~1535~~

1536 The type of the usage object. Always `tokens` for this variant.

~~1537~~

1538 - `"tokens"`

~~1539~~

1540 - `input_token_details: Optional[UsageTokensInputTokenDetails]`

~~1541~~

1542 Details about the input tokens billed for this request.

~~1543~~

1544 - `audio_tokens: Optional[int]`

~~1545~~

1546 Number of audio tokens billed for this request.

~~1547~~

1548 - `text_tokens: Optional[int]`

~~1549~~

1550 Number of text tokens billed for this request.

~~1551~~

1552 - `class UsageDuration: …`

~~1553~~

1554 Usage statistics for models billed by audio input duration.

~~1555~~

1556 - `seconds: float`

~~1557~~

1558 Duration of the input audio in seconds.

~~1559~~

1560 - `type: Literal["duration"]`

~~1561~~

1562 The type of the usage object. Always `duration` for this variant.

~~1563~~

1564 - `"duration"`

~~1565~~

1566 - `class TranscriptionDiarized: …`

~~1567~~

1568 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

~~1569~~

1570 - `duration: float`

~~1571~~

1572 Duration of the input audio in seconds.

~~1573~~

1574 - `segments: List[TranscriptionDiarizedSegment]`

~~1575~~

1576 Segments of the transcript annotated with timestamps and speaker labels.

~~1577~~

1578 - `id: str`

~~1579~~

1580 Unique identifier for the segment.

~~1581~~

1582 - `end: float`

~~1583~~

1584 End timestamp of the segment in seconds.

~~1585~~

1586 - `speaker: str`

~~1587~~

1588 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

~~1589~~

1590 - `start: float`

~~1591~~

1592 Start timestamp of the segment in seconds.

~~1593~~

1594 - `text: str`

~~1595~~

1596 Transcript text for this segment.

~~1597~~

1598 - `type: Literal["transcript.text.segment"]`

~~1599~~

1600 The type of the segment. Always `transcript.text.segment`.

~~1601~~

1602 - `"transcript.text.segment"`

~~1603~~

1604 - `task: Literal["transcribe"]`

~~1605~~

1606 The type of task that was run. Always `transcribe`.

~~1607~~

1608 - `"transcribe"`

~~1609~~

1610 - `text: str`

~~1611~~

1612 The concatenated transcript text for the entire audio input.

~~1613~~

1614 - `usage: Optional[Usage]`

~~1615~~

1616 Token or duration usage statistics for the request.

~~1617~~

1618 - `class UsageTokens: …`

~~1619~~

1620 Usage statistics for models billed by token usage.

~~1621~~

1622 - `input_tokens: int`

~~1623~~

1624 Number of input tokens billed for this request.

~~1625~~

1626 - `output_tokens: int`

~~1627~~

1628 Number of output tokens generated.

~~1629~~

1630 - `total_tokens: int`

~~1631~~

1632 Total number of tokens used (input + output).

~~1633~~

1634 - `type: Literal["tokens"]`

~~1635~~

1636 The type of the usage object. Always `tokens` for this variant.

~~1637~~

1638 - `"tokens"`

~~1639~~

1640 - `input_token_details: Optional[UsageTokensInputTokenDetails]`

~~1641~~

1642 Details about the input tokens billed for this request.

~~1643~~

1644 - `audio_tokens: Optional[int]`

~~1645~~

1646 Number of audio tokens billed for this request.

~~1647~~

1648 - `text_tokens: Optional[int]`

~~1649~~

1650 Number of text tokens billed for this request.

~~1651~~

1652 - `class UsageDuration: …`

~~1653~~

1654 Usage statistics for models billed by audio input duration.

~~1655~~

1656 - `seconds: float`

~~1657~~

1658 Duration of the input audio in seconds.

~~1659~~

1660 - `type: Literal["duration"]`

~~1661~~

1662 The type of the usage object. Always `duration` for this variant.

~~1663~~

1664 - `"duration"`

~~1665~~

1666 - `class TranscriptionVerbose: …`

~~1667~~

1668 Represents a verbose json transcription response returned by model, based on the provided input.

~~1669~~

1670 - `duration: float`

~~1671~~

1672 The duration of the input audio.

~~1673~~

1674 - `language: str`

~~1675~~

1676 The language of the input audio.

~~1677~~

1678 - `text: str`

~~1679~~

1680 The transcribed text.

~~1681~~

1682 - `segments: Optional[List[TranscriptionSegment]]`

~~1683~~

1684 Segments of the transcribed text and their corresponding details.

~~1685~~

1686 - `id: int`

~~1687~~

1688 Unique identifier of the segment.

~~1689~~

1690 - `avg_logprob: float`

~~1691~~

1692 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~1693~~

1694 - `compression_ratio: float`

~~1695~~

1696 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~1697~~

1698 - `end: float`

~~1699~~

1700 End time of the segment in seconds.

~~1701~~

1702 - `no_speech_prob: float`

~~1703~~

1704 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~1705~~

1706 - `seek: int`

~~1707~~

1708 Seek offset of the segment.

~~1709~~

1710 - `start: float`

~~1711~~

1712 Start time of the segment in seconds.

~~1713~~

1714 - `temperature: float`

~~1715~~

1716 Temperature parameter used for generating the segment.

~~1717~~

1718 - `text: str`

~~1719~~

1720 Text content of the segment.

~~1721~~

1722 - `tokens: List[int]`

~~1723~~

1724 Array of token IDs for the text content.

~~1725~~

1726 - `usage: Optional[Usage]`

~~1727~~

1728 Usage statistics for models billed by audio input duration.

~~1729~~

1730 - `seconds: float`

~~1731~~

1732 Duration of the input audio in seconds.

~~1733~~

1734 - `type: Literal["duration"]`

~~1735~~

1736 The type of the usage object. Always `duration` for this variant.

~~1737~~

1738 - `"duration"`

~~1739~~

1740 - `words: Optional[List[TranscriptionWord]]`

~~1741~~

1742 Extracted words and their corresponding timestamps.

~~1743~~

1744 - `end: float`

~~1745~~

1746 End time of the word in seconds.

~~1747~~

1748 - `start: float`

~~1749~~

1750 Start time of the word in seconds.

~~1751~~

1752 - `word: str`

~~1753~~

1754 The text content of the word.

~~1755~~

1756# Translations

~~1757~~

1758## Create translation

~~1759~~

1760`audio.translations.create(TranslationCreateParams**kwargs) -> TranslationCreateResponse`

~~1761~~

1762**post** `/audio/translations`

~~1763~~

1764Translates audio into English.

~~1765~~

1766### Parameters

~~1767~~

1768- `file: FileTypes`

~~1769~~

1770 The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

~~1771~~

1772- `model: Union[str, AudioModel]`

~~1773~~

1774 ID of the model to use. Only `whisper-1` (which is powered by our open source Whisper V2 model) is currently available.

~~1775~~

1776 - `str`

~~1777~~

1778 - `Literal["whisper-1", "gpt-4o-transcribe", "gpt-4o-mini-transcribe", 2 more]`

~~1779~~

1780 - `"whisper-1"`

~~1781~~

1782 - `"gpt-4o-transcribe"`

~~1783~~

1784 - `"gpt-4o-mini-transcribe"`

~~1785~~

1786 - `"gpt-4o-mini-transcribe-2025-12-15"`

~~1787~~

1788 - `"gpt-4o-transcribe-diarize"`

~~1789~~

1790- `prompt: Optional[str]`

~~1791~~

1792 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should be in English.

~~1793~~

1794- `response_format: Optional[Literal["json", "text", "srt", 2 more]]`

~~1795~~

1796 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`.

~~1797~~

1798 - `"json"`

~~1799~~

1800 - `"text"`

~~1801~~

1802 - `"srt"`

~~1803~~

1804 - `"verbose_json"`

~~1805~~

1806 - `"vtt"`

~~1807~~

1808- `temperature: Optional[float]`

~~1809~~

1810 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.

~~1811~~

1812### Returns

~~1813~~

1814- `TranslationCreateResponse`

~~1815~~

1816 - `class Translation: …`

~~1817~~

1818 - `text: str`

~~1819~~

1820 - `class TranslationVerbose: …`

~~1821~~

1822 - `duration: float`

~~1823~~

1824 The duration of the input audio.

~~1825~~

1826 - `language: str`

~~1827~~

1828 The language of the output translation (always `english`).

~~1829~~

1830 - `text: str`

~~1831~~

1832 The translated text.

~~1833~~

1834 - `segments: Optional[List[TranscriptionSegment]]`

~~1835~~

1836 Segments of the translated text and their corresponding details.

~~1837~~

1838 - `id: int`

~~1839~~

1840 Unique identifier of the segment.

~~1841~~

1842 - `avg_logprob: float`

~~1843~~

1844 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~1845~~

1846 - `compression_ratio: float`

~~1847~~

1848 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~1849~~

1850 - `end: float`

~~1851~~

1852 End time of the segment in seconds.

~~1853~~

1854 - `no_speech_prob: float`

~~1855~~

1856 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~1857~~

1858 - `seek: int`

~~1859~~

1860 Seek offset of the segment.

~~1861~~

1862 - `start: float`

~~1863~~

1864 Start time of the segment in seconds.

~~1865~~

1866 - `temperature: float`

~~1867~~

1868 Temperature parameter used for generating the segment.

~~1869~~

1870 - `text: str`

~~1871~~

1872 Text content of the segment.

~~1873~~

1874 - `tokens: List[int]`

~~1875~~

1876 Array of token IDs for the text content.

~~1877~~

1878### Example

~~1879~~

1880```python

1881import os

1882from openai import OpenAI

~~1883~~

1884client = OpenAI(

1885 api_key=os.environ.get("OPENAI_API_KEY"), # This is the default and can be omitted

1886)

1887translation = client.audio.translations.create(

1888 file=b"Example data",

1889 model="whisper-1",

1890)

1891print(translation)

1892```

~~1893~~

1894#### Response

~~1895~~

1896```json

1897{

1898 "text": "text"

1899}

1900```

~~1901~~

1902### Example

~~1903~~

1904```python

1905from openai import OpenAI

1906client = OpenAI()

~~1907~~

1908audio_file = open("speech.mp3", "rb")

1909transcript = client.audio.translations.create(

1910 model="whisper-1",

1911 file=audio_file

1912)

1913```

~~1914~~

1915#### Response

~~1916~~

1917```json

1918{

1919 "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"

1920}

1921```

~~1922~~

1923## Domain Types

~~1924~~

1925### Translation

~~1926~~

1927- `class Translation: …`

~~1928~~

1929 - `text: str`

~~1930~~

1931### Translation Verbose

~~1932~~

1933- `class TranslationVerbose: …`

~~1934~~

1935 - `duration: float`

~~1936~~

1937 The duration of the input audio.

~~1938~~

1939 - `language: str`

~~1940~~

1941 The language of the output translation (always `english`).

~~1942~~

1943 - `text: str`

~~1944~~

1945 The translated text.

~~1946~~

1947 - `segments: Optional[List[TranscriptionSegment]]`

~~1948~~

1949 Segments of the translated text and their corresponding details.

~~1950~~

1951 - `id: int`

~~1952~~

1953 Unique identifier of the segment.

~~1954~~

1955 - `avg_logprob: float`

~~1956~~

1957 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~1958~~

1959 - `compression_ratio: float`

~~1960~~

1961 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~1962~~

1963 - `end: float`

~~1964~~

1965 End time of the segment in seconds.

~~1966~~

1967 - `no_speech_prob: float`

~~1968~~

1969 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~1970~~

1971 - `seek: int`

~~1972~~

1973 Seek offset of the segment.

~~1974~~

1975 - `start: float`

~~1976~~

1977 Start time of the segment in seconds.

~~1978~~

1979 - `temperature: float`

~~1980~~

1981 Temperature parameter used for generating the segment.

~~1982~~

1983 - `text: str`

~~1984~~

1985 Text content of the segment.

~~1986~~

1987 - `tokens: List[int]`

~~1988~~

1989 Array of token IDs for the text content.

~~1990~~

1991### Translation Create Response

~~1992~~

1993- `TranslationCreateResponse`

~~1994~~

1995 - `class Translation: …`

~~1996~~

1997 - `text: str`

~~1998~~

1999 - `class TranslationVerbose: …`

~~2000~~

2001 - `duration: float`

~~2002~~

2003 The duration of the input audio.

~~2004~~

2005 - `language: str`

~~2006~~

2007 The language of the output translation (always `english`).

~~2008~~

2009 - `text: str`

~~2010~~

2011 The translated text.

~~2012~~

2013 - `segments: Optional[List[TranscriptionSegment]]`

~~2014~~

2015 Segments of the translated text and their corresponding details.

~~2016~~

2017 - `id: int`

~~2018~~

2019 Unique identifier of the segment.

~~2020~~

2021 - `avg_logprob: float`

~~2022~~

2023 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

~~2024~~

2025 - `compression_ratio: float`

~~2026~~

2027 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

~~2028~~

2029 - `end: float`

~~2030~~

2031 End time of the segment in seconds.

~~2032~~

2033 - `no_speech_prob: float`

~~2034~~

2035 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

~~2036~~

2037 - `seek: int`

~~2038~~

2039 Seek offset of the segment.

~~2040~~

2041 - `start: float`

~~2042~~

2043 Start time of the segment in seconds.

~~2044~~

2045 - `temperature: float`

~~2046~~

2047 Temperature parameter used for generating the segment.

~~2048~~

2049 - `text: str`

~~2050~~

2051 Text content of the segment.

~~2052~~

2053 - `tokens: List[int]`

~~2054~~

2055 Array of token IDs for the text content.

~~2056~~

2057# Speech

~~2058~~

2059## Create speech

~~2060~~

2061`audio.speech.create(SpeechCreateParams**kwargs) -> BinaryResponseContent`

~~2062~~

2063**post** `/audio/speech`

~~2064~~

2065Generates audio from the input text.

~~2066~~

2067Returns the audio file content, or a stream of audio events.

~~2068~~

2069### Parameters

~~2070~~

2071- `input: str`

~~2072~~

2073 The text to generate audio for. The maximum length is 4096 characters.

~~2074~~

2075- `model: Union[str, SpeechModel]`

~~2076~~

2077 One of the available [TTS models](https://platform.openai.com/docs/models#tts): `tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`, or `gpt-4o-mini-tts-2025-12-15`.

~~2078~~

2079 - `str`

~~2080~~

2081 - `Literal["tts-1", "tts-1-hd", "gpt-4o-mini-tts", "gpt-4o-mini-tts-2025-12-15"]`

~~2082~~

2083 - `"tts-1"`

~~2084~~

2085 - `"tts-1-hd"`

~~2086~~

2087 - `"gpt-4o-mini-tts"`

~~2088~~

2089 - `"gpt-4o-mini-tts-2025-12-15"`

~~2090~~

2091- `voice: Voice`

~~2092~~

2093 The voice to use when generating the audio. Supported built-in voices are `alloy`, `ash`, `ballad`, `coral`, `echo`, `fable`, `onyx`, `nova`, `sage`, `shimmer`, `verse`, `marin`, and `cedar`. You may also provide a custom voice object with an `id`, for example `{ "id": "voice_1234" }`. Previews of the voices are available in the [Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech#voice-options).

~~2094~~

2095 - `str`

~~2096~~

2097 - `Literal["alloy", "ash", "ballad", 7 more]`

~~2098~~

2099 - `"alloy"`

~~2100~~

2101 - `"ash"`

~~2102~~

2103 - `"ballad"`

~~2104~~

2105 - `"coral"`

~~2106~~

2107 - `"echo"`

~~2108~~

2109 - `"sage"`

~~2110~~

2111 - `"shimmer"`

~~2112~~

2113 - `"verse"`

~~2114~~

2115 - `"marin"`

~~2116~~

2117 - `"cedar"`

~~2118~~

2119 - `class VoiceID: …`

~~2120~~

2121 Custom voice reference.

~~2122~~

2123 - `id: str`

~~2124~~

2125 The custom voice ID, e.g. `voice_1234`.

~~2126~~

2127- `instructions: Optional[str]`

~~2128~~

2129 Control the voice of your generated audio with additional instructions. Does not work with `tts-1` or `tts-1-hd`.

~~2130~~

2131- `response_format: Optional[Literal["mp3", "opus", "aac", 3 more]]`

~~2132~~

2133 The format to audio in. Supported formats are `mp3`, `opus`, `aac`, `flac`, `wav`, and `pcm`.

~~2134~~

2135 - `"mp3"`

~~2136~~

2137 - `"opus"`

~~2138~~

2139 - `"aac"`

~~2140~~

2141 - `"flac"`

~~2142~~

2143 - `"wav"`

~~2144~~

2145 - `"pcm"`

~~2146~~

2147- `speed: Optional[float]`

~~2148~~

2149 The speed of the generated audio. Select a value from `0.25` to `4.0`. `1.0` is the default.

~~2150~~

2151- `stream_format: Optional[Literal["sse", "audio"]]`

~~2152~~

2153 The format to stream the audio in. Supported formats are `sse` and `audio`. `sse` is not supported for `tts-1` or `tts-1-hd`.

~~2154~~

2155 - `"sse"`

~~2156~~

2157 - `"audio"`

~~2158~~

2159### Returns

~~2160~~

2161- `BinaryResponseContent`

~~2162~~

2163### Example

~~2164~~

2165```python

2166import os

2167from openai import OpenAI

~~2168~~

2169client = OpenAI(

2170 api_key=os.environ.get("OPENAI_API_KEY"), # This is the default and can be omitted

2171)

2172speech = client.audio.speech.create(

2173 input="input",

2174 model="string",

2175 voice="string",

2176)

2177print(speech)

2178content = speech.read()

2179print(content)

2180```

~~2181~~

2182### Example

~~2183~~

2184```python

2185from pathlib import Path

2186import openai

~~2187~~

2188speech_file_path = Path(__file__).parent / "speech.mp3"

2189with openai.audio.speech.with_streaming_response.create(

2190 model="gpt-4o-mini-tts",

2191 voice="alloy",

2192 input="The quick brown fox jumped over the lazy dog."

2193) as response:

2194 response.stream_to_file(speech_file_path)

2195```

~~2196~~

2197## Domain Types

~~2198~~

2199### Speech Model

~~2200~~

2201- `Literal["tts-1", "tts-1-hd", "gpt-4o-mini-tts", "gpt-4o-mini-tts-2025-12-15"]`

~~2202~~

2203 - `"tts-1"`

~~2204~~

2205 - `"tts-1-hd"`

~~2206~~

2207 - `"gpt-4o-mini-tts"`

~~2208~~

2209 - `"gpt-4o-mini-tts-2025-12-15"`

~~2210~~

2211# Voices

~~2212~~

2213# Voice Consents