Audio

Domain Types

Audio Model

Literal["whisper-1", "gpt-4o-transcribe", "gpt-4o-mini-transcribe", 2 more]
- "whisper-1"
- "gpt-4o-transcribe"
- "gpt-4o-mini-transcribe"
- "gpt-4o-mini-transcribe-2025-12-15"
- "gpt-4o-transcribe-diarize"

Audio Response Format

Literal["json", "text", "srt", 3 more]

The format of the output, in one of these options: json, text, srt, verbose_json, vtt, or diarized_json. For gpt-4o-transcribe and gpt-4o-mini-transcribe, the only supported format is json. For gpt-4o-transcribe-diarize, the supported formats are json, text, and diarized_json, with diarized_json required to receive speaker annotations.
- "json"
- "text"
- "srt"
- "verbose_json"
- "vtt"
- "diarized_json"

Transcriptions

Create transcription

audio.transcriptions.create(TranscriptionCreateParams**kwargs) -> TranscriptionCreateResponse

post /audio/transcriptions

Transcribes audio into the input language.

Returns a transcription object in json, diarized_json, or verbose_json format, or a stream of transcript events.

Parameters

file: FileTypes

The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
model: Union[str, AudioModel]

ID of the model to use. The options are gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, whisper-1 (which is powered by our open source Whisper V2 model), and gpt-4o-transcribe-diarize.
- str
- Literal["whisper-1", "gpt-4o-transcribe", "gpt-4o-mini-transcribe", 2 more]
  - "whisper-1"
  - "gpt-4o-transcribe"
  - "gpt-4o-mini-transcribe"
  - "gpt-4o-mini-transcribe-2025-12-15"
  - "gpt-4o-transcribe-diarize"
chunking_strategy: Optional[ChunkingStrategy]

Controls how the audio is cut into chunks. When set to "auto", the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. server_vad object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using gpt-4o-transcribe-diarize for inputs longer than 30 seconds.
- Literal["auto"]
  
  Automatically set chunking parameters based on the audio. Must be set to "auto".
  - "auto"
- class ChunkingStrategyVadConfig: …
  - type: Literal["server_vad"]
    
    Must be set to server_vad to enable manual chunking using server side VAD.
    - "server_vad"
  - prefix_padding_ms: Optional[int]
    
    Amount of audio to include before the VAD detected speech (in milliseconds).
  - silence_duration_ms: Optional[int]
    
    Duration of silence to detect speech stop (in milliseconds). With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
  - threshold: Optional[float]
    
    Sensitivity threshold (0.0 to 1.0) for voice activity detection. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
include: Optional[List[TranscriptionInclude]]

Additional information to include in the transcription response. logprobs will return the log probabilities of the tokens in the response to understand the model's confidence in the transcription. logprobs only works with response_format set to json and only with the models gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-transcribe-2025-12-15. This field is not supported when using gpt-4o-transcribe-diarize.
- "logprobs"
known_speaker_names: Optional[Sequence[str]]

Optional list of speaker names that correspond to the audio samples provided in known_speaker_references[]. Each entry should be a short identifier (for example customer or agent). Up to 4 speakers are supported.
known_speaker_references: Optional[Sequence[str]]

Optional list of audio samples (as data URLs) that contain known speaker references matching known_speaker_names[]. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by file.
language: Optional[str]

The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
prompt: Optional[str]

An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. This field is not supported when using gpt-4o-transcribe-diarize.
response_format: Optional[AudioResponseFormat]

The format of the output, in one of these options: json, text, srt, verbose_json, vtt, or diarized_json. For gpt-4o-transcribe and gpt-4o-mini-transcribe, the only supported format is json. For gpt-4o-transcribe-diarize, the supported formats are json, text, and diarized_json, with diarized_json required to receive speaker annotations.
- "json"
- "text"
- "srt"
- "verbose_json"
- "vtt"
- "diarized_json"
stream: Optional[Literal[false]]

If set to true, the model response data will be streamed to the client as it is generated using server-sent events. See the Streaming section of the Speech-to-Text guide for more information.

Note: Streaming is not supported for the whisper-1 model and will be ignored.
- false
temperature: Optional[float]

The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.
timestamp_granularities: Optional[List[Literal["word", "segment"]]]

The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. Either or both of these options are supported: word, or segment. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency. This option is not available for gpt-4o-transcribe-diarize.
- "word"
- "segment"

Returns

TranscriptionCreateResponse

Represents a transcription response returned by model, based on the provided input.
- class Transcription: …
  
  Represents a transcription response returned by model, based on the provided input.
  - text: str
    
    The transcribed text.
  - logprobs: Optional[List[Logprob]]
    
    The log probabilities of the tokens in the transcription. Only returned with the models gpt-4o-transcribe and gpt-4o-mini-transcribe if logprobs is added to the include array.
    - token: Optional[str]
      
      The token in the transcription.
    - bytes: Optional[List[float]]
      
      The bytes of the token.
    - logprob: Optional[float]
      
      The log probability of the token.
  - usage: Optional[Usage]
    
    Token usage statistics for the request.
    - class UsageTokens: …
      
      Usage statistics for models billed by token usage.
      - input_tokens: int
        
        Number of input tokens billed for this request.
      - output_tokens: int
        
        Number of output tokens generated.
      - total_tokens: int
        
        Total number of tokens used (input + output).
      - type: Literal["tokens"]
        
        The type of the usage object. Always tokens for this variant.
        
        "tokens"
      - input_token_details: Optional[UsageTokensInputTokenDetails]
        
        Details about the input tokens billed for this request.
        
        audio_tokens: Optional[int]
        
        Number of audio tokens billed for this request.
        
        text_tokens: Optional[int]
        
        Number of text tokens billed for this request.
    - class UsageDuration: …
      
      Usage statistics for models billed by audio input duration.
      - seconds: float
        
        Duration of the input audio in seconds.
      - type: Literal["duration"]
        
        The type of the usage object. Always duration for this variant.
        
        "duration"
- class TranscriptionDiarized: …
  
  Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
  - duration: float
    
    Duration of the input audio in seconds.
  - segments: List[TranscriptionDiarizedSegment]
    
    Segments of the transcript annotated with timestamps and speaker labels.
    - id: str
      
      Unique identifier for the segment.
    - end: float
      
      End timestamp of the segment in seconds.
    - speaker: str
      
      Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).
    - start: float
      
      Start timestamp of the segment in seconds.
    - text: str
      
      Transcript text for this segment.
    - type: Literal["transcript.text.segment"]
      
      The type of the segment. Always transcript.text.segment.
      - "transcript.text.segment"
  - task: Literal["transcribe"]
    
    The type of task that was run. Always transcribe.
    - "transcribe"
  - text: str
    
    The concatenated transcript text for the entire audio input.
  - usage: Optional[Usage]
    
    Token or duration usage statistics for the request.
    - class UsageTokens: …
      
      Usage statistics for models billed by token usage.
      - input_tokens: int
        
        Number of input tokens billed for this request.
      - output_tokens: int
        
        Number of output tokens generated.
      - total_tokens: int
        
        Total number of tokens used (input + output).
      - type: Literal["tokens"]
        
        The type of the usage object. Always tokens for this variant.
        
        "tokens"
      - input_token_details: Optional[UsageTokensInputTokenDetails]
        
        Details about the input tokens billed for this request.
        
        audio_tokens: Optional[int]
        
        Number of audio tokens billed for this request.
        
        text_tokens: Optional[int]
        
        Number of text tokens billed for this request.
    - class UsageDuration: …
      
      Usage statistics for models billed by audio input duration.
      - seconds: float
        
        Duration of the input audio in seconds.
      - type: Literal["duration"]
        
        The type of the usage object. Always duration for this variant.
        
        "duration"
- class TranscriptionVerbose: …
  
  Represents a verbose json transcription response returned by model, based on the provided input.
  - duration: float
    
    The duration of the input audio.
  - language: str
    
    The language of the input audio.
  - text: str
    
    The transcribed text.
  - segments: Optional[List[TranscriptionSegment]]
    
    Segments of the transcribed text and their corresponding details.
    - id: int
      
      Unique identifier of the segment.
    - avg_logprob: float
      
      Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
    - compression_ratio: float
      
      Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
    - end: float
      
      End time of the segment in seconds.
    - no_speech_prob: float
      
      Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
    - seek: int
      
      Seek offset of the segment.
    - start: float
      
      Start time of the segment in seconds.
    - temperature: float
      
      Temperature parameter used for generating the segment.
    - text: str
      
      Text content of the segment.
    - tokens: List[int]
      
      Array of token IDs for the text content.
  - usage: Optional[Usage]
    
    Usage statistics for models billed by audio input duration.
    - seconds: float
      
      Duration of the input audio in seconds.
    - type: Literal["duration"]
      
      The type of the usage object. Always duration for this variant.
      - "duration"
  - words: Optional[List[TranscriptionWord]]
    
    Extracted words and their corresponding timestamps.
    - end: float
      
      End time of the word in seconds.
    - start: float
      
      Start time of the word in seconds.
    - word: str
      
      The text content of the word.

Example

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),  # This is the default and can be omitted
)
for transcription in client.audio.transcriptions.create(
    file=b"Example data",
    model="gpt-4o-transcribe",
):
  print(transcription)

Response

{
  "text": "text",
  "logprobs": [
    {
      "token": "token",
      "bytes": [
        0
      ],
      "logprob": 0
    }
  ],
  "usage": {
    "input_tokens": 0,
    "output_tokens": 0,
    "total_tokens": 0,
    "type": "tokens",
    "input_token_details": {
      "audio_tokens": 0,
      "text_tokens": 0
    }
  }
}

Example

from openai import OpenAI
client = OpenAI()

audio_file = open("speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
  model="gpt-4o-transcribe",
  file=audio_file
)

Response

{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that.",
  "usage": {
    "type": "tokens",
    "input_tokens": 14,
    "input_token_details": {
      "text_tokens": 0,
      "audio_tokens": 14
    },
    "output_tokens": 45,
    "total_tokens": 59
  }
}

Diarization

import base64
from openai import OpenAI

client = OpenAI()

def to_data_url(path: str) -> str:
  with open(path, "rb") as fh:
    return "data:audio/wav;base64," + base64.b64encode(fh.read()).decode("utf-8")

with open("meeting.wav", "rb") as audio_file:
  transcript = client.audio.transcriptions.create(
    model="gpt-4o-transcribe-diarize",
    file=audio_file,
    response_format="diarized_json",
    chunking_strategy="auto",
    extra_body={
      "known_speaker_names": ["agent"],
      "known_speaker_references": [to_data_url("agent.wav")],
    },
  )

print(transcript.segments)

Response

{
  "task": "transcribe",
  "duration": 27.4,
  "text": "Agent: Thanks for calling OpenAI support.\nA: Hi, I'm trying to enable diarization.\nAgent: Happy to walk you through the steps.",
  "segments": [
    {
      "type": "transcript.text.segment",
      "id": "seg_001",
      "start": 0.0,
      "end": 4.7,
      "text": "Thanks for calling OpenAI support.",
      "speaker": "agent"
    },
    {
      "type": "transcript.text.segment",
      "id": "seg_002",
      "start": 4.7,
      "end": 11.8,
      "text": "Hi, I'm trying to enable diarization.",
      "speaker": "A"
    },
    {
      "type": "transcript.text.segment",
      "id": "seg_003",
      "start": 12.1,
      "end": 18.5,
      "text": "Happy to walk you through the steps.",
      "speaker": "agent"
    }
  ],
  "usage": {
    "type": "duration",
    "seconds": 27
  }
}

Streaming

from openai import OpenAI
client = OpenAI()

audio_file = open("speech.mp3", "rb")
stream = client.audio.transcriptions.create(
  file=audio_file,
  model="gpt-4o-mini-transcribe",
  stream=True
)

for event in stream:
  print(event)

Response

data: {"type":"transcript.text.delta","delta":"I","logprobs":[{"token":"I","logprob":-0.00007588794,"bytes":[73]}]}

data: {"type":"transcript.text.delta","delta":" see","logprobs":[{"token":" see","logprob":-3.1281633e-7,"bytes":[32,115,101,101]}]}

data: {"type":"transcript.text.delta","delta":" skies","logprobs":[{"token":" skies","logprob":-2.3392786e-6,"bytes":[32,115,107,105,101,115]}]}

data: {"type":"transcript.text.delta","delta":" of","logprobs":[{"token":" of","logprob":-3.1281633e-7,"bytes":[32,111,102]}]}

data: {"type":"transcript.text.delta","delta":" blue","logprobs":[{"token":" blue","logprob":-1.0280384e-6,"bytes":[32,98,108,117,101]}]}

data: {"type":"transcript.text.delta","delta":" and","logprobs":[{"token":" and","logprob":-0.0005108566,"bytes":[32,97,110,100]}]}

data: {"type":"transcript.text.delta","delta":" clouds","logprobs":[{"token":" clouds","logprob":-1.9361265e-7,"bytes":[32,99,108,111,117,100,115]}]}

data: {"type":"transcript.text.delta","delta":" of","logprobs":[{"token":" of","logprob":-1.9361265e-7,"bytes":[32,111,102]}]}

data: {"type":"transcript.text.delta","delta":" white","logprobs":[{"token":" white","logprob":-7.89631e-7,"bytes":[32,119,104,105,116,101]}]}

data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.0014890312,"bytes":[44]}]}

data: {"type":"transcript.text.delta","delta":" the","logprobs":[{"token":" the","logprob":-0.0110956915,"bytes":[32,116,104,101]}]}

data: {"type":"transcript.text.delta","delta":" bright","logprobs":[{"token":" bright","logprob":0.0,"bytes":[32,98,114,105,103,104,116]}]}

data: {"type":"transcript.text.delta","delta":" blessed","logprobs":[{"token":" blessed","logprob":-0.000045848617,"bytes":[32,98,108,101,115,115,101,100]}]}

data: {"type":"transcript.text.delta","delta":" days","logprobs":[{"token":" days","logprob":-0.000010802739,"bytes":[32,100,97,121,115]}]}

data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.00001700133,"bytes":[44]}]}

data: {"type":"transcript.text.delta","delta":" the","logprobs":[{"token":" the","logprob":-0.0000118755715,"bytes":[32,116,104,101]}]}

data: {"type":"transcript.text.delta","delta":" dark","logprobs":[{"token":" dark","logprob":-5.5122365e-7,"bytes":[32,100,97,114,107]}]}

data: {"type":"transcript.text.delta","delta":" sacred","logprobs":[{"token":" sacred","logprob":-5.4385737e-6,"bytes":[32,115,97,99,114,101,100]}]}

data: {"type":"transcript.text.delta","delta":" nights","logprobs":[{"token":" nights","logprob":-4.00813e-6,"bytes":[32,110,105,103,104,116,115]}]}

data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.0036910512,"bytes":[44]}]}

data: {"type":"transcript.text.delta","delta":" and","logprobs":[{"token":" and","logprob":-0.0031903093,"bytes":[32,97,110,100]}]}

data: {"type":"transcript.text.delta","delta":" I","logprobs":[{"token":" I","logprob":-1.504853e-6,"bytes":[32,73]}]}

data: {"type":"transcript.text.delta","delta":" think","logprobs":[{"token":" think","logprob":-4.3202e-7,"bytes":[32,116,104,105,110,107]}]}

data: {"type":"transcript.text.delta","delta":" to","logprobs":[{"token":" to","logprob":-1.9361265e-7,"bytes":[32,116,111]}]}

data: {"type":"transcript.text.delta","delta":" myself","logprobs":[{"token":" myself","logprob":-1.7432603e-6,"bytes":[32,109,121,115,101,108,102]}]}

data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.29254505,"bytes":[44]}]}

data: {"type":"transcript.text.delta","delta":" what","logprobs":[{"token":" what","logprob":-0.016815351,"bytes":[32,119,104,97,116]}]}

data: {"type":"transcript.text.delta","delta":" a","logprobs":[{"token":" a","logprob":-3.1281633e-7,"bytes":[32,97]}]}

data: {"type":"transcript.text.delta","delta":" wonderful","logprobs":[{"token":" wonderful","logprob":-2.1008714e-6,"bytes":[32,119,111,110,100,101,114,102,117,108]}]}

data: {"type":"transcript.text.delta","delta":" world","logprobs":[{"token":" world","logprob":-8.180258e-6,"bytes":[32,119,111,114,108,100]}]}

data: {"type":"transcript.text.delta","delta":".","logprobs":[{"token":".","logprob":-0.014231676,"bytes":[46]}]}

data: {"type":"transcript.text.done","text":"I see skies of blue and clouds of white, the bright blessed days, the dark sacred nights, and I think to myself, what a wonderful world.","logprobs":[{"token":"I","logprob":-0.00007588794,"bytes":[73]},{"token":" see","logprob":-3.1281633e-7,"bytes":[32,115,101,101]},{"token":" skies","logprob":-2.3392786e-6,"bytes":[32,115,107,105,101,115]},{"token":" of","logprob":-3.1281633e-7,"bytes":[32,111,102]},{"token":" blue","logprob":-1.0280384e-6,"bytes":[32,98,108,117,101]},{"token":" and","logprob":-0.0005108566,"bytes":[32,97,110,100]},{"token":" clouds","logprob":-1.9361265e-7,"bytes":[32,99,108,111,117,100,115]},{"token":" of","logprob":-1.9361265e-7,"bytes":[32,111,102]},{"token":" white","logprob":-7.89631e-7,"bytes":[32,119,104,105,116,101]},{"token":",","logprob":-0.0014890312,"bytes":[44]},{"token":" the","logprob":-0.0110956915,"bytes":[32,116,104,101]},{"token":" bright","logprob":0.0,"bytes":[32,98,114,105,103,104,116]},{"token":" blessed","logprob":-0.000045848617,"bytes":[32,98,108,101,115,115,101,100]},{"token":" days","logprob":-0.000010802739,"bytes":[32,100,97,121,115]},{"token":",","logprob":-0.00001700133,"bytes":[44]},{"token":" the","logprob":-0.0000118755715,"bytes":[32,116,104,101]},{"token":" dark","logprob":-5.5122365e-7,"bytes":[32,100,97,114,107]},{"token":" sacred","logprob":-5.4385737e-6,"bytes":[32,115,97,99,114,101,100]},{"token":" nights","logprob":-4.00813e-6,"bytes":[32,110,105,103,104,116,115]},{"token":",","logprob":-0.0036910512,"bytes":[44]},{"token":" and","logprob":-0.0031903093,"bytes":[32,97,110,100]},{"token":" I","logprob":-1.504853e-6,"bytes":[32,73]},{"token":" think","logprob":-4.3202e-7,"bytes":[32,116,104,105,110,107]},{"token":" to","logprob":-1.9361265e-7,"bytes":[32,116,111]},{"token":" myself","logprob":-1.7432603e-6,"bytes":[32,109,121,115,101,108,102]},{"token":",","logprob":-0.29254505,"bytes":[44]},{"token":" what","logprob":-0.016815351,"bytes":[32,119,104,97,116]},{"token":" a","logprob":-3.1281633e-7,"bytes":[32,97]},{"token":" wonderful","logprob":-2.1008714e-6,"bytes":[32,119,111,110,100,101,114,102,117,108]},{"token":" world","logprob":-8.180258e-6,"bytes":[32,119,111,114,108,100]},{"token":".","logprob":-0.014231676,"bytes":[46]}],"usage":{"input_tokens":14,"input_token_details":{"text_tokens":0,"audio_tokens":14},"output_tokens":45,"total_tokens":59}}

Logprobs

from openai import OpenAI
client = OpenAI()

audio_file = open("speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
  file=audio_file,
  model="gpt-4o-transcribe",
  response_format="json",
  include=["logprobs"]
)

print(transcript)

Response

{
  "text": "Hey, my knee is hurting and I want to see the doctor tomorrow ideally.",
  "logprobs": [
    { "token": "Hey", "logprob": -1.0415299, "bytes": [72, 101, 121] },
    { "token": ",", "logprob": -9.805982e-5, "bytes": [44] },
    { "token": " my", "logprob": -0.00229799, "bytes": [32, 109, 121] },
    {
      "token": " knee",
      "logprob": -4.7159858e-5,
      "bytes": [32, 107, 110, 101, 101]
    },
    { "token": " is", "logprob": -0.043909557, "bytes": [32, 105, 115] },
    {
      "token": " hurting",
      "logprob": -1.1041146e-5,
      "bytes": [32, 104, 117, 114, 116, 105, 110, 103]
    },
    { "token": " and", "logprob": -0.011076359, "bytes": [32, 97, 110, 100] },
    { "token": " I", "logprob": -5.3193703e-6, "bytes": [32, 73] },
    {
      "token": " want",
      "logprob": -0.0017156356,
      "bytes": [32, 119, 97, 110, 116]
    },
    { "token": " to", "logprob": -7.89631e-7, "bytes": [32, 116, 111] },
    { "token": " see", "logprob": -5.5122365e-7, "bytes": [32, 115, 101, 101] },
    { "token": " the", "logprob": -0.0040786397, "bytes": [32, 116, 104, 101] },
    {
      "token": " doctor",
      "logprob": -2.3392786e-6,
      "bytes": [32, 100, 111, 99, 116, 111, 114]
    },
    {
      "token": " tomorrow",
      "logprob": -7.89631e-7,
      "bytes": [32, 116, 111, 109, 111, 114, 114, 111, 119]
    },
    {
      "token": " ideally",
      "logprob": -0.5800861,
      "bytes": [32, 105, 100, 101, 97, 108, 108, 121]
    },
    { "token": ".", "logprob": -0.00011093382, "bytes": [46] }
  ],
  "usage": {
    "type": "tokens",
    "input_tokens": 14,
    "input_token_details": {
      "text_tokens": 0,
      "audio_tokens": 14
    },
    "output_tokens": 45,
    "total_tokens": 59
  }
}

Word timestamps

from openai import OpenAI
client = OpenAI()

audio_file = open("speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
  file=audio_file,
  model="whisper-1",
  response_format="verbose_json",
  timestamp_granularities=["word"]
)

print(transcript.words)

Response

{
  "task": "transcribe",
  "language": "english",
  "duration": 8.470000267028809,
  "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
  "words": [
    {
      "word": "The",
      "start": 0.0,
      "end": 0.23999999463558197
    },
    ...
    {
      "word": "volleyball",
      "start": 7.400000095367432,
      "end": 7.900000095367432
    }
  ],
  "usage": {
    "type": "duration",
    "seconds": 9
  }
}

Segment timestamps

from openai import OpenAI
client = OpenAI()

audio_file = open("speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
  file=audio_file,
  model="whisper-1",
  response_format="verbose_json",
  timestamp_granularities=["segment"]
)

print(transcript.words)

Response

{
  "task": "transcribe",
  "language": "english",
  "duration": 8.470000267028809,
  "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 3.319999933242798,
      "text": " The beach was a popular spot on a hot summer day.",
      "tokens": [
        50364, 440, 7534, 390, 257, 3743, 4008, 322, 257, 2368, 4266, 786, 13, 50530
      ],
      "temperature": 0.0,
      "avg_logprob": -0.2860786020755768,
      "compression_ratio": 1.2363636493682861,
      "no_speech_prob": 0.00985979475080967
    },
    ...
  ],
  "usage": {
    "type": "duration",
    "seconds": 9
  }
}

Domain Types

Transcription

class Transcription: …

Represents a transcription response returned by model, based on the provided input.
- text: str
  
  The transcribed text.
- logprobs: Optional[List[Logprob]]
  
  The log probabilities of the tokens in the transcription. Only returned with the models gpt-4o-transcribe and gpt-4o-mini-transcribe if logprobs is added to the include array.
  - token: Optional[str]
    
    The token in the transcription.
  - bytes: Optional[List[float]]
    
    The bytes of the token.
  - logprob: Optional[float]
    
    The log probability of the token.
- usage: Optional[Usage]
  
  Token usage statistics for the request.
  - class UsageTokens: …
    
    Usage statistics for models billed by token usage.
    - input_tokens: int
      
      Number of input tokens billed for this request.
    - output_tokens: int
      
      Number of output tokens generated.
    - total_tokens: int
      
      Total number of tokens used (input + output).
    - type: Literal["tokens"]
      
      The type of the usage object. Always tokens for this variant.
      - "tokens"
    - input_token_details: Optional[UsageTokensInputTokenDetails]
      
      Details about the input tokens billed for this request.
      - audio_tokens: Optional[int]
        
        Number of audio tokens billed for this request.
      - text_tokens: Optional[int]
        
        Number of text tokens billed for this request.
  - class UsageDuration: …
    
    Usage statistics for models billed by audio input duration.
    - seconds: float
      
      Duration of the input audio in seconds.
    - type: Literal["duration"]
      
      The type of the usage object. Always duration for this variant.
      - "duration"

Transcription Diarized

class TranscriptionDiarized: …

Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
- duration: float
  
  Duration of the input audio in seconds.
- segments: List[TranscriptionDiarizedSegment]
  
  Segments of the transcript annotated with timestamps and speaker labels.
  - id: str
    
    Unique identifier for the segment.
  - end: float
    
    End timestamp of the segment in seconds.
  - speaker: str
    
    Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).
  - start: float
    
    Start timestamp of the segment in seconds.
  - text: str
    
    Transcript text for this segment.
  - type: Literal["transcript.text.segment"]
    
    The type of the segment. Always transcript.text.segment.
    - "transcript.text.segment"
- task: Literal["transcribe"]
  
  The type of task that was run. Always transcribe.
  - "transcribe"
- text: str
  
  The concatenated transcript text for the entire audio input.
- usage: Optional[Usage]
  
  Token or duration usage statistics for the request.
  - class UsageTokens: …
    
    Usage statistics for models billed by token usage.
    - input_tokens: int
      
      Number of input tokens billed for this request.
    - output_tokens: int
      
      Number of output tokens generated.
    - total_tokens: int
      
      Total number of tokens used (input + output).
    - type: Literal["tokens"]
      
      The type of the usage object. Always tokens for this variant.
      - "tokens"
    - input_token_details: Optional[UsageTokensInputTokenDetails]
      
      Details about the input tokens billed for this request.
      - audio_tokens: Optional[int]
        
        Number of audio tokens billed for this request.
      - text_tokens: Optional[int]
        
        Number of text tokens billed for this request.
  - class UsageDuration: …
    
    Usage statistics for models billed by audio input duration.
    - seconds: float
      
      Duration of the input audio in seconds.
    - type: Literal["duration"]
      
      The type of the usage object. Always duration for this variant.
      - "duration"

Transcription Diarized Segment

class TranscriptionDiarizedSegment: …

A segment of diarized transcript text with speaker metadata.
- id: str
  
  Unique identifier for the segment.
- end: float
  
  End timestamp of the segment in seconds.
- speaker: str
  
  Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).
- start: float
  
  Start timestamp of the segment in seconds.
- text: str
  
  Transcript text for this segment.
- type: Literal["transcript.text.segment"]
  
  The type of the segment. Always transcript.text.segment.
  - "transcript.text.segment"

Transcription Include

Literal["logprobs"]
- "logprobs"

Transcription Segment

class TranscriptionSegment: …
- id: int
  
  Unique identifier of the segment.
- avg_logprob: float
  
  Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
- compression_ratio: float
  
  Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
- end: float
  
  End time of the segment in seconds.
- no_speech_prob: float
  
  Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
- seek: int
  
  Seek offset of the segment.
- start: float
  
  Start time of the segment in seconds.
- temperature: float
  
  Temperature parameter used for generating the segment.
- text: str
  
  Text content of the segment.
- tokens: List[int]
  
  Array of token IDs for the text content.

Transcription Stream Event

TranscriptionStreamEvent

Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you create a transcription with stream set to true and response_format set to diarized_json.
- class TranscriptionTextSegmentEvent: …
  
  Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you create a transcription with stream set to true and response_format set to diarized_json.
  - id: str
    
    Unique identifier for the segment.
  - end: float
    
    End timestamp of the segment in seconds.
  - speaker: str
    
    Speaker label for this segment.
  - start: float
    
    Start timestamp of the segment in seconds.
  - text: str
    
    Transcript text for this segment.
  - type: Literal["transcript.text.segment"]
    
    The type of the event. Always transcript.text.segment.
    - "transcript.text.segment"
- class TranscriptionTextDeltaEvent: …
  
  Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you create a transcription with the Stream parameter set to true.
  - delta: str
    
    The text delta that was additionally transcribed.
  - type: Literal["transcript.text.delta"]
    
    The type of the event. Always transcript.text.delta.
    - "transcript.text.delta"
  - logprobs: Optional[List[Logprob]]
    
    The log probabilities of the delta. Only included if you create a transcription with the include[] parameter set to logprobs.
    - token: Optional[str]
      
      The token that was used to generate the log probability.
    - bytes: Optional[List[int]]
      
      The bytes that were used to generate the log probability.
    - logprob: Optional[float]
      
      The log probability of the token.
  - segment_id: Optional[str]
    
    Identifier of the diarized segment that this delta belongs to. Only present when using gpt-4o-transcribe-diarize.
- class TranscriptionTextDoneEvent: …
  
  Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you create a transcription with the Stream parameter set to true.
  - text: str
    
    The text that was transcribed.
  - type: Literal["transcript.text.done"]
    
    The type of the event. Always transcript.text.done.
    - "transcript.text.done"
  - logprobs: Optional[List[Logprob]]
    
    The log probabilities of the individual tokens in the transcription. Only included if you create a transcription with the include[] parameter set to logprobs.
    - token: Optional[str]
      
      The token that was used to generate the log probability.
    - bytes: Optional[List[int]]
      
      The bytes that were used to generate the log probability.
    - logprob: Optional[float]
      
      The log probability of the token.
  - usage: Optional[Usage]
    
    Usage statistics for models billed by token usage.
    - input_tokens: int
      
      Number of input tokens billed for this request.
    - output_tokens: int
      
      Number of output tokens generated.
    - total_tokens: int
      
      Total number of tokens used (input + output).
    - type: Literal["tokens"]
      
      The type of the usage object. Always tokens for this variant.
      - "tokens"
    - input_token_details: Optional[UsageInputTokenDetails]
      
      Details about the input tokens billed for this request.
      - audio_tokens: Optional[int]
        
        Number of audio tokens billed for this request.
      - text_tokens: Optional[int]
        
        Number of text tokens billed for this request.

Transcription Text Delta Event

class TranscriptionTextDeltaEvent: …

Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you create a transcription with the Stream parameter set to true.
- delta: str
  
  The text delta that was additionally transcribed.
- type: Literal["transcript.text.delta"]
  
  The type of the event. Always transcript.text.delta.
  - "transcript.text.delta"
- logprobs: Optional[List[Logprob]]
  
  The log probabilities of the delta. Only included if you create a transcription with the include[] parameter set to logprobs.
  - token: Optional[str]
    
    The token that was used to generate the log probability.
  - bytes: Optional[List[int]]
    
    The bytes that were used to generate the log probability.
  - logprob: Optional[float]
    
    The log probability of the token.
- segment_id: Optional[str]
  
  Identifier of the diarized segment that this delta belongs to. Only present when using gpt-4o-transcribe-diarize.

Transcription Text Done Event

class TranscriptionTextDoneEvent: …

Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you create a transcription with the Stream parameter set to true.
- text: str
  
  The text that was transcribed.
- type: Literal["transcript.text.done"]
  
  The type of the event. Always transcript.text.done.
  - "transcript.text.done"
- logprobs: Optional[List[Logprob]]
  
  The log probabilities of the individual tokens in the transcription. Only included if you create a transcription with the include[] parameter set to logprobs.
  - token: Optional[str]
    
    The token that was used to generate the log probability.
  - bytes: Optional[List[int]]
    
    The bytes that were used to generate the log probability.
  - logprob: Optional[float]
    
    The log probability of the token.
- usage: Optional[Usage]
  
  Usage statistics for models billed by token usage.
  - input_tokens: int
    
    Number of input tokens billed for this request.
  - output_tokens: int
    
    Number of output tokens generated.
  - total_tokens: int
    
    Total number of tokens used (input + output).
  - type: Literal["tokens"]
    
    The type of the usage object. Always tokens for this variant.
    - "tokens"
  - input_token_details: Optional[UsageInputTokenDetails]
    
    Details about the input tokens billed for this request.
    - audio_tokens: Optional[int]
      
      Number of audio tokens billed for this request.
    - text_tokens: Optional[int]
      
      Number of text tokens billed for this request.

Transcription Text Segment Event

class TranscriptionTextSegmentEvent: …

Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you create a transcription with stream set to true and response_format set to diarized_json.
- id: str
  
  Unique identifier for the segment.
- end: float
  
  End timestamp of the segment in seconds.
- speaker: str
  
  Speaker label for this segment.
- start: float
  
  Start timestamp of the segment in seconds.
- text: str
  
  Transcript text for this segment.
- type: Literal["transcript.text.segment"]
  
  The type of the event. Always transcript.text.segment.
  - "transcript.text.segment"

Transcription Verbose

class TranscriptionVerbose: …

Represents a verbose json transcription response returned by model, based on the provided input.
- duration: float
  
  The duration of the input audio.
- language: str
  
  The language of the input audio.
- text: str
  
  The transcribed text.
- segments: Optional[List[TranscriptionSegment]]
  
  Segments of the transcribed text and their corresponding details.
  - id: int
    
    Unique identifier of the segment.
  - avg_logprob: float
    
    Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
  - compression_ratio: float
    
    Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
  - end: float
    
    End time of the segment in seconds.
  - no_speech_prob: float
    
    Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
  - seek: int
    
    Seek offset of the segment.
  - start: float
    
    Start time of the segment in seconds.
  - temperature: float
    
    Temperature parameter used for generating the segment.
  - text: str
    
    Text content of the segment.
  - tokens: List[int]
    
    Array of token IDs for the text content.
- usage: Optional[Usage]
  
  Usage statistics for models billed by audio input duration.
  - seconds: float
    
    Duration of the input audio in seconds.
  - type: Literal["duration"]
    
    The type of the usage object. Always duration for this variant.
    - "duration"
- words: Optional[List[TranscriptionWord]]
  
  Extracted words and their corresponding timestamps.
  - end: float
    
    End time of the word in seconds.
  - start: float
    
    Start time of the word in seconds.
  - word: str
    
    The text content of the word.

Transcription Word

class TranscriptionWord: …
- end: float
  
  End time of the word in seconds.
- start: float
  
  Start time of the word in seconds.
- word: str
  
  The text content of the word.

Transcription Create Response

TranscriptionCreateResponse

Represents a transcription response returned by model, based on the provided input.
- class Transcription: …
  
  Represents a transcription response returned by model, based on the provided input.
  - text: str
    
    The transcribed text.
  - logprobs: Optional[List[Logprob]]
    
    The log probabilities of the tokens in the transcription. Only returned with the models gpt-4o-transcribe and gpt-4o-mini-transcribe if logprobs is added to the include array.
    - token: Optional[str]
      
      The token in the transcription.
    - bytes: Optional[List[float]]
      
      The bytes of the token.
    - logprob: Optional[float]
      
      The log probability of the token.
  - usage: Optional[Usage]
    
    Token usage statistics for the request.
    - class UsageTokens: …
      
      Usage statistics for models billed by token usage.
      - input_tokens: int
        
        Number of input tokens billed for this request.
      - output_tokens: int
        
        Number of output tokens generated.
      - total_tokens: int
        
        Total number of tokens used (input + output).
      - type: Literal["tokens"]
        
        The type of the usage object. Always tokens for this variant.
        
        "tokens"
      - input_token_details: Optional[UsageTokensInputTokenDetails]
        
        Details about the input tokens billed for this request.
        
        audio_tokens: Optional[int]
        
        Number of audio tokens billed for this request.
        
        text_tokens: Optional[int]
        
        Number of text tokens billed for this request.
    - class UsageDuration: …
      
      Usage statistics for models billed by audio input duration.
      - seconds: float
        
        Duration of the input audio in seconds.
      - type: Literal["duration"]
        
        The type of the usage object. Always duration for this variant.
        
        "duration"
- class TranscriptionDiarized: …
  
  Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
  - duration: float
    
    Duration of the input audio in seconds.
  - segments: List[TranscriptionDiarizedSegment]
    
    Segments of the transcript annotated with timestamps and speaker labels.
    - id: str
      
      Unique identifier for the segment.
    - end: float
      
      End timestamp of the segment in seconds.
    - speaker: str
      
      Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).
    - start: float
      
      Start timestamp of the segment in seconds.
    - text: str
      
      Transcript text for this segment.
    - type: Literal["transcript.text.segment"]
      
      The type of the segment. Always transcript.text.segment.
      - "transcript.text.segment"
  - task: Literal["transcribe"]
    
    The type of task that was run. Always transcribe.
    - "transcribe"
  - text: str
    
    The concatenated transcript text for the entire audio input.
  - usage: Optional[Usage]
    
    Token or duration usage statistics for the request.
    - class UsageTokens: …
      
      Usage statistics for models billed by token usage.
      - input_tokens: int
        
        Number of input tokens billed for this request.
      - output_tokens: int
        
        Number of output tokens generated.
      - total_tokens: int
        
        Total number of tokens used (input + output).
      - type: Literal["tokens"]
        
        The type of the usage object. Always tokens for this variant.
        
        "tokens"
      - input_token_details: Optional[UsageTokensInputTokenDetails]
        
        Details about the input tokens billed for this request.
        
        audio_tokens: Optional[int]
        
        Number of audio tokens billed for this request.
        
        text_tokens: Optional[int]
        
        Number of text tokens billed for this request.
    - class UsageDuration: …
      
      Usage statistics for models billed by audio input duration.
      - seconds: float
        
        Duration of the input audio in seconds.
      - type: Literal["duration"]
        
        The type of the usage object. Always duration for this variant.
        
        "duration"
- class TranscriptionVerbose: …
  
  Represents a verbose json transcription response returned by model, based on the provided input.
  - duration: float
    
    The duration of the input audio.
  - language: str
    
    The language of the input audio.
  - text: str
    
    The transcribed text.
  - segments: Optional[List[TranscriptionSegment]]
    
    Segments of the transcribed text and their corresponding details.
    - id: int
      
      Unique identifier of the segment.
    - avg_logprob: float
      
      Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
    - compression_ratio: float
      
      Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
    - end: float
      
      End time of the segment in seconds.
    - no_speech_prob: float
      
      Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
    - seek: int
      
      Seek offset of the segment.
    - start: float
      
      Start time of the segment in seconds.
    - temperature: float
      
      Temperature parameter used for generating the segment.
    - text: str
      
      Text content of the segment.
    - tokens: List[int]
      
      Array of token IDs for the text content.
  - usage: Optional[Usage]
    
    Usage statistics for models billed by audio input duration.
    - seconds: float
      
      Duration of the input audio in seconds.
    - type: Literal["duration"]
      
      The type of the usage object. Always duration for this variant.
      - "duration"
  - words: Optional[List[TranscriptionWord]]
    
    Extracted words and their corresponding timestamps.
    - end: float
      
      End time of the word in seconds.
    - start: float
      
      Start time of the word in seconds.
    - word: str
      
      The text content of the word.

Translations

Create translation

audio.translations.create(TranslationCreateParams**kwargs) -> TranslationCreateResponse

post /audio/translations

Translates audio into English.

Parameters

file: FileTypes

The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
model: Union[str, AudioModel]

ID of the model to use. Only whisper-1 (which is powered by our open source Whisper V2 model) is currently available.
- str
- Literal["whisper-1", "gpt-4o-transcribe", "gpt-4o-mini-transcribe", 2 more]
  - "whisper-1"
  - "gpt-4o-transcribe"
  - "gpt-4o-mini-transcribe"
  - "gpt-4o-mini-transcribe-2025-12-15"
  - "gpt-4o-transcribe-diarize"
prompt: Optional[str]

An optional text to guide the model's style or continue a previous audio segment. The prompt should be in English.
response_format: Optional[Literal["json", "text", "srt", 2 more]]

The format of the output, in one of these options: json, text, srt, verbose_json, or vtt.
- "json"
- "text"
- "srt"
- "verbose_json"
- "vtt"
temperature: Optional[float]

The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

Returns

TranslationCreateResponse
- class Translation: …
  - text: str
- class TranslationVerbose: …
  - duration: float
    
    The duration of the input audio.
  - language: str
    
    The language of the output translation (always english).
  - text: str
    
    The translated text.
  - segments: Optional[List[TranscriptionSegment]]
    
    Segments of the translated text and their corresponding details.
    - id: int
      
      Unique identifier of the segment.
    - avg_logprob: float
      
      Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
    - compression_ratio: float
      
      Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
    - end: float
      
      End time of the segment in seconds.
    - no_speech_prob: float
      
      Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
    - seek: int
      
      Seek offset of the segment.
    - start: float
      
      Start time of the segment in seconds.
    - temperature: float
      
      Temperature parameter used for generating the segment.
    - text: str
      
      Text content of the segment.
    - tokens: List[int]
      
      Array of token IDs for the text content.

Example

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),  # This is the default and can be omitted
)
translation = client.audio.translations.create(
    file=b"Example data",
    model="whisper-1",
)
print(translation)

Response

{
  "text": "text"
}

Example

from openai import OpenAI
client = OpenAI()

audio_file = open("speech.mp3", "rb")
transcript = client.audio.translations.create(
  model="whisper-1",
  file=audio_file
)

Response

{
  "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}

Domain Types

Translation

class Translation: …
- text: str

Translation Verbose

class TranslationVerbose: …
- duration: float
  
  The duration of the input audio.
- language: str
  
  The language of the output translation (always english).
- text: str
  
  The translated text.
- segments: Optional[List[TranscriptionSegment]]
  
  Segments of the translated text and their corresponding details.
  - id: int
    
    Unique identifier of the segment.
  - avg_logprob: float
    
    Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
  - compression_ratio: float
    
    Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
  - end: float
    
    End time of the segment in seconds.
  - no_speech_prob: float
    
    Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
  - seek: int
    
    Seek offset of the segment.
  - start: float
    
    Start time of the segment in seconds.
  - temperature: float
    
    Temperature parameter used for generating the segment.
  - text: str
    
    Text content of the segment.
  - tokens: List[int]
    
    Array of token IDs for the text content.

Translation Create Response

TranslationCreateResponse
- class Translation: …
  - text: str
- class TranslationVerbose: …
  - duration: float
    
    The duration of the input audio.
  - language: str
    
    The language of the output translation (always english).
  - text: str
    
    The translated text.
  - segments: Optional[List[TranscriptionSegment]]
    
    Segments of the translated text and their corresponding details.
    - id: int
      
      Unique identifier of the segment.
    - avg_logprob: float
      
      Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
    - compression_ratio: float
      
      Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
    - end: float
      
      End time of the segment in seconds.
    - no_speech_prob: float
      
      Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
    - seek: int
      
      Seek offset of the segment.
    - start: float
      
      Start time of the segment in seconds.
    - temperature: float
      
      Temperature parameter used for generating the segment.
    - text: str
      
      Text content of the segment.
    - tokens: List[int]
      
      Array of token IDs for the text content.

Speech

Create speech

audio.speech.create(SpeechCreateParams**kwargs) -> BinaryResponseContent

post /audio/speech

Generates audio from the input text.

Returns the audio file content, or a stream of audio events.

Parameters

input: str

The text to generate audio for. The maximum length is 4096 characters.
model: Union[str, SpeechModel]

One of the available TTS models: tts-1, tts-1-hd, gpt-4o-mini-tts, or gpt-4o-mini-tts-2025-12-15.
- str
- Literal["tts-1", "tts-1-hd", "gpt-4o-mini-tts", "gpt-4o-mini-tts-2025-12-15"]
  - "tts-1"
  - "tts-1-hd"
  - "gpt-4o-mini-tts"
  - "gpt-4o-mini-tts-2025-12-15"
voice: Voice

The voice to use when generating the audio. Supported built-in voices are alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Previews of the voices are available in the Text to speech guide.
- str
- Literal["alloy", "ash", "ballad", 7 more]
  - "alloy"
  - "ash"
  - "ballad"
  - "coral"
  - "echo"
  - "sage"
  - "shimmer"
  - "verse"
  - "marin"
  - "cedar"
- class VoiceID: …
  
  Custom voice reference.
  - id: str
    
    The custom voice ID, e.g. voice_1234.
instructions: Optional[str]

Control the voice of your generated audio with additional instructions. Does not work with tts-1 or tts-1-hd.
response_format: Optional[Literal["mp3", "opus", "aac", 3 more]]

The format to audio in. Supported formats are mp3, opus, aac, flac, wav, and pcm.
- "mp3"
- "opus"
- "aac"
- "flac"
- "wav"
- "pcm"
speed: Optional[float]

The speed of the generated audio. Select a value from 0.25 to 4.0. 1.0 is the default.
stream_format: Optional[Literal["sse", "audio"]]

The format to stream the audio in. Supported formats are sse and audio. sse is not supported for tts-1 or tts-1-hd.
- "sse"
- "audio"

Returns

BinaryResponseContent

Example

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),  # This is the default and can be omitted
)
speech = client.audio.speech.create(
    input="input",
    model="tts-1",
    voice="alloy",
)
print(speech)
content = speech.read()
print(content)

Example

from pathlib import Path
import openai

speech_file_path = Path(__file__).parent / "speech.mp3"
with openai.audio.speech.with_streaming_response.create(
  model="gpt-4o-mini-tts",
  voice="alloy",
  input="The quick brown fox jumped over the lazy dog."
) as response:
  response.stream_to_file(speech_file_path)

Domain Types

Speech Model

Literal["tts-1", "tts-1-hd", "gpt-4o-mini-tts", "gpt-4o-mini-tts-2025-12-15"]
- "tts-1"
- "tts-1-hd"
- "gpt-4o-mini-tts"
- "gpt-4o-mini-tts-2025-12-15"

Voices

Voice Consents

python/resources/audio/index.md +2213 −0 created

1# Audio

3## Domain Types

5### Audio Model

7- `Literal["whisper-1", "gpt-4o-transcribe", "gpt-4o-mini-transcribe", 2 more]`

9 - `"whisper-1"`

11 - `"gpt-4o-transcribe"`

13 - `"gpt-4o-mini-transcribe"`

15 - `"gpt-4o-mini-transcribe-2025-12-15"`

17 - `"gpt-4o-transcribe-diarize"`

19### Audio Response Format

21- `Literal["json", "text", "srt", 3 more]`

23 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.

25 - `"json"`

27 - `"text"`

29 - `"srt"`

31 - `"verbose_json"`

33 - `"vtt"`

35 - `"diarized_json"`

37# Transcriptions

39## Create transcription

41`audio.transcriptions.create(TranscriptionCreateParams**kwargs) -> TranscriptionCreateResponse`

43**post** `/audio/transcriptions`

45Transcribes audio into the input language.

47Returns a transcription object in `json`, `diarized_json`, or `verbose_json`

48format, or a stream of transcript events.

50### Parameters

52- `file: FileTypes`

54 The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

56- `model: Union[str, AudioModel]`

58 ID of the model to use. The options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `whisper-1` (which is powered by our open source Whisper V2 model), and `gpt-4o-transcribe-diarize`.

60 - `str`

62 - `Literal["whisper-1", "gpt-4o-transcribe", "gpt-4o-mini-transcribe", 2 more]`

64 - `"whisper-1"`

66 - `"gpt-4o-transcribe"`

68 - `"gpt-4o-mini-transcribe"`

70 - `"gpt-4o-mini-transcribe-2025-12-15"`

72 - `"gpt-4o-transcribe-diarize"`

74- `chunking_strategy: Optional[ChunkingStrategy]`

76 Controls how the audio is cut into chunks. When set to `"auto"`, the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. `server_vad` object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using `gpt-4o-transcribe-diarize` for inputs longer than 30 seconds.

78 - `Literal["auto"]`

80 Automatically set chunking parameters based on the audio. Must be set to `"auto"`.

82 - `"auto"`

84 - `class ChunkingStrategyVadConfig: …`

86 - `type: Literal["server_vad"]`

88 Must be set to `server_vad` to enable manual chunking using server side VAD.

90 - `"server_vad"`

92 - `prefix_padding_ms: Optional[int]`

94 Amount of audio to include before the VAD detected speech (in

95 milliseconds).

97 - `silence_duration_ms: Optional[int]`

99 Duration of silence to detect speech stop (in milliseconds).

100 With shorter values the model will respond more quickly,

101 but may jump in on short pauses from the user.

102

103 - `threshold: Optional[float]`

104

105 Sensitivity threshold (0.0 to 1.0) for voice activity detection. A

106 higher threshold will require louder audio to activate the model, and

107 thus might perform better in noisy environments.

108

109- `include: Optional[List[TranscriptionInclude]]`

110

111 Additional information to include in the transcription response.

112 `logprobs` will return the log probabilities of the tokens in the

113 response to understand the model's confidence in the transcription.

114 `logprobs` only works with response_format set to `json` and only with

115 the models `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `gpt-4o-mini-transcribe-2025-12-15`. This field is not supported when using `gpt-4o-transcribe-diarize`.

116

117 - `"logprobs"`

118

119- `known_speaker_names: Optional[Sequence[str]]`

120

121 Optional list of speaker names that correspond to the audio samples provided in `known_speaker_references[]`. Each entry should be a short identifier (for example `customer` or `agent`). Up to 4 speakers are supported.

122

123- `known_speaker_references: Optional[Sequence[str]]`

124

125 Optional list of audio samples (as [data URLs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs)) that contain known speaker references matching `known_speaker_names[]`. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by `file`.

126

127- `language: Optional[str]`

128

129 The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency.

130

131- `prompt: Optional[str]`

132

133 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should match the audio language. This field is not supported when using `gpt-4o-transcribe-diarize`.

134

135- `response_format: Optional[AudioResponseFormat]`

136

137 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.

138

139 - `"json"`

140

141 - `"text"`

142

143 - `"srt"`

144

145 - `"verbose_json"`

146

147 - `"vtt"`

148

149 - `"diarized_json"`

150

151- `stream: Optional[Literal[false]]`

152

153 If set to true, the model response data will be streamed to the client

154 as it is generated using [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#Event_stream_format).

155 See the [Streaming section of the Speech-to-Text guide](https://platform.openai.com/docs/guides/speech-to-text?lang=curl#streaming-transcriptions)

156 for more information.

157

158 Note: Streaming is not supported for the `whisper-1` model and will be ignored.

159

160 - `false`

161

162- `temperature: Optional[float]`

163

164 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.

165

166- `timestamp_granularities: Optional[List[Literal["word", "segment"]]]`

167

168 The timestamp granularities to populate for this transcription. `response_format` must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported: `word`, or `segment`. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

169 This option is not available for `gpt-4o-transcribe-diarize`.

170

171 - `"word"`

172

173 - `"segment"`

174

175### Returns

176

177- `TranscriptionCreateResponse`

178

179 Represents a transcription response returned by model, based on the provided input.

180

181 - `class Transcription: …`

182

183 Represents a transcription response returned by model, based on the provided input.

184

185 - `text: str`

186

187 The transcribed text.

188

189 - `logprobs: Optional[List[Logprob]]`

190

191 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

192

193 - `token: Optional[str]`

194

195 The token in the transcription.

196

197 - `bytes: Optional[List[float]]`

198

199 The bytes of the token.

200

201 - `logprob: Optional[float]`

202

203 The log probability of the token.

204

205 - `usage: Optional[Usage]`

206

207 Token usage statistics for the request.

208

209 - `class UsageTokens: …`

210

211 Usage statistics for models billed by token usage.

212

213 - `input_tokens: int`

214

215 Number of input tokens billed for this request.

216

217 - `output_tokens: int`

218

219 Number of output tokens generated.

220

221 - `total_tokens: int`

222

223 Total number of tokens used (input + output).

224

225 - `type: Literal["tokens"]`

226

227 The type of the usage object. Always `tokens` for this variant.

228

229 - `"tokens"`

230

231 - `input_token_details: Optional[UsageTokensInputTokenDetails]`

232

233 Details about the input tokens billed for this request.

234

235 - `audio_tokens: Optional[int]`

236

237 Number of audio tokens billed for this request.

238

239 - `text_tokens: Optional[int]`

240

241 Number of text tokens billed for this request.

242

243 - `class UsageDuration: …`

244

245 Usage statistics for models billed by audio input duration.

246

247 - `seconds: float`

248

249 Duration of the input audio in seconds.

250

251 - `type: Literal["duration"]`

252

253 The type of the usage object. Always `duration` for this variant.

254

255 - `"duration"`

256

257 - `class TranscriptionDiarized: …`

258

259 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

260

261 - `duration: float`

262

263 Duration of the input audio in seconds.

264

265 - `segments: List[TranscriptionDiarizedSegment]`

266

267 Segments of the transcript annotated with timestamps and speaker labels.

268

269 - `id: str`

270

271 Unique identifier for the segment.

272

273 - `end: float`

274

275 End timestamp of the segment in seconds.

276

277 - `speaker: str`

278

279 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

280

281 - `start: float`

282

283 Start timestamp of the segment in seconds.

284

285 - `text: str`

286

287 Transcript text for this segment.

288

289 - `type: Literal["transcript.text.segment"]`

290

291 The type of the segment. Always `transcript.text.segment`.

292

293 - `"transcript.text.segment"`

294

295 - `task: Literal["transcribe"]`

296

297 The type of task that was run. Always `transcribe`.

298

299 - `"transcribe"`

300

301 - `text: str`

302

303 The concatenated transcript text for the entire audio input.

304

305 - `usage: Optional[Usage]`

306

307 Token or duration usage statistics for the request.

308

309 - `class UsageTokens: …`

310

311 Usage statistics for models billed by token usage.

312

313 - `input_tokens: int`

314

315 Number of input tokens billed for this request.

316

317 - `output_tokens: int`

318

319 Number of output tokens generated.

320

321 - `total_tokens: int`

322

323 Total number of tokens used (input + output).

324

325 - `type: Literal["tokens"]`

326

327 The type of the usage object. Always `tokens` for this variant.

328

329 - `"tokens"`

330

331 - `input_token_details: Optional[UsageTokensInputTokenDetails]`

332

333 Details about the input tokens billed for this request.

334

335 - `audio_tokens: Optional[int]`

336

337 Number of audio tokens billed for this request.

338

339 - `text_tokens: Optional[int]`

340

341 Number of text tokens billed for this request.

342

343 - `class UsageDuration: …`

344

345 Usage statistics for models billed by audio input duration.

346

347 - `seconds: float`

348

349 Duration of the input audio in seconds.

350

351 - `type: Literal["duration"]`

352

353 The type of the usage object. Always `duration` for this variant.

354

355 - `"duration"`

356

357 - `class TranscriptionVerbose: …`

358

359 Represents a verbose json transcription response returned by model, based on the provided input.

360

361 - `duration: float`

362

363 The duration of the input audio.

364

365 - `language: str`

366

367 The language of the input audio.

368

369 - `text: str`

370

371 The transcribed text.

372

373 - `segments: Optional[List[TranscriptionSegment]]`

374

375 Segments of the transcribed text and their corresponding details.

376

377 - `id: int`

378

379 Unique identifier of the segment.

380

381 - `avg_logprob: float`

382

383 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

384

385 - `compression_ratio: float`

386

387 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

388

389 - `end: float`

390

391 End time of the segment in seconds.

392

393 - `no_speech_prob: float`

394

395 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

396

397 - `seek: int`

398

399 Seek offset of the segment.

400

401 - `start: float`

402

403 Start time of the segment in seconds.

404

405 - `temperature: float`

406

407 Temperature parameter used for generating the segment.

408

409 - `text: str`

410

411 Text content of the segment.

412

413 - `tokens: List[int]`

414

415 Array of token IDs for the text content.

416

417 - `usage: Optional[Usage]`

418

419 Usage statistics for models billed by audio input duration.

420

421 - `seconds: float`

422

423 Duration of the input audio in seconds.

424

425 - `type: Literal["duration"]`

426

427 The type of the usage object. Always `duration` for this variant.

428

429 - `"duration"`

430

431 - `words: Optional[List[TranscriptionWord]]`

432

433 Extracted words and their corresponding timestamps.

434

435 - `end: float`

436

437 End time of the word in seconds.

438

439 - `start: float`

440

441 Start time of the word in seconds.

442

443 - `word: str`

444

445 The text content of the word.

446

447### Example

448

449```python

450import os

451from openai import OpenAI

452

453client = OpenAI(

454 api_key=os.environ.get("OPENAI_API_KEY"), # This is the default and can be omitted

455)

456for transcription in client.audio.transcriptions.create(

457 file=b"Example data",

458 model="gpt-4o-transcribe",

459):

460 print(transcription)

461```

462

463#### Response

464

465```json

466{

467 "text": "text",

468 "logprobs": [

469 {

470 "token": "token",

471 "bytes": [

472 0

473 ],

474 "logprob": 0

475 }

476 ],

477 "usage": {

478 "input_tokens": 0,

479 "output_tokens": 0,

480 "total_tokens": 0,

481 "type": "tokens",

482 "input_token_details": {

483 "audio_tokens": 0,

484 "text_tokens": 0

485 }

486 }

487}

488```

489

490### Example

491

492```python

493from openai import OpenAI

494client = OpenAI()

495

496audio_file = open("speech.mp3", "rb")

497transcript = client.audio.transcriptions.create(

498 model="gpt-4o-transcribe",

499 file=audio_file

500)

501```

502

503#### Response

504

505```json

506{

507 "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that.",

508 "usage": {

509 "type": "tokens",

510 "input_tokens": 14,

511 "input_token_details": {

512 "text_tokens": 0,

513 "audio_tokens": 14

514 },

515 "output_tokens": 45,

516 "total_tokens": 59

517 }

518}

519```

520

521### Diarization

522

523```python

524import base64

525from openai import OpenAI

526

527client = OpenAI()

528

529def to_data_url(path: str) -> str:

530 with open(path, "rb") as fh:

531 return "data:audio/wav;base64," + base64.b64encode(fh.read()).decode("utf-8")

532

533with open("meeting.wav", "rb") as audio_file:

534 transcript = client.audio.transcriptions.create(

535 model="gpt-4o-transcribe-diarize",

536 file=audio_file,

537 response_format="diarized_json",

538 chunking_strategy="auto",

539 extra_body={

540 "known_speaker_names": ["agent"],

541 "known_speaker_references": [to_data_url("agent.wav")],

542 },

543 )

544

545print(transcript.segments)

546```

547

548#### Response

549

550```json

551{

552 "task": "transcribe",

553 "duration": 27.4,

554 "text": "Agent: Thanks for calling OpenAI support.\nA: Hi, I'm trying to enable diarization.\nAgent: Happy to walk you through the steps.",

555 "segments": [

556 {

557 "type": "transcript.text.segment",

558 "id": "seg_001",

559 "start": 0.0,

560 "end": 4.7,

561 "text": "Thanks for calling OpenAI support.",

562 "speaker": "agent"

563 },

564 {

565 "type": "transcript.text.segment",

566 "id": "seg_002",

567 "start": 4.7,

568 "end": 11.8,

569 "text": "Hi, I'm trying to enable diarization.",

570 "speaker": "A"

571 },

572 {

573 "type": "transcript.text.segment",

574 "id": "seg_003",

575 "start": 12.1,

576 "end": 18.5,

577 "text": "Happy to walk you through the steps.",

578 "speaker": "agent"

579 }

580 ],

581 "usage": {

582 "type": "duration",

583 "seconds": 27

584 }

585}

586```

587

588### Streaming

589

590```python

591from openai import OpenAI

592client = OpenAI()

593

594audio_file = open("speech.mp3", "rb")

595stream = client.audio.transcriptions.create(

596 file=audio_file,

597 model="gpt-4o-mini-transcribe",

598 stream=True

599)

600

601for event in stream:

602 print(event)

603```

604

605#### Response

606

607```json

608data: {"type":"transcript.text.delta","delta":"I","logprobs":[{"token":"I","logprob":-0.00007588794,"bytes":[73]}]}

609

610data: {"type":"transcript.text.delta","delta":" see","logprobs":[{"token":" see","logprob":-3.1281633e-7,"bytes":[32,115,101,101]}]}

611

612data: {"type":"transcript.text.delta","delta":" skies","logprobs":[{"token":" skies","logprob":-2.3392786e-6,"bytes":[32,115,107,105,101,115]}]}

613

614data: {"type":"transcript.text.delta","delta":" of","logprobs":[{"token":" of","logprob":-3.1281633e-7,"bytes":[32,111,102]}]}

615

616data: {"type":"transcript.text.delta","delta":" blue","logprobs":[{"token":" blue","logprob":-1.0280384e-6,"bytes":[32,98,108,117,101]}]}

617

618data: {"type":"transcript.text.delta","delta":" and","logprobs":[{"token":" and","logprob":-0.0005108566,"bytes":[32,97,110,100]}]}

619

620data: {"type":"transcript.text.delta","delta":" clouds","logprobs":[{"token":" clouds","logprob":-1.9361265e-7,"bytes":[32,99,108,111,117,100,115]}]}

621

622data: {"type":"transcript.text.delta","delta":" of","logprobs":[{"token":" of","logprob":-1.9361265e-7,"bytes":[32,111,102]}]}

623

624data: {"type":"transcript.text.delta","delta":" white","logprobs":[{"token":" white","logprob":-7.89631e-7,"bytes":[32,119,104,105,116,101]}]}

625

626data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.0014890312,"bytes":[44]}]}

627

628data: {"type":"transcript.text.delta","delta":" the","logprobs":[{"token":" the","logprob":-0.0110956915,"bytes":[32,116,104,101]}]}

629

630data: {"type":"transcript.text.delta","delta":" bright","logprobs":[{"token":" bright","logprob":0.0,"bytes":[32,98,114,105,103,104,116]}]}

631

632data: {"type":"transcript.text.delta","delta":" blessed","logprobs":[{"token":" blessed","logprob":-0.000045848617,"bytes":[32,98,108,101,115,115,101,100]}]}

633

634data: {"type":"transcript.text.delta","delta":" days","logprobs":[{"token":" days","logprob":-0.000010802739,"bytes":[32,100,97,121,115]}]}

635

636data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.00001700133,"bytes":[44]}]}

637

638data: {"type":"transcript.text.delta","delta":" the","logprobs":[{"token":" the","logprob":-0.0000118755715,"bytes":[32,116,104,101]}]}

639

640data: {"type":"transcript.text.delta","delta":" dark","logprobs":[{"token":" dark","logprob":-5.5122365e-7,"bytes":[32,100,97,114,107]}]}

641

642data: {"type":"transcript.text.delta","delta":" sacred","logprobs":[{"token":" sacred","logprob":-5.4385737e-6,"bytes":[32,115,97,99,114,101,100]}]}

643

644data: {"type":"transcript.text.delta","delta":" nights","logprobs":[{"token":" nights","logprob":-4.00813e-6,"bytes":[32,110,105,103,104,116,115]}]}

645

646data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.0036910512,"bytes":[44]}]}

647

648data: {"type":"transcript.text.delta","delta":" and","logprobs":[{"token":" and","logprob":-0.0031903093,"bytes":[32,97,110,100]}]}

649

650data: {"type":"transcript.text.delta","delta":" I","logprobs":[{"token":" I","logprob":-1.504853e-6,"bytes":[32,73]}]}

651

652data: {"type":"transcript.text.delta","delta":" think","logprobs":[{"token":" think","logprob":-4.3202e-7,"bytes":[32,116,104,105,110,107]}]}

653

654data: {"type":"transcript.text.delta","delta":" to","logprobs":[{"token":" to","logprob":-1.9361265e-7,"bytes":[32,116,111]}]}

655

656data: {"type":"transcript.text.delta","delta":" myself","logprobs":[{"token":" myself","logprob":-1.7432603e-6,"bytes":[32,109,121,115,101,108,102]}]}

657

658data: {"type":"transcript.text.delta","delta":",","logprobs":[{"token":",","logprob":-0.29254505,"bytes":[44]}]}

659

660data: {"type":"transcript.text.delta","delta":" what","logprobs":[{"token":" what","logprob":-0.016815351,"bytes":[32,119,104,97,116]}]}

661

662data: {"type":"transcript.text.delta","delta":" a","logprobs":[{"token":" a","logprob":-3.1281633e-7,"bytes":[32,97]}]}

663

664data: {"type":"transcript.text.delta","delta":" wonderful","logprobs":[{"token":" wonderful","logprob":-2.1008714e-6,"bytes":[32,119,111,110,100,101,114,102,117,108]}]}

665

666data: {"type":"transcript.text.delta","delta":" world","logprobs":[{"token":" world","logprob":-8.180258e-6,"bytes":[32,119,111,114,108,100]}]}

667

668data: {"type":"transcript.text.delta","delta":".","logprobs":[{"token":".","logprob":-0.014231676,"bytes":[46]}]}

669

670data: {"type":"transcript.text.done","text":"I see skies of blue and clouds of white, the bright blessed days, the dark sacred nights, and I think to myself, what a wonderful world.","logprobs":[{"token":"I","logprob":-0.00007588794,"bytes":[73]},{"token":" see","logprob":-3.1281633e-7,"bytes":[32,115,101,101]},{"token":" skies","logprob":-2.3392786e-6,"bytes":[32,115,107,105,101,115]},{"token":" of","logprob":-3.1281633e-7,"bytes":[32,111,102]},{"token":" blue","logprob":-1.0280384e-6,"bytes":[32,98,108,117,101]},{"token":" and","logprob":-0.0005108566,"bytes":[32,97,110,100]},{"token":" clouds","logprob":-1.9361265e-7,"bytes":[32,99,108,111,117,100,115]},{"token":" of","logprob":-1.9361265e-7,"bytes":[32,111,102]},{"token":" white","logprob":-7.89631e-7,"bytes":[32,119,104,105,116,101]},{"token":",","logprob":-0.0014890312,"bytes":[44]},{"token":" the","logprob":-0.0110956915,"bytes":[32,116,104,101]},{"token":" bright","logprob":0.0,"bytes":[32,98,114,105,103,104,116]},{"token":" blessed","logprob":-0.000045848617,"bytes":[32,98,108,101,115,115,101,100]},{"token":" days","logprob":-0.000010802739,"bytes":[32,100,97,121,115]},{"token":",","logprob":-0.00001700133,"bytes":[44]},{"token":" the","logprob":-0.0000118755715,"bytes":[32,116,104,101]},{"token":" dark","logprob":-5.5122365e-7,"bytes":[32,100,97,114,107]},{"token":" sacred","logprob":-5.4385737e-6,"bytes":[32,115,97,99,114,101,100]},{"token":" nights","logprob":-4.00813e-6,"bytes":[32,110,105,103,104,116,115]},{"token":",","logprob":-0.0036910512,"bytes":[44]},{"token":" and","logprob":-0.0031903093,"bytes":[32,97,110,100]},{"token":" I","logprob":-1.504853e-6,"bytes":[32,73]},{"token":" think","logprob":-4.3202e-7,"bytes":[32,116,104,105,110,107]},{"token":" to","logprob":-1.9361265e-7,"bytes":[32,116,111]},{"token":" myself","logprob":-1.7432603e-6,"bytes":[32,109,121,115,101,108,102]},{"token":",","logprob":-0.29254505,"bytes":[44]},{"token":" what","logprob":-0.016815351,"bytes":[32,119,104,97,116]},{"token":" a","logprob":-3.1281633e-7,"bytes":[32,97]},{"token":" wonderful","logprob":-2.1008714e-6,"bytes":[32,119,111,110,100,101,114,102,117,108]},{"token":" world","logprob":-8.180258e-6,"bytes":[32,119,111,114,108,100]},{"token":".","logprob":-0.014231676,"bytes":[46]}],"usage":{"input_tokens":14,"input_token_details":{"text_tokens":0,"audio_tokens":14},"output_tokens":45,"total_tokens":59}}

671```

672

673### Logprobs

674

675```python

676from openai import OpenAI

677client = OpenAI()

678

679audio_file = open("speech.mp3", "rb")

680transcript = client.audio.transcriptions.create(

681 file=audio_file,

682 model="gpt-4o-transcribe",

683 response_format="json",

684 include=["logprobs"]

685)

686

687print(transcript)

688```

689

690#### Response

691

692```json

693{

694 "text": "Hey, my knee is hurting and I want to see the doctor tomorrow ideally.",

695 "logprobs": [

696 { "token": "Hey", "logprob": -1.0415299, "bytes": [72, 101, 121] },

697 { "token": ",", "logprob": -9.805982e-5, "bytes": [44] },

698 { "token": " my", "logprob": -0.00229799, "bytes": [32, 109, 121] },

699 {

700 "token": " knee",

701 "logprob": -4.7159858e-5,

702 "bytes": [32, 107, 110, 101, 101]

703 },

704 { "token": " is", "logprob": -0.043909557, "bytes": [32, 105, 115] },

705 {

706 "token": " hurting",

707 "logprob": -1.1041146e-5,

708 "bytes": [32, 104, 117, 114, 116, 105, 110, 103]

709 },

710 { "token": " and", "logprob": -0.011076359, "bytes": [32, 97, 110, 100] },

711 { "token": " I", "logprob": -5.3193703e-6, "bytes": [32, 73] },

712 {

713 "token": " want",

714 "logprob": -0.0017156356,

715 "bytes": [32, 119, 97, 110, 116]

716 },

717 { "token": " to", "logprob": -7.89631e-7, "bytes": [32, 116, 111] },

718 { "token": " see", "logprob": -5.5122365e-7, "bytes": [32, 115, 101, 101] },

719 { "token": " the", "logprob": -0.0040786397, "bytes": [32, 116, 104, 101] },

720 {

721 "token": " doctor",

722 "logprob": -2.3392786e-6,

723 "bytes": [32, 100, 111, 99, 116, 111, 114]

724 },

725 {

726 "token": " tomorrow",

727 "logprob": -7.89631e-7,

728 "bytes": [32, 116, 111, 109, 111, 114, 114, 111, 119]

729 },

730 {

731 "token": " ideally",

732 "logprob": -0.5800861,

733 "bytes": [32, 105, 100, 101, 97, 108, 108, 121]

734 },

735 { "token": ".", "logprob": -0.00011093382, "bytes": [46] }

736 ],

737 "usage": {

738 "type": "tokens",

739 "input_tokens": 14,

740 "input_token_details": {

741 "text_tokens": 0,

742 "audio_tokens": 14

743 },

744 "output_tokens": 45,

745 "total_tokens": 59

746 }

747}

748```

749

750### Word timestamps

751

752```python

753from openai import OpenAI

754client = OpenAI()

755

756audio_file = open("speech.mp3", "rb")

757transcript = client.audio.transcriptions.create(

758 file=audio_file,

759 model="whisper-1",

760 response_format="verbose_json",

761 timestamp_granularities=["word"]

762)

763

764print(transcript.words)

765```

766

767#### Response

768

769```json

770{

771 "task": "transcribe",

772 "language": "english",

773 "duration": 8.470000267028809,

774 "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",

775 "words": [

776 {

777 "word": "The",

778 "start": 0.0,

779 "end": 0.23999999463558197

780 },

781 ...

782 {

783 "word": "volleyball",

784 "start": 7.400000095367432,

785 "end": 7.900000095367432

786 }

787 ],

788 "usage": {

789 "type": "duration",

790 "seconds": 9

791 }

792}

793```

794

795### Segment timestamps

796

797```python

798from openai import OpenAI

799client = OpenAI()

800

801audio_file = open("speech.mp3", "rb")

802transcript = client.audio.transcriptions.create(

803 file=audio_file,

804 model="whisper-1",

805 response_format="verbose_json",

806 timestamp_granularities=["segment"]

807)

808

809print(transcript.words)

810```

811

812#### Response

813

814```json

815{

816 "task": "transcribe",

817 "language": "english",

818 "duration": 8.470000267028809,

819 "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",

820 "segments": [

821 {

822 "id": 0,

823 "seek": 0,

824 "start": 0.0,

825 "end": 3.319999933242798,

826 "text": " The beach was a popular spot on a hot summer day.",

827 "tokens": [

828 50364, 440, 7534, 390, 257, 3743, 4008, 322, 257, 2368, 4266, 786, 13, 50530

829 ],

830 "temperature": 0.0,

831 "avg_logprob": -0.2860786020755768,

832 "compression_ratio": 1.2363636493682861,

833 "no_speech_prob": 0.00985979475080967

834 },

835 ...

836 ],

837 "usage": {

838 "type": "duration",

839 "seconds": 9

840 }

841}

842```

843

844## Domain Types

845

846### Transcription

847

848- `class Transcription: …`

849

850 Represents a transcription response returned by model, based on the provided input.

851

852 - `text: str`

853

854 The transcribed text.

855

856 - `logprobs: Optional[List[Logprob]]`

857

858 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

859

860 - `token: Optional[str]`

861

862 The token in the transcription.

863

864 - `bytes: Optional[List[float]]`

865

866 The bytes of the token.

867

868 - `logprob: Optional[float]`

869

870 The log probability of the token.

871

872 - `usage: Optional[Usage]`

873

874 Token usage statistics for the request.

875

876 - `class UsageTokens: …`

877

878 Usage statistics for models billed by token usage.

879

880 - `input_tokens: int`

881

882 Number of input tokens billed for this request.

883

884 - `output_tokens: int`

885

886 Number of output tokens generated.

887

888 - `total_tokens: int`

889

890 Total number of tokens used (input + output).

891

892 - `type: Literal["tokens"]`

893

894 The type of the usage object. Always `tokens` for this variant.

895

896 - `"tokens"`

897

898 - `input_token_details: Optional[UsageTokensInputTokenDetails]`

899

900 Details about the input tokens billed for this request.

901

902 - `audio_tokens: Optional[int]`

903

904 Number of audio tokens billed for this request.

905

906 - `text_tokens: Optional[int]`

907

908 Number of text tokens billed for this request.

909

910 - `class UsageDuration: …`

911

912 Usage statistics for models billed by audio input duration.

913

914 - `seconds: float`

915

916 Duration of the input audio in seconds.

917

918 - `type: Literal["duration"]`

919

920 The type of the usage object. Always `duration` for this variant.

921

922 - `"duration"`

923

924### Transcription Diarized

925

926- `class TranscriptionDiarized: …`

927

928 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

929

930 - `duration: float`

931

932 Duration of the input audio in seconds.

933

934 - `segments: List[TranscriptionDiarizedSegment]`

935

936 Segments of the transcript annotated with timestamps and speaker labels.

937

938 - `id: str`

939

940 Unique identifier for the segment.

941

942 - `end: float`

943

944 End timestamp of the segment in seconds.

945

946 - `speaker: str`

947

948 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

949

950 - `start: float`

951

952 Start timestamp of the segment in seconds.

953

954 - `text: str`

955

956 Transcript text for this segment.

957

958 - `type: Literal["transcript.text.segment"]`

959

960 The type of the segment. Always `transcript.text.segment`.

961

962 - `"transcript.text.segment"`

963

964 - `task: Literal["transcribe"]`

965

966 The type of task that was run. Always `transcribe`.

967

968 - `"transcribe"`

969

970 - `text: str`

971

972 The concatenated transcript text for the entire audio input.

973

974 - `usage: Optional[Usage]`

975

976 Token or duration usage statistics for the request.

977

978 - `class UsageTokens: …`

979

980 Usage statistics for models billed by token usage.

981

982 - `input_tokens: int`

983

984 Number of input tokens billed for this request.

985

986 - `output_tokens: int`

987

988 Number of output tokens generated.

989

990 - `total_tokens: int`

991

992 Total number of tokens used (input + output).

993

994 - `type: Literal["tokens"]`

995

996 The type of the usage object. Always `tokens` for this variant.

997

998 - `"tokens"`

999

1000 - `input_token_details: Optional[UsageTokensInputTokenDetails]`

1001

1002 Details about the input tokens billed for this request.

1003

1004 - `audio_tokens: Optional[int]`

1005

1006 Number of audio tokens billed for this request.

1007

1008 - `text_tokens: Optional[int]`

1009

1010 Number of text tokens billed for this request.

1011

1012 - `class UsageDuration: …`

1013

1014 Usage statistics for models billed by audio input duration.

1015

1016 - `seconds: float`

1017

1018 Duration of the input audio in seconds.

1019

1020 - `type: Literal["duration"]`

1021

1022 The type of the usage object. Always `duration` for this variant.

1023

1024 - `"duration"`

1025

1026### Transcription Diarized Segment

1027

1028- `class TranscriptionDiarizedSegment: …`

1029

1030 A segment of diarized transcript text with speaker metadata.

1031

1032 - `id: str`

1033

1034 Unique identifier for the segment.

1035

1036 - `end: float`

1037

1038 End timestamp of the segment in seconds.

1039

1040 - `speaker: str`

1041

1042 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

1043

1044 - `start: float`

1045

1046 Start timestamp of the segment in seconds.

1047

1048 - `text: str`

1049

1050 Transcript text for this segment.

1051

1052 - `type: Literal["transcript.text.segment"]`

1053

1054 The type of the segment. Always `transcript.text.segment`.

1055

1056 - `"transcript.text.segment"`

1057

1058### Transcription Include

1059

1060- `Literal["logprobs"]`

1061

1062 - `"logprobs"`

1063

1064### Transcription Segment

1065

1066- `class TranscriptionSegment: …`

1067

1068 - `id: int`

1069

1070 Unique identifier of the segment.

1071

1072 - `avg_logprob: float`

1073

1074 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

1075

1076 - `compression_ratio: float`

1077

1078 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

1079

1080 - `end: float`

1081

1082 End time of the segment in seconds.

1083

1084 - `no_speech_prob: float`

1085

1086 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

1087

1088 - `seek: int`

1089

1090 Seek offset of the segment.

1091

1092 - `start: float`

1093

1094 Start time of the segment in seconds.

1095

1096 - `temperature: float`

1097

1098 Temperature parameter used for generating the segment.

1099

1100 - `text: str`

1101

1102 Text content of the segment.

1103

1104 - `tokens: List[int]`

1105

1106 Array of token IDs for the text content.

1107

1108### Transcription Stream Event

1109

1110- `TranscriptionStreamEvent`

1111

1112 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

1113

1114 - `class TranscriptionTextSegmentEvent: …`

1115

1116 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

1117

1118 - `id: str`

1119

1120 Unique identifier for the segment.

1121

1122 - `end: float`

1123

1124 End timestamp of the segment in seconds.

1125

1126 - `speaker: str`

1127

1128 Speaker label for this segment.

1129

1130 - `start: float`

1131

1132 Start timestamp of the segment in seconds.

1133

1134 - `text: str`

1135

1136 Transcript text for this segment.

1137

1138 - `type: Literal["transcript.text.segment"]`

1139

1140 The type of the event. Always `transcript.text.segment`.

1141

1142 - `"transcript.text.segment"`

1143

1144 - `class TranscriptionTextDeltaEvent: …`

1145

1146 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

1147

1148 - `delta: str`

1149

1150 The text delta that was additionally transcribed.

1151

1152 - `type: Literal["transcript.text.delta"]`

1153

1154 The type of the event. Always `transcript.text.delta`.

1155

1156 - `"transcript.text.delta"`

1157

1158 - `logprobs: Optional[List[Logprob]]`

1159

1160 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

1161

1162 - `token: Optional[str]`

1163

1164 The token that was used to generate the log probability.

1165

1166 - `bytes: Optional[List[int]]`

1167

1168 The bytes that were used to generate the log probability.

1169

1170 - `logprob: Optional[float]`

1171

1172 The log probability of the token.

1173

1174 - `segment_id: Optional[str]`

1175

1176 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.

1177

1178 - `class TranscriptionTextDoneEvent: …`

1179

1180 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

1181

1182 - `text: str`

1183

1184 The text that was transcribed.

1185

1186 - `type: Literal["transcript.text.done"]`

1187

1188 The type of the event. Always `transcript.text.done`.

1189

1190 - `"transcript.text.done"`

1191

1192 - `logprobs: Optional[List[Logprob]]`

1193

1194 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

1195

1196 - `token: Optional[str]`

1197

1198 The token that was used to generate the log probability.

1199

1200 - `bytes: Optional[List[int]]`

1201

1202 The bytes that were used to generate the log probability.

1203

1204 - `logprob: Optional[float]`

1205

1206 The log probability of the token.

1207

1208 - `usage: Optional[Usage]`

1209

1210 Usage statistics for models billed by token usage.

1211

1212 - `input_tokens: int`

1213

1214 Number of input tokens billed for this request.

1215

1216 - `output_tokens: int`

1217

1218 Number of output tokens generated.

1219

1220 - `total_tokens: int`

1221

1222 Total number of tokens used (input + output).

1223

1224 - `type: Literal["tokens"]`

1225

1226 The type of the usage object. Always `tokens` for this variant.

1227

1228 - `"tokens"`

1229

1230 - `input_token_details: Optional[UsageInputTokenDetails]`

1231

1232 Details about the input tokens billed for this request.

1233

1234 - `audio_tokens: Optional[int]`

1235

1236 Number of audio tokens billed for this request.

1237

1238 - `text_tokens: Optional[int]`

1239

1240 Number of text tokens billed for this request.

1241

1242### Transcription Text Delta Event

1243

1244- `class TranscriptionTextDeltaEvent: …`

1245

1246 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

1247

1248 - `delta: str`

1249

1250 The text delta that was additionally transcribed.

1251

1252 - `type: Literal["transcript.text.delta"]`

1253

1254 The type of the event. Always `transcript.text.delta`.

1255

1256 - `"transcript.text.delta"`

1257

1258 - `logprobs: Optional[List[Logprob]]`

1259

1260 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

1261

1262 - `token: Optional[str]`

1263

1264 The token that was used to generate the log probability.

1265

1266 - `bytes: Optional[List[int]]`

1267

1268 The bytes that were used to generate the log probability.

1269

1270 - `logprob: Optional[float]`

1271

1272 The log probability of the token.

1273

1274 - `segment_id: Optional[str]`

1275

1276 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.

1277

1278### Transcription Text Done Event

1279

1280- `class TranscriptionTextDoneEvent: …`

1281

1282 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

1283

1284 - `text: str`

1285

1286 The text that was transcribed.

1287

1288 - `type: Literal["transcript.text.done"]`

1289

1290 The type of the event. Always `transcript.text.done`.

1291

1292 - `"transcript.text.done"`

1293

1294 - `logprobs: Optional[List[Logprob]]`

1295

1296 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

1297

1298 - `token: Optional[str]`

1299

1300 The token that was used to generate the log probability.

1301

1302 - `bytes: Optional[List[int]]`

1303

1304 The bytes that were used to generate the log probability.

1305

1306 - `logprob: Optional[float]`

1307

1308 The log probability of the token.

1309

1310 - `usage: Optional[Usage]`

1311

1312 Usage statistics for models billed by token usage.

1313

1314 - `input_tokens: int`

1315

1316 Number of input tokens billed for this request.

1317

1318 - `output_tokens: int`

1319

1320 Number of output tokens generated.

1321

1322 - `total_tokens: int`

1323

1324 Total number of tokens used (input + output).

1325

1326 - `type: Literal["tokens"]`

1327

1328 The type of the usage object. Always `tokens` for this variant.

1329

1330 - `"tokens"`

1331

1332 - `input_token_details: Optional[UsageInputTokenDetails]`

1333

1334 Details about the input tokens billed for this request.

1335

1336 - `audio_tokens: Optional[int]`

1337

1338 Number of audio tokens billed for this request.

1339

1340 - `text_tokens: Optional[int]`

1341

1342 Number of text tokens billed for this request.

1343

1344### Transcription Text Segment Event

1345

1346- `class TranscriptionTextSegmentEvent: …`

1347

1348 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

1349

1350 - `id: str`

1351

1352 Unique identifier for the segment.

1353

1354 - `end: float`

1355

1356 End timestamp of the segment in seconds.

1357

1358 - `speaker: str`

1359

1360 Speaker label for this segment.

1361

1362 - `start: float`

1363

1364 Start timestamp of the segment in seconds.

1365

1366 - `text: str`

1367

1368 Transcript text for this segment.

1369

1370 - `type: Literal["transcript.text.segment"]`

1371

1372 The type of the event. Always `transcript.text.segment`.

1373

1374 - `"transcript.text.segment"`

1375

1376### Transcription Verbose

1377

1378- `class TranscriptionVerbose: …`

1379

1380 Represents a verbose json transcription response returned by model, based on the provided input.

1381

1382 - `duration: float`

1383

1384 The duration of the input audio.

1385

1386 - `language: str`

1387

1388 The language of the input audio.

1389

1390 - `text: str`

1391

1392 The transcribed text.

1393

1394 - `segments: Optional[List[TranscriptionSegment]]`

1395

1396 Segments of the transcribed text and their corresponding details.

1397

1398 - `id: int`

1399

1400 Unique identifier of the segment.

1401

1402 - `avg_logprob: float`

1403

1404 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

1405

1406 - `compression_ratio: float`

1407

1408 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

1409

1410 - `end: float`

1411

1412 End time of the segment in seconds.

1413

1414 - `no_speech_prob: float`

1415

1416 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

1417

1418 - `seek: int`

1419

1420 Seek offset of the segment.

1421

1422 - `start: float`

1423

1424 Start time of the segment in seconds.

1425

1426 - `temperature: float`

1427

1428 Temperature parameter used for generating the segment.

1429

1430 - `text: str`

1431

1432 Text content of the segment.

1433

1434 - `tokens: List[int]`

1435

1436 Array of token IDs for the text content.

1437

1438 - `usage: Optional[Usage]`

1439

1440 Usage statistics for models billed by audio input duration.

1441

1442 - `seconds: float`

1443

1444 Duration of the input audio in seconds.

1445

1446 - `type: Literal["duration"]`

1447

1448 The type of the usage object. Always `duration` for this variant.

1449

1450 - `"duration"`

1451

1452 - `words: Optional[List[TranscriptionWord]]`

1453

1454 Extracted words and their corresponding timestamps.

1455

1456 - `end: float`

1457

1458 End time of the word in seconds.

1459

1460 - `start: float`

1461

1462 Start time of the word in seconds.

1463

1464 - `word: str`

1465

1466 The text content of the word.

1467

1468### Transcription Word

1469

1470- `class TranscriptionWord: …`

1471

1472 - `end: float`

1473

1474 End time of the word in seconds.

1475

1476 - `start: float`

1477

1478 Start time of the word in seconds.

1479

1480 - `word: str`

1481

1482 The text content of the word.

1483

1484### Transcription Create Response

1485

1486- `TranscriptionCreateResponse`

1487

1488 Represents a transcription response returned by model, based on the provided input.

1489

1490 - `class Transcription: …`

1491

1492 Represents a transcription response returned by model, based on the provided input.

1493

1494 - `text: str`

1495

1496 The transcribed text.

1497

1498 - `logprobs: Optional[List[Logprob]]`

1499

1500 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

1501

1502 - `token: Optional[str]`

1503

1504 The token in the transcription.

1505

1506 - `bytes: Optional[List[float]]`

1507

1508 The bytes of the token.

1509

1510 - `logprob: Optional[float]`

1511

1512 The log probability of the token.

1513

1514 - `usage: Optional[Usage]`

1515

1516 Token usage statistics for the request.

1517

1518 - `class UsageTokens: …`

1519

1520 Usage statistics for models billed by token usage.

1521

1522 - `input_tokens: int`

1523

1524 Number of input tokens billed for this request.

1525

1526 - `output_tokens: int`

1527

1528 Number of output tokens generated.

1529

1530 - `total_tokens: int`

1531

1532 Total number of tokens used (input + output).

1533

1534 - `type: Literal["tokens"]`

1535

1536 The type of the usage object. Always `tokens` for this variant.

1537

1538 - `"tokens"`

1539

1540 - `input_token_details: Optional[UsageTokensInputTokenDetails]`

1541

1542 Details about the input tokens billed for this request.

1543

1544 - `audio_tokens: Optional[int]`

1545

1546 Number of audio tokens billed for this request.

1547

1548 - `text_tokens: Optional[int]`

1549

1550 Number of text tokens billed for this request.

1551

1552 - `class UsageDuration: …`

1553

1554 Usage statistics for models billed by audio input duration.

1555

1556 - `seconds: float`

1557

1558 Duration of the input audio in seconds.

1559

1560 - `type: Literal["duration"]`

1561

1562 The type of the usage object. Always `duration` for this variant.

1563

1564 - `"duration"`

1565

1566 - `class TranscriptionDiarized: …`

1567

1568 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

1569

1570 - `duration: float`

1571

1572 Duration of the input audio in seconds.

1573

1574 - `segments: List[TranscriptionDiarizedSegment]`

1575

1576 Segments of the transcript annotated with timestamps and speaker labels.

1577

1578 - `id: str`

1579

1580 Unique identifier for the segment.

1581

1582 - `end: float`

1583

1584 End timestamp of the segment in seconds.

1585

1586 - `speaker: str`

1587

1588 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

1589

1590 - `start: float`

1591

1592 Start timestamp of the segment in seconds.

1593

1594 - `text: str`

1595

1596 Transcript text for this segment.

1597

1598 - `type: Literal["transcript.text.segment"]`

1599

1600 The type of the segment. Always `transcript.text.segment`.

1601

1602 - `"transcript.text.segment"`

1603

1604 - `task: Literal["transcribe"]`

1605

1606 The type of task that was run. Always `transcribe`.

1607

1608 - `"transcribe"`

1609

1610 - `text: str`

1611

1612 The concatenated transcript text for the entire audio input.

1613

1614 - `usage: Optional[Usage]`

1615

1616 Token or duration usage statistics for the request.

1617

1618 - `class UsageTokens: …`

1619

1620 Usage statistics for models billed by token usage.

1621

1622 - `input_tokens: int`

1623

1624 Number of input tokens billed for this request.

1625

1626 - `output_tokens: int`

1627

1628 Number of output tokens generated.

1629

1630 - `total_tokens: int`

1631

1632 Total number of tokens used (input + output).

1633

1634 - `type: Literal["tokens"]`

1635

1636 The type of the usage object. Always `tokens` for this variant.

1637

1638 - `"tokens"`

1639

1640 - `input_token_details: Optional[UsageTokensInputTokenDetails]`

1641

1642 Details about the input tokens billed for this request.

1643

1644 - `audio_tokens: Optional[int]`

1645

1646 Number of audio tokens billed for this request.

1647

1648 - `text_tokens: Optional[int]`

1649

1650 Number of text tokens billed for this request.

1651

1652 - `class UsageDuration: …`

1653

1654 Usage statistics for models billed by audio input duration.

1655

1656 - `seconds: float`

1657

1658 Duration of the input audio in seconds.

1659

1660 - `type: Literal["duration"]`

1661

1662 The type of the usage object. Always `duration` for this variant.

1663

1664 - `"duration"`

1665

1666 - `class TranscriptionVerbose: …`

1667

1668 Represents a verbose json transcription response returned by model, based on the provided input.

1669

1670 - `duration: float`

1671

1672 The duration of the input audio.

1673

1674 - `language: str`

1675

1676 The language of the input audio.

1677

1678 - `text: str`

1679

1680 The transcribed text.

1681

1682 - `segments: Optional[List[TranscriptionSegment]]`

1683

1684 Segments of the transcribed text and their corresponding details.

1685

1686 - `id: int`

1687

1688 Unique identifier of the segment.

1689

1690 - `avg_logprob: float`

1691

1692 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

1693

1694 - `compression_ratio: float`

1695

1696 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

1697

1698 - `end: float`

1699

1700 End time of the segment in seconds.

1701

1702 - `no_speech_prob: float`

1703

1704 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

1705

1706 - `seek: int`

1707

1708 Seek offset of the segment.

1709

1710 - `start: float`

1711

1712 Start time of the segment in seconds.

1713

1714 - `temperature: float`

1715

1716 Temperature parameter used for generating the segment.

1717

1718 - `text: str`

1719

1720 Text content of the segment.

1721

1722 - `tokens: List[int]`

1723

1724 Array of token IDs for the text content.

1725

1726 - `usage: Optional[Usage]`

1727

1728 Usage statistics for models billed by audio input duration.

1729

1730 - `seconds: float`

1731

1732 Duration of the input audio in seconds.

1733

1734 - `type: Literal["duration"]`

1735

1736 The type of the usage object. Always `duration` for this variant.

1737

1738 - `"duration"`

1739

1740 - `words: Optional[List[TranscriptionWord]]`

1741

1742 Extracted words and their corresponding timestamps.

1743

1744 - `end: float`

1745

1746 End time of the word in seconds.

1747

1748 - `start: float`

1749

1750 Start time of the word in seconds.

1751

1752 - `word: str`

1753

1754 The text content of the word.

1755

1756# Translations

1757

1758## Create translation

1759

1760`audio.translations.create(TranslationCreateParams**kwargs) -> TranslationCreateResponse`

1761

1762**post** `/audio/translations`

1763

1764Translates audio into English.

1765

1766### Parameters

1767

1768- `file: FileTypes`

1769

1770 The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

1771

1772- `model: Union[str, AudioModel]`

1773

1774 ID of the model to use. Only `whisper-1` (which is powered by our open source Whisper V2 model) is currently available.

1775

1776 - `str`

1777

1778 - `Literal["whisper-1", "gpt-4o-transcribe", "gpt-4o-mini-transcribe", 2 more]`

1779

1780 - `"whisper-1"`

1781

1782 - `"gpt-4o-transcribe"`

1783

1784 - `"gpt-4o-mini-transcribe"`

1785

1786 - `"gpt-4o-mini-transcribe-2025-12-15"`

1787

1788 - `"gpt-4o-transcribe-diarize"`

1789

1790- `prompt: Optional[str]`

1791

1792 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should be in English.

1793

1794- `response_format: Optional[Literal["json", "text", "srt", 2 more]]`

1795

1796 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`.

1797

1798 - `"json"`

1799

1800 - `"text"`

1801

1802 - `"srt"`

1803

1804 - `"verbose_json"`

1805

1806 - `"vtt"`

1807

1808- `temperature: Optional[float]`

1809

1810 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.

1811

1812### Returns

1813

1814- `TranslationCreateResponse`

1815

1816 - `class Translation: …`

1817

1818 - `text: str`

1819

1820 - `class TranslationVerbose: …`

1821

1822 - `duration: float`

1823

1824 The duration of the input audio.

1825

1826 - `language: str`

1827

1828 The language of the output translation (always `english`).

1829

1830 - `text: str`

1831

1832 The translated text.

1833

1834 - `segments: Optional[List[TranscriptionSegment]]`

1835

1836 Segments of the translated text and their corresponding details.

1837

1838 - `id: int`

1839

1840 Unique identifier of the segment.

1841

1842 - `avg_logprob: float`

1843

1844 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

1845

1846 - `compression_ratio: float`

1847

1848 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

1849

1850 - `end: float`

1851

1852 End time of the segment in seconds.

1853

1854 - `no_speech_prob: float`

1855

1856 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

1857

1858 - `seek: int`

1859

1860 Seek offset of the segment.

1861

1862 - `start: float`

1863

1864 Start time of the segment in seconds.

1865

1866 - `temperature: float`

1867

1868 Temperature parameter used for generating the segment.

1869

1870 - `text: str`

1871

1872 Text content of the segment.

1873

1874 - `tokens: List[int]`

1875

1876 Array of token IDs for the text content.

1877

1878### Example

1879

1880```python

1881import os

1882from openai import OpenAI

1883

1884client = OpenAI(

1885 api_key=os.environ.get("OPENAI_API_KEY"), # This is the default and can be omitted

1886)

1887translation = client.audio.translations.create(

1888 file=b"Example data",

1889 model="whisper-1",

1890)

1891print(translation)

1892```

1893

1894#### Response

1895

1896```json

1897{

1898 "text": "text"

1899}

1900```

1901

1902### Example

1903

1904```python

1905from openai import OpenAI

1906client = OpenAI()

1907

1908audio_file = open("speech.mp3", "rb")

1909transcript = client.audio.translations.create(

1910 model="whisper-1",

1911 file=audio_file

1912)

1913```

1914

1915#### Response

1916

1917```json

1918{

1919 "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"

1920}

1921```

1922

1923## Domain Types

1924

1925### Translation

1926

1927- `class Translation: …`

1928

1929 - `text: str`

1930

1931### Translation Verbose

1932

1933- `class TranslationVerbose: …`

1934

1935 - `duration: float`

1936

1937 The duration of the input audio.

1938

1939 - `language: str`

1940

1941 The language of the output translation (always `english`).

1942

1943 - `text: str`

1944

1945 The translated text.

1946

1947 - `segments: Optional[List[TranscriptionSegment]]`

1948

1949 Segments of the translated text and their corresponding details.

1950

1951 - `id: int`

1952

1953 Unique identifier of the segment.

1954

1955 - `avg_logprob: float`

1956

1957 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

1958

1959 - `compression_ratio: float`

1960

1961 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

1962

1963 - `end: float`

1964

1965 End time of the segment in seconds.

1966

1967 - `no_speech_prob: float`

1968

1969 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

1970

1971 - `seek: int`

1972

1973 Seek offset of the segment.

1974

1975 - `start: float`

1976

1977 Start time of the segment in seconds.

1978

1979 - `temperature: float`

1980

1981 Temperature parameter used for generating the segment.

1982

1983 - `text: str`

1984

1985 Text content of the segment.

1986

1987 - `tokens: List[int]`

1988

1989 Array of token IDs for the text content.

1990

1991### Translation Create Response

1992

1993- `TranslationCreateResponse`

1994

1995 - `class Translation: …`

1996

1997 - `text: str`

1998

1999 - `class TranslationVerbose: …`

2000

2001 - `duration: float`

2002

2003 The duration of the input audio.

2004

2005 - `language: str`

2006

2007 The language of the output translation (always `english`).

2008

2009 - `text: str`

2010

2011 The translated text.

2012

2013 - `segments: Optional[List[TranscriptionSegment]]`

2014

2015 Segments of the translated text and their corresponding details.

2016

2017 - `id: int`

2018

2019 Unique identifier of the segment.

2020

2021 - `avg_logprob: float`

2022

2023 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

2024

2025 - `compression_ratio: float`

2026

2027 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

2028

2029 - `end: float`

2030

2031 End time of the segment in seconds.

2032

2033 - `no_speech_prob: float`

2034

2035 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

2036

2037 - `seek: int`

2038

2039 Seek offset of the segment.

2040

2041 - `start: float`

2042

2043 Start time of the segment in seconds.

2044

2045 - `temperature: float`

2046

2047 Temperature parameter used for generating the segment.

2048

2049 - `text: str`

2050

2051 Text content of the segment.

2052

2053 - `tokens: List[int]`

2054

2055 Array of token IDs for the text content.

2056

2057# Speech

2058

2059## Create speech

2060

2061`audio.speech.create(SpeechCreateParams**kwargs) -> BinaryResponseContent`

2062

2063**post** `/audio/speech`

2064

2065Generates audio from the input text.

2066

2067Returns the audio file content, or a stream of audio events.

2068

2069### Parameters

2070

2071- `input: str`

2072

2073 The text to generate audio for. The maximum length is 4096 characters.

2074

2075- `model: Union[str, SpeechModel]`

2076

2077 One of the available [TTS models](https://platform.openai.com/docs/models#tts): `tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`, or `gpt-4o-mini-tts-2025-12-15`.

2078

2079 - `str`

2080

2081 - `Literal["tts-1", "tts-1-hd", "gpt-4o-mini-tts", "gpt-4o-mini-tts-2025-12-15"]`

2082

2083 - `"tts-1"`

2084

2085 - `"tts-1-hd"`

2086

2087 - `"gpt-4o-mini-tts"`

2088

2089 - `"gpt-4o-mini-tts-2025-12-15"`

2090

2091- `voice: Voice`

2092

2093 The voice to use when generating the audio. Supported built-in voices are àlloy`, àsh`, `ballad`, `coral`, ècho`, `fable`, ònyx`, `nova`, `sage`, `shimmer`, `verse`, `marin`, and `cedar`. You may also provide a custom voice object with an ìd`, for example `{ "id": "voice_1234" }`. Previews of the voices are available in the [Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech#voice-options).

2094

2095 - `str`

2096

2097 - `Literal["alloy", "ash", "ballad", 7 more]`

2098

2099 - `"alloy"`

2100

2101 - `"ash"`

2102

2103 - `"ballad"`

2104

2105 - `"coral"`

2106

2107 - `"echo"`

2108

2109 - `"sage"`

2110

2111 - `"shimmer"`

2112

2113 - `"verse"`

2114

2115 - `"marin"`

2116

2117 - `"cedar"`

2118

2119 - `class VoiceID: …`

2120

2121 Custom voice reference.

2122

2123 - `id: str`

2124

2125 The custom voice ID, e.g. `voice_1234`.

2126

2127- `instructions: Optional[str]`

2128

2129 Control the voice of your generated audio with additional instructions. Does not work with `tts-1` or `tts-1-hd`.

2130

2131- `response_format: Optional[Literal["mp3", "opus", "aac", 3 more]]`

2132

2133 The format to audio in. Supported formats are `mp3`, `opus`, `aac`, `flac`, `wav`, and `pcm`.

2134

2135 - `"mp3"`

2136

2137 - `"opus"`

2138

2139 - `"aac"`

2140

2141 - `"flac"`

2142

2143 - `"wav"`

2144

2145 - `"pcm"`

2146

2147- `speed: Optional[float]`

2148

2149 The speed of the generated audio. Select a value from `0.25` to `4.0`. `1.0` is the default.

2150

2151- `stream_format: Optional[Literal["sse", "audio"]]`

2152

2153 The format to stream the audio in. Supported formats are `sse` and `audio`. `sse` is not supported for `tts-1` or `tts-1-hd`.

2154

2155 - `"sse"`

2156

2157 - `"audio"`

2158

2159### Returns

2160

2161- `BinaryResponseContent`

2162

2163### Example

2164

2165```python

2166import os

2167from openai import OpenAI

2168

2169client = OpenAI(

2170 api_key=os.environ.get("OPENAI_API_KEY"), # This is the default and can be omitted

2171)

2172speech = client.audio.speech.create(

2173 input="input",

2174 model="tts-1",

2175 voice="alloy",

2176)

2177print(speech)

2178content = speech.read()

2179print(content)

2180```

2181

2182### Example

2183

2184```python

2185from pathlib import Path

2186import openai

2187

2188speech_file_path = Path(__file__).parent / "speech.mp3"

2189with openai.audio.speech.with_streaming_response.create(

2190 model="gpt-4o-mini-tts",

2191 voice="alloy",

2192 input="The quick brown fox jumped over the lazy dog."

2193) as response:

2194 response.stream_to_file(speech_file_path)

2195```

2196

2197## Domain Types

2198

2199### Speech Model

2200

2201- `Literal["tts-1", "tts-1-hd", "gpt-4o-mini-tts", "gpt-4o-mini-tts-2025-12-15"]`

2202

2203 - `"tts-1"`

2204

2205 - `"tts-1-hd"`

2206

2207 - `"gpt-4o-mini-tts"`

2208

2209 - `"gpt-4o-mini-tts-2025-12-15"`

2210

2211# Voices

2212

2213# Voice Consents

python/resources/audio/index.md 2026-06-10 15:48 UTC to 2026-06-12 00:01 UTC

Audio

Domain Types

Audio Model

Audio Response Format

Transcriptions

Create transcription

Parameters

Returns

Example

Response

Example

Response

Diarization

Response

Streaming

Response

Logprobs

Response

Word timestamps

Response

Segment timestamps

Response

Domain Types

Transcription

Transcription Diarized

Transcription Diarized Segment

Transcription Include

Transcription Segment

Transcription Stream Event

Transcription Text Delta Event

Transcription Text Done Event

Transcription Text Segment Event

Transcription Verbose

Transcription Word

Transcription Create Response

Translations

Create translation

Parameters

Returns

Example

Response

Example

Response

Domain Types

Translation

Translation Verbose

Translation Create Response

Speech

Create speech

Parameters

Returns

Example

Example

Domain Types

Speech Model

Voices

Voice Consents