Audio

Domain Types

Audio Model

AudioModel = :"whisper-1" | :"gpt-4o-transcribe" | :"gpt-4o-mini-transcribe" | 2 more
- :"whisper-1"
- :"gpt-4o-transcribe"
- :"gpt-4o-mini-transcribe"
- :"gpt-4o-mini-transcribe-2025-12-15"
- :"gpt-4o-transcribe-diarize"

Audio Response Format

AudioResponseFormat = :json | :text | :srt | 3 more

The format of the output, in one of these options: json, text, srt, verbose_json, vtt, or diarized_json. For gpt-4o-transcribe and gpt-4o-mini-transcribe, the only supported format is json. For gpt-4o-transcribe-diarize, the supported formats are json, text, and diarized_json, with diarized_json required to receive speaker annotations.
- :json
- :text
- :srt
- :verbose_json
- :vtt
- :diarized_json

Transcriptions

Create transcription

audio.transcriptions.create(**kwargs) -> TranscriptionCreateResponse

post /audio/transcriptions

Transcribes audio into the input language.

Returns a transcription object in json, diarized_json, or verbose_json format, or a stream of transcript events.

Parameters

file: String

The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
model: String | AudioModel

ID of the model to use. The options are gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15, whisper-1 (which is powered by our open source Whisper V2 model), and gpt-4o-transcribe-diarize.
- String = String
- AudioModel = :"whisper-1" | :"gpt-4o-transcribe" | :"gpt-4o-mini-transcribe" | 2 more
  - :"whisper-1"
  - :"gpt-4o-transcribe"
  - :"gpt-4o-mini-transcribe"
  - :"gpt-4o-mini-transcribe-2025-12-15"
  - :"gpt-4o-transcribe-diarize"
chunking_strategy: :auto | VadConfig{ type, prefix_padding_ms, silence_duration_ms, threshold}

Controls how the audio is cut into chunks. When set to "auto", the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. server_vad object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using gpt-4o-transcribe-diarize for inputs longer than 30 seconds.
- ChunkingStrategy = :auto
  
  Automatically set chunking parameters based on the audio. Must be set to "auto".
  - :auto
- class VadConfig
  - type: :server_vad
    
    Must be set to server_vad to enable manual chunking using server side VAD.
    - :server_vad
  - prefix_padding_ms: Integer
    
    Amount of audio to include before the VAD detected speech (in milliseconds).
  - silence_duration_ms: Integer
    
    Duration of silence to detect speech stop (in milliseconds). With shorter values the model will respond more quickly, but may jump in on short pauses from the user.
  - threshold: Float
    
    Sensitivity threshold (0.0 to 1.0) for voice activity detection. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
include: Array[TranscriptionInclude]

Additional information to include in the transcription response. logprobs will return the log probabilities of the tokens in the response to understand the model's confidence in the transcription. logprobs only works with response_format set to json and only with the models gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-transcribe-2025-12-15. This field is not supported when using gpt-4o-transcribe-diarize.
- :logprobs
known_speaker_names: Array[String]

Optional list of speaker names that correspond to the audio samples provided in known_speaker_references[]. Each entry should be a short identifier (for example customer or agent). Up to 4 speakers are supported.
known_speaker_references: Array[String]

Optional list of audio samples (as data URLs) that contain known speaker references matching known_speaker_names[]. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by file.
language: String

The language of the input audio. Supplying the input language in ISO-639-1 (e.g. en) format will improve accuracy and latency.
prompt: String

An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language. This field is not supported when using gpt-4o-transcribe-diarize.
response_format: AudioResponseFormat

The format of the output, in one of these options: json, text, srt, verbose_json, vtt, or diarized_json. For gpt-4o-transcribe and gpt-4o-mini-transcribe, the only supported format is json. For gpt-4o-transcribe-diarize, the supported formats are json, text, and diarized_json, with diarized_json required to receive speaker annotations.
- :json
- :text
- :srt
- :verbose_json
- :vtt
- :diarized_json
stream: bool

If set to true, the model response data will be streamed to the client as it is generated using server-sent events. See the Streaming section of the Speech-to-Text guide for more information.

Note: Streaming is not supported for the whisper-1 model and will be ignored.
temperature: Float

The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.
timestamp_granularities: Array[:word | :segment]

The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. Either or both of these options are supported: word, or segment. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency. This option is not available for gpt-4o-transcribe-diarize.
- :word
- :segment

Returns

TranscriptionCreateResponse = Transcription | TranscriptionDiarized | TranscriptionVerbose

Represents a transcription response returned by model, based on the provided input.
- class Transcription
  
  Represents a transcription response returned by model, based on the provided input.
  - text: String
    
    The transcribed text.
  - logprobs: Array[Logprob{ token, bytes, logprob}]
    
    The log probabilities of the tokens in the transcription. Only returned with the models gpt-4o-transcribe and gpt-4o-mini-transcribe if logprobs is added to the include array.
    - token: String
      
      The token in the transcription.
    - bytes: Array[Float]
      
      The bytes of the token.
    - logprob: Float
      
      The log probability of the token.
  - usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}
    
    Token usage statistics for the request.
    - class Tokens
      
      Usage statistics for models billed by token usage.
      - input_tokens: Integer
        
        Number of input tokens billed for this request.
      - output_tokens: Integer
        
        Number of output tokens generated.
      - total_tokens: Integer
        
        Total number of tokens used (input + output).
      - type: :tokens
        
        The type of the usage object. Always tokens for this variant.
        
        :tokens
      - input_token_details: InputTokenDetails{ audio_tokens, text_tokens}
        
        Details about the input tokens billed for this request.
        
        audio_tokens: Integer
        
        Number of audio tokens billed for this request.
        
        text_tokens: Integer
        
        Number of text tokens billed for this request.
    - class Duration
      
      Usage statistics for models billed by audio input duration.
      - seconds: Float
        
        Duration of the input audio in seconds.
      - type: :duration
        
        The type of the usage object. Always duration for this variant.
        
        :duration
- class TranscriptionDiarized
  
  Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
  - duration: Float
    
    Duration of the input audio in seconds.
  - segments: Array[TranscriptionDiarizedSegment]
    
    Segments of the transcript annotated with timestamps and speaker labels.
    - id: String
      
      Unique identifier for the segment.
    - end_: Float
      
      End timestamp of the segment in seconds.
    - speaker: String
      
      Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).
    - start: Float
      
      Start timestamp of the segment in seconds.
    - text: String
      
      Transcript text for this segment.
    - type: :"transcript.text.segment"
      
      The type of the segment. Always transcript.text.segment.
      - :"transcript.text.segment"
  - task: :transcribe
    
    The type of task that was run. Always transcribe.
    - :transcribe
  - text: String
    
    The concatenated transcript text for the entire audio input.
  - usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}
    
    Token or duration usage statistics for the request.
    - class Tokens
      
      Usage statistics for models billed by token usage.
      - input_tokens: Integer
        
        Number of input tokens billed for this request.
      - output_tokens: Integer
        
        Number of output tokens generated.
      - total_tokens: Integer
        
        Total number of tokens used (input + output).
      - type: :tokens
        
        The type of the usage object. Always tokens for this variant.
        
        :tokens
      - input_token_details: InputTokenDetails{ audio_tokens, text_tokens}
        
        Details about the input tokens billed for this request.
        
        audio_tokens: Integer
        
        Number of audio tokens billed for this request.
        
        text_tokens: Integer
        
        Number of text tokens billed for this request.
    - class Duration
      
      Usage statistics for models billed by audio input duration.
      - seconds: Float
        
        Duration of the input audio in seconds.
      - type: :duration
        
        The type of the usage object. Always duration for this variant.
        
        :duration
- class TranscriptionVerbose
  
  Represents a verbose json transcription response returned by model, based on the provided input.
  - duration: Float
    
    The duration of the input audio.
  - language: String
    
    The language of the input audio.
  - text: String
    
    The transcribed text.
  - segments: Array[TranscriptionSegment]
    
    Segments of the transcribed text and their corresponding details.
    - id: Integer
      
      Unique identifier of the segment.
    - avg_logprob: Float
      
      Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
    - compression_ratio: Float
      
      Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
    - end_: Float
      
      End time of the segment in seconds.
    - no_speech_prob: Float
      
      Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
    - seek: Integer
      
      Seek offset of the segment.
    - start: Float
      
      Start time of the segment in seconds.
    - temperature: Float
      
      Temperature parameter used for generating the segment.
    - text: String
      
      Text content of the segment.
    - tokens: Array[Integer]
      
      Array of token IDs for the text content.
  - usage: Usage{ seconds, type}
    
    Usage statistics for models billed by audio input duration.
    - seconds: Float
      
      Duration of the input audio in seconds.
    - type: :duration
      
      The type of the usage object. Always duration for this variant.
      - :duration
  - words: Array[TranscriptionWord]
    
    Extracted words and their corresponding timestamps.
    - end_: Float
      
      End time of the word in seconds.
    - start: Float
      
      Start time of the word in seconds.
    - word: String
      
      The text content of the word.

Example

require "openai"

openai = OpenAI::Client.new(api_key: "My API Key")

transcription = openai.audio.transcriptions.create(file: StringIO.new("Example data"), model: :"gpt-4o-transcribe")

puts(transcription)

Response

{
  "text": "text",
  "logprobs": [
    {
      "token": "token",
      "bytes": [
        0
      ],
      "logprob": 0
    }
  ],
  "usage": {
    "input_tokens": 0,
    "output_tokens": 0,
    "total_tokens": 0,
    "type": "tokens",
    "input_token_details": {
      "audio_tokens": 0,
      "text_tokens": 0
    }
  }
}

Domain Types

Transcription

class Transcription

Represents a transcription response returned by model, based on the provided input.
- text: String
  
  The transcribed text.
- logprobs: Array[Logprob{ token, bytes, logprob}]
  
  The log probabilities of the tokens in the transcription. Only returned with the models gpt-4o-transcribe and gpt-4o-mini-transcribe if logprobs is added to the include array.
  - token: String
    
    The token in the transcription.
  - bytes: Array[Float]
    
    The bytes of the token.
  - logprob: Float
    
    The log probability of the token.
- usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}
  
  Token usage statistics for the request.
  - class Tokens
    
    Usage statistics for models billed by token usage.
    - input_tokens: Integer
      
      Number of input tokens billed for this request.
    - output_tokens: Integer
      
      Number of output tokens generated.
    - total_tokens: Integer
      
      Total number of tokens used (input + output).
    - type: :tokens
      
      The type of the usage object. Always tokens for this variant.
      - :tokens
    - input_token_details: InputTokenDetails{ audio_tokens, text_tokens}
      
      Details about the input tokens billed for this request.
      - audio_tokens: Integer
        
        Number of audio tokens billed for this request.
      - text_tokens: Integer
        
        Number of text tokens billed for this request.
  - class Duration
    
    Usage statistics for models billed by audio input duration.
    - seconds: Float
      
      Duration of the input audio in seconds.
    - type: :duration
      
      The type of the usage object. Always duration for this variant.
      - :duration

Transcription Diarized

class TranscriptionDiarized

Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
- duration: Float
  
  Duration of the input audio in seconds.
- segments: Array[TranscriptionDiarizedSegment]
  
  Segments of the transcript annotated with timestamps and speaker labels.
  - id: String
    
    Unique identifier for the segment.
  - end_: Float
    
    End timestamp of the segment in seconds.
  - speaker: String
    
    Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).
  - start: Float
    
    Start timestamp of the segment in seconds.
  - text: String
    
    Transcript text for this segment.
  - type: :"transcript.text.segment"
    
    The type of the segment. Always transcript.text.segment.
    - :"transcript.text.segment"
- task: :transcribe
  
  The type of task that was run. Always transcribe.
  - :transcribe
- text: String
  
  The concatenated transcript text for the entire audio input.
- usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}
  
  Token or duration usage statistics for the request.
  - class Tokens
    
    Usage statistics for models billed by token usage.
    - input_tokens: Integer
      
      Number of input tokens billed for this request.
    - output_tokens: Integer
      
      Number of output tokens generated.
    - total_tokens: Integer
      
      Total number of tokens used (input + output).
    - type: :tokens
      
      The type of the usage object. Always tokens for this variant.
      - :tokens
    - input_token_details: InputTokenDetails{ audio_tokens, text_tokens}
      
      Details about the input tokens billed for this request.
      - audio_tokens: Integer
        
        Number of audio tokens billed for this request.
      - text_tokens: Integer
        
        Number of text tokens billed for this request.
  - class Duration
    
    Usage statistics for models billed by audio input duration.
    - seconds: Float
      
      Duration of the input audio in seconds.
    - type: :duration
      
      The type of the usage object. Always duration for this variant.
      - :duration

Transcription Diarized Segment

class TranscriptionDiarizedSegment

A segment of diarized transcript text with speaker metadata.
- id: String
  
  Unique identifier for the segment.
- end_: Float
  
  End timestamp of the segment in seconds.
- speaker: String
  
  Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).
- start: Float
  
  Start timestamp of the segment in seconds.
- text: String
  
  Transcript text for this segment.
- type: :"transcript.text.segment"
  
  The type of the segment. Always transcript.text.segment.
  - :"transcript.text.segment"

Transcription Include

TranscriptionInclude = :logprobs
- :logprobs

Transcription Segment

class TranscriptionSegment
- id: Integer
  
  Unique identifier of the segment.
- avg_logprob: Float
  
  Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
- compression_ratio: Float
  
  Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
- end_: Float
  
  End time of the segment in seconds.
- no_speech_prob: Float
  
  Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
- seek: Integer
  
  Seek offset of the segment.
- start: Float
  
  Start time of the segment in seconds.
- temperature: Float
  
  Temperature parameter used for generating the segment.
- text: String
  
  Text content of the segment.
- tokens: Array[Integer]
  
  Array of token IDs for the text content.

Transcription Stream Event

TranscriptionStreamEvent = TranscriptionTextSegmentEvent | TranscriptionTextDeltaEvent | TranscriptionTextDoneEvent

Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you create a transcription with stream set to true and response_format set to diarized_json.
- class TranscriptionTextSegmentEvent
  
  Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you create a transcription with stream set to true and response_format set to diarized_json.
  - id: String
    
    Unique identifier for the segment.
  - end_: Float
    
    End timestamp of the segment in seconds.
  - speaker: String
    
    Speaker label for this segment.
  - start: Float
    
    Start timestamp of the segment in seconds.
  - text: String
    
    Transcript text for this segment.
  - type: :"transcript.text.segment"
    
    The type of the event. Always transcript.text.segment.
    - :"transcript.text.segment"
- class TranscriptionTextDeltaEvent
  
  Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you create a transcription with the Stream parameter set to true.
  - delta: String
    
    The text delta that was additionally transcribed.
  - type: :"transcript.text.delta"
    
    The type of the event. Always transcript.text.delta.
    - :"transcript.text.delta"
  - logprobs: Array[Logprob{ token, bytes, logprob}]
    
    The log probabilities of the delta. Only included if you create a transcription with the include[] parameter set to logprobs.
    - token: String
      
      The token that was used to generate the log probability.
    - bytes: Array[Integer]
      
      The bytes that were used to generate the log probability.
    - logprob: Float
      
      The log probability of the token.
  - segment_id: String
    
    Identifier of the diarized segment that this delta belongs to. Only present when using gpt-4o-transcribe-diarize.
- class TranscriptionTextDoneEvent
  
  Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you create a transcription with the Stream parameter set to true.
  - text: String
    
    The text that was transcribed.
  - type: :"transcript.text.done"
    
    The type of the event. Always transcript.text.done.
    - :"transcript.text.done"
  - logprobs: Array[Logprob{ token, bytes, logprob}]
    
    The log probabilities of the individual tokens in the transcription. Only included if you create a transcription with the include[] parameter set to logprobs.
    - token: String
      
      The token that was used to generate the log probability.
    - bytes: Array[Integer]
      
      The bytes that were used to generate the log probability.
    - logprob: Float
      
      The log probability of the token.
  - usage: Usage{ input_tokens, output_tokens, total_tokens, 2 more}
    
    Usage statistics for models billed by token usage.
    - input_tokens: Integer
      
      Number of input tokens billed for this request.
    - output_tokens: Integer
      
      Number of output tokens generated.
    - total_tokens: Integer
      
      Total number of tokens used (input + output).
    - type: :tokens
      
      The type of the usage object. Always tokens for this variant.
      - :tokens
    - input_token_details: InputTokenDetails{ audio_tokens, text_tokens}
      
      Details about the input tokens billed for this request.
      - audio_tokens: Integer
        
        Number of audio tokens billed for this request.
      - text_tokens: Integer
        
        Number of text tokens billed for this request.

Transcription Text Delta Event

class TranscriptionTextDeltaEvent

Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you create a transcription with the Stream parameter set to true.
- delta: String
  
  The text delta that was additionally transcribed.
- type: :"transcript.text.delta"
  
  The type of the event. Always transcript.text.delta.
  - :"transcript.text.delta"
- logprobs: Array[Logprob{ token, bytes, logprob}]
  
  The log probabilities of the delta. Only included if you create a transcription with the include[] parameter set to logprobs.
  - token: String
    
    The token that was used to generate the log probability.
  - bytes: Array[Integer]
    
    The bytes that were used to generate the log probability.
  - logprob: Float
    
    The log probability of the token.
- segment_id: String
  
  Identifier of the diarized segment that this delta belongs to. Only present when using gpt-4o-transcribe-diarize.

Transcription Text Done Event

class TranscriptionTextDoneEvent

Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you create a transcription with the Stream parameter set to true.
- text: String
  
  The text that was transcribed.
- type: :"transcript.text.done"
  
  The type of the event. Always transcript.text.done.
  - :"transcript.text.done"
- logprobs: Array[Logprob{ token, bytes, logprob}]
  
  The log probabilities of the individual tokens in the transcription. Only included if you create a transcription with the include[] parameter set to logprobs.
  - token: String
    
    The token that was used to generate the log probability.
  - bytes: Array[Integer]
    
    The bytes that were used to generate the log probability.
  - logprob: Float
    
    The log probability of the token.
- usage: Usage{ input_tokens, output_tokens, total_tokens, 2 more}
  
  Usage statistics for models billed by token usage.
  - input_tokens: Integer
    
    Number of input tokens billed for this request.
  - output_tokens: Integer
    
    Number of output tokens generated.
  - total_tokens: Integer
    
    Total number of tokens used (input + output).
  - type: :tokens
    
    The type of the usage object. Always tokens for this variant.
    - :tokens
  - input_token_details: InputTokenDetails{ audio_tokens, text_tokens}
    
    Details about the input tokens billed for this request.
    - audio_tokens: Integer
      
      Number of audio tokens billed for this request.
    - text_tokens: Integer
      
      Number of text tokens billed for this request.

Transcription Text Segment Event

class TranscriptionTextSegmentEvent

Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you create a transcription with stream set to true and response_format set to diarized_json.
- id: String
  
  Unique identifier for the segment.
- end_: Float
  
  End timestamp of the segment in seconds.
- speaker: String
  
  Speaker label for this segment.
- start: Float
  
  Start timestamp of the segment in seconds.
- text: String
  
  Transcript text for this segment.
- type: :"transcript.text.segment"
  
  The type of the event. Always transcript.text.segment.
  - :"transcript.text.segment"

Transcription Verbose

class TranscriptionVerbose

Represents a verbose json transcription response returned by model, based on the provided input.
- duration: Float
  
  The duration of the input audio.
- language: String
  
  The language of the input audio.
- text: String
  
  The transcribed text.
- segments: Array[TranscriptionSegment]
  
  Segments of the transcribed text and their corresponding details.
  - id: Integer
    
    Unique identifier of the segment.
  - avg_logprob: Float
    
    Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
  - compression_ratio: Float
    
    Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
  - end_: Float
    
    End time of the segment in seconds.
  - no_speech_prob: Float
    
    Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
  - seek: Integer
    
    Seek offset of the segment.
  - start: Float
    
    Start time of the segment in seconds.
  - temperature: Float
    
    Temperature parameter used for generating the segment.
  - text: String
    
    Text content of the segment.
  - tokens: Array[Integer]
    
    Array of token IDs for the text content.
- usage: Usage{ seconds, type}
  
  Usage statistics for models billed by audio input duration.
  - seconds: Float
    
    Duration of the input audio in seconds.
  - type: :duration
    
    The type of the usage object. Always duration for this variant.
    - :duration
- words: Array[TranscriptionWord]
  
  Extracted words and their corresponding timestamps.
  - end_: Float
    
    End time of the word in seconds.
  - start: Float
    
    Start time of the word in seconds.
  - word: String
    
    The text content of the word.

Transcription Word

class TranscriptionWord
- end_: Float
  
  End time of the word in seconds.
- start: Float
  
  Start time of the word in seconds.
- word: String
  
  The text content of the word.

Transcription Create Response

TranscriptionCreateResponse = Transcription | TranscriptionDiarized | TranscriptionVerbose

Represents a transcription response returned by model, based on the provided input.
- class Transcription
  
  Represents a transcription response returned by model, based on the provided input.
  - text: String
    
    The transcribed text.
  - logprobs: Array[Logprob{ token, bytes, logprob}]
    
    The log probabilities of the tokens in the transcription. Only returned with the models gpt-4o-transcribe and gpt-4o-mini-transcribe if logprobs is added to the include array.
    - token: String
      
      The token in the transcription.
    - bytes: Array[Float]
      
      The bytes of the token.
    - logprob: Float
      
      The log probability of the token.
  - usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}
    
    Token usage statistics for the request.
    - class Tokens
      
      Usage statistics for models billed by token usage.
      - input_tokens: Integer
        
        Number of input tokens billed for this request.
      - output_tokens: Integer
        
        Number of output tokens generated.
      - total_tokens: Integer
        
        Total number of tokens used (input + output).
      - type: :tokens
        
        The type of the usage object. Always tokens for this variant.
        
        :tokens
      - input_token_details: InputTokenDetails{ audio_tokens, text_tokens}
        
        Details about the input tokens billed for this request.
        
        audio_tokens: Integer
        
        Number of audio tokens billed for this request.
        
        text_tokens: Integer
        
        Number of text tokens billed for this request.
    - class Duration
      
      Usage statistics for models billed by audio input duration.
      - seconds: Float
        
        Duration of the input audio in seconds.
      - type: :duration
        
        The type of the usage object. Always duration for this variant.
        
        :duration
- class TranscriptionDiarized
  
  Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.
  - duration: Float
    
    Duration of the input audio in seconds.
  - segments: Array[TranscriptionDiarizedSegment]
    
    Segments of the transcript annotated with timestamps and speaker labels.
    - id: String
      
      Unique identifier for the segment.
    - end_: Float
      
      End timestamp of the segment in seconds.
    - speaker: String
      
      Speaker label for this segment. When known speakers are provided, the label matches known_speaker_names[]. Otherwise speakers are labeled sequentially using capital letters (A, B, ...).
    - start: Float
      
      Start timestamp of the segment in seconds.
    - text: String
      
      Transcript text for this segment.
    - type: :"transcript.text.segment"
      
      The type of the segment. Always transcript.text.segment.
      - :"transcript.text.segment"
  - task: :transcribe
    
    The type of task that was run. Always transcribe.
    - :transcribe
  - text: String
    
    The concatenated transcript text for the entire audio input.
  - usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}
    
    Token or duration usage statistics for the request.
    - class Tokens
      
      Usage statistics for models billed by token usage.
      - input_tokens: Integer
        
        Number of input tokens billed for this request.
      - output_tokens: Integer
        
        Number of output tokens generated.
      - total_tokens: Integer
        
        Total number of tokens used (input + output).
      - type: :tokens
        
        The type of the usage object. Always tokens for this variant.
        
        :tokens
      - input_token_details: InputTokenDetails{ audio_tokens, text_tokens}
        
        Details about the input tokens billed for this request.
        
        audio_tokens: Integer
        
        Number of audio tokens billed for this request.
        
        text_tokens: Integer
        
        Number of text tokens billed for this request.
    - class Duration
      
      Usage statistics for models billed by audio input duration.
      - seconds: Float
        
        Duration of the input audio in seconds.
      - type: :duration
        
        The type of the usage object. Always duration for this variant.
        
        :duration
- class TranscriptionVerbose
  
  Represents a verbose json transcription response returned by model, based on the provided input.
  - duration: Float
    
    The duration of the input audio.
  - language: String
    
    The language of the input audio.
  - text: String
    
    The transcribed text.
  - segments: Array[TranscriptionSegment]
    
    Segments of the transcribed text and their corresponding details.
    - id: Integer
      
      Unique identifier of the segment.
    - avg_logprob: Float
      
      Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
    - compression_ratio: Float
      
      Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
    - end_: Float
      
      End time of the segment in seconds.
    - no_speech_prob: Float
      
      Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
    - seek: Integer
      
      Seek offset of the segment.
    - start: Float
      
      Start time of the segment in seconds.
    - temperature: Float
      
      Temperature parameter used for generating the segment.
    - text: String
      
      Text content of the segment.
    - tokens: Array[Integer]
      
      Array of token IDs for the text content.
  - usage: Usage{ seconds, type}
    
    Usage statistics for models billed by audio input duration.
    - seconds: Float
      
      Duration of the input audio in seconds.
    - type: :duration
      
      The type of the usage object. Always duration for this variant.
      - :duration
  - words: Array[TranscriptionWord]
    
    Extracted words and their corresponding timestamps.
    - end_: Float
      
      End time of the word in seconds.
    - start: Float
      
      Start time of the word in seconds.
    - word: String
      
      The text content of the word.

Translations

Create translation

audio.translations.create(**kwargs) -> TranslationCreateResponse

post /audio/translations

Translates audio into English.

Parameters

file: String

The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
model: String | AudioModel

ID of the model to use. Only whisper-1 (which is powered by our open source Whisper V2 model) is currently available.
- String = String
- AudioModel = :"whisper-1" | :"gpt-4o-transcribe" | :"gpt-4o-mini-transcribe" | 2 more
  - :"whisper-1"
  - :"gpt-4o-transcribe"
  - :"gpt-4o-mini-transcribe"
  - :"gpt-4o-mini-transcribe-2025-12-15"
  - :"gpt-4o-transcribe-diarize"
prompt: String

An optional text to guide the model's style or continue a previous audio segment. The prompt should be in English.
response_format: :json | :text | :srt | 2 more

The format of the output, in one of these options: json, text, srt, verbose_json, or vtt.
- :json
- :text
- :srt
- :verbose_json
- :vtt
temperature: Float

The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

Returns

TranslationCreateResponse = Translation | TranslationVerbose
- class Translation
  - text: String
- class TranslationVerbose
  - duration: Float
    
    The duration of the input audio.
  - language: String
    
    The language of the output translation (always english).
  - text: String
    
    The translated text.
  - segments: Array[TranscriptionSegment]
    
    Segments of the translated text and their corresponding details.
    - id: Integer
      
      Unique identifier of the segment.
    - avg_logprob: Float
      
      Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
    - compression_ratio: Float
      
      Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
    - end_: Float
      
      End time of the segment in seconds.
    - no_speech_prob: Float
      
      Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
    - seek: Integer
      
      Seek offset of the segment.
    - start: Float
      
      Start time of the segment in seconds.
    - temperature: Float
      
      Temperature parameter used for generating the segment.
    - text: String
      
      Text content of the segment.
    - tokens: Array[Integer]
      
      Array of token IDs for the text content.

Example

require "openai"

openai = OpenAI::Client.new(api_key: "My API Key")

translation = openai.audio.translations.create(file: StringIO.new("Example data"), model: :"whisper-1")

puts(translation)

Response

{
  "text": "text"
}

Domain Types

Translation

class Translation
- text: String

Translation Verbose

class TranslationVerbose
- duration: Float
  
  The duration of the input audio.
- language: String
  
  The language of the output translation (always english).
- text: String
  
  The translated text.
- segments: Array[TranscriptionSegment]
  
  Segments of the translated text and their corresponding details.
  - id: Integer
    
    Unique identifier of the segment.
  - avg_logprob: Float
    
    Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
  - compression_ratio: Float
    
    Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
  - end_: Float
    
    End time of the segment in seconds.
  - no_speech_prob: Float
    
    Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
  - seek: Integer
    
    Seek offset of the segment.
  - start: Float
    
    Start time of the segment in seconds.
  - temperature: Float
    
    Temperature parameter used for generating the segment.
  - text: String
    
    Text content of the segment.
  - tokens: Array[Integer]
    
    Array of token IDs for the text content.

Translation Create Response

TranslationCreateResponse = Translation | TranslationVerbose
- class Translation
  - text: String
- class TranslationVerbose
  - duration: Float
    
    The duration of the input audio.
  - language: String
    
    The language of the output translation (always english).
  - text: String
    
    The translated text.
  - segments: Array[TranscriptionSegment]
    
    Segments of the translated text and their corresponding details.
    - id: Integer
      
      Unique identifier of the segment.
    - avg_logprob: Float
      
      Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.
    - compression_ratio: Float
      
      Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.
    - end_: Float
      
      End time of the segment in seconds.
    - no_speech_prob: Float
      
      Probability of no speech in the segment. If the value is higher than 1.0 and the avg_logprob is below -1, consider this segment silent.
    - seek: Integer
      
      Seek offset of the segment.
    - start: Float
      
      Start time of the segment in seconds.
    - temperature: Float
      
      Temperature parameter used for generating the segment.
    - text: String
      
      Text content of the segment.
    - tokens: Array[Integer]
      
      Array of token IDs for the text content.

Speech

Create speech

audio.speech.create(**kwargs) -> StringIO

post /audio/speech

Generates audio from the input text.

Returns the audio file content, or a stream of audio events.

Parameters

input: String

The text to generate audio for. The maximum length is 4096 characters.
model: String | SpeechModel

One of the available TTS models: tts-1, tts-1-hd, gpt-4o-mini-tts, or gpt-4o-mini-tts-2025-12-15.
- String = String
- SpeechModel = :"tts-1" | :"tts-1-hd" | :"gpt-4o-mini-tts" | :"gpt-4o-mini-tts-2025-12-15"
  - :"tts-1"
  - :"tts-1-hd"
  - :"gpt-4o-mini-tts"
  - :"gpt-4o-mini-tts-2025-12-15"
voice: String | :alloy | :ash | :ballad | 7 more | ID{ id}

The voice to use when generating the audio. Supported built-in voices are alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, verse, marin, and cedar. You may also provide a custom voice object with an id, for example { "id": "voice_1234" }. Previews of the voices are available in the Text to speech guide.
- String = String
- Voice = :alloy | :ash | :ballad | 7 more
  - :alloy
  - :ash
  - :ballad
  - :coral
  - :echo
  - :sage
  - :shimmer
  - :verse
  - :marin
  - :cedar
- class ID
  
  Custom voice reference.
  - id: String
    
    The custom voice ID, e.g. voice_1234.
instructions: String

Control the voice of your generated audio with additional instructions. Does not work with tts-1 or tts-1-hd.
response_format: :mp3 | :opus | :aac | 3 more

The format to audio in. Supported formats are mp3, opus, aac, flac, wav, and pcm.
- :mp3
- :opus
- :aac
- :flac
- :wav
- :pcm
speed: Float

The speed of the generated audio. Select a value from 0.25 to 4.0. 1.0 is the default.
stream_format: :sse | :audio

The format to stream the audio in. Supported formats are sse and audio. sse is not supported for tts-1 or tts-1-hd.
- :sse
- :audio

Returns

StringIO

Example

require "openai"

openai = OpenAI::Client.new(api_key: "My API Key")

speech = openai.audio.speech.create(input: "input", model: :"tts-1", voice: :alloy)

puts(speech)

Domain Types

Speech Model

SpeechModel = :"tts-1" | :"tts-1-hd" | :"gpt-4o-mini-tts" | :"gpt-4o-mini-tts-2025-12-15"
- :"tts-1"
- :"tts-1-hd"
- :"gpt-4o-mini-tts"
- :"gpt-4o-mini-tts-2025-12-15"

Voices

Voice Consents

ruby/resources/audio/index.md +1806 −0 created

1# Audio

3## Domain Types

5### Audio Model

7- `AudioModel = :"whisper-1" | :"gpt-4o-transcribe" | :"gpt-4o-mini-transcribe" | 2 more`

9 - `:"whisper-1"`

11 - `:"gpt-4o-transcribe"`

13 - `:"gpt-4o-mini-transcribe"`

15 - `:"gpt-4o-mini-transcribe-2025-12-15"`

17 - `:"gpt-4o-transcribe-diarize"`

19### Audio Response Format

21- `AudioResponseFormat = :json | :text | :srt | 3 more`

23 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.

25 - `:json`

27 - `:text`

29 - `:srt`

31 - `:verbose_json`

33 - `:vtt`

35 - `:diarized_json`

37# Transcriptions

39## Create transcription

41`audio.transcriptions.create(**kwargs) -> TranscriptionCreateResponse`

43**post** `/audio/transcriptions`

45Transcribes audio into the input language.

47Returns a transcription object in `json`, `diarized_json`, or `verbose_json`

48format, or a stream of transcript events.

50### Parameters

52- `file: String`

54 The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

56- `model: String | AudioModel`

58 ID of the model to use. The options are `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, `gpt-4o-mini-transcribe-2025-12-15`, `whisper-1` (which is powered by our open source Whisper V2 model), and `gpt-4o-transcribe-diarize`.

60 - `String = String`

62 - `AudioModel = :"whisper-1" | :"gpt-4o-transcribe" | :"gpt-4o-mini-transcribe" | 2 more`

64 - `:"whisper-1"`

66 - `:"gpt-4o-transcribe"`

68 - `:"gpt-4o-mini-transcribe"`

70 - `:"gpt-4o-mini-transcribe-2025-12-15"`

72 - `:"gpt-4o-transcribe-diarize"`

74- `chunking_strategy: :auto | VadConfig{ type, prefix_padding_ms, silence_duration_ms, threshold}`

76 Controls how the audio is cut into chunks. When set to `"auto"`, the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. `server_vad` object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block. Required when using `gpt-4o-transcribe-diarize` for inputs longer than 30 seconds.

78 - `ChunkingStrategy = :auto`

80 Automatically set chunking parameters based on the audio. Must be set to `"auto"`.

82 - `:auto`

84 - `class VadConfig`

86 - `type: :server_vad`

88 Must be set to `server_vad` to enable manual chunking using server side VAD.

90 - `:server_vad`

92 - `prefix_padding_ms: Integer`

94 Amount of audio to include before the VAD detected speech (in

95 milliseconds).

97 - `silence_duration_ms: Integer`

99 Duration of silence to detect speech stop (in milliseconds).

100 With shorter values the model will respond more quickly,

101 but may jump in on short pauses from the user.

102

103 - `threshold: Float`

104

105 Sensitivity threshold (0.0 to 1.0) for voice activity detection. A

106 higher threshold will require louder audio to activate the model, and

107 thus might perform better in noisy environments.

108

109- `include: Array[TranscriptionInclude]`

110

111 Additional information to include in the transcription response.

112 `logprobs` will return the log probabilities of the tokens in the

113 response to understand the model's confidence in the transcription.

114 `logprobs` only works with response_format set to `json` and only with

115 the models `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `gpt-4o-mini-transcribe-2025-12-15`. This field is not supported when using `gpt-4o-transcribe-diarize`.

116

117 - `:logprobs`

118

119- `known_speaker_names: Array[String]`

120

121 Optional list of speaker names that correspond to the audio samples provided in `known_speaker_references[]`. Each entry should be a short identifier (for example `customer` or `agent`). Up to 4 speakers are supported.

122

123- `known_speaker_references: Array[String]`

124

125 Optional list of audio samples (as [data URLs](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs)) that contain known speaker references matching `known_speaker_names[]`. Each sample must be between 2 and 10 seconds, and can use any of the same input audio formats supported by `file`.

126

127- `language: String`

128

129 The language of the input audio. Supplying the input language in [ISO-639-1](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) (e.g. `en`) format will improve accuracy and latency.

130

131- `prompt: String`

132

133 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should match the audio language. This field is not supported when using `gpt-4o-transcribe-diarize`.

134

135- `response_format: AudioResponseFormat`

136

137 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, `vtt`, or `diarized_json`. For `gpt-4o-transcribe` and `gpt-4o-mini-transcribe`, the only supported format is `json`. For `gpt-4o-transcribe-diarize`, the supported formats are `json`, `text`, and `diarized_json`, with `diarized_json` required to receive speaker annotations.

138

139 - `:json`

140

141 - `:text`

142

143 - `:srt`

144

145 - `:verbose_json`

146

147 - `:vtt`

148

149 - `:diarized_json`

150

151- `stream: bool`

152

153 If set to true, the model response data will be streamed to the client

154 as it is generated using [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events#Event_stream_format).

155 See the [Streaming section of the Speech-to-Text guide](https://platform.openai.com/docs/guides/speech-to-text?lang=curl#streaming-transcriptions)

156 for more information.

157

158 Note: Streaming is not supported for the `whisper-1` model and will be ignored.

159

160- `temperature: Float`

161

162 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.

163

164- `timestamp_granularities: Array[:word | :segment]`

165

166 The timestamp granularities to populate for this transcription. `response_format` must be set `verbose_json` to use timestamp granularities. Either or both of these options are supported: `word`, or `segment`. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

167 This option is not available for `gpt-4o-transcribe-diarize`.

168

169 - `:word`

170

171 - `:segment`

172

173### Returns

174

175- `TranscriptionCreateResponse = Transcription | TranscriptionDiarized | TranscriptionVerbose`

176

177 Represents a transcription response returned by model, based on the provided input.

178

179 - `class Transcription`

180

181 Represents a transcription response returned by model, based on the provided input.

182

183 - `text: String`

184

185 The transcribed text.

186

187 - `logprobs: Array[Logprob{ token, bytes, logprob}]`

188

189 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

190

191 - `token: String`

192

193 The token in the transcription.

194

195 - `bytes: Array[Float]`

196

197 The bytes of the token.

198

199 - `logprob: Float`

200

201 The log probability of the token.

202

203 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`

204

205 Token usage statistics for the request.

206

207 - `class Tokens`

208

209 Usage statistics for models billed by token usage.

210

211 - `input_tokens: Integer`

212

213 Number of input tokens billed for this request.

214

215 - `output_tokens: Integer`

216

217 Number of output tokens generated.

218

219 - `total_tokens: Integer`

220

221 Total number of tokens used (input + output).

222

223 - `type: :tokens`

224

225 The type of the usage object. Always `tokens` for this variant.

226

227 - `:tokens`

228

229 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`

230

231 Details about the input tokens billed for this request.

232

233 - `audio_tokens: Integer`

234

235 Number of audio tokens billed for this request.

236

237 - `text_tokens: Integer`

238

239 Number of text tokens billed for this request.

240

241 - `class Duration`

242

243 Usage statistics for models billed by audio input duration.

244

245 - `seconds: Float`

246

247 Duration of the input audio in seconds.

248

249 - `type: :duration`

250

251 The type of the usage object. Always `duration` for this variant.

252

253 - `:duration`

254

255 - `class TranscriptionDiarized`

256

257 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

258

259 - `duration: Float`

260

261 Duration of the input audio in seconds.

262

263 - `segments: Array[TranscriptionDiarizedSegment]`

264

265 Segments of the transcript annotated with timestamps and speaker labels.

266

267 - `id: String`

268

269 Unique identifier for the segment.

270

271 - `end_: Float`

272

273 End timestamp of the segment in seconds.

274

275 - `speaker: String`

276

277 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

278

279 - `start: Float`

280

281 Start timestamp of the segment in seconds.

282

283 - `text: String`

284

285 Transcript text for this segment.

286

287 - `type: :"transcript.text.segment"`

288

289 The type of the segment. Always `transcript.text.segment`.

290

291 - `:"transcript.text.segment"`

292

293 - `task: :transcribe`

294

295 The type of task that was run. Always `transcribe`.

296

297 - `:transcribe`

298

299 - `text: String`

300

301 The concatenated transcript text for the entire audio input.

302

303 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`

304

305 Token or duration usage statistics for the request.

306

307 - `class Tokens`

308

309 Usage statistics for models billed by token usage.

310

311 - `input_tokens: Integer`

312

313 Number of input tokens billed for this request.

314

315 - `output_tokens: Integer`

316

317 Number of output tokens generated.

318

319 - `total_tokens: Integer`

320

321 Total number of tokens used (input + output).

322

323 - `type: :tokens`

324

325 The type of the usage object. Always `tokens` for this variant.

326

327 - `:tokens`

328

329 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`

330

331 Details about the input tokens billed for this request.

332

333 - `audio_tokens: Integer`

334

335 Number of audio tokens billed for this request.

336

337 - `text_tokens: Integer`

338

339 Number of text tokens billed for this request.

340

341 - `class Duration`

342

343 Usage statistics for models billed by audio input duration.

344

345 - `seconds: Float`

346

347 Duration of the input audio in seconds.

348

349 - `type: :duration`

350

351 The type of the usage object. Always `duration` for this variant.

352

353 - `:duration`

354

355 - `class TranscriptionVerbose`

356

357 Represents a verbose json transcription response returned by model, based on the provided input.

358

359 - `duration: Float`

360

361 The duration of the input audio.

362

363 - `language: String`

364

365 The language of the input audio.

366

367 - `text: String`

368

369 The transcribed text.

370

371 - `segments: Array[TranscriptionSegment]`

372

373 Segments of the transcribed text and their corresponding details.

374

375 - `id: Integer`

376

377 Unique identifier of the segment.

378

379 - `avg_logprob: Float`

380

381 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

382

383 - `compression_ratio: Float`

384

385 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

386

387 - `end_: Float`

388

389 End time of the segment in seconds.

390

391 - `no_speech_prob: Float`

392

393 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

394

395 - `seek: Integer`

396

397 Seek offset of the segment.

398

399 - `start: Float`

400

401 Start time of the segment in seconds.

402

403 - `temperature: Float`

404

405 Temperature parameter used for generating the segment.

406

407 - `text: String`

408

409 Text content of the segment.

410

411 - `tokens: Array[Integer]`

412

413 Array of token IDs for the text content.

414

415 - `usage: Usage{ seconds, type}`

416

417 Usage statistics for models billed by audio input duration.

418

419 - `seconds: Float`

420

421 Duration of the input audio in seconds.

422

423 - `type: :duration`

424

425 The type of the usage object. Always `duration` for this variant.

426

427 - `:duration`

428

429 - `words: Array[TranscriptionWord]`

430

431 Extracted words and their corresponding timestamps.

432

433 - `end_: Float`

434

435 End time of the word in seconds.

436

437 - `start: Float`

438

439 Start time of the word in seconds.

440

441 - `word: String`

442

443 The text content of the word.

444

445### Example

446

447```ruby

448require "openai"

449

450openai = OpenAI::Client.new(api_key: "My API Key")

451

452transcription = openai.audio.transcriptions.create(file: StringIO.new("Example data"), model: :"gpt-4o-transcribe")

453

454puts(transcription)

455```

456

457#### Response

458

459```json

460{

461 "text": "text",

462 "logprobs": [

463 {

464 "token": "token",

465 "bytes": [

466 0

467 ],

468 "logprob": 0

469 }

470 ],

471 "usage": {

472 "input_tokens": 0,

473 "output_tokens": 0,

474 "total_tokens": 0,

475 "type": "tokens",

476 "input_token_details": {

477 "audio_tokens": 0,

478 "text_tokens": 0

479 }

480 }

481}

482```

483

484## Domain Types

485

486### Transcription

487

488- `class Transcription`

489

490 Represents a transcription response returned by model, based on the provided input.

491

492 - `text: String`

493

494 The transcribed text.

495

496 - `logprobs: Array[Logprob{ token, bytes, logprob}]`

497

498 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

499

500 - `token: String`

501

502 The token in the transcription.

503

504 - `bytes: Array[Float]`

505

506 The bytes of the token.

507

508 - `logprob: Float`

509

510 The log probability of the token.

511

512 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`

513

514 Token usage statistics for the request.

515

516 - `class Tokens`

517

518 Usage statistics for models billed by token usage.

519

520 - `input_tokens: Integer`

521

522 Number of input tokens billed for this request.

523

524 - `output_tokens: Integer`

525

526 Number of output tokens generated.

527

528 - `total_tokens: Integer`

529

530 Total number of tokens used (input + output).

531

532 - `type: :tokens`

533

534 The type of the usage object. Always `tokens` for this variant.

535

536 - `:tokens`

537

538 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`

539

540 Details about the input tokens billed for this request.

541

542 - `audio_tokens: Integer`

543

544 Number of audio tokens billed for this request.

545

546 - `text_tokens: Integer`

547

548 Number of text tokens billed for this request.

549

550 - `class Duration`

551

552 Usage statistics for models billed by audio input duration.

553

554 - `seconds: Float`

555

556 Duration of the input audio in seconds.

557

558 - `type: :duration`

559

560 The type of the usage object. Always `duration` for this variant.

561

562 - `:duration`

563

564### Transcription Diarized

565

566- `class TranscriptionDiarized`

567

568 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

569

570 - `duration: Float`

571

572 Duration of the input audio in seconds.

573

574 - `segments: Array[TranscriptionDiarizedSegment]`

575

576 Segments of the transcript annotated with timestamps and speaker labels.

577

578 - `id: String`

579

580 Unique identifier for the segment.

581

582 - `end_: Float`

583

584 End timestamp of the segment in seconds.

585

586 - `speaker: String`

587

588 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

589

590 - `start: Float`

591

592 Start timestamp of the segment in seconds.

593

594 - `text: String`

595

596 Transcript text for this segment.

597

598 - `type: :"transcript.text.segment"`

599

600 The type of the segment. Always `transcript.text.segment`.

601

602 - `:"transcript.text.segment"`

603

604 - `task: :transcribe`

605

606 The type of task that was run. Always `transcribe`.

607

608 - `:transcribe`

609

610 - `text: String`

611

612 The concatenated transcript text for the entire audio input.

613

614 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`

615

616 Token or duration usage statistics for the request.

617

618 - `class Tokens`

619

620 Usage statistics for models billed by token usage.

621

622 - `input_tokens: Integer`

623

624 Number of input tokens billed for this request.

625

626 - `output_tokens: Integer`

627

628 Number of output tokens generated.

629

630 - `total_tokens: Integer`

631

632 Total number of tokens used (input + output).

633

634 - `type: :tokens`

635

636 The type of the usage object. Always `tokens` for this variant.

637

638 - `:tokens`

639

640 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`

641

642 Details about the input tokens billed for this request.

643

644 - `audio_tokens: Integer`

645

646 Number of audio tokens billed for this request.

647

648 - `text_tokens: Integer`

649

650 Number of text tokens billed for this request.

651

652 - `class Duration`

653

654 Usage statistics for models billed by audio input duration.

655

656 - `seconds: Float`

657

658 Duration of the input audio in seconds.

659

660 - `type: :duration`

661

662 The type of the usage object. Always `duration` for this variant.

663

664 - `:duration`

665

666### Transcription Diarized Segment

667

668- `class TranscriptionDiarizedSegment`

669

670 A segment of diarized transcript text with speaker metadata.

671

672 - `id: String`

673

674 Unique identifier for the segment.

675

676 - `end_: Float`

677

678 End timestamp of the segment in seconds.

679

680 - `speaker: String`

681

682 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

683

684 - `start: Float`

685

686 Start timestamp of the segment in seconds.

687

688 - `text: String`

689

690 Transcript text for this segment.

691

692 - `type: :"transcript.text.segment"`

693

694 The type of the segment. Always `transcript.text.segment`.

695

696 - `:"transcript.text.segment"`

697

698### Transcription Include

699

700- `TranscriptionInclude = :logprobs`

701

702 - `:logprobs`

703

704### Transcription Segment

705

706- `class TranscriptionSegment`

707

708 - `id: Integer`

709

710 Unique identifier of the segment.

711

712 - `avg_logprob: Float`

713

714 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

715

716 - `compression_ratio: Float`

717

718 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

719

720 - `end_: Float`

721

722 End time of the segment in seconds.

723

724 - `no_speech_prob: Float`

725

726 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

727

728 - `seek: Integer`

729

730 Seek offset of the segment.

731

732 - `start: Float`

733

734 Start time of the segment in seconds.

735

736 - `temperature: Float`

737

738 Temperature parameter used for generating the segment.

739

740 - `text: String`

741

742 Text content of the segment.

743

744 - `tokens: Array[Integer]`

745

746 Array of token IDs for the text content.

747

748### Transcription Stream Event

749

750- `TranscriptionStreamEvent = TranscriptionTextSegmentEvent | TranscriptionTextDeltaEvent | TranscriptionTextDoneEvent`

751

752 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

753

754 - `class TranscriptionTextSegmentEvent`

755

756 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

757

758 - `id: String`

759

760 Unique identifier for the segment.

761

762 - `end_: Float`

763

764 End timestamp of the segment in seconds.

765

766 - `speaker: String`

767

768 Speaker label for this segment.

769

770 - `start: Float`

771

772 Start timestamp of the segment in seconds.

773

774 - `text: String`

775

776 Transcript text for this segment.

777

778 - `type: :"transcript.text.segment"`

779

780 The type of the event. Always `transcript.text.segment`.

781

782 - `:"transcript.text.segment"`

783

784 - `class TranscriptionTextDeltaEvent`

785

786 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

787

788 - `delta: String`

789

790 The text delta that was additionally transcribed.

791

792 - `type: :"transcript.text.delta"`

793

794 The type of the event. Always `transcript.text.delta`.

795

796 - `:"transcript.text.delta"`

797

798 - `logprobs: Array[Logprob{ token, bytes, logprob}]`

799

800 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

801

802 - `token: String`

803

804 The token that was used to generate the log probability.

805

806 - `bytes: Array[Integer]`

807

808 The bytes that were used to generate the log probability.

809

810 - `logprob: Float`

811

812 The log probability of the token.

813

814 - `segment_id: String`

815

816 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.

817

818 - `class TranscriptionTextDoneEvent`

819

820 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

821

822 - `text: String`

823

824 The text that was transcribed.

825

826 - `type: :"transcript.text.done"`

827

828 The type of the event. Always `transcript.text.done`.

829

830 - `:"transcript.text.done"`

831

832 - `logprobs: Array[Logprob{ token, bytes, logprob}]`

833

834 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

835

836 - `token: String`

837

838 The token that was used to generate the log probability.

839

840 - `bytes: Array[Integer]`

841

842 The bytes that were used to generate the log probability.

843

844 - `logprob: Float`

845

846 The log probability of the token.

847

848 - `usage: Usage{ input_tokens, output_tokens, total_tokens, 2 more}`

849

850 Usage statistics for models billed by token usage.

851

852 - `input_tokens: Integer`

853

854 Number of input tokens billed for this request.

855

856 - `output_tokens: Integer`

857

858 Number of output tokens generated.

859

860 - `total_tokens: Integer`

861

862 Total number of tokens used (input + output).

863

864 - `type: :tokens`

865

866 The type of the usage object. Always `tokens` for this variant.

867

868 - `:tokens`

869

870 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`

871

872 Details about the input tokens billed for this request.

873

874 - `audio_tokens: Integer`

875

876 Number of audio tokens billed for this request.

877

878 - `text_tokens: Integer`

879

880 Number of text tokens billed for this request.

881

882### Transcription Text Delta Event

883

884- `class TranscriptionTextDeltaEvent`

885

886 Emitted when there is an additional text delta. This is also the first event emitted when the transcription starts. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

887

888 - `delta: String`

889

890 The text delta that was additionally transcribed.

891

892 - `type: :"transcript.text.delta"`

893

894 The type of the event. Always `transcript.text.delta`.

895

896 - `:"transcript.text.delta"`

897

898 - `logprobs: Array[Logprob{ token, bytes, logprob}]`

899

900 The log probabilities of the delta. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

901

902 - `token: String`

903

904 The token that was used to generate the log probability.

905

906 - `bytes: Array[Integer]`

907

908 The bytes that were used to generate the log probability.

909

910 - `logprob: Float`

911

912 The log probability of the token.

913

914 - `segment_id: String`

915

916 Identifier of the diarized segment that this delta belongs to. Only present when using `gpt-4o-transcribe-diarize`.

917

918### Transcription Text Done Event

919

920- `class TranscriptionTextDoneEvent`

921

922 Emitted when the transcription is complete. Contains the complete transcription text. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `Stream` parameter set to `true`.

923

924 - `text: String`

925

926 The text that was transcribed.

927

928 - `type: :"transcript.text.done"`

929

930 The type of the event. Always `transcript.text.done`.

931

932 - `:"transcript.text.done"`

933

934 - `logprobs: Array[Logprob{ token, bytes, logprob}]`

935

936 The log probabilities of the individual tokens in the transcription. Only included if you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with the `include[]` parameter set to `logprobs`.

937

938 - `token: String`

939

940 The token that was used to generate the log probability.

941

942 - `bytes: Array[Integer]`

943

944 The bytes that were used to generate the log probability.

945

946 - `logprob: Float`

947

948 The log probability of the token.

949

950 - `usage: Usage{ input_tokens, output_tokens, total_tokens, 2 more}`

951

952 Usage statistics for models billed by token usage.

953

954 - `input_tokens: Integer`

955

956 Number of input tokens billed for this request.

957

958 - `output_tokens: Integer`

959

960 Number of output tokens generated.

961

962 - `total_tokens: Integer`

963

964 Total number of tokens used (input + output).

965

966 - `type: :tokens`

967

968 The type of the usage object. Always `tokens` for this variant.

969

970 - `:tokens`

971

972 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`

973

974 Details about the input tokens billed for this request.

975

976 - `audio_tokens: Integer`

977

978 Number of audio tokens billed for this request.

979

980 - `text_tokens: Integer`

981

982 Number of text tokens billed for this request.

983

984### Transcription Text Segment Event

985

986- `class TranscriptionTextSegmentEvent`

987

988 Emitted when a diarized transcription returns a completed segment with speaker information. Only emitted when you [create a transcription](https://platform.openai.com/docs/api-reference/audio/create-transcription) with `stream` set to `true` and `response_format` set to `diarized_json`.

989

990 - `id: String`

991

992 Unique identifier for the segment.

993

994 - `end_: Float`

995

996 End timestamp of the segment in seconds.

997

998 - `speaker: String`

999

1000 Speaker label for this segment.

1001

1002 - `start: Float`

1003

1004 Start timestamp of the segment in seconds.

1005

1006 - `text: String`

1007

1008 Transcript text for this segment.

1009

1010 - `type: :"transcript.text.segment"`

1011

1012 The type of the event. Always `transcript.text.segment`.

1013

1014 - `:"transcript.text.segment"`

1015

1016### Transcription Verbose

1017

1018- `class TranscriptionVerbose`

1019

1020 Represents a verbose json transcription response returned by model, based on the provided input.

1021

1022 - `duration: Float`

1023

1024 The duration of the input audio.

1025

1026 - `language: String`

1027

1028 The language of the input audio.

1029

1030 - `text: String`

1031

1032 The transcribed text.

1033

1034 - `segments: Array[TranscriptionSegment]`

1035

1036 Segments of the transcribed text and their corresponding details.

1037

1038 - `id: Integer`

1039

1040 Unique identifier of the segment.

1041

1042 - `avg_logprob: Float`

1043

1044 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

1045

1046 - `compression_ratio: Float`

1047

1048 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

1049

1050 - `end_: Float`

1051

1052 End time of the segment in seconds.

1053

1054 - `no_speech_prob: Float`

1055

1056 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

1057

1058 - `seek: Integer`

1059

1060 Seek offset of the segment.

1061

1062 - `start: Float`

1063

1064 Start time of the segment in seconds.

1065

1066 - `temperature: Float`

1067

1068 Temperature parameter used for generating the segment.

1069

1070 - `text: String`

1071

1072 Text content of the segment.

1073

1074 - `tokens: Array[Integer]`

1075

1076 Array of token IDs for the text content.

1077

1078 - `usage: Usage{ seconds, type}`

1079

1080 Usage statistics for models billed by audio input duration.

1081

1082 - `seconds: Float`

1083

1084 Duration of the input audio in seconds.

1085

1086 - `type: :duration`

1087

1088 The type of the usage object. Always `duration` for this variant.

1089

1090 - `:duration`

1091

1092 - `words: Array[TranscriptionWord]`

1093

1094 Extracted words and their corresponding timestamps.

1095

1096 - `end_: Float`

1097

1098 End time of the word in seconds.

1099

1100 - `start: Float`

1101

1102 Start time of the word in seconds.

1103

1104 - `word: String`

1105

1106 The text content of the word.

1107

1108### Transcription Word

1109

1110- `class TranscriptionWord`

1111

1112 - `end_: Float`

1113

1114 End time of the word in seconds.

1115

1116 - `start: Float`

1117

1118 Start time of the word in seconds.

1119

1120 - `word: String`

1121

1122 The text content of the word.

1123

1124### Transcription Create Response

1125

1126- `TranscriptionCreateResponse = Transcription | TranscriptionDiarized | TranscriptionVerbose`

1127

1128 Represents a transcription response returned by model, based on the provided input.

1129

1130 - `class Transcription`

1131

1132 Represents a transcription response returned by model, based on the provided input.

1133

1134 - `text: String`

1135

1136 The transcribed text.

1137

1138 - `logprobs: Array[Logprob{ token, bytes, logprob}]`

1139

1140 The log probabilities of the tokens in the transcription. Only returned with the models `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` if `logprobs` is added to the `include` array.

1141

1142 - `token: String`

1143

1144 The token in the transcription.

1145

1146 - `bytes: Array[Float]`

1147

1148 The bytes of the token.

1149

1150 - `logprob: Float`

1151

1152 The log probability of the token.

1153

1154 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`

1155

1156 Token usage statistics for the request.

1157

1158 - `class Tokens`

1159

1160 Usage statistics for models billed by token usage.

1161

1162 - `input_tokens: Integer`

1163

1164 Number of input tokens billed for this request.

1165

1166 - `output_tokens: Integer`

1167

1168 Number of output tokens generated.

1169

1170 - `total_tokens: Integer`

1171

1172 Total number of tokens used (input + output).

1173

1174 - `type: :tokens`

1175

1176 The type of the usage object. Always `tokens` for this variant.

1177

1178 - `:tokens`

1179

1180 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`

1181

1182 Details about the input tokens billed for this request.

1183

1184 - `audio_tokens: Integer`

1185

1186 Number of audio tokens billed for this request.

1187

1188 - `text_tokens: Integer`

1189

1190 Number of text tokens billed for this request.

1191

1192 - `class Duration`

1193

1194 Usage statistics for models billed by audio input duration.

1195

1196 - `seconds: Float`

1197

1198 Duration of the input audio in seconds.

1199

1200 - `type: :duration`

1201

1202 The type of the usage object. Always `duration` for this variant.

1203

1204 - `:duration`

1205

1206 - `class TranscriptionDiarized`

1207

1208 Represents a diarized transcription response returned by the model, including the combined transcript and speaker-segment annotations.

1209

1210 - `duration: Float`

1211

1212 Duration of the input audio in seconds.

1213

1214 - `segments: Array[TranscriptionDiarizedSegment]`

1215

1216 Segments of the transcript annotated with timestamps and speaker labels.

1217

1218 - `id: String`

1219

1220 Unique identifier for the segment.

1221

1222 - `end_: Float`

1223

1224 End timestamp of the segment in seconds.

1225

1226 - `speaker: String`

1227

1228 Speaker label for this segment. When known speakers are provided, the label matches `known_speaker_names[]`. Otherwise speakers are labeled sequentially using capital letters (`A`, `B`, ...).

1229

1230 - `start: Float`

1231

1232 Start timestamp of the segment in seconds.

1233

1234 - `text: String`

1235

1236 Transcript text for this segment.

1237

1238 - `type: :"transcript.text.segment"`

1239

1240 The type of the segment. Always `transcript.text.segment`.

1241

1242 - `:"transcript.text.segment"`

1243

1244 - `task: :transcribe`

1245

1246 The type of task that was run. Always `transcribe`.

1247

1248 - `:transcribe`

1249

1250 - `text: String`

1251

1252 The concatenated transcript text for the entire audio input.

1253

1254 - `usage: Tokens{ input_tokens, output_tokens, total_tokens, 2 more} | Duration{ seconds, type}`

1255

1256 Token or duration usage statistics for the request.

1257

1258 - `class Tokens`

1259

1260 Usage statistics for models billed by token usage.

1261

1262 - `input_tokens: Integer`

1263

1264 Number of input tokens billed for this request.

1265

1266 - `output_tokens: Integer`

1267

1268 Number of output tokens generated.

1269

1270 - `total_tokens: Integer`

1271

1272 Total number of tokens used (input + output).

1273

1274 - `type: :tokens`

1275

1276 The type of the usage object. Always `tokens` for this variant.

1277

1278 - `:tokens`

1279

1280 - `input_token_details: InputTokenDetails{ audio_tokens, text_tokens}`

1281

1282 Details about the input tokens billed for this request.

1283

1284 - `audio_tokens: Integer`

1285

1286 Number of audio tokens billed for this request.

1287

1288 - `text_tokens: Integer`

1289

1290 Number of text tokens billed for this request.

1291

1292 - `class Duration`

1293

1294 Usage statistics for models billed by audio input duration.

1295

1296 - `seconds: Float`

1297

1298 Duration of the input audio in seconds.

1299

1300 - `type: :duration`

1301

1302 The type of the usage object. Always `duration` for this variant.

1303

1304 - `:duration`

1305

1306 - `class TranscriptionVerbose`

1307

1308 Represents a verbose json transcription response returned by model, based on the provided input.

1309

1310 - `duration: Float`

1311

1312 The duration of the input audio.

1313

1314 - `language: String`

1315

1316 The language of the input audio.

1317

1318 - `text: String`

1319

1320 The transcribed text.

1321

1322 - `segments: Array[TranscriptionSegment]`

1323

1324 Segments of the transcribed text and their corresponding details.

1325

1326 - `id: Integer`

1327

1328 Unique identifier of the segment.

1329

1330 - `avg_logprob: Float`

1331

1332 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

1333

1334 - `compression_ratio: Float`

1335

1336 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

1337

1338 - `end_: Float`

1339

1340 End time of the segment in seconds.

1341

1342 - `no_speech_prob: Float`

1343

1344 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

1345

1346 - `seek: Integer`

1347

1348 Seek offset of the segment.

1349

1350 - `start: Float`

1351

1352 Start time of the segment in seconds.

1353

1354 - `temperature: Float`

1355

1356 Temperature parameter used for generating the segment.

1357

1358 - `text: String`

1359

1360 Text content of the segment.

1361

1362 - `tokens: Array[Integer]`

1363

1364 Array of token IDs for the text content.

1365

1366 - `usage: Usage{ seconds, type}`

1367

1368 Usage statistics for models billed by audio input duration.

1369

1370 - `seconds: Float`

1371

1372 Duration of the input audio in seconds.

1373

1374 - `type: :duration`

1375

1376 The type of the usage object. Always `duration` for this variant.

1377

1378 - `:duration`

1379

1380 - `words: Array[TranscriptionWord]`

1381

1382 Extracted words and their corresponding timestamps.

1383

1384 - `end_: Float`

1385

1386 End time of the word in seconds.

1387

1388 - `start: Float`

1389

1390 Start time of the word in seconds.

1391

1392 - `word: String`

1393

1394 The text content of the word.

1395

1396# Translations

1397

1398## Create translation

1399

1400`audio.translations.create(**kwargs) -> TranslationCreateResponse`

1401

1402**post** `/audio/translations`

1403

1404Translates audio into English.

1405

1406### Parameters

1407

1408- `file: String`

1409

1410 The audio file object (not file name) translate, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

1411

1412- `model: String | AudioModel`

1413

1414 ID of the model to use. Only `whisper-1` (which is powered by our open source Whisper V2 model) is currently available.

1415

1416 - `String = String`

1417

1418 - `AudioModel = :"whisper-1" | :"gpt-4o-transcribe" | :"gpt-4o-mini-transcribe" | 2 more`

1419

1420 - `:"whisper-1"`

1421

1422 - `:"gpt-4o-transcribe"`

1423

1424 - `:"gpt-4o-mini-transcribe"`

1425

1426 - `:"gpt-4o-mini-transcribe-2025-12-15"`

1427

1428 - `:"gpt-4o-transcribe-diarize"`

1429

1430- `prompt: String`

1431

1432 An optional text to guide the model's style or continue a previous audio segment. The [prompt](https://platform.openai.com/docs/guides/speech-to-text#prompting) should be in English.

1433

1434- `response_format: :json | :text | :srt | 2 more`

1435

1436 The format of the output, in one of these options: `json`, `text`, `srt`, `verbose_json`, or `vtt`.

1437

1438 - `:json`

1439

1440 - `:text`

1441

1442 - `:srt`

1443

1444 - `:verbose_json`

1445

1446 - `:vtt`

1447

1448- `temperature: Float`

1449

1450 The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use [log probability](https://en.wikipedia.org/wiki/Log_probability) to automatically increase the temperature until certain thresholds are hit.

1451

1452### Returns

1453

1454- `TranslationCreateResponse = Translation | TranslationVerbose`

1455

1456 - `class Translation`

1457

1458 - `text: String`

1459

1460 - `class TranslationVerbose`

1461

1462 - `duration: Float`

1463

1464 The duration of the input audio.

1465

1466 - `language: String`

1467

1468 The language of the output translation (always `english`).

1469

1470 - `text: String`

1471

1472 The translated text.

1473

1474 - `segments: Array[TranscriptionSegment]`

1475

1476 Segments of the translated text and their corresponding details.

1477

1478 - `id: Integer`

1479

1480 Unique identifier of the segment.

1481

1482 - `avg_logprob: Float`

1483

1484 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

1485

1486 - `compression_ratio: Float`

1487

1488 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

1489

1490 - `end_: Float`

1491

1492 End time of the segment in seconds.

1493

1494 - `no_speech_prob: Float`

1495

1496 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

1497

1498 - `seek: Integer`

1499

1500 Seek offset of the segment.

1501

1502 - `start: Float`

1503

1504 Start time of the segment in seconds.

1505

1506 - `temperature: Float`

1507

1508 Temperature parameter used for generating the segment.

1509

1510 - `text: String`

1511

1512 Text content of the segment.

1513

1514 - `tokens: Array[Integer]`

1515

1516 Array of token IDs for the text content.

1517

1518### Example

1519

1520```ruby

1521require "openai"

1522

1523openai = OpenAI::Client.new(api_key: "My API Key")

1524

1525translation = openai.audio.translations.create(file: StringIO.new("Example data"), model: :"whisper-1")

1526

1527puts(translation)

1528```

1529

1530#### Response

1531

1532```json

1533{

1534 "text": "text"

1535}

1536```

1537

1538## Domain Types

1539

1540### Translation

1541

1542- `class Translation`

1543

1544 - `text: String`

1545

1546### Translation Verbose

1547

1548- `class TranslationVerbose`

1549

1550 - `duration: Float`

1551

1552 The duration of the input audio.

1553

1554 - `language: String`

1555

1556 The language of the output translation (always `english`).

1557

1558 - `text: String`

1559

1560 The translated text.

1561

1562 - `segments: Array[TranscriptionSegment]`

1563

1564 Segments of the translated text and their corresponding details.

1565

1566 - `id: Integer`

1567

1568 Unique identifier of the segment.

1569

1570 - `avg_logprob: Float`

1571

1572 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

1573

1574 - `compression_ratio: Float`

1575

1576 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

1577

1578 - `end_: Float`

1579

1580 End time of the segment in seconds.

1581

1582 - `no_speech_prob: Float`

1583

1584 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

1585

1586 - `seek: Integer`

1587

1588 Seek offset of the segment.

1589

1590 - `start: Float`

1591

1592 Start time of the segment in seconds.

1593

1594 - `temperature: Float`

1595

1596 Temperature parameter used for generating the segment.

1597

1598 - `text: String`

1599

1600 Text content of the segment.

1601

1602 - `tokens: Array[Integer]`

1603

1604 Array of token IDs for the text content.

1605

1606### Translation Create Response

1607

1608- `TranslationCreateResponse = Translation | TranslationVerbose`

1609

1610 - `class Translation`

1611

1612 - `text: String`

1613

1614 - `class TranslationVerbose`

1615

1616 - `duration: Float`

1617

1618 The duration of the input audio.

1619

1620 - `language: String`

1621

1622 The language of the output translation (always `english`).

1623

1624 - `text: String`

1625

1626 The translated text.

1627

1628 - `segments: Array[TranscriptionSegment]`

1629

1630 Segments of the translated text and their corresponding details.

1631

1632 - `id: Integer`

1633

1634 Unique identifier of the segment.

1635

1636 - `avg_logprob: Float`

1637

1638 Average logprob of the segment. If the value is lower than -1, consider the logprobs failed.

1639

1640 - `compression_ratio: Float`

1641

1642 Compression ratio of the segment. If the value is greater than 2.4, consider the compression failed.

1643

1644 - `end_: Float`

1645

1646 End time of the segment in seconds.

1647

1648 - `no_speech_prob: Float`

1649

1650 Probability of no speech in the segment. If the value is higher than 1.0 and the `avg_logprob` is below -1, consider this segment silent.

1651

1652 - `seek: Integer`

1653

1654 Seek offset of the segment.

1655

1656 - `start: Float`

1657

1658 Start time of the segment in seconds.

1659

1660 - `temperature: Float`

1661

1662 Temperature parameter used for generating the segment.

1663

1664 - `text: String`

1665

1666 Text content of the segment.

1667

1668 - `tokens: Array[Integer]`

1669

1670 Array of token IDs for the text content.

1671

1672# Speech

1673

1674## Create speech

1675

1676`audio.speech.create(**kwargs) -> StringIO`

1677

1678**post** `/audio/speech`

1679

1680Generates audio from the input text.

1681

1682Returns the audio file content, or a stream of audio events.

1683

1684### Parameters

1685

1686- `input: String`

1687

1688 The text to generate audio for. The maximum length is 4096 characters.

1689

1690- `model: String | SpeechModel`

1691

1692 One of the available [TTS models](https://platform.openai.com/docs/models#tts): `tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`, or `gpt-4o-mini-tts-2025-12-15`.

1693

1694 - `String = String`

1695

1696 - `SpeechModel = :"tts-1" | :"tts-1-hd" | :"gpt-4o-mini-tts" | :"gpt-4o-mini-tts-2025-12-15"`

1697

1698 - `:"tts-1"`

1699

1700 - `:"tts-1-hd"`

1701

1702 - `:"gpt-4o-mini-tts"`

1703

1704 - `:"gpt-4o-mini-tts-2025-12-15"`

1705

1706- `voice: String | :alloy | :ash | :ballad | 7 more | ID{ id}`

1707

1708 The voice to use when generating the audio. Supported built-in voices are àlloy`, àsh`, `ballad`, `coral`, ècho`, `fable`, ònyx`, `nova`, `sage`, `shimmer`, `verse`, `marin`, and `cedar`. You may also provide a custom voice object with an ìd`, for example `{ "id": "voice_1234" }`. Previews of the voices are available in the [Text to speech guide](https://platform.openai.com/docs/guides/text-to-speech#voice-options).

1709

1710 - `String = String`

1711

1712 - `Voice = :alloy | :ash | :ballad | 7 more`

1713

1714 - `:alloy`

1715

1716 - `:ash`

1717

1718 - `:ballad`

1719

1720 - `:coral`

1721

1722 - `:echo`

1723

1724 - `:sage`

1725

1726 - `:shimmer`

1727

1728 - `:verse`

1729

1730 - `:marin`

1731

1732 - `:cedar`

1733

1734 - `class ID`

1735

1736 Custom voice reference.

1737

1738 - `id: String`

1739

1740 The custom voice ID, e.g. `voice_1234`.

1741

1742- `instructions: String`

1743

1744 Control the voice of your generated audio with additional instructions. Does not work with `tts-1` or `tts-1-hd`.

1745

1746- `response_format: :mp3 | :opus | :aac | 3 more`

1747

1748 The format to audio in. Supported formats are `mp3`, `opus`, `aac`, `flac`, `wav`, and `pcm`.

1749

1750 - `:mp3`

1751

1752 - `:opus`

1753

1754 - `:aac`

1755

1756 - `:flac`

1757

1758 - `:wav`

1759

1760 - `:pcm`

1761

1762- `speed: Float`

1763

1764 The speed of the generated audio. Select a value from `0.25` to `4.0`. `1.0` is the default.

1765

1766- `stream_format: :sse | :audio`

1767

1768 The format to stream the audio in. Supported formats are `sse` and `audio`. `sse` is not supported for `tts-1` or `tts-1-hd`.

1769

1770 - `:sse`

1771

1772 - `:audio`

1773

1774### Returns

1775

1776- `StringIO`

1777

1778### Example

1779

1780```ruby

1781require "openai"

1782

1783openai = OpenAI::Client.new(api_key: "My API Key")

1784

1785speech = openai.audio.speech.create(input: "input", model: :"tts-1", voice: :alloy)

1786

1787puts(speech)

1788```

1789

1790## Domain Types

1791

1792### Speech Model

1793

1794- `SpeechModel = :"tts-1" | :"tts-1-hd" | :"gpt-4o-mini-tts" | :"gpt-4o-mini-tts-2025-12-15"`

1795

1796 - `:"tts-1"`

1797

1798 - `:"tts-1-hd"`

1799

1800 - `:"gpt-4o-mini-tts"`

1801

1802 - `:"gpt-4o-mini-tts-2025-12-15"`

1803

1804# Voices

1805

1806# Voice Consents

ruby/resources/audio/index.md 2026-06-10 15:48 UTC to 2026-06-12 00:01 UTC

Audio

Domain Types

Audio Model

Audio Response Format

Transcriptions

Create transcription

Parameters

Returns

Example

Response

Domain Types

Transcription

Transcription Diarized

Transcription Diarized Segment

Transcription Include

Transcription Segment

Transcription Stream Event

Transcription Text Delta Event

Transcription Text Done Event

Transcription Text Segment Event

Transcription Verbose

Transcription Word

Transcription Create Response

Translations

Create translation

Parameters

Returns

Example

Response

Domain Types

Translation

Translation Verbose

Translation Create Response

Speech

Create speech

Parameters

Returns

Example

Domain Types

Speech Model

Voices

Voice Consents