Skip to main content

input_audio_buffer.append

Send audio data to the model.
The model splits audio into chunks at its own interval and transcribes from the beginning.
Any remainder shorter than the interval is buffered in the session and processed together with the next event.

For example, if you send 100ms of audio and the model's chunk interval is 80ms, only the first 80ms starts transcribing and results are returned via conversation.item.input_audio_transcription.delta, while the remaining 20ms is buffered.
When the next input_audio_buffer.append adds another 100ms of audio, the previous 20ms is joined with the first 60ms of the new audio and transcribed, and the remaining 40ms is buffered.

If the end user turns off the microphone (or you will not send more audio), you can send input_audio_buffer.commit to transcribe all buffered audio.

Unlike other client events, there is no explicit server acknowledgement event for this message.

event_idstring

Optional identifier you can use to correlate this event with transcription results.

Allowed characters are alphanumerics, underscore (_), and hyphen (-) only.

conversation.item.input_audio_transcription.delta and conversation.item.input_audio_transcription.completed will include this value in item_id (or null if omitted).

Use a unique value if you want to track it. Duplicate values are accepted (no error). Note: if transcription spans multiple item_id values, the last event_id may be applied.

Possible values: <= 36 characters

typestringrequired

Fixed value

Possible values: [input_audio_buffer.append]

audiostringrequired

Base64-encoded audio bytes. Audio must follow the input_audio_format and input_audio_sample_rate set in transcription_session.update.

You can send up to roughly ~8 seconds of audio per event. Larger chunks increase latency, so we recommend streaming the smallest practical chunks.

input_audio_buffer.append
{
"event_id": "random_unique_string",
"type": "input_audio_buffer.append",
"audio": "Base64EncodedAudioData"
}