Realtime ASR (Speech-to-Text) — WebSocket
This page documents Kotoba's Realtime ASR (speech-to-text) API over WebSocket.
Endpoint
wss://api.kotobatech.ai/v1/realtime
Authentication
Server-side (recommended)
Use your API key as a Bearer token:
Authorization: Bearer <KOTOBA_API_KEY>
Browser / client-side (do NOT embed API keys)
If you need to connect from a browser or other client-side environment, first create a short-lived client secret on your server:
POST https://api.kotobatech.ai/v1/realtime/transcription_sessions
Then connect from the browser using the WebSocket subprotocol (because browsers can't set arbitrary headers for WebSocket):
sec-websocket-protocol: realtime, kotoba-insecure-api-key.<CLIENT_SECRET_VALUE>
Errors & error codes
Realtime errors are delivered as JSON events with:
type: "error"error.code: one of theKotobaErrorCodevalues (see the Errors schema page in the sidebar)
Common codes for realtime ASR:
invalid_api_key(HTTP 401 / error event): missing/invalid API keyrate_limit_exceeded/too_many_concurrent_requests(HTTP 429 / error event): throttled or concurrency limit reachedquota_exceeded: insufficient creditsinvalid_parameters: invalid session config / event payloadinvalid_event: unknowntypein a client eventinvalid_json: invalid JSON messagepayload_too_large: audio chunk too large (split into smaller chunks)
High-level protocol
- Connect to the WebSocket.
- Wait for server event
transcription_session.created. - Send exactly once:
transcription_session.updateto configure the session (audio format, language, etc.). - Stream audio chunks with
input_audio_buffer.append. - Receive transcription deltas as
conversation.item.input_audio_transcription.delta(and/or.completedwhen enabled). - When done, send
input_audio_buffer.commit, then close the connection.
Client → Server events
transcription_session.update (required)
{
"type": "transcription_session.update",
"session": {
"input_audio_format": "pcm16",
"input_audio_sample_rate": 24000,
"input_audio_number_of_channels": 1,
"input_audio_transcription": {
"language": "ja",
"target_language": "ja"
}
}
}
input_audio_buffer.append
Send base64 audio bytes:
{
"event_id": "optional_any_id_1",
"type": "input_audio_buffer.append",
"audio": "Base64EncodedAudioData"
}
input_audio_buffer.commit
{
"type": "input_audio_buffer.commit"
}
Server → Client events (common)
transcription_session.createdtranscription_session.updatedconversation.item.createdconversation.item.input_audio_transcription.deltaconversation.item.input_audio_transcription.completedinput_audio_buffer.committederror
Example: Node.js (server-side auth header)
import WebSocket from "ws";
const ws = new WebSocket("wss://api.kotobatech.ai/v1/realtime", {
headers: { Authorization: `Bearer ${process.env.KOTOBA_API_KEY}` },
});
ws.on("open", () => {
// You must send transcription_session.update after you receive session.created in production,
// but for a simple demo you can send it immediately.
ws.send(JSON.stringify({
type: "transcription_session.update",
session: {
input_audio_format: "pcm16",
input_audio_sample_rate: 24000,
input_audio_number_of_channels: 1,
input_audio_transcription: { language: "en", target_language: "en" }
}
}));
});
ws.on("message", (msg) => {
console.log(JSON.parse(msg.toString()));
});
Example: Browser (client secret via subprotocol)
// CLIENT_SECRET_VALUE must be minted server-side via:
// POST https://api.kotobatech.ai/v1/realtime/transcription_sessions
const ws = new WebSocket(
"wss://api.kotobatech.ai/v1/realtime",
["realtime", "kotoba-insecure-api-key." + CLIENT_SECRET_VALUE]
);
ws.onmessage = (m) => console.log(JSON.parse(m.data));