Skip to main content

Realtime ASR (Speech-to-Text) — WebSocket

This page documents Kotoba's Realtime ASR (speech-to-text) API over WebSocket.

Endpoint

wss://api.kotobatech.ai/v1/realtime

Authentication

Use your API key as a Bearer token:

  • Authorization: Bearer <KOTOBA_API_KEY>

Browser / client-side (do NOT embed API keys)

If you need to connect from a browser or other client-side environment, first create a short-lived client secret on your server:

POST https://api.kotobatech.ai/v1/realtime/transcription_sessions

Then connect from the browser using the WebSocket subprotocol (because browsers can't set arbitrary headers for WebSocket):

  • sec-websocket-protocol: realtime, kotoba-insecure-api-key.<CLIENT_SECRET_VALUE>

Errors & error codes

Realtime errors are delivered as JSON events with:

  • type: "error"
  • error.code: one of the KotobaErrorCode values (see the Errors schema page in the sidebar)

Common codes for realtime ASR:

  • invalid_api_key (HTTP 401 / error event): missing/invalid API key
  • rate_limit_exceeded / too_many_concurrent_requests (HTTP 429 / error event): throttled or concurrency limit reached
  • quota_exceeded: insufficient credits
  • invalid_parameters: invalid session config / event payload
  • invalid_event: unknown type in a client event
  • invalid_json: invalid JSON message
  • payload_too_large: audio chunk too large (split into smaller chunks)

High-level protocol

  1. Connect to the WebSocket.
  2. Wait for server event transcription_session.created.
  3. Send exactly once: transcription_session.update to configure the session (audio format, language, etc.).
  4. Stream audio chunks with input_audio_buffer.append.
  5. Receive transcription deltas as conversation.item.input_audio_transcription.delta (and/or .completed when enabled).
  6. When done, send input_audio_buffer.commit, then close the connection.

Client → Server events

transcription_session.update (required)

{
"type": "transcription_session.update",
"session": {
"input_audio_format": "pcm16",
"input_audio_sample_rate": 24000,
"input_audio_number_of_channels": 1,
"input_audio_transcription": {
"language": "ja",
"target_language": "ja"
}
}
}

input_audio_buffer.append

Send base64 audio bytes:

{
"event_id": "optional_any_id_1",
"type": "input_audio_buffer.append",
"audio": "Base64EncodedAudioData"
}

input_audio_buffer.commit

{
"type": "input_audio_buffer.commit"
}

Server → Client events (common)

  • transcription_session.created
  • transcription_session.updated
  • conversation.item.created
  • conversation.item.input_audio_transcription.delta
  • conversation.item.input_audio_transcription.completed
  • input_audio_buffer.committed
  • error

Example: Node.js (server-side auth header)

import WebSocket from "ws";

const ws = new WebSocket("wss://api.kotobatech.ai/v1/realtime", {
headers: { Authorization: `Bearer ${process.env.KOTOBA_API_KEY}` },
});

ws.on("open", () => {
// You must send transcription_session.update after you receive session.created in production,
// but for a simple demo you can send it immediately.
ws.send(JSON.stringify({
type: "transcription_session.update",
session: {
input_audio_format: "pcm16",
input_audio_sample_rate: 24000,
input_audio_number_of_channels: 1,
input_audio_transcription: { language: "en", target_language: "en" }
}
}));
});

ws.on("message", (msg) => {
console.log(JSON.parse(msg.toString()));
});

Example: Browser (client secret via subprotocol)

// CLIENT_SECRET_VALUE must be minted server-side via:
// POST https://api.kotobatech.ai/v1/realtime/transcription_sessions
const ws = new WebSocket(
"wss://api.kotobatech.ai/v1/realtime",
["realtime", "kotoba-insecure-api-key." + CLIENT_SECRET_VALUE]
);

ws.onmessage = (m) => console.log(JSON.parse(m.data));