Realtime ASR (Speech-to-Text) API
wss://api.kotobatech.ai/v1/realtime
Convert microphone audio, desktop audio, or phone calls into text with low latency over WebSocket.
For a step-by-step guide (auth, client secret, event flow, examples), see:
Common use cases
- Generate live captions or meeting minutes
- Build voice agents (including phone calls) and voice-enabled apps
- Detect turns (speech boundaries)
Currently we only provide a WebSocket interface.
Rate limits
For the Realtime API, we enforce concurrent connection limits per account (API key) and globally.
If you try to connect after reaching the limit, the connection may be rejected, or you may receive an error event with error.code=rate_limit_error (see Errors).
How to connect: WebSocket
WebSockets are widely used for real-time data transfer and can be used on any platform that can send/receive HTTP requests.
You can use this API from end-user environments (browser/mobile), but never use your API key in client-side environments. Use it only in a secure server environment.
If you need to connect from an end-user client environment, first mint a short-lived client secret on your server via:
Then pass that client secret to the client and use it for authentication.
| URL | wss://api.kotobatech.ai/v1/realtime |
| Request headers | Auth header (for browser-like environments, see Browser clients) |
Browser clients (client secret via subprotocol)
Browsers cannot set arbitrary headers for WebSocket. Use the WebSocket subprotocol for auth:
sec-websocket-protocol: realtime, kotoba-insecure-api-key.<CLIENT_SECRET_VALUE>
See the full example on /asr-streaming.
Example: using a client secret (server endpoint)
Using Create a client secret as a reference, create a server-side endpoint like the following, then call it from the client to receive client_secret.
Example implementation with Node.js / Express:
import express from "express";
const app = express();
// An endpoint which would work with the client code above - it returns
// the contents of a REST API request to this protected endpoint
app.get("/session", async (req, res) => {
const r = await fetch("https://api.kotobatech.ai/v1/realtime/transcription_sessions", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.KOTOBA_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
}),
});
const data = await r.json();
res.send(data.client_secret.value);
});
app.listen(3000);
Server-side examples
Python: connect and initialize the session
# requirements:
# pip install websocket-client
import os
import json
import websocket
url = "wss://api.kotobatech.ai/v1/realtime"
headers = [
"Authorization: Bearer " + os.environ.get("KOTOBA_API_KEY"),
]
def on_open(ws):
print("Connected to server.")
def on_message(ws, message):
data = json.loads(message)
print("Received event:", json.dumps(data, indent=2))
if data['type'] == 'transcription_session.created':
update_event = {
"type": "transcription_session.update",
"session": {
"input_audio_format": "pcm16",
"input_audio_sample_rate": 24000,
"input_audio_number_of_channels": 1,
"input_audio_transcription": {"language": "ja", "target_language": "ja"},
},
}
ws.send_text(json.dumps(update_event))
if data['type'] == 'transcription_session.updated':
# After this, send audio to start transcription.
pass
ws = websocket.WebSocketApp(
url,
header=headers,
on_open=on_open,
on_message=on_message,
)
ws.run_forever()