Overview
Converts audio (mp3, wav, m4a, flac, ogg, webm, mp4) into text. Returns transcripts with per-segment timestamps, automatic language identification and optional SRT/VTT formats ready for subtitling.
Available models
Two flavours of Whisper large-v3 served by the same endpoint. They share the encoder; turbo trims the decoder from 32 to 4 layers (distillation). The choice is accuracy vs latency.
| Model | Decoder | When to use |
|---|---|---|
| `whisper-large-v3` | 32 layers (full) | Dictation, podcasts, medical/legal transcription, async batch where quality wins |
| `whisper-large-v3-turbo` | 4 layers (distilled) | Live captioning, voice agents, voicebots, any flow where latency matters |
Endpoint and model
POST `https://api.tesseraai.cloud/v1/audio/transcriptions` as `multipart/form-data`. Form field `model`: `whisper-large-v3` or `whisper-large-v3-turbo`.
| Attribute | Value |
|---|---|
| Native languages | English, Spanish, Portuguese, Catalan (plus 95 with degraded quality) |
| Input formats | mp3, wav, m4a, flac, ogg, webm, mp4 |
| Output formats | json (default), text, srt, vtt, verbose_json |
| Max size | 25 MB per file |
| Licence | MIT |
Request
curl https://api.tesseraai.cloud/v1/audio/transcriptions \
-H "Authorization: Bearer $TESSERA_API_KEY" \
-F "file=@call.mp3" \
-F "model=whisper-large-v3-turbo" \
-F "language=en" \
-F "response_format=json"Response
Default returns `{"text": "..."}`. With `response_format=verbose_json` you also get `language`, `duration` and `segments[]` with individual timestamps.
{
"text": "Artificial intelligence is changing the way we work.",
"language": "en",
"duration": 3.984,
"segments": [
{
"id": 1,
"start": 0.0,
"end": 3.76,
"text": " Artificial intelligence is changing...",
"avg_logprob": -0.05
}
]
}Best practices
- If you know the language up front, pass it in the `language` field (ISO 639-1). It improves both accuracy and latency.
- For long audio (>10 min) split into 5–10 min chunks and concatenate text client-side. Whisper hallucinates less on shorter chunks.
- Clean audio (16 kHz mono PCM or WAV) yields better quality than heavily compressed mp3.
- For live captioning, prefer the (forthcoming) streaming endpoint over batch.