/v1/audio/transcriptions

Overview

Converts audio (mp3, wav, m4a, flac, ogg, webm, mp4) into text. Returns transcripts with per-segment timestamps, automatic language identification and optional SRT/VTT formats ready for subtitling.

Available models

Two flavours of Whisper large-v3 served by the same endpoint. They share the encoder; turbo trims the decoder from 32 to 4 layers (distillation). The choice is accuracy vs latency.

Model	Decoder	When to use
`whisper-large-v3`	32 layers (full)	Dictation, podcasts, medical/legal transcription, async batch where quality wins
`whisper-large-v3-turbo`	4 layers (distilled)	Live captioning, voice agents, voicebots, any flow where latency matters

Endpoint and model

POST `https://api.tesseraai.cloud/v1/audio/transcriptions` as `multipart/form-data`. Form field `model`: `whisper-large-v3` or `whisper-large-v3-turbo`.

Attribute	Value
Native languages	English, Spanish, Portuguese, Catalan (plus 95 with degraded quality)
Input formats	mp3, wav, m4a, flac, ogg, webm, mp4
Output formats	json (default), text, srt, vtt, verbose_json
Max size	25 MB per file
Licence	MIT

Request

POST /v1/audio/transcriptions

curl https://api.tesseraai.cloud/v1/audio/transcriptions \
  -H "Authorization: Bearer $TESSERA_API_KEY" \
  -F "file=@call.mp3" \
  -F "model=whisper-large-v3-turbo" \
  -F "language=en" \
  -F "response_format=json"

Response

Default returns `{"text": "..."}`. With `response_format=verbose_json` you also get `language`, `duration` and `segments[]` with individual timestamps.

verbose_json

{
  "text": "Artificial intelligence is changing the way we work.",
  "language": "en",
  "duration": 3.984,
  "segments": [
    {
      "id": 1,
      "start": 0.0,
      "end": 3.76,
      "text": " Artificial intelligence is changing...",
      "avg_logprob": -0.05
    }
  ]
}

Best practices

If you know the language up front, pass it in the `language` field (ISO 639-1). It improves both accuracy and latency.
For long audio (>10 min) split into 5–10 min chunks and concatenate text client-side. Whisper hallucinates less on shorter chunks.
Clean audio (16 kHz mono PCM or WAV) yields better quality than heavily compressed mp3.
For live captioning, prefer the (forthcoming) streaming endpoint over batch.