API Reference

/v1/audio/transcriptions

Audio-to-text transcription. Two models served by the same endpoint: `whisper-large-v3` (best WER) and `whisper-large-v3-turbo` (distilled decoder, up to ~54% faster on long audio). Multilingual, optimised for English, Spanish, Portuguese and Catalan.

Overview

Converts audio (mp3, wav, m4a, flac, ogg, webm, mp4) into text. Returns transcripts with per-segment timestamps, automatic language identification and optional SRT/VTT formats ready for subtitling.

Available models

Two flavours of Whisper large-v3 served by the same endpoint. They share the encoder; turbo trims the decoder from 32 to 4 layers (distillation). The choice is accuracy vs latency.

ModelDecoderWhen to use
`whisper-large-v3`32 layers (full)Dictation, podcasts, medical/legal transcription, async batch where quality wins
`whisper-large-v3-turbo`4 layers (distilled)Live captioning, voice agents, voicebots, any flow where latency matters

Endpoint and model

POST `https://api.tesseraai.cloud/v1/audio/transcriptions` as `multipart/form-data`. Form field `model`: `whisper-large-v3` or `whisper-large-v3-turbo`.

AttributeValue
Native languagesEnglish, Spanish, Portuguese, Catalan (plus 95 with degraded quality)
Input formatsmp3, wav, m4a, flac, ogg, webm, mp4
Output formatsjson (default), text, srt, vtt, verbose_json
Max size25 MB per file
LicenceMIT

Request

POST /v1/audio/transcriptions
curl https://api.tesseraai.cloud/v1/audio/transcriptions \
  -H "Authorization: Bearer $TESSERA_API_KEY" \
  -F "file=@call.mp3" \
  -F "model=whisper-large-v3-turbo" \
  -F "language=en" \
  -F "response_format=json"

Response

Default returns `{"text": "..."}`. With `response_format=verbose_json` you also get `language`, `duration` and `segments[]` with individual timestamps.

verbose_json
{
  "text": "Artificial intelligence is changing the way we work.",
  "language": "en",
  "duration": 3.984,
  "segments": [
    {
      "id": 1,
      "start": 0.0,
      "end": 3.76,
      "text": " Artificial intelligence is changing...",
      "avg_logprob": -0.05
    }
  ]
}

Best practices

  • If you know the language up front, pass it in the `language` field (ISO 639-1). It improves both accuracy and latency.
  • For long audio (>10 min) split into 5–10 min chunks and concatenate text client-side. Whisper hallucinates less on shorter chunks.
  • Clean audio (16 kHz mono PCM or WAV) yields better quality than heavily compressed mp3.
  • For live captioning, prefer the (forthcoming) streaming endpoint over batch.