Rate-limiting model
Each tier has a sustained RPM (requests per minute) and a burst capacity that allows short spikes. RPM is measured over a 60s rolling window; burst is allowed for up to 5 minutes per hour. Beyond those 5 minutes, requests above the sustained rate return 429.
Limits by tier
| Tier | Sustained RPM | Burst | Per-service sublimits |
|---|---|---|---|
| Async | — (job queue) | — | LLM + embeddings only; no Whisper/TTS |
| Lite (€450/mo) | 50 | 100 (5 min/h) | Embeddings 100 RPM · Whisper 10 RPM · TTS 10 RPM |
| Pro (€650/mo) | 200 | 400 (5 min/h) | Embeddings 400 RPM · Whisper 30 RPM · TTS 30 RPM |
| Scale | 500 | 700 (5 min/h) | No sublimits — full bundle |
| Enterprise | 5,000+ | Negotiable per hardware | No sublimits |
Response headers
Every API response carries headers so the client can know its quota state without waiting for a 429.
| Header | Meaning |
|---|---|
| `X-RateLimit-Limit-Requests` | Your sustained RPM for this endpoint |
| `X-RateLimit-Remaining-Requests` | Remaining requests in the current window |
| `X-RateLimit-Reset-Requests` | Seconds until the quota resets |
| `Retry-After` (only on 429) | Seconds to wait before retrying |
Handling 429
When you receive a 429, the response includes a `Retry-After` header. Wait at least that long before retrying. Recommended exponential backoff with jitter:
Exponential backoff with jitter
import time, random
def call_with_retry(fn, max_retries=5):
for attempt in range(max_retries):
try:
return fn()
except RateLimitError as e:
if attempt == max_retries - 1:
raise
delay = max(
int(e.response.headers.get("Retry-After", "1")),
(2 ** attempt) + random.random()
)
time.sleep(delay)Upgrading or downgrading tiers
- Upgrades take effect immediately (within 60s of confirmation). The new RPM applies on the next window.
- Downgrades apply on the next billing cycle to avoid mid-month cuts.
- For temporary upgrades (campaign, event), ask support for an "extended burst window" instead of changing tiers.