API Reference

/v1/rerank

Second-stage reranking for retrieval pipelines. Model: `Qwen/Qwen3-Reranker-4B` (4B parameters, multilingual, Cohere-compatible response shape).

Overview

Reorders a list of candidate documents by relevance to a query. The typical RAG pattern is two-stage: an embedding model retrieves the top-K candidates from your vector store (fast, lossy), then a reranker scores each `(query, doc)` pair end-to-end on the raw text (slower, much more precise). Reranking is the single biggest precision win on top of a vector index — especially when your embeddings are stored at reduced dimensionality (e.g. 1,536-dim into pgvector + HNSW), because the reranker re-reads the full text and recovers precision the truncation cost.

Endpoint and model

POST `https://api.tesseraai.cloud/v1/rerank`. Pass `model: "Qwen/Qwen3-Reranker-4B"` in the request body. Response shape follows the Cohere `/rerank` contract.

AttributeValue
Parameters4B
LanguagesMultilingual (100+ languages, including EN, ES, PT, CA, IT, FR, DE)
Family alignmentTrained jointly with `Qwen3-Embedding-8B` — same tokenizer, same instruction template. Optimal when paired with our embedding model.
Max context per pair8,192 tokens (`query` + one `document`)
QuantisationQ8_0 GGUF (near-FP16 quality)
LicenceApache 2.0

When reranking actually helps

Reranking is worth wiring in when at least one of these is true:

  • You retrieve from a vector store and the top-1 isn't reliably the right document — the right answer is in the top-K but ranked 5th, 12th, sometimes 30th.
  • You truncated your embeddings to fit a storage tier (e.g. `dimensions: 1536` to fit pgvector's HNSW cap). The reranker re-reads the raw text and recovers the precision the truncation traded away.
  • Your domain has near-paraphrase noise (legal, support, knowledge bases) where two documents look semantically close but only one actually answers the query.
  • You're paying Cohere / Voyage / Jina for reranking today and want the same Cohere-compatible API on EU/LATAM infrastructure under your flat Tessera bill.

Request

Send the query plus the list of candidate documents returned by your vector store. The reranker returns each document's index and a relevance score in `[0, 1]`. Typical pipeline: retrieve top 50–100 from your vector store, rerank, keep top 10 for the LLM context window.

POST /v1/rerank
curl https://api.tesseraai.cloud/v1/rerank \
  -H "Authorization: Bearer $TESSERA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Reranker-4B",
    "query": "how do I reset my password",
    "documents": [
      "To reset your password, click forgot password on the login screen.",
      "Our office hours are Monday to Friday, 9 to 18.",
      "Password requirements: 12+ characters, one number, one symbol."
    ]
  }'

Response

Cohere-canonical envelope. `results[].index` references the original `documents` array order; sort `results` by `relevance_score` descending and keep the top-N for downstream use.

Response
{
  "id": "a1f2…",
  "results": [
    {"index": 0, "relevance_score": 0.9902},
    {"index": 2, "relevance_score": 0.4118},
    {"index": 1, "relevance_score": 0.00003}
  ],
  "meta": {
    "billed_units": {"total_tokens": 176},
    "tokens": {"input_tokens": 176, "output_tokens": 0}
  }
}

Pairing with /v1/embeddings

Tessera's embedding and reranker models are from the same Qwen3 family and are trained together. Using both is the recommended setup — same tokenizer, same instruction template, no quality penalty from mixing vendor families.

End-to-end RAG snippet
from openai import OpenAI

client = OpenAI(
    base_url="https://api.tesseraai.cloud/v1",
    api_key=os.environ["TESSERA_API_KEY"],
)

# Stage 1 — embed and retrieve top-50 from your vector store.
query_vec = client.embeddings.create(
    model="Qwen/Qwen3-Embedding-8B",
    input=[user_query],
    dimensions=1536,           # fits pgvector HNSW
).data[0].embedding
candidates = vector_store.query(query_vec, top_k=50)

# Stage 2 — rerank the 50 candidates and keep the top 10.
import httpx
resp = httpx.post(
    "https://api.tesseraai.cloud/v1/rerank",
    headers={"Authorization": f"Bearer {os.environ['TESSERA_API_KEY']}"},
    json={
        "model": "Qwen/Qwen3-Reranker-4B",
        "query": user_query,
        "documents": [c.text for c in candidates],
    },
    timeout=30.0,
).json()
top10 = sorted(resp["results"], key=lambda r: -r["relevance_score"])[:10]
context = [candidates[r["index"]].text for r in top10]

Typical use cases

  • RAG quality boost: keep your existing embedding index, drop the reranker between retrieval and the LLM — usually the highest-leverage one-line change for answer accuracy.
  • Truncated-embedding recovery: if you indexed at 768 / 1,024 / 1,536 dimensions to fit a storage tier, rerank to claw back the precision dimensionality reduction lost.
  • Cross-language retrieval: query in English, retrieve documents in Spanish (or vice versa) — the reranker handles cross-lingual relevance natively.
  • Migration from Cohere / Voyage / Jina: response shape matches Cohere's, so swapping is a `base_url` + `model` change in most clients.