Overview
Reorders a list of candidate documents by relevance to a query. The typical RAG pattern is two-stage: an embedding model retrieves the top-K candidates from your vector store (fast, lossy), then a reranker scores each `(query, doc)` pair end-to-end on the raw text (slower, much more precise). Reranking is the single biggest precision win on top of a vector index — especially when your embeddings are stored at reduced dimensionality (e.g. 1,536-dim into pgvector + HNSW), because the reranker re-reads the full text and recovers precision the truncation cost.
Endpoint and model
POST `https://api.tesseraai.cloud/v1/rerank`. Pass `model: "Qwen/Qwen3-Reranker-4B"` in the request body. Response shape follows the Cohere `/rerank` contract.
| Attribute | Value |
|---|---|
| Parameters | 4B |
| Languages | Multilingual (100+ languages, including EN, ES, PT, CA, IT, FR, DE) |
| Family alignment | Trained jointly with `Qwen3-Embedding-8B` — same tokenizer, same instruction template. Optimal when paired with our embedding model. |
| Max context per pair | 8,192 tokens (`query` + one `document`) |
| Quantisation | Q8_0 GGUF (near-FP16 quality) |
| Licence | Apache 2.0 |
When reranking actually helps
Reranking is worth wiring in when at least one of these is true:
- You retrieve from a vector store and the top-1 isn't reliably the right document — the right answer is in the top-K but ranked 5th, 12th, sometimes 30th.
- You truncated your embeddings to fit a storage tier (e.g. `dimensions: 1536` to fit pgvector's HNSW cap). The reranker re-reads the raw text and recovers the precision the truncation traded away.
- Your domain has near-paraphrase noise (legal, support, knowledge bases) where two documents look semantically close but only one actually answers the query.
- You're paying Cohere / Voyage / Jina for reranking today and want the same Cohere-compatible API on EU/LATAM infrastructure under your flat Tessera bill.
Request
Send the query plus the list of candidate documents returned by your vector store. The reranker returns each document's index and a relevance score in `[0, 1]`. Typical pipeline: retrieve top 50–100 from your vector store, rerank, keep top 10 for the LLM context window.
curl https://api.tesseraai.cloud/v1/rerank \
-H "Authorization: Bearer $TESSERA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Reranker-4B",
"query": "how do I reset my password",
"documents": [
"To reset your password, click forgot password on the login screen.",
"Our office hours are Monday to Friday, 9 to 18.",
"Password requirements: 12+ characters, one number, one symbol."
]
}'Response
Cohere-canonical envelope. `results[].index` references the original `documents` array order; sort `results` by `relevance_score` descending and keep the top-N for downstream use.
{
"id": "a1f2…",
"results": [
{"index": 0, "relevance_score": 0.9902},
{"index": 2, "relevance_score": 0.4118},
{"index": 1, "relevance_score": 0.00003}
],
"meta": {
"billed_units": {"total_tokens": 176},
"tokens": {"input_tokens": 176, "output_tokens": 0}
}
}Pairing with /v1/embeddings
Tessera's embedding and reranker models are from the same Qwen3 family and are trained together. Using both is the recommended setup — same tokenizer, same instruction template, no quality penalty from mixing vendor families.
from openai import OpenAI
client = OpenAI(
base_url="https://api.tesseraai.cloud/v1",
api_key=os.environ["TESSERA_API_KEY"],
)
# Stage 1 — embed and retrieve top-50 from your vector store.
query_vec = client.embeddings.create(
model="Qwen/Qwen3-Embedding-8B",
input=[user_query],
dimensions=1536, # fits pgvector HNSW
).data[0].embedding
candidates = vector_store.query(query_vec, top_k=50)
# Stage 2 — rerank the 50 candidates and keep the top 10.
import httpx
resp = httpx.post(
"https://api.tesseraai.cloud/v1/rerank",
headers={"Authorization": f"Bearer {os.environ['TESSERA_API_KEY']}"},
json={
"model": "Qwen/Qwen3-Reranker-4B",
"query": user_query,
"documents": [c.text for c in candidates],
},
timeout=30.0,
).json()
top10 = sorted(resp["results"], key=lambda r: -r["relevance_score"])[:10]
context = [candidates[r["index"]].text for r in top10]Typical use cases
- RAG quality boost: keep your existing embedding index, drop the reranker between retrieval and the LLM — usually the highest-leverage one-line change for answer accuracy.
- Truncated-embedding recovery: if you indexed at 768 / 1,024 / 1,536 dimensions to fit a storage tier, rerank to claw back the precision dimensionality reduction lost.
- Cross-language retrieval: query in English, retrieve documents in Spanish (or vice versa) — the reranker handles cross-lingual relevance natively.
- Migration from Cohere / Voyage / Jina: response shape matches Cohere's, so swapping is a `base_url` + `model` change in most clients.