API Reference

/v1/embeddings

Embeddings for retrieval, clustering and semantic search. Model: `Qwen/Qwen3-Embedding-8B` (4,096 dimensions, multilingual, optimised for long contexts).

Overview

Converts one or more strings into dense vectors up to 4,096 dimensions. Designed for RAG, semantic search, clustering and deduplication. Multilingual — works equally on English, Spanish, Portuguese, Catalan and many more. Supports the `dimensions` parameter (Matryoshka) to shrink the vector without retraining.

Endpoint and model

POST `https://api.tesseraai.cloud/v1/embeddings`. Pass `model: "Qwen/Qwen3-Embedding-8B"` in the request body.

AttributeValue
Dimensions4,096 native; configurable downward (32–4,096) via the `dimensions` parameter
LanguagesMultilingual (100+ languages, including EN, ES, PT, CA, IT, FR, DE)
Input typestring or array of strings
Max length8,192 tokens per input
QuantisationQ8_0 GGUF (near-FP16 quality)
LicenceApache 2.0

Output dimensions

The model is trained with Matryoshka Representation Learning (MRL): the first N components of the vector form a self-contained representation of the input. You can request shorter vectors without retraining and with minimal quality loss. Pass the OpenAI-standard `dimensions` parameter and Tessera truncates the vector and L2-renormalises it server-side, ready to use with cosine similarity.

  • `dimensions` is optional. Without it you get the native 4,096-dim vector.
  • Accepts any value between 32 and 4,096. Common values: 1,536 (pgvector HNSW cap), 1,024, 768, 512.
  • No extra inference cost — truncation happens post-inference in the gateway.
  • Returned vectors are L2-normalised — ready for `cosine_distance` in pgvector / Qdrant / Weaviate without extra steps.
Request 1,536 dimensions
# cURL
curl https://api.tesseraai.cloud/v1/embeddings \
  -H "Authorization: Bearer $TESSERA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Embedding-8B",
    "input": ["Text 1", "Text 2"],
    "dimensions": 1536
  }'

# Python with the OpenAI SDK
from openai import OpenAI

client = OpenAI(
    base_url="https://api.tesseraai.cloud/v1",
    api_key=os.environ["TESSERA_API_KEY"],
)

resp = client.embeddings.create(
    model="Qwen/Qwen3-Embedding-8B",
    input=["Text 1", "Text 2"],
    dimensions=1536,
)
# resp.data[0].embedding → list of 1,536 floats, L2-normalised
# Ready for INSERT INTO docs (embedding) VALUES (%s) on pgvector

Request

POST /v1/embeddings
curl https://api.tesseraai.cloud/v1/embeddings \
  -H "Authorization: Bearer $TESSERA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Embedding-8B",
    "input": [
      "I want to know my invoice balance",
      "Cancel my subscription"
    ]
  }'

Response

Identical structure to OpenAI: `data[].embedding` is a 4,096-float array per input (or whatever was requested via `dimensions`), L2-normalised.

Response
{
  "object": "list",
  "model": "Qwen/Qwen3-Embedding-8B",
  "data": [
    {"object": "embedding", "index": 0, "embedding": [0.013, 0.0062, ...]},
    {"object": "embedding", "index": 1, "embedding": [-0.022, 0.041, ...]}
  ],
  "usage": {"prompt_tokens": 12, "total_tokens": 12}
}

Typical use cases

  • RAG: index your corpus in Qdrant / pgvector / Weaviate and retrieve passages by similarity before sending them to the chat model.
  • Semantic search: replacement for keyword search where "unpaid invoice" should match "outstanding balance".
  • Clustering / topic modelling: group tickets, documents or emails by similarity for unsupervised analysis.
  • Deduplication: detect near-identical content (duplicate FAQ entries, repeated leads) by cosine-similarity threshold.