Direct vs thinking mode

Operational difference

Direct

Fast, economical responses for assistants, RAG, classification, extraction and interactive flows.

default

Thinking

More internal tokens and more latency for planning, complex analysis or multi-step tasks.

opt-in

Latency cost

In internal benchmarks, thinking was roughly 10x slower and generated many more tokens than direct. Tessera therefore keeps it outside the TTFT SLA and controls it by quota.

Mode	Recommended use	TTFT SLA	Quota
Direct	Interactive production	Covered by tier	Included
Thinking	Occasional reasoning	Not covered	Per tier

How to enable or disable it

The real switch is `chat_template_kwargs.enable_thinking` — a field Tessera forwards to Qwen's chat template on the backend. OpenAI's `reasoning_effort` parameter is **not translated** to this flag and is silently ignored. If you use the official OpenAI SDK, pass it inside `extra_body` as the example shows. To guarantee direct mode in production, pass `false` explicitly — don't assume the default.

Enable thinking (Python SDK)

client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=[{"role": "user", "content": "Evaluate this plan"}],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)