Operational difference
Direct
Fast, economical responses for assistants, RAG, classification, extraction and interactive flows.
Thinking
More internal tokens and more latency for planning, complex analysis or multi-step tasks.
Latency cost
In internal benchmarks, thinking was roughly 10x slower and generated many more tokens than direct. Tessera therefore keeps it outside the TTFT SLA and controls it by quota.
| Mode | Recommended use | TTFT SLA | Quota |
|---|---|---|---|
| Direct | Interactive production | Covered by tier | Included |
| Thinking | Occasional reasoning | Not covered | Per tier |
How to enable or disable it
The real switch is `chat_template_kwargs.enable_thinking` — a field Tessera forwards to Qwen's chat template on the backend. OpenAI's `reasoning_effort` parameter is **not translated** to this flag and is silently ignored. If you use the official OpenAI SDK, pass it inside `extra_body` as the example shows. To guarantee direct mode in production, pass `false` explicitly — don't assume the default.
client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=[{"role": "user", "content": "Evaluate this plan"}],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)