Concepts

Direct vs thinking mode

Qwen 3.6 exposes two modes in the same model. Direct is recommended for interactive flows; thinking enables extended reasoning when you accept higher latency. The switch is `chat_template_kwargs.enable_thinking` and must be passed explicitly — don't rely on silent defaults.

Operational difference

Direct

Fast, economical responses for assistants, RAG, classification, extraction and interactive flows.

default

Thinking

More internal tokens and more latency for planning, complex analysis or multi-step tasks.

opt-in

Latency cost

In internal benchmarks, thinking was roughly 10x slower and generated many more tokens than direct. Tessera therefore keeps it outside the TTFT SLA and controls it by quota.

ModeRecommended useTTFT SLAQuota
DirectInteractive productionCovered by tierIncluded
ThinkingOccasional reasoningNot coveredPer tier

How to enable or disable it

The real switch is `chat_template_kwargs.enable_thinking` — a field Tessera forwards to Qwen's chat template on the backend. OpenAI's `reasoning_effort` parameter is **not translated** to this flag and is silently ignored. If you use the official OpenAI SDK, pass it inside `extra_body` as the example shows. To guarantee direct mode in production, pass `false` explicitly — don't assume the default.

Enable thinking (Python SDK)
client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=[{"role": "user", "content": "Evaluate this plan"}],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)