Qwen 3.5: The Local-Inference Darling with an Agentic Flaw

Qwen 3.5 — Crowd Intelligence Report

SEO Brief

SEO title: Qwen 3.5: The LocalInference Darling with an Agentic Flaw Meta description: Qwen 3.5 runs at 35 tok/s on an RTX 4090 under Apache 2.0, free. But the 4B model loops forever and thinking mode cannot be turned off. Canonical path: /research/qwen35 Primary search intent: Understand whether Qwen 3.5 is reliable enough for local inference and agentic workflows, or if it only works on simple tasks. Target keywords: Qwen 3.5 review, Qwen 3.5 vs Llama 4, Qwen 3.5 local inference, Qwen 3.5 tool calling, is Qwen 3.5 good, Qwen 3.5 thinking problem, Qwen 3.5 coding, Alibaba Qwen review

Report Status

Readiness: publishableseed (90.0/100) Generated: 20260603T09:37:47.034169+00:00 Entity type: topic Industry: Artificial Intelligence / Foundation Models Data foundation: 1,505 content items, 994 extracted opinion units, 50 entity insights, 32 sampled evidence links.

Runs on Your Hardware, Breaks on Your Workflows

"I built a coding agent that gets 87% on benchmarks with a 4B parameter model." That Reddit post, from a developer who created SmallCode because every existing coding agent assumed you were running GPT5.4 or Claude Opus, captures both the promise and the audience for Qwen 3.5. These are developers who refuse to accept that useful AI requires renting someone else's hardware.

Qwen 3.5 is a family of openweight language models from Alibaba's Qwen team, released in stages between February and March 2026. The lineup spans nine sizes: a flagship 397billionparameter MoE model (397BA17B, released February 16), a medium tier including a 122B MoE and 27B dense model (released February 24), and a smallmodel series from 0.8B to 9B (released March 2). All are released under the Apache 2.0 license the most permissive option available, with no restrictions on commercial use. No 700millionuser caps like Llama. No "Built with Qwen" branding requirements. Just download and use.

The models are built on a MixtureofExperts architecture with alwayson reasoning and strong toolcalling benchmarks. The 27B variant hits roughly 35 tokens per second on an RTX 4090 with Q4 quantization the speed king at its size tier. Through Alibaba Cloud's API, pricing starts at $0.033 per million input tokens for the Turbo tier and scales to $1.04 per million for QwenMax. But the real action is local: developers running these models on their own machines, filing bug reports when they break, and comparing results against everything else in the openweight ecosystem.

"Every coding agent assumes you're running GPT5.4 or Claude Opus. I was frustrated." Reddit developer who built SmallCode, a localfirst coding agent using Qwen 3.5 4B

r/LocalLLaMA: I built a coding agent that gets 87% on benchmarks with a 4B parameter model

Qwen 3.5 complete guide all model sizes and capabilities

The Model That Gets Stuck Brainstorming

Qwen 3.5 has earned a reputation for strong toolcalling capabilities on standard benchmarks, and that reputation is what draws developers in. The flagship 397BA17B competes with frontier closed models on reasoning and agentic benchmarks. Qwen 3.6 Plus leads on MCPMark at 48.2% for toolcalling reliability. The numbers look good on paper.

Then you try to build something with it.

On GitHub, a user testing Qwen 3.5 4B in Perplexica/Vane found the model stuck indefinitely on its brainstorming phase, never triggering the search tool that the entire workflow depends on. The 35B variant throws parsing errors during podcast generation. The 9B variant requires repeated workarounds and retries just to complete basic agent loops. A separate GitHub issue proposes an entire multimodel fallback cascade where the system switches to Qwen 3.5 4B only after two JSON validation failures from the primary model treating it as the safety net, not the first choice.

"Qwen 3.5 4B gets stuck on 'Brainstorming' and never calls the search tool." GitHub issue on FastFlowLM, documenting a workflow that works perfectly with Qwen 3 VL 4B

One Reddit post comparing Qwen 3.5 122B to the newer Qwen 3.6 35B found that the larger, older model still outperforms "by a large margin" because the 3.6 variant "gets lost as long as the task requires a couple more steps." The pattern is consistent across model sizes: Qwen 3.5 benchmarks well on singleturn tool calls but degrades on the multistep, multitool workflows that agentic applications actually require.

GitHub: Qwen 3.5 4B versus Qwen 3 VL 4B IT in Perplexica/Vane

r/LocalLLaMA: Qwen 3.6 35B A3B vs Qwen 3.5 122B A10B

GitHub: Multimodel fallback cascade

You Cannot Turn Off Thinking

If there is one issue that unites Qwen 3.5 users across every platform, it is the inability to reliably disable the model's chainofthought reasoning. Qwen 3 supported /think and /nothink soft switches that gave developers explicit control. Qwen 3.5 removed that capability. The Qwen team has confirmed this is by design: "Qwen3.5 does not officially support the soft switch of Qwen3."

The result is that every response begins with an extended <think block, regardless of whether the developer requested it or the task warrants it. Setting enablethinking: false in API parameters does not reliably suppress it. In vLLM, a documented bug shows the parameter being silently ignored. In LM Studio, the thinking is hidden from the user interface but still generated, consuming tokens and adding latency. In llama.cpp, the flag works but only with a specific server configuration that most users do not know about.

"Disabling thinking only hides the model's thought process rather than disabling the feature itself." YouTube commenter @wrenmolot

On YouTube, a user reported that asking Qwen 3.5 32BA3B to tell 10 jokes produces "page after page of thinking" before any jokes appear. Another user observed that turning off thinking makes the results noticeably worse creating a loselose situation where you either accept the latency penalty or accept degraded output.

For latencysensitive applications, the problem is severe. When a model is inside a toolcalling loop, you need it to execute reliably and move to the next action, not deliberate internally about whether to proceed. The thinking mode that makes Qwen 3.5 good at reasoning makes it unreliable at acting.

GitHub: Qwen 3.5 ignores thinkingoff via LM Studio API

GitHub: Qwen 3.5 Model Switching Between Thinking and NonThinking (LM Studio bug tracker)

vLLM issue: Qwen3.5 cannot close thinking by enablethinking: false

HuggingFace: How to disable or reduce thinking on Qwen3.59B

The Local Community Builds Anyway

Qwen 3.5's natural audience is the localinference community developers running models on their own hardware for privacy, cost, or latency reasons. This community is vocal, technical, and remarkably forgiving of rough edges. They are building real products on Qwen: coding agents that pass 87% of benchmark tasks, smart home controllers with tiered model fallback, podcast generators, and offline coding backends. That kind of handson adoption is what separates a model that gets attention from one that gets used.

The Apache 2.0 license is a major factor. In a comparison between Qwen 3.5, Llama 4, and Gemma 4, every guide recommends Qwen for "unrestricted use" no commercial caps, no branding requirements, no EU geographic restrictions. For a local coding assistant, multiple reviews recommend Qwen 3.5 27B specifically: it has the best SWEbench score at its size tier, the fastest inference, and good compatibility with tools like Continue.dev.

But the same community that builds on Qwen also documents its failures. GGUF loading fails with 500 errors in Ollama. Performance has regressed: Qwen 3.5 runs much slower than Qwen 3 on Intel SYCL hardware, and concurrency in Ollama is effectively serialized even on highmemory Apple Silicon. Checkpoint loading in llama.cpp crashes after a specific pull request. The 9B model's quantized lmhead stopped working in vLLM.

"Qwen 3.5 is missing from the leaderboard an embarrassing omission given its agentic performance." GitHub issue on ServiceNow EnterpriseOpsGym

A quieter but strategically important signal: Qwen 3.5 is being omitted from the very benchmark leaderboards where developers make their model decisions. Multiple GitHub issues call this out. For a model family that competes primarily on capabilityperdollar, invisibility in the comparison tables is a material adoption problem.

GitHub: Leaderboard missing Qwen 3.5 an embarrassing omission

GitHub: Qwen 3.5 models from HuggingFace don't work in Ollama

GitHub: Qwen 3.5 Performance on Arrow Lake iGPU/dGPU with SYCL

Alibaba's Advantage and Alibaba's Problem

There is a conversation that happens around every Chineseorigin AI model, and Qwen is not exempt from it. Some developers hesitate over data sovereignty concerns. Others worry about longterm support if geopolitical tensions escalate. These concerns are real, but the data suggests they are not the primary barrier to adoption the technical issues are.

Alibaba has a genuine strategic advantage in the openweight race. The company is iterating faster than any competitor: Qwen 3.5 shipped in February, Qwen 3.6 followed in April, and Qwen 3.7 Max was announced in May. The model family spans from 0.8B models that run on phones to 397B flagships that compete with closedsource frontier models. The pricing is aggressive free weights, cheap API access, and a permissive license that imposes no conditions.

But speed of release is different from reliability of release. Alibaba keeps shipping new model generations before the tooling for the current one stabilizes. Qwen 3.5's thinking controls do not work correctly. Its small models fail on agentic workflows. Its GGUF loading is broken in Ollama. And by the time these issues get fixed, the community's attention has already shifted to Qwen 3.6 which has its own problems.

The developers who are most excited about Qwen 3.5 are the ones building