Llama 4: The Open-Source Model Too Big to Run and Too Fudged to Trust

Llama 4 — Crowd Intelligence Report

SEO Brief

SEO title: Llama 4 Review: Fudged Benchmarks, a $10K Hardware Problem, and Meta's OpenSource Retreat Meta description: Meta's departing AI chief confirmed Llama 4 benchmarks were "fudged a little bit." The smallest model needs 50GB+ of RAM. Developers say it writes worse than Llama 3.3. And Meta's pivot to proprietary models suggests the openweight era may be ending. Canonical path: /research/llama4 Primary search intent: Understand whether Llama 4 lives up to Meta's claims after the benchmark cheating allegations and hardware accessibility issues. Target keywords: Llama 4 review, Llama 4 benchmark cheating, Llama 4 vs Llama 3, Llama 4 Scout local, is Llama 4 good, Llama 4 hardware requirements, Meta AI benchmark scandal, Llama 4 worth it

Report Status

Readiness: publishableseed (90.0/100) Generated: 20260603T09:37:39.317513+00:00 Entity type: topic Industry: Artificial Intelligence / Foundation Models Data foundation: 1,139 content items, 1,083 extracted opinion units, 60 entity insights, 37 sampled evidence links.

"Results Were Fudged" Meta's Own AI Chief

The quote that turned Meta's biggest AI launch into a credibility crisis came not from a competitor or a journalist but from the man who built Meta's AI research program. Yann LeCun, departing after more than a decade to start his own venture, told the Financial Times that Llama 4's benchmark results were "fudged a little bit" that different versions of the model had been submitted to different benchmarks, each one chosen to score best on that particular test.

Normally, researchers submit a single version of a new model for all benchmarks. Meta did not do that. And when LeCun said so publicly, the reaction inside the company was immediate. "Mark was really upset," LeCun told the Financial Times, "and basically lost confidence in everyone who was involved in this. And so basically sidelined the entire GenAI organization."

Llama 4 is Meta's fourthgeneration family of openweight language models, released on April 5, 2025. It ships in three variants: Scout (17 billion active parameters, 109 billion total, 10milliontoken context window), Maverick (17 billion active parameters, 400 billion total across 128 experts, 1milliontoken context), and Behemoth (288 billion active parameters, 2 trillion total, still forthcoming). The models use a MixtureofExperts architecture, are natively multimodal, and can be downloaded from llama.com and Hugging Face. Meta's hosted API is currently free. Thirdparty providers like DeepInfra and Fireworks charge as little as $0.15 per million input tokens for Maverick.

The pitch is straightforward: frontierclass capability at zero licensing cost, running on your own hardware. And that pitch worked until the benchmarks it rested on fell apart.

"Meta cheats on Llama 4 benchmark" HackerNews frontpage headline, later confirmed by LeCun himself

Slashdot: "Results Were Fudged" Departing Meta AI Chief Confirms Llama 4 Benchmark Manipulation

Neowin: Unmodified Llama 4 Maverick ranks below rivals following Meta cheating allegations

Fast Company: Yann LeCun says Meta "fudged a little bit" when benchmarktesting Llama 4

The Community That Built Llama Feels Betrayed

The Llama ecosystem was built by people who value transparency above almost everything else. Researchers who publish their training methodologies. Developers who file reproducibility reports. Hobbyists who spend weekends quantizing models on consumer GPUs so everyone can participate. When that community learned that the benchmarks they used to justify choosing Llama over proprietary alternatives had been manipulated, the response was not just disappointment it was a specific, technical kind of anger.

LMArena data showed unmodified Llama 4 Maverick ranking below rivals after the allegations surfaced. HackerNews ran multiple frontpage stories. On r/LocalLLaMA, the subreddit where Llama's most dedicated users gather, the handson reviews were brutal.

"Its writing quality is poor and seems on par with or worse than Llama 3." r/LocalLLaMA user brown2green, who also noted Scout requires 50+ GB even in 4bit quantization

Another Redditor, RedRedditorReddit, offered a more measured but equally damaging assessment: the quality is "close to, but not quite as good as, Llama 3.3." For a model that was supposed to be a generational leap, being compared unfavorably to its predecessor is a product failure regardless of benchmark scores.

Users also report that Llama 4 feels more selectively censored than Llama 3. The combination worse writing, more censorship, fabricated benchmarks has created a trust deficit that Meta cannot fix with a press release.

"As far as I can tell there is no 'open source' here, just open weights. Earlier releases came with more transparency about training infrastructure." YouTube commenter on a Llama 4 review

The Open Source Initiative has been explicit: Llama's licenses do not meet the Open Source Definition. The Llama 4 Community License restricts commercial use above 700 million monthly active users, prohibits use of outputs to train competing models, and critically excludes individuals and companies domiciled in the European Union from using the multimodal capabilities. Meta calls it "open." The OSI calls it openwashing.

r/LocalLLaMA: Back to local what's your experience with Llama 4?

OSI: Meta's LLaMA license is still not Open Source

The $10,000 Mac Studio Problem

Llama 4's hardware requirements have become a defining barrier and an ironic one for a model family whose entire pitch is that you can download and run it yourself.

Scout, the smallest variant, requires over 50 GB in 4bit quantization. That means it will not fit in the VRAM of any consumer GPU. It requires either a unifiedmemory system like Apple Silicon with enough RAM, or a multiGPU server setup. On an M3 Ultra with maximum memory, users report around 47 tokens per second usable, but only on a machine that costs upward of $7,000.

"I now need a $10,000 M3 Mac Studio with 512GB of RAM to run Llama 4. Excited to play the lottery." YouTube commenter @JohnSmith762A11B

The MixtureofExperts architecture is part of the problem. MoE models load all parameters into memory, not just the active ones. Scout has 17 billion active parameters but 109 billion total and every one of those 109 billion needs to sit in memory. Maverick is 400 billion total. Behemoth is 2 trillion. On Reddit, one user pointed out that Behemoth "is 2T parameters, and they do not have any hardware that can handle it unless they rent GPU servers from data centers."

For the localinference community the people running models on RTX 3090s and Mac Minis and AMD AI Max laptops Llama 4 is effectively inaccessible. The model that is supposed to be the openweight alternative to GPT and Claude now requires the same kind of cloud infrastructure that GPT and Claude run on.

"A bit too big to justify for local inference at present if you don't have a unified memory system." r/LocalLLaMA user Thellton

YouTube: Llama 4 hardware requirements deepdive

BIZON: Llama 4 GPU System Requirements

Who Is Actually Running It?

If Llama 4 is too big for most local setups, the question becomes: where is it actually being used?

The answer, overwhelmingly, is cloud APIs. On launch day, Llama 4 received support from AWS, Google Cloud, Azure, Together AI, Groq, and Fireworks. Groq hosts Scout and runs it at over 460 tokens per second. Together AI has Maverick on serverless. Fireworks offers approximately 1 million tokens of context for Scout. DeepInfra provides the cheapest Maverick access at $0.15 per million input tokens and $0.60 per million output tokens roughly 90% cheaper than Claude Sonnet.

For organizations processing more than about 50 to 100 million tokens per month, selfhosting on rented H100s can break even against managed APIs within 6 to 12 months. Below that volume, the managed APIs are almost always cheaper.

But this creates a philosophical problem for the Llama project. Meta's stated goal "build the world's leading AI, open source it, and make it universally accessible" rings hollow when the smallest model in the family requires a $7,000 computer and the largest one requires a data center. The developers who made Llama popular were running it on machines they already owned. Llama 4 requires machines most of them cannot afford.

"What openweights mean today: you can download the model, but you might need to rent the hardware to run it." Threads post on @arcforai

Groq: Llama 4 Now Live build fast at the lowest cost

LLMWise: Llama 4 API Pricing 2026 Meta AI Cost vs SelfHosting

Why Does Meta Give Away Models?

Mark Zuckerberg's public explanation has been consistent: "Our goal is to build the world's leading AI, open source it, and make it universally accessible so that everyone in the world benefits." The business logic is less altruistic. Openweight models create an ecosystem that depends on Meta's infrastructure, training pipeline, and model releases even if the weights themselves are free. Every developer who builds on Llama is a developer who is not building on a closed competitor.

But there are signs that the openweight era at Meta may be ending. In April 2026, Meta released Muse Spark, a proprietary model a deliberate departure from the openweights strategy. When asked whether future Llama models would continue to be open, Meta said only that "existing Llama models would remain available." Whether the Llama family will continue to be developed at all was left ambiguous.

"This might be the last frontier of open AI." Threads post reacting to the Llama 4 launch

For Meta's business model, the shift makes sense. A company planning to equip billions of users with personal AI agents has reasons to keep control of the underlying model. But for the developer community that built its workflows around Llama's openness, the Muse Spark precedent