Local Python on the Mac: Gemma 4, Its Coding Finetune, and a Current Specialist

Gemma 4 12B is celebrated in the scene as the most complete local model. For pure Python the interesting question is whether the multimodal generalist holds up against specialized coders. I ran three models against each other on an M3 Max, with MLX and real execution of the generated code, from simple functions to an arithmetic parser that all three fail in the end.

The three candidates

Three models with different profiles are in the race.

Gemma 4 12B (base). Google’s encoder-free, multimodal generalist, released June 3, 2026, under Apache 2.0.¹ A single model for text, image, video, and audio. Google never marketed it as a coding model.

Gemma 4 12B Coder (Fable5/Composer). A community finetune of the base, trained specifically on verified Python. Distilled from real reasoning traces of Cursor Composer 2.5, keeping only those whose code passed the tests, plus a „second attempt" set from Claude Fable 5 for exactly the cases Composer got wrong.² A local model that learned a piece of the big cloud models’ reasoning, shortly before Fable 5 was shut down.

Qwen3-Coder-30B-A3B. Alibaba’s current, dedicated coding specialist. Despite 30 billion parameters it is a Mixture-of-Experts that activates only about 3 billion per token. That is exactly what makes it interesting on the Mac.

Setup

All three run locally via MLX. One technical detail up front: Gemma 4 is a vision-language model with the gemma4_unified architecture. The usual mlx_lm loader refuses it; you need mlx_vlm. There is no ready MLX build of the coder finetune, so I quantized the bf16 master weights to MLX 4bit myself (affine, group size 64, ending at 4.5 bits per weight). Qwen3-Coder runs as a pure text model directly via mlx_lm.


Hardware	Apple M3 Max, 64 GB unified memory
Gemma 4 base	`mlx-community/gemma-4-12B-it-qat-4bit` (`mlx_vlm`)
Gemma 4 coder	`yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1`, self-converted to MLX 4bit
Qwen3-Coder	`mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit` (`mlx_lm`)
Parameters	temp 0.2, up to 1536 tokens
Tasks	10 Python functions, each checked with asserts

The test is strict: for each task the generated code is extracted, written to a subprocess, and run against asserts. A pass means the code runs and returns the correct result for every test case.

Results

Ten tasks, from is_palindrome through two_sum and the longest common subsequence to two hard ones: an arithmetic parser with precedence, parentheses, and unary minus, and the canonicalization of a Unix path.

Task	Gemma 4 base	Gemma 4 coder	Qwen3-Coder 30B-A3B
8 standard and medium tasks	8/8	8/8	8/8
simplify_path	PASS	PASS	PASS
calc (parser, 5 runs)	1/5	0/5	0/5
ø decode	~29 tok/s	~44 tok/s	~92 tok/s

On the eight standard to medium tasks all three are reliable (isolated single-run outliers, for instance on the version compare or the Roman numerals, vanished on repetition). The field is separated in two places: at the parser and on speed.

The parser none of them solves

The calc task requires a full expression evaluator with operator precedence, parentheses, and unary minus. To rule out luck or bad luck, I ran it five times per model. The result is clear: Gemma 4 base solved it in one of five runs, the coder finetune and Qwen3-Coder in none.

The failure modes differ:

Gemma 4 base produced runnable code with a wrong result in four of five runs, correct once. So the error is a logic bug in precedence or associativity, not a syntax error. The model writes a plausible parser that simply computes wrong.
Gemma 4 coder failed in mixed ways: sometimes a syntax error, sometimes a RecursionError, sometimes a wrong result.
Qwen3-Coder consistently delivered runnable but wrongly computing code.

To keep this fair: I checked the test cases against a correct reference implementation beforehand, they are consistent. The parser is simply too complex for a single attempt at these model sizes at low temperature. A palindrome test never shows this: on standard tasks local models of this class are strong, on a task with precedence, parentheses, and unary minus they hit the wall as a group.

Speed: why the biggest is the fastest

On throughput the order is clear but counterintuitive: the by far largest model is the fastest. Qwen3-Coder with 30 billion parameters generates around 92 tokens per second, the Gemma coder around 44, the Gemma base around 29.

The key is the Mixture-of-Experts architecture. During token generation the model is read from memory per token, and the bottleneck is memory bandwidth, not compute.³ A dense 12B model like Gemma reads all 12 billion weights. Qwen3-Coder is 30B in size but activates only about 3 billion parameters per token, so it reads far less per step and runs accordingly faster. It is not total size that determines speed, but the active parameters.

Between the two Gemma variants a second effect shows: the same 12B base, but the coding finetune is about one and a half times as fast as the base. The multimodal base uses a pronounced internal thinking channel and reasons before every answer, the coder gets to the code more directly.

Conclusion

For pure Python the result is easy to summarize.

On correctness all three are close. They reliably solve standard and medium tasks and fail as a group on the genuinely hard parser. So a statement about the best local Python model is not found in the hit rate on everyday tasks but at the edges: on speed, and on the question of what else is inside the model.

On speed the current specialist wins clearly. Qwen3-Coder-30B-A3B is, thanks to MoE, more than three times as fast as the Gemma base and therefore the most pleasant choice for an interactive workflow. Anyone using Gemma 4 locally for coding should moreover take the specialized finetune, not the base model: same hit rate, but noticeably faster, because without the detour through the thinking channel.

And above it all stands the parser. No local model of this size solved it reliably. That is the most important finding of the test: for standard Python these models are fit for everyday use, for genuinely tangled logic the jump to the cloud or to much larger hardware remains. Gemma’s real strength lies elsewhere anyway, in the multimodality a pure code specialist does not have.

Among the limits of this test is that only the parser was measured repeatedly, the other tasks once per model at low temperature. Single outliers are therefore possible. What holds is the pattern: same league on standard tasks, the finetune beats the base on speed, the MoE specialist is by far the fastest, and the genuinely hard task overwhelms all three.

As of June 19, 2026. All correctness and performance figures were measured on an Apple M3 Max (64 GB) by myself, with execution of the generated code; the test cases were checked against a correct reference. Model and technical references see footnotes.

„Google DeepMind Releases Gemma 4 12B: An Encoder-Free Multimodal Model", MarkTechPost, June 2026, and Google’s announcement „Introducing Gemma 4 12B" (release June 3, 2026, Apache 2.0, QAT-Q4 on June 5). https://www.marktechpost.com/2026/06/03/google-deepmind-releases-gemma-4-12b-an-encoder-free-multimodal-model-with-native-audio-that-runs-on-a-16-gb-laptop/ ↩︎
yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1, Hugging Face. Finetune of google/gemma-4-12B-it on verified Python, distilled from Cursor Composer 2.5 (passing solutions only) and a Claude Fable 5 „second attempt" set. https://huggingface.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1 ↩︎
„Ollama’s highest performance on Apple Silicon yet with MLX", Ollama Blog, June 2026, on the point that generation throughput is bound by memory bandwidth. https://ollama.com/blog/mlx-performance ↩︎