Yes, of course the model with custom kernels is faster, whether it's written by a human or an AI.
Generally, PyTorch inference is meant to be used during the training process, and when running metrics, not when deploying. When deployed, you should export to ONNX, and then compile the ONNX to the native format of the device.
If you aren't familiar with the pipeline for ML deployment, this is the equivalent of comparing interpreted code to compiled code.
1. I don't have an M4 but I have an M1 Pro and I tried running the claimed 18x speedup VisionAttention attention example and I get close to identical runtimes. This example has more issues the main optimization the LLM is doing is a fusion and so not comparing to torch.compile is a bit sus. The numerics are off as well and I suspect the atols were way too big. Finally MultiHeadAttention is a deprecated API so using neither SDPA or torch.compile is a weird choice
2. In general 18x (and even some 100x speedups claimed near the end) are just a smell that some kernel is incorrect, the typical way you can get speedups like this is you don't warmup or you forget to synchronize. PyTorch has a lot of benchmarking footguns which is why sharing the exact eval scripts is helpful
3. Speaking of footguns, the shapes I saw in the examples were tiny, in that regime you're more often measuring noise as the primary bottleneck is not compute or memory but overhead
4. Generating many random shapes is also not so safe, some input distributions can make certain kernels trivial for example torch.randn() by default generates samples from a normal distribution with mean 0 and variance 1 and so if you take the mean of a large vector you're almost guaranteed to just get 0 esp if your tolerance is too high
5. KernelBench levels measure vastly different things and if you want to compare to PyTorch operators you want to focus on Level 1, Level 2 is fusions and so the right baseline is torch.compile and more reliable on nightlies. The Mamba 2 example (which I didn't run) also acknowledges that the primary thing it does is fusions which assuming everything is correct would still be strange to baseline vs eager
So please for everyone's sanity if you find a kernel that's 10-100x faster please share the exact code and benchmarking methodology to your smartest performance friends, you should be extremely skeptical of such results often you can discard some numbers based on a simple speed of light analysis. We all desperately want faster kernels but to get them we have to be really fanatical about correctness.
Still, I can't help but think we should bet on sth like Mojo instead for the long run.
wouldn't model not work properly if kernels are even slightly off?
wasn't kernels a part of training stack for models? am I missing anything?
Not to take away from the nice writeup, but for anyone not getting far enough into the writeup, this is essentially taking https://github.com/ScalingIntelligence/KernelBench and seeing if it can generate Metal kernels in addition to the CUDA kernels the benchmark is written for. The dataset was released in November 2024, it looks like, with a paper on arXiv in February and a bunch of discussion at the time[1], so worth keeping likelihood of inclusion in training data in mind when comparing models.
The different levels are interesting. Level 1 and 3 are successfully (5-shot) translated to Metal kernels by gpt5 97% and 88% of the time , but in both cases, the majority of generated kernels are slower than the reference compiled pytorch versions. The speculation about more simple op fusion opportunities in the Level 2 kernels vs the very simple Level 1 kernels and the complex architecture Level 3 kernels seems plausible. From the KernelBench paper, it looks like Level 2 kernels were mostly automatically generated from randomly picking operators and then getting an LLM to generate a kernel combining them, while Level 1 were mostly hand written and Level 3 came from well-known ML architectures.
The swarm part seemed a bit of a stretch. They fired off requests to 8 different models to do the translation, and the "supervisor" benchmarked the returned kernels and picked the fastest one. Technically a swarm, I guess, but feels like we're devaluing the term :)
The correctness testing used made my eye twitch a bit:
> We tested the generated kernel's output against the default implementation's output on 100 random inputs. We set a 0.01 tolerance for both relative and absolute. Let a be the generated kernel output, and b be the reference kernel output. Outputs were considered equal if for every element in the output, absolute(a - b) ≤ (atol + rtol absolute(b)) held true.*
For a numerical kernel, this seems way too loose, but turns out those bounds come straight from KernelBench, which only tested for correctness on 5 random inputs by default in their harness, not the 100 they used here. KernelBench mentions the clear tradeoff they get between how strictly they define correctness and kernel performance, but for Level 1 kernels in particular, which are really just single operations, it seems like the bounds should be multiple orders of magnitude smaller to ensure robust translation. For instance, the all 0s "optimization" mentioned in the writeup allowing for trivially translating the kernel looks like it's due to those loose tolerances[2] and KernelBench was looking to make the evaluation more robust.
[1] Like https://metr.org/blog/2025-02-14-measuring-automated-kernel-...
[2] https://github.com/ScalingIntelligence/KernelBench/pull/25
I initially thought they were writing custom kernels for proprietary models like GPT-5. They aren't - they're using proprietary models to write kernels for a set of ~250 open Pytorch modules.