If you just wrote your SIMD in CUDA 15 years ago, NVidia compilers would have given you maximum performance across all NVidia GPUs rather than being forced to write and rewrite in SSE vs AVX vs AVX512.
GPU SIMD is still SIMD. Just... better at it. I think AMD and Intel GPUs can keep up btw. But software advantage and long term benefits of rewriting into CUDA are heavily apparent.
Intel ISPC is a great project btw if you need high level code that targets SSE, AVX, AVX512 and even ARM NEON all with one codebase + auto compiling across all the architectures.
-------
Intels AVX512 is pretty good at a hardware level. But software methodology to interact with SIMD using GPU-like languages should be a priority.
Intrinsics are good for maximum performance but they are too hard for mainstream programmers.
Thank you for saying it out loud. XLAT/XLATB of x86 is positively tame compared to e.g. vrgatherei16.vv/vrgather.vv.
You've not linked to or explained what Mojo is. There's also a lot going on with different products mentioned: Modular, Unum cloud, SimSIMD that are not contextualised either. While I'm at it, where do the others come in (Ovadia, Lemire, Lattner), you all worked on SimSIMD, I guess?
That said, this is a great article, thanks.
Edit: Mojo is a programming language with python-like syntax, and is a product by Modular: https://github.com/modularml/mojo
As this would only use 1 lane, perhaps if you have multiple of these to normalize, you could vectorize it.
On the image: https://www.modular.com/blog/understanding-simd-infinite-com...
Also, the feature set being all over the place (e.g. integer support is fairly recent) doesn't help either.
ISPC is a good idea, but execution is meh... it's hard to setup and integrate.
Ideally you would want to be able to easily use this from other popular languages, like Java, Python, Javascript, without having to resort to linking a library written in C/C++.
Granted, language extensions may be required to approach something like that in an ergonomic way, but most somehow end up just mimicking what C++ does and expose a pseudo assembler.
SIMD on the CPU is most compelling to me due to the latency characteristics. You are nanoseconds away from the control flow. If the GPU needs some updated state regarding the outside world, it takes significantly longer to propagate this information.
For most use cases, the GPU will win the trade off. But, there is a reason you don't hear much about systems like order matching engines using them.