FRESH

Hacker News

Understanding SIMD: Infinite complexity of trivial problems

257 points by verdagon

by dragontamer

11 subcomments

by Joker_vD

2 subcomments

> SIMD instructions are complex, and even Arm is starting to look more “CISCy” than x86!
Thank you for saying it out loud. XLAT/XLATB of x86 is positively tame compared to e.g. vrgatherei16.vv/vrgather.vv.

by TinkersW

1 subcomments

You can simplify the 2x sqrts as sqrt(a*b), overall less operations so perhaps more accurate. It would also let you get rid of the funky lane swivels.
As this would only use 1 lane, perhaps if you have multiple of these to normalize, you could vectorize it.

by EVa5I7bHFq9mnYK

1 subcomments

C# vectors do a great job of simplifying those intrinsics in a safe and portable manner.

by marmaduke

0 subcomment

My approach to this is to write a bunch of tiny “kernels” which are obvious to SIMD and then inline them all, and it does a pretty good job on x86 and arm
https://github.com/maedoc/tvbk/blob/nb-again/src/util.h

by kristianp

1 subcomments

> Let's explore these challenges and how Mojo helps address them
You've not linked to or explained what Mojo is. There's also a lot going on with different products mentioned: Modular, Unum cloud, SimSIMD that are not contextualised either. While I'm at it, where do the others come in (Ovadia, Lemire, Lattner), you all worked on SimSIMD, I guess?
That said, this is a great article, thanks.
Edit: Mojo is a programming language with python-like syntax, and is a product by Modular: https://github.com/modularml/mojo

by juancn

6 subcomments

The main problem is that there are no good abstractions in popular programming languages to take advantage of SIMD extensions.
Also, the feature set being all over the place (e.g. integer support is fairly recent) doesn't help either.
ISPC is a good idea, but execution is meh... it's hard to setup and integrate.
Ideally you would want to be able to easily use this from other popular languages, like Java, Python, Javascript, without having to resort to linking a library written in C/C++.
Granted, language extensions may be required to approach something like that in an ergonomic way, but most somehow end up just mimicking what C++ does and expose a pseudo assembler.

by rishi_devan

1 subcomments

Interesting article. The article mentions "...the NumPy implementation illustrates a marked improvement over the naive algorithm...", but I couldn't find a NumPy implementation in the article.

by Agingcoder

2 subcomments

This is the first time I hear ‘hyperscalar’. Is this generally accepted ? ( I’ve been using SIMD since the MMX days so am a bit surprised )

by remram

1 subcomments

Did they write bfloat16 and bfloat32 when they meant float16 and float32?
On the image: https://www.modular.com/blog/understanding-simd-infinite-com...

by big-chungus4

1 subcomments

by bob1029

5 subcomments

I see a lot of "just use the GPU" and you'd often be right.
SIMD on the CPU is most compelling to me due to the latency characteristics. You are nanoseconds away from the control flow. If the GPU needs some updated state regarding the outside world, it takes significantly longer to propagate this information.
For most use cases, the GPU will win the trade off. But, there is a reason you don't hear much about systems like order matching engines using them.

0 subcomment

by a1o

2 subcomments

by benchmarkist

3 subcomments

Looks like a great use case for AI. Set up the logical specification and constraints and let the AI find the optimal sequence of SIMD operations to fulfill the requirements.