FRESH

Hacker News

Prefix sums at gigabytes per second with ARM NEON

64 points by mfiguiere

by hayley-patton

1 subcomments

As not mentioned in the article, if you want the general form of this algorithm, it is a Hillis-Steele prefix sum: <https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorte...>

by Jeffrin-dev

0 subcomment

The interleaved load trick is clever, i never thought about using vld4 that way. always assumed SIMD would struggle with sequential dependencies like this since each value depends on the previous one curious how this holds up on older ARM chips, like would you see similar gains or does it depend heavily on the M4's specific pipeline. also wonder if there's a similar approach for AVX on x86 or if the instruction set makes it more awkward.

by vardump

2 subcomments

What's going on with SVE[2] support in the ARM land? It's weird that even Apple's M5 still doesn't support it (other than SME[2]).

by flykespice

0 subcomment

Couldn't this be written in a C-pure way so that compilers can take advantadge of vector optimization and produce equally optimized code?
I have been discouraged to write hand-written assembly SIMD code, because netizents say you can barely outsmart compiler-optimized assembly code nowadays..

by ryan14975

1 subcomments

[flagged]