FRESH

Hacker News

I rebuilt FlashAttention in Triton to understand the performance archaeology

90 points by amindiro

by amindiro

1 subcomments

by sheepscreek

4 subcomments

by hyperbovine

1 subcomments

I still don't understand why certain performance aspects of the CUDA platform are so poorly documented. Why is successfully pushing the hw to its performance envelope considered a novel research result? Shouldn't I be able to look this stuff up on the Nvidia website?

by fancy_pantser

0 subcomment

When OpenAI announced the Triton language, I was worried I'd be confused one day while reading something because of Nvidia's open-source Triton inference server. I made it quite a long time, but it finally happened today! I was so intrigued for the first few pages and then deeply confused.

by rishabhaiover

1 subcomments

I did an experiment on FlashAttention in Triton to measure the impact of caching tiles in the Shared Memory. Surprisingly, it had a non-monotonic relationship with prefetching these tiles and it was kernel dependent. Attention kernel benefits from prefetching caches while MLP W1 doesn't.

by raphaelty

0 subcomment

Very interesting, wondering if there are other heavily used algorithm which could benefit a lot from a "Flash" version but don't have one today

by npalli

1 subcomments

Seems very detailed and comprehensive. Did I miss it, but was there a performance comparison to the PyTorch version at the top?