FRESH

Hacker News

Home

Helion: A high-level DSL for performant and portable ML kernels

148 points by jarbus

by markush_

1 subcomments

Interesting choice from PyTorch to release yet another DSL, on positive side it's one more point in the design space on the other hand it's even more difficult to choose the right technology among Triton, Gluon, CuTe, ThunderKittens and a few others.

by darknoon

0 subcomment

The developers also gave a talk about Helion on GPU Mode: https://www.youtube.com/watch?v=1zKvCLuvUYc

by bobajeff

1 subcomments

It's good to see more effort for making things not device specific but I only see benchmarks for NVIDIA B200 and AMD MI350X. Also what's the experience of using one of these Python DSLs like? Are the tools good enough to make code completion, jump to definition, setting breakpoints, watching variables, copying as expression etc. nice?

by brap

3 subcomments

Asking as someone who is really out of the loop: how much of ML development these days touches these “lower level” parts of the stack? I’d expect that by now most of the work would be high level, and the infra would be mostly commoditized.

by dachworker

3 subcomments

I'm super excited to give this one a spin. It seems like a neat idea, Triton, but simpler and with automatic autotuning. My head is spinning with options right now. I love how everyone was hyping up CUDA this and CUDA that a couple of years ago, and now CUDA is all but irrelevant. There's now so many different and opinionated takes on how you should write high performant accelerator cluster code. I love it.
It's also kinda of ironic that right now in 2025, we have all this diversity in tooling, but at the same time, the ML architecture space has collapsed entirely and everyone is just using transformers.

by sega_sai

1 subcomments

I switched from pytorch to jax just before triton appeared. Does anyone know how jax compares to this autotuning machinery in pytorch ? I know jax does jit, but i don't have a good intuition if jit is better than this type of autotuning.

by ballpug

0 subcomment

Compiling a kernel after assemblage in low-level object oriented languages either uses stable kernel or the cargo fuzzed raw_spinlock code.
Helion abstracts syntax and design for calculating λ-functions, which converts language in a kernel config.

by mshockwave

1 subcomments

Is it normal to spend 10minutes on tuning nowadays? Do we need to spend another 10 minutes upon changing the code?

by bwfan123

2 subcomments

I dont get the point of helion as compared to its alternatives like gluon.
For best performance I would presume one needs low-level access to hardware knobs. And, these kernel primitives are written one-time and reused. So, what is the point of a DSL that dumbs things down as a wrapper around triton.

by singularity2001

1 subcomments

Anything as long as I don't have to touch propriety cuda and mpx

by maknee

1 subcomments

How does this compare against other DSLs?

by jarbus

1 subcomments

I posted this 5 days ago, how did this resurface?

by a-dub

0 subcomment

numba for gpu kernels... cool!

by uoaei

2 subcomments

Tangential question related to the example kernel: in GPU programming is it idiomatic/standard to initialize the out array as zeros rather than empty? are the performance savings negligible?

by doctorpangloss

1 subcomments

Is contributing to Triton so bad? It looks like the blocker is usually LLVM.

0 subcomment