FRESH

Hacker News

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

105 points by matt_d

by rahen

2 subcomments

Strictly speaking, this is very domain-specific and doesn't enable any performance that Triton couldn't already achieve (eliminating global memory round-trips via epilogue fusion is nothing new). The real takeaway is the design shift for LLM-driven codegen rather than handcrafted kernels.
LLMs are still bad at low-level hardware optimizations, but really good at high-level composition. Designing compiler abstractions with a restricted, composable API so an LLM can easily glue expert-written blocks together is a smart move. I suspect this will eventually become the norm for codegens as we move to agentic development.

by augment_me

0 subcomment

TLDR:
Authors realize that global row-wise dependent functions like RMSNorm/LayerNorm have baked-in scales that are commutative in certain setups, so they can be moved out after a subsequent projection and be partially aggregated on tiles of rows.
So ((W1 @ gamma * globally_computed_scale) * W2 can be written as (W1 @ gamma * W2) * globally_computed_scale as long as we have row-only interactions for the scale.
This was usually not done before because left-to-right graph compilers like torch.compile can't assume that a global row-wise reduction between GEMMs can be commutative.

by saagarjha

0 subcomment

Guys who have only written CUTLASS GEMM epilogue fusions, seeing their second kernel: Getting a lot of "GEMM epilogue fusion" vibes from this

by maxignol

0 subcomment

« LLMs can successfully author CODA kernels » That might speed up progress in this area then

by cold_harbor

0 subcomment

synthesis-only is the hard part. with execution feedback — run, profile, patch — the gap closes fast. it's basically an RL problem in disguise

by rohitsriram

0 subcomment

by enricotal

0 subcomment

by rizkimurtadha

0 subcomment