You’ve got a frozen transformer and a second module still trained with SGD, so how exactly does that solve forgetting instead of just relocating it?
We are not at the end of AI :)
Also, someone claimed that NVIDA combined diffusion and autoregression, making it 6 times faster, but couldn't find a source. Big if true!