I doubt the VUs can help with inference given their small scratchpad sizes and instruction set though, haha.
Curious about 2 things if you can share:
whats your per-token latency on real hardware how much quality loss came from PSNT quantization vs fp16 baseline Either way this is peak hacker energy, shipping on actual hardware makes it 10x cooler.