it's always surprising for me how absurdly efficient "highly specialized VM/instruction interpreters" are
like e.g. two independent research projects into how to have better (fast, more compact) serialization in rust ended up with something like a VM/interpreter for serialization instructions leading to both higher performance and more compact code size while still being cable of supporting similar feature sets as serde(1)
(in general monomorphisation and double dispatch (e.g. serde) can bring you very far, but the best approach is like always not the extrem. Neither allays monomorphisation nor dynamic dispatch but a balance between taking advantage of the strength of both. And specialized mini VMs are in a certain way an extra flexible form of dynamic dispatch.)
---
(1): More compact code size on normal to large project, not necessary on micro projects as the "fixed overhead" is often slightly larger while the per serialization type/protocol overhead can be smaller.
(1b): They have been experimental research project, not sure if any of them got published to GitHub, non are suited for usage in production or similar.
Tail recursion opens up for people to write really really neat looping facilities using macros.
Last year I was working on a tail-call interpreter (https://github.com/anematode/b-jvm/blob/main/vm/interpreter2...) and found a similar regression on WASM when transforming it from a switch-dispatch loop to tail calls. SpiderMonkey did the best with almost no regression, while V8 and JSC totally crapped out – same finding as the blog post. Because I was targeting both native and WASM I wrote a convoluted macro system that would do a switch-dispatch on WASM and tail calls on native.
Ultimately, because V8's register allocation couldn't handle the switch-loop and was spilling everything, I basically manually outlined all the bytecodes whose implementations were too bloated. But V8 would still inline those implementations and shoot itself in the foot, so I wrote a wasm-opt pass to indirect them through a __funcref table, which prevented inlining.
One trick, to get a little more perf out of the WASM tail-call version, is to use a typed __funcref table. This was really horrible to set up and I actually had to write a wasm-opt pass for this, but basically, if you just naively do a tail call of a "function pointer" (which in WASM is usually an index into some global table), the VM has to check for the validity of the pointer as well as a matching signature. With a __funcref table you can guarantee that the function is valid, avoiding all these annoying checks.
I wonder why they went with a new keyword; I assumed the compiler would opportunistically do TCO when it thinks it's possible, and I figured that the simplest way to require TCO (or else fail compilation) could be done with an attribute.
(Not sure if the article addressed that... I only skimmed it.)