But the things it unlocks in a product I’m building are mind blowing. Millisecond inference even on much older models will change the whole game. Enough to run inference on every. Single. API call. Without notable disruption. This sh*t is wild.
I know, I know.. But they are the one's labeling them "instant". There is a real need for a refresh on the datacenter workhorse that is GPT-4.1
Also, how TF are you going to have an "instant" model release and not mention the latency characteristics at all?