I know Yann LeCun is trying to do a completely different architecture and I think that's expected to take 2-3 years before showing commercial results, right? Is that why they're finding it quicker to change the hardware?
Models aren't trained across their context, their context is their short term memory at runtime, right? Nothing to do with training. They are trained on a static dataset.