Most of the open models that have been benchmarked are old, e.g. 6 months or more older than the latest open weights models that have been launched during the last months.
The only big and recent open model that I see mentioned is GLM 5.1.
For such a study to be credible, it must benchmark all of the many open weights models that have been launched during the last 3 months, and in their full versions. For such a study, I want to see a list with the exact model versions that have been benchmarked and what was used for inference.
Only then one could give a meaningful conclusion about the current delay in months.
In any case, a single value for how the open models are behind does not tell much, because the actual performance is very dependent on the specific problem.
Going on the links from TFA towards private benchmarks, which are supposed to be more trustworthy, I see benchmarks where GPT 5.5 and Opus 4.6 are beaten by models like Qwen 3.7, e.g. in SimpleBench.
Of course, I assume that on average the OpenAI and Anthropic models are better, but this does not guarantee that they will be better than the open weights models on any particular problem.
So the value of the delay in "months" provides little information. If the OpenAI and Anthropic models had an advantage so great that they would have beaten the Chinese models in any benchmark, that would have been newsworthy. As long as their advantage is only probabilistic, i.e. that they win more benchmarks than they lose, that means that their advantage is not decisive and you cannot be certain that by paying for them you will really get the best results.
How far behind are open models compared to Sonnet?
It may be that the absolute SOTA models are way ahead of open models, but the gap in the mid tier really does feel like it's compressing. I'd love to see empirical data about it though.