Models released a few days ago, Opus 4.6 and GPT 5.3, haven't been tested yet, but given the performance on other micro-benchmarks, they will probably not be much different on this benchmark.
Then go ahead and use AI to fix this: https://gitlab.gnome.org/GNOME/mutter/-/issues/4051