I tested Qwen3.6, Gemma4, Nemotron3-nano-omni. They fully hallucinate x,y coords. (did not try GLM-5V yet)
GPT-5.5 can easily do it. But also Vocaela, a tiny 500M model, is quite good at it. Hope they improve the training for x,y clicking soon on the smallish multi-modals.
Recently slopped a http service together just so my local models can click, instead of relying on all the wild ways agents currently hack into the browser (browser-use, browser-harness, agent-browser, dev-browser etc) https://github.com/julius/vocaela-click-coords-http
Comprehensive evaluation results at https://gertlabs.com/rankings
However, both Kimi and GLM can end up in doom loops so be careful how you use them. Without a proper harness the agent can easily get into some tricky situations with no escape.
We had to develop new heuristics in our cloud harness just because of this but I am really grateful that we did as the platform feels now more robust.
A small price to pay for model plug & play!
Turbo makes a huge difference in everyday use because it saves you time and you are not in the mood always to wait endlessly.