Show HN: I taught GPT-OSS-120B to see using Google Lens and OpenCV
41 points by vkaufmann
by l1am0
3 subcomments
I don't get this. Isn't this the same as saying "I taught my 5 year old to calculate integrals, by typing them into Wolfram Alpha"...so the actual relevant cognitive task (integrals in my example, "seeing" in yours) is outsources to an external API.
Why do I need gpt-oss-120B at all in this scenario? Couldn't I just directly call e.g. gemini-3-pro api from the python script?
by leumon
2 subcomments
Next try actually teaching it to see by training a projector with a vision encoder on gpt-oss.
by vessenes
1 subcomments
Confused as to why you wouldn’t integrate a local vlm if you want a local llm as the backbone. Plenty of 8b - 30b vlms out there that are visually competent.
by magic_hamster
0 subcomment
> GPT-OSS-120B, a text-only model with zero vision support, correctly identified an NVIDIA DGX Spark and a SanDisk USB drive from a desk photo.
But wasn't it Google Lens that actually identified them?
by N_Lens
4 subcomments
Looks like a TOS violation to me to scrape google directly like that. While the concept of giving a text only model 'pseudo vision' is clever, I think the solution in its current form is a bit fragile. The SerpAPI, Google Custom Search API, etc. exist for a reason; For anything beyond personal tinkering, this is a liability.
by villgax
0 subcomment
Booyah yourself, this like being able to call two APIs and calling it learning? I thought you did some VLM stuff with a projection
by tanduv
0 subcomment
you eventually get hit with captcha with the playwright approach
by TZubiri
1 subcomments
have you tried Llama?
In my experience it has been strictly better than GPT OSS, but it might depend on specifically how it is used.