FRESH

Hacker News

Show HN: I taught GPT-OSS-120B to see using Google Lens and OpenCV

41 points by vkaufmann

by l1am0

3 subcomments

I don't get this. Isn't this the same as saying "I taught my 5 year old to calculate integrals, by typing them into Wolfram Alpha"...so the actual relevant cognitive task (integrals in my example, "seeing" in yours) is outsources to an external API.
Why do I need gpt-oss-120B at all in this scenario? Couldn't I just directly call e.g. gemini-3-pro api from the python script?

by leumon

2 subcomments

Next try actually teaching it to see by training a projector with a vision encoder on gpt-oss.

by vessenes

1 subcomments

Confused as to why you wouldn’t integrate a local vlm if you want a local llm as the backbone. Plenty of 8b - 30b vlms out there that are visually competent.

by magic_hamster

0 subcomment

> GPT-OSS-120B, a text-only model with zero vision support, correctly identified an NVIDIA DGX Spark and a SanDisk USB drive from a desk photo.
But wasn't it Google Lens that actually identified them?

by N_Lens

4 subcomments

Looks like a TOS violation to me to scrape google directly like that. While the concept of giving a text only model 'pseudo vision' is clever, I think the solution in its current form is a bit fragile. The SerpAPI, Google Custom Search API, etc. exist for a reason; For anything beyond personal tinkering, this is a liability.

by villgax

0 subcomment

Booyah yourself, this like being able to call two APIs and calling it learning? I thought you did some VLM stuff with a projection

by tanduv

0 subcomment

by TZubiri

1 subcomments

have you tried Llama? In my experience it has been strictly better than GPT OSS, but it might depend on specifically how it is used.