FRESH

Hacker News

Home

Gemini 3 Pro: the frontier of vision AI

566 points by xnx

by Workaccount2

27 subcomments

Well
It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.
In fact GPT5 wrote an edge detection script to see where "golden dog feet" met "bright green grass" to prove to me that there were only 4 legs. The script found 5, and GPT-5 then said it was a bug, and adjusted the script sensitivity so it only located 4, lol.
Anyway, Gemini 3, while still being unable to count the legs first try, did identify "male anatomy" (it's own words) also visible in the picture. The 5th leg was approximately where you could expect a well endowed dog to have a "5th leg".
That aside though, I still wouldn't call it particularly impressive.
As a note, Meta's image slicer correctly highlighted all 5 legs without a hitch. Maybe not quite a transformer, but interesting that it could properly interpret "dog leg" and ID them. Also the dog with many legs (I have a few of them) all had there extra legs added by nano-banana.

by knollimar

4 subcomments

I do some electrical drafting work for construction and throw basic tasks at LLMs.
I gave it a shitty harness and it almost 1 shotted laying out outlets in a room based on a shitty pdf. I think if I gave it better control it could do a huge portion of my coworkers jobs very soon

by fngjdflmdflg

2 subcomments

These OCR improvements will almost certainly be brought to google books, which is great. Long term it can enable compressing all non-digital rare books into a manageable size that can be stored for less than $5,000.[0] It would also be great for archive.org to move to this from Tesseract. I wonder what the cost would be, both in raw cost to run, and via a paid API, to do that.
[0] https://annas-archive.org/blog/critical-window.html

by djoldman

3 subcomments

Interesting "ScreenSpot Pro" results:
```
    72.7% Gemini 3 Pro
    11.4% Gemini 2.5 Pro
    49.9% Claude Opus 4.5
    3.50% GPT-5.1
```
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
https://arxiv.org/abs/2504.07981

by simonw

4 subcomments

In case the article author sees this, the "HTML transcription" link is broken - it goes to https://aistudio-preprod.corp.google.com/prompts/1GUEWbLIlpX... which is a Google-employee-only URL.

by mhl47

1 subcomments

We are currently working on some christmas puzzle, that are - I would say - a bit more difficult from the visual side. GPT5.1 completely failed at all of them while Gemini 3 solved two till know that I would consider rather impressive.
One was two screenshots of a phone screen with chats that are timestamped and it had to take the nth letter of the mth word based on the timestamp. While the type of riddle could be in the training data the ability to OCR this that well and understand the spatial relation to each object perfectly is something I have not seen from other models yet.

by TheAceOfHearts

1 subcomments

Since I think it's interesting to highlight the jagged intelligence, I have a simple word search puzzle [0] that Nano Banana Pro stills struggles to solve correctly. Gemini 3 Pro with Code Execution is able to one-shot the problem and find the positions of each word (this is super impressive! one year ago it wasn't possible), but Nano Banana Pro fails to highlight the words correctly.
Here's the output from two tests I ran:
1. Asking Nano Banana Pro to solve the word search puzzle directly [1].
2. Asking Nano Banana Pro to highlight each word on the grid, with the position of every word included as part of the prompt [2].
The fact that it gets 2 words correct demonstrates meaningful progress, and it seems like we're really close to having a model that can one-shot this problem soon.
There's actually a bit of nuance required to solve this puzzle correctly which an older Gemini model struggled to do without additional nudging. You have to convert the grid or word list to use matching casing (the grid uses uppercase, the word list uses lowercase), and you need to recognize that "soup mix" needs to have the space removed when doing the search.
[0] https://imgur.com/ekwfHrN
[1] https://imgur.com/1nybezU
[2] https://imgur.com/18mK5i5

by hodder

3 subcomments

"Gemini 3 Pro represents a generational leap from simple recognition to true visual and spatial reasoning."
Prompt: "wine glass full to the brim"
Image generated: 2/3 full wine glass.
True visual and spatial reasoning denied.

by aziis98

3 subcomments

> Pointing capability: Gemini 3 has the ability to point at specific locations in images by outputting pixel-precise coordinates. Sequences of 2D points can be strung together to perform complex tasks, such as estimating human poses or reflecting trajectories over time
Does somebody know how to correctly prompt the model for these tasks or even better provide some docs? The pictures with the pretty markers are appreciated but that section is a bit vague and without references

by ed

1 subcomments

What’s new here? I believe this is just gemini 3 which was released last month (the model id hasn’t changed AFAICT)

by siva7

2 subcomments

Interesting. When i asked Gemini 3 Pro to generate a Infographic from my personal accounting sheet, it first failed to generate anything except a black background, then it generated something where it mixed different languages in a non-sensical way, with obvious typos and irrelevant information grouping. It's certainly a leap forward in OCR, rendering classic OCR useless.

by devinprater

3 subcomments

Audio described Youtube please? That'd be so amazing! Even if I couldn't play Zelda yet, I could listen to a playthrough with Gemini describing it.

by MostlyStable

1 subcomments

Going to compare this to our current solution of Amazon's Textract service for analyzing handwritten datasheets. Textract, when extracting tables (which is what we use it for) does not allow for providing any context or information about the tables and what we expect them to contain, but it is really good at correctly recognizing hand written characters. All of my attempts at less specialized, more general models allow me to provide that context, which is helpful in some ways, but fail at the basic part of almost always correctly getting the character.
Hopefully Google pro marries the two together.

by hackeruser741

0 subcomment

It's fascinating how these models struggle with simple counting or novel configurations like a 5-legged dog or a 13-hour clock, despite excelling at complex language tasks. It highlights the difference between learning patterns from vast datasets and true conceptual understanding.

by axpy906

0 subcomment

So Gemini was the most non-deterministic model of them all and now we get this one with temperature at 1 and max thinking. It’s so random that it’s hard to justify putting in my setup right now.

by iamjackg

3 subcomments

Curious how this will fare when playing Pokemon Red.

by caseyf

1 subcomments

I'm playing with this and wondering if this is an actually good way to identify dominant colors and other features of a garment/product when using a photo where the item is styled and not isolated from the model or other garments

by k8sToGo

1 subcomments

When will we get Gemini 3 Flash?

by a-dub

1 subcomments

i like to put it in live mode and point it at my plants and have conversations about how they're doing. it properly identifies them and flags any signs of disease and then provides correct next steps.

by jonplackett

0 subcomment

Google really are a fully woken sleeping giant. More code reds being issued today I expect.

by causal

0 subcomment

Okay maybe this one isn't an exaggeration when they say leap forward

by pseudosavant

1 subcomments

I'm really fascinate by the opportunities to analyze videos. The amount of tokens it compresses down to, and what you can reason across those tokens, is incredible.

by drivebyhooting

0 subcomment

Screen understanding is huge for further automating dev work.

by ch2026

0 subcomment

what framework is being utilized for computer use here?

0 subcomment

by stego-tech

2 subcomments

The document is paints a super impressive picture, but the core constraint of “network connection to Google required so we can harvest your data” is still a big showstopper for me (and all cloud-based AI tooling, really).
I’d be curious to see how well something like this can be distilled down for isolated acceleration on SBCs or consumer kit, because that’s where the billions to be made reside (factories, remote sites, dangerous or sensitive facilities, etc).

0 subcomment

by bovermyer

0 subcomment

I would be interested in seeing what G3P makes of the Dead Sea Scrolls or similarly old documents.

by themafia

1 subcomments

"the frontier"
I've never hated industry infatuation with a buzzword more.

by romanovcode

0 subcomment

I gotta say - processing video at 10fps is very impressive.

by genrader

0 subcomment

This is an excellent short way to understand that what you give Gemini 3 Pro is substantial better in understanding the data.
Making sure you ask correctly how it should give you the info is still lacking in many people's ability

by kkukshtel

0 subcomment

sounds awesome but too bad it is impossible to figure out how to actually use these models and what I have to pay for/where

by Frannky

0 subcomment

It's a good model. I worry that they will be able to win the game by offering the best service for free, thanks to selling users' data—kind of like search, email, etc. It's sad. Not that the alternatives are better... You either trust synchopathic ChatGPT backed by Scama, go with woke Claude (they once banned my account for asking how some news was trying to influence me), Grok that feels like a 20-year-old sure about stuff that don't work, and Chinese models that are agenda-aligned...

by empressplay

0 subcomment

Yes, but can it play PacMan yet?

by dmarzio

0 subcomment

So we’re going to use this to make the maid from the Jetsons finally. Right?

by ichik

0 subcomment

Frankly, it's insane how laughably bad under scrutiny their own examples are. It both distorted the data and made the chart less readable (labels placement, segments separation, missing labels, worse contrast). And it combined them into one, so you you'll have harder time comparing them compared to the original image! Isn't it amazing that it added a toggle? Post author seems to think it deserves an exclamation point even.

by OBELISK_ASI

0 subcomment

[dead]

by sora2video

0 subcomment

[dead]

by agentifysh

2 subcomments

im realizing how much of a bottleneck vision models are
im just a glorified speedreadin' promptin' QA at this point with codex
once it replaces the QA layer its truly over for software dev jobs
future would be a software genie where on aistudio you type: "go make counterstrike 1.6 clone, here is $500, you have two hours"
edit: saw the Screenspot benchmark and holy ** this is an insane jump!!! 11% to 71% even beating Opus 4.5's 50%...chatgpt is at 3.5% and it matches my experience with codex