FRESH

Hacker News

Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark

177 points by nabla9

by hwasiti

0 subcomment

@simonw Can you try the transcription job again? It seems it does not work anymore.
I noticed that Gemini 3 Pro can no longer recognize the audio files I upload. It just gives me information from old chats or random stuff that isn’t relevant. It can’t even grasp the topic of the class anymore, it just spits out nonsense.
Something has changed. Just few days ago with Gemini 2.5 Pro it worked just fine!
It's not just me: https://www.reddit.com/r/GeminiAI/comments/1p0givt/gemini_25...
But I am trying Gemini 3 pro web (Pro subscription). aistudio is still working fine.

by simonw

12 subcomments

by leetharris

0 subcomment

I used to work in ASR. Due to the nature of current multimodal architectures, it is unlikely we'll ever see accurate timestamps over a longer horizon. You're better off using encoder-decoder ASR architectures, then using traditional diarization using embedding clustering, then using a multimodal model to refine it, then use a forced alignment technique (maybe even something pre-NN) to get proper timestamps and reconciling it at the end.
These things are getting really good at just regular transcription (as long as you don't care about verbatimicity), but every additional dimension you add (timestamps, speaker assignment, etc) will make the others worse. These work much better as independent processes that then get reconciled and refined by a multimodal LLM.

by scosman

1 subcomments

If anyone enjoys cheeky mischief targeting LLM benchmark hackers: https://github.com/scosman/pelicans_riding_bicycles
I'll need to update for V2!

by Wowfunhappy

1 subcomments

Aww, I don’t like the new pelican benchmark as much. I liked that the old prompt was vague and we could see how the AI interpreted it.

by londons_explore

2 subcomments

Anyone got a class full of students and able to get a human version of this pelican benchmark?
Perhaps half with a web browser to view the results, and half working blind with the numbers alone?

by Redster

1 subcomments

So Gemini 3 Pro dropped today, which happens to be the day I proofread a historical timeline I'm assisting a PhD with. I do one pass and then realize I should try Gemini 3 Pro on it. I give the same exact prompt to 3 Pro as Claude 4.5 Sonnet. 3 pro finds 25 real errors, no hallucinations. Claude finds 7 errors, but only 2 of those are unique to Claude. (Claude was better at "wait, that reference doesn't match the content! It should be $corrected_citation!). But Gemini's visual understanding was top notch. It's biggest flaw was that it saw words that wrapped as having extra spaces. But it also correctly caught a typo where a wrapped word was misspelled, so something about it seemed to fixate on those line breaks, I think.

by ZeroConcerns

1 subcomments

> so I shrunk the file down to a more manageable 38MB using ffmpeg
Without having an LLM figure out the required command line parameters? Mad props!

by nurumaik

2 subcomments

by vessenes

0 subcomment

I'd like a long bet on when Simon will add animation to the pelican SVG benchmark. I'm thinking late 2026.

by razodactyl

0 subcomment