I noticed that Gemini 3 Pro can no longer recognize the audio files I upload. It just gives me information from old chats or random stuff that isn’t relevant. It can’t even grasp the topic of the class anymore, it just spits out nonsense.
Something has changed. Just few days ago with Gemini 2.5 Pro it worked just fine!
It's not just me: https://www.reddit.com/r/GeminiAI/comments/1p0givt/gemini_25...
But I am trying Gemini 3 pro web (Pro subscription). aistudio is still working fine.
Summarizing a 3.5 hour council meeting is something of a holy grail of AI-assisted reporting. There are a LOT of meetings like that, and newspapers (especially smaller ones) can no longer afford to have a human reporter sit through them all.
I tried this prompt (against audio from https://www.youtube.com/watch?v=qgJ7x7R6gy0):
Output a Markdown transcript of this meeting. Include speaker
names and timestamps. Start with an outline of the key
meeting sections, each with a title and summary and timestamp
and list of participating names. Note in bold if anyone
raised their voices, interrupted each other or had
disagreements. Then follow with the full transcript.
Here's the result: https://gist.github.com/simonw/0b7bc23adb6698f376aebfd700943...I'm not sure quite how to grade it here, especially since I haven't sat through the whole 3.5 hour meeting video myself.
It appears to have captured the gist of the meeting very well, but the fact that the transcript isn't close to an exact match to what was said - and the timestamps are incorrect - means it's very hard to trust the output. Could it have hallucinated things that didn't happen? Those can at least be spotted by digging into the video (or the YouTube transcript) to check that they occurred... but what about if there was a key point that Gemini 3 omitted entirely?
These things are getting really good at just regular transcription (as long as you don't care about verbatimicity), but every additional dimension you add (timestamps, speaker assignment, etc) will make the others worse. These work much better as independent processes that then get reconciled and refined by a multimodal LLM.
I'll need to update for V2!
Perhaps half with a web browser to view the results, and half working blind with the numbers alone?
Without having an LLM figure out the required command line parameters? Mad props!
Love the pivot in pelican generation bench.