- With transcribing a talk by Andrej, you already picked the most challenging case possible, speed-wise. His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.
In the idea of making more of an OpenAI minute, don't send it any silence.
E.g.
ffmpeg -i video-audio.m4a \
-af "silenceremove=start_periods=1:start_duration=0:start_threshold=-50dB:\
stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
apad=pad_dur=0.02" \
-c:a aac -b:a 128k output_minpause.m4a -y
will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.
- A point on skimming vs taking the time to read something properly.
I read a transcript + summary of that exact talk. I thought it was fine, but uninteresting, I moved on.
Later I saw it had been put on youtube and I was on the train, so I watched the whole thing at normal speed. I had a huge number of different ideas, thoughts and decisions, sparked by watching the whole thing.
This happens to me in other areas too. Watching a conference talk in person is far more useful to me than watching it online with other distractions. Watching it online is more useful again than reading a summary.
Going for a walk to think about something deeply beats a 10 minute session to "solve" the problem and forget it.
Slower is usually better for thinking.
by georgemandis
1 subcomments
- I was trying to summarize a 40-minute talk with OpenAI’s transcription API, but it was too long. So I sped it up with ffmpeg to fit within the 25-minute cap. It worked quite well (Up to 3x speeds) and was cheaper and faster, so I wrote about it.
Felt like a fun trick worth sharing. There’s a full script and cost breakdown.
- > Is It Accurate?
> I don’t know—I didn’t watch it, lol. That was the whole point. And if that answer makes you uncomfortable, buckle-up for this future we're hurtling toward. Boy, howdy.
This is a great bit of work, and the author accurately summarizes my discomfort
- There was a similar trick which worked with Gemini versions prior to Gemini 2.0: they charged a flat rate of 258 tokens for an image, and it turns out you could fit more than 258 tokens of text in an image of text and use that for a discount!
by dataviz1000
1 subcomments
- I built a Chrome extension with one feature that transcribes audio to text in the browser using huggingface/transformers.js running the OpenAI Whisper model with WebGPU. It works perfect! Here is a list of examples of all the things you can do in the browser with webgpu for free. [0]
The last thing in the world I want to do is listen or watch presidential social media posts, but, on the other hand, sometimes enormously stupid things are said which move the SP500 up or down $60 in a session. So this feature queries for new posts every minute, does ORC image to text and transcribe video audio to text locally, sends the post with text for analysis, all in the background inside a Chrome extension before notify me of anything economically significant.
[0] https://github.com/huggingface/transformers.js/tree/main/exa...
[1] https://github.com/adam-s/doomberg-terminal
- For anybody trying to do this in bulk, instead of using OpenAI's whisper via their API, you can also use Groq [0] which is much cheaper:
[0] https://groq.com/pricing/
Groq is ~$0.02/hr with distil-large-v3, or ~$0.04/hr with whisper-large-v3-turbo. I believe OpenAI comes out to like ~$0.36/hr.
We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube. It uses Groq by default, but I also added support for Replicate and Deepgram as backups because sometimes Groq errors out.
- >> by jumping straight to the point ...
Love this! I wish more authors follow this approach. So many articles keep going all over the place before 'the point' appears.
If trying, perhaps some 50% of the authors may realize that they don't _have_ a point.
- Why would you give up your privacy by sending what interests you to OpenAI when whisper doesn't need that much computer in the first place?
With faster-whisper (int8, batch=8) you can transcripe 13 minutes of audio in 51 seconds on CPU.
- I use the youtube trick, will share it here, but upload to youtube and use their built in transcription service to translate to text for you, and than use gemini pro 2.5 to rebuild the transcript.
ffmpeg \
-f lavfi \
-i color=c=black:s=1920x1080:r=5 \
-i file_you_want_transcripted.wav \
-c:v libx264 \
-preset medium \
-tune stillimage \
-crf 28 \
-c:a aac \
-b:a 192k \
-pix_fmt yuv420p \
-shortest \
file_you_upload_to_youtube_for_free_transcripts.mp4
This works VERY well for my needs.
- You can just dump the youtube link video in Google AI studio and ask it to transcribe the video with speaker labels and even ask it it to add useful visual clues, because the model is multimodal for video too.
by brendanfinan
2 subcomments
- would this also work for my video consisting of 10,000 PDFs?
https://news.ycombinator.com/item?id=44125598
- Love this idea but the accuracy section is lacking. Couldnt you do a simple diff of the outputs and see how many differences there are? .5% or 5%?
by conjecTech
1 subcomments
- If you are hosting whisper yourself, you can do something slightly more elegant, but with the same effect. You can downsample/pool the context 2:1 (or potentially more) a few layers into the encoder. That allows you to do the equivalent of speeding up audio without worry about potential spectral losses. For whisper large v3, that gets you nearly double throughput in exchange for a relative ~4% WER increase.
- If you're already doing local ffmpeg stuff (i.e. pretty involved with code and scripting already) you're only a couple of steps more away from just downloading the openai-whisper models (or even the faster-whisper models which runs about two times faster). Since this looks like personal usage and not building production quality code, you can use AI (e.g. Cursor) to write a script to run the whisper model inference in seconds.
Then there is no cost at all to run any length of audio. (since cost seems to be the primary factor of this article)
On my m1 mac laptop it takes me about 30 seconds to run it on a 3-minute audio file. I'm guessing for a 40 minute talk it takes about 5-10 minutes to run.
- This is great, thank you for sharing. I work on these APIs at OpenAI, it's a surprise to me that it still works reasonably well at 2/3x speed, but on the other hand for phone channels we get 8khz audio that is upsampled to 24khz for the model and it still works well. Note there's probably a measurable decrease in transcription accuracy that worsens as you deviate from 1x speed. Also we really need to support bigger/longer file uploads :)
by another_twist
0 subcomment
- You'd need a WER comparison to check if it really is no drop in quality. With this trick, there might be trouble if the audio is noisy, and it may. ot always be obvious whether or not to speed up.
- Interesting approach to transcript generation!
I'm implementing a similar workflow for VideoToBe.com
My Current Pipeline:
Media Extraction - yt-dlp for reliable video/audio downloads
Local Transcription - OpenAI Whisper running on my own hardware (no API costs)
Storage & UI - Transcripts stored in S3 with a custom web interface for viewing
Y Combinator playlist
https://videotobe.com/play/playlist/ycombinator
and Andrej's talk is
https://videotobe.com/play/youtube/LCEmiRjPEtQ
After reading your blog post, I will be testing effect on speeding audio for locally-hosted Whisper models. Running Whisper locally eliminates the ongoing cost concerns since my infrastructure is already a sunk cost. Speeding audio could be an interesting performance enhancement to explore!
- Appreciated the concise summary + code snippet upfront, followed by more detail and background for those interested. More articles should be written this way!
- Do the APIs support simultaneous voice transcription in a way that different voices are tagged? (either in text or as metadata)
If so: could you split the audiofile and process the latter half by pitch shifting, say an octave, and then merging them together to get shorter audiofile — then transcribe and join them back to a linear form, tagging removed. (You could insert some prerecorded voice to know at which point the second voice starts.). If pitch change is not enough, maybe manipulate it further by formants.
- Gemini 2.5 pro is, in my usage, quite superior for high quality transcriptions of phone calls, in Dutch in my case. As long as you upload the audio to GCS there you can easily process conversations of over an hour. It correctly identified and labeled speakers.
The cheaper 2.5 flash made noticeably more mistakes, for example it didn't correctly output numbers while the Pro model did.
As for OpenAI, their gpt-4o-transcribe model did worse than 2.5 flash, completely messing up names of places and/or people. Plus it doesn't label the conversation in turns, it just outputs a single continuous piece of text.
- Omg long post. TLDR from an LLM for anyone interested
Speed your audio up 2–3× with ffmpeg before sending it to OpenAI’s gpt-4o-transcribe: the shorter file uses fewer input-tokens, cuts costs by roughly a third, and processes faster with little quality loss (4× is too fast). A sample yt-dlp → ffmpeg → curl script shows the workflow.
;)
- This seems like a good place for me to complain about the fact that the automatically generated subtitle files Youtube creates are horribly malformed. Every sentence is repeated twice. In many subtitle files, the subtitle timestamp ranges overlap one another while also repeating every sentence twice in two different ranges. It's absolutely bizarre and has been like this for years or possibly forever. Here's an example - I apologize that it's not in English. I don't know if this issue affects English. https://pastebin.com/raw/LTBps80F
- it's still decoding every frame and matching phonemes either way, but speeding it up reduces how many seconds they bill you for. so you may hack their billing logic more than the model itself.
also means the longer you talk, the more you pay even if the actual info density is the same. so if your voice has longer pauses or you speak slow, you maybe subsidizing inefficiency.
makes me think maybe the next big compression is in delivery cadence. just auto-optimize voice tone and pacing before sending it to LLM. feed it synthetic fast speech with no emotion, just high density words. you lose human warmth but gain 40% cost savings
- This is really interesting, although the cheapest route is still to use an alternative audio-compatible LLM (Gemini 2.0 Flash Lite, Phi 4 Multimodal) or an alternative host for Whisper (Deepinfra, Fal).
- Our team is working with soniox.com They are the most acurate model that works real time.
- In my experience, transcription software has no problem with transcribing sped up audio, or audio that is inaudible to humans or extremely loud (as long as not clipped), I wonder if LLM transcription works the same.
by fallinditch
3 subcomments
- When extracting transcripts from YouTube videos, can anyone give advice on the best (cost effective, quick, accurate) way to do this?
I'm confused because I read in various places that the YouTube API doesn't provide access to transcripts ... so how do all these YouTube transcript extractor services do it?
I want to build my own YouTube summarizer app. Any advice and info on this topic greatly appreciated!
- If you look for a cheaper transcription API you could als use https://Lemonfox.ai. We've optimized the API for long audio files and are much faster and cheaper than OpenAI.
by isubkhankulov
0 subcomment
- Transcripts get much more valuable when one diarizes the audio beforehand to determine which speaker said what.
I use this free tool to extract those and dump the transcripts into a LLM with basic prompts: https://contentflow.megalabs.co
by jasonjmcghee
1 subcomments
- Heads up, the token cost breakdown tables look white on white to me. I'm in dark mode on iOS using Brave.
by cprayingmantis
1 subcomments
- I noticed something similar with images as inputs to Claude, you can scale down the images and still get good outputs. There is an accuracy drop off at a certain point but the token savings are worth doing a little tuning there.
- It's also rude to talk slow to them. Unless its Siri.
by impossiblefork
0 subcomment
- Make the minutes longer, you mean.
by PeterStuer
1 subcomments
- I wonder how much time and battery transcoding/uploading/downloading over coffeeshop wifi would realy save vs just running it locally through optimized Whisper.
- The whisper model weights are free. You could save even more by just using them locally.
by donkey_brains
1 subcomments
- Hmm…doesn’t this technique effectively make the minute longer, not shorter? Because you can pack more speech into a minute of recording? Seems like making a minute shorter would be counterproductive.
by pottertheotter
0 subcomment
- You can just ask Gemini to summarize it for you. It's free. I do it all the time with YouTube videos.
Or you can just copy the transcript that YouTube provides below the video.
- With this logic, you should also be able to trim the parts that doesn’t have words. Just add a cut-off for db, and trim the video before transcription.
Possibly another 10-20% gain?
- That's really cool! Also, isn't this effectively the same as supplying audio with a sampling rate of 8kHz instead of the 16kHz that the model is supposed to work with?
- We discovered this last month.
There is also prob a way to send a smaller sampler of audio at diff speeds and compare them to get a speed optimization with no quality loss unique for each clip.
- Longer*
by yashasolutions
0 subcomment
- the question would be how to do that but also still get proper time code when using whisper to get the subtitles
by anshumankmr
0 subcomment
- Someone should try transcribing Eminem's Rap god with this trick.
- I guess it'd work even if you make it 2.5 or evebn 3x.
- Solution: charge by number of characters generated.
- So wait… is whisper transcription really all that slow locally on a M3 Macbook? It’s been a while since I used whispercpp, but I seem to remember it taking maybe 20 minutes on a comparatively slowpoke (and powerhungry) i5 12600k for maybe 40 minutes of audio; it might take less time on a faster m chip (maybe I’m imagining mobile apple silicon to be more performant than even desktop intel cpus), even less if there support built in for the built in gpu cores and other ai optimized silicon?
Did I miss that the task was time sensitive?
by KPennig86852
0 subcomment
- But you know that you can run OpenAI's Whisper audio recognition model locally for free, right? It has very little GPU requirements, and the new "turbo" model works quite fast (there are also several Python libraries which make it significantly faster still).
- This "hack" also works in real life, youtubers low to talk slowly to increase the video runtime so I watch everything other than songs at 2x speed (and that's only because their player doesn't let you go faster).
by fuzztester
0 subcomment
- Stop being slaves of extorters of any kind, and just leave.
there is tons of this happening everywhere, and we need to fight this, and boycott it.
- [dead]
- [flagged]
- I have a way that is (all but) free -- just watch the video if you care about it, or decide not to if you don't, and move on with your life.