FRESH

Hacker News

Home

OpenAI charges by the minute, so speed up your audio

714 points by georgemandis

by w-m

12 subcomments

With transcribing a talk by Andrej, you already picked the most challenging case possible, speed-wise. His natural talking speed is already >=1.5x that of a normal human. One of the people you absolutely have to set your YouTube speed back down to 1x when listening to follow what's going on.
In the idea of making more of an OpenAI minute, don't send it any silence.
E.g.
```
    ffmpeg -i video-audio.m4a \
      -af "silenceremove=start_periods=1:start_duration=0:start_threshold=-50dB:\
                         stop_periods=-1:stop_duration=0.02:stop_threshold=-50dB,\
                         apad=pad_dur=0.02" \
      -c:a aac -b:a 128k output_minpause.m4a -y
```
will cut the talk down from 39m31s to 31m34s, by replacing any silence (with a -50dB threshold) longer than 20ms by a 20ms pause. And to keep with the spirit of your post, I measured only that the input file got shorter, I didn't look at all at the quality of the transcription by feeding it the shorter version.

by heeton

6 subcomments

A point on skimming vs taking the time to read something properly.
I read a transcript + summary of that exact talk. I thought it was fine, but uninteresting, I moved on.
Later I saw it had been put on youtube and I was on the train, so I watched the whole thing at normal speed. I had a huge number of different ideas, thoughts and decisions, sparked by watching the whole thing.
This happens to me in other areas too. Watching a conference talk in person is far more useful to me than watching it online with other distractions. Watching it online is more useful again than reading a summary.
Going for a walk to think about something deeply beats a 10 minute session to "solve" the problem and forget it.
Slower is usually better for thinking.

by georgemandis

1 subcomments

I was trying to summarize a 40-minute talk with OpenAI’s transcription API, but it was too long. So I sped it up with ffmpeg to fit within the 25-minute cap. It worked quite well (Up to 3x speeds) and was cheaper and faster, so I wrote about it.
Felt like a fun trick worth sharing. There’s a full script and cost breakdown.

by timerol

2 subcomments

> Is It Accurate?
> I don’t know—I didn’t watch it, lol. That was the whole point. And if that answer makes you uncomfortable, buckle-up for this future we're hurtling toward. Boy, howdy.
This is a great bit of work, and the author accurately summarizes my discomfort

by simonw

1 subcomments

There was a similar trick which worked with Gemini versions prior to Gemini 2.0: they charged a flat rate of 258 tokens for an image, and it turns out you could fit more than 258 tokens of text in an image of text and use that for a discount!

by dataviz1000

1 subcomments

I built a Chrome extension with one feature that transcribes audio to text in the browser using huggingface/transformers.js running the OpenAI Whisper model with WebGPU. It works perfect! Here is a list of examples of all the things you can do in the browser with webgpu for free. [0]
The last thing in the world I want to do is listen or watch presidential social media posts, but, on the other hand, sometimes enormously stupid things are said which move the SP500 up or down $60 in a session. So this feature queries for new posts every minute, does ORC image to text and transcribe video audio to text locally, sends the post with text for analysis, all in the background inside a Chrome extension before notify me of anything economically significant.
[0] https://github.com/huggingface/transformers.js/tree/main/exa...
[1] https://github.com/adam-s/doomberg-terminal

by rob

5 subcomments

For anybody trying to do this in bulk, instead of using OpenAI's whisper via their API, you can also use Groq [0] which is much cheaper:
[0] https://groq.com/pricing/
Groq is ~$0.02/hr with distil-large-v3, or ~$0.04/hr with whisper-large-v3-turbo. I believe OpenAI comes out to like ~$0.36/hr.
We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube. It uses Groq by default, but I also added support for Replicate and Deepgram as backups because sometimes Groq errors out.

by alok-g

0 subcomment

>> by jumping straight to the point ...
Love this! I wish more authors follow this approach. So many articles keep going all over the place before 'the point' appears.
If trying, perhaps some 50% of the authors may realize that they don't _have_ a point.

by Tepix

2 subcomments

Why would you give up your privacy by sending what interests you to OpenAI when whisper doesn't need that much computer in the first place?
With faster-whisper (int8, batch=8) you can transcripe 13 minutes of audio in 51 seconds on CPU.

by babuloseo

0 subcomment

I use the youtube trick, will share it here, but upload to youtube and use their built in transcription service to translate to text for you, and than use gemini pro 2.5 to rebuild the transcript.
ffmpeg \ -f lavfi \ -i color=c=black:s=1920x1080:r=5 \ -i file_you_want_transcripted.wav \ -c:v libx264 \ -preset medium \ -tune stillimage \ -crf 28 \ -c:a aac \ -b:a 192k \ -pix_fmt yuv420p \ -shortest \ file_you_upload_to_youtube_for_free_transcripts.mp4
This works VERY well for my needs.

by mt_

1 subcomments

You can just dump the youtube link video in Google AI studio and ask it to transcribe the video with speaker labels and even ask it it to add useful visual clues, because the model is multimodal for video too.

by brendanfinan

2 subcomments

would this also work for my video consisting of 10,000 PDFs?
https://news.ycombinator.com/item?id=44125598

by stogot

1 subcomments

Love this idea but the accuracy section is lacking. Couldnt you do a simple diff of the outputs and see how many differences there are? .5% or 5%?

by conjecTech

1 subcomments

If you are hosting whisper yourself, you can do something slightly more elegant, but with the same effect. You can downsample/pool the context 2:1 (or potentially more) a few layers into the encoder. That allows you to do the equivalent of speeding up audio without worry about potential spectral losses. For whisper large v3, that gets you nearly double throughput in exchange for a relative ~4% WER increase.

by godot

0 subcomment

If you're already doing local ffmpeg stuff (i.e. pretty involved with code and scripting already) you're only a couple of steps more away from just downloading the openai-whisper models (or even the faster-whisper models which runs about two times faster). Since this looks like personal usage and not building production quality code, you can use AI (e.g. Cursor) to write a script to run the whisper model inference in seconds.
Then there is no cost at all to run any length of audio. (since cost seems to be the primary factor of this article)
On my m1 mac laptop it takes me about 30 seconds to run it on a 3-minute audio file. I'm guessing for a 40 minute talk it takes about 5-10 minutes to run.

by pbbakkum

2 subcomments

This is great, thank you for sharing. I work on these APIs at OpenAI, it's a surprise to me that it still works reasonably well at 2/3x speed, but on the other hand for phone channels we get 8khz audio that is upsampled to 24khz for the model and it still works well. Note there's probably a measurable decrease in transcription accuracy that worsens as you deviate from 1x speed. Also we really need to support bigger/longer file uploads :)

by another_twist

0 subcomment

You'd need a WER comparison to check if it really is no drop in quality. With this trick, there might be trouble if the audio is noisy, and it may. ot always be obvious whether or not to speed up.

by meerab

0 subcomment

Interesting approach to transcript generation!
I'm implementing a similar workflow for VideoToBe.com
My Current Pipeline:
Media Extraction - yt-dlp for reliable video/audio downloads Local Transcription - OpenAI Whisper running on my own hardware (no API costs) Storage & UI - Transcripts stored in S3 with a custom web interface for viewing
Y Combinator playlist https://videotobe.com/play/playlist/ycombinator
and Andrej's talk is https://videotobe.com/play/youtube/LCEmiRjPEtQ
After reading your blog post, I will be testing effect on speeding audio for locally-hosted Whisper models. Running Whisper locally eliminates the ongoing cost concerns since my infrastructure is already a sunk cost. Speeding audio could be an interesting performance enhancement to explore!

by pimlottc

0 subcomment

Appreciated the concise summary + code snippet upfront, followed by more detail and background for those interested. More articles should be written this way!

by mushishi

0 subcomment

Do the APIs support simultaneous voice transcription in a way that different voices are tagged? (either in text or as metadata)
If so: could you split the audiofile and process the latter half by pitch shifting, say an octave, and then merging them together to get shorter audiofile — then transcribe and join them back to a linear form, tagging removed. (You could insert some prerecorded voice to know at which point the second voice starts.). If pitch change is not enough, maybe manipulate it further by formants.

by dajonker

0 subcomment

Gemini 2.5 pro is, in my usage, quite superior for high quality transcriptions of phone calls, in Dutch in my case. As long as you upload the audio to GCS there you can easily process conversations of over an hour. It correctly identified and labeled speakers.
The cheaper 2.5 flash made noticeably more mistakes, for example it didn't correctly output numbers while the Pro model did.
As for OpenAI, their gpt-4o-transcribe model did worse than 2.5 flash, completely messing up names of places and/or people. Plus it doesn't label the conversation in turns, it just outputs a single continuous piece of text.

by karpathy

3 subcomments

Omg long post. TLDR from an LLM for anyone interested
Speed your audio up 2–3× with ffmpeg before sending it to OpenAI’s gpt-4o-transcribe: the shorter file uses fewer input-tokens, cuts costs by roughly a third, and processes faster with little quality loss (4× is too fast). A sample yt-dlp → ffmpeg → curl script shows the workflow.
;)

by 55555

1 subcomments

This seems like a good place for me to complain about the fact that the automatically generated subtitle files Youtube creates are horribly malformed. Every sentence is repeated twice. In many subtitle files, the subtitle timestamp ranges overlap one another while also repeating every sentence twice in two different ranges. It's absolutely bizarre and has been like this for years or possibly forever. Here's an example - I apologize that it's not in English. I don't know if this issue affects English. https://pastebin.com/raw/LTBps80F

by b0a04gl

0 subcomment

it's still decoding every frame and matching phonemes either way, but speeding it up reduces how many seconds they bill you for. so you may hack their billing logic more than the model itself.
also means the longer you talk, the more you pay even if the actual info density is the same. so if your voice has longer pauses or you speak slow, you maybe subsidizing inefficiency.
makes me think maybe the next big compression is in delivery cadence. just auto-optimize voice tone and pacing before sending it to LLM. feed it synthetic fast speech with no emotion, just high density words. you lose human warmth but gain 40% cost savings

by KTibow

0 subcomment

This is really interesting, although the cheapest route is still to use an alternative audio-compatible LLM (Gemini 2.0 Flash Lite, Phi 4 Multimodal) or an alternative host for Whisper (Deepinfra, Fal).

by raluk

0 subcomment

Our team is working with soniox.com They are the most acurate model that works real time.

by ryanar

0 subcomment

In my experience, transcription software has no problem with transcribing sped up audio, or audio that is inaudible to humans or extremely loud (as long as not clipped), I wonder if LLM transcription works the same.

by fallinditch

3 subcomments

When extracting transcripts from YouTube videos, can anyone give advice on the best (cost effective, quick, accurate) way to do this?
I'm confused because I read in various places that the YouTube API doesn't provide access to transcripts ... so how do all these YouTube transcript extractor services do it?
I want to build my own YouTube summarizer app. Any advice and info on this topic greatly appreciated!

by BrunoJo

0 subcomment

If you look for a cheaper transcription API you could als use https://Lemonfox.ai. We've optimized the API for long audio files and are much faster and cheaper than OpenAI.

by isubkhankulov

0 subcomment

Transcripts get much more valuable when one diarizes the audio beforehand to determine which speaker said what.
I use this free tool to extract those and dump the transcripts into a LLM with basic prompts: https://contentflow.megalabs.co

by jasonjmcghee

1 subcomments

Heads up, the token cost breakdown tables look white on white to me. I'm in dark mode on iOS using Brave.

by cprayingmantis

1 subcomments

I noticed something similar with images as inputs to Claude, you can scale down the images and still get good outputs. There is an accuracy drop off at a certain point but the token savings are worth doing a little tuning there.

by Nevermark

0 subcomment

It's also rude to talk slow to them. Unless its Siri.

by impossiblefork

0 subcomment

Make the minutes longer, you mean.

by PeterStuer

1 subcomments

I wonder how much time and battery transcoding/uploading/downloading over coffeeshop wifi would realy save vs just running it locally through optimized Whisper.

by tmaly

1 subcomments

The whisper model weights are free. You could save even more by just using them locally.

by donkey_brains

1 subcomments

Hmm…doesn’t this technique effectively make the minute longer, not shorter? Because you can pack more speech into a minute of recording? Seems like making a minute shorter would be counterproductive.

by pottertheotter

0 subcomment

You can just ask Gemini to summarize it for you. It's free. I do it all the time with YouTube videos.
Or you can just copy the transcript that YouTube provides below the video.

by celltalk

0 subcomment

With this logic, you should also be able to trim the parts that doesn’t have words. Just add a cut-off for db, and trim the video before transcription.
Possibly another 10-20% gain?

by xg15

0 subcomment

That's really cool! Also, isn't this effectively the same as supplying audio with a sampling rate of 8kHz instead of the 16kHz that the model is supposed to work with?

by ada1981

2 subcomments

We discovered this last month.
There is also prob a way to send a smaller sampler of audio at diff speeds and compare them to get a speed optimization with no quality loss unique for each clip.

by mcc1ane

1 subcomments

Longer*

by yashasolutions

0 subcomment

the question would be how to do that but also still get proper time code when using whisper to get the subtitles

by anshumankmr

0 subcomment

Someone should try transcribing Eminem's Rap god with this trick.

by pknerd

0 subcomment

I guess it'd work even if you make it 2.5 or evebn 3x.

by amelius

0 subcomment

Solution: charge by number of characters generated.

by 7speter

0 subcomment

So wait… is whisper transcription really all that slow locally on a M3 Macbook? It’s been a while since I used whispercpp, but I seem to remember it taking maybe 20 minutes on a comparatively slowpoke (and powerhungry) i5 12600k for maybe 40 minutes of audio; it might take less time on a faster m chip (maybe I’m imagining mobile apple silicon to be more performant than even desktop intel cpus), even less if there support built in for the built in gpu cores and other ai optimized silicon?
Did I miss that the task was time sensitive?

by KPennig86852

0 subcomment

But you know that you can run OpenAI's Whisper audio recognition model locally for free, right? It has very little GPU requirements, and the new "turbo" model works quite fast (there are also several Python libraries which make it significantly faster still).

by ta8903

0 subcomment

This "hack" also works in real life, youtubers low to talk slowly to increase the video runtime so I watch everything other than songs at 2x speed (and that's only because their player doesn't let you go faster).

by fuzztester

0 subcomment

Stop being slaves of extorters of any kind, and just leave.
there is tons of this happening everywhere, and we need to fight this, and boycott it.

by spapinwar

0 subcomment

[dead]

by Raphell

1 subcomments

[flagged]

by topaz0

0 subcomment

I have a way that is (all but) free -- just watch the video if you care about it, or decide not to if you don't, and move on with your life.