FRESH

Hacker News

Show HN: CPU-only transcription for YouTube, TikTok, X, Instagram videos

88 points by mrkn1

by piotrrojek

2 subcomments

If someone is interested, this is my supershort zsh/bash scripts that I keep in .zshrc for doing the same thing using plain whisper.cpp, ffmpeg and yt-dlp (`brew install whisper-cpp yt-dlp` for Mac); I output it in vtt format (subtitles) though, but it's easy enough to change it to txt.

  yt_to_srt() {
    local url="$1"
    local output_base="$2"
    local language="${3:-en}"

    yt-dlp -x --audio-format wav --postprocessor-args "-ar 16000" -o "$output_base.wav" "$url"
    whisper-cli --language "$language" --model "$WHISPER_MODEL" --split-on-word --max-len 65 --output-vtt --output-file "$output_base" --file "$output_base.wav"
    rm "$output_base.wav"
  }

  file_to_srt() {
    local filepath="$1"
    local language="${2:-en}"

    local filename=$(basename "$filepath")
    local filename_no_ext="${filename%.*}"
    local output_base="$filename_no_ext"
    local temp_wav="$output_base.wav"

    ffmpeg -i "$filepath" -vn -acodec pcm_s16le -ar 16000 -ac 1 "$temp_wav"
    whisper-cli --language "$language" --model "$WHISPER_MODEL" --split-on-word --max-len 65 --output-vtt --output-file "$output_base" --file "$temp_wav"
    rm "$temp_wav"
  }

plus additional bootstrap script for large-v3-turbo model from my chez-moi dotfiles:

  #!/bin/bash
  # Download whisper.cpp models from Hugging Face (runs once per machine).
  set -euo pipefail
  MODELS_DIR="$HOME/whisper-models"
  BASE_URL="https://huggingface.co/ggerganov/whisper.cpp/resolve/main"
  MODELS=("ggml-large-v3-turbo.bin" "ggml-tiny.bin")
  mkdir -p "$MODELS_DIR"
  for model in "${MODELS[@]}"; do
    if [ ! -f "$MODELS_DIR/$model" ]; then
      echo "Downloading $model..."
      curl -L --progress-bar -o "$MODELS_DIR/$model" "$BASE_URL/$model"
    else
      echo "$model already exists, skipping."
    fi
  done
  echo "Whisper models ready at $MODELS_DIR"

by delis-thumbs-7e

1 subcomments

Am I a bit thick, but first we created this amazing way to transfer any text very cheaply and fast over network, then we (well, I think it was Meta and Google) decided that no, everything must be a video, then we added subtitles and AI-transcriptions to those videos and now we just dowload transcriptions of those videos presumably to feel LLM to make summaries of them in order to… Read. Them.
I think I’m gonna go read a book.

by throw98226

1 subcomments

Works extremely well. Command to install on Debian 13:
sudo apt update && sudo apt install -y ffmpeg python3-pip python3-venv && git clone https://github.com/kouhxp/yapsnap.git && cd yapsnap && python3 -m venv ~/yapsnap-venv && source ~/yapsnap-venv/bin/activate && pip install --upgrade pip && pip install .
On a 32GB ThinkPad X13, a 21 minute YouTube video was processed by yapsnap under 2 minutes.
Very well done!

by spudlyo

2 subcomments

So, this project consists of a ~175 line README and a ~500 line Python program that glues yt-dlp and Kroko together. Neat.
I guess if it encourages you to install and figure out how to use ffmpeg, yt-dlp, kroko, numpy, and onnx that's a good thing. Sometimes just knowing a thing is possible is a huge benefit.

by jorritpr

1 subcomments

Very cool, I'm also working on a captioning/subtitling project for the lecture recordings for the university I work at.
My biggest challenge is finding a proper language model that is fast enough and accurate enough since I have to caption about 600 hours of video per week and I preferably want to run all of this on a tiny server (2 cores 4 GB memory). This tool could easily do that with the kroko model but I'll have to test if the accuracy is good enough.
Also in my own scripts I'm using ffmpeg to download just the audio of the videos that I want to caption, which saves a lot of bandwith and speeds up the whole process. As far as I can see this tool doesn't do that, that would be a nice functionality to add, plus an option to turn the output into a working .srt file.

by mrkn1

0 subcomment

Added Diarization / Speaker Separation that is fast and CPU only. Thank you all for the great feedback and support. PRs welcome!

  yapsnap "https://www.youtube.com/watch?v=NzKJ-xO-VhE" --diarize

  SPEAKER_00 [00:00]: Welcome to the show.
  SPEAKER_01 [00:03]: Glad to be here, thanks for having me.
  SPEAKER_00 [00:08]: Let's get started.

by majorchord

1 subcomments

I thought ONNX models were only for text-to-speech? How does one tell them apart if I find some files online?

by niraj-agarwal

1 subcomments

Had Claude test it out on 3 videos. Worked at 5-8x realtime. The beauty of it is that it works on all videos, not just the one with transcripts. Combine it with YouTube search and LLM takeaways from transcripts, and you have super-efficient content consumption. There are SaaS products that charge 1 cent per video for those with transcripts. There is a viable product in here somewhere, methinks.

by HDBaseT

2 subcomments

Wouldn't it still be more efficient to do GPU transcriptions anyways? is this something we could actually put the effectively useless NPUs to use in modern laptops?

by canadiantim

2 subcomments

Nice. Can it do speaker diarization?

by ranger_danger

1 subcomments

How is this so much faster than even GPU-based whisper?

by 7777777phil

1 subcomments

Tis is very simple and very cool! Just installed it on my Hetzner box where I run a remote controlled local agent so now I can basically chat/email a video link to get a summary and/or ask questions. The only issue was YouTube's PO Token requirement (web/mweb clients refuse to serve formats from datacenter IPs without a valid Proof-of-Origin token.) So I had to find a client that still work without PO Token first. Thanks for sharing!

by ranger_danger

1 subcomments

How can we transcribe other languages besides English?

by dmos62

2 subcomments

Now make it distinguish speakers and we really have something. As far as I know, that's significantly harder though.

by charcircuit

1 subcomments

Most of these platforms already have transcriptions built in.

by photonair

0 subcomment

[flagged]

by xnx

0 subcomment

[dead]

by chris_explicare

0 subcomment

[flagged]