FRESH

Hacker News

Home

We replaced H.264 streaming with JPEG screenshots (and it worked better)

517 points by quesobob

by qbow883

6 subcomments

Setting aside the various formatting problems and the LLM writing style, this just seems all kinds of wrong throughout.
> “Just lower the bitrate,” you say. Great idea. Now it’s 10Mbps of blocky garbage that’s still 30 seconds behind.
10Mbps should be way more than enough for a mostly static image with some scrolling text. (And 40Mbps are ridiculous.) This is very likely to be caused by bad encoding settings and/or a bad encoder.
> “What if we only send keyframes?” The post goes on to explain how this does not work because some other component needs to see P-frames. If that is the case, just configure your encoder to have very short keyframe intervals.
> And the size! A 70% quality JPEG of a 1080p desktop is like 100-150KB. A single H.264 keyframe is 200-500KB.
A single H.264 keyframe can be whatever size you want, *depending on how you configure your encoder*, which was apparently never seriously attempted. Why are we badly reinventing MJPEG instead of configuring the tools we already have? Lower the bitrate and keyint, use a better encoder for higher quality, lower the frame rate if you need to. (If 10 fps JPEGs are acceptable, surely you should try 10 fps H.264 too?)
But all in all the main problem seems to be squeezing an entire video stream through a single TCP connection. There are plenty of existing solutions for this. For example, this article never mentions DASH, which is made for these exact purposes.

by mikepavone

6 subcomments

> When the network is bad, you get... fewer JPEGs. That’s it. The ones that arrive are perfect.
This would make sense... if they were using UDP, but they are using TCP. All the JPEGs they send will get there eventually (unless the connection drops). JPEG does not fix your buffering and congestion control problems. What presumably happened here is the way they implemented their JPEG screenshots, they have some mechanism that minimizes the number of frames that are in-flight. This is not some inherent property of JPEG though.
> And the size! A 70% quality JPEG of a 1080p desktop is like 100-150KB. A single H.264 keyframe is 200-500KB. We’re sending LESS data per frame AND getting better reliability.
h.264 has better coding efficiency than JPEG. For a given target size, you should be able to get better quality from an h.264 IDR frame than a JPEG. There is no fixed size to an IDR frame.
Ultimately, the problem here is a lack of bandwidth estimation (apart from the sort of binary "good network"/"cafe mode" thing they ultimately implemented). To be fair, this is difficult to do and being stuck with TCP makes it a bit more difficult. Still, you can do an initial bandwidth probe and then look for increasing transmission latency as a sign that the network is congested. Back off your bitrate (and if needed reduce frame rate to maintain sufficient quality) until transmission latency starts to decrease again.
WebRTC will do this for you if you can use it, which actually suggests a different solution to this problem: use websockets for dumb corporate network firewall rules and just use WebRTC everything else

by adamjs

6 subcomments

They might want to check out what VNC has been doing since 1998– keep the client-pull model, break the framebuffer up into tiles and, when client requests an update, perform a diff against last frame sent, composite the updated tiles client-side. (This is what VNC falls back to when it doesn’t have damage-tracking from the OS compositor)
This would really cut down on the bandwidth of static coding terminals where 90% of screen is just cursor flashing or small bits of text moving.
If they really wanted to be ambitious they could also detect scrolling and do an optimization client-side where it translates some of the existing areas (look up CopyRect command in VNC).

by Dylan16807

2 subcomments

> When the network is bad, you get... fewer JPEGs. That’s it. The ones that arrive are perfect.
You can have still have weird broken stallouts though.
I dunno, this article has some good problem solving but the biggest and mostly untouched issue is that they set the minimum h.264 bandwidth too high. H.264 can do a lot better than JPEG with a lot less bandwidth. But if you lock it at 40Mbps of course it's flaky. Try 1Mbps and iterate from there.
And going keyframe-only is the opposite of how you optimize video bandwidth.

by kccqzy

4 subcomments

There are so many things that I would have done differently.
> We added a keyframes_only flag. We modified the video decoder to check FrameType::Idr. We set GOP to 60 (one keyframe per second at 60fps). We tested.
Why muck around with P-frames and keyframes? Just make your video 1fps.
> Now it’s 10Mbps of blocky garbage that’s still 30 seconds behind.
10 Mbps is way too much. I occasionally watch YouTube videos where someone writes code. I set my quality to 1080p to be comparable with the article and YouTube serves me the video at way less than 1Mbps. I did a quick napkin math for a random coding video and it was 0.6Mbps. It’s not blocky garbage at all.

by andai

2 subcomments

Many moons ago I was using this software which would screenshot every five seconds and give you a little time lapse and the end of the day. So you could see how you were spending your computer time.
My hard disk ended up filling up with tens of gigabytes of screenshots.
I lowered the quality. I lowered the resolution, but this only delayed the inevitable.
One day I was looking through the folder and I noticed well almost all the image data on almost all of these screenshots is identical.
What if I created some sort of algorithm which would allow me to preserve only the changes?
I spent embarrassingly long thinking about this before realizing that I had begun to reinvent video compression!
So I just wrote a ffmpeg one-liner and got like 98% disk usage reduction :)

by nemothekid

1 subcomments

I'm very familiar with the stack and the pain of trying to livestream video to a browser. If JPEG screenshots work for your clients, then I would just stick with that.
The problem with wolf, gstreamer, moonlight, $third party, is you need to be familiar with how the underlying stack handles backpressure and error propagation, or else things will just "not work" and you will have no idea why. I've worked on 3 projects in the last 3 years where I started with gstreamer, got up and running - and while things worked in the happy path, the unhappy path was incredibly brittle and painful to debug. All 3 times I opted to just use the lower level libraries myself.
Given all of OPs requirements, I think something like NVIDIA Video Codec SDK to a websocket to MediaSource Extensions.
However, given that even this post seems to be LLM generated, I don't think the author would care to learn about the actual internals. I don't think this is a solution that could be vibe coded.

by somehnguy

3 subcomments

40mbps for video of an LLM typing text didn't immediately fire off alarm bells in anyone's head that their approach was horribly wrong? That's an insane amount of bandwidth for what they're trying to do.

by Tarean

0 subcomment

Having pair programmed over some truly awful and locked down connections before, dropped frames are infinitely better than blurred frames which make text unreadable whenever the mouse is moved. But 40mbps seems an awful lot for 1080p 60fps.
Temporal SVC (reduce framerate if bandwidth constrained) is pretty widely supported by now, right? Though maybe not for H.264, so it probably would have scaled nicely but only on Webrtc?

by dotancohen

2 subcomments

They're just streaming a video feed of an LLC running in a terminal? Why not stream the actual text? Or fetch it piecemeal over AJAX requests? They complain that corporate networks support only HTTPS and nothing else? Do they not understand what the first T stands for?

by keerthiko

0 subcomment

> The fix was embarrassingly simple: once you fall back to screenshots, stay there until the user explicitly clicks to retry.
There is another recovery option:
- increase the JPEG framerate every couple seconds until the bandwidth consumption approaches the H264 stream bandwidth estimate
- keep track latency changes. If the client reports a stable latency range, and it is acceptable (<1s latency, <200ms variance?) and bandwidth use has reached 95% of H264 estimate, re-activate the stream
Given that text/code is what is being viewed, lower res and adaptive streaming (HLS) are not really viable solutions since they become unreadable at lower res.
If remote screen sharing is a core feature of the service, I think this is a reasonable next step for the product.
That said, IMO at a higher level if you know what you're streaming is human-readable text, it's better to send application data pipes to the stream rather than encoding screenspace videos. That does however require building bespoke decoders and client viewing if real time collaboration network clients don't already exist for the tools (but SSH and RTC code editors exist)

by lewq

3 subcomments

Hi, author of the post here. Just fixed up some formatting issues from when we copied it into substack, sorry about that. Yeah, I used Opus 4.5 to help me write it (and it actually made me laugh!). But the struggle was real. Something I didn't make clear enough in the post is that jpeg works because each screenshot is taken exactly when it's requested. Whereas streaming video is pushing a certain frame rate. The client driving the frame rate is exactly what makes it not queue frames. Yes, I wish we could UDP in enterprise networks too, but we can't. The problem actually isn't opening the UDP port, it's hosting UDP on their Kubernetes cluster. "You want to what?? We have ingress. For HTTPS"
Join our discord for private beta in January! https://discord.gg/VJftd844GE
(This post written by human)

by toledocavani

1 subcomments

This thread is great, truly the only way to get great answers on the HN is to post a wrong blog. But stupid wrong blogs are unlikely to get into HN front page, kudos for the writer for striking the right balance between easy to understand, working, interesting but faulty solution.

by laurencerowe

0 subcomment

If you are ok with a second or so of latency then MPEG-DASH (standardized version of HTTP Live Streaming) is likely the best bet. You simply serve the video chunks over HTTP so it should be just as compatible as the JPEG solution used here but provide 60fps video rather than crappy jpegs.
The standard supports adaptive bit rate playback so you can provide both low quality and high quality videos and players can switch depending on bandwidth available.

by robrain

2 subcomments

"Think “screen share, but the thing being shared is a robot writing code.”"
Thinks: why not send text instead of graphics, then? I'm sure it's more complicated than that...

by rekshaw

1 subcomments

I remember 12 years ago, while the Flash vs Html war was still raging on (pre-html5), I created a framework to create web video playback using CSS and JPEGs. It would expect a set of big JPEGs, each containing the frames of the video in a grid (a "reel"), and play it by changing the css background position (and swap out the background with the next jpeg once a "reel" was complete).
It worked really well, and I also cloned the (at the time) Youtube player UI. Seeking, keyframes, flexible framerate, etc were all supported out of the box thanks to the simple underlying architecture.
https://github.com/VAS/animite

by karhuton

0 subcomment

I made this because I got tired of screensharing issues in corporate environments: https://bluescreen.live (code via github).
Screenshot once per second. Works everywhere.
I’m still waiting for mobile screenshare api support, so I could quickly use it to show stuff from my phone to other phones with the QR link.

by materialpoint

0 subcomment

The fact that they considered transmitting only keyframes speaks volumes about how inept they are. It can be a cool baseline test, but celebrating trendy choices, like Rust, and not understanding that keyframes and efficient differentials are key to achieving high video compression makes me go completely numb.

by MBCook

2 subcomments

So it’s video of an AI typing text?
Why not just send text? Why do you need video at all?

by plqbfbv

0 subcomment

I dabbled a bit with recoding/encoding videos in the past: 40mbps is basically blu-ray quality (1080p/4k depending on content), and it's being used to stream a mostly-static background with some text scrolling in front of it.
A 3-minute chat with Claude suggests 30FPS should be plenty (perhaps minor cursor lag can be noticed if it's drawn), with a GOP of 2s (60 frames) for fast recovery, VBR 1mbps average with a max bitrate at 1.2mbps for crappy connections, and bframes to minimize bandwidth usage (because we have hw encoding).
The crappiest of internet cafes should still be able to guarantee 1.2mbps (150kb/s). If they can do 5-10FPS with 150kb frames, they have 6-12mbps available. Worst case GOP can be reduced to 15 frames, so that there's 2x I-frames every second, and the latency is 500ms tops.

by jayd16

0 subcomment

So they replaced a TCP connection with no congestion control with a sycnronous poll of an endpoint which is inherently congestion controlled.
I wonder if they just tried restarting the stream at a lower bitrate once it got too delayed.
The talk about how the images looks more crisp at a lower FPS is just tuning that I guess they didn't bother with.

by Jakob

0 subcomment

Yes, this is unfortunately still the way and was very common back when iOS Safari did not allow embedded video.
For a fast start of the video, reverse the implementation: instead of downgrading from Websockets to polling when connection fails, you should upgrade from polling to Websockets when the network allows.
Socket.io was one of the first libraries that did that switching and had it wrong first, too. Learned the enterprise network behaviour and they switched the implementation.

by zipy124

0 subcomment

This is just poor engineering. H.264 streaming is obviously superior to JPEG streaming, else MJPEG (motion jpeg) would be standard for screen sharing. In addition if all you're sharing is a picture of text, and you have access to the text, you can just send the damn text instead and render it locally.

by andai

0 subcomment

I recognize this voice :) This is Claude.

by rcarmo

1 subcomments

This was the most entertaining thing I read all day. Kudos.
I've had similar experiences in the past when trying to do remote desktop streaming for digital signage (which is not particularly demanding in bandwidth terms). Multicast streaming video was the most efficent, but annoying to decode when you dropped data. I now wonder how far I could have gone with JPEGs...

by tcherasaro

0 subcomment

Reminds me when I was working on the video system for a mast on a sub-marine 20 years ago.
Customer had impossible set of latency, resolution, processing and storage requirements for their video. They also insisted we use this new H.264 standard that just came out though not a requirement.
We quickly found MJPEG was superior for meeting their requirements in every way. It took a lot of convincing though. H.264 was and would still be a complete non-starter for them.

by any1

0 subcomment

I have some experience with pushing video frames over TCP.
It appears that the writer has jumped to conclusions at every turn and it's usually the wrong one.
The reason that the simple "poll for jpeg" method works is that polling is actually a very crude congestion control mechanism. The sender only sends the next frame when the receiver has received the last frame and asks for more. The downside of this is that network latency affects the frame rate.
The frame rate issue with the polling method can be solved by sending multiple frame requests at a time, but only as many as will fit within one RTT, so the client needs to know the minimum RTT and the sender's maximum frame rate.
The RFB (VNC) protocol does this, by the way. Well, the thing about rtt_min and frame rate isn't in the spec though.
Now, I will not go though every wrong assumption, but as for this nonsense about P-frames and I-frames: With TCP, you only need one I-frame. The rest can be all P-frames. I don't understand how they came to the conclusion that sending only I-frames over TCP might help with their latency problem. Just turn off B-frames and you should be OK.
The actual problem with the latency was that they had frames piling up in buffers between the sender and the receiver. If you're pushing video frames over TCP, you need feedback. The server needs to know how fast it can send. Otherwise, you get pile-up and a bunch of latency. That's all there is to it.
The simplest, absolutely foolproof way to do this is to use TCP's own congestion control. Spin up a thread that does two things: encodes video frames and sends them out on the socket using a blocking send/write call. Set SO_SNDBUF on that socket to a value that's proportional to your maximum latency tolerance and the rough size of your video frames.
One final bit of advice: use ffmpeg (libavcodec, libavformat, etc). It's much simpler to actually understand what you're doing with that than some convoluted gstreamer pipeline.

by egorfine

9 subcomments

> The constraint that ruined everything: It has to work on enterprise networks.
> You know what enterprise networks love? HTTP. HTTPS. Port 443. That’s it. That’s the list.
That's not enough.
Corporate networks also love to MITM their own workstations and reinterpret http traffic. So, no WebSockets and no Server-Side Events either, because their corporate firewall is a piece of software no one in the world wants and everyone in the world hates, including its own developers. Thus it only supports a subset of HTTP/1.1 and sometimes it likes to change the content while keeping Content-Length intact.
And you have to work around that, because IT dept of the corporation will never lift restrictions.
I wish I was kidding.

by benterix

1 subcomments

> And the size! A 70% quality JPEG of a 1080p desktop is like 100-150KB. A single H.264 keyframe is 200-500KB.
I believe the latter can be adjusted in codec settings.

by imiric

0 subcomment

> By the time you see a bug, the AI has already committed it to main
If you have given your "AI" full control over your repo so that it can commit unreviewed code to the main branch, you have far greater problems than a 45 second video stream delay. Besides, you'd need superhuman abilities to spot a bug in hundreds of lines of generated code in under 45 seconds.
I know this example is rhetorical and likely produced by an LLM, but this entire project seems misguided. They're streaming video of a graphical text editor to a web browser client, instead of streaming text itself, or using a web-based editor. These are solved problems. This shouldn't be so complicated...

by refulgentis

1 subcomments

The LinkedIn slop tone, random bolding, miscopied Markdown tables makes me invoke: "please read the copy you worked on with AI"
smaller thing: many, many, moons ago, I did a lot of work with H.264. "A single H.264 keyframe is 200-500KB." is fantastical.
Can't prove it wrong because it will be correct given arbitrary dimensions and encoding settings, but, it's pretty hard to end up with.
Just pulled a couple 1080p's off YouTube, biggest I-frame is 150KB, median is 58KB (`ffprobe $FILE -show_frames -of compact -show_entries frame=pict_type,pkt_size | grep -i "|pict_type=I"`)

by algesten

0 subcomment

WebSockets over TCP is probably always going to cause problems for streaming media.
WebRTC over UDP is one choice for lossy situations. Media over Quic might be another (is the future here?), and it might be more enterprise firewall friendly since HTTP3 is over Quic.

by petcat

3 subcomments

so did they reinvent mjpeg

by Terretta

0 subcomment

Helix is a commercial multi-protocol streaming server:
https://en.wikipedia.org/wiki/Helix_Universal_Server
HTTP Live Streaming is already a thing:
https://en.wikipedia.org/wiki/HTTP_Live_Streaming
See also DASH, M-JPEG, progressive download, etc.
> "Who knew?"
Everyone in the streaming industry, and not so long ago that it's been forgotten.

by wewewedxfgdf

2 subcomments

webp is smaller than jpeg
https://developers.google.com/speed/webp/docs/webp_study
ALSO - the blog author could simplify - you don't need any code at all at the web browser.
The <img> tag automatically does motion jpeg streaming.

by bArray

0 subcomment

I've literally been here (many) years ago whilst trying to stream video from a potato Linux SBC via WiFi. As you walked further away, the H264 stream would just die and hang, no matter what you did. Stream JPEGs? Worked excellently and adjusted the number of JPEGs per second depending on connection (only requested the next frame after the current one arrived or a timeout occurred).
This got me thinking about video calls, which have be notoriously bad on bad connections. Half the time I am just streaming a screen with static information on it, we're not watching videos together. And yet the streaming pipeline is optimised as this article suggests for the higher bandwidth modes - when we're never really using it at all.
The most important part about a video call is rarely the video, is usually the audio. It's counter-intuitive but you are better off having your call without video than you are without sound, and yet when the video falls over it takes the audio with it. Insanity!

by didibus

1 subcomments

What I'm wondering is, why couldn't the AI generate this solution? And implement it all?
Why did they need to spend human time and effort to experiment, arrive at this solution and implement it?
I'm asking genuinely. I use GenAI a lot, every day, multiple times a day. It helps me write emails, documents, produce code, make configuration changes, create diagrams, research topics, etc.
Still, it's all assisted, I never use its output as is, the asks from me to the AI are small, so small, I wouldn't ever assign someone else a task this small. We're not talking 1 story point, we're talking 0.1 story point. And even with those, I have to review, re-prompt, dissect, and often manually fix up or complete the work.
Are there use-cases where this isn't true that I'm simply not tackling? Are there context engineering techniques that I simply fail to grasp? Are there agentic workflows that I don't have the patience to try?
How then, do models score so high on some of those tests, are the prompts to each question they solve hand crafted, rewritten multiple times until they find a prompt that one-shot the problem? Do they not consider all that human babysitting work as the model not truly solving the problem? Do they run the models with a GPU budget 100x that they sell us?

by rezonant

0 subcomment

I guess their LLM doesn't have much training data on how to do video engineering. The result? A "video" stack that looks like a junior engineer wrote it.

by liampulles

0 subcomment

I appreciate the honesty in this article, hacking a solution together that works is ultimately what counts. Having said that, why H264?
If I understand correctly, the clients of the video stream are web browsers and perhaps mobile devices, and the servers are Helix's. Would SVT-AV1 with low-latency mode not be an option?

by epx

1 subcomments

Would HLS be an option? I publish my home security cameras via WebRTC, but I keep HLS as a escape for hotel/cafe WiFi situations (MediaMTX makes it easy to offer both).

by binocarlos

3 subcomments

> I mashed F5 like a degenerate.
I love the style of this blog-post, you can really tell that Luke has been deep down in the rabbit hole, encountered the Balrog and lived to tell the tale.

by vincepaulushook

0 subcomment

Hi, I would concur to some of the comments. A key frame in H264 is already encoded in a similar way as JPEG. Major differences are the "defaults": the flexibility of JPEG in terms of colors depth, color map, but that can be also addressed with a video codec, too. Then when using a video codec like H264, it will also contain differential frames which will only send differences. It depends on the content but these frames can be significantly smaller than a key frame, like 10x.
So the math is that H264 can nearly only be better than JPEG, assuming proper parameters for the type of content, the targeted transmission challenges, the transmission type.
Using JPEG is close to using only key frames from a compression stand point (not to say, it is exactly like that), which is close to older protocols like MPEG-1 (DVD), or close to intra-frames only codec (like used as intermediate formats, for editing or preservation). And the difference in size is a no-brainer, eventually this is the amount of data that needs to be sent to every user.
In my opinion, the first consequence of using JPEG only is the cost per device, the number of concurrent streams from a server and what not.
If the perception of quality is low with H264 compared to JPEG, some parameters need to be adjusted. And ultimately, H264 is already an old codec anyway, not the one I would recommend, newer ones can address visual perception and bandwidth in a much better way. the VP-8/9/AV1 family will reduce the "macro block" effect of the H.26x codecs. Using HDR will dramatically improve the quality and will crush any benefit from JPEG, benefits related to the number of bits per pixels and the poor 8bits color maps, with a much higher efficiency.
Should the volume of users and the cost per user be of any consideration, a lossy video codec will prevail.
Video projects are challenging in the details: wish you the best.

by throwaway173738

0 subcomment

This article reminds me so much of so many hardware providers I deal with at work who want to put equipment on-site and then spend the next year not understanding that our customers manage their own firewall. No, you can’t just add a new protocol or completely change where your stuff is deployed because then our support team has to contact hundreds of customers about thousands of sites.

by Eduard

0 subcomment

> A JPEG screenshot is self-contained. It either arrives complete, or it doesn’t. There’s no “partial decode.”
What about Progressive JPEG?

by avsn

0 subcomment

We did something similar in one of the places I've worked at. We sent xy coordinates and pointer events from our frontend app to our backend/3d renderer and received JPEG frames back. All of that wrapped in protobuf messages and sent via WS connection. Surpassingly it kinda worked, not "60fps worked" though obviously.

by socketcluster

0 subcomment

Next phase would be to do diffs between the JPEGs and if the diff is smaller than the next JPEG, only send the (gzipped) diff and reconstruct the next JPEG on the client side.
TBH, the obsession with standards is kind of nutty. It's not that hard to implement custom solutions that are better adapted to specific problems. Standards make sense when you want maximum interoperability but not everything requires this degree of interoperability these days. It's not such hassle to just provide a lightweight client in those cases.
For example, it's not ideal to use HTTP2 server push for realtime chat use cases. It was primarily intended for file push to avoid round-trip latency but HTTP is such a powerful and widespread protocol that people feel the need to use it for everything.

by nico

0 subcomment

Super interesting. Some time ago I wrote some code that breaks down a jpeg image into smaller frames of itself, then creates an h.264 video with the frames, outputting a smaller file than the original image
You can then extract the frames from the video and reconstruct the original jpeg
Additionally, instead of converting to video, you can use the smaller images of the original, to progressively load the bigger image, ie. when you get the first frame, you have a lower quality version of the whole image, then as you get more frames, the code progressively adds detail with the extra pixels contained in each frame
It was a fun project, but the extra compression doesn’t work for all images, and I also discovered how amazing jpeg is - you can get amazing compression just by changing the quality/size ratio parameter when creating a file

by wood_spirit

0 subcomment

A long time ago I was trying to get video multiplexing to work over mobile over 3G. We struggled with H264 which had broad enough hardware support but almost no tooling and software support on the phones we were targeting. Even with engineers from the phone manufacturer as liaison we struggled to get access to any kind or SDK etc. We ended up doing JPEG streaming instead, much like the article said. And it worked great but we discovered we were getting a fraction of the framerate reported in Flash players - the call to refresh the screen was async and the act of receiving and deciding the next frame staved the redraw so the phone spent more time receiving lots of frames but not showing them. Super annoying and I don’t think the project survived long enough for us to find a fix.

by dimatura

0 subcomment

About eight years ago I was trying to stream several videos of a drone over the internet for remote product demos. Since we were talking to customers while the demo happened, the latency needed to be less than a few seconds. I couldn't get that latency with the more standard streaming video options I tried, and at the time setting up something based on WebRTC seemed pretty daunting. I ended up doing something pretty much like JPEGs as well, via the jsmpeg library [1]. Worked great.
[1] https://jsmpeg.com/ (tagline: "decode like it's 1999")

by josephernest

1 subcomments

Related: for some hardware project, I have a backend server (either C++ or python) receiving frames from an industrial camera, uncompressed.
And I need these frames displayed in a web browser client but on the same computer (instead of network trip like in this article).
How would you do this ?
I eventually did more or less like OP with uncompressed frames.
My goal is to minimize CPU usage on the computer. Would h264 compression be a good thing here given source and destination are the same machine?
Other ideas?
NB: this camera cannot be directly accessed by the browser.

by praveen9920

0 subcomment

This reminds me of the time we built a big angular3 codebase for a content platform. When we had to launch, the search engines were expecting content to be part of page html while we are calling APIs to fetch the content ( angular3 didn’t have server side rendering at that point)
So only plausible thing to do was pre-build html pages for content pages and let load angular’s JS take its time to load ( for ux functionality). It looked like page flickered when JS loads for the first time but we solved the search engine problem.

by dehrmann

0 subcomment

> What if we only send keyframes?
I think the author reached this conclusion, but individual jpegs is essentially only keyframes.
> We don’t spam HTTP requests for individual frames like it’s 2009.
Uncompressed frames are huge, somewhere between 5 MB and 50 MB. The overhead of a request is negligible. It's also different when you're optimizing for latency and reliability where dropped frames is OK. Really, the lesson is they should have tried the easy thing first to see how good it was.

by saagarjha

0 subcomment

I’m currently doing this in one of my side projects: https://github.com/saagarjha/Ensemble. It works, kinda; it’s good enough for demos at least and I haven’t had much time to improve it. At some point you would really want to use an actual video encoder though because JPEGs are not cheap to encode and send even with hardware acceleration.

by bob1029

0 subcomment

> Why JPEGs Actually Slap
JPEG is extremely efficient to [de/en]code on modern CPUs. You can get close to 1080p60 per core if you use a library that leverages SIMD.
I sometimes struggle with the pursuit of perfect codec efficiency when our networks have become this fast. You can employ half-assed compression and still not max out a 1gbps pipe. From Netflix & Google's perspective it totally makes sense, but unless you are building a streaming video platform with billions of customers I don't see the point.

by gametheory87

0 subcomment

It’s always TCP_NODELAY seems relevant here: https://news.ycombinator.com/item?id=40310896

by dehrmann

0 subcomment

> We’re building Helix, an AI platform where autonomous coding agents work in cloud sandboxes. Users need to watch their AI assistants work. Think “screen share, but the thing being shared is a robot writing code.”
This feels like a fast dead end. Agents will get much faster pretty quickly, so synchronous human supervision isn't going to scale. I'd focus on systems that make high-signal asks of humans asynchronously.

by xnx

0 subcomment

You see a company that is bad at video streaming. I see a smart application of Cunningham's Law https://meta.wikimedia.org/wiki/Cunningham%27s_Law

by kiririn7

0 subcomment

"By the time you see a bug, the AI has already committed it to main" does anybody actually actively watch the code their agent is writing? i am watching movie recaps on my 2nd monitor. this seems like a problem that they assume exists because they dont actually use their product

by cwt137

0 subcomment

Everyone talks about Websockets for pushing real time data to the browser. This article highlights some of its drawbacks. I use Server Sent Events (SSE) instead. A lot of the problems the author of the article faced are solved with SSE. Also, SSE scales way better than polling all the time.

by lostmsu

2 subcomments

If you have latency detection already why not pause H.264 frames, then when ack comes just force a key frame and resume (perhaps with adjusted target bitrate)?

by STELLANOVA

1 subcomments

We did something similar +12 years ago with `streaming` AWS running app inside the browser. Basically you can run 3d studio max on chromebook. App is actually running on AWS instance and it just sending jpegs to browser to `stream` it. We did a lot of QoS logic and other stuff but it was actually working pretty nice. Adobe used it for some time to allow user to run Photoshop in the browser. Good old days..

by breve

0 subcomment

WebP is well supported in browsers these days. Use WebP for the screenshots instead of JPEG and it will reduce the file size:
https://developers.google.com/speed/webp/gallery1
https://caniuse.com/webp

by K0nserv

1 subcomments

You can do TURN using TLS/TCP over port 443. This can fool some firewalls, but will still fail for instances when an intercepting HTTP proxy is used.
The neat thing about ICE is that you get automatic fallbacks and best path selection. So best case IPv6 UDP, worst case TCP/TLS
One of the nice things about HTTP3 and QUIC will be that UDP port 443 will be more likely to be open in the future.

0 subcomment

by Sean-Der

0 subcomment

Doesn’t matter now, but what led you to TURN?
You can run all WebRTC traffic over a single port. It’s a shame you spent so much time/were frustrated by ICE errors
That’s great you got something better and with less complexity! I do think people push ‘you need UDP and BWE’ a little too zealously. If you have a homogeneous set of clients stuff like RTMP/Websockets seems to serve people well

by mschuster91

0 subcomment

> We are professionals. We implement proper video codecs. We don’t spam HTTP requests for individual frames like it’s 2009.
I distinctly 'member doing CGI stuff with HTTP multipart responses... although I bet that with the exception of Apache, server (and especially: reverse proxy) side support for that has gone down the drain.

by sevensor

4 subcomments

No mention of PNGs? I don’t usually go to jpegs first for screenshots of text. Did png have worse compression? Burn more cpu? I’m sure there are good reasons, but it seems like they’ve glossed over the obvious choice here.
edit: Thanks for the answers! The consensus is that PNG en/de -coding is too expensive compared to jpeg.

by monus

0 subcomment

Well, we are serving latency sensitive remote control to <one of the biggest banks in US> via WebRTC which uses TLS over TURN so you get 443 HTTPS for the whole traffic.
No NAT, no UDP, just pure TURN traffic over Cloudflare TURN with TLS.

by tracker1

0 subcomment

My only real curiosity is if .png or .webp are supported and how much slower and/or faster they are in practice over jpeg given the quality level needed to not artifact.

by colechristensen

0 subcomment

H.264 can be used to encode a single frame as an effective image with better compression than JPEG.

by poly2it

0 subcomment

Why is video streaming so difficult? We've been doing it for decades, why is there seemingly no FOSS library which let's me encode an arbitrary dynamic frame rate image stream in Rust and get HD data with delta encoding in a browser receiver? This is insanity.

by andrewstuart

0 subcomment

I wrote a motion jpeg server for precisely this use case.
https://github.com/crowdwave/maryjane
The secret to great user experience is you return the current video frame at time of request.

by moralestapia

0 subcomment

>A single H.264 keyframe is 200-500KB.
Hmm they must be doing something wrong, they're not usually that heavy.

by julik

0 subcomment

Having built an image sequence player using JPEGs back in the day - I can attest that it slappps.

by Animats

0 subcomment

This is a screen-sharing system, correct? Sharing screens with text? JPEG compression of text is bad. JPEG is terrible at hard edges. PNG is fine with them, and good at uniform areas of color, like text.

by yanngagnon

0 subcomment

If you don't need sound, or don't need to present the sound to the user in sync with the content, the still image solution is obvious.

0 subcomment

by nrhrjrjrjtntbt

0 subcomment

Thats fun. I take it JPEG (what settings lolz!) is compressing harder than a keyframe.
But you are waching code. Why not send the code? Plus any css/html used to render it pretty. Or in other words why not a vscode tunnel?

by abujazar

0 subcomment

This reminds me of https://github.com/memvid/memvid

by elzbardico

0 subcomment

There's no real reason other than bad configuration/coding for a H.264 1080p 30fps screen share stream to sustainably use 40mbps. You can watch an action move at the same frame rate but with 4k resolution while using less than half this bandwidth.
The real solution is using WebRTC, like every single other fucking company that have to stream video is doing. Yes, enterprise consumers require additional configuration. Yes, sometimes you need to provide a "network requirements" sheet to your customer so they can open a ticket with their IT to configure an exception.
Second problem, usually enterprise networks are not as bad as internet cafe networks, but then, internet café networks usually are not locked down, so, you should always try first the best case scenario with webrtc and turn servers on 3478. That will also be the best option for really bad networks, but usually those networks are not enterprise networks.
Please configure your encoder, 40mbps bit rate for what you're doing is way way too much.
Test if TURN is not acessible. try it first with UDP (the best option and will also work with internet cafe), if not try over tcp on port 443, not working? try over tls on port 443.

by visiondude

2 subcomments

I very confused, couldn’t they have achieved much better outcome with existing hls tech with adaptive bitrate playlists? Seems they both created the problem and found a suboptimal solution.

by escapecharacter

1 subcomments

I guess this is great as long as you don't worry about audio sync?

by keepamovin

0 subcomment

This is similar to what BrowserBox does for the same reasons outlined. Glad to see the control afforded by "ye olde ways" is recognized and widely appreciated.

by ddtaylor

1 subcomments

A very stupid hack that can work to "fix" this could be to buffer the h264 stream at the data center using a proxy before sending it to the real client, etc.

by CrossVR

0 subcomment

This isn't a hack though, MJPEG (Motion JPEG) is an actual video format and has long been used for security camera footage.

by willseth

0 subcomment

“We didn’t have the expertise to build the thing we were building, got in way over our heads, and built a basic POC using legacy technology, which is fine.”

by ErroneousBosh

0 subcomment

So, they've invented MJPEG?
Or is it intra-only H.264?
I mean, none of this is especially new. It's an interesting trick though!

by Dwedit

0 subcomment

"Helix" also happens to be the name of an open-source project created by RealPlayer.

by kuon

1 subcomments

Wait what? 40Mbps for a remote desktop? Event 10Mbps is insane. I remember deploying sunrays over dialup and the image wasn't that bad, yes it was low resolution and I think it was UDP, but the desktop was usable with a surprisingly low latency.
To monitor an IA you can lower the bit depth considerably and not lose that much details on what is happening. If you control the web rendered, disable text anti aliasing, and there might be other optimization that can help. Tile & diff the image... But video encoders already does that so it might just work out of the box.
Also if your single h264 image is larger that jpeg then you are doing something wrong, jpeg is a very poor encoding compared to what we have today.
Look at how other remote desktop protocol does it, VNC, RDP...
Managing streams over corporate network is well documented, many web frameworks will include a "longpoll" fallback (or SSE) for streaming to play nice even without web sockets. "Discovering" you cannot deploy whatever you want to an enterprise network is quite alarming.
I really don't want to be the graybeard guy saying "young engineers are bad", as I am more on the side of believing on the new generations, but please, don't act like computers spawned into existence in 2020 and that nothing has been done before.

by tverbeure

1 subcomments

I’m surprised that H264 I-frame only compresses less than JPG.
Maybe because the basic frequency transform is 4x4 vs 8x8 for JPG?

by dicroce

0 subcomment

They should have used HLS. Its still pulling, and the client controls the downshifts if required...

by notpushkin

0 subcomment

Considering you already have a WebSocket open, why not just send JPEGs over it?

by inDigiNeous

0 subcomment

That was a fun read. Kudos to the writer. This is software development life.

by mring33621

0 subcomment

This is such a great post. I love the "The Oscillation Problem"!

by ddtaylor

0 subcomment

> I mashed F5 like a degenerate
Bargaining.

by boggyb

0 subcomment

You can make webrtc work on enterprise networks by tunneling turn tcp traffic over websocket. The flow looks like this.
client's webrtc app using turn (pointing to the same machine IP) <-> tcp server/ websocket client (runs on client machine) <-> websocket server (relays turn packets) <-> real turn server <-> host's webrtc app
https://github.com/amitv87/turn_ws_proxy
I implemented a similar technique for Browserstack more than a decade ago to bypass enterprise firewalls by tunneling turn packets over (websockets/sse/socket.io etc.) The `tcp server/ websocket/sse/scoket.io client` was hosted as part of a packaged chrome app / firefox extension. WebSocket and TURN servers were hosted on same machine to minimize the latency (could have been embedded in same process to reduce latency further).

by gethly

0 subcomment

oof. i knew instantly what the problem was and realised these people have no clue about how video even works. yet another vibe coded AI startup.

by ErroneousBosh

0 subcomment

You know what else I don't quite get? Why isn't "Your network is broken. Fix your network. Blocking UDP is idiotic. Get someone to set it up who has at least stood within hailing distance of a clue" an acceptable thing to say here?

0 subcomment

by mgaunard

0 subcomment

RTP is specifically designed for real-time video.

by worksonmine

1 subcomments

I'm confused, do people actually watch their agents code like it was a screen share? Why does the AI even mess with that, just send a diff over text? Is it getting a keyboard next?
This is the definition of over-engineering. I don't usually criticize ideas but this is so stupid my head hurts.

by bandamo

0 subcomment

would like to see what alternatives were looked at. RDP with a html client (guacamole) seams like a good match

by the8472

0 subcomment

> looks at TCP congestion control literature
> closes tab
Eh, there are a few easy things one can try. Make sure to use a non-ancient kernel on the sender side (to get the necessary features), then enable BBR and NOTSENT_LOWAT (https://blog.cloudflare.com/http-2-prioritization-with-nginx...) to avoid buffering more than what's in-flight and then start dropping websocket frames when the socket says it's full.
Also, with tighter integration with the h264 encoder loop one could tell it which frames weren't sent and account for that in pframe generation. But I guess that wasn't available with that stack.

by bayindirh

0 subcomment

I love this:
- It's 2025! We don't need to think like the savages of the yore. Use video at 60FPS. Computing is cheap, network is reliable. Why do we need to remember old ways like savages?
it turns out that network is not reliable...
- We will do as our ancestors did, and will send JPEGs, and that works?! Whoa, who guessed it!
Come on. Everything is new but nothing has changed. Sometimes the older tech is vastly better, and saves our butts or lives or both. We shouldn't be ashamed of using things proven to work.

by tylertyler

0 subcomment

I've found that WebM works much better because of the structure of the data in the container. I've also gone down similar routes using outdated tech and even inventing my own encoder and decoders trying to smooth things out but what I've currently found is the best current approach is using WebM because it is easier to lean on hardware encoder and decoders including across browsers with the new WebCodecs APIs. What I've been working on is a little different than what is in this post but but I'm pretty sure this logic still stands.

by krater23

1 subcomments

"By the time you see a bug, the AI has already committed it to main"
Beside of that the Author has no plan at all about encoding, mjpeg, vnc,....
Really, THIS is the product that they sell?! This sounds like a horrible work. Observing a coding agent that does my job, but faster and crappier than me and stopping it when it does totally bullshit to prevent it from commiting to main?

by j45

0 subcomment

One thing this article does point to indirectly is sometimes, simple scales, and complex fails.

by mannyv

0 subcomment

Awesome!
Good engineering: when you're not too proud to do the obvious, but sort of cheesy-sounding solution.

by almog

0 subcomment

Posts like this on the front page make me miss N-Gate so bad...

by nicman23

0 subcomment

close enough, welcome back mpeg

by JohnCClarke

0 subcomment

+1 - I made the same technology choice back in 2014. Seems like nothing has changed.
TL;DR: You can't keep things too simple.

by mxkyb

0 subcomment

why not media over quic

by piyushpr134

0 subcomment

how about using mjpeg ?

by develatio

0 subcomment

I cried. Then I laughed. Then I cried again. I can feel all the pain of the entire thing (don't ask me why). Amazing. Bravo!!

by HocusLocus

0 subcomment

This is a beautiful cope. Every time technology rolls out something that works great 90% of the time for 90% of the people, those 10%s pile up big time in support and lost productivity. You need functional systems that fall back gracefully to 1994 if necessary.
I started the first ISP in my area. We had two T1s to Miami. When HD audio and the rudiments of video started to increase in popularity, I'd always tell our modem customers, "A few minutes of video is a lifetime of email. Remember how exciting email was?"

by dengolius

1 subcomments

what about av1?

by hmontazeri

0 subcomment

Another case of we’re going backwards. The boring stuff is what works every time…

by dinobones

1 subcomments

You spent 3 months on this hacked together garbage when you probably could’ve just configured a pre-existing solution off the shelf with like 10 minutes of reading and understanding documentation.
This blog post reeks of “you can just do things” type of engineering. This is the quality of engineering I would expect from “TPOT” (that part of Twitter) where people talk about working 12 hour days. It’s cause they’re working 12 hours on bullshit like this.
Building some sweet custom codec or binary transportation algorithm was barely cute in like 1989. It definitely ain’t cute now.
How many of these AI and “agentic” companies are just misled engineers thinking they are cracked and writing needlessly complex solutions to problems that dont even exist?
Just burn it all down. Let it pop already.