- Responding to some technical points first, but then after that I do see a future that isn't WebRTC. I don't think it matches where WebTransport+WebCodecs etc is going though.
> …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate
This is the opposite of the feedback I get. Users want instant responses. If you have delay in generating responses/interruptions it kills the magic. You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)
> TTS is faster than real-time
https://research.nvidia.com/labs/adlr/personaplex/ Voice AI for the latest/aspirational is moving away from what the author describes. It is trickled in/out at 20ms
> We really hope the user’s source IP/port never changes, because we broke that functionality.
That is supported. When new IP for ufrag comes in its supported
> It takes a minimum of 8* round trips (RTT)
That's wrong. https://datatracker.ietf.org/doc/draft-hancke-webrtc-sped/
> I’d just stream audio over WebSockets
You lose stuff like AEC. You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily. Lots of developers struggled with Realtime API + web sockets (lots of code and having to do stuff by hand)
----
I think if I had my choice I would pick Offer/Answer model and then doing QUIC instead of DTLS+SCTP. Maybe do RTP over QUIC? I personally don't feel strongly about the protocol itself. I don't know how to ship code to multiple clients (and customers clients) with a much large code footprint.
- I have a lot of experience in this area (and some patent applications). For Alexa, the device established a connection back to the server and then kept that open, sending basically HTTP2/SPDY/Something like it over the wire after it detected the wake word. This allowed the STT start processing before you finish talking, so there is only a small delay in processing the last few chunks of your utterance.
The answer came back over the same connection.
In the case of OpenAI, they can't exactly keep a persistent connection open like Alexa does, but they can use HTTP2 from the phone and both iOS and Android will pretty much take care of that connection magically.
The author is absolutely right, a real time protocol isn't necessary. It's more important to get all the data. The user won't even notice a delay until you get over 500ms. Especially in the age of mobile phones, where most people are used to their real time human to human communications to have a delay.
(If you work at OpenAI or Anthropic, give me a shout, I'm happy to get into more details with you)
- I didn't make it all the way through the post, but I have to say I think he fundamentally understands the purpose of WebRTC. He calls himself an expert, and yeah he's written SFU's in go and rust and different companies ... but his technical credentials do not mean he's correct.
Maybe it's a comprehension issue on my end, but he seems to associate things like stun and dtls as related, compounding issues (particularly in round trip time), but they are really orthogonal.
Also, he spends too much time talking about how you can't resend packets, and reiterates that point by stating they tried really hard (at discord?). That's where he lost the plot, imo.
The RTC in WebRTC is about real time communication. Humans will naturally prefer the auditory experience of an occasional dropped packet, vs backed up audio or audio that plays at an uneven rate. To clarify, I'm talking about human speech here.
If you want to tolerate packet loss, use a protocol based on tcp instead of udp. But you know what happens when you send audio over poor network conditions with tcp? There will be pauses on the receiving end as it waits for the next correct packet. Let's say the delay is multiple seconds. What should the receiving end do when packets start flowing again? Plays the clogged audio at a natural clock? Attempt to play the audio back at a higher rate to "catch up" with any other channels? People, humans, do not generally prefer that experience.
Forget about WebRTC for a minute, but instead think about tcp vs udp for voice. Voip has been based on udp since the 90's for a reason.
- This poor soul. There are few protocols I hate implementing more than WebRTC. Getting a simple client going means you need to quickly acclimate to SDP, TURN/STUN, ice-candidates, offers, peer-to-peer protocols, and the complex handshake that is implemented from scratch each time. I can't imagine re-writing the whole trenchcoat of protocols and unintended "best-practices".
- > But nope, WebRTC has no buffering and renders based on arrival time. Like seriously, timestamps are just suggestions. It’s even more annoying when video enters the picture.
I felt that comment my bones. Why would anyone possibly have the need to know actual presentation timestamp and how that corresponds to actual realtime? Evidently, no one working on WebRTC has had to synchronise data streams from varying sources before with millisecond accuracy.
I was doing a demo for a video stabilisation using a webcam and IMU module in the browser. It turns out the latency between video->rtc->browser and sensor->websocket->browser are wildly different and not constant. The obvious solution would be to send UTC timestamps for the sensors data and synchronise in browser. Not possible, the video has no UTC timestamp reference. When you have control of both sides of the WebRTC pipe, you can do fun things like send the UTC timestamp of the start of the stream, but this won’t solve browser jitter. It worked well enough for a POC but the entire solution had to be reengineered.
- This is frustratingly one-sided writing. Yeah, WebRTC has limitations, but relying on a standard buys you a lot of correctness and reduces long-term engineering cost. The fact that WebRTC is complicated does not mean it is wrong; it means real-time media over the public internet is complicated.
Also, networking is inherently stateful. NAT traversal, jitter buffers, congestion control, packet loss, codec state, encryption, and session routing do not disappear because you put audio over TCP or WebSocket. Pretending otherwise is not architectural clarity. It is just moving the complexity somewhere less visible.
by dalemhurley
0 subcomment
- What would actually be really interesting, text to speech on the device, you could easily stream text to the client which could generate the voice in realtime, far less bandwidth, latency is not really an issue.
- > WebRTC is designed to degrade and drop my prompt during poor network conditions
You want real time that's what you are going to deal with. If you don't want real time and instead imagine everything as STT -> Prompt -> TTS then maybe you shouldn't even be sending audio on the wire at all.
- I run the gemini live api over a mesh hosted managed webrtc cloud. works fantastic, and Ive been running it for 2 years. you can try websocket, handle ephemeral keys, ect ect. but when you speak with people running voice agents at scale in this space, many of the issues are solved with webRTC and pipecat and the many resources allocated to solved problems in this space. It certainly feels overkill, and it probably is, but once connection is established, it's pretty magical. the startup time and buffering has been solved for quicker voice connections too, https://github.com/pipecat-ai/pipecat-examples/tree/main/ins... (video is harder)
by dalemhurley
0 subcomment
- I get what you are saying, I honestly thought it was me who didn’t understand.
- There're tons of ways to fine-tune WebRTC that it wouldn't corrupt audio in poor network - it has all of the controls to smoothly trade-off latency vs quality. Not just NACKs - FEC, disable PLC/Acceleration/Deceleration, larger JB (tons of parameters) etc.
Most of the glitches I heard with OpenAI's Voice were not WebRTC related - but rather, to my ear, they sounded more like realtime issues with their inference - which is a very different component to optimize.
- I used to work in WebRTC back in it's earlier days and our team developed the open-source rtc.io. (https://github.com/rtc-io)
I never would have imagined that OpenAI is sending the full audio of a request to their servers. I had always assumed the audio was transcribed locally and then sent to the server.
The only reason I can think they'd want the full audio is for later model training, which, ok, fair-enough, but this can still likely be done without the limitations of WebRTC.
- there are a lot of extremely smart people that have come back to webRTC time and time again because it continues to solve problems other methods and protocols can't. with saying that, quic is certainly interesting going forward, but i primarily stream voice + vision at 1fps so it just makes sense, and websockets fail and are insecure at scale for this use case (see https://www.daily.co/videosaurus/websockets-and-webrtc/) . also just listen to sean in this thread, dude knows whats up.
by jazzyjackson
0 subcomment
- Oh is this why 1 800 CHAT GPT is trash now? It worked great when I started using it months ago. Last few times I've called the bot constantly interrupts herself, or stops as if I'm interrupting. I can't get a single full sentence out of her so I stopped calling.
I've experienced super deranged behavior out of 1800CHATGPT too, when I was just bored and called to ask how she's doing, what's her day like, she spiraled into laughing maniacally. It was unsettling, that was just before the service became unreliable, so I'm really curious what changed about the architecture.
- I've been using LiveKit which is also WebRTC based and it is super annoying when speed slows down or speeds up at times when connection is not robust. We were using OpenAI's websocket based RealTime audio which was way too slow. So I don't know which one is better. Generally our users like the LiveKit implementation better so maybe WebRTC with enough clever hacks is the answer.
This blog was super insightful for me to understand what are the root problems in the current implementation though.
by splittydev
1 subcomments
- Amazing read. Blog posts rarely keep my attention like this one.
- Why does the voice need to be sent to the server? Why not perform speech-to-text on-device? Is the p10 phone/laptop not capable of this yet, despite every "dictation" feature I see in every modern OS?
- I haven't really experienced disconnections while using ChatGPT. Gemini is the frustrating part. Simply backgrounding the app (and the web version too) and resuming it causes the response or the conversation with an assigned ID to disappear. Haha.
- Nice fun article. Gives me Why The Lucky Stiff vibes.
- I didn't understand - why is WebRTC good for Google Meet and not good for all other conferencing apps?
- Most of the problems happen because we want to simulate human conversations. While thats a good goal to have, another approach is to let the user know clearly they are talking to a bot. You will be surprised at how accomodating users can be when they know they are talking to a bot and want their queries resolved.
by elephantum
0 subcomment
- My biggest frustration with WebRTC was precisely captured in the article: even if you don't need p2p and your video source is the process on the same host with your browser, you have to dance around connection setup like you're on a different side of a planet
by 0xbadcafebee
0 subcomment
- Excellent writeup. I wish we had awards for blog posts when the person is a domain expert in the post's subject.
by AdityaAnuragi
0 subcomment
- Browser API reliability in general has a lot of undocumented edge cases — WebRTC isn't alone there.
by singpolyma3
1 subcomments
- If you're just doing STT and TTS why would you not do that locally and steam text?
- >> ... I say hi to <strike> Scarlett Johansson <strike>
Had a nice chuckle.
- Exactly what I thought when I read the original article, though to be fair WebTransport is barely now entering the mainstream with Safari shipping support this year.
- Why worry for OpenAI. Their product will fail if it doesn’t work. Then they will figure it all out later.
by molszanski
0 subcomment
- I remember using webrtc data channel for p2p video. Browser to browser UDP is neat :) fun memories. Thank you for the read
by spongebobstoes
1 subcomments
- this misses a few key things but hits on many others
webrtc is a bad protocol, without a doubt. I do like websockets as an easy alternative, but you do need to reinvent decent portions of webrtc as a result
I like the idea of MoQ but it's not widely used. probably worth experimenting with, especially as video enters the chat
> and then a GPU pretends to talk to you via text-to-speech
OpenAI is speech-to-speech, there is no TTS in voice mode
> It takes a minimum of 8* round trips (RTT) to establish a WebRTC connection
signalling can be done long ahead of time, though I don't see this mentioned in the OpenAI blog. I also saw some new webrtc extensions that should reduce setup time further
ultimately though, it comes down to
> It’s not like LLMs are particularly responsive anyway
I expect to see a shift in how S2S models work to be lower latency like the new voice API models that OpenAI announced
to be fair, the new models were released the day after this MoQ blog was published
- "WebRTC is the problem" is bait; his real claim is "WebRTC has annoying transport-layer characteristics that hurt cloud Voice AI scaling"...
Having just had to tackle this again for my own startup, I'm reminded about what you would lose by ditching WebRTC - the audio DSP pipeline, transmit side VAD, echo cancellation, noise suppression, NAT traversal maturity, codec integration, browser ubiquity etc.
- interesting read albeit over my head, but i spent half of yesterday comparing Gemini Live (websockets) vs gpt-realtime-2 and while gpt is super good, seemingly more robust. Gemini connects faster.
by giancarlostoro
1 subcomments
- Probably because WebTransport is the lesser known alternative to WebRTC.
by gafferongames
0 subcomment
- Just use UDP
by stackedinserter
0 subcomment
- Just give me mpegts in <video> element, I'm dying.
- I've long had the feeling that WebRTC was intentionally over-engineered. Over-engineered and poorly documented.
IMO, tech standards should be simple and minimal and people should be able to implement whatever they want on top. I tend to stay away from complex web standards.
by brcmthrowaway
2 subcomments
- This is interesting. Does niche knowledge in this area command $1mn salary?
by perryizgr8
0 subcomment
- How is OpenAI Voice mode any different than a Whatsapp call? Ignoring the part that there is a GPU on the other side instead of a human. But what is the technical challenge in the voice call portion? It seems like that has been a solved problem for a long time now.
- Yet another victim of IPv4, and you still find countless detractors of IPv6 on every thread where it's mentioned.
by FlamingMoe
0 subcomment
- [dead]
by yugoslavia4ever
0 subcomment
- [dead]
- [dead]
by Michael666
0 subcomment
- [dead]
by coalstartprob
0 subcomment
- [dead]