FRESH

Hacker News

Home

Show HN: LemonSlice – Upgrade your voice agents to real-time video

130 points by lcolucci

by anigbrowl

2 subcomments

Absolutely Do Not Want.
EDIT: Tell us what characters you want to see in the comments and we can make them for you to talk to (e.g. Max Headroom)
Sure, that kind of thing is great fun. But photorealistic avatars are gonna be abused to hell and back and everyone knows it. I would rather talk to a robot that looks like a robot, ie C-3PO. I would even chat with scary skeleton terminator. I do not want to talk with convincingly-human-appearing terminator. Constantly checking whether any given human appearing on a screen is real or not is a huge energy drain on my primate brain. I already find it tedious with textual data, doing it on realtime video imagery consumers considerably more energy.
Very impressive tech, well done on your engineering achievement and all, but this is a Bad Thing.

by armcat

2 subcomments

This is so awesome, well done LemonSlice team! Super interesting on the ASR->LLM->TTS pipeline, and I agree, you can make it super fast (I did something myself as a 2-hour hobby project: https://github.com/acatovic/ova). I've been following full-duplex models as well and so far couldn't get even PersonaPlex to run properly (without choppiness/latency), but have you peeps tried Sesame, e.g. https://app.sesame.com/?
I played around with your avatars and one thing that it lacks is that it's "not patient", it's rushing the user, so maybe something to try and finetune there? Great work overall!

by pickleballcourt

1 subcomments

One thing I've learnt from movie production is actually what separates professional from amateur quality is in the audio itself. Have you thought about implementing personaplex from NVDIA or other voice models that can both talk and listen at the same time?
Currently the conversation still feels too STT-LLM-TTS that I think a lot of the voice agents suffer from (Seems like only Sesame and NVDIA so far have nailed the natural conversation flow). Still, crazy good work train your own diffusion models, I remember taking a look at the latest literature on diffusion and was mind blown by the advances in last years or so since u-net architecture days.
EDIT: I see that the primary focus is on video generation not audio.

by convivialdingo

2 subcomments

That's super impressive! Definitely one of the best quality conversational agents I've tried syncing A/V and response times.
The text processing is running Qwen / Alibaba?

by snowmaker

2 subcomments

I made a golden retriever you can talk to using Lemon Slice: https://lemonslice.com/hn/agent_5af522f5042ff0a8
Having a real-time video conversation with an AI is a trippy feeling. Talk about a "feel the AGI moment", it really does feel like the computer has come alive.

by r0fl

1 subcomments

Pricing is confusing
Video Agents Unlimited agents Up to 3 concurrent calls Creative Studio 1min long videos Up to 3 concurrent generations
Does that mean I can have a total of 1 minute of video calls? Or video calls can only be 1 minute long? Or does it mean I can have unlimited calls, 3 calls at a time all month long?
Can I have different avatars or only the same avatar x 3?
Can I record the avatar and make videos and post on social media?

by skandan

1 subcomments

Wow this team is non-stop!!! Wild that this small crew is dropping hit after hit. Is there an open polymarket on who acquires them?

by zestyping

1 subcomments

When you generate real-time video of realistic-looking talking characters, the definition of success is fooling people into believing they are talking to a real person when they aren't.
If you pursue this, your explicit goal is deception, and it's a massively harmful kind of deception. I don't see how you can claim to be operating ethically here if that's your goal.

by Escapado

0 subcomment

This was interesting. Had a 5 minute chat with the outsider from the dishonored series. Just a one sentence prompt and its phrasing was at least 60% there, but less cold and nicer in a sense than the video game counterpart. Still an interesting experiment. But I also know that maybe 12-24 months down the line, once this is available in real time on device there will be an ungodly amount of smut coming from this.

by jonsoft

2 subcomments

I asked the Spanish tutor if he/it was familiar with the terms seseo[0] and ceceo[1] and he said it wasn't, which surprised me. Ideally it would be possible to choose which Spanish dialect to practise as mainland Spain pronunciation is very different to Latin America. In general it didn't convince me it was really hearing how I was pronouncing words, an important part of learning a language. I would say the tutor is useful for intermediate and advanced speakers but not beginners due to this and the speed at which he speaks.
At one point subtitles written in pseudo Chinese characters were shown; I can send a screenshot if this is useful.
The latency was slightly distracting, and as others have commented the NVIDIA Personaplex demos [2] are very impressive in this regard.
In general, a very positive experience, thank you.
[0] https://en.wikipedia.org/wiki/Phonological_history_of_Spanis... [1] https://en.wikipedia.org/wiki/Phonological_history_of_Spanis... [2] https://research.nvidia.com/labs/adlr/personaplex/

by bn-l

1 subcomments

I wish I could invest in this company. Really. This is the most exciting revenue opportunity I’ve seen during this recent AI hype cycle.

by wumms

2 subcomments

You could add a Max Headroom to the hn link. You might reach real time by interspersing freeze frames, duplicates, or static.

by dang

1 subcomments

https://lemonslice.com/hn/agent_4d10f62632fd841b
(Update of https://news.ycombinator.com/item?id=43785494)

by zvonimirs

2 subcomments

We're launching a new AI assistant and I wanted to make it alive so I started to play around with LemonSlice and I loved it!! I wanted to make our assistant be like a coworker that can give it an ability to create Loom style videos. Here's what I created - https://drive.google.com/file/d/1nIpEvNkuXA0jeZVjHC8OjuJlT-3...
Anyway, big thumbs up for the LemonSlice team, I'm excited to see it progress. I can definitely see products start coming alive with tools like this.

by r0fl

2 subcomments

Wow I can’t get enough of this site! This is literally all I’ve been playing with for like half an hour. Even moved a meeting!
My mind is blown! It feels like the first time I used my microphone to chat with ai

by leetrout

2 subcomments

Quick feedback if you're still monitoring the thread:
I did /imagine cheeseburger and /imagine a fire extinguisher and both were correctly generated but the agent has no context. when I ask what they are holding in both cases they ramble about not holding anything and referencing lemons and lemon trees.
I expected it to retain the context as the chat continues. If I ask it what it imagined it just tells me I can use /imagine.

by peddling-brink

2 subcomments

I got really excited when I saw that you were releasing your model.
> Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.
But after digging around for a while, searching for a huggingface link, I’m guessing this was just a unfortunate turn of phrase, and you are not in fact, releasing an open weights model that people can run themselves?
Oh well, this looks very cool regardless and congratulations on the release.

by dreamdeadline

2 subcomments

Cool! Do you plan to expose controls over the avatar’s movement, facial expressions, or emotional reactions so users can fine-tune interactions?

by sid-the-kid

0 subcomment

hey HN! one of the founders here. as of today, we are seeing informational avatars + roleplaying for training as the most common use cases. The roleplaying use-case was surprising to us. Think a nurse training to triage with AI patients. Or, SDRs practicing lead qualification with different kinds of clients.

by pbhjpbhj

1 subcomments

Sounds like an innovative approach, any IP protection on your tech?
Have your early versions made any sort of profit?
Absolutely amazing stuff to me. A teenager I very briefly showed it to was nonplussed - 'it's a talking head, isn't that really easy to do' ...

by beast200

1 subcomments

That's really impressive!

by bennyp101

2 subcomments

Heads up, your privacy policy[0] does not work in dark mode - I was going to comment saying it made no sense, then I highlighted the page and more text appeared :)
[0] https://lemonslice.com/privacy

by FatalLogic

1 subcomments

Your demo video defaults to play at 1.5x speed
You probably didn't intend to do that

by mdrzn

1 subcomments

"we're releasing our new model" is it downloadable and runnable in local? Could I create a "vTuber" persona with this model?

by slake

0 subcomment

That's amazing. Feels like a major step ahead. No lag, very snappy. Outstanding work.
Feels like those sci-fi shows where you can talk to Hari Seldon even though he lived like a 100 years ago.
My prediction, this will become really, really big.

by korneelf1

1 subcomments

Wow this is really cool, haven't seen real-time video generation that is this impressive yet!

by r0fl

2 subcomments

Where’s the hn playground to grab a free month?
I have so many websites that would do well with this!

by buddycorp

3 subcomments

I'm curious if I can plug in my own OpenAI realtime voice agents into this.

by koakuma-chan

1 subcomments

> You're probably thinking, how is this useful
I was thinking why the quality is so poor.

by r0fl

1 subcomments

Wow this is the most impressive thing I’ve seen on hacker news in years!!!!!
Take my money!!!!!!

by ripped_britches

1 subcomments

Very freaking impressive!

by dsrtslnd23

1 subcomments

where can I find the 20B model? it sounded like it would be open - but I am not sure with the phrasing...

by swyx

1 subcomments

this is like Tavus but it doesnt suck. congrats!

by wahnfrieden

2 subcomments

Please add InWorld TTS integration

by jedwhite

2 subcomments

That's an interesting insight about "stacking tricks" together. I'm curious where you found that approach hit limits. And what gives you an advantage if anything against others copying it. Getting real-time streaming with a 20B parameter diffusion model and 20fps on a single GPU seems objectively impressive. It's hard to resist just saying "wow" looking at the demo, but I know that's not helpful here. It is clearly a substantial technical achievement and I'm sure lots of other folks here would be interested in the limits with the approach and how generalizable it is.

by benswerd

2 subcomments

The last year vs this year is crazy

by shj2105

1 subcomments

Not working on mobile iOS

by ed_mercer

1 subcomments

This looks super awesome!

by marieschneegans

1 subcomments

This is next-level!

by givinguflac

2 subcomments

While the tech is impressive, from an end user interacting with this perspective, I want nothing to do with it, and I can’t support it. Neat as a one off but destructive imho.
It’s bad enough some companies are doing AI-only interviews. I could see this used to train employees, interview people, replace people at call centers… it’s the next step towards an absolute nightmare. Automated phone trees are bad enough.
There will likely be little human interaction in those and many other situations, and hallucinations will definitely disqualify some people from some jobs.
I’m not anti AI, I’m anti destructive innovation in AI leading to personal health and societal issues, just like modern social media has. I’m not saying this tool is that, I’m saying it’s a foundation for that.
People can choose to not work on things that lead to eventual negative outcomes, and that’s a personal choice for everyone. Of course hindsight is 20/20 but some things can certainly be foreseen.
Apologies for the seemingly negative rant, but this positivity echo chamber in this thread is crazy and I wanted to provide an alternative feedback view.

by ProjectBarks

2 subcomments

Removing - Realized I made a mistake