FRESH

Hacker News

Home

Z-Image: Powerful and highly efficient image generation model with 6B parameters

378 points by doener

by vunderba

6 subcomments

I've done some preliminary testing with Z-Image Turbo in the past week.
Thoughts
- It's fast (~3 seconds on my RTX 4090)
- Surprisingly capable of maintaining image integrity even at high resolutions (1536x1024, sometimes 2048x2048)
- The adherence is impressive for a 6B parameter model
Some tests (2 / 4 passed):
https://imgpb.com/exMoQ
Personally I find it works better as a refiner model downstream of Qwen-Image 20b which has significantly better prompt understanding but has an unnatural "smoothness" to its generated images.

by danielbln

5 subcomments

We've come a long way with these image models, and the things you can do with paltry 6B are super impressive. The community has adopted this model wholesale, and left Flux(2) by the way side. It helps that Z-Image isn't censored, whereas BFL (makers of Flux 2) dedicated like a fith of their press release talking about how "safe" (read: censored and lobotomized) their model is.

by xnx

1 subcomments

Z-Image seems to be the first successor to Stable Diffusion 1.5 that delivers better quality, capability, and extensibility across the board in an open model that can feasibly run locally. Excitement is high and an ecosystem is forming fast.

by muglug

13 subcomments

The [demo PDF](https://github.com/Tongyi-MAI/Z-Image/blob/main/assets/Z-Ima...) has ~50 photos of attractive young women sitting/standing alone, and exactly two photos featuring young attractive men on their own.
It's incredibly clear who the devs assume the target market is.

by khimaros

0 subcomment

i have been testing this on my Framework Desktop. ComfyUI generally causes an amdgpu kernel fault after about 40 steps (across multiple prompts), so i spent a few hours building a workaround here https://github.com/comfyanonymous/ComfyUI/pull/11143
overall it's fun and impressive. decent results using LoRA. you can achieve good looking results with as few as 8 inference steps, which takes 15-20 seconds on a Strix Halo. i also created a llama.cpp inherence custom node for prompt enhancement which has been helping with overall output quality.

by nine_k

2 subcomments

It's amazing how much knowledge about the world fits into 16 GiB of the distilled model.

by icyfox

0 subcomment

We talked about this model in some depth on the last Pretrained episode: https://youtu.be/5weFerGhO84?si=Eh_92_9PPKyiTU_h&t=1743
Some interesting takeaways imo:
- Uses existing model backbones for text encoding & semantic tokens (why reinvent the wheel if you don't need to?)
- Trains on a whole lot of synthetic captions of different lengths, ostensibly generated using some existing vision LLM
- Solid text generation support is facilitated by training on all OCR'd text from the ground truth image. This seems to match how Nano Banana Pro got so good as well; I've seen its thinking tokens sketch out exactly what text to say in the image before it renders.

by xfalcox

1 subcomments

We have vLLM for running text LLMs in production. What is the equivalent for this model?

by thih9

3 subcomments

As an AI outsider with a recent 24GB macbook, can I follow the quick start[1] steps from the repo and expect decent results? How much time would it take to generate a single medium quality image?
[1]: https://github.com/Tongyi-MAI/Z-Image?tab=readme-ov-file#-qu...

by ArcaneMoose

0 subcomment

This model is awesome. I am building an infinite CYOA game and this was a drop-in replacement for my scene image generation. Faster, cheaper, and higher quality than what I was using before!

by zkmon

5 subcomments

Just want to learn - who actually needs or buys up generated images?

by GuestFAUniverse

1 subcomments

All the examples I tried were garbage. Looked decent -- no horrors -- but didn't do the job.
Anything with "most cultures" were manga-influenced comic strips with kanji. Useless.

by Copenjin

1 subcomments

Very good, not always perfect with text or with following exactly the prompt, but 6B so... impressive.

by thot_experiment

0 subcomment

I've messed with this a bit and the distill is incredibly overbaked. Curious to see the capabilities of the full model but I suspect even the base model is quite collapsed.

by Tepix

0 subcomment

I‘m wondering: Is it faster or slower when spread across two GPUs (RTX3090)?

by reactordev

2 subcomments

My issue with this model is it keeps producing Chinese people and Chinese text. I have to very specifically go out of my way to say what kind of race they are.
If I say “A man”, it’s fine. A black man, no problem. It’s when I add context and instructions is just seems to want to go with some Chinese man. Which is fine, but I would like to see more variety of people it’s trained on to create more diverse images. For non-people it’s amazingly good.

by bilsbie

2 subcomments

What kind of rig is required to run this?

by phantomathkg

1 subcomments

Unfortunately, another China censored model. Simply ask it to generate "Tank Man" or "Lady Liberty Hong Kong" and the model return a blackboard with text saying "Maybe Not Safe".

by idontwantthis

2 subcomments

Does it run on apple silicon?

by pawelduda

2 subcomments

Did anyone test it on 5090? I saw some 30xx reports and it seemed very fast

by cubefox

1 subcomments

I'm particularly impressed by the fact that they seem to aim for photorealism rather than the semi-realistic AI-look that is common in many text-to-image models.

by ForOldHack

0 subcomment

It would be more useful to have some standards on what one could expect in terms of hardware requirements and expected performance.

by BoredPositron

0 subcomment

I wish they would have used the WAN vae.

by gatane

0 subcomment

Dude, please give money to artists instead of using genAI