- I've done some preliminary testing with Z-Image Turbo in the past week.
Thoughts
- It's fast (~3 seconds on my RTX 4090)
- Surprisingly capable of maintaining image integrity even at high resolutions (1536x1024, sometimes 2048x2048)
- The adherence is impressive for a 6B parameter model
Some tests (2 / 4 passed):
https://imgpb.com/exMoQ
Personally I find it works better as a refiner model downstream of Qwen-Image 20b which has significantly better prompt understanding but has an unnatural "smoothness" to its generated images.
by danielbln
5 subcomments
- We've come a long way with these image models, and the things you can do with paltry 6B are super impressive. The community has adopted this model wholesale, and left Flux(2) by the way side. It helps that Z-Image isn't censored, whereas BFL (makers of Flux 2) dedicated like a fith of their press release talking about how "safe" (read: censored and lobotomized) their model is.
- Z-Image seems to be the first successor to Stable Diffusion 1.5 that delivers better quality, capability, and extensibility across the board in an open model that can feasibly run locally. Excitement is high and an ecosystem is forming fast.
- The [demo PDF](https://github.com/Tongyi-MAI/Z-Image/blob/main/assets/Z-Ima...) has ~50 photos of attractive young women sitting/standing alone, and exactly two photos featuring young attractive men on their own.
It's incredibly clear who the devs assume the target market is.
- i have been testing this on my Framework Desktop. ComfyUI generally causes an amdgpu kernel fault after about 40 steps (across multiple prompts), so i spent a few hours building a workaround here https://github.com/comfyanonymous/ComfyUI/pull/11143
overall it's fun and impressive. decent results using LoRA. you can achieve good looking results with as few as 8 inference steps, which takes 15-20 seconds on a Strix Halo. i also created a llama.cpp inherence custom node for prompt enhancement which has been helping with overall output quality.
- It's amazing how much knowledge about the world fits into 16 GiB of the distilled model.
- We talked about this model in some depth on the last Pretrained episode:
https://youtu.be/5weFerGhO84?si=Eh_92_9PPKyiTU_h&t=1743
Some interesting takeaways imo:
- Uses existing model backbones for text encoding & semantic tokens (why reinvent the wheel if you don't need to?)
- Trains on a whole lot of synthetic captions of different lengths, ostensibly generated using some existing vision LLM
- Solid text generation support is facilitated by training on all OCR'd text from the ground truth image. This seems to match how Nano Banana Pro got so good as well; I've seen its thinking tokens sketch out exactly what text to say in the image before it renders.
- We have vLLM for running text LLMs in production. What is the equivalent for this model?
- As an AI outsider with a recent 24GB macbook, can I follow the quick start[1] steps from the repo and expect decent results? How much time would it take to generate a single medium quality image?
[1]: https://github.com/Tongyi-MAI/Z-Image?tab=readme-ov-file#-qu...
by ArcaneMoose
0 subcomment
- This model is awesome. I am building an infinite CYOA game and this was a drop-in replacement for my scene image generation. Faster, cheaper, and higher quality than what I was using before!
- Just want to learn - who actually needs or buys up generated images?
by GuestFAUniverse
1 subcomments
- All the examples I tried were garbage. Looked decent -- no horrors -- but didn't do the job.
Anything with "most cultures" were manga-influenced comic strips with kanji.
Useless.
- Very good, not always perfect with text or with following exactly the prompt, but 6B so... impressive.
by thot_experiment
0 subcomment
- I've messed with this a bit and the distill is incredibly overbaked. Curious to see the capabilities of the full model but I suspect even the base model is quite collapsed.
- I‘m wondering: Is it faster or slower when spread across two GPUs (RTX3090)?
by reactordev
2 subcomments
- My issue with this model is it keeps producing Chinese people and Chinese text. I have to very specifically go out of my way to say what kind of race they are.
If I say “A man”, it’s fine. A black man, no problem. It’s when I add context and instructions is just seems to want to go with some Chinese man. Which is fine, but I would like to see more variety of people it’s trained on to create more diverse images. For non-people it’s amazingly good.
- What kind of rig is required to run this?
by phantomathkg
1 subcomments
- Unfortunately, another China censored model.
Simply ask it to generate "Tank Man" or "Lady Liberty Hong Kong" and the model return a blackboard with text saying "Maybe Not Safe".
by idontwantthis
2 subcomments
- Does it run on apple silicon?
by pawelduda
2 subcomments
- Did anyone test it on 5090? I saw some 30xx reports and it seemed very fast
- I'm particularly impressed by the fact that they seem to aim for photorealism rather than the semi-realistic AI-look that is common in many text-to-image models.
by ForOldHack
0 subcomment
- It would be more useful to have some standards on what one could expect in terms of hardware requirements and expected performance.
by BoredPositron
0 subcomment
- I wish they would have used the WAN vae.
- Dude, please give money to artists instead of using genAI