I also created a small editing suite for myself where I can draw bounding boxes on images when they aren’t perfect, and have them fixed. Either just with a prompt or feeding them to Claude as image and then having it write the prompt to fix the issue for me (as a workflow on the api). It’s been quite a lot of fun to figure out what works. I am incredibly impressed by where this is all going.
Once you do have good storyboards. You can easily do start-to-end GenAI video generation (hopping from scene to scene) and bring them to life and build your own small visual animated universes.
I added a CLI to it (using Gemini CLI) and submitted a PR, you can run that like so:
GEMINI_API_KEY="..." \
uv run --with https://github.com/minimaxir/gemimg/archive/d6b9d5bbefa1e2ffc3b09086bc0a3ad70ca4ef22.zip \
python -m gemimg "a racoon holding a hand written sign that says I love trash"
Result in this comment: https://github.com/minimaxir/gemimg/pull/7#issuecomment-3529...> Nano Banana supports a context window of 32,768 tokens: orders of magnitude above T5’s 512 tokens and CLIP’s 77 tokens.
In my pipeline for generating highly complicated images (particularly comics [1]), I take advantage of this by sticking a Mistral 7b LLM in-between that takes a given prompt as an input and creates 4 variations of it before sending them all out.
> Surprisingly, Nano Banana is terrible at style transfer even with prompt engineering shenanigans, which is not the case with any other modern image editing model.
This is true - though I find it works better by providing a minimum of two images. The first image is intended to be transformed, and the second image is used as "stylistic aesthetic reference". This doesn't always work since you're still bound by the original training data, but it is sometimes more effective than attempting to type out a long flavor text description of the style.
This looks like it's caused by 99% of the relative directions in image descriptions describing them from the looker's point of view, and that 99% of the ones that aren't it they refer to a human and not to a skull-shaped pancake.
>Nano Banana is terrible at style transfer even with prompt engineering shenanigans
My context: I'm kind of fixated on visualizing my neighborhood as it would have appeared in the 18th century. I've been doing it in Sketchup, and then in Twinmotion, but neither of those produce "photorealistic" images... Twinmotion can get pretty close with a lot of work, but that's easier with modern architecture than it is with the more hand-made, brick-by-brick structures I'm modeling out.
As different AI image generators have emerged, I've tried them all in an effort to add the proverbial rough edges to snapshots of the models I've created, and it was not until Nano Banana that I ever saw anything even remotely workable.
Nano Banana manages to maintain the geometry of the scene, while applying new styles to it. Sometimes I do this with my Twinmotion renders, but what's really been cool to see is how well it takes a drawing, or engraving, or watercolor - and with as simple a prompt as "make this into a photo" it generates phenomenal results.
Similarly to the Paladin/Starbucks/Pirate example in the link though, I find that sometimes I need to misdirect a little bit, because if I'm peppering the prompt with details about the 18th century, I sometimes get a painterly image back. Instead, I'll tell it I want it to look like a photograph of a well preserved historic neighborhood, or a scene from a period film set in the 18th century.
As fantastic as the results can be, I'm not abandoning my manual modeling of these buildings and scenes. However, Nano Banana's interpretation of contemporary illustrations has helped me reshape how I think about some of the assumptions I made in my own models.
I first extract all the entities from the text, generate characters from an art style, and then start stitching them together into individual illustrations. It works much better with NB than anything else I tried before.
A 1024x1024 image seems to cost about 3ct to generate.
Going further, one thing you can do is give Gemini 2.5 a system prompt like the following:
https://goto.isaac.sh/image-prompt
And then pass Gemini 2.5's output directly to Nano-Banana. Doing this yields very high-quality images. This is also good for style transfer and image combination. For example, if you then give Gemini 2.5 a user prompt that looks something like this:
I would like to perform style transfer. I will provide the image generation model a photograph alongside your generated prompt. Please write a prompt to transfer the following style: {{ brief style description here }}.
You can get aesthetic consistently-styled images, like these:Unfortunately I have to use ChatGPT for this, for some reason local models don't do well with such tasks. I don't know if it's just the extra prompting sauce that ChatGPT does or just diffusion models aren't well designed for these kind of tasks.
- make massive, seemingly random edits to images - adjust image scale - make very fine grained but pervasive detail changes obvious in an image diff
For instance, I have found that nano-banana will sporadically add a (convincing) fireplace to a room or new garage behind a house. This happens even with explicit "ALL CAPS" instructions not to do so. This happens sporadically, even when the temperature is set to zero, and makes it impossible to build a reliable app.
Has anyone had a better experience?
Yet when I ask some simple tasks to it, like doing a 16:9 picture sized image instead of a square one, it ends up doing a 16:9 on a white background that matches a square.
When I ask it to make it with text, then on the second request to redo while changing just a certain visual element, it ends up breaking the previously asked text.
It's getting more good at flattering people and telling them how clever and right they are than actually doing the task.
Things like: Convert the people to clay figures similar to what one would see in a claymation.
And it would think it did it, but I could not perceive any change.
After several attempts, I added "Make the person 10 years younger". Suddenly it made a clay figure of the person.
Very cool post, thanks for sharing!
I had no idea that the context window was so large. I’d been instinctively keeping my prompts small because of experience with other models. I’m going to try much more detailed prompts now!
No, that simply is not true. If you actually compare the before and after you can see it still regenerates all the details on the "unchanged" aspects. Texture, lighting, sharpness, even scale its all different even if varyingly similar to the original.
Sure they're cute for casual edits but it really pains me people suggesting these things are suitable replacements for actual photo editing. Especially when it comes to people, or details outside their training data theres a lot of nuance that can be lost as it regenerates them no matter how you prompt things.
Even if you
I didn't expect that. I would have definitely counted that as a "probably real" tally mark if grading an image.
I was trying to create a simple "mascot logo" for my pet project. I first created an account on Kittl [0] and even paid for one month but it was quite cumbersome to generate images until I figured out I could just use the nano banana api myself.
Took me 4 prompts to ai-slop a small python script I could run with uv that would generate me a specified amount of images with a given prompt (where I discovered some of the insight the author shows in their post). The resulting logo [1] was pretty much what I imagined. I manually added some text and played around with hue/saturation in Kittl (since I already paid for it :)) et voilà.
Feeding back the logo to iterate over it worked pretty nicely and it even spit out an "abstract version" [2] of the logo for favicons and stuff without a lot of effort.
All in all this took me 2 hours and around 2$ (excluding the 1 month Kittl subscription) and I would've never been able to draw something like that in Illustrator or similar.
[0] https://www.kittl.com/ [1] https://github.com/sidneywidmer/yass/blob/master/client/publ... [2] https://github.com/sidneywidmer/yass/blob/master/client/publ...
I figured that if you write the text in Google docs and share the screenshot with banana it will not make any spelling mistake.
So, use something like "can you write my name on this Wimbledon trophy, both images are attached. Use them" will work.
www.brandimagegen.com
if you want a premium account to try out, you can find my email in my bio!!
"YOU WILL BE PENALIZED FOR USING THEM"
That is disconcerting.
It’s pretty good, but one conspicuous thing is that most of the blueberries are pointing upwards.
(Do we say we software engineered something?)
I've noticed a lot of this misinformation floating around lately, and I can't help but wonder if it's intentional?
okay, look at imagen 4 ultra:
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
In this link, Imagen is instructed to render the verbatim prompt “the result of 4+5”, which shows that text, and not instructed, which renders “4+5=9”
Is Imagen thinking?
Let's compare to gemini 2.5 flash image (nano banana):
look carefully at the system prompt here: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
Gemini is instructed to reply in images first, and if it thinks, to think using the image thinking tags. It cannot seemingly be prompted to show verbatim the result 4+5 without showing the answer 4+5=9. Of course it can show whatever exact text that you want, the question is, does it prompt rewrite (no) or do something else (yes)?
compare to ideogram, with prompt rewriting: https://ideogram.ai/g/GRuZRTY7TmilGUHnks-Mjg/0
without prompt rewriting: https://ideogram.ai/g/yKV3EwULRKOu6LDCsSvZUg/2
We can do the same exercises with Flux Kontext for editing versus Flash-2.5, if you think that editing is somehow unique in this regard.
Is prompt rewriting "thinking"? My point is, this article can't answer that question without dElViNg into the nuances of what multi-modal models really are.