FRESH

Hacker News

Home

Using “underdrawings” for accurate text and numbers

271 points by samcollins

by IdiotSavage

3 subcomments

> Transform this image into a photographed claymation diorama of assorted artisan chocolates and candies […] viewed from a low-angle
Side note: whenever I read prompts for image generation, I notice very specific details which the model obviously ignored. Here the chocolates / candies in the last two images look anything but artisanal. They look very "sterile" and mass-produced. The viewing angle is also not accurate.
Why do we even bother writing such elaborate prompts, when the model ignores most of it anyway?

by danpalmer

2 subcomments

I'm glad that we're making progress towards a deeper understanding of what LLMs are inherently good at and what they're inherently bad at (not to say incapable of doing, but stuff that is less likely to work due to fundamental limitations).
There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions. Or asking an LLM to write you the SQL query for your data analysis, rather than asking it to do your data analysis for you.
What I'd really like to see is a more well defined taxonomy of work and studies on which bits work well with LLMs and which don't. I understand some of this intuitively, but am still building my intuition, and I see people tripping up on this all the time.

by samcollins

1 subcomments

I found a simple technique to get reliable text and numbers in AI generated images.
I’m surprised the image models aren’t already doing this, so wanted to share since I’m finding this so useful

by smusamashah

3 subcomments

This is just img2img where first image with correct structure was generated by code.

by petercooper

0 subcomment

This seems analogous to how a human would do it accurately. If you asked an artist to paint stones in a large circular arrangement with the numbers in order in one shot, with no fixes or sketching allowed, it wouldn't be surprising to end up with problems in the arrangement.

by teiferer

0 subcomment

I hope this kind of stuff puts the idea to rest that we're close to actual AGI. Outsourcing this kind of basic stuff which a real intelligence would be able to do "internally" is a hack which works for this specific case but would prevent further generalizations of the task at hand.
But I'm forseeing the opposite. This kind of tool use will soon be integrated and hidden such that people will eventully say "see we solved the problem that AI can't do 123+456, now we are really really close to AGI. Yeah no, with an AGI, it would have been the AGI itself that would have come up with needing at tool, building the tool and then using the tool. But that's not what LLMs are. They are statistical machines to predict tokens. They are very good at it, but that's not an AGI.

by sparuchuri

1 subcomments

This hack definitely falls in the “duh, why didn’t I think of that” category of tricks, but glad to now have it next time imagegen comes up short

by Geonode

0 subcomment

We've been doing this for a long time now, it's similar to using a depth map or a line drawing to control the silhouette.

by xigoi

6 subcomments

The standard objection: if the LLM is supposedly intelligent, why can’t it figure out on its own that this two-step process would achieve a better result?

by elil17

2 subcomments

I wonder whether this could be used to fine-tune image models to provide better outputs. Something like this:
1. Algorithmically generate a underdrawing (e.g. place numbers and shapes randomly in the underdrawing)
2. Algorithmically generate a description of the underdrawing (e.g. for each shape, output text like "there is a square with the number three in the top left corner). You might fuzz this by having an LLM rewrite the descriptions in a variety of ways.
3. Generate a "ground truth" image using the underdrawing and an image+text-to-image model.
4. Use the generated description and the generated "ground truth" image as training data for a text-to-image model.

by dllu

0 subcomment

I was thinking about doing the opposite for the common task of "SVG of a pelican riding a bike". Obviously, directly spitting out the SVG is gonna be bad. But image gen can produce a really stunning photorealistic image easily. Probably a good way to get an LLM to produce a decent bike-pelican SVG is to generate an image first and then get the model to trace it into an SVG. After all, few human beings can generate SVG works of art by just typing out numbers into Notepad. At the core of it, we still rely on looking at it and thinking about it as an image.

by nottorp

0 subcomment

LLMs are like a box of chocolates...

by nine_k

0 subcomment

It's normal to first create a plan, then allow agents to write code. But it seems to be surprising for many to first create a draft / outline of a picture, then go for a final render.

by BobbyTables2

1 subcomments

How is it that LLMs aren’t good at rendering the sequence of numbers but can reliably put the supplied pieces all in the right order?

by cheekyant

1 subcomments

Has anyone built a platform which has image to image pipelines and lets you use prompt to SVG generation from SOTA LLMs?

by docheinestages

0 subcomment

And what happens if the model can't come up with a good enough SVG to begin with?

by wg0

0 subcomment

Has anyone had good luck with making consistent game art and assets?

by utopiah

1 subcomments

Love the concluding note : it works, but not really.
So LLM/GenAI crave. An entire article to show that it's nearly there, yet it's not, despite convoluted effort to make it just so on a very very niche example.

by choeger

0 subcomment

Transformers are great translators. So, yeah, starting with structured output like SVG is probably the best way to start.
It should be fairly trivial to fix any logic errors in the structured output, too.

0 subcomment

by tracerbulletx

0 subcomment

Ive been doing charts for slides like this for a while. Noticed html viz was super reliable, but I could style it with diffusion model. Its very useful for data viz.

by SomaticPirate

2 subcomments

inb4 this technique is subsumed into the next MoE model release
LLMs are evolving so fast I wouldn’t be surprised if this technique was not needed in <6 months

by Melamune

0 subcomment

I wondered why I was losing all passion for creating. These tips and tricks are part of the answer.

by globular-toast

0 subcomment

Wait, where did it get the "Sweet Path//Trail of treats" thing from in the SVG? It wasn't about sweets at that point. Something missing here, I think.

by jeffrallen

0 subcomment

I wish the opposite was true: that when I tell Gemini I want "a diagram of X" that it immediately breaks out Python and mathplotlib, instead of wasting my time with Nano Banana.

by nullc

0 subcomment

Inpainting/guiding from a sketch is how I've always used diffusion models. I thought everyone did that, or at least everyone who wasn't just trying to get some arbitrary filler material without much care of what the output looked like.

by foxes

0 subcomment

I feel sorry for the recipient.

by psychoslave

0 subcomment

A few months ago I tried to make Le-chat Mistral output a French poetry in Alexandrin (12 vowels). Disastrous at first. Then adding in specifications that each line had to also be transposed in IPA and each syllable counted, it went better.
Still emotionally unrelatable, but definitely was providing something that match the specifications of there are explicit and systematically enforced through deterministitic means. For now I retain that LLM limitations are thus that they can't seize the ineffable and so untrustworthy they can only be employed under very clear and inescapable constraints or they will go awry just as sure as water is wet.

by gwern

0 subcomment

tldr: do a standard img2img workflow where you lay out a skeleton or skeleton or low-res version, and then turn it into the final high-quality photorealistic version, instead of trying to zeroshot it purely from a text prompt.

by brentcrude

0 subcomment

[dead]