FRESH

Hacker News

Home

Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

118 points by 50kIters

by ttul

3 subcomments

This is a cool result. Deep learning image models are trained on enormous amounts of data and the information recorded in their weights continues to astonish me. Over in the Stable Diffusion space, hobbyists (as opposed to professional researchers) are continuing to find new ways to squeeze intelligence out of models that were trained in 2022 and are considerably out of date compared with the latest “flow matching” models like Qwen Image and Flux.
Makes you wonder what intelligence is lurking in a 10T parameter model like Gemini 3 that we may not discover for some years yet…

by onesandofgrain

2 subcomments

Can someone smarter than me explain what this is about?

by N_Lens

0 subcomment

According to the paper the image models can 'recognize' and track objects in videos. There are a lot of emergent properties in both diffusion models and LLMs that don't align with simplistic descriptions such as 'next token predictor'. It's not surprising to me that 'diffusing' mass amounts of image data leads to semantic developments and the emergence of recognition.

by tpoacher

0 subcomment

If the authors are reading. I notice you used a "Soft IoU" for validation.
A large part of my 2017 phd thesis [0] is dedicated in exploring the formulation and utility of soft validation operators, including this soft IoU, and the extent to which they are "better" / "more reliable" than thresholding (whether this occurs in isolation, or even when marginalised out, as in with the AUC). Long story short, soft operators are at least an order of magnitude more reliable than their thresholding counterparts [1], despite the fact that thresholding still seems to be the industry/academia standard. This is the case for any set-operation-based operator, such as the Dice coefficient (a.k.a. F1-score), not just for the IoU. Recently, influential groups have proposed the matthews correlation coefficient as a "better operator", but still treat it in binary / thresholding terms, which means it's still unreliable to an order of magnitude. I suspect this insight goes beyond images (e.g. the F1-score is often used in ML problems more generally, in situations where probabilistic outputs are thresholded to compare against binary ground truth labels), but I haven't tested that hypothesis explicitly beyond the image domain (yet).
In this work you effectively used the "goedel" (i.e. min/max) fuzzy operator to define fuzzy intersection and union, for the purposes of using it in an IoU operator. There are other fuzzy norms with interesting properties that you can also explore. Other classical ones include product and lukasiewicz. I show in [0] and [1] that these have "best case scenario sub-pixel overlap", "average case" and "worst-case scenario" underlying semantics. (In other words, min/max should not be a random choice of T-norm, but a conscious choice which should match your problem, and what the operator is intended to validate specifically). In my own work, I then proceeded show that if you take gradient direction at the boundary into account, you can come up with a fuzzy intersection/union pair which has directional semantics, and is even more reliable an operator when used to define a soft IoU.
Having said that, in your case you're comparing against a binary ground truth. This collapses all the different T-norms to the same value. I wonder if this is the reason you chose a binary ground truth. If yes, you might want to consider my work, and use original 'soft' ground truths instead, for higher reliability, as well as ability to define intersection semantics.
I hope the above is of interest / use to you :) (and, if you were to decide to cite my work, it wouldn't be the eeeeeend of the world, I gueeeeesss xD )
[0] https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-08...
[1] https://repository.essex.ac.uk/24856/1/Papastylianou.etal201...