FRESH

Hacker News

Home

Meta Segment Anything Model 3

680 points by lukeinator42

by cebert

5 subcomments

I’m thankful that Meta still contributes to open source and shares models like this. I know there’s several reasons to not like the company, but actions like this are much appreciated and benefit everyone.

by daemonologist

2 subcomments

First impressions are that this model is extremely good - the "zero-shot" text prompted detection is a huge step ahead of what we've seen before (both compared to older zero-shot detection models and to recent general purpose VLMs like Gemini and Qwen). With human supervision I think it's even at the point of being a useful teacher model.
I put together a YOLO tune for climbing hold detection a while back (trained on 10k labels) and this is 90% as good out of the box - just misses some foot chips and low contrast wood holds, and can't handle as many instances. It would've saved me a huge amount of manual annotation though.

by gs17

1 subcomments

The 3D mesh generator is really cool too: https://ai.meta.com/sam3d/ It's not perfect, but it seems to handle occlusion very well (e.g. a person in a chair can be separated into a person mesh and a chair mesh) and it's very fast.

by bahmboo

2 subcomments

Like the models before it it struggles with my use case of tracing circuit board features. It's great with a pony on the beach but really isn't made for more rote industrial type applications. With proper fine-tuning it would probably work much better but I haven't tried that yet. There are good examples on line though.

by Benjamin_Dobell

2 subcomments

For background removal (at least my niche use case of background removal of kids drawings — https://breaka.club/blog/why-were-building-clubs-for-kids) I think birefnet v2 is still working slightly better.
SAM3 seems to less precisely trace the images — it'll discard kids drawing out the lines a bit, which is okay, but then it also seems to struggle around sharp corners and includes a bit of the white page that I'd like cut out.
Of course, SAM3 is significantly more powerful in that it does much more than simply cut out images. It seems to be able to identify what these kids' drawings represent. That's very impressive, AI models are typically trained on photos and adult illustrations — they struggle with children's drawings. So I could perhaps still use this for identifying content, giving kids more freedom to draw what they like, but then unprompted attach appropriate behavior to their drawings in-game.

by fzysingularity

1 subcomments

SAM3 is cool - you can already do this more interactively on chat.vlm.run [1], and do much more. It's built on our new Orion [2] model; we've been able to integrate with SAM and several other computer-vision models in a truly composable manner. Video segmentation and tracking is also coming soon!
[1] https://chat.vlm.run
[2] https://vlm.run/orion

by clueless

4 subcomments

With a avg latency of 4 seconds, this still couldn't be used in real-time video, correct?
[Update: should have mentioned I got the 4 second from the roboflow.com links in this thread]

by yeldarb

3 subcomments

We (Roboflow) have had early access to this model for the past few weeks. It's really, really good. This feels like a seminal moment for computer vision. I think there's a real possibility this launch goes down in history as "the GPT Moment" for vision. The two areas I think this model is going to be transformative in the immediate term are for rapid prototyping and distillation.
Two years ago we released autodistill[1], an open source framework that uses large foundation models to create training data for training small realtime models. I'm convinced the idea was right, but too early; there wasn't a big model good enough to be worth distilling from back then. SAM3 is finally that model (and will be available in Autodistill today).
We are also taking a big bet on SAM3 and have built it into Roboflow as an integral part of the entire build and deploy pipeline[2], including a brand new product called Rapid[3], which reimagines the computer vision pipeline in a SAM3 world. It feels really magical to go from an unlabeled video to a fine-tuned realtime segmentation model with minimal human intervention in just a few minutes (and we rushed the release of our new SOTA realtime segmentation model[4] last week because it's the perfect lightweight complement to the large & powerful SAM3).
We also have a playground[5] up where you can play with the model and compare it to other VLMs.
[1] https://github.com/autodistill/autodistill
[2] https://blog.roboflow.com/sam3/
[3] https://rapid.roboflow.com
[4] https://github.com/roboflow/rf-detr
[5] https://playground.roboflow.com

by hodgehog11

1 subcomments

This is an incredible model. But once again, we find an announcement for a new AI model with highly misleading graphs. That SA-Co Gold graph is particularly bad. Looks like I have another bad graph example for my introductory stats course...

by SubiculumCode

3 subcomments

For my use case, segmentation is all about 3D segmentation of volumes in medical imaging. SAM 2 was tried, mostly using a 2D slice approach, but I don't think it was competitive with the current gold standard nn-unet[1] [1. https://github.com/MIC-DKFZ/nnUNet]

by ____tom____

2 subcomments

Ok, I tried convert body to 3d, which is seems to do well, but it just gives me the image, I see no way to export or use this image. I can rotate it, but that's it.
Is there some functionality I'm missing? I've tried Safari and Firefox.

by featureofone

1 subcomments

The SAM models are great. I used the latest version when building VideoVanish ( https://github.com/calledit/VideoVanish ) a video-editing GUI for removing or making objects vanish from videos.
That used SAM 2, and in my experience SAM 2 was more or less perfect—I didn’t really see the need for a SAM 3. Maybe it could have been better at segmenting without input.
But the new text prompt input seams nice; much easier to automate stuff using text input.

by xfeeefeee

2 subcomments

I can't wait until it is easy to rotoscope / greenscreen / mask this stuff out accessibly for videos. I had tried Runway ML but it was... lacking, and the webui for fixing parts of it had similar issues.
I'm curious how this works for hair and transparent/translucent things. Probably not the best, but does not seem to be mentioned anywhere? Presumably it's just a straight line or vector rather than alpha etc?

by Ey7NFZ3P0nzAe

0 subcomment

> *Core contributor (Alphabetical, Equal Contribution), Intern, †Project leads, §Equal Contribution
I like seeing this

by pacifi30

0 subcomment

Grateful for Meta to release models and give the GPU access for free, it has been great for experimenting without the thinking overhead of paying too much for inference. Thank you Zuck.

by raindear

2 subcomments

There has been a slow progress in computer vision in the last ~5 years. We are still not close to human performance. This is in contrast to language understanding which has been solved - LLMs understand text on a human level (even if they have other limitations). But vision isn't solved. Foundation models struggle to segment some objects, they don't generalize to domains such as scientific images, etc. I wonder what's missing with models. We have enough data in videos. Is it compute? Is the task not informative enough? Do we need agency in 3D?

by visioninmyblood

0 subcomment

Claude, gemini and ChatGPT does image segmentation in surprising ways - we did a small evaluation [1] of different frontier models for image segementation and understanding, and Claude is by far the most surprising in results.
https://news.ycombinator.com/item?id=45996392

by torginus

4 subcomments

These models have been super cool and it'd be nice if they made it into some editing program. Is there anything consumer focused that has this tech?

by geooff_

0 subcomment

Seems like theres no API access. Has anyone got the weights? I'm not sure what to fill out for `affiliation`

0 subcomment

by sciencesama

3 subcomments

Does the license allow for commercial purposes?

by 8f2ab37a-ed6c

0 subcomment

Couple of questions for people in-the-know:
* Does Adobe have their version of this for use within Photoshop, with all of the new AI features they're releasing? Or are they using this behind the scenes? * If so, how does this compare? * What's the best-in-class segmentation model on the market?

by mksystem

0 subcomment

Is it possible to prompt this model with two or more texts for each image and get masks for each? Something like this inputs = processor(images=images, text=["cat", "dog"], return_tensors="pt").to(device)?

by tonyhart7

0 subcomment

This would be good for video editor

by dangoodmanUT

0 subcomment

This model is incredibly impressive. Text is definitely the right modality, and now the ability to intertwine it with an LLM creates insane unlocks - my mind is already storming with ideas of projects that are now not only possible, but trivial.

by HowardStark

1 subcomments

Curious if anyone has done anything meaningful with SAM2 and streaming. SAM3 has built-in streaming support which is very exciting.
I’ve seen versions where people use an in-memory FS to write frames of stream with SAM2. Maybe that is good enough?

by ge96

0 subcomment

Dang that seems like it would work great for game asset generation regarding 3D

0 subcomment

by bangaladore

1 subcomments

Probably still can't get past a Google Captcha when on a VPN. Do I click the square with the shoe of the person who's riding the motorcycle?

by rocauc

0 subcomment

A brief history. SAM 1 - Visual prompt to create pixel-perfect masks in an image. No video. No class names. No open vocabulary. SAM 2 - Visual prompting for tracking on images and video. No open vocab. SAM 3 - Open vocab concept segmentation on images and video.
Roboflow has been long on zero / few shot concept segmentation. We've opened up a research preview exploring a SAM 3 native direction for creating your own model: https://rapid.roboflow.com/

by exe34

1 subcomments

can anyone confirm if this fits in a 3090? the files look about 3.5GB, but I can't work out what the memory needs will be overall.

by retinaros

1 subcomments

a quick question. is it possible in a single prompt to identify multiple type of objects or do you need to send multiple queries? like if i have a prompt "donkey, dogs" will sam3 return in one shot boxes with the class they belong to or do i need to send two queries?

by iandanforth

2 subcomments

I wonder if we'll get an updated DeepSeek-OCR that incorporates this. Would be very cool!

by nowittyusername

0 subcomment

This thing rocks. i can imagine soo many uses for it. I really like the 3d pose estimation especially

by maelito

0 subcomment

Can it detect the speed of a vehicle on any video unsupervised ?

by xnx

0 subcomment

Reminder that Nano Banana is also capable of image segmentation: https://x.com/phillip_lippe/status/1991555954908025123

by foota

1 subcomments

Obligatory xkcd: https://xkcd.com/1425/

by mertleee

0 subcomment

[dead]