First impressions are that this model is extremely good - the "zero-shot" text prompted detection is a huge step ahead of what we've seen before (both compared to older zero-shot detection models and to recent general purpose VLMs like Gemini and Qwen). With human supervision I think it's even at the point of being a useful teacher model.
I put together a YOLO tune for climbing hold detection a while back (trained on 10k labels) and this is 90% as good out of the box - just misses some foot chips and low contrast wood holds, and can't handle as many instances. It would've saved me a huge amount of manual annotation though.
As someone that works on a platform users have used for labeling 1B images, I'm bullish SAM 3 can automate at least 90% of the work. Data prep is flipped to models being human-assisted instead of humans being model-assisted (see "autolabel" https://blog.roboflow.com/sam3/). I'm optimistic majority of users can now start deploying a model to then curate data instead of the inverse.
The 3D mesh generator is really cool too: https://ai.meta.com/sam3d/ It's not perfect, but it seems to handle occlusion very well (e.g. a person in a chair can be separated into a person mesh and a chair mesh) and it's very fast.
Are you sure about that? They say "full 3D shape geometry, texture, and layout" which doesn't preclude it being a splat but maybe they just use splats for visualization?
Like the models before it it struggles with my use case of tracing circuit board features. It's great with a pony on the beach but really isn't made for more rote industrial type applications. With proper fine-tuning it would probably work much better but I haven't tried that yet. There are good examples on line though.
SAM3 seems to less precisely trace the images — it'll discard kids drawing out the lines a bit, which is okay, but then it also seems to struggle around sharp corners and includes a bit of the white page that I'd like cut out.
Of course, SAM3 is significantly more powerful in that it does much more than simply cut out images. It seems to be able to identify what these kids' drawings represent. That's very impressive, AI models are typically trained on photos and adult illustrations — they struggle with children's drawings. So I could perhaps still use this for identifying content, giving kids more freedom to draw what they like, but then unprompted attach appropriate behavior to their drawings in-game.
Didn't see where you got those numbers, but surely that's just a problem of throwing more compute at it? From the blog post:
> This excellent performance comes with fast inference — SAM 3 runs in 30 milliseconds for a single image with more than 100 detected objects on an H200 GPU.
This is an incredible model. But once again, we find an announcement for a new AI model with highly misleading graphs. That SA-Co Gold graph is particularly bad. Looks like I have another bad graph example for my introductory stats course...
We (Roboflow) have had early access to this model for the past few weeks. It's really, really good. This feels like a seminal moment for computer vision. I think there's a real possibility this launch goes down in history as "the GPT Moment" for vision.
The two areas I think this model is going to be transformative in the immediate term are for rapid prototyping and distillation.
Two years ago we released autodistill[1], an open source framework that uses large foundation models to create training data for training small realtime models. I'm convinced the idea was right, but too early; there wasn't a big model good enough to be worth distilling from back then. SAM3 is finally that model (and will be available in Autodistill today).
We are also taking a big bet on SAM3 and have built it into Roboflow as an integral part of the entire build and deploy pipeline[2], including a brand new product called Rapid[3], which reimagines the computer vision pipeline in a SAM3 world. It feels really magical to go from an unlabeled video to a fine-tuned realtime segmentation model with minimal human intervention in just a few minutes (and we rushed the release of our new SOTA realtime segmentation model[4] last week because it's the perfect lightweight complement to the large & powerful SAM3).
We also have a playground[5] up where you can play with the model and compare it to other VLMs.
SAM3 is probably a great model to distill from when training smaller segmentation models, but isn't their DINOv2 a better example of a large foundation model to distill from for various computer vision tasks? I've seen it used for as starting point for models doing segmentation and depth estimation. Maybe there's a v3 coming soon?
I was trying to figure out from their examples, but how are you breaking up the different "things" that you can detect in the image? Are you just running it with each prompt individually?
SAM3 is cool - you can already do this more interactively on chat.vlm.run [1], and do much more.
It's built on our new Orion [2] model; we've been able to integrate with SAM and several other computer-vision models in a truly composable manner. Video segmentation and tracking is also coming soon!
I can't wait until it is easy to rotoscope / greenscreen / mask this stuff out accessibly for videos. I had tried Runway ML but it was... lacking, and the webui for fixing parts of it had similar issues.
I'm curious how this works for hair and transparent/translucent things. Probably not the best, but does not seem to be mentioned anywhere? Presumably it's just a straight line or vector rather than alpha etc?
A brief history. SAM 1 - Visual prompt to create pixel-perfect masks in an image. No video. No class names. No open vocabulary. SAM 2 - Visual prompting for tracking on images and video. No open vocab. SAM 3 - Open vocab concept segmentation on images and video.
Roboflow has been long on zero / few shot concept segmentation. We've opened up a research preview exploring a SAM 3 native direction for creating your own model: https://rapid.roboflow.com/
The native support for streaming in SAM3 is awesome. Especially since it should also remove some of the memory accumulation for long sequences.
I used SAM2 for tracking tumors in real-time MRI images. With the default SAM2 and loading images from the da, we could only process videos with 10^2 - 10^3 frames before running out of memory.
By developing/adapting a custom version (1) based on a modified implementation with real (almost) stateless streaming (2) we were able to increase that to 10^5 frames. While this was enough for our purposes, I spend way too much time debugging/investigating tiny differences between SAM2 versions. So it’s great that the canonical version now supports streaming as well.
(Side note: I also know of people using SAM2 for real-time ultrasound imaging.)
This model is incredibly impressive. Text is definitely the right modality, and now the ability to intertwine it with an LLM creates insane unlocks - my mind is already storming with ideas of projects that are now not only possible, but trivial.
I just check and it seems to commercial permissiable.Companies like vlm.run and roboflow are using for commercial use as show by thier comments below. So i guess it can be used for commercial purposes.
Yes. But also note that redistribution of SAM 3 requires using the same SAM 3 license downstream. So libraries that attempt to, e.g., relicense the model as AGPL are non-compliant.
First impressions are that this model is extremely good - the "zero-shot" text prompted detection is a huge step ahead of what we've seen before (both compared to older zero-shot detection models and to recent general purpose VLMs like Gemini and Qwen). With human supervision I think it's even at the point of being a useful teacher model.
I put together a YOLO tune for climbing hold detection a while back (trained on 10k labels) and this is 90% as good out of the box - just misses some foot chips and low contrast wood holds, and can't handle as many instances. It would've saved me a huge amount of manual annotation though.
As someone that works on a platform users have used for labeling 1B images, I'm bullish SAM 3 can automate at least 90% of the work. Data prep is flipped to models being human-assisted instead of humans being model-assisted (see "autolabel" https://blog.roboflow.com/sam3/). I'm optimistic majority of users can now start deploying a model to then curate data instead of the inverse.
The 3D mesh generator is really cool too: https://ai.meta.com/sam3d/ It's not perfect, but it seems to handle occlusion very well (e.g. a person in a chair can be separated into a person mesh and a chair mesh) and it's very fast.
It's very impressive. Do they let you export a 3D mesh, though? I was only able to export a video. Do you have to buy tokens or something to export?
The models it creates are gaussian splats, so if you are looking for traditional meshes you'd need a tool that can create meshes from splats.
Are you sure about that? They say "full 3D shape geometry, texture, and layout" which doesn't preclude it being a splat but maybe they just use splats for visualization?
The model is open weights, so you can run it yourself.
I couldn't download it. Model appears to be comparable to Sparc3D, Huyunan, etc but w/o download, who can say? It is much faster though.
you can download it at https://github.com/facebookresearch/sam3. for 3d https://github.com/facebookresearch/sam-3d-objects
I actually found the easiest way was to run it for free to see if it works for my use case of person deidentification https://chat.vlm.run/chat/63953adb-a89a-4c85-ae8f-2d501d30a4...
Like the models before it it struggles with my use case of tracing circuit board features. It's great with a pony on the beach but really isn't made for more rote industrial type applications. With proper fine-tuning it would probably work much better but I haven't tried that yet. There are good examples on line though.
Wow that sounds like a really interesting use-case for this. Can you link to some of those examples?
For background removal (at least my niche use case of background removal of kids drawings — https://breaka.club/blog/why-were-building-clubs-for-kids) I think birefnet v2 is still working slightly better.
SAM3 seems to less precisely trace the images — it'll discard kids drawing out the lines a bit, which is okay, but then it also seems to struggle around sharp corners and includes a bit of the white page that I'd like cut out.
Of course, SAM3 is significantly more powerful in that it does much more than simply cut out images. It seems to be able to identify what these kids' drawings represent. That's very impressive, AI models are typically trained on photos and adult illustrations — they struggle with children's drawings. So I could perhaps still use this for identifying content, giving kids more freedom to draw what they like, but then unprompted attach appropriate behavior to their drawings in-game.
With a avg latency of 4 seconds, this still couldn't be used in real-time video, correct?
[Update: should have mentioned I got the 4 second from the roboflow.com links in this thread]
Didn't see where you got those numbers, but surely that's just a problem of throwing more compute at it? From the blog post:
> This excellent performance comes with fast inference — SAM 3 runs in 30 milliseconds for a single image with more than 100 detected objects on an H200 GPU.
This is an incredible model. But once again, we find an announcement for a new AI model with highly misleading graphs. That SA-Co Gold graph is particularly bad. Looks like I have another bad graph example for my introductory stats course...
We (Roboflow) have had early access to this model for the past few weeks. It's really, really good. This feels like a seminal moment for computer vision. I think there's a real possibility this launch goes down in history as "the GPT Moment" for vision. The two areas I think this model is going to be transformative in the immediate term are for rapid prototyping and distillation.
Two years ago we released autodistill[1], an open source framework that uses large foundation models to create training data for training small realtime models. I'm convinced the idea was right, but too early; there wasn't a big model good enough to be worth distilling from back then. SAM3 is finally that model (and will be available in Autodistill today).
We are also taking a big bet on SAM3 and have built it into Roboflow as an integral part of the entire build and deploy pipeline[2], including a brand new product called Rapid[3], which reimagines the computer vision pipeline in a SAM3 world. It feels really magical to go from an unlabeled video to a fine-tuned realtime segmentation model with minimal human intervention in just a few minutes (and we rushed the release of our new SOTA realtime segmentation model[4] last week because it's the perfect lightweight complement to the large & powerful SAM3).
We also have a playground[5] up where you can play with the model and compare it to other VLMs.
[1] https://github.com/autodistill/autodistill
[2] https://blog.roboflow.com/sam3/
[3] https://rapid.roboflow.com
[4] https://github.com/roboflow/rf-detr
[5] https://playground.roboflow.com
SAM3 is probably a great model to distill from when training smaller segmentation models, but isn't their DINOv2 a better example of a large foundation model to distill from for various computer vision tasks? I've seen it used for as starting point for models doing segmentation and depth estimation. Maybe there's a v3 coming soon?
https://dinov2.metademolab.com/
DINOv3 was released earlier this year: https://ai.meta.com/dinov3/
I'm not sure if the work they did with DINOv3 went into SAM3. I don't see any mention of it in the paper, though I just skimmed it.
Thanks for the linkes! Can we run rf-detr in the browser for background removal? This wasn't clear to me from the docs
I was trying to figure out from their examples, but how are you breaking up the different "things" that you can detect in the image? Are you just running it with each prompt individually?
The model supports batch inference, so all prompts are sent to the model, and we parse the results.
This thing rocks. i can imagine soo many uses for it. I really like the 3d pose estimation especially
SAM3 is cool - you can already do this more interactively on chat.vlm.run [1], and do much more. It's built on our new Orion [2] model; we've been able to integrate with SAM and several other computer-vision models in a truly composable manner. Video segmentation and tracking is also coming soon!
[1] https://chat.vlm.run
[2] https://vlm.run/orion
Wow this is actually pretty cool, I was able to segment out the people and dog in the same chat. https://chat.vlm.run/chat/cba92d77-36cf-4f7e-b5ea-b703e612ea...
Even works with long range shots. https://chat.vlm.run/chat/e8bd5a29-a789-40aa-ae31-a510dc6478...
Nice, that's pretty neat.
I can't wait until it is easy to rotoscope / greenscreen / mask this stuff out accessibly for videos. I had tried Runway ML but it was... lacking, and the webui for fixing parts of it had similar issues.
I'm curious how this works for hair and transparent/translucent things. Probably not the best, but does not seem to be mentioned anywhere? Presumably it's just a straight line or vector rather than alpha etc?
I tried it on transparent glass mugs, and it does pretty well. At least better than other available models: https://i.imgur.com/OBfx9JY.png
Curious if you find interesting results - https://playground.roboflow.com
I'm pretty sure davinci resolve does this already, you can even track it, idk if it's available in the free version.
A brief history. SAM 1 - Visual prompt to create pixel-perfect masks in an image. No video. No class names. No open vocabulary. SAM 2 - Visual prompting for tracking on images and video. No open vocab. SAM 3 - Open vocab concept segmentation on images and video.
Roboflow has been long on zero / few shot concept segmentation. We've opened up a research preview exploring a SAM 3 native direction for creating your own model: https://rapid.roboflow.com/
Curious if anyone has done anything meaningful with SAM2 and streaming. SAM3 has built-in streaming support which is very exciting.
I’ve seen versions where people use an in-memory FS to write frames of stream with SAM2. Maybe that is good enough?
The native support for streaming in SAM3 is awesome. Especially since it should also remove some of the memory accumulation for long sequences.
I used SAM2 for tracking tumors in real-time MRI images. With the default SAM2 and loading images from the da, we could only process videos with 10^2 - 10^3 frames before running out of memory.
By developing/adapting a custom version (1) based on a modified implementation with real (almost) stateless streaming (2) we were able to increase that to 10^5 frames. While this was enough for our purposes, I spend way too much time debugging/investigating tiny differences between SAM2 versions. So it’s great that the canonical version now supports streaming as well.
(Side note: I also know of people using SAM2 for real-time ultrasound imaging.)
1 https://github.com/LMUK-RADONC-PHYS-RES/mrgrt-target-localiz...
2 https://github.com/Gy920/segment-anything-2-real-time
This model is incredibly impressive. Text is definitely the right modality, and now the ability to intertwine it with an LLM creates insane unlocks - my mind is already storming with ideas of projects that are now not only possible, but trivial.
Dang that seems like it would work great for game asset generation regarding 3D
Can it detect the speed of a vehicle on any video unsupervised ?
Does the license allow for commercial purposes?
Yes. It's a custom license with an Acceptable Use Policy preventing military use and export restrictions. The custom license permits commercial use.
I just check and it seems to commercial permissiable.Companies like vlm.run and roboflow are using for commercial use as show by thier comments below. So i guess it can be used for commercial purposes.
Yes. But also note that redistribution of SAM 3 requires using the same SAM 3 license downstream. So libraries that attempt to, e.g., relicense the model as AGPL are non-compliant.
Yes, the license allows you to grift for your “AI startup”
This would be good for video editor
Probably still can't get past a Google Captcha when on a VPN. Do I click the square with the shoe of the person who's riding the motorcycle?
There are services you can get that will bypass those with a browser extension for you.
can anyone confirm if this fits in a 3090? the files look about 3.5GB, but I can't work out what the memory needs will be overall.
Obligatory xkcd: https://xkcd.com/1425/