Astro - Hacker News

45 comments

daemonologist 2 hours ago ago

First impressions are that this model is extremely good - the "zero-shot" text prompted detection is a huge step ahead of what we've seen before (both compared to older zero-shot detection models and to recent general purpose VLMs like Gemini and Qwen). With human supervision I think it's even at the point of being a useful teacher model.
I put together a YOLO tune for climbing hold detection a while back (trained on 10k labels) and this is 90% as good out of the box - just misses some foot chips and low contrast wood holds, and can't handle as many instances. It would've saved me a huge amount of manual annotation though.
[-]
- rocauc 2 hours ago ago
  
  As someone that works on a platform users have used for labeling 1B images, I'm bullish SAM 3 can automate at least 90% of the work. Data prep is flipped to models being human-assisted instead of humans being model-assisted (see "autolabel" https://blog.roboflow.com/sam3/). I'm optimistic majority of users can now start deploying a model to then curate data instead of the inverse.
gs17 3 hours ago ago

The 3D mesh generator is really cool too: https://ai.meta.com/sam3d/ It's not perfect, but it seems to handle occlusion very well (e.g. a person in a chair can be separated into a person mesh and a chair mesh) and it's very fast.
[-]
- Animats 3 hours ago ago
  
  It's very impressive. Do they let you export a 3D mesh, though? I was only able to export a video. Do you have to buy tokens or something to export?
  [-]
  - WhiteNoiz3 an hour ago ago
    
    The models it creates are gaussian splats, so if you are looking for traditional meshes you'd need a tool that can create meshes from splats.
    
    [-]
    
    bahmboo 34 minutes ago ago
    
    Are you sure about that? They say "full 3D shape geometry, texture, and layout" which doesn't preclude it being a splat but maybe they just use splats for visualization?
  - modeless 2 hours ago ago
    
    The model is open weights, so you can run it yourself.
  - TheAtomic 2 hours ago ago
    
    I couldn't download it. Model appears to be comparable to Sparc3D, Huyunan, etc but w/o download, who can say? It is much faster though.
    
    [-]
    
    visioninmyblood an hour ago ago
    
    you can download it at https://github.com/facebookresearch/sam3. for 3d https://github.com/facebookresearch/sam-3d-objects
    I actually found the easiest way was to run it for free to see if it works for my use case of person deidentification https://chat.vlm.run/chat/63953adb-a89a-4c85-ae8f-2d501d30a4...
bahmboo 38 minutes ago ago

Like the models before it it struggles with my use case of tracing circuit board features. It's great with a pony on the beach but really isn't made for more rote industrial type applications. With proper fine-tuning it would probably work much better but I haven't tried that yet. There are good examples on line though.
[-]
- squigz a few seconds ago ago
  
  Wow that sounds like a really interesting use-case for this. Can you link to some of those examples?
Benjamin_Dobell 2 hours ago ago

For background removal (at least my niche use case of background removal of kids drawings — https://breaka.club/blog/why-were-building-clubs-for-kids) I think birefnet v2 is still working slightly better.
SAM3 seems to less precisely trace the images — it'll discard kids drawing out the lines a bit, which is okay, but then it also seems to struggle around sharp corners and includes a bit of the white page that I'd like cut out.
Of course, SAM3 is significantly more powerful in that it does much more than simply cut out images. It seems to be able to identify what these kids' drawings represent. That's very impressive, AI models are typically trained on photos and adult illustrations — they struggle with children's drawings. So I could perhaps still use this for identifying content, giving kids more freedom to draw what they like, but then unprompted attach appropriate behavior to their drawings in-game.
clueless 3 hours ago ago

With a avg latency of 4 seconds, this still couldn't be used in real-time video, correct?
[Update: should have mentioned I got the 4 second from the roboflow.com links in this thread]
[-]
- Etheryte 2 hours ago ago
  
  Didn't see where you got those numbers, but surely that's just a problem of throwing more compute at it? From the blog post:
  > This excellent performance comes with fast inference — SAM 3 runs in 30 milliseconds for a single image with more than 100 detected objects on an H200 GPU.
hodgehog11 3 hours ago ago

This is an incredible model. But once again, we find an announcement for a new AI model with highly misleading graphs. That SA-Co Gold graph is particularly bad. Looks like I have another bad graph example for my introductory stats course...
yeldarb 3 hours ago ago

We (Roboflow) have had early access to this model for the past few weeks. It's really, really good. This feels like a seminal moment for computer vision. I think there's a real possibility this launch goes down in history as "the GPT Moment" for vision. The two areas I think this model is going to be transformative in the immediate term are for rapid prototyping and distillation.
Two years ago we released autodistill[1], an open source framework that uses large foundation models to create training data for training small realtime models. I'm convinced the idea was right, but too early; there wasn't a big model good enough to be worth distilling from back then. SAM3 is finally that model (and will be available in Autodistill today).
We are also taking a big bet on SAM3 and have built it into Roboflow as an integral part of the entire build and deploy pipeline[2], including a brand new product called Rapid[3], which reimagines the computer vision pipeline in a SAM3 world. It feels really magical to go from an unlabeled video to a fine-tuned realtime segmentation model with minimal human intervention in just a few minutes (and we rushed the release of our new SOTA realtime segmentation model[4] last week because it's the perfect lightweight complement to the large & powerful SAM3).
We also have a playground[5] up where you can play with the model and compare it to other VLMs.
[1] https://github.com/autodistill/autodistill
[2] https://blog.roboflow.com/sam3/
[3] https://rapid.roboflow.com
[4] https://github.com/roboflow/rf-detr
[5] https://playground.roboflow.com
[-]
- sorenjan an hour ago ago
  
  SAM3 is probably a great model to distill from when training smaller segmentation models, but isn't their DINOv2 a better example of a large foundation model to distill from for various computer vision tasks? I've seen it used for as starting point for models doing segmentation and depth estimation. Maybe there's a v3 coming soon?
  https://dinov2.metademolab.com/
  [-]
  - nsingh2 44 minutes ago ago
    
    DINOv3 was released earlier this year: https://ai.meta.com/dinov3/
    I'm not sure if the work they did with DINOv3 went into SAM3. I don't see any mention of it in the paper, though I just skimmed it.
- mchusma 13 minutes ago ago
  
  Thanks for the linkes! Can we run rf-detr in the browser for background removal? This wasn't clear to me from the docs
- dangoodmanUT 3 hours ago ago
  
  I was trying to figure out from their examples, but how are you breaking up the different "things" that you can detect in the image? Are you just running it with each prompt individually?
  [-]
  - rocauc 3 hours ago ago
    
    The model supports batch inference, so all prompts are sent to the model, and we parse the results.
nowittyusername 30 minutes ago ago

This thing rocks. i can imagine soo many uses for it. I really like the 3d pose estimation especially
fzysingularity 3 hours ago ago

SAM3 is cool - you can already do this more interactively on chat.vlm.run [1], and do much more. It's built on our new Orion [2] model; we've been able to integrate with SAM and several other computer-vision models in a truly composable manner. Video segmentation and tracking is also coming soon!
[1] https://chat.vlm.run
[2] https://vlm.run/orion
[-]
- visioninmyblood 3 hours ago ago
  
  Wow this is actually pretty cool, I was able to segment out the people and dog in the same chat. https://chat.vlm.run/chat/cba92d77-36cf-4f7e-b5ea-b703e612ea...
  [-]
  - luckyLooking an hour ago ago
    
    Even works with long range shots. https://chat.vlm.run/chat/e8bd5a29-a789-40aa-ae31-a510dc6478...
  - fzysingularity 2 hours ago ago
    
    Nice, that's pretty neat.
xfeeefeee 3 hours ago ago

I can't wait until it is easy to rotoscope / greenscreen / mask this stuff out accessibly for videos. I had tried Runway ML but it was... lacking, and the webui for fixing parts of it had similar issues.
I'm curious how this works for hair and transparent/translucent things. Probably not the best, but does not seem to be mentioned anywhere? Presumably it's just a straight line or vector rather than alpha etc?
[-]
- rocauc 3 hours ago ago
  
  I tried it on transparent glass mugs, and it does pretty well. At least better than other available models: https://i.imgur.com/OBfx9JY.png
  Curious if you find interesting results - https://playground.roboflow.com
- nodja 3 hours ago ago
  
  I'm pretty sure davinci resolve does this already, you can even track it, idk if it's available in the free version.
rocauc 3 hours ago ago

A brief history. SAM 1 - Visual prompt to create pixel-perfect masks in an image. No video. No class names. No open vocabulary. SAM 2 - Visual prompting for tracking on images and video. No open vocab. SAM 3 - Open vocab concept segmentation on images and video.
Roboflow has been long on zero / few shot concept segmentation. We've opened up a research preview exploring a SAM 3 native direction for creating your own model: https://rapid.roboflow.com/
HowardStark 3 hours ago ago

Curious if anyone has done anything meaningful with SAM2 and streaming. SAM3 has built-in streaming support which is very exciting.
I’ve seen versions where people use an in-memory FS to write frames of stream with SAM2. Maybe that is good enough?
[-]
- tom-in-july 41 minutes ago ago
  
  The native support for streaming in SAM3 is awesome. Especially since it should also remove some of the memory accumulation for long sequences.
  I used SAM2 for tracking tumors in real-time MRI images. With the default SAM2 and loading images from the da, we could only process videos with 10^2 - 10^3 frames before running out of memory.
  By developing/adapting a custom version (1) based on a modified implementation with real (almost) stateless streaming (2) we were able to increase that to 10^5 frames. While this was enough for our purposes, I spend way too much time debugging/investigating tiny differences between SAM2 versions. So it’s great that the canonical version now supports streaming as well.
  (Side note: I also know of people using SAM2 for real-time ultrasound imaging.)
  1 https://github.com/LMUK-RADONC-PHYS-RES/mrgrt-target-localiz...
  2 https://github.com/Gy920/segment-anything-2-real-time
dangoodmanUT 3 hours ago ago

This model is incredibly impressive. Text is definitely the right modality, and now the ability to intertwine it with an LLM creates insane unlocks - my mind is already storming with ideas of projects that are now not only possible, but trivial.
ge96 2 hours ago ago

Dang that seems like it would work great for game asset generation regarding 3D
maelito 2 hours ago ago

Can it detect the speed of a vehicle on any video unsupervised ?
sciencesama 3 hours ago ago

Does the license allow for commercial purposes?
[-]
- rocauc 3 hours ago ago
  
  Yes. It's a custom license with an Acceptable Use Policy preventing military use and export restrictions. The custom license permits commercial use.
- visioninmyblood 3 hours ago ago
  
  I just check and it seems to commercial permissiable.Companies like vlm.run and roboflow are using for commercial use as show by thier comments below. So i guess it can be used for commercial purposes.
  [-]
  - rocauc 3 hours ago ago
    
    Yes. But also note that redistribution of SAM 3 requires using the same SAM 3 license downstream. So libraries that attempt to, e.g., relicense the model as AGPL are non-compliant.
- colesantiago 3 hours ago ago
  
  Yes, the license allows you to grift for your “AI startup”
tonyhart7 an hour ago ago

This would be good for video editor
bangaladore 2 hours ago ago

Probably still can't get past a Google Captcha when on a VPN. Do I click the square with the shoe of the person who's riding the motorcycle?
[-]
- conception 2 hours ago ago
  
  There are services you can get that will bypass those with a browser extension for you.
exe34 2 hours ago ago

can anyone confirm if this fits in a 3090? the files look about 3.5GB, but I can't work out what the memory needs will be overall.
foota 2 hours ago ago

Obligatory xkcd: https://xkcd.com/1425/