ML Spring
Posts
Intro To SAM: Segment Anything Model! 🚀

Intro To SAM: Segment Anything Model! 🚀

Redefining Computer Vision! - @mohit_agr18

Mohit Agarwal
February 26, 2024

In April of 2023, Meta Research released, a foundational model, called Segment Anything Model or SAM. In this blog post we will get an overview on how it works and how it was trained.

In computer vision, segmentation refers to the process to identifying image pixels belonging to a particular object. Instance segmentation further distinguish between pixels of individual objects of the same class.

However, training a segmentation model from scratch needs lot of computing power, technical know-how and very large volume of annotated dataset. This process can be time and cost intensive.

To tackle this task, Meta Research released Segment Anything Model.

Let's do a deep dive on SAM.

How SAM works?

Segmentation model with prompts as inputs

The model interacts with users using prompts in different formats like points, boxes, textual, masks or any other information which indicates object to segment in the image.

For each prompt, the model returns valid mask with confidence score. For ambiguous prompts, SAM can output multiple valid masks, which can be ranked on the basis of confidence score.

Trained on massive dataset, which contains 11 million images and 1.1+ billion masks, SAM has acquired a broad understanding about the objects in the image. It enables SAM to generate masks for any object in the image, even if that object class was not present in the training dataset, without the need of any additional training. This capability is often referred to as zero-shot transfer.

SAM's Structure: Brief Overview

SAM architecture decoupled into two parts - a one-time image encoder and light weight mask decoder (can run in real time in web browser)

The model architecture is defined in such a way that it can perform in real time on a CPU machine. This is done so that, annotation tools build on SAM can run on web browser.

SAM architecture has three components:

An image encoder: SAM uses a Masked Autoencode (MAE) pre-trained Vision Transformer (ViT). The image encoder runs once per image and produces image embedding. This can be applied before prompting the model, using a GPU machine.
A prompt encoder: SAM supports sparse prompts (point, boxes, text) and dense prompt (mask).
1. Point and boxes are represented by positional encoding and summed with learned embeddings.
2. Text are run through the CLip's text encoder and resulting text encoding is used as prompt.
3. Masks are embedded using convolutions and summed element-wise with image embedding.
Mask decoder: Image embeddings and prompt embeddings are then combined using a light weight decoder that predicts segmented masks.

Segment Anything Training and dataset generation:

Meta Research open sourced dataset, SA-1B, of 11 million images with over 1 billion masks.

Training of the model and dataset generation occurred in three stages, with general idea being, using SAM to interactively annotate images and using new annotations to update SAM in return.

This cycle was repeated multiple times to iteratively improve both dataset and model.

SAM training and SA-1B dataset generation process

A model assisted manual stage, with annotators using interactive tool to annotate and refine model generated masks. At the start of this stage, SAM was trained on public available dataset. Once enough new annotations were generated, SAM was retrained on this data. Around 120k images were labelled and 4.3M masks were generated in this stage.
Semi-Automated Stage, where annotators focus on less prominent objects with models predicting other objects with high confidence. After this stage, number of masks per image went from 44 to 72. Additional 5.9M masks from 180k images were generated.
Fully automatic Stage, where annotation was fully automatic. After this stage, around 1.1 billion masks were generated using 11M images without any human involvement. This dataset, known as SA-1B, is now available for research purpose.

In upcoming blog, we will explore how to use SAM for masks generation using various prompts. We will also explore Grounded SAM, to detect and segment anything with text inputs.

To try out SAM, use this demo.
More details about SA-1B dataset can be found here.
Read SAM paper here.

Stay tuned for part-2 where we’ll bring an applied demo for SAM with code!

Subscribe to keep reading

This content is free, but you must be subscribed to ML Spring to continue reading.

Already a subscriber?Sign In.Not now