Few-shot Segmentation in Computer Vision

6 min readMay 21, 2024

Few-shot segmentation is a subfield of Deep Learning where machine learning models learn to locate and identify objects from a couple of annotated samples.

Introduction — Computer Vision in Manufacturing, Retail and Automotive Sectors

I have been developing Computer Vision solutions for manufacturing, retail, and primarily the automotive sector since 2019. In these industries, most computer vision systems rely on static cameras positioned at specific angles to monitor regions of interest on the shop floor. These solutions demand exceptionally high precision to be valuable.

In the automotive sector, for a model to be deemed “valuable” or “production-ready,” it must achieve 100% precision and a recall rate exceeding 98%, depending on the use case. With static camera placement and a defined field of view, it is possible to attain these results given a sufficient number of images. However, collecting “not okay” (negative) samples is often either very difficult or prohibitively expensive. Consequently, creating more than 50 negative samples (images) is almost impossible.

The Era of Stacked Models

We addressed this challenge by dividing the localization and identification tasks into separate models. Most of the use cases we developed consist of multiple models, such as detection combined with classification or segmentation combined with classification. Detection or segmentation models identify the region where the inspection will occur, and the classification model then categorizes this region as “okay” or “not okay.” This approach helped us nearly achieve our ambitious target of 100% precision with a recall rate exceeding 98%.

However, as we approached these targets, new challenges emerged. Clients began requesting use cases with fewer positive and negative samples but demanding higher recall rates while maintaining 100% precision. To meet these new requirements, we innovated further by focusing on Deep Metric Learning, Few-shot Learning, and Fine-grained Visual Categorization. While I won’t go into detail about these techniques, this blog post provides an excellent explanation.

The Era of Deep Metric Learning

We started experimenting with Proxy-Anchor Loss and Multi-Similarity Loss. The multi-similarity loss was great when we wanted to teach the model fine-grained features of the given input examples, and proxy anchor loss was excellent as “it had lower training complexity and could converge very fast”. However, we saw room for improvement in proxy anchor loss. We changed the loss formula for negative samples.

From

Green: Negative part of the Proxy Anchor Loss, Red: Modified negative part of the Proxy Anchor Loss, Blue: Positive part of the Proxy Anchor Loss

This change allowed us to achieve near-zero losses, which was not possible with the original Proxy Anchor Loss (PAL). The logic behind this improvement can be explained as follows:

The original PAL assigns a very high loss value for negative pairs with near-zero cosine similarity. However, when the cosine similarity between negative pairs is near zero, it indicates that these vectors have no mathematical relationship. Therefore, the loss calculated in this situation should be zero. As illustrated in the graph above, the modified formula calculates the loss as near zero for negative pairs with near-zero cosine similarity.

We have also introduced a proxy optimization method that helps us generate proxies that are as close as possible to a pre-trained model’s initial state, where the proxies of the classes are as far as possible from each other. This optimization helped our models converge faster to a lower minimum than with random initial proxies. More info can be found in VeriMedi: Pill Identification using Proxy-based Deep Metric Learning and Exact Solution

The next revolution is coming…

Both the stacked models and the Deep Metric Learning assets we created have been incredibly valuable. So far, my team and I have developed and deployed over 500 unique use cases for an automotive client, each meeting the criteria of 100% precision and over 99% recall. However, we have more than 1,000 additional use cases planned for development and deployment. When you consider the combined costs of computing, data labeling, man-hours, and generating “not okay” cases in an operational factory, the scope of the problem is much larger than initially anticipated. So, we need something new.

We have identified that the primary costs of computing and labeling stem from the detection and segmentation models. These models also require the most extensive data generation. If we could develop a solution capable of learning the region of interest or any specific object from just 5 “okay” and 5 “not okay” samples, it would revolutionize our entire development process. This breakthrough would significantly reduce expenses on computing, data storage, and labeling, and shorten development times. We might even empower workers to create their models within 15 minutes for any repetitive visual inspection work.

The Era of Few-shot Segmentation Begins…

I saw this opportunity nearly 1.5 years ago and started studying the field of few-shot segmentation. I applied so many methods out there to the production-level data, just to see them all fail.

So, I simply took the outcomes of the studies I had on deep metric learning and adjusted them for few-shot segmentation. And here is the result of my work.

For sample images, I have chosen BMW images as it is my favorite car manufacturer.

Experiments

Support Image & Mask:

Query Images and Mask Predictions:

The results are pretty good as ben be seen from the visualizations above. Also, after you post-process the masks, it is possible to get the positive points from the masks and use Segment Anything to find even better mask for the entire object. If you use Segment Anything V.2, you can track any object in a video as well.

I will make another post explaining how everything works along with the usage of the repository, but until then, you can reach the code repository from the link below:

LINK TO THE CODE REPOSITORY

Note: All research studies which are mentioned in this post are done with my personal resources and time. I am the sole owner of the repository and share my rights with Nikita Karpuks as thanks for his well-appreciated contribution to this research and code repository.