How to combine text and image. Part 2: Understanding Zero Shots… | By Rabeya Tus Sadia

Photo by Lenin Estrada on Unsplash

Since openAI first made the CLIP model available, it’s been over a year since this method of combining images and subtitle texts was created. This huge model was trained on 400 million (!) different pairs of images and captions found on the web.

At the end of this post, we will understand how null training works with CLIP models with practical examples. The goal of CLIP is to learn how to classify images without any distinct labels.

Intuition

Like traditional controlled models, CLIP has two phases. training phase and inference stage (making predictions). I encourage you to read the blog posts specifically about CLIP and how it was made/used or, better yet, the paper.

In a word, in training phaseCLIP learns about images by “reading” the supporting text (ie, sentences) corresponding to each image, as in the example below.

An example of CLIP architecture candidate input. Photo by The Lucky Neko on Unsplash

Even if you’ve never seen a cat, you should be able to read this text and figure out that the three things in the picture are “cats.” If you’ve seen enough pictures of cats with “cat” written on them, you can get a really good idea of ​​whether or not there are cats in the picture.
Similarly, the model can identify how certain phrases and words match certain patterns in images by looking at 400 million pairs of images and text from different subjects. Once it knows this, the model can use the information it has learned to apply it to other classification tasks. But wait a minute.

You might be wondering if this “helper text” doesn’t look like a tag, so this isn’t the “tag-free tutorial” I promised at the beginning.
Additional information, like captions, is a way to keep track of things, but they’re not labels. With this additional information, we can use unstructured data that is full of information without manually parsing it into a single tag (eg “These are my three beautiful cats…”, “cats”).
Collecting the tag takes time and leaves out information that might be useful. Using the CLIP method, we can bypass this bottleneck and give the model as much information as possible.

how correct can the model learn from these supporting texts?

As the name of the architecture suggests, CLIP uses a technique called contrastive learning to understand the relationship between image and text pairs.

Summary of the CLIP approach. Picture from here

Essentially, CLIP has a purpose minimize the difference between the image encoding and its corresponding text. In other words, the model must learn to encode images and its corresponding text encoding. as similar as possible.

Let’s take this idea a little deeper.

What are encodings?Encodings are simply a representation of data in a lower dimension (shown as green and purple boxes in the image above). In an ideal world, encoding an image or text should reveal the most important and unique information about that image or text.
For example, all cat images should have the same encoding because they all contain cats. On the other hand, dog images should have different encodings.
In this perfect world, where similar objects have the same encoding and different objects have different encodings, it is easy to group images. If we give the model an image whose encoding is similar to the encoding of another “cat” it has seen, the model can tell that the image is a cat.
The best way to categorize images seems to be to learn how to code them in the best way possible. In fact, this is the whole point of CLIP (and most of deep learning). We start with bad encodings (random encodings for each image), and we want the model to learn the best encodings (ie, cat images have similar encodings).

To use the CLIP model as a null classifier, one only needs to define a list of possible classes or descriptions, and CLIP predicts which class a given image is most likely to fall into based on its prior knowledge. Consider asking the model “which of these captions best matches this image?”

In this post, we’ll show you how to test the performance of CLIP on your image datasets. This is a public flower classification dataset. The code is here in colab notebook.

First, download and install all CLIP dependencies.

To try CLIP on your own data, copy the notebook to your disk and make sure that GPU is selected in Runtime (Google Colab will give you a free GPU to use). Then we do a few installs along with cloning the CLIP Repo.

Then download the classification database.

Here the classes and images we want to test are stored in the test suite folders. This is how we transfer images _tokenization.txt.

In this code section, you can see auto-generated image captions for classification. You can use your own rapid engineering for this. You can add different captions to create the correct classification for CLIP that best identifies the images. You can use your own intuition to increase the result.

The final step is to run your test images through a prediction step.

CLIP takes as input an image and a list of possible class titles. You can set the class subtitles as you see fit _tokenization.txt file. Make sure they stay in the same order as alphabetical class_names (defined by the folder structure).

This is the basic inference network. Basically, we will iterate over the images in our test folder, then send the images to the network with our markup and see where the clip sends the images in different markups and finally see if they match the ground truth.

Then we use some measurements here. You can see that we got a higher accuracy for dandelion than chamomile. When you use CLIP for your classification task, it is useful to test different classification labels for your classification ontology and remember that CLIP is trained to distinguish image labels.

We tested the following ontologies on the flower dataset and saw the following results.

  • "dandelion" vs "daisy"] –> 46% accuracy (worse than guessing)
  • "dandelion flower" vs "daisy flower" –> 64% accuracy
  • "picture of a dandelion flower" vs "picture of a daisy flower" –> 97% accuracy

These results demonstrate the importance of providing correct class descriptions to CLIP and express the richness of the pre-training procedure, a feature that is generally lost in traditional binary classification. OpenAI calls this process “rapid engineering.”

For more information on CLIP research, you can read the paper and check out OpenAI’s blog post.

That’s all for today.

Be happy and happy learning.

Source link