21.5 C
Saturday, June 15, 2024

A Foundation Model for Medical AI

Introducing PLIP, a foundation model for pathology


We are experiencing advances in all directions thanks to the ongoing AI revolution. The work is being led by OpenAI GPT(s) models, which demonstrate how much some of our daily jobs can truly be facilitated by foundation models. We hear about new models being released every day, some of which can improve our writing or streamline some of our activities.

Our path is being opened up by several opportunities. One of the most significant tools we will acquire in the upcoming years will be artificial intelligence (AI) goods that will assist us in our professional lives.

Where will the most significant changes occur? Where can we assist others in working more quickly? The path that leads to medical AI tools is one of the most intriguing for AI models.

One of the first pathology foundation models is PLIP (Pathology Language and Image Pre-Training), which I discuss in this blog article. PLIP is a vision-language model that enables multi-modal applications by embedding text and images in the same vector space. Recently released in Nature Medicine, the PLIP model is a development on the original CLIP model suggested by OpenAI in 2021:

Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T., Zou, J., A visual–language foundation model for pathology image analysis using medical Twitter. 2023, Nature Medicine.

Some useful links before starting our adventure:

All images, unless otherwise noted, are by the author.

Contrastive Pre-Training 101

We demonstrate that by using social media data gathering techniques and a few other tactics, it is possible to create a model that can be successfully applied to pathological problems in medical artificial intelligence without the need for annotated data.

It is nevertheless beneficial to have a basic introduction or refresher on CLIP (the model from which PLIP is developed) and its contrastive loss, even though it is somewhat outside the focus of this blog article. CLIP’s basic premise is that we can create a model that places text and images in a vector space where “images and their descriptions are going to be close together.”

The GIF above also demonstrates how a model can be used for classification that embeds text and images in the same vector space. By doing this, we can associate each image with one or more labels by taking into account the distance in the vector space: the closer the description is to the image, the better. The actual label for the image should be the one that is closest.

It should be noted that when CLIP has been taught, you can incorporate any picture or text you desire. Remember that although this GIF depicts a 2D space, the spaces utilized in CLIP typically have much higher dimensionality.

This means that once images and text are in the same vector spaces, there are a variety of things we can perform, from retrieval to zero-shot classification (discover which text label is more comparable to an image).

How is CLIP trained? Simply put, the model gets fed with MANY image-text combinations and attempts to group similar matching items together (as in the image above) while dispersing the rest. The representation you will learn will be improved the more image-text pairs you have.

We will stop here with the CLIP background, this should be enough to understand the rest of this post. I have a more in-depth blog post about CLIP on Towards Data Science.

Although CLIP was developed to be a highly broad image-text model, there are some situations in which it underperforms and domain-specific implementations outperform it (Zhang et al., 2023). One such situation is the fashion industry (Chia et al., 2022).

Pre-Training for Pathology Language and Images (PLIP)
We now go over how we created PLIP, our enhanced adaptation of the original CLIP model created especially for pathology.

Creating a Dataset for Pre-Training in Pathology Language and Images
We require data, and this data must be of sufficient quality to be used in model training. The issue is where do we look for these data. Images with pertinent captions, like the one in the GIF above, are what we need.

Despite the fact that there is a large amount of pathology data online, it frequently lacks annotations and may be in non-standard formats as PDF files, slides, or YouTube videos.

We must search elsewhere, and this additional location will be social media. We may have access to a multitude of pathology-related content by utilizing social media sites. Pathologists utilize social media to ask and answer questions from their peers and to disseminate their own research (for more information on how pathologists use social media, see Isom et al., 2017). Pathologists can communicate using a set of universally used Twitter hashtags as well.

We gather a portion of the photos from the massive collection of 5B image-text pairs known as the LAION dataset (Schuhmann et al., 2022) in addition to the Twitter data.

Pathology Twitter

We use Twitter hashtags related to pathology to gather more than 100K tweets. The procedure is quite easy; we simply utilize the API to gather tweets that are related to a particular set of hashtags. We eliminate tweets with a question mark since they frequently ask about other illnesses (such as, “Which kind of tumor is this?”) rather than providing information that we would actually need to create our model.

We extract tweets with specific keywords and we remove sensitive content. In addition to this, we also remove all the tweets that contain question marks, which appear in tweets used by pathologists to ask questions to their colleagues about some possible rare cases.

Sampling from LAION

Our strategy for gathering data will be as follows: we can use our own images that come from Twitter and find similar images in this large corpus; in this way, we should be able to get reasonably similar images, and hopefully, these similar images are also pathology images. LAION contains 5B image-text pairs.

Embedding and searching through 5B embeddings takes a lot of time, making it impossible to perform this manually. Fortunately, LAION has pre-computed vector indexes that we can use APIs to query using real photos! In order to locate related photos in LAION, we just insert our photographs and utilize K-NN search. Keep in mind that each of these pictures has a caption, which is ideal for our use case.

We increase our dataset in a very straightforward manner by using the LAION dataset in K-NN search. Starting with an image from our original corpus, we then search the LAION dataset for photos that are similar to our own. Every picture we get has a real caption on it.

Ensuring Data Quality

Not all of the pictures we gather are excellent. For instance, we gathered numerous group images from medical conferences from Twitter. We occasionally received fractal-like images from LAION that might have some resemblance to a pathological pattern.

We used ImageNet data as the negative class data and some pathology data as the positive class data to train a classifier. This type of classifier has an extraordinarily high level of precision, making it simple to differentiate between pathological photos and random images on the internet.

Additionally, in order to eliminate cases that are not in English, we apply an English language classifier to LAION data.

Training Pathology Language and Image Pre-Training

The hardest aspect was gathering the data. When that is finished and we are confident in our data, we may begin training.

We constructed the training loop, added a cosine annealing for the loss, and made a few other adjustments here and there to make everything run smoothly and in a verifiable manner (for example, Comet ML tracking) in order to train PLIP using the original OpenAI code.

We trained hundreds of different models, compared their parameters, and used various optimization strategies until we found one that worked well for us. More information is provided in the publication, but one of the most crucial steps in creating this type of contrastive model is to make sure that the training batch size is as large as feasible. This enables the model to learn to discriminate as many elements as possible.

Pathology Language and Image Pre-Training for Medical AI

Now is the moment to test out our PLIP. Does this foundation model meet industry standards?

We do a variety of tests to gauge how well our PLIP model is working. The three most intriguing ones are retrieval, linear probing, and zero-shot classification, although I’ll concentrate on the first two here. For the sake of conciseness, I won’t include the experimental configuration, although the manuscript includes all of this information.

As a Zero-Shot Classifier, PLIP
The GIF below demonstrates how to use a model like PLIP for zero-shot categorization. As a metric for vector space similarity, we utilize the dot product (the higher, the more similar).

the method for performing zero-shot categorization. In order to determine which labels are closest to the image in the vector space, we insert an image along with all the labels.

You may see a short comparison of PLIP vs. CLIP on one of the datasets we used for zero-shot classification in the following plot. When PLIP is used in place of CLIP, performance improves noticeably.

PLIP vs CLIP performance (Weighted Macro F1) on two datasets for zero-shot classification. Note that y-axis stops at around 0.6 and not 1.

A Feature Extractor for Linear Probing Using PLIP
Using PLIP as a feature extractor for pathology images is another use. PLIP gains the ability to create vector embeddings for pathology images by seeing a lot of them during training.

Consider that you wish to train a new pathology classifier using some annotated data. Using PLIP, you may extract picture embeddings and then use these embeddings to train a logistic regression (or any other type of regressor you choose) on top of them. This is a quick and efficient method for carrying out a categorization operation.

How does this function? According to the theory, PLIP embeddings, which are pathology-specific rather than general-purpose CLIP embeddings, should perform better during classifier training.

Here is an illustration of how the effectiveness of CLIP and PLIP on two datasets was compared. Even while CLIP performs well, PLIP yields significantly better outcomes.

PLIP vs CLIP performance (Macro F1) on two datasets for linear probing. Note that y-axis starts from 0.65 and not 0.

Pre-training in Pathology Language and Image
How do I employ PLIP? Here are some PLIP in Python usage examples and a Streamlit demo you may use to experiment with the mode a little.

Code: PLIP APIs to Use
You can find some further examples on our GitHub repository. We created an API that makes it simple for you to communicate with the model:

from plip.plip import PLIP
import numpy as np

plip = PLIP(‘vinid/plip’)

# we create image embeddings and text embeddings
image_embeddings = plip.encode_images(images, batch_size=32)
text_embeddings = plip.encode_text(texts, batch_size=32)

# we normalize the embeddings to unit norm (so that we can use dot product instead of cosine similarity to do comparisons)
image_embeddings = image_embeddings/np.linalg.norm(image_embeddings, ord=2, axis=-1, keepdims=True)
text_embeddings = text_embeddings/np.linalg.norm(text_embeddings, ord=2, axis=-1, keepdims=True)

You can also use the more standard HF API to load and use the model:

from PIL import Image
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained(“vinid/plip”)
processor = CLIPProcessor.from_pretrained(“vinid/plip”)

image = Image.open(“images/image1.jpg”)

inputs = processor(text=[“a photo of label 1”, “a photo of label 2”],
images=image, return_tensors=”pt”, padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

Demonstration of PLIP as a Teaching Tool
We also think that PLIP and upcoming models can serve as useful teaching resources for medical AI. With the help of PLIP, users can do zero-shot retrieval, in which they can enter a set of keywords and the program will look for the closest-matching image. You may find a straightforward web application we created in Streamlit here.

We appreciate your time reading this. We are eager to see how this technology develops in the future.

I’ll wrap up this blog post by talking about some crucial PLIP constraints and mentioning some other pieces I’ve written that you might find interesting.

Even though our findings are intriguing, PLIP has numerous drawbacks. To fully understand all the intricate facets of pathology, data is insufficient. Although we have constructed data filters to assure the quality of the data, we still require better evaluation metrics to know what the model is doing correctly and incorrectly.

More importantly, PLIP is not a flawless instrument and can make numerous mistakes that call for more examination. It does not resolve the existing pathological difficulties. The outcomes we observe are unquestionably encouraging and present a wide range of opportunities for future pathology models that integrate vision and language. Before we start using these technologies in routine medical practice, there is still a lot of work to be done.


I have a couple of other blog posts regarding CLIP modeling and CLIP limitations. For example:

Teaching CLIP Some Fashion

Training FashionCLIP, a domain-specific CLIP model for Fashion

Your Vision-Language Model Might Be a Bag of Words

We explore the limits of what vision-language models get about language in our Oral Paper at ICLR 2023

In 2022, Chia, P.J., Attanasio, G., Bianchi, F., Terragni, S., Magalhes, A.R., Gonçalves, D., Greco, C., and Tagliabue, J. general fashion concepts are learned through contrast between words and vision. Biological Reports, 12.

Walsh, M., Isom, J.A., and Gardner, J.M. (2017). Where are we now with social media and pathology, and why does it matter? Anatomical pathology advances.

Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kaczmarczyk, R., & Jitsev, J. (2022). Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M. An open large-scale dataset called LAION-5B is used to train the most recent image-text algorithms. Abs/2210.08402 on ArXiv.

Zhang, S., Xu, Y., Usuyama, N., Bagga, J.K., Tinn, R., Preston, S., Rao, R.N., Wei, M., Valluri, N., Wong, C., Lungren, M.P., Naumann, T., & Poon, H. (2023). Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing. ArXiv, abs/2303.00915.

Related Articles


Please enter your comment!
Please enter your name here

Latest Articles