Uncovering the Potential of Serverless AI

The model’s potential for improving the feasibility of working with AI tools is enourmous, but there’s one pitfall tech managers still need to be aware of.

Shannon Lal
February 10, 2023

In late 2022, we started investigating serverless GPUs as a way to significantly reduce cloud infrastructure costs. Serverless GPU is an on-demand AI technology that allows companies to only pay for what they use — a game changer for companies like us explore working with new AI models — but cold starts can add tens of seconds to an inference call.

We looked at how two serverless GPU platforms, Replicate and Banana, are implemented and how they perform, and did a deep dive on one of the technology’s remaining hurdles. Here’s what we learned.

Why we’re interested in serverless GPU

Designstripe is an intelligent design tool company that helps designers and non-designers build illustrations quickly. In the last year, new AI-powered content generation tools have arrived on the market, including Stable Diffusion, ChatGPT, BLIP and many, many more.

Our engineering team has been exploring these technologies to see how they can be used with our products, and we’re especially interested in evaluating image caption generators. Image captioning is a task where an image is fed to a model which then generates a caption based on the content detected within the image. One popular model is Salesforce’s BLIP.

This image was fed into the BLIP model:

Figure 1: Sample image used for caption generation


And BLIP returned the following caption.  As you can see the caption is simple but is useful for simple image classification.

"a couple holding hands on the beach"

Our team wanted to explore the feasibility of using BLIP to help generate captions for illustrations used in our products. One of the challenges with these models is that they require large servers with a GPU (graphical processing unit) to train and make an inference (i.e. prediction) from datasets. The innovation team has been investigating different AI models for over a year using Amazon’s p3.2xlarge EC2 instances, which cost about $3 per hour to run.

As our work mostly involves inference tests and fine-tuning, we would need multiple instances running continuously to deploy these models in a production environment, increasing our infrastructure costs exponentially. A better solution would be for our team to deploy these models and only pay for what we use — in other words, on demand.

We explored serverless technologies and came across a couple of providers offering serverless AI computing (also known as serverless GPU — that’s the term we’re using — since GPUs are critical for AI models). We chose to focus on two established leaders in the marketplace, Banana and Replicate for this experiment.

A comparison of Banana and Replicate’s pricing models:

But first: the problem of cold starts

Cold starts are a common problem in serverless computing, and it was critical for our team to take a closer look. Serverless computing is a cloud computing execution model in which the cloud provider allocates machine resources on demand, taking care of the servers on behalf of their customers. (As IBM puts it, “there are most definitely servers in serverless computing.

'Serverless' describes the developer’s experience with those servers — they are are invisible to the developer, who doesn't see them, manage them, or interact with them in any way.”) A key benefit of serverless computing is you only pay for what you use; the trade-off is that, as the cloud providers only launch your service on demand, there can be a delay in when your service can start processing requests. That’s referred to as a cold start.

Figure 2: AWS - Lambda - Lifecycle of Request

The diagram above explains AWS Lambda’s serverless function request lifecycle, which is similar to the lifecycle of a serverless GPU. In a cold-start scenario, AWS needs to get the source code, launch a container, bootstrap it and run the code. Cold starts can add a lot of time to your execution (Figure 2). Once the service is up and running, your provider will keep it running — that’s a warm start — until no more requests come through. In API calls, cold starts can add a few seconds to your execution, while warm starts only add a couple of hundred milliseconds.

Serverless GPU has a similar lifecycle (Figure 3): It has to load your model, start the container, and execute the inference. The main difference between serverless functions and serverless GPUs is the model size and memory requirements. Serverless functions are normally only hundreds of megabytes in size and rarely require more than 1GB of memory. On the other hand, serverless GPUs containers are tens of gigabytes and easily require 16 to 32 GB of memory. As a result, an AI model’s cold start can add tens of seconds or even minutes to an inference call, while warm starts add seconds.

Figure 3: Assumed lifecycle for serverless GPU

Note: The above diagram assumes the request lifecycle based on the available information from both Replicate and Banana.

The experiment: Evaluating cold starts and warm starts when making an inference

We wanted to explore the feasibility of using serverless AI models in our production environment, and we needed to get a better understanding of cold-start and warm-start performance when making an inference. To evaluate this, we created a small test program that would send a sequence of four images (See "Images Used" below) to the Salesforce BLIP AI Model, deployed on both Replicate and Banana.

We then waited for a fixed interval and repeated the sequence. For each sequence, we would measure the cold-start time with the first image — how long it took to generate the first inference — and the three remaining images would be used to measure the warm-start time.

Each run would execute 47 sequences, resulting in 180 inference calls. Each run would last 11 hours, and the total experiment would take place over five days.

  • Time period: November 13, 2022 to November 17, 2022
  • Total runs: Five runs
  • Duration of each run: 11 hours

The results: Replicate’s management and bootstrapping leads to better performance on cold and warm starts

Scenario 1: Measuring inference on cold start

Replicate performs much better on the minimum cold start metric. This is due to how each provider manages their platform. Replicate abstracts how long it should keep the model running after no more requests come in(  Uncovering the Potential of Serverless AI). A model on Replicate can stay warm for a few seconds or a few minutes, but you don't have any control over this.

Banana, on the other hand, gives the user more say. Figure 4 is a screenshot of Banana's configuration panel. For this experiment, we set the idle timeout to 60 seconds. This meant that the model would be destroyed after one minute of no activity. Since our sequence was always longer than one minute, we would always incur a cold start on the first image.

Figure 4: Banana model configuration panel

Scenario 2: Measuring inference on warm start

Replicate and Banana have comparable average warm-start times. However, as the interval between sequences increases, Banana takes longer. Further, the maximum warm-start time is consistently higher with Banana, which might be explained by how they handle their bootstrapping.

The evolution of and advances to come in serverless GPU

Based on the results of our experiment, serverless AI isn’t ready for real-time inference. While warm-start times are longer than we’d ideally like to see, they’re within an acceptable range for practical use. On the other hand, cold-start times — currently well above one minute, on average — are just too long for implementation right now.

When Lambda Functions were initially deployed by AWS seven years ago, they had many issues around cold and warm starts. Most of these have now been resolved, and these start times are reasonably predictable, if too long in the case of cold starts. Though the technology’s not there yet, Replicate and Banana are actively looking at this and exploring solutions.

We want to thank the Banana and Replicate teams, who were extremely valuable in helping get our models up and running, as well as their user communities. (Both platforms’ Discord channels — Banana and Replicate — were able to help resolve some issues and are highly recommended.) In 2023, we’ll see meaningful advances in serverless GPU, and we expect both cold and warm starts to drop significantly. This experiment was a quick evaluation of serverless GPU, but we plan to do more work on this.

Please let us know if you have any suggestions on how we can improve this or know of any vendors we should look at. We appreciate any feedback and input. You can also find me here: Linkedin | Github | Twitter

Images Used

Banana, Picture of Drone, Couple on Beach, Boardroom

References

  1. https://replicate.com/docs
  2. https://replicate.com/docs/how-does-replicate-work
  3. https://docs.banana.dev/banana-docs/
  4. https://www.ibm.com/topics/serverless
  5. https://shouldroforion.medium.com/battle-of-the-serverless-part-2-aws-lambda-cold-start-times-1d770ef3a7dc
  6. https://github.com/salesforce/BLIP
  7. https://arxiv.org/pdf/2201.12086.pdf — BLIP Research Paper

Phasellus ornare vel odio vel consequat. Donec ipsum ex, venenatis eu diam in, dignissim eleifend lacus. Mauris nec nulla at velit porta molestie. Duis ac cursus enim. Proin efficitur dapibus massa a posuere. In gravida erat justo, ut sodales turpis pulvinar quis.

Integer ex felis, feugiat nec lacus nec, dignissim volutpat tortor. Suspendisse potenti. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.

Lectus urna tristique tortor, vel cursus lectus

Integer ex felis, feugiat nec lacus nec, dignissim volutpat tortor. Suspendisse potenti. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.

Mauris nec nulla at velit porta molestie. Duis ac cursus enim. Proin efficitur dapibus massa a posuere. In gravida erat justo, ut sodales turpis pulvinar quis.

Phasellus ornare vel odio vel consequat. Donec ipsum ex, venenatis eu diam in, dignissim eleifend lacus. Mauris nec nulla at velit porta molestie. Duis ac cursus enim. Proin efficitur dapibus massa a posuere. In gravida erat justo, ut sodales turpis pulvinar quis.

Donec ac est malesuada, placerat sapien a, viverra mi. Duis pharetra sem dapibus condimentum gravida. Sed ullamcorper elit tellus, eu vestibulum mi elementum at. Cras in viverra odio. Proin et tempus elit, vitae interdum augue. Phasellus commodo pulvinar erat, sed fermentum tellus faucibus nec.

Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Suspendisse maximus ex in molestie mollis. Donec at ex et odio aliquam pharetra a vitae ligula. Nulla facilisi. Integer feugiat imperdiet varius.

Cras in viverra odio. Proin et tempus elit, vitae interdum augue. Phasellus commodo pulvinar erat, sed fermentum tellus faucibus nec.Pellentesque lacinia felis vel ligula pulvinar volutpat. Donec ultricies lectus nec turpis tincidunt, sed molestie sem gravida.

Rectangle

Cras imperdiet erat eget molestie iaculis.

Proin pharetra tempor tincidunt. Morbi mi orci, euismod nec mauris vitae, varius cursus neque. Nulla aliquam dictum arcu, in congue diam fringilla sit amet.

Integer ex felis, feugiat nec lacus nec, dignissim volutpat tortor. Suspendisse potenti. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.

Rectangle

Cras imperdiet erat eget molestie iaculis.

Donec ac est malesuada, placerat sapien a, viverra mi. Duis pharetra sem dapibus condimentum gravida. Sed ullamcorper elit tellus, eu vestibulum mi elementum at. Cras in viverra odio. Proin et tempus elit, vitae interdum augue. Phasellus commodo pulvinar erat, sed fermentum tellus faucibus nec.

Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Suspendisse maximus ex in molestie mollis. Donec at ex et odio aliquam pharetra a vitae ligula. Nulla facilisi. Integer feugiat imperdiet varius.

Cras in viverra odio. Proin et tempus elit, vitae interdum augue. Phasellus commodo pulvinar erat, sed fermentum tellus faucibus nec.Pellentesque lacinia felis vel ligula pulvinar volutpat. Donec ultricies lectus nec turpis tincidunt, sed molestie sem gravida.

About the author :

Shannon Lal is VP of Technology and Engineering at designstripe. With 20 years experience in software development and leadership, he specializes in dev ops and scaling large systems. He’s also contributed to several open-source projects. Outside of work, he loves to sail, ski and run with his family.

Get started for free

Create fresh, on-brand and beautiful content for all your social media channels in seconds.