GPU Virtual Machines For KAITO Models on AKS

The Kubernetes AI Toolchain Operator (KAITO) setup on AKS provisions GPU node pools. This blog is a explores the VM instance types that uses the NVIDIA GPUs that are allocated to provisioned AKS node pools. These AKS nodepools host Kaito supported language models to expose an inference API service.

The KAITO project repo provides examples of AI inference models that are deployed as a Kaito workspace. You can find the full list https://github.com/kaito-project/kaito/tree/main/examples/inference

Here is one example of the Falcon 7b model that references a VM instance type that has an NVIDIA GPU

For easy analysis, I made the following table of some of the inference model YAML files from the above list and their defined VM instance type. I just want to point out the three GPU VM instance types that are used across all the example yaml files.

ModelVM Instance TypevCPUsMemory(GB)Accelerator GPUCost USD/month
falcon-7b
llama-2-13b-chat
mistral-7b
Standard_NC12s_v312
Intel Xeon E5-2690 v4 (Broadwell)
2242 Nvidia Tesla V100 GPU (16GB)$4,870
falcon-40b
llama-2-70b
phi-3-medium
Standard_NC96ads_A100_v496880
4
Nvidia PCIe A100 GPU (80GB)
$13,948
phi-3-miniStandard_NC6s_v361121 Nvidia Tesla V100 GPU (16GB)$2,435

The first two set of models are large language models that have billions of parameters. As you can these VM instance types come with Nvidia GPUs along with a large amount of CPU and memory. Even more, I checked against the Azure pricing calculator to find USD cost per month. A very staggering amount of cost. So when deploying and operating these models into their GPU node pools, you want to save on costs by deprovisioning the Kaito worspace and nodepool or shut down the node pool when the inference service is not in use.

Note that you can modify the inference yaml files to use any other GPU accelerated VM as you like by choosing from the NC sub-family GPU accelerated VM series

Here’s some more background info of those NVIDIA GPUs.

VM GPU NC family

NC-series VMs are ideal for training complex machine learning models and running AI applications. The NVIDIA GPUs provide significant acceleration for computations typically involved in deep learning and other intensive training tasks.

NCv3-series VMs are powered by NVIDIA Tesla V100 GPUs. For Nvidia product page read https://www.nvidia.com/en-gb/data-center/tesla-v100/

The NC A100 v4 series is powered by NVIDIA A100 PCIe GPU and third generation AMD EPYC™ 7V13 (Milan) processors. The VMs feature up to 4 NVIDIA A100 PCIe GPUs with 80 GB memory each, up to 96 non-multithreaded AMD EPYC Milan processor cores and 880 GiB of system memory. 

https://www.nvidia.com/en-us/data-center/a100/

I hope this gives you a clearer understanding of the GPU node pools and what they provide. So that you are more knowledgeable as you deploy these inference models with the appropriate hardware compute specifications.

See my other blog post giving a step by step deployment of Kaito at Effortlessly Setup Kaito v0.3.1 on Azure Kubernetes Service To Deploy A Large Language Model.

Leave a Reply