Summary on AKS KAITO Preset Language Models and GPUs

The Azure Kubernetes Service (AKS) AI toolchain operator (KAITO) provides pre-configured deployments of language models to streamline its installation by automating GPU node provisioning and inference setup. Inference is the runtime process of the model to generate outputs (text, code, embeddings, etc.) from input prompts.

https://github.com/kaito-project/kaito/tree/official/v0.5.1/examples/inference

Opening any of these inference workspace CRD yaml files, are two configuration parameters I inquire

As I browse through these specific language model preset configuration examples, I ask myself

  • What are these models suitable for?
  • For each VM SKU instance type what is its capacities and general cost.

Supported Preset Model Families 

The following taken from the inference yaml file examples.

DeepSeek

Language modelPurpose / benefit / advantageBest used for (scenarios)instanceType
deepseek-r1-0528Strong multi-step reasoningComplex planning, math/logic-heavy promptsStandard_NC80adis_H100_v5
deepseek-r1-distill-llama-8bCheaper/faster “reasoning-ish” behaviorReasoning workloads with tighter latency/costStandard_NC24ads_A100_v4
deepseek-r1-distill-qwen-14bDistilled reasoning w/ more capacityMid-budget reasoning + analysis-heavy chatStandard_NC24ads_A100_v4
deepseek-v3-0324Strong general instruction/chatGeneral assistant, summarization, Q&AStandard_NC80adis_H100_v5

Llama

Language modelBest used for (scenarios)instanceType
llama-3.1-8b-instructDefault assistant, extraction, summarizationStandard_NC24ads_A100_v4
Standard_NC96ads_A100_v4
llama-3.3-70b-instructHigher-quality responses, harder promptsStandard_NC48ads_A100_v4

Falcon

Language modelBest used for (scenarios)instanceType
falcon-7bSimple chat/completion; compatibilityStandard_NC24ads_A100_v4
falcon-7b-instructBasic assistant tasks Standard_NC12s_v3
falcon-40bHigher-quality completion promptsStandard_NC96ads_A100_v4
falcon-40b-instructMid-tier assistant tasks Standard_NC12s_v3

Mistral

Language modelBest used for (scenarios)instanceType
mistral-7bCompletion-style prompts, experimentationStandard_NC24ads_A100_v4
mistral-7b-instructCost-effective chat + structured outputStandard_NC24ads_A100_v4
mistral-large-3-675b-instructMax quality (high cost/latency)Standard_NC80adis_H100_v5
ministral-3-3b-instructLow-latency, high-QPS assistantsStandard_NV36ads_A10_v5
ministral-3-8b-instructBetter quality than 3B at modest costStandard_NC24ads_A100_v4
ministral-3-14b-instructStronger reasoning/writing than 8BStandard_NC24ads_A100_v4

Phi

Language modelBest used for (scenarios)instanceType
phi-3-mini-4k-instructQuick Q&A, classification, short chatStandard_NC4as_T4_v3
phi-3-mini-128k-instructLong doc Q&A/summarization at low costStandard_NC6s_v3
Standard_NC24ads_A100_v4
phi-3-medium-4k-instructBetter quality than mini on typical promptsStandard_NC8as_T4_v3
phi-3-medium-128k-instructLong-context reasoning/summarizationStandard_NC24ads_A100_v4
phi-3.5-mini-instructGeneral low-cost assistant workloadsStandard_NC24ads_A100_v4
phi-4Higher quality assistant than Phi-3.xStandard_NC24ads_A100_v4
phi-4-mini-instructTool calling, function calling, fast chatStandard_NC24ads_A100_v4

Qwen

Language modelBest used for (scenarios)instanceType
qwen2.5-coder-7b-instructCode gen, refactors, tests, code Q&AStandard_NC24ads_A100_v4
qwen2.5-coder-32b-instructHarder coding tasks, bigger refactorsStandard_NC24ads_A100_v4

For the InstanceType VM Skus, I find it difficult to understand what they mean and they cost. So here’s a basic breakdown that I find is sufficient.

ComponentExample ValueDescription
FamilyNThe primary workload category (e.g., N for GPU/Specialized).
Sub-familyCSpecialized differentiation within a family (e.g., C for Compute-intensive GPU).
# of vCPUs24The number of virtual CPU cores allocated to the VM.
Additive FeaturesadsLowercase letters denoting specific hardware traits.
a (AMD-based): The processor is an AMD EPYC instead of the default Intel.
d (Local Disk): The VM includes a local, non-persistent temporary SSD (Temp Disk).
s (Premium Storage)
Accelerator TypeA100The specific hardware accelerator. See table below.
Versionv4The generation of the underlying hardware (e.g., v4, v5).

To drill down the accelerator types:

GPUArchitectureTierPrimary Usage
H100HopperCutting-edgeFrontier AI training
A100AmpereEnterprise flagshipTraining + fine-tuning
A10 / A10GAmpereMid-tierInference + moderate training
T4TuringEntry-levelInference
V100VoltaLegacy high-endOlder training workloads

For testing purposes and to keep costs down, I first test preset inference models using older GPUs. This helps to understand deployment and build apps against the inference endpoint. Then move on to testing more costly and capable inference models. You can try testing an present config with a newer architecture, and then change the instance type to a smaller capacity or older SKU, but it is trial and error. A handful of times the GPU driver that comes loaded with the GPU is not compatible with the language model.

GPUArchitectureTierPrimary Usage
H100HopperCutting-edgeFrontier AI training
A100AmpereEnterprise flagshipTraining + fine-tuning
A10 / A10GAmpereMid-tierInference + moderate training
T4TuringEntry-levelInference
V100VoltaLegacy high-endOlder training workloads

Finally, here is a rough estimate of a cost breakdown. The take away are relative pricing to judge your options.

VM SKUGPU TypeLinux Monthly Cost (USD)Use CaseNVIDIA Family
Standard_NC6s_v31x V100$2,044.00Phi-3.5 Vision / Llama-8BVolta
Standard_NC24ads_A100_v41x A100$2,737.50Llama-3.2-11B VisionAmpere
Standard_NC48ads_A100_v42x A100$5,475.00Llama-3.3-70B (Single Node)Ampere
Standard_ND96asr_v48x A100$21,535.00Llama-3.3-70B Multi-node ClusterAmpere
Standard_ND96ov_v58x H100$82,125.00Frontier Model Training / Large ClustersHopper

Final Thoughts

I hope I am able to provide some direction. I want to narrow down a focus on how to think through the options. It’s important to consider how LLMs and VM SKUs relate to one another. I tested a handful of models and scenarios over the past few months. I wanted to seek more clarity. I hope this helps.

Leave a Reply