Installing KAITO RAG Engine on Azure Kubernetes Service

In my previous post I gave an intro and shown my own high level architecture on RAG Engine. I will walk through installing the RAG Engine.

Demo prerequisites (AKS + KAITO Workspace + GPU auto-provisioning)

If you’re following my repo’s flow, the “platform” prerequisites are handled in 1-setup-aks-kaito-v0.80.sh and then 2-setup-nginx-ingress-controller.sh

Create an AKS cluster
Install KAITO Workspace Controller
Install Azure GPU Provisioner (with managed identity + federated credential)
Deploy an KAITO Workspace (Phi-4) as a model inference service.

The Phi-4 workspace matters because RagEngine needs an inference service endpoint to call. In this repo, the in-cluster URL is shown in 3-setup-kaito-rag.sh as:

http://workspace-phi-4-mini.default.svc.cluster.local/v1/chat/completions

At this point, the AKS node pools have one system node pool, and the other user nodepool with one GPU hosting the Phi-4 model.

The workspace Phi-4 model inference served as a Kubernetes pod.

Install RAG engine helm chart

helm repo add kaito https://kaito-project.github.io/kaito/charts/kaito
helm repo update
helm upgrade --install kaito-ragengine kaito/ragengine \
  --namespace kaito-ragengine \
  --create-namespace

Verify RAGEngine Custom Resource Definition

Verify pod status in kaito-ragengine namespace

Verify kaito-ragengine workload deployment

Deploy BGE Small RAGEngine that uses the BGE Small model for RAG. This is the k8s yaml manifest

apiVersion: kaito.sh/v1alpha1
kind: RAGEngine
metadata:
  name: ragengine-example
spec:
  compute:
    instanceType: "Standard_NC8as_T4_v3"
    labelSelector:
      matchLabels:
        apps: ragengine-example
  embedding:
    local:
      modelID: "BAAI/bge-small-en-v1.5"
  inferenceService:
    url: "http://workspace-phi-4-mini.default.svc.cluster.local/v1/chat/completions"
    contextWindowSize: 4000

Apply the manifest.

kubectl apply -f bge-small-ragengine.yaml

As a result the KAITO workspace controller orchestrates the creation of a new nodepool with the VM Sku “Standard_NC8as_T4_v3” supporting a GPU.

Wait at least several minutes to finish deploying.

RAGengine is deployed and ready.

The RAGengine k8s deployment is also deployed.

Let’s test the deployed RAG Engine.

First, based on a previously deployed Ingress, get the ingress IP.

			
INGRESS_IP=$(kubectl get svc ingress-nginx-controller -n ingress-nginx -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

I like to setup an ingress controller and ingress rule so I don’t have to do any port forwarding in a local terminal session. See the run book style shell script where I setup the ingress rule for the rag engine at https://github.com/RoyKimYYZ/aks-demos/blob/main/aks-kaito/2-setup-nginx-ingress-controller.sh

Ingest text documents and create a vector index with the following curl command. There are two documents. The first is some sample text on a fictional cloud handbook

RAG_PATH="rag-nostorage"
curl -sS -X POST "http://$INGRESS_IP/$RAG_PATH/index" \
  -H "Content-Type: application/json" \
  -d '{
    "index_name": "rag_index",
    "documents": [
      {
	      "text": "Northwind Research Cloud Handbook (Fictional) v2.3 (2026-01-15). Product: Beacon real-time analytics for IoT telemetry. Default retention: raw telemetry 30 days, aggregated metrics 400 days, alert history 180 days. Security rules: encryption at rest AES-256 and in transit TLS 1.2+. Service-to-service tokens must be short-lived with max TTL 60 minutes. Incident comms cadence: SEV-0 updates every 15 minutes, SEV-1 every 30 minutes. API limits (per tenant, production): Query API 300 requests/min with burst up to 2x for 90 seconds. Query constraints: max query window 31 days, max rows 50000, timeout 20 seconds.",
        "metadata": {
          "author": "Roy Kim"
        }
      },
      {
        "text": "Retrieval Augmented Generation (RAG) is an architecture that augments the capabilities of a Large Language Model (LLM) by adding an information retrieval system that provides grounding data.",
        "metadata": {
          "author": "kaito"
        }
      }
    ]
  }' | jq

The JSON response indicates doc_id of the text chunk, the text chunk along with assigned metadata.

To list indexes, run the curl command

curl -sS "http://$INGRESS_IP/$RAG_PATH/indexes" | jq

Now to ask a question against the indexed documents via RAG Engine, “What is the default retention for raw telemetry vs aggregated metrics vs alert history? Give the numbers and units.“
I run the following curl command.


model=phi-4-mini-instruct
curl -sS -X POST "http://$INGRESS_IP/$RAG_PATH/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "index_name": "rag_index",
    "model": "phi-4-mini-instruct",
    "messages": [
      {"role": "system", "content": "You are a knowledgeable assistant."},
      {"role": "user", "content": "What is the default retention for raw telemetry vs aggregated metrics vs alert history? Give the numbers and units."}
    ],
    "temperature": 0.5,
    "max_tokens": 2048,
    "context_token_ratio": 0.5
  }' | jq

The JSON response consists of the content field with value
“Raw telemetry: 30 days
Aggregated metrics: 400 days
Alert history: 180 days”

The response is “The default retention according to the Northwind Research Cloud Handbook v2.3 is as follows:\n\n- Raw telemetry: 30 days\n- Aggregated metrics: 400 days\n- Alert history: 180 days”

Note that the curl command against the chat completion API is set with the model as phi-4-mini-instruct. This model is used to take the search result(s) and provide a grounded response back to the user.

Final Thoughts

I have shown an option to implement a RAG solution that you can manage and control all in your AKS cluster. Also there is less plumbing involved to build a RAG solution but with KAITO RagEngine you get some of it out-of-the-box.

Roy Kim on Azure and AI

Installing KAITO RAG Engine on Azure Kubernetes Service

One thought on “Installing KAITO RAG Engine on Azure Kubernetes Service”

Leave a Reply Cancel reply

Share this:

Related

One thought on “Installing KAITO RAG Engine on Azure Kubernetes Service”

Leave a Reply Cancel reply