Did My Apps Go Down During AKS Upgrade?

I did an experiment of running a load test against 4 applications and while executing a Azure Kubernetes Cluster Upgrade to see the affects and any downtime.

The applications hosted in my AKS cluster are

  • AKS Helloworld – super simple application
  • Voting App – simple application with Redis backend
  • Bookinfo Istio demo – microservices architecture with Istio Service Mesh
  • WordPress – popular blogging site with mySql database and redis. This has stateful peristent storage.

JMeter is used to apply constant http traffic load on all these sites with up to 350 threads (or virtual users). The following shows a rolling http url traffic being visited and scrolling live.

Summary report as traffic is being applied 59 seconds into the load test.

The AKS cluster specs are

  • 3 nodes with Ubuntu Linux and VM SKU Standard_D4s_v3
  • Current version 1.26.10

After some 30 minutes of load testing, this is the overall cluster performance, node count and active pod count.

The Nodepool has 3 nodes and we can see the active load to put CPU pressure on one of the nodes and the version for each node is 1.26.10

While JMeter continues to apply active load, I execute upgrading the cluster to version 1.27.7


After some minutes, we can see a new node with the next version 1.27.7 is provisioned.

We can observe some downtime of some of those applications. This could be due to high CPU pressure on the one of the nodes.

2 nodes are upgraded to 1.27.7 and pods are being scheduled to those nodes. While the older nodes are being drained of pods.

A node with 1.26.10 is removed.

A 3rd node with 1.27.7 is provisioned with 7 pods.

The newer nodes are increasing in pod count, while the older node is decreasing in pod count as pods are being evicted.

Pods count continues to increase in the newer nodes.

At this point, a look at the load test. There as been some http errors.

Yet as the upgrade cycle is almost done, there is stable performance and no errors to the http requests.

Here is a http resposne time graph during the load test to give some general idea. What’s pretty notable is that the wordpress app nin the dark purple had slower response time in the later stage of the upgrade.

Here we see the nodes are completed with the older nodes deprovisioned.

The AKS cluster upgrade has completed.

Final Thoughts

From my readings, AKS is designed to manage downtime during upgrades, but Kubernetes workloads have to be designed and configured for those situations. For example, employing Pod Disruption Budgets, pod scheduling tactics and replicas. Keep in mind that managing downtime for stateful applications such as with the wordpress app with mySQL database can be tricky. This is something I would need to investigate and hope to share some more round of testing upgrades with load in the coming months.

Leave a Reply