Scalability by Design on AKS

Scalability by Design: How We Learned to Plan Performance & Load Testing for Applications on AKS

Let me tell you a story.

When we first moved on cloud native, our team proudly deployed one of our most critical microservices onto Azure Kubernetes Service (AKS). It was modern, containerized, and fully integrated with Azure DevOps. We had CI/CD running like clockwork, YAML pipelines humming, and a cloud-native architecture that looked great in diagrams.

But then came the "real" traffic.

Chapter 1: When Everything Looked Fine—Until It Wasn’t

At first, everything worked well. Our pods responded under typical test conditions, and our dashboards stayed green. But during a quarterly deadline rush (yep, our product deals with lots of load on certain quarter of year), traffic suddenly spiked by 4x. That’s when things started to break.

Pods started throttling, Over-provisioning, random OOM kills, and confusing dashboards. One service was scaling to 20 pods while doing nothing. Another one crashed during traffic bursts. Response times jumped from 300ms to over 2s. Errors crept into logs—timeouts, database retries, cascading failures. Ironically, the autoscaler did kick in, but it didn’t help. We were scaling garbage.

“We never really planned for scalability. We just assumed Kubernetes would handle it.”

Chapter 2: The First Hard Lesson — Know What You’re Scaling

We went back to basics. Not Helm charts or YAML, but questions:

  • What are our most performance-sensitive APIs?
  • How do users actually interact with the system?
  • Are our workloads CPU-bound or I/O-bound?
  • How does load flow across services?

We realized a single endpoint triggered multiple downstream services and slow database joins. We could have predicted it with proper performance modeling.

Chapter 3: Start Small, Then Observe

We began setting intentionally low CPU and memory requests/limits in dev environments to observe, not assume usage.

As we promoted to UAT and load environments, we monitored usage patterns with Datadog. Here's what worked for us:

  • Enable the Datadog Cluster Agent and kube-state-metrics to gather HPA metrics and container-level resource stats.
  • In Datadog, use the Live Containers and Containers Overview dashboards to track:
    • container.cpu.usage vs container.cpu.limit
    • container.memory.usage vs container.memory.limit
    • kubernetes.cpu.usage.total and kubernetes.memory.usage at pod and node level
  • Apply filters by namespace or deployment to identify under- or over-provisioned pods.
  • Use horizontal pod autoscaling dashboards to correlate metrics like CPU throttling (container.cpu.throttled) with sudden pod restarts.

We discovered:

  • Some services peaked in memory briefly during load spikes — their memory requests were too low, leading to OOM kills.
  • Others were CPU-starved during cold starts due to low request values and slow initialization.

Updating HPA Based on Observations

After analysis:

  1. We updated HPA configs using a targetAverageUtilization of 60–70% to leave headroom:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
    name: my-api-hpa
    spec:
    scaleTargetRef:
      apiVersion: apps/v1
      kind: Deployment
      name: my-api
    minReplicas: 2
    maxReplicas: 10
    metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70
  2. We verified with Datadog that pod autoscaling was driven by actual load — not burst noise — by checking sustained utilization trends over 5–10 minute rolling windows.

This approach helped us tune our Helm values and autoscaling rules based on real-world behavior, not theory.

Chapter 4: Simulate Reality, Not Ideals

We used k6 to model full user journeys, not just endpoints. We simulated:

  • Login
  • Document upload
  • Validation trigger
  • Processing wait
  • Report download

We tested from 100 to 2000 users. One API consistently failed with concurrent uploads over 150. That test saved us more than any sprint metric.

Chapter 5: Enter KEDA and Virtual Nodes

We used HPA for scaling API pods based on latency and KEDA for event-driven scaling (Azure Service Bus).

We also enabled virtual nodes to burst compute during spikes. Now we were scaling smart—not wide.

Chapter 6: Automation is Cheaper Than Panic

We baked performance tests into our CI/CD:

  • k6 tests pre-deploy
  • Fail builds on latency or error thresholds

We added Azure Load Testing for staging and relied on Datadog + Grafana to monitor production metrics. We finally had a system we could trust to scale under pressure.

Epilogue: What I’d Tell Past Me

“Don’t assume Kubernetes will scale for you. It can, but only if you tell it how, when, and why.”

Start small with resource limits. Observe and refine as you go up the environments. Use tools like Datadog to let real usage guide your configurations. Plan scalability the way you plan for features. Simulate pain before your users feel it.

Because when you plan scalability by design, scaling becomes a feature—not a fire.

Remember,

Scaling isn't magic — it's the outcome of good observability, collaboration, and patience. Once we began to measure and iterate, our Kubernetes setup stopped being a black box and started being a powerful lever for performance and cost-efficiency.