Scaling & performance tips

ArmoniK is designed to scale horizontally. This page describes the scaling mechanisms available and gives recommendations for tuning performance.


How autoscaling works

ArmoniK uses KEDA (Kubernetes Event-driven Autoscaler) together with the Kubernetes HPA (Horizontal Pod Autoscaler) to automatically scale the number of worker pods in each partition based on the number of queued tasks.

The flow is:

  1. Tasks are submitted and enter the queue.

  2. The Metrics Exporter computes aggregated task-count metrics and exposes them to Prometheus.

  3. KEDA reads those metrics and drives the HPA for each partition.

  4. The HPA scales worker pods up or down within the configured min_replicas / max_replicas bounds.

See Monitoring in ArmoniK for details on the Metrics Exporter and Prometheus setup.


Configuring scaling per partition

Scaling bounds are set per partition in parameters.tfvars:

compute_plane = {
  default = {
    replicas     = 1   # initial replica count
    min_replicas = 0   # scale down to zero when idle
    max_replicas = 10  # upper bound
    # ...
  }
}
  • Set min_replicas = 0 to allow complete scale-down when there is no work, reducing costs on cloud deployments.

  • Set min_replicas >= 1 if cold-start latency is unacceptable for your workload.

  • max_replicas should be sized to the number of nodes (or node slots) available in the partition’s node pool.


Prometheus sizing

Prometheus is a critical component: it feeds the HPA metrics pipeline. Under-sizing it causes stale metrics and delayed scaling reactions.

Follow the golden rule of running two separate Prometheus instances:

Instance

Purpose

Retention

HPA-dedicated

Feeds KEDA / HPA scaling decisions

Hours

Monitoring-dedicated

Long-term observability and dashboards

≥ 1 week

Both instances require persistent storage configured before deployment. Without persistence, a Prometheus restart causes metric gaps that can stall autoscaling.


Task granularity

Task duration has a strong impact on throughput and cost:

Duration

Recommendation

< 100 ms

Too short — scheduling overhead dominates. Batch work into larger tasks.

100 ms – 10 min

Ideal range. Preemptible/spot instances can be used safely.

> 10 min

Risk increases on spot instances. Split tasks or implement a checkpoint/recovery mechanism.

Tasks longer than one hour must be split into smaller sub-tasks (target: 10–20 minutes each).


Worker resource requests and limits

Set CPU and memory requests to reflect actual average usage, and limits conservatively above that. Inaccurate requests lead to:

  • Over-requested: fewer pods scheduled per node, lower utilisation.

  • Under-requested: pods compete for resources, causing throttling or OOM kills.

Profile your worker under realistic load before setting these values in production.


Reducing cold-start time

When scaling from zero, a new worker pod must be pulled, started, and initialised before it can process tasks. To minimise this:

  • Pre-pull worker images on nodes using a DaemonSet or node image caching.

  • Keep worker images small and avoid heavy initialisation in the container entrypoint.

  • Use min_replicas = 1 for latency-sensitive partitions.


Avoiding single points of failure

  • Deploy the control plane with at least 2 replicas behind the ingress.

  • Use a managed MongoDB service (e.g. MongoDB Atlas) or a properly configured MongoDB replica set rather than a single MongoDB pod.

  • Avoid any application-level shared state that cannot scale horizontally (see Application Principles).