Scaling & performance tips
ArmoniK is designed to scale horizontally. This page describes the scaling mechanisms available and gives recommendations for tuning performance.
How autoscaling works
ArmoniK uses KEDA (Kubernetes Event-driven Autoscaler) together with the Kubernetes HPA (Horizontal Pod Autoscaler) to automatically scale the number of worker pods in each partition based on the number of queued tasks.
The flow is:
Tasks are submitted and enter the queue.
The Metrics Exporter computes aggregated task-count metrics and exposes them to Prometheus.
KEDA reads those metrics and drives the HPA for each partition.
The HPA scales worker pods up or down within the configured
min_replicas/max_replicasbounds.
See Monitoring in ArmoniK for details on the Metrics Exporter and Prometheus setup.
Configuring scaling per partition
Scaling bounds are set per partition in parameters.tfvars:
compute_plane = {
default = {
replicas = 1 # initial replica count
min_replicas = 0 # scale down to zero when idle
max_replicas = 10 # upper bound
# ...
}
}
Set
min_replicas = 0to allow complete scale-down when there is no work, reducing costs on cloud deployments.Set
min_replicas >= 1if cold-start latency is unacceptable for your workload.max_replicasshould be sized to the number of nodes (or node slots) available in the partition’s node pool.
Prometheus sizing
Prometheus is a critical component: it feeds the HPA metrics pipeline. Under-sizing it causes stale metrics and delayed scaling reactions.
Follow the golden rule of running two separate Prometheus instances:
Instance |
Purpose |
Retention |
|---|---|---|
HPA-dedicated |
Feeds KEDA / HPA scaling decisions |
Hours |
Monitoring-dedicated |
Long-term observability and dashboards |
≥ 1 week |
Both instances require persistent storage configured before deployment. Without persistence, a Prometheus restart causes metric gaps that can stall autoscaling.
Task granularity
Task duration has a strong impact on throughput and cost:
Duration |
Recommendation |
|---|---|
< 100 ms |
Too short — scheduling overhead dominates. Batch work into larger tasks. |
100 ms – 10 min |
Ideal range. Preemptible/spot instances can be used safely. |
> 10 min |
Risk increases on spot instances. Split tasks or implement a checkpoint/recovery mechanism. |
Tasks longer than one hour must be split into smaller sub-tasks (target: 10–20 minutes each).
Worker resource requests and limits
Set CPU and memory requests to reflect actual average usage, and limits conservatively above that. Inaccurate requests lead to:
Over-requested: fewer pods scheduled per node, lower utilisation.
Under-requested: pods compete for resources, causing throttling or OOM kills.
Profile your worker under realistic load before setting these values in production.
Reducing cold-start time
When scaling from zero, a new worker pod must be pulled, started, and initialised before it can process tasks. To minimise this:
Pre-pull worker images on nodes using a DaemonSet or node image caching.
Keep worker images small and avoid heavy initialisation in the container entrypoint.
Use
min_replicas = 1for latency-sensitive partitions.
Avoiding single points of failure
Deploy the control plane with at least 2 replicas behind the ingress.
Use a managed MongoDB service (e.g. MongoDB Atlas) or a properly configured MongoDB replica set rather than a single MongoDB pod.
Avoid any application-level shared state that cannot scale horizontally (see Application Principles).