Backup & restore

This page covers backup and restore strategies for the stateful components of an ArmoniK deployment.


What needs to be backed up

ArmoniK has the following stateful components:

Component

Contains

Backup priority

MongoDB

Task metadata, session data, authentication config

High

Object storage (MinIO / S3 / GCS)

Task payloads and results

High

Queue (ActiveMQ / RabbitMQ / SQS / PubSub)

In-flight tasks

Low — see note below

Prometheus

Metrics history

Low — recreatable, useful only for historical dashboards

Seq / log storage

Structured application logs

Low — useful for post-mortem analysis only

The queue does not need to be backed up: tasks that are lost during a failure will surface as errors on the client side. Re-submission is not automatic — the client (or operator) must explicitly re-submit the affected tasks.


MongoDB

Self-managed MongoDB

Use mongodump to create a snapshot of the database:

mongodump --uri="mongodb://<host>:27017" --out=/backup/$(date +%Y%m%d)

To restore:

mongorestore --uri="mongodb://<host>:27017" /backup/<date>

Schedule regular dumps with a cron job and store the output in a location outside the Kubernetes cluster (e.g. an S3 bucket or NFS share).

For a running Kubernetes deployment, run mongodump via a pod:

kubectl -n armonik exec deploy/mongodb -- mongodump --out=/tmp/backup
kubectl -n armonik cp mongodb-<pod>:/tmp/backup ./backup

After restoring MongoDB

If authentication is enabled, the RoleData, UserData, and AuthData collections are populated from parameters.tfvars by the authentication-in-database Job, not from the MongoDB backup itself. After restoring MongoDB, re-run this Job to repopulate those collections from the current configuration:

kubectl -n armonik get job authentication-in-database -o json \
  | jq "del(.spec.selector)" \
  | jq "del(.spec.template.metadata.labels)" \
  | kubectl -n armonik replace --force -f -

Object storage

AWS S3

Enable S3 versioning on the bucket used by ArmoniK. Use S3 lifecycle rules to move older versions to cheaper storage tiers and expire them after a retention period.

For cross-region disaster recovery, enable S3 Cross-Region Replication.

GCP Cloud Storage (GCS)

Enable object versioning on the GCS bucket. Use lifecycle management rules to control retention.

MinIO (on-premises / local)

MinIO supports server-side replication to a secondary MinIO instance. Configure a replication policy via the MinIO console or CLI:

mc mirror --watch minio/armonik-bucket backup-minio/armonik-bucket

For periodic snapshots, use mc cp or mc mirror to copy bucket contents to an external location.


Certificates

Local deployments generate TLS certificates that expire after a configurable period (default: 7 days; recommended: 8760 hours for one year). These are not backed up — if lost, redeploy to regenerate them.

For production deployments using custom certificates, store the CA and client certificates securely in a secrets manager (e.g. AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault) and reference them at deploy time.


Disaster recovery checklist

  1. Restore MongoDB from the latest snapshot.

  2. Re-run the authentication-in-database job if authentication is enabled.

  3. Verify object storage is accessible and intact.

  4. Redeploy ArmoniK infrastructure if needed (terraform apply).

  5. Confirm all pods reach Running state with kubectl get po -A.

  6. Re-submit any tasks that were in flight at the time of failure.