Monitoring in ArmoniK

Monitoring your ArmoniK deployment is crucial for maintaining performance, reliability, and resource optimization. By keeping a close eye on the system, you can quickly identify issues, optimize performance, and ensure that your application runs smoothly. Proper monitoring helps in understanding system load, user behavior, and identifying potential bottlenecks or failures before they escalate into major problems.

Metrics Exporter

Importance of the Metrics Exporter

The Metrics Exporter exposes, in Prometheus format, the number of tasks per status (and per partition) — for example how many tasks are queued, processing, or completed. It does not expose per-task metrics such as durations, resource usage, or error details; for that, use the ArmoniK CLI or the Admin GUI to inspect individual tasks. The task-status counts from the Metrics Exporter are the data source used by both KEDA (to scale compute planes up and down) and the Grafana dashboards, and are useful to:

Assess System Load: See at a glance how many tasks are queued vs. being processed, and spot a backlog building up.
Drive Autoscaling: KEDA uses these counts to scale compute planes up and down automatically.
Enable Proactive Management: A growing queue or a stalled “processing” count can indicate a problem before it affects end-users.

Metrics Exporter Tuning

Default Metric: The primary metric is queued, which represents the total number of tasks that are in the states of Submitted, Dispatched, and Processing. This provides a snapshot of the system’s workload.
To gather specific metrics tailored to your needs, you can use the following environment variable in the Core configmap:
```
MetricsExporter__Metrics
```
This will allow you to specify which additional metrics you want to track, ensuring that you are collecting data that is most relevant to your monitoring goals.
Additionally, the cache validity setting allows you to determine how long cached results of measurements are considered valid. Adjusting this can help you manage how quickly metrics are updated:
```
MetricsExporter__CacheValidity
```
The default cache validity is set to 5 seconds, which balances responsiveness and system load.

Task Statuses

Understanding the various task statuses is essential for effective monitoring. The current list of the different task statuses according to the current version of the ArmoniK Api can be consulted here.

Adding More Metrics

To enhance your monitoring capabilities, you may wish to add additional metrics. The metric to be given to the Metrics exporter is obtained by stripping the TASK_STATUS_ suffix from the name and keeping the remaining part in PascalCase. For example, TASK_STATUS_CREATING becomes Creating. With this in mind, here’s an example how to do include the tasks in status TASK_STATUS_CREATING and TASK_STATUS_RETRIED:

Edit the ConfigMap: Make sure you’re working with a running deployment of ArmoniK. Begin editing the core-configmap with this command:
```
$ kubectl -n armonik edit configmaps core-configmap 
```
Note: For cloud deployments, ensure that the KUBECONFIG variable is exported in your current terminal context to access your Kubernetes cluster.

Update Metrics Configuration: In the configuration, you can specify which metrics to export. Add your desired metrics under the MetricsExporter__Metrics key. For example:

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit.
#
apiVersion: v1
data:
  MetricsExporter__Metrics: "Retried, Creating" # <- Add this line
  .
# Rest of the file ...

Restart the Deployment: After updating the configuration, restart the metrics-exporter deployment so that it picks up your changes:
```
$ kubectl -n armonik rollout restart deployment metrics-exporter
```
Verify in Grafana: Finally, check Grafana to ensure that the new metrics are now visible and being reported correctly.

Prometheus

Importance of Prometheus

Prometheus is a powerful monitoring toolkit designed for reliability and scalability. By integrating Prometheus into your ArmoniK deployment, you can:

Collect Time-Series Data: Gather metrics over time for better analysis and historical reference.
Perform Complex Queries: Leverage Prometheus’s querying capabilities to glean insights into your data and performance trends.

Note

Prometheus also supports alerting via Alertmanager, but Alertmanager is not deployed by default with ArmoniK (kube-prometheus.alertmanager.enabled is false). If you need alerting, deploy and configure Alertmanager separately.

General Guidelines

Ensure that Prometheus is properly dimensioned to handle the expected workload. This includes CPU, memory, and disk requirements.
Configure Prometheus with persistence settings so that data is retained even across restarts.
Set appropriate retention policies to manage how long data is kept, balancing storage capacity and the need for historical insights.

Suggested Configuration

For optimal performance, it’s recommended to run two instances of Prometheus:

Dedicated HPA Instance: This instance can be smaller and configured with a shorter data retention policy (on the order of hours) to handle the metrics specific to Horizontal Pod Autoscaling (HPA).
Monitoring Instance: This should be a larger instance, with a data retention period of at least 1 week to allow for comprehensive monitoring and analysis.

Incorporating a robust monitoring setup using both the Metrics Exporter and Prometheus ensures that your ArmoniK deployment remains resilient, efficient, and responsive to the needs of your users.