Edit

Share via


PromQL for system metrics and Guest OS performance counters

This article provides guidance for querying OpenTelemetry system metrics and Guest OS performance counters using PromQL in Azure Monitor. This covers scenarios where Azure Monitor Agent or OpenTelemetry Collector gathers system-level telemetry data.

Tip

System metrics queries work in both workspace-scoped and resource-scoped modes. When using resource-scoped queries, filter and group by "Microsoft.resourceid" instead of generic identifiers like instance or host.name to ensure accurate scoping to your resources.

Prerequisites

System metrics overview

OpenTelemetry system metrics provide comprehensive visibility into Guest OS performance through standardized metric names and semantic conventions:

Core system metric categories

Category OpenTelemetry Metrics Traditional Equivalents
CPU system.cpu.utilization % Processor Time
Memory system.memory.usage, system.memory.utilization Available Memory, Memory Usage
Disk system.disk.io.bytes, system.disk.operations Disk Bytes/sec, Disk Operations/sec
Network system.network.io.bytes, system.network.packets Network Bytes/sec, Packets/sec
Process process.cpu.utilization, process.memory.usage Process-specific counters

CPU

# Dashboarding CPU utilization when query is scoped to a single VM resource
100 * (1 - avg ({"system.cpu.utilization", state="idle"}))

# Dashboarding to identify CPU hotspots across a fleet of VMs, scoped to an AMW, subscription or resource group
topk (5, 100 * (1 - avg by ("Microsoft.resourceid") ({"system.cpu.utilization", state="idle"})))

# Alerting on high CPU utilization, scoped to a single VM
avg_over_time({"system.cpu.utilization"}[5m]) > 0.8

# Dashboarding to identify top 5 CPU-consuming processes by command, scoped to a single VM.
topk(5, sum by ("process.command") ({"process.cpu.utilization"}))

# Identify CPU-bound processes, scoped to a single VM
{"process.cpu.utilization"} > 0.5

Memory

# Dashboarding available MB, scoped to a single VM
(sum ({"system.memory.limit"}) - sum ({"system.memory.usage", state="used"})) / (1024 * 1024)

# Alerting on high memory utilization
{"system.memory.utilization", state="used"} > 0.9

Disk I/O

# Dashboarding disk total throughput (Bytes/sec), scoped to a single VM. Can be filtered to Read or Write.
sum by (device, direction) rate({"system.disk.io"}[5m]) 

# Dashboarding disk total operations per second, scoped to a single VM. Can be filtered to Read or Write.
sum by (device, direction) rate({"system.disk.operations"}[5m]) 

# Dashboarding top 5 processes by disk operations (read and write), scoped to an AMW, subscription or resource group.
topk(5, sum by ("process.command", direction) (rate({"process.disk.operations"}[5m])))

# Alerting on high disk activity detection, scoped to a single VM.
rate({"system.disk.operations"}[5m]) > 1000

Migrating from KQL to PromQL

This section provides side-by-side comparisons of common KQL queries for system performance counters and their PromQL equivalents. Each example includes both a like-for-like migration and a recommended approach optimized for Prometheus.

Important

All PromQL queries use UTF-8 escaping syntax introduced in Prometheus 3.0. Label names containing dots (like Microsoft.resourceid) and metric names must be quoted and moved into braces, as documented in Prometheus 3.0 release.

KQL: Average CPU percentage per 15 minute interval

Perf 
| where CounterName == "% Processor Time"
| where ObjectName == "Processor"
| summarize avg(CounterValue) by bin(TimeGenerated, 15m), Computer, _ResourceId

PromQL: Like-for-like migration

This query replicates the KQL behavior by calculating CPU utilization from system.cpu.time:

100 * (
  1 -
  (
    sum by ("Microsoft.resourceid") (rate({"system.cpu.time", state="idle"}[15m])) /
    sum by ("Microsoft.resourceid") (rate({"system.cpu.time"}[15m]))
  )
)

Query parameters:

  • Step: 15m (matches KQL bin size)

Rationale: Replicates Windows' calculation of 100 - % Idle Time with identical time bin size for direct comparison.

100 * (1 - avg by ("Microsoft.resourceid") ({"system.cpu.utilization", state="idle"})) 

Query parameters:

  • Step: 1m

Rationale: Uses the cleaner system.cpu.utilization metric if emitted by your collector. The 1-minute step provides higher resolution dashboards and more responsive alerting.

KQL: Disk reads per second over 5 minutes

Perf
| where ObjectName == "LogicalDisk" and CounterName == "Disk Reads/sec"
| summarize avg(CounterValue) by bin(TimeGenerated, 5m), Computer, _ResourceId

PromQL: Like-for-like migration

sum by ("Microsoft.resourceid", device) (
  rate({"system.disk.operations", direction="read"}[5m])
)

Query parameters:

  • Step: 5m

Rationale: Matches KQL bin(TimeGenerated, 5m) exactly, aggregated by resource and device.

sum by ("Microsoft.resourceid", device) (
  rate({"system.disk.operations", direction="read"}[1m])
)

Query parameters:

  • Step: 1m

Rationale: More responsive metric for dashboards while maintaining accuracy. The 1-minute rate calculation provides smoother visualization.

KQL: Available memory over time

// Virtual Machine available memory 
// Chart the VM's available memory over time. 
Perf
| where ObjectName == "Memory" and
    (CounterName == "Available MBytes Memory" or  // Linux records
     CounterName == "Available MBytes")            // Windows records
| summarize avg(CounterValue) by bin(TimeGenerated, 15m), Computer, _ResourceId
| render timechart

PromQL: Like-for-like migration

This query aggregates multiple memory states to calculate available memory across Linux and Windows:

# Available MB, scoped to a single VM
(sum ({"system.memory.limit"}) - sum ({"system.memory.usage", state="used"})) / (1024 * 1024)

Query parameters:

  • Step: 15m

Rationale: Aggregates memory states representing available memory across operating systems. Converts bytes to megabytes to match KQL counter output. Uses 15-minute step to match the bin size.

avg by ("Microsoft.resourceid") ({"system.memory.utilization"})

Query parameters:

  • Step: 5m

Rationale: Alerting and autoscale scenarios benefit from utilization metrics which don't need expensive regex filters or unit conversions.

KQL: Bottom 10 free disk space percentage

// Bottom 10 Free disk space % 
// Bottom 10 Free disk space % by computer, for the last 7 days. 
Perf
| where TimeGenerated > ago(7d)
| where (ObjectName == "Logical Disk" or ObjectName == "LogicalDisk") 
    and CounterName contains "%" 
    and InstanceName != "_Total" 
    and InstanceName != "HarddiskVolume1"
| project TimeGenerated, Computer, ObjectName, CounterName, InstanceName, CounterValue 
| summarize arg_max(TimeGenerated, *) by Computer
| top 10 by CounterValue desc

PromQL: Like-for-like migration

sum by ("Microsoft.resourceid", device, mountpoint, type) (
  {"system.filesystem.usage", state="free", type=~"ext4|xfs|btrfs|ntfs|vfat"}
) / (1024 * 1024)

Query parameters:

  • Time range: 7d

Rationale: Converts bytes to megabytes to match Windows counter output. Filters common filesystem types and excludes system volumes using regex pattern matching.

KQL vs PromQL syntax comparison

The following table highlights key syntax differences when migrating from KQL to PromQL:

Concept KQL Syntax PromQL Syntax
Time binning bin(TimeGenerated, 15m) Use step parameter in query API (e.g., step=15m)
Rate calculation CounterValue (already rate) rate(metric[5m]) for counter metrics
Aggregation summarize avg(CounterValue) by Computer avg(metric) by ("Microsoft.resourceid")
Resource filtering where _ResourceId == "..." {metric, "Microsoft.resourceid"="..."}
String matching where Name contains "pattern" {metric, name=~".*pattern.*"}
Label with dots _ResourceId (underscore prefix) "Microsoft.resourceid" (quoted)
Metric name with dots N/A - uses table.column {"system.cpu.utilization"} (quoted)
Time range where TimeGenerated > ago(7d) API parameter: start and end
Top N results top 10 by CounterValue desc topk(10, metric)

Tip

When working with PromQL in Azure Monitor, always quote label names that contain dots (like "Microsoft.resourceid") and metric names with dots (like {"system.cpu.utilization"}). This UTF-8 escaping syntax is required in Prometheus 3.0+ and prevents common parsing errors.