If you have a mix of Proxmox nodes and external servers, getting a clean, truthful view of CPU, memory and disk is strangely hard. Proxmox’s built-in charts blur the line between free, cached and used memory, which makes planning resources awkward. I also needed email alerts for low disk space and high CPU or memory after learning the hard way that a full disk can freeze VMs and turn recovery into a risky dance. Finally, I wanted all nodes in one place, not just Proxmox, but remote KVMs, VPSs and dedicated servers from various providers.
This post is the setup I landed on: Prometheus for metrics and alerting, Grafana for dashboards, plus two small exporters that cover everything from the Proxmox API to plain Linux hosts. It is self-hosted, unlimited in nodes, quick to stand up with Docker on the main server, and easy to roll out to Ubuntu clients with a single apt install. Alerts go to my email via AWS WorkMail, and the Grafana homepage slot fits nicely in my self-hosted gethomepage.dev start page.
Here is the exact problem I wanted to solve:
- Proxmox does not report memory the way Linux admins think about it. It is easy to mistake cache for “used,” which makes it tough to judge real headroom when assigning RAM. Community threads are full of confused screenshots comparing “free” in Proxmox to “free” inside the guest. The short version is that Linux aggressively caches and “MemAvailable” is the right signal for planning, not raw “used.”
- I needed alerting to my inbox for CPU, memory and disk. Once a VM hits 100 percent disk, it can go unresponsive and recovery gets stressful. I wanted proactive warnings before that point.
- I wanted one glass pane that includes Proxmox nodes and external servers I rent. One place to see trends, one place to compare nodes and to make resource decisions.
The stack in this guide fixes all three:
- Prometheus collects and stores metrics. It scrapes Linux nodes with node_exporter and pulls Proxmox cluster, node and VM metrics through pve_exporter.
- Grafana visualizes it and gives you beautiful, tweakable dashboards.
- Alertmanager (the alerting component that ships with Prometheus) routes alerts to email.
- The result is self-hosted, flexible, unlimited in nodes and completely private.
What is Prometheus?
Prometheus is an open source monitoring and alerting toolkit. It pulls metrics on a schedule over HTTP, stores them in a local time-series database, and lets you query them with PromQL. Alert rules run on those metrics and fire into Alertmanager for delivery. Think of it as a pull-based metrics engine with a language you can reason about.
What is Grafana?
Grafana is an open source visualization platform. You point it at Prometheus as a data source, then build dashboards with panels, variables and drill-downs. Import community dashboards to get going quickly, then customize. Grafana also supports alerting, but in this guide we will keep alert routing in Alertmanager for simplicity.
How they work together
- Exporters sit close to your systems and expose metrics as simple text over HTTP. For Linux hosts we use node_exporter. For Proxmox we use prometheus-pve-exporter, which talks to the Proxmox API and exposes node, VM, storage and cluster metrics that Prometheus can scrape.
- Prometheus scrapes those endpoints, stores the data and evaluates alert rules.
- Alertmanager receives alerts from Prometheus and sends notifications to destinations like email.
- Grafana connects to Prometheus and renders dashboards. Grafana has first-class Prometheus support and a huge catalog of community dashboards you can import, including battle-tested ones for node_exporter and Proxmox.
Imagine each exporter is a mailbox hanging outside every server. Prometheus is the carrier who walks a route every 15 seconds, opens each mailbox and copies whatever metrics were left there. Grafana is the sorting room where you can lay out the letters any way you like. Alertmanager is the dispatcher who calls you if certain letters show up.
Why I like this setup to solve my situation:
- It is self-hosted so your operational data stays private.
- There are no arbitrary limits on how many nodes you can monitor.
- Alerts are simple to express and easy to route to email, in my case an AWS WorkMail mailbox.
- Dashboards are flexible and fast to customize, so resource planning is easy.
- It drops cleanly into my browser’s start page. gethomepage.dev has a native Grafana widget, so I can keep a glanceable panel on my homepage.
- It is flexible enough to pull from Proxmox via pve_exporter or from inside any server via node_exporter.
Solving The “Real” Memory Problem
On Linux, cache and buffers are not wasted memory. The kernel will reuse them as soon as applications need RAM. That is why the MemAvailable metric is the one you should alert on. In PromQL the clean, noise-free expression for “used memory percentage” is:
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
This aligns with real pressure rather than penalizing cache. We will use this formula for the dashboards and alerts that follow.
Practical use cases this setup unlocks
- Right-size Proxmox RAM using MemAvailable to avoid overcommitting or leaving too much idle.
- Disk safety net with early warnings when root or data volumes cross 80 percent.
- CPU heat map across mixed infrastructure so you can move workloads intelligently.
- Shared view for Proxmox plus rented KVMs, VPSs and dedicated nodes in one place with one taxonomy.
Examples: ready-made dashboards that just work
- Node Exporter Full (Grafana dashboard ID 1860) gives you a rich host view covering CPU, memory, disk and network. Import it once and it works across every node_exporter instance.
- Proxmox via Prometheus (Grafana dashboard ID 10347) visualizes Proxmox cluster metrics exposed by pve_exporter, including nodes, guests and storage.
Setup Instructions
Main Grafana + Prometheus + AlertManager Server
We will run four containers on a single monitoring host: Prometheus, Alertmanager, Grafana and the Proxmox exporter. The exporter talks to the Proxmox API using a read only token, then exposes clean metrics for Prometheus to scrape. The official prompve/prometheus-pve-exporter image is maintained by the exporter project and is the simplest way to get Proxmox metrics flowing.
1) Create folders and a Compose file
mkdir -p /opt/monitoring/{prometheus,alertmanager,grafana}
cd /opt/monitoring
Create docker-compose.yml:
networks:
monitor:
driver: bridge
volumes:
prometheus-data:
grafana-data:
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: always
networks: [monitor]
ports:
- "9090:9090"
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--web.enable-lifecycle"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/rules:/etc/prometheus/rules:ro
- prometheus-data:/prometheus
depends_on:
- pve-exporter
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: always
networks: [monitor]
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: always
networks: [monitor]
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=changeme
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
The images track current releases of Prometheus, Alertmanager and Grafana, which is perfect for a small homelab or SMB setup. You can pin versions later if you want change control.
2) Prometheus configuration
Create prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "/etc/prometheus/rules/*.yml"
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["prometheus:9090"]
# Proxmox via pve_exporter running in Docker on the same host
- job_name: "proxmox"
metrics_path: /pve
params:
module: [default]
cluster: ["1"]
node: ["1"]
static_configs:
# List your PVE nodes by IP or DNS
- targets:
- "10.0.0.11" # pve-node-1
- "10.0.0.12" # pve-node-2
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: "pve-exporter:9221"
# Linux servers with node_exporter
- job_name: "nodes"
static_configs:
- targets:
- "10.0.1.10:9100" # ubuntu-vm-1
- "10.0.1.11:9100" # ubuntu-vm-2
relabel_configs:
- source_labels: [__address__]
target_label: instance
That relabel_configs block converts the Proxmox node address into a target query parameter for the exporter, and rewrites the scrape to the exporter itself. This is the recommended pattern when the exporter is not installed directly on each PVE node.
3) Core alert rules: CPU, memory, disk
Create prometheus/rules/common-alerts.yml:
groups:
- name: system-alerts
rules:
# CPU usage over 5 minutes
- alert: HighCPUUsage
expr: avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "Average CPU over 5m above 85 percent."
# Memory used percent using MemAvailable
- alert: HighMemoryUsage
expr: 100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High memory on {{ $labels.instance }}"
description: "MemAvailable suggests real pressure is over 85 percent."
# Disk free below 15 percent, ignore pseudo fs
- alert: LowDiskSpace
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs",mountpoint!~"/run.*|/snap.*"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay|squashfs",mountpoint!~"/run.*|/snap.*"}) < 0.15
for: 10m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }} mount {{ $labels.mountpoint }}"
description: "Less than 15 percent free space for at least 10 minutes."
# Optional: Proxmox guest down based on exporter signals
- alert: ProxmoxGuestDown
expr: (pve_guest_info * on(id) group_left(name) pve_up) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Guest down on {{ $labels.instance }}"
description: "Proxmox guest appears down for 5 minutes."
That memory expression uses MemAvailable which tracks true headroom, not cache. This avoids false alarms and matches how modern Linux reports memory pressure.
4) Alertmanager to AWS WorkMail via SMTP
Create alertmanager/alertmanager.yml:
route:
receiver: email-me
group_by: ["alertname", "instance"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: email-me
email_configs:
- to: you@your-domain.com
from: alerts@your-domain.com
smarthost: smtp.mail.us-west-2.awsapps.com:465 # change region
auth_username: you@your-domain.com
auth_identity: you@your-domain.com
auth_password: "YOUR_WORKMAIL_PASSWORD"
require_tls: true
tls_config:
insecure_skip_verify: false
Use the region specific WorkMail SMTP endpoint, for example smtp.mail.us-west-2.awsapps.com on port 465 with SSL or your organization’s region equivalent. This is the typical configuration for WorkMail’s SMTP submission. If your mailbox is integrated with SES for outbound sending, you can also use the SES SMTP endpoint such as email-smtp.us-east-1.amazonaws.com with the SMTP credentials you created in SES.
5) Provision Grafana so it “just works”
Create grafana/provisioning/datasources/prom.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
Now start the stack:
docker compose up -d
Open Grafana on port 3000 and log in with the admin credentials you set. Import two dashboards by their IDs and you get great defaults for both kinds of metrics:
- Node Exporter Full, dashboard ID 1860.
- Proxmox via Prometheus, dashboard ID 10347.
If you use a self hosted start page like gethomepage.dev, drop in its built in Grafana widget so the monitoring status is one click away on your browser homepage.
Node Setup (Specific for Ubuntu servers)
For every Ubuntu server or VM you want to monitor, install prometheus-node-exporter. This exposes system level CPU, memory, disk and network metrics on port 9100.
# Ubuntu 20.04, 22.04, 24.04
sudo apt update
sudo apt install -y prometheus-node-exporter
sudo systemctl enable --now prometheus-node-exporter
You can verify with curl http://localhost:9100/metrics. If port 9100 is already in use, adjust the systemd unit to run node_exporter on a different port, for example --web.listen-address=:9101, then update the target in prometheus.yml.
Add each node’s ip:9100 to the nodes job in prometheus.yml on the monitoring server. Prometheus will pick up changes automatically within a minute if you send it a /-/reload, or restart the container.
Optional Setup
- Certificates for Proxmox: if you run Proxmox with self signed certs, either import the PVE CA into the exporter container or temporarily set
verify_ssl: falseinpve.ymlto start quickly, then fix trust later. The exporter docs cover both approaches. PyPI - Dashboards at a glance: pin a favorite Grafana panel into your gethomepage.dev dashboard so you have a 5 second status glance every morning. gethomepage.dev
- Grafana 12 adds nice quality of life features like tabs and faster tables, so staying on a modern release is worthwhile.
- Setup a Load Balancer with certificates to your server and place it behind a firewall
Alternatives I considered and why I did not pick them
Zabbix
What it is good at: Full stack monitoring with discovery, agent and agentless options, maps, templates and very deep alerting. It can replace an entire NOC toolkit if you want a single product that does everything.
Why I passed: The UI feels heavier and more complex than Grafana for day to day use. It shines in very large, policy driven environments, but I wanted faster dashboard iteration and simpler panel building for a mixed homelab plus rented servers. Grafana’s editing experience is quicker for me, and Prometheus rules are easy to reason about.
Netdata
What it is good at: Real time troubleshooting with second level granularity. The per node dashboard is fantastic when you need to catch a noisy neighbor or a misbehaving process right now.
Why I passed: The free cloud tier caps you to a small number of nodes, and it is not ideal for long term trend analysis across many machines. I care about multi month patterns to plan RAM and storage in Proxmox and across VPSs, which is Prometheus and Grafana’s sweet spot.
Proxmox built in graphs and QEMU stats
What it is good at: Quick glance per VM and per node charts, integrated in the UI you already use.
Why I passed: The memory model does not give the exact “available vs used” number that tracks real pressure, and there is no built in cross platform alerting. I wanted proactive emails before a VM or host runs out of disk, plus a single place that also covers external servers.
How this setup maps to the original goals
- Accurate memory signals: MemAvailable based panels and alerts show true pressure, so it is easy to right size RAM in Proxmox and on external hosts.
- Proactive email alerts: you get warning and critical notifications for CPU, memory and disk in your mailbox before VMs become unresponsive.
- One place for everything: Proxmox nodes flow in via pve_exporter, all Ubuntu boxes via node_exporter, which gives you long term trends and unified dashboards.
- Self hosted and unlimited: the stack lives on your own server, there is no node cap, and you can extend it with additional exporters later.
- Friendly with your homepage: Grafana’s panels embed nicely in gethomepage.dev, so your monitoring becomes a habit rather than an afterthought.
Conclusion
Prometheus plus Grafana gives you a clean mental model. Exporters speak simple HTTP, Prometheus pulls on a schedule and stores everything, Alertmanager does the routing and Grafana turns it into a living dashboard. For a Proxmox centered lab with extra KVMs, VPSs and dedicated boxes scattered across providers, it hits the balance I wanted. It is private, scale friendly and quick to iterate. Most important, it solves the two failure modes that used to bite me. Memory is measured the way Linux actually works, and disk pressure sets off alarms long before a VM hangs.
If you want to take it further, add blackbox_exporter to watch public endpoints, push gateway for batch jobs that cannot be scraped, and recording rules for common service level indicators. The core stays the same, which is why this stack lasts.
