Running vLLM in production without proper monitoring is like flying blind. You need visibility into request latency (P50, P95, P99), token throughput, GPU cache usage, and error rates to optimize performance and costs. This step-by-step guide walks you through building a complete observability stack using Prometheus and Grafanaβthe same tools used by companies like Uber, GitLab, and DigitalOcean. In 10 minutes, you'll have professional dashboards tracking 8 key metrics that matter for LLM inference performance. π‘ **Perfect for:** MLOps engineers, platform teams, and anyone running vLLM servers who wants production-ready monitoring without the complexity.
Learn how to set up comprehensive monitoring for your vLLM deployments in minutes using Prometheus and Grafana
Monitoring your vLLM inference servers is crucial for maintaining optimal performance, understanding usage patterns, and ensuring reliability in production. Whether you're running vLLM on RunPod, local infrastructure, or cloud platforms, having proper observability gives you insights into request latency, throughput, resource utilization, and system health.
In this guide, I'll walk you through setting up a complete monitoring stack for vLLM using Prometheus and Grafana - the industry-standard tools for metrics collection and visualization. This setup will give you professional-grade monitoring capabilities with minimal configuration effort.
Our monitoring stack tracks the essential metrics that matter for vLLM performance:
Our monitoring stack consists of three main components:
Our monitoring stack consists of three main components:
βββββββββββββββββββ Β Β βββββββββββββββββββ Β Β βββββββββββββββββββ
β Β vLLM Server Β β Β Β β Β Prometheus Β Β β Β Β β Β Β Grafana Β Β Β β
β Β (RunPod/Cloud)ββββββ€ Β (Scraper) Β Β ββββββ€ Β (Dashboard) Β β
β Β Β Β Β Β Β Β Β β Β Β β Β Β Β Β Β Β Β Β β Β Β β Β Β Β Β Β Β Β Β β
βββββββββββββββββββ Β Β βββββββββββββββββββ Β Β βββββββββββββββββββ
Β Β Β Β β Β Β Β Β Β Β Β Β Β Β Β β Β Β Β Β Β Β Β Β Β Β Β β
Β Β Β Β β Β Β Β Β Β Β Β Β Β Β Β β Β Β Β Β Β Β Β Β Β Β Β β
Β Β /metrics endpoint Β Β Β Stores metrics Β Β Β Β Visualizes data
/metrics
endpointβ
This is the expected monitoring dashboard you will get at the end for your vLLM server :
β
β
Create the following directory structure:
monitoring/
βββ prometheus/
β Β βββ prometheus.yml
β Β βββ rules/
βββ grafana/
β Β βββ provisioning/
β Β β Β βββ datasources/
β Β β Β β Β βββ prometheus.yml
β Β β Β βββ dashboards/
β Β β Β Β Β βββ dashboard.yml
β Β βββ dashboards/
β Β Β Β βββ vllm-dashboard.json
βββ docker-compose.monitoring.yml
βββ monitoring.env
βββ start-monitoring.sh
Create monitoring.env
with your vLLM server details:
β
# RunPod/Cloud vLLM Server Configuration
VLLM_ENDPOINT=your-endpoint.proxy.runpod.net
VLLM_PORT=443
VLLM_PROTOCOL=https
VLLM_API_KEY=your-api-key-here
# Monitoring Configuration
PROMETHEUS_PORT=9090
GRAFANA_PORT=3000
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=admin123
# vLLM Metrics Configuration
VLLM_METRICS_PATH=/metrics
β
Create monitoring/prometheus/prometheus.yml
:
global:
scrape_interval: 5s
evaluation_interval: 5s
scrape_timeout: 4s
rule_files:
- "rules/*.yml"
scrape_configs:
# vLLM Server Metrics
- job_name: 'vllm-server'
static_configs:
- targets: ['your-endpoint.proxy.runpod.net:443']
scheme: https
metrics_path: '/metrics'
scrape_interval: 5s
scrape_timeout: 4s
params:
format: ['prometheus']
authorization:
type: Bearer
credentials: ${VLLM_API_KEY}
# vLLM Health Check
- job_name: 'vllm-health'
static_configs:
- targets: ['your-endpoint.proxy.runpod.net:443']
scheme: https
metrics_path: '/health'
scrape_interval: 5s
authorization:
type: Bearer
credentials: ${VLLM_API_KEY}
# Prometheus Self-Monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 15s
β
Create monitoring/grafana/provisioning/datasources/prometheus.yml
:β
β
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
β
Create monitoring/grafana/provisioning/dashboards/dashboard.yml
:β
β
apiVersion: 1
providers:
- name: 'vLLM Dashboards'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards
β
Create docker-compose.monitoring.yml
:
β
services:
prometheus:
image: prom/prometheus:latest
container_name: vllm-prometheus
ports:
- "${PROMETHEUS_PORT:-9090}:9090"
volumes:
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./monitoring/prometheus/rules:/etc/prometheus/rules
- prometheus_data:/prometheus
env_file:
- monitoring.env
environment:
- VLLM_API_KEY=${RUNPOD_API_KEY}
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=200h'
- '--web.enable-lifecycle'
restart: unless-stopped
networks:
- monitoring
grafana:
image: grafana/grafana:latest
container_name: vllm-grafana
ports:
- "${GRAFANA_PORT:-3000}:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning
- ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
env_file:
- monitoring.env
environment:
- GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin123}
- GF_USERS_ALLOW_SIGN_UP=false
- GF_INSTALL_PLUGINS=grafana-clock-panel
restart: unless-stopped
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
networks:
monitoring:
driver: bridge
β
β
β
docker-compose -f docker-compose.monitoring.yml up -d
Based on the dashboard screenshots and monitoring data, here are key patterns you'll observe:
curl -H "Authorization: Bearer your-api-key" \
Β Β https://your-endpoint/metrics
http://localhost:9090/targets
monitoring.env
docker logs vllm-prometheus
docker logs vllm-grafana
Setting up comprehensive monitoring for vLLM doesn't have to be complex. With this Prometheus and Grafana stack, you get:
The monitoring setup described here provides the foundation for understanding your vLLM deployment's behavior, optimizing performance, and ensuring reliable service delivery. Whether you're running a single model or managing multiple deployments, this monitoring stack gives you the visibility needed for operational excellence.
Start with this basic setup, then expand with custom dashboards, alerting rules, and additional metrics as your monitoring needs evolve. The investment in proper observability pays dividends in system reliability, performance optimization, and operational confidence.
Ready to monitor your vLLM deployment? Download the complete monitoring stack configuration and get started in minutes!
β
β’β Β β Latest new on data engineering
β’β Β β How to design Production ready AI Systems
β’β Β β Curated list of material to Become the ultimate AIΒ Engineer