Monitoring all the things with Prometheus, Loki, and Grafana

I finally got around to setting up a centralized solution for gathering and viewing metrics, status info, and logs from my servers. This is to make it easier to see the status of various devices and services and get alerts for when things go wrong, as well as viewing logs and correlating various events with their relevant metrics. As usual the services for the monitoring server and agents are deployed using Docker Swarm.

Overview

Information is gathered from a variety of sources into Prometheus and Loki which then can be viewed in Grafana. Information can be visualized in Grafana using custom dashboards or you can manually search for something that is stored in Prometheus or Loki. Prometheus gathers and stores metrics such as CPU, RAM, and HDD utilization while Loki gathers and stores logs from various programs. I’m using a standalone VPS for monitoring so I can actually get alerts if any of the other servers go down, it’s also in a separate location from the other servers in case there is an issue with that location.

The Stack

Prometheus

Prometheus pulls metrics from services/exporters and stores that information in a time series database. As one of the most popular ways of collecting metrics a lot of programs natively support supplying Prometheus with metrics and there are ton of exporters to gather other metrics. Currently I’m using Prometheus to gather metrics from Docker containers, Traefik, CrowdSec, and to do some status monitoring.

Docker Containers

Grafana Dashboard for Docker Containers

Example Grafana Dashboard for Docker Containers

Docker natively supports Prometheus and Prometheus can also get metrics directly from the Docker daemon, however the metrics they provide are a bit limited. I wound up using cAdvisor to gather Docker container metrics. cAdvisor provides a ton of metrics to the point that I limited the amount that actually get stored in order to prevent the database from getting bloated.

Traefik

Grafana Dashboard for Traefik

Example Grafana Dashboard for Traefik

Traefik is what I use for SSL termination and as a reverse proxy for any internet facing services, so it provides Prometheus with network metrics for those services. Even though cAdvisor provides network metrics for each container it can’t easily distinguish container to container traffic from internet traffic, so the Traefik metrics are useful for that.

CrowdSec

CrowdSec is like Fail2Ban, it monitors network connections to your machine and will trigger actions based on various scenario’s, such as dropping connections from a certain IP address if it tries to bruteforce SSH. It provides metrics about any actions its taken and the status of any scenario’s.

Status Monitoring

The Blackbox exporter can probe endpoints over HTTP, HTTPS, DNS, TCP and ICMP. Currently I’m using the Blackbox exporter to monitor all the web services I run as a quick way to check their status and make sure that they’re actually operational. I’m also hosting an instance of Healthchecks to monitor the status of backups from various machines.

Backup Healthchecks

Backup Healthchecks

I could do this a few other ways but being able to monitor recurring tasks from machines that don’t need other kinds of monitoring is useful.

Hosts and SNMP

I don’t need to monitor my VPS’s directly (I get enough information from cAdvisor) or anything using SNMP but I did test these out in case I wanted to do some monitoring at home. Hosts are pretty straight forward as Prometheus has a node exporter for host metrics, this can easily be expanded by using the built in text file collector to collect the output from scripts that are run periodically. The SNMP exporter is useful for grabbing metrics from a number of devices as it’s a standard protocol that’s been around for like 30 years. Prometheus has a generator that parses MIBs and creates an appropriate configuration file for monitoring whatever devices you have using SNMP.

Loki

Grafana Dashboard for Logs

Example Grafana Dashboard for Logs

Loki aggregates and stores logs that are pushed by clients. Setting up and configuring Loki is much much simpler then Prometheus. Logs from the host machine can be pushed to Loki using Promtail. To get logs from all Docker containers there’s the Loki Docker logging driver. That’s all I needed to get all the logs I wanted, pretty simple right?

Grafana

Grafana allows you to create visualizations and alerts from Prometheus and Loki queries. Setting up Grafana itself isn’t too difficult, most of the challenge comes from learning how to query Prometheus/Loki and creating useful dashboards using that information. I got ideas from various example dashboards but wound up either redoing them or making my own.

Challenges

Remote Devices

One of the downsides of Prometheus is that it is designed primarily for pulling metrics, which is fine for machines on the same network but is tricky for anything that is accessible only over the internet. Prometheus and exporters don’t really offer much in the way of security so exposing ports on remote machines to gather metrics isn’t a great idea. Getting metrics from devices that don’t have a static IP or are behind NAT also poses some issues. There’s PushProx that can be used to push metrics to Prometheus however there are some potential security issues with doing it this way which would also apply if you used a VPN instead of PushProx.

I wound up using grafana-cloud-agent which can push metrics to Prometheus and logs to Loki. Even though it’s called grafana-cloud-agent you don’t need to be using Grafana cloud. It turns out Prometheus does support having metrics pushed directly to it, it’s just disabled by default. One issue with pushing metrics to Grafana and Loki remotely is that they don’t implement much if any security so if you leave them exposed to the internet anyone can query them or push metrics/logs to them. To counteract this I added basic HTTP authentication for them, which isn’t much but it’s something.

Alerts

Grafana has some really odd limitations when it comes to alerts. Alerts can only be configured for things that are visualized as graphs, so either forego using any other kind of visualization or you can create a duplicate that’s a graph. One of Grafana’s strengths is being able to use dashboard variables so you can do things like use one dashboard for multiple machines, just choose the machine from the dropdown and the page will reload with that machines metrics. However you cannot use variables when creating alerts, so if you want alerts for multiple machines you have to create a separate graph for each machine or implement a work around.

Monitoring Server Example

Here are some example configurations that are peppered with relevant links and notes.

docker_swarm_server.yml

Example of the swarm stack I use on the monitoring server. This is where everything gets collected together and is made accessible.

version: "3.8"
# https://docs.docker.com/compose/compose-file/compose-file-v3/
# https://docs.docker.com/compose/compose-file/compose-file-v3/#extension-fields
x-logging:
  # https://docs.docker.com/compose/compose-file/compose-file-v3/#logging
  &loki-logging
  driver: loki
  # https://grafana.com/docs/loki/latest/clients/docker-driver/
  options:
    loki-url: "http://127.0.0.1:3100/loki/api/v1/push" # Containers and Loki are running on the same host

services:
  prometheus: # Collects and stores local and remote metrics
    # https://hub.docker.com/r/prom/prometheus
    # https://prometheus.io/docs/introduction/overview/
    image: prom/prometheus:latest
    volumes:
      - prometheus:/prometheus # Store database in named volume
    networks:
      - traefik # Overlay network for containers that need to be accessible over the internet
      - metrics # Overlay network to get metrics from local containers not in this swarm stack
      - monitor # Network for containers in this stack to communicate with each other
    entrypoint: # Override default entrypoint to enable features
      - /bin/prometheus
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.path=/prometheus
      - --storage.tsdb.retention.time=30d
      - --web.console.libraries=/usr/share/prometheus/console_libraries
      - --web.console.templates=/usr/share/prometheus/consoles
      - --enable-feature=remote-write-receiver # Feature to allow for metrics to be pushed to Prometheus https://prometheus.io/docs/prometheus/latest/disabled_features/
    configs:
      - source: prometheus.yml
        target: /etc/prometheus/prometheus.yml
    logging: *loki-logging
    deploy:
      labels:
        # Labels for traefik https://doc.traefik.io/traefik/providers/docker/
        - "traefik.enable=true"
        - "traefik.http.routers.prometheus.entrypoints=websecure"
        - "traefik.http.routers.prometheus.rule=Host(`prometheus.server.test`)"
        - "traefik.http.services.prometheus.loadbalancer.server.port=9090" # Port traefik needs to route traffic to
        # Basic password auth https://doc.traefik.io/traefik/middlewares/basicauth/
        - "traefik.http.middlewares.prometheus_auth.basicauth.users=prom_agent:prom_agent_password"
        # Enable middleware
        - "traefik.http.routers.prometheus.middlewares=prometheus_auth@docker"

  loki: # Stores local and remote logs
    # https://hub.docker.com/r/grafana/loki
    # https://github.com/grafana/loki
    # https://github.com/grafana/loki/tree/main/production
    # https://grafana.com/docs/loki/latest/overview/
    image: grafana/loki:latest
    volumes:
      - loki:/loki
    networks:
      - traefik
      - monitor
    ports:
      - 3100:3100
    logging: *loki-logging
    deploy:
      labels:
        - "traefik.enable=true"
        - "traefik.http.routers.loki.entrypoints=websecure"
        - "traefik.http.routers.loki.rule=Host(`loki.server.test`)"
        - "traefik.http.services.loki.loadbalancer.server.port=3100"
        - "traefik.http.middlewares.loki_auth.basicauth.users=loki_agent:loki_agent_password"
        - "traefik.http.routers.loki.middlewares=loki_auth@docker"

  grafana: # View metrics and logs from Prometheus and Loki
    # https://hub.docker.com/r/grafana/grafana
    # https://grafana.com/docs/grafana/latest/
    image: grafana/grafana:latest
    volumes:
      - grafana:/var/lib/grafana
    networks:
      - traefik
      - monitor
    environment:
      - GF_SERVER_DOMAIN=grafana.server.test
      - GF_SERVER_ROOT_URL=https://grafana.server.test
      - GF_INSTALL_PLUGINS=grafana-worldmap-panel,flant-statusmap-panel
      - GF_SERVER_ENABLE_GZIP=true
    logging: *loki-logging
    deploy:
      labels:
        - "traefik.enable=true"
        - "traefik.http.routers.grafana.entrypoints=websecure"
        - "traefik.http.routers.grafana.rule=Host(`grafana.server.test`)"
        - "traefik.http.services.grafana.loadbalancer.server.port=3000"

  promtail: # Collects host logs for Loki
    # https://hub.docker.com/r/grafana/promtail
    # https://grafana.com/docs/loki/latest/clients/promtail/
    # https://github.com/grafana/loki/tree/main/clients/cmd/promtail
    image: grafana/promtail:latest
    volumes:
      - /var/log:/ext_logs:ro # Mount host logs
      - promtail:/promtail_pos
    networks:
      - monitor
    environment:
      - HOSTNAME=server.test # Logs get labeled with the hostname
    configs:
      - source: promtail.yml
        target: /etc/promtail/config.yml
    entrypoint:
      - /usr/bin/promtail
      - -config.file=/etc/promtail/config.yml
      - -config.expand-env # Allow environment variables to be used in the config file
    logging: *loki-logging

  cadvisor: # Gather metrics for containers running on the local host
    # https://github.com/google/cadvisor
    # https://prometheus.io/docs/guides/cadvisor/
    image: gcr.io/cadvisor/cadvisor:latest
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    networks:
      - monitor
    entrypoint:
      - /usr/bin/cadvisor 
      - -logtostderr 
      - -disable_metrics=referenced_memory,cpu_topology,resctrl,udp,advtcp,sched,hugetlb,memory_numa,tcp,process,accelerator,disk
    logging: *loki-logging

  blackbox: # Pings remote endpoints to see if they're up
    # https://hub.docker.com/r/prom/blackbox-exporter
    # https://github.com/prometheus/blackbox_exporter
    image: prom/blackbox-exporter:latest
    networks:
      - monitor
    configs:
      - source: blackbox.yml
        target: /etc/blackbox_exporter/config.yml
    logging: *loki-logging

  backup: # Backup all the datas
    image: mazzolino/restic:latest
    # https://github.com/djmaze/resticker
    environment:
      - BACKUP_CRON=0 0 1 * * *
      - RESTIC_REPOSITORY=s3:URL
      - AWS_ACCESS_KEY_ID=
      - AWS_SECRET_ACCESS_KEY=
      - RESTIC_PASSWORD_FILE=/run/secrets/MONITORING_RESTIC_PASSWORD_FILE
      - RESTIC_BACKUP_SOURCES=/data
      - RESTIC_BACKUP_ARGS=--verbose
      - RESTIC_FORGET_ARGS=--prune --keep-daily 7 --keep-weekly 4 --keep-monthly 12
      - PRE_COMMANDS=curl -m 10 --retry 3 https://healthcheck.server.test/ping/random-string/start # Pings healthcheck which I use for monitoring backups and other tasks
      - POST_COMMANDS_SUCCESS=curl -m 10 --retry 3 https://healthcheck.server.test/ping/random-string
      - POST_COMMANDS_FAILURE=curl -m 10 --retry 3 https://healthcheck.server.test/ping/random-string/fail
    volumes:
      - prometheus:/data/prometheus
      - loki:/data/loki
      - grafana:/data/grafana
    secrets:
      - MONITORING_RESTIC_PASSWORD_FILE
    logging: *loki-logging

volumes:
  prometheus:
  loki:
  grafana:
  promtail:

networks:
  monitor:
  metrics:
    external: true
  traefik:
    external: true

configs:
  prometheus.yml:
    external: true
  promtail.yml:
    external: true
  blackbox.yml:
    external: true

secrets:
  MONITORING_RESTIC_PASSWORD_FILE:
    external: true

prometheus.yml

Configuration file for Prometheus, this tells Prometheus what metrics to scrape from services running on the local host. In order to try and keep the database small I only keep the metrics I’m interested in.

# https://prometheus.io/docs/prometheus/latest/configuration/configuration/
# https://grafana.com/docs/grafana-cloud/billing-and-usage/prometheus/usage-reduction/
global:
  scrape_interval: 30s

scrape_configs:
  - job_name: cadvisor # Local container metrics
    # https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md
    static_configs:
      - targets:
          - cadvisor:8080 # Can use the service name of a container to connect to it if containers are on the same network
        labels:
          host: server.test # Add hostname label to make it easier to differentiate machines
    metric_relabel_configs:
      # Change the name from swarmstack_container.1.xxxx to swarmstack_container
      - source_labels: [container_label_com_docker_swarm_service_name]
        target_label: name
        # Only keep data that is used in graphs to reduce size
      - source_labels: [__name__]
        regex: container_last_seen|container_memory_usage_bytes|container_network_receive_bytes_total|container_network_transmit_bytes_total|container_fs_reads_bytes_total|container_fs_writes_bytes_total|container_cpu_usage_seconds_total|container_start_time_seconds
        action: keep
        # Only keep necessary labels to keep things neat and reduce size
      - regex: Time|__name__|container_label_com_docker_stack_namespace|instance|job|name|interface|device|Value.*|host
        action: labelkeep

  - job_name: traefik # Network metrics for internet facing services
    # https://doc.traefik.io/traefik/observability/metrics/overview/
    static_configs:
      - targets:
          - traefik:8080
        labels:
          host: server.test
    metric_relabel_configs:
      # Only keep data that is used in graphs to reduce size
      - source_labels: [__name__]
        regex: traefik_service_requests_total|traefik_service_request_duration_seconds_sum|traefik_service_request_duration_seconds_count
        action: keep

  - job_name: crowdsec # Security related metrics
    # https://doc.crowdsec.net/Crowdsec/v1/observability/prometheus/
    static_configs:
      - targets:
          - crowdsec:6060
        labels:
          host: server.test
    metric_relabel_configs:
      # Only keep data that is used in graphs to reduce size
      - source_labels: [__name__]
        regex: cs_.+
        action: keep

  - job_name: blackbox # Remote website status
    metrics_path: /probe
    params:
      module: [http_get] # Check for a HTTP 200 response
    static_configs:
      - targets:
          - https://zeigren.com
          - https://shop.zeigren.com
          - https://docs.zeigren.com
          - https://kairohm.dev
          - https://bookstack.kairohm.dev
          - https://phabricator.kairohm.dev
          - https://inventree.kairohm.dev/part
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox:9115 # The blackbox exporter hostname:port

  - job_name: "healthchecks-backups" # Backups status
    # https://healthchecks.io/docs/configuring_prometheus/
    metrics_path: "/projects/random-string/metrics/random-string"
    static_configs:
      - targets:
          - "healthchecks:8000"

promtail.yml

Configuration for Promtail which is what gathers host logs for Loki.

# https://grafana.com/docs/loki/latest/clients/promtail/configuration/
server:
  disable: true

client:
  url: http://loki:3100/loki/api/v1/push

positions:
  filename: /promtail_pos/positions.yaml

scrape_configs:
  - job_name: varlogs
    static_configs:
    - targets:
      - localhost
      labels:
        job: varlogs
        host: ${HOSTNAME:-default_value} # Add hostname label
        __path__: /ext_logs/*log

blackbox.yaml

Configuration for the Blackbox exporter which I’m only using to ping external websites.

# https://github.com/prometheus/blackbox_exporter/blob/master/CONFIGURATION.md
modules:
  http_get:
    prober: http
    timeout: 10s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      preferred_ip_protocol: "ip4"
      ip_protocol_fallback: false

Agent Example

Here are some example configurations that are peppered with relevant links and notes.

docker_swarm_agent.yml

Example of the swarm stack I use on any machine I want to monitor.

version: "3.8"
services:
  agent: # Scrapes local services and sends info to Prometheus and Loki on the monitoring server
    image: grafana/agent:latest
    # https://hub.docker.com/r/grafana/agent
    # https://grafana.com/docs/grafana-cloud/agent/
    # https://github.com/grafana/agent
    volumes:
      - /var/log:/ext_logs:ro
      - agent:/agent_location
    networks:
      - metrics
      - monitor
    environment:
      - HOSTNAME=zeigren.com
      - PROM_REMOTE_WRITE_URL=https://prometheus.server.test/api/v1/write
      - LOKI_REMOTE_WRITE_URL=https://loki.server.test/loki/api/v1/push
    entrypoint:
      - /bin/agent
      - -config.file=/etc/agent-config/agent.yml
      - -config.expand-env
    configs:
      - source: agent.yml
        target: /etc/agent-config/agent.yml
    secrets:
      - prom_agent_password
      - loki_agent_password

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    networks:
      - monitor
    entrypoint:
      - /usr/bin/cadvisor
      - -logtostderr
      - -disable_metrics=referenced_memory,cpu_topology,resctrl,udp,advtcp,sched,hugetlb,memory_numa,tcp,process,accelerator,disk

volumes:
  agent:

networks:
  monitor:
  metrics:
    external: true

configs:
  agent.yml:
    external: true

secrets:
  prom_agent_password:
    external: true
  loki_agent_password:
    external: true

agent.yml

Configuration for the Grafana agent which is pretty much a combination of prometheus.yml and promtail.yml but with some extra bits.

# https://github.com/grafana/agent/blob/main/docs/configuration-reference.md
prometheus:
  wal_directory: /agent_location/wal
  global:
    scrape_interval: 30s
  configs:
    - name: ${HOSTNAME:-default_value}
      scrape_configs:
        - job_name: traefik
          static_configs:
            - targets:
                - traefik:8080
              labels:
                host: ${HOSTNAME:-default_value}
          metric_relabel_configs:
            - source_labels: [__name__]
              regex: traefik_service_requests_total|traefik_service_request_duration_seconds_sum|traefik_service_request_duration_seconds_count
              action: keep

        - job_name: cadvisor
          static_configs:
            - targets:
                - cadvisor:8080
              labels:
                host: ${HOSTNAME:-default_value}
          metric_relabel_configs:
            - source_labels: [container_label_com_docker_swarm_service_name]
              target_label: name
            - source_labels: [__name__]
              regex: container_last_seen|container_memory_usage_bytes|container_network_receive_bytes_total|container_network_transmit_bytes_total|container_fs_reads_bytes_total|container_fs_writes_bytes_total|container_cpu_usage_seconds_total|container_start_time_seconds
              action: keep
            - regex: Time|__name__|container_label_com_docker_stack_namespace|instance|job|name|interface|device|Value.*|host
              action: labelkeep

        - job_name: crowdsec
          static_configs:
            - targets:
                - crowdsec:6060
              labels:
                host: ${HOSTNAME:-default_value}
          metric_relabel_configs:
            - source_labels: [__name__]
              regex: cs_.+
              action: keep
      remote_write:
        - url: ${PROM_REMOTE_WRITE_URL:-http://localhost:9090/api/v1/write}
          basic_auth:
            username: prom_agent
            password_file: /run/secrets/prom_agent_password # Get basic auth password from a Docker secret

loki:
  configs:
    - name: ${HOSTNAME:-default_value}
      positions:
        filename: /agent_location/positions.yaml
      scrape_configs:
        - job_name: varlogs
          static_configs:
            - targets:
                - localhost
              labels:
                job: varlogs
                host: ${HOSTNAME:-default_value}
                __path__: /ext_logs/*log
      clients:
        - url: ${LOKI_REMOTE_WRITE_URL:-http://localhost:3100/loki/api/v1/push}
          basic_auth:
            username: loki_agent
            password_file: /run/secrets/loki_agent_password