Monitoring Virtualized Servers with Telegraf

virtualization edge Linux

Dashboards are a great way to get at-a-glance diagnostics of a server's vital statistics. Consolidating this data from multiple servers in one place makes this server-sitting task even easier. A Linux server configured with Telegraf, InfluxDB and Grafana (a.k.a. a TIG stack) is an easy way to accomplish this. This post focuses on getting Telegraf, an open-source data collection agent, configured to collect server and vSphere metrics through an InfluxDB OSS v2.0 time-series database.

InfluxDB OSS version 2.0 has completely integrated its dashboard capabilities (Chronograf) and its data processing engine (Kapacitor). Prior to this integration, this functionality was achieved with a TICK stack. While InfluxDB OSS is capable of accomplishing all of this functionality, Grafana is a broader reaching analytics and visualization package and is worth learning and exploring.

Installing Telegraf @ Servers

Telegraf will need to be installed on each of the servers to be monitored. The easiest way to do this is to add the Influx repository, specific to the server Linux distribution, to the apt sources list. The example below is for Ubuntu 20.04 with codename focal. To determine which release to use, the command lsb_release -a will output the distribution details (Linux standard base) for the server. The curl command is used to pull the Influx repository public key and add it as a trusted key to the apt subsystem. Apt will use this key to verify packages (and updates) received from the Influx repository moving forward.

echo "deb focal stable" | sudo tee /etc/apt/sources.list.d/influxdb.list
curl -sL | sudo apt-key add -

Configuring Telegraf @ Servers

Create a stripped-down telegraf.conf file to configure the telegraf agent and the influx (v2) output plugin. The output is assigned a default bucket which would normally capture all telegraf data produced by this server. By using the bucket_tag flag, telegraf will use the assigned influx_bucket to route data to a specific bucket. The exclude_bucket_tag is used to prevent telgraf from adding the bucket_tag to the data. The URLs, token and organization are associated with the influx configuration of the TIG server.

  ... leave defaults

  ... leave defaults

  urls = [""]
  token = "xKyZ7ooVUBOmI...<your unique InfluxDB tokey..."
  organization = "phaedrus"
  bucket_tag = "influx_bucket"
  exclude_bucket_tag = true
  bucket = "phaedrus_primary"

Under the /etc/telgraf/telegraf.d folder, create a system.conf file to configure the system plugin and pull the system stats from each server. Note the use of the influx_bucket tag which will direct the data for each system metric to a specific influx v2 bucket. A specific influx bucket must be created at the TIG server for each monitored server.

# Read metrics about cpu usage
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false
    influx_bucket = "VSVR01WEB"

# Read metrics about disk usage by mount point
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
    influx_bucket = "VSVR01WEB"

# Read metrics about disk IO by device
    influx_bucket = "VSVR01WEB"

# Get kernel statistics from /proc/stat
    influx_bucket = "VSVR01WEB"

# Read metrics about memory usage
    influx_bucket = "VSVR01WEB"

# Get the number of processes and group them by status
    influx_bucket = "VSVR01WEB"

# Read metrics about swap memory usage    
    influx_bucket = "VSVR01WEB"

# Read metrics about system load & uptime
    influx_bucket = "VSVR01WEB"

Telegraf Configuration: All of the basic telegraf configuration shown is pulled from the default telegraf.conf file. This file includes typicals for a huge set of telegraf input and output plugins. In each case, the file includes commented sections to guide full customization for most plugins, including how to include or exclude specific metrics. For manageability and ease of deployment, the configuration has been broken into modular sections.

Once the configuration is complete for each server, the telegraf service can be enabled to start the polling.

sudo systemctl enable --now telegraf

Configuring Telegraf @ TIG Server

The TIG server should carry the same system monitoring configuration, described above, to add its own system metrics to the database. In addition, the vSphere plugin can be used to pull additional metrics specific to the ESXi hypervisor. A read-only vSphere user should be created to facilitate data polling from telegraf.

The following can be used as a vphere.conf file under /etc/telegraf/telegraf.d.

   vcenters = [ "" ]
   username = "<vSphere user>"
   password = "<vSphere user password>"
   vm_metric_include = []
   host_metric_include = []
   cluster_metric_include = []
   datastore_metric_include = []
   datacenter_metric_include = []
   insecure_skip_verify = true
      influx_bucket = "VSPHERE"

Dashboards & Monitoring

Once the telegraf instances are up and running, InfluxDB will be collecting vast quantities of system metrics across all servers. This will collect an overwhelming number of system metrics so carefully selecting the ones you want to monitor is worth the effort. I use Grafana for dashboarding and while there are several publicly available system and vSphere dashboards, most use InfluxQL to pull data instead of the more advanced Flux language. InfluxQL is a very SQL like query language and looks to be supported by Influx v2 but it requires maping Influx v2 buckets to Influx v1 style databases and retention policies. Flux is a very Javascript like syntax and make use of a pipe operator |> to pass data from function to function in order to develop advanced queries.

One of the easiest ways to get started with Flux for system and vSphere monitoring, is to pull the related influx templates and reverse engineer the queries to suit your needs. You can also review the InfuxQL style Grafana dashboards, indicated with INFLUXDB 1.0.0 as a dependency, and pull the metric and filtering tags needed to build an equivalent Flux query.

A couple of good starting points are below:

Focus on the system and vSphere metrics that you understand and are high-level indicators of your overall system performance. In my case, I'm concerned with the metrics related to the bottlenecks on my specific system; primarily RAM and disk space.

Server CPU Load

Flux query:

from(bucket: "VSVR01WEB")
    |> range(start: v.timeRangeStart)
    |> filter(fn: (r) => r._measurement == "system")
    |> filter(fn: (r) => r._field == "load1" or r._field == "load5" or r._field == "load15")
    |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
    |> yield(name: "mean")

Dashboard Element:

System Load: load1, load5 and load15 are the 1, 5 and 15 minute averages for the CPU load expressed as decimal percentages. Values over 1.0 indicate a single CPU system was overloaded which just means it had processes waiting for CPU time. Spikes of overloading are expected but prolonged overloading indicates a system that's not keeping up with demand.

Memory (RAM) Usage

Flux query:

from(bucket: "VSVR01WEB")
    |> range(start: v.timeRangeStart)
    |> filter(fn: (r) => r._measurement == "mem")
    |> filter(fn: (r) => r._field == "used_percent")
    |> yield(name: "mean")

Dashboard Element:

Root Disk Space Free

Flux query:

from(bucket: "VSVR01WEB")
    |> range(start: v.timeRangeStart)
    |> filter(fn: (r) => r._measurement == "disk")
    |> filter(fn: (r) => r._field == "free")
    |> filter(fn: (r) => r.path == "/")
    |> yield(name: "mean")

Dashboard element:

Interesting vSphere Metrics

The following metrics are a short-list of interesting data to monitor for each of the virtual machines and incorporate into dashboard components with the appropriate Flux queries. These metrics have a number of additional parameters per data point that can be used to further refine and filter the queries.

Measurement Field Description Units
vsphere_vm_mem active_average VM memory used KiB
vsphere_vm_mem consumed_average VM memory allocated KiB
vsphere_vm_cpu usage_average VM percent use of host CPU capacity %
vsphere_vm_net transmitted_average network transmit rates kb/s
vsphere_vm_net received_average network receive rates kb/s
vsphere_vm_disk read_average disk read rates kb/s
vsphere_vm_disk write_average disk write rates kb/s
vsphere_vm_disk usage_average disk use rates (read+write) kb/s

Previous Post