Dashboards are a great way to get at-a-glance diagnostics of a server's vital statistics. Consolidating this data from multiple servers in one place makes this server-sitting task even easier. A Linux server configured with Telegraf, InfluxDB and Grafana (a.k.a. a TIG stack) is an easy way to accomplish this. This post focuses on getting Telegraf, an open-source data collection agent, configured to collect server and vSphere metrics through an InfluxDB OSS v2.0 time-series database.
InfluxDB OSS version 2.0 has completely integrated its dashboard capabilities (Chronograf) and its data processing engine (Kapacitor). Prior to this integration, this functionality was achieved with a TICK stack. While InfluxDB OSS is capable of accomplishing all of this functionality, Grafana is a broader reaching analytics and visualization package and is worth learning and exploring.
Telegraf will need to be installed on each of the servers to be monitored. The easiest way to do this is to add the Influx repository, specific to the server Linux distribution, to the apt sources list. The example below is for Ubuntu 20.04 with codename focal. To determine which release to use, the command
lsb_release -a will output the distribution details (Linux standard base) for the server. The
curl command is used to pull the Influx repository public key and add it as a trusted key to the apt subsystem. Apt will use this key to verify packages (and updates) received from the Influx repository moving forward.
echo "deb https://repos.influxdata.com/ubuntu focal stable" | sudo tee /etc/apt/sources.list.d/influxdb.list curl -sL https://repos.influxdata.com/influxdb.key | sudo apt-key add -
Create a stripped-down telegraf.conf file to configure the telegraf agent and the influx (v2) output plugin. The output is assigned a default bucket which would normally capture all telegraf data produced by this server. By using the
bucket_tag flag, telegraf will use the assigned
influx_bucket to route data to a specific bucket. The
exclude_bucket_tag is used to prevent telgraf from adding the
bucket_tag to the data. The URLs, token and organization are associated with the influx configuration of the TIG server.
[global_tags] ... leave defaults [agent] ... leave defaults [[outputs.influxdb_v2]] urls = ["http://192.168.1.32:8086"] token = "xKyZ7ooVUBOmI...<your unique InfluxDB tokey..." organization = "phaedrus" bucket_tag = "influx_bucket" exclude_bucket_tag = true bucket = "phaedrus_primary"
Under the /etc/telgraf/telegraf.d folder, create a system.conf file to configure the system plugin and pull the system stats from each server. Note the use of the
influx_bucket tag which will direct the data for each system metric to a specific influx v2 bucket. A specific influx bucket must be created at the TIG server for each monitored server.
# Read metrics about cpu usage [[inputs.cpu]] percpu = true totalcpu = true collect_cpu_time = false report_active = false [inputs.cpu.tags] influx_bucket = "VSVR01WEB" # Read metrics about disk usage by mount point [[inputs.disk]] ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"] [inputs.disk.tags] influx_bucket = "VSVR01WEB" # Read metrics about disk IO by device [[inputs.diskio]] [inputs.diskio.tags] influx_bucket = "VSVR01WEB" # Get kernel statistics from /proc/stat [[inputs.kernel]] [inputs.kernel.tags] influx_bucket = "VSVR01WEB" # Read metrics about memory usage [[inputs.mem]] [inputs.mem.tags] influx_bucket = "VSVR01WEB" # Get the number of processes and group them by status [[inputs.processes]] [inputs.processes.tags] influx_bucket = "VSVR01WEB" # Read metrics about swap memory usage [[inputs.swap]] [inputs.swap.tags] influx_bucket = "VSVR01WEB" # Read metrics about system load & uptime [[inputs.system]] [inputs.system.tags] influx_bucket = "VSVR01WEB"
Telegraf Configuration: All of the basic telegraf configuration shown is pulled from the default telegraf.conf file. This file includes typicals for a huge set of telegraf input and output plugins. In each case, the file includes commented sections to guide full customization for most plugins, including how to include or exclude specific metrics. For manageability and ease of deployment, the configuration has been broken into modular sections.
Once the configuration is complete for each server, the telegraf service can be enabled to start the polling.
sudo systemctl enable --now telegraf
The TIG server should carry the same system monitoring configuration, described above, to add its own system metrics to the database. In addition, the vSphere plugin can be used to pull additional metrics specific to the ESXi hypervisor. A read-only vSphere user should be created to facilitate data polling from telegraf.
The following can be used as a vphere.conf file under /etc/telegraf/telegraf.d.
[[inputs.vsphere]] vcenters = [ "https://192.168.1.30/sdk" ] username = "<vSphere user>" password = "<vSphere user password>" vm_metric_include =  host_metric_include =  cluster_metric_include =  datastore_metric_include =  datacenter_metric_include =  insecure_skip_verify = true [inputs.vsphere.tags] influx_bucket = "VSPHERE"
|> to pass data from function to function in order to develop advanced queries.
One of the easiest ways to get started with Flux for system and vSphere monitoring, is to pull the related influx templates and reverse engineer the queries to suit your needs. You can also review the InfuxQL style Grafana dashboards, indicated with INFLUXDB 1.0.0 as a dependency, and pull the metric and filtering tags needed to build an equivalent Flux query.
A couple of good starting points are below:
Focus on the system and vSphere metrics that you understand and are high-level indicators of your overall system performance. In my case, I'm concerned with the metrics related to the bottlenecks on my specific system; primarily RAM and disk space.
from(bucket: "VSVR01WEB") |> range(start: v.timeRangeStart) |> filter(fn: (r) => r._measurement == "system") |> filter(fn: (r) => r._field == "load1" or r._field == "load5" or r._field == "load15") |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false) |> yield(name: "mean")
System Load: load1, load5 and load15 are the 1, 5 and 15 minute averages for the CPU load expressed as decimal percentages. Values over 1.0 indicate a single CPU system was overloaded which just means it had processes waiting for CPU time. Spikes of overloading are expected but prolonged overloading indicates a system that's not keeping up with demand.
from(bucket: "VSVR01WEB") |> range(start: v.timeRangeStart) |> filter(fn: (r) => r._measurement == "mem") |> filter(fn: (r) => r._field == "used_percent") |> yield(name: "mean")
from(bucket: "VSVR01WEB") |> range(start: v.timeRangeStart) |> filter(fn: (r) => r._measurement == "disk") |> filter(fn: (r) => r._field == "free") |> filter(fn: (r) => r.path == "/") |> yield(name: "mean")
The following metrics are a short-list of interesting data to monitor for each of the virtual machines and incorporate into dashboard components with the appropriate Flux queries. These metrics have a number of additional parameters per data point that can be used to further refine and filter the queries.
|vsphere_vm_mem||active_average||VM memory used||KiB|
|vsphere_vm_mem||consumed_average||VM memory allocated||KiB|
|vsphere_vm_cpu||usage_average||VM percent use of host CPU capacity||%|
|vsphere_vm_net||transmitted_average||network transmit rates||kb/s|
|vsphere_vm_net||received_average||network receive rates||kb/s|
|vsphere_vm_disk||read_average||disk read rates||kb/s|
|vsphere_vm_disk||write_average||disk write rates||kb/s|
|vsphere_vm_disk||usage_average||disk use rates (read+write)||kb/s|