OpenStack Tenants and Nagios Integration Script

Hi All! A detailed article to integrate OpenStack tenants with Nagios! While my promised articles on implementing High Availability within OpenStack (which I’ve done now for Icehouse and Kilo) are still just in my brain, here’s something else to keep my blog happy with a little more content 🙂

Scenario: Nagios Core and Telemetry

So you have Telemetry installed and running (bonus questions: did you get the Telemetry controller build process completely automated? And handle mongodb database size?) And you have a functional Nagios core. So how to integrate your OpenStack Tenants as elements in your Nagios?

Here’s my answer, which unfortunately does rely on running from my Nagios server rather than trapping SNMP alerts sent from the Telemetry (Ceilometer) server. And the iteration process below is kinda slow and therefore lame. But on the good side: it gets my OpenStack Tenants shown in Nagios core!

Here’s a picture of how the tenants look when displayed in Nagios…note that I created a Nagios “hostgroup” called openstack-tenants so that they all show up together:

Approach

We’ll treat each OpenStack tenant as a Nagios “host” record. That means, since we’re using Nagios Core, that we’ll provide a separate configuration file that defines each OpenStack tenant with information filled in. Fortunately, Nagios permits descriptive information about each host to be expressed using standard HTML tables; we take advantage of this as shown:


define host {
  host_name                      ostenant-common
  hostgroups                     openstack-tenants
  retry_interval                 1
  notification_options           n
  check_interval                 5
  contact_groups                 admins
  register                       1
  max_check_attempts             10
  name                           ostenant-common
  check_period                   24x7
  notification_interval          120
  use                            generic-host
  check_command                  command-true
  notification_period            24x7
  notes                          <table><tr><th>ID</th><td>3e263cf06ad0460882f20ca402b6508b</td></tr><tr><th>Name</th><td>common</td></tr><tr><th>Description</th><td>Common services</td></tr><tr><th>Enabled?</th><td>True</td></tr><tr><th>Hypervisors</th><td>lposhostx020.hlsdev.local:lposhostx010.hlsdev.local:</td></tr><tr><th>Total VMs</th><td>3</td></tr><tr><th>Active VMs</th><td>3</td></tr><tr><th>Disk (GB)</th><td>24</td></tr><tr><th>vCPUs</th><td>5</td></tr><tr><th>RAM</th><td>2053</td></tr></table>
  notes_url                      /nagios/cgi-bin/status.cgi?host=ostenant-common
}

We’ll generate this code by analyzing each OpenStack Tenant. The result will be a display that looks like the following:

Finally…by using Telemetry calls I get a whole bunch of detail information about the tenant:

The way I get this Telemetry information is by using a Nagios command script, which I wrote and installed to the Nagios plugins folder (/usr/lib64/nagios/plugins on my CentOS 6.5 x64 server) as follows:


define command {
        command_line                   $USER1$/check_ceilometer -H $HOSTNAME$ -e "$ARG1$" -s cpu_util -n
        command_name                   check_ceilometer_cpu_util
}

define command {
        command_line                   $USER1$/check_ceilometer -H $HOSTNAME$ -e "$ARG1$" -s disk.read.bytes -n
        command_name                   check_ceilometer_disk_read_bytes
}
...lots more commands, one for each function supported

There are two scripts which I need to have for this to work, along with OpenStack credentials (and clients installed) on the Nagios server. I don’t like that approach (having credentials in a shell script!) but that’s on my list of Things To Fix! I mitigate the risk by making the credentials readable only by my service account on Nagios, and hiding the Nagios server from the outside world.


#!/bin/bash
# check_ceilometer ABr, 20141212
# invoke ceilometer API for Nagios integration
# Change Log
# ----------
# 20150422, ABr: integrated to puppet

# default values (overridden by arguments)
TENANT=demo
SERVICE='cpu_util'
IGNORE=0
THRESHOLD='50.0'
CRITICAL_THRESHOLD='80.0'

# assume file exists
source ~/lvosksclu100-rc-telemetry-access

# function to print the help info
printusage() {
  echo "This plug-in uses the OpenStack Ceilometer API to let Nagios query Ceilometer metrics of VMs."
  echo "usage:"
  echo "ceilometer-call -q [event|sample|statistics] -e [tenant_name] -v [server_name] -s metric_name -t nagios_warning_threshold -T nagios_critical_threshold"
  echo "-h: print this message"
  echo "-H: nagios hostname"
  echo "-e: tenant name"
  echo "-s service: valid name from ceilometer meter-list"
  echo "-n object: no thresholds (ignore)"
  echo "-t threshold: Threshold value which causes Nagios to create a warning message"
  echo "-T threshold for alert: Threshold value which causes Nagios to send a critical alert message"
  exit ${EXITPROB}
}

getvalue() {
  local l_tmp="$1"
  local l_col="$2"
  local l_value=$(cat $l_tmp | cut -d'|' -f $l_col)
  if [ -z "$l_value" ] ; then
    l_value="0"
  else
    l_value=$(echo $l_value | sed -e 's# ##g')
    if echo "$l_value" | grep --quiet -v -e '^[0-9\.EPep]\+$' 2>/dev/null; then
      l_value="0"
    fi
  fi
  set +x
  echo $l_value
  return 0
}

#parse the arguments
while getopts ":hH:e:s:nt:T:" opt; do
  case $opt in
    H )     l_HOSTNAME=${OPTARG};;
    h )     printusage;;
    e )     TENANT=${OPTARG};;
    s )     SERVICE=${OPTARG};;
    n )     IGNORE=1;;
    t )     THRESHOLD=${OPTARG};;
    T )     CRITICAL_THRESHOLD=${OPTARG};;
    ? )     printusage;;
  esac
done

if [ ! -z "$l_HOSTNAME" ]; then
  if echo $l_HOSTNAME | grep --quiet -e "^ostenant-" 2>&1; then
    TENANT=$(echo $l_HOSTNAME | sed -e 's#^ostenant-\(.*\)#\1#')
  fi
fi

# very simple plugin :)
l_tmp="/tmp/ceilometer.$$"
ceilometer --os-tenant-name=$TENANT statistics -m $SERVICE | head -n -1 | tail -n 1 2>/dev/null > $l_tmp 2>&1
l_max=$(getvalue "$l_tmp" 5)
l_min=$(getvalue "$l_tmp" 6)
l_avg=$(getvalue "$l_tmp" 7)
l_count=$(getvalue "$l_tmp" 9)
#logger "ceilometer: rm -f \"$l_tmp\""
rm -f "$l_tmp"
l_db_name=$SERVICE
[ "$l_db_name" = "cpu_util" ] && l_db_name="cpu"
echo "Max:$l_max; Min:$l_min; Avg:$l_avg; Cnt:$l_count|$l_db_name=$l_avg;;;$l_min;$l_max"
echo $RETURNCODE

This command script outputs performance information in a format suitable for Nagios Graph so – over time – I can even get trends on CPU / RAM / HDD usage within a tenant.

Integrate Nagios with OpenStac

The actual script I use to create the faux Nagios “hosts” as representations of OpenStack tenants follows. Here’s the script – with annotations. I run the script as part of a cron job every day at 2:45AM because the script is, well, slow. (Remember: it’s iterating Every VM within Every Tenant in the OpenStack cluster.) However, the script could be run much more often if desired!


#!/bin/bash
# sab-lvos-telemetry-nagios.sh, ABr, 20141212
# Integrate OpenStack Telemetry with Nagios
# Change Log:
# -----------
# 20141212, ABr: initial creation
# 20141215, ABr: placing into production
# 20150102, ABr: save last run

########################################################################
# globals

# our runtime directory - accounts for being run in separate shell or
# from source.
g_SAB_LVOS_TELEMETRY_DIR_TEST="$(dirname "$0" 2&gt;&1)"
g_SAB_LVOS_TELEMETRY_RC_TEST=$?
if [ $g_SAB_LVOS_TELEMETRY_RC_TEST -eq 0 ]; then
  g_SAB_LVOS_TELEMETRY_NAGIOS_DIR_RUN="$(cd $(dirname "$0"); pwd)"
else
  g_SAB_LVOS_TELEMETRY_NAGIOS_DIR_RUN="$(pwd)"
fi

# the folder where we stuff extra files for Nagios to include.
# be sure /etc/nagios/nagios.cfg has:
#   cfg_dir=/etc/nagios/extra-cfg
g_SAB_LVOS_TELEMETRY_NAGIOS_DIR_CFG='/etc/nagios/extra-cfg'

# the resource file with your OpenStack credentials in it. yes, this
# is horrible...it's on my many backlogged tasks to Fix This.
g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_RC_NAME='openstack_rc_file'
g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_RC_PATH="$g_SAB_LVOS_TELEMETRY_NAGIOS_DIR_RUN/$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_RC_NAME"

# temp file where we put our results
g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_NAME='tenants'
g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_TMP_PATH="/tmp/$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_NAME.$$"
g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_PATH="$g_SAB_LVOS_TELEMETRY_NAGIOS_DIR_CFG/$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_NAME.cfg"
g_SAB_LVOS_TELEMETRY_NAGIOS_LAST_RUN_PATH="/tmp/$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_NAME.lastrun"

g_SAB_LVOS_TELEMETRY_NAGIOS_RC_OK=0
g_SAB_LVOS_TELEMETRY_NAGIOS_RC_ERROR=1

########################################################################
# functions

# cleanup: called before exit
function sab-lvos-telemetry-nagios-i-cleanup {
  yes | cp $g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_TMP_PATH $g_SAB_LVOS_TELEMETRY_NAGIOS_LAST_RUN_PATH
  rm -f $g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_TMP_PATH
  return 0
}

# exit system with cleanup
function sab-lvos-telemetry-nagios-i-exit {
  local l_rc=$1
  shift
  [ "$l_rc" -ne 0 ] && echo -n "ERROR: "
  echo "$* ($l_rc)"
  sab-lvos-telemetry-nagios-i-cleanup
  exit $l_rc
}

# exit only if error (where first parm is non-zero)
function sab-lvos-telemetry-nagios-i-error-exit {
  local l_rc=$1
  [ "$l_rc" -ne 0 ] && sab-lvos-telemetry-nagios-i-exit $*
  return 0
}

# read OpenStack file and source credentials
function sab-lvos-telemetry-nagios-i-source-rc {
  [ ! -f "$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_RC_PATH" ] && \
    sab-lvos-telemetry-nagios-i-error-exit $g_SAB_LVOS_TELEMETRY_NAGIOS_RC_ERROR \
    "Missing $g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_RC_NAME"
  source "$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_RC_PATH"
  return 0
}

# load all tenants (assumes that OpenStack credentials are set)
function sab-lvos-telemetry-nagios-i-get-tenants {
  # note that, as of latest OpenStack clients, keystone is deprecated.
  # so we ignore errors (at some point I'll update to use "openstack" client)
  local l_tenant_ids=$(keystone tenant-list 2>/dev/null | tail -n +4 | head -n -1 | sort -k 3 | cut -d'|' -f 2)
  local l_rc=$?
  sab-lvos-telemetry-nagios-i-error-exit $l_rc $l_tenant_ids
  echo $l_tenant_ids
  return 0
}

# we have an OpenStack tenant ID, so get the details to put into Nagios
function sab-lvos-telemetry-nagios-i-tenant-dtl {
  local l_tenant_id=$1
  local l_attr=$2
  local l_result=$(keystone tenant-get $l_tenant_id 2>/dev/null | grep -e "| \+$l_attr" | cut -d'|' -f 3)
  local l_rc=$?
  sab-lvos-telemetry-nagios-i-error-exit $l_rc "Tenant $l_tenant_id: " $l_result
  echo $l_result
  return 0
}

# we have an OpenStack VM's ID, so get its details to analyze
function sab-lvos-telemetry-nagios-i-vm-dtl {
  local l_vm_id=$1
  local l_attr=$2
  local l_result=$(nova show $l_vm_id 2>&1 | grep -e "| \+$l_attr" | cut -d'|' -f 3)
  local l_rc=$?
  sab-lvos-telemetry-nagios-i-error-exit $l_rc "VM $l_vm_id: " $l_result
  echo $l_result
  return 0
}

# an OpenStack "flavor" defines disk / CPU / RAM usage. we get these
# details so we can determine how many resources an OpenStack tenant is
# using and report that data to Nagios.
function sab-lvos-telemetry-nagios-i-flavor-dtl {
  local l_flavor_id=$1
  local l_attr=$2
  local l_result=$(nova flavor-show $l_flavor_id 2>&1 | grep -e "| \+$l_attr" | cut -d'|' -f 3)
  local l_rc=$?
  sab-lvos-telemetry-nagios-i-error-exit $l_rc "Flavor $l_flavor_id: " $l_result
  echo $l_result
  return 0
}

# here's the magic sauce where we take an OpenStack tenant and get lots of
# information that we can then display within Nagios.
function sab-lvos-telemetry-nagios-i-tenant-metadisplay {
  local l_tenant_id="$1"
  local l_tenant_name="$2"
  local l_tenant_description="$3"
  local l_tenant_enabled="$4"

  # first, get the list of VMs
  local l_tenant_vms=$(nova --os-tenant-id=$l_tenant_id list 2>&1 | tail -n +4 | head -n -1 | cut -d'|' -f 2)

  # and we do some countin' to find how many resources this tenant uses...
  local l_ctr=0
  local l_active=0
  local l_vcpus=0
  local l_ram=0
  local l_disk=0
  local l_used_hosts=''
  for i_tenant_vm in $l_tenant_vms; do
    l_ctr=$((l_ctr+1))

    # attributes we want to analyze
    local l_status=$(sab-lvos-telemetry-nagios-i-vm-dtl $i_tenant_vm status)
    local l_flavor_text=$(sab-lvos-telemetry-nagios-i-vm-dtl $i_tenant_vm flavor)
    local l_hostname=$(sab-lvos-telemetry-nagios-i-vm-dtl $i_tenant_vm 'OS-EXT-SRV-ATTR:hypervisor_hostname')

    # is the VM active?
    [ "$l_status" = "ACTIVE" ] && l_active=$((l_active+1))

    # get the HDD / RAM / CPU information used by this VM
    local l_flavor_id=$(echo $l_flavor_text | sed -e 's#.*(\(.*\)).*#\1#')
    local l_flavor_disk=$(sab-lvos-telemetry-nagios-i-flavor-dtl $l_flavor_id disk)
    local l_flavor_ram=$(sab-lvos-telemetry-nagios-i-flavor-dtl $l_flavor_id ram)
    local l_flavor_vcpus=$(sab-lvos-telemetry-nagios-i-flavor-dtl $l_flavor_id vcpus)

    # add to totals
    l_disk=$((l_vcpus+$l_flavor_disk))
    l_vcpus=$((l_vcpus+$l_flavor_vcpus))
    l_ram=$((l_vcpus+$l_flavor_ram))

    # append hostname to used hosts
    if echo $l_used_hosts | grep --quiet -v -e "$l_hostname:"; then
      l_used_hosts=$(echo "$l_hostname:$l_used_hosts" | sort -t':')
    fi
  done

  # get the results in a table format we can feed to Nagios
  echo '<table><tr><th>ID</th><td>'$l_tenant_id'</td></tr><tr><th>Name</th><td>'$l_tenant_name'</td></tr><tr><th>Description</th><td>'$l_tenant_description'</td></tr><tr><th>Enabled?</th><td>'$l_tenant_enabled'</td></tr><tr><th>Hypervisors</th><td>'$l_used_hosts'</td></tr><tr><th>Total VMs</th><td>'$l_ctr'</td></tr><tr><th>Active VMs</th><td>'$l_active'</td></tr><tr><th>Disk (GB)</th><td>'$l_disk'</td></tr><tr><th>vCPUs</th><td>'$l_vcpus'</td></tr><tr><th>RAM</th><td>'$l_ram'</td></tr></table>'
  return 0
}

# build meta-data for one tenant
function sab-lvos-telemetry-nagios-i-bld-tenant-meta {
  local l_tenant_id=$i

  echo "  Getting basic tenant info..."
  local l_tenant_name=$(sab-lvos-telemetry-nagios-i-tenant-dtl $i_tenant_id name)
  local l_tenant_nagios_name="ostenant-$l_tenant_name"
  local l_tenant_description=$(sab-lvos-telemetry-nagios-i-tenant-dtl $i_tenant_id description)
  local l_tenant_notes_url="/nagios/cgi-bin/status.cgi?host=$l_tenant_nagios_name"
  local l_tenant_enabled=$(sab-lvos-telemetry-nagios-i-tenant-dtl $i_tenant_id enabled)

  echo "  Getting VM details for tenant..."
  local l_tenant_metadisplay=$(sab-lvos-telemetry-nagios-i-tenant-metadisplay $i_tenant_id "$l_tenant_name" "$l_tenant_description" "$l_tenant_enabled")

  # this is where we export information to Nagios. we treat the OpenStack tenant
  # as a Nagios "host"
  cat >> "$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_TMP_PATH" << EOF
define host {
  host_name                      $l_tenant_nagios_name
  hostgroups                     openstack-tenants
  retry_interval                 1
  notification_options           n
  check_interval                 5
  contact_groups                 admins
  register                       1
  max_check_attempts             10
  name                           $l_tenant_nagios_name
  check_period                   24x7
  notification_interval          120
  use                            generic-host
  check_command                  command-true
  notification_period            24x7
  notes                          $l_tenant_metadisplay
  notes_url                      $l_tenant_notes_url
}
EOF

  return 0
}

# iterate over all OpenStack tenants, output information to Nagios for each one.
function sab-lvos-telemetry-nagios-i-bld-all-meta {
  local l_tenant_ids=$*

  echo '' > "$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_TMP_PATH"
  for i_tenant_id in $l_tenant_ids; do
    echo "Processing tenant '$i_tenant_id'..."
    sab-lvos-telemetry-nagios-i-bld-tenant-meta $i_tenant_id
  done
  return 0
}

# finally, update Nagios from our results. we will restart Nagios if
# we detected a change in the OpenStack tenant display.
function sab-lvos-telemetry-nagios-i-update {
  local l_update=0
  if [ ! -f "$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_PATH" ]; then
    l_update=1
  else
    if diff "$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_TMP_PATH" "$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_PATH" 2>&1 >/dev/null; then
      l_update=1
    fi
  fi
  if [ $l_update -eq 1 ]; then
    echo "Updating Nagios tenant file '$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_PATH'..."

    # overwrite our existing Nagios configuration file
    yes | cp "$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_TMP_PATH" "$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_PATH"
    chown root:nagios "$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_PATH"
    chmod 640 "$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_PATH"

    # we have a difference; force Nagios to reload itself
    service nagios restart
  else
    echo "Nagios tenant file '$g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_PATH' is up-to-date."
  fi
  return 0
}

function sab-lvos-telemetry-nagios-run {
  echo "Sourcing RC file..."
  sab-lvos-telemetry-nagios-i-source-rc
  echo "Scanning all tenants..."
  local l_tenant_ids=$(sab-lvos-telemetry-nagios-i-get-tenants)
  sab-lvos-telemetry-nagios-i-bld-all-meta "$l_tenant_ids"
  sab-lvos-telemetry-nagios-i-update $

  # only do this if you want details from our work shown
#  cat $g_SAB_LVOS_TELEMETRY_NAGIOS_FILE_TENANTS_TMP_PATH
  sab-lvos-telemetry-nagios-i-exit 0 "OK"
}

# do the work
sab-lvos-telemetry-nagios-run

Enjoy…and Happy Computing!

OpenStack Tenants and Nagios Integration Script

Scenario: Nagios Core and Telemetry

Approach

Integrate Nagios with OpenStac

Leave a Reply Cancel reply