Alerting for stale nodes on Chef with Nagios

With the new version of Chef we have more options and more features and an even better knife status command, which brings us to the discussion at hand which is how to alert for stale nodes on Chef using Nagios:-

The knife status command is used to display a brief summary of nodes on a Chef Server:-

knife status (options)

When used with -H switch it gives us the time on when the last successful Chef run was excluding nodes which ran in the past hour e.g:-

knife status -H
20 hours ago,, ubuntu 10.04,,
3 hours ago, i-225f954f, ubuntu 10.04,,

We can use this command to help us in alerting for stale nodes with a small script in ruby and some settings in nagios. Let’s start with the ruby script first:-

require 'rubygems'
require 'chef/config'
require 'chef/rest'
require 'chef/search/query'

##Define hours to be alerted upon and chef client.rb path so the script can execute knife status command
critical = 12
warning = 1


if warning > critical || warning < 0
        puts "Warning: warning should be less than critical and bigger than zero"

query =
all_nodes = []
cnodes = []
wnodes = []'node', "*:*") do |node|
   all_nodes << node
all_nodes.each do |node|
  hours=( - node['ohai_time'].to_i)/3600
      if hours >= critical
        cnodes <<
      elsif hours >= warning
      wnodes <<
if cnodes.length > 0
        puts "CRITICAL: "+cnodes.join(',')+" did not check in for "+critical.to_s+" hours"
elsif wnodes.length > 0
        puts "Warning :"+wnodes.join(',')+" did not check in for "+warning.to_s+" hours"
elsif cnodes.length == 0 and wnodes.join(',') == 0
        puts "OK: All nodes are ok!"
        puts "UNKNOWN"

Now in the above script if a certain node has not checked in within the 12 hours time period defined we will put it in CRITICAL STATE and generate an alert with the following settings in Nagios:-

Please note that this machine needs to be able to connect to the Chef-Server using knife as we defined in the script.

Install the script in your Nagios plugins directory like :-

cp check_chef_nodes.rb /usr/lib64/nagios/plugins/check_chef_nodes.rb

Then in the nagios configuration define the command, host and service like this:-

define command {
       command_name check_chef_node_status
       command_line $USER1$/check_chef_nodes.rb

define host {
  use linux-server
  contact_groups admins
  host_name localhost

define service {
        use local-service ; Name of service template to use
        host_name localhost
        service_description Chef Node Health Check
        check_command check_chef_node_status
        notifications_enabled 0

Once everything is configured restart nagios and you should see a service monitor for Chef Node Check Health.

That’s all for today folks, now you have an alert on stale nodes on chef-server and can take steps to ensure all your nodes are up to date accordingly.



Comments are closed.

Search Blog


Recent Posts

Subscribe to Our Newsletter

Join our community of DevOps enthusiast - Get free tips, advice, and insights from our industry leading team of AWS experts.