Alerting for stale nodes on Chef with Nagios

With the new version of Chef we have more options and more features and an even better knife status command, which brings us to the discussion at hand which is how to alert for stale nodes on Chef using Nagios:-

The knife status command is used to display a brief summary of nodes on a Chef Server:-

knife status (options)

When used with -H switch it gives us the time on when the last successful Chef run was excluding nodes which ran in the past hour e.g:-

knife status -H
20 hours ago, dev-vm.nclouds.com, ubuntu 10.04, dev-vm.nclouds.com, 10.66.44.126
3 hours ago, i-225f954f, ubuntu 10.04, ec2-67-202-63-102.compute-1.amazonaws.com, 67.202.63.102

We can use this command to help us in alerting for stale nodes with a small script in ruby and some settings in nagios. Let’s start with the ruby script first:-

#!/opt/chef/embedded/bin/ruby
require 'rubygems'
require 'chef/config'
require 'chef/rest'
require 'chef/search/query'

##Define hours to be alerted upon and chef client.rb path so the script can execute knife status command
critical = 12
warning = 1
Chef::Config.from_file(File.expand_path("/etc/chef/client.rb"))

OK_STATE = 0
WARNING_STATE = 1
CRITICAL_STATE = 2
UNKNOWN_STATE = 3

if warning > critical || warning < 0
        puts "Warning: warning should be less than critical and bigger than zero"
        exit(WARNING_STATE)
end

query = Chef::Search::Query.new
all_nodes = []
cnodes = []
wnodes = []
query.search('node', "*:*") do |node|
   all_nodes << node
 end
all_nodes.each do |node|
  hours=(Time.now.to_i - node['ohai_time'].to_i)/3600
      if hours >= critical
        cnodes << node.name
      elsif hours >= warning
      wnodes << node.name
      end
  end
if cnodes.length > 0
        puts "CRITICAL: "+cnodes.join(',')+" did not check in for "+critical.to_s+" hours"
        exit(CRITICAL_STATE)
elsif wnodes.length > 0
        puts "Warning :"+wnodes.join(',')+" did not check in for "+warning.to_s+" hours"
        exit(WARNING_STATE)
elsif cnodes.length == 0 and wnodes.join(',') == 0
        puts "OK: All nodes are ok!"
        exit(OK_STATE)
else
        puts "UNKNOWN"
    exit(UNKNOWN_STATE)
 end

Now in the above script if a certain node has not checked in within the 12 hours time period defined we will put it in CRITICAL STATE and generate an alert with the following settings in Nagios:-

Please note that this machine needs to be able to connect to the Chef-Server using knife as we defined in the script.

Install the script in your Nagios plugins directory like :-

cp check_chef_nodes.rb /usr/lib64/nagios/plugins/check_chef_nodes.rb

Then in the nagios configuration define the command, host and service like this:-

define command {
       command_name check_chef_node_status
       command_line $USER1$/check_chef_nodes.rb
       }

define host {
  use linux-server
  contact_groups admins
  address 127.0.0.1
  host_name localhost
  } 

define service {
        use local-service ; Name of service template to use
        host_name localhost
        service_description Chef Node Health Check
        check_command check_chef_node_status
        notifications_enabled 0
        }

Once everything is configured restart nagios and you should see a service monitor for Chef Node Check Health.

That’s all for today folks, now you have an alert on stale nodes on chef-server and can take steps to ensure all your nodes are up to date accordingly.

 

 

Comments are closed.

Subscribe to Our Newsletter

Join our community of DevOps enthusiast - Get free tips, advice, and insights from our industry leading team of AWS experts.