Configure HAProxy to remove host on Nagios scheduled downtime

I was messing around with HAProxy yesterday and thought it would be useful to integrate Nagios downtime into the process for taking a node off the load balancer. This method uses Xinetd to emulate HTTP headers and isn't limited for use on HAProxy exclusively, it can be used with any LB that supports basic HTTP header health checks... So all of them?

The required components to make this demonstration work are:

  • Linux webserver with Xinetd
  • HAProxy server
  • Nagios server with Nagios-api installed
  • And root access to the above servers!

Now to get started, I used this guide to get Nagios-api up and running. Once you have the Nagios-api running, you should be able to query the status of your webserver via:

[rsty@home ~]$ curl -s http://192.168.33.10:8080/host/prod-web01 | python -mjson.tool
{
    "content": {
        "acknowledgement_type": "0",
        "active_checks_enabled": "1",
        "check_command": "check-host-alive",
        "check_execution_time": "0.010",
        "check_interval": "5.000000",
        "check_latency": "0.024",
        "check_options": "0",
        "check_period": "",
        "check_type": "0",
        "comment": [],
        "current_attempt": "1",
        "current_event_id": "0",
        "current_notification_id": "0",
        "current_notification_number": "0",
        "current_problem_id": "0",
        "current_state": "0",
        "downtime": [],
        "event_handler": "",
        "event_handler_enabled": "1",
        "failure_prediction_enabled": "1",
        "flap_detection_enabled": "1",
        "has_been_checked": "1",
        "host": "prod-web01",
        "host_name": "prod-web01",
        "is_flapping": "0",
        "last_check": "1428676190",
        "last_event_id": "0",
        "last_hard_state": "0",
        "last_hard_state_change": "1428674980",
        "last_notification": "0",
        "last_problem_id": "0",
        "last_state_change": "1428674980",
        "last_time_down": "0",
        "last_time_unreachable": "0",
        "last_time_up": "1428676200",
        "last_update": "1428676315",
        "long_plugin_output": "",
        "max_attempts": "10",
        "modified_attributes": "0",
        "next_check": "1428676500",
        "next_notification": "0",
        "no_more_notifications": "0",
        "notification_period": "24x7",
        "notifications_enabled": "1",
        "obsess_over_host": "1",
        "passive_checks_enabled": "1",
        "percent_state_change": "0.00",
        "plugin_output": "PING OK - Packet loss = 0%, RTA = 0.06 ms",
        "problem_has_been_acknowledged": "0",
        "process_performance_data": "1",
        "retry_interval": "1.000000",
        "scheduled_downtime_depth": "0",
        "services": [
            "smb:139-dosamba-prod-web01",
            "http:43326-donagios-prod-web01",
            "int:load-donagios-prod-web01",
            "int:process_postfix-dopostfix-prod-web01",
            "ssh:15022-docommon-prod-web01",
            "int:process_puppetmaster-dopuppetmaster-prod-web01",
            "int:process_puppetdb-dopuppetmaster-prod-web01",
            "int:disk_root-donagios-prod-web01",
            "int:process_smbd-dosamba-prod-web01",
            "int:process_nagios-donagios-prod-web01"
        ],
        "should_be_scheduled": "1",
        "state_type": "1",
        "type": "hoststatus"
    },
    "success": true
}

So if you notice above, “scheduled_downtime_depth” is the status we are looking for, which is currently 0, so there is currently no downtime set. We can easily grab that value with the following one-liner and save for later:

[rsty@home ~]$ curl -s http://192.168.33.10:8080/host/prod-web01 | python -mjson.tool | grep time_depth | awk -F'"' '{print $4}'
0

So now the fun part begins, creating the Xinetd script to emulate the HTTP header. What we want to do is to return a 200 (OK) if we return a 0 from our scheduled_downtime_depth query and return a 5xx (BAD) if we are returned a non-zero value, meaning downtime is set. So there are a few things we need to do:

  1. Write our script, which will return a 200 if our check passes, otherwise it will return a 503. In the below script, 192.168.33.10 is the Nagios server and prod-web01 is the Nagios configured host for our web server. The Xinetd script will reside on the webserver since that is where the health check from HAProxy will be directed:

/opt/serverchk

#!/bin/bash

DOWN=`curl -s http://192.168.33.10:8080/host/prod-web01 | python -mjson.tool | grep time_depth | awk -F'"' '{print $4}'`

if [ "$DOWN" == "0" ]
then
    # server is online, return http 200
    /bin/echo -e "HTTP/1.1 200 OK\r\n"
    /bin/echo -e "Content-Type: Content-Type: text/plain\r\n"
    /bin/echo -e "\r\n"
    /bin/echo -e "No downtime scheduled.\r\n"
    /bin/echo -e "\r\n"
else
    # server is offline, return http 503
    /bin/echo -e "HTTP/1.1 503 Service Unavailable\r\n"
    /bin/echo -e "Content-Type: Content-Type: text/plain\r\n"
    /bin/echo -e "\r\n"
    /bin/echo -e "**Downtime is SCHEDULED**\r\n"
    /bin/echo -e "\r\n"
fi
  1. Add the service name to the tail of /etc/services
serverchk    8189/tcp # serverchk script
  1. Add the xinetd configuration with the same service name as above:

/etc/xinetd.d/serverchk

# default: on
# description: serverchk
service serverchk
{
    flags           = REUSE
    socket_type     = stream
    port            = 8189
    wait            = no
    user            = nobody
    server          = /opt/serverchk_status.sh
    log_on_failure  += USERID
    disable         = no
    only_from       = 0.0.0.0/0
    per_source      = UNLIMITED
}
  1. Restart xinetd
[rsty@prod-web01 ~]$ sudo service xinetd restart
Redirecting to /bin/systemctl restart  xinetd.service

Now the web portion is complete. You can test it by curling the configured xinetd service port from HAProxy or any other if you didn't limit via 'only_from':

[root@haproxy ~]$ curl -s 192.168.56.101:8189
Content-Type: Content-Type: text/plain



No downtime scheduled.



root@haproxy:~#

Now that it works, we can configure HAProxy. To do so, lets look over the current backend config for our webserver. Here is the excerpt from /etc/haproxy/haproxy.cfg:

backend nagios-test_BACKEND
  balance roundrobin
  server nagios-test 192.168.56.101:80 check

We need to modify this by adding the httpchk and specifying the check port:

backend nagios-test_BACKEND
  option httpchk HEAD
  balance roundrobin
  server nagios-test 192.168.56.101:80 check port 8189

Now lets reload haproxy and check the status:

root@haproxy:~# sudo /etc/init.d/haproxy reload
 * Reloading haproxy haproxy                                                                                        [ OK ]
root@haproxy:~# echo 'show stat' | socat unix-connect:/var/lib/haproxy/stats stdio | grep test | cut -d',' -f1,18
nagios-test_BACKEND,UP
nagios-test_BACKEND,UP
root@haproxy:~#

Excellent! Now lets put the host into maintenance mode (downtime) on Nagios and see what comes of it!

[admin@nagios nagios-api]~$ ./nagios-cli -H localhost -p 8080 schedule-downtime prod-web01 4h
[2015/04/10 15:16:59] {diesel} INFO|Sending command: [1428679019] SCHEDULE_HOST_DOWNTIME;prod-web01;1428679019;1428693419;1;0;14400;nagios-api;schedule downtime
[admin@nagios nagios-api]~$

And now if we check the Nagios downtime, xinetd script remotely from HAProxy on port 8189 and check the status of the BACKEND resource:

root@haproxy:~# curl -s http://192.168.33.10:8080/host/prod-web01 | python -mjson.tool | grep time_depth
        "scheduled_downtime_depth": "1",
root@haproxy:~# curl -s 192.168.56.101:8189
Content-Type: Content-Type: text/plain



**Downtime is SCHEDULED**


root@haproxy:~# curl -sI 192.168.56.101:8189
HTTP/1.1 503 Service Unavailable

root@haproxy:~# echo 'show stat' | socat unix-connect:/var/lib/haproxy/stats stdio | grep test | cut -d',' -f1,18
nagios-test_BACKEND,DOWN
nagios-test_BACKEND,DOWN
root@haproxy:~#

Now as we see, Nagios is reporting a non-zero value for downtime. Also, the web server shows our script as working correctly and returning a 503! HAProxy also shows the node as down, awesome! Now lets cancel the downtime to see it come back up:

[admin@nagios nagios-api]~$ ./nagios-cli -H localhost -p 8080 cancel-downtime prod-web01
[2015/04/10 15:24:09] {diesel} INFO|Sending command: [1428679449] DEL_HOST_DOWNTIME;4
[admin@nagios nagios-api]~$

And...

root@haproxy:~# echo 'show stat' | socat unix-connect:/var/lib/haproxy/stats stdio | grep test | cut -d',' -f1,18
nagios-test_BACKEND,UP
nagios-test_BACKEND,UP
root@haproxy:~#

SUCCESS! So effectively, this xinetd script can be set on all the webservers, by just changing the Nagios-api to query the different webserver in the script. Also, using xinetd scripts in this fashion, you can perform many other "checks" on the server behind the load balancer.. Anything that can be performed in a BASH (or language of your choice) script can be transformed into the boolean state operation needed to bring the node online/offline.

I'd like to see if anyone else has done something similar to this or has any suggestions to improve! Please comment!

DISCLAIMER: Please test thoroughly before using this solution in a production environment. I am not liable for your mistakes 😉

IP to ASN lookup, using Redis

Hello all,

Just recently, I've had the need to map an IP address to an AS number in a very fast manner for use in another project. I started by looking for methods of obtaining the data neccessary for creating the IP to ASN map. I do not personally own a router than is running full BGP tables and I didn't want to abuse one of the looking glass routers for obvious reasons. On my search, I came across two well formatted indexes. One mapping masks to ASNs and the other mapping ASNs to owner names.

Netmask to AS number

AS number to AS owner

This is great, because I could create to arrays or hashes, including the keys/values. Now since I am new to Ruby, I started out by searching for the suitable libraries to scrape the data from the two above maps. I found a nice ruby-curl library on GitHub, which had some nice features, that would allow me to Curl directly into an array using regex with it's scan() method... how nice! After creating my initial script and pulling the data, for some reason the Curl wouldn't put all the data into the array... it would stop between 80.x.x.x and 90.x.x.x every time. After messing around with it a bit, I couldn't get it to work how I wanted, so I switched to using curb, which didn't have the nice scan() method, but I was able to play with the data just how I wanted in no time. I was able to get the script working by putting all the key/values into a Ruby hash and it was easy to pull the values out how I wanted but this would mean scraping the data from the above tables every time I ran the script. I started searching and was thinking about MongoDB but after some further testing, I decided to go with Redis due to the simple model of data I was playing with (just a bunch of key/values). Now the fun part begins; storing the data for persistence. The Redis Ruby client is awesome and very easy to use. You import the values with redis.set and call them with redis.get... nice and simple.

irb(main):007:0> redis.set("1.1.1.0/24", "15169")
=> "OK"
irb(main):008:0> redis.set("15169", "Google Inc.")
=> "OK"
irb(main):009:0> puts "Hey! Don't touch ASN #{redis.get("1.1.1.0/24")}! That's owned by #{redis.get("15169")}"
Hey! Don't touch ASN 15169! That's owned by Google Inc.

Since this script needs to support ANY IPv4 address but the table consists of only the netmask used by the specific AS,  which may not be in the same /24 or another mask easily identified by simple string matching, makes this a little bit more difficult. What we need to do, is convert the IP address into a value that can easily be compared to any other and if there is no match, check all masks until there is a match (Example: for a /24, ... /24, /23, /22, /21 ... until there is a match, which eventually you will run into). This is accomplished by converting the IP to long. Ruby does not have this built in function, so after researching and testing, I was able to come up with:

ip = 1.2.3.4
ipAry = ip.split(/\./)
long = ipAry[3].to_i | ipAry[2].to_i << 8 | ipAry[1].to_i << 16 | ipAry[0].to_i << 24

For 1.2.3.4, this should effectively return 16909060. The script would then take that long IP and search redis. If no match is found, another common network boundry is tested. It will continue searching until redis.get(?) isn’t nil. For example:

    16909060 = no match = redis.get(16909060) = nil
    16909056 = MATCH = redis.get(16909056) = 15169

Now that the logic part is finished, now time to make the script usable and make some final housekeeping changes. The ip2long code was put into a method, along with the search for a matching long netmask. I then added argument support, so you can pass the IP address to the script. Since the Redis import would occur every time, I had to find a nice way to stop this. I first tried to count the redis keys and if they were less than 50k, the Redis import would be initiated but the "redis.keys('*').count" command would take sometimes up to 10 seconds to complete, which wouldn't give me the fast lookup I want to achieve. I then decided to just use optparse and choose an option to initiate the Redis import, otherwise, it would just skip it and run the script as if the necessary Redis keys were already there. The final code I ended up with is:

#!/usr/bin/env ruby

require 'curb'
require 'redis'
require 'optparse'

def ip2long(ip)
  ipAry = ip.split(/\./)
  long = ipAry[3].to_i | ipAry[2].to_i << 8 | ipAry[1].to_i << 16 | ipAry[0].to_i << 24
  long
end

def getMask(ipLong)
  redis = Redis.new
  (0..31).each {|msk|
    ipRef = (ipLong >> msk) << msk
    if redis.get(ipRef).nil?
    else
      return ipRef
    end
  }
end

def redisImport
  redis = Redis.new
  redis.flushall
  http = Curl.get("http://thyme.apnic.net/current/data-raw-table")
  http.body_str.each_line {|s|
    s.scan(/([0-9.]+)\/[0-9]+\s+([0-9]+)/) {|x,y|
      z = ip2long(x)
      redis.set(z, y)
    }
  }
  http = Curl.get("http://thyme.apnic.net/current/data-used-autnums")
  http.body_str.each_line {|s|
    s.scan(/([0-9]+)\s+(.*)/) {|x,y|
      redis.set(x, y)
    }
  }
end

userIp = ARGV[0]
redis = Redis.new

ARGV.options do |opts|
  opts.on("--charge") { puts "Charging Redis..." ; redisImport ; exit }
  opts.parse!
end

ipLong = ip2long(userIp)
zMask = getMask(ipLong)
zAsn = redis.get(zMask)

puts "(" + userIp + ") belongs to ASN " + zAsn + " - " + redis.get(zAsn)

The initial Redis import is initiated with the --charge option when running the script. The import usually takes around 1 minute to complete but the IP to ASN queries only take around 200ms afterwards, which is pretty fast. When running the script after the initial "charge", the script is getting the values directly from Redis and isn't using Curl to receive the values anymore. This is the nice thing about using a persistent cache, is that you can run the script over and over without losing the data as you would if you stored the data in a Ruby array or hash. If you want to use this script, make sure you have redis-server running and make sure you have the Ruby libraries installed, which you can install with gem install. Here is how the script functions:

[rsty@home ~]$ ruby ip2asn.rb --charge
Charging Redis...
[rsty@home ~]$ ruby ip2asn.rb 4.2.2.2
(4.2.2.2) belongs to ASN 3356 - Level 3 Communications, Inc.
[rsty@home ~]$

Leave any comments if you have any suggestions or tips to enhance this script or make it more efficient. Or just for general discussion.. Here is the GitHub repo: https://github.com/nckrse/ruby-redis-ip2asn