I was messing around with HAProxy yesterday and thought it would be useful to integrate Nagios downtime into the process for taking a node off the load balancer. This method uses Xinetd to emulate HTTP headers and isn't limited for use on HAProxy exclusively, it can be used with any LB that supports basic HTTP header health checks... So all of them?
The required components to make this demonstration work are:
- Linux webserver with Xinetd
- HAProxy server
- Nagios server with Nagios-api installed
- And root access to the above servers!
Now to get started, I used this guide to get Nagios-api up and running. Once you have the Nagios-api running, you should be able to query the status of your webserver via:
[rsty@home ~]$ curl -s http://192.168.33.10:8080/host/prod-web01 | python -mjson.tool
{
"content": {
"acknowledgement_type": "0",
"active_checks_enabled": "1",
"check_command": "check-host-alive",
"check_execution_time": "0.010",
"check_interval": "5.000000",
"check_latency": "0.024",
"check_options": "0",
"check_period": "",
"check_type": "0",
"comment": [],
"current_attempt": "1",
"current_event_id": "0",
"current_notification_id": "0",
"current_notification_number": "0",
"current_problem_id": "0",
"current_state": "0",
"downtime": [],
"event_handler": "",
"event_handler_enabled": "1",
"failure_prediction_enabled": "1",
"flap_detection_enabled": "1",
"has_been_checked": "1",
"host": "prod-web01",
"host_name": "prod-web01",
"is_flapping": "0",
"last_check": "1428676190",
"last_event_id": "0",
"last_hard_state": "0",
"last_hard_state_change": "1428674980",
"last_notification": "0",
"last_problem_id": "0",
"last_state_change": "1428674980",
"last_time_down": "0",
"last_time_unreachable": "0",
"last_time_up": "1428676200",
"last_update": "1428676315",
"long_plugin_output": "",
"max_attempts": "10",
"modified_attributes": "0",
"next_check": "1428676500",
"next_notification": "0",
"no_more_notifications": "0",
"notification_period": "24x7",
"notifications_enabled": "1",
"obsess_over_host": "1",
"passive_checks_enabled": "1",
"percent_state_change": "0.00",
"plugin_output": "PING OK - Packet loss = 0%, RTA = 0.06 ms",
"problem_has_been_acknowledged": "0",
"process_performance_data": "1",
"retry_interval": "1.000000",
"scheduled_downtime_depth": "0",
"services": [
"smb:139-dosamba-prod-web01",
"http:43326-donagios-prod-web01",
"int:load-donagios-prod-web01",
"int:process_postfix-dopostfix-prod-web01",
"ssh:15022-docommon-prod-web01",
"int:process_puppetmaster-dopuppetmaster-prod-web01",
"int:process_puppetdb-dopuppetmaster-prod-web01",
"int:disk_root-donagios-prod-web01",
"int:process_smbd-dosamba-prod-web01",
"int:process_nagios-donagios-prod-web01"
],
"should_be_scheduled": "1",
"state_type": "1",
"type": "hoststatus"
},
"success": true
}
So if you notice above, “scheduled_downtime_depth” is the status we are looking for, which is currently 0, so there is currently no downtime set. We can easily grab that value with the following one-liner and save for later:
[rsty@home ~]$ curl -s http://192.168.33.10:8080/host/prod-web01 | python -mjson.tool | grep time_depth | awk -F'"' '{print $4}'
0
So now the fun part begins, creating the Xinetd script to emulate the HTTP header. What we want to do is to return a 200 (OK) if we return a 0 from our scheduled_downtime_depth query and return a 5xx (BAD) if we are returned a non-zero value, meaning downtime is set. So there are a few things we need to do:
- Write our script, which will return a 200 if our check passes, otherwise it will return a 503. In the below script, 192.168.33.10 is the Nagios server and prod-web01 is the Nagios configured host for our web server. The Xinetd script will reside on the webserver since that is where the health check from HAProxy will be directed:
/opt/serverchk
#!/bin/bash
DOWN=`curl -s http://192.168.33.10:8080/host/prod-web01 | python -mjson.tool | grep time_depth | awk -F'"' '{print $4}'`
if [ "$DOWN" == "0" ]
then
# server is online, return http 200
/bin/echo -e "HTTP/1.1 200 OK\r\n"
/bin/echo -e "Content-Type: Content-Type: text/plain\r\n"
/bin/echo -e "\r\n"
/bin/echo -e "No downtime scheduled.\r\n"
/bin/echo -e "\r\n"
else
# server is offline, return http 503
/bin/echo -e "HTTP/1.1 503 Service Unavailable\r\n"
/bin/echo -e "Content-Type: Content-Type: text/plain\r\n"
/bin/echo -e "\r\n"
/bin/echo -e "**Downtime is SCHEDULED**\r\n"
/bin/echo -e "\r\n"
fi
- Add the service name to the tail of /etc/services
serverchk 8189/tcp # serverchk script
- Add the xinetd configuration with the same service name as above:
/etc/xinetd.d/serverchk
# default: on
# description: serverchk
service serverchk
{
flags = REUSE
socket_type = stream
port = 8189
wait = no
user = nobody
server = /opt/serverchk_status.sh
log_on_failure += USERID
disable = no
only_from = 0.0.0.0/0
per_source = UNLIMITED
}
- Restart xinetd
[rsty@prod-web01 ~]$ sudo service xinetd restart
Redirecting to /bin/systemctl restart xinetd.service
Now the web portion is complete. You can test it by curling the configured xinetd service port from HAProxy or any other if you didn't limit via 'only_from':
[root@haproxy ~]$ curl -s 192.168.56.101:8189
Content-Type: Content-Type: text/plain
No downtime scheduled.
root@haproxy:~#
Now that it works, we can configure HAProxy. To do so, lets look over the current backend config for our webserver. Here is the excerpt from /etc/haproxy/haproxy.cfg:
backend nagios-test_BACKEND
balance roundrobin
server nagios-test 192.168.56.101:80 check
We need to modify this by adding the httpchk and specifying the check port:
backend nagios-test_BACKEND
option httpchk HEAD
balance roundrobin
server nagios-test 192.168.56.101:80 check port 8189
Now lets reload haproxy and check the status:
root@haproxy:~# sudo /etc/init.d/haproxy reload
* Reloading haproxy haproxy [ OK ]
root@haproxy:~# echo 'show stat' | socat unix-connect:/var/lib/haproxy/stats stdio | grep test | cut -d',' -f1,18
nagios-test_BACKEND,UP
nagios-test_BACKEND,UP
root@haproxy:~#
Excellent! Now lets put the host into maintenance mode (downtime) on Nagios and see what comes of it!
[admin@nagios nagios-api]~$ ./nagios-cli -H localhost -p 8080 schedule-downtime prod-web01 4h
[2015/04/10 15:16:59] {diesel} INFO|Sending command: [1428679019] SCHEDULE_HOST_DOWNTIME;prod-web01;1428679019;1428693419;1;0;14400;nagios-api;schedule downtime
[admin@nagios nagios-api]~$
And now if we check the Nagios downtime, xinetd script remotely from HAProxy on port 8189 and check the status of the BACKEND resource:
root@haproxy:~# curl -s http://192.168.33.10:8080/host/prod-web01 | python -mjson.tool | grep time_depth
"scheduled_downtime_depth": "1",
root@haproxy:~# curl -s 192.168.56.101:8189
Content-Type: Content-Type: text/plain
**Downtime is SCHEDULED**
root@haproxy:~# curl -sI 192.168.56.101:8189
HTTP/1.1 503 Service Unavailable
root@haproxy:~# echo 'show stat' | socat unix-connect:/var/lib/haproxy/stats stdio | grep test | cut -d',' -f1,18
nagios-test_BACKEND,DOWN
nagios-test_BACKEND,DOWN
root@haproxy:~#
Now as we see, Nagios is reporting a non-zero value for downtime. Also, the web server shows our script as working correctly and returning a 503! HAProxy also shows the node as down, awesome! Now lets cancel the downtime to see it come back up:
[admin@nagios nagios-api]~$ ./nagios-cli -H localhost -p 8080 cancel-downtime prod-web01
[2015/04/10 15:24:09] {diesel} INFO|Sending command: [1428679449] DEL_HOST_DOWNTIME;4
[admin@nagios nagios-api]~$
And...
root@haproxy:~# echo 'show stat' | socat unix-connect:/var/lib/haproxy/stats stdio | grep test | cut -d',' -f1,18
nagios-test_BACKEND,UP
nagios-test_BACKEND,UP
root@haproxy:~#
SUCCESS! So effectively, this xinetd script can be set on all the webservers, by just changing the Nagios-api to query the different webserver in the script. Also, using xinetd scripts in this fashion, you can perform many other "checks" on the server behind the load balancer.. Anything that can be performed in a BASH (or language of your choice) script can be transformed into the boolean state operation needed to bring the node online/offline.
I'd like to see if anyone else has done something similar to this or has any suggestions to improve! Please comment!
DISCLAIMER: Please test thoroughly before using this solution in a production environment. I am not liable for your mistakes 😉