Auto healing

Prerequisites

  • You must have the Heat Stack Owner role.

  • You must have aodhclient, which you can install via pip

  • You must have a network set up that can host webservers.

  • You must have sourced an RC file on your command line

Bullet point overview

  • Create a heat stack with two loadbalanced webservers.

  • Create a loadbalancer_member_health alarm

  • Induce failure to one or more of the webservers.

  • Observe as the alarm is triggered and the errored webserver is replaced.

Process

This example will create an alarm that monitors a set of simulated webservers. We will configure our alarm so that should a webserver go down the alarm will trigger and inform the heat stack, which created the webservers, to activate an autohealing feature. The webservers will be simulated by using netcat on an Ubuntu image in our project, these Ubuntu instances will respond to requests with the message: “Welcome to my <IP address>”.

To get started we need to clone our example templates. These templates create most of the resources that are required for this example. However, this example still requires a network already created before hand for the resources to function.

$ git clone https://github.com/catalyst-cloud/catalystcloud-orchestration/
$ cd catalystcloud-orchestration/hot/autohealing/autohealing-single-server

Next, you will need to change some of the variables in these files. The KEY NAME, NETWORK ID, SUBNET ID, and the IMAGE ID if you are in a project outside the hamilton region; All will need to be changed in the “autohealing.yaml” file. Similarly, the KEYNAME, NETWORK ID, and IMAGE ID will also need to be changed in the “webserver.yaml”

Once these changes have been made and your yaml files have been saved, we want to make sure that they are valid for use. To do this, we can use the openstack commands below.

$ openstack orchestration template validate -f yaml -t autohealing.yaml
$ openstack orchestration template validate -f yaml -t webserver.yaml

If your template is valid the console will output the template, if the template is invalid the console will return an error message instead. As long as our templates are valid, we can go to the next step which is creating the stack.

$ openstack stack create autohealing-test -t autohealing.yaml -e env.yaml
$ export stackid=$(openstack stack show autohealing-test -c id -f value) && echo $stackid

We have now created the stack and exported a variable for repeated use throughout this example. Next we will want to list the stack resources so we can see what is being created.

$ watch openstack stack resource list $stackid
+----------------------------+--------------------------------------+----------------------------+-----------------+----------------------+
| resource_name              | physical_resource_id                 | resource_type              | resource_status | updated_time         |
+----------------------------+--------------------------------------+----------------------------+-----------------+----------------------+
| loadbalancer_public_ip     | d54dcfd2-944d-48e3-830f-xxxxxxxxxxxx | OS::Neutron::FloatingIP    | CREATE_COMPLETE | 2019-10-10T01:26:34Z |
| autoscaling_group          | 7a4f0dc9-5ff9-40ce-8bb8-xxxxxxxxxxxx | OS::Heat::AutoScalingGroup | CREATE_COMPLETE | 2019-10-10T01:26:34Z |
| listener                   | 1a0f2cd2-0d45-42f2-929c-xxxxxxxxxxxx | OS::Octavia::Listener      | CREATE_COMPLETE | 2019-10-10T01:26:35Z |
| loadbalancer_healthmonitor | 2773d0c1-bdcd-41c1-905d-xxxxxxxxxxxx | OS::Octavia::HealthMonitor | CREATE_COMPLETE | 2019-10-10T01:26:34Z |
| loadbalancer_pool          | 30129a16-f6b7-434f-9648-xxxxxxxxxxxx | OS::Octavia::Pool          | CREATE_COMPLETE | 2019-10-10T01:26:35Z |
| loadbalancer               | 5f9ea90e-97ae-4844-867e-xxxxxxxxxxxx | OS::Octavia::LoadBalancer  | CREATE_COMPLETE | 2019-10-10T01:26:35Z |
+----------------------------+--------------------------------------+----------------------------+-----------------+----------------------+

Note

In case of any CREATE_FAILED statuses you can interrogate the stack for the error reasons with the command below.

$ openstack stack failures list autohealing-stack

A common reason for resources failing to be created is due to quotas being exceeded while attempting to create the stack. Address any actionable error messages then delete the stack and try again.

Once these resources reach “CREATE_COMPLETE” the stack has finished and we can move on to testing our webservers. However before this, we are going to create some variables as we will need to refer to certain resource IDs many times throughout this example. These are the ‘Load balancer ID’, ‘Autoscaling Group ID’, and the ‘Load balancer pool ID’

$ lbid=$(openstack loadbalancer list | grep webserver_lb | awk '{print $2}');
$ asgid=$(openstack stack resource list $stackid | grep autoscaling_group | awk '{print $4}');
$ poolid=$(openstack loadbalancer status show $lbid | jq -r '.loadbalancer.listeners[0].pools[0].id')

Next we are going to test our webservers. The service running on each webserver simply responds with a short message including the private IP address of the current server, so we can tell which server has responded to our request. We can interact with the service by making curl requests to the public IP address.

$ openstack stack output show $stackid --all
+--------+-----------------------------------------+
| Field  | Value                                   |
+--------+-----------------------------------------+
| lb_vip | {                                       |
|        |   "output_value": "10.17.9.145",        |
|        |   "output_key": "lb_ip",                |
|        |   "description": "No description given" |
|        | }                                       |
| lb_ip  | {                                       |
|        |   "output_value": "103.254.157.70",     |
|        |   "output_key": "lb_ip",                |
|        |   "description": "No description given" |
|        | }                                       |
+--------+-----------------------------------------+

$ export lb_ip=103.254.157.70
$ while true; do curl $lb_ip; sleep 2; done
Welcome to my 192.168.2.200
Welcome to my 192.168.2.201
Welcome to my 192.168.2.200
Welcome to my 192.168.2.201

The loadbalancer is alternating the traffic between these two servers on every request. To keep our service up and running and to make our service resilient to failure, we are going to create a loadbalancer_member_health alarm. The alarms function is to watch for failures in any of the loadbalancer members and initiate an autohealing action on them.

# We check that our loadbalancer members are all healthy before creating our alarm.
$ openstack loadbalancer member list $poolid
+--------------------------------------+------+----------------------------------+---------------------+---------------+---------------+------------------+--------+
| id                                   | name | project_id                       | provisioning_status | address       | protocol_port | operating_status | weight |
+--------------------------------------+------+----------------------------------+---------------------+---------------+---------------+------------------+--------+
| 4eeac1a8-7837-41d9-8299-xxxxxxxxxxxx |      | bb609fa4634849919b0192c060c02cd7 | ACTIVE              | 192.168.2.200 |            80 | ONLINE           |      1 |
| 2acbd21e-39d5-41fe-8fb9-xxxxxxxxxxxx |      | bb609fa4634849919b0192c060c02cd7 | ACTIVE              | 192.168.2.201 |            80 | ONLINE           |      1 |
+--------------------------------------+------+----------------------------------+---------------------+---------------+---------------+------------------+--------+

$ openstack alarm create --name test_lb_alarm \
--type loadbalancer_member_health \
--alarm-action trust+heat:// \
--repeat-actions false \
--autoscaling-group-id $asgid \
--pool-id $poolid \
--stack-id $stackid

+---------------------------+---------------------------------------+
| Field                     | Value                                 |
+---------------------------+---------------------------------------+
| alarm_actions             | ['trust+heat:']                       |
| alarm_id                  | 8c701d87-679a-4c27-939b-xxxxxxxxxxxx  |
| autoscaling_group_id      | 9ec5bb8c-3b7f-4a71-858d-xxxxxxxxxxxx  |
| description               | loadbalancer_member_health alarm rule |
| enabled                   | True                                  |
| insufficient_data_actions | []                                    |
| name                      | test_lb_alarm                         |
| ok_actions                | []                                    |
| pool_id                   | 0da0911a-0b07-4937-99ab-xxxxxxxxxxxx  |
| project_id                | eac679e489614xxxxxxce29d755fe289      |
| repeat_actions            | False                                 |
| severity                  | low                                   |
| stack_id                  | cc55271e-ddcd-4db0-8803-xxxxxxxxxxxx  |
| state                     | insufficient data                     |
| state_reason              | Not evaluated yet                     |
| state_timestamp           | 2019-10-31T01:19:22.992154            |
| time_constraints          | []                                    |
| timestamp                 | 2019-10-31T01:19:22.992154            |
| type                      | loadbalancer_member_health            |
| user_id                   | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX      |
+---------------------------+---------------------------------------+

Below is a brief explanation of the various arguments we have constructed the alarm with:

  • --pool-id is the loadbalancer pool that the alarm will monitor for unhealthy members.

  • trust+heat:// tells the alarm to notify heat when a loadbalancer pool member is unhealthy. This is what initiates the healing action.

  • --stack-id is the name or ID of the stack which the alarm will initiate an update on.

  • --autoscaling-group-id is the autoscaling group which the resources belong to.

We can now view the alarm and see that its status is insufficient data. This is normal as the alarm has not been created to recognise any state of the loadbalancer that is not the ERROR state.

$ openstack alarm list
+--------------------------------------+----------------------------+---------------+-------------------+----------+---------+
| alarm_id                             | type                       | name          | state             | severity | enabled |
+--------------------------------------+----------------------------+---------------+-------------------+----------+---------+
| 18be0104-feed-4415-b9a5-xxxxxxxxxxxx | loadbalancer_member_health | test_lb_alarm | insufficient data | low      | True    |
+--------------------------------------+----------------------------+---------------+-------------------+----------+---------+

Now that the alarm is in place we can test it out by simulating the failure of one of our application servers. For this example we can simulate a failure by ‘stopping’ a server.

# Find one of the server ids
$ openstack server list
+--------------------------------------+-------------------------------------------------------+--------+-----------------------------------------+---------------------+---------+
| ID                                   | Name                                                  | Status | Networks                                | Image               | Flavor  |
+--------------------------------------+-------------------------------------------------------+--------+-----------------------------------------+---------------------+---------+
| 4a35a813-ac9a-4195-9b25-xxxxxxxxxxxx | au-5z37-rowgvu2inhwa-25buammtmf2s-server-mkvfo7vxlv64 | ACTIVE | private_net=192.168.2.200, 10.17.9.148  | cirros-0.3.1-x86_64 | m1.tiny |
| b80aa773-7330-4a00-9666-xxxxxxxxxxxx | au-5z37-hlzbc66r2vrc-h6qxnp7n5wru-server-wyf3dksa6w3v | ACTIVE | private_net=192.168.2.201, 10.17.9.147  | cirros-0.3.1-x86_64 | m1.tiny |
+--------------------------------------+-------------------------------------------------------+--------+-----------------------------------------+---------------------+---------+

# Then we 'stop' this server
$ openstack server stop b80aa773-7330-4a00-9666-xxxxxxxxxxxx

If we curl our service again we can see that 192.168.2.201 has stopped responding to our request and the one remaining server is receiving all the traffic.

$ while true; do curl $lb_ip; sleep 2; done
Welcome to my 192.168.2.200
Welcome to my 192.168.2.200
Welcome to my 192.168.2.200
Welcome to my 192.168.2.200

Querying the loadbalancer member pool also shows that one of the members status is now reporting ERROR.

$ openstack loadbalancer member list $poolid
+--------------------------------------+------+----------------------------------+---------------------+---------------+---------------+------------------+--------+
| id                                   | name | project_id                       | provisioning_status | address       | protocol_port | operating_status | weight |
+--------------------------------------+------+----------------------------------+---------------------+---------------+---------------+------------------+--------+
| 4eeac1a8-7837-41d9-8299-xxxxxxxxxxxx |      | bb609fa4634849919b0192c060c02cd7 | ACTIVE              | 192.168.2.200 |            80 | ONLINE           |      1 |
| 2acbd21e-39d5-41fe-8fb9-xxxxxxxxxxxx |      | bb609fa4634849919b0192c060c02cd7 | ACTIVE              | 192.168.2.201 |            80 | ERROR            |      1 |
+--------------------------------------+------+----------------------------------+---------------------+---------------+---------------+------------------+--------+

Now that at least one member of the loadbalancer pool is reporting an operating status of ERROR, the conditions for the alarm to be triggered are satisfied and the alarm has transitioned from ok to alarm.

+--------------------------------------+----------------------------+---------------+------------+----------+---------+
| alarm_id                             | type                       | name          | state      | severity | enabled |
+--------------------------------------+----------------------------+---------------+------------+----------+---------+
| 18be0104-feed-4415-b9a5-xxxxxxxxxxxx | loadbalancer_member_health | test_lb_alarm | alarm      | low      | True    |
+--------------------------------------+----------------------------+---------------+------------+----------+---------+

For the loadbalancer member health alarm the trust+heat:// action will mark the failed server as an unhealthy stack resource and then initiate a stack update.

$ openstack stack resource list $stackid
+----------------------------+--------------------------------------+----------------------------+--------------------+----------------------+
| resource_name              | physical_resource_id                 | resource_type              | resource_status    | updated_time         |
+----------------------------+--------------------------------------+----------------------------+--------------------+----------------------+
| loadbalancer_public_ip     | d54dcfd2-944d-48e3-830f-xxxxxxxxxxxx | OS::Neutron::FloatingIP    | CREATE_COMPLETE    | 2019-10-10T01:26:34Z |
| autoscaling_group          | 7a4f0dc9-5ff9-40ce-8bb8-xxxxxxxxxxxx | OS::Heat::AutoScalingGroup | UPDATE_IN_PROGRESS | 2019-10-10T01:53:06Z |
| listener                   | 1a0f2cd2-0d45-42f2-929c-xxxxxxxxxxxx | OS::Octavia::Listener      | CREATE_COMPLETE    | 2019-10-10T01:26:35Z |
| loadbalancer_healthmonitor | 2773d0c1-bdcd-41c1-905d-xxxxxxxxxxxx | OS::Octavia::HealthMonitor | CREATE_COMPLETE    | 2019-10-10T01:26:34Z |
| loadbalancer_pool          | 30129a16-f6b7-434f-9648-xxxxxxxxxxxx | OS::Octavia::Pool          | CREATE_COMPLETE    | 2019-10-10T01:26:35Z |
| loadbalancer               | 5f9ea90e-97ae-4844-867e-xxxxxxxxxxxx | OS::Octavia::LoadBalancer  | CREATE_COMPLETE    | 2019-10-10T01:26:35Z |
+----------------------------+--------------------------------------+----------------------------+--------------------+----------------------+

# After a few minutes, the stack status goes back to healthy. The ERROR load balancer member is replaced.
$ openstack stack resource list $stackid
+----------------------------+--------------------------------------+----------------------------+-----------------+----------------------+
| resource_name              | physical_resource_id                 | resource_type              | resource_status | updated_time         |
+----------------------------+--------------------------------------+----------------------------+-----------------+----------------------+
| loadbalancer_public_ip     | d54dcfd2-944d-48e3-830f-xxxxxxxxxxxx | OS::Neutron::FloatingIP    | CREATE_COMPLETE | 2019-10-10T01:26:34Z |
| autoscaling_group          | 7a4f0dc9-5ff9-40ce-8bb8-xxxxxxxxxxxx | OS::Heat::AutoScalingGroup | UPDATE_COMPLETE | 2019-10-10T01:53:06Z |
| listener                   | 1a0f2cd2-0d45-42f2-929c-xxxxxxxxxxxx | OS::Octavia::Listener      | CREATE_COMPLETE | 2019-10-10T01:26:35Z |
| loadbalancer_healthmonitor | 2773d0c1-bdcd-41c1-905d-xxxxxxxxxxxx | OS::Octavia::HealthMonitor | CREATE_COMPLETE | 2019-10-10T01:26:34Z |
| loadbalancer_pool          | 30129a16-f6b7-434f-9648-xxxxxxxxxxxx | OS::Octavia::Pool          | CREATE_COMPLETE | 2019-10-10T01:26:35Z |
| loadbalancer               | 5f9ea90e-97ae-4844-867e-xxxxxxxxxxxx | OS::Octavia::LoadBalancer  | CREATE_COMPLETE | 2019-10-10T01:26:35Z |
+----------------------------+--------------------------------------+----------------------------+-----------------+----------------------+

$ openstack loadbalancer member list $poolid
+--------------------------------------+------+----------------------------------+---------------------+---------------+---------------+------------------+--------+
| id                                   | name | project_id                       | provisioning_status | address       | protocol_port | operating_status | weight |
+--------------------------------------+------+----------------------------------+---------------------+---------------+---------------+------------------+--------+
| 4eeac1a8-7837-41d9-8299-xxxxxxxxxxxx |      | bb609fa4634849919b0192c060c02cd7 | ACTIVE              | 192.168.2.200 |            80 | ONLINE           |      1 |
| f354fe18-c801-4729-90bb-xxxxxxxxxxxx |      | bb609fa4634849919b0192c060c02cd7 | ACTIVE              | 192.168.2.202 |            80 | ONLINE           |      1 |
+--------------------------------------+------+----------------------------------+---------------------+---------------+---------------+------------------+--------+

Now that the stack update is complete the new server will start responding to requests with a different IP then the failed member.

$ while true; do curl $lb_ip; sleep 2; done
Welcome to my 192.168.2.200
Welcome to my 192.168.2.202
Welcome to my 192.168.2.200
Welcome to my 192.168.2.202

Now that we’ve shown you can create an autohealing service using the alarm service, we can clean up this stack:

$ openstack stack delete $stackid