Create downtimes with Instance Refresh for EC2 Auto Scaling

When AWS launched Instance Refresh for EC2 Auto Scaling last year my colleagues and me were delighted: Should we be able to retire our half baked and mostly working lambda for restarting ec2 instances and hand this over to AWS Auto Scaling?

I think we were not the only ones looking forward to this: Auto Scaling groups is an extremely popular option of the most popular AWS service (EC2) of the biggest cloud provider in the world. At the same time all of these many customers are responsible to use a recent machine image with current security patches. That means refreshing instances in an Auto Scaling Group is something anybody out there needs to do on a regular basis¹ and anybody who does not like busywork will automate the process.

As it turns out that is much trickier than it seems at first glance:

The restart interferes with Auto Scaling Policies
It can get disrupted by new deployments
AWS Auto Scaling tries to balance instance counts between availability zones while the goal is to remove instances with old AMIs.

If you are trying to use this new functionality in the Console you are up for a little confusion:

/img/instance_refresh_console.png

So on the one hand "Each instance is terminated first and then replaced, which temporarily reduces the capacity available" but on the other hand there is a minimum health percentage parameter which indicates the "percentage of the desired capacity of the Auto Scaling group must remain healthy during this operation to allow it to continue". So if the minimum health percentage is 100% capacity should not go down during the instance refresh, right?

Sadly, that is not the case: The first sentence is correct:

Each instance is terminated first and then replaced, […]

The minimum health percentage parameter instead just decides how many instances are stopped at the same time. Setting it to 100% means one instance will be stopped at a time. So for an Auto Scaling Group with one instance this parameter is quite misleading: The one instance will be stopped and a new one started for any value set as "minimum health percentage".

So basically refreshing the instances of an Auto Scaling group always reduces capacity. In the case of one instance it means a guaranteed service interruption. In the case of a decently utilized group with more than one instance it has the potential for one. This is exactly what we were experiencing when we gave this feature a spin.

/img/502-great-success.jpg

So what is this feature useful for? We found one particular use for it: Restarting instances with mounted EBS volumes. Any Auto scaling group that contains such instances is tagged as such and the lambda we use for restarting services makes use of the refresh instance API call in this particular case.

One could also treat machines like pets not cattle and patch them, but I would not recommend going down that path.

aws ec2