If you are following the ‘What’s new in AWS‘ page like many AWS professionals do, you must have noticed the new drift detection support that was announced recently. We have dived into this new feature and this article is about what we have found so far.

Drift detection is one of the many features that have been missing for years from the Cloudformation toolset. Ideally, we should always use Cloudformation to publish our infrastructure changes. However, practicality rules, and frequently we have to change our infrastructure outside of Cloudformation. When we do that, the configuration of our infrastructure drifts away from what is defined in Cloudformation. From time to time we need a summary of what has drifted so we can think about what we should do. Not all of the drifts are bad, and some are even expected. For example, in order to dynamically enable/disable an endpoint, we might have placed a lambda function to tweak the priority of a load balancer rule, which could result in a drift. Therefore, we should focus on other drifts that were created during ad-hoc operations, like a change in the maximum size of an auto scaling group introduced during traffic peaks.

Run drift detection with Python

Now that the feature is available, I wanted to write a python script to detect the drifts in our stacks, and use our Jenkins to run this script periodically so we can stay up to date about the drifts. I installed the latest version of boto3(>=1.9.44) and looked at the boto3 API: we need to trigger a drift detection with a DetectStackDrift call, then a DescribeStackDriftDetectionStatus call to view the result. Since it takes time to finish the drift detection, we would need to enclose the second call with a loop, so it would look like this:

client = boto3.client("cloudformation")
stack = "your-stack-name"
detection_id = client.detect_stack_drift(StackName=stack)
while True:
    time.sleep(3)
    response = client.describe_stack_drift_detection_status(
        StackDriftDetectionId=detection_id
    )
    if response['DetectionStatus'] == 'DETECTION_IN_PROGRESS':
        continue
    else:
        print(f"Stack `{stack}` has a drift status: {response['StackDriftStatus']}")

In general, this snippet should give you the same result as what you can see in the AWS web console. Easy peasy.

Calling DetectStackDrift

Let’s now step back and take look at this DetectStackDrift API. Apart from the obvious stack name parameter, this call can optionally accept a list of logical record IDs defined in the stack.

Should we worry about the resources defined in our stack which do not support drift detection yet? Not really. By default, if you call the DetectStackDrift API without this list, and your stack happened to have resources that do not support drift detection, AWS will simply skip them.

Can we find out the last change time of all the resources in this stack, perhaps with the help of a DescribeStackResources call, and skip the drift detection for a stack? The answer is sadly no. Although both ListStackResources and DescribeStackResources return a LastUpdatedTimestamp field, it only records when it was changed by Cloudformation. When the resource was changed outside of Cloudformation, this field is not updated. Therefore, you cannot short circuit a DetectStackDrift call by list/describe stack resources calls.

In conclusion, unless you have some very specific requirements, we’ll recommend just call DetectStackDrift with stack name as the only argument and let AWS takes care about the rest.

Getting drift status with DescribeStackDriftDetectionStatus

If you take a closer look at the response from DescribeStackDriftDetectionStatus calls, it should have several fields, including:

  • StackDriftStatus: whether a stack has drifted.
  • DetectionStatus: whether the detection succeeded or failed.
  • DriftedStackResourceCount: number of resources that have drifted.

This is where things get tricky. In some occasions, AWS will fail to detect the status of some of the resources defined in the template, in this case, the response would look something like this:

{
    "StackId": "arn:aws:cloudformation:ap-southeast-2:123456789012:stack/your-stack-name/a61c13dc-e875-11e8-8f0f-000c6c095ac3",
    "StackDriftDetectionId": "b11fdd5e-e875-11e8-8f0f-000c6c095ac3",
    "StackDriftStatus": "IN_SYNC",
    "DetectionStatus": "DETECTION_FAILED",
    "DetectionStatusReason": "Failed to detect drift on resource [SNSTopic]",
    "DriftedStackResourceCount": 0,
    "Timestamp": datetime.datetime(2018, 11, 14, 22, 59, 43, 536000, tzinfo=tzutc()),
}

For the record, this logical resource ID SNSTopic is indeed an SNS topic and it should have drift detection support. From what we can see, the behaviour is consistent: if the drift detection failed once, it will continue to fail no matter how many times you try it. However, it is inconsistent in that the same resource type could fail in one stack while not in another one.

Other minor hiccups

The official documentation does mention some limitations. I recommend taking some time to read it first. Here’s a list of some additional hiccups that we have observed:

Empty PropertyDifferences

Some resources were reported as MODIFIED with no property differences. In the following response from Cloudformation, ExpectedProperties and ActualProperties are exactly the same, and the PropertyDifferences is an empty list, while the StackResourceDriftStatus is reported to be MODIFIED.

{
    "StackId": "arn:aws:cloudformation:ap-southeast-2:123456789012:stack/your-stack-name/a61c13dc-e875-11e8-8f0f-000c6c095ac3",
    "LogicalResourceId": "TargetTrackingPolicy",
    "PhysicalResourceId": "some-arn",
    "ResourceType": "AWS::AutoScaling::ScalingPolicy",
    "ExpectedProperties": {
        "AutoScalingGroupName": "your-stack-name-AutoScalingGroup-L3YOV2E9J92N",
        "Cooldown": 900,
        "EstimatedInstanceWarmup": 300,
        "PolicyType": "TargetTrackingScaling",
        "TargetTrackingConfiguration": {
            "PredefinedMetricSpecification": {
                "PredefinedMetricType": "ASGAverageCPUUtilization"
            },
            "TargetValue": 40
        }
    },
    "ActualProperties": {
        "AutoScalingGroupName": "your-stack-name-AutoScalingGroup-L3YOV2E9J92N",
        "Cooldown": 900,
        "EstimatedInstanceWarmup": 300,
        "PolicyType": "TargetTrackingScaling",
        "TargetTrackingConfiguration": {
            "PredefinedMetricSpecification": {
                "PredefinedMetricType": "ASGAverageCPUUtilization"
            },
            "TargetValue": 40
        }
    },
    "PropertyDifferences": [],
    "StackResourceDriftStatus": "MODIFIED",
    "Timestamp": "2018-11-14T22:53:53.090Z"
}

Flattened list

In the following excerpt from a DescribeStackResourceDrifts response, the NotificationTypes were flattened to be a string. I’m not 100% sure, but I think what happened was we manually saved the ASG, which triggered a save on everything and flattened the NotificationTypes. Both configurations will work the same way and I would argue this is not a real change and this could have been handled internally by AWS.

# ExpectedProperties
"NotificationConfigurations": [
    {
        "NotificationTypes": [
            "autoscaling:EC2_INSTANCE_LAUNCH"
        ],
        "TopicARN": "arn:aws:sns:ap-southeast-2:123456789012:your-stack-name-SNSTopic"
    }
]
# ActualProperties
"NotificationConfigurations": [
    {
        "NotificationTypes": "autoscaling:EC2_INSTANCE_LAUNCH",
        "TopicARN": "arn:aws:sns:ap-southeast-2:123456789012:your-stack-name-SNSTopic"
    }
]

Internal type change

In the following case, we have adjusted the Security group ingress rule and changed it back. In the second save step, AWS must have kept the IpProtocol as a name(tcp), but in the template, we have specified it as a number(6). According to the RFC, protocol number 6 is just TCP, so this should not be a drift at all.

# ExpectedProperties
"SecurityGroupIngress": [
    {
        "FromPort": 5432,
        "IpProtocol": 6,
        "SourceSecurityGroupId": "sg-0fedcba9876543210",
        "SourceSecurityGroupOwnerId": 123456789012,
        "ToPort": 5432,
    }
]
# ActualProperties
"SecurityGroupIngress": [
    {
        "FromPort": 5432,
        "IpProtocol": "tcp",
        "SourceSecurityGroupId": "sg-0fedcba9876543210",
        "SourceSecurityGroupOwnerId": 123456789012,
        "ToPort": 5432,
    }
]

Altered order in list items

One of my colleagues has reported that for a list of security groups, the change of order of security groups is considered a drift. I’m yet to witness this case.

The takeaways

The good:

  • It will definitely help us to better maintain our Cloudformation stacks.

The bad:

  • Expect some glitches.
  • Resources type support is limited (for now).

The ugly:

  • Some errors are silently ignored (for now).

References:

Written by Kai Xia

Bookworm, Pythonista, DevOps, Melbournian.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s